A quick performance tip for all those wrestling with occasional un-performant queries in PostgreSQL. There’s one quite simple trick available that many don’t know about, that can be applied at “low cost” when slow queries are caused by poor planner estimates for your filter conditions. Read on for details.

The problem

So the thing with Postgres statistics is that you basically lose them as soon as you transform the data column in some way in your query – most prominent such example would most probably be something like “SELECT … WHERE date_trunc(‘day’, some_indexed_timestamp_column) = ‘2017-01-01″…and gone is the (imaginary) index and column statistics. That query could of course be easily rewritten so that the performance doesn’t explode, but here also a functional index would make sense when rewrite is not possible or the query is executed often and is selective enough. But for queries/filters that are used very rarely a full functional index might not be ideal due to performance side-effects (index maintenance slows down updates, deletes, inserts a bit)…so welcome our performance helper of the day – BRIN functional indexes!

Example use case

So now let’s paint it red with a dummy example: let’s say we want to select all ID’s that are dividable by 10 out of 1 million sequentially growing IDs. Let’s create the test data and look at the resulting query plan.

krl@postgres=# CREATE TABLE t_event AS SELECT generate_series(1, 1e6) id;
SELECT 1000000

krl@postgres=# CREATE INDEX ON t_event (id);
CREATE INDEX

krl@postgres=# ANALYZE t_event ;
ANALYZE

krl@postgres=# EXPLAIN SELECT * FROM t_event WHERE id % 10 = 0;
                          QUERY PLAN                          
──────────────────────────────────────────────────────────────
 Seq Scan on t_event  (cost=0.00..19425.00 <strong>rows=5000</strong> width=6)
   Filter: ((id % '10'::numeric) = '0'::numeric)
(2 rows)

What do we see? Well, we see that Postgres expects to get only 5000 rows but actually we know that there would be 100k rows matched! Such a misestimation of 20x could already end really badly for larger datasets. And where does the number 5000 come from? Now some “insider knowledge” is required – here one just needs to know that when Postgres doesn’t have exact statistics for a column filter (as we actually transformed our ID column that we have statistics on), it by default figures that something between 0.33333 and 1% of data will be returned, depending on datatype and filter type (normal/join) – here 0.5%. So a constant…which obviously doesn’t work too well for our less selective filter. But OK, how to improve the situation?

Using functional BRIN indexes for better row amount estimations

So now the trick – let’s create our BRIN functional index and see how it changes our original query plan row estimate. BRIN by the way stands in short for Block Range (a.k.a. Min/Max) index – a lossy (table re-check needed) index type introduced in Postgres 9.5, generally meant for big amounts of data, most effective when the indexed columns are appearing in the table also somewhat naturally ordered. So let’s create one, rebuild statistics and let’s do another EXPLAIN.

krl@postgres=# CREATE INDEX ON t_event USING brin ((id % 10));
CREATE INDEX

krl@postgres=# ANALYZE t_event ;
ANALYZE

krl@postgres=# EXPLAIN SELECT * FROM t_event WHERE id % 10 = 0;
                          QUERY PLAN                           
───────────────────────────────────────────────────────────────
 Seq Scan on t_event  (cost=0.00..19425.00 <strong>rows=98467</strong> width=6)
   Filter: ((id % '10'::numeric) = '0'::numeric)
(2 rows)

And indeed, the estimation of 98467 matching rows looks now a lot better – quite exact actually!
Here the database of course still decided for a sequential scan…but at least it did it based on right assumptions (based on row count and table location correlation, see below for the corresponding pg_stats entry) and after a quick check, forcing Postgres to use the index, I indeed saw that it was faster with a scan. And the efficiency gain compared to a normal functional index (as the index wouldn’t be used anyways, just the index statistics) – 384x less space on the disc + faster data modifications as BRIN indexes need only maintenance when min/max thereshold is crossed for a range of blocks!

Also note that for our case the row estimation worked so well as we have less than default_statistics_target (100 by default) unique values in our derived column so that Postgres has more or less exact appearance frequencies available for single values.

Hope this trick will serve you well 🙂

krl@postgres=# \di+ t_event_*
                             List of relations
 Schema │       Name        │ Type  │ Owner │  Table  │ Size  │ Description 
────────┼───────────────────┼───────┼───────┼─────────┼───────┼─────────────
 public │ t_event_expr_idx  │ index │ krl   │ t_event │ 21 MB │ 
 public │ t_event_expr_idx1 │ index │ krl   │ t_event │ <strong>56 kB</strong> │ 
 public │ t_event_id_idx    │ index │ krl   │ t_event │ 21 MB │ 
(3 rows)

krl@postgres=# SELECT * FROM pg_stats WHERE tablename = 't_event_expr_idx1';
─[ RECORD 1 ]──────────┬─────────────────────────────────────────────────────────────────────────────────────
schemaname             │ public
tablename              │ t_event_expr_idx1
attname                │ expr
inherited              │ f
null_frac              │ 0
avg_width              │ 7
n_distinct             │ 10
most_common_vals       │ {2,3,7,6,0,1,4,5,8,9}
most_common_freqs      │ {0.1033,0.102233,0.1013,0.1012,0.1005,0.100033,0.0993,0.0978333,0.0978333,0.0964667}
histogram_bounds       │ ¤
correlation            │ 0.0950125
most_common_elems      │ ¤
most_common_elem_freqs │ ¤
elem_count_histogram   │ ¤