BRIN indexes: Correlation, correlation, correlation

06.2016 | Category: How To | Tags: index, sql help

06.2016

Category: How To

Since BRIN indexes have been introduced in PostgreSQL 9.5, many people have gladly adopted this new index type. A lot has been written about this new feature and a lot of positive feedback has been reported. While BRIN indexes are clearly a success and definitely a win, some people tend to exaggerate and use them far too frequently.

Table of Contents

Correlation, it is about correlation

BRIN indexes are cheap, but this does not mean that they are always beneficial. In case the correlation of a column is low, BRIN indexes can be no gain or even a small loss.

Here is an example showing what can happen:

test=# CREATE TABLE t_test AS

SELECT         *

FROM generate_series(1, 1000000) AS id;

SELECT 1000000

Time: 422.647 ms

test=# VACUUM ANALYZE t_test;

VACUUM

Time: 144.370 ms

test=# CREATE TABLE t_test AS

SELECT *

FROM generate_series(1, 1000000) AS id;

SELECT 1000000

Time: 422.647 ms

test=# VACUUM ANALYZE t_test;

VACUUM

Time: 144.370 ms

We generate a PostgreSQL table containing 1 million lines.

Then we select a random row:

test=# SELECT count(*) FROM t_test WHERE id = 533455;
count

-------

1
(1 row)

Time: 44.555 ms

test=# SELECT count(*) FROM t_test WHERE id = 533455;

count

-------

(1 row)

Time: 44.555 ms

The sequential scan takes around 44 ms and returns exactly one row.

Now let us try the same thing using a BRIN index:

test=# CREATE INDEX idx_brin ON t_test USING brin(id);
CREATE INDEX
Time: 148.036 ms
test=# SELECT count(*) FROM t_test WHERE id = 533455;
count

-------

1
(1 row)

Time: 2.983 ms

test=# CREATE INDEX idx_brin ON t_test USING brin(id);

CREATE INDEX

Time: 148.036 ms

test=# SELECT count(*) FROM t_test WHERE id = 533455;

count

-------

(1 row)

Time: 2.983 ms

In this case, the scan is a lot faster and completes within around 3 ms. That's pretty nice.

Here is the execution plan:

test=# explain analyze SELECT count(*) FROM t_test WHERE id = 533455;

QUERY PLAN
------------------------------------------------------------------------------------------------

Aggregate (cost=16.02..16.03 rows=1 width=8)
(actual time=2.514..2.514 rows=1 loops=1)
Bitmap Heap Scan on t_test&nbsp; (cost=12.01..16.02 rows=1 width=0)
(actual time=1.228..2.511 rows=1 loops=1)
Recheck Cond: (id = 533455)
Rows Removed by Index Recheck: 28927
Heap Blocks: lossy=128
Bitmap Index Scan on idx_brin&nbsp; (cost=0.00..12.01 rows=1 width=0)
(actual time=0.029..0.029 rows=1280 loops=1)
Index Cond: (id = 533455)
Planning time: 0.059 ms
Execution time: 2.541 ms
(9 rows)

test=# explain analyze SELECT count(*) FROM t_test WHERE id = 533455;

QUERY PLAN

------------------------------------------------------------------------------------------------

Aggregate (cost=16.02..16.03 rows=1 width=8)

(actual time=2.514..2.514 rows=1 loops=1)

Bitmap Heap Scan on t_test  (cost=12.01..16.02 rows=1 width=0)

(actual time=1.228..2.511 rows=1 loops=1)

Recheck Cond: (id = 533455)

Rows Removed by Index Recheck: 28927

Heap Blocks: lossy=128

Bitmap Index Scan on idx_brin  (cost=0.00..12.01 rows=1 width=0)

(actual time=0.029..0.029 rows=1280 loops=1)

Index Cond: (id = 533455)

Planning time: 0.059 ms

Execution time: 2.541 ms

(9 rows)

As you can see, PostgreSQL does a bitmap scan to fetch the data. The number of blocks read is 128 (exactly the desired number of blocks).

When correlation goes down ...

However, the situation is quite different in case correlation goes down. Remember: A normal BRIN index calculates the minimum and the maximum value in a range of 128 blocks. In case data is sorted the index performs nicely because many 128 x 8k areas can be excluded from the scan.

The situation changes dramatically if the data is shuffled (= correlation is low). In this case, a chunk of 128 blocks (= 1 MB) will most likely contain a value close to the absolute minimum and the absolute maximum of the column.

That means the scan cannot exclude a sufficient number of chunks:

test=# CREATE TABLE t_random AS SELECT * FROM t_test ORDER BY random();
SELECT 1000000
Time: 1321.911 ms
test=# VACUUM ANALYZE t_random ;
VACUUM
Time: 146.827 ms

test=# CREATE INDEX idx_brin_random ON t_random USING brin(id);
CREATE INDEX
Time: 140.131 ms

test=# CREATE TABLE t_random AS SELECT * FROM t_test ORDER BY random();

SELECT 1000000

Time: 1321.911 ms

test=# VACUUM ANALYZE t_random ;

VACUUM

Time: 146.827 ms

test=# CREATE INDEX idx_brin_random ON t_random USING brin(id);

CREATE INDEX

Time: 140.131 ms

As we can see in the next listing, this query needs many more blocks than the previous one:

test=# explain analyze SELECT count(*) FROM t_random WHERE id = 533455;

QUERY PLAN
--------------------------------------------------------------------

Aggregate  (cost=16.02..16.03 rows=1 width=8)
(actual time=86.122..86.122 rows=1 loops=1)
 
 Bitmap Heap Scan on t_random  (cost=12.01..16.02 rows=1 width=0)
(actual time=73.613..86.117 rows=1 loops=1)

Recheck Cond: (id = 533455)
Rows Removed by Index Recheck: 999999
Heap Blocks: lossy=4425
  Bitmap Index Scan on idx_brin_random  (cost=0.00..12.01 rows=1 width=0)

(actual time=0.314..0.314 rows=44800 loops=1)
Index Cond: (id = 533455)
Planning time: 0.102 ms
Execution time: 86.173 ms
(9 rows)

Time: 86.621 ms

test=# explain analyze SELECT count(*) FROM t_random WHERE id = 533455;

QUERY PLAN

--------------------------------------------------------------------

Aggregate (cost=16.02..16.03 rows=1 width=8)

(actual time=86.122..86.122 rows=1 loops=1)

Bitmap Heap Scan on t_random (cost=12.01..16.02 rows=1 width=0)

(actual time=73.613..86.117 rows=1 loops=1)

Recheck Cond: (id = 533455)

Rows Removed by Index Recheck: 999999

Heap Blocks: lossy=4425

Bitmap Index Scan on idx_brin_random (cost=0.00..12.01 rows=1 width=0)

(actual time=0.314..0.314 rows=44800 loops=1)

Index Cond: (id = 533455)

Planning time: 0.102 ms

Execution time: 86.173 ms

(9 rows)

Time: 86.621 ms

In this example, the query runtime skyrockets. So does the number of blocks needed.

Conclusion

BRIN indexes are only effective if a column is somewhat “sorted”. In many cases, this happens naturally. However, it is certainly not always the case.

In case you need any assistance, please feel free to contact us.

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.

One response to “BRIN indexes: Correlation, correlation, correlation”

Christopher Browne says:

June 23, 2016 at 4:14 pm

Yeah, I'm looking at adding some BRIN indexes to the Slony replication schema. In particular... https://github.com/cbbrowne/slony1-engine/commit/7a02c20cde8c8f3a54914ca6889d392074d51503

I put them onto tables that largely get only INSERTs, on columns whose values generally correlate with activity. e.g. - sequence values that are continuously being incremented, transaction IDs that are continuously being incremented. These are likely to be good cases for BRIN, but thus far, that's just a guess.

BRIN ought to be great for tables that are the result of ETL processes, where data is added in an organized order, and where old data doesn't go with great frequency.

Reply

BRIN indexes: Correlation, correlation, correlation

Correlation, it is about correlation

Here is an example showing what can happen:

Then we select a random row:

Now let us try the same thing using a BRIN index:

Here is the execution plan:

When correlation goes down ...

That means the scan cannot exclude a sufficient number of chunks:

As we can see in the next listing, this query needs many more blocks than the previous one:

Conclusion

One response to “BRIN indexes: Correlation, correlation, correlation”

Leave a Reply Cancel reply

Hans-Jürgen Schönig

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts