Indexing is the key to good performance. However, people often ask: Is there an alternative to btree indexing? Can we make indexes in PostgreSQL smaller? Can we create indexes in PostgreSQL faster? And how can we index in a data warehouse?

This blog will answer all those questions and show which options you have to index tables in a data warehouse. We’re going to focus on the epic battle between btree (AKA B-tree) and BRIN indexes.

Types of indexes in PostgreSQL

Unlike many other relational databases, PostgreSQL supports more than just one index type. The reason for that is that not all indexes are suitable for the same type of operation. An index that works fine to index names does not necessarily work for GIS data, and vice versa. That’s why people have the choice of which index to use in PostgreSQL.

How can we figure out which types of indexes actually exist in PostgreSQL? Here’s how it works:


test=# SELECT * FROM pg_am;
 oid  | amname |      amhandler       | amtype
------+--------+----------------------+--------
    2 | heap   | heap_tableam_handler | t
  403 | btree  | bthandler            | i
  405 | hash   | hashhandler          | i
  783 | gist   | gisthandler          | i
 2742 | gin    | ginhandler           | i
 4000 | spgist | spghandler           | i
 3580 | brin   | brinhandler          | i
(7 rows)

The pg_am system table reveals which types of indexes are supported by your system. Note that an index can actually be loaded as an extension, which means that the list you see is not necessarily constant – you might see more entries.

Loading sample data

Before we dive into the details of btree and BRIN indexes, it makes sense to load some sample data. The easiest way to do that is to make use of generate_series. This is the Swiss army knife type of function which is basically used by most PostgreSQL consultants to generate huge tables for testing purposes:

test=# CREATE TABLE t_demo (id_sorted int8, id_random int8);
CREATE TABLE
test=# INSERT INTO t_demo
       SELECT id, hashtext(id::text)
       FROM generate_series(1, 5000000000) AS id;
INSERT 0 5000000000

In this case, we created 5 billion rows which lead to a 206 GB table. Here we have two columns: The first column is ascending. It contains numbers ranging from 1 to 5 billion.

test=# \d+
List of relations
 Schema | Name   | Type  | Owner | Persistence | Access method | Size   | Description
--------+--------+-------+-------+-------------+---------------+--------+-------------
 public | t_demo | table | hs    | permanent   | heap          | 206 GB |
(1 row)

The second column contains a hash value. Take a look:

test=# SELECT * FROM t_demo LIMIT 10 OFFSET 50;
 id_sorted | id_random
-----------+-------------
        51 |   342243648
        52 | -1711900017
        53 |   213436769
        54 |  2112769913
        55 | -1318351987
        56 |   -19937162
        57 |  -299365904
        58 | -1228416573
        59 |    93548776
        60 |  1491665190       
(10 rows)

The point here is that the hash value is pretty much random and should be evenly distributed. It gives us the chance to see how various index types such as btree and BRIN behave in these two quite common use cases.

CREATE INDEX: Performance issues with btree

Let’s create a simple btree. In a first step we use the PostgreSQL default configuration (the only change I made was to set max_wal_size to 100 GB – all other settings are default)

Creating a standard btree index will cost us 35 minutes:

test=# CREATE INDEX idx_id ON t_demo (id_sorted);
CREATE INDEX
Time: 2109651,552 ms (35:09,652)

However, we can do better. If we drop the index again and pump maintenance_work_mem to 1 GB, and if we increase the number of worker processes PostgreSQL is allowed to use, we can speed things up:

test=# SHOW maintenance_work_mem ;
 maintenance_work_mem
----------------------
 64MB
(1 row)

Time: 0,170 ms
test=# SET maintenance_work_mem TO '1 GB';
SET
Time: 0,181 ms
test=# SET max_parallel_maintenance_workers TO 4;
SET

Now let’s create the same btree index again and see what happens:

[hs@hanspc user_test]$ vmstat 2
procs -----------memory---------- ---swap-- -----io---- -system--- ------cpu-----
r b swpd   free   buff cache    si  so bi       bo      in    cs   us sy id wa st
5 0 450048 235408 1800 25689472 0   2  2964     6838    21    41   9  2  88 1  0
5 0 450048 261876 1800 25650596 0   0  1225912  8       19425 8512 39 7  54 0  0
7 0 450048 257548 1800 25660116 0   0  941458   277776  19766 7638 32 7  53 7  0
1 5 450048 232476 1800 25690880 0   0  857176   411110  17791 7603 34 7  50 9  0
5 0 450048 260136 1800 25663332 0   0  756074   485380  16005 7811 28 6  53 13 0
3 4 450048 242340 1800 25679688 0   0  722454   666192  16785 8842 26 10 49 15 0

When looking at vmstat we see that PostgreSQL starts to read hundreds of MB per second, and starts to spill hundreds of MB/second to disk at a time. Since we cannot sort everything in memory, this is exactly what we expected to happen.

During the first phase we see exactly what happens as revealed by the progress monitoring infrastructure:

test=# \x
Expanded display is on.
test=# SELECT * FROM pg_stat_progress_create_index;
-[ RECORD 1 ]------+-------------------------------
pid                | 23147
datid              | 24576
datname            | test
relid              | 24583
index_relid        | 0
command            | CREATE INDEX
phase              | building index: scanning table
lockers_total      | 0
lockers_done       | 0
current_locker_pid | 0
blocks_total       | 27027028
blocks_done        | 8577632
tuples_total       | 0
tuples_done        | 0
partitions_total   | 0
partitions_done    | 0

PostgreSQL will first scan the table and prepare the data for sorting.
In the process table we can see that all desired cores are working at full speed to make this happen:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
38830 hs 20 0 594336 219780 12152 R 99,0 0,7 0:06.65 postgres
38831 hs 20 0 594336 219748 12120 R 99,0 0,7 0:06.65 postgres
38833 hs 20 0 594336 219920 12288 R 98,7 0,7 0:06.64 postgres
38832 hs 20 0 594336 219916 12284 R 98,3 0,7 0:06.63 postgres
23147 hs 20 0 514964 276408 149056 R 97,7 0,8 106:48.69 postgres

Once this step is done, PostgreSQL can start to sort those partial data sets. Remember, the table is too big to make this happen in memory, so we had to spill to disk:

-[ RECORD 1 ]-------+------------------------------------
pid                 | 23147
datid               | 24576
datname             | test
relid               | 24583
index_relid         | 0
command             | CREATE INDEX
phase               | building index: sorting live tuples
lockers_total       | 0
lockers_done        | 0
current_locker_pid  | 0
blocks_total        | 27027028
blocks_done         | 27027028
tuples_total        | 0
tuples_done         | 0
partitions_total    | 0
partitions_done     | 0

After this stage PostgreSQL will add data to the tree and write the index to disk:

-[ RECORD 1 ]------+---------------------------------------
pid                | 23147
datid              | 24576
datname            | test
relid              | 24583
index_relid        | 0
command            | CREATE INDEX
phase              | building index: loading tuples in tree
lockers_total      | 0
lockers_done       | 0
current_locker_pid | 0
blocks_total       | 0
blocks_done        | 0
tuples_total       | 5000000000
tuples_done        | 2005424786
partitions_total   | 0
partitions_done    | 0

We managed to almost double the performance of our indexing process – which is a really great achievement:

test=# CREATE INDEX idx_id ON t_demo (id_sorted);
CREATE INDEX
Time: 1212601,300 ms (20:12,601)

If you want to know more about indexes I can also recommend one of my other blog posts dealing with faster index creation in PostgreSQL.

What is noteworthy here is the size of the index we’ve just created:

test=# \di+
List of relations
 Schema | Name   | Type  | Owner | Table  | Persistence | Access method | Size   | Description
--------+--------+-------+-------+--------+-------------+---------------+--------+-------------
 public | idx_id | index | hs    | t_demo | permanent   | btree         | 105 GB |
(1 row)

105 GB’s is pretty massive. However, it basically confirms the following rule of thumb: Usually an index needs around 25 bytes per index entry – assuming we do not have any duplicate entries.

Create a second index:

test=# CREATE INDEX idx_random ON t_demo (id_random);
CREATE INDEX
Time: 1731088,437 ms (28:51,088)

In this case, we have indexed the randomized column. As our input stream is not sorted anymore it takes a bit longer to create this index.

Running queries using btrees

Run a query and see what the btree index has done so far:

test=# SELECT count(*)
FROM t_demo
WHERE id_sorted BETWEEN 10000000 AND 20000000;
   count
----------
 10000001
(1 row)

Time: 358,804 ms

What we see here is stunning performance. We managed to extract 10 million rows in just 358 milliseconds. Let’s look deep into the execution plans and see what happens:

test=# explain (analyze, buffers) SELECT count(*) 
       FROM  t_demo 
       WHERE id_sorted BETWEEN 10000000 AND 20000000;
                            QUERY PLAN               
---------------------------------------------------------------------
 Finalize Aggregate (cost=296509.46..296509.47 rows=1 width=8) 
                   (actual time=496.450..500.609 rows=1 loops=1)
                   Buffers: shared hit=11 read=27325
    ->  Gather   (cost=296509.24..296509.45 rows=2 width=8) 
                (actual time=496.379..500.603 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=11 read=27325
         ->  Partial Aggregate  (cost=295509.24..295509.25 rows=1 width=8) 
                                (actual time=492.621..492.622 rows=1 loops=3)
               Buffers: shared hit=11 read=27325
               ->  Parallel Index Only Scan using idx_id on t_demo  
                        (cost=0.58..284966.32 rows=4217169 width=0) 
                        (actual time=0.074..340.666 rows=3333334 loops=3)
                     Index Cond: ((id_sorted >= 10000000) 
                            AND (id_sorted <= 20000000))
                     Heap Fetches: 0
                     Buffers: shared hit=11 read=27325
 Planning:
   Buffers: shared hit=7 read=3
 Planning Time: 0.233 ms
 Execution Time: 500.662 ms
(16 rows)

The above is a parallel-index-only scan. Actually this is good news because it means that it’s enough to scan the index. There’s no need to fetch data from the underlying table. All we need to ensure is the existence of the row (which is in this case done through PostgreSQL hint bits).

Now we’ll run the same query, but this time we’ll use the random column.
Note that I reduced the range a bit to ensure that we also get 10 million rows. The size of the result set is therefore roughly identical:

test=# explain SELECT count(*) 
       FROM  t_demo 
       WHERE id_random BETWEEN 10000000 AND 19000000;
                            QUERY PLAN
------------------------------------------------------------------- 
 Aggregate  (cost=22871994.27..22871994.28 rows=1 width=8)
   ->  Index Only Scan using idx_random on t_demo  
            (cost=0.58..22846435.29 rows=10223592 width=0)
         Index Cond: ((id_random >= 10000000) 
            AND (id_random <= 19000000))
(3 rows)

The parallel query using more than one core is gone. But, it’s even worse … see what happens during the query:

[hs@hanspc user_test]$ vmstat 2
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  1 1377280 228168   1032 27286216    0    0 120584   314 17933 30276  2  5 86  7  0
 1  0 1377280 231608   1032 27286224    0    0 121102  1069 17839 30854  2  3 88  7  0
 1  1 1377280 251400   1032 27270360    0    0 121168   869 17669 30177  1  4 88  7  0
 0  1 1377280 263096   1032 27254396    0    0 121674   514 17780 30695  2  2 88  7  0
 0  1 1377280 236788   1032 27275712    0    0 121630   260 18403 31970  2  2 88  7  0

Wow, we see 120 MB / sec sustained I/O which translates to horrible runtime.
Note that the amount of data is the same – all that has changes is the distribution of data on disk:

test=# SELECT count(*)
       FROM t_demo
       WHERE id_random BETWEEN 10000000 AND 19000000;
  count
----------
 10481155
(1 row)
Time: 322747,613 ms (05:22,748)

5 minutes. As we can see here our runtime has exploded. We can figure out why:

test=# explain (analyze, buffers) SELECT count(*) 
       FROM  t_demo 
       WHERE id_random BETWEEN 10000000 AND 19000000;
                           QUERY PLAN                                                   
-------------------------------------------------------------------
 Aggregate  (cost=22871994.27..22871994.28 rows=1 width=8) 
            (actual time=292381.841..292381.843 rows=1 loops=1)
            Buffers: shared hit=6611938 read=3884860
   ->  Index Only Scan using idx_random on t_demo 
        (cost=0.58..22846435.29 rows=10223592 width=0) 
        (actual time=0.515..291336.908 rows=10481155 loops=1)
         Index Cond: ((id_random >= 10000000) 
                  AND (id_random <= 19000000))
         Heap Fetches: 3866504
         Buffers: shared hit=6611938 read=3884860
 Planning Time: 1.080 ms
 Execution Time: 292381.934 ms
(8 rows)

Time: 292384,903 ms (04:52,385)

Data is spread all over the place. Therefore we need millions of 8k pages to handle the query. In those blocks we only need a subset of data. That causes substantial runtime.

BRIN indexes come to the rescue?

BRIN indexes are often seen as a savior for warehousing applications. However, you should not be so sure about that.

First, we’ll create an index:

test=# CREATE INDEX idx_brin ON t_demo USING brin (id_sorted);
CREATE INDEX
Time: 710549,637 ms (11:50,550)

The index is made really quickly. What is also extremely convincing is the size of the BRIN index:

test=# SELECT pg_size_pretty(pg_relation_size('idx_brin'));
 pg_size_pretty
----------------
 7064 kB
(1 row)

it’s only 7 MB which saves us a stunning 10 GB on disk. What we can also see during index creation is that the process looks slightly different:

test=# SELECT * FROM pg_stat_progress_create_index;
-[ RECORD 1 ]------+---------------
pid                | 23147
datid              | 24576
datname            | test
relid              | 24583
index_relid        | 0
command            | CREATE INDEX
phase              | building index
lockers_total      | 0
lockers_done       | 0
current_locker_pid | 0
blocks_total       | 27027028
blocks_done        | 16898015
tuples_total       | 0
tuples_done        | 0
partitions_total   | 0
partitions_done    | 0

BRIN is not as extensive a sorting exercise as normal btree index creation. This is important because it translates to less temporary I/O and faster index creation.

Compare queries:

test=# explain (analyze, buffers) SELECT count(*) 
       FROM  t_demo
       WHERE id_sorted BETWEEN 10000000 AND 20000000;
                                     QUERY PLAN
------------------------------------------------------------------------------------
 Finalize Aggregate (cost=49930967.53..49930967.54 rows=1 width=8) 
                    (actual time=664.605..671.519 rows=1 loops=1)
                    Buffers: shared hit=311 read=55025
    ->  Gather  (cost=49930967.32..49930967.53 rows=2 width=8) 
                (actual time=664.387..671.502 rows=3 loops=1)
          Workers Planned: 2
          Workers Launched: 2
          Buffers: shared hit=311 read=55025
          ->  Partial Aggregate (cost=49929967.32..49929967.33 rows=1 width=8) 
                                (actual time=653.953..653.955 rows=1 loops=3)
                Buffers: shared hit=311 read=55025
                ->  Parallel Bitmap Heap Scan on t_demo  
                        (cost=5390.21..49920339.39 rows=3851169 width=0) 
                        (actual time=56.944..489.563 rows=3333334 loops=3)
                      Recheck Cond: ((id_sorted >= 10000000) 
                                 AND (id_sorted <= 20000000))
                      Rows Removed by Index Recheck: 5546
                      Heap Blocks: lossy=19046
                      Buffers: shared hit=311 read=55025
                      ->  Bitmap Index Scan on idx_brin  
                                (cost=0.00..3079.51 rows=9258865 width=0) 
                                (actual time=65.293..65.294 rows=541440 loops=1)
                            Index Cond: ((id_sorted >= 10000000) 
                                     AND (id_sorted <= 20000000))
                            Buffers: shared hit=311 read=881
  Planning:
    Buffers: shared hit=1
  Planning Time: 0.210 ms
  Execution Time: 671.623 ms
 (20 rows)

The btree managed to run the query with just 27,000 blocks – now we need around 55,000 – which translates to more runtime.

While a BRIN index is usually a good idea for such use cases, it’s not always the magic bullet. It does need significantly less space and it can be built faster. But is does not magically fix all cases under all circumstances.

Please keep in mind that I AM NOT SAYING that BRIN is bad – it’s usually a speedup compared to btree indexes. What I am saying is that you should compare various setups and decide what is best. Also: Keep in mind that correlation does matter.

Finally …

To find out about reduced B-tree index bloat, check out Laurenz’ post.

If you want to learn more about PostgreSQL and if you want to read something about GIS data right now I can highly recommend Florian Nadler’s post dealing with historical data and MobilityDB.