After all, a B-tree index is just a tree
Image © Laurenz Albe 2019

If you thought that the B-tree index is a technology that was perfected in the 1980s, you are mostly right. But there is still room for improvement, so PostgreSQL v12 (in the tradition of v11) has added some new features in this field. Thanks, Peter Geoghegan!

In this article, I want to explain some of these improvements with examples.

A B-tree index test case

To demonstrate the changes, I’ll create an example table on both PostgreSQL v11 and v12:

CREATE TABLE rel (
   aid bigint NOT NULL,
   bid bigint NOT NULL
);

ALTER TABLE rel
   ADD CONSTRAINT rel_pkey PRIMARY KEY (aid, bid);

CREATE INDEX rel_bid_idx ON rel (bid);

\d rel
                Table "public.rel"
 Column |  Type  | Collation | Nullable | Default 
--------+--------+-----------+----------+---------
 aid    | bigint |           | not null | 
 bid    | bigint |           | not null | 
Indexes:
    "rel_pkey" PRIMARY KEY, btree (aid, bid)
    "rel_bid_idx" btree (bid)

Tables like this are typically used to implement many-to-many relationships between other tables (entities).

The primary key creates a unique composite B-tree index on the table. The table serves two purposes:

  • it ensures that there is at most one entry for each combination of aid and bid
  • it can speed up searches for all bids related to a given aid

The second index speeds up searches for all aids related to a given bid.

Adding test data

Now let’s insert some rows in ascending numeric order, as is common for artificially generated keys:

INSERT INTO rel (aid, bid)
   SELECT i, i / 10000
   FROM generate_series(1, 20000000) AS i;

/* set hint bits and calculate statistics */
VACUUM (ANALYZE) rel;

Here each bid is related to 10000 aids.

B-tree index improvement 1: INSERT into indexes with many duplicates

The first difference becomes obvious when we compare the size of the index on bid:

In PostgreSQL v11:

\di+ rel_bid_idx
                           List of relations
 Schema |    Name     | Type  |  Owner   | Table |  Size  | Description 
--------+-------------+-------+----------+-------+--------+-------------
 public | rel_bid_idx | index | postgres | rel   | 545 MB | 
(1 row)

In PostgreSQL v12:

\di+ rel_bid_idx

 Schema |    Name     | Type  |  Owner   | Table |  Size  | Description 
--------+-------------+-------+----------+-------+--------+-------------
 public | rel_bid_idx | index | postgres | rel   | 408 MB | 
(1 row)

The index is 33% bigger in v11.

Explanation

Every bid occurs 10000 times in the index, therefore there will be many leaf pages where all keys are the same (each leaf page can contain a couple of hundred entries).

 

Before v12

Before v12, PostgreSQL would store such entries in no special order in the index. So if a leaf page had to be split, it was sometimes the rightmost leaf page, but sometimes not. The rightmost leaf page was always split towards the right end to optimize for monotonically increasing inserts. In contrast to this, other leaf pages were split in the middle, which wasted space.

Since v12

Starting with v12, the physical address (“tuple ID” or TID) of the table row is part of the index key, so duplicate index entries are stored in table order. This will cause index scans for such entries to access the table in physical order, which can be a significant performance benefit, particularly on spinning disks. In other words, the correlation for duplicate index entries will be perfect. Moreover, pages that consist only of duplicates will be split at the right end, resulting in a densely packed index (that is what we observed above).

A similar optimization was added for multi-column indexes, but it does not apply to our primary key index, because the duplicates are not in the first column. The primary key index is densely packed in both v11 and v12, because the first column is monotonically increasing, so leaf page splits occur always at the rightmost page. As mentioned above, PostgreSQL already had an optimization for that.

B-tree index improvement 2: compressed storage of internal index pages

The improvements for the primary key index are not as obvious, since they are almost equal in size in v11 and v12. We will have to dig deeper here.

First, observe the small difference in an index-only scan in both v11 and v12 (repeat the statement until the blocks are in cache):

In PostgreSQL v11:

EXPLAIN (ANALYZE, BUFFERS, COSTS off, SUMMARY off, TIMING off)
SELECT aid, bid FROM rel
WHERE aid = 420024 AND bid = 42;

                          QUERY PLAN                           
---------------------------------------------------------------
 Index Only Scan using rel_pkey on rel (actual rows=1 loops=1)
   Index Cond: ((aid = 420024) AND (bid = 42))
   Heap Fetches: 0
   Buffers: shared hit=5
(4 rows)

In PostgreSQL v12:

EXPLAIN (ANALYZE, BUFFERS, COSTS off, SUMMARY off, TIMING off)
SELECT aid, bid FROM rel
WHERE aid = 420024 AND bid = 42;

                          QUERY PLAN                           
---------------------------------------------------------------
 Index Only Scan using rel_pkey on rel (actual rows=1 loops=1)
   Index Cond: ((aid = 420024) AND (bid = 42))
   Heap Fetches: 0
   Buffers: shared hit=4
(4 rows)

In v12, one less (index) block is read, which means that the index has one level less. Since the size of the indexes is almost identical, that must mean that the internal pages can fit more index entries. In v12, the index has a greater fan-out.

Explanation

As described above, PostgreSQL v12 introduced the TID as part of the index key, which would waste an inordinate amount of space in internal index pages. So the same commit introduced truncation of “redundant” index attributes from internal pages. The TID is redundant, as are non-key attributes from an INCLUDE clause (these were also removed from internal index pages in v11). But PostgreSQL v12 can also truncate those index attributes that are not needed for table row identification.

In our primary key index, bid is a redundant column and is truncated from internal index pages, which saves 8 bytes of space per index entry. Let’s have a look at an internal index page with the pageinspect extension:

In PostgreSQL v11:

SELECT * FROM bt_page_items('rel_pkey', 2550);

 itemoffset |    ctid    | itemlen | nulls | vars |                      data                       
------------+------------+---------+-------+------+-------------------------------------------------
          1 | (2667,88)  |      24 | f     | f    | cd 8f 0a 00 00 00 00 00 45 00 00 00 00 00 00 00
          2 | (2462,0)   |       8 | f     | f    | 
          3 | (2463,15)  |      24 | f     | f    | d6 c0 09 00 00 00 00 00 3f 00 00 00 00 00 00 00
          4 | (2464,91)  |      24 | f     | f    | db c1 09 00 00 00 00 00 3f 00 00 00 00 00 00 00
          5 | (2465,167) |      24 | f     | f    | e0 c2 09 00 00 00 00 00 3f 00 00 00 00 00 00 00
          6 | (2466,58)  |      24 | f     | f    | e5 c3 09 00 00 00 00 00 3f 00 00 00 00 00 00 00
          7 | (2467,134) |      24 | f     | f    | ea c4 09 00 00 00 00 00 40 00 00 00 00 00 00 00
          8 | (2468,25)  |      24 | f     | f    | ef c5 09 00 00 00 00 00 40 00 00 00 00 00 00 00
          9 | (2469,101) |      24 | f     | f    | f4 c6 09 00 00 00 00 00 40 00 00 00 00 00 00 00
         10 | (2470,177) |      24 | f     | f    | f9 c7 09 00 00 00 00 00 40 00 00 00 00 00 00 00
...
        205 | (2666,12)  |      24 | f     | f    | c8 8e 0a 00 00 00 00 00 45 00 00 00 00 00 00 00
(205 rows)

In the data entry we see the bytes from aid and bid. The experiment was conducted on a little-endian machine, so the numbers in row 6 would be 0x09C3E5 and 0x3F or (as decimal numbers) 639973 and 63. Each index entry is 24 bytes wide, of which 8 bytes are the tuple header.

In PostgreSQL v12:

SELECT * FROM bt_page_items('rel_pkey', 2700);

 itemoffset |   ctid   | itemlen | nulls | vars |          data           
------------+----------+---------+-------+------+-------------------------
          1 | (2862,1) |      16 | f     | f    | ab 59 0b 00 00 00 00 00
          2 | (2576,0) |       8 | f     | f    | 
          3 | (2577,1) |      16 | f     | f    | 1f 38 0a 00 00 00 00 00
          4 | (2578,1) |      16 | f     | f    | 24 39 0a 00 00 00 00 00
          5 | (2579,1) |      16 | f     | f    | 29 3a 0a 00 00 00 00 00
          6 | (2580,1) |      16 | f     | f    | 2e 3b 0a 00 00 00 00 00
          7 | (2581,1) |      16 | f     | f    | 33 3c 0a 00 00 00 00 00
          8 | (2582,1) |      16 | f     | f    | 38 3d 0a 00 00 00 00 00
          9 | (2583,1) |      16 | f     | f    | 3d 3e 0a 00 00 00 00 00
         10 | (2584,1) |      16 | f     | f    | 42 3f 0a 00 00 00 00 00
...
        286 | (2861,1) |      16 | f     | f    | a6 58 0b 00 00 00 00 00
(286 rows)

The data contain only aid, since bid has been truncated away. This reduces the index entry size to 16, so that more entries fit on an index page.

Upgrade considerations

Since index storage has been changed in v12, a new B-tree index version 4 has been introduced.

Since upgrading with pg_upgrade does not change the data files, indexes will still be in version 3 after an upgrade. PostgreSQL v12 can use these indexes, but the above optimizations will not be available. You need to REINDEX an index to upgrade it to version 4 (this has been made easier with REINDEX CONCURRENTLY in PostgreSQL v12).

Other B-tree index features introduced in v12

There were some other improvements added in PostgreSQL v12. I won’t discuss them in detail, but here is a list:

  • Reduce locking overhead for B-tree index inserts for improved performance.
  • Introduce REINDEX CONCURRENTLY to make it easier to rebuild an index without down-time.
  • Improve performance for index-only scans on indexes with many attributes.
  • Add a view pg_stat_progress_create_index to report progress for CREATE INDEX and REINDEX.

Conclusion

B-tree indexes have become smarter again in PostgreSQL v12, particularly for indexes with many duplicate entries. To fully benefit, you either have to upgrade with dump/restore or you have to REINDEX indexes after pg_upgrade.