1000 killed index tuples
© Laurenz Albe 2018

 

Since I only recently learned about the concept of “killed index tuples”, I thought there might be some others who are not yet familiar with this interesting PostgreSQL concept.

This may give you an explanation the next time you encounter wildly varying execution times for the same execution plan of the same PostgreSQL query.

Before we look more closely at the index, let’s review the life cycle of a table row version (“heap tuple”).

Life, death and visibility in the table heap

It is widely known that the visibility of heap tuples is determined by the system columns xmin and xmax (though there is more to xmax than meets the eye). A heap tuple is “dead” if its xmax is less than the xmin of all active transactions.

Now xmin and xmax are only valid if the respective transactions have been marked committed in the “commit log”. Consequently, any transaction that needs to know if it can see a tuple has to consult the commit log. To save future readers that extra work, the first one that consults the commit log will save the information in the tuple’s “hint bits”.

Dead tuples are eventually reclaimed by VACUUM.

This is all fairly well known, but how is the situation with index entries?

Life, death and visibility in the index

To avoid redundancy and to keep index tuples small, the visibility information is not stored in the index.
The status of an index tuple is determined by the heap tuple it points to, and both are removed by VACUUM at the same time.

As a consequence, an index scan has to inspect the heap tuple to determine if it can “see” an entry. This is the case even if all the columns needed are in the index tuple itself. Even worse, this “heap access” will result in random I/O, which is not very efficient on spinning disks.

This makes index scans in PostgreSQL more expensive than in other database management systems that use a different architecture. To mitigate that, several features have been introduced over the years:

  • PostgreSQL 8.1 introduced the “bitmp index scan”. This scan method first creates a list of heap blocks to visit and then scans them sequentially. This not only reduces the random I/O, but also avoids that the same block is visited several times during an index scan.
  • PostgreSQL 9.2 introduced the “index-only scan”, which avoids fetching the heap tuple. This requires that all the required columns are in the index and the “visibility map” shows that all tuples in the table block are visible to everybody.

But I want to focus on another feature that was also introduced in 8.1, yet never made it into the release notes.

Killed index tuples

The feature was introduced with this commit:

commit 3f4d48802271126b1343289a9d2267ff1ed3788a
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:   Fri May 24 18:57:57 2002 +0000

    Mark index entries "killed" when they are no longer visible to any
    transaction, so as to avoid returning them out of the index AM.  Saves
    repeated heap_fetch operations on frequently-updated rows.  Also detect
    queries on unique keys (equality to all columns of a unique index), and
    don't bother continuing scan once we have found first match.

    Killing is implemented in the btree and hash AMs, but not yet in rtree
    or gist, because there isn't an equally convenient place to do it in
    those AMs (the outer amgetnext routine can't do it without re-pinning
    the index page).

    Did some small cleanup on APIs of HeapTupleSatisfies, heap_fetch, and
    index_insert to make this a little easier.

Whenever an index scan fetches a heap tuple only to find that it is dead (that the entire “HOT chain” of tuples is dead, to be more precise), it marks the index tuple as “killed”. Then future index scans can simply ignore it.

This “hint bit for indexes” can speed up future index scans considerably.

An example

Let’s create a table to demonstrate killed index tuples. We have to disable autovacuum so that it does not spoil the effect by removing the dead tuples.

CREATE UNLOGGED TABLE hole (
   id integer PRIMARY KEY,
   val text NOT NULL
) WITH (autovacuum_enabled = off);

INSERT INTO hole
SELECT i, 'text number ' || i
FROM generate_series(1, 1000000) AS i;

ANALYZE hole;

We create a number of dead tuples by deleting rows from the table:

DELETE FROM hole
WHERE id BETWEEN 501 AND 799500;

ANALYZE hole;

Spot the differences!

Now we run the same query twice:

EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, TIMIMG OFF)
SELECT * FROM hole
WHERE id BETWEEN 1 AND 800000;

                          QUERY PLAN
---------------------------------------------------------------
 Index Scan using hole_pkey on hole (actual rows=1000 loops=1)
   Index Cond: ((id >= 1) AND (id <= 800000))
   Buffers: shared hit=7284
 Planning Time: 0.346 ms
 Execution Time: 222.129 ms
(5 rows)

EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, TIMIMG OFF)
SELECT * FROM hole
WHERE id BETWEEN 1 AND 800000;

                          QUERY PLAN
---------------------------------------------------------------
 Index Scan using hole_pkey on hole (actual rows=1000 loops=1)
   Index Cond: ((id >= 1) AND (id <= 800000))
   Buffers: shared hit=2196
 Planning Time: 0.217 ms
 Execution Time: 14.540 ms
(5 rows)

Discussion of the example

There are a number of things that can not account for the difference:

  • Often the second execution is faster because the first run has to read more data from secondary storage, which are already cached on the second run. But you can observe from the execution plan that in both cases, all blocks were already in the cache (“shared hit”) and none had to be read from disk.
  • No hint bits had to be written during the first execution (no buffers were “dirtied”), because that was already done during the sequential scan from the DELETE.

What happened is that the first execution had to visit all the table blocks and killed the 799000 index tuples that pointed to dead tuples.
The second execution didn’t have to do that, which is the reason why it was ten times faster.

Note that the difference in blocks touched by the two queries is not as big as one might suspect, since each table block contains many dead tuples (there is a high correlation).

It is also worth noting that even though they are not shown in the EXPLAIN output, some index pages must have been dirtied in the process, which will cause some disk writes from this read-only query.

Conclusion

Next time you encounter mysterious query run-time differences that cannot be explained by caching effects, consider killed index tuples as the cause. This may be an indication to configure autovacuum to run more often on the affected table, since that will get rid of the dead tuples and speed up the first query execution.