The pre-sales engineer is still going strong: he is telling the prospective customer that a hash index is perfect for hash tags.
© Laurenz Albe 2023

 

Among the many index types in PostgreSQL, the hash index is the most widely ignored. This came home to me when somebody asked me a question about hash index performance recently. High time to explore that little-known corner of PostgreSQL and run some benchmarks!

The history of the hash index

PostgreSQL has had hash indexes since the dawn of time (I looked at the source of version 4.2), but they were not crash safe before v10. Consequently, you could not really use them in older versions. For that very reason, hash indexes have received little attention by the PostgreSQL developers. But even after they became first-class citizens, they have led a niche existence. Time to call them on stage and let them show what they can do.

Hash index implementation details

You can find a detailed description in the hash README file.

A hash index has two or more bucket pages that store the result of a system-defined hash function of the indexed values. Whenever the index exceeds a certain average number of entries per page that depends on the fillfactor, PostgreSQL doubles the number of bucket pages by splitting each of them. For large indexes, PostgreSQL performs this doubling in four batches to spread the work across several DML operations. Consequently, an INSERT into a table with a hash index will occasionally be unpleasantly slow, similar to the effect of cleaning up the pending list in a GIN index. If a hash does not fit in the bucket page where it belongs (but it is not yet time to double the index), PostgreSQL puts it into an overflow pages that it creates for that purpose. This should not happen frequently, unless some of the the indexed values occur very often.

Potential use cases for a hash index

Hash indexes can only have a single column, they only support equality search, and they cannot enforce uniqueness. So they really cannot do anything that B-tree indexes cannot do just as well, with one exception: while the length limit for entries in a B-tree index is a third of the page size, you can use a hash index for values of arbitrary size, because only the hash value is stored in the index. However, this should be an exotic case. How often do you need a query like the following to be efficient:

SELECT id, name
FROM person
WHERE mugshot = '\xffd8ffdb...';

The documentation hints that hash indexes might be better than B-tree indexes in some cases:

Hash indexes are best optimized for SELECT and UPDATE-heavy workloads that use equality scans on larger tables. In a B-tree index, searches must descend through the tree until the leaf page is found. In tables with millions of rows, this descent can increase access time to data. The equivalent of a leaf page in a hash index is referred to as a bucket page. In contrast, a hash index allows accessing the bucket pages directly, thereby potentially reducing index access time in larger tables. This reduction in “logical I/O” becomes even more pronounced on indexes/data larger than shared_buffers/RAM. […]

As a result of the overflow cases, we can say that hash indexes are most suitable for unique, nearly unique data or data with a low number of rows per hash bucket.

A test bed for a hash index performance

We’ll use a function and a table like this:

CREATE FUNCTION mkbig(integer) RETURNS text
   LANGUAGE sql AS
'SELECT repeat(md5($1::text), 20)';

CREATE TABLE hashtest (
   c1 bigint,
   c2 bigint,
   c3 text,
   c4 text
);

Then we create a hash or a b-tree index on one of the columns. After that, we’ll load 10 million rows and see how long that takes:

\timing on

INSERT INTO hashtest
SELECT i, i / 10000,
       mkbig(i), mkbig(i / 10000)
FROM generate_series(1, 10000000) AS g(i);

With the text columns, the idea is to insert values that are not small, but also not big enough to trigger PostgreSQL’s TOAST mechanism. The second and the fourth column contain values that repeat 10000 times, so that we can see how well hash indexes can cope with that. If anywhere, I’d expect hash indexes to shine on the third column, which contains big, but unique values.

Once the indexes are created, I’ll set hint bits and gather statistics, in the hope to get good execution plans and repeatable measurements:

VACUUM (ANALYZE) hashtest;

After that, I will perform 100000 index scans in a DO statement:

\timing on

DO
$$DECLARE
   i bigint;
   r text;
BEGIN
   FOR i IN 1..100000 LOOP
      SELECT * INTO r
      FROM hashtest
      -- for c3 and c4, we will use mkbig(i / 100)
      WHERE c1 = i / 100;
   END LOOP;
END;$$;

Benchmark results

The benchmark was run on my laptop with 8 Intel Core i7 CPU cores and an NVMe hard drive running Linux 6.6.6 and PostgreSQL 16.1 with the standard configuration. To get the net time of an index modification or an index scan, I subtracted the time of the INSERT when run without an index and divided the INSERT overhead and the index scan time by the number of iterations. All measurements were taken three times, and the median value was selected. (I am aware that that doesn’t make my results statistically sound, but I got bored.)

hash index vs. b-tree index performance
average index modification timeaverage index scan timeindex size
hash index on c1 (unique bigint)11.25 μs5.06 μs324 MB
b-tree index on c1 (unique bigint)1.31 μs5.46 μs214 MB
hash index on c2 (repeating bigint)3.85 μs564.95 μs444 MB
b-tree index on c2 (repeating bigint)0.93 μs10.32 μs63 MB
hash index on c3 (unique 640-character text)11.67 μs8.68 μs325 MB
b-tree index on c3 (unique 640-character text)54.63 μs15.61 μs964 MB
hash index on c4 (repeating 640-character text)4.69 μs572.44 μs444 MB
b-tree index on c4 (repeating 640-character text)13.06 μs532.55 μs71 MB

Discussion of the benchmark results

With the bigint columns, a hash index is much slower than a b-tree index when inserting data. With repeated values (c2), the hash index is also much slower when querying the table. Only the select performance for the unique bigint column can compete.

When it comes to the long text column, hash indexes beat the pants off b-tree indexes, particularly during the insert. That can easily be explained by the fact that it is easier to index a 4-byte hash value than a 644-byte string (including the varlena header). The hash index performance of queries is also twice as good for unique strings. The reason is that with large index entries, each b-tree page can only fit a few index entries, and the index tree becomes very narrow and deep. PostgreSQL must traverse many intermediate b-tree pages to get to the leaf nodes. The reason why b-tree indexes can compete in the case where each entry has 10000 repetitions is twofold:

  • during a range scan, PostgreSQL can stay at the leaf page level and does not have to repeatedly traverse the deep b-tree
  • with repeated entries, PostgreSQL can benefit from index key deduplication, a feature introduced in v13 — that reduces the size of the index considerably and consequently speeds up scanning the index

Conclusion

If you got the impressions that hash indexes are better than b-tree indexes, don’t forget that I constructed an exotic test case that is specifically designed to work well with hash indexes. In common cases, b-tree indexes outperform hash indexes. This is not necessarily because hash indexes are fundamentally worse than b-tree indexes: if they had received as much love as b-tree indexes, things might look different. I see no reason why deduplication or bottom-up index tuple deletion (to name two recent b-tree index performance features) should not work for hash indexes as well.

At the current state of affairs, you can safely ignore hash indexes. If you ever get a crazy requirement like fast equality search for free-form long user comments, you might reconsider. Or you create a b-tree index on hashtext(comment) and search for that.