LIKE and ILIKE are two fundamental SQL features. People use those things all over the place in their application and therefore it makes sense to approach the topic from a performance point of view. What can PostgreSQL do to speed up those operations and what can be done in general to first understand the problem and secondly to achieve better PostgreSQL database performance.

Creating sample data

In this blog post you will learn mostly about Gist and GIN indexing. Both index type can handle LIKE as well as ILIKE. These index types are not equally efficient, so it makes sense to dig into the subject matter and figure out what is best when.

Before we get started I have created some sample data. To avoid searching the web for sample data I decided to generate some data. A simple md5 hash is more than sufficient to prove my point here.

test=# CREATE TABLE t_hash AS SELECT id, md5(id::text)
FROM generate_series(1, 50000000) AS id;
SELECT 50000000
test=# VACUUM ANALYZE;
VACUUM

Let us take a look at the data. What we got here are 50 million ids and their hashes. The following listing shows what the data looks like in general:

test=# SELECT * FROM t_hash LIMIT 10;
 id | md5
----+----------------------------------
  1 | c4ca4238a0b923820dcc509a6f75849b
  2 | c81e728d9d4c2f636f067f89cc14862c
  3 | eccbc87e4b5ce2fe28308fd9f2a7baf3
  4 | a87ff679a2f3e71d9181a67b7542122c
  5 | e4da3b7fbbce2345d7772b0674a318d5
  6 | 1679091c5a880faf6fb5e6087eb1b2dc  
  7 | 8f14e45fceea167a5a36dedd4bea2543
  8 | c9f0f895fb98ab9159f51fd0297e236d
  9 | 45c48cce2e2d7fbdea1afc51c7c6ad26
 10 | d3d9446802a44259755d38e6d163e820
(10 rows)

Running simple LIKE queries in PostgreSQL

Let us turn our attention to LIKE: The Following query selects a substring which exists in the data only once. Mind that the percent symbol is not just at the end but also at the beginning of the pattern:

test=# \timing
Timing is on.
test=# SELECT * FROM t_hash WHERE md5 LIKE '%e2345679a%';
       id | md5
----------+----------------------------------
 37211731 | dadb4b54e2345679a8861ab52e4128ea
(1 row)
Time: 4767.415 ms (00:04.767)

 

On my iMac, the query takes 4.7 seconds to complete. In 90+% of all applications out there this is already way too long. The user experience is already going to suffer and there is a good chance that a long running query like that will already increase the load on your server quite substantially.

To see what is going on under the hood I decided to include the execution plan of the SQL statement:

test=# explain SELECT * FROM t_hash WHERE md5 LIKE '%e2345679a%';
                                  QUERY PLAN
------------------------------------------------------------------------------
 Gather (cost=1000.00..678583.88 rows=5000 width=37)
   Workers Planned: 2
   -> Parallel Seq Scan on t_hash (cost=0.00..677083.88 rows=2083 width=37)
      Filter: (md5 ~~ '%e2345679a%'::text)
(4 rows)
Time: 11.531 ms

 

Because of the size of the table the PostgreSQL query optimizer will go for a parallel query. That is basically a good thing because the execution time is cut in half. But: It also means that we are readily sacrificing two CPU cores to answer this query returning just a single row.

The reason for bad performance is that the table is actually quite large, and the database has to read it from the beginning to the end to process the request:

test=# \dt+
List of relations
 Schema |  Name  |  Type | Owner |    Size | Description
--------+--------+-------+-------+---------+-------------
 public | t_hash | table |    hs | 3256 MB |
(1 row)

Reading 3.2 GB to fetch just a single is now to efficient at all.
So what can we do to solve this problem?

pg_trgm: Advanced indexing

Fortunately PostgreSQL offers a module which can do a lot of trickery in the area of pattern matching. The pg_trgm extension implements “trigrams” which is a way to help with fuzzy search. The extension is part of the PostgreSQL contrib package and should therefore be present on the vast majority of systems:

test=# CREATE EXTENSION pg_trgm;
CREATE EXTENSION
Time: 77.216 ms

As you can see enabling the extension is easy. The natural question arising now is: What is a trigram? Let us take a look and see:

test=# SELECT show_trgm('dadb4b54e2345679a8861ab52e4128ea');
                                                              show_trgm
---------------------------------------------------------------------------------------------------------------------------------------------
 {" d"," da",128,1ab,234,28e,2e4,345,412,456,4b5,4e2,52e,54e,567,61a,679,79a,861,886,8ea,9a8,a88,ab5,adb,b4b,b52,b54,dad,db4,e23,e41,"ea "}
(1 row)

What you can observe is that a trigram is like a sliding 3 character window. All these tokens will show up in the index as you will see later on.

To index LIKE the pg_trgm module supports two PostgreSQL index types: Gist and GIN. Both options will be evaluated.

Gist: Deploying a trigram indexes with Gist

What many people do to speed up fuzzy searching in PostgreSQL is to use Gist indexes. Here is how this type of index can be deployed:

test=# CREATE INDEX idx_gist ON t_hash USING gist (md5 gist_trgm_ops);
CREATE INDEX
Time: 2383678.930 ms (39:43.679)

What you can already see is that the index needs quite some time to build. What is important to mention is that even higher maintenance_work_mem settings will NOT speed up the process. Even with 4 GB of maintenance_work_mem the process will take 40 minutes.

What is also noteworthy is that the index is really large:

test=# \di+
List of relations
 Schema |   Name   |  Type | Owner |  Table |   Size  | Description
--------+----------+-------+-------+--------+---------+-------------
 public | idx_gist | index |    hs | t_hash | 8782 MB |
(1 row)

Keep in mind the table is just 3.5 GB – the index is 2.5 times larges.

But, indexes will always make things faster, right? Well, actually no …

test=# SELECT * FROM t_hash WHERE md5 LIKE '%e2345679a%';
       id | md5
----------+----------------------------------
 37211731 | dadb4b54e2345679a8861ab52e4128ea
(1 row)
Time: 105506.064 ms (01:45.506)

 

We have really “optimized” the query? Instead of 4.7 seconds PostgreSQL needs almost 2 minutes to do the job. Why is that the case? Let us take a look what the execution plan has to say:

test=# explain SELECT * FROM t_hash WHERE md5 LIKE '%e2345679a%';
                                  QUERY PLAN
----------------------------------------------------------------------------
 Bitmap Heap Scan on t_hash (cost=495.30..18812.90 rows=5000 width=37)
    Recheck Cond: (md5 ~~ '%e2345679a%'::text)
    -> Bitmap Index Scan on idx_gist (cost=0.00..494.05 rows=5000 width=0)
       Index Cond: (md5 ~~ '%e2345679a%'::text)
(4 rows)
Time: 13.433 ms

 

The PostgreSQL optimizer has decided to go for a “Bitmap Index Scan”. So maybe a direct index scan is better?

test=# SET enable_bitmapscan TO off;
SET
Time: 11.302 ms
test=# explain analyze SELECT * FROM t_hash WHERE md5 LIKE '%e2345679a%';
                                  QUERY PLAN
-----------------------------------------------------------------------------
Index Scan using idx_gist on t_hash (cost=0.55..20428.04 rows=5000 width=37) 
   (actual time=13750.850..99070.510 rows=1 loops=1)
   Index Cond: (md5 ~~ '%e2345679a%'::text)
 Planning Time: 0.074 ms
 Execution Time: 99070.618 ms
(4 rows)
Time: 99072.657 ms (01:39.073)

Actually the query is still going to show horrible execution times.
In short: A Gist index might not be the right thing to use here. It takes ages to create, it is large, it is a lot slower than a sequential scan.

Using GIN indexes for pattern matching

Fortunately the pg_trgm extensions offers a second operator class to get the job done. GIN indexes are usually used to for PostgreSQL Full Text search (FTS). Let us see if we can win in case of LIKE and ILIKE as well? Before we do that we reset the current connection and drop the old index:

test=# DISCARD ALL;
DISCARD ALL
Time: 12.000 ms
test=# DROP INDEX idx_gist;
DROP INDEX
Time: 3123.336 ms (00:03.123)

In the next step a new index is created:

test=# CREATE INDEX idx_gin ON t_hash USING gin (md5 gin_trgm_ops);
CREATE INDEX
Time: 698063.038 ms (11:38.063)

On my machine it takes 11 minutes which is a lot but actually a lot faster than the Gist index creation. However, index creation only happens once so we should not worry too much in this case. What is usually more important (usually) is query execution time:

test=# explain analyze SELECT * FROM t_hash WHERE md5 LIKE '%e2345679a%';
                                       QUERY PLAN
------------------------------------------------------------------------------
Bitmap Heap Scan on t_hash (cost=2270.75..20588.36 rows=5000 width=37) 
    (actual time=74.592..74.593 rows=1 loops=1)
   Recheck Cond: (md5 ~~ '%e2345679a%'::text)
   Heap Blocks: exact=1
   -> Bitmap Index Scan on idx_gin (cost=0.00..2269.50 rows=5000 width=0) 
       (actual time=74.584..74.584 rows=1 loops=1)
      Index Cond: (md5 ~~ '%e2345679a%'::text)
Planning Time: 0.066 ms
Execution Time: 74.665 ms
(7 rows)

Time: 75.031 ms

Wow, we can run the query in 75 milliseconds instead of 4.7 seconds respectively 1 minute and 45 seconds. This is a major leap forward. A small index for man – a giant leap for database performance.

As expected the query returns exactly one row:

test=# SELECT * FROM t_hash WHERE md5 LIKE '%e2345679a%';
       id | md5
----------+----------------------------------
 37211731 | dadb4b54e2345679a8861ab52e4128ea
(1 row)

Time: 74.487 ms

What you have seen so far is that the GIN index has solved the problem. However, you might still need a second index here. GIN does not speed up the “=” operator. So if you are looking for a normal lookup you will need a second index as shown in the next example:

test=# CREATE INDEX idx_btree ON t_hash (md5);
CREATE INDEX
Time: 274778.776 ms (04:34.779)
test=# \di+
List of relations
 Schema |    Name   |  Type | Owner |  Table |    Size | Description
--------+-----------+-------+-------+--------+---------+-------------
 public | idx_btree | index |    hs | t_hash | 2816 MB | 
 public |  idx_gist | index |    hs | t_hash | 2807 MB |
(2 rows)

A btree is needed to speed up normal comparisons. A GIN index alone will not be sufficient for that:

test=# SELECT * FROM t_hash WHERE md5 = 'dadb4b54e2345679a8861ab52e4128ea';
       id | md5
----------+----------------------------------
 37211731 | dadb4b54e2345679a8861ab52e4128ea
(1 row)

Time: 0.379 ms

Finally …

PostgreSQL offers truly powerful indexing strategies. There is a lot more to discover than just btree indexes. Gist and GIN have their strengths too. GIN is especially useful for all kinds of full text operations while Gist is ideal for geometric data (GIS).
If you want to find out more about GIN indexes check out my posting about the GIN posting list and VACUUM behavior. It will be very useful if you are looking for good GIN performance.