CYBERTEC Logo

PostgreSQL: More performance for LIKE and ILIKE statements

07.2020 / Category: / Tags: |

LIKE and ILIKE are two fundamental SQL features. People use those things all over the place in their application and therefore it makes sense to approach the topic from a performance point of view. What can PostgreSQL do to speed up those operations and what can be done in general to first understand the problem and secondly to achieve better PostgreSQL database performance.

Creating sample data

In this blog post you will learn mostly about Gist and GIN indexing. Both index type can handle LIKE as well as ILIKE. These index types are not equally efficient, so it makes sense to dig into the subject matter and figure out what is best when.

Before we get started I have created some sample data. To avoid searching the web for sample data I decided to generate some data. A simple md5 hash is more than sufficient to prove my point here.

Let us take a look at the data. What we got here are 50 million ids and their hashes. The following listing shows what the data looks like in general:

Running simple LIKE queries in PostgreSQL

Let us turn our attention to LIKE: The Following query selects a substring which exists in the data only once. Mind that the percent symbol is not just at the end but also at the beginning of the pattern:

On my iMac, the query takes 4.7 seconds to complete. In 90+% of all applications out there this is already way too long. The user experience is already going to suffer and there is a good chance that a long running query like that will already increase the load on your server quite substantially.

To see what is going on under the hood I decided to include the execution plan of the SQL statement:

Because of the size of the table the PostgreSQL query optimizer will go for a parallel query. That is basically a good thing because the execution time is cut in half. But: It also means that we are readily sacrificing two CPU cores to answer this query returning just a single row.

The reason for bad performance is that the table is actually quite large, and the database has to read it from the beginning to the end to process the request:

Reading 3.2 GB to fetch just a single is now to efficient at all.
So what can we do to solve this problem?

pg_trgm: Advanced indexing

Fortunately PostgreSQL offers a module which can do a lot of trickery in the area of pattern matching. The pg_trgm extension implements “trigrams” which is a way to help with fuzzy search. The extension is part of the PostgreSQL contrib package and should therefore be present on the vast majority of systems:

As you can see enabling the extension is easy. The natural question arising now is: What is a trigram? Let us take a look and see:

What you can observe is that a trigram is like a sliding 3 character window. All these tokens will show up in the index as you will see later on.

To index LIKE the pg_trgm module supports two PostgreSQL index types: Gist and GIN. Both options will be evaluated.

Gist: Deploying a trigram indexes with Gist

What many people do to speed up fuzzy searching in PostgreSQL is to use Gist indexes. Here is how this type of index can be deployed:

What you can already see is that the index needs quite some time to build. What is important to mention is that even higher maintenance_work_mem settings will NOT speed up the process. Even with 4 GB of maintenance_work_mem the process will take 40 minutes.

What is also noteworthy is that the index is really large:

Keep in mind the table is just 3.5 GB - the index is 2.5 times larges.

But, indexes will always make things faster, right? Well, actually no...

We have really “optimized” the query? Instead of 4.7 seconds PostgreSQL needs almost 2 minutes to do the job. Why is that the case? Let us take a look what the execution plan has to say:

The PostgreSQL optimizer has decided to go for a “Bitmap Index Scan”. So maybe a direct index scan is better?

Actually the query is still going to show horrible execution times.
In short: A Gist index might not be the right thing to use here. It takes ages to create, it is large, it is a lot slower than a sequential scan.

Using GIN indexes for pattern matching

Fortunately the pg_trgm extensions offers a second operator class to get the job done. GIN indexes are usually used to for PostgreSQL Full Text search (FTS). Let us see if we can win in case of LIKE and ILIKE as well? Before we do that we reset the current connection and drop the old index:

In the next step a new index is created:

On my machine it takes 11 minutes which is a lot but actually a lot faster than the " rel="noopener" target="_blank">Gist index creation. However, index creation only happens once so we should not worry too much in this case. What is usually more important (usually) is query execution time:

Wow, we can run the query in 75 milliseconds instead of 4.7 seconds respectively 1 minute and 45 seconds. This is a major leap forward. A small index for man - a giant leap for database performance.

As expected the query returns exactly one row:

What you have seen so far is that the GIN index has solved the problem. However, you might still need a second index here. GIN does not speed up the “=” operator. So if you are looking for a normal lookup you will need a second index as shown in the next example:

A btree is needed to speed up normal comparisons. A GIN index alone will not be sufficient for that:

Finally …

PostgreSQL offers truly powerful indexing strategies. There is a lot more to discover than just btree indexes. Gist and GIN have their strengths too. GIN is especially useful for all kinds of full text operations while Gist is ideal for geometric data (GIS).
If you want to find out more about GIN indexes check out my posting about the GIN posting list and VACUUM behavior. It will be very useful if you are looking for good GIN performance.

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0
office@cybertec.at

Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram