CYBERTEC Logo

What PostgreSQL Full-Text-Search has to do with VACUUM

03.2018 / Category: / Tags: | |

What does PostgreSQL Full-Text-Search have to do with VACUUM? Many readers might actually be surprised that there might be a relevant connection worth talking about at all. However, those two topics are more closely related, than people might actually think. The reason is buried deep inside the code and many people might not be aware of those issues. Therefore I've decided to shed some light on this topic and explain, what is really going on here. The goal is to help end users to speed up their Full-Text-Indexing (FTI) and offer better performance to everybody making use of PostgreSQL.

Controlling VACUUM and autovacuum

Before digging into the real stuff, it is necessary to create some test data. For that purpose, I created a table. Note that I turned autovacuum off so that all operations are fully under my control. This makes it easier to demonstrate, what is going on in PostgreSQL.

In the next step we can create 2 million random texts. For the sake of simplicity, I did not import a real data set containing real texts but simply created a set of md5 hashes, which are absolutely good enough for the job:

Here is what our data looks like:

To make things more efficient, I decided to use the tsvector data type in the table directly. The advantage is that we can directly create a full text index (FTI) on the column:

In PostgreSQL, a GIN index is usually used to take care of “full text search” (FTS).

Finally we run VACUUM to create all those hint bits and make PostgreSQL calculate optimizer statistics.

How GIN indexes work in PostgreSQL

To understand what VACUUM and Full Text Search (FTS) have to do with each other, we first got to see, how GIN indexes actually work: A GIN index is basically a “normal tree” down to the word level. So you can just binary search to find a word easily. However: In contrast to a btree, GIN has a “posting tree” below the word level. So each word only shows up once in the index but points to a potentially large list of entries. For full text search this makes sense because the number of distinct words is limited in real life while a single word might actually show up thousands of times.

The following image shows, what a GIN index looks like:

Let us take a closer look at the posting tree itself: It has one entry for pointer to the underlying table. To make it efficient, the posting tree is sorted. The trouble now is: If you insert into the table, changing the GIN index for each row is pretty expensive. Modifying the posting tree does not come for free. Remember: You have to maintain the right order in your posting tree so changing things comes with some serious overhead.

Fortunately there is a solution to the problem: The “GIN pending list”. When a row is added, it does not go to the main index directly. But instead it is added to a “TODO” list, which is then processed by VACUUM. So after a row is inserted, the index is not really in its final state. What does that mean? It means that when you scan the index, you have to scan the tree AND sequentially read what is still in the pending list. In other words: If the pending list is long, this will have some impact on performance. In many cases it can therefore make sense to vacuum a table used to full text search more aggressively as usual. Remember: VACUUM will process all the entries in the pending list.

Measuring the performance impact of VACUUM

To see what is going on behind the scenes, install pgstattuple:

With pgstattuple you can take a look at the internals of the index:

In this case the pending list is empty. In addition to that the index is also pretty small:

Keep in mind: We had 2 million entries and the index is still close to nothing compared to the size of the table:

Let us run a simple query now. We are looking for a word, which does not exist. Note that the query needs ways less than 1 millisecond:

I would also like to point you to something else: “shared hit = 5”. The query only needed 5 blocks of data to run. This is really really good because even if the query has to go to disk, it will still return within a reasonable amount of time.

Let us add more data. Note that autovacuum is off so there are no hidden operations going on:

The same query, which performanced so nicely before, is now a lot slower:

PostgreSQL needs more than 9 milliseconds to run the query. The reason is that there are many pending tuples in the pending list. Also: The query had to access 331 pages in this case, which is A LOT more than before. The GIN pending list reveals the underlying problem:

5 pages + 326 pages = 331 pages. The pending list explains all the additional use of data pages instantly.

Running VACUUM to speed up Full-Text-Search (FTS) in PostgreSQL

Moving those pending entries to the real index is simple. We simply run VACUUM ANALYZE again:

As you can see the pending list is now empty:

The important part is that the query is also a lot slower again because the number of blocks has decreased again.

I think those examples show pretty conclusively that VACUUM does have a serious impact on the performance of your full text indexing. Of course this is only true if a significant part of your data is changed on a regular basis.

0 0 votes
Article Rating
Subscribe
Notify of
guest
5 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Sergey Ponomarev
Sergey Ponomarev
6 years ago

More than that, this pending list do some checkout from time to time and can cause huge IO spike on a single random insert. It's very hard to predict what vacuum rate is sufficient enough. But you have an option to deal with it – just disable the usage of this pending list with fastupdate=off option. https://www.postgresql.org/docs/current/static/gin-implementation.html

In our installation, we haven't noticed any INSERT performance impact but IO spikes has gone forever. The same would be with your SELECTs

Robins Tharakan
Robins Tharakan
6 years ago

Minor typo... Think that should be "faster" ...

"
Important part is that the query is also a lot slower again because the number of blocks has decreased again.
"

Michael Lennon
Michael Lennon
5 years ago

Yeah that had me scratching my head

Egor Rogov
Egor Rogov
6 years ago

And what about gin_pending_list_limit parameter?

ioguix
ioguix
6 years ago

Nice explanation.

There's another way to clean the pending list, you can use the "gin_clean_pending_list" function as well, which might be lighter than VACUUM under some circumstances.

Cheers,

CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0
office@cybertec.at

Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    5
    0
    Would love your thoughts, please comment.x
    ()
    x
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram