What does PostgreSQL Full-Text-Search have to do with VACUUM? Many readers might actually be surprised that there might be a relevant connection worth talking about at all. However, those two topics are more closely related than people might actually think. The reason is buried deep inside the code and many people might not be aware of those issues. Therefore I decided to shade some light on the topic and explain, what is really going on here. The goal is to help end users to speed up their Full-Text-Indexing (FTI) and offer better performance to everybody making use of PostgreSQL.

Controlling VACUUM and autovacuum

Before digging into the real stuff it is necessary to create some test data. For that purpose I created a table. Note that I turned autovacuum off so that all operations are fully under my control. This makes it easier to demonstrate, what is going on in PostgreSQL.

test=# CREATE TABLE t_fti (payload tsvector) 
   WITH (autovacuum_enabled = off);
CREATE TABLE

In the next step we can create 2 million random texts. For the sake of simplicity I did not import a real data set containing real texts but simply created a set of md5 hashes, which are absolutely good enough for the job:

test=# INSERT INTO t_fti 
    SELECT to_tsvector('english', md5('dummy' || id)) 
    FROM generate_series(1, 2000000) AS id;
INSERT 0 2000000

Here is what our data looks like:

test=# SELECT to_tsvector('english', md5('dummy' || id)) 
   FROM generate_series(1, 5) AS id;
        to_tsvector 
--------------------------------------
'8c2753548775b4161e531c323ea24c08':1
'c0c40e7a94eea7e2c238b75273087710':1
'ffdc12d8d601ae40f258acf3d6e7e1fb':1
'abc5fc01b06bef661bbd671bde23aa39':1
'20b70cebcb94b1c9ba30d17ab542a6dc':1
(5 rows)

To make things more efficient, I decided to use the tsvector data type in the table directly. The advantage is that we can directly create a full text index (FTI) on the column:

test=# CREATE INDEX idx_fti 
         ON t_fti USING gin(payload);
CREATE INDEX

In PostgreSQL a GIN index is usually used to take care of “full text search” (FTS).

Finally we run VACUUM to create all those hint bits and make PostgreSQL calculate optimizer statistics.

test=# VACUUM ANALYZE ;
VACUUM

How GIN indexes work in PostgreSQL

To understand what VACUUM and Full Text Search (FTS) have to do with each other, we got to first see, how GIN indexes actually work: A GIN index is basically a “normal tree” down to the word level. So you can just binary search to find a word easily. However: In contrast to a btree, GIN has a “posting tree” below the word level. So each word only shows up once in the index but points to a potentially large list of entries. For full text search this makes sense because the number of distinct words is limited in real life while a single word might actually show up thousands of times.

The following image shows, what a GIN index looks like:

Let us take a closer look at the posting tree itself: It has one entry for pointer to the underlying table. To make it efficient the posting tree is sorted. The trouble now is: If you insert into the table, changing the GIN index for each row is pretty expensive. Modifying the posting tree does not come for free. Remember: You have to maintain the right order in your posting tree so changing things comes with some serious overhead.

Fortunately there is a solution to the problem: The “GIN pending list”. When a row is added, it does not go to the main index directly. But instead it is added to a “TODO” list, which is then processed by VACUUM. So after a row is inserted, the index is not really in its final state. What does that mean? It means that when you scan the index, you have to scan the tree AND sequentially read what is still in the pending list. In other words: If the pending list is long, this will have some impact on performance. In many cases it can therefore make sense to vacuum a table used to full text search more aggressively as usual. Remember: VACUUM will process all the entries in the pending list.

Measuring the performance impact of VACUUM

To see what is going on behind the screenes, install pgstattuple:

CREATE EXTENSION pgstattuple;

With pgstattuple you can take a look at the internals of the index:

test=# SELECT * FROM pgstatginindex('idx_fti');
version | pending_pages | pending_tuples
---------+---------------+----------------
2 | 0 | 0
(1 row)

In this case the pending list is empty. In addition to that the index is also pretty small:

test=# SELECT pg_relation_size('idx_gin');
pg_relation_size
------------------
188416
(1 row)

Keep in mind: We had 2 million entries and the index is still close to nothing compared to the size of the table:

test=# SELECT pg_relation_size('t_fti');
pg_relation_size
------------------
154329088
(1 row)

Let us run a simple query now. We are looking for a word, which does not exist. Note that the query needs ways less than 1 millisecond:

test=# explain (analyze, buffers) SELECT *
FROM t_fti
WHERE payload @@ to_tsquery('whatever');
QUERY PLAN
--------------------------------------------------------------------
Bitmap Heap Scan on t_fti (cost=20.77..294.37 rows=67 width=45)
(actual time=0.030..0.030 rows=0 loops=1)
Recheck Cond: (payload @@ to_tsquery('whatever'::text))
Buffers: shared hit=5
-> Bitmap Index Scan on idx_fti (cost=0.00..20.75 rows=67 width=0)
(actual time=0.028..0.028 rows=0 loops=1)
Index Cond: (payload @@ to_tsquery('whatever'::text))
Buffers: shared hit=5
Planning time: 0.148 ms
Execution time: 0.066 ms

(8 rows)

I would also like to point you to something else: “shared hit = 5”. The query only needed 5 blocks of data to run. This is really really good because even if the query has to go to disk, it will still return within a reasonable amount of time.

Let us add more data. Note that autovacuum is off so there are no hidden operations going on:

test=# INSERT INTO t_fti
SELECT to_tsvector('english', md5('dummy' || id))
FROM generate_series(2000001, 3000000) AS id;
INSERT 0 1000000

The same query, which performanced so nicely before is now a lot slower:

test=# explain (analyze, buffers) SELECT *
FROM t_fti
WHERE payload @@ to_tsquery('whatever');
QUERY PLAN
-----------------------------------------------------------------
Bitmap Heap Scan on t_fti (cost=1329.02..1737.43 rows=100 width=45)
(actual time=9.377..9.377 rows=0 loops=1)
Recheck Cond: (payload @@ to_tsquery('whatever'::text))
Buffers: shared hit=331
-> Bitmap Index Scan on idx_fti (cost=0.00..1329.00 rows=100 width=0)
(actual time=9.374..9.374 rows=0 loops=1)
Index Cond: (payload @@ to_tsquery('whatever'::text))
Buffers: shared hit=331
Planning time: 0.194 ms
Execution time: 9.420 ms
(8 rows)

PostgreSQL needs more than 9 milliseconds to run the query. The reason is that there are many pending tuples in the pending list. Also: The query had to access 331 pages in this case, which is A LOT more than before. The GIN pending list reveals the underlying problem:

test=# SELECT * FROM pgstatginindex('idx_fti');
version | pending_pages | pending_tuples
---------+---------------+----------------
2 | 326 | 50141
(1 row)

5 pages + 326 pages = 331 pages. The pending list explains all the additional use of data pages instantly.

Running VACUUM to speed up Full-Text-Search (FTS) in PostgreSQL

Moving those pending entries to the real index is simple. We simply run VACUUM ANALYZE again:

test=# VACUUM ANALYZE;
VACUUM

As you can see the pending list is now empty:

test=# SELECT * FROM pgstatginindex('idx_fti');
version | pending_pages | pending_tuples
---------+---------------+----------------
2 | 0 | 0
(1 row)

The important part is that the query is also a lot slower again because the number of blocks has decreased again.

test=# explain (analyze, buffers) SELECT *
FROM t_fti
WHERE payload @@ to_tsquery('whatever');
QUERY PLAN
-----------------------------------------------------------------
Bitmap Heap Scan on t_fti (cost=25.03..433.43 rows=100 width=45)
(actual time=0.033..0.033 rows=0 loops=1)
Recheck Cond: (payload @@ to_tsquery('whatever'::text))
Buffers: shared hit=5
-> Bitmap Index Scan on idx_fti (cost=0.00..25.00 rows=100 width=0)
(actual time=0.030..0.030 rows=0 loops=1)
Index Cond: (payload @@ to_tsquery('whatever'::text))
Buffers: shared hit=5
Planning time: 0.240 ms
Execution time: 0.075 ms
(8 rows)

I think those examples show pretty conclusively that VACUUM does have a serious impact on the performance of your full text indexing. Of course this is only true if a significant part of your data is changed on a regular basis.