October 2015

Wikipedia is one of the most relevant collections of text data on the Internet today. It consists of countless entries covering pretty much every topic conceivable. Therefore Wikipedia is an interesting collection of text to play with. To prepare for my talk at the PostgreSQL conference in Vienna, we decided to import Wikipedia into PostgreSQL. The goal is to see which settings are most suitable for indexing Wikipedia, using the creation of a GIN index.

The structure of the table is pretty simple:

test=# d wiki_articles
Table 'public.wiki_articles'
  Column  |  Type   | Modifiers
----------+---------+-----------
 page_id  | integer | not null
 title    | text    |
 content  | text    |
 filtered | text    |
Indexes:
  'wiki_articles_pkey' PRIMARY KEY, btree (page_id)
Tablespace: 'myspace'

test=# d wiki_articles

Table 'public.wiki_articles'

Column | Type | Modifiers

----------+---------+-----------

page_id | integer | not null

title | text |

content | text |

filtered | text |

Indexes:

'wiki_articles_pkey' PRIMARY KEY, btree (page_id)

Tablespace: 'myspace'

Overall, the size of the table adds up to around 27 GB. We store the title, the raw content as well as some filtered content, which does not contain mark-up anymore.

The size of the filtered column is as follows:

test=# timing
Timing is on.
test=# SELECT sum(length(filtered)) FROM wiki_articles ;
    sum
-------------
 15326058215
(1 row)

Time: 181819.396 ms

test=# timing

Timing is on.

test=# SELECT sum(length(filtered)) FROM wiki_articles ;

sum

-------------

15326058215

(1 row)

Time: 181819.396 ms

Wikipedia statistics

Given the very special nature of this corpus, it seems interesting to inspect the distribution of values inside those texts. PostgreSQL has some simple means to gather this information:

test=# CREATE TABLE wiki_stats AS
SELECT      *
FROM        ts_stat('SELECT to_tsvector(''english'', filtered) 
                     FROM wiki_articles') ;

Time: 3925449.024 ms

test=# CREATE TABLE wiki_stats AS

SELECT *

FROM ts_stat('SELECT to_tsvector(''english'', filtered)

FROM wiki_articles') ;

Time: 3925449.024 ms

A full CPU core will consume about 4-7 MB / sec from disk when this query is executed. This is far lower than what the test hardware can provide. The operation is surprisingly expensive.

Here is what vmstat says:

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
2  0  51012  88076    992 15708304    0    0  3896     0 1154  489 12  0 87  0  0
1  0  51012  88324    992 15706008    0    0  4024     0 1155  512 12  0 87  0  0
1  0  51012  94276    992 15696848    0    0  3964     0 1158  488 12  0 87  0  0
2  0  51012  96260    992 15692184    0    0  3892     0 1161  508 12  0 87  0  0
1  0  51012  95388    992 15689188    0    0  4104    10 1187  566 12  0 87  0  0
1  0  51004  96344    992 15686012    8    0  4296     0 1190  660 12  0 87  0  0

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----

r b swpd free buff cache si so bi bo in cs us sy id wa st

2 0 51012 88076 992 15708304 0 0 3896 0 1154 489 12 0 87 0 0

1 0 51012 88324 992 15706008 0 0 4024 0 1155 512 12 0 87 0 0

1 0 51012 94276 992 15696848 0 0 3964 0 1158 488 12 0 87 0 0

2 0 51012 96260 992 15692184 0 0 3892 0 1161 508 12 0 87 0 0

1 0 51012 95388 992 15689188 0 0 4104 10 1187 566 12 0 87 0 0

1 0 51004 96344 992 15686012 8 0 4296 0 1190 660 12 0 87 0 0

Indexing speed

Let's do some speed tests now. The first observation is that indexing on a function is incredibly slow:

test=# SET maintenance_work_mem TO '1 MB';
SET
Time: 0.353 ms

test=# CREATE INDEX idx_gin ON wiki_articles
   USING gin(to_tsvector('english', filtered));
CREATE INDEX

Time: 7411870.102 ms

test=# SET maintenance_work_mem TO '1 MB';

SET

Time: 0.353 ms

test=# CREATE INDEX idx_gin ON wiki_articles

USING gin(to_tsvector('english', filtered));

CREATE INDEX

Time: 7411870.102 ms

After materializing the tsvector column, things are a lot faster. Calculating the stemmed representation of the input is a major speed issue:

test=# CREATE TABLE wiki_ts AS
   SELECT page_id, to_tsvector('english', title) AS title,
          to_tsvector('english', filtered)
   FROM   wiki_articles ;

test=# SET maintenance_work_mem TO '1 MB';
SET
Time: 0.326 ms

test=# d wiki_ts
Table 'public.wiki_ts'
   Column    |   Type   | Modifiers
-------------+----------+-----------
 page_id     | integer  |
 title       | tsvector |
 to_tsvector | tsvector |
Tablespace: 'myspace'

test=# CREATE INDEX idx_ts_to ON wiki_ts USING gin (to_tsvector);
CREATE INDEX

Time: 4478601.021 ms

test=# CREATE TABLE wiki_ts AS

SELECT page_id, to_tsvector('english', title) AS title,

to_tsvector('english', filtered)

FROM wiki_articles ;

test=# SET maintenance_work_mem TO '1 MB';

SET

Time: 0.326 ms

test=# d wiki_ts

Table 'public.wiki_ts'

Column | Type | Modifiers

-------------+----------+-----------

page_id | integer |

title | tsvector |

to_tsvector | tsvector |

Tablespace: 'myspace'

test=# CREATE INDEX idx_ts_to ON wiki_ts USING gin (to_tsvector);

CREATE INDEX

Time: 4478601.021 ms

Increasing memory certainly helps in case of GIN.

NOTE: This is not necessarily the case if you are using btree indexes: http://www.cybertec-postgresql.com/adjusting-maintenance_work_mem/

Here is some performance data:

maintenance_work_mem Index creation

1 MB 4478 seconds
16 MB 2307 seconds
64 MB 1410 seconds
256 MB 1173 seconds
1024 MB 1171 seconds
4096 MB 1160 seconds
12 GB 1160 seconds

The build process speeds up dramatically with more memory. However, it is interesting to see that the curve flattens out pretty soon. At 256 MB we already see pretty much the best we can expect. More memory does not improve the situation anymore.

It is interesting to admit that we are not I/O bound but CPU bound. A better I/O system won't speed up things anymore.

The size of the index in PostgreSQL is roughly 3 GB:

test=# SELECT pg_size_pretty(pg_relation_size('idx_ts_to'));
 pg_size_pretty
----------------
2993 MB
(1 row)

Time: 0.524 ms

test=# SELECT pg_size_pretty(pg_relation_size('idx_ts_to'));

pg_size_pretty

----------------

2993 MB

(1 row)

Time: 0.524 ms

In case you need any assistance, please feel free to contact us.

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

If you really want to track down slow queries, massive I/O and lousy performance, there is no way around the pg_stat_statements extension.

However, the pg_stat_statements system view is full of information and many people get lost. Therefore it can make sense, to come up with a clever query to provide administrators with really relevant information. Without finding the really relevant information, tuning is somewhat pointless.

Here is my personal favorite query to track down slow queries:

test=# SELECT substring(query, 1, 50) AS short_query,
              round(total_exec_time::numeric, 2) AS total_exec_time,
              calls,
              round(mean_exec_time::numeric, 2) AS mean,
              round((100 * total_exec_time /
              sum(total_exec_time::numeric) OVER ())::numeric, 2) AS percentage_cpu
FROM    pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

test=# SELECT substring(query, 1, 50) AS short_query,

round(total_exec_time::numeric, 2) AS total_exec_time,

calls,

round(mean_exec_time::numeric, 2) AS mean,

round((100 * total_exec_time /

sum(total_exec_time::numeric) OVER ())::numeric, 2) AS percentage_cpu

FROM pg_stat_statements

ORDER BY total_exec_time DESC

LIMIT 20;

The output contains a short version of the query (this can be handy if you are using text terminal as I do). Then there is the tota_execl_time of the query in the first column along with the number of calls and the mean execution time.

Personally I have found it useful to calculate an overall percentage for each query. It helps me to get a feeling of what lies in stock for me in case I can manage to optimize a certain query. To me the percentage value provides me with relevance because it is pretty pointless to work on queries, which only need 0.5% of time:

                    short_query                     | total_exec_time | calls  |  mean  | percentage_cpu
----------------------------------------------------+-----------------+--------+--------+----------------
UPDATE pgbench_tellers SET tbalance = tbalance + ?  |       585005.76 | 143881 |   4.07 |          54.03
UPDATE pgbench_branches SET bbalance = bbalance +   |       481968.27 | 143881 |   3.35 |          44.51
UPDATE pgbench_accounts SET abalance = abalance +   |         9801.43 | 143881 |   0.07 |           0.91
SELECT abalance FROM pgbench_accounts WHERE aid =   |         2928.61 | 143881 |   0.02 |           0.27
INSERT INTO pgbench_history (tid, bid, aid, delta,  |         2236.59 | 143881 |   0.02 |           0.21
copy pgbench_accounts from stdin                    |          371.97 |      1 | 371.97 |           0.03
vacuum analyze pgbench_accounts                     |          128.22 |      1 | 128.22 |           0.01
BEGIN;                                              |          108.18 | 143881 |   0.00 |           0.01
END;                                                |           88.91 | 143881 |   0.00 |           0.01
alter table pgbench_accounts add primary key (aid)  |           74.88 |      1 |  74.88 |           0.01
vacuum analyze pgbench_branches                     |           14.99 |      1 |  14.99 |           0.00
vacuum pgbench_branches                             |           11.57 |      1 |  11.57 |           0.00
create table pgbench_history(tid int,bid int,aid    |            4.36 |      1 |   4.36 |           0.00
vacuum analyze pgbench_tellers                      |            2.28 |      1 |   2.28 |           0.00
alter table pgbench_branches add primary key (bid)  |            2.03 |      1 |   2.03 |           0.00
SELECT  substring(query, ?, ?) AS short_query,      |            1.93 |      2 |   0.96 |           0.00
SELECT  substring(query, ?, ?) AS short_query,      |            1.82 |      1 |   1.82 |           0.00
SELECT a.attname,                                 + |            1.39 |      2 |   0.69 |           0.00
   pg_catalog.format_type(a.attty                   |                 |        |        |
alter table pgbench_tellers add primary key (tid)   |            1.07 |      1 |   1.07 |           0.00
vacuum analyze pgbench_history                      |            1.02 |      1 |   1.02 |           0.00
(20 rows)

short_query | total_exec_time | calls | mean | percentage_cpu

----------------------------------------------------+-----------------+--------+--------+----------------

UPDATE pgbench_tellers SET tbalance = tbalance + ? | 585005.76 | 143881 | 4.07 | 54.03

UPDATE pgbench_branches SET bbalance = bbalance + | 481968.27 | 143881 | 3.35 | 44.51

UPDATE pgbench_accounts SET abalance = abalance + | 9801.43 | 143881 | 0.07 | 0.91

SELECT abalance FROM pgbench_accounts WHERE aid = | 2928.61 | 143881 | 0.02 | 0.27

INSERT INTO pgbench_history (tid, bid, aid, delta, | 2236.59 | 143881 | 0.02 | 0.21

copy pgbench_accounts from stdin | 371.97 | 1 | 371.97 | 0.03

vacuum analyze pgbench_accounts | 128.22 | 1 | 128.22 | 0.01

BEGIN; | 108.18 | 143881 | 0.00 | 0.01

END; | 88.91 | 143881 | 0.00 | 0.01

alter table pgbench_accounts add primary key (aid) | 74.88 | 1 | 74.88 | 0.01

vacuum analyze pgbench_branches | 14.99 | 1 | 14.99 | 0.00

vacuum pgbench_branches | 11.57 | 1 | 11.57 | 0.00

create table pgbench_history(tid int,bid int,aid | 4.36 | 1 | 4.36 | 0.00

vacuum analyze pgbench_tellers | 2.28 | 1 | 2.28 | 0.00

alter table pgbench_branches add primary key (bid) | 2.03 | 1 | 2.03 | 0.00

SELECT substring(query, ?, ?) AS short_query, | 1.93 | 2 | 0.96 | 0.00

SELECT substring(query, ?, ?) AS short_query, | 1.82 | 1 | 1.82 | 0.00

SELECT a.attname, + | 1.39 | 2 | 0.69 | 0.00

pg_catalog.format_type(a.attty | | | |

alter table pgbench_tellers add primary key (tid) | 1.07 | 1 | 1.07 | 0.00

vacuum analyze pgbench_history | 1.02 | 1 | 1.02 | 0.00

(20 rows)

Working top down is usually a good idea.

Of course everybody will have his own ideas of how to approach the problem and the information provided by the query is not sufficient to fully optimize a system. However, I have found it useful to gain a quick overview of what is going on.

See also my blog about speeding up slow queries.