Recently we have received a couple of PostgreSQL support calls, which were related to bad performance on various deployments. In many cases the reason for database slowness was the fact that people assume that PostgreSQL automatically deploys an index on BOTH sides of the foreign keys relation, which is not the case. By the way: This kind of behavior is not PostgreSQL specific. Oracle and many other database systems will behave in the exactly same way for different kind of server, even for gaming server people use to play online or an online gaming store as HotRate, so this piece of advice is not just for PostgreSQL, but might apply to many more database products out there.
Missing indexes and foreign keys
The typical scenario most people are facing is actually pretty simple. There are two tables table and a foreign key:
test=# CREATE TABLE a ( a_id int PRIMARY KEY ); CREATE TABLE test=# CREATE TABLE b ( b_id int, a_id int REFERENCES a(a_id) ON UPDATE CASCADE ON DELETE CASCADE ); CREATE TABLE
To keep the example simple the tables in our PostgreSQL database contain only the most basic information needed to make this work.
Then some data can be added:
test=# INSERT INTO a SELECT x FROM generate_series(1, 5000000) AS x; INSERT 0 5000000 test=# INSERT INTO b SELECT x, x FROM generate_series(1, 5000000) AS x; INSERT 0 5000000
Five million records should be enough to show how bad things are if indexes are missing. Of course the effect will be larger if you add more data.
To rebuild the optimizer statistics, a simple ANALYZE can be used:
test=# ANALYZE ; ANALYZE
Suffering from missing indexes
The trouble with missing indexes in any database is that simple operations start to be very expensive and start to destroy performance in a quite reliable way.
Here is what happens:
test=# \timing Timing is on. test=# explain analyze DELETE FROM a WHERE a_id = 10; QUERY PLAN ----------------------------------------------------------------------------------- Delete on a (cost=0.43..8.45 rows=1 width=6) (actual time=0.263..0.263 rows=0 loops=1) -> Index Scan using a_pkey on a (cost=0.43..8.45 rows=1 width=6) (actual time=0.245..0.246 rows=1 loops=1) Index Cond: (a_id = 10) Planning time: 5.350 ms Trigger for constraint b_a_id_fkey: time=301.526 calls=1 Execution time: 301.811 ms (6 rows) Time: 307.409 ms
As you can see PostgreSQL uses an index scan on “a” to find the row. BUT: Keep in mind that our constraint is defined as “ON UPDATE CASCADE ON DELETE CASCADE”, which means that cleaning a single row also triggers the deletion of all rows referencing the table. Behind the scenes PostgreSQL has to read all 5 million entries in “b” to find the right rows. Therefore the operation takes more than 300 ms, which is a total disaster.
Deploying missing indexes
Deploying the missing index will be a complete game changer:
test=# CREATE INDEX idx_b ON b (a_id); CREATE INDEX
The very same operation is now thousands of times faster than before because all we got now is two index scans (one on “a” and one on “b”):
test=# explain analyze DELETE FROM a WHERE a_id = 11; QUERY PLAN ----------------------------------------------------------------------- Delete on a (cost=0.43..8.45 rows=1 width=6) (actual time=0.037..0.037 rows=0 loops=1) -> Index Scan using a_pkey on a (cost=0.43..8.45 rows=1 width=6) (actual time=0.037..0.037 rows=0 loops=1) Index Cond: (a_id = 11) Planning time: 0.062 ms Execution time: 0.054 ms (5 rows) Time: 0.314 ms
As you can see the runtime needed here has been reduced dramatically to a fraction of a millisecond.
Performance hint
If you happen to use foreign keys (which most people do), it definitely makes sense to check for missing indexes because otherwise cleanups might simply take too long. Consider the following scenario: Suppose you wanted to delete 1 million lines without an index: You had to read 5 million lines 1 million times. Clearly, this strategy will lead to enormous performance problems and will certainly trigger a performance problem.