Our PostgreSQL blog about “Speeding up count(*)” was widely read and discussed by our followers on the internet. We also saw some people commenting on the post and suggesting using different means to speed up count(*). I want to specifically focus on one of those comments and to warn our readers.
max(id) – min(id) will return the wrong answer
As stated in our previous post, count(*) has to read through the entire table to provide you with a correct count. This can be quite expensive if you want to read through a large table. So why not use an autoincrement field along with max and min to speed up the process? Regardless of the size of the table.
Here is an example:
test=# CREATE TABLE t_demo (id serial, something text); CREATE TABLE test=# INSERT INTO t_demo (something) VALUES ('a'), ('b'), ('c') RETURNING *; id | something ----+----------- 1 | a 2 | b 3 | c (3 rows) INSERT 0 3 test=# SELECT max(id) - min(id) + 1 FROM t_demo; ?column? ---------- 3 (1 row)
This approach also comes with the ability to use indexes.
SELECT min(id), max(id) FROM t_demo
… can be indexed. PostgreSQL will look for the first and the last entry in the index and return both numbers really fast (if you have created an index of course – the serial column alone does NOT provide you with an index or a unique constraint).
Using sequences: What can possibly go wrong?
The question now is: What can possibly go wrong? Well … a lot actually. The core problem starts when transactions enter the picture. On might argue “We don’t use transactions”. Well, that is wrong. There is no such thing as “no transaction” if you are running SQL statements. In PostgreSQL every query is part of a transaction – you cannot escape.
That leaves us with a problem:
test=# BEGIN; BEGIN test=# INSERT INTO t_demo (something) VALUES ('d') RETURNING *; id | something ----+----------- 4 | d (1 row) INSERT 0 1 test=# ROLLBACK; ROLLBACK
So far so good …
Let us add a row and commit:
test=# INSERT INTO t_demo (something) VALUES ('e') RETURNING *; id | something ----+----------- 5 | e (1 row) INSERT 0 1
The important part is that a sequence DOES NOT assure that there are no gaps in the data – it simply ensures that the number provided will go up. Gaps are possible:
test=# SELECT * FROM t_demo; id | something ----+----------- 1 | a 2 | b 3 | c 5 | e (4 rows)
In my example “4” is missing which of course breaks our “count optimization”:
test=# SELECT max(id) - min(id) + 1 FROM t_demo; ?column? ---------- 5 (1 row)
What you should learn from this example is that you should NEVER use sequences to speed up count(*). To make it clear: In PostgreSQL count(*) has to count all the data – there is no way around it. You might be able to estimate the content of the table, but you won’t be able to calculate a precise row count with any of those tricks.