Planner estimates have already been discussed on this blog in my previous posting and also in some posting before that. A couple of years ago I stumbled over an interesting issue which is commonly known as “cross correlation”.
Let us consider the following example:
test=# CREATE TABLE t_test (a int, b int); CREATE TABLE test=# INSERT INTO t_test SELECT 0, 1 FROM generate_series(1, 100000); INSERT 0 100000 test=# INSERT INTO t_test SELECT 1, 0 FROM generate_series(1, 100000); INSERT 0 100000 test=# ANALYZE t_test; ANALYZE
We add 100.000 rows containing 0 and 1 and then add 100.000 rows containing 1 and 0. Then optimizer statistics are built. So, all together we have 200.000 files in the table:
test=# explain SELECT count(*) FROM t_test; QUERY PLAN -------------------------------------------------------------------- Aggregate (cost=3385.00..3385.01 rows=1 width=0) -> Seq Scan on t_test (cost=0.00..2885.00 rows=200000 width=0) Planning time: 0.095 ms (3 rows)
So far everything looks just fine. The planner has guessed the number of rows in the table precisely. The same applies to the following query:
test=# explain SELECT count(*) FROM t_test WHERE a = 0; QUERY PLAN ------------------------------------------------------------------- Aggregate (cost=3634.28..3634.29 rows=1 width=0) -> Seq Scan on t_test (cost=0.00..3385.00 rows=99713 width=0) Filter: (a = 0) Planning time: 0.052 ms (4 rows)
99.713 is actually a pretty good estimate and it is totally sufficient to come up with a reasonable plan.
cross correlation at work
Let us try something else:
test=# explain SELECT count(*) FROM t_test WHERE a = 0 AND b = 0; QUERY PLAN ------------------------------------------------------------------- Aggregate (cost=4010.00..4010.01 rows=1 width=0) -> Seq Scan on t_test (cost=0.00..3885.00 rows=50000 width=0) Filter: ((a = 0) AND (b = 0)) Planning time: 0.068 ms (4 rows)
Oops? What happened? The planner will estimate that 50.000 rows match this condition. The reality is somewhat different:
test=# SELECT count(*) FROM t_test WHERE a = 0 AND b = 0; count ------- 0 (1 row)
How did we end up with this terrible under-estimation of the problem?
Well, the reason is how PostgreSQL handles statistics. Internally PostgreSQL will store statistics for every column. So, we know that a=0 represents 50% of the table and b=0 will also represent 50% of the table. From a mathematical point of view it might be safe to just multiply those likelihoods:
0.5 * 0.5 = 0.25
This is exactly what is going on here. Therefore the estimate is 25% of 200.000 rows = 50.000 rows. In our example this is dead wrong. The problem is that PostgreSQL does not have statistics about the combination of those columns. It does not know that when a=1 than b will be 0.
As you have seen, this can lead to over estimation – but also to under estimations:
test=# explain SELECT count(*) FROM t_test WHERE a = 0 AND b = 1; QUERY PLAN ------------------------------------------------------------------- Aggregate (cost=4009.28..4009.30 rows=1 width=0) -> Seq Scan on t_test (cost=0.00..3885.00 rows=49714 width=0) Filter: ((a = 0) AND (b = 1)) Planning time: 0.035 ms (4 rows)
In this case the real number of rows returned by the system is 100.000 and not just 49.714.
Keep in mind that the examples I have created are really pretty artificial. But, things like that can happen in real life and cause issues along the way. And as always: If you get bad plans – expect bad performance.