Planner estimates have already been discussed on this blog in my previous posting and also in some posting before that. A couple of years ago I stumbled over an interesting issue which is commonly known as “cross correlation”.

Let us consider the following example:

test=# CREATE TABLE t_test (a int, b int); CREATE TABLE test=# INSERT INTO t_test SELECT 0, 1 FROM generate_series(1, 100000); INSERT 0 100000 test=# INSERT INTO t_test SELECT 1, 0 FROM generate_series(1, 100000); INSERT 0 100000 test=# ANALYZE t_test; ANALYZE

We add 100.000 rows containing 0 and 1 and then add 100.000 rows containing 1 and 0. Then optimizer statistics are built. So, all together we have 200.000 files in the table:

test=# explain SELECT count(*) FROM t_test; QUERY PLAN -------------------------------------------------------------------- Aggregate (cost=3385.00..3385.01 rows=1 width=0) -> Seq Scan on t_test (cost=0.00..2885.00 rows=200000 width=0) Planning time: 0.095 ms (3 rows)

So far everything looks just fine. The planner has guessed the number of rows in the table precisely. The same applies to the following query:

test=# explain SELECT count(*) FROM t_test WHERE a = 0; QUERY PLAN ------------------------------------------------------------------- Aggregate (cost=3634.28..3634.29 rows=1 width=0) -> Seq Scan on t_test (cost=0.00..3385.00 rows=99713 width=0) Filter: (a = 0) Planning time: 0.052 ms (4 rows)

99.713 is actually a pretty good estimate and it is totally sufficient to come up with a reasonable plan.

**cross correlation at work**

Let us try something else:

test=# explain SELECT count(*) FROM t_test WHERE a = 0 AND b = 0; QUERY PLAN ------------------------------------------------------------------- Aggregate (cost=4010.00..4010.01 rows=1 width=0) -> Seq Scan on t_test (cost=0.00..3885.00 rows=50000 width=0) Filter: ((a = 0) AND (b = 0)) Planning time: 0.068 ms (4 rows)

Oops? What happened? The planner will estimate that 50.000 rows match this condition. The reality is somewhat different:

test=# SELECT count(*) FROM t_test WHERE a = 0 AND b = 0; count ------- 0 (1 row)

How did we end up with this terrible under-estimation of the problem?

Well, the reason is how PostgreSQL handles statistics. Internally PostgreSQL will store statistics for every column. So, we know that a=0 represents 50% of the table and b=0 will also represent 50% of the table. From a mathematical point of view it might be safe to just multiply those likelihoods:

0.5 * 0.5 = 0.25

This is exactly what is going on here. Therefore the estimate is 25% of 200.000 rows = 50.000 rows. In our example this is dead wrong. The problem is that PostgreSQL does not have statistics about the combination of those columns. It does not know that when a=1 than b will be 0.

As you have seen, this can lead to over estimation – but also to under estimations:

test=# explain SELECT count(*) FROM t_test WHERE a = 0 AND b = 1; QUERY PLAN ------------------------------------------------------------------- Aggregate (cost=4009.28..4009.30 rows=1 width=0) -> Seq Scan on t_test (cost=0.00..3885.00 rows=49714 width=0) Filter: ((a = 0) AND (b = 1)) Planning time: 0.035 ms (4 rows)

In this case the real number of rows returned by the system is 100.000 and not just 49.714.

Keep in mind that the examples I have created are really pretty artificial. But, things like that can happen in real life and cause issues along the way. And as always: If you get bad plans – expect bad performance.