Planner estimates have already been discussed on this blog in my previous posting and also in some posting before that. A couple of years ago I stumbled over an interesting issue which is commonly known as “cross correlation”.

Let us consider the following example:

test=# CREATE TABLE t_test (a int, b int);

CREATE TABLE

test=# INSERT INTO t_test SELECT 0, 1

                        FROM generate_series(1, 100000);

INSERT 0 100000

test=# INSERT INTO t_test SELECT 1, 0

                        FROM generate_series(1, 100000);

INSERT 0 100000

test=# ANALYZE t_test;

ANALYZE

We add 100.000 rows containing 0 and 1 and then add 100.000 rows containing 1 and 0. Then optimizer statistics are built. So, all together we have 200.000 files in the table:

test=# explain SELECT count(*) FROM t_test;

                             QUERY PLAN                            

--------------------------------------------------------------------

 Aggregate  (cost=3385.00..3385.01 rows=1 width=0)

   ->  Seq Scan on t_test  (cost=0.00..2885.00 rows=200000 width=0)

 Planning time: 0.095 ms

(3 rows)

So far everything looks just fine. The planner has guessed the number of rows in the table precisely. The same applies to the following query:

test=# explain SELECT count(*) FROM t_test WHERE a = 0;

                            QUERY PLAN                            

-------------------------------------------------------------------

 Aggregate  (cost=3634.28..3634.29 rows=1 width=0)

   ->  Seq Scan on t_test  (cost=0.00..3385.00 rows=99713 width=0)

         Filter: (a = 0)

 Planning time: 0.052 ms

(4 rows)

99.713 is actually a pretty good estimate and it is totally sufficient to come up with a reasonable plan.

cross correlation at work

Let us try something else:

test=# explain SELECT count(*) FROM t_test WHERE a = 0 AND b = 0;

                            QUERY PLAN                            

-------------------------------------------------------------------

 Aggregate  (cost=4010.00..4010.01 rows=1 width=0)

   ->  Seq Scan on t_test  (cost=0.00..3885.00 rows=50000 width=0)

         Filter: ((a = 0) AND (b = 0))

 Planning time: 0.068 ms

(4 rows)

Oops? What happened? The planner will estimate that 50.000 rows match this condition. The reality is somewhat different:

test=# SELECT count(*) FROM t_test WHERE a = 0 AND b = 0;

 count

-------

     0

(1 row)

How did we end up with this terrible under-estimation of the problem?

Well, the reason is how PostgreSQL handles statistics. Internally PostgreSQL will store statistics for every column. So, we know that a=0 represents 50% of the table and b=0 will also represent 50% of the table. From a mathematical point of view it might be safe to just multiply those likelihoods:

0.5  * 0.5 = 0.25

This is exactly what is going on here. Therefore the estimate is 25% of 200.000 rows = 50.000 rows. In our example this is dead wrong. The problem is that PostgreSQL does not have statistics about the combination of those columns. It does not know that when a=1 than b will be 0.

As you have seen, this can lead to over estimation – but also to under estimations:

test=# explain SELECT count(*) FROM t_test WHERE a = 0 AND b = 1;

                            QUERY PLAN                             

-------------------------------------------------------------------

 Aggregate  (cost=4009.28..4009.30 rows=1 width=0)

   ->  Seq Scan on t_test  (cost=0.00..3885.00 rows=49714 width=0)

         Filter: ((a = 0) AND (b = 1))

 Planning time: 0.035 ms

(4 rows)

In this case the real number of rows returned by the system is 100.000 and not just 49.714.

Keep in mind that the examples I have created are really pretty artificial. But, things like that can happen in real life and cause issues along the way. And as always: If you get bad plans – expect bad performance.