Optimization issues: Cross column correlation

UPDATE Sept. 2023: For more recent information about planner statistics, see the PostgreSQL documentation about extended statistics, or this blog about CREATE STATISTICS.

Table of Contents

You can also find out more recent info in this blog about Improving Group By with Create Statistics.

The blog post below about cross correlation has been preserved for reference.

Planner estimates have already been discussed on this blog in my previous post, and also in some posts before that. A couple of years ago, I stumbled across an interesting issue, which is commonly known as “cross correlation”.

Let us consider the following example:

test=# CREATE TABLE t_test (a int, b int);
CREATE TABLE

test=# INSERT INTO t_test SELECT 0, 1
       FROM generate_series(1, 100000);

INSERT 0 100000

test=# INSERT INTO t_test SELECT 1, 0
       FROM generate_series(1, 100000);
INSERT 0 100000

test=# ANALYZE t_test;
ANALYZE

test=# CREATE TABLE t_test (a int, b int);

CREATE TABLE

test=# INSERT INTO t_test SELECT 0, 1

FROM generate_series(1, 100000);

INSERT 0 100000

test=# INSERT INTO t_test SELECT 1, 0

FROM generate_series(1, 100000);

INSERT 0 100000

test=# ANALYZE t_test;

ANALYZE

We add 100.000 rows containing 0 and 1 and then add 100.000 rows containing 1 and 0. Then optimizer statistics are built. So, all together we have 200.000 files in the table:

test=# explain SELECT count(*) FROM t_test;
                             QUERY PLAN                            
--------------------------------------------------------------------
 Aggregate  (cost=3385.00..3385.01 rows=1 width=0)
   ->  Seq Scan on t_test  (cost=0.00..2885.00 rows=200000 width=0)
 Planning time: 0.095 ms

(3 rows)

test=# explain SELECT count(*) FROM t_test;

QUERY PLAN

--------------------------------------------------------------------

Aggregate (cost=3385.00..3385.01 rows=1 width=0)

-> Seq Scan on t_test (cost=0.00..2885.00 rows=200000 width=0)

Planning time: 0.095 ms

(3 rows)

So far everything looks just fine. The planner has guessed the number of rows in the table precisely. The same applies to the following query:

test=# explain SELECT count(*) FROM t_test WHERE a = 0;
                            QUERY PLAN                            
-------------------------------------------------------------------
 Aggregate  (cost=3634.28..3634.29 rows=1 width=0)
   ->  Seq Scan on t_test  (cost=0.00..3385.00 rows=99713 width=0)
         Filter: (a = 0)
 Planning time: 0.052 ms

(4 rows)

test=# explain SELECT count(*) FROM t_test WHERE a = 0;

QUERY PLAN

-------------------------------------------------------------------

Aggregate (cost=3634.28..3634.29 rows=1 width=0)

-> Seq Scan on t_test (cost=0.00..3385.00 rows=99713 width=0)

Filter: (a = 0)

Planning time: 0.052 ms

(4 rows)

99.713 is actually a pretty good estimate and it is totally sufficient to come up with a reasonable plan.

cross correlation at work

Let us try something else:

test=# explain SELECT count(*) FROM t_test WHERE a = 0 AND b = 0;
                            QUERY PLAN                            
-------------------------------------------------------------------
 Aggregate  (cost=4010.00..4010.01 rows=1 width=0)
   ->  Seq Scan on t_test  (cost=0.00..3885.00 rows=50000 width=0)
         Filter: ((a = 0) AND (b = 0))

 Planning time: 0.068 ms

(4 rows)

test=# explain SELECT count(*) FROM t_test WHERE a = 0 AND b = 0;

QUERY PLAN

-------------------------------------------------------------------

Aggregate (cost=4010.00..4010.01 rows=1 width=0)

-> Seq Scan on t_test (cost=0.00..3885.00 rows=50000 width=0)

Filter: ((a = 0) AND (b = 0))

Planning time: 0.068 ms

(4 rows)

Oops? What happened? The planner will estimate that 50.000 rows match this condition. The reality is somewhat different:

test=# SELECT count(*) FROM t_test WHERE a = 0 AND b = 0;
 count
-------
     0
(1 row)

test=# SELECT count(*) FROM t_test WHERE a = 0 AND b = 0;

count

-------

(1 row)

How did we end up with this terrible under-estimation of the problem?

Well, the reason is how PostgreSQL handles statistics. Internally PostgreSQL will store statistics for every column. So, we know that a=0 represents 50% of the table and b=0 will also represent 50% of the table. From a mathematical point of view it might be safe to just multiply those likelihoods:

0.5 * 0.5 = 0.25

This is exactly what is going on here. Therefore the estimate is 25% of 200.000 rows = 50.000 rows. In our example this is dead wrong. The problem is that PostgreSQL does not have statistics about the combination of those columns. It does not know that when a=1 than b will be 0.

As you have seen, this can lead to over estimation - but also to under estimations:

test=# explain SELECT count(*) FROM t_test WHERE a = 0 AND b = 1;
                            QUERY PLAN                             
-------------------------------------------------------------------
 Aggregate  (cost=4009.28..4009.30 rows=1 width=0)
   ->  Seq Scan on t_test  (cost=0.00..3885.00 rows=49714 width=0)
         Filter: ((a = 0) AND (b = 1))
 Planning time: 0.035 ms

(4 rows)

test=# explain SELECT count(*) FROM t_test WHERE a = 0 AND b = 1;

QUERY PLAN

-------------------------------------------------------------------

Aggregate (cost=4009.28..4009.30 rows=1 width=0)

-> Seq Scan on t_test (cost=0.00..3885.00 rows=49714 width=0)

Filter: ((a = 0) AND (b = 1))

Planning time: 0.035 ms

(4 rows)

In this case the real number of rows returned by the system is 100.000 and not just 49.714.

Keep in mind that the examples I have created are really pretty artificial. But, things like that can happen in real life and cause issues along the way. And as always: If you get bad plans - expect bad performance.

0 0 votes

Article Rating

1 Comment

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Craig Ringer (2ndQuadrant)

9 years ago

Can I persuade you to add a "date published/edited" to your posts? Things bit-rot on the 'net, and it'd be good to have some indication of when things like this were written and with regards to what PostgreSQL version.

Stay tuned with our

Optimization issues: Cross column correlation

cross correlation at work

How did we end up with this terrible under-estimation of the problem?

Hans-Jürgen Schönig

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts