PostgreSQL query optimization with CREATE STATISTICS is an important topic. Usually, the PostgreSQL optimizer (query planner) does an excellent job. This is not only true for OLTP but also for data warehousing. However, in some cases the optimizer simply lacks the information to do its job properly. One of these situations has to do with cross-column correlation. Let’s dive in and see what that means.

Violating statistical independence

Often data is not correlated at all. The color of a car might have nothing to do with the amount of gasoline consumed. You wouldn’t expect a correlation between the number of dogs in China and the price of coffee. However, sometimes there is indeed a correlation between columns which can cause a problem during query optimization.

Why is that the case? In PostgreSQL, statistics are stored for each column. PostgreSQL knows about the number of distinct entries, the distribution of data, and so on – by default it does NOT know how values relate to each other. Some examples: you know that age and height correlate (babies are not born 6 ft tall), you know that “country” and “language” spoken are usually correlated. However, the PostgreSQL query planner does not know that.
The solution to the problem is extended statistics (“CREATE STATISTICS”).

Let us create some same sample data:

test=# CREATE TABLE t_test (id serial, x int, y int, z int);
test=# INSERT INTO t_test (x, y, z)
          SELECT id % 10000,
                 (id % 10000) + 50000,
                 random() * 100
          FROM generate_series(1, 10000000) AS id;
INSERT 0 10000000
test=# ANALYZE ;

What we do here is to create 10 million rows. The magic is in the first two columns: We see that the “y” is directly related to “x” – we simply add 50.000 to make sure that the data is perfectly correlated.

Inspecting optimizer estimates

In the next example we will take a look at a simple query and see how PostgreSQL handles statistics. To make the plan more readable and the statistics more understandable, we will turn parallel queries off:

test=# SET max_parallel_workers_per_gather TO 0;
test=# explain SELECT x, y, count(*) FROM t_test GROUP BY 1, 2;
                            QUERY PLAN
HashAggregate (cost=741567.03..829693.58 rows=1000018 width=16)
     Group Key: x, y
     Planned Partitions: 32
     -> Seq Scan on t_test (cost=0.00..154056.75 rows=10000175 width=8)
(4 rows)

What we see here is highly interesting: The planner assumes that roughly one million rows will be returned. However, this is not true – as the following listing clearly shows:

test=# explain analyze SELECT x, y, count(*) FROM t_test GROUP BY 1, 2;
                                 QUERY PLAN
HashAggregate (cost=741567.03..829693.58 rows=1000018 width=16) 
              (actual time=2952.991..2954.460 rows=10000 loops=1)
     Group Key: x, y
     Planned Partitions: 32 Batches: 1 Memory Usage: 2577kB
     -> Seq Scan on t_test (cost=0.00..154056.75 rows=10000175 width=8) 
                           (actual time=0.036..947.466 rows=10000000 loops=1)
Planning Time: 0.081 ms
Execution Time: 2955.077 ms
(6 rows)

As you can see, the planner greatly overestimates the number of groups. Why is that the case? The planner simply multiplies the number of entries in “x” with the number of entries in “y”. So 10.000 x 10.000 makes one million. Note that the result is not precise, because PostgreSQL is using estimates here. A wrong estimate can greatly reduce performance and cause serious issues.

CREATE STATISTICS: Solving warehousing problems

Therefore, a solution is needed: “CREATE STATISTICS” allows you to create statistics about the expected number of distinct entries, given a list of columns:

test=# CREATE STATISTICS mygrp (ndistinct) ON x, y FROM t_test;
test=# ANALYZE t_test;

In this case, statistics for x and y are needed. Refreshing those statistics can easily be done with ANALYZE. PostgreSQL will automatically maintain those extended statistics for you.
The key question is: What does it mean for our estimates? Let us run the same query again and inspect the estimates:

test=# explain SELECT x, y, count(*) FROM t_test GROUP BY 1, 2;
                            QUERY PLAN
HashAggregate (cost=229052.55..229152.39 rows=9984 width=16)
     Group Key: x, y
     -> Seq Scan on t_test (cost=0.00..154053.60 rows=9999860 width=8)
(3 rows)

Wow, 9984 groups. We are right on target.
If your queries are more complex, such improvements can have a huge impact on performance and speed things up considerably.
What you have seen in this example is “CREATE STATISTICS” using the “ndistinct” method. In the case of GROUP BY, this method is what you need. If you want to speed up standard WHERE clauses, you might want to dig into the “dependencies” method to improve statistics accuracy.

Finally …

If you want to learn more about query optimization in general, you might want to check out my blog post about GROUP BY. It contains some valuable tips on how to run analytical queries faster.
Also, If you have any comments, feel free to share them in the Disqus section below. If there are any topics you are especially interested in, feel free to share them as well. We can certainly cover your topics of interest in future articles.