PostgreSQL: CREATE STATISTICS - advanced query optimization

02.2021 / Category: / Tags: |

PostgreSQL query optimization with CREATE STATISTICS is an important topic. Usually, the PostgreSQL optimizer (query planner) does an excellent job. This is not only true for OLTP but also for data warehousing. However, in some cases the optimizer simply lacks the information to do its job properly. One of these situations has to do with cross-column correlation. Let’s dive in and see what that means.

Violating statistical independence

Often data is not correlated at all. The color of a car might have nothing to do with the amount of gasoline consumed. You wouldn’t expect a correlation between the number of dogs in China and the price of coffee. However, sometimes there is indeed a correlation between columns which can cause a problem during query optimization.

Why is that the case? In PostgreSQL, statistics are stored for each column. PostgreSQL knows about the number of distinct entries, the distribution of data, and so on - by default it does NOT know how values relate to each other. Some examples: you know that age and height correlate (babies are not born 6 ft tall), you know that “country” and “language” spoken are usually correlated. However, the PostgreSQL query planner does not know that.
The solution to the problem is extended statistics (“CREATE STATISTICS”).

Let us create some same sample data:

What we do here is to create 10 million rows. The magic is in the first two columns: We see that the “y” is directly related to “x” - we simply add 50.000 to make sure that the data is perfectly correlated.

Inspecting optimizer estimates

In the next example we will take a look at a simple query and see how PostgreSQL handles statistics. To make the plan more readable and the statistics more understandable, we will turn parallel queries off:

What we see here is highly interesting: The planner assumes that roughly one million rows will be returned. However, this is not true - as the following listing clearly shows:

As you can see, the planner greatly overestimates the number of groups. Why is that the case? The planner simply multiplies the number of entries in “x” with the number of entries in “y”. So 10.000 x 10.000 makes one million. Note that the result is not precise, because PostgreSQL is using estimates here. A wrong estimate can greatly reduce performance and cause serious issues.

CREATE STATISTICS: Solving warehousing problems

Therefore, a solution is needed: “CREATE STATISTICS” allows you to create statistics about the expected number of distinct entries, given a list of columns:

In this case, statistics for x and y are needed. Refreshing those statistics can easily be done with ANALYZE. PostgreSQL will automatically maintain those extended statistics for you.
The key question is: What does it mean for our estimates? Let us run the same query again and inspect the estimates:

Wow, 9984 groups. We are right on target.
If your queries are more complex, such improvements can have a huge impact on performance and speed things up considerably.
What you have seen in this example is “CREATE STATISTICS” using the “ndistinct” method. In the case of GROUP BY, this method is what you need. If you want to speed up standard WHERE clauses, you might want to dig into the “dependencies” method to improve statistics accuracy.

Finally ...

If you want to learn more about query optimization in general, you might want to check out my blog post about GROUP BY. It contains some valuable tips on how to run analytical queries faster.
Also, If you have any comments, feel free to share them in the Disqus section below. If there are any topics you are especially interested in, feel free to share them as well. We can certainly cover your topics of interest in future articles.

0 0 votes
Article Rating
Notify of
Inline Feedbacks
View all comments
CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf

+43 (0) 2622 93022-0

Get the newest PostgreSQL Info & Tools

    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    CYBERTEC PostgreSQL International GmbH
    Would love your thoughts, please comment.x
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram