PostgreSQL: Speeding up GROUP BY and joins

A couple of weeks ago I have seen a horribly designed, 370 billion (with a “b”) row Oracle database monster running some sort of aggregations. Due to the sheer amount of data I naturally thought about how I would implement the same thing in PostgreSQL. What I noticed is that most people would actually implement aggregations slightly wrong in PostgreSQL - so I decided to share some thoughts on the clause GROUP BY and performance.

Loading data into PostgreSQL

Before getting started, two relations are created:

t_gender is a classical “lookup table”. It contains only a handful of rows, which will be joined on. The real “fact table” is in my example t_person:

For this example it should be enough to load a couple of million rows. The following INSERT statement makes sure that genders are evenly distributed in the table. Half of the population is male, and the other half is female:

How most people approach aggregation in PostgreSQL

The goal of the following code is to take our data and turn it into a simple report. Reporting is pretty important these days so you might see many of those queries as outlined below:

The goal is to figure out, how many men and women our database contains. The way to do that is to simply join the lookup table. On my laptop this takes around 961 ms …
The question is: Can we do better? Of course there is always a way to speed up things: More CPUs, more cache, etc. However, this is not the kind of improvement I am talking about. My question is: Can we use a smarter algorithm? Many people might be surprised but the answer is “yes”.

SQL: Finding the bottleneck

If you want to understand a performance problem there is usually no way to get around reading execution plans. Here is what the planner thinks:

PostgreSQL scans both tables sequentially and joins them together. Then the joined data is aggregated. In other words: 5 million rows will be joined with a small table.

High-performance analysis and aggregation in PostgreSQL

However, there is an alternative: What if we aggregate first and join later? What if we just counted those IDs and then lookup the name? The beauty of this approach is that we just had to join 2 rows instead of 5 million rows.

Here is what we could do:

Voila, the same thing happens A LOT faster. The Common Table Expression (CTE) is executed first and then joined. WITH is an “optimization barrier” making sure that the optimizer cannot fool around with things.

Let us take a look at the plan:

Sweet: The CTE is calculated and later joined giving us the extra boost we desired.

PostgreSQL: What the future might have in stock for us

The reason why PostgreSQL doesn't do this automatically is buried deep inside the structure of the planner. As of version 10.x PostgreSQL always has to join first and aggregate later. Currently serious work is done to lift this restriction and give the planner a bit more flexibility. Various developers including people from my team here at CYBERTEC are actively working on this issue, and I am hopeful to see a speedup in PostgreSQL 11 or maybe PostgreSQL 12.

Find out the latest info on PostgreSQL performance tuning in our blogposts.

0 0 votes
Article Rating
Notify of
Newest Most Voted
Inline Feedbacks
View all comments
3 years ago

Hey Hans, curious if this landed in 11 or 12!

Anil Kesariya
Anil Kesariya
5 years ago

Very useful artical.

4 days ago

Oh that article could be problematic in current society 😀

CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf

+43 (0) 2622 93022-0

Get the newest PostgreSQL Info & Tools

    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    CYBERTEC PostgreSQL International GmbH
    Would love your thoughts, please comment.x
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram