A couple of weeks ago I have seen a horribly designed, 370 billion (with a “b”) row Oracle database monster running some sort of aggregations. Due to the shere amount of data I naturally thought about how I would implement the same thing in PostgreSQL. What I noticed is that most people would actually implement aggregations slightly wrong in PostgreSQL – so decided to share some thoughts on ths issue.

Loading data into PostgreSQL

Before getting started two relations are created:

test=# CREATE TABLE t_gender (id int, name text);
CREATE TABLE
test=# INSERT INTO t_gender
VALUES (1, 'male'), (2, 'female');
INSERT 0 2

t_gender is a classical “lookup table”. It contains only a handful of rows, which will be joined on. The real “fact table” is in my example t_person:

test=# CREATE TABLE t_person (
id serial,
gender int,
data char(40)
);
CREATE TABLE

For the sake of this example it should be enough to load a couple of million rows. The following INSERT statement makes sure that genders are evenly distributed in the table. Half of the population is male and the other half is female:

test=# INSERT INTO t_person (gender, data)
SELECT x % 2 + 1, 'data'
FROM generate_series(1, 5000000) AS x;
INSERT 0 5000000

How most people approach aggregation in PostgreSQL

The goal of the following code is to take our data and turn it into a simple report. Reporting is pretty important these days so you might see many of those queries as outlined below:

test=# SELECT name, count(*)
FROM t_gender AS a, t_person AS b
WHERE a.id = b.gender
GROUP BY 1;
name | count
--------+---------
female | 2500000
male | 2500000
(2 rows)
Time: 961.034 ms

The goal is to figure out, how many men and women our database contains. The way to do that is to simply join the lookup table. On my laptop this takes around 961 ms …
The question is: Can we do better? Of course there is always a way to speed up things: More CPUs, more cache, etc. However, this is not the kind of improvement I am talking about. My question is: Can we use a smarter algorithm? Many people might be surprised but the answer is “yes”.

SQL: Finding the bottleneck

If you want to understand a performance problem there is usually no way to get around reading execution plans. Here is what the planner thinks:

explain SELECT name, count(*)
FROM t_gender AS a, t_person AS b
WHERE a.id = b.gender GROUP BY 1;
QUERY PLAN
---------------------------------
HashAggregate ...
Group Key: a.name
-> Hash Join (rows=5000034)
Hash Cond: (b.gender = a.id)
-> Seq Scan on t_person b (rows=5000034)
-> Hash (cost=1.02..1.02 rows=2 width=10)
-> Seq Scan on t_gender a (rows=2)

PostgreSQL scans both tables sequentually and joins them together. Then the joined data is aggregated. In other words: 5 million rows will be joined with a small table.

High-performance analysis and aggregation in PostgreSQL

However, there is an alternative: What if we aggregate first and join later? What if we just counted those IDs and then lookup the name? The beauty of this approach is that we just had to join 2 rows instead of 5 million rows.

Here is what we could do:

test=# WITH x AS
(
SELECT gender, count(*) AS res
FROM t_person AS a
GROUP BY 1
)
SELECT name, res
FROM x, t_gender AS y
WHERE x.gender = y.id;
... ...
Time: 526.472 ms

Voila, the same thing happens A LOT faster. The Common Table Expression (CTE) is executed first and then joined. WITH is an “optimization barrier” making sure that the optimizer cannot fool around with things.

Let us take a look at the plan:

test=# explain WITH x AS
(
SELECT gender, count(*) AS res
FROM t_person AS a
GROUP BY 1
)
SELECT name, res
FROM x, t_gender AS y
WHERE x.gender = y.id;

QUERY PLAN
---------------------------------
Hash Join (rows=2)
Hash Cond: (y.id = x.gender)
CTE x
-> HashAggregate (rows=2)
Group Key: a.gender
-> Seq Scan on t_person a
(rows=5000034)
-> Seq Scan on t_gender y (rows=2)
-> Hash (rows=2)
-> CTE Scan on x (rows=2)

Sweet: The CTE is calculated and later joined giving us the extra boost we desired.

PostgreSQL: What the future might have in stock for us

The reason why PostgreSQL is not doing this automatically is burried deep inside the structure of the planner. As of version 10.x PostgreSQL always has to join first and aggregate later. Currently serious work is done to lift this restriction and give the planner a bit more flexibility. Various developers including people from my team here at Cybertec are actively working on this issue and I am hopeful to see speedup in PosgreSQL 11 or maybe PostgreSQL 12.