A couple of weeks ago I have seen a horribly designed, 370 billion (with a “b”) row Oracle database monster running some sort of aggregations. Due to the sheer amount of data I naturally thought about how I would implement the same thing in PostgreSQL. What I noticed is that most people would actually implement aggregations slightly wrong in PostgreSQL – so decided to share some thoughts on this issue.
Loading data into PostgreSQL
Before getting started two relations are created:
test=# CREATE TABLE t_gender (id int, name text); CREATE TABLE test=# INSERT INTO t_gender VALUES (1, 'male'), (2, 'female'); INSERT 0 2
t_gender is a classical “lookup table”. It contains only a handful of rows, which will be joined on. The real “fact table” is in my example t_person:
test=# CREATE TABLE t_person ( id serial, gender int, data char(40) ); CREATE TABLE
For this example it should be enough to load a couple of million rows. The following INSERT statement makes sure that genders are evenly distributed in the table. Half of the population is male, and the other half is female:
test=# INSERT INTO t_person (gender, data) SELECT x % 2 + 1, 'data' FROM generate_series(1, 5000000) AS x; INSERT 0 5000000
How most people approach aggregation in PostgreSQL
The goal of the following code is to take our data and turn it into a simple report. Reporting is pretty important these days so you might see many of those queries as outlined below:
test=# SELECT name, count(*) FROM t_gender AS a, t_person AS b WHERE a.id = b.gender GROUP BY 1; name | count --------+--------- female | 2500000 male | 2500000 (2 rows) Time: 961.034 ms
The goal is to figure out, how many men and women our database contains. The way to do that is to simply join the lookup table. On my laptop this takes around 961 ms …
The question is: Can we do better? Of course there is always a way to speed up things: More CPUs, more cache, etc. However, this is not the kind of improvement I am talking about. My question is: Can we use a smarter algorithm? Many people might be surprised but the answer is “yes”.
SQL: Finding the bottleneck
If you want to understand a performance problem there is usually no way to get around reading execution plans. Here is what the planner thinks:
explain SELECT name, count(*) FROM t_gender AS a, t_person AS b WHERE a.id = b.gender GROUP BY 1; QUERY PLAN ----------------------------------------------------- HashAggregate ... Group Key: a.name -> Hash Join (rows=5000034) Hash Cond: (b.gender = a.id) -> Seq Scan on t_person b (rows=5000034) -> Hash (cost=1.02..1.02 rows=2 width=10) -> Seq Scan on t_gender a (rows=2)
PostgreSQL scans both tables sequentually and joins them together. Then the joined data is aggregated. In other words: 5 million rows will be joined with a small table.
High-performance analysis and aggregation in PostgreSQL
However, there is an alternative: What if we aggregate first and join later? What if we just counted those IDs and then lookup the name? The beauty of this approach is that we just had to join 2 rows instead of 5 million rows.
Here is what we could do:
test=# WITH x AS ( SELECT gender, count(*) AS res FROM t_person AS a GROUP BY 1 ) SELECT name, res FROM x, t_gender AS y WHERE x.gender = y.id; ... ... Time: 526.472 ms
Voila, the same thing happens A LOT faster. The Common Table Expression (CTE) is executed first and then joined. WITH is an “optimization barrier” making sure that the optimizer cannot fool around with things.
Let us take a look at the plan:
test=# explain WITH x AS ( SELECT gender, count(*) AS res FROM t_person AS a GROUP BY 1 ) SELECT name, res FROM x, t_gender AS y WHERE x.gender = y.id; QUERY PLAN --------------------------------- Hash Join (rows=2) Hash Cond: (y.id = x.gender) CTE x -> HashAggregate (rows=2) Group Key: a.gender -> Seq Scan on t_person a (rows=5000034) -> Seq Scan on t_gender y (rows=2) -> Hash (rows=2) -> CTE Scan on x (rows=2)
Sweet: The CTE is calculated and later joined giving us the extra boost we desired.
PostgreSQL: What the future might have in stock for us
The reason why PostgreSQL is not doing this automatically is buried deep inside the structure of the planner. As of version 10.x PostgreSQL always has to join first and aggregate later. Currently serious work is done to lift this restriction and give the planner a bit more flexibility. Various developers including people from my team here at Cybertec are actively working on this issue and I am hopeful to see speedup in PostgreSQL 11 or maybe PostgreSQL 12.