In my previous posting on PostgreSQL 9.4 I have shown aggregate FILTER clauses, which are a neat way to make partial aggregates more readable. Inspired by some comments to this blog post I decided to create a follow up posting to see which impact this new FILTER clause has on performance.
Loading some demo data
To see the performance impact people can expect, here is some demo data:
test=# BEGIN; BEGIN test=# CREATE TABLE t_test (id int, dummy char(200)); CREATE TABLE test=# INSERT INTO t_test SELECT *, 'dummy' FROM generate_series(1, 20000000) AS x; INSERT 0 20000000 test=# COMMIT; COMMIT
The data set is roughly 4.5 GB of data. On the test box this still fits nicely into memory:
test=# SELECT pg_size_pretty(pg_relation_size('t_test')); pg_size_pretty ---------------- 4596 MB (1 row)
A classical approach
For years I have used the classical approach. The trick is that aggregation functions won’t consider NULL values. So to do partial aggregation functions the idea is to just replace values on the fly. The code is as already stated in the previous post:
test=# SELECT count(CASE WHEN id % 2 = 0 THEN 1 ELSE NULL END) AS even, count(CASE WHEN id % 2 = 1 THEN 1 ELSE NULL END) AS odd FROM t_test; even | odd ----------+---------- 10000000 | 10000000 (1 row) Time: 6208.092 ms
In this example the aggregation takes slightly over 6 seconds (with all hint bits set, etc.). This is not too bad.
The modern approach
Let us see now what happens when the new approach is used:
SELECT count(id) FILTER (WHERE id % 2 = 0) AS even, count(id) FILTER (WHERE id % 2 = 1) AS odd FROM t_test; even | odd ----------+---------- 10000000 | 10000000 (1 row) Time: 5353.172 ms
The query is almost one second faster than the original one. The overhead of the new FILTER clause is simply ways smaller than in the original version of the query. Of course, the more clauses you add the larger the difference will be.
More realistic scenarios
Before a more complex query is fired I decided to increase work_mem to make sure that all aggregations can happen in memory:
test=# SET work_mem TO '1 GB'; SET Time: 0.375 ms
This time an additional GROUP BY clause is added to split the data set into smaller chunks. A GROUP BY clause is basically what most people will have in their queries in case FILTER is used:
test=# SELECT id % 3 AS g, count(CASE WHEN id % 2 = 0 THEN 1 ELSE NULL END) AS even, count(CASE WHEN id % 2 = 1 THEN 1 ELSE NULL END) AS odd FROM t_test GROUP BY 1; g | even | odd ---+---------+--------- 0 | 3333333 | 3333333 1 | 3333333 | 3333334 2 | 3333334 | 3333333 (3 rows) Time: 22119.977 ms
The result is created in 22 seconds, which is actually quite ok.
The next query shows the same thing using the new syntax:
SELECT id % 3 AS g, count(id) FILTER (WHERE id % 2 = 0) AS even, count(id) FILTER (WHERE id % 2 = 1) AS odd FROM t_test GROUP BY 1; g | even | odd ---+---------+--------- 0 | 3333333 | 3333333 1 | 3333333 | 3333334 2 | 3333334 | 3333333 (3 rows) Time: 21379.657 ms
The important thing here is that the performance difference is again around 1 second. This makes sense because the performance gained by the FILTER clause is the same regardless of the grouping around it. This is somewhat expected.