After my previous post I received mails asking about what was meant by using “generate_series” in an outer join.
Let me try to address this issue to proof my point: Many people are using PostgreSQL to aggregate data. In this example we assume that we want to aggregate data in an hourly basis.
Here is some sample data:
test=# CREATE TABLE t_data (t timestamp, payload text); CREATE TABLE
We create a table with just two fields:
test=# INSERT INTO t_data VALUES ('2012-04-04 03:12', 'data 1'); INSERT 0 1 test=# INSERT INTO t_data VALUES ('2012-04-04 04:16', 'data 2'); INSERT 0 1 test=# INSERT INTO t_data VALUES ('2012-04-04 04:28', 'data 3'); INSERT 0 1 test=# INSERT INTO t_data VALUES ('2012-04-04 06:45', 'data 4'); INSERT 0 1
What you see here is that we have data for 3am, 4am and 6am … we have intentionally left 5am out. If we start to aggregate the data to see how many rows there are per hour, we can use the following query:
test=# SELECT date_trunc('hour', t), count(payload) FROM t_data GROUP BY 1 ORDER BY 1; date_trunc | count ---------------------+------- 2012-04-04 03:00:00 | 1 2012-04-04 04:00:00 | 2 2012-04-04 06:00:00 | 1 (3 rows)
The SQL challenge
A simple GROUP BY will reveal that there were no entries for the 5am group. However, the zero is not displayed here, because GROUP BY will only return data for valid groups.
In order to create a list of times we want to see in our final report, we can execute generate_series. Let us assume we want to have a complete table from midnight to 6am:
test=# SELECT * FROM generate_series( '2012-04-04 00:00'::timestamp, '2012-04-04 06:00'::timestamp, '1 hour') AS x; x --------------------- 2012-04-04 00:00:00 2012-04-04 01:00:00 2012-04-04 02:00:00 2012-04-04 03:00:00 2012-04-04 04:00:00 2012-04-04 05:00:00 2012-04-04 06:00:00 (7 rows)
This will return a complete list – exactly what we need to display in the final results. Remember, we also want to display hours with no rows.
Joining things together
Let us join things together using a standard PostgreSQL outer join now:
test=# SELECT * FROM generate_series('2012-04-04 00:00'::timestamp, '2012-04-04 06:00'::timestamp, '1 hour') AS x LEFT JOIN t_data ON (date_trunc('hour', t) = x); x | t | payload ---------------------+---------------------+--------- 2012-04-04 00:00:00 | | 2012-04-04 01:00:00 | | 2012-04-04 02:00:00 | | 2012-04-04 03:00:00 | 2012-04-04 03:12:00 | data 1 2012-04-04 04:00:00 | 2012-04-04 04:16:00 | data 2 2012-04-04 04:00:00 | 2012-04-04 04:28:00 | data 3 2012-04-04 05:00:00 | | 2012-04-04 06:00:00 | 2012-04-04 06:45:00 | data 4 (8 rows)
We will run date_trunc to make sure that the precision of our dates is actually hours. Then we can use the date column as the join key. Note that the non-existing fields on the “data” side of the join will stay empty.
Watch out for trouble – be precise
Once we have joined the data we have to aggregate to fetch the final result. When doing this aggregation watch out for a commonly made mistake:
test=# SELECT x, count(*) FROM generate_series('2012-04-04 00:00'::timestamp, '2012-04-04 06:00'::timestamp, '1 hour') AS x LEFT JOIN t_data ON (date_trunc('hour', t) = x) GROUP BY 1; x | count ---------------------+------- 2012-04-04 00:00:00 | 1 2012-04-04 01:00:00 | 1 2012-04-04 02:00:00 | 1 2012-04-04 03:00:00 | 1 2012-04-04 04:00:00 | 2 2012-04-04 05:00:00 | 1 2012-04-04 06:00:00 | 1 (7 rows)
Never EVER use count(*) along with an outer join. The problem is: count(*) will count ROWS – we don’t want to count rows because we are not allowed to count NULL values. As you can see the result is just plain wrong.
To correct the problem we highly advise you to count non-NULL values. In SQL this means that you have to use count(column) rather than count(*). count(column) will only count values, which are not NULL. This is exactly what we want:
test=# SELECT x, count(payload) FROM generate_series('2012-04-04 00:00'::timestamp, '2012-04-04 06:00'::timestamp, '1 hour') AS x LEFT JOIN t_data ON (date_trunc('hour', t) = x) GROUP BY 1; x | count ---------------------+------- 2012-04-04 00:00:00 | 0 2012-04-04 01:00:00 | 0 2012-04-04 02:00:00 | 0 2012-04-04 03:00:00 | 1 2012-04-04 04:00:00 | 2 2012-04-04 05:00:00 | 0 2012-04-04 06:00:00 | 1 (7 rows)
Voila, the result is exactly as desired.
———-
We welcome your feedback on our brand new Facebook page: www.fb.com/cybertec.postgresql