UPDATED AUGUST 2023: After my previous post, I received mails asking about what was meant by using “generate_series” in an outer join.

Let me try to address this issue with an example: Many people use PostgreSQL to aggregate data. In this example, we assume that we want to aggregate data on an hourly basis.

Here’s some sample data:

test=# CREATE TABLE t_data (t timestamp, payload text);
CREATE TABLE

We create a table with just two fields:

test=# INSERT INTO t_data VALUES ('2012-04-04 03:12', 'data 1');
INSERT 0 1

test=# INSERT INTO t_data VALUES ('2012-04-04 04:16', 'data 2');
INSERT 0 1

test=# INSERT INTO t_data VALUES ('2012-04-04 04:28', 'data 3');
INSERT 0 1

test=# INSERT INTO t_data VALUES ('2012-04-04 06:45', 'data 4');
INSERT 0 1

We have data for 3am, 4am and 6am … we’ve intentionally left 5am out. If we want to start to aggregate the data to see how many rows there are per hour, we can use the following query:

test=# SELECT date_trunc('hour', t), count(payload)
       FROM   t_data
       GROUP BY 1
 ORDER BY 1;
          date_trunc | count
---------------------+-------
 2012-04-04 03:00:00 | 1
 2012-04-04 04:00:00 | 2
 2012-04-04 06:00:00 | 1
(3 rows)

The SQL challenge

A simple GROUP BY will reveal that there were no entries for the 5am group. However, the zero is not displayed here, because GROUP BY will only return data for valid groups.

In order to create a list of times we want to see in our final report, we can execute generate_series. Let’s assume we want to have a complete table from midnight to 6am:

test=# SELECT *
       FROM   generate_series( '2012-04-04 00:00'::timestamp, 
                               '2012-04-04 06:00'::timestamp, 
                               '1 hour')
AS x; 
         x
---------------------
 2012-04-04 00:00:00
 2012-04-04 01:00:00
 2012-04-04 02:00:00
 2012-04-04 03:00:00
 2012-04-04 04:00:00
 2012-04-04 05:00:00
 2012-04-04 06:00:00
(7 rows)

This will return a complete list – exactly what we need to display in the final results. Remember, we also want to display hours with no rows.

Joining things together with an outer join

Let’s now join things together using a standard PostgreSQL outer join:

 test=# SELECT *
        FROM  generate_series('2012-04-04 00:00'::timestamp,
                              '2012-04-04 06:00'::timestamp,
                              '1 hour') AS x
        LEFT JOIN t_data 
             ON (date_trunc('hour', t) = x);
        x            |         t           | payload
---------------------+---------------------+---------
 2012-04-04 00:00:00 |                     |
 2012-04-04 01:00:00 |                     |
 2012-04-04 02:00:00 |                     |
 2012-04-04 03:00:00 | 2012-04-04 03:12:00 | data 1
 2012-04-04 04:00:00 | 2012-04-04 04:16:00 | data 2
 2012-04-04 04:00:00 | 2012-04-04 04:28:00 | data 3
 2012-04-04 05:00:00 |                     |
 2012-04-04 06:00:00 | 2012-04-04 06:45:00 | data 4
(8 rows)

We’ll run date_trunc to make sure that the precision of our dates is actually in hours. Then we can use the date column as the join key. Note that the non-existing fields on the “data” side of the join will stay empty.

Watch out for trouble – be precise

Once we have joined the data we have to aggregate to fetch the final result. When doing this aggregation, watch out for a common mistake:

test=# SELECT x, count(*)
       FROM   generate_series('2012-04-04 00:00'::timestamp,
                              '2012-04-04 06:00'::timestamp,
                              '1 hour') AS x
       LEFT JOIN t_data
            ON (date_trunc('hour', t) = x)
       GROUP BY 1;
 x                   | count
---------------------+-------
 2012-04-04 00:00:00 | 1
 2012-04-04 01:00:00 | 1
 2012-04-04 02:00:00 | 1
 2012-04-04 03:00:00 | 1
 2012-04-04 04:00:00 | 2
 2012-04-04 05:00:00 | 1
 2012-04-04 06:00:00 | 1
(7 rows)

Never EVER use count(*) along with an outer join.

The problem is, count(*) will count ROWS – we don’t want to count rows because we’re not allowed to count NULL values. As you can see, the result is just plain wrong.

To correct the problem, I advise you to count non-NULL values. In SQL, this means that you have to use count(column) rather than count(*). count(column) will only count values which are not NULL. That’s exactly what we want:

test=# SELECT x, count(payload)
       FROM   generate_series('2012-04-04 00:00'::timestamp,
                              '2012-04-04 06:00'::timestamp,
                              '1 hour') AS x
             LEFT JOIN t_data
                  ON (date_trunc('hour', t) = x)
      GROUP BY 1;
       x             | count
---------------------+------- 
 2012-04-04 00:00:00 | 0
 2012-04-04 01:00:00 | 0
 2012-04-04 02:00:00 | 0
 2012-04-04 03:00:00 | 1
 2012-04-04 04:00:00 | 2
 2012-04-04 05:00:00 | 0
 2012-04-04 06:00:00 | 1
(7 rows)

Voilà, the result is exactly as desired.


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.