Time in PostgreSQL: Outer joins

UPDATED AUGUST 2023: After my previous post, I received mails asking about what was meant by using “generate_series” in an outer join.

Table of Contents

Let me try to address this issue with an example: Many people use PostgreSQL to aggregate data. In this example, we assume that we want to aggregate data on an hourly basis.

Sample data

Here's some sample data:

test=# CREATE TABLE t_data (t timestamp, payload text);
CREATE TABLE

1 2	test=# CREATE TABLE t_data (t timestamp, payload text); CREATE TABLE

We create a table with just two fields:

test=# INSERT INTO t_data VALUES ('2012-04-04 03:12', 'data 1');
INSERT 0 1

test=# INSERT INTO t_data VALUES ('2012-04-04 04:16', 'data 2');
INSERT 0 1

test=# INSERT INTO t_data VALUES ('2012-04-04 04:28', 'data 3');
INSERT 0 1

test=# INSERT INTO t_data VALUES ('2012-04-04 06:45', 'data 4');
INSERT 0 1

test=# INSERT INTO t_data VALUES ('2012-04-04 03:12', 'data 1');

INSERT 0 1

test=# INSERT INTO t_data VALUES ('2012-04-04 04:16', 'data 2');

INSERT 0 1

test=# INSERT INTO t_data VALUES ('2012-04-04 04:28', 'data 3');

INSERT 0 1

test=# INSERT INTO t_data VALUES ('2012-04-04 06:45', 'data 4');

INSERT 0 1

We have data for 3am, 4am and 6am ... we've intentionally left 5am out. If we want to start to aggregate the data to see how many rows there are per hour, we can use the following query:

test=# SELECT date_trunc('hour', t), count(payload)
       FROM   t_data
       GROUP BY 1
 ORDER BY 1;
          date_trunc | count
---------------------+-------
 2012-04-04 03:00:00 | 1
 2012-04-04 04:00:00 | 2
 2012-04-04 06:00:00 | 1
(3 rows)

test=# SELECT date_trunc('hour', t), count(payload)

FROM t_data

GROUP BY 1

ORDER BY 1;

date_trunc | count

---------------------+-------

2012-04-04 03:00:00 | 1

2012-04-04 04:00:00 | 2

2012-04-04 06:00:00 | 1

(3 rows)

The SQL challenge

A simple GROUP BY will reveal that there were no entries for the 5am group. However, the zero is not displayed here, because GROUP BY will only return data for valid groups.

In order to create a list of times we want to see in our final report, we can execute generate_series. Let's assume we want to have a complete table from midnight to 6am:

test=# SELECT *
       FROM   generate_series( '2012-04-04 00:00'::timestamp, 
                               '2012-04-04 06:00'::timestamp, 
                               '1 hour')
AS x; 
         x
---------------------
 2012-04-04 00:00:00
 2012-04-04 01:00:00
 2012-04-04 02:00:00
 2012-04-04 03:00:00
 2012-04-04 04:00:00
 2012-04-04 05:00:00
 2012-04-04 06:00:00
(7 rows)

test=# SELECT *

FROM generate_series( '2012-04-04 00:00'::timestamp,

'2012-04-04 06:00'::timestamp,

'1 hour')

AS x;

---------------------

2012-04-04 00:00:00

2012-04-04 01:00:00

2012-04-04 02:00:00

2012-04-04 03:00:00

2012-04-04 04:00:00

2012-04-04 05:00:00

2012-04-04 06:00:00

(7 rows)

This will return a complete list – exactly what we need to display in the final results. Remember, we also want to display hours with no rows.

Joining things together with an outer join

Let's now join things together using a standard PostgreSQL outer join:

 test=# SELECT *
        FROM  generate_series('2012-04-04 00:00'::timestamp,
                              '2012-04-04 06:00'::timestamp,
                              '1 hour') AS x
        LEFT JOIN t_data 
             ON (date_trunc('hour', t) = x);
        x            |         t           | payload
---------------------+---------------------+---------
 2012-04-04 00:00:00 |                     |
 2012-04-04 01:00:00 |                     |
 2012-04-04 02:00:00 |                     |
 2012-04-04 03:00:00 | 2012-04-04 03:12:00 | data 1
 2012-04-04 04:00:00 | 2012-04-04 04:16:00 | data 2
 2012-04-04 04:00:00 | 2012-04-04 04:28:00 | data 3
 2012-04-04 05:00:00 |                     |
 2012-04-04 06:00:00 | 2012-04-04 06:45:00 | data 4
(8 rows)

test=# SELECT *

FROM generate_series('2012-04-04 00:00'::timestamp,

'2012-04-04 06:00'::timestamp,

'1 hour') AS x

LEFT JOIN t_data

ON (date_trunc('hour', t) = x);

x | t | payload

---------------------+---------------------+---------

2012-04-04 00:00:00 | |

2012-04-04 01:00:00 | |

2012-04-04 02:00:00 | |

2012-04-04 03:00:00 | 2012-04-04 03:12:00 | data 1

2012-04-04 04:00:00 | 2012-04-04 04:16:00 | data 2

2012-04-04 04:00:00 | 2012-04-04 04:28:00 | data 3

2012-04-04 05:00:00 | |

2012-04-04 06:00:00 | 2012-04-04 06:45:00 | data 4

(8 rows)

We'll run date_trunc to make sure that the precision of our dates is actually in hours. Then we can use the date column as the join key. Note that the non-existing fields on the “data” side of the join will stay empty.

Watch out for trouble – be precise

Once we have joined the data we have to aggregate to fetch the final result. When doing this aggregation, watch out for a common mistake:

test=# SELECT x, count(*)
       FROM   generate_series('2012-04-04 00:00'::timestamp,
                              '2012-04-04 06:00'::timestamp,
                              '1 hour') AS x
       LEFT JOIN t_data
            ON (date_trunc('hour', t) = x)
       GROUP BY 1;
 x                   | count
---------------------+-------
 2012-04-04 00:00:00 | 1
 2012-04-04 01:00:00 | 1
 2012-04-04 02:00:00 | 1
 2012-04-04 03:00:00 | 1
 2012-04-04 04:00:00 | 2
 2012-04-04 05:00:00 | 1
 2012-04-04 06:00:00 | 1
(7 rows)

test=# SELECT x, count(*)

FROM generate_series('2012-04-04 00:00'::timestamp,

'2012-04-04 06:00'::timestamp,

'1 hour') AS x

LEFT JOIN t_data

ON (date_trunc('hour', t) = x)

GROUP BY 1;

x | count

---------------------+-------

2012-04-04 00:00:00 | 1

2012-04-04 01:00:00 | 1

2012-04-04 02:00:00 | 1

2012-04-04 03:00:00 | 1

2012-04-04 04:00:00 | 2

2012-04-04 05:00:00 | 1

2012-04-04 06:00:00 | 1

(7 rows)

Never EVER use count(*) along with an outer join.

The problem is, count(*) will count ROWS – we don't want to count rows because we're not allowed to count NULL values. As you can see, the result is just plain wrong.

To correct the problem, I advise you to count non-NULL values. In SQL, this means that you have to use count(column) rather than count(*). count(column) will only count values which are not NULL. That's exactly what we want:

test=# SELECT x, count(payload)
       FROM   generate_series('2012-04-04 00:00'::timestamp,
                              '2012-04-04 06:00'::timestamp,
                              '1 hour') AS x
             LEFT JOIN t_data
                  ON (date_trunc('hour', t) = x)
      GROUP BY 1;
       x             | count
---------------------+------- 
 2012-04-04 00:00:00 | 0
 2012-04-04 01:00:00 | 0
 2012-04-04 02:00:00 | 0
 2012-04-04 03:00:00 | 1
 2012-04-04 04:00:00 | 2
 2012-04-04 05:00:00 | 0
 2012-04-04 06:00:00 | 1
(7 rows)

test=# SELECT x, count(payload)

FROM generate_series('2012-04-04 00:00'::timestamp,

'2012-04-04 06:00'::timestamp,

'1 hour') AS x

LEFT JOIN t_data

ON (date_trunc('hour', t) = x)

GROUP BY 1;

x | count

---------------------+-------

2012-04-04 00:00:00 | 0

2012-04-04 01:00:00 | 0

2012-04-04 02:00:00 | 0

2012-04-04 03:00:00 | 1

2012-04-04 04:00:00 | 2

2012-04-04 05:00:00 | 0

2012-04-04 06:00:00 | 1

(7 rows)

Voilà, the result is exactly as desired.

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.

Comments are closed.

Stay tuned with our

Time in PostgreSQL: Outer joins

Sample data

The SQL challenge

Joining things together with an outer join

Watch out for trouble – be precise

Hans-Jürgen Schönig

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts