While doing PostgreSQL consulting for a German client, I stumbled over an interesting issue this week, which might be worth sharing with some folks out on the Internet, it’s all about grouping.
Suppose you are measuring the same thing various times on different sensors every, say, 15 minutes. Maybe some temperature, some air pressure or whatever. The data might look like it is shown in the next table:
CREATE TABLE t_data (t time, val int); COPY t_data FROM stdin; 14:00 12 14:01 22 14:01 43 14:14 32 14:15 33 14:16 27 14:30 19 \.
The human eye can instantly spot that 14:00 and 14:01 could be candidates for grouping (maybe the differences are just related to latency or some slightly inconsistent timing). The same applies to 14:14 to 14:16. You might want to have this data in the same group during aggregation.
The question now is: How can that be achieved with PostgreSQL?
Some dirty SQL trickery
The first thing to do is to check out those difference from one timestamp to the next:
SELECT *, lag(t, 1) OVER (ORDER BY t) FROM t_data;
The lag function offers a nice way to solve this kind of problem:
t | val | lag ----------+-----+---------- 14:00:00 | 12 | 14:01:00 | 22 | 14:00:00 14:01:00 | 43 | 14:01:00 14:14:00 | 32 | 14:01:00 14:15:00 | 33 | 14:14:00 14:16:00 | 27 | 14:15:00 14:30:00 | 19 | 14:16:00 (7 rows)
Now that we have used lag to “move” the time to the next row, there is a simple trick, which can be applied:
SELECT *, CASE WHEN t - lag < '10 minutes' THEN currval('seq_a') ELSE nextval('seq_a') END AS g FROM ( SELECT *, lag(t, 1) OVER (ORDER BY t) FROM t_data ) AS x;
Moving the lag to a subselect allows us to start all over again and to create those groups. The trick now is: If the difference from one line to the next is high, start a new group – otherwise stay within the group.
This leaves us with a simple result set:
t | val | lag | g ----------+-----+----------+--- 14:00:00 | 12 | | 1 14:01:00 | 22 | 14:00:00 | 1 14:01:00 | 43 | 14:01:00 | 1 14:14:00 | 32 | 14:01:00 | 2 14:15:00 | 33 | 14:14:00 | 2 14:16:00 | 27 | 14:15:00 | 2 14:30:00 | 19 | 14:16:00 | 3 (7 rows)
From now on, life is simple. We can take this output and aggregate on this data easily. “GROUP BY g” will give us nice groups for each value of “g”.