CYBERTEC Logo

PostgreSQL: Detecting periods of activity in a timeseries

05.2018 / Category: / Tags:

I have already written about timeseries and PostgreSQL in the past. However, recently I stumbled across an interesting problem, which caught my attention: Sometimes you might want to find “periods” of activity in a timeseries. For example: When was a user active? Or when did we receive data? This blog post tries to give you some ideas and shows, how you can actually approach this kind of problem.

Loading timeseries data into PostgreSQL

The next listing shows a little bit of sample data, which I used to write the SQL code you are about to see:

To make it easier, I just used two columns in my example. Note that my timeseries is not continuous but interrupted. There are three continuous periods in this set of data. Our goal is to find and isolate them to do analysis on each of those continuous periods.

PostgreSQL time series

Preparing for timeseries analysis

When dealing with timeseries, one of the most important things to learn is how to “look forward and backward”. In most cases, it is simply vital to compare the current line with the previous line. To do that in PostgreSQL (or in SQL in general) you can make use of the “lag” function:

As you can see, the last column contains the date of the previous row. Now: How does PostgreSQL know what the previous row actually is? The “ORDER BY”-clause will define exactly that.

Based on this query you have just seen it will be easy to calculate the size of the gap from one row to the next row

What we see now is the difference from one period to the next. That is pretty useful because we can start to create our rules. When do we consider a segment to be over and how long of a gap to we allow for before we consider it to be the next segment / period?

In my example I decided that every gap, which is longer than 2 days should trigger the creation of a new segment (or period): The next challenge is therefore to assign numbers to each period, which are about to detect. Once this is done, we can easily aggregate on the result. The way I have decided to do this is by using the sum function. Remember: When NULL is fed to an aggregate, the aggregate will ignore the input. Otherwise it will simply start to add up the input.

Here is the query:

As you can see the last column contains the period ID as generated by the sum function in our query. From now on analysis will be pretty simple as we can simply aggregate over this result using a simple subselect as shown in the next statement:

The result displays the sum of all data for each period. Of course you can also do more complicated stuff. However, the important thing is to understand, how you can actually detect various periods of continuous activity.

 


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

0 0 votes
Article Rating
Subscribe
Notify of
guest
3 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Andreas Kretschmer
Andreas Kretschmer
5 years ago

yeah, nice, but your example doesn't work 😉

with your data and sql:

test=*# SELECT *, sum(CASE WHEN diff IS NULL
test(# OR diff = 2 then 1 else NULL end) over (order by t),0) from (select *, t - coalesce(lag(t,1) over (order by t),t) as diff from t_series) foo ;
t | data | diff | ?column?
------------+------+------+----------
2018-03-01 | 12 | 0 | 1
2018-03-02 | 43 | 1 | 1
2018-03-03 | 9 | 1 | 1
2018-03-04 | 13 | 1 | 1
2018-03-09 | 23 | 5 | 2
2018-03-10 | 26 | 1 | 2
2018-03-11 | 28 | 1 | 2
2018-03-14 | 21 | 3 | 3
2018-03-15 | 15 | 1 | 3
(9 Zeilen)

now i'm getting the expected result 😉 But anyway, nice article, thx.

vijay chennupati
vijay chennupati
5 years ago

Hi,

SELECT *, sum(CASE WHEN diff IS NULL
OR diff <2 THEN 1 ELSE NULL END) OVER (ORDER BY t) AS period
FROM (SELECT *, t - lag(t, 1) OVER (ORDER BY t) AS diff
FROM t_series
) AS x;
is generating seven periods

1 12
2 43
3 9
4 36
5 26
6 49
7 15

but if I have changed your query as follows, I am getting the correct results.

select period, sum(data) from
(select *, sum(case when diff is null or diff < 2 then 0 else 1 end) over (order by t) as period
from (select *, t - lag(t, 1) over (order by t) as diff from t_series) as x) as y
group by period
order by period;

0 77
1 77
2 36

which is the one should I use?

CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0
office@cybertec.at

Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    3
    0
    Would love your thoughts, please comment.x
    ()
    x
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram