Windowing functions and analytics have been around for quite some time and many people already make use of this awesome stuff in the PostgreSQL world. Timeseries  are an especially important area in this context. However, not all features have been widely adopted and thus many developers have to implement functionality at the application level in a painful way instead of just using some of the more advanced SQL techniques.

The idea of this blog is to demonstrate some of the advanced stuff so that more people out there can make use of PostgreSQL’s true power.

Preparing data for analytics

For the purpose of this post I have created a simple data set:

test=# CREATE TABLE t_demo AS 
	SELECT   ordinality, day, date_part('week', day) AS week 
	FROM    generate_series('2020-01-02', '2020-01-15', '1 day'::interval) 
			WITH ORDINALITY AS day;
SELECT 14

In PostgreSQL the generate_series function will return one row each day spanning January 2nd, 2020 to January 15th, 2020. The WITH ORDINALITY clause tells PostgreSQL to add an “id” column to the resultset of the function. The date_part function will extract the number of the week out of our date. The purpose of this column is to have a couple of identical values in our timeseries.

In the next list you see the data set we will use:

test=# SELECT * FROM t_demo;
 ordinality |          day           | week 
------------+------------------------+------
          1 | 2020-01-02 00:00:00+01 |    1
          2 | 2020-01-03 00:00:00+01 |    1
          3 | 2020-01-04 00:00:00+01 |    1
          4 | 2020-01-05 00:00:00+01 |    1
          5 | 2020-01-06 00:00:00+01 |    2
          6 | 2020-01-07 00:00:00+01 |    2
          7 | 2020-01-08 00:00:00+01 |    2
          8 | 2020-01-09 00:00:00+01 |    2
          9 | 2020-01-10 00:00:00+01 |    2
         10 | 2020-01-11 00:00:00+01 |    2
         11 | 2020-01-12 00:00:00+01 |    2
         12 | 2020-01-13 00:00:00+01 |    3
         13 | 2020-01-14 00:00:00+01 |    3
         14 | 2020-01-15 00:00:00+01 |    3
(14 rows)

Sliding windows: Moving over timeseries in SQL

One of the things you often have to do is to use a sliding window. In SQL this can be achieved easily using the OVER clause. Here is an example:

test=# SELECT *, 
        array_agg(ordinality) 
                OVER (ORDER BY day ROWS 
                         BETWEEN 1 PRECEDING AND 1 FOLLOWING),
        avg(ordinality) 
                OVER (ORDER BY day ROWS 
                         BETWEEN 1 PRECEDING AND 1 FOLLOWING)
FROM t_demo;
 ordinality |          day           | week | array_agg  |         avg         
------------+------------------------+------+------------+---------------------
          1 | 2020-01-02 00:00:00+01 |    1 | {1,2}      |  1.5000000000000000
          2 | 2020-01-03 00:00:00+01 |    1 | {1,2,3}    |  2.0000000000000000
          3 | 2020-01-04 00:00:00+01 |    1 | {2,3,4}    |  3.0000000000000000
          4 | 2020-01-05 00:00:00+01 |    1 | {3,4,5}    |  4.0000000000000000
          5 | 2020-01-06 00:00:00+01 |    2 | {4,5,6}    |  5.0000000000000000
          6 | 2020-01-07 00:00:00+01 |    2 | {5,6,7}    |  6.0000000000000000
          7 | 2020-01-08 00:00:00+01 |    2 | {6,7,8}    |  7.0000000000000000
          8 | 2020-01-09 00:00:00+01 |    2 | {7,8,9}    |  8.0000000000000000
          9 | 2020-01-10 00:00:00+01 |    2 | {8,9,10}   |  9.0000000000000000
         10 | 2020-01-11 00:00:00+01 |    2 | {9,10,11}  | 10.0000000000000000
         11 | 2020-01-12 00:00:00+01 |    2 | {10,11,12} | 11.0000000000000000
         12 | 2020-01-13 00:00:00+01 |    3 | {11,12,13} | 12.0000000000000000
         13 | 2020-01-14 00:00:00+01 |    3 | {12,13,14} | 13.0000000000000000
         14 | 2020-01-15 00:00:00+01 |    3 | {13,14}    | 13.5000000000000000
(14 rows)

The OVER clause allows you the data fed to the aggregate function. For the sake of simplicity, I have used the array_agg function, which simply returns the data fed to the aggregate as an array. In a real-life scenario you would use something more common such as the avg, sum, min, max, or any other aggregation function. However, array_agg is pretty useful in that it shows which data is really passed to the function and what values we have in use. ROWS BETWEEN … PRECEDING AND 1 … FOLLOWING tells the system that we want to use 3 rows: The previous, the current one, as well as the one after the current row. At the beginning of the list, there is no previous row, so we will only see two values in the array column. At the end of the data set, there are also only two rows, because there are no more values after the last one.

In some cases it can be useful to exclude the current row from the data passed to the aggregate. The EXCLUDE CURRENT ROW clause has been developed to do exactly that. Here is an example:

test=# SELECT *, array_agg(ordinality) 
		OVER (ORDER BY day ROWS 
			BETWEEN 1 PRECEDING AND 1 FOLLOWING 
			EXCLUDE CURRENT ROW) 
	FROM  t_demo;
 ordinality |          day           | week | array_agg 
------------+------------------------+------+-----------
          1 | 2020-01-02 00:00:00+01 |    1 | {2}
          2 | 2020-01-03 00:00:00+01 |    1 | {1,3}
          3 | 2020-01-04 00:00:00+01 |    1 | {2,4}
          4 | 2020-01-05 00:00:00+01 |    1 | {3,5}
          5 | 2020-01-06 00:00:00+01 |    2 | {4,6}
          6 | 2020-01-07 00:00:00+01 |    2 | {5,7}
          7 | 2020-01-08 00:00:00+01 |    2 | {6,8}
          8 | 2020-01-09 00:00:00+01 |    2 | {7,9}
          9 | 2020-01-10 00:00:00+01 |    2 | {8,10}
         10 | 2020-01-11 00:00:00+01 |    2 | {9,11}
         11 | 2020-01-12 00:00:00+01 |    2 | {10,12}
         12 | 2020-01-13 00:00:00+01 |    3 | {11,13}
         13 | 2020-01-14 00:00:00+01 |    3 | {12,14}
         14 | 2020-01-15 00:00:00+01 |    3 | {13}
(14 rows)

As you can see, the array is a little shorter now. The current value is not part of the array anymore.

Timeseries with SQL: Making use of EXCLUDE TIES

EXCLUDE TIES has recently been added to PostgreSQL. The idea is to remove duplicates. Let us take a look at the following to make things clear:

test=# SELECT   day, week, 
        array_agg(week) OVER (ORDER BY week ROWS 
                BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS all,
        array_agg(week) OVER (ORDER BY week ROWS 
                BETWEEN 2 PRECEDING AND 2 FOLLOWING EXCLUDE TIES) AS ties
FROM    t_demo;
          day           | week |     all     |  ties   
------------------------+------+-------------+---------
 2020-01-02 00:00:00+01 |    1 | {1,1,1}     | {1}
 2020-01-03 00:00:00+01 |    1 | {1,1,1,1}   | {1}
 2020-01-04 00:00:00+01 |    1 | {1,1,1,1,2} | {1,2}
 2020-01-05 00:00:00+01 |    1 | {1,1,1,2,2} | {1,2,2}
 2020-01-06 00:00:00+01 |    2 | {1,1,2,2,2} | {1,1,2}
 2020-01-07 00:00:00+01 |    2 | {1,2,2,2,2} | {1,2}
 2020-01-08 00:00:00+01 |    2 | {2,2,2,2,2} | {2}
 2020-01-09 00:00:00+01 |    2 | {2,2,2,2,2} | {2}
 2020-01-10 00:00:00+01 |    2 | {2,2,2,2,2} | {2}
 2020-01-11 00:00:00+01 |    2 | {2,2,2,2,3} | {2,3}
 2020-01-12 00:00:00+01 |    2 | {2,2,2,3,3} | {2,3,3}
 2020-01-13 00:00:00+01 |    3 | {2,2,3,3,3} | {2,2,3}
 2020-01-14 00:00:00+01 |    3 | {2,3,3,3}   | {2,3}
 2020-01-15 00:00:00+01 |    3 | {3,3,3}     | {3}
(14 rows)

The first array_agg simply collects all values in the frame we have defined. The “ties” column is a bit more complicated to understand: Let us take a look at the 5th of January. The result says (1, 2, 2). As you can see, two incarnations of 1 have been removed. EXLUDE TIES made sure that those duplicates are gone. However, this has no impact on those values in the “future”, because the future values are different from the current row. The documentation states what EXCLUDE TIES is all about: “EXCLUDE TIES excludes any peers of the current row from the frame, but not the current row itself.”

After this first example we should also consider a fairly common mistake:

test=# SELECT   day, week, 
        array_agg(week) OVER (ORDER BY week, day ROWS 
                BETWEEN 2 PRECEDING AND 2 FOLLOWING EXCLUDE TIES) AS ties
FROM    t_demo;
          day           | week |    ties     
------------------------+------+-------------
 2020-01-02 00:00:00+01 |    1 | {1,1,1}
 2020-01-03 00:00:00+01 |    1 | {1,1,1,1}
 2020-01-04 00:00:00+01 |    1 | {1,1,1,1,2}
 2020-01-05 00:00:00+01 |    1 | {1,1,1,2,2}
 2020-01-06 00:00:00+01 |    2 | {1,1,2,2,2}
 2020-01-07 00:00:00+01 |    2 | {1,2,2,2,2}
 2020-01-08 00:00:00+01 |    2 | {2,2,2,2,2}
 2020-01-09 00:00:00+01 |    2 | {2,2,2,2,2}
 2020-01-10 00:00:00+01 |    2 | {2,2,2,2,2}
 2020-01-11 00:00:00+01 |    2 | {2,2,2,2,3}
 2020-01-12 00:00:00+01 |    2 | {2,2,2,3,3}
 2020-01-13 00:00:00+01 |    3 | {2,2,3,3,3}
 2020-01-14 00:00:00+01 |    3 | {2,3,3,3}
 2020-01-15 00:00:00+01 |    3 | {3,3,3}
(14 rows)

Can you spot the difference between this and the previous example? Take all the time you need …
The problem is that the ORDER BY clause has two columns in this case. That means that there are no duplicates anymore from the ORDER BY’s perspective. Thus, PostgreSQL is not going to prune values from the resultset. You can safely assume that this is a common mistake seen in many cases. The problem can be very subtle and go unnoticed for quite some time.

Excluding entire groups from a windowing function

In some cases one might want to remove an entire set of rows from the result set. To do that, you can make use of EXCLUDE GROUP. The following example shows how that works and how our timeseries data can be analyzed:

test=# SELECT   *, 
        array_agg(week) OVER (ORDER BY week ROWS 
                                BETWEEN 2 PRECEDING 
			AND 2 FOLLOWING EXCLUDE GROUP) AS week,
        array_agg(week) OVER (ORDER BY day ROWS 
                                BETWEEN 2 PRECEDING 
			AND 2 FOLLOWING EXCLUDE GROUP) AS all
FROM    t_demo;
 ordinality |          day           | week | week  |    all    
------------+------------------------+------+-------+-----------
          1 | 2020-01-02 00:00:00+01 |    1 |       | {1,1}
          2 | 2020-01-03 00:00:00+01 |    1 |       | {1,1,1}
          3 | 2020-01-04 00:00:00+01 |    1 | {2}   | {1,1,1,2}
          4 | 2020-01-05 00:00:00+01 |    1 | {2,2} | {1,1,2,2}
          5 | 2020-01-06 00:00:00+01 |    2 | {1,1} | {1,1,2,2}
          6 | 2020-01-07 00:00:00+01 |    2 | {1}   | {1,2,2,2}
          7 | 2020-01-08 00:00:00+01 |    2 |       | {2,2,2,2}
          8 | 2020-01-09 00:00:00+01 |    2 |       | {2,2,2,2}
          9 | 2020-01-10 00:00:00+01 |    2 |       | {2,2,2,2}
         10 | 2020-01-11 00:00:00+01 |    2 | {3}   | {2,2,2,3}
         11 | 2020-01-12 00:00:00+01 |    2 | {3,3} | {2,2,3,3}
         12 | 2020-01-13 00:00:00+01 |    3 | {2,2} | {2,2,3,3}
         13 | 2020-01-14 00:00:00+01 |    3 | {2}   | {2,3,3}
         14 | 2020-01-15 00:00:00+01 |    3 |       | {3,3}
(14 rows)

The first aggregation function does ORDER BY week and the array_agg will aggregate on the very same column. In this case, we will see a couple of NULL columns, because in some cases all entries within the frame are simply identical. The last row in the resultset is a good example: All array entries are 3 and the current row contains 3 as well. Thus all incarnations of 3 are simply removed which leaves us with an empty array.
To calculate the last column, the data is ordered by day. In this case there are no duplicates and therefore no data can be removed. As you can see, working with timeseries can be fairly easy and quite straight-forward.

DISTINCT: Dealing with duplicates from timeseries data

In PostgreSQL an aggregate function is capable of handling DISTINCT. The following example shows what I mean:

test=# SELECT DISTINCT week FROM t_demo;
 week 
------
    3
    1
    2
(3 rows)

test=# SELECT count(DISTINCT week) FROM t_demo;
 count 
-------
     3
(1 row)

“count” does not simply count all columns, but filters the duplicates beforehand. Therefore the result is simply 3.
In PostgreSQL (current version) there is no way to use DISTINCT as part of a windowing function. PostgreSQL will simply error out:

test=# SELECT   *, 
        array_agg(DISTINCT week) OVER (ORDER BY day ROWS 
                                BETWEEN 2 PRECEDING AND 2 FOLLOWING)
FROM    t_demo;
ERROR:  DISTINCT is not implemented for window functions
LINE 2:  array_agg(DISTINCT week) OVER (ORDER BY day ROWS 

The natural question which arises is: How can we achieve the same result without using DISTINCT inside the windowing function? What you have to do is to filter the duplicates on a higher level. You can use a subselect, unroll the array, remove the duplicates and assemble the array again. It is not hard to do, but it is not as elegant as one might expect.

test=# SELECT  *, (SELECT array_agg(DISTINCT unnest) FROM unnest(x)) AS b
FROM 
(
        SELECT  *, 
                array_agg(week) OVER (ORDER BY day ROWS 
                        BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS x
        FROM    t_demo
) AS a;
 ordinality |          day           | week |      x      |   b   
------------+------------------------+------+-------------+-------
          1 | 2020-01-02 00:00:00+01 |    1 | {1,1,1}     | {1}
          2 | 2020-01-03 00:00:00+01 |    1 | {1,1,1,1}   | {1}
          3 | 2020-01-04 00:00:00+01 |    1 | {1,1,1,1,2} | {1,2}
          4 | 2020-01-05 00:00:00+01 |    1 | {1,1,1,2,2} | {1,2}
          5 | 2020-01-06 00:00:00+01 |    2 | {1,1,2,2,2} | {1,2}
          6 | 2020-01-07 00:00:00+01 |    2 | {1,2,2,2,2} | {1,2}
          7 | 2020-01-08 00:00:00+01 |    2 | {2,2,2,2,2} | {2}
          8 | 2020-01-09 00:00:00+01 |    2 | {2,2,2,2,2} | {2}
          9 | 2020-01-10 00:00:00+01 |    2 | {2,2,2,2,2} | {2}
         10 | 2020-01-11 00:00:00+01 |    2 | {2,2,2,2,3} | {2,3}
         11 | 2020-01-12 00:00:00+01 |    2 | {2,2,2,3,3} | {2,3}
         12 | 2020-01-13 00:00:00+01 |    3 | {2,2,3,3,3} | {2,3}
         13 | 2020-01-14 00:00:00+01 |    3 | {2,3,3,3}   | {2,3}
         14 | 2020-01-15 00:00:00+01 |    3 | {3,3,3}     | {3}
(14 rows)

Further reading

In case you are interested in timeseries and aggregation in general, consider checking out some of our other blog posts including “Speeding up count(*)” by Laurenz Albe. If you are interested in high-performance PostgreSQL check out my blog post about finding slow queries.

If you find this post useful, consider sharing it with your friends. Follow us on Twitter, like us on Facebook and spread the news.