UPDATED July 2023: Window functions and analytics have been around for quite some time and many people already make use of this awesome stuff in the PostgreSQL world. Timeseries are an especially important area in this context. However, not all features have been widely adopted and thus many developers have to implement functionality at the application level in a painful way instead of just using some of the more advanced SQL techniques.

The idea of this blog is to demonstrate some of the advanced analytics so that more people out there can make use of PostgreSQL’s true power.

Preparing data for analytics

For the purpose of this post I have created a basic data set:

test=# CREATE TABLE t_demo AS 
	SELECT   ordinality, day, date_part('week', day) AS week 
	FROM    generate_series('2020-01-02', '2020-01-15', '1 day'::interval) 
			WITH ORDINALITY AS day;
SELECT 14

Useful functions

In PostgreSQL, the generate_series function will return one row each day spanning January 2nd, 2020 to January 15th, 2020. The WITH ORDINALITY clause tells PostgreSQL to add an “id” column to the result set of the function. The date_part function will extract the number of the week out of our date. The purpose of this column is to have a couple of identical values in our timeseries.

In the next list you see the data set we’ll use:

test=# SELECT * FROM t_demo;
 ordinality |          day           | week 
------------+------------------------+------
          1 | 2020-01-02 00:00:00+01 |    1
          2 | 2020-01-03 00:00:00+01 |    1
          3 | 2020-01-04 00:00:00+01 |    1
          4 | 2020-01-05 00:00:00+01 |    1
          5 | 2020-01-06 00:00:00+01 |    2
          6 | 2020-01-07 00:00:00+01 |    2
          7 | 2020-01-08 00:00:00+01 |    2
          8 | 2020-01-09 00:00:00+01 |    2
          9 | 2020-01-10 00:00:00+01 |    2
         10 | 2020-01-11 00:00:00+01 |    2
         11 | 2020-01-12 00:00:00+01 |    2
         12 | 2020-01-13 00:00:00+01 |    3
         13 | 2020-01-14 00:00:00+01 |    3
         14 | 2020-01-15 00:00:00+01 |    3
(14 rows)

Sliding windows: Moving over timeseries in SQL

One of the things you often have to do is to use a sliding window. In SQL this can easily be achieved using the OVER clause. Here is an example:

test=# SELECT *, 
        array_agg(ordinality) 
                OVER (ORDER BY day ROWS 
                         BETWEEN 1 PRECEDING AND 1 FOLLOWING),
        avg(ordinality) 
                OVER (ORDER BY day ROWS 
                         BETWEEN 1 PRECEDING AND 1 FOLLOWING)
FROM t_demo;
 ordinality |          day           | week | array_agg  |         avg         
------------+------------------------+------+------------+---------------------
          1 | 2020-01-02 00:00:00+01 |    1 | {1,2}      |  1.5000000000000000
          2 | 2020-01-03 00:00:00+01 |    1 | {1,2,3}    |  2.0000000000000000
          3 | 2020-01-04 00:00:00+01 |    1 | {2,3,4}    |  3.0000000000000000
          4 | 2020-01-05 00:00:00+01 |    1 | {3,4,5}    |  4.0000000000000000
          5 | 2020-01-06 00:00:00+01 |    2 | {4,5,6}    |  5.0000000000000000
          6 | 2020-01-07 00:00:00+01 |    2 | {5,6,7}    |  6.0000000000000000
          7 | 2020-01-08 00:00:00+01 |    2 | {6,7,8}    |  7.0000000000000000
          8 | 2020-01-09 00:00:00+01 |    2 | {7,8,9}    |  8.0000000000000000
          9 | 2020-01-10 00:00:00+01 |    2 | {8,9,10}   |  9.0000000000000000
         10 | 2020-01-11 00:00:00+01 |    2 | {9,10,11}  | 10.0000000000000000
         11 | 2020-01-12 00:00:00+01 |    2 | {10,11,12} | 11.0000000000000000
         12 | 2020-01-13 00:00:00+01 |    3 | {11,12,13} | 12.0000000000000000
         13 | 2020-01-14 00:00:00+01 |    3 | {12,13,14} | 13.0000000000000000
         14 | 2020-01-15 00:00:00+01 |    3 | {13,14}    | 13.5000000000000000
(14 rows)

OVER and array_agg

The OVER clause allows you to feed the data to the aggregate function. For the sake of simplicity, I have used the array_agg function, which simply returns the data fed to the aggregate as an array. In a real-life scenario you would use something more common such as the avg, sum, min, max, or any other aggregation function. However, array_agg is pretty useful, in that it shows which data is really passed to the function, and what values we have in use. ROWS BETWEEN … PRECEDING AND 1 … FOLLOWING tells the system that we want to use 3 rows: The previous, the current one, as well as the one after the current row. At the beginning of the list, there is no previous row, so we will only see two values in the array column. At the end of the data set, there are also only two rows, because there are no more values after the last one.

The EXCLUDE CURRENT ROW clause

In some cases, it can be useful to exclude the current row from the data passed to the aggregate. The EXCLUDE CURRENT ROW clause has been developed to do exactly that. Here is an example:

test=# SELECT *, array_agg(ordinality) 
		OVER (ORDER BY day ROWS 
			BETWEEN 1 PRECEDING AND 1 FOLLOWING 
			EXCLUDE CURRENT ROW) 
	FROM  t_demo;
 ordinality |          day           | week | array_agg 
------------+------------------------+------+-----------
          1 | 2020-01-02 00:00:00+01 |    1 | {2}
          2 | 2020-01-03 00:00:00+01 |    1 | {1,3}
          3 | 2020-01-04 00:00:00+01 |    1 | {2,4}
          4 | 2020-01-05 00:00:00+01 |    1 | {3,5}
          5 | 2020-01-06 00:00:00+01 |    2 | {4,6}
          6 | 2020-01-07 00:00:00+01 |    2 | {5,7}
          7 | 2020-01-08 00:00:00+01 |    2 | {6,8}
          8 | 2020-01-09 00:00:00+01 |    2 | {7,9}
          9 | 2020-01-10 00:00:00+01 |    2 | {8,10}
         10 | 2020-01-11 00:00:00+01 |    2 | {9,11}
         11 | 2020-01-12 00:00:00+01 |    2 | {10,12}
         12 | 2020-01-13 00:00:00+01 |    3 | {11,13}
         13 | 2020-01-14 00:00:00+01 |    3 | {12,14}
         14 | 2020-01-15 00:00:00+01 |    3 | {13}
(14 rows)

As you can see, the array is a little shorter now. The current value is not part of the array anymore.

Timeseries with SQL: Making use of EXCLUDE TIES

The idea behind EXCLUDE TIES is to remove duplicates. Let’s take a look at the following to make things clear:

test=# SELECT   day, week, 
        array_agg(week) OVER (ORDER BY week ROWS 
                BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS all,
        array_agg(week) OVER (ORDER BY week ROWS 
                BETWEEN 2 PRECEDING AND 2 FOLLOWING EXCLUDE TIES) AS ties
FROM    t_demo;
          day           | week |     all     |  ties   
------------------------+------+-------------+---------
 2020-01-02 00:00:00+01 |    1 | {1,1,1}     | {1}
 2020-01-03 00:00:00+01 |    1 | {1,1,1,1}   | {1}
 2020-01-04 00:00:00+01 |    1 | {1,1,1,1,2} | {1,2}
 2020-01-05 00:00:00+01 |    1 | {1,1,1,2,2} | {1,2,2}
 2020-01-06 00:00:00+01 |    2 | {1,1,2,2,2} | {1,1,2}
 2020-01-07 00:00:00+01 |    2 | {1,2,2,2,2} | {1,2}
 2020-01-08 00:00:00+01 |    2 | {2,2,2,2,2} | {2}
 2020-01-09 00:00:00+01 |    2 | {2,2,2,2,2} | {2}
 2020-01-10 00:00:00+01 |    2 | {2,2,2,2,2} | {2}
 2020-01-11 00:00:00+01 |    2 | {2,2,2,2,3} | {2,3}
 2020-01-12 00:00:00+01 |    2 | {2,2,2,3,3} | {2,3,3}
 2020-01-13 00:00:00+01 |    3 | {2,2,3,3,3} | {2,2,3}
 2020-01-14 00:00:00+01 |    3 | {2,3,3,3}   | {2,3}
 2020-01-15 00:00:00+01 |    3 | {3,3,3}     | {3}
(14 rows)

EXCLUDE TIES: Explanation

The first array_agg simply collects all values in the frame we have defined. The “ties” column is a bit more complicated to understand: Let’s take a look at the 5th of January. The result says (1, 2, 2). As you can see, two incarnations of 1 have been removed. EXCLUDE TIES made sure that those duplicates are gone. However, this has no impact on those values in the “future”, because the future values are different from the current row. The documentation states what EXCLUDE TIES is all about: “EXCLUDE TIES excludes any peers of the current row from the frame, but not the current row itself.”

After this first example, we should also consider a fairly common mistake:

test=# SELECT   day, week, 
        array_agg(week) OVER (ORDER BY week, day ROWS 
                BETWEEN 2 PRECEDING AND 2 FOLLOWING EXCLUDE TIES) AS ties
FROM    t_demo;
          day           | week |    ties     
------------------------+------+-------------
 2020-01-02 00:00:00+01 |    1 | {1,1,1}
 2020-01-03 00:00:00+01 |    1 | {1,1,1,1}
 2020-01-04 00:00:00+01 |    1 | {1,1,1,1,2}
 2020-01-05 00:00:00+01 |    1 | {1,1,1,2,2}
 2020-01-06 00:00:00+01 |    2 | {1,1,2,2,2}
 2020-01-07 00:00:00+01 |    2 | {1,2,2,2,2}
 2020-01-08 00:00:00+01 |    2 | {2,2,2,2,2}
 2020-01-09 00:00:00+01 |    2 | {2,2,2,2,2}
 2020-01-10 00:00:00+01 |    2 | {2,2,2,2,2}
 2020-01-11 00:00:00+01 |    2 | {2,2,2,2,3}
 2020-01-12 00:00:00+01 |    2 | {2,2,2,3,3}
 2020-01-13 00:00:00+01 |    3 | {2,2,3,3,3}
 2020-01-14 00:00:00+01 |    3 | {2,3,3,3}
 2020-01-15 00:00:00+01 |    3 | {3,3,3}
(14 rows)

Spot the difference – find the mistake

Can you spot the difference between this and the previous example? Take all the time you need …
The problem is that the ORDER BY clause has two columns in this case. That means that there are no duplicates anymore from the ORDER BY’s perspective. Thus, PostgreSQL is not going to prune values from the result set. I can assure you that this is a common mistake seen in many cases. The problem can be very subtle and go unnoticed for quite some time.

Excluding entire groups from a window function

In some cases you might want to remove an entire set of rows from the result set. To do that, you can make use of EXCLUDE GROUP.

The following example shows how that works, and how our timeseries data can be analyzed:

test=# SELECT   *, 
        array_agg(week) OVER (ORDER BY week ROWS 
                                BETWEEN 2 PRECEDING 
			AND 2 FOLLOWING EXCLUDE GROUP) AS week,
        array_agg(week) OVER (ORDER BY day ROWS 
                                BETWEEN 2 PRECEDING 
			AND 2 FOLLOWING EXCLUDE GROUP) AS all
FROM    t_demo;
 ordinality |          day           | week | week  |    all    
------------+------------------------+------+-------+-----------
          1 | 2020-01-02 00:00:00+01 |    1 |       | {1,1}
          2 | 2020-01-03 00:00:00+01 |    1 |       | {1,1,1}
          3 | 2020-01-04 00:00:00+01 |    1 | {2}   | {1,1,1,2}
          4 | 2020-01-05 00:00:00+01 |    1 | {2,2} | {1,1,2,2}
          5 | 2020-01-06 00:00:00+01 |    2 | {1,1} | {1,1,2,2}
          6 | 2020-01-07 00:00:00+01 |    2 | {1}   | {1,2,2,2}
          7 | 2020-01-08 00:00:00+01 |    2 |       | {2,2,2,2}
          8 | 2020-01-09 00:00:00+01 |    2 |       | {2,2,2,2}
          9 | 2020-01-10 00:00:00+01 |    2 |       | {2,2,2,2}
         10 | 2020-01-11 00:00:00+01 |    2 | {3}   | {2,2,2,3}
         11 | 2020-01-12 00:00:00+01 |    2 | {3,3} | {2,2,3,3}
         12 | 2020-01-13 00:00:00+01 |    3 | {2,2} | {2,2,3,3}
         13 | 2020-01-14 00:00:00+01 |    3 | {2}   | {2,3,3}
         14 | 2020-01-15 00:00:00+01 |    3 |       | {3,3}
(14 rows)

Understanding the results

The first aggregation function does ORDER BY week and the array_agg will aggregate on the very same column. In this case, we will see a couple of NULL columns, because in some cases all entries within the frame are simply identical. The last row in the result set is a good example: All array entries are 3 and the current row contains 3 as well. Thus all incarnations of 3 are simply removed, which leaves us with an empty array.

To calculate the last column, the data is ordered by day. In this case, there are no duplicates and therefore no data can be removed. For more information on EXCLUDE GROUP, see the documentation on window functions.

DISTINCT: Dealing with duplicates from timeseries data

In PostgreSQL, an aggregate function is capable of handling DISTINCT. The following example shows what I mean:

test=# SELECT DISTINCT week FROM t_demo;
 week 
------
    3
    1
    2
(3 rows)

test=# SELECT count(DISTINCT week) FROM t_demo;
 count 
-------
     3
(1 row)

What does count do?

“count” does not simply count all columns, but filters the duplicates beforehand. Therefore the result is simply 3.
In PostgreSQL there is no way to use DISTINCT as part of a window function. PostgreSQL will simply error out:

test=# SELECT   *, 
        array_agg(DISTINCT week) OVER (ORDER BY day ROWS 
                                BETWEEN 2 PRECEDING AND 2 FOLLOWING)
FROM    t_demo;
ERROR:  DISTINCT is not implemented for window functions
LINE 2:  array_agg(DISTINCT week) OVER (ORDER BY day ROWS 

Avoiding DISTINCT

The natural question which arises is: How can we achieve the same result without using DISTINCT inside the window function? What you have to do is to filter the duplicates on a higher level. You can use a subselect, unroll the array, remove the duplicates and assemble the array again. It is not hard to do, but it is not as elegant as one might expect.

test=# SELECT  *, (SELECT array_agg(DISTINCT unnest) FROM unnest(x)) AS b
FROM 
(
        SELECT  *, 
                array_agg(week) OVER (ORDER BY day ROWS 
                        BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS x
        FROM    t_demo
) AS a;
 ordinality |          day           | week |      x      |   b   
------------+------------------------+------+-------------+-------
          1 | 2020-01-02 00:00:00+01 |    1 | {1,1,1}     | {1}
          2 | 2020-01-03 00:00:00+01 |    1 | {1,1,1,1}   | {1}
          3 | 2020-01-04 00:00:00+01 |    1 | {1,1,1,1,2} | {1,2}
          4 | 2020-01-05 00:00:00+01 |    1 | {1,1,1,2,2} | {1,2}
          5 | 2020-01-06 00:00:00+01 |    2 | {1,1,2,2,2} | {1,2}
          6 | 2020-01-07 00:00:00+01 |    2 | {1,2,2,2,2} | {1,2}
          7 | 2020-01-08 00:00:00+01 |    2 | {2,2,2,2,2} | {2}
          8 | 2020-01-09 00:00:00+01 |    2 | {2,2,2,2,2} | {2}
          9 | 2020-01-10 00:00:00+01 |    2 | {2,2,2,2,2} | {2}
         10 | 2020-01-11 00:00:00+01 |    2 | {2,2,2,2,3} | {2,3}
         11 | 2020-01-12 00:00:00+01 |    2 | {2,2,2,3,3} | {2,3}
         12 | 2020-01-13 00:00:00+01 |    3 | {2,2,3,3,3} | {2,3}
         13 | 2020-01-14 00:00:00+01 |    3 | {2,3,3,3}   | {2,3}
         14 | 2020-01-15 00:00:00+01 |    3 | {3,3,3}     | {3}
(14 rows)

Further reading

In case you are interested in timeseries and aggregation in general, consider checking out some of our other blog posts including “Speeding up count(*)” by Laurenz Albe. If you are interested in high-performance PostgreSQL check out my blog post about finding slow queries.


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.