Windowing functions and analytics have been around for quite some time and many people already make use of this awesome stuff in the PostgreSQL world. Timeseries are an especially important area in this context. However, not all features have been widely adopted and thus many developers have to implement functionality at the application level in a painful way instead of just using some of the more advanced SQL techniques.
The idea of this blog is to demonstrate some of the advanced stuff so that more people out there can make use of PostgreSQL’s true power.
Preparing data for analytics
For the purpose of this post I have created a simple data set:
test=# CREATE TABLE t_demo AS SELECT ordinality, day, date_part('week', day) AS week FROM generate_series('2020-01-02', '2020-01-15', '1 day'::interval) WITH ORDINALITY AS day; SELECT 14
In PostgreSQL the generate_series function will return one row each day spanning January 2nd, 2020 to January 15th, 2020. The WITH ORDINALITY clause tells PostgreSQL to add an “id” column to the resultset of the function. The date_part function will extract the number of the week out of our date. The purpose of this column is to have a couple of identical values in our timeseries.
In the next list you see the data set we will use:
test=# SELECT * FROM t_demo; ordinality | day | week ------------+------------------------+------ 1 | 2020-01-02 00:00:00+01 | 1 2 | 2020-01-03 00:00:00+01 | 1 3 | 2020-01-04 00:00:00+01 | 1 4 | 2020-01-05 00:00:00+01 | 1 5 | 2020-01-06 00:00:00+01 | 2 6 | 2020-01-07 00:00:00+01 | 2 7 | 2020-01-08 00:00:00+01 | 2 8 | 2020-01-09 00:00:00+01 | 2 9 | 2020-01-10 00:00:00+01 | 2 10 | 2020-01-11 00:00:00+01 | 2 11 | 2020-01-12 00:00:00+01 | 2 12 | 2020-01-13 00:00:00+01 | 3 13 | 2020-01-14 00:00:00+01 | 3 14 | 2020-01-15 00:00:00+01 | 3 (14 rows)
Sliding windows: Moving over timeseries in SQL
One of the things you often have to do is to use a sliding window. In SQL this can be achieved easily using the OVER clause. Here is an example:
test=# SELECT *, array_agg(ordinality) OVER (ORDER BY day ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING), avg(ordinality) OVER (ORDER BY day ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) FROM t_demo; ordinality | day | week | array_agg | avg ------------+------------------------+------+------------+--------------------- 1 | 2020-01-02 00:00:00+01 | 1 | {1,2} | 1.5000000000000000 2 | 2020-01-03 00:00:00+01 | 1 | {1,2,3} | 2.0000000000000000 3 | 2020-01-04 00:00:00+01 | 1 | {2,3,4} | 3.0000000000000000 4 | 2020-01-05 00:00:00+01 | 1 | {3,4,5} | 4.0000000000000000 5 | 2020-01-06 00:00:00+01 | 2 | {4,5,6} | 5.0000000000000000 6 | 2020-01-07 00:00:00+01 | 2 | {5,6,7} | 6.0000000000000000 7 | 2020-01-08 00:00:00+01 | 2 | {6,7,8} | 7.0000000000000000 8 | 2020-01-09 00:00:00+01 | 2 | {7,8,9} | 8.0000000000000000 9 | 2020-01-10 00:00:00+01 | 2 | {8,9,10} | 9.0000000000000000 10 | 2020-01-11 00:00:00+01 | 2 | {9,10,11} | 10.0000000000000000 11 | 2020-01-12 00:00:00+01 | 2 | {10,11,12} | 11.0000000000000000 12 | 2020-01-13 00:00:00+01 | 3 | {11,12,13} | 12.0000000000000000 13 | 2020-01-14 00:00:00+01 | 3 | {12,13,14} | 13.0000000000000000 14 | 2020-01-15 00:00:00+01 | 3 | {13,14} | 13.5000000000000000 (14 rows)
The OVER clause allows you the data fed to the aggregate function. For the sake of simplicity, I have used the array_agg function, which simply returns the data fed to the aggregate as an array. In a real-life scenario you would use something more common such as the avg, sum, min, max, or any other aggregation function. However, array_agg is pretty useful in that it shows which data is really passed to the function and what values we have in use. ROWS BETWEEN … PRECEDING AND 1 … FOLLOWING tells the system that we want to use 3 rows: The previous, the current one, as well as the one after the current row. At the beginning of the list, there is no previous row, so we will only see two values in the array column. At the end of the data set, there are also only two rows, because there are no more values after the last one.
In some cases it can be useful to exclude the current row from the data passed to the aggregate. The EXCLUDE CURRENT ROW clause has been developed to do exactly that. Here is an example:
test=# SELECT *, array_agg(ordinality) OVER (ORDER BY day ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING EXCLUDE CURRENT ROW) FROM t_demo; ordinality | day | week | array_agg ------------+------------------------+------+----------- 1 | 2020-01-02 00:00:00+01 | 1 | {2} 2 | 2020-01-03 00:00:00+01 | 1 | {1,3} 3 | 2020-01-04 00:00:00+01 | 1 | {2,4} 4 | 2020-01-05 00:00:00+01 | 1 | {3,5} 5 | 2020-01-06 00:00:00+01 | 2 | {4,6} 6 | 2020-01-07 00:00:00+01 | 2 | {5,7} 7 | 2020-01-08 00:00:00+01 | 2 | {6,8} 8 | 2020-01-09 00:00:00+01 | 2 | {7,9} 9 | 2020-01-10 00:00:00+01 | 2 | {8,10} 10 | 2020-01-11 00:00:00+01 | 2 | {9,11} 11 | 2020-01-12 00:00:00+01 | 2 | {10,12} 12 | 2020-01-13 00:00:00+01 | 3 | {11,13} 13 | 2020-01-14 00:00:00+01 | 3 | {12,14} 14 | 2020-01-15 00:00:00+01 | 3 | {13} (14 rows)
As you can see, the array is a little shorter now. The current value is not part of the array anymore.
Timeseries with SQL: Making use of EXCLUDE TIES
EXCLUDE TIES has recently been added to PostgreSQL. The idea is to remove duplicates. Let us take a look at the following to make things clear:
test=# SELECT day, week, array_agg(week) OVER (ORDER BY week ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS all, array_agg(week) OVER (ORDER BY week ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING EXCLUDE TIES) AS ties FROM t_demo; day | week | all | ties ------------------------+------+-------------+--------- 2020-01-02 00:00:00+01 | 1 | {1,1,1} | {1} 2020-01-03 00:00:00+01 | 1 | {1,1,1,1} | {1} 2020-01-04 00:00:00+01 | 1 | {1,1,1,1,2} | {1,2} 2020-01-05 00:00:00+01 | 1 | {1,1,1,2,2} | {1,2,2} 2020-01-06 00:00:00+01 | 2 | {1,1,2,2,2} | {1,1,2} 2020-01-07 00:00:00+01 | 2 | {1,2,2,2,2} | {1,2} 2020-01-08 00:00:00+01 | 2 | {2,2,2,2,2} | {2} 2020-01-09 00:00:00+01 | 2 | {2,2,2,2,2} | {2} 2020-01-10 00:00:00+01 | 2 | {2,2,2,2,2} | {2} 2020-01-11 00:00:00+01 | 2 | {2,2,2,2,3} | {2,3} 2020-01-12 00:00:00+01 | 2 | {2,2,2,3,3} | {2,3,3} 2020-01-13 00:00:00+01 | 3 | {2,2,3,3,3} | {2,2,3} 2020-01-14 00:00:00+01 | 3 | {2,3,3,3} | {2,3} 2020-01-15 00:00:00+01 | 3 | {3,3,3} | {3} (14 rows)
The first array_agg simply collects all values in the frame we have defined. The “ties” column is a bit more complicated to understand: Let us take a look at the 5th of January. The result says (1, 2, 2). As you can see, two incarnations of 1 have been removed. EXLUDE TIES made sure that those duplicates are gone. However, this has no impact on those values in the “future”, because the future values are different from the current row. The documentation states what EXCLUDE TIES is all about: “EXCLUDE TIES excludes any peers of the current row from the frame, but not the current row itself.”
After this first example we should also consider a fairly common mistake:
test=# SELECT day, week, array_agg(week) OVER (ORDER BY week, day ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING EXCLUDE TIES) AS ties FROM t_demo; day | week | ties ------------------------+------+------------- 2020-01-02 00:00:00+01 | 1 | {1,1,1} 2020-01-03 00:00:00+01 | 1 | {1,1,1,1} 2020-01-04 00:00:00+01 | 1 | {1,1,1,1,2} 2020-01-05 00:00:00+01 | 1 | {1,1,1,2,2} 2020-01-06 00:00:00+01 | 2 | {1,1,2,2,2} 2020-01-07 00:00:00+01 | 2 | {1,2,2,2,2} 2020-01-08 00:00:00+01 | 2 | {2,2,2,2,2} 2020-01-09 00:00:00+01 | 2 | {2,2,2,2,2} 2020-01-10 00:00:00+01 | 2 | {2,2,2,2,2} 2020-01-11 00:00:00+01 | 2 | {2,2,2,2,3} 2020-01-12 00:00:00+01 | 2 | {2,2,2,3,3} 2020-01-13 00:00:00+01 | 3 | {2,2,3,3,3} 2020-01-14 00:00:00+01 | 3 | {2,3,3,3} 2020-01-15 00:00:00+01 | 3 | {3,3,3} (14 rows)
Can you spot the difference between this and the previous example? Take all the time you need …
The problem is that the ORDER BY clause has two columns in this case. That means that there are no duplicates anymore from the ORDER BY’s perspective. Thus, PostgreSQL is not going to prune values from the resultset. You can safely assume that this is a common mistake seen in many cases. The problem can be very subtle and go unnoticed for quite some time.
Excluding entire groups from a windowing function
In some cases one might want to remove an entire set of rows from the result set. To do that, you can make use of EXCLUDE GROUP. The following example shows how that works and how our timeseries data can be analyzed:
test=# SELECT *, array_agg(week) OVER (ORDER BY week ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING EXCLUDE GROUP) AS week, array_agg(week) OVER (ORDER BY day ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING EXCLUDE GROUP) AS all FROM t_demo; ordinality | day | week | week | all ------------+------------------------+------+-------+----------- 1 | 2020-01-02 00:00:00+01 | 1 | | {1,1} 2 | 2020-01-03 00:00:00+01 | 1 | | {1,1,1} 3 | 2020-01-04 00:00:00+01 | 1 | {2} | {1,1,1,2} 4 | 2020-01-05 00:00:00+01 | 1 | {2,2} | {1,1,2,2} 5 | 2020-01-06 00:00:00+01 | 2 | {1,1} | {1,1,2,2} 6 | 2020-01-07 00:00:00+01 | 2 | {1} | {1,2,2,2} 7 | 2020-01-08 00:00:00+01 | 2 | | {2,2,2,2} 8 | 2020-01-09 00:00:00+01 | 2 | | {2,2,2,2} 9 | 2020-01-10 00:00:00+01 | 2 | | {2,2,2,2} 10 | 2020-01-11 00:00:00+01 | 2 | {3} | {2,2,2,3} 11 | 2020-01-12 00:00:00+01 | 2 | {3,3} | {2,2,3,3} 12 | 2020-01-13 00:00:00+01 | 3 | {2,2} | {2,2,3,3} 13 | 2020-01-14 00:00:00+01 | 3 | {2} | {2,3,3} 14 | 2020-01-15 00:00:00+01 | 3 | | {3,3} (14 rows)
The first aggregation function does ORDER BY week and the array_agg will aggregate on the very same column. In this case, we will see a couple of NULL columns, because in some cases all entries within the frame are simply identical. The last row in the resultset is a good example: All array entries are 3 and the current row contains 3 as well. Thus all incarnations of 3 are simply removed which leaves us with an empty array.
To calculate the last column, the data is ordered by day. In this case there are no duplicates and therefore no data can be removed. As you can see, working with timeseries can be fairly easy and quite straight-forward.
DISTINCT: Dealing with duplicates from timeseries data
In PostgreSQL an aggregate function is capable of handling DISTINCT. The following example shows what I mean:
test=# SELECT DISTINCT week FROM t_demo; week ------ 3 1 2 (3 rows) test=# SELECT count(DISTINCT week) FROM t_demo; count ------- 3 (1 row)
“count” does not simply count all columns, but filters the duplicates beforehand. Therefore the result is simply 3.
In PostgreSQL (current version) there is no way to use DISTINCT as part of a windowing function. PostgreSQL will simply error out:
test=# SELECT *, array_agg(DISTINCT week) OVER (ORDER BY day ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) FROM t_demo; ERROR: DISTINCT is not implemented for window functions LINE 2: array_agg(DISTINCT week) OVER (ORDER BY day ROWS
The natural question which arises is: How can we achieve the same result without using DISTINCT inside the windowing function? What you have to do is to filter the duplicates on a higher level. You can use a subselect, unroll the array, remove the duplicates and assemble the array again. It is not hard to do, but it is not as elegant as one might expect.
test=# SELECT *, (SELECT array_agg(DISTINCT unnest) FROM unnest(x)) AS b FROM ( SELECT *, array_agg(week) OVER (ORDER BY day ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS x FROM t_demo ) AS a; ordinality | day | week | x | b ------------+------------------------+------+-------------+------- 1 | 2020-01-02 00:00:00+01 | 1 | {1,1,1} | {1} 2 | 2020-01-03 00:00:00+01 | 1 | {1,1,1,1} | {1} 3 | 2020-01-04 00:00:00+01 | 1 | {1,1,1,1,2} | {1,2} 4 | 2020-01-05 00:00:00+01 | 1 | {1,1,1,2,2} | {1,2} 5 | 2020-01-06 00:00:00+01 | 2 | {1,1,2,2,2} | {1,2} 6 | 2020-01-07 00:00:00+01 | 2 | {1,2,2,2,2} | {1,2} 7 | 2020-01-08 00:00:00+01 | 2 | {2,2,2,2,2} | {2} 8 | 2020-01-09 00:00:00+01 | 2 | {2,2,2,2,2} | {2} 9 | 2020-01-10 00:00:00+01 | 2 | {2,2,2,2,2} | {2} 10 | 2020-01-11 00:00:00+01 | 2 | {2,2,2,2,3} | {2,3} 11 | 2020-01-12 00:00:00+01 | 2 | {2,2,2,3,3} | {2,3} 12 | 2020-01-13 00:00:00+01 | 3 | {2,2,3,3,3} | {2,3} 13 | 2020-01-14 00:00:00+01 | 3 | {2,3,3,3} | {2,3} 14 | 2020-01-15 00:00:00+01 | 3 | {3,3,3} | {3} (14 rows)
Further reading
In case you are interested in timeseries and aggregation in general, consider checking out some of our other blog posts including “Speeding up count(*)” by Laurenz Albe. If you are interested in high-performance PostgreSQL check out my blog post about finding slow queries.
If you find this post useful, consider sharing it with your friends. Follow us on Twitter, like us on Facebook and spread the news.