Generating simple data sets

Before we get started I want to introduce my favorite set-returning functions which can help you to generate sample data:


test=# SELECT * FROM generate_series(1, 10) AS x;
x
----
1
2
3
4
5
6
7
8
9
10
(10 rows)

All we do here is simply to generate a list from 1 to 10 and print it on the screen. Let us play around with windowing a bit now: There are two cases we got to keep in mind. If the OVER-clause is empty it means that the entire data set is used. If we use ORDER BY it is only the data set up to the point in the sorted list. The following listing contains an example:


test=# SELECT *,
array_agg(x) OVER (),
array_agg(x) OVER (ORDER BY x)
FROM generate_series(1, 10) AS x;
x | array_agg | array_agg
----+------------------------+------------------------
1 | {1,2,3,4,5,6,7,8,9,10} | {1}
2 | {1,2,3,4,5,6,7,8,9,10} | {1,2}
3 | {1,2,3,4,5,6,7,8,9,10} | {1,2,3}
4 | {1,2,3,4,5,6,7,8,9,10} | {1,2,3,4}
5 | {1,2,3,4,5,6,7,8,9,10} | {1,2,3,4,5}
6 | {1,2,3,4,5,6,7,8,9,10} | {1,2,3,4,5,6}
7 | {1,2,3,4,5,6,7,8,9,10} | {1,2,3,4,5,6,7}
8 | {1,2,3,4,5,6,7,8,9,10} | {1,2,3,4,5,6,7,8}
9 | {1,2,3,4,5,6,7,8,9,10} | {1,2,3,4,5,6,7,8,9}
10 | {1,2,3,4,5,6,7,8,9,10} | {1,2,3,4,5,6,7,8,9,10}
(10 rows)

As you can see the last column keeps accumulating more values.

PostgreSQL: ROWS BETWEEN … PRECEDING …. AND … FOLLOWING

Often it is necessary to limit the set of data used by the aggregation function. ROWS BETWEEN … PRECEDING … AND … FOLLOWING allows you to do exactly that. The following example shows how this works:


test=# SELECT *, array_agg(x) OVER (ORDER BY x ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) FROM generate_series(1, 10) AS x;
x | array_agg
----+-----------
1 | {1,2}
2 | {1,2,3}
3 | {2,3,4}
4 | {3,4,5}
5 | {4,5,6}
6 | {5,6,7}
7 | {6,7,8}
8 | {7,8,9}
9 | {8,9,10}
10 | {9,10}
(10 rows)

What you see is that the data fed to array_agg is seriously restricted. But, the restriction we are using here is a static one. The constants are hardwired. In some cases you might need more flexibility.

Configuring windowing and analytics

More often than not configuration has to be determined on the fly for whatever reason. The beauty is that in PostgreSQL you can use a subselect as part of the OVER-clause which gives you a lot of flexibility.

Before we move on to a demo we got to create a configuration table:


test=# CREATE TABLE t_config (key text, val int);
CREATE TABLE
test=# INSERT INTO t_config VALUES ('before', 1), ('after', 2);
INSERT 0 2

To make it simple I have simply created two entries. The following SELECT statement uses those configuration parameter to do its magic. Here is how it works:


test=# SELECT *, array_agg(x) OVER (ORDER BY x ROWS BETWEEN (SELECT val FROM t_config WHERE key = 'before') PRECEDING AND (SELECT val FROM t_config WHERE key = 'after') FOLLOWING) FROM generate_series(1, 10) AS x;
x | array_agg
----+------------
1 | {1,2,3}
2 | {1,2,3,4}
3 | {2,3,4,5}
4 | {3,4,5,6}
5 | {4,5,6,7}
6 | {5,6,7,8}
7 | {6,7,8,9}
8 | {7,8,9,10}
9 | {8,9,10}
10 | {9,10}
(10 rows)

As you can see the query performs as expected and can be configured dynamically.

Before leaving you with a reference to some other blog post I want to focus your intention on something else: PARTITION BY cannot simply take a column – it can also take an expression to split the data sets. Many people are not aware of this feature which is actually quite useful for many reasons. Here is an example:


test=# SELECT *, array_agg(x) OVER (PARTITION BY x % 2) FROM generate_series(1, 10) AS x;
x | array_agg
----+--------------
10 | {10,2,4,6,8}
2 | {10,2,4,6,8}
4 | {10,2,4,6,8}
6 | {10,2,4,6,8}
8 | {10,2,4,6,8}
9 | {9,7,3,1,5}
7 | {9,7,3,1,5}
3 | {9,7,3,1,5}
1 | {9,7,3,1,5}
5 | {9,7,3,1,5}
(10 rows)

In this case we have split the data into odd and even numbers easily. What I want to point out here is that PostgreSQL offers a lot of flexibility and we encourage you to test out such things.

 

configuring windowing functions

Finally …

Windowing functions are an important aspect and are super important if you are working with all kinds of data. Sometimes you want to find out more about timeseries. One thing we have seen quite often recently is to count how often somebody was active for some time. “Detecting continuous periods of activity” will show you how to calculate these things in PostgreSQL easily.