SQL is a must, if you want to be a Data Analyst or a Data Scientist. However, every once in a while people are wondering, why a result is the way it is. While on the road in Berlin (Germany) the other day, I have found a fairly interesting scenario, which is pretty counterintuitive to most people and which might be worth sharing.
Windowing function: What PostgreSQL does and why
PostgreSQL has provided windowing functions and analytics for quite some time now and this vital feature has been widely adopted by users, who are using PostgreSQL or SQL in general for more than just trivial queries. A modern database is just so much more than a simple data store and windowing functions are therefore certainly something to look into.
So, what is this post all about? Here is an example:
test=# SELECT *, first_value(x) OVER (ORDER BY x) FROM generate_series(1, 5) AS x; x | first_value ---+------------- 1 | 1 2 | 1 3 | 1 4 | 1 5 | 1 (5 rows)
What we want, is the first value in our data set. The ORDER BY clause will ensure that data is fed to first_value in the right order. The result is therefore not surprising.
The same applies if we add DESC to our ORDER BY. The result is totally obvious:
test=# SELECT *, first_value(x) OVER (ORDER BY x), first_value(x) OVER (ORDER BY x DESC) FROM generate_series(1, 5) AS x; x | first_value | first_value ---+-------------+------------- 5 | 1 | 5 4 | 1 | 5 3 | 1 | 5 2 | 1 | 5 1 | 1 | 5 (5 rows)
last_value: Unexpected results
However, what if we use last_value. Here is an example:
test=# SELECT *, last_value(x) OVER (ORDER BY x), last_value(x) OVER (ORDER BY x DESC) FROM generate_series(1, 5) AS x; x | last_value | last_value ---+------------+------------ 5 | 5 | 5 4 | 4 | 4 3 | 3 | 3 2 | 2 | 2 1 | 1 | 1 (5 rows)
What you can see here is that both columns will return the SAME data – regardless of the different sort order provided by the ORDER BY clause. That comes as a surprise to most people. Actually most people would accept one column to contain only “5” and the other column to contain only “1”.
Why is that not the case? Here is the answer:
test=# SELECT *, array_agg(x) OVER (ORDER BY x), array_agg(x) OVER (ORDER BY x DESC) FROM generate_series(1, 5) AS x; x | array_agg | array_agg ---+-------------+------------- 5 | {1,2,3,4,5} | {5} 4 | {1,2,3,4} | {5,4} 3 | {1,2,3} | {5,4,3} 2 | {1,2} | {5,4,3,2} 1 | {1} | {5,4,3,2,1} (5 rows)
Let us take a look at which values last_value will actually see: array_agg will simply put them all into an array so that we can expect things in detail. As you can see, the last value in the array is identical in both cases, which means that both columns will produce exactly the identical output.