As a professional PostgreSQL support company we see a lot of SQL performance stuff, which is often worth sharing. One of those noteworthy things, which might be interesting to you, the reader, happened this week when I was out on a business trip to Berlin, Germany. This (excellent) customer was making extensive use of windowing functions and analytics. However, there is always room to speed things up.

“Improving SQL performance” or “how to trick the optimizer”

Most people simply write their SQL code and execute it assuming that the optimizer will take care of things on its own. While this is usually the case there are still corner cases, where some clever optimization – done by a professional – can give you an edge and better database performance.

One way to improve speed is by rearranging windowing functions in a clever way. Consider the following example:

test=# CREATE TABLE data (id int);
CREATE TABLE
test=# INSERT INTO data SELECT * FROM generate_series(1, 5);
INSERT 0 5

Our example is pretty simple: All we need is a table containing 5 rows:

test=# SELECT * FROM data;
id
----
 1
 2
 3
 4
 5
(5 rows)

Let us take a look at a simple example now:

test=# SELECT *, array_agg(id) OVER (ORDER BY id) FROM data;
 id | array_agg
----+-------------
 1  | {1}
 2  | {1,2}
 3  | {1,2,3}
 4  | {1,2,3,4}
 5  | {1,2,3,4,5}
(5 rows)

What we have here is a simple aggregation. For the sake of simplicity I have used array_agg, which simply shows, how our data is aggregated. Of course we could also use min, max, sum, count or any other window function.
Let me add a second column to the example:

test=# SELECT *,
        array_agg(id) OVER (ORDER BY id),
        array_agg(id) OVER (ORDER BY id DESC)
FROM    data;
 id | array_agg   | array_agg
----+-------------+-------------
 5  | {1,2,3,4,5} | {5}
 4  | {1,2,3,4}   | {5,4}
 3  | {1,2,3}     | {5,4,3}
 2  | {1,2}       | {5,4,3,2}
 1  | {1}         | {5,4,3,2,1}
(5 rows)

In this case there are two columns with two different OVER-clauses. Note that those two aggregations are using different sort orders. One column needs ascending data and one needs descending data.

To understand what is really going on here, we can take a look at the execution plan provided by PostgreSQL:

test=# explain 
SELECT *,
       array_agg(id) OVER (ORDER BY id),
       array_agg(id) OVER (ORDER BY id DESC)
FROM  data
ORDER BY id;
                               QUERY PLAN
-----------------------------------------------------------------
Sort (cost=557.60..563.97 rows=2550 width=68)
 Sort Key: id
 <- WindowAgg (cost=368.69..413.32 rows=2550 width=68)
    <- Sort (cost=368.69..375.07 rows=2550 width=36)
       Sort Key: id DESC
       <- WindowAgg (cost=179.78..224.41 rows=2550 width=36)
          <- Sort (cost=179.78..186.16 rows=2550 width=4)
             Sort Key: id
             <- Seq Scan on data (cost=0.00..35.50 rows=2550 width=4)
(9 rows)

First of all PostgreSQL has to read the data and sort by “id”. This sorted data is fed to the first aggregation before it is passed on to the next sort step (to sort descendingly). The second sort passes its output to its window aggregation. Finally the data is again sorted by id because we want the final output to be ordered by the first column. Overall our data has to be sorted three times, which is not a good thing to do.

Optimizing windowing and avoiding excessive sorting

Sorting data three times is clearly not a good idea. Maybe we can do better. Let us simply swap two columns in the SELECT clause:

test=# explain 
SELECT *,
       array_agg(id) OVER (ORDER BY id DESC),
       array_agg(id) OVER (ORDER BY id)
FROM   data
ORDER BY id;
                               QUERY PLAN
-------------------------------------------------------------------
 WindowAgg (cost=368.69..413.32 rows=2550 width=68)
  <- Sort (cost=368.69..375.07 rows=2550 width=36)
    Sort Key: id
    <- WindowAgg (cost=179.78..224.41 rows=2550 width=36)
       <- Sort (cost=179.78..186.16 rows=2550 width=4)
          Sort Key: id DESC
          <- Seq Scan on data (cost=0.00..35.50 rows=2550 width=4)
(7 rows)

Wow, by making this little change we have actually managed to skip one sort step. First of all the data is sorted descendingly because we need it for the first windowing function. However, the next column will need data in exactly the same order as the final ORDER BY at the end of the query. PostgreSQL knows that and can already use the sorted input. If you are processing a big data set, this kind of optimization can make a huge difference and speed up your queries tremendously.

At this point PostgreSQL is not able (yet?) to make those adjustments for you so some manual improvements will definitely help. Try to adjust your windowing functions in a way that columns needing identical sorting are actually next to each other.