PostgreSQL: Speeding up analytics and window functions

This post will deal with how to speed up analytics and window functions in PostgreSQL. As a professional PostgreSQL support company, we see a lot of SQL performance stuff which is worth sharing. One of those noteworthy things happened this week when I was out on a business trip to Berlin, Germany. This (excellent) customer was making extensive use of window functions and analytics. However, there is always room to speed things up.

Table of Contents

“Improving SQL performance” or “how to trick the optimizer”

Most people simply write their SQL code and execute it, assuming that the optimizer will take care of things on its own. While this is usually the case there are still corner cases, where some clever optimization - done by a professional - can give you an edge and better database performance.

How to rearrange window functions for better performance

One way to improve speed is by rearranging window functions in a clever way. Consider the following example:

test=# CREATE TABLE data (id int);
CREATE TABLE
test=# INSERT INTO data SELECT * FROM generate_series(1, 5);
INSERT 0 5

test=# CREATE TABLE data (id int);

CREATE TABLE

test=# INSERT INTO data SELECT * FROM generate_series(1, 5);

INSERT 0 5

Our example is pretty simple: All we need is a table containing 5 rows:

test=# SELECT * FROM data;
id
----
 1
 2
 3
 4
 5
(5 rows)

test=# SELECT * FROM data;

----

(5 rows)

Let's take a look at an example:

test=# SELECT *, array_agg(id) OVER (ORDER BY id) FROM data;
 id | array_agg
----+-------------
 1  | {1}
 2  | {1,2}
 3  | {1,2,3}
 4  | {1,2,3,4}
 5  | {1,2,3,4,5}
(5 rows)

test=# SELECT *, array_agg(id) OVER (ORDER BY id) FROM data;

id | array_agg

----+-------------

1 | {1}

2 | {1,2}

3 | {1,2,3}

4 | {1,2,3,4}

5 | {1,2,3,4,5}

(5 rows)

What we have here is a simple aggregation. For the sake of simplicity I have used array_agg, which simply shows, how our data is aggregated. Of course we could also use min, max, sum, count or any other window function.

Now, let's add a second column to the example:

test=# SELECT *,
        array_agg(id) OVER (ORDER BY id),
        array_agg(id) OVER (ORDER BY id DESC)
FROM    data;
 id | array_agg   | array_agg
----+-------------+-------------
 5  | {1,2,3,4,5} | {5}
 4  | {1,2,3,4}   | {5,4}
 3  | {1,2,3}     | {5,4,3}
 2  | {1,2}       | {5,4,3,2}
 1  | {1}

test=# SELECT *,

array_agg(id) OVER (ORDER BY id),

array_agg(id) OVER (ORDER BY id DESC)

FROM data;

id | array_agg | array_agg

----+-------------+-------------

5 | {1,2,3,4,5} | {5}

4 | {1,2,3,4} | {5,4}

3 | {1,2,3} | {5,4,3}

2 | {1,2} | {5,4,3,2}

1 | {1}

In this case, there are two columns with two different OVER-clauses. Note that those two aggregations are using different sort orders. One column needs ascending data and one needs descending data.

To understand what is really going on here, we can take a look at the execution plan provided by PostgreSQL:

test=# explain 
SELECT *,
       array_agg(id) OVER (ORDER BY id),
       array_agg(id) OVER (ORDER BY id DESC)
FROM  data
ORDER BY id;
                               QUERY PLAN
-----------------------------------------------------------------
Sort (cost=557.60..563.97 rows=2550 width=68)
 Sort Key: id
 <- WindowAgg (cost=368.69..413.32 rows=2550 width=68)
    <- Sort (cost=368.69..375.07 rows=2550 width=36)
       Sort Key: id DESC
       <- WindowAgg (cost=179.78..224.41 rows=2550 width=36)
          <- Sort (cost=179.78..186.16 rows=2550 width=4)
             Sort Key: id
             <- Seq Scan on data (cost=0.00..35.50 rows=2550 width=4)
(9 rows)

test=# explain

SELECT *,

array_agg(id) OVER (ORDER BY id),

array_agg(id) OVER (ORDER BY id DESC)

FROM data

ORDER BY id;

QUERY PLAN

-----------------------------------------------------------------

Sort (cost=557.60..563.97 rows=2550 width=68)

Sort Key: id

<- WindowAgg (cost=368.69..413.32 rows=2550 width=68)

<- Sort (cost=368.69..375.07 rows=2550 width=36)

Sort Key: id DESC

<- WindowAgg (cost=179.78..224.41 rows=2550 width=36)

<- Sort (cost=179.78..186.16 rows=2550 width=4)

Sort Key: id

<- Seq Scan on data (cost=0.00..35.50 rows=2550 width=4)

(9 rows)

Sorting the data

First of all, PostgreSQL has to read the data and sort by “id”. This sorted data is fed to the first aggregation, before it is passed on to the next sort step (to sort descending). The second sort passes its output to its window aggregation. Finally, the data is again sorted by id because we want the final output to be ordered by the first column. Overall our data has to be sorted three times, which is not a good thing to do.

Optimizing window functions and avoiding excessive sorting

Sorting data three times is clearly not a good idea. Maybe we can do better. Let us simply swap two columns in the SELECT clause:

test=# explain 
SELECT *,
       array_agg(id) OVER (ORDER BY id DESC),
       array_agg(id) OVER (ORDER BY id)
FROM   data
ORDER BY id;
                               QUERY PLAN
-------------------------------------------------------------------
 WindowAgg (cost=368.69..413.32 rows=2550 width=68)
  <- Sort (cost=368.69..375.07 rows=2550 width=36)
    Sort Key: id
    <- WindowAgg (cost=179.78..224.41 rows=2550 width=36)
       <- Sort (cost=179.78..186.16 rows=2550 width=4)
          Sort Key: id DESC
          <- Seq Scan on data (cost=0.00..35.50 rows=2550 width=4)
(7 rows)

test=# explain

SELECT *,

array_agg(id) OVER (ORDER BY id DESC),

array_agg(id) OVER (ORDER BY id)

FROM data

ORDER BY id;

QUERY PLAN

-------------------------------------------------------------------

WindowAgg (cost=368.69..413.32 rows=2550 width=68)

<- Sort (cost=368.69..375.07 rows=2550 width=36)

Sort Key: id

<- WindowAgg (cost=179.78..224.41 rows=2550 width=36)

<- Sort (cost=179.78..186.16 rows=2550 width=4)

Sort Key: id DESC

<- Seq Scan on data (cost=0.00..35.50 rows=2550 width=4)

(7 rows)

Skipping a sorting step

Wow, by making this little change we have actually managed to skip one sorting step. First of all, the data is sorted descending because we need it for the first window function. However, the next column will need data in exactly the same order as the final ORDER BY at the end of the query. PostgreSQL knows that and can already use the sorted input. If you are processing a big data set, this kind of optimization can make a huge difference and speed up your queries tremendously.

At this point PostgreSQL is not able (yet?) to make those adjustments for you so some manual improvements will definitely help. Try to adjust your window functions in a way that columns needing identical sorting are actually next to each other.

Read more about window functions in my post about SQL Trickery: Configuring Window Functions

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.

PostgreSQL: Speeding up analytics and window functions

“Improving SQL performance” or “how to trick the optimizer”

How to rearrange window functions for better performance

Let's take a look at an example:

Now, let's add a second column to the example:

Sorting the data

Optimizing window functions and avoiding excessive sorting

Skipping a sorting step

Leave a Reply Cancel reply

Hans-Jürgen Schönig

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts