August 2020

tuning autovacuum by hiring more workers — © Laurenz Albe 2020

In many PostgreSQL databases, you never have to think or worry about tuning autovacuum. It runs automatically in the background and cleans up without getting in your way.

But sometimes the default configuration is not good enough, and you have to tune autovacuum to make it work properly. This article presents some typical problem scenarios and describes what to do in these cases.

The many tasks of autovacuum

There are many autovacuum configuration parameters, which makes tuning complicated. The main reason is that autovacuum has many different tasks. In a way, autovacuum has to fix all the problems arising from PostgreSQL's Multiversioning Concurrency Control (MVCC) implementation:

clean up “dead tuples” left behind after UPDATE or DELETE operations
update the free space map that keeps track of free space in table blocks
update the visibility map that is required for index-only scans
“freeze” table rows so that the transaction ID counter can safely wrap around
schedule regular ANALYZE runs to keep the table statistics updated

Depending on which of these functionalities cause a problem, you need different approaches to tuning autovacuum.

Tuning autovacuum for dead tuple cleanup

The best-known autovacuum task is cleaning up of dead tuples from UPDATE or DELETE operations. If autovacuum cannot keep up with cleaning up dead tuples, you should follow these three tuning steps:

Make sure that nothing keeps autovacuum from reclaiming dead tuples

Check the known reasons that keep vacuum from removing dead tuples. Most often, the culprit are long running transactions. Unless you can remove these obstacles, tuning autovacuum will be useless.

If you cannot fight the problem at its root, you can use the configuration parameter idle_in_transaction_session_timeout to have PostgreSQL terminate sessions that stay “idle in transaction” for too long. That causes errors on the client side, but may be justified if you have no other way to keep your database operational. Similarly, to fight long-running queries, you can use statement_timeout.

Tuning autovacuum to run faster

If autovacuum cannot keep up with cleaning up dead tuples, the solution is to make it work faster. This may seem obvious, but many people fall into the trap of thinking that making autovacuum start earlier or run more often will solve the problem.

VACUUM is a resource-intensive operation, so autovacuum by default operates deliberately slowly. The goal is to have it work in the background without being in the way of normal database operation. But if your workload creates lots of dead tuples, you will have to make it more aggressive:

increase autovacuum_vacuum_cost_limit from its default value of 200 (this is the gentle method)
reduce autovacuum_vacuum_cost_delay from its default value of 2 (in older versions: 20!) milliseconds (this is the effective method)

Setting autovacuum_vacuum_cost_delay to 0 will make autovacuum as fast as a manual VACUUM – that is, as fast as possible.

Since not all tables grow dead tuples at the same pace, it is usually best not to change the global setting in postgresql.conf, but to change the setting individually for busy tables:

ALTER TABLE busy_table SET (autovacuum_vacuum_cost_delay = 1);

1	ALTER TABLE busy_table SET (autovacuum_vacuum_cost_delay = 1);

Partitioning a table can also help with getting the job done faster; see below for more.

Change the workload so that fewer dead tuples are generated

If nothing else works, you have to see that fewer dead tuples are generated. Perhaps several UPDATEs to a single row could be combined to a single UPDATE?

Often you can significantly reduce the number of dead tuples by using “HOT updates”:

set the fillfactor for the table to a value less than 100, so that INSERTs leave some free space in each block
make sure that no column that you modify in the UPDATE is indexed

Then any SELECT or DML statement can clean up dead tuples, and there is less need for VACUUM.

Tuning autovacuum for index-only scans

The expensive part of an index scan is looking up the actual table rows. If all columns you want are in the index, it should not be necessary to visit the table at all. But in PostgreSQL you also have to check if a tuple is visible or not, and that information is only stored in the table.

To work around that, PostgreSQL has a “visibility map” for each table. If a table block is marked as “all visible” in the visibility map, you don't have to visit the table for the visibility information.

So to get true index-only scans, autovacuum has to process the table and update the visibility map frequently. How you configure autovacuum for that depends on the kind of data modifications the query receives:

Tuning autovacuum for index-only scans on tables that receive `UPDATE`s or `DELETE`s

For that, you reduce autovacuum_vacuum_scale_factor for the table, for example

ALTER TABLE mytable SET (autovacuum_vacuum_scale_factor = 0.01);

1	ALTER TABLE mytable SET (autovacuum_vacuum_scale_factor = 0.01);

It may be a good idea to also speed up autovacuum as described above.

Tuning autovacuum for index-only scans on tables that receive only `INSERT`s

This is simple from v13 on: tune autovacuum_vacuum_insert_scale_factor as shown above for autovacuum_vacuum_scale_factor.

For older PostgreSQL versions, the best you can do is to significantly lower autovacuum_freeze_max_age. The best value depends on the rate at which you consume transaction IDs. If you consume 100000 transaction IDs per day, and you want the table to be autovacuumed daily, you can set

ALTER TABLE insert_only SET (autovacuum_freeze_max_age = 100000);

1	ALTER TABLE insert_only SET (autovacuum_freeze_max_age = 100000);

To measure the rate of transaction ID consumption, use the function txid_current() (or pg_current_xact_id() from v13 on) twice with a longer time interval in between and take the difference.

Tuning autovacuum to avoid transaction wraparound problems

Normally, autovacuum takes care of that and starts a special “anti-warparound” autovacuum worker whenever the oldest transaction ID in a table is older than autovacuum_freeze_max_age transactions or the oldest multixact is older than autovacuum_multixact_freeze_max_age transactions.

Make sure than anti-wraparound vacuum can freeze tuples in all tables

Again, you have to make sure that there is nothing that blocks autovacuum from freezing old tuples and advancing pg_database.datfrozenxid and pg_database.datminmxid. Such blockers can be:

very long running database sessions that keep a transaction open or have temporary tables (autovacuum cannot process temporary tables)
data corruption, which can make all autovacuum workers fail with an error

To prevent data corruption, use good hardware and always run the latest PostgreSQL minor release.

Tuning tables that receive `UPDATE`s or `DELETE`s for anti-wraparound vacuum

On tables that receive UPDATEs or DELETEs, all that you have to do is to see that autovacuum is running fast enough to get done in time (see above).

Tuning tables that receive only `INSERT`s for anti-wraparound vacuum

From PostgreSQL v13 on, there are no special considerations in this case, because you get regular autovacuum runs on such tables as well.

Before that, insert-only tables were problematic: since there are no dead tuples, normal autovacuum runs are never triggered. Then, as soon as autovacuum_freeze_max_age or autovacuum_multixact_freeze_max_age are exceeded, you may suddenly get a massive autovacuum run that freezes a whole large table, takes a long time and causes massive I/O.

To avoid that, reduce autovacuum_freeze_max_age for such a table:

ALTER TABLE insert_only SET (autovacuum_freeze_max_age = 10000000);

1	ALTER TABLE insert_only SET (autovacuum_freeze_max_age = 10000000);

Partitioning

With very big tables, it can be advisable to use partitioning. The advantage here is you can have several autovacuum workers working on several partitions in parallel, so that the partitioned table as a whole is done faster than a single autovacuum worker could.

If you have many partitions, you should increase autovacuum_max_workers, the maximum number of autovacuum workers.

Partitioning can also help with vacuuming tables that receive lots of updates, as long as the updates affect all partitions.

Tuning autoanalyze

Updating table statistics is a “side job” of autovacuum.

You know that automatic statistics collection does not happen often enough if your query plans get better after a manual ANALYZE of the table.

In that case, you can lower autovacuum_analyze_scale_factor so that autoanalyze processes the table more often:

ALTER TABLE mytable SET (autovacuum_analyze_scale_factor = 0.02);

1	ALTER TABLE mytable SET (autovacuum_analyze_scale_factor = 0.02);

An alternative is not to use the scale factor, but set autovacuum_analyze_threshold, so that table statistics are calculated whenever a fixed number of rows changes. For example, to configure a table to be analyzed whenever more than a million rows change:

ALTER TABLE mytable SET (
   autovacuum_analyze_scale_factor = 0,
   autovacuum_analyze_threshold = 1000000
);

ALTER TABLE mytable SET (

autovacuum_analyze_scale_factor = 0,

autovacuum_analyze_threshold = 1000000

);

Conclusion

Depending on your specific problem and your PostgreSQL version, there are different tuning knobs to make autovacuum do its job correctly. The many tasks of autovacuum and the many configuration parameters don't make that any easier.

If the tips in this article are not enough for you, consider getting professional consulting.

GROUP BY is nothing new and is available in any relational database I am aware of. It is an integral part of SQL and PostgreSQL but what many people might not know is the fact that the GROUP BY expression can do more than just group by simple fields. You can use expressions to group in an even more sophisticated way and here is how:

Importing test data

To show you what you can do with GROUP BY I have compiled some test data:

test=# CREATE TABLE t_oil (
          region       text,
          country      text,
          year         int,
          production   int,
          consumption  int
);
CREATE TABLE

test=# CREATE TABLE t_oil (

region text,

country text,

year int,

production int,

consumption int

);

CREATE TABLE

This data set is pretty easy to understand. It tells us how much oil was produced and consumed by which country in the past. To load the data you can either download the file or simply ask COPY to load it directly from the web.

test=# COPY t_oil FROM PROGRAM
          'curl https://www.cybertec-postgresql.com/secret/oil_ext.txt';
COPY 644

test=# COPY t_oil FROM PROGRAM

'curl https://www.cybertec-postgresql.com/secret/oil_ext.txt';

COPY 644

To use COPY … FROM PROGRAM you have to be superuser otherwise it does not work for security reasons. In my example 644 rows have been loaded successfully.

Simple GROUP BY statements

Using GROUP BY is pretty simple. If we want to make some kind of analysis for each group we basically got two choices:

test=# SELECT region, avg(production) FROM t_oil GROUP BY 1;
      region   |          avg
---------------+-----------------------
 North America | 4541.3623188405797101
   Middle East | 1992.6036866359447005
(2 rows)

test=# SELECT region, avg(production) FROM t_oil GROUP BY 1;

region | avg

---------------+-----------------------

North America | 4541.3623188405797101

Middle East | 1992.6036866359447005

(2 rows)

“GROUP BY 1” basically means the same as the following query:

test=# SELECT region, avg(production) FROM t_oil GROUP BY region;
     region    |          avg
---------------+-----------------------
 North America | 4541.3623188405797101
   Middle East | 1992.6036866359447005
(2 rows)

test=# SELECT region, avg(production) FROM t_oil GROUP BY region;

region | avg

---------------+-----------------------

North America | 4541.3623188405797101

Middle East | 1992.6036866359447005

(2 rows)

“GROUP BY region” and “GROUP BY 1” are therefore identical. It is a question of faith which type of syntax you prefer. People have told me once than once that one or the other syntax is “evil”. In reality it makes absolutely no difference. It is just syntactic sugar.

GROUP BY “expression”

Most people group by one or more columns or no column at all:

test=# SELECT avg(production) FROM t_oil WHERE country = 'USA';
          avg
-----------------------
 9141.3478260869565217
(1 row)

test=# SELECT avg(production) FROM t_oil WHERE country = 'USA';

avg

-----------------------

9141.3478260869565217

(1 row)

However, there is more: You can also use an expression to determine the groups on the fly.

Here is an example:

test=# SELECT production > 9000, count(production) 
       FROM   t_oil 
       WHERE  country = 'USA' 
       GROUP BY production > 9000;
 ?column? | count
----------+-------
        f |   20
        t |   26
(2 rows)

test=# SELECT production > 9000, count(production)

FROM t_oil

WHERE country = 'USA'

GROUP BY production > 9000;

?column? | count

----------+-------

f | 20

t | 26

(2 rows)

In this case we got two groups: One group is for rows greater than 9000 and one for rows lower or equal than 9000. The name of those two groups is therefore “true” (> 9000) or “false” (<= 9000). You can use any expression to calculate those groups on your own. Consider the next example:

test=# SELECT count(production) 
       FROM     t_oil 
       WHERE    country = 'USA' 
       GROUP BY CASE WHEN year % 2 = 0 THEN true ELSE false END; 
 count 
-------
 23
 23 
(2 rows)

test=# SELECT count(production)

FROM t_oil

WHERE country = 'USA'

GROUP BY CASE WHEN year % 2 = 0 THEN true ELSE false END;

count

-------

(2 rows)

In this example we are counting odd and even years. Note that it is not necessary to list the grouping criterial in the SELECT clause. You might not understand the result if you miss half of the column list but you are free to do that. What is also interesting is that you can use a full SQL query in the GROUP BY clause.

The following statement is identical to what you have just seen:

test=# SELECT count(production) 
       FROM   t_oil WHERE country = 'USA' 
       GROUP BY (SELECT CASE WHEN year % 2 = 0 THEN true ELSE false END);
 count 
-------
 23
 23 
(2 rows)

test=# SELECT count(production)

FROM t_oil WHERE country = 'USA'

GROUP BY (SELECT CASE WHEN year % 2 = 0 THEN true ELSE false END);

count

-------

(2 rows)

If you run those queries it makes sense to take a look at the execution plan. As you can see the plan is just like any other GROUP BY statement. It is also noteworthy that PostgreSQL does a really good job to estimate the number of groups:

test=# explain SELECT count(production) 
       FROM    t_oil 
       WHERE   country = 'USA' 
       GROUP BY (SELECT CASE WHEN year % 2 = 0 THEN true ELSE false END); 
                           QUERY PLAN 
-------------------------------------------------------------
 HashAggregate (cost=14.97..15.02 rows=2 width=9)
   Group Key: (SubPlan 1) 
   -> Seq Scan on t_oil (cost=0.00..14.74 rows=46 width=5)
      Filter: (country = 'USA'::text)
      SubPlan 1
      -> Result (cost=0.00..0.01 rows=1 width=1)
(6 rows)

test=# explain SELECT count(production)

FROM t_oil

WHERE country = 'USA'

GROUP BY (SELECT CASE WHEN year % 2 = 0 THEN true ELSE false END);

QUERY PLAN

-------------------------------------------------------------

HashAggregate (cost=14.97..15.02 rows=2 width=9)

Group Key: (SubPlan 1)

-> Seq Scan on t_oil (cost=0.00..14.74 rows=46 width=5)

Filter: (country = 'USA'::text)

SubPlan 1

-> Result (cost=0.00..0.01 rows=1 width=1)

(6 rows)

PostgreSQL successfully figured out that two groups are to be expected.
HAVING clauses done right

Recently people have asked us about aliases in the HAVING clause: Can one use aliases in a HAVING clause?

test=# SELECT count(production) AS x 
       FROM   t_oil 
       WHERE country = 'USA' 
       GROUP BY year < 1990 HAVING x > 22;
ERROR: column 'x' does not exist
LINE 1: ...il WHERE country = 'USA' GROUP BY year < 1990 HAVING x > 22;
^

test=# SELECT count(production) AS x

FROM t_oil

WHERE country = 'USA'

GROUP BY year < 1990 HAVING x > 22;

ERROR: column 'x' does not exist

LINE 1: ...il WHERE country = 'USA' GROUP BY year < 1990 HAVING x > 22;

The answer is simply no. SQL does not allow that. I remember seeing people who actually wanted to implement this feature in PostgreSQL over the years but such a feature is not going to be supported.

If you want to use a HAVING clause it is necessary to explicitly use the entire expression:

test=# SELECT count(production) AS x 
       FROM   t_oil 
       WHERE country = 'USA' 
       GROUP BY year < 1990 HAVING count(production) > 22;
 x
----
 25
(1 row)

test=# SELECT count(production) AS x

FROM t_oil

WHERE country = 'USA'

GROUP BY year < 1990 HAVING count(production) > 22;

----

(1 row)

However, consider the following example:

test=# SELECT count(production) AS x 
       FROM   t_oil 
       WHERE  country = 'USA' 
       GROUP BY year < 1990 HAVING avg(production) > 0;
 x
----
21
25
(2 rows)

test=# SELECT count(production) AS x

FROM t_oil

WHERE country = 'USA'

GROUP BY year < 1990 HAVING avg(production) > 0;

----

(2 rows)

As you can see we use “count” in the SELECT clause, but we use “avg” in the HAVING clause. This is definitely possible and in some cases really useful. You can use different aggregate functions in GROUP BY and HAVING clauses without any problems.

Making use of grouping sets

So far you have seen what GROUP BY can do for you in the simple case butthere is more. I don't want to go into too much detail in this blog but just one thing: You can do more than just one aggregation at the same time to speed up your query. Here is an example:

test=# SELECT year < 1990, count(production) AS x 
       FROM   t_oil 
       WHERE  country = 'USA' 
       GROUP BY GROUPING SETS ((year < 1990), ());
 ?column? | x
----------+----
          | 46
        f | 21
        t | 25
(3 rows)

test=# SELECT year < 1990, count(production) AS x

FROM t_oil

WHERE country = 'USA'

GROUP BY GROUPING SETS ((year < 1990), ());

?column? | x

----------+----

| 46

f | 21

t | 25

(3 rows)

In this case we got two grouping operations: One containing all rows and one for before respectively after 1990.

If you don't like this syntax you can also try ROLLUP which is equivalent to the query you have just seen:

test=# SELECT year < 1990, count(production) AS x 
       FROM   t_oil 
       WHERE  country = 'USA' 
       GROUP BY ROLLUP (year < 1990);
 ?column? | x
----------+----
          | 46
        f | 21
        t | 25
(3 rows)

test=# SELECT year < 1990, count(production) AS x

FROM t_oil

WHERE country = 'USA'

GROUP BY ROLLUP (year < 1990);

?column? | x

----------+----

| 46

f | 21

t | 25

(3 rows)

ROLLUP is basically the “bottom line” - it adds one additional row counting everything.

Finally …

If you want to find out more about grouping in general I recommend checking out my post about speeding up GROUP BY in general.

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

recursive queries misunderstood — © Laurenz Albe 2020

Many people consider recursive queries a difficult topic. Still, they enable you to do things that would otherwise be impossible in SQL.

This articles gives a simple introduction with examples and shows the differences to Oracle's implementation of recursive queries.

Common table expressions (`WITH` clauses)

A common table expression (CTE) can be seen as a view that is only valid for a single query:

WITH ctename AS (
   SELECT ...
)
SELECT ...
FROM ctename ...

WITH ctename AS (

SELECT ...

)

SELECT ...

FROM ctename ...

This could also be written as a subquery in FROM, but there are some advantages to using CTEs:

The query becomes more readable.
You can reference the CTE several times in the query, and it will be calculated only once.
You can use data modifying statements in the CTE (typically with a RETURNING clause).

Note that before v12, PostgreSQL always materialized CTEs. That means, the CTE was calculated independently from the containing query. From v12 on, CTEs can be “inlined” into the query, which provides further optimization potential.

CTEs (as well as the recursive CTEs mentioned below) are defined in the SQL standard, although PostgreSQL does not implement the SEARCH and CYCLE clause.

Syntax of recursive queries

Recursive queries are written using recursive CTEs, that are CTEs containing the RECURSIVE keyword:

WITH RECURSIVE ctename AS (
   SELECT /* non-recursive branch, cannot reference 'ctename' */
   UNION [ALL]
   SELECT /* recursive branch referencing 'ctename' */
)
SELECT ...
FROM ctename ...

WITH RECURSIVE ctename AS (

SELECT /* non-recursive branch, cannot reference 'ctename' */

UNION [ALL]

SELECT /* recursive branch referencing 'ctename' */

)

SELECT ...

FROM ctename ...

How recursive queries are processed

PostgreSQL internally uses a working table to process recursive CTEs. This processing is not really recursive, but rather iterative:

First, the working table is initialized by executing the non-recursive branch of the CTE. The result of the CTE is also initialized with this result set. If the recursive CTE uses UNION rather than UNION ALL, duplicate rows are removed.

Then, PostgreSQL repeats the following until the working table is empty:

Evaluate the recursive branch of the CTE, replacing the reference to the CTE with the working table.
Add all resulting rows to the CTE result. If UNION is used to combine the branches, discard duplicate rows.
Replace the working table with all new rows from the previous step (excluding any removed duplicates).

Note that the self-referential branch of the CTE is not executed using the complete CTE result so far, but only the rows that are new since the previous iteration (the working table).

You have to be aware of the danger of an endless loop here:

If the iteration never ends, the query will just keep running until the result table becomes big enough to cause an error. There are two ways to deal with that:

Often you can avoid infinite recursion by using UNION, which removes duplicate result rows (but of course requires extra processing effort).
Another way is to place a LIMIT clause on the query that uses the CTE, because PostgreSQL stops processing if the recursive CTE has calculated as many rows as are fetched by the parent query. Note that this technique is not portable to other standard compliant databases.

A simple example

Let's assume a self-referential table like this:

TABLE emp;

 empno | ename  |    job    | mgr  |  hiredate  |   sal   |  comm   | deptno 
-------+--------+-----------+------+------------+---------+---------+--------
  7839 | KING   | PRESIDENT |      | 1981-11-17 | 5000.00 |         |     10
  7698 | BLAKE  | MANAGER   | 7839 | 1981-05-01 | 2850.00 |         |     30
  7782 | CLARK  | MANAGER   | 7839 | 1981-06-09 | 2450.00 |         |     10
  7566 | JONES  | MANAGER   | 7839 | 1981-04-02 | 2975.00 |         |     20
  7902 | FORD   | ANALYST   | 7566 | 1981-12-03 | 3000.00 |         |     20
  7369 | SMITH  | CLERK     | 7902 | 1980-12-17 |  800.00 |         |     20
  7499 | ALLEN  | SALESMAN  | 7698 | 1981-02-20 | 1600.00 |  300.00 |     30
  7521 | WARD   | SALESMAN  | 7698 | 1981-02-22 | 1250.00 |  500.00 |     30
  7654 | MARTIN | SALESMAN  | 7698 | 1981-09-28 | 1250.00 | 1400.00 |     30
  7844 | TURNER | SALESMAN  | 7698 | 1981-09-08 | 1500.00 |    0.00 |     30
  7900 | JAMES  | CLERK     | 7698 | 1981-12-03 |  950.00 |         |     30
  7934 | MILLER | CLERK     | 7782 | 1982-01-23 | 1300.00 |         |     10
(12 rows)

TABLE emp;

-------+--------+-----------+------+------------+---------+---------+--------

7839 | KING | PRESIDENT | | 1981-11-17 | 5000.00 | | 10

7698 | BLAKE | MANAGER | 7839 | 1981-05-01 | 2850.00 | | 30

7782 | CLARK | MANAGER | 7839 | 1981-06-09 | 2450.00 | | 10

7566 | JONES | MANAGER | 7839 | 1981-04-02 | 2975.00 | | 20

7902 | FORD | ANALYST | 7566 | 1981-12-03 | 3000.00 | | 20

7369 | SMITH | CLERK | 7902 | 1980-12-17 | 800.00 | | 20

7499 | ALLEN | SALESMAN | 7698 | 1981-02-20 | 1600.00 | 300.00 | 30

7521 | WARD | SALESMAN | 7698 | 1981-02-22 | 1250.00 | 500.00 | 30

7654 | MARTIN | SALESMAN | 7698 | 1981-09-28 | 1250.00 | 1400.00 | 30

7844 | TURNER | SALESMAN | 7698 | 1981-09-08 | 1500.00 | 0.00 | 30

7900 | JAMES | CLERK | 7698 | 1981-12-03 | 950.00 | | 30

7934 | MILLER | CLERK | 7782 | 1982-01-23 | 1300.00 | | 10

(12 rows)

We want to find all the subordinates of person 7566, including the person itself.

The non-recursive branch of the query will be:

SELECT empno, ename
FROM emp
WHERE empno = 7566;

SELECT empno, ename

FROM emp

WHERE empno = 7566;

The recursive branch will find all subordinates of all entries in the working table:

SELECT emp.empno, emp.ename
FROM emp
   JOIN ctename ON emp.mgr = ctename.empno;

SELECT emp.empno, emp.ename

FROM emp

JOIN ctename ON emp.mgr = ctename.empno;

We can assume that the dependencies contain no cycles (nobody is his or her own manager, directly or indirectly). So we can combine the queries with UNION ALL, because no duplicates can occur.

So our complete query will be:

WITH RECURSIVE ctename AS (
      SELECT empno, ename
      FROM emp
      WHERE empno = 7566
   UNION ALL
      SELECT emp.empno, emp.ename
      FROM emp
         JOIN ctename ON emp.mgr = ctename.empno
)
SELECT * FROM ctename;

 empno | ename 
-------+-------
  7566 | JONES
  7902 | FORD
  7369 | SMITH
(3 rows)

WITH RECURSIVE ctename AS (

SELECT empno, ename

FROM emp

WHERE empno = 7566

UNION ALL

SELECT emp.empno, emp.ename

FROM emp

JOIN ctename ON emp.mgr = ctename.empno

)

SELECT * FROM ctename;

empno | ename

-------+-------

7566 | JONES

7902 | FORD

7369 | SMITH

(3 rows)

Adding generated columns

Sometimes you want to add more information, like the hierarchical level. You can do that by adding the starting level as a constant in the non-recursive branch. In the recursive branch you simply add 1 to the level:

WITH RECURSIVE ctename AS (
      SELECT empno, ename,
             0 AS level
      FROM emp
      WHERE empno = 7566
   UNION ALL
      SELECT emp.empno, emp.ename,
             ctename.level + 1
      FROM emp
         JOIN ctename ON emp.mgr = ctename.empno
)
SELECT * FROM ctename;

 empno | ename | level 
-------+-------+-------
  7566 | JONES |     0
  7902 | FORD  |     1
  7369 | SMITH |     2
(3 rows)

WITH RECURSIVE ctename AS (

SELECT empno, ename,

0 AS level

FROM emp

WHERE empno = 7566

UNION ALL

SELECT emp.empno, emp.ename,

ctename.level + 1

FROM emp

JOIN ctename ON emp.mgr = ctename.empno

)

SELECT * FROM ctename;

empno | ename | level

-------+-------+-------

7566 | JONES | 0

7902 | FORD | 1

7369 | SMITH | 2

(3 rows)

If you use UNION to avoid duplicated rows in the case of circular references, you cannot use this technique. This is because adding level will make rows that were identical before different. But in that case a hierarchical level wouldn't make much sense anyway because an entry could appear on infinitely many levels.

Another frequent requirement is to collect all ancestors in a “path”:

WITH RECURSIVE ctename AS (
      SELECT empno, ename,
             ename AS path
      FROM emp
      WHERE empno = 7566
   UNION ALL
      SELECT emp.empno, emp.ename,
             ctename.path || ' -> ' || emp.ename
      FROM emp
         JOIN ctename ON emp.mgr = ctename.empno
)
SELECT * FROM ctename;

 empno | ename |          path          
-------+-------+------------------------
  7566 | JONES | JONES
  7902 | FORD  | JONES -> FORD
  7369 | SMITH | JONES -> FORD -> SMITH

WITH RECURSIVE ctename AS (

SELECT empno, ename,

ename AS path

FROM emp

WHERE empno = 7566

UNION ALL

SELECT emp.empno, emp.ename,

ctename.path || ' -> ' || emp.ename

FROM emp

JOIN ctename ON emp.mgr = ctename.empno

)

SELECT * FROM ctename;

empno | ename | path

-------+-------+------------------------

7566 | JONES | JONES

7902 | FORD | JONES -> FORD

7369 | SMITH | JONES -> FORD -> SMITH

Comparison to Oracle

Oracle database has a different syntax for recursive queries that does not conform to the SQL standard. The original example would look like this:

SELECT empno, ename
FROM emp
START WITH empno = 7566
CONNECT BY PRIOR empno = mgr;

     EMPNO ENAME
---------- ----------
      7566 JONES
      7902 FORD
      7369 SMITH

SELECT empno, ename

FROM emp

START WITH empno = 7566

CONNECT BY PRIOR empno = mgr;

EMPNO ENAME

---------- ----------

7566 JONES

7902 FORD

7369 SMITH

This syntax is more concise, but less powerful than recursive CTEs. For more complicated queries involving joins, it can become difficult and confusing.

It is always easy to translate an Oracle “hierarchical query” to a recursive CTE:

The non-recursive branch is the Oracle query without the CONNECT BY clause but including the START WITH clause.
The recursive branch is the Oracle query without the START WITH clause but including the CONNECT BY clause. You add a join with the name of the recursive CTE and replace all PRIOR columns with columns from that joined CTE.
If the Oracle query uses CONNECT BY NOCYCLE, use UNION, otherwise UNION ALL.

Apart from that, Oracle also supports standard compliant recursive CTEs. These also support the SEARCH and CYCLE clauses that PostgreSQL doesn't implement.

The true power of recursive queries

Without recursive CTEs, many things that can be written in procedural languages cannot be written in SQL. That is usually no problem, because SQL is made to query databases. If you need procedural code, you can always write a database function in one of the many procedural languages available in PostgreSQL.

But recursive CTEs make SQL turing complete, that is, it can perform the same calculations as any other programming language. Obviously, such a “program” would often be quite complicated and inefficient, and this is a theoretical consideration and not of much practical value. Still, the previous examples showed that recursive CTEs can do useful work that you couldn't otherwise perform in SQL.

As an example for the power of recursive queries, here is a recursive CTE that computes the first elements of the Fibonacci sequence:

WITH RECURSIVE fib AS (
      SELECT 1 AS n,
             1::bigint AS 'fibₙ',
             1::bigint AS 'fibₙ₊₁'
   UNION ALL
      SELECT n+1,
             'fibₙ₊₁',
             'fibₙ' + 'fibₙ₊₁'
      FROM fib
)
SELECT n, 'fibₙ' FROM fib
LIMIT 20;

 n  | fibₙ 
----+------
  1 |    1
  2 |    1
  3 |    2
  4 |    3
  5 |    5
  6 |    8
  7 |   13
  8 |   21
  9 |   34
 10 |   55
 11 |   89
 12 |  144
 13 |  233
 14 |  377
 15 |  610
 16 |  987
 17 | 1597
 18 | 2584
 19 | 4181
 20 | 6765
(20 rows)

WITH RECURSIVE fib AS (

SELECT 1 AS n,

1::bigint AS 'fibₙ',

1::bigint AS 'fibₙ₊₁'

UNION ALL

SELECT n+1,

'fibₙ₊₁',

'fibₙ' + 'fibₙ₊₁'

FROM fib

)

SELECT n, 'fibₙ' FROM fib

LIMIT 20;

n | fibₙ

----+------

1 | 1

2 | 1

3 | 2

4 | 3

5 | 5

6 | 8

7 | 13

8 | 21

9 | 34

10 | 55

11 | 89

12 | 144

13 | 233

14 | 377

15 | 610

16 | 987

17 | 1597

18 | 2584

19 | 4181

20 | 6765

(20 rows)

This example also demonstrates how an endless loop can be avoided with a LIMIT clause on the parent query.

Conclusion

You shouldn't be afraid of recursive queries, even if the term “recursive” scares you. They are not that hard to understand, and they provide the simplest way to query databases containing hierarchies and graphs. In many cases they are also much more efficient than equivalent procedural code because they avoid repeated queries to the same tables.

Together with window functions, recursive queries are among the most powerful tools that SQL has to offer.

By Kaarel Moppel

How can I use pg_stat_statements for troubleshooting? First, an introductory story (with some slight “drama” added for fun) from those good old days of on-site consulting 🙂 So...I’m at a client where the database is not behaving nicely among other things...what a crappy DB product indeed I hear, it gets cranky every morning although there should be a constant workload and is saturating the CPU and making some important queries time out!

Hmm very interesting I think, Postgres usually doesn’t tend to care too much about mornings or time in general...Okay, well let’s sit down and look at the metrics I say...oh, surprise-surprise - we discover that there’s no monitoring framework in place! Sounds very bad and rare, right? Well, quite common actually as many people rely only on the server log files...which I wouldn’t recommend doing personally. But not too tragic in this case actually - as luckily their DBA at least had read from somewhere that it’s good to install the pg_stat_statements extension when taking an instance into use - which is of course absolutely true and a highly recommended thing to do!

Giving hope to DBA's

So I say - well, let’s just quickly install some monitoring tool and have a coffee and talk about other issues while the metrics, that will help us to get to the bottom of this, are being gathered. And then I hear that distinct audible “sigh” and they look at me with despair in their eyes - you know, we just can’t install some software like that...firstly there’s no Internet...and 2nd, it needs to go through managerial and security acceptance processes X,Y and Z and it would take a week at best.

Yoda and direct console access

OK, doesn’t make life easier for sure...the poor DBA is also already feeling bad that their organisation is not exactly the candidate for the “Agile company of the year” award...But, then I ask...you know, we do have direct console access to the database still, right? Yes, we do! And then I can finally use my wise Yoda quote: Well great, this is pretty much the same as a monitoring tool - it’s all about the data anyways! The DBA starts to look at me in a very suspicious and puzzled way...what the hell am I talking about? How can a plain text-based query interface be nearly as useful as some modern graphical monitoring tool for troubleshooting?

Ad-hoc setup of pg_stat_statements monitoring

So let’s shed some light on what I mean. A disclaimer - it’s not the most “humane” method of course, but it will definitely gather all the necessary information and thus do the trick. For most users though, who have the possibility, I’d recommend installing some real monitoring tool - like pgwatch2 or pghero, which are both dead simple to use and have a graphical interface for pinpointing problems.

Step 1 - prerequisites

Making sure the pg_stat_statements extension is loaded via the shared_preload_libraries parameter and also the extension created in some database - doesn’t have to be even the problematic one, as the extension works on the cluster level. For new installs, a restart of the PostgreSQL server is needed though, which might be problematic in some cases - so again: it’s a really highly recommended step when performing the initial server setup.

Step 2 - “snapshot” table setup

Log into the database where pg_stat_statements is available and create a table to hold the periodic snapshots of the tracking information the extension is providing. And this is the crux of the “trick” actually - we want to regularly store snapshots of detailed query execution details so that we can later calculate the so called “deltas” for each query and see who was burning the most resources within a certain timespan! This is actually exactly what most “continuous monitoring” tools are doing in a nicer coat - just storing snapshots of the internal PostgreSQL performance counters for various objects or queries and then calculating deltas or anomalies based on that data and representing it in some nice and digestible, usually graphical, format.

A DDL statement to create a table for such “snapshot” storing purposes would look something like that:

CREATE TABLE stat_statements_snapshots AS
  SELECT now() AS ts, * FROM pg_stat_statements WHERE false;

1 2	CREATE TABLE stat_statements_snapshots AS SELECT now() AS ts, * FROM pg_stat_statements WHERE false;

Step 3 - metrics gathering

Set up periodic gathering of pg_stat_statements data into our “snapshot” table. In the simplest form on the “psql” console it can be done with the built-in “watch” subcommand. Namely, this command will re-execute the previous user query with some regular interval - by default set at 2 seconds...which is a bit too frequent mostly though for general pg_stat_statements monitoring (but might again be too infrequent for pg_locks or pg_stat_activity monitoring). So in short it would look something like this:

/* let’s do the first insert normally and then let “watch” do the same every
INSERT INTO stat_statements_snapshots
  SELECT now(), * FROM pg_stat_statements;
watch 60

/* let’s do the first insert normally and then let “watch” do the same every

INSERT INTO stat_statements_snapshots

SELECT now(), * FROM pg_stat_statements;

watch 60

Remember - “psql” + “watch” is just one option and a simple Bash loop or a Cron with the same SQL would be as good, and maybe even better - if you plan to keep it running for a day or so as there might be network interruptions or restarts that terminate your “psql“ session.

Analyzing the data

Now when the data collection has been running for a while and our target time range is covered, we obviously need to analyze the data somehow. And there are quite some ways to do that - many resource usage columns to choose from depend on the detected problem type and also many SQL query techniques could be used. But to give you a general idea, in the chapter below there are some simpler examples of some standard questions and the resulting queries for such ad-hoc troubleshooting sessions...but as already said, there are many-many different paths here and the queries provided are by far not the only ones.

The burning question

Which top 3 queries were burning the most CPU time between times t1 and t2 and approximately how much of the total Postgres runtime did that represent? Let's see:

explain
with q_snap_data as (
  select
    ts,
    queryid,
    max(query) as query, /* avoiding groupings over long text values is a good idea generally */
    sum(total_time) as total_time,
    sum(calls) as calls
  from
    stat_statements_snapshots
  where
    not query like 'insert into stat_statements_snapshots%' /* don't include our snapshot collecting query */
  group by
    ts, queryid /* 1st gruping needed as queries have separate rows for various users which we want to ignore */
),
q_min_total as (
  select sum(total_time) as min from q_snap_data group by ts order by ts limit 1
),
q_max_total as (
  select sum(total_time) as max from q_snap_data group by ts order by ts desc limit 1
)
select
  queryid,
  max(query) as query,
  max(total_time) - min(total_time) as total_time,
  max(calls) - min(calls) as calls,
  (max(total_time) - min(total_time)) / (max(calls) - min(calls)) as mean_time,
  100 * (max(total_time) - min(total_time))::numeric / (select max - min from q_max_total, q_min_total) as approx_pct_of_total
from
  q_snap_data
/*where
  ts between t1 and t2 */
group by
  queryid
having
  max(calls) > min(calls)
order by
  total_time desc
limit
  3;

explain

with q_snap_data as (

select

ts,

queryid,

max(query) as query, /* avoiding groupings over long text values is a good idea generally */

sum(total_time) as total_time,

sum(calls) as calls

from

stat_statements_snapshots

where

not query like 'insert into stat_statements_snapshots%' /* don't include our snapshot collecting query */

group by

ts, queryid /* 1st gruping needed as queries have separate rows for various users which we want to ignore */

q_min_total as (

select sum(total_time) as min from q_snap_data group by ts order by ts limit 1

q_max_total as (

select sum(total_time) as max from q_snap_data group by ts order by ts desc limit 1

)

select

queryid,

max(query) as query,

max(total_time) - min(total_time) as total_time,

max(calls) - min(calls) as calls,

(max(total_time) - min(total_time)) / (max(calls) - min(calls)) as mean_time,

100 * (max(total_time) - min(total_time))::numeric / (select max - min from q_max_total, q_min_total) as approx_pct_of_total

from

q_snap_data

/*where

ts between t1 and t2 */

group by

queryid

having

max(calls) > min(calls)

order by

total_time desc

limit

What queries had the biggest (absolute) execution time change compared to the hour before?

with q_snap_data as (
  select
    ts,
    queryid,
    max(query) as query, /* avoiding grupings over long text values is a good idea generally */
    sum(total_time) as total_time,
    sum(calls) as calls
  from
    stat_statements_snapshots
  where
    not query like 'insert into stat_statements_snapshots%' /* don't include our snapshot collecting query */
    /* and ts between t1 and t2 */
  group by
    ts, queryid /* 1st grouping needed as queries have separate rows for various users which we want to ignore */
),
q_snap_data_hour as (
  select
    extract(hour from ts) as ts,
    queryid,
    max(query) as query,
    max(total_time) - min(total_time) as total_time,
    max(calls) - min(calls) as calls
  from
    q_snap_data
  group by
    extract(hour from ts), queryid
)
select
  *,
  total_time - lag_total_time as total_time_growth
from (
select
  ts,
  queryid,
  query,
  total_time,
  lag(total_time) over (partition by queryid order by ts) as lag_total_time,
  calls
from
  q_snap_data_hour
) x
where
  lag_total_time > 0 and total_time > lag_total_time
order by
  total_time - lag_total_time desc
limit
  3;

with q_snap_data as (

select

ts,

queryid,

max(query) as query, /* avoiding grupings over long text values is a good idea generally */

sum(total_time) as total_time,

sum(calls) as calls

from

stat_statements_snapshots

where

not query like 'insert into stat_statements_snapshots%' /* don't include our snapshot collecting query */

/* and ts between t1 and t2 */

group by

ts, queryid /* 1st grouping needed as queries have separate rows for various users which we want to ignore */

q_snap_data_hour as (

select

extract(hour from ts) as ts,

queryid,

max(query) as query,

max(total_time) - min(total_time) as total_time,

max(calls) - min(calls) as calls

from

q_snap_data

group by

extract(hour from ts), queryid

)

select

total_time - lag_total_time as total_time_growth

from (

select

ts,

queryid,

query,

total_time,

lag(total_time) over (partition by queryid order by ts) as lag_total_time,

calls

from

q_snap_data_hour

) x

where

lag_total_time > 0 and total_time > lag_total_time

order by

total_time - lag_total_time desc

limit

Another warning on the above queries

The results can be considered useful only in case there was no Postmaster crash or user invoked statistics reset in between. If that was the case, your best bet is almost always to go all in on the awesome Window Functions!

Improving resource usage

In some cases, as in the above described fully utilized CPU case, one doesn’t really want to add any unnecessary load on the database server with monitoring activities (especially if you use shorter intervals), as it might actually be that last straw that breaks the camel's back...so how do you minimize such risks a bit?

First performance hint

We could (and mostly should) use the UNLOGGED table modifier for our snapshot gathering table, to bypass the transaction log completely for our frequently inserted data - thus reducing disk flushing and network load (remember - WAL is sent also to the replicas). This modifier of course has the downside that the table is not effectively crash safe and we would lose our data in case of a PostgreSQL crash or a power outage! Nevertheless, this is especially recommended when the whole monitoring operation is planned to be relatively short-lived, as it mostly should, or if the strange “peaks” are appearing very frequently and we could just re-start our capturing any time and still get usable “leads”. I actually use “UNLOGGED” almost always for such ad-hoc troubleshooting - better to be safe than sorry and not to strain the system under monitoring.

Second performance hint

We could leave out the longest info (byte-wise) from the pg_stat_statements view snapshots - the actual queries! This is OK to do as the internal query IDs are static and good enough to look up the real query texts later with a JOIN, when we have already identified some suspicious query ID-s.

Third performance hint

We don’t actually need to store the pg_stat_statement statistics on each interval for all queries...as probably only a subset of them are “active” in the monitored time range. So basically we could add some WHERE filter condition that looks up the very first entry for a specific query and dismisses it if the execution counter has not progressed for the current timestamp - this usually minimizes the data amount at least 2 times based on my experience. A hint though - for fast lookups on that first entry, you might want to also throw some index on the “snapshot” table, so that we again burn a bit less CPU time.

CREATE INDEX ON stat_statements_snapshots (queryid, ts);

INSERT INTO stat_statements_snapshots
  SELECT now(), * FROM pg_stat_statements s
  WHERE calls > (select calls from stat_statements_snapshots where queryid = s.queryid and dbid = s.dbid order by ts limit 1);

CREATE INDEX ON stat_statements_snapshots (queryid, ts);

INSERT INTO stat_statements_snapshots

SELECT now(), * FROM pg_stat_statements s

WHERE calls > (select calls from stat_statements_snapshots where queryid = s.queryid and dbid = s.dbid order by ts limit 1);

Another hint

By the way, exactly the same approach can be used also for all the other internal statics views that are cumulative in nature, as most of them are - first such that comes to mind for example is the pg_stat_user_tables view, that enables to determine at which times some tables are getting a lot of sequential scans. There though I’d also add some minimum table size limit into the mix as well, as scans on tiny tables, unless in billions per day, are not usually a thing to worry about.

Hopefully, that wasn’t too “hacky”, and you have a new trick up your sleeve now.

Read more about putting pg_stat_statements to good use in Hans' blog Find and Fix a Missing PostgreSQL Index.

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

The many tasks of autovacuum

Tuning autovacuum for dead tuple cleanup

Make sure that nothing keeps autovacuum from reclaiming dead tuples

Tuning autovacuum to run faster

Change the workload so that fewer dead tuples are generated

Tuning autovacuum for index-only scans

Tuning autovacuum for index-only scans on tables that receive UPDATEs or DELETEs

Tuning autovacuum for index-only scans on tables that receive only INSERTs

Tuning autovacuum to avoid transaction wraparound problems

Make sure than anti-wraparound vacuum can freeze tuples in all tables

Tuning tables that receive UPDATEs or DELETEs for anti-wraparound vacuum

Tuning tables that receive only INSERTs for anti-wraparound vacuum

Partitioning

Tuning autoanalyze

Conclusion

Importing test data

Simple GROUP BY statements

GROUP BY “expression”

Here is an example:

The following statement is identical to what you have just seen:

However, consider the following example:

Making use of grouping sets

Finally …

Common table expressions (WITH clauses)

Syntax of recursive queries

How recursive queries are processed

You have to be aware of the danger of an endless loop here:

A simple example

So our complete query will be:

Adding generated columns

Another frequent requirement is to collect all ancestors in a “path”:

Comparison to Oracle

The true power of recursive queries

Conclusion

By Kaarel Moppel

Giving hope to DBA's

Yoda and direct console access

Ad-hoc setup of pg_stat_statements monitoring

Step 1 - prerequisites

Step 2 - “snapshot” table setup

Step 3 - metrics gathering

Analyzing the data

The burning question

What queries had the biggest (absolute) execution time change compared to the hour before?

Another warning on the above queries

Improving resource usage

First performance hint

Second performance hint

Third performance hint

Another hint

Tuning autovacuum for index-only scans on tables that receive `UPDATE`s or `DELETE`s

Tuning autovacuum for index-only scans on tables that receive only `INSERT`s

Tuning tables that receive `UPDATE`s or `DELETE`s for anti-wraparound vacuum

Tuning tables that receive only `INSERT`s for anti-wraparound vacuum

Common table expressions (`WITH` clauses)