recursive queries misunderstood
© Laurenz Albe 2020

 

Many people consider recursive queries a difficult topic. Still, they enable you to do things that would otherwise be impossible in SQL.

This articles gives a simple introduction with examples and shows the differences to Oracle’s implementation of recursive queries.

Common table expressions (WITH clauses)

A common table expression (CTE) can be seen as a view that is only valid for a single query:

WITH ctename AS (
   SELECT ...
)
SELECT ...
FROM ctename ...

This could also be written as a subquery in FROM, but there are some advantages to using CTEs:

  • The query becomes more readable.
  • You can reference the CTE several times in the query, and it will be calculated only once.
  • You can use data modifying statements in the CTE (typically with a RETURNING clause).

Note that before v12, PostgreSQL always materialized CTEs. That means, the CTE was calculated independently from the containing query. From v12 on, CTEs can be “inlined” into the query, which provides further optimization potential.

CTEs (as well as the recursive CTEs mentioned below) are defined in the SQL standard, although PostgreSQL does not implement the SEARCH and CYCLE clause.

Syntax of recursive queries

Recursive queries are written using recursive CTEs, that are CTEs containing the RECURSIVE keyword:

WITH RECURSIVE ctename AS (
   SELECT /* non-recursive branch, cannot reference "ctename" */
   UNION [ALL]
   SELECT /* recursive branch referencing "ctename" */
)
SELECT ...
FROM ctename ...

How recursive queries are processed

PostgreSQL internally uses a working table to process recursive CTEs. This processing is not really recursive, but rather iterative:

First, the working table is initialized by executing the non-recursive branch of the CTE. The result of the CTE is also initialized with this result set. If the recursive CTE uses UNION rather than UNION ALL, duplicate rows are removed.

Then, PostgreSQL repeats the following until the working table is empty:

  • Evaluate the recursive branch of the CTE, replacing the reference to the CTE with the working table.
  • Add all resulting rows to the CTE result. If UNION is used to combine the branches, discard duplicate rows.
  • Replace the working table with all new rows from the previous step (excluding any removed duplicates).

Note that the self-referential branch of the CTE is not executed using the complete CTE result so far, but only the rows that are new since the previous iteration (the working table).

You have to be aware of the danger of an endless loop here:

If the iteration never ends, the query will just keep running until the result table becomes big enough to cause an error. There are two ways to deal with that:

  • Often you can avoid infinite recursion by using UNION, which removes duplicate result rows (but of course requires extra processing effort).
  • Another way is to place a LIMIT clause on the query that uses the CTE, because PostgreSQL stops processing if the recursive CTE has calculated as many rows as are fetched by the parent query. Note that this technique is not portable to other standard compliant databases.

A simple example

Let’s assume a self-referential table like this:

TABLE emp;

 empno | ename  |    job    | mgr  |  hiredate  |   sal   |  comm   | deptno 
-------+--------+-----------+------+------------+---------+---------+--------
  7839 | KING   | PRESIDENT |      | 1981-11-17 | 5000.00 |         |     10
  7698 | BLAKE  | MANAGER   | 7839 | 1981-05-01 | 2850.00 |         |     30
  7782 | CLARK  | MANAGER   | 7839 | 1981-06-09 | 2450.00 |         |     10
  7566 | JONES  | MANAGER   | 7839 | 1981-04-02 | 2975.00 |         |     20
  7902 | FORD   | ANALYST   | 7566 | 1981-12-03 | 3000.00 |         |     20
  7369 | SMITH  | CLERK     | 7902 | 1980-12-17 |  800.00 |         |     20
  7499 | ALLEN  | SALESMAN  | 7698 | 1981-02-20 | 1600.00 |  300.00 |     30
  7521 | WARD   | SALESMAN  | 7698 | 1981-02-22 | 1250.00 |  500.00 |     30
  7654 | MARTIN | SALESMAN  | 7698 | 1981-09-28 | 1250.00 | 1400.00 |     30
  7844 | TURNER | SALESMAN  | 7698 | 1981-09-08 | 1500.00 |    0.00 |     30
  7900 | JAMES  | CLERK     | 7698 | 1981-12-03 |  950.00 |         |     30
  7934 | MILLER | CLERK     | 7782 | 1982-01-23 | 1300.00 |         |     10
(12 rows)

We want to find all the subordinates of person 7566, including the person itself.

The non-recursive branch of the query will be:

SELECT empno, ename
FROM emp
WHERE empno = 7566;

The recursive branch will find all subordinates of all entries in the working table:

SELECT emp.empno, emp.ename
FROM emp
   JOIN ctename ON emp.mgr = ctename.empno;

We can assume that the dependencies contain no cycles (nobody is his or her own manager, directly or indirectly). So we can combine the queries with UNION ALL, because no duplicates can occur.

So our complete query will be:

WITH RECURSIVE ctename AS (
      SELECT empno, ename
      FROM emp
      WHERE empno = 7566
   UNION ALL
      SELECT emp.empno, emp.ename
      FROM emp
         JOIN ctename ON emp.mgr = ctename.empno
)
SELECT * FROM ctename;

 empno | ename 
-------+-------
  7566 | JONES
  7902 | FORD
  7369 | SMITH
(3 rows)

Adding generated columns

Sometimes you want to add more information, like the hierarchical level. You can do that by adding the starting level as a constant in the non-recursive branch. In the recursive branch you simply add 1 to the level:

WITH RECURSIVE ctename AS (
      SELECT empno, ename,
             0 AS level
      FROM emp
      WHERE empno = 7566
   UNION ALL
      SELECT emp.empno, emp.ename,
             ctename.level + 1
      FROM emp
         JOIN ctename ON emp.mgr = ctename.empno
)
SELECT * FROM ctename;

 empno | ename | level 
-------+-------+-------
  7566 | JONES |     0
  7902 | FORD  |     1
  7369 | SMITH |     2
(3 rows)

If you use UNION to avoid duplicated rows in the case of circular references, you cannot use this technique. This is because adding level will make rows that were identical before different. But in that case a hierarchical level wouldn’t make much sense anyway because an entry could appear on infinitely many levels.

Another frequent requirement is to collect all ancestors in a “path”:

     
WITH RECURSIVE ctename AS (
      SELECT empno, ename,
             ename AS path
      FROM emp
      WHERE empno = 7566
   UNION ALL
      SELECT emp.empno, emp.ename,
             ctename.path || ' -> ' || emp.ename
      FROM emp
         JOIN ctename ON emp.mgr = ctename.empno
)
SELECT * FROM ctename;

 empno | ename |          path          
-------+-------+------------------------
  7566 | JONES | JONES
  7902 | FORD  | JONES -> FORD
  7369 | SMITH | JONES -> FORD -> SMITH

Comparison to Oracle

Oracle database has a different syntax for recursive queries that does not conform to the SQL standard. The original example would look like this:

SELECT empno, ename
FROM emp
START WITH empno = 7566
CONNECT BY PRIOR empno = mgr;

     EMPNO ENAME
---------- ----------
      7566 JONES
      7902 FORD
      7369 SMITH

This syntax is more concise, but less powerful than recursive CTEs. For more complicated queries involving joins, it can become difficult and confusing.

It is always easy to translate an Oracle “hierarchical query” to a recursive CTE:

  • The non-recursive branch is the Oracle query without the CONNECT BY clause but including the START WITH clause.
  • The recursive branch is the Oracle query without the START WITH clause but including the CONNECT BY clause. You add a join with the name of the recursive CTE and replace all PRIOR columns with columns from that joined CTE.
  • If the Oracle query uses CONNECT BY NOCYCLE, use UNION, otherwise UNION ALL.

Apart from that, Oracle also supports standard compliant recursive CTEs. These also support the SEARCH and CYCLE clauses that PostgreSQL doesn’t implement.

The true power of recursive queries

Without recursive CTEs, many things that can be written in procedural languages cannot be written in SQL. That is usually no problem, because SQL is made to query databases. If you need procedural code, you can always write a database function in one of the many procedural languages available in PostgreSQL.

But recursive CTEs make SQL turing complete, that is, it can perform the same calculations as any other programming language. Obviously, such a “program” would often be quite complicated and inefficient, and this is a theoretical consideration and not of much practical value. Still, the previous examples showed that recursive CTEs can do useful work that you couldn’t otherwise perform in SQL.

As an example for the power of recursive queries, here is a recursive CTE that computes the first elements of the Fibonacci sequence:

WITH RECURSIVE fib AS (
      SELECT 1 AS n,
             1::bigint AS "fibₙ",
             1::bigint AS "fibₙ₊₁"
   UNION ALL
      SELECT n+1,
             "fibₙ₊₁",
             "fibₙ" + "fibₙ₊₁"
      FROM fib
)
SELECT n, "fibₙ" FROM fib
LIMIT 20;

 n  | fibₙ 
----+------
  1 |    1
  2 |    1
  3 |    2
  4 |    3
  5 |    5
  6 |    8
  7 |   13
  8 |   21
  9 |   34
 10 |   55
 11 |   89
 12 |  144
 13 |  233
 14 |  377
 15 |  610
 16 |  987
 17 | 1597
 18 | 2584
 19 | 4181
 20 | 6765
(20 rows)

This example also demonstrates how an endless loop can be avoided with a LIMIT clause on the parent query.

Conclusion

You shouldn’t be afraid of recursive queries, even if the term “recursive” scares you. They are not that hard to understand, and they provide the simplest way to query databases containing hierarchies and graphs. In many cases they are also much more efficient than equivalent procedural code because they avoid repeated queries to the same tables.

Together with window functions, recursive queries are among the most powerful tools that SQL has to offer.