© Laurenz Albe 2023
SQL allows you to use subqueries almost anywhere where you could have a table or column name. All you have to do is surround the query with parentheses, like
(SELECT ...), and you can use it in arbitrary expressions. This makes SQL a powerful language – and one that can be hard to read. But I don’t want to discuss the beauty or ugliness of SQL. In this article, I want to tell you how to write subqueries that perform well. I’ll start simple, but get to more surprising and complicated topics later.
Correlated and uncorrelated subqueries
In a subquery you can use table columns from the outside, like
SELECT a.col1, (SELECT b.col2 FROM b WHERE b.x = a.x) FROM a;
The subquery will be different for each row in “
a”. Such a subquery is usually called a correlated subquery. An uncorrelated subquery is one that does not reference anything from the outside.
Uncorrelated subqueries are simple. If the PostgreSQL optimizer does not “pull it up” (integrate it in the main query tree), the executor will calculate it in a separate step. You can see that as an
InitPlan (initial plan) in the output of
EXPLAIN. Uncorrelated subqueries are almost never a performance problem. In the rest of this article, I will mostly deal with correlated subqueries.
Scalar and tabular subqueries
If you write a subquery in a place in an SQL statement where you would otherwise have to write a single value, it is a scalar subquery. An example for a scalar subquery is the one in the previous section. A different example would be
SELECT a.col1 FROM a WHERE 1 = (SELECT count(*) FROM b WHERE b.x = a.x);
If a scalar subquery returns no result, the resulting value is NULL. If the query returns more than a single row, you will receive a run-time error:
ERROR: more than one row returned by a subquery used as an expression
A tabular subquery appears in a context where it can return more than one value:
FROM (SELECT ...) AS alias
- a common table expression (CTE):
WITH q AS (SELECT ...) SELECT ...
WHERE a.x IN (SELECT ...)
WHERE NOT EXISTS (SELECT ...)
Scalar subqueries usually are a performance problem
My rule of thumb is: avoid correlated scalar subqueries whenever you can. The reason is that PostgreSQL can only execute a scalar subquery as a nested loop. For example, PostgreSQL will execute the subquery from the first section once for each row in table “
a”. This can be fine if “
a” is a small table (remember, my recommendation is just a rule of thumb). However, if table “
a” is large, even a fast subquery will make the query execution unpleasantly slow.
Rewriting a scalar subquery in the
SELECT list or
If correlated scalar subqueries are bad for performance, how can we avoid them? There is no single, straightforward answer, and you probably won’t be able to rewrite the query to avoid such subqueries in all cases. But usually the solution is to convert the subquery into a join. For our first query, that will look like this:
SELECT a.col1, b.col2 FROM a LEFT JOIN b ON b.x = a.x;
That query is semantically equivalent, with the exception that we don’t get a run-time error if a row in “
a” matches more than one row in “
b”. We need an outer join to account for the case where the subquery returns no result.
For our second example, the rewritten query would look like this:
SELECT a.col1 FROM a JOIN b ON b.x = a.x GROUP BY a.pkey, a.col1 HAVING count(*) = 1;
a.pkey is the primary key of “
a”. Grouping by
a.col1 would not be sufficient, because two different rows from table “
a” could have the same value for
The advantage of rewriting the queries as shown above is that PostgreSQL can choose the optimal join strategy and is not restricted to nested loops. If the table “
a” has few rows, that may not make a difference, since a nested loop join may be the most efficient join strategy anyway. But you also won’t lose by rewriting the query in that case. And if “
a” is large, you will be much faster with a hash or a merge join.
Tabular subqueries and performance
While correlated scalar subqueries are usually bad, the case is not so simple with tabular subqueries. Let’s consider the different cases separately.
CTEs and subqueries in
These cases are almost identical, because you can always rewrite a CTE to a subquery in
FROM unless it is a recursive, a
MATERIALIZED or a data modifying CTE. CTEs are never correlated, so they are never problematic. However, a
FROM clause entry can be correlated in a lateral join:
SELECT a.col1, sub.col2 FROM a CROSS JOIN LATERAL (SELECT b.col2 FROM b WHERE b.x = a.x ORDER BY b.sort LIMIT 1) AS sub;
Again, PostgreSQL will execute such a subquery in a nested loop, which can perform badly for large table “
a”. Therefore, it’s usually a good idea to rewrite the query to avoid a correlated subquery:
SELECT DISTINCT ON (a.pkey) a.col1, b.col2 FROM a JOIN b ON b.x = a.x ORDER BY a.pkey, b.sort;
The rewritten query will perform better if “
a” has many rows, but it could perform worse if “
a” is small and “
b“ is large, but has an index on
This is a special case. So far I have always recommended avoiding correlated subqueries. But with
NOT EXISTS, the PostgreSQL optimizer is able to transform the clause to a semi-join and anti-join, respectively. That allows PostgreSQL to use all join strategies, not only nested loops.
Consequently, PostgreSQL can process correlated subqueries in
NOT EXISTS efficiently.
The tricky case of
You will maybe expect that these two cases behave alike, but that is not the case. A query using
IN with a subquery can always be rewritten to use
EXISTS. For example, the following statement:
SELECT a.col1 FROM a WHERE a.foo IN (SELECT b.col2 FROM b WHERE a.x = b.x);
is equivalent to
SELECT a.col1 FROM a WHERE EXISTS (SELECT NULL FROM b WHERE a.x = b.x AND a.foo = b.col2);
The PostgreSQL optimizer can do that and will process the subquery in
IN as efficiently as the one in
However, the case with
NOT IN is quite different. You can rewrite
NOT IN to
NOT EXISTS similar to the above, but that is not a transformation that PostgreSQL can do automatically, because the rewritten statement is semantically different: If the subquery returns at least one NULL value,
NOT IN will never be TRUE. The
NOT EXISTS clause does not exhibit this surprising behavior.
Now people normally don’t care about this property of
NOT IN (and in fact, too few people know about it). Most people would prefer the behavior of
NOT EXISTS anyway. But you have to rewrite the SQL statement yourself and cannot expect PostgreSQL to do it automatically. So my recommendation is that you never use
NOT IN with a subquery and always use
NOT EXISTS instead.
Using correlated subqueries to force nested loop joins
So far, I have told you how to rewrite an SQL statement to avoid forcing the optimizer to use a nested loop. Yet sometimes you need the exact opposite: you want the optimizer to use a nested loop join, because you happen to know that that is the best join strategy. Then you can deliberately rewrite a regular join to a lateral cross join to force a nested loop. For example, this query
SELECT a.col1, b.col2 FROM a JOIN b ON a.x = b.x;
is semantically equivalent to
SELECT a.col1, sub.col2 FROM a CROSS JOIN LATERAL (SELECT b.col2 FROM b WHERE a.x = b.x) AS sub;
If you want good performance with subqueries, it is often a good idea to follow these guidelines:
- use uncorrelated subqueries as much as you like, as long as they don’t make the statement hard to understand
- avoid correlated subqueries everywhere except in
- always rewrite
Don’t take these rules as iron laws. Sometimes a correlated subquery can actually perform better, and sometimes you can use a correlated subquery to force the optimizer to use a nested loop if you are certain that is the right strategy to use.
If you are interested in improving the performance of a query by rewriting it, you may want to read my article about forcing the join order in PostgreSQL.