© Laurenz Albe 2023
Different from many other database systems, PostgreSQL does not support query hints. That makes it difficult to force the hand of the query planner when it comes to a certain join order that you know to be good. This article explains how you can influence execution plans in PostgreSQL.
Why no query hints?
The PostgreSQL TODO list lists optimizer hints under “Features We Do Not Want” and links to a discussion page that explains why. The discussion page references mailing list discussions that show that this decision has not been unanimous.
However, one of the strong points of PostgreSQL is its extensibility. If you really want query hints, you can get them: all you have to do is install the extension pg_hint_plan. This extension offers a comprehensive set of Oracle-style query hints, and it does not require a modified version of the PostgreSQL server.
But perhaps you don’t want to install third-party software, or your database is running at a hosting provider and you have no access to the operating system. In that case, read on for alternative ways to force the join order.
An example query and its join order
We have three tables
c and want to calculate the natural join between them. The optimizer chooses the following plan:
EXPLAIN (COSTS OFF) SELECT b.b_id, a.value FROM a JOIN b USING (a_id) JOIN c USING (a_id) WHERE c.c_id < 300; QUERY PLAN ═══════════════════════════════════════════════════════ Hash Join Hash Cond: (c.a_id = a.a_id) -> Seq Scan on c Filter: (c_id < 300) -> Hash -> Nested Loop -> Seq Scan on b -> Memoize Cache Key: b.a_id Cache Mode: logical -> Index Scan using a_pkey on a Index Cond: (a_id = b.a_id) (12 rows)
The PostgreSQL optimizer decided to first join
a, then join the result with
c. But we would like to join
Forcing the join order with an optimizer barrier
We would like to write the query in a way that makes the PostgreSQL optimizer choose the plan we want. However, that is not as simple as it may seem. The PostgreSQL optimizer does not only plan the query as you wrote it, but it rearranges the query considerably. Among other things
- it rearranges the join order as it thinks best
- if “pulls up” subqueries to “flatten” the plan tree
- it pushes
WHEREconditions into joins and
Usually that is just what you want: the more ways the optimizer finds to execute the query, the better its chances are of finding the fastest execution plan. But if we want to force the optimizer’s hand, we want to prevent exactly that. Therefore, we are looking for optimizer barriers, that is SQL constructs that prevent PostgreSQL from rearranging the plan.
The following two techniques are pretty similar: both rewrite the query to use a subquery and prevent the optimizer from pulling it up.
OFFSET 0 to force the join order
Here, we write a subquery in the
FROM clause that explicitly joins the desired tables. To keep PostgreSQL from flattening the query, we can use an
LIMIT clause in the subquery. The traditional way to do that is to use
OFFSET 0, which does not change the result of the subquery:
EXPLAIN (COSTS OFF) SELECT subq.b_id, a.value FROM a JOIN (SELECT a_id, b.b_id, c.c_id FROM b JOIN c USING (a_id) WHERE c.c_id < 300 OFFSET 0 ) AS subq USING (a_id); QUERY PLAN ═══════════════════════════════════════════ Nested Loop -> Hash Join Hash Cond: (c.a_id = b.a_id) -> Seq Scan on c Filter: (c_id < 300) -> Hash -> Seq Scan on b -> Memoize Cache Key: b.a_id Cache Mode: logical -> Index Scan using a_pkey on a Index Cond: (a_id = b.a_id) (12 rows)
It would not be hard to teach the optimizer to ignore that “useless” clause, but that would disable this useful trick, so it won’t happen.
Using a common table expression to force the join order
Writing a subquery in the
FROM clause can make the query hard to read. A common table expression (CTE) is a different approach: you write the subquery in the
WITH clause at the beginning of the statement and give it a name. Then you can use that name in the main query, quite like a view, but one that only exists in the context of a single statement.
Before PostgreSQL v12, a CTE was automatically an optimizer barrier. Since v12, PostgreSQL can pull CTEs into the main query, and you have to use the
MATERIALIZED keyword to prevent that:
EXPLAIN (COSTS OFF) WITH subq AS MATERIALIZED ( SELECT a_id, b.b_id, c.c_id FROM b JOIN c USING (a_id) WHERE c.c_id < 300 ) SELECT subq.b_id, a.value FROM a JOIN subq USING (a_id); QUERY PLAN ════════════════════════════════════════ Hash Join Hash Cond: (subq.a_id = a.a_id) CTE subq -> Hash Join Hash Cond: (c.a_id = b.a_id) -> Seq Scan on c Filter: (c_id < 300) -> Hash -> Seq Scan on b -> CTE Scan on subq -> Hash -> Seq Scan on a (12 rows)
The plan is different from the previous one, as PostgreSQL chose a hash join.
Join order and
I mentioned above that the optimizer rearranges the join order of a query. With an inner join of two tables, there are usually seven choices: PostgreSQL can opt for a nested loop, hash or merge join, and for the first two of these, the order of the tables makes a difference as well. With more tables, the number of options explodes, since the result of an inner join is independent of the order in which you join the tables. For three tables, there can be up to 147 combinations.
However, while the optimizer tries to find the best execution plan, it is also important that it does not take too long for planning. After all, PostgreSQL normally does not cache execution plans. To keep planning time moderate, the optimizer draws the line somewhere: if a query joins many tables, the optimizer will only consider all possible combinations for the first eight tables. It joins the remaining tables just like you wrote them in the statement. You can adjust that limit with the parameters
from_collapse_limit. The first one is for statements written with the explicit
JOIN syntax, and the second applies to joins written in the form
FROM a, b WHERE a.col1 = b.col2
If the number of tables reaches 12 (the default value of the parameter
geqo_threshold), PostgreSQL uses an entirely different approach: it randomly generates a number of query plans and plays evolution by recombining the most promising plans over several generations. This genetic query optimizer can result in non-deterministic query plans, which is not always what you want.
Dumbing down the optimizer with
join_collapse_limit = 1
With this approach, we deliberately lobotomize the optimizer by telling it not to rearrange the join order in the SQL statement. Then you have to write the tables in the exact order in which you want them joined:
SET join_collapse_limit = 1; EXPLAIN (COSTS OFF) SELECT b.b_id, a.value FROM b JOIN c USING (a_id) JOIN a USING (a_id) WHERE c.c_id < 300; QUERY PLAN ═══════════════════════════════════════════ Nested Loop -> Hash Join Hash Cond: (c.a_id = b.a_id) -> Seq Scan on c Filter: (c_id < 300) -> Hash -> Seq Scan on b -> Memoize Cache Key: b.a_id Cache Mode: logical -> Index Scan using a_pkey on a Index Cond: (a_id = b.a_id) (12 rows)
We end up with the same execution plan as with
OFFSET 0. You don’t want to leave
join_collapse_limit at 1, because other queries may perform badly with that setting. Here are some ideas how to change a parameter for a single query:
- run the query in an explicit
READ ONLYtransaction and use
SET LOCALto change the parameter, so that it reverts to its previous setting as soon as the transaction is done
- run the query from a database function and use the
CREATE FUNCTIONto change the parameter for the execution of the function
Unless you want to use an extension like
pg_hint_plans, it is not easy to force the PostgreSQL optimizer to do what you want. We have seen how you can force the join oder with optimizer barriers or parameter settings.
If you are interested in query optimization, perhaps you want to read about
or about the different join strategies. There is also an introduction to
UNION ALL and performance