(Updated 2023-02-24) There are three join strategies in PostgreSQL that work quite differently. If PostgreSQL chooses the wrong strategy, query performance can suffer a lot. This article explains the join strategies, how you can support them with indexes, what can go wrong with them and how you can tune your joins for better performance.
Terminology of Join Strategies
A join combines data from two relations. Such a relation can be a table (also called base relation) or the result of any plan node. For example, in a join like
SELECT ... FROM a JOIN (b LEFT JOIN c ON ...) ON ...
the base relation
a will be joined to the result of the join of
A relation can also be the result of an index scan. So when I say below that “PostgreSQL scans the relation sequentially”, I don’t mean that there has to be a sequential scan on a table. It is just that PostgreSQL reads “the thing that is joined” in its entirety.
Inner and outer relation
The execution plan for any join looks like this:
EXPLAIN (COSTS OFF) SELECT * FROM a JOIN b USING (id); QUERY PLAN ---------------------------- Hash Join Hash Cond: (a.id = b.id) -> Seq Scan on a -> Hash -> Seq Scan on b (5 rows)
We call the upper of the joined relations (in this case the sequential scan on
a) the outer relation of the join, and we call the lower relation (the hash computed from
b) the inner relation.
Join condition and join key
A Cartesian product or cross join of two relations is what you get if you combine each row from one relation with each row of the other. The join condition is a filter that excludes some of these combinations. There are several ways to write a join condition, but all can be transformed to
a <join type> JOIN b ON <join condition>
If the join condition is of the form
a.col1 <operator> b.col2 [AND ...]
I call “
a.col1” and “
b.col2” join keys.
Note that for inner joins there is no distinction between the join condition and the
WHERE condition, but that doesn’t hold for outer joins.
Nested loop join strategy
This is the simplest and most general join strategy of all.
PostgreSQL scans the outer relation sequentially, and for each result row it scans the inner relation for matching rows.
Indexes that can help with nested loop joins
Since we scan the outer relation sequentially, no index on the outer relation will help. But an index on the join key of the inner relation can speed up a nested loop join considerably.
Use cases for the nested loop join strategy
Nested loop joins are particularly efficient if the outer relation is small, because then the inner loop won’t be executed too often. It is the typical join strategy used in OLTP workloads with a normalized data model, where it is highly efficient. If the outer relation is large, nested loop joins are usually very inefficient, even if they are supported by an index on the inner relation.
Apart from that, it is the only join strategy that can be used if no join condition uses the
= operator. So it also serves as a fall-back strategy if no other strategy can be used.
Hash join strategy
First, PostgreSQL scans the inner relation sequentially and builds a hash table, where the hash key consists of all join keys that use the
= operator. Then it scans the outer relation sequentially and probes the hash for each row found to find matching join keys.
This is somewhat similar to a nested loop join. Building the hash table is an extra start-up effort, but probing the hash is much faster than scanning the inner relation.
Indexes that can help with hash joins
Since we scan both relations sequentially, an index on the join condition will not help with a hash join.
Use cases for the hash join strategy
Hash joins are best if none of the involved relations are small, but the hash table for the smaller table fits in
work_mem. This is because otherwise PostgreSQL would build the hash in several batches and store them in temporary disk files, which hurts performance. In such cases the optimizer usually chooses a different join strategy like a merge join.
Looking up values in a hash table only works if the operator in the join condition is
=, so you need at least one join condition with that operator.
Merge join strategy
In a merge join, PostgreSQL picks all join conditions with the
= operator. It then sorts both tables by the join keys (which means that the data types must be sortable). Then it iterates through both sorted lists and finds matching entries.
Indexes that help with a merge join
An index on the sort keys can speed up sorting, so an index on the join keys on both relations can speed up a merge join. However, an explicit sort is often cheaper unless an index only scan can be used.
Use cases for the merge join strategy
The optimizer usually chooses a merge join if the involved relations are both too big for a hash that fits into
work_mem. So this is the best strategy for joining really large tables.
Like a hash join, a merge join is only feasible if there is at least one join condition with the
Summary table for PostgreSQL join strategies
|Nested Loop Join
|For each outer relation row, scan the inner relation
|Build a hash from the inner relation, scan the outer relation, probe the hash
|Sort both relations and merge rows
|Indexes that help
|Index on the join keys of the inner relation
|Indexes on the join keys of both relations
|Good strategy if
|the outer table is small
|the hash table fits into
|both tables are large
Impact on query performance
Choosing the wrong join strategy leads to bad performance:
- If the optimizer underestimates a row count, it may choose a nested loop join by mistake. Then it scans the inner relation more often than it bargained for, which leads to bad performance.
- If the optimizer overestimates a row count, it may choose a hash or merge join by mistake. Then it has to scan both relations completely, which can perform much worse than a nested loop join with an index on the inner relation.
In both cases, a bad row count estimate is the cause of the problem. So while the join may be where we spend most of the execution time, the cause is a misestimate that happened earlier on.
How to make PostgreSQL choose the correct join strategy
Find out what the best join strategy is (perhaps PostgreSQL is doing the right thing anyway). You can disable different join strategies temporarily with the
SET command, which changes a parameter in your current database session:
SET enable_hashjoin = off; SET enable_mergejoin = off; SET enable_nestloop = off;
Note that you cannot really disable nested loop joins, only discourage PostgreSQL from using them. If there is no join condition with an
= operator, a nested loop join is the only way.
Tuning queries is often not a simple, straightforward task. However, here are some guidelines and ideas:
- If the bad join strategy is chosen because of a misestimate, try to improve that estimate.
ANALYZEthe table, perhaps with increased
default_statistics_target, and see if that makes a difference. Try rewriting the query with simpler
WHEREconditions to makes the optimizer’s task easier.
- Try increasing
work_memand see if you get a cheaper hash join.
- Configure the parameters that tell PostgreSQL about your hardware and resources:
effective_io_concurrency. That will allow it to price an index scan correctly.
- You can speed up nested loop and merge joins with index-only scans. For that you must add all required columns to the index (ideally with the
INCLUDEclause) and make sure that the table is vacuumed often enough.
Conclusion – join strategies and query tuning
Understanding the 3 join strategies – nested loop join, hash join, and merge join – is crucial for anyone who wants to understand execution plans and tune queries. The art of query tuning cannot be conveyed in a single article, but I hope I could collect some relevant information here.