join strategies misunderstood
© Laurenz Albe 2020

There are three join strategies in PostgreSQL that work quite differently. If PostgreSQL chooses the wrong strategy, query performance can suffer a lot. This article explains the join strategies, how you can support them with indexes, what can go wrong with them and how you can tune your joins for better performance.

Terminology

Relation

A join combines data from two relations. Such a relation can be a table base relation or the result of any plan node. For example, in a join like

SELECT ...
FROM a
   JOIN (b LEFT JOIN c ON ...)
      ON ...

the base relation a will be joined to the result of the join of b and c.

Inner and outer relation

The execution plan for any join looks like this:

EXPLAIN (COSTS OFF)
SELECT * FROM a JOIN b USING (id);

         QUERY PLAN         
----------------------------
 Hash Join
   Hash Cond: (a.id = b.id)
   ->  Seq Scan on a
   ->  Hash
         ->  Seq Scan on b
(5 rows)

We call the upper of the joined relations (in this case the sequential scan on a) the outer relation of the join, and we call the lower relation (the hash computed from b) the inner relation.

Join condition and join key

A Cartesian product or cross join of two relations is what you get if you combine each row from one relation with each row of the other. The join condition is a filter that excludes some of these combinations. There are several ways to write a join condition, but all can be transformed to

a <join type> JOIN b ON <join condition>

If the join condition is of the form

a.col1 <operator> b.col2 [AND ...]

I call “a.col1” and “b.col2join keys.

Note that for inner joins there is no distinction between the join condition and the WHERE condition, but that doesn’t hold for outer joins.

Nested loop join strategy

This is the simplest and most general join strategy of all.

PostgreSQL scans the outer relation sequentially, and for each result row it scans the inner relation for matching rows.

Indexes that can help with nested loop joins

Since we scan the outer relation sequentially, no index on the outer relation will help. But an index on the join key of the inner relation can speed up a nested loop join considerably.

Use cases for the nested loop join strategy

Nested loop joins are particularly efficient if the outer relation is small, because then the inner loop won’t be executed too often. It is the typical join strategy used in OLTP workloads with a normalized data model, where it is highly efficient. If the outer relation is large, nested loop joins are usually very inefficient, even if they are supported by an index on the inner relation.

Apart from that, it is the only join strategy that can be used if no join condition uses the = operator. So it also serves as a fall-back strategy if no other strategy can be used.

Hash join strategy

First, PostgreSQL scans the inner relation sequentially and builds a hash table, where the hash key consists of all join keys that use the = operator. Then it scans the outer relation sequentially and probes the hash for each row found to find matching join keys.

This is somewhat similar to a nested loop join. Building the hash table is an extra start-up effort, but probing the hash is much faster than scanning the inner relation.

Indexes that can help with hash joins

Since we scan both relations sequentially, no index will help with a hash join.

Use cases for the hash join strategy

Hash joins are best if none of the involved relations are small, but the hash table for the smaller table fits in work_mem. This is because otherwise PostgreSQL would build the hash in several batches and store them in temporary disk files, which hurts performance. In such cases the optimizer usually chooses a different join strategy like a merge join.

Looking up values in a hash table only works if the operator in the join condition is =, so you need at least one join condition with that operator.

Merge join strategy

In a merge join, PostgreSQL picks all join conditions with the = operator. It then sorts both tables by the join keys (which means that the data types must be sortable). Then it iterates through both sorted lists and finds matching entries.

Indexes that help with a merge join

An index on the sort keys can speed up sorting, so an index on the join keys on both relations can speed up a merge join. However, an explicit sort is often cheaper unless an index only scan can be used.

Use cases for the merge join strategy

The optimizer usually chooses a merge join if the involved relations are both too big for a hash that fits into work_mem. So this is the best strategy for joining really large tables.

Like a hash join, a merge join is only feasible if there is at least one join condition with the = operator.

Summary table for PostgreSQL join strategies

Nested Loop JoinHash JoinMerge Join
AlgorithmFor each outer relation row, scan the inner relationBuild a hash from the inner relation, scan the outer relation, probe the hashSort both relations and merge rows
Indexes that helpIndex on the join keys of the inner relationNoneIndexes on the join keys of both relations
Good strategy ifthe outer table is smallthe hash table fits into work_memboth tables are large

Impact on query performance

Choosing the wrong join strategy leads to bad performance:

  • If the optimizer underestimates a row count, it may choose a nested loop join by mistake. Then it scans the inner relation more often than it bargained for, which leads to bad performance.
  • If the optimizer overestimates a row count, it may choose a hash or merge join by mistake. Then it has to scan both relations completely, which can perform much worse than a nested loop join with an index on the inner relation.

In both cases, a bad row count estimate is the cause of the problem. So while the join may be where we spend most of the execution time, the cause is a misestimate that happened earlier on.

How to make PostgreSQL choose the correct join strategy

Find out what the best join strategy is (perhaps PostgreSQL is doing the right thing anyway). You can disable different join strategies temporarily with the SET command, which changes a parameter in your current database session:

SET enable_hashjoin = off;
SET enable_mergejoin = off;
SET enable_nestloop = off;

Note that you cannot really disable nested loop joins, only discourage PostgreSQL from using them. If there is no join condition with an = operator, a nested loop join is the only way.

Tuning queries is often not a simple, straightforward task. However, here are some guidelines and ideas:

  • If the bad join strategy is chosen because of a misestimate, try to improve that estimate. ANALYZE the table, perhaps with increased default_statistics_target, and see if that makes a difference. Try rewriting the query with simpler WHERE conditions to makes the optimizer’s task easier.
  • Try increasing work_mem and see if you get a cheaper hash join.
  • Configure the parameters that tell PostgreSQL about your hardware and resources: random_page_cost, effective_cache_size and effective_io_concurrency. That will allow it to price an index scan correctly.
  • You can speed up nested loop and merge joins with index-only scans. For that you must add all required columns to the index (ideally with the INCLUDE clause) and make sure that the table is vacuumed often enough.

Conclusion

Understanding the join strategies is crucial for anyone who wants to understand execution plans and tune queries. The art of query tuning cannot be conveyed in a single article, but I hope I could collect some relevant information here.

If you want to read more about tuning queries with joins, read some of our other articles on the topic, like Joining 1 million tables or Speeding up GROUP BY and joins.