A join is a concept in IT which is widely used and often referred to but rarely really understood. What are the differences between inner joins, outer joins, semi joins and so on? Let’s shed some light on them and see how inner and outer joins really work.

Producing sample data

Before we can get started we need to create some sample data:

test=# CREATE TABLE a (aid int);
CREATE TABLE
test=# CREATE TABLE b (bid int);
CREATE TABLE
test=# INSERT INTO a
VALUES (1), (2), (3), (4);
INSERT 0 4
test=# INSERT INTO b
VALUES (2), (3), (4), (4), (5);
INSERT 0 5

The structure is actually quite simple. It consists of two tables containing a handful of rows which will be used for our little demo.

Explicit vs implicit joins

The question people often ask is: What is the difference between an implicit and an explicit join? The answer is simply syntactic sugar. Here is an implicit join:

test=# SELECT * FROM a, b WHERE aid = bid;
aid | bid
-----+-----
2 | 2
3 | 3
4 | 4
4 | 4
(4 rows)

In this the join actually happens in the WHERE-clause. In the FROM clause we only list the tables we want to join. The alternative is to use an explicit join:

test=# SELECT * FROM a JOIN b ON (aid = bid);
aid | bid
-----+-----
2 | 2
3 | 3
4 | 4
4 | 4
(4 rows)

Both syntax variations are identical. It is merely a question of taste and style that is relevant here. If you look at the execution plan you will see that both plans are absolutely identical.

test=# explain SELECT * FROM a, b WHERE aid = bid;
QUERY PLAN
-----------------------------------------------------------------
Merge Join (cost=393.42..893.85 rows=32512 width=8)
Merge Cond: (a.aid = b.bid)
-> Sort (cost=179.78..186.16 rows=2550 width=4)
Sort Key: a.aid
-> Seq Scan on a (cost=0.00..35.50 rows=2550 width=4)
-> Sort (cost=179.78..186.16 rows=2550 width=4)
Sort Key: b.bid
-> Seq Scan on b (cost=0.00..35.50 rows=2550 width=4)
(8 rows)

Here is the second plan …

test=# explain SELECT * FROM a JOIN b ON (aid = bid);
QUERY PLAN
-----------------------------------------------------------------
Merge Join (cost=393.42..893.85 rows=32512 width=8)
Merge Cond: (a.aid = b.bid)
-> Sort (cost=179.78..186.16 rows=2550 width=4)
Sort Key: a.aid
-> Seq Scan on a (cost=0.00..35.50 rows=2550 width=4)
-> Sort (cost=179.78..186.16 rows=2550 width=4)
Sort Key: b.bid
-> Seq Scan on b (cost=0.00..35.50 rows=2550 width=4)
(8 rows)

Basically the optimizer handles both queries the same way for MOST queries. If you happen to run a typical application there is no difference. However, there are some subtle differences which are worth mentioning. Let us take a look at the following configuration variable which can be found in postgresql.conf:

test=# SHOW join_collapse_limit;
join_collapse_limit
---------------------
8
(1 row)

This means that 8 explicit joins are implicitly planned. If you are using standard implicit joins PostgreSQL will be able to determine the join order for those joins automatically. This is done to speed up the query. In reality it means that PostgreSQL does not necessarily join data in the order we pass it to the query – things can be shuffled by the optimizer if it deems necessary. However, if you are using explicit joins this is only done for up to 8 explicit joins – the rest is taken in the order defined by the query. While this is definitely a difference between implicit and explicit joins it is usually not that relevant to most applications. Joining 10+ tables is usually more the exception than the rule. Also, keep in mind that joining dozens of tables will already lead to genetic query optimization (GEQO) anyway.

In most cases explicit vs implicit joins is a matter of style rather than performance. However, keep in mind that there are indeed differences when looking at really complex queries.

What is an inner join? What is an outer join?

The next question people often have is: What is the difference between an inner and an outer join? The answer is: An inner join will look for common entries in those tables. An outer join will always take all data on one side and find the corresponding matches on the other side.

Here is an example:

test=# SELECT * FROM a LEFT JOIN b ON (aid = bid);
aid | bid
-----+-----
1 |
2 | 2
3 | 3
4 | 4
4 | 4
(5 rows)

What we do here is to display all records on table “a” (= “left side”) and look for those rows matching on the right side. The opposite of a LEFT JOIN is a RIGHT JOIN. The next query is identical to the first one:

test=# SELECT * FROM b RIGHT JOIN a ON (aid = bid);
bid | aid
-----+-----
  | 1
2 | 2
3 | 3
4 | 4
4 | 4
(5 rows)

 

Full joins

However, there is a third option: A full join. What it does is to take ALL data from both sides and find the matches. Either side of the join that does not have a match will be filled up with NULL values. Here is an example:

test=# SELECT * FROM a FULL JOIN b ON (aid = bid);
aid | bid
-----+-----
1 |
2 | 2
3 | 3
4 | 4
4 | 4
| 5
(6 rows)

While this is relatively easy to code, people often make mistakes due to misconceptions about what outer joins do.

Outer joins: Common mistakes and errors

What happens more often than not is that people add an AND clause to the join condition. What does it mean for the result:

test=# SELECT *
FROM a LEFT JOIN b
ON (aid = bid AND bid = 2);
aid | bid
-----+-----
1 |
2 | 2
3 |
4 |
(4 rows)

Most people actually expect that the result set will be smaller than before but this is not the case. The ON condition merely changes which row match – a LEFT JOIN will still produce all rows on the left side. The additional condition simply changes some entries to become NULL but this is mostly it – it does not reduce the amount of data. Therefore a LEFT JOIN that is broken semantically is often hitting our PostgreSQL 24×7 support desk masked as a performance problem.

Especially when aggregates are involved this can easily hide the underlying semantic problem:

test=# SELECT count(*), count(bid)
FROM a LEFT JOIN b
ON (aid = bid AND bid = 2);
count | count
-------+-------
4 | 1
(1 row)

It is quite common to see wrong results because people expect some data to be filtered away by the additional condition.

IN: What is a semi-join?

But there is more. Often people ask: What is a semi-join? The answer is quite simple: Think of an IN statement:

test=# SELECT * FROM a WHERE aid IN (SELECT bid FROM b);
aid
-----
2
3
4
(3 rows)

The IN-statement is an implicit DISTINCT filter which removes duplicate entries. The difference between a join and a semi-join is therefore the way duplicates are handled.

The execution plan clearly reveals that those duplicates are removed. In this case it is done using a HashAggregate (which you will also see in case of a GROUP BY statement in many cases):

test=# explain SELECT * FROM a WHERE aid IN (SELECT bid FROM b);
QUERY PLAN
-----------------------------------------------------------------------
Hash Join (cost=46.38..102.75 rows=1275 width=4)
Hash Cond: (a.aid = b.bid)
-> Seq Scan on a (cost=0.00..35.50 rows=2550 width=4)
-> Hash (cost=43.88..43.88 rows=200 width=4)
-> HashAggregate (cost=41.88..43.88 rows=200 width=4)
Group Key: b.bid
-> Seq Scan on b (cost=0.00..35.50 rows=2550 width=4)
(7 rows)

The opposite of an IN statement is NOT IN:

test=# SELECT * FROM a WHERE aid NOT IN (SELECT bid FROM b);
aid
-----
1
(1 row)

In this case we remove all other rows from the result.

Finally …
In case you would like to learn more about joins, check out our blogpost about join strategies.