If you happen to be an SQL developer, you will know that joins are really at the core of the language. Joins come in various flavors: Inner joins, left joins, full joins, natural joins, self joins, semi-joins, lateral joins, and so on. However, one of the most important distinctions is the difference between implicit and explicit joins. Over the years, flame wars have been fought over this issue. Still, not many people know what is really going on. Therefore my post might help to shed some light on the situation.
Using implicit joins
Before we dig into practical examples, it is necessary to create some tables that we can later use to perform our joins:
test=# CREATE TABLE a (id int, aid int); CREATE TABLE test=# CREATE TABLE b (id int, bid int); CREATE TABLE
In the next step some rows are added to those tables:
test=# INSERT INTO a VALUES (1, 1), (2, 2), (3, 3); INSERT 0 3 test=# INSERT INTO b VALUES (2, 2), (3, 3), (4, 4); INSERT 0 3
An implicit join is the simplest way to join data. The following example shows an implicit join:
test=# SELECT * FROM a, b WHERE a.id = b.id; id | aid | id | bid ----+-----+----+----- 2 | 2 | 2 | 2 3 | 3 | 3 | 3 (2 rows)
In this case, all tables are listed in the FROM clause and are later connected in the WHERE clause. In my experience, an implicit join is the most common way to connect two tables. However, my observation might be heavily biased, because an implicit join is the way I tend to write things in my daily work.
Using explicit joins
The following example shows an explicit join. Some people prefer the explicit join syntax over implicit joins because of readability or for whatever other reason:
test=# SELECT * FROM a JOIN b ON (aid = bid); id | aid | id | bid ----+-----+----+----- 2 | 2 | 2 | 2 3 | 3 | 3 | 3 (2 rows)
In this case tables are connected directly using an ON-clause. The ON-clause simply contains the conditions we want to use to join those tables together.
Explicit joins support two types of syntax constructs: ON-clauses and USING-clauses. An ON-clause is perfect in case you want to connect different columns with each other. A using clause is different: It has the same meaning, but it can only be used if the columns on both sides have the same name. Otherwise a syntax error is issued:
test=# SELECT * FROM a JOIN b USING (aid = bid); ERROR: syntax error at or near "=" LINE 1: SELECT * FROM a JOIN b USING (aid = bid);
Using is often used to connect keys with each other as shown in the next example:
test=# SELECT * FROM a JOIN b USING (id); id | aid | bid ----+-----+----- 2 | 2 | 2 3 | 3 | 3 (2 rows)
In my tables both column have a column called “id”, so it is possible to use USING here. Keep in mind: USING is mostly syntactic sugar – there is no deeper meaning here.
Often an explicit join is not just used to join two but more tables. To show how that works, I have added one more table:
test=# CREATE TABLE c (id int, cid int); CREATE TABLE
Let us add some data to this table:
test=# INSERT INTO c VALUES (3, 3), (4, 4), (5, 5); INSERT 0 2
To perform an explicit join, just add addition JOIN and USING clauses (respectively ON clauses) to the statement. Here is an example:
test=# SELECT * FROM a INNER JOIN b USING (id) JOIN c USING (id); id | aid | bid | cid ----+-----+-----+----- 3 | 3 | 3 | 3 (1 row)
Of course the same can be done with an implicit join:
test=# SELECT * FROM a, b, c WHERE a.id = b.id AND b.id = c.id; id | aid | id | bid | id | cid ----+-----+----+-----+----+----- 3 | 3 | 3 | 3 | 3 | 3 (1 row)
However, as you can see, there is a small difference. Check the number of columns returned by the query. You will notice that the implicit join returns more. The “id” column will show up more frequently in this case because the implicit join handles the column list in a slightly different way.
The column list is of course a nasty detail, because in a real application it is always better to explicitly list all columns anyway. However, this little detail should be kept in mind.
join_collapse_limit: What the optimizer does
When I am on the road working as PostgreSQL consultant or PostgreSQL support guy, people often ask if there is a performance difference between implicit and explicit joins. The answer is: “Usually not”. Let us take a look at the following statement:
test=# explain SELECT * FROM a INNER JOIN b USING (id); QUERY PLAN ----------------------------------------------------------------- Merge Join (cost=317.01..711.38 rows=25538 width=12) Merge Cond: (a.id = b.id) -> Sort (cost=158.51..164.16 rows=2260 width=8) Sort Key: a.id -> Seq Scan on a (cost=0.00..32.60 rows=2260 width=8) -> Sort (cost=158.51..164.16 rows=2260 width=8) Sort Key: b.id -> Seq Scan on b (cost=0.00..32.60 rows=2260 width=8) (8 rows)
The explicit join produces exactly the same plan as the implicit plan shown below:
test=# explain SELECT * FROM a, b WHERE a.id = b.id; QUERY PLAN ----------------------------------------------------------------- Merge Join (cost=317.01..711.38 rows=25538 width=16) Merge Cond: (a.id = b.id) -> Sort (cost=158.51..164.16 rows=2260 width=8) Sort Key: a.id -> Seq Scan on a (cost=0.00..32.60 rows=2260 width=8) -> Sort (cost=158.51..164.16 rows=2260 width=8) Sort Key: b.id -> Seq Scan on b (cost=0.00..32.60 rows=2260 width=8) (8 rows)
So in the majority of all cases, an implicit join does exactly the same thing as an explicit join.
However, this is no always the case. In PostgreSQL there is a variable called join_collapse_limit:
test=# SHOW join_collapse_limit; join_collapse_limit --------------------- 8 (1 row)
What does it mean? If you prefer explicit over implicit joins, the planner will always plan the first couple of joins automatically – regardless of which join order you have used inside the query. The optimizer will simply reorder joins the way they seem to be most promising. But if you keep adding joins, the ones exceeding join_collapse_limit will be planned the way you have put them into the query. As you can easily imagine, we are already talking about fairly complicated queries. Joining 9 or more tables is quite a lot and beyond the typical operation in most cases.
There is another parameter called from_collapse_limit that does the same thing for implicit joins and has the same default value. If a query lists more than from_collapse_limit tables in its FROM clause, the ones exceeding the limit will not be reordered, but joined in the order they appear in the statement.
For the typical, “normal” query the performance and the execution plans stay the same and it makes no difference which type of join you prefer.
If you want to read more about joins, consider reading some of our other blog posts: https://www.cybertec-postgresql.com/en/time-in-postgresql-outer-joins/