CYBERTEC PostgreSQL Logo

About cross join in PostgreSQL

06.2021 / Category: / Tags: | | |
CROSS JOINS - promiscuity extreme!
© Laurenz Albe 2021

 

For many people, “cross join” is something to be afraid of. They remember the time when they forgot the join condition and the DBA was angry, because the query hogged the CPU and filled the disk. However, there are valid use cases for cross joins which I want to explore in this article.

What is a cross join?

The term comes from relational algebra, which also calls the Cartesian product between two relations cross product and denotes it by A × B. This is the most basic kind of join: it combines every row of one table with every row of another table.

A simple example:

Table A Table B
name birthday street city
John 1989-08-24 Roman Road Kexborough
Paul 2001-10-03 Earl Street Fitzwilliam
Claude 1996-04-18

The cross product of the tables would be:

A × B
name birthday street city
John 1989-08-24 Roman Road Kexborough
Paul 2001-10-03 Roman Road Kexborough
Claude 1996-04-18 Roman Road Kexborough
John 1989-08-24 Earl Street Fitzwilliam
Paul 2001-10-03 Earl Street Fitzwilliam
Claude 1996-04-18 Earl Street Fitzwilliam

Cross join in SQL

There are two ways to write the cross join of A and B in SQL.

A comma separated list in the FROM clause:

With the explicit CROSS JOIN syntax:

What is problematic about cross joins?

Cross joins are the most basic joins, and you can think of an inner join as a cross join with an additional filter condition. Of course, PostgreSQL doesn't calculate inner joins that way. It uses more efficient join strategies.

If you write your joins using a comma separated table list (like in the first example above), an inner join and a cross join look very similar. The only difference is a WHERE condition. Now it is a frequent mistake to forget a join condition when you develop an SQL query. The result of such an omission is that you get way more result rows than you reckoned with: a cross join between two tables with a million rows each would result in a trillion rows!

Such a query will run forever. If the query contains an ORDER BY clause, the database server has to cache the whole result set in order to sort it. Since this result set doesn't fit into memory, PostgreSQL will start writing temporary files to hold the data. These temporary files can fill up the disk. As soon as the query runs out of disk space, PostgreSQL rolls it back and deletes the temporary files. However, if the timing is bad, even a short out-of-disk condition can cause the database server to crash.

How can I protect myself from unintended cross joins?

Never use the “comma separated list” syntax to write joins!

If you write your inner joins as a JOIN b, it is a syntax error to omit the join condition (ON or USING). A cross join is then explicitly written as CROSS JOIN and cannot happen by mistake.

Valid use cases for cross joins

The above sounds pretty discouraging, but there are situations when a cross join is just what you need. In the following, I present two typical cases:

Lateral cross join

In a lateral join, a join relation (an expression or subquery) can refer to earlier entries in the FROM clause. This is very often used in combination with table functions: if you want to join a row with all the table function results that belong to it, you use a lateral join. In that case,  LATERAL already implies that each row is only joined to the function results that belong to it, so there is no need for an extra join condition.

Here is an example (not recommended for your production database!):

Then the arrays could be unpacked with a lateral expression like this:

Joining with a “variable”

Sometimes you have a more complicated or expensive expression that you want to use in several places with one query. In that case, it can be a good idea to write a common table expression. You have to use that CTE in the FROM clause, typically with a cross join:

Conclusion

You don't need cross joins very often, but sometimes they do come in handy.

Avoid the “comma separated list” join syntax, so that you don't get cross joins by mistake. Such a statement can run forever and use up your database machine's resources.

 In case you need help to run your database in the most efficient way possible, CYBERTEC offers 24/7 support services to customers around the world.

0 0 votes
Article Rating
Subscribe
Notify of
guest
2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
notzippy
notzippy
2 years ago

Another interesting thing you can do is create an index column from the array using ordinality, you then have an index key you can use


SELECT mytab.id, j.elem, ordinality- 1 as index
FROM mytab
CROSS JOIN LATERAL
jsonb_array_elements_text(
mytab.data -> 'key'
) with ordinality AS j(elem);

Högvälborne Efraim Pissmyra
Högvälborne Efraim Pissmyra
3 years ago

Back in the days I used to write reporting queries for a company with multiple brands. The requested format was often a csv with "day, brand, fact columns..." with one row per brand per day, even if a brand had no activity. I often found myself doing "FROM generate_series(...) CROSS JOIN brands", then LEFT JOINing various things onto that, and wrapping fact columns in COALESCE(... ,0).

A similar trick that often comes in handy in for such reporting queries is to cross join a hardcoded set of values to put as one of the leftmost columns: CROSS JOIN (VALUES ('a'),('b'),('c')) foo(bar)

CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0
office@cybertec.at

Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram