Rewriting OR is not always the best solution
© Laurenz Albe 2022
In my article that reviles
OR, I showed how in certain cases, it is possible to rewrite
OR in a
WHERE condition to a longer query with
UNION that can perform much better (the “ugly”
OR). Now people have asked me repeatedly why the PostgreSQL optimizer does not perform such a transformation automatically. This article explains the problems involved.
In the article referenced , I show the following query that will never perform well with bigger tables:
SELECT id, a.a_val, b.b_val FROM a JOIN b USING (id) WHERE a.id = 42 OR b.id = 42;
The solution was to rewrite
OR like this:
SELECT id, a.a_val, b.b_val FROM a JOIN b USING (id) WHERE a.id = 42 UNION SELECT id, a.a_val, b.b_val FROM a JOIN b USING (id) WHERE b.id = 42;
WHERE conditions are selective, then indexes on
b.id can turn both parts of the
UNION into fast nested loop joins.
Can I always rewrite
The above transformation is certainly no secret, and many people know about it. But I keep hearing comments that it is the job of the PostgreSQL optimizer to automatically transform queries like this. The following example shows why that is not always possible:
CREATE TABLE a ( id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY, x integer, p integer ); CREATE TABLE b ( id integer GENERATED ALWAYS AS IDENTITY PRIMARY KEY, x integer, q integer ); INSERT INTO a (x, p) VALUES (1, 1), (1, 1), (2, 1); INSERT INTO b (x, q) VALUES (1, 3), (2, 3);
The query with
OR in this case is:
SELECT x, a.p, b.q FROM a JOIN b USING (x) WHERE a.p = 1 OR b.q = 3; x │ p │ q ═══╪═══╪═══ 1 │ 1 │ 3 1 │ 1 │ 3 2 │ 1 │ 3 (3 rows)
Rewriting the query with
UNION is no solution
Indeed, if we try it, we get a different result:
SELECT x, a.p, b.q FROM a JOIN b USING (x) WHERE a.p = 1 UNION SELECT x, a.p, b.q FROM a JOIN b USING (x) WHERE b.q = 3; x │ p │ q ═══╪═══╪═══ 2 │ 1 │ 3 1 │ 1 │ 3 (2 rows)
The reason for the difference is that
UNION removes all duplicates in the result set, similar to
DISTINCT. Consequently, PostgreSQL removes the legitimate duplicate result row.
Why then did this transformation work in the query from the beginning of this article? Well, I made a tacit assumption there, namely that the column
id is the primary key of both tables. If we have a unique identifier in the result set, there cannot be any duplicate rows in the result set, and the above transformation is always correct.
Rewriting the query with
UNION ALL is no solution either
UNION removes duplicate results that we need, then we should try the more efficient
UNION ALL, which does not remove duplicates:
SELECT x, a.p, b.q FROM a JOIN b USING (x) WHERE a.p = 1 UNION ALL SELECT x, a.p, b.q FROM a JOIN b USING (x) WHERE b.q = 3; x │ p │ q ═══╪═══╪═══ 1 │ 1 │ 3 1 │ 1 │ 3 2 │ 1 │ 3 1 │ 1 │ 3 1 │ 1 │ 3 2 │ 1 │ 3 (6 rows)
Now we have another problem: all rows that satisfy both conditions in the original
OR query appear twice in the result set, once from every branch in the
UNION ALL. In this example, all rows satisfy both conditions, so the whole result set is duplicated.
So neither of the transformations produces the correct result set!
Why the optimizer doesn’t automatically rewrite
When is it safe to transform
UNION ALL? It is safe if there is no result row that satisfies both of the conditions combined with
OR. This is probably often true, but I cannot think of a case where the database could verify that automatically.
On the other hand, let’s consider when it is safe to transform
UNION. The condition is that the original result set does not contain any duplicates. The only way to guarantee this lack of duplicates is if the
SELECT list contains a non-NULL unique key from both tables involved. Now this is a frequent case: way too many people select more columns than they actually need!
So the query optimizer could in theory check if the
SELECT list contains a non-NULL unique key from both tables. If the tables in the join are actual base tables, such a check might not be too hard. If the tables are derived tables (for examples, the result of a join), then that check would be more difficult.
Such a change would certainly make query optimization more complicated. Every query with an
OR in the
WHERE condition would have to be checked:
- if the
FROMexpression is the join of two relations
- if the two relations are base tables
- if the
SELECTlist contains a non-NULL unique key of both base tables
That is, many queries would require more planning time, but only a few queries would benefit. Moreover, I suspect that a change like this is at the very least complicated, because the alternative query would be radically different from the original. Such changes are typically done in the query rewriter, but we cannot rewrite the query unconditionally, since the original query may perform better than the rewritten one (for example if the
WHERE condition is not selective, or if the necessary indexes are missing).
We have seen that it is not always possible to rewrite
UNION, and it is questionable whether automatic rewriting by PostgreSQL is feasible at all.
In practice, you will probably always be able to make the transformation: either you can determine that duplicate result don’t matter and use
UNION ALL, or you can exclude that there cannot be duplicate results anyway and use
UNION. At any rate, you will have to do it by hand.
In case you’re interested in improving query performance, check out Query Parameter Data Types and Performance
If you would learn the basics about SQL queries, see What is an Inner Join? And What is an Outer Join?