Avoiding “OR” for better query performance

To be OR not to be... — © Laurenz Albe 2018

Table of Contents

PostgreSQL query tuning is our daily bread at CYBERTEC, and once you have done some of that, you'll start bristling whenever you see an OR in a query, because they are usually the cause for bad query performance.

Of course there is a reason why there is an OR in SQL, and if you cannot avoid it, you have to use it. But you should be aware of the performance implications.

In this article I'll explore “good” and “bad” ORs and what you can do to avoid the latter.

A little sample schema

We'll use this simple setup for demonstration:

CREATE TABLE a(id integer NOT NULL, a_val text NOT NULL);

INSERT INTO a
   SELECT i, md5(i::text)
   FROM generate_series(1, 100000) i;

CREATE TABLE b(id integer NOT NULL, b_val text NOT NULL);

INSERT INTO b
   SELECT i, md5(i::text)
   FROM generate_series(1, 100000) i;

ALTER TABLE a ADD PRIMARY KEY (id);
ALTER TABLE b ADD PRIMARY KEY (id);
ALTER TABLE b ADD FOREIGN KEY (id) REFERENCES a;

VACUUM (ANALYZE) a;
VACUUM (ANALYZE) b;

CREATE TABLE a(id integer NOT NULL, a_val text NOT NULL);

INSERT INTO a

SELECT i, md5(i::text)

FROM generate_series(1, 100000) i;

CREATE TABLE b(id integer NOT NULL, b_val text NOT NULL);

INSERT INTO b

SELECT i, md5(i::text)

FROM generate_series(1, 100000) i;

ALTER TABLE a ADD PRIMARY KEY (id);

ALTER TABLE b ADD PRIMARY KEY (id);

ALTER TABLE b ADD FOREIGN KEY (id) REFERENCES a;

VACUUM (ANALYZE) a;

VACUUM (ANALYZE) b;

Suppose that we want to run queries with equality and LIKE conditions on the text columns, so we need some indexes:

CREATE INDEX a_val_idx ON a(a_val text_pattern_ops);
CREATE INDEX b_val_idx ON b(b_val text_pattern_ops);

1 2	CREATE INDEX a_val_idx ON a(a_val text_pattern_ops); CREATE INDEX b_val_idx ON b(b_val text_pattern_ops);

Have a look at the documentation if you don't understand text_pattern_ops.

The “good” `OR`

An OR is fine in most parts of an SQL query: if it is not used to filter out rows from your query result, it will have no negative effect on query performance.

So if your OR appears in a CASE expression in the SELECT list, don't worry.

Unfortunately you usually find the OR where it hurts: in the WHERE clause.

The “bad” `OR`

Now for an example of an OR in a WHERE clause that is still pretty nice:

EXPLAIN (COSTS off)
SELECT id FROM a
WHERE id = 42
   OR a_val = 'value 42';

                        QUERY PLAN                         
-----------------------------------------------------------
 Bitmap Heap Scan on a
   Recheck Cond: ((id = 42) OR (a_val = 'value 42'::text))
   ->  BitmapOr
         ->  Bitmap Index Scan on a_pkey
               Index Cond: (id = 42)
         ->  Bitmap Index Scan on a_val_idx
               Index Cond: (a_val = 'value 42'::text)
(7 rows)

EXPLAIN (COSTS off)

SELECT id FROM a

WHERE id = 42

OR a_val = 'value 42';

QUERY PLAN

-----------------------------------------------------------

Bitmap Heap Scan on a

Recheck Cond: ((id = 42) OR (a_val = 'value 42'::text))

-> BitmapOr

-> Bitmap Index Scan on a_pkey

Index Cond: (id = 42)

-> Bitmap Index Scan on a_val_idx

Index Cond: (a_val = 'value 42'::text)

(7 rows)

PostgreSQL can actually use an index scan for the query, because it can combine the bitmaps for both indexes with a “bitmap OR”.
Note, however, that a bitmap index scan is more expensive than a normal index scan, since it has to build the bitmap. Moreover, it uses much more RAM; each of these bitmaps can use up to work_mem memory.

A multi-column index on (id, a_val) won't help at all with this query, so there is no cheaper way to execute it.

`IN` is better than `OR`

Now for a more stupid variant of the above query:

EXPLAIN (COSTS off)
SELECT id FROM a
WHERE id = 42
   OR id = 4711;

                 QUERY PLAN                 
--------------------------------------------
 Bitmap Heap Scan on a
   Recheck Cond: ((id = 42) OR (id = 4711))
   ->  BitmapOr
         ->  Bitmap Index Scan on a_pkey
               Index Cond: (id = 42)
         ->  Bitmap Index Scan on a_pkey
               Index Cond: (id = 4711)
(7 rows)

EXPLAIN (COSTS off)

SELECT id FROM a

WHERE id = 42

OR id = 4711;

QUERY PLAN

--------------------------------------------

Bitmap Heap Scan on a

Recheck Cond: ((id = 42) OR (id = 4711))

-> BitmapOr

-> Bitmap Index Scan on a_pkey

Index Cond: (id = 42)

-> Bitmap Index Scan on a_pkey

Index Cond: (id = 4711)

(7 rows)

Again, a bitmap index scan is used. But there is a simple method to rewrite that query without the pesky OR:

EXPLAIN (COSTS off)
SELECT id FROM a
WHERE id IN (42, 4711);

                    QUERY PLAN                     
---------------------------------------------------
 Index Only Scan using a_pkey on a
   Index Cond: (id = ANY ('{42,4711}'::integer[]))
(2 rows)

EXPLAIN (COSTS off)

SELECT id FROM a

WHERE id IN (42, 4711);

QUERY PLAN

---------------------------------------------------

Index Only Scan using a_pkey on a

Index Cond: (id = ANY ('{42,4711}'::integer[]))

(2 rows)

You see? As soon as you get rid of the OR, an efficient index scan can be used!

You might say that this is good for equality conditions, but what about the following query:

SELECT id FROM a
WHERE a_val LIKE 'something%'
   OR a_val LIKE 'other%';

SELECT id FROM a

WHERE a_val LIKE 'something%'

OR a_val LIKE 'other%';

To improve that query, observe that the PostgreSQL optimizer rewrote the IN in the previous query to = ANY.

This is a case of the standard SQL “quantified comparison predicate”: ANY is true if the comparison is TRUE for any of the values on the right-hand side (the standard only defines this for subqueries on the right-hand side, but PostgreSQL extends the syntax to arrays).

Now LIKE is a comparison operator as well, so we can write:

EXPLAIN (COSTS off)
SELECT id FROM a
WHERE a_val LIKE ANY (ARRAY['something%', 'other%']);

                        QUERY PLAN                        
----------------------------------------------------------
 Seq Scan on a
   Filter: (a_val ~~ ANY ('{something%,other%}'::text[]))
(2 rows)

EXPLAIN (COSTS off)

SELECT id FROM a

WHERE a_val LIKE ANY (ARRAY['something%', 'other%']);

QUERY PLAN

----------------------------------------------------------

Seq Scan on a

Filter: (a_val ~~ ANY ('{something%,other%}'::text[]))

(2 rows)

Unfortunately, the index cannot be used here.

`pg_trgm` to the rescue

But we are not at the end of our wits yet! There is such a wealth of indexes in PostgreSQL; let's try a different one. For this, we need the pg_trgm extension:

CREATE EXTENSION pg_trgm;

1	CREATE EXTENSION pg_trgm;

Then we can create a GIN trigram index on the column:

CREATE INDEX a_val_trgm_idx ON a USING gin (a_val gin_trgm_ops);

1	CREATE INDEX a_val_trgm_idx ON a USING gin (a_val gin_trgm_ops);

Now things are looking better:

EXPLAIN (COSTS off)
SELECT id FROM a
WHERE a_val LIKE ANY (ARRAY['something%', 'other%']);

                             QUERY PLAN                             
--------------------------------------------------------------------
 Bitmap Heap Scan on a
   Recheck Cond: (a_val ~~ ANY ('{something%,other%}'::text[]))
   ->  Bitmap Index Scan on a_val_trgm_idx
         Index Cond: (a_val ~~ ANY ('{something%,other%}'::text[]))
(4 rows)

EXPLAIN (COSTS off)

SELECT id FROM a

WHERE a_val LIKE ANY (ARRAY['something%', 'other%']);

QUERY PLAN

--------------------------------------------------------------------

Bitmap Heap Scan on a

Recheck Cond: (a_val ~~ ANY ('{something%,other%}'::text[]))

-> Bitmap Index Scan on a_val_trgm_idx

Index Cond: (a_val ~~ ANY ('{something%,other%}'::text[]))

(4 rows)

Feel the power of trigram indexes!

Note 1: This index can also be used if the search pattern starts with %

Note 2: The GIN index can become quite large. To avoid that, you can also use a GiST index, which is much smaller, but less efficient to search.

The “ugly” `OR`

Things become really bad if OR combines conditions from different tables:

EXPLAIN (COSTS off)
SELECT id, a.a_val, b.b_val
FROM a JOIN b USING (id)
WHERE a.id = 42
   OR b.id = 42;

                 QUERY PLAN                  
---------------------------------------------
 Merge Join
   Merge Cond: (a.id = b.id)
   Join Filter: ((a.id = 42) OR (b.id = 42))
   ->  Index Scan using a_pkey on a
   ->  Index Scan using b_pkey on b
(5 rows)

EXPLAIN (COSTS off)

SELECT id, a.a_val, b.b_val

FROM a JOIN b USING (id)

WHERE a.id = 42

OR b.id = 42;

QUERY PLAN

---------------------------------------------

Merge Join

Merge Cond: (a.id = b.id)

Join Filter: ((a.id = 42) OR (b.id = 42))

-> Index Scan using a_pkey on a

-> Index Scan using b_pkey on b

(5 rows)

Here we have to compute the complete join between the two tables and afterwards filter out all rows matching the condition. In our example, that would mean computing 100000 rows only to throw away the 99999 that do not match the condition.

Avoiding the ugly `OR`

Fortunately, there is an equivalent query that is longer to write, but much cheaper to execute:

EXPLAIN (COSTS off)
   SELECT id, a.a_val, b.b_val
   FROM a JOIN b USING (id)
   WHERE a.id = 42
UNION
   SELECT id, a.a_val, b.b_val
   FROM a JOIN b USING (id)
   WHERE b.id = 42;

                        QUERY PLAN                        
----------------------------------------------------------
 Unique
   ->  Sort
         Sort Key: a.id, a.a_val, b.b_val
         ->  Append
               ->  Nested Loop
                     ->  Index Scan using a_pkey on a
                           Index Cond: (id = 42)
                     ->  Index Scan using b_pkey on b
                           Index Cond: (id = 42)
               ->  Nested Loop
                     ->  Index Scan using a_pkey on a a_1
                           Index Cond: (id = 42)
                     ->  Index Scan using b_pkey on b b_1
                           Index Cond: (id = 42)
(14 rows)

EXPLAIN (COSTS off)

SELECT id, a.a_val, b.b_val

FROM a JOIN b USING (id)

WHERE a.id = 42

UNION

SELECT id, a.a_val, b.b_val

FROM a JOIN b USING (id)

WHERE b.id = 42;

QUERY PLAN

----------------------------------------------------------

Unique

-> Sort

Sort Key: a.id, a.a_val, b.b_val

-> Append

-> Nested Loop

-> Index Scan using a_pkey on a

Index Cond: (id = 42)

-> Index Scan using b_pkey on b

Index Cond: (id = 42)

-> Nested Loop

-> Index Scan using a_pkey on a a_1

Index Cond: (id = 42)

-> Index Scan using b_pkey on b b_1

Index Cond: (id = 42)

(14 rows)

Both parts of the query can make use of efficient index scans and return one row, and since the rows happen to be identical, UNION will reduce them to one row.

If you can be certain that both branches of the query will return distinct sets, it is better to use UNION ALL instead of UNION, because that doesn't have to do the extra processing to remove duplicates.

When using this trick, you should be aware that rewriting a query in that fashion does not always result in an equivalent query: if the original query can return identical rows, these would be removed by the UNION. In our case, we don't have to worry, because the primary keys were included in the query result. I find that this is hardly ever a problem in practice.

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

0 0 votes

Article Rating

15 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Alexander Shelemin

1 year ago

Hey Laurenz!

Nice blog post (and kudos for your blog in general - I've read many articles here).
I've met most of the mentioned problems multiple times in my work around DB performance.

I've long been wondering about OR involving fields from two tables - you write that "Here we have to compute the complete join between the two tables and afterwards filter out all rows matching the condition".
Indeed, optimizers have problems with these scenario, but do you have any details as to why it's the case?
Why can't optimizer first filter rows from one table, and then join to the second (also potentially having filtered it beforehand), given that suitable indices are present?
Or rewrite it to something involving UNION?
I never found a good explanation of the essential complexity here.. Probably you know some subtler details, or there's some paper about it, or a nice thread on Postgres hackers or someplace else?
Thanks in advance!

laurenz

1 year ago

Reply to Alexander Shelemin

I actually wrote a whole new article about that.

Alexander Shelemin

1 year ago

Reply to laurenz

yep, it was funny - I listened to some Postgres related podcast, and they mentioned your (new) article, saying something along the lines "Someone asked Laurenz in the comments, and he wrote an article on UNION vs OR". And I thought - hmm, that someone is me!
Thanks for the article 😀

laurenz

1 year ago

Reply to Alexander Shelemin

The only thing the optimizer could possibly do is to rewrite the query to use UNION or UNION ALL, but there is a problem. Consider this query:

SELECT col FROM a JOIN b USING (q) WHERE a.x = 1 OR b.y = 2;

If we rewrite it with UNION ALL, we could get a bad duplicate result if a result row satisfies both conditions. If, on the other hand, we use UNION, that could eliminate a legitimate duplicate result.
In some cases it may be possible to prove that one of the alternatives is indeed correct, but that is beyond the capabilities of the optimizer.

Alexander Shelemin

1 year ago

Reply to laurenz

Thanks - makes sense regarding UNION!
But why is filtering either table before the join not a viable strategy?

laurenz

1 year ago

Reply to Alexander Shelemin

Think about it. Given this query:
SELECT * FROM a JOIN b USING (col) WHERE a.x = 1 OR b.y = 2;
there can be results where a.x is not 1 and results where b.y is not 2. You would miss those results if you filter before you join. That strategy only works for AND.

Marc Rechté

4 years ago

Very interesting post, thanks.
I get a different plan for the "ugly or" (increasing the size of the tables does not change it):
QUERY PLAN --------------------------------------------- Hash Join Hash Cond: (a.id = b.id) Join Filter: ((a.id = 42) OR (b.id = 42)) -> Seq Scan on a -> Hash -> Seq Scan on b (6 lignes)

laurenz

4 years ago

Reply to Marc Rechté

The join strategy doesn't matter. Still, it will calculate the whole join and filter away whatever does not satisfy the condition.

Charlie

4 years ago

These "human powered query optimizations" usually result in poor performance down the road when the machine-powered query optimizer has adapted to the anti-case. I've seen many nearly-imposible-to-read SQL queries written 'for performance' that, when re-written to be naive, are 10 times faster.

This seems like a good feature request for a new query optimizer step.

Bruno da Silva

4 years ago

Forgive me for being so gross: does this apply to MySQL too?

laurenz

4 years ago

Reply to Bruno da Silva

OR is hard for all query optimizers, so in principle yes.
Many of the techniques and workarounds in the article are PostgreSQL specific though, so they are not applicable to other databases.
For example, I would be surprised if a database that knows no join algorithm other than nested loops has anything as sophisticated as a bitmap index scan.

Francisco Prado

4 years ago

In PostgreSQL 12 the sentence

EXPLAIN (COSTS OFF)
SELECT a.id, a.a_val, b.b_val
FROM TO JOIN INTERNAL b EN b.id = a.id
WHERE a.id = 42
or b.id = 42;

throws

Hash Join "
Hash Cond: (a.id = b.id)
Join Filter: ((a.id = 42) OR (b.id = 42)) "
-> Seq Scan on a
-> Hash
-> Seq Scan on b

I have turned the matter around and I don't understand why it gives a different result

urikanegun

6 years ago

Hello, I'd like to translate the article https://www.cybertec-postgresql.com/avoid-or-for-better-performance/ into Japanese and publish on our tech blog https://techracho.bpsinc.jp/ for sharing it. Is it OK for you?

I make sure to indicate the link to original, title, author name in the case.

Best regards,

Marcin

6 years ago

Great post! I think I don't quite understand one part. You wrote that "a bitmap index scan is much more expensive than a normal index scan — it has to scan the complete index. ". Why is it so? AFAIK, the main difference between an index scan and a bitmap index scan is that the former goes to the table immediately after finding a matching tuple, while the latter adds them to bitmap, merges bitmaps from all indexes and visits the tuples in physical order.

bma

6 years ago

Perhaps an alternative to the "ugly or" query could be to pass the filter to the join condition for the secondary table, which won't change the results as it's an INNER JOIN.

EXPLAIN (COSTS off)
SELECT a.id, a.a_val, b.b_val
FROM a JOIN b ON (a.id=b.id and b.id = 42)
WHERE a.id = 42;

QUERY PLAN
------------------------------------
Nested Loop
-> Index Scan using a_pkey on a
Index Cond: (id = 42)
-> Index Scan using b_pkey on b
Index Cond: (id = 42)
(5 rows)

Stay tuned with our

Avoiding “OR” for better query performance

A little sample schema

The “good” `OR`

The “bad” `OR`

`IN` is better than `OR`

`pg_trgm` to the rescue

The “ugly” `OR`

Avoiding the ugly `OR`

Laurenz Albe

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts

Avoiding “OR” for better query performance

A little sample schema

The “good” OR

The “bad” OR

IN is better than OR

pg_trgm to the rescue

The “ugly” OR

Avoiding the ugly OR

Laurenz Albe

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts

The “good” `OR`

The “bad” `OR`

`IN` is better than `OR`

`pg_trgm` to the rescue

The “ugly” `OR`

Avoiding the ugly `OR`