To be OR not to be...
© Laurenz Albe 2018

 

PostgreSQL query tuning is our daily bread at Cybertec, and once you have done some of that, you’ll start bristling whenever you see an OR in a query, because they are usually the cause for bad query performance.

Of course there is a reason why there is an OR in SQL, and if you cannot avoid it, you have to use it. But you should be aware of the performance implications.

In this article I’ll explore “good” and “bad” ORs and what you can do to avoid the latter.

A little sample schema

We’ll use this simple setup for demonstration:

CREATE TABLE a(id integer NOT NULL, a_val text NOT NULL);

INSERT INTO a
   SELECT i, md5(i::text)
   FROM generate_series(1, 100000) i;

CREATE TABLE b(id integer NOT NULL, b_val text NOT NULL);

INSERT INTO b
   SELECT i, md5(i::text)
   FROM generate_series(1, 100000) i;

ALTER TABLE a ADD PRIMARY KEY (id);
ALTER TABLE b ADD PRIMARY KEY (id);
ALTER TABLE b ADD FOREIGN KEY (id) REFERENCES a;

VACUUM (ANALYZE) a;
VACUUM (ANALYZE) b;

Suppose that we want to run queries with equality and LIKE conditions on the text columns, so we need some indexes:

CREATE INDEX a_val_idx ON a(a_val text_pattern_ops);
CREATE INDEX b_val_idx ON b(b_val text_pattern_ops);

Have a look at the documentation if you don’t understand text_pattern_ops.

The “good” OR

An OR is fine in most parts of an SQL query: if it is not used to filter out rows from your query result, it will have no negative effect on query performance.

So if your OR appears in a CASE expression in the SELECT list, don’t worry.

Unfortunately you usually find the OR where it hurts: in the WHERE clause.

The “bad” OR

Now for an example of an OR in a WHERE clause that is still pretty nice:

EXPLAIN (COSTS off)
SELECT id FROM a
WHERE id = 42
   OR a_val = 'value 42';

                        QUERY PLAN                         
-----------------------------------------------------------
 Bitmap Heap Scan on a
   Recheck Cond: ((id = 42) OR (a_val = 'value 42'::text))
   ->  BitmapOr
         ->  Bitmap Index Scan on a_pkey
               Index Cond: (id = 42)
         ->  Bitmap Index Scan on a_val_idx
               Index Cond: (a_val = 'value 42'::text)
(7 rows)

PostgreSQL can actually use an index scan for the query, because it can combine the bitmaps for both indexes with a “bitmap OR”.
Note, however, that a bitmap index scan is more expensive than a normal index scan, since it has to build the bitmap. Moreover, it uses much more RAM; each of these bitmaps can use up to work_mem memory.

A multi-column index on (id, a_val) won’t help at all with this query, so there is no cheaper way to execute it.

IN is better than OR

Now for a more stupid variant of the above query:

EXPLAIN (COSTS off)
SELECT id FROM a
WHERE id = 42
   OR id = 4711;

                 QUERY PLAN                 
--------------------------------------------
 Bitmap Heap Scan on a
   Recheck Cond: ((id = 42) OR (id = 4711))
   ->  BitmapOr
         ->  Bitmap Index Scan on a_pkey
               Index Cond: (id = 42)
         ->  Bitmap Index Scan on a_pkey
               Index Cond: (id = 4711)
(7 rows)

Again, a bitmap index scan is used. But there is a simple method to rewrite that query without the pesky OR:

EXPLAIN (COSTS off)
SELECT id FROM a
WHERE id IN (42, 4711);

                    QUERY PLAN                     
---------------------------------------------------
 Index Only Scan using a_pkey on a
   Index Cond: (id = ANY ('{42,4711}'::integer[]))
(2 rows)

You see? As soon as you get rid of the OR, an efficient index scan can be used!

You might say that this is good for equality conditions, but what about the following query:

SELECT id FROM a
WHERE a_val LIKE 'something%'
   OR a_val LIKE 'other%';

To improve that query, observe that the PostgreSQL optimizer rewrote the IN in the previous query to = ANY.

This is a case of the standard SQL “quantified comparison predicate”: <comparison operator> ANY is true if the comparison is TRUE for any of the values on the right-hand side (the standard only defines this for subqueries on the right-hand side, but PostgreSQL extends the syntax to arrays).

Now LIKE is a comparison operator as well, so we can write:

EXPLAIN (COSTS off)
SELECT id FROM a
WHERE a_val LIKE ANY (ARRAY['something%', 'other%']);

                        QUERY PLAN                        
----------------------------------------------------------
 Seq Scan on a
   Filter: (a_val ~~ ANY ('{something%,other%}'::text[]))
(2 rows)

Unfortunately, the index cannot be used here.

pg_trgm to the rescue

But we are not at the end of our wits yet! There is such a wealth of indexes in PostgreSQL; let’s try a different one. For this, we need the pg_trgm extension:

CREATE EXTENSION pg_trgm;

Then we can create a GIN trigram index on the column:

CREATE INDEX a_val_trgm_idx ON a USING gin (a_val gin_trgm_ops);

Now things are looking better:

EXPLAIN (COSTS off)
SELECT id FROM a
WHERE a_val LIKE ANY (ARRAY['something%', 'other%']);

                             QUERY PLAN                             
--------------------------------------------------------------------
 Bitmap Heap Scan on a
   Recheck Cond: (a_val ~~ ANY ('{something%,other%}'::text[]))
   ->  Bitmap Index Scan on a_val_trgm_idx
         Index Cond: (a_val ~~ ANY ('{something%,other%}'::text[]))
(4 rows)

Feel the power of trigram indexes!

Note 1: This index can also be used if the search pattern starts with %

Note 2: The GIN index can become quite large. To avoid that, you can also use a GiST index, which is much smaller, but less efficient to search.

The “ugly” OR

Things become really bad if OR combines conditions from different tables:

EXPLAIN (COSTS off)
SELECT id, a.a_val, b.b_val
FROM a JOIN b USING (id)
WHERE a.id = 42
   OR b.id = 42;

                 QUERY PLAN                  
---------------------------------------------
 Merge Join
   Merge Cond: (a.id = b.id)
   Join Filter: ((a.id = 42) OR (b.id = 42))
   ->  Index Scan using a_pkey on a
   ->  Index Scan using b_pkey on b
(5 rows)

Here we have to compute the complete join between the two tables and afterwards filter out all rows matching the condition. In our example, that would mean computing 100000 rows only to throw away the 99999 that do not natch the condition.

Avoiding the ugly OR

Fortunately, there is an equivalent query that is longer to write, but much cheaper to execute:

EXPLAIN (COSTS off)
   SELECT id, a.a_val, b.b_val
   FROM a JOIN b USING (id)
   WHERE a.id = 42
UNION
   SELECT id, a.a_val, b.b_val
   FROM a JOIN b USING (id)
   WHERE b.id = 42;

                        QUERY PLAN                        
----------------------------------------------------------
 Unique
   ->  Sort
         Sort Key: a.id, a.a_val, b.b_val
         ->  Append
               ->  Nested Loop
                     ->  Index Scan using a_pkey on a
                           Index Cond: (id = 42)
                     ->  Index Scan using b_pkey on b
                           Index Cond: (id = 42)
               ->  Nested Loop
                     ->  Index Scan using a_pkey on a a_1
                           Index Cond: (id = 42)
                     ->  Index Scan using b_pkey on b b_1
                           Index Cond: (id = 42)
(14 rows)

Both parts of the query can make use of efficient index scans and return one row, and since the rows happen to be identical, UNION will reduce them to one row.

If you can be certain that both branches of the query will return distinct sets, it is better to use UNION ALL instead of UNION, because that doesn’t have to do the extra processing to remove duplicates.

When using this trick, you should be aware that rewriting a query in that fashion does not always result in an equivalent query: if the original query can return identical rows, these would be removed by the UNION. In our case, we don’t have to worry, because the primary keys were included in the query result. I find that this is hardly ever a problem in practice.

While doing PostgreSQL consulting for a German client, I stumbled over an interesting issue this week, which might be worth sharing with some folks out on the Internet, it’s all about grouping.

Suppose you are measuring the same thing various times on different sensors every, say, 15 minutes. Maybe some temperature, some air pressure or whatever. The data might look like it is shown in the next table:


CREATE TABLE t_data (t time, val int);



COPY t_data FROM stdin;

14:00   12

14:01   22

14:01   43

14:14   32

14:15   33

14:16   27

14:30   19

\.

The human eye can instantly spot that 14:00 and 14:01 could be candidates for grouping (maybe the differences are just related to latency or some slighty inconsistent timing). The same applies to 14:14 to 14:16. You might want to have this data in the same group during aggregation.

The question now is: How can that be achieved with PostgreSQL?

Some dirty SQL trickery

The first thing to do is to check out those difference from one timestamp to the next:


SELECT *, lag(t, 1) OVER (ORDER BY t)

FROM    t_data;

The lag function offers a nice way to solve this kind of problem:


t     | val |   lag

----------+-----+----------

14:00:00 |  12 |

14:01:00 |  22 | 14:00:00

14:01:00 |  43 | 14:01:00

14:14:00 |  32 | 14:01:00

14:15:00 |  33 | 14:14:00

14:16:00 |  27 | 14:15:00

14:30:00 |  19 | 14:16:00

(7 rows)

Now that we have used lag to “move” the time to the next row, there is a simple trick, which can be applied:


SELECT  *, CASE WHEN t - lag < '10 minutes'

THEN currval('seq_a')

ELSE nextval('seq_a') END AS g

FROM    ( SELECT *, lag(t, 1) OVER (ORDER BY t)

FROM  t_data) AS x;

Moving the lag to a subselect allows us to start all over again and to create those groups. The trick now is: If the difference from one line to the next is high, start a new group – otherwise stay within the group.

This leaves us with a simple result set:


t     | val |   lag    | g

----------+-----+----------+---

14:00:00 |  12 |          | 1

14:01:00 |  22 | 14:00:00 | 1

14:01:00 |  43 | 14:01:00 | 1

14:14:00 |  32 | 14:01:00 | 2

14:15:00 |  33 | 14:14:00 | 2

14:16:00 |  27 | 14:15:00 | 2

14:30:00 |  19 | 14:16:00 | 3

(7 rows)

From now on, life is simple. We can take this output and aggregate on this data easily. “GROUP BY g” will give us nice groups for each value of “g”.