Cartoon about UNION ALL in PostgreSQL and type matching
© Laurenz Albe 2022

 

A while ago, I wrote about the performance impact of query parameter data types. Recently I encountered a similar problem with UNION ALL that baffled me for a while, so I’d like to share it with you.

UNION ALL to implement polymorphism

Different entities can have something in common. For example, a customer relationship management system has to deal with customers and employees, both of which are people with a name, birthday and other common attributes. Such polymorphism is easy to model in object oriented programming, but notoriously hard to implement in relational databases. One way to approach the problem is to have a separate table for each entity and to model the “superclass” with UNION ALL:

CREATE SEQUENCE seq;

CREATE TABLE bird (
   id bigint PRIMARY KEY DEFAULT nextval('seq'),
   wingspan real NOT NULL,
   beak_size double precision NOT NULL
);

CREATE TABLE bat (
   id bigint PRIMARY KEY DEFAULT nextval('seq'),
   wingspan numeric NOT NULL,
   body_temperature numeric NOT NULL
);

CREATE TABLE cat (
   id bigint PRIMARY KEY DEFAULT nextval('seq'),
   body_temperature numeric NOT NULL,
   tail_length numeric
);

CREATE VIEW flying_animal AS
   SELECT id, wingspan FROM bird
UNION ALL
   SELECT id, wingspan FROM bat;

CREATE VIEW mammal AS
   SELECT id, body_temperature FROM bat
UNION ALL
   SELECT id, body_temperature FROM cat;

Let’s fill the tables with random data so that we can test the performance:

INSERT INTO bird (wingspan, beak_size)
SELECT 20 + random() * 5, 2 + random()
FROM generate_series(1, 1000000);

INSERT INTO bat (wingspan, body_temperature)
SELECT 15 + random() * 5, 40 + random() * 2
FROM generate_series(1, 1000000);

INSERT INTO cat (body_temperature, tail_length)
SELECT 36.5 + random(), 20 + random() * 3
FROM generate_series(1, 1000000);

A surprising performance difference between the UNION ALL views

We want to lookup animals by joining with a small lookup table:

CREATE TABLE lookup (
   id bigint PRIMARY KEY
);

INSERT INTO lookup
VALUES (42), (500000), (1500000), (1700000), (2500000), (2700000);

ANALYZE lookup;

Now let’s check the join performance with both views:

test=> \timing
Timing is on.
test=> SELECT * FROM flying_animal JOIN lookup USING (id);
   id    │ wingspan  
═════════╪═══════════
      42 │ 23.711483
  500000 │ 24.004992
 1500000 │ 15.684511
 1700000 │ 18.573671
(4 rows)

Time: 384.368 ms
test=> SELECT * FROM mammal JOIN lookup USING (id);
   id    │ body_temperature 
═════════╪══════════════════
 1500000 │ 41.1855152884075
 1700000 │ 40.7127191738703
 2500000 │ 36.5138848514496
 2700000 │ 36.9444466002649
(4 rows)

Time: 1.282 ms

Oops! What’s wrong with those flying animals?

Investigating the difference

Let’s look at the respective execution plans:

EXPLAIN (ANALYZE, COSTS OFF)
SELECT * FROM flying_animal JOIN lookup USING (id);

                                         QUERY PLAN                                          
═════════════════════════════════════════════════════════════════════════════════════════════
 Hash Join (actual time=0.139..542.474 rows=4 loops=1)
   Hash Cond: (bird.id = lookup.id)
   ->  Append (actual time=0.081..459.626 rows=2000000 loops=1)
         ->  Seq Scan on bird (actual time=0.079..72.049 rows=1000000 loops=1)
         ->  Subquery Scan on "*SELECT* 2" (actual time=0.016..314.671 rows=1000000 loops=1)
               ->  Seq Scan on bat (actual time=0.008..73.920 rows=1000000 loops=1)
   ->  Hash (actual time=0.018..0.019 rows=6 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 9kB
         ->  Seq Scan on lookup (actual time=0.008..0.011 rows=6 loops=1)
 Planning Time: 0.311 ms
 Execution Time: 542.533 ms

EXPLAIN (ANALYZE, COSTS OFF)
SELECT * FROM mammal JOIN lookup USING (id);

                                       QUERY PLAN                                       
════════════════════════════════════════════════════════════════════════════════════════
 Nested Loop (actual time=0.083..0.116 rows=4 loops=1)
   ->  Seq Scan on lookup (actual time=0.008..0.010 rows=6 loops=1)
   ->  Append (actual time=0.016..0.017 rows=1 loops=6)
         ->  Index Scan using bat_pkey on bat (actual time=0.012..0.012 rows=0 loops=6)
               Index Cond: (id = lookup.id)
         ->  Index Scan using cat_pkey on cat (actual time=0.004..0.004 rows=0 loops=6)
               Index Cond: (id = lookup.id)
 Planning Time: 0.183 ms
 Execution Time: 0.150 ms

The query on mammal works just the way we’d like it: it performs a nested loop join, using the primary key indexes on bat and cat to quickly retrieve the few rows we need. Contrariwise, the query on flying_animal scans all 2 million rows from bird and bat to perform a hash join.

Can you spot the difference?

Investigating the difference between the UNION ALL queries

To get the fast execution plan, PostgreSQL has to propagate the join condition into the UNION ALL (which turns into “Append” in the execution plan). Looking closer, we notice the strange “Subquery Scan on "*SELECT* 2"” in the slow query. Indeed, the branches of UNION ALL are subqueries. But usually the PostgreSQL optimizer “pulls up” subqueries into the main join tree wherever possible. Let’s read the source in pull_up_subqueries_recurse() in src/backend/optimizer/prep/prepjointree.c:

        /*
         * Alternatively, is it a simple UNION ALL subquery?  If so, flatten
         * into an "append relation".
         *
         * It's safe to do this regardless of whether this query is itself an
         * appendrel member.  (If you're thinking we should try to flatten the
         * two levels of appendrel together, you're right; but we handle that
         * in set_append_rel_pathlist, not here.)
         */
        if (rte->rtekind == RTE_SUBQUERY &&
            is_simple_union_all(rte->subquery))
            return pull_up_simple_union_all(root, jtnode, rte);

The function comment of is_simple_union_all() tells us more:

/*
 * We require all the setops to be UNION ALL (no mixing) and there can't be
 * any datatype coercions involved, ie, all the leaf queries must emit the
 * same datatypes.
 */

That gives us the decisive clue. Looking at the table definitions, we notice that wingspan has data type real in bird, but numeric in bat. The reason why the optimizer cannot flatten the UNION ALL properly is this difference in the data types. This preserves the subquery, which prevents the optimizer from finding the ideal execution plan.

The surprising thing that makes it difficult to reach that conclusion is that the column that caused the problem didn’t play an important role in the query. We could even omit it from the SELECT list, and we’d still get the bad execution plan!

UNION ALL and data types

Both branches of a UNION must have the same number of columns, and the columns must have compatible data types. The documentation describes the exact meaning of “compatible”. Essentially, if the data types differ, PostgreSQL chooses the type for which there is an implicit conversion from the other type (the PostgreSQL jargon for this is a coercion). If there is an implicit conversion in either direction, PostgreSQL prefers the data type from the first branch of the UNION.

In our case, there is an implicit cast from numeric to real, but not the other way around, so wingspan is of data type real in flying_animals. While this allowed us to create the view without problems, it caused the performance problem we observed.

Explicit type casts as a solution for the performance problem

There is a simple way to make sure that the data types on both sides of the UNION ALL are identical: apply an explicit type cast! So we can redefine the view as

CREATE OR REPLACE VIEW flying_animal AS
   SELECT id, wingspan FROM bird
UNION ALL
   SELECT id, wingspan::real FROM bat;

And, sure enough, we end up with a fast execution plan:

EXPLAIN (ANALYZE, COSTS OFF)
SELECT * FROM flying_animal JOIN lookup USING (id);

                                        QUERY PLAN                                        
══════════════════════════════════════════════════════════════════════════════════════════
 Nested Loop (actual time=7.714..10.593 rows=4 loops=1)
   ->  Seq Scan on lookup (actual time=0.008..0.015 rows=6 loops=1)
   ->  Append (actual time=1.682..1.759 rows=1 loops=6)
         ->  Index Scan using bird_pkey on bird (actual time=1.445..1.446 rows=0 loops=6)
               Index Cond: (id = lookup.id)
         ->  Index Scan using bat_pkey on bat (actual time=0.308..0.309 rows=0 loops=6)
               Index Cond: (id = lookup.id)
 Planning Time: 0.452 ms
 Execution Time: 10.673 ms

Conclusion

While it is convenient that PostgreSQL allows us to use slightly different data types on both sides of a UNION ALL, we saw that it can lead to bad execution plans and bad performance. To be on the safe side, apply explicit type casts to render the data types identical.

Perhaps it would be possible to make the optimizer smarter in cases like the one we investigated. However, it is easy to work around this limitation.

If you’ve enjoyed reading this blog, check out my blog about rewriting OR to UNION in PostgreSQL queries.