© Laurenz Albe 2022
A while ago, I wrote about the performance impact of query parameter data types. Recently I encountered a similar problem with
UNION ALL that baffled me for a while, so I’d like to share it with you.
UNION ALL to implement polymorphism
Different entities can have something in common. For example, a customer relationship management system has to deal with customers and employees, both of which are people with a name, birthday and other common attributes. Such polymorphism is easy to model in object oriented programming, but notoriously hard to implement in relational databases. One way to approach the problem is to have a separate table for each entity and to model the “superclass” with
CREATE SEQUENCE seq; CREATE TABLE bird ( id bigint PRIMARY KEY DEFAULT nextval('seq'), wingspan real NOT NULL, beak_size double precision NOT NULL ); CREATE TABLE bat ( id bigint PRIMARY KEY DEFAULT nextval('seq'), wingspan numeric NOT NULL, body_temperature numeric NOT NULL ); CREATE TABLE cat ( id bigint PRIMARY KEY DEFAULT nextval('seq'), body_temperature numeric NOT NULL, tail_length numeric ); CREATE VIEW flying_animal AS SELECT id, wingspan FROM bird UNION ALL SELECT id, wingspan FROM bat; CREATE VIEW mammal AS SELECT id, body_temperature FROM bat UNION ALL SELECT id, body_temperature FROM cat;
Let’s fill the tables with random data so that we can test the performance:
INSERT INTO bird (wingspan, beak_size) SELECT 20 + random() * 5, 2 + random() FROM generate_series(1, 1000000); INSERT INTO bat (wingspan, body_temperature) SELECT 15 + random() * 5, 40 + random() * 2 FROM generate_series(1, 1000000); INSERT INTO cat (body_temperature, tail_length) SELECT 36.5 + random(), 20 + random() * 3 FROM generate_series(1, 1000000);
A surprising performance difference between the
UNION ALL views
We want to lookup animals by joining with a small lookup table:
CREATE TABLE lookup ( id bigint PRIMARY KEY ); INSERT INTO lookup VALUES (42), (500000), (1500000), (1700000), (2500000), (2700000); ANALYZE lookup;
Now let’s check the join performance with both views:
test=> \timing Timing is on. test=> SELECT * FROM flying_animal JOIN lookup USING (id); id │ wingspan ═════════╪═══════════ 42 │ 23.711483 500000 │ 24.004992 1500000 │ 15.684511 1700000 │ 18.573671 (4 rows) Time: 384.368 ms test=> SELECT * FROM mammal JOIN lookup USING (id); id │ body_temperature ═════════╪══════════════════ 1500000 │ 41.1855152884075 1700000 │ 40.7127191738703 2500000 │ 36.5138848514496 2700000 │ 36.9444466002649 (4 rows) Time: 1.282 ms
Oops! What’s wrong with those flying animals?
Investigating the difference
Let’s look at the respective execution plans:
EXPLAIN (ANALYZE, COSTS OFF) SELECT * FROM flying_animal JOIN lookup USING (id); QUERY PLAN ═════════════════════════════════════════════════════════════════════════════════════════════ Hash Join (actual time=0.139..542.474 rows=4 loops=1) Hash Cond: (bird.id = lookup.id) -> Append (actual time=0.081..459.626 rows=2000000 loops=1) -> Seq Scan on bird (actual time=0.079..72.049 rows=1000000 loops=1) -> Subquery Scan on "*SELECT* 2" (actual time=0.016..314.671 rows=1000000 loops=1) -> Seq Scan on bat (actual time=0.008..73.920 rows=1000000 loops=1) -> Hash (actual time=0.018..0.019 rows=6 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Seq Scan on lookup (actual time=0.008..0.011 rows=6 loops=1) Planning Time: 0.311 ms Execution Time: 542.533 ms EXPLAIN (ANALYZE, COSTS OFF) SELECT * FROM mammal JOIN lookup USING (id); QUERY PLAN ════════════════════════════════════════════════════════════════════════════════════════ Nested Loop (actual time=0.083..0.116 rows=4 loops=1) -> Seq Scan on lookup (actual time=0.008..0.010 rows=6 loops=1) -> Append (actual time=0.016..0.017 rows=1 loops=6) -> Index Scan using bat_pkey on bat (actual time=0.012..0.012 rows=0 loops=6) Index Cond: (id = lookup.id) -> Index Scan using cat_pkey on cat (actual time=0.004..0.004 rows=0 loops=6) Index Cond: (id = lookup.id) Planning Time: 0.183 ms Execution Time: 0.150 ms
The query on
mammal works just the way we’d like it: it performs a nested loop join, using the primary key indexes on
cat to quickly retrieve the few rows we need. Contrariwise, the query on
flying_animal scans all 2 million rows from
bat to perform a hash join.
Can you spot the difference?
Investigating the difference between the
UNION ALL queries
To get the fast execution plan, PostgreSQL has to propagate the join condition into the
UNION ALL (which turns into “
Append” in the execution plan). Looking closer, we notice the strange “
Subquery Scan on "*SELECT* 2"” in the slow query. Indeed, the branches of
UNION ALL are subqueries. But usually the PostgreSQL optimizer “pulls up” subqueries into the main join tree wherever possible. Let’s read the source in
/* * Alternatively, is it a simple UNION ALL subquery? If so, flatten * into an "append relation". * * It's safe to do this regardless of whether this query is itself an * appendrel member. (If you're thinking we should try to flatten the * two levels of appendrel together, you're right; but we handle that * in set_append_rel_pathlist, not here.) */ if (rte->rtekind == RTE_SUBQUERY && is_simple_union_all(rte->subquery)) return pull_up_simple_union_all(root, jtnode, rte);
The function comment of
is_simple_union_all() tells us more:
/* * We require all the setops to be UNION ALL (no mixing) and there can't be * any datatype coercions involved, ie, all the leaf queries must emit the * same datatypes. */
That gives us the decisive clue. Looking at the table definitions, we notice that
wingspan has data type
bat. The reason why the optimizer cannot flatten the
UNION ALL properly is this difference in the data types. This preserves the subquery, which prevents the optimizer from finding the ideal execution plan.
The surprising thing that makes it difficult to reach that conclusion is that the column that caused the problem didn’t play an important role in the query. We could even omit it from the
SELECT list, and we’d still get the bad execution plan!
UNION ALL and data types
Both branches of a
UNION must have the same number of columns, and the columns must have compatible data types. The documentation describes the exact meaning of “compatible”. Essentially, if the data types differ, PostgreSQL chooses the type for which there is an implicit conversion from the other type (the PostgreSQL jargon for this is a coercion). If there is an implicit conversion in either direction, PostgreSQL prefers the data type from the first branch of the
In our case, there is an implicit cast from
real, but not the other way around, so
wingspan is of data type
flying_animals. While this allowed us to create the view without problems, it caused the performance problem we observed.
Explicit type casts as a solution for the performance problem
There is a simple way to make sure that the data types on both sides of the
UNION ALL are identical: apply an explicit type cast! So we can redefine the view as
CREATE OR REPLACE VIEW flying_animal AS SELECT id, wingspan FROM bird UNION ALL SELECT id, wingspan::real FROM bat;
And, sure enough, we end up with a fast execution plan:
EXPLAIN (ANALYZE, COSTS OFF) SELECT * FROM flying_animal JOIN lookup USING (id); QUERY PLAN ══════════════════════════════════════════════════════════════════════════════════════════ Nested Loop (actual time=7.714..10.593 rows=4 loops=1) -> Seq Scan on lookup (actual time=0.008..0.015 rows=6 loops=1) -> Append (actual time=1.682..1.759 rows=1 loops=6) -> Index Scan using bird_pkey on bird (actual time=1.445..1.446 rows=0 loops=6) Index Cond: (id = lookup.id) -> Index Scan using bat_pkey on bat (actual time=0.308..0.309 rows=0 loops=6) Index Cond: (id = lookup.id) Planning Time: 0.452 ms Execution Time: 10.673 ms
While it is convenient that PostgreSQL allows us to use slightly different data types on both sides of a
UNION ALL, we saw that it can lead to bad execution plans and bad performance. To be on the safe side, apply explicit type casts to render the data types identical.
Perhaps it would be possible to make the optimizer smarter in cases like the one we investigated. However, it is easy to work around this limitation.
If you’ve enjoyed reading this blog, check out my blog about rewriting OR to UNION in PostgreSQL queries.