Recently, I read Laurenz Albe's blog about case insensitive string search. He recommended case insensitive collations saying, "it won't be any more expensive (and maybe more intuitive) than the current techniques of using upper()
or the citext
extension are today". In other words, a custom collation can be used instead of citext to achieve case insensivity with a lower performance penalty. Today we will be testing it.
Table of Contents
1 2 3 4 5 6 7 8 9 |
CREATE COLLATION case_insensitive_accent_sensitive ( provider = icu, locale = 'und-u-ks-level2', deterministic = false ); CREATE TABLE demo1 ( word TEXT COLLATE case_insensitive_accent_sensitive ); |
We created the collation as case insensitive but accent sensitive beceause citext works in the same way.
1 2 3 4 5 6 7 8 9 |
create extension citext; CREATE TABLE demo2 ( word citext ); INSERT INTO demo1 VALUES ('apple'), ('APPLE'), ('Åpple'), ('Apple'); INSERT INTO demo2 VALUES ('apple'), ('APPLE'), ('Åpple'), ('Apple'); |
Then duplicate all these records until we have ~32m rows for both tables.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
insert into demo1 select * from demo1; insert into demo1 select * from demo1; . . . select count(*) from demo2; count 33554432 (1 row) select count(*) from demo5; count 33554432 (1 row) create index x on demo1(word); create index y on demo2(word); update demo1 set word = 'applex' where ctid = '(0,1)'; update demo2 set word = 'applex' where ctid = '(0,1)'; |
The following results show the average execution time after executing them 20 times:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
explain(analyze) select * from demo1 where word = 'applex'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------- Index Only Scan using x on demo1 (cost=0.56..4.58 rows=1 width=6) (actual time=0.092..0.096 rows=1.00 loops=1) Index Cond: (word = 'applex'::text) Heap Fetches: 1 Index Searches: 1 Buffers: shared hit=6 Planning Time: 0.266 ms Execution Time: 0.147 ms (7 rows) explain(analyze) select * from demo2 where word = 'applex'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------- Index Only Scan using y on demo2 (cost=0.56..8.58 rows=1 width=6) (actual time=0.136..0.139 rows=1.00 loops=1) Index Cond: (word = 'applex'::citext) Heap Fetches: 1 Index Searches: 1 Buffers: shared hit=6 Planning Time: 0.173 ms Execution Time: 0.180 ms (7 rows) |
As it can be seen index scans for both are roughly the same. That’s expected, since an index lookup only executes the comparison operator a few times. Interesting part is on the sequential scan. Let' s check the sequential scan:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
explain(analyze) select * from demo1 where word = 'apple'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------- Seq Scan on demo1 (cost=0.00..567901.40 rows=25245236 width=6) (actual time=11.837..4835.425 rows=25165823.00 loops=1) Filter: (word = 'apple'::text) Rows Removed by Filter: 8388609 Buffers: shared hit=1316 read=147155 Planning Time: 0.240 ms Execution Time: 5730.945 ms (6 rows) postgres=# explain(analyze) select * from demo2 where word = 'apple'; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------- Seq Scan on demo2 (cost=0.00..567901.40 rows=25287738 width=6) (actual time=10.600..18065.140 rows=25165823.00 loops=1) Filter: (word = 'apple'::citext) Rows Removed by Filter: 8388609 Buffers: shared hit=14772 read=133699 Planning Time: 0.329 ms Execution Time: 18922.067 ms |
As we know that there 5 different values in the demo1 and demo2 tables; 'apple', 'APPLE', 'Åpple', 'Apple', 'applex'. These values each make up about %25 of the table(except 'apple' and 'applex', since the first 'apple' record is updated as 'applex'). Because of case insensitivity and accent sensitivity, the condition "word = 'apple'" matches roughly %75 of the rows. As a result, query runs on demo1 with custom collation is ~3 times faster than the other.
What about <, <=, > and >= operators, do they work? Yes, they do:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
postgres=# select count(*) from demo1 where word >= 'Åpple'; count --------- 8388609 (1 row) postgres=# select count(*) from demo2 where word >= 'Åpple'; count --------- 8388609 (1 row) explain(analyze) select count(*) from demo1 where word >= 'Åpple'; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------- Finalize Aggregate (cost=197860.81..197860.82 rows=1 width=8) (actual time=1300.464..1301.037 rows=1.00 loops=1) Buffers: shared hit=58 read=22924 -> Gather (cost=197860.60..197860.81 rows=2 width=8) (actual time=1300.455..1301.030 rows=3.00 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=58 read=22924 -> Partial Aggregate (cost=196860.60..196860.61 rows=1 width=8) (actual time=1286.575..1286.576 rows=1.00 loops=3) Buffers: shared hit=58 read=22924 -> Parallel Index Only Scan using x on demo1 (cost=0.56..188205.18 rows=3462165 width=0) (actual time=8.405..930.739 rows=2796203.00 loops=3) Index Cond: (word >= 'Åpple'::text) Heap Fetches: 110 Index Searches: 1 Buffers: shared hit=58 read=22924 Planning Time: 0.200 ms Execution Time: 1301.100 ms (15 rows) explain(analyze) select count(*) from demo2 where word >= 'Åpple'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------- Finalize Aggregate (cost=196863.08..196863.09 rows=1 width=8) (actual time=4824.465..4824.511 rows=1.00 loops=1) Buffers: shared hit=10 read=22930 -> Gather (cost=196862.87..196863.08 rows=2 width=8) (actual time=4824.457..4824.505 rows=3.00 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=10 read=22930 -> Partial Aggregate (cost=195862.87..195862.88 rows=1 width=8) (actual time=4806.764..4806.765 rows=1.00 loops=3) Buffers: shared hit=10 read=22930 -> Parallel Index Only Scan using y on demo2 (cost=0.56..187251.73 rows=3444456 width=0) (actual time=0.373..4433.041 rows=2796203.00 loops=3) Index Cond: (word >= 'Åpple'::citext) Heap Fetches: 110 Index Searches: 1 Buffers: shared hit=10 read=22930 Planning Time: 0.246 ms Execution Time: 4824.576 ms (15 rows) |
In this case as well, the query runs on the demo1 table is ~4 times faster than the one runs on demo2 table. An important point to note is that as the number of fetched rows increases, the performance gap between the custom collation and citext
grows wider.
Note: be carefull, 'Åpple' comes before 'applex' in both queries because both implementations are Unicode aware.
This is the tricky part because on PostgreSQL 17 and earlier, the LIKE operator is not supported with nondeterministic collations. For example;
1 2 |
explain(analyze) select * from demo1 where word like 'applex'; ERROR: nondeterministic collations are not supported for LIKE |
However, Laurenz pointed out this commit, starting from PostgreSQL 18 the LIKE
operator is supported with nondeterministic collations as well.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
explain(analyze)select * from demo1 where word like 'applex%'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Gather (cost=1000.00..324235.94 rows=1 width=6) (actual time=6888.665..6932.894 rows=1.00 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=40 read=148474 -> Parallel Seq Scan on demo1 (cost=0.00..323235.84 rows=1 width=6) (actual time=5384.179..6861.712 rows=0.33 loops=3) Filter: (word ~~ 'applex%'::text) Rows Removed by Filter: 11184844 Buffers: shared hit=40 read=148474 Planning Time: 0.337 ms Execution Time: 6933.003 ms (10 rows) explain(analyze)select * from demo2 where word like 'applex%'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------- Gather (cost=1000.00..341013.14 rows=167773 width=6) (actual time=19653.864..19655.891 rows=1.00 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared read=148472 dirtied=1 -> Parallel Seq Scan on demo2 (cost=0.00..323235.84 rows=69905 width=6) (actual time=19615.897..19615.925 rows=0.33 loops=3) Filter: (word ~~ 'applex%'::citext) Rows Removed by Filter: 11184844 Buffers: shared read=148472 dirtied=1 Planning Time: 0.217 ms Execution Time: 19655.961 ms (10 rows) |
The query on demo1
with the custom collation runs faster than on demo2
. However, both queries use a sequential scan — why? The reason is that both solutions (custom collation and citext
) are Unicode - aware. Normally, to speed up LIKE
queries on text
, char
, or varchar
columns, we create an index with the xxx_pattern_ops
option. But since our collation is nondeterministic, creating such an index results in an error:
1 2 |
create index r on demo1(word text_pattern_ops); ERROR: nondeterministic collations are not supported for operator class "text_pattern_ops" |
For citext story is a bit different, to test it, I inserted some rows into demo2:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
insert into demo2 select 'b'||x.i from generate_series(1,100) x(i); \d demo2 Table "public.demo2" Column | Type | Collation | Nullable | Default --------+--------+-----------+----------+--------- word | citext | | | Indexes: "rsp" btree (lower(word::text) text_pattern_ops) "y" btree (word) "z" btree (lower(word::text)) explain(analyze)select * from demo2 where lower(word) like 'b1%'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------- Gather (cost=3288.11..321530.74 rows=167773 width=6) (actual time=5.512..24.216 rows=12.00 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=45 -> Parallel Bitmap Heap Scan on demo2 (cost=2288.11..303753.44 rows=69905 width=6) (actual time=0.252..0.259 rows=4.00 loops=3) Filter: (lower((word)::text) ~~ 'b1%'::text) Heap Blocks: exact=2 Buffers: shared hit=45 -> Bitmap Index Scan on rsp (cost=0.00..2246.17 rows=167773 width=0) (actual time=0.490..0.491 rows=12.00 loops=1) Index Cond: ((lower((word)::text) ~>=~ 'b1'::text) AND (lower((word)::text) ~<~ 'b2'::text)) Index Searches: 1 Buffers: shared hit=3 Planning Time: 0.403 ms Execution Time: 24.294 ms (14 rows) |
In contrast, with a proper index on a citext
column, it is possible to use the LIKE
operator.
A custom nondeterministic collation can indeed serve as an alternative to citext
for case-insensitive queries, often with a much lower performance penalty.
=
) both approaches perform similarly when indexes are used, but under sequential scans the custom collation is consistently 2–4x faster than citext
.<, <=, >, >=
), both solutions work correctly and respect Unicode ordering, but the performance gap widens as the number of matching rows increases, with the custom collation again outperforming citext
.citext
, on the other hand, can still leverage functional indexes, making it more flexible for LIKE queries when proper indexing is required.In short:
citext
remains more practical until nondeterministic collations gain better index support.
Leave a Reply