Can Collations Be Used Over citext?

Introduction

Recently, I read Laurenz Albe's blog about case insensitive string search. He recommended case insensitive collations saying, "it won't be any more expensive (and maybe more intuitive) than the current techniques of using upper() or the citext extension are today". In other words, a custom collation can be used instead of citext to achieve case insensivity with a lower performance penalty. Today we will be testing it.

Table of Contents

Test Setup

CREATE COLLATION case_insensitive_accent_sensitive (
provider = icu,
locale = 'und-u-ks-level2',
deterministic = false
);

CREATE TABLE demo1 (
word TEXT COLLATE case_insensitive_accent_sensitive
);

CREATE COLLATION case_insensitive_accent_sensitive (

provider = icu,

locale = 'und-u-ks-level2',

deterministic = false

);

CREATE TABLE demo1 (

word TEXT COLLATE case_insensitive_accent_sensitive

);

We created the collation as case insensitive but accent sensitive beceause citext works in the same way.

create extension citext;

CREATE TABLE demo2 (
word citext
);

INSERT INTO demo1 VALUES ('apple'), ('APPLE'), ('Åpple'), ('Apple');

INSERT INTO demo2 VALUES ('apple'), ('APPLE'), ('Åpple'), ('Apple');

create extension citext;

CREATE TABLE demo2 (

word citext

);

INSERT INTO demo1 VALUES ('apple'), ('APPLE'), ('Åpple'), ('Apple');

INSERT INTO demo2 VALUES ('apple'), ('APPLE'), ('Åpple'), ('Apple');

Then duplicate all these records until we have ~32m rows for both tables.

insert into demo1 select * from demo1;
insert into demo1 select * from demo1;
. . .

select count(*) from demo2;
count
33554432
(1 row)

select count(*) from demo5;
count
33554432
(1 row)

create index x on demo1(word);
create index y on demo2(word);

update demo1 set word = 'applex' where ctid = '(0,1)';
update demo2 set word = 'applex' where ctid = '(0,1)';

insert into demo1 select * from demo1;

. . .

select count(*) from demo2;

count

33554432

(1 row)

select count(*) from demo5;

count

33554432

(1 row)

create index x on demo1(word);

create index y on demo2(word);

update demo1 set word = 'applex' where ctid = '(0,1)';

update demo2 set word = 'applex' where ctid = '(0,1)';

Comparison For Different Operators

"=" Operator

The following results show the average execution time after executing them 20 times:

explain(analyze) select * from demo1 where word  = 'applex';
                                                   QUERY PLAN                                                    
-----------------------------------------------------------------------------------------------------------------
 Index Only Scan using x on demo1  (cost=0.56..4.58 rows=1 width=6) (actual time=0.092..0.096 rows=1.00 loops=1)
   Index Cond: (word = 'applex'::text)
   Heap Fetches: 1
   Index Searches: 1
   Buffers: shared hit=6
 Planning Time: 0.266 ms
 Execution Time: 0.147 ms
(7 rows)

explain(analyze) select * from demo2 where word  = 'applex';
                                                   QUERY PLAN                                                    
-----------------------------------------------------------------------------------------------------------------
 Index Only Scan using y on demo2  (cost=0.56..8.58 rows=1 width=6) (actual time=0.136..0.139 rows=1.00 loops=1)
   Index Cond: (word = 'applex'::citext)
   Heap Fetches: 1
   Index Searches: 1
   Buffers: shared hit=6
 Planning Time: 0.173 ms
 Execution Time: 0.180 ms
(7 rows)

explain(analyze) select * from demo1 where word = 'applex';

QUERY PLAN

-----------------------------------------------------------------------------------------------------------------

Index Only Scan using x on demo1 (cost=0.56..4.58 rows=1 width=6) (actual time=0.092..0.096 rows=1.00 loops=1)

Index Cond: (word = 'applex'::text)

Heap Fetches: 1

Index Searches: 1

Buffers: shared hit=6

Planning Time: 0.266 ms

Execution Time: 0.147 ms

(7 rows)

explain(analyze) select * from demo2 where word = 'applex';

QUERY PLAN

-----------------------------------------------------------------------------------------------------------------

Index Only Scan using y on demo2 (cost=0.56..8.58 rows=1 width=6) (actual time=0.136..0.139 rows=1.00 loops=1)

Index Cond: (word = 'applex'::citext)

Heap Fetches: 1

Index Searches: 1

Buffers: shared hit=6

Planning Time: 0.173 ms

Execution Time: 0.180 ms

(7 rows)

As it can be seen index scans for both are roughly the same. That’s expected, since an index lookup only executes the comparison operator a few times. Interesting part is on the sequential scan. Let' s check the sequential scan:

explain(analyze) select * from demo1 where word  = 'apple';
                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Seq Scan on demo1  (cost=0.00..567901.40 rows=25245236 width=6) (actual time=11.837..4835.425 rows=25165823.00 loops=1)
   Filter: (word = 'apple'::text)
   Rows Removed by Filter: 8388609
   Buffers: shared hit=1316 read=147155
 Planning Time: 0.240 ms
 Execution Time: 5730.945 ms
(6 rows)

postgres=# explain(analyze) select * from demo2 where word  = 'apple';
                                                        QUERY PLAN                                                        
--------------------------------------------------------------------------------------------------------------------------
 Seq Scan on demo2  (cost=0.00..567901.40 rows=25287738 width=6) (actual time=10.600..18065.140 rows=25165823.00 loops=1)
   Filter: (word = 'apple'::citext)
   Rows Removed by Filter: 8388609
   Buffers: shared hit=14772 read=133699
 Planning Time: 0.329 ms
 Execution Time: 18922.067 ms

explain(analyze) select * from demo1 where word = 'apple';

QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------

Seq Scan on demo1 (cost=0.00..567901.40 rows=25245236 width=6) (actual time=11.837..4835.425 rows=25165823.00 loops=1)

Filter: (word = 'apple'::text)

Rows Removed by Filter: 8388609

Buffers: shared hit=1316 read=147155

Planning Time: 0.240 ms

Execution Time: 5730.945 ms

(6 rows)

postgres=# explain(analyze) select * from demo2 where word = 'apple';

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------

Seq Scan on demo2 (cost=0.00..567901.40 rows=25287738 width=6) (actual time=10.600..18065.140 rows=25165823.00 loops=1)

Filter: (word = 'apple'::citext)

Rows Removed by Filter: 8388609

Buffers: shared hit=14772 read=133699

Planning Time: 0.329 ms

Execution Time: 18922.067 ms

As we know that there 5 different values in the demo1 and demo2 tables; 'apple', 'APPLE', 'Åpple', 'Apple', 'applex'. These values each make up about %25 of the table(except 'apple' and 'applex', since the first 'apple' record is updated as 'applex'). Because of case insensitivity and accent sensitivity, the condition "word = 'apple'" matches roughly %75 of the rows. As a result, query runs on demo1 with custom collation is ~3 times faster than the other.

For <, <=, > and >= Operators

What about <, <=, > and >= operators, do they work? Yes, they do:

postgres=# select count(*) from demo1 where word >= 'Åpple';
  count  
---------
 8388609
(1 row)

postgres=# select count(*) from demo2 where word >= 'Åpple';
  count  
---------
 8388609
(1 row)

explain(analyze) select count(*) from demo1 where word >= 'Åpple';
                                                                          QUERY PLAN                                                                           
---------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=197860.81..197860.82 rows=1 width=8) (actual time=1300.464..1301.037 rows=1.00 loops=1)
   Buffers: shared hit=58 read=22924
   ->  Gather  (cost=197860.60..197860.81 rows=2 width=8) (actual time=1300.455..1301.030 rows=3.00 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=58 read=22924
         ->  Partial Aggregate  (cost=196860.60..196860.61 rows=1 width=8) (actual time=1286.575..1286.576 rows=1.00 loops=3)
               Buffers: shared hit=58 read=22924
               ->  Parallel Index Only Scan using x on demo1  (cost=0.56..188205.18 rows=3462165 width=0) (actual time=8.405..930.739 rows=2796203.00 loops=3)
                     Index Cond: (word >= 'Åpple'::text)
                     Heap Fetches: 110
                     Index Searches: 1
                     Buffers: shared hit=58 read=22924
 Planning Time: 0.200 ms
 Execution Time: 1301.100 ms
(15 rows)

explain(analyze) select count(*) from demo2 where word >= 'Åpple';
                                                                           QUERY PLAN                                                                           
----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=196863.08..196863.09 rows=1 width=8) (actual time=4824.465..4824.511 rows=1.00 loops=1)
   Buffers: shared hit=10 read=22930
   ->  Gather  (cost=196862.87..196863.08 rows=2 width=8) (actual time=4824.457..4824.505 rows=3.00 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         Buffers: shared hit=10 read=22930
         ->  Partial Aggregate  (cost=195862.87..195862.88 rows=1 width=8) (actual time=4806.764..4806.765 rows=1.00 loops=3)
               Buffers: shared hit=10 read=22930
               ->  Parallel Index Only Scan using y on demo2  (cost=0.56..187251.73 rows=3444456 width=0) (actual time=0.373..4433.041 rows=2796203.00 loops=3)
                     Index Cond: (word >= 'Åpple'::citext)
                     Heap Fetches: 110
                     Index Searches: 1
                     Buffers: shared hit=10 read=22930
 Planning Time: 0.246 ms
 Execution Time: 4824.576 ms
(15 rows)

postgres=# select count(*) from demo1 where word >= 'Åpple';

count

---------

8388609

(1 row)

postgres=# select count(*) from demo2 where word >= 'Åpple';

count

---------

8388609

(1 row)

explain(analyze) select count(*) from demo1 where word >= 'Åpple';

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------

Finalize Aggregate (cost=197860.81..197860.82 rows=1 width=8) (actual time=1300.464..1301.037 rows=1.00 loops=1)

Buffers: shared hit=58 read=22924

-> Gather (cost=197860.60..197860.81 rows=2 width=8) (actual time=1300.455..1301.030 rows=3.00 loops=1)

Workers Planned: 2

Workers Launched: 2

Buffers: shared hit=58 read=22924

-> Partial Aggregate (cost=196860.60..196860.61 rows=1 width=8) (actual time=1286.575..1286.576 rows=1.00 loops=3)

Buffers: shared hit=58 read=22924

-> Parallel Index Only Scan using x on demo1 (cost=0.56..188205.18 rows=3462165 width=0) (actual time=8.405..930.739 rows=2796203.00 loops=3)

Index Cond: (word >= 'Åpple'::text)

Heap Fetches: 110

Index Searches: 1

Buffers: shared hit=58 read=22924

Planning Time: 0.200 ms

Execution Time: 1301.100 ms

(15 rows)

explain(analyze) select count(*) from demo2 where word >= 'Åpple';

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------------------------

Finalize Aggregate (cost=196863.08..196863.09 rows=1 width=8) (actual time=4824.465..4824.511 rows=1.00 loops=1)

Buffers: shared hit=10 read=22930

-> Gather (cost=196862.87..196863.08 rows=2 width=8) (actual time=4824.457..4824.505 rows=3.00 loops=1)

Workers Planned: 2

Workers Launched: 2

Buffers: shared hit=10 read=22930

-> Partial Aggregate (cost=195862.87..195862.88 rows=1 width=8) (actual time=4806.764..4806.765 rows=1.00 loops=3)

Buffers: shared hit=10 read=22930

-> Parallel Index Only Scan using y on demo2 (cost=0.56..187251.73 rows=3444456 width=0) (actual time=0.373..4433.041 rows=2796203.00 loops=3)

Index Cond: (word >= 'Åpple'::citext)

Heap Fetches: 110

Index Searches: 1

Buffers: shared hit=10 read=22930

Planning Time: 0.246 ms

Execution Time: 4824.576 ms

(15 rows)

In this case as well, the query runs on the demo1 table is ~4 times faster than the one runs on demo2 table. An important point to note is that as the number of fetched rows increases, the performance gap between the custom collation and citext grows wider.

Note: be carefull, 'Åpple' comes before 'applex' in both queries because both implementations are Unicode aware.

LIKE Operator

This is the tricky part because on PostgreSQL 17 and earlier, the LIKE operator is not supported with nondeterministic collations. For example;

explain(analyze) select * from demo1 where word like 'applex';
ERROR: nondeterministic collations are not supported for LIKE

1 2	explain(analyze) select * from demo1 where word like 'applex'; ERROR: nondeterministic collations are not supported for LIKE

However, Laurenz pointed out this commit, starting from PostgreSQL 18 the LIKE operator is supported with nondeterministic collations as well.

explain(analyze)select * from demo1 where word like 'applex%';
                                                         QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..324235.94 rows=1 width=6) (actual time=6888.665..6932.894 rows=1.00 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=40 read=148474
   -&gt;  Parallel Seq Scan on demo1  (cost=0.00..323235.84 rows=1 width=6) (actual time=5384.179..6861.712 rows=0.33 loops=3)
         Filter: (word ~~ 'applex%'::text)
         Rows Removed by Filter: 11184844
         Buffers: shared hit=40 read=148474
 Planning Time: 0.337 ms
 Execution Time: 6933.003 ms
(10 rows)

explain(analyze)select * from demo2 where word like 'applex%';
                                                            QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..341013.14 rows=167773 width=6) (actual time=19653.864..19655.891 rows=1.00 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared read=148472 dirtied=1
   -&gt;  Parallel Seq Scan on demo2  (cost=0.00..323235.84 rows=69905 width=6) (actual time=19615.897..19615.925 rows=0.33 loops=3)
         Filter: (word ~~ 'applex%'::citext)
         Rows Removed by Filter: 11184844
         Buffers: shared read=148472 dirtied=1
 Planning Time: 0.217 ms
 Execution Time: 19655.961 ms
(10 rows)

explain(analyze)select * from demo1 where word like 'applex%';

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------

Gather (cost=1000.00..324235.94 rows=1 width=6) (actual time=6888.665..6932.894 rows=1.00 loops=1)

Workers Planned: 2

Workers Launched: 2

Buffers: shared hit=40 read=148474

-> Parallel Seq Scan on demo1 (cost=0.00..323235.84 rows=1 width=6) (actual time=5384.179..6861.712 rows=0.33 loops=3)

Filter: (word ~~ 'applex%'::text)

Rows Removed by Filter: 11184844

Buffers: shared hit=40 read=148474

Planning Time: 0.337 ms

Execution Time: 6933.003 ms

(10 rows)

explain(analyze)select * from demo2 where word like 'applex%';

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------

Gather (cost=1000.00..341013.14 rows=167773 width=6) (actual time=19653.864..19655.891 rows=1.00 loops=1)

Workers Planned: 2

Workers Launched: 2

Buffers: shared read=148472 dirtied=1

-> Parallel Seq Scan on demo2 (cost=0.00..323235.84 rows=69905 width=6) (actual time=19615.897..19615.925 rows=0.33 loops=3)

Filter: (word ~~ 'applex%'::citext)

Rows Removed by Filter: 11184844

Buffers: shared read=148472 dirtied=1

Planning Time: 0.217 ms

Execution Time: 19655.961 ms

(10 rows)

The query on demo1 with the custom collation runs faster than on demo2. However, both queries use a sequential scan — why? The reason is that both solutions (custom collation and citext) are Unicode - aware. Normally, to speed up LIKE queries on text, char, or varchar columns, we create an index with the xxx_pattern_ops option. But since our collation is nondeterministic, creating such an index results in an error:

create index r on demo1(word text_pattern_ops);
ERROR:  nondeterministic collations are not supported for operator class "text_pattern_ops"

1 2	create index r on demo1(word text_pattern_ops); ERROR: nondeterministic collations are not supported for operator class "text_pattern_ops"

For citext story is a bit different, to test it, I inserted some rows into demo2:

insert into demo2 select 'b'||x.i from generate_series(1,100) x(i);

\d demo2
               Table "public.demo2"
 Column |  Type  | Collation | Nullable | Default 
--------+--------+-----------+----------+---------
 word   | citext |           |          | 
Indexes:
    "rsp" btree (lower(word::text) text_pattern_ops)
    "y" btree (word)
    "z" btree (lower(word::text))

explain(analyze)select * from demo2 where lower(word) like 'b1%';
                                                             QUERY PLAN                                                              
-------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=3288.11..321530.74 rows=167773 width=6) (actual time=5.512..24.216 rows=12.00 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=45
   -&gt;  Parallel Bitmap Heap Scan on demo2  (cost=2288.11..303753.44 rows=69905 width=6) (actual time=0.252..0.259 rows=4.00 loops=3)
         Filter: (lower((word)::text) ~~ 'b1%'::text)
         Heap Blocks: exact=2
         Buffers: shared hit=45
         -&gt;  Bitmap Index Scan on rsp  (cost=0.00..2246.17 rows=167773 width=0) (actual time=0.490..0.491 rows=12.00 loops=1)
               Index Cond: ((lower((word)::text) ~&gt;=~ 'b1'::text) AND (lower((word)::text) ~&lt;~ 'b2'::text))
               Index Searches: 1
               Buffers: shared hit=3
 Planning Time: 0.403 ms
 Execution Time: 24.294 ms
(14 rows)

insert into demo2 select 'b'||x.i from generate_series(1,100) x(i);

\d demo2

Table "public.demo2"

Column | Type | Collation | Nullable | Default

--------+--------+-----------+----------+---------

word | citext | | |

Indexes:

"rsp" btree (lower(word::text) text_pattern_ops)

"y" btree (word)

"z" btree (lower(word::text))

explain(analyze)select * from demo2 where lower(word) like 'b1%';

QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------

Gather (cost=3288.11..321530.74 rows=167773 width=6) (actual time=5.512..24.216 rows=12.00 loops=1)

Workers Planned: 2

Workers Launched: 2

Buffers: shared hit=45

-> Parallel Bitmap Heap Scan on demo2 (cost=2288.11..303753.44 rows=69905 width=6) (actual time=0.252..0.259 rows=4.00 loops=3)

Filter: (lower((word)::text) ~~ 'b1%'::text)

Heap Blocks: exact=2

Buffers: shared hit=45

-> Bitmap Index Scan on rsp (cost=0.00..2246.17 rows=167773 width=0) (actual time=0.490..0.491 rows=12.00 loops=1)

Index Cond: ((lower((word)::text) ~>=~ 'b1'::text) AND (lower((word)::text) ~<~ 'b2'::text))

Index Searches: 1

Buffers: shared hit=3

Planning Time: 0.403 ms

Execution Time: 24.294 ms

(14 rows)

In contrast, with a proper index on a citext column, it is possible to use the LIKE operator.

Conclusion

A custom nondeterministic collation can indeed serve as an alternative to citext for case-insensitive queries, often with a much lower performance penalty.

For equality lookups (=) both approaches perform similarly when indexes are used, but under sequential scans the custom collation is consistently 2–4x faster than citext.
For range queries (<, <=, >, >=), both solutions work correctly and respect Unicode ordering, but the performance gap widens as the number of matching rows increases, with the custom collation again outperforming citext.
For LIKE queries, things are more nuanced:
- On PostgreSQL 17 and earlier, LIKE is not supported with nondeterministic collations at all.
- Starting with PostgreSQL 18, it is supported, but still limited to sequential scans because pattern operator classes cannot be used with nondeterministic collations.
- citext, on the other hand, can still leverage functional indexes, making it more flexible for LIKE queries when proper indexing is required.

In short:

If your main use case is equality and range comparisons, a custom ICU collation is the better choice, offering better performance with simpler semantics.
If you rely heavily on pattern matching with LIKE and indexes, citext remains more practical until nondeterministic collations gain better index support.

Can Collations Be Used Over citext?

Introduction

Test Setup

Comparison For Different Operators

"=" Operator

For <, <=, > and >= Operators

LIKE Operator

Conclusion

Leave a Reply Cancel reply

Umut Tekin

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts