© Renée Albe 2022
Case-insensitive search is a much-requested feature, partly (I suspect) to maintain compatibility with Microsoft SQL Server. There are several solutions to the problem, one of which is to use case-insensitive ICU collations. This works like a charm, except if you want to perform pattern matching. So let’s have a closer look at the problem and at possible solutions.
Alternatives for case-insensitive search
There are three known solutions to case-insensitive search in PostgreSQL:
Explicit conversion with
A query that uses this method would look as follows:
SELECT id, col FROM tab WHERE lower(col) = lower('search string');
This can be made fast with a B-tree index on
lower(col), but has two disadvantages:
- the solution is implemented at the application level, that is, you have to custom-tailor the query for case-insensitive search
- if the database column contains long values, the whole value has to be converted to lower case, even if only a few characters have to be compared, which leads to bad performance
citext provides a data type
citext, which stands for “case-insensitive
text”. The table is defined as
CREATE TABLE tab ( col citext, ... );
and the query is as simple as
SELECT id, col FROM tab WHERE col = 'search string';
That is simple and convenient, but has some disadvantages as well:
- there is no data type
civarchar, so you can only implement that with a check constraint
- performance for longer values can also be bad, because
lower(col COLLATE "default")before comparing the values
- regular expression matching is not case insensitive, and you have to use the case insensitive operator
Using case-insensitive ICU collations
If you are using PostgreSQL v12 or better, and PostgreSQL was configured
--with-icu, you can define a case-insensitive collation like this:
CREATE COLLATION english_ci ( PROVIDER = icu, -- 'en-US@colStrength=secondary' for old ICU versions LOCALE = 'en-US-u-ks-level2', DETERMINISTIC = FALSE );
The table would be defined as
CREATE TABLE tab ( col text COLLATE english_ci, ... );
and the query again is straightforward:
SELECT id, col FROM tab WHERE col = 'search string';
For more details about ICU collations, read my article on that topic.
The trouble with pattern matching and case-insensitive collations
Case-insensitive collations are the most elegant solution for the problem. There are two shortcomings with this approach:
- Even though PostgreSQL v15 allows you to use ICU collations as the default database collation, this does not extend to case-insensitive collations, so you still have to explicitly specify the collation for each column definition
- You cannot use case-insensitive collations with pattern matching
The first problem is merely an annoyance, but the second problem will require more thought. Here are two examples that exhibit the problem:
SELECT id, long FROM perf_coll WHERE long LIKE 'abcd%'; ERROR: nondeterministic collations are not supported for LIKE SELECT id, long FROM perf_coll WHERE long ~ '^abcd'; ERROR: nondeterministic collations are not supported for regular expressions
Why is that not supported?
The difficult case of German soccer
The ICU documentation details why correct case-insensitive pattern matching is difficult. A good example is the German letter “ß”, which traditionally doesn’t have an upper-case equivalent. So with good German collations (the collation from the GNU C library is not good in that respect), you will get a result like this:
SELECT upper('Fußball' COLLATE "de-DE-x-icu"); upper ══════════ FUSSBALL (1 row)
Now what would be the correct result for the following query in a case-insensitive collation?
SELECT 'Fußball' LIKE 'FUS%';
You could argue that it should be
TRUE, because that’s what you’d get for
upper('Fußball') LIKE 'FUS%'. On the other hand,
SELECT lower('FUSSBALL' COLLATE "de-DE-x-icu"); lower ══════════ fussball (1 row)
so you could just as well argue that the result should be
FALSE. The ICU library goes with the second solution for simplicity. Either solution would be difficult to implement in PostgreSQL, so we have given up on that issue for the time being. Compare this quote from the mailing list:
ICU also provides regexp matching, but not collation-aware, since
character-based patterns don’t play well with the concept of collation.
About a potential collation-aware LIKE, it looks hard to implement,
since the algorithm currently used in like_match.c seems purely
character-based. AFAICS there’s no way to plug calls to usearch_*
functions into it, it would need a separate redesign from scratch.
A semiotic aside
There is no universally accepted authority for correct German. For example, German speaking Swiss did away with the ß in the twentieth century. On the other hand, the “Rat für deutsche Rechtschreibung” (Committee for German Spelling) introduced an upper case letter ẞ in 2017, but this attempt to simplify the problem has been widely ignored by the German speaking world. Simplifying German, what an outrageous idea!
The end result is even more confusing than before:
SELECT upper('ß' COLLATE "de-DE-x-icu"), lower('ẞ' COLLATE "de-DE-x-icu"); upper │ lower ═══════╪═══════ SS │ ß (1 row)
A solution for case-insensitive pattern matching
We need a practical solution for the problem. Of course we could use
lower() or cast to
citext, but that would re-introduce the performance problems for long strings. So we evade the problem by explicitly using a different collation. For performance reasons, and to avoid the brain-twist of German soccer, we cast to the binary collation, which compares words character for character:
SELECT id, col FROM tab WHERE col COLLATE "C" ILIKE 'search%pattern'; SELECT id, col FROM tab WHERE col COLLATE "C" ~* '^search.*pattern';
Note that for this to work as expected, we have to use the case insensitive versions of
LIKE and the regular expression matching operator
Of course, this solution is not as simple as we would wish: again, the application has to write queries that explicitly specify case insensitive processing. Another difficulty is performance: while case sensitive pattern matching can be supported with B-tree indexes, case-insensitive pattern matching requires a trigram index:
CREATE EXTENSION IF NOT EXISTS pg_trgm; CREATE INDEX tab_col_pattern_idx ON tab USING gin (col gin_trgm_ops);
Such an index can speed up both of the above statements, but it can grow large and is slower to update than a normal B-tree index.
A performance test for case-insensitive comparisons
For this test, I used tables with ten million rows and a column with a random string of 320 characters. There was no index on the column. In the table
perf_citext, the column is defined as
perf_lower, the data type is
text, and comparison is made using
text with the collation
english_ci as defined above. The database used was PostgreSQL v15 with US English collations. All tables were cached in shared buffers.
|540 ms||536 ms||1675 ms||2500 ms|
|9000 ms||9000 ms||3000 ms||3800 ms|
|830 ms||ERROR||2000 ms||1940 ms|
In this test,
citext comes out ahead, but case-insensitive collations are a decent runner-up. Using
lower() does not perform well.
Pattern matching is not implemented for case-insensitive collations in PostgreSQL, and it is a difficult question what the correct behavior would be in this case. We have a decent workaround by explicitly using case-insensitive pattern matching operators and the binary collation, but the situation is definitely far from perfect.
If you are interested in further blogs on PostgreSQL topics, check out my blog on how to cancel long-running queries.