comic about pattern matching

© Renée Albe 2022

Case-insensitive search is a much-requested feature, partly (I suspect) to maintain compatibility with Microsoft SQL Server. There are several solutions to the problem, one of which is to use case-insensitive ICU collations. This works like a charm, except if you want to perform pattern matching. So let’s have a closer look at the problem and at possible solutions.

Alternatives for case-insensitive search

There are three known solutions to case-insensitive search in PostgreSQL:

Explicit conversion with lower() or upper()

A query that uses this method would look as follows:

SELECT id, col FROM tab
WHERE lower(col) = lower('search string');

This can be made fast with a B-tree index on lower(col), but has two disadvantages:

  • the solution is implemented at the application level, that is, you have to custom-tailor the query for case-insensitive search
  • if the database column contains long values, the whole value has to be converted to lower case, even if only a few characters have to be compared, which leads to bad performance

Using the citext extension

The extension citext provides a data type citext, which stands for “case-insensitive text”. The table is defined as

CREATE TABLE tab (
   col citext,
   ...
);

and the query is as simple as

SELECT id, col FROM tab
WHERE col = 'search string';

That is simple and convenient, but has some disadvantages as well:

  • there is no data type civarchar, so you can only implement that with a check constraint
  • performance for longer values can also be bad, because citext internally calls lower(col COLLATE "default") before comparing the values
  • regular expression matching is not case insensitive, and you have to use the case insensitive operator ~* explicitly

Using case-insensitive ICU collations

If you are using PostgreSQL v12 or better, and PostgreSQL was configured --with-icu, you can define a case-insensitive collation like this:

CREATE COLLATION english_ci (
   PROVIDER = icu,
   -- 'en-US@colStrength=secondary' for old ICU versions
   LOCALE = 'en-US-u-ks-level2',
   DETERMINISTIC = FALSE
);

The table would be defined as

CREATE TABLE tab (
   col text COLLATE english_ci,
   ...
);

and the query again is straightforward:

SELECT id, col FROM tab
WHERE col = 'search string';

For more details about ICU collations, read my article on that topic.

The trouble with pattern matching and case-insensitive collations

Case-insensitive collations are the most elegant solution for the problem. There are two shortcomings with this approach:

  • Even though PostgreSQL v15 allows you to use ICU collations as the default database collation, this does not extend to case-insensitive collations, so you still have to explicitly specify the collation for each column definition
  • You cannot use case-insensitive collations with pattern matching

The first problem is merely an annoyance, but the second problem will require more thought. Here are two examples that exhibit the problem:

SELECT id, long FROM perf_coll WHERE long LIKE 'abcd%';
ERROR:  nondeterministic collations are not supported for LIKE

SELECT id, long FROM perf_coll WHERE long ~ '^abcd';
ERROR:  nondeterministic collations are not supported for regular expressions

Why is that not supported?

The difficult case of German soccer

The ICU documentation details why correct case-insensitive pattern matching is difficult. A good example is the German letter “ß”, which traditionally doesn’t have an upper-case equivalent. So with good German collations (the collation from the GNU C library is not good in that respect), you will get a result like this:

SELECT upper('Fußball' COLLATE "de-DE-x-icu");

  upper   
══════════
 FUSSBALL
(1 row)

Now what would be the correct result for the following query in a case-insensitive collation?

SELECT 'Fußball' LIKE 'FUS%';

You could argue that it should be TRUE, because that’s what you’d get for upper('Fußball') LIKE 'FUS%'. On the other hand,

SELECT lower('FUSSBALL' COLLATE "de-DE-x-icu");

  lower   
══════════
 fussball
(1 row)

so you could just as well argue that the result should be FALSE. The ICU library goes with the second solution for simplicity. Either solution would be difficult to implement in PostgreSQL, so we have given up on that issue for the time being. Compare this quote from the mailing list:

ICU also provides regexp matching, but not collation-aware, since
character-based patterns don’t play well with the concept of collation.
About a potential collation-aware LIKE, it looks hard to implement,
since the algorithm currently used in like_match.c seems purely
character-based. AFAICS there’s no way to plug calls to usearch_*
functions into it, it would need a separate redesign from scratch.

A semiotic aside

There is no universally accepted authority for correct German. For example, German speaking Swiss did away with the ß in the twentieth century. On the other hand, the “Rat für deutsche Rechtschreibung” (Committee for German Spelling) introduced an upper case letter ẞ in 2017, but this attempt to simplify the problem has been widely ignored by the German speaking world. Simplifying German, what an outrageous idea!

The end result is even more confusing than before:

SELECT upper('ß' COLLATE "de-DE-x-icu"),
       lower('ẞ' COLLATE "de-DE-x-icu");

 upper │ lower 
═══════╪═══════
 SS    │ ß
(1 row)

A solution for case-insensitive pattern matching

We need a practical solution for the problem. Of course we could use lower() or cast to citext, but that would re-introduce the performance problems for long strings. So we evade the problem by explicitly using a different collation. For performance reasons, and to avoid the brain-twist of German soccer, we cast to the binary collation, which compares words character for character:

SELECT id, col FROM tab
WHERE col COLLATE "C" ILIKE 'search%pattern';

SELECT id, col FROM tab
WHERE col COLLATE "C" ~* '^search.*pattern';

Note that for this to work as expected, we have to use the case insensitive versions of LIKE and the regular expression matching operator ~.

Of course, this solution is not as simple as we would wish: again, the application has to write queries that explicitly specify case insensitive processing. Another difficulty is performance: while case sensitive pattern matching can be supported with B-tree indexes, case-insensitive pattern matching requires a trigram index:

CREATE EXTENSION IF NOT EXISTS pg_trgm;

CREATE INDEX tab_col_pattern_idx ON tab USING gin
   (col gin_trgm_ops);

Such an index can speed up both of the above statements, but it can grow large and is slower to update than a normal B-tree index.

A performance test for case-insensitive comparisons

For this test, I used tables with ten million rows and a column with a random string of 320 characters. There was no index on the column. In the table perf_citext, the column is defined as citext. In perf_lower, the data type is text, and comparison is made using lower(). Table perf_coll uses text with the collation english_ci as defined above. The database used was PostgreSQL v15 with US English collations. All tables were cached in shared buffers.

Performance comparison of case-insensitive search techniques
WHERE ... = 'abcd'WHERE ... LIKE 'abcd%'WHERE ... COLLATE "C" ILIKE 'abcd%'WHERE ... COLLATE "C" ~* 'abcd%'
(~ for lower())
citext540 ms536 ms1675 ms2500 ms
lower()9000 ms9000 ms3000 ms3800 ms
english_ci830 msERROR2000 ms1940 ms

In this test, citext comes out ahead, but case-insensitive collations are a decent runner-up. Using lower() does not perform well.

Conclusion

Pattern matching is not implemented for case-insensitive collations in PostgreSQL, and it is a difficult question what the correct behavior would be in this case. We have a decent workaround by explicitly using case-insensitive pattern matching operators and the binary collation, but the situation is definitely far from perfect.

If you are interested in further blogs on PostgreSQL topics, check out my blog on how to cancel long-running queries.