Table of Contents
Case-insensitive search is a much-requested feature, partly (I suspect) to maintain compatibility with Microsoft SQL Server. There are several solutions to the problem, one of which is to use case-insensitive ICU collations. This works like a charm, except if you want to perform pattern matching. So let's have a closer look at the problem and at possible solutions.
There are three known solutions to case-insensitive search in PostgreSQL:
lower()
or upper()
A query that uses this method would look as follows:
1 2 |
SELECT id, col FROM tab WHERE lower(col) = lower('search string'); |
This can be made fast with a B-tree index on lower(col)
, but has two disadvantages:
citext
extensionThe extension citext
provides a data type citext
, which stands for “case-insensitive text
”. The table is defined as
1 2 3 4 |
CREATE TABLE tab ( col citext, ... ); |
and the query is as simple as
1 2 |
SELECT id, col FROM tab WHERE col = 'search string'; |
That is simple and convenient, but has some disadvantages as well:
civarchar
, so you can only implement that with a check constraintcitext
internally calls lower(col COLLATE "default")
before comparing the values~*
explicitlyIf you are using PostgreSQL v12 or better, and PostgreSQL was configured --with-icu
, you can define a case-insensitive collation like this:
1 2 3 4 5 6 |
CREATE COLLATION english_ci ( PROVIDER = icu, -- 'en-US@colStrength=secondary' for old ICU versions LOCALE = 'en-US-u-ks-level2', DETERMINISTIC = FALSE ); |
The table would be defined as
1 2 3 4 |
CREATE TABLE tab ( col text COLLATE english_ci, ... ); |
and the query again is straightforward:
1 2 |
SELECT id, col FROM tab WHERE col = 'search string'; |
For more details about ICU collations, read my article on that topic.
Case-insensitive collations are the most elegant solution for the problem. There are two shortcomings with this approach:
The first problem is merely an annoyance, but the second problem will require more thought. Here are two examples that exhibit the problem:
1 2 3 4 5 |
SELECT id, long FROM perf_coll WHERE long LIKE 'abcd%'; ERROR: nondeterministic collations are not supported for LIKE SELECT id, long FROM perf_coll WHERE long ~ '^abcd'; ERROR: nondeterministic collations are not supported for regular expressions |
Why is that not supported?
The ICU documentation details why correct case-insensitive pattern matching is difficult. A good example is the German letter “ß”, which traditionally doesn't have an upper-case equivalent. So with good German collations (the collation from the GNU C library is not good in that respect), you will get a result like this:
1 2 3 4 5 6 |
SELECT upper('Fußball' COLLATE 'de-DE-x-icu'); upper ══════════ FUSSBALL (1 row) |
Now what would be the correct result for the following query in a case-insensitive collation?
1 |
SELECT 'Fußball' LIKE 'FUS%'; |
You could argue that it should be TRUE
, because that's what you'd get for upper('Fußball') LIKE 'FUS%'
. On the other hand,
1 2 3 4 5 6 |
SELECT lower('FUSSBALL' COLLATE 'de-DE-x-icu'); lower ══════════ fussball (1 row) |
so you could just as well argue that the result should be FALSE
. The ICU library goes with the second solution for simplicity. Either solution would be difficult to implement in PostgreSQL, so we have given up on that issue for the time being. Compare this quote from the mailing list:
ICU also provides regexp matching, but not collation-aware, since
character-based patterns don't play well with the concept of collation.
About a potential collation-aware LIKE, it looks hard to implement,
since the algorithm currently used in like_match.c seems purely
character-based. AFAICS there's no way to plug calls to usearch_*
functions into it, it would need a separate redesign from scratch.
There is no universally accepted authority for correct German. For example, German speaking Swiss did away with the ß in the twentieth century. On the other hand, the “Rat für deutsche Rechtschreibung” (Committee for German Spelling) introduced an upper case letter ẞ in 2017, but this attempt to simplify the problem has been widely ignored by the German speaking world. Simplifying German, what an outrageous idea!
The end result is even more confusing than before:
1 2 3 4 5 6 7 |
SELECT upper('ß' COLLATE 'de-DE-x-icu'), lower('ẞ' COLLATE 'de-DE-x-icu'); upper │ lower ═══════╪═══════ SS │ ß (1 row) |
We need a practical solution for the problem. Of course we could use lower()
or cast to citext
, but that would re-introduce the performance problems for long strings. So we evade the problem by explicitly using a different collation. For performance reasons, and to avoid the brain-twist of German soccer, we cast to the binary collation, which compares words character for character:
1 2 3 4 5 |
SELECT id, col FROM tab WHERE col COLLATE 'C' ILIKE 'search%pattern'; SELECT id, col FROM tab WHERE col COLLATE 'C' ~* '^search.*pattern'; |
Note that for this to work as expected, we have to use the case insensitive versions of LIKE
and the regular expression matching operator ~
.
Of course, this solution is not as simple as we would wish: again, the application has to write queries that explicitly specify case insensitive processing. Another difficulty is performance: while case sensitive pattern matching can be supported with B-tree indexes, case-insensitive pattern matching requires a trigram index:
1 2 3 4 |
CREATE EXTENSION IF NOT EXISTS pg_trgm; CREATE INDEX tab_col_pattern_idx ON tab USING gin (col gin_trgm_ops); |
Such an index can speed up both of the above statements, but it can grow large and is slower to update than a normal B-tree index.
For this test, I used tables with ten million rows and a column with a random string of 320 characters. There was no index on the column. In the table perf_citext
, the column is defined as citext
. In perf_lower
, the data type is text
, and comparison is made using lower()
. Table perf_coll
uses text
with the collation english_ci
as defined above. The database used was PostgreSQL v15 with US English collations. All tables were cached in shared buffers.
WHERE ... = 'abcd' |
WHERE ... LIKE 'abcd%' |
WHERE ... COLLATE "C" ILIKE 'abcd%' |
WHERE ... COLLATE "C" ~* 'abcd%' ( ~ for lower() ) |
|
---|---|---|---|---|
citext |
540 ms | 536 ms | 1675 ms | 2500 ms |
lower() |
9000 ms | 9000 ms | 3000 ms | 3800 ms |
english_ci |
830 ms | ERROR | 2000 ms | 1940 ms |
In this test, citext
comes out ahead, but case-insensitive collations are a decent runner-up. Using lower()
does not perform well.
Pattern matching is not implemented for case-insensitive collations in PostgreSQL, and it is a difficult question what the correct behavior would be in this case. We have a decent workaround by explicitly using case-insensitive pattern matching operators and the binary collation, but the situation is definitely far from perfect.
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.
+43 (0) 2622 93022-0
office@cybertec.at
You are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Thanks for the great article.
Can I apply case insensitive in my current database (postgresql 15) I mean at database level
Would you care to elaborate what you mean?
I think ultimately the end user of postgres should be able to decide if they want to accept these limitations and say I think LIKE should work in the mysqlmssql way. 99.99% of the time that is acceptable and less clunky the citext, lower() indexes, etc.
In the case of FUSSBALL I'd create a searchable column and a "real" column. I've seen this a lot in french systems where they store the name in plain ASCII and use that column for sorting and searching. Since when folks show up and are asked their name the employee really doesn't want to be bother to ask about accents. In that case they store it upper case and then LIKE is pretty straightforward already.
However, I don't really want/need this since I'm simply storing ascii. I don't even allow unicode and have no interest in doing so. the rest of the world can hate me, but it's just not worth investment given the systems I integrate with wouldn't except it anyways and I end up throwing it away or losing data when I'm given Unicode.
I would say the in 2022 the biggest thing SQL engines could do is just interoperate with each other even if it means holding your nose and doing it. Would it be so bad for Postgres to support select top 1 * from table; Given the parser can't really be extended by extensions it makes a lot of app conversions harder then they need to be.... in the long run database freedom will lead to more folks on postgres, leading to more contribution, and a better world. I'd also argue having to add collate to every column is more than inconvenient and really just a missing feature of specifying database or schema level defaults.
You are ranting, and you might have a different opinion if you ever learned a foreign language.
Concerning interoperability: there is an SQL standard, which is all about interoperability. The solution is not for PostgreSQL to implement random syntax introduced by database systems that don't care about the standard, it is the other way around.
Thanks for good explanations. In my opinion case-insensitive ICU collations sounds good. It is OK with extra work when defining tables and columns. But it is not acceptable when the like operator doesn't work. In 2022 common queries should work without work arounds. Please just use the same solution as in MS SQL or MySQL to be competitive.
A late answer: I hear you, but PostgreSQL has always put correctness above all, even above performance and usability. We don't care if other databases have lower standards concerning correctness. PostgreSQL doesn't want to compete with "fast and loose" solutions. Some people love PostgreSQL for that.
That said, I think everybody would be happy if you could point the way to a correct solution. How do MySQL and Microsoft SQL Server do it?
It could be because I am coming from an MS-SQL background, but I don't understand why anyone would want, for example, Last_Name = 'DeRoSalia' to not return 'deroSalia', and 'DEROSALIA'. I am at a loss for when I would not want 'a' to equal 'A', other than a few edge cases. Better to program for those rather than the 99.9% where case doesn't matter. (I'm putting the German ß in the .01%. I would think that generally persons would want it to match on either ss or SS.)
Thank you for the article, very informative and a good summary of the options and pros and cons.
Edit: clarity and add thanks.