CYBERTEC Logo

Case-insensitive pattern matching in PostgreSQL

06.2022 / Category: , / Tags: |

 

comic about pattern matching
© Renée Albe 2022

 

Case-insensitive search is a much-requested feature, partly (I suspect) to maintain compatibility with Microsoft SQL Server. There are several solutions to the problem, one of which is to use case-insensitive ICU collations. This works like a charm, except if you want to perform pattern matching. So let's have a closer look at the problem and at possible solutions.

Alternatives for case-insensitive search

There are three known solutions to case-insensitive search in PostgreSQL:

Explicit conversion with lower() or upper()

A query that uses this method would look as follows:

This can be made fast with a B-tree index on lower(col), but has two disadvantages:

  • the solution is implemented at the application level, that is, you have to custom-tailor the query for case-insensitive search
  • if the database column contains long values, the whole value has to be converted to lower case, even if only a few characters have to be compared, which leads to bad performance

Using the citext extension

The extension citext provides a data type citext, which stands for “case-insensitive text”. The table is defined as

and the query is as simple as

That is simple and convenient, but has some disadvantages as well:

  • there is no data type civarchar, so you can only implement that with a check constraint
  • performance for longer values can also be bad, because citext internally calls lower(col COLLATE "default") before comparing the values
  • regular expression matching is not case insensitive, and you have to use the case insensitive operator ~* explicitly

Using case-insensitive ICU collations

If you are using PostgreSQL v12 or better, and PostgreSQL was configured --with-icu, you can define a case-insensitive collation like this:

The table would be defined as

and the query again is straightforward:

For more details about ICU collations, read my article on that topic.

The trouble with pattern matching and case-insensitive collations

Case-insensitive collations are the most elegant solution for the problem. There are two shortcomings with this approach:

  • Even though PostgreSQL v15 allows you to use ICU collations as the default database collation, this does not extend to case-insensitive collations, so you still have to explicitly specify the collation for each column definition
  • You cannot use case-insensitive collations with pattern matching

The first problem is merely an annoyance, but the second problem will require more thought. Here are two examples that exhibit the problem:

Why is that not supported?

The difficult case of German soccer

The ICU documentation details why correct case-insensitive pattern matching is difficult. A good example is the German letter “ß”, which traditionally doesn't have an upper-case equivalent. So with good German collations (the collation from the GNU C library is not good in that respect), you will get a result like this:

Now what would be the correct result for the following query in a case-insensitive collation?

You could argue that it should be TRUE, because that's what you'd get for upper('Fußball') LIKE 'FUS%'. On the other hand,

so you could just as well argue that the result should be FALSE. The ICU library goes with the second solution for simplicity. Either solution would be difficult to implement in PostgreSQL, so we have given up on that issue for the time being. Compare this quote from the mailing list:

ICU also provides regexp matching, but not collation-aware, since
character-based patterns don't play well with the concept of collation.
About a potential collation-aware LIKE, it looks hard to implement,
since the algorithm currently used in like_match.c seems purely
character-based. AFAICS there's no way to plug calls to usearch_*
functions into it, it would need a separate redesign from scratch.

A semiotic aside

There is no universally accepted authority for correct German. For example, German speaking Swiss did away with the ß in the twentieth century. On the other hand, the “Rat für deutsche Rechtschreibung” (Committee for German Spelling) introduced an upper case letter ẞ in 2017, but this attempt to simplify the problem has been widely ignored by the German speaking world. Simplifying German, what an outrageous idea!

The end result is even more confusing than before:

A solution for case-insensitive pattern matching

We need a practical solution for the problem. Of course we could use lower() or cast to citext, but that would re-introduce the performance problems for long strings. So we evade the problem by explicitly using a different collation. For performance reasons, and to avoid the brain-twist of German soccer, we cast to the binary collation, which compares words character for character:

Note that for this to work as expected, we have to use the case insensitive versions of LIKE and the regular expression matching operator ~.

Of course, this solution is not as simple as we would wish: again, the application has to write queries that explicitly specify case insensitive processing. Another difficulty is performance: while case sensitive pattern matching can be supported with B-tree indexes, case-insensitive pattern matching requires a trigram index:

Such an index can speed up both of the above statements, but it can grow large and is slower to update than a normal B-tree index.

A performance test for case-insensitive comparisons

For this test, I used tables with ten million rows and a column with a random string of 320 characters. There was no index on the column. In the table perf_citext, the column is defined as citext. In perf_lower, the data type is text, and comparison is made using lower(). Table perf_coll uses text with the collation english_ci as defined above. The database used was PostgreSQL v15 with US English collations. All tables were cached in shared buffers.

Performance comparison of case-insensitive search techniques
WHERE ... = 'abcd' WHERE ... LIKE 'abcd%' WHERE ... COLLATE "C" ILIKE 'abcd%' WHERE ... COLLATE "C" ~* 'abcd%'
(~ for lower())
citext 540 ms 536 ms 1675 ms 2500 ms
lower() 9000 ms 9000 ms 3000 ms 3800 ms
english_ci 830 ms ERROR 2000 ms 1940 ms

In this test, citext comes out ahead, but case-insensitive collations are a decent runner-up. Using lower() does not perform well.

Conclusion

Pattern matching is not implemented for case-insensitive collations in PostgreSQL, and it is a difficult question what the correct behavior would be in this case. We have a decent workaround by explicitly using case-insensitive pattern matching operators and the binary collation, but the situation is definitely far from perfect.


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

0 0 votes
Article Rating
Subscribe
Notify of
guest
7 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Mahfoud Bouabdallah
Mahfoud Bouabdallah
1 year ago

Thanks for the great article.
Can I apply case insensitive in my current database (postgresql 15) I mean at database level

laurenz
laurenz
1 year ago

Would you care to elaborate what you mean?

twiggy79
twiggy79
1 year ago

I think ultimately the end user of postgres should be able to decide if they want to accept these limitations and say I think LIKE should work in the mysqlmssql way. 99.99% of the time that is acceptable and less clunky the citext, lower() indexes, etc.

In the case of FUSSBALL I'd create a searchable column and a "real" column. I've seen this a lot in french systems where they store the name in plain ASCII and use that column for sorting and searching. Since when folks show up and are asked their name the employee really doesn't want to be bother to ask about accents. In that case they store it upper case and then LIKE is pretty straightforward already.

However, I don't really want/need this since I'm simply storing ascii. I don't even allow unicode and have no interest in doing so. the rest of the world can hate me, but it's just not worth investment given the systems I integrate with wouldn't except it anyways and I end up throwing it away or losing data when I'm given Unicode.

I would say the in 2022 the biggest thing SQL engines could do is just interoperate with each other even if it means holding your nose and doing it. Would it be so bad for Postgres to support select top 1 * from table; Given the parser can't really be extended by extensions it makes a lot of app conversions harder then they need to be.... in the long run database freedom will lead to more folks on postgres, leading to more contribution, and a better world. I'd also argue having to add collate to every column is more than inconvenient and really just a missing feature of specifying database or schema level defaults.

laurenz
laurenz
1 year ago
Reply to  twiggy79

You are ranting, and you might have a different opinion if you ever learned a foreign language.
Concerning interoperability: there is an SQL standard, which is all about interoperability. The solution is not for PostgreSQL to implement random syntax introduced by database systems that don't care about the standard, it is the other way around.

Anders Høgsbro Madsen
Anders Høgsbro Madsen
1 year ago

Thanks for good explanations. In my opinion case-insensitive ICU collations sounds good. It is OK with extra work when defining tables and columns. But it is not acceptable when the like operator doesn't work. In 2022 common queries should work without work arounds. Please just use the same solution as in MS SQL or MySQL to be competitive.

laurenz
laurenz
1 year ago

A late answer: I hear you, but PostgreSQL has always put correctness above all, even above performance and usability. We don't care if other databases have lower standards concerning correctness. PostgreSQL doesn't want to compete with "fast and loose" solutions. Some people love PostgreSQL for that.
That said, I think everybody would be happy if you could point the way to a correct solution. How do MySQL and Microsoft SQL Server do it?

Jonathan Brune
Jonathan Brune
1 year ago

Case-insensitive search is a much-requested feature, partly (I suspect) to maintain compatibility with Microsoft SQL Server

It could be because I am coming from an MS-SQL background, but I don't understand why anyone would want, for example, Last_Name = 'DeRoSalia' to not return 'deroSalia', and 'DEROSALIA'. I am at a loss for when I would not want 'a' to equal 'A', other than a few edge cases. Better to program for those rather than the 99.9% where case doesn't matter. (I'm putting the German ß in the .01%. I would think that generally persons would want it to match on either ss or SS.)

Thank you for the article, very informative and a good summary of the options and pros and cons.

Edit: clarity and add thanks.

CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0
office@cybertec.at

Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram