ICU collations against PostgreSQL data corruption

Elephant with character troubles wishes for an ICU collation — © Laurenz Albe 2019

Table of Contents

(Updated 2023-07-18) This article was prompted by version 2.28 of the GNU C library, which changed the definition of many collations. Such a change has the potential to cause data corruption in PostgreSQL. I will describe the problem and point out how ICU collations can solve it. The problem is not limited to glibc 2.28; it can hit you whenever you upgrade glibc.

Collation? What is that?

A collation is the set of rules that describe how strings are compared and ordered. This seems trivial at first glance. Indeed, there is a special collation, known as the C or POSIX collation, that simply compares strings character for character and ranks characters by their UNICODE encoding value. This collation is simple and performs very well, and indeed the C collation avoids all the problems that this article talks about, but it is usually not good enough. For example, it would sort upper-case characters before lower-case characters, and most people would argue that “Z” should not be sorted before “a”. The matter is further complicated because sequences of characters (like the Czech digram “ch”) can be sorted differently from their constituent characters. Think of punctuation, accents and numbers (is the string “12” less than “2” or not?), and you will begin to grasp the complexity.

Which collations does PostgreSQL use?

PostgreSQL does not define its own collations, it uses collations defined by third-party libraries. By default, it uses the collations defined by the operating system's C library.

Since v10, PostgreSQL does not only support these “libc collations”, but also “ICU collations” provided by the ICU library. ICU (short for “International Components for Unicode”) describes itself as “a mature, widely used set of C/C++ and Java libraries providing Unicode and Globalization support” and implements, among other things, collations.

From PostgreSQL v15 on, you are able to use ICU collations as database collation (which is the default collation used in a database). For that, you have to create the database with options like LOCALE_PROVIDER icu ICU_LOCALE "en-US". With older PostgreSQL releases, you have to specify the ICU collation explicitly for each string column you create.

We wanted to make ICU collations the default choice for PostgreSQL v16, but there were problems with the implementation, so that effort got postponed to later, hopefully to v17. Anyway, ICU collations are definitely the future.

How can collation changes cause data corruption?

Collations determine how strings are compared, so they affect everything in the database that has to do with the order of strings:

the result of the ORDER BY clause in SELECT statements
the order of the keys in b-tree indexes

Now changes in the behavior of ORDER BY can be annoying, but that doesn't threaten data integrity.
But if your index is suddenly ordered in the wrong fashion, that is data corruption. It can lead to incorrect query results or to duplicate entries in UNIQUE indexes. You can hit this problem whenever the library that provides your collations is updated.

What can you do to avoid the problem?

The straightforward remedy is to rebuild all indexes on strings.

The simple way to do this is to run REINDEX INDEX for all affected indexes. However, that will block all attempts to use the index and will prevent data modifications on the table.

Since PostgreSQL v12, you can use REINDEX CONCURRENTLY to avoid these disadvantages. In older versions of PostgreSQL, you can work around this by using CREATE INDEX CONCURRENTLY to create a copy of the index, then drop the original one. I'll demonstrate that further down.

But I wanted to tell you about ICU collations and their benefits.

Why are ICU collations better?

At first glance, it is not obvious why ICU collations should be any better than C library collations. But there are several advantages:

Unlike operating system collations, they work the same on all operating systems.
They allow you to create your own collations for use with PostgreSQL.
From PostgreSQL v12 on, they allow you to use case- and accent-insensitive collations.
They are versioned, so that you notice if a new version of the ICU library changes a collation.
There is a “language independent collation” which provides a best-effort language agnostic ordering.

To avoid the danger of data corruption, the biggest advantage is:

You have to keep the C library updated, but it is no problem to run with an older ICU library.

Let's look at how to use ICU collations in PostgreSQL.

Understanding ICU collation names

If you run

SELECT * FROM pg_collation WHERE collprovider = 'i';

1	SELECT * FROM pg_collation WHERE collprovider = 'i';

on a v10 or later database, you'll get a long list of collations with names like “vai-Latn-LR-x-icu”.

The “-x-icu” is just a suffix that PostgreSQL adds so that you know at a glance that it is an ICU collation. The rest of the name is identical to the “collcollate” column (although that is not a requirement), which is the name of the ICU collation.

ICU collation names are defined in the Unicode Technical Standard #35 and describe the collation. They are a list of subtags, seperated by “-” or “_”. The subtags are:

The first component is the language subtag and is normally the two-letter language code known from UNIX locales, like “en” for “English”.
The “und” language is for the language independent “root” locale.
The second (optional) component is the script subtag, a four-letter abbreviation that describes the alphabet used, like “Latn” for the latin alphabet.
The third (optional) component is the region subtag, a two-letter description of the country or region (also known from UNIX locales), like “GB” for “Great Britain”.
Finally, offset by “-u-”, follows an optional list of variant subtags and their values that define further modifications.
Some important ones are:
- co: the collation variant, which can take values like “phonebk” for the German phone book ordering or “trad” for traditional Spanish ordering
- kn: if “true”, sequences of digits are ordered numerically rather than alphabetically (“natural ordering”)
- ks: the collation strength; where “level2” means “case insensitive” and “level1” means “case and accent insensitive”
  You can only use these from v12 on with the CREATE COLLATION option deterministic = false.
- kc: if “true”, marks collations that would otherwise be case insensitive as case sensitive (if you use it together with “ks-level1”, you end up with a case sensitive, but accent insensitive collation)
- va: can take the value “posix” to specify binary sort order (by Unicode code point)

Creating your own ICU collations

Once you know the nomenclature, it is easy to create your own ICU collations.

For example, if the pope wants a Latin collation with natural search order (so that “Can. 101” sorts after “Can. 22”), he or she could simply run

CREATE COLLATION 'vat-lat' (provider = icu, locale = 'la-VA-u-kn-true');

1	CREATE COLLATION 'vat-lat' (provider = icu, locale = 'la-VA-u-kn-true');

The script subtag is omitted here, since Latin is only available in the Latin alphabet.

From PostgreSQL v12 on, it is also possible to use case- or accent-insensitive collations in PostgreSQL (an often requested feature!). For that, you have to add the additional clause “deterministic = false” so that PostgreSQL knows that it cannot handle equality comparisons byte-wise. You'll have to pay a certain performance penalty for that. However, it won't be any more expensive (and maybe more intuitive) than the current techniques of using upper() or the citext extension are today.

Versioning of ICU collations

Each ICU collation carries the version with which it was created.

The main benefit of that is that PostgreSQL knows if an upgrade of the ICU library changes the collation. In that case, you will get a warning whenever you use the collation. Then you should REINDEX all indexes that depend on the collation (the documentation has a query to find them). Then, you upgrade the collation using ALTER COLLATION ... REFRESH VERSION. (Update: PostgreSQL v13 has added a similar warning for collations provided by the GNU C library, but unfortunately not for the default collation of the database, which reduces the usefulness considerably.)

Now this is nice, but it still means down time or the risk of database corruption. If you are using ICU collations, and the libicu that ships with your operating system changes, you are in the same predicament as with glibc collations. However, there is a simple solution: don't upgrade the ICU library. While the C library is an integral component of the operating system that you have to keep updated to avoid, for example, security problems, the ICU library is relatively unimportant and doesn't have to be updated.

Switching to ICU collations

It is easy to change the collation of an existing column:

ALTER TABLE mytable ALTER textcol TYPE text COLLATE 'en-x-icu';

1	ALTER TABLE mytable ALTER textcol TYPE text COLLATE 'en-x-icu';

Note that collation names are not strings, but identifiers, so you'll have to surround them with double quotes if they contain uppercase letters or dashes.

Changing the collation of a table column is very fast, because the table does not have to be rewritten for that. However, all indexes on the column that use the column collation have to be rebuilt. Since ALTER TABLE requires an ACCESS EXCLUSIVE lock on the table, this can effectively mean a longer downtime.

So, ALTER TABLE is not the method of choice. How can we do better?

Changing collations with logical replication

PostgreSQL v10 has introduced logical replication, which allows to replicate between databases that are not identical. In particular, the table columns on the primary and on the standby database need not have the same collation. We can make use of that.

These are the steps to move a database to ICU collations:

1. Extract the schema definition from the original database

Create a schema-only dump of the source database:

pg_dump -s -f schema.sql mydb

1	pg_dump -s -f schema.sql mydb

2. Modify the schema dump with ICU collations

This can be done by hand, but it is safer to use a tool like awk for that.

Create an awk script named “modify.awk” like this:

# in CREATE TABLE statements, append a COLLATE clause to all string columns
/^CREATE TABLE/, /^$/ { print gensub(
                                 '^( *([a-z_0-9]*|'[^']*')) (text\y|character( varying)?\([0-9]*\)|([a-z_0-9]*|'[^']*').citext\y)( COLLATE ([a-z_0-9]*|'[^']*')(.([a-z_0-9]*|'[^']*'))?)?',
                                 '\1 \3 COLLATE ' collation,
                                 1
                              );
                        processed=1 }
# unmodified copy of all lines that have not been processed above
{ if (processed == 1) processed = 0; else print }

# in CREATE TABLE statements, append a COLLATE clause to all string columns

/^CREATE TABLE/, /^$/ { print gensub(

'^( *([a-z_0-9]*|'[^']*')) (text\y|character( varying)?$[0-9]*$|([a-z_0-9]*|'[^']*').citext\y)( COLLATE ([a-z_0-9]*|'[^']*')(.([a-z_0-9]*|'[^']*'))?)?',

'\1 \3 COLLATE ' collation,

);

processed=1 }

# unmodified copy of all lines that have not been processed above

{ if (processed == 1) processed = 0; else print }

Then you can call awk like this to modify the file:

awk -v collation=''en-x-icu'' schema.sql >newschema.sql

1	awk -v collation=''en-x-icu'' schema.sql >newschema.sql

Note that this will overwrite all pre-existing non-default column collations.

3. Create a new database and load the dump

createdb mydbcopy
psql -d mydbcopy -1 -f newschema.sql

1 2	createdb mydbcopy psql -d mydbcopy -1 -f newschema.sql

If you create the new database in a different database cluster, you must first create all the users and tablespaces that the database needs.
This can easily be done with the script generated by pg_dumpall -g from the original database.

4. Configure logical replication

On the original database, make sure that “wal_level” is set to “logical”. Then create a publication for all tables:

CREATE PUBLICATION coll_pub FOR ALL TABLES;

1	CREATE PUBLICATION coll_pub FOR ALL TABLES;

Then you create a subscription for that publication on the destination database.

Since both databases are in the same database cluster in our example, we have to create the replication slot explicitly to avoid a deadlock. This is not necessary when you replicate to a different cluster.

On the original database, create a logical replication slot:

SELECT pg_create_logical_replication_slot('coll_sub', 'pgoutput');

1	SELECT pg_create_logical_replication_slot('coll_sub', 'pgoutput');

On the new database, create a subscription that uses the replication slot:

CREATE SUBSCRIPTION coll_sub
   CONNECTION 'dbname=mydb user=postgres'
   PUBLICATION coll_pub
   WITH (slot_name = coll_sub, create_slot = false);

CREATE SUBSCRIPTION coll_sub

CONNECTION 'dbname=mydb user=postgres'

PUBLICATION coll_pub

WITH (slot_name = coll_sub, create_slot = false);

This starts logical replication. First, the existing contents of the tables are copied over, then replication catches up and keeps the two databases synchronized.

5. Switch the application to the new database

First, you'll have to wait until replication catches up. You can monitor the view pg_stat_replication on the original database and wait until “state” is “streaming” and “flush_lsn” is close to the current log sequence number.

Once logical replication has caught up, stop the application, change the connection information to the new database and restart the application. It is a good idea to allow for a few seconds of down time because replication is asynchronous and you want all the changes to be replicated when you start using the new database.

If you want to achieve zero down time, you could use the following approach:

Don't replicate to a different database in the same cluster, but to a database with the same name in a different PostgreSQL cluster.
Make sure that the application's connect string points to both the old and the new database.
This is supported from v10 on, and the connection string will have to look like “host=host1,host2” (if the same port is used for both).
Make sure that your application is fault tolerant: if the database connection is lost, it should automatically re-establish the connection, ideally without reporting an error to the user.
Right before you perform the switch, turn on synchronous replication by adding the name of the subscription to “synchronous_standby_names” on the primary server. Then no transaction can get lost.
To switch databases, simply turn off the primary database. The application will automatically fail over to the new database.

Once you are done, you should drop the subscription on the new database.

Conclusion

ICU collations not only provide you with many features lacking in operating system collations, they also are a way to escape from the collation changes that C library collations undergo regularly. This will reduce the danger of data corruption from corrupted indexes on string columns.

To avoid the danger of collation changes in libicu updates, you can keep the version stable with little impact on the whole system. And if your collations happen to change, you'll be warned by PostgreSQL.

With logical replication, changing your collation with little down time is not too difficult.

For further information about ICU collations, see my blog about case-insensitive pattern matching.

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

0 0 votes

Article Rating

10 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Noah Misch

5 years ago

> One can hope that ICU collation changes don’t happen as often as glibc collation changes do.

Last I checked (a few years ago), every ICU version changes every collation version.

laurenz

5 years ago

Reply to Noah Misch

Hmm, that's bad news. They seem to come out pretty often.
That would essentially mean that ICU collations are pretty much unusable if you use your operating system's version of libicu, right?
Fedora seems to come out with a release every couple of months.

Noah Misch

5 years ago

Reply to laurenz

Using your OS libicu may or may not be more stable than using libc. However,
PostgreSQL does detect libicu updates and emit "collation ... has version
mismatch" warnings. PostgreSQL doesn't detect libc updates that modify
collation.

Red Hat Enterprise Linux 7.6 has libicu-50.1.2-17.el7. ICU 50.1.2 is from
2012, suggesting Red Hat may be maintaining collation compatibility throughout
the life of RHEL 7. Ten years of collation stability would be usable.

I recommend using a libicu build private to PostgreSQL, separate from the OS
build.

laurenz

5 years ago

Reply to Noah Misch

You are right, that's the way to go.
I'll add something to this effect to the post.

laurenz

1 year ago

That has nothing to do with collations, but with the encoding of the database.

You have a character (…) that cannot be represented in LATIN5.
You cannot change the encoding of a database.
Don't store that character, or use a different database with encoding UTF8.

user661049

1 year ago

Why the following statement?

> the C collation avoids all the problems that this article talks about

Doesn't the C collation ever change?

laurenz

1 year ago

Reply to user661049

It doesn't ever change. it is just the order defined by memcmp() of the strings, which only depends on the encoding.

user661049

1 year ago

Reply to laurenz

Is there any official official documentation that states that (for me to link to in my project)?

Багир Гварамадзе

1 year ago

Hello, thanx a lot for sharing
I have an
ERROR: character with byte sequence 0xe2 0x80 0xa6 in encoding "UTF8" has no equivalent in encoding "LATIN5"
but database about 2 TB I hesiste to change collation, what could I do

laurenz

1 year ago

Look here:

The C and POSIX collations both specify “traditional C” behavior, in which only the ASCII letters “A” through “Z” are treated as letters, and sorting is done strictly by character code byte values.

Stay tuned with our

ICU collations against PostgreSQL data corruption

Collation? What is that?

Which collations does PostgreSQL use?

How can collation changes cause data corruption?

What can you do to avoid the problem?

Why are ICU collations better?

Understanding ICU collation names

Creating your own ICU collations

Versioning of ICU collations

Switching to ICU collations

Changing collations with logical replication

1. Extract the schema definition from the original database

2. Modify the schema dump with ICU collations

3. Create a new database and load the dump

4. Configure logical replication

5. Switch the application to the new database

Conclusion

Laurenz Albe

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts