Elephant with charater troubles wishes for an ICU collation
© Laurenz Albe 2019

 

As announced by the glibc mailing list and discussed by Daniel Vérité, version 2.28 of the GNU C library will ship with many changes to the collations.

PostgreSQL uses the operating system’s collations by default. When this version hits your Linux operating system as part of a routine bugfix upgrade, and you don’t use the “C” or “POSIX” collation, you stand a good chance that some indexes on strings will become corrupted.

Since version 10, PostgreSQL does not only support “libc collations” but also “ICU collations” which are not affected by this particular problem. This article describes ICU collations and how you can switch your database to using them.

Why is it a problem?

Collations determine how strings are compared, so they affect everything in the database that has to do with the order of strings:

  • the result of the ORDER BY clause in SELECT statements
  • the order of the keys in b-tree indexes

Now changes in the behavior of ORDER BY can be annoying, but that doesn’t threaten data integrity.
But if your index is suddenly ordered in the wrong fashion, that is data corruption. It can lead to incorrect query results or to duplicate entries in UNIQUE indexes.

What can you do to avoid the problem

The straightforward remedy is to rebuild all indexes on strings.

The simple way to do this is to run REINDEX INDEX for all affected indexes. However, that will block all attempts to use the index and will prevent data modifications on the table.

In PostgreSQL v12, you will be able to use REINDEX CONCURRENTLY to avoid these disadvantages. In older versions of PostgreSQL, you can work around this by using CREATE INDEX CONCURRENTLY to create a copy of the index, then drop the original one. I’ll demonstrate that further down.

But I wanted to tell you about ICU collations and their benefits.

ICU collations

ICU is a set of libraries that provide, among other things, standardized Unicode collations. They are available on many operating systems, and since version 10 PostgreSQL supports using ICU collations.

Using ICU collations provides several advantages:

  • Unlike operating system collations, they work the same on all operating systems.
  • They allow you to create your own collations for use with PostgreSQL.
  • From PostgreSQL v12 on, they allow you to use case- and accent-insensitive collations.
  • They are versioned so that you notice if a new version of the ICU library changes a collation (which hopefully won’t happen often).
  • There is a “language independent collation” which provides a best-effort language agnostic ordering.

There is a fly in the ointment:

  • You cannot use ICU collations as default collation for a database or a cluster, so you have to explicitly specify them for all table columns. This restriction may be lifted at some point in the future.

Understanding ICU collation names

If you run

SELECT * FROM pg_collation WHERE collprovider = 'i';

on a v10 or later database, you’ll get a long list of collations with names like “vai-Latn-LR-x-icu”.

The “-x-icu” is just a suffix that PostgreSQL adds so that you know at a glance that it is an ICU collation. The rest of the name is identical to the “collcollate” column (although that is not a requirement), which is the name of the ICU collation.

ICU collation names are defined in the Unicode specification BCP47 and describe the collation. They are a list of subtags, seperated by “-” or “_”. The subtags are:

  • The first component is the language subtag and is normally the two-letter language code known from UNIX locales, like “en” for “English”.
    The “und” language is for the language independent “root” locale.
  • The second (optional) component is the script subtag, a four-letter abbreviation that describes the alphabet used, like “Latn” for the latin alphabet.
  • The third (optional) component is the region subtag, a two-letter description of the country or region (also known from UNIX locales), like “GB” for “Great Britain”.
  • Finally, offset by “-u-”, follows an optional list of variant subtags and their values that define further modifications.
    Some important ones are:

    • co: the collation variant, which can take values like “phonebk” for the German phone book ordering or “trad” for traditional Spanish ordering
    • kn: if “true”, sequences of digits are ordered numerically rather than alphabetically (“natural ordering”)
    • ks: the collation strength; where “level2” means “case insensitive” and “level1” means “case and accent insensitive”
      You can only use these from v12 on with the CREATE COLLATION option deterministic = false.
    • kc: if “true”, marks collations that would otherwise be case insensitive as case sensitive (if you use it together with “ks-level1”, you end up with a case sensitive, but accent insensitive collation)
    • va: can take the value “posix” to specify binary sort order (by Unicode code point)

Creating your own ICU collations

Once you know the nomenclature, it is easy to create your own ICU collations.

For example, if the pope wants a Latin collation with natural search order (so that “Can. 101” sorts after “Can. 22”), he or she could simply run

CREATE COLLATION "vat-lat" (provider = icu, locale = 'la-VA-u-kn-true');

The script subtag is omitted here, since Latin is only available in the Latin alphabet.

From PostgreSQL v12 on, it is also possible to use case- or accent-insensitive collations in PostgreSQL (an often requested feature!). For that, you have to add the additional clause “deterministic = false” so that PostgreSQL knows that it cannot handle equality comparisons byte-wise. You’ll have to pay a certain performance penalty for that. However, it won’t be any more expensive (and maybe more intuitive) than the current techniques of using upper() or the citext extension are today.

Versioning of ICU collations

Each ICU collation carries the version with which it was created.

The main benefit of that is that PostgreSQL knows if an upgrade of the ICU library changes the collation. In that case, you will get a warning whenever you use the collation. Then you should REINDEX all indexes that depend on the collation (the documentation has a query to find them). Then, you upgrade the collation using ALTER COLLATION ... REFRESH VERSION.

Now this is nice, but it still means down time or the risk of database corruption. If you are using ICU collations, and the libicu that ships with your operating system changes, you are in the same predicament as with glibc collations.

But, as Noah Misch pointed out in his comments below, there is a way out: you can maintain your own copy of libicu that you keep stable and build PostgreSQL against that library. Alternatively, you can opt not to upgrade libicu along with the rest of your operating system software. There are way fewer dependencies on libicu than on glibc, and the impact on security if you don’t upgrade will be lower.

Switching to ICU collations

It is easy to change the collation of an existing column:

ALTER TABLE mytable ALTER textcol TYPE text COLLATE "en-x-icu";

This will be very fast, since the data in the table don’t have to be changed, which would require a table rewrite.

Note that collation names are not strings, but identifiers, so you’ll have to surround them with double quotes if they contain uppercase letters or dashes.

Rebuilding indexes without downtime

Changing the column collation is the simple part. The hard part is to rebuild all indexes using the column. As mentioned in the beginning, a REINDEX can take long and means down time on the application.

Before v12 and its REINDEX CONCURRENTLY, you can use the following trick to rebuild an index on a column with only a short ACCESS EXCLUSIVE lock:

CREATE INDEX CONCURRENTLY myindex2 ON mytable (textcol);
DROP INDEX myindex;
ALTER INDEX myindex2 RENAME TO myindex;

There are, however, two disadvantages to this method:

  • CREATE INDEX consumes non-neglectable resources on your production database, and for a while you’ll have to maintain an extra index, which impacts all data modification operations.
  • Until the procedure is done, you work with a corrupted index, which can lead to bad results and data corruption.

So, this is not the method of choice. How can we do better?

Changing collations with logical replication

PostgreSQL v10 has introduced logical replication, which allows to replicate between databases that are not identical. In particular, the table columns on the primary and on the standby database need not have the same collation. We can make use of that.

These are the steps to move a database to ICU collations:

1. Extract the schema definition from the original database

 

Create a schema-only dump of the source database:

pg_dump -s -f schema.sql mydb

2. Modify the schema dump with ICU collations

 

This can be done by hand, but it is safer to use a tool like awk for that.

Create an awk script named “modify.awk” like this:

# in CREATE TABLE statements, append a COLLATE clause to all string columns
/^CREATE TABLE/, /^$/ { print gensub(\
                                 "^( *([a-z_0-9]*|\"[^\"]*\")) (text\\y|character( varying)?\\([0-9]*\\)|([a-z_0-9]*|\"[^\"]*\").citext\\y)( COLLATE ([a-z_0-9]*|\"[^\"]*\")(.([a-z_0-9]*|\"[^\"]*\"))?)?",\
                                 "\\1 \\3 COLLATE " collation,\
                                 1\
                              );\
                        processed=1 }
# unmodified copy of all lines that have not been processed above
{ if (processed == 1) processed = 0; else print }

Then you can call awk like this to modify the file:

awk -v collation='"en-x-icu"' schema.sql >newschema.sql

Note that this will overwrite all pre-existing non-default column collations.

3. Create a new database and load the dump

 

createdb mydbcopy
psql -d mydbcopy -1 -f newschema.sql

If you create the new database in a different database cluster, you must first create all the users and tablespaces that the database needs.
This can easily be done with the script generated by pg_dumpall -g from the original database.

4. Configure logical replication

 

On the original database, make sure that “wal_level” is set to “logical”. Then create a publication for all tables:

CREATE PUBLICATION coll_pub FOR ALL TABLES;

Then you create a subscription for that publication on the destination database.

Since both databases are in the same database cluster in our example, we have to create the replication slot explicitly to avoid a deadlock. This is not necessary when you replicate to a different cluster.

On the original database, create a logical replication slot:

SELECT pg_create_logical_replication_slot('coll_sub', 'pgoutput');

On the new database, create a subscription that uses the replication slot:

CREATE SUBSCRIPTION coll_sub
   CONNECTION 'dbname=mydb user=postgres'
   PUBLICATION coll_pub
   WITH (slot_name = coll_sub, create_slot = false);

This starts logical replication. First, the existing contents of the tables are copied over, then replication catches up and keeps the two databases synchronized.

5. Switch the application to the new database

 

First, you’ll have to wait until replication catches up. You can monitor the view pg_stat_replication on the original database and wait until “state” is “streaming” and “flush_lsn” is close to the current log sequence number.

Once logical replication has caught up, stop the application, change the connection information to the new database and restart the application. It is a good idea to allow for a few seconds of down time because replication is asynchronous and you want all the changes to be replicated when you start using the new database.

If you want to achieve zero down time, you could use the following approach:

  • Don’t replicate to a different database in the same cluster, but to a database with the same name in a different PostgreSQL cluster.
  • Make sure that the application’s connect string points to both the old and the new database.
    This is supported from v10 on, and the connection string will have to look like “host=host1,host2” (if the same port is used for both).
  • Make sure that your application is fault tolerant: if the database connection is lost, it should automatically re-establish the connection, ideally without reporting an error to the user.
  • Right before you perform the switch, turn on synchronous replication by adding the name of the subscription to “synchronous_standby_names” on the primary server. Then no transaction can get lost.
  • To switch databases, simply turn off the primary database. The application will automatically fail over to the new database.

Once you are done, you should drop the subscription on the new database.

Conclusion

ICU collations not only provide you with many features lacking in operating system collations, they also are a way to escape from the collation changes that glibc collations undergo regularly.

This will reduce the danger of data corruption from corrupted indexes on string columns.
To avoid the danger of collation changes in libicu updates, you can keep the version stable with little impact on the whole system. And if your collations happen to change, you’ll be warned by PostgreSQL.

With logical replication, changing your collation with little down time is not too difficult.