© Laurenz Albe 2019
As announced by the glibc mailing list and discussed by Daniel Vérité, version 2.28 of the GNU C library will ship with many changes to the collations.
PostgreSQL uses the operating system’s collations by default. When this version hits your Linux operating system as part of a routine bugfix upgrade, and you don’t use the “C
” or “POSIX
” collation, you stand a good chance that some indexes on strings will become corrupted.
Since version 10, PostgreSQL does not only support “libc collations” but also “ICU collations” which are not affected by this particular problem. This article describes ICU collations and how you can switch your database to using them.
Why is it a problem?
Collations determine how strings are compared, so they affect everything in the database that has to do with the order of strings:
- the result of the
ORDER BY
clause inSELECT
statements - the order of the keys in b-tree indexes
Now changes in the behavior of ORDER BY
can be annoying, but that doesn’t threaten data integrity.
But if your index is suddenly ordered in the wrong fashion, that is data corruption. It can lead to incorrect query results or to duplicate entries in UNIQUE
indexes.
What can you do to avoid the problem
The straightforward remedy is to rebuild all indexes on strings.
The simple way to do this is to run REINDEX INDEX
for all affected indexes. However, that will block all attempts to use the index and will prevent data modifications on the table.
In PostgreSQL v12, you will be able to use REINDEX CONCURRENTLY
to avoid these disadvantages. In older versions of PostgreSQL, you can work around this by using CREATE INDEX CONCURRENTLY
to create a copy of the index, then drop the original one. I’ll demonstrate that further down.
But I wanted to tell you about ICU collations and their benefits.
ICU collations
ICU is a set of libraries that provide, among other things, standardized Unicode collations. They are available on many operating systems, and since version 10 PostgreSQL supports using ICU collations (you have to enable that feature when you build PostgreSQL).
Using ICU collations provides several advantages:
- Unlike operating system collations, they work the same on all operating systems.
- They allow you to create your own collations for use with PostgreSQL.
- From PostgreSQL v12 on, they allow you to use case- and accent-insensitive collations.
- They are versioned so that you notice if a new version of the ICU library changes a collation (which hopefully won’t happen often).
- There is a “language independent collation” which provides a best-effort language agnostic ordering.
From PostgreSQL v15 on, you are able to use ICU collations as default database collation. For that, you have to create the database with options like LOCALE_PROVIDER icu ICU_LOCALE "en-US-x-icu"
. With older PostgreSQL releases, you have to specify the ICU collation explicitly for each string column you create.
Understanding ICU collation names
If you run
SELECT * FROM pg_collation WHERE collprovider = 'i';
on a v10 or later database, you’ll get a long list of collations with names like “vai-Latn-LR-x-icu
”.
The “-x-icu
” is just a suffix that PostgreSQL adds so that you know at a glance that it is an ICU collation. The rest of the name is identical to the “collcollate” column (although that is not a requirement), which is the name of the ICU collation.
ICU collation names are defined in the Unicode Technical Standard #35 and describe the collation. They are a list of subtags, seperated by “-” or “_”. The subtags are:
- The first component is the language subtag and is normally the two-letter language code known from UNIX locales, like “
en
” for “English”.
The “und
” language is for the language independent “root” locale. - The second (optional) component is the script subtag, a four-letter abbreviation that describes the alphabet used, like “
Latn
” for the latin alphabet. - The third (optional) component is the region subtag, a two-letter description of the country or region (also known from UNIX locales), like “GB” for “Great Britain”.
- Finally, offset by “
-u-
”, follows an optional list of variant subtags and their values that define further modifications.
Some important ones are:co
: the collation variant, which can take values like “phonebk
” for the German phone book ordering or “trad
” for traditional Spanish orderingkn
: if “true
”, sequences of digits are ordered numerically rather than alphabetically (“natural ordering”)ks
: the collation strength; where “level2
” means “case insensitive” and “level1
” means “case and accent insensitive”
You can only use these from v12 on with theCREATE COLLATION
optiondeterministic = false
.kc
: if “true
”, marks collations that would otherwise be case insensitive as case sensitive (if you use it together with “ks-level1
”, you end up with a case sensitive, but accent insensitive collation)va
: can take the value “posix
” to specify binary sort order (by Unicode code point)
Creating your own ICU collations
Once you know the nomenclature, it is easy to create your own ICU collations.
For example, if the pope wants a Latin collation with natural search order (so that “Can. 101
” sorts after “Can. 22
”), he or she could simply run
CREATE COLLATION "vat-lat" (provider = icu, locale = 'la-VA-u-kn-true');
The script subtag is omitted here, since Latin is only available in the Latin alphabet.
From PostgreSQL v12 on, it is also possible to use case- or accent-insensitive collations in PostgreSQL (an often requested feature!). For that, you have to add the additional clause “deterministic = false
” so that PostgreSQL knows that it cannot handle equality comparisons byte-wise. You’ll have to pay a certain performance penalty for that. However, it won’t be any more expensive (and maybe more intuitive) than the current techniques of using upper()
or the citext
extension are today.
Versioning of ICU collations
Each ICU collation carries the version with which it was created.
The main benefit of that is that PostgreSQL knows if an upgrade of the ICU library changes the collation. In that case, you will get a warning whenever you use the collation. Then you should REINDEX
all indexes that depend on the collation (the documentation has a query to find them). Then, you upgrade the collation using ALTER COLLATION ... REFRESH VERSION
. (Update: PostgreSQL v13 has added a similar warning for collations provided by the GNU C library, but unfortunately not for the default collation of the database, which reduces the usefulness considerably.)
Now this is nice, but it still means down time or the risk of database corruption. If you are using ICU collations, and the libicu
that ships with your operating system changes, you are in the same predicament as with glibc
collations.
But, as Noah Misch pointed out in his comments below, there is a way out: you can maintain your own copy of libicu
that you keep stable and build PostgreSQL against that library. Alternatively, you can opt not to upgrade libicu
along with the rest of your operating system software. There are way fewer dependencies on libicu
than on glibc
, and the impact on security if you don’t upgrade will be lower.
Switching to ICU collations
It is easy to change the collation of an existing column:
ALTER TABLE mytable ALTER textcol TYPE text COLLATE "en-x-icu";
Note that collation names are not strings, but identifiers, so you’ll have to surround them with double quotes if they contain uppercase letters or dashes.
Changing the collation of a table column is very fast, because the table does not have to be rewritten for that. However, all indexes on the column that use the column collation have to be rebuilt. Since ALTER TABLE
requires an ACCESS EXCLUSIVE
lock on the table, this can effectively mean a longer downtime.
So, ALTER TABLE
is not the method of choice. How can we do better?
Changing collations with logical replication
PostgreSQL v10 has introduced logical replication, which allows to replicate between databases that are not identical. In particular, the table columns on the primary and on the standby database need not have the same collation. We can make use of that.
These are the steps to move a database to ICU collations:
1. Extract the schema definition from the original database
Create a schema-only dump of the source database:
pg_dump -s -f schema.sql mydb
2. Modify the schema dump with ICU collations
This can be done by hand, but it is safer to use a tool like awk
for that.
Create an awk
script named “modify.awk
” like this:
# in CREATE TABLE statements, append a COLLATE clause to all string columns /^CREATE TABLE/, /^$/ { print gensub(\ "^( *([a-z_0-9]*|\"[^\"]*\")) (text\\y|character( varying)?\\([0-9]*\\)|([a-z_0-9]*|\"[^\"]*\").citext\\y)( COLLATE ([a-z_0-9]*|\"[^\"]*\")(.([a-z_0-9]*|\"[^\"]*\"))?)?",\ "\\1 \\3 COLLATE " collation,\ 1\ );\ processed=1 } # unmodified copy of all lines that have not been processed above { if (processed == 1) processed = 0; else print }
Then you can call awk
like this to modify the file:
awk -v collation='"en-x-icu"' schema.sql >newschema.sql
Note that this will overwrite all pre-existing non-default column collations.
3. Create a new database and load the dump
createdb mydbcopy psql -d mydbcopy -1 -f newschema.sql
If you create the new database in a different database cluster, you must first create all the users and tablespaces that the database needs.
This can easily be done with the script generated by pg_dumpall -g
from the original database.
4. Configure logical replication
On the original database, make sure that “wal_level
” is set to “logical
”. Then create a publication for all tables:
CREATE PUBLICATION coll_pub FOR ALL TABLES;
Then you create a subscription for that publication on the destination database.
Since both databases are in the same database cluster in our example, we have to create the replication slot explicitly to avoid a deadlock. This is not necessary when you replicate to a different cluster.
On the original database, create a logical replication slot:
SELECT pg_create_logical_replication_slot('coll_sub', 'pgoutput');
On the new database, create a subscription that uses the replication slot:
CREATE SUBSCRIPTION coll_sub CONNECTION 'dbname=mydb user=postgres' PUBLICATION coll_pub WITH (slot_name = coll_sub, create_slot = false);
This starts logical replication. First, the existing contents of the tables are copied over, then replication catches up and keeps the two databases synchronized.
5. Switch the application to the new database
First, you’ll have to wait until replication catches up. You can monitor the view pg_stat_replication
on the original database and wait until “state
” is “streaming
” and “flush_lsn
” is close to the current log sequence number.
Once logical replication has caught up, stop the application, change the connection information to the new database and restart the application. It is a good idea to allow for a few seconds of down time because replication is asynchronous and you want all the changes to be replicated when you start using the new database.
If you want to achieve zero down time, you could use the following approach:
- Don’t replicate to a different database in the same cluster, but to a database with the same name in a different PostgreSQL cluster.
- Make sure that the application’s connect string points to both the old and the new database.
This is supported from v10 on, and the connection string will have to look like “host=host1,host2
” (if the same port is used for both). - Make sure that your application is fault tolerant: if the database connection is lost, it should automatically re-establish the connection, ideally without reporting an error to the user.
- Right before you perform the switch, turn on synchronous replication by adding the name of the subscription to “
synchronous_standby_names
” on the primary server. Then no transaction can get lost. - To switch databases, simply turn off the primary database. The application will automatically fail over to the new database.
Once you are done, you should drop the subscription on the new database.
Conclusion
ICU collations not only provide you with many features lacking in operating system collations, they also are a way to escape from the collation changes that glibc collations undergo regularly.
This will reduce the danger of data corruption from corrupted indexes on string columns.
To avoid the danger of collation changes in libicu
updates, you can keep the version stable with little impact on the whole system. And if your collations happen to change, you’ll be warned by PostgreSQL.
With logical replication, changing your collation with little down time is not too difficult.