Database people dealing with natural languages are all painfully aware of the fact that encodings, special characters, accents and alike are usually hard to deal with. This is especially true if you want to implement search in a user-friendly way. This post describes the PG extension unaccent.
Consider the following example in PostgreSQL: My name contains a couple of super inconvenient special characters, which will cause issues for people around the globe. The correct spelling of my surname is “Schönig”, which is pretty hard to type on most keyboards I have seen around the world. And: Who cares about my special characters anyway? People might just want to type in “Schonig” into some search form and find information about me (ideally related to PostgreSQL and database work).
Here is the problem:
test=# SELECT 'Schönig' = 'Schonig'; ?column? ---------- f (1 row) test=# SELECT 'Schönig' = 'Schoenig'; ?column? ---------- f (1 row)
The “=” operator compares those two strings and concludes that those two strings are not identical. Therefore, the correct answer is “false”. While that is true from a technical point of view it might be a real issue and end users might be unsatisfied with the result. Problems like that can make daily life pretty hard. A solution to the problem is therefore needed.
PostgreSQL provides a useful extension
If you want to improve your user experience you can turn to the “unaccent” extension, which is shipped as part of the PostgreSQL contrib package. Installing it is really easy:
test=# CREATE EXTENSION unaccent; CREATE EXTENSION
In the next step you can call the “unaccent” function to clean a string and turn it into something more useful. This is what happens when we use this function on my name and the name of my PostgreSQL support company:
test=# SELECT unaccent('Hans-Jürgen Schönig, Gröhrmühlgasse 26, Wiener Neustadt'); unaccent --------------------------------------------------------- Hans-Jurgen Schonig, Grohrmuhlgasse 26, Wiener Neustadt (1 row) test=# SELECT unaccent('Cybertec Schönig & Schönig GmbH'); unaccent --------------------------------- Cybertec Schonig & Schonig GmbH (1 row)
The beauty is that we can easily compare strings in a more tolerant and more user-friendly way:
test=# SELECT unaccent('Schönig') = unaccent('Schonig'); ?column? ---------- t (1 row) test=# SELECT unaccent('Schönig') = unaccent('Schönig'); ?column? ---------- t (1 row)
In both cases, PostgreSQL will return true, which is exactly what we want.
Indexing
When using unaccent there is one thing, which you should keep in mind. Here is an example:
test=# CREATE TABLE t_name (name text); CREATE TABLE test=# CREATE INDEX idx_accent ON t_name (unaccent(name)); ERROR: functions in index expression must be marked IMMUTABLE
PostgreSQL supports the creation of indexes on functions. However, a functional index has to return an immutable result, which is not the case here. If you want to index on an unaccented string you have to create an additional column, which contains a pre-calculated value (“materialized”). Otherwise, it’s just not possible.
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.