Database people dealing with natural languages are all painfully aware of the fact that encodings, special characters, accents and alike are usually hard to deal with. This is especially true if you want to implement search in a user friendly way.

Consider the following example in PostgreSQL: My name contains a couple of of super inconvenient special characters, which will cause issues for people around the globe. The correct spelling of my surname is “Schönig”, which is pretty hard to type on most keyboards I have seen around the world. And: Who cares about my special characters anyway? People might just want to type in “Schonig” into some search form and find information about me (ideally related to PostgreSQL and database work).

Here is the problem:

test=# SELECT 'Schönig' = 'Schonig';
 ?column? 
----------
 f
(1 row)

test=# SELECT 'Schönig' = 'Schoenig';
 ?column? 
----------
 f
(1 row)

The “=” operator compares those two strings and comes to the conclusion that those two strings are not identical. Therefore the correct answer is “false”. While that is true from a technical point of view it might be a real issue and end users might be unsatisfied with the result. Problems like that can make daily life pretty hard. A solution to the problem is therefore needed.

unaccent: PostgreSQL provides a useful extension

If you want to improve your user experience you can turn to the “unaccent” extension, which is shipped as part of the PostgreSQL contrib package. Installing “unaccent” is really easy:

test=# CREATE EXTENSION unaccent;
CREATE EXTENSION

In the next step you can call the “unaccent” function to clean a string and turn it into something more useful. This is what happens when we use the “unaccent” function on my name and the name of my PostgreSQL support company:

test=# SELECT unaccent('Hans-Jürgen Schönig, Gröhrmühlgasse 26, Wiener Neustadt');
                        unaccent                         
---------------------------------------------------------
 Hans-Jurgen Schonig, Grohrmuhlgasse 26, Wiener Neustadt
(1 row)

test=# SELECT unaccent('Cybertec Schönig & Schönig GmbH');
            unaccent             
---------------------------------
 Cybertec Schonig & Schonig GmbH
(1 row)

The beauty is that we can easily compare strings in a more tolerant and more user friendly way:

test=# SELECT unaccent('Schönig') = unaccent('Schonig');
 ?column? 
----------
 t
(1 row)

test=# SELECT unaccent('Schönig') = unaccent('Schönig');
 ?column? 
----------
 t
(1 row)

In both cases PostgreSQL will return true, which is exactly what we want.

unaccent and indexing

When using unaccent there is one thing, which you should keep in mind. Here is an example:

test=# CREATE TABLE t_name (name text);
CREATE TABLE
test=# CREATE INDEX idx_accent ON t_name (unaccent(name));
ERROR:  functions in index expression must be marked IMMUTABLE

PostgreSQL supports the creation of indexes on functions. However, a functional index has to return an immutable result, which is not the case here. If you want to index on an unaccented string you have to create an additional column, which contains a pre-calculated value (“materialized”). Otherwise this is not possible.