CYBERTEC PostgreSQL Logo

unaccent: Getting rid of umlauts, accents and special characters

03.2019 / Category: / Tags: |

Database people dealing with natural languages are all painfully aware of the fact that encodings, special characters, accents and alike are usually hard to deal with. This is especially true if you want to implement search in a user-friendly way. This post describes the PG extension unaccent.

Consider the following example in PostgreSQL: My name contains a couple of super inconvenient special characters, which will cause issues for people around the globe. The correct spelling of my surname is “Schönig”, which is pretty hard to type on most keyboards I have seen around the world. And: Who cares about my special characters anyway? People might just want to type in “Schonig” into some search form and find information about me (ideally related to PostgreSQL and database work).

Here is the problem:

The “=” operator compares those two strings and concludes that those two strings are not identical. Therefore, the correct answer is “false”. While that is true from a technical point of view it might be a real issue and end users might be unsatisfied with the result. Problems like that can make daily life pretty hard. A solution to the problem is therefore needed.

PostgreSQL provides a useful extension

If you want to improve your user experience you can turn to the “unaccent” extension, which is shipped as part of the PostgreSQL contrib package. Installing it is really easy:

In the next step you can call the “unaccent” function to clean a string and turn it into something more useful. This is what happens when we use this function on my name and the name of my PostgreSQL support company:

The beauty is that we can easily compare strings in a more tolerant and more user-friendly way:

In both cases, PostgreSQL will return true, which is exactly what we want.

Indexing

When using unaccent there is one thing, which you should keep in mind. Here is an example:

PostgreSQL supports the creation of indexes on functions. However, a functional index has to return an immutable result, which is not the case here. If you want to index on an unaccented string you have to create an additional column, which contains a pre-calculated value (“materialized”). Otherwise, it's just not possible.

 


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.

0 0 votes
Article Rating
Subscribe
Notify of
guest
3 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Javier Zayas
Javier Zayas
5 years ago

Interesting topic, which raises a unique question for me. Could the unaccent() function be used during data creation to strip out the UTF8 characters (our database was built with ASCII so these characters especially cause us problems) to help work around the limitationof ASCII databases not handling these characters?

Georg Klimm
5 years ago

Is then a difference in
ORDER BY unaccent(name)
and
ORDER BY name ?
Sure with precondition of the right language setting 🙂

laurenz
laurenz
5 years ago
Reply to  Georg Klimm

If you have a German collation in pg_collation (something like German on Windows or de_DE.UTF-8 on UNIX), the results should be pretty similar (as long as you are not using characters like č or é).

One notable difference would be the sort performance, since unaccent would be called for every string that gets sorted. If ORDER BY can use an index scan, that shouldn't make a difference.

CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0
office@cybertec.at

Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram