There are still many people out there, who cannot spell the name of their favorite database. “PostgreSQL”, “PostgresSQL”, “Postgre SQL” … the list goes on and on. Can we blame those people? Actually no. Some words are simply pretty tricky. Each of us has failed once in a while. After all database work is not about blaming people – it is about helping them.
Fuzzy search is a way to solve the problem and to fix user experience. The goal is really to make sure that users got a chance to find something – even if typos are included in the search string.
Contrib modules
PostgreSQL provides a module called “pg_trgm”, which allows users to use trigrams along with indexes. “pg_trgm” is a very capable module and even allows regular expression matches. However, there is more. There are many more algorithms out there, which can be used to measure the distance between words or groups of words.
pg_similarity: A nice addon to PostgreSQL
One module, which has been around for quite a while, ist pg_similarity. It can be downloaded for free from the following website: http://pgsimilarity.projects.pgfoundry.org/ It features a couple of algorithms such as Jaro-Winkler, Q-grams and a lot more.
pg_similarity in action
To see pg_similarity in action we have compiled a couple of examples. Here is a q-gram example:
test=# SELECT qgram('PostgreSQL', 'PostgresSQL') AS q1, qgram('PostgreSQL', 'Postgres SQL') AS q2, qgram('PostgreSQL', 'Bostgres SQL') AS q3; q1 | q2 | q3 -----+-------------------+------------------- 0.8 | 0.769230769230769 | 0.538461538461538 (1 row)
Now the same with Jaro-Winkler:
test=# SELECT jarowinkler('PostgreSQL', 'PostgresSQL') AS j1, jarowinkler('PostgreSQL', 'Postgres SQL') AS j2, jarowinkler('PostgreSQL', 'Bostgres SQL') AS j3; j1 | j2 | j3 -------------------+-------------------+------------------- 0.981818181818182 | 0.966666666666667 | 0.883333333333333 (1 row)
Depending on your needs you can choose, which algorithm is suited for your problem. In some cases even a combination of various algorithms can be useful.