CYBERTEC PostgreSQL Logo

"hired" vs. "fired" - fuzzy search in PostgreSQL

09.2023 / Category: / Tags:

When dealing with data (and life in general) small things can have a major impact. The difference between “hired” and “fired” is just one simple character, but in many cases it does have real world implications. The question is: How can we use good old community Open Source PostgreSQL to do a fuzzy search, in order to get better results? This blog will give you some insights into how this can be done, and which benefits the end user can enjoy when using those techniques. As fuzzy search is a huge topic, I want to limit the post to what is possible using the fuzzystrmatch extension, and point to some further reading wherever it makes sense to do so.

Creating sample data

To approach fuzzy string search in PostgreSQL we first have to create some simple sample data. For the sake of simplicity, it does not need much to show how this works:

All we need is 5 rows and we are able to demonstrate what true Open Source is capable of.

The fuzzystrmatch extension

The first thing people might want to use in PostgreSQL is the fuzzystrmatch extension. It contains basic functionality such as Soundex and Levenshtein. Let's activate the extension, which is shipped as part of the PostgreSQL contrib package:

Soundex is one of the more popular algorithms. It's quite outdated, but it still does serve its purpose here and there. Let's see how it runs:

The soundex function encodes the input string. What we can do now is encode the input text as well, and see if it matches:

Note that “fired” will get the same encoding as “firred” which already gives us a real advantage over a normal search.

PostgreSQL provides us with indexes on expressions, and since soundex is “immutable”, we can index on it. If you want to know more about IMMUTABLE, STABLE and VOLATILE, read our blog post about this vital topic. Here's how it works in our example:

The table is so small that PostgreSQL would not even consider an index scan, but we can convince the database engine to actually use an index by implementing a runtime variable.

A second mechanism provided by fuzzystrmatch is good old Levenshtein. What is this all about? Basically the Levenshtein-distance defines the minimum number of characters one has to change in order to change string A to string B.

The following listing contains an example of how the Levenshtein algorithm can work:

In this case, we have to change 8 characters to turn the first into the second string. We can make use of that in a normal query too. Suppose we want to find everything that has only up to a certain number of typos:

In this case, two rows are available. Levenshtein is really powerful if you have small typos in the data, and if you're relatively sure of what you're looking for in the first place.

Further fuzzy search option: metaphone

If you're not happy with what levenshtein and soundex can do for you, you might want to check out metaphone. It's pretty similar to what soundex does, but offers an additional parameter that defines the length of your desired output:

In the first example, I have encoded the string into 4 characters. In the second case, the length of the string is 6. We can again use the same strategy:

We encode the input string and the content of the column. Of course, there is no result - because “open source” does not sound like failure and does not sound like being fired either.

Advanced similarity search in PostgreSQL

As outlined briefly before, there is more to fuzzy string searches in PostgreSQL than can ever be covered in a single post. If you want to read more, check out the following posts:

We will cover more of those techniques in the near future to make sure people can enjoy the full benefit of true Open Source PostgreSQL. Until then: Don’t get fired - get hired.

Leave a Reply

Your email address will not be published. Required fields are marked *

CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0
office@cybertec.at

Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram