When dealing with data (and life in general) small things can have a major impact. The difference between “hired” and “fired” is just one simple character, but in many cases it does have real world implications. The question is: How can we use good old community Open Source PostgreSQL to do a fuzzy search, in order to get better results? This blog will give you some insights into how this can be done, and which benefits the end user can enjoy when using those techniques. As fuzzy search is a huge topic, I want to limit the post to what is possible using the
fuzzystrmatch extension, and point to some further reading wherever it makes sense to do so.
Creating sample data
To approach fuzzy string search in PostgreSQL we first have to create some simple sample data. For the sake of simplicity, it does not need much to show how this works:
postgres=# CREATE TABLE t_sample (x text); CREATE TABLE postgres=# INSERT INTO t_sample VALUES ('hired'), ('fired'), ('pump'), ('dump'), ('failure'); INSERT 0 5 postgres=# TABLE t_sample; x --------- hired fired pump dump failure (5 rows)
All we need is 5 rows and we are able to demonstrate what true Open Source is capable of.
The first thing people might want to use in PostgreSQL is the
fuzzystrmatch extension. It contains basic functionality such as Soundex and Levenshtein. Let’s activate the extension, which is shipped as part of the PostgreSQL contrib package:
postgres=# CREATE EXTENSION fuzzystrmatch; CREATE EXTENSION
Soundex is one of the more popular algorithms. It’s quite outdated, but it still does serve its purpose here and there. Let’s see how it runs:
postgres=# SELECT x, soundex(x) FROM t_sample; x | soundex ---------+--------- hired | H630 fired | F630 pump | P510 dump | D510 failure | F460 (5 rows)
soundex function encodes the input string. What we can do now is encode the input text as well, and see if it matches:
postgres=# SELECT x, soundex(x) FROM t_sample WHERE soundex(x) = soundex('firred'); x | soundex -------+--------- fired | F630 (1 row)
Note that “fired” will get the same encoding as “firred” which already gives us a real advantage over a normal search.
PostgreSQL provides us with indexes on expressions, and since
soundex is “immutable”, we can index on it. If you want to know more about IMMUTABLE, STABLE and VOLATILE, read our blog post about this vital topic. Here’s how it works in our example:
postgres=# CREATE INDEX ON t_sample (soundex(x)); CREATE INDEX postgres=# SET enable_seqscan TO off; SET postgres=# explain SELECT x, soundex(x) FROM t_sample WHERE soundex(x) = soundex('firred'); QUERY PLAN --------------------------------------------------- Index Scan using t_sample_soundex_idx on t_sample (cost=0.13..8.15 rows=1 width=64) Index Cond: (soundex(x) = 'F630'::text) (2 rows)
The table is so small that PostgreSQL would not even consider an index scan, but we can convince the database engine to actually use an index by implementing a runtime variable.
A second mechanism provided by
fuzzystrmatch is good old Levenshtein. What is this all about? Basically the Levenshtein-distance defines the minimum number of characters one has to change in order to change string A to string B.
The following listing contains an example of how the Levenshtein algorithm can work:
postgres=# SELECT levenshtein('venture capital', 'adventure game'); levenshtein ------------- 8 (1 row)
In this case, we have to change 8 characters to turn the first into the second string. We can make use of that in a normal query too. Suppose we want to find everything that has only up to a certain number of typos:
postgres=# SELECT x FROM t_sample WHERE levenshtein(x, 'pumpy') <= 2; x ------ pump dump (2 rows)
In this case, two rows are available. Levenshtein is really powerful if you have small typos in the data, and if you’re relatively sure of what you’re looking for in the first place.
Further fuzzy search option:
If you’re not happy with what
soundex can do for you, you might want to check out
metaphone. It’s pretty similar to what
soundex does, but offers an additional parameter that defines the length of your desired output:
postgres=# SELECT metaphone('capital structure', 4); metaphone ----------- KPTL (1 row)
postgres=# SELECT metaphone('capital structure', 6); metaphone ----------- KPTLST (1 row)
In the first example, I have encoded the string into 4 characters. In the second case, the length of the string is 6. We can again use the same strategy:
postgres=# SELECT * FROM t_sample WHERE metaphone('open source', 3) = metaphone(x, 3); x --- (0 rows)
We encode the input string and the content of the column. Of course, there is no result – because “open source” does not sound like failure and does not sound like being fired either.
Advanced similarity search in PostgreSQL
As outlined briefly before, there is more to fuzzy string searches in PostgreSQL than can ever be covered in a single post. If you want to read more, check out the following posts:
- Using pg_similarity
- Case insensitive pattern string matching
- Faster LIKE / ILIKE searches
- Advanced fuzzy string search in PostgreSQL
We will cover more of those techniques in the near future to make sure people can enjoy the full benefit of true Open Source PostgreSQL. Until then: Don’t get fired – get hired.