"hired" vs. "fired" - fuzzy search in PostgreSQL | CYBERTEC PostgreSQL

When dealing with data (and life in general) small things can have a major impact. The difference between “hired” and “fired” is just one simple character, but in many cases it does have real world implications. The question is: How can we use good old community Open Source PostgreSQL to do a fuzzy search, in order to get better results? This blog will give you some insights into how this can be done, and which benefits the end user can enjoy when using those techniques. As fuzzy search is a huge topic, I want to limit the post to what is possible using the fuzzystrmatch extension, and point to some further reading wherever it makes sense to do so.

Table of Contents

Creating sample data

To approach fuzzy string search in PostgreSQL we first have to create some simple sample data. For the sake of simplicity, it does not need much to show how this works:

postgres=# CREATE TABLE t_sample (x text);
CREATE TABLE
postgres=# INSERT INTO t_sample 
VALUES ('hired'), ('fired'), ('pump'), ('dump'),
       ('failure');
INSERT 0 5
postgres=# TABLE t_sample;
    x    
---------
 hired
 fired
 pump
 dump
 failure
(5 rows)

postgres=# CREATE TABLE t_sample (x text);

CREATE TABLE

postgres=# INSERT INTO t_sample

VALUES ('hired'), ('fired'), ('pump'), ('dump'),

('failure');

INSERT 0 5

postgres=# TABLE t_sample;

---------

hired

fired

pump

dump

failure

(5 rows)

All we need is 5 rows and we are able to demonstrate what true Open Source is capable of.

The `fuzzystrmatch` extension

The first thing people might want to use in PostgreSQL is the fuzzystrmatch extension. It contains basic functionality such as Soundex and Levenshtein. Let's activate the extension, which is shipped as part of the PostgreSQL contrib package:

postgres=# CREATE EXTENSION fuzzystrmatch;
CREATE EXTENSION

1 2	postgres=# CREATE EXTENSION fuzzystrmatch; CREATE EXTENSION

Soundex is one of the more popular algorithms. It's quite outdated, but it still does serve its purpose here and there. Let's see how it runs:

postgres=# SELECT x, soundex(x) FROM t_sample;
    x    | soundex 
---------+---------
 hired   | H630
 fired   | F630
 pump    | P510
 dump    | D510
 failure | F460
(5 rows)

postgres=# SELECT x, soundex(x) FROM t_sample;

x | soundex

---------+---------

hired | H630

fired | F630

pump | P510

dump | D510

failure | F460

(5 rows)

The soundex function encodes the input string. What we can do now is encode the input text as well, and see if it matches:

postgres=# SELECT 	x, soundex(x) 
FROM 		t_sample 
WHERE 	soundex(x) = soundex('firred');
   x   | soundex 
-------+---------
 fired | F630
(1 row)

postgres=# SELECT x, soundex(x)

FROM t_sample

WHERE soundex(x) = soundex('firred');

x | soundex

-------+---------

fired | F630

(1 row)

Note that “fired” will get the same encoding as “firred” which already gives us a real advantage over a normal search.

PostgreSQL provides us with indexes on expressions, and since soundex is “immutable”, we can index on it. If you want to know more about IMMUTABLE, STABLE and VOLATILE, read our blog post about this vital topic. Here's how it works in our example:

postgres=# CREATE INDEX ON t_sample (soundex(x));
CREATE INDEX
postgres=# SET enable_seqscan TO off;
SET
postgres=# explain SELECT x, soundex(x) 
FROM 		t_sample 
WHERE 	soundex(x) = soundex('firred');
                      QUERY PLAN                                      
---------------------------------------------------
 Index Scan using t_sample_soundex_idx on t_sample  
(cost=0.13..8.15 rows=1 width=64)
   Index Cond: (soundex(x) = 'F630'::text)
(2 rows)

postgres=# CREATE INDEX ON t_sample (soundex(x));

CREATE INDEX

postgres=# SET enable_seqscan TO off;

SET

postgres=# explain SELECT x, soundex(x)

FROM t_sample

WHERE soundex(x) = soundex('firred');

QUERY PLAN

---------------------------------------------------

Index Scan using t_sample_soundex_idx on t_sample

(cost=0.13..8.15 rows=1 width=64)

Index Cond: (soundex(x) = 'F630'::text)

(2 rows)

The table is so small that PostgreSQL would not even consider an index scan, but we can convince the database engine to actually use an index by implementing a runtime variable.

A second mechanism provided by fuzzystrmatch is good old Levenshtein. What is this all about? Basically the Levenshtein-distance defines the minimum number of characters one has to change in order to change string A to string B.

The following listing contains an example of how the Levenshtein algorithm can work:

postgres=# SELECT levenshtein('venture capital', 'adventure game');
 levenshtein 
-------------
           8
(1 row)

postgres=# SELECT levenshtein('venture capital', 'adventure game');

levenshtein

-------------

(1 row)

In this case, we have to change 8 characters to turn the first into the second string. We can make use of that in a normal query too. Suppose we want to find everything that has only up to a certain number of typos:

postgres=# SELECT 	x 
FROM 		t_sample 
WHERE 	levenshtein(x, 'pumpy') <= 2;
  x   
------
 pump
 dump
(2 rows)

postgres=# SELECT x

FROM t_sample

WHERE levenshtein(x, 'pumpy') <= 2;

------

pump

dump

(2 rows)

In this case, two rows are available. Levenshtein is really powerful if you have small typos in the data, and if you're relatively sure of what you're looking for in the first place.

Further fuzzy search option: `metaphone`

If you're not happy with what levenshtein and soundex can do for you, you might want to check out metaphone. It's pretty similar to what soundex does, but offers an additional parameter that defines the length of your desired output:

postgres=# SELECT metaphone('capital structure', 4);
 metaphone 
-----------
 KPTL
(1 row)

postgres=# SELECT metaphone('capital structure', 4);

metaphone

-----------

KPTL

(1 row)

postgres=# SELECT metaphone('capital structure', 6);
 metaphone 
-----------
 KPTLST
(1 row)

postgres=# SELECT metaphone('capital structure', 6);

metaphone

-----------

KPTLST

(1 row)

In the first example, I have encoded the string into 4 characters. In the second case, the length of the string is 6. We can again use the same strategy:

postgres=# SELECT 	* 
FROM 		t_sample 
WHERE 	metaphone('open source', 3) = metaphone(x, 3);
 x 
---
(0 rows)

postgres=# SELECT *

FROM t_sample

WHERE metaphone('open source', 3) = metaphone(x, 3);

---

(0 rows)

We encode the input string and the content of the column. Of course, there is no result - because “open source” does not sound like failure and does not sound like being fired either.

Advanced similarity search in PostgreSQL

As outlined briefly before, there is more to fuzzy string searches in PostgreSQL than can ever be covered in a single post. If you want to read more, check out the following posts:

We will cover more of those techniques in the near future to make sure people can enjoy the full benefit of true Open Source PostgreSQL. Until then: Don’t get fired - get hired.

"hired" vs. "fired" - fuzzy search in PostgreSQL

Creating sample data

The `fuzzystrmatch` extension

Note that “fired” will get the same encoding as “firred” which already gives us a real advantage over a normal search.

Further fuzzy search option: `metaphone`

Advanced similarity search in PostgreSQL

Leave a Reply Cancel reply

Hans-Jürgen Schönig

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts

"hired" vs. "fired" - fuzzy search in PostgreSQL

Creating sample data

The fuzzystrmatch extension

Note that “fired” will get the same encoding as “firred” which already gives us a real advantage over a normal search.

Further fuzzy search option: metaphone

Advanced similarity search in PostgreSQL

Leave a Reply Cancel reply

Hans-Jürgen Schönig

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts

The `fuzzystrmatch` extension

Further fuzzy search option: `metaphone`