20 years ago it was enough for a database to simply check if one string was identical to some other string. Those times are long gone and thus several algorithms to do fuzzy string matches have been developed over time. Many of these mechanisms, such as trigrams, regular expressions, soundex and so on are already available in PostgreSQL.
Those algorithms we currently got in PostgreSQL, have one thing in common: They all work for fairly short strings, such as names, addresses and so on. Recently I stumbled over the problem of having to come up with some algorithm to do Nearest-Neighbour-Search for slightly longer texts (maybe the size of a paragraph or a page).
Trigrams are clearly not feasible to solve this problem in PostgreSQL, and logically the same applies to regular expressions (how would this ever work?). I did some research on this subject and found out about a nice and pretty simple algorithm to check text similarity: Cosine similarity.
I borrowed some Python code and hacked up some stored procedure to see how things work. To make it short: Those results are pretty promising if you ask me:
SELECT ‘I am hungry, I really want to eat something now’ <#> ‘Give me something to eat because I am hungry’;
We can see that those texts are pretty close:
?column?
—————-
0.673575314055
(1 row)
Let us try it with different texts:
SELECT ‘I am hungry, I really want to eat something now’ <#> ‘Where is your PostgreSQL database now ?’;
This text is logically not as close as the one we have seen before:
?column?
—————-
0.117851130198
(1 row)
Let us take a look and see how those results have been achieved.
Here is a brute force implementation of cosine text similarity for Python:
CREATE OR REPLACE FUNCTION profile_distance(text, text) RETURNS numeric AS $$
import re, math
from collections import Counter
WORD = re.compile(r’\w+’)
def get_cosine(vec1, vec2):
intersection = set(vec1.keys()) & set(vec2.keys())
numerator = sum([vec1[x] * vec2[x] for x in intersection])
sum1 = sum([vec1[x]**2 for x in vec1.keys()])
sum2 = sum([vec2[x]**2 for x in vec2.keys()])
denominator = math.sqrt(sum1) * math.sqrt(sum2)
if not denominator:
return 0.0
else:
return float(numerator) / denominator
def text_to_vector(text):
words = WORD.findall(text)
return Counter(words)
text1 = args[0]
text2 = args[1]
vector1 = text_to_vector(text1)
vector2 = text_to_vector(text2)
cosine = get_cosine(vector1, vector2)
return cosine
$$ LANGUAGE ‘plpythonu’ IMMUTABLE STRICT;
Finally we have defined a simple operator.
CREATE OPERATOR <#>
(PROCEDURE=profile_distance,
LEFTARG=text,
RIGHTARG=text);
From here on it should be trivial to hack up some PostgreSQL operator class and use Gist indexes to find texts dealing with similar topics or issues. The current code is still not using stemming or some technique of this sort to make things as simple as possible for a quick and dirty test. It should also be wise to use some heuristics to fix typos and so on.