Is it Postgre, PostGreSQL, Postgres or PostgreSQL? We have all seen a couple of wrong ways to spell “PostgreSQL”. The question therefore is: How can one find data even if there are typos? In PostgreSQL there are various solutions to the problem. Depending on what kind of search you need you can choose between various methods.
Table of Contents
Before we get started it is necessary to compile some sample data:
1 2 3 4 5 6 7 8 9 10 11 12 |
CREATE EXTENSION citext; CREATE TABLE t_database ( real_name citext ); INSERT INTO t_database VALUES ('PostgreSQL'), ('Postgres'), ('PostGreSQL'), ('postgres'), ('DB2'), ('DB/2 LUW'), ('DB/2'), ('IBM DB2'), ('Oracle'), ('MS SQL Server'), ('Microsoft SQL Server'); |
All we have here is the name of some database along with a handful of different spellings and typos.
There is one common thing people need quite frequently: Case-insensitive search. Of course one can work around this problem using “upper” and “lower” but in many cases this is simply less convenient than it should be. In addition to that developers have to keep these things in mind at all times which can lead to bugs.
Fortunately, there is a more simplistic solution available: The “citext” datatype (as provided by the “citext” extension) can handle case-insensitive comparisons. The following example shows how it works:
1 2 3 4 5 6 |
test=# SELECT * FROM t_database WHERE real_name = 'postgresql'; real_name ------------ PostgreSQL PostGreSQL (2 rows) |
As you can see, the data in the table is elegantly preserved. Just the comparison function is a bit more relaxed. Solving the case-sensitivity issue on the data type level is really nice and makes life a lot easier for all software making use of the database. The citext extension is available on all public clouds including Amazon AWS, Microsoft Azure and so on.
LIKE and ILIKE are the classical means to do similarity search in PostgreSQL:
1 2 3 4 5 6 7 8 |
test=# SELECT * FROM t_database WHERE real_name ILIKE 'postgr%'; real_name ------------ PostgreSQL Postgres PostGreSQL postgres (4 rows) |
As you can see, we can find all incarnations of PostgreSQL in an easy way. However, LIKE and ILIKE have a major limitation: One has to know quite a lot about the spelling of things you are looking for. In my example we needed at string that contained “postgr” - but what if somebody had used “BostgreSQL”? We would not have found anything. In other words: LIKE and ILIKE are good but in many cases these two keywords are not sufficient to find what is really needed.
Trigrams are an additional way, provided by PostgreSQL, to handle similarity search and are more suitable to handle typos. To use trigrams one has to install the extension:
1 2 |
test=# CREATE EXTENSION pg_trgm; CREATE EXTENSION |
The trigram extension provides us with the distance operator. It helps us to find similar strings. Let us give it a try: Support we want to search for “db3” because we did not quite understand that the IBM product is actually called “DB2”. Here is what happens:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
test=# SELECT real_name <-> 'db3', * FROM t_database ORDER BY real_name <-> 'db3'; ?column? | real_name ------------+---------------------- 0.6666666 | DB2 0.71428573 | DB/2 0.8 | IBM DB2 0.8181818 | DB/2 LUW 1 | Oracle 1 | MS SQL Server 1 | PostgreSQL 1 | Microsoft SQL Server 1 | Postgres 1 | PostGreSQL 1 | postgres (11 rows) |
We can sort by distance which ideally gives us the string closest to what we are searching for. As you can see “DB2” and “DB/2” come out on top which is exactly the way it should be. Of course pg_trgm is not the holy grail and should be used with caution. However, it is one way to tackle the problems caused by wrong data.
While pg_trgm is useful is you are looking for names, addresses and so on full text search has a slightly different purpose. The goal is to look for certain words in text. Here is an example:
1 2 3 4 5 6 7 |
test=# SELECT real_name, to_tsvector(real_name) FROM t_database WHERE to_tsvector(real_name) @@ to_tsquery('server & microsoft'); real_name | to_tsvector ----------------------+---------------------------------- Microsoft SQL Server | 'microsoft':1 'server':3 'sql':2 (1 row) |
We are looking for the words “server” and “microsoft”. Naturally one row is returned. What is noteworthy here is that the order of words can be swapped. We simply want to make sure that all words are in the string - for now order does not matter.
Full text search in PostgreSQL is quite powerful and it offers a variety of additional features which are way beyond the scope of this posting.
However, what can we do if the order of words does matter? The answer is “phrase search”: PostgreSQL allows you to specify in which order words have to appear:
1 2 3 4 5 6 7 |
test=# SELECT real_name, to_tsvector(real_name) FROM t_database WHERE to_tsvector(real_name) @@ to_tsquery('microsoft <-> sql'); real_name | to_tsvector ----------------------+---------------------------------- Microsoft SQL Server | 'microsoft':1 'server':3 'sql':2 (1 row) |
In this case we want “microsoft” followed by “sql”. PostgreSQL will return the right row. However, what if we look for “microsoft” followed by “server”? In this case things will fail because there happens to be a word in between which is not allowed:
1 2 3 4 5 6 |
test=# SELECT real_name, to_tsvector(real_name) FROM t_database WHERE to_tsvector(real_name) @@ to_tsquery('microsoft <-> server'); real_name | to_tsvector -----------+------------- (0 rows) |
Fortunately, PostgreSQL (https://www.postgresql.org/docs/current/functions-textsearch.html) has a feature indicating how far things are allowed to be apart. The following listing shows how this works:
1 2 3 4 5 6 7 |
test=# SELECT real_name, to_tsvector(real_name) FROM t_database WHERE to_tsvector(real_name) @@ to_tsquery('microsoft <2> server'); real_name | to_tsvector ----------------------+---------------------------------- Microsoft SQL Server | 'microsoft':1 'server':3 'sql':2 (1 row) |
PostgreSQL has many options to handle fuzzy search. There are many more things out there (such as pgsimilarity) which can be used to make our lives easier. One of the things people are often not aware of is the fact that full text search and indexing can also have an impact on the optimal VACUUM policy. Check out my posting about the “GIN pending list” to learn more.
+43 (0) 2622 93022-0
office@cybertec.at
You are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Poor title mislead me to coming here
Agree. Nevertheless it was a good and simple article for those struggling to tackle fuzzy matches.
Thank you, I've share this url done.