PostgreSQL vs PostgreSQL vs Postgres vs Oracle vs all the rest

09.2020 | Category: | Tags: development , sql help

09.2020

Category:

Is it Postgre, PostGreSQL, Postgres or PostgreSQL? We have all seen a couple of wrong ways to spell “PostgreSQL”. The question therefore is: How can one find data even if there are typos? In PostgreSQL there are various solutions to the problem. Depending on what kind of search you need you can choose between various methods.

Table of Contents

Before we get started it is necessary to compile some sample data:

CREATE EXTENSION citext;

CREATE TABLE t_database
(
     real_name  citext
);

INSERT INTO t_database VALUES
       ('PostgreSQL'), ('Postgres'), ('PostGreSQL'), ('postgres'),
       ('DB2'), ('DB/2 LUW'), ('DB/2'), ('IBM DB2'),
       ('Oracle'),
       ('MS SQL Server'), ('Microsoft SQL Server');

CREATE EXTENSION citext;

CREATE TABLE t_database

(

real_name citext

);

INSERT INTO t_database VALUES

('PostgreSQL'), ('Postgres'), ('PostGreSQL'), ('postgres'),

('DB2'), ('DB/2 LUW'), ('DB/2'), ('IBM DB2'),

('Oracle'),

('MS SQL Server'), ('Microsoft SQL Server');

All we have here is the name of some database along with a handful of different spellings and typos.

Simple case-insensitive search in PostgreSQL

There is one common thing people need quite frequently: Case-insensitive search. Of course one can work around this problem using “upper” and “lower” but in many cases this is simply less convenient than it should be. In addition to that developers have to keep these things in mind at all times which can lead to bugs.

Fortunately, there is a more simplistic solution available: The “citext” datatype (as provided by the “citext” extension) can handle case-insensitive comparisons. The following example shows how it works:

test=# SELECT * FROM t_database WHERE real_name = 'postgresql';
 real_name
------------
 PostgreSQL
 PostGreSQL
(2 rows)

test=# SELECT * FROM t_database WHERE real_name = 'postgresql';

real_name

------------

PostgreSQL

PostGreSQL

(2 rows)

As you can see, the data in the table is elegantly preserved. Just the comparison function is a bit more relaxed. Solving the case-sensitivity issue on the data type level is really nice and makes life a lot easier for all software making use of the database. The citext extension is available on all public clouds including Amazon AWS, Microsoft Azure and so on.

LIKE queries: When we know at least something

LIKE and ILIKE are the classical means to do similarity search in PostgreSQL:

test=# SELECT * FROM t_database WHERE real_name ILIKE 'postgr%';
real_name
------------
 PostgreSQL
 Postgres
 PostGreSQL
 postgres
(4 rows)

test=# SELECT * FROM t_database WHERE real_name ILIKE 'postgr%';

real_name

------------

PostgreSQL

Postgres

PostGreSQL

postgres

(4 rows)

As you can see, we can find all incarnations of PostgreSQL in an easy way. However, LIKE and ILIKE have a major limitation: One has to know quite a lot about the spelling of things you are looking for. In my example we needed at string that contained “postgr” - but what if somebody had used “BostgreSQL”? We would not have found anything. In other words: LIKE and ILIKE are good but in many cases these two keywords are not sufficient to find what is really needed.

Advanced similarity search

Trigrams are an additional way, provided by PostgreSQL, to handle similarity search and are more suitable to handle typos. To use trigrams one has to install the extension:

test=# CREATE EXTENSION pg_trgm;
CREATE EXTENSION

1 2	test=# CREATE EXTENSION pg_trgm; CREATE EXTENSION

The trigram extension provides us with the distance operator. It helps us to find similar strings. Let us give it a try: Support we want to search for “db3” because we did not quite understand that the IBM product is actually called “DB2”. Here is what happens:

test=# SELECT real_name <-> 'db3', * 
       FROM   t_database 
       ORDER BY real_name <-> 'db3';
  ?column?  | real_name
------------+----------------------
  0.6666666 | DB2
 0.71428573 | DB/2
        0.8 | IBM DB2
  0.8181818 | DB/2 LUW
          1 | Oracle
          1 | MS SQL Server
          1 | PostgreSQL
          1 | Microsoft SQL Server
          1 | Postgres
          1 | PostGreSQL
          1 | postgres
(11 rows)

test=# SELECT real_name <-> 'db3', *

FROM t_database

ORDER BY real_name <-> 'db3';

?column? | real_name

------------+----------------------

0.6666666 | DB2

0.71428573 | DB/2

0.8 | IBM DB2

0.8181818 | DB/2 LUW

1 | Oracle

1 | MS SQL Server

1 | PostgreSQL

1 | Microsoft SQL Server

1 | Postgres

1 | PostGreSQL

1 | postgres

(11 rows)

We can sort by distance which ideally gives us the string closest to what we are searching for. As you can see “DB2” and “DB/2” come out on top which is exactly the way it should be. Of course pg_trgm is not the holy grail and should be used with caution. However, it is one way to tackle the problems caused by wrong data.

Using full text search

While pg_trgm is useful is you are looking for names, addresses and so on full text search has a slightly different purpose. The goal is to look for certain words in text. Here is an example:

test=# SELECT real_name, to_tsvector(real_name) 
       FROM   t_database 
       WHERE  to_tsvector(real_name) @@ to_tsquery('server & microsoft');
       real_name      |          to_tsvector
----------------------+----------------------------------
 Microsoft SQL Server | 'microsoft':1 'server':3 'sql':2
(1 row)

test=# SELECT real_name, to_tsvector(real_name)

FROM t_database

WHERE to_tsvector(real_name) @@ to_tsquery('server & microsoft');

real_name | to_tsvector

----------------------+----------------------------------

Microsoft SQL Server | 'microsoft':1 'server':3 'sql':2

(1 row)

We are looking for the words “server” and “microsoft”. Naturally one row is returned. What is noteworthy here is that the order of words can be swapped. We simply want to make sure that all words are in the string - for now order does not matter.

Full text search in PostgreSQL is quite powerful and it offers a variety of additional features which are way beyond the scope of this posting.

Using phrase search

However, what can we do if the order of words does matter? The answer is “phrase search”: PostgreSQL allows you to specify in which order words have to appear:

test=# SELECT real_name, to_tsvector(real_name) 
       FROM   t_database 
       WHERE  to_tsvector(real_name) @@ to_tsquery('microsoft <-> sql');
       real_name      | to_tsvector
----------------------+----------------------------------
 Microsoft SQL Server | 'microsoft':1 'server':3 'sql':2
(1 row)

test=# SELECT real_name, to_tsvector(real_name)

FROM t_database

WHERE to_tsvector(real_name) @@ to_tsquery('microsoft <-> sql');

real_name | to_tsvector

----------------------+----------------------------------

Microsoft SQL Server | 'microsoft':1 'server':3 'sql':2

(1 row)

In this case we want “microsoft” followed by “sql”. PostgreSQL will return the right row. However, what if we look for “microsoft” followed by “server”? In this case things will fail because there happens to be a word in between which is not allowed:

test=# SELECT real_name, to_tsvector(real_name) 
       FROM   t_database 
       WHERE  to_tsvector(real_name) @@ to_tsquery('microsoft <-> server');
 real_name | to_tsvector
-----------+-------------
(0 rows)

test=# SELECT real_name, to_tsvector(real_name)

FROM t_database

WHERE to_tsvector(real_name) @@ to_tsquery('microsoft <-> server');

real_name | to_tsvector

-----------+-------------

(0 rows)

Fortunately, PostgreSQL (https://www.postgresql.org/docs/current/functions-textsearch.html) has a feature indicating how far things are allowed to be apart. The following listing shows how this works:

test=# SELECT real_name, to_tsvector(real_name) 
       FROM   t_database 
       WHERE  to_tsvector(real_name) @@ to_tsquery('microsoft <2> server');
       real_name      |              to_tsvector
----------------------+----------------------------------
 Microsoft SQL Server | 'microsoft':1 'server':3 'sql':2
(1 row)

test=# SELECT real_name, to_tsvector(real_name)

FROM t_database

WHERE to_tsvector(real_name) @@ to_tsquery('microsoft <2> server');

real_name | to_tsvector

----------------------+----------------------------------

Microsoft SQL Server | 'microsoft':1 'server':3 'sql':2

(1 row)

Finally …

PostgreSQL has many options to handle fuzzy search. There are many more things out there (such as pgsimilarity) which can be used to make our lives easier. One of the things people are often not aware of is the fact that full text search and indexing can also have an impact on the optimal VACUUM policy. Check out my posting about the “GIN pending list” to learn more.

3 responses to “PostgreSQL vs PostgreSQL vs Postgres vs Oracle vs all the rest”

Zixma Liamg says:

September 2, 2020 at 12:01 pm

Thank you, I've share this url done.

Reply
Stirling Butcher says:

September 4, 2020 at 2:56 am

Poor title mislead me to coming here

Reply
- Sameer A says:
  
  September 4, 2020 at 5:35 am
  
  Agree. Nevertheless it was a good and simple article for those struggling to tackle fuzzy matches.
  
  Reply

PostgreSQL vs PostgreSQL vs Postgres vs Oracle vs all the rest

Simple case-insensitive search in PostgreSQL

LIKE queries: When we know at least something

Advanced similarity search

Using full text search

Using phrase search

Finally …

3 responses to “PostgreSQL vs PostgreSQL vs Postgres vs Oracle vs all the rest”

Leave a Reply Cancel reply

Hans-Jürgen Schönig

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts