Connecting through a time rift into the AI future: in PostgreSQL v27, LIKE can search for books with similar content (indexing LIKE won't become any easier)
© Laurenz Albe 2023

Unless you use the binary collation, creating a b-tree index to support a LIKE condition in PostgreSQL is not straightforward. This keeps surprising Oracle users, who claim that a regular b-tree index will of course always support LIKE. I decided to explore the differences between Oracle and PostgreSQL when it comes to indexing LIKE conditions. Is Oracle really smarter here?

Indexing LIKE conditions

It is clear that a pattern must not start with a wildcard if we want to support it with a b-tree index. A b-tree index is ordered, and a pattern like %smith could be anywhere in the ordered list. Consider searching a phone book for all names ending in “smith”: you’d have to search all entries to make sure you find all “Blacksmith”, “Hammersmith” and who knows what other “smiths” might be in there. The phone book is a good example for an index: the database could use the b-tree index, but it would have to scan the whole index, and that is not very efficient.

If a pattern does not start with a wildcard, the case should be different. After all, the phone book is efficient if you are searching for all names that start with “black”: you may have to scan several pages, but “Black”, “Blackthorn” and “Blacksmith” are right next to each other in an alphabetically sorted list. So a b-tree index should be helpful in this case.

So what is the difficulty with indexing LIKE conditions in PostgreSQL?

An example to show the difficulties with indexing LIKE in PostgreSQL

We create a table with three columns. The columns will contain the same strings, but I define them with different collations:

CREATE TABLE likeme (
   id       integer PRIMARY KEY,
   s_binary text COLLATE "C"           NOT NULL,
   s_libc   text COLLATE "cs_CZ.utf8"  NOT NULL,
   s_icu    text COLLATE "cs-CZ-x-icu" NOT NULL
);

The three collations I used are

  • The “binary” collation C, also known as POSIX. It compares strings byte by byte, so it is very fast and never changes.
  • The Czech collation as defined by the operating system’s C library.
  • The Czech collation as defined by the ICU library.

Remember that (with the exception of the C collation) PostgreSQL does not implement collations itself. Rather, it uses the collations from the system’s C library of the ICU library. See this article for problems that can arise from this design choice.

We inserts a few rows of data:

INSERT INTO likeme (id, s_binary, s_libc, s_icu) VALUES
   (1, 'abc', 'abc', 'abc'),
   (2, 'abč', 'abč', 'abč'),
   (3, 'abb', 'abb', 'abb'),
   (4, 'abd', 'abd', 'abd'),
   (5, 'abcm', 'abcm', 'abcm'),
   (6, 'abch', 'abch', 'abch'),
   (7, 'abh', 'abh', 'abh'),
   (8, 'abz', 'abz', 'abz'),
   (9, 'Nesmysl', 'Nesmysl', 'Nesmysl');

Then, we’ll fill up the table with enough unimportant data so that the optimizer will prefer an index scan if possible:

INSERT INTO likeme (id, s_binary, s_libc, s_icu)
SELECT i, 'z'||i, 'z'||i, 'z'||i
FROM generate_series(10, 10000) AS i;

Finally, let’s run VACUUM and ANALYZE to set hint bits and gather statistics:

VACUUM (ANALYZE) likeme;

The good and bad sides of the C (binary) collation

Indexing LIKE is trivial: just create a regular b-tree index:

CREATE INDEX binary_ind ON likeme (s_binary);

EXPLAIN (COSTS OFF)
SELECT s_binary
FROM likeme
WHERE s_binary LIKE 'abc%';

                               QUERY PLAN                               
════════════════════════════════════════════════════════════════════════
 Index Scan using binary_ind on likeme
   Index Cond: ((s_binary >= 'abc'::text) AND (s_binary < 'abd'::text))
   Filter: (s_binary ~~ 'abc%'::text)
(3 rows)

PostgreSQL scans the index for everything between 'abc' and 'abd' and eliminates false positive results with a filter. (There are no false positive results in our case, but there could be characters after the wildcard.)

While this works great and just as we want it to, the result of ORDER BY is rather poor:

SELECT s_binary
FROM likeme
ORDER BY s_binary
FETCH FIRST 9 ROWS ONLY;

 s_binary 
══════════
 Nesmysl
 abb
 abc
 abch
 abcm
 abd
 abh
 abz
 abč
(9 rows)

Blech! Not only does the upper-case “N” sort before the lower-case “a”, but “č” sorts after all ASCII characters. So if you need natural language sorting, the C collation is not acceptable. However, if you don’t need to sort strings, it is the best collation to use.

The good and bad sides of natural language collations

Let’s look at s_icu (s_libc will behave just the same). First, PostgreSQL now sorts the strings correctly:

SELECT s_icu
FROM likeme
ORDER BY s_icu
FETCH FIRST 9 ROWS ONLY;

  s_icu  
═════════
 abb
 abc
 abcm
 abč
 abd
 abh
 abch
 abz
 Nesmysl
(9 rows)

Please turn your attention to the position of 'abch': it appears after 'abh'! This is the correct way to sort the digraph “ch” in Czech (and that is why I chose Czech for my example). You see that natural language collations can be quite complicated. For details, read Peter Eisentraut’s excellent blog series how collation works, how collation of punctuation and whitespace works and overview of ICU collation settings.

This peculiarity has no influence on how LIKE works (which the SQL standard defines in ISO/IEC 9075-2, 8.5 (<like_predicate>):

SELECT s_icu FROM likeme WHERE s_icu LIKE 'abc%';

 s_icu 
═══════
 abc
 abch
 abcm
(3 rows)

Let’s create an index on the column and see if it speeds up the LIKE condition:

CREATE INDEX icu_ind ON likeme (s_icu);

EXPLAIN (COSTS OFF)
SELECT s_icu
FROM likeme
WHERE s_icu LIKE 'abc%';

            QUERY PLAN             
═══════════════════════════════════
 Seq Scan on likeme
   Filter: (s_icu ~~ 'abc%'::text)
(2 rows)

PostgreSQL doesn’t (and cannot) use the index. Can you guess the reason? If we scanned the index in the same manner as before, we would not find 'abch', because it is sorted elsewhere! With some natural language collations, you cannot find a string in the index if you only know a prefix. Not all collations are as complicated as that, but since PostgreSQL has no deeper insight into the collations (which are defined by external libraries), it has to stay on the safe side and avoid using a natural language index.

How to go about indexing LIKE in PostgreSQL

If you want to both have your cake and eat it, that is, if you want the benefits of a natural language collation and still like to index LIKE conditions, you have to use a PostgreSQL feature. In PostgreSQL you can specify an operator class in an index definition. That way, you can specify the set of comparison operators that the index supports. Now there is a special operator class text_pattern_ops, whose operators compare strings character for character. That is exactly what we need for indexing LIKE:

CREATE INDEX icu_like_idx ON likeme (s_icu text_pattern_ops);

EXPLAIN (COSTS OFF)
SELECT s_icu
FROM likeme
WHERE s_icu LIKE 'abc%';

                              QUERY PLAN                              
══════════════════════════════════════════════════════════════════════
 Index Only Scan using icu_like_idx on likeme
   Index Cond: ((s_icu ~>=~ 'abc'::text) AND (s_icu ~<~ 'abd'::text))
   Filter: (s_icu ~~ 'abc%'::text)
(3 rows)

You can see the “character-wise comparison operators” in the execution plan. But this index can do more than speed up LIKE conditions: you can also use it to speed up equality comparisons. In fact, such an index is typically the only index you need on a string column, unless you have to index ORDER BY, which clearly needs the plain natural language index.

We have seen how to index LIKE in PostgreSQL, now what about Oracle?

An example for indexing LIKE in Oracle

We create a table similar to the test table in PostgreSQL. We have to define the string columns as NOT NULL, otherwise Oracle won’t be able to speed up ORDER BY with an index (Oracle does not store index entries where all columns are NULL values):

CREATE TABLE likeme (
   id NUMBER(5) CONSTRAINT likeme_pkey PRIMARY KEY,
   s_binary VARCHAR2(100 CHAR)                NOT NULL,
   s_czech  VARCHAR2(100 CHAR) COLLATE CZECH  NOT NULL,
   s_xczech VARCHAR2(100 CHAR) COLLATE XCZECH NOT NULL
);

Oracle uses the binary collation by default. This already goes a long way to explain why people claim that a “simple index” is sufficient for indexing LIKE! But we want to dig deeper, that’s why we define two additional columns with the two natural language collations Oracle supports:

  • CZECH is a “poor man’s natural language collation” that only supports part of the functionality
  • XCZECH is the collation that behavs as a natural language collation should

In the words of the Oracle documentation:

Some monolingual collations have an extended version that handles special linguistic cases. The name of the extended version is prefixed with the letter X. These special cases typically mean that one character is sorted like a sequence of two characters or a sequence of two characters is sorted as one character. For example, ch and ll are treated as a single character in XSPANISH. Extended monolingual collations may also define special language-specific uppercase and lowercase rules that override standard rules of a character set.

To insert data into Oracle, we can simply use pg_dump to export the PostgreSQL data and use that as an SQL script for Oracle. That guarantees that we have the same data in both systems:

pg_dump --inserts --data-only --table=likeme --file=import.sql dbname

After importing the data into Oracle, we calculate optimizer statistics:

ANALYZE TABLE likeme COMPUTE STATISTICS;

Natural language collations in Oracle and ORDER BY

With the CZECH collation, the result is not satisfactory:

SELECT s_czech
FROM likeme
ORDER BY s_czech
FETCH FIRST 9 ROWS ONLY;

S_CZECH
-------
abb
abc
abch
abcm
abč
abd
abh
abz
Nesmysl

9 rows selected.

Like the Oracle documentation stated, Oracle does not sort 'abch' correctly. We need the XCZECH collation for that:

SELECT s_xczech
FROM likeme
ORDER BY s_xczech
FETCH FIRST 9 ROWS ONLY;

S_XCZECH
--------
abb
abc
abcm
abč
abd
abh
abch
abz
Nesmysl

9 rows selected.

If we create an index on s_xczech, that index will speed up the query:

CREATE INDEX xczech_ind ON likeme (s_xczech);

Index created.

SET AUTOTRACE TRACEONLY

SELECT s_xczech
FROM likeme
ORDER BY s_xczech
FETCH FIRST 9 ROWS ONLY;

Execution Plan
--------------

---------------------------------------------------------------------
| Id  | Operation                      | Name       | Rows  | Bytes |
---------------------------------------------------------------------
|   0 | SELECT STATEMENT               |            |     9 |  3753 |
|   1 |  SORT ORDER BY                 |            |     9 |  3753 |
|*  2 |   VIEW                         |            |     9 |  3753 |
|*  3 |    WINDOW NOSORT STOPKEY       |            | 10000 |    17M|
|   4 |     TABLE ACCESS BY INDEX ROWID| LIKEME     | 10000 |    17M|
|   5 |      INDEX FULL SCAN           | XCZECH_IND | 10000 |       |
---------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - filter("from$_subquery$_002"."rowlimit_$$_rownumber"<=9)
   3 - filter(ROW_NUMBER() OVER ( ORDER BY "S_XCZECH")<=9)

I have omitted some parts for brevity. Oracle seems to implement FETCH FIRST 9 ROWS ONLY using a window function, but I guess it is smart enough not to read the whole index. Oracle calls an index that uses a collation different from the binary one a linguistic index.

Indexing LIKE in Oracle

The suspense is mounting: this is what I originally wanted to investigate. Let’s see if Oracle uses the linguistic index:

SELECT s_xczech
FROM likeme
WHERE s_xczech LIKE 'abc%';

Execution Plan
--------------

--------------------------------------------------------------------------
| Id  | Operation                           | Name       | Rows  | Bytes |
--------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |            |     1 |  1809 |
|*  1 |  TABLE ACCESS BY INDEX ROWID BATCHED| LIKEME     |     1 |  1809 |
|*  2 |   INDEX RANGE SCAN                  | XCZECH_IND |    45 |       |
--------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(INSTR("S_XCZECH",'abc',1,1)=1)
   2 - access(NLSSORT("S_XCZECH",'nls_sort=''XCZECH''')>=HEXTORAW('14191E0001010100') AND
              NLSSORT("S_XCZECH",'nls_sort=''XCZECH''')<HEXTORAW('14191F0001010100'))

Oracle actually uses the index! But how is that possible? How can it find 'abch' that way? The mystery is solved as soon as we look at the actual query result:

SET AUTOTRACE OFF

SELECT s_xczech
FROM likeme
WHERE s_xczech LIKE 'abc%';

S_XCZECH
--------
abc
abcm

Ouch! Oracle gave us the wrong result! 'abch' is missing. With the CZECH collation we get the correct result:

SELECT s_czech
FROM likeme
WHERE s_czech LIKE 'abc%';

S_CZECH
-------
abc
abch
abcm

So Oracle gives us a choice: either get a correct result for ORDER BY and a wrong result for LIKE using the XCZECH collation, or get a wrong result for ORDER BY and a correct result for LIKE using the CZECH collation. To get the correct result, you’d have to add an explicit COLLATE clause:

SELECT s_xczech
FROM likeme
WHERE (s_xczech COLLATE CZECH) LIKE 'abc%';

But I found no way to create an index in Oracle that can support that query.

Conclusion

We have seen how to index LIKE in PostgreSQL. We have also verified that LIKE really never needs a special index in Oracle. However, that is not because Oracle is smarter than PostgreSQL, but because it does not care all that much about correct results.

To see more posts about the topic of collations, check out the following:

 


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.