PostgreSQL: Speeding up recursive queries and hierarchical data

06.2020 | Category: Performance | Tags: development, sql help

06.2020

A hierarchical query is an SQL query that handles hierarchical model data such as the structure of organizations, living species, and a lot more. All important database engines including PostgreSQL, Oracle, DB2 and MS SQL offer support for this type of query.

Table of Contents

However, in some cases hierarchical queries can come with a price tag. This is especially true if trees are deep, complex and therefore require some effort by the database engine to handle all this data. Simple queries are usually not a problem but if trees grow out of proportion “WITH RECURSIVE” (ANSI SQL) and “CONNECT BY” (the old Oracle implementation) might start to be an issue.

ltree: Speeding up tree structures

The ltree module is part of the standard PostgreSQL contrib package which can be found on most servers and can therefore be loaded as a normal extension. It should also be provided by most cloud services including Amazon RDS, AWS and Microsoft Azure.

Ltree implements a data type ltree for representing labels of data stored in a hierarchical tree-like structure. This blog will show how this module can be used to speed up some quite common use cases.

Preparing sample hierarchical data

The first thing you have to do is to make sure that the ltree extension is loaded into your database server. CREATE EXTENSION will do the job. Then I have created a simple hierarchical structure that has a parent and a child value. This is the standard approach to store this kind of data:

CREATE EXTENSION ltree;

CREATE TABLE t_orga
(
parent_code text,
child_code text,
UNIQUE (parent_code, child_code)
);

CREATE EXTENSION ltree;

CREATE TABLE t_orga

(

parent_code text,

child_code text,

UNIQUE (parent_code, child_code)

);

In the next step I have created some test data.

Note that ltree can handle labels - please don't use strings full of special characters here

(more on that a little later):

INSERT INTO t_orga
VALUES (NULL, 'A'),
('A', 'B'),
('B', 'C'),
('B', 'D'),
('D', 'E'),
('C', 'F'),
('C', 'G'),
('E', 'H'),
('F', 'I'),
('I', 'J'),
('G', 'K'),
('K', 'L'),
('L', 'M')
;

INSERT INTO t_orga

VALUES (NULL, 'A'),

('A', 'B'),

('B', 'C'),

('B', 'D'),

('D', 'E'),

('C', 'F'),

('C', 'G'),

('E', 'H'),

('F', 'I'),

('I', 'J'),

('G', 'K'),

('K', 'L'),

('L', 'M')

;

test=# SELECT * FROM t_orga ;
 parent_code | child_code
-------------+------------
             | A
 A           | B
 B           | C
 B           | D
 D           | E
 C           | F
 C           | G
 E           | H
 F           | I
 I           | J
 G           | K
 K           | L
 L           | M
(13 rows)

test=# SELECT * FROM t_orga ;

parent_code | child_code

-------------+------------

| A

A | B

B | C

B | D

D | E

C | F

C | G

E | H

F | I

I | J

G | K

K | L

L | M

(13 rows)

In my data “A” is on top of the food chain (this is why the parent label is NULL in this case). Basically this data structure is already enough to figure out if you want to know all employees working in “A → B → C”. The question is: If your organization or your data in general is pretty constant - can we do better? Plants, animals and so on are not shifted around between classes on a regular basis. Elephants won't be plants tomorrow and a rose will not turn into an animal either. These data sets are quite stable and we can therefore apply to trick to query them faster.

PostgreSQL: Simple ltree examples

Before taking a look at the trickery I have promised I want to focus your attention on some simple ltree examples so that you can understand how ltree works in general:

test=# SELECT 'A.B.C'::ltree;
ltree
-------
A.B.C
(1 row)

test=# SELECT 'A.B.C'::ltree;

ltree

-------

A.B.C

(1 row)

test=# SELECT ltree_addtext('A.B.C'::ltree, 'D');
ltree_addtext
---------------
A.B.C.D
(1 row)

test=# SELECT ltree_addtext('A.B.C'::ltree, 'D');

ltree_addtext

---------------

A.B.C.D

(1 row)

test=# SELECT ltree_addtext(NULL::ltree, 'D');
ltree_addtext
---------------

(1 row)

test=# SELECT ltree_addtext(NULL::ltree, 'D');

ltree_addtext

---------------

(1 row)

test=# SELECT ltree_addtext(''::ltree, 'D');
ltree_addtext
---------------
D
(1 row)

test=# SELECT ltree_addtext(''::ltree, 'D');

ltree_addtext

---------------

(1 row)

Easily append values, but be careful about it!

As you can see, ltree is a “.”-separated kind of list. You can append values easily. But you have to be careful: You can append to the list if it contains an empty string. If the value is NULL you cannot append. The same is true for other stuff in PostgreSQL as well, and you should be aware of this issue. What is also important is that ltree can hold labels - not arbitrary strings. Here is an example:

test=# SELECT ltree_addtext('A'::ltree, '.');
ERROR: syntax error at position 0

1 2	test=# SELECT ltree_addtext('A'::ltree, '.'); ERROR: syntax error at position 0

Armed with this information we can create a materialized view which pre-aggregates some data.

Pre-calculation hierarchies in SQL

In PostgreSQL you can pre-aggregate data using a materialized view. In this case we want to pre-calculate the hierarchy of data and store it as a separate column:

CREATE MATERIALIZED VIEW t_orga_mat AS
WITH RECURSIVE x AS (
SELECT *, child_code::ltree AS mypath
FROM t_orga
WHERE parent_code IS NULL
UNION ALL
SELECT y.parent_code, y.child_code, ltree_addtext(x.mypath, y.child_code) AS mypath
FROM x, t_orga AS y
WHERE x.child_code = y.parent_code
)
SELECT * FROM x;

CREATE MATERIALIZED VIEW t_orga_mat AS

WITH RECURSIVE x AS (

SELECT *, child_code::ltree AS mypath

FROM t_orga

WHERE parent_code IS NULL

UNION ALL

SELECT y.parent_code, y.child_code, ltree_addtext(x.mypath, y.child_code) AS mypath

FROM x, t_orga AS y

WHERE x.child_code = y.parent_code

)

SELECT * FROM x;

To do that I have used WITH RECURSIVE which is the standard way of doing this kind of operation. How does it work? The SELECT inside the recursion has two components. Something before the UNION ALL and something after it. The first part selects the starting point of the recursion. We basically want to start at the beginning. In our case parent_code has to be NULL - this tells us that we are on top of the food chain. All the data is selected, and a column is added at the end. The first line will only contain the child_code as it is the first entry on top of the tree.
The second part will take the result created by the first part and join it. Then the last column is extended by the current value. We do that until the join does not yield anything anymore.

Wait, wouldn't that lead to an infinite loop?

Yes, it can. Basically one has to use UNION instead of UNION ALL to prevent a nested loop from happening. However, in case of ltree this does not work because to do that, the data type must support hashing which ltree does not. In other words: You must be 100% sure that the dataset does NOT contain loops.

The materialized view will create this kind of result:

test=# SELECT * FROM t_orga_mat;
parent_code | child_code | mypath
-------------+------------+---------------
| A | A
A | B | A.B
B | C | A.B.C
B | D | A.B.D
D | E | A.B.D.E
C | F | A.B.C.F
C | G | A.B.C.G
E | H | A.B.D.E.H
F | I | A.B.C.F.I
G | K | A.B.C.G.K
I | J | A.B.C.F.I.J
K | L | A.B.C.G.K.L
L | M | A.B.C.G.K.L.M
(13 rows)

test=# SELECT * FROM t_orga_mat;

parent_code | child_code | mypath

-------------+------------+---------------

| A | A

A | B | A.B

B | C | A.B.C

B | D | A.B.D

D | E | A.B.D.E

C | F | A.B.C.F

C | G | A.B.C.G

E | H | A.B.D.E.H

F | I | A.B.C.F.I

G | K | A.B.C.G.K

I | J | A.B.C.F.I.J

K | L | A.B.C.G.K.L

L | M | A.B.C.G.K.L.M

(13 rows)

As you can see the last column has all the higher levels. The main question now is: How can we query this data? Here is an example:

test=# SELECT * FROM t_orga_mat WHERE 'A.B.C.G' @> mypath;
parent_code | child_code | mypath
-------------+------------+---------------
C | G | A.B.C.G
G | K | A.B.C.G.K
K | L | A.B.C.G.K.L
L | M | A.B.C.G.K.L.M
(4 rows)

test=# SELECT * FROM t_orga_mat WHERE 'A.B.C.G' @> mypath;

parent_code | child_code | mypath

-------------+------------+---------------

C | G | A.B.C.G

G | K | A.B.C.G.K

K | L | A.B.C.G.K.L

L | M | A.B.C.G.K.L.M

(4 rows)

We are looking for everybody in “A → B → C → G”. Of course ltree provides many more operators which can be found in the official documentation.

Indexing ltree in PostgreSQL

Keep in mind that the table is small so PostgreSQL would naturally use a sequential scan and not consider the index for efficiency reasons. It only does so when the table is larger. To encourage the optimizer to choose an index, we can make sequential scans super expensive:

test=# SET enable_seqscan TO off;
SET
test=# explain SELECT * FROM t_orga_mat WHERE 'A.B.C.G' @> mypath;
QUERY PLAN
---------------------------------------------------------------------------
Index Scan using idx_mat on t_orga_mat (cost=0.13..8.15 rows=1 width=96)
Index Cond: (mypath <@ 'A.B.C.G'::ltree)
(2 rows)

test=# SET enable_seqscan TO off;

SET

test=# explain SELECT * FROM t_orga_mat WHERE 'A.B.C.G' @> mypath;

QUERY PLAN

---------------------------------------------------------------------------

Index Scan using idx_mat on t_orga_mat (cost=0.13..8.15 rows=1 width=96)

Index Cond: (mypath <@ 'A.B.C.G'::ltree)

(2 rows)

You can see that PostgreSQL is using the desired index to speed up this kind of data.

Finally …

Hierarchical data is important. We have covered this topic in the past already. If you want to find out more, read our post about WITH RECURSIVE and company organization.

One response to “PostgreSQL: Speeding up recursive queries and hierarchical data”

Adam Brusselback says:

June 2, 2020 at 3:36 pm

I would love to see some more analysis on the actual performance difference here. I have a good feeling ltree would be faster, but it would be nice to see some "real world" numbers on how much faster that actually is.

Reply

PostgreSQL: Speeding up recursive queries and hierarchical data

ltree: Speeding up tree structures

Preparing sample hierarchical data

Note that ltree can handle labels - please don't use strings full of special characters here

PostgreSQL: Simple ltree examples

Easily append values, but be careful about it!

Pre-calculation hierarchies in SQL

Wait, wouldn't that lead to an infinite loop?

Indexing ltree in PostgreSQL

Finally …

One response to “PostgreSQL: Speeding up recursive queries and hierarchical data”

Leave a Reply Cancel reply

Hans-Jürgen Schönig

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts