A hierarchical query is an SQL query that handles hierarchical model data such as the structure of organizations, living species, and a lot more. All important database engines including PostgreSQL, Oracle, DB2 and MS SQL offer support for this type of query.
However, in some cases hierarchical queries can come with a price tag. This is especially true if trees are deep, complex and therefore require some effort by the database engine to handle all this data. Simple queries are usually not a problem but if trees grow out of proportion “WITH RECURSIVE” (ANSI SQL) and “CONNECT BY” (the old Oracle implementation) might start to be an issue.
ltree: Speeding up tree structures
The ltree module is part of the standard PostgreSQL contrib package which can be found on most servers and can therefore be loaded as a normal extension. It should also be provided by most cloud services including Amazon RDS, AWS and Microsoft Azure.
Ltree implements a data type ltree for representing labels of data stored in a hierarchical tree-like structure. This blog will show how this module can be used to speed up some quite common use cases.
Preparing sample data
The first thing you have to do is to make sure that the ltree extension is loaded into your database server. CREATE EXTENSION will do the job. Then I have created a simple hierarchical structure that has a parent and a child value. This is the standard approach to store this kind of data:
CREATE EXTENSION ltree; CREATE TABLE t_orga ( parent_code text, child_code text, UNIQUE (parent_code, child_code) );
In the next step I have created some test data. Note that ltree can handle labels – please don’t use strings full of special characters here (more on that a little later):
INSERT INTO t_orga VALUES (NULL, 'A'), ('A', 'B'), ('B', 'C'), ('B', 'D'), ('D', 'E'), ('C', 'F'), ('C', 'G'), ('E', 'H'), ('F', 'I'), ('I', 'J'), ('G', 'K'), ('K', 'L'), ('L', 'M') ;
test=# SELECT * FROM t_orga ; parent_code | child_code -------------+------------ | A A | B B | C B | D D | E C | F C | G E | H F | I I | J G | K K | L L | M (13 rows)
In my data “A” is on top of the food chain (this is why the parent label is NULL in this case). Basically this data structure is already enough to figure out if you want to know all employees working in “A → B → C”. The question is: If your organization or your data in general is pretty constant – can we do better? Plants, animals and so on are not shifted around between classes on a regular basis. Elephants won’t be plants tomorrow and a rose will not turn into an animal either. These data sets are quite stable and we can therefore apply to trick to query them faster.
PostgreSQL: Simple ltree examples
Before taking a look at the trickery I have promised I want to focus your attention on some simple ltree examples so that you can understand how ltree works in general:
test=# SELECT 'A.B.C'::ltree; ltree ------- A.B.C (1 row)
test=# SELECT ltree_addtext('A.B.C'::ltree, 'D'); ltree_addtext --------------- A.B.C.D (1 row)
test=# SELECT ltree_addtext(NULL::ltree, 'D'); ltree_addtext --------------- (1 row)
test=# SELECT ltree_addtext(''::ltree, 'D'); ltree_addtext --------------- D (1 row)
As you can see ltree is a “.” separated kind of list. You can append values easily. But you have to be careful: You can append to the list if it contains an empty string. If the value is NULL you cannot append. The same is true for other stuff in PostgreSQL as well and you should be aware of this issue. What is also important is that ltree can hold labels – not arbitrary strings. Here is an example:
test=# SELECT ltree_addtext('A'::ltree, '.'); ERROR: syntax error at position 0
Armed with this information we can create a materialized view which pre-aggregates some data.
Pre-calculation hierarchies in SQL
In PostgreSQL you can pre-aggregate data using a materialized view. In this case we want to pre-calculate the hierarchy of data and store it as a separate column:
CREATE MATERIALIZED VIEW t_orga_mat AS WITH RECURSIVE x AS ( SELECT *, child_code::ltree AS mypath FROM t_orga WHERE parent_code IS NULL UNION ALL SELECT y.parent_code, y.child_code, ltree_addtext(x.mypath, y.child_code) AS mypath FROM x, t_orga AS y WHERE x.child_code = y.parent_code ) SELECT * FROM x;
To do that I have used WITH RECURSIVE which is the standard way of doing this kind of operation. How does it work? The SELECT inside the recursion has two components. Something before the UNION ALL and something after it. The first part selects the starting point of the recursion. We basically want to start at the beginning. In our case parent_code has to be NULL – this tells us that we are on top of the food chain. All the data is selected, and a column is added at the end. The first line will only contain the child_code as it is the first entry on top of the tree.
The second part will take the result created by the first part and join it. Then the last column is extended by the current value. We do that until the join does not yield anything anymore.
Wait, wouldn’t that lead to an infinite loop? Yes, it can. Basically one has to use UNION instead of UNION ALL to prevent a nested loop from happening. However, in case of ltree this does not work because to do that, the data type must support hashing which ltree does not. In other words: You must be 100% sure that the dataset does NOT contain loops.
The materialized view will create this kind of result:
test=# SELECT * FROM t_orga_mat; parent_code | child_code | mypath -------------+------------+--------------- | A | A A | B | A.B B | C | A.B.C B | D | A.B.D D | E | A.B.D.E C | F | A.B.C.F C | G | A.B.C.G E | H | A.B.D.E.H F | I | A.B.C.F.I G | K | A.B.C.G.K I | J | A.B.C.F.I.J K | L | A.B.C.G.K.L L | M | A.B.C.G.K.L.M (13 rows)
As you can see the last column has all the higher levels. The main question now is: How can we query this data? Here is an example:
test=# SELECT * FROM t_orga_mat WHERE 'A.B.C.G' @> mypath; parent_code | child_code | mypath -------------+------------+--------------- C | G | A.B.C.G G | K | A.B.C.G.K K | L | A.B.C.G.K.L L | M | A.B.C.G.K.L.M (4 rows)
We are looking for everybody in “A → B → C → G”. Of course ltree provides many more operators which can be found in the official documentation.
Indexing ltree in PostgreSQL
Keep in mind that the table is small so PostgreSQL would naturally use a sequential scan and not consider the index for efficiency reasons. It only does so when the table is larger. To encourage the optimizer to an index we can make sequential scans super expensive:
test=# SET enable_seqscan TO off; SET test=# explain SELECT * FROM t_orga_mat WHERE 'A.B.C.G' @> mypath; QUERY PLAN --------------------------------------------------------------------------- Index Scan using idx_mat on t_orga_mat (cost=0.13..8.15 rows=1 width=96) Index Cond: (mypath <@ 'A.B.C.G'::ltree) (2 rows)
You can see that PostgreSQL is using the desired index to speed up this kind of data.
Hierarchical data is important. We have covered this topic in the past already. If you want to find out more consider reading our post about WITH RECURSIVE and company organization.