While sharding is often advertised as “THE solution to PostgreSQL scalability”, it is necessary to keep some technical aspects in consideration in terms of performance. The rule is: Sharding should not be used without a deeper awareness of what it is you are actually doing to the data. It's important to keep in mind that sharding has to be applied in a clever and thoughtful manner. One of the most common mistakes is to ignore the concept of “data locality”. It's important for many IT problems, but crucial in the context of database sharding. Citus is one of the most sophisticated sharding solutions in the PostgreSQL world. It can help you to achieve maximum scalability and allows for efficient analytics as well as OLTP. Citus is available on-premise or as part of the Microsoft Azure cloud.
Table of Contents
To demonstrate the concept, we first have to create two tables. For the sake of simplicity, we'll use customers and sales:
1 2 3 4 5 6 7 8 9 10 11 |
postgres=# CREATE TABLE t_customer ( id int, name text ); CREATE TABLE postgres=# CREATE TABLE t_sales ( id int, customer_id int, whatever text ); CREATE TABLE |
The data model is really straightforward. In this scenario, the typical way to analyse the data is to join the customer with the sales table. Why is this relevant?
1 2 3 4 5 6 7 8 9 10 11 |
postgres=# SELECT create_distributed_table('t_customer', 'id'); create_distributed_table -------------------------- (1 row) postgres=# SELECT create_distributed_table('t_sales', 'id'); create_distributed_table -------------------------- (1 row) |
Note that the data is sharded using the “id
” which is not the join criteria.
1 2 3 4 5 6 7 8 9 |
postgres=# INSERT INTO t_customer SELECT *, 'dummy' FROM generate_series(1, 1000000); INSERT 0 1000000 postgres=# INSERT INTO t_sales SELECT id, random()*100000, 'dummy' FROM generate_series(1, 1000000) AS id; INSERT 0 1000000 |
Once the data has been loaded, it's time to execute a query. All we want to do is to join and count some data. Here's what happens:
1 2 3 4 5 |
postgres=# SELECT count(*) FROM t_customer AS c, t_sales AS s WHERE c.id = s.customer_id; ERROR: the query contains a join that requires repartitioning HINT: Set citus.enable_repartition_joins to on to enable repartitioning |
Citus will send an error message by default and refuse to run the query. Why is that the case? Imagine we want to join two rows on the same machine: PostgreSQL will do some magic in RAM locally and voila, two rows has been join (it is not as simple as you can see in our blog post about PostgreSQL join strategies but one can understand the logic). Local joins are efficient but if all the data has to go through an expensive network stack? Things will be orders of magnitude more expensive and in case of large data sets this strategy becomes totally impractical and doomed to fail. Therefore it is better to error out than to lead the user directly into a serious performance problem.
In some cases, it might still be necessary to force a join. However, this requires a configuration variable which can be set locally. Note that this is not something you should do in real life on large data sets:
1 2 3 4 5 6 7 8 9 |
postgres=# SET citus.enable_repartition_joins = on; SET postgres=# SELECT count(*) FROM t_customer AS c, t_sales AS s WHERE c.id = s.customer_id; count -------- 999993 (1 row) |
Due to the small data set we are using here (for the sake of simplicity), the query does indeed return in reasonable time. But as stated before, if you do this on large data sets which are not on the same machine as my little test, this query might not end at all.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
postgres=# explain analyze SELECT count(*) FROM t_customer AS c, t_sales AS s WHERE c.id = s.customer_id; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=250.00..250.02 rows=1 width=8) (actual time=1283.486..1283.486 rows=1 loops=1) -> Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=100000 width=8) (actual time=1283.478..1283.480 rows=20 loops=1) Task Count: 20 Tuple data received from nodes: 160 bytes Tasks Shown: None, not supported for re-partition queries -> MapMergeJob Map Task Count: 32 Merge Task Count: 20 -> MapMergeJob Map Task Count: 32 Merge Task Count: 20 Planning Time: 10.282 ms Execution Time: 1283.670 ms (13 rows) |
We need around 1.3 seconds to run the query. If the shards reside on different machines, and not just on different ports, this execution time can easily increase by many times.
To fix the problem, we have to introduce the concept of “data locality”. The importance of this idea cannot be underestimated. The core idea is that data which has to be joined should stay in the same shard. Do not move data between shards and avoid large operations across shards as much as you can.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
postgres=# CREATE TABLE t_sales_proper AS SELECT * FROM t_sales; SELECT 1000000 postgres=# SELECT create_distributed_table( 't_sales_proper', 'customer_id', colocate_with => 't_customer' ); NOTICE: Copying data from local table... NOTICE: copying the data has completed DETAIL: The local data in the table is no longer visible, but is still on disk. HINT: To remove the local data, run: SELECT truncate_local_data_after_distributing_table($$public.t_sales_proper$$) create_distributed_table -------------------------- (1 row) |
Fortunately Citus is nice enough to warn us that we have just created a local table with all the data. The create_distributed_table function DOES NOT delete local data for safety reasons so we have to truncate it locally to avoid storing it twice:
1 2 3 4 5 |
postgres=# SELECT truncate_local_data_after_distributing_table($$public.t_sales_proper$$); truncate_local_data_after_distributing_table ---------------------------------------------- (1 row) |
Storing data locally is usually a bad idea in any case, since you can easily run out of space. Sharding and Citus is all about scalability, so moving data to a central place might not be the best of ideas, due to space constraints and other limitations which are diametrically opposed to the core ideas of sharding and scalability.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
postgres=# explain analyze SELECT count(*) FROM t_customer AS c, t_sales_proper AS s WHERE c.id = s.customer_id; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=250.00..250.02 rows=1 width=8) (actual time=98.915..98.915 rows=1 loops=1) -> Custom Scan (Citus Adaptive) (cost=0.00..0.00 rows=100000 width=8) (actual time=98.907..98.909 rows=32 loops=1) Task Count: 32 Tuple data received from nodes: 256 bytes Tasks Shown: One of 32 -> Task Tuple data received from node: 8 bytes Node: host=localhost port=6003 dbname=postgres -> Aggregate (cost=1851.57..1851.58 rows=1 width=8) (actual time=25.905..25.907 rows=1 loops=1) -> Hash Join (cost=858.76..1773.21 rows=31345 width=0) (actual time=8.929..23.526 rows=31345 loops=1) Hash Cond: (s.customer_id = c.id) -> Seq Scan on t_sales_proper_102142 s (cost=0.00..483.45 rows=31345 width=4) (actual time=0.007..2.387 rows=31345 loops=1) -> Hash (cost=474.45..474.45 rows=30745 width=4) (actual time=8.863..8.863 rows=30745 loops=1) Buckets: 32768 Batches: 1 Memory Usage: 1337kB -> Seq Scan on t_customer_102078 c (cost=0.00..474.45 rows=30745 width=4) (actual time=0.005..3.701 rows=30745 loops=1) Planning Time: 0.138 ms Execution Time: 26.038 ms Planning Time: 2.678 ms Execution Time: 98.964 ms |
In reality, the performance difference is often even larger, but still: This is important and should be taken seriously.
If you're new to sharding and Citus for PostgreSQL, you might want to check out some of our other blog posts related to scalability.
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.
+43 (0) 2622 93022-0
office@cybertec.at
You are currently viewing a placeholder content from Facebook. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More InformationYou are currently viewing a placeholder content from X. To access the actual content, click the button below. Please note that doing so will share data with third-party providers.
More Information
Leave a Reply