Scaling PostgreSQL: Modern servers provide people with more and more CPU cores. 16 cores or more are not uncommon these days. Unfortunately, PostgreSQL cannot scale a single query to more than one core. A lot of effort is going into this issue already.
To address the issue, we've done some experiments to see what can be done to scale things up. At this point, we've written a small module which takes an existing execution plan and does some post-processing of the plan using “custom plan nodes”.
The idea is experimental: what happens if the approach described in this post is actually used?
1 2 3 |
SELECT … a LOT of processing going on here … FROM large_table GROUP BY some_groups |
So the goal in our case is really to scale out the processing done in the SELECT clause.
For a start, 100 million rows have been created to do a little testing. 20 million rows are in each partition:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
parallel=# SELECT count(*) FROM t_test; count ----------- 100000000 (1 row) Time: 8715.324 ms parallel=# d+ t_test Table 'public.t_test' Column | Type | Modifiers | Storage | Stats target | Description --------+------------------+-----------+---------+--------------+------------- id | integer | not null | plain | | r | double precision | | plain | | Child tables: t_test_1, t_test_2, t_test_3, t_test_4, t_test_5 |
Our prototype module has a simple runtime variable, which allows us to define the number of processes we want to see. Setting parallel.agg_workers to 1 basically means turning the plan-processing code off.
The data in the table is scanned normally. We need around 29 seconds to process the query:
1 2 3 4 5 6 7 8 9 10 11 12 |
parallel=# SET parallel.agg_workers = 1; SET Time: 1.572 ms parallel=# SELECT sum(r) FROM t_test GROUP BY id < 1000; sum ------------------ 49998665.9549409 2538.35037390701 (2 rows) Time: 29477.039 ms |
Let us try the same with more cores: 7 cores will be used for worker processes.
1 2 3 4 5 6 7 8 9 10 11 12 |
parallel=# SET parallel.agg_workers = 7; SET Time: 0.324 ms parallel=# SELECT sum(r) FROM t_test GROUP BY id < 1000; sum ------------------ 49998665.9549346 2538.35037390701 (2 rows) Time: 31279.595 ms |
The important part here is that more processes are actually a net loss. The reason is simple: The table is scanned by a single process at the moment so there is a serious bottleneck on the scan side. The aggregation part is simply not complex enough to provide us with a speedup. Remember: All we do is adding up some values (which is ways too simple). This is by far not enough to justify the overhead of moving data around via shared memory queues.
SELECT
clauses when scalingHowever, the situation starts to change significantly if we try a query involving a lot of processing in the SELECT
-clause. The more stuff we do in the SELECT
-clause, the more benefit we can observe.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
parallel=# SELECT sum(CASE WHEN r < 0 THEN r ELSE 0 END), sum(CASE WHEN r > 0 THEN r ELSE 0 END), sum(CASE WHEN r > 0.1 THEN r ELSE 0 END), sum(CASE WHEN r > 0.2 THEN r ELSE 0 END), sum(CASE WHEN r > 0.3 THEN r ELSE 0 END), sum(CASE WHEN r > 0.4 THEN r ELSE 0 END), sum(CASE WHEN r > 0.5 THEN r ELSE 0 END), sum(CASE WHEN r > 0.6 THEN r ELSE 0 END), sum(CASE WHEN r > 0.7 THEN r ELSE 0 END), sum(CASE WHEN r > 0.8 THEN r ELSE 0 END), sum(CASE WHEN r > 0.9 THEN r ELSE 0 END), sum(r) FILTER (WHERE r < 0) FROM t_test GROUP BY id < 1000, id % 5, id % 10 ; |
In this case, we do a lot of aggregation and therefore the overhead of moving tuples around is not as big compared to the rest of the query.
1 2 |
8 CPU cores: 206667.727 ms 1 CPU core: 59226.385 ms |
What we see is a nice 3.5 times speedup. Of course, this is still far from linear. Linear scalability can hardly be achieved with this approach. However, if there are 50-100 aggregation functions in the SELECT
-clause (which is far from uncommon), there is clearly a major performance improvement. That tends to get pretty close to “linear” in many cases.
1 2 3 4 5 6 7 8 9 |
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 8411 hs 20 0 507012 183284 178560 R 100.0 1.1 23:19.98 postgres 13014 hs 20 0 503632 25400 24212 R 70.1 0.2 0:08.69 postgres 13013 hs 20 0 503632 25440 24244 R 69.8 0.2 0:08.83 postgres 13015 hs 20 0 503632 25476 24276 R 69.4 0.2 0:08.74 postgres 13017 hs 20 0 503632 25320 24132 R 69.4 0.2 0:08.73 postgres 13012 hs 20 0 503632 25460 24272 R 68.1 0.2 0:08.69 postgres 13018 hs 20 0 503632 24572 23384 R 68.1 0.1 0:08.65 postgres 13016 hs 20 0 503632 25468 24276 S 66.4 0.2 0:08.60 postgres |
The process dispatching the data is running at 100%, while the worker processes are not able to make full use of the CPU. It seems that some overhead can still be shaved off to provide users with better overall performance. A prototype might be possible fairly soon.
In case you need any assistance, please feel free to contact us.
In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.
In my previous post I have shared some GPU performance data, which were mostly related to aggregation. Given the information I have found on the pgstrom wiki page, I decided to give joins a try to see how much speed we can gain by offloading some of the work PostgreSQL has to do to the graphics card. Let's see how PostgreSQL does on a GPU with joins.
First I created some test data:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
test=# CREATE TABLE t_test AS SELECT id, id % 10 AS ten, id % 20 AS twenty FROM generate_series(1, 25000000) AS id ORDER BY id; SELECT 25000000 test=# CREATE TABLE t_join AS SELECT * FROM t_test ORDER BY random() LIMIT 1000000; SELECT 1000000 test=# SELECT count(*) FROM t_join; count --------- 1000000 (1 row) test=# SELECT count(*) FROM t_test; count ---------- 25000000 (1 row) |
25 million rows should be joined with 1 million rows, which are a 100% subset of the original data.
A first test shows a very nice improvement. First two queries on my CPU:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id GROUP BY a.ten; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=724919.18..724919.28 rows=10 width=4) (actual time=11283.769..11283.772 rows=10 loops=1) Group Key: a.ten -> Hash Join (cost=31813.00..719919.18 rows=1000000 width=4) (actual time=340.218..11061.192 rows=1000000 loops=1) Hash Cond: (a.id = b.id) -> Seq Scan on t_test a (cost=0.00..385135.40 rows=24999940 width=8) (actual time=0.046..3590.336 rows=25000000 loops=1) -> Hash (cost=15406.00..15406.00 rows=1000000 width=4) (actual time=338.516..338.516 rows=1000000 loops=1) Buckets: 131072 Batches: 16 Memory Usage: 3226kB -> Seq Scan on t_join b (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.031..142.085 rows=1000000 loops=1) Planning time: 0.587 ms Execution time: 11284.411 ms (10 rows) test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id GROUP BY a.twenty; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=724919.18..724919.38 rows=20 width=4) (actual time=10991.615..10991.619 rows=20 loops=1) Group Key: a.twenty -> Hash Join (cost=31813.00..719919.18 rows=1000000 width=4) (actual time=317.089..10766.779 rows=1000000 loops=1) Hash Cond: (a.id = b.id) -> Seq Scan on t_test a (cost=0.00..385135.40 rows=24999940 width=8) (actual time=0.050..3636.188 rows=25000000 loops=1) -> Hash (cost=15406.00..15406.00 rows=1000000 width=4) (actual time=316.321..316.321 rows=1000000 loops=1) Buckets: 131072 Batches: 16 Memory Usage: 3226kB -> Seq Scan on t_join b (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.026..141.551 rows=1000000 loops=1) Planning time: 0.240 ms Execution time: 10992.203 ms (10 rows) |
The CPU in use here is an “AMD FX(tm)-8350 Eight-Core Processor” running at 4 Ghz.
Let us try the same thing on the GPU:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id GROUP BY a.ten; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=175445.25..175445.35 rows=10 width=4) (actual time=2556.159..2556.161 rows=10 loops=1) Group Key: a.ten -> Custom Scan (GpuPreAgg) (cost=27659.66..174601.72 rows=22 width=8) (actual time=2538.558..2556.123 rows=30 loops=1) Bulkload: Off Reduction: Local + Global -> Custom Scan (GpuJoin) (cost=23659.66..170445.25 rows=1000000 width=4) (actual time=1392.571..2437.772 rows=1000000 loops=1) Bulkload: On (density: 100.00%) Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id) Nrows (in:25000000 out:1000000, 4.00% planned 4.00%), KDS-Hash (size: 57.22MB planned 82.25MB, nbatches: 1 planned 1) Inner Buffer: (82.25MB), DMA nums: 13, size: 1069.32MB -> Custom Scan (BulkScan) on t_test a (cost=0.00..385137.60 rows=25000160 width=8) (actual time=16.137..1333.313 rows=25000000 loops=1) -> Seq Scan on t_join b (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.018..109.725 rows=1000000 loops=1) Planning time: 1.720 ms Execution time: 3264.747 ms (14 rows) test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id GROUP BY a.twenty; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=175445.25..175445.45 rows=20 width=4) (actual time=2532.113..2532.121 rows=20 loops=1) Group Key: a.twenty -> Custom Scan (GpuPreAgg) (cost=27659.66..174601.94 rows=44 width=8) (actual time=2508.512..2531.991 rows=120 loops=1) Bulkload: Off Reduction: Local + Global -> Custom Scan (GpuJoin) (cost=23659.66..170445.25 rows=1000000 width=4) (actual time=1373.296..2410.850 rows=1000000 loops=1) Bulkload: On (density: 100.00%) Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id) Nrows (in:25000000 out:1000000, 4.00% planned 4.00%), KDS-Hash (size: 57.22MB planned 82.25MB, nbatches: 1 planned 1) Inner Buffer: (82.25MB), DMA nums: 11, size: 904.81MB -> Custom Scan (BulkScan) on t_test a (cost=0.00..385137.60 rows=25000160 width=8) (actual time=12.961..1308.733 rows=25000000 loops=1) -> Seq Scan on t_join b (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.018..109.893 rows=1000000 loops=1) Planning time: 0.539 ms Execution time: 3225.901 ms (14 rows) |
What we see here is a nice improvement. The query is several times faster.
The interesting part is to see, what happens if indexes are added to both sides of the join:
1 2 3 4 5 |
test=# CREATE INDEX idx_id ON t_test (id); CREATE INDEX test=# CREATE INDEX idx_id2 ON t_join (id); CREATE INDEX |
The standard query, shown before, does not show any difference because too much data is needed inside the join. So, the test is repeated with a reasonable filter:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id AND a.id < 100000 GROUP BY a.ten; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------- HashAggregate (cost=32541.70..32541.71 rows=1 width=4) (actual time=45.845..45.846 rows=10 loops=1) Group Key: a.ten -> Merge Join (cost=1.49..32519.75 rows=4390 width=4) (actual time=0.076..44.423 rows=4137 loops=1) Merge Cond: (a.id = b.id) -> Index Scan using idx_id on t_test a (cost=0.44..3722.05 rows=109749 width=8) (actual time=0.025..27.556 rows=99999 loops=1) Index Cond: (id < 100000) -> Index Only Scan using idx_id2 on t_join b (cost=0.42..25980.42 rows=1000000 width=4) (actual time=0.022..0.931 rows=4138 loops=1) Heap Fetches: 0 Planning time: 0.449 ms Execution time: 45.946 ms (10 rows) |
The CPU version is pretty fast. PostgreSQL scans the indexes, makes use of the sorted input to perform a nice merge join then. The aggregation is pretty fast as well.
In the GPU enabled case the plan does not look as efficient in the default setting:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id AND a.id < 100000 GROUP BY a.ten; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------ HashAggregate (cost=13404.21..13404.22 rows=1 width=4) (actual time=118.154..118.156 rows=10 loops=1) Group Key: a.ten -> Custom Scan (GpuJoin) (cost=9649.48..13383.34 rows=4174 width=4) (actual time=70.993..117.099 rows=4137 loops=1) Bulkload: On (density: 100.00%) Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id) Nrows (in:1000000 out:4137, 0.41% planned 0.42%), KDS-Hash (size: 5859.39KB planned 8.58MB, nbatches: 1 planned 1) Inner Buffer: (8.58MB), DMA nums: 1, size: 8.58MB -> Custom Scan (BulkScan) on t_join b (cost=0.00..15406.00 rows=1000000 width=4) (actual time=7.164..24.709 rows=1000000 loops=1) -> Index Scan using idx_id on t_test a (cost=0.44..3542.55 rows=104349 width=8) (actual time=0.018..21.684 rows=99999 loops=1) Index Cond: (id < 100000) Planning time: 0.858 ms Execution time: 553.543 ms (12 rows) |
Even if the query is slower in this case, it should not be a major issue. If the cost models are adjusted properly and if the planner is told how to decide on the right plan, the system should be able to figure out that the CPU version is the faster one. So the loss in speed is really not an issue here. As soon as all infrastructure is in place, balancing the cost model and adjusting the planner here and there does not seem like a showstopper (at least not given the achievements already made).
Let us just add more tables to the join to see what happens. To make sure that CPU and GPU get a fair chance, I have removed my two indexes again:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b, t_test AS c, t_join AS d, t_test AS e, t_join AS f WHERE a.id = b.id AND b.id = c.id AND c.id = d.id AND d.id = e.id AND e.id = f.id; QUERY PLAN ------------------------------------------------------------------------------------------------------------ Aggregate (cost=1745125.48..1745125.49 rows=1 width=0) (actual time=34234.209..34234.209 rows=1 loops=1) -> Hash Join (cost=1266218.99..1745121.48 rows=1600 width=0) (actual time=23054.940..34133.602 rows=1000000 loops=1) Hash Cond: (e.id = a.id) -> Seq Scan on t_test e (cost=0.00..385136.36 rows=25000036 width=4) (actual time=0.056..2728.388 rows=25000000 loops=1) -> Hash (cost=1266198.99..1266198.99 rows=1600 width=20) (actual time=23054.441..23054.441 rows=1000000 loops=1) Buckets: 131072 (originally 2048) Batches: 32 (originally 1) Memory Usage: 4205kB -> Hash Join (cost=787296.49..1266198.99 rows=1600 width=20) (actual time=12247.386..22803.559 rows=1000000 loops=1) Hash Cond: (c.id = a.id) -> Seq Scan on t_test c (cost=0.00..385136.36 rows=25000036 width=4) (actual time=0.004..2727.218 rows=25000000 loops=1) -> Hash (cost=787276.49..787276.49 rows=1600 width=16) (actual time=12246.958..12246.958 rows=1000000 loops=1) Buckets: 131072 (originally 2048) Batches: 16 (originally 1) Memory Usage: 3960kB -> Hash Join (cost=768104.49..787276.49 rows=1600 width=16) (actual time=11330.994..12044.482 rows=1000000 loops=1) Hash Cond: (f.id = a.id) -> Seq Scan on t_join f (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.030..111.371 rows=1000000 loops=1) -> Hash (cost=767604.49..767604.49 rows=40000 width=12) (actual time=11330.687..11330.687 rows=1000000 loops=1) Buckets: 131072 (originally 65536) Batches: 16 (originally 1) Memory Usage: 3716kB -> Hash Join (cost=63626.00..767604.49 rows=40000 width=12) (actual time=620.425..11124.981 rows=1000000 loops=1) Hash Cond: (a.id = d.id) -> Hash Join (cost=31813.00..719920.49 rows=1000000 width=8) (actual time=306.078..10215.246 rows=1000000 loops=1) Hash Cond: (a.id = b.id) -> Seq Scan on t_test a (cost=0.00..385136.36 rows=25000036 width=4) (actual time=0.002..3430.792 rows=25000000 loops=1) -> Hash (cost=15406.00..15406.00 rows=1000000 width=4) (actual time=305.466..305.466 rows=1000000 loops=1) Buckets: 131072 Batches: 16 Memory Usage: 3226kB -> Seq Scan on t_join b (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.002..137.925 rows=1000000 loops=1) -> Hash (cost=15406.00..15406.00 rows=1000000 width=4) (actual time=314.051..314.051 rows=1000000 loops=1) Buckets: 131072 Batches: 16 Memory Usage: 3226kB -> Seq Scan on t_join d (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.006..137.796 rows=1000000 loops=1) Planning time: 4.833 ms Execution time: 34238.964 ms (29 rows) |
As expected the CPU has a hard time joining all the stuff together. It is a pretty nasty join. At the end 34 seconds are needed.
The GPU can really shine under those circumstances. Expensive joins really seem to be what this is all about after all:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b, t_test AS c, t_join AS d, t_test AS e, t_join AS f WHERE a.id = b.id AND b.id = c.id AND c.id = d.id AND d.id = e.id AND e.id = f.id; NOTICE: GpuJoin(0x204508860) DataStoreNoSpace retry=1 [0.00%] src_nitems: 312465 max_space: 390581=>781162 nrooms: 101578=>390581 Nthreads: (312465, 499=>312465) inners: (0, 1000000) results: [312465, 312465] NOTICE: GpuJoin(0x204506060) DataStoreNoSpace retry=1 [0.00%] src_nitems: 312465 max_space: 390581=>781162 nrooms: 101578=>390581 Nthreads: (312465, 499=>312465) inners: (0, 1000000) results: [312465, 312465] NOTICE: GpuJoin(0x204505860) DataStoreNoSpace retry=1 [0.00%] src_nitems: 312465 max_space: 390581=>781162 nrooms: 101578=>390581 Nthreads: (312465, 499=>312465) inners: (0, 1000000) results: [312465, 312465] QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=471554.15..471554.16 rows=1 width=0) (actual time=7270.748..7270.748 rows=1 loops=1) -> Custom Scan (GpuJoin) (cost=334846.07..471550.15 rows=1600 width=0) (actual time=6212.393..7184.973 rows=1000000 loops=1) Bulkload: On (density: 100.00%) Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id) Nrows (in:25000000 out:1000000, 4.00% planned 0.01%), KDS-Hash (size: 64.85MB planned 134.86KB, nbatches: 1 planned 1) Inner Buffer: (124.00MB), DMA nums: 9, size: 1116.01MB -> Custom Scan (BulkScan) on t_test e (cost=0.00..385136.36 rows=25000036 width=4) (actual time=13.106..1103.447 rows=25000000 loops=1) -> Custom Scan (GpuJoin) (cost=192385.58..329089.66 rows=1600 width=20) (actual time=4240.682..5277.841 rows=1000000 loops=1) Bulkload: On (density: 100.00%) Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id) Nrows (in:25000000 out:1000000, 4.00% planned 0.01%), KDS-Hash (size: 57.22MB planned 119.23KB, nbatches: 1 planned 1) Inner Buffer: (62.00MB), DMA nums: 15, size: 930.01MB -> Custom Scan (BulkScan) on t_test c (cost=0.00..385136.36 rows=25000036 width=4) (actual time=13.160..1093.463 rows=25000000 loops=1) -> Custom Scan (GpuJoin) (cost=182921.05..186629.17 rows=1600 width=16) (actual time=3320.939..344.434 rows=1000000 loops=1) Bulkload: On (density: 100.00%) Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id) Nrows (in:1000000 out:1000000, 100.00% planned 0.16%), KDS-Hash (size: 57.22MB planned 2978.59KB, nbatches: 1 planned 1) Inner Buffer: (62.00MB), DMA nums: 1, size: 62.00MB -> Custom Scan (BulkScan) on t_join f (cost=0.00..15406.00 rows=1000000 width=4) (actual time=13.220..45.445 rows=1000000 loops=1) -> Custom Scan (GpuJoin) (cost=41543.18..176974.98 rows=40000 width=12) (actual time=1736.740..2818.735 rows=1000000 loops=1) Bulkload: On (density: 100.00%) Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id) Nrows (in:25000000 out:1000000, 4.00% planned 4.00%), KDS-Hash (size: 57.22MB planned 82.25MB, nbatches: 1 planned 1) Depth 2: GpuHashJoin, HashKeys: (id), JoinQual: (id = id) Nrows (in:1000000 out:1000000, 100.00% planned 4.00%), KDS-Hash (size: 57.22MB planned 82.25MB, nbatches: 1 planned 1) Inner Buffer: (82.25MB)x(82.25MB), DMA nums: 10, size: 1645.10MB -> Custom Scan (BulkScan) on t_test a (cost=0.00..385136.36 rows=25000036 width=4) (actual time=16.505..1599.395 rows=25000000 loops=1) -> Seq Scan on t_join b (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.022..111.602 rows=1000000 loops=1) -> Seq Scan on t_join d (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.005..95.175 rows=1000000 loops=1) Planning time: 11.106 ms Execution time: 8051.245 ms (31 rows) |
Given the number of rows involved, this is really a nice number to see. 8 seconds just rocks and shows the real potential of a GPU.
Of course: You got to keep in mind that I am running arbitrary SQL statements, which are not too realistic in many cases. I am yet to test pgstrom on a real workload involving more every-day queries you would typically see in PostgreSQL.
Finally I got around to take a more extensive look at pgstrom (a module to make use of GPUs). The entire GPU thing fascinates me, and I was excited to see the first real performance data.
Here is some simple test data:
1 2 3 4 5 |
test=# CREATE TABLE t_test AS SELECT x, 'a'::char(100) AS y, 'b'::char(100) AS z FROM generate_series(1, 5000000) AS x ORDER BY random(); SELECT 5000000 |
5 million rows should be enough to get a first impression of what is going on.
Queries can benefit
To make sure that a real difference can actually be observed, I have decided to use no indexes. In real life, this is not too realistic because performance would suffer in a horrible way. pgstrom has not been made to speed up index lookups anyway so this should not be an issue.
The first thing I tried was to filter and group some data on the CPU:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
test=# explain analyze SELECT count(*) FROM t_test WHERE sqrt(x) > 0 GROUP BY y; QUERY PLAN -------------------------------------------------------------------------------------- HashAggregate (cost=242892.24..242892.25 rows=1 width=101) (actual time=3965.362..3965.362 rows=1 loops=1) Group Key: y -> Seq Scan on t_test (cost=0.00..234558.91 rows=1666665 width=101) (actual time=0.033..1417.593 rows=5000000 loops=1) Filter: (sqrt((x)::double precision) > '0'::double precision) Planning time: 0.459 ms Execution time: 3965.495 ms |
My box (4 GHz AMD) can do that in just under 4 seconds. Note that I am using the standard PostgreSQL storage manager here (no column store or so).
Let us try the same thing on the GPU:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
test=# explain analyze SELECT count(*) FROM t_test WHERE sqrt(x) > 0 GROUP BY y; QUERY PLAN ------------------------------------------------------------------------------------------ HashAggregate (cost=176230.88..176230.89 rows=1 width=101) (actual time=2004.355..2004.356 rows=1 loops=1) Group Key: y -> Custom Scan (GpuPreAgg) (cost=11929.24..171148.30 rows=75 width=108) (actual time=1151.310..2003.868 rows=76 loops=1) Bulkload: On (density: 100.00%) Reduction: Local + Global Device Filter: (sqrt((x)::double precision) > '0'::double precision) -> Custom Scan (BulkScan) on t_test (cost=8929.24..167897.55 rows=5000001 width=101) (actual time=12.956..1152.273 rows=5000000 loops=1) Planning time: 0.550 ms Execution time: 2299.633 ms (9 rows) |
We see a nice improvement here. The speedup is incredible - especially when taking into consideration that getting the data already takes more than a second. It seems moving stuff out to the GPU definitely pays off in this case.
The interesting thing to notice is that the real improvement can be seen because of the GROUP BY clause. A normal filter does not show a benefit:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
test=# explain analyze SELECT count(*) FROM t_test WHERE sqrt(x) > 0; QUERY PLAN -------------------------------------------------------------------------------------------- Aggregate (cost=238725.58..238725.59 rows=1 width=0) (actual time=1762.829..1762.829 rows=1 loops=1) -> Seq Scan on t_test (cost=0.00..234558.91 rows=1666665 width=0) (actual time=0.055..1311.220 rows=5000000 loops=1) Filter: (sqrt((x)::double precision) > '0'::double precision) Planning time: 0.074 ms Execution time: 1762.875 ms (5 rows) test=# explain analyze SELECT count(*) FROM t_test WHERE sqrt(x) > 0; QUERY PLAN ------------------------------------------------------------------------------------------------- Aggregate (cost=172064.21..172064.22 rows=1 width=0) (actual time=1411.036..1411.036 rows=1 loops=1) -> Custom Scan (GpuPreAgg) (cost=11929.24..171148.30 rows=75 width=4) (actual time=650.590..1410.837 rows=76 loops=1) Bulkload: On (density: 100.00%) Reduction: NoGroup Device Filter: (sqrt((x)::double precision) > '0'::double precision) -> Custom Scan (BulkScan) on t_test (cost=8929.24..167897.55 rows=5000001 width=0) (actual time=16.034..1160.592 rows=5000000 loops=1) Planning time: 1.634 ms Execution time: 1745.605 ms (8 rows) |
It certainly makes sense that there is no improvement in this case because moving data around is simply too expensive to make a difference. Remember: GPUs only make sense if things can be done in parallel and if data is coming fast enough. sqrt is not complicated enough to justify the effort of moving data around and PostgreSQL cannot provide data fast enough.
Or queries can be slower
It is important to mention that many queries won't benefit from the GPU at all. In fact, I would expect than the majority of queries in a usual system will not behave differently.
Here is an example of a query, which is actually slower with pgstrom:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
test=# explain analyze SELECT *, sum(x) OVER () FROM t_test WHERE sqrt(x) > 0; QUERY PLAN ------------------------------------------------------------------------------------------------ WindowAgg (cost=8929.24..188730.88 rows=1666667 width=206) (actual time=4449.716..5848.233 rows=5000000 loops=1) -> Custom Scan (GpuScan) on t_test (cost=8929.24..167897.55 rows=1666667 width=206) (actual time=616.879..1899.651 rows=5000000 loops=1) Device Filter: (sqrt((x)::double precision) > '0'::double precision) Planning time: 0.142 ms Execution time: 6260.461 ms (5 rows) test=# explain analyze SELECT *, sum(x) OVER () FROM t_test WHERE sqrt(x) > 0; QUERY PLAN -------------------------------------------------------------------------------------------------- WindowAgg (cost=0.00..255392.23 rows=1666665 width=206) (actual time=3610.914..4895.518 rows=5000000 loops=1) -> Seq Scan on t_test (cost=0.00..234558.91 rows=1666665 width=206) (actual time=0.038..1389.645 rows=5000000 loops=1) Filter: (sqrt((x)::double precision) > '0'::double precision) Planning time: 0.324 ms Execution time: 5187.048 ms (5 rows) |
In this case the GPU seems like a loss - at least there is no benefit to be expected at this stage.
One word about sorts
According to the main developer of pgstrom sorting is not yet as good as he wants it to be, so I skipped the sort part for now. As sorts are key to many queries, there is still pgstrom functionality I am really looking forward to.
I assume that sorts can greatly benefit from a GPU because there is a lot of intrinsic parallelism in a sort algorithm. Therefore sorting on the GPU could be highly beneficial. The speedup we can expect is hard to predict but I firmly believe that it can be quite substantial.
Stability
What stunned me is that I have not encountered a single segmentation fault during my tests. I definitely did not expect that. My assumption was that there would be more loose ends but actually things worked as expected most of the time - given the stage of the project I am pretty excited. pgstrom certainly feels like the future ...
Find all the latest CYBERTEC blog posts by Hans-Jürgen Schönig, Laurenz Albe, Pavlo Golub and others in our Performance blog spot.
NVIDIA's CUDA is definitely a great thing and I have to admit that I already got excited years ago when I first learned about it. For many operations a nicely optimized GPU implementation definitely seems the way to go. GPUs are traditionally used for scientific operations and massively parallel tasks. However, some important work is also going into the PGStrom project, which is all about bringing the power of modern GPUs to PostgreSQL: pgstrom documentation
Installing CUDA
At this point installing CUDA on Linux might be the hardest part of the entire undertaking. The CUDA installer only works nicely, when no X-server is running. A simple „init 1“ should solve this problem, however.
Before you get started with pgstrom, it is usually a good idea to check, if the GPU has been detected properly:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
[hs@laura ~]$ cd /usr/local/cuda/samples/1_Utilities/deviceQuery [hs@laura deviceQuery]$ ./deviceQuery ./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: 'GeForce GTX 970' CUDA Driver Version / Runtime Version 7.5 / 7.0 CUDA Capability Major/Minor version number: 5.2 Total amount of global memory: 4095 MBytes (4294246400 bytes) (13) Multiprocessors, (128) CUDA Cores/MP: 1664 CUDA Cores GPU Max Clock rate: 1253 MHz (1.25 GHz) Memory Clock rate: 3505 Mhz Memory Bus Width: 256-bit L2 Cache Size: 1835008 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.0, NumDevs = 1, Device0 = GeForce GTX 970 Result = PASS |
If the test is passed, CUDA is ready for PostgreSQL.
Installing PostgreSQL with CUDA
Installing pg_strom for PostgreSQL is basically not hard. Here is how it works:
1 2 3 4 5 6 7 8 9 10 11 12 |
$ git clone https://github.com/postgres/postgres.git pgsql $ cd pgsql $ ./configure --enable-debug --enable-cassert $ make $ sudo make install $ git clone https://github.com/pg-strom/devel pg_strom $ cd pg_strom $ which pg_config /usr/local/pgsql/bin/pg_config $ make $ sudo make install |
What happened in my case was that I had to uncomment 3 lines in a pg_strom header file because my version of PostgreSQL was a bit more up to date than expected. However, this is nothing major. It is more of a small fix.
Once pg_strom has been added to shared_preload_libraries, the system is already ready for action. In my case starting the database shows the following listing:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
[hs@laura ~]$ pg_ctl -D /data/dbstrom/ start server starting LOG: CUDA Runtime version: 7.5.0 LOG: NVIDIA driver version: 352.30 LOG: GPU0 GeForce GTX 970 (1664 CUDA cores, 1253MHz), L2 1792KB, RAM 4095MB (256bits, 3505MHz), capability 5.2 LOG: NVRTC - CUDA Runtime Compilation vertion 7.0 LOG: database system shutdown was interrupted; last known up at 2015-08-27 21:05:41 CEST LOG: database system was not properly shut down; automatic recovery in progress LOG: invalid record length at 0/A54973F8 LOG: redo is not required LOG: MultiXact member wraparound protections are now enabled LOG: database system is ready to accept connections LOG: autovacuum launcher started |
The important point here is that during PostgreSQL startup the CUDA device has to be in the LOG message - otherwise there is a problem with the driver.
The beauty is that pgstrom automatically uses GPU code when it seems useful. The user does not have to worry about where the code is actually executed. The optimizer will make those decisions for your automatically.
So far pgstrom seems pretty promising. Of course, it is not ready for production yet, but it is definitely worth investigating the issue further and run tests next week.
+43 (0) 2622 93022-0
office@cybertec.at