In my previous post I have shared some GPU performance data, which were mostly related to aggregation. Given the information I have found on the pgstrom wiki page, I decided to give joins a try to see, how much speed we can gain by offloading some of the work PostgreSQL has to do to the graphics card.

First I created some test data:


test=# CREATE TABLE t_test AS SELECT id, id % 10 AS ten, id % 20 AS twenty

FROM generate_series(1, 25000000) AS id

ORDER BY id;

SELECT 25000000

test=#

test=# CREATE TABLE t_join AS SELECT *

FROM t_test

ORDER BY random()

LIMIT 1000000;

SELECT 1000000





test=# SELECT count(*) FROM t_join;

count

---------

1000000

(1 row)



test=# SELECT count(*) FROM t_test;

count

----------

25000000

(1 row)

25 million rows should be joined with 1 million rows, which are a 100% subset of the original data.

A first test shows a very nice improvement. First two queries on my CPU:


test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id GROUP BY a.ten;

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------

HashAggregate  (cost=724919.18..724919.28 rows=10 width=4) (actual time=11283.769..11283.772 rows=10 loops=1)

Group Key: a.ten

->  Hash Join  (cost=31813.00..719919.18 rows=1000000 width=4) (actual time=340.218..11061.192 rows=1000000 loops=1)

Hash Cond: (a.id = b.id)

->  Seq Scan on t_test a  (cost=0.00..385135.40 rows=24999940 width=8) (actual time=0.046..3590.336 rows=25000000 loops=1)

->  Hash  (cost=15406.00..15406.00 rows=1000000 width=4) (actual time=338.516..338.516 rows=1000000 loops=1)

Buckets: 131072  Batches: 16  Memory Usage: 3226kB

->  Seq Scan on t_join b  (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.031..142.085 rows=1000000 loops=1)

Planning time: 0.587 ms

Execution time: 11284.411 ms

(10 rows)



test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id GROUP BY a.twenty;

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------

HashAggregate  (cost=724919.18..724919.38 rows=20 width=4) (actual time=10991.615..10991.619 rows=20 loops=1)

Group Key: a.twenty

->  Hash Join  (cost=31813.00..719919.18 rows=1000000 width=4) (actual time=317.089..10766.779 rows=1000000 loops=1)

Hash Cond: (a.id = b.id)

->  Seq Scan on t_test a  (cost=0.00..385135.40 rows=24999940 width=8) (actual time=0.050..3636.188 rows=25000000 loops=1)

->  Hash  (cost=15406.00..15406.00 rows=1000000 width=4) (actual time=316.321..316.321 rows=1000000 loops=1)

Buckets: 131072  Batches: 16  Memory Usage: 3226kB

->  Seq Scan on t_join b  (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.026..141.551 rows=1000000 loops=1)

Planning time: 0.240 ms

Execution time: 10992.203 ms

(10 rows)

The CPU in use here is an “AMD FX(tm)-8350 Eight-Core Processor” running at 4 Ghz.

Let us try the same thing on the GPU:


test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id GROUP BY a.ten;

QUERY PLAN



------------------------------------------------------------------------------------------------------------------------------------------

---------------

HashAggregate  (cost=175445.25..175445.35 rows=10 width=4) (actual time=2556.159..2556.161 rows=10 loops=1)

Group Key: a.ten

->  Custom Scan (GpuPreAgg)  (cost=27659.66..174601.72 rows=22 width=8) (actual time=2538.558..2556.123 rows=30 loops=1)

Bulkload: Off

Reduction: Local + Global

->  Custom Scan (GpuJoin)  (cost=23659.66..170445.25 rows=1000000 width=4) (actual time=1392.571..2437.772 rows=1000000 loops=1)

Bulkload: On (density: 100.00%)

Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id)

Nrows (in:25000000 out:1000000, 4.00% planned 4.00%), KDS-Hash (size: 57.22MB planned 82.25MB, nbatches: 1 planned

1)

Inner Buffer: (82.25MB), DMA nums: 13, size: 1069.32MB

->  Custom Scan (BulkScan) on t_test a  (cost=0.00..385137.60 rows=25000160 width=8) (actual time=16.137..1333.313 rows=250

00000 loops=1)

->  Seq Scan on t_join b  (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.018..109.725 rows=1000000 loops=1)

Planning time: 1.720 ms

Execution time: 3264.747 ms

(14 rows)



test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id GROUP BY a.twenty;

QUERY PLAN



------------------------------------------------------------------------------------------------------------------------------------------

---------------

HashAggregate  (cost=175445.25..175445.45 rows=20 width=4) (actual time=2532.113..2532.121 rows=20 loops=1)

Group Key: a.twenty

->  Custom Scan (GpuPreAgg)  (cost=27659.66..174601.94 rows=44 width=8) (actual time=2508.512..2531.991 rows=120 loops=1)

Bulkload: Off

Reduction: Local + Global

->  Custom Scan (GpuJoin)  (cost=23659.66..170445.25 rows=1000000 width=4) (actual time=1373.296..2410.850 rows=1000000 loops=1)

Bulkload: On (density: 100.00%)

Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id)

Nrows (in:25000000 out:1000000, 4.00% planned 4.00%), KDS-Hash (size: 57.22MB planned 82.25MB, nbatches: 1 planned

1)

Inner Buffer: (82.25MB), DMA nums: 11, size: 904.81MB

->  Custom Scan (BulkScan) on t_test a  (cost=0.00..385137.60 rows=25000160 width=8) (actual time=12.961..1308.733 rows=250

00000 loops=1)

->  Seq Scan on t_join b  (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.018..109.893 rows=1000000 loops=1)

Planning time: 0.539 ms

Execution time: 3225.901 ms

(14 rows)

What we see here is a nice improvement. The query is several times faster.

Making use of indexes

The interesting part is to see, what happens if indexes are added to both sides of the join:


test=# CREATE INDEX idx_id ON t_test (id);

CREATE INDEX

test=# CREATE INDEX idx_id2 ON t_join (id);

CREATE INDEX

The standard query shown before, does not show any difference because too much data is needed inside the join. So, the test is repeated with a reasonable filter:


test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id AND a.id < 100000 GROUP BY a.ten; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------ ------ HashAggregate  (cost=32541.70..32541.71 rows=1 width=4) (actual time=45.845..45.846 rows=10 loops=1) Group Key: a.ten ->  Merge Join  (cost=1.49..32519.75 rows=4390 width=4) (actual time=0.076..44.423 rows=4137 loops=1)

Merge Cond: (a.id = b.id)

->  Index Scan using idx_id on t_test a  (cost=0.44..3722.05 rows=109749 width=8) (actual time=0.025..27.556 rows=99999 loops=1)

Index Cond: (id < 100000) ->  Index Only Scan using idx_id2 on t_join b  (cost=0.42..25980.42 rows=1000000 width=4) (actual time=0.022..0.931 rows=4138 loo

ps=1)

Heap Fetches: 0

Planning time: 0.449 ms

Execution time: 45.946 ms

(10 rows)

The CPU version is pretty fast. PostgreSQL scans the indexes, makes use of the sorted input to perform a nice merge join then. The aggregation is pretty fast as well.

In the GPU enabled case the plan does not look as efficiently in the default setting:


test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id AND a.id < 100000 GROUP BY a.ten; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------ --- HashAggregate  (cost=13404.21..13404.22 rows=1 width=4) (actual time=118.154..118.156 rows=10 loops=1) Group Key: a.ten ->  Custom Scan (GpuJoin)  (cost=9649.48..13383.34 rows=4174 width=4) (actual time=70.993..117.099 rows=4137 loops=1)

Bulkload: On (density: 100.00%)

Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id)

Nrows (in:1000000 out:4137, 0.41% planned 0.42%), KDS-Hash (size: 5859.39KB planned 8.58MB, nbatches: 1 planned 1)

Inner Buffer: (8.58MB), DMA nums: 1, size: 8.58MB

->  Custom Scan (BulkScan) on t_join b  (cost=0.00..15406.00 rows=1000000 width=4) (actual time=7.164..24.709 rows=1000000 loops=

1)

->  Index Scan using idx_id on t_test a  (cost=0.44..3542.55 rows=104349 width=8) (actual time=0.018..21.684 rows=99999 loops=1)

Index Cond: (id < 100000)

Planning time: 0.858 ms

Execution time: 553.543 ms

(12 rows)

Even if the query is slower in this case, it should not be a major issue. If the cost models are adjusted properly and if the planner is told how to decide on the right plan, the system should be able to figure out that the CPU version is the faster one. So the loss in speed is really not an issue here. As soon as all infrastructure is in place, balancing the cost model and adjusting the planner here and there does not seem like a showstopper (at least not given the achievements already made).

Joining more tables

Let us just add more tables to the join to see what happens. To make sure that CPU and GPU get a fair chance, I have removed my two indexes again:


test=# explain analyze SELECT count(*)

FROM    t_test AS a, t_join AS b, t_test AS c,

t_join AS d, t_test AS e, t_join AS f

WHERE a.id = b.id

AND b.id = c.id

AND c.id = d.id

AND d.id = e.id

AND e.id = f.id;

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Aggregate  (cost=1745125.48..1745125.49 rows=1 width=0) (actual time=34234.209..34234.209 rows=1 loops=1)

->  Hash Join  (cost=1266218.99..1745121.48 rows=1600 width=0) (actual time=23054.940..34133.602 rows=1000000 loops=1)

Hash Cond: (e.id = a.id)

->  Seq Scan on t_test e  (cost=0.00..385136.36 rows=25000036 width=4) (actual time=0.056..2728.388 rows=25000000 loops=1)

->  Hash  (cost=1266198.99..1266198.99 rows=1600 width=20) (actual time=23054.441..23054.441 rows=1000000 loops=1)

Buckets: 131072 (originally 2048)  Batches: 32 (originally 1)  Memory Usage: 4205kB

->  Hash Join  (cost=787296.49..1266198.99 rows=1600 width=20) (actual time=12247.386..22803.559 rows=1000000 loops=1)

Hash Cond: (c.id = a.id)

->  Seq Scan on t_test c  (cost=0.00..385136.36 rows=25000036 width=4) (actual time=0.004..2727.218 rows=25000000 loops=1)

->  Hash  (cost=787276.49..787276.49 rows=1600 width=16) (actual time=12246.958..12246.958 rows=1000000 loops=1)

Buckets: 131072 (originally 2048)  Batches: 16 (originally 1)  Memory Usage: 3960kB

->  Hash Join  (cost=768104.49..787276.49 rows=1600 width=16) (actual time=11330.994..12044.482 rows=1000000 loops=1)

Hash Cond: (f.id = a.id)

->  Seq Scan on t_join f  (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.030..111.371 rows=1000000 loops=1)

->  Hash  (cost=767604.49..767604.49 rows=40000 width=12) (actual time=11330.687..11330.687 rows=1000000 loops=1)

Buckets: 131072 (originally 65536)  Batches: 16 (originally 1)  Memory Usage: 3716kB

->  Hash Join  (cost=63626.00..767604.49 rows=40000 width=12) (actual time=620.425..11124.981 rows=1000000 loops=1)

Hash Cond: (a.id = d.id)

->  Hash Join  (cost=31813.00..719920.49 rows=1000000 width=8) (actual time=306.078..10215.246 rows=1000000 loops=1)

Hash Cond: (a.id = b.id)

->  Seq Scan on t_test a  (cost=0.00..385136.36 rows=25000036 width=4) (actual time=0.002..3430.792 rows=25000000 loops=1)

->  Hash  (cost=15406.00..15406.00 rows=1000000 width=4) (actual time=305.466..305.466 rows=1000000 loops=1)

Buckets: 131072  Batches: 16  Memory Usage: 3226kB

->  Seq Scan on t_join b  (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.002..137.925 rows=1000000 loops=1)

->  Hash  (cost=15406.00..15406.00 rows=1000000 width=4) (actual time=314.051..314.051 rows=1000000 loops=1)

Buckets: 131072  Batches: 16  Memory Usage: 3226kB

->  Seq Scan on t_join d  (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.006..137.796 rows=1000000 loops=1)

Planning time: 4.833 ms

Execution time: 34238.964 ms

(29 rows)

As expected the CPU has a hard time joining all the stuff together. It is a pretty nasty join. At the end 34 seconds are needed.

The GPU can really shine under those circumstances. Expensive joins really seem to be what this is all about after all:


test=# explain analyze SELECT count(*)

FROM    t_test AS a, t_join AS b, t_test AS c,

t_join AS d, t_test AS e, t_join AS f

WHERE a.id = b.id

AND b.id = c.id

AND c.id = d.id

AND d.id = e.id

AND e.id = f.id;

NOTICE:  GpuJoin(0x204508860) DataStoreNoSpace retry=1 [0.00%] src_nitems: 312465 max_space: 390581=>781162 nrooms: 101578=>390581 Nthreads: (312465, 499=>312465) inners: (0, 1000000) results: [312465, 312465]

NOTICE:  GpuJoin(0x204506060) DataStoreNoSpace retry=1 [0.00%] src_nitems: 312465 max_space: 390581=>781162 nrooms: 101578=>390581 Nthreads: (312465, 499=>312465) inners: (0, 1000000) results: [312465, 312465]

NOTICE:  GpuJoin(0x204505860) DataStoreNoSpace retry=1 [0.00%] src_nitems: 312465 max_space: 390581=>781162 nrooms: 101578=>390581 Nthreads: (312465, 499=>312465) inners: (0, 1000000) results: [312465, 312465]

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Aggregate  (cost=471554.15..471554.16 rows=1 width=0) (actual time=7270.748..7270.748 rows=1 loops=1)

->  Custom Scan (GpuJoin)  (cost=334846.07..471550.15 rows=1600 width=0) (actual time=6212.393..7184.973 rows=1000000 loops=1)

Bulkload: On (density: 100.00%)

Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id)

Nrows (in:25000000 out:1000000, 4.00% planned 0.01%), KDS-Hash (size: 64.85MB planned 134.86KB, nbatches: 1 planned 1)

Inner Buffer: (124.00MB), DMA nums: 9, size: 1116.01MB

->  Custom Scan (BulkScan) on t_test e  (cost=0.00..385136.36 rows=25000036 width=4) (actual time=13.106..1103.447 rows=25000000 loops=1)

->  Custom Scan (GpuJoin)  (cost=192385.58..329089.66 rows=1600 width=20) (actual time=4240.682..5277.841 rows=1000000 loops=1)

Bulkload: On (density: 100.00%)

Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id)

Nrows (in:25000000 out:1000000, 4.00% planned 0.01%), KDS-Hash (size: 57.22MB planned 119.23KB, nbatches: 1 planned 1)

Inner Buffer: (62.00MB), DMA nums: 15, size: 930.01MB

->  Custom Scan (BulkScan) on t_test c  (cost=0.00..385136.36 rows=25000036 width=4) (actual time=13.160..1093.463 rows=25000000 loops=1)

->  Custom Scan (GpuJoin)  (cost=182921.05..186629.17 rows=1600 width=16) (actual time=3320.939..3448.434 rows=1000000 loops=1)

Bulkload: On (density: 100.00%)

Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id)

Nrows (in:1000000 out:1000000, 100.00% planned 0.16%), KDS-Hash (size: 57.22MB planned 2978.59KB, nbatches: 1 planned 1)

Inner Buffer: (62.00MB), DMA nums: 1, size: 62.00MB

->  Custom Scan (BulkScan) on t_join f  (cost=0.00..15406.00 rows=1000000 width=4) (actual time=13.220..45.445 rows=1000000 loops=1)

->  Custom Scan (GpuJoin)  (cost=41543.18..176974.98 rows=40000 width=12) (actual time=1736.740..2818.735 rows=1000000 loops=1)

Bulkload: On (density: 100.00%)

Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id)

Nrows (in:25000000 out:1000000, 4.00% planned 4.00%), KDS-Hash (size: 57.22MB planned 82.25MB, nbatches: 1 planned 1)

Depth 2: GpuHashJoin, HashKeys: (id), JoinQual: (id = id)

Nrows (in:1000000 out:1000000, 100.00% planned 4.00%), KDS-Hash (size: 57.22MB planned 82.25MB, nbatches: 1 planned 1)

Inner Buffer: (82.25MB)x(82.25MB), DMA nums: 10, size: 1645.10MB

->  Custom Scan (BulkScan) on t_test a  (cost=0.00..385136.36 rows=25000036 width=4) (actual time=16.505..1599.395 rows=25000000 loops=1)

->  Seq Scan on t_join b  (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.022..111.602 rows=1000000 loops=1)

->  Seq Scan on t_join d  (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.005..95.175 rows=1000000 loops=1)

Planning time: 11.106 ms

Execution time: 8051.245 ms

(31 rows)

Given the number of rows involved, this is really a nice number to see. 8 seconds just rocks and shows the real potential of a GPU.

Of course: You got to keep in mind that I am running arbitrary SQL statements, which are not too realistic in many cases. I am yet to test pgstrom on a real workload involving more every-day queries you would typically see in PostgreSQL.