September 2015

Scaling PostgreSQL: Modern servers provide people with more and more CPU cores. 16 cores or more are not uncommon these days. Unfortunately, PostgreSQL cannot scale a single query to more than one core. A lot of effort is going into this issue already.

CYBERTEC experiments

To address the issue, we've done some experiments to see what can be done to scale things up. At this point, we've written a small module which takes an existing execution plan and does some post-processing of the plan using “custom plan nodes”.

The idea is experimental: what happens if the approach described in this post is actually used?

The main challenge we face at a client is:

SELECT      … a LOT of processing going on here …
FROM        large_table
GROUP BY    some_groups

SELECT … a LOT of processing going on here …

FROM large_table

GROUP BY some_groups

So the goal in our case is really to scale out the processing done in the SELECT clause.

Some test data for scaling

For a start, 100 million rows have been created to do a little testing. 20 million rows are in each partition:

parallel=# SELECT count(*) FROM t_test;
  count
-----------
100000000
(1 row)

Time: 8715.324 ms

parallel=# d+ t_test
Table 'public.t_test'

 Column |       Type       | Modifiers | Storage | Stats target | Description
--------+------------------+-----------+---------+--------------+-------------
 id     | integer          | not null  | plain   |              |
 r      | double precision |           | plain   |              |
Child tables: t_test_1,
      t_test_2,
      t_test_3,
      t_test_4,
      t_test_5

parallel=# SELECT count(*) FROM t_test;

count

-----------

100000000

(1 row)

Time: 8715.324 ms

parallel=# d+ t_test

Table 'public.t_test'

--------+------------------+-----------+---------+--------------+-------------

Child tables: t_test_1,

t_test_2,

t_test_3,

t_test_4,

t_test_5

Giving multicore PostgreSQL a try - scaling test

Our prototype module has a simple runtime variable, which allows us to define the number of processes we want to see. Setting parallel.agg_workers to 1 basically means turning the plan-processing code off.

The data in the table is scanned normally. We need around 29 seconds to process the query:

parallel=# SET parallel.agg_workers = 1;
SET
Time: 1.572 ms

parallel=# SELECT sum(r) FROM t_test GROUP BY id < 1000;
      sum
------------------
49998665.9549409
2538.35037390701
(2 rows)

Time: 29477.039 ms

parallel=# SET parallel.agg_workers = 1;

SET

Time: 1.572 ms

parallel=# SELECT sum(r) FROM t_test GROUP BY id < 1000;

sum

------------------

49998665.9549409

2538.35037390701

(2 rows)

Time: 29477.039 ms

Let us try the same with more cores: 7 cores will be used for worker processes.

The 8^th core will be taken by the coordinating process:

parallel=# SET parallel.agg_workers = 7;
SET
Time: 0.324 ms

parallel=# SELECT sum(r) FROM t_test GROUP BY id < 1000;
      sum
------------------
49998665.9549346
2538.35037390701
(2 rows)

Time: 31279.595 ms

parallel=# SET parallel.agg_workers = 7;

SET

Time: 0.324 ms

parallel=# SELECT sum(r) FROM t_test GROUP BY id < 1000;

sum

------------------

49998665.9549346

2538.35037390701

(2 rows)

Time: 31279.595 ms

The important part here is that more processes are actually a net loss. The reason is simple: The table is scanned by a single process at the moment so there is a serious bottleneck on the scan side. The aggregation part is simply not complex enough to provide us with a speedup. Remember: All we do is adding up some values (which is ways too simple). This is by far not enough to justify the overhead of moving data around via shared memory queues.

Handling long `SELECT` clauses when scaling

However, the situation starts to change significantly if we try a query involving a lot of processing in the SELECT-clause. The more stuff we do in the SELECT-clause, the more benefit we can observe.

Here is an example:

parallel=# SELECT
   sum(CASE WHEN r < 0   THEN r ELSE 0 END),
   sum(CASE WHEN r > 0   THEN r ELSE 0 END),
   sum(CASE WHEN r > 0.1 THEN r ELSE 0 END),
   sum(CASE WHEN r > 0.2 THEN r ELSE 0 END),
   sum(CASE WHEN r > 0.3 THEN r ELSE 0 END),
   sum(CASE WHEN r > 0.4 THEN r ELSE 0 END),
   sum(CASE WHEN r > 0.5 THEN r ELSE 0 END),
   sum(CASE WHEN r > 0.6 THEN r ELSE 0 END),
   sum(CASE WHEN r > 0.7 THEN r ELSE 0 END),
   sum(CASE WHEN r > 0.8 THEN r ELSE 0 END),
   sum(CASE WHEN r > 0.9 THEN r ELSE 0 END),
   sum(r) FILTER (WHERE r < 0)
FROM    t_test
GROUP BY id < 1000, id % 5, id % 10 ;

parallel=# SELECT

sum(CASE WHEN r < 0 THEN r ELSE 0 END),

sum(CASE WHEN r > 0 THEN r ELSE 0 END),

sum(CASE WHEN r > 0.1 THEN r ELSE 0 END),

sum(CASE WHEN r > 0.2 THEN r ELSE 0 END),

sum(CASE WHEN r > 0.3 THEN r ELSE 0 END),

sum(CASE WHEN r > 0.4 THEN r ELSE 0 END),

sum(CASE WHEN r > 0.5 THEN r ELSE 0 END),

sum(CASE WHEN r > 0.6 THEN r ELSE 0 END),

sum(CASE WHEN r > 0.7 THEN r ELSE 0 END),

sum(CASE WHEN r > 0.8 THEN r ELSE 0 END),

sum(CASE WHEN r > 0.9 THEN r ELSE 0 END),

sum(r) FILTER (WHERE r < 0)

FROM t_test

GROUP BY id < 1000, id % 5, id % 10 ;

In this case, we do a lot of aggregation and therefore the overhead of moving tuples around is not as big compared to the rest of the query.

The performance difference is already quite huge:

8 CPU cores: 206667.727 ms
1 CPU core:   59226.385 ms

1 2	8 CPU cores: 206667.727 ms 1 CPU core: 59226.385 ms

What we see is a nice 3.5 times speedup. Of course, this is still far from linear. Linear scalability can hardly be achieved with this approach. However, if there are 50-100 aggregation functions in the SELECT-clause (which is far from uncommon), there is clearly a major performance improvement. That tends to get pretty close to “linear” in many cases.

At the moment, actually providing the data to our worker processes is still clearly an issue:

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
8411 hs        20   0  507012 183284 178560 R 100.0  1.1  23:19.98 postgres
13014 hs        20   0  503632  25400  24212 R  70.1  0.2   0:08.69 postgres
13013 hs        20   0  503632  25440  24244 R  69.8  0.2   0:08.83 postgres
13015 hs        20   0  503632  25476  24276 R  69.4  0.2   0:08.74 postgres
13017 hs        20   0  503632  25320  24132 R  69.4  0.2   0:08.73 postgres
13012 hs        20   0  503632  25460  24272 R  68.1  0.2   0:08.69 postgres
13018 hs        20   0  503632  24572  23384 R  68.1  0.1   0:08.65 postgres
13016 hs        20   0  503632  25468  24276 S  66.4  0.2   0:08.60 postgres

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

8411 hs 20 0 507012 183284 178560 R 100.0 1.1 23:19.98 postgres

13014 hs 20 0 503632 25400 24212 R 70.1 0.2 0:08.69 postgres

13013 hs 20 0 503632 25440 24244 R 69.8 0.2 0:08.83 postgres

13015 hs 20 0 503632 25476 24276 R 69.4 0.2 0:08.74 postgres

13017 hs 20 0 503632 25320 24132 R 69.4 0.2 0:08.73 postgres

13012 hs 20 0 503632 25460 24272 R 68.1 0.2 0:08.69 postgres

13018 hs 20 0 503632 24572 23384 R 68.1 0.1 0:08.65 postgres

13016 hs 20 0 503632 25468 24276 S 66.4 0.2 0:08.60 postgres

The process dispatching the data is running at 100%, while the worker processes are not able to make full use of the CPU. It seems that some overhead can still be shaved off to provide users with better overall performance. A prototype might be possible fairly soon.

In case you need any assistance, please feel free to contact us.

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.

In my previous post I have shared some GPU performance data, which were mostly related to aggregation. Given the information I have found on the pgstrom wiki page, I decided to give joins a try to see how much speed we can gain by offloading some of the work PostgreSQL has to do to the graphics card. Let's see how PostgreSQL does on a GPU with joins.

Let's start...

First I created some test data:

test=# CREATE TABLE t_test AS SELECT id, id % 10 AS ten, id % 20 AS twenty
          FROM generate_series(1, 25000000) AS id
          ORDER BY id;
SELECT 25000000

test=# CREATE TABLE t_join AS SELECT *
       FROM   t_test
       ORDER BY random()
       LIMIT 1000000;
SELECT 1000000

test=# SELECT count(*) FROM t_join;
  count
---------
1000000
(1 row)

test=# SELECT count(*) FROM t_test;
  count
----------
25000000
(1 row)

test=# CREATE TABLE t_test AS SELECT id, id % 10 AS ten, id % 20 AS twenty

FROM generate_series(1, 25000000) AS id

ORDER BY id;

SELECT 25000000

test=# CREATE TABLE t_join AS SELECT *

FROM t_test

ORDER BY random()

LIMIT 1000000;

SELECT 1000000

test=# SELECT count(*) FROM t_join;

count

---------

1000000

(1 row)

test=# SELECT count(*) FROM t_test;

count

----------

25000000

(1 row)

25 million rows should be joined with 1 million rows, which are a 100% subset of the original data.

A first test shows a very nice improvement. First two queries on my CPU:

test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id GROUP BY a.ten;

                                                               QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=724919.18..724919.28 rows=10 width=4) (actual time=11283.769..11283.772 rows=10 loops=1)
    Group Key: a.ten
    ->  Hash Join  (cost=31813.00..719919.18 rows=1000000 width=4) (actual time=340.218..11061.192 rows=1000000 loops=1)
        Hash Cond: (a.id = b.id)
        ->  Seq Scan on t_test a  (cost=0.00..385135.40 rows=24999940 width=8) (actual time=0.046..3590.336 rows=25000000 loops=1)
        ->  Hash  (cost=15406.00..15406.00 rows=1000000 width=4) (actual time=338.516..338.516 rows=1000000 loops=1)
            Buckets: 131072  Batches: 16  Memory Usage: 3226kB
            ->  Seq Scan on t_join b  (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.031..142.085 rows=1000000 loops=1)
 Planning time: 0.587 ms
 Execution time: 11284.411 ms
(10 rows)

test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id GROUP BY a.twenty;
                                                                QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=724919.18..724919.38 rows=20 width=4) (actual time=10991.615..10991.619 rows=20 loops=1)
 Group Key: a.twenty
 ->  Hash Join  (cost=31813.00..719919.18 rows=1000000 width=4) (actual time=317.089..10766.779 rows=1000000 loops=1)
     Hash Cond: (a.id = b.id)
     ->  Seq Scan on t_test a  (cost=0.00..385135.40 rows=24999940 width=8) (actual time=0.050..3636.188 rows=25000000 loops=1)
         ->  Hash  (cost=15406.00..15406.00 rows=1000000 width=4) (actual time=316.321..316.321 rows=1000000 loops=1)
             Buckets: 131072  Batches: 16  Memory Usage: 3226kB
             ->  Seq Scan on t_join b  (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.026..141.551 rows=1000000 loops=1)

 Planning time: 0.240 ms
 Execution time: 10992.203 ms
(10 rows)

test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id GROUP BY a.ten;

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------

HashAggregate (cost=724919.18..724919.28 rows=10 width=4) (actual time=11283.769..11283.772 rows=10 loops=1)

Group Key: a.ten

-> Hash Join (cost=31813.00..719919.18 rows=1000000 width=4) (actual time=340.218..11061.192 rows=1000000 loops=1)

Hash Cond: (a.id = b.id)

-> Seq Scan on t_test a (cost=0.00..385135.40 rows=24999940 width=8) (actual time=0.046..3590.336 rows=25000000 loops=1)

-> Hash (cost=15406.00..15406.00 rows=1000000 width=4) (actual time=338.516..338.516 rows=1000000 loops=1)

Buckets: 131072 Batches: 16 Memory Usage: 3226kB

-> Seq Scan on t_join b (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.031..142.085 rows=1000000 loops=1)

Planning time: 0.587 ms

Execution time: 11284.411 ms

(10 rows)

test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id GROUP BY a.twenty;

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------

HashAggregate (cost=724919.18..724919.38 rows=20 width=4) (actual time=10991.615..10991.619 rows=20 loops=1)

Group Key: a.twenty

-> Hash Join (cost=31813.00..719919.18 rows=1000000 width=4) (actual time=317.089..10766.779 rows=1000000 loops=1)

Hash Cond: (a.id = b.id)

-> Seq Scan on t_test a (cost=0.00..385135.40 rows=24999940 width=8) (actual time=0.050..3636.188 rows=25000000 loops=1)

-> Hash (cost=15406.00..15406.00 rows=1000000 width=4) (actual time=316.321..316.321 rows=1000000 loops=1)

Buckets: 131072 Batches: 16 Memory Usage: 3226kB

-> Seq Scan on t_join b (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.026..141.551 rows=1000000 loops=1)

Planning time: 0.240 ms

Execution time: 10992.203 ms

(10 rows)

The CPU in use here is an “AMD FX(tm)-8350 Eight-Core Processor” running at 4 Ghz.

Same thing on the GPU

Let us try the same thing on the GPU:

test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id GROUP BY a.ten;

                                                           QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=175445.25..175445.35 rows=10 width=4) (actual time=2556.159..2556.161 rows=10 loops=1)
       Group Key: a.ten
       ->  Custom Scan (GpuPreAgg)  (cost=27659.66..174601.72 rows=22 width=8) 
           (actual time=2538.558..2556.123 rows=30 loops=1)
           Bulkload: Off
           Reduction: Local + Global
           ->  Custom Scan (GpuJoin)  (cost=23659.66..170445.25 rows=1000000 width=4) 
               (actual time=1392.571..2437.772 rows=1000000 loops=1)
               Bulkload: On (density: 100.00%)
               Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id)
               Nrows (in:25000000 out:1000000, 4.00% planned 4.00%), 
               KDS-Hash (size: 57.22MB planned 82.25MB, nbatches: 1 planned 1)
               Inner Buffer: (82.25MB), DMA nums: 13, size: 1069.32MB
               ->  Custom Scan (BulkScan) on t_test a  (cost=0.00..385137.60 rows=25000160 width=8) 
                   (actual time=16.137..1333.313 rows=25000000 loops=1)
       ->  Seq Scan on t_join b  (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.018..109.725 rows=1000000 loops=1)
 Planning time: 1.720 ms
 Execution time: 3264.747 ms

(14 rows)



test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id GROUP BY a.twenty;

                                                         QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=175445.25..175445.45 rows=20 width=4) (actual time=2532.113..2532.121 rows=20 loops=1)
       Group Key: a.twenty
       ->  Custom Scan (GpuPreAgg)  (cost=27659.66..174601.94 rows=44 width=8) (actual time=2508.512..2531.991 rows=120 loops=1)
           Bulkload: Off
           Reduction: Local + Global
           ->  Custom Scan (GpuJoin)  (cost=23659.66..170445.25 rows=1000000 width=4) (actual time=1373.296..2410.850 rows=1000000 loops=1)
               Bulkload: On (density: 100.00%)
               Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id)
               Nrows (in:25000000 out:1000000, 4.00% planned 4.00%), KDS-Hash (size: 57.22MB planned 82.25MB, nbatches: 1 planned 1)
               Inner Buffer: (82.25MB), DMA nums: 11, size: 904.81MB
               ->  Custom Scan (BulkScan) on t_test a  (cost=0.00..385137.60 rows=25000160 width=8) (actual time=12.961..1308.733 rows=25000000 loops=1)
               ->  Seq Scan on t_join b  (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.018..109.893 rows=1000000 loops=1)
Planning time: 0.539 ms
Execution time: 3225.901 ms

(14 rows)

test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id GROUP BY a.ten;

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------

HashAggregate (cost=175445.25..175445.35 rows=10 width=4) (actual time=2556.159..2556.161 rows=10 loops=1)

Group Key: a.ten

-> Custom Scan (GpuPreAgg) (cost=27659.66..174601.72 rows=22 width=8)

(actual time=2538.558..2556.123 rows=30 loops=1)

Bulkload: Off

Reduction: Local + Global

-> Custom Scan (GpuJoin) (cost=23659.66..170445.25 rows=1000000 width=4)

(actual time=1392.571..2437.772 rows=1000000 loops=1)

Bulkload: On (density: 100.00%)

Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id)

Nrows (in:25000000 out:1000000, 4.00% planned 4.00%),

KDS-Hash (size: 57.22MB planned 82.25MB, nbatches: 1 planned 1)

Inner Buffer: (82.25MB), DMA nums: 13, size: 1069.32MB

-> Custom Scan (BulkScan) on t_test a (cost=0.00..385137.60 rows=25000160 width=8)

(actual time=16.137..1333.313 rows=25000000 loops=1)

-> Seq Scan on t_join b (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.018..109.725 rows=1000000 loops=1)

Planning time: 1.720 ms

Execution time: 3264.747 ms

(14 rows)

test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id GROUP BY a.twenty;

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------

HashAggregate (cost=175445.25..175445.45 rows=20 width=4) (actual time=2532.113..2532.121 rows=20 loops=1)

Group Key: a.twenty

-> Custom Scan (GpuPreAgg) (cost=27659.66..174601.94 rows=44 width=8) (actual time=2508.512..2531.991 rows=120 loops=1)

Bulkload: Off

Reduction: Local + Global

-> Custom Scan (GpuJoin) (cost=23659.66..170445.25 rows=1000000 width=4) (actual time=1373.296..2410.850 rows=1000000 loops=1)

Bulkload: On (density: 100.00%)

Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id)

Nrows (in:25000000 out:1000000, 4.00% planned 4.00%), KDS-Hash (size: 57.22MB planned 82.25MB, nbatches: 1 planned 1)

Inner Buffer: (82.25MB), DMA nums: 11, size: 904.81MB

-> Custom Scan (BulkScan) on t_test a (cost=0.00..385137.60 rows=25000160 width=8) (actual time=12.961..1308.733 rows=25000000 loops=1)

-> Seq Scan on t_join b (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.018..109.893 rows=1000000 loops=1)

Planning time: 0.539 ms

Execution time: 3225.901 ms

(14 rows)

What we see here is a nice improvement. The query is several times faster.

Making use of indexes

The interesting part is to see, what happens if indexes are added to both sides of the join:

test=# CREATE INDEX idx_id ON t_test (id);
CREATE INDEX

test=# CREATE INDEX idx_id2 ON t_join (id);
CREATE INDEX

test=# CREATE INDEX idx_id ON t_test (id);

CREATE INDEX

test=# CREATE INDEX idx_id2 ON t_join (id);

CREATE INDEX

The standard query, shown before, does not show any difference because too much data is needed inside the join. So, the test is repeated with a reasonable filter:

test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id AND a.id < 100000 GROUP BY a.ten; 
                                                       QUERY PLAN 
-------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=32541.70..32541.71 rows=1 width=4) (actual time=45.845..45.846 rows=10 loops=1) Group Key: a.ten 
 ->  Merge Join  (cost=1.49..32519.75 rows=4390 width=4) (actual time=0.076..44.423 rows=4137 loops=1)
     Merge Cond: (a.id = b.id)
     ->  Index Scan using idx_id on t_test a  (cost=0.44..3722.05 rows=109749 width=8) 
         (actual time=0.025..27.556 rows=99999 loops=1)
         Index Cond: (id < 100000) 
     ->  Index Only Scan using idx_id2 on t_join b  (cost=0.42..25980.42 rows=1000000 width=4) 
         (actual time=0.022..0.931 rows=4138 loops=1)
 Heap Fetches: 0
 Planning time: 0.449 ms
 Execution time: 45.946 ms
(10 rows)

test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id AND a.id < 100000 GROUP BY a.ten;

QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------

HashAggregate (cost=32541.70..32541.71 rows=1 width=4) (actual time=45.845..45.846 rows=10 loops=1) Group Key: a.ten

-> Merge Join (cost=1.49..32519.75 rows=4390 width=4) (actual time=0.076..44.423 rows=4137 loops=1)

Merge Cond: (a.id = b.id)

-> Index Scan using idx_id on t_test a (cost=0.44..3722.05 rows=109749 width=8)

(actual time=0.025..27.556 rows=99999 loops=1)

Index Cond: (id < 100000)

-> Index Only Scan using idx_id2 on t_join b (cost=0.42..25980.42 rows=1000000 width=4)

(actual time=0.022..0.931 rows=4138 loops=1)

Heap Fetches: 0

Planning time: 0.449 ms

Execution time: 45.946 ms

(10 rows)

The CPU version is pretty fast. PostgreSQL scans the indexes, makes use of the sorted input to perform a nice merge join then. The aggregation is pretty fast as well.

In the GPU enabled case the plan does not look as efficient in the default setting:

test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id AND a.id < 100000 GROUP BY a.ten; 
                                                        QUERY PLAN 
------------------------------------------------------------------------------------------------------------------------------------------ 
 HashAggregate  (cost=13404.21..13404.22 rows=1 width=4) (actual time=118.154..118.156 rows=10 loops=1) 
     Group Key: a.ten 
     ->  Custom Scan (GpuJoin)  (cost=9649.48..13383.34 rows=4174 width=4) (actual time=70.993..117.099 rows=4137 loops=1)
         Bulkload: On (density: 100.00%)
         Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id)
         Nrows (in:1000000 out:4137, 0.41% planned 0.42%), KDS-Hash (size: 5859.39KB planned 8.58MB, nbatches: 1 planned 1)
         Inner Buffer: (8.58MB), DMA nums: 1, size: 8.58MB
         ->  Custom Scan (BulkScan) on t_join b  (cost=0.00..15406.00 rows=1000000 width=4) 
             (actual time=7.164..24.709 rows=1000000 loops=1)
         ->  Index Scan using idx_id on t_test a  (cost=0.44..3542.55 rows=104349 width=8) (actual time=0.018..21.684 rows=99999 loops=1)
             Index Cond: (id < 100000)
 Planning time: 0.858 ms
 Execution time: 553.543 ms

(12 rows)

test=# explain analyze SELECT count(*) FROM t_test AS a, t_join AS b WHERE a.id = b.id AND a.id < 100000 GROUP BY a.ten;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------

HashAggregate (cost=13404.21..13404.22 rows=1 width=4) (actual time=118.154..118.156 rows=10 loops=1)

Group Key: a.ten

-> Custom Scan (GpuJoin) (cost=9649.48..13383.34 rows=4174 width=4) (actual time=70.993..117.099 rows=4137 loops=1)

Bulkload: On (density: 100.00%)

Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id)

Nrows (in:1000000 out:4137, 0.41% planned 0.42%), KDS-Hash (size: 5859.39KB planned 8.58MB, nbatches: 1 planned 1)

Inner Buffer: (8.58MB), DMA nums: 1, size: 8.58MB

-> Custom Scan (BulkScan) on t_join b (cost=0.00..15406.00 rows=1000000 width=4)

(actual time=7.164..24.709 rows=1000000 loops=1)

-> Index Scan using idx_id on t_test a (cost=0.44..3542.55 rows=104349 width=8) (actual time=0.018..21.684 rows=99999 loops=1)

Index Cond: (id < 100000)

Planning time: 0.858 ms

Execution time: 553.543 ms

(12 rows)

Even if the query is slower in this case, it should not be a major issue. If the cost models are adjusted properly and if the planner is told how to decide on the right plan, the system should be able to figure out that the CPU version is the faster one. So the loss in speed is really not an issue here. As soon as all infrastructure is in place, balancing the cost model and adjusting the planner here and there does not seem like a showstopper (at least not given the achievements already made).

Joining more tables with pgstrom

Let us just add more tables to the join to see what happens. To make sure that CPU and GPU get a fair chance, I have removed my two indexes again:

test=# explain analyze SELECT count(*)
              FROM    t_test AS a, t_join AS b, t_test AS c,
                      t_join AS d, t_test AS e, t_join AS f
              WHERE   a.id = b.id
                      AND b.id = c.id
                      AND c.id = d.id
                      AND d.id = e.id
                      AND e.id = f.id;
                                                     QUERY PLAN
------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1745125.48..1745125.49 rows=1 width=0) (actual time=34234.209..34234.209 rows=1 loops=1)
 ->  Hash Join  (cost=1266218.99..1745121.48 rows=1600 width=0) 
     (actual time=23054.940..34133.602 rows=1000000 loops=1)
     Hash Cond: (e.id = a.id)
     ->  Seq Scan on t_test e  (cost=0.00..385136.36 rows=25000036 width=4) 
         (actual time=0.056..2728.388 rows=25000000 loops=1)
         ->  Hash  (cost=1266198.99..1266198.99 rows=1600 width=20) 
             (actual time=23054.441..23054.441 rows=1000000 loops=1)
             Buckets: 131072 (originally 2048)  Batches: 32 (originally 1)  Memory Usage: 4205kB
             ->  Hash Join  (cost=787296.49..1266198.99 rows=1600 width=20) 
                 (actual time=12247.386..22803.559 rows=1000000 loops=1)
                 Hash Cond: (c.id = a.id)
                 ->  Seq Scan on t_test c  (cost=0.00..385136.36 rows=25000036 width=4) 
                     (actual time=0.004..2727.218 rows=25000000 loops=1)
                 ->  Hash  (cost=787276.49..787276.49 rows=1600 width=16) 
                     (actual time=12246.958..12246.958 rows=1000000 loops=1)
                     Buckets: 131072 (originally 2048)  Batches: 16 (originally 1)  Memory Usage: 3960kB
         ->  Hash Join  (cost=768104.49..787276.49 rows=1600 width=16) 
             (actual time=11330.994..12044.482 rows=1000000 loops=1)
             Hash Cond: (f.id = a.id)
             ->  Seq Scan on t_join f  (cost=0.00..15406.00 rows=1000000 width=4) 
                     (actual time=0.030..111.371 rows=1000000 loops=1)
                 ->  Hash  (cost=767604.49..767604.49 rows=40000 width=12) 
                     (actual time=11330.687..11330.687 rows=1000000 loops=1)
                     Buckets: 131072 (originally 65536)  Batches: 16 (originally 1)  Memory Usage: 3716kB
                     ->  Hash Join  (cost=63626.00..767604.49 rows=40000 width=12) 
                         (actual time=620.425..11124.981 rows=1000000 loops=1)
                         Hash Cond: (a.id = d.id)
                         ->  Hash Join  (cost=31813.00..719920.49 rows=1000000 width=8) 
                             (actual time=306.078..10215.246 rows=1000000 loops=1)
                             Hash Cond: (a.id = b.id)
                             ->  Seq Scan on t_test a  (cost=0.00..385136.36 rows=25000036 width=4) 
                                 (actual time=0.002..3430.792 rows=25000000 loops=1)
                             ->  Hash  (cost=15406.00..15406.00 rows=1000000 width=4) 
                                 (actual time=305.466..305.466 rows=1000000 loops=1)
                                 Buckets: 131072  Batches: 16  Memory Usage: 3226kB
                                 ->  Seq Scan on t_join b  (cost=0.00..15406.00 rows=1000000 width=4) 
                                    (actual time=0.002..137.925 rows=1000000 loops=1)
                         ->  Hash  (cost=15406.00..15406.00 rows=1000000 width=4) 
                             (actual time=314.051..314.051 rows=1000000 loops=1)
                             Buckets: 131072  Batches: 16  Memory Usage: 3226kB
                             ->  Seq Scan on t_join d  (cost=0.00..15406.00 rows=1000000 width=4) 
                                 (actual time=0.006..137.796 rows=1000000 loops=1)

Planning time: 4.833 ms
Execution time: 34238.964 ms

(29 rows)

test=# explain analyze SELECT count(*)

FROM t_test AS a, t_join AS b, t_test AS c,

t_join AS d, t_test AS e, t_join AS f

WHERE a.id = b.id

AND b.id = c.id

AND c.id = d.id

AND d.id = e.id

AND e.id = f.id;

QUERY PLAN

------------------------------------------------------------------------------------------------------------

Aggregate (cost=1745125.48..1745125.49 rows=1 width=0) (actual time=34234.209..34234.209 rows=1 loops=1)

-> Hash Join (cost=1266218.99..1745121.48 rows=1600 width=0)

(actual time=23054.940..34133.602 rows=1000000 loops=1)

Hash Cond: (e.id = a.id)

-> Seq Scan on t_test e (cost=0.00..385136.36 rows=25000036 width=4)

(actual time=0.056..2728.388 rows=25000000 loops=1)

-> Hash (cost=1266198.99..1266198.99 rows=1600 width=20)

(actual time=23054.441..23054.441 rows=1000000 loops=1)

Buckets: 131072 (originally 2048) Batches: 32 (originally 1) Memory Usage: 4205kB

-> Hash Join (cost=787296.49..1266198.99 rows=1600 width=20)

(actual time=12247.386..22803.559 rows=1000000 loops=1)

Hash Cond: (c.id = a.id)

-> Seq Scan on t_test c (cost=0.00..385136.36 rows=25000036 width=4)

(actual time=0.004..2727.218 rows=25000000 loops=1)

-> Hash (cost=787276.49..787276.49 rows=1600 width=16)

(actual time=12246.958..12246.958 rows=1000000 loops=1)

Buckets: 131072 (originally 2048) Batches: 16 (originally 1) Memory Usage: 3960kB

-> Hash Join (cost=768104.49..787276.49 rows=1600 width=16)

(actual time=11330.994..12044.482 rows=1000000 loops=1)

Hash Cond: (f.id = a.id)

-> Seq Scan on t_join f (cost=0.00..15406.00 rows=1000000 width=4)

(actual time=0.030..111.371 rows=1000000 loops=1)

-> Hash (cost=767604.49..767604.49 rows=40000 width=12)

(actual time=11330.687..11330.687 rows=1000000 loops=1)

Buckets: 131072 (originally 65536) Batches: 16 (originally 1) Memory Usage: 3716kB

-> Hash Join (cost=63626.00..767604.49 rows=40000 width=12)

(actual time=620.425..11124.981 rows=1000000 loops=1)

Hash Cond: (a.id = d.id)

-> Hash Join (cost=31813.00..719920.49 rows=1000000 width=8)

(actual time=306.078..10215.246 rows=1000000 loops=1)

Hash Cond: (a.id = b.id)

-> Seq Scan on t_test a (cost=0.00..385136.36 rows=25000036 width=4)

(actual time=0.002..3430.792 rows=25000000 loops=1)

-> Hash (cost=15406.00..15406.00 rows=1000000 width=4)

(actual time=305.466..305.466 rows=1000000 loops=1)

Buckets: 131072 Batches: 16 Memory Usage: 3226kB

-> Seq Scan on t_join b (cost=0.00..15406.00 rows=1000000 width=4)

(actual time=0.002..137.925 rows=1000000 loops=1)

-> Hash (cost=15406.00..15406.00 rows=1000000 width=4)

(actual time=314.051..314.051 rows=1000000 loops=1)

Buckets: 131072 Batches: 16 Memory Usage: 3226kB

-> Seq Scan on t_join d (cost=0.00..15406.00 rows=1000000 width=4)

(actual time=0.006..137.796 rows=1000000 loops=1)

Planning time: 4.833 ms

Execution time: 34238.964 ms

(29 rows)

As expected the CPU has a hard time joining all the stuff together. It is a pretty nasty join. At the end 34 seconds are needed.

The GPU can really shine under those circumstances. Expensive joins really seem to be what this is all about after all:

test=# explain analyze SELECT count(*)
FROM    t_test AS a, t_join AS b, t_test AS c,
        t_join AS d, t_test AS e, t_join AS f
WHERE   a.id = b.id
        AND b.id = c.id
        AND c.id = d.id
        AND d.id = e.id
        AND e.id = f.id;
NOTICE:  GpuJoin(0x204508860) DataStoreNoSpace retry=1 [0.00%] src_nitems: 312465 max_space: 390581=>781162 nrooms: 101578=>390581 Nthreads: (312465, 499=>312465) inners: (0, 1000000) results: [312465, 312465]
NOTICE:  GpuJoin(0x204506060) DataStoreNoSpace retry=1 [0.00%] src_nitems: 312465 max_space: 390581=>781162 nrooms: 101578=>390581 Nthreads: (312465, 499=>312465) inners: (0, 1000000) results: [312465, 312465]
NOTICE:  GpuJoin(0x204505860) DataStoreNoSpace retry=1 [0.00%] src_nitems: 312465 max_space: 390581=>781162 nrooms: 101578=>390581 Nthreads: (312465, 499=>312465) inners: (0, 1000000) results: [312465, 312465]
                                                                       
                                                                   QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=471554.15..471554.16 rows=1 width=0) (actual time=7270.748..7270.748 rows=1 loops=1)
 ->  Custom Scan (GpuJoin)  (cost=334846.07..471550.15 rows=1600 width=0) (actual time=6212.393..7184.973 rows=1000000 loops=1)
     Bulkload: On (density: 100.00%)
     Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id)
     Nrows (in:25000000 out:1000000, 4.00% planned 0.01%), KDS-Hash (size: 64.85MB planned 134.86KB, nbatches: 1 planned 1)
     Inner Buffer: (124.00MB), DMA nums: 9, size: 1116.01MB
     ->  Custom Scan (BulkScan) on t_test e  (cost=0.00..385136.36 rows=25000036 width=4) (actual time=13.106..1103.447 rows=25000000 loops=1)
     ->  Custom Scan (GpuJoin)  (cost=192385.58..329089.66 rows=1600 width=20) (actual time=4240.682..5277.841 rows=1000000 loops=1)
         Bulkload: On (density: 100.00%)
         Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id)
         Nrows (in:25000000 out:1000000, 4.00% planned 0.01%), KDS-Hash (size: 57.22MB planned 119.23KB, nbatches: 1 planned 1)
         Inner Buffer: (62.00MB), DMA nums: 15, size: 930.01MB
         ->  Custom Scan (BulkScan) on t_test c  (cost=0.00..385136.36 rows=25000036 width=4) (actual time=13.160..1093.463 rows=25000000 loops=1)
         ->  Custom Scan (GpuJoin)  (cost=182921.05..186629.17 rows=1600 width=16) (actual time=3320.939..344.434 rows=1000000 loops=1)
             Bulkload: On (density: 100.00%)
             Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id)
             Nrows (in:1000000 out:1000000, 100.00% planned 0.16%), KDS-Hash (size: 57.22MB planned 2978.59KB, nbatches: 1 planned 1)
             Inner Buffer: (62.00MB), DMA nums: 1, size: 62.00MB
             ->  Custom Scan (BulkScan) on t_join f  (cost=0.00..15406.00 rows=1000000 width=4) (actual time=13.220..45.445 rows=1000000 loops=1)
             ->  Custom Scan (GpuJoin)  (cost=41543.18..176974.98 rows=40000 width=12) (actual time=1736.740..2818.735 rows=1000000 loops=1)
                 Bulkload: On (density: 100.00%)
                 Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id)
                 Nrows (in:25000000 out:1000000, 4.00% planned 4.00%), KDS-Hash (size: 57.22MB planned 82.25MB, nbatches: 1 planned 1)
                 Depth 2: GpuHashJoin, HashKeys: (id), JoinQual: (id = id)
                 Nrows (in:1000000 out:1000000, 100.00% planned 4.00%), KDS-Hash (size: 57.22MB planned 82.25MB, nbatches: 1 planned 1)
                 Inner Buffer: (82.25MB)x(82.25MB), DMA nums: 10, size: 1645.10MB
                 ->  Custom Scan (BulkScan) on t_test a  (cost=0.00..385136.36 rows=25000036 width=4) (actual time=16.505..1599.395 rows=25000000 loops=1)
       ->  Seq Scan on t_join b  (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.022..111.602 rows=1000000 loops=1)
       ->  Seq Scan on t_join d  (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.005..95.175 rows=1000000 loops=1)
Planning time: 11.106 ms

Execution time: 8051.245 ms

(31 rows)

test=# explain analyze SELECT count(*)

FROM t_test AS a, t_join AS b, t_test AS c,

t_join AS d, t_test AS e, t_join AS f

WHERE a.id = b.id

AND b.id = c.id

AND c.id = d.id

AND d.id = e.id

AND e.id = f.id;

NOTICE: GpuJoin(0x204508860) DataStoreNoSpace retry=1 [0.00%] src_nitems: 312465 max_space: 390581=>781162 nrooms: 101578=>390581 Nthreads: (312465, 499=>312465) inners: (0, 1000000) results: [312465, 312465]

NOTICE: GpuJoin(0x204506060) DataStoreNoSpace retry=1 [0.00%] src_nitems: 312465 max_space: 390581=>781162 nrooms: 101578=>390581 Nthreads: (312465, 499=>312465) inners: (0, 1000000) results: [312465, 312465]

NOTICE: GpuJoin(0x204505860) DataStoreNoSpace retry=1 [0.00%] src_nitems: 312465 max_space: 390581=>781162 nrooms: 101578=>390581 Nthreads: (312465, 499=>312465) inners: (0, 1000000) results: [312465, 312465]

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Aggregate (cost=471554.15..471554.16 rows=1 width=0) (actual time=7270.748..7270.748 rows=1 loops=1)

-> Custom Scan (GpuJoin) (cost=334846.07..471550.15 rows=1600 width=0) (actual time=6212.393..7184.973 rows=1000000 loops=1)

Bulkload: On (density: 100.00%)

Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id)

Nrows (in:25000000 out:1000000, 4.00% planned 0.01%), KDS-Hash (size: 64.85MB planned 134.86KB, nbatches: 1 planned 1)

Inner Buffer: (124.00MB), DMA nums: 9, size: 1116.01MB

-> Custom Scan (BulkScan) on t_test e (cost=0.00..385136.36 rows=25000036 width=4) (actual time=13.106..1103.447 rows=25000000 loops=1)

-> Custom Scan (GpuJoin) (cost=192385.58..329089.66 rows=1600 width=20) (actual time=4240.682..5277.841 rows=1000000 loops=1)

Bulkload: On (density: 100.00%)

Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id)

Nrows (in:25000000 out:1000000, 4.00% planned 0.01%), KDS-Hash (size: 57.22MB planned 119.23KB, nbatches: 1 planned 1)

Inner Buffer: (62.00MB), DMA nums: 15, size: 930.01MB

-> Custom Scan (BulkScan) on t_test c (cost=0.00..385136.36 rows=25000036 width=4) (actual time=13.160..1093.463 rows=25000000 loops=1)

-> Custom Scan (GpuJoin) (cost=182921.05..186629.17 rows=1600 width=16) (actual time=3320.939..344.434 rows=1000000 loops=1)

Bulkload: On (density: 100.00%)

Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id)

Nrows (in:1000000 out:1000000, 100.00% planned 0.16%), KDS-Hash (size: 57.22MB planned 2978.59KB, nbatches: 1 planned 1)

Inner Buffer: (62.00MB), DMA nums: 1, size: 62.00MB

-> Custom Scan (BulkScan) on t_join f (cost=0.00..15406.00 rows=1000000 width=4) (actual time=13.220..45.445 rows=1000000 loops=1)

-> Custom Scan (GpuJoin) (cost=41543.18..176974.98 rows=40000 width=12) (actual time=1736.740..2818.735 rows=1000000 loops=1)

Bulkload: On (density: 100.00%)

Depth 1: GpuHashJoin, HashKeys: (id), JoinQual: (id = id)

Nrows (in:25000000 out:1000000, 4.00% planned 4.00%), KDS-Hash (size: 57.22MB planned 82.25MB, nbatches: 1 planned 1)

Depth 2: GpuHashJoin, HashKeys: (id), JoinQual: (id = id)

Nrows (in:1000000 out:1000000, 100.00% planned 4.00%), KDS-Hash (size: 57.22MB planned 82.25MB, nbatches: 1 planned 1)

Inner Buffer: (82.25MB)x(82.25MB), DMA nums: 10, size: 1645.10MB

-> Custom Scan (BulkScan) on t_test a (cost=0.00..385136.36 rows=25000036 width=4) (actual time=16.505..1599.395 rows=25000000 loops=1)

-> Seq Scan on t_join b (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.022..111.602 rows=1000000 loops=1)

-> Seq Scan on t_join d (cost=0.00..15406.00 rows=1000000 width=4) (actual time=0.005..95.175 rows=1000000 loops=1)

Planning time: 11.106 ms

Execution time: 8051.245 ms

(31 rows)

Given the number of rows involved, this is really a nice number to see. 8 seconds just rocks and shows the real potential of a GPU.

Of course: You got to keep in mind that I am running arbitrary SQL statements, which are not too realistic in many cases. I am yet to test pgstrom on a real workload involving more every-day queries you would typically see in PostgreSQL.

Finally I got around to take a more extensive look at pgstrom (a module to make use of GPUs). The entire GPU thing fascinates me, and I was excited to see the first real performance data.

Here is some simple test data:

test=# CREATE TABLE t_test AS
   SELECT       x, 'a'::char(100) AS y, 'b'::char(100) AS z
   FROM         generate_series(1, 5000000) AS x
   ORDER BY random();
SELECT 5000000

test=# CREATE TABLE t_test AS

SELECT x, 'a'::char(100) AS y, 'b'::char(100) AS z

FROM generate_series(1, 5000000) AS x

ORDER BY random();

SELECT 5000000

5 million rows should be enough to get a first impression of what is going on.

Queries can benefit

To make sure that a real difference can actually be observed, I have decided to use no indexes. In real life, this is not too realistic because performance would suffer in a horrible way. pgstrom has not been made to speed up index lookups anyway so this should not be an issue.

The first thing I tried was to filter and group some data on the CPU:

test=# explain analyze SELECT count(*)
        FROM   t_test
        WHERE  sqrt(x) > 0
        GROUP BY y;
                                     QUERY PLAN
--------------------------------------------------------------------------------------
 HashAggregate  (cost=242892.24..242892.25 rows=1 width=101)
     (actual time=3965.362..3965.362 rows=1 loops=1)
     Group Key: y
     ->  Seq Scan on t_test  (cost=0.00..234558.91 rows=1666665 width=101)
         (actual time=0.033..1417.593 rows=5000000 loops=1)
         Filter: (sqrt((x)::double precision) > '0'::double precision)
 Planning time: 0.459 ms
 Execution time: 3965.495 ms

test=# explain analyze SELECT count(*)

FROM t_test

WHERE sqrt(x) > 0

GROUP BY y;

QUERY PLAN

--------------------------------------------------------------------------------------

HashAggregate (cost=242892.24..242892.25 rows=1 width=101)

(actual time=3965.362..3965.362 rows=1 loops=1)

Group Key: y

-> Seq Scan on t_test (cost=0.00..234558.91 rows=1666665 width=101)

(actual time=0.033..1417.593 rows=5000000 loops=1)

Filter: (sqrt((x)::double precision) > '0'::double precision)

Planning time: 0.459 ms

Execution time: 3965.495 ms

My box (4 GHz AMD) can do that in just under 4 seconds. Note that I am using the standard PostgreSQL storage manager here (no column store or so).

Let us try the same thing on the GPU:

test=# explain analyze SELECT count(*)
FROM   t_test
WHERE  sqrt(x) > 0
GROUP BY y;

                                        QUERY PLAN
------------------------------------------------------------------------------------------
 HashAggregate  (cost=176230.88..176230.89 rows=1 width=101)
    (actual time=2004.355..2004.356 rows=1 loops=1) 
    Group Key: y
    ->  Custom Scan (GpuPreAgg)  (cost=11929.24..171148.30 rows=75 width=108)
        (actual time=1151.310..2003.868 rows=76 loops=1)
        Bulkload: On (density: 100.00%)
        Reduction: Local + Global

        Device Filter: (sqrt((x)::double precision) > '0'::double precision)
        ->  Custom Scan (BulkScan) on t_test  (cost=8929.24..167897.55 rows=5000001 width=101)
            (actual time=12.956..1152.273 rows=5000000 loops=1)
Planning time: 0.550 ms
Execution time: 2299.633 ms
(9 rows)

test=# explain analyze SELECT count(*)

FROM t_test

WHERE sqrt(x) > 0

GROUP BY y;

QUERY PLAN

------------------------------------------------------------------------------------------

HashAggregate (cost=176230.88..176230.89 rows=1 width=101)

(actual time=2004.355..2004.356 rows=1 loops=1)

Group Key: y

-> Custom Scan (GpuPreAgg) (cost=11929.24..171148.30 rows=75 width=108)

(actual time=1151.310..2003.868 rows=76 loops=1)

Bulkload: On (density: 100.00%)

Reduction: Local + Global

Device Filter: (sqrt((x)::double precision) > '0'::double precision)

-> Custom Scan (BulkScan) on t_test (cost=8929.24..167897.55 rows=5000001 width=101)

(actual time=12.956..1152.273 rows=5000000 loops=1)

Planning time: 0.550 ms

Execution time: 2299.633 ms

(9 rows)

We see a nice improvement here. The speedup is incredible - especially when taking into consideration that getting the data already takes more than a second. It seems moving stuff out to the GPU definitely pays off in this case.

The interesting thing to notice is that the real improvement can be seen because of the GROUP BY clause. A normal filter does not show a benefit:

test=# explain analyze SELECT count(*)
FROM   t_test
WHERE  sqrt(x) > 0;

                                     QUERY PLAN
--------------------------------------------------------------------------------------------
 Aggregate  (cost=238725.58..238725.59 rows=1 width=0)
      (actual time=1762.829..1762.829 rows=1 loops=1)
      ->  Seq Scan on t_test  (cost=0.00..234558.91 rows=1666665 width=0)
          (actual time=0.055..1311.220 rows=5000000 loops=1)
          Filter: (sqrt((x)::double precision) > '0'::double precision)
Planning time: 0.074 ms
Execution time: 1762.875 ms
(5 rows)

test=# explain analyze SELECT count(*)
FROM   t_test
WHERE  sqrt(x) > 0;
                                           QUERY PLAN
-------------------------------------------------------------------------------------------------
 Aggregate  (cost=172064.21..172064.22 rows=1 width=0)
       (actual time=1411.036..1411.036 rows=1 loops=1)
       ->  Custom Scan (GpuPreAgg)  (cost=11929.24..171148.30 rows=75 width=4)
           (actual time=650.590..1410.837 rows=76 loops=1)
           Bulkload: On (density: 100.00%)
           Reduction: NoGroup
           Device Filter: (sqrt((x)::double precision) > '0'::double precision)
           ->  Custom Scan (BulkScan) on t_test  (cost=8929.24..167897.55 rows=5000001 width=0)
               (actual time=16.034..1160.592 rows=5000000 loops=1) 
 Planning time: 1.634 ms
 Execution time: 1745.605 ms
(8 rows)

test=# explain analyze SELECT count(*)

FROM t_test

WHERE sqrt(x) > 0;

QUERY PLAN

--------------------------------------------------------------------------------------------

Aggregate (cost=238725.58..238725.59 rows=1 width=0)

(actual time=1762.829..1762.829 rows=1 loops=1)

-> Seq Scan on t_test (cost=0.00..234558.91 rows=1666665 width=0)

(actual time=0.055..1311.220 rows=5000000 loops=1)

Filter: (sqrt((x)::double precision) > '0'::double precision)

Planning time: 0.074 ms

Execution time: 1762.875 ms

(5 rows)

test=# explain analyze SELECT count(*)

FROM t_test

WHERE sqrt(x) > 0;

QUERY PLAN

-------------------------------------------------------------------------------------------------

Aggregate (cost=172064.21..172064.22 rows=1 width=0)

(actual time=1411.036..1411.036 rows=1 loops=1)

-> Custom Scan (GpuPreAgg) (cost=11929.24..171148.30 rows=75 width=4)

(actual time=650.590..1410.837 rows=76 loops=1)

Bulkload: On (density: 100.00%)

Reduction: NoGroup

Device Filter: (sqrt((x)::double precision) > '0'::double precision)

-> Custom Scan (BulkScan) on t_test (cost=8929.24..167897.55 rows=5000001 width=0)

(actual time=16.034..1160.592 rows=5000000 loops=1)

Planning time: 1.634 ms

Execution time: 1745.605 ms

(8 rows)

It certainly makes sense that there is no improvement in this case because moving data around is simply too expensive to make a difference. Remember: GPUs only make sense if things can be done in parallel and if data is coming fast enough. sqrt is not complicated enough to justify the effort of moving data around and PostgreSQL cannot provide data fast enough.

Or queries can be slower

It is important to mention that many queries won't benefit from the GPU at all. In fact, I would expect than the majority of queries in a usual system will not behave differently.

Here is an example of a query, which is actually slower with pgstrom:

test=# explain analyze SELECT *, sum(x) OVER ()
FROM   t_test
WHERE  sqrt(x) > 0;

                                        QUERY PLAN
------------------------------------------------------------------------------------------------
 WindowAgg  (cost=8929.24..188730.88 rows=1666667 width=206)
      (actual time=4449.716..5848.233 rows=5000000 loops=1)
      ->  Custom Scan (GpuScan) on t_test  (cost=8929.24..167897.55 rows=1666667 width=206)
          (actual time=616.879..1899.651 rows=5000000 loops=1)
          Device Filter: (sqrt((x)::double precision) > '0'::double precision)
 Planning time: 0.142 ms
 Execution time: 6260.461 ms
(5 rows)


test=# explain analyze SELECT *, sum(x) OVER ()
FROM   t_test
WHERE  sqrt(x) > 0;
                                              QUERY PLAN
--------------------------------------------------------------------------------------------------
 WindowAgg  (cost=0.00..255392.23 rows=1666665 width=206)
     (actual time=3610.914..4895.518 rows=5000000 loops=1)
     ->  Seq Scan on t_test  (cost=0.00..234558.91 rows=1666665 width=206)
         (actual time=0.038..1389.645 rows=5000000 loops=1)
         Filter: (sqrt((x)::double precision) > '0'::double precision)
 Planning time: 0.324 ms
 Execution time: 5187.048 ms
(5 rows)

test=# explain analyze SELECT *, sum(x) OVER ()

FROM t_test

WHERE sqrt(x) > 0;

QUERY PLAN

------------------------------------------------------------------------------------------------

WindowAgg (cost=8929.24..188730.88 rows=1666667 width=206)

(actual time=4449.716..5848.233 rows=5000000 loops=1)

-> Custom Scan (GpuScan) on t_test (cost=8929.24..167897.55 rows=1666667 width=206)

(actual time=616.879..1899.651 rows=5000000 loops=1)

Device Filter: (sqrt((x)::double precision) > '0'::double precision)

Planning time: 0.142 ms

Execution time: 6260.461 ms

(5 rows)

test=# explain analyze SELECT *, sum(x) OVER ()

FROM t_test

WHERE sqrt(x) > 0;

QUERY PLAN

--------------------------------------------------------------------------------------------------

WindowAgg (cost=0.00..255392.23 rows=1666665 width=206)

(actual time=3610.914..4895.518 rows=5000000 loops=1)

-> Seq Scan on t_test (cost=0.00..234558.91 rows=1666665 width=206)

(actual time=0.038..1389.645 rows=5000000 loops=1)

Filter: (sqrt((x)::double precision) > '0'::double precision)

Planning time: 0.324 ms

Execution time: 5187.048 ms

(5 rows)

In this case the GPU seems like a loss - at least there is no benefit to be expected at this stage.

One word about sorts

According to the main developer of pgstrom sorting is not yet as good as he wants it to be, so I skipped the sort part for now. As sorts are key to many queries, there is still pgstrom functionality I am really looking forward to.

I assume that sorts can greatly benefit from a GPU because there is a lot of intrinsic parallelism in a sort algorithm. Therefore sorting on the GPU could be highly beneficial. The speedup we can expect is hard to predict but I firmly believe that it can be quite substantial.

Stability

What stunned me is that I have not encountered a single segmentation fault during my tests. I definitely did not expect that. My assumption was that there would be more loose ends but actually things worked as expected most of the time - given the stage of the project I am pretty excited. pgstrom certainly feels like the future ...

Find all the latest CYBERTEC blog posts by Hans-Jürgen Schönig, Laurenz Albe, Pavlo Golub and others in our Performance blog spot.

NVIDIA's CUDA is definitely a great thing and I have to admit that I already got excited years ago when I first learned about it. For many operations a nicely optimized GPU implementation definitely seems the way to go. GPUs are traditionally used for scientific operations and massively parallel tasks. However, some important work is also going into the PGStrom project, which is all about bringing the power of modern GPUs to PostgreSQL: pgstrom documentation

Installing CUDA

At this point installing CUDA on Linux might be the hardest part of the entire undertaking. The CUDA installer only works nicely, when no X-server is running. A simple „init 1“ should solve this problem, however.

Before you get started with pgstrom, it is usually a good idea to check, if the GPU has been detected properly:

[hs@laura ~]$ cd /usr/local/cuda/samples/1_Utilities/deviceQuery
[hs@laura deviceQuery]$ ./deviceQuery
./deviceQuery Starting...
 CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: 'GeForce GTX 970'
  CUDA Driver Version / Runtime Version          7.5 / 7.0
  CUDA Capability Major/Minor version number:    5.2
  Total amount of global memory:                 4095 MBytes (4294246400 bytes)
  (13) Multiprocessors, (128) CUDA Cores/MP:     1664 CUDA Cores
  GPU Max Clock rate:                            1253 MHz (1.25 GHz)
  Memory Clock rate:                             3505 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 1835008 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.0, NumDevs = 1, Device0 = GeForce GTX 970
Result = PASS

[hs@laura ~]$ cd /usr/local/cuda/samples/1_Utilities/deviceQuery

[hs@laura deviceQuery]$ ./deviceQuery

./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: 'GeForce GTX 970'

CUDA Driver Version / Runtime Version 7.5 / 7.0

CUDA Capability Major/Minor version number: 5.2

Total amount of global memory: 4095 MBytes (4294246400 bytes)

(13) Multiprocessors, (128) CUDA Cores/MP: 1664 CUDA Cores

GPU Max Clock rate: 1253 MHz (1.25 GHz)

Memory Clock rate: 3505 Mhz

Memory Bus Width: 256-bit

L2 Cache Size: 1835008 bytes

Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)

Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers

Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 49152 bytes

Total number of registers available per block: 65536

Warp size: 32

Maximum number of threads per multiprocessor: 2048

Maximum number of threads per block: 1024

Max dimension size of a thread block (x,y,z): (1024, 1024, 64)

Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)

Maximum memory pitch: 2147483647 bytes

Texture alignment: 512 bytes

Concurrent copy and kernel execution: Yes with 2 copy engine(s)

Run time limit on kernels: Yes

Integrated GPU sharing Host Memory: No

Support host page-locked memory mapping: Yes

Alignment requirement for Surfaces: Yes

Device has ECC support: Disabled

Device supports Unified Addressing (UVA): Yes

Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0

Compute Mode:

< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.0, NumDevs = 1, Device0 = GeForce GTX 970

Result = PASS

If the test is passed, CUDA is ready for PostgreSQL.

Installing PostgreSQL with CUDA

Installing pg_strom for PostgreSQL is basically not hard. Here is how it works:

$ git clone https://github.com/postgres/postgres.git pgsql
$ cd pgsql
$ ./configure --enable-debug --enable-cassert
$ make
$ sudo make install

$ git clone https://github.com/pg-strom/devel pg_strom
$ cd pg_strom
$ which pg_config
/usr/local/pgsql/bin/pg_config
$ make
$ sudo make install

$ git clone https://github.com/postgres/postgres.git pgsql

$ cd pgsql

$ ./configure --enable-debug --enable-cassert

$ make

$ sudo make install

$ git clone https://github.com/pg-strom/devel pg_strom

$ cd pg_strom

$ which pg_config

/usr/local/pgsql/bin/pg_config

$ make

$ sudo make install

What happened in my case was that I had to uncomment 3 lines in a pg_strom header file because my version of PostgreSQL was a bit more up to date than expected. However, this is nothing major. It is more of a small fix.

Once pg_strom has been added to shared_preload_libraries, the system is already ready for action. In my case starting the database shows the following listing:

[hs@laura ~]$ pg_ctl -D /data/dbstrom/ start
server starting
LOG:  CUDA Runtime version: 7.5.0
LOG:  NVIDIA driver version: 352.30
LOG:  GPU0 GeForce GTX 970 (1664 CUDA cores, 1253MHz), L2 1792KB, RAM 4095MB (256bits, 3505MHz), capability 5.2
LOG:  NVRTC - CUDA Runtime Compilation vertion 7.0
LOG:  database system shutdown was interrupted; last known up at 2015-08-27 21:05:41 CEST
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  invalid record length at 0/A54973F8
LOG:  redo is not required
LOG:  MultiXact member wraparound protections are now enabled
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started

[hs@laura ~]$ pg_ctl -D /data/dbstrom/ start

server starting

LOG: CUDA Runtime version: 7.5.0

LOG: NVIDIA driver version: 352.30

LOG: GPU0 GeForce GTX 970 (1664 CUDA cores, 1253MHz), L2 1792KB, RAM 4095MB (256bits, 3505MHz), capability 5.2

LOG: NVRTC - CUDA Runtime Compilation vertion 7.0

LOG: database system shutdown was interrupted; last known up at 2015-08-27 21:05:41 CEST

LOG: database system was not properly shut down; automatic recovery in progress

LOG: invalid record length at 0/A54973F8

LOG: redo is not required

LOG: MultiXact member wraparound protections are now enabled

LOG: database system is ready to accept connections

LOG: autovacuum launcher started

The important point here is that during PostgreSQL startup the CUDA device has to be in the LOG message - otherwise there is a problem with the driver.

The beauty is that pgstrom automatically uses GPU code when it seems useful. The user does not have to worry about where the code is actually executed. The optimizer will make those decisions for your automatically.

So far pgstrom seems pretty promising. Of course, it is not ready for production yet, but it is definitely worth investigating the issue further and run tests next week.

CYBERTEC experiments

The main challenge we face at a client is:

Some test data for scaling

Giving multicore PostgreSQL a try - scaling test

The 8th core will be taken by the coordinating process:

Handling long SELECT clauses when scaling

Here is an example:

The performance difference is already quite huge:

At the moment, actually providing the data to our worker processes is still clearly an issue:

Let's start...

Same thing on the GPU

Making use of indexes

Joining more tables with pgstrom

Queries can benefit

Or queries can be slower

One word about sorts

Stability

Installing CUDA

The 8^th core will be taken by the coordinating process:

Handling long `SELECT` clauses when scaling