Finally I got around to take a more extensive look at pgstrom (a module to make use of GPUs). The entire GPU thing fascinates me and I was excited to see the first real performance data.

Here is some simple test data:


test=# CREATE TABLE t_test AS

SELECT       x, 'a'::char(100) AS y, 'b'::char(100) AS z

FROM   generate_series(1, 5000000) AS x

ORDER BY random();

SELECT 5000000

5 million rows should be enough to get a first impression of what is going on.

Queries can benefit

To make sure that a real difference can actually be observed, I have decided to use no indexes. In real life, this is not too realistic because performance would suffer in a horrible way. pgstrom has not been made to speed up index lookups anyway so this should not be an issue.

The first thing I tried was to filter and group some data on the CPU:


test=# explain analyze SELECT count(*)

FROM   t_test

WHERE sqrt(x) > 0

GROUP BY y;

QUERY PLAN

--------------------------------------------------------------------------------------

HashAggregate  (cost=242892.24..242892.25 rows=1 width=101)

(actual time=3965.362..3965.362 rows=1 loops=1)

Group Key: y

->  Seq Scan on t_test  (cost=0.00..234558.91 rows=1666665 width=101)

(actual time=0.033..1417.593 rows=5000000 loops=1)

Filter: (sqrt((x)::double precision) > '0'::double precision)

Planning time: 0.459 ms

Execution time: 3965.495 ms

My box (4 Ghz AMD) can do that in just under 4 seconds. Note that I am using the standard PostgreSQL storage manager here (no column store or so).

Let us try the same thing on the GPU:


test=# explain analyze SELECT count(*)

FROM   t_test

WHERE sqrt(x) > 0

GROUP BY y;

QUERY PLAN



------------------------------------------------------------------------------------------

HashAggregate  (cost=176230.88..176230.89 rows=1 width=101)

(actual time=2004.355..2004.356 rows=1 loops=1)

Group Key: y

->  Custom Scan (GpuPreAgg)  (cost=11929.24..171148.30 rows=75 width=108)

(actual time=1151.310..2003.868 rows=76 loops=1)

Bulkload: On (density: 100.00%)

Reduction: Local + Global

Device Filter: (sqrt((x)::double precision) > '0'::double precision)

->  Custom Scan (BulkScan) on t_test  (cost=8929.24..167897.55 rows=5000001 width=101)

(actual time=12.956..1152.273 rows=5000000 loops=1)

Planning time: 0.550 ms

Execution time: 2299.633 ms

(9 rows)

We see a nice improvement here. The speedup is incredible – especially when taking into consideration that getting the data already takes more than a second. It seems moving stuff out to the GPU definitely pays off in this case.

The interesting thing to notice is that the real improvment can be seen because of the GROUP BY clause. A normal filter does not show a benefit:


test=# explain analyze SELECT count(*)

FROM   t_test

WHERE sqrt(x) > 0;

QUERY PLAN



--------------------------------------------------------------------------------------------

Aggregate  (cost=238725.58..238725.59 rows=1 width=0)

(actual time=1762.829..1762.829 rows=1 loops=1)

->  Seq Scan on t_test  (cost=0.00..234558.91 rows=1666665 width=0)

(actual time=0.055..1311.220 rows=5000000 loops=1)

Filter: (sqrt((x)::double precision) > '0'::double precision)

Planning time: 0.074 ms

Execution time: 1762.875 ms

(5 rows)



test=# explain analyze SELECT count(*)

FROM   t_test

WHERE sqrt(x) > 0;

QUERY PLAN



-------------------------------------------------------------------------------------------------

Aggregate  (cost=172064.21..172064.22 rows=1 width=0)

(actual time=1411.036..1411.036 rows=1 loops=1)

->  Custom Scan (GpuPreAgg)  (cost=11929.24..171148.30 rows=75 width=4)

(actual time=650.590..1410.837 rows=76 loops=1)

Bulkload: On (density: 100.00%)

Reduction: NoGroup

Device Filter: (sqrt((x)::double precision) > '0'::double precision)

->  Custom Scan (BulkScan) on t_test  (cost=8929.24..167897.55 rows=5000001 width=0)

(actual time=16.034..1160.592 rows=5000000 loops=1)

Planning time: 1.634 ms

Execution time: 1745.605 ms

(8 rows)

It certainly makes sense that there is no improvement in this case because moving data around is simply too expensive to make a difference. Remember: GPUs only make sense if things can be done in parallel and if data is coming fast enough. sqrt is not complicated enough to justify the effort of moving data around and PostgreSQL cannot provide data fast enough.

Or queries can be slower

It is important to mention that many queries won’t benefit from the GPU at all. In fact, I would expect than the majority of queries in a usual system will not behave differently.

Here is an example of a query, which is actually slower with pgstrom:


test=# explain analyze SELECT *, sum(x) OVER ()

FROM   t_test

WHERE sqrt(x) > 0;

QUERY PLAN


------------------------------------------------------------------------------------------------

WindowAgg  (cost=8929.24..188730.88 rows=1666667 width=206)

(actual time=4449.716..5848.233 rows=5000000 loops=1)

->  Custom Scan (GpuScan) on t_test  (cost=8929.24..167897.55 rows=1666667 width=206)

(actual time=616.879..1899.651 rows=5000000 loops=1)

Device Filter: (sqrt((x)::double precision) > '0'::double precision)

Planning time: 0.142 ms

Execution time: 6260.461 ms

(5 rows)



test=# explain analyze SELECT *, sum(x) OVER ()

FROM   t_test

WHERE sqrt(x) > 0;

QUERY PLAN



--------------------------------------------------------------------------------------------------

WindowAgg  (cost=0.00..255392.23 rows=1666665 width=206)

(actual time=3610.914..4895.518 rows=5000000 loops=1)

->  Seq Scan on t_test  (cost=0.00..234558.91 rows=1666665 width=206)

(actual time=0.038..1389.645 rows=5000000 loops=1)

Filter: (sqrt((x)::double precision) > '0'::double precision)

Planning time: 0.324 ms

Execution time: 5187.048 ms

(5 rows)

In this case the GPU seems like a loss – at least there is no benefit to be expected at this stage.

One word about sorts

According to the main developer of pgstrom sorting is not yet as good as he wants it to be so I skipped the sort part for now. As sorts are key to many queries, there is still pgstrom functionality I am really looking forward to.

I assume that sorts can greatly benefit from a GPU because there is a lot of intrinsic parallelism in a sort algorithm. Therefore sorting on the GPU could be highly beneficial. The speedup we can expect is hard to predict but I firmly believe that it can be quite substantial.

Stability

What stunned me is that I have not encountered a single segmentation fault during my tests. I definitely did not expect that. My assumption was that there would be more loose ends but actually things worked as expected most of the time – given the stage of the project I am pretty excited. pgstrom certainly feels like the future …