CYBERTEC Logo

Scaling PostgreSQL: Modern servers provide people with more and more CPU cores. 16 cores or more are not uncommon these days. Unfortunately, PostgreSQL cannot scale a single query to more than one core. A lot of effort is going into this issue already.

CYBERTEC experiments

To address the issue, we've done some experiments to see what can be done to scale things up. At this point, we've written a small module which takes an existing execution plan and does some post-processing of the plan using “custom plan nodes”.

The idea is experimental: what happens if the approach described in this post is actually used?

The main challenge we face at a client is:

So the goal in our case is really to scale out the processing done in the SELECT clause.

Some test data for scaling

For a start, 100 million rows have been created to do a little testing. 20 million rows are in each partition:

Giving multicore PostgreSQL a try - scaling test

Our prototype module has a simple runtime variable, which allows us to define the number of processes we want to see. Setting parallel.agg_workers to 1 basically means turning the plan-processing code off.

The data in the table is scanned normally. We need around 29 seconds to process the query:

Let us try the same with more cores: 7 cores will be used for worker processes.

The 8th core will be taken by the coordinating process:

The important part here is that more processes are actually a net loss. The reason is simple: The table is scanned by a single process at the moment so there is a serious bottleneck on the scan side. The aggregation part is simply not complex enough to provide us with a speedup. Remember: All we do is adding up some values (which is ways too simple). This is by far not enough to justify the overhead of moving data around via shared memory queues.

Handling long SELECT clauses when scaling

However, the situation starts to change significantly if we try a query involving a lot of processing in the SELECT-clause. The more stuff we do in the SELECT-clause, the more benefit we can observe.

Here is an example:

In this case, we do a lot of aggregation and therefore the overhead of moving tuples around is not as big compared to the rest of the query.

The performance difference is already quite huge:

What we see is a nice 3.5 times speedup. Of course, this is still far from linear. Linear scalability can hardly be achieved with this approach. However, if there are 50-100 aggregation functions in the SELECT-clause (which is far from uncommon), there is clearly a major performance improvement. That tends to get pretty close to “linear” in many cases.

At the moment, actually providing the data to our worker processes is still clearly an issue:

The process dispatching the data is running at 100%, while the worker processes are not able to make full use of the CPU. It seems that some overhead can still be shaved off to provide users with better overall performance. A prototype might be possible fairly soon.

In case you need any assistance, please feel free to contact us.
 


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

In my previous post I have shared some GPU performance data, which were mostly related to aggregation. Given the information I have found on the pgstrom wiki page, I decided to give joins a try to see how much speed we can gain by offloading some of the work PostgreSQL has to do to the graphics card. Let's see how PostgreSQL does on a GPU with joins.

First I created some test data:

25 million rows should be joined with 1 million rows, which are a 100% subset of the original data.

A first test shows a very nice improvement. First two queries on my CPU:

The CPU in use here is an “AMD FX(tm)-8350 Eight-Core Processor” running at 4 Ghz.

Let us try the same thing on the GPU:

What we see here is a nice improvement. The query is several times faster.

Making use of indexes

The interesting part is to see, what happens if indexes are added to both sides of the join:

The standard query, shown before, does not show any difference because too much data is needed inside the join. So, the test is repeated with a reasonable filter:

The CPU version is pretty fast. PostgreSQL scans the indexes, makes use of the sorted input to perform a nice merge join then. The aggregation is pretty fast as well.

In the GPU enabled case the plan does not look as efficient in the default setting:

Even if the query is slower in this case, it should not be a major issue. If the cost models are adjusted properly and if the planner is told how to decide on the right plan, the system should be able to figure out that the CPU version is the faster one. So the loss in speed is really not an issue here. As soon as all infrastructure is in place, balancing the cost model and adjusting the planner here and there does not seem like a showstopper (at least not given the achievements already made).

Joining more tables with pgstrom

Let us just add more tables to the join to see what happens. To make sure that CPU and GPU get a fair chance, I have removed my two indexes again:

As expected the CPU has a hard time joining all the stuff together. It is a pretty nasty join. At the end 34 seconds are needed.

The GPU can really shine under those circumstances. Expensive joins really seem to be what this is all about after all:

Given the number of rows involved, this is really a nice number to see. 8 seconds just rocks and shows the real potential of a GPU.

Of course: You got to keep in mind that I am running arbitrary SQL statements, which are not too realistic in many cases. I am yet to test pgstrom on a real workload involving more every-day queries you would typically see in PostgreSQL.

Finally I got around to take a more extensive look at pgstrom (a module to make use of GPUs). The entire GPU thing fascinates me, and I was excited to see the first real performance data.

Here is some simple test data:

5 million rows should be enough to get a first impression of what is going on.

Queries can benefit

To make sure that a real difference can actually be observed, I have decided to use no indexes. In real life, this is not too realistic because performance would suffer in a horrible way. pgstrom has not been made to speed up index lookups anyway so this should not be an issue.

The first thing I tried was to filter and group some data on the CPU:

My box (4 GHz AMD) can do that in just under 4 seconds. Note that I am using the standard PostgreSQL storage manager here (no column store or so).

Let us try the same thing on the GPU:

We see a nice improvement here. The speedup is incredible - especially when taking into consideration that getting the data already takes more than a second. It seems moving stuff out to the GPU definitely pays off in this case.

The interesting thing to notice is that the real improvement can be seen because of the GROUP BY clause. A normal filter does not show a benefit:

It certainly makes sense that there is no improvement in this case because moving data around is simply too expensive to make a difference. Remember: GPUs only make sense if things can be done in parallel and if data is coming fast enough. sqrt is not complicated enough to justify the effort of moving data around and PostgreSQL cannot provide data fast enough.

Or queries can be slower

It is important to mention that many queries won't benefit from the GPU at all. In fact, I would expect than the majority of queries in a usual system will not behave differently.

Here is an example of a query, which is actually slower with pgstrom:

In this case the GPU seems like a loss - at least there is no benefit to be expected at this stage.

One word about sorts

According to the main developer of pgstrom sorting is not yet as good as he wants it to be, so I skipped the sort part for now. As sorts are key to many queries, there is still pgstrom functionality I am really looking forward to.

I assume that sorts can greatly benefit from a GPU because there is a lot of intrinsic parallelism in a sort algorithm. Therefore sorting on the GPU could be highly beneficial. The speedup we can expect is hard to predict but I firmly believe that it can be quite substantial.

Stability

What stunned me is that I have not encountered a single segmentation fault during my tests. I definitely did not expect that. My assumption was that there would be more loose ends but actually things worked as expected most of the time - given the stage of the project I am pretty excited. pgstrom certainly feels like the future ...

Find all the latest CYBERTEC blog posts by Hans-Jürgen Schönig, Laurenz Albe, Pavlo Golub and others in our Performance blog spot.

NVIDIA's CUDA is definitely a great thing and I have to admit that I already got excited years ago when I first learned about it. For many operations a nicely optimized GPU implementation definitely seems the way to go. GPUs are traditionally used for scientific operations and massively parallel tasks. However, some important work is also going into the PGStrom project, which is all about bringing the power of modern GPUs to PostgreSQL: pgstrom documentation

Installing CUDA

At this point installing CUDA on Linux might be the hardest part of the entire undertaking. The CUDA installer only works nicely, when no X-server is running. A simple „init 1“ should solve this problem, however.

Before you get started with pgstrom, it is usually a good idea to check, if the GPU has been detected properly:

If the test is passed, CUDA is ready for PostgreSQL.

Installing PostgreSQL with CUDA

Installing pg_strom for PostgreSQL is basically not hard. Here is how it works:

What happened in my case was that I had to uncomment 3 lines in a pg_strom header file because my version of PostgreSQL was a bit more up to date than expected. However, this is nothing major. It is more of a small fix.

Once pg_strom has been added to shared_preload_libraries, the system is already ready for action. In my case starting the database shows the following listing:

The important point here is that during PostgreSQL startup the CUDA device has to be in the LOG message - otherwise there is a problem with the driver.

The beauty is that pgstrom automatically uses GPU code when it seems useful. The user does not have to worry about where the code is actually executed. The optimizer will make those decisions for your automatically.

So far pgstrom seems pretty promising. Of course, it is not ready for production yet, but it is definitely worth investigating the issue further and run tests next week.

CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0
office@cybertec.at

Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram