PostgreSQL 18: Better I/O performance with AIO

PostgreSQL 18 is around the corner and it is time to take a look at one of the most important improvements that have been added to the core engine. We are of course talking about the introduction of asynchronous I/O (AIO), which has been a huge topic over the years.

Table of Contents

Synchronous vs. asynchronous I/O

Let's dive into this and understand what the fuzz is all about. The standard I/O model works roughly like this:

PostgreSQL opens a file and reads one 8k block after the other. Yes, the kernel will do readahead and so on, but essentially the process is:

request an 8k block from the OS
wait until it arrives
process the block

This process is repeated. In real life, this means that the database cannot do calculations while we are waiting for the data to arrive. That is the curse of synchronous I/O. So, wouldn’t it be cool if the OS could already produce the data while the database is doing its calculations? The goal of AIO is to make reading and processing happen in parallel.

Here is a picture that illustrates this in more detail:

How does this work? The database schedules I/O and asks to provide it for later use. In the meantime, processing of already existing data can happen. This greatly reduces the latency the system will face. Obviously this is not beneficial to all operations, but it can greatly speed up things such as sequential scans, bitmap scans and so on.

In reality, one can expect better I/O performance - especially when running large scale operations.

However, let us not stick to theoretical explanations and move on to a more practical showcase.

Testing asynchronous I/O in PostgreSQL 18

The first thing one can do in PostgreSQL 18 and beyond is to actually configure the I/O method we are looking at. Here is how it works:

Configuring AIO in PostgreSQL

PostgreSQL 18 supports three I/O modes:

worker: asynchronous I/O using worker processes
io_uring: execute asynchronous I/O using io_uring
sync (execute asynchronous-eligible I/O synchronously)

The parameter in use here is io_method, which can be found in postgresql.conf. We recommend using io_uring for maximum efficiency.

PostgreSQL 18 performance data

For the purpose of this test we will simply use pgbench, which can provide us with a simple enough data set for the tests we want to perform. In this example, we will use pgbench to initialize a database called “bench”, which will be roughly 1 TB in size:

hs@fedora:$ time pgbench -i bench -s 75000 

dropping old tables...

NOTICE:  table "pgbench_accounts" does not exist, skipping

NOTICE:  table "pgbench_branches" does not exist, skipping

NOTICE:  table "pgbench_history" does not exist, skipping

NOTICE:  table "pgbench_tellers" does not exist, skipping

creating tables...

generating data (client-side)...

vacuuming...                                                                                           

creating primary keys...

done in 12814.15 s (drop tables 0.00 s, create tables 0.03 s, client-side generate 8285.35 s, vacuum 4.17 s, primary keys 4524.61 s).

real 213m34,164s

user 27m2,342s

sys 1m46,958s

hs@fedora:$ time pgbench -i bench -s 75000

dropping old tables...

NOTICE: table "pgbench_accounts" does not exist, skipping

NOTICE: table "pgbench_branches" does not exist, skipping

NOTICE: table "pgbench_history" does not exist, skipping

NOTICE: table "pgbench_tellers" does not exist, skipping

creating tables...

generating data (client-side)...

vacuuming...

creating primary keys...

done in 12814.15 s (drop tables 0.00 s, create tables 0.03 s, client-side generate 8285.35 s, vacuum 4.17 s, primary keys 4524.61 s).

real 213m34,164s

user 27m2,342s

sys 1m46,958s

The “-s” parameter says that we want to produce 75000 times x 100k rows in our sample database.

Note that we are running the test on a local NVME disk on a Fedora Linux box:

bench=# \d+

                                        List of relations

 Schema |       Name       | Type  | Owner | Persistence | Access method |  Size   

--------+------------------+-------+-------+-------------+---------------+--------

 public | pgbench_accounts | table | hs    | permanent   | heap          | 938 GB 

 public | pgbench_branches | table | hs    | permanent   | heap          | 2688 kB  

 public | pgbench_history  | table | hs    | permanent   | heap          | 0 bytes  

 public | pgbench_tellers  | table | hs    | permanent   | heap          | 32 MB  

(4 rows)

bench=# \d+

List of relations

--------+------------------+-------+-------+-------------+---------------+--------

(4 rows)

The table that is ideal for the I/O test is the first one in the list. It is sufficiently big to not fit into memory and provides us exactly what we need: The chance to run a big fat simple sequential scan on a large table.

But let us check the desired configuration first:

bench=# SHOW io_method ;

 io_method 

-----------

 io_uring

(1 row)

bench=# SHOW io_method ;

io_method

-----------

io_uring

(1 row)

The most simplistic query that forces PostgreSQL to read the entire table might be as follows: We simply count the number of entries in the “bid” column of our big table. There is no index on this column and therefore the system will trigger a parallel sequential scan. The following listing shows the execution plan of our SQL statement:

bench=# explain SELECT count(bid) FROM pgbench_accounts;

                                       QUERY PLAN                                                 

---------------------------------------------------------------------------------------------- Finalize Aggregate  (cost=162014321.55..162014321.56 rows=1 width=8)

   ->  Gather  (cost=162014321.33..162014321.54 rows=2 width=8)

         Workers Planned: 2

         ->  Partial Aggregate  (cost=162013321.33..162013321.34 rows=1 width=8)

               ->  Parallel Seq Scan on pgbench_accounts  

                   (cost=0.00..154200821.07 rows=3125000107 width=4)

 JIT:

   Functions: 5

   Options: Inlining true, Optimization true, Expressions true, Deforming true

(8 rows)

Time: 9,635 ms

bench=# explain SELECT count(bid) FROM pgbench_accounts;

QUERY PLAN

---------------------------------------------------------------------------------------------- Finalize Aggregate (cost=162014321.55..162014321.56 rows=1 width=8)

-> Gather (cost=162014321.33..162014321.54 rows=2 width=8)

Workers Planned: 2

-> Partial Aggregate (cost=162013321.33..162013321.34 rows=1 width=8)

-> Parallel Seq Scan on pgbench_accounts

(cost=0.00..154200821.07 rows=3125000107 width=4)

JIT:

Functions: 5

Options: Inlining true, Optimization true, Expressions true, Deforming true

(8 rows)

Time: 9,635 ms

Let us just run it and see what happens:

bench=# SELECT count(bid) FROM pgbench_accounts;

   count    

------------

 7500000000

(1 row)

Time: 288458,463 ms (04:48,458)

bench=# SELECT count(bid) FROM pgbench_accounts;

count

------------

7500000000

(1 row)

Time: 288458,463 ms (04:48,458)

The query takes almost 5 minutes. When looking at the I/O throughput of our disk, we can see that the disk peaks around 3.4 GB / sec (the “bi” column in the next listing contains the real system data represented as a 2 second average):

It is also noteworthy that the “wa” column (“disk wait”) is around 40. During the execution, one can see that PostgreSQL will use up around 3 cores and keep the disk totally busy.

Inspecting asynchronous I/O

To see what is going on during asynchronous reads, PostgreSQL offers a couple of new functions, which allow us to inspect how much I/O has been scheduled and what is going on in general. The first thing to look at is the pg_get_aios() function, which returns some insights:

bench=# SELECT count(*) FROM pg_get_aios();

 count 

-------

    36

(1 row)

bench=# SELECT count(*) FROM pg_get_aios();

count

-------

(1 row)

On my local machine, the return value happened to be around 30 and 50 given 2 parallel worker processes. However, this number can of course vary depending on hardware, configuration, structure of your data and a lot more.

Of course, we can also take a deeper look at the data returned by the SFR (“set returning function”):

bench=# \x

Expanded display is on.

bench=# SELECT * FROM pg_get_aios() LIMIT 3;

-[ RECORD 1 ]---+-----------------------------------------------------

pid             | 9665

io_id           | 64

io_generation   | 33591

state           | SUBMITTED

operation       | readv

off             | 532152320

length          | 131072

target          | smgr

handle_data_len | 16

raw_result      | 

result          | UNKNOWN

target_desc     | blocks 25755072..25755087 in file "base/16414/16430"

f_sync          | f

f_localmem      | f

f_buffered      | t

-[ RECORD 2 ]---+-----------------------------------------------------

pid             | 9665

io_id           | 65

io_generation   | 33500

state           | SUBMITTED

operation       | readv

off             | 531890176

length          | 131072

target          | smgr

handle_data_len | 16

raw_result      | 

result          | UNKNOWN

target_desc     | blocks 25755040..25755055 in file "base/16414/16430"

f_sync          | f

f_localmem      | f

f_buffered      | t

bench=# \x

Expanded display is on.

bench=# SELECT * FROM pg_get_aios() LIMIT 3;

-[ RECORD 1 ]---+-----------------------------------------------------

pid | 9665

io_id | 64

io_generation | 33591

state | SUBMITTED

operation | readv

off | 532152320

length | 131072

target | smgr

handle_data_len | 16

raw_result |

result | UNKNOWN

target_desc | blocks 25755072..25755087 in file "base/16414/16430"

f_sync | f

f_localmem | f

f_buffered | t

-[ RECORD 2 ]---+-----------------------------------------------------

pid | 9665

io_id | 65

io_generation | 33500

state | SUBMITTED

operation | readv

off | 531890176

length | 131072

target | smgr

handle_data_len | 16

raw_result |

result | UNKNOWN

target_desc | blocks 25755040..25755055 in file "base/16414/16430"

f_sync | f

f_localmem | f

f_buffered | t

For every I/O request PostgreSQL has scheduled, we will get one entry so that we can understand exactly what is going on behind the scenes.

What is interesting to see here is: Regardless of the number of worker processes (assuming max_parallel_workers_per_gather is at least 2) the throughput on disk won’t change anymore because we have simply hit the limit of what the disk is able to handle. We are simply completely I/O bound already.

Comparing synchronous and asynchronous I/O

Back in the old days, all we had was synchronous I/O, and the new behavior is cutting edge. So let us compare the old and new technology by setting “io_method = sync”, which is what we had in PostgreSQL 17 and before:

bench=# SELECT count(bid) FROM pgbench_accounts;

   count    

------------

 7500000000

(1 row)

Time: 368744,007 ms (06:08,744)

bench=# SELECT count(bid) FROM pgbench_accounts;

count

------------

7500000000

(1 row)

Time: 368744,007 ms (06:08,744)

The changes are substantial: We top out at 2.6 GB already, which is roughly 800 MB / sec less than before. To get closer to what is doable, we have to raise the number of worker processes to somehow compensate for this:

bench=# SET max_parallel_workers_per_gather to 8;

SET

Time: 0,147 ms

bench=# SELECT count(bid) FROM pgbench_accounts;

   count    

------------

 7500000000

(1 row)

Time: 288192,030 ms (04:48,192)

bench=# SET max_parallel_workers_per_gather to 8;

SET

Time: 0,147 ms

bench=# SELECT count(bid) FROM pgbench_accounts;

count

------------

7500000000

(1 row)

Time: 288192,030 ms (04:48,192)

In other words: We have to throw additional CPUs at the problem to make better use of the disk as such. When we take a look at “vmstat 2 -w”, there is something worth noticing:

The disk wait has gone up substantially - we are now at around 75 instead of 40, which of course has a couple of implications that can affect the rest of your system in a fairly significant way.

Comparing results

Visualizing the results can give us a feeling of what we are dealing with:

Now, with faster (or more) disks, we can see even higher differences and bigger gains. However, what is already pretty evident is: “PostgreSQL 18 makes significantly better use of your hardware” - especially if you are running on premise.

PostgreSQL 18: Better I/O performance with AIO

Synchronous vs. asynchronous I/O

Testing asynchronous I/O in PostgreSQL 18

Configuring AIO in PostgreSQL

PostgreSQL 18 performance data

Inspecting asynchronous I/O

Comparing synchronous and asynchronous I/O

Comparing results

Leave a Reply Cancel reply

Hans-Jürgen Schönig

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts