Most people using PostgreSQL database systems are aware of the fact that the database engine has to send changes to the so called “Write Ahead Log” (= WAL) to ensure that in case of a crash the database will be able to recover to a consistent state safely and reliably. However, not everybody is aware of the fact that tuning the database server will actually have an impact on the amount of WAL, which has to be written to disk.
WAL and checkpointing
In PostgreSQL changes have to be written to the WAL (or xlog as some call it) before those changes even make it to the underlying data files. In most cases PostgreSQL will simple log, what has to be changed in the data files. However, sometimes a lot more has to be written.
To ensure that the WAL will not grow infinitely, PostgreSQL does so called checkpoints, which are a perfect opportunity to recycle WAL. The size of your transaction should be more or less constant, so cleaning up WAL from time to time is an absolute must. Fortunately checkpointing is done automatically.
By default PostgreSQL uses the following parameters to determine, when a checkpoint is supposed to happen:
# checkpoint_timeout = 5min # range 30s-1h
# max_wal_size = 1GB
# min_wal_size = 80MB
Those default settings are somewhat ok for a reasonable small database. However, for a big system, having fewer checkpoints will help to increase performance. Setting max_wal_size to, say, 20GB is definitely a good thing to do if your system is reasonably large and heavily loaded (writes).
What is little known is that the distance between two checkpoints does not only improve speed due to reduced checkpointing – fewer checkpoints will also have an impact on the amount of transaction log written.
Reducing the amount of transaction log written
Increasing the distance between checkpoints leads to less WAL – but why does that actually happen? Remember: The whole point of having the transaction log in the first place is to ensure that the system will always survive a crash. Applying those changes in the WAL to the data files will fix the datafiles and recover the system at startup. To do that safely PostgreSQL cannot simply log the changes made to a block – in case a block is changed for THE FIRST TIME after a checkpoint, the entire page has to be sent to the WAL. All subsequent changes can be incremental. The point now is: If checkpoints are close together there are many “first times” and full pages have to be written to the WAL quite frequently. However, if checkpoints are far apart the number of full page writes will drop dramatically leading to a lot less WAL.
On a heavily loaded systems we are not talking about peanuts – the difference can be quite significant.
Measuring the amount of WAL written
To measure how much transaction log the system actually produces during a normal benchmark, I have conducted a simple test using empty database instances:
[hs@linuxpc db]$ createdb test [hs@linuxpc db]$ psql test psql (9.6.1) Type "help" for help.
To reduce the number of disk flushes and to give my SSD a long and prosperous life, I set synchronous_commit to off:
test=# ALTER DATABASE test SET synchronous_commit TO off; ALTER DATABASE
Then a small set of test data is loaded:
[hs@linuxpc db]$ pgbench -i test NOTICE: table "pgbench_history" does not exist, skipping NOTICE: table "pgbench_tellers" does not exist, skipping NOTICE: table "pgbench_accounts" does not exist, skipping NOTICE: table "pgbench_branches" does not exist, skipping creating tables... 100000 of 100000 tuples (100%) done (elapsed 0.09 s, remaining 0.00 s) vacuum... set primary keys... done.
100.000 rows are absolutely enough to conduct this simple test.
Then 4 concurrent connections will perform 2 million transactions each. For the first test default PostgreSQL config parameters are used:
[hs@linuxpc db]$ time pgbench -c 4 -t 2000000 test starting vacuum...end. transaction type: <builtin: TPC-B (sort of)> scaling factor: 1 query mode: simple number of clients: 4 number of threads: 1 number of transactions per client: 2000000 number of transactions actually processed: 8000000/8000000 latency average = 0.558 ms tps = 7173.597623 (including connections establishing) tps = 7173.616874 (excluding connections establishing) real 18m35.222s user 2m23.980s sys 3m4.830s
The amount of WAL produced is around 3.5 GB:
test=# SELECT pg_size_pretty(pg_current_xlog_location() - '0/00000000'::pg_lsn); pg_size_pretty ---------------- 3447 MB (1 row)
However, what happens if the checkpoint distances are decreased dramatically?
checkpoint_timeout = 30s max_wal_size = 32MB min_wal_size = 32MB
The amount of WAL will skyrocket because checkpoints are so close that most changes will be a “first change” made to a block. Using those parameters the WAL will skyrocket to staggering 23 GB. As you can see the amount of WAL can easily multiply if those settings are not ideal.
Increasing the distance between checkpoints will certainly speed things up and have a positive impact on the WAL volume: