Of course most people do not want to corrupt their databases. These people will profit from avoiding the techniques used in this article. But for some, it might be useful to corrupt a database, for example if you want to test a tool or procedure that will be used to detect or fix data corruption.
We need a database with some data in it, and for some of our experiments, we will need to have some ongoing activity. For that, we can use the built-in PostgreSQL benchmark
pgbench. We use scale factor 100, so that the largest table contains 10 million rows:
$ pgbench -q -i -s 100 dropping old tables... creating tables... generating data (client-side)... 10000000 of 10000000 tuples (100%) done (elapsed 7.44 s, remaining 0.00 s) vacuuming... creating primary keys... done in 10.12 s (drop tables 0.18 s, create tables 0.01 s, client-side generate 7.52 s, vacuum 0.14 s, primary keys 2.28 s).
Load will be generated with 5 concurrent client sessions:
$ pgbench -c 5 -T 3600
Creating a corrupt database by setting
fsync = off
fsync = off in
postgresql.conf and power off the server while it is under load.
After a few attempts, we can detect data corruption with the
postgres=# CREATE EXTENSION amcheck; CREATE EXTENSION postgres=# SELECT bt_index_parent_check('pgbench_accounts_pkey', TRUE, TRUE); WARNING: concurrent delete in progress within table "pgbench_accounts" ERROR: could not access status of transaction 1949706 DETAIL: Could not read from file "pg_subtrans/001D" at offset 196608: read too few bytes. CONTEXT: while checking uniqueness of tuple (131074,45) in relation "pgbench_accounts"
What happened? Data were no longer flushed to disk in the correct order, so that data modifications could hit the disk before the WAL did. This leads to data corruption during crash recovery.
Creating a corrupt database from a backup
pgbench is running, we create a base backup:
$ psql postgres=# SELECT pg_backup_start('test'); pg_backup_start ═════════════════ 1/47F8A130 (1 row)
Note that since I am using PostgreSQL v15, the function to start backup mode is
pg_backup_start() rather than
pg_start_backup(). This is because the exclusive backup API, which had been deprecated since PostgreSQL 9.6, was finally removed in v15. To find out more, read my updated post in the link.
Let’s figure out the object IDs of the database and of the primary key index of
postgres=# SELECT relfilenode FROM pg_class WHERE relname = 'pgbench_accounts_pkey'; relfilenode ═════════════ 16430 (1 row) postgres=# SELECT oid FROM pg_database WHERE datname = 'postgres'; oid ═════ 5 (1 row)
We create a backup by copying the data directory. Afterwards, we copy the primary key index of
pgbench_accounts and the commit log again to make sure that they are more recent than the rest:
$ cp -r data backup $ cp data/base/5/16430* backup/base/5 $ cp data/pg_xact/* backup/pg_xact/ $ rm backup/postmaster.pid
The crucial part: do not create
Now we exit backup mode, but ignore the contents of the
backup_label file returned from
postgres=# SELECT labelfile FROM pg_backup_stop(); NOTICE: WAL archiving is not enabled; you must ensure that all required WAL segments are copied through other means to complete the backup labelfile ════════════════════════════════════════════════════════════════ START WAL LOCATION: 1/47F8A130 (file 000000010000000100000047)↵ CHECKPOINT LOCATION: 1/65CD24F0 ↵ BACKUP METHOD: streamed ↵ BACKUP FROM: primary ↵ START TIME: 2022-07-05 08:32:47 CEST ↵ LABEL: test ↵ START TIMELINE: 1 ↵ (1 row)
Then, let’s make sure that the last checkpoint in the control file is different:
$ pg_controldata -D backup | grep REDO Latest checkpoint's REDO location: 1/890077D0 Latest checkpoint's REDO WAL file: 000000010000000100000089
Great! Let’s start the server:
$ echo 'port = 5555' >> backup/postgresql.auto.conf $ pg_ctl -D backup start waiting for server to start..... done server started
Now an index scan on
pgbench_accounts fails, because the index contains more recent data than the table:
postgres=# SELECT * FROM pgbench_accounts ORDER BY aid; ERROR: could not read block 166818 in file "base/5/16422.1": read only 0 of 8192 bytes
What happened? By omitting the
backup_label file from the backup, we recovered from the wrong checkpoint, so the data in the table and its index were no longer consistent. Note that we can get the same effect without
pg_backup_stop(), I only wanted to emphasize the importance of
Creating a corrupt database with
While the database is under load from
pgbench, we crash it with
pg_ctl stop -m immediate -D data
Then we run
pg_resetwal -D data The database server was not shut down cleanly. Resetting the write-ahead log might cause data to be lost. If you want to proceed anyway, use -f to force reset. $ pg_resetwal -f -D data Write-ahead log reset
Then we start the server and use
amcheck like before to check the index for integrity:
postgres=# CREATE EXTENSION amcheck; CREATE EXTENSION postgres=# SELECT bt_index_parent_check('pgbench_accounts_pkey', TRUE, TRUE); WARNING: concurrent delete in progress within table "pgbench_accounts" ERROR: could not access status of transaction 51959 DETAIL: Could not read from file "pg_subtrans/0000" at offset 204800: read too few bytes. CONTEXT: while checking uniqueness of tuple (1,1) in relation "pgbench_accounts"
pg_resetwal is only safe to use on a cluster that was shutdown cleanly. The option
-f is intended as a last-ditch effort to get a corrupted server to start and salvage some data. Only experts should use it.
Creating a corrupt database with
We create a second cluster with
$ initdb -E UTF8 --locale=C -U postgres data2
Then we edit
postgresql.conf and choose a different port number. After shutting down the original cluster, we run an “upgrade” in link mode:
$ pg_upgrade -d /home/laurenz/data -D /home/laurenz/data2 \ > -b /usr/pgsql-15/bin -B /usr/pgsql-15/bin -U postgres --link Performing Consistency Checks ... Performing Upgrade ... Adding ".old" suffix to old global/pg_control ok If you want to start the old cluster, you will need to remove the ".old" suffix from /home/laurenz/data/global/pg_control.old. Because "link" mode was used, the old cluster cannot be safely started once the new cluster has been started. ... Upgrade Complete ---------------- Optimizer statistics are not transferred by pg_upgrade. Once you start the new server, consider running: /usr/pgsql-15/bin/vacuumdb -U postgres --all --analyze-in-stages Running this script will delete the old cluster's data files: ./delete_old_cluster.sh
pg_upgrade renamed the control file of the old cluster, so that it cannot get started by accident. We’ll undo that:
mv /home/laurenz/data/global/pg_control.old \ > /home/laurenz/data/global/pg_control
Now we can start both clusters and run
pgbench on both. Soon we will see error messages like
ERROR: unexpected data beyond EOF in block 1 of relation base/5/16397 HINT: This has been seen to occur with buggy kernels; consider updating your system. ERROR: duplicate key value violates unique constraint "pgbench_accounts_pkey" DETAIL: Key (aid)=(8040446) already exists. WARNING: could not write block 13 of base/5/16404 DETAIL: Multiple failures --- write error might be permanent. ERROR: xlog flush request 0/98AEE3E0 is not satisfied --- flushed only to 0/648CDC58 CONTEXT: writing block 13 of relation base/5/16404 ERROR: could not access status of transaction 39798 DETAIL: Could not read from file "pg_subtrans/0000" at offset 155648: read too few bytes.
What happened? Since both clusters share the same data files, we managed to start two servers on the same data files. This leads to data corruption.
Creating a corrupt database by manipulating data files
For that, we figure out the file name that belongs to the table
postgres=# SELECT relfilenode FROM pg_class WHERE relname = 'pgbench_accounts'; relfilenode ═════════════ 16396 (1 row)
Now we stop the server and write some garbage into the first data block:
yes 'this is garbage' | dd of=data/base/5/16396 bs=1024 seek=2 count=1 conv=notrunc 0+1 records in 0+1 records out 1024 bytes (1.0 kB, 1.0 KiB) copied, 0.00031255 s, 3.3 MB/s
Then we start the server and try to select from the table:
postgres=# TABLE pgbench_accounts ; ERROR: compressed pglz data is corrupt
What happened? We tampered with the data files, so it’s unsurprising that the table is corrupted.
Creating a corrupt database with catalog modifications
ALTER TABLE to drop a table column? We can simply run
DELETE FROM pg_attribute WHERE attrelid = 'pgbench_accounts'::regclass AND attname = 'bid';
After that, an attempt to query the table will result in an error:
ERROR: pg_attribute catalog is missing 1 attribute(s) for relation OID 16396
What happened? We ignored that dropping a column sets
pg_attribute rather than actually removing the entry. Moreover, we didn’t check for dependencies in
pg_depend, nor did we properly lock the table against concurrent access. Modifying catalog tables is unsupported, and if it breaks the database, you get to keep both pieces.
We have seen a number of ways how you can corrupt a PostgreSQL database. Some of these were obvious, some might surprise the beginner. If you don’t want a corrupted database,
- don’t mess with the system catalogs
- never modify anything in the data directory (with the exception of configuration files)
- don’t run with
fsync = off
- don’t call
pg_resetwal -fon a crashed server
- remove the old cluster after an upgrade with
- don’t delete or omit
- run a supported version of PostgreSQL to avoid known software bugs
- run on reliable hardware
I hope you can save some databases with this information! If you’d like to know more about troubleshooting PostgreSQL performance, read my post on join strategies.