How to corrupt your PostgreSQL database

Of course most people do not want to corrupt their databases. These people will profit from avoiding the techniques used in this article. But for some, it might be useful to corrupt a database, for example if you want to test a tool or procedure that will be used to detect or fix data corruption.

Table of Contents

Prerequisites

We need a database with some data in it, and for some of our experiments, we will need to have some ongoing activity. For that, we can use the built-in PostgreSQL benchmark pgbench. We use scale factor 100, so that the largest table contains 10 million rows:

$ pgbench -q -i -s 100
dropping old tables...
creating tables...
generating data (client-side)...
10000000 of 10000000 tuples (100%) done (elapsed 7.44 s, remaining 0.00 s)
vacuuming...
creating primary keys...
done in 10.12 s (drop tables 0.18 s, create tables 0.01 s, client-side generate 7.52 s, vacuum 0.14 s, primary keys 2.28 s).

$ pgbench -q -i -s 100

dropping old tables...

creating tables...

generating data (client-side)...

10000000 of 10000000 tuples (100%) done (elapsed 7.44 s, remaining 0.00 s)

vacuuming...

creating primary keys...

done in 10.12 s (drop tables 0.18 s, create tables 0.01 s, client-side generate 7.52 s, vacuum 0.14 s, primary keys 2.28 s).

Load will be generated with 5 concurrent client sessions:

$ pgbench -c 5 -T 3600

1	$ pgbench -c 5 -T 3600

Creating a corrupt database by setting `fsync = off`

Let's set fsync = off in postgresql.conf and power off the server while it is under load.

After a few attempts, we can detect data corruption with the amcheck extension:

postgres=# CREATE EXTENSION amcheck;
CREATE EXTENSION
postgres=# SELECT bt_index_parent_check('pgbench_accounts_pkey', TRUE, TRUE);
WARNING:  concurrent delete in progress within table 'pgbench_accounts'
ERROR:  could not access status of transaction 1949706
DETAIL:  Could not read from file 'pg_subtrans/001D' at offset 196608: read too few bytes.
CONTEXT:  while checking uniqueness of tuple (131074,45) in relation 'pgbench_accounts'

postgres=# CREATE EXTENSION amcheck;

CREATE EXTENSION

postgres=# SELECT bt_index_parent_check('pgbench_accounts_pkey', TRUE, TRUE);

WARNING: concurrent delete in progress within table 'pgbench_accounts'

ERROR: could not access status of transaction 1949706

DETAIL: Could not read from file 'pg_subtrans/001D' at offset 196608: read too few bytes.

CONTEXT: while checking uniqueness of tuple (131074,45) in relation 'pgbench_accounts'

What happened? Data were no longer flushed to disk in the correct order, so that data modifications could hit the disk before the WAL did. This leads to data corruption during crash recovery.

Creating a corrupt database from a backup

While pgbench is running, we create a base backup:

$ psql
postgres=# SELECT pg_backup_start('test');
 pg_backup_start 
═════════════════
 1/47F8A130
(1 row)

$ psql

postgres=# SELECT pg_backup_start('test');

pg_backup_start

═════════════════

1/47F8A130

(1 row)

Note that since I am using PostgreSQL v15, the function to start backup mode is pg_backup_start() rather than pg_start_backup(). This is because the exclusive backup API, which had been deprecated since PostgreSQL 9.6, was finally removed in v15. To find out more, read my updated post in the link.

Let's figure out the object IDs of the database and of the primary key index of pgbench_accounts:

postgres=# SELECT relfilenode FROM pg_class
           WHERE relname = 'pgbench_accounts_pkey';
 relfilenode 
═════════════
       16430
(1 row)

postgres=# SELECT oid FROM pg_database
           WHERE datname = 'postgres';
 oid 
═════
   5
(1 row)

postgres=# SELECT relfilenode FROM pg_class

WHERE relname = 'pgbench_accounts_pkey';

relfilenode

═════════════

16430

(1 row)

postgres=# SELECT oid FROM pg_database

WHERE datname = 'postgres';

oid

═════

(1 row)

We create a backup by copying the data directory. Afterwards, we copy the primary key index of pgbench_accounts and the commit log again to make sure that they are more recent than the rest:

$ cp -r data backup
$ cp data/base/5/16430* backup/base/5
$ cp data/pg_xact/* backup/pg_xact/
$ rm backup/postmaster.pid

$ cp -r data backup

$ cp data/base/5/16430* backup/base/5

$ cp data/pg_xact/* backup/pg_xact/

$ rm backup/postmaster.pid

The crucial part: do not create `backup_label`

Now we exit backup mode, but ignore the contents of the backup_label file returned from pg_backup_stop():

postgres=# SELECT labelfile FROM pg_backup_stop();
NOTICE:  WAL archiving is not enabled; you must ensure that all required WAL segments are copied through other means to complete the backup
                           labelfile                            
════════════════════════════════════════════════════════════════
 START WAL LOCATION: 1/47F8A130 (file 000000010000000100000047)↵
 CHECKPOINT LOCATION: 1/65CD24F0                               ↵
 BACKUP METHOD: streamed                                       ↵
 BACKUP FROM: primary                                          ↵
 START TIME: 2022-07-05 08:32:47 CEST                          ↵
 LABEL: test                                                   ↵
 START TIMELINE: 1                                             ↵
 
(1 row)

postgres=# SELECT labelfile FROM pg_backup_stop();

NOTICE: WAL archiving is not enabled; you must ensure that all required WAL segments are copied through other means to complete the backup

labelfile

════════════════════════════════════════════════════════════════

START WAL LOCATION: 1/47F8A130 (file 000000010000000100000047)↵

CHECKPOINT LOCATION: 1/65CD24F0 ↵

BACKUP METHOD: streamed ↵

BACKUP FROM: primary ↵

START TIME: 2022-07-05 08:32:47 CEST ↵

LABEL: test ↵

START TIMELINE: 1 ↵

(1 row)

Then, let's make sure that the last checkpoint in the control file is different:

$ pg_controldata -D backup | grep REDO
Latest checkpoint's REDO location:    1/890077D0
Latest checkpoint's REDO WAL file:    000000010000000100000089

$ pg_controldata -D backup | grep REDO

Latest checkpoint's REDO location: 1/890077D0

Latest checkpoint's REDO WAL file: 000000010000000100000089

Great! Let's start the server:

$ echo 'port = 5555' >> backup/postgresql.auto.conf
$ pg_ctl -D backup start
waiting for server to start..... done
server started

$ echo 'port = 5555' >> backup/postgresql.auto.conf

$ pg_ctl -D backup start

waiting for server to start..... done

server started

Now an index scan on pgbench_accounts fails, because the index contains more recent data than the table:

postgres=# SELECT * FROM pgbench_accounts ORDER BY aid;
ERROR:  could not read block 166818 in file 'base/5/16422.1': read only 0 of 8192 bytes

1 2	postgres=# SELECT * FROM pgbench_accounts ORDER BY aid; ERROR: could not read block 166818 in file 'base/5/16422.1': read only 0 of 8192 bytes

What happened? By omitting the backup_label file from the backup, we recovered from the wrong checkpoint, so the data in the table and its index were no longer consistent. Note that we can get the same effect without pg_backup_start() and pg_backup_stop(), I only wanted to emphasize the importance of backup_label.

Creating a corrupt database with `pg_resetwal`

While the database is under load from pgbench, we crash it with

pg_ctl stop -m immediate -D data

1	pg_ctl stop -m immediate -D data

Then we run pg_resetwal:

pg_resetwal -D data
The database server was not shut down cleanly.
Resetting the write-ahead log might cause data to be lost.
If you want to proceed anyway, use -f to force reset.
$ pg_resetwal -f -D data
Write-ahead log reset

pg_resetwal -D data

The database server was not shut down cleanly.

Resetting the write-ahead log might cause data to be lost.

If you want to proceed anyway, use -f to force reset.

$ pg_resetwal -f -D data

Write-ahead log reset

Then we start the server and use amcheck like before to check the index for integrity:

postgres=# CREATE EXTENSION amcheck;
CREATE EXTENSION
postgres=# SELECT bt_index_parent_check('pgbench_accounts_pkey', TRUE, TRUE);
WARNING:  concurrent delete in progress within table 'pgbench_accounts'
ERROR:  could not access status of transaction 51959
DETAIL:  Could not read from file 'pg_subtrans/0000' at offset 204800: read too few bytes.
CONTEXT:  while checking uniqueness of tuple (1,1) in relation 'pgbench_accounts'

postgres=# CREATE EXTENSION amcheck;

CREATE EXTENSION

postgres=# SELECT bt_index_parent_check('pgbench_accounts_pkey', TRUE, TRUE);

WARNING: concurrent delete in progress within table 'pgbench_accounts'

ERROR: could not access status of transaction 51959

DETAIL: Could not read from file 'pg_subtrans/0000' at offset 204800: read too few bytes.

CONTEXT: while checking uniqueness of tuple (1,1) in relation 'pgbench_accounts'

What happened? pg_resetwal is only safe to use on a cluster that was shutdown cleanly. The option -f is intended as a last-ditch effort to get a corrupted server to start and salvage some data. Only experts should use it.

Creating a corrupt database with `pg_upgrade --link`

We create a second cluster with initdb:

$ initdb -E UTF8 --locale=C -U postgres data2

1	$ initdb -E UTF8 --locale=C -U postgres data2

Then we edit postgresql.conf and choose a different port number. After shutting down the original cluster, we run an “upgrade” in link mode:

$ pg_upgrade -d /home/laurenz/data -D /home/laurenz/data2 
> -b /usr/pgsql-15/bin -B /usr/pgsql-15/bin -U postgres --link
Performing Consistency Checks
...
Performing Upgrade
...
Adding '.old' suffix to old global/pg_control               ok

If you want to start the old cluster, you will need to remove
the '.old' suffix from /home/laurenz/data/global/pg_control.old.
Because 'link' mode was used, the old cluster cannot be safely
started once the new cluster has been started.
...
Upgrade Complete
----------------
Optimizer statistics are not transferred by pg_upgrade.
Once you start the new server, consider running:
    /usr/pgsql-15/bin/vacuumdb -U postgres --all --analyze-in-stages

Running this script will delete the old cluster's data files:
    ./delete_old_cluster.sh

$ pg_upgrade -d /home/laurenz/data -D /home/laurenz/data2

> -b /usr/pgsql-15/bin -B /usr/pgsql-15/bin -U postgres --link

Performing Consistency Checks

...

Performing Upgrade

...

Adding '.old' suffix to old global/pg_control ok

If you want to start the old cluster, you will need to remove

the '.old' suffix from /home/laurenz/data/global/pg_control.old.

Because 'link' mode was used, the old cluster cannot be safely

started once the new cluster has been started.

...

Upgrade Complete

----------------

Optimizer statistics are not transferred by pg_upgrade.

Once you start the new server, consider running:

/usr/pgsql-15/bin/vacuumdb -U postgres --all --analyze-in-stages

Running this script will delete the old cluster's data files:

./delete_old_cluster.sh

pg_upgrade renamed the control file of the old cluster, so that it cannot get started by accident. We'll undo that:

mv /home/laurenz/data/global/pg_control.old 
>  /home/laurenz/data/global/pg_control

1 2	mv /home/laurenz/data/global/pg_control.old > /home/laurenz/data/global/pg_control

Now we can start both clusters and run pgbench on both. Soon we will see error messages like

ERROR:  unexpected data beyond EOF in block 1 of relation base/5/16397
HINT:  This has been seen to occur with buggy kernels; consider updating your system.

ERROR:  duplicate key value violates unique constraint 'pgbench_accounts_pkey'
DETAIL:  Key (aid)=(8040446) already exists.

WARNING:  could not write block 13 of base/5/16404
DETAIL:  Multiple failures --- write error might be permanent.

ERROR:  xlog flush request 0/98AEE3E0 is not satisfied --- flushed only to 0/648CDC58
CONTEXT:  writing block 13 of relation base/5/16404

ERROR:  could not access status of transaction 39798
DETAIL:  Could not read from file 'pg_subtrans/0000' at offset 155648: read too few bytes.

ERROR: unexpected data beyond EOF in block 1 of relation base/5/16397

HINT: This has been seen to occur with buggy kernels; consider updating your system.

ERROR: duplicate key value violates unique constraint 'pgbench_accounts_pkey'

DETAIL: Key (aid)=(8040446) already exists.

WARNING: could not write block 13 of base/5/16404

DETAIL: Multiple failures --- write error might be permanent.

ERROR: xlog flush request 0/98AEE3E0 is not satisfied --- flushed only to 0/648CDC58

CONTEXT: writing block 13 of relation base/5/16404

ERROR: could not access status of transaction 39798

DETAIL: Could not read from file 'pg_subtrans/0000' at offset 155648: read too few bytes.

What happened? Since both clusters share the same data files, we managed to start two servers on the same data files. This leads to data corruption.

Creating a corrupt database by manipulating data files

For that, we figure out the file name that belongs to the table pgbench_accounts:

postgres=# SELECT relfilenode FROM pg_class
           WHERE relname = 'pgbench_accounts';
 relfilenode 
═════════════
       16396
(1 row)

postgres=# SELECT relfilenode FROM pg_class

WHERE relname = 'pgbench_accounts';

relfilenode

═════════════

16396

(1 row)

Now we stop the server and write some garbage into the first data block:

yes 'this is garbage' | dd of=data/base/5/16396 bs=1024 seek=2 count=1 conv=notrunc
0+1 records in
0+1 records out
1024 bytes (1.0 kB, 1.0 KiB) copied, 0.00031255 s, 3.3 MB/s

yes 'this is garbage' | dd of=data/base/5/16396 bs=1024 seek=2 count=1 conv=notrunc

0+1 records in

0+1 records out

1024 bytes (1.0 kB, 1.0 KiB) copied, 0.00031255 s, 3.3 MB/s

Then we start the server and try to select from the table:

postgres=# TABLE pgbench_accounts ;
ERROR:  compressed pglz data is corrupt

1 2	postgres=# TABLE pgbench_accounts ; ERROR: compressed pglz data is corrupt

What happened? We tampered with the data files, so it's unsurprising that the table is corrupted.

Creating a corrupt database with catalog modifications

Who needs ALTER TABLE to drop a table column? We can simply run

DELETE FROM pg_attribute
WHERE attrelid = 'pgbench_accounts'::regclass
  AND attname = 'bid';

DELETE FROM pg_attribute

WHERE attrelid = 'pgbench_accounts'::regclass

AND attname = 'bid';

After that, an attempt to query the table will result in an error:

ERROR:  pg_attribute catalog is missing 1 attribute(s) for relation OID 16396

1	ERROR: pg_attribute catalog is missing 1 attribute(s) for relation OID 16396

What happened? We ignored that dropping a column sets attisdropped to TRUE in pg_attribute rather than actually removing the entry. Moreover, we didn't check for dependencies in pg_depend, nor did we properly lock the table against concurrent access. Modifying catalog tables is unsupported, and if it breaks the database, you get to keep both pieces.

Conclusion

We have seen a number of ways how you can corrupt a PostgreSQL database. Some of these were obvious, some might surprise the beginner. If you don't want a corrupted database,

don't mess with the system catalogs
never modify anything in the data directory (with the exception of configuration files)
don't run with fsync = off
don't call pg_resetwal -f on a crashed server
remove the old cluster after an upgrade with pg_upgrade --link
don't delete or omit backup_label
run a supported version of PostgreSQL to avoid known software bugs
run on reliable hardware

I hope you can save some databases with this information! If you'd like to know more about troubleshooting PostgreSQL performance, read my post on join strategies.

0 0 votes

Article Rating

0 Comments

Inline Feedbacks

View all comments

Stay tuned with our

How to corrupt your PostgreSQL database

Prerequisites

Creating a corrupt database by setting `fsync = off`

Creating a corrupt database from a backup

The crucial part: do not create `backup_label`

Creating a corrupt database with `pg_resetwal`

Creating a corrupt database with `pg_upgrade --link`

Creating a corrupt database by manipulating data files

Creating a corrupt database with catalog modifications

Conclusion

Laurenz Albe

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts

How to corrupt your PostgreSQL database

Prerequisites

Creating a corrupt database by setting fsync = off

Creating a corrupt database from a backup

The crucial part: do not create backup_label

Creating a corrupt database with pg_resetwal

Creating a corrupt database with pg_upgrade --link

Creating a corrupt database by manipulating data files

Creating a corrupt database with catalog modifications

Conclusion

Laurenz Albe

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts

Creating a corrupt database by setting `fsync = off`

The crucial part: do not create `backup_label`

Creating a corrupt database with `pg_resetwal`

Creating a corrupt database with `pg_upgrade --link`