I received a question about how to fork PostgreSQL databases like you can do on Heroku. As I did not find any good examples on how to do this I decided to do a quick write up.

Forking a database means taking a copy of a database where you can make changes that don’t affect the original source database. The obvious solution here is to just take a copy of a database (also known as a backup), and start a new database instance on the copy. There are enough articles and documentation for backup procedures so I will not dwell on this point. The issue with the obvious solution is performance – copying a large amount of data takes a lot of time. If we want to use database forks to refresh our staging or (confidentiality requirements permitting) testing environments we can’t afford to wait several hours each time we do this.

We need to somehow take a copy without actually taking the time to make a copy of everything. The good news is that with a little help from your storage layer this is possible. The technical term for what we are looking for is a copy-on-write snapshot. A feature provided by specialized storage devices, but also in software by Linux LVM layer and Btrfs and ZFS filesystems.

For this demonstration I will go with Btrfs, it is the simplest to set up and also this is what I happened to have lying around. I have a 99% filled up RAID1 Btrfs filesystem mounted at /mnt/data, backed by three Western Digital Green spinning disks (<6000RPM). So basically a storage system as slow as you can make it.

First I set up a database to run our tests on. I create a new subvolume for the master database, create a new PostgreSQL database in it and start it up on port 6000.

/mnt/data/dbforking$ btrfs subvolume create master
Create subvolume './master
/mnt/data/dbforking$ initdb master/
     ... snipped
/mnt/data/dbforking$ sed -i 's/# port =.*/port = 6000/' master/postgresql.conf
/mnt/data/dbforking$ pg_ctl -D master/ -l master/postgres.log start
server starting
/mnt/data/dbforking$ createdb -p 6000
/mnt/data/dbforking$ psql -p 6000 -c "SELECT 'hello world'"
 hello world
(1 row)

Now that we have a master server running, let’s generate some dummy data in it. I will use pgbench to do this.

/mnt/data/dbforking$ time pgbench -p 6000 -i -s 1000
    ... few minutes of progress reports go by
100000000 of 100000000 tuples (100%) done (elapsed 173.58 s, remaining 0.00 s).
set primary keys...

real    8m35.011s
user    0m21.746s
sys     0m0.739s
/mnt/data/dbforking$ du -sh master
15G     master/

At this point I have a master database with 15GB of data in it. Let’s also modify some data in the master so we can track our forks. We will set the filler column on a row in pgbench_branches table to do this.

/mnt/data/dbforking$ psql -p 6000 -c "UPDATE pgbench_branches SET filler = 'master before forking' WHERE bid = 1"
/mnt/data/dbforking$ psql -p 6000 -c "SELECT * FROM pgbench_branches WHERE bid = 1"
 bid | bbalance |                                          filler                                          
   1 |        0 | master before forking                                                                   
(1 row)

Normal PostgreSQL hot backup procedure is to start your backup with pg_start_backup(), copy database contents over, end backup with pg_stop_backup() and then copy xlogs over. However Btrfs supports atomic snapshots. With atomic snapshot we can just create a snapshot and have the exact same contents we would have had if we had SIGKILL’ed PostgreSQL at that point in time and taken a regular copy. PostgreSQL durability mechanisms ensure that we will get a consistent state that includes everything that has successfully committed at that time and nothing more. The backup management commands are not required in our case.

To make things more interesting, I will start a workload on the master in a second shell to demonstrate that we can easily do this on a production database. 16 clients trying to keep a steady load of 50 transactions per second should do it. 50 transactions a second is quite slow, but a considerable challenge for the slow storage system we are using here.

/mnt/data/dbforking$ pgbench -p 6000 --rate=50 --client=16 --progress=5 --time=1200
starting vacuum...end.
progress: 5.0 s, 44.6 tps, lat 129.484 ms stddev 30.302, lag 0.304 ms

We are ready to fork the database, let’s go ahead and do it:

/mnt/data/dbforking$ time btrfs subvolume snapshot master fork1
Create a snapshot of 'master' in './fork1'

real    0m6.295s
user    0m0.000s
sys     0m0.202s
/mnt/data/dbforking$ du -sh fork1/
15G     fork1/

And the performance stats from that point in time:

progress: 30.0 s, 54.5 tps, lat 134.819 ms stddev 34.012, lag 0.500 ms
progress: 35.1 s, 44.4 tps, lat 199.910 ms stddev 75.235, lag 3.243 ms
progress: 40.1 s, 40.5 tps, lat 1281.642 ms stddev 791.303, lag 970.009 ms
progress: 45.0 s, 62.1 tps, lat 349.229 ms stddev 313.576, lag 145.631 ms
progress: 50.0 s, 50.0 tps, lat 146.155 ms stddev 45.599, lag 2.543 ms
progress: 55.1 s, 53.0 tps, lat 146.554 ms stddev 40.694, lag 0.562 ms

So we managed to take fork a 15 GB database in 6 seconds with only a small hiccup in performance. We are ready to start up the forked database.

To start the fork we have couple of things to do. First we have to remove the pid file of the master. Usually PostgreSQL can remove it on it’s own, but because the master is still running on the same machine it isn’t convinced it is safe in this case and we have to do it manually. Second we have to configure the fork to run on a different port from the master. When this is done, we can start the server up and observe that it successfully recovers from a crash:

/mnt/data/dbforking$ rm fork1/postmaster.pid
/mnt/data/dbforking$ sed -i 's/port =.*/port = 6001/' fork1/postgresql.conf
/mnt/data/dbforking$ pg_ctl -D fork1 -l fork1/postgres.log start
/mnt/data/dbforking$ tail fork1/postgres.log
LOG:  database system was interrupted; last known up at 2015-01-09 14:47:29 EET
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at 0/7086E90
LOG:  record with zero length at 0/8BDD888
LOG:  redo done at 0/8BDD858
LOG:  last completed transaction was at log time 2015-01-09 14:51:44.56637+02
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started

Next we can verify that we indeed have two different databases running. Let’s update our tracking row in both the master and fork databases and verify that they are independent.

Let’s make change on the forked database:

/mnt/data/dbforking$ psql -p 6001 -c "UPDATE pgbench_branches SET filler = 'fork1 after forking' WHERE bid = 1"
/mnt/data/dbforking$ psql -p 6001 -c "SELECT * FROM pgbench_branches WHERE bid = 1"
 bid | bbalance |                                          filler                                          
   1 |     1025 | fork1 after forking                                                                     
(1 row)

And check that master database still has the old data (except the bbalance column that pgbench has updated):

/mnt/data/dbforking$ psql -p 6000 -c "SELECT * FROM pgbench_branches WHERE bid = 1"
 bid | bbalance |                                          filler                                          
   1 |   -33546 | master before forking                                                                   
(1 row)

With this we have a recipe for successfully forking a running production database with only a minor hiccup. You probably don’t want to run your staging tests or dev environment on the same machine as the master. To have your forks on a separate machine you would need a streaming replication standby running on your staging or dev environment and then fork new databases off the streaming standby. Just don’t forget to remove/replace recovery.conf otherwise your snapshot will still be running as a standby.

It’s also advisable to switch the fork to a new timeline so PostgreSQL knows it’s a fork and will give you errors if you try to do something stupid, like have the fork replicate from master (or vice versa). To do this create a recovery.conf that contains the single line restore_command = '/bin/false'. This will switch PostgreSQL to point-in-time-recovery mode (as opposed to regular crash recovery), creating a timeline switch at the end of transaction log. /bin/false is there to signify that there is no archive to fetch additional transaction logs from.

Happy forking.