CYBERTEC Logo

Forking databases - the art of copying without copying

01.2015 / Category: , / Tags: |

I received a question about how to fork PostgreSQL databases like you can do on Heroku. As I did not find any good examples on how to do this, I decided to do a quick write up.

What does "fork databases" mean?

Forking a database means taking a copy of a database where you can make changes that don't affect the original source database. The obvious solution here is to just take a copy of a database (also known as a backup) and start a new database instance on the copy. There are enough articles and documentation for backup procedures so I will not dwell on this point. The issue with the obvious solution is performance - copying a large amount of data takes a lot of time. If we want to use database forks to refresh our staging or (confidentiality requirements permitting) testing environments we can't afford to wait several hours each time we do this.

We need to somehow take a copy without actually taking the time to make a copy of everything. The good news is that with a little help from your storage layer this is possible. The technical term for what we are looking for is a copy-on-write snapshot. A feature provided by specialized storage devices, but also in software by Linux LVM layer and Btrfs and ZFS filesystems.

For this demonstration I will go with Btrfs, it is the simplest to set up and also this is what I happened to have lying around. I have a 99% filled up RAID1 Btrfs filesystem mounted at /mnt/data, backed by three Western Digital Green spinning disks (<6000RPM). So basically a storage system as slow as you can make it. First I set up a database to run our tests on. I create a new subvolume for the master database, create a new PostgreSQL database in it and start it up on port 6000.

Now that we have a master server running, let's generate some dummy data in it. I will use pgbench to do this.

At this point I have a master database with 15GB of data in it. Let's also modify some data in the master so we can track our forks. We will set the filler column on a row in pgbench_branches table to do this.

Normal PostgreSQL hot backup procedure is to start your backup with pg_start_backup(), copy database contents over, end backup with pg_stop_backup() and then copy xlogs over. However Btrfs supports atomic snapshots. With atomic snapshot we can just create a snapshot and have the exact same contents we would have had if we had SIGKILL'ed PostgreSQL at that point in time and taken a regular copy. PostgreSQL durability mechanisms ensure that we will get a consistent state that includes everything that has successfully committed at that time and nothing more. The backup management commands are not required in our case.

To make things more interesting, I will start a workload on the master in a second shell to demonstrate that we can easily do this on a production database. 16 clients trying to keep a steady load of 50 transactions per second should do it. 50 transactions a second is quite slow, but a considerable challenge for the slow storage system we are using here.

We are ready to fork the database, let's go ahead and do it:

And the performance stats from that point in time:

So we managed to take fork a 15 GB database in 6 seconds with only a small hiccup in performance. We are ready to start up the forked database.

To start the fork we have couple of things to do. First we have to remove the pid file of the master. Usually PostgreSQL can remove it on its own, but because the master is still running on the same machine it isn't convinced it is safe in this case and we have to do it manually. Second we have to configure the fork to run on a different port from the master. When this is done, we can start the server up and observe that it successfully recovers from a crash:

Next we can verify that we indeed have two different databases running. Let's update our tracking row in both the master and fork databases and verify that they are independent.

Let's make change on the forked database:

And check that master database still has the old data (except the bbalance column that pgbench has updated):

With this we have a recipe for successfully forking a running production database with only a minor hiccup. You probably don't want to run your staging tests or dev environment on the same machine as the master. To have your forks on a separate machine you would need a streaming replication standby running on your staging or dev environment and then fork new databases off the streaming standby. Just don't forget to remove/replace recovery.conf otherwise your snapshot will still be running as a standby.

It's also advisable to switch the fork to a new timeline so PostgreSQL knows it's a fork and will give you errors if you try to do something stupid, like have the fork replicate from master (or vice versa). To do this create a recovery.conf that contains the single line restore_command = '/bin/false'. This will switch PostgreSQL to point-in-time-recovery mode (as opposed to regular crash recovery), creating a timeline switch at the end of transaction log. /bin/false is there to signify that there is no archive to fetch additional transaction logs from.

Happy forking.

In case you need any assistance, please feel free to contact us.
 


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

0 0 votes
Article Rating
Subscribe
Notify of
guest
2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Anshuman Aggarwal
Anshuman Aggarwal
6 years ago

Needed it. Found it. Loved it.

Awesome solution and since we are already using btrfs, works perfectly.

Thanks

Ryan Long
Ryan Long
4 years ago

This is forking awesome! Thank you.

CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0
office@cybertec.at

Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    2
    0
    Would love your thoughts, please comment.x
    ()
    x
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram