During this years pgconf.eu we displayed a “cluster in a box” demo case. Many of you inquired about how we built it, so here is a blog post with all details.

Cluster in a box

The goal was to provide a hands on experience of injecting failures to a high availability cluster and to show off the resiliency and self-healing capabilities of Patroni. Any visitor to our booth could try bringing down the cluster by using the big red shiny switches to cut power to any node. Patroni held up to the task magnificently, it swiftly handled failover and automatically recovered when we would expect it to. Manual intervention was only required in a couple of cases when the lack of power loss protection on our consumer level SSDs caused database corruption, which was swiftly fixed by just erasing the broken database and letting Patroni reinitialize it.

What is Patroni

Patroni overview

Patroni is an open source tool for building highly available clusters. It is not a complete solution for all your problems, and that is a good thing. Different companies have very different preferences on what an ideal cluster architecture looks like, depending on amount of databases managed, if they are running on hardware or virtual machines or containers, network topology, durability vs. availability and so on. Patroni is a good building block for a large variety of different cluster architectures. Its role is to be a cluster manager, to make sure that there is always a PostgreSQL master in the cluster, but never more than one. It outsources the hard parts of making a HA cluster to battle tested and proven tools and integrates them together into a whole that is bigger than the sum of its parts.

The first hard problem is that of replication. To be highly available you need multiple copies of the database. For this Patroni relies on PostgreSQL streaming replication. PostgreSQL streaming replication has proven to be very reliable and very fast and serves us well as the core building block.

The second hard problem is getting cluster wide agreement on who is the master node. Patroni solves it by delegating it to an external purpose built tool, in Patroni terminology called a distributed consensus store. This consensus store is used to arbitrate leader election, store cluster state and cluster wide configuration. Various different providers are supported: etcd, Consul, Zookeeper, Exhibitor. Soon Kubernetes API can also be used for consensus. Having an external consensus store gives a battle tested implementation providing clear semantics to rest of the system, and allows for separate deployment of consensus and data nodes.

The third outsourced problem is routing client connections to the current master node. This can be done in many different ways. Consul or similar products can be used to do DNS based routing. Patroni provides a health check endpoint to integrate with a TCP/IP load balancer, such as HAProxy, F5 BigIP or AWS ELB. If the servers are on the same L2 network a virtual IP can be moved around based on cluster state. Starting from PostgreSQL 10 you can specify multiple hosts and let client library pick the master server. Or you can roll your own by integrating Patroni health check API into your application connection management.


External quorum Patroni deploymentSmall Patroni deployment

For the demonstration cluster we picked a bare metal deployment with a separate etcd. Mostly because having external consensus allowed for more interesting demonstrations with multiple nodes failing simultaneously. This setup would be common in cases where there already is an etcd deployment in the organization, or there is a plan to deploy multiple database clusters. In our case the etcd cluster was simply the head node laptop running a simple etcd instance. For real HA deployments you would want to have 3 or more etcd servers, each in separate failure domains with no single point of failure. Client connection routing was done by a HAProxy running on the same headnode.

But the flexibility of Patroni means that when you just want a small database cluster with smallest maintenance overhead you can just deploy etcd on the database nodes. You still need 3 nodes – it’s not possible to make a reliable HA cluster with less. But the third node can be a tiny VM that just runs the etcd process. Just don’t put the VM on the same physical server as one of the databases, if you lose that server your cluster will go into read only mode.

The hardware

Failure Injection Unit

The hardware for our demo cluster was 3 Intel NUCs with Core i3 CPUs, 8GB of memory and 256GB of SSD. Much more than needed for this simple demo, but we foresee greater things for them.

The “Failure Injection Unit” is a small plastic box that just adds an nice and hefty toggle switch to the DC power coming out of the power bricks that came with the NUCs. There was no way we could resist also adding big red toggle switch guards.

What Patroni does for you

As cluster nodes first boot up they race to initialize the cluster. One of the will succeed, run initdb and obtain the master lease. Other nodes will fetch a base backup and start replicating from the master. Each node will poll the state of local PostgreSQL database, restarting it if necessary.

If the master node crashes or becomes disconnected its lease will expire and other nodes will get woken up. They will coordinate between each other to make sure the node with most transaction log available will get promoted. That node will obtain the new master lease and other nodes will switch to replicating from it.

For the demonstration cluster we set master lease ttl to 5 seconds, loop_wait and retry_timeout to 1 second and HA proxy check and connect timeouts to 1 second. These rather aggressive settings gave us failover times of under 10s (measured from last successful commit on failed node to successful commit). The default settings usually fail over within 40 seconds, but are much more resilient to temporary network issues. Given the relative probabilities of a node failing outright and network going down for a couple of seconds, most people are best served by the defaults.

If the old master comes back online, it possibly has some unreplicated changes that need to be backed out before rejoining the cluster. Patroni will automatically determine when this is needed based on timeline history and will run pg_rewind for you before rejoining the node to the cluster.

You can pick your own tradeoff between availability and durability by tuning maximum_lag_on_failover, master_start_timeout and synchronous_mode. Default settings use PostgreSQL standard asynchronous replication settings, but don’t allow for a failover if replica is more than 1MB behind. If you turn on synchronous_mode Patroni will set PostgreSQL up for synchronous replication, but will not automatically fail over if there is any chance of losing transactions.

You can also use backup system integrations to image new nodes from a backup system. Initialize the whole cluster from a backup. Designate some nodes as special that should not be failed over to. Should you want it, there is possibility to enable an extra layer of protection with watchdog, providing split-brain protection in face of bugs and operational errors, like some solutions do with a much more complicated STONITH solution.

All in all, we have been extremely satisfied with our deployments of Patroni. It has the best property a piece of software can have – it just works.

I received a question about how to fork PostgreSQL databases like you can do on Heroku. As I did not find any good examples on how to do this I decided to do a quick write up.

Forking a database means taking a copy of a database where you can make changes that don’t affect the original source database. The obvious solution here is to just take a copy of a database (also known as a backup), and start a new database instance on the copy. There are enough articles and documentation for backup procedures so I will not dwell on this point. The issue with the obvious solution is performance – copying a large amount of data takes a lot of time. If we want to use database forks to refresh our staging or (confidentiality requirements permitting) testing environments we can’t afford to wait several hours each time we do this.

We need to somehow take a copy without actually taking the time to make a copy of everything. The good news is that with a little help from your storage layer this is possible. The technical term for what we are looking for is a copy-on-write snapshot. A feature provided by specialized storage devices, but also in software by Linux LVM layer and Btrfs and ZFS filesystems.

For this demonstration I will go with Btrfs, it is the simplest to set up and also this is what I happened to have lying around. I have a 99% filled up RAID1 Btrfs filesystem mounted at /mnt/data, backed by three Western Digital Green spinning disks (<6000RPM). So basically a storage system as slow as you can make it.

First I set up a database to run our tests on. I create a new subvolume for the master database, create a new PostgreSQL database in it and start it up on port 6000.

/mnt/data/dbforking$ btrfs subvolume create master
Create subvolume './master
/mnt/data/dbforking$ initdb master/
     ... snipped
/mnt/data/dbforking$ sed -i 's/# port =.*/port = 6000/' master/postgresql.conf
/mnt/data/dbforking$ pg_ctl -D master/ -l master/postgres.log start
server starting
/mnt/data/dbforking$ createdb -p 6000
/mnt/data/dbforking$ psql -p 6000 -c "SELECT 'hello world'"
 hello world
(1 row)

Now that we have a master server running, let’s generate some dummy data in it. I will use pgbench to do this.

/mnt/data/dbforking$ time pgbench -p 6000 -i -s 1000
    ... few minutes of progress reports go by
100000000 of 100000000 tuples (100%) done (elapsed 173.58 s, remaining 0.00 s).
set primary keys...

real    8m35.011s
user    0m21.746s
sys     0m0.739s
/mnt/data/dbforking$ du -sh master
15G     master/

At this point I have a master database with 15GB of data in it. Let’s also modify some data in the master so we can track our forks. We will set the filler column on a row in pgbench_branches table to do this.

/mnt/data/dbforking$ psql -p 6000 -c "UPDATE pgbench_branches SET filler = 'master before forking' WHERE bid = 1"
/mnt/data/dbforking$ psql -p 6000 -c "SELECT * FROM pgbench_branches WHERE bid = 1"
 bid | bbalance |                                          filler                                          
   1 |        0 | master before forking                                                                   
(1 row)

Normal PostgreSQL hot backup procedure is to start your backup with pg_start_backup(), copy database contents over, end backup with pg_stop_backup() and then copy xlogs over. However Btrfs supports atomic snapshots. With atomic snapshot we can just create a snapshot and have the exact same contents we would have had if we had SIGKILL’ed PostgreSQL at that point in time and taken a regular copy. PostgreSQL durability mechanisms ensure that we will get a consistent state that includes everything that has successfully committed at that time and nothing more. The backup management commands are not required in our case.

To make things more interesting, I will start a workload on the master in a second shell to demonstrate that we can easily do this on a production database. 16 clients trying to keep a steady load of 50 transactions per second should do it. 50 transactions a second is quite slow, but a considerable challenge for the slow storage system we are using here.

/mnt/data/dbforking$ pgbench -p 6000 --rate=50 --client=16 --progress=5 --time=1200
starting vacuum...end.
progress: 5.0 s, 44.6 tps, lat 129.484 ms stddev 30.302, lag 0.304 ms

We are ready to fork the database, let’s go ahead and do it:

/mnt/data/dbforking$ time btrfs subvolume snapshot master fork1
Create a snapshot of 'master' in './fork1'

real    0m6.295s
user    0m0.000s
sys     0m0.202s
/mnt/data/dbforking$ du -sh fork1/
15G     fork1/

And the performance stats from that point in time:

progress: 30.0 s, 54.5 tps, lat 134.819 ms stddev 34.012, lag 0.500 ms
progress: 35.1 s, 44.4 tps, lat 199.910 ms stddev 75.235, lag 3.243 ms
progress: 40.1 s, 40.5 tps, lat 1281.642 ms stddev 791.303, lag 970.009 ms
progress: 45.0 s, 62.1 tps, lat 349.229 ms stddev 313.576, lag 145.631 ms
progress: 50.0 s, 50.0 tps, lat 146.155 ms stddev 45.599, lag 2.543 ms
progress: 55.1 s, 53.0 tps, lat 146.554 ms stddev 40.694, lag 0.562 ms

So we managed to take fork a 15 GB database in 6 seconds with only a small hiccup in performance. We are ready to start up the forked database.

To start the fork we have couple of things to do. First we have to remove the pid file of the master. Usually PostgreSQL can remove it on it’s own, but because the master is still running on the same machine it isn’t convinced it is safe in this case and we have to do it manually. Second we have to configure the fork to run on a different port from the master. When this is done, we can start the server up and observe that it successfully recovers from a crash:

/mnt/data/dbforking$ rm fork1/postmaster.pid
/mnt/data/dbforking$ sed -i 's/port =.*/port = 6001/' fork1/postgresql.conf
/mnt/data/dbforking$ pg_ctl -D fork1 -l fork1/postgres.log start
/mnt/data/dbforking$ tail fork1/postgres.log
LOG:  database system was interrupted; last known up at 2015-01-09 14:47:29 EET
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at 0/7086E90
LOG:  record with zero length at 0/8BDD888
LOG:  redo done at 0/8BDD858
LOG:  last completed transaction was at log time 2015-01-09 14:51:44.56637+02
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started

Next we can verify that we indeed have two different databases running. Let’s update our tracking row in both the master and fork databases and verify that they are independent.

Let’s make change on the forked database:

/mnt/data/dbforking$ psql -p 6001 -c "UPDATE pgbench_branches SET filler = 'fork1 after forking' WHERE bid = 1"
/mnt/data/dbforking$ psql -p 6001 -c "SELECT * FROM pgbench_branches WHERE bid = 1"
 bid | bbalance |                                          filler                                          
   1 |     1025 | fork1 after forking                                                                     
(1 row)

And check that master database still has the old data (except the bbalance column that pgbench has updated):

/mnt/data/dbforking$ psql -p 6000 -c "SELECT * FROM pgbench_branches WHERE bid = 1"
 bid | bbalance |                                          filler                                          
   1 |   -33546 | master before forking                                                                   
(1 row)

With this we have a recipe for successfully forking a running production database with only a minor hiccup. You probably don’t want to run your staging tests or dev environment on the same machine as the master. To have your forks on a separate machine you would need a streaming replication standby running on your staging or dev environment and then fork new databases off the streaming standby. Just don’t forget to remove/replace recovery.conf otherwise your snapshot will still be running as a standby.

It’s also advisable to switch the fork to a new timeline so PostgreSQL knows it’s a fork and will give you errors if you try to do something stupid, like have the fork replicate from master (or vice versa). To do this create a recovery.conf that contains the single line restore_command = '/bin/false'. This will switch PostgreSQL to point-in-time-recovery mode (as opposed to regular crash recovery), creating a timeline switch at the end of transaction log. /bin/false is there to signify that there is no archive to fetch additional transaction logs from.

Happy forking.