During this years pgconf.eu we displayed a “cluster in a box” demo case. Many of you inquired about how we built it, so here is a blog post with all details.

Cluster in a box

The goal was to provide a hands on experience of injecting failures to a high availability cluster and to show off the resiliency and self-healing capabilities of Patroni. Any visitor to our booth could try bringing down the cluster by using the big red shiny switches to cut power to any node. Patroni held up to the task magnificently, it swiftly handled failover and automatically recovered when we would expect it to. Manual intervention was only required in a couple of cases when the lack of power loss protection on our consumer level SSDs caused database corruption, which was swiftly fixed by just erasing the broken database and letting Patroni reinitialize it.

What is Patroni

Patroni overview

Patroni is an open source tool for building highly available clusters. It is not a complete solution for all your problems, and that is a good thing. Different companies have very different preferences on what an ideal cluster architecture looks like, depending on amount of databases managed, if they are running on hardware or virtual machines or containers, network topology, durability vs. availability and so on. Patroni is a good building block for a large variety of different cluster architectures. Its role is to be a cluster manager, to make sure that there is always a PostgreSQL master in the cluster, but never more than one. It outsources the hard parts of making a HA cluster to battle tested and proven tools and integrates them together into a whole that is bigger than the sum of its parts.

The first hard problem is that of replication. To be highly available you need multiple copies of the database. For this Patroni relies on PostgreSQL streaming replication. PostgreSQL streaming replication has proven to be very reliable and very fast and serves us well as the core building block.

The second hard problem is getting cluster wide agreement on who is the master node. Patroni solves it by delegating it to an external purpose built tool, in Patroni terminology called a distributed consensus store. This consensus store is used to arbitrate leader election, store cluster state and cluster wide configuration. Various different providers are supported: etcd, Consul, Zookeeper, Exhibitor. Soon Kubernetes API can also be used for consensus. Having an external consensus store gives a battle tested implementation providing clear semantics to rest of the system, and allows for separate deployment of consensus and data nodes.

The third outsourced problem is routing client connections to the current master node. This can be done in many different ways. Consul or similar products can be used to do DNS based routing. Patroni provides a health check endpoint to integrate with a TCP/IP load balancer, such as HAProxy, F5 BigIP or AWS ELB. If the servers are on the same L2 network a virtual IP can be moved around based on cluster state. Starting from PostgreSQL 10 you can specify multiple hosts and let client library pick the master server. Or you can roll your own by integrating Patroni health check API into your application connection management.

Deployment

External quorum Patroni deploymentSmall Patroni deployment

For the demonstration cluster we picked a bare metal deployment with a separate etcd. Mostly because having external consensus allowed for more interesting demonstrations with multiple nodes failing simultaneously. This setup would be common in cases where there already is an etcd deployment in the organization, or there is a plan to deploy multiple database clusters. In our case the etcd cluster was simply the head node laptop running a simple etcd instance. For real HA deployments you would want to have 3 or more etcd servers, each in separate failure domains with no single point of failure. Client connection routing was done by a HAProxy running on the same headnode.

But the flexibility of Patroni means that when you just want a small database cluster with smallest maintenance overhead you can just deploy etcd on the database nodes. You still need 3 nodes – it’s not possible to make a reliable HA cluster with less. But the third node can be a tiny VM that just runs the etcd process. Just don’t put the VM on the same physical server as one of the databases, if you lose that server your cluster will go into read only mode.

The hardware

Failure Injection Unit

The hardware for our demo cluster was 3 Intel NUCs with Core i3 CPUs, 8GB of memory and 256GB of SSD. Much more than needed for this simple demo, but we foresee greater things for them.

The “Failure Injection Unit” is a small plastic box that just adds an nice and hefty toggle switch to the DC power coming out of the power bricks that came with the NUCs. There was no way we could resist also adding big red toggle switch guards.

What Patroni does for you

As cluster nodes first boot up they race to initialize the cluster. One of the will succeed, run initdb and obtain the master lease. Other nodes will fetch a base backup and start replicating from the master. Each node will poll the state of local PostgreSQL database, restarting it if necessary.

If the master node crashes or becomes disconnected its lease will expire and other nodes will get woken up. They will coordinate between each other to make sure the node with most transaction log available will get promoted. That node will obtain the new master lease and other nodes will switch to replicating from it.

For the demonstration cluster we set master lease ttl to 5 seconds, loop_wait and retry_timeout to 1 second and HA proxy check and connect timeouts to 1 second. These rather aggressive settings gave us failover times of under 10s (measured from last successful commit on failed node to successful commit). The default settings usually fail over within 40 seconds, but are much more resilient to temporary network issues. Given the relative probabilities of a node failing outright and network going down for a couple of seconds, most people are best served by the defaults.

If the old master comes back online, it possibly has some unreplicated changes that need to be backed out before rejoining the cluster. Patroni will automatically determine when this is needed based on timeline history and will run pg_rewind for you before rejoining the node to the cluster.

You can pick your own tradeoff between availability and durability by tuning maximum_lag_on_failover, master_start_timeout and synchronous_mode. Default settings use PostgreSQL standard asynchronous replication settings, but don’t allow for a failover if replica is more than 1MB behind. If you turn on synchronous_mode Patroni will set PostgreSQL up for synchronous replication, but will not automatically fail over if there is any chance of losing transactions.

You can also use backup system integrations to image new nodes from a backup system. Initialize the whole cluster from a backup. Designate some nodes as special that should not be failed over to. Should you want it, there is possibility to enable an extra layer of protection with watchdog, providing split-brain protection in face of bugs and operational errors, like some solutions do with a much more complicated STONITH solution.

All in all, we have been extremely satisfied with our deployments of Patroni. It has the best property a piece of software can have – it just works.