BY Kaarel Moppel – Although Postgres is a great all-around product that can’t be beaten in price-performance ratio, there’s one area where things could be a bit better. That’s in the validation of replicas, or maybe even more widely, all clusters.
What’s the problem? After building a new replica there’s actually no way to tell in a “pushbutton” way if everything is absolutely fine with the new replica. The replica building tool pg_basebackup (there are of course some other tools/approaches but I suspect that the pg_basebackup tool has 90% of the “market”) just streams over the datafiles and transaction logs and writes them to disk and that’s it – done. After that, you usually modify the configs if needed and start your server and then see if you can connect to it and maybe also issue some “select count(*) from my_favourite_table” to verify that things are more or less plausible, before grinning in satisfaction and grabbing a coffee to tackle other burning tasks… but to what extent did we actually verify that the cluster is intact?
What could go wrong with replicas?
So what could go wrong when building up a new replica? Of course if pg_basebackup reports success then not too much actually, it is very unlikely that something is wrong. But a couple of things still bug me… but I might be a little paranoid 🙂 Just read on too see what I mean.
First the hard disks: in theory, disk errors should be already a thing of the past. Enterprise-grade disk systems should definitely have some built-in error detection. The thing is, nowadays everything is so abstracted and virtualized. You actually hardly know what’s down there. Maybe it’s some network storage? It for sure requires some drivers on the Operating System side, and also on the hardware side. As we all know, every software sadly has bugs. Google says the industry average is about 15 – 50 errors per 1000 lines of delivered code. So one cannot 100% exclude the chance of silent data corruption on writing datafiles. That’s especially true because there will typically be some new untested hardware for replicas.
Luckily, Postgres already has some remedy against such cases. The “data-checksums” flag is there for this purpose. However, it needs to be set up when initializing the cluster, and also has a small performance impact. So it’s not always used.
What else? Postgres allows building up replicas on different platforms. For example, a master on Ubuntu, or a replica on CentOS – which of course is a great feature in itself. Plus, there is also the fact that minor versions can differ, e.g. master on 9.6.2, replica by some accident on 9.6.0. Even the replication code itself can fail; I can remember at least one occasion some years ago when committed rows where marked uncommitted on replicas. This is very rare, but could technically happen again.
In short, there are some “if”-s. The question is “What can be done here?” As I said, sadly there’s currently no tool from the Postgres project (I’m not sure that there should even be one or if it can even be done) to do full verification. The only 100% reliable option is to do a SQL dump/restore. That could take hours and hours for 1TB+ databases. It’s not really an option if you need to switch over the master in a hush-hush manner.
So a good compromise, in order to have a better feeling before making the switchover, is to just dump the database! This will at least verify that there’s no silent data corruption. However, it could take hours, depending on your DB-size/hardware. There are some tricks to make it faster though – coming right up.
Speeding up dummy SQL dumps on replicas
1. One does not have to actually store the SQL dump – we just want to verify that data can be read out i.e. using “/dev/null” is an excellent performance booster. When using pg_dumpall or pg_dump in plain SQL mode (default) it’s very easy to do:
pg_dumpall -h /var/run/postgresql >/dev/null
2. Make sure you use the Unix socket for connecting to the local server instead of “localhost” TCP stack – just that simple adjustment should already give some 10-15% speedup.
3. Using multiple processes to dump the data! This is a real life-saver for bigger databases when time is of essence. One should just figure out a reasonable value for the “jobs” flag so that the IO-subsystem is pushed to its limit. NB! Multiple jobs absolutely require the “directory” output format (-Fd/–format=directory + -f/–file=dirname) for the pg_dump and sadly it brings also some issues:
- pg_dump “directory” format does not get along with /dev/null as it wants to set up a directory structure. There are of course workarounds, e.g. check out the nullfs FUSE filesystem driver but it’s tedious to say the least.
- Currently when dumping replicas with multiple processes it is not possible to obtain a consistent snapshot. For our “quick verification” goal it’s not too catastrophic though and we can ignore the initial error by providing the “–no-synchronized-snapshots” flag. This will be fixed in upcoming Postgres 10 though.
- pg_dump works “per database” only – so if you have a lot of DBs in your cluster you need some lines of Bash for example. Not a big problem though.
4. Throw together a simple custom script that handles the negative side-effects from approach nr. 3 so you don’t have to think about it. One such that I use myself can for example be found from here. It’s a simple Python script that just spawns a given number of worker processes (half of the CPUs by default) and then dumps all tables to /dev/null starting from the bigger ones – one should just specify the Unix socket as host and also Postgres “bindir” is required.
NB! Before kicking off the dumping process on the replica one should make sure that the query conflict situation has been solved – otherwise it will fail soon if there’s some activity on the master! Options are configuring hot_standby_feedback or allowing replica to “fall behind” by setting the max_standby_streaming_delay to -1 or some big value. More on that here.
That’s it, hope you made it through, and it made you to think a bit. Feedback appreciated as usual!