Although Postgres being a great all-around product that can’t be beaten in price-performance ratio, there’s one area where things could be a bit better – that’s validation of replicas, or maybe even more widely all clusters.
So the problem that not so many people know about is that after building a new replica there’s actually no way to tell in a “pushbutton” way if everything is absolutely fine with the new replica. The replica building tool pg_basebackup (there are of course some other tools/approaches but I suspect that the pg_basebackup tool has 90% of the “market”) just streams over the datafiles and transaction logs and writes them to disk and that’s it – done. After that you usually modify the configs if needed and start your server and then see if you can connect to it and maybe also issue some “select count(*) from my_favourite_table” to verify that things are more or less plausible, before grinning in satisfaction and grabbing a coffee to tackle other burning tasks… but to what extent did we actually verify that the cluster is intact?
What could go wrong with replicas?
So what could go wrong when building up a new replica? Of course if pg_basebackup reports success then not too much actually, it is very unlikely that something is wrong. But a couple of things still bug me… or of course I might be a little paranoid 🙂 But read on too see what I mean.
First the hard disks – in theory disk errors should be already a thing of the past and enterprise-grade disk systems should definitely have some built-in error detection… but the thing is that nowadays everything is so abstracted and virtualized that you actually hardly know what’s down there. Maybe it’s some network storage? It for sure then requires some drivers on Operating System side and also on the hardware side… and as we all know every software sadly has bugs – Google says industry average is about 15 – 50 errors per 1000 lines of delivered code. So one cannot 100% exclude the chance of silent data corruption on writing datafiles, especially when usually there will be some new hardware for replicas, that hasn’t proved itself yet.
Luckily though Postgres already has some remedy against such cases – the “data-checksums” flag, but this needs to be set up when initializing the cluster and also has a small performance impact so it’s not always used.
What else? Postgres allows building up replicas on different platforms (for example master on Ubuntu, replica on CentOS – which of course is a great feature in itself) plus also the fact that minor versions can differ – e.g master on 9.6.2, replica by some accident on 9.6.0. Also just the replication code itself can fail, I can remember at least one occasion some years ago when commited rows where marked uncommited on replicas. This is very rare of course but could for sure technically happen again.
In short – there are some “if”-s and the question is what can be done there? As said currently sadly there’s no tool from the Postgres project (I’m not sure that there should even be one or if it can even be done) to do full verification so the only 100% reliable option is to do a SQL dump/restore… but this of course could take hours and hours for 1TB+ databases and is not really an option if you need to switch over the master in a hush-hush manner.
So a good compromise there I think, to have even a slightly better feeling before making the switchover, is to just dump the database! This at least will verify there’s no silent data corruption. But of course also this could take hours, depending on your DB-size/hardware. There are some tricks to make it faster though – coming right up.
Speeding up dummy SQL dumps on replicas
1. One does not have to actually store the SQL dump – we just want to verify that data can be read out i.e. using “/dev/null” is an excellent performance booster. When using pg_dumpall or pg_dump in plain SQL mode (default) it’s very easy to do:
pg_dumpall -h /var/run/postgresql >/dev/null
2. Make sure you use the Unix socket for connecting to the local server instead of “localhost” TCP stack – just that simple adjustment should already give some 10-15% speedup.
3. Using multiple processes to dump the data! This is a real life-saver for bigger databases when time is of essence. One should just figure out a reasonable value for the “jobs” flag so that the IO-subsystem is pushed to it’s limit. NB! Multiple jobs absolutely require the “directory” output format (-Fd/–format=directory + -f/–file=dirname) for the pg_dump and sadly it brings also some issues:
- pg_dump “directory” format does not get along with /dev/null as it wants to set up a directory structure. There are of course workarounds, e.g. check out the nullfs FUSE filesystem driver but it’s tedious to say the least.
- Currently when dumping replicas with multiple processes it is not possible to obtain a consistent snapshot. For our “quick verification” goal it’s not too catastrophic though and we can ignore the initial error by providing the “–no-synchronized-snapshots” flag. This will be fixed in upcoming Postgres 10 though.
- pg_dump works “per database” only – so if you have a lot of DBs in your cluster you need some lines of Bash for example. Not a big problem though.
4. Throw together a simple custom script that handles the negative side-effects from approach nr. 3 so you don’t have to think about it. One such that I use myself can for example be found from here. It’s a simple Python script that just spawns a given number of worker processes (half of the CPUs by default) and then dumps all tables to /dev/null starting from the bigger ones – one should just specify the Unix socket as host and also Postgres “bindir” is requried.
NB! Before kicking off the dumping process on the replica one should make sure that the query conflict situation has been solved – otherwise it will fail soon if there’s some activity on the master! Options are configuring hot_standby_feedback or allowing replica to “fall behind” by setting the max_standby_streaming_delay to -1 or some big value. More on that here.
That’s it, hope you made it through and it made you to think a bit. Feedback appreciated as usual!