By Kaarel Moppel: So, you’re building the next unicorn startup and are thinking feverishly about a future-proof PostgreSQL architecture to house your bytes? My advice here, having seen dozens of hopelessly over-engineered / oversized solutions as a database consultant over the last 5 years, is short and blunt: Don’t overthink, and keep it simple on the database side! Instead of getting fancy with the database, focus on your application. Turn your microscope to the database only when the need actually arises, m’kay! When that day comes, first of all, try all the common vertical scale-up approaches and tricks. Try to avoid using derivative Postgres products, or employing distributed approaches, or home-brewed sharding at all costs – until you have, say, less than 1 year of breathing room available.
Wow, what kind of advice is that for 2021? I’m talking about a simple, single-node approach in the age of Big Data and hyper-scalability…I surely must be a Luddite or just still dizzy from too much New Year’s Eve champagne. Well, perhaps so, but let’s start from a bit further back…
PostgreSQL and MySQL – brothers from another mother grow up – scaling
Over the holidays, I finally had a bit of time to catch up on my tech reading / watching TODO-list (still dozens of items left though, arghh)…and one pretty good talk was on the past and current state of distributed MySQL architectures by Peter Zaitsev of Percona. Oh, MySQL??? No no, we haven’t changed “horses” suddenly, PostgreSQL is still our main focus 🙂 It’s just that in many key points pertaining to scaling, the same constraints actually also apply to PostgreSQL. After all, they’re both designed as single-node relational database management engines.
In short, I’m summarizing some ideas out of the talk, plus adding some of my own. I would like to provide some food for thought to those who are overly worried about database performance – thus prematurely latching onto some overly complex architectures. In doing so, the “worriers” sacrifice some other good properties of single-node databases – like usability, and being bullet-proof.
All distributed systems are inherently complex, and difficult to get right
If you’re new to this realm, just trust me on the above, OK? There’s a bunch of abandoned or stale projects which have tried to offer some fully or semi-automatically scalable, highly available, easy-to-use and easy-to-manage DBMS…and failed! It’s not an utterly bad thing to try though, since we can learn from it. Actually, some products are getting pretty close to the Holy Grail of distributed SQL databases (CockroachDB comes to mind first). However, I’m afraid we still have to live with the CAP theorem. Also, remember that to go from covering 99.9% of corner cases of complex architectures to covering 99.99% is not a matter of linear complexity/cost, but rather exponential complexity/cost!
Although after a certain amount of time a company like Facebook surely needs some kind of horizontal scaling, maybe you’re not there yet, and maybe stock Postgres can still provide you some years of stress-free cohabitation. Consider: Do you even have a runway for that long?
What does plain PostgreSQL offer?
* A single PostgreSQL instance can easily do hundreds of thousands of transactions per second
For example, on my (pretty average) workstation, I can do ca. 25k simple read transactions per 1 CPU core on an “in memory” pgbench dataset…with the default config for Postgres v13! With some tuning (by the way, tuning reads is much harder in Postgres than tuning writes!) I was able to increase it to ~32k TPS per core, meaning: a top-notch, dedicated hardware server can do about 1 million short reads! With reads, you can also usually employ replicas – so multiply that by 10 if needed! You then need to somehow solve the query routing problem, but there are tools for that. In some cases, the new standard LibPQ connection string syntax (target_session_attrs) can be used – with some shuffling. By the way, Postgres doesn’t limit the number of replicas, though I personally have never witnessed more than 10 replicas. With some cascading, I’m sure you could run dozens without bigger issues.
* A single node can typically do tens of thousands of write transactions per second
On my humble workstation with 6 cores (12 logical CPUs) and NVMe SSD storage, the default very write-heavy (3 UPD, 1 INS, 1 SEL) “pgbench” test greets me with a number of around 45k TPS – for example, after some checkpoint tuning – and there are even more tuning tricks available.
* A single Postgres instance can easily handle dozens of Terabytes of data
Given that you have separated “hot” and “cold” data sets, and there’s some thought put into indexing, etc., a single Postgres instance can cope with quite a lot of data. Backups and standby server provisioning, etc. will be a pain, since you’ll surely meet some physical limits even on the finest hardware. However, these issues are common to all database systems. From the query performance side, there is no reason why it should suddenly be forced to slow down!
* A single node instance is literally bullet-proof as far as data consistency is concerned
Given that 1) you declare your constraints correctly, 2) don’t fool around with some “fsync” or asynchronous commit settings, and 3) your disks don’t explode, a single node instance provides rock-solid data consistency. Again, the last item applies to any data storage, so hopefully, you have “some” backups somewhere…
* Failures are easily comprehensible – thus also recoverable
Meaning: that even if something very bad happens and the primary node is down, the worst outcome is that your application is just currently unavailable. Once you do your recovery magic (or better, let some bot like Patroni take care of that) you’re exactly where you were previously. Now compare that with some partial failure scenarios or data hashing errors in a distributed world! Believe me, when working with critical data, in a lot of cases it’s better to have a short downtime than to have to sort out some runaway datasets for days or weeks to come, which is confusing for yourself and your customers.
Tips to be prepared for scaling
In the beginning of the post, I said that when starting out, you shouldn’t worry too much about scaling from the architectural side. That doesn’t mean that you should ignore some common best practices, in case scaling could theoretically be required later. Some of them might be:
* Don’t be afraid to run your own database
This might be the most important thing on the list – with modern real hardware (or some metal cloud instances) and the full power of config and filesystem tuning and extensions, you’ll typically do just fine on a single node for years. Remember that if you get tired of running your own setup, nowadays you can always migrate to some cloud providers – with minimal downtime – via Logical Replication! If you want to know how, see here. Note that I specifically mentioned “real” hardware above, due to the common misconception that a single cloud vCPU is pretty much equal to a real one…the reality is far from that of course – my own impression over the years has been that there is around a 2-3x performance difference, depending on the provider/region/luck factor in question.
* Try to avoid the serious mistake of having your data “architecture” centered around a single huge table
You’d be surprised how often we see that…so slice and dice early, or set up some partitioning. Partitioning will also do a lot of good to the long-term health of the database, since it allows multiple autovacuum workers on the same logical table, and it can speed up IO considerably on enterprise storage. If IO indeed becomes a bottleneck at some point, you can employ Postgres native remote partitions, so that some older data lives on another node.
* Make sure to “bake in” a proper sharding key for your tables/databases
Initially, the data can just reside on a single physical node. If your data model revolves around the “millions of independent clients” concept for example, then it might even be best to start with many “sharded” databases with identical schemas, so that transferring out the shards to separate hardware nodes will be a piece of cake in the future.
There are benefits to systems that can scale 1000x from day one…but in many cases, there’s also an unreasonable (and costly) desire to be ready for scaling. I get it, it’s very human – I’m also tempted to buy a nice BMW convertible with a maximum speed of 250 kilometers per hour…only to discover that the maximum allowed speed in my country is 110, and even that during the summer months.
The thing that resonated with me from the Youtube talk the most was that there’s a definite downside to such theoretical scaling capability – it throttles development velocity and operational management efficiency at early stages! Having a plain rock-solid database that you know well, and which also actually performs well – if you know how to use it – is most often a great place to start with.
By the way, here’s another good link on a similar note from a nice Github collection and also one pretty detailed overview here about how an Alexa top 250 company managed to get by with a single database for 12 years before needing drastic scaling action!
To sum it all up: this is probably a good place to quote the classics: premature optimization is the root of all evil…