By Kaarel Moppel: Some weeks ago I wrote about some common concepts / performance hacks related to how you can (relatively) easily scale to a terabyte cluster or more. Based on my experience visiting customers from various industries, 80% of them don’t even reach that threshold. However, I wanted to write another post showing that a couple of terabytes are not the “end station” for Postgres, given you’re ready to roll up your sleeves and get your “hands dirty”. So let’s look at some additional Postgres-like projects for cases where you want to make use of your Postgres know-how and SQL skills over big amounts of data. Postgres is ready for multi-terabyte scaling.
Be warned, the road will now get bumpy: we usually need to change the applications and also the surrounding bits. We’re sharding, meaning data does not live on a single node anymore; SQL aggregates over all data can get quirky. Also, we’re mostly extending the rock-solid core PostgreSQL with 3rd-party extensions or using forks with constraining characteristics. You might have to re-define and re-import the data. You might also need to learn some new query constructs and forget some standard PostgreSQL ones. Generally, be prepared to pull out a bit of hair, if you’ve got any left:) But OK, here are some projects that you should know of.
Postgres extensions/derivatives for multi-terabyte scale-out
Sharding via PL/Proxy stored procedures
This kind of “old school” solution was created and battle-tested in Skype (huge user of Postgres by the way!) by scaling an important cluster to 32 nodes, so it obviously works pretty well. The main upside, on the other hand, is that all data and data access is sharded for you automatically after you pick a stored procedure parameter as a shard key. Plus, you can use all of the standard Postgres features…with the downside that, well, all data access needs to go over PL/pgSQL stored procedures. Most developers, I guess, are not so versed in that. In short PL/Proxy is similar to a kind of glue to get the stored procedure call to reach the correct shard so the performance penalty is minimal. Does not support Postgres 11 yet though …
Sharding with Postgres-XL
Postgres-XL could perhaps be described as a “PostgreSQL-based” sharding framework. And it actually lives somewhat under the umbrella of the PostgreSQL Global Development Group. It lags a bit behind a Postgres major version or two though, and the setup is by far not as easy as “apt install postgresql” due to the distributed nature – with coordinators, transaction managers and data nodes in the picture. It can, however, help you to manage and run queries on tens of terabytes of data with relatively few restrictions! Of course there are some caveats (e.g. no triggers, constraints on FK-s) as with any PostgreSQL derivative. Also, you can’t also expect the same level of support from the community when countering technical problems.
Nevertheless, it’s actively maintained and a good choice if you want to stay “almost” Postgres with your 50TB+ of data. Biggest cluster I’ve heard of by the way has 130 TB of data on it, so this is worth checking out!
Sharding with Citus – another form of scaling
Citus is an extension to standard Postgres, in that sense, it’s a bit of a lighter concept compared to the previous contenders. It’s more “up to date”, but it’s not a transparent drop-in replacement for all scaling needs. It adds “distributed table” features similar to Postgres-XL, with simpler architecture (only data and coordinator nodes). According to the documentation, it is especially well-suited for multi-tenant and “realtime analytics” use cases. It has some caveats like all the others – e.g. “shard-local” constraints, no subqueries in the
WHERE clause, no window functions on sharded/partitioned tables. But defining the tables works as usual: you just needs some function calls to activate the distributed behavior. The project is also under very active development with a decent company behind it for those who require support, so this might be a good choice for your next 50 TB+ project.
Greenplum – a PostgreSQL fork for Data Warehousing
Greenplum – a massively parallel processing (MPP) database system might just be the oldest active Postgres fork alive. It started based on version 8.2, and was developed behind closed doors for a long time. It’s been Open Source for a couple of years. It’s now making up for lost time, and trying to modernize itself to include features from the latest Postgres versions. The architecture seems quite a bit more complex compared to the above-mentioned alternatives. It needs thorough studying though. You’d also be giving up on some more advanced/recent SQL features, but I can imagine the architecture decisions are made with certain performance aspects in mind, so it might be a worthwhile trade-off. Also, behind the product stands a huge (publicly noted) consulting company named Pivotal, so again, it’s a serious alternative.
To conclude – don’t be afraid to scale up with Postgres! There are quite a few options (I’m pretty sure I forgot some products) for every taste, also professional support is available.
Scaling in the documentation
One more remark to the Postgres community though – I think that making the “scaling” topic a bit more discoverable for newcomers would do a lot of good for general Postgres adoption and adding a few words to the official documentation might even be appropriate. Currently there’s a bit on HA and replication here, but the word “scaling” is not even mentioned in this context.