Some weeks ago I wrote about some common concepts / performance hacks, how one can (relatively) easily scale to a terabyte cluster or more. And based on my experience visiting customers from various industries, 80% of them are not even reaching that threshold…but just to be clear I wanted to write another post showing that a couple of terabytes are of course not the “end station” for Postgres, given one is ready to roll up his sleeves and get “hands dirty“ so to say. So let’s look here at some additional Postgres-like projects for cases where you still want to make use of your Postgres know-how and SQL skills over big amounts of data.
But be warned, the road is getting bumpy now – we now usually need to change the applications and also the surrounding bits and we’re doing sharding, meaning data does not live on a single node anymore so SQL aggregates over all data can get quirky. Also we’re mostly extending the rock-solid core PostgreSQL with 3rd-party extensions or using forks with constraining characteristics, so you might have to re-define and re-import the data and you might need to learn some new query constructs and forget some standard PostgreSQL ones…so generally be prepared to pull out a bit of hair if you’ve got any left:) But OK, here some projects that you should know of.
Postgres extensions/derivatives for multi-terabyte scale-out
- Sharding via PL/Proxy stored procedures
This kind of “old school” solution was created and battle tested in Skype (huge user of Postgres by the way!) by scaling an important cluster to 32 nodes so it obviously works pretty well. The main upside on the other hand is that all data and data access is sharded for you automatically after you pick a stored procedure parameter as shard key and you can use all of the standard Postgres features…with the downside that, well, all data access needs to go over PL/pgSQL stored procedures which most developers I guess are not so versed in. In short PL/Proxy is just some glue to get the stored procedure call to reach the correct shard so the performance penalty is minimal. Does not support Postgres 11 yet though …
- Sharding with Postgres-XL
Postgres-XL could perhaps be described as an “PostgreSQL-based” sharding framework. And it lives actually somewhat under the umbrella of the PostgreSQL Global Development Group. It lags though a bit behind a Postgres major version or two, and the setup is not so easy as “apt install postgresql” by far due to the distributed nature – with coordinators, transaction managers and data nodes on the picture. But it can help you to manage and run queries on tens of terabytes of data with relatively little restrictions! Of course there are some caveats (e.g. no triggers, constraints on FK-s) as with any PostgreSQL derivative and one can’t also expect the same level of support from the community when countering technical problems – but nevertheless it’s actively maintained and a good choice if you want to stay “almost” Postgres with your 50TB+ of data. Biggest cluster I’ve heard of by the way has even 130 TB of data on it so worth checking out!
- Sharding with Citus
Citus is an extension to standard Postgres so in that sense a bit lighter concept compared to the previous contenders and more “up to date”, but not a transparent drop-in replacement for all scaling needs still. It adds “distributed table” features similar to Postgres-XL but with a bit simpler architecture (only data and coordinator nodes) and according to documentation is especially well-suited for multi-tenant and “realtime analytics“ use cases. It has some caveats like all the others – e.g. “shard-local” constraints, no subqueries in the WHERE clause, no window functions on sharded/partitioned tables, but defining the tables works like normally, one just needs some function calls to activate the distributed behaviour. The project is also under very active development with a decent company behind it for those who require support, so this might be a good choice for your next 50 TB+ project.
- Greenplum – a PostgreSQL fork for Data Warehousing
Greenplum – a massively parallel processing (MPP) database system might just be the oldest active Postgres fork alive. It started based on version 8.2 and was developed behind closed doors for a long time, but being Open Source for a couple of years it’s now making up the lost time and trying to modernize itself to include features from latest Postgres versions. The architecture though seems quite a bit more complex as for the above-mentioned alternatives and needs thorough studying. You’d also be giving up on some more advanced/recent SQL features but I can imagine the architecture decisions are made with certain performance aspects in mind so it might be a worthwhile trade-off and also behind the product stands a huge (publicly noted) consulting company named Pivotal, so again a serious alternative.
To conclude – don’t be afraid that your can’t scale up with Postgres! There are quite some options (I’m pretty sure I forgot some products also) for every taste and also professional support is available.
One more remark though to the Postgres community – I think that making the “scaling” topic a bit more discoverable for newcomers would do a lot of good for general Postgres adoption and adding a few words to the official documentation might even be appropriate – currently there’s a bit on HA and replication here, but the word “scaling” is not even mentioned in this context.