After a recent chat with a DBA, who was making his first steps towards migrating a big database away from Oracle to Postgres, I thought there are not enough articles on PostgreSQL features and “hacks”, to achieve scalability to squeeze the last out of the hardware, to safely accommodate decent multi-terabyte size databases. Especially as there are quite a few options out there, I was very surprised that there was so much fear that PostgreSQL is somehow very limited in scaling matters. Well maybe it was indeed once so (I started with Postgres in 2011), but in year 2018 things are pretty solid actually – so please do read on for some ideas how to juggle terabytes with ease.
Standard Postgres facilities
If you don’t like to sweat too much and do some pioneering then the safest way to scale would be to stick with proven out-of-the-box features of Postgres – so first I’d recommend to take a look at the following keywords with some short explanations and maybe it’s all that you need.
- Light-weight / special purpose indexes
For a complex OLTP system, supporting hundreds of freaky queries, it is very common that the indexes actually take much more disk space than the table files holding the data. To improve that (especially for indexes that are used infrequently) one can reduce the index sizes drastically with appropriate use of partial, BRIN, GIN or even a bit experimental BLOOM indexes. In total there are 7 different index types supported…but most people only know about and use the default B-tree – a big mistake in a multi-TB setting!
Partial indexes only allow a subset of the data – for example in a sales system we might not be interested in fast access to orders in status “FINISHED” (some nightly reports deal with that usually and they can take their time), so why should we index such rows?
GIN, the most know non-default index type perhaps, has been around for ages (full-text search) and in short, is perfect for indexing columns where there are lot of repeating values – think all kinds of statuses or good old Mr/Mrs/Miss. GIN only stores every unique column value only once, as for the default B-tree you’ll have e.g. 1 million leaf nodes with the integer “1” in it.
BRIN (block-range a.k.a. min-max index) on the other hand, is something newer and very different – it’s a lossy index type with a very small disk footprint, where not all column values are indexed, but only the biggest and smallest values for a range of rows (1 MB section of a table by default) – this still works very well on ordered values and is for example perfect for time series data or other “log” type of tables.
BLOOM might be exotic but if you manage to find a good use case (“bitmap/matrix search”) for it, it can be up to 20x more efficient than traditional indexing – see here for an example use case when it seems too abstract.
But why did I place the somewhat unoriginal topic of indexing on the top of the list? Because the big advantage of this solution is that you don’t need any application changes – the DBA can make it work easily behind the scenes as a one time effort! Perfect.
- Table partitioning
Postgres has been partitioning for 15 years I believe…but kind of in a “get your hands dirty” way – one had to do some low-level management of attaching partitions, adding check constraints and inserting rows to correct child tables directly, or routing them there via insert triggers on the parent table. All this is history starting from Postgres version 10, there is declarative partitioning and it gets even better with version 11 where the functionality can be called feature complete with overarching indexes enabling primary and foreign keys.
But why bother? Well, advantages of partitioning are: it’s possible to cleanly separate “cold data” and “hot data” – and this gives us some nice options like compacting old data maximally with VACUUM FULL or placing it on another media (see “tablespaces” below) and as a side effect we have smaller indexes, which take a lot less space in the “shared_buffers” so we have more space for data there. The effect is biggest on uniformly accessed data scenarios (by name/email/hash) where all parts of a big indexes still need to be traversed/read/cached but only a tiny percentage of it is actually used. Also similar to indexes, under favourable application conditions partitioning can be implemented without any code changes by DBAs in the background.
As mentioned above – it is possible to move tables / indexes selectively to different disk media with the help of tablespaces. Here one can achieve different goals – to just save money by using slower/affordable disk partitions for “cold” data, keeping only the most recent/important data on fast/expensive media, using some special compressed file systems for data that has a lot of repetitions or using some network shares or even in-memory file systems on remote nodes for massive non-persistent data – there are quite some options. And management of tablespaces is also quite straightforward actually, only transferring existing tables / indexes during live operation can be problematic due to full locking.
- Using multi-process features maximally
Starting from Postgres 9.6 it is possible to parallelize some common operations on data. In Postgres 10/11 the relevant parameter “max_parallel_workers_per_gather” is also enabled by default with the value of 2, thus max. 2 background processes used . For “Big Data” use cases though it might make sense to increase it even more (and also some related params) though. Also it is to be expected that the support for operations that can be parallelized increases with every new major release as it has been so far. Upcoming version 11 for example can now do parallel hash joins, index scans and UNION-s.
- Query load balancing with replicas
Here we’re moving out of the “single node” or “scaling up” realm…but given the very reasonable hardware prices and availability of usable Postgres cluster management software (Patroni being our favourite) it is not only for the bigger “shops” but should be doable for everyman. This kind of scaling can of course only be used if you’re mostly just reading data…as currently (and in next years) officially there can only be a single “master/primary” node in a cluster that is accepting writes. Also, going down this road you’ll probably have to tackle some technical details (especially connection routing) depending on your selected technical stack, but actually Postgres 10 did add support for the multi-host use case on the driver level – so the batteries are included! See here for more. Also starting with Postgres 9.6 the replicas can be run in “mirrored” mode so that it does not matter on which node you’re running you SELECT! As a friendly warning though – this works well only if the read queries are purely OLTP i.e. very fast.
Approaches with some compromises
So now we’re done with the conventional stuff…but if you’re ready to step off the beaten path
and maybe make some slight adjustments to your application or try out some funnily named extensions, one could well squeeze out the last drop of performance from single node hardware. Here’s what I mean:
- Hybrid / Foreign tables
What I call hybrid tables, are actually based on Postgres’ excellent SQL MED standard implementation also known as Foreign Data Wrappers, and they basically look like normal Postgres tables for read queries but the data might reside or be piped over from literally anywhere – it might be coming from Twitter, LDAP or Amazon S3, see here for the full list of crazy datasources supported. In practice the most used application of Foreign Data Wrappers (FDW-s) is probably making normal (correctly formatted) files look like tables, for example exposing the server log as a table to make monitoring easier.
Where’s the scaling part you may ask though? The FDW approach works very well in the sense that it enables to reduce the amount of data by using some clever file formats or just compression, that typically reduces the data size 10-20x so that the data would fit on the node! This works very well for “cold” data, leaving more disk space/cache available for real tables with “hot” data. Since Postgres 10 it is also very easy to implement – sample code here.
Another very promising use case is to use the columnar data storage format (ORC) – take a look at the “c_store” extension project for more info. It’s especially suited for helping to scale large Data Warehouses with tables being up to 10x smaller and queries up to 100% faster.
Why didn’t I add this feature to the above “Standard Postgres facilities” section though as Foreign Data Wrapper infrastructure is built firmly into Postgres? Well, the downside is that you cannot usually change data via SQL and add indexes or constraints, so it’s use is a bit limited.
- Foreign table inheritance a.k.a. sharding!
Pretty much the same as previous point – but bringing in table partitioning and having child tables residing on remote nodes! The data could be planted to a nearby Postgres server and pulled in automatically as needed over the network. And actually they don’t have to be even Postgres tables! It could well be MySQL, Oracle or MS SQL any other popular server that works well for some subset of queries. Now how cool is that? Best results are to be expected from Postgres-to-Postgres interactions though as only “postgres_fdw” supports all write operations, transactions and clever filter push-downs so that data amounts passed over the wire are minimized.