Ideas for scaling PostgreSQL to multi-terabyte and beyond

10.2018 | Category: | Tags: scaling , sharding

10.2018

Category:

By Kaarel Moppel: After a recent chat with a DBA who was taking his first steps towards migrating a big database away from Oracle to Postgres, I thought there are not enough articles on PostgreSQL features and "hacks" to achieve scalability. People want to know how to squeeze the last out of the hardware and safely accommodate decent multi-terabyte size databases. Since there are quite a few options out there, I was very surprised that there is so much fear that PostgreSQL is somehow very limited as far as scaling goes. Maybe it used to be so, (I started with Postgres in 2011), but in the year 2018, things are pretty solid. So please do read on for some ideas on how to juggle terabytes with ease.

Table of Contents

Standard Postgres facilities

If you don’t like to sweat too much or do any pioneering, then the safest way to scale would be to stick with proven out-of-the-box features of Postgres - so first, I’d recommend you take a look at the following keywords with some short explanations. Maybe that's all you need.

Light-weight / special purpose indexes

For a complex OLTP system, supporting hundreds of freaky queries, it is very common that indexes actually take much more disk space than the table files holding the data. To improve that (especially for indexes that are used infrequently) you can drastically reduce the index sizes with appropriate use of partial, BRIN, GIN or even a bit experimental BLOOM indexes. In total, there are 7 different index types supported. Most people only know about and use the default B-tree – a big mistake in a multi-TB setting!

Partial indexes only allow a subset of the data. For example, in a sales system we might not be interested in fast access to orders in status “FINISHED” (some nightly reports usually deal with that, and they can take their time), so why should we index such rows?

GIN

The most commonly known non-default index type perhaps, has been around for ages (full-text search). It's perfect for indexing columns where there are a lot of repeating values. Think of all kinds of statuses, or good old Mr/Mrs/Miss. GIN only stores every unique column value once, as compared to the default B-tree, where you’ll have e.g. 1 million leaf nodes with the integer “1” in them.

BRIN

(block-range a.k.a. min-max index) on the other hand, is something newer and very different. It’s a lossy index type with a very small disk footprint, where not all column values are indexed, but only the biggest and smallest values for a range of rows (1 MB section of a table by default). This still works very well on ordered values. It's perfect for time series data or other “log” types of tables.

BLOOM

might be exotic, but if you manage to find a good use case (“bitmap/matrix search”) for it, it can be up to 20x more efficient than traditional indexing. See this blog on trying out Postgres bloom indexes for an example use case, if it seems too abstract.

But why did I place the somewhat unoriginal topic of indexing on the top of the list? Because the big advantage of this solution is that you don’t need any application changes. The DBA can easily make it work behind the scenes, in a one-time effort! Perfect.

Table partitioning

Postgres has been partitioning for 15 years I believe...but kind of in a “get your hands dirty” way. You had to do some low-level management of attaching partitions, adding check constraints and inserting rows to correct child tables directly. Or, you had to route them there via insert triggers on the parent table. All this is history starting from Postgres version 10; from then on, there is declarative partitioning. It gets even better with version 11, where the functionality can be called feature-complete, with overarching indexes enabling primary and foreign keys.

Why bother? Well, the advantages of partitioning are: it’s possible to cleanly separate “cold data” and “hot data”. That gives us some nice options like compacting old data maximally with VACUUM FULL or placing it on another media (see “tablespaces” below). As a side effect we have smaller indexes, which take a lot less space in the shared_buffers, so we have more space for data there. The effect is biggest on uniformly accessed data scenarios (by name/email/hash) where all parts of big indexes still need to be traversed/read/cached, but only a tiny percentage of them is actually used. Also similar to indexes, under favorable application conditions, partitioning can be implemented without any code changes by DBA's in the background.

Tablespaces

As mentioned above – it is possible to move tables/ indexes selectively to various disk media with the help of tablespaces. Here you can achieve many different goals – save money by using slower/ affordable disk partitions for “cold” data, keep only the most recent/important data on fast/expensive media, use some special compressed file systems for data that has a lot of repetitions, or use some network shares or even in-memory file systems on remote nodes for massive non-persistent data. There are quite a few options. Management of tablespaces is also quite straightforward, however: the transfer of existing tables / indexes during live operation can be problematic due to full locking.

Using multi-process features maximally

Starting from Postgres 9.6, it is possible to parallelize some common operations on data. In Postgres 10/11 the relevant parameter max_parallel_workers_per_gather is also enabled by default with the value of 2, thus max. 2 background processes used. For “Big Data” use cases, it might make sense to increase it even more (and also increase some related parameters). Also, it is to be expected that the support for operations that can be parallelized increases with every new major release, as it has so far been. The upcoming version 11, for example, can now do parallel hash joins, index scans and UNION-s.

Query load balancing with replicas

Here we’re moving out of the “single node” or “scaling up” realm...but given the very reasonable hardware prices and availability of usable Postgres cluster management software (Patroni being our favourite) it is not only for the bigger “shops” but should be doable for everyman. This kind of scaling can of course only be used if you’re mostly just reading data...as currently (and in next years) officially there can only be a single “master/primary” node in a cluster that is accepting writes. Also, going down this road you’ll probably have to tackle some technical details (especially connection routing) depending on your selected technical stack, but actually Postgres 10 did add support for the multi-host use case on the driver level – so the batteries are included! See here for more.

Also, starting with Postgres 9.6 the replicas can be run in “mirrored” mode, so that it does not matter which node you’re running your SELECT on! As a friendly warning – this works well only if the read queries are purely OLTP (i.e., very fast).

Approaches with some compromises

So now we’re done with the conventional stuff...but if you’re ready to step off the beaten path and make slight adjustments to your application, or try out some funnily-named extensions, you could well squeeze out the last drop of performance from single-node hardware. Here’s what I mean:

Hybrid / Foreign tables

What I call hybrid tables are actually based on Postgres’ excellent SQL MED standard implementation, also known as Foreign Data Wrappers. They basically look like normal Postgres tables for read queries, but the data might reside or be piped over from literally anywhere. It might be coming from Twitter, LDAP or Amazon S3, see the wiki on fdw's for the full list of crazy datasources supported. In practice, the most used application of Foreign Data Wrappers (FDW-s) probably makes normal (correctly formatted) files look like tables. For example, exposing the server log as a table to make monitoring easier.

Where’s the scaling part, you may ask? The FDW approach works very well in the sense that it enables you to reduce the amount of data by using some clever file formats or compression. That typically reduces the data size 10-20x, so that the data would fit on the node! This works very well for “cold” data, leaving more disk space/cache available for real tables with “hot” data. As of Postgres v10 it is also very easy to implement – see some sample code here.

Another very promising use case is to use the columnar data storage format (ORC). Take a look at the “c_store” extension project for more info. It’s especially suited to scaling large data warehouses, with tables up to 10x smaller and queries up to 100% faster.

Why didn’t I add this feature to the above “Standard Postgres facilities” section, since the Foreign Data Wrapper infrastructure is firmly built into Postgres? Well, the downside is that you cannot usually change data via SQL and add indexes or constraints, so its use is a bit limited.

Foreign table inheritance a.k.a. sharding!

This is pretty much the same info as in the previous point – but brings in table partitioning and has child tables residing on remote nodes! The data could be planted to a nearby Postgres server and pulled in automatically, as needed over the network. Actually, they don’t even have to be Postgres tables! It could well be MySQL, Oracle or MS SQL any other popular server that works well for some subset of queries. Now how cool is that? The best results are to be expected from Postgres-to-Postgres interactions though as only “postgres_fdw” supports all write operations, transactions and clever filter push-downs so that data amounts passed over the wire are minimized.

Happy terabyte-hunting!