October 2017

2 weeks ago a new PostgreSQL major version with the number 10 was released (note the new numbering scheme!). I already covered my favorite parts from the official release notes from a DBA’s point of view already here, so also take a look there, if you are interested in the operational side as well. In this blogpost, I would like to look at the list of changes from a developers angle (disclaimer - not doing too much development these days though, just keeping track of things). Of course, some features are shared with the DBA part, but there is still a lot of new stuff – enjoy!

My favourite PostgreSQL 10 features from a Developer point of view

Logical replication

The long-awaited feature enables easy setup and minimal performance penalties for application scenarios, when you only want to replicate a single table or a subset of tables or all tables, enabling totally new application scenarios and enabling also zero downtime upgrades for following major versions! The top feature of version 10 for me.

Declarative partitioning

An old way of managing partitions via inheritance and creating triggers to re-route inserts to correct tables ,was bothersome to say the least, not to mention the performance impact. Currently supported are „range“ and „list“ partitioning schemes. If someone is missing „hash“ partitioning available in some DB engines, one could use „list“ partitioning with expressions to achieve the same.

Multiple hostname/IP support in client connection strings

Libpq will connect to the first responsive server in the list. This helps greatly in providing some basic high-availability to applications without any extra software! FYI – one needs to update the Postgres client libraries and possibly also your database driver of choice to make use of the feature.

Possibility to do simple read balancing between multiple hosts

Now there’s a new Libpq connect parameter called target_session_attrs, with possible values “read-write” and “any” so that by listing replicas first, in your connect string (see previous list item) and choosing “any”, one can do simple read balancing. For writes one needs another connection (according pools recommended) with “read-write”. That is not really transparent (transparent way possible with pgpool2 for example) but might be just good enough.

Commit tracking

For highly critical systems it is now possible to check, if the last transaction was committed, given that the txid was stored locally. For example, It can happen that the transaction was committed nicely and just the acknowledgement didn’t make it over the network to the client.

Quorum commit for Synchronous Replication

For highly critical systems it’s now possible to say that in addition to the connected server, the commit has to be propagated to any number of other servers. Previously the only option was a priority list, making managing server failures more difficult.

Cross-column optimizer statistics

Such stats need to be created manually on a set of columns of a table, to point out that the values are actually somehow dependent on each other. This will enable to counter slow query problems, where the planner thinks there will be very little data returned (multiplication of probabilities yields very small numbers usually) and will choose a „nested loop“ join for example, that does usually not perform too well on bigger amounts of data.

Usable Hash indexes

Hash indexes are now WAL-logged, thus crash-safe and ready for wider use. They also received some performance improvements for simple searches so that they should actually be faster than standard B-tree indexes for bigger amounts of data. Bigger index size too and only suitable for equality searches.

Transition tables to inspect all changed rows of a statement from a trigger

Quite an awesome feature - now it is possible to look at both, old and new data of the affected rows for a statement level trigger. This can be used to calculate some aggregates for example or to reject some updates if suspiciously too many rows were changed.

Allow the specification of a function name without arguments in DDL commands, if it is unique

Now one can drop a function based on name only. This is one of those small things that add up if you work a lot with stored procedures, as 99% of time the names are indeed unique. Less typing is good 🙂

Allow ENUM values to be renamed

This could be done previously only by hacking the system catalogs...which could bite you if not being careful.

Add full text search support for JSON and JSONB

In addition to some smaller JSONB stuff this addition once again assures us that JSONB is first class citizen in Postgres and mixed applications (NoSQL + SQL) have even more options.

file_fdw extension can now execute a program to read data

Postgres has long had the feature to treat simple files as virtual tables, but this feature could be a huge win for “warm data” / “cold data" scenarios where archive data grows too big but is occasionally still needed. When compressing Postgres data one can easily win 10-30x on disk size by the way. A sample from the Postgres Wiki:

CREATE FOREIGN TABLE
test(a int, b text)
SERVER csv
OPTIONS (program 'gunzip -c /tmp/data.czv.gz');

CREATE FOREIGN TABLE

test(a int, b text)

SERVER csv

OPTIONS (program 'gunzip -c /tmp/data.czv.gz');

That was that – step by next year for Postgres 11 !

Last week a new PostgreSQL major version with the number 10 was released! Announcement, release notes and the "What's new" overview can be found from here, here and here – it’s highly recommended reading, so check them out. As usual there have been already quite some blog postings covering all the new stuff, but I guess everyone has their own angle on what is important so as with version 9.6 I'm again throwing in my impressions on the most interesting/relevant features here.

As always, users who upgrade or initialize a fresh cluster, will enjoy huge performance wins (e.g. better parallelization with parallel index scans, merge joins and uncorrelated sub-queries, faster aggregations, smarter joins and aggregates on remote servers) out of the box without doing anything, but here I would like to look more at the things that you won’t get out of the box but you actually need to take some steps to start benefiting from them. List of below highlighted features is compiled from a DBA's viewpoint here, soon a post on changes from a developer’s point of view will also follow.

Upgrading considerations

First some hints on upgrading from an existing setup – this time there are some small things that could cause problems when migrating from 9.6 or even older versions, so before the real deal one should definitely test the upgrade on a separate replica and go through the full list of possible troublemakers from the release notes. Most likely pitfalls to watch out for:

All functions containing "xlog" have been renamed to use "wal" instead of "xlog"

The latter naming could be confused with normal server logs so a "just in case" change. If using any 3rd party backup/replication/HA tools check that they are all at latest versions.

pg_log folder for server logs (error messages/warnings etc) has been renamed to just "log"

Make sure to verify that your log parsing/grepping scripts (if having any) work.

By default queries will make use of up to 2 background processes

If using the default 10 postgresql.conf settings on a machine with low number of CPUs you may see resource usage spikes as parallel processing is enabled by default now - which is a good thing though as it should mean faster queries. Set max_parallel_workers_per_gather to 0 if old behaviour is needed.

Replication connections from localhost are enabled now by default

To ease testing etc, localhost and local Unix socket replication connections are now enabled in "trust" mode (without password) in pg_hba.conf! So if other non-DBA user also have access to real production machines, make sure you change the config.

My favourites from a DBA’s point of view

Logical replication

The long awaited feature enables easy setup and minimal performance penalties for application scenarios where you only want to replicate a single table or a subset of tables or all tables, meaning also zero downtime upgrades for following major versions! Historically (Postgres 9.4+ required) this could be achieved only by usage of a 3rd party extension or slowish trigger based solutions. The top feature of version 10 for me.

Declarative partitioning

Old way of managing partitions via inheritance and creating triggers to re-route inserts to correct tables was bothersome to say the least, not to mention the performance impact. Currently supported are "range" and "list" partitioning schemes. If someone is missing "hash" partitioning available in some DB engines, one could use "list" partitioning with expressions to achieve the same.

Usable Hash indexes

Hash indexes are now WAL-logged thus crash safe and received some performance improvements so that for simple searches they're actually faster than standard B-tree indexes for bigger amounts of data. Bigger index size though too.

Cross-column optimizer statistics

Such stats needs to be created manually on a set of columns of a table, to point out that the values are actually somehow dependent on each other. This will enable to counter slow query problems where the planner thinks there will be very little data returned (multiplication of probabilities yields very small numbers usually) and will choose for example a "nested loop" join that does not perform well on bigger amounts of data.

Parallel snapshots on replicas

Now one can use the pg_dump tool to speed up backups on standby servers enormously by using multiple processes (the --jobs flag).

Better tuning of parallel processing worker behaviour

See max_parallel_workers and min_parallel_table_scan_size / min_parallel_index_scan_size parameters. The default values (8MB, 512KB) for the latter two I would recommend to increase a bit though.

New built-in monitoring roles for easier tooling

New roles pg_monitor, pg_read_all_settings, pg_read_all_stats, and pg_stat_scan_tables make life a lot easier for all kinds of monitoring tasks - previously one had to use superuser accounts or some SECURITY DEFINER wrapper functions.

Temporary (per session) replication slots for safer replica building
A new Contrib extension for checking validity of B-tree indexes

Does couple of smart checks to discover structural inconsistencies and stuff not covered by page level checksums. Hope to check it out more deeply in nearer future.

Psql query tool supports now basic branching (if/elif/else)

This would for example enable having a single maintenance/monitoring script with version specific branching (different column names for pg_stat* views etc) instead of many version specific scripts.

SELECT :VERSION_NAME = '10.0' AS is_v10 gset 
if :is_v10
    SELECT 'yippee' AS msg;
else
    SELECT 'time to upgrade!' AS msg;
endif

SELECT :VERSION_NAME = '10.0' AS is_v10 gset

if :is_v10

SELECT 'yippee' AS msg;

else

SELECT 'time to upgrade!' AS msg;

endif

That's it for this time! Lot of other stuff didn't got listed of course, so for full time DBAs I'd definitely suggest to look at the notes more thoroughly. And a big thanks to those 300+ people who contributed their effort to this particularly exciting release!

If you need further information, please feel free to contact us directly.

BY Kaarel Moppel - Although Postgres is a great all-around product that can’t be beaten in price-performance ratio, there’s one area where things could be a bit better. That’s in the validation of replicas, or maybe even more widely, all clusters.

What's the problem? After building a new replica there’s actually no way to tell in a “pushbutton” way if everything is absolutely fine with the new replica. The replica building tool pg_basebackup (there are of course some other tools/approaches but I suspect that the pg_basebackup tool has 90% of the “market”) just streams over the datafiles and transaction logs and writes them to disk and that’s it – done. After that, you usually modify the configs if needed and start your server and then see if you can connect to it and maybe also issue some “select count(*) from my_favourite_table” to verify that things are more or less plausible, before grinning in satisfaction and grabbing a coffee to tackle other burning tasks... but to what extent did we actually verify that the cluster is intact?

What could go wrong with replicas?

So what could go wrong when building up a new replica? Of course if pg_basebackup reports success then not too much actually, it is very unlikely that something is wrong. But a couple of things still bug me... but I might be a little paranoid 🙂 Just read on too see what I mean.

First the hard disks: in theory, disk errors should be already a thing of the past. Enterprise-grade disk systems should definitely have some built-in error detection. The thing is, nowadays everything is so abstracted and virtualized. You actually hardly know what’s down there. Maybe it’s some network storage? It for sure requires some drivers on the Operating System side, and also on the hardware side. As we all know, every software sadly has bugs. Google says the industry average is about 15 - 50 errors per 1000 lines of delivered code. So one cannot 100% exclude the chance of silent data corruption on writing datafiles. That's especially true because there will typically be some new untested hardware for replicas.

Luckily, Postgres already has some remedy against such cases. The “data-checksums” flag is there for this purpose. However, it needs to be set up when initializing the cluster, and also has a small performance impact. So it’s not always used.

What else? Postgres allows building up replicas on different platforms. For example, a master on Ubuntu, or a replica on CentOS – which of course is a great feature in itself. Plus, there is also the fact that minor versions can differ, e.g. master on 9.6.2, replica by some accident on 9.6.0. Even the replication code itself can fail; I can remember at least one occasion some years ago when committed rows where marked uncommitted on replicas. This is very rare, but could technically happen again.

Status Quo

In short, there are some “if”-s. The question is "What can be done here?" As I said, sadly there’s currently no tool from the Postgres project (I’m not sure that there should even be one or if it can even be done) to do full verification. The only 100% reliable option is to do a SQL dump/restore. That could take hours and hours for 1TB+ databases. It's not really an option if you need to switch over the master in a hush-hush manner.

So a good compromise, in order to have a better feeling before making the switchover, is to just dump the database! This will at least verify that there’s no silent data corruption. However, it could take hours, depending on your DB-size/hardware. There are some tricks to make it faster though - coming right up.

Speeding up dummy SQL dumps on replicas

1. One does not have to actually store the SQL dump - we just want to verify that data can be read out i.e. using “/dev/null” is an excellent performance booster. When using pg_dumpall or pg_dump in plain SQL mode (default) it’s very easy to do:

pg_dumpall -h /var/run/postgresql >/dev/null

1	pg_dumpall -h /var/run/postgresql >/dev/null

2. Make sure you use the Unix socket for connecting to the local server instead of “localhost” TCP stack – just that simple adjustment should already give some 10-15% speedup.

3. Using multiple processes to dump the data! This is a real life-saver for bigger databases when time is of essence. One should just figure out a reasonable value for the “jobs” flag so that the IO-subsystem is pushed to its limit. NB! Multiple jobs absolutely require the “directory” output format (-Fd/--format=directory + -f/--file=dirname) for the pg_dump and sadly it brings also some issues:

pg_dump “directory” format does not get along with /dev/null as it wants to set up a directory structure. There are of course workarounds, e.g. check out the nullfs FUSE filesystem driver but it’s tedious to say the least.
Currently when dumping replicas with multiple processes it is not possible to obtain a consistent snapshot. For our “quick verification” goal it’s not too catastrophic though and we can ignore the initial error by providing the “--no-synchronized-snapshots” flag. This will be fixed in upcoming Postgres 10 though.
pg_dump works “per database” only – so if you have a lot of DBs in your cluster you need some lines of Bash for example. Not a big problem though.

4. Throw together a simple custom script that handles the negative side-effects from approach nr. 3 so you don’t have to think about it. One such that I use myself can for example be found from here. It’s a simple Python script that just spawns a given number of worker processes (half of the CPUs by default) and then dumps all tables to /dev/null starting from the bigger ones – one should just specify the Unix socket as host and also Postgres “bindir” is required.

NB! Before kicking off the dumping process on the replica one should make sure that the query conflict situation has been solved – otherwise it will fail soon if there’s some activity on the master! Options are configuring hot_standby_feedback or allowing replica to “fall behind” by setting the max_standby_streaming_delay to -1 or some big value. More on that here.

That’s it, hope you made it through, and it made you to think a bit. Feedback appreciated as usual!

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter.

PostgreSQL 10 will provide end users with countless new features. One of those features is related to “Foreign Data Wrappers” and is generally known as “aggregate pushdown”. To be honest: This stuff is one of my favorite new features of PostgreSQL 10 and therefore it might be worth, sharing this piece information with a broader audience. So if you are interested in remote aggregation, distributed queries, distributed aggregates and aggregate pushdown in PostgreSQL, keep reading.

Preparing PostgreSQL for a test

To show what the optimizer is already capable of, we need two databases:

iMac:~ hs$ createdb db01
iMac:~ hs$ createdb db02

1 2	iMac:~ hs$ createdb db01 iMac:~ hs$ createdb db02

Then we can deploy some simple test data in db02:

BEGIN;

CREATE TABLE t_test (id serial, name text);

INSERT INTO t_test (name)
SELECT 'dummy'
FROM generate_series(1, 1000000);

ANALYZE;

COMMIT;

BEGIN;

CREATE TABLE t_test (id serial, name text);

INSERT INTO t_test (name)

SELECT 'dummy'

FROM generate_series(1, 1000000);

ANALYZE;

COMMIT;

The script generates 1 million rows and just a single name (= “dummy”)

Create a “database link” in PostgreSQL

For many years now, PostgreSQL has provided means to access remote data sources using “Foreign Data Wrappers” (FDWs)

CREATE EXTENSION postgres_fdw;

CREATE SERVER pgserver FOREIGN DATA WRAPPER postgres_fdw
OPTIONS (dbname 'db02', host 'localhost');

CREATE USER MAPPING FOR CURRENT_USER
SERVER pgserver
OPTIONS (user 'hs');

IMPORT FOREIGN SCHEMA public
FROM SERVER pgserver
INTO public;

CREATE EXTENSION postgres_fdw;

CREATE SERVER pgserver FOREIGN DATA WRAPPER postgres_fdw

OPTIONS (dbname 'db02', host 'localhost');

CREATE USER MAPPING FOR CURRENT_USER

SERVER pgserver

OPTIONS (user 'hs');

IMPORT FOREIGN SCHEMA public

FROM SERVER pgserver

INTO public;

The script shown here loads the postgres_fdw extension, which allows us to connect to a remote PostgreSQL database. Then a virtual server pointing to db01 is created in db01. Finally, there are a user mapping and the foreign schema imported. All tables in the remote database, which can be found in the “public” schema, will be linked and visible in db01.

Running a simple query in PostgreSQL

Once the test data is in place, we can give PostgreSQL a try and see, how it behaves in case of aggregates. Here is an example:

db01=# explain (analyze, verbose) SELECT name, count(*)
         FROM     t_test
         GROUP BY 1;
                                  QUERY PLAN
------------------------------------------------------------------------------
 Foreign Scan (cost=107.31..137.93 rows=200 width=40)
         (actual time=192.244..192.245 rows=1 loops=1)
         Output: name, (count(*))
         Relations: Aggregate on (public.t_test)
         Remote SQL: SELECT name, count(*) FROM public.t_test GROUP BY name
 Planning time: 0.063 ms
 Execution time: 192.581 ms
(6 rows)

db01=# explain (analyze, verbose) SELECT name, count(*)

FROM t_test

GROUP BY 1;

QUERY PLAN

------------------------------------------------------------------------------

Foreign Scan (cost=107.31..137.93 rows=200 width=40)

(actual time=192.244..192.245 rows=1 loops=1)

Output: name, (count(*))

Relations: Aggregate on (public.t_test)

Remote SQL: SELECT name, count(*) FROM public.t_test GROUP BY name

Planning time: 0.063 ms

Execution time: 192.581 ms

(6 rows)

The most important observation here is that PostgreSQL is able to push over the complete aggregate. As you can see, the remote SQL is basically the same as the local query. The main advantage is that by pushing over the aggregates PostgreSQL can drastically reduce the load on your local machine and reduce the amount of data, which has to be sent over the network.

PostgreSQL Foreign Data Wrappers and joins

However, at this point it is necessary to issue a word of caution: Yes, aggregates can be pushed down to a remote server. The thing is: Joins happen before the aggregate. In other words: PostgreSQL has to transfer all the data from the remote host in this case:

PgSQL
db01=# explain (analyze, verbose) SELECT a.name, count(*) 
         FROM     t_test AS a, t_test AS b 
         WHERE    a.id = b.id 
         GROUP BY 1 
ORDER BY 2, 1;
                                             QUERY PLAN
-------------------------------------------------------------------------------------------------
  Sort (cost=1003.90..1004.40 rows=200 width=40) (actual time=4012.290..4012.290 rows=1 loops=1)
       Output: a.name, (count(*))
       Sort Key: (count(*)), a.name
       Sort Method: quicksort Memory: 25kB
       -> HashAggregate (cost=994.26..996.26 rows=200 width=40) (actual time=4012.283..4012.283 rows=1 loops=1)
          Output: a.name, count(*)
          Group Key: a.name
          -> Merge Join (cost=588.18..894.45 rows=19963 width=32) (actual time=3382.674..3848.202 rows=1000000 loops=1)
             Output: a.name
             Merge Cond: (a.id = b.id)
             -> Sort (cost=222.03..225.44 rows=1365 width=36) (actual time=1691.089..1788.210 rows=1000000 loops=1)
                Output: a.name, a.id 
                Sort Key: a.id
                Sort Method: external sort Disk: 21528kB
                -> Foreign Scan on public.t_test a (cost=100.00..150.95 rows=1365 width=36) (actual time=0.311..1232.045 rows=1000000 loops=1)
                   Output: a.name, a.id
                   Remote SQL: SELECT id, name FROM public.t_test
             -> Sort (cost=366.15..373.46 rows=2925 width=4) (actual time=1691.579..1785.666 rows=1000000 loops=1)
                Output: b.id 
                Sort Key: b.id
                Sort Method: external sort Disk: 17616kB
                -> Foreign Scan on public.t_test b (cost=100.00..197.75 rows=2925 width=4) (actual time=0.400..1203.757 rows=1000000 loops=1)
                   Output: b.id
                   Remote SQL: SELECT id FROM public.t_test
Planning time: 0.105 ms
Execution time: 4071.736 ms

PgSQL

db01=# explain (analyze, verbose) SELECT a.name, count(*)

FROM t_test AS a, t_test AS b

WHERE a.id = b.id

GROUP BY 1

ORDER BY 2, 1;

QUERY PLAN

-------------------------------------------------------------------------------------------------

Sort (cost=1003.90..1004.40 rows=200 width=40) (actual time=4012.290..4012.290 rows=1 loops=1)

Output: a.name, (count(*))

Sort Key: (count(*)), a.name

Sort Method: quicksort Memory: 25kB

-> HashAggregate (cost=994.26..996.26 rows=200 width=40) (actual time=4012.283..4012.283 rows=1 loops=1)

Output: a.name, count(*)

Group Key: a.name

-> Merge Join (cost=588.18..894.45 rows=19963 width=32) (actual time=3382.674..3848.202 rows=1000000 loops=1)

Output: a.name

Merge Cond: (a.id = b.id)

-> Sort (cost=222.03..225.44 rows=1365 width=36) (actual time=1691.089..1788.210 rows=1000000 loops=1)

Output: a.name, a.id

Sort Key: a.id

Sort Method: external sort Disk: 21528kB

-> Foreign Scan on public.t_test a (cost=100.00..150.95 rows=1365 width=36) (actual time=0.311..1232.045 rows=1000000 loops=1)

Output: a.name, a.id

Remote SQL: SELECT id, name FROM public.t_test

-> Sort (cost=366.15..373.46 rows=2925 width=4) (actual time=1691.579..1785.666 rows=1000000 loops=1)

Output: b.id

Sort Key: b.id

Sort Method: external sort Disk: 17616kB

-> Foreign Scan on public.t_test b (cost=100.00..197.75 rows=2925 width=4) (actual time=0.400..1203.757 rows=1000000 loops=1)

Output: b.id

Remote SQL: SELECT id FROM public.t_test

Planning time: 0.105 ms

Execution time: 4071.736 ms

Further development

For PostgreSQL 11.0 we are working on a patch, which will hopefully make it into core. It allows PostgreSQL to perform many aggregations before the join has to happen, which makes joining cheaper because less data ends up in the join. There are many more improvements possible. They may be added to the planner in the near future.
However, as of PostgreSQL 10 a large step forward has been made already to allow PostgreSQL to dynamically distribute queries in a cluster.

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.