On some rare occasions, I’ve been asked if Postgres supports system-versioned temporal tables – in other words, time travel on table data for your queries. As quite often with Postgres, thanks to its extensibility, I can answer: well, officially, it does not…but, you can make it work, nevertheless 🙂

Why are people even asking such a crazy question? Well, the idea has something to it. For example, when something fishy suddenly happens with your data, i.e. an UPDATE returns “1000 rows affected”, but you were expecting 10 rows to be affected – then it would be nice to quickly “scroll back the time” and see how the table actually looked 5 minutes ago, to maybe then reinstate the lost data quickly if it indeed was an error. Turning to real cold backups in such a case could mean hours spent – too slow!

Temporal tables options

The first possibility for achieving such a requirement on any database system is with meticulous schema design, and basically not deleting data at all – i.e. “insert only” data modelling. In some cases, it can be pretty convenient, so if the concept is new to you I’d recommend checking out the effects in the PostgreSQL context, from earlier posts here and here.

At the system-level implementation, Oracle has had something like that for years, in the form of the “AS OF” syntax. Also, MS SQL has supported something similar, since its v2016 release. I must say that I’ve never used that particular Oracle feature myself, but I’ve heard that it has some limitations in terms of huge databases with lots of changes happening – so probably something disk / WAL storage related. Which I can well believe – database engines almost always need to pick and optimize for some important attributes – can’t have them all.

In short, it’s true that PostgreSQL has no such native feature out of the box, but the idea is definitely not a new one – as can be seen here. As always with Postgres, there are some related extensions or plain schema design scripts available, which I’ve seen quite a few of. The most promising of them might be this one, but I’ve yet to try it out.

My workaround solution for today is a mishmash of two more “rareish” PostgreSQL features: dblink + recovery apply delay.

Time delayed replication

This cool Postgres feature largely lives in the shadows, but it basically allows you to have an extra regular streaming replication node, constantly time-delayed! So you can tell it to not apply the changes coming from the primary node immediately, which is the default. Instead, it will implement them after some time, like 1h.

The parameter to set the “apply delay” is called recovery_min_apply_delay (official documentation here) and can be conveniently set in human readable units. Note the “min” or minimum part – meaning, if you have some longer queries on the node, or the hardware is weak, the actual delay could be longer.

Bear in mind that for huge and very busy databases, in practice this feature can get “costly” in the sense of disk space. – as the WAL pulled in must be stored for the selected time period (and usually even a bit more) until a “restartpoint” is performed. It’s better to start conservatively with lags measured in hours, not days. If this really becomes a show-stopper for longer lag periods, there are alternatives: one can switch to the old “archive recovery” replication method (which was the only binary replication possibility before v9.0) and only pull in WAL when needed, or even maybe regularly re-syncing with a PITR backup system like pgBackRest which performs very snappily in “delta” mode.

Postgres-to-Postgres communication over “dblink”

This feature, or an extension actually, is probably more well-known than the previous feature – since something equivalent is provided pretty much by any other DBMS system. It basically allows you to access arbitrary remote Postgres nodes on-the-fly, in a fully dynamic way – i.e. we don’t need to set up anything on the target system or the executing system! The queries can be arbitrarily complex – given that we know what data types are being returned; the syntax requires that we declare aliases. So a “Hello world” for “dblink” in simplest form would look something like this:


CREATE EXTENSION dblink;

/* opening a remote connection to the same instance I’m on */
SELECT * from dblink('host=localhost port=5432', 'select 1') as t(x int);
 x 
───
 1
(1 row)

By the way, if you’re planning to make more heavy use of this functionality, you can also use some more elaborate access patterns with connection caching or privilege escalation – see the documentation for that.

1 + 1 = 10

So what do we get when we cross-pollinate the two features previously explained? Yep, you guessed it – a dose of extra coolness in the form of “temporal tables”!

There are some small annoyances to be dealt with. As explained in the “dblink” section – we need to explicitly know our returned data types, and the SQL code is not going to look pretty, intermingled with some connect string info. Also, since it’s not really “native”, performance could potentially be sub-optimal, and would need verification in each individual use case.

Additionally, we cannot just select a new “timeline” within a query as we please, but rather we need to meddle with the replica configuration. We can only conveniently scroll the time forward – say from 2h lag to 1h lag. Going to a 3h lag would require a rebuild or PITR restore. But we can have many replicas running simultaneously with different lags, with a little extra hardware cost.

In the end, it works, so let’s cut to the code and set up a minimal useful example with 2 nodes, using the latest PostgreSQL version – v13.2.

Test setup

For test hardware, I used 2 DigitalOcean nodes with 4 vCPU, 8GB of RAM and 160GB of SSD disks in the same AMS3 AZ. The IP address in the sample code below means 10.110.0.4 is the primary and 10.110.0.5 is for the time-delayed replica node. As the test dataset, I used the good old default pgbench schema, with a scaling factor of 100, resulting in 1.3GB of data, so that everything would be cached and we’re not affected by disk slowness, since we don’t want to measure disk access here.

The only changes I made to the server configuration on both nodes were the following: “work_mem=512MB” and “shared_buffers=2GB” + “trust” access in pg_hba.conf between the two hosts’ private IPs.

Note that I’m skipping the build-up of initial streaming replication, as it would be 100% a standard setup. We’ll skip directly to more interesting stuff.


# on the primary node, assuming you are logged in as a “postgres” user
# let’s initialize 10 million test bank accounts
pgbench -i -s 100
psql -c  “CREATE EXTENSION dblink”

Let’s see if “dblink” works by trying to access the secondary node from the primary one, with the former being a completely normal up-to-date streaming replication HA replica which is ready for a theoretical failover.


SELECT *                 
    FROM dblink('host=10.110.0.5 dbname=postgres',
                'select aid, abalance from pgbench_accounts limit 1')
      AS t1( aid int, abalance int);
 aid | abalance 
-----+----------
   1 |        0
(1 row)

Yay, it works! So now we can go and enable the “apply lag” on the replica side, and verify that the feature indeed functions as it should.

Applying the “apply lag”

On node 2, we can enable the “apply lag” dynamically, but changing the configuration file + restart also works.


postgres=# ALTER SYSTEM SET recovery_min_apply_delay TO '1h';
ALTER SYSTEM
postgres=# SELECT pg_reload_conf();
 pg_reload_conf 
----------------
 t
(1 row)

Now we’re ready to generate some test transactions on the primary, and verify that the time-delayed replica does not yet reflect the change:


# let’s run “simple” version of the built-in TPC-B test for the duration of 1 second on the primary
pgbench -T 1 -N -n

# and check the sum of all account balances
psql -c "SELECT sum(abalance) from pgbench_accounts"
  sum   
--------
 110304
(1 row)

But on the replica:


psql -c "SELECT sum(abalance) from pgbench_accounts"
 sum 
-----
   0
(1 row)

We are still on zero, since we see the “lag” feature working!

Use case – discovering potentially harmful data changes

Let’s assume that we want to use a “pseudo temporal tables“ to see which rows have changed since 1 hour ago, which might be a common use case. We basically need to then calculate a diff – combining data both from the primary and the time-delayed replica. For the number of changed rows, let’s just take about 1% of total rows, i.e. 100k rows in our case. So on the primary, let’s execute the following:


# let’s also enable asynchronous commit here to speed up the test, we’re not really
# interested in testing nor waiting on the disk
PGOPTIONS=“-c synchronous_commit=off” pgbench -t 100000 -N -n -M prepared

# let’s check how many rows actually have been changed - some rows were changed many times
# due to random selection
psql -c "SELECT count(*) FROM pgbench_accounts WHERE abalance <> 0"
 count 
-------
 99201
(1 row)

And now, to the diffing part. The query itself in “meta-SQL” could, in its simplest form, look something like this:

SELECT * FROM $current c FULL OUTER JOIN $lag l using (id) where c.$datacol IS DISTINCT FROM l.$datacol OR c.id IS NULL OR l.id IS NULL;

Note that this is one of those rare cases where we definitely want to use a FULL OUTER JOIN as we also want to know about new / disappeared rows. If we were only interested in new and changed rows, we could just use a LEFT JOIN, and the query would probably be quite a bit faster.

How does it perform?

Now to the more important question – how does “dblink” perform with our test query? Can we actually use it in the real world to compare a significant amount of rows? Let’s find out…


postgres=# EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF)
WITH lag_data AS (
  SELECT aid, abalance
  FROM dblink('host=10.110.0.5  dbname=postgres',
              'select aid, abalance from pgbench_accounts')
       AS pgbench_accounts_lag(aid int, abalance int)
)
SELECT * FROM pgbench_accounts c FULL OUTER JOIN lag_data l using (aid) where c.abalance IS DISTINCT FROM l.abalance OR c.aid IS NULL OR l.aid IS NULL;
                                         QUERY PLAN                                         
--------------------------------------------------------------------------------------------
 Hash Full Join (actual rows=99201 loops=1)
   Hash Cond: (l.aid = c.aid)
   Filter: ((c.abalance IS DISTINCT FROM l.abalance) OR (c.aid IS NULL) OR (l.aid IS NULL))
   Rows Removed by Filter: 9900799
   CTE lag_data
     ->  Function Scan on dblink pgbench_accounts_lag (actual rows=10000000 loops=1)
   ->  CTE Scan on lag_data l (actual rows=10000000 loops=1)
   ->  Hash (actual rows=10000000 loops=1)
         Buckets: 16777216  Batches: 1  Memory Usage: 1390838kB
         ->  Seq Scan on pgbench_accounts c (actual rows=10000000 loops=1)
 Planning Time: 0.136 ms
 Execution Time: 20854.470 ms
(12 rows)

Around 20 seconds – not bad, I believe! Given 10 million rows scanned on both sides, plus transmitting that 10 million over the wire once.

NB! Note that I actually executed the query a couple of times to get everything cached, since we usually want to remove disk access from the equation. I actually didn’t want to get the changed rows, so I used EXPLAIN ANALYZE with additional costs, and particularly with timings disabled, to minimize the “observer effect”.

A bit of relativity theory

Along the lines of query performance…I started to wonder: was the number we got actually any good? How would it perform if the data was totally local and persistent, i.e. like a normal table? So on the primary side, I pulled in the whole lagged dataset from the standby into a separate table!


CREATE UNLOGGED TABLE pgbench_accounts_lag_local (LIKE pgbench_accounts INCLUDING ALL);

INSERT INTO pgbench_accounts_lag_local SELECT * FROM dblink('host=10.110.0.5', 'select aid, bid, abalance, filler from pgbench_accounts') AS x(aid int, bid int, abalance int, filler text);

VACUUM ANALYZE pgbench_accounts_lag_local;

EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF) SELECT* FROM pgbench_accounts c FULL OUTER JOIN pgbench_accounts_lag_local l using (aid) where c.abalance IS DISTINCT FROM l.abalance OR c.aid IS NULL OR l.aid IS NULL;
                                         QUERY PLAN                                         
--------------------------------------------------------------------------------------------
 Hash Full Join (actual rows=99201 loops=1)
   Hash Cond: (c.aid = l.aid)
   Filter: ((c.abalance IS DISTINCT FROM l.abalance) OR (c.aid IS NULL) OR (l.aid IS NULL))
   Rows Removed by Filter: 9900799
   ->  Seq Scan on pgbench_accounts c (actual rows=10000000 loops=1)
   ->  Hash (actual rows=10000000 loops=1)
         Buckets: 16777216  Batches: 1  Memory Usage: 1390838kB
         ->  Seq Scan on pgbench_accounts_lag_local l (actual rows=10000000 loops=1)
 Planning Time: 0.130 ms
 Execution Time: 11233.989 ms
(10 rows)

After a couple of runs, we arrived at around 11s. As expected – considerably faster locally 🙂 But also not really tragic I would say, at least not on an order of magnitude. I could live with that, given that something like this is a rare occasion. Well, at least, it should be 🙂

Summary

As we saw, this somewhat obscure approach of “dblink” + “recovery delay” mixed together can definitely deliver good results within a reasonable amount of time for our simple test case – so the approach might be worth making a mental note of, if you haven’t seen it before. The performance side could get problematic for really large data amounts – but hey, we’re competing with much more expensive and complex database systems, and we still get the job done, without paying a dime!

Also, the test buildup could surely be improved – consider that we were changing quite a lot of data – 1% of some huge transactional table usually does not get changed within an hour. In a real-world scenario, we could potentially take advantage of some indexes, whereas here we only had full scans.

In the future, the coolest would be some level of support for something like that on the “postgres_fdw” side – I even actually tried it, but currently it’s not an intended use case, so recursion went on behind the scenes, with lots of connections being opened until some ‘out of connection’ errors popped up.

Thanks for reading!