August 2019

written by Granthana Biswas

pgBackRest is an open source backup tool for PostgreSQL which offers easy configuration and reliable backups. So if you want to protect your database and create backups easily, pgBackRest is a good solution to make that happen. In this blog, we are going to go through the basic steps of using pgBackRest for full and differential backup of PostgreSQL.

Some of the key features of pgBackRest are:

Parallel backup & restore
Local and remote operations
Full, incremental & differential backups
Backup rotation & archive expiration
Backup integrity
Page checksums
Resume backups
Streaming compression and checksums

To read about them more in detail, please visit pgbackrest.org.

Here, we are going to build pgBackRest from the source and install it on the host where a test DB cluster is running.

Installation

Installing from Debian / Ubuntu packages:

sudo apt-get install pgbackrest

1	sudo apt-get install pgbackrest

For manual installation, download the source on a build host. Please avoid building the source on a production server, as the tools required should not be installed on a production machine:

Download pgBackRest Version 2.14:

sudo wget -q -O -  
     https://github.com/pgbackrest/pgbackrest/archive/release/2.14.tar.gz |  
     sudo tar zx -C /root

sudo wget -q -O -

https://github.com/pgbackrest/pgbackrest/archive/release/2.14.tar.gz |

sudo tar zx -C /root

Install the dependencies and check for 64-bit integers:

sudo apt-get install build-essential libssl-dev  
     libxml2-dev libperl-dev zlib1g-dev 
perl -V | grep USE_64_BIT_INT

sudo apt-get install build-essential libssl-dev

libxml2-dev libperl-dev zlib1g-dev

perl -V | grep USE_64_BIT_INT

Build pgbackrest package:

(cd /root/pgbackrest-release-2.14/src && ./configure) 
make -s -C /root/pgbackrest-release-2.14/src

1 2	(cd /root/pgbackrest-release-2.14/src && ./configure) make -s -C /root/pgbackrest-release-2.14/src

Copy from build host to DB host:

sudo scp BUILD_HOST:/root/pgbackrest-release-2.14/src/pgbackrest  /usr/bin/
sudo chmod 755 /usr/bin/pgbackrest

1 2	sudo scp BUILD_HOST:/root/pgbackrest-release-2.14/src/pgbackrest /usr/bin/ sudo chmod 755 /usr/bin/pgbackrest

Install Perl packages:

sudo apt-get install libdbd-pg-perl

1	sudo apt-get install libdbd-pg-perl

pgBackRest config files & directories on DB test host:

sudo mkdir -p -m 770 /var/log/pgbackrest
sudo chown postgres:postgres /var/log/pgbackrest
sudo mkdir -p /etc/pgbackrest
sudo mkdir -p /etc/pgbackrest/conf.d
sudo touch /etc/pgbackrest/pgbackrest.conf
sudo chmod 640 /etc/pgbackrest/pgbackrest.conf
sudo chown postgres:postgres /etc/pgbackrest/pgbackrest.conf

sudo mkdir -p -m 770 /var/log/pgbackrest

sudo chown postgres:postgres /var/log/pgbackrest

sudo mkdir -p /etc/pgbackrest

sudo mkdir -p /etc/pgbackrest/conf.d

sudo touch /etc/pgbackrest/pgbackrest.conf

sudo chmod 640 /etc/pgbackrest/pgbackrest.conf

sudo chown postgres:postgres /etc/pgbackrest/pgbackrest.conf

Create repository for pgBackRest

sudo mkdir -p /var/lib/pgbackrest
sudo chmod 750 /var/lib/pgbackrest
sudo chown postgres:postgres /var/lib/pgbackrest

sudo mkdir -p /var/lib/pgbackrest

sudo chmod 750 /var/lib/pgbackrest

sudo chown postgres:postgres /var/lib/pgbackrest

Set the Repo path and PostgreSQL cluster directory in the pgbackrest file:

cat /etc/pgbackrest/pgbackrest.conf

1	cat /etc/pgbackrest/pgbackrest.conf

[demo]

pg1-path=/data/postgres/pgdata/data1

[global]

repo1-path=/var/lib/pgbackrest

Configure archiving on the PostgreSQL cluster:

Change the following parameters in postgresql.conf:

archive_command = 'pgbackrest --stanza=demo archive-push %p'
archive_mode = on
listen_addresses = '*'
log_line_prefix = ''
max_wal_senders = 3
wal_level = replica

archive_command = 'pgbackrest --stanza=demo archive-push %p'

archive_mode = on

listen_addresses = '*'

log_line_prefix = ''

max_wal_senders = 3

wal_level = replica

Create stanza as `postgres` user:

The stanza-create command must be run on the host where the repository is located to initialize the stanza. It is recommended that the check command be run after stanza-create to ensure archiving and backups are properly configured.

$ pgbackrest --stanza=demo --log-level-console=info stanza-create
2019-07-03 12:26:40.060 P00 INFO: stanza-create command begin 2.14: --log-level-console=info --pg1-path=/data/postgres/pgdata/data1 --repo1-path=/var/lib/pgbackrest --stanza=demo
2019-07-03 12:26:40.494 P00 INFO: stanza-create command end: completed successfully (435ms)

$ pgbackrest --stanza=demo --log-level-console=info check
2019-07-03 12:27:11.996 P00 INFO: check command begin 2.14: --log-level-console=info --pg1-path=/data/postgres/pgdata/data1 --repo1-path=/var/lib/pgbackrest --stanza=demo
2019-07-03 12:27:13.386 P00 INFO: WAL segment 000000010000000000000003 successfully stored in the archive at '/var/lib/pgbackrest/archive/demo/10-1/0000000100000000/000000010000000000000003-b346d07d4b31e54e31d9204204816cde3cfcca3a.gz'
2019-07-03 12:27:13.387 P00 INFO: check command end: completed successfully (1392ms)

$ pgbackrest --stanza=demo --log-level-console=info stanza-create

2019-07-03 12:26:40.060 P00 INFO: stanza-create command begin 2.14: --log-level-console=info --pg1-path=/data/postgres/pgdata/data1 --repo1-path=/var/lib/pgbackrest --stanza=demo

2019-07-03 12:26:40.494 P00 INFO: stanza-create command end: completed successfully (435ms)

$ pgbackrest --stanza=demo --log-level-console=info check

2019-07-03 12:27:11.996 P00 INFO: check command begin 2.14: --log-level-console=info --pg1-path=/data/postgres/pgdata/data1 --repo1-path=/var/lib/pgbackrest --stanza=demo

2019-07-03 12:27:13.386 P00 INFO: WAL segment 000000010000000000000003 successfully stored in the archive at '/var/lib/pgbackrest/archive/demo/10-1/0000000100000000/000000010000000000000003-b346d07d4b31e54e31d9204204816cde3cfcca3a.gz'

2019-07-03 12:27:13.387 P00 INFO: check command end: completed successfully (1392ms)

Use the `info` command to get information about the backup

Since we haven't made any backups yet, we will get the following result:

$ pgbackrest info
stanza: demo
    status: error (no valid backups)
    cipher: none
    db (current)
        wal archive min/max (10-1): 000000010000000000000001/000000010000000000000003

$ pgbackrest info

stanza: demo

status: error (no valid backups)

cipher: none

db (current)

wal archive min/max (10-1): 000000010000000000000001/000000010000000000000003

Backup:

Let's make the first backup. By default, it will be full even if we specify type as differential:

$ pgbackrest --stanza=demo --log-level-console=info backup
2019-07-03 12:37:38.366 P00 INFO: backup command begin 2.14: --log-level-console=info --pg1-path=/data/postgres/pgdata/data1 --repo1-path=/var/lib/pgbackrest --repo1-retention-full=2 --stanza=demo
WARN: no prior backup exists, incr backup has been changed to full
2019-07-03 12:37:39.200 P00 INFO: execute non-exclusive pg_start_backup() with label 'pgBackRest backup started at 2019-07-03 12:37:38': backup begins after the next regular checkpoint completes
2019-07-03 12:37:39.500 P00 INFO: backup start archive = 000000010000000000000005, lsn = 0/5000028
.
.
.
2019-07-03 12:37:45.212 P00 INFO: full backup size = 22.5MB
2019-07-03 12:37:45.212 P00 INFO: execute non-exclusive pg_stop_backup() and wait for all WAL segments to archive
2019-07-03 12:37:45.313 P00 INFO: backup stop archive = 000000010000000000000005, lsn = 0/5000130
2019-07-03 12:37:45.558 P00 INFO: new backup label = 20190703-123738F
2019-07-03 12:37:45.586 P00 INFO: backup command end: completed successfully (7221ms)
2019-07-03 12:37:45.586 P00 INFO: expire command begin
2019-07-03 12:37:45.594 P00 INFO: full backup total < 2 - using oldest full backup for 10-1 archive retention
2019-07-03 12:37:45.596 P00 INFO: expire command end: completed successfully (10ms)

$ pgbackrest --stanza=demo --log-level-console=info backup

2019-07-03 12:37:38.366 P00 INFO: backup command begin 2.14: --log-level-console=info --pg1-path=/data/postgres/pgdata/data1 --repo1-path=/var/lib/pgbackrest --repo1-retention-full=2 --stanza=demo

WARN: no prior backup exists, incr backup has been changed to full

2019-07-03 12:37:39.200 P00 INFO: execute non-exclusive pg_start_backup() with label 'pgBackRest backup started at 2019-07-03 12:37:38': backup begins after the next regular checkpoint completes

2019-07-03 12:37:39.500 P00 INFO: backup start archive = 000000010000000000000005, lsn = 0/5000028

2019-07-03 12:37:45.212 P00 INFO: full backup size = 22.5MB

2019-07-03 12:37:45.212 P00 INFO: execute non-exclusive pg_stop_backup() and wait for all WAL segments to archive

2019-07-03 12:37:45.313 P00 INFO: backup stop archive = 000000010000000000000005, lsn = 0/5000130

2019-07-03 12:37:45.558 P00 INFO: new backup label = 20190703-123738F

2019-07-03 12:37:45.586 P00 INFO: backup command end: completed successfully (7221ms)

2019-07-03 12:37:45.586 P00 INFO: expire command begin

2019-07-03 12:37:45.594 P00 INFO: full backup total < 2 - using oldest full backup for 10-1 archive retention

2019-07-03 12:37:45.596 P00 INFO: expire command end: completed successfully (10ms)

Now when we run the info command again:

$ pgbackrest info
stanza: demo
    status: ok
    cipher: none
    db (current)
        wal archive min/max (10-1): 000000010000000000000005/000000010000000000000005
        full backup: 20190703-123738F
            timestamp start/stop: 2019-07-03 12:37:38 / 2019-07-03 12:37:45
            wal start/stop: 000000010000000000000005 / 000000010000000000000005
            database size: 22.6MB, backup size: 22.6MB
            repository size: 2.7MB, repository backup size: 2.7MB

$ pgbackrest info

stanza: demo

status: ok

cipher: none

db (current)

wal archive min/max (10-1): 000000010000000000000005/000000010000000000000005

full backup: 20190703-123738F

timestamp start/stop: 2019-07-03 12:37:38 / 2019-07-03 12:37:45

wal start/stop: 000000010000000000000005 / 000000010000000000000005

database size: 22.6MB, backup size: 22.6MB

repository size: 2.7MB, repository backup size: 2.7MB

Take a differential backup of the cluster

$ pgbackrest --stanza=demo --log-level-console=info --type=diff backup
2019-07-03 12:40:05.749 P00 INFO: backup command begin 2.14: --log-level-console=info --pg1-path=/data/postgres/pgdata/data1 --repo1-path=/var/lib/pgbackrest --repo1-retention-full=2 --stanza=demo --type=diff
2019-07-03 12:40:05.951 P00 INFO: last backup label = 20190703-123738F, version = 2.14
2019-07-03 12:40:06.657 P00 INFO: execute non-exclusive pg_start_backup() with label 'pgBackRest backup started at 2019-07-03 12:40:05': backup begins after the next regular checkpoint completes
2019-07-03 12:40:06.958 P00 INFO: backup start archive = 000000010000000000000007, lsn = 0/7000028
2019-07-03 12:40:08.414 P01 INFO: backup file /data/postgres/pgdata/data1/global/pg_control (8KB, 99%) checksum c8b3635ef4701b19bff56fcd5ca33d41eaf3ce5b
2019-07-03 12:40:08.421 P01 INFO: backup file /data/postgres/pgdata/data1/pg_logical/replorigin_checkpoint (8B, 100%) checksum 347fc8f2df71bd4436e38bd1516ccd7ea0d46532
2019-07-03 12:40:08.439 P00 INFO: diff backup size = 8KB
2019-07-03 12:40:08.439 P00 INFO: execute non-exclusive pg_stop_backup() and wait for all WAL segments to archive
2019-07-03 12:40:08.540 P00 INFO: backup stop archive = 000000010000000000000007, lsn = 0/70000F8
2019-07-03 12:40:08.843 P00 INFO: new backup label = 20190703-123738F_20190703-124005D
2019-07-03 12:40:08.938 P00 INFO: backup command end: completed successfully (3189ms)
2019-07-03 12:40:08.938 P00 INFO: expire command begin
2019-07-03 12:40:08.949 P00 INFO: full backup total < 2 - using oldest full backup for 10-1 archive retention
2019-07-03 12:40:08.951 P00 INFO: expire command end: completed successfully (13ms)

$ pgbackrest --stanza=demo --log-level-console=info --type=diff backup

2019-07-03 12:40:05.749 P00 INFO: backup command begin 2.14: --log-level-console=info --pg1-path=/data/postgres/pgdata/data1 --repo1-path=/var/lib/pgbackrest --repo1-retention-full=2 --stanza=demo --type=diff

2019-07-03 12:40:05.951 P00 INFO: last backup label = 20190703-123738F, version = 2.14

2019-07-03 12:40:06.657 P00 INFO: execute non-exclusive pg_start_backup() with label 'pgBackRest backup started at 2019-07-03 12:40:05': backup begins after the next regular checkpoint completes

2019-07-03 12:40:06.958 P00 INFO: backup start archive = 000000010000000000000007, lsn = 0/7000028

2019-07-03 12:40:08.414 P01 INFO: backup file /data/postgres/pgdata/data1/global/pg_control (8KB, 99%) checksum c8b3635ef4701b19bff56fcd5ca33d41eaf3ce5b

2019-07-03 12:40:08.421 P01 INFO: backup file /data/postgres/pgdata/data1/pg_logical/replorigin_checkpoint (8B, 100%) checksum 347fc8f2df71bd4436e38bd1516ccd7ea0d46532

2019-07-03 12:40:08.439 P00 INFO: diff backup size = 8KB

2019-07-03 12:40:08.439 P00 INFO: execute non-exclusive pg_stop_backup() and wait for all WAL segments to archive

2019-07-03 12:40:08.540 P00 INFO: backup stop archive = 000000010000000000000007, lsn = 0/70000F8

2019-07-03 12:40:08.843 P00 INFO: new backup label = 20190703-123738F_20190703-124005D

2019-07-03 12:40:08.938 P00 INFO: backup command end: completed successfully (3189ms)

2019-07-03 12:40:08.938 P00 INFO: expire command begin

2019-07-03 12:40:08.949 P00 INFO: full backup total < 2 - using oldest full backup for 10-1 archive retention

2019-07-03 12:40:08.951 P00 INFO: expire command end: completed successfully (13ms)

To see the changes, check `info`:

$ pgbackrest info
stanza: demo
    status: error (no valid backups)
    cipher: none
    db (current)
        wal archive min/max (10-1): 000000010000000000000001/000000010000000000000003

        full backup: 20190703-123738F
            timestamp start/stop: 2019-07-03 12:37:38 / 2019-07-03 12:37:45
            wal start/stop: 000000010000000000000005 / 000000010000000000000005  
            database size: 22.6MB, backup size: 22.6MB
            repository size: 2.7MB, repository backup size: 2.7MB

        diff backup: 20190703-123738F_20190703-124005D
            timestamp start/stop: 2019-07-03 12:40:05 / 2019-07-03 12:40:08
            wal start/stop: 000000010000000000000007 / 000000010000000000000007
            database size: 22.6MB, backup size: 8.2KB
            repository size: 2.7MB, repository backup size: 468B
            backup reference list: 20190703-123738F

$ pgbackrest info

stanza: demo

status: error (no valid backups)

cipher: none

db (current)

wal archive min/max (10-1): 000000010000000000000001/000000010000000000000003

full backup: 20190703-123738F

timestamp start/stop: 2019-07-03 12:37:38 / 2019-07-03 12:37:45

wal start/stop: 000000010000000000000005 / 000000010000000000000005

database size: 22.6MB, backup size: 22.6MB

repository size: 2.7MB, repository backup size: 2.7MB

diff backup: 20190703-123738F_20190703-124005D

timestamp start/stop: 2019-07-03 12:40:05 / 2019-07-03 12:40:08

wal start/stop: 000000010000000000000007 / 000000010000000000000007

database size: 22.6MB, backup size: 8.2KB

repository size: 2.7MB, repository backup size: 468B

backup reference list: 20190703-123738F

If you want to learn more about how to protect data, check out my blog on PostgreSQL TDE ("Transparent Data Encryption") .

Also if you want to make sure that your database performs well, check out our blog posts on PostgreSQL performance.

Row Level Security (RLS) is one of the key features in PostgreSQL. It can be used to dramatically improve security and help to protect data in all cases. However, there are a couple of corner cases which most people are not aware of. So if you are running PostgreSQL and you happen to use RLS in a high-security environment, this might be the most important piece of text about database security you have ever read.

Row-Level-Security in PostgreSQL: Corner cases

To prepare for my examples let me create some data first. The following code is executed as superuser:

CREATE USER bob NOSUPERUSER;
CREATE USER alice NOSUPERUSER;

CREATE TABLE t_service (service_type text, service text);
INSERT INTO t_service VALUES 
	('open_source', 'PostgreSQL consulting'),
	('open_source', 'PostgreSQL training'),
	('open_source', 'PostgreSQL 24x7 support'),
	('closed_source', 'Oracle tuning'),
	('closed_source', 'Oracle license management'),
	('closed_source', 'IBM DB2 training');

GRANT ALL ON SCHEMA PUBLIC TO bob, alice;
GRANT ALL ON TABLE t_service TO bob, alice;

CREATE USER bob NOSUPERUSER;

CREATE USER alice NOSUPERUSER;

CREATE TABLE t_service (service_type text, service text);

INSERT INTO t_service VALUES

('open_source', 'PostgreSQL consulting'),

('open_source', 'PostgreSQL training'),

('open_source', 'PostgreSQL 24x7 support'),

('closed_source', 'Oracle tuning'),

('closed_source', 'Oracle license management'),

('closed_source', 'IBM DB2 training');

GRANT ALL ON SCHEMA PUBLIC TO bob, alice;

GRANT ALL ON TABLE t_service TO bob, alice;

For the sake of simplicity there are only three users: postgres, bob, and alice. The t_service table contains six different services. Some are related to PostgreSQL and some to Oracle. The goal is to ensure that bob is only allowed to see Open Source stuff while alice is mostly an Oracle girl.

While hacking up the example, we want to see who we are and which chunks of code are executed as which user at all times. Therefore I have written a debug function which just throws out a message and returns true:

CREATE FUNCTION debug_me(text) RETURNS boolean AS
$
BEGIN
	RAISE NOTICE 'called as session_user=%, current_user=% for '%' ', 
		session_user, current_user, $1;
	RETURN true;
END;
$ LANGUAGE 'plpgsql';

GRANT ALL ON FUNCTION debug_me TO bob, alice;

CREATE FUNCTION debug_me(text) RETURNS boolean AS

BEGIN

RAISE NOTICE 'called as session_user=%, current_user=% for '%' ',

session_user, current_user, $1;

RETURN true;

END;

$ LANGUAGE 'plpgsql';

GRANT ALL ON FUNCTION debug_me TO bob, alice;

Now that the infrastructure is in place, RLS can be enabled for this table:

ALTER TABLE t_service ENABLE ROW LEVEL SECURITY;

1	ALTER TABLE t_service ENABLE ROW LEVEL SECURITY;

The superuser is not able to see all the data. Normal users are not allowed to see anything. To them, the table will appear to be empty.

Using CREATE POLICY in PostgreSQL

Of course, people want to see data. In order to expose data to people, policies have to be created. In my example there will be two policies:

CREATE POLICY bob_pol ON t_service
	FOR SELECT
	TO bob
	USING (debug_me(service) AND service_type = 'open_source');

CREATE POLICY alice_pol ON t_service
	FOR SELECT
	TO alice
	USING (debug_me(service) AND service_type = 'closed_source');

CREATE POLICY bob_pol ON t_service

FOR SELECT

TO bob

USING (debug_me(service) AND service_type = 'open_source');

CREATE POLICY alice_pol ON t_service

FOR SELECT

TO alice

USING (debug_me(service) AND service_type = 'closed_source');

What we see here is that bob is really the Open Source guy while alice is more on the Oracle side. I added the debug_me function to the policy so that you can see which users are active.

Let us set the current role to bob and run a simple SELECT statement:

test=# SET ROLE bob;
SET
test=> SELECT * FROM t_service;
psql: NOTICE:  called as session_user=hs, current_user=bob for 'PostgreSQL consulting' 
psql: NOTICE:  called as session_user=hs, current_user=bob for 'PostgreSQL training' 
psql: NOTICE:  called as session_user=hs, current_user=bob for 'PostgreSQL 24x7 support' 
 service_type |         service         
--------------+-------------------------
 open_source  | PostgreSQL consulting
 open_source  | PostgreSQL training
 open_source  | PostgreSQL 24x7 support
(3 rows)

test=# SET ROLE bob;

SET

test=> SELECT * FROM t_service;

psql: NOTICE: called as session_user=hs, current_user=bob for 'PostgreSQL consulting'

psql: NOTICE: called as session_user=hs, current_user=bob for 'PostgreSQL training'

psql: NOTICE: called as session_user=hs, current_user=bob for 'PostgreSQL 24x7 support'

service_type | service

--------------+-------------------------

open_source | PostgreSQL consulting

open_source | PostgreSQL training

open_source | PostgreSQL 24x7 support

(3 rows)

The policy does exactly what you would expect for bob. The same thing is true for alice:

test=> SET ROLE alice;
SET
test=> SELECT * FROM t_service;
psql: NOTICE:  called as session_user=hs, current_user=alice for 'Oracle tuning' 
psql: NOTICE:  called as session_user=hs, current_user=alice for 'Oracle license management' 
psql: NOTICE:  called as session_user=hs, current_user=alice for 'IBM DB2 training' 
 service_type  |          service          
---------------+---------------------------
 closed_source | Oracle tuning
 closed_source | Oracle license management
 closed_source | IBM DB2 training
(3 rows)

test=> SET ROLE alice;

SET

test=> SELECT * FROM t_service;

psql: NOTICE: called as session_user=hs, current_user=alice for 'Oracle tuning'

psql: NOTICE: called as session_user=hs, current_user=alice for 'Oracle license management'

psql: NOTICE: called as session_user=hs, current_user=alice for 'IBM DB2 training'

service_type | service

---------------+---------------------------

closed_source | Oracle tuning

closed_source | Oracle license management

closed_source | IBM DB2 training

(3 rows)

PostgreSQL Row-Level-Security and views

As a PostgreSQL consultant and PostgreSQL support company there is one specific question which keeps coming to us again and again: What happens if RLS (Row Level Security) is used in combination with views? This kind of question is not as easy to answer as some people might think. Expect some corner cases which require a little bit of thinking to get stuff right.

To show how things work, I will switch back to user “postgres” and create two identical views:

test=> SET ROLE postgres;
SET 
test=# CREATE VIEW v1 AS SELECT *, session_user, current_user 
	FROM t_service;
CREATE VIEW
test=# CREATE VIEW v2 AS SELECT *, session_user, current_user 
	FROM t_service;
CREATE VIEW 
test=# GRANT SELECT ON v1 TO bob, alice;
GRANT
test=# ALTER VIEW v2 OWNER TO alice;
ALTER VIEW
test=# GRANT SELECT ON v2 TO bob;
GRANT

test=> SET ROLE postgres;

SET

test=# CREATE VIEW v1 AS SELECT *, session_user, current_user

FROM t_service;

CREATE VIEW

test=# CREATE VIEW v2 AS SELECT *, session_user, current_user

FROM t_service;

CREATE VIEW

test=# GRANT SELECT ON v1 TO bob, alice;

GRANT

test=# ALTER VIEW v2 OWNER TO alice;

ALTER VIEW

test=# GRANT SELECT ON v2 TO bob;

GRANT

SELECT permissions will be granted to both views, but there is one more difference: alice will be the owner of v2. v1 is owned by the postgres user. This tiny difference makes a major difference later on, as you will see. To sum it up: v1 will be owned by postgres, v2 is owned by our commercial database lady alice, and everybody is allowed to read those views.

Let us see what happens:

test=# SET ROLE bob;
SET
test=> SELECT * FROM v1;
 service_type  |          service          | session_user | current_user 
---------------+---------------------------+--------------+--------------
 open_source   | PostgreSQL consulting     | hs           | bob
 open_source   | PostgreSQL training       | hs           | bob
 open_source   | PostgreSQL 24x7 support   | hs           | bob
 closed_source | Oracle tuning             | hs           | bob
 closed_source | Oracle license management | hs           | bob
 closed_source | IBM DB2 training          | hs           | bob
(6 rows)

test=# SET ROLE bob;

SET

test=> SELECT * FROM v1;

service_type | service | session_user | current_user

---------------+---------------------------+--------------+--------------

open_source | PostgreSQL consulting | hs | bob

open_source | PostgreSQL training | hs | bob

open_source | PostgreSQL 24x7 support | hs | bob

closed_source | Oracle tuning | hs | bob

closed_source | Oracle license management | hs | bob

closed_source | IBM DB2 training | hs | bob

(6 rows)

Ooops! What is going on here? “bob” is allowed to see all the data. There is a reason for that: The view is owned by “postgres”. That means that the row level policy on t_service will not be taken into account. The RLS policies (Row Level Security) have been defined for bob and alice. However, in this case they are not taken into consideration, because the view is owned by the superuser, and the superuser has given us SELECT permissions on this view so we can see all that data. That is important: Imagine some sort of aggregation (e.g. SELECT sum(turnover) FROM sales). A user might see the aggregate but not the raw data. In that case, skipping the policy is perfectly fine.

The situation is quite different in the case of v2:

test=> SELECT * FROM v2;
psql: NOTICE:  called as session_user=hs, current_user=bob for 'Oracle tuning' 
psql: NOTICE:  called as session_user=hs, current_user=bob for 'Oracle license management' 
psql: NOTICE:  called as session_user=hs, current_user=bob for 'IBM DB2 training' 
 service_type  |          service          | session_user | current_user 
---------------+---------------------------+--------------+--------------
 closed_source | Oracle tuning             | hs           | bob
 closed_source | Oracle license management | hs           | bob
 closed_source | IBM DB2 training          | hs           | bob
(3 rows)

test=> SELECT * FROM v2;

psql: NOTICE: called as session_user=hs, current_user=bob for 'Oracle tuning'

psql: NOTICE: called as session_user=hs, current_user=bob for 'Oracle license management'

psql: NOTICE: called as session_user=hs, current_user=bob for 'IBM DB2 training'

service_type | service | session_user | current_user

---------------+---------------------------+--------------+--------------

closed_source | Oracle tuning | hs | bob

closed_source | Oracle license management | hs | bob

closed_source | IBM DB2 training | hs | bob

(3 rows)

The “current_user” is still “bob” BUT what we see is only closed source data, which basically belongs to alice. Why does that happen? The reason is: v2 belongs to alice and therefore PostgreSQL will check alice's RLS policy. Remember, she is supposed to see closed source data and as the “owner” of the data she is in charge. The result is: bob will see closed source data, but no open source data (which happens to be his domain). Keep these corner cases in mind - not being aware of this behavior can create nasty security problems. Always ask yourself which policy PostgreSQL will actually use behind the scenes. Having a small test case at hand can be really useful in this context.

Solving the more common security challenges

What you have seen are some corner cases many people are not aware of. Our PostgreSQL consultants have seen some horrible mistakes in this area already, and we would like to ensure that other people out there don't make the same mistakes.

Let us drop those policies:

test=> SET ROLE postgres;
SET
test=# DROP POLICY bob_pol ON t_service;
DROP POLICY
test=# DROP POLICY alice_pol ON t_service;
DROP POLICY

test=> SET ROLE postgres;

SET

test=# DROP POLICY bob_pol ON t_service;

DROP POLICY

test=# DROP POLICY alice_pol ON t_service;

DROP POLICY

Usually policies are not assigned to individual people, but to a group of people or sometimes even to “public” (basically everybody who does not happen to be a superuser in this context). The following code snippet shows a simple example:

test=# CREATE POLICY general_pol ON t_service
	FOR SELECT
	TO  public
	USING (CASE WHEN CURRENT_USER = 'bob' 
			THEN service_type = 'open_source'
		  ELSE service_type = 'closed_source' END);
CREATE POLICY

test=# CREATE POLICY general_pol ON t_service

FOR SELECT

TO public

USING (CASE WHEN CURRENT_USER = 'bob'

THEN service_type = 'open_source'

ELSE service_type = 'closed_source' END);

CREATE POLICY

If the CURRENT_USER is bob, the system is supposed to show Open Source data. Otherwise it is all about closed source.

Let us take a look what happens:

test=# SET ROLE bob;
SET
test=> SELECT * FROM v2;
 service_type |         service         | session_user | current_user 
--------------+-------------------------+--------------+--------------
 open_source  | PostgreSQL consulting   | hs           | bob
 open_source  | PostgreSQL training     | hs           | bob
 open_source  | PostgreSQL 24x7 support | hs           | bob
(3 rows)

test=# SET ROLE bob;

SET

test=> SELECT * FROM v2;

service_type | service | session_user | current_user

--------------+-------------------------+--------------+--------------

open_source | PostgreSQL consulting | hs | bob

open_source | PostgreSQL training | hs | bob

open_source | PostgreSQL 24x7 support | hs | bob

(3 rows)

The most important observation is that the policy applies to everybody who is not marked as superuser and it applies to everybody who is not marked with BYPASSRLS. As expected, alice will only see her subset of data:

test=> SET ROLE alice;
SET
test=> SELECT * FROM v2;
 service_type  |          service          | session_user | current_user 
---------------+---------------------------+--------------+--------------
 closed_source | Oracle tuning             | hs           | alice
 closed_source | Oracle license management | hs           | alice
 closed_source | IBM DB2 training          | hs           | alice
(3 rows)

test=> SET ROLE alice;

SET

test=> SELECT * FROM v2;

service_type | service | session_user | current_user

---------------+---------------------------+--------------+--------------

closed_source | Oracle tuning | hs | alice

closed_source | Oracle license management | hs | alice

closed_source | IBM DB2 training | hs | alice

(3 rows)

The most important observation here is that defining policies has to be done with great care. ALWAYS make sure that your setup is well tested and that no leaks can happen. Security is one of the most important topics in any modern IT system and nobody wants to take chances in this area.

CYBERTEC can help to secure databases

As a PostgreSQL consulting company we can help to make sure that your databases are indeed secure. Leaks must not happen and we can help to achieve that.
If you want to learn more about PostgreSQL security in general, check out our PostgreSQL products including Data Masking for PostgreSQL which helps you to obfuscate data, and check out PL/pgSQL_sec which has been designed explicitly to protect your code.

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.

UPDATED July 2023: Window functions and analytics have been around for quite some time and many people already make use of this awesome stuff in the PostgreSQL world. Timeseries are an especially important area in this context. However, not all features have been widely adopted and thus many developers have to implement functionality at the application level in a painful way instead of just using some of the more advanced SQL techniques.

The idea of this blog is to demonstrate some of the advanced analytics so that more people out there can make use of PostgreSQL's true power.

Preparing data for analytics

For the purpose of this post I have created a basic data set:

test=# CREATE TABLE t_demo AS 
	SELECT   ordinality, day, date_part('week', day) AS week 
	FROM    generate_series('2020-01-02', '2020-01-15', '1 day'::interval) 
			WITH ORDINALITY AS day;
SELECT 14

test=# CREATE TABLE t_demo AS

SELECT ordinality, day, date_part('week', day) AS week

FROM generate_series('2020-01-02', '2020-01-15', '1 day'::interval)

WITH ORDINALITY AS day;

SELECT 14

Useful functions

In PostgreSQL, the generate_series function will return one row each day spanning January 2nd, 2020 to January 15th, 2020. The WITH ORDINALITY clause tells PostgreSQL to add an “id” column to the result set of the function. The date_part function will extract the number of the week out of our date. The purpose of this column is to have a couple of identical values in our timeseries.

In the next list you see the data set we'll use:

test=# SELECT * FROM t_demo;
 ordinality |          day           | week 
------------+------------------------+------
          1 | 2020-01-02 00:00:00+01 |    1
          2 | 2020-01-03 00:00:00+01 |    1
          3 | 2020-01-04 00:00:00+01 |    1
          4 | 2020-01-05 00:00:00+01 |    1
          5 | 2020-01-06 00:00:00+01 |    2
          6 | 2020-01-07 00:00:00+01 |    2
          7 | 2020-01-08 00:00:00+01 |    2
          8 | 2020-01-09 00:00:00+01 |    2
          9 | 2020-01-10 00:00:00+01 |    2
         10 | 2020-01-11 00:00:00+01 |    2
         11 | 2020-01-12 00:00:00+01 |    2
         12 | 2020-01-13 00:00:00+01 |    3
         13 | 2020-01-14 00:00:00+01 |    3
         14 | 2020-01-15 00:00:00+01 |    3
(14 rows)

test=# SELECT * FROM t_demo;

ordinality | day | week

------------+------------------------+------

1 | 2020-01-02 00:00:00+01 | 1

2 | 2020-01-03 00:00:00+01 | 1

3 | 2020-01-04 00:00:00+01 | 1

4 | 2020-01-05 00:00:00+01 | 1

5 | 2020-01-06 00:00:00+01 | 2

6 | 2020-01-07 00:00:00+01 | 2

7 | 2020-01-08 00:00:00+01 | 2

8 | 2020-01-09 00:00:00+01 | 2

9 | 2020-01-10 00:00:00+01 | 2

10 | 2020-01-11 00:00:00+01 | 2

11 | 2020-01-12 00:00:00+01 | 2

12 | 2020-01-13 00:00:00+01 | 3

13 | 2020-01-14 00:00:00+01 | 3

14 | 2020-01-15 00:00:00+01 | 3

(14 rows)

Sliding windows: Moving over timeseries in SQL

One of the things you often have to do is to use a sliding window. In SQL this can easily be achieved using the OVER clause. Here is an example:

test=# SELECT *, 
        array_agg(ordinality) 
                OVER (ORDER BY day ROWS 
                         BETWEEN 1 PRECEDING AND 1 FOLLOWING),
        avg(ordinality) 
                OVER (ORDER BY day ROWS 
                         BETWEEN 1 PRECEDING AND 1 FOLLOWING)
FROM t_demo;
 ordinality |          day           | week | array_agg  |         avg         
------------+------------------------+------+------------+---------------------
          1 | 2020-01-02 00:00:00+01 |    1 | {1,2}      |  1.5000000000000000
          2 | 2020-01-03 00:00:00+01 |    1 | {1,2,3}    |  2.0000000000000000
          3 | 2020-01-04 00:00:00+01 |    1 | {2,3,4}    |  3.0000000000000000
          4 | 2020-01-05 00:00:00+01 |    1 | {3,4,5}    |  4.0000000000000000
          5 | 2020-01-06 00:00:00+01 |    2 | {4,5,6}    |  5.0000000000000000
          6 | 2020-01-07 00:00:00+01 |    2 | {5,6,7}    |  6.0000000000000000
          7 | 2020-01-08 00:00:00+01 |    2 | {6,7,8}    |  7.0000000000000000
          8 | 2020-01-09 00:00:00+01 |    2 | {7,8,9}    |  8.0000000000000000
          9 | 2020-01-10 00:00:00+01 |    2 | {8,9,10}   |  9.0000000000000000
         10 | 2020-01-11 00:00:00+01 |    2 | {9,10,11}  | 10.0000000000000000
         11 | 2020-01-12 00:00:00+01 |    2 | {10,11,12} | 11.0000000000000000
         12 | 2020-01-13 00:00:00+01 |    3 | {11,12,13} | 12.0000000000000000
         13 | 2020-01-14 00:00:00+01 |    3 | {12,13,14} | 13.0000000000000000
         14 | 2020-01-15 00:00:00+01 |    3 | {13,14}    | 13.5000000000000000
(14 rows)

test=# SELECT *,

array_agg(ordinality)

OVER (ORDER BY day ROWS

BETWEEN 1 PRECEDING AND 1 FOLLOWING),

avg(ordinality)

OVER (ORDER BY day ROWS

BETWEEN 1 PRECEDING AND 1 FOLLOWING)

FROM t_demo;

ordinality | day | week | array_agg | avg

------------+------------------------+------+------------+---------------------

1 | 2020-01-02 00:00:00+01 | 1 | {1,2} | 1.5000000000000000

2 | 2020-01-03 00:00:00+01 | 1 | {1,2,3} | 2.0000000000000000

3 | 2020-01-04 00:00:00+01 | 1 | {2,3,4} | 3.0000000000000000

4 | 2020-01-05 00:00:00+01 | 1 | {3,4,5} | 4.0000000000000000

5 | 2020-01-06 00:00:00+01 | 2 | {4,5,6} | 5.0000000000000000

6 | 2020-01-07 00:00:00+01 | 2 | {5,6,7} | 6.0000000000000000

7 | 2020-01-08 00:00:00+01 | 2 | {6,7,8} | 7.0000000000000000

8 | 2020-01-09 00:00:00+01 | 2 | {7,8,9} | 8.0000000000000000

9 | 2020-01-10 00:00:00+01 | 2 | {8,9,10} | 9.0000000000000000

10 | 2020-01-11 00:00:00+01 | 2 | {9,10,11} | 10.0000000000000000

11 | 2020-01-12 00:00:00+01 | 2 | {10,11,12} | 11.0000000000000000

12 | 2020-01-13 00:00:00+01 | 3 | {11,12,13} | 12.0000000000000000

13 | 2020-01-14 00:00:00+01 | 3 | {12,13,14} | 13.0000000000000000

14 | 2020-01-15 00:00:00+01 | 3 | {13,14} | 13.5000000000000000

(14 rows)

OVER and `array_agg`

The OVER clause allows you to feed the data to the aggregate function. For the sake of simplicity, I have used the array_agg function, which simply returns the data fed to the aggregate as an array. In a real-life scenario you would use something more common such as the avg, sum, min, max, or any other aggregation function. However, array_agg is pretty useful, in that it shows which data is really passed to the function, and what values we have in use. ROWS BETWEEN … PRECEDING AND 1 … FOLLOWING tells the system that we want to use 3 rows: The previous, the current one, as well as the one after the current row. At the beginning of the list, there is no previous row, so we will only see two values in the array column. At the end of the data set, there are also only two rows, because there are no more values after the last one.

The EXCLUDE CURRENT ROW clause

In some cases, it can be useful to exclude the current row from the data passed to the aggregate. The EXCLUDE CURRENT ROW clause has been developed to do exactly that. Here is an example:

test=# SELECT *, array_agg(ordinality) 
		OVER (ORDER BY day ROWS 
			BETWEEN 1 PRECEDING AND 1 FOLLOWING 
			EXCLUDE CURRENT ROW) 
	FROM  t_demo;
 ordinality |          day           | week | array_agg 
------------+------------------------+------+-----------
          1 | 2020-01-02 00:00:00+01 |    1 | {2}
          2 | 2020-01-03 00:00:00+01 |    1 | {1,3}
          3 | 2020-01-04 00:00:00+01 |    1 | {2,4}
          4 | 2020-01-05 00:00:00+01 |    1 | {3,5}
          5 | 2020-01-06 00:00:00+01 |    2 | {4,6}
          6 | 2020-01-07 00:00:00+01 |    2 | {5,7}
          7 | 2020-01-08 00:00:00+01 |    2 | {6,8}
          8 | 2020-01-09 00:00:00+01 |    2 | {7,9}
          9 | 2020-01-10 00:00:00+01 |    2 | {8,10}
         10 | 2020-01-11 00:00:00+01 |    2 | {9,11}
         11 | 2020-01-12 00:00:00+01 |    2 | {10,12}
         12 | 2020-01-13 00:00:00+01 |    3 | {11,13}
         13 | 2020-01-14 00:00:00+01 |    3 | {12,14}
         14 | 2020-01-15 00:00:00+01 |    3 | {13}
(14 rows)

test=# SELECT *, array_agg(ordinality)

OVER (ORDER BY day ROWS

BETWEEN 1 PRECEDING AND 1 FOLLOWING

EXCLUDE CURRENT ROW)

FROM t_demo;

ordinality | day | week | array_agg

------------+------------------------+------+-----------

1 | 2020-01-02 00:00:00+01 | 1 | {2}

2 | 2020-01-03 00:00:00+01 | 1 | {1,3}

3 | 2020-01-04 00:00:00+01 | 1 | {2,4}

4 | 2020-01-05 00:00:00+01 | 1 | {3,5}

5 | 2020-01-06 00:00:00+01 | 2 | {4,6}

6 | 2020-01-07 00:00:00+01 | 2 | {5,7}

7 | 2020-01-08 00:00:00+01 | 2 | {6,8}

8 | 2020-01-09 00:00:00+01 | 2 | {7,9}

9 | 2020-01-10 00:00:00+01 | 2 | {8,10}

10 | 2020-01-11 00:00:00+01 | 2 | {9,11}

11 | 2020-01-12 00:00:00+01 | 2 | {10,12}

12 | 2020-01-13 00:00:00+01 | 3 | {11,13}

13 | 2020-01-14 00:00:00+01 | 3 | {12,14}

14 | 2020-01-15 00:00:00+01 | 3 | {13}

(14 rows)

As you can see, the array is a little shorter now. The current value is not part of the array anymore.

Timeseries with SQL: Making use of EXCLUDE TIES

The idea behind EXCLUDE TIES is to remove duplicates. Let's take a look at the following to make things clear:

test=# SELECT   day, week, 
        array_agg(week) OVER (ORDER BY week ROWS 
                BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS all,
        array_agg(week) OVER (ORDER BY week ROWS 
                BETWEEN 2 PRECEDING AND 2 FOLLOWING EXCLUDE TIES) AS ties
FROM    t_demo;
          day           | week |     all     |  ties   
------------------------+------+-------------+---------
 2020-01-02 00:00:00+01 |    1 | {1,1,1}     | {1}
 2020-01-03 00:00:00+01 |    1 | {1,1,1,1}   | {1}
 2020-01-04 00:00:00+01 |    1 | {1,1,1,1,2} | {1,2}
 2020-01-05 00:00:00+01 |    1 | {1,1,1,2,2} | {1,2,2}
 2020-01-06 00:00:00+01 |    2 | {1,1,2,2,2} | {1,1,2}
 2020-01-07 00:00:00+01 |    2 | {1,2,2,2,2} | {1,2}
 2020-01-08 00:00:00+01 |    2 | {2,2,2,2,2} | {2}
 2020-01-09 00:00:00+01 |    2 | {2,2,2,2,2} | {2}
 2020-01-10 00:00:00+01 |    2 | {2,2,2,2,2} | {2}
 2020-01-11 00:00:00+01 |    2 | {2,2,2,2,3} | {2,3}
 2020-01-12 00:00:00+01 |    2 | {2,2,2,3,3} | {2,3,3}
 2020-01-13 00:00:00+01 |    3 | {2,2,3,3,3} | {2,2,3}
 2020-01-14 00:00:00+01 |    3 | {2,3,3,3}   | {2,3}
 2020-01-15 00:00:00+01 |    3 | {3,3,3}     | {3}
(14 rows)

test=# SELECT day, week,

array_agg(week) OVER (ORDER BY week ROWS

BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS all,

array_agg(week) OVER (ORDER BY week ROWS

BETWEEN 2 PRECEDING AND 2 FOLLOWING EXCLUDE TIES) AS ties

FROM t_demo;

day | week | all | ties

------------------------+------+-------------+---------

2020-01-02 00:00:00+01 | 1 | {1,1,1} | {1}

2020-01-03 00:00:00+01 | 1 | {1,1,1,1} | {1}

2020-01-04 00:00:00+01 | 1 | {1,1,1,1,2} | {1,2}

2020-01-05 00:00:00+01 | 1 | {1,1,1,2,2} | {1,2,2}

2020-01-06 00:00:00+01 | 2 | {1,1,2,2,2} | {1,1,2}

2020-01-07 00:00:00+01 | 2 | {1,2,2,2,2} | {1,2}

2020-01-08 00:00:00+01 | 2 | {2,2,2,2,2} | {2}

2020-01-09 00:00:00+01 | 2 | {2,2,2,2,2} | {2}

2020-01-10 00:00:00+01 | 2 | {2,2,2,2,2} | {2}

2020-01-11 00:00:00+01 | 2 | {2,2,2,2,3} | {2,3}

2020-01-12 00:00:00+01 | 2 | {2,2,2,3,3} | {2,3,3}

2020-01-13 00:00:00+01 | 3 | {2,2,3,3,3} | {2,2,3}

2020-01-14 00:00:00+01 | 3 | {2,3,3,3} | {2,3}

2020-01-15 00:00:00+01 | 3 | {3,3,3} | {3}

(14 rows)

EXCLUDE TIES: Explanation

The first array_agg simply collects all values in the frame we have defined. The “ties” column is a bit more complicated to understand: Let's take a look at the 5th of January. The result says (1, 2, 2). As you can see, two incarnations of 1 have been removed. EXCLUDE TIES made sure that those duplicates are gone. However, this has no impact on those values in the “future”, because the future values are different from the current row. The documentation states what EXCLUDE TIES is all about: “EXCLUDE TIES excludes any peers of the current row from the frame, but not the current row itself.”

After this first example, we should also consider a fairly common mistake:

test=# SELECT   day, week, 
        array_agg(week) OVER (ORDER BY week, day ROWS 
                BETWEEN 2 PRECEDING AND 2 FOLLOWING EXCLUDE TIES) AS ties
FROM    t_demo;
          day           | week |    ties     
------------------------+------+-------------
 2020-01-02 00:00:00+01 |    1 | {1,1,1}
 2020-01-03 00:00:00+01 |    1 | {1,1,1,1}
 2020-01-04 00:00:00+01 |    1 | {1,1,1,1,2}
 2020-01-05 00:00:00+01 |    1 | {1,1,1,2,2}
 2020-01-06 00:00:00+01 |    2 | {1,1,2,2,2}
 2020-01-07 00:00:00+01 |    2 | {1,2,2,2,2}
 2020-01-08 00:00:00+01 |    2 | {2,2,2,2,2}
 2020-01-09 00:00:00+01 |    2 | {2,2,2,2,2}
 2020-01-10 00:00:00+01 |    2 | {2,2,2,2,2}
 2020-01-11 00:00:00+01 |    2 | {2,2,2,2,3}
 2020-01-12 00:00:00+01 |    2 | {2,2,2,3,3}
 2020-01-13 00:00:00+01 |    3 | {2,2,3,3,3}
 2020-01-14 00:00:00+01 |    3 | {2,3,3,3}
 2020-01-15 00:00:00+01 |    3 | {3,3,3}
(14 rows)

test=# SELECT day, week,

array_agg(week) OVER (ORDER BY week, day ROWS

BETWEEN 2 PRECEDING AND 2 FOLLOWING EXCLUDE TIES) AS ties

FROM t_demo;

day | week | ties

------------------------+------+-------------

2020-01-02 00:00:00+01 | 1 | {1,1,1}

2020-01-03 00:00:00+01 | 1 | {1,1,1,1}

2020-01-04 00:00:00+01 | 1 | {1,1,1,1,2}

2020-01-05 00:00:00+01 | 1 | {1,1,1,2,2}

2020-01-06 00:00:00+01 | 2 | {1,1,2,2,2}

2020-01-07 00:00:00+01 | 2 | {1,2,2,2,2}

2020-01-08 00:00:00+01 | 2 | {2,2,2,2,2}

2020-01-09 00:00:00+01 | 2 | {2,2,2,2,2}

2020-01-10 00:00:00+01 | 2 | {2,2,2,2,2}

2020-01-11 00:00:00+01 | 2 | {2,2,2,2,3}

2020-01-12 00:00:00+01 | 2 | {2,2,2,3,3}

2020-01-13 00:00:00+01 | 3 | {2,2,3,3,3}

2020-01-14 00:00:00+01 | 3 | {2,3,3,3}

2020-01-15 00:00:00+01 | 3 | {3,3,3}

(14 rows)

Spot the difference - find the mistake

Can you spot the difference between this and the previous example? Take all the time you need …
The problem is that the ORDER BY clause has two columns in this case. That means that there are no duplicates anymore from the ORDER BY's perspective. Thus, PostgreSQL is not going to prune values from the result set. I can assure you that this is a common mistake seen in many cases. The problem can be very subtle and go unnoticed for quite some time.

Excluding entire groups from a window function

In some cases you might want to remove an entire set of rows from the result set. To do that, you can make use of EXCLUDE GROUP.

The following example shows how that works, and how our timeseries data can be analyzed:

test=# SELECT   *, 
        array_agg(week) OVER (ORDER BY week ROWS 
                                BETWEEN 2 PRECEDING 
			AND 2 FOLLOWING EXCLUDE GROUP) AS week,
        array_agg(week) OVER (ORDER BY day ROWS 
                                BETWEEN 2 PRECEDING 
			AND 2 FOLLOWING EXCLUDE GROUP) AS all
FROM    t_demo;
 ordinality |          day           | week | week  |    all    
------------+------------------------+------+-------+-----------
          1 | 2020-01-02 00:00:00+01 |    1 |       | {1,1}
          2 | 2020-01-03 00:00:00+01 |    1 |       | {1,1,1}
          3 | 2020-01-04 00:00:00+01 |    1 | {2}   | {1,1,1,2}
          4 | 2020-01-05 00:00:00+01 |    1 | {2,2} | {1,1,2,2}
          5 | 2020-01-06 00:00:00+01 |    2 | {1,1} | {1,1,2,2}
          6 | 2020-01-07 00:00:00+01 |    2 | {1}   | {1,2,2,2}
          7 | 2020-01-08 00:00:00+01 |    2 |       | {2,2,2,2}
          8 | 2020-01-09 00:00:00+01 |    2 |       | {2,2,2,2}
          9 | 2020-01-10 00:00:00+01 |    2 |       | {2,2,2,2}
         10 | 2020-01-11 00:00:00+01 |    2 | {3}   | {2,2,2,3}
         11 | 2020-01-12 00:00:00+01 |    2 | {3,3} | {2,2,3,3}
         12 | 2020-01-13 00:00:00+01 |    3 | {2,2} | {2,2,3,3}
         13 | 2020-01-14 00:00:00+01 |    3 | {2}   | {2,3,3}
         14 | 2020-01-15 00:00:00+01 |    3 |       | {3,3}
(14 rows)

test=# SELECT *,

array_agg(week) OVER (ORDER BY week ROWS

BETWEEN 2 PRECEDING

AND 2 FOLLOWING EXCLUDE GROUP) AS week,

array_agg(week) OVER (ORDER BY day ROWS

BETWEEN 2 PRECEDING

AND 2 FOLLOWING EXCLUDE GROUP) AS all

FROM t_demo;

ordinality | day | week | week | all

------------+------------------------+------+-------+-----------

1 | 2020-01-02 00:00:00+01 | 1 | | {1,1}

2 | 2020-01-03 00:00:00+01 | 1 | | {1,1,1}

3 | 2020-01-04 00:00:00+01 | 1 | {2} | {1,1,1,2}

4 | 2020-01-05 00:00:00+01 | 1 | {2,2} | {1,1,2,2}

5 | 2020-01-06 00:00:00+01 | 2 | {1,1} | {1,1,2,2}

6 | 2020-01-07 00:00:00+01 | 2 | {1} | {1,2,2,2}

7 | 2020-01-08 00:00:00+01 | 2 | | {2,2,2,2}

8 | 2020-01-09 00:00:00+01 | 2 | | {2,2,2,2}

9 | 2020-01-10 00:00:00+01 | 2 | | {2,2,2,2}

10 | 2020-01-11 00:00:00+01 | 2 | {3} | {2,2,2,3}

11 | 2020-01-12 00:00:00+01 | 2 | {3,3} | {2,2,3,3}

12 | 2020-01-13 00:00:00+01 | 3 | {2,2} | {2,2,3,3}

13 | 2020-01-14 00:00:00+01 | 3 | {2} | {2,3,3}

14 | 2020-01-15 00:00:00+01 | 3 | | {3,3}

(14 rows)

Understanding the results

The first aggregation function does ORDER BY week and the array_agg will aggregate on the very same column. In this case, we will see a couple of NULL columns, because in some cases all entries within the frame are simply identical. The last row in the result set is a good example: All array entries are 3 and the current row contains 3 as well. Thus all incarnations of 3 are simply removed, which leaves us with an empty array.

To calculate the last column, the data is ordered by day. In this case, there are no duplicates and therefore no data can be removed. For more information on EXCLUDE GROUP, see the documentation on window functions.

DISTINCT: Dealing with duplicates from timeseries data

In PostgreSQL, an aggregate function is capable of handling DISTINCT. The following example shows what I mean:

test=# SELECT DISTINCT week FROM t_demo;
 week 
------
    3
    1
    2
(3 rows)

test=# SELECT count(DISTINCT week) FROM t_demo;
 count 
-------
     3
(1 row)

test=# SELECT DISTINCT week FROM t_demo;

week

------

(3 rows)

test=# SELECT count(DISTINCT week) FROM t_demo;

count

-------

(1 row)

What does count do?

“count” does not simply count all columns, but filters the duplicates beforehand. Therefore the result is simply 3.
In PostgreSQL there is no way to use DISTINCT as part of a window function. PostgreSQL will simply error out:

test=# SELECT   *, 
        array_agg(DISTINCT week) OVER (ORDER BY day ROWS 
                                BETWEEN 2 PRECEDING AND 2 FOLLOWING)
FROM    t_demo;
ERROR:  DISTINCT is not implemented for window functions
LINE 2:  array_agg(DISTINCT week) OVER (ORDER BY day ROWS

test=# SELECT *,

array_agg(DISTINCT week) OVER (ORDER BY day ROWS

BETWEEN 2 PRECEDING AND 2 FOLLOWING)

FROM t_demo;

ERROR: DISTINCT is not implemented for window functions

LINE 2: array_agg(DISTINCT week) OVER (ORDER BY day ROWS

Avoiding DISTINCT

The natural question which arises is: How can we achieve the same result without using DISTINCT inside the window function? What you have to do is to filter the duplicates on a higher level. You can use a subselect, unroll the array, remove the duplicates and assemble the array again. It is not hard to do, but it is not as elegant as one might expect.

test=# SELECT  *, (SELECT array_agg(DISTINCT unnest) FROM unnest(x)) AS b
FROM 
(
        SELECT  *, 
                array_agg(week) OVER (ORDER BY day ROWS 
                        BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS x
        FROM    t_demo
) AS a;
 ordinality |          day           | week |      x      |   b   
------------+------------------------+------+-------------+-------
          1 | 2020-01-02 00:00:00+01 |    1 | {1,1,1}     | {1}
          2 | 2020-01-03 00:00:00+01 |    1 | {1,1,1,1}   | {1}
          3 | 2020-01-04 00:00:00+01 |    1 | {1,1,1,1,2} | {1,2}
          4 | 2020-01-05 00:00:00+01 |    1 | {1,1,1,2,2} | {1,2}
          5 | 2020-01-06 00:00:00+01 |    2 | {1,1,2,2,2} | {1,2}
          6 | 2020-01-07 00:00:00+01 |    2 | {1,2,2,2,2} | {1,2}
          7 | 2020-01-08 00:00:00+01 |    2 | {2,2,2,2,2} | {2}
          8 | 2020-01-09 00:00:00+01 |    2 | {2,2,2,2,2} | {2}
          9 | 2020-01-10 00:00:00+01 |    2 | {2,2,2,2,2} | {2}
         10 | 2020-01-11 00:00:00+01 |    2 | {2,2,2,2,3} | {2,3}
         11 | 2020-01-12 00:00:00+01 |    2 | {2,2,2,3,3} | {2,3}
         12 | 2020-01-13 00:00:00+01 |    3 | {2,2,3,3,3} | {2,3}
         13 | 2020-01-14 00:00:00+01 |    3 | {2,3,3,3}   | {2,3}
         14 | 2020-01-15 00:00:00+01 |    3 | {3,3,3}     | {3}
(14 rows)

test=# SELECT *, (SELECT array_agg(DISTINCT unnest) FROM unnest(x)) AS b

FROM

(

SELECT *,

array_agg(week) OVER (ORDER BY day ROWS

BETWEEN 2 PRECEDING AND 2 FOLLOWING) AS x

FROM t_demo

) AS a;

ordinality | day | week | x | b

------------+------------------------+------+-------------+-------

1 | 2020-01-02 00:00:00+01 | 1 | {1,1,1} | {1}

2 | 2020-01-03 00:00:00+01 | 1 | {1,1,1,1} | {1}

3 | 2020-01-04 00:00:00+01 | 1 | {1,1,1,1,2} | {1,2}

4 | 2020-01-05 00:00:00+01 | 1 | {1,1,1,2,2} | {1,2}

5 | 2020-01-06 00:00:00+01 | 2 | {1,1,2,2,2} | {1,2}

6 | 2020-01-07 00:00:00+01 | 2 | {1,2,2,2,2} | {1,2}

7 | 2020-01-08 00:00:00+01 | 2 | {2,2,2,2,2} | {2}

8 | 2020-01-09 00:00:00+01 | 2 | {2,2,2,2,2} | {2}

9 | 2020-01-10 00:00:00+01 | 2 | {2,2,2,2,2} | {2}

10 | 2020-01-11 00:00:00+01 | 2 | {2,2,2,2,3} | {2,3}

11 | 2020-01-12 00:00:00+01 | 2 | {2,2,2,3,3} | {2,3}

12 | 2020-01-13 00:00:00+01 | 3 | {2,2,3,3,3} | {2,3}

13 | 2020-01-14 00:00:00+01 | 3 | {2,3,3,3} | {2,3}

14 | 2020-01-15 00:00:00+01 | 3 | {3,3,3} | {3}

(14 rows)

Loading trivial timeseries data the easy way

To show how data can be loaded, I compiled a basic dataset which can be found on my website. Here is how it works:

test=# CREATE TABLE t_oil 
(
	region 		text, 
	country 	text, 
	year 		int, 
	production 	int, 
	consumption 	int
);
CREATE TABLE
test=# COPY t_oil FROM 
	PROGRAM 'curl /secret/oil_ext.txt';
COPY 644

test=# CREATE TABLE t_oil

(

region text,

country text,

year int,

production int,

consumption int

);

CREATE TABLE

test=# COPY t_oil FROM

PROGRAM 'curl /secret/oil_ext.txt';

COPY 644

The cool thing is that if you happen to be a superuser, you can easily load the data from the web directly. COPY FROM PROGRAM allows you to execute code on the server and pipe it directly to PostgreSQL, which is super simple. Keep in mind: that only works if you are a PostgreSQL superuser (for security reasons).

lag: The backbone of trivial timeseries analysis

If you are dealing with timeseries, calculating the difference to the previous period is really important. Fortunately, SQL allows you to do that pretty easily. Here is how it works:

test=# SELECT year, production, 
		lag(production, 1) OVER (ORDER BY year) 
	FROM 	t_oil 
	WHERE 	country = 'USA' 
	LIMIT 5;
 year | production |  lag  
------+------------+-------
 1965 |       9014 |      
 1966 |       9579 |  9014
 1967 |      10219 |  9579
 1968 |      10600 | 10219
 1969 |      10828 | 10600
(5 rows)

test=# SELECT year, production,

lag(production, 1) OVER (ORDER BY year)

FROM t_oil

WHERE country = 'USA'

LIMIT 5;

year | production | lag

------+------------+-------

1965 | 9014 |

1966 | 9579 | 9014

1967 | 10219 | 9579

1968 | 10600 | 10219

1969 | 10828 | 10600

(5 rows)

The lag functions takes two parameters: The first column defines the column, which should be used in this case. The second parameter is optional. If you skip it, the expression will be equivalent to lag(production, 1). In my example, the lag column will be off by one. However, you can use any integer number to move data up or down, given the order defined in the OVER clause.

What we have so far is the value of the previous period. Let us calculate the difference next:

test=# SELECT year, production, 
		production - lag(production, 1) OVER (ORDER BY year) AS diff 
	FROM 	t_oil 
	WHERE 	country = 'USA' 
	LIMIT 5;
 year | production | diff 
------+------------+------
 1965 |       9014 |     
 1966 |       9579 |  565
 1967 |      10219 |  640
 1968 |      10600 |  381
 1969 |      10828 |  228
(5 rows)

test=# SELECT year, production,

production - lag(production, 1) OVER (ORDER BY year) AS diff

FROM t_oil

WHERE country = 'USA'

LIMIT 5;

year | production | diff

------+------------+------

1965 | 9014 |

1966 | 9579 | 565

1967 | 10219 | 640

1968 | 10600 | 381

1969 | 10828 | 228

(5 rows)

That was easy. All we have to do is to take the current row and subtract the previous row.
Window functions are far more powerful than shown here, but maybe this example will help to get you started in the first place.

Calculating correlations using PostgreSQL

You may want to calculate the correlation between columns. PostgreSQL offers the “corr” function to do exactly that. The following listing shows a simple example:

test=# SELECT country, corr(production, consumption) 
	FROM 	t_oil 
	GROUP BY 1 
	ORDER BY 2 DESC NULLS LAST;
       country        |        corr        
----------------------+--------------------
 Mexico               |  0.962790640608018
 Canada               |  0.932931452462893
 Qatar                |  0.925552359601189
 United Arab Emirates |  0.882953285119214
 Saudi Arabien        |  0.642815458284221

test=# SELECT country, corr(production, consumption)

FROM t_oil

GROUP BY 1

ORDER BY 2 DESC NULLS LAST;

country | corr

----------------------+--------------------

Mexico | 0.962790640608018

Canada | 0.932931452462893

Qatar | 0.925552359601189

United Arab Emirates | 0.882953285119214

Saudi Arabien | 0.642815458284221

As you can see, the correlation in Mexico and Canada are highest.

Checking for continuous activity

In the past we presented other examples related to timeseries and analysis in general. One of the most interesting posts is found here.

If you want to see the entire short presentation in German consider checking out the following video.

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.

Installation

Download pgBackRest Version 2.14:

Install the dependencies and check for 64-bit integers:

Build pgbackrest package:

Copy from build host to DB host:

Install Perl packages:

pgBackRest config files & directories on DB test host:

Create repository for pgBackRest

Set the Repo path and PostgreSQL cluster directory in the pgbackrest file:

Configure archiving on the PostgreSQL cluster:

Create stanza as postgres user:

Use the info command to get information about the backup

Backup:

Take a differential backup of the cluster

To see the changes, check info:

Row-Level-Security in PostgreSQL: Corner cases

Using CREATE POLICY in PostgreSQL

PostgreSQL Row-Level-Security and views

Let us see what happens:

The situation is quite different in the case of v2:

Solving the more common security challenges

CYBERTEC can help to secure databases

Preparing data for analytics

Useful functions

Sliding windows: Moving over timeseries in SQL

OVER and array_agg

The EXCLUDE CURRENT ROW clause

Timeseries with SQL: Making use of EXCLUDE TIES

EXCLUDE TIES: Explanation

Spot the difference - find the mistake

Excluding entire groups from a window function

Understanding the results

DISTINCT: Dealing with duplicates from timeseries data

What does count do?

Avoiding DISTINCT

Further reading

Loading trivial timeseries data the easy way

lag: The backbone of trivial timeseries analysis

Calculating correlations using PostgreSQL

Checking for continuous activity

Create stanza as `postgres` user:

Use the `info` command to get information about the backup

To see the changes, check `info`:

OVER and `array_agg`