Tips and tricks to kick-start Postgres year 2020

01.2020 | Category: How To | Tags: development, sql help

01.2020

Category: How To

Quite often when visiting customers or doing trainings, I see that people are quite happy seeing and learning about some little time-saving PostgreSQL tricks, commands or SQL snippets on the screen and they usually ask to slow down and explain or just say “wow, didn’t know you could do that”. As these tips and tricks are not too much material on their own, I thought I’d just make note of such titbits and write a post on that later - and now it seems that this “later” arrived 🙂 So below there is a list without any particular ordering on some Postgres features and tricks that I’ve gathered over the last year and that might be news for more casual Postgres users. Hope you’ll find something useful and a happy new Postgres year!

Table of Contents

Dynamic SQL loops with psql’s “gexec”

This could easily be the most useful one, especially for DBA-s doing some maintenance operations. Since a couple of years the PostgreSQL’s native query tool can do kind of “dynamic programming” - meaning you generate some SQL-s based on a queries output (i.e. the resultset is dynamic) and then you execute all of those generated SQL-s in sequence. Typical uses might be: “vacuuming” or gathering statistics for all the biggest tables that haven’t gotten any automatic “care” recently, getting row counts for all tables, generating thousands of tables for test purposes, rebuilding most bloated indexes, etc - in short stuff that previously required leaving the psql environment and writing a small script.

-- counting rows from all user tables dynamically
select format('select count(*) from %s.%s', 
quote_ident(table_schema), quote_ident(table_name)) as sql from information_schema.tables 
where table_type ~ 'BASE' 
and not table_schema LIKE ANY(array[E'pg\_%', 'information_schema']) order by 
table_schema, table_name gexec

-- counting rows from all user tables dynamically

select format('select count(*) from %s.%s',

quote_ident(table_schema), quote_ident(table_name)) as sql from information_schema.tables

where table_type ~ 'BASE'

and not table_schema LIKE ANY(array[E'pg\_%', 'information_schema']) order by

table_schema, table_name gexec

Note that gexec is a psql command, not SQL!

Session variables

Officially session level user variables are not really advertised in the documentation and not as convenient as in some other database systems (this might change in v13 with a proposed new “schema variables” feature) but it works, and they can be freely used in your applications when needed. How does it work? A small sample below:

/* setting variables */
SET myscope.var1 = 1;
SET myscope.var2 = 'a';  -- notice that data types are not enforced
SELECT set_config('myscope.var1', ‘3.14’, false); -- ‘false’ means session not transaction scope
/* reading variables */

SHOW myscope.var2;
 myscope.var2 
──────────────
 a
(1 row)

SELECT current_setting('myscope.var1')::numeric * 10 as pi_x_10;

/* setting variables */

SET myscope.var1 = 1;

SET myscope.var2 = 'a'; -- notice that data types are not enforced

SELECT set_config('myscope.var1', ‘3.14’, false); -- ‘false’ means session not transaction scope

/* reading variables */

SHOW myscope.var2;

myscope.var2

──────────────

(1 row)

SELECT current_setting('myscope.var1')::numeric * 10 as pi_x_10;

Sleeping more conveniently

The sleep functions are not overly useful and rather for demonstration purposes (creating some locking situations and then trying to find out who blocked whom) but after using the pg_sleep() function for years, that takes input in seconds, to my horror one day I discovered that there are more convenient functions that accept even human readable input!

SELECT now();
SELECT pg_sleep_for('5 minutes');
SELECT  /* then do something …. */
SELECT pg_sleep_until('tomorrow 03:00');

SELECT now();

SELECT pg_sleep_for('5 minutes');

SELECT /* then do something …. */

SELECT pg_sleep_until('tomorrow 03:00');

Setting per function planner constants

This is not something that everyone can use, as the heyday of stored procedures is surely over, but they’re nevertheless very useful and performance friendly if applied in correct situations. So one transparent performance tuning aspect of stored procedures is that you can tune all the PostgreSQL configuration parameters available for your procedure of interest. It runs the fastest, without changing any application code or affecting any other queries / procedures. And this is actually a huge benefit and a serious argument to use stored procedures in a tiered / compartmentalized environment (think banks, etc). Code for such changes looks like:

ALTER FUNCTION f1(int) SET enable_seqscan TO off;
ALTER FUNCTION f1(int) SET work_mem = '1GB';
ALTER FUNCTION f1(int) SET statement_timeout TO '5min';

ALTER FUNCTION f1(int) SET enable_seqscan TO off;

ALTER FUNCTION f1(int) SET work_mem = '1GB';

ALTER FUNCTION f1(int) SET statement_timeout TO '5min';

Duplicating databases for testing purposes

There are a couple of different ways to achieve that (with pg_dump most commonly used) but actually the fastest way to duplicate databases is to use the CREATE DATABASE syntax with the TEMPLATE modifier. Something like:

CREATE DATABASE prod_app1_test TEMPLATE prod_app1;
/* prod_app1 is the SOURCE db and prod_app1_test will be the new DUPLICATE */

1 2	CREATE DATABASE prod_app1_test TEMPLATE prod_app1; /* prod_app1 is the SOURCE db and prod_app1_test will be the new DUPLICATE */

What does this command do then? Well sadly on operational systems it will hang first...as the grave downside of the method is that it expects no active sessions on the copied DB - thus it’s not suitable for 24/7 live systems but for test systems or where some off-hours interruption is OK. But after terminating all sessions, it will binary copy all the existing files from the specified template DB into a new DB, thus no parsing, exporting, importing, building indexes. FYI - for snapshotting file systems like ZFS, Btrfs, EBS it can even be used during operation as the copy doesn’t cost initially anything there. So a more real life sample would look something like that:

SELECT count(*) FROM (select pg_terminate_backend(pid) from pg_stat_activity 
where datname = ‘prod_app1’) x;
CREATE DATABASE prod_app1_test TEMPLATE prod_app1;
# or from bash
psql -qc 'select pg_terminate_backend(pid) from pg_stat_activity 
where datname = 'prod_app1'' && createdb -T prod_app1 prod_app1_test

SELECT count(*) FROM (select pg_terminate_backend(pid) from pg_stat_activity

where datname = ‘prod_app1’) x;

CREATE DATABASE prod_app1_test TEMPLATE prod_app1;

# or from bash

psql -qc 'select pg_terminate_backend(pid) from pg_stat_activity

where datname = 'prod_app1'' && createdb -T prod_app1 prod_app1_test

Defining cluster initialization defaults (Debian)

If we are already talking about test instances, or cases where for some reason you need a lot of instances on a single compute node i.e. multi-tenant on instance level or when running some test farms - there is a couple of quite unknown file locations that enable you to override the default configs and instance initialization parameters. Server configs should be (and mostly are) somehow versioned and managed, so I’ve used it rather to set initialization settings, namely enabling PostgreSQL side data checksums for new clusters. This cluster defaults file can be found at “/etc/postgresql-common/createcluster.conf” and I usually change only one line:

initdb_options = '-k'

1	initdb_options = '-k'

FYI - besides the default server config and initialization settings there are also some more cluster based files to customize the behaviour of “pg_ctl” (e.g. timeouts, restart mode) and service autostart behaviour, but the defaults there are quite sane.

PS this titbit only applies to Debian based systems (and PGDG repos i.e. the “official” packages from the Postgres project) and assumes that you use the according wrappers like “pg_createcluster”.

Sample config files

This is what around 98.375% of people find surprising - every PostgreSQL installation also has a bunch of sample configuration files on board, both for the server and client utilities! For example for those cases where you maybe accidentally erased some of them or you’re just curious what were the original defaults (although this can also be investigated via the pg_settings system view). For Debian based systems these files are located in “/usr/share/postgresql/$PGVER/” and for Redhat based ones in “/usr/pgsql-$PGVER/share/”.

A fun fact on config files - postgresql.conf is not even absolutely required as all settings have a default value compiled in! Only at least one access defining row in pg_hba.conf is required.

Selecting random sample rows quickly

Back to SQL land.

* The TABLESAMPLE SQL command

Although fixed in SQL standard for a while (SQL standard 2003) and implemented in PostgreSQL years ago, the TABLESAMPLE clause still seems to be quite unknown. What does it do? Well, as the name implies, it helps us to quickly select a random test subset of a table’s data. Quite handy actually when data browsing, to get a feel for the schema at hand. There are 2 variations of the command in PostgreSQL where the BERNOULLI method could be substantially slower, so if “good randomness” is not important better prefer SYSTEM.

CREATE TEMP TABLE test (id int);
INSERT INTO test SELECT * FROM generate_series(1, 1000000); -- 1 million rows
/* a ~0.1% sample of rows - fast */
SELECT * FROM test TABLESAMPLE SYSTEM (0.1);
/* a ~0.1% sample of rows - slower but more random */
SELECT * FROM test TABLESAMPLE BERNOULLI (0.1);

CREATE TEMP TABLE test (id int);

INSERT INTO test SELECT * FROM generate_series(1, 1000000); -- 1 million rows

/* a ~0.1% sample of rows - fast */

SELECT * FROM test TABLESAMPLE SYSTEM (0.1);

/* a ~0.1% sample of rows - slower but more random */

SELECT * FROM test TABLESAMPLE BERNOULLI (0.1);

* Good old random()

Another possibility that always works also for some very old Postgres versions and could be handy also for smallish tables and small sample percentages where TABLESAMPLE, being somewhat probabilistic, sometimes fails to select any rows at all. Warning - it will be slow for bigger tables and causes a sequential scan + expensive sort operation.

SELECT * FROM test ORDER BY random() LIMIT 10;
/* or shaving off a couple of more “characters typed” with TABLE */
TABLE test ORDER BY random() LIMIT 10;

SELECT * FROM test ORDER BY random() LIMIT 10;

/* or shaving off a couple of more “characters typed” with TABLE */

TABLE test ORDER BY random() LIMIT 10;

* Crazy CTID stuff

For ultra-big data warehouse type of tables where good randomness is required on a regular basis for some weird reason, one could out-smart the slow TABLESAMPLE BERNOULLI with the following hack:

select * from (
  select * from big_table where ctid in (
    select
      format(
        '(%s,%s)',
        (relpages * random())::int,
        ((select reltuples / relpages  from pg_class where relname = 'big_table') 
* random())::int
      )::tid
    from
      pg_class, generate_series(1, 200) rows /* need ca 2x more candidate tuples as avg page 
density still varies across pages */
    where relname = 'big_table'
  )
) x
limit 100;

select * from (

select * from big_table where ctid in (

select

format(

'(%s,%s)',

(relpages * random())::int,

((select reltuples / relpages from pg_class where relname = 'big_table')

* random())::int

)::tid

from

pg_class, generate_series(1, 200) rows /* need ca 2x more candidate tuples as avg page

density still varies across pages */

where relname = 'big_table'

)

) x

limit 100;

If you grokked the above, you can feel good about your Postgres knowledge and have a small celebration break with soda pop or Club Mate 😀

NOTNULL

To finish off - another very simple character saving tip when doing NULL comparisons:

-- usually written as 
SELECT * FROM tbl WHERE somecol IS NOT NULL;
-- but this goes also
SELECT * FROM tbl WHERE somecol NOTNULL;  -- yay, 4 chars 'saved'

-- usually written as

SELECT * FROM tbl WHERE somecol IS NOT NULL;

-- but this goes also

SELECT * FROM tbl WHERE somecol NOTNULL; -- yay, 4 chars 'saved'

2 responses to “Tips and tricks to kick-start Postgres year 2020”

Damien Sawyer says:

January 17, 2020 at 12:54 am

Thank you!!! I've been looking for a way to do session variables in Postgres for a while since coming from SQL Server. Combining that with multiple Postgres result sets from Azure Data Studio (the best PG IDE I've found from a long list) and I'm having a good day.

Reply
Tobias Bussmann says:

January 22, 2020 at 3:53 pm

Regarding the first query: I like you utilised the format function to construct the dynamic statement instead of concatenation operations. It would be even a bit more elegantly when using %I as the format specifier as that saves you from quote_identing the parameters: format('SELECT count(*) FROM %I.%I', table_schema, table_name)
For literals that would normally be passed through quote_nullable there is also a %L specifier.

Reply

Tips and tricks to kick-start Postgres year 2020

Dynamic SQL loops with psql’s “gexec”

Session variables

Sleeping more conveniently

Setting per function planner constants

Duplicating databases for testing purposes

Defining cluster initialization defaults (Debian)

Sample config files

Selecting random sample rows quickly

NOTNULL

2 responses to “Tips and tricks to kick-start Postgres year 2020”

Leave a Reply Cancel reply

CYBERTEC Guest

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts