Quite often when visiting customers or doing trainings, I see that people are quite happy seeing and learning about some little time-saving PostgreSQL tricks, commands or SQL snippets on the screen and they usually ask to slow down and explain or just say “wow, didn’t know you could do that”. As these tips and tricks are not too much material on their own, I thought I’d just make note of such titbits and write a post on that later – and now it seems that this “later” arrived 🙂 So below there is a list without any particular ordering on some Postgres features and tricks that I’ve gathered over the last year and that might be news for more casual Postgres users. Hope you’ll find something useful and a happy new Postgres year!
Dynamic SQL loops with psql’s “gexec”
This could easily be the most useful one, especially for DBA-s doing some maintenance operations. Since a couple of years the PostgreSQL’s native query tool can do kind of “dynamic programming” – meaning you generate some SQL-s based on a queries output (i.e. the resultset is dynamic) and then you execute all of those generated SQL-s in sequence. Typical uses might be: “vacuuming” or gathering statistics for all the biggest tables that haven’t gotten any automatic “care” recently, getting row counts for all tables, generating thousands of tables for test purposes, rebuilding most bloated indexes, etc – in short stuff that previously required leaving the psql environment and writing a small script.
-- counting rows from all user tables dynamically select format('select count(*) from %s.%s', quote_ident(table_schema), quote_ident(table_name)) as sql from information_schema.tables where table_type ~ 'BASE' and not table_schema LIKE ANY(array[E'pg\\_%', 'information_schema']) order by table_schema, table_name \gexec
Note that gexec is a psql command, not SQL!
Officially session level user variables are not really advertised in the documentation and not as convenient as in some other database systems (this might change in v13 with a proposed new “schema variables” feature) but it works, and they can be freely used in your applications when needed. How does it work? A small sample below:
/* setting variables */ SET myscope.var1 = 1; SET myscope.var2 = 'a'; -- notice that data types are not enforced SELECT set_config('myscope.var1', ‘3.14’, false); -- ‘false’ means session not transaction scope /* reading variables */ SHOW myscope.var2; myscope.var2 ────────────── a (1 row) SELECT current_setting('myscope.var1')::numeric * 10 as pi_x_10;
Sleeping more conveniently
The sleep functions are not overly useful and rather for demonstration purposes (creating some locking situations and then trying to find out who blocked whom) but after using the pg_sleep() function for years, that takes input in seconds, to my horror one day I discovered that there are more convenient functions that accept even human readable input!
SELECT now(); SELECT pg_sleep_for('5 minutes'); SELECT /* then do something …. */ SELECT pg_sleep_until('tomorrow 03:00');
Setting per function planner constants
This is not something that everyone can use, as the heyday of stored procedures is surely over, but they’re nevertheless very useful and performance friendly if applied in correct situations. So one transparent performance tuning aspect of stored procedures is that you can tune all the PostgreSQL configuration parameters available for your procedure of interest. It runs the fastest, without changing any application code or affecting any other queries / procedures. And this is actually a huge benefit and a serious argument to use stored procedures in a tiered / compartmentalized environment (think banks, etc). Code for such changes looks like:
ALTER FUNCTION f1(int) SET enable_seqscan TO off; ALTER FUNCTION f1(int) SET work_mem = '1GB'; ALTER FUNCTION f1(int) SET statement_timeout TO '5min';
Duplicating databases for testing purposes
There are a couple of different ways to achieve that (with pg_dump most commonly used) but actually the fastest way to duplicate databases is to use the CREATE DATABASE syntax with the TEMPLATE modifier. Something like:
CREATE DATABASE prod_app1_test TEMPLATE prod_app1; /* prod_app1 is the SOURCE db and prod_app1_test will be the new DUPLICATE */
What does this command do then? Well sadly on operational systems it will hang first…as the grave downside of the method is that it expects no active sessions on the copied DB – thus it’s not suitable for 24/7 live systems but for test systems or where some off-hours interruption is OK. But after terminating all sessions, it will binary copy all the existing files from the specified template DB into a new DB, thus no parsing, exporting, importing, building indexes. FYI – for snapshotting file systems like ZFS, Btrfs, EBS it can even be used during operation as the copy doesn’t cost initially anything there. So a more real life sample would look something like that:
SELECT count(*) FROM (select pg_terminate_backend(pid) from pg_stat_activity where datname = ‘prod_app1’) x; CREATE DATABASE prod_app1_test TEMPLATE prod_app1; # or from bash psql -qc "select pg_terminate_backend(pid) from pg_stat_activity where datname = 'prod_app1'" && createdb -T prod_app1 prod_app1_test
Defining cluster initialization defaults (Debian)
If we are already talking about test instances, or cases where for some reason you need a lot of instances on a single compute node i.e. multi-tenant on instance level or when running some test farms – there is a couple of quite unknown file locations that enable you to override the default configs and instance initialization parameters. Server configs should be (and mostly are) somehow versioned and managed, so I’ve used it rather to set initialization settings, namely enabling PostgreSQL side data checksums for new clusters. This cluster defaults file can be found at “/etc/postgresql-common/createcluster.conf” and I usually change only one line:
initdb_options = '-k'
FYI – besides the default server config and initialization settings there are also some more cluster based files to customize the behaviour of “pg_ctl” (e.g. timeouts, restart mode) and service autostart behaviour, but the defaults there are quite sane.
PS this titbit only applies to Debian based systems (and PGDG repos i.e. the “official” packages from the Postgres project) and assumes that you use the according wrappers like “pg_createcluster”.
Sample config files
This is what around 98.375% of people find surprising – every PostgreSQL installation also has a bunch of sample configuration files on board, both for the server and client utilities! For example for those cases where you maybe accidentally erased some of them or you’re just curious what were the original defaults (although this can also be investigated via the pg_settings system view). For Debian based systems these files are located in “/usr/share/postgresql/$PGVER/” and for Redhat based ones in “/usr/pgsql-$PGVER/share/”.
A fun fact on config files – postgresql.conf is not even absolutely required as all settings have a default value compiled in! Only at least one access defining row in pg_hba.conf is required.
Selecting random sample rows quickly
Back to SQL land.
* The TABLESAMPLE SQL command
Although fixed in SQL standard for a while (SQL standard 2003) and implemented in PostgreSQL years ago, the TABLESAMPLE clause still seems to be quite unknown. What does it do? Well, as the name implies, it helps us to quickly select a random test subset of a table’s data. Quite handy actually when data browsing, to get a feel for the schema at hand. There are 2 variations of the command in PostgreSQL where the BERNOULLI method could be substantially slower, so if “good randomness” is not important better prefer SYSTEM.
CREATE TEMP TABLE test (id int); INSERT INTO test SELECT * FROM generate_series(1, 1000000); -- 1 million rows /* a ~0.1% sample of rows - fast */ SELECT * FROM test TABLESAMPLE SYSTEM (0.1); /* a ~0.1% sample of rows - slower but more random */ SELECT * FROM test TABLESAMPLE BERNOULLI (0.1);
* Good old random()
Another possibility that always works also for some very old Postgres versions and could be handy also for smallish tables and small sample percentages where TABLESAMPLE, being somewhat probabilistic, sometimes fails to select any rows at all. Warning – it will be slow for bigger tables and causes a sequential scan + expensive sort operation.
SELECT * FROM test ORDER BY random() LIMIT 10; /* or shaving off a couple of more “characters typed” with TABLE */ TABLE test ORDER BY random() LIMIT 10;
* Crazy CTID stuff
For ultra-big data warehouse type of tables where good randomness is required on a regular basis for some weird reason, one could out-smart the slow TABLESAMPLE BERNOULLI with the following hack:
select * from ( select * from big_table where ctid in ( select format( '(%s,%s)', (relpages * random())::int, ((select reltuples / relpages from pg_class where relname = 'big_table') * random())::int )::tid from pg_class, generate_series(1, 200) rows /* need ca 2x more candidate tuples as avg page density still varies across pages */ where relname = 'big_table' ) ) x limit 100;
If you grokked the above, you can feel good about your Postgres knowledge and have a small celebration break with soda pop or Club Mate 😀
To finish off – another very simple character saving tip when doing NULL comparisons:
-- usually written as SELECT * FROM tbl WHERE somecol IS NOT NULL; -- but this goes also SELECT * FROM tbl WHERE somecol NOTNULL; -- yay, 4 chars "saved"