Transaction performance topic being as actual as ever in spite of constant hardware improvements, we’re quite often asked how to improve in that area. But the truth is that when the most important PostgreSQL configuration paramaters are already more or less tuned, it is mostly really hard to magically squeeze that extra something out of a setup without modifying also the schema.

What I’ve noticed though is that lot of people with performance issues seem to be still using “old” spinning disks. There are of course a lot of different reasons for that (maybe some long-term server lease contracts or losing support if changing hardware configs) but for cases where this is done because there’s just way too much data and it would get very expensive, there might be some remedy. Most people don’t seem to realize that for OLTP systems there’s a huge win if one can already move the indexes of the most busy tables to SSDs or some other low latency media. So here a quick overview how to do that here together with some indicative numbers.

Moving indexes to fast storage

The process is quite simple.

1. Install/connect/mount the media. This is the hardest part probably.

2. Create a Postgres tablespace (superuser needed). Would then make sense to also adjust the random_page_cost parameter.

CREATE TABLESPACE fast_storage LOCATION '/some/mount'
  WITH (random_page_cost=1.25);

3. Move impactful (lots of columns) or existing indexes to that tablespace. NB! This will result in full locking. Also since 9.4 it’s actually possible to move all indexes to some tablespace with “ALTER INDEX ALL IN TABLESPACE”, but this would basically mean a downtime as everything is locked and then the moving starts. One can do it also in a more controlled/manual way, via “CREATE INDEX CONCURRENTLY … TABLESPACE … + RENAME INDEX+ DROP INDEX” or maybe use pg_squeeze/pg_repack extensions that can basically do the same.

# single index
ALTER INDEX pgbench_accounts_aid_idx SET TABLESPACE fast_storage;
# or all indexes on a table with a small DO block (which could be improved to with schemas)
DO $$
DECLARE
r record;
BEGIN
  FOR r IN SELECT ci.relname FROM pg_class c
                  JOIN pg_index i ON i.indrelid = c.oid JOIN pg_class ci on i.indexrelid = ci.oid
                  AND c.relname = 'pgbench_accounts'
  LOOP
    EXECUTE 'ALTER INDEX ' || r.relname || ' SET TABLESPACE fast_storage' ;
  END LOOP;
END
$$;
# or by levereging psql-s newish "gexec" (less locking so due to independent transaction)
SELECT  'ALTER INDEX ' || ci.relname || ' SET TABLESPACE fast_media' FROM pg_class c
                  JOIN pg_index i ON i.indrelid = c.oid JOIN pg_class ci on i.indexrelid = ci.oid
                  AND c.relname = 'pgbench_accounts' \gexec

4. Optionally it also might be a good idea to set this new tablespace as default is schema is somewhat static.

ALTER SYSTEM SET default_tablespace TO fast_storage;
select pg_reload_conf();

Trying to compare HDD vs SDD

To “visualize” the possible performace benefits (there are some for sure) I performed a small and somewhat simplistic test, comparing HDD and SDD transaction latencies with a lot of random IO – to really hit the disk a lot I chose a very small amount of RAM (~5% of dataset fits in shared_buffers/kernel cache), but increased max_wal_size a lot so that we wouldn’t stall during the test, giving more predictable latencies. To generate random IO easily I decided to just create 9 extra indexes on the pgbench schema – and having 10 to 20 indexes on a central OLTP table is also actually quite common. Also to only illustrate HDD vs SDD difference on multi-index update latencies I removed other activities like WAL logging by using unlogged tables, disabling the background writer, and also changed the pgbench default transaction so that only the UPDATE part on the single pgbench_accounts table would be executed.

HW info: Google Compute Engine (europe-west-1), 1 vCPU, 1 GB RAM, 100GB Standard persistent disk / SSD persistent disk
Postgres info: PG 10.3, max_wal_size=50GB, checkpoint_timeout=1d, shared_buffers=256MB, bgwriter_lru_maxpages=0
Test script:

SCALE=684 # estimated ~10Gi DB, see jsfiddle.net/kmoppel/6zrfwbas/ for the formula
TEST_DURATION_SEC=900
CLIENTS=1
NR_OF_AID_INDEXES=8
MAX_TPS=10

pgbench --unlogged-tables -i -s $SCALE

for i in $(seq 1 $NR_OF_AID_INDEXES) ; do psql -qXc "create index on pgbench_accounts (aid)" ; done
psql -qXc "create index on pgbench_accounts (abalance)"	# an extra index to effectively disables HOT-updates

psql -qXc "checkpoint"

cat << EOF > bench_upd_only.sql
\set aid random(1, 100000 * :scale)
\set delta random(-5000, 5000)
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
EOF

pgbench -f bench_upd_only.sql -R $MAX_TPS -T $TEST_DURATION_SEC -P $TEST_DURATION_SEC -c $CLIENTS

The results

And the results were quite suprising actually – almost 38x difference in average update latency on 10 indexes! I somehow thought it will be a bit less, maybe 5-10x…

Disk TypeLatency avg.Latency Stddev.
HDD141 ms158 ms
SSD3.7 ms2.5 ms

NB! The test doesn’t make any claims at absolute thruths – I used separate Google cloud machines (same non-disk specs though) which could have different utilization levels, but to counteract I limited the transaction rate to a very low 10 TPS not to make it a total throughput test but rather latency test, so in the end it should at least give some idea on possible performance gains. Also we can see that HDD latencies (at least on shared cloud envs) jump quite a lot on random updates, with “Latency Stddev” being bigger than “Latency avg”.

Pgbench is a very well known and handy built-in tool that Postgres DBAs can use for quick performance benchmarking. Main functionality/flow is super simple, but it also has some great optional features like running custom scripts and specifying different probabilities for them. One can also use bash commands to fill query variables for example.

But the thing that has always annoyed me a bit is the fact that one cannot specify the desired database or table size, but has to think in so called “scaling factor” numbers. And we know from documentation that scaling factor of “1” means 100k rows in the main data table “pgbench_accounts”. But how the does scaling factor of 100 (i.e. 10 Million bank accounts) translate to disk size? Which factor do I need to input when wanting to quickly generate a database of 50 GB to see how would random updates perform in case the dataset does not fit into RAM/shared buffers. Currently a bit of trial and error involved 🙁 So how to get rid of the guesswork and be a bit more deterministic? Well I guess one needs a formula that translates input of DB size to scaling factor!

Test data generation

One way to arrive at the magic formula would be to generate a lot of sample data for various scaling factors, measure the resulting on-disk sizes and deriving a formula out of it. That’s exactly what I’ve done.

So I hatched up the below script and booted up a GCE instance, destined to churn the disks nonstop for a bit more than a whole day as it appeared – I’m sorry little computer 🙂 The script is very simple – it runs the pgbench schema initialization with different scaling factors from 1 to 50k and stores the resulting DB/table/index sizes in a table, so that later some statistics/maths could be applied to infer a formula. NB! Postgres version used was 10.1.

psql -c "create table if not exists init_results (scale int not null, table_size_mb int not null, index_size_mb int not null, db_size_mb int not null);"

SCALES="1 10 50 100 150 200 250 300 350 400 450 500 750 1000 1500 2000 2500 3000 3500 4000 4500 5000 6000 7000 8000 9000 10000 12500 15000 17500 19999 20000 25000 30000 35000 40000 45000 50000"

for x in $SCALES ; do

pgbench -i --unlogged-tables -s $x -n &>/tmp/pgbench.log

SQL=$(cat <<-EOF

insert into init_results select

${x},

(select round(pg_table_size('pgbench_accounts') / 1024^2)),

(select round(pg_relation_size('pgbench_accounts_pkey') / 1024^2)),

(select round(pg_database_size(current_database()) / 1024^2))

EOF

)

echo "$SQL" | psql

done

Turning test data into a formula

After the script finished (25h later…note that without the “–unlogged-tables” flag it would take a lot longer), I had a nice table showing how different scaling factors translate to database size on disk, measured in MB (MiB/mebibytes to be exact, i.e. 1MB=1048576 bytes) as this is what the “psql” client is using when reporting object sizes.

So now how to turn this data around so that we get a formula to calculate scaling factors, based on input target DB size? Well I’m sure there are quite some ways and countless Statistics/Data Mining techniques could be applied but as Data Science is not my strongest skill I thougt I’ll first take the simplest possible road of using built in statistics functions of a spreadsheet program and see how it fares. If it the estimation doesn’t look good I could look for something more sophisticated, scikit-learn or such. But luckily it worked out quite well!

Simple regression analysis with Libreoffice

Libreoffice (my office suite of choice) is nowhere near MS Excel generally but it does have some quite good Statistical Regression functionalities built in and one has options to calculate regression constants together with accuracy coefficents or calculate the formulas and draw them on charts as “Trend Lines”. I went the visual way. It goes something like that – create a normal chart, activate it with a doubleclick and then right click on some data point on the chart and select “Insert Trend Line” from the appearing popup menu. On the next screen one can choose if the “to be fitted” formula for your data should be a simple linear one or a logarithmic/exponential/polynomial. In the case of pgbench data at hand though, after trying out all of them, it appeared that there was no real extra accuracy from the more complex formulas, so I decided to apply KISS here, meaning a simple linear formula. If you look at the script you’ll see that we actually gathered also size data for the “pgbench_accounts” table and index data, thus for completeness I also generated the formulas for calculating scaling factors from target “pgbench_accounts” table/index sizes, but mostly it should not be of interest as the prominent table makes up 99% of the DB size.

The magic formula

So without further ado, below are the resulting formulas and a graph showing how pgbench scale changes according to DB size. I also created a small JSFiddle that you can bookmark here if you happen to work with pgbench a lot.

NB! Accuracy here is not 100% perfect of course as there are some non-linear components (pgbench_accounts.aid changes to int8 from scale 20k, index leaf density varies minimally) but it is really “good enough”, meaning <1% accuracy error.

Hope you find it useful and would be nice if something similar ends up in “pgbench” tool itself also one day.

Target objectScale Formula
DB0.0669 * DB_Size_Target_MB – 0.5
Table (pgbench_accounts)0.0781 * Table_Size_Target_MB
Index (pgbench_accounts_pkey)0.4668 * Index_Size_Target_MB

 

DB size to Scale graph

Beginning of the year happens to be luckily somewhat slow at customers, so last couple of weeks I’ve had the chance to work on our Open Source PostgreSQL monitoring tool called pgwatch2 and implement some stuff that was in the queue. Changes include only a couple of fixes and a lot of new features… so hell, I’m calling it “Feature Pack 3″… as we’ve had some bigger additions in the past also. Read on for a more detailed overview on the most important new stuff.

Becoming “Enterprizy”

Focus word this time could be “Enterprise”. Meaning – firstly trying to make pgwatch2 easier to deploy for larger companies who maybe need to think about security or container orchestration and secondly also adding robustness. Security even got some special attention – now there’s a pgwatch2 version suitable for running on OpenShift, i.e. Docker process runs under an unprivileged user. But there are also quite some new dashboards (see screenshots at the end), like the long-awaited “Top N queries”, that should delight all “standard” users and also some other smaller UI improvements.

And please do let us know on Github if you’re still missing something in the tool or are having difficulties with something – I think myself I’ve lost the ability to look at the tool with beginner eyes. Thus – any feedback would be highly appreciated!

Project Github link – here.

Version 1.3 full changelog – here.

Most important changes for v1.3.0

  • “Non-root” Docker support

Suitable for example for OpenShift deployments.

  • Explicit Docker volumes added to Postgres, Grafana, InfluxDB and pgwatch2 config

Using volumes was of course also possible before, but people didn’t seem to think about it, so now it’s more explicit. Volumes helps a lot for long term Docker deployments as it makes updating to new image versions quite easy compared to dump/restore.

  • Dockerfiles for deploying “non-root” components separately

These can be used to do some advanced orchestrated setups. See the “docker” folder for details. There are also latest images available on Docker Hub, but it’s meant more for DIY still.

  • Better Grafana access control vie env. variables

Now possible to turn off anononymous access and set admin user/password.

  • New “AWS CloudWatch overview” dashboard

One can now easily monitor/alert on a mix of on-prem and AWS RDS managed DBs.

  • New dashboard for PgBouncer statistics

Visualizes PgBouncer “SHOW STATS” commands. NB! Requires config DB schema
change for existing setups as a new “datasource type” field was introduced.

  • New dashboard for “Top N” time consuming/frequent/slowest/IO-hungry queries added

Now one can see based on pg_stat_statements info, exactly which queries are the costliest (4 criterias available) over a user selected timeperiod. Lot’s of people have asked for that, so please, enjoy!

  • New dashboard – “Biggest relations treemap”

Helps to visually highlight biggest tables/indexes. Uses a custom plugin for Grafana.

  • InfluxDB HA support added, i.e. writing metrics to 2 independent DBs

Metrics gathering daemon can now store all gathered metrics in 2 independent InfluxDBs…so kind of homebrewn clustering. For real clustering one needs the commercial Influx version so be sure to check it out also if needing extra robustness, seems like a great product.

  • Gatherer daemon robustness – a ringbuffer of max 100k metrics datapoints introduced

Previously if Influx was down metrics were gathered and saved in daemon memory till things blew up. 100k datasets translates into ~ 2GB of RAM.

  • InfluxDB metrics retention period now configurable via ENV

Variable is called PW2_IRETENTIONDAYS. Default is 90 days.

  • “Change detection” doesn’t expect superuser rights on the monitored DB anymore
  • Web UI security – respecting the “only SSL connections” ENV variable also for Postgres
  • Web UI security – now possible to disable exposing of component logs
  • README improvements – a new section on custom deployments
  • README improvements – a new docu on all available ENV variables, per subcomponent

Screenshots of new Dashboards

Stat Statements Top
Stat Statements Top

PgBouncer Stats
PgBouncer Stats

Biggest relations
Biggest relations

AWS RDS overview
AWS RDS overview

Something from the security realm this time – normally this side of the “database world” doesn’t get too much attention and “love” because it’s difficult/tedious and requires some deep understanding of how the system works…but ignoring security as we know can catch up with you badly.

Anyways, not telling a real horror story here though, but some weeks ago when a friend who’s a “casual DBA” (install, start/stop, backup/restore) asked for help with a situation where a column was “leaking” data I had to think a bit longer as usual to find an appropriate place where to fix this issue. The solution I suggested in the end included making use of the Rule System of PostgreSQL. I hadn’t seen rules used in years myself (one should generally try to avoid having some stuff happening behind the scenes) and had to refresh on them a bit myself, but this was the correct use case for applying them I think and thought I’d share it as well then.

The security leak

So the application they had was a bought in “black box” webapp that did some quite trivial CRUD on some data forms and all was fine…until they noticed that on some form, some column with address data was visible to all users of the system instead of a specific login role. As said they couldn’t immediately change the code and were looking for a quick fix to hide the data in that column altogether for all users on the DB side (they could do changes there), until the app gets properly fixed. So what means/tricks are available for such kind of stuff? Not too many and they have mostly implications…but let’s look at them.

Available options on restricting column access

  • Column-level privileges

When data is access by different login roles, normally the best way to limit access on single columns would be to use the “column privileges” feature – just do a “REVOKE SELECT” followed by “GRANT SELECT (col1, col2,…)”, leaving out the “secret” columns from the list. With our app though it wouldn’t work as only one backend role was used, and all the SELECTs on that table would start to error out also on other forms showing the column – SQL needs to be adjusted. A no go.

  • Rename the column, create a new (nullable) column with the same name

This would be the easiest trick (very easily rollbackable) for a temporary fix…but would only work when no new data is being added – in that case though new customer registrations were still coming in from the web. A no go.

  • A view instead of a table

Renaming the old table and creating a simple view with the same name, including all the wanted columns (and adding appropriate GRANTs, for example copying the existing ones with “pg_dump -s -t problem_table | grep ^GRANT”) is relatively easy – we could replace our “secret” column’s value for example with an empty string and all would be fine…until we again get some INSERT-s or UPDATE-s. Then we would see an error like below, although simple DML in itself is allowed on views by default in Postgres.

ERROR:  cannot insert into column "leaky_data" of view "problem_table"
DETAIL:  View columns that are not columns of their base relation are not updatable.

So how to get around this INSERT/UPDATE problem?

  • A view instead of a table with Rules!

To overcome the errors on adding/changing data we need to re-route the data into the “base table” of our view. This can easily be done with rules and this would be a good fit for this use case. Read on for a sample.

The solution

So let’s say this is our leaking table:

CREATE TABLE  problem_table(id int, leaky_data text);

And now we want to just replace all values in leaky_data column with empty strings. So we need to write something like that:

BEGIN;

ALTER TABLE problem_table RENAME TO problem_table_real;

CREATE VIEW problem_table AS SELECT id, ''::text AS leaky_data FROM problem_table_real;

CREATE RULE problem_table_ins AS ON INSERT TO problem_table
    DO INSTEAD
    INSERT INTO problem_table_real VALUES (
           NEW.id,
           NEW.leaky_data
    );

CREATE RULE problem_table_upd AS ON UPDATE TO problem_table
    DO INSTEAD
    UPDATE  problem_table_real
       SET id = NEW.id,
               leaky_data = NEW.leaky_data
     WHERE id = NEW.id;

COMMIT;

Now indeed it does what we want:


INSERT INTO problem_table(id, leaky_data) VALUES (1, 'secret');

SELECT * FROM problem_table;

 id │ leaky_data 
────┼────────────
  1 │ 
(1 row)

NB! When the application is using constructs like “UPDATE … RETURNING *” it’s also possible to take care of the loophole there in the Rule definition, by adding the “RETURNING …, ‘’::text AS leaky_data” part. Hope it helps!

A quick performance tip for all those wrestling with occasional un-performant queries in PostgreSQL. There’s one quite simple trick available that many don’t know about, that can be applied at “low cost” when slow queries are caused by poor planner estimates for your filter conditions. Read on for details.

The problem

So the thing with Postgres statistics is that you basically lose them as soon as you transform the data column in some way in your query – most prominent such example would most probably be something like “SELECT … WHERE date_trunc(‘day’, some_indexed_timestamp_column) = ‘2017-01-01″…and gone is the (imaginary) index and column statistics. That query could of course be easily rewritten so that the performance doesn’t explode, but here also a functional index would make sense when rewrite is not possible or the query is executed often and is selective enough. But for queries/filters that are used very rarely a full functional index might not be ideal due to performance side-effects (index maintenance slows down updates, deletes, inserts a bit)…so welcome our performance helper of the day – BRIN functional indexes!

Example use case

So now let’s paint it red with a dummy example: let’s say we want to select all ID’s that are dividable by 10 out of 1 million sequentially growing IDs. Let’s create the test data and look at the resulting query plan.

krl@postgres=# CREATE TABLE t_event AS SELECT generate_series(1, 1e6) id;
SELECT 1000000

krl@postgres=# CREATE INDEX ON t_event (id);
CREATE INDEX

krl@postgres=# ANALYZE t_event ;
ANALYZE

krl@postgres=# EXPLAIN SELECT * FROM t_event WHERE id % 10 = 0;
                          QUERY PLAN                          
──────────────────────────────────────────────────────────────
 Seq Scan on t_event  (cost=0.00..19425.00 <strong>rows=5000</strong> width=6)
   Filter: ((id % '10'::numeric) = '0'::numeric)
(2 rows)

What do we see? Well, we see that Postgres expects to get only 5000 rows but actually we know that there would be 100k rows matched! Such a misestimation of 20x could already end really badly for larger datasets. And where does the number 5000 come from? Now some “insider knowledge” is required – here one just needs to know that when Postgres doesn’t have exact statistics for a column filter (as we actually transformed our ID column that we have statistics on), it by default figures that something between 0.33333 and 1% of data will be returned, depending on datatype and filter type (normal/join) – here 0.5%. So a constant…which obviously doesn’t work too well for our less selective filter. But OK, how to improve the situation?

Using functional BRIN indexes for better row amount estimations

So now the trick – let’s create our BRIN functional index and see how it changes our original query plan row estimate. BRIN by the way stands in short for Block Range (a.k.a. Min/Max) index – a lossy (table re-check needed) index type introduced in Postgres 9.5, generally meant for big amounts of data, most effective when the indexed columns are appearing in the table also somewhat naturally ordered. So let’s create one, rebuild statistics and let’s do another EXPLAIN.

krl@postgres=# CREATE INDEX ON t_event USING brin ((id % 10));
CREATE INDEX

krl@postgres=# ANALYZE t_event ;
ANALYZE

krl@postgres=# EXPLAIN SELECT * FROM t_event WHERE id % 10 = 0;
                          QUERY PLAN                           
───────────────────────────────────────────────────────────────
 Seq Scan on t_event  (cost=0.00..19425.00 <strong>rows=98467</strong> width=6)
   Filter: ((id % '10'::numeric) = '0'::numeric)
(2 rows)

And indeed, the estimation of 98467 matching rows looks now a lot better – quite exact actually!
Here the database of course still decided for a sequential scan…but at least it did it based on right assumptions (based on row count and table location correlation, see below for the corresponding pg_stats entry) and after a quick check, forcing Postgres to use the index, I indeed saw that it was faster with a scan. And the efficiency gain compared to a normal functional index (as the index wouldn’t be used anyways, just the index statistics) – 384x less space on the disc + faster data modifications as BRIN indexes need only maintenance when min/max thereshold is crossed for a range of blocks!

Also note that for our case the row estimation worked so well as we have less than default_statistics_target (100 by default) unique values in our derived column so that Postgres has more or less exact appearance frequencies available for single values.

Hope this trick will serve you well 🙂

krl@postgres=# \di+ t_event_*
                             List of relations
 Schema │       Name        │ Type  │ Owner │  Table  │ Size  │ Description 
────────┼───────────────────┼───────┼───────┼─────────┼───────┼─────────────
 public │ t_event_expr_idx  │ index │ krl   │ t_event │ 21 MB │ 
 public │ t_event_expr_idx1 │ index │ krl   │ t_event │ <strong>56 kB</strong> │ 
 public │ t_event_id_idx    │ index │ krl   │ t_event │ 21 MB │ 
(3 rows)

krl@postgres=# SELECT * FROM pg_stats WHERE tablename = 't_event_expr_idx1';
─[ RECORD 1 ]──────────┬─────────────────────────────────────────────────────────────────────────────────────
schemaname             │ public
tablename              │ t_event_expr_idx1
attname                │ expr
inherited              │ f
null_frac              │ 0
avg_width              │ 7
n_distinct             │ 10
most_common_vals       │ {2,3,7,6,0,1,4,5,8,9}
most_common_freqs      │ {0.1033,0.102233,0.1013,0.1012,0.1005,0.100033,0.0993,0.0978333,0.0978333,0.0964667}
histogram_bounds       │ ¤
correlation            │ 0.0950125
most_common_elems      │ ¤
most_common_elem_freqs │ ¤
elem_count_histogram   │ ¤