Recently, I was asked for advice on how to reasonably implement a common task of table change tracking – meaning a so-called “audit trail” for all row changes over time was required. The keyword “compliance” might also ring a bell in this context, here for those who work in finance or for government institutions. But as always with technology, there are a couple of different approaches with different benefits / drawbacks to choose from; let’s lay it out for the Internet Citizens! There’s a summary of pros and cons down below if you’re in a hurry.
TL;DR: sadly, there are no real standard built-in tools for these types of tasks, but in real life it mostly boils down to creating some “shadow” tables and writing some triggers.
The log_statement configuration parameter
This server parameter named log_statement has 4 possible values – none, ddl, mod, all. By default, it’s configured conservatively (in this case, with ‘none’), as all PostgreSQL parameters generally are, in order to avoid accidentally wasting resources. That way, it doesn’t log normal SELECT or DML statements to the server log – only warnings and errors. But when set to ‘mod’ (or ‘all’ which is a superset of ‘mod’), all top level data-changing SQL statements (INSERT, UPDATE, DELETE, TRUNCATE) will be stored in the server log!
Make sure to note the “top level” part though – this means that if you have some business logic exercising triggers on your tables and a chain of changes is triggered, you’ll not see any traces of these subsequent changes in the server log! Also, log_statement doesn’t work well if you rely on stored procedures to change data – since they’re usually called via normal SELECT-s, which don’t get flagged as data-changing operations with the ‘mod’ log_statement level. Worse, if you need to do some dynamic SQL within your stored procedures – even the ‘all’ level won’t catch them!
In short – the use cases for the whole approach are somewhat limited to basic CRUD patterns, and log_statement is not necessarily suitable for enterprise requirements.
PS – also note that superusers can change this log_statement setting on the fly; thus bypassing the auditing, and do stuff secretly– without any traces left! To remind you – “by design” with PostgreSQL, it’s never really possible to guard the system against a malicious superuser. Some methods just need some more work from the user, in short better be careful to whom you hand out superuser rights. Preferably, do not even allow remote superuser access, but that’s another topic – see here for more info, if you’re interested in security.
- Absolutely the easiest setup ever- only one parameter to enable. This is even true during live operation, so that it can also be used for ad-hoc troubleshooting.
- Catches only the top level DML statement issued by the user.
- Only explicitly defined column values will be stored to the server log, thus nothing usable from a statement such as: UPDATE tbl SET x = x + 1.
- No row context for multi-row updates – i.e. you’ll have no indication how many rows were altered by some action.
- No table / schema based filtering – all or nothing will be logged.
- Information storage is purely text based – possibly need to deal with huge log files where information is all tangled up, and searching is difficult. Might need additional extraction / parsing / indexing for fast search.
- Queries from failed transactions are also included.
The pgaudit extension
In short, pgaudit is a 3rd-party PostgreSQL extension that tries to improve on the quite limited default PostgreSQL text-based auditing / logging capabilities. It has been around for ages, so it’s stable enough to use, and there are even packages provided by PGDG repos for Debian / RH-based distros.
Its main drawback is the same as with the previous method, though – everything goes to the same server log with normal status / error messages. There’s no clear separation of “concerns”– thus searching will be a bit of work, and for fast “trail” access, you probably need to parse the files and store them in some other system, properly indexed. It’s also the same story for the generated volume of logs. At default settings, (when just enabling all) it’s way more write heavy than the log_statement-based approach. In short, be sure to tune the plentiful parameters to your needs. To warn users about that, the project README also nicely says: … be sure to assess the performance impact while testing and allocate plenty of space on the log volume.
- Quite granular logging / auditing options. Configurable by change type or by some role’s access privileges to table, view, etc.
- Internally / dynamically generated SQL-s can also be logged.
- A 3rd party extension.
- Possibly heavy disk / IO footprint, same as for the previous method.
- No row context for multi-row updates.
- Information storage is purely text-based – need to possibly deal with huge log files where information is all tangled up, and searching is difficult. Might need additional extraction / parsing / indexing for fast search.
- Queries from failed transactions are also included.
Custom audit tables and triggers for all tables
Custom audit tables and triggers must be the most classic / common approach to auditing, and all those working with RDBMS systems for a longer period have surely seen or implemented something like these features. The main idea – create a “shadow” history tracking / audit table for all relevant tables, where all changes to the main table will be logged, via triggers. However, since triggers are considered black magic by a lot of developers these days, I’ve also seen implementations via application code…but this can’t be recommended, as only in a non-existent perfect world are there no ad-hoc manual data changes.
The setup process here would look something like what you see below for every target table X, where we want to track who changed what rows / columns and when:
- Create the “shadow” table for X, typically X_LOG with some typical audit columns like “change time”, “user doing the change”, “type of change” and then all or only important data columns from X.
- Create a trigger function FX which inserts the new or old values (it’s a matter of taste) into X_LOG with above declared audit and data columns filled.
- Declare a trigger on table X to call our tracking function FX for each altered row. The trigger would be typically an AFTER trigger as we don’t want to alter anything and just protocol, but when doing heavy multi-row transactions (thousands of rows per TX) it would make sense to test BEFORE triggers as well, as they should be more resource-friendly (give rollbacks / exceptions are rare).
- Explicit audit targeting, track only exactly what is important
- Isolation of concerns – changes are in a nice table, ready for fast inspection
- Fast search – normal tables can easily be indexed per need
- Need to write stored procedures and manage triggers definitions. FYI – triggers can also be written in other supported PL languages like Python, if you happen to dislike the de-facto trigger language of PL/pgSQL.
- Some schema duplication.
- Database growth. Previously, changes were written into server logs that are usually recycled and federated in any case, so it was not a big concern. At present, however, the audit tables may need explicit “care”.
One generic audit trigger and table for all audited tables
On a high level, this method is very similar to the previous one; the only change being that instead of having custom audit tables / trigger code for all “business” tables, we create a generic trigger function that can be used on all target tables, and that also logs all changes into a single table! By doing that, we’ve minimized the amount of table / code duplication – which could be of real value for big and complex systems – remember, DRY!
And how, you may wonder, would be the best way to implement it? Well, the best way to achieve such generic behaviour is to utilize the superb JSON functions of PostgreSQL, preferably the JSONB data type (available since v9.4), due to some space saving and faster search capabilities. BTW, if you happen to be running some earlier version, you should really think of upgrading, as versions 9.3 and lesser are not officially supported any more…and soon (February 13, 2020) PostgreSQL 9.4 will stop receiving fixes.
Since this approach is relatively unknown to the wider public, a piece of sample code probably wouldn’t hurt; check below for a sample. Note, however, that fiddling with JSONB along with the fact that this is basically a NoSQL type of storage, is not exactly as effective as normal tables / columns. You’ll have to pay a small performance and storage premium for this “generic” convenience.
CREATE TABLE generic_log ( mtime timestamptz not null default now(), action char not null check (action in ('I', 'U', 'D')), username text not null, table_name text not null, row_data jsonb not null ); CREATE INDEX ON generic_log USING brin (mtime); CREATE INDEX ON generic_log ((row_data->>'my_pk’)) WHERE row_data->>'my_pk' IS NOT NULL; // note the cast to text as JSONB can’t be indexed with B-tree CREATE EXTENSION IF NOT EXISTS btree_gin; CREATE INDEX ON generic_log USING gin (table_name); // GiN is better for lots of repeating values CREATE OR REPLACE FUNCTION public.generic_log_trigger() RETURNS trigger LANGUAGE plpgsql AS $function$ BEGIN IF TG_OP = 'DELETE' THEN INSERT INTO generic_log VALUES (now(), 'D', session_user, TG_TABLE_NAME, to_json(OLD)); ELSE INSERT INTO generic_log VALUES (now(), TG_OP::char , session_user, TG_TABLE_NAME, to_json(NEW)); END IF; RETURN NULL; END; $function$; CREATE TRIGGER log_generic AFTER INSERT OR UPDATE OR DELETE ON some_table FOR EACH ROW EXECUTE FUNCTION generic_log_trigger();
- Less code to manage.
- Automatic attachment of audit trail triggers can be easily configured for new tables, e.g. via event triggers.
- A bit more resources burned compared to custom per-table triggers
- Some more exotic indexing (GiN) may be needed
- SQL for searching may become a bit more complex
PS – note again, that with both of these trigger-based methods, superusers (or table owners) can temporarily disable the triggers and thus bypass our audit trail.
Logical replication, also known as “pub-sub” replication, is a relatively new thing in PostgreSQL (introduced in v10), and originally, not really an auditing feature but rather a near-zero-downtime upgrade mechanism (read this blog post with more details on that).
It can also be “abused” for auditing or CDC (Change Data Capture) purposes…and actually quite well! The biggest benefit – storage of any extra auditing data can be “outsourced” to an external system, and so-called “write amplification” can be avoided – meaning generally better performance, since extra writing to the disk happens somewhere else.
You need to choose between 2 implementation options though – PostgreSQL native or the custom application way
Logical replication – PostgreSQL native
PostgreSQL native logical replication means that you build up a master server similarly structured to the original server, re-adjust the schema a bit – dropping PK/UQ-s, create some triggers that tweak or throw away uninteresting data or store it in “shadow” tables (just like with the normal trigger-based approaches) and then configure data streaming with CREATE PUBLICATION / CREATE SUBSCRIPTION commands.
As usual, some constraints still apply – you might need to alter the schema slightly to get going. Large objects (up to 4TB blobs) are not supported, and with default settings, you’d only be getting the primary key and changed column data, i.e. not the latest “row image”. Also, it’s generally more hassle to set up and run – an extra node and monitoring is needed, since the publisher and subscriber will be sort of “physically coupled”, and there will be operational risks for the publisher (source server) – if the subscriber goes on a ”vacation” for too long the publisher might run out of disk space as all data changes will be reserved and stored as transaction logs (WAL) until they’re fetched (or the slot deleted). The latter actually applies for the “custom application” approach. So you should definitely spend a minute in careful consideration before jumping into some implementations.
On a positive note from the security side – “PostgreSQL native” can actually be configured in such a way that it’s not even possible for superusers on the source system to disable / bypass the auditing process and change something secretly! (i.e. temporarily leaving out some tables from the replication stream so that the subscriber doesn’t notice!) However, this only works with the standard (for upgrades at least) FOR ALL TABLES setup.
Logical replication – with custom applications
The “application way” means using some programming language (C, Python, Java, etc) where the PostgreSQL driver supports logical decoding. You’ll always be streaming the changes as they happen, and then inspect or stash away the data in your favourite format, or propagate into some other database system altogether. See here for a sample PostgreSQL implementation that also can easily be tested on the command line. To simplify it a bit – you can live-stream JSON change-sets out of PostgreSQL and do whatever you like with the data.
- Minimal extra resource usage penalty on the source system.
- Can be well-configured on the table level – one could leave out some tables, or only stream certain changes like INSERTS for some tables.
- Can be the safest option with regard to data integrity.
- Subscribers can also be purely virtual, i.e. applications doing “something” with the changed data
- Somewhat complex to set up and run.
- Postgres-native way requires careful syncing of future schema changes.
- Means coupling of 2 servers via replication slots, so monitoring is needed.
Summary of top pros and cons
|log_statement=’mod’||Simplest way for basic requirements – just flip the built-in switch, even during runtime.||Text-based: volume and search can be problematic.|
Captures only top level statements issued by users.
Does not capture bulk update details.
No table level configuration.
|Pgaudit extension||Options to configure processing according to operation type and object / role.|
Logs also dynamically generated statements.
|Text-based: volume and search can be problematic.|
Does not capture bulk update details.
A 3rd party extension.
|Explicit audit tables and triggers for all (relevant) tables||Less resources burnt than with text-based approaches.|
Easily customizable per table.
Lots of code to manage and some structure duplication introduced.
|A single audit table and trigger for all (relevant) tables||Less resources burnt than with text-based approaches.|
Still a fast search.
Customizable per table.
Audit search queries might need some JSON skills.
|Logical Replication||Least amount of resources burnt on the source system.|
Highly customizable on object level.
Can be well secured to guard data integrity.
Needs extra hardware / custom application.
Typically requires some schema changes and extra care when the schema evolves.
Hope you got some ideas for your next auditing project with PostgreSQL!