By Kaarel Moppel - Auto-rebuild bloated tables with pg_squeeze: One of the few areas where out-of-the-box functionality by PostgreSQL is not 100% satisfying, is the “bloat problem”. Combating bloat, or just trying to ensure that your table data is physically ordered according to some column(s) (a.k.a. clustering) required accepting some inconvenient compromises until now. Extended periods of full table locking (no read or write activities) with built-in VACUUM FULL
or CLUSTER
commands or involving third party tooling, usually meaning “pg_repack”, were necessary. “pg_repack” offers good benefits like a lot smaller full-lock time, ordering by specific columns, but needs a bit of fiddling around - installing the extension, identifying bloated tables, running their command line client, and for larger tables it could also temporarily grow the disk size unnecessarily as it uses triggers to store the modifications made to tables during the pre-building phase.
To alleviate the situation, on behalf of the Cybertec development team, I’m really glad to announce a new bloat-painkiller called "pg_squeeze"! I myself, with my stereotypically calm Nordic temper, don’t usually get too excited by a piece of software, but this time as a day-to-day PostgreSQL user I must say that I’m really impressed - absolutely great piece of work! And also I wonder why nothing like that came about earlier.
pg_squeeze is a PostgreSQL extension implementing a background worker process (one per DB) that periodically monitors tables defined by the user and when it detects a table crossing the “bloat threshold”, it kicks in and rebuilds that table automatically! Rebuilding happens concurrently in the background with minimal storage and computational overhead due to using Postgres’ built-in replication slots together with logical decoding to extract possible table changes happening during the rebuild from XLOG. Bloat threshold is configurable and bloat ratio calculation is based on the free space map or under certain conditions based on concepts of “pgstattuple” extension. Additionally minimum table size can be set, with smaller tables being ignored. Additional requirement for the table to be considered for rebuilding is that they need to have a primary key or unique constraint defined.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
# Download and install the extension git clone … export PGCONFIG=/usr/bin/pg_config # point it to your desired Postgres installation make && sudo make install cat <<-EOF>> testcluster/postgresql.conf wal_level=logical max_replication_slots = 1 shared_preload_libraries = 'pg_squeeze' EOF pg_ctl -D testcluster restart psql -c “CREATE EXTENSION pg_squeeze” psql -c “INSERT INTO squeeze.tables (tabschema, tabname, first_check) VALUES ('public', 'foo', now());” psql -c “SELECT squeeze.start_worker()” # PS! not needed when we define the list of “squeezed” # databases in postgresql.conf |
In addition to the above-mentioned option to list databases and tables taking part in the auto-rebuild, also following “rebuild decision” aspects can be configured for every table by adjusting values in the “squeeze.tables” table. NB! Only this table and start/stop_worker() calls are meant to be the only “user interface” provided for the extension, with other tables/functions meant for internal use (although it is possible to launcher).
Additionally provided on the global (database) level:
Also due to using logical decoding, this means that only newer versions of PostgreSQL starting from version 9.4 can be taken into consideration.
Currently “pg_squeeze” supports the latest PostgreSQL 16, see the Github page for the latest info.
If you want the background worker to start automatically during startup of the whole PostgreSQL cluster, add entries like this to “postgresql.conf”, selecting appropriate databases and a role for the worker. More explanations on that are available from the README.
1 2 3 |
squeeze.worker_autostart = 'mydb1 mydb2' squeeze.worker_role = postgres |
Grab the code here and try it out! Questions and feedback welcome.
In case you need any assistance, please feel free to contact us.
Last week a new PostgreSQL major version with the number 9.6 was released and I already covered my favorite parts from the official release notes from DBA’s point of view in a blogpost here. Now I would like to look at the same list of changes from a different angle, transforming into a database developer in my mind 🙂 Some features of shared interest are of course re-listed, but mostly new stuff, so put on your dev-hat and enjoy.
Needs enabling via “max_parallel_workers_per_gather” parameter, which can luckily be done on the user level, thus making per-query parallelization possible. But one thing to note here is that the total number worker processes is limited with “max_worker_processes” parameter so this might need increasing (default is 8) on good hardware when doing parallel queries from lots of concurrent sessions.
Transparent sharding ahoy! That was already possible with 9.5 (foreign table inheritance + exclusion constraints) but now things also perform well for the non-simplistic usecases. Great stuff.
Allows efficient execution of remote queries involving extension operators. Again, great for sharding scenarios.
Together with the new "remote_apply" setting for the "synchronous_commit" parameter user have power to create “mirrored” multi-machine clusters. Awesome feature!
A pilot feature for generic index access methods, this new index type enables Bloom filters in Postgres. Basically Bloom index can definitely tell you if your entry “is not” there, with “is there” being lossy and giving false-positives. More info on the algorithm here. Support only equality queries and “int” and “text” datatypes but is a lot more efficient than B-tree for multi-column query conditions. Contrib module.
Meaning order of your “search words” is respected.
Previously only whole words could be fuzzy-compared with pg_trgm, now also inputs with best matching fractions can be determined with “word_similarity” function. Contrib module.
This would enable canceling running queries and terminating sessions of other database users. More built-in roles to be expected in future.
The specified operations are carried out in the order in which the options are given, and then psql terminates.
This is useful after getting an unexpected error
With those two pgbench changes one can now build and test "close to real life" testcases faster and easier.
Last week a new PostgreSQL major version with the number 9.6 was released! The announcement, release notes and the official “What’s new” overview can be found here, here and here – it’s highly recommended for reading, so check them out. But as always, there's also a slew of blog-posts from exited members of the global Postgres community follows (check out Planet PostgreSQL here if not yet subscribed), each with a bit of a different angle. Now I would like to add my own impressions on the most interesting/relevant features, summarized for easy digestion.
As always, users who upgrade or initialize a fresh cluster, will enjoy huge performance wins (avoid scanning frozen pages unnecessarily during vacuum freeze, scalability on multi-CPU-socket servers, checkpoint writes in sorted order, index-only scans for partial indexes) out of the box without doing, or being able to do anything, but here I would like to look at the things that you won't get out of the box, but you actually need to take some steps to start benefiting from them. The list below highlights PostgreSQL 9.6 features compiled from a DBA’s viewpoint. This week a similar article where we look at the changes from a developer's point of view will follow.
First, here's a list of things that could most likely cause problems when migrating to PostgreSQL 9.6 from an older version. Before migrating one should of course test on a separate replica and go through the full list of possibly incompatible changes from the release notes.
"waiting" column has been replaced with "wait_event_type" and "wait_event".
Pretty simple stuff, "pg_upgrade" will give you an error "The old cluster contains roles starting with 'pg_'".
In case there exists a ".psqlrc" file this could cause your Cron scripts to generate some unwanted output (translating to emails usually), even with the "-q/--quiet" flag.
Needs enabling via “max_parallel_workers_per_gather” parameter, which can luckily be done on the user level too, thus making per-query parallelization possible. But one thing to note here is that the total number worker processes is limited with “max_worker_processes” parameter, so this might need increasing (default is 8) on good hardware when doing parallel queries from lots of concurrent sessions.
Beyond the threshold, old data may be vacuumed away, and users will get a “snapshot too old” error when trying to read such old rows. Warning! From the documentation - “When this feature is enabled, freed space at the end of a relation cannot be released to the operating system”...sothis is basically a double-edged sword basically and it's not enabled by default.
Together with the new "remote_apply" setting for the "synchronous_commit" parameter user have power to create “mirrored” multi-machine clusters. Awesome feature!
From release notes - “This allows, for example, rewinding a promoted standby back to some state of the old master’s timeline”. Meaning you could promote a replica, do some migration testing say and then convert it back into a normal replica. Great!
Enables to read information equivalent to “pg_controldata” utility via SQL. Previously one had to work around it via a custom PL/Pythonu stored procedure or even custom extension, when wanting to expose “database system identifier” for monitoring queries for example.
As release notes formulated it well "Historically users have obtained such information using a self-join on the pg_locks view. However, it is unreasonably tedious...", this means one can hugely simplify monitoring scripts and ad-hoc troubleshooting, paired with some joins to pg_stat_activity.
Useful to prevent forgotten transactions from holding locks or preventing vacuum cleanup for too long. Bye-bye Cron scripts trying to do the same by reading pg_stat_activity regularly and terminating misbehaving transactions.
Postgresql.conf needs adjusting while migrating and already not using “logical”!
More built-in roles to be expected in future.
The specified operations are carried out in the order in which the options are given, and then psql terminates.
When doing PostgreSQL consulting the other day, the talk went to the topic of connection pools - namely what approaches and products are commonly used and perform well? The topic is pretty wide in itself but mostly well-known for old-timers. Nevertheless it is worth a small write up on basic concepts and a small comparison of the two most common "near to Postgres" products that you should know about - PGBouncer and pgpool-II.
First a basic intro - connection pools are middleware that speak the database protocol and cache database connections so that clients could spare the time used to negotiate the connection, do authentication and set client defaults (encoding, work_mem) when opening a new connection and also to relieve the database server from storing too much client state in memory. Thus applications connect to the pool, thinking it's the database.
Common approaches for deploying pools:
When talking about separate pooling servers in Postgres context, two products stand out: PgBouncer and pgpool-II. Both products are written in C and seem to be actively maintained, with pgpool-II seeing more action as it also has a higher number of ambitious features. Besides source code, packages for common Linux distros are available and deployment it pretty simple: customize the configuration files and start the daemon. Redirect your clients to the pools instead of the real DB (both use non-standard ports by default though - 6432 and 9999 accordingly). Based on the available online documentation (PgBouncer in current version 1.7.2 and pgpool-II in version 3.5.4, the latter one having sadly some outdated parts) I compiled the following outline of the features so that you can decide on the suitability for your needs yourself.
pgpool-II gotchas
PgBouncer gotchas
From sceptical reasons, to test the claims of my colleague Ants that PgBouncer is significantly faster than pgpool-II. I also decided to run a quick set of tests with all the components running on my laptop. Test setup - a small 13MB "pgbench" in-memory dataset in "--select-only" mode to get fast responses, as we want to test here only connection overhead. Pools were configured without SSL and so that the tested amount of 8 concurrent connections would always be kept cached by the pools and no connection re-establishing would take place during test. For PgBouncer default "session pooling" was used.
1 2 3 4 5 6 |
pgbench -i -s 1 bench # init the bench schema ~13MB for port in 5432 6432 9999 ; do for i in {1..3} ; do pgbench --select-only --connect -T300 -c8 -j2 -p $port bench done done |
Side note - before I could really fire off with testing I ran into a distro-specific problem where connections started to fail after some time, and it required changing some kernel parameters. More info here.
Results (as always, given with a YMMV disclaimer) were such:
Both well-known and battle-tested products, PgBouncer and pgpool-II, provide a good way to grab that low-hanging fruit for performance (very noticeable difference when doing very short and simple transactions) and also to add some flexibility to your setup by hiding the database from direct access, making it easier to do minor maintenance. For most usecases (no replicas or using external HA solutions) PgBouncer would be my pick, due to its lightweight architecture and superior performance.
In case you have any questions, feel free to contact us.