I recently had an interesting support case that shows how the cause of a problem can sometimes be where you would least suspect it.
About table bloat
DELETE, PostgreSQL keeps old versions of a table row around. This way, concurrent sessions that want to read the row don’t have to wait. But eventually this “garbage” will have to be cleaned up. That is the task of the autovacuum daemon.
Usually you don’t have to worry about that, but sometimes something goes wrong. Then old row versions don’t get deleted, and the table keeps growing. Apart from the wasted storage space, this will also slow down sequential scans and – to some extent – index scans.
To get rid of the bloat, you can use
VACUUM (FULL) and other tools like pg_squeeze. But it is important to find and fix the cause of the table bloat so that it does not reappear.
I got called by a customer who experienced table bloat in the
pg_attribute system catalog table that contains the table column metadata.
This can happen if table columns get modified or dropped frequently. Most of the time, these are temporary tables which are automatically dropped when the session or transaction ends.
The customer made heavy use of temporary tables. But they have several Linux machines with databases that experience the same workload, and only some of those had the problem.
Searching the cause
I went through the list of common causes for table bloat:
- Database transactions that remain open (state “idle in transaction”).
These will keep autovacuum from cleaning up row versions that have become obsolete after the start of the transaction.
- A rate of data modification that is so high that autovacuum cannot keep up.
In this case, the correct answer is to make autovacuum more aggressive.
Both were not the case; the second option could be ruled out because that would cause bloat on all machines and not only on some.
Then I had a look at the usage statistics for the affected table:
dbname=> SELECT * FROM pg_stat_sys_tables dbname-> WHERE relname = 'pg_attribute'; -[ RECORD 1 ]-------+------------- relid | 1249 schemaname | pg_catalog relname | pg_attribute seq_scan | 167081 seq_tup_read | 484738 idx_scan | 1506941 idx_tup_fetch | 4163837 n_tup_ins | 0 n_tup_upd | 0 n_tup_del | 0 n_tup_hot_upd | 0 n_live_tup | 0 n_dead_tup | 0 n_mod_since_analyze | 0 last_vacuum | last_autovacuum | last_analyze | last_autoanalyze | vacuum_count | 0 autovacuum_count | 0 analyze_count | 0 autoanalyze_count | 0
This confirms that autovacuum has never run. But more interesting is that we get a hint why it didn’t run:
PostgreSQL thinks that the number of dead tuples (row versions that could be removed) is 0, so it didn’t even try to remove them.
The statistics collector
A suspicion became certainty after I found the following message in the server logs:
using stale statistics instead of current ones because stats collector is not responding
The statistics collector process is the PostgreSQL backend process that collects usage statistics.
After each activity, PostgreSQL backends send statistics about their activity. These statistics updates are sent through a UDP socket on
localhost; that is created at PostgreSQL startup time. The statistics collector reads from the socket and aggregates the collected statistics.
Closing in on the problem
The statistics collector was running:
918 1 /usr/pgsql-10/bin/postmaster -D /var/lib/pgsql/10/data/ 947 918 postgres: logger process 964 918 postgres: checkpointer process 965 918 postgres: writer process 966 918 postgres: wal writer process 967 918 postgres: autovacuum launcher process 968 918 postgres: stats collector process 969 918 postgres: bgworker: logical replication launcher
To see what the statistics collector was doing and to spot any problems it had, I traced its execution:
# strace -p 968 strace: Process 968 attached epoll_pwait(3,
The statistics collector was waiting for messages on the UDP socket, but no messages were coming through!
I had a look at the UPD socket:
# netstat -u -n -p Active Internet connections (w/o servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name udp6 0 0 ::1:59517 ::1:59517 ESTABLISHED 918/postmaster
Nothing suspicious so far.
But when I tried the same thing on a machine that didn’t have the problem, I got something different:
# netstat -u -n -p Active Internet connections (w/o servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name udp 0 0 127.0.0.1:46031 127.0.0.1:46031 ESTABLISHED 9303/postmaster
But all machines had IPv6 disabled for the loopback interface:
# ifconfig lo lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 loop txqueuelen 1000 (Local Loopback) RX packets 6897 bytes 2372420 (2.2 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 6897 bytes 2372420 (2.2 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
Nailing the bug to the wall
PostgreSQL uses the POSIX function
getaddrinfo(3) to resolve
Since PostgreSQL knows quite well that it is important to have a working statistics collection, it will loop through all the addresses returned by that call, create a UDP socket and test it until it has a socket that works.
So we know that IPv6 must have worked when PostgreSQL was started!
Further analysis revealed that IPv6 was disabled during the boot sequence, but there was a race condition:
Sometimes IPv6 would be disabled before PostgreSQL started, sometimes afterwards. And these latter machines were the ones where the statistics collector stopped working and tables got bloated!
After changing the boot sequence to always disable IPv6 before starting PostgreSQL, the problem was fixed.
This shows (again) how the cause of a computer problem can be in an entirely different place than you would suspect at first glance.
It also shows why a good database administrator needs to know the operating system well.