© Laurenz Albe 2025
Table of Contents
I recently helped a customer with a slow query. Eventually, an ANALYZE
on a partitioned table was enough to fix the problem. This came as a surprise for the customer, since autovacuum was enabled. So I decided to write an article on how PostgreSQL collects partitioned table statistics and how they affect PostgreSQL's estimates.
I am talking about optimizer statistics as defined in my article about preserving statistics during an upgrade. These are the data that the query optimizer uses to estimate the cost of execution plans. You can view the optimizer statistics for table columns in the system view pg_stats
.
In PostgreSQL, a partitioned table is a logical construct, somewhat similar to a view. A partitioned table does not hold any data itself. The data reside in the partitions of the partitioned table. Inserting a row into a partitioned table actually inserts the row into the appropriate partition. In PostgreSQL, partitions are ordinary tables that just have a “side job” as a partition of a partitioned table.
When I talk about partitioned table statistics, I mean the optimizer statistics for the partitioned table itself, as opposed to the optimizer statistics of its partitions.
To demonstrate statistics on a partitioned table, I am going to use a simple example:
1 2 3 4 5 6 7 8 9 |
CREATE TABLE tab ( pk integer NOT NULL ) PARTITION BY LIST ((pk % 2)); CREATE UNLOGGED TABLE tab_even PARTITION OF tab FOR VALUES IN (0); CREATE UNLOGGED TABLE tab_odd PARTITION OF tab FOR VALUES IN (1); |
I used unlogged tables for better performance, because we don't need crash recovery. Each of the partitions will receive half of the rows I INSERT
into the table:
1 2 |
INSERT INTO tab SELECT * FROM generate_series(1, 1000000); |
With the default configuration, the autovacuum launcher will sleep for up to a minute before becoming active. Then it will process our new tables and run VACUUM
and ANALYZE
. So let's wait for two minutes before we take a look at the statistics:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
SELECT * FROM pg_stats WHERE tablename = 'tab_even' AND attname = 'pk'; -[ RECORD 1 ]----------+---------------------- schemaname | laurenz tablename | tab_even attname | pk inherited | f null_frac | 0 avg_width | 4 n_distinct | -1 most_common_vals | most_common_freqs | histogram_bounds | {82,10896,21078,...} correlation | 1 most_common_elems | most_common_elem_freqs | elem_count_histogram | range_length_histogram | range_empty_frac | range_bounds_histogram | SELECT * FROM pg_stats WHERE tablename = 'tab' AND attname = 'pk'; (0 rows) |
PostgreSQL has statistics for the partitions, but no partitioned table statistics! We can find the explanation in the documentation:
Partitioned tables do not directly store tuples and consequently are not processed by autovacuum. (Autovacuum does process table partitions just like other tables.) Unfortunately, this means that autovacuum does not run
ANALYZE
on partitioned tables, and this can cause suboptimal plans for queries that reference partitioned table statistics. You can work around this problem by manually runningANALYZE
on partitioned tables when they are first populated, and again whenever the distribution of data in their partitions changes significantly.
So we see that we don't get statistics on a partitioned table automatically. The question is how important these statistics are. After all, the partitioned table contains no data, and we have statistics on the partitions. For many SQL statements, that is good enough. Here is an example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
EXPLAIN (ANALYZE) SELECT * FROM tab WHERE pk < 1000; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------- Gather (cost=1000.00..12872.14 rows=913 width=4) (actual time=0.558..27.241 rows=999.00 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=4426 -> Parallel Append (cost=0.00..11780.84 rows=380 width=4) (actual time=6.016..20.488 rows=333.00 loops=3) Buffers: shared hit=4426 -> Parallel Seq Scan on tab_even tab_1 (cost=0.00..5889.47 rows=249 width=4) (actual time=3.035..9.128 rows=166.33 loops=3) Filter: (pk < 1000) Rows Removed by Filter: 166500 Buffers: shared hit=2213 -> Parallel Seq Scan on tab_odd tab_2 (cost=0.00..5889.47 rows=288 width=4) (actual time=6.741..16.962 rows=250.00 loops=2) Filter: (pk < 1000) Rows Removed by Filter: 249750 Buffers: shared hit=2213 Planning Time: 0.199 ms Execution Time: 27.359 ms |
As you can see, the estimates are fairly correct.
However, PostgreSQL uses partitioned table statistics to estimate the result row count of a join. Look at the following example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
EXPLAIN (ANALYZE) SELECT * FROM generate_series(1, 100) AS gs(id) JOIN tab ON gs.id = tab.pk; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------- Hash Join (cost=2.25..28178.25 rows=500000 width=8) (actual time=0.164..145.455 rows=100.00 loops=1) Hash Cond: (tab.pk = gs.id) Buffers: shared hit=4426 -> Append (cost=0.00..19426.00 rows=1000000 width=4) (actual time=0.018..84.210 rows=1000000.00 loops=1) Buffers: shared hit=4426 -> Seq Scan on tab_even tab_1 (cost=0.00..7213.00 rows=500000 width=4) (actual time=0.017..21.568 rows=500000.00 loops=1) Buffers: shared hit=2213 -> Seq Scan on tab_odd tab_2 (cost=0.00..7213.00 rows=500000 width=4) (actual time=0.019..16.421 rows=500000.00 loops=1) Buffers: shared hit=2213 -> Hash (cost=1.00..1.00 rows=100 width=4) (actual time=0.119..0.120 rows=100.00 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 12kB -> Function Scan on generate_series gs (cost=0.00..1.00 rows=100 width=4) (actual time=0.058..0.071 rows=100.00 loops=1) Planning: Buffers: shared hit=73 Planning Time: 0.743 ms Execution Time: 145.540 ms |
All the row count estimates are correct — except the one for the hash join: it is off by a factor of 5000! Let's ANALYZE
the partitioned table and see if that improves the estimate:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
ANALYZE tab; EXPLAIN (ANALYZE) SELECT * FROM generate_series(1, 100) AS gs(id) JOIN tab ON gs.id = tab.pk; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------- Gather (cost=1002.25..14966.87 rows=100 width=8) (actual time=0.382..57.992 rows=100.00 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=4426 -> Hash Join (cost=2.25..13956.87 rows=42 width=8) (actual time=16.681..52.161 rows=33.33 loops=3) Hash Cond: (tab.pk = gs.id) Buffers: shared hit=4426 -> Parallel Append (cost=0.00..12391.69 rows=416668 width=4) (actual time=0.008..29.572 rows=333333.33 loops=3) Buffers: shared hit=4426 -> Parallel Seq Scan on tab_even tab_1 (cost=0.00..5154.18 rows=294118 width=4) (actual time=0.004..6.317 rows=166666.67 loops=3) Buffers: shared hit=2213 -> Parallel Seq Scan on tab_odd tab_2 (cost=0.00..5154.18 rows=294118 width=4) (actual time=0.007..10.296 rows=250000.00 loops=2) Buffers: shared hit=2213 -> Hash (cost=1.00..1.00 rows=100 width=4) (actual time=0.058..0.058 rows=100.00 loops=3) Buckets: 1024 Batches: 1 Memory Usage: 12kB -> Function Scan on generate_series gs (cost=0.00..1.00 rows=100 width=4) (actual time=0.027..0.036 rows=100.00 loops=3) Planning: Buffers: shared hit=12 Planning Time: 0.237 ms Execution Time: 58.041 ms |
That did the trick! Now we can also see data in pg_stats
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
SELECT * FROM pg_stats WHERE tablename = 'tab' AND attname = 'pk'; -[ RECORD 1 ]----------+---------------------- schemaname | laurenz tablename | tab attname | pk inherited | t null_frac | 0 avg_width | 4 n_distinct | -1 most_common_vals | most_common_freqs | histogram_bounds | {29,10972,20818,...} correlation | 0.5027541 most_common_elems | most_common_elem_freqs | elem_count_histogram | range_length_histogram | range_empty_frac | range_bounds_histogram | |
It has become clear that we need to collect partitioned table statistics — at least for certain queries. On the other hand, autovacuum doesn't do the job for us. So we'll have to schedule explicit ANALYZE
runs for our partitioned tables. We can use an operating system scheduler like cron
, a PostgreSQL extension like pg_timetable or something you build into your application.
How often should you ANALYZE
your partitioned tables? That depends on your use case, but I'd expect that once a day or once a week should be good enough in most cases. After all, you normally partition large tables, and it is unlikely that a big fraction of these data change in a day or a week.
The answer is probably that nobody has got around to implementing that yet.
For ordinary tables, PostgreSQL uses the monitoring statistics in pg_stat_all_tables.n_mod_since_analyze
to determine if a table needs autoanalyze. Since a partitioned table does not contain any data itself, this number is always zero for a partitioned table. There are two possible approaches for autoanalyze of partitioned tables:
n_mod_since_analyze
for a partitionANALYZE
a partitioned table, add up n_mod_since_analyze
from all partitionsThe first approach would burden all data modifying statements, so I'd say that the second approach is more promising. Perhaps some reader will feel inspired to improve PostgreSQL!
We have seen that autovacuum does not collect partitioned table statistics. We have to collect them with an explicit ANALYZE
if we want to have good execution plans.
I am indebted to Andrey Lepikhov for guidance where such statistics actually matter!
Leave a Reply