Histograms are a common way for a relational database to store statistical information about data. What is this kind of statistical information good for? Well, statistics are the rocket fuel behind performance. If statistics are bad, the optimizer will come up with bad decisions and poor execution plans will be the result. The PostgreSQL optimizer is a damn sophisticated piece of software and it provides end users with good plans in general – still, without statistics even the best SQL optimizer will yield bad performance.

The good news is that when things are a little tight, the end user can have some influence on the way PostgreSQL stores statistics. The ALTER TABLE … SET STATISTICS command will allow users to define the size of the internal histograms storing statistical information.

Before we dig into those details we create a simple table and add some data to it:

test=# CREATE TABLE t_test AS SELECT * FROM generate_series(1, 1000000) AS x;
SELECT 1000000

This will create a table containing one million numbers. To make sure that the optimizer will know about the content of the table, we can run ANALYZE (a command in charge of creating statistics):

test=# ANALYZE ;

How statistics is used

Whenever PostgreSQL has to plan a query, it will rely on the statistics we have made just before. Let us take a look at a very simplistic plan:

test=# explain SELECT * FROM t_test WHERE x < 10000;
                      QUERY PLAN
 Seq Scan on t_test (cost=0.00..16925.00 rows=9109 width=4)
   Filter: (x < 10000)

(2 rows)

We try to find all values lower than 10.000. The planner estimates that we will find roughly 9.100 rows. Overall this is pretty good and totally sufficient to plan the query efficiently. Remember, estimating statistics is not about having a final answer right away – it is about having a reasonably well estimate.

Let us see, which statistics PostgreSQL has made. To do so we can take a look at a system view called pg_stats:

test=# \d pg_stats
          View "pg_catalog.pg_stats"
       Column           |   Type   | Modifiers
 schemaname             | name     |
 tablename              | name     |
 attname                | name     |
 inherited              | boolean  |
 null_frac              | real     |
 avg_width              | integer  |
 n_distinct             | real     |
 most_common_vals       | anyarray |
 most_common_freqs      | real[]   |
 histogram_bounds       | anyarray |
 correlation            | real     |
 most_common_elems      | anyarray |
 most_common_elem_freqs | real[]   |
 elem_count_histogram   | real[]   |

If you are interested in the content of pg_stats we recommend to take a look at the PostgreSQL documentation. It outlines nicely how things work in this area: http://www.postgresql.org/docs/9.3/static/view-pg-stats.html

Changing the size of the histogram

Sometimes it can be quite beneficial to change the size of a column’s histogram. Reducing the size if the histogram will make estimates a lot less precise – but, it will reduce the time needed to plan a query. If you are dealing with a table with 1 billion rows and all you do is hitting the primary key to fetch exactly one row at a time, a narrow histogram is definitely not a big issue.

However, if you are dealing with more complex operations, keeping the default to 100 or to raise this value can be quite beneficial.

This is how you can change the size of the histogram of a specific column:


Once this has been done, we can recreate the stats:

test=# ANALYZE t_test;

Querying the stats table reveals what PostgreSQL knows about the column:

test=# \x
Expanded display is on.
test=# SELECT * FROM pg_stats WHERE tablename = 't_test' AND attname = 'x';

-[ RECORD 1 ]----------+----------------------------------------------------------------------------
schemaname             | public
tablename              | t_test
attname                | x
inherited              | f
null_frac              | 0
avg_width              | 4
n_distinct             | -1
most_common_vals       |
most_common_freqs      |
histogram_bounds       | {492,100971,194099,290435,398419,485780,587410,692588,793607,898079,999686}
correlation            | 1
most_common_elems      |
most_common_elem_freqs |
elem_count_histogram   |

Note that the histogram consists of just 11 values now (which gives us 10 bars).

The impact of a change

We can toy around with this setting now and see what happens to the plan made by PostgreSQL once the statistics target has been risen to 1.000:


test=# ANALYZE ;

test=# explain SELECT * FROM t_test WHERE x < 10000;
                        QUERY PLAN
 Seq Scan on t_test (cost=0.00..16925.00 rows=9921 width=4)
     Filter: (x < 10000)
(2 rows)

In our example the estimate is a little bit more precise than initially. However, this comes at a cost. Planning time has risen – the plan is still the same (because there is no other choice for the planner anyway).

Btw, keep in mind that those values should not be changed on a daily basis and in many cases it is not even necessary to change this stuff at all – it is nice to have the opportunity to do so, however, when beneficial.

Visit us on facebook: www.fb.com/cybertec.postgresql