A lot has been written about effective_cache_size in postgresql.conf and about PostgreSQL performance in general. However, few people know what this famous parameter really does. Let me share some more insights.

What the PostgreSQL optimizer does

The idea behind SQL is actually quite simple: The end user sends a query and the optimizer is supposed to find the best strategy to execute this query. The output of the optimizer is what people call an “execution plan”. The question now is: What makes one execution plan better than some other plan? What is it that makes a strategy greater than some other strategy? In PostgreSQL everything boils down to the concept of “costs”. The planner will assign costs to every operation. At the end of the day the cheapest plan is selected and executed.

The magic is therefore in the way the optimizer handles costs and this is exactly what effective_cache_size is all about.

Understanding effective_cache_size

To achieve good performance it is important to figure out whether to use an index or not. A question often asked is: Why not always use an index? Traversing and index might not be cheap at and using an index does not mean that there is no need to touch the table as well. The optimizer therefore has to decide whether to go for an index or not.

The way costs are estimated depend on various factors: Amount of I/O needed, number of operators called, number of tuples processed, selectivity, and a lot more. However, what is the cost of I/O? Obviously it makes a difference if data is already in cache or if data has to be read from disk. That brings us to the idea behind effective_cache_size which tells the optimizer how much cache to expect in the system. The important part is that “cache” is not only the amount of memory knows about (this part is pretty clear). The system also has to consider the size of the filesystem cache, CPU caches, and so on. effective_cache_size is the sum of all those caching components. What you will learn in this post is how the optimizer uses this kind of information.

effective_cache_size in action

Before we lose ourselves in theoretical explanation it makes sense to dig into a practical example. For the purpose of this blog I have generated two tables:

test=# CREATE TABLE t_random AS SELECT id, random() AS r 
          FROM generate_series(1, 1000000) AS id ORDER BY random();
SELECT 1000000
test=# CREATE TABLE t_ordered AS SELECT id, random() AS r 
          FROM generate_series(1, 1000000) AS id;
SELECT 1000000
test=# CREATE INDEX idx_random ON t_random (id);
CREATE INDEX
test=# CREATE INDEX idx_ordered ON t_ordered (id);
CREATE INDEX
test=# VACUUM ANALYZE ;
VACUUM

Mind that both tables contain the same set of data. One table is ordered – the other one is not. Let us set the effective_cache_size to a really small value. The optimizer will assume that there is really not much memory in the system:

test=# SET effective_cache_size TO '1 MB';
SET
test=# SET enable_bitmapscan TO off;
SET
test=# explain SELECT * FROM t_random WHERE id < 1000;
                                      QUERY PLAN
---------------------------------------------------------------------------------
Index Scan using idx_random on t_random (cost=0.42..3611.96 rows=909 width=12)
Index Cond: (id < 1000)
(2 rows)

Normally PostgreSQL would go for a bitmap index scan, but we want to see what happens in case of an index scan. Therefore we turn bitmap scans off (= making them insanely expensive to the optimizer).

Let us compare the plan with the one we have just seen before:

test=# SET effective_cache_size TO '1000 GB';
SET
test=# explain SELECT * FROM t_random WHERE id < 1000;
                                    QUERY PLAN
--------------------------------------------------------------------------------
 Index Scan using idx_random on t_random (cost=0.42..3383.99 rows=909 width=12)
Index Cond: (id < 1000)
(2 rows)

As you can see, the price of the index scan has gone down. Why is that relevant? We have to see costs as “relative”. The absolute number is not important – it is important how expensive a plan is compared to some other plan. If the price of a sequential scan stays the same and the price of an index scan goes down relative to a seq scan PostgreSQL will favor indexing more often than it otherwise would. This is exactly what effective_cache_size at its core is all about: Making index scans more likely if there is a lot of RAM around.

Limitations and special cases

When people talk about postgresql.conf and effective_cache_size they are often not aware of the fact that the parameter does not always work miracles. The following scenario shows when there is no impact:

test=# SET effective_cache_size TO '1 MB';
SET
test=# explain SELECT * FROM t_ordered WHERE id < 1000;
                                       QUERY PLAN
----------------------------------------------------------------------------------
 Index Scan using idx_ordered on t_ordered (cost=0.42..39.17 rows=1014 width=12)
       Index Cond: (id < 1000)
(2 rows)

test=# SET effective_cache_size TO '1000 GB';
SET
test=# explain SELECT * FROM t_ordered WHERE id < 1000;
                                     QUERY PLAN
----------------------------------------------------------------------------------
 Index Scan using idx_ordered on t_ordered (cost=0.42..39.17 rows=1014 width=12)
       Index Cond: (id < 1000)
(2 rows)

The table statistics used by the optimizer contain information about physical “correlation”. If correlation is 1 (= all data is sorted perfectly on disk) effective_cache_size will NOT change anything.

The same is true if the is only one column as shown in the next example:

test=# ALTER TABLE t_random DROP COLUMN r;
ALTER TABLE
test=# SET effective_cache_size TO '1 MB';
SET
test=# explain SELECT * FROM t_ordered WHERE id < 1000;
                                       QUERY PLAN
----------------------------------------------------------------------------------
 Index Scan using idx_ordered on t_ordered (cost=0.42..39.17 rows=1014 width=12)
       Index Cond: (id < 1000)
(2 rows)

test=# SET effective_cache_size TO '1000 GB';
SET
test=# explain SELECT * FROM t_ordered WHERE id < 1000;
                                      QUERY PLAN
----------------------------------------------------------------------------------
 Index Scan using idx_ordered on t_ordered (cost=0.42..39.17 rows=1014 width=12)
       Index Cond: (id < 1000)
(2 rows)

That comes as a surprise to most users and therefore I considered it worth mentioning.

Tuning effective_cache_size in PostgreSQL

I found it useful to use a simple formula to get a rough estimate for a good setting:

effective_cache_size = RAM * 0.7

Some people have also used 0.8 successfully. Of course this is only true when we are talking about a dedicated database server. Feel free to experiment.

Finally …

If you want to learn more about PostgreSQL I want to recommend a post written by my friend and work mate Florian Nadler who has written an excellent post about MobilityDB which is a PostgreSQL extension optimized for GPS tracking data.