Better correlation helps in real life too

After you ANALYZE a PostgreSQL table to collect value distribution statistics, you will find the gathered statistics for each column in the pg_stats system view. This article will explain the meaning of the correlation column and its impact on index scans.

Physical vs. logical ordering

Most common PostgreSQL data types have an ordering: they support the operators <, <=, =, >= and >.
Such data types can be used with a B-tree index (the “standard” index type).

The values in a column of such a type provide a logical ordering of the table rows. An index on this column will be sorted according to that ordering.

A PostgreSQL table consists of one or more files of 8KB blocks. The order in which the rows are stored in the file is the physical ordering.
You can examine the physical ordering of the rows by selecting the ctid system column: it contains the block number and the item number inside the block, which describe the physical location of the table row.


The correlation for a column is a value between -1 and 1. It tells how good the match between logical and physical ordering is.

  • If the correlation is 1, the rows are stored in the table file in ascending column order; if it is -1, they are stored in descending order.
  • Values between -1 and 1 mean a less perfect match.
  • A value of 0 means that there is no connection between the physical and the logical order.

Why should I care?

You will create indexes on your tables for faster access (but not too many!).
The correlation of a column has an impact on the performance of an index scan.

During an index scan, the whole index or part of it are read in index sequential order. For each entry that is found, the corresponding row is fetched from the table (this is skipped in an “index only scan”, but that is a different story).

If the correlation of the indexed column is close to zero, the fetched rows will be from all over the table. This will result in many randomly distributed reads of many different table blocks.

However, if the correlation is close to 1 or -1, the next row fetched during the index scan tends to be in the same or the next table block as the previous row.

High correlation has two advantages:

  1. Blocks read by the database are cached in shared memory. Consequently, if many of the table rows fetched during the index scan are located in the same table block, only few blocks have to be read from storage.
  2. The blocks that have to be read from storage are next to each other. This leads to sequential I/O, which on spinning disks is substantially faster than random I/O.

An example

Let’s create two tables with identical content, but different correlation:

CREATE TABLE corr (id, val) AS
   SELECT i, 'some text ' || i
   FROM generate_series(1, 100000) AS i;

CREATE INDEX corr_idx ON corr (id);


SELECT correlation FROM pg_stats
WHERE tablename = 'corr' AND attname = 'id';

(1 row)

   SELECT * FROM corr
   ORDER BY random();

CREATE INDEX uncorr_idx ON uncorr (id);


SELECT correlation FROM pg_stats
WHERE tablename = 'uncorr' AND attname = 'id';

(1 row)

We disable bitmap index scans so that we can compare index scans on both tables.
Then we check how index scans perform:

SET enable_bitmapscan = off;

SELECT * FROM corr WHERE id BETWEEN 1001 AND 1300;

                    QUERY PLAN
 Index Scan using corr_idx on corr
       (cost=0.29..15.23 rows=297 width=19)
       (actual time=0.108..0.732 rows=300 loops=1)
   Index Cond: ((id >= 1001) AND (id <= 1300))
   Buffers: shared hit=6
 Planning time: 0.456 ms
 Execution time: 1.049 ms
(5 rows)

SELECT * FROM uncorr WHERE id BETWEEN 1001 AND 1300;

                    QUERY PLAN
 Index Scan using uncorr_idx on uncorr
       (cost=0.29..978.15 rows=298 width=19)
       (actual time=0.105..2.352 rows=300 loops=1)
   Index Cond: ((id >= 1001) AND (id <= 1300))
   Buffers: shared hit=303
 Planning time: 0.548 ms
 Execution time: 2.736 ms
(5 rows)

Now 2.7 milliseconds is not so bad, but that is only because all blocks were already in shared buffers.
If a part of these blocks has to be read from disk, the 303 blocks from the second query will do much worse than the 6 from the first!

In the second query, each result row was found in a different table block. This caused 300 blocks to be touched. The remaining three blocks are index blocks.

The first query touches only three table blocks:

SELECT ctid, id FROM corr
WHERE id BETWEEN 1001 AND 1300;

  ctid   |  id  
 (6,58)  | 1001
 (6,59)  | 1002
 (6,60)  | 1003
 (6,61)  | 1004
 (6,62)  | 1005
 (6,63)  | 1006
 (6,64)  | 1007
 (8,37)  | 1294
 (8,38)  | 1295
 (8,39)  | 1296
 (8,40)  | 1297
 (8,41)  | 1298
 (8,42)  | 1299
 (8,43)  | 1300
(300 rows)

Indeed, all rows are contained in the table blocks 6, 7 and 8!

Correlation and the optimizer

The PostgreSQL optimizer estimates the cost of the possible ways to execute an SQL statement.

With the use of the correlation it can give better estimates of the cost of an index scan, leading to better plan choices.

The PostgreSQL optimizer will prefer index scans if the correlation is close to 1 or -1.

Correlation and BRIN indexes

PostgreSQL 9.5 introduced the BRIN index (block range index).

This index works be storing the minimum and maximum of all values for ranges of table blocks. It is only useful for columns with perfect correlation. Its advantage over the B-tree index is its much smaller size, which makes it an interesting option for large tables.

How to make use of correlation?

If you need to efficiently scan bigger portions of an index, it is good to keep the table in index order.

There are no “index ordered tables” in PostgreSQL.
Still, high correlation for a column can be maintained in two ways:

  1. Automatically:

    If the table rows are inserted in logical column order and there are no updates or deletes on the table, the physical ordering will be identical to the logical ordering. Good examples for that are primary key columns generated by sequences or measurements with a timestamp.

    Since correlation is always perfect in this case, a BRIN index can be an interesting option.

    If you want to remove old data from a table without disrupting the physical ordering, you can use table partitioning.

  2. Clustering:

    The SQL statement CLUSTER can be used to rewrite a table so that the physical ordering is identical to the logical ordering of an index.

    However, subsequent modifications of the table will reduce the correlation again. Because of that, you need to re-cluster the table regularly to maintain high correlation. This is annoying, because CLUSTER blocks all concurrent access to the table.

PostgreSQL will shortly be released and it is therefore time to take a look at one of the most important new features provided by PostgreSQL 11: The ability to create indexes in parallel. For many years various commercial database vendors have already offered this feature and we are glad that PostgreSQL has become part of this elite club, which offers multi-core index creation, which will dramatically improve the usability of large database deployments in the future.

Creating large tables in PostgreSQL

Since version 11 PostgreSQL supports classical “stored procedures”. The beauty is that a procedure can run more than one transaction, which is ideal if you want to generate huge amounts of random data. When you call generate_series to generate 1 million rows, PostgreSQL has to keep this data in memory and therefore generating hundreds of millions of random rows using more than 1 transactions can be really useful to reduce the memory footprint. Here is how it works:

CREATE TABLE t_demo (data numeric);

CREATE OR REPLACE PROCEDURE insert_data(buckets integer)
LANGUAGE plpgsql
AS $$
      i int;
      i := 0;
      WHILE i < buckets
         INSERT INTO t_demo SELECT random()
            FROM generate_series(1, 1000000);
         i := i + 1;
         RAISE NOTICE 'inserted % buckets', i;
      END LOOP;

CALL insert_data(500);

This tiny bit of code loads 500 million random numeric values, which should be enough to demonstrate, how CREATE INDEX can be improved in PostgreSQL 11. In our example 500 million rows translate to roughly 21 GB of data:

test=# \d+
 List of relations
 Schema | Name   | Type  | Owner | Size  | Description
 public | t_demo | table | hs    | 21 GB |
(1 row)

The reason why I went for numeric is that numeric causes the most overhead of all number data types. Creating a numeric index is a lot more costly than indexing, say, int4 or int8. The goal is to see, how much CPU time we can save by building a large index on a fairly expensive field.

CREATE INDEX: Using just 1 CPU core

In PostgreSQL 11 parallel index creation is on by default. The parameter in charge for this issue is called max_parallel_maintenance_workers, which can be set in postgresql.conf:

test=# SHOW max_parallel_maintenance_workers;
(1 row)

The default value here tells PostgreSQL that if the table is sufficiently large, it can launch two workers to help with index creation. To compare a “traditional” way to create the index with the new settings, I have set max_parallel_maintenance_workers to 0. This will ensure that no multicore indexing is available:

test=# SET max_parallel_maintenance_workers TO 0;

The consequence is that indexing will take forever. When running the CREATE INDEX statement we will see a lot of I/O and a lot of CPU. To make things worse I left all memory parameters at their default value, which means that the index creation has to work with only 4 MB of memory, which is nothing given the size of the table.

Here are the results on my “Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz”:

test=# CREATE INDEX idx1 ON t_demo (data);
Time: 1031650.658 ms (17:11.651)

17 minutes, not too bad. Remember, we are talking about 500 million of really nasty lines of data.

Using more than just one core

Let us run the same type of indexing on 2 cores:

test=# SET max_parallel_maintenance_workers TO 2;

test=# CREATE INDEX idx2 ON t_demo (data);
Time: 660672.867 ms (11:00.673)

Wow, we are down to 11 minutes. Of course the operation is not completely linear because we have to keep in mind that those partial results have to be merged together and all that. But, there is a catch: If set max_parallel_maintenance_workers to 2 and what we saw is 2 cores, right? What if we set the value to 4? In my case 4 is the number of physical cores in the machine so it makes no sense to use any higher values. What you will see is that PostgreSQL still uses only two cores.

How can we change that? The answer can be found in the next listing: ALTER TABLE … SET … allows us to lift this restriction and use more workers:

test=# ALTER TABLE t_demo SET (parallel_workers = 4);

test=# SET max_parallel_maintenance_workers TO 4;

In this case both, max_parallel_workers and the table parameter are set to 4. What we will see now is that PostgreSQL will utilize 5 processes. Why does that happen? What you will see is one main process and 4 processes helping with index creation. That might not be totally obvious but it makes sense when you think about it.

Of course we cannot add an infinite amount of workers and expect performance to grow linearly. At this stage our (single) SSD will also start to run into performance limitations and we won’t see a two times increase anymore:

test=# CREATE INDEX idx3 ON t_demo (data);
Time: 534775.040 ms (08:54.775)

Everybody is doing the same thing pretty much at the same time so we will see wild swings in our I/O curve, which naturally makes the entire thing a bit slower and not linear. Still, we managed to speed up our index creation from 17 minutes to close to 9 minutes by simply adding more cores to the system.

Using more memory for CREATE INDEX

CPU cores are not the only limiting factor during index creation. Memory is also of significant importance. By default maintenance_work_mem is set to a really low value (64 MB), which greatly limits the amount of data, which can be sorted in memory. Therefore the next logical step is to increase this parameter and set it to a higher value creating the new index:

test=# SET maintenance_work_mem TO '4 GB';

In my case I decided to pump the value to 4 GB. My server has 32 GB of memory and we have to keep in mind that we are not the only ones, which might create an index so 4 GB x 5 cores might already be a really aggressive value in a real world scenario.

What we will see while creating the index is a lot more parallelism going on in the first phase of the index creation, which is exactly what we are supposed to see and what we expected. You can also see quite clearly that towards the end CPU usage is pretty low and PostgreSQL is waiting on the disk to do its job. The entire system has been set up with default values so writes have not been optimized yet and are therefore going to be an issue.

However, we will still see a nice improvement:

test=# CREATE INDEX idx4 ON t_demo (data);
Time: 448498.535 ms (07:28.499)

7 minutes and 28 seconds. That is already very nice. But let us see if we can do even better. What we have seen so far is that checkpoints and I/O have started to become a limiting factor. Therefore we will try to improve on that by telling PostgreSQL to use larger checkpoint distances. In this example I have decided to change postgresql.conf to the following values:

checkpoint_timeout = 120min
max_wal_size = 50GB
min_wal_size = 80MB

Those settings can easily be activated by reloading the config file:

test=# SELECT pg_reload_conf();
(1 row)

Let us create a new index using those larger checkpoint distances.

When looking at the process table while building the index you can notice that PostgreSQL spent quite a lot of time on writing the WAL to disk. As long as we stick to a single SSD there is not much more we can do about it. However, what will happen if we play our next trump card? Additional hardware. What if we created all our temporary data on one disk, send the WAL to the main disk and create the index on a third SSD? This way we could split the amount of I/O needed quite nicely and see what happens.

Using tablespaces in PostgreSQL to speed up indexing

As already stated adding more hardware by using tablespaces might be a good idea. I am well aware that this might not be possible in a modern cloud environment. However, on my test server I still got the luxury items: A couple of real physical SSD drives.

So let us give them a try and create two tablespaces, which can store the data. On top of that I will tell PostgreSQL to use those tablespaces for sorting and to store the new index:

test=# CREATE TABLESPACE indexspace LOCATION '/ssd1/tabspace1';

test=# CREATE TABLESPACE sortspace LOCATION '/ssd2/tabspace2';

Then we can tell PostgreSQL where to put temporary data:

test=# SET temp_tablespaces TO sortspace;

In the next step the index creation can start:

test=# CREATE INDEX idx6 ON t_demo (data) TABLESPACE indexspace;
Time: 408508.976 ms (06:48.509)

What we see here during the index creation is that our throughput peaks at higher values than before because more than one SSD can work at the same time. Instead of 500 MB / sec peak our throughput goes up to as much as 900 MB / sec at times. The overall speed has improved as well. We are already below 7 minutes, which is really nice.

If you add more hardware to the box it might be worth considering to create one filesystem using all disks at once. I did not have time to test this options but I assume that it might similar and maybe even better results then what I was able to come up with in this first test.

PostgreSQL, CREATE INDEX on many CPUs
Multicore index creation in PostgreSQL. CREATE INDEX can use more than one CPU


TIP: Don’t underestimate the importance of the data type in use. If we did the same test using normal integer values, we could create the index in 3 min 51 seconds. In other words: The data type is of significant importance.

In this post you have seen that creating indexes can be improved. However, keep in mind that new indexes are not always beneficial. Pointless indexes can even slow down things. To figure out, which indexes might not be needed, consider reading a post written by Laurenz Albe, who explains, how to tackle this kind of problem.

Many of you out there using PostgreSQL streaming replication might wonder what this hot_standby_feedback parameter in postgresql.conf really does. Support customers keep asking this question so it might be useful to share this knowledge with a broader audience of PostgreSQL users out there.

What VACUUM does in PostgreSQL

VACUUM is an essential command in PostgreSQL its goal is to clean out dead rows, which are not needed by anyone anymore. The idea is to reuse space inside a table later as new data comes in. The important thing is: The purpose of VACUUM is to reuse space inside a table – this does not necessarily imply that a relation will shrink. Also: Keep in mind that VACUUM can only clean out dead rows, if they are not need anymore by some other transaction running on your PostgreSQL server.

Consider the following image:

hot_standby_feedback in PostgreSQL
How hot_standby_feedback and VACUUM work together in PostgreSQL

As you can see we have two connections here. The first connection on the left side is running a lengthy SELECT statement. Now keep in mind: An SQL statement will basically “freeze” its view of the data. Within an SQL statement the world does not “change” – the query will always see the same set of data regardless of changes made concurrently. That is really really important to understand.

Let us take a look at the second transaction. It will delete some data and commit. The question that naturally arises is: When can PostgreSQL really delete this row from disk? DELETE itself cannot really clean the row from disk because there might still be a ROLLBACK instead of a COMMIT. In other words a rows must not be deleted on DELETE. PostgreSQL can only mark it as dead for the current transaction. As you can see other transactions might still be able to see those deleted rows.
However, even COMMIT does not have the right to really clean out the row. Remember: The transaction on the left side can still see the dead row because the SELECT statement does not change its snapshot while it is running. COMMIT is therefore too early to clean out the row.

This is when VACUUM enters the scenario. VACUUM is here to clean rows, which cannot be seen by any other transaction anymore. In my image there are two VACUUM operations going on. The first one cannot clean the dead row yet because it is still seen by the left transaction.
However, the second VACUUM can clean this row because it is not used by the reading transaction anymore.

On a single server the situation is therefore pretty clear. VACUUM can clean out rows, which are not seen anymore.

Replication conflicts in PostgreSQL

What happens in a master / slave scenario? The situation is slightly more complicated because how can the master know that some strange transaction is going on one of the slaves?

Here is an image showing a typical scenario:

PostgreSQL VACUUM and table bloat
Prevent table bloat with VACUUM in PostgreSQL

In this case a SELECT statement on the replica is running for a couple of minutes. In the meantime a change is made on the master (UPDATE, DELETE, etc.). This is still no problem. Remember: DELETE does not really delete the row – it simply marks it as dead but it is still visible to other transactions, which are allowed to see the “dead” row. The situation becomes critical if a VACUUM on the master is allowed to really delete row from disk. VACUUM is allowed to do that because it has no idea that somebody on a slave is still going to need the row. The result is a replication conflict. By default a replication conflict is resolved after 30 seconds:

ERROR: canceling statement due to conflict with recovery
Detail: User query might have needed to see row versions that must be removed

If you have ever seen a message like that – this is exactly the kind of problem we are talking about here.

hot_standby_feedback can prevent replication conflicts

To solve this kind of problem, we can teach the slave to periodically inform the master about the oldest transaction running on the slave. If the master knows about old transactions on the slave, it can make VACUUM keep rows until the slaves are done.
This is exactly what hot_standby_feedback does. It prevents rows from being deleted too early from a slave’s point of view. The idea is to inform the master about the oldest transaction ID on the slave so that VACUUM can delay its cleanup action for certain rows.

The benefit is obvious: hot_standby_feedback will dramatically reduce the number of replication conflicts. However, there are also downsides: Remember, VACUUM will delay its cleanup operations. If the slave never terminates a query, it can lead to table bloat on the master, which can be dangerous in the long run.

Sorting is a very important aspect of PostgreSQL performance tuning. However, tuning sorts is often misunderstood or simply overlooked by many people. So I decided to come up with a PostgreSQL blog showing, how sorts can be tuned in PostgreSQL.

Creating sample data

To show how sorting works, I created a couple of million rows first:

test=# CREATE TABLE t_test (x numeric);
test=# INSERT INTO t_test SELECT random()
       FROM generate_series(1, 5000000);
INSERT 0 5000000
test=# ANALYZE ;

What the code does is to create a table and load 5 million random values. As you will notice, data can be loaded within seconds.

Sorting data in PostgreSQL

Let us try to sort the data. For the sake of simplicity I am using the most simplistic statements possible. What you can see is that PostgreSQL has to sort on disk because the data we want to sort does not fit into memory. In this case a bit more than 100 MB of data is moved to disk:

test=# explain analyze SELECT * FROM t_test ORDER BY x;
                                QUERY PLAN
Sort (cost=804270.42..816770.42 rows=5000000 width=11)
     (actual time=4503.484..6475.222 rows=5000000 loops=1)
     Sort Key: x
     Sort Method: external merge Disk: 102896kB
     -> Seq Scan on t_test (cost=0.00..77028.00 rows=5000000 width=11)
        (actual time=0.035..282.427 rows=5000000 loops=1)
Planning time: 0.139 ms
Execution time: 6606.637 ms
(6 rows)

Why does PostgreSQL not simply sort stuff in memory? The reason is the work_mem parameter, which is by default set to 4 MB:

test=# SHOW work_mem;
(1 row)

work_mem tells the server that up to 4 MB can be used per operation (per sort, grouping operation, etc.). If you sort too much data, PostgreSQL has to move the excessive amount of data to disk, which is of course slow.

Fortunately changing work_mem is simple and can even be done at the session level.

Speeding up sorts in PostgreSQL – using more work_mem

Let us change work_mem for our current session and see what happens to our example shown before.

test=# SET work_mem TO '1 GB';

The easiest way to change work_mem on the fly is to use SET. In this case I have set the parameter to 1 GB. Now PostgreSQL has enough RAM to do stuff in memory:

test=# explain analyze SELECT * FROM t_test ORDER BY x;
                             QUERY PLAN
 Sort (cost=633365.42..645865.42 rows=5000000 width=11)
      (actual time=1794.953..2529.398 rows=5000000 loops=1)
      Sort Key: x
      Sort Method: quicksort Memory: 430984kB
      -> Seq Scan on t_test (cost=0.00..77028.00 rows=5000000 width=11)
         (actual time=0.075..296.167 rows=5000000 loops=1)
Planning time: 0.067 ms
Execution time: 2686.635 ms
(6 rows)

The performance impact is incredible. The speed has improved from 6.6 seconds to around 2.7 seconds, which is around 60% less. As you can see, PostgreSQL uses “quicksort” instead of “external merge Disk”. If you want to speed up and tune sorting in PostgreSQL, there is no way of doing that without changing work_mem. The work_mem parameter is THE most important knob you have. The cool thing is that work_mem is not only used to speed up sorts – it will also have a positive impact on aggregations and so on.

Taking care of partial sorts

As of PostgreSQL 10 there are 3 types of sort algorithms in PostgreSQL:

  • external sort Disk
  • quicksort
  • top-N heapsort

“top-N heapsort” is used if you only want a couple of sorted rows. For example: The highest 10 values, the lowest 10 values and so on. “top-N heapsort” is pretty efficient and returns the desired data in almost no time:

test=# explain analyze SELECT * FROM t_test ORDER BY x LIMIT 10;
                               QUERY PLAN
 Limit (cost=185076.20..185076.23 rows=10 width=11)
       (actual time=896.739..896.740 rows=10 loops=1)
        -> Sort (cost=185076.20..197576.20 rows=5000000 width=11)
                (actual time=896.737..896.738 rows=10 loops=1)
           Sort Key: x
           Sort Method: top-N heapsort Memory: 25kB
           -> Seq Scan on t_test (cost=0.00..77028.00 rows=5000000 width=11) 
                                 (actual time=1.154..282.408 rows=5000000 loops=1)
Planning time: 0.087 ms
Execution time: 896.768 ms
(7 rows)

Wow, the query returns in less than one second.

Improving sorting: Consider indexing …

work_mem is ideal to speed up sorts. However, in many cases it can make sense to avoid sorting in the first place. Indexes are a good way to provide the database engine with “sorted input”. In fact: A btree is somewhat similar to a sorted list.

Building indexes (btrees) will also require some sorting. Many years ago PostgreSQL used work_mem to tell the CREATE INDEX command, how much memory to use for index creation. This is not the case anymore: In modern versions of PostgreSQL the maintenance_work_mem parameter will tell DDLs how much memory to use.

Here is an example:

test=# \timing
Timing is on.
test=# CREATE INDEX idx_x ON t_test (x);
Time: 4648.530 ms (00:04.649)

The default setting for maintenance_work_mem is 64 MB, but this can of course be changed:

test=# SET maintenance_work_mem TO '1 GB';
Time: 0.469 ms

The index creation will be considerably faster with more memory:

test=# CREATE INDEX idx_x2 ON t_test (x);
Time: 3083.661 ms (00:03.084)

In this case CREATE INDEX can use up to 1 GB of RAM to sort the data, which is of course a lot faster than going to disk. This is especially useful if you want to create large indexes.

The query will be a lot faster if you have proper indexes in place. Here is an example:

test=# explain analyze SELECT * FROM t_test ORDER BY x LIMIT 10;
                                  QUERY PLAN
Limit (cost=0.43..0.95 rows=10 width=11)
      (actual time=0.068..0.087 rows=10 loops=1)
      -> Index Only Scan using idx_x2 on t_test
               (cost=0.43..260132.21 rows=5000000 width=11)
               (actual time=0.066..0.084 rows=10 loops=1)
               Heap Fetches: 10
Planning time: 0.130 ms
Execution time: 0.119 ms
(5 rows)

In my example the query needs way less than a millisecond. If your database happens to sort a lot of data all the time, consider using better indexes to speed things up rather than pumping work_mem to higher and higher.

Sorting in PostgreSQL and tablespaces

Many people out there are using tablespaces to scale I/O. By default PostgreSQL only uses a single tablespace, which can easily turn into a bottleneck. Tablespaces are a good way to provide PostgreSQL with more hardware.

Let us assume you have to sort a lot of data repeatedly: The temp_tablespaces is a parameter, which allows administrators to control the location of temporary files sent to disk. Using a separate tablespace for temporary files can also help to speed up sorting.

If you are not sure how to configure work_mem, consider checking out – it is an easy tool helping people to configure PostgreSQL.

When digging into PostgreSQL performance it is always good to know, which option one has to spot performance problems and to figure out, what is really going on on a server. Finding slow queries and performance weakspots is therefore exactly what this post is all about.

There are many ways to approach performance problems. However, three methods have proven to really useful to quickly assess a problem. Here are my top three suggestions to handle bad performance:

  • Make use of the slow query log
  • Checking execution plans with auto_explain
  • Relying on aggregate information in pg_stat_statements
PostgreSQL performance
Analyzing PostgreSQL performance and finding bottlenecks

Each method has its own advantages and disadvantages, which will be discussed in this document

Making use of the PostgreSQL slow query log

A more traditional way to attack slow queries is to make use of PostgreSQL’s slow query log. The idea is: If a query takes longer than a certain amount of time, a line will be sent to the log. This way slow queries can easily be spotted so that developers and administrators can quickly react and know where to look.

In a default configuration the slow query log is not active. Therefore it is necessary to turn it on. You have version choices: If you want to turn the slow query log on globally, you can change postgresql.conf:

log_min_duration_statement = 5000

If you set log_min_duration_statement in postgresql.conf to 5000, PostgreSQL will consider queries, which take longer than 5 seconds to be slow queries and send them to the logfile. If you change this line in postgresql.conf there is no need for a server restart. A “reload” will be enough:

postgres=# SELECT pg_reload_conf();
(1 row) 

You can do that using an init script or simply by calling the SQL function shown above.

If you change postgresql.conf the change will be done for the entire instance, which might be too much. In many cases you want to be a lot more precise. Therefore it can make sense to make the change only for a certain user or for a certain database:

postgres=# ALTER DATABASE test SET log_min_duration_statement = 5000;

ALTER DATABASE allows you to change the configuration parameter for a single database. Let us reconnect and run a slow query:

postgres=# \c test
You are now connected to database "test" as user "hs".
test=# SELECT pg_sleep(10);

(1 row)

In my example I am using pg_sleep to just make the system wait for 10 seconds. When inspecting the logfile, we will already see the desired entry:

2018-08-20 08:19:28.151 CEST [22845] LOG: duration: 10010.353 ms statement: SELECT pg_sleep(10);

One can now take the statement and analyze, why it is slow. A good way to do that is to run “explain analyze”, which will run the statement and provide you with an execution plan.

The advantage of the slow query log is that you can instantly inspect a slow query. Whenever something is slow, you can respond instantly to any individual query, which exceeds the desired threshold. However, the strength of this approach is also its main weakness. The slow query log will track single queries. But what if bad performance is caused by a ton of not quite so slow queries? We can all agree that 10 seconds can be seen as an expensive queries. But what if we are running 1 million queries, which take 500 milliseconds each? All those queries will never show up in the slow query log because they are still considered to be “fast”. What you might find, however, consists of backups, CREATE INDEX, bulk loads and so on. You might never find the root cause if you only rely on the slow query log. The purpose of the slow query log is therefore to track down individual slow statements.

Checking unstable execution plans

The same applies to our next method. Sometimes your database is just fine but once in a while a query goes crazy. The goal is now to find those queries and fix them. One way to do that is to make use of the auto_explain module.

The idea is similar to what the slow query log does: Whenever something is slow, create log entries. In case of auto_explain you will find the complete execution plan in the logfile – not just the query. Why does it matter? Consider the following example:

test=# CREATE TABLE t_demo AS
          SELECT * FROM generate_series(1, 10000000) AS id;
SELECT 10000000
test=# CREATE INDEX idx_id ON t_demo (id);
test=# ANALYZE;

The table I have just created contains 10 million rows. In addition to that an index has been defined. Let us take a look at two almost identical queries:

test=# explain SELECT * FROM t_demo WHERE id < 10;
                                    QUERY PLAN
Index Only Scan using idx_id on t_demo (cost=0.43..8.61 rows=10 width=4)
    Index Cond: (id < 10)
(2 rows)

test=# explain SELECT * FROM t_demo WHERE id < 1000000000;
 Seq Scan on t_demo (cost=0.00..169248.60 rows=10000048 width=4)
     Filter: (id < 1000000000)
     Functions: 2
     Inlining: false
     Optimization: false
(6 rows)

The queries are basically the same but PostgreSQL will use totally different execution plans. The first query will only fetch a handful of rows and therefore go for an index scan. The second query will fetch all the data and therefore prefer a sequential scan. Although the queries appear to be similar the runtime will be totally different. The first query will execute in a millisecond or so while the second query might very well take up to half a second or even a second (depending on hardware, load, caching and all that). The trouble now is: A million queries might be fast because the parameters are suitable – however, in some rare cases somebody might want something, which leads to a bad plan or simply returns a lot of data.

Finding a query, which takes too long for whatever reason is exactly when one can make use of auto_explain. Here is the idea: If a query exceeds a certain threshold, PostgreSQL can send the plan to the logfile for later inspection.

Here is an example:

test=# LOAD 'auto_explain';
test=# SET auto_explain.log_analyze TO on;
test=# SET auto_explain.log_min_duration TO 500;

The LOAD command will load the auto_explain module into a database connection. For the demo we can do that easily. In a production system one would use postgresql.conf or ALTER DATABASE / ALTER TABLE to load the module. If you want to make the change in postgresql.conf, consider adding the following line to the config file:

session_preload_libraries = 'auto_explain';

session_preload_libraries will ensure that the module is loaded into every database connection by default. There is no need for the LOAD command anymore. Once the change has been made to the configuration (don’t forget to call pg_reload_conf() ) you can try to run the following query:

test=# SELECT count(*) FROM t_demo GROUP BY id % 2;
(2 rows)

The query will need more than 500ms and therefore show up in the logfile as expected:

2018-08-20 09:51:59.056 CEST [23256] LOG: duration: 4280.595 ms plan:
      Query Text: SELECT count(*) FROM t_demo GROUP BY id % 2;
      GroupAggregate (cost=1605370.36..1805371.32 rows=10000048 width=12)
            (actual time=3667.207..4280.582 rows=2 loops=1)
       Group Key: ((id % 2))
       -> Sort (cost=1605370.36..1630370.48 rows=10000048 width=4)
               (actual time=3057.351..3866.446 rows=10000000 loops=1)
          Sort Key: ((id % 2))
          Sort Method: external merge Disk: 137000kB
          -> Seq Scan on t_demo (cost=0.00..169248.60 rows=10000048 width=4)
                  (actual time=65.470..876.695 rows=10000000 loops=1)

As you can see a full “explain analyze” will be sent to the logfile.

The advantage of this approach is that you can have a deep look at certain slow queries and see, when a queries decides on a bad plan. However, it is still hard to gather overall information because there might be millions of queries running on your system.

Checking pg_stat_statements

The third method is to use pg_stat_statements. The idea behind pg_stat_statements is to group identical queries, which are just used with different parameters and aggregate runtime information in a system view.

In my personal judgement pg_stat_statements is really like a swiss army knife. It allows you to understand, what is really going on on your system. To enable pg_stat_statements add the following line to postgresql.conf and restart your server:

shared_preload_libraries = 'pg_stat_statements'

Then run “CREATE EXTENSION pg_stat_statements” in your database. PostgreSQL will create a view for you:

test=# CREATE EXTENSION pg_stat_statements;
test=# \d pg_stat_statements
View "public.pg_stat_statements"
       Column        |       Type       | Collation | Nullable | Default
userid               | oid              |           |          |
dbid                 | oid              |           |          |
queryid              | bigint           |           |          |
query                | text             |           |          |
calls                | bigint           |           |          |
total_time           | double precision |           |          |
min_time             | double precision |           |          |
max_time             | double precision |           |          |
mean_time            | double precision |           |          |
stddev_time          | double precision |           |          |
rows                 | bigint           |           |          |
shared_blks_hit      | bigint           |           |          |
shared_blks_read     | bigint           |           |          |
shared_blks_dirtied  | bigint           |           |          |
shared_blks_written  | bigint           |           |          |
local_blks_hit       | bigint           |           |          |
local_blks_read      | bigint           |           |          |
local_blks_dirtied   | bigint           |           |          |
local_blks_written   | bigint           |           |          |
temp_blks_read       | bigint           |           |          |
temp_blks_written    | bigint           |           |          |
blk_read_time        | double precision |           |          |
blk_write_time       | double precision |           |          |

The view will tell us, which kind of query has been executed how often and tell us about the total runtime of this type of query as well as about the distribution of runtimes for those particular queries. The data presented by pg_stat_statements can then be analyzed. Some time ago I have written a blog post about this issue, which can be found on our website.

The advantage of this module is that you will even be able to find millions of fairly fast queries, which can be the reason for high load. In addition to that pg_stat_statements will tell you about the I/O behavior of various types of queries. The downside is that it can be fairly hard to track down individual slow queries, which are usually fast but sometimes slow. On top of that pg_stat_statements does not contain parameters. One of my workmates (Julian Markwort) is working on a patch to fix this issue for (most likely) PostgreSQL 12.


Tracking down slow queries and bottlenecks in PostgreSQL is easy assuming that you know, which technique to use when. This post should simple give you a fast overview of what is possible and what can be done to track down performance issues.

Triggers with transition tables rule!
© Laurenz Albe 2018


Inspired by my co-worker’s recent blog post, I decided to revisit the old question of rules vs. triggers and run a little benchmark to see which one does better.

About rules

While triggers are well known to most application developers and database administrators, rules are less well known. The full name “query rewrite rule” explains what they are doing: Before the query is optimized, a rule can either replace the query with a different one or add additional queries. These are then planned and executed instead of or together with the original query.

While rules are very powerful, they are also tricky to get right, particularly when data modification is involved. It is usually recommended to use the more modern triggers, and there have been attempts to deprecate them. But since rules are used to implement views, they are probable here to stay.

Moreover, they are said to be faster than triggers when many rows are affected. That’s what I tried to measure.

The test case

We have a simple table:

   id  integer NOT NULL,
   val text    NOT NULL
) WITH (autovacuum_enabled = off);

Since we are only interested in the speed of processing, I decided to use an unlogged table to bypass disk I/O as much as possible. Autovacuum is disabled to avoid inferences with it.

Every update to this table should be logged to a second table:

   id       integer     NOT NULL,
   log_time timestamptz NOT NULL
            DEFAULT current_timestamp,
   new_val  text        NOT NULL
) WITH (autovacuum_enabled = off);

The test will repeatedly perform the following in a loop:


INSERT INTO upd (id, val)
   SELECT i, 'text number ' || i
   FROM generate_series(1, 100000) i;


UPDATE upd SET val = val;


I’ll perform two baseline measurements: one without the UPDATE (called “zero”), and one with just the UPDATE, but no logging (called “base”). This will enable me to measure the net time spent in the UPDATE by subtraction so I can calculate how much performance is slowed down by the different logging techniques.

logging with a rule

For this measurement, which I call “rule”, I’ll use the following rule to log updates:

   DO ALSO INSERT INTO log (id, new_val)
           VALUES (, NEW.val);

logging with a row level trigger

For this measurement, which I call “trig_row”, I’ll use the following trigger to log updates:

CREATE FUNCTION upd_row_trig() RETURNS trigger
   LANGUAGE plpgsql AS
   INSERT INTO log (id, new_val)
   VALUES (, NEW.val);


CREATE TRIGGER upd_row_trig
   EXECUTE PROCEDURE upd_row_trig();

logging with a statement level trigger

For this measurement, which I call “trig_stmt”, I’ll use the following trigger to log updates:

CREATE FUNCTION upd_stmt_trig() RETURNS trigger
   LANGUAGE plpgsql AS
   INSERT INTO log (id, new_val)
   SELECT id, val FROM newrows;


   EXECUTE PROCEDURE upd_stmt_trig();

This uses transition tables which are new in PostgreSQL v10. The REFERENCING clause will make all rows modified by the statement available in a “temporary” table newrows.

Test environment

All commands are run from a shell script on my Linux laptop using psql -c "...". The script loops through the statements twenty times, and the elapsed time is measured with time.

Each test is executed three times to ensure that the execution time does not vary, then the middle value is used.

Test results

The zero measurements have been subtracted and the result divided by twenty to obtain the following numbers:

Performance of rule and triggers for logging bulk updates
baselinerulerow level triggerstatement level trigger
execution time295 ms454 ms3816 ms642 ms
overhead percentage0 % 54 % 1193 % 117 % 


The test confirms that rules perform better than anything else when it comes to logging bulk updates. However, while logging with row level triggers caused a slowdown by a forbidding factor of 13, statement level triggers with transition tables only doubled the execution time.

So with PostgreSQL v10 we have a good alternative that performs half as well as rules, and there is no need to use rules unless you need cutting edge performance.

When people are talking about database performance monitoring they usually think of inspecting one PostgreSQL database server at a time. While this is certainly useful it can also be quite beneficial to inspect the status of an entire database cluster or to inspect a set of servers working together at once. Fortunately there are easy means to achieve that with PostgreSQL. How this works can be outlined in this post.

pg_stat_statements: The best tool to monitor PostgreSQL performance

If you want to take a deep loop at PostgreSQL performance there is really no way around pg_stat_statements. It offers a lot of information and is really easy to use.

To install pg_stat_statements, the following steps are necessary:

  • run “CREATE EXTENSION pg_stat_statements” in your desired database
  • add the following line to postgresql.conf:
    • shared_preload_libraries = ‘pg_stat_statements’
  • restart PostgreSQL

Once this is done, PostgreSQL will already be busy collecting data on your database hosts. However, how can we create a “clusterwide pg_stat_statements” view so that we can inspect an entire set of servers at once?

Using pg_stat_statements to check an entire database cluster

Our goal is to show data from a list of servers in a single view. One way to do that is to make use of PostgreSQL’s foreign data wrapper infrastructure. We can simply connect to all servers in the cluster and unify the data in a single view.

Let us assume we have 3 servers, a local machine, “a_server”, and “b_server”. Let us get started by connecting to the local server to run the following commands:

GRANT USAGE ON SCHEMA pg_catalog TO dbmonitoring;
GRANT ALL ON pg_stat_statements TO dbmonitoring;

In the first step I created a simple user to do the database monitoring. Of course you can handle users and so on differently but it seems like an attractive idea to use a special user for that purpose.

The next command enables the postgres_fdw extension, which is necessary to connect to those remote servers we want to access:

CREATE EXTENSION postgres_fdw;

Then we can already create “foreign servers”. Here is how those servers can be created:

       OPTIONS (host 'a_server', dbname 'a');
       OPTIONS (host 'b_server', dbname 'b');

Just replace the hostnames and the database names with your data and run those commands. The next step is already about user mapping: It might easily happen that local users are not present on the other side so it is necessary to create some sort of mapping between local and remote users:

       SERVER pg1
       OPTIONS (user 'postgres', password 'abcd');

       SERVER pg2
       OPTIONS (user 'postgres', password 'abcd');

In this case we will login as user “postgres”. Now that two servers and the user mappings are ready, we can import the remote schema into a local schema:

CREATE SCHEMA monitoring_a;
       LIMIT TO (pg_stat_statements)
       FROM SERVER pg1
       INTO monitoring_a;

CREATE SCHEMA monitoring_b;
       LIMIT TO (pg_stat_statements)
       FROM SERVER pg2
       INTO monitoring_b;

For each schema there will be a separate schema. This makes it very easy to drop things again and to handle various incarnations of the same data structure.

Wiring things together

The last thing to do in our main database, is to connect those remote tables with our local data. The easiest way to achieve that is to use a simple view:

CREATE VIEW monitoring_performance AS
SELECT 'localhost'::text AS node, *
FROM pg_stat_statements
SELECT 'server a'::text AS node, *
FROM monitoring_a.pg_stat_statements
SELECT 'server b'::text AS node, *
FROM monitoring_b.pg_stat_statements;

The view will simply unify all the data and add an additional column at the beginning.

PostgreSQL performance monitoring for clusters

Our system is now ready to use and we can already start to run useful analysis:

       sum(total_time) OVER () AS cluster_total_time,
       sum(total_time) OVER (PARTITION BY node) AS node_total_time,
       round((100 * total_time / sum(total_time) OVER ())::numeric, 4) AS percentage_total,
       round((100 * total_time / sum(total_time) OVER (PARTITION BY node))::numeric, 4) AS percentage_node
FROM   monitoring_performance

The query will return all the raw data and add some percentage numbers on top of this data.

If you are interested in further information on pg_state_statements consider reading the following blog post too:

A bad query plan ...
© Laurenz Albe 2018


We all know that you have to pay a price for a new index you create — data modifying operations will become slower, and indexes use disk space. That’s why you try to have no more indexes than you actually need.

But most people think that SELECT performance will never suffer from a new index. The worst that can happen is that the new index is not used.

However, this is not always true, as I have seen more than once in the field. I’ll show you such a case and tell you what you can do about it.

An example

We will experiment with this table:

   sort        integer NOT NULL,
   category    integer NOT NULL,
   interesting boolean NOT NULL

   SELECT i, i%1000, i>50000
   FROM generate_series(1, 1000000) i;

CREATE INDEX skewed_category_idx ON skewed (category);


We want to find the first twenty interesting rows in category 42:

SELECT * FROM skewed
WHERE interesting AND category = 42

This performs fine:

                             QUERY PLAN
 Limit  (cost=2528.75..2528.80 rows=20 width=9)
        (actual time=4.548..4.558 rows=20 loops=1)
   Buffers: shared hit=1000 read=6
   ->  Sort  (cost=2528.75..2531.05 rows=919 width=9)
             (actual time=4.545..4.549 rows=20 loops=1)
         Sort Key: sort
         Sort Method: top-N heapsort  Memory: 25kB
         Buffers: shared hit=1000 read=6
         ->  Bitmap Heap Scan on skewed
                        (cost=19.91..2504.30 rows=919 width=9)
                        (actual time=0.685..4.108 rows=950 loops=1)
               Recheck Cond: (category = 42)
               Filter: interesting
               Rows Removed by Filter: 50
               Heap Blocks: exact=1000
               Buffers: shared hit=1000 read=6
               ->  Bitmap Index Scan on skewed_category_idx
                        (cost=0.00..19.68 rows=967 width=0)
                        (actual time=0.368..0.368 rows=1000 loops=1)
                     Index Cond: (category = 42)
                     Buffers: shared read=6
 Planning time: 0.371 ms
 Execution time: 4.625 ms

PostgreSQL uses the index to find the 1000 rows with category 42, filters out the ones that are not interesting, sorts them and returns the top 20. 5 milliseconds is fine.

A new index makes things go sour

Now we add an index that can help us with sorting. That is definitely interesting if we often have to find the top 20 results:

CREATE INDEX skewed_sort_idx ON skewed (sort);

And suddenly, things are looking worse:

                          QUERY PLAN
 Limit  (cost=0.42..736.34 rows=20 width=9)
        (actual time=21.658..28.568 rows=20 loops=1)
   Buffers: shared hit=374 read=191
   ->  Index Scan using skewed_sort_idx on skewed
                (cost=0.42..33889.43 rows=921 width=9)
                (actual time=21.655..28.555 rows=20 loops=1)
         Filter: (interesting AND (category = 42))
         Rows Removed by Filter: 69022
         Buffers: shared hit=374 read=191
 Planning time: 0.507 ms
 Execution time: 28.632 ms

What happened?

PostgreSQL thinks that it will be faster if it examines the rows in sort order using the index until it has found 20 matches. But it doesn’t know how the matching rows are distributed with respect to the sort order, so it is not aware that it will have to scan 69042 rows until it has found its 20 matches (see Rows Removed by Filter: 69022 in the above execution plan).

What can we do to get the better plan?

PostgreSQL v10 has added extended statistics to track how the values in different columns are correlated, but that does no track the distributions of the values, so it will not help us here.

There are two workarounds:

  1. Drop the index that misleads PostgreSQL.If that is possible, it is a simple solution. But usually one cannot do that, because the index is either used to enforce a unique constraint, or it is needed by other queries that benefit from it.
  2. Rewrite the query so that PostgreSQL cannot use the offending index.Of the many possible solutions for this, I want to present two:
    • A subquery with OFFSET 0:
      SELECT *
      FROM (SELECT * FROM skewed
            WHERE interesting AND category = 42
            OFFSET 0) q
      ORDER BY sort
      LIMIT 20;

      This makes use of the fact that OFFSET and LIMIT prevent a subquery from being “flattened”, even if they have no effect on the query result.

    • Using an expression as sort key:
      SELECT * FROM skewed
      WHERE interesting AND category = 42
      ORDER BY sort + 0
      LIMIT 20;

      This makes use of the fact that PostgreSQL cannot deduce that sort + 0 is the same as sort. Remember that PostgreSQL is extensible, and you can define your own + operator!

I have already written about timeseries and PostgreSQL in the past. However, recently I stumbled across an interesting problem, which caught my attention: Sometimes you might want to find “periods” of activity in a timseries. For example: When was a user active? Or when did we receive data? This blog post tries to give you some ideas and shows, how you can actually approach this kind of problem.

Loading timeseries data into PostgreSQL

The next listing shows a little bit of sample data, which I used to write the SQL code you are about to see:

CREATE TABLE t_series (t date, data int);

COPY t_series FROM stdin DELIMITER ';';

For the sake of simplicity I just used two columns in my example. Note that my timeseries is not continuous but interrupted. There are three continuous periods in this set of data. Our goal is to find and isolate them to do analysis on each of those continuous periods.

PostgreSQL time series

Preparing for timeseries analysis

When dealing with timeseries one of the most important things to learn is how to “look forward and backward”. In most cases it is simply vital to compare the current line with the previous line. To do that in PostgreSQL (or in SQL in general) you can make use of the “lag” function:

test=# SELECT *, lag(t, 1) OVER (ORDER BY t)
       FROM t_series;
          t | data | lag
 2018-03-01 |   12 | 
 2018-03-02 |   43 | 2018-03-01
 2018-03-03 |    9 | 2018-03-02
 2018-03-04 |   13 | 2018-03-03
 2018-03-09 |   23 | 2018-03-04
 2018-03-10 |   26 | 2018-03-09
 2018-03-11 |   28 | 2018-03-10
 2018-03-14 |   21 | 2018-03-11
 2018-03-15 |   15 | 2018-03-14
(9 rows)

As you can see the last column contains the date of the previous row. Now: How does PostgreSQL know what the previous row actually is? The “ORDER BY”-clause will define exactly that.

Based on this query you have just seen it will be easy to calculate the size of the gap from one row to the next row

test=# SELECT *, t - lag(t, 1) OVER (ORDER BY t) AS diff
       FROM t_series;
          t | data | diff 
 2018-03-01 |   12 | 
 2018-03-02 |   43 | 1
 2018-03-03 |    9 | 1
 2018-03-04 |   13 | 1
 2018-03-09 |   23 | 5
 2018-03-10 |   26 | 1
 2018-03-11 |   28 | 1
 2018-03-14 |   21 | 3
 2018-03-15 |   15 | 1
(9 rows)

What we see now is the difference from one period to the next. That is pretty useful because we can start to create our rules. When do we consider a segment to be over and how long of a gap to we allow for before we consider it to be the next segment / period?

In my example I decided that every gap, which is longer than 2 days should trigger the creation of a new segment (or period): The next challenge is therefore to assign numbers to each period, which are about to detect. Once this is done, we can easily aggregate on the result. The way I have decided to do this is by using the sum function. Remember: When NULL is fed to an aggregate, the aggregate will ignore the input. Otherwise it will simply start to add up the input.

Here is the query:

test=# SELECT *, sum(CASE WHEN diff IS NULL 
                     OR diff <2 THEN 1 ELSE NULL END) OVER (ORDER BY t) AS period
       FROM (SELECT *, t - lag(t, 1) OVER (ORDER BY t) AS diff
             FROM   t_series
       ) AS x;
          t | data | diff | period 
 2018-03-01 |   12 |      | 1
 2018-03-02 |   43 |    1 | 1
 2018-03-03 |    9 |    1 | 1
 2018-03-04 |   13 |    1 | 1
 2018-03-09 |   23 |    5 | 2
 2018-03-10 |   26 |    1 | 2
 2018-03-11 |   28 |    1 | 2
 2018-03-14 |   21 |    3 | 3
 2018-03-15 |   15 |    1 | 3
(9 rows)

As you can see the last column contains the period ID as generated by the sum function in our query. From now on analysis will be pretty simple as we can simply aggregate over this result using a simple subselect as shown in the next statement:

test=# SELECT period, sum(data) 
       FROM (SELECT *, sum(CASE WHEN diff IS NULL 
                    OR diff <2 THEN 1 ELSE NULL END) OVER (ORDER BY t) AS period
             FROM (SELECT *, t - lag(t, 1) OVER (ORDER BY t) AS diff
                   FROM t_series
                  ) AS x
       ) AS y
GROUP BY period 
ORDER BY period;
 period | sum 
      1 | 77
      2 | 77
      3 | 36
(3 rows)

The result displays the sum of all data for each period. Of course you can also do more complicated stuff. However, the important thing is to understand, how you can actually detect various periods of continuous activity.

In PostgreSQL, a view is a virtual table based on an SQL statement. It is an abstraction layer, which allows to access the result of a more complex SQL fast an easily. The fields in a view are fields from one or more real tables in the database. The question many people now ask if: If a view is based on a table. What happens if the data structure of the underlying table changes?


To show what PostgreSQL will do, I created a simple table:

view_demo=# CREATE TABLE t_product
        id         serial,
        name       text,
        price      numeric(16, 4)

My table has just three simple columns and does not contain anything special. Here is the layout of the table:

view_demo=# \d t_product
  Table "public.t_product"
 Column |     Type      | Collation | Nullable | Default
 id     | integer       |           | not null | nextval('t_product_id_seq'::regclass)
 name   | text          |           |          |
 price  | numeric(16,4) |           |          |


Making changes to tables and views

The first thing to do in order to get our demo going is to create a view:

view_demo=# CREATE VIEW v AS SELECT * FROM t_product;

The important thing here to see is how PostgreSQL handles the view. In the following listing you can see that the view definition does not contain a “*” anymore. PostgreSQL has silently replaced the “*” with the actual column list. Note that this is an important thing because it will have serious implications:

view_demo=# \d+ v
  View "public.v"
 Column | Type          | Collation | Nullable | Default | Storage  | Description
 id     | integer       |           |          |         | plain    |
 name   | text          |           |          |         | extended |
 price  | numeric(16,4) |           |          |         | main     |
View definition:
  FROM t_product;

What happens if we simply try to rename the table the view is based on:

view_demo=# ALTER TABLE t_product RENAME TO t_cool_product;

view_demo=# \d+ v
View "public.v"
 Column | Type          | Collation | Nullable | Default | Storage  | Description
 id     | integer       |           |          |         | plain    |
 name   | text          |           |          |         | extended |
 price  | numeric(16,4) |           |          |         | main     |
View definition:
  FROM t_cool_product;

As you can see the view will be changed as well. The reason for that is simple: PostgreSQL does not store the view as string. Instead if will keep a binary copy of the definition around, which is largely based on object ids. The beauty is that if the name of a table or a column changes, those objects will still have the same object id and therefore there is no problem for the view. The view will not break, become invalid or face deletion.

The same happens when you change the name of a column:

view_demo=# ALTER TABLE t_cool_product
RENAME COLUMN price TO produce_price;

Again the view will not be harmed:

view_demo=# \d+ v
  View "public.v"
 Column | Type          | Collation | Nullable | Default | Storage  | Description
 id     | integer       |           |          |         | plain    |
 name   | text          |           |          |         | extended |
 price  | numeric(16,4) |           |          |         | main     |
View definition:
          t_cool_product.produce_price AS price
   FROM t_cool_product;

What is really really important and noteworthy here is that the view does not change its output. The columns provided by the view will be the same. In other words: Application relying on the view won’t break just because some other column has changed somewhere.

What PostgreSQL does behind the scenes

Behind the scenes a view is handled by the rewrite system. In the system catalog there is a table called pg_rewrite, which will store a binary representation of the view:

view_demo=# \d pg_rewrite
  Table "pg_catalog.pg_rewrite"
 Column     | Type         | Collation | Nullable | Default
 rulename   | name         |           | not null |
 ev_class   | oid          |           | not null |
 ev_type    | "char"       |           | not null |
 ev_enabled | "char"       |           | not null |
 is_instead | boolean      |           | not null |
 ev_qual    | pg_node_tree |           |          |
 ev_action  | pg_node_tree |           |          |
  "pg_rewrite_oid_index" UNIQUE, btree (oid)
  "pg_rewrite_rel_rulename_index" UNIQUE, btree (ev_class, rulename)

Basically this is an internal thing. However, I decided to show, how it works behind the scenes as it might be interesting to know.

Views and dropping columns

However, in some cases PostgreSQL has to error out. Suppose somebody wants to drop a column, on which a view depends on. In this case PostgreSQL has to error out because it cannot silently delete the column from the view.

view_demo=# ALTER TABLE t_cool_product DROP COLUMN name;
ERROR: cannot drop table t_cool_product column name because other objects depend on it
DETAIL: view v depends on table t_cool_product column name
HINT: Use DROP ... CASCADE to drop the dependent objects too.

In this case PostgreSQL complains that the view cannot be kept around because columns are missing. You can now decide whether to not drop the column or whether to drop the view along with the column.

For those of you out there working with PostgreSQL in a professional way, migrating from Oracle to PostgreSQL might be one of the most beloved tasks available. One of the first things most people will notice, however, is that those data types available in Oracle might not be quite the same in PostgreSQL. This blog will try to shed some light and show, how things work.

Data types in Oracle and PostgreSQL

While there are many similarities between Oracle and PostgreSQL there are a couple of differences, which are quite noteworthy. The first thing many people might notice is: PostgreSQL has many more data types than Oracle. As of version 10.0 an empty PostgreSQL database will expose the staggering number of 92 data types. Of course not all of them are useful and many of them are purely internal used for internal purposes.

Still: At the end of the day there are just more data types, which are can be used by applications, which of course

Let us take a look and see, which types can be matched. The following table contains a lof of potential options:


Oracle typePossible PostgreSQL types
CHARchar, varchar, text
NCHARchar, varchar, text
VARCHARchar, varchar, text
VARCHAR2char, varchar, text, json
NVARCHAR2char, varchar, text
CLOBchar, varchar, text, json
LONGchar, varchar, text
RAWuuid, bytea
BFILEbytea (read-only)
NUMBERnumeric, float4, float8, char, varchar, text
NUMBER(n,m) with m<=0numeric, float4, float8, int2, int4, int8,
boolean, char, varchar, text
FLOATnumeric, float4, float8, char, varchar, text
BINARY_FLOATnumeric, float4, float8, char, varchar, text
BINARY_DOUBLEnumeric, float4, float8, char, varchar, text
DATEdate, timestamp, timestamptz, char, varchar, text
TIMESTAMPdate, timestamp, timestamptz, char, varchar, text
TIMESTAMP WITH TIME ZONEdate, timestamp, timestamptz, char, varchar, text
TIMESTAMP WITHdate, timestamp, timestamptz, char, varchar, text
INTERVAL YEAR TO MONTHinterval, char, varchar, text
INTERVAL DAY TO SECONDinterval, char, varchar, text
MDSYS.SDO_GEOMETRYgeometry (see “PostGIS support” below)

As you can see everything all types can always be represented as text and as varchar. However, this is of course not desirable. However, there is more: In Oracle there is not such thing as “integer”. Everything is represented as “number”. For example: In Oracle you might see “number (5, 0)”. The way to map this is to use “integer” or “bigint” on the PostgreSQL side. In general “CPU data types” such as integer and bigint are much faster than “numeric”.

A second important thing is “varchar2”: On the PostgreSQL side it can easily be mapped to varchar or text. A small detail is that in Oracle varchar2 can be the number of bytes or the number of characters. Consider the following example: VARCHAR2(20 BYTE) vs.VARCHAR2(10 CHAR). You have to think twice, what to do on the PostgreSQL side because in Postgres we are always talking about “characters”.

Oracle → PostgreSQL: Automatic conversion

Of course there is no need to most of the work by hand. The ora_migrator tool ( will use the oracle_fdw (= database link to Oracle) to map data types for you. ora_migrator will also help with indexes, constraints, and so on to make migrations as easy and as fast as possible. Also have a look on our Cybertec Enterprise Migrator tool page.

caged elephant
© Laurenz Albe 2018


In a recent wrestling match with the Linux “out-of-memory killer” for a Cybertec customer I got acquainted with Linux control groups (“cgroups”), and I want to give you a short introduction how they can be used with PostgreSQL and discuss their usefulness.

Warning: This was done on my RedHat Fedora 27 system running Linux 4.16.5 with cgroups v1 managed by systemd version 234. Both cgroups and systemd‘s handling of them seem to be undergoing changes, so your mileage may vary considerably. Still, it should be a useful starting point if you want to explore cgroups.

What are Linux cgroups?

From the cgroups manual page:

Control cgroups, usually referred to as cgroups, are a Linux kernel feature which allow processes to be organized into hierarchical groups whose usage of various types of resources can then be limited and monitored.

cgroups are managed with special commands that start with “cg”, but can also be managed through a special cgroups file system and systemd.

Now a running PostgreSQL cluster is a group of processes, so that’s a perfect fit.

There are several subsystems defined (also called “controllers” in cgroups terminology). Of these, the following are interesting for PostgreSQL:

  • memory: useful for limiting the total memory usage
  • blkio: useful for limiting the I/O throughput
  • cpu: useful to define upper and lower limits to the CPU time available for the processes
  • cpuset: useful for binding the processes to a subset of the available CPU cores

Configuring cgroups

During system startup, cgroups are created as defined in the /etc/cgconfig.conf configuration file.

Let’s create a cgroup to build a cage for a PostgreSQL cluster:

group db_cage {
    # user and group "postgres" can manage these cgroups
    perm {
        task {
            uid = postgres;
            gid = postgres;
            fperm = 774;
        admin {
            uid = postgres;
            gid = postgres;
            dperm = 775;
            fperm = 774;

    # limit memory to 1 GB and disable swap
    memory {
        memory.limit_in_bytes = 1G;
        memory.memsw.limit_in_bytes = 1G;

    # limit read and write I/O to 10MB/s each on device 8:0
    blkio {
        blkio.throttle.read_bps_device = "8:0 10485760";
        blkio.throttle.write_bps_device = "8:0 10485760";

    # limit CPU time to 0.25 seconds out of each second
    cpu {
        cpu.cfs_period_us = 1000000;
        cpu.cfs_quota_us = 250000;

    # only CPUs 0-3 and memory node 0 can be used
    cpuset {
        cpuset.cpus = 0-3;
        cpuset.mems = 0;

To activate it, run the following as root:

# /usr/sbin/cgconfigparser -l /etc/cgconfig.conf -s 1664

To have that done automatically at server start, I tell systemd to enable the cgconfig service:

# systemctl enable cgconfig
# systemctl start cgconfig

Starting PostgreSQL in a cgroup

To start PostgreSQL in the cgroups we defined above, use the cgexec executable (you may have to install an operating system package called libcgroup or libcgroup-tools for that):

$ cgexec -g cpu,memory,blkio:db_cage \
   /usr/pgsql-10/bin/pg_ctl -D /var/lib/pgsql/10/data start

We can confirm that PostgreSQL is running in the correct cgroup:

$ head -1 /var/lib/pgsql/10/data/ 

$ cat /proc/16284/cgroup | egrep '\b(cpu|blkio|memory)\b'

To change a running process to a cgroup, you can use cgclassify (but then you have to change all running PostgreSQL processes).

Using cgroups with systemd

systemd provides a simpler interface to Linux cgroups, so you don’t have to do any of the above. systemd can create cgroups “on the fly” for the services it starts.

If your PostgreSQL service is called postgresql-10, simply create a file /etc/systemd/system/postgresql-10.service like this:

# include the original service file rather than editing it
# so that changes don't get lost during an upgrade
.include /usr/lib/systemd/system/postgresql-10.service

# limit memory to 1GB
# sets "memory.limit_in_bytes"
# limit memory + swap space to 1GB
# this should set "memory.memsw.limit_in_bytes" but it only
# works with cgroups v2 ...
# MemorySwapMax=1G

# limit read I/O on block device 8:0 to 10MB per second
# sets "blkio.throttle.read_bps_device"
IOReadBandwidthMax=/dev/block/8:0 10M
# limit write I/O on block device 8:0 to 10MB per second
# sets "blkio.throttle.write_bps_device"
IOWriteBandwidthMax=/dev/block/8:0 10M

# limit CPU time to a quarter of the available
# sets "cpu.cfs_quota_us"

# there are no settings to control "cpuset" cgroups

Now you have to tell systemd that you changed the configuration and restart the service:

# systemctl daemon-reload
# systemctl restart postgresql-10

As you see, not all cgroup settings ar available with systemd. As a workaround, you can define cgroups in /etc/cgconfig.conf and use cgexec to start the service.

How useful are cgroups for PostgreSQL?

I would say that it depends on the subsystem.


At first glance, it sounds interesting to limit memory usage with cgroups. But there are several drawbacks:

  • If PostgreSQL is allowed to use swap space, it will start swapping when the memory quota is exceeded.
  • If PostgreSQL is not allowed to use swap space, the Linux OOM killer will kill PostgreSQL when the quota is exceeded (alternatively, you can configure the cgroup so that the process is paused until memory is freed, but this might never happen).
  • The memory quota also limits the amount of memory available for the file system cache.

None of this is very appealing — there is no option to make malloc fail so that PostgreSQL can handle the problem.

I think that it is better to use the traditional way of limiting PostgreSQL’s memory footprint by setting shared_buffers, work_mem and max_connections so that PostgreSQL won’t use too much memory.

That also has the advantage that all PostgreSQL clusters on the machine can share the file system cache, so that clusters that need it can get more of that resource, while no cluster can become completely memory starved (everybody is guaranteed shared_buffers).


I think that cgroups are a very useful way of limiting I/O bandwidth for PostgreSQL.

The only drawback is maybe that PostgreSQL cannot use more than its allotted quota even if the I/O system is idle.


cgroups are also a good way of limiting CPU usage by a PostgreSQL cluster.

Again, it would be nice if PostgreSQL were allowed to exceed its quota if the CPUs are idle.


This is only useful on big machines with a NUMA architecture. On such machines, binding PostgreSQL to the CPUs and memory of one NUMA node will make sure that all memory access is local to that node and consequently fast.

You can thus partition your NUMA machine between several PostgreSQL clusters.