Better correlation helps in real life too

After you ANALYZE a PostgreSQL table to collect value distribution statistics, you will find the gathered statistics for each column in the pg_stats system view. This article will explain the meaning of the correlation column and its impact on index scans.

Physical vs. logical ordering

Most common PostgreSQL data types have an ordering: they support the operators <, <=, =, >= and >.
Such data types can be used with a B-tree index (the “standard” index type).

The values in a column of such a type provide a logical ordering of the table rows. An index on this column will be sorted according to that ordering.

A PostgreSQL table consists of one or more files of 8KB blocks. The order in which the rows are stored in the file is the physical ordering.
You can examine the physical ordering of the rows by selecting the ctid system column: it contains the block number and the item number inside the block, which describe the physical location of the table row.


The correlation for a column is a value between -1 and 1. It tells how good the match between logical and physical ordering is.

  • If the correlation is 1, the rows are stored in the table file in ascending column order; if it is -1, they are stored in descending order.
  • Values between -1 and 1 mean a less perfect match.
  • A value of 0 means that there is no connection between the physical and the logical order.

Why should I care?

You will create indexes on your tables for faster access (but not too many!).
The correlation of a column has an impact on the performance of an index scan.

During an index scan, the whole index or part of it are read in index sequential order. For each entry that is found, the corresponding row is fetched from the table (this is skipped in an “index only scan”, but that is a different story).

If the correlation of the indexed column is close to zero, the fetched rows will be from all over the table. This will result in many randomly distributed reads of many different table blocks.

However, if the correlation is close to 1 or -1, the next row fetched during the index scan tends to be in the same or the next table block as the previous row.

High correlation has two advantages:

  1. Blocks read by the database are cached in shared memory. Consequently, if many of the table rows fetched during the index scan are located in the same table block, only few blocks have to be read from storage.
  2. The blocks that have to be read from storage are next to each other. This leads to sequential I/O, which on spinning disks is substantially faster than random I/O.

An example

Let’s create two tables with identical content, but different correlation:

CREATE TABLE corr (id, val) AS
   SELECT i, 'some text ' || i
   FROM generate_series(1, 100000) AS i;

CREATE INDEX corr_idx ON corr (id);


SELECT correlation FROM pg_stats
WHERE tablename = 'corr' AND attname = 'id';

(1 row)

   SELECT * FROM corr
   ORDER BY random();

CREATE INDEX uncorr_idx ON uncorr (id);


SELECT correlation FROM pg_stats
WHERE tablename = 'uncorr' AND attname = 'id';

(1 row)

We disable bitmap index scans so that we can compare index scans on both tables.
Then we check how index scans perform:

SET enable_bitmapscan = off;

SELECT * FROM corr WHERE id BETWEEN 1001 AND 1300;

                    QUERY PLAN
 Index Scan using corr_idx on corr
       (cost=0.29..15.23 rows=297 width=19)
       (actual time=0.108..0.732 rows=300 loops=1)
   Index Cond: ((id >= 1001) AND (id <= 1300))
   Buffers: shared hit=6
 Planning time: 0.456 ms
 Execution time: 1.049 ms
(5 rows)

SELECT * FROM uncorr WHERE id BETWEEN 1001 AND 1300;

                    QUERY PLAN
 Index Scan using uncorr_idx on uncorr
       (cost=0.29..978.15 rows=298 width=19)
       (actual time=0.105..2.352 rows=300 loops=1)
   Index Cond: ((id >= 1001) AND (id <= 1300))
   Buffers: shared hit=303
 Planning time: 0.548 ms
 Execution time: 2.736 ms
(5 rows)

Now 2.7 milliseconds is not so bad, but that is only because all blocks were already in shared buffers.
If a part of these blocks has to be read from disk, the 303 blocks from the second query will do much worse than the 6 from the first!

In the second query, each result row was found in a different table block. This caused 300 blocks to be touched. The remaining three blocks are index blocks.

The first query touches only three table blocks:

SELECT ctid, id FROM corr
WHERE id BETWEEN 1001 AND 1300;

  ctid   |  id  
 (6,58)  | 1001
 (6,59)  | 1002
 (6,60)  | 1003
 (6,61)  | 1004
 (6,62)  | 1005
 (6,63)  | 1006
 (6,64)  | 1007
 (8,37)  | 1294
 (8,38)  | 1295
 (8,39)  | 1296
 (8,40)  | 1297
 (8,41)  | 1298
 (8,42)  | 1299
 (8,43)  | 1300
(300 rows)

Indeed, all rows are contained in the table blocks 6, 7 and 8!

Correlation and the optimizer

The PostgreSQL optimizer estimates the cost of the possible ways to execute an SQL statement.

With the use of the correlation it can give better estimates of the cost of an index scan, leading to better plan choices.

The PostgreSQL optimizer will prefer index scans if the correlation is close to 1 or -1.

Correlation and BRIN indexes

PostgreSQL 9.5 introduced the BRIN index (block range index).

This index works be storing the minimum and maximum of all values for ranges of table blocks. It is only useful for columns with perfect correlation. Its advantage over the B-tree index is its much smaller size, which makes it an interesting option for large tables.

How to make use of correlation?

If you need to efficiently scan bigger portions of an index, it is good to keep the table in index order.

There are no “index ordered tables” in PostgreSQL.
Still, high correlation for a column can be maintained in two ways:

  1. Automatically:

    If the table rows are inserted in logical column order and there are no updates or deletes on the table, the physical ordering will be identical to the logical ordering. Good examples for that are primary key columns generated by sequences or measurements with a timestamp.

    Since correlation is always perfect in this case, a BRIN index can be an interesting option.

    If you want to remove old data from a table without disrupting the physical ordering, you can use table partitioning.

  2. Clustering:

    The SQL statement CLUSTER can be used to rewrite a table so that the physical ordering is identical to the logical ordering of an index.

    However, subsequent modifications of the table will reduce the correlation again. Because of that, you need to re-cluster the table regularly to maintain high correlation. This is annoying, because CLUSTER blocks all concurrent access to the table.

Triggers with transition tables rule!
© Laurenz Albe 2018


Inspired by my co-worker’s recent blog post, I decided to revisit the old question of rules vs. triggers and run a little benchmark to see which one does better.

About rules

While triggers are well known to most application developers and database administrators, rules are less well known. The full name “query rewrite rule” explains what they are doing: Before the query is optimized, a rule can either replace the query with a different one or add additional queries. These are then planned and executed instead of or together with the original query.

While rules are very powerful, they are also tricky to get right, particularly when data modification is involved. It is usually recommended to use the more modern triggers, and there have been attempts to deprecate them. But since rules are used to implement views, they are probable here to stay.

Moreover, they are said to be faster than triggers when many rows are affected. That’s what I tried to measure.

The test case

We have a simple table:

   id  integer NOT NULL,
   val text    NOT NULL
) WITH (autovacuum_enabled = off);

Since we are only interested in the speed of processing, I decided to use an unlogged table to bypass disk I/O as much as possible. Autovacuum is disabled to avoid inferences with it.

Every update to this table should be logged to a second table:

   id       integer     NOT NULL,
   log_time timestamptz NOT NULL
            DEFAULT current_timestamp,
   new_val  text        NOT NULL
) WITH (autovacuum_enabled = off);

The test will repeatedly perform the following in a loop:


INSERT INTO upd (id, val)
   SELECT i, 'text number ' || i
   FROM generate_series(1, 100000) i;


UPDATE upd SET val = val;


I’ll perform two baseline measurements: one without the UPDATE (called “zero”), and one with just the UPDATE, but no logging (called “base”). This will enable me to measure the net time spent in the UPDATE by subtraction so I can calculate how much performance is slowed down by the different logging techniques.

logging with a rule

For this measurement, which I call “rule”, I’ll use the following rule to log updates:

   DO ALSO INSERT INTO log (id, new_val)
           VALUES (, NEW.val);

logging with a row level trigger

For this measurement, which I call “trig_row”, I’ll use the following trigger to log updates:

CREATE FUNCTION upd_row_trig() RETURNS trigger
   LANGUAGE plpgsql AS
   INSERT INTO log (id, new_val)
   VALUES (, NEW.val);


CREATE TRIGGER upd_row_trig
   EXECUTE PROCEDURE upd_row_trig();

logging with a statement level trigger

For this measurement, which I call “trig_stmt”, I’ll use the following trigger to log updates:

CREATE FUNCTION upd_stmt_trig() RETURNS trigger
   LANGUAGE plpgsql AS
   INSERT INTO log (id, new_val)
   SELECT id, val FROM newrows;


   EXECUTE PROCEDURE upd_stmt_trig();

This uses transition tables which are new in PostgreSQL v10. The REFERENCING clause will make all rows modified by the statement available in a “temporary” table newrows.

Test environment

All commands are run from a shell script on my Linux laptop using psql -c "...". The script loops through the statements twenty times, and the elapsed time is measured with time.

Each test is executed three times to ensure that the execution time does not vary, then the middle value is used.

Test results

The zero measurements have been subtracted and the result divided by twenty to obtain the following numbers:

Performance of rule and triggers for logging bulk updates
baselinerulerow level triggerstatement level trigger
execution time295 ms454 ms3816 ms642 ms
overhead percentage0 % 54 % 1193 % 117 % 


The test confirms that rules perform better than anything else when it comes to logging bulk updates. However, while logging with row level triggers caused a slowdown by a forbidding factor of 13, statement level triggers with transition tables only doubled the execution time.

So with PostgreSQL v10 we have a good alternative that performs half as well as rules, and there is no need to use rules unless you need cutting edge performance.

A bad query plan ...
© Laurenz Albe 2018


We all know that you have to pay a price for a new index you create — data modifying operations will become slower, and indexes use disk space. That’s why you try to have no more indexes than you actually need.

But most people think that SELECT performance will never suffer from a new index. The worst that can happen is that the new index is not used.

However, this is not always true, as I have seen more than once in the field. I’ll show you such a case and tell you what you can do about it.

An example

We will experiment with this table:

   sort        integer NOT NULL,
   category    integer NOT NULL,
   interesting boolean NOT NULL

   SELECT i, i%1000, i>50000
   FROM generate_series(1, 1000000) i;

CREATE INDEX skewed_category_idx ON skewed (category);


We want to find the first twenty interesting rows in category 42:

SELECT * FROM skewed
WHERE interesting AND category = 42

This performs fine:

                             QUERY PLAN
 Limit  (cost=2528.75..2528.80 rows=20 width=9)
        (actual time=4.548..4.558 rows=20 loops=1)
   Buffers: shared hit=1000 read=6
   ->  Sort  (cost=2528.75..2531.05 rows=919 width=9)
             (actual time=4.545..4.549 rows=20 loops=1)
         Sort Key: sort
         Sort Method: top-N heapsort  Memory: 25kB
         Buffers: shared hit=1000 read=6
         ->  Bitmap Heap Scan on skewed
                        (cost=19.91..2504.30 rows=919 width=9)
                        (actual time=0.685..4.108 rows=950 loops=1)
               Recheck Cond: (category = 42)
               Filter: interesting
               Rows Removed by Filter: 50
               Heap Blocks: exact=1000
               Buffers: shared hit=1000 read=6
               ->  Bitmap Index Scan on skewed_category_idx
                        (cost=0.00..19.68 rows=967 width=0)
                        (actual time=0.368..0.368 rows=1000 loops=1)
                     Index Cond: (category = 42)
                     Buffers: shared read=6
 Planning time: 0.371 ms
 Execution time: 4.625 ms

PostgreSQL uses the index to find the 1000 rows with category 42, filters out the ones that are not interesting, sorts them and returns the top 20. 5 milliseconds is fine.

A new index makes things go sour

Now we add an index that can help us with sorting. That is definitely interesting if we often have to find the top 20 results:

CREATE INDEX skewed_sort_idx ON skewed (sort);

And suddenly, things are looking worse:

                          QUERY PLAN
 Limit  (cost=0.42..736.34 rows=20 width=9)
        (actual time=21.658..28.568 rows=20 loops=1)
   Buffers: shared hit=374 read=191
   ->  Index Scan using skewed_sort_idx on skewed
                (cost=0.42..33889.43 rows=921 width=9)
                (actual time=21.655..28.555 rows=20 loops=1)
         Filter: (interesting AND (category = 42))
         Rows Removed by Filter: 69022
         Buffers: shared hit=374 read=191
 Planning time: 0.507 ms
 Execution time: 28.632 ms

What happened?

PostgreSQL thinks that it will be faster if it examines the rows in sort order using the index until it has found 20 matches. But it doesn’t know how the matching rows are distributed with respect to the sort order, so it is not aware that it will have to scan 69042 rows until it has found its 20 matches (see Rows Removed by Filter: 69022 in the above execution plan).

What can we do to get the better plan?

PostgreSQL v10 has added extended statistics to track how the values in different columns are correlated, but that does no track the distributions of the values, so it will not help us here.

There are two workarounds:

  1. Drop the index that misleads PostgreSQL.If that is possible, it is a simple solution. But usually one cannot do that, because the index is either used to enforce a unique constraint, or it is needed by other queries that benefit from it.
  2. Rewrite the query so that PostgreSQL cannot use the offending index.Of the many possible solutions for this, I want to present two:
    • A subquery with OFFSET 0:
      SELECT *
      FROM (SELECT * FROM skewed
            WHERE interesting AND category = 42
            OFFSET 0) q
      ORDER BY sort
      LIMIT 20;

      This makes use of the fact that OFFSET and LIMIT prevent a subquery from being “flattened”, even if they have no effect on the query result.

    • Using an expression as sort key:
      SELECT * FROM skewed
      WHERE interesting AND category = 42
      ORDER BY sort + 0
      LIMIT 20;

      This makes use of the fact that PostgreSQL cannot deduce that sort + 0 is the same as sort. Remember that PostgreSQL is extensible, and you can define your own + operator!


There are several ways of building PostgreSQL under Windows. Official manual stands using Visual Studio is the simplest one, wiki describes how to use mingw and mingw-w64.

As for me, using new MSYS2 building platform for Windows is a bless. With its help not only PostgreSQL sources are built smoothly, but even extensions are not a problem anymore. And if you were playing with extensions under Windows, you know what I’m talking about.

At its core MSYS2 is an independent rewrite of MSYS, based on modern Cygwin (POSIX compatibility layer) and MinGW-w64 with the aim of better interoperability with native Windows software. It provides a bash shell, Autotools, revision control systems and the like for building native Windows applications using MinGW-w64 toolchains.


It is pretty straightforward. On the official page you have the step-by-step guide, but in case of troubles, you may check detailed install guide.

After installation and upgrading run in cmd console

C:\msys64\msys2_shell.cmd -mingw64

or simply just click on the Start menu “MSYS2 MinGW 64-bit” shortcut:
MSYS2 MinGW 64-bit shortcut

Then update packages:

pacman -Syu

Install only needed packages:

pacman --needed -S git mingw-w64-x86_64-gcc base-devel

Let’s build it!

Get the PostgreSQL sources. Here you have two options:
1. Download them from the official site.
2. Use git repository:

git clone git://

Enter to the source folder:

cd postgresql

And run configure followed by make and make install:

./configure --host=x86_64-w64-mingw32 --prefix=/c/pgsql/ && make && make install

Here we tell our target machine will be 64-bit and we want our binaries to be copied to C:\pgsql\.


As you may see building PostgreSQL with MSYS2 for Windows is simple enough. But the main profit of such toolchain is building extensions for PostgreSQL. And there it shines in all its glory. Stay tuned.

Another way to get rid of unused indexes...
© Laurenz Albe 2018

Why should I get rid of unused indexes?

Everybody knows that a database index is a good thing because it can speed up SQL queries. But this does not come for free.

The disadvantages of indexes are:

  • Indexes use up space. It is not unusual for database indexes to use as much storage space as the data themselves. And the kind of reliable, fast storage you want for a database is not necessarily cheap.
    The space used up by indexes also increases the size and duration of physical backups.
  • Indexes slow down data modification. Whenever you INSERT into or DELETE from a table, all indexes have to be modified, in addition to the table itself (the “heap”).
    And it is much more expensive to modify the complicated data structure of an index than the heap itself, which has its name precisely because it is basically an unordered “pile” of data (and as everybody knows, maintaining order is more work than having a mess). Modifying an indexed table can easily be an order of magnitude more expensive than modifying an unindexed table.
  • Indexes prevent HOT updates. Because of the architecture of PostgreSQL, every UPDATE causes a new row version (“tuple”) to be written, and that causes a new entry in every index on the table.
    This behavior has been dubbed “write amplification” and has drawn a lot of fire. This undesirable effect can be avoided if a) the new tuple fits into the same table block as the old one and b) no indexed row is modified. Then PostgreSQL creates the new tuple as a “Heap Only Tuple” (hence HOT), which is much more efficient and also reduces the work VACUUM has to do.

The many uses of indexes

Now we know that we don’t want unnecessary indexes. The problem is that indexes serve so many purposes that it is difficult to determine if a certain index is needed or not.

Here is a list of all benefits of indexes in PostgreSQL:

  1. Indexes can speed up queries that use indexed columns (or expressions) in the WHERE clause.
    Everybody knows that one!
    The traditional B-tree index supports the <, <=, =, >= and > operators, while the many other index types in PostgreSQL can support more exotic operators like “overlaps” (for ranges or geometries), “distance” (for words) or regular expression matches.
  2. B-tree indexes can speed up the max() and min() aggregates.
  3. B-tree indexes can speed up ORDER BY clauses.
  4. Indexes can speed up joins. This depends on the “join strategy” chosen by the optimizer: hash joins, for example, will never make use of an index.
  5. A B-tree index on the origin of a FOREIGN KEY constraint avoids a sequential scan when rows are deleted (or keys modified) in the target table. A scan on the origin of the constraint is necessary to make sure that the constraint will not be not violated by the modification.
  6. Indexes are used to enforce constraints. Unique B-tree indexes are used to enforce PRIMARY KEY and UNIQUE constraints, while exclusion constraints use GiST indexes.
  7. Indexes can provide the optimizer with better value distribution statistics.
    If you create an index on an expression, ANALYZE and the autoanalyze daemon will not only collect statistics for the data distribution in table columns, but also for each expression that occurs in an index. This helps the optimizer to get a good estimate for the “selectivity” of complicated conditions that contain the indexed expression, which causes better plans to be chosen.This is a widely ignored benefit of indexes!

Find the unused indexes!

The following query that we at Cybertec use will show you all indexes that serve none of the above mentioned purposes.

It makes use of the fact that all uses of indexes in the above list with the exception of the last two result in an index scan.

For completeness’ sake, I have to add that the parameter track_counts has to remain “on” for the query to work, otherwise index usage is not tracked in pg_stat_user_indexes. But you must not change that parameter anyway, otherwise autovacuum will stop working.

To find the indexes that have never been used since the last statistics reset with pg_stat_reset(), use

SELECT s.schemaname,
       s.relname AS tablename,
       s.indexrelname AS indexname,
       pg_relation_size(s.indexrelid) AS index_size
FROM pg_catalog.pg_stat_user_indexes s
   JOIN pg_catalog.pg_index i ON s.indexrelid = i.indexrelid
WHERE s.idx_scan = 0      -- has never been scanned
  AND 0 <>ALL (i.indkey)  -- no index column is an expression
  AND NOT EXISTS          -- does not enforce a constraint
         (SELECT 1 FROM pg_catalog.pg_constraint c
          WHERE c.conindid = s.indexrelid)
ORDER BY pg_relation_size(s.indexrelid) DESC;

Some remarks:

  • Don’t do that on your test database, but on the production database!
  • If your software it running at several customer sites, run the query on all of them.
    Different users have different ways to use a software, which can cause different indexes to be used.
  • You can replace s.idx_scan = 0 in the query with a different condition, e.g. s.idx_scan < 10. Indexes that are very rarely used are also good candidates for removal.


Unlock the tremendous energy of the vacuum!
© xkcd.xom (Randall Munroe) under the Creative Commons Attribution-NonCommercial 2.5 License

Whenever rows in a PostgreSQL table are updated or deleted, dead rows are left behind. VACUUM gets rid of them so that the space can be reused. If a table doesn’t get vacuumed, it will get bloated, which wastes disk space and slows down sequential table scans (and – to a smaller extent – index scans).

VACUUM also takes care of freezing table rows so to avoid problems when the transaction ID counter wraps around, but that’s a different story.

Normally you don’t have to take care of all that, because the autovacuum daemon built into PostgreSQL does that for you.

The problem

If your tables get bloated, the first thing you check is whether autovacuum has processed them or not:

SELECT schemaname, relname, n_live_tup, n_dead_tup, last_autovacuum
FROM pg_stat_all_tables
ORDER BY n_dead_tup
    / (n_live_tup
       * current_setting('autovacuum_vacuum_scale_factor')::float8
          + current_setting('autovacuum_vacuum_threshold')::float8)

If your bloated table does not show up here, n_dead_tup is zero and last_autovacuum is NULL, you might have a problem with the statistics collector.

If the bloated table is right there on top, but last_autovacuum is NULL, you might need to configure autovacuum to be more aggressive so that it gets done with the table.

But sometimes the result will look like this:

 schemaname |    relname   | n_live_tup | n_dead_tup |   last_autovacuum
 laurenz    | vacme        |      50000 |      50000 | 2018-02-22 13:20:16
 pg_catalog | pg_attribute |         42 |        165 |
 pg_catalog | pg_amop      |        871 |        162 |
 pg_catalog | pg_class     |          9 |         31 |
 pg_catalog | pg_type      |         17 |         27 |
 pg_catalog | pg_index     |          5 |         15 |
 pg_catalog | pg_depend    |       9162 |        471 |
 pg_catalog | pg_trigger   |          0 |         12 |
 pg_catalog | pg_proc      |        183 |         16 |
 pg_catalog | pg_shdepend  |          7 |          6 |
(10 rows)

Here autovacuum has recently run, but it didn’t free the dead tuples!

We can verify the problem by running VACUUM (VERBOSE):

test=> VACUUM (VERBOSE) vacme;
INFO:  vacuuming "laurenz.vacme"
INFO:  "vacme": found 0 removable, 100000 nonremovable row versions in
       443 out of 443 pages
DETAIL:  50000 dead row versions cannot be removed yet,
         oldest xmin: 22300
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.01 s.

Why won’t VACUUM remove the dead rows?

VACUUM can only remove those row versions (also known as “tuples”) that are not needed any more. A tuple is not needed if the transaction ID of the deleting transaction (as stored in the xmax system column) is older than the oldest transaction still active in the PostgreSQL database (or the whole cluster for shared tables).

This value (22300 in the VACUUM output above) is called the “xmin horizon”.

There are three things that can hold back this xmin horizon in a PostgreSQL cluster:

  1. Long-running transactions:

    You can find those and their xmin value with the following query:

    SELECT pid, datname, usename, state, backend_xmin
    FROM pg_stat_activity
    WHERE backend_xmin IS NOT NULL
    ORDER BY age(backend_xmin) DESC;

    You can use the pg_terminate_backend() function to terminate the database session that is blocking your VACUUM.

  2. Abandoned replication slots:

    A replication slot is a data structure that keeps the PostgreSQL server from discarding information that is still needed by a standby server to catch up with the primary.

    If replication is delayed or the standby server is down, the replication slot will prevent VACUUM from deleting old rows.

    You can find all replication slots and their xmin value with this query:

    SELECT slot_name, slot_type, database, xmin
    FROM pg_replication_slots
    ORDER BY age(xmin) DESC;

    Use the pg_drop_replication_slot() function to drop replication slots that are no longer needed.

    Note: This can only happen with physical replication if hot_standby_feedback = on. For logical replication there is a similar hazard, but only system catalogs are affected. Examine the column catalog_xmin in that case.

  3. Orphaned prepared transactions:

    During two-phase commit, a distributed transaction is first prepared with the PREPARE statement and then committed with the COMMIT PREPARED statement.

    Once a transaction has been prepared, it is kept “hanging around” until it is committed or aborted. It even has to survive a server restart! Normally, transactions don’t remain in the prepared state for long, but sometimes things go wrong and a prepared transaction has to be removed manually by an administrator.

    You can find all prepared transactions and their xmin value with the following query:

    SELECT gid, prepared, owner, database, transaction AS xmin
    FROM pg_prepared_xacts
    ORDER BY age(transaction) DESC;

    Use the ROLLBACK PREPARED SQL statement to remove prepared transactions.

About sequences

Sequences are used to generate artificial numeric primary key columns for tables.
A sequence provides a “new ID” that is guaranteed to be unique, even if many database sessions are using the sequence at the same time.

Sequences are not transaction safe, because they are not supposed to block the caller. That is not a shortcoming, but intentional.

As a consequence, a transaction that requests a new value from the sequence and then rolls back will leave a “gap” in the values committed to the database. In the rare case that you really need a “gap-less” series of values, a sequence is not the right solution for you.

PostgreSQL’s traditional way of using sequences (nextval('my_seq')) differs from the SQL standard, which uses NEXT VALUE FOR <sequence generator name>.

New developments in PostgreSQL v10

Identity columns

PostgreSQL v10 has introduced the standard SQL way of defining a table with an automatically generated unique value:

GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY [ ( sequence_options ) ]

Here is an example:


Behind the scenes, this uses a sequence, and it is roughly equivalent to the traditional

   id bigserial PRIMARY KEY,

which is a shorthand for

CREATE SEQUENCE my_tab_id_seq;

   id bigint PRIMARY KEY DEFAULT nextval('my_tab_id_seq'::regclass),


The problem with such a primary key column is that the generated value is a default value, so if the user explicitly inserts a different value into this column, it will override the generated one.

This is usually not what you want, because it will lead to a constraint violation error as soon as the sequence counter reaches the same value. Rather, you want the explicit insertion to fail, since it is probably a mistake.

For this you use GENERATED ALWAYS:


You can still override the generated value, but you’ll have to use the OVERRIDING SYSTEM VALUE clause for that, which makes it much harder for such an INSERT to happen by mistake:


New system catalog pg_sequence

Before PostgreSQL v10, a sequence’s metadata (starting value, increment and others) were stored in the sequence itself.

This information is now stored in a new catalog table pg_sequence.

The only data that remain in the sequence are the data changed by the sequence manipulation functions nextval, currval, lastval and setval.

Transactional DDL for sequences

A sequence in PostgreSQL is a “special table” with a single row.

In “normal tables”, an UPDATE does not modify the existing row, but writes a new version of it and marks the old version as obsolete. Since sequence operations should be fast and are never rolled back, PostgreSQL can be more efficient by just modifying the single row of a sequence in place whenever its values change.

Since prior to PostgreSQL v10 all metadata of a sequence were kept in the sequence (as explained in the previous section), this had the down side that ALTER SEQUENCE, which also modified the single row of a sequence, could not be rolled back.

Since PostgreSQL v10 has given us pg_sequence, and catalog modifications are transaction safe in PostgreSQL, this limitation could be removed with the latest release.

Performance regression with ALTER SEQUENCE

When I said above that ALTER SEQUENCE has become transaction safe just by introducing a new catalog table, I cheated a little. There is one variant of ALTER SEQUENCE that modifies the values stored in a sequence:


If only some variants of ALTER SEQUENCE were transaction safe and others weren’t, this would lead to surprising and buggy behavior.

That problem was fixed with this commit:

commit 3d79013b970d4cc336c06eb77ed526b44308c03e
Author: Andres Freund <>
Date:   Wed May 31 16:39:27 2017 -0700

    Make ALTER SEQUENCE, including RESTART, fully transactional.
    Previously the changes to the "data" part of the sequence, i.e. the
    one containing the current value, were not transactional, whereas the
    definition, including minimum and maximum value were.  That leads to
    odd behaviour if a schema change is rolled back, with the potential
    that out-of-bound sequence values can be returned.
    To avoid the issue create a new relfilenode fork whenever ALTER
    SEQUENCE is executed, similar to how TRUNCATE ... RESTART IDENTITY
    already is already handled.
    This commit also makes ALTER SEQUENCE RESTART transactional, as it
    seems to be too confusing to have some forms of ALTER SEQUENCE behave
    transactionally, some forms not.  This way setval() and nextval() are
    not transactional, but DDL is, which seems to make sense.
    This commit also rolls back parts of the changes made in 3d092fe540
    and f8dc1985f as they're now not needed anymore.
    Author: Andres Freund
    Backpatch: Bug is in master/v10 only

This means that every ALTER SEQUENCE statement will now create a new data file for the sequence; the old one gets deleted during COMMIT. This is similar to the way TRUNCATE, CLUSTER, VACUUM (FULL) and some ALTER TABLE statements are implemented.

Of course this makes ALTER SEQUENCE much slower in PostgreSQL v10 than in previous releases, but you can expect this statement to be rare enough that it should not cause a performance problem.

However, there is this old blog post by depesz that recommends the following function to efficiently get a gap-less block of sequence values:

   use_seqname text,
   use_increment integer
) RETURNS bigint AS $$
   reply bigint;
   PERFORM pg_advisory_lock(123);
   EXECUTE 'ALTER SEQUENCE ' || quote_ident(use_seqname)
           || ' INCREMENT BY ' || use_increment::text;
   reply := nextval(use_seqname);
   EXECUTE 'ALTER SEQUENCE ' || quote_ident(use_seqname)
           || ' INCREMENT BY 1';
   PERFORM pg_advisory_unlock(123);
   RETURN reply;
$$ LANGUAGE 'plpgsql';

This function returns the last value of the gap-less sequence value block (and does not work correctly when called on a newly created sequence).

Since this function calls ALTER SEQUENCE not only once but twice, you can imagine that every application that uses it a lot will experience quite a performance hit when upgrading to PostgreSQL v10.

Fortunately you can achieve the same thing with the normal sequence manipulation functions, so you can have a version of the function that will continue performing well in PostgreSQL v10:

   use_seqname regclass,
   use_increment integer
) RETURNS bigint AS $$
   reply bigint;
   lock_id bigint := (use_seqname::bigint - 2147483648)::integer;
   PERFORM pg_advisory_lock(lock_id);
   reply := nextval(use_seqname);
   PERFORM setval(use_seqname, reply + use_increment - 1, TRUE);
   PERFORM pg_advisory_unlock(lock_id);
   RETURN reply + increment - 1;
$$ LANGUAGE plpgsql;

If you want to get the first value of the sequence value block, use RETURN reply;

Note that both the original function and the improved one use advisory locks, so they will only work reliably if the sequence is only used with that function.

While doing PostgreSQL consulting for a German client, I stumbled over an interesting issue this week, which might be worth sharing with some folks out on the Internet, it’s all about grouping.

Suppose you are measuring the same thing various times on different sensors every, say, 15 minutes. Maybe some temperature, some air pressure or whatever. The data might look like it is shown in the next table:

CREATE TABLE t_data (t time, val int);

COPY t_data FROM stdin;

14:00   12

14:01   22

14:01   43

14:14   32

14:15   33

14:16   27

14:30   19


The human eye can instantly spot that 14:00 and 14:01 could be candidates for grouping (maybe the differences are just related to latency or some slighty inconsistent timing). The same applies to 14:14 to 14:16. You might want to have this data in the same group during aggregation.

The question now is: How can that be achieved with PostgreSQL?

Some dirty SQL trickery

The first thing to do is to check out those difference from one timestamp to the next:

SELECT *, lag(t, 1) OVER (ORDER BY t)

FROM    t_data;

The lag function offers a nice way to solve this kind of problem:

t     | val |   lag


14:00:00 |  12 |

14:01:00 |  22 | 14:00:00

14:01:00 |  43 | 14:01:00

14:14:00 |  32 | 14:01:00

14:15:00 |  33 | 14:14:00

14:16:00 |  27 | 14:15:00

14:30:00 |  19 | 14:16:00

(7 rows)

Now that we have used lag to “move” the time to the next row, there is a simple trick, which can be applied:

SELECT  *, CASE WHEN t - lag < '10 minutes'

THEN currval('seq_a')

ELSE nextval('seq_a') END AS g

FROM    ( SELECT *, lag(t, 1) OVER (ORDER BY t)

FROM  t_data) AS x;

Moving the lag to a subselect allows us to start all over again and to create those groups. The trick now is: If the difference from one line to the next is high, start a new group – otherwise stay within the group.

This leaves us with a simple result set:

t     | val |   lag    | g


14:00:00 |  12 |          | 1

14:01:00 |  22 | 14:00:00 | 1

14:01:00 |  43 | 14:01:00 | 1

14:14:00 |  32 | 14:01:00 | 2

14:15:00 |  33 | 14:14:00 | 2

14:16:00 |  27 | 14:15:00 | 2

14:30:00 |  19 | 14:16:00 | 3

(7 rows)

From now on, life is simple. We can take this output and aggregate on this data easily. “GROUP BY g” will give us nice groups for each value of “g”.