Triggers with transition tables rule!
© Laurenz Albe 2018


Inspired by my co-worker’s recent blog post, I decided to revisit the old question of rules vs. triggers and run a little benchmark to see which one does better.

About rules

While triggers are well known to most application developers and database administrators, rules are less well known. The full name “query rewrite rule” explains what they are doing: Before the query is optimized, a rule can either replace the query with a different one or add additional queries. These are then planned and executed instead of or together with the original query.

While rules are very powerful, they are also tricky to get right, particularly when data modification is involved. It is usually recommended to use the more modern triggers, and there have been attempts to deprecate them. But since rules are used to implement views, they are probable here to stay.

Moreover, they are said to be faster than triggers when many rows are affected. That’s what I tried to measure.

The test case

We have a simple table:

   id  integer NOT NULL,
   val text    NOT NULL
) WITH (autovacuum_enabled = off);

Since we are only interested in the speed of processing, I decided to use an unlogged table to bypass disk I/O as much as possible. Autovacuum is disabled to avoid inferences with it.

Every update to this table should be logged to a second table:

   id       integer     NOT NULL,
   log_time timestamptz NOT NULL
            DEFAULT current_timestamp,
   new_val  text        NOT NULL
) WITH (autovacuum_enabled = off);

The test will repeatedly perform the following in a loop:


INSERT INTO upd (id, val)
   SELECT i, 'text number ' || i
   FROM generate_series(1, 100000) i;


UPDATE upd SET val = val;


I’ll perform two baseline measurements: one without the UPDATE (called “zero”), and one with just the UPDATE, but no logging (called “base”). This will enable me to measure the net time spent in the UPDATE by subtraction so I can calculate how much performance is slowed down by the different logging techniques.

logging with a rule

For this measurement, which I call “rule”, I’ll use the following rule to log updates:

   DO ALSO INSERT INTO log (id, new_val)
           VALUES (, NEW.val);

logging with a row level trigger

For this measurement, which I call “trig_row”, I’ll use the following trigger to log updates:

CREATE FUNCTION upd_row_trig() RETURNS trigger
   LANGUAGE plpgsql AS
   INSERT INTO log (id, new_val)
   VALUES (, NEW.val);


CREATE TRIGGER upd_row_trig
   EXECUTE PROCEDURE upd_row_trig();

logging with a statement level trigger

For this measurement, which I call “trig_stmt”, I’ll use the following trigger to log updates:

CREATE FUNCTION upd_stmt_trig() RETURNS trigger
   LANGUAGE plpgsql AS
   INSERT INTO log (id, new_val)
   SELECT id, val FROM newrows;


   EXECUTE PROCEDURE upd_stmt_trig();

This uses transition tables which are new in PostgreSQL v10. The REFERENCING clause will make all rows modified by the statement available in a “temporary” table newrows.

Test environment

All commands are run from a shell script on my Linux laptop using psql -c "...". The script loops through the statements twenty times, and the elapsed time is measured with time.

Each test is executed three times to ensure that the execution time does not vary, then the middle value is used.

Test results

The zero measurements have been subtracted and the result divided by twenty to obtain the following numbers:

Performance of rule and triggers for logging bulk updates
baselinerulerow level triggerstatement level trigger
execution time295 ms454 ms3816 ms642 ms
overhead percentage0 % 54 % 1193 % 117 % 


The test confirms that rules perform better than anything else when it comes to logging bulk updates. However, while logging with row level triggers caused a slowdown by a forbidding factor of 13, statement level triggers with transition tables only doubled the execution time.

So with PostgreSQL v10 we have a good alternative that performs half as well as rules, and there is no need to use rules unless you need cutting edge performance.

When people are talking about database performance monitoring they usually think of inspecting one PostgreSQL database server at a time. While this is certainly useful it can also be quite beneficial to inspect the status of an entire database cluster or to inspect a set of servers working together at once. Fortunately there are easy means to achieve that with PostgreSQL. How this works can be outlined in this post.

pg_stat_statements: The best tool to monitor PostgreSQL performance

If you want to take a deep loop at PostgreSQL performance there is really no way around pg_stat_statements. It offers a lot of information and is really easy to use.

To install pg_stat_statements, the following steps are necessary:

  • run “CREATE EXTENSION pg_stat_statements” in your desired database
  • add the following line to postgresql.conf:
    • shared_preload_libraries = ‘pg_stat_statements’
  • restart PostgreSQL

Once this is done, PostgreSQL will already be busy collecting data on your database hosts. However, how can we create a “clusterwide pg_stat_statements” view so that we can inspect an entire set of servers at once?

Using pg_stat_statements to check an entire database cluster

Our goal is to show data from a list of servers in a single view. One way to do that is to make use of PostgreSQL’s foreign data wrapper infrastructure. We can simply connect to all servers in the cluster and unify the data in a single view.

Let us assume we have 3 servers, a local machine, “a_server”, and “b_server”. Let us get started by connecting to the local server to run the following commands:

GRANT USAGE ON SCHEMA pg_catalog TO dbmonitoring;
GRANT ALL ON pg_stat_statements TO dbmonitoring;

In the first step I created a simple user to do the database monitoring. Of course you can handle users and so on differently but it seems like an attractive idea to use a special user for that purpose.

The next command enables the postgres_fdw extension, which is necessary to connect to those remote servers we want to access:

CREATE EXTENSION postgres_fdw;

Then we can already create “foreign servers”. Here is how those servers can be created:

       OPTIONS (host 'a_server', dbname 'a');
       OPTIONS (host 'b_server', dbname 'b');

Just replace the hostnames and the database names with your data and run those commands. The next step is already about user mapping: It might easily happen that local users are not present on the other side so it is necessary to create some sort of mapping between local and remote users:

       SERVER pg1
       OPTIONS (user 'postgres', password 'abcd');

       SERVER pg2
       OPTIONS (user 'postgres', password 'abcd');

In this case we will login as user “postgres”. Now that two servers and the user mappings are ready, we can import the remote schema into a local schema:

CREATE SCHEMA monitoring_a;
       LIMIT TO (pg_stat_statements)
       FROM SERVER pg1
       INTO monitoring_a;

CREATE SCHEMA monitoring_b;
       LIMIT TO (pg_stat_statements)
       FROM SERVER pg2
       INTO monitoring_b;

For each schema there will be a separate schema. This makes it very easy to drop things again and to handle various incarnations of the same data structure.

Wiring things together

The last thing to do in our main database, is to connect those remote tables with our local data. The easiest way to achieve that is to use a simple view:

CREATE VIEW monitoring_performance AS
SELECT 'localhost'::text AS node, *
FROM pg_stat_statements
SELECT 'server a'::text AS node, *
FROM monitoring_a.pg_stat_statements
SELECT 'server b'::text AS node, *
FROM monitoring_b.pg_stat_statements;

The view will simply unify all the data and add an additional column at the beginning.

PostgreSQL performance monitoring for clusters

Our system is now ready to use and we can already start to run useful analysis:

       sum(total_time) OVER () AS cluster_total_time,
       sum(total_time) OVER (PARTITION BY node) AS node_total_time,
       round((100 * total_time / sum(total_time) OVER ())::numeric, 4) AS percentage_total,
       round((100 * total_time / sum(total_time) OVER (PARTITION BY node))::numeric, 4) AS percentage_node
FROM   monitoring_performance

The query will return all the raw data and add some percentage numbers on top of this data.

If you are interested in further information on pg_state_statements consider reading the following blog post too:

A bad query plan ...
© Laurenz Albe 2018


We all know that you have to pay a price for a new index you create — data modifying operations will become slower, and indexes use disk space. That’s why you try to have no more indexes than you actually need.

But most people think that SELECT performance will never suffer from a new index. The worst that can happen is that the new index is not used.

However, this is not always true, as I have seen more than once in the field. I’ll show you such a case and tell you what you can do about it.

An example

We will experiment with this table:

   sort        integer NOT NULL,
   category    integer NOT NULL,
   interesting boolean NOT NULL

   SELECT i, i%1000, i>50000
   FROM generate_series(1, 1000000) i;

CREATE INDEX skewed_category_idx ON skewed (category);


We want to find the first twenty interesting rows in category 42:

SELECT * FROM skewed
WHERE interesting AND category = 42

This performs fine:

                             QUERY PLAN
 Limit  (cost=2528.75..2528.80 rows=20 width=9)
        (actual time=4.548..4.558 rows=20 loops=1)
   Buffers: shared hit=1000 read=6
   ->  Sort  (cost=2528.75..2531.05 rows=919 width=9)
             (actual time=4.545..4.549 rows=20 loops=1)
         Sort Key: sort
         Sort Method: top-N heapsort  Memory: 25kB
         Buffers: shared hit=1000 read=6
         ->  Bitmap Heap Scan on skewed
                        (cost=19.91..2504.30 rows=919 width=9)
                        (actual time=0.685..4.108 rows=950 loops=1)
               Recheck Cond: (category = 42)
               Filter: interesting
               Rows Removed by Filter: 50
               Heap Blocks: exact=1000
               Buffers: shared hit=1000 read=6
               ->  Bitmap Index Scan on skewed_category_idx
                        (cost=0.00..19.68 rows=967 width=0)
                        (actual time=0.368..0.368 rows=1000 loops=1)
                     Index Cond: (category = 42)
                     Buffers: shared read=6
 Planning time: 0.371 ms
 Execution time: 4.625 ms

PostgreSQL uses the index to find the 1000 rows with category 42, filters out the ones that are not interesting, sorts them and returns the top 20. 5 milliseconds is fine.

A new index makes things go sour

Now we add an index that can help us with sorting. That is definitely interesting if we often have to find the top 20 results:

CREATE INDEX skewed_sort_idx ON skewed (sort);

And suddenly, things are looking worse:

                          QUERY PLAN
 Limit  (cost=0.42..736.34 rows=20 width=9)
        (actual time=21.658..28.568 rows=20 loops=1)
   Buffers: shared hit=374 read=191
   ->  Index Scan using skewed_sort_idx on skewed
                (cost=0.42..33889.43 rows=921 width=9)
                (actual time=21.655..28.555 rows=20 loops=1)
         Filter: (interesting AND (category = 42))
         Rows Removed by Filter: 69022
         Buffers: shared hit=374 read=191
 Planning time: 0.507 ms
 Execution time: 28.632 ms

What happened?

PostgreSQL thinks that it will be faster if it examines the rows in sort order using the index until it has found 20 matches. But it doesn’t know how the matching rows are distributed with respect to the sort order, so it is not aware that it will have to scan 69042 rows until it has found its 20 matches (see Rows Removed by Filter: 69022 in the above execution plan).

What can we do to get the better plan?

PostgreSQL v10 has added extended statistics to track how the values in different columns are correlated, but that does no track the distributions of the values, so it will not help us here.

There are two workarounds:

  1. Drop the index that misleads PostgreSQL.If that is possible, it is a simple solution. But usually one cannot do that, because the index is either used to enforce a unique constraint, or it is needed by other queries that benefit from it.
  2. Rewrite the query so that PostgreSQL cannot use the offending index.Of the many possible solutions for this, I want to present two:
    • A subquery with OFFSET 0:
      SELECT *
      FROM (SELECT * FROM skewed
            WHERE interesting AND category = 42
            OFFSET 0) q
      ORDER BY sort
      LIMIT 20;

      This makes use of the fact that OFFSET and LIMIT prevent a subquery from being “flattened”, even if they have no effect on the query result.

    • Using an expression as sort key:
      SELECT * FROM skewed
      WHERE interesting AND category = 42
      ORDER BY sort + 0
      LIMIT 20;

      This makes use of the fact that PostgreSQL cannot deduce that sort + 0 is the same as sort. Remember that PostgreSQL is extensible, and you can define your own + operator!

I have already written about timeseries and PostgreSQL in the past. However, recently I stumbled across an interesting problem, which caught my attention: Sometimes you might want to find “periods” of activity in a timseries. For example: When was a user active? Or when did we receive data? This blog post tries to give you some ideas and shows, how you can actually approach this kind of problem.

Loading timeseries data into PostgreSQL

The next listing shows a little bit of sample data, which I used to write the SQL code you are about to see:

CREATE TABLE t_series (t date, data int);

COPY t_series FROM stdin DELIMITER ';';

For the sake of simplicity I just used two columns in my example. Note that my timeseries is not continuous but interrupted. There are three continuous periods in this set of data. Our goal is to find and isolate them to do analysis on each of those continuous periods.

PostgreSQL time series

Preparing for timeseries analysis

When dealing with timeseries one of the most important things to learn is how to “look forward and backward”. In most cases it is simply vital to compare the current line with the previous line. To do that in PostgreSQL (or in SQL in general) you can make use of the “lag” function:

test=# SELECT *, lag(t, 1) OVER (ORDER BY t)
       FROM t_series;
          t | data | lag
 2018-03-01 |   12 | 
 2018-03-02 |   43 | 2018-03-01
 2018-03-03 |    9 | 2018-03-02
 2018-03-04 |   13 | 2018-03-03
 2018-03-09 |   23 | 2018-03-04
 2018-03-10 |   26 | 2018-03-09
 2018-03-11 |   28 | 2018-03-10
 2018-03-14 |   21 | 2018-03-11
 2018-03-15 |   15 | 2018-03-14
(9 rows)

As you can see the last column contains the date of the previous row. Now: How does PostgreSQL know what the previous row actually is? The “ORDER BY”-clause will define exactly that.

Based on this query you have just seen it will be easy to calculate the size of the gap from one row to the next row

test=# SELECT *, t - lag(t, 1) OVER (ORDER BY t) AS diff
       FROM t_series;
          t | data | diff 
 2018-03-01 |   12 | 
 2018-03-02 |   43 | 1
 2018-03-03 |    9 | 1
 2018-03-04 |   13 | 1
 2018-03-09 |   23 | 5
 2018-03-10 |   26 | 1
 2018-03-11 |   28 | 1
 2018-03-14 |   21 | 3
 2018-03-15 |   15 | 1
(9 rows)

What we see now is the difference from one period to the next. That is pretty useful because we can start to create our rules. When do we consider a segment to be over and how long of a gap to we allow for before we consider it to be the next segment / period?

In my example I decided that every gap, which is longer than 2 days should trigger the creation of a new segment (or period): The next challenge is therefore to assign numbers to each period, which are about to detect. Once this is done, we can easily aggregate on the result. The way I have decided to do this is by using the sum function. Remember: When NULL is fed to an aggregate, the aggregate will ignore the input. Otherwise it will simply start to add up the input.

Here is the query:

test=# SELECT *, sum(CASE WHEN diff IS NULL 
                     OR diff <2 THEN 1 ELSE NULL END) OVER (ORDER BY t) AS period
       FROM (SELECT *, t - lag(t, 1) OVER (ORDER BY t) AS diff
             FROM   t_series
       ) AS x;
          t | data | diff | period 
 2018-03-01 |   12 |      | 1
 2018-03-02 |   43 |    1 | 1
 2018-03-03 |    9 |    1 | 1
 2018-03-04 |   13 |    1 | 1
 2018-03-09 |   23 |    5 | 2
 2018-03-10 |   26 |    1 | 2
 2018-03-11 |   28 |    1 | 2
 2018-03-14 |   21 |    3 | 3
 2018-03-15 |   15 |    1 | 3
(9 rows)

As you can see the last column contains the period ID as generated by the sum function in our query. From now on analysis will be pretty simple as we can simply aggregate over this result using a simple subselect as shown in the next statement:

test=# SELECT period, sum(data) 
       FROM (SELECT *, sum(CASE WHEN diff IS NULL 
                    OR diff <2 THEN 1 ELSE NULL END) OVER (ORDER BY t) AS period
             FROM (SELECT *, t - lag(t, 1) OVER (ORDER BY t) AS diff
                   FROM t_series
                  ) AS x
       ) AS y
GROUP BY period 
ORDER BY period;
 period | sum 
      1 | 77
      2 | 77
      3 | 36
(3 rows)

The result displays the sum of all data for each period. Of course you can also do more complicated stuff. However, the important thing is to understand, how you can actually detect various periods of continuous activity.

In PostgreSQL, a view is a virtual table based on an SQL statement. It is an abstraction layer, which allows to access the result of a more complex SQL fast an easily. The fields in a view are fields from one or more real tables in the database. The question many people now ask if: If a view is based on a table. What happens if the data structure of the underlying table changes?


To show what PostgreSQL will do, I created a simple table:

view_demo=# CREATE TABLE t_product
        id         serial,
        name       text,
        price      numeric(16, 4)

My table has just three simple columns and does not contain anything special. Here is the layout of the table:

view_demo=# \d t_product
  Table "public.t_product"
 Column |     Type      | Collation | Nullable | Default
 id     | integer       |           | not null | nextval('t_product_id_seq'::regclass)
 name   | text          |           |          |
 price  | numeric(16,4) |           |          |


Making changes to tables and views

The first thing to do in order to get our demo going is to create a view:

view_demo=# CREATE VIEW v AS SELECT * FROM t_product;

The important thing here to see is how PostgreSQL handles the view. In the following listing you can see that the view definition does not contain a “*” anymore. PostgreSQL has silently replaced the “*” with the actual column list. Note that this is an important thing because it will have serious implications:

view_demo=# \d+ v
  View "public.v"
 Column | Type          | Collation | Nullable | Default | Storage  | Description
 id     | integer       |           |          |         | plain    |
 name   | text          |           |          |         | extended |
 price  | numeric(16,4) |           |          |         | main     |
View definition:
  FROM t_product;

What happens if we simply try to rename the table the view is based on:

view_demo=# ALTER TABLE t_product RENAME TO t_cool_product;

view_demo=# \d+ v
View "public.v"
 Column | Type          | Collation | Nullable | Default | Storage  | Description
 id     | integer       |           |          |         | plain    |
 name   | text          |           |          |         | extended |
 price  | numeric(16,4) |           |          |         | main     |
View definition:
  FROM t_cool_product;

As you can see the view will be changed as well. The reason for that is simple: PostgreSQL does not store the view as string. Instead if will keep a binary copy of the definition around, which is largely based on object ids. The beauty is that if the name of a table or a column changes, those objects will still have the same object id and therefore there is no problem for the view. The view will not break, become invalid or face deletion.

The same happens when you change the name of a column:

view_demo=# ALTER TABLE t_cool_product
RENAME COLUMN price TO produce_price;

Again the view will not be harmed:

view_demo=# \d+ v
  View "public.v"
 Column | Type          | Collation | Nullable | Default | Storage  | Description
 id     | integer       |           |          |         | plain    |
 name   | text          |           |          |         | extended |
 price  | numeric(16,4) |           |          |         | main     |
View definition:
          t_cool_product.produce_price AS price
   FROM t_cool_product;

What is really really important and noteworthy here is that the view does not change its output. The columns provided by the view will be the same. In other words: Application relying on the view won’t break just because some other column has changed somewhere.

What PostgreSQL does behind the scenes

Behind the scenes a view is handled by the rewrite system. In the system catalog there is a table called pg_rewrite, which will store a binary representation of the view:

view_demo=# \d pg_rewrite
  Table "pg_catalog.pg_rewrite"
 Column     | Type         | Collation | Nullable | Default
 rulename   | name         |           | not null |
 ev_class   | oid          |           | not null |
 ev_type    | "char"       |           | not null |
 ev_enabled | "char"       |           | not null |
 is_instead | boolean      |           | not null |
 ev_qual    | pg_node_tree |           |          |
 ev_action  | pg_node_tree |           |          |
  "pg_rewrite_oid_index" UNIQUE, btree (oid)
  "pg_rewrite_rel_rulename_index" UNIQUE, btree (ev_class, rulename)

Basically this is an internal thing. However, I decided to show, how it works behind the scenes as it might be interesting to know.

Views and dropping columns

However, in some cases PostgreSQL has to error out. Suppose somebody wants to drop a column, on which a view depends on. In this case PostgreSQL has to error out because it cannot silently delete the column from the view.

view_demo=# ALTER TABLE t_cool_product DROP COLUMN name;
ERROR: cannot drop table t_cool_product column name because other objects depend on it
DETAIL: view v depends on table t_cool_product column name
HINT: Use DROP ... CASCADE to drop the dependent objects too.

In this case PostgreSQL complains that the view cannot be kept around because columns are missing. You can now decide whether to not drop the column or whether to drop the view along with the column.

For those of you out there working with PostgreSQL in a professional way, migrating from Oracle to PostgreSQL might be one of the most beloved tasks available. One of the first things most people will notice, however, is that those data types available in Oracle might not be quite the same in PostgreSQL. This blog will try to shed some light and show, how things work.

Data types in Oracle and PostgreSQL

While there are many similarities between Oracle and PostgreSQL there are a couple of differences, which are quite noteworthy. The first thing many people might notice is: PostgreSQL has many more data types than Oracle. As of version 10.0 an empty PostgreSQL database will expose the staggering number of 92 data types. Of course not all of them are useful and many of them are purely internal used for internal purposes.

Still: At the end of the day there are just more data types, which are can be used by applications, which of course

Let us take a look and see, which types can be matched. The following table contains a lof of potential options:


Oracle typePossible PostgreSQL types
CHARchar, varchar, text
NCHARchar, varchar, text
VARCHARchar, varchar, text
VARCHAR2char, varchar, text, json
NVARCHAR2char, varchar, text
CLOBchar, varchar, text, json
LONGchar, varchar, text
RAWuuid, bytea
BFILEbytea (read-only)
NUMBERnumeric, float4, float8, char, varchar, text
NUMBER(n,m) with m<=0numeric, float4, float8, int2, int4, int8,
boolean, char, varchar, text
FLOATnumeric, float4, float8, char, varchar, text
BINARY_FLOATnumeric, float4, float8, char, varchar, text
BINARY_DOUBLEnumeric, float4, float8, char, varchar, text
DATEdate, timestamp, timestamptz, char, varchar, text
TIMESTAMPdate, timestamp, timestamptz, char, varchar, text
TIMESTAMP WITH TIME ZONEdate, timestamp, timestamptz, char, varchar, text
TIMESTAMP WITHdate, timestamp, timestamptz, char, varchar, text
INTERVAL YEAR TO MONTHinterval, char, varchar, text
INTERVAL DAY TO SECONDinterval, char, varchar, text
MDSYS.SDO_GEOMETRYgeometry (see “PostGIS support” below)

As you can see everything all types can always be represented as text and as varchar. However, this is of course not desirable. However, there is more: In Oracle there is not such thing as “integer”. Everything is represented as “number”. For example: In Oracle you might see “number (5, 0)”. The way to map this is to use “integer” or “bigint” on the PostgreSQL side. In general “CPU data types” such as integer and bigint are much faster than “numeric”.

A second important thing is “varchar2”: On the PostgreSQL side it can easily be mapped to varchar or text. A small detail is that in Oracle varchar2 can be the number of bytes or the number of characters. Consider the following example: VARCHAR2(20 BYTE) vs.VARCHAR2(10 CHAR). You have to think twice, what to do on the PostgreSQL side because in Postgres we are always talking about “characters”.

Oracle → PostgreSQL: Automatic conversion

Of course there is no need to most of the work by hand. The ora_migrator tool ( will use the oracle_fdw (= database link to Oracle) to map data types for you. ora_migrator will also help with indexes, constraints, and so on to make migrations as easy and as fast as possible. Also have a look on our Cybertec Enterprise Migrator tool page.

caged elephant
© Laurenz Albe 2018


In a recent wrestling match with the Linux “out-of-memory killer” for a Cybertec customer I got acquainted with Linux control groups (“cgroups”), and I want to give you a short introduction how they can be used with PostgreSQL and discuss their usefulness.

Warning: This was done on my RedHat Fedora 27 system running Linux 4.16.5 with cgroups v1 managed by systemd version 234. Both cgroups and systemd‘s handling of them seem to be undergoing changes, so your mileage may vary considerably. Still, it should be a useful starting point if you want to explore cgroups.

What are Linux cgroups?

From the cgroups manual page:

Control cgroups, usually referred to as cgroups, are a Linux kernel feature which allow processes to be organized into hierarchical groups whose usage of various types of resources can then be limited and monitored.

cgroups are managed with special commands that start with “cg”, but can also be managed through a special cgroups file system and systemd.

Now a running PostgreSQL cluster is a group of processes, so that’s a perfect fit.

There are several subsystems defined (also called “controllers” in cgroups terminology). Of these, the following are interesting for PostgreSQL:

  • memory: useful for limiting the total memory usage
  • blkio: useful for limiting the I/O throughput
  • cpu: useful to define upper and lower limits to the CPU time available for the processes
  • cpuset: useful for binding the processes to a subset of the available CPU cores

Configuring cgroups

During system startup, cgroups are created as defined in the /etc/cgconfig.conf configuration file.

Let’s create a cgroup to build a cage for a PostgreSQL cluster:

group db_cage {
    # user and group "postgres" can manage these cgroups
    perm {
        task {
            uid = postgres;
            gid = postgres;
            fperm = 774;
        admin {
            uid = postgres;
            gid = postgres;
            dperm = 775;
            fperm = 774;

    # limit memory to 1 GB and disable swap
    memory {
        memory.limit_in_bytes = 1G;
        memory.memsw.limit_in_bytes = 1G;

    # limit read and write I/O to 10MB/s each on device 8:0
    blkio {
        blkio.throttle.read_bps_device = "8:0 10485760";
        blkio.throttle.write_bps_device = "8:0 10485760";

    # limit CPU time to 0.25 seconds out of each second
    cpu {
        cpu.cfs_period_us = 1000000;
        cpu.cfs_quota_us = 250000;

    # only CPUs 0-3 and memory node 0 can be used
    cpuset {
        cpuset.cpus = 0-3;
        cpuset.mems = 0;

To activate it, run the following as root:

# /usr/sbin/cgconfigparser -l /etc/cgconfig.conf -s 1664

To have that done automatically at server start, I tell systemd to enable the cgconfig service:

# systemctl enable cgconfig
# systemctl start cgconfig

Starting PostgreSQL in a cgroup

To start PostgreSQL in the cgroups we defined above, use the cgexec executable (you may have to install an operating system package called libcgroup or libcgroup-tools for that):

$ cgexec -g cpu,memory,blkio:db_cage \
   /usr/pgsql-10/bin/pg_ctl -D /var/lib/pgsql/10/data start

We can confirm that PostgreSQL is running in the correct cgroup:

$ head -1 /var/lib/pgsql/10/data/ 

$ cat /proc/16284/cgroup | egrep '\b(cpu|blkio|memory)\b'

To change a running process to a cgroup, you can use cgclassify (but then you have to change all running PostgreSQL processes).

Using cgroups with systemd

systemd provides a simpler interface to Linux cgroups, so you don’t have to do any of the above. systemd can create cgroups “on the fly” for the services it starts.

If your PostgreSQL service is called postgresql-10, simply create a file /etc/systemd/system/postgresql-10.service like this:

# include the original service file rather than editing it
# so that changes don't get lost during an upgrade
.include /usr/lib/systemd/system/postgresql-10.service

# limit memory to 1GB
# sets "memory.limit_in_bytes"
# limit memory + swap space to 1GB
# this should set "memory.memsw.limit_in_bytes" but it only
# works with cgroups v2 ...
# MemorySwapMax=1G

# limit read I/O on block device 8:0 to 10MB per second
# sets "blkio.throttle.read_bps_device"
IOReadBandwidthMax=/dev/block/8:0 10M
# limit write I/O on block device 8:0 to 10MB per second
# sets "blkio.throttle.write_bps_device"
IOWriteBandwidthMax=/dev/block/8:0 10M

# limit CPU time to a quarter of the available
# sets "cpu.cfs_quota_us"

# there are no settings to control "cpuset" cgroups

Now you have to tell systemd that you changed the configuration and restart the service:

# systemctl daemon-reload
# systemctl restart postgresql-10

As you see, not all cgroup settings ar available with systemd. As a workaround, you can define cgroups in /etc/cgconfig.conf and use cgexec to start the service.

How useful are cgroups for PostgreSQL?

I would say that it depends on the subsystem.


At first glance, it sounds interesting to limit memory usage with cgroups. But there are several drawbacks:

  • If PostgreSQL is allowed to use swap space, it will start swapping when the memory quota is exceeded.
  • If PostgreSQL is not allowed to use swap space, the Linux OOM killer will kill PostgreSQL when the quota is exceeded (alternatively, you can configure the cgroup so that the process is paused until memory is freed, but this might never happen).
  • The memory quota also limits the amount of memory available for the file system cache.

None of this is very appealing — there is no option to make malloc fail so that PostgreSQL can handle the problem.

I think that it is better to use the traditional way of limiting PostgreSQL’s memory footprint by setting shared_buffers, work_mem and max_connections so that PostgreSQL won’t use too much memory.

That also has the advantage that all PostgreSQL clusters on the machine can share the file system cache, so that clusters that need it can get more of that resource, while no cluster can become completely memory starved (everybody is guaranteed shared_buffers).


I think that cgroups are a very useful way of limiting I/O bandwidth for PostgreSQL.

The only drawback is maybe that PostgreSQL cannot use more than its allotted quota even if the I/O system is idle.


cgroups are also a good way of limiting CPU usage by a PostgreSQL cluster.

Again, it would be nice if PostgreSQL were allowed to exceed its quota if the CPUs are idle.


This is only useful on big machines with a NUMA architecture. On such machines, binding PostgreSQL to the CPUs and memory of one NUMA node will make sure that all memory access is local to that node and consequently fast.

You can thus partition your NUMA machine between several PostgreSQL clusters.

To be OR not to be...
© Laurenz Albe 2018


PostgreSQL query tuning is our daily bread at Cybertec, and once you have done some of that, you’ll start bristling whenever you see an OR in a query, because they are usually the cause for bad query performance.

Of course there is a reason why there is an OR in SQL, and if you cannot avoid it, you have to use it. But you should be aware of the performance implications.

In this article I’ll explore “good” and “bad” ORs and what you can do to avoid the latter.

A little sample schema

We’ll use this simple setup for demonstration:

CREATE TABLE a(id integer NOT NULL, a_val text NOT NULL);

   SELECT i, md5(i::text)
   FROM generate_series(1, 100000) i;

CREATE TABLE b(id integer NOT NULL, b_val text NOT NULL);

   SELECT i, md5(i::text)
   FROM generate_series(1, 100000) i;



Suppose that we want to run queries with equality and LIKE conditions on the text columns, so we need some indexes:

CREATE INDEX a_val_idx ON a(a_val text_pattern_ops);
CREATE INDEX b_val_idx ON b(b_val text_pattern_ops);

Have a look at the documentation if you don’t understand text_pattern_ops.

The “good” OR

An OR is fine in most parts of an SQL query: if it is not used to filter out rows from your query result, it will have no negative effect on query performance.

So if your OR appears in a CASE expression in the SELECT list, don’t worry.

Unfortunately you usually find the OR where it hurts: in the WHERE clause.

The “bad” OR

Now for an example of an OR in a WHERE clause that is still pretty nice:

WHERE id = 42
   OR a_val = 'value 42';

                        QUERY PLAN                         
 Bitmap Heap Scan on a
   Recheck Cond: ((id = 42) OR (a_val = 'value 42'::text))
   ->  BitmapOr
         ->  Bitmap Index Scan on a_pkey
               Index Cond: (id = 42)
         ->  Bitmap Index Scan on a_val_idx
               Index Cond: (a_val = 'value 42'::text)
(7 rows)

PostgreSQL can actually use an index scan for the query, because it can combine the bitmaps for both indexes with a “bitmap OR”.
Note, however, that a bitmap index scan is more expensive than a normal index scan, since it has to build the bitmap. Moreover, it uses much more RAM; each of these bitmaps can use up to work_mem memory.

A multi-column index on (id, a_val) won’t help at all with this query, so there is no cheaper way to execute it.

IN is better than OR

Now for a more stupid variant of the above query:

WHERE id = 42
   OR id = 4711;

                 QUERY PLAN                 
 Bitmap Heap Scan on a
   Recheck Cond: ((id = 42) OR (id = 4711))
   ->  BitmapOr
         ->  Bitmap Index Scan on a_pkey
               Index Cond: (id = 42)
         ->  Bitmap Index Scan on a_pkey
               Index Cond: (id = 4711)
(7 rows)

Again, a bitmap index scan is used. But there is a simple method to rewrite that query without the pesky OR:

WHERE id IN (42, 4711);

                    QUERY PLAN                     
 Index Only Scan using a_pkey on a
   Index Cond: (id = ANY ('{42,4711}'::integer[]))
(2 rows)

You see? As soon as you get rid of the OR, an efficient index scan can be used!

You might say that this is good for equality conditions, but what about the following query:

WHERE a_val LIKE 'something%'
   OR a_val LIKE 'other%';

To improve that query, observe that the PostgreSQL optimizer rewrote the IN in the previous query to = ANY.

This is a case of the standard SQL “quantified comparison predicate”: <comparison operator> ANY is true if the comparison is TRUE for any of the values on the right-hand side (the standard only defines this for subqueries on the right-hand side, but PostgreSQL extends the syntax to arrays).

Now LIKE is a comparison operator as well, so we can write:

WHERE a_val LIKE ANY (ARRAY['something%', 'other%']);

                        QUERY PLAN                        
 Seq Scan on a
   Filter: (a_val ~~ ANY ('{something%,other%}'::text[]))
(2 rows)

Unfortunately, the index cannot be used here.

pg_trgm to the rescue

But we are not at the end of our wits yet! There is such a wealth of indexes in PostgreSQL; let’s try a different one. For this, we need the pg_trgm extension:


Then we can create a GIN trigram index on the column:

CREATE INDEX a_val_trgm_idx ON a USING gin (a_val gin_trgm_ops);

Now things are looking better:

WHERE a_val LIKE ANY (ARRAY['something%', 'other%']);

                             QUERY PLAN                             
 Bitmap Heap Scan on a
   Recheck Cond: (a_val ~~ ANY ('{something%,other%}'::text[]))
   ->  Bitmap Index Scan on a_val_trgm_idx
         Index Cond: (a_val ~~ ANY ('{something%,other%}'::text[]))
(4 rows)

Feel the power of trigram indexes!

Note 1: This index can also be used if the search pattern starts with %

Note 2: The GIN index can become quite large. To avoid that, you can also use a GiST index, which is much smaller, but less efficient to search.

The “ugly” OR

Things become really bad if OR combines conditions from different tables:

SELECT id, a.a_val, b.b_val
WHERE = 42
   OR = 42;

                 QUERY PLAN                  
 Merge Join
   Merge Cond: ( =
   Join Filter: (( = 42) OR ( = 42))
   ->  Index Scan using a_pkey on a
   ->  Index Scan using b_pkey on b
(5 rows)

Here we have to compute the complete join between the two tables and afterwards filter out all rows matching the condition. In our example, that would mean computing 100000 rows only to throw away the 99999 that do not natch the condition.

Avoiding the ugly OR

Fortunately, there is an equivalent query that is longer to write, but much cheaper to execute:

   SELECT id, a.a_val, b.b_val
   FROM a JOIN b USING (id)
   WHERE = 42
   SELECT id, a.a_val, b.b_val
   FROM a JOIN b USING (id)
   WHERE = 42;

                        QUERY PLAN                        
   ->  Sort
         Sort Key:, a.a_val, b.b_val
         ->  Append
               ->  Nested Loop
                     ->  Index Scan using a_pkey on a
                           Index Cond: (id = 42)
                     ->  Index Scan using b_pkey on b
                           Index Cond: (id = 42)
               ->  Nested Loop
                     ->  Index Scan using a_pkey on a a_1
                           Index Cond: (id = 42)
                     ->  Index Scan using b_pkey on b b_1
                           Index Cond: (id = 42)
(14 rows)

Both parts of the query can make use of efficient index scans and return one row, and since the rows happen to be identical, UNION will reduce them to one row.

If you can be certain that both branches of the query will return distinct sets, it is better to use UNION ALL instead of UNION, because that doesn’t have to do the extra processing to remove duplicates.

When using this trick, you should be aware that rewriting a query in that fashion does not always result in an equivalent query: if the original query can return identical rows, these would be removed by the UNION. In our case, we don’t have to worry, because the primary keys were included in the query result. I find that this is hardly ever a problem in practice.

Recently I did some PostgreSQL consulting in the Berlin area (Germany) when I stumbled over an interesting request: How can data be shared across function calls in PostgreSQL? I recalled some one of the other features of PostgreSQL (15+ years old or so) to solve the issue. Here is how it works.

Stored procedures in PostgreSQL

As many of you might know PostgreSQL allows you to write stored procedures in many different languages. Two of the more popular ones are Perl and Python, which have been around for quite some time. The cool thing is: Both languages offer a way to share variables across function calls. In Perl you can make use of the $_SHARED variable, which is always there.

Here is an example:

   $_SHARED{'some_name'} = $_[0];
   return $_[0];
$$ LANGUAGE plperl;

What the code does is to assign a value to some_name and returns the assigned value. Some other function can then make use of this data, which is stored inside your database connection. Here is an example:

   $_SHARED{'some_name'} += 1;
   return $_SHARED{'some_name'};
$$ LANGUAGE plperl;

This function will simply increment the value and return it. As you can see the code is pretty simple and easy to write.

Assigning shared variables

The following listing shows, how the code can be used. The first call will assign a value to the function while the second one will simply increment that value:

test=# SELECT set_var(5);
(1 row)

test=# SELECT increment_var(), increment_var();
 increment_var | increment_var
             6 | 7
(1 row)

It is especially noteworthy here that the second column will already see the changes made by the first column, which is exactly what we want here.

Shared variables and transactions

When working with shared variables in PL/Perl or PL/Python you have to keep in mind that those changes will not be transactional as all the rest in PostgreSQL is. Even if you rollback a transaction you can observe that those values will stay incremented:

test=# BEGIN;
test=# SELECT increment_var(), increment_var();
increment_var  | increment_var
             8 | 9
(1 row)
test=# ROLLBACK;
test=# SELECT increment_var(), increment_var();
 increment_var | increment_var
            10 | 11
(1 row)

This behavior makes shared values actually a nice thing to have if you want to preserve data across transactions.

Many people have asked for this feature for years and PostgreSQL 11 will finally have it. I am of course talking about CREATE PROCEDURE. Traditionally PostgreSQL has provided all the means to write functions (which were often simply called “stored procedures”). However, in a function you cannot really run transactions – all you can do is to use exceptions, which are basically savepoints. Inside a function you cannot just commit a transaction or open a new one. CREATE PROCEDURE will change all that and provide you with the means to run transactions in procedural code.


CREATE PROCEDURE will allow you to write procedures just like in most other modern databases. The syntax is quite simple and definitely not hard to use:

Description: define a new procedure
    name ( [ [ argmode ] [ argname ] argtype [ { DEFAULT | = } default_expr ] [, ...] ] )
  { LANGUAGE lang_name
    | TRANSFORM { FOR TYPE type_name } [, ... ]
    | SET configuration_parameter { TO value | = value | FROM CURRENT }
    | AS 'definition'
    | AS 'obj_file', 'link_symbol'
} …

As you can see there are a couple of similarities to CREATE FUNCTION so things should be really easy for most end users.

The next example shows a simple procedure:

db11=# CREATE PROCEDURE test_proc()
       LANGUAGE plpgsql
AS $$
    CREATE TABLE a (aid int);
    CREATE TABLE b (bid int);
    CREATE TABLE c (cid int);

The first thing to notice here is that there is a COMMIT inside the procedure. In classical PostgreSQL functions this is not possible for a simple reason. Consider the following code:

SELECT func(id) FROM large_table;

What would happen if some function call simply commits? Totally chaos would be the consequence. Therefore real transactions are only possible inside a “procedure”, which is never called the way a function is executed. Also: Note that there is more than just one transaction going on inside our procedure. A procedure is therefore more of a “batch job”.

The following example shows, how to call the procedure I have just implemented:

db11=# CALL test_proc();

The first two tables where committed – the third table has not been created because of the rollback inside the procedure.

db11=# \d
List of relations
 Schema | Name | Type  | Owner
 public | a    | table | hs
 public | b    | table | hs

(2 rows)

To me CREATE PROCEDURE is definitely one of the most desirable features of PostgreSQL 11.0. The upcoming release will be great and many people will surely welcome CREATE PROCEDURE the way I do.

As some of you might know Cybertec has been doing PostgreSQL consulting, tuning and 24×7 support for many years now. However, one should not only see what is going on in the PostgreSQL market. It also makes sense to look left and right to figure out what the rest of the world is up to these day. I tend to read read a lot about Oracle, the cloud, their new strategy and all that on the Internet these days.

Oracle, PostgreSQL and the cloud

What I found in an Oracle blog post seems to sum up what is going on: “This is a huge change in Oracle’s business model, and the company is taking the transformation very seriously”. The trouble is: Oracle’s business model might change a lot more than they think because these days more and more and more people are actually moving to PostgreSQL. So yes, the quote is correct – but this is a “huge change” in their business model but maybe not what they way they intended it to be.

As license fees and support costs seems to be ever increasing for Oracle customers, more and more people step back and reflect. The logical consequence is: People are moving to PostgreSQL in ever greater numbers. So why not give people a tool to move to PostgreSQL as fast as possible? In many cases an Oracle database is only used as a simple data store. The majority of systems only contains tables, constraints, indexes, foreign keys and so on. Sure, many databases will also contain procedures and more sophisticated stuff. However, I have seen countless systems, which are simply trivial.

Migrating to PostgreSQL

While there are tool such as ora2pg out there, which help people to move from Oracle to PostgreSQL, I found them in general a bit cumbersome and not as efficient as they could be.
So why not build a migration tool, which makes migration as simple as possible? Why not migrate simple database with a single SQL statement? We want to HURT Oracle sales people after all ;).

ora_migrator is doing exactly that:

CREATE EXTENSION ora_migrator;
SELECT oracle_migrate(server => 'oracle', only_schemas => '{HANS,PAUL}');

Unless you have stored procedures in PL/pgSQL or some other hyper fancy stuff, this is already it. The entire migration will be done in a SINGLE transaction.

How does it work?

Oracle to PostgreSQL migration
Oracle to PostgreSQL


First of all the ora_migrator will connect to Oracle using the oracle_fdw, which is the real foundation of the software. Then we will read the Oracle system catalog and store of a copy of the table definitions, index definitions and all that in PostgreSQL. oracle_fdw will do all the data type mapping and so on for us. Why do we copy the Oracle system catalog to a local table and not just use it directly? During the migration process you might want to make changes to the underlying data structure and so on (a commercial GUI to do that is available). You might not want to copy table definitions and so on blindly.

Once the definitions are duplicated and once you have made your modifications (which might not be the case too often – most people prefer a 1:1 copy), the ora_migrator will actually create the desired tables in PostgreSQL, load the data from Oracle, create indexes and add constraints. Your transaction will commit and, voila, your migration is done.

One word about Oracle stored procedures

Experience has shown that a fully automated migrated process for stored procedures usually fails. We have used ora2pg for a long time and procedure conversion has always been a pain when attempted automatically. Therefore we decided to skip this point entirely. To avoid nasty bugs, bad performance or simply mistakes we think that it makes more sense to port stored procedures manually. In the past automatic conversion has led to a couple of subtle bugs (mostly NULL handling issues) and therefore we do not attempt things in the first place. In real life this is not an issue and does not add too much additional work to the migration process – the minimum amount of additional time is worth spending on quality in my judgement).

ora_migrator is free

If you want to move from Oracle to PostgreSQL, you can download ora_migrator for free from our github page. We also have created a GUI for the ora_migrator, which will be available along with our support and consulting services.

More infos on ora_migrator can be found here:

Merry xmax!
Based on an image by Robo Android under the Creative Commons Attribution 2.0 license


xmax is a PostgreSQL system column that is used to implement Multiversion Concurrency Control (MVCC). The documentation is somewhat terse:

The identity (transaction ID) of the deleting transaction, or zero for an undeleted row version. It is possible for this column to be nonzero in a visible row version. That usually indicates that the deleting transaction hasn’t committed yet, or that an attempted deletion was rolled back.

While this is true, the presence of “weasel words” like “usually” indicates that there is more to the picture. This is what I want to explore in this article.

The two meanings of xmax

I’ll follow the PostgreSQL convention to use the word “tuple” for “row version” (remember that PostgreSQL implements MVCC by holding several versions of a row in the table).

xmax is actually used for two purposes in PostgreSQL:

  • It stores the transaction ID (“xid”) of the transaction that deleted the tuple, like the documentation says. Remember that UPDATE also deletes a tuple in PostgreSQL!
  • It stores row locks on the tuple.

This is possible, because a tuple cannot be locked and deleted at the same time: normal locks are only held for the duration of the transaction, and a tuple is deleted only after the deleting transaction has committed.

Storing row locks on the tuple itself has one vast advantage: it avoids overflows of the “lock table”. The lock table is a fixed-size area that is allocated in shared memory during server startup and could easily be too small to hold all the row locks from a bigger transaction. To cope with this, you’d need techniques like “lock escalation” that are difficult to implement, impact concurrency and lead to all kinds of nasty problems.

There is also a down side to storing row locks in the tuple: each row lock modifies the table, and the modified blocks have to be written back to persistent storage. This means that row locks lead to increased I/O load.

But a number of questions remain:

  • How can you tell which of the two meanings xmax has in a tuple?
  • How can I tell if xmax is valid or not?
  • How exactly are row locks stored?

We will dive deeper in the rest of this article to answer these questions.

An example

In the following, I’ll use a simple schema for demonstration. I am using PostgreSQL v10, but this hasn’t changed in the last couple of releases.

   p_id integer PRIMARY KEY,
   p_val text

   c_id integer PRIMARY KEY,
   p_id integer REFERENCES parent(p_id),
   c_val text

INSERT INTO parent (p_id, p_val)
   VALUES (42, 'parent');

Now let’s look at the relevant system columns:

session1=# SELECT ctid, xmin, xmax, p_id, p_val FROM parent;

 ctid  | xmin  | xmax | p_id | p_val  
 (0,1) | 53163 |    0 |   42 | parent
(1 row)

This is the simple view we expect to see: ctid is the physical location of the tuple (Block 0, item 1), xmin contains the ID of the inserting transaction, and xmax is zero because the row is alive.

Now let’s start a transaction in session 1 and delete the row:

session1=# BEGIN;
session1=# DELETE FROM parent WHERE p_id = 42;

Then session 2 can see that xmax has changed:

session2=# SELECT ctid, xmin, xmax, p_id, p_val FROM parent;

 ctid  | xmin  | xmax  | p_id | p_val  
 (0,1) | 53163 | 53165 |   42 | parent
(1 row)

But wait, we change our mind in session 1 and undo the change:

session1=# ROLLBACK;

To find out what xmax means in this case, let’s call in the cavalry.

pageinspect comes to the rescue

PostgreSQL comes with a “contrib” module called pageinspect that can be used to examine the actual contents of table blocks. It is installed with


We’ll use two of its functions:

  • get_raw_page: reads one 8kB block from the table’s data file
  • heap_page_item_attrs: for each tuple in a data block, this returns the tuple metadata and data

Needless to say, these functions are superuser only.

heap_page_item_attrs returns an integer field named t_infomask that contains several flags, some of which tell us the meaning of xmax. To get the full story, you’ll have to read the code in src/include/access/htup_details.h.

Let’s have a look at table block 0, which contains our tuple:

session2=# SELECT lp, 
       t_ctid AS ctid,
       t_xmin AS xmin,
       t_xmax AS xmax,
       (t_infomask & 128)::boolean AS xmax_is_lock,
       (t_infomask & 1024)::boolean AS xmax_committed,
       (t_infomask & 2048)::boolean AS xmax_rolled_back,
       (t_infomask & 4096)::boolean AS xmax_multixact,
       t_attrs[1] AS p_id,
       t_attrs[2] AS p_val
FROM heap_page_item_attrs(
        get_raw_page('parent', 0), 

-[ RECORD 1 ]----+-----------------
lp               | 1
ctid             | (0,1)
xmin             | 53163
xmax             | 53165
xmax_is_lock     | f
xmax_committed   | f
xmax_rolled_back | f
xmax_multixact   | f
p_id             | \x2a000000
p_val            | \x0f706172656e74

The attributes p_id and p_val are displayed in binary form.

The information in the tuple doesn’t tell us whether the transaction that set xmax has been committed or rolled back, so we (and PostgreSQL when it inspects the tuple) still don’t know what to make of xmax. That is because PostgreSQL does not update the tuple when a transaction ends.

To resolve that uncertainty, we’d have to look at the commit log that stores the state of each transaction. The commit log is persisted in the pg_xact subdirectory of the PostgreSQL data directory (pg_clog in older versions).

A SELECT that modifies data

We cannot examine the commit log from SQL, but when any database transaction reads the tuple and looks up the commit log, it will persist the result in the tuple so that the next reader does not have to do it again (this is called “setting the hint bits”).

So all we have to do is to read the tuple:

session2=# SELECT ctid, xmin, xmax, p_id, p_val FROM parent;

 ctid  | xmin  | xmax  | p_id | p_val  
 (0,1) | 53163 | 53165 |   42 | parent
(1 row)

This changes the information stored in the tuple. Let’s have another look with pageinspect:

-[ RECORD 1 ]----+-----------------
lp               | 1
ctid             | (0,1)
xmin             | 53163
xmax             | 53165
xmax_is_lock     | f
xmax_committed   | f
xmax_rolled_back | t
xmax_multixact   | f
p_id             | \x2a000000
p_val            | \x0f706172656e74

The SELECT statement has set the flags on the tuple, and now we can see that xmax is from a transaction that was rolled back and should be ignored.

As an aside, that means that the first reader of a tuple modifies the tuple, causing surprising write I/O. This is annoying, but it is the price we pay for instant COMMIT and ROLLBACK. It is also the reason why it is a good idea to either use COPY … (FREEZE) to bulk load data or to VACUUM the data after loading.

Now we know how to determine if xmax is from a valid transaction or not, but what about row locks?

Row locks and xmax

Rows are locked by data modifying statements, but there is a simple way to lock a row without inserting or deleting tuples:

session1=# BEGIN;
session1=# SELECT * FROM parent WHERE p_id = 42 FOR UPDATE;

 p_id | p_val  
   42 | parent
(1 row)

Now what does pageinspect tell us?

-[ RECORD 1 ]----+-----------------
lp               | 1
ctid             | (0,1)
xmin             | 53163
xmax             | 53166
xmax_is_lock     | t
xmax_committed   | f
xmax_rolled_back | f
xmax_multixact   | f
p_id             | \x2a000000
p_val            | \x0f706172656e74

We see that the row is locked. In this case, it is a FOR UPDATE lock, but the query does not distinguish between the lock modes for simplicity’s sake. You’ll notice that xmax again is neither committed nor rolled back, but we don’t care because we know it is a row lock.

xmax is set to 53166, which is the transaction ID of the locking transaction. Let’s close that transaction to continue:

session1=# COMMIT;

PostgreSQL does not have to set hint bits here — if xmax contains a row lock, the row is active, no matter what the state of the locking transaction is.

If you think you have seen it all, you are in for a surprise.

Multiple locks on a single row

In the previous example we have seen that PostgreSQL stores the transaction ID of the locking transaction in xmax. This works fine as long as only a single transaction holds a lock on that tuple. With exclusive locks like the one that SELECT … FOR UPDATE takes, this is always the case.

But PostgreSQL also knows other row locks, for example the FOR KEY SHARE lock that is taken on the destination of a foreign key constraint to prevent concurrent modification of the keys in that row. Let’s insert some rows in the child table:

session1=# BEGIN;
session1=# INSERT INTO child (c_id, p_id, c_val)
   VALUES (1, 42, 'first');

session2=# BEGIN;
session2=# INSERT INTO child (c_id, p_id, c_val)
   VALUES (2, 42, 'second');

Now let’s look at our parent row again:

-[ RECORD 1 ]----+-----------------
lp               | 1
ctid             | (0,1)
xmin             | 53163
xmax             | 3
xmax_is_lock     | t
xmax_committed   | f
xmax_rolled_back | f
xmax_multixact   | t
p_id             | \x2a000000
p_val            | \x0f706172656e74

That “3” in xmax cannot be a transaction ID (they keep counting up), and the xmax_multixact flag is set.

This is the ID of a “multiple transaction object”, called “mulitxact” in PostgreSQL jargon for lack of a better word. Such objects are created whenever more than one transaction locks a row, and their IDs are also counted up (you can tell that this database needs few of them). Multixacts are persisted in the pg_multixact subdirectory of the data directory.

You can get information about the members of a multixact with the undocumented pg_get_multixact_members function:

session2=# SELECT * FROM pg_get_multixact_members('3');
  xid  | mode  
 53167 | keysh
 53168 | keysh
(2 rows)

Now you really know what is in an xmax!