A bad query plan ...
© Laurenz Albe 2018

 

We all know that you have to pay a price for a new index you create — data modifying operations will become slower, and indexes use disk space. That’s why you try to have no more indexes than you actually need.

But most people think that SELECT performance will never suffer from a new index. The worst that can happen is that the new index is not used.

However, this is not always true, as I have seen more than once in the field. I’ll show you such a case and tell you what you can do about it.

An example

We will experiment with this table:

CREATE TABLE skewed (
   sort        integer NOT NULL,
   category    integer NOT NULL,
   interesting boolean NOT NULL
);

INSERT INTO skewed
   SELECT i, i%1000, i>50000
   FROM generate_series(1, 1000000) i;

CREATE INDEX skewed_category_idx ON skewed (category);

VACUUM (ANALYZE) skewed;

We want to find the first twenty interesting rows in category 42:

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM skewed
WHERE interesting AND category = 42
ORDER BY sort
LIMIT 20;

This performs fine:

                             QUERY PLAN
--------------------------------------------------------------------
 Limit  (cost=2528.75..2528.80 rows=20 width=9)
        (actual time=4.548..4.558 rows=20 loops=1)
   Buffers: shared hit=1000 read=6
   ->  Sort  (cost=2528.75..2531.05 rows=919 width=9)
             (actual time=4.545..4.549 rows=20 loops=1)
         Sort Key: sort
         Sort Method: top-N heapsort  Memory: 25kB
         Buffers: shared hit=1000 read=6
         ->  Bitmap Heap Scan on skewed
                        (cost=19.91..2504.30 rows=919 width=9)
                        (actual time=0.685..4.108 rows=950 loops=1)
               Recheck Cond: (category = 42)
               Filter: interesting
               Rows Removed by Filter: 50
               Heap Blocks: exact=1000
               Buffers: shared hit=1000 read=6
               ->  Bitmap Index Scan on skewed_category_idx
                        (cost=0.00..19.68 rows=967 width=0)
                        (actual time=0.368..0.368 rows=1000 loops=1)
                     Index Cond: (category = 42)
                     Buffers: shared read=6
 Planning time: 0.371 ms
 Execution time: 4.625 ms

PostgreSQL uses the index to find the 1000 rows with category 42, filters out the ones that are not interesting, sorts them and returns the top 20. 5 milliseconds is fine.

A new index makes things go sour

Now we add an index that can help us with sorting. That is definitely interesting if we often have to find the top 20 results:

CREATE INDEX skewed_sort_idx ON skewed (sort);

And suddenly, things are looking worse:

                          QUERY PLAN
-------------------------------------------------------------
 Limit  (cost=0.42..736.34 rows=20 width=9)
        (actual time=21.658..28.568 rows=20 loops=1)
   Buffers: shared hit=374 read=191
   ->  Index Scan using skewed_sort_idx on skewed
                (cost=0.42..33889.43 rows=921 width=9)
                (actual time=21.655..28.555 rows=20 loops=1)
         Filter: (interesting AND (category = 42))
         Rows Removed by Filter: 69022
         Buffers: shared hit=374 read=191
 Planning time: 0.507 ms
 Execution time: 28.632 ms

What happened?

PostgreSQL thinks that it will be faster if it examines the rows in sort order using the index until it has found 20 matches. But it doesn’t know how the matching rows are distributed with respect to the sort order, so it is not aware that it will have to scan 69042 rows until it has found its 20 matches (see Rows Removed by Filter: 69022 in the above execution plan).

What can we do to get the better plan?

PostgreSQL v10 has added extended statistics to track how the values in different columns are correlated, but that does no track the distributions of the values, so it will not help us here.

There are two workarounds:

  1. Drop the index that misleads PostgreSQL.If that is possible, it is a simple solution. But usually one cannot do that, because the index is either used to enforce a unique constraint, or it is needed by other queries that benefit from it.
  2. Rewrite the query so that PostgreSQL cannot use the offending index.Of the many possible solutions for this, I want to present two:
    • A subquery with OFFSET 0:
      SELECT *
      FROM (SELECT * FROM skewed
            WHERE interesting AND category = 42
            OFFSET 0) q
      ORDER BY sort
      LIMIT 20;
      

      This makes use of the fact that OFFSET and LIMIT prevent a subquery from being “flattened”, even if they have no effect on the query result.

    • Using an expression as sort key:
      SELECT * FROM skewed
      WHERE interesting AND category = 42
      ORDER BY sort + 0
      LIMIT 20;
      

      This makes use of the fact that PostgreSQL cannot deduce that sort + 0 is the same as sort. Remember that PostgreSQL is extensible, and you can define your own + operator!

I have already written about timeseries and PostgreSQL in the past. However, recently I stumbled across an interesting problem, which caught my attention: Sometimes you might want to find “periods” of activity in a timseries. For example: When was a user active? Or when did we receive data? This blog post tries to give you some ideas and shows, how you can actually approach this kind of problem.

Loading timeseries data into PostgreSQL

The next listing shows a little bit of sample data, which I used to write the SQL code you are about to see:

CREATE TABLE t_series (t date, data int);

COPY t_series FROM stdin DELIMITER ';';
2018-03-01;12
2018-03-02;43
2018-03-03;9
2018-03-04;13
2018-03-09;23
2018-03-10;26
2018-03-11;28
2018-03-14;21
2018-03-15;15
\.

For the sake of simplicity I just used two columns in my example. Note that my timeseries is not continuous but interrupted. There are three continuous periods in this set of data. Our goal is to find and isolate them to do analysis on each of those continuous periods.

PostgreSQL time series

Preparing for timeseries analysis

When dealing with timeseries one of the most important things to learn is how to “look forward and backward”. In most cases it is simply vital to compare the current line with the previous line. To do that in PostgreSQL (or in SQL in general) you can make use of the “lag” function:

test=# SELECT *, lag(t, 1) OVER (ORDER BY t)
       FROM t_series;
          t | data | lag
------------+------+----------
 2018-03-01 |   12 | 
 2018-03-02 |   43 | 2018-03-01
 2018-03-03 |    9 | 2018-03-02
 2018-03-04 |   13 | 2018-03-03
 2018-03-09 |   23 | 2018-03-04
 2018-03-10 |   26 | 2018-03-09
 2018-03-11 |   28 | 2018-03-10
 2018-03-14 |   21 | 2018-03-11
 2018-03-15 |   15 | 2018-03-14
(9 rows)

As you can see the last column contains the date of the previous row. Now: How does PostgreSQL know what the previous row actually is? The “ORDER BY”-clause will define exactly that.

Based on this query you have just seen it will be easy to calculate the size of the gap from one row to the next row

test=# SELECT *, t - lag(t, 1) OVER (ORDER BY t) AS diff
       FROM t_series;
          t | data | diff 
------------+------+------
 2018-03-01 |   12 | 
 2018-03-02 |   43 | 1
 2018-03-03 |    9 | 1
 2018-03-04 |   13 | 1
 2018-03-09 |   23 | 5
 2018-03-10 |   26 | 1
 2018-03-11 |   28 | 1
 2018-03-14 |   21 | 3
 2018-03-15 |   15 | 1
(9 rows)

What we see now is the difference from one period to the next. That is pretty useful because we can start to create our rules. When do we consider a segment to be over and how long of a gap to we allow for before we consider it to be the next segment / period?

In my example I decided that every gap, which is longer than 2 days should trigger the creation of a new segment (or period): The next challenge is therefore to assign numbers to each period, which are about to detect. Once this is done, we can easily aggregate on the result. The way I have decided to do this is by using the sum function. Remember: When NULL is fed to an aggregate, the aggregate will ignore the input. Otherwise it will simply start to add up the input.

Here is the query:

test=# SELECT *, sum(CASE WHEN diff IS NULL 
                     OR diff <2 THEN 1 ELSE NULL END) OVER (ORDER BY t) AS period
       FROM (SELECT *, t - lag(t, 1) OVER (ORDER BY t) AS diff
             FROM   t_series
       ) AS x;
          t | data | diff | period 
------------+------+------+--------
 2018-03-01 |   12 |      | 1
 2018-03-02 |   43 |    1 | 1
 2018-03-03 |    9 |    1 | 1
 2018-03-04 |   13 |    1 | 1
 2018-03-09 |   23 |    5 | 2
 2018-03-10 |   26 |    1 | 2
 2018-03-11 |   28 |    1 | 2
 2018-03-14 |   21 |    3 | 3
 2018-03-15 |   15 |    1 | 3
(9 rows)

As you can see the last column contains the period ID as generated by the sum function in our query. From now on analysis will be pretty simple as we can simply aggregate over this result using a simple subselect as shown in the next statement:

test=# SELECT period, sum(data) 
       FROM (SELECT *, sum(CASE WHEN diff IS NULL 
                    OR diff <2 THEN 1 ELSE NULL END) OVER (ORDER BY t) AS period
             FROM (SELECT *, t - lag(t, 1) OVER (ORDER BY t) AS diff
                   FROM t_series
                  ) AS x
       ) AS y
GROUP BY period 
ORDER BY period;
 period | sum 
--------+-----
      1 | 77
      2 | 77
      3 | 36
(3 rows)

The result displays the sum of all data for each period. Of course you can also do more complicated stuff. However, the important thing is to understand, how you can actually detect various periods of continuous activity.

In PostgreSQL, a view is a virtual table based on an SQL statement. It is an abstraction layer, which allows to access the result of a more complex SQL fast an easily. The fields in a view are fields from one or more real tables in the database. The question many people now ask if: If a view is based on a table. What happens if the data structure of the underlying table changes?

CREATE VIEW in PostgreSQL

To show what PostgreSQL will do, I created a simple table:

view_demo=# CREATE TABLE t_product
(
        id         serial,
        name       text,
        price      numeric(16, 4)
);
CREATE TABLE

My table has just three simple columns and does not contain anything special. Here is the layout of the table:

view_demo=# \d t_product
  Table "public.t_product"
 Column |     Type      | Collation | Nullable | Default
--------+---------------+-----------+----------+---------------------------------------
 id     | integer       |           | not null | nextval('t_product_id_seq'::regclass)
 name   | text          |           |          |
 price  | numeric(16,4) |           |          |

 

Making changes to tables and views

The first thing to do in order to get our demo going is to create a view:

view_demo=# CREATE VIEW v AS SELECT * FROM t_product;
CREATE VIEW

The important thing here to see is how PostgreSQL handles the view. In the following listing you can see that the view definition does not contain a “*” anymore. PostgreSQL has silently replaced the “*” with the actual column list. Note that this is an important thing because it will have serious implications:

view_demo=# \d+ v
  View "public.v"
 Column | Type          | Collation | Nullable | Default | Storage  | Description
--------+---------------+-----------+----------+---------+----------+-------------
 id     | integer       |           |          |         | plain    |
 name   | text          |           |          |         | extended |
 price  | numeric(16,4) |           |          |         | main     |
View definition:
  SELECT t_product.id,
         t_product.name,
         t_product.price
  FROM t_product;

What happens if we simply try to rename the table the view is based on:

view_demo=# ALTER TABLE t_product RENAME TO t_cool_product;
ALTER TABLE

view_demo=# \d+ v
View "public.v"
 Column | Type          | Collation | Nullable | Default | Storage  | Description
--------+---------------+-----------+----------+---------+----------+-------------
 id     | integer       |           |          |         | plain    |
 name   | text          |           |          |         | extended |
 price  | numeric(16,4) |           |          |         | main     |
View definition:
  SELECT t_cool_product.id,
         t_cool_product.name,
         t_cool_product.price
  FROM t_cool_product;

As you can see the view will be changed as well. The reason for that is simple: PostgreSQL does not store the view as string. Instead if will keep a binary copy of the definition around, which is largely based on object ids. The beauty is that if the name of a table or a column changes, those objects will still have the same object id and therefore there is no problem for the view. The view will not break, become invalid or face deletion.

The same happens when you change the name of a column:

view_demo=# ALTER TABLE t_cool_product
RENAME COLUMN price TO produce_price;
ALTER TABLE

Again the view will not be harmed:

view_demo=# \d+ v
  View "public.v"
 Column | Type          | Collation | Nullable | Default | Storage  | Description
--------+---------------+-----------+----------+---------+----------+-------------
 id     | integer       |           |          |         | plain    |
 name   | text          |           |          |         | extended |
 price  | numeric(16,4) |           |          |         | main     |
View definition:
   SELECT t_cool_product.id,
          t_cool_product.name,
          t_cool_product.produce_price AS price
   FROM t_cool_product;

What is really really important and noteworthy here is that the view does not change its output. The columns provided by the view will be the same. In other words: Application relying on the view won’t break just because some other column has changed somewhere.

What PostgreSQL does behind the scenes

Behind the scenes a view is handled by the rewrite system. In the system catalog there is a table called pg_rewrite, which will store a binary representation of the view:

view_demo=# \d pg_rewrite
  Table "pg_catalog.pg_rewrite"
 Column     | Type         | Collation | Nullable | Default
------------+--------------+-----------+----------+---------
 rulename   | name         |           | not null |
 ev_class   | oid          |           | not null |
 ev_type    | "char"       |           | not null |
 ev_enabled | "char"       |           | not null |
 is_instead | boolean      |           | not null |
 ev_qual    | pg_node_tree |           |          |
 ev_action  | pg_node_tree |           |          |
Indexes:
  "pg_rewrite_oid_index" UNIQUE, btree (oid)
  "pg_rewrite_rel_rulename_index" UNIQUE, btree (ev_class, rulename)

Basically this is an internal thing. However, I decided to show, how it works behind the scenes as it might be interesting to know.

Views and dropping columns

However, in some cases PostgreSQL has to error out. Suppose somebody wants to drop a column, on which a view depends on. In this case PostgreSQL has to error out because it cannot silently delete the column from the view.

view_demo=# ALTER TABLE t_cool_product DROP COLUMN name;
ERROR: cannot drop table t_cool_product column name because other objects depend on it
DETAIL: view v depends on table t_cool_product column name
HINT: Use DROP ... CASCADE to drop the dependent objects too.

In this case PostgreSQL complains that the view cannot be kept around because columns are missing. You can now decide whether to not drop the column or whether to drop the view along with the column.

For those of you out there working with PostgreSQL in a professional way, migrating from Oracle to PostgreSQL might be one of the most beloved tasks available. One of the first things most people will notice, however, is that those data types available in Oracle might not be quite the same in PostgreSQL. This blog will try to shed some light and show, how things work.

Data types in Oracle and PostgreSQL

While there are many similarities between Oracle and PostgreSQL there are a couple of differences, which are quite noteworthy. The first thing many people might notice is: PostgreSQL has many more data types than Oracle. As of version 10.0 an empty PostgreSQL database will expose the staggering number of 92 data types. Of course not all of them are useful and many of them are purely internal used for internal purposes.

Still: At the end of the day there are just more data types, which are can be used by applications, which of course

Let us take a look and see, which types can be matched. The following table contains a lof of potential options:

 

Oracle typePossible PostgreSQL types
CHARchar, varchar, text
NCHARchar, varchar, text
VARCHARchar, varchar, text
VARCHAR2char, varchar, text, json
NVARCHAR2char, varchar, text
CLOBchar, varchar, text, json
LONGchar, varchar, text
RAWuuid, bytea
BLOBbytea
BFILEbytea (read-only)
LONG RAWbytea
NUMBERnumeric, float4, float8, char, varchar, text
NUMBER(n,m) with m<=0numeric, float4, float8, int2, int4, int8,
boolean, char, varchar, text
FLOATnumeric, float4, float8, char, varchar, text
BINARY_FLOATnumeric, float4, float8, char, varchar, text
BINARY_DOUBLEnumeric, float4, float8, char, varchar, text
DATEdate, timestamp, timestamptz, char, varchar, text
TIMESTAMPdate, timestamp, timestamptz, char, varchar, text
TIMESTAMP WITH TIME ZONEdate, timestamp, timestamptz, char, varchar, text
TIMESTAMP WITHdate, timestamp, timestamptz, char, varchar, text
LOCAL TIME ZONE
INTERVAL YEAR TO MONTHinterval, char, varchar, text
INTERVAL DAY TO SECONDinterval, char, varchar, text
MDSYS.SDO_GEOMETRYgeometry (see “PostGIS support” below)

As you can see everything all types can always be represented as text and as varchar. However, this is of course not desirable. However, there is more: In Oracle there is not such thing as “integer”. Everything is represented as “number”. For example: In Oracle you might see “number (5, 0)”. The way to map this is to use “integer” or “bigint” on the PostgreSQL side. In general “CPU data types” such as integer and bigint are much faster than “numeric”.

A second important thing is “varchar2”: On the PostgreSQL side it can easily be mapped to varchar or text. A small detail is that in Oracle varchar2 can be the number of bytes or the number of characters. Consider the following example: VARCHAR2(20 BYTE) vs.VARCHAR2(10 CHAR). You have to think twice, what to do on the PostgreSQL side because in Postgres we are always talking about “characters”.

Oracle → PostgreSQL: Automatic conversion

Of course there is no need to most of the work by hand. The ora_migrator tool (https://github.com/cybertec-postgresql/ora_migrator) will use the oracle_fdw (= database link to Oracle) to map data types for you. ora_migrator will also help with indexes, constraints, and so on to make migrations as easy and as fast as possible. Also have a look on our Cybertec Enterprise Migrator tool page.

caged elephant
© Laurenz Albe 2018

 

In a recent wrestling match with the Linux “out-of-memory killer” for a Cybertec customer I got acquainted with Linux control groups (“cgroups”), and I want to give you a short introduction how they can be used with PostgreSQL and discuss their usefulness.

Warning: This was done on my RedHat Fedora 27 system running Linux 4.16.5 with cgroups v1 managed by systemd version 234. Both cgroups and systemd‘s handling of them seem to be undergoing changes, so your mileage may vary considerably. Still, it should be a useful starting point if you want to explore cgroups.

What are Linux cgroups?

From the cgroups manual page:

Control cgroups, usually referred to as cgroups, are a Linux kernel feature which allow processes to be organized into hierarchical groups whose usage of various types of resources can then be limited and monitored.

cgroups are managed with special commands that start with “cg”, but can also be managed through a special cgroups file system and systemd.

Now a running PostgreSQL cluster is a group of processes, so that’s a perfect fit.

There are several subsystems defined (also called “controllers” in cgroups terminology). Of these, the following are interesting for PostgreSQL:

  • memory: useful for limiting the total memory usage
  • blkio: useful for limiting the I/O throughput
  • cpu: useful to define upper and lower limits to the CPU time available for the processes
  • cpuset: useful for binding the processes to a subset of the available CPU cores

Configuring cgroups

During system startup, cgroups are created as defined in the /etc/cgconfig.conf configuration file.

Let’s create a cgroup to build a cage for a PostgreSQL cluster:

group db_cage {
    # user and group "postgres" can manage these cgroups
    perm {
        task {
            uid = postgres;
            gid = postgres;
            fperm = 774;
        }
        admin {
            uid = postgres;
            gid = postgres;
            dperm = 775;
            fperm = 774;
        }
    }

    # limit memory to 1 GB and disable swap
    memory {
        memory.limit_in_bytes = 1G;
        memory.memsw.limit_in_bytes = 1G;
    }

    # limit read and write I/O to 10MB/s each on device 8:0
    blkio {
        blkio.throttle.read_bps_device = "8:0 10485760";
        blkio.throttle.write_bps_device = "8:0 10485760";
    }

    # limit CPU time to 0.25 seconds out of each second
    cpu {
        cpu.cfs_period_us = 1000000;
        cpu.cfs_quota_us = 250000;
    }

    # only CPUs 0-3 and memory node 0 can be used
    cpuset {
        cpuset.cpus = 0-3;
        cpuset.mems = 0;
    }
}

To activate it, run the following as root:

# /usr/sbin/cgconfigparser -l /etc/cgconfig.conf -s 1664

To have that done automatically at server start, I tell systemd to enable the cgconfig service:

# systemctl enable cgconfig
# systemctl start cgconfig

Starting PostgreSQL in a cgroup

To start PostgreSQL in the cgroups we defined above, use the cgexec executable (you may have to install an operating system package called libcgroup or libcgroup-tools for that):

$ cgexec -g cpu,memory,blkio:db_cage \
   /usr/pgsql-10/bin/pg_ctl -D /var/lib/pgsql/10/data start

We can confirm that PostgreSQL is running in the correct cgroup:

$ head -1 /var/lib/pgsql/10/data/postmaster.pid 
16284

$ cat /proc/16284/cgroup | egrep '\b(cpu|blkio|memory)\b'
10:cpu,cpuacct:/db_cage
9:blkio:/db_cage
4:memory:/db_cage

To change a running process to a cgroup, you can use cgclassify (but then you have to change all running PostgreSQL processes).

Using cgroups with systemd

systemd provides a simpler interface to Linux cgroups, so you don’t have to do any of the above. systemd can create cgroups “on the fly” for the services it starts.

If your PostgreSQL service is called postgresql-10, simply create a file /etc/systemd/system/postgresql-10.service like this:

# include the original service file rather than editing it
# so that changes don't get lost during an upgrade
.include /usr/lib/systemd/system/postgresql-10.service

[Service]
# limit memory to 1GB
# sets "memory.limit_in_bytes"
MemoryMax=1G
# limit memory + swap space to 1GB
# this should set "memory.memsw.limit_in_bytes" but it only
# works with cgroups v2 ...
# MemorySwapMax=1G

# limit read I/O on block device 8:0 to 10MB per second
# sets "blkio.throttle.read_bps_device"
IOReadBandwidthMax=/dev/block/8:0 10M
# limit write I/O on block device 8:0 to 10MB per second
# sets "blkio.throttle.write_bps_device"
IOWriteBandwidthMax=/dev/block/8:0 10M

# limit CPU time to a quarter of the available
# sets "cpu.cfs_quota_us"
CPUQuota=25%

# there are no settings to control "cpuset" cgroups

Now you have to tell systemd that you changed the configuration and restart the service:

# systemctl daemon-reload
# systemctl restart postgresql-10

As you see, not all cgroup settings ar available with systemd. As a workaround, you can define cgroups in /etc/cgconfig.conf and use cgexec to start the service.

How useful are cgroups for PostgreSQL?

I would say that it depends on the subsystem.

memory

At first glance, it sounds interesting to limit memory usage with cgroups. But there are several drawbacks:

  • If PostgreSQL is allowed to use swap space, it will start swapping when the memory quota is exceeded.
  • If PostgreSQL is not allowed to use swap space, the Linux OOM killer will kill PostgreSQL when the quota is exceeded (alternatively, you can configure the cgroup so that the process is paused until memory is freed, but this might never happen).
  • The memory quota also limits the amount of memory available for the file system cache.

None of this is very appealing — there is no option to make malloc fail so that PostgreSQL can handle the problem.

I think that it is better to use the traditional way of limiting PostgreSQL’s memory footprint by setting shared_buffers, work_mem and max_connections so that PostgreSQL won’t use too much memory.

That also has the advantage that all PostgreSQL clusters on the machine can share the file system cache, so that clusters that need it can get more of that resource, while no cluster can become completely memory starved (everybody is guaranteed shared_buffers).

blkio

I think that cgroups are a very useful way of limiting I/O bandwidth for PostgreSQL.

The only drawback is maybe that PostgreSQL cannot use more than its allotted quota even if the I/O system is idle.

cpu

cgroups are also a good way of limiting CPU usage by a PostgreSQL cluster.

Again, it would be nice if PostgreSQL were allowed to exceed its quota if the CPUs are idle.

cpuset

This is only useful on big machines with a NUMA architecture. On such machines, binding PostgreSQL to the CPUs and memory of one NUMA node will make sure that all memory access is local to that node and consequently fast.

You can thus partition your NUMA machine between several PostgreSQL clusters.

To be OR not to be...
© Laurenz Albe 2018

 

PostgreSQL query tuning is our daily bread at Cybertec, and once you have done some of that, you’ll start bristling whenever you see an OR in a query, because they are usually the cause for bad query performance.

Of course there is a reason why there is an OR in SQL, and if you cannot avoid it, you have to use it. But you should be aware of the performance implications.

In this article I’ll explore “good” and “bad” ORs and what you can do to avoid the latter.

A little sample schema

We’ll use this simple setup for demonstration:

CREATE TABLE a(id integer NOT NULL, a_val text NOT NULL);

INSERT INTO a
   SELECT i, md5(i::text)
   FROM generate_series(1, 100000) i;

CREATE TABLE b(id integer NOT NULL, b_val text NOT NULL);

INSERT INTO b
   SELECT i, md5(i::text)
   FROM generate_series(1, 100000) i;

ALTER TABLE a ADD PRIMARY KEY (id);
ALTER TABLE b ADD PRIMARY KEY (id);
ALTER TABLE b ADD FOREIGN KEY (id) REFERENCES a;

VACUUM (ANALYZE) a;
VACUUM (ANALYZE) b;

Suppose that we want to run queries with equality and LIKE conditions on the text columns, so we need some indexes:

CREATE INDEX a_val_idx ON a(a_val text_pattern_ops);
CREATE INDEX b_val_idx ON b(b_val text_pattern_ops);

Have a look at the documentation if you don’t understand text_pattern_ops.

The “good” OR

An OR is fine in most parts of an SQL query: if it is not used to filter out rows from your query result, it will have no negative effect on query performance.

So if your OR appears in a CASE expression in the SELECT list, don’t worry.

Unfortunately you usually find the OR where it hurts: in the WHERE clause.

The “bad” OR

Now for an example of an OR in a WHERE clause that is still pretty nice:

EXPLAIN (COSTS off)
SELECT id FROM a
WHERE id = 42
   OR a_val = 'value 42';

                        QUERY PLAN                         
-----------------------------------------------------------
 Bitmap Heap Scan on a
   Recheck Cond: ((id = 42) OR (a_val = 'value 42'::text))
   ->  BitmapOr
         ->  Bitmap Index Scan on a_pkey
               Index Cond: (id = 42)
         ->  Bitmap Index Scan on a_val_idx
               Index Cond: (a_val = 'value 42'::text)
(7 rows)

PostgreSQL can actually use an index scan for the query, because it can combine the bitmaps for both indexes with a “bitmap OR”.
Note, however, that a bitmap index scan is more expensive than a normal index scan, since it has to build the bitmap. Moreover, it uses much more RAM; each of these bitmaps can use up to work_mem memory.

A multi-column index on (id, a_val) won’t help at all with this query, so there is no cheaper way to execute it.

IN is better than OR

Now for a more stupid variant of the above query:

EXPLAIN (COSTS off)
SELECT id FROM a
WHERE id = 42
   OR id = 4711;

                 QUERY PLAN                 
--------------------------------------------
 Bitmap Heap Scan on a
   Recheck Cond: ((id = 42) OR (id = 4711))
   ->  BitmapOr
         ->  Bitmap Index Scan on a_pkey
               Index Cond: (id = 42)
         ->  Bitmap Index Scan on a_pkey
               Index Cond: (id = 4711)
(7 rows)

Again, a bitmap index scan is used. But there is a simple method to rewrite that query without the pesky OR:

EXPLAIN (COSTS off)
SELECT id FROM a
WHERE id IN (42, 4711);

                    QUERY PLAN                     
---------------------------------------------------
 Index Only Scan using a_pkey on a
   Index Cond: (id = ANY ('{42,4711}'::integer[]))
(2 rows)

You see? As soon as you get rid of the OR, an efficient index scan can be used!

You might say that this is good for equality conditions, but what about the following query:

SELECT id FROM a
WHERE a_val LIKE 'something%'
   OR a_val LIKE 'other%';

To improve that query, observe that the PostgreSQL optimizer rewrote the IN in the previous query to = ANY.

This is a case of the standard SQL “quantified comparison predicate”: <comparison operator> ANY is true if the comparison is TRUE for any of the values on the right-hand side (the standard only defines this for subqueries on the right-hand side, but PostgreSQL extends the syntax to arrays).

Now LIKE is a comparison operator as well, so we can write:

EXPLAIN (COSTS off)
SELECT id FROM a
WHERE a_val LIKE ANY (ARRAY['something%', 'other%']);

                        QUERY PLAN                        
----------------------------------------------------------
 Seq Scan on a
   Filter: (a_val ~~ ANY ('{something%,other%}'::text[]))
(2 rows)

Unfortunately, the index cannot be used here.

pg_trgm to the rescue

But we are not at the end of our wits yet! There is such a wealth of indexes in PostgreSQL; let’s try a different one. For this, we need the pg_trgm extension:

CREATE EXTENSION pg_trgm;

Then we can create a GIN trigram index on the column:

CREATE INDEX a_val_trgm_idx ON a USING gin (a_val gin_trgm_ops);

Now things are looking better:

EXPLAIN (COSTS off)
SELECT id FROM a
WHERE a_val LIKE ANY (ARRAY['something%', 'other%']);

                             QUERY PLAN                             
--------------------------------------------------------------------
 Bitmap Heap Scan on a
   Recheck Cond: (a_val ~~ ANY ('{something%,other%}'::text[]))
   ->  Bitmap Index Scan on a_val_trgm_idx
         Index Cond: (a_val ~~ ANY ('{something%,other%}'::text[]))
(4 rows)

Feel the power of trigram indexes!

Note 1: This index can also be used if the search pattern starts with %

Note 2: The GIN index can become quite large. To avoid that, you can also use a GiST index, which is much smaller, but less efficient to search.

The “ugly” OR

Things become really bad if OR combines conditions from different tables:

EXPLAIN (COSTS off)
SELECT id, a.a_val, b.b_val
FROM a JOIN b USING (id)
WHERE a.id = 42
   OR b.id = 42;

                 QUERY PLAN                  
---------------------------------------------
 Merge Join
   Merge Cond: (a.id = b.id)
   Join Filter: ((a.id = 42) OR (b.id = 42))
   ->  Index Scan using a_pkey on a
   ->  Index Scan using b_pkey on b
(5 rows)

Here we have to compute the complete join between the two tables and afterwards filter out all rows matching the condition. In our example, that would mean computing 100000 rows only to throw away the 99999 that do not natch the condition.

Avoiding the ugly OR

Fortunately, there is an equivalent query that is longer to write, but much cheaper to execute:

EXPLAIN (COSTS off)
   SELECT id, a.a_val, b.b_val
   FROM a JOIN b USING (id)
   WHERE a.id = 42
UNION
   SELECT id, a.a_val, b.b_val
   FROM a JOIN b USING (id)
   WHERE b.id = 42;

                        QUERY PLAN                        
----------------------------------------------------------
 Unique
   ->  Sort
         Sort Key: a.id, a.a_val, b.b_val
         ->  Append
               ->  Nested Loop
                     ->  Index Scan using a_pkey on a
                           Index Cond: (id = 42)
                     ->  Index Scan using b_pkey on b
                           Index Cond: (id = 42)
               ->  Nested Loop
                     ->  Index Scan using a_pkey on a a_1
                           Index Cond: (id = 42)
                     ->  Index Scan using b_pkey on b b_1
                           Index Cond: (id = 42)
(14 rows)

Both parts of the query can make use of efficient index scans and return one row, and since the rows happen to be identical, UNION will reduce them to one row.

If you can be certain that both branches of the query will return distinct sets, it is better to use UNION ALL instead of UNION, because that doesn’t have to do the extra processing to remove duplicates.

When using this trick, you should be aware that rewriting a query in that fashion does not always result in an equivalent query: if the original query can return identical rows, these would be removed by the UNION. In our case, we don’t have to worry, because the primary keys were included in the query result. I find that this is hardly ever a problem in practice.

Recently I did some PostgreSQL consulting in the Berlin area (Germany) when I stumbled over an interesting request: How can data be shared across function calls in PostgreSQL? I recalled some one of the other features of PostgreSQL (15+ years old or so) to solve the issue. Here is how it works.

Stored procedures in PostgreSQL

As many of you might know PostgreSQL allows you to write stored procedures in many different languages. Two of the more popular ones are Perl and Python, which have been around for quite some time. The cool thing is: Both languages offer a way to share variables across function calls. In Perl you can make use of the $_SHARED variable, which is always there.

Here is an example:

CREATE OR REPLACE FUNCTION set_var(int)
RETURNS int AS $$
   $_SHARED{'some_name'} = $_[0];
   return $_[0];
$$ LANGUAGE plperl;

What the code does is to assign a value to some_name and returns the assigned value. Some other function can then make use of this data, which is stored inside your database connection. Here is an example:

CREATE OR REPLACE FUNCTION increment_var()
RETURNS int AS $$
   $_SHARED{'some_name'} += 1;
   return $_SHARED{'some_name'};
$$ LANGUAGE plperl;

This function will simply increment the value and return it. As you can see the code is pretty simple and easy to write.

Assigning shared variables

The following listing shows, how the code can be used. The first call will assign a value to the function while the second one will simply increment that value:

test=# SELECT set_var(5);
 set_var
---------
 5
(1 row)

test=# SELECT increment_var(), increment_var();
 increment_var | increment_var
---------------+---------------
             6 | 7
(1 row)

It is especially noteworthy here that the second column will already see the changes made by the first column, which is exactly what we want here.

Shared variables and transactions

When working with shared variables in PL/Perl or PL/Python you have to keep in mind that those changes will not be transactional as all the rest in PostgreSQL is. Even if you rollback a transaction you can observe that those values will stay incremented:

test=# BEGIN;
BEGIN
test=# SELECT increment_var(), increment_var();
increment_var  | increment_var
---------------+---------------
             8 | 9
(1 row)
test=# ROLLBACK;
ROLLBACK
test=# SELECT increment_var(), increment_var();
 increment_var | increment_var
---------------+---------------
            10 | 11
(1 row)

This behavior makes shared values actually a nice thing to have if you want to preserve data across transactions.

Many people have asked for this feature for years and PostgreSQL 11 will finally have it. I am of course talking about CREATE PROCEDURE. Traditionally PostgreSQL has provided all the means to write functions (which were often simply called “stored procedures”). However, in a function you cannot really run transactions – all you can do is to use exceptions, which are basically savepoints. Inside a function you cannot just commit a transaction or open a new one. CREATE PROCEDURE will change all that and provide you with the means to run transactions in procedural code.

Using CREATE PROCEDURE in PostgreSQL

CREATE PROCEDURE will allow you to write procedures just like in most other modern databases. The syntax is quite simple and definitely not hard to use:

db11=# \h CREATE PROCEDURE
Command: CREATE PROCEDURE
Description: define a new procedure
Syntax:
CREATE [ OR REPLACE ] PROCEDURE
    name ( [ [ argmode ] [ argname ] argtype [ { DEFAULT | = } default_expr ] [, ...] ] )
  { LANGUAGE lang_name
    | TRANSFORM { FOR TYPE type_name } [, ... ]
    | [ EXTERNAL ] SECURITY INVOKER | [ EXTERNAL ] SECURITY DEFINER
    | SET configuration_parameter { TO value | = value | FROM CURRENT }
    | AS 'definition'
    | AS 'obj_file', 'link_symbol'
} …

As you can see there are a couple of similarities to CREATE FUNCTION so things should be really easy for most end users.

The next example shows a simple procedure:

db11=# CREATE PROCEDURE test_proc()
       LANGUAGE plpgsql
AS $$
  BEGIN
    CREATE TABLE a (aid int);
    CREATE TABLE b (bid int);
    COMMIT;
    CREATE TABLE c (cid int);
    ROLLBACK;
  END;
$$;
CREATE PROCEDURE

The first thing to notice here is that there is a COMMIT inside the procedure. In classical PostgreSQL functions this is not possible for a simple reason. Consider the following code:

SELECT func(id) FROM large_table;

What would happen if some function call simply commits? Totally chaos would be the consequence. Therefore real transactions are only possible inside a “procedure”, which is never called the way a function is executed. Also: Note that there is more than just one transaction going on inside our procedure. A procedure is therefore more of a “batch job”.

The following example shows, how to call the procedure I have just implemented:

db11=# CALL test_proc();
CALL

The first two tables where committed – the third table has not been created because of the rollback inside the procedure.

db11=# \d
List of relations
 Schema | Name | Type  | Owner
--------+------+-------+-------
 public | a    | table | hs
 public | b    | table | hs

(2 rows)

To me CREATE PROCEDURE is definitely one of the most desirable features of PostgreSQL 11.0. The upcoming release will be great and many people will surely welcome CREATE PROCEDURE the way I do.

As some of you might know Cybertec has been doing PostgreSQL consulting, tuning and 24×7 support for many years now. However, one should not only see what is going on in the PostgreSQL market. It also makes sense to look left and right to figure out what the rest of the world is up to these day. I tend to read read a lot about Oracle, the cloud, their new strategy and all that on the Internet these days.

Oracle, PostgreSQL and the cloud

What I found in an Oracle blog post seems to sum up what is going on: “This is a huge change in Oracle’s business model, and the company is taking the transformation very seriously”. The trouble is: Oracle’s business model might change a lot more than they think because these days more and more and more people are actually moving to PostgreSQL. So yes, the quote is correct – but this is a “huge change” in their business model but maybe not what they way they intended it to be.

As license fees and support costs seems to be ever increasing for Oracle customers, more and more people step back and reflect. The logical consequence is: People are moving to PostgreSQL in ever greater numbers. So why not give people a tool to move to PostgreSQL as fast as possible? In many cases an Oracle database is only used as a simple data store. The majority of systems only contains tables, constraints, indexes, foreign keys and so on. Sure, many databases will also contain procedures and more sophisticated stuff. However, I have seen countless systems, which are simply trivial.

Migrating to PostgreSQL

While there are tool such as ora2pg out there, which help people to move from Oracle to PostgreSQL, I found them in general a bit cumbersome and not as efficient as they could be.
So why not build a migration tool, which makes migration as simple as possible? Why not migrate simple database with a single SQL statement? We want to HURT Oracle sales people after all ;).

ora_migrator is doing exactly that:

CREATE EXTENSION ora_migrator;
SELECT oracle_migrate(server => 'oracle', only_schemas => '{HANS,PAUL}');

Unless you have stored procedures in PL/pgSQL or some other hyper fancy stuff, this is already it. The entire migration will be done in a SINGLE transaction.

How does it work?

Oracle to PostgreSQL migration
Oracle to PostgreSQL

 

First of all the ora_migrator will connect to Oracle using the oracle_fdw, which is the real foundation of the software. Then we will read the Oracle system catalog and store of a copy of the table definitions, index definitions and all that in PostgreSQL. oracle_fdw will do all the data type mapping and so on for us. Why do we copy the Oracle system catalog to a local table and not just use it directly? During the migration process you might want to make changes to the underlying data structure and so on (a commercial GUI to do that is available). You might not want to copy table definitions and so on blindly.

Once the definitions are duplicated and once you have made your modifications (which might not be the case too often – most people prefer a 1:1 copy), the ora_migrator will actually create the desired tables in PostgreSQL, load the data from Oracle, create indexes and add constraints. Your transaction will commit and, voila, your migration is done.

One word about Oracle stored procedures

Experience has shown that a fully automated migrated process for stored procedures usually fails. We have used ora2pg for a long time and procedure conversion has always been a pain when attempted automatically. Therefore we decided to skip this point entirely. To avoid nasty bugs, bad performance or simply mistakes we think that it makes more sense to port stored procedures manually. In the past automatic conversion has led to a couple of subtle bugs (mostly NULL handling issues) and therefore we do not attempt things in the first place. In real life this is not an issue and does not add too much additional work to the migration process – the minimum amount of additional time is worth spending on quality in my judgement).

ora_migrator is free

If you want to move from Oracle to PostgreSQL, you can download ora_migrator for free from our github page. We also have created a GUI for the ora_migrator, which will be available along with our support and consulting services.

More infos on ora_migrator can be found here: https://www.cybertec-postgresql.com/en/products/cybertec-enterprise-migrator-modern-oracle-postgresql-migration/

Merry xmax!
Based on an image by Robo Android under the Creative Commons Attribution 2.0 license

 

xmax is a PostgreSQL system column that is used to implement Multiversion Concurrency Control (MVCC). The documentation is somewhat terse:

The identity (transaction ID) of the deleting transaction, or zero for an undeleted row version. It is possible for this column to be nonzero in a visible row version. That usually indicates that the deleting transaction hasn’t committed yet, or that an attempted deletion was rolled back.

While this is true, the presence of “weasel words” like “usually” indicates that there is more to the picture. This is what I want to explore in this article.

The two meanings of xmax

I’ll follow the PostgreSQL convention to use the word “tuple” for “row version” (remember that PostgreSQL implements MVCC by holding several versions of a row in the table).

xmax is actually used for two purposes in PostgreSQL:

  • It stores the transaction ID (“xid”) of the transaction that deleted the tuple, like the documentation says. Remember that UPDATE also deletes a tuple in PostgreSQL!
  • It stores row locks on the tuple.

This is possible, because a tuple cannot be locked and deleted at the same time: normal locks are only held for the duration of the transaction, and a tuple is deleted only after the deleting transaction has committed.

Storing row locks on the tuple itself has one vast advantage: it avoids overflows of the “lock table”. The lock table is a fixed-size area that is allocated in shared memory during server startup and could easily be too small to hold all the row locks from a bigger transaction. To cope with this, you’d need techniques like “lock escalation” that are difficult to implement, impact concurrency and lead to all kinds of nasty problems.

There is also a down side to storing row locks in the tuple: each row lock modifies the table, and the modified blocks have to be written back to persistent storage. This means that row locks lead to increased I/O load.

But a number of questions remain:

  • How can you tell which of the two meanings xmax has in a tuple?
  • How can I tell if xmax is valid or not?
  • How exactly are row locks stored?

We will dive deeper in the rest of this article to answer these questions.

An example

In the following, I’ll use a simple schema for demonstration. I am using PostgreSQL v10, but this hasn’t changed in the last couple of releases.

CREATE TABLE parent(
   p_id integer PRIMARY KEY,
   p_val text
);

CREATE TABLE child(
   c_id integer PRIMARY KEY,
   p_id integer REFERENCES parent(p_id),
   c_val text
);

INSERT INTO parent (p_id, p_val)
   VALUES (42, 'parent');

Now let’s look at the relevant system columns:

session1=# SELECT ctid, xmin, xmax, p_id, p_val FROM parent;

 ctid  | xmin  | xmax | p_id | p_val  
-------+-------+------+------+--------
 (0,1) | 53163 |    0 |   42 | parent
(1 row)

This is the simple view we expect to see: ctid is the physical location of the tuple (Block 0, item 1), xmin contains the ID of the inserting transaction, and xmax is zero because the row is alive.

Now let’s start a transaction in session 1 and delete the row:

session1=# BEGIN;
session1=# DELETE FROM parent WHERE p_id = 42;

Then session 2 can see that xmax has changed:

session2=# SELECT ctid, xmin, xmax, p_id, p_val FROM parent;

 ctid  | xmin  | xmax  | p_id | p_val  
-------+-------+-------+------+--------
 (0,1) | 53163 | 53165 |   42 | parent
(1 row)

But wait, we change our mind in session 1 and undo the change:

session1=# ROLLBACK;

To find out what xmax means in this case, let’s call in the cavalry.

pageinspect comes to the rescue

PostgreSQL comes with a “contrib” module called pageinspect that can be used to examine the actual contents of table blocks. It is installed with

CREATE EXTENSION pageinspect;

We’ll use two of its functions:

  • get_raw_page: reads one 8kB block from the table’s data file
  • heap_page_item_attrs: for each tuple in a data block, this returns the tuple metadata and data

Needless to say, these functions are superuser only.

heap_page_item_attrs returns an integer field named t_infomask that contains several flags, some of which tell us the meaning of xmax. To get the full story, you’ll have to read the code in src/include/access/htup_details.h.

Let’s have a look at table block 0, which contains our tuple:

session2=# SELECT lp, 
       t_ctid AS ctid,
       t_xmin AS xmin,
       t_xmax AS xmax,
       (t_infomask & 128)::boolean AS xmax_is_lock,
       (t_infomask & 1024)::boolean AS xmax_committed,
       (t_infomask & 2048)::boolean AS xmax_rolled_back,
       (t_infomask & 4096)::boolean AS xmax_multixact,
       t_attrs[1] AS p_id,
       t_attrs[2] AS p_val
FROM heap_page_item_attrs(
        get_raw_page('parent', 0), 
        'parent'
     );  

-[ RECORD 1 ]----+-----------------
lp               | 1
ctid             | (0,1)
xmin             | 53163
xmax             | 53165
xmax_is_lock     | f
xmax_committed   | f
xmax_rolled_back | f
xmax_multixact   | f
p_id             | \x2a000000
p_val            | \x0f706172656e74

The attributes p_id and p_val are displayed in binary form.

The information in the tuple doesn’t tell us whether the transaction that set xmax has been committed or rolled back, so we (and PostgreSQL when it inspects the tuple) still don’t know what to make of xmax. That is because PostgreSQL does not update the tuple when a transaction ends.

To resolve that uncertainty, we’d have to look at the commit log that stores the state of each transaction. The commit log is persisted in the pg_xact subdirectory of the PostgreSQL data directory (pg_clog in older versions).

A SELECT that modifies data

We cannot examine the commit log from SQL, but when any database transaction reads the tuple and looks up the commit log, it will persist the result in the tuple so that the next reader does not have to do it again (this is called “setting the hint bits”).

So all we have to do is to read the tuple:

session2=# SELECT ctid, xmin, xmax, p_id, p_val FROM parent;

 ctid  | xmin  | xmax  | p_id | p_val  
-------+-------+-------+------+--------
 (0,1) | 53163 | 53165 |   42 | parent
(1 row)

This changes the information stored in the tuple. Let’s have another look with pageinspect:

-[ RECORD 1 ]----+-----------------
lp               | 1
ctid             | (0,1)
xmin             | 53163
xmax             | 53165
xmax_is_lock     | f
xmax_committed   | f
xmax_rolled_back | t
xmax_multixact   | f
p_id             | \x2a000000
p_val            | \x0f706172656e74

The SELECT statement has set the flags on the tuple, and now we can see that xmax is from a transaction that was rolled back and should be ignored.

As an aside, that means that the first reader of a tuple modifies the tuple, causing surprising write I/O. This is annoying, but it is the price we pay for instant COMMIT and ROLLBACK. It is also the reason why it is a good idea to either use COPY … (FREEZE) to bulk load data or to VACUUM the data after loading.

Now we know how to determine if xmax is from a valid transaction or not, but what about row locks?

Row locks and xmax

Rows are locked by data modifying statements, but there is a simple way to lock a row without inserting or deleting tuples:

session1=# BEGIN;
session1=# SELECT * FROM parent WHERE p_id = 42 FOR UPDATE;

 p_id | p_val  
------+--------
   42 | parent
(1 row)

Now what does pageinspect tell us?

-[ RECORD 1 ]----+-----------------
lp               | 1
ctid             | (0,1)
xmin             | 53163
xmax             | 53166
xmax_is_lock     | t
xmax_committed   | f
xmax_rolled_back | f
xmax_multixact   | f
p_id             | \x2a000000
p_val            | \x0f706172656e74

We see that the row is locked. In this case, it is a FOR UPDATE lock, but the query does not distinguish between the lock modes for simplicity’s sake. You’ll notice that xmax again is neither committed nor rolled back, but we don’t care because we know it is a row lock.

xmax is set to 53166, which is the transaction ID of the locking transaction. Let’s close that transaction to continue:

session1=# COMMIT;

PostgreSQL does not have to set hint bits here — if xmax contains a row lock, the row is active, no matter what the state of the locking transaction is.

If you think you have seen it all, you are in for a surprise.

Multiple locks on a single row

In the previous example we have seen that PostgreSQL stores the transaction ID of the locking transaction in xmax. This works fine as long as only a single transaction holds a lock on that tuple. With exclusive locks like the one that SELECT … FOR UPDATE takes, this is always the case.

But PostgreSQL also knows other row locks, for example the FOR KEY SHARE lock that is taken on the destination of a foreign key constraint to prevent concurrent modification of the keys in that row. Let’s insert some rows in the child table:

session1=# BEGIN;
session1=# INSERT INTO child (c_id, p_id, c_val)
   VALUES (1, 42, 'first');

session2=# BEGIN;
session2=# INSERT INTO child (c_id, p_id, c_val)
   VALUES (2, 42, 'second');

Now let’s look at our parent row again:

-[ RECORD 1 ]----+-----------------
lp               | 1
ctid             | (0,1)
xmin             | 53163
xmax             | 3
xmax_is_lock     | t
xmax_committed   | f
xmax_rolled_back | f
xmax_multixact   | t
p_id             | \x2a000000
p_val            | \x0f706172656e74

That “3” in xmax cannot be a transaction ID (they keep counting up), and the xmax_multixact flag is set.

This is the ID of a “multiple transaction object”, called “mulitxact” in PostgreSQL jargon for lack of a better word. Such objects are created whenever more than one transaction locks a row, and their IDs are also counted up (you can tell that this database needs few of them). Multixacts are persisted in the pg_multixact subdirectory of the data directory.

You can get information about the members of a multixact with the undocumented pg_get_multixact_members function:

session2=# SELECT * FROM pg_get_multixact_members('3');
  xid  | mode  
-------+-------
 53167 | keysh
 53168 | keysh
(2 rows)

Now you really know what is in an xmax!

Another way to get rid of unused indexes...
© Laurenz Albe 2018

Why should I get rid of unused indexes?

Everybody knows that a database index is a good thing because it can speed up SQL queries. But this does not come for free.

The disadvantages of indexes are:

  • Indexes use up space. It is not unusual for database indexes to use as much storage space as the data themselves. And the kind of reliable, fast storage you want for a database is not necessarily cheap.
    The space used up by indexes also increases the size and duration of physical backups.
  • Indexes slow down data modification. Whenever you INSERT into or DELETE from a table, all indexes have to be modified, in addition to the table itself (the “heap”).
    And it is much more expensive to modify the complicated data structure of an index than the heap itself, which has its name precisely because it is basically an unordered “pile” of data (and as everybody knows, maintaining order is more work than having a mess). Modifying an indexed table can easily be an order of magnitude more expensive than modifying an unindexed table.
  • Indexes prevent HOT updates. Because of the architecture of PostgreSQL, every UPDATE causes a new row version (“tuple”) to be written, and that causes a new entry in every index on the table.
    This behavior has been dubbed “write amplification” and has drawn a lot of fire. This undesirable effect can be avoided if a) the new tuple fits into the same table block as the old one and b) no indexed row is modified. Then PostgreSQL creates the new tuple as a “Heap Only Tuple” (hence HOT), which is much more efficient and also reduces the work VACUUM has to do.

The many uses of indexes

Now we know that we don’t want unnecessary indexes. The problem is that indexes serve so many purposes that it is difficult to determine if a certain index is needed or not.

Here is a list of all benefits of indexes in PostgreSQL:

  1. Indexes can speed up queries that use indexed columns (or expressions) in the WHERE clause.
    Everybody knows that one!
    The traditional B-tree index supports the <, <=, =, >= and > operators, while the many other index types in PostgreSQL can support more exotic operators like “overlaps” (for ranges or geometries), “distance” (for words) or regular expression matches.
  2. B-tree indexes can speed up the max() and min() aggregates.
  3. B-tree indexes can speed up ORDER BY clauses.
  4. Indexes can speed up joins. This depends on the “join strategy” chosen by the optimizer: hash joins, for example, will never make use of an index.
  5. A B-tree index on the origin of a FOREIGN KEY constraint avoids a sequential scan when rows are deleted (or keys modified) in the target table. A scan on the origin of the constraint is necessary to make sure that the constraint will not be not violated by the modification.
  6. Indexes are used to enforce constraints. Unique B-tree indexes are used to enforce PRIMARY KEY and UNIQUE constraints, while exclusion constraints use GiST indexes.
  7. Indexes can provide the optimizer with better value distribution statistics.
    If you create an index on an expression, ANALYZE and the autoanalyze daemon will not only collect statistics for the data distribution in table columns, but also for each expression that occurs in an index. This helps the optimizer to get a good estimate for the “selectivity” of complicated conditions that contain the indexed expression, which causes better plans to be chosen.This is a widely ignored benefit of indexes!

Find the unused indexes!

The following query that we at Cybertec use will show you all indexes that serve none of the above mentioned purposes.

It makes use of the fact that all uses of indexes in the above list with the exception of the last two result in an index scan.

For completeness’ sake, I have to add that the parameter track_counts has to remain “on” for the query to work, otherwise index usage is not tracked in pg_stat_user_indexes. But you must not change that parameter anyway, otherwise autovacuum will stop working.

To find the indexes that have never been used since the last statistics reset with pg_stat_reset(), use

SELECT s.schemaname,
       s.relname AS tablename,
       s.indexrelname AS indexname,
       pg_relation_size(s.indexrelid) AS index_size
FROM pg_catalog.pg_stat_user_indexes s
   JOIN pg_catalog.pg_index i ON s.indexrelid = i.indexrelid
WHERE s.idx_scan = 0      -- has never been scanned
  AND 0 <>ALL (i.indkey)  -- no index column is an expression
  AND NOT EXISTS          -- does not enforce a constraint
         (SELECT 1 FROM pg_catalog.pg_constraint c
          WHERE c.conindid = s.indexrelid)
ORDER BY pg_relation_size(s.indexrelid) DESC;

Some remarks:

  • Don’t do that on your test database, but on the production database!
  • If your software it running at several customer sites, run the query on all of them.
    Different users have different ways to use a software, which can cause different indexes to be used.
  • You can replace s.idx_scan = 0 in the query with a different condition, e.g. s.idx_scan < 10. Indexes that are very rarely used are also good candidates for removal.

Why VACUUM?

Unlock the tremendous energy of the vacuum!
© xkcd.xom (Randall Munroe) under the Creative Commons Attribution-NonCommercial 2.5 License

Whenever rows in a PostgreSQL table are updated or deleted, dead rows are left behind. VACUUM gets rid of them so that the space can be reused. If a table doesn’t get vacuumed, it will get bloated, which wastes disk space and slows down sequential table scans (and – to a smaller extent – index scans).

VACUUM also takes care of freezing table rows so to avoid problems when the transaction ID counter wraps around, but that’s a different story.

Normally you don’t have to take care of all that, because the autovacuum daemon built into PostgreSQL does that for you.

The problem

If your tables get bloated, the first thing you check is whether autovacuum has processed them or not:

SELECT schemaname, relname, n_live_tup, n_dead_tup, last_autovacuum
FROM pg_stat_all_tables
ORDER BY n_dead_tup
    / (n_live_tup
       * current_setting('autovacuum_vacuum_scale_factor')::float8
          + current_setting('autovacuum_vacuum_threshold')::float8)
     DESC
LIMIT 10;

If your bloated table does not show up here, n_dead_tup is zero and last_autovacuum is NULL, you might have a problem with the statistics collector.

If the bloated table is right there on top, but last_autovacuum is NULL, you might need to configure autovacuum to be more aggressive so that it gets done with the table.

But sometimes the result will look like this:

 schemaname |    relname   | n_live_tup | n_dead_tup |   last_autovacuum
------------+--------------+------------+------------+---------------------
 laurenz    | vacme        |      50000 |      50000 | 2018-02-22 13:20:16
 pg_catalog | pg_attribute |         42 |        165 |
 pg_catalog | pg_amop      |        871 |        162 |
 pg_catalog | pg_class     |          9 |         31 |
 pg_catalog | pg_type      |         17 |         27 |
 pg_catalog | pg_index     |          5 |         15 |
 pg_catalog | pg_depend    |       9162 |        471 |
 pg_catalog | pg_trigger   |          0 |         12 |
 pg_catalog | pg_proc      |        183 |         16 |
 pg_catalog | pg_shdepend  |          7 |          6 |
(10 rows)

Here autovacuum has recently run, but it didn’t free the dead tuples!

We can verify the problem by running VACUUM (VERBOSE):

test=> VACUUM (VERBOSE) vacme;
INFO:  vacuuming "laurenz.vacme"
INFO:  "vacme": found 0 removable, 100000 nonremovable row versions in
       443 out of 443 pages
DETAIL:  50000 dead row versions cannot be removed yet,
         oldest xmin: 22300
There were 0 unused item pointers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.01 s, system: 0.00 s, elapsed: 0.01 s.

Why won’t VACUUM remove the dead rows?

VACUUM can only remove those row versions (also known as “tuples”) that are not needed any more. A tuple is not needed if the transaction ID of the deleting transaction (as stored in the xmax system column) is older than the oldest transaction still active in the PostgreSQL database (or the whole cluster for shared tables).

This value (22300 in the VACUUM output above) is called the “xmin horizon”.

There are three things that can hold back this xmin horizon in a PostgreSQL cluster:

  1. Long-running transactions:

    You can find those and their xmin value with the following query:

    SELECT pid, datname, usename, state, backend_xmin
    FROM pg_stat_activity
    WHERE backend_xmin IS NOT NULL
    ORDER BY age(backend_xmin) DESC;
    

    You can use the pg_terminate_backend() function to terminate the database session that is blocking your VACUUM.

  2. Abandoned replication slots:

    A replication slot is a data structure that keeps the PostgreSQL server from discarding information that is still needed by a standby server to catch up with the primary.

    If replication is delayed or the standby server is down, the replication slot will prevent VACUUM from deleting old rows.

    You can find all replication slots and their xmin value with this query:

    SELECT slot_name, slot_type, database, xmin
    FROM pg_replication_slots
    ORDER BY age(xmin) DESC;
    

    Use the pg_drop_replication_slot() function to drop replication slots that are no longer needed.

    Note: This can only happen with physical replication if hot_standby_feedback = on. For logical replication there is a similar hazard, but only system catalogs are affected. Examine the column catalog_xmin in that case.

  3. Orphaned prepared transactions:

    During two-phase commit, a distributed transaction is first prepared with the PREPARE statement and then committed with the COMMIT PREPARED statement.

    Once a transaction has been prepared, it is kept “hanging around” until it is committed or aborted. It even has to survive a server restart! Normally, transactions don’t remain in the prepared state for long, but sometimes things go wrong and a prepared transaction has to be removed manually by an administrator.

    You can find all prepared transactions and their xmin value with the following query:

    SELECT gid, prepared, owner, database, transaction AS xmin
    FROM pg_prepared_xacts
    ORDER BY age(transaction) DESC;
    

    Use the ROLLBACK PREPARED SQL statement to remove prepared transactions.