options for automatic partition creation
© Laurenz Albe 2022

Table partitioning is one of the best-liked features out of the more recent PostgreSQL developments. However, there is no support for automatic partition creation yet. This article shows what you can do to remedy that.

Use cases for automatic partition creation

There are essentially two use cases:

  • Create partitions triggered by time, for example for the next month at the end of the current month.
  • Create partitions on demand if a row is inserted that does not fit in any existing partition.

I will call the first option time-triggered partitioning and the latter on-demand partitioning.

Automatic partition creation for time-triggered partitioning

The lack of PostgreSQL support in this area is an inconvenience at worst. There are several ways to work around the limitation:

Using the operating system scheduler

You can use the operating system scheduler (cron on Unix-like systems or “Task Scheduler” on Windows). The job would connect to the database using psql and run the appropriate CREATE TABLE and ALTER TABLE ... ATTACH PARTITION statements.

Typically, you would want to schedule such a job directly on the database server machine. Sometimes you cannot do that, for example, because you are using a hosted database and have no access to the operating system. In that case, you can schedule the job on a different machine.

Using a job scheduler in the database

PostgreSQL doesn’t have a built-in job scheduling system. However, there are extensions like pg_timetable or pg_cron that allow you to schedule database jobs. Installing a third-party extension adds another dependency to your software architecture, and it may not be possible on a hosted database. On the other hand, it will make your job independent from the operating system scheduler and is less likely to be forgotten when you move the database.

Using pg_partman

pg_partman is a PostgreSQL extension that specializes in partition management. It was useful and necessary in the bad old days before v10 introduced “declarative partitioning”: it created triggers and constraints for you. Today, most of this functionality is unnecessary, but the support for automatic partition creation can still come in handy.

Automatic partition creation for on-demand partitioning

This is more difficult, and here the lack of support from PostgreSQL core is more painful.

The naïve approach to try is to have a BEFORE INSERT trigger that creates and attaches a new partition. However, that will cause one of the following “object in use” errors:

ERROR:  cannot CREATE TABLE .. PARTITION OF "tab"
        because it is being used by active queries in this session

or

ERROR:  cannot ALTER TABLE "tab"
        because it is being used by active queries in this session

Essentially, by the time the INSERT has started, you cannot modify the table definition any more.

LISTEN and NOTIFY as a workaround

What we need is asynchronous processing: you have to create the new partition in a different transaction. Since we certainly cannot create the partition before we INSERT, it has to be afterwards. However, the partition has to exist at the time of the INSERT, so we would seem to be at an impasse.

But there is another option: instead of creating the partition in a single step, we can first create a table and then attach it later, as a partition of the partitioned table. The sequence of events could then be like this:

  • in the BEFORE INSERT trigger, create a new table that is not yet a partition (skipped if the table already exists)
  • insert the new row into the prospective partition instead of into the partitioned table
  • at some later time, attach the newly-created table as a partition

The remaining question is how to perform the attachment “at some later time”. There is a convenient feature in PostgreSQL for asynchronous processing: LISTEN and NOTIFY. A session can register for notifications on a channel (often a table name) with LISTEN and will then be notified asynchronously whenever another backend calls NOTIFY for that channel. The sending backend can add a message string to the notification.

We can use this together with a row level BEFORE INSERT trigger as follows:

  1. create a new table that will become a new partition (if it does not yet exist)
  2. insert the new row into that table
  3. if a new table was created in the first step, NOTIFY with the new partition key as the argument
  4. skip the original INSERT, since we already inserted the row in the partition

 

Note that this requires that we have a default partition, otherwise inserting a row that does not match an existing partition will fail.

In a different session, a daemon process waits for notifications and attaches the new table to the partitioned table.

A sample implementation of automatic partition creation on demand

Here is the definition of the partitioned table:

CREATE TABLE tab (
   id bigint GENERATED ALWAYS AS IDENTITY,
   ts timestamp NOT NULL,
   data text
) PARTITION BY LIST ((ts::date));

CREATE TABLE tab_def PARTITION OF tab DEFAULT;

I chose list partitioning to simplify the listener code, although that is not required.

The trigger function

The trigger looks like this:

CREATE FUNCTION part_trig() RETURNS trigger
   LANGUAGE plpgsql AS
$$BEGIN
   BEGIN
      /* try to create a table for the new partition */
      EXECUTE
         format(
            'CREATE TABLE %I (LIKE tab INCLUDING INDEXES)',
            'tab_' || to_char(NEW.ts, 'YYYY-MM-DD')
         );

      /*
       * tell listener to attach the partition
       * (only if a new table was created)
       */
      EXECUTE
         format(
            'NOTIFY tab, %L',
            to_char(NEW.ts, 'YYYY-MM-DD')
         );
   EXCEPTION
      WHEN duplicate_table THEN
         NULL;  -- ignore
   END;

   /* insert into the new partition */
   EXECUTE
      format(
         'INSERT INTO %I VALUES ($1.*)',
         'tab_' || to_char(NEW.ts, 'YYYY-MM-DD')
      )
      USING NEW;

   /* skip insert into the partitioned table */
   RETURN NULL;
END;$$;

CREATE TRIGGER part_trig
   BEFORE INSERT ON TAB FOR EACH ROW
   WHEN (pg_trigger_depth() < 1)
   EXECUTE FUNCTION part_trig();

To understand why the trigger function returns NULL, read my blog post on the topic. The WHEN clause in the trigger definition avoids infinite recursion: without it, inserting a row into a partition would call the trigger function again.

The listener

For the listener, we need to use a PostgreSQL client API that supports asynchronous notification. I chose to use C:

#include <libpq-fe.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/select.h>
#include <errno.h>

static int attach_partition(PGconn *conn,
                            char *table,
                            char *datestr)
{
    PGresult *res;
    char *esc_tab, *esc_part, *esc_date;
    /* identifiers are at most 63 bytes long */
    char stmt[400], part[64];

    /* escape table and partition name */
    esc_tab = PQescapeIdentifier(conn, table, 63);
    snprintf(part, 63, "%s_%s", table, datestr);
    esc_part = PQescapeIdentifier(conn, part, 63);
    esc_date = PQescapeLiteral(conn, datestr, 10);

    /* construct statement */
    sprintf(stmt,
            "ALTER TABLE %s ATTACH PARTITION %s FOR VALUES IN (%s)",
            esc_tab,
            esc_part,
            esc_date);
    free(esc_tab);
    free(esc_part);
    free(esc_date);

    /* execute statement */
    if ((res = PQexec(conn, stmt)) == NULL)
    {
        fprintf(stderr,
                "Out of memory sending statement\n");
        return 0;
    }
    if (PQresultStatus(res) != PGRES_COMMAND_OK)
    {
        fprintf(stderr,
                "Error attaching partition: %s\n",
                PQresultErrorMessage(res));
        return 0;
    }

    PQclear(res);

    return 1;
}

int main (int argc, char **argv)
{
    PGconn *conn;
    int sock;
    PGresult *res;

    /* connect to PostgreSQL with default parameters */
    if ((conn = PQconnectdb("application_name=listener")) == NULL)
    {
        fprintf(stderr,
                "Out of memory connecting to PostgreSQL\n");
        return 1;
    }
    if (PQstatus(conn) != CONNECTION_OK)
    {
        fprintf(stderr,
                "Error connecting to PostgreSQL: %s\n",
                PQerrorMessage(conn));
        PQfinish(conn);
        return 1;
    }

    /* get network socket for the connection */
    if ((sock = PQsocket(conn)) < 0)
    {
        fprintf(stderr,
                "Error getting connection network socket\n");
        PQfinish(conn);
        return 1;
    }

    /* listen on a channel */
    if ((res = PQexec(conn, "LISTEN tab")) == NULL)
    {
        fprintf(stderr,
                "Out of memory sending LISTEN\n");
        PQfinish(conn);
        return 1;
    }
    if (PQresultStatus(res) != PGRES_COMMAND_OK)
    {
        fprintf(stderr,
                "Error listening on channel: %s\n",
                PQresultErrorMessage(res));
        PQfinish(conn);
        return 1;
    }
    PQclear(res);

    while(1)
    {
        fd_set sockets;
        struct pgNotify *note;

        /* block waiting for activity on the network socket */
        FD_ZERO(&sockets);
        FD_SET(sock, &sockets);
        errno = 0;
        if (select(sock + 1, &sockets, NULL, NULL, NULL) < 0)
        {
            perror("Error waiting for notification");
            PQfinish(conn);
            return 1;
        }

        /* consume the input */
        if (PQconsumeInput(conn) == 0)
        {
            fprintf(stderr,
                    "Error receiving data: %s\n",
                    PQerrorMessage(conn));
            PQfinish(conn);
            return 1;
        }

        /* check for notifications */
        note = PQnotifies(conn);

        if (note != NULL)
        {
            if (!attach_partition(conn, note->relname, note->extra))
            {
                PQfinish(conn);
                return 1;
            }

            PQfreemem(note);
        }
    }
}

Discussion of performance and robustness

The sample code above should be free from race conditions. However, performance will not be good. Let’s compare the speed of 100000 inserts with and without the trigger:

test=# TRUNCATE tab;
TRUNCATE TABLE
test=# \timing
Timing is on.
test=# INSERT INTO tab (ts, data)
       SELECT clock_timestamp(), 'something'
       FROM generate_series(1, 100000);
INSERT 0 0
Time: 3354.275 ms (00:03.354)

test=# ALTER TABLE tab DISABLE TRIGGER part_trig;
ALTER TABLE

test=# TRUNCATE tab;
TRUNCATE TABLE
Time: 20.005 ms
test=# INSERT INTO tab (ts, data)
       SELECT clock_timestamp(), 'something'
       FROM generate_series(1, 100000);
INSERT 0 100000
Time: 120.869 ms

So while the code works, the trigger incurs an overhead of about 32 milliseconds per row, which is quite a lot. The whole operation takes 28 times as long as without the trigger, but that factor is only so high because there are no indexes on the table. It would be possible to reduce the overhead by writing the trigger function in C as well, but that would add still more complexity.

Conclusion

While it is fairly easy to work around the lack of time-triggered partition creation, it is more difficult to create partitions on demand when rows are inserted. We found a solution using triggers, LISTEN and NOTIFY, but the solution was complicated and the performance impact was severe. It would be good to get better support from PostgreSQL core for this!

If you are interested in partitioning, you may want to read our articles about partition management and partitioning live data on foot and with pg_rewrite.