December 2020

the shibboleth of PostgreSQL — © Laurenz Albe 2020

After all the technical articles I have written, I thought it would be nice to write about PostgreSQL sociology for a change.

Language and community

A community like PostgreSQL has no clearly defined borders. There is no membership certificate; you belong to it if you feel that you belong. That said, you can only get that feeling if you feel accepted.

In all communities, language is an important factor. In the case of an international technical community where much of the communication is in written form, this is not so much about the correct accent or general mastership of the language: it is more the choice of technical terms that separates the “in” crowd from the outsiders. This “jargon”, which primarily serves to unambiguously convey certain technical concepts, also has a social dimension.

What is a shibboleth?

A shibboleth, as Wikipedia defines it, is

any custom or tradition, usually a choice of phrasing or even a single word, that distinguishes one group of people from another.

The term comes from the Biblical Book of Judges (12:5-6):

The Gileadites seized the fords of the Jordan and held them against Ephraim. When any Ephraimite who had escaped wished to cross, the men of Gilead would ask, ‘Are you an Ephraimite?’ and if he said ‘No,’ they would retort, ‘Say “Shibboleth.”’ He would say ‘Sibboleth,” and because he could not pronounce the word properly, they seized him and killed him at the fords. At that time fourty-two thousand men of Ephraim lost their lives.

Now, I am not suggesting that the PostgreSQL crowd engages in genocide, but there is a word that immediately exposes you as an outsider if you pronounce or write it wrong: The name “PostgreSQL” itself.

How to write and pronounce PostgreSQL

The official name is of course “PostgreSQL”, and the documentation states that

Many people continue to refer to PostgreSQL as “Postgres” (now rarely in all capital letters) because of tradition or because it is easier to pronounce. This usage is widely accepted as a nickname or alias.

Also, it is not unusual to refer to the product as “PGSQL” or, even shorter, “PG”.

Now “PostgreSQL” is not a word that lends itself to pronunciation easily (and there have been repeated attempts to change the name), so you can imagine that the pronunciation varies even more wildly. My research showed a surprising number of Youtube videos teaching what they perceive as the correct pronunciation, and there has even been a survey by the project about people's preferred pronunciation. The widely accepted forms are “POST-gres-cue-ell” and “POST-gres”.

So given that diversity, you'd think that you cannot get it wrong - but you can.

How not to write and pronounce PostgreSQL

There are two ways to write and pronounce “PostgreSQL” that will immediately identify you as an outsider:

“Postgre” (pronounced “POST-gray”)
“Postgres SQL” (pronounced “POST-gres-ESS-cue-ell” or “post-gres-SEE-quel”)

Nobody will seize and kill you if you use these, but people will assume that you don't know a thing about PostgreSQL.

Conclusion

Fake it till you make it! Show that you belong and avoid calling the product “Postgre” or “Postgres SQL”.

I wrote about the new pg_timetable 3 major release not so long ago. Two essential features were highlighted:
- new session locking implementation
- new jackc/pgx Golang library used

Today I want to reveal one more advanced feature! Fasten your seat belts!
First, we need to distinguish exclusive client session mode from exclusive chain execution mode.

Exclusive client session

Each pg_timetable instance, known as a client, is pre-set to work in the exclusive mode. Meaning there is no possibility each target database can hold more than one active client connection with the same client name. How we distinguish clients? Simple, the command-line parameter:
pg_timetable --clientname=worker01 ...

It is possible to run as many different clients as you want. Take a look:

Of course, life is complicated, and sometimes more than one pg_timetable instance with the same client name may connect to the target database. In this case, pg_timetable ensures that one and only one is executing chains exclusively. The second will wait until the active one will disconnect.

Frankly speaking, you can run whatever number of pg_timetable clients with the same name. The only one will be active (primary), and all others will be waiting (standby). The obvious drawback of such a situation is losing database connection slots to idle clients.

Exclusive chain

Let's get back to chains. By default, all chains are supposed to run simultaneously at their schedule. This number is physically limited by the number of pg_timetable instance parallel workers. Right now, this number is 16 for scheduled chains workers and 16 for interval chains workers. These workers are implemented using lightweight goroutines. That doesn't mean you cannot have more than 32 chains scheduled for the same time. It means no more than 16 scheduled are executed at any given time, and others are waiting in the queue.

Maybe later, we can add a settings parameter to specify the exact number of workers to run. However, we have never yet hit the situation where the default settings were not enough.

Imagine there are several pg_timetable instances working simultaneously against the same PostgreSQL database. And some task (exclusive or not) has no specified executor (by default). That leads all clients will execute the same chain in their own session.

One more time. We can only guarantee exclusive chain execution within the pg_timetable client session. We won't block any chain's simultaneous execution in several parallel pg_timetable sessions by design. Please use the client_name field of the timetable.chain_execution_config table during chain creation to prevent this situation.

Why exclusive chains matter? I'm sure you can find many use cases: report building, software updates, external non-transactional data updates (files), other process waiting, etc.

Let's sum up the concept of the exclusive task:

exclusive chain adds itself to the queue
exclusive chain waits until all working chains and jobs finished
exclusive chain starts and blocks all other chains until it's done

Implementation details

If you look at the problem closer, you might see the classic readers-writers problems.

Readers-writers problems (there are some variations, check the wiki for details) occur when a shared piece of resource needs to be accessed by multiple executors. We assume our exclusive chains are writers, and non-exclusive (regular) chains are readers.

⚠ This is not related to a database, file system, or any other IO read-write. Just the concept itself is identical, so we can use the same algorithms efficiently.

Thank God we have standard Golang sync package which provides RWMutex type to solve this in a simple way.

From the Golang docs:

An RWMutex is a reader/writer mutual exclusion lock. The lock can be held by an arbitrary number of readers or a single writer. The zero value for an RWMutex is an unlocked mutex.

In other words, non-exclusive regular chains (readers) don't have to wait for each other. They only have to wait for exclusive chains (writers) holding the lock. And only one exclusive (writer) lock is possible at any given time.

During parallel operation, regular chains will call RWMutex.RLock() before and RWMutex.RUnlock() after transaction is done.

And exclusive chains will call RWMutex.Lock() before and RWMutex.Unlock() after. Simple and effective. I really like to Go!

Finally

This was the first in a series of posts dedicated to the new pg_timetable v3 features. Stay tuned for the components to be highlighted:

Asynchronous chains!
Debug mode for developing!

The previous post can be found here.

Stay safe, healthy, and wealthy!
Be happy! Peace! Love! ❤

UPDATE is not the same as DELETE + INSERT! — © Laurenz Albe 2020

Introduction

We know that PostgreSQL does not update a table row in place. Rather, it writes a new version of the row (the PostgreSQL term for a row version is “tuple”) and leaves the old row version in place to serve concurrent read requests. VACUUM later removes these “dead tuples”.

If you delete a row and insert a new one, the effect is similar: we have one dead tuple and one new live tuple. This is why many people (me, among others) explain to beginners that “an UPDATE in PostgreSQL is almost the same as a DELETE, followed by an INSERT”.

This article is about that “almost”.

Example showing the difference between `UPDATE` and `DELETE` + `INSERT`

Let's take this simple test table:

CREATE TABLE uptest (
   id smallint PRIMARY KEY,
   val smallint NOT NULL
);

INSERT INTO uptest VALUES (1, 42);

CREATE TABLE uptest (

id smallint PRIMARY KEY,

val smallint NOT NULL

);

INSERT INTO uptest VALUES (1, 42);

In the two following tests, we will issue statements from two concurrent sessions.

First, the UPDATE:

 Session 1                     Session 2

BEGIN;

UPDATE uptest SET id = 2
   WHERE val = 42;
                               SELECT id FROM uptest
                                  WHERE val = 42
                                  FOR UPDATE;  -- hangs
COMMIT;
                               -- one row is returned

Session 1 Session 2

BEGIN;

UPDATE uptest SET id = 2

WHERE val = 42;

SELECT id FROM uptest

WHERE val = 42

FOR UPDATE; -- hangs

COMMIT;

-- one row is returned

Let's reset the table before the second test;

TRUNCATE uptest;
INSERT INTO uptest VALUES (1, 42);

1 2	TRUNCATE uptest; INSERT INTO uptest VALUES (1, 42);

Now let's repeat the experiment with DELETE and INSERT:

 Session 1                     Session 2

BEGIN;

DELETE FROM uptest
   WHERE id = 1;

INSERT INTO uptest VALUES (2, 42);
                               SELECT id FROM uptest
                                  WHERE val = 42
                                  FOR UPDATE;  -- hangs
COMMIT;
                               -- no row is returned

Session 1 Session 2

BEGIN;

DELETE FROM uptest

WHERE id = 1;

INSERT INTO uptest VALUES (2, 42);

SELECT id FROM uptest

WHERE val = 42

FOR UPDATE; -- hangs

COMMIT;

-- no row is returned

Explanation for the observed difference

The documentation describes what happens when an SQL statement runs into a lock in a transaction with the default READ COMMITTED isolation level:

UPDATE, DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE commands behave the same as SELECT in terms of searching for target rows: they will only find target rows that were committed as of the command start time. However, such a target row might have already been updated (or deleted or locked) by another concurrent transaction by the time it is found. In this case, the would-be updater will wait for the first updating transaction to commit or roll back (if it is still in progress). If the first updater rolls back, then its effects are negated and the second updater can proceed with updating the originally found row. If the first updater commits, the second updater will ignore the row if the first updater deleted it, otherwise it will attempt to apply its operation to the updated version of the row. The search condition of the command (the WHERE clause) is re-evaluated to see if the updated version of the row still matches the search condition. If so, the second updater proceeds with its operation using the updated version of the row. In the case of SELECT FOR UPDATE and SELECT FOR SHARE, this means it is the updated version of the row that is locked and returned to the client.

The above shows that there is some way for PostgreSQL to find the new version of an updated row. That is why the first experiment returned a result row. In the second experiment, there was no connection between the old, deleted row and the newly inserted one, that's why we get no result in that case.

To figure out how the old and the new version are connected, we have to look deeper.

Looking at `UPDATE` with “pageinspect”

The extension pageinspect allows us to see all data in a PostgreSQL data page. It requires superuser permissions.

Let's use it to see what's on disk after the first experiment:

TRUNCATE uptest;
INSERT INTO uptest VALUES (1, 42);

UPDATE uptest SET id = 2 WHERE val = 42;

SELECT lp,
       t_xmin AS xmin,
       t_xmax AS xmax,
       t_ctid,
       to_hex(t_infomask2) AS infomask2,
       to_hex(t_infomask) AS infomask,
       t_attrs       
FROM heap_page_item_attrs(get_raw_page('uptest', 0), 'uptest');

 lp |  xmin  |  xmax  | t_ctid | infomask2 | infomask |        t_attrs        
----+--------+--------+--------+-----------+----------+-----------------------
  1 | 385688 | 385689 | (0,2)  | 2002      | 100      | {'\x0100','\x2a00'}
  2 | 385689 |      0 | (0,2)  | 2         | 2800     | {'\x0200','\x2a00'}
(2 rows)

TRUNCATE uptest;

INSERT INTO uptest VALUES (1, 42);

UPDATE uptest SET id = 2 WHERE val = 42;

SELECT lp,

t_xmin AS xmin,

t_xmax AS xmax,

t_ctid,

to_hex(t_infomask2) AS infomask2,

to_hex(t_infomask) AS infomask,

t_attrs

FROM heap_page_item_attrs(get_raw_page('uptest', 0), 'uptest');

----+--------+--------+--------+-----------+----------+-----------------------

1 | 385688 | 385689 | (0,2) | 2002 | 100 | {'\x0100','\x2a00'}

2 | 385689 | 0 | (0,2) | 2 | 2800 | {'\x0200','\x2a00'}

(2 rows)

The first entry is the old version of the row, the second the new version.

The lp is the line pointer number, which stands for the number of the tuple within the data page. Together with the page number, this constitutes the physical address (tuple ID or tid) of a tuple.

It follows that the t_ctid stored in the tuple header is usually redundant, since it is implicit in the line pointer. However, it becomes relevant after an UPDATE: then t_ctid contains the tuple identifier of the updated version of the row.

This is the “missing link” between the old row version and the updated one!

Looking at `DELETE` + `DELETE` with “pageinspect”

Let's compare this to DELETE + INSERT:

TRUNCATE uptest;
INSERT INTO uptest VALUES (1, 42);

BEGIN;
DELETE FROM uptest WHERE id = 1;
INSERT INTO uptest VALUES (2, 42);
COMMIT;

SELECT lp,
       t_xmin AS xmin,
       t_xmax AS xmax,
       t_ctid,
       to_hex(t_infomask2) AS infomask2,
       to_hex(t_infomask) AS infomask,
       t_attrs       
FROM heap_page_item_attrs(get_raw_page('uptest', 0), 'uptest');

 lp |  xmin  |  xmax  | t_ctid | infomask2 | infomask |        t_attrs        
----+--------+--------+--------+-----------+----------+-----------------------
  1 | 385691 | 385692 | (0,1)  | 2002      | 100      | {'\x0100','\x2a00'}
  2 | 385692 |      0 | (0,2)  | 2         | 800      | {'\x0200','\x2a00'}
(2 rows)

TRUNCATE uptest;

INSERT INTO uptest VALUES (1, 42);

BEGIN;

DELETE FROM uptest WHERE id = 1;

INSERT INTO uptest VALUES (2, 42);

COMMIT;

SELECT lp,

t_xmin AS xmin,

t_xmax AS xmax,

t_ctid,

to_hex(t_infomask2) AS infomask2,

to_hex(t_infomask) AS infomask,

t_attrs

FROM heap_page_item_attrs(get_raw_page('uptest', 0), 'uptest');

----+--------+--------+--------+-----------+----------+-----------------------

1 | 385691 | 385692 | (0,1) | 2002 | 100 | {'\x0100','\x2a00'}

2 | 385692 | 0 | (0,2) | 2 | 800 | {'\x0200','\x2a00'}

(2 rows)

Here the t_ctid column from the old, deleted tuple is unchanged and there is no link to the new tuple. The second tuple is not found by the SELECT ... FOR UPDATE, since it is “invisible” to the “snapshot” used for scanning the table.

Differences in `infomask` and `infomask2`

There are also some relevant differences in the attributes infomask and infomask2. You can see the meaning of these flags in the PostgreSQL source file src/include/access/htup_details.h.

for the old tuple, the values are the same in both cases:
- infomask2: 2 is the number of columns, and HEAP_KEYS_UPDATED (0x2000) means that the tuple is deleted or updated
- infomask: HEAP_XMIN_COMMITTED (0x0100) means that the tuple was valid before it was removed (a hint bit)
for the new tuple, there are some differences:
- infomask: both cases have HEAP_XMAX_INVALID (0x0800) set (they have not been deleted), but the UPDATE case also has HEAP_UPDATED (0x2000), which shows that this is the result of an UPDATE

Conclusion

To understand the difference between UPDATE and DELETE+INSERT, we had a closer look at the tuple headers. We saw infomask, infomask2 and t_ctid, where the latter provides the link between the old and the new version of a row.

PostgreSQL's row header occupies 23 bytes, which is more storage overhead than in other databases, but is required for PostgreSQL's special multiversioning and tuple visibility implementation.

UPDATE can be challenging in PostgreSQL: if you want to read more about its problems and how to deal with them, read my article on HOT update.

Blog post updated 05.12.2022 People keep asking: "What are the best ways to upgrade and update my PostgreSQL to the latest version?" This blog post covers how you can upgrade or update to the latest PostgreSQL release.

Before we get started, we have to make a distinction between two things:

Updating PostgreSQL
Upgrading PostgreSQL

Let’s take a look at both scenarios.

Updating PostgreSQL

The first scenario I want to shed some light on is updating within a major version of PostgreSQL: What does that mean? Suppose you want to move from PostgreSQL 15.0 to PostgreSQL 15.1. In that case, all you have to do is to install the new binaries and restart the database.
There is no need to reload the data. The downtime required is minimal. The binary format does not change; a quick restart is all you need. The same is true if you are running a HA (high availability) cluster solution such as Patroni. Simply restart all nodes in your PostgreSQL cluster and you are ready to go.
The following table contains a little summary:

Tooling	Task	Reloading data	Downtime needed
`yum`/`dnf`/`apt`	Minor release update	Not needed	Close to zero

Upgrading PostgreSQL

Now let’s take a look at upgrades: if you want to move from PostgreSQL 9.6, 10, 13 or some other older version to PostgreSQL 15, an upgrade is needed. To do that, you have various options:

pg_dumpall: Dump / reload
pg_upgrade: Copy data on a binary level
pg_upgrade --link: In-place upgrades

If you dump and reload data, it might take a lot of time. The bigger your database is, the more time you will need to do the upgrade. It follows that pg_dump/pg_dumpall and restore are not the right tools to upgrade a large multi-terabyte database.
pg_upgrade is here to do a binary upgrade. It copies all the data files from the old directory to the new one. Depending on the amount of data, this can take quite a lot of time and cause serious downtime. However, if the new and the old data directory are on the same filesystem, there is a better option: “pg_upgrade --link”. Instead of copying all the files, pg_upgrade will create hard links for those data files. The amount of data is not a limiting factor anymore, because hard links can be created quickly. “pg_upgrade --link” therefore promises close to zero downtime.

Tooling	Task	Reloading data	Downtime needed
`pg_upgrade`	Major release update	Metadata and files are copied	Downtime is needed
`pg_upgrade --link`	Major release update	Only metadata are copied	Close to zero

What is important to note here is that pg_upgrade is never destructive. If something goes wrong, you can always delete the new data directory and start from scratch.

Preparing a sample database

To show pg_upgrade in action, I have created a little sample database to demonstrate how things work in real life:

test=# CREATE TABLE a AS SELECT id AS a, id AS b, id AS c
       FROM generate_series(1, 50000000) AS id;
SELECT 50000000
test=# CREATE TABLE b AS SELECT *
       FROM a;
SELECT 50000000
test=# CREATE INDEX idx_a_a ON a (a);
CREATE INDEX
test=# CREATE INDEX idx_a_b ON a (b);
CREATE INDEX
test=# CREATE INDEX idx_b_a ON b (a);
CREATE INDEX

test=# CREATE TABLE a AS SELECT id AS a, id AS b, id AS c

FROM generate_series(1, 50000000) AS id;

SELECT 50000000

test=# CREATE TABLE b AS SELECT *

FROM a;

SELECT 50000000

test=# CREATE INDEX idx_a_a ON a (a);

CREATE INDEX

test=# CREATE INDEX idx_a_b ON a (b);

CREATE INDEX

test=# CREATE INDEX idx_b_a ON b (a);

CREATE INDEX

This is a fairly small database, but it is already large enough so that users can feel the difference when doing the upgrade:

test=# SELECT pg_size_pretty(pg_database_size('test'));
pg_size_pretty
----------------
7444 MB
(1 row)

test=# SELECT pg_size_pretty(pg_database_size('test'));

pg_size_pretty

----------------

7444 MB

(1 row)

7.4 GB are ready to be upgraded. Let’s see how it works.

pg_upgrade in action

To upgrade a database, three steps are needed:

initdb ...
copy pg_hba.conf and pg_ident.conf and adapt the new postgresql.conf
pg_upgrade ...

Let us start with initdb:

iMac:tmp hs$ initdb -D /data/db15

The files belonging to this database system will be owned by user 'hs'.

This user must also own the server process.

The database cluster will be initialized with locales
       COLLATE: en_US
       CTYPE: UTF-8
       MESSAGES: en_US
       MONETARY: en_US
       NUMERIC: en_US
       TIME: en_US
The default database encoding has accordingly been set to 'UTF8'.
initdb: could not find suitable text search configuration for locale 'UTF-8'
The default text search configuration will be set to 'simple'.

Data page checksums are disabled.

creating directory /data/db15 ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting default time zone ... Europe/Vienna
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok

initdb: warning: enabling 'trust' authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.

Success. You can now start the database server using:

       pg_ctl -D /data/db15 -l logfile start

iMac:tmp hs$ initdb -D /data/db15

The files belonging to this database system will be owned by user 'hs'.

This user must also own the server process.

The database cluster will be initialized with locales

COLLATE: en_US

CTYPE: UTF-8

MESSAGES: en_US

MONETARY: en_US

NUMERIC: en_US

TIME: en_US

The default database encoding has accordingly been set to 'UTF8'.

initdb: could not find suitable text search configuration for locale 'UTF-8'

The default text search configuration will be set to 'simple'.

Data page checksums are disabled.

creating directory /data/db15 ... ok

creating subdirectories ... ok

selecting dynamic shared memory implementation ... posix

selecting default max_connections ... 100

selecting default shared_buffers ... 128MB

selecting default time zone ... Europe/Vienna

creating configuration files ... ok

running bootstrap script ... ok

performing post-bootstrap initialization ... ok

syncing data to disk ... ok

initdb: warning: enabling 'trust' authentication for local connections

You can change this by editing pg_hba.conf or using the option -A, or

--auth-local and --auth-host, the next time you run initdb.

Success. You can now start the database server using:

pg_ctl -D /data/db15 -l logfile start

Note that pg_upgrade is only going to work in case the encodings of the old and the new database instance match. Otherwise, it will fail.

After adapting the configuration files, we can run pg_upgrade: Basically, we need four pieces of information here: The old and the new data directory as well as the path of the old and the new binaries:

iMac:tmp hs$ time pg_upgrade -d /data/db12/ 
       -D /data/db15 
       -b /path/pg12/bin/ 
       -B /usr/local/Cellar/postgresql/15.1/bin/

Performing Consistency Checks
-----------------------------
Checking cluster versions                                   ok
Checking database user is the install user                  ok
Checking database connection settings                       ok
Checking for prepared transactions                          ok
Checking for reg* data types in user tables                 ok
Checking for contrib/isn with bigint-passing mismatch       ok
Creating dump of global objects                             ok
Creating dump of database schemas                           ok
Checking for presence of required libraries                 ok
Checking database user is the install user                  ok
Checking for prepared transactions                          ok

If pg_upgrade fails after this point, you must re-initdb the
new cluster before continuing.

Performing Upgrade
------------------
Analyzing all rows in the new cluster                       ok
Freezing all rows in the new cluster                        ok
Deleting files from new pg_xact                             ok
Copying old pg_xact to new server                           ok
Setting next transaction ID and epoch for new cluster       ok
Deleting files from new pg_multixact/offsets                ok
Copying old pg_multixact/offsets to new server              ok
Deleting files from new pg_multixact/members                ok
Copying old pg_multixact/members to new server              ok
Setting next multixact ID and offset for new cluster        ok
Resetting WAL archives                                      ok
Setting frozenxid and minmxid counters in new cluster       ok
Restoring global objects in the new cluster                 ok
Restoring database schemas in the new cluster               ok
Copying user relation files                                 ok
Setting next OID for new cluster                            ok
Sync data directory to disk                                 ok
Creating script to analyze new cluster                      ok
Creating script to delete old cluster                       ok

Upgrade Complete
----------------
Optimizer statistics are not transferred by pg_upgrade so,
once you start the new server, consider running:
        ./analyze_new_cluster.sh

Running this script will delete the old cluster's data files:
        ./delete_old_cluster.sh

real 4m11.702s
user 0m0.255s
sys 0m13.385s

iMac:tmp hs$ time pg_upgrade -d /data/db12/

-D /data/db15

-b /path/pg12/bin/

-B /usr/local/Cellar/postgresql/15.1/bin/

Performing Consistency Checks

-----------------------------

Checking cluster versions ok

Checking database user is the install user ok

Checking database connection settings ok

Checking for prepared transactions ok

Checking for reg* data types in user tables ok

Checking for contrib/isn with bigint-passing mismatch ok

Creating dump of global objects ok

Creating dump of database schemas ok

Checking for presence of required libraries ok

Checking database user is the install user ok

Checking for prepared transactions ok

If pg_upgrade fails after this point, you must re-initdb the

new cluster before continuing.

Performing Upgrade

------------------

Analyzing all rows in the new cluster ok

Freezing all rows in the new cluster ok

Deleting files from new pg_xact ok

Copying old pg_xact to new server ok

Setting next transaction ID and epoch for new cluster ok

Deleting files from new pg_multixact/offsets ok

Copying old pg_multixact/offsets to new server ok

Deleting files from new pg_multixact/members ok

Copying old pg_multixact/members to new server ok

Setting next multixact ID and offset for new cluster ok

Resetting WAL archives ok

Setting frozenxid and minmxid counters in new cluster ok

Restoring global objects in the new cluster ok

Restoring database schemas in the new cluster ok

Copying user relation files ok

Setting next OID for new cluster ok

Sync data directory to disk ok

Creating script to analyze new cluster ok

Creating script to delete old cluster ok

Upgrade Complete

----------------

Optimizer statistics are not transferred by pg_upgrade so,

once you start the new server, consider running:

./analyze_new_cluster.sh

Running this script will delete the old cluster's data files:

./delete_old_cluster.sh

real 4m11.702s

user 0m0.255s

sys 0m13.385s

What is worth mentioning here is that the upgrade process takes over four minutes because all the data had to be copied to the new directory. My old Mac is not very fast and copying takes a very long time.

To reduce downtime, we can clean out the directory, run initdb again and add the --link option:

iMac:~ hs$ time pg_upgrade -d /data/db12/ 
       -D /data/db15 
       -b /path/pg12/bin/ 
       -B /usr/local/Cellar/postgresql/15.1/bin/ 
       --link

Performing Consistency Checks
-----------------------------
Checking cluster versions                                    ok
Checking database user is the install user                   ok

...

Upgrade Complete
----------------
Optimizer statistics are not transferred by pg_upgrade so,
once you start the new server, consider running:
        ./analyze_new_cluster.sh

Running this script will delete the old cluster's data files:
        ./delete_old_cluster.sh

real 0m3.538s
user 0m0.198s
sys 0m0.322s

iMac:~ hs$ time pg_upgrade -d /data/db12/

-D /data/db15

-b /path/pg12/bin/

-B /usr/local/Cellar/postgresql/15.1/bin/

--link

Performing Consistency Checks

-----------------------------

Checking cluster versions ok

Checking database user is the install user ok

...

Upgrade Complete

----------------

Optimizer statistics are not transferred by pg_upgrade so,

once you start the new server, consider running:

./analyze_new_cluster.sh

Running this script will delete the old cluster's data files:

./delete_old_cluster.sh

real 0m3.538s

user 0m0.198s

sys 0m0.322s

In this case, it took only 3.5 seconds to upgrade. We can start the database instance directly and keep working with the new instance.

`pg_upgrade --link` under the hood

The --link option can reduce the downtime to close to zero. The question is what happens under the hood? Let's check out the data files:

-rw------- 2 hs wheel 1073741824 Nov 22 11:55 16388
-rw------- 2 hs wheel 1073741824 Nov 22 11:58 16388.1
-rw------- 2 hs wheel 66576384 Nov 22 12:01 16388.2
-rw------- 2 hs wheel 1073741824 Nov 22 12:12 16391
-rw------- 2 hs wheel 1073741824 Nov 22 12:13 16391.1
-rw------- 2 hs wheel 66576384 Nov 22 12:14 16391.2
-rw------- 2 hs wheel 1073741824 Nov 22 12:01 16394
-rw------- 2 hs wheel 49373184 Nov 22 12:01 16394.1
-rw------- 2 hs wheel 1073741824 Nov 22 12:03 16395
-rw------- 2 hs wheel 49373184 Nov 22 12:03 16395.1
-rw------- 2 hs wheel 1073741824 Nov 22 12:15 16396
-rw------- 2 hs wheel 49373184 Nov 22 12:15 16396.1
-rw------- 1 hs wheel 8192 Nov 24 10:03 174
-rw------- 1 hs wheel 8192 Nov 24 10:03 175

-rw------- 2 hs wheel 1073741824 Nov 22 11:55 16388

-rw------- 2 hs wheel 1073741824 Nov 22 11:58 16388.1

-rw------- 2 hs wheel 66576384 Nov 22 12:01 16388.2

-rw------- 2 hs wheel 1073741824 Nov 22 12:12 16391

-rw------- 2 hs wheel 1073741824 Nov 22 12:13 16391.1

-rw------- 2 hs wheel 66576384 Nov 22 12:14 16391.2

-rw------- 2 hs wheel 1073741824 Nov 22 12:01 16394

-rw------- 2 hs wheel 49373184 Nov 22 12:01 16394.1

-rw------- 2 hs wheel 1073741824 Nov 22 12:03 16395

-rw------- 2 hs wheel 49373184 Nov 22 12:03 16395.1

-rw------- 2 hs wheel 1073741824 Nov 22 12:15 16396

-rw------- 2 hs wheel 49373184 Nov 22 12:15 16396.1

-rw------- 1 hs wheel 8192 Nov 24 10:03 174

-rw------- 1 hs wheel 8192 Nov 24 10:03 175

What you see here is that the relevant data files have two entries: “2” means that the same file (inode) shows up as two directories. pg_upgrade creates hard links, and this is exactly what we see here.

That's why it is super important to make sure that the mount point is /data and not /data/db12. Otherwise, hard links are not possible.

Finally …

pg_upgrade is an important tool. However, there are more ways to upgrade: Logical replication is an alternative method. Check out our blog post about this exciting technology.

PostgreSQL offers a nice BLOB interface which is widely used. However, recently we came across problems faced by various customers, and it makes sense to reflect a bit and figure out how PostgreSQL handles BLOBs - and especially BLOB cleanup.

Using the PostgreSQL BLOB interface

In PostgreSQL, you can use various means to store binary data. The simplest form is definitely to make use of the “bytea” (= byte array) data type. In this case a binary field is basically seen as part of a row.
Here is how it works:

test=# CREATE TABLE t_image (id int, name text, image bytea);
CREATE TABLE
test=# d t_image
Table 'public.t_image'
Column | Type    | Collation | Nullable | Default
-------+---------+-----------+----------+---------
id     | integer |           |          |
name   | text    |           |          |
image  | bytea   |           |          |

test=# CREATE TABLE t_image (id int, name text, image bytea);

CREATE TABLE

test=# d t_image

Table 'public.t_image'

Column | Type | Collation | Nullable | Default

-------+---------+-----------+----------+---------

id | integer | | |

name | text | | |

image | bytea | | |

As you can see, this is a normal column and it can be used just like a normal column. The only thing worth mentioning is the encoding one has to use on the SQL level. PostgreSQL uses a variable to configure this behavior:

test=# SHOW bytea_output;
 bytea_output
--------------
 hex
(1 row)

test=# SHOW bytea_output;

bytea_output

--------------

hex

(1 row)

Bytea_output

The bytea_output variable accepts two values: “hex” tells PostgreSQL to send the data in hex format. “escape” means that data has to be fed in as an octal string. There is not much the application has to worry about here, apart from the maximum size of 1 GB per field.
However, PostgreSQL has a second interface to handle binary data: The BLOB interface. Let me show an example of this powerful tool in action:

test=# SELECT lo_import('/etc/hosts');
 lo_import
-----------
     80343
(1 row)

test=# SELECT lo_import('/etc/hosts');

lo_import

-----------

80343

(1 row)

In this case, the content of /etc/hosts has been imported into the database. Note that PostgreSQL has a copy of the data - it is not a link to the filesystem. What is noteworthy here is that the database will return the OID (object ID) of the new entry. To keep track of these OIDs, some developers do the following:

test=# CREATE TABLE t_file (
          id          int,
          name        text, 
          object_id   oid
);
CREATE TABLE
test=# INSERT INTO t_file
           VALUES (1, 'some_name', lo_import('/etc/hosts'))
       RETURNING *;
id  | name          | object_id
----+---------------+-----------
1   | some_name     | 80350
(1 row)

test=# CREATE TABLE t_file (

id int,

name text,

object_id oid

);

CREATE TABLE

test=# INSERT INTO t_file

VALUES (1, 'some_name', lo_import('/etc/hosts'))

RETURNING *;

id | name | object_id

----+---------------+-----------

1 | some_name | 80350

(1 row)

INSERT 0 1

1	INSERT 0 1

This is absolutely fine, unless you do something like below:

test=# DELETE FROM t_file WHERE id = 1;
DELETE 1

1 2	test=# DELETE FROM t_file WHERE id = 1; DELETE 1

The problem is that the object id has been forgotten. However, the object is still there. pg_largeobject is the system table in charge of storing the binary data inside PostgreSQL. All lo_functions will simply talk to this system table in order to handle thesethings:

test=# x
Expanded display is on.
test=# SELECT * FROM pg_largeobject WHERE loid = 80350;
-[ RECORD 1 ]------------------------------------------
loid   | 80350
pageno | 0
data   | ##\012# Host Database\012#\012# localhost ...

test=# x

Expanded display is on.

test=# SELECT * FROM pg_largeobject WHERE loid = 80350;

-[ RECORD 1 ]------------------------------------------

loid | 80350

pageno | 0

data | ##\012# Host Database\012#\012# localhost ...

Why is that a problem? The reason is simple: Your database will grow and the number of “dead objects” will accumulate.

Therefore the correct way to kill a BLOB entry is as follows:

test=# x
Expanded display is off.
test=#
test=# SELECT lo_unlink(80350);
 lo_unlink
-----------
 1
(1 row)

test=# SELECT * FROM pg_largeobject WHERE loid = 80350;
loid  | pageno | data
------+--------+------
(0 rows)

test=# x

Expanded display is off.

test=#

test=# SELECT lo_unlink(80350);

lo_unlink

-----------

(1 row)

test=# SELECT * FROM pg_largeobject WHERE loid = 80350;

loid | pageno | data

------+--------+------

(0 rows)

If you forget to unlink the object, you will suffer in the long run - and we have often seen that happen. It is a major issue if you are using the BLOB interface.

vacuumlo: Cleaning up dead large objects

However, how can one fix the problem once you have accumulated thousands, or maybe millions, of dead BLOBs? The answer is a command line tool called “vacuumlo”.
Let us first create a dead entry:

test=# SELECT lo_import('/etc/hosts');
 lo_import
-----------
 80351
(1 row)

test=# SELECT lo_import('/etc/hosts');

lo_import

-----------

80351

(1 row)

Then we can run vacuumlo from any client:

iMac:~ hs$ vacuumlo -h localhost -v test
Connected to database 'test'
Checking object_id in public.t_file
Successfully removed 2 large objects from database 'test'.

iMac:~ hs$ vacuumlo -h localhost -v test

Connected to database 'test'

Checking object_id in public.t_file

Successfully removed 2 large objects from database 'test'.

As you can see, two dead objects have been killed by the tool. vacuumlo is the easiest way to clean out orphan objects.

Additional functionality

However, there is more than just lo_import and lo_unlink. PostgreSQL offers a variety of functions to handle large objects in a nice way:

test=# df lo_*
                            	List of functions
   Schema   | 	Name  	    | Result data type | Argument data types       | Type
------------+---------------+------------------+---------------------------+------
 pg_catalog | lo_close      | integer          | integer                   | func
 pg_catalog | lo_creat      | oid              | integer                   | func
 pg_catalog | lo_create     | oid              | oid                       | func
 pg_catalog | lo_export     | integer          | oid, text                 | func
 pg_catalog | lo_from_bytea | oid              | oid, bytea                | func
 pg_catalog | lo_get        | bytea            | oid                       | func
 pg_catalog | lo_get        | bytea            | oid, bigint, integer      | func
 pg_catalog | lo_import     | oid              | text                      | func
 pg_catalog | lo_import     | oid              | text, oid                 | func
 pg_catalog | lo_lseek      | integer          | integer, integer, integer | func
 pg_catalog | lo_lseek64    | bigint           | integer, bigint, integer  | func
 pg_catalog | lo_open       | integer          | oid, integer              | func
 pg_catalog | lo_put        | void             | oid, bigint, bytea        | func
 pg_catalog | lo_tell       | integer          | integer                   | func
 pg_catalog | lo_tell64     | bigint           | integer                   | func
 pg_catalog | lo_truncate   | integer          | integer, integer          | func
 pg_catalog | lo_truncate64 | integer          | integer, bigint           | func
 pg_catalog | lo_unlink     | integer          | oid                       | func
(18 rows)

test=# df lo_*

List of functions

Schema | Name | Result data type | Argument data types | Type

------------+---------------+------------------+---------------------------+------

pg_catalog | lo_close | integer | integer | func

pg_catalog | lo_creat | oid | integer | func

pg_catalog | lo_create | oid | oid | func

pg_catalog | lo_export | integer | oid, text | func

pg_catalog | lo_from_bytea | oid | oid, bytea | func

pg_catalog | lo_get | bytea | oid | func

pg_catalog | lo_get | bytea | oid, bigint, integer | func

pg_catalog | lo_import | oid | text | func

pg_catalog | lo_import | oid | text, oid | func

pg_catalog | lo_lseek | integer | integer, integer, integer | func

pg_catalog | lo_lseek64 | bigint | integer, bigint, integer | func

pg_catalog | lo_open | integer | oid, integer | func

pg_catalog | lo_put | void | oid, bigint, bytea | func

pg_catalog | lo_tell | integer | integer | func

pg_catalog | lo_tell64 | bigint | integer | func

pg_catalog | lo_truncate | integer | integer, integer | func

pg_catalog | lo_truncate64 | integer | integer, bigint | func

pg_catalog | lo_unlink | integer | oid | func

(18 rows)

There are two more functions which don’t follow the naming convention for historic reasons: loread and lowrite:

pg_catalog | loread  | bytea   | integer, integer | func
pg_catalog | lowrite | integer | integer, bytea   | func

1 2	pg_catalog \| loread \| bytea \| integer, integer \| func pg_catalog \| lowrite \| integer \| integer, bytea \| func

They are functions whose names cannot easily be changed anymore. However, it is worth noting that they exist.

Finally …

The PostgreSQL BLOB interface is really useful and can be used for many things. The beauty is that it is fully transactional and therefore binary content and metadata cannot go out of sync anymore.

If you want to learn more about triggers to enforce constraints in PostgreSQL, we recommend you check out our blog post written by Laurenz Albe. It will shed some light on this important topic.

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.

Language and community

What is a shibboleth?

How to write and pronounce PostgreSQL

How not to write and pronounce PostgreSQL

Conclusion

Exclusive client session

Exclusive chain

Implementation details

Finally

Introduction

Example showing the difference between UPDATE and DELETE + INSERT

Explanation for the observed difference

Looking at UPDATE with “pageinspect”

Looking at DELETE + DELETE with “pageinspect”

Differences in infomask and infomask2

Conclusion

Updating PostgreSQL

Upgrading PostgreSQL

Preparing a sample database

pg_upgrade in action

pg_upgrade --link under the hood

Finally …

Using the PostgreSQL BLOB interface

Bytea_output

vacuumlo: Cleaning up dead large objects

Additional functionality

Finally …

Example showing the difference between `UPDATE` and `DELETE` + `INSERT`

Looking at `UPDATE` with “pageinspect”

Looking at `DELETE` + `DELETE` with “pageinspect”

Differences in `infomask` and `infomask2`

`pg_upgrade --link` under the hood