Some time ago, I posted some information about zheap, a storage engine for PostgreSQL. The idea behind zheap is to make UPDATE statements run faster in order to keep table bloat under control. Currently, PostgreSQL copies data on UPDATE and puts the copy of the row into the same data file. While this isn’t a bad strategy, it can in some cases lead to table bloat. The idea behind zheap undo is to manage old copies of rows, which is similar to Oracle. Until now, the problem was that this “zheap undo” was not cleaned up. That’s why we implemented a “discard worker” to handle the cleanup. By discarding zheap UNDO logs, the worker keeps UPDATE-related table bloat to a minimum.

Getting PostgreSQL and zheap

Before we dig into what the discard worker does, you need to install a zheap-enabled version of PostgreSQL. To make it easier for everybody, we created a container to allow users to give zheap a try out-of-the-box.

The original idea was to have zheap as an extension. However, it’s too invasive, therefore you can’t just install it on top of PostgreSQL, as you can with other extensions. zheap does need core patches.

Also: I want to point out that this is not a production-ready version, but rather a tech preview. We’re happy to share it with the community to get feedback and maybe gather some bug reports.

Here is the freely available docker container for you to try out. Installation is simple and very straightforward:

[hs@fedora ~]$ docker run --name zHeap -p 5432:5432 \
-d cybertecpostgresql/zheap

To connect to the newly created instance, just run psql as you normally would. The default username and password are both postgres.

Username: postgres
Password: postgres

[hs@fedora ~]$ psql -h localhost -p 5432 -U postgres
UNDO: Cleaning up using the discard worker

Testing zheap UNDO in PostgreSQL

Once your zheap-enabled version of PostgreSQL is up and running, we can inspect the process table:

[hs@fedora ~]$ ps axf | grep post
2403084 ? Ss 0:00 \_ /home/hs/pgzheap/bin/postgres -D ./dbzheap
2403086 ? Ss 0:00 \_ postgres: checkpointer 
2403087 ? Ss 0:00 \_ postgres: background writer 
2403088 ? Ss 0:00 \_ postgres: walwriter 
2403089 ? Ss 0:00 \_ postgres: autovacuum launcher 
2403090 ? Ss 0:00 \_ postgres: stats collector 
2403091 ? Ss 0:00 \_ postgres: undo discard worker 
2403093 ? Ss 0:00 \_ postgres: logical replication launcher 

The discard worker is launched automatically. Let’s see what it does, and how the undo log is stored.

Before we create a table and fill it with data, we’ve got to adjust a variable:

test=# SHOW default_table_access_method;
(1 row)

test=# SET default_table_access_method TO zheap;

This setting tells PostgreSQL which storage engine you want to use by default when creating a new table. Usually we want heaps. However, in this case zheap is the desired choice. We can add the zheap option to each CREATE TABLE. In many cases you might want to set it for the entire session as I just did:

test=# CREATE TABLE t_large (id serial, name text);

The layout of the table doesn’t really matter. We can use any layout.

Testing zheap UNDO with INSERT load

Let’s load some data:

test=# BEGIN;
test=*# INSERT INTO t_large (name) SELECT 'dummy' FROM generate_series(1, 10000000);
INSERT 0 10000000
test=*# \d+
List of relations
 Schema |      Name      |   Type   |  Owner   | Persistence | Access method |    Size    | Description 
 public |        t_large |    table |       hs |   permanent |         zheap |     193 MB | 
 public | t_large_id_seq | sequence |       hs |   permanent |               | 8192 bytes | 
(2 rows)

The important thing is that zheap is able to rollback in case of error. Therefore, a lot of undo has to be written: Let’s take a look at the undo directories before we commit the transaction:

[hs@fedora dbzheap]$ ls -l ./pg_undo/ && ls -l ./base/undo/
total 8
-rw-------. 1 hs hs 52 12. Nov 11:47 00000001A84012A0
-rw-------. 1 hs hs 52 12. Nov 11:47 00000001C9F48970
total 432128
-rw-------. 1 hs hs 1048576 12. Nov 11:47 000000.0065A00000
-rw-------. 1 hs hs 1048576 12. Nov 11:47 000000.0065B00000
-rw-------. 1 hs hs 1048576 12. Nov 11:47 000000.0065C00000
-rw-------. 1 hs hs 1048576 12. Nov 11:47 000000.007FD00000
-rw-------. 1 hs hs 1048576 12. Nov 11:47 000000.007FE00000
-rw-------. 1 hs hs 1048576 12. Nov 11:47 000000.007FF00000

You can see that we’ve created a fairly large number of files in our base/undo directory. Let’s commit and see what happens:

test=*# COMMIT;

PostgreSQL will recycle (= delete) these logs when the new discard worker kicks in. Let’s take a look and see:

[hs@fedora dbzheap]$ ls -l ./pg_undo/ && ls -l ./base/undo/
total 8
-rw-------. 1 hs hs 52 12. Nov 11:47 00000001A84012A0
-rw-------. 1 hs hs 52 12. Nov 11:47 00000001C9F48970
total 2048
-rw-------. 1 hs hs 1048576 12. Nov 11:47 000000.007FF00000
-rw-------. 1 hs hs 1048576 12. Nov 11:47 000000.0080000000

The files are gone, which is what we expected.

Testing zheap UNDO with UPDATE load

The same will happen when we run UPDATE statements:

test=# BEGIN;
test=*# UPDATE t_large SET id = id - 1;
UPDATE 10000000

The important aspect is that there is actually a lot of WAL created. As you can see, we are talking about around 600 MB:

[hs@fedora dbzheap]$ cd base/undo/
[hs@fedora undo]$ du -h

Now, let’s commit the transaction:

test=*# COMMIT;

What’s interesting is what happens immediately after the commit statement:

[hs@fedora undo]$ date && du -h
Fr 12. Nov 11:55:50 CET 2021
603M .
[hs@fedora undo]$ date && du -h
Fr 12. Nov 11:55:57 CET 2021
2,0M .

So that’s it – the zheap UNDO logs are gone. The cleanup is NOT part of a commit, but is actually done by the discard worker which is in charge of making sure that log is not killed too early. Keep in mind: You might not be the only transaction at work here, so the cleanup still has to be asynchronous (just like a vacuum is).


Check out our Table Bloat Archive for more important information about zheap!

New GIS Site!

We also recently released a website that offers ready-to-use GIS data. If you use PostGIS and OSM data, this is the place to go. We provide you with ready-made database dumps featuring the entire planet. Check out our new GIS site.