CYBERTEC Logo

Patroni has dramatically simplified the setup of the PostgreSQL High-Availability cluster. Personally I consider Patroni to be by far the best tool out there to cluster PostgreSQL and to ensure HA in the most reliable and straightforward way possible.

The main questions asked by many people are: How can you make a PostgreSQL cluster transparent? How does the application figure out where the primary database is? The solution to the problem is a tool we implemented some time ago: Let me introduce you to vip-manager (Virtual IP Manager), a simple to use tool to handle service IPs in a PostgreSQL Patroni cluster.

vip-manager: Automatic IP failover


Before we dig a bit deeper into vip-manager, I want to explain the basic architecture of Patroni itself briefly. As you might know, Patroni is a tool developed initially by Zalando, which was a milestone in the development of PostgreSQL High-Availability. Over the years, we (CYBERTEC) have contributed countless patches to Patroni. We hope that we have made a difference in this crucial area relevant to so many people relying on PostgreSQL HA.

Patroni uses a key-value store (usually, etcd, zookeeper or consul) to handle consensus. The idea is to keep a leader key in the key-value store, which knows where the current master is. In other words: the cluster knows who happens to be primary and who happens to be secondary. Why not use this information to move a service IP around inside the cluster and assign it to the active primary? This is precisely what vip-manager does. It runs on every database node and checks if the node it runs on happens to be the "chosen one". In case vip-manager sees that it runs on the current primary it grabs the IP and creates an IP alias. In case the vip-manager sees that the node it is running on is not the master, it ensures that the IP is removed.

In a cluster consisting of 3 database servers, you will therefore need 4 IPs. 1 IP per server will be static - one IP will be moved around and follow the primary. Of course, the floating IP is what you should use in your applications.

A typical node in a Patroni cluster has to run two services: Patroni and vip-manager. Patroni controls PostgreSQL (stop, start, sync, etc.). vip-manager is a simple Go application doing nothing else but control the floating IP pointing to the master.

Configuring vip-manager

Configuring vip-manager is simple. Here is an example:

Basically, the configuration is simple: First of all, vip-manager has to know where etcd is. Then it has to know where it can find the leader key (URL). Finally, we want to know which IP has to be bound to which network interface. The rest is the simple login information for etcd or some retry configuration. The core is really: Where is my etcd and how can I log in? Where in etcd is my leader key and which IP is assigned to which device. That is the entire magic.

vip-manager helps to make a cluster fully transparent. It is available as binary packages for the most common Linux distributions and Windows. We maintain the package and add improvements as needed. One improvement worth pointing out is that vip-manager is also able to work on Hetzner (a leading German hosting company). We are able to talk to the Hetzner API to move IPs around. The reason I am mentioning this is that if you've got special requirements vip-manager can be adjusted to your needs with reasonable effort.

Finally …

I hope you enjoyed this posting about virtual IPs and PostgreSQL clustering. If you want to know more about recent software released, I want to point you to Scalefield, a solution to automate PostgreSQL deployments using Kubernetes.
If you want to read something right now, I want to tell you about one more tool we have recently implemented in Go (golang). pg_timetable is a cutting-edge job scheduler for PostgreSQL, which allows you to run SQL code as well as built-ins (email, etc.) and other executables. Check out our posts about that here.

In SQL, the concept of foreign keys is an important one that can be found in all professional databases used in the industry. The core idea is to prevent your PostgreSQL database from storing inconsistent data by enforcing constraints ensuring the correctness of your tables (at least as far as relations between objects are concerned). Referential integrity is therefore one of the most important concepts ever invented in IT.

However, foreign keys will introduce some issues which you have to take care of when writing applications. If there are no foreign keys, you can insert data into any table in any order. PostgreSQL does not care. However, if a foreign key is in place, order starts to matter (at least in a typical scenario but more on that later).

Foreign keys and order

To show the importance of order, we have to create a data model first:

We want to store currencies, products, as well as product descriptions. Basically it is a very simple data model. Let us see if we happen to insert into the product table:

Logically the first INSERT is going to fail because currency number 1 does not exist yet. If we want to INSERT, we have to use a NULL value (= unknown currency). In order words: We have to fill the currency table first, then insert locations, and so on. The order does matter in the default case.

Determining the correct insertion order for foreign keys

If you have to start using an existing data model, it can be a bit hard to wrap your head around this stuff. Populating an empty data model can be a bit tricky. So why not write a query telling us the order in which we are supposed to insert data?

Well, here is that magic query...

The query is not trivial to read, but I have done my best to document it a bit. Basically, the PostgreSQL system tables have all the information we need to determine the correct order. Here is the output:

As you can see, the query has correctly given us the tables in the desired order. First, we have to insert into all tables at level one and so on. If we stick to this order, referential integrity will always be ensured (assuming the data is correct).

Making use of “initially deferred” constraints

In some cases, the insertion order can be a nasty thing to deal with. What if we had the means to tell PostgreSQL to ignore the order and check integrity on commit instead? This is exactly what "initially deferred" does. Here is how it works:

In this case, we can modify data in any order we want. As long as integrity is guaranteed to be intact at the end of the transaction, PostgreSQL is perfectly fine. PostgreSQL will postpone the constraint check and take some burden off the developer.

Finally …

If you want to learn more about advanced SQL, you might want to take a look at my blog about some more advanced windowing functions (with ties). So put on your tie and read to learn more.

Partitioning data is a well known way to optimize big tables into smaller ones using a given attribute. And while creating a partitioned table structure itself is not complicated, the migration of a table on a live system can be tricky. Our solution offers a simple way to solve this problem.

 

The usual way can be summarized the following:

  1. create a table and necessary triggers for the future partitioned data
  2. create an insert trigger on the existing live table that on insert/update/delete performs the same operation on the partitioned table, to maintain the data integrity
  3. copy the data from the live table to the partitioned table - likely in batches and in a transaction
  4. switch in the application code to the partitioned table or rename the tables
  5. delete the original live table

 

Disadvantage of this method

The disadvantage of the method above is that it is relatively complex; the triggers that secure data integrity must be tested. It also creates data redundancy, because during migration the data must be duplicated. It also requires a migration plan, and a change in the application when switching from the old table to the new one - and that may or may not be an issue.

During data migration the particular difficulty is: how to be certain that if a given batch of data is being migrated and entries of the same batch of data is updated or deleted, the correct values are going to be present in the new, partitioned table. It may mean some freeze of insert/updated during migration (see answer 13) or some logic that re-scans the data in the batch after migration for any changes and gives priority to the original table. At think project! we needed both to be sure that we have no data loss - that may not be an issue in some scenarios -, and that the downtime/service time is minimal, or the better, non existent.

Another solution

Below, we would like to present a simple and robust solution that does not require duplication of the data, nor downtime or any freeze during migration and overall it is a minimum intrusion to the system. The solution is using the original table as the mother table for the partitioning. The data is moved from the mother table using a BEFORE UPDATE trigger, updating an extra column of each row by some migration process. This update value is added only for this purpose. In case a row is still in the mother table when queried - Postgres will deliver it from the mother table before consulting a partition.

Below we demonstrate how to set up a basic partitioning, and after that our proposal for the migration process is discussed.

Preparation for partitioning

Let's take the source table that acts as our live table. The part_value will be the base of our partitioning.

Let's create the child tables for partitioning:

Now, add the insert trigger to the livedata table to forward the new inserts to the proper child tables:

Bind the function to a trigger:

Up to this point we have prepared the partitions and the insert trigger, following the standard way. As soon as the function is bound to a trigger, the mother table inserts will land in the child tables, and will be delivered from there. The existing data will be delivered from the mother table.

should show the new entry in the partition table, as expected.

Migration through UPDATE

Now, let's migrate the data from the mother table. Our logic uses an UPDATE trigger. For the controlled migration we use a value to update and for this we choose to add an extra BOOL column without default values. This column can also be handy for monitoring the migration process. The application/client can be unaware of it's existence. Without default values and indices this modification runs in 10-100ms even if the table is huge.

BEFORE UPDATE Trigger

Now let's add a BEFORE UPDATE trigger, so that it copies the values from the mother table to the partition and then performs an update on the copied value, setting the in_partition flag on the updated row. This will be our migration function. The migration will simply be an update on the rows on the mother table setting the in_partition flag to true.

The migration happens when there is any update on a not yet migrated row.

To migrate the values through some migration process, we use the in_partition BOOL flag. Eg. doing a batch migration by part_value simply execute the following:

The race conditions are going to be handled by the database itself. The application code requires no change, all the migration happens inside the database. The data is not duplicated during the process and the integrity is maintained by the database, and requires very little extra code. To check if the migration is finished, either the size of the source table should be checked using the PostgreSQL internals, or adding the following to the insert_trigger() directly after the IF ... BETWEEN statement we can set the in_partition attribute to true on each insert.

In this case there is no need to set this attribute to true in the migrate_data(). Using this logic if the following query returns no rows the source table is fully migrated:

 

 

This blogpost was written by László Forró (LinkedIn) and André Plöger (LinkedIn & Xing).

 

To dig a bit deeper into zheap and PostgreSQL storage technology in general I decided to provide some more empirical information about space consumption. As stated in my previous blog post about zheap is more efficient in terms of storage consumption. The reasons are:

The question is: While those theoretical statements are true one wants to know what this means in a real-world scenario. This blog will shed some light on this question and give you some more empirical insights as to what is to be expected as soon as zheap is production-ready (which it is NOT as of October 2020).

Creating sample data for PostgreSQL heap and zheap

To show the differences in storage consumption I have created some sample data. To make it fair I have first added data to a temporary table which is in memory. This way there are no undesired side effects:

10 million rows will roughly translate to half a gigabyte of data:

A standard temporary table is absolutely fine for our purpose.

Populating a zheap table

One of my favorite features in PostgreSQL is CREATE TABLE … LIKE …. It allows you to quickly create identical tables. This feature is especially useful if you want to clone a table containing a large number of columns and you don't want to to list them all, manually create all indexes etc.
Copying the data from "raw" into a normal heap table takes around 7.5 seconds:

Let us do the same thing. This time we will use a zheap table. Note that to use zheap one has to add a USING-clause to the statement:

As you can see creating the content of the table takes a bit longer but the difference in table size is absolutely stunning:

The zheap table is roughly 50% smaller than the normal PostgreSQL storage format. The main question naturally arising is: Why is that the case? There are basically two important factors:

Let us consider the tuple header first: The new tuple header is only 5 bytes which is almost 20 bytes less per row. That alone saves us around 200 MB of storage space. The reason for the smaller tuple header is that the visibility information has been moved from the row to the page level ("transaction slots"). The more columns you've got the lower the overall percentage will be but if your table is really narrow the difference between heap and zheap is very significant.

NOTE: Reduced storage consumption is mostly an issue for tables containing just a few columns - if your table contains X00 columns it is less of an issue.

UPDATE and zheap space consumption

UPDATE has traditionally been an important thing when talking about zheap in general. So let us see what happens when a table is modified:

In my case the size of the row is identical. We simply want to change the ID of the data. What is important to notice here is that the size of the table is identical. In case of heap the size of the data file would have doubled.
To support transactions UPDATE must not forget the old rows. Therefore the data has to be "somewhere". This "somewhere" is called "undo":

The undo area contains a couple of files (1 MB each) holding the undo data which is necessary to handle rollback properly (= putting the old data back into the table).
In other words: Being able to handle transactions is not free - the space needed to do that is simply handled in a different way.

One word about ROLLBACK

If you are using a normal heap ROLLBACK is basically free because a transaction can simply leave all its obsolete trash behind. Not so if you are using zheap to store data.
Let us compare and see:

As you can see the ROLLBACK is really quick - it does basically nothing. The situation is quite different in case of zheap

41 milliseconds is not much but it is still a lot more than a fraction of a millisecond. Of course, things are slower but the main issue is that zheap is all about table bloat. Avoiding table bloat has major advantages in the long run. One should therefore see this performance data in a different light. One should also keep in mind that COMMIT is (in most cases) ways more likely than ROLLBACK. Thus putting a price tag on ROLLBACK might not be so problematic after all.

zheap: Trying it all out

If you want to give zheap a try we suggest taking a look at our Github repo. All the code is there. At the moment we have not prepared binaries yet. We will soon release Docker containers to make it easier for users to try out this awesome new technology.

Finally…

We again want to point out that zheap is still in development - it is not production-ready. However, this is a really incredible technology and we again want to thank Heroic Labs for the support we are receiving. We also want to thank EDB for the work on zheap they have done over the years.

If you want to learn more about storage efficiency, alignment etc. we recommend checking out my blog post about column order. In addition, if you want to know more about specific aspects of zheap feel free to leave a comment below so that we can maybe address those issues in the near future and dedicate entire articles to it.

Temporary tables have been around forever and are widely used by application developers. However, there is more to temporary tables than meets the eye. PostgreSQL allows you to configure the lifespan of a temporary table in a nice way and helps to avoid some common pitfalls.

CREATE TEMPORARY TABLE …

By default, a temporary table will live as long as your database connection. It will be dropped as soon as you disconnect. In many cases this is the behavior people want:

Once we have reconnected, the table is gone for good. Also, keep in mind that the temporary table is only visible within your session. Other connections are not going to see the table (which is, of course, the desired behavior). This also implies that many sessions can create a temporary table having the same name.

However, a temporary table can do more. The most important thing is the ability to control what happens on commit:

3 Options available

As you can see, there are three options. "PRESERVE ROWS" is the behavior you have just witnessed. Sometimes you don't want that. It is therefore also possible to empty a temporary table on commit:

In this case, PostgreSQL simply leaves us with an empty table as soon as the transaction ends. The table itself is still around and can be used.

Let's drop the table for now:

Sometimes you want the entire table to be gone at the end of the transaction: "ON COMMIT DROP" can be used to achieve exactly that:

PostgreSQL will throw an error because the table is already gone. What is noteworthy here is that you can still use WITH HOLD cursors as shown in the next example:

The table itself is still gone, but the WITH HOLD cursors will ensure that the "content" of the cursor will survive the end of the transaction. Many people don't expect this kind of behavior, but it makes sense and can come in pretty handy.

Controlling memory usage …

If you are using temporary tables, it makes sense to keep them relatively small. In some cases, however, a temporary table might be quite large for whatever reason. To ensure that performance stays good, you can tell PostgreSQL to keep more of a temporary table in RAM. temp_buffers is the parameter in postgresql.conf you should be looking at in this case:

The default value is 8 MB. If your temporary tables are large, increasing this value certainly makes sense.

Finally ...

If you want to find out more about PostgreSQL database performance in general, consider checking out my post about three ways to detect and fix slow queries.

Is your database growing at a rapid rate? Does your database system slow down all the time? And maybe you have trouble understanding why this happens? Maybe it is time to take a look at pg_squeeze and fix your database once and for all. pg_squeeze has been designed to shrink your database tables without downtime. No more need for VACUUM FULL - pg_squeeze has it all.

The first question any PostgreSQL person will ask is: Why not use VACUUM or VACUUM FULL? There are various reasons: A normal VACUUM does not really shrink the table in disk. Normal VACUUM will look for free space, but it won’t return this space to the operating system. VACUUM FULL does return space to the operating system but it needs a table lock. In case your table is small this usually does not matter. However, what if your table is many TBs in size? You cannot simply lock up a large table for hours just to shrink it after table bloat has ruined performance. pg_squeeze can shrink large tables using only a small, short lock.

However, there is more. The following listing contains some of the operations pg_squeeze can do with minimal locking:

After this basic introduction it is time to take a look and see how pg_squeeze can be installed and configured.

PostgreSQL: Installing pg_squeeze

pg_squeeze can be downloaded for free from our GitHub repository. However, binary packages are available for most Linux distributions. If you happen to run Solar, AIX, FreeBSD or some other less widespread operating system just get in touch with us. We are eager to help.

After you have compiled pg_squeeze or installed the binaries some changes have to be made to postgresql.conf:

The most important thing is to set the wal_level to logical. Internally, pg_squeeze works as follows: It creates a new datafile (snapshot) and then applies changes made to the table while this snapshot is copied over. This is done using logical decoding. Of course logical decoding needs replication slots. Finally the library has to be loaded when PostgreSQL is started. This is basically it - pg_squeeze is ready for action.

Understanding table bloat in PostgreSQL

Before we dive deeper into pg_squeeze it is important to understand table bloat in general. Let us take a look at the following example:

Once we have imported 2 million rows the size of the table is 69 MB. What happens if we update these rows and simply add one?

The size of the table is going to double. Remember, UPDATE has to duplicate the row which of course eats up some space. The most important observation, however, is: If you run VACUUM the size of the table on disk is still 138 MB - storage IS NOT returned to the operating system. VACUUM can shrink tables in some rare instances. However, in reality the table is basically never going to return space to the filesystem which is a major issue. Table bloat is one of the most frequent reasons for bad performance, so it is important to either prevent it or make sure the table is allowed to shrink again.

PostgreSQL: Shrinking tables again

If you want to use pg_squeeze you have to make sure that a table has a primary key. It is NOT enough to have unique indexes - it really has to be a primary key. The reason is that we use replica identities internally, so we basically suffer from the same restrictions as other tools using logical decoding.
Let us add a primary key and squeeze the table:

Calling pg_squeeze manually is one way to handle a table. It is the preferred method if you want to shrink a table once. As you can see the table is smaller than before:

The beauty is that minimal locking was needed to do that.

Scheduling table reorganization

pg_squeeze has a builtin job scheduler which can operate in many ways. It can tell the system to squeeze a table within a certain timeframe or trigger a process in case some thresholds have been reached. Internally pg_squeeze uses configuration tables to control its behavior. Here is how it works:

The last column here is worth mentioning: It is a custom data type capable of holding cron-style scheduling information. The custom data type looks as follows:

If you want to make sure that pg_squeeze takes care of a table simple insert the configuration into the table:

In this case public.t_test will be squeezed at 22:30h in the evening every 3rd and 5th day of the week. The main question is: When is that? In our setup days 0 and 7 are sundays. So 3 and 5 means wednesday and friday at 22:30h.
Let us check what the configuration looks like:

Once this configuration is in place, pg_squeeze will automatically take care of things. Everything is controlled by configuration tables so you can easily control and monitor the inner workings of pg_squeeze.

Handling errors

If pg_squeeze decides to take care of a table it can happen that the reorg process actually fails. Why is that the case? One might drop a table and recreate it, the structure might change or pg_squeeze might not be able to get the brief lock at the end. Of course it is also possible that the tablespace you want to move a table too does not have enough space. There are many issues which can lead to errors. Therefore one has to track those reorg processes.
The way to do that is to inspect squeeze.errors:

This log table contains all the relevant information needed to track things fast and easily.

Finally …

pg_squeeze is not the only Open Source tool we have published for PostgreSQL. If you are looking for a cutting edge scheduler we recommend taking a look at what pg_timetable has to offer.

In PostgreSQL table bloat has been a primary concern since the original MVCC model was conceived. Therefore we have decided to do a series of blog posts discussing this issue in more detail. What is table bloat in the first place? Table bloat means that a table and/or indexes are growing in size even if the amount of data stored in the database does not grow at all. If one wants to support transactions it is absolutely necessary not to overwrite data in case it is modified because one has to keep in mind that people might want to read an old row while it is modified or rollback a transaction.

Therefore bloat is an intrinsic thing related to MVCC in PostgreSQL. However, the way PostgreSQL stores data and handles transactions is not the only way a database can handle transactions and concurrency. Let us see which other options there are:

In MS SQL you will find a thing called tempdb while Oracle and MySQL put old versions into the redo log. As you might know PostgreSQL copies rows on UPDATE and stores them in the same table. Firebird is also storing old row versions inline.

There are two main points I want to make here:

Getting rid of rows is definitely an issue. In PostgreSQL removing old rows is usually done by VACUUM. However, in some cases VACUUM cannot keep up or space is growing for some other reasons (usually long transactions). We at CYBERTEC have blogged extensively about that scenario.

“No solution is without tradeoffs” is also an important aspect of storage. There is no such thing as a perfect storage engine - there are only storage engines serving a certain workload well. The same is true for PostgreSQL: The current table format is ideal for many workloads. However, there is also a dark side which leads us back to where we started: Table bloat. If you are running UPDATE-intense workloads it happens more often than not that the size of a table is hard to keep under control. This is especially true if developers and system administrators are not fully aware of the inner workings of PostgreSQL in the first place.

zheap: Keeping bloat under control

zheap is a way to keep table bloat under control by implementing a storage engine capable of running UPDATE-intense workloads a lot more efficiently. The project has originally been started by EnterpriseDB and a lot of effort has been put into the project already.

To make zheap ready for production we are proud to announce that our partners at Heroic Labs have committed to fund the further development of zheap and release all code to the community. CYBERTEC has decided to double the amount of funding and to put up additional expertise and manpower to move zheap forward. If there are people, companies, etc. who are also interested in helping us move zheap forward we are eager to team up with everybody willing to make this great technology succeed.

Let us take a look at the key design goals:

So let us see how those goals can be achieved in general.

The basic design of zheap

zheap is a completely new storage engine and it therefore makes sense to dive into the basic architecture. Three essential components have to work together:

Let us take a look at the layout of a zheap page first. As you know PostgreSQL typically sees tables as a sequence of 8k blocks, so the layout of a page is definitely important:

At first glance, this image looks almost like a standard PostgreSQL 8k page but in fact it is not. The first thing you might notice is that tuples are stored in the same order as item entries at the beginning of the page to allow for faster scans. The next thing we see here is the presence of “slot” entries at the end of the page. In a standard PostgreSQL table visibility information is stored as part of the row which needs a lot of space. In zheap transaction information has been moved to the page which significantly reduces the size of data (which in turn translates to better performance).

A transaction occupies 16 bytes of storage and contains the following information: transaction id, epoch and the latest undo record pointer of that transaction. A row points to a transaction slot. The default number of transaction slots in the table is 4 which is usually ok for big tables. However, sometimes more transaction slots are needed. In this case, zheap has something called “TPD” which is nothing more than an overflow area to store additional transaction information as needed.

Here is the basic layout:

 

Sometimes many transaction slots are needed for a single page. TPD offers a flexible way to handle that. The question is: Where does zheap store TPD data? The answer is: These special pages are interleaved with the standard data pages. They are just marked in a special way to ensure that sequential scans won’t touch them. To track these special purpose pages zheap uses a meta page to track them:

TDP is simply a way to make transaction slots more scalable. Having some slots in the block itself reduces the need to excessively touch pages. If more are needed TPD is an elegant way out. In a way it is the best of both worlds.

Transaction slots can be reused after a transaction ends.

zheap: Tuple formats

The next important part of the puzzle is the layout of a single tuple: In PostgreSQL a standard heap tuple has a 20+ byte header because all the transactional information is stored in a tuple. Not so in this case. All transactional information has been moved to page level structures (transaction slots). This is super important: The header has therefore been reduced to merely 5 bytes. But there are more optimizations going on here: A standard tuple has to use CPU alignment (padding) between the tuple header and the real data in the row. This can burn some bytes for every single row in the table. zheap doesn't do that, which leads to more tightly packed storage.

Additional space is saved by removing the padding from pass-by-value data types. All those optimizations mean that we can save valuable space in every single row of the table. URSELOCATIONSTARTENDTIMELANGUAGE

 

Here is what a standard PostgreSQL tuple header looks like:

Now let us compare this to a zheap tuple:

 

As you can see a zheap tuple is a lot smaller than a normal heap tuple. As the transactional information has been unified in the transaction slot machinery, we don’t have to handle visibility on the row level anymore but can do it more efficiently on the page level.

By shrinking the storage footprint zheap will contribute to good performance.

UNDO: Keeping things in order

One of the most important things when talking about zheap is the notion of “undo”. What is the purpose of this thing in the first place? Let us take a look and see: Consider the following operation:

To ensure that transactions can operate correctly UPDATE cannot just overwrite the old value and forget about it. There are two reasons for that: First of all, we want to support concurrency. Many users should be able to read data while it is modified. The second problem is that updating a row does not necessarily mean that it will be committed. Thus we need a way to handle ROLLBACK in a useful way. The classical PostgreSQL storage format will simply copy the row INSIDE standard heap which leads to all those bloat related issues we have discussed on our blog already.

The way zheap approaches things here is a bit different: In case a modification is made the system writes “undo” information to fix it in case the transaction has to be aborted for whatever reason. This is the fundamental concept applicable to INSERT, UPDATE, and DELETE. Let us go through those operations one by one and see how it works:

INSERT: Adding rows

In case of INSERT zheap has to allocate a transaction slot and then emit an undo entry to fix things on error. In case of INSERT the TID is the most relevant information needed by undo. Space can be reclaimed instantly after an INSERT has been rolled back which is a major difference between zheap and standard heap tables in PostgreSQL.

UPDATE: Modifying data

An UPDATE statement is far more complicated: There are basically two cases:

In case the old row is shorter than the new one we can simply overwrite it and emit an undo entry holding the complete old row. In short: We hold the new row in zheap and a copy of the old row in undo so that we can copy it back to the old structure in case it is needed.

What happens if the new row does not fit in? In this case performance will be worse because zheap essentially has to perform a DELETE / INSERT operation which is of course not as efficient as an in-place UPDATE.

Space can instantly be reclaimed in the following cases:

DELETE: Removing rows

Finally there is DELETE. To handle the removal of a row zheap has to emit an undo record to put the old row back in place in case of ROLLBACK. The row has to be removed from the zheap during DELETE.

UNDO and ROLLBACK in action

Up to now we have spoken quite a bit about undo and rollback. However, let us dive a bit deeper and see how undo, rollback, and so on interact with each other.

In case a ROLLBACK happens the undo has to make sure that the old state of the table is restored. Thus the undo action we have scheduled before has to be executed. In case of errors the undo action is applied as part of a new transaction to ensure success.

Ideally, all undo action associated with a single page is applied at once to cut down on the amount of WAL that has to be written. A nice side effect of this strategy is also that we can reduce page-level locking to the absolute minimum which reduces contention and therefore helps contribute to good performance.

So far this sounds easy but let us consider an important use case: What happens in the event of a really long transaction? What happens if a terabyte of data has to be rolled back at once? End users certainly don’t appreciate never-ending rollbacks. It is also worth keeping in mind that we must also be prepared for a crash during rollback.

What happens is that if undo action is larger than a certain configurable threshold the job is done by a background worker process. This is a really elegant solution that helps to maintain a good end-user experience.

Undo itself can be removed in three cases:

Let us take a look at a basic architecture diagram:

As you can see the process is quite sophisticated.

Indexing: A brief remark

To ensure that zheap is a drop-in replacement for the current heap it is important to keep the indexing code untouched. Zheap can work with PostgreSQL’s standard access methods. There is of course room to make things even more efficient. However, at this point no changes to the indexing code are needed. This also implies that all index types available in PostgreSQL are fully available without known restrictions.

Finally …

Currently zheap is still under development and we are glad for the contributions made by Heroic Labs to develop this technology further. So far we have already implemented logical decoding for zheap and added support for PostgreSQL. We will continue to allocate more resources to push the tool to make it production-ready.

If you want to read more about PostgreSQL and VACUUM right now consider checking our previous posts on the subject. In addition, we also want to invite you to keep visiting our blog on a regular basis to learn more about it and other interesting technologies.

CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0
office@cybertec.at

Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram