CYBERTEC PostgreSQL Logo

PostGIS setup with Ubuntu 20.04.2

In one of his last blogposts, our CEO Hans-Jรผrgen Schรถnig explained how to set up PostgreSQL on Ubuntu. He consciously avoided any discussion of how to deal with our beloved PostGIS extension. Letโ€™s fix that up right now by following these steps to implement PostGIS setup:

  1. Status quo assessment
  2. Setup on the OS level
  3. Setup on the database level

Status quo assessment

Letโ€™s assume you followed Hansโ€™ blog post (see link above) and installed PostgreSQL 13 on Ubuntu.
By utilizing apt list โ€“installed, we can quickly list installed packages related to PostgreSQL.

From the consoleโ€™s output, we can confirm that both server and client have been installed in version 13.2.1. Together with this version string and the PostGIS support matrix we can ensure that our PostGIS version of interest is finally supported.
The table below displays a relevant excerpt of this matrix:

PostgreSQL versionPostGIS 2.4.PostGIS 2.5PostGIS 3.0PostGIS 3.1
(master)
PostGIS release date2017/09/302018/09/232019/10/202020/XX/XX
13NoNoYes*Yes*
12NoYesYesYes
11Yes*YesYesYes
10YesYesYesYes
9.6YesYesYesYes
9.5YesYesYesNo

PostGIS setup on the OS level

The support matrix suggests installing PostGIS packages 3.0 or 3.1 on top of PostgreSQL 13. Now letโ€™s verify that our main apt-repository contains the necessary packages, by utilizing apt-cache search.

The listing does not include packages for all minor PostGIS versions. So how is it possible to install not the latest PostGIS 3.1 version, but PostGIS 3.0, instead? As an interim step, letโ€™s output the packagesโ€™ version table first.

From the output, we realize the repository used does not provide PostGIS 3.0 packages for PostgreSQL 13. So, letโ€™s execute apt-cache policy for postgresql-12 to see and understand the difference:

The version table now includes minor PostGIS version strings too, which can be handed over to apt-get install, in order to choose one particular PostGIS version. If this version string is not provided, the latest PostGIS version will be installed by default.

For demonstration purposes, letโ€™s install PostGIS 3.1.1 by parameterizing the version string as follows:

By executing apt list โ€“-installed again, we can confirm our successful PostGIS 3 installation on Ubuntu.

PostGIS setup on the database level

PostGIS must be enabled within each database separately. This implies that PostGIS must be registered as an extension within PostgreSQLโ€™s ecosystem. PostgreSQL kindly provides a view containing all available extensions to verify this.

From these results, we realize there is more than one extension related to PostGIS setup. Why is that the case, and how should we handle it? PostGIS is a great extension, containing various data types and functions. To group types and functions by topics, the extension is organized in various sub-extensions. The most recent refactoring in this area was carried out by swapping out raster functionality, for instance. That means that someone who deals solely with vector data is not forced to install postgis_raster too.

Finally, letโ€™s create a new database and enable PostGIS ????.

Querying pg_extension lists the extensions installed in the context of the current database:

To gather even more information about the PostGIS version installed, you can query postgis_full_version() - in order to see details about which libraries were referenced, too.

Conclusion

We successfully completed our task: PostGIS setup on a fresh and clean Ubuntu. Good preparation should prevent almost anything from going wrong. When it comes to upgrades, things are a bit more complicated. Do not forget to consult the PostGIS support matrix and the folks at CYBERTEC???? for further support.

Manage constraints over multiple rows: In PostgreSQL and many other relational databases, constraints are an integral part of the feature set. Many people are aware of primary keys, foreign keys, CHECK-constraints, and table constraints. However, from time to time, the situation is way more complicated. Thatโ€™s when some more advanced techniques are needed to enforce integrity in the way it is desired by end users.
In this post, we will elaborate on restrictions spanning more than one row. What does that mean? Let us envision a simple scenario: We want to store information about airplane ownership. In aviation, it happens more often than not that a single plane has more than one owner. In this example, we want to make sure that the ownership parts actually add up to 100% of the whole. Similar challenges can be found over and over again in real business applications.

Creating sample data to illustrate the use of constraints over multiple rows

Letโ€™s first create some sample data, and see how we can actually solve the problem of ensuring that ownership always has to end up at 100% for a specific airplane:

In this case, weโ€™ve got two tables: The t_plane table contains a unique ID and the call sign of the plane. The call sign is similar to a license plate - it identifies the aircraft. โ€œDโ€ indicates Germany, OE means Austria and โ€œNโ€ would indicate that we are dealing with a US-registered aircraft.

As you can see, we have used a 1 : N relationship here. One plane can have many owners. The trouble is, if we add up the ownersโ€™ percentages of a plane, we always have to end up with 100%. The nasty part is concurrency. What if many people make changes at the same time?

Letโ€™s take a look at the basic problem of concurrency:ย 

User 1

User 2

Comment

returns 200 for D-EHWI

Here you can see that two people are about to modify the same airplane concurrently. There is nothing to stop those two users from actually doing that. The end result is that D-EHWI is owned by two people at the same time - a big no-no.

There are various ways to avoid such concurrency predicaments.

Using LOCK TABLE to protect against concurrency when coding constraints over multiple rows

One of the more commonly chosen methods is to use a table lock. Letโ€™s take a look and see how this works:

User 1

User 2

Comment

in case user 1 is slightly ahead user 2 has to wait

user 2 has to wait ...

returns 100, we are fine

user 2 wakes up

user 2 gets 200

we must rollback because the constraint is violated

In this case, we have used a table lock to solve the problem. The important point here is that PostgreSQL allows us to define 8 different types of locks:

ACCESS SHARE simply means that somebody is reading a row. All it does is to prevent things such as DROP TABLE from happening. However, what we need to do here is to ensure that people cannot insert or modify concurrently. The solution to the problem is lock level EXCLUSIVE. Reading is still possible, but only a single transaction can modify the table. ACCESS EXCLUSIVE would prevent both concurrent reads and writes, but that would be overkill in this case.

What is the problem with this approach? The answer is scalability: In case of a table lock, only ONE transaction can modify the same table at the same time. If 100 airplanes are modified at the same time, 99 transactions have to wait until one transaction is actually able to commit. One could argue that airplane ownership does not change that often. However, this might not be true for other use-cases; we need to keep in mind that scalability does matter. It makes no sense to buy a 32-core server if 31 cores are doomed to idle due to locking.
A more sophisticated solution is needed.

Transaction isolation levels: SERIALIZABLE

One solution to the problem is to use a higher transaction isolation level. Basically, PostgreSQL supports three out of four transaction isolation levels proposed by the ANSI SQL standard:

Currently, READ UNCOMMITTED is mapped to READ COMMITTED, which makes a lot of sense in an MVCC context. However, itโ€™s not enough here.

SERIALIZABLE is the method of choice in this case. What is the basic idea behind SERIAZIABLE? It provides us with the illusion of sequential execution - however, in the background, things are happening with as much parallelism as possible.

Letโ€™s take a closer look at SERIALIZABLE:

User 1

User 2

Comment

start a transaction using the right isolation level

make desired changes

check to make sure that ownership is correct (the app will decide whether to commit or rollback).

user 2 wakes up

user 2 gets 200

we must rollback because the constraint is violated


User 1

User 2

Comment

start a transaction using the right isolation level

make desired changes

check to make sure that ownership is correct (the app will decide whether to commit or rollback).

make conflicting changes violating ownership

the app decided to commit

in this case, we will fail because transactions are not guaranteed to be independent anymore.

We have started to insert data for D-ONUT. The idea is to insert it directly and check at the end of the transaction to see what happens. Note that there is no LOCK TABLE, no SELECT FOR UPDATE or anything of that kind. It is a plain transaction in isolation level SERIALIZABLE. It inserts two owners and then it checks for correctness. The important part is that SERIALIZABLE is actually going to error out:

PostgreSQL noticed that the same data was touched here and that the SELECT statement can surely not return the desired data. The transaction would actually have worked, had we not modified the data. But we did, and therefore the system figured that it cannot uphold the illusion of sequential execution.

There are a couple of questions people frequently ask concerning constraints over multiple rows and SERIALIZABLE:

Letโ€™s focus on SELECT FOR UPDATE first: It is true that SELECT FOR UPDATE does lock rows, and that we cannot modify them concurrently (UPDATE, DELETE). But there is a problem: SELECT FOR UPDATE protects against changes made to existing rows. It does not lock โ€œfutureโ€ rows. In other words: SELECT FOR UPDATE does not prevent other transactions from inserting data, which of course would allow the violation of our constraint (= total ownership has to be NULL or add up to 100%).

The second argument is heard more often than not: It is true that airplane ownership does not change very often. But what we have seen here is that getting things right is really easy and does not require much effort. So why not write proper code to avoid problems in the first place? Managing locking and transaction isolation well is important - not just for consistency but also for scalability. Thatโ€™s why I recommend you take coding constraints over multiple rows seriously.

Finally โ€ฆ

Transaction isolation is a highly important topic. Another crucial topic is how to know when to use what data type. If you want to find out more about data types in PostgreSQL, we recommend checking out our post on differentiating integer, float, and numeric types. It will give you some insight into what is possible on the data type front.

 


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook, or LinkedIn.

scram-sha-256 rejects a client connection
ยฉ Laurenz Albe 2021

 

Since v10, PostgreSQL has provided support for scram-sha-256 for password hashing and authentication. This article describes how you can adapt your application safely.

Why do we need scram-sha-256?

PostgreSQL uses cryptographic hashing for two purposes:

Now, the MD5 hashing method has weaknesses that make it unsuitable for cryptography. In particular, it is too easy to construct a string with a given MD5 hash. These shortcomings do not apply to the way PostgreSQL uses MD5, but it still makes sense to use a better hashing algorithm:

Hence the introduction of scram-sha-256 support in v10. If you can, start using the new hashing method. The increased difficulty of brute force password attacks makes it worth the effort.

Problems with switching over to scram-sha-256

There are two problems that make it hard to switch over from MD5 to scram-sha-256:

The error message you get with an old version of libpq when you attempt to connect to a server that requires scram-sha-256 authentication is:

An old JDBC driver will tell you:

Old versions of Npgsql will come back with:

Step-by-step instructions for switching to scram-sha-256

It is actually not difficult to convert to scram-sha-256, if you follow these guidelines:

1. Upgrade the client software

Upgrade all PostgreSQL client software and drivers that are too old to support the new authentication method. This is a good idea anyway, as it is never smart to get stuck with old, unmaintained software.

2. Change the password_encryption parameter

Edit postgresql.conf and change the parameter to

Make sure you remove the hash (#) at the beginning of the line. Then reload the server by running

where /postgres/datadir is the PostgreSQL data directory. Alternatively, you can run this SQL statement:

Look into the log file to see if the reload was successful, and check the new value via SQL:

Note that even though you changed the parameter, the old MD5 passwords still work, as long as the authentication method in pg_hba.conf is set to md5.

3. Set all passwords again

All password authenticated users have to change their password. In psql, a superuser can change any user's password with

Even if the user sets the same password as before, the password will now be hashed with SHA-256. Before proceeding with the next step, examine the table pg_authid and make sure that it contains no more MD5 hashed passwords.

4. Change the authentication method in pg_hba.conf

This step is not strictly necessary, because PostgreSQL will use scram-sha-256 authentication for scram-sha-256-hashed passwords, even if the authentication method is set to md5 in pg_hba.conf. This is a compatibility feature.

Still, you should adapt pg_hba.conf by replacing all occurrences of โ€œmd5โ€ with โ€œscram-sha-256โ€. That will prevent users who still have an old MD5 password from authenticating.

After that, reload the configuration as above. Then check the log file or examine the view pg_hba_file_rules to see if the reload was successful.

Conclusion

You can see from the above that it is not so difficult to change from md5 to scram-sha-256. The hard parts are that you have to set all the passwords again, and that you may have to upgrade the client software.

If you want to know how to protect yourself from common security problems, read Kaarel's blog or my article about SECURITY DEFINER functions.


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

A โ€œmaterialized viewโ€ is a database object which stores the result of a precalculated database query and makes it easy to refresh this result as needed. Materialized views are an integral feature of pretty much all advanced database systems. Naturally, PostgreSQL also provides support for materialized views, and offers the end-user a powerful tool to handle more time-consuming requests.

The main questions are now: What are the pitfalls, and how can you make use of materialized views in the first place? Letโ€™s dive in and find out.

Creating a materialized view

Before we can actually take a look at materialized views and figure out how they work, we have to import some sample data which we can use as the basis for our calculations:

We have created 10 million rows organized in 2 groups.

To create a materialized view in PostgreSQL, we can make use of the following syntax specification:

Basically, a materialized view has a name, some parameters, and is based on a query.

Here is an example of a materialized view:

What is really important to note here is the size of the materialized view compared to the underlying table:

423 MB vs. 16 KB is a real difference. If you want to query the materialized view, you can do it just like you would query a normal PostgreSQL table:

Looking at the execution plan (explain) shows that there is no difference between a normal table and a materialized view. We can simply query the data.
At this point, the only suboptimal thing is the query optimizerโ€™s estimate (= 1130 rows). However, that can easily be fixed.

The process is the same as for any other table:

ANALYZE has recreated the new optimizer statistics and fixed the estimate nicely. If you want to know more about ANALYZE in general, we recommend our posts about PostgreSQL optimizer statistics and autovacuum. Autovacuum is also able to recreate optimizer statistics.

CREATE MATERIALIZED VIEW โ€ฆ USING โ€ฆ

However, there is more. PostgreSQL offers some additional important features which are relevant in this field. One of those features is the USING clause. What is the purpose of this one? At the moment, PostgreSQL supports only one storage format (heap). However, in the future, we hope to see more storage formats such as zheap or zedstore.

The idea of USING is to have the ability to select the best storage format for the given requirement. At the moment, heap is the default format, and thatโ€™s perfectly fine.

CREATE MATERIALIZED VIEW โ€ฆ WITH โ€ฆ

The WITH-clause is the same as in CREATE TABLE. It allows you to define storage parameters such as autovacuum behavior, FILLFACTOR and so on.

Hereโ€™s an example:

In general, a materialized view is stored just like a table, so simply follow the same recommendations as you would follow when creating a standard table.

CREATE MATERIALIZED VIEW โ€ฆ TABLESPACE โ€ฆ

The same is true for the TABLESPACE keyword. In the default setup, the materialized view will be stored in the default tablespace which is $PGDATA:

SHOW is an easy way to figure out where that is. If you want to find out more about tablespaces, consider checking out our blog post about tablespaces in PostgreSQL.

CREATE MATERIALIZED VIEW โ€ฆ DATA vs NO DATA โ€ฆ

By default, a materialized view is created in a way that it contains the result. However, if the object is created using the NO DATA option, it is empty. Only the definition is created.

One has to be aware of the fact that in this case, the view cannot be queried until it has been refreshed to contain data. We have not seen the NO DATA option being used too frequently in database support. Itโ€™s still worth mentioning that this option does exist.

Modifying a materialized view in PostgreSQL

Sometimes a materialized view has to be modified later on. A classical example would be to rename a column of the view without recreating it.

Letโ€™s take a look at the definition of the view:

Modifying a materialized view is done using the ALTER MATERIALIZED VIEW command which is a powerful tool to facilitate all kinds of changes.

The syntax is pretty similar to what ALTER TABLE can do for you:

What is noteworthy here is that you can also do things such as CLUSTER (= index organize), SET STATISTICS (= adjust size of histogram for the PostgreSQL optimizer) and so on. The process is the same as for a normal table.

Now, letโ€™s try it out and rename a column:

Before we move on to refreshing, I want to focus your attention on indexing:

It is a good idea to index a materialized view to ensure fast querying. Again, the same rules as for a table exist. Simply index entries which are diverse and offer good selectivity when the materialized view is read.

Refresh materialized view

A materialized view contains a snapshot of the query result. It is not updated periodically, unless the user forces PostgreSQL to do so. In Oracle, materialized view support is a bit more sophisticated. Hopefully, PostgreSQL will soon catch up in this area a bit.

However, at the moment,a materialized view in PostgreSQL has to be refreshed.

Refresh manually using the REFRESH MATERIALIZED VIEW command:

Running the refresh is simple:

PostgreSQL knows the query which has to be re-executed to refresh the content of the materialized view. There is just one downside: PostgreSQL has to lock the object for the time of the refresh, which means that nobody can access it while it is refreshed.

Refreshing materialized views concurrently

Sometimes it is simply not acceptable to tolerate a locked materialized view. In that case, concurrent updates are needed. However, there is a problem:

The materialized view needs a unique column to support concurrent refreshing.

That means you have to determine a โ€œprimary keyโ€ and create a unique index:

You need to be a bit careful and make sure that table bloat does not take over. Technically, it is a DELETE / INSERT, which is prone to cause table bloat. Make sure a proper VACUUMpolicy is in place.

Finally โ€ฆ

Materialized views are an important feature in most databases, including PostgreSQL. They can help to speed up large calculations - or at least to cache them.

If you want to make sure that your materialized views are up to date and if you want to read more about PostgreSQL right now, check out our blog about pg_timetable which shows you how to schedule jobs in PostgreSQL. Why is pg_timetable so useful? Our scheduler makes sure that identical jobs cannot overlap, and that they simply donโ€™t execute again, in case the same job is already running. In case of long jobs, this is super important - especially if you want to use materialized views.
 


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.

CROSS JOINS - promiscuity extreme!
ยฉ Laurenz Albe 2021

 

For many people, โ€œcross joinโ€ is something to be afraid of. They remember the time when they forgot the join condition and the DBA was angry, because the query hogged the CPU and filled the disk. However, there are valid use cases for cross joins which I want to explore in this article.

What is a cross join?

The term comes from relational algebra, which also calls the Cartesian product between two relations cross product and denotes it by A ร— B. This is the most basic kind of join: it combines every row of one table with every row of another table.

A simple example:

Table A Table B
name birthday street city
John 1989-08-24 Roman Road Kexborough
Paul 2001-10-03 Earl Street Fitzwilliam
Claude 1996-04-18

The cross product of the tables would be:

A ร— B
name birthday street city
John 1989-08-24 Roman Road Kexborough
Paul 2001-10-03 Roman Road Kexborough
Claude 1996-04-18 Roman Road Kexborough
John 1989-08-24 Earl Street Fitzwilliam
Paul 2001-10-03 Earl Street Fitzwilliam
Claude 1996-04-18 Earl Street Fitzwilliam

Cross join in SQL

There are two ways to write the cross join of A and B in SQL.

A comma separated list in the FROM clause:

With the explicit CROSS JOIN syntax:

What is problematic about cross joins?

Cross joins are the most basic joins, and you can think of an inner join as a cross join with an additional filter condition. Of course, PostgreSQL doesn't calculate inner joins that way. It uses more efficient join strategies.

If you write your joins using a comma separated table list (like in the first example above), an inner join and a cross join look very similar. The only difference is a WHERE condition. Now it is a frequent mistake to forget a join condition when you develop an SQL query. The result of such an omission is that you get way more result rows than you reckoned with: a cross join between two tables with a million rows each would result in a trillion rows!

Such a query will run forever. If the query contains an ORDER BY clause, the database server has to cache the whole result set in order to sort it. Since this result set doesn't fit into memory, PostgreSQL will start writing temporary files to hold the data. These temporary files can fill up the disk. As soon as the query runs out of disk space, PostgreSQL rolls it back and deletes the temporary files. However, if the timing is bad, even a short out-of-disk condition can cause the database server to crash.

How can I protect myself from unintended cross joins?

Never use the โ€œcomma separated listโ€ syntax to write joins!

If you write your inner joins as a JOIN b, it is a syntax error to omit the join condition (ON or USING). A cross join is then explicitly written as CROSS JOIN and cannot happen by mistake.

Valid use cases for cross joins

The above sounds pretty discouraging, but there are situations when a cross join is just what you need. In the following, I present two typical cases:

Lateral cross join

In a lateral join, a join relation (an expression or subquery) can refer to earlier entries in the FROM clause. This is very often used in combination with table functions: if you want to join a row with all the table function results that belong to it, you use a lateral join. In that case,ย  LATERALย already implies that each row is only joined to the function results that belong to it, so there is no need for an extra join condition.

Here is an example (not recommended for your production database!):

Then the arrays could be unpacked with a lateral expression like this:

Joining with a โ€œvariableโ€

Sometimes you have a more complicated or expensive expression that you want to use in several places with one query. In that case, it can be a good idea to write a common table expression. You have to use that CTE in the FROM clause, typically with a cross join:

Conclusion

You don't need cross joins very often, but sometimes they do come in handy.

Avoid the โ€œcomma separated listโ€ join syntax, so that you don't get cross joins by mistake. Such a statement can run forever and use up your database machine's resources.

ย In case you need help to run your database in the most efficient way possible, CYBERTEC offers 24/7 support services to customers around the world.

PostgreSQL contains some hidden gems which have been around for many years and help to silently speed up your queries. They optimize your SQL statements in a clever and totally transparent way. One of those hidden gems is the ability to synchronize sequential scans. Actually, this feature has been around for 15+ years, but has gone mostly unnoticed by many end-users. However, if you are running data warehouses and analytical workloads, you might have already used synchronized seq scans without actually knowing it.

Reducing I/O for large SQL queries

Before we draw any conclusions, it is necessary to understand the problem weโ€™re trying to solve in the first place. Consider the following scenario: 10 users are concurrently running analytics on a large table. For the sake of simplicity, we assume that the size of the table is 1 TB and we can read 1 GB / second. In case there is just 1 user, we can get read data in around 17 minutes (assuming there is zero caching). But what if people are reading the table concurrently and if there is still zero caching, because those  reads are spread all over the PostgreSQL table? In that case, we face a problem:

Number of usersMB / secondTime
1100016.6 minutes
250033.2 minutes
425067 minutes
10100166 minutes

Assuming that the overall throughput is a constant (which of course it is not - especially not on spinning disks) we really have a big problem.

Queries take longer the more users are added to the warehouse!

What is important to understand is that queries take longer and longer as more users are added to your warehouse.

Synchronized sequential scans in PostgreSQL

Synchronized sequential scans have been around for many years and are a good way to reduce I/O. Letโ€™s take a look at a picture explaining the solution to the problems we face in many data warehouses:

Data warehousing_Making use of synchronized seq scans

Suppose query 1 starts a sequential scan on the table. After it has read 4TB, a second query also needs to perform a sequential scan. Instead of starting at the beginning, the second query can start reading the table at the current position of query 1. The idea is to serve both sequential scans with a single I/O request. The scans proceed in lock step until they reach the end of the table, then the second scan continues reading at the start of the table until it has read the data it missed in the beginning.

More users can be added as needed. By keeping the scans together, I/O can be reduced dramatically in many cases - especially if the amount of data is really large. What is important to note here is that PostgreSQL synchronizes the scans once, but does not force them to stay together. If one of the scans is significantly slower than the other one, the scans wonโ€™t stick together, but will go separate ways, to make sure that neither scan is forced to execute slowly.

However, since data warehousing and analytics in general are often I/O bound, weโ€™ve got a good chance to see better performance.

To control this behavior, PostgreSQL offers the following variable:

By default, synchronize_seqscans is on, which is a good value. However, if you want to change the default behavior, it is possible to change the value in postgresql.conf.

Finally โ€ฆ

If you want to learn more about PostgreSQL performance tuning, consider checking out our blog about improving PostgreSQLโ€™s ability to handle time series.

In case you need help to run your database in the most efficient way possible, CYBERTEC offers 24x7 support services to customers around the world.

 

Real life is running in autocommit mode
ยฉ Laurenz Albe 2021

 

When analyzing customer's problems, I have seen the mess you can get into if you disable autocommit in your interactive client, so I'd like to bring this topic to a wider audience.

What is autocommit?

In PostgreSQL, like in any other ACID-complicant database, each statement runs in a transaction: if the statement causes an error, PostgreSQL undoes all its effects.

You can start transactions explicitly using the START TRANSACTION statement (or BEGIN in PostgreSQL). However, if you don't do that, different RDBMSโ€™ behave differently:

The latter mode is called autocommit mode.

PostgreSQL and autocommit

PostgreSQL operates in autocommit mode, and there is no way to change that behavior on the server side.

Back in 2002, release 7.3 introduced a database parameter, autocommit, that allowed users to turn autocommit mode off. This proved to be a bad mistake: now the effects of SQL statements varied depending on that database setting, which led to all kinds of confusion and problems. Consequently, release 7.4 in 2003 removed that feature again.

Instead, PostgreSQL now implements autocommit mode on the client side. Almost all database clients and APIs offer a way to turn autocommit off. A few examples:

Note that in the above list, all clients except for psycopg2 run in autocommit mode by default.

The way that clients disable autocommit is by automatically sending a BEGIN statement before the first statement after the end of a transaction.

The appeal of disabling autocommit

If you are inside a database transaction, nothing much can go wrong: if you make a mistake, you can simply issue ROLLBACK and everything you did is undone. This is appealing because it allows you to recover from mistakes like running a DELETE statement without a WHERE clause.

Moreover, users who are used to database systems that do not operate in autocommit mode (for example, Oracle), like to disable autocommit to get the behavior that they are accustomed to.

This seems to be a safe and useful setting, so many people are surprised to learn that it is a really bad idea in PostgreSQL.

Problems with โ€œautocommit offโ€

If you disable autocommit, and you don't explicitly commit or roll back, your database session will remain in the state โ€œidle in transactionโ€. Such long-running transactions cause two kinds of problems in the database:

Problem: locks in the database

PostgreSQL holds the locks that SQL statements take until the transaction ends. Now if you ran a data modifying statement after your last commit, some rows in your database are locked. Any other transaction that tries to modify such a row will hang until your transaction is ended.

To make things worse, each SQL statement causes a table-level lock on the tables it accesses: even a โ€œharmlessโ€ SELECT will take an ACCESS SHARE on the tables it reads. This is to prevent concurrent data definition statements like DROP TABLE or ALTER TABLE from deleting the table or changing its definition while the SELECT is reading it.

As a consequence, a session with an open transaction can block data definition statements indefinitely long. In a DevOps environment, where new versions are deployed frequently, this will lead to serious disruptions. So even a simple SELECT from a client, with autocommit disabled, can do damage!

Problem: autovacuum maintenance

Whenever you update or delete a row in PostgreSQL, it retains the old version of the row to satisfy concurrent read requests. These old versions soon become obsolete and have to be removed. This is the job of the autovacuum daemon in PostgreSQL. If your database session has a long-running open transaction, it is possible that autovacuum cannot delete any row versions that are younger than the start of your transaction, because you might still need them.

The consequence is that autovacuum keeps running to get rid of those โ€œdead tuplesโ€, but it cannot delete them and cannot make any progress. This leads to table bloat, where a table grows out of proportion while containing mostly junk data. Once you have a bloated table, you usually have to take downtime to run VACUUM (FULL) to fix the problem. There are tools like pg_squeeze to make this less painful.

How to defend against โ€œautocommit offโ€

There are a couple of things that the database administrator can do to mitigate these problems:

While these parameters allow you to prevent the worst, they are โ€œnot niceโ€, because they will lead to errors and rollbacks on the client side.

Conclusion

Nice as it may seem, resist the temptation to disable autocommit in your interactive client. Get used to PostgreSQL's autocommit mode, otherwise you will endanger the health of your database. The administrator can set idle_in_transaction_session_timeout and old_snapshot_threshold to prevent such damage, at the price of causing errors on the client side.

UPDATED August 2023: Autovacuum has been part of PostgreSQL for a long time. But how does it really work? Can you simply turn it on and off? People keep asking us these questions about enabling and disabling autovacuum. PostgreSQL relies on MVCC to handle concurrency in a multiuser environment. The problem which arises with concurrent transactions is that dead tuples must be cleaned up. In PostgreSQL this is handled by the VACUUM command, which we already have covered in some other posts. However, running VACUUM manually is a thing of the past. Most people rely on the PostgreSQL autovacuum daemon to handle cleanup.

How autovacuum works

The first thing to understand is that autovacuum really does what it says: basically, it is automation around manual VACUUM. All it does is to sleep for a while and check periodically if a table requires processing. There are three things autovacuum takes care of:

In PostgreSQL, autovacuum is a server-side daemon which is always there. Yes, thatโ€™s right: ALWAYS. Even if you turn autovacuum off in postgresql.conf (or by using ALTER SYSTEM to adjust postgresql.auto.conf), the daemon will still be around - by design - to help with wraparound protection.

The way autovacuum works is: it periodically checks if work has to be done, and notifies the postmaster in case new workers have to be launched to take care of a table. Autovacuum does not launch a worker directly, but works indirectly through the postmaster to make sure that all processes are on one level. The fact that it works through the postmaster clearly helps to make the system more reliable.

Letโ€™s take a closer look at what autovacuum does.

Autovacuum creates optimizer statistics

The PostgreSQL optimizer relies heavily on statistics. It estimates the number of rows returned by various operations and tries to guess the best way to optimize a query.
The optimizer uses the statistical distribution of data in a table in order to do that. In case the content of a table changes, the optimizer has to use stale data, which in turn can lead to bad performance.

Therefore autovacuum kicks in on a regular basis to update all the statistics. Autovacuum relies on various configuration parameters (to be found in postgresql.conf) which can be changed to optimize this behavior:

These parameters will tell autovacuum when to create new statistics. In the above case, the rule is as follows:

Given the default configuration, autovacuum usually does a good job to maintain statistics automatically. However, sometimes it is necessary to exercise some more precise control:

In this example, I have adjusted autovacuum_analyze_scale_factor to 0.05 to make autovacuum more aggressive. Once in a while, this is necessary - especially if tables are really, really large in size.

Autovacuum cleans out dead rows

Creating fresh optimizer statistics is quite important. However, thereโ€™s more: cleaning out dead rows. The purpose of VACUUM in general is to make sure that dead rows are removed from heaps and indexes nicely. We have written extensively about VACUUM and cleanup in the past.

Usually, the default settings are ok. However, just like in the case of ANALYZE, it can make sense to adjust these parameters to make the autovacuum daemon either more aggressive, or to make it behave in a more relaxed way.

Setting autovacuum parameters can either be done globally during table creation (check out postgresql.conf or postgresql.auto.conf), or later on in the process, as shown in the next listing:

In this case, we have changed the scale factor and made autovacuum kick in less frequently.

Autovacuum handles transaction wraparound

There is a third thing autovacuum does: it prevents transaction wraparound. If you want to find out more about this topic, you can also check out my article about wraparound protection in PostgreSQL.

Enabling and disabling autovacuum in PostgreSQL: transaction identifiers

Wraparound protection is an important thing which must not be taken lightly; it can cause serious downtime and cause problems for various workloads.

Enabling and disabling autovacuum

Autovacuum can be turned off globally. However, this does NOT mean that the daemon stops - it merely means that it ONLY does wraparound protection. The reason why this is done is to make sure that downtime is minimized as much as possible. Itโ€™s easier to live with table bloat than with downtime.

Therefore, stopping autovacuum is not really possible - it can only be suspended for the majority of tasks. It is not a good idea to turn off the autovacuum completely, anyway. In most cases, all turning it off does is cause trouble.

What does make sense is to disable autovacuum within special scenarios on individual tables. Here is a scenario in which it might make sense:

Why would anybody care about wraparound, cleanup and so on in this case? When it is clear that a table is thrown away anyway, why clean it up? Keep in mind: We're talking about very specific scenarios here, and we're definitely not talking about general-purpose tables.

Hereโ€™s how you can turn autovacuum on and off for a single table:

Please understand that turning autovacuum off is certainly not a good idea unless you face a specific use case with a strong justification for doing so.

Manual VACUUM vs. automatic VACUUM

What happens if the autovacuum is running but somebody starts a manual VACUUM? The general rule is: autovacuum always loses out. If you start a VACUUM but there is already an autovacuum worker running, PostgreSQL will terminate the worker and give your manual process priority.

The same is true for pretty much all other operations. Suppose somebody wants to drop a column, a table, etc. In case of a conflict, PostgreSQL will always kill the autovacuum process and make sure that normal user operations aren't harmed by background operations.

Why is autovacuum so slow?

Many people keep asking why autovacuum is slower than manual VACUUM. First of all, in the default configuration, it is indeed true that autovacuum is a lot slower than manual VACUUM. The reason is a thing generally known as cost delay:

When VACUUM hits a table, it usually does so โ€œfull speedโ€ which means that cleaning up a table can cause massive I/O bottlenecks. A single VACUUM job can suck hundreds of megabytes of I/O per second -  which leaves less capacity for other operations, and can, in turn, lead to bad response times.

The solution is to punish autovacuum for I/O and add artificial delays to the process. Processing a table will take longer, but it leads to a lower overall impact on other processes. In older versions of PostgreSQL, the setting was at 20 milliseconds. However, it is set to 2 milliseconds in PostgreSQL 13 and beyond.

The artificial delays (which can be turned off on a per-table basis) are the core reason why end-users observe runtime differences.

Finallyโ€ฆ

If you want to know more about PostgreSQL performance, we also recommend checking out our consulting services. We help you to tune your database and make sure that your servers operate perfectly.

 

CYBERTEC Logo white
Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ยฉ
    2025
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram