CYBERTEC Logo
The worker's union likes PostgreSQL's implementation of DROP ROLE
© Laurenz Albe 2022

 

You might, at first glance, believe that DROP ROLE (or DROP USER, which is the same) is a simple matter. However, that's not always the case. So I thought it might be a good idea to show you the problems involved and how to solve them.

Why is there no DROP ROLE ... CASCADE?

PostgreSQL tracks dependencies between objects in the system catalogs pg_depend and pg_shdepend. Sometimes, dropping an object automatically drops other associated objects: for example, dropping a table will drop all indexes and triggers associated with the table. In other cases, you get an error if you drop an object with dependencies: for example, you can't drop a schema that still contains other objects.

The usual way to automatically drop all dependent objects is to use the CASCADE option. For example, “DROP SCHEMA schema_name CASCADE” will automatically try to drop all objects in the schema. So it comes as a surprise that there is no CASCADE option for DROP ROLE. Now what is the reason? Consider these facts:

As a consequence, PostgreSQL cannot guarantee that a single statement will get rid of all objects that depend on a role. So we can't have a CASCADE option for DROP ROLE that can drop all dependent objects.

How can you avoid trouble with DROP ROLE?

The best thing you can do is to never drop a role that owns objects or has permissions. That's easy to accomplish, if you use “group roles” to carry permissions and ownership, and if you make sure that roles that might be dropped are members of these group roles. That way, the roles inherit all privileges, and you can still drop them at any time. For example, you could have a role “readonly” that has read permissions on everything, and if you temporarily need a role with read access, you just make that role a member of readonly.

Another advantage of using group roles is that you don't have to grant the same permissions to several roles. Not only is that cumbersome and error-prone, but it also inflates the “access control lists” on the objects.

The big helpers: DROP OWNED BY and REASSIGN OWNED BY

PostgreSQL has a utility command DROP OWNED BY that drops all objects owned by a role in the current database. It also revokes all permissions granted to the role in the current database. So you can get rid of most objects with dependencies to a role by connecting to all databases in turn and running “DROP OWNED BY role_name” in each of them.

Note that this may still fail: some of the objects that DROP OWNED BY wants to drop might have dependencies to objects owned by other roles, which prevent PostgreSQL from dropping them. In that case, you can use “DROP OWNED BY role_name CASCADE” to have the DROP cascade to these dependent objects.

An alternative to dropping the objects is to change their ownership to a different role. You can do that in bulk with REASSIGN OWNED BY. Again, this only affects objects in the current database, so you have to issue REASSIGN OWNED BY in all databases with objects owned by the role you want to get rid of. Note: REASSIGN OWNED BY does not modify or remove permissions granted to the role, so you may want to follow it with DROP OWNED BY to revoke those permissions.

Objects left behind by DROP OWNED BY

There are only two kinds of objects that might be left behind by DROP OWNED BY, both of which are not part of any database:

Getting rid of the databases is easy: you connect to database postgres and drop them or change their ownership.

Tablespaces are more difficult: you cannot drop a tablespace unless there are no more objects in that tablespace. This is a similar case to roles, since a tablespace can contain objects from different databases. Consequently, there is also no CASCADE option for DROP TABLESPACE. In practice, you will probably end up changing the ownership of the tablespace to a different role.

But I still get an error if I try to drop the role “postgres”!

There is one special role in each PostgreSQL cluster: the bootstrap superuser created during initdb. Usually that role is called “postgres”, but it always has the object ID 10. If you attempt to drop that role or use DROP OWNED BY and REASSIGN OWNED BY on it, you end up with these error messages:

You always need the bootstrap superuser, so you can't remove it. If you don't like the name of the user, you can simply rename it with ALTER ROLE ... RENAME TO .... Object names are only tags in PostgreSQL and you can change them at any time.

Conclusion

With DROP OWNED BY and REASSIGN OWNED BY it is not difficult to get rid of a role (user), even if it has many dependencies. Still better is to avoid the problem from the start by not granting permissions to roles that might get deleted. Use group roles for that!

If you are interested in tracking permissions in a PostgreSQL database, read more about it here in Hans' blog: pg_permissions.

 


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

PostgreSQL gets a query parameter of type
© Laurenz Albe 2022

 

Recently, I could help solve a “mysterious” performance problem for a customer. The problem turned out to be a badly chosen query parameter type, which led to poor query performance. After working on this customer's case, I realized that the meaning of the data type of a query parameter is not universally understood, which is why I decided to write this article.

What are query parameters?

In SQL, we often encounter queries that only differ in constant values, for example

It makes sense to consider them as different instances of the same query:

Here, $1 is a placeholder for an arbitrary value. Such a placeholder is called a query parameter (also known as a bind variable). When you execute the query, you have to supply an actual parameter value as an argument for the parameter.

Using a placeholder has several advantages:

Support for query parameters in PostgreSQL

The PostgreSQL frontend/backend protocol provides two ways to send an SQL statement:

Note that you can use query parameters only for constants. It is not possible to use parameters for identifiers like table, column or schema names. Also, parameters are only supported in SELECT, INSERT, UPDATE and DELETE.

Prepared statements

When you prepare a statement, you can assign it a name, so that the same statement can be executed with different parameter values. This is called a prepared statement, and every database client API provides support for that.

Using JDBC in Java, that would work as follows:

The placeholder is always ?, and the first argument of the set* methods specifies the parameter number.

In PostgreSQL, you can also use prepared statements in SQL. PREPARE creates a prepared statement, and EXECUTE executes it:

Query parameter data types and type conversion

You may have noticed that in both of the above cases, there was a data type associated with the parameter: long in Java and bigint in SQL. However, you don't have to specify a data type. Then PostgreSQL will try to infer the data type itself. Such an “untyped value” is represented by the data type “unknown” in PostgreSQL.

But PostgreSQL will also consider type conversions for data types other than unknown. For example, there is no equality operator to compare bigint and numeric in PostgreSQL:

Still, you can compare a bigint with a numeric, because PostgreSQL will implicitly convert the bigint to a numeric, as the latter is the preferred data type for numbers. You can find the exact rules that govern these conversions in the documentation. These rules are quite complicated, because PostgreSQL supports overloading for functions and operators.

An example of the performance impact of query parameter types

To illustrate that, let's consider this simple example:

Then we write a Java program that queries the database with the code snippet from above, but we set the parameter in three different ways. Using the auto_explain extension, we capture the execution plan on the server in each case.

Using integer as the parameter type

The parameter is set with

and the resulting execution plan is

This works fine, since setInt marks the parameter as type integer, and there is an equality operator to compare integer and bigint.

Using unknown as the parameter type

The parameter is set as type java.sql.Types.OTHER with

and the resulting execution plan is

This works just as well, because a parameter of type unknown is inferred to have the same type as id, which is bigint.

Using numeric as the parameter type

The parameter is set with

and the resulting execution plan is

Surprisingly, PostgreSQL used a sequential scan, and the performance is much worse. The reason is that the JDBC driver maps java.math.BigDecimal to numeric, so the parameter has the data type numeric. As we have seen above, there is no equality operator for bigint and numeric, so PostgreSQL converts both to numeric. This is clearly visible in the execution plan above. Since the primary key index is defined on “id” and not on “id::numeric”, it cannot be used in this case.

How to avoid problems with query parameter data types

From the above, we can learn the following lesson:

Sometimes both of these strategies won't work, because there is no data type in your host language that matches the PostgreSQL data type - and PostgreSQL guesses wrong. An example would be this query:

We want to supply an interval like “1 year” or “1-0”. Java does not have a data type for intervals, so we would have to send an unknown parameter. However, PostgreSQL infers that we mean the operator

rather than

so we will end up with error messages complaining that “1 year” is not a valid timestamp. In this case, the solution is to add an explicit type cast to the statement:

So we have to amend the above list with a third point:

Conclusion

It is easy to overlook the fact that the data type of a query parameter can matter for query performance. To avoid problems, either don't specify a type and let PostgreSQL do the right thing, or explicitly specify the correct type.

If you want to read more about PostgreSQL query protocol and performance, perhaps you would be interested in my article on pipeline mode.

pipeline mode in real life
© Laurenz Albe 2022

 

It is known that high network latency is bad for database performance. PostgreSQL v14 has introduced “pipeline mode” for the libpq C API, which is particularly useful to get decent performance over high-latency network connections. If you are using a hosted database in “the cloud”, then this article might be interesting for you.

The PostgreSQL extended query protocol

To understand pipeline mode, we have to understand the message flow between client and server. Using the extended query protocol, statements get processed as follows:

The database transaction is finished by sending a “Sync” message. These messages are typically sent in a single TCP packet. All of the above messages are generated by a call to the PQexec or PQexecParams functions of libpq. In the case of a prepared statement, the “Parse” step is separated from the “Bind” and “Execute” steps.

After “Sync” has been sent, the client waits for a response from the server. The server processes the statement and replies with

Finally, the server sends a “ReadyForQuery” message to indicate that the transaction is complete and it is ready for more. Again, these messages are typically sent in a single TCP packet.

How pipeline mode works

Pipeline mode is nothing new on the frontend/backend protocol level. It just relies on the fact that you can send more than one statement before sending “Sync”. This allows you to send multiple statements in a single transaction without waiting for a response from the server. What is new is support for this from the libpq API. PostgreSQL v14 introduced the following new functions:

The statements themselves are sent using the asynchronous query execution functions PQsendQuery, PQsendQueryParams and PQsendQueryPrepared, and after the “Sync” message has been sent, PQgetResult is used to receive the responses.

Since all that does not rely on new features in the frontend/backend protocol, you can use pipeline mode with older versions of the PostgreSQL server.

Performance advantages of pipeline mode

Let's assume the case of a simple money transfer using a table like

To transfer money from one account to the other, we have to run a transaction like

With normal processing, that makes four round trips from the client to the server, so the whole transaction will incur eight times the network latency.

normal mode postgrasql performance

Using the pipeline mode, you can get away with only twice the network latency:

pipeline mode for better postgrasql performance

A code sample using pipeline mode

This is C code that can be used to process the above transaction. It uses a prepared statement stmt for this UPDATE statement:

In order to focus on the matter at hand, I have omitted the code to establish a database connection and prepare the statement.

Measuring the speed improvement with pipeline mode

To verify the improvement in speed , I used the tc utility on my Linux system to artificially add a 50 millisecond latency to the loopback interface:

This can be reset with

I measured the time spent in the above function, as well as the time for a function that used no pipeline and an explicit transaction:

no pipeline (8 times network latency) pipeline (2 times network latency)
first attempt 406 ms 111 ms
second attempt 414 ms 104 ms
third attempt 414 ms 103 ms

With short SQL statements like these, the speed gain from pipelining is almost proportional to the client-server round trips saved.

An alternative way to get a similar speedup

If you don't want to use the libpq C API or directly speak the frontend/backend protocol, there is another way to get a similar performance improvement: you can write a PL/pgSQL function or procedure.

This will also run inside a single transaction and is just as fast, because

Writing a PL/pgSQL procedure is in this case probably the simpler solution. However, pipeline mode allows you precise control over the message and data flow between client and server, which you won't get with a function.

You may not be able to use this solution in case you operate in an environment where database functions are an anathema. But then, strictly held religious beliefs can be expected to make you suffer occasionally!

Conclusion

Pipeline mode, new with the libpq C API in PostgreSQL v14, allows for considerable performance improvements with laggy network connections. Since it doesn't use new features in the frontend/backend protocol, it can be used with old server versions as well. Often, similar performance gains can be achieved using PL/pgSQL functions.

If you are interested in data manipulation performance, you may want to read my article about HOT updates.

There are certain types of questions which accompany any PostgreSQL consultant throughout his or her entire career. Invariably, one of those questions is: “How can we compare the content of two tables?” This is an important thing to know, because we might want to ensure that data is identical after a migration, or perhaps we want to verify that data has not been corrupted during storage. These questions arise so frequently that it was necessary to come up with a speedy solution for comparing the content of tables. The answer lies in creating checksums for tables.

md5sum on Linux

On a Linux system you normally use a program such as md5sum to get the checksum, or "digital fingerprint", of a file. Here's how it works:

md5sum is a widely-used method to obtain checksums. Checksums can be compared more easily than entire files - especially when those files are large.

The question which naturally arises is: How can we achieve the same thing in a database? What can we do in SQL to create such a checksum?

Ordered sets in SQL

The solution to the problem in SQL is to use an “ordered set”. Now what is that? Here is an example of a simple ordered set:

This is the SQL way to calculate a median value. What happens is: The generate_series function creates values from 1 to 43243 (one row per value). The GROUP BY clause dissects them into 2 groups. One group will contain odd numbers and the other group will contain even numbers. The trick is: If we calculate the average (mean) number, the order of data does not matter. Suppose we have 1, 2, 3, 4, 5: The average will always be 3 - regardless of the order. However, how does a median work? We order the data and then we go 50% into the ordered data set - percentile_disc(0.5) does exactly that. Finding the “middle” value does not work without order.

The same is true for checksums. If we want to create a fingerprint for a table, the order of entries does matter. Identical data in a different order will lead to different results - and we want to avoid that.

Checksums in PostgreSQL

To solve this problem, we implement an ordered set which consumes sorted data and produces a checksum.

Before we implement the ordered set we need to see what's already available in PostgreSQL:

A simple md5 function is available, but there's no corresponding aggregate. We need to define one. Here's how it works:

An “ordered set” needs an INITCOND. We initialize an empty string. The general idea is: We take a string, add the next row and turn it into a checksum. Then we add a row, turn it into a checksum and so on. We do that as long as there is data. Basically, it is a checksum over checksums, to ensure that the amount of data we're working on is never too large to fit into memory. For every row in the loop, we call SFUNC (in our case, md5_agg_sfunc).

Order

In the context of an ordered set, one thing is of great importance: You might have guessed it - it is order. We use a little trick here:

After loading some data, we can identify what's possible:

Note that we're actually sorting by the “table”. What that means is that we're sorting by every column (so in our case, ORDER BY a, b). This is important, as you will see in a moment.

It's also possible is to use an entire row just like a column:

That is convenient: Let's recap the definition of the aggregate:

CREATE  AGGREGATE md5_agg (ORDER BY anyelement)...

What we see here is that the aggregate can basically order by “anything”. So, as in this case, we fully exploit that. As long as a row can be sorted, we can use it for our checksums:

We've successfully created a checksum for this table. Of course, we can also checksum subsets of data:

We can always use a GROUP BY statement to split the data into groups.

Finally…

Creating checksums can be incredibly useful to help ensure the quality of your data. However, there is more. If you really want to protect your data and ensure maximum security, we suggest taking a look at PostgreSQL TDE which offers a way to encrypt data on disk. Check out our blog post about managing encryption keys with PostgreSQL TDE to read more.

 


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

We want to announce that the beta release of pgwatch2 v1.9 is now available for download. This release contains previews of all features that will be available in the final release of pgwatch2 v1.9, though some details of the release could change before then.

You can find information about all of the new features and changes in pgwatch2 v1.9 within the release notes.

In the spirit of open-source software, we strongly encourage you to test the new features of pgwatch2 v1.9 on your systems to help us eliminate any bugs or other issues that may exist. While we don't advise you to run pgwatch2 v1.9.0-beta in your production environments, we encourage you to find ways to run your typical application workloads against this beta release.

pgwatch2 v1.9 Feature Highlights

New Features

It has been a long time since our previous release. And even though pgwatch2 v1.9 is not considered a major upgrade, there are a lot of impressive changes.

First of all, we have a new maintainer for pgwatch2. That's me, Pavlo Golub. I hope Kaarel Moppel, the author of pgwatch2, will still work on the project - but now without the extra burden of support and maintenance.

We've used the latest Go v1.17 for all binaries and Docker images.

We now have full PostgreSQL v14 support and support for the latest Grafana v8 visualization solution. For those who still use the previous versions of Grafana, we also added support for v7 and v6 dashboards.

To improve our CI/CD experience and automate some tasks, we've introduced three GitHub Actions in a series of planned workflows:

CodeQL analysis stands guard over code security. It will thoroughly investigate each pull request and each push.

It's not a surprise that many issues are created for open-source projects which never get a follow-up contribution from the topic starter. We have enabled a Close Stale Issues and PRs workflow to keep up with the continuous storm of new issues and PRs. It will mark issues and PRs with no activity as stale, and eventually close them.

We added many more cool features, like new gatherer options, e.g. --try-create-listed-exts-if-missing. Or like new metrics for monitoring "wait_events".

Performance

We provided many code modifications to improve performance.

Prometheus-mode monitoring is now one of the essential optimizations.

We rewrote the gatherer code to skip tables without waiting for AccessExclusiveLock.

We've added a new --min-db-size-mb flag to ignore "empty" databases. It allows the gatherer to skip measures fetching for empty or for small-sized databases.

pgwatch2 v1.9 comes with fully redesigned connection pooling, which will hardly rely on sqlx.DB from now on.

Timeouts have been greatly increased for database and table size fetching - to 5min.

Metric fetching was made less disruptive by setting a short lock_timeout.

The new --no-helper-functions parameter allows you to skip metric definitions which rely on helper functions. This mode makes working with managed instances more fluid, with fewer errors in logs. It uses the SU or superuser version of a metric immediately when available, and not after the first failed call.

The new --emergency-pause-triggerfile flag aims to quickly shut down collectors. The main idea of the feature is to quickly free monitored databases and networks of any extra "monitoring effect" load.

Additional Features

You'll find many other new features and improvements in pgwatch2 v1.9. Some may be more important to your use case than those highlighted above. Please see the documentation for the latest information.

Testing for Bugs & Compatibility

The stability of each pgwatch2 v1.9 release significantly depends on the community's effort to test the upcoming version with your workloads and testing tools, in order to find bugs and regressions before pgwatch2 v1.9 becomes generally available. As this is a Beta, minor changes and bug fixes are still possible. Your feedback and testing will help determine the final tweaks on the new features, so please test as soon as possible. The quality of user testing helps decide when we can make an official release.

A list of open issues is publicly available in the pgwatch2 repository. You can report bugs using this form: https://github.com/cybertec-postgresql/pgwatch2/issues.

CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0
office@cybertec.at

Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram