CYBERTEC Logo

It is possible to write functions in PostgreSQL in almost any widespread language such as Perl, Python or C. In general this offers a great deal of flexibility and acceptable performance. However, in some cases customers might say: “We have the feeling that procedures are slow”. The thing is: PostgreSQL might not be to blame for this observation - in many cases it is simply “pilot error”. Let me try to elaborate on the key issue of function calls related to bad performance.

How PostgreSQL treats procedures

As mentioned already, it is possible to write functions in basically any language. PostgreSQL simply passes the code of a function to the external language and takes back the result. In a way a function is a kind of black box - PostgreSQL rarely knows what is going on inside a procedure.

Here is an example of a function call:

The results are not very spectacular:

The important thing here is: The PL/pgSQL function is a complete black box. The planner has no idea what the “external” language does in this case. This has some important implication.

Consider the following example:

The table is already large enough to consider indexes:

Function calls in PostgreSQL are usually black boxes

The problem is: The situation changes completely if we start to use the function I have just shown:

PostgreSQL has no idea that the function will return 20. It is clear to humans, but nobody told the machine that this is guaranteed to happen. To PostgreSQL, the result of a function is considered to be “volatile” - anything can happen. Therefore, it cannot simply ask the index for the correct row. The first function call might not return the same as the second call - even if the parameters are identical. The optimizer has to play it safe and will go for a sequential scan, which is definitely going to produce the correct result.

In PostgreSQL a function can be:

• VOLATILE
• STABLE
• IMMUTABLE

If a function is marked as VOLATILE, it can return anything if you call it multiple times using the very same input parameters. In case of STABLE the function is going to return the same result given the same parameters within the same transaction.

The most prominent STABLE function is now(), which will always return the same result within the same transaction:

Some functions are even IMMUTABLE: In this case the result given the same input parameters will be constant, regardless of the transaction. Cosine would be an example of one of these function calls:

Even in the second transaction the cosine of a number will be the same.

Reducing the number of function calls

To fix our problem we have to drop the existing function …

… and recreate it:

The planner will detect that the function is supposed to return a static value given the same input values and go for an index scan:

Of course the index scan orders magnitude faster and will return in basically no time.

Detecting bottlenecks

Fortunately PostgreSQL has a system view, which might shed some light on functions, that could be a problem. The first thing you have to do is to set “track_functions = 'all' “ - it will tell PostgreSQL to collection function statistics:

Once this setting has been turned on, pg_stat_user_functions will contain valuable information:

If you happen to see that a function is called insanely often, it can make sense to inspect it and check, if it happens to be VOLATILE for no good reason. Changing the function definition can significantly improve speed.

Creating indexes on functions in PostgreSQL

If you want to create indexes on a function, you have to make sure that the function itself is in fact IMMUTABLE. Otherwise PostgreSQL won't create the index for you. The reason is simple: PostgreSQL has to make sure that the content of the index is stable and does not have to be changed over time if the underlying data is unchanged.

Read the latest blogs about the PostgreSQL optimizer, or find out more about fixing slow queries.

Configure PostgreSQL quickly and easily

We are proud to announce the release of our latest tool to the PostgreSQL community: the CYBERTEC pgconfigurator, a visual tool for configuring PostgreSQL. We know that many people want a visual tool to configure PostgreSQL database servers. CYBERTEC's pgconfigurator quickly determines the most important config parameters to be used in postgresql.conf.

The tool can be found and used freely at the Cybertec pgconfigurator product page on our website.

PG Configurator
pgconfigurator CYBERTEC

What pgconfigurator will do for you

pgconfigurator gives you a quick and easy way to generate or to cross-check existing config files (postgresql.conf). Use the results of our vast database experience to help tune your database and achieve better performance.

pgconfigurator will determine all vital performance parameters and come up with useful suggestions. Keep in mind that an automatic tool cannot replace a human in all cases, and it still makes sense to hire a professional in case of doubt. However, pgconfigurator can help you to get started and to configure your servers to maximum advantage.

Our tool for visually configuring PostgreSQL will determine the following types of parameters for you:

Just download the ready-made config and add your personal parameters (for logging, etc.)

Feedback is welcome

We hope that our first release of pgconfigurator is helpful. Of course feedback is always welcome and we are eager to hear from you. Tell us what you want us to improve and tell us about settings you don't agree with. We will collect all recommendations and use them to make pgconfigurator even better.

We are planning to maintain this tool long-term and add support for PostgreSQL 11 and higher in the future.

Also: Share the news! Tell your friends, workmates and everybody you know about the CYBERTEC pgconfigurator.

 

(Updated 2023-04-20) I recently had an interesting support case with table bloat that shows how the cause of a problem can sometimes be where you would least suspect it. Note: the problem I describe in this article can only occur in PostgreSQL v14 and below, since the statistics collector has been rewritten in v15.

About table bloat

After an UPDATE or DELETE, PostgreSQL keeps old versions of a table row around. That way, concurrent sessions that want to read the row don't have to wait. But eventually this “garbage” will have to be cleaned up. That is the task of the autovacuum daemon.

Usually you don't have to worry about that, but sometimes something goes wrong. Then old row versions don't get deleted, and the table keeps growing. Apart from the wasted storage space, this will also slow down sequential scans and – to some extent – index scans.

To get rid of the bloat, you can use VACUUM (FULL) and other tools like pg_squeeze. But it is important to find and fix the cause of the table bloat so that it does not reappear.

Description of the table bloat problem

I got called by a customer who experienced table bloat in the pg_attribute system catalog table that contains the table column metadata.

This can happen if table columns get modified or dropped frequently. Most of the time, these are temporary tables which are automatically dropped when the session or transaction ends.

The customer made heavy use of temporary tables. But they have several Linux machines with databases that experience the same workload, and only some of those had the problem.

Searching for the cause of the table bloat

I went through the list of common causes for table bloat:

Both were not the case; the second option could be ruled out because that would cause bloat on all machines and not only on some.

Then I had a look at the usage statistics for the affected table:

This confirms that autovacuum never ran. But more interesting is that we get a hint why it didn't run:

PostgreSQL thinks that the number of dead tuples (row versions that could be removed) is 0, so it didn't even try to remove them.

The statistics collector

A suspicion became certainty after I found the following message in the server logs:

The statistics collector process is the PostgreSQL backend process that collects usage statistics.

After each activity, PostgreSQL backends send statistics about their activity. These statistics updates are sent through a UDP socket on localhost; that is created at PostgreSQL startup time. The statistics collector reads from the socket and aggregates the collected statistics.

Closing in on the table bloat problem

The statistics collector was running:

To see what the statistics collector was doing and to spot any problems it had, I traced its execution:

The statistics collector was waiting for messages on the UDP socket, but no messages were coming through!

I had a look at the UPD socket:

Nothing suspicious so far.

But when I tried the same thing on a machine that didn't have the problem, I got something different:

It turned out that on all systems that experienced table bloat, the statistics collector socket was created on the IPv6 address for localhost, while all working systems were using the IPv4 address!

But all machines had IPv6 disabled for the loopback interface:

Nailing the bug to the wall

PostgreSQL uses the POSIX function getaddrinfo(3) to resolve localhost.

Since PostgreSQL knows quite well that it is important to have a working statistics collection, it will loop through all the addresses returned by that call, create a UDP socket and test it until it has a socket that works.

So we know that IPv6 must have worked when PostgreSQL was started!

Further analysis revealed that IPv6 was disabled during the boot sequence, but there was a race condition:
Sometimes IPv6 would be disabled before PostgreSQL started, sometimes afterwards. And these latter machines were the ones where the statistics collector stopped working and tables got bloated!

After changing the boot sequence to always disable IPv6 before starting PostgreSQL, the problem was fixed.

Conclusion

This shows (again) how the cause of a computer problem can be in an entirely different place than you would suspect at first glance.

It also shows why a good database administrator needs to know the operating system well.

You have probably noticed that everyone is talking about Artificial Intelligence and Machine Learning these days. Quite rightly, because it is a very important topic, which is going to shape our future for sure. However, when looking at most of the code related to machine learning available on the net, it strikes me, how much “coding” people actually use to prepare the data. In many cases the input data is “just a matrix” and people spent a lot of time (and sometimes memory) to prepare it.

The question, which naturally arises in my case is: Why not prepare the data in SQL? It is easier, less work and a lot more flexible.

Creating some sample data

To demonstrate what we can do on the PostgreSQL side, I will create a simple table containing just 20 random values. Let us define a table first:

What we have now is a table containing 20 random values.

In reality you will already have some existing data, which you will use to work with:

Thoughts on sampling, training, and verification

If you are training an AI model (maybe a Support Vector Machine / SVM, a neural network, or whatever) you will always start by splitting the data into various parts:

• Training data for your AI model
• Test data for your AI model

The training data is used to teach your model. The test data is then used to check, if your model performs well. Splitting the data is important and somewhat the key to success.

Table sampling in PostgreSQL

Since version 9.5, we have sampling in PostgreSQL core. Here is an example:

You have to add the TABLESAMPLE clause to the table and specify how it is supposed to work.

In this example, I decided to use the Bernoulli table sampling method (TSM) and told it to be repeatable. In case of machine learning it really makes sense to use the REPEATABLE clause as we want to make sure that we can train our model over and over again given the same input data. We can run the code as often as we want and PostgreSQL will always return the same sample (assuming of course that the underlying data does not change).

Machine learning: Lots and lots of data …

So far everything sounds nice. However, there is a bit of a downside to this approach. A state-of-the-art model needs A LOT of input data to perform well. We are talking about many million rows. At some point we will be facing a query as shown below (written in pseudo code):

There are two problems here:

• We have to store the test data set somewhere, which needs a lot of space
• The large NOT IN statement is fairly expensive

So maybe there is some other way to do that? What I came up with is a pretty simple approach, which does the job.

Preparing data for machine learning in PostgreSQL

As there is no “NOT IN TABLESAMPLE”-clause in PostgreSQL and as we do not want to duplicate our data, the idea is to use a view, which can be used to extract the sample:

The first idea is to use the hashtext function to turn the input data into a hash. The hashtext returns evenly distributed numbers, which is exactly what we want here. Adding “modulo 100” will create 100 slices of data (1% each). Be aware of the fact that hashtext function can return negative values. The “abs” function will turn the value into positive values.

When looking at the view, the result will be pretty useful already:

Now we can filter out data. An example: “abs < 50” might be training data and all the rest might be used to validate and check our models. The approach here is basically fine if your dataset is really large (xxx-million rows or so). It might not be ideal if your dataset is too small. In this case, it would be better to use sampling mechanisms provided by your favorite library (TensorFlow, sklearn, etc.). Why is that the case? If you are using a small data set (e.g. only a couple hundred of rows) you are running the risk of creating a biased sample. What does that mean? Suppose you have a dataset containing information about men and women. You want the sample to show the same distribution as the original data set - so the same percentage of men and women. The solution to the problem is called “stratification” and is supported by libraries such as sklearn and others. In my simple SQL example, I work under the assumption that the amount of data fed to the model is really large and therefore stratification is not an issue.

High availability in PostgreSQL: When running an application in production, it might happen that the data structure has to be changed once in a while. Adding columns, dropping columns, etc. might simply be necessary once in a while. However, changing data structures should not be done mindlessly – there are some things you have to take care of. The main issue with DDLs is that in some cases locks are held for quite a long time, which can have serious consequences if you are running PostgreSQL on critical production systems. I hope that this blog can help people to run DDLs more wisely with ALTER TABLE.

Creating some demo data

As usual some demo data is needed to show, how things work. In case the following script simply creates 10 million simple integer values:

Be aware of locking: Indexing

One of the most common issues related to DDLs is the creation of indexes. If you create an index in PostgreSQL, “normally” writes will be blocked while the index is in creation. In case of very large tables this can be quite counterproductive. A large index simply takes time to build and concurrent writes might suffer as shown in the next example:

Connection 1 Connection 2
BEGIN;
CREATE UNIQUE INDEX idx_id ON data (id); BEGIN;
-- running INSERT INTO data VALUES (0);
-- running -- waiting
COMMIT; -- will proceed
COMMIT;

If data is really large, the INSERT might have to wait for an unacceptable amount of time. This is especially critical, if the number of concurrent requests is high. Connection pools might get into trouble or your webserver might run out of connections. In short: A solution is needed.

Note that reading is still possible – PostgreSQL will only block writes while the index is built. This is extremely important to point out.

If you cannot afford to lock a table while building an index, CREATE INDEX CONCURRENTLY is the tool of choice. CREATE INDEX CONCURRENTLY takes longer than a “normal” CREATE INDEX, but it will allow for concurrent writes and it helps to avoid excessive table locking. Note that CREATE INDEX CONCURRENTLY is not guaranteed to succeed and might leave you with an invalid index, which has to be dropped manually in certain cases. However, if you are indexing a 1+TB table or so there is no way around concurrent indexing if you are under constant load, which is often the case.

Here is an example: As you can see no conflicts happen and nobody has to wait for an unacceptable amount of time.

Connection 1 Connection 2
CREATE UNIQUE INDEX

CONCURRENTLY idx_id2 ON data (id);

-- running INSERT INTO data VALUES (-1);
-- running INSERT INTO data VALUES (-2);
-- done INSERT INTO data VALUES (-3);

Adding columns to tables in PostgreSQL

Adding a new column to a table in PostgreSQL is most likely the most common problem reported by people around the globe. The important thing here is that there are actually two cases:

  1. Adding a column without a default value
  2. Adding a column with a default value

Let us see, what those two cases have in stock for us:

Connection 1 Connection 2
BEGIN;
ALTER TABLE data ADD COLUMN x int;
Time: 1.241 ms INSERT INTO data VALUES (-4);
COMMIT; -- waits a VERY short time
BEGIN;
ALTER TABLE data

ADD COLUMN y int DEFAULT 0;

-- we need time INSERT INTO data VALUES (-5);
Time: 11888.547 ms -- waiting for a long time
COMMIT; -- finally proceeds

As you can see the ALTER TABLE … ADD COLUMN without the default value is basically done in around 1 millisecond, which is totally fine. The concurrent write does not have to wait for long. As long as the DDL can get the table lock it needs for a short moment, everything will be fine. In PostgreSQL adding a column with no default value is merely a metadata change in the catalog – there is no need to rewrite the data files. Keep in mind that this is not true for all other database engines – especially not for some expensive commercial ones.

The story is a bit different if you need a default value. In this case it has to be written to disk. Our general recommendation is to avoid this kind of operation if possible. If your tables is large, rewriting things is not too attractive.

Making use of tablespaces

Another important thing to consider is the use of tablespaces. While tablespaces are a nice way to scale I/O in general, moving tables around does not come for free. Keep in mind that ALTER TABLE … SET TABLESPACE locks the table and blocks writing operation.

Here is an example:

Connection 1 Connection 2
BEGIN;
ALTER TABLE data

SET TABLESPACE some_storage;

-- usually time consuming INSERT INTO data VALUES (-6);
-- usually time consuming -- waiting for a long time
COMMIT; -- finally done

The good thing here is that there is actually a way to get around the table. pg_squeeze (https://github.com/cybertec-postgresql/pg_squeeze) is a tool designed to shrink a bloated table. However, it can do a lot more: You can “abuse” it to move a table from one tablespace to some other storage location. Using Cybertec pg_squeeze you can not only shrink your storage footprint – you can also elegantly avoid heavy locking. Go and check it out.

Adding foreign keys

Foreign keys and constraints are also an important factor. If you want to add foreign keys or CHECK constraints PostgreSQL will verify the content of the table to ensure the correctness of the key. Those checks don’t come for free – PostgreSQL will again lock the table:

Adding keys ex-post without locking the table is impossible. Therefore, it is important to think ahead and plan your moves.

Dropping columns in PostgreSQL

Let us come to a more relaxing thing: DROP COLUMN. In PostgreSQL DROP TABLE does not rewrite the table. It simply makes it invisible. VACUUM will take care of cleanup asynchronously. DROP COLUMN is therefore really fast. However, keep in mind that a DROP COLUMN can only start if there are no concurrent reads:

Connection 1 Connection 2
BEGIN;
SELECT count(*) FROM data
ALTER TABLE data

DROP COLUMN y;

-- takes a while
-- has to wait -- takes a while
-- has to wait -- takes a while
-- finally proceeds COMMIT;

Otherwise DROP COLUMN will simply return in no time.

There is always more …

There is always more to say. Still, the examples outlined in this blog cover some of the most common issues people are facing around the world. Feel free to add your comments and send some feedback.

Read our latest blogs about high availability to find out more.

CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0
office@cybertec.at

Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram