January 2018

It is possible to write functions in PostgreSQL in almost any widespread language such as Perl, Python or C. In general this offers a great deal of flexibility and acceptable performance. However, in some cases customers might say: “We have the feeling that procedures are slow”. The thing is: PostgreSQL might not be to blame for this observation - in many cases it is simply “pilot error”. Let me try to elaborate on the key issue of function calls related to bad performance.

How PostgreSQL treats procedures

As mentioned already, it is possible to write functions in basically any language. PostgreSQL simply passes the code of a function to the external language and takes back the result. In a way a function is a kind of black box - PostgreSQL rarely knows what is going on inside a procedure.

Here is an example of a function call:

test=# CREATE OR REPLACE FUNCTION mymax(int, int)
RETURNS int
AS
$
  BEGIN
       RETURN CASE WHEN $1 > $2 THEN $1 ELSE $2 END;
  END;
$ LANGUAGE 'plpgsql';
CREATE FUNCTION

test=# CREATE OR REPLACE FUNCTION mymax(int, int)

RETURNS int

BEGIN

RETURN CASE WHEN $1 > $2 THEN $1 ELSE $2 END;

END;

$ LANGUAGE 'plpgsql';

CREATE FUNCTION

The results are not very spectacular:

test=# SELECT mymax(20, 30);
 mymax
-------
    30
(1 row)

test=# SELECT mymax(20, 30);

mymax

-------

(1 row)

The important thing here is: The PL/pgSQL function is a complete black box. The planner has no idea what the “external” language does in this case. This has some important implication.

Consider the following example:

test=# CREATE TABLE demo AS
	SELECT 	*
	FROM generate_series(1, 1000000) AS id;
SELECT 1000000
test=# CREATE INDEX idx_id ON demo(id);
CREATE INDEX

test=# CREATE TABLE demo AS

SELECT *

FROM generate_series(1, 1000000) AS id;

SELECT 1000000

test=# CREATE INDEX idx_id ON demo(id);

CREATE INDEX

The table is already large enough to consider indexes:

test=# explain SELECT * FROM demo WHERE id = 20;
                      QUERY PLAN                               
---------------------------------------------------------------
 Index Only Scan using idx_id on demo  
	(cost=0.42..8.44 rows=1 width=4)
   Index Cond: (id = 20)
(2 rows)

test=# explain SELECT * FROM demo WHERE id = 20;

QUERY PLAN

---------------------------------------------------------------

Index Only Scan using idx_id on demo

(cost=0.42..8.44 rows=1 width=4)

Index Cond: (id = 20)

(2 rows)

Function calls in PostgreSQL are usually black boxes

The problem is: The situation changes completely if we start to use the function I have just shown:

test=# explain SELECT *
	FROM 	demo
	WHERE 	id = mymax(20, 20);
                       QUERY PLAN                        
---------------------------------------------------------
 Seq Scan on demo  (cost=0.00..266925.00 rows=1 width=4)
   Filter: (id = mymax(20, 20))
(2 rows)

test=# explain SELECT *

FROM demo

WHERE id = mymax(20, 20);

QUERY PLAN

---------------------------------------------------------

Seq Scan on demo (cost=0.00..266925.00 rows=1 width=4)

Filter: (id = mymax(20, 20))

(2 rows)

PostgreSQL has no idea that the function will return 20. It is clear to humans, but nobody told the machine that this is guaranteed to happen. To PostgreSQL, the result of a function is considered to be “volatile” - anything can happen. Therefore, it cannot simply ask the index for the correct row. The first function call might not return the same as the second call - even if the parameters are identical. The optimizer has to play it safe and will go for a sequential scan, which is definitely going to produce the correct result.

In PostgreSQL a function can be:

• VOLATILE
• STABLE
• IMMUTABLE

If a function is marked as VOLATILE, it can return anything if you call it multiple times using the very same input parameters. In case of STABLE the function is going to return the same result given the same parameters within the same transaction.

The most prominent STABLE function is now(), which will always return the same result within the same transaction:

 test=# SELECT now();
              now              
-------------------------------
 2018-01-09 11:48:46.385457+01
(1 row)

test=# BEGIN;
BEGIN
test=# SELECT now();
              now              
-------------------------------
 2018-01-09 11:48:51.073123+01
(1 row)

test=# SELECT now();
              now              
-------------------------------
 2018-01-09 11:48:51.073123+01
(1 row)

test=# COMMIT;
COMMIT
test=# SELECT now();
              now              
-------------------------------
 2018-01-09 11:48:59.640697+01
(1 row)

test=# SELECT now();

now

-------------------------------

2018-01-09 11:48:46.385457+01

(1 row)

test=# BEGIN;

BEGIN

test=# SELECT now();

now

-------------------------------

2018-01-09 11:48:51.073123+01

(1 row)

test=# SELECT now();

now

-------------------------------

2018-01-09 11:48:51.073123+01

(1 row)

test=# COMMIT;

COMMIT

test=# SELECT now();

now

-------------------------------

2018-01-09 11:48:59.640697+01

(1 row)

Some functions are even IMMUTABLE: In this case the result given the same input parameters will be constant, regardless of the transaction. Cosine would be an example of one of these function calls:

test=# SELECT cos(10), cos(20);
        cos         |        cos        
--------------------+-------------------
 -0.839071529076452 | 0.408082061813392
(1 row)

test=# SELECT cos(10), cos(20);
        cos         |        cos        
--------------------+-------------------
 -0.839071529076452 | 0.408082061813392
(1 row)

test=# SELECT cos(10), cos(20);

cos | cos

--------------------+-------------------

-0.839071529076452 | 0.408082061813392

(1 row)

test=# SELECT cos(10), cos(20);

cos | cos

--------------------+-------------------

-0.839071529076452 | 0.408082061813392

(1 row)

Even in the second transaction the cosine of a number will be the same.

Reducing the number of function calls

To fix our problem we have to drop the existing function …

test=# DROP FUNCTION mymax(int, int);
DROP FUNCTION

1 2	test=# DROP FUNCTION mymax(int, int); DROP FUNCTION

… and recreate it:

test=# CREATE OR REPLACE FUNCTION mymax(int, int)
RETURNS int
AS
$
  BEGIN
       RETURN CASE WHEN $1 > $2 THEN $1 ELSE $2 END;
  END;
$ LANGUAGE 'plpgsql' IMMUTABLE;
CREATE FUNCTION

test=# CREATE OR REPLACE FUNCTION mymax(int, int)

RETURNS int

BEGIN

RETURN CASE WHEN $1 > $2 THEN $1 ELSE $2 END;

END;

$ LANGUAGE 'plpgsql' IMMUTABLE;

CREATE FUNCTION

The planner will detect that the function is supposed to return a static value given the same input values and go for an index scan:

test=# explain SELECT *
	FROM 	demo
	WHERE 	id = mymax(20, 20);
                      QUERY PLAN                               
------------------------------------------------------
 Index Only Scan using idx_id on demo  
	(cost=0.42..8.44 rows=1 width=4)
   Index Cond: (id = 20)
(2 rows)

test=# explain SELECT *

FROM demo

WHERE id = mymax(20, 20);

QUERY PLAN

------------------------------------------------------

Index Only Scan using idx_id on demo

(cost=0.42..8.44 rows=1 width=4)

Index Cond: (id = 20)

(2 rows)

Of course the index scan orders magnitude faster and will return in basically no time.

Detecting bottlenecks

Fortunately PostgreSQL has a system view, which might shed some light on functions, that could be a problem. The first thing you have to do is to set “track_functions = 'all' “ - it will tell PostgreSQL to collection function statistics:

test=# SELECT * FROM demo WHERE id = mymax(20, 20);
 id
----
 20
(1 row)

test=# SELECT * FROM demo WHERE id = mymax(20, 20);

----

(1 row)

Once this setting has been turned on, pg_stat_user_functions will contain valuable information:

test=# SELECT * FROM pg_stat_user_functions ;
 funcid | schemaname | funcname | calls | total_time | self_time
--------+------------+----------+-------+------------+-----------
  16429 | public     | mymax    |     1 |      0.025 |     0.025
(1 row)

test=# SELECT * FROM pg_stat_user_functions ;

--------+------------+----------+-------+------------+-----------

16429 | public | mymax | 1 | 0.025 | 0.025

(1 row)

If you happen to see that a function is called insanely often, it can make sense to inspect it and check, if it happens to be VOLATILE for no good reason. Changing the function definition can significantly improve speed.

Creating indexes on functions in PostgreSQL

If you want to create indexes on a function, you have to make sure that the function itself is in fact IMMUTABLE. Otherwise PostgreSQL won't create the index for you. The reason is simple: PostgreSQL has to make sure that the content of the index is stable and does not have to be changed over time if the underlying data is unchanged.

Read the latest blogs about the PostgreSQL optimizer, or find out more about fixing slow queries.

Configure PostgreSQL quickly and easily

We are proud to announce the release of our latest tool to the PostgreSQL community: the CYBERTEC pgconfigurator, a visual tool for configuring PostgreSQL. We know that many people want a visual tool to configure PostgreSQL database servers. CYBERTEC's pgconfigurator quickly determines the most important config parameters to be used in postgresql.conf.

The tool can be found and used freely at the Cybertec pgconfigurator product page on our website.

What pgconfigurator will do for you

pgconfigurator gives you a quick and easy way to generate or to cross-check existing config files (postgresql.conf). Use the results of our vast database experience to help tune your database and achieve better performance.

pgconfigurator will determine all vital performance parameters and come up with useful suggestions. Keep in mind that an automatic tool cannot replace a human in all cases, and it still makes sense to hire a professional in case of doubt. However, pgconfigurator can help you to get started and to configure your servers to maximum advantage.

Our tool for visually configuring PostgreSQL will determine the following types of parameters for you:

PostgreSQL memory settings (shared_buffers, work_mem, etc.)
PostgreSQL replication settings (max_wal_senders, etc.)
PostgreSQL parallel query parameters
PostgreSQL I/O and checkpointing related settings
Preloaded libraries

Just download the ready-made config and add your personal parameters (for logging, etc.)

Feedback is welcome

We hope that our first release of pgconfigurator is helpful. Of course feedback is always welcome and we are eager to hear from you. Tell us what you want us to improve and tell us about settings you don't agree with. We will collect all recommendations and use them to make pgconfigurator even better.

We are planning to maintain this tool long-term and add support for PostgreSQL 11 and higher in the future.

Also: Share the news! Tell your friends, workmates and everybody you know about the CYBERTEC pgconfigurator.

(Updated 2023-04-20) I recently had an interesting support case with table bloat that shows how the cause of a problem can sometimes be where you would least suspect it. Note: the problem I describe in this article can only occur in PostgreSQL v14 and below, since the statistics collector has been rewritten in v15.

About table bloat

After an UPDATE or DELETE, PostgreSQL keeps old versions of a table row around. That way, concurrent sessions that want to read the row don't have to wait. But eventually this “garbage” will have to be cleaned up. That is the task of the autovacuum daemon.

Usually you don't have to worry about that, but sometimes something goes wrong. Then old row versions don't get deleted, and the table keeps growing. Apart from the wasted storage space, this will also slow down sequential scans and – to some extent – index scans.

To get rid of the bloat, you can use VACUUM (FULL) and other tools like pg_squeeze. But it is important to find and fix the cause of the table bloat so that it does not reappear.

Description of the table bloat problem

I got called by a customer who experienced table bloat in the pg_attribute system catalog table that contains the table column metadata.

This can happen if table columns get modified or dropped frequently. Most of the time, these are temporary tables which are automatically dropped when the session or transaction ends.

The customer made heavy use of temporary tables. But they have several Linux machines with databases that experience the same workload, and only some of those had the problem.

Searching for the cause of the table bloat

I went through the list of common causes for table bloat:

Database transactions that remain open (state “idle in transaction”).
These will keep autovacuum from cleaning up row versions that have become obsolete after the start of the transaction.
A rate of data modification that is so high that autovacuum cannot keep up.
In this case, the correct answer is to make autovacuum more aggressive.

Both were not the case; the second option could be ruled out because that would cause bloat on all machines and not only on some.

Then I had a look at the usage statistics for the affected table:

dbname=> SELECT * FROM pg_stat_sys_tables
dbname->          WHERE relname = 'pg_attribute';

-[ RECORD 1 ]-------+-------------
relid               | 1249
schemaname          | pg_catalog
relname             | pg_attribute
seq_scan            | 167081
seq_tup_read        | 484738
idx_scan            | 1506941
idx_tup_fetch       | 4163837
n_tup_ins           | 0
n_tup_upd           | 0
n_tup_del           | 0
n_tup_hot_upd       | 0
n_live_tup          | 0
n_dead_tup          | 0
n_mod_since_analyze | 0
last_vacuum         |
last_autovacuum     |
last_analyze        |
last_autoanalyze    |
vacuum_count        | 0
autovacuum_count    | 0
analyze_count       | 0
autoanalyze_count   | 0

dbname=> SELECT * FROM pg_stat_sys_tables

dbname-> WHERE relname = 'pg_attribute';

-[ RECORD 1 ]-------+-------------

relid | 1249

schemaname | pg_catalog

relname | pg_attribute

seq_scan | 167081

seq_tup_read | 484738

idx_scan | 1506941

idx_tup_fetch | 4163837

n_tup_ins | 0

n_tup_upd | 0

n_tup_del | 0

n_tup_hot_upd | 0

n_live_tup | 0

n_dead_tup | 0

n_mod_since_analyze | 0

last_vacuum |

last_autovacuum |

last_analyze |

last_autoanalyze |

vacuum_count | 0

autovacuum_count | 0

analyze_count | 0

autoanalyze_count | 0

This confirms that autovacuum never ran. But more interesting is that we get a hint why it didn't run:

PostgreSQL thinks that the number of dead tuples (row versions that could be removed) is 0, so it didn't even try to remove them.

The statistics collector

A suspicion became certainty after I found the following message in the server logs:

using stale statistics instead of current ones because stats collector is not responding

1	using stale statistics instead of current ones because stats collector is not responding

The statistics collector process is the PostgreSQL backend process that collects usage statistics.

After each activity, PostgreSQL backends send statistics about their activity. These statistics updates are sent through a UDP socket on localhost; that is created at PostgreSQL startup time. The statistics collector reads from the socket and aggregates the collected statistics.

Closing in on the table bloat problem

The statistics collector was running:

918     1 /usr/pgsql-10/bin/postmaster -D /var/lib/pgsql/10/data/
947   918 postgres: logger process   
964   918 postgres: checkpointer process   
965   918 postgres: writer process   
966   918 postgres: wal writer process   
967   918 postgres: autovacuum launcher process   
968   918 postgres: stats collector process   
969   918 postgres: bgworker: logical replication launcher

918 1 /usr/pgsql-10/bin/postmaster -D /var/lib/pgsql/10/data/

947 918 postgres: logger process

964 918 postgres: checkpointer process

965 918 postgres: writer process

966 918 postgres: wal writer process

967 918 postgres: autovacuum launcher process

968 918 postgres: stats collector process

969 918 postgres: bgworker: logical replication launcher

To see what the statistics collector was doing and to spot any problems it had, I traced its execution:

# strace -p 968
strace: Process 968 attached
epoll_pwait(3,

# strace -p 968

strace: Process 968 attached

epoll_pwait(3,

The statistics collector was waiting for messages on the UDP socket, but no messages were coming through!

I had a look at the UPD socket:

# netstat -u -n -p
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address   Foreign Address State       PID/Program name    
udp6       0      0 ::1:59517       ::1:59517       ESTABLISHED 918/postmaster

# netstat -u -n -p

Active Internet connections (w/o servers)

Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name

udp6 0 0 ::1:59517 ::1:59517 ESTABLISHED 918/postmaster

Nothing suspicious so far.

But when I tried the same thing on a machine that didn't have the problem, I got something different:

# netstat -u -n -p
Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address   Foreign Address State       PID/Program name    
udp        0      0 127.0.0.1:46031 127.0.0.1:46031 ESTABLISHED 9303/postmaster

# netstat -u -n -p

Active Internet connections (w/o servers)

Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name

udp 0 0 127.0.0.1:46031 127.0.0.1:46031 ESTABLISHED 9303/postmaster

It turned out that on all systems that experienced table bloat, the statistics collector socket was created on the IPv6 address for localhost, while all working systems were using the IPv4 address!

But all machines had IPv6 disabled for the loopback interface:

# ifconfig lo
lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 6897  bytes 2372420 (2.2 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 6897  bytes 2372420 (2.2 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

# ifconfig lo

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536

inet 127.0.0.1 netmask 255.0.0.0

loop txqueuelen 1000 (Local Loopback)

RX packets 6897 bytes 2372420 (2.2 MiB)

RX errors 0 dropped 0 overruns 0 frame 0

TX packets 6897 bytes 2372420 (2.2 MiB)

TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

Nailing the bug to the wall

PostgreSQL uses the POSIX function getaddrinfo(3) to resolve localhost.

Since PostgreSQL knows quite well that it is important to have a working statistics collection, it will loop through all the addresses returned by that call, create a UDP socket and test it until it has a socket that works.

So we know that IPv6 must have worked when PostgreSQL was started!

Further analysis revealed that IPv6 was disabled during the boot sequence, but there was a race condition:
Sometimes IPv6 would be disabled before PostgreSQL started, sometimes afterwards. And these latter machines were the ones where the statistics collector stopped working and tables got bloated!

After changing the boot sequence to always disable IPv6 before starting PostgreSQL, the problem was fixed.

Conclusion

This shows (again) how the cause of a computer problem can be in an entirely different place than you would suspect at first glance.

It also shows why a good database administrator needs to know the operating system well.

You have probably noticed that everyone is talking about Artificial Intelligence and Machine Learning these days. Quite rightly, because it is a very important topic, which is going to shape our future for sure. However, when looking at most of the code related to machine learning available on the net, it strikes me, how much “coding” people actually use to prepare the data. In many cases the input data is “just a matrix” and people spent a lot of time (and sometimes memory) to prepare it.

The question, which naturally arises in my case is: Why not prepare the data in SQL? It is easier, less work and a lot more flexible.

Creating some sample data

To demonstrate what we can do on the PostgreSQL side, I will create a simple table containing just 20 random values. Let us define a table first:

test=# CREATE TABLE inputdata
(
	id          int,
	data        numeric 	DEFAULT random() * 1000
);
CREATE TABLE

The table is populated:

test=# INSERT INTO inputdata
	SELECT * FROM generate_series(1, 20);
INSERT 0 20

test=# CREATE TABLE inputdata

(

id int,

data numeric DEFAULT random() * 1000

);

CREATE TABLE

The table is populated:

test=# INSERT INTO inputdata

SELECT * FROM generate_series(1, 20);

INSERT 0 20

What we have now is a table containing 20 random values.

In reality you will already have some existing data, which you will use to work with:

test=# SELECT * FROM inputdata;
 id |       data       
----+------------------
  1 |  542.76927607134
  2 | 813.954454381019
  3 |  215.18046176061
  4 | 989.989245776087
  5 | 142.890753224492
  6 | 326.086463406682
  7 | 24.8975520953536
  8 | 266.512574627995
  9 | 86.0621216706932
 10 | 801.756543107331
 11 | 790.149183012545
 12 | 317.997705657035
 13 | 975.230060052127
 14 | 385.490739252418
 15 | 746.592517476529
 16 | 621.084009762853
 17 | 208.689162041992
 18 | 529.119417071342
 19 | 260.399237740785
 20 | 563.285110052675
(20 rows)

test=# SELECT * FROM inputdata;

id | data

----+------------------

1 | 542.76927607134

2 | 813.954454381019

3 | 215.18046176061

4 | 989.989245776087

5 | 142.890753224492

6 | 326.086463406682

7 | 24.8975520953536

8 | 266.512574627995

9 | 86.0621216706932

10 | 801.756543107331

11 | 790.149183012545

12 | 317.997705657035

13 | 975.230060052127

14 | 385.490739252418

15 | 746.592517476529

16 | 621.084009762853

17 | 208.689162041992

18 | 529.119417071342

19 | 260.399237740785

20 | 563.285110052675

(20 rows)

Thoughts on sampling, training, and verification

If you are training an AI model (maybe a Support Vector Machine / SVM, a neural network, or whatever) you will always start by splitting the data into various parts:

• Training data for your AI model
• Test data for your AI model

The training data is used to teach your model. The test data is then used to check, if your model performs well. Splitting the data is important and somewhat the key to success.

Table sampling in PostgreSQL

Since version 9.5, we have sampling in PostgreSQL core. Here is an example:

test=# SELECT *
	FROM inputdata TABLESAMPLE BERNOULLI (50)
				REPEATABLE (87);
 id |       data       
----+------------------
  3 |  215.18046176061
  4 | 989.989245776087
  5 | 142.890753224492
 10 | 801.756543107331
 11 | 790.149183012545
 12 | 317.997705657035
 14 | 385.490739252418
 15 | 746.592517476529
 16 | 621.084009762853
 19 | 260.399237740785
 20 | 563.285110052675
(11 rows)

test=# SELECT *

FROM inputdata TABLESAMPLE BERNOULLI (50)

REPEATABLE (87);

id | data

----+------------------

3 | 215.18046176061

4 | 989.989245776087

5 | 142.890753224492

10 | 801.756543107331

11 | 790.149183012545

12 | 317.997705657035

14 | 385.490739252418

15 | 746.592517476529

16 | 621.084009762853

19 | 260.399237740785

20 | 563.285110052675

(11 rows)

You have to add the TABLESAMPLE clause to the table and specify how it is supposed to work.

In this example, I decided to use the Bernoulli table sampling method (TSM) and told it to be repeatable. In case of machine learning it really makes sense to use the REPEATABLE clause as we want to make sure that we can train our model over and over again given the same input data. We can run the code as often as we want and PostgreSQL will always return the same sample (assuming of course that the underlying data does not change).

Machine learning: Lots and lots of data …

So far everything sounds nice. However, there is a bit of a downside to this approach. A state-of-the-art model needs A LOT of input data to perform well. We are talking about many million rows. At some point we will be facing a query as shown below (written in pseudo code):

SELECT	*
FROM 	inputdata
WHERE	id NOT IN (SELECT id FROM test_data_set);

SELECT *

FROM inputdata

WHERE id NOT IN (SELECT id FROM test_data_set);

There are two problems here:

• We have to store the test data set somewhere, which needs a lot of space
• The large NOT IN statement is fairly expensive

So maybe there is some other way to do that? What I came up with is a pretty simple approach, which does the job.

Preparing data for machine learning in PostgreSQL

As there is no “NOT IN TABLESAMPLE”-clause in PostgreSQL and as we do not want to duplicate our data, the idea is to use a view, which can be used to extract the sample:

test=# CREATE VIEW preparation_step1 AS
	SELECT 	*, abs(hashtext(id::text) % 100)
	FROM 	inputdata ;
CREATE VIEW

test=# CREATE VIEW preparation_step1 AS

SELECT *, abs(hashtext(id::text) % 100)

FROM inputdata ;

CREATE VIEW

The first idea is to use the hashtext function to turn the input data into a hash. The hashtext returns evenly distributed numbers, which is exactly what we want here. Adding “modulo 100” will create 100 slices of data (1% each). Be aware of the fact that hashtext function can return negative values. The “abs” function will turn the value into positive values.

test=# x
Expanded display is on.
test=# df *hashtext*
List of functions
-[ RECORD 1 ]-------+-----------
Schema              | pg_catalog
Name                | hashtext
Result data type    | integer
Argument data types | text
Type                | normal

test=# x

Expanded display is on.

test=# df *hashtext*

List of functions

-[ RECORD 1 ]-------+-----------

Schema | pg_catalog

Name | hashtext

Result data type | integer

Argument data types | text

Type | normal

When looking at the view, the result will be pretty useful already:

test=# SELECT * FROM preparation_step1 LIMIT 10;
 id |       data       | abs
----+------------------+-----
  1 |  542.76927607134 |  47
  2 | 813.954454381019 |  26
  3 |  215.18046176061 |   4
  4 | 989.989245776087 |  92
  5 | 142.890753224492 |  58
  6 | 326.086463406682 |  12
  7 | 24.8975520953536 |  95
  8 | 266.512574627995 |  88
  9 | 86.0621216706932 |  36
 10 | 801.756543107331 |  81
(10 rows)

test=# SELECT * FROM preparation_step1 LIMIT 10;

id | data | abs

----+------------------+-----

1 | 542.76927607134 | 47

2 | 813.954454381019 | 26

3 | 215.18046176061 | 4

4 | 989.989245776087 | 92

5 | 142.890753224492 | 58

6 | 326.086463406682 | 12

7 | 24.8975520953536 | 95

8 | 266.512574627995 | 88

9 | 86.0621216706932 | 36

10 | 801.756543107331 | 81

(10 rows)

Now we can filter out data. An example: “abs < 50” might be training data and all the rest might be used to validate and check our models. The approach here is basically fine if your dataset is really large (xxx-million rows or so). It might not be ideal if your dataset is too small. In this case, it would be better to use sampling mechanisms provided by your favorite library (TensorFlow, sklearn, etc.). Why is that the case? If you are using a small data set (e.g. only a couple hundred of rows) you are running the risk of creating a biased sample. What does that mean? Suppose you have a dataset containing information about men and women. You want the sample to show the same distribution as the original data set - so the same percentage of men and women. The solution to the problem is called “stratification” and is supported by libraries such as sklearn and others. In my simple SQL example, I work under the assumption that the amount of data fed to the model is really large and therefore stratification is not an issue.

High availability in PostgreSQL: When running an application in production, it might happen that the data structure has to be changed once in a while. Adding columns, dropping columns, etc. might simply be necessary once in a while. However, changing data structures should not be done mindlessly – there are some things you have to take care of. The main issue with DDLs is that in some cases locks are held for quite a long time, which can have serious consequences if you are running PostgreSQL on critical production systems. I hope that this blog can help people to run DDLs more wisely with ALTER TABLE.

Creating some demo data

As usual some demo data is needed to show, how things work. In case the following script simply creates 10 million simple integer values:

test=# CREATE TABLE data (id int);
CREATE TABLE

test=# INSERT INTO data
       SELECT * FROM generate_series(1, 10000000);
INSERT 0 10000000

test=# CREATE TABLE data (id int);

CREATE TABLE

test=# INSERT INTO data

SELECT * FROM generate_series(1, 10000000);

INSERT 0 10000000

Be aware of locking: Indexing

One of the most common issues related to DDLs is the creation of indexes. If you create an index in PostgreSQL, “normally” writes will be blocked while the index is in creation. In case of very large tables this can be quite counterproductive. A large index simply takes time to build and concurrent writes might suffer as shown in the next example:

Connection 1	Connection 2
BEGIN;
CREATE UNIQUE INDEX idx_id ON data (id);	BEGIN;
-- running	INSERT INTO data VALUES (0);
-- running	-- waiting
COMMIT;	-- will proceed
	COMMIT;

If data is really large, the INSERT might have to wait for an unacceptable amount of time. This is especially critical, if the number of concurrent requests is high. Connection pools might get into trouble or your webserver might run out of connections. In short: A solution is needed.

Note that reading is still possible – PostgreSQL will only block writes while the index is built. This is extremely important to point out.

If you cannot afford to lock a table while building an index, CREATE INDEX CONCURRENTLY is the tool of choice. CREATE INDEX CONCURRENTLY takes longer than a “normal” CREATE INDEX, but it will allow for concurrent writes and it helps to avoid excessive table locking. Note that CREATE INDEX CONCURRENTLY is not guaranteed to succeed and might leave you with an invalid index, which has to be dropped manually in certain cases. However, if you are indexing a 1+TB table or so there is no way around concurrent indexing if you are under constant load, which is often the case.

Here is an example: As you can see no conflicts happen and nobody has to wait for an unacceptable amount of time.

Connection 1	Connection 2
CREATE UNIQUE INDEX CONCURRENTLY idx_id2 ON data (id);
-- running	INSERT INTO data VALUES (-1);
-- running	INSERT INTO data VALUES (-2);
-- done	INSERT INTO data VALUES (-3);

Adding columns to tables in PostgreSQL

Adding a new column to a table in PostgreSQL is most likely the most common problem reported by people around the globe. The important thing here is that there are actually two cases:

Adding a column without a default value
Adding a column with a default value

Let us see, what those two cases have in stock for us:

Connection 1	Connection 2
BEGIN;
ALTER TABLE data ADD COLUMN x int;
Time: 1.241 ms	INSERT INTO data VALUES (-4);
COMMIT;	-- waits a VERY short time

BEGIN;
ALTER TABLE data ADD COLUMN y int DEFAULT 0;
-- we need time	INSERT INTO data VALUES (-5);
Time: 11888.547 ms	-- waiting for a long time
COMMIT;	-- finally proceeds

As you can see the ALTER TABLE … ADD COLUMN without the default value is basically done in around 1 millisecond, which is totally fine. The concurrent write does not have to wait for long. As long as the DDL can get the table lock it needs for a short moment, everything will be fine. In PostgreSQL adding a column with no default value is merely a metadata change in the catalog – there is no need to rewrite the data files. Keep in mind that this is not true for all other database engines – especially not for some expensive commercial ones.

The story is a bit different if you need a default value. In this case it has to be written to disk. Our general recommendation is to avoid this kind of operation if possible. If your tables is large, rewriting things is not too attractive.

Making use of tablespaces

Another important thing to consider is the use of tablespaces. While tablespaces are a nice way to scale I/O in general, moving tables around does not come for free. Keep in mind that ALTER TABLE … SET TABLESPACE locks the table and blocks writing operation.

Here is an example:

Connection 1	Connection 2
BEGIN;
ALTER TABLE data SET TABLESPACE some_storage;
-- usually time consuming	INSERT INTO data VALUES (-6);
-- usually time consuming	-- waiting for a long time
COMMIT;	-- finally done

The good thing here is that there is actually a way to get around the table. pg_squeeze (https://github.com/cybertec-postgresql/pg_squeeze) is a tool designed to shrink a bloated table. However, it can do a lot more: You can “abuse” it to move a table from one tablespace to some other storage location. Using Cybertec pg_squeeze you can not only shrink your storage footprint – you can also elegantly avoid heavy locking. Go and check it out.

Adding foreign keys

Foreign keys and constraints are also an important factor. If you want to add foreign keys or CHECK constraints PostgreSQL will verify the content of the table to ensure the correctness of the key. Those checks don’t come for free – PostgreSQL will again lock the table:

test=# ALTER TABLE data
          ADD CONSTRAINT constname
          FOREIGN KEY (id)
          REFERENCES data(id);

test=# ALTER TABLE data

ADD CONSTRAINT constname

FOREIGN KEY (id)

REFERENCES data(id);

Adding keys ex-post without locking the table is impossible. Therefore, it is important to think ahead and plan your moves.

Dropping columns in PostgreSQL

Let us come to a more relaxing thing: DROP COLUMN. In PostgreSQL DROP TABLE does not rewrite the table. It simply makes it invisible. VACUUM will take care of cleanup asynchronously. DROP COLUMN is therefore really fast. However, keep in mind that a DROP COLUMN can only start if there are no concurrent reads:

Connection 1	Connection 2
	BEGIN;
	SELECT count(*) FROM data
ALTER TABLE data DROP COLUMN y;	-- takes a while
-- has to wait	-- takes a while
-- has to wait	-- takes a while
-- finally proceeds	COMMIT;

Otherwise DROP COLUMN will simply return in no time.

There is always more …

There is always more to say. Still, the examples outlined in this blog cover some of the most common issues people are facing around the world. Feel free to add your comments and send some feedback.

Read our latest blogs about high availability to find out more.