When running PostgreSQL on a production system, it might happen that you are facing table bloat. As you might know PostgreSQL has to copy a row on UPDATE to ensure that concurrent transactions can still see the data. At some point VACUUM can clean out dead rows but if transactions are too long, this cleanup might happen quite late and therefore table bloat (= your table keeps growing dramatically) is the logical consequence. Having a table, which has grown out of proportion, will have all kinds of bad side effects including but not limited to bad performance. idle_in_transaction_session_timeout has been added to PostgreSQL 9.6 to prevent bad things from happening in case long idle transactions are around.

MVCC: Why PostgreSQL has to copy rows on UPDATE

Why does PostgreSQL have to copy rows on UPDATE after all? Here is an example:

Session 1Session 2
CREATE TABLE a (aid int);
INSERT INTO a VALUES (5);
BEGIN;
SELECT sum(aid) FROM a;
… running …UPDATE a SET aid = 9;
… running …SELECT * FROM a;
… will return 5 …… will return 9 …

As you can see two results will be returned at the same time at the end of our example. Logically PostgreSQL has to keep both versions of a row. Just imagine if you want to UPDATE 100 million rows – your table will have to keep an additional 100 million rows.

Let us take a look at a second example:

Session 1Session 2
BEGIN;
DELETE FROM a;
… running …SELECT * FROM a;
… running …… we will see rows …
COMMIT;
VACUUM a;
… now we can clean out rows …

DELETE is not allowed to actually remove those rows. Remember, we can still issue a ROLLBACK so we cannot destroy data yet. The same applies to COMMIT. Concurrent transactions might still see the data. VACUUM can only really reclaim those deleted rows if no other transactions can still see them. And this is exactly where our problem starts: What if a transaction starts but is not closed for a long long time …

In PostgreSQL long transactions can cause table bloat

A long transaction is actually not a problem – the problem starts if a long transaction and many small changes have to exist. Remember: The long transaction can cause VACUUM to not clean out your dead rows.

Are long transactions evil in general? No: If a long transaction does useful work, it should be allowed to proceed unharmed. But what is a transaction is kept open because of bad coding or for some other reason?

Here is an example:

Session 1Session 2
BEGIN;
SELECT …
… doing nothing …UPDATE “huge change”
… doing nothing …DELETE “huge change”
… doing nothing …INSERT “huge change”
… doing nothing …UPDATE “huge change”

In this case we will end up in trouble at some doing. VACUUM might actually run but it is never allowed to clean out dead rows because a single transaction might still be allowed to see old data. Thus dead rows will keep accumulating as long as “Session 1” exists. PostgreSQL cannot clean dead tuples – even if you keep running VACUUM.

idle_in_transaction_session_timeout: Killing idle transactions in PostgreSQL

If a transaction is working, it is there for a reason – but if it just hangs around, why not just kill it? This is exactly what idle_in_transaction_session_timeout will do for you. Here is how it works:

test=# SET idle_in_transaction_session_timeout TO '3000';
SET
test=# BEGIN;
BEGIN
test=# SELECT pg_sleep(5);
 pg_sleep
----------

(1 row)

-- going out for a coffee ...
test=# SELECT pg_sleep(5);
FATAL: terminating connection due to idle-in-transaction timeout
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

In this example the timeout is set to 3 seconds (3000 milliseconds). Then we will sleep for 5 seconds, which is no problem at all. However, before the next SELECT there is a long pause – and this is when the session in question will be killed.

In other words: Transactions cannot stay open accidentally anymore as PostgreSQL will clean things out for you.

Note that you don’t have to set things in postgresql.conf globally. The beauty is that you can actually set this variable for a certain database or simply for a specific user. There is no need for making the change globally and suffering from potential side effects.

Settings things for a single user is actually pretty simple. Here is how it works:

test=# CREATE USER joe;
CREATE ROLE
test=# ALTER USER joe SET idle_in_transaction_session_timeout TO 10000;
ALTER ROLE

Is there table bloat in my PostgreSQL database?

If you want to figure out if there is table bloat in your database or not: Consider checking out the pgstattuple extension, which has been covered in one of our older posts: https://www.cybertec-postgresql.com/en/detecting-table-bloat/

SQL is a must, if you want to be a Data Analyst or a Data Scientist. However, one in a while people are wondering, why a result is the way it is. While on the road in Berlin (Germany) the other day I have found a fairly interesting scenario, which is pretty counterintuitive to most people and which might be worth sharing.

Windowing function: What PostgreSQL does and why

PostgreSQL has provided windowing functions and analytics for quite some time now and this vital feature has been widely adopted by users, who are using PostgreSQL or SQL in general for more than just trivial queries. A modern database is just so much more than a simple data store and windowing functions are therefore certainly something to look into.

So, what is this post all about? Here is an example:

test=# SELECT *, 
 first_value(x) OVER (ORDER BY x) 
 FROM generate_series(1, 5) AS x;
 x | first_value 
---+-------------
 1 | 1
 2 | 1
 3 | 1
 4 | 1
 5 | 1
(5 rows)

What we want is the first value in our data set. The ORDER BY clause will ensure that data is fed to first_value in the right order. The result is therefore not surprising.

The same applies if we add DESC to our ORDER BY. The result is totally obvious:

test=# SELECT *, 
         first_value(x) OVER (ORDER BY x),
         first_value(x) OVER (ORDER BY x DESC)
   FROM generate_series(1, 5) AS x;
 x | first_value | first_value 
---+-------------+-------------
 5 | 1 | 5
 4 | 1 | 5
 3 | 1 | 5
 2 | 1 | 5
 1 | 1 | 5
(5 rows)

last_value: Unexpected results

However, what if we use last_value. Here is an example:

test=# SELECT *, 
         last_value(x) OVER (ORDER BY x),
         last_value(x) OVER (ORDER BY x DESC)
       FROM generate_series(1, 5) AS x;
 x | last_value | last_value 
---+------------+------------
 5 | 5 | 5
 4 | 4 | 4
 3 | 3 | 3
 2 | 2 | 2
 1 | 1 | 1
(5 rows)

What you can see here is that both columns will return the SAME data – regardless of the different sort order provided by the ORDER BY clause. That comes as a surprise to most people. Actually most people would accept one column to contain only “5” and the other column to contain only “1”.

Why is that not the case? Here is the answer:

test=# SELECT *, 
         array_agg(x) OVER (ORDER BY x),
         array_agg(x) OVER (ORDER BY x DESC)
       FROM generate_series(1, 5) AS x;
 x | array_agg   | array_agg 
---+-------------+-------------
 5 | {1,2,3,4,5} | {5}
 4 | {1,2,3,4}   | {5,4}
 3 | {1,2,3}     | {5,4,3}
 2 | {1,2}       | {5,4,3,2}
 1 | {1}         | {5,4,3,2,1}
(5 rows)

Let us take a look and see, which values last_value will actually see: array_agg will simply put them all into an array so that we can expect things in detail. As you can see the last value in the array is identical in both cases, which means that both columns will produce exactly the identical output.

As a professional PostgreSQL support company we see a lot of SQL performance stuff, which is often worth sharing. One of those noteworthy things, which might be interesting to you, the reader, happened this week when I was out on a business trip to Berlin, Germany. This (excellent) customer was making extensive use of windowing functions and analytics. However, there is always room to speed things up.

“Improving SQL performance” or “how to trick the optimizer”

Most people simply write their SQL code and execute it assuming that the optimizer will take care of things on its own. While this is usually the case there are still corner cases, where some clever optimization – done by a professional – can give you an edge and better database performance.

One way to improve speed is by rearranging windowing functions in a clever way. Consider the following example:

test=# CREATE TABLE data (id int);
CREATE TABLE
test=# INSERT INTO data SELECT * FROM generate_series(1, 5);
INSERT 0 5

Our example is pretty simple: All we need is a table containing 5 rows:

test=# SELECT * FROM data;
id
----
 1
 2
 3
 4
 5
(5 rows)

Let us take a look at a simple example now:

test=# SELECT *, array_agg(id) OVER (ORDER BY id) FROM data;
 id | array_agg
----+-------------
 1  | {1}
 2  | {1,2}
 3  | {1,2,3}
 4  | {1,2,3,4}
 5  | {1,2,3,4,5}
(5 rows)

What we have here is a simple aggregation. For the sake of simplicity I have used array_agg, which simply shows, how our data is aggregated. Of course we could also use min, max, sum, count or any other window function.
Let me add a second column to the example:

test=# SELECT *,
        array_agg(id) OVER (ORDER BY id),
        array_agg(id) OVER (ORDER BY id DESC)
FROM    data;
 id | array_agg   | array_agg
----+-------------+-------------
 5  | {1,2,3,4,5} | {5}
 4  | {1,2,3,4}   | {5,4}
 3  | {1,2,3}     | {5,4,3}
 2  | {1,2}       | {5,4,3,2}
 1  | {1}         | {5,4,3,2,1}
(5 rows)

In this case there are two columns with two different OVER-clauses. Note that those two aggregations are using different sort orders. One column needs ascending data and one needs descending data.

To understand what is really going on here, we can take a look at the execution plan provided by PostgreSQL:

test=# explain 
SELECT *,
       array_agg(id) OVER (ORDER BY id),
       array_agg(id) OVER (ORDER BY id DESC)
FROM  data
ORDER BY id;
                               QUERY PLAN
-----------------------------------------------------------------
Sort (cost=557.60..563.97 rows=2550 width=68)
 Sort Key: id
 <- WindowAgg (cost=368.69..413.32 rows=2550 width=68)
    <- Sort (cost=368.69..375.07 rows=2550 width=36)
       Sort Key: id DESC
       <- WindowAgg (cost=179.78..224.41 rows=2550 width=36)
          <- Sort (cost=179.78..186.16 rows=2550 width=4)
             Sort Key: id
             <- Seq Scan on data (cost=0.00..35.50 rows=2550 width=4)
(9 rows)

First of all PostgreSQL has to read the data and sort by “id”. This sorted data is fed to the first aggregation before it is passed on to the next sort step (to sort descendingly). The second sort passes its output to its window aggregation. Finally the data is again sorted by id because we want the final output to be ordered by the first column. Overall our data has to be sorted three times, which is not a good thing to do.

Optimizing windowing and avoiding excessive sorting

Sorting data three times is clearly not a good idea. Maybe we can do better. Let us simply swap two columns in the SELECT clause:

test=# explain 
SELECT *,
       array_agg(id) OVER (ORDER BY id DESC),
       array_agg(id) OVER (ORDER BY id)
FROM   data
ORDER BY id;
                               QUERY PLAN
-------------------------------------------------------------------
 WindowAgg (cost=368.69..413.32 rows=2550 width=68)
  <- Sort (cost=368.69..375.07 rows=2550 width=36)
    Sort Key: id
    <- WindowAgg (cost=179.78..224.41 rows=2550 width=36)
       <- Sort (cost=179.78..186.16 rows=2550 width=4)
          Sort Key: id DESC
          <- Seq Scan on data (cost=0.00..35.50 rows=2550 width=4)
(7 rows)

Wow, by making this little change we have actually managed to skip one sort step. First of all the data is sorted descendingly because we need it for the first windowing function. However, the next column will need data in exactly the same order as the final ORDER BY at the end of the query. PostgreSQL knows that and can already use the sorted input. If you are processing a big data set, this kind of optimization can make a huge difference and speed up your queries tremendously.

At this point PostgreSQL is not able (yet?) to make those adjustments for you so some manual improvements will definitely help. Try to adjust your windowing functions in a way that columns needing identical sorting are actually next to each other.

What does PostgreSQL Full-Text-Search have to do with VACUUM? Many readers might actually be surprised that there might be a relevant connection worth talking about at all. However, those two topics are more closely related than people might actually think. The reason is buried deep inside the code and many people might not be aware of those issues. Therefore I decided to shade some light on the topic and explain, what is really going on here. The goal is to help end users to speed up their Full-Text-Indexing (FTI) and offer better performance to everybody making use of PostgreSQL.

Controlling VACUUM and autovacuum

Before digging into the real stuff it is necessary to create some test data. For that purpose I created a table. Note that I turned autovacuum off so that all operations are fully under my control. This makes it easier to demonstrate, what is going on in PostgreSQL.

test=# CREATE TABLE t_fti (payload tsvector) 
   WITH (autovacuum_enabled = off);
CREATE TABLE

In the next step we can create 2 million random texts. For the sake of simplicity I did not import a real data set containing real texts but simply created a set of md5 hashes, which are absolutely good enough for the job:

test=# INSERT INTO t_fti 
    SELECT to_tsvector('english', md5('dummy' || id)) 
    FROM generate_series(1, 2000000) AS id;
INSERT 0 2000000

Here is what our data looks like:

test=# SELECT to_tsvector('english', md5('dummy' || id)) 
   FROM generate_series(1, 5) AS id;
        to_tsvector 
--------------------------------------
'8c2753548775b4161e531c323ea24c08':1
'c0c40e7a94eea7e2c238b75273087710':1
'ffdc12d8d601ae40f258acf3d6e7e1fb':1
'abc5fc01b06bef661bbd671bde23aa39':1
'20b70cebcb94b1c9ba30d17ab542a6dc':1
(5 rows)

To make things more efficient, I decided to use the tsvector data type in the table directly. The advantage is that we can directly create a full text index (FTI) on the column:

test=# CREATE INDEX idx_fti 
         ON t_fti USING gin(payload);
CREATE INDEX

In PostgreSQL a GIN index is usually used to take care of “full text search” (FTS).

Finally we run VACUUM to create all those hint bits and make PostgreSQL calculate optimizer statistics.

test=# VACUUM ANALYZE ;
VACUUM

How GIN indexes work in PostgreSQL

To understand what VACUUM and Full Text Search (FTS) have to do with each other, we got to first see, how GIN indexes actually work: A GIN index is basically a “normal tree” down to the word level. So you can just binary search to find a word easily. However: In contrast to a btree, GIN has a “posting tree” below the word level. So each word only shows up once in the index but points to a potentially large list of entries. For full text search this makes sense because the number of distinct words is limited in real life while a single word might actually show up thousands of times.

The following image shows, what a GIN index looks like:

Let us take a closer look at the posting tree itself: It has one entry for pointer to the underlying table. To make it efficient the posting tree is sorted. The trouble now is: If you insert into the table, changing the GIN index for each row is pretty expensive. Modifying the posting tree does not come for free. Remember: You have to maintain the right order in your posting tree so changing things comes with some serious overhead.

Fortunately there is a solution to the problem: The “GIN pending list”. When a row is added, it does not go to the main index directly. But instead it is added to a “TODO” list, which is then processed by VACUUM. So after a row is inserted, the index is not really in its final state. What does that mean? It means that when you scan the index, you have to scan the tree AND sequentially read what is still in the pending list. In other words: If the pending list is long, this will have some impact on performance. In many cases it can therefore make sense to vacuum a table used to full text search more aggressively as usual. Remember: VACUUM will process all the entries in the pending list.

Measuring the performance impact of VACUUM

To see what is going on behind the screenes, install pgstattuple:

CREATE EXTENSION pgstattuple;

With pgstattuple you can take a look at the internals of the index:

test=# SELECT * FROM pgstatginindex('idx_fti');
version | pending_pages | pending_tuples
---------+---------------+----------------
2 | 0 | 0
(1 row)

In this case the pending list is empty. In addition to that the index is also pretty small:

test=# SELECT pg_relation_size('idx_gin');
pg_relation_size
------------------
188416
(1 row)

Keep in mind: We had 2 million entries and the index is still close to nothing compared to the size of the table:

test=# SELECT pg_relation_size('t_fti');
pg_relation_size
------------------
154329088
(1 row)

Let us run a simple query now. We are looking for a word, which does not exist. Note that the query needs ways less than 1 millisecond:

test=# explain (analyze, buffers) SELECT *
FROM t_fti
WHERE payload @@ to_tsquery('whatever');
QUERY PLAN
--------------------------------------------------------------------
Bitmap Heap Scan on t_fti (cost=20.77..294.37 rows=67 width=45)
(actual time=0.030..0.030 rows=0 loops=1)
Recheck Cond: (payload @@ to_tsquery('whatever'::text))
Buffers: shared hit=5
-&gt; Bitmap Index Scan on idx_fti (cost=0.00..20.75 rows=67 width=0)
(actual time=0.028..0.028 rows=0 loops=1)
Index Cond: (payload @@ to_tsquery('whatever'::text))
Buffers: shared hit=5
Planning time: 0.148 ms
Execution time: 0.066 ms

(8 rows)

I would also like to point you to something else: “shared hit = 5”. The query only needed 5 blocks of data to run. This is really really good because even if the query has to go to disk, it will still return within a reasonable amount of time.

Let us add more data. Note that autovacuum is off so there are no hidden operations going on:

test=# INSERT INTO t_fti
SELECT to_tsvector('english', md5('dummy' || id))
FROM generate_series(2000001, 3000000) AS id;
INSERT 0 1000000

The same query, which performanced so nicely before is now a lot slower:

test=# explain (analyze, buffers) SELECT *
FROM t_fti
WHERE payload @@ to_tsquery('whatever');
QUERY PLAN
-----------------------------------------------------------------
Bitmap Heap Scan on t_fti (cost=1329.02..1737.43 rows=100 width=45)
(actual time=9.377..9.377 rows=0 loops=1)
Recheck Cond: (payload @@ to_tsquery('whatever'::text))
Buffers: shared hit=331
-&gt; Bitmap Index Scan on idx_fti (cost=0.00..1329.00 rows=100 width=0)
(actual time=9.374..9.374 rows=0 loops=1)
Index Cond: (payload @@ to_tsquery('whatever'::text))
Buffers: shared hit=331
Planning time: 0.194 ms
Execution time: 9.420 ms
(8 rows)

PostgreSQL needs more than 9 milliseconds to run the query. The reason is that there are many pending tuples in the pending list. Also: The query had to access 331 pages in this case, which is A LOT more than before. The GIN pending list reveals the underlying problem:

test=# SELECT * FROM pgstatginindex('idx_fti');
version | pending_pages | pending_tuples
---------+---------------+----------------
2 | 326 | 50141
(1 row)

5 pages + 326 pages = 331 pages. The pending list explains all the additional use of data pages instantly.

Running VACUUM to speed up Full-Text-Search (FTS) in PostgreSQL

Moving those pending entries to the real index is simple. We simply run VACUUM ANALYZE again:

test=# VACUUM ANALYZE;
VACUUM

As you can see the pending list is now empty:

test=# SELECT * FROM pgstatginindex('idx_fti');
version | pending_pages | pending_tuples
---------+---------------+----------------
2 | 0 | 0
(1 row)

The important part is that the query is also a lot slower again because the number of blocks has decreased again.

test=# explain (analyze, buffers) SELECT *
FROM t_fti
WHERE payload @@ to_tsquery('whatever');
QUERY PLAN
-----------------------------------------------------------------
Bitmap Heap Scan on t_fti (cost=25.03..433.43 rows=100 width=45)
(actual time=0.033..0.033 rows=0 loops=1)
Recheck Cond: (payload @@ to_tsquery('whatever'::text))
Buffers: shared hit=5
-&gt; Bitmap Index Scan on idx_fti (cost=0.00..25.00 rows=100 width=0)
(actual time=0.030..0.030 rows=0 loops=1)
Index Cond: (payload @@ to_tsquery('whatever'::text))
Buffers: shared hit=5
Planning time: 0.240 ms
Execution time: 0.075 ms
(8 rows)

I think those examples show pretty conclusively that VACUUM does have a serious impact on the performance of your full text indexing. Of course this is only true if a significant part of your data is changed on a regular basis.

The COPY command in PostgreSQL is a simple means to copy data between a file and a table. COPY can either copy the content of a table to or from a table. Traditionally data was copied between PostgreSQL and a file. However, recently a pretty cool feature was added to PostgreSQL: It is now possible to send data directly to the UNIX pipe.

COPY … TO PROGRAM: Sending data to the pipe

The ability to send data directly to the UNIX pipe (or Linux command line) can be pretty useful. You might want to compress your data or change the format on the fly. The beauty of the UNIX shell is that it allows you all kinds of trickery.

If you want to send data to an external program – here is how it works:

test=# COPY (SELECT * FROM pg_available_extensions)
TO PROGRAM 'gzip -c > /tmp/file.txt.gz';
COPY 43

In this case the output of the query is sent to gzip, which compresses the data coming from PostgreSQL and stores the output in a file. As you can see this is pretty easy and really straight forward.

Copying data between PostgreSQL and other machines

However, in some cases users might desire to store data on some other machine. Note that the program is executed on the database server and not on the client. It is also important to note that only superusers can run COPY … TO PROGRAM. Otherwise people would face tremendous security problems, which is not desirable at all.

Once in a while people might not want to store the data exported from the database on the server but send the result to some other host. In this case SSH comes to the rescue. SSH offers an easy way to move data.

Here is an example:

echo "Lots of data" | ssh user@some.example.com 'cat > /directory/big.txt'

In this case “Lots of data” will be copied over SSH and stored in /directory/big.txt.

The beauty is that we can apply the same technique to PostgreSQL:

test=# COPY (SELECT * FROM pg_available_extensions)
TO PROGRAM 'ssh user@some.example.com ''cat > /tmp/result.txt'' ';
COPY 43

To make this work in real life you have to make sure that SSH keys are in place and ready to use. Otherwise the system will prompt for a password, which is of course not desirable at all. Also keep in mind that the SSH command is executed as “postgres” user (in case your OS user is called “postgres” too).

“How does the PostgreSQL optimizer handle views?” or “Are views good or bad?” I assume that every database consultant and every SQL performance expert has heard this kind of question already in the past. Most likely this does not only hold true for PostgreSQL experts but is also true for Oracle, DB2 and MS SQL consultants. Given the fact that views are a really essential feature of SQL it makes sense to take a closer look at the topic in general and hopefully help some people to write better and faster code.

PostgreSQL: What it does to views

Let us create a simple table containing just 10 rows, which can be used throughout the blog to show, how PostgreSQL works and how the optimizer treats things:

test=# CREATE TABLE data AS
	SELECT 	*
	FROM 	generate_series(1, 10) AS id;
SELECT 10

Then I have created a very simple view:

test=# CREATE VIEW v1 AS
	SELECT 	*
	FROM 	data
	WHERE 	id < 4;
CREATE VIEW

The idea here is simply to filter some data and return all the columns.

Inlining and flattening

The key thing is: The optimizer will process the view just like a “preprocessor” directive. It will try to inline the code and flatten it. Here is an example:

test=# explain SELECT * FROM v1;
                      QUERY PLAN                       
-------------------------------------------------------
 Seq Scan on data  (cost=0.00..41.88 rows=850 width=4)
   Filter: (id < 4)
(2 rows)

When we try to read from the view it is just like running the SQL statement directly. The optimizer will perform the following steps:

SELECT	*
FROM	(SELECT 	*
		FROM 	data
		WHERE 	id < 4
	) AS v1;

In the next step the subselect will be flattened out completely, which leaves us with:

SELECT * FROM data WHERE id < 4;

We could have done this transformation ourselves but it is easier to make the optimizer do it for us and enjoy the convenience offered by the view.

Joining views

Basically the same mechanism applies, when you are working with joins. PostgreSQL will again inline, flatten and then create a plan to join those tables together. The output is therefore not surprising:

test=# explain SELECT 	*
	FROM 	v1 AS a, v1 AS b
	WHERE 	a.id = 1
			AND a.id = b.id;
                               QUERY PLAN                               
------------------------------------------------------------------------
 Nested Loop  (cost=0.00..96.71 rows=16 width=8)
   ->  Seq Scan on data  (cost=0.00..48.25 rows=4 width=4)
         Filter: ((id < 4) AND (id = 1))
   ->  Materialize  (cost=0.00..48.27 rows=4 width=4)
         ->  Seq Scan on data data_1  (cost=0.00..48.25 rows=4 width=4)
               Filter: ((id < 4) AND (id = 1))
(6 rows)

Again PostgreSQL provides us with a simple plain.

Destructive behavior

However, views are not only good. Sometimes they can be a little destructive too. Consider the following example:

test=# CREATE VIEW v2 AS
	SELECT 	*
	FROM 	data
	WHERE 	id < 4
	ORDER BY 1;
CREATE VIEW

In this case the view provides us with some ordered data. Let us see what happens in the following query:

test=# explain SELECT * FROM v2 ORDER BY 1 DESC;
                            QUERY PLAN                             
-------------------------------------------------------------------
 Sort  (cost=135.22..137.34 rows=850 width=4)
   Sort Key: data.id DESC
   ->  Sort  (cost=83.23..85.36 rows=850 width=4)
         Sort Key: data.id
         ->  Seq Scan on data  (cost=0.00..41.88 rows=850 width=4)
               Filter: (id < 4)
(6 rows)

As you can see PostgreSQL will sort the data twice, which is definitely not good for performance. PostgreSQL won’t optimize this for you so you have to take yourself and be very aware of the issue.

In general it is a good idea to make a mental distinction between two types of views:

• Final results
• Intermediate results

If a view already has the “final” results you can do sorting and all kinds of stuff. However, if you are using a view to perform some large calculations on top of it, it might be a good idea to avoid formatting, sorting and alike.

However, there is more to views. Consider the following example:

test=# CREATE VIEW v3 AS
	SELECT 	*
	FROM 	data
	WHERE 	id < 4
	LIMIT 100000000000;
CREATE VIEW

In this case the view contains a LIMIT clause, which basically says that “all” the data will be returned. However, during those formal transformation the optimizer cannot “rely” on the fact that all data will be included in the final result. Therefore the optimizer cannot silently skip the LIMIT clause, which of course leads to the following plan:

test=# explain SELECT * FROM v3 ORDER BY 1 DESC;
                            QUERY PLAN                             
-------------------------------------------------------------------
 Sort  (cost=91.73..93.86 rows=850 width=4)
   Sort Key: data.id DESC
   ->  Limit  (cost=0.00..41.88 rows=850 width=4)
         ->  Seq Scan on data  (cost=0.00..41.88 rows=850 width=4)
               Filter: (id < 4)
(5 rows)

Conclusion

As always the world is neither only black nor white. Views are a valuable feature and when used nicely there is nothing wrong with them. However, keep in mind what is going on behind the scenes and avoid building views on top of view on top of views and so on.

Is there a single significantly large and important database in the world, which does not suffer from performance problems once in a while? I bet that there are not too many. Therefore every DBA (database administrator) in charge of PostgreSQL should know, how to track down potential performance problems to figure out, what is really going on.

Improving PostgreSQL performance beyond parameter tuning

Many people think that changing parameters in postgresql.conf are the real way to success. However, this is not always the case. Sure, more often than not good database config parameters are highly beneficial. Still, in many cases the real problems will come from some strange query hidden deep in some application logic. It is even quite likely that those queries causing real issues are not the ones you happen to focus on. The natural question now arising is: How can we track down those queries and figure out, what is really going on? My favorite tool to do that is pg_stat_statements, which should always be enabled in my judgement if you are using PostgreSQL 9.2 or higher (please do not use it in older versions).

Enabling pg_stat_statements

To enable pg_stat_statements on your server change the following line in postgresql.conf and restart PostgreSQL:

shared_preload_libraries = ‘pg_stat_statements’

Once this module has been loaded into the server, PostgreSQL will automatically start to collect information. The good thing is that the overhead of the module is really really low (the overhead is basically jsut “noise”).

Then run the following command to create the necessary view to access the data:

CREATE EXTENSION pg_stat_statements;

The extension will deploy a view called pg_stat_statements and make the data easily accessible.

Detecting slow queries in PostgreSQL

The easiest way to find the most interesting queries is to sort the output of pg_stat_statements by total_time:

SELECT * FROM pg_stat_statements ORDER BY total_time DESC;

The beauty here is that the type of query, which is consuming most of time time, will naturally show up on top of the listing. The best way is to work your way down from the first to the, say, 10th query and see, what is going on there.
In my judgement there is no way to tune a system without looking at the most time consuming queries on the database server.

Taking a deep look at PostgreSQL performance

pg_stat_statements has a lot more to offer than just the query and the time it has eaten. Here is the structure of the view:

test=# \d pg_stat_statements
View "public.pg_stat_statements"
Column | Type | Collation | Nullable | Default
---------------------+------------------+-----------+----------+---------
userid | oid | | |
dbid | oid | | |
queryid | bigint | | |
query | text | | |
calls | bigint | | |
total_time | double precision | | |
min_time | double precision | | |
max_time | double precision | | |
mean_time | double precision | | |
stddev_time | double precision | | |
rows | bigint | | |
shared_blks_hit | bigint | | |
shared_blks_read | bigint | | |
shared_blks_dirtied | bigint | | |
shared_blks_written | bigint | | |
local_blks_hit | bigint | | |
local_blks_read | bigint | | |
local_blks_dirtied | bigint | | |
local_blks_written | bigint | | |
temp_blks_read | bigint | | |
temp_blks_written | bigint | | |
blk_read_time | double precision | | |
blk_write_time | double precision | | |

It can be quite useful to take a look at the stddev_time column as well. It will tell you if queries of a certain type tend to have similar runtimes or not. If the standard deviation is high you can expect some of those queries to be fast and some of them to be slow, which might lead to bad user experience.

The “rows” column can also be quite informative. Suppose 1000 calls have returned 1.000.000.000 rows: It actually means that every call has returned 1 million rows in average. It is easy to see that returning so much data all the time is not a good thing to do.

If you want to check if a certain type of query shows bad caching performance, the shared_* will be of interest. In short: PostgreSQL is able to tell you the cache hit rate of every single type of query in case pg_stat_statements has been enabled.

It also makes sense to take a look at the temp_blks_* fields. Whenever PostgreSQL has to go to disk to sort or to materialize, temporary blocks will be needed.

Finally there are blk_read_time and blk_write_time. Usually those fields are empty unless track_io_timing is turned on. The idea here is to be able to measure the amount of time a certain type of query spends on I/O. It will allow you to answer the question whether your system is I/O bound or CPU bound. In most cases it is a good idea to turn on I/O timing because it will give you vital information.

Dealing with Java and Hibernate

pg_stat_statements delivers good information. However, in some cases it can cut off the query because of a config variable:

test=# SHOW track_activity_query_size;
track_activity_query_size
---------------------------
1024
(1 row)

For most applications 1024 bytes are absolutely enough. However, this is usually not the case if you are running Hibernate or Java. Hibernate tends to send insanely long queries to the database and thus the SQL code might be cut off long before the relevant parts start (e.g. the FROM-clause etc.). Therefore it makes a lot of sense to increase track_activity_query_size to a higher value (maybe 32.786).

Helpful queries to detect bottlenecks in PostgreSQL

There is one query I have found especially useful in this context: The following query shows 20 statements, which need a lot of time:

test=# SELECT substring(query, 1, 50) AS short_query,
round(total_time::numeric, 2) AS total_time,
calls,
round(mean_time::numeric, 2) AS mean,
round((100 * total_time / sum(total_time::numeric) OVER ())::numeric, 2) AS percentage_cpu
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 20;
short_query | total_time | calls | mean | percentage_cpu
----------------------------------------------------+------------+-------+------+----------------
SELECT name FROM (SELECT pg_catalog.lower(name) A | 11.85 | 7 | 1.69 | 38.63
DROP SCHEMA IF EXISTS performance_check CASCADE; | 4.49 | 4 | 1.12 | 14.64
CREATE OR REPLACE FUNCTION performance_check.pg_st | 2.23 | 4 | 0.56 | 7.27
SELECT pg_catalog.quote_ident(c.relname) FROM pg_c | 1.78 | 2 | 0.89 | 5.81
SELECT a.attname, +| 1.28 | 1 | 1.28 | 4.18
pg_catalog.format_type(a.attty | | | |
SELECT substring(query, ?, ?) AS short_query,roun | 1.18 | 3 | 0.39 | 3.86
CREATE OR REPLACE FUNCTION performance_check.pg_st | 1.17 | 4 | 0.29 | 3.81
SELECT query FROM pg_stat_activity LIMIT ?; | 1.17 | 2 | 0.59 | 3.82
CREATE SCHEMA performance_check; | 1.01 | 4 | 0.25 | 3.30
SELECT pg_catalog.quote_ident(c.relname) FROM pg_c | 0.92 | 2 | 0.46 | 3.00
SELECT query FROM performance_check.pg_stat_activi | 0.74 | 1 | 0.74 | 2.43
SELECT * FROM pg_stat_statements ORDER BY total_ti | 0.56 | 1 | 0.56 | 1.82
SELECT query FROM pg_stat_statements LIMIT ?; | 0.45 | 4 | 0.11 | 1.45
GRANT EXECUTE ON FUNCTION performance_check.pg_sta | 0.35 | 4 | 0.09 | 1.13
SELECT query FROM performance_check.pg_stat_statem | 0.30 | 1 | 0.30 | 0.96
SELECT query FROM performance_check.pg_stat_activi | 0.22 | 1 | 0.22 | 0.72
GRANT ALL ON SCHEMA performance_check TO schoenig_ | 0.20 | 3 | 0.07 | 0.66
SELECT query FROM performance_check.pg_stat_statem | 0.20 | 1 | 0.20 | 0.67
GRANT EXECUTE ON FUNCTION performance_check.pg_sta | 0.19 | 4 | 0.05 | 0.62
SELECT query FROM performance_check.pg_stat_statem | 0.17 | 1 | 0.17 | 0.56
(20 rows)

The last column is especially noteworthy: It tells us the percentage of total time burned by a single query. It will help you to figure out whether a single statement is relevant to overall performance or not.

Many people keep asking about index scans in PostgreSQL. This blog is meant to be a basic introduction to the topic because many people do not seem to be aware of what the optimizer does, when a single query is processed. I decided to give a brief introduction showing, how a table can be accessed in some of the most basic cases. Let us get started with PostgreSQL indexing.

Indexes are somehow the backbone of good performance. Without proper indexing your PostgreSQL database might be in dire straits and end users might complain about slow queries and bad response times. It therefore makes sense to see, which choices PostgreSQL makes, when a single column is queried.

Preparing some demo data in PostgreSQL

To show you how things work we can use a simple table:

test=# CREATE TABLE sampletable (x numeric);
CREATE TABLE

If your table is almost empty, you will never see an index scan because it might be too much overhead to consult an index – it is cheaper to just scan the table directly and throw away some rows, which don’t match your query.

So to demonstrate how an index actually works, we can add 10 million random rows to the table we just created before:

test=# INSERT INTO sampletable
         SELECT random() * 10000
         FROM generate_series(1, 10000000);
INSERT 0 10000000

Then an index is created:

test=# CREATE INDEX idx_x ON sampletable(x);
CREATE INDEX

After loading so much data it can be a good idea to create optimizer statistics in case autovacuum has not caught up yet. The PostgreSQL optimizer needs this statistics to decide on whether to use an index or not:

test=# ANALYZE ; 
ANALYZE

In PostgreSQL a btree uses Lehman-Yao High-Concurrency btrees (which will be covered in more detail in a later blog).

 

Selecting a small subset of data in PostgreSQL

When only a small set of rows is selected, PostgreSQL can directly ask the index. In this case it can even use an “Index Only Scan” because all columns needed are actually already in the index:

test=# explain SELECT * FROM sampletable WHERE x = 42353;
                               QUERY PLAN
-----------------------------------------------------------------------
Index Only Scan using idx_x on sampletable (cost=0.43..8.45 rows=1 width=11)
      Index Cond: (x = '42353'::numeric)
(2 rows)

Selecting only a handful of rows will be super efficient using the index. However, if more data is selected, scanning the index AND the table will be too expensive.

PostgreSQL indexing: Selecting a lot of data from a table in PostgreSQL

However, if you select a LOT of data from a table, PostgreSQL will fall back to a sequential scan. In this case reading the entire table and just filtering out a couple of rows is the most way to do things.

Here is how it works:

test=# explain SELECT * FROM sampletable WHERE x < 42353;
                            QUERY PLAN
---------------------------------------------------------------
Seq Scan on sampletable (cost=0.00..179054.03 rows=9999922 width=11)
    Filter: (x < '42353'::numeric)
(2 rows)

PostgreSQL will filter out those unnecesssary rows and just return the rest. This is really the best thing to do in this case. A sequential scan is therefore not aways bad – there are use cases, where a sequential scan is actually perfect.

Still: Keep in mind that scanning large tables sequentially too often will take its toll at some point.

PostgreSQL: Making use of bitmap scans

If you only select a handful of rows, PostgreSQL will decide on an index scan – if you select a majority of the rows, PostgreSQL will decide to read the table completely. But what if you read too much for an index scan to be efficient but too little for a sequential scan? The solution to the problem is to use a bitmap scan. The idea behind a bitmap scan is that a single block is only used once during a scan. It can also be very helpful if you want to use more than one index to scan a single table.

Here is what happens:

test=# explain SELECT * FROM sampletable WHERE x < 423;
                                    QUERY PLAN
----------------------------------------------------------------------------
Bitmap Heap Scan on sampletable (cost=9313.62..68396.35 rows=402218 width=11)
       Recheck Cond: (x < '423'::numeric)
       -> Bitmap Index Scan on idx_x (cost=0.00..9213.07 rows=402218 width=0)
          Index Cond: (x < '423'::numeric)
(4 rows)

PostgreSQL will first scan the index and compile those rows / blocks, which are needed at the end of the scan. Then PostgreSQL will take this list and go to the table to really fetch those rows. The beauty is that this mechanism even works if you are using more than just one index.

Bitmaps scans are therefore a wonderful contribution to performance.

In PostgreSQL it is possible to write stored procedures in almost any widespread language such as Perl, Python or C. In general this offers a great deal of flexibility and acceptable performance. However, in some cases customers might say: “We have the feeling that procedures are slow”. The thing is: PostgreSQL might not be to blame for this observation – in many cases it is simply “pilot error”. Let me try to elaborate on one of the key issues related to bad performance.

How PostgreSQL treats procedures

As mentioned already it is possible to write stored procedures in basically any language. PostgreSQL simply passes the code of a procedure to the external language and takes back the result. In a way a stored procedure is a kind of black box – PostgreSQL rarely knows what is going on inside a procedure.

Here is an example:

test=# CREATE OR REPLACE FUNCTION mymax(int, int)
RETURNS int
AS
$$
  BEGIN
       RETURN CASE WHEN $1 > $2 THEN $1 ELSE $2 END;
  END;
$$ LANGUAGE 'plpgsql';
CREATE FUNCTION

The results are not very spectacular:

test=# SELECT mymax(20, 30);
 mymax
-------
    30
(1 row)

The important thing here is: The PL/pgSQL function is a complete blackbox. The planner has no idea, what the “external” language does in this case. This has some important implication. Consider the following example:

test=# CREATE TABLE demo AS
	SELECT 	*
	FROM generate_series(1, 1000000) AS id;
SELECT 1000000
test=# CREATE INDEX idx_id ON demo(id);
CREATE INDEX

The table is already large enough to consider indexes:

test=# explain SELECT * FROM demo WHERE id = 20;
                      QUERY PLAN                               
---------------------------------------------------------------
 Index Only Scan using idx_id on demo  
	(cost=0.42..8.44 rows=1 width=4)
   Index Cond: (id = 20)
(2 rows)

Functions in PostgreSQL are usually blackboxes

The problem is: The situation changes completely if we start to use the procedure I have just shown:

test=# explain SELECT *
	FROM 	demo
	WHERE 	id = mymax(20, 20);
                       QUERY PLAN                        
---------------------------------------------------------
 Seq Scan on demo  (cost=0.00..266925.00 rows=1 width=4)
   Filter: (id = mymax(20, 20))
(2 rows)

PostgreSQL has no idea that the function will return 20. It is clear to humans but nobody told the machine that this is guaranteed to happen. To PostgreSQL the result of a function is considered to be “volatile” – anything can happen. Therefore it cannot simply ask the index for the correct row. The first function call might not return the same as the second call – even if the parameters are identical. The optimizer has to play safe and will go for a sequential scan, which is definitely going to produce the correct result.

In PostgreSQL a function can be:

• VOLATILE
• STABLE
• IMMUTABLE

If a function is marked as VOLATILE it can return anything if you call it multiple times using the very same input parameters. In case of STABLE the function is going to return the same result given the same parameters within the same transaction.

The most prominent STABLE function is now(), which will always return the same result within the same transaction:

 test=# SELECT now();
              now              
-------------------------------
 2018-01-09 11:48:46.385457+01
(1 row)

test=# BEGIN;
BEGIN
test=# SELECT now();
              now              
-------------------------------
 2018-01-09 11:48:51.073123+01
(1 row)

test=# SELECT now();
              now              
-------------------------------
 2018-01-09 11:48:51.073123+01
(1 row)

test=# COMMIT;
COMMIT
test=# SELECT now();
              now              
-------------------------------
 2018-01-09 11:48:59.640697+01
(1 row)

Some functions are even IMMUTABLE: In this case the result given the same input parameters will be constant regardless of the transaction. Cosine would be an example in this case:

test=# SELECT cos(10), cos(20);
        cos         |        cos        
--------------------+-------------------
 -0.839071529076452 | 0.408082061813392
(1 row)

test=# SELECT cos(10), cos(20);
        cos         |        cos        
--------------------+-------------------
 -0.839071529076452 | 0.408082061813392
(1 row)

Even in the second transaction the cosine of a number will be the same.

Reducing the number of function calls

To fix our problem we have to drop the existing function …

test=# DROP FUNCTION mymax(int, int);
DROP FUNCTION

… and recreate it:

test=# CREATE OR REPLACE FUNCTION mymax(int, int)
RETURNS int
AS
$$
  BEGIN
       RETURN CASE WHEN $1 > $2 THEN $1 ELSE $2 END;
  END;
$$ LANGUAGE 'plpgsql' IMMUTABLE;
CREATE FUNCTION

The planner with detect that the function is supposed to return a static value given the same input values and go for an index scan:

test=# explain SELECT *
	FROM 	demo
	WHERE 	id = mymax(20, 20);
                      QUERY PLAN                               
------------------------------------------------------
 Index Only Scan using idx_id on demo  
	(cost=0.42..8.44 rows=1 width=4)
   Index Cond: (id = 20)
(2 rows)

Of course the index scan orders magnitude faster and will return in basically no time.

Detecting bottlenecks

Fortunately PostgreSQL has a system view, which might shed some light on functions, which might be a problem. The first thing you have to do is to set “track_functions = ‘all’ “ – it will tell PostgreSQL to collection function statistics:

test=# SELECT * FROM demo WHERE id = mymax(20, 20);
 id
----
 20
(1 row)

Once this setting has been turned on, pg_stat_user_functions will contain valuable information:

test=# SELECT * FROM pg_stat_user_functions ;
 funcid | schemaname | funcname | calls | total_time | self_time
--------+------------+----------+-------+------------+-----------
  16429 | public     | mymax    |     1 |      0.025 |     0.025
(1 row)

If you happen to see that a function is called insanely often, it can make sense to inspect it and check, if it happens to be VOLATILE for no good reason. Changing the function definition can significantly improve speed.

Creating indexes on functions in PostgreSQL

If you want to create indexes on a function, you have to make sure that the function itself is in fact IMMUTABLE. Otherwise PostgreSQL won’t create the index for you. The reason is simple: PostgreSQL has to make sure that the content of the index is stable and does not have to be changed over time if the underlying data is unchanged.

We are proud to announce our latest tool released to the PostgreSQL community. We know that many people want a visual tool to configure their PostgreSQL database servers. Therefore we proudly want to announce Cybertec pgconfigurator, which has been designed to give users a tool, to quickly determine the most important config parameters to be used in postgresql.conf.

The tool can be found and used freely at the Cybertec pgconfigurator product page on our website.

What pgconfigurator will do for you

The idea behind Cybertec pgconfigurator is to to give people a quick and easy way to generate or to cross check existing config files (postgresql.conf). We have used our experience to help users tune their database and achieve better performance.

pgconfigurator will determine all vital performance parameters and come up with clever suggestions. Keep in mind that an automatic tool cannot replace a human in all cases and it still makes sense to hire a professional in case of doubt. However, pgconfigurator can help you to get started and to cleverly configure your servers.

Our configuration tool will visually determine the following types of parameters for you:

  • PostgreSQL memory settings (shared_buffers, work_mem, etc.)
  • PostgreSQL replication settings (max_wal_senders, etc.)
  • PostgreSQL parallel query parameters
  • PostgreSQL I/O and checkpointing related settings
  • Preloaded libraries

Just download the ready made config and add your personal parameters (for logging, etc.)

Feedback is welcome

We hope that our first release of pgconfigurator is helpful to you. Of course feedback is always welcome and we are eager to hear from you. Tell us what you want us to improve and tell us about settings you don’t agree with. We will do our best to collect all those recommendations and make pgconfigurator even better.

We are planning to maintain pgconfigurator long term and add support for PostgreSQL 11 and higher in the future.

Also: Share the news! Tell your friends, workmates and everybody you know about Cybertec pgconfigurator.

 

You have probably noticed that everyone is talking about Artificial Intelligence and Machine Learning these days. Quite rightly because it is a very important topic, which is going to shape our future for sure. However, when looking at most of the code related to machine learning available on the net, it strikes me, how much “coding” people actually use to prepare the data. In many cases the input data is “just a matrix” and people spent a lot of time (and sometimes memory) to prepare it.

The question, which naturally arises in my case is: Why not prepare the data in SQL? It is easier, less work and a lot more flexible.

Creating some sample data

To demonstrate what we can do on the PostgreSQL side, I will create a simple table containing just 20 random values. Let us define a table first:

test=# CREATE TABLE inputdata
(
	id 		int,
	data 	numeric 	DEFAULT random() * 1000
);
CREATE TABLE

The the table is populated:

test=# INSERT INTO inputdata
	SELECT * FROM generate_series(1, 20);
INSERT 0 20

What we got now is a table containing 20 random values. In reality you will already have some existing data, which you will use to work with:

test=# SELECT * FROM inputdata;
 id |       data       
----+------------------
  1 |  542.76927607134
  2 | 813.954454381019
  3 |  215.18046176061
  4 | 989.989245776087
  5 | 142.890753224492
  6 | 326.086463406682
  7 | 24.8975520953536
  8 | 266.512574627995
  9 | 86.0621216706932
 10 | 801.756543107331
 11 | 790.149183012545
 12 | 317.997705657035
 13 | 975.230060052127
 14 | 385.490739252418
 15 | 746.592517476529
 16 | 621.084009762853
 17 | 208.689162041992
 18 | 529.119417071342
 19 | 260.399237740785
 20 | 563.285110052675
(20 rows)

Thoughts on sampling, training, and verification

If you are training an AI model (maybe a Support Vector Machine / SVM, a neural network, or whatever) you will always start by splitting the data into various parts:

• Training data for your AI model
• Test data for your AI model

The training data is used to teach your model. The test data is then used to check, if your model performs well. Splitting the data is important and somewhat key to success.

Table sampling in PostgreSQL

Since version 9.5 we already have sampling in PostgreSQL core. Here is an example:

test=# SELECT *
	FROM inputdata TABLESAMPLE BERNOULLI (50)
				REPEATABLE (87);
 id |       data       
----+------------------
  3 |  215.18046176061
  4 | 989.989245776087
  5 | 142.890753224492
 10 | 801.756543107331
 11 | 790.149183012545
 12 | 317.997705657035
 14 | 385.490739252418
 15 | 746.592517476529
 16 | 621.084009762853
 19 | 260.399237740785
 20 | 563.285110052675
(11 rows)

You have to add the TABLESAMPLE clause to the table and specify how it is supposed to work. In this example I decided to use the Bernoulli table sampling method (TSM) and told it to be repeatable. In case of machine learning it really makes sense to use the REPEATABLE clause as we want to make sure that we can train our model over and over again given the same input data. We can run the code as often as we want and PostgreSQL will always return the same sample (assuming of course that the underlying data does not change).

Machine learning: Lots and lots of data …

So far everything sounds nice. However, there is a bit of a downside to this approach. A state of the art model needs A LOT of input data to perform well. We are talking about many million rows. At some point we will be facing a query as shown below (written in pseudo code):

SELECT	*
FROM 	inputdata
WHERE	id NOT IN (SELECT id FROM test_data_set);

There are two problems here:

• We have to store the test data set somewhere, which needs a lot of space
• The large NOT IN statement is fairly expensive

So maybe there is some other way to do that? What I came up with is a pretty simple approach, which does the job.

Preparing data for machine learning in PostgreSQL

As there is no “NOT IN TABLESAMPLE”-clause in PostgreSQL and as we do not want to duplicate our data, the idea is to use a view, which can be used to extract the sample:

test=# CREATE VIEW preparation_step1 AS
	SELECT 	*, abs(hashtext(id::text) % 100)
	FROM 	inputdata ;
CREATE VIEW

The first idea here is to use the hashtext function to turn the input data into a hash. The nice thing here is that hashtext returns evenly distributed numbers, which is exactly what we want here. Adding “modulo 100” will create 100 slices of data (1% each). Be aware of the fact that hashtext function can return negative values. The “abs” function will turn the value into positive values.

test=# \x
Expanded display is on.
test=# \df *hashtext*
List of functions
-[ RECORD 1 ]-------+-----------
Schema              | pg_catalog
Name                | hashtext
Result data type    | integer
Argument data types | text
Type                | normal

When looking at the view, the result will be pretty useful already:

test=# SELECT * FROM preparation_step1 LIMIT 10;
 id |       data       | abs
----+------------------+-----
  1 |  542.76927607134 |  47
  2 | 813.954454381019 |  26
  3 |  215.18046176061 |   4
  4 | 989.989245776087 |  92
  5 | 142.890753224492 |  58
  6 | 326.086463406682 |  12
  7 | 24.8975520953536 |  95
  8 | 266.512574627995 |  88
  9 | 86.0621216706932 |  36
 10 | 801.756543107331 |  81
(10 rows)

We can now simply filter out data. An example: “abs < 50” might be training data and all the rest might be used to validate and check our models.

The approach here is basically fine if your dataset is really large (xxx-million rows or so). It might not be ideal if your dataset is too small. In this case it might be better to use sampling mechanisms provided by your favorite library (TensorFlow, sklearn, etc.). Why is that the case? If you are using a small data set (e.g. only a couple hundred of rows) you are running the risk of creating a biased sample. What does that mean? Suppose you have a dataset containing information about men and women. You want the sample to show the same distribution as the original data set – so the same percentage of men and women. The solution to the problem is called “stratification” and is supported by libraries such as sklearn and others. In my simple SQL example I am working under the assumption that the amount of data fed to the model is really large and therefore stratification is not an issue.

When running an application in production it might happen that the data structure has to be changed once in a while. Adding columns, dropping columns, etc. might simply be necessary once in a while. However, changing data structures should not be done mindlessly – there are some things you have to take care of.

The main issue with DDLs is that in some cases locks are held for quite a long time, which can have serious consequences if you are running PostgreSQL on critical production systems. I hope that this blog can shed some light and help people to run DDLs more wisely.

Creating some demo data

As usual some demo data is needed to show, how things work. In case the following script simply creates 10 million simple integer values:


test=# CREATE TABLE data (id int);

CREATE TABLE

test=# INSERT INTO data

SELECT * FROM generate_series(1, 10000000);

INSERT 0 10000000

Be aware of locking: Indexing

One of the most common issues related to DDLs is the creation of indexes. If you create an index in PostgreSQL, “normally” writes will be blocked while the index is in creation. In case of very large tables this can be quite counterproductive. A large index simply takes time to build and concurrent writes might suffer as shown in the next example:

Connection 1Connection 2
BEGIN;
CREATE UNIQUE INDEX idx_id ON data (id);BEGIN;
— runningINSERT INTO data VALUES (0);
— running— waiting
COMMIT;— will proceed
COMMIT;

If data is really large, the INSERT might have to wait for an unacceptable amount of time. This is especially critical if the number of concurrent requests is high. Connection pools might get into trouble or your webserver might run out of connections. In short: A solution is needed.

Note that reading is still possible – PostgreSQL will only block writes while the index is built. This is extremely important to point out.

If you cannot afford to lock a table while building an index, CREATE INDEX CONCURRENTLY is the tool of choice. CREATE INDEX CONCURRENTLY takes longer than a “normal” CREATE INDEX but it will allow for concurrent writes and it helps to avoid excessive table locking. Note that CREATE INDEX CONCURRENTLY is not guaranteed to succeed and might leave you with an invalid index, which has to be dropped manually in certain cases. However, if you are indexing a 1+TB table or so there is no way around concurrent indexing if you are under constant load, which is often the case.

Here is an example: As you can see no conflicts happen and nobody has to wait for an unacceptable amount of time.

Connection 1Connection 2
CREATE UNIQUE INDEX

CONCURRENTLY idx_id2 ON data (id);

— runningINSERT INTO data VALUES (-1);
— runningINSERT INTO data VALUES (-2);
— doneINSERT INTO data VALUES (-3);

Adding columns to tables in PostgreSQL

Adding a new column to a table in PostgreSQL is most likely the most common problem reported by people around the globe. The important thing here is that there are actually two cases:

  1. Adding a column without a default value
  2. Adding a column with a default value

Let us see, what those two cases have in stock for us:

Connection 1Connection 2
BEGIN;
ALTER TABLE data ADD COLUMN x int;
Time: 1.241 msINSERT INTO data VALUES (-4);
COMMIT;— waits a VERY short time
BEGIN;
ALTER TABLE data

ADD COLUMN y int DEFAULT 0;

— we need timeINSERT INTO data VALUES (-5);
Time: 11888.547 ms— waiting for a long time
COMMIT;— finally proceeds

As you can see the ALTER TABLE … ADD COLUMN without the default value is basically done in around 1 millisecond, which is totally fine. The concurrent write does not have to wait for long. As long as the DDL can get the table lock it needs for a short moment, everything will be fine. In PostgreSQL adding a column with no default value is merely a metadata change in the catalog – there is no need to rewrite the data files. Keep in mind that this is not true for all other database engines – especially not for some expensive commercial ones.

The story is a bit different if you need a default value. In this case it has to be written to disk. Our general recommendation is to avoid this kind of operation if possible. If your tables is large, rewriting things is not too attractive.

Making use of tablespaces

Another important thing to take into account is the use of tablespaces. While tablespaces are a nice way to scale I/O in general, moving tables around does not come for free. Keep in mind that ALTER TABLE … SET TABLESPACE locks the table and blocks writing operation.

Here is an example:

Connection 1Connection 2
BEGIN;
ALTER TABLE data

SET TABLESPACE some_storage;

— usually time consumingINSERT INTO data VALUES (-6);
— usually time consuming— waiting for a long time
COMMIT;— finally done

The good thing here is that there is actually a way to get around the table. pg_squeeze (https://github.com/cybertec-postgresql/pg_squeeze) is a tool designed to shrink a bloated table. However, it can do a lot more: You can “abuse” it to move a table from one tablespace to some other storage location. Using Cybertec pg_squeeze you can not only shrink your storage footprint – you can also elegantly avoid heavy locking. Go and check it out.

Adding foreign keys

Foreign keys and constraints are also an important factor. If you want to add foreign keys or CHECK constraints PostgreSQL will verify the content of the table to ensure the correctness of the key. Those checks don’t come for free – PostgreSQL will again lock the table:


test=# ALTER TABLE data

ADD CONSTRAINT constname

FOREIGN KEY (id)

REFERENCES data(id);

Adding keys ex-post without locking the table is impossible. Therefore it is important to think ahead and plan your moves.

Dropping columns in PostgreSQL

Let us come to a more relaxing thing: DROP COLUMN. In PostgreSQL DROP TABLE does not rewrite the table. It simply makes it invisible. VACUUM will take care of cleanup asynchronously. DROP COLUMN is therefore really fast. However, keep in mind that a DROP COLUMN can only start if there are no concurrent reads:

Connection 1Connection 2
BEGIN;
SELECT count(*) FROM data
ALTER TABLE data

DROP COLUMN y;

— takes a while
— has to wait— takes a while
— has to wait— takes a while
— finally proceedsCOMMIT;

Otherwise DROP COLUMN will simply return in no time.

There is always more …

There is always more to say. Still, the examples outlined in this blog cover some of the most common issues people are facing around the world. Feel free to add your comments and send some feedback.