When running PostgreSQL on a production system, it might happen that you are facing table bloat. As you might know PostgreSQL has to copy a row on UPDATE to ensure that concurrent transactions can still see the data. At some point VACUUM can clean out dead rows but if transactions are too long, this cleanup might happen quite late and therefore table bloat (= your table keeps growing dramatically) is the logical consequence. Having a table, which has grown out of proportion, will have all kinds of bad side effects including but not limited to bad performance. idle_in_transaction_session_timeout has been added to PostgreSQL 9.6 to prevent bad things from happening in case long idle transactions are around.

MVCC: Why PostgreSQL has to copy rows on UPDATE

Why does PostgreSQL have to copy rows on UPDATE after all? Here is an example:

Session 1Session 2
CREATE TABLE a (aid int);
INSERT INTO a VALUES (5);
BEGIN;
SELECT sum(aid) FROM a;
… running …UPDATE a SET aid = 9;
… running …SELECT * FROM a;
… will return 5 …… will return 9 …

As you can see two results will be returned at the same time at the end of our example. Logically PostgreSQL has to keep both versions of a row. Just imagine if you want to UPDATE 100 million rows – your table will have to keep an additional 100 million rows.

Let us take a look at a second example:

Session 1Session 2
BEGIN;
DELETE FROM a;
… running …SELECT * FROM a;
… running …… we will see rows …
COMMIT;
VACUUM a;
… now we can clean out rows …

DELETE is not allowed to actually remove those rows. Remember, we can still issue a ROLLBACK so we cannot destroy data yet. The same applies to COMMIT. Concurrent transactions might still see the data. VACUUM can only really reclaim those deleted rows if no other transactions can still see them. And this is exactly where our problem starts: What if a transaction starts but is not closed for a long long time …

In PostgreSQL long transactions can cause table bloat

A long transaction is actually not a problem – the problem starts if a long transaction and many small changes have to exist. Remember: The long transaction can cause VACUUM to not clean out your dead rows.

Are long transactions evil in general? No: If a long transaction does useful work, it should be allowed to proceed unharmed. But what is a transaction is kept open because of bad coding or for some other reason?

Here is an example:

Session 1Session 2
BEGIN;
SELECT …
… doing nothing …UPDATE “huge change”
… doing nothing …DELETE “huge change”
… doing nothing …INSERT “huge change”
… doing nothing …UPDATE “huge change”

In this case we will end up in trouble at some doing. VACUUM might actually run but it is never allowed to clean out dead rows because a single transaction might still be allowed to see old data. Thus dead rows will keep accumulating as long as “Session 1” exists. PostgreSQL cannot clean dead tuples – even if you keep running VACUUM.

idle_in_transaction_session_timeout: Killing idle transactions in PostgreSQL

If a transaction is working, it is there for a reason – but if it just hangs around, why not just kill it? This is exactly what idle_in_transaction_session_timeout will do for you. Here is how it works:

test=# SET idle_in_transaction_session_timeout TO '3000';
SET
test=# BEGIN;
BEGIN
test=# SELECT pg_sleep(5);
 pg_sleep
----------

(1 row)

-- going out for a coffee ...
test=# SELECT pg_sleep(5);
FATAL: terminating connection due to idle-in-transaction timeout
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

In this example the timeout is set to 3 seconds (3000 milliseconds). Then we will sleep for 5 seconds, which is no problem at all. However, before the next SELECT there is a long pause – and this is when the session in question will be killed.

In other words: Transactions cannot stay open accidentally anymore as PostgreSQL will clean things out for you.

Note that you don’t have to set things in postgresql.conf globally. The beauty is that you can actually set this variable for a certain database or simply for a specific user. There is no need for making the change globally and suffering from potential side effects.

Settings things for a single user is actually pretty simple. Here is how it works:

test=# CREATE USER joe;
CREATE ROLE
test=# ALTER USER joe SET idle_in_transaction_session_timeout TO 10000;
ALTER ROLE

Is there table bloat in my PostgreSQL database?

If you want to figure out if there is table bloat in your database or not: Consider checking out the pgstattuple extension, which has been covered in one of our older posts: https://www.cybertec-postgresql.com/en/detecting-table-bloat/

SQL is a must, if you want to be a Data Analyst or a Data Scientist. However, one in a while people are wondering, why a result is the way it is. While on the road in Berlin (Germany) the other day I have found a fairly interesting scenario, which is pretty counterintuitive to most people and which might be worth sharing.

Windowing function: What PostgreSQL does and why

PostgreSQL has provided windowing functions and analytics for quite some time now and this vital feature has been widely adopted by users, who are using PostgreSQL or SQL in general for more than just trivial queries. A modern database is just so much more than a simple data store and windowing functions are therefore certainly something to look into.

So, what is this post all about? Here is an example:

test=# SELECT *, 
 first_value(x) OVER (ORDER BY x) 
 FROM generate_series(1, 5) AS x;
 x | first_value 
---+-------------
 1 | 1
 2 | 1
 3 | 1
 4 | 1
 5 | 1
(5 rows)

What we want is the first value in our data set. The ORDER BY clause will ensure that data is fed to first_value in the right order. The result is therefore not surprising.

The same applies if we add DESC to our ORDER BY. The result is totally obvious:

test=# SELECT *, 
         first_value(x) OVER (ORDER BY x),
         first_value(x) OVER (ORDER BY x DESC)
   FROM generate_series(1, 5) AS x;
 x | first_value | first_value 
---+-------------+-------------
 5 | 1 | 5
 4 | 1 | 5
 3 | 1 | 5
 2 | 1 | 5
 1 | 1 | 5
(5 rows)

last_value: Unexpected results

However, what if we use last_value. Here is an example:

test=# SELECT *, 
         last_value(x) OVER (ORDER BY x),
         last_value(x) OVER (ORDER BY x DESC)
       FROM generate_series(1, 5) AS x;
 x | last_value | last_value 
---+------------+------------
 5 | 5 | 5
 4 | 4 | 4
 3 | 3 | 3
 2 | 2 | 2
 1 | 1 | 1
(5 rows)

What you can see here is that both columns will return the SAME data – regardless of the different sort order provided by the ORDER BY clause. That comes as a surprise to most people. Actually most people would accept one column to contain only “5” and the other column to contain only “1”.

Why is that not the case? Here is the answer:

test=# SELECT *, 
         array_agg(x) OVER (ORDER BY x),
         array_agg(x) OVER (ORDER BY x DESC)
       FROM generate_series(1, 5) AS x;
 x | array_agg   | array_agg 
---+-------------+-------------
 5 | {1,2,3,4,5} | {5}
 4 | {1,2,3,4}   | {5,4}
 3 | {1,2,3}     | {5,4,3}
 2 | {1,2}       | {5,4,3,2}
 1 | {1}         | {5,4,3,2,1}
(5 rows)

Let us take a look and see, which values last_value will actually see: array_agg will simply put them all into an array so that we can expect things in detail. As you can see the last value in the array is identical in both cases, which means that both columns will produce exactly the identical output.

As a professional PostgreSQL support company we see a lot of SQL performance stuff, which is often worth sharing. One of those noteworthy things, which might be interesting to you, the reader, happened this week when I was out on a business trip to Berlin, Germany. This (excellent) customer was making extensive use of windowing functions and analytics. However, there is always room to speed things up.

“Improving SQL performance” or “how to trick the optimizer”

Most people simply write their SQL code and execute it assuming that the optimizer will take care of things on its own. While this is usually the case there are still corner cases, where some clever optimization – done by a professional – can give you an edge and better database performance.

One way to improve speed is by rearranging windowing functions in a clever way. Consider the following example:

test=# CREATE TABLE data (id int);
CREATE TABLE
test=# INSERT INTO data SELECT * FROM generate_series(1, 5);
INSERT 0 5

Our example is pretty simple: All we need is a table containing 5 rows:

test=# SELECT * FROM data;
id
----
 1
 2
 3
 4
 5
(5 rows)

Let us take a look at a simple example now:

test=# SELECT *, array_agg(id) OVER (ORDER BY id) FROM data;
 id | array_agg
----+-------------
 1  | {1}
 2  | {1,2}
 3  | {1,2,3}
 4  | {1,2,3,4}
 5  | {1,2,3,4,5}
(5 rows)

What we have here is a simple aggregation. For the sake of simplicity I have used array_agg, which simply shows, how our data is aggregated. Of course we could also use min, max, sum, count or any other window function.
Let me add a second column to the example:

test=# SELECT *,
        array_agg(id) OVER (ORDER BY id),
        array_agg(id) OVER (ORDER BY id DESC)
FROM    data;
 id | array_agg   | array_agg
----+-------------+-------------
 5  | {1,2,3,4,5} | {5}
 4  | {1,2,3,4}   | {5,4}
 3  | {1,2,3}     | {5,4,3}
 2  | {1,2}       | {5,4,3,2}
 1  | {1}         | {5,4,3,2,1}
(5 rows)

In this case there are two columns with two different OVER-clauses. Note that those two aggregations are using different sort orders. One column needs ascending data and one needs descending data.

To understand what is really going on here, we can take a look at the execution plan provided by PostgreSQL:

test=# explain 
SELECT *,
       array_agg(id) OVER (ORDER BY id),
       array_agg(id) OVER (ORDER BY id DESC)
FROM  data
ORDER BY id;
                               QUERY PLAN
-----------------------------------------------------------------
Sort (cost=557.60..563.97 rows=2550 width=68)
 Sort Key: id
 <- WindowAgg (cost=368.69..413.32 rows=2550 width=68)
    <- Sort (cost=368.69..375.07 rows=2550 width=36)
       Sort Key: id DESC
       <- WindowAgg (cost=179.78..224.41 rows=2550 width=36)
          <- Sort (cost=179.78..186.16 rows=2550 width=4)
             Sort Key: id
             <- Seq Scan on data (cost=0.00..35.50 rows=2550 width=4)
(9 rows)

First of all PostgreSQL has to read the data and sort by “id”. This sorted data is fed to the first aggregation before it is passed on to the next sort step (to sort descendingly). The second sort passes its output to its window aggregation. Finally the data is again sorted by id because we want the final output to be ordered by the first column. Overall our data has to be sorted three times, which is not a good thing to do.

Optimizing windowing and avoiding excessive sorting

Sorting data three times is clearly not a good idea. Maybe we can do better. Let us simply swap two columns in the SELECT clause:

test=# explain 
SELECT *,
       array_agg(id) OVER (ORDER BY id DESC),
       array_agg(id) OVER (ORDER BY id)
FROM   data
ORDER BY id;
                               QUERY PLAN
-------------------------------------------------------------------
 WindowAgg (cost=368.69..413.32 rows=2550 width=68)
  <- Sort (cost=368.69..375.07 rows=2550 width=36)
    Sort Key: id
    <- WindowAgg (cost=179.78..224.41 rows=2550 width=36)
       <- Sort (cost=179.78..186.16 rows=2550 width=4)
          Sort Key: id DESC
          <- Seq Scan on data (cost=0.00..35.50 rows=2550 width=4)
(7 rows)

Wow, by making this little change we have actually managed to skip one sort step. First of all the data is sorted descendingly because we need it for the first windowing function. However, the next column will need data in exactly the same order as the final ORDER BY at the end of the query. PostgreSQL knows that and can already use the sorted input. If you are processing a big data set, this kind of optimization can make a huge difference and speed up your queries tremendously.

At this point PostgreSQL is not able (yet?) to make those adjustments for you so some manual improvements will definitely help. Try to adjust your windowing functions in a way that columns needing identical sorting are actually next to each other.

What does PostgreSQL Full-Text-Search have to do with VACUUM? Many readers might actually be surprised that there might be a relevant connection worth talking about at all. However, those two topics are more closely related than people might actually think. The reason is buried deep inside the code and many people might not be aware of those issues. Therefore I decided to shade some light on the topic and explain, what is really going on here. The goal is to help end users to speed up their Full-Text-Indexing (FTI) and offer better performance to everybody making use of PostgreSQL.

Controlling VACUUM and autovacuum

Before digging into the real stuff it is necessary to create some test data. For that purpose I created a table. Note that I turned autovacuum off so that all operations are fully under my control. This makes it easier to demonstrate, what is going on in PostgreSQL.

test=# CREATE TABLE t_fti (payload tsvector) 
   WITH (autovacuum_enabled = off);
CREATE TABLE

In the next step we can create 2 million random texts. For the sake of simplicity I did not import a real data set containing real texts but simply created a set of md5 hashes, which are absolutely good enough for the job:

test=# INSERT INTO t_fti 
    SELECT to_tsvector('english', md5('dummy' || id)) 
    FROM generate_series(1, 2000000) AS id;
INSERT 0 2000000

Here is what our data looks like:

test=# SELECT to_tsvector('english', md5('dummy' || id)) 
   FROM generate_series(1, 5) AS id;
        to_tsvector 
--------------------------------------
'8c2753548775b4161e531c323ea24c08':1
'c0c40e7a94eea7e2c238b75273087710':1
'ffdc12d8d601ae40f258acf3d6e7e1fb':1
'abc5fc01b06bef661bbd671bde23aa39':1
'20b70cebcb94b1c9ba30d17ab542a6dc':1
(5 rows)

To make things more efficient, I decided to use the tsvector data type in the table directly. The advantage is that we can directly create a full text index (FTI) on the column:

test=# CREATE INDEX idx_fti 
         ON t_fti USING gin(payload);
CREATE INDEX

In PostgreSQL a GIN index is usually used to take care of “full text search” (FTS).

Finally we run VACUUM to create all those hint bits and make PostgreSQL calculate optimizer statistics.

test=# VACUUM ANALYZE ;
VACUUM

How GIN indexes work in PostgreSQL

To understand what VACUUM and Full Text Search (FTS) have to do with each other, we got to first see, how GIN indexes actually work: A GIN index is basically a “normal tree” down to the word level. So you can just binary search to find a word easily. However: In contrast to a btree, GIN has a “posting tree” below the word level. So each word only shows up once in the index but points to a potentially large list of entries. For full text search this makes sense because the number of distinct words is limited in real life while a single word might actually show up thousands of times.

The following image shows, what a GIN index looks like:

Let us take a closer look at the posting tree itself: It has one entry for pointer to the underlying table. To make it efficient the posting tree is sorted. The trouble now is: If you insert into the table, changing the GIN index for each row is pretty expensive. Modifying the posting tree does not come for free. Remember: You have to maintain the right order in your posting tree so changing things comes with some serious overhead.

Fortunately there is a solution to the problem: The “GIN pending list”. When a row is added, it does not go to the main index directly. But instead it is added to a “TODO” list, which is then processed by VACUUM. So after a row is inserted, the index is not really in its final state. What does that mean? It means that when you scan the index, you have to scan the tree AND sequentially read what is still in the pending list. In other words: If the pending list is long, this will have some impact on performance. In many cases it can therefore make sense to vacuum a table used to full text search more aggressively as usual. Remember: VACUUM will process all the entries in the pending list.

Measuring the performance impact of VACUUM

To see what is going on behind the screenes, install pgstattuple:

CREATE EXTENSION pgstattuple;

With pgstattuple you can take a look at the internals of the index:

test=# SELECT * FROM pgstatginindex('idx_fti');
version | pending_pages | pending_tuples
---------+---------------+----------------
2 | 0 | 0
(1 row)

In this case the pending list is empty. In addition to that the index is also pretty small:

test=# SELECT pg_relation_size('idx_gin');
pg_relation_size
------------------
188416
(1 row)

Keep in mind: We had 2 million entries and the index is still close to nothing compared to the size of the table:

test=# SELECT pg_relation_size('t_fti');
pg_relation_size
------------------
154329088
(1 row)

Let us run a simple query now. We are looking for a word, which does not exist. Note that the query needs ways less than 1 millisecond:

test=# explain (analyze, buffers) SELECT *
FROM t_fti
WHERE payload @@ to_tsquery('whatever');
QUERY PLAN
--------------------------------------------------------------------
Bitmap Heap Scan on t_fti (cost=20.77..294.37 rows=67 width=45)
(actual time=0.030..0.030 rows=0 loops=1)
Recheck Cond: (payload @@ to_tsquery('whatever'::text))
Buffers: shared hit=5
-&gt; Bitmap Index Scan on idx_fti (cost=0.00..20.75 rows=67 width=0)
(actual time=0.028..0.028 rows=0 loops=1)
Index Cond: (payload @@ to_tsquery('whatever'::text))
Buffers: shared hit=5
Planning time: 0.148 ms
Execution time: 0.066 ms

(8 rows)

I would also like to point you to something else: “shared hit = 5”. The query only needed 5 blocks of data to run. This is really really good because even if the query has to go to disk, it will still return within a reasonable amount of time.

Let us add more data. Note that autovacuum is off so there are no hidden operations going on:

test=# INSERT INTO t_fti
SELECT to_tsvector('english', md5('dummy' || id))
FROM generate_series(2000001, 3000000) AS id;
INSERT 0 1000000

The same query, which performanced so nicely before is now a lot slower:

test=# explain (analyze, buffers) SELECT *
FROM t_fti
WHERE payload @@ to_tsquery('whatever');
QUERY PLAN
-----------------------------------------------------------------
Bitmap Heap Scan on t_fti (cost=1329.02..1737.43 rows=100 width=45)
(actual time=9.377..9.377 rows=0 loops=1)
Recheck Cond: (payload @@ to_tsquery('whatever'::text))
Buffers: shared hit=331
-&gt; Bitmap Index Scan on idx_fti (cost=0.00..1329.00 rows=100 width=0)
(actual time=9.374..9.374 rows=0 loops=1)
Index Cond: (payload @@ to_tsquery('whatever'::text))
Buffers: shared hit=331
Planning time: 0.194 ms
Execution time: 9.420 ms
(8 rows)

PostgreSQL needs more than 9 milliseconds to run the query. The reason is that there are many pending tuples in the pending list. Also: The query had to access 331 pages in this case, which is A LOT more than before. The GIN pending list reveals the underlying problem:

test=# SELECT * FROM pgstatginindex('idx_fti');
version | pending_pages | pending_tuples
---------+---------------+----------------
2 | 326 | 50141
(1 row)

5 pages + 326 pages = 331 pages. The pending list explains all the additional use of data pages instantly.

Running VACUUM to speed up Full-Text-Search (FTS) in PostgreSQL

Moving those pending entries to the real index is simple. We simply run VACUUM ANALYZE again:

test=# VACUUM ANALYZE;
VACUUM

As you can see the pending list is now empty:

test=# SELECT * FROM pgstatginindex('idx_fti');
version | pending_pages | pending_tuples
---------+---------------+----------------
2 | 0 | 0
(1 row)

The important part is that the query is also a lot slower again because the number of blocks has decreased again.

test=# explain (analyze, buffers) SELECT *
FROM t_fti
WHERE payload @@ to_tsquery('whatever');
QUERY PLAN
-----------------------------------------------------------------
Bitmap Heap Scan on t_fti (cost=25.03..433.43 rows=100 width=45)
(actual time=0.033..0.033 rows=0 loops=1)
Recheck Cond: (payload @@ to_tsquery('whatever'::text))
Buffers: shared hit=5
-&gt; Bitmap Index Scan on idx_fti (cost=0.00..25.00 rows=100 width=0)
(actual time=0.030..0.030 rows=0 loops=1)
Index Cond: (payload @@ to_tsquery('whatever'::text))
Buffers: shared hit=5
Planning time: 0.240 ms
Execution time: 0.075 ms
(8 rows)

I think those examples show pretty conclusively that VACUUM does have a serious impact on the performance of your full text indexing. Of course this is only true if a significant part of your data is changed on a regular basis.

The COPY command in PostgreSQL is a simple means to copy data between a file and a table. COPY can either copy the content of a table to or from a table. Traditionally data was copied between PostgreSQL and a file. However, recently a pretty cool feature was added to PostgreSQL: It is now possible to send data directly to the UNIX pipe.

COPY … TO PROGRAM: Sending data to the pipe

The ability to send data directly to the UNIX pipe (or Linux command line) can be pretty useful. You might want to compress your data or change the format on the fly. The beauty of the UNIX shell is that it allows you all kinds of trickery.

If you want to send data to an external program – here is how it works:

test=# COPY (SELECT * FROM pg_available_extensions)
TO PROGRAM 'gzip -c > /tmp/file.txt.gz';
COPY 43

In this case the output of the query is sent to gzip, which compresses the data coming from PostgreSQL and stores the output in a file. As you can see this is pretty easy and really straight forward.

Copying data between PostgreSQL and other machines

However, in some cases users might desire to store data on some other machine. Note that the program is executed on the database server and not on the client. It is also important to note that only superusers can run COPY … TO PROGRAM. Otherwise people would face tremendous security problems, which is not desirable at all.

Once in a while people might not want to store the data exported from the database on the server but send the result to some other host. In this case SSH comes to the rescue. SSH offers an easy way to move data.

Here is an example:

echo "Lots of data" | ssh user@some.example.com 'cat > /directory/big.txt'

In this case “Lots of data” will be copied over SSH and stored in /directory/big.txt.

The beauty is that we can apply the same technique to PostgreSQL:

test=# COPY (SELECT * FROM pg_available_extensions)
TO PROGRAM 'ssh user@some.example.com ''cat > /tmp/result.txt'' ';
COPY 43

To make this work in real life you have to make sure that SSH keys are in place and ready to use. Otherwise the system will prompt for a password, which is of course not desirable at all. Also keep in mind that the SSH command is executed as “postgres” user (in case your OS user is called “postgres” too).

“How does the PostgreSQL optimizer handle views?” or “Are views good or bad?” I assume that every database consultant and every SQL performance expert has heard this kind of question already in the past. Most likely this does not only hold true for PostgreSQL experts but is also true for Oracle, DB2 and MS SQL consultants. Given the fact that views are a really essential feature of SQL it makes sense to take a closer look at the topic in general and hopefully help some people to write better and faster code.

PostgreSQL: What it does to views

Let us create a simple table containing just 10 rows, which can be used throughout the blog to show, how PostgreSQL works and how the optimizer treats things:

test=# CREATE TABLE data AS
	SELECT 	*
	FROM 	generate_series(1, 10) AS id;
SELECT 10

Then I have created a very simple view:

test=# CREATE VIEW v1 AS
	SELECT 	*
	FROM 	data
	WHERE 	id < 4;
CREATE VIEW

The idea here is simply to filter some data and return all the columns.

Inlining and flattening

The key thing is: The optimizer will process the view just like a “preprocessor” directive. It will try to inline the code and flatten it. Here is an example:

test=# explain SELECT * FROM v1;
                      QUERY PLAN                       
-------------------------------------------------------
 Seq Scan on data  (cost=0.00..41.88 rows=850 width=4)
   Filter: (id < 4)
(2 rows)

When we try to read from the view it is just like running the SQL statement directly. The optimizer will perform the following steps:

SELECT	*
FROM	(SELECT 	*
		FROM 	data
		WHERE 	id < 4
	) AS v1;

In the next step the subselect will be flattened out completely, which leaves us with:

SELECT * FROM data WHERE id < 4;

We could have done this transformation ourselves but it is easier to make the optimizer do it for us and enjoy the convenience offered by the view.

Joining views

Basically the same mechanism applies, when you are working with joins. PostgreSQL will again inline, flatten and then create a plan to join those tables together. The output is therefore not surprising:

test=# explain SELECT 	*
	FROM 	v1 AS a, v1 AS b
	WHERE 	a.id = 1
			AND a.id = b.id;
                               QUERY PLAN                               
------------------------------------------------------------------------
 Nested Loop  (cost=0.00..96.71 rows=16 width=8)
   ->  Seq Scan on data  (cost=0.00..48.25 rows=4 width=4)
         Filter: ((id < 4) AND (id = 1))
   ->  Materialize  (cost=0.00..48.27 rows=4 width=4)
         ->  Seq Scan on data data_1  (cost=0.00..48.25 rows=4 width=4)
               Filter: ((id < 4) AND (id = 1))
(6 rows)

Again PostgreSQL provides us with a simple plain.

Destructive behavior

However, views are not only good. Sometimes they can be a little destructive too. Consider the following example:

test=# CREATE VIEW v2 AS
	SELECT 	*
	FROM 	data
	WHERE 	id < 4
	ORDER BY 1;
CREATE VIEW

In this case the view provides us with some ordered data. Let us see what happens in the following query:

test=# explain SELECT * FROM v2 ORDER BY 1 DESC;
                            QUERY PLAN                             
-------------------------------------------------------------------
 Sort  (cost=135.22..137.34 rows=850 width=4)
   Sort Key: data.id DESC
   ->  Sort  (cost=83.23..85.36 rows=850 width=4)
         Sort Key: data.id
         ->  Seq Scan on data  (cost=0.00..41.88 rows=850 width=4)
               Filter: (id < 4)
(6 rows)

As you can see PostgreSQL will sort the data twice, which is definitely not good for performance. PostgreSQL won’t optimize this for you so you have to take yourself and be very aware of the issue.

In general it is a good idea to make a mental distinction between two types of views:

• Final results
• Intermediate results

If a view already has the “final” results you can do sorting and all kinds of stuff. However, if you are using a view to perform some large calculations on top of it, it might be a good idea to avoid formatting, sorting and alike.

However, there is more to views. Consider the following example:

test=# CREATE VIEW v3 AS
	SELECT 	*
	FROM 	data
	WHERE 	id < 4
	LIMIT 100000000000;
CREATE VIEW

In this case the view contains a LIMIT clause, which basically says that “all” the data will be returned. However, during those formal transformation the optimizer cannot “rely” on the fact that all data will be included in the final result. Therefore the optimizer cannot silently skip the LIMIT clause, which of course leads to the following plan:

test=# explain SELECT * FROM v3 ORDER BY 1 DESC;
                            QUERY PLAN                             
-------------------------------------------------------------------
 Sort  (cost=91.73..93.86 rows=850 width=4)
   Sort Key: data.id DESC
   ->  Limit  (cost=0.00..41.88 rows=850 width=4)
         ->  Seq Scan on data  (cost=0.00..41.88 rows=850 width=4)
               Filter: (id < 4)
(5 rows)

Conclusion

As always the world is neither only black nor white. Views are a valuable feature and when used nicely there is nothing wrong with them. However, keep in mind what is going on behind the scenes and avoid building views on top of view on top of views and so on.