Have you ever heard about cursors in PostgreSQL or in SQL in general? If not you should definitely read this article in depth and learn how to reduce memory consumption in PostgreSQL easily. Cursors have been around for many years and are in my judgement one of the most underappreciated feature of all times. Therefore it makes sense to take a closer look at cursors and see what they can be used for.

The purpose of a cursor in PostgreSQL

Consider the following example:

test=# CREATE TABLE t_large (id int);
CREATE TABLE
test=# INSERT INTO t_large 
	SELECT * FROM generate_series(1, 10000000);
INSERT 0 10000000

I have created a table containing 10 million rows so that we can play with the data. Let us run a simple query now:

test=# SELECT * FROM t_large;
    id    
----------
        1
        2
        3
…

The first thing you will notice is that the query does not return immediately. There is a reason for that: PostgreSQL will send the data to the client and the client will return as soon as ALL the data has been received. If you happen to select a couple thousand rows, life is good and everything will be just fine. However, what happens if you do a “SELECT * …” on a table containing 10 billion rows? Usually the client will die with an “out of memory” error and your applications will simply die. There is no way to keep such a large table in memory. Throwing ever more RAM at the problem is not feasible either (and pretty stupid too).

Using DECLARE CURSOR and FETCH

DECLARE CURSOR and FETCH can come to the rescue. What is the core idea? We can fetch data in small chunks and only prepare the data at the time it is fetched – not earlier. Here is how it works:

test=# BEGIN;
BEGIN
test=# DECLARE mycur CURSOR FOR 
	SELECT * FROM t_large WHERE id > 0;
DECLARE CURSOR
test=# FETCH NEXT FROM mycur;
 id 
----
  1
(1 row)

test=# FETCH 4 FROM mycur;
 id 
----
  2
  3
  4
  5
(4 rows)

test=# COMMIT;
COMMIT

The first important thing to notice is that a cursor can only be declared inside a transaction. However, there is more: The second important this is that DECLARE CURSOR itself is lightning fast. It does not calculate the data yet but only prepares the query so that your data can be created when you call FETCH. To gather all the data from the server you can simply run FETCH until the resultset is empty. At the you can simply commit the transaction.

Note that a cursor is closed on commit as you can see in the next listing:

test=# FETCH 4 FROM mycur;
ERROR:  cursor "mycur" does not exist
test

The FETCH command is ways more powerful than most people think. It allows you to navigate in your resultset and fetch rows as desired:

test=# \h FETCH
Command:     FETCH
Description: retrieve rows from a query using a cursor
Syntax:
FETCH [ direction [ FROM | IN ] ] cursor_name

where direction can be empty or one of:

    NEXT
    PRIOR
    FIRST
    LAST
    ABSOLUTE count
    RELATIVE count
    count
    ALL
    FORWARD
    FORWARD count
    FORWARD ALL
    BACKWARD
    BACKWARD count
    BACKWARD ALL

Cursors are an easy and efficient way to retrieve data from the server. However, you have to keep one thing in mind: Latency. Asking the network for one row at a time will add considerable network overhead (latency). It therefore makes sense to fetch data in reasonably large chunks. I found it useful to fetch 10.000 rows at a time. 10.000 can still reside in memory easily while still ensuring reasonably low networking overhead. Of course I highly encourage you to do your own experience to see, what is best in your specific cases.

Cursor PostgreSQL
Cursors in PostgreSQL and how to use them

 

Cursors and the PostgreSQL optimizer

Cursors are treated by the optimizer in a special way. If you are running a “normal” statement PostgreSQL will optimize for total runtime. It will assume that you really want all the data and optimize accordingly. However, in case of a cursor it assumes that only a fraction of the data will actually be consumed by the client. The following example shows, how this works:

test=# CREATE TABLE t_random AS 
	SELECT random() AS r 
	FROM generate_series(1, 1000000);
SELECT 1000000
test=# CREATE INDEX idx_random ON t_random (r);
CREATE INDEX
test=# ANALYZE ;
ANALYZE

I have created a table, which contains 1 million random rows. Finally I have created a simple index. To make sure that the example works I have told the optimizer that indexes are super expensive (random_page_cost):

test=# SET random_page_cost TO 100;
SET

Let us take a look at an example now: If the query is executed as cursor you will notice that PostgreSQL goes for an index scan to speed up the creation of the first 10% of the data. If the entire resultset is fetched, PostgreSQL will go for a sequential scan and sort the data because the index scan is considered to be too expensive:

test=# BEGIN;
BEGIN
test=# explain DECLARE cur CURSOR FOR SELECT * FROM t_random ORDER BY r;
                                        QUERY PLAN                                         
-------------------------------------------------------------------------------------------
 Index Only Scan using idx_random on t_random  (cost=0.42..732000.04 rows=1000000 width=8)
(1 row)

test=# explain SELECT * FROM t_random ORDER BY r;
                                     QUERY PLAN                                      
-------------------------------------------------------------------------------------
 Gather Merge  (cost=132326.50..229555.59 rows=833334 width=8)
   Workers Planned: 2
   ->  Sort  (cost=131326.48..132368.15 rows=416667 width=8)
         Sort Key: r
         ->  Parallel Seq Scan on t_random  (cost=0.00..8591.67 rows=416667 width=8)
(5 rows)
test=# COMMIT;
COMMIT

The main question arising now is: How does the optimizer know that the first 10% should be fast and that we are not looking for the entire resultset? A runtime setting is going to control this kind of behavior: cursor_tuple_fraction will configure this kind of behavior:

test=# SHOW cursor_tuple_fraction;
 cursor_tuple_fraction 
-----------------------
 0.1
(1 row)

The default value is 0.1, which means that PostgreSQL optimizes for the first 10%. The parameter can be changed easily in postgresql.conf just for your current session.

Using cursors across transactions

So far you have seen that a cursor can only be used inside a transaction. COMMIT or ROLLBACK will destroy the cursor. However, in some (usually rare) cases it can be necessary to have cursors, which actually are able to survive a transaction. Fortunately PostgreSQL has a solution to the problem: WITH HOLD cursors.

Here is how it works:

test=# BEGIN;
BEGIN
test=# DECLARE cur CURSOR WITH HOLD FOR 
	SELECT * FROM t_random ORDER BY r;
DECLARE CURSOR
test=# \timing
Timing is on.
test=# COMMIT;
COMMIT
Time: 651.211 ms

As you can see the WITH HOLD cursor has been declared just like a normal cursor. The interesting part is the COMMIT: To make sure that the data can survive the transaction PostgreSQL has to materialize the result. Therefore the COMMIT takes quite some time. However, the FETCH can now happen after the COMMIT:

test=# FETCH 4 FROM cur;
          r           
----------------------
 2.76602804660797e-07
 6.17466866970062e-07
 3.60095873475075e-06
 4.77954745292664e-06
(4 rows)

If you are making use of WITH HOLD cursors you have to keep in mind that the cursor has to be closed as well. Otherwise your connection will keep accumulating new cursors and store the result.

test=# \h CLOSE
Command:     CLOSE
Description: close a cursor
Syntax:
CLOSE { name | ALL }

Do you want to learn more about PostgreSQL and the optimizer in general consider? Check out one of our older posts right now.