July 2022

We all know and value SQL functions as a handy shortcut. PostgreSQL v14 has introduced a new, better way to write SQL functions. This article will show the advantages of the new syntax.

An example of an SQL function

Let's create a simple example of an SQL function with the “classical” syntax so that we have some material for demonstrations:

CREATE EXTENSION unaccent;

CREATE FUNCTION mangle(t text) RETURNS text
   LANGUAGE sql
   AS 'SELECT lower(unaccent(t))';

CREATE EXTENSION unaccent;

CREATE FUNCTION mangle(t text) RETURNS text

LANGUAGE sql

AS 'SELECT lower(unaccent(t))';

You can use the new function like other database functions:

SELECT mangle('Schön dumm');

   mangle   
════════════
 schon dumm
(1 row)

SELECT mangle('Schön dumm');

mangle

════════════

schon dumm

(1 row)

Why SQL functions?

You may ask what good an SQL function is. After all, the main purpose of a database function is to be able to run procedural code inside the database, something you cannot do with SQL. But SQL functions have their use:

code reuse for expressions frequently used in different SQL statements
to make SQL statements more readable by factoring out part of the code into a function with a meaningful name
whenever you need a function for syntactical reasons, like in CREATE AGGREGATE or CREATE OPERATOR

Moreover, simple SQL functions can be inlined, that is, the optimizer can replace the function call with the function definition at query planning time. This can make SQL functions singularly efficient:

it removes the overhead of an actual function call
since functions are (mostly) black boxes to the optimizer, replacing the function with its definition usually gives you better estimates

We can see function inlining if we use EXPLAIN (VERBOSE) on our example function:

EXPLAIN (VERBOSE, COSTS OFF) SELECT mangle('Schön dumm');

                  QUERY PLAN                   
═══════════════════════════════════════════════
 Result
   Output: lower(unaccent('Schön dumm'::text))
(2 rows)

EXPLAIN (VERBOSE, COSTS OFF) SELECT mangle('Schön dumm');

QUERY PLAN

═══════════════════════════════════════════════

Result

Output: lower(unaccent('Schön dumm'::text))

(2 rows)

Shortcomings of PostgreSQL functions

PostgreSQL functions are great. One of the nice aspects is that you are not restricted to a single programming language. Out of the box, PostgreSQL supports functions written in SQL, C, PL/pgSQL (a clone of Oracle's PL/SQL), Perl, Python and Tcl. But that is not all: in PostgreSQL, you can write a plugin that allows you to use any language of your choice inside the database. To allow that flexibility, the function body of a PostgreSQL function is simply a string constant that the call handler of the procedural language interprets when PostgreSQL executes the function. This has some undesirable side effects:

Lack of dependency tracking

Usually, PostgreSQL tracks dependencies between database objects in the pg_depend and pg_shdepend catalog tables. That way, the database knows the relationships between objects: it will either prevent you from dropping objects on which other objects depend (like a table with a foreign key reference) or drop dependent objects automatically (like dropping a table drops all indexes on the table).

Since the body of a function is just a string constant that PostgreSQL cannot interpret, it won't track dependencies between a function and objects used in the function. A procedural language can provide a validator that checks the function body for syntactic correctness (if check_function_bodies = on). The validator can also test if the objects referenced in the function exist, but it cannot keep you from later dropping an object used by the function.

Let's demonstrate that with our example:

DROP EXTENSION unaccent;

SELECT mangle('boom');
ERROR:  function unaccent(text) does not exist
LINE 1: SELECT lower(unaccent(t))
                     ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
QUERY:  SELECT lower(unaccent(t))
CONTEXT:  SQL function 'mangle' during inlining

DROP EXTENSION unaccent;

SELECT mangle('boom');

ERROR: function unaccent(text) does not exist

LINE 1: SELECT lower(unaccent(t))

HINT: No function matches the given name and argument types. You might need to add explicit type casts.

QUERY: SELECT lower(unaccent(t))

CONTEXT: SQL function 'mangle' during inlining

We will fix the problem by creating the extension again. However, it would be better to get an error message when we run DROP EXTENSION without using the CASCADE option.

`search_path` as a security problem

Since PostgreSQL parses the function body at query execution time, it uses the current setting of search_path to resolve all references to database objects that are not qualified with the schema name. That is not limited to tables and views, but also extends to functions and operators. We can use our example function to demonstrate the problem:

SET search_path = pg_catalog;

SELECT public.mangle('boom');
ERROR:  function unaccent(text) does not exist
LINE 1: SELECT lower(unaccent(t))
                     ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
QUERY:  SELECT lower(unaccent(t))
CONTEXT:  SQL function 'mangle' during inlining

SET search_path = pg_catalog;

SELECT public.mangle('boom');

ERROR: function unaccent(text) does not exist

LINE 1: SELECT lower(unaccent(t))

HINT: No function matches the given name and argument types. You might need to add explicit type casts.

QUERY: SELECT lower(unaccent(t))

CONTEXT: SQL function 'mangle' during inlining

In our example, it is a mere annoyance that we can avoid by using public.unaccent() in the function call. But it can be worse than that, particularly with SECURITY DEFINER functions. Since it is cumbersome to schema-qualify each function and operator, the recommended solution is to force a search_path on the function:

ALTER FUNCTION mangle(text) SET search_path = public;

1	ALTER FUNCTION mangle(text) SET search_path = public;

Note that the schemas on the search_path should allow CREATE only to privileged users, so the above is not a good idea on versions older than v15!

An unpleasant downside of setting a search_path is that it prevents the inlining of the SQL function.

The new SQL function syntax in PostgreSQL v14

From PostgreSQL v14 on, the body of SQL functions and procedures need no longer be a string constant. You can now use one of the following forms for the function body:

CREATE FUNCTION function_name(...) RETURNS ...
RETURN expression;

CREATE FUNCTION function_name(...) RETURNS ...
BEGIN ATOMIC
   statement;
   ...
END;

CREATE FUNCTION function_name(...) RETURNS ...

RETURN expression;

CREATE FUNCTION function_name(...) RETURNS ...

BEGIN ATOMIC

statement;

...

END;

The first form requires the function body to be an expression. So if you want to perform a query, you have to wrap it in parentheses (turning it into a subquery, which is a valid expression). For example:

CREATE FUNCTION get_data(v_id bigint) RETURNS text
RETURN (SELECT value FROM data WHERE is = v_id);

1 2	CREATE FUNCTION get_data(v_id bigint) RETURNS text RETURN (SELECT value FROM data WHERE is = v_id);

The second form allows you to write a function with more than one SQL statement. As it used to be with multi-statement SQL functions, the result of the function will be the result of the final SQL statement. You can also use the second form of the new syntax to create SQL procedures. The first form is obviously not suitable for a procedure, since procedures don't have a return value.

We can easily rewrite our example function to use the new syntax:

CREATE OR REPLACE FUNCTION mangle(t text) RETURNS text
RETURN lower(unaccent(t));

1 2	CREATE OR REPLACE FUNCTION mangle(t text) RETURNS text RETURN lower(unaccent(t));

Note that these new SQL functions can be inlined into SQL statements just like the old ones!

Advantages of the new SQL function syntax

The main difference is that the new-style SQL functions and procedures are parsed at function definition time and stored in parsed form in the prosqlbody column of the pg_proc system catalog. As a consequence, the two shortcomings noted above are gone:

Dependency tracking with new-style SQL functions

Because the function body is available in parsed form, PostgreSQL can track dependencies. Let's try that with our redefined example function:

DROP EXTENSION unaccent;
ERROR:  cannot drop extension unaccent because other objects depend on it
DETAIL:  function mangle(text) depends on function unaccent(text)
HINT:  Use DROP ... CASCADE to drop the dependent objects too.

DROP EXTENSION unaccent;

ERROR: cannot drop extension unaccent because other objects depend on it

DETAIL: function mangle(text) depends on function unaccent(text)

HINT: Use DROP ... CASCADE to drop the dependent objects too.

Fixed `search_path` with new-style SQL functions

search_path is only relevant when SQL is parsed. Since this now happens when CREATE FUNCTION runs, we don't have to worry about the current setting of that parameter at function execution time:

SET search_path = pg_catalog;

SELECT public.mangle('Schön besser');

    mangle    
══════════════
 schon besser
(1 row)

SET search_path = pg_catalog;

SELECT public.mangle('Schön besser');

mangle

══════════════

schon besser

(1 row)

Problems with interactive clients

You may notice that the multi-statement form for defining SQL functions contains semicolons to terminate the SQL statements. That will not only confuse the usual suspects like HeidiSQL (which never learned dollar quoting), but it will be a problem for any client that recognizes semicolons as separator between SQL statements. Even older versions of psql have a problem with that syntax:

psql (13.7, server 15beta2)
WARNING: psql major version 13, server major version 15.
         Some psql features might not work.
Type 'help' for help.

test=> CREATE FUNCTION tryme() RETURNS integer
BEGIN ATOMIC
   SELECT 42;
END;
ERROR:  syntax error at end of input
LINE 3:    SELECT 42;
                     ^
WARNING:  there is no transaction in progress
COMMIT

psql (13.7, server 15beta2)

WARNING: psql major version 13, server major version 15.

Some psql features might not work.

Type 'help' for help.

test=> CREATE FUNCTION tryme() RETURNS integer

BEGIN ATOMIC

SELECT 42;

END;

ERROR: syntax error at end of input

LINE 3: SELECT 42;

WARNING: there is no transaction in progress

COMMIT

psql thinks that the semicolon after “SELECT 42” terminates the CREATE FUNCTION statement. The truncated statement causes an error. The final END is treated as its own statement, which is a synonym for COMMIT and causes a warning.

In v14 and above, psql handles such statements correctly. pgAdmin 4 has learned the new syntax with version 6.3. But I am sure that there are many clients out there that have not got the message yet.

Conclusion

The new syntax for SQL function introduced by PostgreSQL v14 has great advantages for usability and security. Get a client that supports the new syntax and start using it for your SQL functions. You should consider rewriting your existing functions to make use of these benefits.

Read another great post to increase your PostgreSQL syntax savvy: my post on Cross Join in PostgreSQL.

Find out more about how to get the most performance out of your PostgreSQL database with Hans' post on how to find and fix a missing index.

How gexec can help

For PostgreSQL powerusers, automating repeated steps is becoming more and more necessary, and gexec can help. This blog will show you how to use the || operator and the gexec command to avoid unnecessary repetition in your workflow.

The CLI client that ships with PostgreSQL is called psql. Like many CLI clients, it is often overlooked and replaced with something with a GUI, or it is only used for the most basic tasks, while more complex operations are carried out elsewhere. However, psql is a very capable tool with lots of useful features.

One common pattern is the need to run the same command with different arguments. Often, users simply rewrite the command over and over, or sometimes they may opt to use a text editor to write the command once, then copy and paste and edit it to accommodate different arguments.

Sometimes it can be useful to automate such steps, not only in the interest of saving time, but also in the interest of avoiding errors due to typos or copy-pasting. PostgreSQL can take the results of queries and add text to create commands with those results as arguments.
For this purpose, we can prepend or append text to any query result using the || operator.

Exercise 1: using the `||` operator

Let's assume a new user needs access to some tables in a schema, e.g. all those tables that match a certain prefix.
Now, we could do this manually, or ask the database to automate the boring stuff.

1. Let's retrieve the relevant tables with names starting with pgbench

postgres=# SELECT tablename FROM pg_tables WHERE tablename~'^pgbench';
tablename
------------------
pgbench_accounts
pgbench_branches
pgbench_history
pgbench_tellers
(4 rows)

postgres=# SELECT tablename FROM pg_tables WHERE tablename~'^pgbench';

tablename

------------------

pgbench_accounts

pgbench_branches

pgbench_history

pgbench_tellers

(4 rows)

2. Let's use || to prepend and append command fragments to create a valid command with the tablename as a parameter.

postgres=# SELECT 'GRANT SELECT ON TABLE ' || tablename || ' TO someuser;' FROM pg_tables WHERE tablename~'^pgbench';
?column?
-----------------------------------------------------
GRANT SELECT ON TABLE pgbench_accounts TO someuser;
GRANT SELECT ON TABLE pgbench_branches TO someuser;
GRANT SELECT ON TABLE pgbench_history TO someuser;
GRANT SELECT ON TABLE pgbench_tellers TO someuser;
(4 rows)

postgres=# SELECT 'GRANT SELECT ON TABLE ' || tablename || ' TO someuser;' FROM pg_tables WHERE tablename~'^pgbench';

?column?

-----------------------------------------------------

GRANT SELECT ON TABLE pgbench_accounts TO someuser;

GRANT SELECT ON TABLE pgbench_branches TO someuser;

GRANT SELECT ON TABLE pgbench_history TO someuser;

GRANT SELECT ON TABLE pgbench_tellers TO someuser;

(4 rows)

Note that the strings end or begin with additional spaces, as the tablename itself does not contain the necessary spaces for argument separation. The semicolon ; was also added so these commands could be run straight away.

Please keep in mind that, while it is convenient to use || to concatenate things, it is not considered good practice, as it can be vulnerable to SQL injection attacks, as a helpful commenter detailed below:

Do NOT blindly concatenate table names with queries. Use quote_ident(), or format() with %I, instead. These apply correct escaping as necessary.

A safer approach to achieve the same results would be something like this:

postgres=# SELECT format('GRANT SELECT ON TABLE %I TO someuser;', tablename) FROM pg_tables WHERE tablename~'^pgbench';
format 
-----------------------------------------------------
GRANT SELECT ON TABLE pgbench_accounts TO someuser;
GRANT SELECT ON TABLE pgbench_branches TO someuser;
GRANT SELECT ON TABLE pgbench_history TO someuser;
GRANT SELECT ON TABLE pgbench_tellers TO someuser;
(4 rows)

postgres=# SELECT format('GRANT SELECT ON TABLE %I TO someuser;', tablename) FROM pg_tables WHERE tablename~'^pgbench';

format

-----------------------------------------------------

GRANT SELECT ON TABLE pgbench_accounts TO someuser;

GRANT SELECT ON TABLE pgbench_branches TO someuser;

GRANT SELECT ON TABLE pgbench_history TO someuser;

GRANT SELECT ON TABLE pgbench_tellers TO someuser;

(4 rows)

Now, these commands could be copied and then pasted straight into the prompt.
I've even seen people take such lines, store them into a file and then have psql execute all commands from the file.

But thankfully, a much easier way exists.

`gexec`

In psql, there are many shortcuts and helpers to quickly gather info about the database, schemas, tables, privileges and much more.
The psql shell allows for working on the input and output buffers, and this can be used together with gexec to have psql execute each command from the output buffer.

Exercise 2: calling `gexec`

Reusing the query to generate the necessary commands, we can call gexec to execute each line from the previous output.

postgres=# SELECT 'GRANT SELECT ON TABLE ' || tablename || ' TO someuser;' FROM pg_tables WHERE tablename~'^pgbench';
?column?
-----------------------------------------------------
GRANT SELECT ON TABLE pgbench_accounts TO someuser;
GRANT SELECT ON TABLE pgbench_branches TO someuser;
GRANT SELECT ON TABLE pgbench_history TO someuser;
GRANT SELECT ON TABLE pgbench_tellers TO someuser;
(4 rows)

postgres=# gexec
GRANT
GRANT
GRANT
GRANT

postgres=# SELECT 'GRANT SELECT ON TABLE ' || tablename || ' TO someuser;' FROM pg_tables WHERE tablename~'^pgbench';

?column?

-----------------------------------------------------

GRANT SELECT ON TABLE pgbench_accounts TO someuser;

GRANT SELECT ON TABLE pgbench_branches TO someuser;

GRANT SELECT ON TABLE pgbench_history TO someuser;

GRANT SELECT ON TABLE pgbench_tellers TO someuser;

(4 rows)

postgres=# gexec

GRANT

Exercise 3: a cross join with `gexec`

Assuming that you want to do something involving more arguments, you can always add more || to add more command fragments around the results from a query.
Suppose you need to grant privileges to insert, update, and delete from those tables as well.

A simple cross join gives us the desired action (constructed as a relation using the VALUES constructor) for each of the table names.

postgres=# SELECT action, tablename FROM pg_tables CROSS JOIN (VALUES ('INSERT'),('UPDATE'),('DELETE')) AS t(action) WHERE tablename~'^pgbench';
action | tablename
--------+------------------
INSERT | pgbench_accounts
UPDATE | pgbench_accounts
DELETE | pgbench_accounts
INSERT | pgbench_branches
UPDATE | pgbench_branches
DELETE | pgbench_branches
INSERT | pgbench_history
UPDATE | pgbench_history
DELETE | pgbench_history
INSERT | pgbench_tellers
UPDATE | pgbench_tellers
DELETE | pgbench_tellers
(12 rows)

postgres=# SELECT action, tablename FROM pg_tables CROSS JOIN (VALUES ('INSERT'),('UPDATE'),('DELETE')) AS t(action) WHERE tablename~'^pgbench';

action | tablename

--------+------------------

INSERT | pgbench_accounts

UPDATE | pgbench_accounts

DELETE | pgbench_accounts

INSERT | pgbench_branches

UPDATE | pgbench_branches

DELETE | pgbench_branches

INSERT | pgbench_history

UPDATE | pgbench_history

DELETE | pgbench_history

INSERT | pgbench_tellers

UPDATE | pgbench_tellers

DELETE | pgbench_tellers

(12 rows)

Note that we explicitly assign the action column name using AS t(action) to the table generated using VALUES.

postgres=# SELECT 'GRANT ' || action || ' ON TABLE ' || tablename || ' TO someuser;' FROM pg_tables CROSS JOIN (VALUES ('INSERT'),('UPDATE'),('DELETE')) AS t(action) WHERE tablename~'^pgbench';
?column?
-----------------------------------------------------
GRANT INSERT ON TABLE pgbench_accounts TO someuser;
GRANT UPDATE ON TABLE pgbench_accounts TO someuser;
GRANT DELETE ON TABLE pgbench_accounts TO someuser;
GRANT INSERT ON TABLE pgbench_branches TO someuser;
GRANT UPDATE ON TABLE pgbench_branches TO someuser;
GRANT DELETE ON TABLE pgbench_branches TO someuser;
GRANT INSERT ON TABLE pgbench_history TO someuser;
GRANT UPDATE ON TABLE pgbench_history TO someuser;
GRANT DELETE ON TABLE pgbench_history TO someuser;
GRANT INSERT ON TABLE pgbench_tellers TO someuser;
GRANT UPDATE ON TABLE pgbench_tellers TO someuser;
GRANT DELETE ON TABLE pgbench_tellers TO someuser;
(12 rows)

postgres=# SELECT 'GRANT ' || action || ' ON TABLE ' || tablename || ' TO someuser;' FROM pg_tables CROSS JOIN (VALUES ('INSERT'),('UPDATE'),('DELETE')) AS t(action) WHERE tablename~'^pgbench';

?column?

-----------------------------------------------------

GRANT INSERT ON TABLE pgbench_accounts TO someuser;

GRANT UPDATE ON TABLE pgbench_accounts TO someuser;

GRANT DELETE ON TABLE pgbench_accounts TO someuser;

GRANT INSERT ON TABLE pgbench_branches TO someuser;

GRANT UPDATE ON TABLE pgbench_branches TO someuser;

GRANT DELETE ON TABLE pgbench_branches TO someuser;

GRANT INSERT ON TABLE pgbench_history TO someuser;

GRANT UPDATE ON TABLE pgbench_history TO someuser;

GRANT DELETE ON TABLE pgbench_history TO someuser;

GRANT INSERT ON TABLE pgbench_tellers TO someuser;

GRANT UPDATE ON TABLE pgbench_tellers TO someuser;

GRANT DELETE ON TABLE pgbench_tellers TO someuser;

(12 rows)

This output can then again be executed using gexec.

postgres=# gexec
GRANT
GRANT
GRANT
GRANT
GRANT
GRANT
GRANT
GRANT
GRANT
GRANT
GRANT
GRANT

postgres=# gexec

GRANT

Exercise 4: adding quotes

Depending on the circumstances, it may be required to add additional quotes to the output, for example when table names contain capitalization or spaces. In such cases, matching double quotes " can be added to the strings prepended and appended to arguments.

postgres=# SELECT 'GRANT SELECT ON TABLE '' || tablename || '' TO someuser;' FROM pg_tables WHERE schemaname='public';
?column?
-----------------------------------------------------
GRANT SELECT ON TABLE 'with spaces' TO someuser;
GRANT SELECT ON TABLE 'Capitalization' TO someuser;
GRANT SELECT ON TABLE 'capitalization' TO someuser;
(3 rows)

postgres=# gexec
GRANT
GRANT
GRANT

postgres=# SELECT 'GRANT SELECT ON TABLE '' || tablename || '' TO someuser;' FROM pg_tables WHERE schemaname='public';

?column?

-----------------------------------------------------

GRANT SELECT ON TABLE 'with spaces' TO someuser;

GRANT SELECT ON TABLE 'Capitalization' TO someuser;

GRANT SELECT ON TABLE 'capitalization' TO someuser;

(3 rows)

postgres=# gexec

GRANT

Now that you know how to use gexec, why not take the next step? Take a look at our blog on column order in PostgreSQL to see it used in another practical example.

If you would like to learn more about security in PostgreSQL, see my blog about Transport Layer Security.

I've recently seen some really broad tables (hundreds of columns) in a somewhat inefficiently structured database. Our PostgreSQL support customer complained about strange runtime behavior which could not be easily explained. To help other PostgreSQL users in this same situation, I decided to reveal the secrets of a fairly common performance problem many people don’t understand: Column order and column access.

Creating a large table

The first question is: How can we create a table containing many columns? The easiest way is to simply generate the CREATE TABLE statement using generate_series:

test=# SELECT 'CREATE TABLE t_broad ('
        || string_agg('t_' || x
        || ' varchar(10) DEFAULT ''a'' ', ', ')
        || ' )'
FROM generate_series(1, 4) AS x;

                      ?column?       
----------------------------------------------------------
CREATE TABLE t_broad (
t_1 varchar(10) DEFAULT 'a' ,
t_2 varchar(10) DEFAULT 'a' ,
t_3 varchar(10) DEFAULT 'a' , 
t_4 varchar(10) DEFAULT 'a'  )
(1 row)

test=# gexec
CREATE TABLE

test=# SELECT 'CREATE TABLE t_broad ('

|| string_agg('t_' || x

|| ' varchar(10) DEFAULT ''a'' ', ', ')

|| ' )'

FROM generate_series(1, 4) AS x;

?column?

----------------------------------------------------------

CREATE TABLE t_broad (

t_1 varchar(10) DEFAULT 'a' ,

t_2 varchar(10) DEFAULT 'a' ,

t_3 varchar(10) DEFAULT 'a' ,

t_4 varchar(10) DEFAULT 'a' )

(1 row)

test=# gexec

CREATE TABLE

For the sake of simplicity I have only used 4 columns here. Once the command has been generated we can use gexec to execute the string we have just compiled. gexec is a really powerful thing: It treats the previous result as SQL input which is exactly what we want here. It leaves us with a table containing 4 columns.

However, let's drop the table and create a really large one.

test=# DROP TABLE t_broad ;
DROP TABLE

1 2	test=# DROP TABLE t_broad ; DROP TABLE

Create an extremely wide table

The following statement creates a table containing 1500 columns. Mind that the upper limit is 1600 columns:

test=# SELECT 'CREATE TABLE t_broad ('
         || string_agg('t_' || x
         || ' varchar(10) DEFAULT ''a'' ', ', ') || ' )'
       FROM generate_series(1, 1500) AS x;

test=# SELECT 'CREATE TABLE t_broad ('

|| string_agg('t_' || x

|| ' varchar(10) DEFAULT ''a'' ', ', ') || ' )'

FROM generate_series(1, 1500) AS x;

In real life such a table is far from efficient and should usually not be used to store data. It will simply create too much overhead and in most cases it is not good modelling in the first place.

Let's populate the table and add 1 million rows:

test=# timing
Timing is on.

test=# INSERT INTO t_broad
       SELECT 'a' FROM generate_series(1, 1000000);
INSERT 0 1000000
Time: 67457,107 ms (01:07,457)

test=# VACUUM ANALYZE ;
VACUUM
Time: 155935,761 ms (02:35,936)

test=# timing

Timing is on.

test=# INSERT INTO t_broad

SELECT 'a' FROM generate_series(1, 1000000);

INSERT 0 1000000

Time: 67457,107 ms (01:07,457)

test=# VACUUM ANALYZE ;

VACUUM

Time: 155935,761 ms (02:35,936)

Note that the table has default values so we can be sure that those columns actually contain something. Finally I have executed VACUUM to make sure that all hint bits and alike are set.

The table we have just created is roughly 4 GB in size which can easily be determined using the following line:

test=# SELECT pg_size_pretty(pg_total_relation_size('t_broad'));
 pg_size_pretty
----------------
 3907 MB
(1 row)

test=# SELECT pg_size_pretty(pg_total_relation_size('t_broad'));

pg_size_pretty

----------------

3907 MB

(1 row)

Accessing various columns

PostgreSQL stores data in rows. As you might know data can be stored column- or row-oriented. Depending on your use case one or the other option might be beneficial. In the case of OLTP a row-based approach is usually far more efficient.

Let's do a count(*) and see how long it takes:

test=# SELECT count(*) FROM t_broad;
  count
---------
 1000000
(1 row)
Time: 416,732 ms

test=# SELECT count(*) FROM t_broad;

count

---------

1000000

(1 row)

Time: 416,732 ms

We can run the query in around 400 ms which is quite ok. As expected, the optimizer will go for a parallel sequential scan:

test=# explain SELECT count(*) FROM t_broad;
                              QUERY PLAN
--------------------------------------------------------------------
 Finalize Aggregate (cost=506208.55..506208.56 rows=1 width=8)
 -> Gather (cost=506208.33..506208.54 rows=2 width=8)
    Workers Planned: 2
    -> Partial Aggregate
       (cost=505208.33..505208.34 rows=1 width=8)
       -> Parallel Seq Scan on t_broad
          (cost=0.00..504166.67 rows=416667 width=0)
 JIT:
     Functions: 4
     Options: Inlining true, Optimization true, Expressions true,
              Deforming true
(8 rows)

test=# explain SELECT count(*) FROM t_broad;

QUERY PLAN

--------------------------------------------------------------------

Finalize Aggregate (cost=506208.55..506208.56 rows=1 width=8)

-> Gather (cost=506208.33..506208.54 rows=2 width=8)

Workers Planned: 2

-> Partial Aggregate

(cost=505208.33..505208.34 rows=1 width=8)

-> Parallel Seq Scan on t_broad

(cost=0.00..504166.67 rows=416667 width=0)

JIT:

Functions: 4

Options: Inlining true, Optimization true, Expressions true,

Deforming true

(8 rows)

Let's compare this to a count on the first column. You'll see a small difference in performance. The reason is that count(*) has to check for the existence of the row while count(column) has to check if a NULL value is fed to the aggregate or not. In case of NULL the value has to be ignored:

test=# SELECT count(t_1) FROM t_broad;
  count
---------
 1000000
(1 row)
Time: 432,803 ms

test=# SELECT count(t_1) FROM t_broad;

count

---------

1000000

(1 row)

Time: 432,803 ms

But, let's see what happens if we access column number 100? The time to do that will differ significantly:

test=# SELECT count(t_100) FROM t_broad;
  count
---------
 1000000
(1 row)

Time: 857,897 ms

test=# SELECT count(t_100) FROM t_broad;

count

---------

1000000

(1 row)

Time: 857,897 ms

The execution time has basically doubled. The performance is even worse if we do a count on column number 1000:

test=# SELECT count(t_1000) FROM t_broad;
  count
---------
 1000000
(1 row)
Time: 8570,238 ms (00:08,570)

test=# SELECT count(t_1000) FROM t_broad;

count

---------

1000000

(1 row)

Time: 8570,238 ms (00:08,570)

Wow, we are already 20 times slower than before. This is not a small difference but a major problem which has to be understood.

Debunking PostgreSQL performance issues: column order

To understand why the problem happens in the first place we need to take a look at how PostgreSQL stores data: After the tuple header which is present in every row we got a couple of varchar columns. We just used varchar here to prove the point. The same issues will happen with other data types - the problem is simply more apparent with varchar as it is more complicated internally than, say, integer.

How does PostgreSQL access a column? It will fetch the row and then dissect this tuple to calculate the position of the desired column inside the row. So if we want to access column #1000 it means that we have to figure out how long those first 999 columns before our chosen one really are. This can be quite complex. For integer we simply have to add 4, but in case of varchar, the operation turns into something really expensive. Let's inspect how PostgreSQL stores varchar (just to see why it is so expensive):

1 bit indicating short (127 bytes) vs. long string (> 127 bit)
7 bit or 31 bit length (depending on first bit)
“data” + (to terminate the string )
alignment (to make sure the next column starts at a multiple of CPU-word length)

Now imagine what that means if we need to loop over 1000 columns? It does create some non-trivial overhead.

Finally …

The key insight here is that using extremely large tables is often not beneficial from a performance standpoint. It makes sense to use sensible table layouts to have a good compromise between performance and convenience.

If you are interested in other ways to improve performance, read my blog on CLUSTER.

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.

Bonus cards, “Miles & more”, bonus points - don’t we all love and hate them at the same time? Recently we had an interesting use case which made me think about sharing some of the techniques we used in this area to reduce client code by writing some clever SQL. This post will show you how to efficiently code bonus programs in SQL.

Getting started coding bonus programs in SQL

Suppose we want to run a bonus program. What we want is to know how many bonus points somebody had at any given point in time. This is how we might want to store the data:

CREATE TABLE t_bonus_card
(
  card_number     text NOT NULL,
  d               date,
  points          int
);

CREATE TABLE t_bonus_card

(

card_number text NOT NULL,

d date,

points int

);

For each bonus card, we want to store how many points were awarded when. So far this is relatively easy. Let's load some sample data:

COPY t_bonus_card 
     FROM stdin DELIMITER ';';
A4711;2022-01-01;8
A4711;2022-01-04;7
A4711;2022-02-12;3
A4711;2022-05-05;2
A4711;2022-06-07;9
A4711;2023-02-02;4
A4711;2023-03-03;7
A4711;2023-05-02;1
B9876;2022-01-07;8
B9876;2022-02-03;5
B9876;2022-02-09;4
B9876;2022-10-18;7
.

COPY t_bonus_card

FROM stdin DELIMITER ';';

A4711;2022-01-01;8

A4711;2022-01-04;7

A4711;2022-02-12;3

A4711;2022-05-05;2

A4711;2022-06-07;9

A4711;2023-02-02;4

A4711;2023-03-03;7

A4711;2023-05-02;1

B9876;2022-01-07;8

B9876;2022-02-03;5

B9876;2022-02-09;4

B9876;2022-10-18;7

In my example, we have data for two bonus cards which receive some rewards from time to time. To run our bonus program using PostgreSQL, we might want to answer some basic questions:

How many bonus points does a participant have at any point in time?
What is the number of points the participant has in case bonus points expire after a certain number of months?
How does the number of bonus points look if we start to count all over at the beginning of every year, and in case bonus points expire after some time?

Let's answer these questions using…

Windowing functions and advanced frame clauses for bonus programs in SQL

To answer all these questions we can use windowing functions along with some advanced, fancy frame clauses. Let's take a look at a basic query:

SELECT *,
       array_agg(points) OVER (ORDER BY d
          RANGE BETWEEN '6 months' PRECEDING AND CURRENT ROW)
FROM   t_bonus_card
WHERE  card_number = 'A4711' ;
 card_number |     d      | points |  array_agg  
-------------+------------+--------+-------------
 A4711       | 2022-01-01 |      8 | {8}
 A4711       | 2022-01-04 |      7 | {8,7}
 A4711       | 2022-02-12 |      3 | {8,7,3}
 A4711       | 2022-05-05 |      2 | {8,7,3,2}
 A4711       | 2022-06-07 |      9 | {8,7,3,2,9}
 A4711       | 2023-02-02 |      4 | {4}
 A4711       | 2023-03-03 |      7 | {4,7}
 A4711       | 2023-05-02 |      1 | {4,7,1}
(8 rows)

SELECT *,

array_agg(points) OVER (ORDER BY d

RANGE BETWEEN '6 months' PRECEDING AND CURRENT ROW)

FROM t_bonus_card

WHERE card_number = 'A4711' ;

card_number | d | points | array_agg

-------------+------------+--------+-------------

A4711 | 2022-01-01 | 8 | {8}

A4711 | 2022-01-04 | 7 | {8,7}

A4711 | 2022-02-12 | 3 | {8,7,3}

A4711 | 2022-05-05 | 2 | {8,7,3,2}

A4711 | 2022-06-07 | 9 | {8,7,3,2,9}

A4711 | 2023-02-02 | 4 | {4}

A4711 | 2023-03-03 | 7 | {4,7}

A4711 | 2023-05-02 | 1 | {4,7,1}

(8 rows)

What this does is simple: It goes through our data set line by line (sorted by date). Then it checks if there are rows between our current row and a value 6 months earlier. For debugging purposes, we aggregate those values into an array. What we see is that on June 7th we have 5 entries. But keep in mind: The rules of our bonus program say that points awarded are taken away after 6 months. By using a sliding window, we can easily achieve this goal.

Using RANGE

Note that in SQL we have “ROWS”, “RANGE” and “GROUP” as possible keywords in our frame clause. ROWS means that we want to see a specific number of older rows in our frame. However, this makes no sense here - what we need is an interval and this is exactly what RANGE can do for us. Rewards might be granted at random points in time so we certainly need to operate with intervals here.

The array_agg function is really useful to debug things. However, in a real world scenario, we need to add up those numbers using sum:

SELECT  *,
        sum(points) OVER (ORDER BY d 
           RANGE BETWEEN '6 months' PRECEDING AND CURRENT ROW)
FROM    t_bonus_card
WHERE  card_number = 'A4711' ;
 card_number |     d      | points | sum 
-------------+------------+--------+-----
 A4711       | 2022-01-01 |      8 |   8
 A4711       | 2022-01-04 |      7 |  15
 A4711       | 2022-02-12 |      3 |  18
 A4711       | 2022-05-05 |      2 |  20
 A4711       | 2022-06-07 |      9 |  29
 A4711       | 2023-02-02 |      4 |   4
 A4711       | 2023-03-03 |      7 |  11
 A4711       | 2023-05-02 |      1 |  12
(8 rows)

SELECT *,

sum(points) OVER (ORDER BY d

RANGE BETWEEN '6 months' PRECEDING AND CURRENT ROW)

FROM t_bonus_card

WHERE card_number = 'A4711' ;

card_number | d | points | sum

-------------+------------+--------+-----

A4711 | 2022-01-01 | 8 | 8

A4711 | 2022-01-04 | 7 | 15

A4711 | 2022-02-12 | 3 | 18

A4711 | 2022-05-05 | 2 | 20

A4711 | 2022-06-07 | 9 | 29

A4711 | 2023-02-02 | 4 | 4

A4711 | 2023-03-03 | 7 | 11

A4711 | 2023-05-02 | 1 | 12

(8 rows)

We have seen that points drop in 2023 again. That's exactly what we wanted.

Windowing for bonus programs in SQL: `PARTITION BY`

Maybe you have noticed that we did the entire calculation for just one card number. However, what has to be done to make this work for any number of cards? The answer is PARTITION BY:

SELECT  *,
        sum(points) 
           OVER (PARTITION BY card_number, date_trunc('year', d) 
           ORDER BY d 
           RANGE BETWEEN '6 months' PRECEDING AND CURRENT ROW)
FROM    t_bonus_card ;
 card_number |     d      | points | sum 
-------------+------------+--------+-----
 A4711       | 2022-01-01 |      8 |   8
 A4711       | 2022-01-04 |      7 |  15
 A4711       | 2022-02-12 |      3 |  18
 A4711       | 2022-05-05 |      2 |  20
 A4711       | 2022-06-07 |      9 |  29
 A4711       | 2023-02-02 |      4 |   4
 A4711       | 2023-03-03 |      7 |  11
 A4711       | 2023-05-02 |      1 |  12
 B9876       | 2022-01-07 |      8 |   8
 B9876       | 2022-02-03 |      5 |  13
 B9876       | 2022-02-09 |      4 |  17
 B9876       | 2022-10-18 |      7 |   7
(12 rows)

SELECT *,

sum(points)

OVER (PARTITION BY card_number, date_trunc('year', d)

ORDER BY d

RANGE BETWEEN '6 months' PRECEDING AND CURRENT ROW)

FROM t_bonus_card ;

card_number | d | points | sum

-------------+------------+--------+-----

A4711 | 2022-01-01 | 8 | 8

A4711 | 2022-01-04 | 7 | 15

A4711 | 2022-02-12 | 3 | 18

A4711 | 2022-05-05 | 2 | 20

A4711 | 2022-06-07 | 9 | 29

A4711 | 2023-02-02 | 4 | 4

A4711 | 2023-03-03 | 7 | 11

A4711 | 2023-05-02 | 1 | 12

B9876 | 2022-01-07 | 8 | 8

B9876 | 2022-02-03 | 5 | 13

B9876 | 2022-02-09 | 4 | 17

B9876 | 2022-10-18 | 7 | 7

(12 rows)

PARTITION BY card_number ensures that our calculations are done for each incarnation of card_number separately. In other words: User A’s points cannot be mixed with user B’s points anymore. But there is more to this query: We want that at the beginning of every year those points should be set to zero and counting should resume. We can achieve this by using PARTITION BY as well. By rounding out dates to full years we can use the year as partition criteria.

As you can see, SQL is really powerful. A lot can be done without having to write a single line of client code. A handful of SQL statements can produce terrific results and it makes sense to leverage your application.

Finally …

If you want to know more about PostgreSQL 15 and if you are interested in merging data, check out my post about MERGE - which can be found here.

An example of an SQL function

Why SQL functions?

Shortcomings of PostgreSQL functions

Lack of dependency tracking

search_path as a security problem

The new SQL function syntax in PostgreSQL v14

Advantages of the new SQL function syntax

Dependency tracking with new-style SQL functions

Fixed search_path with new-style SQL functions

Problems with interactive clients

Conclusion

How gexec can help

Exercise 1: using the || operator

gexec

Exercise 2: calling gexec

Exercise 3: a cross join with gexec

Exercise 4: adding quotes

Creating a large table

Create an extremely wide table

Accessing various columns

Debunking PostgreSQL performance issues: column order

Finally …

Getting started coding bonus programs in SQL

Windowing functions and advanced frame clauses for bonus programs in SQL

Using RANGE

Windowing for bonus programs in SQL: PARTITION BY

Finally …

`search_path` as a security problem

Fixed `search_path` with new-style SQL functions

Exercise 1: using the `||` operator

`gexec`

Exercise 2: calling `gexec`

Exercise 3: a cross join with `gexec`

Windowing for bonus programs in SQL: `PARTITION BY`