“How does the PostgreSQL optimizer handle views?” or “Are views good or bad?” I assume that every database consultant and every SQL performance expert has heard this kind of question already in the past. Most likely this does not only hold true for PostgreSQL experts but is also true for Oracle, DB2 and MS SQL consultants. Given the fact that views are a really essential feature of SQL it makes sense to take a closer look at the topic in general and hopefully help some people to write better and faster code.

PostgreSQL: What it does to views

Let us create a simple table containing just 10 rows, which can be used throughout the blog to show, how PostgreSQL works and how the optimizer treats things:

test=# CREATE TABLE data AS
	FROM 	generate_series(1, 10) AS id;

Then I have created a very simple view:

test=# CREATE VIEW v1 AS
	FROM 	data
	WHERE 	id < 4;

The idea here is simply to filter some data and return all the columns.

Inlining and flattening

The key thing is: The optimizer will process the view just like a “preprocessor” directive. It will try to inline the code and flatten it. Here is an example:

test=# explain SELECT * FROM v1;
                      QUERY PLAN                       
 Seq Scan on data  (cost=0.00..41.88 rows=850 width=4)
   Filter: (id < 4)
(2 rows)

When we try to read from the view it is just like running the SQL statement directly. The optimizer will perform the following steps:

		FROM 	data
		WHERE 	id < 4
	) AS v1;

In the next step the subselect will be flattened out completely, which leaves us with:

SELECT * FROM data WHERE id < 4;

We could have done this transformation ourselves but it is easier to make the optimizer do it for us and enjoy the convenience offered by the view.

Joining views

Basically the same mechanism applies, when you are working with joins. PostgreSQL will again inline, flatten and then create a plan to join those tables together. The output is therefore not surprising:

test=# explain SELECT 	*
	FROM 	v1 AS a, v1 AS b
	WHERE 	a.id = 1
			AND a.id = b.id;
                               QUERY PLAN                               
 Nested Loop  (cost=0.00..96.71 rows=16 width=8)
   ->  Seq Scan on data  (cost=0.00..48.25 rows=4 width=4)
         Filter: ((id < 4) AND (id = 1))
   ->  Materialize  (cost=0.00..48.27 rows=4 width=4)
         ->  Seq Scan on data data_1  (cost=0.00..48.25 rows=4 width=4)
               Filter: ((id < 4) AND (id = 1))
(6 rows)

Again PostgreSQL provides us with a simple plain.

Destructive behavior

However, views are not only good. Sometimes they can be a little destructive too. Consider the following example:

test=# CREATE VIEW v2 AS
	FROM 	data
	WHERE 	id < 4

In this case the view provides us with some ordered data. Let us see what happens in the following query:

test=# explain SELECT * FROM v2 ORDER BY 1 DESC;
                            QUERY PLAN                             
 Sort  (cost=135.22..137.34 rows=850 width=4)
   Sort Key: data.id DESC
   ->  Sort  (cost=83.23..85.36 rows=850 width=4)
         Sort Key: data.id
         ->  Seq Scan on data  (cost=0.00..41.88 rows=850 width=4)
               Filter: (id < 4)
(6 rows)

As you can see PostgreSQL will sort the data twice, which is definitely not good for performance. PostgreSQL won’t optimize this for you so you have to take yourself and be very aware of the issue.

In general it is a good idea to make a mental distinction between two types of views:

• Final results
• Intermediate results

If a view already has the “final” results you can do sorting and all kinds of stuff. However, if you are using a view to perform some large calculations on top of it, it might be a good idea to avoid formatting, sorting and alike.

However, there is more to views. Consider the following example:

test=# CREATE VIEW v3 AS
	FROM 	data
	WHERE 	id < 4
	LIMIT 100000000000;

In this case the view contains a LIMIT clause, which basically says that “all” the data will be returned. However, during those formal transformation the optimizer cannot “rely” on the fact that all data will be included in the final result. Therefore the optimizer cannot silently skip the LIMIT clause, which of course leads to the following plan:

test=# explain SELECT * FROM v3 ORDER BY 1 DESC;
                            QUERY PLAN                             
 Sort  (cost=91.73..93.86 rows=850 width=4)
   Sort Key: data.id DESC
   ->  Limit  (cost=0.00..41.88 rows=850 width=4)
         ->  Seq Scan on data  (cost=0.00..41.88 rows=850 width=4)
               Filter: (id < 4)
(5 rows)


As always the world is neither only black nor white. Views are a valuable feature and when used nicely there is nothing wrong with them. However, keep in mind what is going on behind the scenes and avoid building views on top of view on top of views and so on.

Is there a single significantly large and important database in the world, which does not suffer from performance problems once in a while? I bet that there are not too many. Therefore every DBA (database administrator) in charge of PostgreSQL should know, how to track down potential performance problems to figure out, what is really going on.

Improving PostgreSQL performance beyond parameter tuning

Many people think that changing parameters in postgresql.conf are the real way to success. However, this is not always the case. Sure, more often than not good database config parameters are highly beneficial. Still, in many cases the real problems will come from some strange query hidden deep in some application logic. It is even quite likely that those queries causing real issues are not the ones you happen to focus on. The natural question now arising is: How can we track down those queries and figure out, what is really going on? My favorite tool to do that is pg_stat_statements, which should always be enabled in my judgement if you are using PostgreSQL 9.2 or higher (please do not use it in older versions).

Enabling pg_stat_statements

To enable pg_stat_statements on your server change the following line in postgresql.conf and restart PostgreSQL:

shared_preload_libraries = ‘pg_stat_statements’

Once this module has been loaded into the server, PostgreSQL will automatically start to collect information. The good thing is that the overhead of the module is really really low (the overhead is basically jsut “noise”).

Then run the following command to create the necessary view to access the data:

CREATE EXTENSION pg_stat_statements;

The extension will deploy a view called pg_stat_statements and make the data easily accessible.

Detecting slow queries in PostgreSQL

The easiest way to find the most interesting queries is to sort the output of pg_stat_statements by total_time:

SELECT * FROM pg_stat_statements ORDER BY total_time DESC;

The beauty here is that the type of query, which is consuming most of time time, will naturally show up on top of the listing. The best way is to work your way down from the first to the, say, 10th query and see, what is going on there.
In my judgement there is no way to tune a system without looking at the most time consuming queries on the database server.

Taking a deep look at PostgreSQL performance

pg_stat_statements has a lot more to offer than just the query and the time it has eaten. Here is the structure of the view:

test=# \d pg_stat_statements
View "public.pg_stat_statements"
Column | Type | Collation | Nullable | Default
userid | oid | | |
dbid | oid | | |
queryid | bigint | | |
query | text | | |
calls | bigint | | |
total_time | double precision | | |
min_time | double precision | | |
max_time | double precision | | |
mean_time | double precision | | |
stddev_time | double precision | | |
rows | bigint | | |
shared_blks_hit | bigint | | |
shared_blks_read | bigint | | |
shared_blks_dirtied | bigint | | |
shared_blks_written | bigint | | |
local_blks_hit | bigint | | |
local_blks_read | bigint | | |
local_blks_dirtied | bigint | | |
local_blks_written | bigint | | |
temp_blks_read | bigint | | |
temp_blks_written | bigint | | |
blk_read_time | double precision | | |
blk_write_time | double precision | | |

It can be quite useful to take a look at the stddev_time column as well. It will tell you if queries of a certain type tend to have similar runtimes or not. If the standard deviation is high you can expect some of those queries to be fast and some of them to be slow, which might lead to bad user experience.

The “rows” column can also be quite informative. Suppose 1000 calls have returned rows: It actually means that every call has returned 1 million rows in average. It is easy to see that returning so much data all the time is not a good thing to do.

If you want to check if a certain type of query shows bad caching performance, the shared_* will be of interest. In short: PostgreSQL is able to tell you the cache hit rate of every single type of query in case pg_stat_statements has been enabled.

It also makes sense to take a look at the temp_blks_* fields. Whenever PostgreSQL has to go to disk to sort or to materialize, temporary blocks will be needed.

Finally there are blk_read_time and blk_write_time. Usually those fields are empty unless track_io_timing is turned on. The idea here is to be able to measure the amount of time a certain type of query spends on I/O. It will allow you to answer the question whether your system is I/O bound or CPU bound. In most cases it is a good idea to turn on I/O timing because it will give you vital information.

Dealing with Java and Hibernate

pg_stat_statements delivers good information. However, in some cases it can cut off the query because of a config variable:

test=# SHOW track_activity_query_size;
(1 row)

For most applications 1024 bytes are absolutely enough. However, this is usually not the case if you are running Hibernate or Java. Hibernate tends to send insanely long queries to the database and thus the SQL code might be cut off long before the relevant parts start (e.g. the FROM-clause etc.). Therefore it makes a lot of sense to increase track_activity_query_size to a higher value (maybe 32.786).

Helpful queries to detect bottlenecks in PostgreSQL

There is one query I have found especially useful in this context: The following query shows 20 statements, which need a lot of time:

test=# SELECT substring(query, 1, 50) AS short_query,
round(total_time::numeric, 2) AS total_time,
round(mean_time::numeric, 2) AS mean,
round((100 * total_time / sum(total_time::numeric) OVER ())::numeric, 2) AS percentage_cpu
FROM pg_stat_statements
ORDER BY total_time DESC
short_query | total_time | calls | mean | percentage_cpu
SELECT name FROM (SELECT pg_catalog.lower(name) A | 11.85 | 7 | 1.69 | 38.63
DROP SCHEMA IF EXISTS performance_check CASCADE; | 4.49 | 4 | 1.12 | 14.64
CREATE OR REPLACE FUNCTION performance_check.pg_st | 2.23 | 4 | 0.56 | 7.27
SELECT pg_catalog.quote_ident(c.relname) FROM pg_c | 1.78 | 2 | 0.89 | 5.81
SELECT a.attname, +| 1.28 | 1 | 1.28 | 4.18
pg_catalog.format_type(a.attty | | | |
SELECT substring(query, ?, ?) AS short_query,roun | 1.18 | 3 | 0.39 | 3.86
CREATE OR REPLACE FUNCTION performance_check.pg_st | 1.17 | 4 | 0.29 | 3.81
SELECT query FROM pg_stat_activity LIMIT ?; | 1.17 | 2 | 0.59 | 3.82
CREATE SCHEMA performance_check; | 1.01 | 4 | 0.25 | 3.30
SELECT pg_catalog.quote_ident(c.relname) FROM pg_c | 0.92 | 2 | 0.46 | 3.00
SELECT query FROM performance_check.pg_stat_activi | 0.74 | 1 | 0.74 | 2.43
SELECT * FROM pg_stat_statements ORDER BY total_ti | 0.56 | 1 | 0.56 | 1.82
SELECT query FROM pg_stat_statements LIMIT ?; | 0.45 | 4 | 0.11 | 1.45
GRANT EXECUTE ON FUNCTION performance_check.pg_sta | 0.35 | 4 | 0.09 | 1.13
SELECT query FROM performance_check.pg_stat_statem | 0.30 | 1 | 0.30 | 0.96
SELECT query FROM performance_check.pg_stat_activi | 0.22 | 1 | 0.22 | 0.72
GRANT ALL ON SCHEMA performance_check TO schoenig_ | 0.20 | 3 | 0.07 | 0.66
SELECT query FROM performance_check.pg_stat_statem | 0.20 | 1 | 0.20 | 0.67
GRANT EXECUTE ON FUNCTION performance_check.pg_sta | 0.19 | 4 | 0.05 | 0.62
SELECT query FROM performance_check.pg_stat_statem | 0.17 | 1 | 0.17 | 0.56
(20 rows)

The last column is especially noteworthy: It tells us the percentage of total time burned by a single query. It will help you to figure out whether a single statement is relevant to overall performance or not.

Many people keep asking about index scans in PostgreSQL. This blog is meant to be a basic introduction to the topic because many people do not seem to be aware of what the optimizer does, when a single query is processed. I decided to give a brief introduction showing, how a table can be accessed in some of the most basic cases. Let us get started with PostgreSQL indexing.

Indexes are somehow the backbone of good performance. Without proper indexing your PostgreSQL database might be in dire straits and end users might complain about slow queries and bad response times. It therefore makes sense to see, which choices PostgreSQL makes, when a single column is queried.

Preparing some demo data in PostgreSQL

To show you how things work we can use a simple table:

test=# CREATE TABLE sampletable (x numeric);

If your table is almost empty, you will never see an index scan because it might be too much overhead to consult an index – it is cheaper to just scan the table directly and throw away some rows, which don’t match your query.

So to demonstrate how an index actually works, we can add 10 million random rows to the table we just created before:

test=# INSERT INTO sampletable
         SELECT random() * 10000
         FROM generate_series(1, 10000000);
INSERT 0 10000000

Then an index is created:

test=# CREATE INDEX idx_x ON sampletable(x);

After loading so much data it can be a good idea to create optimizer statistics in case autovacuum has not caught up yet. The PostgreSQL optimizer needs this statistics to decide on whether to use an index or not:

test=# ANALYZE ; 

In PostgreSQL a btree uses Lehman-Yao High-Concurrency btrees (which will be covered in more detail in a later blog).


Selecting a small subset of data in PostgreSQL

When only a small set of rows is selected, PostgreSQL can directly ask the index. In this case it can even use an “Index Only Scan” because all columns needed are actually already in the index:

test=# explain SELECT * FROM sampletable WHERE x = 42353;
                               QUERY PLAN
Index Only Scan using idx_x on sampletable (cost=0.43..8.45 rows=1 width=11)
      Index Cond: (x = '42353'::numeric)
(2 rows)

Selecting only a handful of rows will be super efficient using the index. However, if more data is selected, scanning the index AND the table will be too expensive.

PostgreSQL indexing: Selecting a lot of data from a table in PostgreSQL

However, if you select a LOT of data from a table, PostgreSQL will fall back to a sequential scan. In this case reading the entire table and just filtering out a couple of rows is the most way to do things.

Here is how it works:

test=# explain SELECT * FROM sampletable WHERE x < 42353;
                            QUERY PLAN
Seq Scan on sampletable (cost=0.00..179054.03 rows=9999922 width=11)
    Filter: (x < '42353'::numeric)
(2 rows)

PostgreSQL will filter out those unnecesssary rows and just return the rest. This is really the best thing to do in this case. A sequential scan is therefore not aways bad – there are use cases, where a sequential scan is actually perfect.

Still: Keep in mind that scanning large tables sequentially too often will take its toll at some point.

PostgreSQL: Making use of bitmap scans

If you only select a handful of rows, PostgreSQL will decide on an index scan – if you select a majority of the rows, PostgreSQL will decide to read the table completely. But what if you read too much for an index scan to be efficient but too little for a sequential scan? The solution to the problem is to use a bitmap scan. The idea behind a bitmap scan is that a single block is only used once during a scan. It can also be very helpful if you want to use more than one index to scan a single table.

Here is what happens:

test=# explain SELECT * FROM sampletable WHERE x < 423;
                                    QUERY PLAN
Bitmap Heap Scan on sampletable (cost=9313.62..68396.35 rows=402218 width=11)
       Recheck Cond: (x < '423'::numeric)
       -> Bitmap Index Scan on idx_x (cost=0.00..9213.07 rows=402218 width=0)
          Index Cond: (x < '423'::numeric)
(4 rows)

PostgreSQL will first scan the index and compile those rows / blocks, which are needed at the end of the scan. Then PostgreSQL will take this list and go to the table to really fetch those rows. The beauty is that this mechanism even works if you are using more than just one index.

Bitmaps scans are therefore a wonderful contribution to performance.

In PostgreSQL it is possible to write stored procedures in almost any widespread language such as Perl, Python or C. In general this offers a great deal of flexibility and acceptable performance. However, in some cases customers might say: “We have the feeling that procedures are slow”. The thing is: PostgreSQL might not be to blame for this observation – in many cases it is simply “pilot error”. Let me try to elaborate on one of the key issues related to bad performance.

How PostgreSQL treats procedures

As mentioned already it is possible to write stored procedures in basically any language. PostgreSQL simply passes the code of a procedure to the external language and takes back the result. In a way a stored procedure is a kind of black box – PostgreSQL rarely knows what is going on inside a procedure.

Here is an example:

test=# CREATE OR REPLACE FUNCTION mymax(int, int)
       RETURN CASE WHEN $1 > $2 THEN $1 ELSE $2 END;
$$ LANGUAGE 'plpgsql';

The results are not very spectacular:

test=# SELECT mymax(20, 30);
(1 row)

The important thing here is: The PL/pgSQL function is a complete blackbox. The planner has no idea, what the “external” language does in this case. This has some important implication. Consider the following example:

test=# CREATE TABLE demo AS
	FROM generate_series(1, 1000000) AS id;
SELECT 1000000
test=# CREATE INDEX idx_id ON demo(id);

The table is already large enough to consider indexes:

test=# explain SELECT * FROM demo WHERE id = 20;
                      QUERY PLAN                               
 Index Only Scan using idx_id on demo  
	(cost=0.42..8.44 rows=1 width=4)
   Index Cond: (id = 20)
(2 rows)

Functions in PostgreSQL are usually blackboxes

The problem is: The situation changes completely if we start to use the procedure I have just shown:

test=# explain SELECT *
	FROM 	demo
	WHERE 	id = mymax(20, 20);
                       QUERY PLAN                        
 Seq Scan on demo  (cost=0.00..266925.00 rows=1 width=4)
   Filter: (id = mymax(20, 20))
(2 rows)

PostgreSQL has no idea that the function will return 20. It is clear to humans but nobody told the machine that this is guaranteed to happen. To PostgreSQL the result of a function is considered to be “volatile” – anything can happen. Therefore it cannot simply ask the index for the correct row. The first function call might not return the same as the second call – even if the parameters are identical. The optimizer has to play safe and will go for a sequential scan, which is definitely going to produce the correct result.

In PostgreSQL a function can be:


If a function is marked as VOLATILE it can return anything if you call it multiple times using the very same input parameters. In case of STABLE the function is going to return the same result given the same parameters within the same transaction.

The most prominent STABLE function is now(), which will always return the same result within the same transaction:

 test=# SELECT now();
 2018-01-09 11:48:46.385457+01
(1 row)

test=# BEGIN;
test=# SELECT now();
 2018-01-09 11:48:51.073123+01
(1 row)

test=# SELECT now();
 2018-01-09 11:48:51.073123+01
(1 row)

test=# COMMIT;
test=# SELECT now();
 2018-01-09 11:48:59.640697+01
(1 row)

Some functions are even IMMUTABLE: In this case the result given the same input parameters will be constant regardless of the transaction. Cosine would be an example in this case:

test=# SELECT cos(10), cos(20);
        cos         |        cos        
 -0.839071529076452 | 0.408082061813392
(1 row)

test=# SELECT cos(10), cos(20);
        cos         |        cos        
 -0.839071529076452 | 0.408082061813392
(1 row)

Even in the second transaction the cosine of a number will be the same.

Reducing the number of function calls

To fix our problem we have to drop the existing function …

test=# DROP FUNCTION mymax(int, int);

… and recreate it:

test=# CREATE OR REPLACE FUNCTION mymax(int, int)
       RETURN CASE WHEN $1 > $2 THEN $1 ELSE $2 END;

The planner with detect that the function is supposed to return a static value given the same input values and go for an index scan:

test=# explain SELECT *
	FROM 	demo
	WHERE 	id = mymax(20, 20);
                      QUERY PLAN                               
 Index Only Scan using idx_id on demo  
	(cost=0.42..8.44 rows=1 width=4)
   Index Cond: (id = 20)
(2 rows)

Of course the index scan orders magnitude faster and will return in basically no time.

Detecting bottlenecks

Fortunately PostgreSQL has a system view, which might shed some light on functions, which might be a problem. The first thing you have to do is to set “track_functions = ‘all’ “ – it will tell PostgreSQL to collection function statistics:

test=# SELECT * FROM demo WHERE id = mymax(20, 20);
(1 row)

Once this setting has been turned on, pg_stat_user_functions will contain valuable information:

test=# SELECT * FROM pg_stat_user_functions ;
 funcid | schemaname | funcname | calls | total_time | self_time
  16429 | public     | mymax    |     1 |      0.025 |     0.025
(1 row)

If you happen to see that a function is called insanely often, it can make sense to inspect it and check, if it happens to be VOLATILE for no good reason. Changing the function definition can significantly improve speed.

Creating indexes on functions in PostgreSQL

If you want to create indexes on a function, you have to make sure that the function itself is in fact IMMUTABLE. Otherwise PostgreSQL won’t create the index for you. The reason is simple: PostgreSQL has to make sure that the content of the index is stable and does not have to be changed over time if the underlying data is unchanged.

We are proud to announce our latest tool released to the PostgreSQL community. We know that many people want a visual tool to configure their PostgreSQL database servers. Therefore we proudly want to announce Cybertec pgconfigurator, which has been designed to give users a tool, to quickly determine the most important config parameters to be used in postgresql.conf.

The tool can be found and used freely at the Cybertec pgconfigurator product page on our website.

What pgconfigurator will do for you

The idea behind Cybertec pgconfigurator is to to give people a quick and easy way to generate or to cross check existing config files (postgresql.conf). We have used our experience to help users tune their database and achieve better performance.

pgconfigurator will determine all vital performance parameters and come up with clever suggestions. Keep in mind that an automatic tool cannot replace a human in all cases and it still makes sense to hire a professional in case of doubt. However, pgconfigurator can help you to get started and to cleverly configure your servers.

Our configuration tool will visually determine the following types of parameters for you:

  • PostgreSQL memory settings (shared_buffers, work_mem, etc.)
  • PostgreSQL replication settings (max_wal_senders, etc.)
  • PostgreSQL parallel query parameters
  • PostgreSQL I/O and checkpointing related settings
  • Preloaded libraries

Just download the ready made config and add your personal parameters (for logging, etc.)

Feedback is welcome

We hope that our first release of pgconfigurator is helpful to you. Of course feedback is always welcome and we are eager to hear from you. Tell us what you want us to improve and tell us about settings you don’t agree with. We will do our best to collect all those recommendations and make pgconfigurator even better.

We are planning to maintain pgconfigurator long term and add support for PostgreSQL 11 and higher in the future.

Also: Share the news! Tell your friends, workmates and everybody you know about Cybertec pgconfigurator.


You have probably noticed that everyone is talking about Artificial Intelligence and Machine Learning these days. Quite rightly because it is a very important topic, which is going to shape our future for sure. However, when looking at most of the code related to machine learning available on the net, it strikes me, how much “coding” people actually use to prepare the data. In many cases the input data is “just a matrix” and people spent a lot of time (and sometimes memory) to prepare it.

The question, which naturally arises in my case is: Why not prepare the data in SQL? It is easier, less work and a lot more flexible.

Creating some sample data

To demonstrate what we can do on the PostgreSQL side, I will create a simple table containing just 20 random values. Let us define a table first:

test=# CREATE TABLE inputdata
	id 		int,
	data 	numeric 	DEFAULT random() * 1000

The the table is populated:

test=# INSERT INTO inputdata
	SELECT * FROM generate_series(1, 20);

What we got now is a table containing 20 random values. In reality you will already have some existing data, which you will use to work with:

test=# SELECT * FROM inputdata;
 id |       data       
  1 |  542.76927607134
  2 | 813.954454381019
  3 |  215.18046176061
  4 | 989.989245776087
  5 | 142.890753224492
  6 | 326.086463406682
  7 | 24.8975520953536
  8 | 266.512574627995
  9 | 86.0621216706932
 10 | 801.756543107331
 11 | 790.149183012545
 12 | 317.997705657035
 13 | 975.230060052127
 14 | 385.490739252418
 15 | 746.592517476529
 16 | 621.084009762853
 17 | 208.689162041992
 18 | 529.119417071342
 19 | 260.399237740785
 20 | 563.285110052675
(20 rows)

Thoughts on sampling, training, and verification

If you are training an AI model (maybe a Support Vector Machine / SVM, a neural network, or whatever) you will always start by splitting the data into various parts:

• Training data for your AI model
• Test data for your AI model

The training data is used to teach your model. The test data is then used to check, if your model performs well. Splitting the data is important and somewhat key to success.

Table sampling in PostgreSQL

Since version 9.5 we already have sampling in PostgreSQL core. Here is an example:

test=# SELECT *
				REPEATABLE (87);
 id |       data       
  3 |  215.18046176061
  4 | 989.989245776087
  5 | 142.890753224492
 10 | 801.756543107331
 11 | 790.149183012545
 12 | 317.997705657035
 14 | 385.490739252418
 15 | 746.592517476529
 16 | 621.084009762853
 19 | 260.399237740785
 20 | 563.285110052675
(11 rows)

You have to add the TABLESAMPLE clause to the table and specify how it is supposed to work. In this example I decided to use the Bernoulli table sampling method (TSM) and told it to be repeatable. In case of machine learning it really makes sense to use the REPEATABLE clause as we want to make sure that we can train our model over and over again given the same input data. We can run the code as often as we want and PostgreSQL will always return the same sample (assuming of course that the underlying data does not change).

Machine learning: Lots and lots of data …

So far everything sounds nice. However, there is a bit of a downside to this approach. A state of the art model needs A LOT of input data to perform well. We are talking about many million rows. At some point we will be facing a query as shown below (written in pseudo code):

FROM 	inputdata
WHERE	id NOT IN (SELECT id FROM test_data_set);

There are two problems here:

• We have to store the test data set somewhere, which needs a lot of space
• The large NOT IN statement is fairly expensive

So maybe there is some other way to do that? What I came up with is a pretty simple approach, which does the job.

Preparing data for machine learning in PostgreSQL

As there is no “NOT IN TABLESAMPLE”-clause in PostgreSQL and as we do not want to duplicate our data, the idea is to use a view, which can be used to extract the sample:

test=# CREATE VIEW preparation_step1 AS
	SELECT 	*, abs(hashtext(id::text) % 100)
	FROM 	inputdata ;

The first idea here is to use the hashtext function to turn the input data into a hash. The nice thing here is that hashtext returns evenly distributed numbers, which is exactly what we want here. Adding “modulo 100” will create 100 slices of data (1% each). Be aware of the fact that hashtext function can return negative values. The “abs” function will turn the value into positive values.

test=# \x
Expanded display is on.
test=# \df *hashtext*
List of functions
-[ RECORD 1 ]-------+-----------
Schema              | pg_catalog
Name                | hashtext
Result data type    | integer
Argument data types | text
Type                | normal

When looking at the view, the result will be pretty useful already:

test=# SELECT * FROM preparation_step1 LIMIT 10;
 id |       data       | abs
  1 |  542.76927607134 |  47
  2 | 813.954454381019 |  26
  3 |  215.18046176061 |   4
  4 | 989.989245776087 |  92
  5 | 142.890753224492 |  58
  6 | 326.086463406682 |  12
  7 | 24.8975520953536 |  95
  8 | 266.512574627995 |  88
  9 | 86.0621216706932 |  36
 10 | 801.756543107331 |  81
(10 rows)

We can now simply filter out data. An example: “abs < 50” might be training data and all the rest might be used to validate and check our models.

The approach here is basically fine if your dataset is really large (xxx-million rows or so). It might not be ideal if your dataset is too small. In this case it might be better to use sampling mechanisms provided by your favorite library (TensorFlow, sklearn, etc.). Why is that the case? If you are using a small data set (e.g. only a couple hundred of rows) you are running the risk of creating a biased sample. What does that mean? Suppose you have a dataset containing information about men and women. You want the sample to show the same distribution as the original data set – so the same percentage of men and women. The solution to the problem is called “stratification” and is supported by libraries such as sklearn and others. In my simple SQL example I am working under the assumption that the amount of data fed to the model is really large and therefore stratification is not an issue.