You have probably noticed that everyone is talking about Artificial Intelligence and Machine Learning these days. Quite rightly, because it is a very important topic, which is going to shape our future for sure. However, when looking at most of the code related to machine learning available on the net, it strikes me, how much “coding” people actually use to prepare the data. In many cases the input data is “just a matrix” and people spent a lot of time (and sometimes memory) to prepare it.

The question, which naturally arises in my case is: Why not prepare the data in SQL? It is easier, less work and a lot more flexible.

## Creating some sample data

To demonstrate what we can do on the PostgreSQL side, I will create a simple table containing just 20 random values. Let us define a table first:

test=# CREATE TABLE inputdata
(
id 		int,
data 	numeric 	DEFAULT random() * 1000
);
CREATE TABLE

The table is populated:

test=# INSERT INTO inputdata
SELECT * FROM generate_series(1, 20);
INSERT 0 20

What we got now is a table containing 20 random values. In reality you will already have some existing data, which you will use to work with:

test=# SELECT * FROM inputdata;
id |       data
----+------------------
1 |  542.76927607134
2 | 813.954454381019
3 |  215.18046176061
4 | 989.989245776087
5 | 142.890753224492
6 | 326.086463406682
7 | 24.8975520953536
8 | 266.512574627995
9 | 86.0621216706932
10 | 801.756543107331
11 | 790.149183012545
12 | 317.997705657035
13 | 975.230060052127
14 | 385.490739252418
15 | 746.592517476529
16 | 621.084009762853
17 | 208.689162041992
18 | 529.119417071342
19 | 260.399237740785
20 | 563.285110052675
(20 rows)

## Thoughts on sampling, training, and verification

If you are training an AI model (maybe a Support Vector Machine / SVM, a neural network, or whatever) you will always start by splitting the data into various parts:

• Training data for your AI model
• Test data for your AI model

The training data is used to teach your model. The test data is then used to check, if your model performs well. Splitting the data is important and somewhat the key to success.

## Table sampling in PostgreSQL

Since version 9.5 we already have sampling in PostgreSQL core. Here is an example:

test=# SELECT *
FROM inputdata TABLESAMPLE BERNOULLI (50)
REPEATABLE (87);
id |       data
----+------------------
3 |  215.18046176061
4 | 989.989245776087
5 | 142.890753224492
10 | 801.756543107331
11 | 790.149183012545
12 | 317.997705657035
14 | 385.490739252418
15 | 746.592517476529
16 | 621.084009762853
19 | 260.399237740785
20 | 563.285110052675
(11 rows)

You have to add the TABLESAMPLE clause to the table and specify how it is supposed to work. In this example I decided to use the Bernoulli table sampling method (TSM) and told it to be repeatable. In case of machine learning it really makes sense to use the REPEATABLE clause as we want to make sure that we can train our model over and over again given the same input data. We can run the code as often as we want and PostgreSQL will always return the same sample (assuming of course that the underlying data does not change).

## Machine learning: Lots and lots of data …

So far everything sounds nice. However, there is a bit of a downside to this approach. A state-of-the-art model needs A LOT of input data to perform well. We are talking about many million rows. At some point we will be facing a query as shown below (written in pseudo code):

SELECT	*
FROM 	inputdata
WHERE	id NOT IN (SELECT id FROM test_data_set);

There are two problems here:

• We have to store the test data set somewhere, which needs a lot of space
• The large NOT IN statement is fairly expensive

So maybe there is some other way to do that? What I came up with is a pretty simple approach, which does the job.

## Preparing data for machine learning in PostgreSQL

As there is no “NOT IN TABLESAMPLE”-clause in PostgreSQL and as we do not want to duplicate our data, the idea is to use a view, which can be used to extract the sample:

test=# CREATE VIEW preparation_step1 AS
SELECT 	*, abs(hashtext(id::text) % 100)
FROM 	inputdata ;
CREATE VIEW

The first idea is to use the hashtext function to turn the input data into a hash. The hashtext returns evenly distributed numbers, which is exactly what we want here. Adding “modulo 100” will create 100 slices of data (1% each). Be aware of the fact that hashtext function can return negative values. The “abs” function will turn the value into positive values.

test=# \x
Expanded display is on.
test=# \df *hashtext*
List of functions
-[ RECORD 1 ]-------+-----------
Schema              | pg_catalog
Name                | hashtext
Result data type    | integer
Argument data types | text
Type                | normal

When looking at the view, the result will be pretty useful already:

test=# SELECT * FROM preparation_step1 LIMIT 10;
id |       data       | abs
----+------------------+-----
1 |  542.76927607134 |  47
2 | 813.954454381019 |  26
3 |  215.18046176061 |   4
4 | 989.989245776087 |  92
5 | 142.890753224492 |  58
6 | 326.086463406682 |  12
7 | 24.8975520953536 |  95
8 | 266.512574627995 |  88
9 | 86.0621216706932 |  36
10 | 801.756543107331 |  81
(10 rows)

Now we can filter out data. An example: “abs < 50” might be training data and all the rest might be used to validate and check our models.

The approach here is basically fine if your dataset is really large (xxx-million rows or so). It might not be ideal if your dataset is too small. In this case, it would be better to use sampling mechanisms provided by your favorite library (TensorFlow, sklearn, etc.). Why is that the case? If you are using a small data set (e.g. only a couple hundred of rows) you are running the risk of creating a biased sample. What does that mean? Suppose you have a dataset containing information about men and women. You want the sample to show the same distribution as the original data set – so the same percentage of men and women. The solution to the problem is called “stratification” and is supported by libraries such as sklearn and others. In my simple SQL example I am working under the assumption that the amount of data fed to the model is really large and therefore stratification is not an issue.