Preparing data for machine learning in PostgreSQL

You have probably noticed that everyone is talking about Artificial Intelligence and Machine Learning these days. Quite rightly, because it is a very important topic, which is going to shape our future for sure. However, when looking at most of the code related to machine learning available on the net, it strikes me, how much “coding” people actually use to prepare the data. In many cases the input data is “just a matrix” and people spent a lot of time (and sometimes memory) to prepare it.

Table of Contents

The question, which naturally arises in my case is: Why not prepare the data in SQL? It is easier, less work and a lot more flexible.

Creating some sample data

To demonstrate what we can do on the PostgreSQL side, I will create a simple table containing just 20 random values. Let us define a table first:

test=# CREATE TABLE inputdata
(
	id          int,
	data        numeric 	DEFAULT random() * 1000
);
CREATE TABLE

The table is populated:

test=# INSERT INTO inputdata
	SELECT * FROM generate_series(1, 20);
INSERT 0 20

test=# CREATE TABLE inputdata

(

id int,

data numeric DEFAULT random() * 1000

);

CREATE TABLE

The table is populated:

test=# INSERT INTO inputdata

SELECT * FROM generate_series(1, 20);

INSERT 0 20

What we have now is a table containing 20 random values.

In reality you will already have some existing data, which you will use to work with:

test=# SELECT * FROM inputdata;
 id |       data       
----+------------------
  1 |  542.76927607134
  2 | 813.954454381019
  3 |  215.18046176061
  4 | 989.989245776087
  5 | 142.890753224492
  6 | 326.086463406682
  7 | 24.8975520953536
  8 | 266.512574627995
  9 | 86.0621216706932
 10 | 801.756543107331
 11 | 790.149183012545
 12 | 317.997705657035
 13 | 975.230060052127
 14 | 385.490739252418
 15 | 746.592517476529
 16 | 621.084009762853
 17 | 208.689162041992
 18 | 529.119417071342
 19 | 260.399237740785
 20 | 563.285110052675
(20 rows)

test=# SELECT * FROM inputdata;

id | data

----+------------------

1 | 542.76927607134

2 | 813.954454381019

3 | 215.18046176061

4 | 989.989245776087

5 | 142.890753224492

6 | 326.086463406682

7 | 24.8975520953536

8 | 266.512574627995

9 | 86.0621216706932

10 | 801.756543107331

11 | 790.149183012545

12 | 317.997705657035

13 | 975.230060052127

14 | 385.490739252418

15 | 746.592517476529

16 | 621.084009762853

17 | 208.689162041992

18 | 529.119417071342

19 | 260.399237740785

20 | 563.285110052675

(20 rows)

Thoughts on sampling, training, and verification

If you are training an AI model (maybe a Support Vector Machine / SVM, a neural network, or whatever) you will always start by splitting the data into various parts:

• Training data for your AI model
• Test data for your AI model

The training data is used to teach your model. The test data is then used to check, if your model performs well. Splitting the data is important and somewhat the key to success.

Table sampling in PostgreSQL

Since version 9.5, we have sampling in PostgreSQL core. Here is an example:

test=# SELECT *
	FROM inputdata TABLESAMPLE BERNOULLI (50)
				REPEATABLE (87);
 id |       data       
----+------------------
  3 |  215.18046176061
  4 | 989.989245776087
  5 | 142.890753224492
 10 | 801.756543107331
 11 | 790.149183012545
 12 | 317.997705657035
 14 | 385.490739252418
 15 | 746.592517476529
 16 | 621.084009762853
 19 | 260.399237740785
 20 | 563.285110052675
(11 rows)

test=# SELECT *

FROM inputdata TABLESAMPLE BERNOULLI (50)

REPEATABLE (87);

id | data

----+------------------

3 | 215.18046176061

4 | 989.989245776087

5 | 142.890753224492

10 | 801.756543107331

11 | 790.149183012545

12 | 317.997705657035

14 | 385.490739252418

15 | 746.592517476529

16 | 621.084009762853

19 | 260.399237740785

20 | 563.285110052675

(11 rows)

You have to add the TABLESAMPLE clause to the table and specify how it is supposed to work.

In this example, I decided to use the Bernoulli table sampling method (TSM) and told it to be repeatable. In case of machine learning it really makes sense to use the REPEATABLE clause as we want to make sure that we can train our model over and over again given the same input data. We can run the code as often as we want and PostgreSQL will always return the same sample (assuming of course that the underlying data does not change).

Machine learning: Lots and lots of data …

So far everything sounds nice. However, there is a bit of a downside to this approach. A state-of-the-art model needs A LOT of input data to perform well. We are talking about many million rows. At some point we will be facing a query as shown below (written in pseudo code):

SELECT	*
FROM 	inputdata
WHERE	id NOT IN (SELECT id FROM test_data_set);

SELECT *

FROM inputdata

WHERE id NOT IN (SELECT id FROM test_data_set);

There are two problems here:

• We have to store the test data set somewhere, which needs a lot of space
• The large NOT IN statement is fairly expensive

So maybe there is some other way to do that? What I came up with is a pretty simple approach, which does the job.

Preparing data for machine learning in PostgreSQL

As there is no “NOT IN TABLESAMPLE”-clause in PostgreSQL and as we do not want to duplicate our data, the idea is to use a view, which can be used to extract the sample:

test=# CREATE VIEW preparation_step1 AS
	SELECT 	*, abs(hashtext(id::text) % 100)
	FROM 	inputdata ;
CREATE VIEW

test=# CREATE VIEW preparation_step1 AS

SELECT *, abs(hashtext(id::text) % 100)

FROM inputdata ;

CREATE VIEW

The first idea is to use the hashtext function to turn the input data into a hash. The hashtext returns evenly distributed numbers, which is exactly what we want here. Adding “modulo 100” will create 100 slices of data (1% each). Be aware of the fact that hashtext function can return negative values. The “abs” function will turn the value into positive values.

test=# x
Expanded display is on.
test=# df *hashtext*
List of functions
-[ RECORD 1 ]-------+-----------
Schema              | pg_catalog
Name                | hashtext
Result data type    | integer
Argument data types | text
Type                | normal

test=# x

Expanded display is on.

test=# df *hashtext*

List of functions

-[ RECORD 1 ]-------+-----------

Schema | pg_catalog

Name | hashtext

Result data type | integer

Argument data types | text

Type | normal

When looking at the view, the result will be pretty useful already:

test=# SELECT * FROM preparation_step1 LIMIT 10;
 id |       data       | abs
----+------------------+-----
  1 |  542.76927607134 |  47
  2 | 813.954454381019 |  26
  3 |  215.18046176061 |   4
  4 | 989.989245776087 |  92
  5 | 142.890753224492 |  58
  6 | 326.086463406682 |  12
  7 | 24.8975520953536 |  95
  8 | 266.512574627995 |  88
  9 | 86.0621216706932 |  36
 10 | 801.756543107331 |  81
(10 rows)

test=# SELECT * FROM preparation_step1 LIMIT 10;

id | data | abs

----+------------------+-----

1 | 542.76927607134 | 47

2 | 813.954454381019 | 26

3 | 215.18046176061 | 4

4 | 989.989245776087 | 92

5 | 142.890753224492 | 58

6 | 326.086463406682 | 12

7 | 24.8975520953536 | 95

8 | 266.512574627995 | 88

9 | 86.0621216706932 | 36

10 | 801.756543107331 | 81

(10 rows)

Now we can filter out data. An example: “abs < 50” might be training data and all the rest might be used to validate and check our models. The approach here is basically fine if your dataset is really large (xxx-million rows or so). It might not be ideal if your dataset is too small. In this case, it would be better to use sampling mechanisms provided by your favorite library (TensorFlow, sklearn, etc.). Why is that the case? If you are using a small data set (e.g. only a couple hundred of rows) you are running the risk of creating a biased sample. What does that mean? Suppose you have a dataset containing information about men and women. You want the sample to show the same distribution as the original data set - so the same percentage of men and women. The solution to the problem is called “stratification” and is supported by libraries such as sklearn and others. In my simple SQL example, I work under the assumption that the amount of data fed to the model is really large and therefore stratification is not an issue.

Preparing data for machine learning in PostgreSQL

Creating some sample data

In reality you will already have some existing data, which you will use to work with:

Thoughts on sampling, training, and verification

Table sampling in PostgreSQL

Machine learning: Lots and lots of data …

There are two problems here:

Preparing data for machine learning in PostgreSQL

Leave a Reply Cancel reply

Hans-Jürgen Schönig

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts