Preparing data for machine learning in PostgreSQL

01.2018 / Category: / Tags: |

You have probably noticed that everyone is talking about Artificial Intelligence and Machine Learning these days. Quite rightly, because it is a very important topic, which is going to shape our future for sure. However, when looking at most of the code related to machine learning available on the net, it strikes me, how much “coding” people actually use to prepare the data. In many cases the input data is “just a matrix” and people spent a lot of time (and sometimes memory) to prepare it.

The question, which naturally arises in my case is: Why not prepare the data in SQL? It is easier, less work and a lot more flexible.

Creating some sample data

To demonstrate what we can do on the PostgreSQL side, I will create a simple table containing just 20 random values. Let us define a table first:

What we have now is a table containing 20 random values.

In reality you will already have some existing data, which you will use to work with:

Thoughts on sampling, training, and verification

If you are training an AI model (maybe a Support Vector Machine / SVM, a neural network, or whatever) you will always start by splitting the data into various parts:

• Training data for your AI model
• Test data for your AI model

The training data is used to teach your model. The test data is then used to check, if your model performs well. Splitting the data is important and somewhat the key to success.

Table sampling in PostgreSQL

Since version 9.5, we have sampling in PostgreSQL core. Here is an example:

You have to add the TABLESAMPLE clause to the table and specify how it is supposed to work.

In this example, I decided to use the Bernoulli table sampling method (TSM) and told it to be repeatable. In case of machine learning it really makes sense to use the REPEATABLE clause as we want to make sure that we can train our model over and over again given the same input data. We can run the code as often as we want and PostgreSQL will always return the same sample (assuming of course that the underlying data does not change).

Machine learning: Lots and lots of data …

So far everything sounds nice. However, there is a bit of a downside to this approach. A state-of-the-art model needs A LOT of input data to perform well. We are talking about many million rows. At some point we will be facing a query as shown below (written in pseudo code):

There are two problems here:

• We have to store the test data set somewhere, which needs a lot of space
• The large NOT IN statement is fairly expensive

So maybe there is some other way to do that? What I came up with is a pretty simple approach, which does the job.

Preparing data for machine learning in PostgreSQL

As there is no “NOT IN TABLESAMPLE”-clause in PostgreSQL and as we do not want to duplicate our data, the idea is to use a view, which can be used to extract the sample:

The first idea is to use the hashtext function to turn the input data into a hash. The hashtext returns evenly distributed numbers, which is exactly what we want here. Adding “modulo 100” will create 100 slices of data (1% each). Be aware of the fact that hashtext function can return negative values. The “abs” function will turn the value into positive values.

When looking at the view, the result will be pretty useful already:

Now we can filter out data. An example: “abs < 50” might be training data and all the rest might be used to validate and check our models. The approach here is basically fine if your dataset is really large (xxx-million rows or so). It might not be ideal if your dataset is too small. In this case, it would be better to use sampling mechanisms provided by your favorite library (TensorFlow, sklearn, etc.). Why is that the case? If you are using a small data set (e.g. only a couple hundred of rows) you are running the risk of creating a biased sample. What does that mean? Suppose you have a dataset containing information about men and women. You want the sample to show the same distribution as the original data set - so the same percentage of men and women. The solution to the problem is called “stratification” and is supported by libraries such as sklearn and others. In my simple SQL example, I work under the assumption that the amount of data fed to the model is really large and therefore stratification is not an issue.

0 0 votes
Article Rating
Notify of
Inline Feedbacks
View all comments
CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf

+43 (0) 2622 93022-0

Get the newest PostgreSQL Info & Tools

    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    CYBERTEC PostgreSQL International GmbH
    Would love your thoughts, please comment.x
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram