Machine Learning in PostgreSQL Part 1: Kmeans clustering

CYBERTEC Guest

11.2018 | Category: How To | Tags: development, machine learning

11.2018

Category: How To

Tags: development, machine learning

By Hernan Resnizky

Table of Contents

Machine Learning without leaving the Database - Kmeans

Every company has the opportunity to improve its decision-making process with minimal effort through the use of machine learning. However, the drawback is that for most of the DBMS, you will need to perform your Machine Learning processes outside the database. This is not the case in PostgreSQL.

As PostgreSQL contains multiple extensions to other languages. You can train and use Machine Learning algorithms without leaving PostgreSQL.

Let's take a look at how to do Kmeans, one of the most popular unsupervised learning algorithms, directly within PostgreSQL using PLPython.

KMeans in PostgreSQL Step by Step

Step 1: Load sample Data

For this example, we are going to use the iris dataset, which is publicly available. To do this, we first have to download the data from this website to our local machine.
download the data into our local machine from

After that, you will need to create the iris table in your database:

CREATE TABLE iris(

sepal_length REAL,
sepal_width REAL,
petal_length REAL,
petal_width REAL,
species varchar(20)
);

CREATE TABLE iris(

sepal_length REAL,

sepal_width REAL,

petal_length REAL,

petal_width REAL,

species varchar(20)

);

Once the table is created, we can proceed to populate it with the data we just downloaded. Before running the following command, please delete the last empty line of iris.data

COPY iris FROM '/path/to/iris.data' DELIMITER ',';

1	COPY iris FROM '/path/to/iris.data' DELIMITER ',';

Now that we are having the data we are going to use, let's jump to kmean's core function.

Step 2: Installation of dependencies

Before creating our function, let's install the requirements:

Python: Follow the instructions here that correspond to the OS of the machine, where PostgreSQL is installed. Please notice that there are 2 main versions of Python currently, 2.7 and 3.5. The Python community is transitioning from one to the other, so you will need to decide to which one to stick to. The Python code presented here runs on both of them, but if you are planning to use PL/Python further in the future you should take this into account. You can also install both side by side. The only limitation is that both cannot be ran during the same session. For more information visit this link
PL/Python: PL/Python is the extension that allows you to run Python code without leaving Postgres. At least starting from 9.5, there are two plpython extensions available, one for python 2 (plpython) and one for python3 (plpython3). Again, you can have both installed at the same time.
Once you have them installed, you need to create the extension. To do so, connect to the database and type

CREATE EXTENSION plpython

1	CREATE EXTENSION plpython

and/or

CREATE EXTENSION plpython3

1	CREATE EXTENSION plpython3

Install additional Python Libraries: use your favorite Python package manager (pip, conde, etc.) and install scikit-learn and pandas. By the way, both packages are a must if you want to start digging into Machine Learning with Python

Step 3: Kmeans in PostgreSQL in a nutshell

Functions written with PL/Python can be called like any other SQL function. As Python has endless libraries for Machine Learning, the integration is very simple. Moreover, apart from giving full support to Python, PL/Python also provides a set of convenience functions to run any parametrized query. So, executing Machine Learning algorithms can be a question of a couple of lines. Let’s take a look:

CREATE OR replace FUNCTION kmeans(input_table text, columns text[], clus_num int) 
RETURNS bytea AS 

$

from pandas import DataFrame
from sklearn.cluster import KMeans
from cPickle import dumps

all_columns = ','.join(columns)
if all_columns == '':
    all_columns = '*'

rv = plpy.execute('SELECT %s FROM %s;' % (all_columns, plpy.quote_ident(input_table)))

frame = []

for i in rv:
    frame.append(i)
df = DataFrame(frame).convert_objects(convert_numeric =True)
kmeans = KMeans(n_clusters=clus_num, random_state=0).fit(df._get_numeric_data())
return dumps(kmeans)

$ LANGUAGE plpythonu;

CREATE OR replace FUNCTION kmeans(input_table text, columns text[], clus_num int)

RETURNS bytea AS

from pandas import DataFrame

from sklearn.cluster import KMeans

from cPickle import dumps

all_columns = ','.join(columns)

if all_columns == '':

all_columns = '*'

rv = plpy.execute('SELECT %s FROM %s;' % (all_columns, plpy.quote_ident(input_table)))

frame = []

for i in rv:

frame.append(i)

df = DataFrame(frame).convert_objects(convert_numeric =True)

kmeans = KMeans(n_clusters=clus_num, random_state=0).fit(df._get_numeric_data())

return dumps(kmeans)

$ LANGUAGE plpythonu;

As you can see, the script is very simple. Firstly, we import the functions we need, then we generate a string from the columns passed or replace it with * if an empty array is passed and then finally we build the query using PL/Python’s execute function. Although it is out of the scope of this article, I strongly recommend reading about how to parametrize queries using PL/Python.

Once the query is built and executed, we need to cast it to convert it into a data frame and transform the numeric variables into numeric type (they may be interpreted as something else by default). Then, we call kmeans, where the passed input groups amount is passed as parameter as the number of clusters you want to obtain. Finally, we dump it into a cPickle and returned the object stored in a Pickle. Pickling is necessary to restore the model later, since otherwise Python would not be able to restore the kmeans object directly from a bytearray coming from PostgreSQL.

The final line specifies the extension language: in this case, we are using python 2 and, for that reason, the extension is called plpythonu. If you would like to execute it in Python 3, you should use the extension language named plpython3u

Step 4: Storing the Model

It doesn't make much sense to create a model and not do anything with it. So, we will need to store it.

To do so, let’s create a models table first:

CREATE TABLE models (
id SERIAL PRIMARY KEY,
model BYTEA NOT NULL
);

CREATE TABLE models (

id SERIAL PRIMARY KEY,

model BYTEA NOT NULL

);

In this case, our table has just a primary key and a byte array field, that is the actual model serialized. Please note that it is the same data type as the one that is returned by our defined kmeans.

Once we have the table, we can easily insert a new record with the model:

INSERT INTO models(model) SELECT kmeans('iris', array[]::text[], 3);

1	INSERT INTO models(model) SELECT kmeans('iris', array[]::text[], 3);

In this case, we are passing the columns parameter as an empty array to perform clustering with all the numeric variables in the table. Please consider that this is just an example. In a production case you may want to add, for example, some extra fields that can make it easier to identify the different models.

Step 5: Displaying Model Info

So far, we were able to create a model and store it but getting it directly from the database isn't very useful. You can check it by running

select * from models;

For that reason, we will need to get back to Python to display useful information about our model. This is the function we are going to use:

CREATE OR replace FUNCTION get_kmeans_centroids(model_table text, model_column text, model_id int) 
RETURNS real[] AS

$

from pandas import DataFrame
from cPickle import loads

rv = plpy.execute('SELECT %s FROM %s WHERE id = %s;' % (plpy.quote_ident(model_column), 
plpy.quote_ident(model_table), model_id))
model = loads(rv[0][model_column])
ret = map(list, model.cluster_centers_)
return ret

$ LANGUAGE plpythonu;

CREATE OR replace FUNCTION get_kmeans_centroids(model_table text, model_column text, model_id int)

RETURNS real[] AS

from pandas import DataFrame

from cPickle import loads

rv = plpy.execute('SELECT %s FROM %s WHERE id = %s;' % (plpy.quote_ident(model_column),

plpy.quote_ident(model_table), model_id))

model = loads(rv[0][model_column])

ret = map(list, model.cluster_centers_)

return ret

$ LANGUAGE plpythonu;

Let’s start from the beginning: we are passing, again, the table containing the models and the column that holds the binary. The output is read by cpickle’s load function (here you can see how results from a plpython query are loaded into Python).

Once the model is loaded, we know that all kmeans objects have an attribute “cluster_centers_” , which is where the centroids are stored. Centroids are the mean vectors for each group, i.e., the mean for each variable in each group. Natively, they are stored as a numpy array but since plpython cannot handle numpy arrays, we need to convert them to a list of lists. That is the reason why the returned object is the output of listing every row, producing a list of lists, where each sub-list represents a group’s centroid.

This is just an example of how to output a certain characteristic of a model. You can create similar functions to return other characteristics or even all together.

Let’s take a look at what it returns:

hernan=> select get_kmeans_centroids('models','model',1);

                                     get_kmeans_centroids                    

--------------------------------------------------------------------------------------------

{{4.39355,1.43387,5.90161,2.74839},{1.464,0.244,5.006,3.418},{5.74211,2.07105,6.85,3.07368}}

(1 row)

hernan=> select get_kmeans_centroids('models','model',1);

get_kmeans_centroids

--------------------------------------------------------------------------------------------

{{4.39355,1.43387,5.90161,2.74839},{1.464,0.244,5.006,3.418},{5.74211,2.07105,6.85,3.07368}}

(1 row)

Each of the elements enclosed by braces represent a group and the values are its vector of means.

Step 6: Making Predictions

Now that we have a model, let’s use it to do predictions! In kmeans, this means passing an array of values (corresponding to each of the variables) and get the group number it belongs to. The function is very similar to the previous one:

CREATE OR replace FUNCTION predict_kmeans(model_table text, model_column text, model_id int, 
input_values real[]) RETURNS int[] AS

$

from cPickle import loads

rv = plpy.execute('SELECT %s FROM %s WHERE id = %s;' % (plpy.quote_ident(model_column), 
plpy.quote_ident(model_table), model_id))
model = loads(rv[0][model_column])
ret = model.predict(input_values)
return ret

$ LANGUAGE plpythonu;

CREATE OR replace FUNCTION predict_kmeans(model_table text, model_column text, model_id int,

input_values real[]) RETURNS int[] AS

from cPickle import loads

rv = plpy.execute('SELECT %s FROM %s WHERE id = %s;' % (plpy.quote_ident(model_column),

plpy.quote_ident(model_table), model_id))

model = loads(rv[0][model_column])

ret = model.predict(input_values)

return ret

$ LANGUAGE plpythonu;

Compared to the previous function, we add one input parameter (input_values), passing the input values representing a case (one value per variable) for which we want to get the group based on the clustering.

Instead of returning an array of floats, we return an array of integers because we are talking about a group index.

hernan=> select predict_kmeans('models','model',1,array[[0.5,0.5,0.5,0.5]]);

 predict_kmeans 

----------------

 {1}

(1 row)

hernan=> select predict_kmeans('models','model',1,array[[0.5,0.5,0.5,0.5]]);

predict_kmeans

----------------

{1}

(1 row)

Please notice that you need to pass an array of arrays, even if you are passing only one element. This has to do with how Python handles arrays.

We can also pass column names to the function, for example:

select species,predict_kmeans('models','model',1,array[[petal_length,petal_width,sepal_length,
sepal_width]]) from iris;

1 2	select species,predict_kmeans('models','model',1,array[[petal_length,petal_width,sepal_length, sepal_width]]) from iris;

As you can see, the associated group is strongly correlated with the species they are.

Conclusion

We have seen in this article that you can train and use machine learning without leaving Postgres. However, you need to have knowledge of Python to prepare everything. Still, this can be a very good solution to make a complete machine learning toolkit inside PostgreSQL for those that may not know how to do it in Python or any other language.

3 responses to “Machine Learning in PostgreSQL Part 1: Kmeans clustering”

Edgar Ortega Ramírez says:

November 15, 2018 at 3:12 pm

Very interesting, but why you need to store python code in the database? Is it only to be able to call the functions from any connected client?

You could have the python code behind an endpoint or a script on a repo.
How do you handle changes to the code? What if you are in a distributed team?

Reply
Sreekumar S Nair says:

November 3, 2019 at 4:17 am

Please let me know - Where is the 2nd part ?

Reply
Stefan Keller says:

December 27, 2019 at 2:24 pm

See also "K-means function as a user-defined window function in PostgreSQL"
https://github.com/umitanuki/kmeans-postgresql/blob/master/doc/kmeans.md !

Reply

Machine Learning in PostgreSQL Part 1: Kmeans clustering

Machine Learning without leaving the Database - Kmeans

KMeans in PostgreSQL Step by Step

Step 1: Load sample Data

Step 2: Installation of dependencies

Step 3: Kmeans in PostgreSQL in a nutshell

Step 4: Storing the Model

Step 5: Displaying Model Info

Step 6: Making Predictions

Conclusion

3 responses to “Machine Learning in PostgreSQL Part 1: Kmeans clustering”

Leave a Reply Cancel reply

CYBERTEC Guest

Blog Tags

NEWSLETTER

Articles by our PostgreSQL Experts