Machine Learning, Deep Learning, Artificial Intelligence, and Data Science have become really important topics these days. Everybody is looking for new technologies and tries to make use of those recent developments. PostgreSQL can help to achieve those goals, and it offers reliable backend storage for your data.
PostgreSQL vs. CSV files vs. commercial databases
In the past we have seen that many in the Data Science and AI community are still using CSV files to process data. While there is a lot of tooling available (e.g. the Python pandas library) it might not be the best of all decision to build your models on CSV files.
So, what is wrong with a CSV file? Well, actually nothing as long as …
- the format of your file is correct
- the amount of data is reasonable
- ALL the data in the file is needed
- nothing changes
All those points might seem obvious BUT my practical experience shows that they are not and that those issues are seriously underappreciated. What do I mean by that?
Recently we have seen a customer, who wanted to store interest rates for some financial models. What did we find in the data? Let me quote: “3 percent” (as a string). Now: How do you expect your AI models to work if the data is so wrong? To do data mining you will need high quality data – not some crap. Which leads to an important observation:
“If you are doing Artificial Intelligence 80% of your effort has to go into data preparation – only 20% are the real model”
In short: You cannot seriously expect a model to learn something about “interest rates” if the raw data does not even ensure that the interest rate is actually a number.
The second remark is about the amount of data: Suppose you got a TB of data. If you want to process or maybe sort your data. What do you think? Who is better at sorting? A database, which has been optimized to sort data for 30 years or some handwritten Python program, which you have written over a weekend? From experience I can tell you: You got no chance. The database is going to beat you on filtering, sorting, joining and a lot more in 99% of all cases. So why not just use SQL? It can do all those things easily for you without having to write thousands of lines of code. You want to use a moving average to enrich your data for training? No problem. In SQL a moving average is a 1 liner. Now think about: How many lines of code does such a simple thing need in Java?
The third remark is about filtering. A database such as PostgreSQL or Oracle has been made to filter data. That is the purpose of a database: Filtering, aggregation, transformation. So why does everybody want to reimplement things in Python? The real advantage of a database starts to become obvious if your training data and your models change. An SQL statement can easily be changed to add one more group to your data, it can easily be adapted to cut out or normalize some data. If you got to change a handwritten C++ application to normalize one more field of data, it translates to quite some effort. We have seen a client burning the stunning amount of 24 million Euros to implement a “risk engine”, which could have easily been done for 1 million Euros with PostgreSQL and a proper toolchain. Writing stuff from scratch in C++ means reinventing the wheel, which seriously increases the time before you can hit the market. Your competitors are racing ahead while your engineers are busy fixing some segfaults in your filtering and sampling code.
Maybe you can see, why a database makes sense. Once you have made up your mind: Do you really want to spend a million Euros on licenses? Why not spend the money on engineers producing things of value? If you decide to use PostgreSQL in favor or Oracle, DB2, MS SQL Server or Informix you can spend your money on important things – not on licenses. Imagine what good engineers can produce if you give them a million.
Classical SQL analytics vs. Machine Learning
The next thing is more of a marketing topic: “We want to use machine learning to calculate our yearly revenue”. Can anybody tell me what you want to use Machine Learning for in this case? What you want is to add up some numbers. That has nothing to do with AI and Data Science. 30 years ago, people added up invoices to determine their yearly revenue. That is boring and it still is 30 years later. This is bookkeeping – not Machine Learning and any database systems will readily provide you with the means to perform those operations. You want to know the likelihood that a customer is going to return a package you have sent? Again: That is not Data Science. It is simply counting what happened in the past.
You want to know WHY a customer sent back stuff? Well, in this case things might be more interesting. You might want to use the Python NLTK to do some text mining. You might want to train a Random Forest to see what is going on and to predict things. However, if you only what to know, how many packages will likely come back – this is not Machine Learning. It is simple statistics. Fact is – it is not cool, it is not modern, it does not need a buzzword. Any database engineer can help you in this case.
“Not everybody who wants Machine Learning actually needs it”
You always got to ask yourself: Do you want to be cool or do you want to be productive? In the long run productivity is king. PostgreSQL will give you exactly that: Productivity.
Deciding on a toolchain
Now suppose you have decided that Machine Learning and Data Science are really needed. Maybe you want to do image recognition (e.g. detecting cancer on medical images, etc.) or maybe you want to create a mathematical model for timeseries prediction. What I found useful is to use PostgreSQL as a solid relational database, which perfectly ensure data consistency, data types, and so on. It also gives you an easy way to do data sampling (“TABLESAMPLE” in SQL), filtering, joining and so on.
In general, we found that Python and R are really good platforms for Data Science and Machine Learning in general. R Shiny is a tool, which quickly helps to visualize data to understand your results more quickly. TensorFlow and sklearn are great tools to work with as well.
If you want to learn more about R Shiny, check out what my personal friend and Cybertec Data Science expert Hernan Resnizky has to say in his book. In case you want to learn about sampling in PostgreSQL consider checking out my blog on Machine Learning.