Timeseries are an increasingly important topic – not just in PostgreSQL. Recently I gave a presentation @AGIT in Salzburg about timeseries and I demonstrated some super simple examples. The presentation was well received, so I decided to share this stuff in the form of a blog PostgreSQL, so that more people can learn about windowing functions and SQL in general. A link to the video is available at the end of the post so that you can listen to the original material in German.
Loading timeseries data the easy way
To show how data can be loaded, I compiled a simple dataset, which can be found on my website. Here is how it works:
test=# CREATE TABLE t_oil ( region text, country text, year int, production int, consumption int ); CREATE TABLE test=# COPY t_oil FROM PROGRAM 'curl https://www.cybertec-postgresql.com/secret/oil_ext.txt'; COPY 644
The cool thing is that if you happen to be a superuser, you can easily load the data from the web directly. COPY FROM PROGRAM allows you to execute code on the server and pipe it directly to PostgreSQL, which is super simple. But keep in mind: This only works if you are a PostgreSQL superuser (for security reasons).
lag: The backbone of timeseries analysis
If you are dealing with timeseries, calculating the difference to the previous period is really important and is needed in many cases. Fortunately SQL allows you to do that pretty easily. Here is how it works:
test=# SELECT year, production, lag(production, 1) OVER (ORDER BY year) FROM t_oil WHERE country = 'USA' LIMIT 5; year | production | lag ------+------------+------- 1965 | 9014 | 1966 | 9579 | 9014 1967 | 10219 | 9579 1968 | 10600 | 10219 1969 | 10828 | 10600 (5 rows)
The lag functions takes two parameters: The first column defines the column, which should be used in this case. The second parameter is optional. If you skip it, the expression will be equivalent to lag(production, 1). In my example the lag column will be off by one. However, you can use any integer number to move data up or down given the order defined in the OVER clause.
What we got so far is the value of the previous period. Let us calculate the difference next:
test=# SELECT year, production, production - lag(production, 1) OVER (ORDER BY year) AS diff FROM t_oil WHERE country = 'USA' LIMIT 5; year | production | diff ------+------------+------ 1965 | 9014 | 1966 | 9579 | 565 1967 | 10219 | 640 1968 | 10600 | 381 1969 | 10828 | 228 (5 rows)
That was easy. All we have to do is to take the current row and subtract the previous row.
Windowing functions are way more powerful than shown here but maybe this example is good to getting you started in the first place.
Calculating correlations using PostgreSQL
Sometimes you might want to calculate the correlation between columns. PostgreSQL offers the “corr” function to do exactly that. The following listing shows a simple example:
test=# SELECT country, corr(production, consumption) FROM t_oil GROUP BY 1 ORDER BY 2 DESC NULLS LAST; country | corr ----------------------+-------------------- Mexico | 0.962790640608018 Canada | 0.932931452462893 Qatar | 0.925552359601189 United Arab Emirates | 0.882953285119214 Saudi Arabien | 0.642815458284221
As you can see the correlation in Mexico and Canada are highest.
Checking for continuous activity
In the past we presented other examples related to timeseries and analysis in general. One of the most interesting posts is found here.
If you want to see the entire short presentation in German consider checking out the following video.