CYBERTEC Logo

A couple of weeks ago I have seen a horribly designed, 370 billion (with a “b”) row Oracle database monster running some sort of aggregations. Due to the sheer amount of data I naturally thought about how I would implement the same thing in PostgreSQL. What I noticed is that most people would actually implement aggregations slightly wrong in PostgreSQL - so I decided to share some thoughts on the clause GROUP BY and performance.

Loading data into PostgreSQL

Before getting started, two relations are created:

t_gender is a classical “lookup table”. It contains only a handful of rows, which will be joined on. The real “fact table” is in my example t_person:

For this example it should be enough to load a couple of million rows. The following INSERT statement makes sure that genders are evenly distributed in the table. Half of the population is male, and the other half is female:

How most people approach aggregation in PostgreSQL

The goal of the following code is to take our data and turn it into a simple report. Reporting is pretty important these days so you might see many of those queries as outlined below:

The goal is to figure out, how many men and women our database contains. The way to do that is to simply join the lookup table. On my laptop this takes around 961 ms …
The question is: Can we do better? Of course there is always a way to speed up things: More CPUs, more cache, etc. However, this is not the kind of improvement I am talking about. My question is: Can we use a smarter algorithm? Many people might be surprised but the answer is “yes”.

SQL: Finding the bottleneck

If you want to understand a performance problem there is usually no way to get around reading execution plans. Here is what the planner thinks:

PostgreSQL scans both tables sequentially and joins them together. Then the joined data is aggregated. In other words: 5 million rows will be joined with a small table.

High-performance analysis and aggregation in PostgreSQL

However, there is an alternative: What if we aggregate first and join later? What if we just counted those IDs and then lookup the name? The beauty of this approach is that we just had to join 2 rows instead of 5 million rows.

Here is what we could do:

Voila, the same thing happens A LOT faster. The Common Table Expression (CTE) is executed first and then joined. WITH is an “optimization barrier” making sure that the optimizer cannot fool around with things.

Let us take a look at the plan:

Sweet: The CTE is calculated and later joined giving us the extra boost we desired.

PostgreSQL: What the future might have in stock for us

The reason why PostgreSQL doesn't do this automatically is buried deep inside the structure of the planner. As of version 10.x PostgreSQL always has to join first and aggregate later. Currently serious work is done to lift this restriction and give the planner a bit more flexibility. Various developers including people from my team here at CYBERTEC are actively working on this issue, and I am hopeful to see a speedup in PostgreSQL 11 or maybe PostgreSQL 12.

Find out the latest info on PostgreSQL performance tuning in our blogposts.

By Kaarel Moppel - Restricting column access - Something from the security realm this time - normally this side of the “database world” doesn't get too much attention and "love" because it's difficult/tedious and requires some deep understanding of how the system works...but ignoring security as we know can catch up with you badly.

Anyways, I'm not telling a real horror story here, but some weeks ago when a friend who’s a “casual DBA” (install, start/stop, backup/restore) asked for help with a situation, where a column was “leaking” data. I had to think a bit longer than usual to find an appropriate place where this issue could be fixed. The solution I suggested in the end included making use of the Rule System of PostgreSQL. I hadn’t seen rules being used in years myself (one should generally try to avoid having some stuff happening behind the scenes) and had to refresh on them a bit myself, but this was the correct use case for applying them and thought I’d share it as well then.

The security leak

So, the application they had was a bought “black box” webapp that did some quite trivial CRUD on some data forms and all was fine - until they noticed that in some form, some column with address data was visible to all users of the system, instead of a specific login role. They couldn’t immediately change the code and were looking for a quick fix to hide the data in that column altogether for all users on the DB side (they could do changes there), until the app gets properly fixed. So what means/tricks are available for such kind of stuff? Not too many and they usually have implications...but let’s have a look at them.

Available options on restricting column access

When data is accessed by different login roles, normally the best way to limit access on single columns would be to use the “column privileges” feature - just do a “REVOKE SELECT” followed by “GRANT SELECT (col1, col2,...)”, leaving out the “secret” columns from the list. With our app it wouldn’t work as only one backend role was used, and all the SELECTs on that table would start to throw errors. Additionally, on other forms showing the column, SQL needs to be adjusted. A no go.

This would be the easiest trick (very easily rollbackable) for a temporary fix...but would only work when no new data is being added - in our case though new customer registrations were still coming in from the web. A no go.

Renaming the old table and creating a simple view with the same name, including all the wanted columns (and adding appropriate GRANTs, for example copying the existing ones with "pg_dump -s -t problem_table | grep ^GRANT") is relatively easy. We could replace our “secret” column’s value for example with an empty string and all would be fine…until we again get some INSERT-s or UPDATE-s. Then we would see an error like the below one, although simple DML in itself is allowed on views by default in Postgres.

So how do you get around this INSERT/UPDATE problem?

To overcome the errors on adding/changing data, we need to re-route the data into the “base table” of our view. This can be easily done with rules and it would be a good fit for this use case. Read on for a sample.

The solution

So let’s say this is our leaking table:

And now we want to replace all values in leaky_data column with empty strings. So, we need to write something like this:

Now indeed it does what we want:

NB! When the application is using constructs like “UPDATE … RETURNING *”, it’s also possible to take care of the loophole there in the rule definition, by adding the “RETURNING …, ‘’::text AS leaky_data” part. Hope it helps!

Find out the latest on security matters in PostgreSQL by reading our security blog spot.

While I was at database training to migrate some poor fellows from Oracle to PostgreSQL, I was confronted with this simple question: “How many users can you have in PostgreSQL?”. Obviously somebody has made some bad experiences with other databases, or Oracle in particular, to come up with this kind of question. “Well, as many as you want. How many do you need?”. After some brief discussion it seems that people just did not believe in what I said. So I decided to show them how easily a large number of database users can be created in no time.

Creating a large number of users

The good thing about PostgreSQL is that DDLs can happen inside transactions. As user creation is done by DDLs, it is possible to create all those new accounts in a single transaction. Also 1 million users is a lot, so we clearly don’t want to compile this list by hand. And: The entire process has to be quick and take only 5 minutes at a training course.

In short: I needed something fast, so I decided to come up with the following script...

It simply generates a couple of SQL commands, which do exactly what I want. Here is the output:

The idea is to get 1 million statements.

Before the script is executed, the table is empty:

Here comes the trick: gexec executes the SQL we have just generated directly. Finally the transaction is committed:

On my laptop, generating 1 million users in a single transaction takes around 1 minute. Many people might be surprised that it is actually possible to create this number of users. Others might be delighted to see that it only takes one minute. The first feedback I got from my folks here at Cybertec HQ was: “Why does it take so long 😉 ”. Well, people in the PostgreSQL community see things differently sometimes 🙂

Here is what happens:

The bottom line is really that PostgreSQL can easily handle a million users without any problems. It is good to see, that it is possible to create so many accounts with just 4 lines of code.

Ever since Hannes Eder published the idea of the SKYLINE OF operator on the PostgreSQL mailing list years ago, I was somewhat hooked on the idea of being able to make more intelligent queries in PostgreSQL. So, what is the idea of a “Skyline query”? Here is the basic concept: Imagine you want to go on holiday, and you are looking for a nice hotel on the beach. The trouble is: The hotels with a nice view of the beach are way too expensive – the hotels further back are cheap but far away from the sea. The question is: What is the best compromise?

That's exactly what this post is all about.

Here is an example:

The table stores the name of a hotel, the price and the distance to the beach. Let us add a couple of rows manually:

If we select our hotels sorted by price, we will see that we will most likely end up far away from the beach in a cheap, low-quality hotel. Clearly, this is not desirable:

However, if we sort by distance, we will end up close to the beach, but we won't be able to afford it. The trouble is that none of those queries will actually offer us a good compromise:

More advanced ordering in PostgreSQL

Fortunately PostgreSQL allows us to use more sophisticated sort criteria. Sorting by a single column is boring. What we want is to somehow treat different columns differently. In this case, customers might feel that distance is not really linear. Whether you are 20 or 50 meters away from the beach does not really matter anymore. However, being 50 meters or 1 km away really matters already. To make it easy, I decided to go for the square root of the distance, while still taking the price as it is. The result looks way more promising than before:

It seems that the Crapstone hotel is the best bargain here. It is not the cheapest hotel, but it is pretty close and still reasonably priced, so maybe it is most fitting to book that one.

The trouble starts when we look at the execution plan of this tiny PostgreSQL query:

PostgreSQL will read all the data and sort by our custom criterial. While this is nice for a small data set, it will kill us if the amount of data keeps growing

Scaling up: Increasing the size of our data set

Let us see what happens if we load 5 million rows into our table:

Loading all this data is clearly not a problem, but check out what happens now:

It took almost 19 seconds (my laptop) to run the query. Most users would not tolerate this kind of behavior for too often, so we somehow need to improve the runtime.

The SKYLINE OF operator does not exist in PostgreSQL (nor in any other database engine I am aware of). However: PostgreSQL offers functional indexes, which are ideal in this case:

The important thing here is to use an IMMUTABLE function. We must assure that the function used to rank the data is perfectly deterministic and its result does not change over time given the same input parameters.
Creating the index is easy:

Speeding up the skyline query using an index

Our new index is really boosting things and reduces the runtime of this query to around 1 millisecond, which is around 20.000 faster than before. The result is the same:

The execution plan shows that PostgreSQL will directly go to the index and fetch the data needed. Indexes in PostgreSQL return sorted data so there is no need for sorting and no need to touch more than a handful of rows:

Of course the approach is somewhat different than described in the paper about Skyline queries. However, the method I have shown you is easy to implement, efficient and serves most real-world purposes. You can make your rating function reasonably sophisticated without and still maintain good performance.

Read on about PostgreSQL in our latest performance blogs.

CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0
office@cybertec.at

Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram