A couple of weeks ago I have seen a horribly designed, 370 billion (with a “b”) row Oracle database monster running some sort of aggregations. Due to the sheer amount of data I naturally thought about how I would implement the same thing in PostgreSQL. What I noticed is that most people would actually implement aggregations slightly wrong in PostgreSQL - so I decided to share some thoughts on the clause GROUP BY and performance.
Before getting started, two relations are created:
1 2 3 4 5 |
test=# CREATE TABLE t_gender (id int, name text); CREATE TABLE test=# INSERT INTO t_gender VALUES (1, 'male'), (2, 'female'); INSERT 0 2 |
t_gender is a classical “lookup table”. It contains only a handful of rows, which will be joined on. The real “fact table” is in my example t_person:
1 2 3 4 5 6 |
test=# CREATE TABLE t_person ( id serial, gender int, data char(40) ); CREATE TABLE |
For this example it should be enough to load a couple of million rows. The following INSERT
statement makes sure that genders are evenly distributed in the table. Half of the population is male, and the other half is female:
1 2 3 4 |
test=# INSERT INTO t_person (gender, data) SELECT x % 2 + 1, 'data' FROM generate_series(1, 5000000) AS x; INSERT 0 5000000 |
The goal of the following code is to take our data and turn it into a simple report. Reporting is pretty important these days so you might see many of those queries as outlined below:
1 2 3 4 5 6 7 8 9 10 |
test=# SELECT name, count(*) FROM t_gender AS a, t_person AS b WHERE a.id = b.gender GROUP BY 1; name | count --------+--------- female | 2500000 male | 2500000 (2 rows) Time: 961.034 ms |
The goal is to figure out, how many men and women our database contains. The way to do that is to simply join the lookup table. On my laptop this takes around 961 ms …
The question is: Can we do better? Of course there is always a way to speed up things: More CPUs, more cache, etc. However, this is not the kind of improvement I am talking about. My question is: Can we use a smarter algorithm? Many people might be surprised but the answer is “yes”.
If you want to understand a performance problem there is usually no way to get around reading execution plans. Here is what the planner thinks:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
explain SELECT name, count(*) FROM t_gender AS a, t_person AS b WHERE a.id = b.gender GROUP BY 1; QUERY PLAN ----------------------------------------------------- HashAggregate ... Group Key: a.name -> Hash Join (rows=5000034) Hash Cond: (b.gender = a.id) -> Seq Scan on t_person b (rows=5000034) -> Hash (cost=1.02..1.02 rows=2 width=10) -> Seq Scan on t_gender a (rows=2) |
PostgreSQL scans both tables sequentially and joins them together. Then the joined data is aggregated. In other words: 5 million rows will be joined with a small table.
However, there is an alternative: What if we aggregate first and join later? What if we just counted those IDs and then lookup the name? The beauty of this approach is that we just had to join 2 rows instead of 5 million rows.
Here is what we could do:
1 2 3 4 5 6 7 8 9 10 11 |
test=# WITH x AS ( SELECT gender, count(*) AS res FROM t_person AS a GROUP BY 1 ) SELECT name, res FROM x, t_gender AS y WHERE x.gender = y.id; ... ... Time: 526.472 ms |
Voila, the same thing happens A LOT faster. The Common Table Expression (CTE) is executed first and then joined. WITH
is an “optimization barrier” making sure that the optimizer cannot fool around with things.
Let us take a look at the plan:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
test=# explain WITH x AS ( SELECT gender, count(*) AS res FROM t_person AS a GROUP BY 1 ) SELECT name, res FROM x, t_gender AS y WHERE x.gender = y.id; QUERY PLAN --------------------------------- Hash Join (rows=2) Hash Cond: (y.id = x.gender) CTE x -> HashAggregate (rows=2) Group Key: a.gender -> Seq Scan on t_person a (rows=5000034) -> Seq Scan on t_gender y (rows=2) -> Hash (rows=2) -> CTE Scan on x (rows=2) |
Sweet: The CTE is calculated and later joined giving us the extra boost we desired.
The reason why PostgreSQL doesn't do this automatically is buried deep inside the structure of the planner. As of version 10.x PostgreSQL always has to join first and aggregate later. Currently serious work is done to lift this restriction and give the planner a bit more flexibility. Various developers including people from my team here at CYBERTEC are actively working on this issue, and I am hopeful to see a speedup in PostgreSQL 11 or maybe PostgreSQL 12.
Find out the latest info on PostgreSQL performance tuning in our blogposts.
By Kaarel Moppel - Restricting column access - Something from the security realm this time - normally this side of the “database world” doesn't get too much attention and "love" because it's difficult/tedious and requires some deep understanding of how the system works...but ignoring security as we know can catch up with you badly.
Anyways, I'm not telling a real horror story here, but some weeks ago when a friend who’s a “casual DBA” (install, start/stop, backup/restore) asked for help with a situation, where a column was “leaking” data. I had to think a bit longer than usual to find an appropriate place where this issue could be fixed. The solution I suggested in the end included making use of the Rule System of PostgreSQL. I hadn’t seen rules being used in years myself (one should generally try to avoid having some stuff happening behind the scenes) and had to refresh on them a bit myself, but this was the correct use case for applying them and thought I’d share it as well then.
So, the application they had was a bought “black box” webapp that did some quite trivial CRUD on some data forms and all was fine - until they noticed that in some form, some column with address data was visible to all users of the system, instead of a specific login role. They couldn’t immediately change the code and were looking for a quick fix to hide the data in that column altogether for all users on the DB side (they could do changes there), until the app gets properly fixed. So what means/tricks are available for such kind of stuff? Not too many and they usually have implications...but let’s have a look at them.
When data is accessed by different login roles, normally the best way to limit access on single columns would be to use the “column privileges” feature - just do a “REVOKE SELECT” followed by “GRANT SELECT (col1, col2,...)”, leaving out the “secret” columns from the list. With our app it wouldn’t work as only one backend role was used, and all the SELECTs on that table would start to throw errors. Additionally, on other forms showing the column, SQL needs to be adjusted. A no go.
This would be the easiest trick (very easily rollbackable) for a temporary fix...but would only work when no new data is being added - in our case though new customer registrations were still coming in from the web. A no go.
Renaming the old table and creating a simple view with the same name, including all the wanted columns (and adding appropriate GRANTs, for example copying the existing ones with "pg_dump -s -t problem_table | grep ^GRANT") is relatively easy. We could replace our “secret” column’s value for example with an empty string and all would be fine…until we again get some INSERT-s or UPDATE-s. Then we would see an error like the below one, although simple DML in itself is allowed on views by default in Postgres.
1 2 |
ERROR: cannot insert into column 'leaky_data' of view 'problem_table' DETAIL: View columns that are not columns of their base relation are not updatable. |
So how do you get around this INSERT/UPDATE problem?
To overcome the errors on adding/changing data, we need to re-route the data into the “base table” of our view. This can be easily done with rules and it would be a good fit for this use case. Read on for a sample.
So let’s say this is our leaking table:
1 |
CREATE TABLE problem_table(id int, leaky_data text); |
And now we want to replace all values in leaky_data column with empty strings. So, we need to write something like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
BEGIN; ALTER TABLE problem_table RENAME TO problem_table_real; CREATE VIEW problem_table AS SELECT id, ''::text AS leaky_data FROM problem_table_real; CREATE RULE problem_table_ins AS ON INSERT TO problem_table DO INSTEAD INSERT INTO problem_table_real VALUES ( NEW.id, NEW.leaky_data ); CREATE RULE problem_table_upd AS ON UPDATE TO problem_table DO INSTEAD UPDATE problem_table_real SET id = NEW.id, leaky_data = NEW.leaky_data WHERE id = NEW.id; COMMIT; |
Now indeed it does what we want:
1 2 3 4 5 6 7 8 |
INSERT INTO problem_table(id, leaky_data) VALUES (1, 'secret'); SELECT * FROM problem_table; id │ leaky_data ────┼──────────── 1 │ (1 row) |
NB! When the application is using constructs like “UPDATE … RETURNING *”, it’s also possible to take care of the loophole there in the rule definition, by adding the “RETURNING …, ‘’::text AS leaky_data” part. Hope it helps!
Find out the latest on security matters in PostgreSQL by reading our security blog spot.
While I was at database training to migrate some poor fellows from Oracle to PostgreSQL, I was confronted with this simple question: “How many users can you have in PostgreSQL?”. Obviously somebody has made some bad experiences with other databases, or Oracle in particular, to come up with this kind of question. “Well, as many as you want. How many do you need?”. After some brief discussion it seems that people just did not believe in what I said. So I decided to show them how easily a large number of database users can be created in no time.
The good thing about PostgreSQL is that DDLs can happen inside transactions. As user creation is done by DDLs, it is possible to create all those new accounts in a single transaction. Also 1 million users is a lot, so we clearly don’t want to compile this list by hand. And: The entire process has to be quick and take only 5 minutes at a training course.
In short: I needed something fast, so I decided to come up with the following script...
1 2 3 4 5 |
BEGIN; SELECT 'CREATE USER xy' || id FROM generate_series(1, 1000000) AS id; gexec COMMIT; |
It simply generates a couple of SQL commands, which do exactly what I want. Here is the output:
1 2 3 4 5 6 7 8 9 10 11 |
test=# SELECT 'CREATE USER xy' || id FROM generate_series(1, 1000000) AS id; ?column? ----------------------- CREATE USER xy1 CREATE USER xy2 CREATE USER xy3 CREATE USER xy4 CREATE USER xy5 CREATE USER xy6 CREATE USER xy7 |
The idea is to get 1 million statements.
1 2 3 4 5 |
test=# SELECT count(*) FROM pg_user; count ------- 1 (1 row) |
Here comes the trick: gexec executes the SQL we have just generated directly. Finally the transaction is committed:
1 2 3 4 5 |
[hs@zenbook master]$ time psql test < /tmp/a.sql > /dev/null real 1m0.009s user 0m4.212s sys 0m8.600s |
On my laptop, generating 1 million users in a single transaction takes around 1 minute. Many people might be surprised that it is actually possible to create this number of users. Others might be delighted to see that it only takes one minute. The first feedback I got from my folks here at Cybertec HQ was: “Why does it take so long 😉 ”. Well, people in the PostgreSQL community see things differently sometimes 🙂
1 2 3 4 5 |
test=# SELECT count(*) FROM pg_user; count --------- 1000001 (1 row) |
The bottom line is really that PostgreSQL can easily handle a million users without any problems. It is good to see, that it is possible to create so many accounts with just 4 lines of code.
Ever since Hannes Eder published the idea of the SKYLINE OF
operator on the PostgreSQL mailing list years ago, I was somewhat hooked on the idea of being able to make more intelligent queries in PostgreSQL. So, what is the idea of a “Skyline query”? Here is the basic concept: Imagine you want to go on holiday, and you are looking for a nice hotel on the beach. The trouble is: The hotels with a nice view of the beach are way too expensive – the hotels further back are cheap but far away from the sea. The question is: What is the best compromise?
That's exactly what this post is all about.
Here is an example:
1 2 3 4 5 6 7 8 |
test=# CREATE TABLE t_hotel ( id serial, name text, price numeric, distance_beach numeric ); CREATE TABLE |
The table stores the name of a hotel, the price and the distance to the beach. Let us add a couple of rows manually:
1 2 3 4 5 6 7 8 9 10 11 12 |
test=# INSERT INTO t_hotel (name, price, distance_beach) VALUES ('ABC Motel', 120, 2.4); INSERT 0 1 test=# INSERT INTO t_hotel (name, price, distance_beach) VALUES ('Crapstone Hotel', 90, 2.2); INSERT 0 1 test=# INSERT INTO t_hotel (name, price, distance_beach) VALUES ('Luxury Watch Spa Hotel', 495, 0.2); INSERT 0 1 test=# INSERT INTO t_hotel (name, price, distance_beach) VALUES ('Nowhere Middle Hotel', 45, 9.6); INSERT 0 1 |
If we select our hotels sorted by price, we will see that we will most likely end up far away from the beach in a cheap, low-quality hotel. Clearly, this is not desirable:
1 2 3 4 5 6 7 8 |
test=# SELECT * FROM t_hotel ORDER BY price; id | name | price | distance_beach ----+------------------------+-------+---------------- 4 | Nowhere Middle Hotel | 45 | 9.6 2 | Crapstone Hotel | 90 | 2.2 1 | ABC Motel | 120 | 2.4 3 | Luxury Watch Spa Hotel | 495 | 0.2 (4 rows) |
However, if we sort by distance, we will end up close to the beach, but we won't be able to afford it. The trouble is that none of those queries will actually offer us a good compromise:
1 2 3 4 5 6 7 8 |
test=# SELECT * FROM t_hotel ORDER BY distance_beach; id | name | price | distance_beach ----+------------------------+-------+---------------- 3 | Luxury Watch Spa Hotel | 495 | 0.2 2 | Crapstone Hotel | 90 | 2.2 1 | ABC Motel | 120 | 2.4 4 | Nowhere Middle Hotel | 45 | 9.6 (4 rows) |
Fortunately PostgreSQL allows us to use more sophisticated sort criteria. Sorting by a single column is boring. What we want is to somehow treat different columns differently. In this case, customers might feel that distance is not really linear. Whether you are 20 or 50 meters away from the beach does not really matter anymore. However, being 50 meters or 1 km away really matters already. To make it easy, I decided to go for the square root of the distance, while still taking the price as it is. The result looks way more promising than before:
1 2 3 4 5 6 7 8 9 10 |
test=# SELECT price * sqrt(distance_beach), * FROM t_hotel ORDER BY 1; ?column? | id | name | price | distance_beach --------------+----+------------------------+-------+---------------- 133.491572 | 2 | Crapstone Hotel | 90 | 2.2 139.427400 | 4 | Nowhere Middle Hotel | 45 | 9.6 185.903200 | 1 | ABC Motel | 120 | 2.4 221.37072977 | 3 | Luxury Watch Spa Hotel | 495 | 0.2 (4 rows) |
It seems that the Crapstone hotel is the best bargain here. It is not the cheapest hotel, but it is pretty close and still reasonably priced, so maybe it is most fitting to book that one.
The trouble starts when we look at the execution plan of this tiny PostgreSQL query:
1 2 3 4 5 6 7 8 9 |
test=# explain SELECT price * sqrt(distance_beach), * FROM t_hotel ORDER BY 1; QUERY PLAN ------------------------------------------------------------------ Sort (cost=48.74..50.32 rows=630 width=132) Sort Key: ((price * sqrt(distance_beach))) -> Seq Scan on t_hotel (cost=0.00..19.45 rows=630 width=132) (3 rows) |
PostgreSQL will read all the data and sort by our custom criterial. While this is nice for a small data set, it will kill us if the amount of data keeps growing
Let us see what happens if we load 5 million rows into our table:
1 2 3 4 5 6 |
test=# TRUNCATE t_hotel ; TRUNCATE TABLE test=# INSERT INTO t_hotel (price, distance_beach) SELECT 40 + random() * 200, random() * 15 FROM generate_series(1, 5000000); INSERT 0 5000000 |
Loading all this data is clearly not a problem, but check out what happens now:
1 2 3 4 5 6 7 8 9 10 11 12 |
test=# \timing Timing is on. test=# SELECT price * sqrt(distance_beach), price, distance_beach FROM t_hotel ?column? | price | distance_beach ------------------------------+------------------+------------------------- 0.15700293278521447089153382 | 199.127877652645 | 0.000000621657818555832 0.3200968902259212440086465 | 77.0173728093505 | 0.0000172737054526806 0.3452837023672139082940331 | 59.0800635144114 | 0.0000341562554240227 ... (10 rows) Time: 18916.807 ms (00:18.917) |
It took almost 19 seconds (my laptop) to run the query. Most users would not tolerate this kind of behavior for too often, so we somehow need to improve the runtime.
The SKYLINE OF operator does not exist in PostgreSQL (nor in any other database engine I am aware of). However: PostgreSQL offers functional indexes, which are ideal in this case:
1 2 3 4 5 6 7 |
test=# CREATE FUNCTION rate_hotel(numeric, numeric) RETURNS numeric AS $$ SELECT $1 * sqrt($2) $$ LANGUAGE 'sql' IMMUTABLE; CREATE FUNCTION |
The important thing here is to use an IMMUTABLE function. We must assure that the function used to rank the data is perfectly deterministic and its result does not change over time given the same input parameters.
Creating the index is easy:
1 2 3 4 5 6 7 |
test=# CREATE INDEX idx_fix_hotel ON t_hotel (rate_hotel(price, distance_beach)); CREATE INDEX Time: 22706.882 ms (00:22.707) test=# ANALYZE ; ANALYZE Time: 354.660 ms |
Our new index is really boosting things and reduces the runtime of this query to around 1 millisecond, which is around 20.000 faster than before. The result is the same:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
test=# SELECT rate_hotel(price, distance_beach), price, distance_beach FROM t_hotel ORDER BY 1 LIMIT 10; rate_hotel | price | distance_beach ------------------------------+------------------+------------------------- 0.15700293278521447089153382 | 199.127877652645 | 0.000000621657818555832 0.3200968902259212440086465 | 77.0173728093505 | 0.0000172737054526806 0.3452837023672139082940331 | 59.0800635144114 | 0.0000341562554240227 ... (10 rows) Time: 1.024 ms |
The execution plan shows that PostgreSQL will directly go to the index and fetch the data needed. Indexes in PostgreSQL return sorted data so there is no need for sorting and no need to touch more than a handful of rows:
1 2 3 4 5 6 7 8 9 10 11 |
test=# explain SELECT rate_hotel(price, distance_beach), price, distance_beach FROM t_hotel ORDER BY 1 LIMIT 10; QUERY PLAN -------------------------------------------------------------------------- Limit (cost=0.43..1.12 rows=10 width=55) -> Index Scan using idx_fix_hotel on t_hotel (cost=0.43..346214.73 rows=4999993 width=55) (2 rows) |
Of course the approach is somewhat different than described in the paper about Skyline queries. However, the method I have shown you is easy to implement, efficient and serves most real-world purposes. You can make your rating function reasonably sophisticated without and still maintain good performance.
Read on about PostgreSQL in our latest performance blogs.
+43 (0) 2622 93022-0
office@cybertec.at