PostgreSQL 10 will provide end users with countless new features. One of those features is related to “Foreign Data Wrappers” and is generally known as “aggregate pushdown”. To be honest: This stuff is one of my favorite new features of PostgreSQL 10 and therefore it might be worth, sharing this piece information with a broader audience. So if you are interested in remote aggregation and distributed queries, keep reading.
Preparing PostgreSQL for a test
To show what the optimizer is already capable of, we need two databases:
iMac:~ hs$ createdb db01 iMac:~ hs$ createdb db02
Then we can deploy some simple test data in db02:
BEGIN; CREATE TABLE t_test (id serial, name text); INSERT INTO t_test (name) SELECT 'dummy' FROM generate_series(1, 1000000); ANALYZE; COMMIT;
The script generates 1 million rows and just a single name (= “dummy”)
Create a “database link” in PostgreSQL
For many years now, PostgreSQL has provided means to access remote data sources using “Foreign Data Wrappers” (FDWs)
CREATE EXTENSION postgres_fdw; CREATE SERVER pgserver FOREIGN DATA WRAPPER postgres_fdw OPTIONS (dbname 'db02', host 'localhost'); CREATE USER MAPPING FOR CURRENT_USER SERVER pgserver OPTIONS (user 'hs'); IMPORT FOREIGN SCHEMA public FROM SERVER pgserver INTO public;
The script shown here loads the postgres_fdw extension, which allows us to connect to a remote PostgreSQL database. Then a virtual server pointing to db01 is created in db01. Finally, there are a user mapping and the foreign schema imported. All tables in the remote database, which can be found in the “public” schema, will be linked and visible in db01.
Running a simple query in PostgreSQL
Once the test data is in place, we can give PostgreSQL a try and see, how it behaves in case of aggregates. Here is an example:
db01=# explain (analyze, verbose) SELECT name, count(*) FROM t_test GROUP BY 1; QUERY PLAN ------------------------------------------------------------------------------ Foreign Scan (cost=107.31..137.93 rows=200 width=40) (actual time=192.244..192.245 rows=1 loops=1) Output: name, (count(*)) Relations: Aggregate on (public.t_test) Remote SQL: SELECT name, count(*) FROM public.t_test GROUP BY name Planning time: 0.063 ms Execution time: 192.581 ms (6 rows)
The most important observation here is that PostgreSQL is able to push over the complete aggregate. As you can see, the remote SQL is basically the same as the local query. The main advantage is that by pushing over the aggregates PostgreSQL can drastically reduce the load on your local machine and reduce the amount of data, which has to be sent over the network.
PostgreSQL Foreign Data Wrappers and joins
However, at this point it is necessary to issue a word of caution: Yes, aggregates can be pushed down to a remote server. The thing is: Joins happen before the aggregate. In other words: PostgreSQL has to transfer all the data from the remote host in this case:
db01=# explain (analyze, verbose) SELECT a.name, count(*) FROM t_test AS a, t_test AS b WHERE a.id = b.id GROUP BY 1 ORDER BY 2, 1; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=1003.90..1004.40 rows=200 width=40) (actual time=4012.290..4012.290 rows=1 loops=1) Output: a.name, (count(*)) Sort Key: (count(*)), a.name Sort Method: quicksort Memory: 25kB -> HashAggregate (cost=994.26..996.26 rows=200 width=40) (actual time=4012.283..4012.283 rows=1 loops=1) Output: a.name, count(*) Group Key: a.name -> Merge Join (cost=588.18..894.45 rows=19963 width=32) (actual time=3382.674..3848.202 rows=1000000 loops=1) Output: a.name Merge Cond: (a.id = b.id) -> Sort (cost=222.03..225.44 rows=1365 width=36) (actual time=1691.089..1788.210 rows=1000000 loops=1) Output: a.name, a.id Sort Key: a.id Sort Method: external sort Disk: 21528kB -> Foreign Scan on public.t_test a (cost=100.00..150.95 rows=1365 width=36) (actual time=0.311..1232.045 rows=1000000 loops=1) Output: a.name, a.id Remote SQL: SELECT id, name FROM public.t_test -> Sort (cost=366.15..373.46 rows=2925 width=4) (actual time=1691.579..1785.666 rows=1000000 loops=1) Output: b.id Sort Key: b.id Sort Method: external sort Disk: 17616kB -> Foreign Scan on public.t_test b (cost=100.00..197.75 rows=2925 width=4) (actual time=0.400..1203.757 rows=1000000 loops=1) Output: b.id Remote SQL: SELECT id FROM public.t_test Planning time: 0.105 ms Execution time: 4071.736 ms
For PostgreSQL 11.0 we are working on a patch, which will hopefully make it into core, which allows PostgreSQL to perform as many aggregations actually before the join has to happen, which makes joining cheaper because less data ends up in the join. There are many more improvements possible, which might be added to the planner in the near future.
However, as of PostgreSQL 10 a large step forward has been made already to allow PostgreSQL to dynamically distribute queries in a cluster.