GROUP BY is nothing new and is available in any relational database I am aware of. It is an integral part of SQL and PostgreSQL but what many people might not know is the fact that GROUP BY can do more than just group by simple fields. You can use expressions to group in an even more sophisticated way and here is how:

Importing test data

To show you what you can do with GROUP BY I have compiled some test data:

test=# CREATE TABLE t_oil (
          region       text,
          country      text,
          year         int,
          production   int,
          consumption  int
);
CREATE TABLE

This data set is pretty easy to understand. It tells us how much oil was produced and consumed by which country in the past. To load the data you can either download the file or simply ask COPY to load it directly from the web.

test=# COPY t_oil FROM PROGRAM
          'curl https://www.cybertec-postgresql.com/secret/oil_ext.txt';
COPY 644

To use COPY … FROM PROGRAM you have to be superuser otherwise it does not work for security reasons. In my example 644 rows have been loaded successfully.

Simple GROUP BY statements

Using GROUP BY is pretty simple. If we want to make some kind of analysis for each group we basically got two choices:

test=# SELECT region, avg(production) FROM t_oil GROUP BY 1;
      region   |          avg
---------------+-----------------------
 North America | 4541.3623188405797101
   Middle East | 1992.6036866359447005
(2 rows)

“GROUP BY 1” basically means the same as the following query:

test=# SELECT region, avg(production) FROM t_oil GROUP BY region;
     region    |          avg
---------------+-----------------------
 North America | 4541.3623188405797101
   Middle East | 1992.6036866359447005
(2 rows)

“GROUP BY region” and “GROUP BY 1” are therefore identical. It is a question of faith which type of syntax you prefer. People have told me once than once that one or the other syntax is “evil”. In reality it makes absolutely no difference. It is just syntactic sugar.

GROUP BY “expression”

Most people group by one or more columns or no column at all:

test=# SELECT avg(production) FROM t_oil WHERE country = 'USA';
          avg
-----------------------
 9141.3478260869565217
(1 row)

However, there is more: You can also use an expression to determine the groups on the fly.
Here is an example:

test=# SELECT production > 9000, count(production) 
       FROM   t_oil 
       WHERE  country = 'USA' 
       GROUP BY production > 9000;
 ?column? | count
----------+-------
        f |   20
        t |   26
(2 rows)

In this case we got two groups: One group is for rows greater than 9000 and one for rows lower or equal than 9000. The name of those two groups is therefore “true” (> 9000) or “false” (<= 9000). You can use any expression to calculate those groups on your own.

Consider the next example:

test=# SELECT count(production) 
       FROM     t_oil 
       WHERE    country = 'USA' 
       GROUP BY CASE WHEN year % 2 = 0 THEN true ELSE false END; 
 count 
-------
 23
 23 
(2 rows)

In this example we are counting odd and even years. Note that it is not necessary to list the grouping criterial in the SELECT clause. You might not understand the result if you miss half of the column list but you are free to do that. What is also interesting is that you can use a full SQL query in the GROUP BY clause.

The following statement is identical to what you have just seen:

test=# SELECT count(production) 
       FROM   t_oil WHERE country = 'USA' 
       GROUP BY (SELECT CASE WHEN year % 2 = 0 THEN true ELSE false END);
 count 
-------
 23
 23 
(2 rows)

If you run those queries it makes sense to take a look at the execution plan. As you can see the plan is just like any other GROUP BY statement. It is also noteworthy that PostgreSQL does a really good job to estimate the number of groups:

test=# explain SELECT count(production) 
       FROM    t_oil 
       WHERE   country = 'USA' 
       GROUP BY (SELECT CASE WHEN year % 2 = 0 THEN true ELSE false END); 
                           QUERY PLAN 
-------------------------------------------------------------
 HashAggregate (cost=14.97..15.02 rows=2 width=9)
   Group Key: (SubPlan 1) 
   -> Seq Scan on t_oil (cost=0.00..14.74 rows=46 width=5)
      Filter: (country = 'USA'::text)
      SubPlan 1
      -> Result (cost=0.00..0.01 rows=1 width=1)
(6 rows)

PostgreSQL successfully figured out that two groups are to be expected.
HAVING clauses done right

Recently people have asked us about aliases in the HAVING clause: Can one use aliases in a HAVING clause?

test=# SELECT count(production) AS x 
       FROM   t_oil 
       WHERE country = 'USA' 
       GROUP BY year < 1990 HAVING x > 22;
ERROR: column "x" does not exist
LINE 1: ...il WHERE country = 'USA' GROUP BY year < 1990 HAVING x > 22;
^

The answer is simply no. SQL does not allow that. I remember seeing people who actually wanted to implement this feature in PostgreSQL over the years but such a feature is not going to be supported.

If you want to use a HAVING clause it is necessary to explicitly use the entire expression:

test=# SELECT count(production) AS x 
       FROM   t_oil 
       WHERE country = 'USA' 
       GROUP BY year < 1990 HAVING count(production) > 22;
 x
----
 25
(1 row)

However, consider the following example:

test=# SELECT count(production) AS x 
       FROM   t_oil 
       WHERE  country = 'USA' 
       GROUP BY year < 1990 HAVING avg(production) > 0;
 x
----
21
25
(2 rows)

As you can see we use “count” in the SELECT clause, but we use “avg” in the HAVING clause. This is definitely possible and in some cases really useful. You can use different aggregate functions in GROUP BY and HAVING clauses without any problems.

Making use of grouping sets

So far you have seen what GROUP BY can do for you in the simple case butthere is more. I don’t want to go into too much detail in this blog but just one thing: You can do more than just one aggregation at the same time to speed up your query. Here is an example:

test=# SELECT year < 1990, count(production) AS x 
       FROM   t_oil 
       WHERE  country = 'USA' 
       GROUP BY GROUPING SETS ((year < 1990), ());
 ?column? | x
----------+----
          | 46
        f | 21
        t | 25
(3 rows)

In this case we got two grouping operations: One containing all rows and one for before respectively after 1990.

If you don’t like this syntax you can also try ROLLUP which is equivalent to the query you have just seen:

test=# SELECT year < 1990, count(production) AS x 
       FROM   t_oil 
       WHERE  country = 'USA' 
       GROUP BY ROLLUP (year < 1990);
 ?column? | x
----------+----
          | 46
        f | 21
        t | 25
(3 rows)

ROLLUP is basically the “bottom line” – it adds one additional row counting everything.

Finally …

If you want to find out more about grouping in general I recommend checking out my post about speeding up GROUP BY in general.