The problem with parallel query and limits: three runners should stop as soon as they have, in sum, reached one kilometer
© Laurenz Albe 2024

 

I recently investigated a surprising behavior of the DBeaver database client in connection with PostgreSQL parallel query, which I’d like to share with you. This might be interesting for everybody who accesses PostgreSQL using the JDBC driver.

Some basics about PostgreSQL parallel query

Parallel query concepts

Parallel query was introduced in PostgreSQL 9.6 and has been improved in later versions. It breaks with the “traditional” PostgreSQL architecture of using a single backend process per database connection to process SQL statements. If the optimizer thinks that parallel processing would reduce the execution time, it will plan additional parallel worker processes. These processes are created by the query executor and live only for the duration of a single SQL statement. The parallel worker processes calculate intermediate results, which eventually are collected at the original backend process. This collection happens in the “Gather” node of the execution plan: PostgreSQL executes all steps below the Gather node in parallel, while everything above Gather is single-threaded.

Parallel query limits

While parallel query can speed up query execution, it incurs some overhead:

  • starting parallel worker processes is an expensive operation
  • exchanging data between worker processes and the backend requires inter-process communication via dynamic shared memory segments (DSM)

The PostgreSQL optimizer takes this overhead into account by planning parallel query only expensive statements that process large tables or indexes. There are some additional limits to prevent parallel query from consuming too many resources:

If the pool defined by max_parallel_workers is exhausted, the query executor cannot start all the parallel processes that the optimizer planned.

This brief introduction is far from exhaustive. For more information, read the PostgreSQL documentation and our articles about parallel query optimization, parallel DDL, parallel aggregate or the interaction of parallel query and SERIALIZABLE isolation.

The surprising DBeaver problem with parallel query

The query I will use to demonstrate the problem is a simple query that counts the number of rows in a table with 5 million rows. PostgreSQL plans a parallel sequential scan on the table:

EXPLAIN (COSTS OFF) SELECT count(*) FROM bar;

                 QUERY PLAN                 
════════════════════════════════════════════
 Finalize Aggregate
   ->  Gather
         Workers Planned: 2
         ->  Partial Aggregate
               ->  Parallel Seq Scan on bar

Using DBeaver, I can use EXPLAIN (ANALYZE) to verify that the executor actually launches the parallel worker processes:

DBeaver running EXPLAIN (ANALYZE), which shows that parallel query is used

But running the query in DBeaver actually takes much longer than the 491 milliseconds. That is strange, because EXPLAIN (ANALYZE) adds notable overhead to the execution of a statement! To find out what actually happens on the server, I use the auto_explain extension to collect the actual execution plan in the PostgreSQL server log:

LOG:  duration: 712.369 ms  plan:
Query Text: SELECT count(*) FROM bar
Finalize Aggregate  (...) (actual time=712.361..712.363 rows=1 loops=1)
  ->  Gather  (...) (actual time=712.355..712.356 rows=1 loops=1)
        Workers Planned: 2
        Workers Launched: 0
        ->  Partial Aggregate  (...) (actual time=712.354..712.354 rows=1 loops=1)
              ->  Parallel Seq Scan on bar  (...) (actual time=0.006..499.970 rows=5000000 loops=1)

PostgreSQL didn’t start any parallel workers, so the query had to execute single-threaded. That explains the longer duration. But what kept PostgreSQL from starting parallel workers? The first suspicion would be that the pool of max_parallel_workers was exhausted. But no other queries are running on this server! Time to dig deeper.

Limitations for parallel query

The PostgreSQL documentation has a chapter on the limitations of parallel query. For example, most data modifying statements don’t support parallel query. Also, queries that use PARALLEL UNSAFE functions or are executed in a cursor cannot use parallel query. But none of these limitations can affect us here, because then the optimizer wouldn’t have planned parallel workers at all. The problem must occur during the execution of the query.

At the very end of the page, we find the following limitation:

Even when parallel query plan is generated for a particular query, there are several circumstances under which it will be impossible to execute that plan in parallel at execution time. […]

  • The client sends an Execute message with a non-zero fetch count. See the discussion of the extended query protocol. Since libpq currently provides no way to send such a message, this can only occur when using a client that does not rely on libpq.

Ha! Looks like we have struck gold. DBeaver uses the JDBC driver to access PostgreSQL, which does not use libpq to access PostgreSQL.

Reproducing the problem with JDBC

You can see a description of the “execute” message mentioned above in the PostgreSQL documentation. The final argument is the fetch count. The PostgreSQL JDBC driver uses that feature to implement the method setMaxRows(int), available in java.sql.Statement (and inherited by java.sql.PreparedStatement and java.sql.CallableStatement).

This simple Java program reproduces the problem:

public class Parallel {
    public static void main(String[] args) throws ClassNotFoundException, java.sql.SQLException {
        Class.forName("org.postgresql.Driver");

        java.sql.Connection conn =
            java.sql.DriverManager.getConnection(
                "jdbc:postgresql://127.0.0.1:5432/test?user=xxx&password=xxx"
            );

        java.sql.Statement stmt = conn.createStatement();
        stmt.setMaxRows(200);
        java.sql.ResultSet rs =
            stmt.executeQuery("SELECT count(*) FROM bar");
        rs.next();
        System.out.println(rs.getString(1));

        rs.close();
        stmt.close();
        conn.close();
    }
}

If you set the row limit to 0 (the default), the problem goes away.

Getting parallel query to work with DBeaver

The remaining piece of the puzzle is to figure out how to configure the row count limit in DBeaver. After some digging in the documentation, I found that the answer was right before my eyes when I looked at the “Data Editor” pane:

DBeaver with an arrow pointing to where you can configure the row limit

All you have to do is to change the limit to 0! Alternatively, you can click the “blue flower” left of the row limit to bring up the “Properties” dialog. There you can change the ”ResultSet fetch size” in the ”Data Editor” properties. If you want to change the setting for good, switch to the “Global settings” in the top right corner.

Once you change the setting, PostgreSQL will start parallel workers as desired.

So I cannot have parallel processing with a result set limit?

Sure you can: use the LIMIT clause rather than setMaxRows(). This query can use parallel workers:

EXPLAIN (COSTS OFF) SELECT count(*) FROM bar LIMIT 200;

                    QUERY PLAN                    
══════════════════════════════════════════════════
 Limit
   ->  Finalize Aggregate
         ->  Gather
               Workers Planned: 2
               ->  Partial Aggregate
                     ->  Parallel Seq Scan on bar

The difference between LIMIT and setMaxRows() is that the former is known at query planning time, while the latter only at query execution time. In the above query, the optimizer knows that the LIMIT is applied after the Gather node, when there is no more parallelism. The executor doesn’t know that, so it disables parallelism to be on the safe side. There is another reason why it is better to use LIMIT: if the optimizer knows that you only need the first couple of result rows, it can choose a plan that yields those first rows quickly rather than optimizing for the whole result set.

Conclusion

The default configuration of DBeaver uses java.sql.Statement.setMaxRows(int) to limit the size of a result set to 200. This triggers a little known limitation in PostgreSQL that prevents it from starting parallel worker processes. The problem is easy to avoid by setting the limit to 0, once you have understood what’s going on.

Use LIMIT at query planning time rather than a row count at execution time if you want parallelism.