CYBERTEC Logo

How PostgreSQL estimates parallel queries

10.2019 / Category: / Tags: | |

Parallel queries were introduced back in PostgreSQL 9.6, and the feature has been extended ever since. In PostgreSQL 11 and PostgreSQL 12, even more functionality has been added to the database engine. However, there remain some questions related to parallel queries which often pop up during training and which definitely deserve some clarification.

Estimating the cost of a sequential scan

To show you how the process works, I have created a simple table containing just two columns:

The "many" column contains 10 million different entries. The "few" column will contain two different ones. However, for the sake of this example, all reasonably large tables will do.

The query we want to use to show how the PostgreSQL optimizer works is pretty simple:

The default configuration will automatically make PostgreSQL go for a parallel sequential scan; we want to prevent it from doing that in order to make things easier to read.

Turning off parallel queries can be done by setting the max_parallel_workers_per_gather variable to 0. As you can see in the execution plan, the cost of the sequential scan is estimated to be 144.248. The cost of the total query is expected to be 169.248. But how does PostgreSQL come up with this number? Let’s take a look at the following listings:

The t_test table is around 350 MB and consists of 44.248 blocks. Each block has to be read and processed sequentially. All rows in those blocks have to be counted to create the final results. The following formula will be used by the optimizer to estimate the costs:

As you can see, a couple of parameters are being used here: seq_page_cost tells the optimizer the cost of reading a block sequentially. On top of that, we have to account for the fact that all those rows have to travel through the CPU (cpu_tuple_cost) before they are finally counted. cpu_operator_cost is used because counting is basically the same as calling "+1" for every row. The total cost of the sequential scan is therefore 169.248 which is exactly what we see in the plan.

Estimating parallel sequential scans

The way PostgreSQL estimates sequential scans is often a bit obscure to people end during database training here at Cybertec many people ask about this topic. Let’s therefore take a look at the execution plan and see what happens:

As you can see, PostgreSQL decided on using two CPU cores. But how did PostgreSQL
come up with this part? "rows=4166687"

The answer lies in the following formula:

10000048.0 / (2 + (1 - 0.3 * 2)) = 4166686.66 rows

10.000.048 rows is the number of rows PostgreSQL expects to be in the table (as
determined by ANALYZE before). The next thing is that PostgreSQL tries to
determine how much work has to be done by one core. But what does the formula
actually mean?

estimate = estimated_rows / (number_of_cores + (1 - leader_contribution * number_of_cores)

How Postgres estimates parallel queries - the optimizer looks at workers, rows, and costs

The leader process often spends quite some effort contributing to the result.
However, we assume that the leader spends around 30% of its time servicing the
worker processes. Therefore the contribution of the leader will go down as the
number of cores grows. If there are 4 or more cores at work, the leader will not
make a meaningful contribution to the scan anymore-- so PostgreSQL will simply
calculate the size of the tables by the number of cores, instead of using the
formula just shown.

Other parallel operations

Other parallel operations will use similar divisors to estimate the amount of
effort needed for those operations. Bitmap scans and so on work the same way.
If you want to learn more about PostgreSQL and if you want to know how to speed
up CREATE INDEX, consider checking out our blog post.

0 0 votes
Article Rating
Subscribe
Notify of
guest
1 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Glyn Astill
4 years ago

Interesting post.

Could have been completed by showing how postgres comes up with the parallel cost though:

SELECT current_setting('seq_page_cost')::numeric * 44248
+ current_setting('cpu_tuple_cost')::numeric * 4166686.66
+ current_setting('cpu_operator_cost')::numeric * 4166686.66
+ current_setting('parallel_setup_cost');

97331.583250

CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0
office@cybertec.at

Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    1
    0
    Would love your thoughts, please comment.x
    ()
    x
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram