CYBERTEC Logo

About sequences

Sequences are used to generate artificial numeric primary key columns for tables.
A sequence provides a “new ID” that is guaranteed to be unique, even if many database sessions are using the sequence at the same time.

Sequences are not transaction safe, because they are not supposed to block the caller. That is not a shortcoming, but intentional.

As a consequence, a transaction that requests a new value from the sequence and then rolls back will leave a “gap” in the values committed to the database. In the rare case that you really need a “gap-less” series of values, a sequence is not the right solution for you.

PostgreSQL's traditional way of using sequences (nextval('my_seq')) differs from the SQL standard, which uses NEXT VALUE FOR .

New developments in PostgreSQL v10

Identity columns

PostgreSQL v10 has introduced the standard SQL way of defining a table with an automatically generated unique value:

Here is an example:

Behind the scenes, this uses a sequence, and it is roughly equivalent to the traditional

which is a shorthand for

The problem with such a primary key column is that the generated value is a default value, so if the user explicitly inserts a different value into this column, it will override the generated one.

This is usually not what you want, because it will lead to a constraint violation error as soon as the sequence counter reaches the same value. Rather, you want the explicit insertion to fail, since it is probably a mistake.

For this you use GENERATED ALWAYS:

You can still override the generated value, but you'll have to use the OVERRIDING SYSTEM VALUE clause for that, which makes it much harder for such an INSERT to happen by mistake:

New system catalog pg_sequence

Before PostgreSQL v10, Postgres stored a sequence's metadata (starting value, increment and others) in the sequence itself.

This information is now stored in a new catalog table pg_sequence.

The only data that remain in the sequence are the data changed by the sequence manipulation functions nextval, currval, lastval and setval.

Transactional DDL for sequences

A sequence in PostgreSQL is a “special table” with a single row.

In “normal tables”, an UPDATE does not modify the existing row, but writes a new version of it and marks the old version as obsolete. Since sequence operations should be fast and are never rolled back, PostgreSQL can be more efficient by just modifying the single row of a sequence in place whenever its values change.

Since prior to PostgreSQL v10 all metadata of a sequence were kept in the sequence (as explained in the previous section), this had the downside that ALTER SEQUENCE, which also modified the single row of a sequence, could not be rolled back.

Since PostgreSQL v10 has given us pg_sequence, and catalog modifications are transaction safe in PostgreSQL, this limitation could be removed with the latest release.

Performance regression with ALTER SEQUENCE

When I said above that ALTER SEQUENCE has become transaction safe just by introducing a new catalog table, I cheated a little. There is one variant of ALTER SEQUENCE that modifies the values stored in a sequence:

If only some variants of ALTER SEQUENCE were transaction safe and others weren't, this would lead to surprising and buggy behavior.

That problem was fixed with this commit:

This means that every ALTER SEQUENCE statement will now create a new data file for the sequence; the old one gets deleted during COMMIT. This is similar to the way TRUNCATE, CLUSTER, VACUUM (FULL) and some ALTER TABLE statements are implemented.

Of course this makes ALTER SEQUENCE much slower in PostgreSQL v10 than in previous releases, but you can expect this statement to be rare enough that it should not cause a performance problem.

However, there is this old blog post by depesz that recommends the following function to efficiently get a gap-less block of sequence values:

This function returns the last value of the gap-less sequence value block (and does not work correctly when called on a newly created sequence).

Since this function calls ALTER SEQUENCE not only once but twice, you can imagine that every application that uses it a lot will experience quite a performance hit when upgrading to PostgreSQL v10.

Fortunately you can achieve the same thing with the normal sequence manipulation functions, so you can have a version of the function that will continue performing well in PostgreSQL v10:

If you want to get the first value of the sequence value block, use RETURN reply;

Note that both the original function and the improved one, use advisory locks. That means they will only work reliably if the sequence is only used with that function.

 


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

“How does the PostgreSQL optimizer handle views?” or “Are views good or bad?” I assume that every database consultant and every SQL performance expert has heard this kind of question already. Given the fact that views are a really essential feature of SQL, it makes sense to take a closer look at the topic in general, and hopefully help some people to write better and faster code.

PostgreSQL and the optimizer: What they do to views

Let us create a simple table containing just 10 rows, which can be used throughout the blog to show how PostgreSQL works and how the optimizer treats things:

The idea here is to filter some data and return all the columns.

Inlining and flattening

The key aspect is: The optimizer will process the view just like a “pre-processor” directive. It will try to inline the code and to flatten it. Here is an example:

When we try to read from the view it is just like running the SQL statement directly. The optimizer will perform the following steps:

In the next step the subselect will be flattened out completely, which leaves us with:

We could have done this transformation by ourselves, but it is easier to make the optimizer do it for us and enjoy the convenience offered by the view.

Joining views

Basically the same mechanism applies, when you are working with joins. PostgreSQL will again inline, flatten and then create a plan to join those tables together. The output is therefore not surprising:

Again PostgreSQL provides us with a simple plain.

Destructive behavior

However, views are not only good. Sometimes they can be a little destructive too. Consider the following example:

In this case the view provides us with some ordered data. Let us see what happens in the following query:

As you can see PostgreSQL will sort the data twice, which is definitely not good for performance. PostgreSQL won't optimize this for you, so you have to take action yourself and be very aware of the issue.

In general it is a good idea to make a mental distinction between two types of views:

• Final results
• Intermediate results

If a view already has the “final” results, you can do sorting and all kinds of stuff. However, if you are using a view to perform some large calculations on top of it, it might be a good idea to avoid formatting, sorting and alike.

However, there is more to views. Consider the following example:

In this case the view contains a LIMIT clause, which basically says that “all” the data will be returned. However, during those formal transformation the optimizer cannot “rely” on the fact that all data will be included in the final result. Therefore the optimizer cannot silently skip the LIMIT clause, which of course leads to the following plan:

Conclusion

As always the world is neither only black nor white. Views are a valuable feature and when used nicely there is nothing wrong with them. However, keep in mind what is going on behind the scenes and avoid building views on top of view on top of views and so on. Here is the documentation on query planning and the optimizer, for reference.

Learn more about query tuning, see our latest blogs on the topic.

Pgbench is a very well-known and handy built-in tool that Postgres DBAs can use for quick performance benchmarking. Its main functionality/flow is super simple, but it also has some great optional features, like running custom scripts and specifying different probabilities for them. One can also use bash commands to fill query variables for example.

But the thing that has always annoyed me a bit, is the fact that one cannot specify the desired database or table size but has to think in so called "scaling factor" numbers. And we know from documentation, that scaling factor of "1" means 100k rows in the main data table "pgbench_accounts". But how the does scaling factor of 100 (i.e. 10 Million bank accounts) translate to disk size? Which factor do I need to input when wanting to quickly generate a database of 50 GB to see how random updates would perform, in case the dataset does not fit into RAM/shared buffers. Currently there is a bit of trial and error involved 🙁 So how to get rid of the guesswork and be a bit more deterministic? Well I guess one needs a formula that translates input of DB size to scaling factor!

Test data generation

One way to arrive at the magic formula would be to generate a lot of sample data for various scaling factors, measure the resulting on-disk sizes and deriving a formula out of it. That's exactly what I've done.

So I hatched up the script below and booted up a GCE instance, destined to churn the disks nonstop for a bit more than a whole day as it appeared - I'm sorry little computer 🙂 The script is very simple - it runs the pgbench schema initialization with different scaling factors from 1 to 50k and stores the resulting DB/table/index sizes in a table, so that later some statistics/maths could be applied to infer a formula. NB! Postgres version used was 10.1.

Turning test data into a formula

After the script finished (25h later...note that without the "--unlogged-tables" flag it would take a lot longer), I had a nice table showing how different scaling factors translate to database size on disk, measured in MB (MiB/mebibytes to be exact, i.e. 1MB=1048576 bytes) as this is what the "psql" client is using when reporting object sizes.

So now how to turn this data around so that we get a formula to calculate scaling factors, based on input target DB size? Well I'm sure there are quite some ways and countless Statistics/Data Mining techniques could be applied, but as Data Science is not my strongest skill I thought I'll first take the simplest possible road of using built in statistics functions of a spreadsheet program and see how it fares. If the estimation doesn't look good I could look for something more sophisticated, scikit-learn or such. But luckily it worked out quite well!

Simple regression analysis with LibreOffice

LibreOffice (my office suite of choice) is nowhere near MS Excel but it does have some quite good Statistical Regression functionalities built in and one has options to calculate regression constants together with accuracy coefficients or calculate the formulas and draw them on charts as "Trend Lines". I went the visual way. It goes something like that – create a normal chart, activate it with a double-click and then right click on some data point on the chart and select "Insert Trend Line" from the appearing popup menu. On the next screen one can choose if the "to be fitted" formula for your data should be a simple linear one or a logarithmic/exponential/polynomial. In the case of pgbench data at hand though, after trying out all of them, it appeared that there was no real extra accuracy from the more complex formulas, so I decided to apply KISS here, meaning a simple linear formula. If you look at the script you'll see that we also gathered also size data for the "pgbench_accounts" table and index data, thus for completenes,s I also generated the formulas for calculating scaling factors from target "pgbench_accounts" table/index sizes, but mostly it should not be of interest as the prominent table makes up 99% of the DB size.

The magic formula

So without further ado, below are the resulting formulas and a graph showing how pgbench scale changes according to DB size. I also created a small JSFiddle that you can bookmark here if you happen to work with pgbench a lot.

NB! Accuracy here is not 100% perfect of course as there are some non-linear components (pgbench_accounts.aid changes to int8 from scale 20k, index leaf density varies minimally) but it is really "good enough", meaning <1% accuracy error. Hope you find it useful and would be nice if something similar ends up in "pgbench" tool itself one day.

Target object Scale Formula
DB 0.0669 * DB_Size_Target_MB - 0.5
Table (pgbench_accounts) 0.0781 * Table_Size_Target_MB
Index (pgbench_accounts_pkey) 0.4668 * Index_Size_Target_MB

 

DB size to Scale graph

Detect PostgreSQL performance problems with ease: Is there a single significantly large and important database in the world which does not suffer from performance problems once in a while? I bet that there are not too many. Therefore, every DBA (database administrator) in charge of PostgreSQL should know how to track down potential performance problems to figure out what is really going on.

Improving PostgreSQL performance beyond parameter tuning

Many people think that changing parameters in postgresql.conf is the real way to success. However, this is not always the case. Sure, more often than not, good database config parameters are highly beneficial. Still, in many cases the real problems will come from some strange query hidden deep in some application logic. It is even quite likely that those queries causing real issues are not the ones you happen to focus on. The natural question now arising is: How can we track down those queries and figure out, what is really going on? My favorite tool to do that is pg_stat_statements, which should always be enabled in my judgement if you are using PostgreSQL 9.2 or higher (please do not use it in older versions).

Enabling pg_stat_statements

To enable pg_stat_statements on your server change the following line in postgresql.conf and restart PostgreSQL:

shared_preload_libraries = 'pg_stat_statements'

Once this module has been loaded into the server, PostgreSQL will automatically start to collect information. The good thing is that the overhead of the module is really really low (the overhead is basically just “noise”).

Then run the following command to create the necessary view to access the data:

The extension will deploy a view called pg_stat_statements and make the data easily accessible.

Detecting slow queries in PostgreSQL

The easiest way to find the most interesting queries is to sort the output of pg_stat_statements by total_time:

The beauty here is that the type of query, which is consuming most of time, will naturally show up on top of the listing. The best way is to work your way down from the first to the, say, 10th query and see, what is going on there.
In my judgement there is no way to tune a system without looking at the most time-consuming queries on the database server.

Read more about detecting slow queries in PostgreSQL. 

Taking a deep look at PostgreSQL performance problems

pg_stat_statements has a lot more to offer than just the query and the time it has eaten. Here is the structure of the view:

It can be quite useful to take a look at the stddev_time column as well. It will tell you if queries of a certain type tend to have similar runtimes or not. If the standard deviation is high you can expect some of those queries to be fast and some of them to be slow, which might lead to bad user experience.

The “rows” column can also be quite informative. Suppose 1000 calls have returned 1.000.000.000 rows: It actually means that every call has returned 1 million rows in average. It is easy to see that returning so much data all the time is not a good thing to do.

If you want to check if a certain type of query shows bad caching performance, the shared_* will be of interest. In short: PostgreSQL is able to tell you the cache hit rate of every single type of query in case pg_stat_statements has been enabled.

It also makes sense to take a look at the temp_blks_* fields. Whenever PostgreSQL has to go to disk to sort or to materialize, temporary blocks will be needed.

Finally there are blk_read_time and blk_write_time. Usually those fields are empty unless track_io_timing is turned on. The idea here is to be able to measure the amount of time a certain type of query spends on I/O. It will allow you to answer the question whether your system is I/O bound or CPU bound. In most cases it is a good idea to turn on I/O timing because it will give you vital information.

Dealing with Java and Hibernate

pg_stat_statements delivers good information. However, in some cases it can cut off the query because of a config variable:

For most applications 1024 bytes are absolutely enough. However, this is usually not the case if you are running Hibernate or Java. Hibernate tends to send insanely long queries to the database and thus the SQL code might be cut off long before the relevant parts start (e.g. the FROM-clause etc.). Therefore it makes a lot of sense to increase track_activity_query_size to a higher value (maybe 32.786).

Helpful queries to detect bottlenecks in PostgreSQL

There is one query I have found especially useful in this context: The following query shows 20 statements, which need a lot of time:

The last column is especially noteworthy: It tells us the percentage of total time burned by a single query. It will help you to figure out whether a single statement is relevant to overall performance problems or not.

Many people ask about index scans in PostgreSQL. This blog is meant to be a basic introduction to the topic. Many people aren't aware of what the optimizer does when a single query is processed. I decided to show how a table can be accessed and give examples. Let's get started with PostgreSQL indexing.

Indexes are the backbone of good performance. Without proper indexing, your PostgreSQL database might be in dire straits and end users might complain about slow queries and bad response times. It therefore makes sense to see which choices PostgreSQL makes when a single column is queried.

Preparing some demo data in PostgreSQL

To show you how things work we can use a table:

If your table is almost empty, you will never see an index scan, because it might be too much overhead to consult an index - it is cheaper to just scan the table directly and throw away whatever rows which don't match your query.

So to demonstrate how an index actually works, we can add 10 million random rows to the table we just created before:

Then an index is created:

After loading so much data, it is a good idea to create optimizer statistics in case autovacuum has not caught up yet. The PostgreSQL optimizer needs these statistics to decide on whether to use an index or not:

In PostgreSQL a btree uses Lehman-Yao High-Concurrency btrees (which will be covered in more detail in a later blog).

 

Selecting a small subset of data in PostgreSQL

When only a small set of rows is selected, PostgreSQL can directly ask the index. In this case it can even use an “Index Only Scan” because all columns needed are actually already in the index:

Selecting only a handful of rows will be super efficient using the index. However, if more data is selected, scanning the index AND the table will be too expensive.

PostgreSQL indexing: Selecting a lot of data from a table in PostgreSQL

However, if you select a LOT of data from a table, PostgreSQL will fall back to a sequential scan. In this case, reading the entire table and just filtering out a couple of rows is the best way to do things.

Here is how it works:

PostgreSQL will filter out those unnecessary rows and just return the rest. This is really the ideal thing to do in this case. A sequential scan is therefore not always bad - there are use cases where a sequential scan is actually perfect.

Still: Keep in mind that scanning large tables sequentially too often will take its toll at some point.

PostgreSQL: Making use of bitmap scans

If you only select a handful of rows, PostgreSQL will decide on an index scan - if you select a majority of the rows, PostgreSQL will decide to read the table completely. But what if you read too much for an index scan to be efficient but too little for a sequential scan? The solution to the problem is to use a bitmap scan. The idea behind a bitmap scan is that a single block is only used once during a scan. It can also be very helpful if you want to use more than one index to scan a single table.

Here is what happens:

PostgreSQL will first scan the index and compile those rows / blocks which are needed at the end of the scan. Then PostgreSQL will take this list and go to the table to really fetch those rows. The beauty is that this mechanism even works if you are using more than just one index.

Bitmaps scans are therefore a wonderful contribution to performance.

Read more about how foreign key indexing affects performance, and how to find missing indexes in Laurenz Albe's blog.

 


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

By Kaarel Moppel - The last couple of weeks, I've had the chance to work on our Open Source PostgreSQL monitoring tool called pgwatch2 and implemented some stuff that was in the queue. Changes include only a couple of fixes and a lot of new features... so hell, I'm calling it "Feature Pack 3"... as we've had some bigger additions in the past. Read on for a more detailed overview on the most important new stuff.

Becoming "Enterprizy"

The focus word this time could be "Enterprise". Meaning - firstly trying to make pgwatch2 easier to deploy for larger companies, who maybe need to think about security or container orchestration and secondly also adding robustness. Security even got some special attention – now there's a pgwatch2 version suitable for running on OpenShift, i.e. Docker process runs under an unprivileged user. But there are also quite some new dashboards (see screenshots at the end), like the long-awaited "Top N queries", that should delight all "standard" users and also some other smaller UI improvements.

And please do let us know on GitHub if you’re still missing something in the tool or are having difficulties with something – I myself think I've lost the ability to look at the tool with beginner eyes. Thus - any feedback would be highly appreciated!

Project GitHub link – here.

Version 1.3 full changelog – here.

Most important changes for v1.3.0

Suitable for example for OpenShift deployments.

Using volumes was of course also possible before, but people didn't seem to think about it, so now it's more explicit. Volumes helps a lot for long term Docker deployments as it makes updating to new image versions quite easy compared to dump/restore.

These can be used to do some advanced orchestrated setups. See the "docker" folder for details. There are also latest images available on Docker Hub, but it's meant more for DIY still.

Now it is possible to turn off anonymous access and set admin user/password.

One can now easily monitor/alert on a mix of on-prem and AWS RDS managed DBs.

Visualizes PgBouncer "SHOW STATS" commands. NB! Requires config DB schema
change for existing setups as a new "datasource type" field was introduced.

Now one can see based on pg_stat_statements info, exactly which queries are the costliest (4 criteria available) over a user selected timeperiod. Lots of people have asked for that, so please, enjoy!

Helps to visually highlight biggest tables/indexes. Uses a custom plugin for Grafana.

Metrics gathering daemon can now store all gathered metrics in 2 independent InfluxDBs...so kind of homebrewn clustering. For real clustering one needs the commercial Influx version so be sure to check it out also if needing extra robustness, seems like a great product.

Previously if Influx was down metrics were gathered and saved in daemon memory till things blew up. 100k datasets translates into ~ 2GB of RAM.

Variable is called PW2_IRETENTIONDAYS. Default is 90 days.

 

pgwatch2 is constantly being improved and new features are added. Learn more >>

Screenshots of new Dashboards

Stat Statements Top

PgBouncer Stats

Biggest relations

AWS RDS overview

CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0
office@cybertec.at

Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram