CYBERTEC Logo

Index missing?

When an index is missing,
good performance won’t be kissing
a PostgreSQL user looking for efficiency
but instead feels like a legacy.

To satisfy a DBA’s desire and thirst,
let us load some data first.

pgbench is the tool of the day
but the next listing will explain that anyway:

Loading millions of rows into PostgreSQL is not hard,
that is not the tricky part.
Data so quickly created,
is usually not hated.

10 seconds for 10 million rows,
the DBA in charge bows.
Performance will be good
ensuring all users’ good mood.

To celebrate the success
let the database run again and confess.

This time we are looking for fast reads,
an important thing when aiming for high speeds:

pgbench is the name of the game.
My slow box, 101 thousand transactions, insane.
30 seconds to execute the test,
10 concurrent connections that did not rest.
Even 10 threads for the client code,
running in super quick mode.

However, only indexes kept us happy and fine,
bringing good performance, almost divine.
Without indexing life is harsh to the one that suffers,
just like misconfigured shared_buffers.

Our holy lord the superior b-tree.
without it bad performance we see.
A single missing index for a test,
the entire database feels like it is at rest.

Do you want to know why?
Let us give it a try:

Killing an index, only one …
All the performance will be gone:

Let me run the test again.
Good performance? Back then?

Good lord? What happened to my data?
Will speed be back later?

8 transactions per second will reveal
that bad performance is hard to conceal.
A single index is broken,
all end users have been walking.
Calling support hotlines like crazy,
No more chances for DBAs to be lazy.

Conclusion

The conclusion of the day.
Make sure no important indexes go away.
It is a really important affair
to index data with care.
Don’t forget a single thing
and performance will be king.

I hope this little poem makes people happy.
Make sure your databases are not crappy.

Creating auto increment columns in PostgreSQL is easy. Simply use two pseudo data types serial and serial8, respectively, then PostgreSQL will automatically take care of your auto increment columns. However, once in a while problems can still occur. Let us take a look and see.

Sequences: Avoid manual values

To understand the underlying problem, one has to understand how auto increment and sequences work in PostgreSQL:

In this case, everything is fine. But what happens if a manual value is added? Note that PostgreSQL does allow this. Other databases, such as Informix, explicitly prohibit it. However, in PostgreSQL an autoincrement value is really just a sequence which adds a default value. Nothing stops you from breaking things:

The trouble is that the manual insertion does not increment the sequence. Thus, the second INSERT statement is going to fail because the value is already there. This often happens when people import data into the database.

So the main question is: How can we fix these sequences? The answer is: Use pg_sequence_fixer.

Using pg_sequence_fixer

The idea behind pg_sequence_fixer is to have a tool that can easily fix these things with one simple stored procedure call. It automatically looks for all sequences associated with a column, finds the max value, and pumps it up to what is needed.

First clone the repository which can be found here: https://github.com/cybertec-postgresql/pg_sequence_fixer

Then simply install the extension:

The extension is now available and one can enable it:

Finally, you can run the procedure:

The first parameter (= 1000) means that we want to add a safety margin of 1000 to the max value determined by the system. The second parameter controls whether we want to lock those tables now or during the operation. Usually, locks are not desirable, but sometimes they are necessary in case you cannot take down the application in which you are fixing your sequences.

Note that for each sequence fixed, the tool will tell PostgreSQL to issue one log entry (= NOTICE).

Voila, your database is sound and healthy again.

Finally...

Fixing sequences is an important issue you can achieve easily using pg_sequence_fixer. Also: We want to point out that this is just the first incarnation of this tool, so if you are willing to contribute - feel free to submit changes.

If you want to learn more about PostgreSQL right now, consider checking out my post about parallel CREATE INDEX in PostgreSQL.

Our team is proud to introduce a new major pg_timetable v4 release!

This time a huge amount of work has been done and the pg_timetable v4 release is backward-incompatible with previous versions. However, job migration is straightforward and shouldn't stop anyone from updating.

We've added detailed Read the Docs documentation and will enhance it further. Let us know what areas should be highlighted or what topics we should add.

Now, let's check out some of the great new features! You can review the official release page for a complete changelog with links to issues, commits, and pull requests.

I want to remind you that pg_timetable is a community project. So, please, don't hesitate to ask any questions , to report bugs, to star pg_timetable project, and to tell the world about it.

pg_timetable v4 vs v3

The previous release is still considered stable. There are no confirmed bugs nor critical performance issues and is absolutely safe to stick with it.

However, it won't be backpatched if new bugs are discovered. In addition, new features won't be implemented for the v3 branch.

Performance

To increase performance, we worked on several fronts.

First of all, we reworked the database schema and simplified it, as much as possible. This way new pg_timetable can execute fewer system queries leaving more bandwidth and connections to the real workers.

Secondly, we rewrote cron-related functions in pure SQL instead of PL/PgSQL. It allows us not only to increase performance but also to get rid of tedious debugging.

And the third pillar is the comprehensive logging system. See details below in the separate section.

Configuration File Support

As soon as the functionality grows, it becomes error-prone and tedious to handle configuration using only command-line parameters and environmental variables.

Rules of common sense suggest to use configuration files, in this case. We've added a detailed self-explanatory config.example.yaml file for you.

Of course, you are free to use a custom set of options or not to use it at all. For example, it seems like a good idea to provide general options in the configuration file and connection options using environmental variables, for the sake of security.

CopyFromFile built-in task

One of the most frequent tasks for database schedulers is to import some data from external sources to the database, e.g. sensors, logs, official statistics, etc. The CSV format is standard de facto for such kinds of data. PostgreSQL can use the COPY command to import CSV files. But to use this function, the file should be placed in the server's file system, so the PostgreSQL process can access it. That is not always the case for production environments, especially in the cloud era.

To overcome this limitation in the previous version of pg_timetable, we suggest using psql copy command, which performs a frontend (client) copy. This is an operation that runs an SQL COPY command, but instead of the server reading or writing the specified file, psql reads or writes the file and routes the data between the server and the local file system. This means that file accessibility and privileges are those of the local user, not the server, and no SQL superuser privileges are required.

From now on you don't need any additional programs to perform client copy. One can use the new built-in CopyFromFile task. You may find the full example in the manual. Below you will notice that the task accepts JSON parameters in the form of an object, e.g.

This task perfectly works for all formats supported by PostgreSQL: text, csv and binary.

Enhanced logging

One of the key architectural decisions made for pg_timetable was the database-driven concept. Meaning scheduler sessions are fully controlled by the database and every piece of the output is stored back in the database for further analysis.

File logs

That said, people were using output redirection, if they wanted to have logs stored in files, e.g.

The drawback of this approach is that information is stored as plain text, making it hard for machine processing. In the new version, we have introduced the ability to duplicate logs to a file system using --log-file command line parameter. Moreover,  the user can choose the exact format of the underlying file specifying --log-file-format: json or text.

Database log performance

Another bottleneck we were trying to fix was the database logging. Heavy clients were producing a lot of information to be stored by separate INSERT statements, causing intensive use of connections. We have come up with an outstanding log hook implementation for the famous logrus package.

The key benefits are:

  1. Usage of the COPY machinery;
  2. Cache controlled by the time and the number of entries to be logged;
  3. Ability to skip entry logging in case of extremely high load.

I personally highly encourage you to use our log hook in your projects. Let us know if you think we should make it a standalone project.

Tasks output

pg_timetable has been storing output for program tasks from the very beginning. It's useful for debugging and back-in-time incident analysis. We never thought this feature could be applied to the built-in SQL tasks until Nikolay Samokhvalov proposed it. So now one can have SQL task with SELECT statement to determine which output will be saved upon execution, e.g.

The output of the SQL task is the tag of the command executed and the number of rows returned or affected. We are not storing the result set since commands can be arbitrarily complicated or rather huge. Users should take care of this and explicitly store anything important, e.g.

Resource Options

We want our scheduler to be as flexible as possible. That's why we have introduced the options category for resource management:

Finally

This was the first in a series of posts dedicated to the new pg_timetable v4 features. Stay tuned for the coolest features to be highlighted.

You can find previous publications following the tag pg_timetable.

Stay safe, healthy, and wealthy!
Be happy! Peace! Love! ❤

When I recently stumbled on an article comparing some main properties, and also the read-write performance of two very popular pieces of caching software, often used to speed up applications heavy on database queries, I immediately felt an itch - how would it actually look for PostgreSQL vs Redis vs Memcached on the performance side? Meaning, if one would just skip the cache and hit the database directly.

Especially after the first look, I wasn’t actually too impressed with the numbers presented for Redis and Memcache. It could have also been due to the fact that the actual test data, extracted from a linked paper, was from 2016 - that’s surely ages in “internet years”. Either way, I was curious to get some kind of a comparison point and quickly threw together a small benchmarking script in Python and let my workstation get to work.

Why Python you might ask, if the original test used Java? Well, Java is just one of my least favorite languages for database projects and I somehow have the impression that Python is very popular both for the Web and also ad-hoc database scripting, so probably a good fit here - might be wrong of course on that.

Test setup

Some characteristics on my test setup:

PostgreSQL vs Redis vs Memcached: Write Operation

Represented in a similar way to the original numbers. The calculated time to write key-value pairs is in milliseconds.

NUMBER OF RECORDS
Database 1,000 10,000 100,000 1,000,000
Redis (v3.0.7) 34 214 1,666 14,638
Memcached (v1.4.14) 23 100 276 2,813
PostgreSQL (v13.3) 29.6 304 2,888 31,230

PostgreSQL vs Redis vs Memcached: Read Operation

The calculated time to read key-value pairs (ms).

NUMBER OF RECORDS
Database 1,000 10,000 100,000 1,000,000
Redis (v3.0.7) 8 6 8 8
Memcached (v1.4.14) 9 14 14 30
PostgreSQL (v13.3) 0.026 0.028 0.027 0.029

Conclusion

In short - the numbers looked surprisingly/suspiciously good for the reading test for Postgres! I can’t imagine how the original test managed to get such high single-digit millisecond results for random key reads across the whole dataset. For my test, I only managed to see 1ms+ worst cases for the biggest rowcount. This data can, by the way, be also looked up on the “pg_stat_statements” snapshots table called “results”.

So sadly, I still cannot possibly get too ecstatic as there was a lot of information missing on the details of exactly how the original tests were performed, so it might have been a bit of an “apples to oranges” situation still in the end, I’m afraid. The average key readout times for Redis/Memcached seemed just way too slow in comparison to Postgres. I suspect they used a remote machine still for the cache, although the paper didn’t mention it and talked about a single Core i7 node.

But about the writing speed of key-value data - well: Postgres doesn’t really compete on higher row counts. 🙁 But this was also more or less expected! Why? A full-blown relational database engine like Postgres goes to great lengths to ensure we cannot insert invalid data violating some constraints, plus the WAL writing (minimal though for unlogged tables but still) and the on-disk-format overhead - internal columns, alignment, some index bloat on page splits, etc. That all amplifies writes a lot! So basically, I think the results still aren’t bad in the end. Only ca 2x slower than Redis for 100k and 1M row counts. The main idea of caches is that they’re only useful if we read much more from them compared to writing/updating anyways!

But be it how it is with the comparison to the other DBs in absolute numbers, it was good to see that the relative stability of Postgres responses to growing datasets was very-very good! And even beating Memcached which deteriorated 2x on 1M rows! This all probably shows that the selected algorithms for PostgreSQL are mathematically sound and well implemented!

But what can be definitely said - Postgres performance is definitely at least good enough for a “caching use case” for some smaller amount of rows. So with your next project it might be worth asking, do you really need another external component?

Or maybe just:

  1. add some more memory/CPU to the DB and hammer it harder - pretty safe to do for reads actually;
  2. use your HA replicas for load balancing and achieve better total resource utilization, and enjoy the benefits of not having to change the mental model when dealing with relational data on the “real persistence” level anyways.

Other thoughts/notes

You can find more of my posts. Cheers.

PostgreSQL is one of the best OLTP databases (OLTP = online transaction processing) in the world. However, it can do more than just OLTP. PostgreSQL offers many additional features relevant to a more OLAP-style workload. One of those features is called “GROUPING SETS”. 

Before we dive into the details, I've compiled some sample data which you can easily load into your SQL database: 

Note that everything you are going to see in this blog is pretty SQL-standard compliant, so you can expect most of the stuff to work in other professional SQL databases as well. 

Let's get started with a simple aggregation:

There's not much to say here, apart from the fact that we will get one sum for each group. However, there's a bit of a philosophical discussion going on. “GROUP BY 1” basically means “GROUP BY country” which is the equivalent of the first column in the SELECT clause. Therefore “GROUP BY country” and “GROUP BY 1” are the same thing:

Of course, this works with more than one column as well. However, I want to point out something else. Consider the following example: 

Most people group by a column. In some cases, it can make sense to group by an expression. In my case, we are forming groups on the fly (= one group for the US and one for non-US sales). This feature is often underappreciated. However, it is useful in many real-world scenarios. Keep in mind that all the things you are going to see also work with expressions, meaning more flexible grouping is possible. 

GROUPING SETS: The basic building blocks

GROUP BY will turn every distinct entry in a column into a group. Sometimes you might want to do more grouping at once. Why is that necessary? Suppose you are processing a 10 TB table. Clearly, reading this data is usually the limiting factor in terms of performance. So reading the data once and producing more results at once is appealing. That's exactly what you can do with GROUP BY GROUP SETS. Suppose we want to produce two results at once:

Here's how it works: 

In this case, PostgreSQL simply appends the results. The first three lines represent “GROUP BY country”. The next two lines contain the result of “GROUP BY product_name”. Logically, it's the equivalent of the following query:

However, the GROUPING SETS version is ways more efficient because it only has to read the data once. 

ROLLUP: Adding the “bottom line”

When creating reports, you will often need the “bottom line” which sums up what has been shown in the table. The way to do that in SQL is to use “GROUP BY ROLLUP”:

PostgreSQL will inject a couple of rows into the result. As you can see, “Argentina” returns 3 and not just 2 rows. The “product_name = NULL” entry was added by ROLLUP. It contains the sum of all argentinian sales (116 + 27 = 137). Additional rows are injected for both other countries. Finally, a row is added for the overall sales worldwide. 

Often those NULL entries are not what people want to see, thus it can make sense to replace them with some other kind of entry. The way to do that is to use a subselect which checks for the NULL entry and does the replacement.  Here's how it works:

As you can see, all NULL entries have been replaced with “TOTAL”, which in many cases is the more desirable way to display this data.

CUBE: Creating data cubes in PostgreSQL efficiently

ROLLUP is useful if you want to add the “bottom line”. However, you often want to see all combinations of countries and products. GROUP BY CUBE will do exactly that:

In this case, we've got all the combinations. Technically, it's the same as: GROUP BY country + GROUP BY product_name + GROUP BY country_product_name + GROUP BY (). We could do that using more than just one statement, but doing it at once is easier - and a lot more efficient. 

Again, NULL values have been added to indicate various aggregation levels. For a tutorial, check out my YouTube video about GROUP BY CUBE.

Grouping sets: Execution plans

Grouping sets don’t just simply rewrite the query to turn it into a UNION ALL - there is actually specific code in the database engine to perform those aggregations. 

What you will see is a “MixedAggregate” which is capable of aggregating at various levels at once. Here is an example: 

Looking at the MixedAggregate also reveals which aggregations are performed as part of the grouping set. 

Finally ...

In general, grouping sets are a really cool feature which is often unknown or overlooked. We highly recommend making use of this awesome stuff to speed up your aggregations. It's particularly useful if you are dealing with a large data set. 

If you want to know more about PostgreSQL and SQL in general, you might also like my post about “string encoding using SQL".

CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0
office@cybertec.at

Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram