CYBERTEC Logo

Generating simple data sets

Before we get started, I want to introduce my favorite set-returning functions which can help you to generate sample data:

All we do here is to generate a list from 1 to 10 and print it on the screen. Let us play around with window functions a bit now: There are two cases we need to keep in mind. If the OVER-clause is empty it means that the entire data set is used. If we use ORDER BY, it is only the data set up to the current row in the sorted list. The following listing contains an example:

As you can see, the last column keeps accumulating more values.

PostgreSQL: ROWS BETWEEN … PRECEDING …. AND … FOLLOWING

Often it is necessary to limit the set of data (the window) used by the window function. ROWS BETWEEN … PRECEDING … AND … FOLLOWING allows you to do exactly that. The following example shows how this works:

What you see is that the data fed to array_agg is seriously restricted. But the restriction we are using here is a static one. The constants are hardwired. In some cases, you might need more flexibility.

Configuring window functions and analytics

More often than not, configuration has to be determined on the fly. The beauty is that in PostgreSQL you can use a subselect as part of the OVER-clause, which gives you a lot of flexibility.

Before we move on to a demo, we need to create a configuration table:

To make it simple, I've simply created two entries. The following SELECT statement uses those configuration parameters to do its magic. Here is how it works:

As you can see, the query performs as expected and can be configured dynamically.

PARTITION BY and expressions

Another important note: PARTITION BY can take not only a column, but also an expression, to split the data set. Many people are not aware of this feature, which is actually quite useful. Here is an example:

In this case, we had no problem splitting the data into odd and even numbers. What I want to point out here is that PostgreSQL offers a lot of flexibility. We encourage you to test it out for yourself.

Finally …

Window functions are super important if you need to relate the rows in a result set to each other. You order them, you partition them, and then you define a window from which you can compute additional result columns.

Sometimes, you want to find out more about a timeseries. One thing we have seen quite often recently is to count how often somebody was active for a certain amount of time. “Detecting continuous periods of activity” will show you how to calculate these things in PostgreSQL easily.


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

By Kaarel Moppel

The functionality of using table partitions to speed up queries and make tables more manageable as data amounts grow has been available in Postgres for a long time already. Nicer declarative support became available from v10 on - so in general it’s a known technique for developers. But what is not so uniformly clear is the way how low-level partition management is done. Postgres leaves it up to users, but no real standard tools or even concepts have emerged.

So it happened again that a customer approached me with a plan to use partitions for an action logging use case. The customer couldn't find a nice tool to his liking after a session of googling and was looking for some advice. Sure I know some tools. But my actual advice seemed quite unconventional and surprising for him: Why all the craziness with tools for this simple task? Why not throw together and use something simple yourself?

The problem with tools

Of course tools in a broader sense are necessary and great, we wouldn’t be here today without our stone axes. Software developers thrive on the skillful use of editors, programming languages, etc. But the thing with tools is that you should really know when you actually need to use one! And when it comes to simple, isolated tasks I’d argue that it’s better to delay picking one up until you absolutely need one, and understand the “problem space” on a sufficiently good level.

Adopting some deeply integrated software tools also bring risks to the table - like the added additional complexity. When something then goes sour you're usually up for some serious sweating / swearing as you quickly need to gain a deep understanding of the tool’s internals (as it was mostly set up some X months or years ago and who remembers all this stuff). That's a tough thing to do when the production database is unable to process user queries and your manager is breathing down your neck.

Also, most tools or products in the PostgreSQL realm are in the end just GitHub repos with code, but without the associated business model and support services behind them. Which means you can forget about any guarantees / SLA-s whatsoever unless you’re covered by some Postgres support provider. Even then not all 3rd party tools / extensions fall under the SLA, typically.

But sure - if you evaluate the associated dangers, learn the basics, evaluate the project liveliness (a must for Open Source tools I think) you could look for some tools that more or less automatically take care of the tasks needed. For PostgreSQL partitioning tasks, by the way, I think the most common ones are pg_partman and the very similarly named pg_pathman.

Rolling your own

Disclaimer - this approach can only be recommended if you’re familiar with the problem space and have some programming skills, or actually rather SQL skills when it comes to database partitioning. What I wanted to show with this blog post is that it’s not really hard to do. That's true at least for the simple task of table partitioning, and I personally prefer this way over some more complex tools, where by the time you’re done with the README and digested all the concepts and configuration options, you could have actually written and tested the whole code needed. So below find some sample code describing one approach for how to implement your own time-based (the most common partition scheme I believe) partition management - i.e. pre-creating new partitions for future time ranges and dropping old partitions.

Our sample business schema

From the business logic side let’s assume here that we have a schema for capturing millions of user interaction events per day (not uncommon for bigger shops), but luckily we need to store those events only for 6 months - a perfect use case for partitions!

Now let’s create the first partitions manually.

That way we can get going and send to application to QA and have some time to figure out how we want to solve the future automation part. NB! Note that I’m creating a separate schema for the sub-partitions! This is actually a best practice, when it’s foreseeable that the amount of partitions is going to be more than a dozen or so (not the case for our demo schema though), so that they’re out of sight if we inspect our schema with some query tool and table access is still going to happen over the main table so partitions can be considered implementation details.

Sample partition management code

So now to the most important bits - creating new partitions and dropping old ones. The below query generates as SQL so that, when executed (for example by uncommenting the ‘gexec’ directive when using psql), it pre-creates a partition for the next month. When run as a weekly Cron, it should be enough.

And SQL for dropping old ones:

For real production use I’d recommend wrapping such things into nice, descriptively named stored procedures. Then, call them from good old Cron jobs or maybe also use the pg_cron extension if you have it installed already, or try out pg_timetable (our job scheduling software with advanced rules / chaining) or similar. Also don’t forget to set up some alerting and re-try mechanisms for production usage, as some hiccups will sooner or later appear - table locking issues on partition attaching / dropping, accidentally killed sessions etc.

And for multi-role access models I’d also recommend setting up reasonable DEFAULT PRIVILEGES for your schemas involved. That way all necessary application users will automatically gain correct access rights to new subpartition tables. This is not actually directly related to partitioning and is a good thing to set up for any schema.

Some more ideas to manage partitions

For a lot of environments the simple task of running periodic jobs is actually not so simple - for example the environment might be very dynamic, with database and scheduling nodes jumping around the infrastructure and connectivity might become a problem, scheduling nodes might have different criticality classes and thus no access, etc. For such cases there are also some alternative partitioning tricks that can be applied. Mostly they look something like that:

* Calling some partition management function with every insert

If the volume of inserts is not too great it’s more or less OK to call the partition management code transparently in the background via triggers. The trigger will then look a bit into the future and pre-create the next sub-partitions on-the-fly, if not already existing - applying some dynamic SQL techniques usually.

For non-native (pre v10, inheritance based) partitioning you could also look into the actual row (see here for a code sample) and verify / create only the required partitions, but the newer declarative partition mechanism doesn’t allow “before” triggers, so you need to look into the future a bit.

* Calling the partition management functions only for some transactions

Although from a performance point of view the previous method does not hurt as much as it might seem, as system catalog info is highly likely to be cached and mostly we don’t also need to create or drop any partitions, but under heavy volumes it could still start to hurt so a better idea might be to apply the whole partition management procedure only for a fraction of rows - most commonly by using the builtin random() function. Note that in this case a good trick is to apply the randomness filter (0.1% might be a good starting choice) at trigger definition level (the WHEN clause), not at trigger function level - this again saves some CPU cycles.

Conclusion

To formulate some kind of a “takeaway” for this post - we saw above that partition management can basically be implemented with only 2 SQL statements (without validation and retry mechanisms) - so why tie yourself to some 3rd party project X that basically doesn’t give you any guarantees, needs maybe installation of an extension and might make you wait before adding compatibility for some newly released PostgreSQL version?

We are so used to using all kinds of tools for everything and being always on the lookout for quick wins and those very rare “silver bullets” that we sometimes forget to stop and ask - do I really need a tool or an utility for the task at hand?

So my advice - if the task looks relatively simple, try to stop and rethink your needs. Is the new tool “investment” really worth it? What percentage of features of the tool are you actually interested in? And in a lot of cases when dealing with PostgreSQL you’ll see that things are not really that difficult to implement by yourself...and by doing so your system will gain one very important attribute - you stay in full control, know exactly what’s going on, and can easily make changes if needed.

Read on to find out more about partitioning in these posts:

CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0
office@cybertec.at

Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram