CYBERTEC PostgreSQL Logo

Citus: 7 commonly used advanced SQL tools

09.2023 / Category: / Tags: | |

When you run advanced SQL in Citus, what's possible? Which SQL statements work, and which ones don't? Citus is a PostgreSQL extension that adds powerful sharding capabilities to PostgreSQL. However, every solution does have limitations. Therefore, it makes sense to take a look at the latest version of Citus and learn how to properly use the most frequently-needed SQL features. Note that this is not a comprehensive overview, rather it is a guide through 6 of the most commonly-needed SQL tools:

  1. Naming Citus databases
  2. The first thing people typically notice is that with Citus, managing databases is not as straightforward as in a “normal PostgreSQL deployment” anymore. You might notice that most Citus examples use the “postgres” database for demonstration purposes. The reason is simple:

    The Citus workers are bound to a certain database because in PostgreSQL, a database is a strict separation. Therefore people are supposed to use schemas rather than multiple databases. However, if you really run analytics at scale, you'll most likely not need multiple databases on the same infrastructure anyway. You can build multiple clusters in Kubernetes using Patroni to ensure high availability.

    The first takeaway is therefore:

    Just use the postgres database and you'll be OK.

  3. Loading data using COPY
  4. Bulk loading data using COPY is the key to loading data quickly. Yes, you can use the PostgreSQL COPY command with Citus. Here's how it works:

    Data was loaded successfully. The system will automatically route the tuples to the right shard and ensure that there is full transparency.

    Takeaway:

    Bulk loading data with COPY in Citus works just fine.

  5. Advanced SQL: Using ordered sets
  6. What is an ordered set? We're all familiar with calculating the “average”. However, this is not what we want in many cases. Let us imagine that we have three people: Person A earns 1 USD, person B earns 3 USD and person C earns 1 million USD. On average everybody makes north of 300k USD so we should be fine? Actually, no, because two people are starving. The average of a data set is therefore often not what reveals the real situation. A more meaningful value is the “median” of the data set, which is the value in the middle if you order the data set. In the case of 1, 3 and 1000000, the middle value is 3. 50% of the values in the data set are smaller than the median, and 50% of the values are bigger. In SQL, this means that we have to sort the data and get the middle of this ordered data set. WITHIN GROUP (ORDER BY …) is necessary for aggregates like the median that require a certain ordering.

    So let's see if we can run ordered sets in a Citus-enabled PostgreSQL database:

    Here is the execution plan of the query:

    Voila, it works.

    The takeaway is:

    Yes, ordered set aggregates can be used with Citus DB.

  7. Advanced SQL: Window functions with Citus and PostgreSQL
  8. When analyzing time series, it's often necessary to calculate the difference between rows. One example would be to calculate the difference between a value and the difference to the previous period. We might want to know if production has risen or fallen. The following example does exactly that:

    We can indeed run window functions and analytics in Citus. Of course, we have to keep in mind that we are operating in a sharded environment. Window functions usually need sorted input.

    Takeaway:

    Citus supports window functions, but your data should be aligned in a way that the system can provide sorted input to achieve a decent level of efficiency.

  9. Rethinking grouping sets
  10. If you are working on big reporting projects, you'll inevitably need some kind of grouping sets. What is a grouping set? The goal is to perform more than one aggregation at once. Maybe we want to calculate the average production per country, but also have a bottom line that contains the “overall average” of all rows. The way to do that in advanced SQL is to use ROLLUP:

    Basically, the idea behind ROLLUP is to group by “country” as well as by “nothing” which leaves us with two grouping criteria:

    Neither feature is in Citus (yet?), therefore we have to work around this missing feature.

    The way to model grouping sets is to make use of UNION ALL:

    The downside is that we have to read the data twice (once for the country list and once for the overall average).

    Takeaway:

    Emulating grouping sets with UNION ALL in a column store will still perform better than using grouping sets with a row store.

  11. Using triggers with Citus
  12. Often, people want to run triggers on their data. However, there is a catch.

    Let's write a basic trigger and see what happens:

    We CANNOT use triggers on distributed Citus tables.

    The takeaway here is:

    It's not possible to use triggers on distributed tables.

  13. Running ALTER TABLE in Citus
  14. Once in a while, the data structure of a table must be changed. In SQL, the command to do that is ALTER TABLE. However, there are some implications which we have to keep in mind:

    Adding columns is actually simple and can be easily done. However, dropping columns is more delicate:

    In case the partitioning column is touched, Citus will error out and PostgreSQL will not drop the column. This kind of behavior is totally expected and it is logical to prevent this operation from happening, because it would destroy the entire setup.

    What is possible is to rename the sharding column. Citus and PostgreSQL will handle this nicely for us:

Takeaway:

There are some restrictions to changing the table structure, particularly when the sharding column is involved.

Finally …


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

Leave a Reply

Your email address will not be published. Required fields are marked *

CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0
office@cybertec.at

Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram