CYBERTEC PostgreSQL Logo

How to count hits on a website - concurrent counting in PostgreSQL

05.2020 / Category: / Tags:

Concurrent counting in PostgreSQL: Recently we covered “count” quite extensively in this blog. We discussed optimizing count(*) and also talked about “max(id) - min(id)” which is of course a bad idea to count data in any relational database (not just in PostgreSQL). Today I want to focus your attention on a different kind of problem and its solution: Suppose you want to grant a user access to a certain piece of data only a limited number of times. How can you safely implement that?

SQL: Trying the trivial approach to achieve concurrent counting

One might argue that this is easy. Here is some pseudo code:

Looks easy, doesn't it? Well, not really.

What happens if two people run this concurrently. Both sessions will receive the same result for count(*) and therefore one session will come to the wrong conclusion. Both transactions will proceed and do the work even if only one is allowed to do it. That is clearly a problem.

To solve the problem one could do...

... to eliminate concurrency all together. But, this is clearly a problem because it would block pg_dump or some analytics jobs or even VACUUM. Obviously not an attractive option. Alternatively we could try a SERIALIZABLE transaction but that might lead to many transactions failing due to concurrency issue.

One more problem is that count(*) might be really slow if there are already many entries. In short this approach has many problems:

  • It is buggy
  • It is slow
  • It does not scale
  • Runtime is unpredictable

But how can we do better?

Introducing a “count” table

To solve the problem we can introduce a “count table” but let us start right at the beginning. Let me create a table structure first:

First of all I have installed an extension which is needed for one of the tables (more on that later).The customer table contains an id, some login email address and some other fields for whatever purpose. Note that this is mostly a prototype implementation.

Then comes an important table: t_customer_limit. It tells us how many clicks per day which customer can have. PostgreSQL supports “range types”. A period can be stored in a single column. Using range types has a couple of advantages. PostgreSQL will automatically check if the period is valid (end > start and so on). In addition to that the database provides a couple of operators to make handling really easy.

In the next step I have created a table to track all requests thrown at us. Creating such a table has the advantage that you can do a lot of analytics later on:

If you decide to store whatever there is in the HTTP header you might be able to identify spammers, access patterns and a lot more later on. This data can be really beneficial later on.

But due to the problems outlined before it is pretty tricky and inefficient to limit the number of entries per day for a single user. Therefore I have added a “count” table which has some interesting information:

For each day we store the counter_limit which is the number of hits served to the client on a given day. This number will never go higher than the limit of the user for the day. Then we got the counter_limit_real which contains the number of requests the user has really sent. If counter_limit_real is higher than counter_limit we know that we had to reject requests due to high consumption. It is also trivial to figure out how many requests we already had to reject on a given day for a user.

Let us add some sample data:

'hans_company' is allowed to make 4 requests per day for the period I have specified.

Now that the data structure is in place I have created a simple SQL function which contains the entire magic we need to make this work efficiently:

This might look a bit nasty at first glance but let us dissect the query bit by bit. The first CTE (x) tells today's limits for a specific user. We will need this information later on. “x” is not needed explicitly because we can also use a simple subselect. But if you want to sophisticate my prototype it might come in handy if “x” can be reused in the query multiple times.

Then the query inserts a row into t_log. This should be a pretty straight forward operation. Then we attempt to insert into t_log_count. If there is no entry for the customer for the day yet the insert will go through. The important part is that if two INSERT statements happen at the same time PostgreSQL will report a key violation and we cannot end up with a duplicate entry under and circumstances.

But what happens if the key is violated? In this case the ON CONFLICT clause we show its worth: If the INSERT goes wrong we want to perform an UPDATE. If we have reached the limit we only increment counter_limit_real - otherwise we increment both values. Then the updated row is returned. If counter_limit_real and counter_limit are identical we can allow the request - otherwise the user has to be rejected.

The count table has some nice attributes: When it is updated only ONE row will be locked at a time. That means that no operations on t_log will be blocked for reporting, VACUUM and so on. The row lock on the second table will perfectly synchronize things and there is no chance of serving requests which are not allowed. Minimal locking, maximum performance while maintaining flexibility.

Of course this is only a prototype which leaves some room for improvements (such as partitioning). From a locking and efficiency point of view this is promising.

A simple test on my (relatively old) iMac has shown that I can run around 20.000 requests in around 10 seconds (single connection). This is more than enough for most applications assuming that there will be many clients and a lot of parallelism.

What you have to keep in mind is bloat. Keep an eye on the updated table at all time and make sure that it does not grow out of proportion.

concurrent counting

Finally …

Avoiding bloat is important. Check out our post about VACUUM and PostgreSQL database bloat here. You will learn all you need to know to ensure your database is small and efficient.

Also: If you happen to be interested in stored procedures we recommend to check out our latest blog dealing with stored procedures in PostgreSQL.

 


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.

0 0 votes
Article Rating
Subscribe
Notify of
guest
1 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Maletin von Oertzen
Maletin von Oertzen
4 years ago

how is the performance, when i just add a FOR UPDATE?
select * from t_log_count where customer_id = 10 and tdate = '2020-04-09' FOR UPDATE;.
normally i just check the return-value of my update (0 or 1 rows):
update t_log_count set counter_limit_real = counter_limit_real 1 where customer_id = 1 and tdate = current_date and counter_limit_real < counter_limit;

CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0
office@cybertec.at

Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram