CYBERTEC PostgreSQL Logo

UPDATED July 2023: PostgreSQL has a sophisticated security system capable of handling complex policies and user permissions are an essential component of every productive setup many people rely on. However, over the past couple of years I have noticed that many users fall into the same trap: The public schema.

Unexpected results

To make life easier, PostgreSQL provides a thing called the “public” schema, which can be found in every database. Prior to version 15, everybody could create objects in the public schema. This could lead to some interesting and unexpected behavior. (For info about PG v15 and higher, see the last paragraph.) Let's assume I'm connected to a database as superuser in v14:

I've created a new user as well as a table belonging to the superuser. Let's see what the new user is capable of doing.

As expected our new user is not allowed to read from existing tables and he is not allowed to create a new schema or even drop an existing one:

So far everything is going according to plan. But, let us take a look at the next example:

The user can easily create a new table. Most people would have expected some “permission denied” here. However, this does not happen because the public schema can be utilized by everybody. d reveals what is going on here:

We can see that the new relation belongs to John.

Closing the hole (PG prior to v15)

To make sure that things like this can't happen, we have to remove permissions from the public schema. The superuser can execute the following command to achieve that:

This will lead exactly to the kind of behavior we expected before:

Don't be misled by the error message in the first example. This is only due to the fact that PostgreSQL does not know where to put the table. Prefixing the table properly will do the job, reveal the real cause of the problem and lead to the expected result.

PostgreSQL v15: new default for public schema

In PostgreSQL v15 and higher, the default for the public schema has been changed.

That may lead you to receive this same error message - to learn how to fix it, see my blog about ERROR: Permission Denied: schema public.

Finally...


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.

After digging through some customer source code yesterday I thought that it might be worth publishing a post about space consumption and enum types. It seems to be quite common to store status information as text fields. This is highly inefficient.

Here is an example:

Unless we have defined a check constraint we can add any data we want:

The problem I want to point out here is that a string needs a fair amount of space. In addition to that, it is pretty likely that there is only a handful of status flags available anyway. Just imagine that you have to store billions of rows – a couple of bytes can definitely make a difference.

Space optimization

To reduce the amount of space needed, we can create an enum-type. The advantage here is that we need just one byte to store up to 255 different values. An enum is also an implicit “check constraint” making sure that only those values we really want are allowed:

In our example three values are allowed.

We can now use this type in our table:

When we try to insert data, we will say that bad rows are not accepted:

Enum data types are really a nice way to model a finite list of textual values in an efficient way. One cool feature is also that those types can preserve order. What does that mean? It basically means that if you sort an enum column, data will be returned in the same order as the enum.

Changing enums later on in the process is definitely possible. ALTER TYPE provides a nice interface to add values to an enum type.

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.

If a good programmer is facing a technical problem, he will always try to abstract the problem. In a procedural language such as C people will create typedefs, functions and so on – in object oriented programming people will simply create objects. But what about data type abstraction in SQL? Most people seem to miss this possibility completely.

CREATE DOMAIN – an approach

PostgreSQL (and most other relational databases I am aware of) provide an instruction called CREATE DOMAIN. The idea is to have a mechanism capable of providing data type abstraction.

Let us assume you are writing an application, which has to store information about a simple shop. You might have sales prices, discounts, taxes, and so on. In short: There will be many places in the data structure where you actually want to store a price or some sort of financial information.

Here is an example:

In this case we are storing two prices and we have to repeat the same thing over and over again. The danger is that if you store currency in ten different tables, there is always a danger that one of your programmers will have a different view of currency – this will definitely lead to inconsistencies in your models.

Data type abstraction

CREATE DOMAIN is exactly what we need here. The syntax is actually quite simple:

In our case we want to make sure that our column has exactly 10 digits (overall), it must not be NULL and the value inserted must be higher than zero. Here is the command to achieve exactly that:

Once we have created the domain we can use it just like any other data type:

The main advantage is that those checks we have assigned to the domain will now be used by all columns using the data type. It is a consistent way to store data.

If you want to modify things later on you can still use ALTER DOMAIN so you are not stuck with a mistake.

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.

Histograms are a common way for a relational database to store statistical information about data. What is this kind of statistical information good for? Well, statistics are the rocket fuel behind performance. If statistics are bad, the optimizer will come up with bad decisions and poor execution plans will be the result. The PostgreSQL optimizer is a damn sophisticated piece of software and it provides end users with good plans in general – still, without statistics even the best SQL optimizer will yield bad performance.

The good news is that when things are a little tight, the end user can have some influence on the way PostgreSQL stores statistics. The ALTER TABLE ... SET STATISTICS command will allow users to define the size of the internal histograms storing statistical information.

Before we dig into those details we create a simple table and add some data to it:

This will create a table containing one million numbers. To make sure that the optimizer will know about the content of the table, we can run ANALYZE (a command in charge of creating statistics):

How statistics is used

Whenever PostgreSQL has to plan a query, it will rely on the statistics we have made just before. Let us take a look at a very simplistic plan:

We try to find all values lower than 10.000. The planner estimates that we will find roughly 9.100 rows. Overall this is pretty good and totally sufficient to plan the query efficiently. Remember, estimating statistics is not about having a final answer right away – it is about having a reasonably well estimate.

Let us see, which statistics PostgreSQL has made. To do so we can take a look at a system view called pg_stats:

If you are interested in the content of pg_stats we recommend to take a look at the PostgreSQL documentation. It outlines nicely how things work in this area: http://www.postgresql.org/docs/9.3/static/view-pg-stats.html

Changing the size of the histogram

Sometimes it can be quite beneficial to change the size of a column's histogram. Reducing the size if the histogram will make estimates a lot less precise – but, it will reduce the time needed to plan a query. If you are dealing with a table with 1 billion rows and all you do is hitting the primary key to fetch exactly one row at a time, a narrow histogram is definitely not a big issue.

However, if you are dealing with more complex operations, keeping the default to 100 or to raise this value can be quite beneficial.

This is how you can change the size of the histogram of a specific column:

Once this has been done, we can recreate the stats:

Querying the stats table reveals what PostgreSQL knows about the column:

Note that the histogram consists of just 11 values now (which gives us 10 bars).

The impact of a change

We can toy around with this setting now and see what happens to the plan made by PostgreSQL once the statistics target has been risen to 1.000:

In our example the estimate is a little bit more precise than initially. However, this comes at a cost. Planning time has risen – the plan is still the same (because there is no other choice for the planner anyway).

Btw, keep in mind that those values should not be changed on a daily basis and in many cases it is not even necessary to change this stuff at all – it is nice to have the opportunity to do so, however, when beneficial.

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.

“Waiting is the slowest way to execute an operation” - this is how a professor at my university in Vienna usually described bad locking and low concurrency. He could not have been more right. The same applies to performing calculations: The fastest way to calculate something is to try to skip it entirely. In this posting we want to focus on the first half of the story: Reducing the impact of locking.

Let us assume we have three tables:

We can populate those tables with a handful of values to get our example started:

SELECT … FOR UPDATE

Concurrency (= doing things in parallel) is an essential thing to consider when writing code. In most applications there is more than just one user at the same time and those users might very well focus their attention on the same set of data.

To make sure that nobody can modify a row while we are processing it, all modern relational databases such as Oracle and PostgreSQL provide a syntax component called “SELECT .. FOR UPDATE”. All rows returned by “SELECT FOR UPDATE” are locked as if they were about to be updated. This will help us to avoid concurrency issues and strange behavior (race conditions).

In our example we want to perform a join between table A and table B and lock all rows returned by SELECT. Here is the code:

As you can see, data is returned as expected. For the sake of this example we keep the transaction open. Let us run a second transaction now, which is executed, while the first session is still active:

The point here is: The second session has to wait because session 1 and session 2 will have to lock rows in the B table. This is exactly the desired behavior.

Improving concurrency by locking

But: What if B was just a simple lookup table? What if A was a payment table and B just a postal code table? What if C was a table storing data about people?

If somebody was about to modify some payment and if he has to look up some postal code for this purpose, we would automatically lock other transactions out, which want to modify data about people. The lock on the postal code table would magically spread to other areas of the code, which have nothing to do with payment.

If changes to payments are frequent, the postal code table would be the poison pill for all writing transactions touching the table storing people. Clearly, this is not acceptable. How can you ever scale to 64 CPUs or beyond if you are shot down by a simple row level lock?

SELECT FOR UPDATE

PostgreSQL has the right solution to this kind of problem: SELECT FOR UPDATE allows you to specify, which table in the join you want to lock. It works like this:

In this example we are telling the system that we are joining A and B but we only want to lock A. This makes sure that C can be locked undisturbed:

This SELECT statement will only take care of table C and make sure that B is not touched because we don't want to modify it anyway.

As you can see, a small change can have a significant impact on the overall performance because we gave magically allowed a second transaction (and thus a second CPU) to proceed without having to wait on a statement.

The golden rule of thumb is: The more concurrency you got in your system the faster you are overall. Never lock more than needed.

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.

Just like most other database systems I am aware of PostgreSQL providing a feature called tablespaces. In general a tablespace is a collection of storage files residing in a certain mount point. The main idea of a tablespace is to give administrators a chance to scale out their I/O and to make things more efficient. Typically a tablespace makes a lot of sense if you want to access more than just one I/O subsystem.

More use for per-tablespaces

As I have mentioned before, a tablespace can be used to add more hardware to the system – but, is there some other way to make use of a tablespace? Well, there is some sort of hidden “abuse”, which can be highly beneficial to performance: Using tablespaces to help the optimizer.

In PostgreSQL the optimizer is in charge of finding good execution plans given a certain query. To come up with a good plan PostgreSQL has to calculate artificial costs for certain operations. To estimate I/O, two essential cost parameters are used:

In reality (on mechanical disks) random I/O is a lot more expensive than sequential I/O because the disk has to seek each block. This is why PostgreSQL assigns 4 penalty points to such a block.

The quest for “real” costs

Let us assume for a second there is no cache around. In a case like that, random I/O can very well be 50-100 times more expensive than sequential I/O. The question is: Why is the cost parameter still at 4 instead of at 100 then? The answer is that we can assume that not all blocks will come from disk. Some will be cached, and we can assume that in average 4 is a pretty good number.

But what if a table (maybe 10 TB?) is so large that we can never expect an index lookup to benefit from a cache hit? What if we had additional small lookup tables which will always be in cache because they are used frequently? Clearly, it seems unfair to use the same cost parameters for those different types of tables.

Assigning costs to tablespaces

A tablespace can come to the rescue here. PostgreSQL allows people to assign different cost parameters to different tablespaces. We could pack tables we expect to be uncached into one tablespace and lookup tables onto some other tablespace. Those two tablespaces could reside on the same set of drives – we could just use a tablespace to honour different cost models.

Then we can make sure that the optimizer will consider the costs of a certain tablespace to be more expensive. Here is how it works:

In this case we have told the system that data residing on a tablespace called name_of_tablespace is 5 times more expensive than a typical random read.

The benefit of per-tablespace

Especially when data sets are composed of very large and fairly small tables, using different cost parameter for various tablespaces can be highly beneficial because it allows the optimizer to estimate costs more reliably. Of course, it takes some time to properly adjust those parameters, but it can be highly beneficial.

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Facebook or LinkedIn.

CYBERTEC Logo white
Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2025
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram