CYBERTEC PostgreSQL Logo
PostgreSQL tablespaces misunderstood
ยฉ Laurenz Albe 2021

 

Users with an Oracle background consider tablespaces very important and are surprised that you can find so little information about them in PostgreSQL. This article will explain what they are, when they are useful and whether or not you should use them.

What is a tablespace

Essentially, a tablespace in PostgreSQL is a directory containing data files. These data files are the storage behind objects with a state: tables, sequences, indexes and materialized views. In PostgreSQL, each such object has its own data file. If the object is bigger, it will have several files called segments with a size limit of 1GB.

PostgreSQL uses the operating system's file system for its storage. This is different from Oracle, which essentially implements its own โ€œfile systemโ€.

Let's compare the terms for clarity:

Terminology in Oracle vs. PostgreSQL
Oracle PostgreSQL or operating system
tablespace file system
datafile logical/physical volume
segment all data files of a table
extent segment / data file

Default tablespaces

Each PostgreSQL database cluster initially has two tablespaces. You can list them with db in psql:

You'll notice that there is no location specified. That is because they always correspond to fixed subdirectories of the PostgreSQL data directory: the default tablespace (pg_default) is the โ€œbaseโ€ subdirectory, and the global tablespace (pg_global) is the โ€œglobalโ€ subdirectory.

By default, all data files will be stored in the default tablespace. Only certain objects are stored in the global tablespace: the catalog tables pg_database, pg_authid, pg_tablespace and pg_shdepend and all their indexes. These are the only catalog tables shared by all databases.

Creating and using new tablespaces

To create a new tablespace, you first have to create a new directory. Don't create that directory in the PostgreSQL data directory!

Note that the directory has to belong to the โ€œpostgresโ€ operating system user (to be exact, the user has to have permissions to change the directory's permissions).

Then you can create the tablespace:

To use the tablespace, you can create a table or another object with storage in it:

Note that indexes are not automatically created in the same tablespace as the table.

You can also create a database in a tablespace:

Then all objects you create in that database will automatically be placed in the database's tablespace.

There are ALTER commands to change the tablespace of any object. Moving an object to another tablespace copies the data files, and the object is inaccessible while it is being moved.

Backup and tablespaces

If you perform a file system backup of a database with tablespaces, you have to back up all tablespaces. You cannot back up or restore a single tablespace, and there is no equivalent to Oracle's โ€œtransportable tablespacesโ€.

pg_basebackup with the plain format will try to save tablespaces in the same place as on the database server (the -D option only specifies the location of the data directory). To backup data from a tablespace to a different location, you have to use the option --tablespace-mapping=olddir=newdir. You can use this option more than once for multiple tablespaces.

Using tablespaces makes database administration more complicated, because the data directory no longer contains all the data.

When should I create a tablespace?

In the vast majority of cases, you shouldn't create extra tablespaces in PostgreSQL. In particular, it never makes sense to create a tablespace on the same file system as the data directory or on the same file system as another tablespace.

So what are the benefits of tablespaces that justify the administrative complexity?

If you are running in a virtualized environment with virtualized storage, all these points are moot, with the exception of the third. Since almost everybody uses virtualization these days, tablespaces are becoming an increasingly irrelevant PostgreSQL feature.

Dispelling an old myth

There is a tenacious myth circulating among database administrators that you should put tables and indexes on different disks for good performance.

You will hear people making elaborate arguments as to why the particular interplay of access patterns during an index scan will make this efficient on spinning disks. But spinning disks are going out of business, and you typically only saturate your storage system with several concurrent SQL statements, when all such access patterns will be disrupted anyway.

The truth behind the myth is that it is certainly beneficial to spread the I/O load over multiple devices. If you use striping on the operating system level, you will get a better spread than you will by carefully placing tables and indexes.

Conclusion

Tablespaces are rarely relevant in PostgreSQL. Resist the temptation to create tablespaces and leave all data in the default tablespace.

Just like any advanced relational database, PostgreSQL uses a cost-based query optimizer that tries to turn your SQL queries into something efficient that executes in as little time as possible. For many people, the workings of the optimizer itself remain a mystery, so we have decided to give users some insight into what is really going on behind the scenes.

So letโ€™s take a tour through the PostgreSQL optimizer and get an overview of some of the most important techniques the optimizer uses to speed up queries. Note that the techniques listed here are in no way complete. There is a lot more going on, but it makes sense to take a look at the most basic things in order to gain a good understanding of the process.

PostgreSQL constant folding

Constant folding is one of the easier processes to understand. Nonetheless, it's extremely important.

Letโ€™s see what happens during the constant folding process:

Here we add a filter to the query: x = 7 + 1. What the system does is to โ€œfoldโ€ the constant and instead do โ€œx = 8โ€. Why is that important? In case โ€œxโ€ is indexed (assuming it is a table), we can easily look up 8 in the index.

Note the following:

PostgreSQL does not transform the expression to โ€œx = 8โ€ in this case. Thatโ€™s why you should try to make sure that the filter is on the right side, and not on the column you might want to index.

PostgreSQL query optimizer: function inlining

One more important technique is the idea of function inlining. The goal is to reduce function calls as much as possible and thus speed up the query.

Letโ€™s create a function to calculate a logarithm:

2^10 = 1024. This looks right.

Now, letโ€™s see what happens in a real query that uses function inlining:

Look in the WHERE clause. The ld function has been replaced with the underlying log function. Note that this is only possible in the case of SQL functions. PL/pgSQL and other stored procedure languages are black boxes to the optimizer, so whether these things are possible or not depends on the type of language used.

Here is an example using PL/pgSQL:

In this case, inlining is not possible. While the code is basically the same, the programming language does make a major difference.

Function stability: VOLATILE vs. STABLE vs. IMMUTABLE

Something that is often overlooked is the concept of function stability. When creating a function, it makes a difference if a function is created as VOLATILE (default), STABLE, or as IMMUTABLE. It can even make a major difference - especially if you are using indexes. Letโ€™s create some sample data and sort these differences out:

VOLATILE means that a function is not guaranteed to return the same result within the same transaction given the same input parameters. In other words, the PostgreSQL optimizer cannot see the function as a constant, and has to execute it for every row.

Here's an example where the optimizer has to execute the function for every row:

We have generated a list of 64 million entries containing 1 row per minute since January 1900, which produces 64 million entries.

Letโ€™s run the query using a VOLATILE function:

In this case, the query needs a whopping 2.6 seconds and eats up a ton of resources. The reason is that clock_timestamp() is VOLATILE.

Now, what if we try to do the same thing using a STABLE function?

The query is many thousands of times faster, because now PostgreSQL can turn it into a constant and thus use the index. If you want to learn more about function stability in PostgreSQL, here is more information.

Equality constraints

The next optimization on our list is the concept of equality constraints. What PostgreSQL tries to do here is to derive implicit knowledge about the query.

First, letโ€™s create some sample data for our equality constraints:

What we have here are 1000 rows. Weโ€™ll run a simple query:

Again, the magic is in the execution plan. You can see that PostgreSQL has figured that x and y are 4.

That opens the door for important optimizations:

Without this optimization, it would be absolutely impossible to use the index we have just created. In order to optimize the query, PostgreSQL will automatically figure out that we can use an index here.

Indexes are important!

View inlining and subselect flattening

When talking about the PostgreSQL optimizer and query optimization there is no way to ignore views and subselect handling.

Letโ€™s create a view and query it:

When we look at the execution plan, the view is nowhere to be seen.

The reason is that PostgreSQL has inlined the view as a subselect and flattened it out. How does that work?

This query is turned into โ€ฆ

Then the subselect is flattened out which leaves us with โ€ฆ

There is a parameter in the PostgreSQL query optimizer, from_collapse_limit, which controls this behavior:

The meaning of this parameter is that only up to 8 subselects in the FROM clause will be flattened out. If there are more than 8 subselects, they will be executed without being flattened out. In most real-world use cases, this is not a problem. It can only become an issue if the SQL statements used are very complex. More information about joins and join_collapse_limit can be found in our blog.

Keep in mind that inlining is not always possible. Developers are aware of that.

Optimizing joins in PostgreSQL

Joins are used in most queries and are therefore of incredible importance to good performance. Weโ€™ll now focus on some of the techniques relevant to joins in general.

Optimizing join orders

The next important thing on our list is the way the PostgreSQL optimizer handles join orders. In a PostgreSQL database, joins are not necessarily done in the order proposed by the end user - quite the opposite: The query optimizer tries to figure out as many join options as possible.

Letโ€™s create some sample data and figure out how the optimizer works when it comes to join orders:

In the next steps, a couple of indexes are created:

We now have three tables. Weโ€™ll query them and see what happens:

Note that the query joins โ€œc and aโ€ and then โ€œa and bโ€. However, letโ€™s look at the plan more closely. PostgreSQL starts with index scans on a and b. The result is then joined with c. Three indexes are used. This happens because of the equality constraints we discussed before. To find out about forcing the join order, see this blog.

Implicit vs. explicit joins in PostgreSQL

Many people keep asking about explicit versus implicit joins. Basically, both variants are the same.

Letโ€™s check out two queries, one with an explicit and one with an implicit join:

Both queries are identical and the planner will treat them the same way for most commonly seen queries. Mind that the explicit joins work with and without parenthesis.

However, there is one parameter that is of great importance here, join_collapse_limit:

The join_collapse_limit parameter controls how many explicit joins are planned implicitly. In other words, an implicit join is just like an explicit join, but only up to a certain number of joins controlled by this parameter. See this blog for more information. It is also possible to use join_collapse_limit to force the join order, as explained in this blog.

For the sake of simplicity, we can assume that it makes no difference for 95% of all queries and for most customers.

Determine the join strategy

PostgreSQL offers various join strategies. These strategies include hash joins, merge joins, nested loops, and a lot more. We have already shared some of this information in previous posts. More on PostgreSQL join strategies can be found here.

Optimizing outer joins (LEFT JOIN, etc.)

Optimizing outer joins (LEFT JOIN, RIGHT JOIN, etc.) is an important topic. Usually, the planner has fewer options here than in the case of inner joins. The following optimizations are possible:

where Pac is a predicate referencing A and C, etc (in this case, clearly
Pac cannot reference B, or the transformation is nonsensical).

While this theoretical explanation is correct, most people will have no clue what it means in real life.

Therefore I have compiled a real-world example showing how PostgreSQL actually reorders a real join:

What we see here is that the PostgreSQL optimizer decides on joining x with y and then with z. In other words, the PostgreSQL optimizer has simply followed the join order as used in the SQL statement.

But what happens if we decide to tweak the parameters a bit?

This is the same query, but with slightly altered parameters.

The difference is that PostgreSQL again starts with x but then joins z first, before adding y.

Note that this optimization happens automatically. One reason why the optimizer can make this decision is because of the existence of optimizer support functions which were added to PostgreSQL a while ago. The reason why the reordering works is that support functions offer the planner a chance to figure out how many rows are returned from which part. If you use tables instead of set returning functions, support functions are irrelevant. PostgreSQL v16 has added support for "anti-joins" in RIGHT and OUTER queries.

Automatic join pruning in PostgreSQL

Not every join in a query is actually executed by PostgreSQL. The optimizer knows the concept of join pruning and is able to get rid of pointless joins quite efficiently. The main question is: When is that possible, and how can we figure out whatโ€™s going on?

The next listing shows how some suitable sample data can be created:

In this case, we need to make sure that both sides actually have primary keys, or some kind of unique constraint:

To show how the PostgreSQL query planner handles join pruning, weโ€™ll take a look at two different SQL statements:

In this case, the join has to be executed. As you can see, PostgreSQL has decided on a hash join.

The next example contains only a small variation of the query:

Join pruning can happen if we DO NOT read data from the right side, and if the right side is unique. If the right side is not unique, the join might actually increase the number of rows returned; so pruning is only possible in case the right side is unique.

Letโ€™s try out join pruning:

While it is certainly a good thing to have join pruning in the PostgreSQL optimizer, you have to be aware of the fact that the planner is basically fixing something which should not exist anyway. Write queries efficiently in the first place; donโ€™t add pointless joins.

EXISTS and anti-joins

There is also something common you will see everywhere in the code within SQL EXISTS. Hereโ€™s an example:

This might not look like a big deal, but consider the alternatives: What PostgreSQL does here is to create a โ€œhash anti-joinโ€. This is way more efficient than some sort of nested loop. In short: The nested loop is replaced with a join which can yield significant performance gains.

Making use of sorted output

Every database relies heavily on sorting, which is necessary to handle many different types of queries and optimize various workloads. One of the key optimizations in this area is that PostgreSQL can use indexes to optimize ORDER BY in a very clever way. See more about how PostgreSQL can optimize subqueries in the blog Subqueries and Performance in PostgreSQL.

Letโ€™s see how optimizing ORDER BY looks:

The listing created a list of 1 million entries thatโ€™s been stored on disk in random order. Subsequently, VACUUM was called to ensure that all PostgreSQL hint bit-related issues were sorted out before the test is executed. If you want to know what hint bits are and how they operate, check out our post about hint bits in PostgreSQL.

Let's run explain analyze:

PostgreSQL needs more than one core to process the query in 86 milliseconds, which is a lot.

Now, what happens if we create an index?

After adding an index, the query executed in a fraction of a millisecond. However, what is most important here is that we do NOT see a sort-step. PostgreSQL knows that the index returns data in sorted order (sorted by id) and thus there is no need to sort the data all over again. PostgreSQL consults the index and can simply take the data as it is and feed it to the client until enough rows have been found. In this special case, even an index-only scan is possible, because we are only looking for columns which actually exist in the index.

Reducing the amount of data to be sorted is vital to performance, and thus important to the user experience.

min and max can be used to reduce the amount of data to be sorted:

The minimal value is the first value in a sorted list that is not NULL. The max value is the last value in a sequence of sorted values that is not NULL. PostgreSQL can take this into consideration and replace the standard way of processing aggregates with a subplan that simply consults the index.

In a way, calling โ€œminโ€ is the same as โ€ฆ

Fortunately, PostgreSQL does this for you and optimizes the query perfectly.

Partition pruning and constraint exclusion at work

Partitioning is one of the favorite features of many PostgreSQL users. It offers you the ability to reduce table sizes and break up data into smaller chunks, which are easier to handle and in many cases (not all of them) faster to query. However, keep in mind that SOME queries might be faster - increased planning time can, however, backfire - this is not a rare corner case. We have seen partitioning decrease speed countless times in cases when partitioning wasnโ€™t suitable.

Logically, the PostgreSQL optimizer has to take care of partitioning in a clever way and make sure that only the partitions are touched which might actually contain some of the data.

The following is an example of how the optimizer manages partitioning issues:

In this case, a table with two partitions has been created.

What is the planner going to do if we are only looking for positive values?

The PostgreSQL optimizer correctly figured out that the data cannot be in one of the partitions and removed it from the execution plan.

If we want to query all values below 1000, it's not possible, and all partitions are correctly queried:

Conclusionโ€ฆ

The optimizations you have seen on this page are only the beginning of everything that PostgreSQL can do for you. Itโ€™s a good start on getting an impression of what is going on behind the scenes. Usually, database optimizers are some sort of black box and people rarely know which optimizations really happen. The goal of this page is to shed some light on the mathematical transformations going on.


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on X, Facebook, or LinkedIn.

PostgreSQL is a secure database and we want to keep it that way. It makes sense, then, to consider SSL to encrypt the connection between client and server. This posting will help you to set up SSL authentication for PostgreSQL properly, and hopefully also to understand some background information to make your database more secure.

SSL authetication with PostgreSQL

At the end of this post, you should be able to configure PostgreSQL and handle secure client server connections in the easiest way possible.

Configuring PostgreSQL for OpenSSL

The first thing we have to do to set up OpenSSL is to change postgresql.conf. There are a couple of parameters which are related to encryption:

Once ssl = on, the server will negotiate SSL connections in case they are possible. The remaining parameters define the location of key files and the strength of the ciphers. Please note that turning SSL on does not require a database restart. The variable can be set with a plain reload. However, you will still need a restart, otherwise PostgreSQL will not accept SSL connections. This is an important point which frequently causes problems for users:

The SHOW command is an easy way to make sure that the setting has indeed been changed. Technically, pg_reload_conf() is not needed at this stage. It is necessary to restart later anyway. We just reloaded to show the effect on the variable.

In the next step, we have to adjust pg_hba.conf to ensure that PostgreSQL will handle our connections in a secure way:

Then restart the database instance to make sure SSL is enabled.

The next thing you have to do is to create certificates.

In order to keep things simple, we will simply create self-signed certificates here. However, it is of course also possible with other certificates are. Here is how it works:

This certificate will be valid for 365 days.

Set permissions

Next we have to set permissions to ensure the certificate can be used. If those permissions are too relaxed, the server will not accept the certificate:

Self-signed certificates are nice. However, to create a server certificate whose identity and origin can be validated by clients, first create a certificate signing request and a public/private key file:

Again, we have to make sure that those permissions are exactly the way they should be:

Then we sign the request.

To do that with OpenSSL, we first have to find out where openssl.cnf can be found. We have seen that it is not always in the same place - so make sure you are using the right path:

We use this path when we sign the request:

Letโ€™s create the certificate with the new root authority:

server.crt and server.key should be stored on the server in your data directory as configured on postgresql.conf.

But there's more: root.crt should be stored on the client, so the client can verify that the server's certificate was signed by the certification authority. root.key should be stored offline for use in creating future certificates.

The following files are needed:

File namePurpose of the fileRemarks
ssl_cert_file ($PGDATA/server.crt)server certificatesent to client to indicate server's identity
ssl_key_file ($PGDATA/server.key)server private keyproves server certificate was sent by the owner; does not indicate certificate owner is trustworthy
ssl_ca_filetrusted certificate authoritieschecks that client certificate is signed by a trusted certificate authority
ssl_crl_filecertificates revoked by certificate authoritiesclient certificate must not be on this list

Checking your setup

Now that all the certificates are in place it is time to restart the servers:

Without a restart, the connection would fail with an error message (โ€œpsql: error: FATAL: no pg_hba.conf entry for host "10.0.3.200", user "postgres", database "test", SSL offโ€).

However, after the restart, the process should work as expected:

psql indicates that the connection is encrypted. To figure out if the connection is indeed encrypted, we need to check the content of pg_stat_ssl:

Let us query the system view and see what it contains:

The connection has been successfully encrypted. If โ€œssl = trueโ€, then we have succeeded.

Different levels of SSL supported by PostgreSQL

Two SSL setups are not necessarily identical. There are various levels which allow you to control the desired level of security and protection. The following table outlines those SSL modes as supported by PostgreSQL:


sslmode

Eavesdropping protection

MITM (= man in the middle) protection

Statement
disableNoNoNo SSL, no encryption and thus no overhead.
allowMaybeNoThe client attempts an unencrypted connection, but uses an encrypted connection if the server insists.
preferMaybeNoThe reverse of the โ€œallowโ€ mode: the client attempts an encrypted connection, but uses an unencrypted connection if the server insists.
requireYesNoData should be encrypted and the overhead of doing so is accepted. The network is trusted and will send me to the desired server.
verify-caYesDepends on CA policyData must be encrypted. Systems must be doubly sure that the connection to the right server is established.
verify-fullYesYesStrongest protection possible. Full encryption and full validation of the desired target server.

 The overhead really depends on the mode you are using. First letโ€™s take a look at the general mechanism:

 SSL authentication with postgresql

The main question now is: How does one specify the mode to be used? The answer is: It has to be hidden as part of the connect string as shown in the next example:

In this case, verify-ca does not work because to do that the root.* files have to be copied to the client, and the certificates have to be ones which allow for proper validation of the target server.

Encrypting your entire server: PostgreSQL TDE

So far, you have learned how to encrypt the connection between client and server. However, sometimes it is necessary to encrypt the entire server, including storage. PostgreSQL TDE does exactly that:

Transparent Data Encryption of PostgreSQL

To find out more, check out our website about PostgreSQL TDE. We offer a fully encrypted stack to help you achieve maximum security. PostgreSQL TDE is available for free (Open Source).

What might also interest you ...

More recent information about encryption keys: Manage Encryption Keys with PostgreSQL TDE

Materialized views are an important feature in most databases, including PostgreSQL. They can help to speed up large calculations, or at least to cache them.

If you want to make sure that your materialized views are up-to-date, and if you want to read more about PostgreSQL right now, check out our blog about pg_timetable, which shows you how to schedule jobs in PostgreSQL. Why is pg_timetable so useful? Our scheduler makes sure that identical jobs cannot overlap, but simply donโ€™t execute again in case the same job is already running. In the case of long jobs, using a scheduler is super important - especially if you want to use materialized views.
 

Last time, we experimented with lesser known PostGIS functions to extract areas of interest for sales. Now, letโ€™s extend our example regarding catchment areas by optimizing trips within the area of interest we generated in our previous example, which is around Hamburg. Letโ€™s ask the following question:
which order should we visit our major cities in so that we can optimize our tripโ€™s costs?

This optimization problem is commonly known as the Traveling Salesman Problem (or TSP).

Apart from PostGIS, pgRouting will be used to tackle this challenge within PostgreSQL only. pgRouting is served as a PostgreSQL extension, which adds a lot of geospatial routing functionality on top of PostGIS.
I recommend you check out its online documentation, located at https://pgrouting.org/, to get an overview.



Figure 1 Area of interest around Hamburg / Figure 2 Area of interest around Hamburg, Zoom

This article is organized as follows:

  1. Dataset import
  2. Layer setup
  3. Extract shortest round trip
  4. Final thoughts and outlook

Dataset import

We will re-use osm files and datasets processed in my last blogpost, so first, please play through the blogpost sections involved. Subsequently, an intermediate table โ€œtspPointsโ€ must be created, which contains major cities and airports covered by our preferred area around Hamburg only (see the annex for ddl).
To solve TSP by utilizing our pgRouting functionality, we must generate a graph out of osm data. Different tools like osm2pgrouting or osm2po exist, which generate a routable graph from plain osm data. Due to limited memory resources on my development machine, I decided to give osm2po a try.

So then - letโ€™s start. After downloading the latest osm2po release from https://osm2po.de/releases/,
we need to adapt its configuration, in order to generate the desired graph  as an sql file.
Please uncomment the following line in osm2po.config to accomplish this task.

postp.0.class = de.cm.osm2po.plugins.postp.PgRoutingWriter

Now weโ€™re ready to generate our graph by executing

The outputted sql file can easily be imported to PostGIS by calling

the resulting graph table contains geometries for edges only (see annex for ddl). To generate source and target nodes too, letโ€™s utilize pgr_createverticestable as follows:

Layer setup

Slowly, weโ€™re getting closer ????. Figure 3 to 5 represent our stacked layer setup on top of OpenStreetMap, which is based upon the following tables:


Figure 3 Intermediate trip stops


Figure 4 Edges around Hamburg


Figure 5 Edges + nodes around Hamburg

Solving TSP

Now letโ€™s ask pgRouting to solve our traveling salesmen problem by querying

This command results in the optimal order of intermediate stops (edges next to our cities) forming our trip. Itโ€™s worth pointing out that the optimal route (order) depends on the cost-matrixโ€™s respective cost-function, which was defined to calculate the edgeโ€™s weights. In our case, the cost function is prescribed by osm2po and expressed as length/kmh.

To export the sequence of city names to pass through, the nodes returned by pgr_TSP must be joined back to tspPoints. Please consult the annex for this particular query extension. From figure 6 we now realize that the optimal route goes from Hamburg Airport to Kiel, Lรผbeck, Rostock and Hamburg before bringing us back to our airport.

Figure 6 Optimal order of stops

Final thought and outlook

This time, we took a quick look at pgRouting to solve a basic traveling salesman problem within PostGIS. Feel free to experiment with further cost functions and to move on to calculating detailed routes between cities by utilizing further pgRouting functions.

Annex

by Kaarel Moppel

The big question we hear quite often is, โ€œCan and should we run production Postgres workloads in Docker? Does it work?โ€ The answer in short: yes, it will work... if you really want it to... or if itโ€™s all only fun and play, i.e. for throwaway stuff like testing.

Containers, commonly also just called Docker, have definitely been a thing for quite a few years now. (There are other popular container runtimes out there, and itโ€™s not a proprietary technology per se, but letโ€™s just say Docker to save on typing.) More and more people are โ€œjumping on the container-shipโ€ and want to try out Docker, or have already given this technology a go. However, containers were originally designed more as a vehicle for code; they were initially intended to provide a worry-free โ€œbatteries includedโ€ deployment experience. The idea is that it โ€œjust worksโ€ anywhere and is basically immutable. That way, quality can easily be tested and guaranteed across the board.

Those are all perfectly desirable properties indeed for developers...but what if youโ€™re in the business of data and database management? Databases, as we know, are not really immutable - they maintain a state, so that code can stay relatively โ€œdumbโ€ and doesnโ€™t have to โ€œworryโ€ about state. Statelessness enables rapid feature development and deployment, and even push-button scaling - just add more containers!

Running Postgres in Docker

Should I use Postgres with Docker?

If your sensors are halfway functional, you might have picked up on some concerned tones in that last statement, meaning there are some โ€œbutsโ€ - as usual. So why not fully embrace this great modern technology and go all in? Especially since I already said it definitely works.

The reason is that there are some aspects you should at least take into account to avoid cold sweats and swearing later on. To summarise: youโ€™ll benefit greatly for your production-grade use cases only if youโ€™re ready to do the following:

a) live fully on a container framework like Kubernetes / OpenShift

b) depend on some additional 3rd party software projects not directly affiliated with the PostgreSQL Global Development Group

c) or maintain either your own Docker images, including some commonly needed extensions, or some scripts to perform common operational tasks like upgrading between major versions.

To reiterate - yes, containers are mostly a great technology, and this type of stuff is interesting and probably would look cool on your CV...but: the origins of container technologies do not stem from persistent use cases. Also, the PostgreSQL project does not really do much for you here besides giving you a quick and convenient way to launch a standard PostgreSQL instance on version X.

A testersโ€™ dream

Not to sound too discouraging - there is definitely at least one perfectly valid use case out there for Docker / containers: itโ€™s perfect for all kinds of testing, especially for integration and smoke testing!

Since we basically implement containers as super light-weight โ€œmini VMsโ€, you can start and discard them in seconds! That, however, assumes you have already downloaded the image. If not, then the first launch will take a minute or two, depending on how good your internet connection is ๐Ÿ™‚

As a matter of fact, I personally usually have all the recent (9.0+) versions of Postgres constantly running on my workstation in the background, via Docker! I donโ€™t of course use all those versions too frequently - however, since they donโ€™t ask for too much attention, and donโ€™t use up too many resources if โ€œidlingโ€, they donโ€™t bother me. Also, theyโ€™re always there for me when I need to test out some Postgres statistic fetching queries for our Postgres monitoring tool called pgwatch2. The only annoying thing that could pester you a bit is - if you happen to also run Postgres on the host machine, and want to take a look at a process listing to figure out what itโ€™s doing, (e.g. ps -efH | grep postgres) the โ€œin containerโ€ processes show up and somewhat โ€œlitterโ€ the picture.

Slonik in a box - a quickstart

OK, so I want to benefit from those light-weight pre-built โ€œall-inclusiveโ€ database images that everyone is talking about and launch one - how do I get started? Which images should I use?

As always, you canโ€™t go wrong with the official stuff - and luckily, the PostgreSQL project provides all modern major versions (up to v8.4 by the way, released in 2009!) via the official Docker Hub. You also need to know some โ€œDocker fooโ€. For a simple test run, you usually want something similar to what you can see in the code below.

NB! As a first step, you need to install the Docker runtime / engine (if it is not already installed). Iโ€™ll not be covering that, as it should be a simple process of following the official documentation line by line.

Also note: when launching images, we always need to explicitly expose or โ€œremapโ€ the default Postgres port to a free port of our preference. Ports are the โ€œservice interfaceโ€ for Docker images, over which all communication normally happens. So we actually donโ€™t need to care about how the service is internally implemented!

Note that you donโ€™t have to actually use โ€œtrustโ€ authentication, but can also set a password for the default โ€œpostgresโ€ superuser via the POSTGRES_PASSWORD env variable.

Once youโ€™ve had enough of Slonikโ€™s services for the time being, just throw away the container and all the stored tables / files etc with the following code:

Couldnโ€™t be any simpler!

NB! Note that I could also explicitly mark the launched container as โ€œtemporaryโ€ with the โ€˜--rmโ€™ flag when launching the container, so that any data remains would automatically be destroyed upon stopping.

Peeking inside the container

Now that we have seen how basic container usage works, complete Docker beginners might get curious here - how does it actually function? What is actually running down there inside the container box?

First, we should probably clear up the two concepts that people often initially mix up:

Letโ€™s make sense of this visually:

Other common tasks when working with Docker might be:

Checking the logs of a specific container, for example, to get more insights into query errors

Listing the IP address of the image

Note that by default, all Docker containers can speak to each other, since they get assigned to the default subnet of 172.17.0.0/16. If you donโ€™t like that, you can also create custom networks to cordon off some containers, whereby then they can also access each other using the container name!

Executing custom commands on the container

Note that this should be a rather rare occasion, and is usually only necessary for some troubleshooting purposes. You should try not to install new programs and change the files directly, as this kind of defeats the concept of immutability. Luckily, in the case of official Postgres images you can easily do that, since it runs under โ€œrootโ€ and the Debian repositories are also still connected - which a lot of images remove, in order to prevent all sorts of maintenance nightmares.

Here is an example of how to install a 3rd party extension. By default, we only get the โ€œcontrib extensionsโ€ that are part of the official Postgres project.

Changing the PostgreSQL configuration

Quite often when doing some application testing, you want to measure how much time the queries really take - i.e. measure things from the DB engine side via the indispensable โ€œpg_stat_statementsโ€ extension. You can do it relatively easily, without going โ€œintoโ€ the container! Starting from Postgres version 9.5, to be exact...

Donโ€™t forget about the volumes

As stated in the Docker documentation: โ€œIdeally, very little data is written to a containerโ€™s writable layer, and you use Docker volumes to write data.โ€

The thing about containersโ€™ data layer is that itโ€™s not really meant to be changed! Remember, containers should be kind of immutable. The way it works internally is via โ€œcopy-on-writeโ€. Then, thereโ€™s a bunch of different storage drivers used over different versions of historical Docker runtime versions. Also, there are some differences which spring from different host OS versions. It can get quite complex, and most importantly, slow on the disk access level via the โ€œvirtualizedโ€ file access layer. Itโ€™s best to listen to what the documentation says, and set up volumes for your data to begin with.

Aha, but what are volumes, exactly?

Theyโ€™re directly connected and persistent OS folders where Docker tries to stay out of the way as much as possible. That way, you donโ€™t actually lose out on file system performance and features. The latter is not really guaranteed, though - and can be platform-dependent. Things might look a bit hairy, especially on Windows (as usual), where one nice issue comes to mind. The most important keyword here might be โ€œpersistentโ€ - meaning volumes donโ€™t disappear, even when a container is deleted! So they can also be used to โ€œmigrateโ€ from one version of the software to another.

How should you use volumes, in practice? There are two ways to use volumes: the implicit and the explicit. The โ€œfine printโ€ by the way, is available here.

Also, note that we actually need to know beforehand what paths should be directly accessed, i.e. โ€œvolumizedโ€! How can you find out such paths? Well, you could start from the Docker Hub โ€œpostgresโ€ page, or locate the instruction files (the Dockerfile) that are used to build the Postgres images and search for the โ€œVOLUMEโ€ keyword. The latter can be found for Postgres here.

Some drops of tar - big benefits possible, with some drawbacks

To tie up the knots on this posting - if you like containers in general, and also need to run some PostgreSQL services - go ahead! Containers can be made to work pretty well, and for bigger organizations running hundreds of PostgreSQL services, it can actually make life a lot easier and more standardized once everything has been automated. Most of the time, the containers wonโ€™t bite you.

But at the same time, you had better be aware of the pitfalls:

TLDR;

Donโ€™t want to sound like a luddite again, but before going โ€œall inโ€ on containers you should acknowledge two things. One, that there are major benefits to production-level database containers only if youโ€™re using some container automation platform like Kubernetes. Two, the benefits will come only if you are willing to make yourself somewhat dependent on some 3rd party software vendors. 3rd party vendors are not out to simplify the life of smaller shops, but rather cater to bigger โ€œK8s for the winโ€ organizations. Often, they encode that way of thinking into the frameworks, which might not align well with your way of doing things.

Also, not all aspects of the typical database lifecycle are well covered. My recommendation is: if it currently works for you โ€œas isโ€, and youโ€™re not 100% migrating to some container-orchestration framework for all other parts of your software stack, be aware that youโ€™re only winning in the ease of the initial deployment and typically also in automatic high-availability (which is great of course!) - but not necessarily in all aspects of the whole lifecycle (fast major version upgrades, backups the way you like them, access control, etc).

On the other hand - if you feel comfortable with some container framework like Kubernetes and/or can foresee that youโ€™ll be running oodles of database instances - give it a go! -- after you research possible problem points, of course.

On the positive side - since I am in communication with a pretty wide crowd of DBAโ€™s, I can say that many bigger organizations do not want to look back at the traditional way of running databases after learning to trust containers.

Anyway, it went a bit long - thanks for reading, and please do let me know in the comments section if you have some thoughts on the topic!

Intro

GitHub Actions (GHA) are altogether a piece of excellent machinery for continuous integration or other automated tasks on your repo. I started to use them from the release day on as a replacement for CircleCI. Not that I think CircleCI is a bad product; I love to have everything in one place if possible. However, using a young product is a challenge. Even now, there is no easy way to debug actions.

I came up with many solutions: Docker-based actions, actions downloading binaries, etc. But this post will cover using the latest GitHub Actions Virtual Environments, which have PostgreSQL installed by default. Handy, huh? ๐Ÿ™‚

Available GitHub Actions Virtual Environments

Here is the table listing all available GitHub Actions Virtual Environments for a moment:

Environment YAML Label Included Software
Ubuntu 20.04 ubuntu-20.04 ubuntu-20.04
Ubuntu 18.04 ubuntu-latest or ubuntu-18.04 ubuntu-18.04
Ubuntu 16.04 ubuntu-16.04 ubuntu-16.04
macOS 11.0 macos-11.0 macOS-11.0
macOS 10.15 macos-latest or macos-10.15 macOS-10.15
Windows Server 2019 windows-latest or windows-2019 windows-2019
Windows Server 2016 windows-2016 windows-2016

 

In this post, I will use three of them: windows-latest, ubuntu-latest, and macos-latest. However, you may use any of the environments available. These actions were first written for pg_timetable testing, but now they are used as a template for all Cybertec PostgreSQL-related actions.

Each of the actions below will:

Of course, you may want to add more steps in real life, e.g., import test data, checkout, build, test, gather coverage, release, etc.

PostgreSQL GitHub Action for Ubuntu

Nothing unusual here for Ubuntu users. We use systemctl to start PostgreSQL and the pg_isready utility to check if the server is running.

To create a scheduler user, we use a psql client in non-interactive mode. We send two commands to it:

First, we create the user. Second, we output the list of users for control.

๐Ÿ’ก To remember psql commands, try to decode them. For example, dt - describe tables, du - describe users, etc.

To create a timetable database, we use the createdb utility. Pay attention to the fact that sudo -u postgres allows us to not specify connection credentials, because a system user is allowed to connect locally without any restrictions. Then, just like in the previous step, list the databases with psql for control.

PostgreSQL GitHub Action for MacOS

There are not so many differences from Ubuntu for MacOS:

As you probably noticed, we may skip sudo -u postgres prefixes since the current user has all the rights needed in this environment.

PostgreSQL GitHub Action for Windows

With Windows, everything is different, but not as scary as haters usually paint it to be:

I think the code is straightforward. The only thing I want to highlight is the ampersand sign (& aka call operator) before some commands.

The call operator (&) allows you to execute a command, script, or function. Since the PostgreSQL utilities psql and createdb are not in the PATH, we need to specify the full path to them. We use the environmental variable PGBIN for that. But after the concatenation operation, we have a regular string. Therefore, a call operator allows us to execute a resulting command.

๐Ÿ’ก PowerShell (pwsh) is available for all environments in GitHub Actions including Ubuntu and macOS. That's a pretty powerful shell, I must say! Try to give it a chance and you will be surprised.

Conclusion

There are also other ways to run PostgreSQL within GitHub Actions. Let me know if you are interested in this topic, and I will write again with even more examples.

In conclusion, I wish you all the best! โ™ฅ๏ธ
Please, stay safe! So we can meet in person at one of the conferences, meetups, or training sessions.

target_session_attrs used in real life - well, almost real
ยฉ Laurenz Albe 2021

 

PostgreSQL commit ee28cacf61 has added new options for the target_session_attrs connection parameter. This article explores how the new options can be used to simplify high availability and horizontal scaling solutions.

What is target_session_attrs?

The PostgreSQL client shared library libpq has support for connection strings to more than one database server:

In the URL connection string format, that would look like

Then the client will first try to connect to server1.my.org, and if it cannot reach that server, it will try to connect to server2.my.org.

This is useful in combination with high-availability solutions, because it allows the client to fail over to a secondary server without the need for additional load balancing software. However, there is one shortcoming: if we don't know which of the servers is the currently active primary server, we might accidentally end up connecting to a standby server. This is usually not what you want, because you cannot run data modifying statements there.

To make up for this, you can specify the additional parameter โ€œtarget_session_attrsโ€, which determines when a server is acceptable. If you specify

the PostgreSQL client will not accept a connection to a server where it cannot modify data. This allows you to connect to the streaming replication primary, regardless of the order of the servers in the connection string.

Improved options for target_session_attrs in v14

PostgreSQL v14 introduces these new options:

You are probably confused by the distinction between โ€œread-onlyโ€ and โ€œstandbyโ€ or between โ€œread-writeโ€ and โ€œprimaryโ€.
The difference is subtle: if the parameter โ€œdefault_transaction_read_onlyโ€ is set to โ€œonโ€, a server can be read-only even if it is not a streaming replication standby server.

Which client software can use target_session_attrs?

Since this functionality is part of the client C library libpq, all clients that link with that library automatically support this connection parameter. This includes the Python library psycopg2, drivers for Go, Rust and most other programming languages.

Notable exceptions to this are the PostgreSQL JDBC driver and the .NET provider Npgsql, which do not link with libpq. However, the JDBC driver already has support for this functionality with the โ€œtargetServerTypeโ€ connection parameter, which you can set to โ€œprimaryโ€, โ€œsecondaryโ€ and โ€œpreferSecondaryโ€ to achieve the same behavior. Npgsql does not offer this functionality.

Making use of target_session_attrs for high availability

The application is obvious: you always want to connect to the primary server. While target_session_attrs=read-write served the purpose fairly well, you should switch to using target_session_attrs=primary. This will do the right thing even in the unlikely case that somebody set default_transaction_read_only on the primary server.

Making use of target_session_attrs for horizontal scaling

You can use PostgreSQL streaming replication for horizontal scaling by directing read requests to standby servers. There are some limitations to this approach:

Other than that, you can use target_session_attrs to implement scaling without the need for additional load balancing software:

Conclusion

The new settings for target_session_attrs allow a more reliable high-availability setup and some degree of horizontal scaling without the need for additional software. This can simplify your architecture, thus making it more reliable.

To make it easier for our readers to install PostgreSQL on Ubuntu, we have compiled a quick guide to getting started. Donโ€™t worry, itโ€™s not too hard. You will learn:

Note that this is an introduction. If you are looking to create a PostgreSQL High-Availability cluster, we recommend checking out some of our other content.

Preparing the system

For this demonstration I have created a virtual machine using VirtualBox on my local machine. We use Ubuntu 20.10 in this tutorial.
Installing Ubuntu on VirtualBox is easy. Simply download the Ubuntu ISO file from the website and create a virtual machine:

postgresql on ubuntu

ย Then ensure that the ISO file is inserted into the virtual CD drive:

postgresql on ubuntu

You can then boot the machine. Simply follow the instructions. Once Ubuntu is installed, we can proceed with the installation of PostgreSQL itself.

Installing repositories and keys

To download PostgreSQL, we suggest checking out the official PostgreSQL website. Those PostgreSQL packages provided by the community are high quality and we recommend using them for your deployment: https://www.postgresql.org/download/

Please select your desired operating system. In our example, I have selected the latest version of Ubuntu (20.10). The following steps are now necessary to install PostgreSQL:

The first thing you have to do is to add the repository to your Ubuntu installation. Here is how it works:

Then you can add the keys to the system to make sure the repository is trustworthy:

Next, we can update the package list and ensure that our system has the latest stuff:

Once the repositories are ready to use, we can actually go and install PostgreSQL on our Ubuntu server.

Installing PostgreSQL on Ubuntu

Basically, all we need to do is run โ€œapt-get -y install postgresqlโ€ This will automatically deploy the latest version of PostgreSQL. If we want to deploy, say, PostgreSQL 12 instead of the current PostgreSQL, we would use โ€œapt-get install postgresql-12โ€ instead.

Now letโ€™s install PostgreSQL 13:

The beauty of this setup is that everything was done automatically by the packages. This includes but is not limited to:

Now, all that remains to be done is to enable the services to ensure that the database is running correctly and automatically - in case the system is restarted.

Ensuring the service is running (systemd)

Enabling the service is easy and can be done using the standard systemd procedure. Basically the service should already be active but it makes sense to double check to make sure that stuff is really enabled:

Once the service has been enabled, we can start the instance so that we can use it directly, without a restart:

Ubuntu did not issue any errors. However, it is usually a good idea to check if the database is indeed running. There are two ways to do that:

First, letโ€™s check the process table:

The second option is to use pg_isready:

Switch to the โ€œpostgresโ€ user, and then call pg_isready. If you want a little bit more information, you can also add the -V (verbose) flag to the end. The beauty is that pg_isready returns a POSIX compliant return code, so you can easily use it in your scripts.

Testing your database

After we have installed and enabled PostgreSQL on our Ubuntu system, we can check which databases there are:

โ€œpsql -lโ€ connects to the database and lists the databases in our PostgreSQL cluster. This works perfectly - so we can already log in, and easily create a new database:

PostgreSQL is now ready for action. You can already go ahead and use the system according to your needs. If you want to tune your database and find good configuration parameters, we recommend checking out the PostgreSQL configuration tool (pgconfigurator). It will help you to achieve better performance by adjusting some key parameters in the postgresql.conf configuration file.

Finallyโ€ฆ

If you want to improve your database setup even more, we recommend checking out a way to cluster PostgreSQL to achieve higher availability. We have compiled a post for you showing how HA can easily be achieved. Check it out.

CYBERTEC Logo white
Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ยฉ
    2025
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram