CYBERTEC Logo
EXPLAIN ANALYZE as developers would want it
© Laurenz Albe 2021

 

UPDATED July 2023: EXPLAIN ANALYZE is the key to optimizing SQL statements in PostgreSQL. This article does not attempt to explain everything there is to it. Rather, I want to give you a brief introduction, explain what to look for and show you some helpful tools to visualize the output.

Also, I want to call your attention to a new patch I wrote which has been included in PostgreSQL 16, EXPLAIN GENERIC PLAN. You can read about it here.

How to call EXPLAIN ANALYZE?

If you have an SQL statement that executes too slowly, you want to know what is going on and how to fix it. In SQL, it is harder to see how the engine spends its time, because it is not a procedural language, but a declarative language: you describe the result you want to get, not how to calculate it. The EXPLAIN command shows you the execution plan that is generated by the optimizer. This is helpful, but typically you need more information.

You can obtain such additional information by adding EXPLAIN options in parentheses.

The most important EXPLAIN options

Typically, the best way to call EXPLAIN is:

Include SETTINGS if you are on v12 or better and WAL for data modifying statements from v13 on.

It is highly commendable to set track_io_timing = on to get data about the I/O performance.

Caveats and limitations

You cannot use EXPLAIN for all kinds of statements: it only supports SELECT, INSERT, UPDATE, DELETE, EXECUTE (of a prepared statement), CREATE TABLE ... AS and DECLARE (of a cursor).

Note that EXPLAIN ANALYZE adds a noticeable overhead to the query execution time, so don't worry if the statement takes longer.

There is always a certain variation in query execution time, as data may not be in cache during the first execution. That's why it is valuable to repeat EXPLAIN ANALYZE a couple of times and see if the result changes.

What information do you get from EXPLAIN ANALYZE?

PostgreSQL displays some of the information for each node of the execution plan, and you see some of it in the footer.

EXPLAIN without options

Plain EXPLAIN will give you the estimated cost, the estimated number of rows and the estimated size of the average result row. The unit for the estimated query cost is artificial (1 is the cost for reading an 8kB page during a sequential scan). There are two cost values: the startup cost (cost to return the first row) and the total cost (cost to return all rows).

Output from the ANALYZE option

ANALYZE gives you a second parenthesis with the actual execution time in milliseconds, the actual row count and a loop count that shows how often that node was executed. It also shows the number of rows that filters have removed.

In the footer, you see how long PostgreSQL took to plan and execute the query. You can suppress that information with SUMMARY OFF.

Output from the BUFFERS option

This option shows how many data blocks were found in the cache (hit) for each node, how many had to be read from disk, how many were written and how many dirtied. In recent versions, the footer contains the same information for the work done by the optimizer, if it didn't find all its data in the cache.

If track_io_timing = on, you will get timing information for all I/O operations.

How to read EXPLAIN ANALYZE output

You see that you end up with quite a bit of information even for simple queries. To extract meaningful information, you have to know how to read it.

First, you have to understand that a PostgreSQL execution plan is a tree structure consisting of several nodes. The top node (the Aggregate above) is at the top, and lower nodes are indented and start with an arrow (->). Nodes with the same indentation are on the same level (for example, the two relations combined with a join).

PostgreSQL executes a plan top down, that is, it starts with producing the first result row for the top node. The executor processes lower nodes “on demand”, that is, it fetches only as many result rows from them as it needs to calculate the next result of the upper node. This influences how you have to read “cost” and “time”: the startup time for the upper node is at least as high as the startup time of the lower nodes, and the same holds for the total time. If you want to find the net time spent in a node, you have to subtract the time spent in the lower nodes. Parallel queries make that even more complicated.

On top of that, you have to multiply the cost and the time with the number of “loops” to get the total time spent in a node.

What to focus on in EXPLAIN ANALYZE output

Tools to interpret EXPLAIN ANALYZE output

Since reading a longer execution plan is quite cumbersome, there are a few tools that attempt to visualize this “sea of text”:

Depesz' EXPLAIN ANALYZE visualizer

This tool can be found at https://explain.depesz.com/. If you paste the execution plan in the text area and hit “Submit”, you will get output like this:

EXPLAIN ANAYLZE as seen by Depesz

The execution plan looks somewhat similar to the original, but optically more pleasing. There are useful additional features:

What I particularly like about this tool is that all the original EXPLAIN text is right there for you to see, once you have focused on an interesting node. The look and feel is decidedly more “old school” and no-nonsense, and this site has been around for a long time.

Dalibo's EXPLAIN ANALYZE visualizer

This tool can be found at https://explain.dalibo.com/. Again, you paste the raw execution plan and hit “Submit”. The output is presented as a tree:

EXPLAIN ANALYZE as seen by Dalibo

Initially, the display hides the details, but you can show them by clicking on a node, like I have done with the second node in the image above. On the left, you see a small overview over all nodes, and from there you can jump to the right side to get details. Features that add value are:

The nice thing about this tool is that it makes the tree structure of the execution plan clearly visible. The look and feel is more up-to-date. On the downside, it hides the detailed information somewhat, and you have to learn where to search for it.

Conclusion

EXPLAIN (ANALYZE, BUFFERS) (with track_io_timing turned on) will show you everything you need to know to diagnose SQL statement performance problems. To keep from drowning in a sea of text, you can use Depesz' or Dalibo's visualizer. Both provide roughly the same features.

For more information on detailed query performance analysis, see my blog about using EXPLAIN(ANALYZE) with parameterized statements.

Before you can start tuning queries, you have to find the ones that use most of your server's resources. Here is an article that describes how you can use pg_stat_statements to do just that.


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

Bulk loading is the quickest way to import large amounts of data into a PostgreSQL database. There are various ways to facilitate large-scale imports, and many different ways to scale are also available. This post will show you how to use some of these tricks, and explain how fast importing works. You can use this knowledge to optimize data warehousing or any other data-intensive workload.

There are several things to take into consideration in order to speed up bulk loading of massive amounts of data using PostgreSQL:

Let us take a look at these things in greater detail.

INSERT vs. COPY

The first thing to consider is that COPY is usually a LOT better than plain inserts. The reason is that INSERT has a lot of overhead. People often ask: What kind of overhead is there? What makes COPY so much faster than INSERT? There are a variety of reasons: In the case of INSERT, every statement has to check for locks, check for the existence of the table and the columns in the table, check for permissions, look up data types and so on. In the case of COPY, this is only done once, which is a lot faster. Whenever you want to write large amounts of data, data COPY is usually the way to go.

To show what kind of impact this change has in terms of performance, I have compiled a short example.

Let’s create a table as well as some sample data:

The sample table consists of 4 columns which is pretty simple.

In the next step, we will compile a script containing 1 million INSERT statements in a single transaction:

Running the script can be done using psql:

We need around 81 seconds to run this simple test, which is A LOT of time. Single INSERT statements are therefore obviously not the solution to perform quick imports and efficient bulk loading.

As I have already mentioned, COPY is a lot more efficient than INSERT, so let’s use the same data, but feed it to COPY instead;

Running the script is again an easy thing to do:

The speed has improved from 81 to just 2.6 seconds:

Keep in mind that I have executed this test on a fairly old machine, on a totally untuned database. On modern hardware and on a more modern operating system, a lot more can be achieved than on my local iMac desktop machine. Loading 1 million lines or more is not uncommon in the real world. Of course, this data depends on the length of a “record” and so on. However, it is important to get a feeling for what is possible and what is not.

NOTE: Runtimes might vary. This has many reasons. One of them is certainly related to the hardware in use here. We have seen that many SSDs provide us with quite unstable response times.

Adjusting checkpoints for faster bulk loading

The PostgreSQL configuration does have an impact on bulk loading performance. There are many configuration parameters which are vital to database performance, and loading in particular. However, I explicitly want to focus your attention on checkpoint and I/O performance. If you want to load billions of rows, I/O is king. There are various angles to approach the topic:

The following settings are important:

In general it is a REALLY good idea to stretch checkpoints by a significant amount.

Setting this value to 100 or 200 GB in case of bulk-load intense workloads is definitely not out of scope.

Keep in mind that increased checkpoint distances DO NOT put your server at risk. It merely affects the way PostgreSQL writes data. Also keep in mind that more disk space will be consumed and recovery might take longer, in case of a crash.

If you want to learn more about checkpointing, check out this article about reducing the amount of WAL written.

CREATE TABLE vs. CREATE UNLOGGED TABLE

However, what if there were a way to get rid of WAL altogether? Well, there is one. It is called an “unlogged table”. What is the general idea? Often we got the following sequence of events:

This is the ideal scenario to use the WAL bypass provided by unlogged tables:

Let’s load the same data again:

As you can see, the entire thing is a lot faster. 81 seconds vs. 59 seconds and 2.6 vs. 0.6 seconds. The difference is massive.

The reason is that an unlogged table does not have to write the data twice (no WAL needed). However, this comes with a price tag attached to it:

These restrictions imply that an unlogged table is not suitable for storing “normal” data. However, it is ideal for staging areas and bulk loading.

Tables can be made logged and unlogged. Many people expect these to be cheap operations but this is not true.

Let’s take a look and see what happens:

In addition to setting the table from UNLOGGED to LOGGED, I have measured the current WAL position.

What we can see is that a lot of data has been written:

Wow, we have produced 80 MB of WAL (if you do exactly one COPY on an empty table - the amount will grow if you run more imports). In case of COPY + INSERT, the volume will be a lot higher.

From this, we draw the conclusion that if we want to do efficient bulk loading setting a table from LOGGED to UNLOGGED, importing the data and setting it back to LOGGED might not be the best of all ideas - because as soon as a table is set back to LOGGED, the entire content of the table has to be sent to the WAL, to make sure that the replicas can receive the content of the table.

Direct import vs. recreating indexes

It takes around 8 seconds to copy the data over. Let’s try the same thing by creating the indexes later:

We can see that the copy process (= INSERT) is a lot faster than before. In total, it is quicker to produce the index later. Also keep in mind that I am using synthetic data on Mac OSX (not too efficient) here. If you repeat the test with a lot more real data, the difference is a lot higher.

The bottom line is:

Create indexes after importing data if possible.

Enabled triggers vs. disabled triggers

Triggers are also an important factor. One could say that triggers are “the natural enemy” of bulk loading performance. Let’s take a look at the following example:

Our trigger is really simple. All it does is to modify two entries in our data. However, the trigger will add an extra function call to every row, which really adds up.
In our case, we have got the following data: The variation with trigger is around 3 times slower. However, the real difference highly depends on the complexity of the trigger, the size of a row and a lot more. There is no way to state that “a trigger slows things down by a factor of X”. One has to see, case-by-case, what happens.

Optimizing column order for bulk loading

There is more to importing large amounts of data into PostgreSQL than meets the eye. So far, we have optimized checkpoints, touched indexes, triggers and so on. But what about the column order? Let’s try to find out.

In PostgreSQL, column order does make a real difference. It is generally a good idea to put “fixed length” columns in front. In other words: int8, int4, timestamptz and so on should be at the beginning of the table. Variable length data types such as varchar, text and so on should be at the end of the table. The reason for this is that CPU alignment is an issue on disk. This is true for normal heap tables (not for zheap).

Shrinking the size of a table without changing the content can speed things up, because it helps to avoid or reduce one of the key bottlenecks when bulk loading data: I/O. Check out this article to find out more.

Tooling for bulk loading

If what you have seen so far is still not enough, we can recommend some tools to improve bulk loading even more. The following tools can be recommended:

Both tools are very well known and widely used. You can use them safely.

If you have further questions regarding these tools, please don’t hesitate to ask in the comment section, or send us an email.

Finally …

If you want to know more about PostgreSQL performance, we also recommend checking out our consulting services. We help you to tune your database and make sure that your servers are operating perfectly.

Autogenerated primary keys violate human rights!
© Laurenz Albe 2021

 

UPDATED 14.05.2022: Sometimes customers ask me about the best choice for auto-generated primary keys. In this article, I'll explore the options and give recommendations.

Why auto-generated primary keys?

Every table needs a primary key. In a relational database, it is important to be able to identify an individual table row. If you wonder why, search the internet for the thousands of questions asking for help with removing duplicate entries from a table.

You are well advised to choose a primary key that is not only unique, but also never changes during the lifetime of a table row. This is because foreign key constraints typically reference primary keys, and changing a primary key that is referenced elsewhere causes trouble or unnecessary work.

Now, sometimes a table has a natural primary key, for example the social security number of a country's citizens. But typically, there is no such attribute, and you have to generate an artificial primary key. Some people even argue that you should use an artificial primary key even if there is a natural one, but I won't go into that “holy war”.

Techniques for auto-generated primary keys in PostgreSQL

There are two basic techniques:

Generating keys with a sequence

A sequence is a database object whose sole purpose in life is to generate unique numbers. It does this using an internal counter that it increments.

Sequences are highly optimized for concurrent access, and they will never issue the same number twice. Still, accessing a sequence from many concurrent SQL statements could become a bottleneck, so there is the CACHE option that makes the sequence hand out several values at once to database sessions.

Sequences don't follow the normal transactional rules: if a transaction rolls back, the sequence does not reset its counter. This is required for good performance, and it does not constitute a problem. If you are looking for a way to generate a gapless sequence of numbers, a sequence is not the right choice, and you will have to resort to less efficient and more complicated techniques.

To fetch the next value from a sequence you use the nextval function like this:

See the documentation for other functions to manipulate sequences.

Generating UUIDs

A UUID (universally unique identifier) is a 128-bit number that is generated with an algorithm that effectively guarantees uniqueness. There are several standardized algorithms for that. In PostgreSQL, there are a number of functions that generate UUIDs:

Note that you should always use the PostgreSQL data type uuid for UUIDs. Don't try to convert them to strings or numeric — you will waste space and lose performance.

Defining auto-generated primary keys

There are four ways to define a column with automatically generated values:

Using the DEFAULT clause

You can use this method with sequences and UUIDs. Here are some examples:

PostgreSQL uses the DEFAULT value whenever the INSERT statement doesn't explicitly insert that column.

Using the serial and bigserial pseudo-types

This method is a shortcut for defining a sequence and setting a DEFAULT clause as above. With this method, you define a table as follows:

That is equivalent to the following:

The “OWNED BY” clause adds a dependency between the column and the sequence, so that dropping the column automatically drops the sequence.

Using serial will create an integer column, while bigserial will create a bigint column.

Using identity columns

This is another way to use a sequence, because PostgreSQL uses sequences “behind the scenes” to implement identity columns.

There is also “GENERATED BY DEFAULT AS IDENTITY”, which is the same, except that you won't get an error message if you try to explicitly insert a value for the column (much like with a DEFAULT clause). See below for more!

You can specify sequence options for identity columns:

Using BEFORE INSERT triggers

This is similar to DEFAULT values, but it allows you to unconditionally override a value inserted by the user with a generated value. The big disadvantage of a trigger is the performance impact.

Should I use integer(serial) or bigint(bigserial) for my auto-generated primary key?

You should always use bigint.

True, an integer occupies four bytes, while a bigint needs eight. But:

With bigint, you are certain to never exceed the maximum of 9223372036854775807: even if you insert 10000 rows per second without any pause, you have almost 30 million years before you reach the limit.

Should I use bigserial or an identity column for my auto-generated primary key?

You should use an identity column, unless you have to support old PostgreSQL versions.

Identity columns were introduced in PostgreSQL v11, and they have two advantages over bigserial:

So unless you have to support PostgreSQL v10 or below, there is no reason to use bigserial.

Should I use bigint or uuid for an auto-generated primary key?

My advice is to use a sequence unless you use database sharding or have some other reason to generate primary keys in a "decentralized" fashion (outside a single database).

Real differences

The advantages of bigint are clear:

One disadvantage of using a sequence is that it is a single object in a single database. So if you use sharding, where you distribute your data across several databases, you cannot use a sequence. In such a case, UUIDs are an obvious choice. (You could use sequences defined with an INCREMENT greater than 1 and different START values, but that might lead to problems when you add additional shards.)

Of course, if your primary key is not auto-generated by the database, but created in an application distributed across several application servers, you will also prefer UUIDs.

Imaginary differences

There are people that argue that UUIDs are better, because they spread the writes across different pages of the primary key index. That is supposed to reduce contention and lead to a more balanced or less fragmented index. The first is true, but that may actually be a disadvantage, because it requires the whole index to be cached for good performance. The second is definitely wrong, since B-tree indexes are always balanced. Also, a change in PostgreSQL v11 made sure that monotonically increasing values will fill an index more efficiently than random inserts ever could (but subsequent deletes will of course cause fragmentation). In short, any such advantages are either marginal or non-existent, and they are more than balanced by the fact that uuid uses twice as much storage, which will make the index bigger, causing more writes and occupying more of your cache.

Security considerations

People have argued (see for example the comments below) that sequence-generated primary keys can leak information, because they allow you to deduce the approximate order on which rows were inserted into a table. That is true, even though I personally find it hard to imagine a case where this is a real security problem. However, if that worries you, use UUIDs and stop worrying!

Benchmarking bigint versus uuid

My co-worker Kaarel ran a small performance test a while ago and found that uuid was slower than bigint when it came to bigger joins.

I decided to run a small insert-only benchmark with these two tables:

I performed the benchmark on my laptop (SSD, 8 cores) with a pgbench custom script that had 6 concurrent clients repeatedly run transactions of 1000 prepared INSERT statements for five minutes:

Performance comparison bigint versus uuid autogenerated primary keys
bigint uuid
inserts per second 107090 74947
index growth per row 30.5 bytes 41.7 bytes

Using bigint clearly wins, but the difference is not spectacular.

Conclusion

Numbers generated by a sequence and UUIDs are both useful as auto-generated primary keys.

Use identity columns unless you need to generate primary keys outside a single database, and make sure all your primary key columns are of type bigint.

If you are interested in reading more about primary keys, see also Hans' post on Primary Keys vs. Unique Constraints.

In other news, in PostgreSQL 15 default permissions for the public schema were modified, which can cause errors.
Find out more about this important change here.


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on TwitterFacebook, or LinkedIn.

UPDATED March 2023: In this post, we'll focus our attention on PostgreSQL performance and detecting slow queries. Performance tuning does not only mean adjusting postgresql.conf properly, or making sure that your kernel parameters are properly tuned. Performance tuning also implies that you need to first find performance bottlenecks, isolate slow queries and understand what the system is doing in more detail.

Your best friend to detect slow queries: pg_stat_statements

After all these years, I still strongly believe that the best and most efficient way to detect performance problems is to make use of pg_stat_statements, which is an excellent extension shipped with PostgreSQL. It's used to inspect general query statistics. It helps you to instantly figure out which queries cause bad performance, and how often they are executed.

What is performance tuning?

Over the years, I have begun to get the impression that for most people “tuning” is limited to adjusting some magical PostgreSQL parameters. Sure, parameter tuning does help, but I can assure you, there is no such thing as “speed = on”. It does not exist and it most likely never will. Instead, we have to fall back on some pretty “boring techniques” such as inspecting queries to figure out what is going on in your database.

Where does database load come from?

There is one law of nature that has been true for the past 20 years and will most likely still hold true 20 years from now:

Queries cause database load

And slow queries (but not only) are the main reason for such load.

Armed with this important, yet highly under-appreciated wisdom, we can use pg_stat_statements to figure out which queries have caused the most load, and tackle those – instead of wasting time on irrelevant guess work.

Installing pg_stat_statements in PostgreSQL

As mentioned above, pg_stat_statements comes as part of PostgreSQL. All you have to do to enjoy the full benefit of this exceptionally useful extension and take a bite out of your slow queries - is to enable it.

The first thing you have to do is to change shared_preload_libraries in postgresql.conf:

Then restart PostgreSQL.

Finally, the module can be enabled in your desired database:

The last step will deploy a view – we will need to inspect the data collected by the pg_stat_statements machinery.

Taking a look at pg_stat_statements

pg_stat_statements provides a ton of useful information. Here is the definition of the view, as of PostgreSQL 15. Note that the view has been growing over the years and more and more vital information is added as PostgreSQL is steadily extended:

The danger here is that people get lost in the sheer volume of information. Therefore it makes sense to process this data a bit to extract useful information.

Making use of pg_stat_statements in PostgreSQL

To make it easier for our readers to extract as much information as possible from pg_stat_statements, we have compiled a couple of queries that we have found useful over the years.
The most important one is used to find out which operations are the most time-consuming.

Here is the query:

What you see here is how often a certain type of query has been executed, and how many milliseconds of total execution time we have measured. Finally, I have put the numbers into context and calculated the percentage of total runtime for each type of query. Also, note that I have used substring to make the query shorter. The reason for that is to make it easier to display the data on the website. In real life, you usually want to see the full query.

The upshot

The main observations in my example are that the first two queries already need 80% of the total runtime. In other words: The rest is totally irrelevant. One can also see that the slowest query by far (= the loading process) has absolutely no significance in the bigger picture. The real power of this query is that we instantly spot the most time-consuming operations.
However, sometimes it is not only about runtime – often I/O is the real issue. pg_stat_statements can help you with that one as well. But let’s first reset the content of the view:
 

 
Measuring I/O time is simple. The track_io_timing parameter can be adjusted to measure this vital KPI. You can turn it on in postgresql.conf for the entire server, or simply adjust things on the database level if you want more fine-grained data:
 

 
In this example, the parameter has been set for this one specific database. The advantage is that we can now inspect I/O performance and see whether we have an I/O or a CPU problem:
 

 
What we have done here is to compare the total_exec_time (= total execution time) to the time we actually used up for I/O. In case the I/O time is a significant fraction of the overall time we are bound by the I/O system - otherwise additional disks (= IOPS) won’t be beneficial. track_io_timing is therefore essential to determine whether the bottleneck is a CPU or a disk thing.
But there is more: If you are looking for good performance, it makes sense to consider temporary I/O as a potential factor. temp_blks_read and temp_blks_written are the important parameters here. But keep in mind that simply throwing work_mem at temporary I/O is not usually the solution. In the case of normal everyday operations, the real problem is often a missing index.

Finally…

pg_stat_statements is an important topic, but it is often overlooked. Hopefully more and more PostgreSQL users will spread the word about this great extension! If you want to find out more about performance, we have tons of other useful tips for you. Maybe you want to check out our post about 3 ways to detect slow queries, or find out about checkpoints and checkpoint performance in PostgreSQL.

Also, try watching some of the exciting new videos on our YouTube channel. We post fresh content there on a regular basis. If you enjoy the videos, like us and subscribe!

 


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

There are two types of replication available in PostgreSQL at the moment: Streaming replication & Logical replication. If you are looking to set up streaming replication for PostgreSQL 13, this is the page you have been looking for. This tutorial will show you how to configure PostgreSQL replication and how to set up your database servers quickly.

PostgreSQL replication: What we want to achieve

Before we get started with configuring PostgreSQL, it makes sense to take a look at what we want to achieve. The goal of this tutorial is to create a primary server replicating data to a secondary one, using asynchronous replication.

Here is what the desired setup will look like:

The entire setup will be done using CentOS 8.3. The process on RHEL (Redhat Enterprise Linux) is expected to be the same. Simply follow the same procedures.

To show how the setup works, we use two virtual machines with the following IPs:

Let’s prepare these systems step-by-step.

Installing PostgreSQL

Once you have installed CentOS / RHEL you can already prepare the installation of PostgreSQL itself. The way to do that is to go to the PostgreSQL website and follow the instructions.
The following script shows how things work. You can simply copy / paste the script:

Let’s check what we should now see on node1:

After this process, you should have:

Now let’s move on to the next step: disabling the firewall on the primary.

Why is that necessary? The replica will connect to the master on port 5432. If the firewall is still active, the replica will not be able to access port 5432. In our example, the firewall will be disabled completely to make it easier for the reader. In a more secure setup, you might want to do this in a more precise manner.

Configuring the primary for replication

There are four things we have to do on the primary server:

We can perform these things step-by-step.

The first thing is to change postgresql.conf.

The file can be found in /var/lib/pgsql/13/data/postgresql.conf. However, if you have no clue where to find postgresql.conf you can ask PostgreSQL itself to point you to the configuration file. Here is how it works:

The following parameter has to be changed in postgresql.conf:

listen_addresses = '*'

What does listen_addresses mean? By default, PostgreSQL only listens on localhost. Remote access is not allowed by default for security reasons. Therefore, we have to teach PostgreSQL to listen on remote requests as well. In other words: listen_addresses defines the bind addresses of our database service. Without it, remote access is not possible (even if you change pg_hba.conf later on).

Then we can create the user in the database:

Of course, you can also set a password. What is important here is that the user has the REPLICATION flag set. The basic idea is to avoid using the superuser to stream the transaction log from the primary to the replica.

The next thing we can do is to change pg_hba.conf, which controls who is allowed to connect to PostgreSQL from which IP.

Please add the following line to the configuration file:

host replication repuser 10.0.3.201/32 trust

We want to allow the repuser coming from 10.0.3.201 to log in and stream the transaction log from the primary. Keep in mind that 10.0.3.200 is the primary in our setup and 10.0.3.201 is the replica.

Finally, we can restart the primary because we have changed listen_addresses in postgresql.conf.
If you only changed pg_hba.conf a reload is enough:

Your system is now ready, and we can focus our attention on the replica.

Creating a base backup

The next step is to create the replica. There are various things we need to do to make this work. The first thing is to make sure that the replica is stopped and that the data directory is empty. Let’s first make sure the service is stopped:

Then, we need to make sure the data directory is empty:

Note that this step is not necessary if you have skipped the initdb step during installation.

However, it is necessary if you want to turn an existing server into a replica.

pg_basebackup will connect to the primary and simply copy all the data files over. The connection has to be made as repuser. To ensure that the copy process starts instantly, it makes sense to tell PostgreSQL to quickly checkpoint. The -D flag defines the destination directory where we want to store the data on the replica. The -R flag automatically configures our replica for replication.

No more configuration is needed on the secondary server. Finally, we created a replication slot. What is the purpose of a replication slot in PostgreSQL? Basically, the primary server is able to recycle the WAL - if it is not needed anymore on the primary. But what if the replica has not consumed it yet? In that case, the replica will fail unless there is a replication slot ensuring that the primary can only recycle the WAL if the replica has fully consumed it. We at CYBERTEC recommend to use replication slots in most common cases.

Let’s figure out what pg_basebackup has done:

pg_basebackup has copied everything over. However, there is more. The standby.signal file has been created which tells the replica that it is indeed a replica.

Finally, the tooling has adjusted the postgresql.auto.conf file, which contains all the configuration needed to make the replica connect to its replica on the primary server (node1):

Voilà, we are done and we can proceed to start the replica.

Firing up the replica

We are ready to start the replica using systemctl:

It’s a good idea to check that the processes are indeed running. It’s especially important to check for the existence of the walreceiver process. walreceiver is in charge of fetching the WAL from the primary. In case it is not there, your setup has failed.
Also make sure that the service is enabled.

Checking your PostgreSQL replication setup

Once the setup has been completed, it makes sense to take a look at monitoring. In general, it makes sense to use a tool such as pgwatch2 to professionally monitor your database.

Let us first check the primary:

The existence of a row in pg_stat_replication tells us that WAL is flowing from the primary to a secondary.

However, we can also make a check on the replica:

A row in pg_stat_wal_receiver ensures that the WAL receiver does indeed exist, and that data is flowing.

Conclusion

That’s it! I hope you have enjoyed this tutorial. For more information on choosing between synchronous and asynchronous replication, take a look at this page.

Analysing data within PostGIS is just one side of the coin. What about publishing our datasets as maps that various clients can consume and profit from? Let’s have a look at how this can be accomplished with Geoserver. Geoserver acts in this regard as a full-fledged open source mapping server, supporting several OGC compliant services, such as OGC Web Mapping (WMS) or Web Feature Service (WFS).
OGC is an abbreviation for Open GIS Consortium, which is a community working on improving access to spatial information by developing standards in this area.

Here’s a quick overview of this article’s structure:

  1. Geoserver setup
  2. Dataset and layer configuration
  3. Service configuration
  4. Service consumption

Geoserver setup

Let’s first start by setting up Geoserver as a single, non-clustered instance. For our demo, I recommend using one of the available Docker images to start up quickly. The steps listed below show how to pull our geoserver image (kartoza) first, and subsequently start the container with its default configuration within a tmux session, which allows us to easily monitor geoserver’s output.

If everything runs smoothly, Apache Tomcat reports back with

org.apache.catalina.startup.Catalina.start Server startup in XX ms

Dataset and layer configuration

In one of my last blog posts, I imported a csv containing locations of airports (https://ourairports.com/data/airports.csv). Let’s build on this and use these point geometries as the main vector layer for our visualization.

By using kartoza’s image defaults, Apache Tomcat will listen on port 8080 serving Geoserver’s web-interface with http://10.0.0.216:8090/geoserver. So, let’s continue and login with geoserver/admin.

Figure 1 Geoserver Web Administrator

Figure 1 Geoserver Web Administrator

Figure 1 represents Geoserver’s main web screen after logging in. To publish our airport dataset, we must register our PostGIS database as a store (data source). To do so, I would first like to introduce the concept of workspaces within Geoserver. Workspaces are containers which data stores, layers and services belong to. Workspaces can be configured in detail in terms of accessibility and functionality.
Setting up at least one workspace is required. Check out Figure 2 and 3 to understand how this can easily be accomplished.

Figure 2 Create new workspace / Figure 3 Workspace overview

Let’s come back to our initial task, and configure our PostGIS database as a vector data source.

1. Click on store and choose PostGIS as your vector datasource:

Figure 4 Add new data source

2. Choose your workspace and configure and relate your PostGIS connection:

Figure 5 Configure new vector data source

Figure 5 Configure new vector data source

3. If everything runs according to plan, saving our new data source results in a list of available spatial tables. Picking airports from this list brings us one step closer to our goal.

Figure 6 Publish individual layer

Unfortunately, we are not completely done with our initial layer configuration ☹. Successfully publishing a vector layer within Geoserver involves defining at least the coordinate reference system’s and layer’s extent (bounding box). Geoserver tries to extract the coordinate reference system from the data source’s metadata, thus if registered correctly within your data source, there should be no need to define this by hand. Subsequently, the layer’s bounding box can be extracted from the coordinate reference systems’ real data, or can be overridden too. Don’t worry about forgetting to define those parameters – Geoserver will force you to define these mandatory settings. Once the job has been done, you are rewarded for your efforts by finding a “fully” configured layer item in the layers section.

Figure 7 Layer section

Service configuration

As mentioned in the beginning, Geoserver acts as a full-fledged open source mapping server. What do I mean by calling Geoserver full-fledged? On the one hand, I’m referring to its huge list of supported data sources and formats, on the other hand this statement is based on the wide range of spatial services implemented. By default, Geoserver is prebuilt with support for the following services: TMS, WMS-C, WMTS, CSW, WCS, WFS, WMS, and WPS. What do all these abbreviations mean? Most of them are acronyms for OGC-compliant services. But no worries – I won’t bother you with details of OGC standards – this time 😉.Today and in this article, we will have a closer look at Web Mapping Services (WMS), but I recommend to check out OGC standards, released in https://www.ogc.org/docs/is, to get a comprehensive understanding what those services are about.

WMS stands for Web Mapping Service, which is an OGC-compliant mapping service whose main purpose is simply producing maps. WMS creates maps on the fly; this fact enables clients to individualize their map requests in terms of styling, extent and projection.
Within Geoserver, WMS is enabled globally by default, thus we can move on and test our service. Fortunately, Geoserver offers a useful layer preview, which allows us to verify various services within its own ecosystem.

Let’s open the Layer preview from the navigation pane and pick WMS/JPEG from the formats dropdown.

Figure 8 Layer preview – WMS

Figure 8 Layer preview - WMS

This action immediately results in a get request against our WMS endpoint, which responds with an image representing our airports.

http://10.0.0.216:8090/geoserver/cybertec/wms?service=WMS&version=1.1.0&request=GetMap&layers=cybertec%3Aairports&bbox=-179.8769989013672%2C-90.0%2C179.9757080078125%2C82.75&width=768&height=368&srs=EPSG%3A4326&styles=&format=image%2Fjpeg

Figure 9 presents the map generated. Not quite appealing you might think – I must agree. As we did not define and refine any styling parameters for our airport layer, Geoserver applies default stylings to our layer. Styling is another big story which will not be covered today.

Figure 9 WMS map

Service consumption

It’s time to request our Web Mapping Service from our preferred GIS client. I am a huge fan of Quantum GIS (QGIS), a mature open-source GIS client, but any other fat or web-based client with WMS support will do this job too.
After opening up QGIS, add and configure a new WMS connection from the navigation pane (see Figure 10 and 11). By utilizing the layer preview, we can easily extract the http address our WMS listens to from the url which was previously generated.

http://10.0.0.216:8090/geoserver/cybertec/wms

Figure 10 Add new WMS connection / Figure 11 Configure WMS

What happens now in the background is quite interesting. QGIS first requests http://10.0.0.216:8090/geoserver/cybertec/wms?service=WMS&version=1.1.0&request=GetCapabilities, to gather comprehensive metadata about the service. The xml document it returns covers details about published layers, projections, formats and much more.

If everything has gone to plan, QGIS processes GetCapabilities response first and finally extends the WMS' description by our airport layer. Figure 12 proudly presents the results from our demonstration – airports which are displayed on top of OpenStreetMap tiles to slightly improve the overall map style.

Figure 12 Airports on top of OSM tiles

Conclusion and outlook

We set up and configured Geoserver to publish an individual spatial dataset via WMS. By offering various data sources and OGC-compliant services out of the box, it’s no exaggeration to claim that Geoserver is a very powerful mapping server. This makes it easy to start and build up your spatial infrastructure. When it comes to productive environments, things turn out differently, since further topics such as performance and security must be considered too. If you would like more information on how to build up or improve your spatial infrastructure together with Geoserver and PostGIS, get in contact with the folks at CYBERTEC!

 

In PostgreSQL, every database connection is a server-side process. This makes PostgreSQL a robust multi-process rather than a multi-threaded solution. However, occasionally people want to terminate database connections. Maybe something has gone wrong, maybe some kind of query is taking too long, or maybe there is a maintenance window approaching.

In this blog you will learn how to terminate queries and database connections in PostgreSQL.

How to cancel PostgreSQL queries

In PostgreSQL there are two functions we need to take into consideration when talking about cancellation or termination:

pg_cancel_backend ist part of the standard PostgreSQL distribution and can be used quite easily:

As you can see, all that’s necessary is to pass the process ID (pid) to the function. The main question is therefore: How can I find the ID of a process to make sure that the right query is cancelled?

The solution to the problem is a system view: pg_stat_activity.

There are a couple of things to mention here: First of all, you might want to kill a query in a specific database. The “datname” field shows you which database a connection has been established to. The “query” column contains the query string. Usually this is the best way to identify the query you want to end. However, the fact that the “query” column contains a string does not mean that the affiliated command is actually active at the moment. We also need to take the “state” column into consideration. “active” means that this query is currently running. Other entries might indicate that the connection is waiting for more user input, or that nothing is happening at all.

“leader_pid” is also important: PostgreSQL supports parallel query execution. In case of a parallel query, parallelism is also executed using separate processes. To see which process belongs to which query, the “leader_pid” gives us a clue.

Once the query has been found, we can do the following:

In case the query has been cancelled properly, we will get “true”. However, if the query is not there anymore, or in case the database connection does not exist, “false” will be returned.

How to terminate PostgreSQL database connections

So far, you have seen how a query can be ended. However, how can we terminate the entire connection?
Here is the definition of the function we will need for this purpose:

Calling it works according to the same concept as shown before. The following listing shows an example of how this can be done:

However, sometimes kicking out a single user is not enough.

Kicking out every single user

More often than not, we have to terminate all database connections except our own. Fortunately, this is reasonably easy to do. We can again use pg_stat_activity:

The first thing to exclude in the WHERE clause is our own PID which can be determined using the pg_backend_pid() function;

The next important filter is to exclude database names which are NULL. Why is that important? In old versions of PostgreSQL, the system view only provided us with information about database connections. Current versions also list information about other processes. Those processes are not associated with a database (e.g. the background writer) and therefore we should exclude those. The same is true for parallel worker processes. Parallel workers will die anyway if we kill the parent process.

Finally…

If you are interested in security, I would like to recommend “PostgreSQL TDE” which is a more secure version of PostgreSQL, capable of encrypting data on disk. It can be downloaded from our website. In addition to that, CYBERTEC offers 24x7 product support for PostgreSQL TDE.

CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0
office@cybertec.at

Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram