debugging deadlocks made unnecessary
© Laurenz Albe 2022

Even if you understand what a deadlock is, debugging deadlocks can be tricky. This article shows some techniques on how to figure out the cause of a deadlock.

A simple deadlock example

Setting the stage

We will test our techniques for debugging deadlocks with the following example:

CREATE TABLE parent (
   pid bigint PRIMARY KEY,
   pdata text NOT NULL
);

CREATE TABLE child (
   cid bigint PRIMARY KEY,
   pid bigint REFERENCES parent NOT NULL,
   cdata text NOT NULL
);

INSERT INTO parent VALUES (1, 'no children yet');

The deadlock

To provoke the deadlock, we run the following transactions in parallel. Each transaction adds a row to the child table and then tries to modify the corresponding row in parent.

-- session 1
BEGIN;

INSERT INTO child VALUES (100, 1, 'new child');

   -- session 2
   BEGIN;

   INSERT INTO child VALUES (101, 1, 'another child');

   SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE;
   -- session 2 blocks

-- session 1
SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE;
-- session 1 blocks, but after one second, it receives an error
ERROR:  deadlock detected
DETAIL:  Process 19674 waits for ShareLock on transaction 1109; blocked by process 19718.
Process 19718 waits for ShareLock on transaction 1108; blocked by process 19674.
HINT:  See server log for query details.
CONTEXT:  while locking tuple (0,1) in relation "parent"

   -- session 2 is unblocked and receives a result
       pdata    
   ═════════════
    no children
   (1 row)

How a deadlock appears in the log

The PostgreSQL log reveals more details:

ERROR:  deadlock detected
DETAIL:  Process 19674 waits for ShareLock on transaction 1109; blocked by process 19718.
        Process 19718 waits for ShareLock on transaction 1108; blocked by process 19674.
        Process 19674: SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE;
        Process 19718: SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE;
HINT:  See server log for query details.
CONTEXT:  while locking tuple (0,1) in relation "parent"
STATEMENT:  SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE;

Why is debugging deadlocks difficult?

The error message sent to the client does not contain any meaningful detail. That is because the SQL statement from the other sessions participating in the deadlock may contain data that you have no right to see. The PostgreSQL log contains more helpful data, but also not enough to understand the deadlock. To understand the cause of the deadlock, you would have to know all statements from the involved transactions, because locks are always held until the end of a transaction. The two statements from the error message alone could never cause a deadlock. So some earlier statement in these transactions must already have taken locks that are part of the problem. However, PostgreSQL does not retain any memory of these previous statements, so we have to find some other way to get the whole picture.

Usually you need help from the application’s developers. Perhaps you are lucky, the SQL statements from the error message are enough to identify the database transactions involved, and it is easy to reconstruct the SQL statements that were issued from each transaction. By examining the locks taken by these statements, you should be able to reconstruct the cause of the deadlock.

The cause of the deadlock in our example

If we know the statements from our transactions in the example above, we can deduce the following:

  • The INSERT statements take a FOR KEY SHARE lock on the referenced rows in parent. That is necessary so that no concurrent transaction can delete the row referenced by our not yet committed INSERT. As you can see in the compatibility matrix, these locks don’t conflict with each other.
  • The FOR UPDATE row locks from the SELECT ... FOR UPDATE statements conflict with the FOR KEY SHARE locks from the other transaction, which leads to the deadlock.

Once we know that, the solution is simple: change “SELECT ... FOR UPDATE” to “SELECT ... FOR NO KEY UPDATE“, and there will be no more lock conflicts and consequently no deadlock. Note that a FOR UPDATE row lock is only required if you want to DELETE a row, or if you intend to update a column with a unique or primary key constraint on it (a key). The lock name FOR UPDATE is counter-intuitive, since it suggests that this is the correct lock for an UPDATE, and it is a common trap for PostgreSQL beginners.

Great, with our complete knowledge we could fix the problem. But how do we get the statements from the transactions?

Techniques for debugging deadlocks

There are several approaches:

Debugging deadlocks by logging on the application side

If the application logs information about each error and each database transaction started, it is easy to determine the transactions that caused the problem. However, that requires that you either have control over the application code, or that the application has sufficiently well-designed tracing facilities. You are not always in that lucky position.

Debugging deadlocks by annotating queries on the application side

This is also a method that requires that you can modify the application. The trick here is to add a comment to the query, like

SELECT /* function "add_data", source file "app.c", line 1234 */
pdata FROM parent WHERE pid = 1 FOR UPDATE;

The information can be anything that allows you to tell which transaction the statement belongs to.

Debugging deadlocks by logging on the database side

In order to enable us to trace statements, we have to make sure that the process ID and transaction number are logged with each statement. With CSV logging, that is always the case, but with stderr logging, you have to change log_line_prefix:

log_line_prefix = '%m [pid=%p] %q[txid=%x] %u@%d '

Also, you have to log all statements:

log_statement = 'all'

Now the amount and performance impact of logging all statements may be forbidding for busy applications. PostgreSQL v12 introduced log_transaction_sample_rate, which allows you to log only a certain percentage of all transactions. However, in many cases that won’t be good enough, because you’ll typically only see the statements from one of the transactions involved.

The log file for our example would look as follows:

2022-06-20 17:17:50.888 CEST [pid=19718] [txid=0] laurenz@test LOG:  statement: BEGIN;
2022-06-20 17:17:52.988 CEST [pid=19718] [txid=0] laurenz@test LOG:  statement: INSERT INTO child VALUES (100, 1, 'new child');
2022-06-20 17:17:58.215 CEST [pid=19674] [txid=0] laurenz@test LOG:  statement: BEGIN;
2022-06-20 17:18:03.626 CEST [pid=19674] [txid=0] laurenz@test LOG:  statement: INSERT INTO child VALUES (101, 1, 'another child');
2022-06-20 17:18:18.979 CEST [pid=19674] [txid=1108] laurenz@test LOG:  statement: SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE;
2022-06-20 17:18:31.550 CEST [pid=19718] [txid=1109] laurenz@test LOG:  statement: SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE;
2022-06-20 17:18:32.551 CEST [pid=19718] [txid=1109] laurenz@test ERROR:  deadlock detected
2022-06-20 17:18:32.551 CEST [pid=19718] [txid=1109] laurenz@test DETAIL:  Process 19718 waits for ShareLock on transaction 1108; blocked by process 19674.
        Process 19674 waits for ShareLock on transaction 1109; blocked by process 19718.
        Process 19718: SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE;
        Process 19674: SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE;
2022-06-20 17:18:32.551 CEST [pid=19718] [txid=1109] laurenz@test HINT:  See server log for query details.
2022-06-20 17:18:32.551 CEST [pid=19718] [txid=1109] laurenz@test CONTEXT:  while locking tuple (0,1) in relation "parent"
2022-06-20 17:18:32.551 CEST [pid=19718] [txid=1109] laurenz@test STATEMENT:  SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE;

Note that the transaction ID is logged as 0 until after the first data modifying statement in the transaction. This is because PostgreSQL assigns transaction IDs lazily, only when they are needed.

Debugging deadlocks by increasing deadlock_timeout

The PostgreSQL deadlock detector doesn’t kick in immediately, but waits until after deadlock_timeout has passed. This is because the deadlock detector is a computationally expensive component, so we only want to activate it if there is very likely a problem. By default, the parameter is set to one second. Unfortunately, one second is much too short to debug the problem. Now if you can afford to let deadlocked sessions “hang” until you can debug them, you can increase the parameter. The maximum allowed value is almost 600 days, which is longer than you need. With proper monitoring, you could for example set deadlock_timeout to 15 minutes to catch deadlocks in the act.

Once you have an active deadlock, you can find client_addr and client_port from pg_stat_activity for the hanging sessions:

SELECT client_addr, client_port, query
FROM pg_stat_activity
WHERE state = 'active'
  AND wait_event_type = 'Lock'
ORDER BY state_change;

  client_addr  │ client_port │                       query                        
═══════════════╪═════════════╪════════════════════════════════════════════════════
 192.168.2.120 │       49578 │ SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE;
 192.168.2.122 │       47432 │ SELECT pdata FROM parent WHERE pid = 1 FOR UPDATE;
(2 rows)

We sort by state_change, because other sessions might later get stuck behind the sessions that originally caused the deadlock. For that reason, you want to focus on the oldest entries.

Next, log into the client machine and find out which process has a TCP connection on port 49578:

# lsof -P | grep 49578
myapp  3974  laurenz  [...]  TCP localhost:49578->dbserver:5432 (ESTABLISHED)

Now you know that the client process is 3974, running the program myapp. You can now attach to the program with a debugger and see the current execution stack. That should give you a clue as to what is being executed, so you can figure out the SQL statements run in that transaction.

Needless to say, this is yet another method that requires control over the application code.

Conclusion

We have seen several methods for debugging deadlocks. Most of them require that you have control over the application code. Not all techniques are useful in every case: for example, you may not be able to afford logging all statements or leaving deadlocks hanging for a longer time.

For more information on deadlocks, see Hans’ blog on Understanding Deadlocks.


Please leave your comments below. In order to receive regular updates on important changes in PostgreSQL, subscribe to the CYBERTEC newsletter, or follow us on TwitterFacebook, or LinkedIn