Before there was TCP keepalive
© Laurenz Albe 2021

If you’ve heard about TCP keepalive but aren’t sure what that is, read on. If you’ve ever been surprised by error messages like:

  • server closed the connection unexpectedly
  • SSL SYSCALL error: EOF detected
  • unexpected EOF on client connection
  • could not receive data from client: Connection reset by peer

then this article is for you.

Causes for broken connections

There are several possible causes for broken connections:

Database server crashes

The first two messages in the above list can be the consequence of a PostgreSQL server problem. If the server crashes for whatever reason, you’ll see a message like that. To investigate whether there is a server problem, you should first look into the PostgreSQL log and see if you can find a matching crash report.

We won’t deal with that case in the following, since it isn’t a network problem.

Connections abandoned by the client

If the client exits without properly closing the database connection, the server will get an end-of-file or an error while communicating on the network socket. With the new session statistics introduced in v14, you can track the number of such “abandoned” database connections in pg_stat_database.sessions_abandoned.

For example, if an application server fails and is restarted, it typically won’t close the connections to the database server. This isn’t alarming, and the database server will quickly detect it when the server tries to send data to the client. But if the database session is idle, the server process is waiting for the client to send the next statement (you can see the wait_eventClientRead” in pg_stat_activity). Then the server won’t immediately notice that the client is no longer there! Such lingering backend processes occupy a process slot and can cause you to exceed max_connections.

PostgreSQL v14 has introduced a new parameter idle_session_timeout which closes idle connections after a while. But that will terminate “healthy” idle connections as well, so it isn’t a very good solution. TCP keepalive provides a much better solution to this problem.

Connections closed by a network component

Sometimes both ends of the database connection experience the same problem: each sees that the other end “hung up on them”. In that case, the problem lies somewhere between the database client and the server.

Network connections can get disconnected if there is a real connectivity problem. There’s nothing you can do to change that on the software level. But very often, disconnections are caused by the way firewalls or routers are configured. The network component may have to “memorize” the state of each open connection, and the resources for that are limited. So it can seem expedient to “forget” and drop connections that have been idle for a longer time.

Since a lot of today’s TCP traffic is via HTTP, and HTTP is stateless, that’s not normally a problem. If your HTTP connection is broken, you simply establish a new connection for your next request, which isn’t very expensive. But databases are different:

  • it is expensive to establish a database connection
  • database connections are not stateless; for example, with a closed connection you lose open transactions, temporary tables and prepared statements
  • it is normal for database sessions to be idle for a longer time, for example if you are using a connection pool, or when the client is waiting for the result from a long-running analytical query

This is where TCP keepalive comes in handy as a way to keep idle connections open.

What is TCP keepalive?

Keepalive is a functionality of the TCP protocol. When you set the SO_KEEPALIVE option on a TCP network socket, a timer will start running as soon as the socket becomes idle. When the keepalive idle time has expired without further activity on the socket, the kernel will send a “keepalive packet” to the communication partner. If the partner answers, the connection is considered good, and the timer starts running again.

If there is no answer, the kernel waits for the keepalive interval before sending another keepalive packet. This process is repeated until the number of keepalive packets sent reaches the keepalive count. After that, the connection is considered dead, and attempts to use the network socket will result in an error.

Note that it is the operating system kernel, not the application (database server or client) that sends keepalive messages. The application is not aware of this process.

TCP keepalive serves two purposes:

  • keep network connections from being idle
  • detect if the other communication end has left without closing the network connection
    (The name “keepalive” does not describe that well – “detectdead” would be more to the point).

TCP keepalive default settings

The default values for the keepalive parameters vary from operating system to operating system. On Linux and Windows, the default values are:

  • keepalive idle time: 2 hours on both Linux and Windows
  • keepalive interval: 75 seconds on Linux, 1 second on Windows
  • keepalive count: 9 on Linux, 10 on Windows (this value cannot be changed on Windows)

I could not find the default settings for MacOS.

Using TCP keepalive to keep an idle database session alive

To keep firewalls and routers from closing an idle connection, we need a much lower setting for the keepalive idle time. Then keepalive packets get sent before the connection is closed. This will trick the offending network component into believing that the connection isn’t idle, even if neither database client nor server send any data.

For this use case, keepalive count and keepalive interval are irrelevant. All we need is for the first keepalive packet to be sent early enough.

Using TCP keepalive to detect dead connections

For this use case, reducing the keepalive idle time is often not enough. If the server sends nine keepalive packets with an interval of 75 seconds, it will take more than 10 minutes before a dead connection is detected. So we’ll also reduce the keepalive count, or the keepalive interval, or both – as in this case.

There is still one missing piece to the puzzle: even if the operating system detects that a network connection is broken, the database server won’t notice, unless it tries to use the network socket. If it’s waiting for a request from the client, that will happen immediately. But if the server is busy executing a long-running SQL statement, it won’t notice the dead connection until the query is done and it tries to send the result back to the client! To prevent this from happening, PostgreSQL v14 has introduced the new parameter client_connection_check_interval, which is currently only supported on Linux. Setting this parameter causes the server to “poll” the socket regularly, even if it has nothing to send yet. That way, it can detect a closed connection and interrupt the execution of the SQL statement.

Setting TCP keepalive parameters on the PostgreSQL server

The PostgreSQL server always sets SO_KEEPALIVE on TCP sockets to detect broken connections, but the default idle timeout of two hours is very long.

You can set the configuration parameters tcp_keepalives_idle, tcp_keepalives_interval and tcp_keepalives_count (the last one is not supported on Windows) to change the settings for all server sockets.

This is the most convenient way to configure TCP keepalive for all database connections, regardless of the client used.

Setting TCP keepalive parameters on the PostgreSQL client

The PostgreSQL client shared library libpq has the connection parameters keepalives_idle, keepalives_interval and keepalives_count (again, the latter is not supported on Windows) to configure keepalive on the client side.

These parameters can be used in PostgreSQL connection strings with all client interfaces that link with libpq, for example, Psycopg or PHP.

The PostgreSQL JDBC driver, which does not use libpq, only has a connection parameter tcpKeepAlive to enable TCP keepalive (it is disabled by default), but no parameter to configure the keepalive idle time and other keepalive settings.

Setting TCP keepalive parameters on the operating system

Instead of configuring keepalive settings specifically for PostgreSQL connections, you can change the operating system default values for all TCP connections – which can be useful, if you are using a PostgreSQL client application that doesn’t allow you to set keepalive connection parameters.

On Linux, this is done by editing the /etc/sysctl.conf file:

# detect dead connections after 70 seconds
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_intvl = 5
net.ipv4.tcp_keepalive_probes = 3

To activate the settings without rebooting the machine, run

sysctl -p

On Windows, you change the TCP keepalive settings by adding these registry keys:

HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Tcpip\Parameters\KeepAliveTime
HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Tcpip\Parameters\KeepAliveInterval

As noted above, there is no setting for the number of keepalive probes, which is hard-coded to 10. The registry keys must be of type DWORD, and the values are in milliseconds rather than in seconds.

After changing these keys, restart Windows to activate them.

Conclusion

Configuring TCP keepalive can improve your PostgreSQL experience, either by keeping idle database connections open, or through the timely detection of broken connections. You can do configure keepalive on the PostgreSQL client, the server, or on the operating system.

In addition to configuring keepalive, set the new parameter client_connection_check_interval to cancel long-running queries when the client has abandoned the session.

 

To learn more about terminating database connections, see our blogpost here.

If you would like to learn about connections settings, see my post about max_connections.