PostgreSQL 14 introduced the parameter
idle_session_timeout, and, unfortunately, many DBAs jumped to start using it without understanding or by ignoring the consequences. In a short span of time, it has become one of the most misused parameters in many PostgreSQL installations. There is nothing wrong with
idle_session_timeout from a technical perspective; even without this parameter, I know frustrated DBAs run scheduled shell scripts to keep terminating old idle sessions. The new parameter made their work easy to nuke all idle connections. The problem is the collateral damage it can cause.
Note for novice DBAs: There are human and organizational factors for almost every incident, rather than just technical reasons. Trying to do a quick/dirty fix by changing parameters will have far-reaching implications. Not everything can be addressed on the database side. This is more evident in connection-related incidents. Blame games and bad decisions following one incident/outage can lead to further bad decisions and further problems.
For example, many such incidents will have a common root cause” ” something like unexpected, sudden connection explosion — a big bang of connections with hundreds or thousands of idle connections. The RCA investigation may end with only the technical side of the problem, like loose limits invited disaster to happen. But the right question to be addressed would be: Why do DBAs end up setting such loose limits? What are the factors that influenced and forced DBAs to do so? Unfortunately, it gets addressed very rarely in many organizations.
Common factors contributing to connection-related outages
1. Unrestricted connection settings: Leaving connection settings, such as max_connections and per-user limits, wide open exposes the system to malicious Denial-of-Service (DoS) attacks. These attacks can rapidly consume available connections, leaving legitimate users unable to connect and disrupting system functionality.
2. Inadequate application-side pooling: Suboptimal application-side pooling can lead to opening a large number of connections, and they may remain open and unused, consuming valuable resources. These connections can accumulate, eventually exhausting server resources and causing the system to become unresponsive.
3. Unexpected connection leaks: Unanticipated connection leaks can also arise from programming errors or faulty application logic. These leaks can quickly deplete the available connection pool, hindering system performance and potentially leading to crashes.
4. Human factor: In an ideal case, there must be stringent restrictions and enforcement on what load and number of connections can be onboarded to the Database. It should be a tight scrutiny like airline security.
The problem generally begins as a conflict of interests. App-Dev often demands unrestricted, uncontrolled access to the database. A “NO” from the Ops/DBAs at the right time may save the entire system from outages. However, gatekeeping is not an easy job. Pressure builds up, and DBAs give up in most cases and allow settings that the database server cannot accommodate. The first decision goes wrong here. Obviously, an incident is expected down the line.
5. Organizational factor of wrong decisions: A surprising set of wrong decisions generally happens after the investigation following major incidents, The investigation could be pointing fingers at an excessive number of connections. Its common that Ops/DBA becomes responsible for managing the “idle” connections! because “it is a database problem”. I would say the very wrong person takes charge.
The point forgotten is that connections are owned by clients but maintained by a database. It’s like a bank account. The customer owns it, but the bank maintains it. We shouldn’t be terminating or closing connections from the database side. Instead of Investigating who owns them and addressing them at their root, DBAs often go for shortcuts like idle_session_timeout or a script to terminate all “idle” connections. The boundaries of ownership are violated, inviting the next level of problems.
There are a lot of misunderstandings about “idle” connections among the user community. We should remember that a connection will be “idle” immediately after establishing it. It becomes “active” when there is an SQL for executing. It goes back to “idle” when the processing is over. Most of the connections from any connection pooler also will be “idle” most of the time. So, there is nothing wrong with a connection being “idle”. The point I want to make is that a connection appearing as “idle” doesn’t make it a candidate for termination/killing. This may sound too silly for an experienced user, but the reality is that I ended up sitting on emergency calls to explain this.
What matters is the number of connections. The right question to ask in an RCA call is, “Who is responsible for creating the large number of connections that are idle most of the time?
Common symptoms of poor connection management
- Gradual server performance degradation due to the high number of connections
- Often, the server becomes unresponsive and/or crashes
- The server is running out of memory, and OOM killer terminates PostgreSQL
- The server is running out of available connections
- Application reporting connection failures and unexpected drops of connection crashes
Problems of backend termination
I often get the question, “What is the problem with terminating connections from backends?”. In layman’s terms, it will be like a bank closing your account and taking all the money in it. That can become a big surprise for the database client/application. Most of the applications may not be designed and tested for handling such surprises. Even a
psql will get it with a surprise
postgres=> l+ FATAL: terminating connection due to idle-session timeout FATAL: server conn crashed? server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Succeeded.
So unless the application is designed, developed, and tested for handling unexpected connection drops, it won’t be able to handle this. The majority of applications just exit with error reporting. In critical systems, this causes serious outages for business.
Despite being tailored for PostgreSQL, pgBouncer is not immune to sudden connection terminations:
2023-11-14 16:17:21.967 UTC [4630] WARNING S-0x10b7940: postgres/authuser@127.0.0.1:5432 got packet 'E' from server when not linked 2023-11-14 16:17:21.967 UTC [4630] LOG S-0x10b7940: postgres/authuser@127.0.0.1:5432 closing because: server conn crashed? (age=218s) 2023-11-14 16:17:22.927 UTC [4630] LOG stats: 0 xacts/s, 0 queries/s, 0 client parses/s, 0 server parses/s, 0 binds/s, in 0 B/s, out 11 B/s, xact 0 us, query 0 us, wait 12 us 2023-11-14 16:17:40.533 UTC [4630] LOG S-0x10b7bf8: postgres/pgbtestapp@[::1]:5432 closing because: server conn crashed? (age=236s) 2023-11-14 16:17:40.533 UTC [4630] LOG C-0x10af130: postgres/pgbtestapp@[::1]:39470 closing because: server conn crashed? (age=28s) 2023-11-14 16:17:40.533 UTC [4630] WARNING C-0x10af130: postgres/pgbtestapp@[::1]:39470 pooler error: server conn crashed? 2023-11-14 16:18:06.370 UTC [4630] LOG S-0x10b7bf8: postgres/authuser@127.0.0.1:5432 new connection to server (from 127.0.0.1:46632)
Even if we ignore all other problems, the connection pooler will end up re-establishing pretty much every connection in the pool. This completely destroys the very purpose of the connection pooler.
It is not that the PostgreSQL community is not aware of this possible abuse of parameters. This has been discussed in the community, and the implications on the pooler are known and well documented.
However, users still overlook the implications, and serious mess-ups happen.
From the discussions in the PostgreSQL community, the only strong argument in favor of this parameter is one-off users who directly log in to the database to execute custom statements. They may accidentally leave their sessions open for a long time, which causes damage.
There were discussions about making it more explicit like: Consider setting this for specific users instead of as a server default. Client connections managed by connection poolers or initiated indirectly, like those by a remote postgres_fdw using server, should probably be excluded from this timeout.
What can we do about it without causing any serious consequences?
Obviously, DBAs may have questions like, what if user accounts / application connections are really abusive? As mentioned above, the solution is not just technical, but understanding the system and implementing appropriate strategies is important.
Segregation of database accounts and role-based settings
Those accounts used by individual users for interactive logins need to be differentiated from the accounts used by the application – The service accounts, In terms of user/role-level settings.PostgreSQL allows us to have parameter settings at different levels and scope. More stringent restrictions are to be placed on the interactive login accounts. It is very much OK to have settings like idle_session_timeout for those accounts, Preferably not exceeding 5 minutes. Most importantly, the idle_in_transaction_session_timeout also does not exceed a few seconds.
On the other hand, I would recommend NOT to use
idle_session_timeout for service accounts. But using
idle_in_transaction_session_timeout for a couple of minutes is acceptable as it helps us to find problematic application logic.
Use network timeouts
There are alternative approaches if we suspect that the connections are just leaked without the application holding them. TCP timeouts could be the best solution for such cases. PostgreSQL has TCP-related timeouts as a parameter option, which can be specified like any other PostgreSQL parameter option.
tcp_keepalives_idle: This parameter specifies the number of seconds of inactivity, after which a keepalive message will be sent to the client by the PostgreSQL server. For example, if we set a value “120”, Sever will send a keepalive message to the client after two minutes of inactivity and wait for a response. If there is an acknowledgment from the client, the connection is validated. What if there is no acknowledgment from a client? That is where the next parameter helps
tcp_keepalives_interval: This is the number of seconds after which the TCP keepalive message will be retransmitted if there is no acknowledgment from the client. For example, setting a value of 20 results in sending keepalive packets every 20 seconds until an acknowledgment is obtained. What if there is no acknowledgment for any of the keepalive attempts? This is where the next parameter comes into play.
tcp_keepalives_count: This is the number of keepalive messages that can be lost before the client connection is considered as “dead”. And terminated.
So, the overall settings summary discussed is:
tcp_keepalives_idle = 120 tcp_keepalives_interval = 20 tcp_keepalives_count = 6
But you may adjust the values according to the needs of your environment.
There is one more parameter: tcp_user_timeout, Which, as some caveat, is documented and reportedly affects other parameter values. So this is not something we should consider on day one.
PostgreSQL 14 onwards, we have an additional option: client_connection_check_interval, which can detect whether the connection to the client has gone away or not. This is most useful if a client dies during a long-running statement like those complex reporting queries from OLAP systems. By default, this check is off (value 0). Setting this value to 5 to 10 seconds will be of great value if you are expecting cases of client connectivity. Please note that the value is specified in milliseconds. PostgreSQL allows us to set these values at even user level.
With all TCP timeouts in place, you may see entries as follows in PostgreSQL logs if some timeout happens.
2023-12-20 04:37:12.026 UTC [1230] LOG: could not receive data from client: Connection timed out 2023-12-20 04:37:12.026 UTC [1230] LOG: disconnection: session time: 0:22:23.419 user=jobin database=postgres host=fd42:8aba:4133:fb7d:216:3eff:fe3c:7bcb port=52984
Summary
There should be a cap on what a system can accept as a load. At the PostgreSQL level, it is the max_connections. I would consider it the primary duty of any Production DBA to protect the system from any possible abuse.
Dear DBAs, Please consider the following points to keep the system stable and protect it from any abuses.
- Configure connection settings appropriately. Set connection limits, such as max_connections and per-user limits, to reasonable values that align with expected usage patterns. Regularly review and adjust these settings as needed. The max_connections setting of a PostgreSQL instance should not be too far from 10x of available CPUs.
- Encourage the use of robust application-side pooling: Implement well-configured application-side pooling mechanisms to manage connections and prevent leaks efficiently. Employ libraries or frameworks that provide reliable pooling capabilities. In case the application framework doesn’t have a good connection pooler, consider external connection poolers like pgBouncer.
- Connections are owned by those who create it. If there is an excessive number of idle connections. The part of the system that is causing the creation of connections needs to be fixed. Terminating them silently from the backend is not a solution, but it invites more problems.
- Never terminate connections from the backend unless there is a request to do so from the application/client who owns the connections.
- Avoid using parameters like idle_session_timeout at the global level; use it only at the user/role level, especially for interactive login accounts.
- Avoid idle_session_timeout for application/service accounts. It can be used for accounts that are used for interactive logins where the consequence of timeout is predictable.
- Alternate options, as discussed above, to handle leaked connections.
- Monitor the connections. It is precious.
And most importantly, the crucial part. Empower yourself or your DBAs with the ability to say NO to any connection abuses. Junior DBAs are more vulnerable to pressure from other parts of organizations to take the wrong steps.
Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.