Yes, this post is about connection queueing, not just pooling. Because “connection pooling” – pre-created connections as a pool – is a much-celebrated feature. Almost every discussion on connection pool/pgBouncer starts with the overhead of establishing a new connection to PostgreSQL… and how pre-created connections in the pool can save the world.
But there is a non-celebrity feature in pgBouncer (not denying others) that can address some of the real big operational challenges. It is the connection queueing. Many new PostgreSQL users don’t know there is something like this. In this blog post, I am planning to discuss some of the problems related to connection management first, and then explore how connection queueing can address those problems.
Problem 1: Spike in Load can Jam and Halt the Server
PostgreSQL has a dedicated backend server process for every user connection. So there will be one backend process running on a CPU core for every active queries/sessions. This means there is a one-to-one mapping between active sessions and running processes in the server. If we consider parallel execution of SQL statements, there will be many more running processes than active sessions. In many real-world cases, a sudden spike in load can result in hundreds of active queries starting at once while the server is equipped with a small number of CPUs (sometimes just virtual CPUs with only fraction of performance). As the number of active sessions/processes increases, the overhead of scheduling and context switches takes over. Many times, the host server becomes unresponsive, and even opening a bash shell/terminal can take time. This is quite easy to simulate. Just 10 active connections on a two virtual CPU server with SELECT only workload can cause this.
With two active sessions:
$ time ssh -t postgres@pghost 'ls -l'
real 0m0.474s
user 0m0.017s
sys 0m0.006s
When there are 10 active sessions on PostgreSQL, just establishing an ssh connection to the server took 15 seconds.
real 0m15.307s
user 0m0.026s
sys 0m0.015s
**These are indicative numbers from a very specific system and do not qualify for a benchmark.
Generally, we could see that as the number of active sessions approaches double the number of CPU cores the performance penalty starts increasing heavily.
Many times, the problem won’t end there. Session level resource allocations (work_mem, temporary tables, etc.) can lead to overall server resource consumption. As the host server slows down, each session will take more time to complete while holding the resources, which could lead to more accumulation of active sessions. It is a spiral of evil. There are many real-world cases, where the entire show ended in a complete halt of the host server or OOM kick-in, terminating the PostgreSQL process and forcing it for crash recovery.
Have open source expertise you want to share? Submit your talk for Percona Live ONLINE 2021!
Problem 2: “Too Many Clients Already” Errors
Few smart DBAs will prevent this database disaster by setting max_connections properly to a smaller value than the database can handle, which is the right thing to do from a DB server perspective. Allowing an excessive number of connections to the database can lead to different types of abuses, attacks, and disasters. But the flip side to it is an abusive application may be greeted with the message as follows:
FATAL: sorry, too many clients already
The same will be logged in the PostgreSQL log.
2021-02-15 13:40:50.017 UTC [12829] FATAL: sorry, too many clients already
Unfortunately, this could lead to an application crash or misbehavior. From the business point of view, we just shifted the problem from database to application.
Problem 3: Big max_connection Value and Overhead
Because of the above-mentioned problem, it is common to see max_connection to have a very high value. The overhead of connections is probably one of the most actively-discussed topics these days because Postgres 14 is expected to have some of the connection scalability fixes. Andres Freund blogged about the details of analyzing the connection scalability problem and how it is addressed.
Even the idling connection may occupy server resources like memory. The overhead is considered as very low on a properly configured server; however, the impact could be heavy in reality. Again, a lot of things depend on the workload. There are at least a few cases that reported up to 50MB consumption per session. That means 500 idle connections can result in up to 25GB of memory usage.
In addition to this, more connections can lead to more lock management-related overheads. And don’t forget that system becomes vulnerable to sudden spikes as the max_connections are increased.
Solution: Connection Queueing
At the very least, connection queueing is the queueing of connections so that they can absorb the sudden spike in load. The connections can be put into a queue rather than straight away rejecting or sending it to the server and jamming it. This results in streamlining the execution. PostgreSQL server can keep doing what it can do rather than dealing with a jam situation.
Let me demonstrate with an example. For this demonstration, I set the max_connections to “2”, assuming that this is the maximum the server can accommodate without causing too many context switches. Too many connections won’t come and overload my database.
postgres=# show max_connections ;
2
A third connection to the database will result in an error as expected.
psql: error: FATAL: sorry, too many clients already
Now let’s use the pgbouncer for the connection queue. Many of users may not be knowing that it exists, by default. I used the following pgbouncer configuration for testing:
[databases]
pgbounce = host=172.16.2.16 port=5432 dbname=postgres
[pgbouncer]
listen_port = 6432
listen_addr = *
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
logfile = pgbouncer.log
pidfile = pgbouncer.pid
admin_users = postgres
application_name_add_host=1
default_pool_size=1
min_pool_size=1
Yes, the pooler will establish only one connection to the database. pgBouncer establishes this one connection when the client connection establishes for the first time because the min_pool_size is 1. The pgBouncer log says:
2021-02-16 15:19:30.725 UTC [2152] LOG C-0x1fb0c58: pgbounce/postgres@172.16.2.56:54754 login attempt: db=pgbounce user=postgres tls=no
2021-02-16 15:19:30.726 UTC [2152] LOG S-0x1fbba80: pgbounce/postgres@172.16.2.16:5432 new connection to server (from 172.16.2.144:58968)
pgbouncer pool statistics also shows the details:
pgbouncer=# show pools;
database | user | cl_active | cl_waiting | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait | maxwait_us | pool_mode
-----------+-----------+-----------+------------+-----------+---------+---------+-----------+----------+---------+------------+-----------
pgbounce | postgres | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | session
But the beauty is that we won’t get any more “FATAL: sorry, too many clients already” errors. All client connections are accepted and put into the connection queue. For example, I have five client connections. please see the value of cl_active:
pgbouncer=# show pools;
database | user | cl_active | cl_waiting | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait | maxwait_us | pool_mode
-----------+-----------+-----------+------------+-----------+---------+---------+-----------+----------+---------+------------+-----------
pgbounce | postgres | 5 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | session
As each client connection becomes active (with a SQL statement), they will be put into waiting.
pgbouncer=# show pools;
database | user | cl_active | cl_waiting | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait | maxwait_us | pool_mode
-----------+-----------+-----------+------------+-----------+---------+---------+-----------+----------+---------+------------+-----------
pgbounce | postgres | 1 | 4 | 1 | 0 | 0 | 0 | 0 | 28 | 438170 | session
Each client connection will be executed over the available database connection, one after another. This is a case with a single database connection. If the connection count and pool size can be increased, multiple client connections can hit the server at the same time and queue size drops. The following is a case with two connections (pool size two) to the database.
database | user | cl_active | cl_waiting | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait | maxwait_us | pool_mode
-----------+-----------+-----------+------------+-----------+---------+---------+-----------+----------+---------+------------+-----------
pgbounce | postgres | 2 | 1 | 2 | 0 | 0 | 0 | 0 | 4 | 978081 | session
Putting Connection Queueing into a Simple Test
This is not an extensive benchmarking, but a quick test to see the benefits for a typical host with two virtual CPUs. I have created 20 active connections to PostgreSQL with select-only load using pgbench.
pgbench -c 20 -j 20 -h pghost -U postgres -S -T 100
As the 20 server processes started running, the load average went out of the roof.
As you can see in the screenshot, the load average spiked to 17+. And as expected, the server response also becomes very poor consistently.
time ssh -t postgres@pghost 'ls -l'
real 0m17.768s
user 0m0.019s
sys 0m0.007s
At this stage, I tried sending the same 20 active connections through the connection queue of pgbouncer with pool size four (default_pool_size=4). The pgbouncer is at the client-side.
Since there are only four server-side processes, the load average dropped drastically. The maximum I could see is 1.73:
The server response is also very good.
$ time ssh -t postgres@pghost 'ls -l'
real 0m0.559s
user 0m0.021s
sys 0m0.008s
A load average of 17+ vs 1.73! That must be too good to be true.
There was a bit of skepticism about whether the low load on the server and the better server response is coming at the cost of database throughput. I was expecting to see not-so-great throughput numbers. So I took the same test to a more consistently-performing platform (AWS r5.large with two virtual CPUs) again. To a bit of surprise, the numbers were even better.
The following are the numbers I got. At least it is not bad; it’s better. 5% better.
Direct |
With Queueing |
20190.301781 |
21663.454921 |
20308.115945 |
21646.195661 |
20434.780692 |
21218.44989 |
Since we are using just four connections on the database side in this case, it also gives us the opportunity to reduce the max_connection value on the database side. Another check was whether switching to transaction-level pooling can save more because the database connection will be back to the pool and it could serve another client connection after each transaction. This could result in better concurrency.
Queue + max_ connection=5 |
Queue + max_connection=5 + transaction level pool |
21897.685891 |
23318.655016 |
21913.841813 |
23486.485856 |
21933.129685 |
23633.39607 |
As expected, it delivered even better numbers. I would like to encourage readers to do more tests and proper benchmarking.
Summary and References
A really large number of application/client connections can be multiplexed over a very few database connections using the connection queueing. This queue helps in absorbing any spike in connections without overloading the database server. Instead of session processes queueing up on the OS scheduler/run queue, the connections can be kept outside safely and the database server can operate at full-throttle without any contentions. Streamlined database traffic results in better throughput also.