Logical Replication/Decoding Improvements in PostgreSQL 13 and 14

Logical Replication/Decoding Improvements in PostgreSQL

I recently blogged about how Patroni Addresses the Problem of the Logical Replication Slot Failover in a PostgreSQL Cluster. In fact, nothing else was hurting a logical replication as much as this problem. Even while I am writing this post, I could see customers/users who don’t have Patroni struggling to address this. Thanks to the Patroni community for addressing this problem in the most magnificent way: No patch to PostgreSQL, no Extensions required! Completely non-invasive solution.

As the biggest road-block/deterrent is gone, we expect more and more users to start looking into OR re-considering the logical replication, especially those who discarded it due to practical difficulties. I want to let them know that there are a lot more exciting new features related to logical replication/decoding in new versions like PostgreSQL 13 and 14.

Before getting into new features, let us look at what else was hurting in logical replication in older PostgreSQL versions.

Memory Usage and Disk Usage

PostgreSQL used to keep only 4096 changes (max_changes_in_memory) for each transaction in memory. If there is a very lengthy transaction, the rest of the changes will be spilled to disk as spill files. This has two important implications. First, if each change is really big and if there are sub-transactions, the memory consumption can easily run into several GBs. This can even affect the host machine’s stability and the chance of OOM kicks-in. On the other hand, if the changes are very small and if there are too many small changes, it will be spilled to disk if the transaction is lengthy, causing IO overheads.

Massive Replication Delays and CPU Load

It was almost a regular complaint from many users that they keep seeing huge replication lags. A closer inspection shows the WAL Sender is consuming a lot of CPU. Single-core saturation was the commonly reported case. Many times, a further closer investigation reveals that there was a long-running transaction or a bulk data loading and causing the generation of spill files. The system is busy going through the spill files and preparing the commit order, which needs to be sent to a logical replica

Again, we witnessed a few cases where users opted for logical replication to reduce the load on the primary. But the complexity (CPU and IO usage) during the logical decoding by the WAL sender wiped out all potential gains.

These problems were not something unknown to the PostgreSQL community. In fact, the discussions started around the same time PostgreSQL 10 was released about the problems and their fixes. The good news is all these are addressed in the recent development.

I would like to express my gratitude from user communities to those who contributed their time and effort in developing these wonderful solutions. Namely, Tomas Vondra, Amit Kapila, Dilip Kumar, Masahiko Sawada, Vignesh C, and there are many more who gave very valuable input like Peter Eisentraut, Masahiko Sawada, and Andres Freund.

Improvements in PostgreSQL 13

The problem of memory and disk usage is basically addressed in PostgreSQL 13. Now the max_changes_in_memory (4096) is not used while adding changes. Instead, total memory usage for all transactions together and memory usage for individual transactions are tracked. A new parameter logical_decoding_work_mem is introduced. The buffer will be spilled to disk only if this limit is exceeded and only the largest transaction which consumes the highest amount of memory will be the victim to be spilled to disk. This is smarter and reduces unwanted disk spills also.

Reference: ReorderBufferCheckMemoryLimit (src/backend/replication/logical/reorderbuffer.c)

Improvements in PostgreSQL 14

Spilling to disk when logical_decoding_work_mem is full is one idea. But what about transmitting the changes directly to subscribers instead of spilling to disk. This is the major change/improvement in PostgreSQL 14. But that is not that easy as say because we are dealing with ongoing transactions. Overall logic and feature for logical replication had to undergo huge changes.  But yes, PostgreSQL 14 introduces the option to stream the


to the subscriber rather than spill to the disk first. Obviously, this new feature to stream the ongoing transaction required the improvement of the replication protocol. New message formats like “Stream Start”, “Stream Stop”, “Stream Commit” and “Stream Abort” etc are added to the replication protocol. Please refer to the PostgreSQL documentation: https://www.postgresql.org/docs/14/protocol-logicalrep-message-formats.html for more details.

The corresponding changes are also required on the output plugin interface also. This is also an improvement in PG 14. Please refer the commit 45fdc9738b for more details and refer to the PostgreSQL doc.

The streaming is considered when the logical_decoding_work_mem is exceeded. This doesn’t mean that the buffer is never spilled to the disk. Spilling to disk remains the option if streaming is not possible. This happens if the information currently available is not sufficient to decode.

The commit 7259736a6e5b7c7588fff9578370736a6648acbb summarizes the big improvement.

Instead of serializing the transaction to disk after reaching the logical_decoding_work_mem limit in memory, we consume the changes we have in memory and invoke stream API methods added by commit 45fdc9738b. However, sometimes if we have incomplete toast or speculative insert we spill to the disk because we can’t generate the complete tuple and stream. And, as soon as we get the complete tuple we stream the transaction including the serialized changes.

We can do this incremental processing thanks to having assignments (associating subxact with toplevel xacts) in WAL right away, and thanks to logging the invalidation messages at each command end. These features are added by commits 0bead9af48 and c55040ccd0 respectively.

Now that we can stream in-progress transactions, the concurrent aborts may cause failures when the output plugin consults catalogs (both system and user-defined). We handle such failures by returning ERRCODE_TRANSACTION_ROLLBACK sqlerrcode from system table scan APIs to the backend or WALSender decoding a specific uncommitted transaction. The decoding logic on the receipt of such a sqlerrcode aborts the decoding of the current transaction and continue with the decoding of other transactions.

How to Setup

The necessary features are available only from PostgreSQL 14. And client needs to initiate a replication connection with “streaming” on. To facilitate this, the CREATE SUBSCRIPTION takes a new input parameter “streaming”, which is off by default. Following is an example:

CREATE SUBSCRIPTION sub1 CONNECTION 'host=pg0 port=5432 dbname=postgres user=postgres password=xxxx' PUBLICATION tap_pub WITH (streaming = on);

Please make a note of the new parameter streaming =on which specifies whether streaming of in-progress transactions should be enabled for this subscription.

Alternatively, an existing subscription can be modified to enable streaming.


Monitoring Improvements

There are two major improvements in terms of monitoring.

Monitoring the Initial Data Copy

PostgreSQL 14 allows users to monitor the progress of the COPY command using a new monitoring view


. This is a great value addition when someone is setting up the logical replication. Please refer to the documentation for more details.

The following is the sample output of:

select * from pg_stat_progress_copy ;

  from the PUBLISHER side using psql’s


Wed 23 Feb 2022 07:01:46 AM UTC (every 1s)

 pid  | datid | datname  | relid | command | type | bytes_processed | bytes_total | tuples_processed | tuples_excluded 
 2034 | 16401 | postgres | 16390 | COPY TO | PIPE |       932960052 |           0 |          9540522 |               0
(1 row)

                                      Wed 23 Feb 2022 07:01:47 AM UTC (every 1s)

 pid  | datid | datname  | relid | command | type | bytes_processed | bytes_total | tuples_processed | tuples_excluded 
 2034 | 16401 | postgres | 16390 | COPY TO | PIPE |       976060287 |           0 |          9979509 |               0
(1 row)

Since we know how many tuples are there in the table, we don’t have any difficulty in understanding how far it is progressed.

Similar monitoring is possible from the SUBSCRIBER side also:

Wed 23 Feb 2022 07:01:46 AM UTC (every 1s)

 pid  | datid | datname  | relid |  command  |   type   | bytes_processed | bytes_total | tuples_processed | tuples_excluded 
 1204 | 14486 | postgres | 16385 | COPY FROM | CALLBACK |       912168274 |           0 |          9328360 |               0
(1 row)

                                         Wed 23 Feb 2022 07:01:47 AM UTC (every 1s)

 pid  | datid | datname  | relid |  command  |   type   | bytes_processed | bytes_total | tuples_processed | tuples_excluded 
 1204 | 14486 | postgres | 16385 | COPY FROM | CALLBACK |       948074690 |           0 |          9694752 |               0
(1 row)

Monitoring the Logical Replication

Monitoring logical replication is possible through the new view available from PostgreSQL 14: pg_stat_replication_slots on the PUBLISHER side. (Name is similar to pg_replication_slots ). But this view is a great improvement.

This is of great use even if we are not using the new streaming feature because there is a higher chance of generating spill files.

postgres=# select * from pg_stat_replication_slots ;
 slot_name | spill_txns | spill_count | spill_bytes | stream_txns | stream_count | stream_bytes | total_txns | total_bytes | stats_reset 
 sub       |          1 |          34 |  2250000000 |           0 |            0 |            0 |       2701 |     1766040 | 
(1 row)

As we can see in the above case, there was a single bulky transaction, which caused a lot of spill files.

Statistics related to a particular slot can be reset using function 


postgres=# select pg_stat_reset_replication_slot('sub');
 (1 row)

postgres=# select * from pg_stat_replication_slots ;
 slot_name | spill_txns | spill_count | spill_bytes | stream_txns | stream_count | stream_bytes | total_txns | total_bytes |          stats_reset          
 sub       |          0 |           0 |           0 |           0 |            0 |            0 |          0 |           0 | 2022-02-23 15:39:08.472519+00
(1 row)

With the streaming enabled, we can get details of streaming of ongoing transactions:

Wed 23 Feb 2022 03:58:53 PM UTC (every 2s)

 slot_name | spill_txns | spill_count | spill_bytes | stream_txns | stream_count | stream_bytes | total_txns | total_bytes |         stats_reset         
 sub       |          1 |           9 |   603980550 |           0 |           29 |   1914455250 |        242 |  1914488162 | 2022-02-23 15:55:46.8994+00
(1 row)

It is recommended to adjust the value of logical_decoding_work_mem (default is 64MB) to set the maximum amount of memory that we can spend for a (each) walsender process. Using this we can avoid many spills to disk at the same time avoiding excessive memory usage.

For example:

postgres=# ALTER SYSTEM SET logical_decoding_work_mem = '512MB';
postgres=# select pg_reload_conf();


In this blog post, I want to encourage PostgreSQL users who abandoned logical replication in the past due to its shortcomings to reconsider it again as PostgreSQL 13 and 14 along with Patroni solves most of the difficulties. Lengthy, bulk transactions are known to cause severe problems to logical replication. The effect was very serious in previous versions but with new improvements, it is mitigated to a good extent and expected to reduce the load on the publisher side considerably.

However, this doesn’t mean that it is perfect. Community and developers are aware of lot more areas for improvement, especially improvements on the subscription side. We should expect such changes in the upcoming versions.


PostgreSQL 14 and Recent SCRAM Authentication Changes – Should I Migrate to SCRAM?

PostgreSQL Migrate to SCRAM

Recently, a few PostgreSQL users reported that they got connection failures after switching to PostgreSQL 14.

Why do I get the error FATAL:  password authentication failed for a user in the new server?” has become one of the most intriguing questions.

At least in one case, it was a bit of a surprise that the application message was as follows:

FATAL: Connection to database failed: connection to server at “localhost” (::1), port 5432 failed: fe_sendauth: no password supplied

The reason for these errors is the defaults for password encryption are changed in new versions of PostgreSQL to SCRAM authentication. Even though the last one appears nothing directly related to SCRAM, oh yes, some post-installation script failed which was looking for “md5”.

SCRAM authentication is not something new in PostgreSQL. It was there from PostgreSQL 10 onwards but never affected DBA life in general because that has never been the default. It was an opt-in feature by explicitly changing the default settings. Those who do an opt-in generally understand and do it intentionally, and it’s never been known to cause any problem. The PostgreSQL community was reluctant to make it a prime method for years because many of the client/application libraries were not ready for SCRAM authentication.

But that is changing in PostgreSQL 14. With PostgreSQL 9.6 going out of support, the landscape is changing. Now we expect all old client libraries to get upgraded and SCRAM authentication is becoming the prime password authentication method. But, those who are completely unaware are going to be greeted with a surprise one day or another. The purpose of this post is to create a quick awareness for those who are not yet, and address some of the commonly asked questions.

What is SCRAM Authentication?

In simple words, the database client and the server prove and convince each other that they know the password without exchanging the password or the password hash. Yes, it is possible by doing a Salted Challenge and Responses, SCRAM-SHA-256, as specified by RFC 7677. This way of storing, communicating, and verifying passwords makes it very hard to break a password.

This method is more resistant to:

  • Dictionary attacks
  • Replay attacks
  • Stollen hashes

Overall it becomes very hard to break a password-based authentication.

What Has Changed Over Time?

Channel Binding

Authentication is only one part of secured communication. After authentication, a rogue server in the middle can potentially take over and fool the client connection. PostgreSQL 11 introduced SCRAM-SHA-256-PLUS which supports the channel binding. This is to make sure that there is no rogue server acting as a real server OR doing a man-in-middle attack.

From PostgreSQL 13 onwards, a client can request and even insist on channel binding.

For example:

psql -U postgres -h c76pri channel_binding=prefer
psql -U postgres -h c76pri channel_binding=require

The channel binding works over SSL/TLS, so SSL/TLS configuration is mandatory to get the channel binding work.

Setting Password Encryption

The md5 was the only available option for password encryption before PostgreSQL 10, so PostgreSQL allows settings to indicate that “password encryption is required” which is defaulted to md5.

–Upto PG 13
postgres=# set password_encryption TO ON;

Due to the same reason, the above statement was effectively the same as:

postgres=# set password_encryption TO MD5;

We could even use “true”, “1”,”yes” instead of “on” as an equivalent value.

But now we have multiple encryption methods and “ON” doesn’t really convey what we really want. So from PostgreSQL 14 onwards, the system expects us to specify the encryption method.

postgres=# set password_encryption TO 'scram-sha-256';

postgres=# set password_encryption TO 'md5';

Any attempt to use “on”/”true”, ”yes” will be rejected with an error.

–-From PG 14
postgres=# set password_encryption TO 'on';
ERROR:  invalid value for parameter "password_encryption": "on"
HINT:  Available values: md5, scram-sha-256.

So please check your scripts and make sure that they don’t have the old way of “enabling” encryption.

Some Frequently Asked Questions

  1. Does my logical backup and restore get affected?
    Logical backup and restore of PostgreSQL globals (pg_dumpall) won’t affect the SCRAM authentication, the same password should work after the restore. In fact, it will be interesting to recollect that the SCRAM authentication is more resilient to changes. For example, if we rename a USER the old MD5 password won’t work anymore, because the way PostgreSQL generates the MD5 it uses the username also.

    postgres=# ALTER USER jobin1 RENAME TO jobin;
    NOTICE:  MD5 password cleared because of role rename

    As the NOTICE indicates the password hash in the pg_authid will be cleared as the old one is no longer valid. But this won’t be the case with SCRAM authentication, as we can rename the users without affecting the password.

    postgres=# ALTER USER jobin RENAME TO jobin1;
  2. The existing/old method of encryption (md5) was a big vulnerability. Was there a big risk?
    This worry mainly comes from the name “MD5” which is way too silly for modern hardware. The way PostgreSQL uses md5 is different is not just the hash of the password, but it considers the username also.  Additionally, it is communicated over the wire after preparing a hash with a random salt provided by the server. Effectively what is communicated will be different from the password hash, so it is not too vulnerable. But prone to dictionary attacks and leaked username password hash problems.
  3. Is the new scram authentication ads complex to authenticate? Is my connection request is going to take more time?
    The wire protocol SCRAM is very efficient and not known to cause any degradation in connection time. Moreover, compared to other overheads of server-side connection management, the overhead created by SCRAM becomes very negligible
  4. Is it mandatory to use SCRAM authentication from PostgreSQL 14 and force all my user accounts to switch to it?
    Definitely not, only the defaults are changed. Old method md5 is still a valid method that works great, and if the access to the PostgreSQL environment is restricted by firewall/hba rules, there is already less risk in using md5.
  5. Why do I get the “: FATAL:  password authentication failed for user “ error when I switched to PostgreSQL 14?
    The most probable reason is the pg_hba.conf entries. If we specify “md5” as the authentication method, PostgreSQL will allow SCRAM authentication also. But the reverse won’t work. When you created the PostgreSQL 14 environment, most probably it may have “scram-sha-256” as the authentication method. In some of the PostgreSQL packages, the installation script automatically does it for you ? In case the authentication works from the PostgreSQL client tools and not from the application, please check the driver version check the scope for upgrade
  6. Why do I get other types of authentication errors?
    The most probable reasons are the post-installation scripts. It is a regular practice in many organizations to use DevOps tools (Ansible/Chef) or even shell scripts to do the post-installation customizations. Many of those will be doing a range of things that involves steps like set password_encryption TO ON; or even modification to pg_hba.conf using sed, which is expected to fail if it is trying to modify an entry that is not there anymore.

Why Should I Care and What To Do

Anything starting from automation/deployment scripts, tools, application connections, and connection poolers could potentially break. One of the major arguments for delaying this change till PostgreSQL 14 is that the oldest supported version (9.6) is going out of support soon. So this is the right time to inspect your environments to see if any of those environments have old PostgreSQL libraries (9.6 or older) and have a plan for the upgrade, as the old version PostgreSQL libraries cannot handle SCRAM negotiations.

In summary, having a good plan to migrate will help, even though it is not urgent.

    1. Inspect the environments and application drivers to see whether any of them are still using old versions of PostgreSQL client libraries and upgrade them wherever required.
      Please refer to: https://wiki.postgresql.org/wiki/List_of_drivers
      Encourage / Drive the upgrade of client libraries with a timeline
    2. If the existing environment is using md5, encourage users to switch to SCRAM authentication.
      Remember that the authentication method mentioned as “md5” in pg_hba.conf will continue to work for both SCRAM and MD5 authentication in PostgreSQL 14 also.
    3. Take every opportunity to test and migrate automation, connection poolers, and other infrastructure to SCRAM authentication.

By changing the default authentication, the PostgreSQL community is showing a clear direction about the future.

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.

Download Percona Distribution for PostgreSQL Today!


How Patroni Addresses the Problem of the Logical Replication Slot Failover in a PostgreSQL Cluster

PostgreSQL Patroni Logical Replication Slot Failover

Failover of the logical replication slot has always been the pain point while using the logical replication in PostgreSQL. This lack of feature undermined the use of logical replication and acted as one of the biggest deterrents. The stake and impact were so high that many organizations had to discard their plans around logical replication, and it affected many plans for migrations to PostgreSQL. It was painful to see that many had to opt for proprietary/vendor-specific solutions instead.

At Percona, we have written about this in the past: Missing Piece: Failover of the Logical Replication Slot.  In that post, we discussed one of the possible approaches to solve this problem, but there was no reliable mechanism to copy the slot information to a Physical standby and maintain it.

The problem, in nutshell, is: the replication slot will be always maintained on the Primary node. If there is a switchover/failover to promote one of the standby, the new primary won’t have any idea about the replication slot maintained by the previous primary node. This breaks the logical replication from the downstream systems or if a new slot is created, it becomes unsafe to use.

The good news is that Patroni developers and maintainers addressed this problem from Version 2.1.0 and provided a working solution without any invasive methods/extensions. For me, this is a work that deserves a big round of applause from the Patroni community and that is the intention of this blog post and to make sure that a bigger crowd is aware of it.

How to Set it Up

A ready-to-use Patroni package is available from the Percona repository. But you are free to use Patroni from any source.

Basic Configuration

In case you are excited about this and want to try it, the following steps might be helpful.

The entire discussion is about logical replication. So the minimum requirement is to have a wal_level set to “logical”. If the existing Patroni configuration is having wal_level set to “replica” and if you want to use this feature, you may just edit the Patroni configuration.

Patroni configuration

However, this change requires the PostgreSQL restart:

“Pending restart” with * marking indicates the same.

You may use Patroni’s “switchover” feature to restart the node to make the changes into effect because the demoted node goes for a restart.

If there are any remaining nodes, they can be restarted later.

Creating Logical Slots

Now we can add a permanent logical replication slot to PostgreSQL which will be maintained by Patroni.

Edit the patroni configuration:

$ patronictl -c /etc/patroni/patroni.yml edit-config

A slot specification can be added as follows:

    database: postgres
    plugin: pgoutput
    type: logical

The “slots:” section defines permanent replication slots. These slots will be preserved during switchover/failover. “pgoutput” is the decoding plugin for PostgreSQL logical replication.

Once the change is applied, the logical replication slot will be created on the primary node. Which can be verified by querying:

select * from pg_replication_slots;

The following is a sample output:

patroni output

Now here is the first level of magic! The same replication slot will be created on the standbys, also. Yes, Patroni does it. Patroni internally copies the replication slot information from the primary to all eligible standby nodes!.

We can use the same query on the pg_replication_slots on the standby and see similar information.

The following is an example showing the same replication slot reflecting on the standby side:

replication slot

This slot can be used by the subscription by explicitly specifying the slot name while creating the subscription.

CREATE SUBSCRIPTION sub2 CONNECTION '<connection_string' PUBLICATION <publication_name> WITH (copy_data = true, create_slot=false, enabled=true, slot_name=logicreplia);

Alternatively, an existing subscription can be modified to use the new slot which I generally prefer to do.

For example:

ALTER SUBSCRIPTION name SET (slot_name=logicreplia);

Corresponding PostgreSQL log entries can confirm the slot name change:

2021-12-27 15:56:58.294 UTC [20319] LOG:  logical replication apply worker for subscription "sub2" will restart because the replication slot name was changed
2021-12-27 15:56:58.304 UTC [20353] LOG:  logical replication apply worker for subscription "sub2" has started

From the publisher side, We can confirm the slot usage by checking the active_pid and advancing LSN for the slots.

The second level of Surprise! The Replication Slot information in all the standby nodes of the Patroni cluster is also advanced as the logical replication progresses from the primary side

At a higher level, this is exactly what this feature is doing:

  1. Automatically create/copy the replication slot information from the primary node of the Patroni cluster to all eligible standby nodes.
  2. Automatically advances the LSN numbers on slots of standby nodes as the LSN number advances on the corresponding slot on the primary.

After a Switchover/Failover

In the event of a switchover or failover, we are not losing any slot information as they are already maintained on the standby nodes.

After the switchover, the topology looks like this:

Now, any downstream logical replica can be repointed to the new primary.

postgres=# ALTER SUBSCRIPTION sub2 CONNECTION 'host= port=5432 dbname=postgres user=postgres password=vagrant';                                                                                                                                                  

This continues the replication, and pg_replication_slot information can confirm this.

Summary + Key Points

The logical replication slot is conceptually possible only on the primary Instance because that is where the logical decoding happens. Now with this improvement, Patroni makes sure that the slot information is available on standby also and it will be ready to take over the connection from the subscriber.

  • This solution requires PostgreSQL 11 or above because it uses the  pg_replication_slot_advance() function which is available from PostgreSQL 11 onwards, for advancing the slot.
  • The downstream connection can use HAProxy so that the connection will be automatically routed to the primary (not covered in this post). No modification to PostgreSQL code or Creation of any extension is required.
  • The copying of the slot happens over PostgreSQL protocol (libpq) rather than any OS-specific tools/methods. Patroni uses rewind or superuser credentials. Patroni uses the pg_read_binary_file()  function to read the slot information. Source code Reference.
  • Once the logical slot is created on the replica side, Patroni uses pg_replication_slot_advance() to move the slot forward.
  • The permanent slot information will be added to DCS and will be continuously maintained by the primary instance of the Patroni. A New DCS key with the name “status” is introduced and supported across all DCS options (zookeeper, etcd, consul, etc.).
  • hot_standby_feedback must be enabled on all standby nodes where the logical replication slot needs to be maintained.
  • Patroni parameter postgresql.use_slots must be enabled to make sure that every standby node uses a slot on the primary node.

Why Linux HugePages are Super Important for Database Servers: A Case with PostgreSQL

Linux HugePages PostgreSQL

Often users come to us with incidents of database crashes due to OOM Killer. The Out Of Memory killer terminates PostgreSQL processes and remains the top reason for most of the PostgreSQL database crashes reported to us. There could be multiple reasons why a host machine could run out of memory, and the most common problems are:

  1. Poorly tuned memory on the host machine.
  2. A high value of work_mem is specified globally (at instance level). Users often underestimate the multiplying effect for such blanket decisions.
  3. The high number of connections. Users ignore the fact that even a non-active connection can hold a good amount of memory allocation.
  4. Other programs co-hosted on the same machine consuming resources.

Even though we used to assist in tuning both host machines and databases, we do not always take the time to explain how and why HugePages are important and justify it with data. Thanks to repeated probing by my friend and colleague Fernando, I couldn’t resist doing so this time.

The Problem

Let me explain the problem with a testable and repeatable case. This might be helpful if anyone wants to test the case in their own way.

Test Environment

The test machine is equipped with 40 CPU cores (80 vCPUs) and 192 GB of installed memory. I don’t want to overload this server with too many connections, so only 80 connections are used for the test. Yes, just 80 connections, which we should expect in any environment and is very realistic. Transparent HugePages (THP) is disabled. I don’t want to divert the topic by explaining why it is not a good idea to have THP for a database server, but I commit that I will prepare another blog.

In order to have a relatively persistent connection,  just like the ones from application side poolers (or even from external connection poolers), pgBouncer is used for making all the 80 connections persistent throughout the tests.  The following is the pgBouncer configuration used:

sbtest2 = host=localhost port=5432 dbname=sbtest2

listen_port = 6432
listen_addr = *
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
logfile = /tmp/pgbouncer.log
pidfile = /tmp/pgbouncer.pid
admin_users = postgres

As we can see, the


  parameter is specified to a high value not to destroy the connection from the pooler to PostgreSQL. Following PostgreSQL, parameter modifications are incorporated to mimic some of the common customer environment settings.

logging_collector = 'on'
max_connections = '1000'
work_mem = '32MB'
checkpoint_timeout = '30min'
checkpoint_completion_target = '0.92'
shared_buffers = '138GB'
shared_preload_libraries = 'pg_stat_statements'

The test load is created using sysbench

sysbench /usr/share/sysbench/oltp_point_select.lua --db-driver=pgsql --pgsql-host=localhost --pgsql-port=6432 --pgsql-db=sbtest2 --pgsql-user=postgres --pgsql-password=vagrant --threads=80 --report-interval=1 --tables=100 --table-size=37000000 prepare

and then

sysbench /usr/share/sysbench/oltp_point_select.lua --db-driver=pgsql --pgsql-host=localhost --pgsql-port=6432 --pgsql-db=sbtest2 --pgsql-user=postgres --pgsql-password=vagrant --threads=80 --report-interval=1 --time=86400  --tables=80 --table-size=37000000  run

The first prepare stage puts a write load on the server and the second one read-only load.

I am not attempting to explain the theory and concepts behind HugePages, but concentrate on the impact analysis. Please refer to the LWN article: Five-Level Page Tables and Andres Freund’s blog post Measuring the Memory Overhead of a Postgres Connection for understanding some of the concepts.

Test Observations

During the test, memory consumption was checked using the Linux


utility command. When making use of the regular pool of memory pages, the consumption started with a really low value. But it was under steady increase (please see the screenshot below). The “Available” memory is depleted at a faster rate.

Available Memory

Towards the end, it started swap activity also. Swap activity is captured in


  output below:

swap activity

Information from the


  reveals that the Total Page Table size has grown to more than 25+GB from the initial 45MB

This is not just memory wastage; this is a huge overhead impacting the overall execution of the program and operating system. This size is the Total of Lower PageTable entries of the 80+ PostgreSQL processes.

The same can be verified by checking each PostgreSQL Process. Following is a sample

So the Total PageTable size (25GB) should be approximately this value * 80 (connections). Since this synthetic benchmark sends an almost similar workload through all connections, All individual processes are having very close values to what was captured above.

The following shell line can be used for checking the Pss (Proportional set size). Since PostgreSQL uses Linux shared memory, focusing on Rss won’t be meaningful.

for PID in $(pgrep "postgres|postmaster") ; do awk '/Pss/ {PSS+=$2} END{getline cmd < "/proc/'$PID'/cmdline"; sub("\0", " ", cmd);printf "%.0f --> %s (%s)\n", PSS, cmd, '$PID'}' /proc/$PID/smaps ; done|sort -n

Without Pss information, there is no easy method to understand the memory responsibility per process.

In a typical database system where we have considerable DML load, the background processes of PostgreSQL such as Checkpointer, Background Writer, or Autovaccum workers will be touching more pages in the shared memory. Corresponding Pss will be higher for those processes.

Postgres Process

This should explain why Checkpointer, Background worker, or even the Postmaster often becomes the usual victim/target of an OOM Killer very often. As we can see above, they carry the biggest responsibility of the shared memory.

After several hours of execution, the individual session touched more shared memory pages. As a consequence per process, Pss values were rearranged: Checkpointer is responsible for less as other sessions shared the responsibility.

However, checkpointer retains the highest share.

Even though it is not important for this test, it will be worth mentioning that this kind of load pattern is specific to synthetic benchmarking because every session does pretty much the same job. That is not a good approximation to typical application load, where we usually see checkpointer and background writers carry the major responsibility.

The Solution: Enable HugePages

The solution for such bloated page tables and associated problems is to make use of HugePages instead. We can figure out how much memory should be allocated to HugePages by checking the VmPeak of the postmaster process. For example, if 4357 is the PID of the postmaster:

grep ^VmPeak /proc/4357/status

This gives the amount of memory required in KB:

VmPeak: 148392404 kB

This much needs to be fitting into huge pages. Converting this value into 2MB pages:

postgres=# select 148392404/1024/2;
(1 row)

Specify this value in




, for example:

vm.nr_hugepages = 72457

Now shutdown PostgreSQL instance and execute:

sysctl -p

We shall verify whether the requested number of huge pages are created or not:

grep ^Huge /proc/meminfo
HugePages_Total:   72457
HugePages_Free:    72457
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:        148391936 kB

If we start up the PostgreSQL at this stage, we could see that the HugePages_Rsvd is allocated.

$ grep ^Huge /proc/meminfo
HugePages_Total:   72457
HugePages_Free:    70919
HugePages_Rsvd:    70833
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:        148391936 kB

If everything is fine, I would prefer to make sure that PostgreSQL always uses HugePages. because I would prefer a failure of startup of PostgreSQL rather than problems/crashes later.

postgres=# ALTER SYSTEM SET huge_pages = on;

The above change needs a restart of the PostgreSQL instance.

Tests with HugePages “ON”

The HugePages are created in advance even before the PostgreSQL startup. PostgreSQL just allocates them and uses them. So there won’t be even any noticeable change in the


 output before and after the startup. PostgreSQL allocates its shared memory into these HugePages if they are already available. PostgreSQL’s


  is the biggest occupant of this shared memory.

HugePages PostgreSQL

The first output of

free -h

in the above screenshot is generated before the PostgreSQL startup and the second one after the PostgreSQL startup. As we can see, there is no noticeable change

I have done the same test which ran for several hours and there wasn’t any change;  the only noticeable change even after many hours of run is the shift of “free” memory to filesystem cache which is expected and what we want to achieve. The total “available” memory remained pretty much constant as we can see in the following screenshot.

Available Memory PostgreSQL

Total Page Tables size remained pretty much the same :

As we can see the difference is huge: Just 61MB with HugePages instead of 25+GB previously. Pss per session also reduced drastically:

Pss per session

The biggest advantage I could observe is that CheckPointer or Background Writer is no longer accountable for several GBs of RAM.

Instead, they are accountable only for a few MBs of consumption. Obviously, they won’t be a candidate victim for the OOM Killer anymore.


In this blog post, we discussed how Linux Huge Pages can potentially save the database server from OOM Killers and associated crashes. We could see two improvements:

  1. Overall memory consumption was reduced by a big margin. Without the HugePages, the server almost ran out of memory (available memory completely depleted, and swapping activity started). However, once we switched to HugePages, 38-39GB remained as Available/Linux filesystem cache. This is a huge saving.
  2. With HugePages enabled, PostgreSQL background processes are not accounted for a large amount of shared memory.  So they won’t be candidate victim/target for OOM Killer easily.

These improvements can potentially save the system if it is on the brink of OOM condition, but I don’t want to make any claim that this will protect the database from all OOM conditions forever.

HugePage (hugetlbfs) originally landed in Linux Kernel in 2002 for addressing the requirement of database systems that need to address a large amount of memory. I could see that the design goals are still valid.

There are other additional indirect benefits of using HugePages:

  1. HugePages never get swapped out. When PostgreSQL shared buffers are in HugePages, it can produce more consistent and predictable performance.  I shall discuss that in another article.
  2. Linux uses multi-level page lookup method. HugePages are implemented using direct pointers to pages from the middle layer (a 2MB huge page would be found directly at the PMD level, with no intervening PTE page). The address translation becomes considerably simpler. Since this is a high-frequency operation in a database server with a large amount of memory, the gains are multiplied.

Note: The HugePages discussed in this blog post are about fixed size (2MB) huge pages.

Additionally, as a side note, I want to mention that there are a lot of improvements in Transparent HugePages (THP) over the years which allows applications to use HugePages without any code modification. THP is often considered as a replacement for regular HugePages (hugetlbfs) for a generic workload. However, usage of THP is discouraged on database systems as it can lead to memory fragmentation and increased delays. I want to cover that topic in another post and just want mention that these are not PostgreSQL specific problem, but affects every database systems.  For example,

  1. Oracle Recommends disabling TPH. Reference link
  2. MongoDB Recommends disabling THP. Reference link
  3. “THP is known to cause performance degradation with PostgreSQL for some users on some Linux versions.” Reference link.

As more companies look at migrating away from Oracle or implementing new databases alongside their applications, PostgreSQL is often the best option for those who want to run on open source databases.

Read Our New White Paper:

Why Customers Choose Percona for PostgreSQL


Impact of Network and Cursor on Query Performance of PostgreSQL

Query Performance of PostgreSQL

Many times, we see PostgreSQL users getting confused about the query/statement duration reported in PostgreSQL logs. Other PostgreSQL tools like pgBadger present the same data based on the log file, which further increases the confusion. Knowing the full impact of network-related overhead and cursors is important not only to alleviate the confusion but also to get the best performance.

One might ask “Why discuss Network overhead and Cursors specifically?”. Well, they are the hidden cost after the/in the query execution. Once the query execution starts and there is some data to be given to the client, these are the factors that mainly affect the performance. So the important point which we may want to keep in mind is since all these happens outside the query execution, the corresponding information will not be available through the EXPLAIN (ANALYZE).

Impact of Network

For the demonstration, I am using a query based on pgBench tables.

select a.bid, b.cnt from pgbench_accounts a,
   (select bid,count(*) cnt from pgbench_accounts group by bid) b
where b.bid > a.bid;

There is no significance for this query. A random query is selected which takes some time in the database to execute.

In order to capture the Explain Analyze output, auto_explain is used. Settings are made to capture all statements which take more than 250ms.

ALTER SYSTEM SET auto_explain.log_min_duration = 250

A few other settings to capture additional details with EXPLAIN ANALYZE are:

ALTER SYSTEM SET auto_explain.log_analyze = on;
ALTER SYSTEM SET auto_explain.log_buffers=on;
ALTER SYSTEM SET auto_explain.log_timing=on;
ALTER SYSTEM SET auto_explain.log_verbose=on;
ALTER SYSTEM SET auto_explain.log_triggers=on;
ALTER SYSTEM SET auto_explain.log_wal=on;

In order to Illustratre the difference, the same query will be executed from

1. The database Host Server

2. The application Host Server that is connected over a network

Queries Returning a Large Number of Rows

Case 1. Executing on the Database Host Itself

Following are the few lines from PostgreSQL logs generated by auto_explain:

2021-08-02 03:27:56.347 UTC [25537] LOG:  duration: 1591.784 ms  plan:
        Query Text: select a.bid, b.cnt from pgbench_accounts a,
           (select bid,count(*) cnt from pgbench_accounts group by bid) b
        where b.bid > a.bid;
        Nested Loop  (cost=12322.13..63020.46 rows=833333 width=12) (actual time=119.739..1069.924 rows=1000000 loops=1)
          Output: a.bid, b.cnt
          Join Filter: (b.bid > a.bid)
          Rows Removed by Join Filter: 1500000

As we can see, the outer nested loop of the query was completed in 1069.924 ms returning 1 million records, but the total duration of the query is reported as 1591.784 ms. What could be the difference?

A straight, EXPLAIN ANALYZE shows that the planning time is sub milliseconds for this simple query where data is coming from a single table without any index. So the planning time shouldn’t be the reason.

Case 2. Executing from a Remote Application Host

Again the information from the PostgreSQL log looks different:

2021-08-02 04:08:58.955 UTC [25617] LOG:  duration: 6568.659 ms  plan:
        Query Text: select a.bid, b.cnt from pgbench_accounts a,
           (select bid,count(*) cnt from pgbench_accounts group by bid) b
        where b.bid > a.bid;
        Nested Loop  (cost=12322.13..63020.46 rows=833333 width=12) (actual time=140.644..1069.153 rows=1000000 loops=1)
          Output: a.bid, b.cnt
          Join Filter: (b.bid > a.bid)
          Rows Removed by Join Filter: 1500000

As we can see the statement duration jumped to 6568.659 ms! even though the actual execution of the query remained pretty much the same 1069.153 ms. That’s a huge difference. What could be the reason?

Queries Returning a Fewer Number of Rows

The above-mentioned query can be slightly modified to return only the max values. The modified test query may look like this:

select max(a.bid), max(b.cnt) from pgbench_accounts a,
   (select bid,count(*) cnt from pgbench_accounts group by bid) b
where b.bid > a.bid ;

The query plan or time doesn’t change much other than there is an additional aggregate. Even if there is a change in plan that is not relevant for the topic, we are discussing it because we are considering only the time difference between the outer node of the query execution and the duration reported by PostgreSQL.

Case 1: Executing on the Database Host Itself

2021-08-03 06:58:14.364 UTC [28129] LOG:  duration: 1011.143 ms  plan:
        Query Text: select max(a.bid), max(b.cnt) from pgbench_accounts a,
           (select bid,count(*) cnt from pgbench_accounts group by bid) b
        where b.bid > a.bid ;
        Aggregate  (cost=67187.12..67187.13 rows=1 width=12) (actual time=1010.914..1011.109 rows=1 loops=1)
          Output: max(a.bid), max(b.cnt)
          Buffers: shared hit=12622 read=3786
          ->  Nested Loop  (cost=12322.13..63020.46 rows=833333 width=12) (actual time=135.635..908.315 rows=1000000 loops=1)
                Output: a.bid, b.cnt
                Join Filter: (b.bid > a.bid)
                Rows Removed by Join Filter: 1500000
                Buffers: shared hit=12622 read=3786

As we can see there is not much difference between the completion of the query 1011.109 and the duration reported 1011.143 ms. The observation so far indicates that there is extra time consumed when there is a lot of rows are returned.

Case 2: Executing the Statement from the Remote Host

2021-08-03 06:55:37.221 UTC [28111] LOG:  duration: 1193.387 ms  plan:
        Query Text: select max(a.bid), max(b.cnt) from pgbench_accounts a,
           (select bid,count(*) cnt from pgbench_accounts group by bid) b
        where b.bid > a.bid ;
        Aggregate  (cost=67187.12..67187.13 rows=1 width=12) (actual time=1193.139..1193.340 rows=1 loops=1)
          Output: max(a.bid), max(b.cnt)
          Buffers: shared hit=11598 read=4810
          ->  Nested Loop  (cost=12322.13..63020.46 rows=833333 width=12) (actual time=124.508..1067.409 rows=1000000 loops=1)
                Output: a.bid, b.cnt
                Join Filter: (b.bid > a.bid)
                Rows Removed by Join Filter: 1500000
                Buffers: shared hit=11598 read=4810

Again there is not much difference 1193.340 vs 1193.387 ms. Overall, I feel safe to assume from the results that if the data transfer is minimal, the application server on a different host machine doesn’t make much difference; meanwhile, the impact is huge if there is a lot of result transfer is involved.

Analyzing the Wait Events

Luckily, newer versions of PostgreSQL provide us with an excellent way to monitor the “wait event” information from the pg_stat_activity view of session/connection.

At Percona support we use a script from pg_gather snippet for gathering performance information including the wait events by collecting multiple samples. The script collects samples of wait events in every 10ms gap, so there will be 2000 samples in 20 seconds. Apart from UI, the gathered information can be analyzed using backend queries also.

The following is what I could see about PID: 25617 (the case of returning a large number of rows to remote host).

postgres=# select pid,wait_event,count(*) from pg_pid_wait where pid=25617 group by 1,2 order by 3 desc;
  pid  |  wait_event  | count 
 25617 | ClientWrite  |   286
 25617 |              |    75
 25617 | DataFileRead |     3
(3 rows)

The sessions are spending more time on “ClientWrite” As per PostgreSQL documentation.

ClientWrite Waiting to write data to the client.

It is the time spend on writing the data to the client. The wait_event NULL indicates the CPU utilization.

Impact of Cursors

Typically, after a query execution, the result data need to be processed by the application. Cursors are used for holding the result of queries and processing them. The impact on query performance is mainly decided by where the cursor is residing, whether on PostgreSQL server-side or at the client-side. The location of the cursor is expected to affect when the query is issued from a separate application host, so I am testing only that case.

Client-Side Cursors

Generally, this is the case with most of the PostgreSQL clients and applications. The data is retrieved fully to the database client end and then processed one by one.

Here is a simple python snippet (to mimic the application connection) for testing the same. (Only the relevant lines are copied.)

conn =  psycopg2.connect(connectionString)
   cur = conn.cursor()
   cur.itersize = 2000
   cur.execute("select a.bid, b.cnt from pgbench_accounts a, (select bid,count(*) cnt from pgbench_accounts group by bid) b where b.bid > a.bid")
   row = cur.fetchone()
   while row is not None:
     row = cur.fetchone()


As we can see


is specified so only those many records are to be fetched at a time for processing and there is a 1-millisecond delay in a loop using


in each loop

But none of these affects the server-side query performance because the cursor is already cached on the client-side. The query time and duration reported are similar to executing from a remote application host for a large number of rows. As expected, the impact of the network is clearly visible:

2021-08-03 17:39:17.119 UTC [29180] LOG:  duration: 5447.793 ms  plan:
        Query Text: select a.bid, b.cnt from pgbench_accounts a, (select bid,count(*) cnt from pgbench_accounts group by bid) b where b.bid > a.bid
        Nested Loop  (cost=12322.13..63020.46 rows=833333 width=12) (actual time=130.919..1240.727 rows=1000000 loops=1)
          Output: a.bid, b.cnt
          Join Filter: (b.bid > a.bid)
          Rows Removed by Join Filter: 1500000
          Buffers: shared hit=1647 read=14761
          ->  Seq Scan on public.pgbench_accounts a  (cost=0.00..13197.00 rows=500000 width=4) (actual time=0.086..183.503 rows=500000 loops=1)
                Output: a.aid, a.bid, a.abalance, a.filler
                Buffers: shared hit=864 read=7333

Server-Side Cursors

The way the cursor is created and used will be totally changing if we have a named cursor that stays on the server-side, and the statement

cur = conn.cursor()

is modified to include a name like

cur = conn.cursor('curname')


As the psycopg2: (The Python connector for PostgreSQL) documentation says:

“Psycopg wraps the database server-side cursor in named cursors. A named cursor is created using the cursor() method specifying the name parameter”

Surprisingly, the PostgreSQL log does not give any more information about the query even though auto_explain is configured. No duration information either. There is only a single line of info:

2021-08-03 18:02:45.184 UTC [29249] LOG:  duration: 224.501 ms  statement: FETCH FORWARD 1 FROM "curname"

PostgreSQL cursor supports various FETCH options with custom size specifications. Please refer to the documentation for FETCH for more details. It is up to the language driver/connector to wrap this functionality into corresponding functions.

The python driver for PostgreSQL – psychopg2 – wraps the functionality to run FETCH rows with the custom batch size specified as follows:


Which produces a PostgreSQL log entry like:

2021-08-05 05:13:30.728 UTC [32374] LOG:  duration: 262.931 ms  statement: FETCH FORWARD 20000 FROM "curname"

As we expected, the fetch size is increased.

But the important point to note here is: even though the application iterating over a server-side cursor took a hell of a lot of time (running into several minutes), there is almost zero information about the query or the session in the PostgreSQL logs.

Ahh! This could the weirdest thing someone would expect.

Wait Event Analysis

Again wait event analysis comes in handy to understand what’s happening. Out of 2000 wait event samples collected by the pg_gather script, the wait events looks like this:

pid  |  wait_event  | count 
 30115 | ClientRead   |  1754
 30115 |   (CPU)      |   245
 30115 | DataFileRead |     1

The time is pend on the “ClientRead” weight event. This means that the server-side is waiting for the client to send the next request. So a slow network between the application server and the database server can adversely affect the server-side cursors.  But no information will be available in the PostgreSQL logs about the statement.


In this blog post, I tried to assess the impact of transferring a large amount of data from a Database Host to an Application Host.

PLEASE NOTE: The numbers discussed in the blog post in terms of the number of rows and the time recorded don’t have any absolute relevance and it might change from system to system with many environmental factors.

The discussion is more on the areas of impact and what we should expect and how to analyze as an end-user rather than any absolute numbers.

  1. Always try to specify the minimum number of columns in the query.  Avoid “SELECT *” or columns that are not used on the application side to avoid unnecessary data transfer.
  2. Avoid fetching a large number of rows to the application at a time.  If required, use proper paging with LIMIT and OFFSET.
  3. Avoid server-side cursors wherever possible. PostgreSQL reports only the duration of the first fetch and the actual query performance may go unnoticed. Bad performing queries could be hiding behind.
  4. Client-side cursors and data processing time won’t affect the query performance at the PostgreSQL server-side.
  5. Wait event analysis is very handy in understanding where the server is spending time.

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.

Download Percona Distribution for PostgreSQL Today!


Understanding pg_repack: What Can Go Wrong – and How to Avoid It

Understanding pg_repack

pg_repack is one of the oldest, widely used, extension projects for PostgreSQL. It is so much popular that even DBaaS service providers couldn’t avoid it. It is a “power tool” in the hands of a DBA to deal with bloated/fragmented tables. I can’t imagine a serious production deployment without it these days. It magically replaces the bloated, fragmented tables with a fresh fully packed table without holding an exclusive lock on the table during its processing**. This extension made PostgreSQL’s built-in commands like VACUUM FULL and CLUSTER almost useless.

**Momentarily AccessExclusive locks are required. Please see the discussion below.

But unfortunately, regular and repeated usage of pg_repack increased the comfort levels of everyone (myself included). As a consequence, it is randomly recommended as a solution for everything, as a panacea. But I recently started coming across cases like a scheduled job doing pg_repack of every table. This time it rang an alarm bell and it started to feel like overuse of antibiotics. So I thought about writing about how pg_repack works and how it can affect your system while it is running for a regular user who doesn’t want to do a detailed study. because a better understanding might help for an informed decision on where it should be used.

pg_repack has a lot of functionalities (options). In this blog post, my intention is to discuss only how it does the basic “repack”ing of a fragmented/bloated table.

Pre-Checks and Ensuring Sanity

This is to ensure that pg_repack is installed and extension is available in the database, we are running the pg_repack as a superuser, etc. It creates a provision for cleaning up temporary objects created by pg_repack. It collects metadata about tables and associated objects like Indexes, Toast, Triggers, etc. because tracking of tables and associated objects is very important. Objects like invalid indexes, conflicting triggers, etc. are checked. We are not going to discuss this in detail.

The actual processing starts with obtaining an advisory lock on the table’s OID, to make sure no other pg_repack is working on the table. This is not so much a concern with full-table repacks, but mainly so that index-only repacks don’t interfere with each other or a full-table repack. If there are other pg_repaks in progress, repack attempts may get a message as follows and exit.

ERROR: Another pg_repack command may be running on the table. Please try again later.

Create temporary objects while holding AccessExclusive lock on the table

Yes. Correct, pg_repack needs the heavyweight AccessExclusive lock. But temporarily.  pg_repack attempts to gain a table ACCESS EXCLUSIVE lock by executing a statement as follows:


An Access Exclusive lock requires that no other sessions are accessing the table, Not even a SELECT query. pg_repack wait for a “wait-time” (60 seconds by default). This wait-time can be changed using the optional parameter --wait-timeout. Once this wait-time is over pg_repack will start trying to cancel the conflicting statements.  Users may see messages as follows for each of the attempts.
WARNING: canceling conflicted backends
WARNING: canceling conflicted backends

So the point to be noted is:

So please avoid running pg_repack when there is a lot of concurrent activities on the table. because the new sessions are allowed to get conflicting locks like ACCESS SHARE concurrently and a session waiting for ACCESS EXCLUSIVE lock may need to wait indefinitely. We should be selecting a proper time window of low activity for pg_repack work

These cancel attempts will continue for another round of “wait-timeout” seconds. But even after attempting for this second round of wait-timeout, if the AcessExclusive lock is not obtained, it escalates to termination of every conflicting session. So pg_repack will terminate sessions if the total wait time exceeds double the “wait-time”. This could be problematic for application connections that get abruptly get terminated. This can lead to outages if the application layer does not handle it gracefully.

The pg_repack may emit a message as follows:

WARNING: terminating conflicted backends

And there will be PostgreSQL log entries as follows for each of the sessions which is killed

2021-06-17 06:28:46.317 UTC [2761] FATAL: terminating connection due to administrator command
2021-06-17 06:28:46.317 UTC [2758] FATAL: terminating connection due to administrator command

So the point to be noted is: pg_repack can terminate sessions if they stand against obtaining AcessExclusive lock which can lead to unexpected outage or misbehavior of the application.

If the double the wait time is crossed, pg_repack may just proceed with terminate the sessions. But, again this behavior also can be controlled using a parameter --no-kill-backend.  if this parameter is specified, pg_repack will respect all concurrent sessions and cancel itself instead of attempting to canceling or terminating other sessions
pg_repack may emit a message as follows

WARNING: timed out, do not cancel conflicting backends
INFO: Skipping repack public.t1 due to timeout

I believe this is more desirable in production systems. So the point to be noted is:

Always remember to specify --no-kill-backend whenever you deal with critical systems

When AccessExclusive is locked, pg_repack creates all temporary objects including the substituted table with the same structure.

It creates a primary key TYPE as per the original table. For example, if we are repacking a table with a primary key of a single field “id”, the primary key type definition would look like:

CREATE TYPE repack.pk_16423 AS (id integer)

This TYPE of definition is useful because there can be composite keys. The following is an example of it dealing with tables of a composite key.
CREATE TYPE repack.pk_16824 AS (id integer, id1 integer, id2 timestamp without time zone);

Then It proceeds to create a “log” table to capture all the data changes (CDC-Change data capture) during the pg_repack operation. This table will have a primary key of BIGINT data type, Primary key type of the original table created in the above step, and “row” datatype of the table which we are repacking. This row can hold the entire tuple information of the table which we repack.
This log table definition is easy because of the TYPE definition and the “row” type. here is an example

CREATE TABLE repack.log_16423 (id bigserial PRIMARY KEY, pk repack.pk_16423, row public.t1)

Now pg_repack creates a trigger on the table to be repacked so that whenever there is DML on the table, corresponding information needs to be captured to the log table created above. This is done using an AFTER INSERT OR DELETE OR UPDATE trigger. For example:

CREATE TRIGGER repack_trigger AFTER INSERT OR DELETE OR UPDATE ON public.t1 FOR EACH ROW EXECUTE PROCEDURE repack.repack_trigger('INSERT INTO repack.log_16423(pk, row) VALUES( CASE WHEN $1 IS NULL THEN NULL ELSE (ROW($1.id)::repack.pk_16423) END, $2)')

Since the pg_repack is holding the AccessExclusive lock at this stage, there won’t be any concurrent DMLs at this stage, which is going to change in the following stages.

pg_repack plays an important trick at this stage before releasing AcessExclusive lock as the comment in the source code says:

/* While we are still holding an AccessExclusive lock on the table, submit
* the request for an AccessShare lock asynchronously from conn2.
* We want to submit this query in conn2 while connection's
* transaction still holds its lock, so that no DDL may sneak in
* between the time that connection commits and conn2 gets its lock.

Yes, pg_repack uses another connection to the database and sends an AccessShare lock request through that. This prevents any DDL in between the switch to AccessShare lock. The moment the main connection commits, An AccessShare lock will be granted to the second connection. So pg_repack uses two database connections to carry out the work.

But still, there is a chance that some DDL can interfere. So pg_repack kills any concurrent DDL against the table by default.

Once this stage is complete pg_repack can proceed with releasing AccessExclusive lock on the first connection by COMMITing the transaction. So that the AccessShare lock request by the second connection will be granted. This COMMIT is very special that all the log table and triggers on the table will be committed so that it is available to the entire system from this point onwards.

Copying rows/tuples to a temporary table

Copying of tuples from the table to a new table is performed with SERIALIZABLE isolation. because there shouldn’t be any inconsistency between the data getting into the log table and the temporary, substitute table which pg_repack is going to create.


Since the AccessExclusive lock is removed, concurrent sessions can proceed with their DMLs and Select queries. Only DDLs will be blocked. So we can say that the table is available for transactions and queries.

Prior to the data copy, the log table is truncated. because pg_repack needs only those log data that is captured from the starting of data copy.

DELETE FROM repack.log_16423

Again pg_repack will attempt to kill any session which might be waiting for doing a DDL and get an AcessShare lock on the table. Since the other connection is already holding an AccessShare lock, This can be gained without much problem.

At this stage, pg_repack creates a substitute table that is going to replace the original table with the exact same structure as the original table but without any data. It will be a CTAS statement, something like :

CREATE TABLE repack.table_16423... AS SELECT col1,col2,col3... FROM ONLY public.t1 WITH NO DATA

followed by the data copy:

INSERT INTO repack.table_16423 SELECT col1,col2,co3... FROM ONLY public.t1

Once the data copy is over, the transaction will be COMMIT ed. This completes one heaviest stage in repacking in terms of load and WAL generation

Indexes, Keys will be created at this stage on this temporary, substitute table at this stage.

Apply the CDC log to a temporary table

Please remember that pg_repack’s main connection is not holding any lock on the table at this stage (other than the second connection’s AccessShare lock).  So There is nothing blocking the transactions (DMLs) at this stage. Depending on the time it took for the data copy in the previous stage, and concurrent transactions during the data copy, There could be a lot of CDC(Change Data Capture) entries in the log file. This needs to be copied to the new temporary/substitute table.

This logic is implemented as C function in pg_repack. you may refer to the source code of repack_apply. It reads all the data from the log table and processes INSERTS, UPDATES, and DELETES. In order to speed up the repeated operations, Prepared Statements are used. Finally, all those data from the log table which is processed will be deleted from the log table.

Swapping the original table with a temporary table

This is performed by the second connection because it already holds an AccessShare lock, But it will escalate the lock to AccessExclusive lock.


The CDC apply will be performed once again (The same “repack_apply”) while holding the AccessExclusive lock on the table. So if there is any new entry that appears in the log table, that also will be processed.

The original table and the temporary table quickly by executing repack_swap function like:

SELECT repack.repack_swap('16423');

This is the most beautiful and powerful part of pg_repack. which is implemented in a C function repack_swap. Not just tables are swapped, ownership, associated toasts (table and index), indexes, and dependencies are also swapped. Oids are swapped so that the oid of the table remains the same even after pg_repack. The Swapping work complies with a COMMIT

Final cleanup

pg_repack uses its built-in C Function repack_drop for doing the cleanup of all temporary objects. To prevent any concurrent sessions from acquiring a lock on the table which could prevent the cleanup, An AccessExclusive lock is obtained before the cleanup. This is the third time an AccessExcluive lock is placed on the table.


pg_repack is one of the most powerful, popular, and useful extensions. We encourage the usage wherever applicable with proper supervision. But please avoid over-usage. As I tried to explain,  we should expect a good amount of data movement between the original table to the temporary table, trigger writing to the log table, data copy from the log table to the temporary table, etc. So we should be expecting a higher WAL generation also. Considering all the implications, pg_repack needs to be performed on a low activity time window to avoid undesirable consequences.

Some of the important points to reiterate for end-users are:

  1. pg_repack needs to acquire heavyweight AccessExclusive lock multiple times. But Temporarily.
  2. In a high concurrency situation, it will be almost impossible to get an AccessExclusive lock
  3. pg_repack, by default, will attempt canceling the conflicting statements if it is not able to gain AcessExclusive lock-in wait-time
  4. It may proceed to terminate sessions if the total wait exceeds double the amount of wait time. This could lead to undesirable outcomes and outages.
  5. Defaults of pg_repack may not be good for critical systems. use --no-kill-backend option to make it more gentle.
  6. No DDLs are allowed against the table which is undergoing pg_repack and any session that attempts to do so might get killed.

PostgreSQL HA with Patroni: Your Turn to Test Failure Scenarios

PostgreSQL HA with Patroni

A couple of weeks ago, Jobin and I did a short presentation during Percona Live Online bearing a similar title as the one for this post: “PostgreSQL HA With Patroni: Looking at Failure Scenarios and How the Cluster Recovers From Them”. We deployed a 3-node PostgreSQL environment with some recycled hardware we had lying around and set ourselves at “breaking” it in different ways: by unplugging network and power cables, killing main processes, attempting to saturate processors. All of this while continuously writing and reading data from PostgreSQL. The idea was to see how Patroni would handle the failures and manage the cluster to continue delivering service. It was a fun demo!

We promised a follow-up post explaining how we set up the environment, so you could give it a try yourselves, and this is it. We hope you also have fun attempting to reproduce our small experiment, but mostly that you use it as an opportunity to learn how a PostgreSQL HA environment managed by Patroni works in practice: there is nothing like a hands-on lab for this!

Initial Setup

We recycled three 10-year old Intel Atom mini-computers for our experiment but you could use some virtual machines instead: even though you will miss the excitement of unplugging real cables, this can still be simulated with a VM. We installed the server version of Ubuntu 20.04 and configured them to know “each other” by hostname; here’s how the hosts file of the first node looked like:

$ cat /etc/hosts localhost node1 node1 node2 node3


Patroni supports a myriad of systems for Distribution Configuration Store but etcd remains a popular choice. We installed the version available from the Ubuntu repository on all three nodes:

sudo apt-get install etcd

It is necessary to initialize the etcd cluster from one of the nodes and we did that from node1 using the following configuration file:

$ cat /etc/default/etcd

Note how ETCD_INITIAL_CLUSTER_STATE is defined with “new”.

We then restarted the service:

sudo systemctl restart etcd

We can then move on to install etcd on node2. The configuration file follows the same structure as that of node1, except that we are adding node2 to an existing cluster so we should indicate the other node(s):


Before we restart the service, we need to formally add node2 to the etcd cluster by running the following command on node1:

sudo etcdctl member add node2

We can then restart the etcd service on node2:

sudo systemctl restart etcd

The configuration file for node3 looks like this:


Remember we need to add node3 to the cluster by running the following command on node1:

sudo etcdctl member add node3

before we can restart the service on node3:

sudo systemctl restart etcd

We can verify the cluster state to confirm it has been deployed successfully by running the following command from any of the nodes:

$ sudo etcdctl member list
2ed43136d81039b4: name=node3 peerURLs= clientURLs= isLeader=false
d571a1ada5a5afcf: name=node1 peerURLs= clientURLs= isLeader=true
ecec6c549ebb23bc: name=node2 peerURLs= clientURLs= isLeader=false

As we can see above, node1 is the leader at this point, which is expected since the etcd cluster has been bootstrapped from it. If you get a different result, check for etcd entries logged to /var/log/syslog on each node.


Quoting Patroni’s manual:

Watchdog devices are software or hardware mechanisms that will reset the whole system when they do not get a keepalive heartbeat within a specified timeframe. This adds an additional layer of fail safe in case usual Patroni split-brain protection mechanisms fail.

While the use of a watchdog mechanism with Patroni is optional, you shouldn’t really consider deploying a PostgreSQL HA environment in production without it.

For our tests, we used the standard software implementation for watchdog that is shipped with Ubuntu 20.04, a module called softdog. Here’s the procedure we used in all three nodes to configure the module to load:

sudo sh -c 'echo "softdog" >> /etc/modules'

Patroni will be the component interacting with the watchdog device. Since Patroni is run by the postgres user, we need to either set the permissions of the watchdog device open enough so the postgres user can write to it or make the device owned by postgres itself, which we consider a safer approach (as it is more restrictive):

sudo sh -c 'echo "KERNEL==\"watchdog\", OWNER=\"postgres\", GROUP=\"postgres\"" >> /etc/udev/rules.d/61-watchdog.rules'

These two steps looked like all that would be required for watchdog to work but to our surprise, the softdog module wasn’t loaded after restarting the servers. After spending quite some time digging around we figured the module was blacklisted by default and there was a strain file with such a directive still lingering around:

$ grep blacklist /lib/modprobe.d/* /etc/modprobe.d/* |grep softdog
/lib/modprobe.d/blacklist_linux_5.4.0-72-generic.conf:blacklist softdog

Editing that file in each of the nodes to remove the line above and restarting the servers did the trick:

$ lsmod | grep softdog
softdog                16384  0

$ ls -l /dev/watchdog*
crw-rw---- 1 postgres postgres  10, 130 May 21 21:30 /dev/watchdog
crw------- 1 root     root     245,   0 May 21 21:30 /dev/watchdog0


Percona Distribution for PostgreSQL can be easily installed from the Percona Repository in a few easy steps:

sudo apt-get update -y; sudo apt-get install -y wget gnupg2 lsb-release curl
wget https://repo.percona.com/apt/percona-release_latest.generic_all.deb
sudo dpkg -i percona-release_latest.generic_all.deb
sudo apt-get update
sudo percona-release setup ppg-12
sudo apt-get install percona-postgresql-12

An important concept to understand in a PostgreSQL HA environment like this one is that PostgreSQL should not be started automatically by systemd during the server initialization: we should leave it to Patroni to fully manage it, including the process of starting and stopping the server. Thus, we should disable the service:

sudo systemctl disable postgresql

For our tests, we want to start with a fresh new PostgreSQL setup and let Patroni bootstrap the cluster, so we stop the server and remove the data directory that has been created as part of the PostgreSQL installation:

sudo systemctl stop postgresql
sudo rm -fr /var/lib/postgresql/12/main

These steps should be repeated in nodes 2 and 3 as well.


The Percona Repository also includes a package for Patroni so with it already configured in the nodes we can install Patroni with a simple:

sudo apt-get install percona-patroni

Here’s the configuration file we have used for node1:

$ cat /etc/patroni/config.yml
scope: stampede
name: node1

  connect_address: node1:8008

  host: node1:2379

  # this section will be written into Etcd:/<namespace>/<scope>/config after initializing new cluster
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
#    master_start_timeout: 300
#    synchronous_mode: false
      use_pg_rewind: true
      use_slots: true
        wal_level: replica
        hot_standby: "on"
        logging_collector: 'on'
        max_wal_senders: 5
        max_replication_slots: 5
        wal_log_hints: "on"
        #archive_mode: "on"
        #archive_timeout: 600
        #archive_command: "cp -f %p /home/postgres/archived/%f"
        #restore_command: cp /home/postgres/archived/%f %p

  # some desired options for 'initdb'
  initdb:  # Note: It needs to be a list (some options need values, others are switches)
  - encoding: UTF8
  - data-checksums

  pg_hba:  # Add following lines to pg_hba.conf after running 'initdb'
  - host replication replicator md5
  - host replication replicator trust
  - host all all md5
  - host all all md5
#  - hostssl all all md5

  # Additional script to be launched after initial cluster creation (will be passed the connection URL as parameter)
# post_init: /usr/local/bin/setup_cluster.sh
  # Some additional users users which needs to be created after initializing new cluster
      password: admin
        - createrole
        - createdb

  connect_address: node1:5432
  data_dir: "/var/lib/postgresql/12/main"
  bin_dir: "/usr/lib/postgresql/12/bin"
#  config_dir:
  pgpass: /tmp/pgpass0
      username: replicator
      password: vagrant
      username: postgres
      password: vagrant
    unix_socket_directories: '/var/run/postgresql'

  mode: required # Allowed values: off, automatic, required
  device: /dev/watchdog
  safety_margin: 5

    nofailover: false
    noloadbalance: false
    clonefrom: false
    nosync: false

With the configuration file in place, and now that we already have the etcd cluster up, all that is required is to restart the Patroni service:

sudo systemctl restart patroni

When Patroni starts, it will take care of initializing PostgreSQL (because the service is not currently running and the data directory is empty) following the directives in the bootstrap section of Patroni’s configuration file. If everything went according to the plan, you should be able to connect to PostgreSQL using the credentials in the configuration file (password is vagrant):

$ psql -U postgres
psql (12.6 (Ubuntu 2:12.6-2.focal))
Type "help" for help.


Repeat the operation for installing Patroni on nodes 2 and 3: the only difference is that you will need to replace the references to node1 in the configuration file (there are four of them, shown in bold) with the respective node name.

You can also check the state of the Patroni cluster we just created with:

$ sudo patronictl -c /etc/patroni/config.yml list
| Cluster  | Member |  Host |  Role  |  State  | TL | Lag in MB |
| stampede | node1  | node1 | Leader | running |  2 |           |
| stampede | node2  | node2 |        | running |  2 |         0 |
| stampede | node3  | node3 |        | running |  2 |         0 |

node1 started the Patroni cluster so it was automatically made the leader – and thus the primary/master PostgreSQL server. Nodes 2 and 3 are configured as read replicas (as the hot_standby option was enabled in Patroni’s configuration file).


A common implementation of high availability in a PostgreSQL environment makes use of a proxy: instead of connecting directly to the database server, the application will be connecting to the proxy instead, which will forward the request to PostgreSQL. When HAproxy is used for this, it is also possible to route read requests to one or more replicas, for load balancing. However, this is not a transparent process: the application needs to be aware of this and split read-only from read-write traffic itself. With HAproxy, this is done by providing two different ports for the application to connect. We opted for the following setup:

  • Writes   ?  5000
  • Reads   ?  5001

HAproxy can be installed as an independent server (and you can have as many as you want) but it can also be installed on the application server or the database server itself – it is a light enough service. For our tests, we planned on using our own Linux workstations (which also run Ubuntu 20.04) to simulate application traffic so we installed HAproxy on them:

sudo apt-get install haproxy

With the software installed, we modified the main configuration file as follows:

$ cat /etc/haproxy/haproxy.cfg
    maxconn 100

    log    global
    mode    tcp
    retries 2
    timeout client 30m
    timeout connect 4s
    timeout server 30m
    timeout check 5s

listen stats
    mode http
    bind *:7000
    stats enable
    stats uri /

listen primary
    bind *:5000
    option httpchk OPTIONS /master
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server node1 node1:5432 maxconn 100 check port 8008
    server node2 node2:5432 maxconn 100 check port 8008
    server node3 node3:5432 maxconn 100 check port 8008

listen standbys
    balance roundrobin
    bind *:5001
    option httpchk OPTIONS /replica
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server node1 node1:5432 maxconn 100 check port 8008
    server node2 node2:5432 maxconn 100 check port 8008
    server node3 node3:5432 maxconn 100 check port 8008

Note there are two sections: primary, using port 5000, and standbys, using port 5001. All three nodes are included in both sections: that’s because they are all potential candidates to be either primary or secondary. For HAproxy to know which role each node currently has, it will send an HTTP request to port 8008 of the node: Patroni will answer. Patroni provides a built-in REST API support for health check monitoring that integrates perfectly with HAproxy for this:

$ curl -s http://node1:8008
{"state": "running", "postmaster_start_time": "2021-05-24 14:50:11.707 UTC", "role": "master", "server_version": 120006, "cluster_unlocked": false, "xlog": {"location": 25615248}, "timeline": 1, "database_system_identifier": "6965869170583425899", "patroni": {"version": "1.6.4", "scope": "stampede"}}

We configured the standbys group to balance read-requests in a round-robin fashion, so each connection request (or reconnection) will alternate between the available replicas. We can test this in practice, let’s save the postgres user password in a file to facilitate the process:

echo "localhost:5000:postgres:postgres:vagrant" > ~/.pgpass
echo "localhost:5001:postgres:postgres:vagrant" >> ~/.pgpass
chmod 0600 ~/.pgpass

We can then execute two read-requests to verify the round-robin mechanism is working as intended:

$ psql -Upostgres -hlocalhost -p5001 -t -c "select inet_server_addr()"

$ psql -Upostgres -hlocalhost -p5001 -t -c "select inet_server_addr()"

as well as test the writer access:

$ psql -Upostgres -hlocalhost -p5000 -t -c "select inet_server_addr()"

You can also check the state of HAproxy by visiting http://localhost:7000/ on your browser.


To best simulate a production environment to test our failure scenarios, we wanted to have continuous reads and writes to the database. We could have used a benchmark tool such as Sysbench or Pgbench but we were more interested in observing the switch of source server upon a server failure than load itself. Jobin wrote a simple Python script that is perfect for this, HAtester. As was the case with HAproxy, we run the script from our Linux workstation. Since it is a Python script, you need to have a PostgreSQL driver for Python installed to execute it:

sudo apt-get install python3-psycopg2
curl -LO https://raw.githubusercontent.com/jobinau/pgscripts/main/patroni/HAtester.py
chmod +x HAtester.py

Edit the script with the credentials to access the PostgreSQL servers (through HAproxy) if you are using different settings from ours. The only requirement for it to work is to have the target table created beforehand, so first connect to the postgres database (unless you are using a different target) in the Primary and run:


You can then start two different sessions:

  1. One for writes:

    ./HAtester.py 5000
  2. One for reads:
    ./HAtester.py 5001

The idea is to observe what happens with database traffic when the environment experiences a failure; that is, how HAproxy will route reads and writes as Patroni adjusts the PostgreSQL cluster. You can continuously monitor Patroni from the point of view of the nodes by opening a session in each of them and running the following command:

sudo -u postgres watch patronictl -c /etc/patroni/config.yml list

To facilitate observability and better follow the changes in real-time, we used the terminal multiplexer Tmux to visualize all 5 sessions on the same screen:

  • On the left side, we have one session open for each of the 3 nodes, continuously running:

    sudo -u postgres watch patronictl -c /etc/patroni/config.yml list

    It’s better to have the Patroni view for each node independently because when you start the failure tests you will lose connection to a part of the cluster.

  • On the right side, we are executing the HAtester.py script from our workstation:
    • Sending writes through port 5000:

      ./HAtester.py 5000
    • and reads through port 5001:

      ./HAtester.py 5001

A couple of notes on the execution of the HAtester.py script:

  • Pressing Ctrl+C will break the connection but the script will reconnect, this time to a different replica (in the case of reads) due to having the Standbys group on HAproxy configured with round-robin balancing.
  • When a switchover or failover takes place and the nodes are re-arranged in the cluster, you may temporarily see writes sent to a node that used to be a replica and was just promoted as primary and reads send to a node that used to be the primary and was demoted as secondary: that’s a limitation of the HAtester.py script but “by design”; we favored faster reconnections and minimal checks on the node’s role for demonstration purposes. On a production application, this part ought to be implemented differently.

Testing Failure Scenarios

The fun part starts now! We leave it to you to test and play around to see what happens with the PostgreSQL cluster in practice following a failure. We leave as suggestions the tests we did in our presentation. For each failure scenario, observe how the cluster re-adjusts itself and the impact on read and write traffic.

1) Loss of Network Communication

  • Unplug the network cable from one of the nodes (or simulate this condition in your VM):
    • First from a replica
    • Then from the primary
  • Unplug the network cable from one replica and the primary at the same time:
    • Does Patroni experience a split-brain situation?

2) Power Outage

  • Unplug the power cable from the primary
  • Wait until the cluster is re-adjusted then plug the power cable back and start the node


Simulate an OOM/crash by killing the postmaster process in one of the nodes with kill -9.

4) Killing Patroni

Remember that Patroni is managing PostgreSQL. What happens if the Patroni process (and not PostgreSQL) is killed?

5) CPU Saturation

Simulate CPU saturation with a benchmark tool such as Sysbench, for example:

sysbench cpu --threads=10 --time=0 run

This one is a bit tricky as the reads and writes are each single-threaded operation. You may need to decrease the priority of the HAtester.py processes with renice, and possibly increase that of Sysbench’s.

6) Manual Switchover

Patroni facilitates changes in the PostgreSQL hierarchy. Switchover operations can be scheduled, the command below is interactive and will prompt you with options:

sudo -u postgres patronictl -c /etc/patroni/config.yml switchover

Alternatively, you can be specific and tell Patroni exactly what to do:

sudo -u postgres patronictl -c /etc/patroni/config.yml switchover --master node1 --candidate node2 --force

We hope you had fun with this hands-on lab! If you have questions or comments, leave us a note in the comments section below!

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com