Understanding pg_repack: What Can Go Wrong – and How to Avoid It

Understanding pg_repack

pg_repack is one of the oldest, widely used, extension projects for PostgreSQL. It is so much popular that even DBaaS service providers couldn’t avoid it. It is a “power tool” in the hands of a DBA to deal with bloated/fragmented tables. I can’t imagine a serious production deployment without it these days. It magically replaces the bloated, fragmented tables with a fresh fully packed table without holding an exclusive lock on the table during its processing**. This extension made PostgreSQL’s built-in commands like VACUUM FULL and CLUSTER almost useless.

**Momentarily AccessExclusive locks are required. Please see the discussion below.

But unfortunately, regular and repeated usage of pg_repack increased the comfort levels of everyone (myself included). As a consequence, it is randomly recommended as a solution for everything, as a panacea. But I recently started coming across cases like a scheduled job doing pg_repack of every table. This time it rang an alarm bell and it started to feel like overuse of antibiotics. So I thought about writing about how pg_repack works and how it can affect your system while it is running for a regular user who doesn’t want to do a detailed study. because a better understanding might help for an informed decision on where it should be used.

pg_repack has a lot of functionalities (options). In this blog post, my intention is to discuss only how it does the basic “repack”ing of a fragmented/bloated table.

Pre-Checks and Ensuring Sanity

This is to ensure that pg_repack is installed and extension is available in the database, we are running the pg_repack as a superuser, etc. It creates a provision for cleaning up temporary objects created by pg_repack. It collects metadata about tables and associated objects like Indexes, Toast, Triggers, etc. because tracking of tables and associated objects is very important. Objects like invalid indexes, conflicting triggers, etc. are checked. We are not going to discuss this in detail.

The actual processing starts with obtaining an advisory lock on the table’s OID, to make sure no other pg_repack is working on the table. This is not so much a concern with full-table repacks, but mainly so that index-only repacks don’t interfere with each other or a full-table repack. If there are other pg_repaks in progress, repack attempts may get a message as follows and exit.

ERROR: Another pg_repack command may be running on the table. Please try again later.

Create temporary objects while holding AccessExclusive lock on the table

Yes. Correct, pg_repack needs the heavyweight AccessExclusive lock. But temporarily.  pg_repack attempts to gain a table ACCESS EXCLUSIVE lock by executing a statement as follows:


An Access Exclusive lock requires that no other sessions are accessing the table, Not even a SELECT query. pg_repack wait for a “wait-time” (60 seconds by default). This wait-time can be changed using the optional parameter --wait-timeout. Once this wait-time is over pg_repack will start trying to cancel the conflicting statements.  Users may see messages as follows for each of the attempts.
WARNING: canceling conflicted backends
WARNING: canceling conflicted backends

So the point to be noted is:

So please avoid running pg_repack when there is a lot of concurrent activities on the table. because the new sessions are allowed to get conflicting locks like ACCESS SHARE concurrently and a session waiting for ACCESS EXCLUSIVE lock may need to wait indefinitely. We should be selecting a proper time window of low activity for pg_repack work

These cancel attempts will continue for another round of “wait-timeout” seconds. But even after attempting for this second round of wait-timeout, if the AcessExclusive lock is not obtained, it escalates to termination of every conflicting session. So pg_repack will terminate sessions if the total wait time exceeds double the “wait-time”. This could be problematic for application connections that get abruptly get terminated. This can lead to outages if the application layer does not handle it gracefully.

The pg_repack may emit a message as follows:

WARNING: terminating conflicted backends

And there will be PostgreSQL log entries as follows for each of the sessions which is killed

2021-06-17 06:28:46.317 UTC [2761] FATAL: terminating connection due to administrator command
2021-06-17 06:28:46.317 UTC [2758] FATAL: terminating connection due to administrator command

So the point to be noted is: pg_repack can terminate sessions if they stand against obtaining AcessExclusive lock which can lead to unexpected outage or misbehavior of the application.

If the double the wait time is crossed, pg_repack may just proceed with terminate the sessions. But, again this behavior also can be controlled using a parameter --no-kill-backend.  if this parameter is specified, pg_repack will respect all concurrent sessions and cancel itself instead of attempting to canceling or terminating other sessions
pg_repack may emit a message as follows

WARNING: timed out, do not cancel conflicting backends
INFO: Skipping repack public.t1 due to timeout

I believe this is more desirable in production systems. So the point to be noted is:

Always remember to specify --no-kill-backend whenever you deal with critical systems

When AccessExclusive is locked, pg_repack creates all temporary objects including the substituted table with the same structure.

It creates a primary key TYPE as per the original table. For example, if we are repacking a table with a primary key of a single field “id”, the primary key type definition would look like:

CREATE TYPE repack.pk_16423 AS (id integer)

This TYPE of definition is useful because there can be composite keys. The following is an example of it dealing with tables of a composite key.
CREATE TYPE repack.pk_16824 AS (id integer, id1 integer, id2 timestamp without time zone);

Then It proceeds to create a “log” table to capture all the data changes (CDC-Change data capture) during the pg_repack operation. This table will have a primary key of BIGINT data type, Primary key type of the original table created in the above step, and “row” datatype of the table which we are repacking. This row can hold the entire tuple information of the table which we repack.
This log table definition is easy because of the TYPE definition and the “row” type. here is an example

CREATE TABLE repack.log_16423 (id bigserial PRIMARY KEY, pk repack.pk_16423, row public.t1)

Now pg_repack creates a trigger on the table to be repacked so that whenever there is DML on the table, corresponding information needs to be captured to the log table created above. This is done using an AFTER INSERT OR DELETE OR UPDATE trigger. For example:

CREATE TRIGGER repack_trigger AFTER INSERT OR DELETE OR UPDATE ON public.t1 FOR EACH ROW EXECUTE PROCEDURE repack.repack_trigger('INSERT INTO repack.log_16423(pk, row) VALUES( CASE WHEN $1 IS NULL THEN NULL ELSE (ROW($1.id)::repack.pk_16423) END, $2)')

Since the pg_repack is holding the AccessExclusive lock at this stage, there won’t be any concurrent DMLs at this stage, which is going to change in the following stages.

pg_repack plays an important trick at this stage before releasing AcessExclusive lock as the comment in the source code says:

/* While we are still holding an AccessExclusive lock on the table, submit
* the request for an AccessShare lock asynchronously from conn2.
* We want to submit this query in conn2 while connection's
* transaction still holds its lock, so that no DDL may sneak in
* between the time that connection commits and conn2 gets its lock.

Yes, pg_repack uses another connection to the database and sends an AccessShare lock request through that. This prevents any DDL in between the switch to AccessShare lock. The moment the main connection commits, An AccessShare lock will be granted to the second connection. So pg_repack uses two database connections to carry out the work.

But still, there is a chance that some DDL can interfere. So pg_repack kills any concurrent DDL against the table by default.

Once this stage is complete pg_repack can proceed with releasing AccessExclusive lock on the first connection by COMMITing the transaction. So that the AccessShare lock request by the second connection will be granted. This COMMIT is very special that all the log table and triggers on the table will be committed so that it is available to the entire system from this point onwards.

Copying rows/tuples to a temporary table

Copying of tuples from the table to a new table is performed with SERIALIZABLE isolation. because there shouldn’t be any inconsistency between the data getting into the log table and the temporary, substitute table which pg_repack is going to create.


Since the AccessExclusive lock is removed, concurrent sessions can proceed with their DMLs and Select queries. Only DDLs will be blocked. So we can say that the table is available for transactions and queries.

Prior to the data copy, the log table is truncated. because pg_repack needs only those log data that is captured from the starting of data copy.

DELETE FROM repack.log_16423

Again pg_repack will attempt to kill any session which might be waiting for doing a DDL and get an AcessShare lock on the table. Since the other connection is already holding an AccessShare lock, This can be gained without much problem.

At this stage, pg_repack creates a substitute table that is going to replace the original table with the exact same structure as the original table but without any data. It will be a CTAS statement, something like :

CREATE TABLE repack.table_16423... AS SELECT col1,col2,col3... FROM ONLY public.t1 WITH NO DATA

followed by the data copy:

INSERT INTO repack.table_16423 SELECT col1,col2,co3... FROM ONLY public.t1

Once the data copy is over, the transaction will be COMMIT ed. This completes one heaviest stage in repacking in terms of load and WAL generation

Indexes, Keys will be created at this stage on this temporary, substitute table at this stage.

Apply the CDC log to a temporary table

Please remember that pg_repack’s main connection is not holding any lock on the table at this stage (other than the second connection’s AccessShare lock).  So There is nothing blocking the transactions (DMLs) at this stage. Depending on the time it took for the data copy in the previous stage, and concurrent transactions during the data copy, There could be a lot of CDC(Change Data Capture) entries in the log file. This needs to be copied to the new temporary/substitute table.

This logic is implemented as C function in pg_repack. you may refer to the source code of repack_apply. It reads all the data from the log table and processes INSERTS, UPDATES, and DELETES. In order to speed up the repeated operations, Prepared Statements are used. Finally, all those data from the log table which is processed will be deleted from the log table.

Swapping the original table with a temporary table

This is performed by the second connection because it already holds an AccessShare lock, But it will escalate the lock to AccessExclusive lock.


The CDC apply will be performed once again (The same “repack_apply”) while holding the AccessExclusive lock on the table. So if there is any new entry that appears in the log table, that also will be processed.

The original table and the temporary table quickly by executing repack_swap function like:

SELECT repack.repack_swap('16423');

This is the most beautiful and powerful part of pg_repack. which is implemented in a C function repack_swap. Not just tables are swapped, ownership, associated toasts (table and index), indexes, and dependencies are also swapped. Oids are swapped so that the oid of the table remains the same even after pg_repack. The Swapping work complies with a COMMIT

Final cleanup

pg_repack uses its built-in C Function repack_drop for doing the cleanup of all temporary objects. To prevent any concurrent sessions from acquiring a lock on the table which could prevent the cleanup, An AccessExclusive lock is obtained before the cleanup. This is the third time an AccessExcluive lock is placed on the table.


pg_repack is one of the most powerful, popular, and useful extensions. We encourage the usage wherever applicable with proper supervision. But please avoid over-usage. As I tried to explain,  we should expect a good amount of data movement between the original table to the temporary table, trigger writing to the log table, data copy from the log table to the temporary table, etc. So we should be expecting a higher WAL generation also. Considering all the implications, pg_repack needs to be performed on a low activity time window to avoid undesirable consequences.

Some of the important points to reiterate for end-users are:

  1. pg_repack needs to acquire heavyweight AccessExclusive lock multiple times. But Temporarily.
  2. In a high concurrency situation, it will be almost impossible to get an AccessExclusive lock
  3. pg_repack, by default, will attempt canceling the conflicting statements if it is not able to gain AcessExclusive lock-in wait-time
  4. It may proceed to terminate sessions if the total wait exceeds double the amount of wait time. This could lead to undesirable outcomes and outages.
  5. Defaults of pg_repack may not be good for critical systems. use --no-kill-backend option to make it more gentle.
  6. No DDLs are allowed against the table which is undergoing pg_repack and any session that attempts to do so might get killed.

PostgreSQL HA with Patroni: Your Turn to Test Failure Scenarios

PostgreSQL HA with Patroni

A couple of weeks ago, Jobin and I did a short presentation during Percona Live Online bearing a similar title as the one for this post: “PostgreSQL HA With Patroni: Looking at Failure Scenarios and How the Cluster Recovers From Them”. We deployed a 3-node PostgreSQL environment with some recycled hardware we had lying around and set ourselves at “breaking” it in different ways: by unplugging network and power cables, killing main processes, attempting to saturate processors. All of this while continuously writing and reading data from PostgreSQL. The idea was to see how Patroni would handle the failures and manage the cluster to continue delivering service. It was a fun demo!

We promised a follow-up post explaining how we set up the environment, so you could give it a try yourselves, and this is it. We hope you also have fun attempting to reproduce our small experiment, but mostly that you use it as an opportunity to learn how a PostgreSQL HA environment managed by Patroni works in practice: there is nothing like a hands-on lab for this!

Initial Setup

We recycled three 10-year old Intel Atom mini-computers for our experiment but you could use some virtual machines instead: even though you will miss the excitement of unplugging real cables, this can still be simulated with a VM. We installed the server version of Ubuntu 20.04 and configured them to know “each other” by hostname; here’s how the hosts file of the first node looked like:

$ cat /etc/hosts localhost node1 node1 node2 node3


Patroni supports a myriad of systems for Distribution Configuration Store but etcd remains a popular choice. We installed the version available from the Ubuntu repository on all three nodes:

sudo apt-get install etcd

It is necessary to initialize the etcd cluster from one of the nodes and we did that from node1 using the following configuration file:

$ cat /etc/default/etcd

Note how ETCD_INITIAL_CLUSTER_STATE is defined with “new”.

We then restarted the service:

sudo systemctl restart etcd

We can then move on to install etcd on node2. The configuration file follows the same structure as that of node1, except that we are adding node2 to an existing cluster so we should indicate the other node(s):


Before we restart the service, we need to formally add node2 to the etcd cluster by running the following command on node1:

sudo etcdctl member add node2

We can then restart the etcd service on node2:

sudo systemctl restart etcd

The configuration file for node3 looks like this:


Remember we need to add node3 to the cluster by running the following command on node1:

sudo etcdctl member add node3

before we can restart the service on node3:

sudo systemctl restart etcd

We can verify the cluster state to confirm it has been deployed successfully by running the following command from any of the nodes:

$ sudo etcdctl member list
2ed43136d81039b4: name=node3 peerURLs= clientURLs= isLeader=false
d571a1ada5a5afcf: name=node1 peerURLs= clientURLs= isLeader=true
ecec6c549ebb23bc: name=node2 peerURLs= clientURLs= isLeader=false

As we can see above, node1 is the leader at this point, which is expected since the etcd cluster has been bootstrapped from it. If you get a different result, check for etcd entries logged to /var/log/syslog on each node.


Quoting Patroni’s manual:

Watchdog devices are software or hardware mechanisms that will reset the whole system when they do not get a keepalive heartbeat within a specified timeframe. This adds an additional layer of fail safe in case usual Patroni split-brain protection mechanisms fail.

While the use of a watchdog mechanism with Patroni is optional, you shouldn’t really consider deploying a PostgreSQL HA environment in production without it.

For our tests, we used the standard software implementation for watchdog that is shipped with Ubuntu 20.04, a module called softdog. Here’s the procedure we used in all three nodes to configure the module to load:

sudo sh -c 'echo "softdog" >> /etc/modules'

Patroni will be the component interacting with the watchdog device. Since Patroni is run by the postgres user, we need to either set the permissions of the watchdog device open enough so the postgres user can write to it or make the device owned by postgres itself, which we consider a safer approach (as it is more restrictive):

sudo sh -c 'echo "KERNEL==\"watchdog\", OWNER=\"postgres\", GROUP=\"postgres\"" >> /etc/udev/rules.d/61-watchdog.rules'

These two steps looked like all that would be required for watchdog to work but to our surprise, the softdog module wasn’t loaded after restarting the servers. After spending quite some time digging around we figured the module was blacklisted by default and there was a strain file with such a directive still lingering around:

$ grep blacklist /lib/modprobe.d/* /etc/modprobe.d/* |grep softdog
/lib/modprobe.d/blacklist_linux_5.4.0-72-generic.conf:blacklist softdog

Editing that file in each of the nodes to remove the line above and restarting the servers did the trick:

$ lsmod | grep softdog
softdog                16384  0

$ ls -l /dev/watchdog*
crw-rw---- 1 postgres postgres  10, 130 May 21 21:30 /dev/watchdog
crw------- 1 root     root     245,   0 May 21 21:30 /dev/watchdog0


Percona Distribution for PostgreSQL can be easily installed from the Percona Repository in a few easy steps:

sudo apt-get update -y; sudo apt-get install -y wget gnupg2 lsb-release curl
wget https://repo.percona.com/apt/percona-release_latest.generic_all.deb
sudo dpkg -i percona-release_latest.generic_all.deb
sudo apt-get update
sudo percona-release setup ppg-12
sudo apt-get install percona-postgresql-12

An important concept to understand in a PostgreSQL HA environment like this one is that PostgreSQL should not be started automatically by systemd during the server initialization: we should leave it to Patroni to fully manage it, including the process of starting and stopping the server. Thus, we should disable the service:

sudo systemctl disable postgresql

For our tests, we want to start with a fresh new PostgreSQL setup and let Patroni bootstrap the cluster, so we stop the server and remove the data directory that has been created as part of the PostgreSQL installation:

sudo systemctl stop postgresql
sudo rm -fr /var/lib/postgresql/12/main

These steps should be repeated in nodes 2 and 3 as well.


The Percona Repository also includes a package for Patroni so with it already configured in the nodes we can install Patroni with a simple:

sudo apt-get install percona-patroni

Here’s the configuration file we have used for node1:

$ cat /etc/patroni/config.yml
scope: stampede
name: node1

  connect_address: node1:8008

  host: node1:2379

  # this section will be written into Etcd:/<namespace>/<scope>/config after initializing new cluster
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
#    master_start_timeout: 300
#    synchronous_mode: false
      use_pg_rewind: true
      use_slots: true
        wal_level: replica
        hot_standby: "on"
        logging_collector: 'on'
        max_wal_senders: 5
        max_replication_slots: 5
        wal_log_hints: "on"
        #archive_mode: "on"
        #archive_timeout: 600
        #archive_command: "cp -f %p /home/postgres/archived/%f"
        #restore_command: cp /home/postgres/archived/%f %p

  # some desired options for 'initdb'
  initdb:  # Note: It needs to be a list (some options need values, others are switches)
  - encoding: UTF8
  - data-checksums

  pg_hba:  # Add following lines to pg_hba.conf after running 'initdb'
  - host replication replicator md5
  - host replication replicator trust
  - host all all md5
  - host all all md5
#  - hostssl all all md5

  # Additional script to be launched after initial cluster creation (will be passed the connection URL as parameter)
# post_init: /usr/local/bin/setup_cluster.sh
  # Some additional users users which needs to be created after initializing new cluster
      password: admin
        - createrole
        - createdb

  connect_address: node1:5432
  data_dir: "/var/lib/postgresql/12/main"
  bin_dir: "/usr/lib/postgresql/12/bin"
#  config_dir:
  pgpass: /tmp/pgpass0
      username: replicator
      password: vagrant
      username: postgres
      password: vagrant
    unix_socket_directories: '/var/run/postgresql'

  mode: required # Allowed values: off, automatic, required
  device: /dev/watchdog
  safety_margin: 5

    nofailover: false
    noloadbalance: false
    clonefrom: false
    nosync: false

With the configuration file in place, and now that we already have the etcd cluster up, all that is required is to restart the Patroni service:

sudo systemctl restart patroni

When Patroni starts, it will take care of initializing PostgreSQL (because the service is not currently running and the data directory is empty) following the directives in the bootstrap section of Patroni’s configuration file. If everything went according to the plan, you should be able to connect to PostgreSQL using the credentials in the configuration file (password is vagrant):

$ psql -U postgres
psql (12.6 (Ubuntu 2:12.6-2.focal))
Type "help" for help.


Repeat the operation for installing Patroni on nodes 2 and 3: the only difference is that you will need to replace the references to node1 in the configuration file (there are four of them, shown in bold) with the respective node name.

You can also check the state of the Patroni cluster we just created with:

$ sudo patronictl -c /etc/patroni/config.yml list
| Cluster  | Member |  Host |  Role  |  State  | TL | Lag in MB |
| stampede | node1  | node1 | Leader | running |  2 |           |
| stampede | node2  | node2 |        | running |  2 |         0 |
| stampede | node3  | node3 |        | running |  2 |         0 |

node1 started the Patroni cluster so it was automatically made the leader – and thus the primary/master PostgreSQL server. Nodes 2 and 3 are configured as read replicas (as the hot_standby option was enabled in Patroni’s configuration file).


A common implementation of high availability in a PostgreSQL environment makes use of a proxy: instead of connecting directly to the database server, the application will be connecting to the proxy instead, which will forward the request to PostgreSQL. When HAproxy is used for this, it is also possible to route read requests to one or more replicas, for load balancing. However, this is not a transparent process: the application needs to be aware of this and split read-only from read-write traffic itself. With HAproxy, this is done by providing two different ports for the application to connect. We opted for the following setup:

  • Writes   ?  5000
  • Reads   ?  5001

HAproxy can be installed as an independent server (and you can have as many as you want) but it can also be installed on the application server or the database server itself – it is a light enough service. For our tests, we planned on using our own Linux workstations (which also run Ubuntu 20.04) to simulate application traffic so we installed HAproxy on them:

sudo apt-get install haproxy

With the software installed, we modified the main configuration file as follows:

$ cat /etc/haproxy/haproxy.cfg
    maxconn 100

    log    global
    mode    tcp
    retries 2
    timeout client 30m
    timeout connect 4s
    timeout server 30m
    timeout check 5s

listen stats
    mode http
    bind *:7000
    stats enable
    stats uri /

listen primary
    bind *:5000
    option httpchk OPTIONS /master
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server node1 node1:5432 maxconn 100 check port 8008
    server node2 node2:5432 maxconn 100 check port 8008
    server node3 node3:5432 maxconn 100 check port 8008

listen standbys
    balance roundrobin
    bind *:5001
    option httpchk OPTIONS /replica
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server node1 node1:5432 maxconn 100 check port 8008
    server node2 node2:5432 maxconn 100 check port 8008
    server node3 node3:5432 maxconn 100 check port 8008

Note there are two sections: primary, using port 5000, and standbys, using port 5001. All three nodes are included in both sections: that’s because they are all potential candidates to be either primary or secondary. For HAproxy to know which role each node currently has, it will send an HTTP request to port 8008 of the node: Patroni will answer. Patroni provides a built-in REST API support for health check monitoring that integrates perfectly with HAproxy for this:

$ curl -s http://node1:8008
{"state": "running", "postmaster_start_time": "2021-05-24 14:50:11.707 UTC", "role": "master", "server_version": 120006, "cluster_unlocked": false, "xlog": {"location": 25615248}, "timeline": 1, "database_system_identifier": "6965869170583425899", "patroni": {"version": "1.6.4", "scope": "stampede"}}

We configured the standbys group to balance read-requests in a round-robin fashion, so each connection request (or reconnection) will alternate between the available replicas. We can test this in practice, let’s save the postgres user password in a file to facilitate the process:

echo "localhost:5000:postgres:postgres:vagrant" > ~/.pgpass
echo "localhost:5001:postgres:postgres:vagrant" >> ~/.pgpass
chmod 0600 ~/.pgpass

We can then execute two read-requests to verify the round-robin mechanism is working as intended:

$ psql -Upostgres -hlocalhost -p5001 -t -c "select inet_server_addr()"

$ psql -Upostgres -hlocalhost -p5001 -t -c "select inet_server_addr()"

as well as test the writer access:

$ psql -Upostgres -hlocalhost -p5000 -t -c "select inet_server_addr()"

You can also check the state of HAproxy by visiting http://localhost:7000/ on your browser.


To best simulate a production environment to test our failure scenarios, we wanted to have continuous reads and writes to the database. We could have used a benchmark tool such as Sysbench or Pgbench but we were more interested in observing the switch of source server upon a server failure than load itself. Jobin wrote a simple Python script that is perfect for this, HAtester. As was the case with HAproxy, we run the script from our Linux workstation. Since it is a Python script, you need to have a PostgreSQL driver for Python installed to execute it:

sudo apt-get install python3-psycopg2
curl -LO https://raw.githubusercontent.com/jobinau/pgscripts/main/patroni/HAtester.py
chmod +x HAtester.py

Edit the script with the credentials to access the PostgreSQL servers (through HAproxy) if you are using different settings from ours. The only requirement for it to work is to have the target table created beforehand, so first connect to the postgres database (unless you are using a different target) in the Primary and run:


You can then start two different sessions:

  1. One for writes:

    ./HAtester.py 5000
  2. One for reads:
    ./HAtester.py 5001

The idea is to observe what happens with database traffic when the environment experiences a failure; that is, how HAproxy will route reads and writes as Patroni adjusts the PostgreSQL cluster. You can continuously monitor Patroni from the point of view of the nodes by opening a session in each of them and running the following command:

sudo -u postgres watch patronictl -c /etc/patroni/config.yml list

To facilitate observability and better follow the changes in real-time, we used the terminal multiplexer Tmux to visualize all 5 sessions on the same screen:

  • On the left side, we have one session open for each of the 3 nodes, continuously running:

    sudo -u postgres watch patronictl -c /etc/patroni/config.yml list

    It’s better to have the Patroni view for each node independently because when you start the failure tests you will lose connection to a part of the cluster.

  • On the right side, we are executing the HAtester.py script from our workstation:
    • Sending writes through port 5000:

      ./HAtester.py 5000
    • and reads through port 5001:

      ./HAtester.py 5001

A couple of notes on the execution of the HAtester.py script:

  • Pressing Ctrl+C will break the connection but the script will reconnect, this time to a different replica (in the case of reads) due to having the Standbys group on HAproxy configured with round-robin balancing.
  • When a switchover or failover takes place and the nodes are re-arranged in the cluster, you may temporarily see writes sent to a node that used to be a replica and was just promoted as primary and reads send to a node that used to be the primary and was demoted as secondary: that’s a limitation of the HAtester.py script but “by design”; we favored faster reconnections and minimal checks on the node’s role for demonstration purposes. On a production application, this part ought to be implemented differently.

Testing Failure Scenarios

The fun part starts now! We leave it to you to test and play around to see what happens with the PostgreSQL cluster in practice following a failure. We leave as suggestions the tests we did in our presentation. For each failure scenario, observe how the cluster re-adjusts itself and the impact on read and write traffic.

1) Loss of Network Communication

  • Unplug the network cable from one of the nodes (or simulate this condition in your VM):
    • First from a replica
    • Then from the primary
  • Unplug the network cable from one replica and the primary at the same time:
    • Does Patroni experience a split-brain situation?

2) Power Outage

  • Unplug the power cable from the primary
  • Wait until the cluster is re-adjusted then plug the power cable back and start the node


Simulate an OOM/crash by killing the postmaster process in one of the nodes with kill -9.

4) Killing Patroni

Remember that Patroni is managing PostgreSQL. What happens if the Patroni process (and not PostgreSQL) is killed?

5) CPU Saturation

Simulate CPU saturation with a benchmark tool such as Sysbench, for example:

sysbench cpu --threads=10 --time=0 run

This one is a bit tricky as the reads and writes are each single-threaded operation. You may need to decrease the priority of the HAtester.py processes with renice, and possibly increase that of Sysbench’s.

6) Manual Switchover

Patroni facilitates changes in the PostgreSQL hierarchy. Switchover operations can be scheduled, the command below is interactive and will prompt you with options:

sudo -u postgres patronictl -c /etc/patroni/config.yml switchover

Alternatively, you can be specific and tell Patroni exactly what to do:

sudo -u postgres patronictl -c /etc/patroni/config.yml switchover --master node1 --candidate node2 --force

We hope you had fun with this hands-on lab! If you have questions or comments, leave us a note in the comments section below!


postgres_fdw Enhancement in PostgreSQL 14

postgres_fdw Postgresql 14

It’s exciting times in the PostgreSQL world with the version 14 beta released a few days ago. It’s now time to look under the hood and find out what has changed from version 13, see what has improved, and what behaviors are changed that we should be aware of. Putting it all in a single blog would take weeks to write and days to read, so this one will focus solely on the changes expected in the Foreign Data Wrapper in the GA release of PostgreSQL version 14.

Foreign Data Wrappers (FDWs) provide a mechanism by which regular SQL queries can be used to access data that resides outside PostgreSQL. There are many different FDWs available, however, and PostgreSQL comes with a “File FDW” and a “PostgreSQL FDW”. PostgreSQL FDW may seem counterintuitive, but, it is an extremely useful feature. And there have been some very useful updates to this FDW.

So, let’s start understanding what has changed.

Performance Feature

If you are already using PostgreSQL FDW for any use case, take note of the performance improvements.

1 – Parallel /Async Foreign Scans

(Allow a query referencing multiple foreign tables to perform foreign table scans in parallel)

Remote aggregations and remote joins might have been a performance nightmare when performed across multiple servers. The performance benefit comes from the parallelization of ForeignScan which can now be executed in parallel asynchronously. The sequential execution previously was very slow, and in some cases, too slow. For this, a new server option is added “async_capable” which allows for parallel planning and execution of ForeignScan.

Create Servers and User-Mappings

-- Create foreign server 1.
CREATE SERVER postgres_svr1
OPTIONS (host '', async_capable "true");

-- Create foreign server 2.
CREATE SERVER postgres_svr2
OPTIONS (host '', async_capable "true");

SERVER postgres_svr1
OPTIONS (user 'postgres', password 'pass');

SERVER postgres_svr2
OPTIONS (user 'postgres', password 'pass');

Create Local Tables

CREATE TABLE child_local1 (a INTEGER, b CHAR, c TEXT, d VARCHAR(255));
CREATE TABLE child_local2 (a int, b CHAR, c text, d VARCHAR(255));
GRANT ALL ON child_local1 to postgres;
GRANT ALL ON child_local2 to postgres;

Create Foreign TABLES

PARTITION OF parent_local VALUES FROM 1000 TO 2000
SERVER postgres_svr1 OPTIONS table_name 'child_local1');

PARTITION OF parent_local
SERVER postgres_svr2 OPTIONS table_name 'child_local2');

Now try that, and see the plan tree, now you can see two Async Foreign plans in the tree.

CREATE TABLE sample_table (a INTEGER, b CHAR, c TEXT, d VARCHAR(255));

INSERT INTO sample_table SELECT * FROM parent_local WHERE a % 100 = 0;
                                          QUERY PLAN
 Insert on public.sample_table
   ->  Append
         ->  Async Foreign Scan on public.parent_remote1 parent_local_1
               Output: parent_local_1.a, parent_local_1.b, parent_local_1.c, parent_local_1.d
               Remote SQL: SELECT a, b, c, d FROM public.child_local1 WHERE (((a % 100) = 0))
         ->  Async Foreign Scan on public.parent_remote2 parent_local_2
               Output: parent_local_2.a, parent_local_2.b, parent_local_2.c, parent_local_2.d
               Remote SQL: SELECT a, b, c, d FROM public.child_local2 WHERE (((a % 100) = 0))
(8 rows)

2 – Bulk Insert

(Allow postgres_fdw to INSERT rows in bulk.)

Now bulk insert functionality has been added to Foreign Data Wrapper, and postgres_fdw is now supporting that function. There is an opportunity for other foreign data wrappers to implement bulk insert. A complete blog can be seen here.

Functional Features


Allow TRUNCATE to operate on foreign tables.

Foreign Data Wrapper is enhanced to support TRUNCATE command on the foreign table as the target. That means it issues a TRUNCATE command shipped to the foreign server and executes on the table. Fortunately, this functionality is implemented in postgres_fdw. Here is an example of that.

CREATE SERVER postgres_svr 
       FOREIGN DATA WRAPPER postgres_fdw 
       OPTIONS (host '');

       SERVER postgres_svr
       OPTIONS (user 'postgres', password 'pass');

                                 b CHAR,
                                 c TEXT,
                                 d VARCHAR(255))
       SERVER postgres_svr
       OPTIONS(table_name 'foo_local');

Now a foreign table can be truncated using the TRUNCATE command.

postgres=# TRUNCATE foo_remote;

2 – LIMIT TO Child Partitioning

(Allow postgres_fdw to import table partitions if specified by IMPORT FOREIGN SCHEMA … LIMIT TO.)

The postgres_fdw does not allow the import of table partitions, because data can be accessed using the root partition. But in case the user wants to import the partitioned table partitioning, PostgreSQL 14 added a new option call “LIMIT TO”. Create a new schema on a remote machine and add a parent table “foo_schema.foo_table_parent” and one child table “foo_schema.foo_table_child”.

postgres=# \d+ foo_schema.*
                                   Table "foo_schema.foo_table_child"
 Column |  Type   | Collation | Nullable | Default | Storage | Compression | Stats target | Description 
 a      | integer |           |          |         | plain   |             |              | 
Partition of: foo_schema.foo_table_parent FOR VALUES FROM (0) TO (10)

Partition constraint: ((a IS NOT NULL) AND (a >= 0) AND (a < 10))
Access method: heap

                            Partitioned table "foo_schema.foo_table_parent"
 Column |  Type   | Collation | Nullable | Default | Storage | Compression | Stats target | Description 
 a      | integer |           |          |         | plain   |             |              | 
Partition key: RANGE (a)
Partitions: foo_schema.foo_table_child FOR VALUES FROM (0) TO (10)

Import schema without specifying the LIMIT TO, you can see only the parent table is imported.

IMPORT FOREIGN SCHEMA foo_schema FROM SERVER postgres_svr INTO bar_schema;
postgres=# \d+ bar_schema.*
                                 Foreign table "bar_schema.foo_table_parent"
 Column |  Type   | Collation | Nullable | Default |    FDW options    | Storage | Stats target | Description 
 a      | integer |           |          |         | (column_name 'a') | plain   |              | 
Server: postgres_svr
FDW options: (schema_name 'foo_schema', table_name 'foo_table_parent')

If you explicitly specified partitioned table into LIMIT TO clause then it will import that table.

postgres=# IMPORT FOREIGN SCHEMA foo_schema LIMIT TO (foo_table_parent, foo_table_child)
            FROM SERVER loopback INTO bar_schema;

postgres=# \d+ bar_schema.*
                                  Foreign table "bar_schema.foo_table_child"
 Column |  Type   | Collation | Nullable | Default |    FDW options    | Storage | Stats target | Description 
 a      | integer |           |          |         | (column_name 'a') | plain   |              | 
Server: loopback
FDW options: (schema_name 'foo_schema', table_name 'foo_table_child')

                                 Foreign table "bar_schema.foo_table_parent"

 Column |  Type   | Collation | Nullable | Default |    FDW options    | Storage | Stats target | Description 
 a      | integer |           |          |         | (column_name 'a') | plain   |              | 
Server: loopback
FDW options: (schema_name 'foo_schema', table_name 'foo_table_parent')

3 – Active and Valid Connection List

(Add postgres_fdw function postgres_fdw_get_connections to report open foreign server connections)

 A new function postgres_fdw_get_connections() is added. The function returns the open connection names local session to the foreign servers of postgres_fdw. It also outputs the validity of the connection.

postgres=# SELECT * FROM postgres_fdw_get_connections() ORDER BY 1;
    server_name    | valid 
 postgres_svr      | t
 postgres_svr_bulk | t
(2 rows)

Now disconnect all connections connection and try the query again.

postgres=# SELECT 1 FROM postgres_fdw_disconnect_all();
(1 row)

postgres=# SELECT * FROM postgres_fdw_get_connections() ORDER BY 1;
 server_name | valid 
(0 rows)

4 – Keep Connections

(Add postgres_fdw functions to discard cached connections)

A new option keep_connections is added to keep the connections alive so that subsequent queries can reuse them. By default, this option is on, but when turned off, the connections will be discarded as the transactions end.

Set the option off

ALTER SERVER loopback OPTIONS (keep_connections 'off');

Establish the connection using the remote query.

postgres=# BEGIN;

postgres=*# select * from foo_remote;
 a | b | c | d 
(0 rows)

postgres=*# END;

postgres=# SELECT * FROM postgres_fdw_get_connections() ORDER BY 1;
 server_name | valid 
(0 rows)

Set the keep_connections option on

ALTER SERVER postgres_svr options (set keep_connections 'on');


postgres=# BEGIN;
postgres=*# select * from foo_remote;
 a | b | c | d 
(0 rows)

-- Establish the connection using the remote query.

postgres=*# END;

postgres=# SELECT * FROM postgres_fdw_get_connections() ORDER BY 1;
 server_name  | valid 
 postgres_svr | t
(1 row)

5 – Reestablish Brocken Connection

(Allow postgres_fdw to reestablish foreign server connections if necessary)

Previously when the remote server restarted and the postgres_fdw connection was broken, then the error was thrown because the cached connection is no longer available. This is fixed in PostgreSQL, and, in any case, the connection is broken and no longer exists in the cache, and postgres_fdw will establish the connection.


It is quite promising that the Foreign Data Wrapper API is expanding over every release, but PostgreSQL 14 provides some user-centric new features. The performance-related improvements give another reason to use FDWs for many relevant use cases. The functionality will surely be added in the next few versions, making these more performant and easy to use.


New Features in PostgreSQL 14: Bulk Inserts for Foreign Data Wrappers

PostgreSQL 14 Bulk Inserts for Foreign Data Wrappers

Foreign Data Wrapper based on SQL-MED is one the coolest features of PostgreSQL. The feature set of foreign data wrapper is expanding since version 9.1. We know that the PostgreSQL 14 beta is out and GA will be available shortly, therefore it is helpful to study the upcoming features of PostgreSQL 14. There are a lot of them, along with some improvements in foreign data wrapper. A new performance feature, “Bulk Insert“, is added in PostgreSQL 14. The API is extended and allows bulk insert of the data into the foreign table, therefore, using that API, any foreign data wrapper now can implement Bulk Insert. It is definitely more efficient than inserting individual rows.

The API contains two new functions, which can be used to implement the bulk insert.

There is no need to explain these functions here because it is useful for people interested in having that functionality into their foreign data wrapper like mysql_fdw, mongo_fdw, and oracle_fdw. If someone is interested to see that, they can see it in the PostgreSQL documentation. But the good news is, postgres_fdw already implement that and have that in PostgreSQL 14.

There is a new server option is added which is batch_size, and you can specify that when creating the foreign server or creating a foreign table.

  • Create a postgres_fdw extension
CREATE EXTENSION postgres_fdw;

  • Create a foreign server without batch_size
CREATE SERVER postgres_svr 
       FOREIGN DATA WRAPPER postgres_fdw 
       OPTIONS (host '');

       SERVER postgres_svr
       OPTIONS (user 'postgres', password 'pass');

                                 b CHAR,
                                 c TEXT,
                                 d VARCHAR(255))
       SERVER postgres_svr
       OPTIONS(table_name 'foo_local');

EXPLAIN (VERBOSE, COSTS OFF) insert into foo_remote values (generate_series(1, 1), 'c', 'text', 'varchar');
                                                QUERY PLAN                                                 
 Insert on public.foo_remote
   Remote SQL: INSERT INTO public.foo_local(a, b, c, d) VALUES ($1, $2, $3, $4)
   Batch Size: 1
   ->  ProjectSet
         Output: generate_series(1, 1), 'c'::character(1), 'text'::text, 'varchar'::character varying(255)
         ->  Result
(6 rows)

  • Execution time with batch_size not specified
        INSERT INTO foo_remote
        VALUES (generate_series(1, 100000000),
                                                       QUERY PLAN                                                        
Insert on foo_remote  (cost=0.00..500000.02 rows=0 width=0) (actual time=4591443.250..4591443.250 rows=0 loops=1)
   ->  ProjectSet  (cost=0.00..500000.02 rows=100000000 width=560) (actual time=0.006..31749.132 rows=100000000 loops=1)
         ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.002..0.002 rows=1 loops=1)
Planning Time: 4.988 ms
Execution Time: 4591447.101 ms -- timing is important
(5 rows)

  • Create a foreign table with batch_size = 10, in case no batch_size is specified with server creation
                                 b CHAR,
                                 c TEXT,
                                 d VARCHAR(255))
        SERVER postgres_svr OPTIONS(table_name 'foo_local', batch_size '10');

  • Create a foreign server with batch_size = 10, now every table of that server will use the batch_size 10
CREATE SERVER postgres_svr_bulk
       FOREIGN DATA WRAPPER postgres_fdw
       OPTIONS (host '', batch_size = '10'); -- new option batch_size

       SERVER postgres_svr
       OPTIONS (user 'postgres', password 'pass');

                                 b CHAR,
                                 c TEXT,
                                 d VARCHAR(255))
        SERVER postgres_svr OPTIONS(table_name 'foo_local');

EXPLAIN (VERBOSE, COSTS OFF) insert into foo_remote_bulk values (generate_series(1, 1), 'c', 'text', 'varchar');

                                                QUERY PLAN                                                 

Insert on public.foo_remote_bulk
   Remote SQL: INSERT INTO public.foo_local_bulk(a, b, c, d) VALUES ($1, $2, $3, $4)
   Batch Size: 10
   ->  ProjectSet
         Output: generate_series(1, 1), 'c'::character(1), 'text'::text, 'varchar'::character varying(255)
         ->  Result
(6 rows)

  • Execution time with batch_size = 10:
        INSERT INTO foo_remote_bulk
        VALUES (generate_series(1, 100000000),
                                                       QUERY PLAN                                                        
 Insert on foo_remote_bulk  (cost=0.00..500000.02 rows=0 width=0) (actual time=822224.678..822224.678 rows=0 loops=1)
   ->  ProjectSet  (cost=0.00..500000.02 rows=100000000 width=560) (actual time=0.005..10543.845 rows=100000000 loops=1)
         ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.002 rows=1 loops=1)
 Planning Time: 0.250 ms
 Execution Time: 822239.178 ms -- timing is important
(5 rows)


PostgreSQL is expanding the feature list of foreign data wrappers, and Bulk Insert is another good addition. As this feature is added to the core, I hope all other foreign data wrappers will implement it as well.


Percona Distribution for PostgreSQL Operator Technical Preview Released

Percona Distribution for PostgreSQL Operator

Percona is championing open source database software and we are committed to running our products on Kubernetes. We don’t only want to run the software, but make sure that the database is highly available, secure, and follows best practices. We also focus on day-2 operations such as scaling, backup and restore, disaster recovery, and customization.

To get there, we have Operators – the software framework that extends Kubernetes APIs and provides control over database deployment and operations through the control plane. Until May we had two Operators:

The only missing piece was Percona Distribution for PostgreSQL, for which we introduced the Operator during Percona Live in May 2021. This completes our vision for deploying our software on Kubernetes. See the release notes of the initial version here.

Kubernetes Operator FAQ

This blog post is intended to answer some frequently asked questions we received from our community about Percona Distribution for PostgreSQL Operator.

Is This a Brand New Operator?

No. Our Operator is based on PGO, the Postgres Operator from Crunchy Data, which we modified and enhanced in order to support our PostgreSQL distribution.

Why CrunchyData Operator?

As noted above, we are committed to running our database software on Kubernetes. There are multiple ways to achieve this goal:

  1. Develop a new Operator from scratch
  2. Collaborate and contribute necessary changes to an existing Operator
  3. Fork an  existing Operator

Option (1) looks great, but it is time and effort-intensive, and we might be re-inventing existing wheels. Our goal is to minimize Time To Market (TTM), so we dropped this option right away.

For options (2) and (3), there are at least three different PostgreSQL Operators in active development:

Stackgres is written in Java, and our engineering team is more familiar with C/C++ and Golang. We do not see that changing in the near future. Zalando Operator is great and provides a lot of functionality out of the box, but our Engineering team estimated the effort to perform the needed changes almost similar to writing the Operator from scratch.

PGO is written in Golang and provides the features we were looking for: high availability with Patroni, scaling, backups, and many more. Our engineering team did not flag any complexity of introducing the changes and we jumped to work.

Will Percona use PGO as an Upstream?

For now – yes. We will be merging new features implemented in PGO into our fork. Version 0.1.0 of our Operator is based on the 4.6.2 version of PGO. Version 0.2.0 will be based on version 4.7.X. At the same time, we want to contribute back some of our changes to the upstream and have already sent some pull requests (one, two). We’ll continue submitting patches to the upstream project.

What is Different About Percona Operator?

The main differences were highlighted in the release notes. Here they are:

  • Percona Distribution for PostgreSQL is now used as the main container image. CrunchyData container images are provided under Crunchy Data Developer Program, which means that without an active contract they could not be used for production. Percona container images are fully open source and do not have any limitations for use.
  • It is possible to specify custom images for all components separately. For example, users can easily build and use custom images for one or several components (e.g. pgBouncer) while all other images will be the official ones. Also, users can build and use all custom images.
  • All container images are reworked and simplified. They are built on Red Hat Universal Base Image (UBI) 8.
  • The Operator has built-in integration with Percona Monitoring and Management (PMM) v2.
  • A build/test infrastructure was created, and we have started adding e2e tests to be sure that all pieces of the cluster work together as expected.
  • We have phased out the PGO CLI tool, and the Custom Resource UX will be completely aligned with other Percona Operators in the following release.

For future releases, our goal is to cover the feature and UX parity between the Operators, so that our users will have the same look and feel for all three database engines.

What Does Tech Preview Mean?

Tech Preview Features are not yet ready for production use and are not included in support via SLA (Service License Agreement). They are included in this release so that users can provide feedback prior to the full release of the feature in a future GA (General Availability) release (or removal of the feature if it is deemed not useful). This functionality can change (APIs, CLIs, etc.) from tech preview to GA.

When is GA Coming and What is Going to be Included?

Our goal is to release the GA version early in Q3. The changes in this version would include:

  • Moving control over replicas to the main Custom Resource instead of managing them separately
  • Change the main manifest to provide the same look and feel as in other Percona Operators
  • Rework scheduled backups and control them with main CR and Kubernetes primitives
  • Add support for Google Cloud Storage (this will be merged from upstream)

Call for Action

To install our Operator and learn more about it please read the documentation.

Our Operator is licensed under Apache 2.0 and can be found in percona-postgresql-operator repository on Github. There are multiple ways to contact us or share your ideas:

  • To report a bug use jira.percona.com and create the bug in K8SPG project
  • For general questions and sharing your thoughts, we have a community forum or Discord where we chat about open source, databases, Kubernetes, and many more.
  • We have a public roadmap where you can see what is planned and what is under development. You can share your ideas about new features there as well.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com