PostgreSQL Database Security: OS – Authentication

PostgreSQL Database Security OS - Authentication

Security is everybody’s concern when talking about data and information, and therefore it becomes the main foundation of every database. Security means protecting your data from unauthorized access. That means only authorized users can log in to a system called authentication; a user can only do what they are authorized to do (authorization) and log the user activity (accounting). I have explained these in my main security post, PostgreSQL Database Security: What You Need To Know.

When we are talking about security, authentication is the first line of defense. PostgreSQL provides various methods of authentication, which are categorized into three categories.

In most cases, PostgreSQL is configured to be used with internal authentication. Therefore I have discussed all internal authentication in the previous blog post I mentioned above. In this blog, we will discuss the operating system-based authentication methods for PostgreSQL. There are three methods to do OS-Based authentication.


Ident authentication only supports TCP/IP connections. Its ident server provides a mechanism to map the client’s operating system username onto the database username. It also has the option for username mapping.

# TYPE  DATABASE        USER            ADDRESS                 METHOD
# "local" is for Unix domain socket connections only
local   all             all                                     trust
# IPv4 local connections:
host    all             all               ident
# IPv6 local connections:
host    all             all             ::1/128                 trust
# Allow replication connections from localhost, by a user with the

$ psql postgres -h -U postgres
psql: error: connection to server at "", port 5432 failed: FATAL:  Ident authentication failed for user "postgres"

If no ident server is installed, you will need to install the ident2 on your ubuntu box or oidentd on CentOS 7. Once you have downloaded and configured the ident server, it is now time to configure PostgreSQL. It starts with creating a user map in “pg_ident.conf” file.

# Put your actual configuration here
# ----------------------------------
PG_USER         vagrant                 postgres

Here we have mapped our system user “vagrant” user with PostgreSQL’s “postgres.” Time to login using the user vagrant.

$ psql postgres -h -U postgres
psql (15devel)
Type "help" for help.


Note: The Identification Protocol is not intended as an authorization or access control protocol.

PAM (Pluggable Authentication Modules)

PAM (Pluggable Authentication Modules) authentication works similarly to “passwords.” You’d have to create a PAM service file that should enable PAM-based authentication. The service name should be set to “PostgreSQL.”

Once the service is created, PAM can now validate user name/password pairs and optionally the connected remote hostname or IP address. The user must already exist in the database for PAM authentication to work.

$ psql postgres -h -U postgres
Password for user postgres: 
2021-08-11 13:16:38.332 UTC [13828] LOG:  pam_authenticate failed: Authentication failure
2021-08-11 13:16:38.332 UTC [13828] FATAL:  PAM authentication failed for user "postgres"
2021-08-11 13:16:38.332 UTC [13828] DETAIL:  Connection matched pg_hba.conf line 91: "host    all             all               pam"
psql: error: connection to server at "", port 5432 failed: FATAL:  PAM authentication failed for user "postgres"

Ensure that the PostgreSQL server supports PAM authentication. It is a compile-time option that must be set when the server binaries were built. You can check if your PostgreSQL server supports PAM authentication using the following command.

$ pg_config | grep with-pam

CONFIGURE =  '--enable-tap-tests' '--enable-cassert' '--prefix=/usr/local/pgsql/' '--with-pam'

In case there is no PAM server file for PostgreSQL under /etc/pam.d, you’d have to create it manually. You may choose any name for the file; however, I prefer to name it “postgresql.”

$ /etc/pam.d/PostgreSQL

@include common-auth
@include common-account
@include common-session
@include common-password

Since the PostgreSQL user cannot read the password files, install sssd (SSSD – System Security Services Daemon) to bypass this limitation.

sudo apt-get install sssd

Add postgresql to the “ad_gpo_map_remote_interactive” to the “/etc/sssd/sssd.conf”

$ cat /etc/sssd/sssd.conf
ad_gpo_map_remote_interactive = +postgresql

Start sssd service, and check the status that it has properly started.

$ sudo systemctl start sssd

$ sudo systemctl status sssd

sssd.service - System Security Services Daemon

     Loaded: loaded (/lib/systemd/system/sssd.service; enabled; vendor preset: enabled)

     Active: active (running) since Wed 2021-08-11 16:18:41 UTC; 12min ago

   Main PID: 1393 (sssd)

      Tasks: 2 (limit: 1071)

     Memory: 5.7M

     CGroup: /system.slice/sssd.service

             ??1393 /usr/sbin/sssd -i --logger=files

             ??1394 /usr/libexec/sssd/sssd_be --domain shadowutils --uid 0 --gid 0 --logger=files

Time now to configure pg_hba.conf to use the PAM authentication. We need to specify the PAM service name (pamservice) as part of authentication options. This should be the same as the file you have created in the /etc/pam.d folder, which in my case is postgresql.

# "local" is for Unix domain socket connections only

local   all             all                                     trust

# IPv4 local connections:

host    all             all               pam pamservice=postgresql

# IPv6 local connections:

host    all             all             ::1/128                 trust

# Allow replication connections from localhost, by a user with the

We must now reload (or restart) the PostgreSQL server. After this, you can try to login into the PostgreSQL server.

vagrant@ubuntu-focal:~$ psql postgres -h -U postgres
psql (15devel)
Type "help" for help.


If PAM is set up to read /etc/shadow, authentication will fail because the PostgreSQL server is started by a non-root user. However, this is not an issue when PAM is configured to use LDAP or other authentication methods.


Peer authentication is “ident”ical; i.e., Very much like the ident authentication! The only subtle differences are there are no ident servers, and this method works on local connections rather than over TCP/IP.

The peer authentication provides a mechanism to map the client’s operating system username onto the database username. It also has the option for username mapping.  The configuration is very similar to how we configured for ident authentication except that the authentication method is specified as “peer” instead of “ident.”

$ cat $PGDATA/pg_hba.conf

# TYPE  DATABASE        USER            ADDRESS                 METHOD

# "local" is for Unix domain socket connections only
local   all             all                                   peer map=PG_USER
# IPv4 local connections:
host    all             all               trust

$ psql postgres -U postgres

2021-08-12 10:51:11.855 UTC [1976] LOG:  no match in usermap "PG_USER" for user "postgres" authenticated as "vagrant"

2021-08-12 10:51:11.855 UTC [1976] FATAL:  Peer authentication failed for user "postgres"

2021-08-12 10:51:11.855 UTC [1976] DETAIL:  Connection matched pg_hba.conf line 89: "local   all             all                                     peer map=PG_USER"

psql: error: connection to server on socket "/tmp/.s.PGSQL.5432" failed: FATAL:  Peer authentication failed for user "postgres"

$PGDATA/pg_hba.conf configuration will look something like this:

$ cat $PGDATA/pg_hba.conf
# TYPE  DATABASE        USER            ADDRESS                 METHOD
# "local" is for Unix domain socket connections only

local   all             all                                     peer map=PG_USER
# IPv4 local connections:


# Put your actual configuration here
# ----------------------------------
PG_USER         vagrant                postgres

vagrant@ubuntu-focal:~$ psql postgres -U postgres
psql (15devel)
Type "help" for help.


We’ve covered several different authentication methods in this blog. These basic authentication methods involve the PostgreSQL server, kernel, and the ident server; options are available natively without any major external dependencies. It is, however, important that the database is secured properly to prevent unauthorized access to the data.

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community in a single distribution, designed and tested to work together.

Download Percona Distribution for PostgreSQL Today!


Migrating PostgreSQL to Kubernetes

Migrating PostgreSQL to Kubernetes

More and more companies are adopting Kubernetes. For some it is about being cutting-edge, for some, it is a well-defined strategy and a business transformation. Developers and operations teams all over the world are struggling with moving applications that aren’t cloud-native friendly to containers and Kubernetes.

Migrating databases is always a challenge, which comes with risks and downtime for businesses. Today I’m going to show how easy it is to migrate a PostgreSQL database to Kubernetes with minimal downtime with Percona Distribution for PostgreSQL Operator.


To perform the migration I’m going to use the following setup:

Migrating PostgreSQL to Kubernetes

  1. PostgreSQL database deployed on-prem or somewhere in the cloud. It will be the Source.
  2. Google Kubernetes Engine (GKE) cluster where Percona Operator deploys and manages PostgreSQL cluster (the Target) and pgBackRest Pod
  3. PostgreSQL backups and Write Ahead Logs are uploaded to some Object Storage bucket (GCS in my case)
  4. pgBackRest Pod reads the data from the bucket
  5. pgBackRest Pod restores the data continuously to the PostgreSQL cluster in Kubernetes

The data should be continuously synchronized. In the end, I want to shut down PostgreSQL running on-prem and only keep the cluster in GKE.



To replicate the setup you will need the following:

  • PostgreSQL (v 12 or 13) running somewhere
  • pgBackRest installed
  • Google Cloud Storage or any S3 bucket. My examples will be about GCS.
  • Kubernetes cluster

Configure The Source

I have Percona Distribution for PostgreSQL version 13 running on some Linux machines.

1. Configure pgBackrest

# cat /etc/pgbackrest.conf


  • pg1-path should point to PostgreSQL data directory
  • repo1-type is set to GCS as we want our backups to go there
  • The key is in /tmp/gcs.key file. The key can be obtained through Google Cloud UI. Read more about it here.
  • The backups are going to be stored in on-prem-pg folder in sp-test-1 bucket

2. Edit


config to enable archival through pgBackrest 

archive_mode = on   
archive_command = 'pgbackrest --stanza=db archive-push %p'

Restart is required after changing the configuration.

3. Operator requires to have a


file in the data directory. It is enough to have an empty file:

touch /var/lib/postgresql/13/main/postgresql.conf



must be created on the Source to ensure replication is correctly set up by the Operator. 

# create user primaryuser with encrypted password '<PRIMARYUSER PASSWORD>' replication;

Configure The Target

1. Deploy Percona Distribution for PostgreSQL Operator on Kubernetes. Read more about it in the documentation here.

# create the namespace
kubectl create namespace pgo

# clone the git repository
git clone -b v0.2.0
cd percona-postgresql-operator

# deploy the operator
kubectl apply -f deploy/operator.yaml

2. Edit main custom resource manifest – deploy/cr.yaml.

  • I’m not going to change the cluster name and keep it cluster1
  • the cluster is going to operate in Standby mode, which means it is going to sync the data from the GCS bucket. Set




  • configure GCS itself.

    section would look like this (




    are the same as in pgBackrest configuration above)

    repoPath: "/on-prem-pg"
        type: gcs
        region: us-central1-a
        uriStyle: path
        verifyTLS: false
        bucket: sp-test-1
    storageTypes: [

  • I would like to have at least one Replica in my PostgreSQL cluster. Set

    to 1.

3. Operator should be able to authenticate with GCS. To do that we need to create a secret object called




in data. It should be the same key we used on the Source. See the example of this secret here.

kubectl apply -f gcs.yaml

4. Create users by creating Secret objects:




(the one we created on the Source). See the examples of users Secrets here. The passwords should be the same as on the Source.

kubectl apply -f users.yaml

5. Now let’s deploy our cluster on Kubernetes by applying the



kubectl apply -f deploy/cr.yaml

Verify and Troubleshoot

If everything is done correctly you should see the following in the Primary Pod logs:

kubectl -n pgo logs -f --tail=20 cluster1-5dfb96f77d-7m2rs
2021-07-30 10:41:08,286 INFO: Reaped pid=548, exit status=0
2021-07-30 10:41:08,298 INFO: establishing a new patroni connection to the postgres cluster
2021-07-30 10:41:08,359 INFO: initialized a new cluster
Fri Jul 30 10:41:09 UTC 2021 INFO: PGHA_INIT is 'true', waiting to initialize as primary
Fri Jul 30 10:41:09 UTC 2021 INFO: Node cluster1-5dfb96f77d-7m2rs fully initialized for cluster cluster1 and is ready for use
2021-07-30 10:41:18,781 INFO: Lock owner: cluster1-5dfb96f77d-7m2rs; I am cluster1-5dfb96f77d-7m2rs                                 2021-07-30 10:41:18,810 INFO: no action.  i am the standby leader with the lock                                                     2021-07-30 10:41:28,781 INFO: Lock owner: cluster1-5dfb96f77d-7m2rs; I am cluster1-5dfb96f77d-7m2rs                                 2021-07-30 10:41:28,832 INFO: no action.  i am the standby leader with the lock

Change some data on the Source and ensure that it is properly synchronized to the Target cluster.

Common Issues

The following error message indicates that you forgot to create


file in the data directory:

FileNotFoundError: [Errno 2] No such file or directory: '/pgdata/cluster1/postgresql.conf' -> '/pgdata/cluster1/postgresql.base.conf'

Sometimes it is easy to forget to create the


  and see the following in the logs:

psycopg2.OperationalError: FATAL:  password authentication failed for user "primaryuser"

Wrong or missing object store credentials will trigger the following error:

WARN: repo1: [CryptoError] unable to load info file '/on-prem-pg/backup/db/' or '/on-prem-pg/backup/db/':      CryptoError: raised from remote-0 protocol on 'cluster1-backrest-shared-repo': unable to read PEM: [218529960] wrong tag            HINT: is or was the repo encrypted?                                                                                                 CryptoError: raised from remote-0 protocol on 'cluster1-backrest-shared-repo': unable to read PEM: [218595386] nested asn1 error
      HINT: is or was the repo encrypted?
      HINT: cannot be opened and is required to perform a backup.
      HINT: has a stanza-create been performed?
ERROR: [075]: no backup set found to restore
Fri Jul 30 10:54:00 UTC 2021 ERROR: pgBackRest standby Creation: pgBackRest restore failed when creating standby


Everything looks good and it is time to perform the cutover. In this blog post, I cover only the database side but do not forget that your application should be reconfigured to point to the correct PostgreSQL cluster. It might be a good idea to stop the application before the cutover.

1. Stop the source PostgreSQL cluster to ensure no data is written

systemctl stop postgresql

2. Promote the Target cluster to primary. To do that remove


, change


to false in


, and apply the changes:

kubectl apply -f deploy/cr.yaml

PostgreSQL will be restarted automatically and you will see the following in the logs:

2021-07-30 11:16:20,020 INFO: updated leader lock during promote
2021-07-30 11:16:20,025 INFO: Changed archive_mode from on to True (restart might be required)
2021-07-30 11:16:20,025 INFO: Changed max_wal_senders from 10 to 6 (restart might be required)
2021-07-30 11:16:20,027 INFO: Reloading PostgreSQL configuration.
server signaled
2021-07-30 11:16:21,037 INFO: Lock owner: cluster1-5dfb96f77d-n4c79; I am cluster1-5dfb96f77d-n4c79
2021-07-30 11:16:21,132 INFO: no action.  i am the leader with the lock


Deploying and managing database clusters is not an easy task. Recently released Percona Distribution for PostgreSQL Operator automates day-1 and day-2 operations and turns running PostgreSQL on Kubernetes into a smooth and pleasant journey.

With Kubernetes becoming the default control plane, the most common task for developers and operations teams is to perform the migration, which usually turns into a complex project. This blog post shows that database migration can be an easy task with minimal downtime.

We encourage you to try out our operator. See our github repository and check out the documentation.

Found a bug or have a feature idea? Feel free to submit it in JIRA.

For general questions please raise the topic in the community forum.

Are you a developer and looking to contribute? Please read our and send the Pull Request.

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.

Download Percona Distribution for PostgreSQL Today!


Improve PostgreSQL Query Performance Insights with pg_stat_monitor

Improve PostgreSQL Query Performance pg_stat_monitor

Understanding query performance patterns is essentially the foundation for query performance tuning. It, in many ways, dictates how a database cluster evolves. And then there are obviously direct and indirect cost connotations as well.

PostgreSQL provides very detailed statistics through a number of catalog views and extensions that can be easily added to provide more detailed query statistics. With each view focused on a particular aspect, the picture almost always needs to be stitched together by combining different datasets. That requires effort and still, the whole picture might not be complete.

The pg_stat_monitor extension attempts to provide a more holistic picture by providing much-needed query performance insights in a single view. The extension has been evolving over the past year and is now nearing the GA release.

Some Useful Extensions

Currently, you may be relying on a number of extensions to understand how a query behaves, the time taken in planning and execution phases, min/max/meantime values, index hits, query plan, and client application details. Here are some extensions that you might already be very familiar with.


This view is available by default with PostgreSQL. It provides one row per server process along with current activity and query text.

In case you’d like to learn more about it, hop over to the official PostgreSQL documentation here.


This extension is part of the contrib packages provided with the PostgreSQL server. However, you’d have to create the extension manually. It’s a query-wise aggregation of statistical data with min/max/mean/standard deviation for execution and planning times and various useful information and query text.

You can read more about pg_stat_statements at the official PostgreSQL documentation site.


Another useful extension is provided by the PostgreSQL server. It dumps query plans in the server log for any query exceeding a time threshold specified by a GUC

(Grand Unified Configuration).

You can find more about auto_explain here.


Whilst all previously mentioned views/extensions are great in their own right, one needs to manually combine client/connection information from pg_stat_activity, statistical data from pg_stat_statements, and query plan from auto_analyze to complete the dataset to understand query performance patterns

And that’s precisely the pain that pg_stat_monitor alleviates.

The feature set has been growing over the past year, with it providing, in a single view, all performance-related information that you may need to debug a low performant query. For more information about the extension see our GitHub repository, or for user-specific documentation, see our user guide.

Feature Set

Some features that were part of earlier releases are already discussed in this blog, however, for completeness, I’m going to discuss those here as well.

  • Time Interval Grouping: Instead of supplying one set of ever-increasing counts, pg_stat_monitor computes stats for a configured number of time intervals; time buckets. This allows for much better data accuracy, especially in the case of high resolution or unreliable networks.
  • Multi-Dimensional Grouping:  While pg_stat_statements groups counters by (userid, dbid, queryid),  pg_stat_monitor uses a more detailed group for higher precision:
    • Bucket ID (bucket),
    • User ID (userid),
    • Database ID (dbid),
    • Query ID (queryid),
    • Client IP Address (client_ip),
    • Plan ID (planid),
    • Application Name (application_name).

This allows you to drill down into the performance of queries coming from particular client addresses and applications, which we at Percona have found to be very valuable in a number of cases.

  • Capture Actual Parameters in the Queries: pg_stat_monitor allows you to choose if you want to see queries with placeholders for parameters or actual query examples.
  • Query Plan: Each SQL is now accompanied by its actual plan that was constructed for its execution. Also, we found having query parameter values is very helpful, as you can run EXPLAIN on it, or easily play with modifying the query to make it run better, as well as making communication about the query clearer when discussing with other DBAs and application developers.
  • Tables Access Statistics for a Statement: This allows us to easily identify all queries that accessed a given table. This set is at par with the information provided by the pg_stat_statements.
  • Histogram: Visual representation is very helpful when it can help identify issues. With the help of the histogram function, you can now view a timing/calling data histogram in response to a SQL query. And yes, it even works in psql.
SELECT * FROM histogram(0, 'F44CD1B4B33A47AF') AS a(range TEXT, freq INT, bar TEXT);
       range        | freq |              bar
  (0 - 3)}          |    2 | ??????????????????????????????
  (3 - 10)}         |    0 |
  (10 - 31)}        |    1 | ???????????????
  (31 - 100)}       |    0 |
  (100 - 316)}      |    0 |
  (316 - 1000)}     |    0 |
  (1000 - 3162)}    |    0 |
  (3162 - 10000)}   |    0 |
  (10000 - 31622)}  |    0 |
  (31622 - 100000)} |    0 |
(10 rows)

  • Functions: It may come as a surprise, but we do understand that functions may internally execute statements!!! To help ease the tracking and analysis, pg_stat_monitor now provides a column that specifically helps keep track of the top query for a statement so that you can backtrack to the originating function.
  • Relation Names: Relations used in a query are available in the “relations” column in the pg_stat_monitor view. This reduces work at your and makes analysis simpler and quicker.
  • Query Types: With query classification as SELECT, INSERT, UPDATE or DELETE, analysis becomes simpler. It’s another effort reduced at your end, and another simplification by pg_stat_monitor.
SELECT bucket, substr(query,0, 50) AS query, cmd_type FROM pg_stat_monitor WHERE elevel = 0;
 bucket |                       query                       | cmd_type 
      4 | END                                               | 
      4 | SELECT abalance FROM pgbench_accounts WHERE aid = | SELECT
      4 | vacuum pgbench_branches                           | 
      4 | select count(*) from pgbench_branches             | SELECT
      4 | UPDATE pgbench_accounts SET abalance = abalance + | UPDATE
      4 | truncate pgbench_history                          | 
      4 | INSERT INTO pgbench_history (tid, bid, aid, delta | INSERT

  • Query Metadata: Google’s Sqlcommenter is a useful tool that in a way bridges that gap between ORM libraries and understanding database performance. And we support it. So, you can now put any key value data in the comments in /* … */ syntax (see Sqlcommenter documentation for details) in your SQL statements, and the information will be parsed by pg_stat_monitor and made available in the comments column in pg_stat_monitor view.
CREATE FUNCTION text_to_hstore(s text) RETURNS hstore AS $$
    RETURN hstore(s::text[]);

SELECT 1 AS num /* { "application", java_app, "real_ip",} */;
(1 row)

SELECT query, text_to_hstore(comments)->'real_ip' AS real_ip from pg_stat_monitor;
query                                                                       |  real_ip 
 SELECT $1 AS num /* { "application", psql_app, "real_ip", */  |

  • Logging Error and Warning: As seen in different monitoring/statics collector tools, most of the tools/extensions only monitor the successful queries. But in many cases, monitoring ERROR, WARNING, and LOG give meaningful information to debug the issue. pg_stat_monitor not only monitors the ERROR/WARNINGS/LOG but also collects the statistics about these queries. In PostgreSQL queries with ERROR/WARNING there is an error level (elevel), SQL Code (sqlcode), and an error message is attached. Pg_stat_monitor collects all this information along with its aggregates.
SELECT substr(query,0,50) AS query, decode_error_level(elevel) AS elevel,sqlcode, calls, substr(message,0,50) message 
FROM pg_stat_monitor;
                       query                       | elevel | sqlcode | calls |                      message                      
 select substr(query,$1,$2) as query, decode_error |        |       0 |     1 | 
 select bucket,substr(query,$1,$2),decode_error_le |        |       0 |     3 | 
 select 1/0;                                       | ERROR  |     130 |     1 | division by zero

We’ve Come a Long Way

What started as a concept is now nearing its final approach. The pg_stat_monitor extension has evolved and has become very feature-rich. We have no doubt about its usefulness for DBAs, performance engineers, application developers, and anyone who needs to look at query performance. We believe it can help save many hours and help identify unexpected query behaviors. 

pg_stat_monitor is available on Github. We are releasing it to get feedback from the community on what we’re doing right and what we should do differently before we release pg_stat_monitor as a generally available version to be supported for years to come. Please check it out,  drop us a note, file an issue, or make a pull request!

Try Percona Distribution for PostgreSQL today; it’s free to download and use! 


Understanding pg_repack: What Can Go Wrong – and How to Avoid It

Understanding pg_repack

pg_repack is one of the oldest, widely used, extension projects for PostgreSQL. It is so much popular that even DBaaS service providers couldn’t avoid it. It is a “power tool” in the hands of a DBA to deal with bloated/fragmented tables. I can’t imagine a serious production deployment without it these days. It magically replaces the bloated, fragmented tables with a fresh fully packed table without holding an exclusive lock on the table during its processing**. This extension made PostgreSQL’s built-in commands like VACUUM FULL and CLUSTER almost useless.

**Momentarily AccessExclusive locks are required. Please see the discussion below.

But unfortunately, regular and repeated usage of pg_repack increased the comfort levels of everyone (myself included). As a consequence, it is randomly recommended as a solution for everything, as a panacea. But I recently started coming across cases like a scheduled job doing pg_repack of every table. This time it rang an alarm bell and it started to feel like overuse of antibiotics. So I thought about writing about how pg_repack works and how it can affect your system while it is running for a regular user who doesn’t want to do a detailed study. because a better understanding might help for an informed decision on where it should be used.

pg_repack has a lot of functionalities (options). In this blog post, my intention is to discuss only how it does the basic “repack”ing of a fragmented/bloated table.

Pre-Checks and Ensuring Sanity

This is to ensure that pg_repack is installed and extension is available in the database, we are running the pg_repack as a superuser, etc. It creates a provision for cleaning up temporary objects created by pg_repack. It collects metadata about tables and associated objects like Indexes, Toast, Triggers, etc. because tracking of tables and associated objects is very important. Objects like invalid indexes, conflicting triggers, etc. are checked. We are not going to discuss this in detail.

The actual processing starts with obtaining an advisory lock on the table’s OID, to make sure no other pg_repack is working on the table. This is not so much a concern with full-table repacks, but mainly so that index-only repacks don’t interfere with each other or a full-table repack. If there are other pg_repaks in progress, repack attempts may get a message as follows and exit.

ERROR: Another pg_repack command may be running on the table. Please try again later.

Create temporary objects while holding AccessExclusive lock on the table

Yes. Correct, pg_repack needs the heavyweight AccessExclusive lock. But temporarily.  pg_repack attempts to gain a table ACCESS EXCLUSIVE lock by executing a statement as follows:


An Access Exclusive lock requires that no other sessions are accessing the table, Not even a SELECT query. pg_repack wait for a “wait-time” (60 seconds by default). This wait-time can be changed using the optional parameter --wait-timeout. Once this wait-time is over pg_repack will start trying to cancel the conflicting statements.  Users may see messages as follows for each of the attempts.
WARNING: canceling conflicted backends
WARNING: canceling conflicted backends

So the point to be noted is:

So please avoid running pg_repack when there is a lot of concurrent activities on the table. because the new sessions are allowed to get conflicting locks like ACCESS SHARE concurrently and a session waiting for ACCESS EXCLUSIVE lock may need to wait indefinitely. We should be selecting a proper time window of low activity for pg_repack work

These cancel attempts will continue for another round of “wait-timeout” seconds. But even after attempting for this second round of wait-timeout, if the AcessExclusive lock is not obtained, it escalates to termination of every conflicting session. So pg_repack will terminate sessions if the total wait time exceeds double the “wait-time”. This could be problematic for application connections that get abruptly get terminated. This can lead to outages if the application layer does not handle it gracefully.

The pg_repack may emit a message as follows:

WARNING: terminating conflicted backends

And there will be PostgreSQL log entries as follows for each of the sessions which is killed

2021-06-17 06:28:46.317 UTC [2761] FATAL: terminating connection due to administrator command
2021-06-17 06:28:46.317 UTC [2758] FATAL: terminating connection due to administrator command

So the point to be noted is: pg_repack can terminate sessions if they stand against obtaining AcessExclusive lock which can lead to unexpected outage or misbehavior of the application.

If the double the wait time is crossed, pg_repack may just proceed with terminate the sessions. But, again this behavior also can be controlled using a parameter --no-kill-backend.  if this parameter is specified, pg_repack will respect all concurrent sessions and cancel itself instead of attempting to canceling or terminating other sessions
pg_repack may emit a message as follows

WARNING: timed out, do not cancel conflicting backends
INFO: Skipping repack public.t1 due to timeout

I believe this is more desirable in production systems. So the point to be noted is:

Always remember to specify --no-kill-backend whenever you deal with critical systems

When AccessExclusive is locked, pg_repack creates all temporary objects including the substituted table with the same structure.

It creates a primary key TYPE as per the original table. For example, if we are repacking a table with a primary key of a single field “id”, the primary key type definition would look like:

CREATE TYPE repack.pk_16423 AS (id integer)

This TYPE of definition is useful because there can be composite keys. The following is an example of it dealing with tables of a composite key.
CREATE TYPE repack.pk_16824 AS (id integer, id1 integer, id2 timestamp without time zone);

Then It proceeds to create a “log” table to capture all the data changes (CDC-Change data capture) during the pg_repack operation. This table will have a primary key of BIGINT data type, Primary key type of the original table created in the above step, and “row” datatype of the table which we are repacking. This row can hold the entire tuple information of the table which we repack.
This log table definition is easy because of the TYPE definition and the “row” type. here is an example

CREATE TABLE repack.log_16423 (id bigserial PRIMARY KEY, pk repack.pk_16423, row public.t1)

Now pg_repack creates a trigger on the table to be repacked so that whenever there is DML on the table, corresponding information needs to be captured to the log table created above. This is done using an AFTER INSERT OR DELETE OR UPDATE trigger. For example:


Since the pg_repack is holding the AccessExclusive lock at this stage, there won’t be any concurrent DMLs at this stage, which is going to change in the following stages.

pg_repack plays an important trick at this stage before releasing AcessExclusive lock as the comment in the source code says:

/* While we are still holding an AccessExclusive lock on the table, submit
* the request for an AccessShare lock asynchronously from conn2.
* We want to submit this query in conn2 while connection's
* transaction still holds its lock, so that no DDL may sneak in
* between the time that connection commits and conn2 gets its lock.

Yes, pg_repack uses another connection to the database and sends an AccessShare lock request through that. This prevents any DDL in between the switch to AccessShare lock. The moment the main connection commits, An AccessShare lock will be granted to the second connection. So pg_repack uses two database connections to carry out the work.

But still, there is a chance that some DDL can interfere. So pg_repack kills any concurrent DDL against the table by default.

Once this stage is complete pg_repack can proceed with releasing AccessExclusive lock on the first connection by COMMITing the transaction. So that the AccessShare lock request by the second connection will be granted. This COMMIT is very special that all the log table and triggers on the table will be committed so that it is available to the entire system from this point onwards.

Copying rows/tuples to a temporary table

Copying of tuples from the table to a new table is performed with SERIALIZABLE isolation. because there shouldn’t be any inconsistency between the data getting into the log table and the temporary, substitute table which pg_repack is going to create.


Since the AccessExclusive lock is removed, concurrent sessions can proceed with their DMLs and Select queries. Only DDLs will be blocked. So we can say that the table is available for transactions and queries.

Prior to the data copy, the log table is truncated. because pg_repack needs only those log data that is captured from the starting of data copy.

DELETE FROM repack.log_16423

Again pg_repack will attempt to kill any session which might be waiting for doing a DDL and get an AcessShare lock on the table. Since the other connection is already holding an AccessShare lock, This can be gained without much problem.

At this stage, pg_repack creates a substitute table that is going to replace the original table with the exact same structure as the original table but without any data. It will be a CTAS statement, something like :

CREATE TABLE repack.table_16423... AS SELECT col1,col2,col3... FROM ONLY public.t1 WITH NO DATA

followed by the data copy:

INSERT INTO repack.table_16423 SELECT col1,col2,co3... FROM ONLY public.t1

Once the data copy is over, the transaction will be COMMIT ed. This completes one heaviest stage in repacking in terms of load and WAL generation

Indexes, Keys will be created at this stage on this temporary, substitute table at this stage.

Apply the CDC log to a temporary table

Please remember that pg_repack’s main connection is not holding any lock on the table at this stage (other than the second connection’s AccessShare lock).  So There is nothing blocking the transactions (DMLs) at this stage. Depending on the time it took for the data copy in the previous stage, and concurrent transactions during the data copy, There could be a lot of CDC(Change Data Capture) entries in the log file. This needs to be copied to the new temporary/substitute table.

This logic is implemented as C function in pg_repack. you may refer to the source code of repack_apply. It reads all the data from the log table and processes INSERTS, UPDATES, and DELETES. In order to speed up the repeated operations, Prepared Statements are used. Finally, all those data from the log table which is processed will be deleted from the log table.

Swapping the original table with a temporary table

This is performed by the second connection because it already holds an AccessShare lock, But it will escalate the lock to AccessExclusive lock.


The CDC apply will be performed once again (The same “repack_apply”) while holding the AccessExclusive lock on the table. So if there is any new entry that appears in the log table, that also will be processed.

The original table and the temporary table quickly by executing repack_swap function like:

SELECT repack.repack_swap('16423');

This is the most beautiful and powerful part of pg_repack. which is implemented in a C function repack_swap. Not just tables are swapped, ownership, associated toasts (table and index), indexes, and dependencies are also swapped. Oids are swapped so that the oid of the table remains the same even after pg_repack. The Swapping work complies with a COMMIT

Final cleanup

pg_repack uses its built-in C Function repack_drop for doing the cleanup of all temporary objects. To prevent any concurrent sessions from acquiring a lock on the table which could prevent the cleanup, An AccessExclusive lock is obtained before the cleanup. This is the third time an AccessExcluive lock is placed on the table.


pg_repack is one of the most powerful, popular, and useful extensions. We encourage the usage wherever applicable with proper supervision. But please avoid over-usage. As I tried to explain,  we should expect a good amount of data movement between the original table to the temporary table, trigger writing to the log table, data copy from the log table to the temporary table, etc. So we should be expecting a higher WAL generation also. Considering all the implications, pg_repack needs to be performed on a low activity time window to avoid undesirable consequences.

Some of the important points to reiterate for end-users are:

  1. pg_repack needs to acquire heavyweight AccessExclusive lock multiple times. But Temporarily.
  2. In a high concurrency situation, it will be almost impossible to get an AccessExclusive lock
  3. pg_repack, by default, will attempt canceling the conflicting statements if it is not able to gain AcessExclusive lock-in wait-time
  4. It may proceed to terminate sessions if the total wait exceeds double the amount of wait time. This could lead to undesirable outcomes and outages.
  5. Defaults of pg_repack may not be good for critical systems. use --no-kill-backend option to make it more gentle.
  6. No DDLs are allowed against the table which is undergoing pg_repack and any session that attempts to do so might get killed.

PostgreSQL HA with Patroni: Your Turn to Test Failure Scenarios

PostgreSQL HA with Patroni

A couple of weeks ago, Jobin and I did a short presentation during Percona Live Online bearing a similar title as the one for this post: “PostgreSQL HA With Patroni: Looking at Failure Scenarios and How the Cluster Recovers From Them”. We deployed a 3-node PostgreSQL environment with some recycled hardware we had lying around and set ourselves at “breaking” it in different ways: by unplugging network and power cables, killing main processes, attempting to saturate processors. All of this while continuously writing and reading data from PostgreSQL. The idea was to see how Patroni would handle the failures and manage the cluster to continue delivering service. It was a fun demo!

We promised a follow-up post explaining how we set up the environment, so you could give it a try yourselves, and this is it. We hope you also have fun attempting to reproduce our small experiment, but mostly that you use it as an opportunity to learn how a PostgreSQL HA environment managed by Patroni works in practice: there is nothing like a hands-on lab for this!

Initial Setup

We recycled three 10-year old Intel Atom mini-computers for our experiment but you could use some virtual machines instead: even though you will miss the excitement of unplugging real cables, this can still be simulated with a VM. We installed the server version of Ubuntu 20.04 and configured them to know “each other” by hostname; here’s how the hosts file of the first node looked like:

$ cat /etc/hosts localhost node1 node1 node2 node3


Patroni supports a myriad of systems for Distribution Configuration Store but etcd remains a popular choice. We installed the version available from the Ubuntu repository on all three nodes:

sudo apt-get install etcd

It is necessary to initialize the etcd cluster from one of the nodes and we did that from node1 using the following configuration file:

$ cat /etc/default/etcd

Note how ETCD_INITIAL_CLUSTER_STATE is defined with “new”.

We then restarted the service:

sudo systemctl restart etcd

We can then move on to install etcd on node2. The configuration file follows the same structure as that of node1, except that we are adding node2 to an existing cluster so we should indicate the other node(s):


Before we restart the service, we need to formally add node2 to the etcd cluster by running the following command on node1:

sudo etcdctl member add node2

We can then restart the etcd service on node2:

sudo systemctl restart etcd

The configuration file for node3 looks like this:


Remember we need to add node3 to the cluster by running the following command on node1:

sudo etcdctl member add node3

before we can restart the service on node3:

sudo systemctl restart etcd

We can verify the cluster state to confirm it has been deployed successfully by running the following command from any of the nodes:

$ sudo etcdctl member list
2ed43136d81039b4: name=node3 peerURLs= clientURLs= isLeader=false
d571a1ada5a5afcf: name=node1 peerURLs= clientURLs= isLeader=true
ecec6c549ebb23bc: name=node2 peerURLs= clientURLs= isLeader=false

As we can see above, node1 is the leader at this point, which is expected since the etcd cluster has been bootstrapped from it. If you get a different result, check for etcd entries logged to /var/log/syslog on each node.


Quoting Patroni’s manual:

Watchdog devices are software or hardware mechanisms that will reset the whole system when they do not get a keepalive heartbeat within a specified timeframe. This adds an additional layer of fail safe in case usual Patroni split-brain protection mechanisms fail.

While the use of a watchdog mechanism with Patroni is optional, you shouldn’t really consider deploying a PostgreSQL HA environment in production without it.

For our tests, we used the standard software implementation for watchdog that is shipped with Ubuntu 20.04, a module called softdog. Here’s the procedure we used in all three nodes to configure the module to load:

sudo sh -c 'echo "softdog" >> /etc/modules'

Patroni will be the component interacting with the watchdog device. Since Patroni is run by the postgres user, we need to either set the permissions of the watchdog device open enough so the postgres user can write to it or make the device owned by postgres itself, which we consider a safer approach (as it is more restrictive):

sudo sh -c 'echo "KERNEL==\"watchdog\", OWNER=\"postgres\", GROUP=\"postgres\"" >> /etc/udev/rules.d/61-watchdog.rules'

These two steps looked like all that would be required for watchdog to work but to our surprise, the softdog module wasn’t loaded after restarting the servers. After spending quite some time digging around we figured the module was blacklisted by default and there was a strain file with such a directive still lingering around:

$ grep blacklist /lib/modprobe.d/* /etc/modprobe.d/* |grep softdog
/lib/modprobe.d/blacklist_linux_5.4.0-72-generic.conf:blacklist softdog

Editing that file in each of the nodes to remove the line above and restarting the servers did the trick:

$ lsmod | grep softdog
softdog                16384  0

$ ls -l /dev/watchdog*
crw-rw---- 1 postgres postgres  10, 130 May 21 21:30 /dev/watchdog
crw------- 1 root     root     245,   0 May 21 21:30 /dev/watchdog0


Percona Distribution for PostgreSQL can be easily installed from the Percona Repository in a few easy steps:

sudo apt-get update -y; sudo apt-get install -y wget gnupg2 lsb-release curl
sudo dpkg -i percona-release_latest.generic_all.deb
sudo apt-get update
sudo percona-release setup ppg-12
sudo apt-get install percona-postgresql-12

An important concept to understand in a PostgreSQL HA environment like this one is that PostgreSQL should not be started automatically by systemd during the server initialization: we should leave it to Patroni to fully manage it, including the process of starting and stopping the server. Thus, we should disable the service:

sudo systemctl disable postgresql

For our tests, we want to start with a fresh new PostgreSQL setup and let Patroni bootstrap the cluster, so we stop the server and remove the data directory that has been created as part of the PostgreSQL installation:

sudo systemctl stop postgresql
sudo rm -fr /var/lib/postgresql/12/main

These steps should be repeated in nodes 2 and 3 as well.


The Percona Repository also includes a package for Patroni so with it already configured in the nodes we can install Patroni with a simple:

sudo apt-get install percona-patroni

Here’s the configuration file we have used for node1:

$ cat /etc/patroni/config.yml
scope: stampede
name: node1

  connect_address: node1:8008

  host: node1:2379

  # this section will be written into Etcd:/<namespace>/<scope>/config after initializing new cluster
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
#    master_start_timeout: 300
#    synchronous_mode: false
      use_pg_rewind: true
      use_slots: true
        wal_level: replica
        hot_standby: "on"
        logging_collector: 'on'
        max_wal_senders: 5
        max_replication_slots: 5
        wal_log_hints: "on"
        #archive_mode: "on"
        #archive_timeout: 600
        #archive_command: "cp -f %p /home/postgres/archived/%f"
        #restore_command: cp /home/postgres/archived/%f %p

  # some desired options for 'initdb'
  initdb:  # Note: It needs to be a list (some options need values, others are switches)
  - encoding: UTF8
  - data-checksums

  pg_hba:  # Add following lines to pg_hba.conf after running 'initdb'
  - host replication replicator md5
  - host replication replicator trust
  - host all all md5
  - host all all md5
#  - hostssl all all md5

  # Additional script to be launched after initial cluster creation (will be passed the connection URL as parameter)
# post_init: /usr/local/bin/
  # Some additional users users which needs to be created after initializing new cluster
      password: admin
        - createrole
        - createdb

  connect_address: node1:5432
  data_dir: "/var/lib/postgresql/12/main"
  bin_dir: "/usr/lib/postgresql/12/bin"
#  config_dir:
  pgpass: /tmp/pgpass0
      username: replicator
      password: vagrant
      username: postgres
      password: vagrant
    unix_socket_directories: '/var/run/postgresql'

  mode: required # Allowed values: off, automatic, required
  device: /dev/watchdog
  safety_margin: 5

    nofailover: false
    noloadbalance: false
    clonefrom: false
    nosync: false

With the configuration file in place, and now that we already have the etcd cluster up, all that is required is to restart the Patroni service:

sudo systemctl restart patroni

When Patroni starts, it will take care of initializing PostgreSQL (because the service is not currently running and the data directory is empty) following the directives in the bootstrap section of Patroni’s configuration file. If everything went according to the plan, you should be able to connect to PostgreSQL using the credentials in the configuration file (password is vagrant):

$ psql -U postgres
psql (12.6 (Ubuntu 2:12.6-2.focal))
Type "help" for help.


Repeat the operation for installing Patroni on nodes 2 and 3: the only difference is that you will need to replace the references to node1 in the configuration file (there are four of them, shown in bold) with the respective node name.

You can also check the state of the Patroni cluster we just created with:

$ sudo patronictl -c /etc/patroni/config.yml list
| Cluster  | Member |  Host |  Role  |  State  | TL | Lag in MB |
| stampede | node1  | node1 | Leader | running |  2 |           |
| stampede | node2  | node2 |        | running |  2 |         0 |
| stampede | node3  | node3 |        | running |  2 |         0 |

node1 started the Patroni cluster so it was automatically made the leader – and thus the primary/master PostgreSQL server. Nodes 2 and 3 are configured as read replicas (as the hot_standby option was enabled in Patroni’s configuration file).


A common implementation of high availability in a PostgreSQL environment makes use of a proxy: instead of connecting directly to the database server, the application will be connecting to the proxy instead, which will forward the request to PostgreSQL. When HAproxy is used for this, it is also possible to route read requests to one or more replicas, for load balancing. However, this is not a transparent process: the application needs to be aware of this and split read-only from read-write traffic itself. With HAproxy, this is done by providing two different ports for the application to connect. We opted for the following setup:

  • Writes   ?  5000
  • Reads   ?  5001

HAproxy can be installed as an independent server (and you can have as many as you want) but it can also be installed on the application server or the database server itself – it is a light enough service. For our tests, we planned on using our own Linux workstations (which also run Ubuntu 20.04) to simulate application traffic so we installed HAproxy on them:

sudo apt-get install haproxy

With the software installed, we modified the main configuration file as follows:

$ cat /etc/haproxy/haproxy.cfg
    maxconn 100

    log    global
    mode    tcp
    retries 2
    timeout client 30m
    timeout connect 4s
    timeout server 30m
    timeout check 5s

listen stats
    mode http
    bind *:7000
    stats enable
    stats uri /

listen primary
    bind *:5000
    option httpchk OPTIONS /master
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server node1 node1:5432 maxconn 100 check port 8008
    server node2 node2:5432 maxconn 100 check port 8008
    server node3 node3:5432 maxconn 100 check port 8008

listen standbys
    balance roundrobin
    bind *:5001
    option httpchk OPTIONS /replica
    http-check expect status 200
    default-server inter 3s fall 3 rise 2 on-marked-down shutdown-sessions
    server node1 node1:5432 maxconn 100 check port 8008
    server node2 node2:5432 maxconn 100 check port 8008
    server node3 node3:5432 maxconn 100 check port 8008

Note there are two sections: primary, using port 5000, and standbys, using port 5001. All three nodes are included in both sections: that’s because they are all potential candidates to be either primary or secondary. For HAproxy to know which role each node currently has, it will send an HTTP request to port 8008 of the node: Patroni will answer. Patroni provides a built-in REST API support for health check monitoring that integrates perfectly with HAproxy for this:

$ curl -s http://node1:8008
{"state": "running", "postmaster_start_time": "2021-05-24 14:50:11.707 UTC", "role": "master", "server_version": 120006, "cluster_unlocked": false, "xlog": {"location": 25615248}, "timeline": 1, "database_system_identifier": "6965869170583425899", "patroni": {"version": "1.6.4", "scope": "stampede"}}

We configured the standbys group to balance read-requests in a round-robin fashion, so each connection request (or reconnection) will alternate between the available replicas. We can test this in practice, let’s save the postgres user password in a file to facilitate the process:

echo "localhost:5000:postgres:postgres:vagrant" > ~/.pgpass
echo "localhost:5001:postgres:postgres:vagrant" >> ~/.pgpass
chmod 0600 ~/.pgpass

We can then execute two read-requests to verify the round-robin mechanism is working as intended:

$ psql -Upostgres -hlocalhost -p5001 -t -c "select inet_server_addr()"

$ psql -Upostgres -hlocalhost -p5001 -t -c "select inet_server_addr()"

as well as test the writer access:

$ psql -Upostgres -hlocalhost -p5000 -t -c "select inet_server_addr()"

You can also check the state of HAproxy by visiting http://localhost:7000/ on your browser.


To best simulate a production environment to test our failure scenarios, we wanted to have continuous reads and writes to the database. We could have used a benchmark tool such as Sysbench or Pgbench but we were more interested in observing the switch of source server upon a server failure than load itself. Jobin wrote a simple Python script that is perfect for this, HAtester. As was the case with HAproxy, we run the script from our Linux workstation. Since it is a Python script, you need to have a PostgreSQL driver for Python installed to execute it:

sudo apt-get install python3-psycopg2
curl -LO
chmod +x

Edit the script with the credentials to access the PostgreSQL servers (through HAproxy) if you are using different settings from ours. The only requirement for it to work is to have the target table created beforehand, so first connect to the postgres database (unless you are using a different target) in the Primary and run:


You can then start two different sessions:

  1. One for writes:

    ./ 5000
  2. One for reads:
    ./ 5001

The idea is to observe what happens with database traffic when the environment experiences a failure; that is, how HAproxy will route reads and writes as Patroni adjusts the PostgreSQL cluster. You can continuously monitor Patroni from the point of view of the nodes by opening a session in each of them and running the following command:

sudo -u postgres watch patronictl -c /etc/patroni/config.yml list

To facilitate observability and better follow the changes in real-time, we used the terminal multiplexer Tmux to visualize all 5 sessions on the same screen:

  • On the left side, we have one session open for each of the 3 nodes, continuously running:

    sudo -u postgres watch patronictl -c /etc/patroni/config.yml list

    It’s better to have the Patroni view for each node independently because when you start the failure tests you will lose connection to a part of the cluster.

  • On the right side, we are executing the script from our workstation:
    • Sending writes through port 5000:

      ./ 5000
    • and reads through port 5001:

      ./ 5001

A couple of notes on the execution of the script:

  • Pressing Ctrl+C will break the connection but the script will reconnect, this time to a different replica (in the case of reads) due to having the Standbys group on HAproxy configured with round-robin balancing.
  • When a switchover or failover takes place and the nodes are re-arranged in the cluster, you may temporarily see writes sent to a node that used to be a replica and was just promoted as primary and reads send to a node that used to be the primary and was demoted as secondary: that’s a limitation of the script but “by design”; we favored faster reconnections and minimal checks on the node’s role for demonstration purposes. On a production application, this part ought to be implemented differently.

Testing Failure Scenarios

The fun part starts now! We leave it to you to test and play around to see what happens with the PostgreSQL cluster in practice following a failure. We leave as suggestions the tests we did in our presentation. For each failure scenario, observe how the cluster re-adjusts itself and the impact on read and write traffic.

1) Loss of Network Communication

  • Unplug the network cable from one of the nodes (or simulate this condition in your VM):
    • First from a replica
    • Then from the primary
  • Unplug the network cable from one replica and the primary at the same time:
    • Does Patroni experience a split-brain situation?

2) Power Outage

  • Unplug the power cable from the primary
  • Wait until the cluster is re-adjusted then plug the power cable back and start the node


Simulate an OOM/crash by killing the postmaster process in one of the nodes with kill -9.

4) Killing Patroni

Remember that Patroni is managing PostgreSQL. What happens if the Patroni process (and not PostgreSQL) is killed?

5) CPU Saturation

Simulate CPU saturation with a benchmark tool such as Sysbench, for example:

sysbench cpu --threads=10 --time=0 run

This one is a bit tricky as the reads and writes are each single-threaded operation. You may need to decrease the priority of the processes with renice, and possibly increase that of Sysbench’s.

6) Manual Switchover

Patroni facilitates changes in the PostgreSQL hierarchy. Switchover operations can be scheduled, the command below is interactive and will prompt you with options:

sudo -u postgres patronictl -c /etc/patroni/config.yml switchover

Alternatively, you can be specific and tell Patroni exactly what to do:

sudo -u postgres patronictl -c /etc/patroni/config.yml switchover --master node1 --candidate node2 --force

We hope you had fun with this hands-on lab! If you have questions or comments, leave us a note in the comments section below!


postgres_fdw Enhancement in PostgreSQL 14

postgres_fdw Postgresql 14

It’s exciting times in the PostgreSQL world with the version 14 beta released a few days ago. It’s now time to look under the hood and find out what has changed from version 13, see what has improved, and what behaviors are changed that we should be aware of. Putting it all in a single blog would take weeks to write and days to read, so this one will focus solely on the changes expected in the Foreign Data Wrapper in the GA release of PostgreSQL version 14.

Foreign Data Wrappers (FDWs) provide a mechanism by which regular SQL queries can be used to access data that resides outside PostgreSQL. There are many different FDWs available, however, and PostgreSQL comes with a “File FDW” and a “PostgreSQL FDW”. PostgreSQL FDW may seem counterintuitive, but, it is an extremely useful feature. And there have been some very useful updates to this FDW.

So, let’s start understanding what has changed.

Performance Feature

If you are already using PostgreSQL FDW for any use case, take note of the performance improvements.

1 – Parallel /Async Foreign Scans

(Allow a query referencing multiple foreign tables to perform foreign table scans in parallel)

Remote aggregations and remote joins might have been a performance nightmare when performed across multiple servers. The performance benefit comes from the parallelization of ForeignScan which can now be executed in parallel asynchronously. The sequential execution previously was very slow, and in some cases, too slow. For this, a new server option is added “async_capable” which allows for parallel planning and execution of ForeignScan.

Create Servers and User-Mappings

-- Create foreign server 1.
CREATE SERVER postgres_svr1
OPTIONS (host '', async_capable "true");

-- Create foreign server 2.
CREATE SERVER postgres_svr2
OPTIONS (host '', async_capable "true");

SERVER postgres_svr1
OPTIONS (user 'postgres', password 'pass');

SERVER postgres_svr2
OPTIONS (user 'postgres', password 'pass');

Create Local Tables

CREATE TABLE child_local1 (a INTEGER, b CHAR, c TEXT, d VARCHAR(255));
CREATE TABLE child_local2 (a int, b CHAR, c text, d VARCHAR(255));
GRANT ALL ON child_local1 to postgres;
GRANT ALL ON child_local2 to postgres;

Create Foreign TABLES

PARTITION OF parent_local VALUES FROM 1000 TO 2000
SERVER postgres_svr1 OPTIONS table_name 'child_local1');

PARTITION OF parent_local
SERVER postgres_svr2 OPTIONS table_name 'child_local2');

Now try that, and see the plan tree, now you can see two Async Foreign plans in the tree.

CREATE TABLE sample_table (a INTEGER, b CHAR, c TEXT, d VARCHAR(255));

INSERT INTO sample_table SELECT * FROM parent_local WHERE a % 100 = 0;
                                          QUERY PLAN
 Insert on public.sample_table
   ->  Append
         ->  Async Foreign Scan on public.parent_remote1 parent_local_1
               Output: parent_local_1.a, parent_local_1.b, parent_local_1.c, parent_local_1.d
               Remote SQL: SELECT a, b, c, d FROM public.child_local1 WHERE (((a % 100) = 0))
         ->  Async Foreign Scan on public.parent_remote2 parent_local_2
               Output: parent_local_2.a, parent_local_2.b, parent_local_2.c, parent_local_2.d
               Remote SQL: SELECT a, b, c, d FROM public.child_local2 WHERE (((a % 100) = 0))
(8 rows)

2 – Bulk Insert

(Allow postgres_fdw to INSERT rows in bulk.)

Now bulk insert functionality has been added to Foreign Data Wrapper, and postgres_fdw is now supporting that function. There is an opportunity for other foreign data wrappers to implement bulk insert. A complete blog can be seen here.

Functional Features


Allow TRUNCATE to operate on foreign tables.

Foreign Data Wrapper is enhanced to support TRUNCATE command on the foreign table as the target. That means it issues a TRUNCATE command shipped to the foreign server and executes on the table. Fortunately, this functionality is implemented in postgres_fdw. Here is an example of that.

CREATE SERVER postgres_svr 
       FOREIGN DATA WRAPPER postgres_fdw 
       OPTIONS (host '');

       SERVER postgres_svr
       OPTIONS (user 'postgres', password 'pass');

                                 b CHAR,
                                 c TEXT,
                                 d VARCHAR(255))
       SERVER postgres_svr
       OPTIONS(table_name 'foo_local');

Now a foreign table can be truncated using the TRUNCATE command.

postgres=# TRUNCATE foo_remote;

2 – LIMIT TO Child Partitioning

(Allow postgres_fdw to import table partitions if specified by IMPORT FOREIGN SCHEMA … LIMIT TO.)

The postgres_fdw does not allow the import of table partitions, because data can be accessed using the root partition. But in case the user wants to import the partitioned table partitioning, PostgreSQL 14 added a new option call “LIMIT TO”. Create a new schema on a remote machine and add a parent table “foo_schema.foo_table_parent” and one child table “foo_schema.foo_table_child”.

postgres=# \d+ foo_schema.*
                                   Table "foo_schema.foo_table_child"
 Column |  Type   | Collation | Nullable | Default | Storage | Compression | Stats target | Description 
 a      | integer |           |          |         | plain   |             |              | 
Partition of: foo_schema.foo_table_parent FOR VALUES FROM (0) TO (10)

Partition constraint: ((a IS NOT NULL) AND (a >= 0) AND (a < 10))
Access method: heap

                            Partitioned table "foo_schema.foo_table_parent"
 Column |  Type   | Collation | Nullable | Default | Storage | Compression | Stats target | Description 
 a      | integer |           |          |         | plain   |             |              | 
Partition key: RANGE (a)
Partitions: foo_schema.foo_table_child FOR VALUES FROM (0) TO (10)

Import schema without specifying the LIMIT TO, you can see only the parent table is imported.

IMPORT FOREIGN SCHEMA foo_schema FROM SERVER postgres_svr INTO bar_schema;
postgres=# \d+ bar_schema.*
                                 Foreign table "bar_schema.foo_table_parent"
 Column |  Type   | Collation | Nullable | Default |    FDW options    | Storage | Stats target | Description 
 a      | integer |           |          |         | (column_name 'a') | plain   |              | 
Server: postgres_svr
FDW options: (schema_name 'foo_schema', table_name 'foo_table_parent')

If you explicitly specified partitioned table into LIMIT TO clause then it will import that table.

postgres=# IMPORT FOREIGN SCHEMA foo_schema LIMIT TO (foo_table_parent, foo_table_child)
            FROM SERVER loopback INTO bar_schema;

postgres=# \d+ bar_schema.*
                                  Foreign table "bar_schema.foo_table_child"
 Column |  Type   | Collation | Nullable | Default |    FDW options    | Storage | Stats target | Description 
 a      | integer |           |          |         | (column_name 'a') | plain   |              | 
Server: loopback
FDW options: (schema_name 'foo_schema', table_name 'foo_table_child')

                                 Foreign table "bar_schema.foo_table_parent"

 Column |  Type   | Collation | Nullable | Default |    FDW options    | Storage | Stats target | Description 
 a      | integer |           |          |         | (column_name 'a') | plain   |              | 
Server: loopback
FDW options: (schema_name 'foo_schema', table_name 'foo_table_parent')

3 – Active and Valid Connection List

(Add postgres_fdw function postgres_fdw_get_connections to report open foreign server connections)

 A new function postgres_fdw_get_connections() is added. The function returns the open connection names local session to the foreign servers of postgres_fdw. It also outputs the validity of the connection.

postgres=# SELECT * FROM postgres_fdw_get_connections() ORDER BY 1;
    server_name    | valid 
 postgres_svr      | t
 postgres_svr_bulk | t
(2 rows)

Now disconnect all connections connection and try the query again.

postgres=# SELECT 1 FROM postgres_fdw_disconnect_all();
(1 row)

postgres=# SELECT * FROM postgres_fdw_get_connections() ORDER BY 1;
 server_name | valid 
(0 rows)

4 – Keep Connections

(Add postgres_fdw functions to discard cached connections)

A new option keep_connections is added to keep the connections alive so that subsequent queries can reuse them. By default, this option is on, but when turned off, the connections will be discarded as the transactions end.

Set the option off

ALTER SERVER loopback OPTIONS (keep_connections 'off');

Establish the connection using the remote query.

postgres=# BEGIN;

postgres=*# select * from foo_remote;
 a | b | c | d 
(0 rows)

postgres=*# END;

postgres=# SELECT * FROM postgres_fdw_get_connections() ORDER BY 1;
 server_name | valid 
(0 rows)

Set the keep_connections option on

ALTER SERVER postgres_svr options (set keep_connections 'on');


postgres=# BEGIN;
postgres=*# select * from foo_remote;
 a | b | c | d 
(0 rows)

-- Establish the connection using the remote query.

postgres=*# END;

postgres=# SELECT * FROM postgres_fdw_get_connections() ORDER BY 1;
 server_name  | valid 
 postgres_svr | t
(1 row)

5 – Reestablish Brocken Connection

(Allow postgres_fdw to reestablish foreign server connections if necessary)

Previously when the remote server restarted and the postgres_fdw connection was broken, then the error was thrown because the cached connection is no longer available. This is fixed in PostgreSQL, and, in any case, the connection is broken and no longer exists in the cache, and postgres_fdw will establish the connection.


It is quite promising that the Foreign Data Wrapper API is expanding over every release, but PostgreSQL 14 provides some user-centric new features. The performance-related improvements give another reason to use FDWs for many relevant use cases. The functionality will surely be added in the next few versions, making these more performant and easy to use.


New Features in PostgreSQL 14: Bulk Inserts for Foreign Data Wrappers

PostgreSQL 14 Bulk Inserts for Foreign Data Wrappers

Foreign Data Wrapper based on SQL-MED is one the coolest features of PostgreSQL. The feature set of foreign data wrapper is expanding since version 9.1. We know that the PostgreSQL 14 beta is out and GA will be available shortly, therefore it is helpful to study the upcoming features of PostgreSQL 14. There are a lot of them, along with some improvements in foreign data wrapper. A new performance feature, “Bulk Insert“, is added in PostgreSQL 14. The API is extended and allows bulk insert of the data into the foreign table, therefore, using that API, any foreign data wrapper now can implement Bulk Insert. It is definitely more efficient than inserting individual rows.

The API contains two new functions, which can be used to implement the bulk insert.

There is no need to explain these functions here because it is useful for people interested in having that functionality into their foreign data wrapper like mysql_fdw, mongo_fdw, and oracle_fdw. If someone is interested to see that, they can see it in the PostgreSQL documentation. But the good news is, postgres_fdw already implement that and have that in PostgreSQL 14.

There is a new server option is added which is batch_size, and you can specify that when creating the foreign server or creating a foreign table.

  • Create a postgres_fdw extension
CREATE EXTENSION postgres_fdw;

  • Create a foreign server without batch_size
CREATE SERVER postgres_svr 
       FOREIGN DATA WRAPPER postgres_fdw 
       OPTIONS (host '');

       SERVER postgres_svr
       OPTIONS (user 'postgres', password 'pass');

                                 b CHAR,
                                 c TEXT,
                                 d VARCHAR(255))
       SERVER postgres_svr
       OPTIONS(table_name 'foo_local');

EXPLAIN (VERBOSE, COSTS OFF) insert into foo_remote values (generate_series(1, 1), 'c', 'text', 'varchar');
                                                QUERY PLAN                                                 
 Insert on public.foo_remote
   Remote SQL: INSERT INTO public.foo_local(a, b, c, d) VALUES ($1, $2, $3, $4)
   Batch Size: 1
   ->  ProjectSet
         Output: generate_series(1, 1), 'c'::character(1), 'text'::text, 'varchar'::character varying(255)
         ->  Result
(6 rows)

  • Execution time with batch_size not specified
        INSERT INTO foo_remote
        VALUES (generate_series(1, 100000000),
                                                       QUERY PLAN                                                        
Insert on foo_remote  (cost=0.00..500000.02 rows=0 width=0) (actual time=4591443.250..4591443.250 rows=0 loops=1)
   ->  ProjectSet  (cost=0.00..500000.02 rows=100000000 width=560) (actual time=0.006..31749.132 rows=100000000 loops=1)
         ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.002..0.002 rows=1 loops=1)
Planning Time: 4.988 ms
Execution Time: 4591447.101 ms -- timing is important
(5 rows)

  • Create a foreign table with batch_size = 10, in case no batch_size is specified with server creation
                                 b CHAR,
                                 c TEXT,
                                 d VARCHAR(255))
        SERVER postgres_svr OPTIONS(table_name 'foo_local', batch_size '10');

  • Create a foreign server with batch_size = 10, now every table of that server will use the batch_size 10
CREATE SERVER postgres_svr_bulk
       FOREIGN DATA WRAPPER postgres_fdw
       OPTIONS (host '', batch_size = '10'); -- new option batch_size

       SERVER postgres_svr
       OPTIONS (user 'postgres', password 'pass');

                                 b CHAR,
                                 c TEXT,
                                 d VARCHAR(255))
        SERVER postgres_svr OPTIONS(table_name 'foo_local');

EXPLAIN (VERBOSE, COSTS OFF) insert into foo_remote_bulk values (generate_series(1, 1), 'c', 'text', 'varchar');

                                                QUERY PLAN                                                 

Insert on public.foo_remote_bulk
   Remote SQL: INSERT INTO public.foo_local_bulk(a, b, c, d) VALUES ($1, $2, $3, $4)
   Batch Size: 10
   ->  ProjectSet
         Output: generate_series(1, 1), 'c'::character(1), 'text'::text, 'varchar'::character varying(255)
         ->  Result
(6 rows)

  • Execution time with batch_size = 10:
        INSERT INTO foo_remote_bulk
        VALUES (generate_series(1, 100000000),
                                                       QUERY PLAN                                                        
 Insert on foo_remote_bulk  (cost=0.00..500000.02 rows=0 width=0) (actual time=822224.678..822224.678 rows=0 loops=1)
   ->  ProjectSet  (cost=0.00..500000.02 rows=100000000 width=560) (actual time=0.005..10543.845 rows=100000000 loops=1)
         ->  Result  (cost=0.00..0.01 rows=1 width=0) (actual time=0.001..0.002 rows=1 loops=1)
 Planning Time: 0.250 ms
 Execution Time: 822239.178 ms -- timing is important
(5 rows)


PostgreSQL is expanding the feature list of foreign data wrappers, and Bulk Insert is another good addition. As this feature is added to the core, I hope all other foreign data wrappers will implement it as well.


Percona Distribution for PostgreSQL Operator Technical Preview Released

Percona Distribution for PostgreSQL Operator

Percona is championing open source database software and we are committed to running our products on Kubernetes. We don’t only want to run the software, but make sure that the database is highly available, secure, and follows best practices. We also focus on day-2 operations such as scaling, backup and restore, disaster recovery, and customization.

To get there, we have Operators – the software framework that extends Kubernetes APIs and provides control over database deployment and operations through the control plane. Until May we had two Operators:

The only missing piece was Percona Distribution for PostgreSQL, for which we introduced the Operator during Percona Live in May 2021. This completes our vision for deploying our software on Kubernetes. See the release notes of the initial version here.

Kubernetes Operator FAQ

This blog post is intended to answer some frequently asked questions we received from our community about Percona Distribution for PostgreSQL Operator.

Is This a Brand New Operator?

No. Our Operator is based on PGO, the Postgres Operator from Crunchy Data, which we modified and enhanced in order to support our PostgreSQL distribution.

Why CrunchyData Operator?

As noted above, we are committed to running our database software on Kubernetes. There are multiple ways to achieve this goal:

  1. Develop a new Operator from scratch
  2. Collaborate and contribute necessary changes to an existing Operator
  3. Fork an  existing Operator

Option (1) looks great, but it is time and effort-intensive, and we might be re-inventing existing wheels. Our goal is to minimize Time To Market (TTM), so we dropped this option right away.

For options (2) and (3), there are at least three different PostgreSQL Operators in active development:

Stackgres is written in Java, and our engineering team is more familiar with C/C++ and Golang. We do not see that changing in the near future. Zalando Operator is great and provides a lot of functionality out of the box, but our Engineering team estimated the effort to perform the needed changes almost similar to writing the Operator from scratch.

PGO is written in Golang and provides the features we were looking for: high availability with Patroni, scaling, backups, and many more. Our engineering team did not flag any complexity of introducing the changes and we jumped to work.

Will Percona use PGO as an Upstream?

For now – yes. We will be merging new features implemented in PGO into our fork. Version 0.1.0 of our Operator is based on the 4.6.2 version of PGO. Version 0.2.0 will be based on version 4.7.X. At the same time, we want to contribute back some of our changes to the upstream and have already sent some pull requests (one, two). We’ll continue submitting patches to the upstream project.

What is Different About Percona Operator?

The main differences were highlighted in the release notes. Here they are:

  • Percona Distribution for PostgreSQL is now used as the main container image. CrunchyData container images are provided under Crunchy Data Developer Program, which means that without an active contract they could not be used for production. Percona container images are fully open source and do not have any limitations for use.
  • It is possible to specify custom images for all components separately. For example, users can easily build and use custom images for one or several components (e.g. pgBouncer) while all other images will be the official ones. Also, users can build and use all custom images.
  • All container images are reworked and simplified. They are built on Red Hat Universal Base Image (UBI) 8.
  • The Operator has built-in integration with Percona Monitoring and Management (PMM) v2.
  • A build/test infrastructure was created, and we have started adding e2e tests to be sure that all pieces of the cluster work together as expected.
  • We have phased out the PGO CLI tool, and the Custom Resource UX will be completely aligned with other Percona Operators in the following release.

For future releases, our goal is to cover the feature and UX parity between the Operators, so that our users will have the same look and feel for all three database engines.

What Does Tech Preview Mean?

Tech Preview Features are not yet ready for production use and are not included in support via SLA (Service License Agreement). They are included in this release so that users can provide feedback prior to the full release of the feature in a future GA (General Availability) release (or removal of the feature if it is deemed not useful). This functionality can change (APIs, CLIs, etc.) from tech preview to GA.

When is GA Coming and What is Going to be Included?

Our goal is to release the GA version early in Q3. The changes in this version would include:

  • Moving control over replicas to the main Custom Resource instead of managing them separately
  • Change the main manifest to provide the same look and feel as in other Percona Operators
  • Rework scheduled backups and control them with main CR and Kubernetes primitives
  • Add support for Google Cloud Storage (this will be merged from upstream)

Call for Action

To install our Operator and learn more about it please read the documentation.

Our Operator is licensed under Apache 2.0 and can be found in percona-postgresql-operator repository on Github. There are multiple ways to contact us or share your ideas:

  • To report a bug use and create the bug in K8SPG project
  • For general questions and sharing your thoughts, we have a community forum or Discord where we chat about open source, databases, Kubernetes, and many more.
  • We have a public roadmap where you can see what is planned and what is under development. You can share your ideas about new features there as well.

Powered by WordPress | Theme: Aeros 2.0 by