Unlike the standard multi-node Postgres replication cluster, when managed by Patroni, all failovers are automatically executed. However, this is not the case when dealing with inter-datacentre failovers when for instance a standby datacentre must take over from a failed primary. The following describes the mechanisms required to perform such a procedure when the case arises. […]
20
2025
Performing Standby Datacentre Promotions of a Patroni Cluster
27
2025
Troubleshooting PostgreSQL Logical Replication, Working with LSNs
PostgreSQL logical replication adoption is becoming more popular as significant advances continue to expand its range of capabilities. While quite a few blogs have described features, there seems to be a lack of simple and straightforward advice on restoring stalled replication. This blog demonstrates an extremely powerful approach to resolving replication problems using the Log […]
09
2024
Performing ELT Operations Using DOMAINs
First things first, a couple of definitions to clear up: ETL: Extract, Transform, Load. Transforms data before loading it into the target system extract raw data from various sources use a secondary processing server to transform that data load that data into a target database ELT: Extra, Load, Transform. Transforms data after loading it into […]
05
2024
Administering a Patroni Managed PostgreSQL Cluster
There are quite a number of methods these days for installing Patroni. I want to discuss the next step, which is how to start tuning a running system.The manner of updating a running Postgres server typically involves editing the standard Postgres files: postgresql.conf postgresql.auto.conf pg_hba.conf Working with a Patroni HA cluster, one has the added […]
26
2023
Grafana Dashboards: A PoC Implementing the PostgreSQL Extension pg_stat_monitor

This PoC demonstrates how to install and configure pg_stat_monitor in order to extract useful and actionable metrics from a PostgreSQL database and display them on a Grafana dashboard.
About the environment
- Grafana: version 10.0.0
- Grafana database backend: Prometheus version 2.15.2+d
- PostgreSQL version 13
- pgbench version 13
In order to investigate the potential opportunities for implementing constructive and useful metrics derived from PostgreSQL into Grafana, it is necessary to generate loading using pgbench.
Configuring Grafana
For our purposes, the Grafana datasource used in this PoC is also the Postgres data cluster that is generating the data to be monitored.

pg_stat_monitor
About
pg_stat_monitor is a Query Performance Monitoring tool for PostgreSQL. It collects various statistics data such as query statistics, query plan, SQL comments, and other performance insights. The collected data is aggregated and presented in a single view.
pg_stat_monitor takes its inspiration from pg_stat_statements. Unlike pg_stat_statements, which aggregates its metrics from the last time it was zeroed, pg_stat_monitor possesses the ability to bucket its output within a set number of aggregated results, thus saving user efforts from doing it themselves.
pg_stat_monitor tracks the following operations:
- statements
- queries
- functions
- stored procedures and other non-utility statements
Features
- Time Interval Grouping: Instead of supplying one set of ever-increasing counts, pg_stat_monitor computes stats for a configured number of time intervals – time buckets. This allows for much better data accuracy, especially in the case of high-resolution or unreliable networks.
- Multi-Dimensional Grouping: While pg_stat_statements groups counters by userid, dbid, queryid, pg_stat_monitor uses a more detailed group for higher precision. This allows a user to drill down into the performance of queries.
- Capture Actual Parameters in the Queries: pg_stat_monitor allows you to choose if you want to see queries with placeholders for parameters or actual parameter data. This simplifies debugging and analysis processes by enabling users to execute the same query.
- Query Plan: Each SQL is now accompanied by its actual plan that was constructed for its execution. That’s a huge advantage if you want to understand why a particular query is slower than expected.
- Tables Access Statistics for a Statement: This allows us to easily identify all queries that accessed a given table. This set is at par with the information provided by the pg_stat_statements.
- Histogram: Visual representation is very helpful as it can help identify issues. With the help of the histogram function, one can now view a timing/calling data histogram in response to an SQL query. And yes, it even works in PostgreSQL.
Installation (example: CENTOS8, pg14)
The simplest way to get pg_stat_monitor is to install it via Percona Distribution for PostgreSQL.
The following instructions demonstrate installing Percona Distribution for PostgreSQL and pg_stat_monitor on a CENTOS8 OS Linux distribution:
# Install The Percona Repository dnf install -y https://repo.percona.com/yum/percona-release-latest.noarch.rpm percona-release setup ppg14 # Install The postgres Community Repository dnf install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-8-x86_64/pgdg-redhat-repo-latest.noarch.rpm dnf update -y dnf install -y pg_stat_monitor_14 # perform standard initialization and systemd configurations /usr/pgsql-14/bin/postgresql-14-setup initdb # configure postgres to use pg_stat_monitor echo " shared_preload_libraries=pg_stat_monitor " >> /var/lib/pgsql/14/data/postgresql.auto.conf # complete postgres configuration systemctl enable postgresql-14 systemctl start postgresql-14
Create extension
The pg_stat_monitor extension can be created in any database, but for the purposes of this PoC, it is placed in the database pgbench.
postgres=# create database pgbench;
postgres=# c pgbench
pgbench=# create extension pg_stat_monitor;
pgbench=# d
List of relations
Schema | Name | Type | Owner
--------+-----------------+------+----------
public | pg_stat_monitor | view | postgres
View "public.pg_stat_monitor"
Column | Type | Collation | Nullable | Default
---------------------+--------------------------+-----------+----------+---------
bucket | bigint | | |
bucket_start_time | timestamp with time zone | | |
userid | oid | | |
username | text | | |
dbid | oid | | |
datname | text | | |
client_ip | inet | | |
pgsm_query_id | bigint | | |
queryid | bigint | | |
toplevel | boolean | | |
top_queryid | bigint | | |
query | text | | |
comments | text | | |
planid | bigint | | |
query_plan | text | | |
top_query | text | | |
application_name | text | | |
relations | text[] | | |
cmd_type | integer | | |
cmd_type_text | text | | |
elevel | integer | | |
sqlcode | text | | |
message | text | | |
calls | bigint | | |
total_exec_time | double precision | | |
min_exec_time | double precision | | |
max_exec_time | double precision | | |
mean_exec_time | double precision | | |
stddev_exec_time | double precision | | |
rows | bigint | | |
shared_blks_hit | bigint | | |
shared_blks_read | bigint | | |
shared_blks_dirtied | bigint | | |
shared_blks_written | bigint | | |
local_blks_hit | bigint | | |
local_blks_read | bigint | | |
local_blks_dirtied | bigint | | |
local_blks_written | bigint | | |
temp_blks_read | bigint | | |
temp_blks_written | bigint | | |
blk_read_time | double precision | | |
blk_write_time | double precision | | |
resp_calls | text[] | | |
cpu_user_time | double precision | | |
cpu_sys_time | double precision | | |
wal_records | bigint | | |
wal_fpi | bigint | | |
wal_bytes | numeric | | |
bucket_done | boolean | | |
plans | bigint | | |
total_plan_time | double precision | | |
min_plan_time | double precision | | |
max_plan_time | double precision | | |
mean_plan_time | double precision | | |
stddev_plan_time | double precision | | |
About pgbench
pgbench is a simple program executing benchmark tests on PostgreSQL by running the same sequence of SQL commands over and over. pgbench is capable of executing multiple concurrent database sessions and can calculate the average transaction rate (TPS) at the end of a run. Although the default configuration simulates loading based loosely upon TPC-B, it is nevertheless easy to test other use cases by writing one’s own transaction script files.
Querying the data
While it is reasonable to create panels showing real-time load in order to explore better the types of queries that can be run against pg_stat_monitor, it is more practical to copy and query the data into tables after the benchmarking has completed its run.
Table: pg_stat_monitor_archive
Save the data generated from a recently completed benchmark run into an archive table:
select * into pg_stat_monitor_archive from pg_stat_monitor order by bucket_start_time asc
Table "public.pg_stat_monitor_archive"
Column | Type | Collation | Nullable | Default
--------------------+--------------------------+-----------+----------+---------
bucket | bigint | | |
bucket_start_time | timestamp with time zone | | |
userid | oid | | |
username | text | | |
dbid | oid | | |
datname | text | | |
client_ip | inet | | |
pgsm_query_id | bigint | | |
queryid | bigint | | |
toplevel | boolean | | |
top_queryid | bigint | | |
query | text | | |
comments | text | | |
planid | bigint | | |
query_plan | text | | |
top_query | text | | |
application_name | text | | |
relations | text[] | | |
cmd_type | integer | | |
cmd_type_text | text | | |
elevel | integer | | |
sqlcode | text | | |
message | text | | |
calls | bigint | | |
total_exec_time | double precision | | |
min_exec_time | double precision | | |
max_exec_time | double precision | | |
mean_exec_time | double precision | | |
stddev_exec_time | double precision | | |
rows | bigint | | |
shared_blks_hit | bigint | | |
shared_blks_read | bigint | | |
shared_blks_dirtied | bigint | | |
shared_blks_written | bigint | | |
local_blks_hit | bigint | | |
local_blks_read | bigint | | |
local_blks_dirtied | bigint | | |
local_blks_written | bigint | | |
temp_blks_read | bigint | | |
temp_blks_written | bigint | | |
blk_read_time | double precision | | |
blk_write_time | double precision | | |
resp_calls | text[] | | |
cpu_user_time | double precision | | |
cpu_sys_time | double precision | | |
wal_records | bigint | | |
wal_fpi | bigint | | |
wal_bytes | numeric | | |
bucket_done | boolean | | |
plans | bigint | | |
total_plan_time | double precision | | |
min_plan_time | double precision | | |
max_plan_time | double precision | | |
mean_plan_time | double precision | | |
stddev_plan_time | double precision | | |
Table: pg_stat_monitor_qry
Extract this metric of interest, i.e., time vs total execution time:
select bucket_start_time, pgsm_query_id, queryid, total_exec_time into pg_stat_monitor_qry from pg_stat_monitor order by bucket_start_time asc
pgbench=# d pg_stat_monitor_qry
Table "public.pg_stat_monitor_qry"
Column | Type | Collation | Nullable | Default
-------------------+--------------------------+-----------+----------+---------
bucket_start_time | timestamp with time zone | | |
pgsm_query_id | bigint | | |
queryid | bigint | | |
total_exec_time | double precision | | |
Table: pg_stat_monitor_shared_blk_io
Extract this metric of interest, i.e., time vs shared_blk io:
select bucket_start_time, pgsm_query_id, queryid,
shared_blks_hit, shared_blks_read,
shared_blks_dirtied, shared_blks_written
into pg_stat_monitor_shared_blk_io
from pg_stat_monitor_archive
order by bucket_start_time asc;
pgbench=# d pg_stat_monitor_shared_blk_io
Table "public.pg_stat_monitor_shared_blk_io"
Column | Type | Collation | Nullable | Default
---------------------+--------------------------+-----------+----------+---------
bucket_start_time | timestamp with time zone | | |
pgsm_query_id | bigint | | |
queryid | bigint | | |
shared_blks_hit | bigint | | |
shared_blks_read | bigint | | |
shared_blks_dirtied | bigint | | |
shared_blks_written | bigint | | |
Table: pg_stat_monitor_blk_io
Note: this metric requires runtime parameter track_io_timing to be set on.
Extract this metric of interest, i.e., time vs. blk io:
select bucket_start_time, pgsm_query_id, queryid, blk_read_time, blk_write_time
into pg_stat_monitor_blk_io
from pg_stat_monitor_archive
order by bucket_start_time asc;
Table: pg_stat_monitor_uniq_id
Save a copy of all unique query IDs in order to parse out future queries from the view.
Column pgsm_query_id identifies the query in such a manner that one can still identify the same query even when generated on other platforms under different loading conditions with different data:
with a as (select distinct on (pgsm_query_id) *
from pg_stat_monitor_archive
where application_name='pgbench')
select cmd_type, cmd_type_text,pgsm_query_id, queryid,query as example_query
into pg_stat_monitor_uniq_id
from a
order by cmd_type;
pgbench=# d pg_stat_monitor_uniq_id
Table "public.pg_stat_monitor_uniq_id"
Column | Type | Collation | Nullable | Default
---------------+---------+-----------+----------+---------
cmd_type | integer | | |
cmd_type_text | text | | |
pgsm_query_id | bigint | | |
queryid | bigint | | |
example_query | text | | |
This is an example set of queries generated by pgbench. Note the numbers in column pgsm_query_id are always the same values irrespective of hosts or environments:
select cmd_type_text, pgsm_query_id, example_query from pg_stat_monitor_uniq_id where cmd_type > 0;
cmd_type_text | pgsm_query_id | example_query
---------------+----------------------+-----------------------------------------------------------------
SELECT | -7455620703706695456 | SELECT abalance FROM pgbench_accounts WHERE aid = 16416498
UPDATE | -510321339504955469 | UPDATE pgbench_accounts SET abalance = abalance + 2063
WHERE aid = 1482568
UPDATE | 5276535447716615446 | UPDATE pgbench_branches SET bbalance = bbalance + 1384
WHERE bid = 7
UPDATE | 3629195281782908951 | UPDATE pgbench_tellers SET tbalance = tbalance + -2966
WHERE tid = 330
INSERT | -8751124061964589929 | INSERT INTO pgbench_history (tid, bid, aid, delta, mtime)
VALUES (321, 56, 21104880, 4211, CURRENT_TIMESTAMP)
Benchmarking
Two types of performance monitoring are profiled:
- Real-time loading performance
- Aggregate performance over a specific time period, i.e., a snapshot.
Although the results of the benchmarking can be viewed by querying the view pg_stat_monitor you will note, as demonstrated by the bash script and SQL statements below, that the contents of the view is immediately copied and saved into a collection of tables. This is because the data will disappear over time as pg_stat_monitor cycles through its allotted number of buckets.
A script executing a benchmarking run:
#!/bin/bash
# REFERENCE
# https://docs.percona.com/pg-stat-monitor/reference.html
#
set -e
export PGPASSWORD="MYPASSWORD"
PGHOST='MYHOST’
PGPORT=5434
PGUSER=postgres
#
# initialize benchmarking database
#
dropdb --if-exists pgbench
createdb pgbench
/usr/pgsql-12/bin/pgbench -i --foreign-keys -s 300 pgbench
psql pgbench -c 'create extension pg_stat_monitor'
#
# configure pg_stat_monitor, requires system restart
#
psql postgres <<_eof_
-- set bucket time range, default is normally 60 seconds
alter system set pg_stat_monitor.pgsm_bucket_time = '1min';
-- set number of buckets, default is normally 10
alter system set pg_stat_monitor.pgsm_max_buckets = 75;
_eof_
systemctl restart postgresql@13-main
psql postgres <<_eof_
-- zero pg_stat_monitor stats
select * from pg_stat_monitor_reset();
_eof_
#
# begin benchmarking run
#
# 4500 seconds (75 minutes)
/usr/pgsql-12/bin/pgbench -U postgres -c 4 -j 2 -T 4500 -P 5 -b tpcb-like pgbench
#
# copy and save the benchmarking run into tables
#
psql postgres <<_eof_
drop table if exists pg_stat_monitor_archive,
pg_stat_monitor_qry,
pg_stat_monitor_uniq_id;
select * into pg_stat_monitor_archive from pg_stat_monitor order by bucket_start_time;
select bucket_start_time, pgsm_query_id, queryid, total_exec_time
into pg_stat_monitor_qry
from pg_stat_monitor_archive
where application_name='pgbench';
with a as (select distinct on (pgsm_query_id) *
from pg_stat_monitor_archive
where application_name='pgbench')
select cmd_type, cmd_type_text,pgsm_query_id, queryid,query as example_query
into pg_stat_monitor_uniq_id
from a
order by cmd_type;;
_eof_
echo "DONE"
progress: 4435.0 s, 341.2 tps, lat 11.718 ms stddev 3.951 progress: 4440.0 s, 361.2 tps, lat 11.075 ms stddev 3.519 progress: 4445.0 s, 348.0 tps, lat 11.483 ms stddev 5.246 progress: 4450.0 s, 383.8 tps, lat 10.418 ms stddev 4.514 progress: 4455.0 s, 363.6 tps, lat 10.988 ms stddev 4.326 progress: 4460.0 s, 344.0 tps, lat 11.621 ms stddev 3.981 progress: 4465.0 s, 360.4 tps, lat 11.093 ms stddev 4.457 progress: 4470.0 s, 383.8 tps, lat 10.423 ms stddev 5.615 progress: 4475.0 s, 369.6 tps, lat 10.811 ms stddev 3.784 progress: 4480.0 s, 355.6 tps, lat 11.227 ms stddev 3.954 progress: 4485.0 s, 378.8 tps, lat 10.580 ms stddev 2.890 progress: 4490.0 s, 370.8 tps, lat 10.770 ms stddev 2.879 progress: 4495.0 s, 365.2 tps, lat 10.947 ms stddev 4.997 progress: 4500.0 s, 379.2 tps, lat 10.549 ms stddev 2.832
transaction type: <builtin: TPC-B (sort of)> scaling factor: 300 query mode: simple number of clients: 4 number of threads: 2 duration: 4500 s number of transactions actually processed: 1564704 latency average = 11.497 ms latency stddev = 4.800 ms tps = 347.711175 (including connections establishing) tps = 347.711731 (excluding connections establishing)
Dashboard example 1: Querying saved data
Top panel (Query execution time vs. DML)
Five (5) SQL statements are used to create this panel:
-- SELECT -- select bucket_start_time,total_exec_time as "SELECT" from pg_stat_monitor_qry join pg_stat_monitor_uniq_id using (pgsm_query_id) where a.cmd_type_text='SELECT' order by 1 asc;
-- INSERT -- select bucket_start_time,total_exec_time as "INSERT" from pg_stat_monitor_qry join pg_stat_monitor_uniq_id using (pgsm_query_id) where a.cmd_type_text='INSERT' order by 1 asc;
-- UPDATE 1 -- select bucket_start_time,total_exec_time as "UPDATE 1" from pg_stat_monitor_qry join pg_stat_monitor_uniq_id using (pgsm_query_id) where cmd_type_text='UPDATE' and pgsm_query_id = -510321339504955469 order by 1 asc;
-- UPDATE 2 -- select bucket_start_time,total_exec_time as "UPDATE 2" from pg_stat_monitor_qry join pg_stat_monitor_uniq_id using (pgsm_query_id) where cmd_type_text='UPDATE' and pgsm_query_id = 5276535447716615446 order by 1 asc;
-- UPDATE 3 -- select bucket_start_time,total_exec_time as "UPDATE 3" from pg_stat_monitor_qry join pg_stat_monitor_uniq_id using (pgsm_query_id) where cmd_type_text='UPDATE' and pgsm_query_id = 3629195281782908951 order by 1 asc;
Bottom panel (Query execution time vs. shared blocks)
-- INSERT (ins_[hit|read|dirty|write]) --
select bucket_start_time,
shared_blks_hit as ins_hit, shared_blks_read as ins_read,
shared_blks_dirtied as ins_dirt, shared_blks_written as ins_writ
from pg_stat_monitor_shared_blk_io
join pg_stat_monitor_uniq_id using (pgsm_query_id)
where cmd_type_text='INSERT'
order by 1 asc;
-- UPDATE 1 (update1_[hit|read|dirty|write]) --
select bucket_start_time,
shared_blks_hit as update1_hit, shared_blks_read as update1_read,
shared_blks_dirtied as update1_dirt, shared_blks_written as update1_writ
from pg_stat_monitor_shared_blk_io
join pg_stat_monitor_uniq_id using (pgsm_query_id)
where cmd_type_text='UPDATE'
and pgsm_query_id = -510321339504955469
order by 1 asc;
-- UPDATE 2 (update2_[hit|read|dirty|write]) --
select bucket_start_time,
shared_blks_hit as update2_hit, shared_blks_read as update2_read,
shared_blks_dirtied as update2_dirt, shared_blks_written as update2_writ
from pg_stat_monitor_shared_blk_io
join pg_stat_monitor_uniq_id using (pgsm_query_id)
where cmd_type_text='UPDATE'
and pgsm_query_id = 5276535447716615446
order by 1 asc;
-- UPDATE 3 (update3_[hit|read|dirty|write]) --
select bucket_start_time,
shared_blks_hit as update3_hit, shared_blks_read as update3_read,
shared_blks_dirtied as update3_dirt, shared_blks_written as update3_writ
from pg_stat_monitor_shared_blk_io
join pg_stat_monitor_uniq_id using (pgsm_query_id)
where cmd_type_text='UPDATE'
and pgsm_query_id = 3629195281782908951
order by 1 asc;

Analysis
Here are some example patterns that can be discerned:
- The SELECT statements are the fastest DML operations (top panel).
- Although SQL statement UPDATE 1 (top panel) takes up the most time, its contents do not have much presence in the shared buffer relative to the other update statements.
- Inserts (top) are the 2nd slowest set of statements, yet they have very little execution time performing inserts compared to the UPDATES in the shared buffer (bottom).
Dashboard example 2: Monitoring in real time
These two panels show read/write IO performance to the persistent storage while benchmarking a live run.
Top panel (Execution time vs. DML)
-- SELECT -- select bucket_start_time,total_exec_time as "SELECT" from pg_stat_monitor join pg_stat_monitor_uniq_id a using (pgsm_query_id) where a.cmd_type_text='SELECT' order by 1 asc;
-- INSERT -- select bucket_start_time,total_exec_time as "INSERT" from pg_stat_monitor join pg_stat_monitor_uniq_id a using (pgsm_query_id) where a.cmd_type_text='INSERT' order by 1 asc;
-- UPDATE 1 -- select bucket_start_time,total_exec_time as "UPDATE 1" from pg_stat_monitor join pg_stat_monitor_uniq_id a using (pgsm_query_id) where a.cmd_type_text='UPDATE' and pgsm_query_id = -510321339504955469 order by 1 asc;
-- UPDATE 2 -- select bucket_start_time,total_exec_time as "UPDATE 2" from pg_stat_monitor join pg_stat_monitor_uniq_id a using (pgsm_query_id) where a.cmd_type_text='UPDATE' and pgsm_query_id = 5276535447716615446 order by 1 asc;
-- UPDATE 3 -- select bucket_start_time,total_exec_time as "UPDATE 3" from pg_stat_monitor join pg_stat_monitor_uniq_id a using (pgsm_query_id) where a.cmd_type_text='UPDATE' and pgsm_query_id = 3629195281782908951 order by 1 asc;
Bottom panel (Time vs. IO)
-- time vs read/write blocks (blk_read_time, blk_write_time -- track_io_timing is on select bucket_start_time, blk_read_time, blk_write_time from public.pg_stat_monitor;

Analysis
It’s quite easy to observe that SQL statement UPDATE 1 represents the bulk of the read operations.
-- example SQL statement UPDATE pgbench_accounts SET abalance = abalance + 2063 WHERE aid = 1482568
Interestingly, writes are not as significant as reads.
Conclusion
I’m excited about pg_stat_monitor. Not only can it be used in Grafana, but it’s easily implemented in any monitoring solution, including our own Percona Monitoring and Management. It’s also incorporated in our latest version of Percona Operator for PostgreSQL.
Pg_stat_monitor is an obvious, common sense improvement over pg_stat_statement’s greatest limitation i.e., its inability to bucket metrics over time intervals. And to be frankly honest, I can see the pg_stat_monitor extension eventually replacing pg_stat_statement as the defacto extension monitoring Postgres when it comes to real-time analysis.
Happy monitoring!
Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.
03
2023
How To Scale a Single-Host PostgreSQL Database With Citus

When it comes to Citus, successfully building out and scaling a PostgreSQL cluster across multiple nodes and even across data centers can feel, at times, to be an art form because there are so many ways of building it out.
There’s an axiom that I think aptly applies to this situation describing the differences between science and art:
– Science: 1 problem -> 1 solution
– Art: 1 problem -> 1,000 solutions
While the two previous Citus blogs that I wrote covered the fairly straightforward concepts of creating columnar tables and leveraging data redundancy, this blog explores the architectural design considerations using this freely downloadable and fully featured PostgreSQL extension.
Rather than listing the concepts, function calls, etc, available in Citus, which frankly is a bit boring, I’m going to explore scaling out a database system starting with a single host. I won’t cover all the features but show just enough that you’ll want to see more of what you can learn to accomplish for yourself.
About the cluster

Following a step-by-step process, the objective is to create a four-node cluster consisting of:
- PostgreSQL version 15
- Citus extension (I’ll be using version 11, but there are newer ones available.)
- PostgreSQL Cluster
- One coordinator node
- citus-coord-01
- Three worker nodes
- citus1
- citus2
- citus3
- One coordinator node
- Hardware
AWS Instance
Ubuntu Server 20.04, SSD volume type
64-bit (x86)
c5.xlarge 4vCPU 8GB-RAM
Storage: EBS volume (root)
80GB
gp2 (IOPS 240/3000)
As well, high availability will be integrated, guaranteeing cluster viability in the case that one worker node goes down.
Leveraging pgbench, which is a benchmarking utility that comes bundled with PostgreSQL, I will put the cluster through its paces by executing a series of DML operations. Using its default tpc-b benchmark, one can stress test a database of any size ranging from a few clients to simulating thousands interacting with a system sized into the Terabytes if needs be.
Steps
Provisioning
The first step is to provision the four nodes with both PostgreSQL and Citus. Refer here or here for getting and installing the extension.
-- update each postgres node and enabling the extension -- with a server restart alter system set shared_preload_libraries='citus';
On each of the four nodes, create the database with the Citus extension:
-- using psql create database pgbench; c pgbench create extension citus;
One worker node

Beginning with a single worker node, log onto citus-coord-01, declaring the coordinator host, and starting with the single node, citus1:
-- execute on citus-coord-01
psql pgbench <<_eof1_
select citus_set_coordinator_host('citus-coord-01', 5432);
select citus_add_node('citus1', 5432);
_eof1_
While still logged in database pgbench, get a list of the worker nodes, which at this time is just the one:
select * from master_get_active_worker_nodes();
node_name | node_port -----------+----------- citus1 | 5432
Now, things get a little interesting; the normal initialization of pgbench is to create and populate the tables with the appropriate constraints. In this case, however, only the tables, without data or constraints, are initialized:
-- execute the following command on citus-coord-01 pgbench -iI t pgbench
pgbench=# d+
List of relations
Schema | Name | Type | Owner | Persistence | Access method | Size |
-------+------------------+-------+----------+-------------+---------------+---------+-----
public | citus_tables | view | postgres | permanent | | 0 bytes |
public | pgbench_accounts | table | postgres | permanent | heap | 0 bytes |
public | pgbench_branches | table | postgres | permanent | heap | 0 bytes |
public | pgbench_history | table | postgres | permanent | heap | 0 bytes |
public | pgbench_tellers | table | postgres | permanent | heap | 0 bytes |
This next set of commands, executed on the coordinator node, distributes the pgbench tables across node citus1:
psql pgbench <<_eof1_
BEGIN;
select create_distributed_table('pgbench_history', 'aid');
select create_distributed_table('pgbench_accounts', 'aid');
select create_distributed_table('pgbench_branches', 'bid');
select create_distributed_table('pgbench_tellers', 'tid');
COMMIT;
_eof1_
And now, we can begin populating the tables. The following invocation generates almost 4GB of data. Notice I still haven’t added any constraints:
pgbench -iI g -s 300 pgbench
This command shows the total number of shards on the node. The default number of shards generated for a given table is 32.
select * from citus_tables;
table_name | citus_table_type | distribution_column | colocation_id | table_size | shard_count ------------------+------------------+---------------------+---------------+------------+------------- pgbench_accounts | distributed | aid | 1 | 3844 MB | 32 pgbench_branches | distributed | bid | 1 | 256 kB | 32 pgbench_history | distributed | aid | 1 | 0 bytes | 32 pgbench_tellers | distributed | tid | 1 | 256 kB | 32
You’ll also notice that only some of the shards have been populated with data, which is normal when initializing pgbench with data:
select * from citus_shards;
table_name | shardid | shard_name | citus_table_type | colocation_id | nodename | nodeport | shard_size ------------------+---------+-------------------------+------------------+---------------+----------+----------+------------ pgbench_accounts | 102040 | pgbench_accounts_102040 | distributed | 1 | citus1 | 5432 | 125853696 pgbench_accounts | 102041 | pgbench_accounts_102041 | distributed | 1 | citus1 | 5432 | 126173184 pgbench_accounts | 102042 | pgbench_accounts_102042 | distributed | 1 | citus1 | 5432 | 125739008 pgbench_accounts | 102043 | pgbench_accounts_102043 | distributed | 1 | citus1 | 5432 | 125968384 . . pgbench_branches | 102072 | pgbench_branches_102072 | distributed | 1 | citus1 | 5432 | 8192 pgbench_branches | 102073 | pgbench_branches_102073 | distributed | 1 | citus1 | 5432 | 8192 pgbench_branches | 102074 | pgbench_branches_102074 | distributed | 1 | citus1 | 5432 | 8192 pgbench_branches | 102075 | pgbench_branches_102075 | distributed | 1 | citus1 | 5432 | 8192 . . pgbench_history | 102008 | pgbench_history_102008 | distributed | 1 | citus1 | 5432 | 0 pgbench_history | 102009 | pgbench_history_102009 | distributed | 1 | citus1 | 5432 | 0 pgbench_history | 102010 | pgbench_history_102010 | distributed | 1 | citus1 | 5432 | 0 pgbench_history | 102011 | pgbench_history_102011 | distributed | 1 | citus1 | 5432 | 0 . . pgbench_tellers | 102104 | pgbench_tellers_102104 | distributed | 1 | citus1 | 5432 | 8192 pgbench_tellers | 102105 | pgbench_tellers_102105 | distributed | 1 | citus1 | 5432 | 8192 pgbench_tellers | 102106 | pgbench_tellers_102106 | distributed | 1 | citus1 | 5432 | 8192 pgbench_tellers | 102107 | pgbench_tellers_102107 | distributed | 1 | citus1 | 5432 | 8192 pgbench_tellers | 102108 | pgbench_tellers_102108 | distributed | 1 | citus1 | 5432 | 8192 . .
And now, execute the benchmark:
-- execute the following on the coordinator node pgbench -c 20 -j 3 -T 60 -P 3 pgbench
The results are not pretty.
transaction type: <builtin: TPC-B (sort of)> scaling factor: 300 query mode: simple number of clients: 20 number of threads: 3 maximum number of tries: 1 duration: 60 s number of transactions actually processed: 1524 number of failed transactions: 0 (0.000%) latency average = 791.374 ms latency stddev = 157.710 ms initial connection time = 15.704 ms tps = 25.197427 (without initial connection time)
ATTENTION: Refer to the bottom of this blog, which summarizes a tabulation of the various benchmarks as this cluster evolves.
In order to speed up the benchmark indexes must be added. But, this standard pgbench method of adding indexes fails:
postgres@citus-coord-01:~$ pgbench -iI p pgbench creating primary keys... pgbench: error: query failed: ERROR: cannot create constraint without a name on a distributed table pgbench: detail: Query was: alter table pgbench_branches add primary key (bid)
Successfully adding indexes and constraints requires them to be explicitly named and created in the following fashion. Note that table pgbench_history requires a REPLICA identity because it doesn’t have a primary key:
# THIS WORKS! psql pgbench <<_eof1_ -- -- indexes and constraints must be explicitly named -- BEGIN; create unique index pgbench_accounts_pk on pgbench_accounts(aid); create unique index pgbench_branches_pk on pgbench_branches(bid); create unique index pgbench_tellers_pk on pgbench_tellers(tid); alter table pgbench_accounts add constraint pk_accounts primary key using index pgbench_accounts_pk; alter table pgbench_branches add constraint pk_branches primary key using index pgbench_branches_pk; alter table pgbench_tellers add constraint pk_tellers primary key using index pgbench_tellers_pk; qecho adding REPLICA IDENTITY (no PK present) alter table pgbench_history replica identity full; COMMIT; _eof1_
Repeating the benchmark run yields a much better result:
-- execute the following on the coordinator node pgbench -c 20 -j 3 -T 60 -P 3 pgbench
scaling factor: 300 query mode: simple number of clients: 20 number of threads: 3 maximum number of tries: 1 duration: 60 s number of transactions actually processed: 135273 number of failed transactions: 0 (0.000%) latency average = 8.865 ms latency stddev = 9.452 ms initial connection time = 15.556 ms tps = 2254.544852 (without initial connection time)
Two worker nodes

Adding a second worker node is straightforward requiring only two steps:
- Adding the node.
- Rebalancing the shards across the two nodes.
Execute this on the coordinator node, citus-coord-01:
psql pgbench <<_eof1_
--
-- it's understood that database pgbench already exists on citus2
-- and extension citus has been created
--
qecho adding node citus2 ...
select citus_add_node('citus2', 5432);
qecho rebalancing shards across TWO nodes ...
select * from rebalance_table_shards();
ANALYZE;
_eof1_
Here’s a partial output as the tables are rebalanced across the two nodes:
adding node citus2 ...
citus_add_node
----------------
3
rebalancing shards across TWO nodes ... NOTICE: Moving shard 102008 from citus1:5432 to citus2:5432 ... NOTICE: Moving shard 102009 from citus1:5432 to citus2:5432 ... NOTICE: Moving shard 102010 from citus1:5432 to citus2:5432 ... NOTICE: Moving shard 102011 from citus1:5432 to citus2:5432 ... . . NOTICE: Moving shard 102020 from citus1:5432 to citus2:5432 ... NOTICE: Moving shard 102021 from citus1:5432 to citus2:5432 ... NOTICE: Moving shard 102022 from citus1:5432 to citus2:5432 ... NOTICE: Moving shard 102023 from citus1:5432 to citus2:5432 ...
Repeating the benchmark once again demonstrates that performance has degraded somewhat. This is to be expected, considering my POC consists of containers on a single machine, all sharing the same system resources:
-- execute the following on the coordinator node pgbench -c 20 -j 3 -T 60 -P 3 pgbench
transaction type: <builtin: TPC-B (sort of)> scaling factor: 300 query mode: simple number of clients: 20 number of threads: 3 maximum number of tries: 1 duration: 60 s number of transactions actually processed: 130190 number of failed transactions: 0 (0.000%) latency average = 9.214 ms latency stddev = 19.159 ms initial connection time = 15.776 ms tps = 2169.559969 (without initial connection time)
Three worker nodes

As per the two-worker node configuration, adding another node is straightforward.
psql pgbench <<_eof1_
qecho adding node citus3 ...
select citus_add_node('citus3', 5432);
qecho rebalancing shards across TWO nodes ...
select * from rebalance_table_shards();
_eof1_
Repeating the benchmark gives us the updated performance metrics.
-- execute the following on the coordinator node pgbench -c 20 -j 3 -T 60 -P 3 pgbench
scaling factor: 300 query mode: simple number of clients: 20 number of threads: 3 maximum number of tries: 1 duration: 60 s number of transactions actually processed: 142630 number of failed transactions: 0 (0.000%) latency average = 8.410 ms latency stddev = 5.542 ms initial connection time = 16.407 ms tps = 2377.227920 (without initial connection time)
Three worker nodes, with redundancy/replication

This next step is interesting because it adds redundancy by increasing the replication factor from one to two such that two copies of each data shard of each table are not only duplicated but located on two distinct nodes.
ATTENTION: A newly updated replication factor affects tables only after the fact. Pre Existing tables are not affected by the new replication factor:
psql pgbench <<_eof1_
qecho increasing replication factor to TWO
alter system set citus.shard_replication_factor=2;
select pg_reload_conf();
_eof1_
TIP: In order to replicate tables, set the replication factor before creating them.
In order to replicate the existing tables across the nodes, they are first undistributed by centralizing the tables onto the coordinator node. Redistributing them across the three node clusters automatically adds replication.
This centralizes the tables onto the coordinator node:
psql pgbench <<_eof1_
BEGIN;
select undistribute_table('pgbench_history');
select undistribute_table('pgbench_accounts');
select undistribute_table('pgbench_branches');
select undistribute_table('pgbench_tellers');
COMMIT;
_eof1_
The tables are redistributed and rebalanced across the three-node cluster with the correct redundancy of two:
psql pgbench <<_eof1_
BEGIN;
select create_distributed_table('pgbench_history', 'aid');
select create_distributed_table('pgbench_accounts', 'aid');
select create_distributed_table('pgbench_branches', 'bid');
select create_distributed_table('pgbench_tellers', 'tid');
COMMIT;
-- remove legacy tables from the coordinator node
BEGIN;
select truncate_local_data_after_distributing_table($$public.pgbench_history$$);
select truncate_local_data_after_distributing_table($$public.pgbench_accounts$$);
select truncate_local_data_after_distributing_table($$public.pgbench_branches$$);
select truncate_local_data_after_distributing_table($$public.pgbench_tellers$$);
COMMIT;
_eof1_
Repeating the benchmark gives us the updated performance metrics.
-- execute the following on the coordinator node pgbench -c 20 -j 3 -T 60 -P 3 pgbench
transaction type: <builtin: TPC-B (sort of)> scaling factor: 300 query mode: simple number of clients: 20 number of threads: 3 maximum number of tries: 1 duration: 60 s number of transactions actually processed: 68834 number of failed transactions: 0 (0.000%) latency average = 17.430 ms latency stddev = 12.818 ms initial connection time = 16.321 ms tps = 1147.093369 (without initial connection time)
Benchmark results
Here’s a summary of the various benchmarking runs, giving you an idea of relative performance.
| RUN | NUM NODES | WITH IDX | Replication Factor | TPS |
| 1 | 1 | no | 1 | 25 |
| 2 | 1 | yes | 1 | 2255 |
| 3 | 2 | yes | 1 | 2170 |
| 4 | 3 | yes | 1 | 2377 |
| 5 | 3 | yes | 2 | 1147 |
Caveat
Although not mentioned in the blog, there are a few things you’ll need to keep in mind as you work with CitusDB:
- The password for the account used to propagate the operations across the cluster is input in the home account’s .pgpass file on all nodes.
- The “wal_level” is set at logical.
- Redundancy can potentially decrease overall performance.
Remember to keep in mind that the performance metrics are dependent not only upon the cluster’s hardware and system resources but as to the level of tuning, which, in this case, has not been addressed.
A future blog will continue my exploration into Citus by scaling out pgbench into other architectures.
Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.
29
2023
Data Redundancy With the PostgreSQL Citus Extension

Over the years, I’ve had the opportunity to architect all sorts of configurations using Postgres as a backend. I’ve always found it very cool and satisfying to implement sophisticated business rules, often in more ways than one has fingers and toes. So, it’s not an understatement when I say that Citus is one of the more interesting technologies that I’ve come across when scaling PostgreSQL.
Citus is an extension that horizontally scales PostgreSQL across multiple machines by sharding and creating redundant copies of tables. Its query engine not only parallelizes incoming SQL queries for time series but also has the ability to create column-wise tables making it ideal for OLAP analysis duties.
Among its many capabilities, a Citus cluster can:
- Create distributed tables that are sharded across a cluster of PostgreSQL nodes to combine their CPU, memory, storage, and I/O capacity.
- Reference tables can be replicated to all nodes for joins and foreign keys from distributed tables and for maximum read performance.
- The distributed query engine can route and parallelize SELECT, DML, and other operations on distributed tables across the cluster.
- Columnar storage of tables can compress data, speeding up scans and supporting fast projections, both on regular and distributed tables.
Data redundancy, a database version of a RAID
Pondering the case of high availability and redundancy, one replicates data by creating a replica via streaming replication.
Now let’s stretch our imagination and consider a second method of high availability, ala Citus.
The best way to describe the Citus way of doing things is to reflect how data is managed by a disk RAID array. Depending on the configuration, one can tune a hardware RAID for either performance or redundancy. The same can be said for Citus data sharding.
Here is an example of a table named replica2x, which has 2X redundancy across a cluster of four (4) nodes. The colors indicate duplicated shards of the table. For example, if node citus1 goes offline, the sharded table it holds still has copies on nodes citus2 and citus4. Likewise, it can be said that if node citus2 goes offline, the same data is still available on nodes 1, 3, and 4.

About this POC
I’ll be upfront: I love working with Linux Containers, LXD. Much of what I will show you makes heavy use of them. You won’t need LXD to replicate this POC, of course, but I can’t say enough how flexible and versatile such an environment can be when prototyping a cluster of Postgres nodes, let alone an entire multi-data center infrastructure on a single workstation.
There are two parts to this POC;
- Part A: Setup
- Part B: Redundancy demonstration
Part A: POC setup
Step one: Getting and installing Citus
Referring to this earlier blog, you’ll see how to get and install Citus into your environment.
Step two: Creating the Citus nodes
The Citus cluster consists of a five (5) node cluster:
- citus-coord-01: coordinator
- citus1: worker
- citus2: worker
- citus3: worker
- citus4: worker
By using LXD, I created a single templated container with Citus on Ubuntu 20.04, where the various nodes were copied from this template.
for u in citus-coord-01 citus1 citus2 citus3 citus4
do
echo "==== $u ===="
lxc rm --force $u 2>/dev/null
lxc cp template-ubuntu-2004-citusdb $u
lxc start $u
done
And here’s the resultant cluster:
lxc ls -c ns4 citus
+----------------+---------+----------------------+ | NAME | STATE | IPV4 | +----------------+---------+----------------------+ | citus1 | RUNNING | 10.231.38.140 (eth0) | +----------------+---------+----------------------+ | citus2 | RUNNING | 10.231.38.151 (eth0) | +----------------+---------+----------------------+ | citus3 | RUNNING | 10.231.38.171 (eth0) | +----------------+---------+----------------------+ | citus4 | RUNNING | 10.231.38.204 (eth0) | +----------------+---------+----------------------+ | citus-coord-01 | RUNNING | 10.231.38.34 (eth0) | +----------------+---------+----------------------+
It’s understood that on each of the five nodes:
- Database db01 has already been created.
- The postgresql.conf configuration file has been appropriately edited for remote access.
- The .pgpass file has been configured to supply the superuser password for all nodes.
- Extension citus has been created in database db01.
Step three: Check packages
Inspecting the nodes confirms Citus has been correctly installed:
for u in citus1 citus2 citus3 citus4 citus-coord-01 do echo "==== NODE: $u ====" lxc exec $u -- su - postgres -c 'psql db01'<<_eof_ select extname, extversion from pg_extension; _eof_ done | less -S
==== NODE: citus1 ====
extname | extversion
----------------+------------
plpgsql | 1.0
citus_columnar | 11.1-1
citus | 11.1-1
==== NODE: citus2 ====
extname | extversion
----------------+------------
plpgsql | 1.0
citus_columnar | 11.1-1
citus | 11.1-1
==== NODE: citus3 ====
extname | extversion
----------------+------------
plpgsql | 1.0
citus_columnar | 11.1-1
citus | 11.1-1
==== NODE: citus4 ====
extname | extversion
----------------+------------
plpgsql | 1.0
citus_columnar | 11.1-1
citus | 11.1-1
==== NODE: citus-coord-01 ====
extname | extversion
----------------+------------
plpgsql | 1.0
citus_columnar | 11.1-1
citus | 11.1-1
Properly installed, the Citus runtime variables are now available:
lxc exec citus1 -- su postgres -c psql db01<<_eof_ | less -S
select name,setting,unit from pg_settings where name ~ 'citus' order by 1;
_eof_
name | setting | unit ----------------------------------------------------+-----------------+------ citus.all_modifications_commutative | off | citus.background_task_queue_interval | 5000 | ms citus.cluster_name | default | citus.coordinator_aggregation_strategy | row-gather | citus.count_distinct_error_rate | 0 | citus.cpu_priority | 0 | citus.cpu_priority_for_logical_replication_senders | inherit | citus.defer_drop_after_shard_move | on | citus.defer_drop_after_shard_split | on | citus.defer_shard_delete_interval | 15000 | ms citus.desired_percent_disk_available_after_move | 10 | citus.distributed_deadlock_detection_factor | 2 | citus.enable_binary_protocol | on | citus.enable_create_role_propagation | on | citus.enable_deadlock_prevention | on | citus.enable_local_execution | on | citus.enable_local_reference_table_foreign_keys | on | citus.enable_repartition_joins | off | citus.enable_statistics_collection | off | citus.explain_all_tasks | off | citus.explain_analyze_sort_method | execution-time | citus.limit_clause_row_fetch_count | -1 | citus.local_hostname | localhost | citus.local_shared_pool_size | 50 | citus.local_table_join_policy | auto | citus.log_remote_commands | off | citus.max_adaptive_executor_pool_size | 16 | citus.max_cached_connection_lifetime | 600000 | ms citus.max_cached_conns_per_worker | 1 | citus.max_client_connections | -1 | citus.max_high_priority_background_processes | 2 | citus.max_intermediate_result_size | 1048576 | kB citus.max_matview_size_to_auto_recreate | 1024 | MB citus.max_shared_pool_size | 100 | citus.max_worker_nodes_tracked | 2048 | citus.multi_shard_modify_mode | parallel | citus.multi_task_query_log_level | off | citus.node_connection_timeout | 30000 | ms citus.node_conninfo | sslmode=require | citus.propagate_set_commands | none | citus.recover_2pc_interval | 60000 | ms citus.remote_task_check_interval | 10 | ms citus.shard_count | 32 | citus.shard_replication_factor | 1 | citus.show_shards_for_app_name_prefixes | | citus.skip_constraint_validation | off | citus.skip_jsonb_validation_in_copy | on | citus.stat_statements_track | none | citus.task_assignment_policy | greedy | citus.task_executor_type | adaptive | citus.use_citus_managed_tables | off | citus.use_secondary_nodes | never | citus.values_materialization_threshold | 100 | citus.version | 11.1.4 | citus.worker_min_messages | notice | citus.writable_standby_coordinator | off |
Step 4: Define/configure the cluster
Log into the coordinator in order to declare and configure the cluster:
lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_
select citus_set_coordinator_host('citus-coord-01', 5432);
insert into pg_dist_node(nodename)
values ('citus1')
,('citus2')
,('citus3')
,('citus4');
_eof_
And here’s the cluster’s organization:
db01=# select nodeid,nodename,groupid,isactive from pg_dist_node order by 1;
nodeid | nodename | groupid | isactive
--------+----------------+---------+----------
1 | citus-coord-01 | 0 | t
2 | citus1 | 1 | t
3 | citus2 | 2 | t
4 | citus3 | 3 | t
5 | citus4 | 4 | t
Step four: Create, distribute, and populate a single table
Table myevents is created, and the newly inserted records are evenly distributed across the cluster of nodes.
Login to the coordinator and execute the following commands. Notice that all DML and SQL statements are executed on the coordinator node:
lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_
-- create the table
create table myevents (
device_id
bigint,
event_id
bigserial,
event_time
timestamptz default now(),
data
jsonb not null,
primary key (device_id, event_id)
);
-- distribute the events among the nodes
select create_distributed_table('myevents', 'device_id');
-- populate the table
insert into myevents (device_id, data)
select s % 100, ('{"measurement":'||random()||'}')::jsonb
from generate_series(1,1000000) s;
_eof_
Querying the coordinator:
lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_ | less -S
select *
from myevents
where device_id = 1
order by event_time desc, event_id desc
limit 10;
_eof_
device_id | event_id | event_time | data
----------+----------+-------------------------------+-------------------------------------
1 | 999901 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.2868659956537316}
1 | 999801 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.7931493079697731}
1 | 999701 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.4875322951757288}
1 | 999601 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.6491362745752653}
1 | 999501 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.9629266554851366}
1 | 999401 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.1185674800281864}
1 | 999301 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.5133762596297742}
1 | 999201 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.7307634886202119}
1 | 999101 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.2997471209159892}
1 | 999001 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.9692520484104021}
Step five: A review of table distributions across the cluster
This demonstrates clearly that table myevents is hash-sharded across every node member:
- Notice how well-balanced the sharded tables are in size.
- Notice the table name numbering order; see how each named shard that is incremented by one is found on the next node.
for u in citus1 citus2 citus3 citus4
do
echo "==== NODE: $u ===="
lxc exec $u -- su - postgres -c 'psql db01'<<_eof_
dt+
_eof_
done | less -S
==== NODE: citus1 ====
List of relations
Schema | Name | Type | Owner | Persistence | Access method | Size | Description
-------+-----------------+-------+----------+-------------+---------------+---------+-------------
public | myevents_102008 | table | postgres | permanent | heap | 2832 kB |
public | myevents_102012 | table | postgres | permanent | heap | 2840 kB |
public | myevents_102016 | table | postgres | permanent | heap | 5624 kB |
public | myevents_102020 | table | postgres | permanent | heap | 2840 kB |
public | myevents_102024 | table | postgres | permanent | heap | 1904 kB |
public | myevents_102028 | table | postgres | permanent | heap | 5632 kB |
public | myevents_102032 | table | postgres | permanent | heap | 2840 kB |
public | myevents_102036 | table | postgres | permanent | heap | 6560 kB |
==== NODE: citus2 ====
List of relations
Schema | Name | Type | Owner | Persistence | Access method | Size | Description
-------+-----------------+-------+----------+-------------+---------------+---------+-------------
public | myevents_102009 | table | postgres | permanent | heap | 4696 kB |
public | myevents_102013 | table | postgres | permanent | heap | 976 kB |
public | myevents_102017 | table | postgres | permanent | heap | 1904 kB |
public | myevents_102021 | table | postgres | permanent | heap | 3768 kB |
public | myevents_102025 | table | postgres | permanent | heap | 1904 kB |
public | myevents_102029 | table | postgres | permanent | heap | 2840 kB |
public | myevents_102033 | table | postgres | permanent | heap | 1904 kB |
public | myevents_102037 | table | postgres | permanent | heap | 2840 kB |
==== NODE: citus3 ====
List of relations
Schema | Name | Type | Owner | Persistence | Access method | Size | Description
-------+-----------------+-------+----------+-------------+---------------+------------+-------------
public | myevents_102010 | table | postgres | permanent | heap | 1904 kB |
public | myevents_102014 | table | postgres | permanent | heap | 1904 kB |
public | myevents_102018 | table | postgres | permanent | heap | 1904 kB |
public | myevents_102022 | table | postgres | permanent | heap | 4696 kB |
public | myevents_102026 | table | postgres | permanent | heap | 8192 bytes |
public | myevents_102030 | table | postgres | permanent | heap | 2832 kB |
public | myevents_102034 | table | postgres | permanent | heap | 976 kB |
public | myevents_102038 | table | postgres | permanent | heap | 4696 kB |
==== NODE: citus4 ====
List of relations
Schema | Name | Type | Owner | Persistence | Access method | Size | Description
-------+-----------------+-------+----------+-------------+---------------+---------+-------------
public | myevents_102011 | table | postgres | permanent | heap | 5632 kB |
public | myevents_102015 | table | postgres | permanent | heap | 1904 kB |
public | myevents_102019 | table | postgres | permanent | heap | 1904 kB |
public | myevents_102023 | table | postgres | permanent | heap | 1904 kB |
public | myevents_102027 | table | postgres | permanent | heap | 2840 kB |
public | myevents_102031 | table | postgres | permanent | heap | 2832 kB |
public | myevents_102035 | table | postgres | permanent | heap | 1904 kB |
public | myevents_102039 | table | postgres | permanent | heap | 4696 kB |
Part B: Redundancy demonstration
Method
- Step 1: Update shard replication factor from 1X to 2X
- Step 2: Create table myevents2x with 2X redundancy
- Step 3: Identify shard myevents2x_102040 across citus1 and citus2
- Step 4: Identify some records to query from shards known to be on nodes citus1 and citus2
- Step 5: Test
- Shutdown citus1; perform the aforementioned identified query
- Startup citus1, shutdown citus2; perform afore identified query
- Restart citus2; perform the aforementioned identified query
ATTENTION: Please note that you may have to edit your own queries as the values may be different for your setup.
Step one:
Update shard replication factor from 1X to 2X:
lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_
show citus.shard_replication_factor;
alter system set citus.shard_replication_factor=2;
select pg_reload_conf();
_eof_
# validate
lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_
show citus.shard_replication_factor;
_eof_
citus.shard_replication_factor -------------------------------- 2
Step two:
Table myevents2x is created and populated with a redundancy of 2X across the cluster:
lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_ | less -S
-- create a new table with 2X redundancy
create table myevents2x (
device_id
bigint,
event_id
bigserial,
event_time
timestamptz default now(),
data
jsonb not null,
primary key (device_id, event_id)
);
-- distribute the events among the nodes
select create_distributed_table('myevents2x', 'device_id');
-- confirm table has been added across the cluster
select * from master_get_active_worker_nodes() order by 1;
-- populate the table
insert into myevents2x (device_id, data)
select s % 100, ('{"measurement":'||random()||'}')::jsonb
from generate_series(1,1000000) s;
_eof_
Here’s the output:
CREATE TABLE create_distributed_table -------------------------- (1 row) node_name | node_port -----------+----------- citus1 | 5432 citus2 | 5432 citus3 | 5432 citus4 | 5432 (4 rows) INSERT 0 1000000
Step three:
Locate shard myevents2x_102040, which should be on nodes citus1 and citus2:
for u in citus1 citus2 citus3 citus4
do
echo "==== NODE: $u ===="
lxc exec $u -- su - postgres -c 'psql db01'<<_eof_
select tablename from pg_tables where tablename~'myevents2x' order by 1
_eof_
done | less -S
Here’s the output:
==== NODE: citus1 ====
tablename
-------------------
myevents2x_102040
myevents2x_102043
myevents2x_102044
myevents2x_102047
myevents2x_102048
myevents2x_102051
myevents2x_102052
myevents2x_102055
myevents2x_102056
myevents2x_102059
myevents2x_102060
myevents2x_102063
myevents2x_102064
myevents2x_102067
myevents2x_102068
myevents2x_102071
(16 rows)
==== NODE: citus2 ====
tablename
-------------------
myevents2x_102040
myevents2x_102041
myevents2x_102044
myevents2x_102045
myevents2x_102048
myevents2x_102049
myevents2x_102052
myevents2x_102053
myevents2x_102056
myevents2x_102057
myevents2x_102060
myevents2x_102061
myevents2x_102064
myevents2x_102065
myevents2x_102068
myevents2x_102069
==== NODE: citus3 ====
tablename
-------------------
myevents2x_102041
myevents2x_102042
myevents2x_102045
myevents2x_102046
myevents2x_102049
myevents2x_102050
myevents2x_102053
myevents2x_102054
myevents2x_102057
myevents2x_102058
myevents2x_102061
myevents2x_102062
myevents2x_102065
myevents2x_102066
myevents2x_102069
myevents2x_102070
==== NODE: citus4 ====
tablename
-------------------
myevents2x_102042
myevents2x_102043
myevents2x_102046
myevents2x_102047
myevents2x_102050
myevents2x_102051
myevents2x_102054
myevents2x_102055
myevents2x_102058
myevents2x_102059
myevents2x_102062
myevents2x_102063
myevents2x_102066
myevents2x_102067
myevents2x_102070
myevents2x_102071
Step four:
Locate and return the first three records of shard myevents2x_102040 on nodes citus1 and citus2:
lxc exec citus1 -- su - postgres -c 'psql db01'<<_eof_ | less -S
qecho ==== citus1 ====
qecho select * from myevents2x_102040 order by 1,2 limit 3
select * from myevents2x_102040 order by 1,2 limit 3;
c 'host=citus2 dbname=db01 user=postgres'
qecho ==== citus2 ====
qecho select * from myevents2x_102040 order by 1,2 limit 3
select * from myevents2x_102040 order by 1,2 limit 3;
c 'host=citus-coord-01 dbname=db01 user=postgres'
qecho ==== coordinator ====
qecho select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2
select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2;
_eof_
Here’s the output:
==== citus1 ====
select * from myevents2x_102040 order by 1,2 limit 3
device_id | event_id | event_time | data
-----------+----------+-------------------------------+-------------------------------------
8 | 8 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.5055415404961443}
8 | 108 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.9145586163936634}
8 | 208 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.8032392420487922}
You are now connected to database "db01" as user "postgres" on host "citus2" (address "fd42:cb6a:5384:9a60:216:3eff:fe38>
==== citus2 ====
select * from myevents2x_102040 order by 1,2 limit 3
device_id | event_id | event_time | data
-----------+----------+-------------------------------+-------------------------------------
8 | 8 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.5055415404961443}
8 | 108 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.9145586163936634}
8 | 208 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.8032392420487922}
You are now connected to database "db01" as user "postgres" on host "citus-coord-01" (address "fd42:cb6a:5384:9a60:216:3>
==== coordinator ====
select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2
device_id | event_id | event_time | data
-----------+----------+-------------------------------+-------------------------------------
8 | 8 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.5055415404961443}
8 | 108 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.9145586163936634}
8 | 208 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.8032392420487922}
Step five:
The next few steps demonstrate our ability to continuously query and return those records found in shard myevents2x_102040.
Step 5a: Test, shutdown node citus1
lxc stop citus1
lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_ | less -S
qecho "==== citus1 is shutdown ===="
qecho ==== querying coordinator ====
qecho select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2
select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2;
_eof_
Here’s the output:
"==== citus1 is shutdown ===="
==== querying coordinator ====
select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2
device_id | event_id | event_time | data
-----------+----------+-------------------------------+-------------------------------------
8 | 8 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.5055415404961443}
8 | 108 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.9145586163936634}
8 | 208 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.8032392420487922}
Step 5b: Test, restart citus1, and shutdown citus2
lxc start citus1 lxc stop citus2
lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_ | less -S
qecho "==== citus2 is shutdown ===="
qecho ==== querying coordinator ====
qecho select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2
select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2;
_eof_
Here’s the output; note that it’s exactly the same as the previous test:
"==== citus2 is shutdown ===="
==== querying coordinator ====
select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2
device_id | event_id | event_time | data
-----------+----------+-------------------------------+-------------------------------------
8 | 8 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.5055415404961443}
8 | 108 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.9145586163936634}
8 | 208 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.8032392420487922}
Step 5c: Test, restart node citus2
lxc start citus2
lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_ | less -S
qecho "==== cluster restored ===="
qecho ==== querying coordinator ====
qecho select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2
select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2;
_eof_
Here’s the output, of course!
"==== cluster restored ===="
==== querying coordinator ====
select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2
device_id | event_id | event_time | data
----------+----------+-------------------------------+-------------------------------------
8 | 8 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.5055415404961443}
8 | 108 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.9145586163936634}
8 | 208 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.8032392420487922}
Conclusion
Data redundancy is the hallmark of high availability. But with Citus, we’ve raised the bar. Can you think of a better system that can stay up without losing time initiating failovers when a node fails?
Have fun!
Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community in a single distribution, designed and tested to work together.
21
2023
Configuring PostgreSQL and LDAP Using StartTLS

Effectively working with LDAP as an authentication mechanism for PostgreSQL typically requires extensive knowledge in both domains. While trying to be as complete yet succinct as possible, I’m detailing how to enable TLS between a PostgreSQL and the OpenLDAP server.
Ironically, the most complicated aspect has nothing to do with either PostgreSQL or OpenLDAP but with the steps of creating and signing private keys and certificates.
Note: I had seriously thought about leaving out much of the OpenLDAP commands, but I figured it might benefit you if you’re like me since I don’t touch this very often.
Summary steps
The underlying assumptions are:
- PostgreSQL and LDAP hosts are installed on the same host.
- Username accounts, but not passwords, are created on the Postgres server.
- Both username accounts and passwords are configured on the LDAP server.
- Private/Public keys, sic certificates, are to be created.
- A self-signed Certificate Authority, CA, exists on the host and has signed the aforementioned Postgres and LDAP certificates.
The environment
The Proof Of Concept described in this document consists of a single stand-alone server:
- Linux OS, Ubuntu 18.04
- Host name: my-ldap
- PostgreSQL version 14
- OpenLDAP version 2.4.45
- Distinguished Name: “cn=admin,dc=nodomain”
- password: admin
- topmost domain: “dn: dc=pg_user,dc=nodomain”
- superuser, postgres: “dn: cn=postgres,dc=pg_user,dc=nodomain”
- user1: “dn: cn=user1,dc=pg_user,dc=nodomain”
- user2: “dn: cn=user2,dc=pg_user,dc=nodomain”
- user accounts:
- postgres (password postgres)
Installation PostgreSQL
Apart from the standard steps of installing and configuring Postgres for remote access, edit the host-based authentication file enabling Postgres to refer to the LDAP service for authentication.
ROLES and useraccounts used in Postgres should be declared. Keep in mind that assigning passwords is NOT required:
-- Example create role user1 with login password null; create role user2 with login password null;
pg_hba.conf assumptions:
- The LDAP URL is 127.0.01, localhost, or the hostname URL if desired.
- Rules for both IPv4 and IPv6 are declared.
- NO encryption is enabled between the Postgres and LDAP services.
# IPv4 local connections: host all all 0.0.0.0/0 ldap ldapserver=127.0.0.1 ldapprefix="cn=" ldapsuffix=", dc=pg_user, dc=nodomain" # IPv6 local connections: host all all ::0/0 ldap ldapserver=127.0.0.1 ldapprefix="cn=" ldapsuffix=", dc=pg_user, dc=nodomain"
Installation LDAP
apt-get install -y slapd ldap-utils
Running netstat, (netstat -tlnp), returns the following:
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 0.0.0.0:389 0.0.0.0:* LISTEN 7742/slapd tcp6 0 0 :::389 :::* LISTEN 7742/slapd
Command line utilities
You can control the behavior of OpenLDAP by using these command line utilities:
- ldapmodify
- ldapadd
- ldapdelete
- ldapmodrdn
- ldapsearch
- ldapwhoami
Setting the administrator password
It is understood that administering the LDAP server requires setting the password. Although the installation of LDAP includes setting the password, which will be admin, by executing the following command, one can reset the password at will:
# Select "No" when asked to configure the database with dbconfig-common. # Set the domain name for your LDAP server, for example, "example.com". # Set the organization name for your LDAP server, for example, "Example Inc". # Set the administrator password for your LDAP server. dpkg-reconfigure slapd
Configuration
The following bash script demonstrates configuring OpenLDAP to authenticate three Postgres ROLES, i.e., postgres, user1, and user2:
#!/bin/bash set -e ########################################## # # admin password is "admin" # the top domain is assigned as "nodomain" # ldapadd -v -xD "cn=admin,dc=nodomain" -w admin <<_eof_ # create the topmost domain dn: dc=pg_user,dc=nodomain objectClass: dcObject objectClass: organization dc: pg_user o: Postgres Users description: all postgres users reside here # create SUPERUSER ROLE postgres dn: cn=postgres,dc=pg_user,dc=nodomain objectclass: top objectclass: person cn: postgres sn: postgres userPassword: postgres _eof_ # ########################################## # ADD ROLES ldapadd -v -xD "cn=admin,dc=nodomain" -w admin <<_eof_ # creating other user accounts, down the road # create role user1 dn: cn=user1,dc=pg_user,dc=nodomain objectclass: top objectclass: person cn: user1 sn: user1 userPassword: user1 # create role user2 dn: cn=user2,dc=pg_user,dc=nodomain objectclass: top objectclass: person cn: user2 sn: user2 userPassword: user2 _eof_
Test connectivity without TLS
A simple login confirms LDAP and PostgreSQL are working correctly. Even though there is an encrypted session between psql and the Postgres server, there is no encrypted session between Postgres and LDAP as authentication is performed:
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off) root@my-ldap:~# psql 'host=my-ldap dbname=postgres user=postgres password=postgres' -c "select 'ping' as test_connectivity" test_connectivity ------------------- ping root@my-ldap:~# psql 'host=my-ldap dbname=postgres user=user1 password=user1' -c "select 'ping from user1' as test_connectivity" test_connectivity ------------------- ping from user1 root@my-ldap:~# psql 'host=my-ldap dbname=postgres user=user2 password=user2' -c "select 'ping from user2' as test_connectivity" test_connectivity ------------------- ping from user2
Working With TLS Encryption
Install Additional Packages
To work with SSL certificates, these packages should be present, i.e., for the Ubuntu distribution:
apt install -y gnutls-bin ssl-cert
Managing the Certificate Authority
Authentication between the Postgres and LDAP servers includes that the hosts making connection attempts are, in fact, legitimate. For that reason, the certificates for both servers must be signed, i.e., a Certificate Authority mechanism is required.
# # Generate private key for self-signed Certificate Authority # certtool --generate-privkey --bits 4096 --outfile /etc/ssl/private/mycakey.pem
In this case, the CA certificate is configured to expire in ten years:
# # Define CA certificate attributes # echo "cn = my-ldap ca cert_signing_key expiration_days = 3650" > /etc/ssl/ca.info
An internal system can get away using self-signed CA certificates, otherwise, it is strongly recommended that your certificates be signed by an authorized CA. In this case, the certificate, once signed, is placed in the same directory where the other CA certificates are stored, i.e., /usr/local/share/ca-certificates:
# # Generate a self-signed CA certificate and # copy the CRT to other trusted CA certificates that includes # both postgres and ldap servers (/usr/local/share/ca-certificates/) # certtool --generate-self-signed --load-privkey /etc/ssl/private/mycakey.pem --template /etc/ssl/ca.info --outfile /usr/local/share/ca-certificates/mycacert.crt
ATTENTION: Once signed, the CA certificate is copied onto the Postgres and LDAP servers respectively. In this case, as they are on the same host, it is located on host my-ldap. Otherwise, it MUST be copied to all Postgres and LDAP hosts.
Once copied into the correct directory, the list of CA certificates is updated, adding the self-signed CA certificate:
# # Update list of CA certificate on both # postgres and LDAP servers # update-ca-certificates
Generating the LDAP public/private key
From here on, it is important to include the fully qualified domain name of the LDAP certificate hostname, i.e., my-ldap.
Generate a private key for LDAP server:
certtool --generate-privkey --bits 2048 --outfile /etc/ldap/my-ldap_slapd_key.pem
Define LDAP certificate attributes:
# for a certificate request which expires # in one year # echo "organization = mycompany cn = my-ldap tls_www_server encryption_key signing_key expiration_days = 365" > /etc/ssl/my-ldap.info
Sign the LDAP private key using the self-signed Certificate Authority certificate and its private key:
certtool --generate-certificate --load-privkey /etc/ldap/my-ldap_slapd_key.pem --load-ca-certificate /etc/ssl/certs/mycacert.pem --load-ca-privkey /etc/ssl/private/mycakey.pem --template /etc/ssl/my-ldap.info --outfile /etc/ldap/my-ldap_slapd_cert.pem
Update access permissions of the private key:
sudo chgrp openldap /etc/ldap/my-ldap_slapd_key.pem sudo chmod 0640 /etc/ldap/my-ldap_slapd_key.pem
Create and save the LDAP TLS configuration file, certinfo.ldif:
echo "dn: cn=config add: olcTLSCACertificateFile olcTLSCACertificateFile: /etc/ssl/certs/mycacert.pem - add: olcTLSCertificateFile olcTLSCertificateFile: /etc/ldap/my-ldap_slapd_cert.pem - add: olcTLSCertificateKeyFile olcTLSCertificateKeyFile: /etc/ldap/my-ldap_slapd_key.pem" > /etc/ldap/schema/certinfo.ldif
Enable OpenLDAP to use TLS:
ldapmodify -Y EXTERNAL -H ldapi:/// -f /etc/ldap/schema/certinfo.ldif
Validation test #1
Validation returns the string “anonymous”:
ldapwhoami -x -ZZ -H ldap://my-ldap
Validation test #2
This test confirms that the previous behavior without TLS still works:
# without TLS ldapsearch -x -H ldap://my-ldap -b dc=nodomain -D cn=admin,dc=nodomain -w admin
This test should return the exact output of the previous ldapsearch. Failure is indicated by a short message which is invoked by using the switch “-zz”.
# with TLS ldapsearch -x -ZZ -H ldap://my-ldap -b dc=nodomain -D cn=admin,dc=nodomain -w admin
Working with PostgreSQL using TLS
Configuration
This is the host-based rule with TLS configured; notice the minor edit in red:
# IPv4 local connections: hostssl all all 0.0.0.0/0 ldap ldapserver=tmp ldapprefix="cn=" ldapsuffix=", dc=pg_user, dc=nodomain" ldaptls=1 # IPv6 local connections: host all all ::0/0 ldap ldapserver=tmp ldapprefix="cn=" ldapsuffix=", dc=pg_user, dc=nodomain" ldaptls=1
Validation test: TLS
After updating pg_hba.conf with the new argument, ldaptls=1, a server reload is executed, followed by a psql ping:
systemctl reload postgresql@14-main
root@my-ldap:~# psql 'host=10.231.38.243 dbname=postgres user=postgres password=postgres' -c "select 'ping from postgres' as test_connectivity" test_connectivity -------------------- ping from postgres
Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.
05
2023
Configuring PgBouncer for Multi-Port Access

From time to time, situations occur where unusual circumstances dictate out-of-the-box thinking.
For example, suppose you have a system where you’ve installed multiple data clusters onto a single host. What you end up with is a host breaking up and sharing valuable system resources, i.e., CPU, RAM, disk, etc., between multiple instances of PostgreSQL data clusters. While easy to do in a development environment, it does represent technical debt when it reaches production. Sooner or later, one must address this issue; otherwise, one can suffer the consequences of handicapping your entire database infrastructure.
Let’s now move forward in time: your system has scaled, of course, and this shortcut of using multiple data clusters on a single host has now become a performance bottleneck. The problem is you either don’t or can’t refactor your application servers; something, maybe, about not having enough time in the day. And, as you may already know, while Postgres can sit on both a UNIX DOMAIN socket and IPv4, IPv6 port, etc., one is nevertheless constrained to listen to just the one port.
So what do you do?
For the experienced sysadmin, there are actually quite a number of “magical” techniques. However, in this case, with a little help from systemd, which manages all service processes, we will solve this little puzzle using PgBouncer with a concoction of configuration files.
Scenario
Configure the system such that Postgres resides on its default port of 5432 and PgBouncer sits on three ports, i.e., 6432, 6433, and 6433, accessing the resident Postgres server.
The PgBouncer connection pooler will use an administrative account, the ROLE PgBouncer, for the purpose of user authentication. Authentication is to be achieved by accessing the Postgres server’s pg_shadow table and comparing the resultant hash to all incoming connections (this won’t work for cloud setups such as, for example, Amazon RDS).
A set of Systemd configuration files will be created and edited in order to manage the PgBouncer service.
About the files
Below is a summary of the files and how they will be edited. Remember, these configuration files are of a hypothetical nature using minimal settings, which, of course, will need to be updated to match a realistic production environment.
├── pgbouncer
│ ├── pgbouncer.ini
│ └── userlist.txt
├── postgres
│ ├── add_function.sql
│ ├── add_user.sql
└── systemd
├── pgbouncer_override
└── pgbouncer.socket
pgbouncer.ini
This configuration file defines all behavior and is installed in its default location, “/etc/pgbouncer”.
Only one domain socket is used. For our purposes, the listen_port runtime parameter is just noise and is superseded by the other runtime parameters as declared in file pgbouncer.socket.
TIP: Backup the original pgbouncer.ini as it references ALL runtime parameters.
[databases] * = host=localhost [users] # left blank [pgbouncer] logfile = /var/log/postgresql/pgbouncer.log pidfile = /var/run/postgresql/pgbouncer.pid ;; these parameters are implicitly disabled listen_addr = 0.0.0.0 listen_port = 6432 unix_socket_dir = /var/run/postgresql auth_type = md5 auth_file = /etc/pgbouncer/userlist.txt auth_user = pgbouncer auth_query = SELECT p_user, p_password FROM public.lookup($1) pool_mode = session ;; Use <appname - host> as application_name on server. application_name_add_host = 1
userlist.txt
Contains a single user account and its password for authentication. Note that the md5 hashed password is pgbouncer.
"pgbouncer" "md5be5544d3807b54dd0637f2439ecb03b9"
add_user.sql
Adds the ROLE “pgbouncer” to the Postgres data cluster. While under normal circumstances this is not required, PgBouncer uses this ROLE in order to validate all logins.
For our purposes, the password is “pgbouncer”:
-- -- EXECUTE AS SUPERUSER postgres -- CREATE ROLE pgbouncer LOGIN WITH PASSSWORD pgbouncer;
add_function.sql
This is a user-defined function that ROLE PgBouncer executes in order to obtain the password hash from pg_shadow.
This SQL statement must be executed against each and every database that is to be accessed by any ROLE connection using PgBouncer.
TIP: Execute this SQL against database template1, and the function call will be included with every newly created database thereafter.
-- -- EXECUTE AS SUPERUSER postgres -- -- execute on each database user accounts will login -- CREATE FUNCTION public.lookup ( INOUT p_user name, OUT p_password text ) RETURNS record LANGUAGE sql SECURITY DEFINER SET search_path = pg_catalog AS $$SELECT usename, passwd FROM pg_shadow WHERE usename = p_user$$; -- make sure only "pgbouncer" can use the function REVOKE EXECUTE ON FUNCTION public.lookup(name) FROM PUBLIC; GRANT EXECUTE ON FUNCTION public.lookup(name) TO pgbouncer;
pgbouncer_override
This systemd drop-in file overrides key options in the default PgBouncer unit file and will never be overwritten even after updating the PgBouncer Linux RPM/DEB packages.
As root, execute the following command and paste the contents of the provided file pgbouncer_override:
systemctl edit pgbouncer
# # systemctl edit pgbouncer # systemctl daemon-reload # [Unit] Requires=pgbouncer.socket
pgbouncer.socket
This is the secret sauce; this file enables PgBouncer to listen on all three ports, 6432, 6433, and 6434. You will note that adding or removing ports is a simple matter of adding or removing addresses as per the format shown in the file below.
As root: create this file and perform a daemon reload:
vi /etc/systemd/system/pgbouncer.socket
# # vi /etc/systemd/system/pgbouncer.socket # systemctl daemon-reload # [Unit] Description=sockets for PgBouncer PartOf=pgbouncer.service [Socket] ListenStream=6432 ListenStream=6433 ListenStream=6434 ListenStream=/var/run/postgresql/.s.PGSQL.6432 ReusePort=true RemoveOnStop=true [Install] WantedBy=sockets.target
systemctl daemon-reload
Putting it all together
After all the configuration files have been created and edited, this is what we get:
systemctl restart pgbouncer
netstat -tlnp
Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN tcp 0 0 127.0.0.1:5432 0.0.0.0:* LISTEN tcp 0 0 127.0.0.53:53 0.0.0.0:* LISTEN tcp6 0 0 :::22 :::* LISTEN tcp6 0 0 ::1:5432 :::* LISTEN tcp6 0 0 :::6432 :::* LISTEN tcp6 0 0 :::6433 :::* LISTEN tcp6 0 0 :::6434 :::* LISTEN
Finally, perform the following:
- sudo as Postgres
- create a database db01
- create a ROLE usr1 accessing database db01
- update pg_hba.conf and postgresql.conf allowing localhost connections by PgBouncer
Now validate the ports:
for u in 5432 6432 6433 6434
do
echo "==== port: $u ==="
export PGHOST=localhost PGPORT=$u PGDATABASE=db01
psql "user=usr1 password=usr1" -c "select 'hello world' as greetings"
done
And here’s the output:
==== port: 5432 === greetings ------------- hello world (1 row) ==== port: 6432 === greetings ------------- hello world (1 row) ==== port: 6433 === greetings ------------- hello world (1 row) ==== port: 6434 === greetings ------------- hello world (1 row)
Conclusion
Because PostgreSQL was designed from the ground up to work in conjunction with the OS, i.e., a UNIX type of operating system, we’ve been able to solve an interesting problem using a novel approach. And while systemd is ubiquitous, it isn’t normally considered part of a solution to a Postgres puzzle. We call this hacking 🙂
Happy Travels!
Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.
02
2023
An Argument for Logical Failover Slots

These days, the typical PostgreSQL cluster consists not only of a multi-node replication cluster, which is ordinarily an asynchronous streaming replication model but can sometimes include a logical replication component.

Recall that logical replication works by using the PUB/SUB mode, where individual tables are published and are then subscribed by remotely connected databases/tables. DML operations, such as INSERT, UPDATE, DELETE, TRUNCATE, etc., are then replicated from one or more publishers to one or more subscribers.
For the most part it’s a pretty straightforward operation. The only real limitation is that one can only publish tables from a read-write server such as the PRIMARY.

There’s a problem, however, when it comes to executing failovers. While a STANDBY/REPLICA host can quickly take over with a simple select pg_promote() SQL statement, the same cannot be said regarding logical replication. Instead, failing over is, in fact, not a single action but a collection of discrete steps that must be performed in an exact and precise manner.
Consider the following steps typifying a failover:
- Promote the REPLICA, thus becoming the new PRIMARY.
- When present, redirect other REPLICAs pointing to the new PRIMARY.
- Failover the logical replication slot:
- Block all logical replication from the failed PRIMARY.
- Create a logical replication slot on the newly promoted PRIMARY.
- Alter the subscription and point to the newly promoted PRIMARY.
- Resume/restore logical replication between publisher and subscriber.
The challenge, of course, is reconstituting the logical replication process as quickly as possible, ideally without any data loss.
Two common solutions come to mind:
- Block all incoming connections before promoting the STANDBY while simultaneously installing the logical replication slot.
- Promote the STANDBY and then flush and recopy the entire contents of the subscribed table(s).
In both cases, these solutions make it problematic for a fast recovery, not to mention the additional effort required by the SRA/DBA/DEV preparing the requisite instructions.

In an ideal world, the perfect implementation of a logical slot failover would have one already in place on the STANDBY, thus guaranteeing data consistency without fear of data loss no matter how fast or slow the actual promotion or altering the subscription’s connectivity parameters may take.

The good news is that there are hacks that can speed up the entire recovery process, although they all have their own quirky limitations. Patroni, for example, implements one such technique by copying the file named state, which is located in the data cluster’s subdirectory pg_repslot, over to the REPLICA’s data cluster in the same location.
# PGDATA on an Ubuntu install /var/lib/postgresql/15/main/pg_replslot/pg3 ??? state
However, the caveat is that it requires a full server restart to the REPLICA, in addition to being promoted, before the logical slot is fully active. And, of course, timing is critical.

So there you have it.
Too bad such a feature doesn’t exist… or does it?

Stay tuned folks; the game is afoot. 
For the curious, I’ve included some references in past blogs, courtesy of Jobin:
- Failover of Logical Replication Slots in PostgreSQL
- How Patroni Addresses the Problem of the Logical Replication Slot Failover in a PostgreSQL Cluster
And here’s another set of references: a valiant attempt to incorporate this feature into PostgreSQL a few years ago.
- PostgreSQL wiki: here
- Part of a mail thread back in 2016: here
- An old blog located on http://webarchive.org from a now-defunct company: here
Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open source community in a single distribution, designed and tested to work together.
Download Percona Distribution for PostgreSQL Today!