Nov
20
2025

Performing Standby Datacentre Promotions of a Patroni Cluster

Performing Standby Datacentre Promotions of a Patroni Cluster Unlike the standard multi-node Postgres replication cluster, when managed by Patroni, all failovers are automatically executed. However, this is not the case when dealing with inter-datacentre failovers when for instance a standby datacentre must take over from a failed primary. The following describes the mechanisms required to perform such a procedure when the case arises. […]

Written by Robert Bernier in: Insight for DBAs,Insight for Developers,postgresql,Robert PP,Zend Developer |

Oct
27
2025

Troubleshooting PostgreSQL Logical Replication, Working with LSNs

PostgreSQL logical replication adoption is becoming more popular as significant advances continue to expand its range of capabilities. While quite a few blogs have described features, there seems to be a lack of simple and straightforward advice on restoring stalled replication. This blog demonstrates an extremely powerful approach to resolving replication problems using the Log […]

Written by Robert Bernier in: Cloud,Insight for DBAs,Insight for Developers,postgresql,Robert PP,Zend Developer |

Dec
09
2024

Performing ELT Operations Using DOMAINs

Performing ELT Operations Using DOMAINs First things first, a couple of definitions to clear up: ETL: Extract, Transform, Load. Transforms data before loading it into the target system extract raw data from various sources use a secondary processing server to transform that data load that data into a target database ELT: Extra, Load, Transform. Transforms data after loading it into […]

Written by Robert Bernier in: Insight for DBAs,postgresql,Robert PP,Zend Developer |

Jun
05
2024

Administering a Patroni Managed PostgreSQL Cluster

Patroni managed PostgreSQL cluster There are quite a number of methods these days for installing Patroni. I want to discuss the next step, which is how to start tuning a running system.The manner of updating a running Postgres server typically involves editing the standard Postgres files: postgresql.conf postgresql.auto.conf pg_hba.conf Working with a Patroni HA cluster, one has the added […]

Written by Robert Bernier in: Insight for DBAs,Insight for Developers,postgresql,Robert PP,Zend Developer |

Dec
26
2023

Grafana Dashboards: A PoC Implementing the PostgreSQL Extension pg_stat_monitor

This PoC demonstrates how to install and configure pg_stat_monitor in order to extract useful and actionable metrics from a PostgreSQL database and display them on a Grafana dashboard.

About the environment

Grafana: version 10.0.0
Grafana database backend: Prometheus version 2.15.2+d
PostgreSQL version 13
pgbench version 13

In order to investigate the potential opportunities for implementing constructive and useful metrics derived from PostgreSQL into Grafana, it is necessary to generate loading using pgbench.

Configuring Grafana

For our purposes, the Grafana datasource used in this PoC is also the Postgres data cluster that is generating the data to be monitored.

Grafana Pgbench

pg_stat_monitor

About

pg_stat_monitor is a Query Performance Monitoring tool for PostgreSQL. It collects various statistics data such as query statistics, query plan, SQL comments, and other performance insights. The collected data is aggregated and presented in a single view.

pg_stat_monitor takes its inspiration from pg_stat_statements. Unlike pg_stat_statements, which aggregates its metrics from the last time it was zeroed, pg_stat_monitor possesses the ability to bucket its output within a set number of aggregated results, thus saving user efforts from doing it themselves.

pg_stat_monitor tracks the following operations:

statements
queries
functions
stored procedures and other non-utility statements

Features

Time Interval Grouping: Instead of supplying one set of ever-increasing counts, pg_stat_monitor computes stats for a configured number of time intervals – time buckets. This allows for much better data accuracy, especially in the case of high-resolution or unreliable networks.
Multi-Dimensional Grouping: While pg_stat_statements groups counters by userid, dbid, queryid, pg_stat_monitor uses a more detailed group for higher precision. This allows a user to drill down into the performance of queries.
Capture Actual Parameters in the Queries: pg_stat_monitor allows you to choose if you want to see queries with placeholders for parameters or actual parameter data. This simplifies debugging and analysis processes by enabling users to execute the same query.
Query Plan: Each SQL is now accompanied by its actual plan that was constructed for its execution. That’s a huge advantage if you want to understand why a particular query is slower than expected.
Tables Access Statistics for a Statement: This allows us to easily identify all queries that accessed a given table. This set is at par with the information provided by the pg_stat_statements.
Histogram: Visual representation is very helpful as it can help identify issues. With the help of the histogram function, one can now view a timing/calling data histogram in response to an SQL query. And yes, it even works in PostgreSQL.

Installation (example: CENTOS8, pg14)

The simplest way to get pg_stat_monitor is to install it via Percona Distribution for PostgreSQL.

The following instructions demonstrate installing Percona Distribution for PostgreSQL and pg_stat_monitor on a CENTOS8 OS Linux distribution:

# Install The Percona Repository
dnf install -y https://repo.percona.com/yum/percona-release-latest.noarch.rpm
percona-release setup ppg14

# Install The postgres Community Repository
dnf install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-8-x86_64/pgdg-redhat-repo-latest.noarch.rpm
dnf update -y
dnf install -y pg_stat_monitor_14

# perform standard initialization and systemd configurations
/usr/pgsql-14/bin/postgresql-14-setup initdb

# configure postgres to use pg_stat_monitor
echo "
shared_preload_libraries=pg_stat_monitor
" >> /var/lib/pgsql/14/data/postgresql.auto.conf

# complete postgres configuration
systemctl enable postgresql-14
systemctl start postgresql-14

Create extension

The pg_stat_monitor extension can be created in any database, but for the purposes of this PoC, it is placed in the database pgbench.

postgres=# create database pgbench;
postgres=# c pgbench
pgbench=# create extension pg_stat_monitor;

pgbench=# d
             List of relations
 Schema |      Name       | Type |  Owner   
--------+-----------------+------+----------
 public | pg_stat_monitor | view | postgres

View "public.pg_stat_monitor"
       Column        |           Type           | Collation | Nullable | Default 
---------------------+--------------------------+-----------+----------+---------
 bucket              | bigint                   |           |          | 
 bucket_start_time   | timestamp with time zone |           |          | 
 userid              | oid                      |           |          | 
 username            | text                     |           |          | 
 dbid                | oid                      |           |          | 
 datname             | text                     |           |          | 
 client_ip           | inet                     |           |          | 
 pgsm_query_id       | bigint                   |           |          | 
 queryid             | bigint                   |           |          | 
 toplevel            | boolean                  |           |          | 
 top_queryid         | bigint                   |           |          | 
 query               | text                     |           |          | 
 comments            | text                     |           |          | 
 planid              | bigint                   |           |          | 
 query_plan          | text                     |           |          | 
 top_query           | text                     |           |          | 
 application_name    | text                     |           |          | 
 relations           | text[]                   |           |          | 
 cmd_type            | integer                  |           |          | 
 cmd_type_text       | text                     |           |          | 
 elevel              | integer                  |           |          | 
 sqlcode             | text                     |           |          | 
 message             | text                     |           |          | 
 calls               | bigint                   |           |          | 
 total_exec_time     | double precision         |           |          | 
 min_exec_time       | double precision         |           |          | 
 max_exec_time       | double precision         |           |          | 
 mean_exec_time      | double precision         |           |          | 
 stddev_exec_time    | double precision         |           |          | 
 rows                | bigint                   |           |          | 
 shared_blks_hit     | bigint                   |           |          | 
 shared_blks_read    | bigint                   |           |          | 
 shared_blks_dirtied | bigint                   |           |          | 
 shared_blks_written | bigint                   |           |          | 
 local_blks_hit      | bigint                   |           |          | 
 local_blks_read     | bigint                   |           |          | 
 local_blks_dirtied  | bigint                   |           |          | 
 local_blks_written  | bigint                   |           |          | 
 temp_blks_read      | bigint                   |           |          | 
 temp_blks_written   | bigint                   |           |          | 
 blk_read_time       | double precision         |           |          | 
 blk_write_time      | double precision         |           |          | 
 resp_calls          | text[]                   |           |          | 
 cpu_user_time       | double precision         |           |          | 
 cpu_sys_time        | double precision         |           |          | 
 wal_records         | bigint                   |           |          | 
 wal_fpi             | bigint                   |           |          | 
 wal_bytes           | numeric                  |           |          | 
 bucket_done         | boolean                  |           |          | 
 plans               | bigint                   |           |          | 
 total_plan_time     | double precision         |           |          | 
 min_plan_time       | double precision         |           |          | 
 max_plan_time       | double precision         |           |          | 
 mean_plan_time      | double precision         |           |          | 
 stddev_plan_time    | double precision         |           |          |

About pgbench

pgbench is a simple program executing benchmark tests on PostgreSQL by running the same sequence of SQL commands over and over. pgbench is capable of executing multiple concurrent database sessions and can calculate the average transaction rate (TPS) at the end of a run. Although the default configuration simulates loading based loosely upon TPC-B, it is nevertheless easy to test other use cases by writing one’s own transaction script files.

Querying the data

While it is reasonable to create panels showing real-time load in order to explore better the types of queries that can be run against pg_stat_monitor, it is more practical to copy and query the data into tables after the benchmarking has completed its run.

Table: pg_stat_monitor_archive

Save the data generated from a recently completed benchmark run into an archive table:

select * into pg_stat_monitor_archive 
from pg_stat_monitor 
order by bucket_start_time asc

Table "public.pg_stat_monitor_archive"
      Column        |           Type           | Collation | Nullable | Default  
--------------------+--------------------------+-----------+----------+---------
bucket              | bigint                   |           |          |  
bucket_start_time   | timestamp with time zone |           |          |  
userid              | oid                      |           |          |  
username            | text                     |           |          |  
dbid                | oid                      |           |          |  
datname             | text                     |           |          |  
client_ip           | inet                     |           |          |  
pgsm_query_id       | bigint                   |           |          |  
queryid             | bigint                   |           |          |  
toplevel            | boolean                  |           |          |  
top_queryid         | bigint                   |           |          |  
query               | text                     |           |          |  
comments            | text                     |           |          |  
planid              | bigint                   |           |          |  
query_plan          | text                     |           |          |  
top_query           | text                     |           |          |  
application_name    | text                     |           |          |  
relations           | text[]                   |           |          |  
cmd_type            | integer                  |           |          |  
cmd_type_text       | text                     |           |          |  
elevel              | integer                  |           |          |  
sqlcode             | text                     |           |          |  
message             | text                     |           |          |  
calls               | bigint                   |           |          |  
total_exec_time     | double precision         |           |          |  
min_exec_time       | double precision         |           |          |  
max_exec_time       | double precision         |           |          |  
mean_exec_time      | double precision         |           |          |  
stddev_exec_time    | double precision         |           |          |  
rows                | bigint                   |           |          |  
shared_blks_hit     | bigint                   |           |          |  
shared_blks_read    | bigint                   |           |          |  
shared_blks_dirtied | bigint                   |           |          |  
shared_blks_written | bigint                   |           |          |  
local_blks_hit      | bigint                   |           |          |  
local_blks_read     | bigint                   |           |          |  
local_blks_dirtied  | bigint                   |           |          |  
local_blks_written  | bigint                   |           |          |  
temp_blks_read      | bigint                   |           |          |  
temp_blks_written   | bigint                   |           |          |  
blk_read_time       | double precision         |           |          |  
blk_write_time      | double precision         |           |          |  
resp_calls          | text[]                   |           |          |  
cpu_user_time       | double precision         |           |          |  
cpu_sys_time        | double precision         |           |          |  
wal_records         | bigint                   |           |          |  
wal_fpi             | bigint                   |           |          |  
wal_bytes           | numeric                  |           |          |  
bucket_done         | boolean                  |           |          |  
plans               | bigint                   |           |          |  
total_plan_time     | double precision         |           |          |  
min_plan_time       | double precision         |           |          |  
max_plan_time       | double precision         |           |          |  
mean_plan_time      | double precision         |           |          |  
stddev_plan_time    | double precision         |           |          |

Table: pg_stat_monitor_qry

Extract this metric of interest, i.e., time vs total execution time:

select bucket_start_time, pgsm_query_id, queryid, total_exec_time
into pg_stat_monitor_qry
from pg_stat_monitor 
order by bucket_start_time asc

pgbench=# d pg_stat_monitor_qry
                      Table "public.pg_stat_monitor_qry"
      Column       |           Type           | Collation | Nullable | Default 
-------------------+--------------------------+-----------+----------+---------
 bucket_start_time | timestamp with time zone |           |          | 
 pgsm_query_id     | bigint                   |           |          | 
 queryid           | bigint                   |           |          | 
 total_exec_time   | double precision         |           |          |

Table: pg_stat_monitor_shared_blk_io

Extract this metric of interest, i.e., time vs shared_blk io:

select bucket_start_time, pgsm_query_id, queryid,
       shared_blks_hit, shared_blks_read,
       shared_blks_dirtied, shared_blks_written
    into pg_stat_monitor_shared_blk_io
    from pg_stat_monitor_archive
    order by bucket_start_time asc;

pgbench=# d pg_stat_monitor_shared_blk_io
                 Table "public.pg_stat_monitor_shared_blk_io"
      Column        |           Type           | Collation | Nullable | Default  
---------------------+--------------------------+-----------+----------+---------
bucket_start_time   | timestamp with time zone |           |          |  
pgsm_query_id       | bigint                   |           |          |  
queryid             | bigint                   |           |          |  
shared_blks_hit     | bigint                   |           |          |  
shared_blks_read    | bigint                   |           |          |  
shared_blks_dirtied | bigint                   |           |          |  
shared_blks_written | bigint                   |           |          |

Table: pg_stat_monitor_blk_io

Note: this metric requires runtime parameter track_io_timing to be set on.

Extract this metric of interest, i.e., time vs. blk io:

select bucket_start_time, pgsm_query_id, queryid, blk_read_time, blk_write_time
    into pg_stat_monitor_blk_io
    from pg_stat_monitor_archive
    order by bucket_start_time asc;

Table: pg_stat_monitor_uniq_id

Save a copy of all unique query IDs in order to parse out future queries from the view.

Column pgsm_query_id identifies the query in such a manner that one can still identify the same query even when generated on other platforms under different loading conditions with different data:

with a as (select distinct on (pgsm_query_id) *
            from pg_stat_monitor_archive
            where application_name='pgbench')
select cmd_type, cmd_type_text,pgsm_query_id, queryid,query as example_query
    into pg_stat_monitor_uniq_id
    from a
    order by cmd_type;

pgbench=# d pg_stat_monitor_uniq_id
          Table "public.pg_stat_monitor_uniq_id"
    Column     |  Type   | Collation | Nullable | Default
---------------+---------+-----------+----------+---------
 cmd_type      | integer |           |          |
 cmd_type_text | text    |           |          |
 pgsm_query_id | bigint  |           |          |
 queryid       | bigint  |           |          |
 example_query | text    |           |          |

This is an example set of queries generated by pgbench. Note the numbers in column pgsm_query_id are always the same values irrespective of hosts or environments:

select cmd_type_text, pgsm_query_id, example_query 
from pg_stat_monitor_uniq_id where cmd_type > 0;

cmd_type_text |    pgsm_query_id     |           example_query
---------------+----------------------+-----------------------------------------------------------------
 SELECT        | -7455620703706695456 | SELECT abalance FROM pgbench_accounts WHERE aid = 16416498
 UPDATE        |  -510321339504955469 | UPDATE pgbench_accounts SET abalance = abalance + 2063 
                                        WHERE aid = 1482568
 UPDATE        |  5276535447716615446 | UPDATE pgbench_branches SET bbalance = bbalance + 1384 
                                        WHERE bid = 7
 UPDATE        |  3629195281782908951 | UPDATE pgbench_tellers SET tbalance = tbalance + -2966 
                                        WHERE tid = 330
 INSERT        | -8751124061964589929 | INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) 
                                        VALUES (321, 56, 21104880, 4211, CURRENT_TIMESTAMP)

Benchmarking

Two types of performance monitoring are profiled:

Real-time loading performance
Aggregate performance over a specific time period, i.e., a snapshot.

Although the results of the benchmarking can be viewed by querying the view pg_stat_monitor you will note, as demonstrated by the bash script and SQL statements below, that the contents of the view is immediately copied and saved into a collection of tables. This is because the data will disappear over time as pg_stat_monitor cycles through its allotted number of buckets.

A script executing a benchmarking run:

#!/bin/bash

# REFERENCE
#   https://docs.percona.com/pg-stat-monitor/reference.html
#
set -e

export PGPASSWORD="MYPASSWORD" 
       PGHOST='MYHOST’ 
       PGPORT=5434 
       PGUSER=postgres
#
# initialize benchmarking database
#
dropdb --if-exists pgbench
createdb pgbench
/usr/pgsql-12/bin/pgbench -i --foreign-keys -s 300 pgbench
psql pgbench -c 'create extension pg_stat_monitor'

#
# configure pg_stat_monitor, requires system restart
#
psql postgres <<_eof_
-- set bucket time range,  default is normally 60 seconds
   alter system set pg_stat_monitor.pgsm_bucket_time = '1min';
-- set number of buckets, default is normally 10
   alter system set pg_stat_monitor.pgsm_max_buckets = 75;
_eof_

systemctl restart postgresql@13-main

psql postgres <<_eof_
-- zero pg_stat_monitor stats
    select * from pg_stat_monitor_reset();
_eof_

#
# begin benchmarking run
#
#   4500 seconds (75 minutes)
/usr/pgsql-12/bin/pgbench -U postgres -c 4 -j 2 -T 4500 -P 5 -b tpcb-like pgbench

#
# copy and save the benchmarking run into tables
#
psql postgres <<_eof_
    drop table if exists pg_stat_monitor_archive, 
                           pg_stat_monitor_qry, 
                           pg_stat_monitor_uniq_id;

    select * into pg_stat_monitor_archive from pg_stat_monitor order by bucket_start_time;

    select bucket_start_time, pgsm_query_id, queryid, total_exec_time
    into pg_stat_monitor_qry
    from pg_stat_monitor_archive
    where application_name='pgbench';

    with a as (select distinct on (pgsm_query_id) * 
               from pg_stat_monitor_archive 
               where application_name='pgbench')    
    select cmd_type, cmd_type_text,pgsm_query_id, queryid,query as example_query
        into pg_stat_monitor_uniq_id
        from a
        order by cmd_type;;
_eof_


echo "DONE"

progress: 4435.0 s, 341.2 tps, lat 11.718 ms stddev 3.951
progress: 4440.0 s, 361.2 tps, lat 11.075 ms stddev 3.519
progress: 4445.0 s, 348.0 tps, lat 11.483 ms stddev 5.246
progress: 4450.0 s, 383.8 tps, lat 10.418 ms stddev 4.514
progress: 4455.0 s, 363.6 tps, lat 10.988 ms stddev 4.326
progress: 4460.0 s, 344.0 tps, lat 11.621 ms stddev 3.981
progress: 4465.0 s, 360.4 tps, lat 11.093 ms stddev 4.457
progress: 4470.0 s, 383.8 tps, lat 10.423 ms stddev 5.615
progress: 4475.0 s, 369.6 tps, lat 10.811 ms stddev 3.784
progress: 4480.0 s, 355.6 tps, lat 11.227 ms stddev 3.954
progress: 4485.0 s, 378.8 tps, lat 10.580 ms stddev 2.890
progress: 4490.0 s, 370.8 tps, lat 10.770 ms stddev 2.879
progress: 4495.0 s, 365.2 tps, lat 10.947 ms stddev 4.997
progress: 4500.0 s, 379.2 tps, lat 10.549 ms stddev 2.832

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 300
query mode: simple
number of clients: 4
number of threads: 2
duration: 4500 s
number of transactions actually processed: 1564704
latency average = 11.497 ms
latency stddev = 4.800 ms
tps = 347.711175 (including connections establishing)
tps = 347.711731 (excluding connections establishing)

Dashboard example 1: Querying saved data

Top panel (Query execution time vs. DML)

Five (5) SQL statements are used to create this panel:

-- SELECT --
select  bucket_start_time,total_exec_time as "SELECT"
from pg_stat_monitor_qry
join pg_stat_monitor_uniq_id using (pgsm_query_id)
where a.cmd_type_text='SELECT'
order by 1 asc;

-- INSERT --
select  bucket_start_time,total_exec_time as "INSERT"
from pg_stat_monitor_qry
join pg_stat_monitor_uniq_id using (pgsm_query_id)
where a.cmd_type_text='INSERT'
order by 1 asc;

-- UPDATE 1 --
select  bucket_start_time,total_exec_time as "UPDATE 1"
from pg_stat_monitor_qry
join pg_stat_monitor_uniq_id using (pgsm_query_id)
where cmd_type_text='UPDATE'
and pgsm_query_id = -510321339504955469
order by 1 asc;

-- UPDATE 2 --
select  bucket_start_time,total_exec_time as "UPDATE 2"
from pg_stat_monitor_qry
join pg_stat_monitor_uniq_id using (pgsm_query_id)
where cmd_type_text='UPDATE'
and pgsm_query_id = 5276535447716615446
order by 1 asc;

-- UPDATE 3 --
select  bucket_start_time,total_exec_time  as "UPDATE 3"
from pg_stat_monitor_qry
join pg_stat_monitor_uniq_id using (pgsm_query_id)
where cmd_type_text='UPDATE'
and pgsm_query_id = 3629195281782908951
order by 1 asc;

Bottom panel (Query execution time vs. shared blocks)

-- INSERT (ins_[hit|read|dirty|write]) --
select bucket_start_time,
       shared_blks_hit as ins_hit, shared_blks_read as ins_read,
       shared_blks_dirtied as ins_dirt, shared_blks_written as ins_writ
from pg_stat_monitor_shared_blk_io
join pg_stat_monitor_uniq_id using (pgsm_query_id)
where cmd_type_text='INSERT'
order by 1 asc;

-- UPDATE 1 (update1_[hit|read|dirty|write]) --
select bucket_start_time,
       shared_blks_hit as update1_hit, shared_blks_read as update1_read,
       shared_blks_dirtied as update1_dirt, shared_blks_written as update1_writ
from pg_stat_monitor_shared_blk_io
join pg_stat_monitor_uniq_id using (pgsm_query_id)
where cmd_type_text='UPDATE'
and pgsm_query_id = -510321339504955469
order by 1 asc;

-- UPDATE 2 (update2_[hit|read|dirty|write]) --
select bucket_start_time,
       shared_blks_hit as update2_hit, shared_blks_read as update2_read,
       shared_blks_dirtied as update2_dirt, shared_blks_written as update2_writ
from pg_stat_monitor_shared_blk_io
join pg_stat_monitor_uniq_id using (pgsm_query_id)
where cmd_type_text='UPDATE'
and pgsm_query_id = 5276535447716615446
order by 1 asc;

-- UPDATE 3 (update3_[hit|read|dirty|write]) --
select bucket_start_time,
       shared_blks_hit as update3_hit, shared_blks_read as update3_read,
       shared_blks_dirtied as update3_dirt, shared_blks_written as update3_writ
from pg_stat_monitor_shared_blk_io
join pg_stat_monitor_uniq_id using (pgsm_query_id)
where cmd_type_text='UPDATE'
and pgsm_query_id = 3629195281782908951
order by 1 asc;

Analysis

Here are some example patterns that can be discerned:

The SELECT statements are the fastest DML operations (top panel).
Although SQL statement UPDATE 1 (top panel) takes up the most time, its contents do not have much presence in the shared buffer relative to the other update statements.
Inserts (top) are the 2nd slowest set of statements, yet they have very little execution time performing inserts compared to the UPDATES in the shared buffer (bottom).

Dashboard example 2: Monitoring in real time

These two panels show read/write IO performance to the persistent storage while benchmarking a live run.

Top panel (Execution time vs. DML)

-- SELECT --
select  bucket_start_time,total_exec_time as "SELECT"
from pg_stat_monitor
join pg_stat_monitor_uniq_id a using (pgsm_query_id)
where a.cmd_type_text='SELECT'
order by 1 asc;

-- INSERT --
select  bucket_start_time,total_exec_time as "INSERT"
from pg_stat_monitor
join pg_stat_monitor_uniq_id a using (pgsm_query_id)
where a.cmd_type_text='INSERT'
order by 1 asc;

-- UPDATE 1 --
select  bucket_start_time,total_exec_time as "UPDATE 1"
from pg_stat_monitor
join pg_stat_monitor_uniq_id a using (pgsm_query_id)
where a.cmd_type_text='UPDATE'
and pgsm_query_id = -510321339504955469
order by 1 asc;

-- UPDATE 2 --
select  bucket_start_time,total_exec_time as "UPDATE 2"
from pg_stat_monitor
join pg_stat_monitor_uniq_id a using (pgsm_query_id)
where a.cmd_type_text='UPDATE'
and pgsm_query_id = 5276535447716615446
order by 1 asc;

-- UPDATE 3 --
select  bucket_start_time,total_exec_time  as "UPDATE 3"
from pg_stat_monitor
join pg_stat_monitor_uniq_id a using (pgsm_query_id)
where a.cmd_type_text='UPDATE'
and pgsm_query_id = 3629195281782908951
order by 1 asc;

Bottom panel (Time vs. IO)

-- time vs read/write blocks (blk_read_time, blk_write_time
--    track_io_timing is on
select bucket_start_time, blk_read_time, blk_write_time from public.pg_stat_monitor;

Analysis

It’s quite easy to observe that SQL statement UPDATE 1 represents the bulk of the read operations.

-- example SQL statement
UPDATE pgbench_accounts SET abalance = abalance + 2063  WHERE aid = 1482568

Interestingly, writes are not as significant as reads.

Conclusion

I’m excited about pg_stat_monitor. Not only can it be used in Grafana, but it’s easily implemented in any monitoring solution, including our own Percona Monitoring and Management. It’s also incorporated in our latest version of Percona Operator for PostgreSQL.

Pg_stat_monitor is an obvious, common sense improvement over pg_stat_statement’s greatest limitation i.e., its inability to bucket metrics over time intervals. And to be frankly honest, I can see the pg_stat_monitor extension eventually replacing pg_stat_statement as the defacto extension monitoring Postgres when it comes to real-time analysis.

Happy monitoring!

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.

Download Percona Distribution for PostgreSQL Today!

Written by Robert Bernier in: Insight for DBAs,Percona Software,postgresql,Robert PP,Zend Developer |

Nov
03
2023

How To Scale a Single-Host PostgreSQL Database With Citus

Single-Host PostgreSQL Database With Citus

When it comes to Citus, successfully building out and scaling a PostgreSQL cluster across multiple nodes and even across data centers can feel, at times, to be an art form because there are so many ways of building it out.

There’s an axiom that I think aptly applies to this situation describing the differences between science and art:

– Science: 1 problem -> 1 solution
– Art: 1 problem -> 1,000 solutions

While the two previous Citus blogs that I wrote covered the fairly straightforward concepts of creating columnar tables and leveraging data redundancy, this blog explores the architectural design considerations using this freely downloadable and fully featured PostgreSQL extension.

Rather than listing the concepts, function calls, etc, available in Citus, which frankly is a bit boring, I’m going to explore scaling out a database system starting with a single host. I won’t cover all the features but show just enough that you’ll want to see more of what you can learn to accomplish for yourself.

About the cluster

Citus PostgreSQL

Following a step-by-step process, the objective is to create a four-node cluster consisting of:

PostgreSQL version 15
Citus extension (I’ll be using version 11, but there are newer ones available.)
PostgreSQL Cluster
- One coordinator node
  - citus-coord-01
- Three worker nodes
  - citus1
  - citus2
  - citus3
Hardware
AWS Instance
Ubuntu Server 20.04, SSD volume type
64-bit (x86)
c5.xlarge 4vCPU 8GB-RAM
Storage: EBS volume (root)
80GB
gp2 (IOPS 240/3000)

As well, high availability will be integrated, guaranteeing cluster viability in the case that one worker node goes down.

Leveraging pgbench, which is a benchmarking utility that comes bundled with PostgreSQL, I will put the cluster through its paces by executing a series of DML operations. Using its default tpc-b benchmark, one can stress test a database of any size ranging from a few clients to simulating thousands interacting with a system sized into the Terabytes if needs be.

Steps

Provisioning

The first step is to provision the four nodes with both PostgreSQL and Citus. Refer here or here for getting and installing the extension.

-- update each postgres node and enabling the extension 
-- with a server restart
alter system set shared_preload_libraries='citus';

On each of the four nodes, create the database with the Citus extension:

-- using psql
create database pgbench;
c pgbench
create extension citus;

One worker node

Beginning with a single worker node, log onto citus-coord-01, declaring the coordinator host, and starting with the single node, citus1:

-- execute on citus-coord-01
    psql pgbench <<_eof1_
        select citus_set_coordinator_host('citus-coord-01', 5432);
        select citus_add_node('citus1', 5432);
_eof1_

While still logged in database pgbench, get a list of the worker nodes, which at this time is just the one:

select * from master_get_active_worker_nodes();

node_name | node_port
-----------+-----------
 citus1    |      5432

Now, things get a little interesting; the normal initialization of pgbench is to create and populate the tables with the appropriate constraints. In this case, however, only the tables, without data or constraints, are initialized:

-- execute the following command on citus-coord-01
pgbench -iI t pgbench

pgbench=# d+
                                        List of relations
Schema |       Name       | Type  |  Owner   | Persistence | Access method |  Size   |
-------+------------------+-------+----------+-------------+---------------+---------+-----
public | citus_tables     | view  | postgres | permanent   |               | 0 bytes |
public | pgbench_accounts | table | postgres | permanent   | heap          | 0 bytes |
public | pgbench_branches | table | postgres | permanent   | heap          | 0 bytes |
public | pgbench_history  | table | postgres | permanent   | heap          | 0 bytes |
public | pgbench_tellers  | table | postgres | permanent   | heap          | 0 bytes |

This next set of commands, executed on the coordinator node, distributes the pgbench tables across node citus1:

psql pgbench <<_eof1_
    BEGIN;
        select create_distributed_table('pgbench_history', 'aid');
        select create_distributed_table('pgbench_accounts', 'aid');
        select create_distributed_table('pgbench_branches', 'bid');
        select create_distributed_table('pgbench_tellers', 'tid');
    COMMIT;
_eof1_

And now, we can begin populating the tables. The following invocation generates almost 4GB of data. Notice I still haven’t added any constraints:

pgbench -iI g -s 300 pgbench

This command shows the total number of shards on the node. The default number of shards generated for a given table is 32.

select * from citus_tables;

table_name    | citus_table_type | distribution_column | colocation_id | table_size | shard_count 
------------------+------------------+---------------------+---------------+------------+-------------
 pgbench_accounts | distributed      | aid                 |             1 | 3844 MB    |      32 
 pgbench_branches | distributed      | bid                 |             1 | 256 kB     |      32 
 pgbench_history  | distributed      | aid                 |             1 | 0 bytes    |      32 
 pgbench_tellers  | distributed      | tid                 |             1 | 256 kB     |      32

You’ll also notice that only some of the shards have been populated with data, which is normal when initializing pgbench with data:

select * from citus_shards;

table_name    | shardid |       shard_name        | citus_table_type | colocation_id | nodename | nodeport | shard_size
------------------+---------+-------------------------+------------------+---------------+----------+----------+------------
 pgbench_accounts |  102040 | pgbench_accounts_102040 | distributed      |             1 | citus1   |     5432 |  125853696
 pgbench_accounts |  102041 | pgbench_accounts_102041 | distributed      |             1 | citus1   |     5432 |  126173184
 pgbench_accounts |  102042 | pgbench_accounts_102042 | distributed      |             1 | citus1   |     5432 |  125739008
 pgbench_accounts |  102043 | pgbench_accounts_102043 | distributed      |             1 | citus1   |     5432 |  125968384
.
.
 pgbench_branches |  102072 | pgbench_branches_102072 | distributed      |             1 | citus1   |     5432 |       8192
 pgbench_branches |  102073 | pgbench_branches_102073 | distributed      |             1 | citus1   |     5432 |       8192
 pgbench_branches |  102074 | pgbench_branches_102074 | distributed      |             1 | citus1   |     5432 |       8192
 pgbench_branches |  102075 | pgbench_branches_102075 | distributed      |             1 | citus1   |     5432 |       8192
.
.
 pgbench_history  |  102008 | pgbench_history_102008  | distributed      |             1 | citus1   |     5432 |          0
 pgbench_history  |  102009 | pgbench_history_102009  | distributed      |             1 | citus1   |     5432 |          0
 pgbench_history  |  102010 | pgbench_history_102010  | distributed      |             1 | citus1   |     5432 |          0
 pgbench_history  |  102011 | pgbench_history_102011  | distributed      |             1 | citus1   |     5432 |          0
.
.
 pgbench_tellers  |  102104 | pgbench_tellers_102104  | distributed      |             1 | citus1   |     5432 |       8192
 pgbench_tellers  |  102105 | pgbench_tellers_102105  | distributed      |             1 | citus1   |     5432 |       8192
 pgbench_tellers  |  102106 | pgbench_tellers_102106  | distributed      |             1 | citus1   |     5432 |       8192
 pgbench_tellers  |  102107 | pgbench_tellers_102107  | distributed      |             1 | citus1   |     5432 |       8192
 pgbench_tellers  |  102108 | pgbench_tellers_102108  | distributed      |             1 | citus1   |     5432 |       8192
.
.

And now, execute the benchmark:

-- execute the following on the coordinator node
pgbench -c 20 -j 3 -T 60 -P 3 pgbench

The results are not pretty.

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 300
query mode: simple
number of clients: 20
number of threads: 3
maximum number of tries: 1
duration: 60 s
number of transactions actually processed: 1524
number of failed transactions: 0 (0.000%)
latency average = 791.374 ms
latency stddev = 157.710 ms
initial connection time = 15.704 ms
tps = 25.197427 (without initial connection time)

ATTENTION: Refer to the bottom of this blog, which summarizes a tabulation of the various benchmarks as this cluster evolves.

In order to speed up the benchmark indexes must be added. But, this standard pgbench method of adding indexes fails:

postgres@citus-coord-01:~$ pgbench -iI p pgbench
creating primary keys...
pgbench: error: query failed: ERROR:  cannot create constraint without a name on a distributed table
pgbench: detail: Query was: alter table pgbench_branches add primary key (bid)

Successfully adding indexes and constraints requires them to be explicitly named and created in the following fashion. Note that table pgbench_history requires a REPLICA identity because it doesn’t have a primary key:

# THIS WORKS!
psql pgbench <<_eof1_
--
-- indexes and constraints must be explicitly named
--
BEGIN;
create unique index pgbench_accounts_pk on pgbench_accounts(aid);
create unique index pgbench_branches_pk on pgbench_branches(bid);
create unique index pgbench_tellers_pk on pgbench_tellers(tid);

alter table pgbench_accounts
   add constraint pk_accounts primary key using index pgbench_accounts_pk;

alter table pgbench_branches
   add constraint pk_branches primary key using index pgbench_branches_pk;

alter table pgbench_tellers
   add constraint pk_tellers primary key using index pgbench_tellers_pk;

qecho adding REPLICA IDENTITY (no PK present)
alter table pgbench_history replica identity full;
COMMIT;
_eof1_

Repeating the benchmark run yields a much better result:

-- execute the following on the coordinator node
pgbench -c 20 -j 3 -T 60 -P 3 pgbench

scaling factor: 300
query mode: simple
number of clients: 20
number of threads: 3
maximum number of tries: 1
duration: 60 s
number of transactions actually processed: 135273
number of failed transactions: 0 (0.000%)
latency average = 8.865 ms
latency stddev = 9.452 ms
initial connection time = 15.556 ms
tps = 2254.544852 (without initial connection time)

Two worker nodes

Adding a second worker node is straightforward requiring only two steps:

Adding the node.
Rebalancing the shards across the two nodes.

Execute this on the coordinator node, citus-coord-01:

psql pgbench <<_eof1_
        --
        -- it's understood that database pgbench already exists on citus2
        --  and extension citus has been created
        --
        qecho adding node citus2 ...
        select citus_add_node('citus2', 5432);
        qecho rebalancing shards across TWO nodes ...
        select * from rebalance_table_shards();
        ANALYZE;
_eof1_

Here’s a partial output as the tables are rebalanced across the two nodes:

adding node citus2 ...
citus_add_node
----------------
             3

rebalancing shards across TWO nodes ...
NOTICE:  Moving shard 102008 from citus1:5432 to citus2:5432 ...
NOTICE:  Moving shard 102009 from citus1:5432 to citus2:5432 ...
NOTICE:  Moving shard 102010 from citus1:5432 to citus2:5432 ...
NOTICE:  Moving shard 102011 from citus1:5432 to citus2:5432 ...
.
.
NOTICE:  Moving shard 102020 from citus1:5432 to citus2:5432 ...
NOTICE:  Moving shard 102021 from citus1:5432 to citus2:5432 ...
NOTICE:  Moving shard 102022 from citus1:5432 to citus2:5432 ...
NOTICE:  Moving shard 102023 from citus1:5432 to citus2:5432 ...

Repeating the benchmark once again demonstrates that performance has degraded somewhat. This is to be expected, considering my POC consists of containers on a single machine, all sharing the same system resources:

-- execute the following on the coordinator node
pgbench -c 20 -j 3 -T 60 -P 3 pgbench

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 300
query mode: simple
number of clients: 20
number of threads: 3
maximum number of tries: 1
duration: 60 s
number of transactions actually processed: 130190
number of failed transactions: 0 (0.000%)
latency average = 9.214 ms
latency stddev = 19.159 ms
initial connection time = 15.776 ms
tps = 2169.559969 (without initial connection time)

Three worker nodes

As per the two-worker node configuration, adding another node is straightforward.

psql pgbench <<_eof1_
       qecho adding node citus3 ...
        select citus_add_node('citus3', 5432);
        qecho rebalancing shards across TWO nodes ...
         select * from rebalance_table_shards();
_eof1_

Repeating the benchmark gives us the updated performance metrics.

-- execute the following on the coordinator node
pgbench -c 20 -j 3 -T 60 -P 3 pgbench

scaling factor: 300
query mode: simple
number of clients: 20
number of threads: 3
maximum number of tries: 1
duration: 60 s
number of transactions actually processed: 142630
number of failed transactions: 0 (0.000%)
latency average = 8.410 ms
latency stddev = 5.542 ms
initial connection time = 16.407 ms
tps = 2377.227920 (without initial connection time)

Three worker nodes, with redundancy/replication

This next step is interesting because it adds redundancy by increasing the replication factor from one to two such that two copies of each data shard of each table are not only duplicated but located on two distinct nodes.

ATTENTION: A newly updated replication factor affects tables only after the fact. Pre Existing tables are not affected by the new replication factor:

psql pgbench <<_eof1_
        qecho increasing replication factor to TWO
        alter system set citus.shard_replication_factor=2;
        select pg_reload_conf();
_eof1_

TIP: In order to replicate tables, set the replication factor before creating them.

In order to replicate the existing tables across the nodes, they are first undistributed by centralizing the tables onto the coordinator node. Redistributing them across the three node clusters automatically adds replication.

This centralizes the tables onto the coordinator node:

psql pgbench <<_eof1_
    BEGIN;
        select undistribute_table('pgbench_history');
        select undistribute_table('pgbench_accounts');
        select undistribute_table('pgbench_branches');
        select undistribute_table('pgbench_tellers');
    COMMIT;
_eof1_

The tables are redistributed and rebalanced across the three-node cluster with the correct redundancy of two:

psql pgbench <<_eof1_
BEGIN;
  select create_distributed_table('pgbench_history', 'aid');
  select create_distributed_table('pgbench_accounts', 'aid');
  select create_distributed_table('pgbench_branches', 'bid');
  select create_distributed_table('pgbench_tellers', 'tid');
COMMIT;

-- remove legacy tables from the coordinator node
BEGIN;
   select truncate_local_data_after_distributing_table($$public.pgbench_history$$);
   select truncate_local_data_after_distributing_table($$public.pgbench_accounts$$);
   select truncate_local_data_after_distributing_table($$public.pgbench_branches$$);
   select truncate_local_data_after_distributing_table($$public.pgbench_tellers$$);
COMMIT;
_eof1_

Repeating the benchmark gives us the updated performance metrics.

-- execute the following on the coordinator node
pgbench -c 20 -j 3 -T 60 -P 3 pgbench

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 300
query mode: simple
number of clients: 20
number of threads: 3
maximum number of tries: 1
duration: 60 s
number of transactions actually processed: 68834
number of failed transactions: 0 (0.000%)
latency average = 17.430 ms
latency stddev = 12.818 ms
initial connection time = 16.321 ms
tps = 1147.093369 (without initial connection time)

Benchmark results

Here’s a summary of the various benchmarking runs, giving you an idea of relative performance.

RUN	NUM NODES	WITH IDX	Replication Factor	TPS
1	1	no	1	25
2	1	yes	1	2255
3	2	yes	1	2170
4	3	yes	1	2377
5	3	yes	2	1147

Caveat

Although not mentioned in the blog, there are a few things you’ll need to keep in mind as you work with CitusDB:

The password for the account used to propagate the operations across the cluster is input in the home account’s .pgpass file on all nodes.
The “wal_level” is set at logical.
Redundancy can potentially decrease overall performance.

Remember to keep in mind that the performance metrics are dependent not only upon the cluster’s hardware and system resources but as to the level of tuning, which, in this case, has not been addressed.

A future blog will continue my exploration into Citus by scaling out pgbench into other architectures.

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.

Download Percona Distribution for PostgreSQL Today!

Written by Robert Bernier in: Citus,Insight for DBAs,Insight for Developers,postgresql,Robert PP,Zend Developer |

Aug
29
2023

Data Redundancy With the PostgreSQL Citus Extension

Data Redundancy With the PostgreSQL Citus

Over the years, I’ve had the opportunity to architect all sorts of configurations using Postgres as a backend. I’ve always found it very cool and satisfying to implement sophisticated business rules, often in more ways than one has fingers and toes. So, it’s not an understatement when I say that Citus is one of the more interesting technologies that I’ve come across when scaling PostgreSQL.

Citus is an extension that horizontally scales PostgreSQL across multiple machines by sharding and creating redundant copies of tables. Its query engine not only parallelizes incoming SQL queries for time series but also has the ability to create column-wise tables making it ideal for OLAP analysis duties.

Among its many capabilities, a Citus cluster can:

Create distributed tables that are sharded across a cluster of PostgreSQL nodes to combine their CPU, memory, storage, and I/O capacity.
Reference tables can be replicated to all nodes for joins and foreign keys from distributed tables and for maximum read performance.
The distributed query engine can route and parallelize SELECT, DML, and other operations on distributed tables across the cluster.
Columnar storage of tables can compress data, speeding up scans and supporting fast projections, both on regular and distributed tables.

Data redundancy, a database version of a RAID

Pondering the case of high availability and redundancy, one replicates data by creating a replica via streaming replication.

Now let’s stretch our imagination and consider a second method of high availability, ala Citus.

The best way to describe the Citus way of doing things is to reflect how data is managed by a disk RAID array. Depending on the configuration, one can tune a hardware RAID for either performance or redundancy. The same can be said for Citus data sharding.

Here is an example of a table named replica2x, which has 2X redundancy across a cluster of four (4) nodes. The colors indicate duplicated shards of the table. For example, if node citus1 goes offline, the sharded table it holds still has copies on nodes citus2 and citus4. Likewise, it can be said that if node citus2 goes offline, the same data is still available on nodes 1, 3, and 4.

Citus example

About this POC

I’ll be upfront: I love working with Linux Containers, LXD. Much of what I will show you makes heavy use of them. You won’t need LXD to replicate this POC, of course, but I can’t say enough how flexible and versatile such an environment can be when prototyping a cluster of Postgres nodes, let alone an entire multi-data center infrastructure on a single workstation.

There are two parts to this POC;

Part A: Setup
Part B: Redundancy demonstration

Part A: POC setup

Step one: Getting and installing Citus

Referring to this earlier blog, you’ll see how to get and install Citus into your environment.

Step two: Creating the Citus nodes

The Citus cluster consists of a five (5) node cluster:

citus-coord-01: coordinator
citus1: worker
citus2: worker
citus3: worker
citus4: worker

By using LXD, I created a single templated container with Citus on Ubuntu 20.04, where the various nodes were copied from this template.

for u in citus-coord-01 citus1 citus2 citus3 citus4
do
    echo "==== $u ===="
    lxc rm --force $u 2>/dev/null
    lxc cp template-ubuntu-2004-citusdb $u
    lxc start $u
done

And here’s the resultant cluster:

lxc ls -c ns4 citus

+----------------+---------+----------------------+
|      NAME      |  STATE  |         IPV4         |
+----------------+---------+----------------------+
| citus1         | RUNNING | 10.231.38.140 (eth0) |
+----------------+---------+----------------------+
| citus2         | RUNNING | 10.231.38.151 (eth0) |
+----------------+---------+----------------------+
| citus3         | RUNNING | 10.231.38.171 (eth0) |
+----------------+---------+----------------------+
| citus4         | RUNNING | 10.231.38.204 (eth0) |
+----------------+---------+----------------------+
| citus-coord-01 | RUNNING | 10.231.38.34 (eth0)  |
+----------------+---------+----------------------+

It’s understood that on each of the five nodes:

Database db01 has already been created.
The postgresql.conf configuration file has been appropriately edited for remote access.
The .pgpass file has been configured to supply the superuser password for all nodes.
Extension citus has been created in database db01.

Step three: Check packages

Inspecting the nodes confirms Citus has been correctly installed:

for u in citus1 citus2 citus3 citus4 citus-coord-01
do

echo "==== NODE: $u ===="

lxc exec $u -- su - postgres -c 'psql db01'<<_eof_
select extname, extversion from pg_extension;
_eof_

done | less -S

==== NODE: citus1 ====
    extname     | extversion 
----------------+------------
 plpgsql        | 1.0
 citus_columnar | 11.1-1
 citus          | 11.1-1

==== NODE: citus2 ====
    extname     | extversion 
----------------+------------
 plpgsql        | 1.0
 citus_columnar | 11.1-1
 citus          | 11.1-1

==== NODE: citus3 ====
    extname     | extversion 
----------------+------------
 plpgsql        | 1.0
 citus_columnar | 11.1-1
 citus          | 11.1-1

==== NODE: citus4 ====
    extname     | extversion 
----------------+------------
 plpgsql        | 1.0
 citus_columnar | 11.1-1
 citus          | 11.1-1

==== NODE: citus-coord-01 ====
    extname     | extversion 
----------------+------------
 plpgsql        | 1.0
 citus_columnar | 11.1-1
 citus          | 11.1-1

Properly installed, the Citus runtime variables are now available:

lxc exec citus1 -- su postgres -c psql db01<<_eof_ | less -S
    select name,setting,unit from pg_settings where name ~ 'citus' order by 1;
_eof_

name                        |     setting     | unit 
----------------------------------------------------+-----------------+------
 citus.all_modifications_commutative                | off             | 
 citus.background_task_queue_interval               | 5000            | ms
 citus.cluster_name                                 | default         | 
 citus.coordinator_aggregation_strategy             | row-gather      | 
 citus.count_distinct_error_rate                    | 0               | 
 citus.cpu_priority                                 | 0               | 
 citus.cpu_priority_for_logical_replication_senders | inherit         | 
 citus.defer_drop_after_shard_move                  | on              | 
 citus.defer_drop_after_shard_split                 | on              | 
 citus.defer_shard_delete_interval                  | 15000           | ms
 citus.desired_percent_disk_available_after_move    | 10              | 
 citus.distributed_deadlock_detection_factor        | 2               | 
 citus.enable_binary_protocol                       | on              | 
 citus.enable_create_role_propagation               | on              | 
 citus.enable_deadlock_prevention                   | on              | 
 citus.enable_local_execution                       | on              | 
 citus.enable_local_reference_table_foreign_keys    | on              | 
 citus.enable_repartition_joins                     | off             | 
 citus.enable_statistics_collection                 | off             | 
 citus.explain_all_tasks                            | off             | 
 citus.explain_analyze_sort_method                  | execution-time  | 
 citus.limit_clause_row_fetch_count                 | -1              | 
 citus.local_hostname                               | localhost       | 
 citus.local_shared_pool_size                       | 50              | 
 citus.local_table_join_policy                      | auto            | 
 citus.log_remote_commands                          | off             | 
 citus.max_adaptive_executor_pool_size              | 16              | 
 citus.max_cached_connection_lifetime               | 600000          | ms
 citus.max_cached_conns_per_worker                  | 1               | 
 citus.max_client_connections                       | -1              | 
 citus.max_high_priority_background_processes       | 2               | 
 citus.max_intermediate_result_size                 | 1048576         | kB
 citus.max_matview_size_to_auto_recreate            | 1024            | MB
 citus.max_shared_pool_size                         | 100             | 
 citus.max_worker_nodes_tracked                     | 2048            | 
 citus.multi_shard_modify_mode                      | parallel        | 
 citus.multi_task_query_log_level                   | off             | 
 citus.node_connection_timeout                      | 30000           | ms
 citus.node_conninfo                                | sslmode=require | 
 citus.propagate_set_commands                       | none            | 
 citus.recover_2pc_interval                         | 60000           | ms
 citus.remote_task_check_interval                   | 10              | ms
 citus.shard_count                                  | 32              | 
 citus.shard_replication_factor                     | 1               | 
 citus.show_shards_for_app_name_prefixes            |                 | 
 citus.skip_constraint_validation                   | off             | 
 citus.skip_jsonb_validation_in_copy                | on              | 
 citus.stat_statements_track                        | none            | 
 citus.task_assignment_policy                       | greedy          | 
 citus.task_executor_type                           | adaptive        | 
 citus.use_citus_managed_tables                     | off             | 
 citus.use_secondary_nodes                          | never           | 
 citus.values_materialization_threshold             | 100             | 
 citus.version                                      | 11.1.4          | 
 citus.worker_min_messages                          | notice          | 
 citus.writable_standby_coordinator                 | off             |

Step 4: Define/configure the cluster

Log into the coordinator in order to declare and configure the cluster:

lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_
    select citus_set_coordinator_host('citus-coord-01', 5432);
    insert into pg_dist_node(nodename)
        values ('citus1')
              ,('citus2')
              ,('citus3')
              ,('citus4');
_eof_

And here’s the cluster’s organization:

db01=# select nodeid,nodename,groupid,isactive from pg_dist_node order by 1;
 nodeid |    nodename    | groupid | isactive 
--------+----------------+---------+----------
      1 | citus-coord-01 |       0 | t
      2 | citus1         |       1 | t
      3 | citus2         |       2 | t
      4 | citus3         |       3 | t
      5 | citus4         |       4 | t

Step four: Create, distribute, and populate a single table

Table myevents is created, and the newly inserted records are evenly distributed across the cluster of nodes.

Login to the coordinator and execute the following commands. Notice that all DML and SQL statements are executed on the coordinator node:

lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_

-- create the table
    create table myevents (
        device_id
        bigint,
        event_id
        bigserial,
        event_time
        timestamptz default now(),
        data
        jsonb not null,
        primary key (device_id, event_id)
    );

-- distribute the events among the nodes
    select create_distributed_table('myevents', 'device_id');

-- populate the table
    insert into myevents (device_id, data)
        select s % 100, ('{"measurement":'||random()||'}')::jsonb
        from generate_series(1,1000000) s;
_eof_

Querying the coordinator:

lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_ | less -S
select *
    from myevents
    where device_id = 1
    order by event_time desc, event_id desc
    limit 10;
_eof_

device_id | event_id |          event_time           |                data                  
----------+----------+-------------------------------+-------------------------------------
        1 |   999901 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.2868659956537316}
        1 |   999801 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.7931493079697731}
        1 |   999701 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.4875322951757288}
        1 |   999601 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.6491362745752653}
        1 |   999501 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.9629266554851366}
        1 |   999401 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.1185674800281864}
        1 |   999301 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.5133762596297742}
        1 |   999201 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.7307634886202119}
        1 |   999101 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.2997471209159892}
        1 |   999001 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.9692520484104021}

Step five: A review of table distributions across the cluster

This demonstrates clearly that table myevents is hash-sharded across every node member:

Notice how well-balanced the sharded tables are in size.
Notice the table name numbering order; see how each named shard that is incremented by one is found on the next node.

for u in citus1 citus2 citus3 citus4
do
echo "==== NODE: $u ===="
lxc exec $u -- su - postgres -c 'psql db01'<<_eof_
    dt+
_eof_
done | less -S

==== NODE: citus1 ====
                                        List of relations
Schema |      Name       | Type  |  Owner   | Persistence | Access method |  Size   | Description  
-------+-----------------+-------+----------+-------------+---------------+---------+-------------
public | myevents_102008 | table | postgres | permanent   | heap          | 2832 kB |  
public | myevents_102012 | table | postgres | permanent   | heap          | 2840 kB |  
public | myevents_102016 | table | postgres | permanent   | heap          | 5624 kB |  
public | myevents_102020 | table | postgres | permanent   | heap          | 2840 kB |  
public | myevents_102024 | table | postgres | permanent   | heap          | 1904 kB |  
public | myevents_102028 | table | postgres | permanent   | heap          | 5632 kB |  
public | myevents_102032 | table | postgres | permanent   | heap          | 2840 kB |  
public | myevents_102036 | table | postgres | permanent   | heap          | 6560 kB |  

==== NODE: citus2 ====
                                        List of relations
Schema |      Name       | Type  |  Owner   | Persistence | Access method |  Size   | Description  
-------+-----------------+-------+----------+-------------+---------------+---------+-------------
public | myevents_102009 | table | postgres | permanent   | heap          | 4696 kB |  
public | myevents_102013 | table | postgres | permanent   | heap          | 976 kB  |  
public | myevents_102017 | table | postgres | permanent   | heap          | 1904 kB |  
public | myevents_102021 | table | postgres | permanent   | heap          | 3768 kB |  
public | myevents_102025 | table | postgres | permanent   | heap          | 1904 kB |  
public | myevents_102029 | table | postgres | permanent   | heap          | 2840 kB |  
public | myevents_102033 | table | postgres | permanent   | heap          | 1904 kB |  
public | myevents_102037 | table | postgres | permanent   | heap          | 2840 kB |  

==== NODE: citus3 ====
                                         List of relations
Schema |      Name       | Type  |  Owner   | Persistence | Access method |    Size    | Description  
-------+-----------------+-------+----------+-------------+---------------+------------+-------------
public | myevents_102010 | table | postgres | permanent   | heap          | 1904 kB    |  
public | myevents_102014 | table | postgres | permanent   | heap          | 1904 kB    |  
public | myevents_102018 | table | postgres | permanent   | heap          | 1904 kB    |  
public | myevents_102022 | table | postgres | permanent   | heap          | 4696 kB    |  
public | myevents_102026 | table | postgres | permanent   | heap          | 8192 bytes |  
public | myevents_102030 | table | postgres | permanent   | heap          | 2832 kB    |  
public | myevents_102034 | table | postgres | permanent   | heap          | 976 kB     |  
public | myevents_102038 | table | postgres | permanent   | heap          | 4696 kB    |  

==== NODE: citus4 ====
                                        List of relations
Schema |      Name       | Type  |  Owner   | Persistence | Access method |  Size   | Description  
-------+-----------------+-------+----------+-------------+---------------+---------+-------------
public | myevents_102011 | table | postgres | permanent   | heap          | 5632 kB |  
public | myevents_102015 | table | postgres | permanent   | heap          | 1904 kB |  
public | myevents_102019 | table | postgres | permanent   | heap          | 1904 kB |  
public | myevents_102023 | table | postgres | permanent   | heap          | 1904 kB |  
public | myevents_102027 | table | postgres | permanent   | heap          | 2840 kB |  
public | myevents_102031 | table | postgres | permanent   | heap          | 2832 kB |  
public | myevents_102035 | table | postgres | permanent   | heap          | 1904 kB |  
public | myevents_102039 | table | postgres | permanent   | heap          | 4696 kB |

Part B: Redundancy demonstration

Method

Step 1: Update shard replication factor from 1X to 2X
Step 2: Create table myevents2x with 2X redundancy
Step 3: Identify shard myevents2x_102040 across citus1 and citus2
Step 4: Identify some records to query from shards known to be on nodes citus1 and citus2
Step 5: Test
- Shutdown citus1; perform the aforementioned identified query
- Startup citus1, shutdown citus2; perform afore identified query
- Restart citus2; perform the aforementioned identified query

ATTENTION: Please note that you may have to edit your own queries as the values may be different for your setup.

Step one:

Update shard replication factor from 1X to 2X:

lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_
    show citus.shard_replication_factor;
    alter system set citus.shard_replication_factor=2;
    select pg_reload_conf();
_eof_

# validate
lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_
    show citus.shard_replication_factor;
_eof_

citus.shard_replication_factor  
--------------------------------
2

Step two:

Table myevents2x is created and populated with a redundancy of 2X across the cluster:

lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_ | less -S
-- create a new table with 2X redundancy
    create table myevents2x (
        device_id
        bigint,
        event_id
        bigserial,
        event_time
        timestamptz default now(),
        data
        jsonb not null,
        primary key (device_id, event_id)
    );

-- distribute the events among the nodes
    select create_distributed_table('myevents2x', 'device_id');


-- confirm table has been added across the cluster
    select * from master_get_active_worker_nodes() order by 1;

-- populate the table
    insert into myevents2x (device_id, data)
        select s % 100, ('{"measurement":'||random()||'}')::jsonb
        from generate_series(1,1000000) s;
_eof_

Here’s the output:

CREATE TABLE
 create_distributed_table 
--------------------------
 
(1 row)

 node_name | node_port 
-----------+-----------
 citus1    |      5432
 citus2    |      5432
 citus3    |      5432
 citus4    |      5432
(4 rows)

INSERT 0 1000000

Step three:

Locate shard myevents2x_102040, which should be on nodes citus1 and citus2:

for u in citus1 citus2 citus3 citus4
do
echo "==== NODE: $u ===="
lxc exec $u -- su - postgres -c 'psql db01'<<_eof_
    select tablename from pg_tables where tablename~'myevents2x' order by 1
_eof_
done | less -S

Here’s the output:

==== NODE: citus1 ====
     tablename     
-------------------
 myevents2x_102040
 myevents2x_102043
 myevents2x_102044
 myevents2x_102047
 myevents2x_102048
 myevents2x_102051
 myevents2x_102052
 myevents2x_102055
 myevents2x_102056
 myevents2x_102059
 myevents2x_102060
 myevents2x_102063
 myevents2x_102064
 myevents2x_102067
 myevents2x_102068
 myevents2x_102071
(16 rows)

==== NODE: citus2 ====
     tablename     
-------------------
 myevents2x_102040
 myevents2x_102041
 myevents2x_102044
 myevents2x_102045
 myevents2x_102048
 myevents2x_102049
 myevents2x_102052
 myevents2x_102053
 myevents2x_102056
 myevents2x_102057
 myevents2x_102060
 myevents2x_102061
 myevents2x_102064
 myevents2x_102065
 myevents2x_102068
 myevents2x_102069

==== NODE: citus3 ====
     tablename     
-------------------
 myevents2x_102041
 myevents2x_102042
 myevents2x_102045
 myevents2x_102046
 myevents2x_102049
 myevents2x_102050
 myevents2x_102053
 myevents2x_102054
 myevents2x_102057
 myevents2x_102058
 myevents2x_102061
 myevents2x_102062
 myevents2x_102065
 myevents2x_102066
 myevents2x_102069
 myevents2x_102070

==== NODE: citus4 ====
     tablename     
-------------------
 myevents2x_102042
 myevents2x_102043
 myevents2x_102046
 myevents2x_102047
 myevents2x_102050
 myevents2x_102051
 myevents2x_102054
 myevents2x_102055
 myevents2x_102058
 myevents2x_102059
 myevents2x_102062
 myevents2x_102063
 myevents2x_102066
 myevents2x_102067
 myevents2x_102070
 myevents2x_102071

Step four:

Locate and return the first three records of shard myevents2x_102040 on nodes citus1 and citus2:

lxc exec citus1 -- su - postgres -c 'psql db01'<<_eof_ | less -S
    qecho ==== citus1 ====
    qecho select * from myevents2x_102040 order by 1,2 limit 3
    select * from myevents2x_102040 order by 1,2 limit 3;
    c 'host=citus2 dbname=db01 user=postgres'
    qecho ==== citus2 ====
    qecho select * from myevents2x_102040 order by 1,2 limit 3
    select * from myevents2x_102040 order by 1,2 limit 3;
    c 'host=citus-coord-01 dbname=db01 user=postgres'
    qecho ==== coordinator ====
    qecho select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2
    select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2;
_eof_

Here’s the output:

==== citus1 ====
select * from myevents2x_102040 order by 1,2 limit 3
 device_id | event_id |          event_time           |                data                 
-----------+----------+-------------------------------+-------------------------------------
         8 |        8 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.5055415404961443}
         8 |      108 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.9145586163936634}
         8 |      208 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.8032392420487922}


You are now connected to database "db01" as user "postgres" on host "citus2" (address "fd42:cb6a:5384:9a60:216:3eff:fe38>
==== citus2 ====
select * from myevents2x_102040 order by 1,2 limit 3
 device_id | event_id |          event_time           |                data                 
-----------+----------+-------------------------------+-------------------------------------
         8 |        8 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.5055415404961443}
         8 |      108 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.9145586163936634}
         8 |      208 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.8032392420487922}


You are now connected to database "db01" as user "postgres" on host "citus-coord-01" (address "fd42:cb6a:5384:9a60:216:3>
==== coordinator ====
select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2
 device_id | event_id |          event_time           |                data                 
-----------+----------+-------------------------------+-------------------------------------
         8 |        8 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.5055415404961443}
         8 |      108 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.9145586163936634}
         8 |      208 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.8032392420487922}

Step five:

The next few steps demonstrate our ability to continuously query and return those records found in shard myevents2x_102040.

Step 5a: Test, shutdown node citus1

lxc stop citus1

lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_ | less -S
    qecho "==== citus1 is shutdown ===="
    qecho ==== querying coordinator ====
    qecho select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2
    select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2;
_eof_

Here’s the output:

"==== citus1 is shutdown ===="
==== querying coordinator ====
select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2
 device_id | event_id |          event_time           |                data                 
-----------+----------+-------------------------------+-------------------------------------
         8 |        8 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.5055415404961443}
         8 |      108 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.9145586163936634}
         8 |      208 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.8032392420487922}

Step 5b: Test, restart citus1, and shutdown citus2

lxc start citus1
lxc stop citus2

lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_ | less -S
    qecho "==== citus2 is shutdown ===="
    qecho ==== querying coordinator ====
    qecho select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2
    select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2;
_eof_

Here’s the output; note that it’s exactly the same as the previous test:

"==== citus2 is shutdown ===="
==== querying coordinator ====
select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2
 device_id | event_id |          event_time           |                data                 
-----------+----------+-------------------------------+-------------------------------------
         8 |        8 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.5055415404961443}
         8 |      108 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.9145586163936634}
         8 |      208 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.8032392420487922}

Step 5c: Test, restart node citus2

lxc start citus2

lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_ | less -S
    qecho "==== cluster restored ===="
    qecho ==== querying coordinator ====
    qecho select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2
    select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2;
_eof_

Here’s the output, of course!

"==== cluster restored ===="
==== querying coordinator ====
select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2
device_id | event_id |          event_time           |                data                  
----------+----------+-------------------------------+-------------------------------------
        8 |        8 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.5055415404961443}
        8 |      108 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.9145586163936634}
        8 |      208 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.8032392420487922}

Conclusion

Data redundancy is the hallmark of high availability. But with Citus, we’ve raised the bar. Can you think of a better system that can stay up without losing time initiating failovers when a node fails?

Have fun!

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community in a single distribution, designed and tested to work together.

Download Percona Distribution for PostgreSQL Today!

Written by Robert Bernier in: Citus,Insight for DBAs,Insight for Developers,postgresql,Robert PP,Zend Developer |

Aug
21
2023

Configuring PostgreSQL and LDAP Using StartTLS

Effectively working with LDAP as an authentication mechanism for PostgreSQL typically requires extensive knowledge in both domains. While trying to be as complete yet succinct as possible, I’m detailing how to enable TLS between a PostgreSQL and the OpenLDAP server.

Ironically, the most complicated aspect has nothing to do with either PostgreSQL or OpenLDAP but with the steps of creating and signing private keys and certificates.

Note: I had seriously thought about leaving out much of the OpenLDAP commands, but I figured it might benefit you if you’re like me since I don’t touch this very often.

Summary steps

The underlying assumptions are:

PostgreSQL and LDAP hosts are installed on the same host.
Username accounts, but not passwords, are created on the Postgres server.
Both username accounts and passwords are configured on the LDAP server.
Private/Public keys, sic certificates, are to be created.
A self-signed Certificate Authority, CA, exists on the host and has signed the aforementioned Postgres and LDAP certificates.

The environment

The Proof Of Concept described in this document consists of a single stand-alone server:

Linux OS, Ubuntu 18.04
Host name: my-ldap
PostgreSQL version 14
OpenLDAP version 2.4.45
- Distinguished Name: “cn=admin,dc=nodomain”
- password: admin
- topmost domain: “dn: dc=pg_user,dc=nodomain”
- superuser, postgres: “dn: cn=postgres,dc=pg_user,dc=nodomain”
- user1: “dn: cn=user1,dc=pg_user,dc=nodomain”
- user2: “dn: cn=user2,dc=pg_user,dc=nodomain”
user accounts:
- postgres (password postgres)

Installation PostgreSQL

Apart from the standard steps of installing and configuring Postgres for remote access, edit the host-based authentication file enabling Postgres to refer to the LDAP service for authentication.

ROLES and useraccounts used in Postgres should be declared. Keep in mind that assigning passwords is NOT required:

-- Example
create role user1 with login password null;
create role user2 with login password null;

pg_hba.conf assumptions:

The LDAP URL is 127.0.01, localhost, or the hostname URL if desired.
Rules for both IPv4 and IPv6 are declared.
NO encryption is enabled between the Postgres and LDAP services.

# IPv4 local connections:
host all all 0.0.0.0/0  ldap ldapserver=127.0.0.1 ldapprefix="cn=" ldapsuffix=", dc=pg_user, dc=nodomain"

# IPv6 local connections:
host all all ::0/0 ldap ldapserver=127.0.0.1 ldapprefix="cn=" ldapsuffix=", dc=pg_user, dc=nodomain"

Installation LDAP

apt-get install -y slapd ldap-utils

Running netstat, (netstat -tlnp), returns the following:

Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:389             0.0.0.0:*               LISTEN      7742/slapd          
tcp6       0      0 :::389                  :::*                    LISTEN      7742/slapd

Command line utilities

You can control the behavior of OpenLDAP by using these command line utilities:

ldapmodify
ldapadd
ldapdelete
ldapmodrdn
ldapsearch
ldapwhoami

Setting the administrator password

It is understood that administering the LDAP server requires setting the password. Although the installation of LDAP includes setting the password, which will be admin, by executing the following command, one can reset the password at will:

# Select "No" when asked to configure the database with dbconfig-common.
# Set the domain name for your LDAP server, for example, "example.com".
# Set the organization name for your LDAP server, for example, "Example Inc".
# Set the administrator password for your LDAP server.

dpkg-reconfigure slapd

Configuration

The following bash script demonstrates configuring OpenLDAP to authenticate three Postgres ROLES, i.e., postgres, user1, and user2:

#!/bin/bash
set -e
##########################################
#
# admin password is "admin"
# the top domain is assigned as "nodomain"
#
ldapadd -v -xD "cn=admin,dc=nodomain" -w admin <<_eof_
# create the topmost domain
dn: dc=pg_user,dc=nodomain
objectClass: dcObject
objectClass: organization
dc: pg_user
o: Postgres Users
description: all postgres users reside here

# create SUPERUSER ROLE postgres
dn: cn=postgres,dc=pg_user,dc=nodomain
objectclass: top
objectclass: person
cn: postgres
sn: postgres
userPassword: postgres
_eof_

#
##########################################
# ADD ROLES

ldapadd -v -xD "cn=admin,dc=nodomain" -w admin <<_eof_
# creating other user accounts, down the road
# create role user1
dn: cn=user1,dc=pg_user,dc=nodomain
objectclass: top
objectclass: person
cn: user1
sn: user1
userPassword: user1

# create role user2
dn: cn=user2,dc=pg_user,dc=nodomain
objectclass: top
objectclass: person
cn: user2
sn: user2
userPassword: user2
_eof_

Test connectivity without TLS

A simple login confirms LDAP and PostgreSQL are working correctly. Even though there is an encrypted session between psql and the Postgres server, there is no encrypted session between Postgres and LDAP as authentication is performed:

SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
root@my-ldap:~# psql 'host=my-ldap dbname=postgres user=postgres password=postgres' -c "select 'ping' as test_connectivity"  
test_connectivity  
-------------------
ping

root@my-ldap:~# psql 'host=my-ldap dbname=postgres user=user1 password=user1' -c "select 'ping from user1' as test_connectivity"                 
test_connectivity  
-------------------
ping from user1

root@my-ldap:~# psql 'host=my-ldap dbname=postgres user=user2 password=user2' -c "select 'ping from user2' as test_connectivity"    
test_connectivity  
-------------------
ping from user2

Working With TLS Encryption

Install Additional Packages

To work with SSL certificates, these packages should be present, i.e., for the Ubuntu distribution:

apt install -y gnutls-bin ssl-cert

Managing the Certificate Authority

Authentication between the Postgres and LDAP servers includes that the hosts making connection attempts are, in fact, legitimate. For that reason, the certificates for both servers must be signed, i.e., a Certificate Authority mechanism is required.

#
# Generate private key for self-signed Certificate Authority
#
certtool --generate-privkey --bits 4096 --outfile /etc/ssl/private/mycakey.pem

In this case, the CA certificate is configured to expire in ten years:

#
# Define CA certificate attributes
#
echo "cn = my-ldap
ca
cert_signing_key
expiration_days = 3650" > /etc/ssl/ca.info

An internal system can get away using self-signed CA certificates, otherwise, it is strongly recommended that your certificates be signed by an authorized CA. In this case, the certificate, once signed, is placed in the same directory where the other CA certificates are stored, i.e., /usr/local/share/ca-certificates:

#
# Generate a self-signed CA certificate and
#   copy the CRT to other trusted CA certificates that includes
#   both postgres and ldap servers (/usr/local/share/ca-certificates/)
#
certtool --generate-self-signed 
--load-privkey /etc/ssl/private/mycakey.pem 
--template /etc/ssl/ca.info 
--outfile /usr/local/share/ca-certificates/mycacert.crt

ATTENTION: Once signed, the CA certificate is copied onto the Postgres and LDAP servers respectively. In this case, as they are on the same host, it is located on host my-ldap. Otherwise, it MUST be copied to all Postgres and LDAP hosts.

Once copied into the correct directory, the list of CA certificates is updated, adding the self-signed CA certificate:

#
# Update list of CA certificate on both
#   postgres and LDAP servers
#
update-ca-certificates

Generating the LDAP public/private key

From here on, it is important to include the fully qualified domain name of the LDAP certificate hostname, i.e., my-ldap.

Generate a private key for LDAP server:

certtool --generate-privkey 
--bits 2048 
--outfile /etc/ldap/my-ldap_slapd_key.pem

Define LDAP certificate attributes:

#   for a certificate request which expires
#   in one year
#
echo "organization = mycompany
cn = my-ldap
tls_www_server
encryption_key
signing_key
expiration_days = 365" > /etc/ssl/my-ldap.info

Sign the LDAP private key using the self-signed Certificate Authority certificate and its private key:

certtool --generate-certificate 
--load-privkey /etc/ldap/my-ldap_slapd_key.pem 
--load-ca-certificate /etc/ssl/certs/mycacert.pem 
--load-ca-privkey /etc/ssl/private/mycakey.pem 
--template /etc/ssl/my-ldap.info 
--outfile /etc/ldap/my-ldap_slapd_cert.pem

Update access permissions of the private key:

sudo chgrp openldap /etc/ldap/my-ldap_slapd_key.pem
sudo chmod 0640 /etc/ldap/my-ldap_slapd_key.pem

Create and save the LDAP TLS configuration file, certinfo.ldif:

echo "dn: cn=config
add: olcTLSCACertificateFile
olcTLSCACertificateFile: /etc/ssl/certs/mycacert.pem
-
add: olcTLSCertificateFile
olcTLSCertificateFile: /etc/ldap/my-ldap_slapd_cert.pem
-
add: olcTLSCertificateKeyFile
olcTLSCertificateKeyFile: /etc/ldap/my-ldap_slapd_key.pem" > /etc/ldap/schema/certinfo.ldif

Enable OpenLDAP to use TLS:

ldapmodify -Y EXTERNAL -H ldapi:/// -f /etc/ldap/schema/certinfo.ldif

Validation test #1

Validation returns the string “anonymous”:

ldapwhoami -x -ZZ -H ldap://my-ldap

Validation test #2

This test confirms that the previous behavior without TLS still works:

# without TLS
ldapsearch -x -H ldap://my-ldap -b dc=nodomain -D cn=admin,dc=nodomain -w admin

This test should return the exact output of the previous ldapsearch. Failure is indicated by a short message which is invoked by using the switch “-zz”.

# with TLS
ldapsearch -x -ZZ -H ldap://my-ldap -b dc=nodomain -D cn=admin,dc=nodomain -w admin

Working with PostgreSQL using TLS

Configuration

This is the host-based rule with TLS configured; notice the minor edit in red:

# IPv4 local connections:
hostssl all all 0.0.0.0/0 ldap ldapserver=tmp ldapprefix="cn=" ldapsuffix=", dc=pg_user, dc=nodomain" ldaptls=1

# IPv6 local connections:
host all all ::0/0 ldap ldapserver=tmp ldapprefix="cn=" ldapsuffix=", dc=pg_user, dc=nodomain" ldaptls=1

Validation test: TLS

After updating pg_hba.conf with the new argument, ldaptls=1, a server reload is executed, followed by a psql ping:

systemctl reload postgresql@14-main

root@my-ldap:~# psql 'host=10.231.38.243 dbname=postgres user=postgres password=postgres' -c "select 'ping from postgres' as test_connectivity"
test_connectivity   
--------------------
ping from postgres

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.

Download Percona Distribution for PostgreSQL Today!

Written by Robert Bernier in: Insight for DBAs,postgresql,Robert PP,Zend Developer |

Jun
05
2023

Configuring PgBouncer for Multi-Port Access

PgBouncer for Multi-Port Access

From time to time, situations occur where unusual circumstances dictate out-of-the-box thinking.

For example, suppose you have a system where you’ve installed multiple data clusters onto a single host. What you end up with is a host breaking up and sharing valuable system resources, i.e., CPU, RAM, disk, etc., between multiple instances of PostgreSQL data clusters. While easy to do in a development environment, it does represent technical debt when it reaches production. Sooner or later, one must address this issue; otherwise, one can suffer the consequences of handicapping your entire database infrastructure.

Let’s now move forward in time: your system has scaled, of course, and this shortcut of using multiple data clusters on a single host has now become a performance bottleneck. The problem is you either don’t or can’t refactor your application servers; something, maybe, about not having enough time in the day. And, as you may already know, while Postgres can sit on both a UNIX DOMAIN socket and IPv4, IPv6 port, etc., one is nevertheless constrained to listen to just the one port.

So what do you do?

For the experienced sysadmin, there are actually quite a number of “magical” techniques. However, in this case, with a little help from systemd, which manages all service processes, we will solve this little puzzle using PgBouncer with a concoction of configuration files.

Scenario

Configure the system such that Postgres resides on its default port of 5432 and PgBouncer sits on three ports, i.e., 6432, 6433, and 6433, accessing the resident Postgres server.

The PgBouncer connection pooler will use an administrative account, the ROLE PgBouncer, for the purpose of user authentication. Authentication is to be achieved by accessing the Postgres server’s pg_shadow table and comparing the resultant hash to all incoming connections (this won’t work for cloud setups such as, for example, Amazon RDS).

A set of Systemd configuration files will be created and edited in order to manage the PgBouncer service.

About the files

Below is a summary of the files and how they will be edited. Remember, these configuration files are of a hypothetical nature using minimal settings, which, of course, will need to be updated to match a realistic production environment.

├── pgbouncer
│   ├── pgbouncer.ini
│   └── userlist.txt
├── postgres
│   ├── add_function.sql
│   ├── add_user.sql
└── systemd
    ├── pgbouncer_override
    └── pgbouncer.socket

pgbouncer.ini

This configuration file defines all behavior and is installed in its default location, “/etc/pgbouncer”.

Only one domain socket is used. For our purposes, the listen_port runtime parameter is just noise and is superseded by the other runtime parameters as declared in file pgbouncer.socket.

TIP: Backup the original pgbouncer.ini as it references ALL runtime parameters.

[databases]
* = host=localhost

[users]
# left blank 

[pgbouncer]
logfile = /var/log/postgresql/pgbouncer.log
pidfile = /var/run/postgresql/pgbouncer.pid

;; these parameters are implicitly disabled
listen_addr = 0.0.0.0
listen_port = 6432

unix_socket_dir = /var/run/postgresql
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
auth_user = pgbouncer
auth_query = SELECT p_user, p_password FROM public.lookup($1)

pool_mode = session

;; Use <appname - host> as application_name on server.
application_name_add_host = 1

userlist.txt

Contains a single user account and its password for authentication. Note that the md5 hashed password is pgbouncer.

"pgbouncer" "md5be5544d3807b54dd0637f2439ecb03b9"

add_user.sql

Adds the ROLE “pgbouncer” to the Postgres data cluster. While under normal circumstances this is not required, PgBouncer uses this ROLE in order to validate all logins.

For our purposes, the password is “pgbouncer”:

--
-- EXECUTE AS SUPERUSER postgres
--
CREATE ROLE pgbouncer LOGIN WITH PASSSWORD pgbouncer;

add_function.sql

This is a user-defined function that ROLE PgBouncer executes in order to obtain the password hash from pg_shadow.

This SQL statement must be executed against each and every database that is to be accessed by any ROLE connection using PgBouncer.

TIP: Execute this SQL against database template1, and the function call will be included with every newly created database thereafter.

--
-- EXECUTE AS SUPERUSER postgres
--
-- execute on each database user accounts will login
--
CREATE FUNCTION public.lookup (
   INOUT p_user     name,
   OUT   p_password text
) RETURNS record
   LANGUAGE sql SECURITY DEFINER SET search_path = pg_catalog AS
$$SELECT usename, passwd FROM pg_shadow WHERE usename = p_user$$;

-- make sure only "pgbouncer" can use the function
REVOKE EXECUTE ON FUNCTION public.lookup(name) FROM PUBLIC;
GRANT EXECUTE ON FUNCTION public.lookup(name) TO pgbouncer;

pgbouncer_override

This systemd drop-in file overrides key options in the default PgBouncer unit file and will never be overwritten even after updating the PgBouncer Linux RPM/DEB packages.

As root, execute the following command and paste the contents of the provided file pgbouncer_override:

systemctl edit pgbouncer

#
# systemctl edit pgbouncer
# systemctl daemon-reload
#

[Unit]
Requires=pgbouncer.socket

pgbouncer.socket

This is the secret sauce; this file enables PgBouncer to listen on all three ports, 6432, 6433, and 6434. You will note that adding or removing ports is a simple matter of adding or removing addresses as per the format shown in the file below.

As root: create this file and perform a daemon reload:

vi /etc/systemd/system/pgbouncer.socket

#
# vi /etc/systemd/system/pgbouncer.socket
# systemctl daemon-reload
#

[Unit]
Description=sockets for PgBouncer
PartOf=pgbouncer.service

[Socket]
ListenStream=6432
ListenStream=6433
ListenStream=6434
ListenStream=/var/run/postgresql/.s.PGSQL.6432

ReusePort=true
RemoveOnStop=true

[Install]
WantedBy=sockets.target

systemctl daemon-reload

Putting it all together

After all the configuration files have been created and edited, this is what we get:

systemctl restart pgbouncer

netstat -tlnp

Proto Recv-Q Send-Q Local Address           Foreign Address         State      
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN     
tcp        0      0 127.0.0.1:5432          0.0.0.0:*               LISTEN     
tcp        0      0 127.0.0.53:53           0.0.0.0:*               LISTEN     
tcp6       0      0 :::22                   :::*                    LISTEN     
tcp6       0      0 ::1:5432                :::*                    LISTEN     
tcp6       0      0 :::6432                 :::*                    LISTEN     
tcp6       0      0 :::6433                 :::*                    LISTEN     
tcp6       0      0 :::6434                 :::*                    LISTEN

Finally, perform the following:

sudo as Postgres
create a database db01
create a ROLE usr1 accessing database db01
update pg_hba.conf and postgresql.conf allowing localhost connections by PgBouncer

Now validate the ports:

for u in 5432 6432 6433 6434
do
    echo "==== port: $u ==="
    export PGHOST=localhost PGPORT=$u PGDATABASE=db01
    psql "user=usr1 password=usr1" -c "select 'hello world' as greetings"
done

And here’s the output:

==== port: 5432 ===
 greetings   
-------------
hello world
(1 row)

==== port: 6432 ===
 greetings   
-------------
hello world
(1 row)

==== port: 6433 ===
 greetings   
-------------
hello world
(1 row)

==== port: 6434 ===
 greetings   
-------------
hello world
(1 row)

Conclusion

Because PostgreSQL was designed from the ground up to work in conjunction with the OS, i.e., a UNIX type of operating system, we’ve been able to solve an interesting problem using a novel approach. And while systemd is ubiquitous, it isn’t normally considered part of a solution to a Postgres puzzle. We call this hacking 🙂

Happy Travels!

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.

Download Percona Distribution for PostgreSQL Today!

Written by Robert Bernier in: Insight for DBAs,postgresql,Robert PP,Zend Developer |

Feb
02
2023

An Argument for Logical Failover Slots

Logical Failover Slots PostgreSQL

These days, the typical PostgreSQL cluster consists not only of a multi-node replication cluster, which is ordinarily an asynchronous streaming replication model but can sometimes include a logical replication component.

logical replication postgresql

Recall that logical replication works by using the PUB/SUB mode, where individual tables are published and are then subscribed by remotely connected databases/tables. DML operations, such as INSERT, UPDATE, DELETE, TRUNCATE, etc., are then replicated from one or more publishers to one or more subscribers.

For the most part it’s a pretty straightforward operation. The only real limitation is that one can only publish tables from a read-write server such as the PRIMARY.

There’s a problem, however, when it comes to executing failovers. While a STANDBY/REPLICA host can quickly take over with a simple select pg_promote() SQL statement, the same cannot be said regarding logical replication. Instead, failing over is, in fact, not a single action but a collection of discrete steps that must be performed in an exact and precise manner.

Consider the following steps typifying a failover:

Promote the REPLICA, thus becoming the new PRIMARY.
When present, redirect other REPLICAs pointing to the new PRIMARY.
Failover the logical replication slot:
1. Block all logical replication from the failed PRIMARY.
2. Create a logical replication slot on the newly promoted PRIMARY.
3. Alter the subscription and point to the newly promoted PRIMARY.
4. Resume/restore logical replication between publisher and subscriber.

The challenge, of course, is reconstituting the logical replication process as quickly as possible, ideally without any data loss.

Two common solutions come to mind:

Block all incoming connections before promoting the STANDBY while simultaneously installing the logical replication slot.
Promote the STANDBY and then flush and recopy the entire contents of the subscribed table(s).

In both cases, these solutions make it problematic for a fast recovery, not to mention the additional effort required by the SRA/DBA/DEV preparing the requisite instructions.

failover postgresql

In an ideal world, the perfect implementation of a logical slot failover would have one already in place on the STANDBY, thus guaranteeing data consistency without fear of data loss no matter how fast or slow the actual promotion or altering the subscription’s connectivity parameters may take.

The good news is that there are hacks that can speed up the entire recovery process, although they all have their own quirky limitations. Patroni, for example, implements one such technique by copying the file named state, which is located in the data cluster’s subdirectory pg_repslot, over to the REPLICA’s data cluster in the same location.

# PGDATA on an Ubuntu install
/var/lib/postgresql/15/main/pg_replslot/pg3
??? state

However, the caveat is that it requires a full server restart to the REPLICA, in addition to being promoted, before the logical slot is fully active. And, of course, timing is critical.

failover desired state postgresql

So there you have it.

Too bad such a feature doesn’t exist… or does it?

Stay tuned folks; the game is afoot.

For the curious, I’ve included some references in past blogs, courtesy of Jobin:

And here’s another set of references: a valiant attempt to incorporate this feature into PostgreSQL a few years ago.

PostgreSQL wiki: here
Part of a mail thread back in 2016: here
An old blog located on http://webarchive.org from a now-defunct company: here

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open source community in a single distribution, designed and tested to work together.

Download Percona Distribution for PostgreSQL Today!

Written by Robert Bernier in: Insight for DBAs,Insight for Developers,postgresql,Robert PP,Zend Developer |

Older Posts »

The Glass is twice as large as it needs to be

About the environment

Configuring Grafana

pg_stat_monitor

About

Features

Installation (example: CENTOS8, pg14)

Create extension

About pgbench

Querying the data

Table: pg_stat_monitor_archive

Table: pg_stat_monitor_qry

Table: pg_stat_monitor_shared_blk_io

Table: pg_stat_monitor_blk_io

Table: pg_stat_monitor_uniq_id

Benchmarking

Dashboard example 1: Querying saved data

Top panel (Query execution time vs. DML)

Bottom panel (Query execution time vs. shared blocks)

Analysis

Dashboard example 2: Monitoring in real time

Top panel (Execution time vs. DML)

Bottom panel (Time vs. IO)

Analysis

Conclusion

About the cluster

Steps

Provisioning

One worker node

Two worker nodes

Three worker nodes

Three worker nodes, with redundancy/replication

Benchmark results

Caveat

Data redundancy, a database version of a RAID

About this POC

Part A: POC setup

Step one: Getting and installing Citus

Step two: Creating the Citus nodes

Step three: Check packages

Step 4: Define/configure the cluster

Step four: Create, distribute, and populate a single table

Step five: A review of table distributions across the cluster

Part B: Redundancy demonstration

Method

Step one:

Step two:

Step three:

Step four:

Step five:

Step 5a: Test, shutdown node citus1

Step 5b: Test, restart citus1, and shutdown citus2

Step 5c: Test, restart node citus2

Conclusion

Summary steps

The environment

Installation PostgreSQL

Installation LDAP

Command line utilities

Setting the administrator password

Configuration

Test connectivity without TLS

Working With TLS Encryption

Install Additional Packages

Managing the Certificate Authority

Generating the LDAP public/private key

Validation test #1

Validation test #2

Working with PostgreSQL using TLS

Configuration

Validation test: TLS

Scenario

About the files

pgbouncer.ini

userlist.txt

add_user.sql

add_function.sql

pgbouncer_override

pgbouncer.socket

Putting it all together