Jun
05
2024
--

Administering a Patroni Managed PostgreSQL Cluster

Patroni managed PostgreSQL clusterThere are quite a number of methods these days for installing Patroni. I want to discuss the next step, which is how to start tuning a running system.The manner of updating a running Postgres server typically involves editing the standard Postgres files: postgresql.conf postgresql.auto.conf pg_hba.conf Working with a Patroni HA cluster, one has the added […]

Dec
26
2023
--

Grafana Dashboards: A PoC Implementing the PostgreSQL Extension pg_stat_monitor

Grafana Dashboards: A PoC Implementing the PostgreSQL Extension pg_stat_monitor

This PoC demonstrates how to install and configure pg_stat_monitor in order to extract useful and actionable metrics from a PostgreSQL database and display them on a Grafana dashboard.

About the environment

  • Grafana: version 10.0.0
  • Grafana database backend: Prometheus version 2.15.2+d
  • PostgreSQL version 13
  • pgbench version 13

In order to investigate the potential opportunities for implementing constructive and useful metrics derived from PostgreSQL into Grafana, it is necessary to generate loading using pgbench.

Configuring Grafana

For our purposes, the Grafana datasource used in this PoC is also the Postgres data cluster that is generating the data to be monitored.

Grafana Pgbench

pg_stat_monitor

About

pg_stat_monitor is a Query Performance Monitoring tool for PostgreSQL. It collects various statistics data such as query statistics, query plan, SQL comments, and other performance insights. The collected data is aggregated and presented in a single view. 

pg_stat_monitor takes its inspiration from pg_stat_statements. Unlike pg_stat_statements, which aggregates its metrics from the last time it was zeroed, pg_stat_monitor possesses the ability to bucket its output within a set number of aggregated results, thus saving user efforts from doing it themselves.

pg_stat_monitor tracks the following operations:

  • statements
  • queries
  • functions
  • stored procedures and other non-utility statements

Features

  • Time Interval Grouping: Instead of supplying one set of ever-increasing counts, pg_stat_monitor computes stats for a configured number of time intervals – time buckets. This allows for much better data accuracy, especially in the case of high-resolution or unreliable networks.
  • Multi-Dimensional Grouping: While pg_stat_statements groups counters by userid, dbid, queryid, pg_stat_monitor uses a more detailed group for higher precision. This allows a user to drill down into the performance of queries.
  • Capture Actual Parameters in the Queries: pg_stat_monitor allows you to choose if you want to see queries with placeholders for parameters or actual parameter data. This simplifies debugging and analysis processes by enabling users to execute the same query.
  • Query Plan: Each SQL is now accompanied by its actual plan that was constructed for its execution. That’s a huge advantage if you want to understand why a particular query is slower than expected.
  • Tables Access Statistics for a Statement: This allows us to easily identify all queries that accessed a given table. This set is at par with the information provided by the pg_stat_statements.
  • Histogram: Visual representation is very helpful as it can help identify issues. With the help of the histogram function, one can now view a timing/calling data histogram in response to an SQL query. And yes, it even works in PostgreSQL.

Installation (example: CENTOS8, pg14)

The simplest way to get pg_stat_monitor is to install it via Percona Distribution for PostgreSQL.

The following instructions demonstrate installing Percona Distribution for PostgreSQL and  pg_stat_monitor on a CENTOS8 OS Linux distribution:

# Install The Percona Repository
dnf install -y https://repo.percona.com/yum/percona-release-latest.noarch.rpm
percona-release setup ppg14

# Install The postgres Community Repository
dnf install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-8-x86_64/pgdg-redhat-repo-latest.noarch.rpm
dnf update -y
dnf install -y pg_stat_monitor_14

# perform standard initialization and systemd configurations
/usr/pgsql-14/bin/postgresql-14-setup initdb

# configure postgres to use pg_stat_monitor
echo "
shared_preload_libraries=pg_stat_monitor
" >> /var/lib/pgsql/14/data/postgresql.auto.conf

# complete postgres configuration
systemctl enable postgresql-14
systemctl start postgresql-14

Create extension

The pg_stat_monitor extension can be created in any database, but for the purposes of this PoC, it is placed in the database pgbench.

postgres=# create database pgbench;
postgres=# c pgbench
pgbench=# create extension pg_stat_monitor;

pgbench=# d
             List of relations
 Schema |      Name       | Type |  Owner   
--------+-----------------+------+----------
 public | pg_stat_monitor | view | postgres

View "public.pg_stat_monitor"
       Column        |           Type           | Collation | Nullable | Default 
---------------------+--------------------------+-----------+----------+---------
 bucket              | bigint                   |           |          | 
 bucket_start_time   | timestamp with time zone |           |          | 
 userid              | oid                      |           |          | 
 username            | text                     |           |          | 
 dbid                | oid                      |           |          | 
 datname             | text                     |           |          | 
 client_ip           | inet                     |           |          | 
 pgsm_query_id       | bigint                   |           |          | 
 queryid             | bigint                   |           |          | 
 toplevel            | boolean                  |           |          | 
 top_queryid         | bigint                   |           |          | 
 query               | text                     |           |          | 
 comments            | text                     |           |          | 
 planid              | bigint                   |           |          | 
 query_plan          | text                     |           |          | 
 top_query           | text                     |           |          | 
 application_name    | text                     |           |          | 
 relations           | text[]                   |           |          | 
 cmd_type            | integer                  |           |          | 
 cmd_type_text       | text                     |           |          | 
 elevel              | integer                  |           |          | 
 sqlcode             | text                     |           |          | 
 message             | text                     |           |          | 
 calls               | bigint                   |           |          | 
 total_exec_time     | double precision         |           |          | 
 min_exec_time       | double precision         |           |          | 
 max_exec_time       | double precision         |           |          | 
 mean_exec_time      | double precision         |           |          | 
 stddev_exec_time    | double precision         |           |          | 
 rows                | bigint                   |           |          | 
 shared_blks_hit     | bigint                   |           |          | 
 shared_blks_read    | bigint                   |           |          | 
 shared_blks_dirtied | bigint                   |           |          | 
 shared_blks_written | bigint                   |           |          | 
 local_blks_hit      | bigint                   |           |          | 
 local_blks_read     | bigint                   |           |          | 
 local_blks_dirtied  | bigint                   |           |          | 
 local_blks_written  | bigint                   |           |          | 
 temp_blks_read      | bigint                   |           |          | 
 temp_blks_written   | bigint                   |           |          | 
 blk_read_time       | double precision         |           |          | 
 blk_write_time      | double precision         |           |          | 
 resp_calls          | text[]                   |           |          | 
 cpu_user_time       | double precision         |           |          | 
 cpu_sys_time        | double precision         |           |          | 
 wal_records         | bigint                   |           |          | 
 wal_fpi             | bigint                   |           |          | 
 wal_bytes           | numeric                  |           |          | 
 bucket_done         | boolean                  |           |          | 
 plans               | bigint                   |           |          | 
 total_plan_time     | double precision         |           |          | 
 min_plan_time       | double precision         |           |          | 
 max_plan_time       | double precision         |           |          | 
 mean_plan_time      | double precision         |           |          | 
 stddev_plan_time    | double precision         |           |          |

About pgbench

pgbench is a simple program executing benchmark tests on PostgreSQL by running the same sequence of SQL commands over and over. pgbench is capable of executing multiple concurrent database sessions and can calculate the average transaction rate (TPS) at the end of a run. Although the default configuration simulates loading based loosely upon TPC-B, it is nevertheless easy to test other use cases by writing one’s own transaction script files.

Querying the data

While it is reasonable to create panels showing real-time load in order to explore better the types of queries that can be run against pg_stat_monitor, it is more practical to copy and query the data into tables after the benchmarking has completed its run.

Table: pg_stat_monitor_archive

Save the data generated from a recently completed benchmark run into an archive table:

select * into pg_stat_monitor_archive 
from pg_stat_monitor 
order by bucket_start_time asc

Table "public.pg_stat_monitor_archive"
      Column        |           Type           | Collation | Nullable | Default  
--------------------+--------------------------+-----------+----------+---------
bucket              | bigint                   |           |          |  
bucket_start_time   | timestamp with time zone |           |          |  
userid              | oid                      |           |          |  
username            | text                     |           |          |  
dbid                | oid                      |           |          |  
datname             | text                     |           |          |  
client_ip           | inet                     |           |          |  
pgsm_query_id       | bigint                   |           |          |  
queryid             | bigint                   |           |          |  
toplevel            | boolean                  |           |          |  
top_queryid         | bigint                   |           |          |  
query               | text                     |           |          |  
comments            | text                     |           |          |  
planid              | bigint                   |           |          |  
query_plan          | text                     |           |          |  
top_query           | text                     |           |          |  
application_name    | text                     |           |          |  
relations           | text[]                   |           |          |  
cmd_type            | integer                  |           |          |  
cmd_type_text       | text                     |           |          |  
elevel              | integer                  |           |          |  
sqlcode             | text                     |           |          |  
message             | text                     |           |          |  
calls               | bigint                   |           |          |  
total_exec_time     | double precision         |           |          |  
min_exec_time       | double precision         |           |          |  
max_exec_time       | double precision         |           |          |  
mean_exec_time      | double precision         |           |          |  
stddev_exec_time    | double precision         |           |          |  
rows                | bigint                   |           |          |  
shared_blks_hit     | bigint                   |           |          |  
shared_blks_read    | bigint                   |           |          |  
shared_blks_dirtied | bigint                   |           |          |  
shared_blks_written | bigint                   |           |          |  
local_blks_hit      | bigint                   |           |          |  
local_blks_read     | bigint                   |           |          |  
local_blks_dirtied  | bigint                   |           |          |  
local_blks_written  | bigint                   |           |          |  
temp_blks_read      | bigint                   |           |          |  
temp_blks_written   | bigint                   |           |          |  
blk_read_time       | double precision         |           |          |  
blk_write_time      | double precision         |           |          |  
resp_calls          | text[]                   |           |          |  
cpu_user_time       | double precision         |           |          |  
cpu_sys_time        | double precision         |           |          |  
wal_records         | bigint                   |           |          |  
wal_fpi             | bigint                   |           |          |  
wal_bytes           | numeric                  |           |          |  
bucket_done         | boolean                  |           |          |  
plans               | bigint                   |           |          |  
total_plan_time     | double precision         |           |          |  
min_plan_time       | double precision         |           |          |  
max_plan_time       | double precision         |           |          |  
mean_plan_time      | double precision         |           |          |  
stddev_plan_time    | double precision         |           |          |

Table: pg_stat_monitor_qry

Extract this metric of interest, i.e., time vs total execution time:

select bucket_start_time, pgsm_query_id, queryid, total_exec_time
into pg_stat_monitor_qry
from pg_stat_monitor 
order by bucket_start_time asc

pgbench=# d pg_stat_monitor_qry
                      Table "public.pg_stat_monitor_qry"
      Column       |           Type           | Collation | Nullable | Default 
-------------------+--------------------------+-----------+----------+---------
 bucket_start_time | timestamp with time zone |           |          | 
 pgsm_query_id     | bigint                   |           |          | 
 queryid           | bigint                   |           |          | 
 total_exec_time   | double precision         |           |          |

Table: pg_stat_monitor_shared_blk_io

Extract this metric of interest, i.e., time vs shared_blk io:

select bucket_start_time, pgsm_query_id, queryid,
       shared_blks_hit, shared_blks_read,
       shared_blks_dirtied, shared_blks_written
    into pg_stat_monitor_shared_blk_io
    from pg_stat_monitor_archive
    order by bucket_start_time asc;

pgbench=# d pg_stat_monitor_shared_blk_io
                 Table "public.pg_stat_monitor_shared_blk_io"
      Column        |           Type           | Collation | Nullable | Default  
---------------------+--------------------------+-----------+----------+---------
bucket_start_time   | timestamp with time zone |           |          |  
pgsm_query_id       | bigint                   |           |          |  
queryid             | bigint                   |           |          |  
shared_blks_hit     | bigint                   |           |          |  
shared_blks_read    | bigint                   |           |          |  
shared_blks_dirtied | bigint                   |           |          |  
shared_blks_written | bigint                   |           |          |

Table: pg_stat_monitor_blk_io

Note: this metric requires runtime parameter track_io_timing to be set on.

Extract this metric of interest, i.e., time vs. blk io:

select bucket_start_time, pgsm_query_id, queryid, blk_read_time, blk_write_time
    into pg_stat_monitor_blk_io
    from pg_stat_monitor_archive
    order by bucket_start_time asc;

Table: pg_stat_monitor_uniq_id

Save a copy of all unique query IDs in order to parse out future queries from the view. 

Column pgsm_query_id identifies the query in such a manner that one can still identify the same query even when generated on other platforms under different loading conditions with  different data:

with a as (select distinct on (pgsm_query_id) *
            from pg_stat_monitor_archive
            where application_name='pgbench')
select cmd_type, cmd_type_text,pgsm_query_id, queryid,query as example_query
    into pg_stat_monitor_uniq_id
    from a
    order by cmd_type;

pgbench=# d pg_stat_monitor_uniq_id
          Table "public.pg_stat_monitor_uniq_id"
    Column     |  Type   | Collation | Nullable | Default
---------------+---------+-----------+----------+---------
 cmd_type      | integer |           |          |
 cmd_type_text | text    |           |          |
 pgsm_query_id | bigint  |           |          |
 queryid       | bigint  |           |          |
 example_query | text    |           |          |

This is an example set of queries generated by pgbench. Note the numbers in column pgsm_query_id are always the same values irrespective of hosts or environments:

select cmd_type_text, pgsm_query_id, example_query 
from pg_stat_monitor_uniq_id where cmd_type > 0;

cmd_type_text |    pgsm_query_id     |           example_query
---------------+----------------------+-----------------------------------------------------------------
 SELECT        | -7455620703706695456 | SELECT abalance FROM pgbench_accounts WHERE aid = 16416498
 UPDATE        |  -510321339504955469 | UPDATE pgbench_accounts SET abalance = abalance + 2063 
                                        WHERE aid = 1482568
 UPDATE        |  5276535447716615446 | UPDATE pgbench_branches SET bbalance = bbalance + 1384 
                                        WHERE bid = 7
 UPDATE        |  3629195281782908951 | UPDATE pgbench_tellers SET tbalance = tbalance + -2966 
                                        WHERE tid = 330
 INSERT        | -8751124061964589929 | INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) 
                                        VALUES (321, 56, 21104880, 4211, CURRENT_TIMESTAMP)

Benchmarking

Two types of performance monitoring are profiled:

  1. Real-time loading performance
  2. Aggregate performance over a specific time period, i.e., a snapshot.

Although the results of the benchmarking can be viewed by querying the view pg_stat_monitor you will note, as demonstrated by the bash script and SQL statements below, that the contents of the view is immediately copied and saved into a collection of tables. This is because the data will disappear over time as pg_stat_monitor cycles through its allotted number of buckets.

A script executing a benchmarking run:

#!/bin/bash

# REFERENCE
#   https://docs.percona.com/pg-stat-monitor/reference.html
#
set -e

export PGPASSWORD="MYPASSWORD" 
       PGHOST='MYHOST’ 
       PGPORT=5434 
       PGUSER=postgres
#
# initialize benchmarking database
#
dropdb --if-exists pgbench
createdb pgbench
/usr/pgsql-12/bin/pgbench -i --foreign-keys -s 300 pgbench
psql pgbench -c 'create extension pg_stat_monitor'

#
# configure pg_stat_monitor, requires system restart
#
psql postgres <<_eof_
-- set bucket time range,  default is normally 60 seconds
   alter system set pg_stat_monitor.pgsm_bucket_time = '1min';
-- set number of buckets, default is normally 10
   alter system set pg_stat_monitor.pgsm_max_buckets = 75;
_eof_

systemctl restart postgresql@13-main

psql postgres <<_eof_
-- zero pg_stat_monitor stats
    select * from pg_stat_monitor_reset();
_eof_

#
# begin benchmarking run
#
#   4500 seconds (75 minutes)
/usr/pgsql-12/bin/pgbench -U postgres -c 4 -j 2 -T 4500 -P 5 -b tpcb-like pgbench

#
# copy and save the benchmarking run into tables
#
psql postgres <<_eof_
    drop table if exists pg_stat_monitor_archive, 
                           pg_stat_monitor_qry, 
                           pg_stat_monitor_uniq_id;

    select * into pg_stat_monitor_archive from pg_stat_monitor order by bucket_start_time;

    select bucket_start_time, pgsm_query_id, queryid, total_exec_time
    into pg_stat_monitor_qry
    from pg_stat_monitor_archive
    where application_name='pgbench';

    with a as (select distinct on (pgsm_query_id) * 
               from pg_stat_monitor_archive 
               where application_name='pgbench')    
    select cmd_type, cmd_type_text,pgsm_query_id, queryid,query as example_query
        into pg_stat_monitor_uniq_id
        from a
        order by cmd_type;;
_eof_


echo "DONE"

progress: 4435.0 s, 341.2 tps, lat 11.718 ms stddev 3.951
progress: 4440.0 s, 361.2 tps, lat 11.075 ms stddev 3.519
progress: 4445.0 s, 348.0 tps, lat 11.483 ms stddev 5.246
progress: 4450.0 s, 383.8 tps, lat 10.418 ms stddev 4.514
progress: 4455.0 s, 363.6 tps, lat 10.988 ms stddev 4.326
progress: 4460.0 s, 344.0 tps, lat 11.621 ms stddev 3.981
progress: 4465.0 s, 360.4 tps, lat 11.093 ms stddev 4.457
progress: 4470.0 s, 383.8 tps, lat 10.423 ms stddev 5.615
progress: 4475.0 s, 369.6 tps, lat 10.811 ms stddev 3.784
progress: 4480.0 s, 355.6 tps, lat 11.227 ms stddev 3.954
progress: 4485.0 s, 378.8 tps, lat 10.580 ms stddev 2.890
progress: 4490.0 s, 370.8 tps, lat 10.770 ms stddev 2.879
progress: 4495.0 s, 365.2 tps, lat 10.947 ms stddev 4.997
progress: 4500.0 s, 379.2 tps, lat 10.549 ms stddev 2.832

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 300
query mode: simple
number of clients: 4
number of threads: 2
duration: 4500 s
number of transactions actually processed: 1564704
latency average = 11.497 ms
latency stddev = 4.800 ms
tps = 347.711175 (including connections establishing)
tps = 347.711731 (excluding connections establishing)

Dashboard example 1: Querying saved data

Top panel (Query execution time vs. DML)

Five (5) SQL statements are used to create this panel:

-- SELECT --
select  bucket_start_time,total_exec_time as "SELECT"
from pg_stat_monitor_qry
join pg_stat_monitor_uniq_id using (pgsm_query_id)
where a.cmd_type_text='SELECT'
order by 1 asc;

-- INSERT --
select  bucket_start_time,total_exec_time as "INSERT"
from pg_stat_monitor_qry
join pg_stat_monitor_uniq_id using (pgsm_query_id)
where a.cmd_type_text='INSERT'
order by 1 asc;

-- UPDATE 1 --
select  bucket_start_time,total_exec_time as "UPDATE 1"
from pg_stat_monitor_qry
join pg_stat_monitor_uniq_id using (pgsm_query_id)
where cmd_type_text='UPDATE'
and pgsm_query_id = -510321339504955469
order by 1 asc;

-- UPDATE 2 --
select  bucket_start_time,total_exec_time as "UPDATE 2"
from pg_stat_monitor_qry
join pg_stat_monitor_uniq_id using (pgsm_query_id)
where cmd_type_text='UPDATE'
and pgsm_query_id = 5276535447716615446
order by 1 asc;

-- UPDATE 3 --
select  bucket_start_time,total_exec_time  as "UPDATE 3"
from pg_stat_monitor_qry
join pg_stat_monitor_uniq_id using (pgsm_query_id)
where cmd_type_text='UPDATE'
and pgsm_query_id = 3629195281782908951
order by 1 asc;

Bottom panel (Query execution time vs. shared blocks)

-- INSERT (ins_[hit|read|dirty|write]) --
select bucket_start_time,
       shared_blks_hit as ins_hit, shared_blks_read as ins_read,
       shared_blks_dirtied as ins_dirt, shared_blks_written as ins_writ
from pg_stat_monitor_shared_blk_io
join pg_stat_monitor_uniq_id using (pgsm_query_id)
where cmd_type_text='INSERT'
order by 1 asc;

-- UPDATE 1 (update1_[hit|read|dirty|write]) --
select bucket_start_time,
       shared_blks_hit as update1_hit, shared_blks_read as update1_read,
       shared_blks_dirtied as update1_dirt, shared_blks_written as update1_writ
from pg_stat_monitor_shared_blk_io
join pg_stat_monitor_uniq_id using (pgsm_query_id)
where cmd_type_text='UPDATE'
and pgsm_query_id = -510321339504955469
order by 1 asc;

-- UPDATE 2 (update2_[hit|read|dirty|write]) --
select bucket_start_time,
       shared_blks_hit as update2_hit, shared_blks_read as update2_read,
       shared_blks_dirtied as update2_dirt, shared_blks_written as update2_writ
from pg_stat_monitor_shared_blk_io
join pg_stat_monitor_uniq_id using (pgsm_query_id)
where cmd_type_text='UPDATE'
and pgsm_query_id = 5276535447716615446
order by 1 asc;

-- UPDATE 3 (update3_[hit|read|dirty|write]) --
select bucket_start_time,
       shared_blks_hit as update3_hit, shared_blks_read as update3_read,
       shared_blks_dirtied as update3_dirt, shared_blks_written as update3_writ
from pg_stat_monitor_shared_blk_io
join pg_stat_monitor_uniq_id using (pgsm_query_id)
where cmd_type_text='UPDATE'
and pgsm_query_id = 3629195281782908951
order by 1 asc;

Analysis

Here are some example patterns that can be discerned:

  1. The SELECT statements are the fastest DML operations (top panel).
  2. Although SQL statement UPDATE 1 (top panel) takes up the most time, its contents do not have much presence in the shared buffer relative to the other update statements.
  3. Inserts (top) are the 2nd slowest set of statements, yet they have very little execution time performing inserts compared to the UPDATES in the shared buffer (bottom).

Dashboard example 2: Monitoring in real time

These two panels show read/write IO performance to the persistent storage while benchmarking a live run.

Top panel (Execution time vs. DML)

-- SELECT --
select  bucket_start_time,total_exec_time as "SELECT"
from pg_stat_monitor
join pg_stat_monitor_uniq_id a using (pgsm_query_id)
where a.cmd_type_text='SELECT'
order by 1 asc;

-- INSERT --
select  bucket_start_time,total_exec_time as "INSERT"
from pg_stat_monitor
join pg_stat_monitor_uniq_id a using (pgsm_query_id)
where a.cmd_type_text='INSERT'
order by 1 asc;

-- UPDATE 1 --
select  bucket_start_time,total_exec_time as "UPDATE 1"
from pg_stat_monitor
join pg_stat_monitor_uniq_id a using (pgsm_query_id)
where a.cmd_type_text='UPDATE'
and pgsm_query_id = -510321339504955469
order by 1 asc;

-- UPDATE 2 --
select  bucket_start_time,total_exec_time as "UPDATE 2"
from pg_stat_monitor
join pg_stat_monitor_uniq_id a using (pgsm_query_id)
where a.cmd_type_text='UPDATE'
and pgsm_query_id = 5276535447716615446
order by 1 asc;

-- UPDATE 3 --
select  bucket_start_time,total_exec_time  as "UPDATE 3"
from pg_stat_monitor
join pg_stat_monitor_uniq_id a using (pgsm_query_id)
where a.cmd_type_text='UPDATE'
and pgsm_query_id = 3629195281782908951
order by 1 asc;

Bottom panel (Time vs. IO)

-- time vs read/write blocks (blk_read_time, blk_write_time
--    track_io_timing is on
select bucket_start_time, blk_read_time, blk_write_time from public.pg_stat_monitor;

Analysis

It’s quite easy to observe that SQL statement UPDATE 1 represents the bulk of the read operations.

-- example SQL statement
UPDATE pgbench_accounts SET abalance = abalance + 2063  WHERE aid = 1482568

Interestingly, writes are not as significant as reads.

Conclusion

I’m excited about pg_stat_monitor. Not only can it be used in Grafana, but it’s easily implemented in any monitoring solution, including our own Percona Monitoring and Management. It’s also incorporated in our latest version of Percona Operator for PostgreSQL.

Pg_stat_monitor is an obvious, common sense improvement over pg_stat_statement’s greatest limitation i.e., its inability to bucket metrics over time intervals. And to be frankly honest, I can see the pg_stat_monitor extension eventually replacing pg_stat_statement as the defacto extension monitoring Postgres when it comes to real-time analysis.

Happy monitoring!

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.

 

Download Percona Distribution for PostgreSQL Today!

Nov
03
2023
--

How To Scale a Single-Host PostgreSQL Database With Citus

Single-Host PostgreSQL Database With Citus

When it comes to Citus, successfully building out and scaling a PostgreSQL cluster across multiple nodes and even across data centers can feel, at times, to be an art form because there are so many ways of building it out.

There’s an axiom that I think aptly applies to this situation describing the differences between science and art:

– Science: 1 problem -> 1 solution
– Art: 1 problem -> 1,000 solutions

While the two previous Citus blogs that I wrote covered the fairly straightforward concepts of creating columnar tables and leveraging data redundancy, this blog explores the architectural design considerations using this freely downloadable and fully featured PostgreSQL extension.

Rather than listing the concepts, function calls, etc, available in Citus, which frankly is a bit boring, I’m going to explore scaling out a database system starting with a single host. I won’t cover all the features but show just enough that you’ll want to see more of what you can learn to accomplish for yourself.

About the cluster

Citus PostgreSQL

Following a step-by-step process, the objective is to create a four-node cluster consisting of:

  • PostgreSQL version 15
  • Citus extension (I’ll be using version 11, but there are newer ones available.)
  • PostgreSQL Cluster
    • One coordinator node
      • citus-coord-01
    • Three worker nodes
      • citus1
      • citus2
      • citus3
  • Hardware
    AWS Instance
    Ubuntu Server 20.04, SSD volume type
    64-bit (x86)
    c5.xlarge 4vCPU 8GB-RAM
    Storage: EBS volume (root)
    80GB
    gp2 (IOPS 240/3000)

As well, high availability will be integrated, guaranteeing cluster viability in the case that one worker node goes down.

Leveraging pgbench, which is a benchmarking utility that comes bundled with PostgreSQL, I will put the cluster through its paces by executing a series of DML operations. Using its default tpc-b benchmark, one can stress test a database of any size ranging from a few clients to simulating thousands interacting with a system sized into the Terabytes if needs be.

Steps

Provisioning

The first step is to provision the four nodes with both PostgreSQL and Citus. Refer here or here for getting and installing the extension.

-- update each postgres node and enabling the extension 
-- with a server restart
alter system set shared_preload_libraries='citus';

On each of the four nodes, create the database with the Citus extension:

-- using psql
create database pgbench;
c pgbench
create extension citus;

One worker node

Beginning with a single worker node, log onto citus-coord-01, declaring the coordinator host, and starting with the single node, citus1:

-- execute on citus-coord-01
    psql pgbench <<_eof1_
        select citus_set_coordinator_host('citus-coord-01', 5432);
        select citus_add_node('citus1', 5432);
_eof1_

While still logged in database pgbench, get a list of the worker nodes, which at this time is just the one:

select * from master_get_active_worker_nodes();

node_name | node_port
-----------+-----------
 citus1    |      5432

Now, things get a little interesting; the normal initialization of pgbench is to create and populate the tables with the appropriate constraints. In this case, however, only the tables, without data or constraints, are initialized:

-- execute the following command on citus-coord-01
pgbench -iI t pgbench

pgbench=# d+
                                        List of relations
Schema |       Name       | Type  |  Owner   | Persistence | Access method |  Size   |
-------+------------------+-------+----------+-------------+---------------+---------+-----
public | citus_tables     | view  | postgres | permanent   |               | 0 bytes |
public | pgbench_accounts | table | postgres | permanent   | heap          | 0 bytes |
public | pgbench_branches | table | postgres | permanent   | heap          | 0 bytes |
public | pgbench_history  | table | postgres | permanent   | heap          | 0 bytes |
public | pgbench_tellers  | table | postgres | permanent   | heap          | 0 bytes |

This next set of commands, executed on the coordinator node, distributes the pgbench tables across node citus1:

psql pgbench <<_eof1_
    BEGIN;
        select create_distributed_table('pgbench_history', 'aid');
        select create_distributed_table('pgbench_accounts', 'aid');
        select create_distributed_table('pgbench_branches', 'bid');
        select create_distributed_table('pgbench_tellers', 'tid');
    COMMIT;
_eof1_

And now, we can begin populating the tables. The following invocation generates almost 4GB of data. Notice I still haven’t added any constraints:

pgbench -iI g -s 300 pgbench

This command shows the total number of shards on the node. The default number of shards generated for a given table is 32.

select * from citus_tables;

table_name    | citus_table_type | distribution_column | colocation_id | table_size | shard_count 
------------------+------------------+---------------------+---------------+------------+-------------
 pgbench_accounts | distributed      | aid                 |             1 | 3844 MB    |      32 
 pgbench_branches | distributed      | bid                 |             1 | 256 kB     |      32 
 pgbench_history  | distributed      | aid                 |             1 | 0 bytes    |      32 
 pgbench_tellers  | distributed      | tid                 |             1 | 256 kB     |      32

You’ll also notice that only some of the shards have been populated with data, which is normal when initializing pgbench with data:

select * from citus_shards;

table_name    | shardid |       shard_name        | citus_table_type | colocation_id | nodename | nodeport | shard_size
------------------+---------+-------------------------+------------------+---------------+----------+----------+------------
 pgbench_accounts |  102040 | pgbench_accounts_102040 | distributed      |             1 | citus1   |     5432 |  125853696
 pgbench_accounts |  102041 | pgbench_accounts_102041 | distributed      |             1 | citus1   |     5432 |  126173184
 pgbench_accounts |  102042 | pgbench_accounts_102042 | distributed      |             1 | citus1   |     5432 |  125739008
 pgbench_accounts |  102043 | pgbench_accounts_102043 | distributed      |             1 | citus1   |     5432 |  125968384
.
.
 pgbench_branches |  102072 | pgbench_branches_102072 | distributed      |             1 | citus1   |     5432 |       8192
 pgbench_branches |  102073 | pgbench_branches_102073 | distributed      |             1 | citus1   |     5432 |       8192
 pgbench_branches |  102074 | pgbench_branches_102074 | distributed      |             1 | citus1   |     5432 |       8192
 pgbench_branches |  102075 | pgbench_branches_102075 | distributed      |             1 | citus1   |     5432 |       8192
.
.
 pgbench_history  |  102008 | pgbench_history_102008  | distributed      |             1 | citus1   |     5432 |          0
 pgbench_history  |  102009 | pgbench_history_102009  | distributed      |             1 | citus1   |     5432 |          0
 pgbench_history  |  102010 | pgbench_history_102010  | distributed      |             1 | citus1   |     5432 |          0
 pgbench_history  |  102011 | pgbench_history_102011  | distributed      |             1 | citus1   |     5432 |          0
.
.
 pgbench_tellers  |  102104 | pgbench_tellers_102104  | distributed      |             1 | citus1   |     5432 |       8192
 pgbench_tellers  |  102105 | pgbench_tellers_102105  | distributed      |             1 | citus1   |     5432 |       8192
 pgbench_tellers  |  102106 | pgbench_tellers_102106  | distributed      |             1 | citus1   |     5432 |       8192
 pgbench_tellers  |  102107 | pgbench_tellers_102107  | distributed      |             1 | citus1   |     5432 |       8192
 pgbench_tellers  |  102108 | pgbench_tellers_102108  | distributed      |             1 | citus1   |     5432 |       8192
.
.

And now, execute the benchmark:

-- execute the following on the coordinator node
pgbench -c 20 -j 3 -T 60 -P 3 pgbench

The results are not pretty.

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 300
query mode: simple
number of clients: 20
number of threads: 3
maximum number of tries: 1
duration: 60 s
number of transactions actually processed: 1524
number of failed transactions: 0 (0.000%)
latency average = 791.374 ms
latency stddev = 157.710 ms
initial connection time = 15.704 ms
tps = 25.197427 (without initial connection time)

ATTENTION: Refer to the bottom of this blog, which summarizes a tabulation of the various benchmarks as this cluster evolves.

In order to speed up the benchmark indexes must be added. But, this standard pgbench method of adding indexes fails:

postgres@citus-coord-01:~$ pgbench -iI p pgbench
creating primary keys...
pgbench: error: query failed: ERROR:  cannot create constraint without a name on a distributed table
pgbench: detail: Query was: alter table pgbench_branches add primary key (bid)

Successfully adding indexes and constraints requires them to be explicitly named and created in the following fashion. Note that table pgbench_history requires a REPLICA identity because it doesn’t have a primary key:

# THIS WORKS!
psql pgbench <<_eof1_
--
-- indexes and constraints must be explicitly named
--
BEGIN;
create unique index pgbench_accounts_pk on pgbench_accounts(aid);
create unique index pgbench_branches_pk on pgbench_branches(bid);
create unique index pgbench_tellers_pk on pgbench_tellers(tid);

alter table pgbench_accounts
   add constraint pk_accounts primary key using index pgbench_accounts_pk;

alter table pgbench_branches
   add constraint pk_branches primary key using index pgbench_branches_pk;

alter table pgbench_tellers
   add constraint pk_tellers primary key using index pgbench_tellers_pk;

qecho adding REPLICA IDENTITY (no PK present)
alter table pgbench_history replica identity full;
COMMIT;
_eof1_

Repeating the benchmark run yields a much better result:

-- execute the following on the coordinator node
pgbench -c 20 -j 3 -T 60 -P 3 pgbench

scaling factor: 300
query mode: simple
number of clients: 20
number of threads: 3
maximum number of tries: 1
duration: 60 s
number of transactions actually processed: 135273
number of failed transactions: 0 (0.000%)
latency average = 8.865 ms
latency stddev = 9.452 ms
initial connection time = 15.556 ms
tps = 2254.544852 (without initial connection time)

Two worker nodes

Adding a second worker node is straightforward requiring only two steps:

  • Adding the node.
  • Rebalancing the shards across the two nodes.

Execute this on the coordinator node, citus-coord-01:

psql pgbench <<_eof1_
        --
        -- it's understood that database pgbench already exists on citus2
        --  and extension citus has been created
        --
        qecho adding node citus2 ...
        select citus_add_node('citus2', 5432);
        qecho rebalancing shards across TWO nodes ...
        select * from rebalance_table_shards();
        ANALYZE;
_eof1_

Here’s a partial output as the tables are rebalanced across the two nodes:

adding node citus2 ...
citus_add_node
----------------
             3

rebalancing shards across TWO nodes ...
NOTICE:  Moving shard 102008 from citus1:5432 to citus2:5432 ...
NOTICE:  Moving shard 102009 from citus1:5432 to citus2:5432 ...
NOTICE:  Moving shard 102010 from citus1:5432 to citus2:5432 ...
NOTICE:  Moving shard 102011 from citus1:5432 to citus2:5432 ...
.
.
NOTICE:  Moving shard 102020 from citus1:5432 to citus2:5432 ...
NOTICE:  Moving shard 102021 from citus1:5432 to citus2:5432 ...
NOTICE:  Moving shard 102022 from citus1:5432 to citus2:5432 ...
NOTICE:  Moving shard 102023 from citus1:5432 to citus2:5432 ...

Repeating the benchmark once again demonstrates that performance has degraded somewhat. This is to be expected, considering my POC consists of containers on a single machine, all sharing the same system resources:

-- execute the following on the coordinator node
pgbench -c 20 -j 3 -T 60 -P 3 pgbench

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 300
query mode: simple
number of clients: 20
number of threads: 3
maximum number of tries: 1
duration: 60 s
number of transactions actually processed: 130190
number of failed transactions: 0 (0.000%)
latency average = 9.214 ms
latency stddev = 19.159 ms
initial connection time = 15.776 ms
tps = 2169.559969 (without initial connection time)

Three worker nodes

As per the two-worker node configuration, adding another node is straightforward.

psql pgbench <<_eof1_
       qecho adding node citus3 ...
        select citus_add_node('citus3', 5432);
        qecho rebalancing shards across TWO nodes ...
         select * from rebalance_table_shards();
_eof1_

Repeating the benchmark gives us the updated performance metrics.

-- execute the following on the coordinator node
pgbench -c 20 -j 3 -T 60 -P 3 pgbench

scaling factor: 300
query mode: simple
number of clients: 20
number of threads: 3
maximum number of tries: 1
duration: 60 s
number of transactions actually processed: 142630
number of failed transactions: 0 (0.000%)
latency average = 8.410 ms
latency stddev = 5.542 ms
initial connection time = 16.407 ms
tps = 2377.227920 (without initial connection time)

Three worker nodes, with redundancy/replication

This next step is interesting because it adds redundancy by increasing the replication factor from one to two such that two copies of each data shard of each table are not only duplicated but located on two distinct nodes. 

ATTENTION: A newly updated replication factor affects tables only after the fact. Pre Existing tables are not affected by the new replication factor:

psql pgbench <<_eof1_
        qecho increasing replication factor to TWO
        alter system set citus.shard_replication_factor=2;
        select pg_reload_conf();
_eof1_

TIP: In order to replicate tables, set the replication factor before creating them.

In order to replicate the existing tables across the nodes, they are first undistributed by centralizing the tables onto the coordinator node. Redistributing them across the three node clusters automatically adds replication.

This centralizes the tables onto the coordinator node:

psql pgbench <<_eof1_
    BEGIN;
        select undistribute_table('pgbench_history');
        select undistribute_table('pgbench_accounts');
        select undistribute_table('pgbench_branches');
        select undistribute_table('pgbench_tellers');
    COMMIT;
_eof1_

The tables are redistributed and rebalanced across the three-node cluster with the correct redundancy of two:

psql pgbench <<_eof1_
BEGIN;
  select create_distributed_table('pgbench_history', 'aid');
  select create_distributed_table('pgbench_accounts', 'aid');
  select create_distributed_table('pgbench_branches', 'bid');
  select create_distributed_table('pgbench_tellers', 'tid');
COMMIT;

-- remove legacy tables from the coordinator node
BEGIN;
   select truncate_local_data_after_distributing_table($$public.pgbench_history$$);
   select truncate_local_data_after_distributing_table($$public.pgbench_accounts$$);
   select truncate_local_data_after_distributing_table($$public.pgbench_branches$$);
   select truncate_local_data_after_distributing_table($$public.pgbench_tellers$$);
COMMIT;
_eof1_

Repeating the benchmark gives us the updated performance metrics.

-- execute the following on the coordinator node
pgbench -c 20 -j 3 -T 60 -P 3 pgbench

transaction type: <builtin: TPC-B (sort of)>
scaling factor: 300
query mode: simple
number of clients: 20
number of threads: 3
maximum number of tries: 1
duration: 60 s
number of transactions actually processed: 68834
number of failed transactions: 0 (0.000%)
latency average = 17.430 ms
latency stddev = 12.818 ms
initial connection time = 16.321 ms
tps = 1147.093369 (without initial connection time)

Benchmark results

Here’s a summary of the various benchmarking runs, giving you an idea of relative performance.

RUN NUM NODES WITH  IDX Replication Factor TPS
1 1 no 1 25
2 1 yes 1 2255
3 2 yes 1 2170
4 3 yes 1 2377
5 3 yes 2 1147

Caveat

Although not mentioned in the blog, there are a few things you’ll need to keep in mind as you work with CitusDB:

  1. The password for the account used to propagate the operations across the cluster is input in the home account’s .pgpass file on all nodes.
  2. The “wal_level” is set at logical.
  3. Redundancy can potentially decrease overall performance.

Remember to keep in mind that the performance metrics are dependent not only upon the cluster’s hardware and system resources but as to the level of tuning, which, in this case, has not been addressed.

A future blog will continue my exploration into Citus by scaling out pgbench into other architectures.

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.

 

Download Percona Distribution for PostgreSQL Today!

Aug
29
2023
--

Data Redundancy With the PostgreSQL Citus Extension

Data Redundancy With the PostgreSQL Citus

Over the years, I’ve had the opportunity to architect all sorts of configurations using Postgres as a backend. I’ve always found it very cool and satisfying to implement sophisticated business rules, often in more ways than one has fingers and toes. So, it’s not an understatement when I say that Citus is one of the more interesting technologies that I’ve come across when scaling PostgreSQL.

Citus is an extension that horizontally scales PostgreSQL across multiple machines by sharding and creating redundant copies of tables. Its query engine not only parallelizes incoming SQL queries for time series but also has the ability to create column-wise tables making it ideal for OLAP analysis duties.

Among its many capabilities, a Citus cluster can:

  • Create distributed tables that are sharded across a cluster of PostgreSQL nodes to combine their CPU, memory, storage, and I/O capacity.
  • Reference tables can be replicated to all nodes for joins and foreign keys from distributed tables and for maximum read performance.
  • The distributed query engine can route and parallelize SELECT, DML, and other operations on distributed tables across the cluster.
  • Columnar storage of tables can compress data, speeding up scans and supporting fast projections, both on regular and distributed tables.

Data redundancy, a database version of a RAID

Pondering the case of high availability and redundancy, one replicates data by creating a replica via streaming replication.

Now let’s stretch our imagination and consider a second method of high availability, ala Citus.

The best way to describe the Citus way of doing things is to reflect how data is managed by a disk RAID array. Depending on the configuration, one can tune a hardware RAID for either performance or redundancy. The same can be said for Citus data sharding.

Here is an example of a table named replica2x, which has 2X redundancy across a cluster of four (4) nodes. The colors indicate duplicated shards of the table. For example, if node citus1 goes offline, the sharded table it holds still has copies on nodes citus2 and citus4. Likewise, it can be said that if node citus2 goes offline, the same data is still available on nodes 1, 3, and 4.

Citus example

About this POC

I’ll be upfront: I love working with Linux Containers, LXD. Much of what I will show you makes heavy use of them. You won’t need LXD to replicate this POC, of course, but I can’t say enough how flexible and versatile such an environment can be when prototyping a cluster of Postgres nodes, let alone an entire multi-data center infrastructure on a single workstation.

There are two parts to this POC;

  • Part A: Setup
  • Part B: Redundancy demonstration

Part A: POC setup

Step one: Getting and installing Citus

Referring to this earlier blog, you’ll see how to get and install Citus into your environment.

Step two: Creating the Citus nodes

The Citus cluster consists of a five (5) node cluster:

  • citus-coord-01: coordinator
  • citus1: worker
  • citus2: worker
  • citus3: worker
  • citus4: worker

By using LXD, I created a single templated container with Citus on Ubuntu 20.04, where the various nodes were copied from this template.

for u in citus-coord-01 citus1 citus2 citus3 citus4
do
    echo "==== $u ===="
    lxc rm --force $u 2>/dev/null
    lxc cp template-ubuntu-2004-citusdb $u
    lxc start $u
done

And here’s the resultant cluster:

lxc ls -c ns4 citus

+----------------+---------+----------------------+
|      NAME      |  STATE  |         IPV4         |
+----------------+---------+----------------------+
| citus1         | RUNNING | 10.231.38.140 (eth0) |
+----------------+---------+----------------------+
| citus2         | RUNNING | 10.231.38.151 (eth0) |
+----------------+---------+----------------------+
| citus3         | RUNNING | 10.231.38.171 (eth0) |
+----------------+---------+----------------------+
| citus4         | RUNNING | 10.231.38.204 (eth0) |
+----------------+---------+----------------------+
| citus-coord-01 | RUNNING | 10.231.38.34 (eth0)  |
+----------------+---------+----------------------+

It’s understood that on each of the five nodes:

  1. Database db01 has already been created.
  2. The postgresql.conf configuration file has been appropriately edited for remote access.
  3. The .pgpass file has been configured to supply the superuser password for all nodes.
  4. Extension citus has been created in database db01.

Step three: Check packages

Inspecting the nodes confirms Citus has been correctly installed:

for u in citus1 citus2 citus3 citus4 citus-coord-01
do

echo "==== NODE: $u ===="

lxc exec $u -- su - postgres -c 'psql db01'<<_eof_
select extname, extversion from pg_extension;
_eof_

done | less -S

==== NODE: citus1 ====
    extname     | extversion 
----------------+------------
 plpgsql        | 1.0
 citus_columnar | 11.1-1
 citus          | 11.1-1

==== NODE: citus2 ====
    extname     | extversion 
----------------+------------
 plpgsql        | 1.0
 citus_columnar | 11.1-1
 citus          | 11.1-1

==== NODE: citus3 ====
    extname     | extversion 
----------------+------------
 plpgsql        | 1.0
 citus_columnar | 11.1-1
 citus          | 11.1-1

==== NODE: citus4 ====
    extname     | extversion 
----------------+------------
 plpgsql        | 1.0
 citus_columnar | 11.1-1
 citus          | 11.1-1

==== NODE: citus-coord-01 ====
    extname     | extversion 
----------------+------------
 plpgsql        | 1.0
 citus_columnar | 11.1-1
 citus          | 11.1-1

Properly installed, the Citus runtime variables are now available:

lxc exec citus1 -- su postgres -c psql db01<<_eof_ | less -S
    select name,setting,unit from pg_settings where name ~ 'citus' order by 1;
_eof_

name                        |     setting     | unit 
----------------------------------------------------+-----------------+------
 citus.all_modifications_commutative                | off             | 
 citus.background_task_queue_interval               | 5000            | ms
 citus.cluster_name                                 | default         | 
 citus.coordinator_aggregation_strategy             | row-gather      | 
 citus.count_distinct_error_rate                    | 0               | 
 citus.cpu_priority                                 | 0               | 
 citus.cpu_priority_for_logical_replication_senders | inherit         | 
 citus.defer_drop_after_shard_move                  | on              | 
 citus.defer_drop_after_shard_split                 | on              | 
 citus.defer_shard_delete_interval                  | 15000           | ms
 citus.desired_percent_disk_available_after_move    | 10              | 
 citus.distributed_deadlock_detection_factor        | 2               | 
 citus.enable_binary_protocol                       | on              | 
 citus.enable_create_role_propagation               | on              | 
 citus.enable_deadlock_prevention                   | on              | 
 citus.enable_local_execution                       | on              | 
 citus.enable_local_reference_table_foreign_keys    | on              | 
 citus.enable_repartition_joins                     | off             | 
 citus.enable_statistics_collection                 | off             | 
 citus.explain_all_tasks                            | off             | 
 citus.explain_analyze_sort_method                  | execution-time  | 
 citus.limit_clause_row_fetch_count                 | -1              | 
 citus.local_hostname                               | localhost       | 
 citus.local_shared_pool_size                       | 50              | 
 citus.local_table_join_policy                      | auto            | 
 citus.log_remote_commands                          | off             | 
 citus.max_adaptive_executor_pool_size              | 16              | 
 citus.max_cached_connection_lifetime               | 600000          | ms
 citus.max_cached_conns_per_worker                  | 1               | 
 citus.max_client_connections                       | -1              | 
 citus.max_high_priority_background_processes       | 2               | 
 citus.max_intermediate_result_size                 | 1048576         | kB
 citus.max_matview_size_to_auto_recreate            | 1024            | MB
 citus.max_shared_pool_size                         | 100             | 
 citus.max_worker_nodes_tracked                     | 2048            | 
 citus.multi_shard_modify_mode                      | parallel        | 
 citus.multi_task_query_log_level                   | off             | 
 citus.node_connection_timeout                      | 30000           | ms
 citus.node_conninfo                                | sslmode=require | 
 citus.propagate_set_commands                       | none            | 
 citus.recover_2pc_interval                         | 60000           | ms
 citus.remote_task_check_interval                   | 10              | ms
 citus.shard_count                                  | 32              | 
 citus.shard_replication_factor                     | 1               | 
 citus.show_shards_for_app_name_prefixes            |                 | 
 citus.skip_constraint_validation                   | off             | 
 citus.skip_jsonb_validation_in_copy                | on              | 
 citus.stat_statements_track                        | none            | 
 citus.task_assignment_policy                       | greedy          | 
 citus.task_executor_type                           | adaptive        | 
 citus.use_citus_managed_tables                     | off             | 
 citus.use_secondary_nodes                          | never           | 
 citus.values_materialization_threshold             | 100             | 
 citus.version                                      | 11.1.4          | 
 citus.worker_min_messages                          | notice          | 
 citus.writable_standby_coordinator                 | off             |

Step 4: Define/configure the cluster

Log into the coordinator in order to declare and configure the cluster:

lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_
    select citus_set_coordinator_host('citus-coord-01', 5432);
    insert into pg_dist_node(nodename)
        values ('citus1')
              ,('citus2')
              ,('citus3')
              ,('citus4');
_eof_

And here’s the cluster’s organization:

db01=# select nodeid,nodename,groupid,isactive from pg_dist_node order by 1;
 nodeid |    nodename    | groupid | isactive 
--------+----------------+---------+----------
      1 | citus-coord-01 |       0 | t
      2 | citus1         |       1 | t
      3 | citus2         |       2 | t
      4 | citus3         |       3 | t
      5 | citus4         |       4 | t

Step four: Create, distribute, and populate a single table

Table myevents is created, and the newly inserted records are evenly distributed across the cluster of nodes.

Login to the coordinator and execute the following commands. Notice that all DML and SQL statements are executed on the coordinator node:

lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_

-- create the table
    create table myevents (
        device_id
        bigint,
        event_id
        bigserial,
        event_time
        timestamptz default now(),
        data
        jsonb not null,
        primary key (device_id, event_id)
    );

-- distribute the events among the nodes
    select create_distributed_table('myevents', 'device_id');

-- populate the table
    insert into myevents (device_id, data)
        select s % 100, ('{"measurement":'||random()||'}')::jsonb
        from generate_series(1,1000000) s;
_eof_

Querying the coordinator:

lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_ | less -S
select *
    from myevents
    where device_id = 1
    order by event_time desc, event_id desc
    limit 10;
_eof_

device_id | event_id |          event_time           |                data                  
----------+----------+-------------------------------+-------------------------------------
        1 |   999901 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.2868659956537316}
        1 |   999801 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.7931493079697731}
        1 |   999701 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.4875322951757288}
        1 |   999601 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.6491362745752653}
        1 |   999501 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.9629266554851366}
        1 |   999401 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.1185674800281864}
        1 |   999301 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.5133762596297742}
        1 |   999201 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.7307634886202119}
        1 |   999101 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.2997471209159892}
        1 |   999001 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.9692520484104021}

Step five: A review of table distributions across the cluster

This demonstrates clearly that table myevents is hash-sharded across every node member:

  • Notice how well-balanced the sharded tables are in size.
  • Notice the table name numbering order; see how each named shard that is incremented by one is found on the next node.
for u in citus1 citus2 citus3 citus4
do
echo "==== NODE: $u ===="
lxc exec $u -- su - postgres -c 'psql db01'<<_eof_
    dt+
_eof_
done | less -S

==== NODE: citus1 ====
                                        List of relations
Schema |      Name       | Type  |  Owner   | Persistence | Access method |  Size   | Description  
-------+-----------------+-------+----------+-------------+---------------+---------+-------------
public | myevents_102008 | table | postgres | permanent   | heap          | 2832 kB |  
public | myevents_102012 | table | postgres | permanent   | heap          | 2840 kB |  
public | myevents_102016 | table | postgres | permanent   | heap          | 5624 kB |  
public | myevents_102020 | table | postgres | permanent   | heap          | 2840 kB |  
public | myevents_102024 | table | postgres | permanent   | heap          | 1904 kB |  
public | myevents_102028 | table | postgres | permanent   | heap          | 5632 kB |  
public | myevents_102032 | table | postgres | permanent   | heap          | 2840 kB |  
public | myevents_102036 | table | postgres | permanent   | heap          | 6560 kB |  

==== NODE: citus2 ====
                                        List of relations
Schema |      Name       | Type  |  Owner   | Persistence | Access method |  Size   | Description  
-------+-----------------+-------+----------+-------------+---------------+---------+-------------
public | myevents_102009 | table | postgres | permanent   | heap          | 4696 kB |  
public | myevents_102013 | table | postgres | permanent   | heap          | 976 kB  |  
public | myevents_102017 | table | postgres | permanent   | heap          | 1904 kB |  
public | myevents_102021 | table | postgres | permanent   | heap          | 3768 kB |  
public | myevents_102025 | table | postgres | permanent   | heap          | 1904 kB |  
public | myevents_102029 | table | postgres | permanent   | heap          | 2840 kB |  
public | myevents_102033 | table | postgres | permanent   | heap          | 1904 kB |  
public | myevents_102037 | table | postgres | permanent   | heap          | 2840 kB |  

==== NODE: citus3 ====
                                         List of relations
Schema |      Name       | Type  |  Owner   | Persistence | Access method |    Size    | Description  
-------+-----------------+-------+----------+-------------+---------------+------------+-------------
public | myevents_102010 | table | postgres | permanent   | heap          | 1904 kB    |  
public | myevents_102014 | table | postgres | permanent   | heap          | 1904 kB    |  
public | myevents_102018 | table | postgres | permanent   | heap          | 1904 kB    |  
public | myevents_102022 | table | postgres | permanent   | heap          | 4696 kB    |  
public | myevents_102026 | table | postgres | permanent   | heap          | 8192 bytes |  
public | myevents_102030 | table | postgres | permanent   | heap          | 2832 kB    |  
public | myevents_102034 | table | postgres | permanent   | heap          | 976 kB     |  
public | myevents_102038 | table | postgres | permanent   | heap          | 4696 kB    |  

==== NODE: citus4 ====
                                        List of relations
Schema |      Name       | Type  |  Owner   | Persistence | Access method |  Size   | Description  
-------+-----------------+-------+----------+-------------+---------------+---------+-------------
public | myevents_102011 | table | postgres | permanent   | heap          | 5632 kB |  
public | myevents_102015 | table | postgres | permanent   | heap          | 1904 kB |  
public | myevents_102019 | table | postgres | permanent   | heap          | 1904 kB |  
public | myevents_102023 | table | postgres | permanent   | heap          | 1904 kB |  
public | myevents_102027 | table | postgres | permanent   | heap          | 2840 kB |  
public | myevents_102031 | table | postgres | permanent   | heap          | 2832 kB |  
public | myevents_102035 | table | postgres | permanent   | heap          | 1904 kB |  
public | myevents_102039 | table | postgres | permanent   | heap          | 4696 kB |

Part B: Redundancy demonstration

Method

  • Step 1: Update shard replication factor from 1X to 2X
  • Step 2: Create table myevents2x with 2X redundancy
  • Step 3: Identify shard myevents2x_102040 across citus1 and citus2
  • Step 4: Identify some records to query from shards known to be on nodes citus1 and citus2
  • Step 5: Test
    • Shutdown citus1; perform the aforementioned identified query
    • Startup citus1, shutdown citus2; perform afore identified query
    • Restart citus2; perform the aforementioned identified query

ATTENTION: Please note that you may have to edit your own queries as the values may be different for your setup.

Step one:

Update shard replication factor from 1X to 2X:

lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_
    show citus.shard_replication_factor;
    alter system set citus.shard_replication_factor=2;
    select pg_reload_conf();
_eof_

# validate
lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_
    show citus.shard_replication_factor;
_eof_

citus.shard_replication_factor  
--------------------------------
2

Step two:

Table myevents2x is created and populated with a redundancy of 2X across the cluster:

lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_ | less -S
-- create a new table with 2X redundancy
    create table myevents2x (
        device_id
        bigint,
        event_id
        bigserial,
        event_time
        timestamptz default now(),
        data
        jsonb not null,
        primary key (device_id, event_id)
    );

-- distribute the events among the nodes
    select create_distributed_table('myevents2x', 'device_id');


-- confirm table has been added across the cluster
    select * from master_get_active_worker_nodes() order by 1;

-- populate the table
    insert into myevents2x (device_id, data)
        select s % 100, ('{"measurement":'||random()||'}')::jsonb
        from generate_series(1,1000000) s;
_eof_

Here’s the output:

CREATE TABLE
 create_distributed_table 
--------------------------
 
(1 row)

 node_name | node_port 
-----------+-----------
 citus1    |      5432
 citus2    |      5432
 citus3    |      5432
 citus4    |      5432
(4 rows)

INSERT 0 1000000

Step three:

Locate shard myevents2x_102040, which should be on nodes citus1 and citus2:

for u in citus1 citus2 citus3 citus4
do
echo "==== NODE: $u ===="
lxc exec $u -- su - postgres -c 'psql db01'<<_eof_
    select tablename from pg_tables where tablename~'myevents2x' order by 1
_eof_
done | less -S

Here’s the output:

==== NODE: citus1 ====
     tablename     
-------------------
 myevents2x_102040
 myevents2x_102043
 myevents2x_102044
 myevents2x_102047
 myevents2x_102048
 myevents2x_102051
 myevents2x_102052
 myevents2x_102055
 myevents2x_102056
 myevents2x_102059
 myevents2x_102060
 myevents2x_102063
 myevents2x_102064
 myevents2x_102067
 myevents2x_102068
 myevents2x_102071
(16 rows)

==== NODE: citus2 ====
     tablename     
-------------------
 myevents2x_102040
 myevents2x_102041
 myevents2x_102044
 myevents2x_102045
 myevents2x_102048
 myevents2x_102049
 myevents2x_102052
 myevents2x_102053
 myevents2x_102056
 myevents2x_102057
 myevents2x_102060
 myevents2x_102061
 myevents2x_102064
 myevents2x_102065
 myevents2x_102068
 myevents2x_102069

==== NODE: citus3 ====
     tablename     
-------------------
 myevents2x_102041
 myevents2x_102042
 myevents2x_102045
 myevents2x_102046
 myevents2x_102049
 myevents2x_102050
 myevents2x_102053
 myevents2x_102054
 myevents2x_102057
 myevents2x_102058
 myevents2x_102061
 myevents2x_102062
 myevents2x_102065
 myevents2x_102066
 myevents2x_102069
 myevents2x_102070

==== NODE: citus4 ====
     tablename     
-------------------
 myevents2x_102042
 myevents2x_102043
 myevents2x_102046
 myevents2x_102047
 myevents2x_102050
 myevents2x_102051
 myevents2x_102054
 myevents2x_102055
 myevents2x_102058
 myevents2x_102059
 myevents2x_102062
 myevents2x_102063
 myevents2x_102066
 myevents2x_102067
 myevents2x_102070
 myevents2x_102071

Step four:

Locate and return the first three records of shard myevents2x_102040 on nodes citus1 and citus2:

lxc exec citus1 -- su - postgres -c 'psql db01'<<_eof_ | less -S
    qecho ==== citus1 ====
    qecho select * from myevents2x_102040 order by 1,2 limit 3
    select * from myevents2x_102040 order by 1,2 limit 3;
    c 'host=citus2 dbname=db01 user=postgres'
    qecho ==== citus2 ====
    qecho select * from myevents2x_102040 order by 1,2 limit 3
    select * from myevents2x_102040 order by 1,2 limit 3;
    c 'host=citus-coord-01 dbname=db01 user=postgres'
    qecho ==== coordinator ====
    qecho select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2
    select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2;
_eof_

Here’s the output:

==== citus1 ====
select * from myevents2x_102040 order by 1,2 limit 3
 device_id | event_id |          event_time           |                data                 
-----------+----------+-------------------------------+-------------------------------------
         8 |        8 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.5055415404961443}
         8 |      108 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.9145586163936634}
         8 |      208 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.8032392420487922}


You are now connected to database "db01" as user "postgres" on host "citus2" (address "fd42:cb6a:5384:9a60:216:3eff:fe38>
==== citus2 ====
select * from myevents2x_102040 order by 1,2 limit 3
 device_id | event_id |          event_time           |                data                 
-----------+----------+-------------------------------+-------------------------------------
         8 |        8 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.5055415404961443}
         8 |      108 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.9145586163936634}
         8 |      208 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.8032392420487922}


You are now connected to database "db01" as user "postgres" on host "citus-coord-01" (address "fd42:cb6a:5384:9a60:216:3>
==== coordinator ====
select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2
 device_id | event_id |          event_time           |                data                 
-----------+----------+-------------------------------+-------------------------------------
         8 |        8 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.5055415404961443}
         8 |      108 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.9145586163936634}
         8 |      208 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.8032392420487922}

Step five:

The next few steps demonstrate our ability to continuously query and return those records found in shard myevents2x_102040.

Step 5a: Test, shutdown node citus1

lxc stop citus1

lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_ | less -S
    qecho "==== citus1 is shutdown ===="
    qecho ==== querying coordinator ====
    qecho select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2
    select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2;
_eof_

Here’s the output:

"==== citus1 is shutdown ===="
==== querying coordinator ====
select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2
 device_id | event_id |          event_time           |                data                 
-----------+----------+-------------------------------+-------------------------------------
         8 |        8 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.5055415404961443}
         8 |      108 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.9145586163936634}
         8 |      208 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.8032392420487922}

Step 5b: Test, restart citus1, and shutdown citus2

lxc start citus1
lxc stop citus2

lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_ | less -S
    qecho "==== citus2 is shutdown ===="
    qecho ==== querying coordinator ====
    qecho select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2
    select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2;
_eof_

Here’s the output; note that it’s exactly the same as the previous test:

"==== citus2 is shutdown ===="
==== querying coordinator ====
select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2
 device_id | event_id |          event_time           |                data                 
-----------+----------+-------------------------------+-------------------------------------
         8 |        8 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.5055415404961443}
         8 |      108 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.9145586163936634}
         8 |      208 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.8032392420487922}

Step 5c: Test, restart node citus2

lxc start citus2

lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_ | less -S
    qecho "==== cluster restored ===="
    qecho ==== querying coordinator ====
    qecho select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2
    select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2;
_eof_

Here’s the output, of course!

"==== cluster restored ===="
==== querying coordinator ====
select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2
device_id | event_id |          event_time           |                data                  
----------+----------+-------------------------------+-------------------------------------
        8 |        8 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.5055415404961443}
        8 |      108 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.9145586163936634}
        8 |      208 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.8032392420487922}

Conclusion

Data redundancy is the hallmark of high availability. But with Citus, we’ve raised the bar. Can you think of a better system that can stay up without losing time initiating failovers when a node fails?

Have fun!

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community in a single distribution, designed and tested to work together.

Download Percona Distribution for PostgreSQL Today!

Aug
21
2023
--

Configuring PostgreSQL and LDAP Using StartTLS

Configuring PostgreSQL and LDAP Using StartTLS

Effectively working with LDAP as an authentication mechanism for PostgreSQL typically requires extensive knowledge in both domains. While trying to be as complete yet succinct as possible, I’m detailing how to enable TLS between a PostgreSQL and the OpenLDAP server.

Ironically, the most complicated aspect has nothing to do with either PostgreSQL or OpenLDAP but with the steps of creating and signing private keys and certificates.

Note: I had seriously thought about leaving out much of the OpenLDAP commands, but I figured it might benefit you if you’re like me since I don’t touch this very often.

Summary steps

The underlying assumptions are:

  1. PostgreSQL and LDAP hosts are installed on the same host.
  2. Username accounts, but not passwords, are created on the Postgres server.
  3. Both username accounts and passwords are configured on the LDAP server.
  4. Private/Public keys, sic certificates, are to be created.
  5. A self-signed Certificate Authority, CA, exists on the host and has signed the aforementioned Postgres and LDAP certificates.

The environment

The Proof Of Concept described in this document consists of a single stand-alone server:

  • Linux OS, Ubuntu 18.04
  • Host name: my-ldap
  • PostgreSQL version 14
  • OpenLDAP version 2.4.45
    • Distinguished Name: “cn=admin,dc=nodomain”
    • password: admin
    • topmost domain: “dn: dc=pg_user,dc=nodomain”
    • superuser, postgres: “dn: cn=postgres,dc=pg_user,dc=nodomain”
    • user1: “dn: cn=user1,dc=pg_user,dc=nodomain”
    • user2: “dn: cn=user2,dc=pg_user,dc=nodomain”
  • user accounts:
    • postgres (password postgres)

Installation PostgreSQL

Apart from the standard steps of installing and configuring Postgres for remote access, edit the host-based authentication file enabling Postgres to refer to the LDAP service for authentication.

ROLES and useraccounts used in Postgres should be declared. Keep in mind that assigning passwords is NOT required:

-- Example
create role user1 with login password null;
create role user2 with login password null;

pg_hba.conf assumptions:

  • The LDAP URL is 127.0.01, localhost, or the hostname URL if desired.
  • Rules for both IPv4 and IPv6 are declared.
  • NO encryption is enabled between the Postgres and LDAP services.
# IPv4 local connections:
host all all 0.0.0.0/0  ldap ldapserver=127.0.0.1 ldapprefix="cn=" ldapsuffix=", dc=pg_user, dc=nodomain"

# IPv6 local connections:
host all all ::0/0 ldap ldapserver=127.0.0.1 ldapprefix="cn=" ldapsuffix=", dc=pg_user, dc=nodomain"

Installation LDAP

apt-get install -y slapd ldap-utils

Running netstat, (netstat -tlnp), returns the following:

Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:389             0.0.0.0:*               LISTEN      7742/slapd          
tcp6       0      0 :::389                  :::*                    LISTEN      7742/slapd

Command line utilities

You can control the behavior of OpenLDAP by using these command line utilities:

  • ldapmodify
  • ldapadd
  • ldapdelete
  • ldapmodrdn
  • ldapsearch
  • ldapwhoami

Setting the administrator password

It is understood that administering the LDAP server requires setting the password. Although the installation of LDAP includes setting the password, which will be admin, by executing the following command, one can reset the password at will:

# Select "No" when asked to configure the database with dbconfig-common.
# Set the domain name for your LDAP server, for example, "example.com".
# Set the organization name for your LDAP server, for example, "Example Inc".
# Set the administrator password for your LDAP server.

dpkg-reconfigure slapd

Configuration

The following bash script demonstrates configuring OpenLDAP to authenticate three Postgres ROLES, i.e., postgres, user1, and user2:

#!/bin/bash
set -e
##########################################
#
# admin password is "admin"
# the top domain is assigned as "nodomain"
#
ldapadd -v -xD "cn=admin,dc=nodomain" -w admin <<_eof_
# create the topmost domain
dn: dc=pg_user,dc=nodomain
objectClass: dcObject
objectClass: organization
dc: pg_user
o: Postgres Users
description: all postgres users reside here

# create SUPERUSER ROLE postgres
dn: cn=postgres,dc=pg_user,dc=nodomain
objectclass: top
objectclass: person
cn: postgres
sn: postgres
userPassword: postgres
_eof_

#
##########################################
# ADD ROLES

ldapadd -v -xD "cn=admin,dc=nodomain" -w admin <<_eof_
# creating other user accounts, down the road
# create role user1
dn: cn=user1,dc=pg_user,dc=nodomain
objectclass: top
objectclass: person
cn: user1
sn: user1
userPassword: user1

# create role user2
dn: cn=user2,dc=pg_user,dc=nodomain
objectclass: top
objectclass: person
cn: user2
sn: user2
userPassword: user2
_eof_

Test connectivity without TLS

A simple login confirms LDAP and PostgreSQL are working correctly. Even though there is an encrypted session between psql and the Postgres server, there is no encrypted session between Postgres and LDAP as authentication is performed:

SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
root@my-ldap:~# psql 'host=my-ldap dbname=postgres user=postgres password=postgres' -c "select 'ping' as test_connectivity"  
test_connectivity  
-------------------
ping

root@my-ldap:~# psql 'host=my-ldap dbname=postgres user=user1 password=user1' -c "select 'ping from user1' as test_connectivity"                 
test_connectivity  
-------------------
ping from user1

root@my-ldap:~# psql 'host=my-ldap dbname=postgres user=user2 password=user2' -c "select 'ping from user2' as test_connectivity"    
test_connectivity  
-------------------
ping from user2

Working With TLS Encryption

Install Additional Packages

To work with SSL certificates, these packages should be present, i.e., for the Ubuntu distribution:

apt install -y gnutls-bin ssl-cert

Managing the Certificate Authority

Authentication between the Postgres and LDAP servers includes that the hosts making connection attempts are, in fact, legitimate. For that reason, the certificates for both servers must be signed, i.e., a Certificate Authority mechanism is required.

#
# Generate private key for self-signed Certificate Authority
#
certtool --generate-privkey --bits 4096 --outfile /etc/ssl/private/mycakey.pem

In this case, the CA certificate is configured to expire in ten years:

#
# Define CA certificate attributes
#
echo "cn = my-ldap
ca
cert_signing_key
expiration_days = 3650" > /etc/ssl/ca.info

An internal system can get away using self-signed CA certificates, otherwise, it is strongly recommended that your certificates be signed by an authorized CA. In this case, the certificate, once signed, is placed in the same directory where the other CA certificates are stored, i.e., /usr/local/share/ca-certificates:

#
# Generate a self-signed CA certificate and
#   copy the CRT to other trusted CA certificates that includes
#   both postgres and ldap servers (/usr/local/share/ca-certificates/)
#
certtool --generate-self-signed 
--load-privkey /etc/ssl/private/mycakey.pem 
--template /etc/ssl/ca.info 
--outfile /usr/local/share/ca-certificates/mycacert.crt

ATTENTION: Once signed, the CA certificate is copied onto the Postgres and LDAP servers respectively. In this case, as they are on the same host, it is located on host my-ldap. Otherwise, it MUST be copied to all Postgres and LDAP hosts.

Once copied into the correct directory, the list of CA certificates is updated, adding the self-signed CA certificate:

#
# Update list of CA certificate on both
#   postgres and LDAP servers
#
update-ca-certificates

Generating the LDAP public/private key

From here on, it is important to include the fully qualified domain name of the LDAP certificate hostname, i.e., my-ldap.

Generate a private key for LDAP server:

certtool --generate-privkey 
--bits 2048 
--outfile /etc/ldap/my-ldap_slapd_key.pem

Define LDAP certificate attributes:

#   for a certificate request which expires
#   in one year
#
echo "organization = mycompany
cn = my-ldap
tls_www_server
encryption_key
signing_key
expiration_days = 365" > /etc/ssl/my-ldap.info

Sign the LDAP private key using the self-signed Certificate Authority certificate and its private key:

certtool --generate-certificate 
--load-privkey /etc/ldap/my-ldap_slapd_key.pem 
--load-ca-certificate /etc/ssl/certs/mycacert.pem 
--load-ca-privkey /etc/ssl/private/mycakey.pem 
--template /etc/ssl/my-ldap.info 
--outfile /etc/ldap/my-ldap_slapd_cert.pem

Update access permissions of the private key:

sudo chgrp openldap /etc/ldap/my-ldap_slapd_key.pem
sudo chmod 0640 /etc/ldap/my-ldap_slapd_key.pem

Create and save the LDAP TLS configuration file, certinfo.ldif:

echo "dn: cn=config
add: olcTLSCACertificateFile
olcTLSCACertificateFile: /etc/ssl/certs/mycacert.pem
-
add: olcTLSCertificateFile
olcTLSCertificateFile: /etc/ldap/my-ldap_slapd_cert.pem
-
add: olcTLSCertificateKeyFile
olcTLSCertificateKeyFile: /etc/ldap/my-ldap_slapd_key.pem" > /etc/ldap/schema/certinfo.ldif

Enable OpenLDAP to use TLS:

ldapmodify -Y EXTERNAL -H ldapi:/// -f /etc/ldap/schema/certinfo.ldif

Validation test #1

Validation returns the string  “anonymous”:

ldapwhoami -x -ZZ -H ldap://my-ldap

Validation test #2

This test confirms that the previous behavior without TLS still works:

# without TLS
ldapsearch -x -H ldap://my-ldap -b dc=nodomain -D cn=admin,dc=nodomain -w admin

This test should return the exact output of the previous ldapsearch. Failure is indicated by a short message which is invoked by using the switch “-zz”.

# with TLS
ldapsearch -x -ZZ -H ldap://my-ldap -b dc=nodomain -D cn=admin,dc=nodomain -w admin

Working with PostgreSQL using TLS

Configuration

This is the host-based rule with TLS configured; notice the minor edit in red:

# IPv4 local connections:
hostssl all all 0.0.0.0/0 ldap ldapserver=tmp ldapprefix="cn=" ldapsuffix=", dc=pg_user, dc=nodomain" ldaptls=1

# IPv6 local connections:
host all all ::0/0 ldap ldapserver=tmp ldapprefix="cn=" ldapsuffix=", dc=pg_user, dc=nodomain" ldaptls=1

Validation test: TLS

After updating pg_hba.conf with the new argument, ldaptls=1, a server reload is executed, followed by a psql ping:

systemctl reload postgresql@14-main
root@my-ldap:~# psql 'host=10.231.38.243 dbname=postgres user=postgres password=postgres' -c "select 'ping from postgres' as test_connectivity"
test_connectivity   
--------------------
ping from postgres

 

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.

 

Download Percona Distribution for PostgreSQL Today!

Jun
05
2023
--

Configuring PgBouncer for Multi-Port Access

PgBouncer for Multi-Port Access

From time to time, situations occur where unusual circumstances dictate out-of-the-box thinking.

For example, suppose you have a system where you’ve installed multiple data clusters onto a single host. What you end up with is a host breaking up and sharing valuable system resources, i.e., CPU, RAM, disk, etc., between multiple instances of PostgreSQL data clusters.  While easy to do in a development environment, it does represent technical debt when it reaches production. Sooner or later, one must address this issue; otherwise, one can suffer the consequences of handicapping your entire database infrastructure.

Let’s now move forward in time: your system has scaled, of course, and this shortcut of using multiple data clusters on a single host has now become a performance bottleneck. The problem is you either don’t or can’t refactor your application servers; something, maybe, about not having enough time in the day. And, as you may already know, while Postgres can sit on both a UNIX DOMAIN socket and IPv4, IPv6 port, etc., one is nevertheless constrained to listen to just the one port.

So what do you do?

For the experienced sysadmin, there are actually quite a number of “magical” techniques. However, in this case, with a little help from systemd, which manages all service processes, we will solve this little puzzle using PgBouncer with a concoction of configuration files.

Scenario

Configure the system such that Postgres resides on its default port of 5432 and PgBouncer sits on three ports, i.e., 6432, 6433, and 6433, accessing the resident Postgres server.

The PgBouncer connection pooler will use an administrative account, the ROLE PgBouncer, for the purpose of user authentication. Authentication is to be achieved by accessing the Postgres server’s pg_shadow table and comparing the resultant hash to all incoming connections (this won’t work for cloud setups such as, for example, Amazon RDS).

A set of Systemd configuration files will be created and edited in order to manage the PgBouncer service.

About the files

Below is a summary of the files and how they will be edited. Remember, these configuration files are of a hypothetical nature using minimal settings, which, of course, will need to be updated to match a realistic production environment.

├── pgbouncer
│   ├── pgbouncer.ini
│   └── userlist.txt
├── postgres
│   ├── add_function.sql
│   ├── add_user.sql
└── systemd
    ├── pgbouncer_override
    └── pgbouncer.socket

pgbouncer.ini

This configuration file defines all behavior and is installed in its default location, “/etc/pgbouncer”. 

Only one domain socket is used. For our purposes, the listen_port runtime parameter is just noise and is superseded by the other runtime parameters as declared in file pgbouncer.socket.

TIP: Backup the original pgbouncer.ini as it references ALL runtime parameters.

[databases]
* = host=localhost

[users]
# left blank 

[pgbouncer]
logfile = /var/log/postgresql/pgbouncer.log
pidfile = /var/run/postgresql/pgbouncer.pid

;; these parameters are implicitly disabled
listen_addr = 0.0.0.0
listen_port = 6432

unix_socket_dir = /var/run/postgresql
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
auth_user = pgbouncer
auth_query = SELECT p_user, p_password FROM public.lookup($1)

pool_mode = session

;; Use <appname - host> as application_name on server.
application_name_add_host = 1

userlist.txt

Contains a single user account and its password for authentication. Note that the md5 hashed password is pgbouncer.

"pgbouncer" "md5be5544d3807b54dd0637f2439ecb03b9"

add_user.sql

Adds the ROLE “pgbouncer” to the Postgres data cluster. While under normal circumstances this is not required, PgBouncer uses this ROLE in order to validate all logins.

For our purposes, the password is “pgbouncer”:

--
-- EXECUTE AS SUPERUSER postgres
--
CREATE ROLE pgbouncer LOGIN WITH PASSSWORD pgbouncer;

add_function.sql

This is a user-defined function that ROLE PgBouncer executes in order to obtain the password hash from pg_shadow.

This SQL statement must be executed against each and every database that is to be accessed by any ROLE connection using PgBouncer.

TIP: Execute this SQL against database template1, and the function call will be included with every newly created database thereafter.

--
-- EXECUTE AS SUPERUSER postgres
--
-- execute on each database user accounts will login
--
CREATE FUNCTION public.lookup (
   INOUT p_user     name,
   OUT   p_password text
) RETURNS record
   LANGUAGE sql SECURITY DEFINER SET search_path = pg_catalog AS
$$SELECT usename, passwd FROM pg_shadow WHERE usename = p_user$$;

-- make sure only "pgbouncer" can use the function
REVOKE EXECUTE ON FUNCTION public.lookup(name) FROM PUBLIC;
GRANT EXECUTE ON FUNCTION public.lookup(name) TO pgbouncer;

pgbouncer_override

This systemd drop-in file overrides key options in the default PgBouncer unit file and will never be overwritten even after updating the PgBouncer Linux RPM/DEB packages.

As root, execute the following command and paste the contents of the provided file pgbouncer_override:

systemctl edit pgbouncer

#
# systemctl edit pgbouncer
# systemctl daemon-reload
#

[Unit]
Requires=pgbouncer.socket

pgbouncer.socket

This is the secret sauce; this file enables PgBouncer to listen on all three ports, 6432, 6433, and 6434. You will note that adding or removing ports is a simple matter of adding or removing addresses as per the format shown in the file below.

As root: create this file and perform a daemon reload:

vi /etc/systemd/system/pgbouncer.socket

#
# vi /etc/systemd/system/pgbouncer.socket
# systemctl daemon-reload
#

[Unit]
Description=sockets for PgBouncer
PartOf=pgbouncer.service

[Socket]
ListenStream=6432
ListenStream=6433
ListenStream=6434
ListenStream=/var/run/postgresql/.s.PGSQL.6432

ReusePort=true
RemoveOnStop=true

[Install]
WantedBy=sockets.target

systemctl daemon-reload

Putting it all together

After all the configuration files have been created and edited, this is what we get:

systemctl restart pgbouncer

netstat -tlnp

Proto Recv-Q Send-Q Local Address           Foreign Address         State      
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN     
tcp        0      0 127.0.0.1:5432          0.0.0.0:*               LISTEN     
tcp        0      0 127.0.0.53:53           0.0.0.0:*               LISTEN     
tcp6       0      0 :::22                   :::*                    LISTEN     
tcp6       0      0 ::1:5432                :::*                    LISTEN     
tcp6       0      0 :::6432                 :::*                    LISTEN     
tcp6       0      0 :::6433                 :::*                    LISTEN     
tcp6       0      0 :::6434                 :::*                    LISTEN

Finally, perform the following:

  1. sudo as Postgres
  2. create a database db01
  3. create a ROLE usr1 accessing database db01
  4. update pg_hba.conf and postgresql.conf allowing localhost connections by PgBouncer

Now validate the ports:

for u in 5432 6432 6433 6434
do
    echo "==== port: $u ==="
    export PGHOST=localhost PGPORT=$u PGDATABASE=db01
    psql "user=usr1 password=usr1" -c "select 'hello world' as greetings"
done

And here’s the output:

==== port: 5432 ===
 greetings   
-------------
hello world
(1 row)

==== port: 6432 ===
 greetings   
-------------
hello world
(1 row)

==== port: 6433 ===
 greetings   
-------------
hello world
(1 row)

==== port: 6434 ===
 greetings   
-------------
hello world
(1 row)

Conclusion

Because PostgreSQL was designed from the ground up to work in conjunction with the OS, i.e., a UNIX type of operating system, we’ve been able to solve an interesting problem using a novel approach. And while systemd is ubiquitous, it isn’t normally considered part of a solution to a Postgres puzzle. We call this hacking 🙂 

Happy Travels!

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.

 

Download Percona Distribution for PostgreSQL Today!

Feb
02
2023
--

An Argument for Logical Failover Slots

Logical Failover Slots PostgreSQL

These days, the typical PostgreSQL cluster consists not only of a multi-node replication cluster, which is ordinarily an asynchronous streaming replication model but can sometimes include a logical replication component.

 

logical replication postgresql

 

Recall that logical replication works by using the PUB/SUB mode, where individual tables are published and are then subscribed by remotely connected databases/tables. DML operations, such as INSERT, UPDATE, DELETE, TRUNCATE, etc., are then replicated from one or more publishers to one or more subscribers.

For the most part it’s a pretty straightforward operation. The only real limitation is that one can only publish tables from a read-write server such as the PRIMARY.

 

There’s a problem, however, when it comes to executing failovers. While a STANDBY/REPLICA host can quickly take over with a simple select pg_promote() SQL statement, the same cannot be said regarding logical replication. Instead, failing over is, in fact, not a single action but a collection of discrete steps that must be performed in an exact and precise manner.

Consider the following steps typifying a failover:

  1. Promote the REPLICA, thus becoming the new PRIMARY.
  2. When present, redirect other REPLICAs pointing to the new PRIMARY.
  3. Failover the logical replication slot:
    1. Block all logical replication from the failed PRIMARY.
    2. Create a logical replication slot on the newly promoted PRIMARY.
    3. Alter the subscription and point to the newly promoted PRIMARY.
    4. Resume/restore logical replication between publisher and subscriber.

The challenge, of course, is reconstituting the logical replication process as quickly as possible, ideally without any data loss.

Two common solutions come to mind:

  1. Block all incoming connections before promoting the STANDBY while simultaneously installing the logical replication slot.
  2. Promote the STANDBY and then flush and recopy the entire contents of the subscribed table(s).

In both cases, these solutions make it problematic for a fast recovery, not to mention the additional effort required by the SRA/DBA/DEV preparing the requisite instructions.

failover postgresql

In an ideal world, the perfect implementation of a logical slot failover would have one already in place on the STANDBY, thus guaranteeing data consistency without fear of data loss no matter how fast or slow the actual promotion or altering the subscription’s connectivity parameters may take.

The good news is that there are hacks that can speed up the entire recovery process, although they all have their own quirky limitations. Patroni, for example, implements one such technique by copying the file named state, which is located in the data cluster’s subdirectory pg_repslot, over to the REPLICA’s data cluster in the same location.

# PGDATA on an Ubuntu install
/var/lib/postgresql/15/main/pg_replslot/pg3
??? state

However, the caveat is that it requires a full server restart to the REPLICA, in addition to being promoted, before the logical slot is fully active. And, of course, timing is critical.

failover desired state postgresql

 

So there you have it.

Too bad such a feature doesn’t exist… or does it?

 

 

 

 

Stay tuned folks; the game is afoot. ?

For the curious, I’ve included some references in past blogs, courtesy of Jobin:

And here’s another set of references: a valiant attempt to incorporate this feature into PostgreSQL a few years ago.

 

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open source community in a single distribution, designed and tested to work together.

Download Percona Distribution for PostgreSQL Today!

 

Dec
23
2022
--

Diffing PostgreSQL Schema Changes

Diffing PostgreSQL Schema Changes

One of the routine operations when administering PostgreSQL is periodic updates to the database system’s architecture. PostgreSQL does a good job of allowing one to update a schema, add types, functions, triggers, or alter a table by adding and removing columns and updating column data types, etc., in a reliable manner. However, there is no built-in mechanism to help identify the differences, let alone generate the necessary SQL, to accomplish updates in an easy manner from the development to the production environment.

So let’s talk about possible approaches to schema changes.

Using logical dump manifests

The easiest way to identify changes between schemas from one database to another is to compare schema dump manifests

The following example demonstrates an approach one can take looking for differences between schema on different databases:

EXAMPLE:

-- create database schemas
create database db01
create database db01


-- db01: version 1
create table t1 (
    c1 int,
    c2 text,
    c4 date
);

create table t2(
    c1 int,
    c2 varchar(3),
    c3 timestamp,
    c4 date
);
-- db02: version 2
create table t1 (
    c1 serial primary key,
    c2 varchar(256),
    c3 date default now()
);

create table t2(
    c1 serial primary key,
    c2 varchar(3),
    c3 varchar(50),
    c4 timestamp with time zone default now(),
    c5 int references t1(c1)
);

create index on t2 (c5);

 

# generate schema dumps
pg_dump -s db01 -Fc > db01.db
pg_dump -s db02 -Fc > db02.db
# generate manifests
pg_restore -l db01.db > db01_manifest.ini
pg_restore -l db02.db > db02_manifest.ini

This snippet demonstrates looking for differences by comparing the md5 checksums:

# EX 1: generate checksums
md5sum \
<(tail +16 db01_manifest.ini | cut -d ' ' -f 4-) \
<(tail +16 db02_manifest.ini | cut -d ' ' -f 4-)

# output
$ 9d76c028259f2d8bed966308c256943e  /dev/fd/63
$ ba124f9410ea623085c237dc4398388a  /dev/fd/62

This next snippet diffs the differences between the two manifests identifying only those objects and attributes that have changed. Notice that redundant information, the first 16 lines, are skipped:

# EX 2: perform diff
diff \
> <(tail +16 db01_manifest.ini | cut -d ' ' -f 4-) \
> <(tail +16 db02_manifest.ini | cut -d ' ' -f 4-)

 

This resultant diff shows the changes made between the two schemas:

1a2,3
> SEQUENCE public t1_c1_seq postgres
> SEQUENCE OWNED BY public t1_c1_seq postgres
2a5,12
> SEQUENCE public t2_c1_seq postgres
> SEQUENCE OWNED BY public t2_c1_seq postgres
> DEFAULT public t1 c1 postgres
> DEFAULT public t2 c1 postgres
> CONSTRAINT public t1 t1_pkey postgres
> CONSTRAINT public t2 t2_pkey postgres
> INDEX public t2_c5_idx postgres
> FK CONSTRAINT public t2 t2_c5_fkey postgres

The good news is that there are a number of existing tools that can reconcile differences between a proposed schema design and the target schema:

  1. Commercial offerings can differentiate schema between databases in an elegant and efficient manner. Researching, ala Google, yields the most popular technologies one can use.
  2. In regards to open source solutions, there are a number of projects capable of diffing Postgres database schemas.

Working with the apgdiff extension

The following is an example implementation of the open source tool apgdiff

Apgdiff can be found in the Postgres community repository. It compares two schema dump files and creates an SQL output file that is, for the most part, suitable for upgrades of old schemata:

Package: apgdiff

Version: 2.7.0-1.pgdg18.04+1
Architecture: all
Maintainer: Debian PostgreSQL Maintainers <team+postgresql@tracker.debian.org>
Installed-Size: 173
Depends: default-jre-headless | java2-runtime-headless
Homepage: https://www.apgdiff.com/
Priority: optional
Section: database
Filename: pool/main/a/apgdiff/apgdiff_2.7.0-1.pgdg18.04+1_all.deb
Size: 154800
SHA256: 9a83fcf54aed00e1a28c3d00eabe1c166977af1e26e91035e15f88b5215b181b
SHA1: ea713acb55898f07374dadd1bebb09ec2fa4b589
MD5sum: e70a97903cb23b8df8a887da4c54e945

The following example demonstrates how one can update differences between the development environment and the production database schema using apgdiff.

EXAMPLE:

apt install -y apgdiff
# EX 1: dump as SQL statements
pg_dump -s db01 -Fp > db01.sql
pg_dump -s db02 -Fp > db02.sql
createdb db03 --template=db01

apgdiff --ignore-start-with db01.sql db02.sql > db01-db02.sql

# “psql -1” encapsulates statements within a transaction
psql -1 -f db01-db02.sql db03

 

# EX 2: uses logical dumps
# notice the dumps are standard logical dumps and includes data
pg_dump db01 -Fc > db01.db
pg_dump db02 -Fc > db02.db
createdb db03 --template=db01

# this invocation assumes the resultant diff doesn’t require editing
apgdiff --ignore-start-with \
    <(pg_restore -s -f - db01.db) \
    <(pg_restore -s -f - db02.db) \
    | psql -1 db03

There’s more you can accomplish with these simple approaches. By incorporating variations of these, one can create fairly sophisticated shell scripts with little code and, with a little luck, not that much effort.

Dec
14
2022
--

Powering PostgreSQL 15 With Columnar Tables

Powering PostgreSQL 15 With Columnar Tables

Prologue

This blog is the result of my investigation into column-wise tables. Hopefully, you will find it useful in your deliberations.

When I started writing it was meant to be a “Good News” blurb. I was pretty optimistic that I’d truly have impressive numbers to share but as it turns out, while there is a potential for significant advantage, a detailed testing regime of the CITUS columnar store extension should be carried out before implementing them into your own environment.

The missing sizzle

Sizzle: A phrase used to show affirmation or approval. It references the sound that one hears when cooking certain delicious foods i.e. bacon or fajitas and therefore transfers those feelings to non-edible objects or ideas.

There’s a lot to take in and I’m often impressed with every new version of PostgreSQL that is released. Nevertheless, there’s been a long-standing feature that, to be honest, I’ve always felt was missing. This weakness is being addressed by the industry. Ironically, some of these implementations are in fact Postgres derivatives.

What is this sizzling feature that I speak of you ask?

Column-wise tables!

When to consider column-wise tables

Consider using the CITUS Columnar Extension under one or more of the following conditions:

  • In regards to table attributes:
    • they are wide with “many” columns
    • size should at a minimum be in the tens of GB
    • the typical row is byte “heavy”
  • System space is at a premium i.e. it’s worth your while to manage space utilization as much as possible.
  • OLAP is a major component of overall activity i.e. lots of different kinds of SELECT statements.
  • INSERT performance is not a priority.
  • Indexes are not feasible 
  • Creating EXPRESSION indexes on columnar tables is faster by orders of magnitude than on heap tables.

CAVEAT: You cannot perform UPDATE, DELETE operations on a columnar table.

OLTP vs OLAP

Let’s get back to basics. In the database world there are essentially two types of database operations:

    1. Online Transaction Processing, OLTP: Online transaction processing applications have high throughput and are insert- or update-intensive in database management. These applications are used concurrently by hundreds of users. The key goals of OLTP applications are availability, speed, concurrency, and recoverability.

    2. Online analytical processing, OLAP: Online analytical processing applications enable users to analyze multidimensional data interactively from multiple perspectives which consist of three basic analytical operations: 
      1. consolidation (roll-up)
      2. drill-down
      3. slicing and dicing

About columnar tables

As an RDBMS, PostgreSQL is geared more toward OLTP than OLAP operations. Manipulating data by inserting, updating, deleting, etc., it does well. But while eminently capable of performing OLAP, it’s not quite as efficient. The primary reason is actually a pretty common characteristic among most RDBMS i.e. it stores and processes collections of data as individual records, otherwise known as row-wise processing. 

Suppose you are performing a deep-dive analysis involving a table of records containing 50 columns. And further, suppose that your query only needs a single column. It still needs to read all of the data from those 50 columns per row in order to access that single column. And if the query processes 100 million rows, that’s definitely not trivial!

Now let’s consider a reorientation of the table i.e. column-wise. In this case, the query only reads one column and not 50. The result is that the query is now much lighter, requiring less IO and processing power yet achieving the same result in significantly less time.

Introducing the CITUS columnar extension

As a General Purpose Database Management System, basic behaviors can be reconfigured in Postgres for different purposes. And as such, it is possible to enhance PostgreSQL with columnar tables by using the CITUS columnar table extension.

The CITUS columnar extension is just one part of a larger set of capabilities of this extension that when fully implemented creates a fully scalable distributed Postgres database system.

The CITUS columnar extension feature set includes:

  • Highly compressed tables:
    • Reduces storage requirements
    • Reduces the IO needed to scan the table
  • Projection Pushdown:  Only those column(s) within the SELECT are returned further reducing IO.
  • Chunk Group Filtering: Allows queries to skip over whole groups of unrelated data without processing them.

The complete CITUS feature set which, except for the Columnar storage component, is not covered in this blog, includes:

  • Distributed tables
  • References tables
  • Distributed query engine routes and parallelizes SELECT, DML, and operations
  • Columnar storage
    • compresses data
    • speeds up scans, 
    • supports fast projections
  • Query from any node

Getting it

This is a bit of a bad-news, good-news, excellent-news situation.

BAD-NEWS: At the time of writing this blog, the only publicly available packages on the Postgres repositories that I could find were the cstore foreign data wrappers which cover only version 12 and older versions of Postgres.

GOOD-NEWS: The CITUS extension is available on the Citus Data download page.

EXCELLENT-NEWS: As part of the process of writing this blog custom, DEB and RPM packages were created for PostgreSQL version 15 for CENTOS 7, 8, and Ubuntu 18.04, 20.04 and which are available for download from github HERE.

First-time installation

Assuming you opted for the easy way, installing the packages made for this blog:

  • Go to the PostgreSQL download page and configure your packaging system for your Linux distribution.
  • Download the appropriate custom-made columnar package for your version of Linux:
# Ubuntu 18.04
wget https://github.com/rbernierZulu/columnar-tables-extension/blob/main/postgresql-15-citus-columnar_11.1-1-UBUNTU1804_amd64.deb

# Ubuntu 20.04
wget https://github.com/rbernierZulu/columnar-tables-extension/blob/main/postgresql-15-citus-columnar_11.1-1-UBUNTU2004_amd64.deb

# Centos 7
wget https://github.com/rbernierZulu/columnar-tables-extension/blob/main/postgresql-15-citus_columnar-11.1-1-CENTOS7.x86_64.rpm

# Centos 8
wget https://github.com/rbernierZulu/columnar-tables-extension/blob/main/postgresql-15-citus_columnar-11.1-1-CENTOS8.x86_64.rpm

 

  • Install the package: note that all the dependent packages, including the Postgres server, will be automatically installed onto your host. Pay attention to CENTOS 7 as it requires the epel-release repository too.
# Ubuntu 18.04
apt update
apt-get install ./postgresql-15-citus-columnar_11.1-1-UBUNTU1804_amd64.deb

# Ubuntu 20.04
apt update
apt-get install ./postgresql-15-citus-columnar_11.1-1-UBUNTU2004_amd64.deb

# Centos 7
# ATTENTION: the epel package must be installed beforehand!
yum install epel-release
yum install ./postgresql-15-citus_columnar-11.1-1-CENTOS7.x86_64.rpm

# Centos 8
dnf -qy module disable postgresql
dnf install ./postgresql-15-citus_columnar-11.1-1-CENTOS7.x86_64.rpm
dnf check-update
  • Initialize the postgres data cluster and configure appropriately for your distribution of Linux in order to login to postgres.

 

A working example of columnar tables

What follows here are the results of my analysis. Of course, there’s always more that can be said. Hopefully, this will give you enough of an overview of the possibilities of this extension.

Installing the extension into your database is pretty much standard fare:

-- create your extension in the database:
create database db01;
\c db01
create extension citus_columnar;

Here are two tables, of type HEAP and COLUMNAR, that will be used for the initial investigations. You’ll notice that it took less time to create the regular HEAP accessed table than the columnar table:

drop table if exists t1,t2;

\timing

-- Time: 7628.612 ms (00:07.629)
create table if not exists t1(id,qty) 
using heap 
as 
select (random()*10)::int, (random()*1000)::int from generate_series(1,10e6);

-- Time:15635.768 ms (00:15.636)
create table if not exists t2(id,qty) 
using columnar 
as 
select (random()*10)::int, (random()*1000)::int from generate_series(1,10e6);
Table "public.t1"
Column |  Type   | Collation | Nullable | Default  
--------+---------+-----------+----------+---------
id     | integer |           |          |  
qty    | integer |           |          |  
Access method: heap

                Table "public.t2"
Column |  Type   | Collation | Nullable | Default  
--------+---------+-----------+----------+---------
id     | integer |           |          |  
qty    | integer |
Access method: columnar

The number of records is 10 million:

with a(row_wise)    as (select count(*) from t1),
     b(column_wise) as (select count(*) from t2)
select row_wise,column_wise from a,b;

 row_wise | column_wise
----------+-------------
  9900001 |     9900001

This is very cool, the columnar compression does a great job of shrinking the table:

\dt+ t[12]
                                   List of relations
 Schema | Name | Type  |  Owner   | Persistence | Access method |  Size  |
--------+------+-------+----------+-------------+---------------+--------+
 public | t1   | table | postgres | permanent   | heap          | 346 MB |
 public | t2   | table | postgres | permanent   | columnar      | 27 MB  |

SQL statements, preliminary

Let’s begin by comparing basic administration and SELECT statements of a HEAP vs COLUMNAR table. 

Examining the tabulated results you’ll see that much of the performance times indicate that columnar tables either perform, at best, similarly to that of a HEAP table but most of the time they take more time executing the same operations.

Using a psql session, the following SQL statements are executed and examined for performance differences:

SQL

Timings 

— HEAP TABLE
create table if not exists t1(id,qty) using heap as select (random()*10)::int, (random()*1000)::int from generate_series(1,10e6);

7.6s

15.6s

— COLUMNAR TABLE
create table if not exists t2(id,qty) using columnar as select (random()*10)::int, (random()*1000)::int from generate_series(1,10e6);

— HEAP TABLE, adding 5 million records
do
$$
begin
    for i in 5.1e6..10e6
    loop
    insert into t1 values((random()*10)::int,(random()*1000)::int);
    end loop;
end
$$;

13.7s

18.5s

— COLUMNAR TABLE, adding 5 million records
do
$$
begin
    for i in 5.1e6..10e6
    loop
    insert into t2 values((random()*10)::int,(random()*1000)::int);
    end loop;
end
$$;

— HEAP TABLE
create index on t1(id);

4.9s

7.8s

— HEAP TABLE
create index on t2(id);

SQL statement query plans

Part one (10 million records, two column tables)

Using the aforementioned table definitions, the following metrics were generated with the runtime parameter max_parallel_workers_per_gather = 4.

It’s quite evident that, at least for these two tables, there’s no performance benefit of a columnar accessed table over a regular heap accessed one:

SQL 

Timings 

— HEAP TABLE
explain analyze select id,qty from t1;

742.411 ms

914.096 ms

— COLUMNAR TABLE
explain analyze select id,qty from t2;

— HEAP TABLE
explain analyze select id,qty from t1 order by random();

6441.425 ms

5871.620 ms

— COLUMNAR TABLE
explain analyze select id,qty from t2 order by random();

— HEAP TABLE
explain analyze select sum(qty) from t1;

329.562 ms

902.614 ms

— COLUMNAR TABLE
explain analyze select sum(qty) from t2;

— HEAP TABLE
explain analyze select id,sum(qty) from t1 group by id;

531.525 ms

1602.756 ms

— COLUMNAR TABLE
explain analyze select id,sum(qty) from t2 group by id;

Part two (five million records, 100 column tables)

In order to get a better idea of performance differences, a second set of tables at a greater scale were created. However, this time, while the number of records was halved, the number of columns was increased from two to one hundred.

Even if most of the columns are simply copies of one another, the columnar table’s resultant compression is remarkable as the default size is reduced by a factor of 752X.

/* TABLES

                                                  Table "public.t[34]"
 Column |  Type  | Collation | Nullable |                                  Default
--------+--------+-----------+----------+----------------------------------------------------------------------------
 c1     | bigint |           | not null | nextval('t1_c1_seq'::regclass)
 c2     | bigint |           |          |
 c3     | text   |           |          | 'aowjfa fawjfawofjawofjawoifawevvaerarpfjkaofvaweawe[OJARGOIAJOAFWF'::text
 c4     | text   |           |          | 'aowjfa fawjfawofjawofjawoifawevvaerarpfjkaofvaweawe[OJARGOIAJOAFWF'::text
.
.
.
 c98    | text   |           |          | 'aowjfa fawjfawofjawofjawoifawevvaerarpfjkaofvaweawe[OJARGOIAJOAFWF'::text
 c99    | text   |           |          | 'aowjfa fawjfawofjawofjawoifawevvaerarpfjkaofvaweawe[OJARGOIAJOAFWF'::text
 c100   | text   |           |          | 'aowjfa fawjfawofjawofjawoifawevvaerarpfjkaofvaweawe[OJARGOIAJOAFWF'::text
Indexes:
    "t1_pkey" PRIMARY KEY, btree (c1)
    "t1_c2_idx" btree (c2)

List of relations
 Schema | Name | Type  |  Owner   | Persistence | Access method | Size  | Description
--------+------+-------+----------+-------------+---------------+-------+-------------
 public | t3   | table | postgres | permanent   | heap          | 67 GB |
 public | t4   | table | postgres | permanent   | columnar      | 89 MB |
*/

Examining the indexes one sees them to be the same size.

List of relations
Schema |     Name      | Type  |  Owner   | Table | Persistence | Access method |  Size  |
-------+---------------+-------+----------+-------+-------------+---------------+--------+
public | t3_c2_idx     | index | postgres | t3    | permanent   | btree         | 105 MB |  
public | t3_c2_idx1    | index | postgres | t3    | permanent   | btree         | 105 MB |  
public | t3_length_idx | index | postgres | t3    | permanent   | btree         | 33 MB  |  
public | t3_pkey       | index | postgres | t3    | permanent   | btree         | 107 MB |  
public | t4_c2_idx     | index | postgres | t4    | permanent   | btree         | 105 MB |  
public | t4_length_idx | index | postgres | t4    | permanent   | btree         | 33 MB  |  
public | t4_pkey       | index | postgres | t4    | permanent   | btree         | 107 MB |

Unlike the first set of query plans, these ones clearly demonstrate a significant performance improvement. 

Curious to see what would change in the way of performance, the varying the  max_parallel_workers_per_gather doesn’t appear to have changed much.

SQL 

      Timings
      max parallel workers 
      4                            1

— HEAP TABLE without index
explain analyze select sum(c2) from t3;

9.6s

590.176ms

8.7s

596.459ms

— COLUMNAR TABLE without index
explain analyze select sum(c2) from t4;

— HEAP TABLE
explain analyze select count(c3) from t3;

10.4s

509.209 ms

8.8s

541.452ms

— COLUMNAR TABLE
explain analyze select count(c3) from t4;

— HEAP TABLE
explain analyze select max(length(c25)) from t3;

1m34s

1.1s

1m17s

1.2s

— COLUMNAR TABLE
explain analyze select max(length(c25)) from t4;

— HEAP TABLE
explain analyze select sum(length(c50)) from t3;

1m33s

1.2s

1m18s

1.2s

— COLUMNAR TABLE
explain analyze select sum(length(c50)) from t4;

Working with indexes

General observations: Btree indexes are similar in size between HEAP and COLUMNAR tables. Overall their performance also appears similar although the columnar table’s index is somewhat slower, no doubt due to the extra processing required to uncompress the table’s values. 

Regarding Expression Indexes: Creating an EXPRESSION index on COLUMNAR table is significantly faster than that of HEAP:

-- 1m17s
create index on t3(length(c90));
-- 14s
create index on t4(length(c90));

Regarding Runtime Parameter max parallel workers: Index performance varies considerably on HEAP tables depending upon the value set to max parallel workers.

The following results highlight that, depending upon the type of table used, it can become important when hardware resource and server costs are a consideration:

SQL 

Timings

max parallel workers  

   4                        1

— HEAP TABLE using BTREE index
explain analyze select sum(c2) from t3;

467.789ms

561.522 ms

748.939ms

599.629ms

— COLUMNAR TABLE using BTREE index
explain analyze select sum(c2) from t4;

— HEAP TABLE using EXPRESSION index
explain analyze select max(length(c90)) from t3;

1.614ms

31.980ms

2.346ms

38.766ms

— COLUMNAR TABLE using EXPRESSION index
explain analyze select max(length(c90)) from t4;

About table constraints and access methods

Overall, indexes, constraints, and access methods are still evolving with many of the features still to be implemented.

Let’s start with a big issue i.e. DELETE and UPDATE:

-- fails
delete from t2 where id=5;  
ERROR:  UPDATE and CTID scans not supported for ColumnarScan

-- fails
update t2 set id=5;   
ERROR:  UPDATE and CTID scans not supported for ColumnarScan

Creating indexes on a columnar table is restricted to btree indexes:

-- works
create index on t2 using btree (id);

-- fails
create index on t2 using columnar (id);
ERROR:  unsupported access method for the index on columnar table t2

Creating foreign key constraints aren’t implemented:

select generate_series as id into t3 from generate_series(0,15);
alter table t3 add primary key(id);

-- works for our standard table t1
alter table t1 add foreign key(id) references t3(id);

-- fails with the columnar table t2
alter table t2 add foreign key(id) references t3(generate_series);
ERROR:  Foreign keys and AFTER ROW triggers are not supported for columnar tables
HINT:  Consider an AFTER STATEMENT trigger instead.

--works after converting table t1 from COLUMNAR to HEAP
alter table t2 set access method heap;
alter table t2 add foreign key(id) references t3(generate_series);
alter table t2 set access method columnar;

 

Partitioning

Columnar tables can be used as partitions; a partitioned table can be made up of any combination of row and columnar partitions.

An excellent use case is INSERT once and READ only table partitions where one can leverage both its compression and better performing OLAP type queries for very large tables.

 

Caveat

Columnar Extension Limitations, as of version 11.1:

  • It takes more time to create the table than standard heap access-based tables
  • The query performance is equal or slower with smallish tables compared to heap-based tables.
  • There is no update/delete possible in a columnar table.
  • The indexes are limited to btree, as of version 10.0.
  • There is no logical replication.

There’s actually more documented. Refer here for more information.

AUTHOR’S NOTE: In regard to the custom packages created for this blog. The entire citus suite is designed to enhance Postgres as a distributed database solution. Only the columnar table component was covered in this blog. The complete feature set is part of this package install and should work but has not been tested. 

Conclusion

Despite its current limitations, there are use cases where this extension can definitely make a difference. And it speaks well of its future as the team continues development by constantly improving and adding capabilities. Watch for updates on its GitHub source repository.

 

Sep
13
2022
--

Working With Snapshots in PostgreSQL

Working With Snapshots in PostgreSQL

One of the reasons I’ve never strayed far from PostgreSQL is that I am always discovering features that overcome all sorts of interesting problems. Although, to be honest, from time to time I’ve had the feeling that sometimes some features are a solution in search of a problem. Take, for example, exporting transaction snapshots

Originally implemented in version 9.2, snapshot exports allow multiple transactions to share identical views of the same database state. Back then I wasn’t certain how I could use it until pg version 9.3 came out presenting a true use-case when multi-threaded downloads were introduced to pg_dump.

Here’s a simple example; let’s start by creating two tables with lots of data:

begin;
    create table t1(
        c1 serial primary key,
        c2 text default lpad('',500,md5(random()::text)),
        c3 timestamp default clock_timestamp()
    );

    create table t2(like t1 including all);

    insert into t1 select generate_series(1,5e6);
    insert into t2 select generate_series(1,5e6);

    analyze;
commit;
List of relations
 Schema | Name | Type  |  Owner   |  Size   | Description 
--------+------+-------+----------+---------+-------------
 public | t1   | table | postgres | 2791 MB | 
 public | t2   | table | postgres | 2791 MB |

TIP: Changing the arguments in the generate_series function will populate the tables that best fits your hardware’s capabilities.

The traditional invocation of a logical dump uses a single CPU/thread archiving the tables one at a time. The CLI time generates timing statistics returning the time it takes to perform the dump using the “compressed” format:

# execution time: 60 seconds
time pg_dump -Fc db01 > db01.db

real    1m0.322s
user    0m50.203s
sys     0m1.309s

An alternate invocation generating the dump uses the “directory” format:

# execution time: 52 seconds
time pg_dump -Fd db01 -f db01_dirA

real    0m52.304s
user    0m50.908s
sys     0m1.152s

Thus far the execution utilizes a single CPU. Now look at the execution time when the ‘-j’, or ‘–jobs’ switch, where multiple connections are used to generate the dump:

# execution time: 31 seconds
time pg_dump -Fd db01 -f db01_dirB -j 2

real    0m31.115s
user    1m0.242s
sys     0m1.377s

Notice the number of processes running in the background. The third connection is a leader process which coordinates the two worker processes:

$: ps aux | grep -v grep | grep pg_dump
rbernier 1872809  0.1  0.0  14444  5968 pts/1    S+   08:52   0:00 pg_dump -Fd db01 -f db01_dirB -j 2
rbernier 1872811  100  0.0  14804  4480 pts/1    R+   08:52   0:12 pg_dump -Fd db01 -f db01_dirB -j 2
rbernier 1872812  100  0.0  14804  4488 pts/1    R+   08:52   0:12 pg_dump -Fd db01 -f db01_dirB -j 2

Without the synchronized snapshot feature, the different worker jobs wouldn’t be guaranteed to see the same data in each connection, which could lead to an inconsistent backup.

One important caveat to remember is that the pg_dump jobs switch only works with the “directory” format. Attempting to use jobs to a single database dump fails:

pg_dump -Fc db01 -j 2 > db01.db               
pg_dump: error: parallel backup only supported by the directory format

There is a full description of other caveats using the jobs switch in the PostgreSQL documentation.

So we’ve shown how it works with pg_dump, but can we go further? … YES!

We can replicate the behavior using the snapshot synchronization function pg_export_snapshot().

Continuing with the two previously created tables, let’s create another scenario. Look at the table below for each step:

  • STEP 1: Three psql sessions login and are interacting with tables t1 and t2 in tandem.
  • STEP 2: Session 1 inserts every five seconds to the tables. Session 2 sees the data differently than session 3, note the three-second delay queried in session 3, thus making it problematic seeing the same dataset at the same time.
  • STEP 3: Session 1 continues updating tables t1, t2 but now sessions 2 and 3 are using the same snapshot session.
  • STEP 4: While session 1 continues updating tables t1, t2 sessions 2 and 3 see the same datasets i.e. they are synchronized.
  • STEP 5: Both sessions 2, and 3 are actively copying data at the same time without fear of inconsistency.
  • STEP 6: The COPY is completed so let’s finish up by committing the transactions.

STEP

Session 1

Session 2

Session 3

1

psql db01

psql db01

psql db01

2

with a(x) as (select max(c1) from t1),

b as (insert into t1

select generate_series(x+1,x+10)

from a returning *),

c as (insert into t2

select generate_series(x+1,x+10)

from a returning *)

select x as increment from a;

\watch 5

with a(x) as (select max(c1) from t1)

select c1,md5(c2),c3 from t1,a where c1=x

union all

select c1,md5(c2),c3 from t2,a where c1=x;

with a(x) as (select max(c1) from t1)

select c1,md5(c2),c3 from t1,a where c1=x

union all

select c1,md5(c2),c3 from t2,a,pg_sleep(3) where c1=x;

3

Step 2 is active

— let’s export the snapshot

— 00000003-000021CE-1

begin transaction isolation level repeatable read;

select pg_export_snapshot();

— let’s import the snapshot

begin transaction isolation level repeatable read;

set transaction snapshot ‘00000003-000021CE-1’;

4

Step 2 is active

with a(x) as (select max(c1) from t1)

select c1,md5(c2),c3 from t1,a where c1=x

union all

select c1,md5(c2),c3 from t2,a where c1=x;

with a(x) as (select max(c1) from t1)

select c1,md5(c2),c3 from t1,a where c1=x

union all

select c1,md5(c2),c3 from t2,a,pg_sleep(3) where c1=x;

5

Step 2 is active

\copy t1 to program ‘gzip > t1.gz’

\copy t2 to program ‘gzip > t2.gz’

6

Step 2 is terminated

— commit or rollback,
— it’s your choice

commit

— commit or rollback,
— it’s your choice

commit

 

And there you have it, a neat little trick exporting snapshots! Of course, there’s more you can do than just copying tables in a multi-threaded fashion but hopefully, this little exposé will lead to more opportunities to simplify your life while working with Postgres.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com