There are quite a number of methods these days for installing Patroni. I want to discuss the next step, which is how to start tuning a running system.The manner of updating a running Postgres server typically involves editing the standard Postgres files: postgresql.conf postgresql.auto.conf pg_hba.conf Working with a Patroni HA cluster, one has the added […]
26
2023
Grafana Dashboards: A PoC Implementing the PostgreSQL Extension pg_stat_monitor
This PoC demonstrates how to install and configure pg_stat_monitor in order to extract useful and actionable metrics from a PostgreSQL database and display them on a Grafana dashboard.
About the environment
- Grafana: version 10.0.0
- Grafana database backend: Prometheus version 2.15.2+d
- PostgreSQL version 13
- pgbench version 13
In order to investigate the potential opportunities for implementing constructive and useful metrics derived from PostgreSQL into Grafana, it is necessary to generate loading using pgbench.
Configuring Grafana
For our purposes, the Grafana datasource used in this PoC is also the Postgres data cluster that is generating the data to be monitored.
pg_stat_monitor
About
pg_stat_monitor is a Query Performance Monitoring tool for PostgreSQL. It collects various statistics data such as query statistics, query plan, SQL comments, and other performance insights. The collected data is aggregated and presented in a single view.
pg_stat_monitor takes its inspiration from pg_stat_statements. Unlike pg_stat_statements, which aggregates its metrics from the last time it was zeroed, pg_stat_monitor possesses the ability to bucket its output within a set number of aggregated results, thus saving user efforts from doing it themselves.
pg_stat_monitor tracks the following operations:
- statements
- queries
- functions
- stored procedures and other non-utility statements
Features
- Time Interval Grouping: Instead of supplying one set of ever-increasing counts, pg_stat_monitor computes stats for a configured number of time intervals – time buckets. This allows for much better data accuracy, especially in the case of high-resolution or unreliable networks.
- Multi-Dimensional Grouping: While pg_stat_statements groups counters by userid, dbid, queryid, pg_stat_monitor uses a more detailed group for higher precision. This allows a user to drill down into the performance of queries.
- Capture Actual Parameters in the Queries: pg_stat_monitor allows you to choose if you want to see queries with placeholders for parameters or actual parameter data. This simplifies debugging and analysis processes by enabling users to execute the same query.
- Query Plan: Each SQL is now accompanied by its actual plan that was constructed for its execution. That’s a huge advantage if you want to understand why a particular query is slower than expected.
- Tables Access Statistics for a Statement: This allows us to easily identify all queries that accessed a given table. This set is at par with the information provided by the pg_stat_statements.
- Histogram: Visual representation is very helpful as it can help identify issues. With the help of the histogram function, one can now view a timing/calling data histogram in response to an SQL query. And yes, it even works in PostgreSQL.
Installation (example: CENTOS8, pg14)
The simplest way to get pg_stat_monitor is to install it via Percona Distribution for PostgreSQL.
The following instructions demonstrate installing Percona Distribution for PostgreSQL and pg_stat_monitor on a CENTOS8 OS Linux distribution:
# Install The Percona Repository dnf install -y https://repo.percona.com/yum/percona-release-latest.noarch.rpm percona-release setup ppg14 # Install The postgres Community Repository dnf install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-8-x86_64/pgdg-redhat-repo-latest.noarch.rpm dnf update -y dnf install -y pg_stat_monitor_14 # perform standard initialization and systemd configurations /usr/pgsql-14/bin/postgresql-14-setup initdb # configure postgres to use pg_stat_monitor echo " shared_preload_libraries=pg_stat_monitor " >> /var/lib/pgsql/14/data/postgresql.auto.conf # complete postgres configuration systemctl enable postgresql-14 systemctl start postgresql-14
Create extension
The pg_stat_monitor extension can be created in any database, but for the purposes of this PoC, it is placed in the database pgbench.
postgres=# create database pgbench; postgres=# c pgbench pgbench=# create extension pg_stat_monitor; pgbench=# d List of relations Schema | Name | Type | Owner --------+-----------------+------+---------- public | pg_stat_monitor | view | postgres
View "public.pg_stat_monitor" Column | Type | Collation | Nullable | Default ---------------------+--------------------------+-----------+----------+--------- bucket | bigint | | | bucket_start_time | timestamp with time zone | | | userid | oid | | | username | text | | | dbid | oid | | | datname | text | | | client_ip | inet | | | pgsm_query_id | bigint | | | queryid | bigint | | | toplevel | boolean | | | top_queryid | bigint | | | query | text | | | comments | text | | | planid | bigint | | | query_plan | text | | | top_query | text | | | application_name | text | | | relations | text[] | | | cmd_type | integer | | | cmd_type_text | text | | | elevel | integer | | | sqlcode | text | | | message | text | | | calls | bigint | | | total_exec_time | double precision | | | min_exec_time | double precision | | | max_exec_time | double precision | | | mean_exec_time | double precision | | | stddev_exec_time | double precision | | | rows | bigint | | | shared_blks_hit | bigint | | | shared_blks_read | bigint | | | shared_blks_dirtied | bigint | | | shared_blks_written | bigint | | | local_blks_hit | bigint | | | local_blks_read | bigint | | | local_blks_dirtied | bigint | | | local_blks_written | bigint | | | temp_blks_read | bigint | | | temp_blks_written | bigint | | | blk_read_time | double precision | | | blk_write_time | double precision | | | resp_calls | text[] | | | cpu_user_time | double precision | | | cpu_sys_time | double precision | | | wal_records | bigint | | | wal_fpi | bigint | | | wal_bytes | numeric | | | bucket_done | boolean | | | plans | bigint | | | total_plan_time | double precision | | | min_plan_time | double precision | | | max_plan_time | double precision | | | mean_plan_time | double precision | | | stddev_plan_time | double precision | | |
About pgbench
pgbench is a simple program executing benchmark tests on PostgreSQL by running the same sequence of SQL commands over and over. pgbench is capable of executing multiple concurrent database sessions and can calculate the average transaction rate (TPS) at the end of a run. Although the default configuration simulates loading based loosely upon TPC-B, it is nevertheless easy to test other use cases by writing one’s own transaction script files.
Querying the data
While it is reasonable to create panels showing real-time load in order to explore better the types of queries that can be run against pg_stat_monitor, it is more practical to copy and query the data into tables after the benchmarking has completed its run.
Table: pg_stat_monitor_archive
Save the data generated from a recently completed benchmark run into an archive table:
select * into pg_stat_monitor_archive from pg_stat_monitor order by bucket_start_time asc
Table "public.pg_stat_monitor_archive" Column | Type | Collation | Nullable | Default --------------------+--------------------------+-----------+----------+--------- bucket | bigint | | | bucket_start_time | timestamp with time zone | | | userid | oid | | | username | text | | | dbid | oid | | | datname | text | | | client_ip | inet | | | pgsm_query_id | bigint | | | queryid | bigint | | | toplevel | boolean | | | top_queryid | bigint | | | query | text | | | comments | text | | | planid | bigint | | | query_plan | text | | | top_query | text | | | application_name | text | | | relations | text[] | | | cmd_type | integer | | | cmd_type_text | text | | | elevel | integer | | | sqlcode | text | | | message | text | | | calls | bigint | | | total_exec_time | double precision | | | min_exec_time | double precision | | | max_exec_time | double precision | | | mean_exec_time | double precision | | | stddev_exec_time | double precision | | | rows | bigint | | | shared_blks_hit | bigint | | | shared_blks_read | bigint | | | shared_blks_dirtied | bigint | | | shared_blks_written | bigint | | | local_blks_hit | bigint | | | local_blks_read | bigint | | | local_blks_dirtied | bigint | | | local_blks_written | bigint | | | temp_blks_read | bigint | | | temp_blks_written | bigint | | | blk_read_time | double precision | | | blk_write_time | double precision | | | resp_calls | text[] | | | cpu_user_time | double precision | | | cpu_sys_time | double precision | | | wal_records | bigint | | | wal_fpi | bigint | | | wal_bytes | numeric | | | bucket_done | boolean | | | plans | bigint | | | total_plan_time | double precision | | | min_plan_time | double precision | | | max_plan_time | double precision | | | mean_plan_time | double precision | | | stddev_plan_time | double precision | | |
Table: pg_stat_monitor_qry
Extract this metric of interest, i.e., time vs total execution time:
select bucket_start_time, pgsm_query_id, queryid, total_exec_time into pg_stat_monitor_qry from pg_stat_monitor order by bucket_start_time asc
pgbench=# d pg_stat_monitor_qry Table "public.pg_stat_monitor_qry" Column | Type | Collation | Nullable | Default -------------------+--------------------------+-----------+----------+--------- bucket_start_time | timestamp with time zone | | | pgsm_query_id | bigint | | | queryid | bigint | | | total_exec_time | double precision | | |
Table: pg_stat_monitor_shared_blk_io
Extract this metric of interest, i.e., time vs shared_blk io:
select bucket_start_time, pgsm_query_id, queryid, shared_blks_hit, shared_blks_read, shared_blks_dirtied, shared_blks_written into pg_stat_monitor_shared_blk_io from pg_stat_monitor_archive order by bucket_start_time asc;
pgbench=# d pg_stat_monitor_shared_blk_io Table "public.pg_stat_monitor_shared_blk_io" Column | Type | Collation | Nullable | Default ---------------------+--------------------------+-----------+----------+--------- bucket_start_time | timestamp with time zone | | | pgsm_query_id | bigint | | | queryid | bigint | | | shared_blks_hit | bigint | | | shared_blks_read | bigint | | | shared_blks_dirtied | bigint | | | shared_blks_written | bigint | | |
Table: pg_stat_monitor_blk_io
Note: this metric requires runtime parameter track_io_timing to be set on.
Extract this metric of interest, i.e., time vs. blk io:
select bucket_start_time, pgsm_query_id, queryid, blk_read_time, blk_write_time into pg_stat_monitor_blk_io from pg_stat_monitor_archive order by bucket_start_time asc;
Table: pg_stat_monitor_uniq_id
Save a copy of all unique query IDs in order to parse out future queries from the view.
Column pgsm_query_id identifies the query in such a manner that one can still identify the same query even when generated on other platforms under different loading conditions with different data:
with a as (select distinct on (pgsm_query_id) * from pg_stat_monitor_archive where application_name='pgbench') select cmd_type, cmd_type_text,pgsm_query_id, queryid,query as example_query into pg_stat_monitor_uniq_id from a order by cmd_type;
pgbench=# d pg_stat_monitor_uniq_id Table "public.pg_stat_monitor_uniq_id" Column | Type | Collation | Nullable | Default ---------------+---------+-----------+----------+--------- cmd_type | integer | | | cmd_type_text | text | | | pgsm_query_id | bigint | | | queryid | bigint | | | example_query | text | | |
This is an example set of queries generated by pgbench. Note the numbers in column pgsm_query_id are always the same values irrespective of hosts or environments:
select cmd_type_text, pgsm_query_id, example_query from pg_stat_monitor_uniq_id where cmd_type > 0;
cmd_type_text | pgsm_query_id | example_query ---------------+----------------------+----------------------------------------------------------------- SELECT | -7455620703706695456 | SELECT abalance FROM pgbench_accounts WHERE aid = 16416498 UPDATE | -510321339504955469 | UPDATE pgbench_accounts SET abalance = abalance + 2063 WHERE aid = 1482568 UPDATE | 5276535447716615446 | UPDATE pgbench_branches SET bbalance = bbalance + 1384 WHERE bid = 7 UPDATE | 3629195281782908951 | UPDATE pgbench_tellers SET tbalance = tbalance + -2966 WHERE tid = 330 INSERT | -8751124061964589929 | INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (321, 56, 21104880, 4211, CURRENT_TIMESTAMP)
Benchmarking
Two types of performance monitoring are profiled:
- Real-time loading performance
- Aggregate performance over a specific time period, i.e., a snapshot.
Although the results of the benchmarking can be viewed by querying the view pg_stat_monitor you will note, as demonstrated by the bash script and SQL statements below, that the contents of the view is immediately copied and saved into a collection of tables. This is because the data will disappear over time as pg_stat_monitor cycles through its allotted number of buckets.
A script executing a benchmarking run:
#!/bin/bash # REFERENCE # https://docs.percona.com/pg-stat-monitor/reference.html # set -e export PGPASSWORD="MYPASSWORD" PGHOST='MYHOST’ PGPORT=5434 PGUSER=postgres # # initialize benchmarking database # dropdb --if-exists pgbench createdb pgbench /usr/pgsql-12/bin/pgbench -i --foreign-keys -s 300 pgbench psql pgbench -c 'create extension pg_stat_monitor' # # configure pg_stat_monitor, requires system restart # psql postgres <<_eof_ -- set bucket time range, default is normally 60 seconds alter system set pg_stat_monitor.pgsm_bucket_time = '1min'; -- set number of buckets, default is normally 10 alter system set pg_stat_monitor.pgsm_max_buckets = 75; _eof_ systemctl restart postgresql@13-main psql postgres <<_eof_ -- zero pg_stat_monitor stats select * from pg_stat_monitor_reset(); _eof_ # # begin benchmarking run # # 4500 seconds (75 minutes) /usr/pgsql-12/bin/pgbench -U postgres -c 4 -j 2 -T 4500 -P 5 -b tpcb-like pgbench # # copy and save the benchmarking run into tables # psql postgres <<_eof_ drop table if exists pg_stat_monitor_archive, pg_stat_monitor_qry, pg_stat_monitor_uniq_id; select * into pg_stat_monitor_archive from pg_stat_monitor order by bucket_start_time; select bucket_start_time, pgsm_query_id, queryid, total_exec_time into pg_stat_monitor_qry from pg_stat_monitor_archive where application_name='pgbench'; with a as (select distinct on (pgsm_query_id) * from pg_stat_monitor_archive where application_name='pgbench') select cmd_type, cmd_type_text,pgsm_query_id, queryid,query as example_query into pg_stat_monitor_uniq_id from a order by cmd_type;; _eof_ echo "DONE"
progress: 4435.0 s, 341.2 tps, lat 11.718 ms stddev 3.951 progress: 4440.0 s, 361.2 tps, lat 11.075 ms stddev 3.519 progress: 4445.0 s, 348.0 tps, lat 11.483 ms stddev 5.246 progress: 4450.0 s, 383.8 tps, lat 10.418 ms stddev 4.514 progress: 4455.0 s, 363.6 tps, lat 10.988 ms stddev 4.326 progress: 4460.0 s, 344.0 tps, lat 11.621 ms stddev 3.981 progress: 4465.0 s, 360.4 tps, lat 11.093 ms stddev 4.457 progress: 4470.0 s, 383.8 tps, lat 10.423 ms stddev 5.615 progress: 4475.0 s, 369.6 tps, lat 10.811 ms stddev 3.784 progress: 4480.0 s, 355.6 tps, lat 11.227 ms stddev 3.954 progress: 4485.0 s, 378.8 tps, lat 10.580 ms stddev 2.890 progress: 4490.0 s, 370.8 tps, lat 10.770 ms stddev 2.879 progress: 4495.0 s, 365.2 tps, lat 10.947 ms stddev 4.997 progress: 4500.0 s, 379.2 tps, lat 10.549 ms stddev 2.832
transaction type: <builtin: TPC-B (sort of)> scaling factor: 300 query mode: simple number of clients: 4 number of threads: 2 duration: 4500 s number of transactions actually processed: 1564704 latency average = 11.497 ms latency stddev = 4.800 ms tps = 347.711175 (including connections establishing) tps = 347.711731 (excluding connections establishing)
Dashboard example 1: Querying saved data
Top panel (Query execution time vs. DML)
Five (5) SQL statements are used to create this panel:
-- SELECT -- select bucket_start_time,total_exec_time as "SELECT" from pg_stat_monitor_qry join pg_stat_monitor_uniq_id using (pgsm_query_id) where a.cmd_type_text='SELECT' order by 1 asc;
-- INSERT -- select bucket_start_time,total_exec_time as "INSERT" from pg_stat_monitor_qry join pg_stat_monitor_uniq_id using (pgsm_query_id) where a.cmd_type_text='INSERT' order by 1 asc;
-- UPDATE 1 -- select bucket_start_time,total_exec_time as "UPDATE 1" from pg_stat_monitor_qry join pg_stat_monitor_uniq_id using (pgsm_query_id) where cmd_type_text='UPDATE' and pgsm_query_id = -510321339504955469 order by 1 asc;
-- UPDATE 2 -- select bucket_start_time,total_exec_time as "UPDATE 2" from pg_stat_monitor_qry join pg_stat_monitor_uniq_id using (pgsm_query_id) where cmd_type_text='UPDATE' and pgsm_query_id = 5276535447716615446 order by 1 asc;
-- UPDATE 3 -- select bucket_start_time,total_exec_time as "UPDATE 3" from pg_stat_monitor_qry join pg_stat_monitor_uniq_id using (pgsm_query_id) where cmd_type_text='UPDATE' and pgsm_query_id = 3629195281782908951 order by 1 asc;
Bottom panel (Query execution time vs. shared blocks)
-- INSERT (ins_[hit|read|dirty|write]) -- select bucket_start_time, shared_blks_hit as ins_hit, shared_blks_read as ins_read, shared_blks_dirtied as ins_dirt, shared_blks_written as ins_writ from pg_stat_monitor_shared_blk_io join pg_stat_monitor_uniq_id using (pgsm_query_id) where cmd_type_text='INSERT' order by 1 asc;
-- UPDATE 1 (update1_[hit|read|dirty|write]) -- select bucket_start_time, shared_blks_hit as update1_hit, shared_blks_read as update1_read, shared_blks_dirtied as update1_dirt, shared_blks_written as update1_writ from pg_stat_monitor_shared_blk_io join pg_stat_monitor_uniq_id using (pgsm_query_id) where cmd_type_text='UPDATE' and pgsm_query_id = -510321339504955469 order by 1 asc;
-- UPDATE 2 (update2_[hit|read|dirty|write]) -- select bucket_start_time, shared_blks_hit as update2_hit, shared_blks_read as update2_read, shared_blks_dirtied as update2_dirt, shared_blks_written as update2_writ from pg_stat_monitor_shared_blk_io join pg_stat_monitor_uniq_id using (pgsm_query_id) where cmd_type_text='UPDATE' and pgsm_query_id = 5276535447716615446 order by 1 asc;
-- UPDATE 3 (update3_[hit|read|dirty|write]) -- select bucket_start_time, shared_blks_hit as update3_hit, shared_blks_read as update3_read, shared_blks_dirtied as update3_dirt, shared_blks_written as update3_writ from pg_stat_monitor_shared_blk_io join pg_stat_monitor_uniq_id using (pgsm_query_id) where cmd_type_text='UPDATE' and pgsm_query_id = 3629195281782908951 order by 1 asc;
Analysis
Here are some example patterns that can be discerned:
- The SELECT statements are the fastest DML operations (top panel).
- Although SQL statement UPDATE 1 (top panel) takes up the most time, its contents do not have much presence in the shared buffer relative to the other update statements.
- Inserts (top) are the 2nd slowest set of statements, yet they have very little execution time performing inserts compared to the UPDATES in the shared buffer (bottom).
Dashboard example 2: Monitoring in real time
These two panels show read/write IO performance to the persistent storage while benchmarking a live run.
Top panel (Execution time vs. DML)
-- SELECT -- select bucket_start_time,total_exec_time as "SELECT" from pg_stat_monitor join pg_stat_monitor_uniq_id a using (pgsm_query_id) where a.cmd_type_text='SELECT' order by 1 asc;
-- INSERT -- select bucket_start_time,total_exec_time as "INSERT" from pg_stat_monitor join pg_stat_monitor_uniq_id a using (pgsm_query_id) where a.cmd_type_text='INSERT' order by 1 asc;
-- UPDATE 1 -- select bucket_start_time,total_exec_time as "UPDATE 1" from pg_stat_monitor join pg_stat_monitor_uniq_id a using (pgsm_query_id) where a.cmd_type_text='UPDATE' and pgsm_query_id = -510321339504955469 order by 1 asc;
-- UPDATE 2 -- select bucket_start_time,total_exec_time as "UPDATE 2" from pg_stat_monitor join pg_stat_monitor_uniq_id a using (pgsm_query_id) where a.cmd_type_text='UPDATE' and pgsm_query_id = 5276535447716615446 order by 1 asc;
-- UPDATE 3 -- select bucket_start_time,total_exec_time as "UPDATE 3" from pg_stat_monitor join pg_stat_monitor_uniq_id a using (pgsm_query_id) where a.cmd_type_text='UPDATE' and pgsm_query_id = 3629195281782908951 order by 1 asc;
Bottom panel (Time vs. IO)
-- time vs read/write blocks (blk_read_time, blk_write_time -- track_io_timing is on select bucket_start_time, blk_read_time, blk_write_time from public.pg_stat_monitor;
Analysis
It’s quite easy to observe that SQL statement UPDATE 1 represents the bulk of the read operations.
-- example SQL statement UPDATE pgbench_accounts SET abalance = abalance + 2063 WHERE aid = 1482568
Interestingly, writes are not as significant as reads.
Conclusion
I’m excited about pg_stat_monitor. Not only can it be used in Grafana, but it’s easily implemented in any monitoring solution, including our own Percona Monitoring and Management. It’s also incorporated in our latest version of Percona Operator for PostgreSQL.
Pg_stat_monitor is an obvious, common sense improvement over pg_stat_statement’s greatest limitation i.e., its inability to bucket metrics over time intervals. And to be frankly honest, I can see the pg_stat_monitor extension eventually replacing pg_stat_statement as the defacto extension monitoring Postgres when it comes to real-time analysis.
Happy monitoring!
Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.
03
2023
How To Scale a Single-Host PostgreSQL Database With Citus
When it comes to Citus, successfully building out and scaling a PostgreSQL cluster across multiple nodes and even across data centers can feel, at times, to be an art form because there are so many ways of building it out.
There’s an axiom that I think aptly applies to this situation describing the differences between science and art:
– Science: 1 problem -> 1 solution
– Art: 1 problem -> 1,000 solutions
While the two previous Citus blogs that I wrote covered the fairly straightforward concepts of creating columnar tables and leveraging data redundancy, this blog explores the architectural design considerations using this freely downloadable and fully featured PostgreSQL extension.
Rather than listing the concepts, function calls, etc, available in Citus, which frankly is a bit boring, I’m going to explore scaling out a database system starting with a single host. I won’t cover all the features but show just enough that you’ll want to see more of what you can learn to accomplish for yourself.
About the cluster
Following a step-by-step process, the objective is to create a four-node cluster consisting of:
- PostgreSQL version 15
- Citus extension (I’ll be using version 11, but there are newer ones available.)
- PostgreSQL Cluster
- One coordinator node
- citus-coord-01
- Three worker nodes
- citus1
- citus2
- citus3
- One coordinator node
- Hardware
AWS Instance
Ubuntu Server 20.04, SSD volume type
64-bit (x86)
c5.xlarge 4vCPU 8GB-RAM
Storage: EBS volume (root)
80GB
gp2 (IOPS 240/3000)
As well, high availability will be integrated, guaranteeing cluster viability in the case that one worker node goes down.
Leveraging pgbench, which is a benchmarking utility that comes bundled with PostgreSQL, I will put the cluster through its paces by executing a series of DML operations. Using its default tpc-b benchmark, one can stress test a database of any size ranging from a few clients to simulating thousands interacting with a system sized into the Terabytes if needs be.
Steps
Provisioning
The first step is to provision the four nodes with both PostgreSQL and Citus. Refer here or here for getting and installing the extension.
-- update each postgres node and enabling the extension -- with a server restart alter system set shared_preload_libraries='citus';
On each of the four nodes, create the database with the Citus extension:
-- using psql create database pgbench; c pgbench create extension citus;
One worker node
Beginning with a single worker node, log onto citus-coord-01, declaring the coordinator host, and starting with the single node, citus1:
-- execute on citus-coord-01 psql pgbench <<_eof1_ select citus_set_coordinator_host('citus-coord-01', 5432); select citus_add_node('citus1', 5432); _eof1_
While still logged in database pgbench, get a list of the worker nodes, which at this time is just the one:
select * from master_get_active_worker_nodes();
node_name | node_port -----------+----------- citus1 | 5432
Now, things get a little interesting; the normal initialization of pgbench is to create and populate the tables with the appropriate constraints. In this case, however, only the tables, without data or constraints, are initialized:
-- execute the following command on citus-coord-01 pgbench -iI t pgbench
pgbench=# d+ List of relations Schema | Name | Type | Owner | Persistence | Access method | Size | -------+------------------+-------+----------+-------------+---------------+---------+----- public | citus_tables | view | postgres | permanent | | 0 bytes | public | pgbench_accounts | table | postgres | permanent | heap | 0 bytes | public | pgbench_branches | table | postgres | permanent | heap | 0 bytes | public | pgbench_history | table | postgres | permanent | heap | 0 bytes | public | pgbench_tellers | table | postgres | permanent | heap | 0 bytes |
This next set of commands, executed on the coordinator node, distributes the pgbench tables across node citus1:
psql pgbench <<_eof1_ BEGIN; select create_distributed_table('pgbench_history', 'aid'); select create_distributed_table('pgbench_accounts', 'aid'); select create_distributed_table('pgbench_branches', 'bid'); select create_distributed_table('pgbench_tellers', 'tid'); COMMIT; _eof1_
And now, we can begin populating the tables. The following invocation generates almost 4GB of data. Notice I still haven’t added any constraints:
pgbench -iI g -s 300 pgbench
This command shows the total number of shards on the node. The default number of shards generated for a given table is 32.
select * from citus_tables;
table_name | citus_table_type | distribution_column | colocation_id | table_size | shard_count ------------------+------------------+---------------------+---------------+------------+------------- pgbench_accounts | distributed | aid | 1 | 3844 MB | 32 pgbench_branches | distributed | bid | 1 | 256 kB | 32 pgbench_history | distributed | aid | 1 | 0 bytes | 32 pgbench_tellers | distributed | tid | 1 | 256 kB | 32
You’ll also notice that only some of the shards have been populated with data, which is normal when initializing pgbench with data:
select * from citus_shards;
table_name | shardid | shard_name | citus_table_type | colocation_id | nodename | nodeport | shard_size ------------------+---------+-------------------------+------------------+---------------+----------+----------+------------ pgbench_accounts | 102040 | pgbench_accounts_102040 | distributed | 1 | citus1 | 5432 | 125853696 pgbench_accounts | 102041 | pgbench_accounts_102041 | distributed | 1 | citus1 | 5432 | 126173184 pgbench_accounts | 102042 | pgbench_accounts_102042 | distributed | 1 | citus1 | 5432 | 125739008 pgbench_accounts | 102043 | pgbench_accounts_102043 | distributed | 1 | citus1 | 5432 | 125968384 . . pgbench_branches | 102072 | pgbench_branches_102072 | distributed | 1 | citus1 | 5432 | 8192 pgbench_branches | 102073 | pgbench_branches_102073 | distributed | 1 | citus1 | 5432 | 8192 pgbench_branches | 102074 | pgbench_branches_102074 | distributed | 1 | citus1 | 5432 | 8192 pgbench_branches | 102075 | pgbench_branches_102075 | distributed | 1 | citus1 | 5432 | 8192 . . pgbench_history | 102008 | pgbench_history_102008 | distributed | 1 | citus1 | 5432 | 0 pgbench_history | 102009 | pgbench_history_102009 | distributed | 1 | citus1 | 5432 | 0 pgbench_history | 102010 | pgbench_history_102010 | distributed | 1 | citus1 | 5432 | 0 pgbench_history | 102011 | pgbench_history_102011 | distributed | 1 | citus1 | 5432 | 0 . . pgbench_tellers | 102104 | pgbench_tellers_102104 | distributed | 1 | citus1 | 5432 | 8192 pgbench_tellers | 102105 | pgbench_tellers_102105 | distributed | 1 | citus1 | 5432 | 8192 pgbench_tellers | 102106 | pgbench_tellers_102106 | distributed | 1 | citus1 | 5432 | 8192 pgbench_tellers | 102107 | pgbench_tellers_102107 | distributed | 1 | citus1 | 5432 | 8192 pgbench_tellers | 102108 | pgbench_tellers_102108 | distributed | 1 | citus1 | 5432 | 8192 . .
And now, execute the benchmark:
-- execute the following on the coordinator node pgbench -c 20 -j 3 -T 60 -P 3 pgbench
The results are not pretty.
transaction type: <builtin: TPC-B (sort of)> scaling factor: 300 query mode: simple number of clients: 20 number of threads: 3 maximum number of tries: 1 duration: 60 s number of transactions actually processed: 1524 number of failed transactions: 0 (0.000%) latency average = 791.374 ms latency stddev = 157.710 ms initial connection time = 15.704 ms tps = 25.197427 (without initial connection time)
ATTENTION: Refer to the bottom of this blog, which summarizes a tabulation of the various benchmarks as this cluster evolves.
In order to speed up the benchmark indexes must be added. But, this standard pgbench method of adding indexes fails:
postgres@citus-coord-01:~$ pgbench -iI p pgbench creating primary keys... pgbench: error: query failed: ERROR: cannot create constraint without a name on a distributed table pgbench: detail: Query was: alter table pgbench_branches add primary key (bid)
Successfully adding indexes and constraints requires them to be explicitly named and created in the following fashion. Note that table pgbench_history requires a REPLICA identity because it doesn’t have a primary key:
# THIS WORKS! psql pgbench <<_eof1_ -- -- indexes and constraints must be explicitly named -- BEGIN; create unique index pgbench_accounts_pk on pgbench_accounts(aid); create unique index pgbench_branches_pk on pgbench_branches(bid); create unique index pgbench_tellers_pk on pgbench_tellers(tid); alter table pgbench_accounts add constraint pk_accounts primary key using index pgbench_accounts_pk; alter table pgbench_branches add constraint pk_branches primary key using index pgbench_branches_pk; alter table pgbench_tellers add constraint pk_tellers primary key using index pgbench_tellers_pk; qecho adding REPLICA IDENTITY (no PK present) alter table pgbench_history replica identity full; COMMIT; _eof1_
Repeating the benchmark run yields a much better result:
-- execute the following on the coordinator node pgbench -c 20 -j 3 -T 60 -P 3 pgbench
scaling factor: 300 query mode: simple number of clients: 20 number of threads: 3 maximum number of tries: 1 duration: 60 s number of transactions actually processed: 135273 number of failed transactions: 0 (0.000%) latency average = 8.865 ms latency stddev = 9.452 ms initial connection time = 15.556 ms tps = 2254.544852 (without initial connection time)
Two worker nodes
Adding a second worker node is straightforward requiring only two steps:
- Adding the node.
- Rebalancing the shards across the two nodes.
Execute this on the coordinator node, citus-coord-01:
psql pgbench <<_eof1_ -- -- it's understood that database pgbench already exists on citus2 -- and extension citus has been created -- qecho adding node citus2 ... select citus_add_node('citus2', 5432); qecho rebalancing shards across TWO nodes ... select * from rebalance_table_shards(); ANALYZE; _eof1_
Here’s a partial output as the tables are rebalanced across the two nodes:
adding node citus2 ... citus_add_node ---------------- 3
rebalancing shards across TWO nodes ... NOTICE: Moving shard 102008 from citus1:5432 to citus2:5432 ... NOTICE: Moving shard 102009 from citus1:5432 to citus2:5432 ... NOTICE: Moving shard 102010 from citus1:5432 to citus2:5432 ... NOTICE: Moving shard 102011 from citus1:5432 to citus2:5432 ... . . NOTICE: Moving shard 102020 from citus1:5432 to citus2:5432 ... NOTICE: Moving shard 102021 from citus1:5432 to citus2:5432 ... NOTICE: Moving shard 102022 from citus1:5432 to citus2:5432 ... NOTICE: Moving shard 102023 from citus1:5432 to citus2:5432 ...
Repeating the benchmark once again demonstrates that performance has degraded somewhat. This is to be expected, considering my POC consists of containers on a single machine, all sharing the same system resources:
-- execute the following on the coordinator node pgbench -c 20 -j 3 -T 60 -P 3 pgbench
transaction type: <builtin: TPC-B (sort of)> scaling factor: 300 query mode: simple number of clients: 20 number of threads: 3 maximum number of tries: 1 duration: 60 s number of transactions actually processed: 130190 number of failed transactions: 0 (0.000%) latency average = 9.214 ms latency stddev = 19.159 ms initial connection time = 15.776 ms tps = 2169.559969 (without initial connection time)
Three worker nodes
As per the two-worker node configuration, adding another node is straightforward.
psql pgbench <<_eof1_ qecho adding node citus3 ... select citus_add_node('citus3', 5432); qecho rebalancing shards across TWO nodes ... select * from rebalance_table_shards(); _eof1_
Repeating the benchmark gives us the updated performance metrics.
-- execute the following on the coordinator node pgbench -c 20 -j 3 -T 60 -P 3 pgbench
scaling factor: 300 query mode: simple number of clients: 20 number of threads: 3 maximum number of tries: 1 duration: 60 s number of transactions actually processed: 142630 number of failed transactions: 0 (0.000%) latency average = 8.410 ms latency stddev = 5.542 ms initial connection time = 16.407 ms tps = 2377.227920 (without initial connection time)
Three worker nodes, with redundancy/replication
This next step is interesting because it adds redundancy by increasing the replication factor from one to two such that two copies of each data shard of each table are not only duplicated but located on two distinct nodes.
ATTENTION: A newly updated replication factor affects tables only after the fact. Pre Existing tables are not affected by the new replication factor:
psql pgbench <<_eof1_ qecho increasing replication factor to TWO alter system set citus.shard_replication_factor=2; select pg_reload_conf(); _eof1_
TIP: In order to replicate tables, set the replication factor before creating them.
In order to replicate the existing tables across the nodes, they are first undistributed by centralizing the tables onto the coordinator node. Redistributing them across the three node clusters automatically adds replication.
This centralizes the tables onto the coordinator node:
psql pgbench <<_eof1_ BEGIN; select undistribute_table('pgbench_history'); select undistribute_table('pgbench_accounts'); select undistribute_table('pgbench_branches'); select undistribute_table('pgbench_tellers'); COMMIT; _eof1_
The tables are redistributed and rebalanced across the three-node cluster with the correct redundancy of two:
psql pgbench <<_eof1_ BEGIN; select create_distributed_table('pgbench_history', 'aid'); select create_distributed_table('pgbench_accounts', 'aid'); select create_distributed_table('pgbench_branches', 'bid'); select create_distributed_table('pgbench_tellers', 'tid'); COMMIT; -- remove legacy tables from the coordinator node BEGIN; select truncate_local_data_after_distributing_table($$public.pgbench_history$$); select truncate_local_data_after_distributing_table($$public.pgbench_accounts$$); select truncate_local_data_after_distributing_table($$public.pgbench_branches$$); select truncate_local_data_after_distributing_table($$public.pgbench_tellers$$); COMMIT; _eof1_
Repeating the benchmark gives us the updated performance metrics.
-- execute the following on the coordinator node pgbench -c 20 -j 3 -T 60 -P 3 pgbench
transaction type: <builtin: TPC-B (sort of)> scaling factor: 300 query mode: simple number of clients: 20 number of threads: 3 maximum number of tries: 1 duration: 60 s number of transactions actually processed: 68834 number of failed transactions: 0 (0.000%) latency average = 17.430 ms latency stddev = 12.818 ms initial connection time = 16.321 ms tps = 1147.093369 (without initial connection time)
Benchmark results
Here’s a summary of the various benchmarking runs, giving you an idea of relative performance.
RUN | NUM NODES | WITH IDX | Replication Factor | TPS |
1 | 1 | no | 1 | 25 |
2 | 1 | yes | 1 | 2255 |
3 | 2 | yes | 1 | 2170 |
4 | 3 | yes | 1 | 2377 |
5 | 3 | yes | 2 | 1147 |
Caveat
Although not mentioned in the blog, there are a few things you’ll need to keep in mind as you work with CitusDB:
- The password for the account used to propagate the operations across the cluster is input in the home account’s .pgpass file on all nodes.
- The “wal_level” is set at logical.
- Redundancy can potentially decrease overall performance.
Remember to keep in mind that the performance metrics are dependent not only upon the cluster’s hardware and system resources but as to the level of tuning, which, in this case, has not been addressed.
A future blog will continue my exploration into Citus by scaling out pgbench into other architectures.
Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.
29
2023
Data Redundancy With the PostgreSQL Citus Extension
Over the years, I’ve had the opportunity to architect all sorts of configurations using Postgres as a backend. I’ve always found it very cool and satisfying to implement sophisticated business rules, often in more ways than one has fingers and toes. So, it’s not an understatement when I say that Citus is one of the more interesting technologies that I’ve come across when scaling PostgreSQL.
Citus is an extension that horizontally scales PostgreSQL across multiple machines by sharding and creating redundant copies of tables. Its query engine not only parallelizes incoming SQL queries for time series but also has the ability to create column-wise tables making it ideal for OLAP analysis duties.
Among its many capabilities, a Citus cluster can:
- Create distributed tables that are sharded across a cluster of PostgreSQL nodes to combine their CPU, memory, storage, and I/O capacity.
- Reference tables can be replicated to all nodes for joins and foreign keys from distributed tables and for maximum read performance.
- The distributed query engine can route and parallelize SELECT, DML, and other operations on distributed tables across the cluster.
- Columnar storage of tables can compress data, speeding up scans and supporting fast projections, both on regular and distributed tables.
Data redundancy, a database version of a RAID
Pondering the case of high availability and redundancy, one replicates data by creating a replica via streaming replication.
Now let’s stretch our imagination and consider a second method of high availability, ala Citus.
The best way to describe the Citus way of doing things is to reflect how data is managed by a disk RAID array. Depending on the configuration, one can tune a hardware RAID for either performance or redundancy. The same can be said for Citus data sharding.
Here is an example of a table named replica2x, which has 2X redundancy across a cluster of four (4) nodes. The colors indicate duplicated shards of the table. For example, if node citus1 goes offline, the sharded table it holds still has copies on nodes citus2 and citus4. Likewise, it can be said that if node citus2 goes offline, the same data is still available on nodes 1, 3, and 4.
About this POC
I’ll be upfront: I love working with Linux Containers, LXD. Much of what I will show you makes heavy use of them. You won’t need LXD to replicate this POC, of course, but I can’t say enough how flexible and versatile such an environment can be when prototyping a cluster of Postgres nodes, let alone an entire multi-data center infrastructure on a single workstation.
There are two parts to this POC;
- Part A: Setup
- Part B: Redundancy demonstration
Part A: POC setup
Step one: Getting and installing Citus
Referring to this earlier blog, you’ll see how to get and install Citus into your environment.
Step two: Creating the Citus nodes
The Citus cluster consists of a five (5) node cluster:
- citus-coord-01: coordinator
- citus1: worker
- citus2: worker
- citus3: worker
- citus4: worker
By using LXD, I created a single templated container with Citus on Ubuntu 20.04, where the various nodes were copied from this template.
for u in citus-coord-01 citus1 citus2 citus3 citus4 do echo "==== $u ====" lxc rm --force $u 2>/dev/null lxc cp template-ubuntu-2004-citusdb $u lxc start $u done
And here’s the resultant cluster:
lxc ls -c ns4 citus
+----------------+---------+----------------------+ | NAME | STATE | IPV4 | +----------------+---------+----------------------+ | citus1 | RUNNING | 10.231.38.140 (eth0) | +----------------+---------+----------------------+ | citus2 | RUNNING | 10.231.38.151 (eth0) | +----------------+---------+----------------------+ | citus3 | RUNNING | 10.231.38.171 (eth0) | +----------------+---------+----------------------+ | citus4 | RUNNING | 10.231.38.204 (eth0) | +----------------+---------+----------------------+ | citus-coord-01 | RUNNING | 10.231.38.34 (eth0) | +----------------+---------+----------------------+
It’s understood that on each of the five nodes:
- Database db01 has already been created.
- The postgresql.conf configuration file has been appropriately edited for remote access.
- The .pgpass file has been configured to supply the superuser password for all nodes.
- Extension citus has been created in database db01.
Step three: Check packages
Inspecting the nodes confirms Citus has been correctly installed:
for u in citus1 citus2 citus3 citus4 citus-coord-01 do echo "==== NODE: $u ====" lxc exec $u -- su - postgres -c 'psql db01'<<_eof_ select extname, extversion from pg_extension; _eof_ done | less -S
==== NODE: citus1 ==== extname | extversion ----------------+------------ plpgsql | 1.0 citus_columnar | 11.1-1 citus | 11.1-1 ==== NODE: citus2 ==== extname | extversion ----------------+------------ plpgsql | 1.0 citus_columnar | 11.1-1 citus | 11.1-1 ==== NODE: citus3 ==== extname | extversion ----------------+------------ plpgsql | 1.0 citus_columnar | 11.1-1 citus | 11.1-1 ==== NODE: citus4 ==== extname | extversion ----------------+------------ plpgsql | 1.0 citus_columnar | 11.1-1 citus | 11.1-1 ==== NODE: citus-coord-01 ==== extname | extversion ----------------+------------ plpgsql | 1.0 citus_columnar | 11.1-1 citus | 11.1-1
Properly installed, the Citus runtime variables are now available:
lxc exec citus1 -- su postgres -c psql db01<<_eof_ | less -S select name,setting,unit from pg_settings where name ~ 'citus' order by 1; _eof_
name | setting | unit ----------------------------------------------------+-----------------+------ citus.all_modifications_commutative | off | citus.background_task_queue_interval | 5000 | ms citus.cluster_name | default | citus.coordinator_aggregation_strategy | row-gather | citus.count_distinct_error_rate | 0 | citus.cpu_priority | 0 | citus.cpu_priority_for_logical_replication_senders | inherit | citus.defer_drop_after_shard_move | on | citus.defer_drop_after_shard_split | on | citus.defer_shard_delete_interval | 15000 | ms citus.desired_percent_disk_available_after_move | 10 | citus.distributed_deadlock_detection_factor | 2 | citus.enable_binary_protocol | on | citus.enable_create_role_propagation | on | citus.enable_deadlock_prevention | on | citus.enable_local_execution | on | citus.enable_local_reference_table_foreign_keys | on | citus.enable_repartition_joins | off | citus.enable_statistics_collection | off | citus.explain_all_tasks | off | citus.explain_analyze_sort_method | execution-time | citus.limit_clause_row_fetch_count | -1 | citus.local_hostname | localhost | citus.local_shared_pool_size | 50 | citus.local_table_join_policy | auto | citus.log_remote_commands | off | citus.max_adaptive_executor_pool_size | 16 | citus.max_cached_connection_lifetime | 600000 | ms citus.max_cached_conns_per_worker | 1 | citus.max_client_connections | -1 | citus.max_high_priority_background_processes | 2 | citus.max_intermediate_result_size | 1048576 | kB citus.max_matview_size_to_auto_recreate | 1024 | MB citus.max_shared_pool_size | 100 | citus.max_worker_nodes_tracked | 2048 | citus.multi_shard_modify_mode | parallel | citus.multi_task_query_log_level | off | citus.node_connection_timeout | 30000 | ms citus.node_conninfo | sslmode=require | citus.propagate_set_commands | none | citus.recover_2pc_interval | 60000 | ms citus.remote_task_check_interval | 10 | ms citus.shard_count | 32 | citus.shard_replication_factor | 1 | citus.show_shards_for_app_name_prefixes | | citus.skip_constraint_validation | off | citus.skip_jsonb_validation_in_copy | on | citus.stat_statements_track | none | citus.task_assignment_policy | greedy | citus.task_executor_type | adaptive | citus.use_citus_managed_tables | off | citus.use_secondary_nodes | never | citus.values_materialization_threshold | 100 | citus.version | 11.1.4 | citus.worker_min_messages | notice | citus.writable_standby_coordinator | off |
Step 4: Define/configure the cluster
Log into the coordinator in order to declare and configure the cluster:
lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_ select citus_set_coordinator_host('citus-coord-01', 5432); insert into pg_dist_node(nodename) values ('citus1') ,('citus2') ,('citus3') ,('citus4'); _eof_
And here’s the cluster’s organization:
db01=# select nodeid,nodename,groupid,isactive from pg_dist_node order by 1; nodeid | nodename | groupid | isactive --------+----------------+---------+---------- 1 | citus-coord-01 | 0 | t 2 | citus1 | 1 | t 3 | citus2 | 2 | t 4 | citus3 | 3 | t 5 | citus4 | 4 | t
Step four: Create, distribute, and populate a single table
Table myevents is created, and the newly inserted records are evenly distributed across the cluster of nodes.
Login to the coordinator and execute the following commands. Notice that all DML and SQL statements are executed on the coordinator node:
lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_ -- create the table create table myevents ( device_id bigint, event_id bigserial, event_time timestamptz default now(), data jsonb not null, primary key (device_id, event_id) ); -- distribute the events among the nodes select create_distributed_table('myevents', 'device_id'); -- populate the table insert into myevents (device_id, data) select s % 100, ('{"measurement":'||random()||'}')::jsonb from generate_series(1,1000000) s; _eof_
Querying the coordinator:
lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_ | less -S select * from myevents where device_id = 1 order by event_time desc, event_id desc limit 10; _eof_
device_id | event_id | event_time | data ----------+----------+-------------------------------+------------------------------------- 1 | 999901 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.2868659956537316} 1 | 999801 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.7931493079697731} 1 | 999701 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.4875322951757288} 1 | 999601 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.6491362745752653} 1 | 999501 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.9629266554851366} 1 | 999401 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.1185674800281864} 1 | 999301 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.5133762596297742} 1 | 999201 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.7307634886202119} 1 | 999101 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.2997471209159892} 1 | 999001 | 2023-08-16 14:51:12.618467-07 | {"measurement": 0.9692520484104021}
Step five: A review of table distributions across the cluster
This demonstrates clearly that table myevents is hash-sharded across every node member:
- Notice how well-balanced the sharded tables are in size.
- Notice the table name numbering order; see how each named shard that is incremented by one is found on the next node.
for u in citus1 citus2 citus3 citus4 do echo "==== NODE: $u ====" lxc exec $u -- su - postgres -c 'psql db01'<<_eof_ dt+ _eof_ done | less -S
==== NODE: citus1 ==== List of relations Schema | Name | Type | Owner | Persistence | Access method | Size | Description -------+-----------------+-------+----------+-------------+---------------+---------+------------- public | myevents_102008 | table | postgres | permanent | heap | 2832 kB | public | myevents_102012 | table | postgres | permanent | heap | 2840 kB | public | myevents_102016 | table | postgres | permanent | heap | 5624 kB | public | myevents_102020 | table | postgres | permanent | heap | 2840 kB | public | myevents_102024 | table | postgres | permanent | heap | 1904 kB | public | myevents_102028 | table | postgres | permanent | heap | 5632 kB | public | myevents_102032 | table | postgres | permanent | heap | 2840 kB | public | myevents_102036 | table | postgres | permanent | heap | 6560 kB | ==== NODE: citus2 ==== List of relations Schema | Name | Type | Owner | Persistence | Access method | Size | Description -------+-----------------+-------+----------+-------------+---------------+---------+------------- public | myevents_102009 | table | postgres | permanent | heap | 4696 kB | public | myevents_102013 | table | postgres | permanent | heap | 976 kB | public | myevents_102017 | table | postgres | permanent | heap | 1904 kB | public | myevents_102021 | table | postgres | permanent | heap | 3768 kB | public | myevents_102025 | table | postgres | permanent | heap | 1904 kB | public | myevents_102029 | table | postgres | permanent | heap | 2840 kB | public | myevents_102033 | table | postgres | permanent | heap | 1904 kB | public | myevents_102037 | table | postgres | permanent | heap | 2840 kB | ==== NODE: citus3 ==== List of relations Schema | Name | Type | Owner | Persistence | Access method | Size | Description -------+-----------------+-------+----------+-------------+---------------+------------+------------- public | myevents_102010 | table | postgres | permanent | heap | 1904 kB | public | myevents_102014 | table | postgres | permanent | heap | 1904 kB | public | myevents_102018 | table | postgres | permanent | heap | 1904 kB | public | myevents_102022 | table | postgres | permanent | heap | 4696 kB | public | myevents_102026 | table | postgres | permanent | heap | 8192 bytes | public | myevents_102030 | table | postgres | permanent | heap | 2832 kB | public | myevents_102034 | table | postgres | permanent | heap | 976 kB | public | myevents_102038 | table | postgres | permanent | heap | 4696 kB | ==== NODE: citus4 ==== List of relations Schema | Name | Type | Owner | Persistence | Access method | Size | Description -------+-----------------+-------+----------+-------------+---------------+---------+------------- public | myevents_102011 | table | postgres | permanent | heap | 5632 kB | public | myevents_102015 | table | postgres | permanent | heap | 1904 kB | public | myevents_102019 | table | postgres | permanent | heap | 1904 kB | public | myevents_102023 | table | postgres | permanent | heap | 1904 kB | public | myevents_102027 | table | postgres | permanent | heap | 2840 kB | public | myevents_102031 | table | postgres | permanent | heap | 2832 kB | public | myevents_102035 | table | postgres | permanent | heap | 1904 kB | public | myevents_102039 | table | postgres | permanent | heap | 4696 kB |
Part B: Redundancy demonstration
Method
- Step 1: Update shard replication factor from 1X to 2X
- Step 2: Create table myevents2x with 2X redundancy
- Step 3: Identify shard myevents2x_102040 across citus1 and citus2
- Step 4: Identify some records to query from shards known to be on nodes citus1 and citus2
- Step 5: Test
- Shutdown citus1; perform the aforementioned identified query
- Startup citus1, shutdown citus2; perform afore identified query
- Restart citus2; perform the aforementioned identified query
ATTENTION: Please note that you may have to edit your own queries as the values may be different for your setup.
Step one:
Update shard replication factor from 1X to 2X:
lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_ show citus.shard_replication_factor; alter system set citus.shard_replication_factor=2; select pg_reload_conf(); _eof_
# validate lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_ show citus.shard_replication_factor; _eof_
citus.shard_replication_factor -------------------------------- 2
Step two:
Table myevents2x is created and populated with a redundancy of 2X across the cluster:
lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_ | less -S -- create a new table with 2X redundancy create table myevents2x ( device_id bigint, event_id bigserial, event_time timestamptz default now(), data jsonb not null, primary key (device_id, event_id) ); -- distribute the events among the nodes select create_distributed_table('myevents2x', 'device_id'); -- confirm table has been added across the cluster select * from master_get_active_worker_nodes() order by 1; -- populate the table insert into myevents2x (device_id, data) select s % 100, ('{"measurement":'||random()||'}')::jsonb from generate_series(1,1000000) s; _eof_
Here’s the output:
CREATE TABLE create_distributed_table -------------------------- (1 row) node_name | node_port -----------+----------- citus1 | 5432 citus2 | 5432 citus3 | 5432 citus4 | 5432 (4 rows) INSERT 0 1000000
Step three:
Locate shard myevents2x_102040, which should be on nodes citus1 and citus2:
for u in citus1 citus2 citus3 citus4 do echo "==== NODE: $u ====" lxc exec $u -- su - postgres -c 'psql db01'<<_eof_ select tablename from pg_tables where tablename~'myevents2x' order by 1 _eof_ done | less -S
Here’s the output:
==== NODE: citus1 ==== tablename ------------------- myevents2x_102040 myevents2x_102043 myevents2x_102044 myevents2x_102047 myevents2x_102048 myevents2x_102051 myevents2x_102052 myevents2x_102055 myevents2x_102056 myevents2x_102059 myevents2x_102060 myevents2x_102063 myevents2x_102064 myevents2x_102067 myevents2x_102068 myevents2x_102071 (16 rows) ==== NODE: citus2 ==== tablename ------------------- myevents2x_102040 myevents2x_102041 myevents2x_102044 myevents2x_102045 myevents2x_102048 myevents2x_102049 myevents2x_102052 myevents2x_102053 myevents2x_102056 myevents2x_102057 myevents2x_102060 myevents2x_102061 myevents2x_102064 myevents2x_102065 myevents2x_102068 myevents2x_102069 ==== NODE: citus3 ==== tablename ------------------- myevents2x_102041 myevents2x_102042 myevents2x_102045 myevents2x_102046 myevents2x_102049 myevents2x_102050 myevents2x_102053 myevents2x_102054 myevents2x_102057 myevents2x_102058 myevents2x_102061 myevents2x_102062 myevents2x_102065 myevents2x_102066 myevents2x_102069 myevents2x_102070 ==== NODE: citus4 ==== tablename ------------------- myevents2x_102042 myevents2x_102043 myevents2x_102046 myevents2x_102047 myevents2x_102050 myevents2x_102051 myevents2x_102054 myevents2x_102055 myevents2x_102058 myevents2x_102059 myevents2x_102062 myevents2x_102063 myevents2x_102066 myevents2x_102067 myevents2x_102070 myevents2x_102071
Step four:
Locate and return the first three records of shard myevents2x_102040 on nodes citus1 and citus2:
lxc exec citus1 -- su - postgres -c 'psql db01'<<_eof_ | less -S qecho ==== citus1 ==== qecho select * from myevents2x_102040 order by 1,2 limit 3 select * from myevents2x_102040 order by 1,2 limit 3; c 'host=citus2 dbname=db01 user=postgres' qecho ==== citus2 ==== qecho select * from myevents2x_102040 order by 1,2 limit 3 select * from myevents2x_102040 order by 1,2 limit 3; c 'host=citus-coord-01 dbname=db01 user=postgres' qecho ==== coordinator ==== qecho select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2 select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2; _eof_
Here’s the output:
==== citus1 ==== select * from myevents2x_102040 order by 1,2 limit 3 device_id | event_id | event_time | data -----------+----------+-------------------------------+------------------------------------- 8 | 8 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.5055415404961443} 8 | 108 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.9145586163936634} 8 | 208 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.8032392420487922} You are now connected to database "db01" as user "postgres" on host "citus2" (address "fd42:cb6a:5384:9a60:216:3eff:fe38> ==== citus2 ==== select * from myevents2x_102040 order by 1,2 limit 3 device_id | event_id | event_time | data -----------+----------+-------------------------------+------------------------------------- 8 | 8 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.5055415404961443} 8 | 108 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.9145586163936634} 8 | 208 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.8032392420487922} You are now connected to database "db01" as user "postgres" on host "citus-coord-01" (address "fd42:cb6a:5384:9a60:216:3> ==== coordinator ==== select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2 device_id | event_id | event_time | data -----------+----------+-------------------------------+------------------------------------- 8 | 8 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.5055415404961443} 8 | 108 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.9145586163936634} 8 | 208 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.8032392420487922}
Step five:
The next few steps demonstrate our ability to continuously query and return those records found in shard myevents2x_102040.
Step 5a: Test, shutdown node citus1
lxc stop citus1
lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_ | less -S qecho "==== citus1 is shutdown ====" qecho ==== querying coordinator ==== qecho select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2 select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2; _eof_
Here’s the output:
"==== citus1 is shutdown ====" ==== querying coordinator ==== select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2 device_id | event_id | event_time | data -----------+----------+-------------------------------+------------------------------------- 8 | 8 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.5055415404961443} 8 | 108 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.9145586163936634} 8 | 208 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.8032392420487922}
Step 5b: Test, restart citus1, and shutdown citus2
lxc start citus1 lxc stop citus2
lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_ | less -S qecho "==== citus2 is shutdown ====" qecho ==== querying coordinator ==== qecho select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2 select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2; _eof_
Here’s the output; note that it’s exactly the same as the previous test:
"==== citus2 is shutdown ====" ==== querying coordinator ==== select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2 device_id | event_id | event_time | data -----------+----------+-------------------------------+------------------------------------- 8 | 8 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.5055415404961443} 8 | 108 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.9145586163936634} 8 | 208 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.8032392420487922}
Step 5c: Test, restart node citus2
lxc start citus2
lxc exec citus-coord-01 -- su - postgres -c 'psql db01'<<_eof_ | less -S qecho "==== cluster restored ====" qecho ==== querying coordinator ==== qecho select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2 select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2; _eof_
Here’s the output, of course!
"==== cluster restored ====" ==== querying coordinator ==== select * from myevents2x where device_id=8 and event_id in (8,108,208) order by 1,2 device_id | event_id | event_time | data ----------+----------+-------------------------------+------------------------------------- 8 | 8 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.5055415404961443} 8 | 108 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.9145586163936634} 8 | 208 | 2023-08-17 07:10:51.622082-07 | {"measurement": 0.8032392420487922}
Conclusion
Data redundancy is the hallmark of high availability. But with Citus, we’ve raised the bar. Can you think of a better system that can stay up without losing time initiating failovers when a node fails?
Have fun!
Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community in a single distribution, designed and tested to work together.
21
2023
Configuring PostgreSQL and LDAP Using StartTLS
Effectively working with LDAP as an authentication mechanism for PostgreSQL typically requires extensive knowledge in both domains. While trying to be as complete yet succinct as possible, I’m detailing how to enable TLS between a PostgreSQL and the OpenLDAP server.
Ironically, the most complicated aspect has nothing to do with either PostgreSQL or OpenLDAP but with the steps of creating and signing private keys and certificates.
Note: I had seriously thought about leaving out much of the OpenLDAP commands, but I figured it might benefit you if you’re like me since I don’t touch this very often.
Summary steps
The underlying assumptions are:
- PostgreSQL and LDAP hosts are installed on the same host.
- Username accounts, but not passwords, are created on the Postgres server.
- Both username accounts and passwords are configured on the LDAP server.
- Private/Public keys, sic certificates, are to be created.
- A self-signed Certificate Authority, CA, exists on the host and has signed the aforementioned Postgres and LDAP certificates.
The environment
The Proof Of Concept described in this document consists of a single stand-alone server:
- Linux OS, Ubuntu 18.04
- Host name: my-ldap
- PostgreSQL version 14
- OpenLDAP version 2.4.45
- Distinguished Name: “cn=admin,dc=nodomain”
- password: admin
- topmost domain: “dn: dc=pg_user,dc=nodomain”
- superuser, postgres: “dn: cn=postgres,dc=pg_user,dc=nodomain”
- user1: “dn: cn=user1,dc=pg_user,dc=nodomain”
- user2: “dn: cn=user2,dc=pg_user,dc=nodomain”
- user accounts:
- postgres (password postgres)
Installation PostgreSQL
Apart from the standard steps of installing and configuring Postgres for remote access, edit the host-based authentication file enabling Postgres to refer to the LDAP service for authentication.
ROLES and useraccounts used in Postgres should be declared. Keep in mind that assigning passwords is NOT required:
-- Example create role user1 with login password null; create role user2 with login password null;
pg_hba.conf assumptions:
- The LDAP URL is 127.0.01, localhost, or the hostname URL if desired.
- Rules for both IPv4 and IPv6 are declared.
- NO encryption is enabled between the Postgres and LDAP services.
# IPv4 local connections: host all all 0.0.0.0/0 ldap ldapserver=127.0.0.1 ldapprefix="cn=" ldapsuffix=", dc=pg_user, dc=nodomain" # IPv6 local connections: host all all ::0/0 ldap ldapserver=127.0.0.1 ldapprefix="cn=" ldapsuffix=", dc=pg_user, dc=nodomain"
Installation LDAP
apt-get install -y slapd ldap-utils
Running netstat, (netstat -tlnp), returns the following:
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp 0 0 0.0.0.0:389 0.0.0.0:* LISTEN 7742/slapd tcp6 0 0 :::389 :::* LISTEN 7742/slapd
Command line utilities
You can control the behavior of OpenLDAP by using these command line utilities:
- ldapmodify
- ldapadd
- ldapdelete
- ldapmodrdn
- ldapsearch
- ldapwhoami
Setting the administrator password
It is understood that administering the LDAP server requires setting the password. Although the installation of LDAP includes setting the password, which will be admin, by executing the following command, one can reset the password at will:
# Select "No" when asked to configure the database with dbconfig-common. # Set the domain name for your LDAP server, for example, "example.com". # Set the organization name for your LDAP server, for example, "Example Inc". # Set the administrator password for your LDAP server. dpkg-reconfigure slapd
Configuration
The following bash script demonstrates configuring OpenLDAP to authenticate three Postgres ROLES, i.e., postgres, user1, and user2:
#!/bin/bash set -e ########################################## # # admin password is "admin" # the top domain is assigned as "nodomain" # ldapadd -v -xD "cn=admin,dc=nodomain" -w admin <<_eof_ # create the topmost domain dn: dc=pg_user,dc=nodomain objectClass: dcObject objectClass: organization dc: pg_user o: Postgres Users description: all postgres users reside here # create SUPERUSER ROLE postgres dn: cn=postgres,dc=pg_user,dc=nodomain objectclass: top objectclass: person cn: postgres sn: postgres userPassword: postgres _eof_ # ########################################## # ADD ROLES ldapadd -v -xD "cn=admin,dc=nodomain" -w admin <<_eof_ # creating other user accounts, down the road # create role user1 dn: cn=user1,dc=pg_user,dc=nodomain objectclass: top objectclass: person cn: user1 sn: user1 userPassword: user1 # create role user2 dn: cn=user2,dc=pg_user,dc=nodomain objectclass: top objectclass: person cn: user2 sn: user2 userPassword: user2 _eof_
Test connectivity without TLS
A simple login confirms LDAP and PostgreSQL are working correctly. Even though there is an encrypted session between psql and the Postgres server, there is no encrypted session between Postgres and LDAP as authentication is performed:
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off) root@my-ldap:~# psql 'host=my-ldap dbname=postgres user=postgres password=postgres' -c "select 'ping' as test_connectivity" test_connectivity ------------------- ping root@my-ldap:~# psql 'host=my-ldap dbname=postgres user=user1 password=user1' -c "select 'ping from user1' as test_connectivity" test_connectivity ------------------- ping from user1 root@my-ldap:~# psql 'host=my-ldap dbname=postgres user=user2 password=user2' -c "select 'ping from user2' as test_connectivity" test_connectivity ------------------- ping from user2
Working With TLS Encryption
Install Additional Packages
To work with SSL certificates, these packages should be present, i.e., for the Ubuntu distribution:
apt install -y gnutls-bin ssl-cert
Managing the Certificate Authority
Authentication between the Postgres and LDAP servers includes that the hosts making connection attempts are, in fact, legitimate. For that reason, the certificates for both servers must be signed, i.e., a Certificate Authority mechanism is required.
# # Generate private key for self-signed Certificate Authority # certtool --generate-privkey --bits 4096 --outfile /etc/ssl/private/mycakey.pem
In this case, the CA certificate is configured to expire in ten years:
# # Define CA certificate attributes # echo "cn = my-ldap ca cert_signing_key expiration_days = 3650" > /etc/ssl/ca.info
An internal system can get away using self-signed CA certificates, otherwise, it is strongly recommended that your certificates be signed by an authorized CA. In this case, the certificate, once signed, is placed in the same directory where the other CA certificates are stored, i.e., /usr/local/share/ca-certificates:
# # Generate a self-signed CA certificate and # copy the CRT to other trusted CA certificates that includes # both postgres and ldap servers (/usr/local/share/ca-certificates/) # certtool --generate-self-signed --load-privkey /etc/ssl/private/mycakey.pem --template /etc/ssl/ca.info --outfile /usr/local/share/ca-certificates/mycacert.crt
ATTENTION: Once signed, the CA certificate is copied onto the Postgres and LDAP servers respectively. In this case, as they are on the same host, it is located on host my-ldap. Otherwise, it MUST be copied to all Postgres and LDAP hosts.
Once copied into the correct directory, the list of CA certificates is updated, adding the self-signed CA certificate:
# # Update list of CA certificate on both # postgres and LDAP servers # update-ca-certificates
Generating the LDAP public/private key
From here on, it is important to include the fully qualified domain name of the LDAP certificate hostname, i.e., my-ldap.
Generate a private key for LDAP server:
certtool --generate-privkey --bits 2048 --outfile /etc/ldap/my-ldap_slapd_key.pem
Define LDAP certificate attributes:
# for a certificate request which expires # in one year # echo "organization = mycompany cn = my-ldap tls_www_server encryption_key signing_key expiration_days = 365" > /etc/ssl/my-ldap.info
Sign the LDAP private key using the self-signed Certificate Authority certificate and its private key:
certtool --generate-certificate --load-privkey /etc/ldap/my-ldap_slapd_key.pem --load-ca-certificate /etc/ssl/certs/mycacert.pem --load-ca-privkey /etc/ssl/private/mycakey.pem --template /etc/ssl/my-ldap.info --outfile /etc/ldap/my-ldap_slapd_cert.pem
Update access permissions of the private key:
sudo chgrp openldap /etc/ldap/my-ldap_slapd_key.pem sudo chmod 0640 /etc/ldap/my-ldap_slapd_key.pem
Create and save the LDAP TLS configuration file, certinfo.ldif:
echo "dn: cn=config add: olcTLSCACertificateFile olcTLSCACertificateFile: /etc/ssl/certs/mycacert.pem - add: olcTLSCertificateFile olcTLSCertificateFile: /etc/ldap/my-ldap_slapd_cert.pem - add: olcTLSCertificateKeyFile olcTLSCertificateKeyFile: /etc/ldap/my-ldap_slapd_key.pem" > /etc/ldap/schema/certinfo.ldif
Enable OpenLDAP to use TLS:
ldapmodify -Y EXTERNAL -H ldapi:/// -f /etc/ldap/schema/certinfo.ldif
Validation test #1
Validation returns the string “anonymous”:
ldapwhoami -x -ZZ -H ldap://my-ldap
Validation test #2
This test confirms that the previous behavior without TLS still works:
# without TLS ldapsearch -x -H ldap://my-ldap -b dc=nodomain -D cn=admin,dc=nodomain -w admin
This test should return the exact output of the previous ldapsearch. Failure is indicated by a short message which is invoked by using the switch “-zz”.
# with TLS ldapsearch -x -ZZ -H ldap://my-ldap -b dc=nodomain -D cn=admin,dc=nodomain -w admin
Working with PostgreSQL using TLS
Configuration
This is the host-based rule with TLS configured; notice the minor edit in red:
# IPv4 local connections: hostssl all all 0.0.0.0/0 ldap ldapserver=tmp ldapprefix="cn=" ldapsuffix=", dc=pg_user, dc=nodomain" ldaptls=1 # IPv6 local connections: host all all ::0/0 ldap ldapserver=tmp ldapprefix="cn=" ldapsuffix=", dc=pg_user, dc=nodomain" ldaptls=1
Validation test: TLS
After updating pg_hba.conf with the new argument, ldaptls=1, a server reload is executed, followed by a psql ping:
systemctl reload postgresql@14-main
root@my-ldap:~# psql 'host=10.231.38.243 dbname=postgres user=postgres password=postgres' -c "select 'ping from postgres' as test_connectivity" test_connectivity -------------------- ping from postgres
Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.
05
2023
Configuring PgBouncer for Multi-Port Access
From time to time, situations occur where unusual circumstances dictate out-of-the-box thinking.
For example, suppose you have a system where you’ve installed multiple data clusters onto a single host. What you end up with is a host breaking up and sharing valuable system resources, i.e., CPU, RAM, disk, etc., between multiple instances of PostgreSQL data clusters. While easy to do in a development environment, it does represent technical debt when it reaches production. Sooner or later, one must address this issue; otherwise, one can suffer the consequences of handicapping your entire database infrastructure.
Let’s now move forward in time: your system has scaled, of course, and this shortcut of using multiple data clusters on a single host has now become a performance bottleneck. The problem is you either don’t or can’t refactor your application servers; something, maybe, about not having enough time in the day. And, as you may already know, while Postgres can sit on both a UNIX DOMAIN socket and IPv4, IPv6 port, etc., one is nevertheless constrained to listen to just the one port.
So what do you do?
For the experienced sysadmin, there are actually quite a number of “magical” techniques. However, in this case, with a little help from systemd, which manages all service processes, we will solve this little puzzle using PgBouncer with a concoction of configuration files.
Scenario
Configure the system such that Postgres resides on its default port of 5432 and PgBouncer sits on three ports, i.e., 6432, 6433, and 6433, accessing the resident Postgres server.
The PgBouncer connection pooler will use an administrative account, the ROLE PgBouncer, for the purpose of user authentication. Authentication is to be achieved by accessing the Postgres server’s pg_shadow table and comparing the resultant hash to all incoming connections (this won’t work for cloud setups such as, for example, Amazon RDS).
A set of Systemd configuration files will be created and edited in order to manage the PgBouncer service.
About the files
Below is a summary of the files and how they will be edited. Remember, these configuration files are of a hypothetical nature using minimal settings, which, of course, will need to be updated to match a realistic production environment.
├── pgbouncer │ ├── pgbouncer.ini │ └── userlist.txt ├── postgres │ ├── add_function.sql │ ├── add_user.sql └── systemd ├── pgbouncer_override └── pgbouncer.socket
pgbouncer.ini
This configuration file defines all behavior and is installed in its default location, “/etc/pgbouncer”.
Only one domain socket is used. For our purposes, the listen_port runtime parameter is just noise and is superseded by the other runtime parameters as declared in file pgbouncer.socket.
TIP: Backup the original pgbouncer.ini as it references ALL runtime parameters.
[databases] * = host=localhost [users] # left blank [pgbouncer] logfile = /var/log/postgresql/pgbouncer.log pidfile = /var/run/postgresql/pgbouncer.pid ;; these parameters are implicitly disabled listen_addr = 0.0.0.0 listen_port = 6432 unix_socket_dir = /var/run/postgresql auth_type = md5 auth_file = /etc/pgbouncer/userlist.txt auth_user = pgbouncer auth_query = SELECT p_user, p_password FROM public.lookup($1) pool_mode = session ;; Use <appname - host> as application_name on server. application_name_add_host = 1
userlist.txt
Contains a single user account and its password for authentication. Note that the md5 hashed password is pgbouncer.
"pgbouncer" "md5be5544d3807b54dd0637f2439ecb03b9"
add_user.sql
Adds the ROLE “pgbouncer” to the Postgres data cluster. While under normal circumstances this is not required, PgBouncer uses this ROLE in order to validate all logins.
For our purposes, the password is “pgbouncer”:
-- -- EXECUTE AS SUPERUSER postgres -- CREATE ROLE pgbouncer LOGIN WITH PASSSWORD pgbouncer;
add_function.sql
This is a user-defined function that ROLE PgBouncer executes in order to obtain the password hash from pg_shadow.
This SQL statement must be executed against each and every database that is to be accessed by any ROLE connection using PgBouncer.
TIP: Execute this SQL against database template1, and the function call will be included with every newly created database thereafter.
-- -- EXECUTE AS SUPERUSER postgres -- -- execute on each database user accounts will login -- CREATE FUNCTION public.lookup ( INOUT p_user name, OUT p_password text ) RETURNS record LANGUAGE sql SECURITY DEFINER SET search_path = pg_catalog AS $$SELECT usename, passwd FROM pg_shadow WHERE usename = p_user$$; -- make sure only "pgbouncer" can use the function REVOKE EXECUTE ON FUNCTION public.lookup(name) FROM PUBLIC; GRANT EXECUTE ON FUNCTION public.lookup(name) TO pgbouncer;
pgbouncer_override
This systemd drop-in file overrides key options in the default PgBouncer unit file and will never be overwritten even after updating the PgBouncer Linux RPM/DEB packages.
As root, execute the following command and paste the contents of the provided file pgbouncer_override:
systemctl edit pgbouncer
# # systemctl edit pgbouncer # systemctl daemon-reload # [Unit] Requires=pgbouncer.socket
pgbouncer.socket
This is the secret sauce; this file enables PgBouncer to listen on all three ports, 6432, 6433, and 6434. You will note that adding or removing ports is a simple matter of adding or removing addresses as per the format shown in the file below.
As root: create this file and perform a daemon reload:
vi /etc/systemd/system/pgbouncer.socket
# # vi /etc/systemd/system/pgbouncer.socket # systemctl daemon-reload # [Unit] Description=sockets for PgBouncer PartOf=pgbouncer.service [Socket] ListenStream=6432 ListenStream=6433 ListenStream=6434 ListenStream=/var/run/postgresql/.s.PGSQL.6432 ReusePort=true RemoveOnStop=true [Install] WantedBy=sockets.target
systemctl daemon-reload
Putting it all together
After all the configuration files have been created and edited, this is what we get:
systemctl restart pgbouncer
netstat -tlnp
Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN tcp 0 0 127.0.0.1:5432 0.0.0.0:* LISTEN tcp 0 0 127.0.0.53:53 0.0.0.0:* LISTEN tcp6 0 0 :::22 :::* LISTEN tcp6 0 0 ::1:5432 :::* LISTEN tcp6 0 0 :::6432 :::* LISTEN tcp6 0 0 :::6433 :::* LISTEN tcp6 0 0 :::6434 :::* LISTEN
Finally, perform the following:
- sudo as Postgres
- create a database db01
- create a ROLE usr1 accessing database db01
- update pg_hba.conf and postgresql.conf allowing localhost connections by PgBouncer
Now validate the ports:
for u in 5432 6432 6433 6434 do echo "==== port: $u ===" export PGHOST=localhost PGPORT=$u PGDATABASE=db01 psql "user=usr1 password=usr1" -c "select 'hello world' as greetings" done
And here’s the output:
==== port: 5432 === greetings ------------- hello world (1 row) ==== port: 6432 === greetings ------------- hello world (1 row) ==== port: 6433 === greetings ------------- hello world (1 row) ==== port: 6434 === greetings ------------- hello world (1 row)
Conclusion
Because PostgreSQL was designed from the ground up to work in conjunction with the OS, i.e., a UNIX type of operating system, we’ve been able to solve an interesting problem using a novel approach. And while systemd is ubiquitous, it isn’t normally considered part of a solution to a Postgres puzzle. We call this hacking 🙂
Happy Travels!
Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.
02
2023
An Argument for Logical Failover Slots
These days, the typical PostgreSQL cluster consists not only of a multi-node replication cluster, which is ordinarily an asynchronous streaming replication model but can sometimes include a logical replication component.
Recall that logical replication works by using the PUB/SUB mode, where individual tables are published and are then subscribed by remotely connected databases/tables. DML operations, such as INSERT, UPDATE, DELETE, TRUNCATE, etc., are then replicated from one or more publishers to one or more subscribers.
For the most part it’s a pretty straightforward operation. The only real limitation is that one can only publish tables from a read-write server such as the PRIMARY.
There’s a problem, however, when it comes to executing failovers. While a STANDBY/REPLICA host can quickly take over with a simple select pg_promote() SQL statement, the same cannot be said regarding logical replication. Instead, failing over is, in fact, not a single action but a collection of discrete steps that must be performed in an exact and precise manner.
Consider the following steps typifying a failover:
- Promote the REPLICA, thus becoming the new PRIMARY.
- When present, redirect other REPLICAs pointing to the new PRIMARY.
- Failover the logical replication slot:
- Block all logical replication from the failed PRIMARY.
- Create a logical replication slot on the newly promoted PRIMARY.
- Alter the subscription and point to the newly promoted PRIMARY.
- Resume/restore logical replication between publisher and subscriber.
The challenge, of course, is reconstituting the logical replication process as quickly as possible, ideally without any data loss.
Two common solutions come to mind:
- Block all incoming connections before promoting the STANDBY while simultaneously installing the logical replication slot.
- Promote the STANDBY and then flush and recopy the entire contents of the subscribed table(s).
In both cases, these solutions make it problematic for a fast recovery, not to mention the additional effort required by the SRA/DBA/DEV preparing the requisite instructions.
In an ideal world, the perfect implementation of a logical slot failover would have one already in place on the STANDBY, thus guaranteeing data consistency without fear of data loss no matter how fast or slow the actual promotion or altering the subscription’s connectivity parameters may take.
The good news is that there are hacks that can speed up the entire recovery process, although they all have their own quirky limitations. Patroni, for example, implements one such technique by copying the file named state, which is located in the data cluster’s subdirectory pg_repslot, over to the REPLICA’s data cluster in the same location.
# PGDATA on an Ubuntu install /var/lib/postgresql/15/main/pg_replslot/pg3 ??? state
However, the caveat is that it requires a full server restart to the REPLICA, in addition to being promoted, before the logical slot is fully active. And, of course, timing is critical.
So there you have it.
Too bad such a feature doesn’t exist… or does it?
Stay tuned folks; the game is afoot.
For the curious, I’ve included some references in past blogs, courtesy of Jobin:
- Failover of Logical Replication Slots in PostgreSQL
- How Patroni Addresses the Problem of the Logical Replication Slot Failover in a PostgreSQL Cluster
And here’s another set of references: a valiant attempt to incorporate this feature into PostgreSQL a few years ago.
- PostgreSQL wiki: here
- Part of a mail thread back in 2016: here
- An old blog located on http://webarchive.org from a now-defunct company: here
Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open source community in a single distribution, designed and tested to work together.
Download Percona Distribution for PostgreSQL Today!
23
2022
Diffing PostgreSQL Schema Changes
One of the routine operations when administering PostgreSQL is periodic updates to the database system’s architecture. PostgreSQL does a good job of allowing one to update a schema, add types, functions, triggers, or alter a table by adding and removing columns and updating column data types, etc., in a reliable manner. However, there is no built-in mechanism to help identify the differences, let alone generate the necessary SQL, to accomplish updates in an easy manner from the development to the production environment.
So let’s talk about possible approaches to schema changes.
Using logical dump manifests
The easiest way to identify changes between schemas from one database to another is to compare schema dump manifests.
The following example demonstrates an approach one can take looking for differences between schema on different databases:
EXAMPLE:
-- create database schemas create database db01 create database db01
-- db01: version 1 create table t1 ( c1 int, c2 text, c4 date ); create table t2( c1 int, c2 varchar(3), c3 timestamp, c4 date );
-- db02: version 2 create table t1 ( c1 serial primary key, c2 varchar(256), c3 date default now() ); create table t2( c1 serial primary key, c2 varchar(3), c3 varchar(50), c4 timestamp with time zone default now(), c5 int references t1(c1) ); create index on t2 (c5);
# generate schema dumps pg_dump -s db01 -Fc > db01.db pg_dump -s db02 -Fc > db02.db
# generate manifests pg_restore -l db01.db > db01_manifest.ini pg_restore -l db02.db > db02_manifest.ini
This snippet demonstrates looking for differences by comparing the md5 checksums:
# EX 1: generate checksums md5sum \ <(tail +16 db01_manifest.ini | cut -d ' ' -f 4-) \ <(tail +16 db02_manifest.ini | cut -d ' ' -f 4-) # output $ 9d76c028259f2d8bed966308c256943e /dev/fd/63 $ ba124f9410ea623085c237dc4398388a /dev/fd/62
This next snippet diffs the differences between the two manifests identifying only those objects and attributes that have changed. Notice that redundant information, the first 16 lines, are skipped:
# EX 2: perform diff diff \ > <(tail +16 db01_manifest.ini | cut -d ' ' -f 4-) \ > <(tail +16 db02_manifest.ini | cut -d ' ' -f 4-)
This resultant diff shows the changes made between the two schemas:
1a2,3 > SEQUENCE public t1_c1_seq postgres > SEQUENCE OWNED BY public t1_c1_seq postgres 2a5,12 > SEQUENCE public t2_c1_seq postgres > SEQUENCE OWNED BY public t2_c1_seq postgres > DEFAULT public t1 c1 postgres > DEFAULT public t2 c1 postgres > CONSTRAINT public t1 t1_pkey postgres
> CONSTRAINT public t2 t2_pkey postgres
> INDEX public t2_c5_idx postgres
> FK CONSTRAINT public t2 t2_c5_fkey postgres
The good news is that there are a number of existing tools that can reconcile differences between a proposed schema design and the target schema:
- Commercial offerings can differentiate schema between databases in an elegant and efficient manner. Researching, ala Google, yields the most popular technologies one can use.
- In regards to open source solutions, there are a number of projects capable of diffing Postgres database schemas.
Working with the apgdiff extension
The following is an example implementation of the open source tool apgdiff.
Apgdiff can be found in the Postgres community repository. It compares two schema dump files and creates an SQL output file that is, for the most part, suitable for upgrades of old schemata:
Package: apgdiff Version: 2.7.0-1.pgdg18.04+1 Architecture: all Maintainer: Debian PostgreSQL Maintainers <team+postgresql@tracker.debian.org> Installed-Size: 173 Depends: default-jre-headless | java2-runtime-headless Homepage: https://www.apgdiff.com/ Priority: optional Section: database Filename: pool/main/a/apgdiff/apgdiff_2.7.0-1.pgdg18.04+1_all.deb Size: 154800 SHA256: 9a83fcf54aed00e1a28c3d00eabe1c166977af1e26e91035e15f88b5215b181b SHA1: ea713acb55898f07374dadd1bebb09ec2fa4b589 MD5sum: e70a97903cb23b8df8a887da4c54e945
The following example demonstrates how one can update differences between the development environment and the production database schema using apgdiff.
EXAMPLE:
apt install -y apgdiff
# EX 1: dump as SQL statements pg_dump -s db01 -Fp > db01.sql pg_dump -s db02 -Fp > db02.sql createdb db03 --template=db01 apgdiff --ignore-start-with db01.sql db02.sql > db01-db02.sql # “psql -1” encapsulates statements within a transaction psql -1 -f db01-db02.sql db03
# EX 2: uses logical dumps # notice the dumps are standard logical dumps and includes data pg_dump db01 -Fc > db01.db pg_dump db02 -Fc > db02.db createdb db03 --template=db01 # this invocation assumes the resultant diff doesn’t require editing apgdiff --ignore-start-with \ <(pg_restore -s -f - db01.db) \ <(pg_restore -s -f - db02.db) \ | psql -1 db03
There’s more you can accomplish with these simple approaches. By incorporating variations of these, one can create fairly sophisticated shell scripts with little code and, with a little luck, not that much effort.
14
2022
Powering PostgreSQL 15 With Columnar Tables
Prologue
This blog is the result of my investigation into column-wise tables. Hopefully, you will find it useful in your deliberations.
When I started writing it was meant to be a “Good News” blurb. I was pretty optimistic that I’d truly have impressive numbers to share but as it turns out, while there is a potential for significant advantage, a detailed testing regime of the CITUS columnar store extension should be carried out before implementing them into your own environment.
The missing sizzle
Sizzle: A phrase used to show affirmation or approval. It references the sound that one hears when cooking certain delicious foods i.e. bacon or fajitas and therefore transfers those feelings to non-edible objects or ideas.
There’s a lot to take in and I’m often impressed with every new version of PostgreSQL that is released. Nevertheless, there’s been a long-standing feature that, to be honest, I’ve always felt was missing. This weakness is being addressed by the industry. Ironically, some of these implementations are in fact Postgres derivatives.
What is this sizzling feature that I speak of you ask?
Column-wise tables!
When to consider column-wise tables
Consider using the CITUS Columnar Extension under one or more of the following conditions:
- In regards to table attributes:
- they are wide with “many” columns
- size should at a minimum be in the tens of GB
- the typical row is byte “heavy”
- System space is at a premium i.e. it’s worth your while to manage space utilization as much as possible.
- OLAP is a major component of overall activity i.e. lots of different kinds of SELECT statements.
- INSERT performance is not a priority.
- Indexes are not feasible
- Creating EXPRESSION indexes on columnar tables is faster by orders of magnitude than on heap tables.
CAVEAT: You cannot perform UPDATE, DELETE operations on a columnar table.
OLTP vs OLAP
Let’s get back to basics. In the database world there are essentially two types of database operations:
-
- Online Transaction Processing, OLTP: Online transaction processing applications have high throughput and are insert- or update-intensive in database management. These applications are used concurrently by hundreds of users. The key goals of OLTP applications are availability, speed, concurrency, and recoverability.
- Online analytical processing, OLAP: Online analytical processing applications enable users to analyze multidimensional data interactively from multiple perspectives which consist of three basic analytical operations:
- consolidation (roll-up)
- drill-down
- slicing and dicing
About columnar tables
As an RDBMS, PostgreSQL is geared more toward OLTP than OLAP operations. Manipulating data by inserting, updating, deleting, etc., it does well. But while eminently capable of performing OLAP, it’s not quite as efficient. The primary reason is actually a pretty common characteristic among most RDBMS i.e. it stores and processes collections of data as individual records, otherwise known as row-wise processing.
Suppose you are performing a deep-dive analysis involving a table of records containing 50 columns. And further, suppose that your query only needs a single column. It still needs to read all of the data from those 50 columns per row in order to access that single column. And if the query processes 100 million rows, that’s definitely not trivial!
Now let’s consider a reorientation of the table i.e. column-wise. In this case, the query only reads one column and not 50. The result is that the query is now much lighter, requiring less IO and processing power yet achieving the same result in significantly less time.
Introducing the CITUS columnar extension
As a General Purpose Database Management System, basic behaviors can be reconfigured in Postgres for different purposes. And as such, it is possible to enhance PostgreSQL with columnar tables by using the CITUS columnar table extension.
The CITUS columnar extension is just one part of a larger set of capabilities of this extension that when fully implemented creates a fully scalable distributed Postgres database system.
The CITUS columnar extension feature set includes:
- Highly compressed tables:
- Reduces storage requirements
- Reduces the IO needed to scan the table
- Projection Pushdown: Only those column(s) within the SELECT are returned further reducing IO.
- Chunk Group Filtering: Allows queries to skip over whole groups of unrelated data without processing them.
The complete CITUS feature set which, except for the Columnar storage component, is not covered in this blog, includes:
- Distributed tables
- References tables
- Distributed query engine routes and parallelizes SELECT, DML, and operations
- Columnar storage
- compresses data
- speeds up scans,
- supports fast projections
- Query from any node
Getting it
This is a bit of a bad-news, good-news, excellent-news situation.
BAD-NEWS: At the time of writing this blog, the only publicly available packages on the Postgres repositories that I could find were the cstore foreign data wrappers which cover only version 12 and older versions of Postgres.
GOOD-NEWS: The CITUS extension is available on the Citus Data download page.
EXCELLENT-NEWS: As part of the process of writing this blog custom, DEB and RPM packages were created for PostgreSQL version 15 for CENTOS 7, 8, and Ubuntu 18.04, 20.04 and which are available for download from github HERE.
First-time installation
Assuming you opted for the easy way, installing the packages made for this blog:
- Go to the PostgreSQL download page and configure your packaging system for your Linux distribution.
- Download the appropriate custom-made columnar package for your version of Linux:
# Ubuntu 18.04 wget https://github.com/rbernierZulu/columnar-tables-extension/blob/main/postgresql-15-citus-columnar_11.1-1-UBUNTU1804_amd64.deb # Ubuntu 20.04 wget https://github.com/rbernierZulu/columnar-tables-extension/blob/main/postgresql-15-citus-columnar_11.1-1-UBUNTU2004_amd64.deb # Centos 7 wget https://github.com/rbernierZulu/columnar-tables-extension/blob/main/postgresql-15-citus_columnar-11.1-1-CENTOS7.x86_64.rpm # Centos 8 wget https://github.com/rbernierZulu/columnar-tables-extension/blob/main/postgresql-15-citus_columnar-11.1-1-CENTOS8.x86_64.rpm
- Install the package: note that all the dependent packages, including the Postgres server, will be automatically installed onto your host. Pay attention to CENTOS 7 as it requires the epel-release repository too.
# Ubuntu 18.04 apt update apt-get install ./postgresql-15-citus-columnar_11.1-1-UBUNTU1804_amd64.deb # Ubuntu 20.04 apt update apt-get install ./postgresql-15-citus-columnar_11.1-1-UBUNTU2004_amd64.deb # Centos 7 # ATTENTION: the epel package must be installed beforehand! yum install epel-release yum install ./postgresql-15-citus_columnar-11.1-1-CENTOS7.x86_64.rpm # Centos 8 dnf -qy module disable postgresql dnf install ./postgresql-15-citus_columnar-11.1-1-CENTOS7.x86_64.rpm dnf check-update
- Initialize the postgres data cluster and configure appropriately for your distribution of Linux in order to login to postgres.
A working example of columnar tables
What follows here are the results of my analysis. Of course, there’s always more that can be said. Hopefully, this will give you enough of an overview of the possibilities of this extension.
Installing the extension into your database is pretty much standard fare:
-- create your extension in the database: create database db01; \c db01 create extension citus_columnar;
Here are two tables, of type HEAP and COLUMNAR, that will be used for the initial investigations. You’ll notice that it took less time to create the regular HEAP accessed table than the columnar table:
drop table if exists t1,t2; \timing -- Time: 7628.612 ms (00:07.629) create table if not exists t1(id,qty) using heap as select (random()*10)::int, (random()*1000)::int from generate_series(1,10e6); -- Time:15635.768 ms (00:15.636) create table if not exists t2(id,qty) using columnar as select (random()*10)::int, (random()*1000)::int from generate_series(1,10e6);
Table "public.t1" Column | Type | Collation | Nullable | Default --------+---------+-----------+----------+--------- id | integer | | | qty | integer | | | Access method: heap Table "public.t2" Column | Type | Collation | Nullable | Default --------+---------+-----------+----------+--------- id | integer | | | qty | integer | Access method: columnar
The number of records is 10 million:
with a(row_wise) as (select count(*) from t1), b(column_wise) as (select count(*) from t2) select row_wise,column_wise from a,b; row_wise | column_wise ----------+------------- 9900001 | 9900001
This is very cool, the columnar compression does a great job of shrinking the table:
\dt+ t[12]
List of relations
Schema | Name | Type | Owner | Persistence | Access method | Size |
--------+------+-------+----------+-------------+---------------+--------+
public | t1 | table | postgres | permanent | heap | 346 MB |
public | t2 | table | postgres | permanent | columnar | 27 MB |
SQL statements, preliminary
Let’s begin by comparing basic administration and SELECT statements of a HEAP vs COLUMNAR table.
Examining the tabulated results you’ll see that much of the performance times indicate that columnar tables either perform, at best, similarly to that of a HEAP table but most of the time they take more time executing the same operations.
Using a psql session, the following SQL statements are executed and examined for performance differences:
SQL |
Timings |
— HEAP TABLE |
7.6s 15.6s |
— COLUMNAR TABLE |
|
— HEAP TABLE, adding 5 million records |
13.7s 18.5s |
— COLUMNAR TABLE, adding 5 million records |
|
— HEAP TABLE |
4.9s 7.8s |
— HEAP TABLE |
SQL statement query plans
Part one (10 million records, two column tables)
Using the aforementioned table definitions, the following metrics were generated with the runtime parameter max_parallel_workers_per_gather = 4.
It’s quite evident that, at least for these two tables, there’s no performance benefit of a columnar accessed table over a regular heap accessed one:
SQL |
Timings |
— HEAP TABLE |
742.411 ms 914.096 ms |
— COLUMNAR TABLE |
|
— HEAP TABLE |
6441.425 ms 5871.620 ms |
— COLUMNAR TABLE |
|
— HEAP TABLE |
329.562 ms 902.614 ms |
— COLUMNAR TABLE |
|
— HEAP TABLE |
531.525 ms 1602.756 ms |
— COLUMNAR TABLE |
Part two (five million records, 100 column tables)
In order to get a better idea of performance differences, a second set of tables at a greater scale were created. However, this time, while the number of records was halved, the number of columns was increased from two to one hundred.
Even if most of the columns are simply copies of one another, the columnar table’s resultant compression is remarkable as the default size is reduced by a factor of 752X.
/* TABLES Table "public.t[34]" Column | Type | Collation | Nullable | Default --------+--------+-----------+----------+---------------------------------------------------------------------------- c1 | bigint | | not null | nextval('t1_c1_seq'::regclass) c2 | bigint | | | c3 | text | | | 'aowjfa fawjfawofjawofjawoifawevvaerarpfjkaofvaweawe[OJARGOIAJOAFWF'::text c4 | text | | | 'aowjfa fawjfawofjawofjawoifawevvaerarpfjkaofvaweawe[OJARGOIAJOAFWF'::text . . . c98 | text | | | 'aowjfa fawjfawofjawofjawoifawevvaerarpfjkaofvaweawe[OJARGOIAJOAFWF'::text c99 | text | | | 'aowjfa fawjfawofjawofjawoifawevvaerarpfjkaofvaweawe[OJARGOIAJOAFWF'::text c100 | text | | | 'aowjfa fawjfawofjawofjawoifawevvaerarpfjkaofvaweawe[OJARGOIAJOAFWF'::text Indexes: "t1_pkey" PRIMARY KEY, btree (c1) "t1_c2_idx" btree (c2) List of relations Schema | Name | Type | Owner | Persistence | Access method | Size | Description --------+------+-------+----------+-------------+---------------+-------+------------- public | t3 | table | postgres | permanent | heap | 67 GB | public | t4 | table | postgres | permanent | columnar | 89 MB | */
Examining the indexes one sees them to be the same size.
List of relations Schema | Name | Type | Owner | Table | Persistence | Access method | Size | -------+---------------+-------+----------+-------+-------------+---------------+--------+ public | t3_c2_idx | index | postgres | t3 | permanent | btree | 105 MB | public | t3_c2_idx1 | index | postgres | t3 | permanent | btree | 105 MB | public | t3_length_idx | index | postgres | t3 | permanent | btree | 33 MB | public | t3_pkey | index | postgres | t3 | permanent | btree | 107 MB | public | t4_c2_idx | index | postgres | t4 | permanent | btree | 105 MB | public | t4_length_idx | index | postgres | t4 | permanent | btree | 33 MB | public | t4_pkey | index | postgres | t4 | permanent | btree | 107 MB |
Unlike the first set of query plans, these ones clearly demonstrate a significant performance improvement.
Curious to see what would change in the way of performance, the varying the max_parallel_workers_per_gather doesn’t appear to have changed much.
SQL |
Timings |
|
— HEAP TABLE without index |
9.6s 590.176ms |
8.7s 596.459ms |
— COLUMNAR TABLE without index |
||
— HEAP TABLE |
10.4s 509.209 ms |
8.8s 541.452ms |
— COLUMNAR TABLE |
||
— HEAP TABLE |
1m34s 1.1s |
1m17s 1.2s |
— COLUMNAR TABLE |
||
— HEAP TABLE |
1m33s 1.2s |
1m18s 1.2s |
— COLUMNAR TABLE |
Working with indexes
General observations: Btree indexes are similar in size between HEAP and COLUMNAR tables. Overall their performance also appears similar although the columnar table’s index is somewhat slower, no doubt due to the extra processing required to uncompress the table’s values.
Regarding Expression Indexes: Creating an EXPRESSION index on COLUMNAR table is significantly faster than that of HEAP:
-- 1m17s create index on t3(length(c90)); -- 14s create index on t4(length(c90));
Regarding Runtime Parameter max parallel workers: Index performance varies considerably on HEAP tables depending upon the value set to max parallel workers.
The following results highlight that, depending upon the type of table used, it can become important when hardware resource and server costs are a consideration:
SQL |
Timings 4 1 |
|
— HEAP TABLE using BTREE index |
467.789ms 561.522 ms |
748.939ms 599.629ms |
— COLUMNAR TABLE using BTREE index |
||
— HEAP TABLE using EXPRESSION index |
1.614ms 31.980ms |
2.346ms 38.766ms |
— COLUMNAR TABLE using EXPRESSION index |
About table constraints and access methods
Overall, indexes, constraints, and access methods are still evolving with many of the features still to be implemented.
Let’s start with a big issue i.e. DELETE and UPDATE:
-- fails delete from t2 where id=5; ERROR: UPDATE and CTID scans not supported for ColumnarScan -- fails update t2 set id=5; ERROR: UPDATE and CTID scans not supported for ColumnarScan
Creating indexes on a columnar table is restricted to btree indexes:
-- works create index on t2 using btree (id); -- fails create index on t2 using columnar (id); ERROR: unsupported access method for the index on columnar table t2
Creating foreign key constraints aren’t implemented:
select generate_series as id into t3 from generate_series(0,15); alter table t3 add primary key(id); -- works for our standard table t1 alter table t1 add foreign key(id) references t3(id); -- fails with the columnar table t2 alter table t2 add foreign key(id) references t3(generate_series); ERROR: Foreign keys and AFTER ROW triggers are not supported for columnar tables HINT: Consider an AFTER STATEMENT trigger instead. --works after converting table t1 from COLUMNAR to HEAP alter table t2 set access method heap; alter table t2 add foreign key(id) references t3(generate_series); alter table t2 set access method columnar;
Partitioning
Columnar tables can be used as partitions; a partitioned table can be made up of any combination of row and columnar partitions.
An excellent use case is INSERT once and READ only table partitions where one can leverage both its compression and better performing OLAP type queries for very large tables.
Caveat
Columnar Extension Limitations, as of version 11.1:
- It takes more time to create the table than standard heap access-based tables
- The query performance is equal or slower with smallish tables compared to heap-based tables.
- There is no update/delete possible in a columnar table.
- The indexes are limited to btree, as of version 10.0.
- There is no logical replication.
There’s actually more documented. Refer here for more information.
AUTHOR’S NOTE: In regard to the custom packages created for this blog. The entire citus suite is designed to enhance Postgres as a distributed database solution. Only the columnar table component was covered in this blog. The complete feature set is part of this package install and should work but has not been tested.
Conclusion
Despite its current limitations, there are use cases where this extension can definitely make a difference. And it speaks well of its future as the team continues development by constantly improving and adding capabilities. Watch for updates on its GitHub source repository.
13
2022
Working With Snapshots in PostgreSQL
One of the reasons I’ve never strayed far from PostgreSQL is that I am always discovering features that overcome all sorts of interesting problems. Although, to be honest, from time to time I’ve had the feeling that sometimes some features are a solution in search of a problem. Take, for example, exporting transaction snapshots.
Originally implemented in version 9.2, snapshot exports allow multiple transactions to share identical views of the same database state. Back then I wasn’t certain how I could use it until pg version 9.3 came out presenting a true use-case when multi-threaded downloads were introduced to pg_dump.
Here’s a simple example; let’s start by creating two tables with lots of data:
begin; create table t1( c1 serial primary key, c2 text default lpad('',500,md5(random()::text)), c3 timestamp default clock_timestamp() ); create table t2(like t1 including all); insert into t1 select generate_series(1,5e6); insert into t2 select generate_series(1,5e6); analyze; commit;
List of relations Schema | Name | Type | Owner | Size | Description --------+------+-------+----------+---------+------------- public | t1 | table | postgres | 2791 MB | public | t2 | table | postgres | 2791 MB |
TIP: Changing the arguments in the generate_series function will populate the tables that best fits your hardware’s capabilities.
The traditional invocation of a logical dump uses a single CPU/thread archiving the tables one at a time. The CLI time generates timing statistics returning the time it takes to perform the dump using the “compressed” format:
# execution time: 60 seconds time pg_dump -Fc db01 > db01.db real 1m0.322s user 0m50.203s sys 0m1.309s
An alternate invocation generating the dump uses the “directory” format:
# execution time: 52 seconds time pg_dump -Fd db01 -f db01_dirA real 0m52.304s user 0m50.908s sys 0m1.152s
Thus far the execution utilizes a single CPU. Now look at the execution time when the ‘-j’, or ‘–jobs’ switch, where multiple connections are used to generate the dump:
# execution time: 31 seconds time pg_dump -Fd db01 -f db01_dirB -j 2 real 0m31.115s user 1m0.242s sys 0m1.377s
Notice the number of processes running in the background. The third connection is a leader process which coordinates the two worker processes:
$: ps aux | grep -v grep | grep pg_dump rbernier 1872809 0.1 0.0 14444 5968 pts/1 S+ 08:52 0:00 pg_dump -Fd db01 -f db01_dirB -j 2 rbernier 1872811 100 0.0 14804 4480 pts/1 R+ 08:52 0:12 pg_dump -Fd db01 -f db01_dirB -j 2 rbernier 1872812 100 0.0 14804 4488 pts/1 R+ 08:52 0:12 pg_dump -Fd db01 -f db01_dirB -j 2
Without the synchronized snapshot feature, the different worker jobs wouldn’t be guaranteed to see the same data in each connection, which could lead to an inconsistent backup.
One important caveat to remember is that the pg_dump jobs switch only works with the “directory” format. Attempting to use jobs to a single database dump fails:
pg_dump -Fc db01 -j 2 > db01.db pg_dump: error: parallel backup only supported by the directory format
There is a full description of other caveats using the jobs switch in the PostgreSQL documentation.
So we’ve shown how it works with pg_dump, but can we go further? … YES!
We can replicate the behavior using the snapshot synchronization function pg_export_snapshot().
Continuing with the two previously created tables, let’s create another scenario. Look at the table below for each step:
- STEP 1: Three psql sessions login and are interacting with tables t1 and t2 in tandem.
- STEP 2: Session 1 inserts every five seconds to the tables. Session 2 sees the data differently than session 3, note the three-second delay queried in session 3, thus making it problematic seeing the same dataset at the same time.
- STEP 3: Session 1 continues updating tables t1, t2 but now sessions 2 and 3 are using the same snapshot session.
- STEP 4: While session 1 continues updating tables t1, t2 sessions 2 and 3 see the same datasets i.e. they are synchronized.
- STEP 5: Both sessions 2, and 3 are actively copying data at the same time without fear of inconsistency.
- STEP 6: The COPY is completed so let’s finish up by committing the transactions.
STEP |
Session 1 |
Session 2 |
Session 3 |
1 |
psql db01 |
psql db01 |
psql db01 |
2 |
with a(x) as (select max(c1) from t1), b as (insert into t1 select generate_series(x+1,x+10) from a returning *), c as (insert into t2 select generate_series(x+1,x+10) from a returning *) select x as increment from a; \watch 5 |
with a(x) as (select max(c1) from t1) select c1,md5(c2),c3 from t1,a where c1=x union all select c1,md5(c2),c3 from t2,a where c1=x; |
with a(x) as (select max(c1) from t1) select c1,md5(c2),c3 from t1,a where c1=x union all select c1,md5(c2),c3 from t2,a,pg_sleep(3) where c1=x; |
3 |
Step 2 is active |
— let’s export the snapshot — 00000003-000021CE-1 begin transaction isolation level repeatable read; select pg_export_snapshot(); |
— let’s import the snapshot begin transaction isolation level repeatable read; set transaction snapshot ‘00000003-000021CE-1’; |
4 |
Step 2 is active |
with a(x) as (select max(c1) from t1) select c1,md5(c2),c3 from t1,a where c1=x union all select c1,md5(c2),c3 from t2,a where c1=x; |
with a(x) as (select max(c1) from t1) select c1,md5(c2),c3 from t1,a where c1=x union all select c1,md5(c2),c3 from t2,a,pg_sleep(3) where c1=x; |
5 |
Step 2 is active |
\copy t1 to program ‘gzip > t1.gz’ |
\copy t2 to program ‘gzip > t2.gz’ |
6 |
Step 2 is terminated |
— commit or rollback, commit |
— commit or rollback, commit |
And there you have it, a neat little trick exporting snapshots! Of course, there’s more you can do than just copying tables in a multi-threaded fashion but hopefully, this little exposé will lead to more opportunities to simplify your life while working with Postgres.