When it comes to Citus, successfully building out and scaling a PostgreSQL cluster across multiple nodes and even across data centers can feel, at times, to be an art form because there are so many ways of building it out.
There’s an axiom that I think aptly applies to this situation describing the differences between science and art:
– Science: 1 problem -> 1 solution
– Art: 1 problem -> 1,000 solutions
While the two previous Citus blogs that I wrote covered the fairly straightforward concepts of creating columnar tables and leveraging data redundancy, this blog explores the architectural design considerations using this freely downloadable and fully featured PostgreSQL extension.
Rather than listing the concepts, function calls, etc, available in Citus, which frankly is a bit boring, I’m going to explore scaling out a database system starting with a single host. I won’t cover all the features but show just enough that you’ll want to see more of what you can learn to accomplish for yourself.
About the cluster
Following a step-by-step process, the objective is to create a four-node cluster consisting of:
- PostgreSQL version 15
- Citus extension (I’ll be using version 11, but there are newer ones available.)
- PostgreSQL Cluster
- One coordinator node
- citus-coord-01
- Three worker nodes
- citus1
- citus2
- citus3
- One coordinator node
- Hardware
AWS Instance
Ubuntu Server 20.04, SSD volume type
64-bit (x86)
c5.xlarge 4vCPU 8GB-RAM
Storage: EBS volume (root)
80GB
gp2 (IOPS 240/3000)
As well, high availability will be integrated, guaranteeing cluster viability in the case that one worker node goes down.
Leveraging pgbench, which is a benchmarking utility that comes bundled with PostgreSQL, I will put the cluster through its paces by executing a series of DML operations. Using its default tpc-b benchmark, one can stress test a database of any size ranging from a few clients to simulating thousands interacting with a system sized into the Terabytes if needs be.
Steps
Provisioning
The first step is to provision the four nodes with both PostgreSQL and Citus. Refer here or here for getting and installing the extension.
-- update each postgres node and enabling the extension -- with a server restart alter system set shared_preload_libraries='citus';
On each of the four nodes, create the database with the Citus extension:
-- using psql create database pgbench; c pgbench create extension citus;
One worker node
Beginning with a single worker node, log onto citus-coord-01, declaring the coordinator host, and starting with the single node, citus1:
-- execute on citus-coord-01 psql pgbench <<_eof1_ select citus_set_coordinator_host('citus-coord-01', 5432); select citus_add_node('citus1', 5432); _eof1_
While still logged in database pgbench, get a list of the worker nodes, which at this time is just the one:
select * from master_get_active_worker_nodes();
node_name | node_port -----------+----------- citus1 | 5432
Now, things get a little interesting; the normal initialization of pgbench is to create and populate the tables with the appropriate constraints. In this case, however, only the tables, without data or constraints, are initialized:
-- execute the following command on citus-coord-01 pgbench -iI t pgbench
pgbench=# d+ List of relations Schema | Name | Type | Owner | Persistence | Access method | Size | -------+------------------+-------+----------+-------------+---------------+---------+----- public | citus_tables | view | postgres | permanent | | 0 bytes | public | pgbench_accounts | table | postgres | permanent | heap | 0 bytes | public | pgbench_branches | table | postgres | permanent | heap | 0 bytes | public | pgbench_history | table | postgres | permanent | heap | 0 bytes | public | pgbench_tellers | table | postgres | permanent | heap | 0 bytes |
This next set of commands, executed on the coordinator node, distributes the pgbench tables across node citus1:
psql pgbench <<_eof1_ BEGIN; select create_distributed_table('pgbench_history', 'aid'); select create_distributed_table('pgbench_accounts', 'aid'); select create_distributed_table('pgbench_branches', 'bid'); select create_distributed_table('pgbench_tellers', 'tid'); COMMIT; _eof1_
And now, we can begin populating the tables. The following invocation generates almost 4GB of data. Notice I still haven’t added any constraints:
pgbench -iI g -s 300 pgbench
This command shows the total number of shards on the node. The default number of shards generated for a given table is 32.
select * from citus_tables;
table_name | citus_table_type | distribution_column | colocation_id | table_size | shard_count ------------------+------------------+---------------------+---------------+------------+------------- pgbench_accounts | distributed | aid | 1 | 3844 MB | 32 pgbench_branches | distributed | bid | 1 | 256 kB | 32 pgbench_history | distributed | aid | 1 | 0 bytes | 32 pgbench_tellers | distributed | tid | 1 | 256 kB | 32
You’ll also notice that only some of the shards have been populated with data, which is normal when initializing pgbench with data:
select * from citus_shards;
table_name | shardid | shard_name | citus_table_type | colocation_id | nodename | nodeport | shard_size ------------------+---------+-------------------------+------------------+---------------+----------+----------+------------ pgbench_accounts | 102040 | pgbench_accounts_102040 | distributed | 1 | citus1 | 5432 | 125853696 pgbench_accounts | 102041 | pgbench_accounts_102041 | distributed | 1 | citus1 | 5432 | 126173184 pgbench_accounts | 102042 | pgbench_accounts_102042 | distributed | 1 | citus1 | 5432 | 125739008 pgbench_accounts | 102043 | pgbench_accounts_102043 | distributed | 1 | citus1 | 5432 | 125968384 . . pgbench_branches | 102072 | pgbench_branches_102072 | distributed | 1 | citus1 | 5432 | 8192 pgbench_branches | 102073 | pgbench_branches_102073 | distributed | 1 | citus1 | 5432 | 8192 pgbench_branches | 102074 | pgbench_branches_102074 | distributed | 1 | citus1 | 5432 | 8192 pgbench_branches | 102075 | pgbench_branches_102075 | distributed | 1 | citus1 | 5432 | 8192 . . pgbench_history | 102008 | pgbench_history_102008 | distributed | 1 | citus1 | 5432 | 0 pgbench_history | 102009 | pgbench_history_102009 | distributed | 1 | citus1 | 5432 | 0 pgbench_history | 102010 | pgbench_history_102010 | distributed | 1 | citus1 | 5432 | 0 pgbench_history | 102011 | pgbench_history_102011 | distributed | 1 | citus1 | 5432 | 0 . . pgbench_tellers | 102104 | pgbench_tellers_102104 | distributed | 1 | citus1 | 5432 | 8192 pgbench_tellers | 102105 | pgbench_tellers_102105 | distributed | 1 | citus1 | 5432 | 8192 pgbench_tellers | 102106 | pgbench_tellers_102106 | distributed | 1 | citus1 | 5432 | 8192 pgbench_tellers | 102107 | pgbench_tellers_102107 | distributed | 1 | citus1 | 5432 | 8192 pgbench_tellers | 102108 | pgbench_tellers_102108 | distributed | 1 | citus1 | 5432 | 8192 . .
And now, execute the benchmark:
-- execute the following on the coordinator node pgbench -c 20 -j 3 -T 60 -P 3 pgbench
The results are not pretty.
transaction type: <builtin: TPC-B (sort of)> scaling factor: 300 query mode: simple number of clients: 20 number of threads: 3 maximum number of tries: 1 duration: 60 s number of transactions actually processed: 1524 number of failed transactions: 0 (0.000%) latency average = 791.374 ms latency stddev = 157.710 ms initial connection time = 15.704 ms tps = 25.197427 (without initial connection time)
ATTENTION: Refer to the bottom of this blog, which summarizes a tabulation of the various benchmarks as this cluster evolves.
In order to speed up the benchmark indexes must be added. But, this standard pgbench method of adding indexes fails:
postgres@citus-coord-01:~$ pgbench -iI p pgbench creating primary keys... pgbench: error: query failed: ERROR: cannot create constraint without a name on a distributed table pgbench: detail: Query was: alter table pgbench_branches add primary key (bid)
Successfully adding indexes and constraints requires them to be explicitly named and created in the following fashion. Note that table pgbench_history requires a REPLICA identity because it doesn’t have a primary key:
# THIS WORKS! psql pgbench <<_eof1_ -- -- indexes and constraints must be explicitly named -- BEGIN; create unique index pgbench_accounts_pk on pgbench_accounts(aid); create unique index pgbench_branches_pk on pgbench_branches(bid); create unique index pgbench_tellers_pk on pgbench_tellers(tid); alter table pgbench_accounts add constraint pk_accounts primary key using index pgbench_accounts_pk; alter table pgbench_branches add constraint pk_branches primary key using index pgbench_branches_pk; alter table pgbench_tellers add constraint pk_tellers primary key using index pgbench_tellers_pk; qecho adding REPLICA IDENTITY (no PK present) alter table pgbench_history replica identity full; COMMIT; _eof1_
Repeating the benchmark run yields a much better result:
-- execute the following on the coordinator node pgbench -c 20 -j 3 -T 60 -P 3 pgbench
scaling factor: 300 query mode: simple number of clients: 20 number of threads: 3 maximum number of tries: 1 duration: 60 s number of transactions actually processed: 135273 number of failed transactions: 0 (0.000%) latency average = 8.865 ms latency stddev = 9.452 ms initial connection time = 15.556 ms tps = 2254.544852 (without initial connection time)
Two worker nodes
Adding a second worker node is straightforward requiring only two steps:
- Adding the node.
- Rebalancing the shards across the two nodes.
Execute this on the coordinator node, citus-coord-01:
psql pgbench <<_eof1_ -- -- it's understood that database pgbench already exists on citus2 -- and extension citus has been created -- qecho adding node citus2 ... select citus_add_node('citus2', 5432); qecho rebalancing shards across TWO nodes ... select * from rebalance_table_shards(); ANALYZE; _eof1_
Here’s a partial output as the tables are rebalanced across the two nodes:
adding node citus2 ... citus_add_node ---------------- 3
rebalancing shards across TWO nodes ... NOTICE: Moving shard 102008 from citus1:5432 to citus2:5432 ... NOTICE: Moving shard 102009 from citus1:5432 to citus2:5432 ... NOTICE: Moving shard 102010 from citus1:5432 to citus2:5432 ... NOTICE: Moving shard 102011 from citus1:5432 to citus2:5432 ... . . NOTICE: Moving shard 102020 from citus1:5432 to citus2:5432 ... NOTICE: Moving shard 102021 from citus1:5432 to citus2:5432 ... NOTICE: Moving shard 102022 from citus1:5432 to citus2:5432 ... NOTICE: Moving shard 102023 from citus1:5432 to citus2:5432 ...
Repeating the benchmark once again demonstrates that performance has degraded somewhat. This is to be expected, considering my POC consists of containers on a single machine, all sharing the same system resources:
-- execute the following on the coordinator node pgbench -c 20 -j 3 -T 60 -P 3 pgbench
transaction type: <builtin: TPC-B (sort of)> scaling factor: 300 query mode: simple number of clients: 20 number of threads: 3 maximum number of tries: 1 duration: 60 s number of transactions actually processed: 130190 number of failed transactions: 0 (0.000%) latency average = 9.214 ms latency stddev = 19.159 ms initial connection time = 15.776 ms tps = 2169.559969 (without initial connection time)
Three worker nodes
As per the two-worker node configuration, adding another node is straightforward.
psql pgbench <<_eof1_ qecho adding node citus3 ... select citus_add_node('citus3', 5432); qecho rebalancing shards across TWO nodes ... select * from rebalance_table_shards(); _eof1_
Repeating the benchmark gives us the updated performance metrics.
-- execute the following on the coordinator node pgbench -c 20 -j 3 -T 60 -P 3 pgbench
scaling factor: 300 query mode: simple number of clients: 20 number of threads: 3 maximum number of tries: 1 duration: 60 s number of transactions actually processed: 142630 number of failed transactions: 0 (0.000%) latency average = 8.410 ms latency stddev = 5.542 ms initial connection time = 16.407 ms tps = 2377.227920 (without initial connection time)
Three worker nodes, with redundancy/replication
This next step is interesting because it adds redundancy by increasing the replication factor from one to two such that two copies of each data shard of each table are not only duplicated but located on two distinct nodes.
ATTENTION: A newly updated replication factor affects tables only after the fact. Pre Existing tables are not affected by the new replication factor:
psql pgbench <<_eof1_ qecho increasing replication factor to TWO alter system set citus.shard_replication_factor=2; select pg_reload_conf(); _eof1_
TIP: In order to replicate tables, set the replication factor before creating them.
In order to replicate the existing tables across the nodes, they are first undistributed by centralizing the tables onto the coordinator node. Redistributing them across the three node clusters automatically adds replication.
This centralizes the tables onto the coordinator node:
psql pgbench <<_eof1_ BEGIN; select undistribute_table('pgbench_history'); select undistribute_table('pgbench_accounts'); select undistribute_table('pgbench_branches'); select undistribute_table('pgbench_tellers'); COMMIT; _eof1_
The tables are redistributed and rebalanced across the three-node cluster with the correct redundancy of two:
psql pgbench <<_eof1_ BEGIN; select create_distributed_table('pgbench_history', 'aid'); select create_distributed_table('pgbench_accounts', 'aid'); select create_distributed_table('pgbench_branches', 'bid'); select create_distributed_table('pgbench_tellers', 'tid'); COMMIT; -- remove legacy tables from the coordinator node BEGIN; select truncate_local_data_after_distributing_table($$public.pgbench_history$$); select truncate_local_data_after_distributing_table($$public.pgbench_accounts$$); select truncate_local_data_after_distributing_table($$public.pgbench_branches$$); select truncate_local_data_after_distributing_table($$public.pgbench_tellers$$); COMMIT; _eof1_
Repeating the benchmark gives us the updated performance metrics.
-- execute the following on the coordinator node pgbench -c 20 -j 3 -T 60 -P 3 pgbench
transaction type: <builtin: TPC-B (sort of)> scaling factor: 300 query mode: simple number of clients: 20 number of threads: 3 maximum number of tries: 1 duration: 60 s number of transactions actually processed: 68834 number of failed transactions: 0 (0.000%) latency average = 17.430 ms latency stddev = 12.818 ms initial connection time = 16.321 ms tps = 1147.093369 (without initial connection time)
Benchmark results
Here’s a summary of the various benchmarking runs, giving you an idea of relative performance.
RUN | NUM NODES | WITH IDX | Replication Factor | TPS |
1 | 1 | no | 1 | 25 |
2 | 1 | yes | 1 | 2255 |
3 | 2 | yes | 1 | 2170 |
4 | 3 | yes | 1 | 2377 |
5 | 3 | yes | 2 | 1147 |
Caveat
Although not mentioned in the blog, there are a few things you’ll need to keep in mind as you work with CitusDB:
- The password for the account used to propagate the operations across the cluster is input in the home account’s .pgpass file on all nodes.
- The “wal_level” is set at logical.
- Redundancy can potentially decrease overall performance.
Remember to keep in mind that the performance metrics are dependent not only upon the cluster’s hardware and system resources but as to the level of tuning, which, in this case, has not been addressed.
A future blog will continue my exploration into Citus by scaling out pgbench into other architectures.
Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.