This blog post is part two of a multi-post series about using the Citus extension with Percona Distribution for PostgreSQL. You can read part one here: Scalable Solutions With Percona Distribution for PostgreSQL: Set Up Three PostgreSQL Database Instances.
Citus is an open-source extension for PostgreSQL that expands its capacity to manage databases of varying scales, ranging from individual nodes to extensively distributed database clusters. Citus maintains seamless integration with PostgreSQL.
Citus has many advantages, one of which is the “Sharding Technique” and this is what we are going to explore in this blog post. We will use the “Sharding Technique” with Percona Distribution for PostgreSQL.
This is how Citus defines sharding: “Sharding is a technique used in database systems and distributed computing to horizontally partition data across multiple servers or nodes. It involves breaking up a large database or dataset into smaller, more manageable parts called Shards. Each shard contains a subset of the data, and together, they form the complete dataset.”
Citus official website. “Distribute Tables”, URL
There are three types of tables in a Citus cluster, each serving a distinct purpose. We will be utilizing the Distributed Tables type.
“These appear to be normal tables to SQL statements but are horizontally partitioned across worker nodes.” – Citus
Percona Distribution for PostgreSQL is a collection of tools to assist you in managing your PostgreSQL database system: it installs PostgreSQL and complements it with a selection of extensions that enable solving essential practical tasks efficiently. Some of these extensions are HAProxy, Patroni, pgAudit, and much more.
The goal of this tutorial is to show a simplified example of how you might use Citus with Percona Distribution PostgreSQL 15.
Ready?
Let’s start with…
For this tutorial, I am using
- Operating System Ubuntu 20.04
- Percona Distribution for PostgreSQL 15 on Ubuntu
- We need three nodes of Percona Distribution for PostgreSQL. Check the “Scalable Solutions with Percona Distribution for PostgreSQL (Part 1)” blog post to set up three nodes.
- Citus-12.0
- Citus usually installs PostgreSQL, but I already have Percona PostgreSQL installed, and in this blog, we will only install Citus.
If you want to check the version of Percona PostgreSQL, you can use the ‘psql’ command-line utility for interacting with PostgreSQL databases. To do this, open a terminal and run the following SQL query: “SELECT version();” This will display the version of Percona PostgreSQL installed on your system.
sudo su postgres psql
You will have this as an output:
psql (15.4 - Percona Distribution) Type "help" for help. postgres=# SELECT VERSION(); version ------------------------------------------------------------------------------------------------------------------------------ PostgreSQL 15.4 - Percona Distribution on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit (1 row)
Installing Citus for Distributed PostgreSQL
Let’s install Citus version 12 in our three nodes of Percona Distribution for PostgreSQL.
# Let's download the deb.sh script to install Citus curl https://install.citusdata.com/community/deb.sh > add-citus-repo.sh # Execute the script sudo bash add-citus-repo.sh # Install Citus package sudo apt-get -y install postgresql-15-citus-12.0
Add Citus as a PostgreSQL library
Once Citus is installed, we will add “citus” as a shared library to preload when the PostgreSQL server starts. This should be done on all three nodes of Percona Distribution for PostgreSQL.
sudo su postgres psql ALTER SYSTEM SET shared_preload_libraries=citus;
Restart Percona PostgreSQL to apply the changes.
sudo systemctl restart postgresql
Create the Citus extension
We will create and enable the Citus extension within the database. Make sure you run these commands with the “postgres” user, also on all three Percona Distribution for PostgreSQL nodes.
sudo su postgres psql CREATE EXTENSION citus;
Now, let’s check if the Citus extension has been correctly created.
# Checking Citus version select citus_version();
You will see an output similar to this:
citus_version ---------------------------------------------------------------------------------------------------- Citus 12.0.0 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit (1 row)
Set up a multi-node Citus cluster
We start by setting the coordinator host. This is used in a Citus distributed database environment to set the IP address and port of the coordinator node. We will take the first node we created as a coordinator host.
This coordinator node manages the overall cluster and coordinates queries and data distribution among other nodes.
sudo su postgres psql SELECT citus_set_coordinator_host('172.31.88.41', 5432);
It is time to add each node to the Citus cluster. Make sure to run these commands with “postgres” user.
SELECT * from citus_add_node('172.31.86.26', 5433); SELECT * from citus_add_node('172.31.89.45', 5434);
Check distributed worker nodes
Now, let’s list the distributed nodes in the cluster, including the main node.
sudo su postgres psql postgres=# SELECT * FROM pg_dist_node; nodeid | groupid | nodename | nodeport | noderack | hasmetadata | isactive | noderole | nodecluster | metadatasynced | shouldhaveshards --------+---------+---------------+----------+----------+-------------+----------+----------+-------------+----------------+------------------ 1 | 0 | 172.31.88.41 | 5432 | default | t | t | primary | default | t | f 6 | 5 | 172.31.86.26 | 5433 | default | t | t | primary | default | t | t 7 | 6 | 172.31.89.45 | 5434 | default | t | t | primary | default | t | t (3 rows)
Or you can also list the active Postgres Citus nodes in the cluster:
postgres=# SELECT * FROM citus_get_active_worker_nodes(); node_name | node_port -----------+----------- 172.31.86.26 | 5433 172.31.89.45 | 5434 (2 rows)
Once this is ready, we will log in to the main PostgreSQL and create our tables “users” and “events”
Creating the database
```sqlsq -- Create basic users table CREATE TABLE users ( user_id serial PRIMARY KEY, user_name text, email text ); -- Create basic events table CREATE TABLE events ( event_id bigserial, user_id int, event_time timestamp, data jsonb, PRIMARY KEY (user_id, event_id) ); ```
In this example, we create two tables:
“users” table with the following columns:
- user_id: This column will be used for sharding.
- user_name: Name of the user text type.
- email: Email of the user text type.
And “events” table with the following columns:
- event_id: A serial primary key to ensure unique event IDs. This column will be used for sharding.
- user_id: This column is to register user’s id.
- event_time: A timestamp to record when the event occurred.
- data: A JSONB column for storing event-related data.
Creating distributed tables
Now, let’s distribute the “users” and “events” tables across shards placed locally or on the Postgres Citus nodes. In this case, we are distributing through the “user_id” field.
SELECT create_distributed_table('users', 'user_id'); SELECT create_distributed_table('events', 'user_id');
Insert data into our sharded tables
Now, insert data into the sharded table:
-- Insert user data INSERT INTO users (user_name, email) VALUES ('user1', 'user1@example.com'), ('user2', 'user2@example.com'), ('user3', 'user2@example.com'); -- Insert event data INSERT INTO events ( user_id, event_time, data) VALUES (1, '2023-09-01 10:00:00', '{"temperature": 25.5, "humidity": 62}'), (2, '2023-09-01 11:30:00', '{"temperature": 22.0, "humidity": 56}'), (1, '2023-09-02 08:15:00', '{"temperature": 24.0, "humidity": 59}'), (3, '2023-09-08 07:06:00', '{"temperature": 25.8, "humidity": 66}'), (1, '2023-09-10 04:00:00', '{"temperature": 20.0, "humidity": 54}'); (1, '2023-09-04 10:00:00', '{"temperature": 25.5, "humidity": 62}'), (2, '2023-09-05 11:30:00', '{"temperature": 22.0, "humidity": 50}'), (1, '2023-09-03 08:15:00', '{"temperature": 24.0, "humidity": 56}'), (3, '2023-09-06 07:06:00', '{"temperature": 25.8, "humidity": 65}'), (1, '2023-09-09 04:00:00', '{"temperature": 20.0, "humidity": 52}');
When you use the primary key (user_id) to create a distributed table in Citus, it means you’re effectively “hash-partitioning” the data based on the user_id column. This means that events with the same user_id will be stored together on the same shard.
Citus takes the values in the ID column, applies a hash function to each value, and assigns each row to a shard based on the hash value. This process ensures that rows with similar ID values are distributed across multiple shards in a way that attempts to distribute the data evenly.
Testing and query analysis
Query the sharded table:
SET citus.explain_all_tasks TO on; EXPLAIN ANALYZE SELECT * FROM events;
“SET citus.explain_all_tasks TO on”; is used to enable the “explain all tasks” feature in the Citus extension for PostgreSQL. This feature is particularly useful when you’re working with distributed queries in a Citus cluster.
“EXPLAIN ANALYZE”, is used to analyze and explain the execution plan of a SQL query in PostgreSQL.
We can see all tasks, and it also shows all Postgres Citus nodes.
We will analyze the first task.
The output showcases the advantage of parallel processing. Citus demonstrated its ability to parallelize operations, distributing the workload across multiple nodes, in this case on the nodes on ports 5433 and 5434, resulting in remarkably fast query execution times.
This was an example of how to use Citus to sharding in “events” table. The main advantage of sharding with Citus is its ability to scale PostgreSQL databases horizontally, spreading data across multiple nodes to accommodate growing data sets and high workloads.
Note: Sharding may not be necessary for all applications; it’s typically most beneficial when dealing with large-scale datasets and high-concurrency workloads.
Check more about Percona PostgreSQL and multi-node Citus, and visit our Percona Community Forum page in case you have questions. We are happy to help!
Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.