Oct
20
2023
--

Scalable Solutions with Percona Distribution for PostgreSQL (Part 2): Using Citus

Scalable Solutions with Percona Distribution for PostgreSQL Using Citus

This blog post is part two of a multi-post series about using the Citus extension with Percona Distribution for PostgreSQL. You can read part one here: Scalable Solutions With Percona Distribution for PostgreSQL: Set Up Three PostgreSQL Database Instances.

Citus is an open-source extension for PostgreSQL that expands its capacity to manage databases of varying scales, ranging from individual nodes to extensively distributed database clusters. Citus maintains seamless integration with PostgreSQL.

Citus has many advantages, one of which is the “Sharding Technique” and this is what we are going to explore in this blog post. We will use the “Sharding Technique” with Percona Distribution for PostgreSQL.

This is how Citus defines sharding: “Sharding is a technique used in database systems and distributed computing to horizontally partition data across multiple servers or nodes. It involves breaking up a large database or dataset into smaller, more manageable parts called Shards. Each shard contains a subset of the data, and together, they form the complete dataset.”

postgresql sharding

Citus official website. “Distribute Tables”, URL

There are three types of tables in a Citus cluster, each serving a distinct purpose. We will be utilizing the Distributed Tables type.

“These appear to be normal tables to SQL statements but are horizontally partitioned across worker nodes.” – Citus

Percona Distribution for PostgreSQL is a collection of tools to assist you in managing your PostgreSQL database system: it installs PostgreSQL and complements it with a selection of extensions that enable solving essential practical tasks efficiently. Some of these extensions are HAProxy, Patroni, pgAudit, and much more.

The goal of this tutorial is to show a simplified example of how you might use Citus with Percona Distribution PostgreSQL 15.

Ready?

Let’s start with…

For this tutorial, I am using

  1. Operating System Ubuntu 20.04
  2. Percona Distribution for PostgreSQL 15 on Ubuntu

If you want to check the version of Percona PostgreSQL, you can use the ‘psql’ command-line utility for interacting with PostgreSQL databases. To do this, open a terminal and run the following SQL query: “SELECT version();” This will display the version of Percona PostgreSQL installed on your system.

sudo su postgres
psql

You will have this as an output:

psql (15.4 - Percona Distribution)
Type "help" for help.

postgres=# SELECT VERSION();
                                                           version
------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 15.4 - Percona Distribution on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit
(1 row)

Installing Citus for Distributed PostgreSQL

Let’s install Citus version 12 in our three nodes of Percona Distribution for PostgreSQL.

# Let's download the deb.sh script to install Citus
curl https://install.citusdata.com/community/deb.sh > add-citus-repo.sh

# Execute the script
sudo bash add-citus-repo.sh

# Install Citus package
sudo apt-get -y install postgresql-15-citus-12.0

Add Citus as a PostgreSQL library

Once Citus is installed, we will add “citus” as a shared library to preload when the PostgreSQL server starts. This should be done on all three nodes of Percona Distribution for PostgreSQL.

sudo su postgres
psql
ALTER SYSTEM SET shared_preload_libraries=citus;

Restart Percona PostgreSQL to apply the changes.

sudo systemctl restart postgresql

Create the Citus extension

We will create and enable the Citus extension within the database. Make sure you run these commands with the “postgres” user, also on all three Percona Distribution for PostgreSQL nodes.

sudo su postgres
psql
CREATE EXTENSION citus;

Now, let’s check if the Citus extension has been correctly created.

# Checking Citus version
select citus_version();

You will see an output similar to this:

citus_version
----------------------------------------------------------------------------------------------------
 Citus 12.0.0 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit
(1 row)

Set up a multi-node Citus cluster

We start by setting the coordinator host. This is used in a Citus distributed database environment to set the IP address and port of the coordinator node. We will take the first node we created as a coordinator host.

This coordinator node manages the overall cluster and coordinates queries and data distribution among other nodes.

sudo su postgres
psql
SELECT citus_set_coordinator_host('172.31.88.41', 5432);

It is time to add each node to the Citus cluster. Make sure to run these commands with “postgres” user.

SELECT * from citus_add_node('172.31.86.26', 5433);
SELECT * from citus_add_node('172.31.89.45', 5434);

Check distributed worker nodes

Now, let’s list the distributed nodes in the cluster, including the main node.

sudo su postgres
psql
postgres=# SELECT * FROM pg_dist_node;
 nodeid | groupid |   nodename    | nodeport | noderack | hasmetadata | isactive | noderole | nodecluster | metadatasynced | shouldhaveshards
--------+---------+---------------+----------+----------+-------------+----------+----------+-------------+----------------+------------------
      1 |       0 | 172.31.88.41 |     5432 | default  | t           | t        | primary  | default     | t              | f
      6 |       5 | 172.31.86.26 |     5433 | default  | t           | t        | primary  | default     | t              | t
      7 |       6 | 172.31.89.45 |     5434 | default  | t           | t        | primary  | default     | t              | t
(3 rows)

Or you can also list the active Postgres Citus nodes in the cluster:

postgres=# SELECT * FROM citus_get_active_worker_nodes();
 node_name | node_port
-----------+-----------
 172.31.86.26 |      5433
 172.31.89.45 |      5434
(2 rows)

Once this is ready, we will log in to the main PostgreSQL and create our tables “users” and “events”

Creating the database

```sqlsq
-- Create basic users table
CREATE TABLE users (
    user_id serial PRIMARY KEY,
    user_name text,
    email text
);
-- Create basic events table
CREATE TABLE events (
    event_id bigserial,
    user_id int,
    event_time timestamp,
    data jsonb,
    PRIMARY KEY (user_id, event_id)
);
```

In this example, we create two tables:

“users” table with the following columns:

  • user_id: This column will be used for sharding.
  • user_name: Name of the user text type.
  • email:  Email of the user text type.

And  “events” table with the following columns:

  • event_id: A serial primary key to ensure unique event IDs. This column will be used for sharding.
  • user_id: This column is to register user’s id.
  • event_time: A timestamp to record when the event occurred.
  • data: A JSONB column for storing event-related data.

Creating distributed tables

Now, let’s distribute the “users” and “events” tables across shards placed locally or on the Postgres Citus nodes. In this case, we are distributing through the “user_id” field.

SELECT create_distributed_table('users', 'user_id');
SELECT create_distributed_table('events', 'user_id');

Insert data into our sharded tables

Now, insert data into the sharded table:

-- Insert user data
INSERT INTO users (user_name, email)
VALUES
    ('user1', 'user1@example.com'),
    ('user2', 'user2@example.com'),
    ('user3', 'user2@example.com');

-- Insert event data
INSERT INTO events ( user_id, event_time, data)
VALUES
    (1, '2023-09-01 10:00:00', '{"temperature": 25.5, "humidity": 62}'),
    (2, '2023-09-01 11:30:00', '{"temperature": 22.0, "humidity": 56}'),
    (1, '2023-09-02 08:15:00', '{"temperature": 24.0, "humidity": 59}'),
    (3, '2023-09-08 07:06:00', '{"temperature": 25.8, "humidity": 66}'),
    (1, '2023-09-10 04:00:00', '{"temperature": 20.0, "humidity": 54}');
    (1, '2023-09-04 10:00:00', '{"temperature": 25.5, "humidity": 62}'),
    (2, '2023-09-05 11:30:00', '{"temperature": 22.0, "humidity": 50}'),
    (1, '2023-09-03 08:15:00', '{"temperature": 24.0, "humidity": 56}'),
    (3, '2023-09-06 07:06:00', '{"temperature": 25.8, "humidity": 65}'),
    (1, '2023-09-09 04:00:00', '{"temperature": 20.0, "humidity": 52}');

When you use the primary key (user_id) to create a distributed table in Citus, it means you’re effectively “hash-partitioning” the data based on the user_id column.  This means that events with the same user_id will be stored together on the same shard.

Citus takes the values in the ID column, applies a hash function to each value, and assigns each row to a shard based on the hash value. This process ensures that rows with similar ID values are distributed across multiple shards in a way that attempts to distribute the data evenly.

Testing and query analysis

Query the sharded table:

SET citus.explain_all_tasks TO on;
EXPLAIN ANALYZE SELECT * FROM events;

“SET citus.explain_all_tasks TO on”; is used to enable the “explain all tasks” feature in the Citus extension for PostgreSQL. This feature is particularly useful when you’re working with distributed queries in a Citus cluster.

“EXPLAIN ANALYZE”, is used to analyze and explain the execution plan of a SQL query in PostgreSQL. 

We can see all tasks, and it also shows all Postgres Citus nodes.

We will analyze the first task.

The output showcases the advantage of parallel processing. Citus demonstrated its ability to parallelize operations, distributing the workload across multiple nodes, in this case on the nodes on ports 5433 and 5434, resulting in remarkably fast query execution times. 

This was an example of how to use Citus to sharding in “events” table. The main advantage of sharding with Citus is its ability to scale PostgreSQL databases horizontally, spreading data across multiple nodes to accommodate growing data sets and high workloads.

Note: Sharding may not be necessary for all applications; it’s typically most beneficial when dealing with large-scale datasets and high-concurrency workloads.

Check more about Percona PostgreSQL and multi-node Citus, and visit our Percona Community Forum page in case you have questions. We are happy to help!

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.

 

Download Percona Distribution for PostgreSQL Today!

Oct
19
2023
--

Scalable Solutions With Percona Distribution for PostgreSQL (Part 1): Set Up Three PostgreSQL Database Instances

Scalable Solutions With Percona Distribution for PostgreSQL

Welcome to the first installment of our series: Scalable Solutions with Percona Distribution for PostgreSQL. In this guide, we will demonstrate how to establish a Citus database spanning multiple nodes by using Percona Distribution for PostgreSQL. This setup empowers your database to efficiently manage increased data volumes, enhance performance, and maintain availability.

This initial section will guide you through establishing a Percona Distribution PostgreSQL in three EC2 instances.

Then, in the next part of this series, Scalable Solutions with Percona Distribution for PostgreSQL (Part 2): Using Citus, we will use these worker nodes to establish a multi-node Citus cluster and implement sharding techniques.

This tutorial will use Ubuntu 20.04 and Percona Distribution for PostgreSQL 15.

What we have so far is this:

  • Three EC2 instances in the same subnet.

Three EC2 instances in the same subnet

  • A security group in AWS for PostgreSQL with inbound rules allowing traffic on ports 5432, 5433, and 5434 for PostgreSQL.

Installing Percona Distribution for PostgreSQL on the three instances

Let’s first install Percona Distribution for PostgreSQL on each EC2 instance. Use the following procedure for all the three nodes.

# Download Debian package file for Percona Realease
wget https://repo.percona.com/apt/percona-release_latest.$(lsb_release -sc)_all.deb

# Install Percona Debian package
sudo dpkg -i percona-release_latest.$(lsb_release -sc)_all.deb

# Refresh local package database
sudo apt update

# Add Percona repository for PostgreSQL to our system's list of repositories.
sudo percona-release setup ppg-15

# Install "percona-ppg-server-15" package
sudo apt install percona-ppg-server-15

Now, let’s verify if the service is operating correctly.

sudo systemctl status postgresql.service

You should see this output, indicating an ‘Active’ status in all the instances.

● postgresql.service - PostgreSQL RDBMS
     Loaded: loaded (/lib/systemd/system/postgresql.service; enabled; vendor preset: enabled)
     Active: active (exited) since Mon 2023-09-18 13:36:07 UTC; 1h 15min ago
    Process: 706 ExecStart=/bin/true (code=exited, status=0/SUCCESS)
   Main PID: 706 (code=exited, status=0/SUCCESS)

If you want to check the version of Percona PostgreSQL, you can use the ‘psql’ command-line utility for interacting with PostgreSQL databases. To do this, open a terminal and run the following SQL query: “SELECT version();” This will display the version of Percona Distribution PostgreSQL installed on your system.

sudo su postgres
psql

You will have this as an output:

postgres=# SELECT VERSION();
                                                           version
------------------------------------------------------------------------------------------------------------------------------
 PostgreSQL 15.4 - Percona Distribution on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit
(1 row)

Configuring postgresql.conf on three instances

Before starting the database on each node, modifying specific configuration files is essential. To do this, follow these steps:

  1. Navigate to the PostgreSQL Configuration Directory: Locate the ‘postgresql.conf’ file, typically found at the path ‘/etc/postgresql/15/main’. You will need to access this file to make the necessary adjustments.
  2. Log in as the ‘postgres’ User: To access and modify the ‘postgresql.conf’ file, you should log in using the “postgres” user account, ensuring the necessary permissions for configuration changes.
sudo su postgres

      3. Configure Essential Settings: You’ll configure vital settings within the ‘postgresql.conf’ file. These settings may include specifying each node’s IP address, PORT, and SSL certificate. Ensure that these values align with your desired database configuration.

Node main

# Ip address
listen_addresses = '172.31.88.41'

# Port for node1_data worker
port = 5432

# - SSL -
ssl = on
ssl_cert_file = '/etc/ssl/certs/ssl-cert-snakeoil.pem'
ssl_key_file = '/etc/ssl/private/ssl-cert-snakeoil.key'

Node worker01

# Ip address
listen_addresses = '172.31.86.26'

# Port for node1_data worker
port = 5433

# - SSL -
ssl = on
ssl_cert_file = '/etc/ssl/certs/ssl-cert-snakeoil.pem'
ssl_key_file = '/etc/ssl/private/ssl-cert-snakeoil.key'

Node worker02

# Ip address
listen_addresses = '172.31.89.45'

# Port for node1_data worker
port = 5434

# - SSL -
ssl = on
ssl_cert_file = '/etc/ssl/certs/ssl-cert-snakeoil.pem'
ssl_key_file = '/etc/ssl/private/ssl-cert-snakeoil.key'

Configuring pg_hba.conf for three instances

Now, let’s modify ‘pg_hba.conf,’ which can be found in the same directory as the ‘postgresql.conf’ file: ‘/etc/postgresql/15/main’.

Node main

# Database administrative login by Unix domain socket
local   all             postgres                                peer

# TYPE  DATABASE        USER            ADDRESS                 METHOD

# "local" is for Unix domain socket connections only
local   all             all                                     peer

# IPv4 local connections:
host    all             all           127.0.0.1/32              trust

# IPv6 local connections:
host    all             all            ::1/128                 scram-sha-256

# Allow replication connections from localhost, by a user with the
# replication privilege.
local   replication     all                                     peer
host    replication     all             127.0.0.1/32            scram-sha-256
host    replication     all             ::1/128                 scram-sha-256

Node worker01

# Database administrative login by Unix domain socket
local   all             postgres                                peer

# TYPE  DATABASE        USER            ADDRESS                 METHOD

# "local" is for Unix domain socket connections only
local   all             all                                     peer

# IPv4 local connections:
host    all            postgres       72.31.88.41/32            trust

# IPv6 local connections:
host    all             all           ::1/128                   scram-sha-256

# Allow replication connections from localhost, by a user with the
# replication privilege.
local   replication     all                                     peer
host    replication     all             127.0.0.1/32            scram-sha-256
host    replication     all             ::1/128                 scram-sha-256

Node worker02

# Database administrative login by Unix domain socket
local   all             postgres                                peer

# TYPE  DATABASE        USER            ADDRESS                 METHOD

# "local" is for Unix domain socket connections only
local   all             all                                     peer

# IPv4 local connections:
host    all             postgres        172.31.88.41/32         trust

# IPv6 local connections:
host    all             all             ::1/128                 scram-sha-256

# Allow replication connections from localhost, by a user with the
# replication privilege.
local   replication     all                                     peer
host    replication     all             127.0.0.1/32            scram-sha-256
host    replication     all             ::1/128                 scram-sha-256

Note: Using “trust” in the settings is too permissive. See Citus notes about Increasing Worker Security. The PostgreSQL manual explains how to make them more restrictive.

Restart all PostgreSQL instances

Now, we will start the PostgreSQL instances. 

sudo systemctl restart postgresql,

Three nodes of Percona PostgreSQL that are listening on ports 5432, 5433, and 5434.

Node main on port 5432

ubuntu@ip-172-31-88-41:~$ ss -tuln
Netid  State  Recv-Q  Send-Q  Local Address:Port  Peer Address:Port    Process
tcp    LISTEN   0       244   172.31.88.41:5432      0.0.0.0:*

Worker01 on port 5433

ubuntu@ip-172-31-86-26:~$ ss -tuln
Netid  State  Recv-Q  Send-Q  Local Address:Port  Peer Address:Port    Process
tcp    LISTEN   0      244    172.31.86.26:5433       0.0.0.0:*

Worker02 on port 5434

ubuntu@ip-172-31-89-45:~$ ss -tuln
Netid  State  Recv-Q  Send-Q  Local Address:Port  Peer Address:Port    Process
tcp    LISTEN   0      244    172.31.89.45:5434      0.0.0.0:*

Are you ready to see what we can do with these Percona Distributions for PostgreSQL nodes? 

Check the Scalable Solutions with Percona Distribution for PostgreSQL (Part 2): Using Citus, and Percona Distribution for PostgreSQL 15: An Introduction to Scalable Database Solutions.

In the meantime, you can learn more about the Percona Distribution for PostgreSQL. Additionally, if you’re interested in deploying PostgreSQL on Kubernetes, you can explore information about the Percona Operator for PostgreSQL.

You can also contact Percona’s experts for assistance in maximizing your application’s performance. We offer support for open-source databases, managed services, and consulting services, or visit our Percona Community forum; we are happy to help.

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community, in a single distribution, designed and tested to work together.

 

Download Percona Distribution for PostgreSQL Today!

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com