Jun
12
2024
--

GCache and Record-Set Cache Encryption in Percona XtraDB Cluster – Part Two

GCache and Record-Set Cache Encryption in Percona XtraDB ClusterKeeping Your Data Safe: An Introduction to Data-at-Rest Encryption in Percona XtraDB Cluster. In the first part of this blog post, we learned how to enable GCache and Record-Set cache encryption in Percona XtraDB Cluster. This part will explore the details of the implementation to understand what happens behind the scenes. How does it work […]

Jun
07
2024
--

GCache and Record-Set Cache Encryption in Percona XtraDB Cluster – Part One

GCache and Record-Set Cache Encryption in Percona XtraDB ClusterKeeping Your Data Safe: An Introduction to Data-at-Rest Encryption in Percona XtraDB Cluster. Data breaches are a major concern today. Whether you’re a business owner or value your privacy, the thought of your sensitive information being exposed is unsettling. Fortunately, data-at-rest encryption can help. Think of it as a sophisticated lock that scrambles your data, […]

Jun
28
2023
--

Consequences of Forcing Bootstrap on Percona XtraDB Cluster

forcing bootstrap

Recently, I was working on a very unfortunate case that revolved around diverging clusters, data loss, missing important log errors, and forcing commands on Percona XtraDB Cluster (PXC). Even though PXC tries its best to explain what happens in the error log, I can vouch that it can be missed or overlooked when you do not know what to expect.

This blog post is a warning tale, an invitation to try yourself and break stuff (not in production, right?).

TLDR:
Do you know right away what happened when seeing this log?

2023-06-22T08:23:29.003334Z 0 [ERROR] [MY-000000] [Galera] gcs/src/gcs_group.cpp:group_post_state_exchange():433: Reversing history: 171 -> 44, this member has applied 127 more events than the primary component.Data loss is possible. Must abort.

Demonstration

Using the great https://github.com/datacharmer/dbdeployer:

$ dbdeployer deploy replication --topology=pxc --sandbox-binary=~/opt/pxc  8.0.31

Let’s write some data

$ ./sandboxes/pxc_msb_8_0_31/sysbench oltp_read_write --tables=2 --table-size=1000 prepare

Then let’s suppose someone wants to restart node 1. For some reason, they read somewhere in your internal documentation that they should bootstrap in that situation. With dbdeployer, this will translate to:

$ ./sandboxes/pxc_msb_8_0_31/node1/stop
stop /home/yoann-lc/sandboxes/pxc_msb_8_0_31/node1

$ ./sandboxes/pxc_msb_8_0_31/node1/start --wsrep-new-cluster
......................................................................................................^C

It fails, as it should.

In reality, those bootstrap mistakes happen in homemade start scripts, puppet or ansible modules, or even internal procedures applied in the wrong situation.

Why did it fail? First error to notice:

2023-06-22T08:00:48.322148Z 0 [ERROR] [MY-000000] [Galera] It may not be safe to bootstrap the cluster from this node. It was not the last one to leave the cluster and may not contain all the updates. To force cluster bootstrap with this node, edit the grastate.dat file manually and set safe_to_bootstrap to 1 .

Reminder: Bootstrap should only be used when every node has been double-checked to be down; it’s a manual operation. It fails here because it was not forced and because this node was not the last to be stopped in the cluster.

Good reflex: Connecting to other mysql and check for ‘wsrep_cluster_size’ and ‘wsrep_cluster_status’ statuses before anything.

mysql> show global status where variable_name IN ('wsrep_local_state','wsrep_local_state_comment','wsrep_local_commits','wsrep_received','wsrep_cluster_size','wsrep_cluster_status','wsrep_connected');

Do not: Apply blindly what this log is telling you to do.

But we are here to “fix” around and find out, so let’s bootstrap.

$ sed -i 's/safe_to_bootstrap: 0/safe_to_bootstrap: 1/' ./sandboxes/pxc_msb_8_0_31/node1/data/grastate.dat
$ ./sandboxes/pxc_msb_8_0_31/node1/start --wsrep-new-cluster
.. sandbox server started

At this point, notice that from node1, you have:

$ ./sandboxes/pxc_msb_8_0_31/node1/use -e "show global status where variable_name in ('wsrep_cluster_status', 'wsrep_cluster_size')"
+----------------------+---------+
| Variable_name        | Value   |
+----------------------+---------+
| wsrep_cluster_size   | 1       |
| wsrep_cluster_status | Primary |
+----------------------+---------+

But from node2 and node3 you will have:

$ ./sandboxes/pxc_msb_8_0_31/node2/use -e "show global status where variable_name in ('wsrep_cluster_status', 'wsrep_cluster_size')"
+----------------------+---------+
| Variable_name        | Value   |
+----------------------+---------+
| wsrep_cluster_size   | 2       |
| wsrep_cluster_status | Primary |
+----------------------+---------+

Looks fishy. But does your monitoring really alert you to this?

Let’s write some more data, obviously on node1, because why not? It looks healthy.

$ ./sandboxes/pxc_msb_8_0_31/node1/sysbench oltp_delete --tables=2 --table-size=1000 --events=127  run

127 will be useful later on.

Nightmare ensues

We are a few days later. You are still writing to your node. Some new reason to restart node1 comes. Maybe you want to apply a parameter.

$ ./sandboxes/pxc_msb_8_0_31/node1/restart
.............................................................................................................................................................^C

It fails?

Reviewing logs, you would find:

$ less sandboxes/pxc_msb_8_0_31/node1/data/msandbox.err
...
2023-06-22T08:23:29.003334Z 0 [ERROR] [MY-000000] [Galera] gcs/src/gcs_group.cpp:group_post_state_exchange():433: Reversing history: 171 -> 44, this member has applied 127 more events than the primary component.Data loss is possible. Must abort.
...

Voila, We find our “127” again.

Good reflex: Depends. It would need a post of its own, but that’s a serious problem.

Do not: Force SST on this node. Because it will work, and every data inserted on node1 will be lost.

What does it mean?

When forcing bootstrap, a node will always start. It won’t ever try to connect to other nodes if they are healthy. The other nodes won’t try to connect to the third one either; from their point of view, it just never joined, so it’s not part of the cluster.

When restarting the previously bootstrapped node1 in non-bootstrapped mode, that’s the first time they all see each other in a while.

Each time a transaction is committed, it is replicated along with a sequence number (seqno). The seqno is an ever-growing number. It is used by nodes to determine if incremental state transfer is possible, or if a node state is coherent with others.

Now that node1 is no longer in bootstrap mode, node1 connects to the other members. node1 shares its state (last primary members, seqno). The other nodes correctly picked up that this seqno looks suspicious because it’s higher than their own, meaning the node joining could have applied more transactions. It could also mean it was from some other cluster.

Because nodes are in doubt, nothing will happen. Node1 is denied joining and will not do anything. It won’t try to resynchronize automatically, and it won’t touch its data. Node2 and node3 are not impacted; they will be kept as is too.

How to proceed from there will depend as there are no general guidelines. Ideally, a source of truth should be found. If both clusters applied writes, that’s the toughest situation to be in, and it’s a split brain.

Note: seqno are just numbers. Having equal seqno does not actually guarantee that the underlying transactions applied are identical, but it’s still useful as a simple sanity check. If we were to mess around even more and apply 127 transactions on node2, or even modify seqno manually in grastate.dat, we could have “interesting” results. Try it out (not in production, mind you)!

Note: If you are unaware of bootstrapping and how to properly recover, check out the documentation.

Conclusion

Bootstrap is a last resort procedure, don’t force it lightly. Do not force SST right away if a node does not want to join either. You should always check the error log first.

Fortunately, PXC does not blindly let any node join without some sanity checks.

Minimize unexpected downtime and data loss with a highly available, open source MySQL clustering solution.

Download Percona XtraDB Cluster today

Nov
25
2022
--

Exploring Data Dump and Load Utility With MySQL Shell

Data Dump and Load Utility With MySQL Shell

Data Dump and Load Utility With MySQL ShellIn this blog post, I will try to cover a few dump and data-loading utilities offered by MySQL Shell.

What is MySQL Shell?

It is a robust client and code editor for MySQL. In addition to having APIs for dealing with MySQL, MySQL Shell offers scripting capabilities for JavaScript and Python. It provides an AdminAPI to manage MySQL servers and create InnoDB Cluster and ReplicaSet deployments. It supports an X DevAPI to work on both SQL and NoSQL interfaces via the X Protocol. Finally, it provides utilities to make working with MySQL in Python and Javascript mode easier.

Now, let’s jump into some practical stuff and see what these tools offer us.

Dump Utility

In the dump utility, there are three tools: util.dumpInstance(), util.dumpSchemas(), and util.dumpTables () so let’s discuss them one by one. 

1.  “util.dumpInstance()” – This was introduced in MySQL Shell 8.0.21 and supports the export of all databases excluding information_schema, MySQL, ndbinfo, performance_schema, and sys schema. The backups can be kept locally or could be taken remotely from the MySQL Shell utility.

Performing a complete instance dump with “util.dumpInstance()”:

a) Log in to the database via MySQL Shell.

shell> mysqlsh --uri root@localhost

b) Prior to doing the actual backup, we will first perform a dryRun procedure to check for any potential problems or compatibility issues.

mysql shell > util.dumpInstance("/home/backup/", {dryRun:”true”})

Output:

dumping.
Acquiring global read lock
Global read lock acquired
Initializing - done 
1 out of 5 schemas will be dumped and within them 16 tables, 0 views.
3 out of 5 users will be dumped.
Gathering information - done 
All transactions have been started
Global read lock has been released
Writing global DDL files
Writing users DDL
Writing DDL - done         
Starting data dump
0% (0 rows / ~1.60M rows), 0.00 rows/s, 0.00 B/s uncompressed, 0.00 B/s compressed

 So, here we didn’t get any issues or warnings so we are good to proceed with the actual backup process.

c) Running the full instance backup and saving in the target location.

mysql shell >util.dumpInstance("/home/backup/")

Output:

dumping.
Acquiring global read lock
Global read lock acquired
Initializing - done 
1 out of 5 schemas will be dumped and within them 16 tables, 0 views.
3 out of 5 users will be dumped.
Gathering information - done 
All transactions have been started
Global read lock has been released
Writing global DDL files
Writing users DDL
Running data dump using 4 threads.
NOTE: Progress information uses estimated values and may not be accurate.
Writing schema metadata - done       
Writing DDL - done         
Writing table metadata - done         
Starting data dump
101% (1.63M rows / ~1.60M rows), 114.56K rows/s, 22.13 MB/s uncompressed, 10.09 MB/s compressed                 
Dump duration: 00:00:15s                                                                       
Total duration: 00:00:15s                                                                      
Schemas dumped: 1                                                                              
Tables dumped: 16                                                                              
Uncompressed data size: 313.55 MB                                                              
Compressed data size: 142.87 MB                                                                
Compression ratio: 2.2                                                                         
Rows written: 1629368                                                                          
Bytes written: 142.87 MB                                                                       
Average uncompressed throughput: 20.01 MB/s                                                    
Average compressed throughput: 9.12 MB/s

We have successfully finished the backup here. By default, it employs “4 threads” and applies the zstd compression algorithm to the database backup.

d) Running the instance dump with more advanced options.

mysql shell >util.dumpInstance("/home/backup/", {threads:8,maxRate:"100M",consistent:true,chunking:true,bytesPerchunk:"64M",compression:"zstd"})

Output:

Acquiring global read lock
Global read lock acquired
Initializing - done 
1 out of 5 schemas will be dumped and within them 16 tables, 0 views.
3 out of 5 users will be dumped.
Gathering information - done 
All transactions have been started
Global read lock has been released
Writing global DDL files
Writing users DDL
Running data dump using 8 threads.
NOTE: Progress information uses estimated values and may not be accurate.
Writing schema metadata - done       
Writing DDL - done         
Writing table metadata - done         
Starting data dump
101% (1.63M rows / ~1.60M rows), 468.47K rows/s, 84.22 MB/s uncompressed, 38.41 MB/s compressed                  
Dump duration: 00:00:03s                                                                       
Total duration: 00:00:03s                                                                      
Schemas dumped: 1                                                                              
Tables dumped: 16                                                                              
Uncompressed data size: 313.55 MB                                                              
Compressed data size: 142.87 MB                                                                
Compression ratio: 2.2                                                                         
Rows written: 1629368                                                                          
Bytes written: 142.87 MB                                                                       
Average uncompressed throughput: 85.24 MB/s                                                    
Average compressed throughput: 38.84 MB/s

There are various options available to control and enhance the dump process.

“dryRun”: List out all of the compatibility issues before starting the dump.
“consistent”: In order to provide consistent backup, it automatically acquires a global read lock. By default, it is turned on.
“chunking”: splits each table's data into a number of files. It is enabled by default.
“bytesPerChunk”: specify the approximate number of bytes that should be written for each data block. 
“threads”: The number of parallel threads to employ when dumping data from the MySQL instance. "4" threads are used by default.
“maxRate”: Maximum data read throughput for each thread during the dump in bytes per second.

2) “util.dumpSchemas” –  This was introduced in MySQL Shell 8.0.21 and supports the export of selected databases either locally or remotely. 

Performing database dump with “util.dumpSchemas()”. Here we will take a specific database(“sbtest”) backup only.

mysql shell > util.dumpSchemas(["sbtest"],"/home/backup/sbtest/",{threads :2})

Output:

dumping.
Acquiring global read lock
Global read lock acquired
Initializing - done 
1 schemas will be dumped and within them 16 tables, 0 views.
Gathering information - done 
All transactions have been started
Global read lock has been released
Writing global DDL files
Running data dump using 2 threads.
NOTE: Progress information uses estimated values and may not be accurate.
Writing schema metadata - done       
Writing DDL - done         
Writing table metadata - done         
Starting data dump
104% (1.63M rows / ~1.56M rows), 424.56K rows/s, 81.96 MB/s uncompressed, 37.35 MB/s compressed                  
Dump duration: 00:00:03s                                                                       
Total duration: 00:00:03s                                                                      
Schemas dumped: 1                                                                              
Tables dumped: 16                                                                              
Uncompressed data size: 313.55 MB                                                              
Compressed data size: 142.87 MB                                                                
Compression ratio: 2.2                                                                         
Rows written: 1629368                                                                          
Bytes written: 142.87 MB                                                                       
Average uncompressed throughput: 81.82 MB/s                                                    
Average compressed throughput: 37.28 MB/s

3) “util.dumpTables” –  In MySQL Shell 8.0.22, the table dump utility “util.dumpTables()” was released.

 Performing selective tables dump. Here we take only table “sbtest2” dump from the “sbtest” database.

mysql shell> util.dumpTables("sbtest", [ "sbtest2"], "/home/backup/sbtest/",{threads: 4})

Output:

Writing schema metadata - done       
Writing DDL - done       
Writing table metadata - done       
Starting data dump
101% (101.89K rows / ~100.58K rows), 0.00 rows/s, 0.00 B/s uncompressed, 0.00 B/s compressed                
Dump duration: 00:00:00s                                                                    
Total duration: 00:00:00s                                                                   
Schemas dumped: 1                                                                           
Tables dumped: 1                                                                            
Uncompressed data size: 19.61 MB                                                            
Compressed data size: 8.93 MB                                                               
Compression ratio: 2.2                                                                      
Rows written: 101892                                                                        
Bytes written: 8.93 MB                                                                      
Average uncompressed throughput: 19.61 MB/s                                                 
Average compressed throughput: 8.93 MB/s

Data load utility

So far, we’ve seen how to use various methods to take a dump. We will now demonstrate how to restore the same using a single restoration command for all sorts of backups.

The utility “util.loadDump()” for loading dumps was introduced in MySQL Shell 8.0.21. It enabled the parallel loading of tables or table chunks during the data-loading process. 

Load the dump files using util.loadDump().

a) Running a dry run process to validate any compatibility issues or errors prior to the actual restoration process.

mysql shell > util.loadDump("/home/backup/", {dryRun :"true"})

Output:

Loading DDL and Data from '/home/backup/' using 4 threads.
Opening dump...
dryRun enabled, no changes will be made.
Target is MySQL 5.7.38-41-57-log. Dump was produced from MySQL 5.7.38-41-57-log
Scanning metadata - done       
Checking for pre-existing objects...
Executing common preamble SQL
Executing DDL - done       
Executing view DDL - done       
Starting data load
Executing common postamble SQL                        
0% (0 bytes / 313.55 MB), 0.00 B/s, 16 / 16 tables done
Recreating indexes - done 
No data loaded.                                        
0 warnings were reported during the load.

Note – no data loaded, it just prints us information about any warnings or errors during the restoration process.

b) Running the data restoration process.

mysql shell > util.loadDump("/home/backup/", {progressFile :"/home/backuplog.json",threads:4,backgroundThreads:4,maxBytesPerTransaction:"4096"})

Output:

Loading DDL and Data from '/home/backup/' using 4 threads.
Opening dump...
Target is MySQL 5.7.38-41-57-log. Dump was produced from MySQL 5.7.38-41-57-log
Scanning metadata - done       
Checking for pre-existing objects...
Executing common preamble SQL
Executing DDL - done       
Executing view DDL - done       
Starting data load
Executing common postamble SQL                                              
100% (313.55 MB / 313.55 MB), 3.03 MB/s, 16 / 16 tables done                
Recreating indexes - done       
16 chunks (1.63M rows, 313.55 MB) for 16 tables in 1 schemas were loaded in 2 min 43 sec (avg throughput 1.92 MB/s)
0 warnings were reported during the load.

Here, we have successfully completed the restoration process.

There are various options available to control and enhance the data loading process.

“threads”:The number of parallel threads to employ when dumping data from the MySQL instance. "4" is used by default.
“backgroundThreads”:the number of threads in the pool of background threads that were used to retrieve the file data.
“maxBytesPerTransaction”:The maximum number of bytes that can be loaded from a data file with a single LOAD DATA statement.
“loadIndexes”: Secondary indexes are not built during the import when this option is set to false; you must do so afterwards. This may come in handy if you're loading the DDL files and data files separately and want to modify the table structure after the DDL files have been loaded.
“deferTableIndexes”:Postpone creating secondary indexes until after loading the table's contents. This may speed up loading. off denotes that all indexes are built as the table is loaded.

Taking backups on cloud storage (GCP and AWS S3)

MySQL Shell also provides the capability to store dumps remotely over some S3-compatible cloud storage such as Amazon S3 and Google Cloud storage. Here, we’ll talk about how to do it with Google Cloud storage and an Amazon S3 environment.

Let’s start by configuring it for Google Cloud storage.

Prerequisites

  • Storage Bucket should exist in Google Cloud
  • Access keys and secrets defined in a file(“/home/credentials”) 
[gcb]
aws_access_key_id=xxxxxxxx
aws_secret_access_key=xxxxxxxx

 

a) Running full backup and putting it in the GCP bucket “ajpxctest”.

mysql shell> util.dumpInstance("sbtest",{s3BucketName: "ajpxctest",s3EndpointOverride: "https://storage.googleapis.com", s3CredentialsFile:"/home/credentials", s3Profile: "gcb", threads: 4})

Output:

dumping.
Acquiring global read lock
Global read lock acquired
Initializing - done 
1 out of 5 schemas will be dumped and within them 16 tables, 0 views.
3 out of 5 users will be dumped.
Gathering information - done 
All transactions have been started
Global read lock has been released
Writing global DDL files
Writing users DDL
Running data dump using 4 threads.
NOTE: Progress information uses estimated values and may not be accurate.
Writing schema metadata - done       
Writing DDL - done         
Writing table metadata - done         
Starting data dump
104% (1.63M rows / ~1.56M rows), 172.12K rows/s, 34.60 MB/s uncompressed, 15.75 MB/s compressed                  
Dump duration: 00:00:12s                                                                       
Total duration: 00:00:14s                                                                      
Schemas dumped: 1                                                                              
Tables dumped: 16                                                                              
Uncompressed data size: 313.55 MB                                                              
Compressed data size: 142.87 MB                                                                
Compression ratio: 2.2                                                                         
Rows written: 1629368                                                                          
Bytes written: 142.87 MB                                                                       
Average uncompressed throughput: 25.21 MB/s                                                    
Average compressed throughput: 11.49 MB/s

b) Then, using these same details, we will restore the data from the S3 bucket into the local database.

mysql shell> util.loadDump("sbtest",{s3BucketName: "ajpxctest", s3EndpointOverride: "https://storage.googleapis.com",s3CredentialsFile:"/home/credentials", s3Profile: "gcb", threads: 4})

Output:

Loading DDL and Data from AWS S3 bucket=ajpxctest, prefix='sbtest' using 4 threads.
Opening dump...
Target is MySQL 5.7.38-41-57-log. Dump was produced from MySQL 5.7.38-41-57-log
Fetching dump data from remote location...
Listing files - done 
Scanning metadata - done         
Checking for pre-existing objects...
Executing common preamble SQL
Executing DDL - done         
Executing view DDL - done       
Starting data load
1 thds loading / 100% (313.55 MB / 313.55 MB), 6.23 MB/s, 14 / 16 tables done
Recreating indexes - done       
Executing common postamble SQL                                               
16 chunks (1.63M rows, 313.55 MB) for 16 tables in 1 schemas were loaded in 1 min 8 sec (avg throughput 7.41 MB/s)
0 warnings were reported during the load.      

S3 options:

“s3BucketName”: Name of the s3 bucket where the dump will be kept.
“s3EndpointOverride”: s3 endpoint or url.
“s3CredentialsFile”: Credential file containing key & secret details.
“s3Profile”: The name of the s3 profile to identify the credential details.

Let’s configure the same for Amazon S3 

Prerequisites

  • AWS S3 bucket should exist
  • Configure AWS CLI
###~/.aws/credentials### 
[default]
aws_access_key_id=xxxxxxx
aws_secret_access_key=xxxxxxx

###~/.aws/config### 
[default]
region = us-east-1

a) Now, let’s perform the instance backup process which keeps the dump file in “ajtestbkp” S3 bucket.

mysql shell> util.dumpInstance("sbtest",{s3bucketName: "ajtestbkp",threads: 4})

Output:

Acquiring global read lock
Global read lock acquired
Initializing - done 
1 out of 5 schemas will be dumped and within them 16 tables, 0 views.
3 out of 5 users will be dumped.
Gathering information - done 
All transactions have been started
Global read lock has been released
Writing global DDL files
Writing users DDL
Running data dump using 4 threads.
NOTE: Progress information uses estimated values and may not be accurate.
Writing schema metadata - done       
Writing DDL - done         
Writing table metadata - done         
Starting data dump
101% (1.63M rows / ~1.61M rows), 287.53K rows/s, 53.81 MB/s uncompressed, 24.51 MB/s compressed                  
Dump duration: 00:00:06s                                                                       
Total duration: 00:00:07s                                                                      
Schemas dumped: 1                                                                              
Tables dumped: 16                                                                              
Uncompressed data size: 313.55 MB                                                              
Compressed data size: 142.87 MB                                                                
Compression ratio: 2.2                                                                         
Rows written: 1629368                                                                          
Bytes written: 142.87 MB                                                                       
Average uncompressed throughput: 49.93 MB/s                                                    
Average compressed throughput: 22.75 MB/s

b) Now, let’s restore the dump from S3 into our local database.

mysql shell> util.loadDump("sbtest",{s3BucketName: "ajtestbkp", threads: 4})

Output:

Loading DDL and Data from AWS S3 bucket=ajtestbkp, prefix='sbtest' using 4 threads.
Opening dump...
Target is MySQL 5.7.38-41-57-log. Dump was produced from MySQL 5.7.38-41-57-log
Fetching dump data from remote location...
Listing files - done 
Scanning metadata - done        
Checking for pre-existing objects...
Executing common preamble SQL
Executing DDL - done         
Executing view DDL - done       
Starting data load
1 thds loading / 100% (313.55 MB / 313.55 MB), 8.80 MB/s, 15 / 16 tables done
Recreating indexes - done       
Executing common postamble SQL                                               
16 chunks (1.63M rows, 313.55 MB) for 16 tables in 1 schemas were loaded in 38 sec (avg throughput 9.14 MB/s)
0 warnings were reported during the load.

Summary

These utilities are a great addition to MySQL Shell and very useful in the scenario of backup/restoration of large tables by using different control options. Now, we can have better handling of the dump/load process by defining options such as threads, max rate, and maxBytesPerTransaction.  Dump utilities are supported in (MySQL 5.6,5.7, and 8.0), however, for data loading, the database should be either MySQL 5.7 or 8.0.

Sep
15
2022
--

Percona XtraDB Cluster on Amazon EC2 and Two Interesting Changes in PXC 8.0

Percona XtraDB Cluster on Amazon EC2

Percona XtraDB Cluster on Amazon EC2This article outlines the basic configurations for setting up and deploying Percona XtraDB Cluster 8.0 (PXC) on Amazon EC2, as well as what is new in the setup compared to Percona XtraDB Cluster 5.7.

What is Percona XtraDB Cluster an ideal fit for?

Percona XtraDB Cluster is a cost-effective, high-performance clustering solution for mission-critical data. It combines all the improvements, and functionality found in MySQL 8 with Percona Server for MySQL‘s Enterprise features and Percona’s upgraded Galera library.

A Percona XtraDB Cluster environment is an ideal fit for applications requiring 5-9s uptime with high read workloads; industries like financial or healthcare businesses that require in-house or externally dedicated database resources.

How is a three-node cluster configured in an EC2 environment?

In order to describe the setup procedures, I’ll be using Amazon EC2 instances to build the environment, and based on the business requirements, we may utilize alternative infrastructures to build the cluster environment.

Amazon EC2 settings are designed to provide uptime and high availability. When designing architecture in an EC2 environment, it is preferable to have one node situated in another Availability Zones to avoid the loss of an entire AZ and its data.

If a different region is planned for a node, we can prevent the loss of the entire region and its data. It is desirable that nodes and regions have appropriate network connectivity because network latency between the two regions affects synchronous replication write latency. Alternatively, an async replica in a different region is an option.

I’m not going into too much depth on Amazon EC2 to keep this blog brief and readable. 

To build the three-node Percona XtraDB 8.0 cluster environment we first spin up the following three nodes in EC2. I’m using Amazon Linux but you can also use Ubuntu or any of the Percona-supported operating systems.

It is advised that a cluster’s nodes all have the same configuration. 

PXCNode1   IP Address: 172.31.18.119

PXCNode2   IP Address: 172.31.86.114

PXCNode3  IP Address: 172.31.26.152

To install the Percona repository on all three nodes, use the following command.

$ sudo yum install -y https://repo.percona.com/yum/percona-release-latest.noarch.rpm

Enable the Percona Server for MySQL 8.0 repository in all three nodes by running the following command.

$ sudo percona-release setup pxc-80
* Disabling all Percona Repositories
* Enabling the Percona XtraDB Cluster 8.0 repository
* Enabling the Percona Tools repository
<*> All done!
$

Using the following command, install the Percona XtraDB Cluster packages and software on all three nodes.

$ sudo yum install percona-xtradb-cluster

Before starting the nodes, update the basic variables listed below for the nodes

The following default variables must be modified with the first installations. Those that came from PXC 5.7 might wonder why wsrep_sst_auth is missing. The wsrep_sst_auth variable was removed in PXC 8 since it causes security concerns, as the user and password are saved in the config file and are easily visible to OS users. 

In PXC 8, a temporary user is created when a new node joins the existing cluster. For additional details on this security enhancement, check this article.

$ vi /etc/my.cnf
..
######## wsrep ###############
wsrep_provider=/usr/lib64/galera4/libgalera_smm.so
wsrep_cluster_address=gcomm://172.31.18.119,172.31.86.114,172.31.26.152
wsrep_slave_threads=8
wsrep_log_conflicts
wsrep_node_address=172.31.18.119
wsrep_cluster_name=pxc-cluster
#If wsrep_node_name is not specified,  then system hostname will be used
wsrep_node_name=pxc-cluster-node-1
wsrep_sst_method=xtrabackup-v2
..

Percona XtraDB Cluster nodes utilize the following ports by default, so we must open them and ensure that the nodes can communicate with one another.

3306 is used for MySQL client connections and SST (State Snapshot Transfer) via mysqldump.

    4444 is used for SST via Percona XtraBackup.

    4567 is used for write-set replication traffic (over TCP) and multicast replication (over TCP and UDP).

    4568 is used for IST (Incremental State Transfer).

For example, to test access.

Node 1
#  socat - TCP-LISTEN:4444
hello

Node 2
# echo "hello" | socat - TCP:172.31.18.119:4444
#

How is the first node bootstrapped?

After configuring each PXC node, you must bootstrap the cluster starting with the first node. All of the data that you wish to replicate to additional nodes must be present on the first node. 

Run the command to bootstrap the first node.

# systemctl start mysql@bootstrap.service
#

Since this is a brand new install, a temporary password is generated for the ‘root’ MySQL user which we can find in the mysqld.log

# grep -i "A temporary password is generated " /var/log/mysqld.log
2022-09-13T06:52:37.700818Z 6 [Note] [MY-010454] [Server] A temporary password is generated for root@localhost: PmqQiGf*#6iy
#

Reset the temporary password using the following alter.

$ mysql -uroot -p
Enter password:
mysql> SET PASSWORD = 'GdKG*12#ULmE';
Query OK, 0 rows affected (0.03 sec)

How can the cluster’s remaining nodes be joined?

Before starting node2, you must copy the SSL certificates from node1 to node2 (and to node3). PXC 8 by default encrypts all replication communication, so this is a critical step that most users miss, causing cluster startup failures.

PXCNode1# scp /var/lib/mysql/*.pem 172.31.86.114:/var/lib/mysql/
PXCNode2# chown -R mysql.mysql /var/lib/mysql/*.pem

Start the node.

PXCNode2# systemctl start mysql
#

Verify the following SET and make sure they appear as below once the node has been added to the cluster.

PXCNode2$ mysql -uroot -p -e " show global status where variable_name IN ('wsrep_local_state','wsrep_local_state_comment','wsrep_cluster_size','wsrep_cluster_status','wsrep_connected','wsrep_ready');"
Enter password:
+---------------------------+---------+
| Variable_name             | Value   |
+---------------------------+---------+
| wsrep_cluster_size        | 2       |
| wsrep_cluster_status      | Primary |
| wsrep_connected           | ON      |
| wsrep_local_state         | 4       |
| wsrep_local_state_comment | Synced  |
| wsrep_ready               | ON      |
+---------------------------+---------+

The third node may be added to the cluster using the same procedures, and its status will then look as follows.

PXCNode3$ mysql -uroot -p -e " show global status where variable_name IN ('wsrep_local_state','wsrep_local_state_comment','wsrep_cluster_size','wsrep_cluster_status','wsrep_connected','wsrep_ready');"
Enter password:
+---------------------------+---------+
| Variable_name             | Value   |
+---------------------------+---------+
| wsrep_cluster_size        | 3       |
| wsrep_cluster_status      | Primary |
| wsrep_connected           | ON      |
| wsrep_local_state         | 4       |
| wsrep_local_state_comment | Synced  |
| wsrep_ready               | ON      |
+---------------------------+---------+

Additional supporting factors

Additionally, the use of failover technologies like ProxySQL, which assist in removing failing nodes from the active read pool in some cases and shifting the primary, is advised.

It is advised to have a backup in place, with the open-source tool Percona XtraBackup taking physical copies of the dataset that are significantly faster to recover. It is strongly advised to back up binary logs using mysqlbinlog in order to do point-in-time recovery. Backups should be encrypted, compressed, and transferred to S3 as soon as possible.

To copy binlogs to an s3 bucket the command looks like this.

aws s3 sync /backups/binlogs/ s3://backup-bucket/
upload: ../../../../backups/binlogs/binlog.000001.gz.gpg to 
s3://backup-bucket/binlog.000001.gpg

For query analytics and time-based database performance insights, the open-source tool Percona Monitoring and Management is highly recommended. Using the Amazon Marketplace, this needs to be deployed on a separate host. This monitors the operating system and MySQL metrics and provides sophisticated query analytics.

Aug
11
2022
--

How to Build Percona XtraDB Cluster From Sources

Build Percona XtraDB Cluster

Build Percona XtraDB ClusterPercona XtraDB Cluster (PXC) is a database clustering solution for MySQL. It ensures high availability, prevents downtime and data loss, and provides linear scalability for a growing environment.

Lately, the number of questions about how to build Percona software has increased. More and more people try to add their own patches, add some modifications, and build software by themselves. But this raises the question of how to do this in the same way as Percona does, as sometimes the compiler flag can make a drastic impact on the final binary. This is a really essential question and let’s look at how we build Percona XtraDB Cluster here at Percona.

At the first glance, the building process is not something difficult. Everyone knows that magic make install all command and suppose that as a result everything will be built and installed on the system. But databases are very essential for applications and it is needed to do everything carefully.

Let’s divide the build process into the stages:

  1. Prepare build environments
  2. Prepare source tarballs
  3. Prepare source package
  4. Prepare binary packages

We can skip the third stage, but in this case, we wouldn’t be able to make repeatable builds so I would suggest always creating source packages.

So let’s move forward and discuss each step.

One of the key points during a build is that the environment should be clean and it is necessary to install all needed dependencies, and so on. For each version, the dependency list would be different. How do you get the correct dependency list? You can get all build requirements from the spec file (on rpm-based systems) or from the control file( on deb-based systems).

As for source tarball, you can easily do the following:

  • Get source tarball from the website

We publish source tarball for each release we issue, and you can easily get it for any released version (this is the link for the 8.0.28 version):

https://downloads.percona.com/downloads/Percona-XtraDB-Cluster-LATEST/Percona-XtraDB-Cluster-8.0.28/source/tarball/Percona-XtraDB-Cluster-8.0.28-19.tar.gz

  • Download tarball from GitHub

Here is the link:

https://github.com/percona/percona-xtradb-cluster/archive/refs/tags/Percona-XtraDB-Cluster-8.0.28-19.1.tar.gz

You can use build instructions from our website and proceed with the compilation process:

https://www.percona.com/doc/percona-xtradb-cluster/LATEST/install/compile.html#compile

So everything looks easy. But what is the way we use, internally, to prepare our release packages? As I mentioned earlier, each compiler option can make a significant effect.

Everyone tries to make life easier and automate all tasks. It is great, as automation is one of the keys to success as you can work on other tasks once the release build is in progress. So we have created a build script that is used for making builds. It can be found in the PXC GitHub repo and can be used by anyone to make his own builds.

This script can install all needed dependencies for the build environment, create binary tarballs, source RPMs and debs, RPMs, debs itself, and the binary tarball. So it covers all build cycles. So how do we use it?

The script has various params:

--builddir=DIR Absolute path to the dir where all actions will be performed

--get_sources Source will be downloaded from github

--build_src_rpm If it is 1 src rpm will be built

--build_source_deb If it is 1 source deb package will be built

--build_rpm If it is 1 rpm will be built

--build_deb If it is 1 deb will be built

--build_tarball If it is 1 tarball will be built

--install_deps Install build dependencies(root privileges are required)

--branch Branch for build

--repo Repo for build

--rpm_release RPM version( default = 1)

--deb_release DEB version( default = 1)

--debug Build debug tarball

 

So let’s see how we can make the build process easier:

1. Download the build script:

wget https://raw.githubusercontent.com/percona/percona-xtradb-cluster/8.0/build-ps/pxc_builder.sh

2. Create a build directory where you are going to perform all build actions:

mkdir /tmp/BUILD_PXC

3. Install dependencies(please note root permissions are required):

sudo ./pxc_builder.sh --builddir=/tmp/BUILD_PXC --install_deps=1

4. Download source code:

– From Percona repo (it is used by default):

./pxc_builder.sh --builddir=/tmp/BUILD_PXC --get_sources=1 --branch=Percona-XtraDB-Cluster-8.0.28-19

– From your own repo and branch:

./pxc_builder.sh --builddir=/tmp/BUILD_PXC --get_sources=1 --branch=<your_branch_name> --repo=<link_to_your_repo_on_github>

5. Prepare src rpm:

./pxc_builder.sh --builddir=/tmp/BUILD_PXC --build_src_rpm=1

Please note if you already have your source tarball, just create a directory named source_tarball in the build directory and put it into it.

6. Prepare source deb:

./pxc_builder.sh --builddir=/tmp/BUILD_PXC --build_source_deb=1

Please note if you already have your source tarball, just create a directory named source_tarball in the build directory and put it into it.

7. Prepare rpm:

./pxc_builder.sh --builddir=/tmp/BUILD_PXC --build_rpm=1

8. Prepare deb:

./pxc_builder.sh --builddir=/tmp/BUILD_PXC --build_deb=1

9. Prepare tarball:

./pxc_builder.sh --builddir=/tmp/BUILD_PXC –build_tarball=1

So as you can see, the build procedure becomes easier and you don’t need to think about what dependencies are needed, what cmake params should be used, etc. This build script will make everything for you and will use all build flags (if you didn’t change them in your sources) that we use for release builds.

Jul
25
2022
--

Managing MySQL Configurations with the PXC Kubernetes Operator V1.10.0 Part 1: Introduction

MySQL Configurations with the PXC Kubernetes Operator

MySQL Configurations with the PXC Kubernetes OperatorIntroduction/FAQ

Question: I need to run a production-grade open source MySQL DB.

Answer: Percona to the rescue! Percona XtraDB Cluster (PXC) is an open source enterprise MySQL solution that helps you to ensure data availability for your applications while improving security and simplifying the development of new applications in the most demanding public, private, and hybrid cloud environments

Question: I forgot to mention that I need to run it on Kubernetes.

Answer: Percona to the rescue again! Percona Distribution for MySQL Operator based on Percona XtraDB Cluster contains everything you need to quickly and consistently deploy and scale Percona XtraDB Cluster instances in a Kubernetes-based environment on-premises or in the cloud.

Question: I have a lot of MySQL configurations to manage.

Answer:  PXC Operator makes it easy to manage MySQL Configurations. Let’s explore.

For the rest of the article, the name of the PXC cluster is assumed to be cluster1, and this can be modified based on the user preference.

How can I change the MySQL configurations?

If you have not done it already, the first thing to do is install the PXC operator. Our Quickstart guide gives detailed instructions on how to get started. 

There are three possible ways to modify the MySQL configurations as described in the Documentation for MySQL options:

  1. Custom Resource PerconaXtraDBCluster (pxc/pxcs/perconaxtradbclusters )
  2. Config map with name cluster1-pxc
  3. Secret with name cluster1-pxc

Which option should I choose for managing configurations?

The choice of using the above options depends on the use case and the user’s preferences.

Following are some examples:

Using ConfigMap

  1. If the MySQL configuration is pretty big and/or if you want to maintain the configuration separately rather than updating everything in PXC object.
  2. If you want to provide permission to change MySQL configurations but not the other properties of PXC objects like resources, affinity, etc., K8s RBAC can be used to achieve this. A Role/ClusterRole can be created to provide access only for the ConfigMap which is used for MySQL configuration.

Using Secrets

  1. If there is any sensitive information that needs to be used in the configuration, secrets are recommended. Even though k8s secrets are just base64 encoded data, secrets have the advantage of integrating well with vaults, and it’s always best practice to use k8s secrets than ConfigMap when there is sensitive data.

What happens when I change MySQL configuration?

Any changes in MySQL configurations will generally recycle the pods in reverse order if the RollingUpdate strategy is used.

Example: If three replicas are used for the PXC cluster, cluster1-pxc-[0,1,2] pods would be created. When MySQL configuration is changed, cluster1-pxc-2 will be terminated first and the system will wait till the new pod cluster1-pxc-2 starts running and becomes healthy, then cluster1-pxc-1 will be terminated, and so on.

Following are the changes observed with the watch command.

# kubectl get po -l app.kubernetes.io/component=pxc --watch
NAME             READY   STATUS    RESTARTS   AGE
cluster1-pxc-0   3/3     Running   0          8m23s
cluster1-pxc-1   3/3     Running   0          10m
cluster1-pxc-2   3/3     Running   0          13m
cluster1-pxc-2   3/3     Terminating   0          13m
cluster1-pxc-2   0/3     Terminating   0          14m
cluster1-pxc-2   0/3     Pending       0          0s
cluster1-pxc-2   0/3     Init:0/1      0          1s
cluster1-pxc-2   0/3     PodInitializing   0          8s
cluster1-pxc-2   2/3     Running           0          10s
cluster1-pxc-2   3/3     Running           0          2m
cluster1-pxc-1   3/3     Terminating       0          14m
cluster1-pxc-1   0/3     Terminating       0          14m
cluster1-pxc-1   0/3     Pending           0          0s
cluster1-pxc-1   0/3     Init:0/1          0          1s
cluster1-pxc-1   0/3     PodInitializing   0          6s
cluster1-pxc-1   2/3     Running           0          8s
cluster1-pxc-1   3/3     Running           0          2m1s
cluster1-pxc-0   3/3     Terminating       0          13m
cluster1-pxc-0   0/3     Terminating       0          14m
cluster1-pxc-0   0/3     Pending           0          0s
cluster1-pxc-0   0/3     Init:0/1          0          0s
cluster1-pxc-0   0/3     PodInitializing   0          6s
cluster1-pxc-0   2/3     Running           0          8s
cluster1-pxc-0   3/3     Running           0          2m

In the upcoming post, we will see the precedence and the changes happening while modifying MySQL configurations. Stay tuned!

Jul
21
2022
--

Preventing Stale Reads on Percona XtraDB Cluster by Enforcing Causality Checks on a Per-Query Basis

Stale Reads on Percona XtraDB Cluster

Stale Reads on Percona XtraDB ClusterWhen we run a SELECT in a replica server and it returns a different result to the one we would have obtained had we queried the source server instead, due to changes in the dataset that have not yet been replicated or synchronized to the replica, we get what is known as a stale read. My colleague Marcelo has blogged about Stale Reads and How to Fix Them before.

The need to avoid stale reads is a constraint that leads people into migrating to Percona XtraDB Cluster (PXC) / Galera Cluster or, more recently, Group Replication. I am focusing on PXC and Galera in this short blog post, which, similar to Group Replication, implements not exactly a synchronous replication model but what is commonly referred to as a “virtually synchronous” replication (but not without some polemic). In short, it means a transaction that is committed in one node is guaranteed (and there’s a certification process behind this) to have “reached” other nodes, but that doesn’t mean the changes this transaction carries with it have been applied on the other nodes already. Thus, stale reads can happen in PXC too. For the long version, see the manual page about Certification in Percona XtraDB Cluster.

Why does PXC implement virtually synchronous replication instead of fully synchronous replication? There are a few challenges to accomplish this but we can nail it down to performance constraints. Fully synchronous replication requires not only for the transactions to reach the other nodes but for the transactions to be applied to them too. And then they have to report back the success of the operation. For each transaction. This takes time and increases the overall commit time, which in practice makes writing to the database slow and impacts overall concurrency.

Evolution of “synchronous” replication on PXC

To be clear, no matter how you configure PXC, it won’t become a fully-synchronous system (neither will Group Replication, for that matter). BUT, it is possible to make it behave in such a way that it enforces data consistency. That’s what the variable wsrep_causal_reads has been used for. When this setting was introduced, it worked like a switch that one could use to kind of enable “synchronous” replication (as in data consistency) across the cluster – either for all transactions (globally) or at a connection basis (per session). In fact, what this setting does in practice is to enforce causality checks; quoting the manual: “When this variable is set to ON, the replica will wait until that event is applied before doing any other queries”. By waiting, it means blocking new writes until the expected consistency is reached.

Being able to enable “synchronous” replication for selected connections brought the potential to mix the best of both modes: operate with virtually synchronous replication by default (good performance) and use “synchronous” replication (once more, enforcing data consistency) when stale reads cannot be afforded.

The problem with this approach is that it requires some planning in advance to determine which connections should be configured with wsrep_causal_reads enabled. Without carefully planning ahead, what ends up happening in practice in many cases is having more connections configured with wsrep_causal_reads enabled than necessary. As a result, the cluster performance gets closer to that of operating with that setting applied globally.

The switch button that was wsrep_causal_reads evolved into the more granular wsrep_sync_wait, which allows the configuration of causality checks for different combinations of statements.

The cherry on top

While wsrep_sync_wait expanded the possibilities for causality checks, it still requires the “planning ahead” discussed above. I recently worked on the troubleshooting of a slow cluster and found it was configured with wsrep_sync_wait set to three (“Perform checks for READ, UPDATE, and DELETE statements”) all around, which is an even more “aggressive” configuration than the legacy wsrep_causal_reads. The justification for operating this way was that it was not possible to identify upfront which connection would be issuing a particular query that could not afford stale reads.

The good news is that starting with Percona XtraDB Cluster 8.0.26-16, which was released earlier this year, we can set wsrep_sync_wait on a per-query basis through the use of the optimizer hint SET_VAR: “The SET_VAR hint sets the session value of a system variable temporarily (for the duration of a single statement)”.

Here’s a simple example of employing SET_VAR to enforce the highest level of causality checks for a simple SELECT:

SELECT /*+ SET_VAR(wsrep_sync_wait=7) */ balance FROM accounts WHERE id=100;

If you find it is complicated to determine beforehand which connections should be configured to avoid stale reads, know that now you have the option to literally hard-code this constraint in your query. And if you end up deploying your database in a regular MySQL server, which is shipped without any wsrep_ settings, the optimizer hint is simply ignored and the query is processed normally.

Dec
09
2021
--

A Look Into Percona XtraDB Cluster Non-Blocking Operation for Online Schema Upgrade

Percona XtraDB Cluster Non-Blocking Operation

Percona XtraDB Cluster Non-Blocking OperationPercona XtraDB Cluster 8.0.25 (PXC) has introduced a new option to perform online schema modifications: NBO (Non-Blocking Operation).

When using PXC, the cluster relies on the wsrep_OSU_method parameter to define the Online Schema Upgrade (OSU) method the node uses to replicate DDL statements. 

Until now, we normally have three options:

  • Use Total Isolation Order (TOI, the default)
  • Use Rolling Schema Upgrade (RSU)
  • Use Percona’s online schema change tool (TOI + PTOSC)

Each method has some positive and negative aspects. TOI will lock the whole cluster from being able to accept data modifications for the entire time it takes to perform the DDL operation. RSU will misalign the schema definition between the nodes, and in any case, the node performing the DDL operation is still locked. Finally, TOI+PTOSC will rely on creating triggers and copying data, so in some cases, this can be very impactful. 

The new Non-Blocking Operation (NBO) method is to help to reduce the impact on the cluster and make it easier to perform some DDL operations.

At the moment we only support a limited set of operations with NBO like:

  • ALTER INDEX
  • CREATE INDEX
  • DROP INDEX

Any other command will result in an error message ER_NOT_SUPPORTED_YET.

But let us see how it works and what the impact is while we will also compare it with the default method TOI.

What we will do is work with four connections:

1 – to perform ddl
2 – to perform insert data in the table being altered
3 – to perform insert data on a different table
4-5 – checking the other two nodes operations

PXC must be at least Version 8.0.25-15.1.

The table we will modify is :

DC1-1(root@localhost) [windmills_s]>show create table windmills_test\G
*************************** 1. row ***************************
       Table: windmills_test
Create Table: CREATE TABLE `windmills_test` (
  `id` bigint NOT NULL AUTO_INCREMENT,
  `uuid` char(36) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `millid` smallint NOT NULL,
  `kwatts_s` int NOT NULL,
  `date` date NOT NULL,
  `location` varchar(50) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `active` tinyint NOT NULL DEFAULT '1',
  `time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `strrecordtype` char(3) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  PRIMARY KEY (`id`),
  KEY `IDX_millid` (`millid`,`active`),
  KEY `IDX_active` (`id`,`active`),
  KEY `kuuid_x` (`uuid`),
  KEY `millid_x` (`millid`),
  KEY `active_x` (`active`)
) ENGINE=InnoDB AUTO_INCREMENT=8199260 DEFAULT CHARSET=utf8mb3 COLLATE=utf8_bin ROW_FORMAT=DYNAMIC
1 row in set (0.00 sec)

And contains ~five million rows.

DC1-1(root@localhost) [windmills_s]>select count(*) from windmills_test;
+----------+
| count(*) |
+----------+
|  5002909 |
+----------+
1 row in set (0.44 sec)

The Commands

Connection 1:

ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE;
  ALTER TABLE windmills_test drop INDEX idx_1, ALGORITHM=INPLACE;

 

Connection 2:

while [ 1 = 1 ];do da=$(date +'%s.%3N');/opt/mysql_templates/PXC8P/bin/mysql --defaults-file=./my.cnf -uroot -D windmills_s -e "insert into windmills_test  select null,uuid,millid,kwatts_s,date,location,active,time,strrecordtype from windmills7 limit 1;" -e "select count(*) from windmills_s.windmills_test;" > /dev/null;db=$(date +'%s.%3N'); echo "$(echo "($db - $da)"|bc)";sleep 1;done

 

Connection 3:

while [ 1 = 1 ];do da=$(date +'%s.%3N');/opt/mysql_templates/PXC8P/bin/mysql --defaults-file=./my.cnf -uroot -D windmills_s -e "insert into windmills8  select null,uuid,millid,kwatts_s,date,location,active,time,strrecordtype from windmills7 limit 1;" -e "select count(*) from windmills_s.windmills_test;" > /dev/null;db=$(date +'%s.%3N'); echo "$(echo "($db - $da)"|bc)";sleep 1;done

 

Connections 4-5:

while [ 1 = 1 ];do echo "$(date +'%T.%3N')";/opt/mysql_templates/PXC8P/bin/mysql --defaults-file=./my.cnf -uroot -D windmills_s -e "show full processlist;"|egrep -i -e "(windmills_test|windmills_s)"|grep -i -v localhost;sleep 1;done

Operations

  • Start inserts from connections
  • Start commands in connections 4 – 5 on the other nodes
  • Execute: 
    • For TOI 
      • DC1-1(root@localhost) [windmills_s]>SET SESSION wsrep_OSU_method=TOI;
    • For NBO
      • DC1-1(root@localhost) [windmills_s]>SET SESSION wsrep_OSU_method=NBO;
    • For both
      • DC1-1(root@localhost) [windmills_s]>ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE, LOCK=shared;

Let’s Run It

Altering a Table with TOI

DC1-1(root@localhost) [windmills_s]>ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE;
Query OK, 0 rows affected (1 min 4.74 sec)
Records: 0  Duplicates: 0  Warnings: 0

 

Inserts in the altering table (connection 2):

.450
.492
64.993 <--- Alter blocks all inserts on the table we are altering
.788
.609

 

Inserts on the other table (connection 3):

.455
.461
64.161 <--- Alter blocks all inserts on all the other tables as well
.641
.483

 

On the other nodes at the same time of the ALTER we can see:

Id  User             db         Command Time  State             Info                                                                            Time_ms Rows_sent Rows_examined 
15	system user		windmills_s	Query	102	  altering table	ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE	102238	0	       0    <--- time from start

In short, we have the whole cluster locked for ~64 seconds. During this period of time, all the operations to modify data or structure were on hold. 

Let’s Now Try With NBO

Inserts in the altering table:

.437
.487
120.758 <---- Execution time increase
.617
.510

 

Inserts on the other table:

.468
.485
25.061 <---- still a metalock, but not locking the other tables for the whole duration 
.494
.471

 

On the other nodes at the same time of the ALTER we can see:

Id      User         db             Command Time  State             Info                                                                            Time_ms Rows_sent Rows_examined 
110068	system user	 windmills_s	Connect	86	  altering table	ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE	 120420	 0	          0

 

In this case, what is also interesting to note is that:

  1. We have a moment of metalock:
    1. 110174 pmm 127.0.0.1:42728 NULL Query 2 Waiting for table metadata lock  SELECT x FROM information_schema.tables  WHERE TABLE_SCHEMA = 'windmills_s' 1486 10    0
    2. 110068  system user connecting host windmills_s Connect 111 closing tables ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE 111000 0 0
  2. The execution time is longer

Summarizing:

TOI            NBO
Time on hold for insert for altering table   	~64 sec    	~120 sec
Time on hold for insert for another table   	~64 sec      ~25 sec 
metalock                       			whole time  	 only at the end

What is Happening? What are the Differences and Why Does it Take Longer with NBO?

Let’s see at a very high level how the two work:

  • TOI: when you issue a DDL like ADD INDEX a metadata lock is taken on the table and it will be released only at the end of the operation. During this time, you cannot:
    • Perform DMLs on any cluster node
    • Alter another table in the cluster
  • NBO: the metadata lock is taken at the start and at the end for a very brief period of time. The ADD INDEX operation will then work on each node independently. The lock taken at the end is to have all the nodes agree on the operation and commit or rollback (using cluster error voting). This final phase costs a bit more in time and is what adds a few seconds to the operation execution. But during the operation:
    • You can alter another table (using NBO)
    • You can continue to insert data, except in the table(s) you are altering.
    • On node crash, the operation will continue on the other nodes, and if successful it will persist.  

In short, the cluster server behavior changes significantly when using NBO, offering significant flexibility compared to TOI. The cost in time should not linearly increase with the dimension of the table, but more in relation to the single node efficiency in performing the ALTER operation.       

Conclusion

NBO can be significantly helpful to reduce the impact of DDL on the cluster, for now, limited to the widely used creation/modification/drop of an index. But in the future … we may expand it. 

The feature is still a technology preview, so do not trust it in production, but test it and let us know what you think. 

Final comment: another distribution has introduced NBO, but only if you buy the enterprise version.

Percona, which is truly open source with facts not just words, has implemented NBO in standard PXC, and the code is fully open source. This is not the first one, but just another of the many features Percona is offering for free while others ask you to buy the enterprise version.

Enjoy the product and let us have your feedback! Great MySQL to all! 

Dec
01
2021
--

PXC Scheduler Handler: The Missing Piece for Galera/Percona XtraDB Cluster Puzzle

Percona XtraDB Cluster Scheduler HandlerWorking on a real case scenario in a five node Percona XtraDB Cluster (PXC), we were forced to use wsrep_sync_wait = 1, because the app does reads-after-write and we send reads to all the nodes. We had the idea to leave some nodes in DESYNC mode to reduce the flow control messages during peak load and expected to have a steadier write throughput keeping the read consistency.

We decided to test Perconas’s new PXC Scheduler Handler which is an application that manages integration between ProxySQL and Galera/PXC (the scope is to maintain the ProxySQL mysql_server table, if a negative scenario occurs, like: failures, service degradation, and maintenance). However, we realized that when a node is in DESYNC mode, it is kicked out of the read hostgroup. That is why we asked Marco Tusa to implement this new feature which will remove the node from the read hostgroup if wsrep_local_recv_queue is higher than max_replication_lag.

Environment

  • 5 PXC nodes
  • 1 ProxySQL server

In db01, we run sysbench to simulate write traffic:

sysbench /usr/share/sysbench/oltp_insert.lua --threads=50 --tables=100 --mysql-password=<pass> --mysql-user=<user> --report-interval=5 --time=600 --tx-rate=0  run

In ProxySQL we run sysbench to simulate read-only traffic:

sysbench /usr/share/sysbench/oltp_read_only.lua --tables=100 --mysql-password=<pass> --mysql-user=<user> --mysql-host=127.0.0.1 run --threads=32 --report-interval=5 --time=600 --db-ps-mode=disable

DESYNC Test

The goal of this test is to see the differences between wsrep_desync ON/OFF and wsrep_sync_wait = 1.

In the next graph, we are going to see both scenarios in the same graph, on the left of each graph when wsrep_desync is ON and on the right when it is OFF.

DESYNC Test 

As you can see there are decreases in the read traffic when DESYNC is OFF. It occurs in  the same period of time when the flow control messages are sent:

This is expected and it is not new. The number of queries executed were:

    DESYNC ON  queries:                             30561552 (50934.44 per sec.)
    DESYNC OFF queries:                             28324704 (47195.67 per sec.)

Basically, you can execute 8% of queries if flow control is not enabled.

Consistency Test

Now, we are going to simulate a scenario when the cluster receives CPU-intensive queries. We are going to execute in db03 (or any other node):

for j in $(seq 1 40) ; do 
  for i in $(seq 1 250); do 
    echo "SELECT pad FROM sbtest.sbtest1 GROUP BY pad LIMIT 10" | mysql --password=<pass> --user=<user> --host=<host> ; 
  done > /dev/null 2>&1 & 
done

It starts 40 threads that execute 250 times the same group by query which is enough for our testing.

And we are going to add a timestamp column on the sbtest1 to monitor the lag:

ALTER TABLE sbtest.sbtest1 add  `tnow` timestamp(6) NOT NULL DEFAULT CURRENT_TIMESTAMP(6) ON UPDATE CURRENT_TIMESTAMP(6),

Now we have this 4 scenario where we are going to monitor with:

select timediff(tnow,current_timestamp(6))  from sbtest1 order by tnow desc limit 1

And the status variable: wsrep_local_recv_queue.

This will be the header of the tables:

| timediff | query time |  wsrep_local_recv_queue |

– With wsrep_desync=OFF and wsrep_sync_wait=1

| 00:00:02.553139 | 2.69 sec |  237 |
| 00:00:00.223150 | 0.26 sec |   72 |

– With wsrep_desync=OFF and wsrep_sync_wait=0

| 00:00:01.711082 | 1.74 sec |  183 |
| 00:00:01.175446 | 1.20 sec |  112 |

– With wsrep_desync=ON and wsrep_sync_wait=1

| -00:00:00.021460 | 1.30 sec | 13542 |
| -00:00:00.009098 | 1.28 sec | 13184 |

– With wsrep_desync=ON and wsrep_sync_wait=0

| -00:01:02.065854 | 0.46 sec | 635387 |
| -00:01:02.882633 | 0.54 sec | 643742 |

With wsrep_desync=OFF, the behavior is similar, and this means that the node needs to be on sync and then it checks wsrep_sync_wait. 

With wsrep_desync=ON and wsrep_sync_wait=1, we can see that the query is delayed because it needs to apply the transaction in the apply queue. It is not the case when wsrep_sync_wait=0, which data is far behind the writer node, and the query is answered immediately.

The two cases that matter are when wsrep_sync_wait=1, and both cases are read-consistent even if they show different timediff values, as the query time is measuring the flow control lag when wsrep_desync=OFF and the apply queue time when wsrep_desync=ON. 

Cluster Behavior During Load Increases

It is time to merge both test and simulate when ProxySQL spreads the CPU-intensive queries over the cluster. We are going to execute in ProxySQL the same script:

for j in $(seq 1 40) ; do 
  for i in $(seq 1 250); do 
    echo "SELECT pad FROM sbtest.sbtest1 GROUP BY pad LIMIT 10" | mysql --password=<pass> --user=<user> --host=<host> ; 
  done > /dev/null 2>&1 & 
done

In the next graph we are going to see how the active connections on different nodes went up and down as the status of the node changed: 

MySQL Node Change

And when the script finished it went back to normal.

In ProxySQL, you will see how the status of the servers changes to ONLINE to OFFLINE_SOFT and back to ONLINE because of PXC Scheduler Handler intervention, like this:

proxysql> select hostgroup_id,hostname,status from runtime_mysql_servers order by hostgroup_id ='101' ;
+--------------+---------------+--------------+
| hostgroup_id | hostname      | status       |
+--------------+---------------+--------------+
| 101          | 10.127.70.242 | ONLINE       |
| 101          | 10.127.70.243 | ONLINE       |
| 101          | 10.127.70.244 | OFFLINE_SOFT |
| 101          | 10.127.70.245 | ONLINE       |
| 101          | 10.127.71.3   | ONLINE       |
+--------------+---------------+--------------+
proxysql> select hostgroup_id,hostname,status from runtime_mysql_servers order by hostgroup_id ='101' ;
+--------------+---------------+--------------+
| hostgroup_id | hostname      | status       |
+--------------+---------------+--------------+
| 101          | 10.127.70.242 | ONLINE       |
| 101          | 10.127.70.243 | ONLINE       |
| 101          | 10.127.70.244 | ONLINE       |
| 101          | 10.127.70.245 | OFFLINE_SOFT |
| 101          | 10.127.71.3   | ONLINE       |
+--------------+---------------+--------------+

This can be reviewed also in PMM:

percona monitoring and management

This means that PXC Scheduler Handler is helping us to spread the load across DESYNC nodes, improving the response time for the read-only traffic.

Architecture

With traditional replication we had an abstract diagram like this:

traditional replication

We didn’t have any production-ready option to guarantee Read Consistency on the Read Replicas.

With Galera/PXC we don’t need the Replication Manager as it will be replaced with PXC Scheduler Handler: 

PXC Scheduler Handler

We have the same amount of nodes with Read Consistency (thanks to wsrep_sync_wait) and a synced secondary writer node. What is missing, but not difficult to add, is a tool that monitors the WriterNodes, as in a failure scenario, we might want to keep not less than two synced nodes.

Conclusion

I think that Marco Tusa did a great job with the PXC Scheduler Handler which allowed us to implement this new Architecture that might help people that need to scale reads, need consistent write throughput, need consistent reads, and don’t want reads to affect the flow of the replication process.

If you’re interested in learning more about Percona XtraDB Cluster and ProxySQL, be sure to check out Percona’s Training Services. We offer an advanced two-day hands-on tutorial for XtraDB Cluster, in addition to our one-day ProxySQL intensive. Contact us today to learn more!

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com