Keeping Your Data Safe: An Introduction to Data-at-Rest Encryption in Percona XtraDB Cluster. In the first part of this blog post, we learned how to enable GCache and Record-Set cache encryption in Percona XtraDB Cluster. This part will explore the details of the implementation to understand what happens behind the scenes. How does it work […]
12
2024
GCache and Record-Set Cache Encryption in Percona XtraDB Cluster – Part Two
07
2024
GCache and Record-Set Cache Encryption in Percona XtraDB Cluster – Part One
Keeping Your Data Safe: An Introduction to Data-at-Rest Encryption in Percona XtraDB Cluster. Data breaches are a major concern today. Whether you’re a business owner or value your privacy, the thought of your sensitive information being exposed is unsettling. Fortunately, data-at-rest encryption can help. Think of it as a sophisticated lock that scrambles your data, […]
28
2023
Consequences of Forcing Bootstrap on Percona XtraDB Cluster
Recently, I was working on a very unfortunate case that revolved around diverging clusters, data loss, missing important log errors, and forcing commands on Percona XtraDB Cluster (PXC). Even though PXC tries its best to explain what happens in the error log, I can vouch that it can be missed or overlooked when you do not know what to expect.
This blog post is a warning tale, an invitation to try yourself and break stuff (not in production, right?).
TLDR:
Do you know right away what happened when seeing this log?
2023-06-22T08:23:29.003334Z 0 [ERROR] [MY-000000] [Galera] gcs/src/gcs_group.cpp:group_post_state_exchange():433: Reversing history: 171 -> 44, this member has applied 127 more events than the primary component.Data loss is possible. Must abort.
Demonstration
Using the great https://github.com/datacharmer/dbdeployer:
$ dbdeployer deploy replication --topology=pxc --sandbox-binary=~/opt/pxc 8.0.31
Let’s write some data
$ ./sandboxes/pxc_msb_8_0_31/sysbench oltp_read_write --tables=2 --table-size=1000 prepare
Then let’s suppose someone wants to restart node 1. For some reason, they read somewhere in your internal documentation that they should bootstrap in that situation. With dbdeployer, this will translate to:
$ ./sandboxes/pxc_msb_8_0_31/node1/stop stop /home/yoann-lc/sandboxes/pxc_msb_8_0_31/node1 $ ./sandboxes/pxc_msb_8_0_31/node1/start --wsrep-new-cluster ......................................................................................................^C
It fails, as it should.
In reality, those bootstrap mistakes happen in homemade start scripts, puppet or ansible modules, or even internal procedures applied in the wrong situation.
Why did it fail? First error to notice:
2023-06-22T08:00:48.322148Z 0 [ERROR] [MY-000000] [Galera] It may not be safe to bootstrap the cluster from this node. It was not the last one to leave the cluster and may not contain all the updates. To force cluster bootstrap with this node, edit the grastate.dat file manually and set safe_to_bootstrap to 1 .
Reminder: Bootstrap should only be used when every node has been double-checked to be down; it’s a manual operation. It fails here because it was not forced and because this node was not the last to be stopped in the cluster.
Good reflex: Connecting to other mysql and check for ‘wsrep_cluster_size’ and ‘wsrep_cluster_status’ statuses before anything.
mysql> show global status where variable_name IN ('wsrep_local_state','wsrep_local_state_comment','wsrep_local_commits','wsrep_received','wsrep_cluster_size','wsrep_cluster_status','wsrep_connected');
Do not: Apply blindly what this log is telling you to do.
But we are here to “fix” around and find out, so let’s bootstrap.
$ sed -i 's/safe_to_bootstrap: 0/safe_to_bootstrap: 1/' ./sandboxes/pxc_msb_8_0_31/node1/data/grastate.dat $ ./sandboxes/pxc_msb_8_0_31/node1/start --wsrep-new-cluster .. sandbox server started
At this point, notice that from node1, you have:
$ ./sandboxes/pxc_msb_8_0_31/node1/use -e "show global status where variable_name in ('wsrep_cluster_status', 'wsrep_cluster_size')" +----------------------+---------+ | Variable_name | Value | +----------------------+---------+ | wsrep_cluster_size | 1 | | wsrep_cluster_status | Primary | +----------------------+---------+
But from node2 and node3 you will have:
$ ./sandboxes/pxc_msb_8_0_31/node2/use -e "show global status where variable_name in ('wsrep_cluster_status', 'wsrep_cluster_size')" +----------------------+---------+ | Variable_name | Value | +----------------------+---------+ | wsrep_cluster_size | 2 | | wsrep_cluster_status | Primary | +----------------------+---------+
Looks fishy. But does your monitoring really alert you to this?
Let’s write some more data, obviously on node1, because why not? It looks healthy.
$ ./sandboxes/pxc_msb_8_0_31/node1/sysbench oltp_delete --tables=2 --table-size=1000 --events=127 run
127 will be useful later on.
Nightmare ensues
We are a few days later. You are still writing to your node. Some new reason to restart node1 comes. Maybe you want to apply a parameter.
$ ./sandboxes/pxc_msb_8_0_31/node1/restart .............................................................................................................................................................^C
It fails?
Reviewing logs, you would find:
$ less sandboxes/pxc_msb_8_0_31/node1/data/msandbox.err ... 2023-06-22T08:23:29.003334Z 0 [ERROR] [MY-000000] [Galera] gcs/src/gcs_group.cpp:group_post_state_exchange():433: Reversing history: 171 -> 44, this member has applied 127 more events than the primary component.Data loss is possible. Must abort. ...
Voila, We find our “127” again.
Good reflex: Depends. It would need a post of its own, but that’s a serious problem.
Do not: Force SST on this node. Because it will work, and every data inserted on node1 will be lost.
What does it mean?
When forcing bootstrap, a node will always start. It won’t ever try to connect to other nodes if they are healthy. The other nodes won’t try to connect to the third one either; from their point of view, it just never joined, so it’s not part of the cluster.
When restarting the previously bootstrapped node1 in non-bootstrapped mode, that’s the first time they all see each other in a while.
Each time a transaction is committed, it is replicated along with a sequence number (seqno). The seqno is an ever-growing number. It is used by nodes to determine if incremental state transfer is possible, or if a node state is coherent with others.
Now that node1 is no longer in bootstrap mode, node1 connects to the other members. node1 shares its state (last primary members, seqno). The other nodes correctly picked up that this seqno looks suspicious because it’s higher than their own, meaning the node joining could have applied more transactions. It could also mean it was from some other cluster.
Because nodes are in doubt, nothing will happen. Node1 is denied joining and will not do anything. It won’t try to resynchronize automatically, and it won’t touch its data. Node2 and node3 are not impacted; they will be kept as is too.
How to proceed from there will depend as there are no general guidelines. Ideally, a source of truth should be found. If both clusters applied writes, that’s the toughest situation to be in, and it’s a split brain.
Note: seqno are just numbers. Having equal seqno does not actually guarantee that the underlying transactions applied are identical, but it’s still useful as a simple sanity check. If we were to mess around even more and apply 127 transactions on node2, or even modify seqno manually in grastate.dat, we could have “interesting” results. Try it out (not in production, mind you)!
Note: If you are unaware of bootstrapping and how to properly recover, check out the documentation.
Conclusion
Bootstrap is a last resort procedure, don’t force it lightly. Do not force SST right away if a node does not want to join either. You should always check the error log first.
Fortunately, PXC does not blindly let any node join without some sanity checks.
Minimize unexpected downtime and data loss with a highly available, open source MySQL clustering solution.
25
2022
Exploring Data Dump and Load Utility With MySQL Shell
In this blog post, I will try to cover a few dump and data-loading utilities offered by MySQL Shell.
What is MySQL Shell?
It is a robust client and code editor for MySQL. In addition to having APIs for dealing with MySQL, MySQL Shell offers scripting capabilities for JavaScript and Python. It provides an AdminAPI to manage MySQL servers and create InnoDB Cluster and ReplicaSet deployments. It supports an X DevAPI to work on both SQL and NoSQL interfaces via the X Protocol. Finally, it provides utilities to make working with MySQL in Python and Javascript mode easier.
Now, let’s jump into some practical stuff and see what these tools offer us.
Dump Utility
In the dump utility, there are three tools: util.dumpInstance(), util.dumpSchemas(), and util.dumpTables () so let’s discuss them one by one.
1. “util.dumpInstance()” – This was introduced in MySQL Shell 8.0.21 and supports the export of all databases excluding information_schema, MySQL, ndbinfo, performance_schema, and sys schema. The backups can be kept locally or could be taken remotely from the MySQL Shell utility.
Performing a complete instance dump with “util.dumpInstance()”:
a) Log in to the database via MySQL Shell.
shell> mysqlsh --uri root@localhost
b) Prior to doing the actual backup, we will first perform a dryRun procedure to check for any potential problems or compatibility issues.
mysql shell > util.dumpInstance("/home/backup/", {dryRun:”true”})
Output:
dumping. Acquiring global read lock Global read lock acquired Initializing - done 1 out of 5 schemas will be dumped and within them 16 tables, 0 views. 3 out of 5 users will be dumped. Gathering information - done All transactions have been started Global read lock has been released Writing global DDL files Writing users DDL Writing DDL - done Starting data dump 0% (0 rows / ~1.60M rows), 0.00 rows/s, 0.00 B/s uncompressed, 0.00 B/s compressed
So, here we didn’t get any issues or warnings so we are good to proceed with the actual backup process.
c) Running the full instance backup and saving in the target location.
mysql shell >util.dumpInstance("/home/backup/")
Output:
dumping. Acquiring global read lock Global read lock acquired Initializing - done 1 out of 5 schemas will be dumped and within them 16 tables, 0 views. 3 out of 5 users will be dumped. Gathering information - done All transactions have been started Global read lock has been released Writing global DDL files Writing users DDL Running data dump using 4 threads. NOTE: Progress information uses estimated values and may not be accurate. Writing schema metadata - done Writing DDL - done Writing table metadata - done Starting data dump 101% (1.63M rows / ~1.60M rows), 114.56K rows/s, 22.13 MB/s uncompressed, 10.09 MB/s compressed Dump duration: 00:00:15s Total duration: 00:00:15s Schemas dumped: 1 Tables dumped: 16 Uncompressed data size: 313.55 MB Compressed data size: 142.87 MB Compression ratio: 2.2 Rows written: 1629368 Bytes written: 142.87 MB Average uncompressed throughput: 20.01 MB/s Average compressed throughput: 9.12 MB/s
We have successfully finished the backup here. By default, it employs “4 threads” and applies the zstd compression algorithm to the database backup.
d) Running the instance dump with more advanced options.
mysql shell >util.dumpInstance("/home/backup/", {threads:8,maxRate:"100M",consistent:true,chunking:true,bytesPerchunk:"64M",compression:"zstd"})
Output:
Acquiring global read lock Global read lock acquired Initializing - done 1 out of 5 schemas will be dumped and within them 16 tables, 0 views. 3 out of 5 users will be dumped. Gathering information - done All transactions have been started Global read lock has been released Writing global DDL files Writing users DDL Running data dump using 8 threads. NOTE: Progress information uses estimated values and may not be accurate. Writing schema metadata - done Writing DDL - done Writing table metadata - done Starting data dump 101% (1.63M rows / ~1.60M rows), 468.47K rows/s, 84.22 MB/s uncompressed, 38.41 MB/s compressed Dump duration: 00:00:03s Total duration: 00:00:03s Schemas dumped: 1 Tables dumped: 16 Uncompressed data size: 313.55 MB Compressed data size: 142.87 MB Compression ratio: 2.2 Rows written: 1629368 Bytes written: 142.87 MB Average uncompressed throughput: 85.24 MB/s Average compressed throughput: 38.84 MB/s
There are various options available to control and enhance the dump process.
“dryRun”: List out all of the compatibility issues before starting the dump. “consistent”: In order to provide consistent backup, it automatically acquires a global read lock. By default, it is turned on. “chunking”: splits each table's data into a number of files. It is enabled by default. “bytesPerChunk”: specify the approximate number of bytes that should be written for each data block. “threads”: The number of parallel threads to employ when dumping data from the MySQL instance. "4" threads are used by default. “maxRate”: Maximum data read throughput for each thread during the dump in bytes per second.
2) “util.dumpSchemas” – This was introduced in MySQL Shell 8.0.21 and supports the export of selected databases either locally or remotely.
Performing database dump with “util.dumpSchemas()”. Here we will take a specific database(“sbtest”) backup only.
mysql shell > util.dumpSchemas(["sbtest"],"/home/backup/sbtest/",{threads :2})
Output:
dumping. Acquiring global read lock Global read lock acquired Initializing - done 1 schemas will be dumped and within them 16 tables, 0 views. Gathering information - done All transactions have been started Global read lock has been released Writing global DDL files Running data dump using 2 threads. NOTE: Progress information uses estimated values and may not be accurate. Writing schema metadata - done Writing DDL - done Writing table metadata - done Starting data dump 104% (1.63M rows / ~1.56M rows), 424.56K rows/s, 81.96 MB/s uncompressed, 37.35 MB/s compressed Dump duration: 00:00:03s Total duration: 00:00:03s Schemas dumped: 1 Tables dumped: 16 Uncompressed data size: 313.55 MB Compressed data size: 142.87 MB Compression ratio: 2.2 Rows written: 1629368 Bytes written: 142.87 MB Average uncompressed throughput: 81.82 MB/s Average compressed throughput: 37.28 MB/s
3) “util.dumpTables” – In MySQL Shell 8.0.22, the table dump utility “util.dumpTables()” was released.
Performing selective tables dump. Here we take only table “sbtest2” dump from the “sbtest” database.
mysql shell> util.dumpTables("sbtest", [ "sbtest2"], "/home/backup/sbtest/",{threads: 4})
Output:
Writing schema metadata - done Writing DDL - done Writing table metadata - done Starting data dump 101% (101.89K rows / ~100.58K rows), 0.00 rows/s, 0.00 B/s uncompressed, 0.00 B/s compressed Dump duration: 00:00:00s Total duration: 00:00:00s Schemas dumped: 1 Tables dumped: 1 Uncompressed data size: 19.61 MB Compressed data size: 8.93 MB Compression ratio: 2.2 Rows written: 101892 Bytes written: 8.93 MB Average uncompressed throughput: 19.61 MB/s Average compressed throughput: 8.93 MB/s
Data load utility
So far, we’ve seen how to use various methods to take a dump. We will now demonstrate how to restore the same using a single restoration command for all sorts of backups.
The utility “util.loadDump()” for loading dumps was introduced in MySQL Shell 8.0.21. It enabled the parallel loading of tables or table chunks during the data-loading process.
Load the dump files using util.loadDump().
a) Running a dry run process to validate any compatibility issues or errors prior to the actual restoration process.
mysql shell > util.loadDump("/home/backup/", {dryRun :"true"})
Output:
Loading DDL and Data from '/home/backup/' using 4 threads. Opening dump... dryRun enabled, no changes will be made. Target is MySQL 5.7.38-41-57-log. Dump was produced from MySQL 5.7.38-41-57-log Scanning metadata - done Checking for pre-existing objects... Executing common preamble SQL Executing DDL - done Executing view DDL - done Starting data load Executing common postamble SQL 0% (0 bytes / 313.55 MB), 0.00 B/s, 16 / 16 tables done Recreating indexes - done No data loaded. 0 warnings were reported during the load.
Note – no data loaded, it just prints us information about any warnings or errors during the restoration process.
b) Running the data restoration process.
mysql shell > util.loadDump("/home/backup/", {progressFile :"/home/backuplog.json",threads:4,backgroundThreads:4,maxBytesPerTransaction:"4096"})
Output:
Loading DDL and Data from '/home/backup/' using 4 threads. Opening dump... Target is MySQL 5.7.38-41-57-log. Dump was produced from MySQL 5.7.38-41-57-log Scanning metadata - done Checking for pre-existing objects... Executing common preamble SQL Executing DDL - done Executing view DDL - done Starting data load Executing common postamble SQL 100% (313.55 MB / 313.55 MB), 3.03 MB/s, 16 / 16 tables done Recreating indexes - done 16 chunks (1.63M rows, 313.55 MB) for 16 tables in 1 schemas were loaded in 2 min 43 sec (avg throughput 1.92 MB/s) 0 warnings were reported during the load.
Here, we have successfully completed the restoration process.
There are various options available to control and enhance the data loading process.
“threads”:The number of parallel threads to employ when dumping data from the MySQL instance. "4" is used by default. “backgroundThreads”:the number of threads in the pool of background threads that were used to retrieve the file data. “maxBytesPerTransaction”:The maximum number of bytes that can be loaded from a data file with a single LOAD DATA statement. “loadIndexes”: Secondary indexes are not built during the import when this option is set to false; you must do so afterwards. This may come in handy if you're loading the DDL files and data files separately and want to modify the table structure after the DDL files have been loaded. “deferTableIndexes”:Postpone creating secondary indexes until after loading the table's contents. This may speed up loading. off denotes that all indexes are built as the table is loaded.
Taking backups on cloud storage (GCP and AWS S3)
MySQL Shell also provides the capability to store dumps remotely over some S3-compatible cloud storage such as Amazon S3 and Google Cloud storage. Here, we’ll talk about how to do it with Google Cloud storage and an Amazon S3 environment.
Let’s start by configuring it for Google Cloud storage.
Prerequisites
- Storage Bucket should exist in Google Cloud
- Access keys and secrets defined in a file(“/home/credentials”)
[gcb] aws_access_key_id=xxxxxxxx aws_secret_access_key=xxxxxxxx
- GCP cloud storage endpoint – “https://storage.googleapis.com”
a) Running full backup and putting it in the GCP bucket “ajpxctest”.
mysql shell> util.dumpInstance("sbtest",{s3BucketName: "ajpxctest",s3EndpointOverride: "https://storage.googleapis.com", s3CredentialsFile:"/home/credentials", s3Profile: "gcb", threads: 4})
Output:
dumping. Acquiring global read lock Global read lock acquired Initializing - done 1 out of 5 schemas will be dumped and within them 16 tables, 0 views. 3 out of 5 users will be dumped. Gathering information - done All transactions have been started Global read lock has been released Writing global DDL files Writing users DDL Running data dump using 4 threads. NOTE: Progress information uses estimated values and may not be accurate. Writing schema metadata - done Writing DDL - done Writing table metadata - done Starting data dump 104% (1.63M rows / ~1.56M rows), 172.12K rows/s, 34.60 MB/s uncompressed, 15.75 MB/s compressed Dump duration: 00:00:12s Total duration: 00:00:14s Schemas dumped: 1 Tables dumped: 16 Uncompressed data size: 313.55 MB Compressed data size: 142.87 MB Compression ratio: 2.2 Rows written: 1629368 Bytes written: 142.87 MB Average uncompressed throughput: 25.21 MB/s Average compressed throughput: 11.49 MB/s
b) Then, using these same details, we will restore the data from the S3 bucket into the local database.
mysql shell> util.loadDump("sbtest",{s3BucketName: "ajpxctest", s3EndpointOverride: "https://storage.googleapis.com",s3CredentialsFile:"/home/credentials", s3Profile: "gcb", threads: 4})
Output:
Loading DDL and Data from AWS S3 bucket=ajpxctest, prefix='sbtest' using 4 threads. Opening dump... Target is MySQL 5.7.38-41-57-log. Dump was produced from MySQL 5.7.38-41-57-log Fetching dump data from remote location... Listing files - done Scanning metadata - done Checking for pre-existing objects... Executing common preamble SQL Executing DDL - done Executing view DDL - done Starting data load 1 thds loading / 100% (313.55 MB / 313.55 MB), 6.23 MB/s, 14 / 16 tables done Recreating indexes - done Executing common postamble SQL 16 chunks (1.63M rows, 313.55 MB) for 16 tables in 1 schemas were loaded in 1 min 8 sec (avg throughput 7.41 MB/s) 0 warnings were reported during the load.
S3 options:
“s3BucketName”: Name of the s3 bucket where the dump will be kept. “s3EndpointOverride”: s3 endpoint or url. “s3CredentialsFile”: Credential file containing key & secret details. “s3Profile”: The name of the s3 profile to identify the credential details.
Let’s configure the same for Amazon S3
Prerequisites
- AWS S3 bucket should exist
- Configure AWS CLI
###~/.aws/credentials### [default] aws_access_key_id=xxxxxxx aws_secret_access_key=xxxxxxx ###~/.aws/config### [default] region = us-east-1
a) Now, let’s perform the instance backup process which keeps the dump file in “ajtestbkp” S3 bucket.
mysql shell> util.dumpInstance("sbtest",{s3bucketName: "ajtestbkp",threads: 4})
Output:
Acquiring global read lock Global read lock acquired Initializing - done 1 out of 5 schemas will be dumped and within them 16 tables, 0 views. 3 out of 5 users will be dumped. Gathering information - done All transactions have been started Global read lock has been released Writing global DDL files Writing users DDL Running data dump using 4 threads. NOTE: Progress information uses estimated values and may not be accurate. Writing schema metadata - done Writing DDL - done Writing table metadata - done Starting data dump 101% (1.63M rows / ~1.61M rows), 287.53K rows/s, 53.81 MB/s uncompressed, 24.51 MB/s compressed Dump duration: 00:00:06s Total duration: 00:00:07s Schemas dumped: 1 Tables dumped: 16 Uncompressed data size: 313.55 MB Compressed data size: 142.87 MB Compression ratio: 2.2 Rows written: 1629368 Bytes written: 142.87 MB Average uncompressed throughput: 49.93 MB/s Average compressed throughput: 22.75 MB/s
b) Now, let’s restore the dump from S3 into our local database.
mysql shell> util.loadDump("sbtest",{s3BucketName: "ajtestbkp", threads: 4})
Output:
Loading DDL and Data from AWS S3 bucket=ajtestbkp, prefix='sbtest' using 4 threads. Opening dump... Target is MySQL 5.7.38-41-57-log. Dump was produced from MySQL 5.7.38-41-57-log Fetching dump data from remote location... Listing files - done Scanning metadata - done Checking for pre-existing objects... Executing common preamble SQL Executing DDL - done Executing view DDL - done Starting data load 1 thds loading / 100% (313.55 MB / 313.55 MB), 8.80 MB/s, 15 / 16 tables done Recreating indexes - done Executing common postamble SQL 16 chunks (1.63M rows, 313.55 MB) for 16 tables in 1 schemas were loaded in 38 sec (avg throughput 9.14 MB/s) 0 warnings were reported during the load.
Summary
These utilities are a great addition to MySQL Shell and very useful in the scenario of backup/restoration of large tables by using different control options. Now, we can have better handling of the dump/load process by defining options such as threads, max rate, and maxBytesPerTransaction. Dump utilities are supported in (MySQL 5.6,5.7, and 8.0), however, for data loading, the database should be either MySQL 5.7 or 8.0.
15
2022
Percona XtraDB Cluster on Amazon EC2 and Two Interesting Changes in PXC 8.0
This article outlines the basic configurations for setting up and deploying Percona XtraDB Cluster 8.0 (PXC) on Amazon EC2, as well as what is new in the setup compared to Percona XtraDB Cluster 5.7.
What is Percona XtraDB Cluster an ideal fit for?
Percona XtraDB Cluster is a cost-effective, high-performance clustering solution for mission-critical data. It combines all the improvements, and functionality found in MySQL 8 with Percona Server for MySQL‘s Enterprise features and Percona’s upgraded Galera library.
A Percona XtraDB Cluster environment is an ideal fit for applications requiring 5-9s uptime with high read workloads; industries like financial or healthcare businesses that require in-house or externally dedicated database resources.
How is a three-node cluster configured in an EC2 environment?
In order to describe the setup procedures, I’ll be using Amazon EC2 instances to build the environment, and based on the business requirements, we may utilize alternative infrastructures to build the cluster environment.
Amazon EC2 settings are designed to provide uptime and high availability. When designing architecture in an EC2 environment, it is preferable to have one node situated in another Availability Zones to avoid the loss of an entire AZ and its data.
If a different region is planned for a node, we can prevent the loss of the entire region and its data. It is desirable that nodes and regions have appropriate network connectivity because network latency between the two regions affects synchronous replication write latency. Alternatively, an async replica in a different region is an option.
I’m not going into too much depth on Amazon EC2 to keep this blog brief and readable.
To build the three-node Percona XtraDB 8.0 cluster environment we first spin up the following three nodes in EC2. I’m using Amazon Linux but you can also use Ubuntu or any of the Percona-supported operating systems.
It is advised that a cluster’s nodes all have the same configuration.
PXCNode1 IP Address: 172.31.18.119
PXCNode2 IP Address: 172.31.86.114
PXCNode3 IP Address: 172.31.26.152
To install the Percona repository on all three nodes, use the following command.
$ sudo yum install -y https://repo.percona.com/yum/percona-release-latest.noarch.rpm
Enable the Percona Server for MySQL 8.0 repository in all three nodes by running the following command.
$ sudo percona-release setup pxc-80 * Disabling all Percona Repositories * Enabling the Percona XtraDB Cluster 8.0 repository * Enabling the Percona Tools repository <*> All done! $
Using the following command, install the Percona XtraDB Cluster packages and software on all three nodes.
$ sudo yum install percona-xtradb-cluster
Before starting the nodes, update the basic variables listed below for the nodes
The following default variables must be modified with the first installations. Those that came from PXC 5.7 might wonder why wsrep_sst_auth is missing. The wsrep_sst_auth variable was removed in PXC 8 since it causes security concerns, as the user and password are saved in the config file and are easily visible to OS users.
In PXC 8, a temporary user is created when a new node joins the existing cluster. For additional details on this security enhancement, check this article.
$ vi /etc/my.cnf .. ######## wsrep ############### wsrep_provider=/usr/lib64/galera4/libgalera_smm.so wsrep_cluster_address=gcomm://172.31.18.119,172.31.86.114,172.31.26.152 wsrep_slave_threads=8 wsrep_log_conflicts wsrep_node_address=172.31.18.119 wsrep_cluster_name=pxc-cluster #If wsrep_node_name is not specified, then system hostname will be used wsrep_node_name=pxc-cluster-node-1 wsrep_sst_method=xtrabackup-v2 ..
Percona XtraDB Cluster nodes utilize the following ports by default, so we must open them and ensure that the nodes can communicate with one another.
3306 is used for MySQL client connections and SST (State Snapshot Transfer) via mysqldump. 4444 is used for SST via Percona XtraBackup. 4567 is used for write-set replication traffic (over TCP) and multicast replication (over TCP and UDP). 4568 is used for IST (Incremental State Transfer).
For example, to test access.
Node 1 # socat - TCP-LISTEN:4444 hello Node 2 # echo "hello" | socat - TCP:172.31.18.119:4444 #
How is the first node bootstrapped?
After configuring each PXC node, you must bootstrap the cluster starting with the first node. All of the data that you wish to replicate to additional nodes must be present on the first node.
Run the command to bootstrap the first node.
# systemctl start mysql@bootstrap.service #
Since this is a brand new install, a temporary password is generated for the ‘root’ MySQL user which we can find in the mysqld.log
# grep -i "A temporary password is generated " /var/log/mysqld.log 2022-09-13T06:52:37.700818Z 6 [Note] [MY-010454] [Server] A temporary password is generated for root@localhost: PmqQiGf*#6iy #
Reset the temporary password using the following alter.
$ mysql -uroot -p Enter password: mysql> SET PASSWORD = 'GdKG*12#ULmE'; Query OK, 0 rows affected (0.03 sec)
How can the cluster’s remaining nodes be joined?
Before starting node2, you must copy the SSL certificates from node1 to node2 (and to node3). PXC 8 by default encrypts all replication communication, so this is a critical step that most users miss, causing cluster startup failures.
PXCNode1# scp /var/lib/mysql/*.pem 172.31.86.114:/var/lib/mysql/ PXCNode2# chown -R mysql.mysql /var/lib/mysql/*.pem
Start the node.
PXCNode2# systemctl start mysql #
Verify the following SET and make sure they appear as below once the node has been added to the cluster.
PXCNode2$ mysql -uroot -p -e " show global status where variable_name IN ('wsrep_local_state','wsrep_local_state_comment','wsrep_cluster_size','wsrep_cluster_status','wsrep_connected','wsrep_ready');" Enter password: +---------------------------+---------+ | Variable_name | Value | +---------------------------+---------+ | wsrep_cluster_size | 2 | | wsrep_cluster_status | Primary | | wsrep_connected | ON | | wsrep_local_state | 4 | | wsrep_local_state_comment | Synced | | wsrep_ready | ON | +---------------------------+---------+
The third node may be added to the cluster using the same procedures, and its status will then look as follows.
PXCNode3$ mysql -uroot -p -e " show global status where variable_name IN ('wsrep_local_state','wsrep_local_state_comment','wsrep_cluster_size','wsrep_cluster_status','wsrep_connected','wsrep_ready');" Enter password: +---------------------------+---------+ | Variable_name | Value | +---------------------------+---------+ | wsrep_cluster_size | 3 | | wsrep_cluster_status | Primary | | wsrep_connected | ON | | wsrep_local_state | 4 | | wsrep_local_state_comment | Synced | | wsrep_ready | ON | +---------------------------+---------+
Additional supporting factors
Additionally, the use of failover technologies like ProxySQL, which assist in removing failing nodes from the active read pool in some cases and shifting the primary, is advised.
It is advised to have a backup in place, with the open-source tool Percona XtraBackup taking physical copies of the dataset that are significantly faster to recover. It is strongly advised to back up binary logs using mysqlbinlog in order to do point-in-time recovery. Backups should be encrypted, compressed, and transferred to S3 as soon as possible.
To copy binlogs to an s3 bucket the command looks like this.
aws s3 sync /backups/binlogs/ s3://backup-bucket/ upload: ../../../../backups/binlogs/binlog.000001.gz.gpg to s3://backup-bucket/binlog.000001.gpg
For query analytics and time-based database performance insights, the open-source tool Percona Monitoring and Management is highly recommended. Using the Amazon Marketplace, this needs to be deployed on a separate host. This monitors the operating system and MySQL metrics and provides sophisticated query analytics.
11
2022
How to Build Percona XtraDB Cluster From Sources
Percona XtraDB Cluster (PXC) is a database clustering solution for MySQL. It ensures high availability, prevents downtime and data loss, and provides linear scalability for a growing environment.
Lately, the number of questions about how to build Percona software has increased. More and more people try to add their own patches, add some modifications, and build software by themselves. But this raises the question of how to do this in the same way as Percona does, as sometimes the compiler flag can make a drastic impact on the final binary. This is a really essential question and let’s look at how we build Percona XtraDB Cluster here at Percona.
At the first glance, the building process is not something difficult. Everyone knows that magic make install all
command and suppose that as a result everything will be built and installed on the system. But databases are very essential for applications and it is needed to do everything carefully.
Let’s divide the build process into the stages:
- Prepare build environments
- Prepare source tarballs
- Prepare source package
- Prepare binary packages
We can skip the third stage, but in this case, we wouldn’t be able to make repeatable builds so I would suggest always creating source packages.
So let’s move forward and discuss each step.
One of the key points during a build is that the environment should be clean and it is necessary to install all needed dependencies, and so on. For each version, the dependency list would be different. How do you get the correct dependency list? You can get all build requirements from the spec file (on rpm-based systems) or from the control file( on deb-based systems).
As for source tarball, you can easily do the following:
- Get source tarball from the website
We publish source tarball for each release we issue, and you can easily get it for any released version (this is the link for the 8.0.28 version):
- Download tarball from GitHub
Here is the link:
You can use build instructions from our website and proceed with the compilation process:
https://www.percona.com/doc/percona-xtradb-cluster/LATEST/install/compile.html#compile
So everything looks easy. But what is the way we use, internally, to prepare our release packages? As I mentioned earlier, each compiler option can make a significant effect.
Everyone tries to make life easier and automate all tasks. It is great, as automation is one of the keys to success as you can work on other tasks once the release build is in progress. So we have created a build script that is used for making builds. It can be found in the PXC GitHub repo and can be used by anyone to make his own builds.
This script can install all needed dependencies for the build environment, create binary tarballs, source RPMs and debs, RPMs, debs itself, and the binary tarball. So it covers all build cycles. So how do we use it?
The script has various params:
--builddir=DIR Absolute path to the dir where all actions will be performed --get_sources Source will be downloaded from github --build_src_rpm If it is 1 src rpm will be built --build_source_deb If it is 1 source deb package will be built --build_rpm If it is 1 rpm will be built --build_deb If it is 1 deb will be built --build_tarball If it is 1 tarball will be built --install_deps Install build dependencies(root privileges are required) --branch Branch for build --repo Repo for build --rpm_release RPM version( default = 1) --deb_release DEB version( default = 1) --debug Build debug tarball
So let’s see how we can make the build process easier:
1. Download the build script:
wget https://raw.githubusercontent.com/percona/percona-xtradb-cluster/8.0/build-ps/pxc_builder.sh
2. Create a build directory where you are going to perform all build actions:
mkdir /tmp/BUILD_PXC
3. Install dependencies(please note root permissions are required):
sudo ./pxc_builder.sh --builddir=/tmp/BUILD_PXC --install_deps=1
4. Download source code:
– From Percona repo (it is used by default):
./pxc_builder.sh --builddir=/tmp/BUILD_PXC --get_sources=1 --branch=Percona-XtraDB-Cluster-8.0.28-19
– From your own repo and branch:
./pxc_builder.sh --builddir=/tmp/BUILD_PXC --get_sources=1 --branch=<your_branch_name> --repo=<link_to_your_repo_on_github>
5. Prepare src rpm:
./pxc_builder.sh --builddir=/tmp/BUILD_PXC --build_src_rpm=1
Please note if you already have your source tarball, just create a directory named source_tarball in the build directory and put it into it.
6. Prepare source deb:
./pxc_builder.sh --builddir=/tmp/BUILD_PXC --build_source_deb=1
Please note if you already have your source tarball, just create a directory named source_tarball in the build directory and put it into it.
7. Prepare rpm:
./pxc_builder.sh --builddir=/tmp/BUILD_PXC --build_rpm=1
8. Prepare deb:
./pxc_builder.sh --builddir=/tmp/BUILD_PXC --build_deb=1
9. Prepare tarball:
./pxc_builder.sh --builddir=/tmp/BUILD_PXC –build_tarball=1
So as you can see, the build procedure becomes easier and you don’t need to think about what dependencies are needed, what cmake params should be used, etc. This build script will make everything for you and will use all build flags (if you didn’t change them in your sources) that we use for release builds.
25
2022
Managing MySQL Configurations with the PXC Kubernetes Operator V1.10.0 Part 1: Introduction
Introduction/FAQ
Question: I need to run a production-grade open source MySQL DB.
Answer: Percona to the rescue! Percona XtraDB Cluster (PXC) is an open source enterprise MySQL solution that helps you to ensure data availability for your applications while improving security and simplifying the development of new applications in the most demanding public, private, and hybrid cloud environments
Question: I forgot to mention that I need to run it on Kubernetes.
Answer: Percona to the rescue again! Percona Distribution for MySQL Operator based on Percona XtraDB Cluster contains everything you need to quickly and consistently deploy and scale Percona XtraDB Cluster instances in a Kubernetes-based environment on-premises or in the cloud.
Question: I have a lot of MySQL configurations to manage.
Answer: PXC Operator makes it easy to manage MySQL Configurations. Let’s explore.
For the rest of the article, the name of the PXC cluster is assumed to be cluster1, and this can be modified based on the user preference.
How can I change the MySQL configurations?
If you have not done it already, the first thing to do is install the PXC operator. Our Quickstart guide gives detailed instructions on how to get started.
There are three possible ways to modify the MySQL configurations as described in the Documentation for MySQL options:
- Custom Resource PerconaXtraDBCluster (pxc/pxcs/perconaxtradbclusters )
- Config map with name cluster1-pxc
- Secret with name cluster1-pxc
Which option should I choose for managing configurations?
The choice of using the above options depends on the use case and the user’s preferences.
Following are some examples:
Using ConfigMap
- If the MySQL configuration is pretty big and/or if you want to maintain the configuration separately rather than updating everything in PXC object.
- If you want to provide permission to change MySQL configurations but not the other properties of PXC objects like resources, affinity, etc., K8s RBAC can be used to achieve this. A Role/ClusterRole can be created to provide access only for the ConfigMap which is used for MySQL configuration.
Using Secrets
- If there is any sensitive information that needs to be used in the configuration, secrets are recommended. Even though k8s secrets are just base64 encoded data, secrets have the advantage of integrating well with vaults, and it’s always best practice to use k8s secrets than ConfigMap when there is sensitive data.
What happens when I change MySQL configuration?
Any changes in MySQL configurations will generally recycle the pods in reverse order if the RollingUpdate strategy is used.
Example: If three replicas are used for the PXC cluster, cluster1-pxc-[0,1,2] pods would be created. When MySQL configuration is changed, cluster1-pxc-2 will be terminated first and the system will wait till the new pod cluster1-pxc-2 starts running and becomes healthy, then cluster1-pxc-1 will be terminated, and so on.
Following are the changes observed with the watch command.
# kubectl get po -l app.kubernetes.io/component=pxc --watch NAME READY STATUS RESTARTS AGE cluster1-pxc-0 3/3 Running 0 8m23s cluster1-pxc-1 3/3 Running 0 10m cluster1-pxc-2 3/3 Running 0 13m cluster1-pxc-2 3/3 Terminating 0 13m cluster1-pxc-2 0/3 Terminating 0 14m cluster1-pxc-2 0/3 Pending 0 0s cluster1-pxc-2 0/3 Init:0/1 0 1s cluster1-pxc-2 0/3 PodInitializing 0 8s cluster1-pxc-2 2/3 Running 0 10s cluster1-pxc-2 3/3 Running 0 2m cluster1-pxc-1 3/3 Terminating 0 14m cluster1-pxc-1 0/3 Terminating 0 14m cluster1-pxc-1 0/3 Pending 0 0s cluster1-pxc-1 0/3 Init:0/1 0 1s cluster1-pxc-1 0/3 PodInitializing 0 6s cluster1-pxc-1 2/3 Running 0 8s cluster1-pxc-1 3/3 Running 0 2m1s cluster1-pxc-0 3/3 Terminating 0 13m cluster1-pxc-0 0/3 Terminating 0 14m cluster1-pxc-0 0/3 Pending 0 0s cluster1-pxc-0 0/3 Init:0/1 0 0s cluster1-pxc-0 0/3 PodInitializing 0 6s cluster1-pxc-0 2/3 Running 0 8s cluster1-pxc-0 3/3 Running 0 2m
In the upcoming post, we will see the precedence and the changes happening while modifying MySQL configurations. Stay tuned!
21
2022
Preventing Stale Reads on Percona XtraDB Cluster by Enforcing Causality Checks on a Per-Query Basis
When we run a SELECT in a replica server and it returns a different result to the one we would have obtained had we queried the source server instead, due to changes in the dataset that have not yet been replicated or synchronized to the replica, we get what is known as a stale read. My colleague Marcelo has blogged about Stale Reads and How to Fix Them before.
The need to avoid stale reads is a constraint that leads people into migrating to Percona XtraDB Cluster (PXC) / Galera Cluster or, more recently, Group Replication. I am focusing on PXC and Galera in this short blog post, which, similar to Group Replication, implements not exactly a synchronous replication model but what is commonly referred to as a “virtually synchronous” replication (but not without some polemic). In short, it means a transaction that is committed in one node is guaranteed (and there’s a certification process behind this) to have “reached” other nodes, but that doesn’t mean the changes this transaction carries with it have been applied on the other nodes already. Thus, stale reads can happen in PXC too. For the long version, see the manual page about Certification in Percona XtraDB Cluster.
Why does PXC implement virtually synchronous replication instead of fully synchronous replication? There are a few challenges to accomplish this but we can nail it down to performance constraints. Fully synchronous replication requires not only for the transactions to reach the other nodes but for the transactions to be applied to them too. And then they have to report back the success of the operation. For each transaction. This takes time and increases the overall commit time, which in practice makes writing to the database slow and impacts overall concurrency.
Evolution of “synchronous” replication on PXC
To be clear, no matter how you configure PXC, it won’t become a fully-synchronous system (neither will Group Replication, for that matter). BUT, it is possible to make it behave in such a way that it enforces data consistency. That’s what the variable wsrep_causal_reads has been used for. When this setting was introduced, it worked like a switch that one could use to kind of enable “synchronous” replication (as in data consistency) across the cluster – either for all transactions (globally) or at a connection basis (per session). In fact, what this setting does in practice is to enforce causality checks; quoting the manual: “When this variable is set to ON, the replica will wait until that event is applied before doing any other queries”. By waiting, it means blocking new writes until the expected consistency is reached.
Being able to enable “synchronous” replication for selected connections brought the potential to mix the best of both modes: operate with virtually synchronous replication by default (good performance) and use “synchronous” replication (once more, enforcing data consistency) when stale reads cannot be afforded.
The problem with this approach is that it requires some planning in advance to determine which connections should be configured with wsrep_causal_reads enabled. Without carefully planning ahead, what ends up happening in practice in many cases is having more connections configured with wsrep_causal_reads enabled than necessary. As a result, the cluster performance gets closer to that of operating with that setting applied globally.
The switch button that was wsrep_causal_reads evolved into the more granular wsrep_sync_wait, which allows the configuration of causality checks for different combinations of statements.
The cherry on top
While wsrep_sync_wait expanded the possibilities for causality checks, it still requires the “planning ahead” discussed above. I recently worked on the troubleshooting of a slow cluster and found it was configured with wsrep_sync_wait set to three (“Perform checks for READ, UPDATE, and DELETE statements”) all around, which is an even more “aggressive” configuration than the legacy wsrep_causal_reads. The justification for operating this way was that it was not possible to identify upfront which connection would be issuing a particular query that could not afford stale reads.
The good news is that starting with Percona XtraDB Cluster 8.0.26-16, which was released earlier this year, we can set wsrep_sync_wait on a per-query basis through the use of the optimizer hint SET_VAR: “The SET_VAR hint sets the session value of a system variable temporarily (for the duration of a single statement)”.
Here’s a simple example of employing SET_VAR to enforce the highest level of causality checks for a simple SELECT:
SELECT /*+ SET_VAR(wsrep_sync_wait=7) */ balance FROM accounts WHERE id=100;
If you find it is complicated to determine beforehand which connections should be configured to avoid stale reads, know that now you have the option to literally hard-code this constraint in your query. And if you end up deploying your database in a regular MySQL server, which is shipped without any wsrep_ settings, the optimizer hint is simply ignored and the query is processed normally.
09
2021
A Look Into Percona XtraDB Cluster Non-Blocking Operation for Online Schema Upgrade
Percona XtraDB Cluster 8.0.25 (PXC) has introduced a new option to perform online schema modifications: NBO (Non-Blocking Operation).
When using PXC, the cluster relies on the wsrep_OSU_method parameter to define the Online Schema Upgrade (OSU) method the node uses to replicate DDL statements.
Until now, we normally have three options:
- Use Total Isolation Order (TOI, the default)
- Use Rolling Schema Upgrade (RSU)
- Use Percona’s online schema change tool (TOI + PTOSC)
Each method has some positive and negative aspects. TOI will lock the whole cluster from being able to accept data modifications for the entire time it takes to perform the DDL operation. RSU will misalign the schema definition between the nodes, and in any case, the node performing the DDL operation is still locked. Finally, TOI+PTOSC will rely on creating triggers and copying data, so in some cases, this can be very impactful.
The new Non-Blocking Operation (NBO) method is to help to reduce the impact on the cluster and make it easier to perform some DDL operations.
At the moment we only support a limited set of operations with NBO like:
- ALTER INDEX
- CREATE INDEX
- DROP INDEX
Any other command will result in an error message ER_NOT_SUPPORTED_YET.
But let us see how it works and what the impact is while we will also compare it with the default method TOI.
What we will do is work with four connections:
1 – to perform ddl
2 – to perform insert data in the table being altered
3 – to perform insert data on a different table
4-5 – checking the other two nodes operations
PXC must be at least Version 8.0.25-15.1.
The table we will modify is :
DC1-1(root@localhost) [windmills_s]>show create table windmills_test\G *************************** 1. row *************************** Table: windmills_test Create Table: CREATE TABLE `windmills_test` ( `id` bigint NOT NULL AUTO_INCREMENT, `uuid` char(36) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL, `millid` smallint NOT NULL, `kwatts_s` int NOT NULL, `date` date NOT NULL, `location` varchar(50) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL, `active` tinyint NOT NULL DEFAULT '1', `time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, `strrecordtype` char(3) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL, PRIMARY KEY (`id`), KEY `IDX_millid` (`millid`,`active`), KEY `IDX_active` (`id`,`active`), KEY `kuuid_x` (`uuid`), KEY `millid_x` (`millid`), KEY `active_x` (`active`) ) ENGINE=InnoDB AUTO_INCREMENT=8199260 DEFAULT CHARSET=utf8mb3 COLLATE=utf8_bin ROW_FORMAT=DYNAMIC 1 row in set (0.00 sec)
And contains ~five million rows.
DC1-1(root@localhost) [windmills_s]>select count(*) from windmills_test; +----------+ | count(*) | +----------+ | 5002909 | +----------+ 1 row in set (0.44 sec)
The Commands
Connection 1:
ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE; ALTER TABLE windmills_test drop INDEX idx_1, ALGORITHM=INPLACE;
Connection 2:
while [ 1 = 1 ];do da=$(date +'%s.%3N');/opt/mysql_templates/PXC8P/bin/mysql --defaults-file=./my.cnf -uroot -D windmills_s -e "insert into windmills_test select null,uuid,millid,kwatts_s,date,location,active,time,strrecordtype from windmills7 limit 1;" -e "select count(*) from windmills_s.windmills_test;" > /dev/null;db=$(date +'%s.%3N'); echo "$(echo "($db - $da)"|bc)";sleep 1;done
Connection 3:
while [ 1 = 1 ];do da=$(date +'%s.%3N');/opt/mysql_templates/PXC8P/bin/mysql --defaults-file=./my.cnf -uroot -D windmills_s -e "insert into windmills8 select null,uuid,millid,kwatts_s,date,location,active,time,strrecordtype from windmills7 limit 1;" -e "select count(*) from windmills_s.windmills_test;" > /dev/null;db=$(date +'%s.%3N'); echo "$(echo "($db - $da)"|bc)";sleep 1;done
Connections 4-5:
while [ 1 = 1 ];do echo "$(date +'%T.%3N')";/opt/mysql_templates/PXC8P/bin/mysql --defaults-file=./my.cnf -uroot -D windmills_s -e "show full processlist;"|egrep -i -e "(windmills_test|windmills_s)"|grep -i -v localhost;sleep 1;done
Operations
- Start inserts from connections
- Start commands in connections 4 – 5 on the other nodes
- Execute:
- For TOI
-
DC1-1(root@localhost) [windmills_s]>SET SESSION wsrep_OSU_method=TOI;
-
- For NBO
-
DC1-1(root@localhost) [windmills_s]>SET SESSION wsrep_OSU_method=NBO;
-
- For both
-
DC1-1(root@localhost) [windmills_s]>ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE, LOCK=shared;
-
- For TOI
Let’s Run It
Altering a Table with TOI
DC1-1(root@localhost) [windmills_s]>ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE; Query OK, 0 rows affected (1 min 4.74 sec) Records: 0 Duplicates: 0 Warnings: 0
Inserts in the altering table (connection 2):
.450 .492 64.993 <--- Alter blocks all inserts on the table we are altering .788 .609
Inserts on the other table (connection 3):
.455 .461 64.161 <--- Alter blocks all inserts on all the other tables as well .641 .483
On the other nodes at the same time of the ALTER we can see:
Id User db Command Time State Info Time_ms Rows_sent Rows_examined 15 system user windmills_s Query 102 altering table ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE 102238 0 0 <--- time from start
In short, we have the whole cluster locked for ~64 seconds. During this period of time, all the operations to modify data or structure were on hold.
Let’s Now Try With NBO
Inserts in the altering table:
.437 .487 120.758 <---- Execution time increase .617 .510
Inserts on the other table:
.468 .485 25.061 <---- still a metalock, but not locking the other tables for the whole duration .494 .471
On the other nodes at the same time of the ALTER we can see:
Id User db Command Time State Info Time_ms Rows_sent Rows_examined 110068 system user windmills_s Connect 86 altering table ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE 120420 0 0
In this case, what is also interesting to note is that:
- We have a moment of metalock:
-
110174 pmm 127.0.0.1:42728 NULL Query 2 Waiting for table metadata lock SELECT x FROM information_schema.tables WHERE TABLE_SCHEMA = 'windmills_s' 1486 10 0
-
110068 system user connecting host windmills_s Connect 111 closing tables ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE 111000 0 0
-
- The execution time is longer
Summarizing:
TOI NBO Time on hold for insert for altering table ~64 sec ~120 sec Time on hold for insert for another table ~64 sec ~25 sec metalock whole time only at the end
What is Happening? What are the Differences and Why Does it Take Longer with NBO?
Let’s see at a very high level how the two work:
- TOI: when you issue a DDL like ADD INDEX a metadata lock is taken on the table and it will be released only at the end of the operation. During this time, you cannot:
- Perform DMLs on any cluster node
- Alter another table in the cluster
- NBO: the metadata lock is taken at the start and at the end for a very brief period of time. The ADD INDEX operation will then work on each node independently. The lock taken at the end is to have all the nodes agree on the operation and commit or rollback (using cluster error voting). This final phase costs a bit more in time and is what adds a few seconds to the operation execution. But during the operation:
- You can alter another table (using NBO)
- You can continue to insert data, except in the table(s) you are altering.
- On node crash, the operation will continue on the other nodes, and if successful it will persist.
In short, the cluster server behavior changes significantly when using NBO, offering significant flexibility compared to TOI. The cost in time should not linearly increase with the dimension of the table, but more in relation to the single node efficiency in performing the ALTER operation.
Conclusion
NBO can be significantly helpful to reduce the impact of DDL on the cluster, for now, limited to the widely used creation/modification/drop of an index. But in the future … we may expand it.
The feature is still a technology preview, so do not trust it in production, but test it and let us know what you think.
Final comment: another distribution has introduced NBO, but only if you buy the enterprise version.
Percona, which is truly open source with facts not just words, has implemented NBO in standard PXC, and the code is fully open source. This is not the first one, but just another of the many features Percona is offering for free while others ask you to buy the enterprise version.
Enjoy the product and let us have your feedback! Great MySQL to all!
01
2021
PXC Scheduler Handler: The Missing Piece for Galera/Percona XtraDB Cluster Puzzle
Working on a real case scenario in a five node Percona XtraDB Cluster (PXC), we were forced to use wsrep_sync_wait = 1, because the app does reads-after-write and we send reads to all the nodes. We had the idea to leave some nodes in DESYNC mode to reduce the flow control messages during peak load and expected to have a steadier write throughput keeping the read consistency.
We decided to test Perconas’s new PXC Scheduler Handler which is an application that manages integration between ProxySQL and Galera/PXC (the scope is to maintain the ProxySQL mysql_server table, if a negative scenario occurs, like: failures, service degradation, and maintenance). However, we realized that when a node is in DESYNC mode, it is kicked out of the read hostgroup. That is why we asked Marco Tusa to implement this new feature which will remove the node from the read hostgroup if wsrep_local_recv_queue is higher than max_replication_lag.
Environment
- 5 PXC nodes
- 1 ProxySQL server
In db01, we run sysbench to simulate write traffic:
sysbench /usr/share/sysbench/oltp_insert.lua --threads=50 --tables=100 --mysql-password=<pass> --mysql-user=<user> --report-interval=5 --time=600 --tx-rate=0 run
In ProxySQL we run sysbench to simulate read-only traffic:
sysbench /usr/share/sysbench/oltp_read_only.lua --tables=100 --mysql-password=<pass> --mysql-user=<user> --mysql-host=127.0.0.1 run --threads=32 --report-interval=5 --time=600 --db-ps-mode=disable
DESYNC Test
The goal of this test is to see the differences between wsrep_desync ON/OFF and wsrep_sync_wait = 1.
In the next graph, we are going to see both scenarios in the same graph, on the left of each graph when wsrep_desync is ON and on the right when it is OFF.
As you can see there are decreases in the read traffic when DESYNC is OFF. It occurs in the same period of time when the flow control messages are sent:
This is expected and it is not new. The number of queries executed were:
DESYNC ON queries: 30561552 (50934.44 per sec.) DESYNC OFF queries: 28324704 (47195.67 per sec.)
Basically, you can execute 8% of queries if flow control is not enabled.
Consistency Test
Now, we are going to simulate a scenario when the cluster receives CPU-intensive queries. We are going to execute in db03 (or any other node):
for j in $(seq 1 40) ; do for i in $(seq 1 250); do echo "SELECT pad FROM sbtest.sbtest1 GROUP BY pad LIMIT 10" | mysql --password=<pass> --user=<user> --host=<host> ; done > /dev/null 2>&1 & done
It starts 40 threads that execute 250 times the same group by query which is enough for our testing.
And we are going to add a timestamp column on the sbtest1 to monitor the lag:
ALTER TABLE sbtest.sbtest1 add `tnow` timestamp(6) NOT NULL DEFAULT CURRENT_TIMESTAMP(6) ON UPDATE CURRENT_TIMESTAMP(6),
Now we have this 4 scenario where we are going to monitor with:
select timediff(tnow,current_timestamp(6)) from sbtest1 order by tnow desc limit 1
And the status variable: wsrep_local_recv_queue.
This will be the header of the tables:
| timediff | query time | wsrep_local_recv_queue |
– With wsrep_desync=OFF and wsrep_sync_wait=1
| 00:00:02.553139 | 2.69 sec | 237 | | 00:00:00.223150 | 0.26 sec | 72 |
– With wsrep_desync=OFF and wsrep_sync_wait=0
| 00:00:01.711082 | 1.74 sec | 183 | | 00:00:01.175446 | 1.20 sec | 112 |
– With wsrep_desync=ON and wsrep_sync_wait=1
| -00:00:00.021460 | 1.30 sec | 13542 | | -00:00:00.009098 | 1.28 sec | 13184 |
– With wsrep_desync=ON and wsrep_sync_wait=0
| -00:01:02.065854 | 0.46 sec | 635387 | | -00:01:02.882633 | 0.54 sec | 643742 |
With wsrep_desync=OFF, the behavior is similar, and this means that the node needs to be on sync and then it checks wsrep_sync_wait.
With wsrep_desync=ON and wsrep_sync_wait=1, we can see that the query is delayed because it needs to apply the transaction in the apply queue. It is not the case when wsrep_sync_wait=0, which data is far behind the writer node, and the query is answered immediately.
The two cases that matter are when wsrep_sync_wait=1, and both cases are read-consistent even if they show different timediff values, as the query time is measuring the flow control lag when wsrep_desync=OFF and the apply queue time when wsrep_desync=ON.
Cluster Behavior During Load Increases
It is time to merge both test and simulate when ProxySQL spreads the CPU-intensive queries over the cluster. We are going to execute in ProxySQL the same script:
for j in $(seq 1 40) ; do for i in $(seq 1 250); do echo "SELECT pad FROM sbtest.sbtest1 GROUP BY pad LIMIT 10" | mysql --password=<pass> --user=<user> --host=<host> ; done > /dev/null 2>&1 & done
In the next graph we are going to see how the active connections on different nodes went up and down as the status of the node changed:
And when the script finished it went back to normal.
In ProxySQL, you will see how the status of the servers changes to ONLINE to OFFLINE_SOFT and back to ONLINE because of PXC Scheduler Handler intervention, like this:
proxysql> select hostgroup_id,hostname,status from runtime_mysql_servers order by hostgroup_id ='101' ; +--------------+---------------+--------------+ | hostgroup_id | hostname | status | +--------------+---------------+--------------+ | 101 | 10.127.70.242 | ONLINE | | 101 | 10.127.70.243 | ONLINE | | 101 | 10.127.70.244 | OFFLINE_SOFT | | 101 | 10.127.70.245 | ONLINE | | 101 | 10.127.71.3 | ONLINE | +--------------+---------------+--------------+ proxysql> select hostgroup_id,hostname,status from runtime_mysql_servers order by hostgroup_id ='101' ; +--------------+---------------+--------------+ | hostgroup_id | hostname | status | +--------------+---------------+--------------+ | 101 | 10.127.70.242 | ONLINE | | 101 | 10.127.70.243 | ONLINE | | 101 | 10.127.70.244 | ONLINE | | 101 | 10.127.70.245 | OFFLINE_SOFT | | 101 | 10.127.71.3 | ONLINE | +--------------+---------------+--------------+
This can be reviewed also in PMM:
This means that PXC Scheduler Handler is helping us to spread the load across DESYNC nodes, improving the response time for the read-only traffic.
Architecture
With traditional replication we had an abstract diagram like this:
We didn’t have any production-ready option to guarantee Read Consistency on the Read Replicas.
With Galera/PXC we don’t need the Replication Manager as it will be replaced with PXC Scheduler Handler:
We have the same amount of nodes with Read Consistency (thanks to wsrep_sync_wait) and a synced secondary writer node. What is missing, but not difficult to add, is a tool that monitors the WriterNodes, as in a failure scenario, we might want to keep not less than two synced nodes.
Conclusion
I think that Marco Tusa did a great job with the PXC Scheduler Handler which allowed us to implement this new Architecture that might help people that need to scale reads, need consistent write throughput, need consistent reads, and don’t want reads to affect the flow of the replication process.
If you’re interested in learning more about Percona XtraDB Cluster and ProxySQL, be sure to check out Percona’s Training Services. We offer an advanced two-day hands-on tutorial for XtraDB Cluster, in addition to our one-day ProxySQL intensive. Contact us today to learn more!