In this blog post, we will be discussing the PXC Replication Manager script/tool which basically facilitates both source and replica failover when working with multiple PXC clusters, across different DC/Networks connected via asynchronous replication mechanism. Such topologies emerge from requirements like database version upgrades, reporting or streaming for applications, separate disaster recovery or backup solutions, […]
12
2026
Using PXC Replication Manager to Auto Manage Both Source and Replica Failover in Galera-Based Environments
21
2025
Speeding Up Percona XtraDB Cluster State Transfers with Kubernetes Volume Snapshots
When using the Percona Operator for MySQL based on Percona XtraDB Cluster (PXC), it’s common to encounter scenarios where cluster nodes request a full State Snapshot Transfer (SST) when rejoining the cluster. One typical scenario where a State Snapshot Transfer (SST) is required is when a node has been offline long enough that the GCache […]
14
2024
Exploring Kubernetes CPU Resources in View of Percona XtraDB Cluster’s Flow Control
Even though I used a dedicated Kubernetes cluster to host my test database, I had this belief that by not explicitly allocating (or requesting, in Kubernetes vocabulary) CPU resources to my Percona XtraDB Cluster (PXC) pods or yet making just a small request, Kubernetes could be delaying access to the free CPU cycles available on […]
20
2024
Can We Set up a Replicate Filter Within the Percona XtraDB Cluster?
If you were to ask me the question, “Would it be possible to set up a replicate filter within the Percona XtraDB Cluster?” my immediate response would be to question you right back. Why would you even want the cluster nodes to have different data when you’re using a synchronous replication cluster? That’s because, instinctively, […]
21
2022
Preventing Stale Reads on Percona XtraDB Cluster by Enforcing Causality Checks on a Per-Query Basis

When we run a SELECT in a replica server and it returns a different result to the one we would have obtained had we queried the source server instead, due to changes in the dataset that have not yet been replicated or synchronized to the replica, we get what is known as a stale read. My colleague Marcelo has blogged about Stale Reads and How to Fix Them before.
The need to avoid stale reads is a constraint that leads people into migrating to Percona XtraDB Cluster (PXC) / Galera Cluster or, more recently, Group Replication. I am focusing on PXC and Galera in this short blog post, which, similar to Group Replication, implements not exactly a synchronous replication model but what is commonly referred to as a “virtually synchronous” replication (but not without some polemic). In short, it means a transaction that is committed in one node is guaranteed (and there’s a certification process behind this) to have “reached” other nodes, but that doesn’t mean the changes this transaction carries with it have been applied on the other nodes already. Thus, stale reads can happen in PXC too. For the long version, see the manual page about Certification in Percona XtraDB Cluster.
Why does PXC implement virtually synchronous replication instead of fully synchronous replication? There are a few challenges to accomplish this but we can nail it down to performance constraints. Fully synchronous replication requires not only for the transactions to reach the other nodes but for the transactions to be applied to them too. And then they have to report back the success of the operation. For each transaction. This takes time and increases the overall commit time, which in practice makes writing to the database slow and impacts overall concurrency.
Evolution of “synchronous” replication on PXC
To be clear, no matter how you configure PXC, it won’t become a fully-synchronous system (neither will Group Replication, for that matter). BUT, it is possible to make it behave in such a way that it enforces data consistency. That’s what the variable wsrep_causal_reads has been used for. When this setting was introduced, it worked like a switch that one could use to kind of enable “synchronous” replication (as in data consistency) across the cluster – either for all transactions (globally) or at a connection basis (per session). In fact, what this setting does in practice is to enforce causality checks; quoting the manual: “When this variable is set to ON, the replica will wait until that event is applied before doing any other queries”. By waiting, it means blocking new writes until the expected consistency is reached.
Being able to enable “synchronous” replication for selected connections brought the potential to mix the best of both modes: operate with virtually synchronous replication by default (good performance) and use “synchronous” replication (once more, enforcing data consistency) when stale reads cannot be afforded.
The problem with this approach is that it requires some planning in advance to determine which connections should be configured with wsrep_causal_reads enabled. Without carefully planning ahead, what ends up happening in practice in many cases is having more connections configured with wsrep_causal_reads enabled than necessary. As a result, the cluster performance gets closer to that of operating with that setting applied globally.
The switch button that was wsrep_causal_reads evolved into the more granular wsrep_sync_wait, which allows the configuration of causality checks for different combinations of statements.
The cherry on top
While wsrep_sync_wait expanded the possibilities for causality checks, it still requires the “planning ahead” discussed above. I recently worked on the troubleshooting of a slow cluster and found it was configured with wsrep_sync_wait set to three (“Perform checks for READ, UPDATE, and DELETE statements”) all around, which is an even more “aggressive” configuration than the legacy wsrep_causal_reads. The justification for operating this way was that it was not possible to identify upfront which connection would be issuing a particular query that could not afford stale reads.
The good news is that starting with Percona XtraDB Cluster 8.0.26-16, which was released earlier this year, we can set wsrep_sync_wait on a per-query basis through the use of the optimizer hint SET_VAR: “The SET_VAR hint sets the session value of a system variable temporarily (for the duration of a single statement)”.
Here’s a simple example of employing SET_VAR to enforce the highest level of causality checks for a simple SELECT:
SELECT /*+ SET_VAR(wsrep_sync_wait=7) */ balance FROM accounts WHERE id=100;
If you find it is complicated to determine beforehand which connections should be configured to avoid stale reads, know that now you have the option to literally hard-code this constraint in your query. And if you end up deploying your database in a regular MySQL server, which is shipped without any wsrep_ settings, the optimizer hint is simply ignored and the query is processed normally.
11
2022
Online DDL With Group Replication in MySQL 8.0.27

In April 2021, I wrote an article about Online DDL and Group Replication. At that time we were dealing with MySQL 8.0.23 and also opened a bug report which did not have the right answer to the case presented.
Anyhow, in that article I have shown how an online DDL was de facto locking the whole cluster for a very long time even when using the consistency level set to EVENTUAL.
This article is to give justice to the work done by the MySQL/Oracle engineers to correct that annoying inconvenience.
Before going ahead, let us remember how an Online DDL was propagated in a group replication cluster, and identify the differences with what happens now, all with the consistency level set to EVENTUAL (see).
In MySQL 8.0.23 we were having:
![]() |
![]() |
![]() |
While in MySQL 8.0.27 we have:
![]() |
![]() |
![]() |
As you can see from the images we have three different phases. Phase one is the same between version 8.0.23 and version 8.0.27.
Phases two and three, instead, are quite different. In MySQL 8.0.23 after the DDL is applied on the Primary, it is propagated to the other nodes, but a metalock was also acquired and the control was NOT returned. The result was that not only the session executing the DDL was kept on hold, but also all the other sessions performing modifications.
Only when the operation was over on all secondaries, the DDL was pushed to Binlog and disseminated for Asynchronous replication, lock raised and operation can restart.
Instead, in MySQL 8.0.27, once the operation is over on the primary the DDL is pushed to binlog, disseminated to the secondaries and control returned. The result is that the write operations on primary have no interruption whatsoever and the DDL is distributed to secondary and Asynchronous replication at the same time.
This is a fantastic improvement, available only with consistency level EVENTUAL, but still, fantastic.
Let’s See Some Numbers
To test the operation, I have used the same approach used in the previous tests in the article mentioned above.
Connection 1:
ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE, LOCK=NONE;
ALTER TABLE windmills_test drop INDEX idx_1, ALGORITHM=INPLACE;
Connection 2:
while [ 1 = 1 ];do da=$(date +'%s.%3N');/opt/mysql_templates/mysql-8P/bin/mysql --defaults-file=./my.cnf -uroot -D windmills_large -e "insert into windmills_test select null,uuid,millid,kwatts_s,date,location,active,time,strrecordtype from windmill7 limit 1;" -e "select count(*) from windmills_large.windmills_test;" > /dev/null;db=$(date +'%s.%3N'); echo "$(echo "($db - $da)"|bc)";sleep 1;done
Connection 3:
while [ 1 = 1 ];do da=$(date +'%s.%3N');/opt/mysql_templates/mysql-8P/bin/mysql --defaults-file=./my.cnf -uroot -D windmills_large -e "insert into windmill8 select null,uuid,millid,kwatts_s,date,location,active,time,strrecordtype from windmill7 limit 1;" -e "select count(*) from windmills_large.windmills_test;" > /dev/null;db=$(date +'%s.%3N'); echo "$(echo "($db - $da)"|bc)";sleep 1;done
Connections 4-5:
while [ 1 = 1 ];do echo "$(date +'%T.%3N')";/opt/mysql_templates/mysql-8P/bin/mysql --defaults-file=./my.cnf -uroot -D windmills_large -e "show full processlist;"|egrep -i -e "(windmills_test|windmills_large)"|grep -i -v localhost;sleep 1;done
Modifying a table with ~5 million rows:
node1-DC1 (root@localhost) [windmills_large]>select count(*) from windmills_test; +----------+ | count(*) | +----------+ | 5002909 | +----------+
The numbers below represent the time second/milliseconds taken by the operation to complete. While I was also catching the state of the ALTER on the other node I am not reporting it here given it is not relevant.
EVENTUAL (on the primary only) ------------------- Node 1 same table: .184 .186 <--- no locking during alter on the same node .184 <snip> .184 .217 <--- moment of commit .186 .186 .186 .185 Node 1 another table : .189 .198 <--- no locking during alter on the same node .188 <snip> .191 .211 <--- moment of commit .194
As you can see there is just a very small delay at the moment of commit, but other impacts.
Now if we compare this with the recent tests I have done for Percona XtraDB Cluster (PXC) Non-Blocking operation (see A Look Into Percona XtraDB Cluster Non-Blocking Operation for Online Schema Upgrade) with the same number of rows and same kind of table/data:
| Action | Group Replication | PXC (NBO) |
|---|---|---|
| Time on hold for insert for altering table | ~ 0.217 sec | ~ 120 sec |
| Time on hold for insert for another table | ~ 0.211 sec | ~ 25 sec |
However, yes there is a however, PXC was maintaining consistency between the different nodes during the DDL execution, while MySQL 8.0.27 with Group Replication was postponing consistency on the secondaries, thus Primary and Secondary were not in sync until full DDL finalization on the secondaries.
Conclusions
MySQL 8.0.27 comes with this nice fix that significantly reduces the impact of an online DDL operation on a busy server. But we can still observe a significant misalignment of the data between the nodes when a DDL is executing.
On the other hand, PXC with NBO is a bit more “expensive” in time, but nodes remain aligned all the time.
In the end, is what is more important for you to choose one or the other solution, consistency vs. operational impact.
Great MySQL to all.
23
2021
MySQL on Kubernetes with GitOps

GitOps workflow was introduced by WeaveWorks as a way to implement Continuous Deployment for cloud-native applications. This technique quickly found its way into devops and developer’s hearts as it greatly simplifies the application delivery pipeline: the change in the manifests in the git repository is reflected in Kubernetes right away. With GitOps there is no need to provide access to the cluster for the developer as all the actions are executed by the Operator.
This blog post is a guide on how to deploy Percona Distribution for MySQL on Kubernetes with Flux – GitOps Operator that keeps your cluster state in sync with the Git repository.

In a nutshell, the flow is the following:
- Developer triggers the change in the GitHub repository
- Flux Operator:
- detects the change
- deploys Percona Distribution for MySQL Operator
- creates the Custom Resource, which triggers the creation of Percona XtraDB Cluster and HAProxy pods
The result is a fully working MySQL service deployed without talking to Kubernetes API directly.
Preparation
Prerequisites:
- Kubernetes cluster
- Github user and account
- For this blog post, I used the manifests from this repository
It is a good practice to create a separate namespace for Flux:
$ kubectl create namespace gitops
Installing and managing Flux is easier with
fluxctl
. In Ubuntu, I use snap to install tools, for other operating systems please refer to the manual here.
$ sudo snap install fluxctl --classic
Install Flux operator to your Kubernetes cluster:
$ fluxctl install --git-email=your@email.com --git-url=git@github.com:spron-in/blog-data.git --git-path=gitops-mysql --manifest-generation=true --git-branch=master --namespace=gitops | kubectl apply -f -
GitHub Sync
As per configuration, Flux will monitor the changes in the spron-in/blog-data repository continuously and sync the state. It is required to grant access to Flux to the repo.
Get the public key that was generated during the installation:
$ fluxctl identity --k8s-fwd-ns gitops
Copy the key, add it as Deploy key with write access in GitHub. Go to Settings -> Deploy keys -> Add deploy key:

Action
All set. Flux reconcile loops check the state for changes every five minutes. To trigger synchronization right away run:
$ fluxctl sync --k8s-fwd-ns gitops
In my case I have two YAMLs in the repo:
-
bundle.yaml
– installs the Operator, creates the Custom Resource Definitions (CRDs)
-
cr.yaml
– deploys PXC and HAProxy pods
Flux is going to deploy them both.
$ kubectl get pods NAME READY STATUS RESTARTS AGE cluster1-haproxy-0 2/2 Running 0 26m cluster1-haproxy-1 2/2 Running 0 25m cluster1-pxc-0 1/1 Running 0 26m cluster1-pxc-1 1/1 Running 0 25m cluster1-pxc-2 1/1 Running 0 23m percona-xtradb-cluster-operator-79966668bd-95plv 1/1 Running 0 26m
Now let’s add one more HAProxy Pod by changing
spec.haproxy.size
from 2 to 3 in
cr.yaml
. After that commit and push the changes. In a production-grade scenario, the Pull Request will go through a thorough review, in my case I push directly to the main branch.
$ git commit cr.yaml -m 'increase haproxy size from 2 to 3' $ git push Enumerating objects: 7, done. Counting objects: 100% (7/7), done. Delta compression using up to 2 threads Compressing objects: 100% (4/4), done. Writing objects: 100% (4/4), 385 bytes | 385.00 KiB/s, done. Total 4 (delta 2), reused 0 (delta 0), pack-reused 0 remote: Resolving deltas: 100% (2/2), completed with 2 local objects. To https://github.com/spron-in/blog-data e1a27b8..d555c77 master -> master
Either trigger the sync with
fluxctl sync
command or wait for approximately 5 minutes for Flux reconcile loop to detect the changes. In the logs of the Flux Operator you will see the event:
ts=2021-06-15T12:59:08.267469963Z caller=loop.go:134 component=sync-loop event=refreshed url=ssh://git@github.com/spron-in/blog-data.git branch=master HEAD=d555c77c19ea9d1685392680186e1491905401cc ts=2021-06-15T12:59:08.270678093Z caller=sync.go:61 component=daemon info="trying to sync git changes to the cluster" old=e1a27b8a81e640d3bee9bc2e2c31f9c4189e898a new=d555c77c19ea9d1685392680186e1491905401cc ts=2021-06-15T12:59:08.844068322Z caller=sync.go:540 method=Sync cmd=apply args= count=9 ts=2021-06-15T12:59:09.097835721Z caller=sync.go:606 method=Sync cmd="kubectl apply -f -" took=253.684342ms err=null output="serviceaccount/percona-xtradb-cluster-operator unchanged\nrole.rbac.authorization.k8s.io/percona-xtradb-cluster-operator unchanged\ncustomresourcedefinition.apiextensions.k8s.io/perconaxtradbbackups.pxc.percona.com configured\ncustomresourcedefinition.apiextensions.k8s.io/perconaxtradbclusterbackups.pxc.percona.com unchanged\ncustomresourcedefinition.apiextensions.k8s.io/perconaxtradbclusterrestores.pxc.percona.com unchanged\ncustomresourcedefinition.apiextensions.k8s.io/perconaxtradbclusters.pxc.percona.com unchanged\nrolebinding.rbac.authorization.k8s.io/service-account-percona-xtradb-cluster-operator unchanged\ndeployment.apps/percona-xtradb-cluster-operator unchanged\nperconaxtradbcluster.pxc.percona.com/cluster1 configured" ts=2021-06-15T12:59:09.099258988Z caller=daemon.go:701 component=daemon event="Sync: d555c77, default:perconaxtradbcluster/cluster1" logupstream=false ts=2021-06-15T12:59:11.387525662Z caller=loop.go:236 component=sync-loop state="tag flux" old=e1a27b8a81e640d3bee9bc2e2c31f9c4189e898a new=d555c77c19ea9d1685392680186e1491905401cc ts=2021-06-15T12:59:12.122386802Z caller=loop.go:134 component=sync-loop event=refreshed url=ssh://git@github.com/spron-in/blog-data.git branch=master HEAD=d555c77c19ea9d1685392680186e1491905401cc
The log indicates that the main CR was configured:
perconaxtradbcluster.pxc.percona.com/cluster1 configured
Now we have three HAProxy Pods:
$ kubectl get pods NAME READY STATUS RESTARTS AGE cluster1-haproxy-0 2/2 Running 1 50m cluster1-haproxy-1 2/2 Running 0 48m cluster1-haproxy-2 2/2 Running 0 4m45s
It is important to note that GitOps maintains the sync between Kubernetes and GitHub. It means that if the user manually changes the object on Kubernetes, Flux, or any other GitOps Operator will revert the changes and sync them with GitHub.
GitOps also comes in handy when users want to take the backup or perform the restoration. To do that the user just creates YAML manifests in the GitHub repo and Flux creates corresponding Kubernetes objects. The Database Operator does the rest.
Conclusion
GitOps is a simple approach to deploy and manage applications on Kubernetes:
- Change Management is provided by git version-control and code reviews
- Direct access to Kubernetes API is limited which increases security
- Infrastructure-as-a-Code is here, there is no need to integrate Terraform, Ansible, or any other tool
All Percona Operators can be deployed and managed with GitOps. As a result, you will get production-grade MySQL, MongoDB, or PostgreSQL cluster which just works.
28
2021
Resolving the MySQL Active-Active Replication Dilemma

Multi-writer replication has been a challenge in the MySQL ecosystem for years before truly dedicated solutions were introduced – first Galera (and so Percona XtradDB Cluster (PXC)) replication (around 2011), and then Group Replication (first GA in 2016).
Now, with both multi-writer technologies available, do we still need traditional asynchronous replication, set up in active-active topology? Apparently yes, there are still valid use cases. And you may need it not only when for some reason Galera/PXC or GR are not suitable, but also when you actually use them. Of course, the most typical case is to have a second cluster in a different geographic location, as Disaster Recovery. If you still wonder why you would need it, just read how a whole data center can disappear in the news a few weeks ago, about the OVH incident.
So, a DR site needs to replicate online from the primary cluster and be able to take over the workload very fast if needed. But also it is expected to be able to switch back effortlessly, hence very often the async replication channels are set up in both directions.
A very good writeup about this can be found here: How Not to do MySQL High Availability: Geographic Node Distribution with Galera-Based Replication Misuse.
Now, after seeing repeating problems with active-active setups falling over and over for years, I thought there are still too few warnings out there about the risks, so I decided to add one more little stone to the stack.
Before I continue, I have to mention this great webinar made last year by my colleague Sveta. You should definitely watch if you are interested in the subject: How Safe is Asynchronous Master-Master Setup in MySQL?.
Register for Percona Live ONLINE
A Virtual Event about Open Source Databases
Failure Test
So, let me demonstrate a simple test case, which may be an eye-opener to some.
First, let’s use the great dbdeployer tool to launch two MySQL instances with active-active replication in just one command:
$ dbdeployer deploy replication --topology=all-masters --nodes=2 --concurrent 8.0.23 all-masters directory installed in $HOME/sandboxes/all_masters_msb_8_0_23 run 'dbdeployer usage multiple' for basic instructions' $HOME/sandboxes/all_masters_msb_8_0_23/initialize_ms_nodes # server: 1 # server: 2
Now, create a very simple table with one example data row:
node1 (test) > create table t1 (id int primary key auto_increment, a int); Query OK, 0 rows affected (0.04 sec) node1 (test) > insert into t1 set a=500; Query OK, 1 row affected (0.01 sec)
So, at this point both nodes have the same data:
node1 (test) > select * from test.t1; +----+------+ | id | a | +----+------+ | 1 | 500 | +----+------+ 1 row in set (0.00 sec) node2 (test) > select * from test.t1; +----+------+ | id | a | +----+------+ | 1 | 500 | +----+------+ 1 row in set (0.00 sec)
In the next step, let’s simulate some little replication lag by introducing a delay of one second, as it will allow reproducing the problem at will:
node1 (test) > stop replica sql_thread; CHANGE REPLICATION SOURCE TO SOURCE_DELAY=1 FOR CHANNEL "node2"; start replica sql_thread; Query OK, 0 rows affected (0.01 sec) Query OK, 0 rows affected (0.00 sec) Query OK, 0 rows affected (0.00 sec) node2 (test) > stop replica sql_thread; CHANGE REPLICATION SOURCE TO SOURCE_DELAY=1 FOR CHANNEL "node1"; start replica sql_thread; Query OK, 0 rows affected (0.01 sec) Query OK, 0 rows affected (0.00 sec) Query OK, 0 rows affected (0.00 sec)
OK, so what if we send an update, very close in time, to both nodes, where the same row gets a different value:
$ all_masters_msb_8_0_23/use_all -e "update test.t1 set a=@@server_id where id=1" # server: 1 # server: 2
As a result, both nodes have different column values!
node1 (test) > select * from test.t1; +----+------+ | id | a | +----+------+ | 1 | 200 | +----+------+ 1 row in set (0.00 sec) node2 (test) > select * from test.t1; +----+------+ | id | a | +----+------+ | 1 | 100 | +----+------+ 1 row in set (0.00 sec)
Is Replication Broken?
You may think replication is now broken and some error will alert you about the situation? Not at all!
node1 (test) > show replica status\G *************************** 1. row *************************** ... Replica_IO_Running: Yes Replica_SQL_Running: Yes ... Seconds_Behind_Source: 0 Source_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: ... Executed_Gtid_Set: 00023824-1111-1111-1111-111111111111:1-3, 00023825-2222-2222-2222-222222222222:1 ... node2 (test) > show replica status\G *************************** 1. row *************************** ... Replica_IO_Running: Yes Replica_SQL_Running: Yes ... Seconds_Behind_Source: 0 Source_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: ... Executed_Gtid_Set: 00023824-1111-1111-1111-111111111111:1-3, 00023825-2222-2222-2222-222222222222:1
Or, did you hope enabling the GTID feature would prevent inconsistency from happening? Well, see the Executed_Gtid_Set on both nodes – it is identical, yet the nodes have different data.
What happens next? Well, it depends, maybe replication will eventually fail someday with an error if the same row is modified again, but also it is possible the inconsistency will spread further without you even notice it!
Lesson learned?
I hope this simple example emphasized the need for extra care when dealing with multi-primary replication topologies. The remedy though is usually quite simple:
1 – Make sure to allow only one node at a time for writes (super_read_only is here to help you).
2 – Check the data consistency regularly with appropriate tools (MySQL replication primer with pt-table-checksum and pt-table-sync).
23
2021
Yearly Reminder: DDLs That Fail to Propagate May Cause Percona XtraDB Cluster Inconsistencies

Apologies for the silly title, but the issue is a real one, even though it is not a new thing. Schema upgrades are not an ordinary operation in Galera. For the subject at hand, the bottom line is: under the default Total Order Isolation (TOI) method, “the cluster replicates the schema change query as a statement before its execution.” What this means in practice is that a DDL issued in one node is replicated to other nodes in the cluster before it is even executed in the source node, let alone completed successfully.
As a result of this, it may fail in one node and be successful in another, and this without raising loud alerts or stopping nodes to protect against data inconsistency. This is not a bug in itself but rather a compromise of design. With new changes in MySQL and the recent support for atomic DDLs added in version 8.0, we may see improvements in this area over time in Galera as well. For now, the warning should be maintained: DDLs are not atomic in Galera, even in 8.0.
How Big of an Issue Is This?
To be fair, it’s rare – we do not see this happening often.
How Does the Problem Manifest Itself in Practice? Give Us an Example!
We recently investigated a case where a DBA issued a TRUNCATE TABLE statement in one node, which was taking some time to complete. His reaction at the time was to hit Ctrl+C on the session where the DDL was issued. That session seemed to hang, so he ended up opening a new session and issued a full DELETE on that same table. It also took time to complete, however this time, he was more patient. The problem was: once the statement was completed, he noticed that the two other nodes disappeared from the cluster. Upon closer inspection of the logs, both nodes reported the following error before evicting the cluster:
[ERROR] Slave SQL: Could not execute Delete_rows event on table mydb.target; Can't find record in 'target’, Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 1073, Error_code: 1032
The DELETE was successful on the source node but failed to complete on the other two nodes. Why? Because the target table was already empty: the previous TRUNCATE TABLE was effectively executed in those nodes.
Wasn’t the TRUNCATE TABLE also successful on the source node? It turned out it wasn’t. In fact, the TRUNCATE was recorded in the slow query log of the source node like this:
# Time: (...) # User (...) # Schema: building Last_errno: 1317 Killed: 0 # Query_time: 145.753247 Lock_time: 0.000000 Rows_sent: 0 Rows_examined: 0 Rows_affected: 0 # Bytes_sent: 44 Tmp_tables: 0 Tmp_disk_tables: 0 Tmp_table_sizes: 0 # QC_Hit: No Full_scan: No Full_join: No Tmp_table: No Tmp_table_on_disk: No # Filesort: No Filesort_on_disk: No Merge_passes: 0 # No InnoDB statistics available for this query SET timestamp=(...) truncate target;
Note the Last_errno indicated above:
$ perror 1317 MySQL error code 1317 (ER_QUERY_INTERRUPTED): Query execution was interrupted
Our colleague Przemek reproduced a similar problem, which he described in bug PXC-3536: Modify processing to not allow threads/queries to be killed if the thread is in TOI. The Percona Engineering team has since fixed the abortion issue, leaving the following note on the bug tracker:
Fixed in 5.7[.33-31.49] and 8.0[.22-13.1]. Any attempt to kill the query or connection while in TOI will fail. Note that due to timing, the “kill query” may sometimes return success, but the operation will still have succeeded.
In the case we investigated, the MySQL error log of the source node had the answer; there was some other higher-priority transaction already in place on that node that prevented it from executing the TRUNCATE:
[Note] WSREP: MDL conflict db=mydb table=target ticket=MDL_SHARED_WRITE solved by abort
To clarify, this cluster was receiving writes from multiple nodes, and the higher-priority transaction was applied by a replication thread, which usually has precedence over local events when a conflict is detected.
As you may have suspected, the TRUNCATE was not found in the binary log of the source node; since the statement did not complete, it was not logged. Only the subsequent DELETE statement was there.
And as for the binary logs of the other two nodes, they included both the TRUNCATE as well as the DELETEs covering all rows in the target table. Rows these other nodes no longer had in place and thus were unable to remove caused the data inconsistency that followed.
20
2021
Change Storage Class on Kubernetes on the Fly

Percona Kubernetes Operators support various options for storage: Persistent Volume (PV), hostPath, ephemeral storage, etc. In most of the cases, PVs are used, which are provisioned by the Operator through Storage Classes and Persistent Volume Claims.
Storage Classes define the underlying volume type that should be used (ex. AWS, EBS, gp2, or io1), file system (xfs, ext4), and permissions. In various cases, cluster administrators want to change the Storage Class for already existing volumes:
- DB cluster is underutilized and it is a good cost-saving when switching from io1 to gp2
- The other way – DB cluster is saturated on IO and it is required to upsize the volumes
- Switch the file system for better performance (MongoDB is much better with xfs)
In this blog post, we will show what the best way is to change the Storage Class with Percona Operators and not introduce downtime to the database. We will cover the change in the Storage Class, but not the migration from PVC to other storage types, like hostPath.
Changing Storage Class on Kubernetes
Prerequisites:
- GKE cluster
- Percona XtraDB Cluster (PXC) deployed with Percona Operator. See instructions here.
Goal:
- Change the storage from pd-standard to pd-ssd without downtime for PXC.
Planning
The steps we are going to take are the following:
- Create a new Storage class for pd-ssd volumes
- Change the
storageClassName
in Custom Resource (CR)
- Change the
storageClassName
in the StatefulSet
- Scale up the cluster (optional, to avoid performance degradation)
- Reprovision the Pods one by one to change the storage
- Scale down the cluster
Register for Percona Live ONLINE
A Virtual Event about Open Source Databases
Execution
Create the Storage Class
By default, standard Storage Class is already present in GKE, we need to create the new
StorageClass
for ssd:
$ cat pd-ssd.yaml apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: ssd provisioner: kubernetes.io/gce-pd parameters: type: pd-ssd volumeBindingMode: Immediate reclaimPolicy: Delete $ kubectl apply -f pd-ssd.yaml
The new Storage Class will be called
ssd
and will provision the volumes of
type: pd-ssd
.
Change the storageClassName in Custom Resource
We need to change the configuration for Persistent Volume Claims in our Custom Resource. The variable we look for is
storageClassName
and it is located under
spec.pxc.volumeSpec.persistentVolumeClaim
.
spec: pxc: volumeSpec: persistentVolumeClaim: - storageClassName: standard + storageClassName: ssd
Now apply new cr.yaml:
$ kubectl apply -f deploy/cr.yaml
Change the storageClassName in the StatefulSet
StatefulSets are almost immutable and when you try to edit the object you get the warning:
# * spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden
We will rely on the fact that the operator controls the StatefulSet. If we delete it, the Operator is going to recreate it with the last applied configuration. For us, it means – with the new storage class. But the deletion of the StatefulSet leads to Pods termination, but our goal is 0 downtime. To get there, we will delete the set, but keep the Pods running. It can be done with –cascade flag:
$ kubectl delete sts cluster1-pxc --cascade=orphan
As a result, the Pods are up and the Operator recreated the StatefulSet with new storageClass:
$ kubectl get sts | grep cluster1-pxc cluster1-pxc 3/3 8s

Scale up the Cluster (Optional)
Changing the storage type would require us to terminate the Pods, which decreases the computational power of the cluster and might cause performance issues. To improve performance during the operation we are going to changing the size of the cluster from 3 to 5 nodes:
spec: pxc: - size: 3 + size: 5 $ kubectl apply -f deploy/cr.yaml
As long as we have changed the StatefulSet already, new PXC Pods will be provisioned with the volumes backed by the new
StorageClass
:
$ kubectl get pvc datadir-cluster1-pxc-0 Bound pvc-6476a94c-fa1b-45fe-b87e-c884f47bd328 6Gi RWO standard 78m datadir-cluster1-pxc-1 Bound pvc-fcfdeb71-2f75-4c36-9d86-8c68e508da75 6Gi RWO standard 76m datadir-cluster1-pxc-2 Bound pvc-08b12c30-a32d-46a8-abf1-59f2903c2a9e 6Gi RWO standard 64m datadir-cluster1-pxc-3 Bound pvc-b96f786e-35d6-46fb-8786-e07f3097da02 6Gi RWO ssd 69m datadir-cluster1-pxc-4 Bound pvc-84b55c3f-a038-4a38-98de-061868fd207d 6Gi RWO ssd 68m
Reprovision the Pods One by One to Change the Storage
This is the step where underlying storage is going to be changed for the database Pods.
Delete the PVC of the Pod that you are going to reprovision. Like for Pod
cluster1-pxc-2
the PVC is called
datadir-cluster1-pxc-2
:
$ kubectl delete pvc datadir-cluster1-pxc-2
The PVC will not be deleted right away as there is a Pod using it. To proceed, delete the Pod:
$ kubectl delete pod cluster1-pxc-2
The Pod will be deleted along with the PVCs. The StatefulSet controller will notice that the pod is gone and will recreate it along with the new PVC of a new Storage Class:
$ kubectl get pods ... cluster1-pxc-2 0/3 Init:0/1 0 7s
$ kubectl get pvc datadir-cluster1-pxc-2 NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE … datadir-cluster1-pxc-2 Bound pvc-08b12c30-a32d-46a8-abf1-59f2903c2a9e 6Gi RWO ssd 10s
The STORAGECLASS column indicates that this PVC is of type ssd.
You might face the situation that the Pod is stuck in a pending state with the following error:
Warning FailedScheduling 63s (x3 over 71s) default-scheduler persistentvolumeclaim "datadir-cluster1-pxc-2" not found
It can happen due to the race condition: Pod was created when old PVC was terminating, and when the Pod is ready to start the PVC is already gone. Just delete the Pod again, so that the PVC is recreated.
Once the Pod is up, the State Snapshot Transfer kicks in and the data is synced from other nodes. It might take a while if your cluster holds lots of data and is heavily utilized. Please wait till the node is fully up and running, sync is finished, and only then proceed to the next Pod.
Scale Down the Cluster
Once all the Pods are running on the new storage it is time to scale down the cluster (if it was scaled up):
spec:
pxc:
- size: 5
+ size: 3
$ kubectl apply -f deploy/cr.yaml
Do not forget to clean up the PVCs for nodes 4 and 5. And done!
Conclusion
Changing the storage type is a simple task on the public clouds, but its simplicity is not synchronized yet with Kubernetes capabilities. Kubernetes and Container Native landscape is evolving and we hope to see this functionality soon. The way of changing Storage Class described in this blog post can be applied to both the Percona Operator for PXC and Percona Operator for MongoDB. If you have ideas on how to automate this and are willing to collaborate with us on the implementation, please submit the Issue to our public roadmap on Github.




