How To Recover Percona XtraDB Cluster 5.7 Node Without SST

Recover Percona XtraDB Cluster 5.7 Node Without SST

The Problem

Recover Percona XtraDB Cluster 5.7 Node Without SSTState Snapshot Transfer can be a very long and expensive process, depending on the size of your Percona XtraDB Cluster (PXC)/Galera cluster, as well as network and disk bandwidth. There are situations where it is needed though, like after long enough node separation, where the gcache on other members was too small to keep all the needed transactions.

Let’s see how we can avoid SST, yet recover fast and without even the need for doing a full backup from another node.

Below, I will present a simple scenario, where one of the cluster nodes was having a broken network for long enough that it will make Incremental State Transfer (IST) no longer possible.

For this solution to work, I am assuming that the cluster has binary logs with GTID mode enabled, and logs with missing transactions were not purged yet. Though it would be still possible without GTID, just slightly more complex.

My example PXC member, node3, gets separated from the cluster due to a network outage. Its last applied transaction status is:

node3 > show global variables like 'gtid_executed';
| Variable_name | Value                                        |
| gtid_executed | 2cd15721-261a-ee14-4166-00c9b4945b0b:1-28578 |
1 row in set (0.01 sec)

node3 > show status like 'wsrep_last_committed';
| Variable_name        | Value |
| wsrep_last_committed | 28610 |
1 row in set (0.00 sec)

However, other available active nodes in the cluster have already rotated the gcache further:

node1 > show status like 'wsrep_local_cached_downto';
| Variable_name             | Value |
| wsrep_local_cached_downto | 42629 |
1 row in set (0.00 sec)

Hence, after the network is restored, it fails to re-join the cluster due to IST failure:

DONOR error log:

2021-06-30T21:52:02.199697Z 2 [Note] WSREP: IST request: d32ea8de-d9e5-11eb-be99-ff364b6ba4f4:28610-83551|tcp://
2021-06-30T21:52:02.199743Z 2 [Note] WSREP: IST first seqno 28611 not found from cache, falling back to SST

JOINER error log:

2021-06-30T21:52:02.139242Z 0 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 83551)
2021-06-30T21:52:02.139408Z 4 [Note] WSREP: State transfer required:
Group state: d32ea8de-d9e5-11eb-be99-ff364b6ba4f4:83551
Local state: d32ea8de-d9e5-11eb-be99-ff364b6ba4f4:28610
2021-06-30T21:52:02.200137Z 0 [Warning] WSREP: 1.0 (node1): State transfer to 0.0 (node3) failed: -61 (No data available)
2021-06-30T21:52:02.200171Z 0 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():805: State transfer request failed unrecoverably because the donor seqno had gone forward during IST, but SST request was not prepared from our side due to selected state transfer method (which do not supports SST during node operation). Restart required.
2021-06-30T21:52:02.200191Z 0 [Note] WSREP: gcomm: terminating thread

And node3 shuts down its service as a result.

The Solution

To avoid using full backup transfer from the donor, let’s try asynchronous replication here, to let the failed node catch up with the others so that IST should be possible later.

To achieve that, let’s modify the configuration file first on the separated node, and add these to avoid accidental writes during the operation:

super_read_only = 1

and to disable PXC mode for the time, comment out the provider:


Now, after a restart, node3 becomes a standalone MySQL node, without Galera replication enabled. So, let’s configure async replication channel (repl user was created already on all nodes):

Query OK, 0 rows affected, 2 warnings (0.03 sec)

node3 > start slave;
Query OK, 0 rows affected (0.00 sec)

And then wait for it to catch up with the source node. Once this replica is fully up to date, let’s stop it, remove async channel configuration, and note its new GTID position:

node3 > stop slave;
Query OK, 0 rows affected (0.00 sec)

node3 > reset slave all;
Query OK, 0 rows affected (0.01 sec)

node3 > show global variables like 'gtid_executed';
| Variable_name | Value                                        |
| gtid_executed | 2cd15721-261a-ee14-4166-00c9b4945b0b:1-83553 |
1 row in set (0.00 sec)

Now, we have to find the corresponding cluster’s wsrep sequence, in the source binary log, like this:

$ mysqlbinlog mysql-bin.000005|grep -A1000 '2cd15721-261a-ee14-4166-00c9b4945b0b:83553'|grep Xid|head -1
#210701  0:19:06 server id 100  end_log_pos 1010 CRC32 0x212d2592  Xid = 83557

With this position, the grastate.dat file on the failed node has to be updated, as follows:

$ cat pxc_msb_pxc5_7_33/node3/data/grastate.dat
# GALERA saved state
version: 2.1
uuid:    d32ea8de-d9e5-11eb-be99-ff364b6ba4f4
seqno:   83557
safe_to_bootstrap: 0

The previous configuration file modifications must be now reverted, and the service restarted again.

This time, IST was finally possible:

2021-06-30T22:26:10.563512Z 2 [Note] WSREP: State transfer required:
Group state: d32ea8de-d9e5-11eb-be99-ff364b6ba4f4:85668
Local state: d32ea8de-d9e5-11eb-be99-ff364b6ba4f4:83557
2021-06-30T22:26:28.860555Z 2 [Note] WSREP: Receiving IST: 2111 writesets, seqnos 83557-85668
2021-06-30T22:26:28.860812Z 0 [Note] WSREP: Receiving IST...  0.0% (   0/2111 events) complete.
2021-06-30T22:26:29.247313Z 0 [Note] WSREP: Receiving IST...100.0% (2111/2111 events) complete.
2021-06-30T22:26:29.247713Z 2 [Note] WSREP: IST received: d32ea8de-d9e5-11eb-be99-ff364b6ba4f4:85668
2021-06-30T22:26:29.247902Z 0 [Note] WSREP: 0.0 (node3): State transfer from 1.0 (node1) complete.
2021-06-30T22:26:29.248074Z 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 85668)

And node3 joins back the cluster properly:

node3 > show global variables like 'gtid_executed';
| Variable_name | Value                                        |
| gtid_executed | 2cd15721-261a-ee14-4166-00c9b4945b0b:1-85664 |
1 row in set (0.00 sec)

node3 > show status like 'wsrep_last_committed';
| Variable_name        | Value |
| wsrep_last_committed | 85668 |
1 row in set (0.01 sec)


With the help of traditional asynchronous replication, we were able to restore the failed node back to the cluster faster and without all the overhead related to a full backup made by SST.

The only requirement for such a method to work is an enabled binary log, with a long enough rotation policy.

I have tested this on version:

node3 > select @@version,@@version_comment\G
*************************** 1. row ***************************
        @@version: 5.7.33-36-49-log
@@version_comment: Percona XtraDB Cluster binary (GPL) 5.7.33-rel36-49, Revision a1ed9c3, wsrep_31.49
1 row in set (0.00 sec)

Unfortunately, a similar solution does not work with Percona XtraDB Cluster 8.0.x, due to the modified way wsrep positions are kept in the storage engine, hence the trick with updating grastate.dat does not work as expected there.

I would like to also remind here, that in case some node is expected to stay separated from the cluster for too long, there is a way to preserve longer galera cache history for it. So by doing this, the solution I presented may not even be needed – check the relevant article: Want IST Not SST for Node Rejoins? We Have a Solution!


Skipping Percona Server for MySQL Version 8.0.24 and Releasing 8.0.25 Next

Percona Server for MySQL Version 8.0.24

After Oracle released MySQL version 8.0.24 on April 20, 2021, our engineering team got started right away with merging our enhancements to prepare the corresponding 8.0.24 version of Percona Server for MySQL.

However, Oracle released MySQL version 8.0.25 shortly afterward on May 11, 2021, to fix a critical bug that we also observed during our initial testing and reported back to them.

Therefore, we have decided to skip releasing Percona Server for MySQL 8.0.24 both as a standalone product and a distribution as well as the matching Percona XtraDB Cluster release.

Our next public release after Percona Server for MySQL version 8.0.23 will be version 8.0.25, so don’t be surprised if you’re looking for version 8.0.24 and can’t find it. Thank you!


Inconsistent Voting in Percona XtraDB Cluster

Cluster Error Voting

AKA Cluster Error Voting…

Cluster Error VotingWhat is Cluster Error Voting (CEV)?

“Cluster Error Voting is a new feature implemented by Alexey Yurchenko, and it is a protocol for nodes to decide how the cluster will react to problems in replication. When one or several nodes have an issue applying an incoming transaction(s) (e.g., suspected inconsistency), this new feature helps. In a 5-node cluster, if 2-nodes fail to apply the transaction, they get removed, and a DBA can go in to fix what went wrong so that the nodes can rejoin the cluster. (Seppo Jaakola)”

This feature was ported to Percona XtraDB Cluster (PXC) in version 8.0.21. As indicated above, it is about increasing the resilience of the cluster, especially when TWO nodes fail to operate and may drop from the cluster abruptly. The protocol is activated in a cluster with any number of nodes.

Before CEV, if a node has a problem/error during a transaction, the node having the issue would report the error in his own log and exit the cluster:

2021-04-23T15:18:38.568903Z 11 [ERROR] [MY-010584] [Repl] Slave SQL: Could not execute Write_rows event on table test.test_voting; Duplicate entry '21' for key 'test_voting.PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 0, Error_code: MY-001062
2021-04-23T15:18:38.568976Z 11 [Warning] [MY-000000] [WSREP] Event 3 Write_rows apply failed: 121, seqno 16
2021-04-23T15:18:38.569717Z 11 [Note] [MY-000000] [Galera] Failed to apply write set: gtid: 224fddf7-a43b-11eb-84d5-2ebf2df70610:16 server_id: d7ae67e4-a43c-11eb-861f-8fbcf4f1cbb8 client_id: 40 trx_id: 115 flags: 3
2021-04-23T15:18:38.575439Z 11 [Note] [MY-000000] [Galera] Closing send monitor...
2021-04-23T15:18:38.575578Z 11 [Note] [MY-000000] [Galera] Closed send monitor.
2021-04-23T15:18:38.575647Z 11 [Note] [MY-000000] [Galera] gcomm: terminating thread
2021-04-23T15:18:38.575737Z 11 [Note] [MY-000000] [Galera] gcomm: joining thread
2021-04-23T15:18:38.576132Z 11 [Note] [MY-000000] [Galera] gcomm: closing backend
2021-04-23T15:18:38.577954Z 11 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(NON_PRIM,3206d174,5)
memb {
joined {
left {
partitioned {
2021-04-23T15:18:38.578109Z 11 [Note] [MY-000000] [Galera] PC protocol downgrade 1 -> 0
2021-04-23T15:18:38.578158Z 11 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view ((empty))
2021-04-23T15:18:38.578640Z 11 [Note] [MY-000000] [Galera] gcomm: closed
2021-04-23T15:18:38.578747Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1

While the other nodes will “just” report the node as out of the view:

2021-04-23T15:18:38.561402Z 0 [Note] [MY-000000] [Galera] forgetting 727c277a (tcp://
2021-04-23T15:18:38.562751Z 0 [Note] [MY-000000] [Galera] Node 3206d174 state primary
2021-04-23T15:18:38.570411Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(PRIM,3206d174,6)
memb {
joined {
left {
partitioned {
2021-04-23T15:18:38.570679Z 0 [Note] [MY-000000] [Galera] Save the discovered primary-component to disk
2021-04-23T15:18:38.574592Z 0 [Note] [MY-000000] [Galera] forgetting 727c277a (tcp://
2021-04-23T15:18:38.574716Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2

With CEV, we have a different process. Let us review it with images first.

Let us start with a cluster…

3 Nodes, where only one works as Primary.

Primary writes and, as expected, writesets are distributed on all nodes.

insert into test_voting values(null,REVERSE(UUID()), NOW()); <-- Few times

DC1-1(root@localhost) [test]>select * from test_voting;
| id | what                                 | when                |
|  3 | 05de43720080-938a-be11-305a-6d135601 | 2021-04-24 14:43:34 |
|  6 | 05de43720080-938a-be11-305a-7eb60711 | 2021-04-24 14:43:36 |
|  9 | 05de43720080-938a-be11-305a-6861c221 | 2021-04-24 14:43:37 |
| 12 | 05de43720080-938a-be11-305a-d43f0031 | 2021-04-24 14:43:38 |
| 15 | 05de43720080-938a-be11-305a-53891c31 | 2021-04-24 14:43:39 |
5 rows in set (0.00 sec)

Some inexperienced DBA does a manual operation on a secondary using the very unsafe feature wsrep_on…

And then, by mistake or because he did not understand what he is doing…

insert into test_voting values(17,REVERSE(UUID()), NOW()); <-- with few different ids

At the end of the operation of the Secondary node, he will have:

DC1-2(root@localhost) [test]>select * from test_voting;
| id | what                                 | when                |
|  3 | 05de43720080-938a-be11-305a-6d135601 | 2021-04-24 14:43:34 |
|  6 | 05de43720080-938a-be11-305a-7eb60711 | 2021-04-24 14:43:36 |
|  9 | 05de43720080-938a-be11-305a-6861c221 | 2021-04-24 14:43:37 |
| 12 | 05de43720080-938a-be11-305a-d43f0031 | 2021-04-24 14:43:38 |
| 15 | 05de43720080-938a-be11-305a-53891c31 | 2021-04-24 14:43:39 |
| 16 | 05de43720080-a39a-be11-405a-82715600 | 2021-04-24 14:50:17 |
| 17 | 05de43720080-a39a-be11-405a-f9d62e22 | 2021-04-24 14:51:14 |
| 18 | 05de43720080-a39a-be11-405a-f5624662 | 2021-04-24 14:51:20 |
| 19 | 05de43720080-a39a-be11-405a-cd8cd640 | 2021-04-24 14:50:23 |

This is not in line with the rest of the cluster that still has the previous data. Then our guy put the node back:

At this point, the Primary does another insert in that table and:

Houston, we have a problem! 

The secondary node already has the entry with that ID and cannot perform the insert:

2021-04-24T13:52:51.930184Z 12 [ERROR] [MY-010584] [Repl] Slave SQL: Could not execute Write_rows event on table test.test_voting; Duplicate entry '18' for key 'test_voting.PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 0, Error_code: MY-001062
2021-04-24T13:52:51.930295Z 12 [Warning] [MY-000000] [WSREP] Event 3 Write_rows apply failed: 121, seqno 4928120

But instead of exit from the cluster, it will raise a verification through voting:

2021-04-24T13:52:51.932774Z 0 [Note] [MY-000000] [Galera] Member 0(node2) initiates vote on ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120,878ded7898c83a72:  Duplicate entry '18' for key 'test_voting.PRIMARY', Error_code: 1062;
2021-04-24T13:52:51.932888Z 0 [Note] [MY-000000] [Galera] Votes over ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120:
   878ded7898c83a72:   1/3
Waiting for more votes.
2021-04-24T13:52:51.936525Z 0 [Note] [MY-000000] [Galera] Member 1(node3) responds to vote on ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120,0000000000000000: Success
2021-04-24T13:52:51.936626Z 0 [Note] [MY-000000] [Galera] Votes over ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120:
   0000000000000000:   1/3
   878ded7898c83a72:   1/3
Waiting for more votes.
2021-04-24T13:52:52.003615Z 0 [Note] [MY-000000] [Galera] Member 2(node1) responds to vote on ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120,0000000000000000: Success
2021-04-24T13:52:52.003722Z 0 [Note] [MY-000000] [Galera] Votes over ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120:
   0000000000000000:   2/3
   878ded7898c83a72:   1/3
Winner: 0000000000000000

As you can see, each node informs the cluster about the success or failure of the operation, and the majority wins.

Once the majority had identified the operation was legit, as such, the node that asked for the voting will need to get out from the cluster:

2021-04-24T13:52:52.038510Z 12 [ERROR] [MY-000000] [Galera] Inconsistency detected: Inconsistent by consensus on ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120
	 at galera/src/replicator_smm.cpp:process_apply_error():1433
2021-04-24T13:52:52.062666Z 12 [Note] [MY-000000] [Galera] Closing send monitor...
2021-04-24T13:52:52.062750Z 12 [Note] [MY-000000] [Galera] Closed send monitor.
2021-04-24T13:52:52.062796Z 12 [Note] [MY-000000] [Galera] gcomm: terminating thread
2021-04-24T13:52:52.062880Z 12 [Note] [MY-000000] [Galera] gcomm: joining thread
2021-04-24T13:52:52.063372Z 12 [Note] [MY-000000] [Galera] gcomm: closing backend
2021-04-24T13:52:52.085853Z 12 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(NON_PRIM,65a111c6-bb0f,23)
memb {
joined {
left {
partitioned {
2021-04-24T13:52:52.086241Z 12 [Note] [MY-000000] [Galera] PC protocol downgrade 1 -> 0
2021-04-24T13:52:52.086391Z 12 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view ((empty))
2021-04-24T13:52:52.150106Z 12 [Note] [MY-000000] [Galera] gcomm: closed
2021-04-24T13:52:52.150340Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1

It is also nice to notice that now we have a decent level of information about what happened in the other nodes; the log below is from the Primary:

2021-04-24T13:52:51.932829Z 0 [Note] [MY-000000] [Galera] Member 0(node2) initiates vote on ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120,878ded7898c83a72:  Duplicate entry '18' for key 'test_voting.PRIMARY', Error_code: 1062;
2021-04-24T13:52:51.978123Z 0 [Note] [MY-000000] [Galera] Votes over ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120:
2021-04-24T13:52:51.981647Z 0 [Note] [MY-000000] [Galera] Votes over ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120:
   0000000000000000:   2/3
   878ded7898c83a72:   1/3
Winner: 0000000000000000
2021-04-24T13:52:51.981887Z 11 [Note] [MY-000000] [Galera] Vote 0 (success) on ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120 is consistent with group. Continue.
2021-04-24T13:52:52.064685Z 0 [Note] [MY-000000] [Galera] declaring aae38617-8dd5 at tcp:// stable
2021-04-24T13:52:52.064885Z 0 [Note] [MY-000000] [Galera] forgetting 65a111c6-bb0f (tcp://
2021-04-24T13:52:52.066916Z 0 [Note] [MY-000000] [Galera] Node aae38617-8dd5 state primary
2021-04-24T13:52:52.071577Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(PRIM,aae38617-8dd5,24)
memb {
joined {
left {
partitioned {
2021-04-24T13:52:52.071683Z 0 [Note] [MY-000000] [Galera] Save the discovered primary-component to disk
2021-04-24T13:52:52.075293Z 0 [Note] [MY-000000] [Galera] forgetting 65a111c6-bb0f (tcp://
2021-04-24T13:52:52.075419Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2

At this point, a DBA can start to investigate and manually fix the inconsistency and have the node rejoin the cluster. In the meantime, the rest of the cluster continues to operate:

| id | what                                 | when                |
|  3 | 05de43720080-938a-be11-305a-6d135601 | 2021-04-24 14:43:34 |
|  6 | 05de43720080-938a-be11-305a-7eb60711 | 2021-04-24 14:43:36 |
|  9 | 05de43720080-938a-be11-305a-6861c221 | 2021-04-24 14:43:37 |
| 12 | 05de43720080-938a-be11-305a-d43f0031 | 2021-04-24 14:43:38 |
| 15 | 05de43720080-938a-be11-305a-53891c31 | 2021-04-24 14:43:39 |
| 18 | 05de43720080-938a-be11-405a-d02c7bc5 | 2021-04-24 14:52:51 |


Cluster Error Voting (CEV) is a nice feature to have. It helps us better understand what goes wrong and increases the stability of the cluster, and with the voting has a better way to manage the node expulsion.

Another aspect is visibility; never underestimate the fact information is available also on other nodes. Having it available on multiple nodes may help investigations if the log on the failing node gets lost (for any reason).

We still do not have active tuple certification, but it is a good step, especially given the history we have seen of data drift in PXC/Galera in these 12 years of utilization.

My LAST comment is that while I agree WSREP_ON can be a very powerful tool in the hands of experts, as indicated in my colleague’s blog How to Perform Compatible Schema Changes in Percona XtraDB Cluster (Advanced Alternative). That option remains DANGEROUS, and you should never use it UNLESS your name is Przemys?aw Malkowski, and you really know what you are doing.

Great MySQL to everybody!



Galera Clustering in MariaDB 10.5 and beyond – Seppo Jaakola – MariaDB Server Fest 2020


Register for Percona Live ONLINE
A Virtual Event about Open Source Databases


Yearly Reminder: DDLs That Fail to Propagate May Cause Percona XtraDB Cluster Inconsistencies

DDLs Percona XtraDB Cluster Inconsistencies

DDLs Percona XtraDB Cluster InconsistenciesApologies for the silly title, but the issue is a real one, even though it is not a new thing. Schema upgrades are not an ordinary operation in Galera. For the subject at hand, the bottom line is: under the default Total Order Isolation (TOI) method, “the cluster replicates the schema change query as a statement before its execution.” What this means in practice is that a DDL issued in one node is replicated to other nodes in the cluster before it is even executed in the source node, let alone completed successfully.

As a result of this, it may fail in one node and be successful in another, and this without raising loud alerts or stopping nodes to protect against data inconsistency. This is not a bug in itself but rather a compromise of design. With new changes in MySQL and the recent support for atomic DDLs added in version 8.0, we may see improvements in this area over time in Galera as well. For now, the warning should be maintained: DDLs are not atomic in Galera, even in 8.0.

How Big of an Issue Is This?

To be fair, it’s rare – we do not see this happening often.

How Does the Problem Manifest Itself in Practice? Give Us an Example!

We recently investigated a case where a DBA issued a TRUNCATE TABLE statement in one node, which was taking some time to complete. His reaction at the time was to hit Ctrl+C on the session where the DDL was issued. That session seemed to hang, so he ended up opening a new session and issued a full DELETE on that same table. It also took time to complete, however this time, he was more patient. The problem was: once the statement was completed, he noticed that the two other nodes disappeared from the cluster. Upon closer inspection of the logs, both nodes reported the following error before evicting the cluster:

[ERROR] Slave SQL: Could not execute Delete_rows event on table mydb.target; Can't find record in 'target’, Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 1073, Error_code: 1032

The DELETE was successful on the source node but failed to complete on the other two nodes. Why? Because the target table was already empty: the previous TRUNCATE TABLE was effectively executed in those nodes.

Wasn’t the TRUNCATE TABLE also successful on the source node? It turned out it wasn’t. In fact, the TRUNCATE was recorded in the slow query log of the source node like this:

# Time: (...)
# User (...)
# Schema: building Last_errno: 1317 Killed: 0
# Query_time: 145.753247 Lock_time: 0.000000 Rows_sent: 0 Rows_examined: 0 Rows_affected: 0
# Bytes_sent: 44 Tmp_tables: 0 Tmp_disk_tables: 0 Tmp_table_sizes: 0
# QC_Hit: No Full_scan: No Full_join: No Tmp_table: No Tmp_table_on_disk: No
# Filesort: No Filesort_on_disk: No Merge_passes: 0
# No InnoDB statistics available for this query
SET timestamp=(...)
truncate target;

Note the Last_errno indicated above:

$ perror 1317
MySQL error code 1317 (ER_QUERY_INTERRUPTED): Query execution was interrupted

Our colleague Przemek reproduced a similar problem, which he described in bug PXC-3536: Modify processing to not allow threads/queries to be killed if the thread is in TOI. The Percona Engineering team has since fixed the abortion issue, leaving the following note on the bug tracker:

Fixed in 5.7[.33-31.49] and 8.0[.22-13.1].  Any attempt to kill the query or connection while in TOI will fail.  Note that due to timing, the “kill query” may sometimes return success, but the operation will still have succeeded.

In the case we investigated, the MySQL error log of the source node had the answer; there was some other higher-priority transaction already in place on that node that prevented it from executing the TRUNCATE:

[Note] WSREP: MDL conflict db=mydb table=target ticket=MDL_SHARED_WRITE solved by abort

To clarify, this cluster was receiving writes from multiple nodes, and the higher-priority transaction was applied by a replication thread, which usually has precedence over local events when a conflict is detected.

As you may have suspected, the TRUNCATE was not found in the binary log of the source node; since the statement did not complete, it was not logged. Only the subsequent DELETE statement was there.

And as for the binary logs of the other two nodes, they included both the TRUNCATE as well as the DELETEs covering all rows in the target table. Rows these other nodes no longer had in place and thus were unable to remove caused the data inconsistency that followed.


Point-In-Time Recovery in Kubernetes Operator for Percona XtraDB Cluster – Architecture Decisions

Point-In-Time Recovery in Kubernetes Operator

Point-In-Time Recovery in Kubernetes OperatorPoint-In-Time Recovery (PITR) for MySQL databases is an important feature that is essential and covers common use cases, like a recovery to the latest possible transaction or roll-back the database to a specific date before some bad query was executed. Percona Kubernetes Operator for Percona XtraDB Cluster (PXC) added support for PITR in version 1.7, and in this blog post we are going to look into the technical details and decisions we made to implement this feature.

Architecture Decisions

Store Binary Logs on Object Storage

MySQL uses binary logs to perform point-in-time recovery. Usually, they are stored locally along with the data, but it is not an option for us:

  • We run the cluster and we cannot rely on a single node’s local storage.
  • The cloud-native world lives in an ephemeral dimension, where nodes and pods can be terminated and S3-compatible storage is a de facto standard to store data.
  • We should be able to recover the data to another Kubernetes cluster in case of a disaster.

We have decided to add a new Binlog Uploader Pod, which connects to the available PXC member and uploads binary logs to S3. Under the hood, it relies on the mysqlbinlog utility.

Use Global Transaction ID

Binary logs on the clustered nodes are not synced and can have different names and contents. This becomes a problem for the Uploader, as it can connect to different PXC nodes for various reasons.

To solve this problem, we decided to rely on Global Transaction ID (GTID). It is a unique transaction identifier, but it is unique not only to the server on which it originated, but is unique across all servers in a given replication topology.  With the GTID captured in binary logs, we can identify any transaction not depending on the filename or its contents. This allows us to continue streaming binlogs from any PXC member at any moment.

User-Defined Functions

We have a unique identifier for every transaction, but the mysqlbinlog utility still doesn’t have the functionality to determine which binary log file contains which GTID. We decided to extend MySQL with few User Defined Functions and added them to Percona Server for MySQL and Percona XtraDB Cluster versions 8.0.21


This function returns all GTIDs that are stored inside the given binlog file. We put the GTID setlist to a new file next to the binary log on S3.


This function takes GTID set as an input and returns a binlog filename which is stored locally. We use it to figure out which GTIDs are already uploaded and which binlog to upload next. 

binlog uploader pod

Have open source expertise you want to share? Submit your talk for Percona Live ONLINE 2021!

Find the node with the oldest binary log

Our quality assurance team caught a bug before the release which can happen in the cluster only:

  • Add a new node to the Percona XtraDB Cluster (for example scale up from 3 to 5 nodes).
  • Binlog Uploader Pod tries to execute get_binlog_by_gtid_set on the new node but gets the error.
2021/01/19 11:23:19 ERROR: collect binlog files: get last uploaded binlog name by gtid set: scan binlog: sql: Scan error on column index 0, name "get_binlog_by_gtid_set('a8e657ab-5a47-11eb-bea2-f3554c9e5a8d:15')": converting NULL to string is unsupported

The error is valid, as this node is new and there are no binary log files that have the GTID set that Uploader got from S3. If you look into this pull request, the quick patch is to always pick the oldest node in the array or in other words the node, which most likely would have the binary logs we need. In the next release of the Operator, we add more sophisticated logic, to discover the node which has the oldest binary logs for sure.

Storageless binlog uploader

The size of binlogs depends on the cluster usage patterns, so it is hard to predict the size of the storage or memory required for them. We decided to take this complexity away by making our Binary Log Uploader Pod completely storageless. Mysqlbinlog can store remote binlog only into files, but we need to put them to S3. To get there we decided to use a named pipe or FIFO special file. Now mysqlbinlog utility loads the binary log file to a named pipe, our Uploader reads it and streams the data directly to S3.

Also, storageless design means that we never store any state between Uploader restarts. Basically, state is not needed, we only need to know which GTIDs are already uploaded and we have this data on a remote S3 bucket. Such design enables the continuous upload flow of binlogs.

Binlog upload delay

S3 protocol expects that the file is completely uploaded. If the file upload is interrupted (let’s say Uploader Pod is evicted), the file will not be accessible/visible on S3. Potentially we can lose many hours of binary logs because of such interruptions. That’s why we need to split the binlog stream into files and upload them separately.

One of the options that users can configure when enabling point-in-time recovery in Percona XtraDB Cluster Operator is timeBetweenUploads. It sets the number of seconds between uploads for Binlog Uploader Pod. By default, we set it to 60 seconds, but it can go down to one second. We do not recommend setting it too low, as every invocation of the Uploader leads to FLUSH BINARY LOGS command execution on the PXC node. We need to flush the logs to close the binary log file to upload it to external storage, but doing it frequently may negatively affect IO and as a result database performance.


It is all about recovery and it has two steps:

  1. Recover the cluster from a full backup
  2. Apply binary logs

We already have the functionality to restore from a full backup (see here), so let’s get to applying the binary logs.

First, we need to figure out from which GTID set we should start applying binary logs – in other words: where do we start?. As we rely on the Percona XtraBackup utility to take full MySQL backups, what we need to do is read the xtrabackup_info file which has lots of useful metadata. We already have this file on S3 near the full backup.

Second, find the binlog which has the GTID set we need. As you remember, we store a file with binlog’s GTID sets on S3 already, so it boils down to reading these files.

Third, download binary logs and apply them. Here we rely on mysqlbinlog as well, which has the flags we need, like –stop-datetime – which stops recovery when the event with a specific timestamp is caught in the log.

point in time recovery


MySQL is more than 25 years old and has a great tooling ecosystem established around it, but as we saw in this blog post, not all these tools are cloud-native ready. Percona engineering teams are committed to providing users the same features across various environments, whether it is a bare-metal installation in the data center or cutting edge Kubernetes in the cloud.


Wondering How to Run Percona XtraDB Cluster on Kubernetes? Try Our Operator!

Run Percona XtraDB Cluster on Kubernetes

Run Percona XtraDB Cluster on KubernetesKubernetes has been a big trend for a while now, particularly well-suited for microservices. Running your main databases on Kubernetes is probably NOT what you are looking for. However, there’s a niche market for them. My colleague Stephen Thorn did a great job explaining this in The Criticality of a Kubernetes Operator for Databases. If you are considering running your database on Kubernetes, have a look at it first. And, if after reading it you start wondering how the Operator works, Stephen also wrote an Introduction to Percona Kubernetes Operator for Percona XtraDB Cluster (PXC), which presents the Kubernetes architecture and how the Percona Operator simplifies the deployment of a full HA PXC cluster in this environment, proxies included!

Now, if you are curious about how it actually works in practice but are afraid the entry barrier is too high, I can help you with that. In fact, this technology is widespread now, with most cloud providers offering a dedicated Kubernetes engine. In this blog post, I’ll walk you over the steps on how to deploy a Percona XtraDB Cluster (PXC) using the Percona Operator for Kubernetes on Google Cloud Platform (GCP).

Creating a Virtual Environment to Run Kubernetes on GCP

Google Cloud Platform includes among its products the Google Kubernetes Engine (GKE). We can take advantage of their trial offer to create our test cluster there: https://cloud.google.com.

After you sign up, you can access all the bells and whistles in their web interface. Note the Kubernetes Engine API is not enabled by default, you need to do it by visiting the Kubernetes Engine section in the left menu, under COMPUTE.

For the purpose of deploying our environment, we should install their SDK and work from the command line: see https://cloud.google.com/sdk/docs/install and follow the respective installation instructions for your OS (you will probably want to install the SDK on your personal computer).

With the SDK installed, we can initialize our environment, which requires authenticating to the Google Cloud account:

gcloud init

You will be prompted to choose a cloud project to use: there’s one created by default when the account is activated, named “My First Project”. It will receive a unique id, which you can verify in the Google Cloud interface, but usually, it is displayed as the first option presented in the prompt.

Alternatively, you can use gcloud config set to configure your default project and zone, among other settings.

For this exercise, we will be creating a 3-node cluster named k8-test-cluster with n1-standard-4 instances in the us-central1-b zone:

gcloud container clusters create --machine-type n1-standard-4 --num-nodes 3 --zone us-central1-b --cluster-version latest k8-test-cluster

If the command above was successful, you should see your newly created cluster in the list returned by:

gcloud container clusters list

Getting Ready to Work with Kubernetes

Besides the Google Cloud SDK that is used to manage the cloud instances, we also need the Kubernetes command-line tool, kubectl, to manage the Kubernetes cluster. One way to install it is through gcloud itself:

gcloud components install kubectl

This method won’t work for everyone though, as the Cloud SDK component manager is disabled for certain kinds of installation, such as through apt or yum in Linux. I find myself in this group, using Ubuntu, but the failed attempt to install kubectl through gcloud suggested another approach that worked for me:

sudo apt install kubectl

Deploying a PXC Cluster Using the Percona Kubernetes Operator

The Percona operators are available on Github. The most straightforward way to obtain a copy is by cloning the operator’s repository. The latest version of the PXC operator is 1.6.0 and we can clone it with the following command:

git clone -b v1.6.0 https://github.com/percona/percona-xtradb-cluster-operator

Move inside the created directory:

cd percona-xtradb-cluster-operator

and run the following sequence of commands:

  1. Define the Custom Resource Definitions for PXC:

    kubectl apply -f deploy/crd.yaml
  2. Create a namespace on Kubernetes and associate it to your current context:

    kubectl create namespace pxc
    kubectl config set-context $(kubectl config current-context) --namespace=pxc
  3. Define Role-Based Access Control (RBAC) for PXC:

    kubectl apply -f deploy/rbac.yaml
  4. Start the operator within Kubernetes:

    kubectl apply -f deploy/operator.yaml
  5. Configure PXC users and their credentials:

    kubectl apply -f deploy/secrets.yaml
  6. Finally, deploy the cluster:

    kubectl apply -f deploy/cr.yaml

You can find a more detailed explanation of each of these steps, as well as how to customize your installation, in the Percona Kubernetes Operator for Percona XtraDB Cluster online documentation, which includes a quickstart guide for GKE.

Now, it is a matter of waiting for the deployment to complete, which you can monitor with:

kubectl get pods

A successful deployment will show output for the above command similar to:

NAME                                               READY   STATUS    RESTARTS   AGE
cluster1-haproxy-0                                 2/2     Running   0          4m21s
cluster1-haproxy-1                                 2/2     Running   0          2m47s
cluster1-haproxy-2                                 2/2     Running   0          2m21s
cluster1-pxc-0                                     1/1     Running   0          4m22s
cluster1-pxc-1                                     1/1     Running   0          2m52s
cluster1-pxc-2                                     1/1     Running   0          111s
percona-xtradb-cluster-operator-79d786dcfb-9lthw   1/1     Running   0          4m37s

As you can see above, the operator will deploy seven pods with the default settings, and those are distributed across the three GKE n1-standard-4 machines we created at first:

kubectl get nodes
NAME                                             STATUS   ROLES    AGE    VERSION
gke-k8-test-cluster-default-pool-02c370e1-gvfg   Ready    <none>   152m   v1.17.13-gke.1400
gke-k8-test-cluster-default-pool-02c370e1-lvh7   Ready    <none>   152m   v1.17.13-gke.1400
gke-k8-test-cluster-default-pool-02c370e1-qn3p   Ready    <none>   152m   v1.17.13-gke.1400

Accessing the Cluster

One way to access the cluster is by creating an interactive shell in the Kubernetes cluster:

kubectl run -i --rm --tty percona-client --image=percona:8.0 --restart=Never -- bash -il

From there, we can access MySQL through the cluster’s HAproxy writer node:

mysql -h cluster1-haproxy -uroot -proot_password

Note the hostname used above is an alias, the connection being routed to one of the HAproxy servers available in the cluster. It is also possible to connect to a specific node by modifying the host option -h with the node’s name:

mysql -h cluster1-pxc-0 -uroot -proot_password

This is where all the fun and experimentation starts: you can test and break things without worrying too much as you can easily and quickly start again from scratch.

Destroying the Cluster and Deleting the Test Environment

Once you are done playing with your Kubernetes cluster, you can destroy it with:

gcloud container clusters delete --zone=us-central1-b k8-test-cluster

It’s important to note the command above will not discard the persistent disk volumes that were created and used by the nodes, which you can check with the command:

gcloud compute disks list

A final purging command is required to remove those as well:

gcloud compute disks delete <disk_name_1> <disk_name_2> <disk_name_3> --zone=us-central1-b

If you are feeling overzealous, you can double-check that all has been deleted:

gcloud container clusters list

gcloud compute disks list

Learn More About Percona Kubernetes Operator for Percona XtraDB Cluster

Interested In Hands-On Learning?

Be sure to get in touch with Percona’s Training Department to schedule your PXC Kubernetes training engagement. Our expert trainers will guide your team firstly through the basics, cover all the configuration noted above (and then some), and then dive deeper into how the operator functions along with High-Availability exercises, disaster recovery scenarios, backups, restore, and much more.


CVE-2020-15180 – Affects Percona XtraDB Cluster


CVE-2020-15180Galera replication technology, a key component of Percona XtraDB Cluster, suffered from a remote code execution vulnerability. Percona has been working with the vendor since early September on this issue and has made releases available to address the problem.


A malicious party with access to the WSREP service port (4567/TCP) as well as prerequisite knowledge of the configuration of the Galera cluster name is required in order to exploit this vulnerability, which leads to remote code execution via the WSREP protocol. 

Fixes are available in Percona XtraDB Cluster versions:

>= 8.0.20-11.2

>= 5.7.31-31.45.2

>= 5.6.49-28.42.2


Percona would like to thank all the Percona staff involved in the resolution of this issue.

More Information

Release notes


The Criticality of a Kubernetes Operator for Databases

Importance of Kubernetes Operators for Databases

Importance of Kubernetes Operators for DatabasesAs a Solutions Engineer at Percona, one of my responsibilities is to support our customers as they investigate new and emerging technologies. This affords me the opportunity to speak to many current and new customers who partner with Percona. The topic of Kubernetes is becoming more popular as companies are investigating and adopting this technology. The issue most companies are encountering is architecting a stateful database that doesn’t fall victim to an environment tuned for ephemeral workloads. This obviously introduces a level of complexity as to how to run a stateful database in an inherently stateless world, as databases are not natively designed for that.

To make your life easier, as a part of the Percona Cloud-Native Autonomous Database Initiative, our engineering teams have built two Kubernetes Operators: Percona Kubernetes Operator for Percona XtraDB Cluster and Percona Kubernetes Operator for Percona Server for MongoDB, which allows for Kubernetes Pods to be destroyed, moved, or created with no impact to the application. To see an overview of Kubernetes, you can read this previous blog of mine Introduction to Percona Kubernetes Operator for Percona XtraDB Cluster that covers this topic. It’s common for companies new to Kubernetes to attempt to run their databases in Kubernetes the same way they would in a traditional environment. But, this is not advised as it introduces the possibility of data loss and it is not recommended for production workloads. Why is this dangerous and how has Percona solved this?

Appropriate Workloads for Kubernetes

Kubernetes is not the answer for everyone. It’s even not the answer for most people. Do not be misled into thinking that moving a database into Kubernetes is going to solve any of your problems. Before you consider moving your database into Kubernetes, ensure the rest of your application is cloud-native and can be used with Kubernetes. Moving your database to Kubernetes should happen after you have started both elastic vertical and horizontal scale and need to orchestrate it to control costs.

As more companies are moving to Kubernetes something has to happen to the legacy workloads. Oftentimes we see a lift and shift mentality into Kubernetes, which can be dangerous or cause more work than expected. We have seen two primary ideal use cases for moving database workloads to Kubernetes: Microservices and Unified Abstraction Layer.

Monolithic, large datasets can prohibit some of Kubernetes’ strong points: self-healing and availability. This can be an issue due to the time it takes to physically transmit data to a new Pod instance as it joins the database cluster. If your dataset is too large, this process is slow due to physical limitations and prohibits performance and the availability of your database. Microservices are a great fit due to the relatively smaller datasets, which allows Kubernetes automation to work well with the dataset size.

Companies looking to take full advantage of cloud-native applications and databases can be a really good fit for Kubernetes as well. If you truly want the ability to deploy and run your databases anywhere utilizing the concept of a Unified Abstraction Layer, Kubernetes is a great option. You can move your databases to anywhere that is running Kubernetes and know it will work.

We talked about large unsharded datasets and the limitations Kubernetes presents when handling them, but we should mention a few more workloads better suited for traditional platforms. Applications with a throughput sensitivity may not do well on Kubernetes, or they may not be cost-effective to do so. Kubernetes is fundamentally designed for container orchestration and is not designed to handle highly performant databases that require low latency. This may be possible to achieve, but at what cost? This applies to highly performant distributed applications as well. Lowest latency across all nodes is not a core tenant of Kubernetes, so ensure you have planned and tested against this before you move everything over to Kubernetes.

Pods Are Cattle, Not Pets

If you’re not familiar with Pets vs Cattle, it’s a DevOps concept that differentiates deployment methodologies of unique servers that require attention when issues arise (pets) versus the ability to replace a server with a copy if issues arise (cattle). Due to the nature of how Kubernetes operates, Pods can be destroyed, spun up, and moved at any time due to factors outside of the application’s control, much like how cattle are treated. Kubernetes uses a scheduler, which by design, can destroy and recreate Pods to meet the configuration needs of your Kubernetes Cluster. This is great for stateless applications as any failure in the application will result in a Pod containing the application being destroyed and recreated, eliminating the need for human interaction, and greatly speeding up the process to a resolution. This isn’t ideal for databases as you don’t want your database to suddenly stop working, halt the application, and introduce the potential for lost or corrupted data. One of the tools Kubernetes can utilize to help combat this is called Stateful Sets. These help by keeping a Pod’s identity assigned to it as it is destroyed and re-created. This helps facilitate stateful workloads, but how does this come into play with high availability and utilizing the automation aspects of Kubernetes?

Databases Are Pets, Not Cattle

Databases by design need to keep their identity, information, and most importantly their data safe and accessible at all times. They are the backbones of the application as they are the source of truth an application relies on for normal processing. Any errors in their operations will quickly stop an application from functioning. They are important, to say the least. How can we safely run databases in Kubernetes and still ensure we have highly available database deployments? By using Stateful Sets and Persistent Volumes we can maintain data integrity, but we need an additional set of hands to take on database administrator tasks such as ensuring failover happens, database members are recovered, and re-join the highly available architecture, along with other technology-specific functions. Fortunately, Kubernetes is extensible and has Operators, which aims to automate the key task of a human operator who is managing a service or set of services.

Automation, Automation, Automation

We know the complexities of running a database (safely) in Kubernetes and some of the concepts used to help bridge the gaps between automation and traditional human functions. With the help of Percona’s Kubernetes Operators, we can safely run databases the way they were intended to run. Percona’s Kubernetes Operators are able to automate tasks that are usually done by a database administrator such as:

  • Fully automated deployments with strict consistency and no single point of failure
  • Automated scaling with the ability to change the size parameter to add or remove members of a Cluster or Replica-Set
  • Fully automated backups and restores
  • Fully automated self-healing by automatically recovering from the failure of a single Cluster or Replica-Set member.
  • Automatically manage system users on password rotation
  • Simplified updates

Always Use a Kubernetes Operator

With the complexities of running a highly available database environment and the inherent dangers introduced by using the dynamic Kubernetes environment, an Operator should always be used when deploying databases in Kubernetes. Fortunately, Percona has already solved this by providing Percona Kubernetes Operator for Percona XtraDB Cluster and Percona Kubernetes Operator for Percona Server for MongoDB. Percona provides full support for databases running in Kubernetes with the Percona Operators. If you are interested in learning more or obtaining support or professional services to maximize your database deployments, please reach out to us.


Various Ways to Perform Schema Upgrades with Percona XtraDB Cluster

Perform Schema Upgrades with Percona XtraDB Cluster

Perform Schema Upgrades with Percona XtraDB ClusterSchema changes are the big challenges in Galera replication. So, it is recommended to understand the schema changes operation for everyone who uses the Percona XtraDB Cluster (PXB)/Galera clusters. In this blog, I am going to explain the operation and impact of the various schema changes methods used in the PXB/Galera cluster.

  • Schema changes with “wsrep_OSU_method = TOI”
  • Schema changes with “wsrep_OSU_method = RSU”
  • Schema changes with “ONLINE ALGORITHMS”
  • Schema changes with “pt-osc”
  • Schema changes with “gh-ost”

For testing:

  • I have configured the 3-node Percona Xtradb Cluster (8.0.19).
  • Executing read/write load using the sysbench.
mysql> select @@wsrep_cluster_address\G
*************************** 1. row ***************************
@@wsrep_cluster_address: gcomm://pxc81,pxc82,pxc83
1 row in set (0.00 sec)

mysql> select @@version, @@version_comment\G
*************************** 1. row ***************************
        @@version: 8.0.19-10
@@version_comment: Percona XtraDB Cluster (GPL), Release rel10, Revision 727f180, WSREP version 26.4.3
1 row in set (0.00 sec)

What is the Impact of Schema Changes in Clusters?

  • By default (TOI), all the nodes in the cluster will be pause during the ALTER process. Because the ALTER needs to be replicated on all the nodes. If the ALTER is big it will affect the performance and could be the cause of the downtime.
  • Rollback is not possible on schema upgrade. 
  • You can’t kill the ALTER query immediately during the operation. So, your application may need to wait until the ALTER completion. 
mysql> pager grep alter
PAGER set to 'grep alter'

mysql> show processlist;
| 19 | root            | localhost | schema_changes | Query   |   18 | altering table           | alter table sbtest1 add index idx_c(c) |         0 |             0 |
7 rows in set (0.00 sec)

mysql> kill 19;
ERROR 1095 (HY000): You are not owner of thread 19

  • MDLs are set only on one node. Not across all the nodes in the cluster. So, you need additional control over this. 

Schema Changes with “wsrep_OSU_method = TOI”

TOI: Total Order Isolation

  • TOI is the default method ( wsrep_OSU_method = TOI ) for schema changes.
  • DDL statements are processed in the same order with regard to other transactions in each node. 
  • The full cluster will be blocked/locked during the DDL operation. 
  • This guarantees data consistency.
mysql> select @@wsrep_OSU_method\G
*************************** 1. row ***************************
@@wsrep_OSU_method: TOI
1 row in set (0.00 sec)


I am going to run the below ALTER on “pxc81”.

alter table sbtest1 add index idx_c(c)

After initiating the ALTER on pxc81, My processlist states the COMMIT and UPDATE ( from sysbench ) statements are paused. Only ALTER is in progress. The COMMIT and UPDATE will be resumed once the ALTER is completed.

| 17 | root            | localhost | schema_changes | Execute |      15 | closing tables                           | COMMIT                                 |         0 |             0 |
| 17 | root            | localhost | schema_changes | Execute |      15 | updating                                 | UPDATE sbtest1 SET c='91668836759-30934071579-18064439108-53267873872-79461377960-32104006456-143369 |         0 |             1 |
| 24 | root            | localhost | schema_changes | Query   |      15 | altering table                           | alter table sbtest1 add index idx_c(c) |         0 |             0 |

But, still, the SELECT statement can be run with “wsrep_sync_wait != 1” because “wsrep_sync_wait = 1” needs the casualty checks from other nodes. So, it will fail.  

SELECT with “wsrep_sync_wait=1”

|  1 | system user     |           | schema_changes | Query   |     0 | altering table         | alter table sbtest1 add index idx_c(c) |         0 |             0 |
| 15 | root            | localhost | schema_changes | Query   |    40 | starting               | select * from sbtest1 where id=1       |         0 |             0 |

mysql> select * from sbtest1 where id=1;
ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

TOI can be the right choice for quick operations.


Schema Changes with “wsrep_OSU_method = RSU”

RSU – Rolling Schema Upgrade

  • In this method, DDL statements will not replicate across the cluster nodes. Need to execute the DDL individually on all nodes.
  • The node which is executing the DDL will desync from the cluster group. The other nodes in the cluster are still operational and receive the application connections.
  • Once the node executes the DDL, it will start to apply the missing writesets.
  • In this method, the important thing is the WRITEs should not be performed on that particular table until the schema upgrade completes on all the nodes. Users should be very clear on this because the failure will break the cluster and the data may be unrecoverable. 
  • Gcache should be good enough to store the writesets.


At pxc82, I am going to execute the ALTER.

Session 1: (setting up the value to RSU – session-level) 

mysql> set wsrep_OSU_method=RSU;
Query OK, 0 rows affected (0.09 sec)

Session 2: (checking the node status)

mysql> show global status where Variable_name like 'wsrep_local_recv_queue' or Variable_name like 'wsrep_local_state_comment';
| Variable_name             | Value  |
| wsrep_local_recv_queue    | 0      |
| wsrep_local_state_comment | Synced |
2 rows in set (0.00 sec)

Session 1: (executing the ALTER )

mysql> alter table sbtest1 add index idx_c(c);

Session 2: (checking again the node status )

Here the node went to Donor/Desynced state once the ALTER started. You can see the queue also keeps increasing.

mysql> nopager;  show global status where Variable_name like 'wsrep_local_recv_queue' or Variable_name like 'wsrep_local_state_comment';
PAGER set to stdout
| Variable_name             | Value          |
| wsrep_local_recv_queue    | 2053           |
| wsrep_local_state_comment | Donor/Desynced |
2 rows in set (0.21 sec)

Session 1: (ALTER completed)

mysql> alter table sbtest1 add index idx_c(c);
Query OK, 0 rows affected (2 min 6.52 sec)
Records: 0  Duplicates: 0  Warnings: 0

Session 2: (Node synced to cluster)

mysql> show global status where Variable_name like 'wsrep_local_recv_queue' or Variable_name like 'wsrep_local_state_comment';
| Variable_name             | Value  |
| wsrep_local_recv_queue    | 0      |
| wsrep_local_state_comment | Synced |
2 rows in set (0.00 sec)

This step needs to be executed in pxc81 and pxc83 as well. After completing on all nodes, we are good to allow the WRITEs for that table. 

The RSU method is not truly disruption-free, as there are few bugs reported regarding RSU. Users should be very clear and careful about executing the RSU for schema updates:




Schema Changes with “ONLINE ALGORITHMS”

So far, we have 3 algorithms,

  • COPY

With TOI:

“ALGORITHM = INPLACE / COPY” still pauses the cluster during the operation. Galera doesn’t allow transactions when an ALTER TABLE statement is run. So if you are using TOI, any ALTER TABLE will block all transactions on all nodes.

| 17 | root            | localhost | schema_changes | Execute |      12 | closing tables                           | COMMIT                                                               |         0 |             0 |
| 18 | root            | localhost | schema_changes | Execute |      12 | closing tables                           | COMMIT                                                               |         0 |             0 |
| 32 | root            | localhost | schema_changes | Query   |      13 | altering table                           | alter table sbtest1 add index idx_c(c), algorithm=inplace,

“ALGORITHM=INSTANT” is supported and faster in TOI.

mysql> alter table sbtest1 add column test_Ins int , algorithm=instant;
Query OK, 0 rows affected (0.24 sec)
Records: 0  Duplicates: 0  Warnings: 0 lock=none

With RSU:

“ALGORITHM = INPLACE/COPY” is still not beneficial on RSU. It pauses the Galera replication and takes the node to Desync. 

mysql> show processlist;
| 62 | root            | localhost | schema_changes | Query   |    51 | altering table                       | alter table sbtest1 add index idx_c(c), algorithm=inplace, lock=none |         0 |             0 |
5 rows in set (0.06 sec)

mysql> nopager;  show global status where Variable_name like 'wsrep_local_recv_queue' or Variable_name like 'wsrep_local_state_comment';
PAGER set to stdout
| Variable_name             | Value          |
| wsrep_local_recv_queue    | 7335           |
| wsrep_local_state_comment | Donor/Desynced |
2 rows in set (0.03 sec)

“ALGORITHM=INSTANT” is supported and faster in RSU. But, still, you can use TOI to avoid the additional work.

mysql> alter table sbtest1 add column test_Inss int , algorithm=instant;
Query OK, 0 rows affected (0.19 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> select @@wsrep_OSU_method;
| @@wsrep_OSU_method |
| RSU                |
1 row in set (0.02 sec)

I would suggest using the “ALGORITHM = INSTANT ” with TOI wherever you can. But, make sure you have the MySQL 8.x + version. Unfortunately, “ALGORITHM=INSTANT” currently only supports adding new columns. 

Schema Changes with “pt-osc”

pt-osc : Percona-online-schema-change

Personally, I like this approach very much and use this mostly in production environments. Pt-osc provides non-blocking schema upgrades on all nodes in one shot. This should be used with the TOI method. The action flow will be like this:

  • Create a new table “_tablename_new” with the required modification
  • Creates triggers for update the modified rows (insert / update / delete)
  • Copy the records from the original table to the new table using chunk operation.
  • Once the copy is completed, it will swap the table ( original ? _old, _new ? original ) and drop the triggers and old table. Direct DDLs ( RENAME TABLE, DROP TABLE ) will be used for this operation ( wsrep_OSU_method=TOI ). 

For the below ALTER, 

alter table schema_changes.sbtest1 add index idx_test_Ins(test_Ins);

Pt-osc flow in SQL syntax:

Creating new table...
CREATE TABLE `schema_changes`.`_sbtest1_new` (
  `k` int NOT NULL DEFAULT '0',
  `c` char(120) NOT NULL DEFAULT '',
  `pad` char(60) NOT NULL DEFAULT '',
  `test_Ins` int DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `k_1` (`k`)
) ENGINE=InnoDB AUTO_INCREMENT=20400335 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
Created new table schema_changes._sbtest1_new OK.
Altering new table...
ALTER TABLE `schema_changes`.`_sbtest1_new` add index idx_test_Ins(test_Ins)
Altered `schema_changes`.`_sbtest1_new` OK.
Not creating triggers because this is a dry run.
Not copying rows because this is a dry run.
INSERT LOW_PRIORITY IGNORE INTO `schema_changes`.`_sbtest1_new` (`id`, `k`, `c`, `pad`, `test_ins`) SELECT `id`, `k`, `c`, `pad`, `test_ins` FROM `schema_changes`.`sbtest1` FORCE INDEX(`PRIMARY`) WHERE ((`id` >= ?)) AND ((`id` <= ?)) LOCK IN SHARE MODE /*pt-online-schema-change 9253 copy nibble*/
SELECT /*!40001 SQL_NO_CACHE */ `id` FROM `schema_changes`.`sbtest1` FORCE INDEX(`PRIMARY`) WHERE ((`id` >= ?)) ORDER BY `id` LIMIT ?, 2 /*next chunk boundary*/
Not swapping tables because this is a dry run.
Not dropping old table because this is a dry run.
Not dropping triggers because this is a dry run.
DROP TRIGGER IF EXISTS `schema_changes`.`pt_osc_schema_changes_sbtest1_del`
DROP TRIGGER IF EXISTS `schema_changes`.`pt_osc_schema_changes_sbtest1_upd`
DROP TRIGGER IF EXISTS `schema_changes`.`pt_osc_schema_changes_sbtest1_ins`
2020-09-30T08:31:17 Dropping new table...
DROP TABLE IF EXISTS `schema_changes`.`_sbtest1_new`;
2020-09-30T08:31:17 Dropped new table OK.

Pt-osc provides several options to perform the effective operations.  You can control the connections, active threads, load, chunk size etc ..

For Galera, we have the option “–max-flow-ctrl”. The option will check the average time cluster spent pausing for FC and make the tool pause if it goes over the percentage indicated in the option. By default, the tool will not check the FC.

[root@pxc81 log]# less /bin/pt-online-schema-change  | grep -i pausing
         print STDERR "Pausing because PXC Flow Control is active\n";
         print STDERR "Pausing because "

To make the schema changes on FOREIGN KEY tables, I would suggest using the “alter-foreign-keys-method = rebuild_constraints”. This helps to maintain the consistency of the schema and its relations. In this approach, before dropping the old table, it runs ALTER on all the child tables to drop existing FK and re-add new FK constraints that points to the columns from the new table. Again, adding and dropping the FOREIGN KEY will be the direct ALTER using TOI.

Schema changes with “gh-ost”

Gh-ost is doing a similar approach like “pt-osc”. It also helps to do the non-blocking ALTERs on all cluster nodes in one shot. The main difference is gh-ost is triggerless. Gh-ost uses the binary log to track the changes. So you need the following variables and thresholds to perform the gh-ost operation. 


The flow will be like,

  • Creates gh-ost table with the required modifications
  • Copy the records from the original table to the new table using chunk operation.
  • Apply the live changes by reading the DML events from binary logs.
  • Once the binary log events are applied, it will swap the tables ( original –> _old, gh-ost –> original ) and drop the old table.


[root@pxc81 schema_changes]# gh-ost --alter="add index idx_test_Inss(test_Ins)" --database=schema_changes --table=sbtest1 --user=root --password=Jesus@7sakthI --allow-on-master --execute
[2020/09/30 09:40:56] [info] binlogsyncer.go:133 create BinlogSyncer with config {99999 mysql 3306 root    false false <nil> false UTC true 0 0s 0s 0 false}
[2020/09/30 09:40:56] [info] binlogsyncer.go:354 begin to sync binlog from position (binlog.000027, 196850993)
[2020/09/30 09:40:56] [info] binlogsyncer.go:203 register slave for master server
[2020/09/30 09:40:56] [info] binlogsyncer.go:723 rotate to (binlog.000027, 196850993)
# Migrating `schema_changes`.`sbtest1`; Ghost table is `schema_changes`.`_sbtest1_gho`
# Migrating pxc81:3306; inspecting pxc81:3306; executing on pxc81
# Migration started at Wed Sep 30 09:40:56 +0000 2020
# chunk-size: 1000; max-lag-millis: 1500ms; dml-batch-size: 10; max-load: ; critical-load: ; nice-ratio: 0.000000
# throttle-additional-flag-file: /tmp/gh-ost.throttle 
# Serving on unix socket: /tmp/gh-ost.schema_changes.sbtest1.sock
Copy: 0/6563240 0.0%; Applied: 0; Backlog: 0/1000; Time: 0s(total), 0s(copy); streamer: binlog.000027:196853401; Lag: 0.02s, State: migrating; ETA: N/A
Copy: 0/6563240 0.0%; Applied: 0; Backlog: 0/1000; Time: 1s(total), 1s(copy); streamer: binlog.000027:196858195; Lag: 0.01s, State: migrating; ETA: N/A
Copy: 22000/6563240 0.3%; Applied: 0; Backlog: 0/1000; Time: 2s(total), 2s(copy); streamer: binlog.000027:201067135; Lag: 0.01s, State: migrating; ETA: 9m58s


Copy: 5682000/6563240 86.6%; Applied: 0; Backlog: 0/1000; Time: 16m10s(total), 16m10s(copy); streamer: binlog.000028:213168607; Lag: 0.01s, State: migrating; ETA: 2m30s
Copy: 6563000/6563240 100.0%; Applied: 0; Backlog: 0/1000; Time: 20m20s(total), 20m20s(copy); streamer: binlog.000028:382677405; Lag: 0.01s, State: migrating; ETA: 0s

Gh-ost also provides several options to perform effective operations.  You can control the connections, active threads, load, chunk size, etc.

But unfortunately, “–max-flow-ctl” option is not available in gh-ost. 


So, finally, I would say,

  • Always use the direct ALTER with TOI for the metadata changes and INSTANT ALTERs.
  • Use pt-online-schema-change with TOI and use the optimal flow control thresholds for InnoDB tables.
  • Schedule pt-online-schema-change operation in off-peak hours for FOREIGN KEY tables.
  • If you use RSU, never forget that you need to execute the ALTER on all nodes individually and you should block the WRITEs for that particular table. Make sure, your Gcache size is good enough to hold the writesets.
  • If you are concerned with triggers, you can use the gh-ost to make the ALTERs.

ProxySQL 2.0.7 and proxysql-admin Tool Now Available


ProxySQLProxySQL 2.0.7, released by ProxySQL, is now available for download in the Percona Repository along with Percona’s proxysql-admin tool.

ProxySQL is a high-performance proxy, currently for MySQL and database servers in the MySQL ecosystem (like Percona Server for MySQL and MariaDB). It acts as an intermediary for client requests seeking resources from the database. René Cannaò created ProxySQL for DBAs as a means of solving complex replication topology issues.

This release includes ProxySQL 2.0.7 which introduces many new features and enhancements and also fixes a number of bugs.

The proxysql-admin tool now supports MariaDB 10.4.

New Features


  • PSQLADM-195: A new option --with-stats-reset has been added to the proxysql-status script to display the  *_reset tables from the stats database. If this option is not specified, these tables are not displayed by default.

Bugs Fixed

  • PSQLADM-157: In some cases, the proxysql-status script used the cat command to display a file without checking if the file existed and was readable.
  • PSQLADM-181: When run with --update-cluster --write-node=<node_name>, the proxysql-admin tool now verifies that the writer nodes are not read-only.

The ProxySQL 2.0.7 source and binary packages available from the Percona download page for ProxySQL include ProxySQL Admin – a tool developed by Percona to configure Percona XtraDB Cluster nodes into ProxySQL. Docker images for release 2.0.7 are available as well. You can download the original ProxySQL from GitHub. GitHub hosts the documentation in the wiki format.

ProxySQL is available under Open Source license GPLv3.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com