Recently, I was working on a very unfortunate case that revolved around diverging clusters, data loss, missing important log errors, and forcing commands on Percona XtraDB Cluster (PXC). Even though PXC tries its best to explain what happens in the error log, I can vouch that it can be missed or overlooked when you do not know what to expect.
This blog post is a warning tale, an invitation to try yourself and break stuff (not in production, right?).
TLDR:
Do you know right away what happened when seeing this log?
2023-06-22T08:23:29.003334Z 0 [ERROR] [MY-000000] [Galera] gcs/src/gcs_group.cpp:group_post_state_exchange():433: Reversing history: 171 -> 44, this member has applied 127 more events than the primary component.Data loss is possible. Must abort.
Demonstration
Using the great https://github.com/datacharmer/dbdeployer:
$ dbdeployer deploy replication --topology=pxc --sandbox-binary=~/opt/pxc 8.0.31
Let’s write some data
$ ./sandboxes/pxc_msb_8_0_31/sysbench oltp_read_write --tables=2 --table-size=1000 prepare
Then let’s suppose someone wants to restart node 1. For some reason, they read somewhere in your internal documentation that they should bootstrap in that situation. With dbdeployer, this will translate to:
$ ./sandboxes/pxc_msb_8_0_31/node1/stop stop /home/yoann-lc/sandboxes/pxc_msb_8_0_31/node1 $ ./sandboxes/pxc_msb_8_0_31/node1/start --wsrep-new-cluster ......................................................................................................^C
It fails, as it should.
In reality, those bootstrap mistakes happen in homemade start scripts, puppet or ansible modules, or even internal procedures applied in the wrong situation.
Why did it fail? First error to notice:
2023-06-22T08:00:48.322148Z 0 [ERROR] [MY-000000] [Galera] It may not be safe to bootstrap the cluster from this node. It was not the last one to leave the cluster and may not contain all the updates. To force cluster bootstrap with this node, edit the grastate.dat file manually and set safe_to_bootstrap to 1 .
Reminder: Bootstrap should only be used when every node has been double-checked to be down; it’s a manual operation. It fails here because it was not forced and because this node was not the last to be stopped in the cluster.
Good reflex: Connecting to other mysql and check for ‘wsrep_cluster_size’ and ‘wsrep_cluster_status’ statuses before anything.
mysql> show global status where variable_name IN ('wsrep_local_state','wsrep_local_state_comment','wsrep_local_commits','wsrep_received','wsrep_cluster_size','wsrep_cluster_status','wsrep_connected');
Do not: Apply blindly what this log is telling you to do.
But we are here to “fix” around and find out, so let’s bootstrap.
$ sed -i 's/safe_to_bootstrap: 0/safe_to_bootstrap: 1/' ./sandboxes/pxc_msb_8_0_31/node1/data/grastate.dat $ ./sandboxes/pxc_msb_8_0_31/node1/start --wsrep-new-cluster .. sandbox server started
At this point, notice that from node1, you have:
$ ./sandboxes/pxc_msb_8_0_31/node1/use -e "show global status where variable_name in ('wsrep_cluster_status', 'wsrep_cluster_size')" +----------------------+---------+ | Variable_name | Value | +----------------------+---------+ | wsrep_cluster_size | 1 | | wsrep_cluster_status | Primary | +----------------------+---------+
But from node2 and node3 you will have:
$ ./sandboxes/pxc_msb_8_0_31/node2/use -e "show global status where variable_name in ('wsrep_cluster_status', 'wsrep_cluster_size')" +----------------------+---------+ | Variable_name | Value | +----------------------+---------+ | wsrep_cluster_size | 2 | | wsrep_cluster_status | Primary | +----------------------+---------+
Looks fishy. But does your monitoring really alert you to this?
Let’s write some more data, obviously on node1, because why not? It looks healthy.
$ ./sandboxes/pxc_msb_8_0_31/node1/sysbench oltp_delete --tables=2 --table-size=1000 --events=127 run
127 will be useful later on.
Nightmare ensues
We are a few days later. You are still writing to your node. Some new reason to restart node1 comes. Maybe you want to apply a parameter.
$ ./sandboxes/pxc_msb_8_0_31/node1/restart .............................................................................................................................................................^C
It fails?
Reviewing logs, you would find:
$ less sandboxes/pxc_msb_8_0_31/node1/data/msandbox.err ... 2023-06-22T08:23:29.003334Z 0 [ERROR] [MY-000000] [Galera] gcs/src/gcs_group.cpp:group_post_state_exchange():433: Reversing history: 171 -> 44, this member has applied 127 more events than the primary component.Data loss is possible. Must abort. ...
Voila, We find our “127” again.
Good reflex: Depends. It would need a post of its own, but that’s a serious problem.
Do not: Force SST on this node. Because it will work, and every data inserted on node1 will be lost.
What does it mean?
When forcing bootstrap, a node will always start. It won’t ever try to connect to other nodes if they are healthy. The other nodes won’t try to connect to the third one either; from their point of view, it just never joined, so it’s not part of the cluster.
When restarting the previously bootstrapped node1 in non-bootstrapped mode, that’s the first time they all see each other in a while.
Each time a transaction is committed, it is replicated along with a sequence number (seqno). The seqno is an ever-growing number. It is used by nodes to determine if incremental state transfer is possible, or if a node state is coherent with others.
Now that node1 is no longer in bootstrap mode, node1 connects to the other members. node1 shares its state (last primary members, seqno). The other nodes correctly picked up that this seqno looks suspicious because it’s higher than their own, meaning the node joining could have applied more transactions. It could also mean it was from some other cluster.
Because nodes are in doubt, nothing will happen. Node1 is denied joining and will not do anything. It won’t try to resynchronize automatically, and it won’t touch its data. Node2 and node3 are not impacted; they will be kept as is too.
How to proceed from there will depend as there are no general guidelines. Ideally, a source of truth should be found. If both clusters applied writes, that’s the toughest situation to be in, and it’s a split brain.
Note: seqno are just numbers. Having equal seqno does not actually guarantee that the underlying transactions applied are identical, but it’s still useful as a simple sanity check. If we were to mess around even more and apply 127 transactions on node2, or even modify seqno manually in grastate.dat, we could have “interesting” results. Try it out (not in production, mind you)!
Note: If you are unaware of bootstrapping and how to properly recover, check out the documentation.
Conclusion
Bootstrap is a last resort procedure, don’t force it lightly. Do not force SST right away if a node does not want to join either. You should always check the error log first.
Fortunately, PXC does not blindly let any node join without some sanity checks.
Minimize unexpected downtime and data loss with a highly available, open source MySQL clustering solution.