This blog post will focus on failover and recovery scenarios inside the InnoDB Cluster and ClusterSet environment. To know more about the deployments of these topologies, you can refer to the manuals – InnoDB Cluster and Innodb ClusterSet setup.
In the below snippet, we have two clusters (cluster1 and cluster2), which are connected via an async channel and combined, known as a ClusterSet topology. We are going to use the below topology in all of our cases.
MySQL 127.0.0.1:3308 ssl JS > myclusterset.status({extended: 1})
{
"clusters": {
"Cluster1": {
"clusterRole": "PRIMARY",
"globalStatus": "OK",
"primary": "127.0.0.1:3308",
"status": "OK",
"statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
"topology": {
"127.0.0.1:3308": {
"address": "127.0.0.1:3308",
"memberRole": "PRIMARY",
"mode": "R/W",
"status": "ONLINE",
"version": "8.0.31"
},
"127.0.0.1:3309": {
"address": "127.0.0.1:3309",
"memberRole": "SECONDARY",
"mode": "R/O",
"replicationLagFromImmediateSource": "",
"replicationLagFromOriginalSource": "",
"status": "ONLINE",
"version": "8.0.31"
},
"127.0.0.1:3310": {
"address": "127.0.0.1:3310",
"memberRole": "SECONDARY",
"mode": "R/O",
"replicationLagFromImmediateSource": "",
"replicationLagFromOriginalSource": "",
"status": "ONLINE",
"version": "8.0.31"
}
},
"transactionSet": "39ec1328-3909-11ee-a0f8-5254004d77d3:1-4,b3a79c40-3909-11ee-805d-5254004d77d3:1-124,b3a7a659-3909-11ee-805d-5254004d77d3:1-5"
},
"cluster2": {
"clusterRole": "REPLICA",
"clusterSetReplication": {
"applierStatus": "APPLIED_ALL",
"applierThreadState": "Waiting for an event from Coordinator",
"applierWorkerThreads": 4,
"receiver": "127.0.0.1:3311",
"receiverStatus": "ON",
"receiverThreadState": "Waiting for source to send event",
"source": "127.0.0.1:3308"
},
"clusterSetReplicationStatus": "OK",
"globalStatus": "OK",
"status": "OK",
"statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
"topology": {
"127.0.0.1:3311": {
"address": "127.0.0.1:3311",
"memberRole": "PRIMARY",
"mode": "R/O",
"replicationLagFromImmediateSource": "",
"replicationLagFromOriginalSource": "",
"status": "ONLINE",
"version": "8.0.31"
},
"127.0.0.1:3312": {
"address": "127.0.0.1:3312",
"memberRole": "SECONDARY",
"mode": "R/O",
"replicationLagFromImmediateSource": "",
"replicationLagFromOriginalSource": "",
"status": "ONLINE",
"version": "8.0.31"
},
"127.0.0.1:3313": {
"address": "127.0.0.1:3313",
"memberRole": "SECONDARY",
"mode": "R/O",
"replicationLagFromImmediateSource": "",
"replicationLagFromOriginalSource": "",
"status": "ONLINE",
"version": "8.0.31"
}
},
"transactionSet": "39ec1328-3909-11ee-a0f8-5254004d77d3:1-4,3db995e4-390a-11ee-b678-5254004d77d3:1-5,b3a79c40-3909-11ee-805d-5254004d77d3:1-124,b3a7a659-3909-11ee-805d-5254004d77d3:1-5",
"transactionSetConsistencyStatus": "OK",
"transactionSetErrantGtidSet": "",
"transactionSetMissingGtidSet": ""
}
},
"domainName": "firstclusterset",
"globalPrimaryInstance": "127.0.0.1:3308",
"metadataServer": "127.0.0.1:3308",
"primaryCluster": "Cluster1",
"status": "HEALTHY",
"statusText": "All Clusters available."
}
How failover happens inside a single InnoDB Cluster
- Connect to any node of the first cluster (“cluster1”) via MySQLShell and fetch the details.
MySQL 127.0.0.1:3308 ssl JS > c root@localhost:3308
MySQL localhost:3308 ssl JS > cluster1=dba.getCluster()
MySQL localhost:3308 ssl JS > cluster1.status()
{
"clusterName": "Cluster1",
"clusterRole": "PRIMARY",
"defaultReplicaSet": {
"name": "default",
"primary": "127.0.0.1:3308",
"ssl": "REQUIRED",
"status": "OK",
"statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
"topology": {
"127.0.0.1:3308": {
"address": "127.0.0.1:3308",
"memberRole": "PRIMARY",
"mode": "R/W",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.31"
},
"127.0.0.1:3309": {
"address": "127.0.0.1:3309",
"memberRole": "SECONDARY",
"mode": "R/O",
"readReplicas": {},
"replicationLagFromImmediateSource": "",
"replicationLagFromOriginalSource": "",
"role": "HA",
"status": "ONLINE",
"version": "8.0.31"
},
"127.0.0.1:3310": {
"address": "127.0.0.1:3310",
"memberRole": "SECONDARY",
"mode": "R/O",
"readReplicas": {},
"replicationLagFromImmediateSource": "",
"replicationLagFromOriginalSource": "",
"role": "HA",
"status": "ONLINE",
"version": "8.0.31"
}
},
"topologyMode": "Single-Primary"
},
"domainName": "firstclusterset",
"groupInformationSourceMember": "127.0.0.1:3308"
}
- Now perform a primary switchover from the instance (“127.0.0.1:3308”) to (“127.0.0.1:3309”).
MySQL localhost:3308 ssl JS > cluster1.setPrimaryInstance("root@127.0.0.1:3309")
Output:
Setting instance '127.0.0.1:3309' as the primary instance of cluster 'Cluster1'...
Instance '127.0.0.1:3308' was switched from PRIMARY to SECONDARY.
Instance '127.0.0.1:3309' was switched from SECONDARY to PRIMARY.
Instance '127.0.0.1:3310' remains SECONDARY.
The instance '127.0.0.1:3309' was successfully elected as primary.
- Finally, the instance (“127.0.0.1:3309”) will show the status as primary.
MySQL localhost:3308 ssl JS > cluster1.status()
...
"127.0.0.1:3309": {
"address": "127.0.0.1:3309",
"memberRole": "PRIMARY",
"mode": "R/W",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.31"
...
How to rejoin the lost instance again
If, for some reason, an instance leaves the cluster or loses connection and can’t automatically rejoin the cluster, we might need to rejoin an instance to a cluster by issuing the “Cluster.rejoinInstance(instance)” command. Here, we will try to create a small example that can demonstrate the usage of this command.
- Create some blocker by stopping group replication on the instance (“127.0.0.1:3310”).
MySQL localhost:3310 ssl SQL > stop group_replication;
- Looking over the information below, we can see the instance (“127.0.0.1:3310”) is showing “MISSING” status.
MySQL localhost:3310 ssl JS > cluster1.status()
...
"127.0.0.1:3310": {
"address": "127.0.0.1:3310",
"instanceErrors": [
"NOTE: group_replication is stopped."
],
"memberRole": "SECONDARY",
"memberState": "OFFLINE",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "(MISSING)",
"version": "8.0.31"
}
...
- Now, add the instance (“127.0.0.1:3310”) again with the rejoinInstance() command.
MySQL localhost:3310 ssl JS > cluster1.rejoinInstance('127.0.0.1:3310')
...
Validating instance configuration at 127.0.0.1:3310...
NOTE: Instance detected as a sandbox.
Please note that sandbox instances are only suitable for deploying test clusters for use within the same host.
This instance reports its own address as 127.0.0.1:3310
Instance configuration is suitable.
Rejoining instance '127.0.0.1:3310' to cluster 'Cluster1'...
Re-creating recovery account...
NOTE: User 'mysql_innodb_cluster_2558498413'@'%' already existed at instance '127.0.0.1:3309'. It will be deleted and created again with a new password.
* Waiting for the Cluster to synchronize with the PRIMARY Cluster...
** Transactions replicated ############################################################ 100%
The instance '127.0.0.1:3310' was successfully rejoined to the cluster.
...
And after the above operation, the instance seems to be “ONLINE” now.
"127.0.0.1:3310": {
"address": "127.0.0.1:3310",
"memberRole": "SECONDARY",
"mode": "R/O",
"readReplicas": {},
"replicationLagFromImmediateSource": "",
"replicationLagFromOriginalSource": "",
"role": "HA",
"status": "ONLINE",
"version": "8.0.31"
}
How to recover a cluster from a quorum or vote loss
There are situations when the running instance can fail, and a cluster can lose its quorum (ability to vote) due to insufficient members. This could trigger when there is a failure of enough/majority of instances that make up the cluster to vote on Group Replication operations. In case a cluster loses the quorum, it can no longer process any write transactions within the cluster or change the cluster’s topology.
However, if any instance is online that contains the latest InnoDB Cluster metadata, it is possible to restore a cluster with the quorum.
Let’s see how we can use the feature “forceQuorumUsingPartitionOf” to recover the cluster again with minimal member votes.
- First, we will try to fail the majority of nodes with a simple “KILL” operation.
root 30281 1 1 07:10 ? 00:02:03 /root/mysql-sandboxes/3308/bin/mysqld --defaults-file=/root/mysql-sandboxes/3308/my.cnf --user=root
root 30788 1 1 07:14 ? 00:01:53 /root/mysql-sandboxes/3309/bin/mysqld --defaults-file=/root/mysql-sandboxes/3309/my.cnf --user=root
root 30912 1 2 07:14 ? 00:02:48 /root/mysql-sandboxes/3310/bin/mysqld --defaults-file=/root/mysql-sandboxes/3310/my.cnf --user=root
[root@localhost ~]# kill -9 30281 30912
- Now check the cluster1 status again. The cluster lost the quorum, and no write activity was allowed.
MySQL localhost:3309 ssl JS > cluster1.status()
{
"clusterName": "Cluster1",
"clusterRole": "PRIMARY",
"defaultReplicaSet": {
"name": "default",
"primary": "127.0.0.1:3309",
"ssl": "REQUIRED",
"status": "NO_QUORUM",
"statusText": "Cluster has no quorum as visible from '127.0.0.1:3309' and cannot process write transactions. 2 members are not active.",
"topology": {
"127.0.0.1:3308": {
"address": "127.0.0.1:3308",
"memberRole": "SECONDARY",
"memberState": "(MISSING)",
"mode": "n/a",
"readReplicas": {},
"role": "HA",
"shellConnectError": "MySQL Error 2003: Could not open connection to '127.0.0.1:3308': Can't connect to MySQL server on '127.0.0.1:3308' (111)",
"status": "UNREACHABLE",
"version": "8.0.31"
},
"127.0.0.1:3309": {
"address": "127.0.0.1:3309",
"memberRole": "PRIMARY",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.31"
},
"127.0.0.1:3310": {
"address": "127.0.0.1:3310",
"memberRole": "SECONDARY",
"memberState": "(MISSING)",
"mode": "n/a",
"readReplicas": {},
"role": "HA",
"shellConnectError": "MySQL Error 2003: Could not open connection to '127.0.0.1:3310': Can't connect to MySQL server on '127.0.0.1:3310' (111)",
"status": "UNREACHABLE",
"version": "8.0.31"
}
},
"topologyMode": "Single-Primary"
},
"domainName": "firstclusterset",
"groupInformationSourceMember": "127.0.0.1:3309"
}
- To fix this situation, we can connect to the available instance (“127.0.0.1:3309”) and reset the quorum again with the below command.
MySQL localhost:3309 ssl JS > mycluster1.forceQuorumUsingPartitionOf("root@localhost:3309")
...
Restoring cluster 'Cluster1' from loss of quorum, by using the partition composed of [127.0.0.1:3309]
Restoring the InnoDB cluster ...
The InnoDB cluster was successfully restored using the partition from the instance 'root@localhost:3309'.
WARNING: To avoid a split-brain scenario, ensure that all other members of the cluster are removed or joined back to the group that was restored.
...
- The Instances with port (3308,3310) are still down/failed. However, we recovered the cluster quorum with a single primary member (“127.0.0.1:3309”). Now, we can perform the write activity on the cluster without any issues.
MySQL localhost:3309 ssl JS > cluster1.status()
{
"clusterName": "Cluster1",
"clusterRole": "PRIMARY",
"defaultReplicaSet": {
"name": "default",
"primary": "127.0.0.1:3309",
"ssl": "REQUIRED",
"status": "OK_NO_TOLERANCE_PARTIAL",
"statusText": "Cluster is NOT tolerant to any failures. 2 members are not active.",
"topology": {
"127.0.0.1:3308": {
"address": "127.0.0.1:3308",
"memberRole": "SECONDARY",
"mode": "n/a",
"readReplicas": {},
"role": "HA",
"shellConnectError": "MySQL Error 2003: Could not open connection to '127.0.0.1:3308': Can't connect to MySQL server on '127.0.0.1:3308' (111)",
"status": "(MISSING)"
},
"127.0.0.1:3309": {
"address": "127.0.0.1:3309",
"memberRole": "PRIMARY",
"mode": "R/W",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.31"
},
"127.0.0.1:3310": {
"address": "127.0.0.1:3310",
"memberRole": "SECONDARY",
"mode": "n/a",
"readReplicas": {},
"role": "HA",
"shellConnectError": "MySQL Error 2003: Could not open connection to '127.0.0.1:3310': Can't connect to MySQL server on '127.0.0.1:3310' (111)",
"status": "(MISSING)"
}
},
"topologyMode": "Single-Primary"
},
"domainName": "firstclusterset",
"groupInformationSourceMember": "127.0.0.1:3309"
}
MySQL localhost:3309 ssl SQL > create database sbtest2;
Query OK, 1 row affected (0.0055 sec)
Later on, we can fix the failed instances, and they will join cluster1 again. Sometimes, we might need to perform the “rejoinInstance()” operation in order to add the nodes again.
How to recover a complete cluster from a major outage
Sometimes, even if all the nodes are up and some internal issues happen, like group replication is stuck or some networking problem, you might experience a complete outage and be unable to perform any writes/activity on the cluster.
In such circumstances, you can use any one node and use its metadata to recover the cluster. You need to connect to the most up-to-date instance, as otherwise, you may lose data or have inconsistency in the cluster nodes.
Let’s see how we can introduce such a situation manually and then try to fix that.
- First, stop the group replication on all three instances.
MySQL localhost:3308 ssl SQL > stop group_replication;
MySQL localhost:3310 ssl SQL > stop group_replication;
MySQL localhost:3309 ssl SQL > stop group_replication;
127.0.0.1:3308": {
"address": "127.0.0.1:3308",
"instanceErrors": [
"NOTE: group_replication is stopped."
],
"memberRole": "SECONDARY",
"memberState": "OFFLINE",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "(MISSING)",
"version": "8.0.31"
},
"127.0.0.1:3309": {
"address": "127.0.0.1:3309",
"instanceErrors": [
"NOTE: group_replication is stopped."
],
"memberRole": "SECONDARY",
"memberState": "OFFLINE",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "(MISSING)",
"version": "8.0.31"
},
"127.0.0.1:3310": {
"address": "127.0.0.1:3310",
"instanceErrors": [
"NOTE: group_replication is stopped."
],
"memberRole": "SECONDARY",
"memberState": "OFFLINE",
"mode": "R/O",
"readReplicas": {},
"role": "HA",
"status": "(MISSING)",
"version": "8.0.31"
}
Now we are completely stuck, and even if we try to perform any writes on the last primary member, we see the below error since the “–super-read-only” is enabled in order to protect the trxs on the database.
MySQL localhost:3309 ssl SQL > create database sbtest3;
ERROR: 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
- Now, we will fix the issue with the below set of steps.
1. Connect to the instance with most executions/Gtid’s. We can confirm the same by connecting to each instance and comparing the GTIDs with the below command.
mysql> SELECT @@GLOBAL.GTID_EXECUTED;
mysql> SELECT received_transaction_set FROM performance_schema.replication_connection_status WHERE channel_name="group_replication_applier";
2. In our case, it is the same in all three instances, so we can choose any of the nodes. Let’s choose the last primary (“127.0.0.1:3309”).
MySQL localhost:3309 ssl SQL > SELECT @@GLOBAL.GTID_EXECUTED;
+-------------------------------------------------------------------------------------------------------------------------------------------------+
| @@GLOBAL.GTID_EXECUTED |
+-------------------------------------------------------------------------------------------------------------------------------------------------+
| 39ec1328-3909-11ee-a0f8-5254004d77d3:1-4,
b3a79c40-3909-11ee-805d-5254004d77d3:1-152:1000071-1000079,
b3a7a659-3909-11ee-805d-5254004d77d3:1-12 |
+-------------------------------------------------------------------------------------------------------------------------------------------------+
MySQL localhost:3309 ssl SQL > SELECT received_transaction_set FROM performance_schema.replication_connection_status WHERE channel_name="group_replication_applier";
+-------------------------------------------------------------------------------------------------------------------------------------------------+
| received_transaction_set |
+-------------------------------------------------------------------------------------------------------------------------------------------------+
| 39ec1328-3909-11ee-a0f8-5254004d77d3:1-4,
b3a79c40-3909-11ee-805d-5254004d77d3:1-152:1000071-1000079,
b3a7a659-3909-11ee-805d-5254004d77d3:1-12 |
+-------------------------------------------------------------------------------------------------------------------------------------------------+
3. Finally, connect to the instance (“127.0.0.1:3309”) and execute the below command to recover from the outage/failure.
MySQL localhost:3309 ssl JS > c root@localhost:3309
MySQL localhost:3309 ssl JS > mycluster1 = dba.rebootClusterFromCompleteOutage("cluster1",{force: true})
...
Restoring the Cluster 'Cluster1' from complete outage...
Cluster instances: '127.0.0.1:3308' (OFFLINE), '127.0.0.1:3309' (OFFLINE), '127.0.0.1:3310' (OFFLINE)
Validating instance configuration at localhost:3309...
This instance reports its own address as 127.0.0.1:3309
Instance configuration is suitable.
* Waiting for seed instance to become ONLINE...
127.0.0.1:3309 was restored.
Validating instance configuration at 127.0.0.1:3308...
This instance reports its own address as 127.0.0.1:3308
Instance configuration is suitable.
Rejoining instance '127.0.0.1:3308' to cluster 'Cluster1'...
Re-creating recovery account...
NOTE: User 'mysql_innodb_cluster_3676500949'@'%' already existed at instance '127.0.0.1:3309'. It will be deleted and created again with a new password.
* Waiting for the Cluster to synchronize with the PRIMARY Cluster...
** Transactions replicated ############################################################ 100%
The instance '127.0.0.1:3308' was successfully rejoined to the cluster.
Validating instance configuration at 127.0.0.1:3310...
This instance reports its own address as 127.0.0.1:3310
Instance configuration is suitable.
Rejoining instance '127.0.0.1:3310' to cluster 'Cluster1'...
Re-creating recovery account...
NOTE: User 'mysql_innodb_cluster_2558498413'@'%' already existed at instance '127.0.0.1:3309'. It will be deleted and created again with a new password.
* Waiting for the Cluster to synchronize with the PRIMARY Cluster...
** Transactions replicated ############################################################ 100%
The instance '127.0.0.1:3310' was successfully rejoined to the cluster.
The Cluster was successfully rebooted.
<Cluster:Cluster1>
...
Note: Be careful with the force option as it can bypass other important checks like GTID_SET or instance reachability.
Now, if we check the status again, we see all three nodes are up now. The recovery command fixes all problems (group replication started) and rejoins the instances again.
MySQL localhost:3309 ssl JS > cluster1=dba.getCluster()
MySQL localhost:3309 ssl JS > cluster1.status()
{
"clusterName": "Cluster1",
"clusterRole": "PRIMARY",
"defaultReplicaSet": {
"name": "default",
"primary": "127.0.0.1:3309",
"ssl": "REQUIRED",
"status": "OK",
"statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
"topology": {
"127.0.0.1:3308": {
"address": "127.0.0.1:3308",
"memberRole": "SECONDARY",
"mode": "R/O",
"readReplicas": {},
"replicationLagFromImmediateSource": "",
"replicationLagFromOriginalSource": "",
"role": "HA",
"status": "ONLINE",
"version": "8.0.31"
},
"127.0.0.1:3309": {
"address": "127.0.0.1:3309",
"memberRole": "PRIMARY",
"mode": "R/W",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.31"
},
"127.0.0.1:3310": {
"address": "127.0.0.1:3310",
"memberRole": "SECONDARY",
"mode": "R/O",
"readReplicas": {},
"replicationLagFromImmediateSource": "",
"replicationLagFromOriginalSource": "",
"role": "HA",
"status": "ONLINE",
"version": "8.0.31"
}
},
"topologyMode": "Single-Primary"
},
"domainName": "firstclusterset",
"groupInformationSourceMember": "127.0.0.1:3309"
}
How to perform switchover/failover from one cluster to another in a ClusterSet
Sometimes, we need to perform maintenance and update activities on the database instances. To avoid a long downtime or major impact on production, we do some control switchover so the concerned node can be offline without impacting any running workload.
Inside ClusterSet, we can perform such activity by changing the clusterRole from “REPLICA” to “PRIMARY” among the running clusters.
Let’s see the exact scenario below.
- Fetch the ClusterSet information.
MySQL 127.0.0.1:3308 ssl JS > myclusterset=dba.getClusterSet()
<ClusterSet:firstclusterset>
MySQL 127.0.0.1:3308 ssl JS > myclusterset.status({extended: 1})
...
"clusters": {
"Cluster1": {
"clusterRole": "PRIMARY",
"globalStatus": "OK",
"primary": "127.0.0.1:3309",
"status": "OK",
"statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
"topology": {
"cluster2": {
"clusterRole": "REPLICA",
"clusterSetReplication": {
"applierStatus": "APPLIED_ALL",
"applierThreadState": "Waiting for an event from Coordinator",
"applierWorkerThreads": 4,
"receiver": "127.0.0.1:3311",
"receiverStatus": "ON",
"receiverThreadState": "Waiting for source to send event",
"source": "127.0.0.1:3309"
...
- Performing a switchover from “cluster1” to “cluster2”.
MySQL 127.0.0.1:3308 ssl JS > myclusterset.setPrimaryCluster('cluster2')
...
Switching the primary cluster of the clusterset to 'cluster2'
* Verifying clusterset status
** Checking cluster Cluster1
Cluster 'Cluster1' is available
** Checking cluster cluster2
Cluster 'cluster2' is available
* Reconciling 5 internally generated GTIDs
* Refreshing replication account of demoted cluster
* Synchronizing transaction backlog at 127.0.0.1:3311
** Transactions replicated ############################################################ 100%
* Updating metadata
* Updating topology
** Changing replication source of 127.0.0.1:3308 to 127.0.0.1:3311
** Changing replication source of 127.0.0.1:3310 to 127.0.0.1:3311
** Changing replication source of 127.0.0.1:3309 to 127.0.0.1:3311
* Acquiring locks in replicaset instances
** Pre-synchronizing SECONDARIES
** Acquiring global lock at PRIMARY
** Acquiring global lock at SECONDARIES
* Synchronizing remaining transactions at promoted primary
** Transactions replicated ############################################################ 100%
* Updating replica clusters
Cluster 'cluster2' was promoted to PRIMARY of the clusterset. The PRIMARY instance is '127.0.0.1:3311'
...
- Now, if we see the status again, we can observe the change in the ClusterRole.
"clusters": {
"Cluster1": {
"clusterRole": "REPLICA",
"clusterSetReplication": {
"applierStatus": "APPLIED_ALL",
"applierThreadState": "Waiting for an event from Coordinator",
"applierWorkerThreads": 4,
"receiver": "127.0.0.1:3309",
"receiverStatus": "ON",
"receiverThreadState": "Waiting for source to send event",
"source": "127.0.0.1:3311"
},
"cluster2": {
"clusterRole": "PRIMARY",
"globalStatus": "OK",
"primary": "127.0.0.1:3311",
"status": "OK",
"statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
"topology": {
"127.0.0.1:3311": {
"address": "127.0.0.1:3311",
"memberRole": "PRIMARY",
"mode": "R/W",
"status": "ONLINE",
"version": "8.0.31"
},
In some unlucky situations, when one of the clusters (“cluster1”) is completely down, and we don’t have any choice, we might need to do an emergency failover. The below command could be handy in those cases.
mysql> myclusterset.forcePrimaryCluster("cluster2")
In case you are using the MySQLRouter interface for routing traffic in the ClusterSet then you also need to change the Router option with (“setRoutingOption”) so the traffic can be diverted to the new Primary Cluster (“cluster2”) otherwise the application will fail and not able to communicate.
In the next scenario, we will see the usage of “setRoutingOption” in more detail.
How to change traffic routes with MySQLRouter
In the case of Cluster:
Within a single cluster or “cluster1,” the writes will fall to the Primary and reads will fall to the secondary in a balanced order. The router will automatically detect the failovers and choose the Primary.
In the case of ClusterSet:
We can switch the Primary stack for routing traffic from “cluster1” to “cluster2” with the help of the below steps.
- Verifying the current configurations.
MySQL localhost:3309 ssl JS > myclusterset=dba.getClusterSet()
MySQL localhost:3309 ssl JS > myclusterset.listRouters()
{
"domainName": "firstclusterset",
"routers": {
"localhost.localdomain::Router1": {
"hostname": "localhost.localdomain",
"lastCheckIn": "2023-08-12 10:13:22",
"roPort": "6447",
"roXPort": "6449",
"rwPort": "6446",
"rwXPort": "6448",
"targetCluster": null,
"version": "8.0.31"
}
}
}
MySQL localhost:3309 ssl JS > myclusterset.routingOptions()
Output:
{
"domainName": "firstclusterset",
"global": {
"invalidated_cluster_policy": "drop_all",
"stats_updates_frequency": 0,
"target_cluster": "primary"
},
"routers": {
"localhost.localdomain::Router1": {}
}
}
- Now, switching the target cluster from “cluster1” to “cluster2”.
MySQL localhost:3309 ssl JS > myclusterset.setRoutingOption('localhost.localdomain::Router1', 'target_cluster', 'cluster2')
Routing option 'target_cluster' successfully updated in router 'localhost.localdomain::Router1'.
- Checking the status again, we can see the target cluster is now changed to “cluster2”.
MySQL localhost:3309 ssl JS > myclusterset.routingOptions()
{
"domainName": "firstclusterset",
"global": {
"invalidated_cluster_policy": "drop_all",
"stats_updates_frequency": 0,
"target_cluster": "primary"
},
"routers": {
"localhost.localdomain::Router1": {
"target_cluster": "cluster2"
}
}
}
How to rejoin the lost cluster again in the ClusterSet
There are situations when the cluster has been marked as invalidated, or there might be some issue with the replication channel. To fix that, we might need to perform the “rejoinCluster()” process in order to add the same to the ClusterSet.
We can simulate the same by using the below steps.
- Stopping Async replication channel on the Primary instance (“127.0.0.1:3311”) of cluster2, which is syncing from the instance (“127.0.0.1:3309”) of cluster1.
MySQL localhost:3311 ssl SQL > stop slave;
...
Slave_IO_State:
Master_Host: 127.0.0.1
Master_User: mysql_innodb_cs_63c324c0
Master_Port: 3309
Connect_Retry: 3
Master_Log_File: localhost-bin.000005
Read_Master_Log_Pos: 2248
Relay_Log_File: localhost-relay-bin-clusterset_replication.000002
Relay_Log_Pos: 432
Relay_Master_Log_File: localhost-bin.000005
Slave_IO_Running: No
Slave_SQL_Running: No
...
- If we check the status again, we see the replication is stopped now on the cluster2 instance.
MySQL localhost:3311 ssl JS > myclusterset.status({extended:1})
"clusterRole": "REPLICA",
"clusterSetReplication": {
"applierStatus": "OFF",
"applierThreadState": "",
"applierWorkerThreads": 4,
"receiver": "127.0.0.1:3311",
"receiverStatus": "OFF",
"receiverThreadState": "",
"source": "127.0.0.1:3309"
},
"clusterSetReplicationStatus": "STOPPED",
"globalStatus": "OK_NOT_REPLICATING",
"status": "OK",
"statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
- Now, we will fix the same by running the below rejoin command.
MySQL localhost:3311 ssl JS > myclusterset.rejoinCluster('cluster2')
...
Rejoining cluster 'cluster2' to the clusterset
NOTE: Cluster 'cluster2' is not invalidated
* Refreshing replication settings
** Changing replication source of 127.0.0.1:3312 to 127.0.0.1:3309
** Changing replication source of 127.0.0.1:3313 to 127.0.0.1:3309
** Changing replication source of 127.0.0.1:3311 to 127.0.0.1:3309
Cluster 'cluster2' was rejoined to the clusterset
...
So, the stopped replication started automatically and synced with the main cluster.
cluster2":
"clusterRole": "REPLICA",
"clusterSetReplication": {
"applierStatus": "APPLIED_ALL",
"applierThreadState": "Waiting for an event from Coordinator",
"applierWorkerThreads": 4,
"receiver": "127.0.0.1:3311",
"receiverStatus": "ON",
"receiverThreadState": "Waiting for source to send event",
"source": "127.0.0.1:3309"
},
"clusterSetReplicationStatus": "OK",
"globalStatus": "OK",
"status": "OK",
"statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",
Summary
Here, we have discussed a few scenarios and methodologies that could be very useful in order to recover the cluster nodes and perform some manual failovers in the topology. The above-discussed options would be used in both InnoDB Cluster and Innodb ClusterSet-related environments.
Percona Distribution for MySQL is the most complete, stable, scalable, and secure open source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!
Try Percona Distribution for MySQL today!