Apr
21
2023
--

Fixing Errant GTID With Orchestrator: The Easy Way Out

Fixing Errant GTI With Orchestrator

In this article, we will discuss errant Transaction /GTID and how we can solve them with the Orchestrator tool.

Orchestrator is a MySQL high availability and replication management tool that runs as a service and provides command line access, HTTP API, and Web interface. I will not go into the details of the Orchestrator but will explore one of the features that can help us solve the errant GTID in a replication topology.

What are errant transactions?

Simply stated, they are transactions executed directly on a replica. Thus they only exist on a specific replica. This could result from a mistake (the application wrote to a replica instead of writing to the source) or by design (you need additional tables for reports).

What problem can errant transactions cause?

The major problem it causes during a planned change in a MySQL replication topology is that the transaction is not present in the binlog and hence cannot be sent over to the replica, which causes a replication error.

So let’s jump into generating and fixing an errant transaction. Below is my current topology:

[root@monitor ~]# orchestrator-client -c topology-tabulated -alias testcluster | tr '|' 't'
192.168.56.10:3306  0s ok 5.7.41-44-log rw ROW GTID
+ 192.168.56.20:3306 0s ok 5.7.41-44-log ro ROW GTID
+ 192.168.56.30:3306 0s ok 5.7.41-44-log ro ROW GTID

Now let’s make some changes on any of the replicas, which will generate an errant transaction. On 192.168.56.20:3306, I created a test database:

mysql> create database test;
Query OK, 1 row affected (0.00 sec)

This will result in an errant transaction, so let’s see how the Orchestrator will show the topology.

[root@monitor ~]# orchestrator-client -c topology-tabulated -alias testcluster | tr '|' 't'
192.168.56.10:3306   0s ok 5.7.41-44-log  rw ROW   GTID
+ 192.168.56.20:3306 0s ok 5.7.41-44-log  ro ROW   GTID:errant
+ 192.168.56.30:3306 0s ok 5.7.41-44-log  ro ROW   GTID

Now you can see we have an errant transaction; we can check in more detail by using the Orchestrator API as below:

[root@monitor ~]# orchestrator-client -c which-gtid-errant -i 192.168.56.20:3306
A71a855a-dcdc-11ed-99d7-080027e6334b:1

To know which binlogs have this errant transaction, you check with the below command:

[root@monitor ~]# orchestrator-client -c locate-gtid-errant -i 192.168.56.20:3306
mysqlbinlog.000001

Checking the binlogs is very important. We should know what changes were made to the replica, and you can check that binlog for specific GTIDs.

We can get the output from replication analysis, and you use this API feature in your custom code in case you want to monitor the topology for errant transactions:

[root@monitor ~]# orchestrator-client -c api -path replication-analysis | jq . | grep -A2 -B2 "StructureAnalysis"
      "Analysis": "NoProblem",
      "Description": "",
      "StructureAnalysis": [
        "ErrantGTIDStructureWarning"

There is a more detailed way to compare the ExecutedGtidSet and GtidErrant on the whole topology. So let me show you below:

[root@monitor ~]# sudo orchestrator-client -c api -path cluster/testcluster | jq -C '.[] | {Port: .Key.Port, Hostname: .Key.Hostname,ServerUUID: .ServerUUID, ExecutedGtidSet: .ExecutedGtidSet, GtidErrant:.GtidErrant}'
{
  "Port": 3306,
  "Hostname": "192.168.56.10",
  "ServerUUID": "3b678bc9-dcdc-11ed-b9fc-080027e6334b",
  "ExecutedGtidSet": "3b678bc9-dcdc-11ed-b9fc-080027e6334b:1-10",
  "GtidErrant": ""
}
{
  "Port": 3306,
  "Hostname": "192.168.56.20",
  "ServerUUID": "a71a855a-dcdc-11ed-99d7-080027e6334b",
  "ExecutedGtidSet": "3b678bc9-dcdc-11ed-b9fc-080027e6334b:1-10,na71a855a-dcdc-11ed-99d7-080027e6334b:1",
  "GtidErrant": "a71a855a-dcdc-11ed-99d7-080027e6334b:1"
}
{
  "Port": 3306,
  "Hostname": "192.168.56.30",
  "ServerUUID": "ea6c6af9-dcdc-11ed-9e09-080027e6334b",
  "ExecutedGtidSet": "3b678bc9-dcdc-11ed-b9fc-080027e6334b:1-10",
  "GtidErrant": ""
}

So now we know about the issue, let’s fix it with the Orchestrator.

The first way to fix it is to inject an empty transaction, which can be done as below:

[root@monitor ~]# orchestrator-client -c gtid-errant-inject-empty -i 192.168.56.20:3306
192.168.56.20:3306

[root@monitor ~]# orchestrator-client -c topology-tabulated -alias testcluster | tr '|' 't'

192.168.56.10:3306   0s ok 5.7.41-44-log  rw ROW   GTID

+ 192.168.56.20:3306 0s ok 5.7.41-44-log  ro ROW   GTID

+ 192.168.56.30:3306 0s ok 5.7.41-44-log  ro ROW   GTID

The gtid-errant-inject-empty configuration contains settings related to injecting empty transactions to reconcile Global Transaction Identifiers (GTIDs) in a MySQL replication topology. GTIDs are a way to uniquely identify transactions in a MySQL cluster, and ensuring their consistency is critical for maintaining data integrity.

So with injecting an empty transaction, the Orchestrator will inject the empty transaction from the top, it will replicate to the bottom, and that GTID will be ignored by the replica server, which already has it. So now you can see that the gti-executed set is changed, and it contains the GTID with UUID from the replica 192.168.56.20:3306.

[root@monitor ~]# sudo orchestrator-client -c api -path cluster/testcluster | jq -C '.[] | {Port: .Key.Port, Hostname: .Key.Hostname,ServerUUID: .ServerUUID, ExecutedGtidSet: .ExecutedGtidSet, GtidErrant: .GtidErrant}'
{
  "Port": 3306,
  "Hostname": "192.168.56.10",
  "ServerUUID": "3b678bc9-dcdc-11ed-b9fc-080027e6334b",
  "ExecutedGtidSet": "3b678bc9-dcdc-11ed-b9fc-080027e6334b:1-10,na71a855a-dcdc-11ed-99d7-080027e6334b:1",
  "GtidErrant": ""
}
{
  "Port": 3306,
  "Hostname": "192.168.56.20",
  "ServerUUID": "a71a855a-dcdc-11ed-99d7-080027e6334b",
  "ExecutedGtidSet": "3b678bc9-dcdc-11ed-b9fc-080027e6334b:1-10,na71a855a-dcdc-11ed-99d7-080027e6334b:1",
  "GtidErrant": ""
}
{
  "Port": 3306,
  "Hostname": "192.168.56.30",
  "ServerUUID": "ea6c6af9-dcdc-11ed-9e09-080027e6334b",
  "ExecutedGtidSet": "3b678bc9-dcdc-11ed-b9fc-080027e6334b:1-10,na71a855a-dcdc-11ed-99d7-080027e6334b:1",
  "GtidErrant": ""
}

Another way to fix this is a DANGEROUS way is to reset the master.

Orchestrator has a command gtid-errant-reset-master, applied on an instance:

Then this command “fixes” errant GTID transactions via RESET MASTER; SET GLOBAL gtid_purged…

This command is, of course, destructive to the server’s binary logs. If binary logs are assumed to enable incremental restore, then this command is dangerous.

So an example to fix an errant transaction is:

[root@monitor ~]# orchestrator-client -c topology-tabulated -alias testcluster | tr '|' 't'
192.168.56.10:3306   0s ok 5.7.41-44-log  rw ROW   GTID
+ 192.168.56.20:3306 0s ok 5.7.41-44-log  ro ROW   GTID:errant
+ 192.168.56.30:3306 0s ok 5.7.41-44-log  ro ROW   GTID

[root@monitor ~]# orchestrator-client -c which-gtid-errant -i 192.168.56.20:3306
A71a855a-dcdc-11ed-99d7-080027e6334b:2

This is how it looks:

{
  "Port": 3306,
  "Hostname": "192.168.56.20",
  "ServerUUID": "a71a855a-dcdc-11ed-99d7-080027e6334b",
  "ExecutedGtidSet": "3b678bc9-dcdc-11ed-b9fc-080027e6334b:1-10,na71a855a-dcdc-11ed-99d7-080027e6334b:1-2",
  "GtidErrant": "a71a855a-dcdc-11ed-99d7-080027e6334b:2"
}

Let’s reset the master.

[root@monitor ~]# orchestrator-client -c gtid-errant-reset-master -i 192.168.56.20:3306
192.168.56.20:3306

Now you can see that the ExecutedGtidSet is synced with the Source ExecutedGtidSet.

[root@monitor ~]# sudo orchestrator-client -c api -path cluster/testcluster | jq -C '.[] | {Port: .Key.Port, Hostname: .Key.Hostname,ServerUUID: .ServerUUID, ExecutedGtidSet: .ExecutedGtidSet, GtidErrant: .GtidErrant}'
{
  "Port": 3306,
  "Hostname": "192.168.56.10",
  "ServerUUID": "3b678bc9-dcdc-11ed-b9fc-080027e6334b",
  "ExecutedGtidSet": "3b678bc9-dcdc-11ed-b9fc-080027e6334b:1-10,na71a855a-dcdc-11ed-99d7-080027e6334b:1",
  "GtidErrant": ""
}
{
  "Port": 3306,
  "Hostname": "192.168.56.20",
  "ServerUUID": "a71a855a-dcdc-11ed-99d7-080027e6334b",
  "ExecutedGtidSet": "3b678bc9-dcdc-11ed-b9fc-080027e6334b:1-10,na71a855a-dcdc-11ed-99d7-080027e6334b:1",
  "GtidErrant": ""
}
{
  "Port": 3306,
  "Hostname": "192.168.56.30",
  "ServerUUID": "ea6c6af9-dcdc-11ed-9e09-080027e6334b",
  "ExecutedGtidSet": "3b678bc9-dcdc-11ed-b9fc-080027e6334b:1-10,na71a855a-dcdc-11ed-99d7-080027e6334b:1",
  "GtidErrant": ""
}

But this option is risky because this command actually purged the binlogs, and if any app is tailing the logs or if binary logs are assumed to enable incremental restore, then this command is dangerous and not recommended. It’s better to use gtid-errant-inject-empty, and if you still want to use gtid-errant-reset-master on a busy replica, then stop the replication first and make sure to wait for two or three minutes, then use gtid-errant-reset-master.

Conclusion

If you want to switch to GTID-based replication, make sure to check errant transactions before any planned or unplanned replication topology change. And specifically, be careful if you use a tool that reconfigures replication for you. It is always recommended to use the pt-table-checksum and pt-table-sync if you ever get this kind of situation where changes were made to the replica.

Percona Distribution for MySQL is the most complete, stable, scalable, and secure open source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!

 

Try Percona Distribution for MySQL today!

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com