Jul
02
2018
--

Fixing ER_MASTER_HAS_PURGED_REQUIRED_GTIDS when pointing a slave to a different master

gtid auto position

gtid auto positionGTID replication has made it convenient to setup and maintain MySQL replication. You need not worry about binary log file and position thanks to GTID and auto-positioning. However, things can go wrong when pointing a slave to a different master. Consider a situation where the new master has executed transactions that haven’t been executed on the old master. If the corresponding binary logs have been purged already, how do you point the slave to the new master?

The scenario

Based on technical requirements and architectural change, there is a need to point the slave to a different master by

  1. Pointing it to another node in a PXC cluster
  2. Pointing it to another master in master/master replication
  3. Pointing it to another slave of a master
  4. Pointing it to the slave of a slave of the master … and so on and so forth.

Theoretically, pointing to a new master with GTID replication is easy. All you have to do is run:

STOP SLAVE;
CHANGE MASTER TO MASTER_HOST='new_master_ip';
START SLAVE;
SHOW SLAVE STATUS\G

Alas, in some cases, replication breaks due to missing binary logs:

*************************** 1. row ***************************
Slave_IO_State:
Master_Host: pxc_57_5
Master_User: repl
Master_Port: 3306
**redacted**
Slave_IO_Running: No
Slave_SQL_Running: Yes
** redacted **
Last_IO_Errno: 1236
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.'
** redacted **
Master_Server_Id: 1
Master_UUID: 4998aaaa-6ed5-11e8-948c-0242ac120007
Master_Info_File: /var/lib/mysql/master.info
** redacted **
Last_IO_Error_Timestamp: 180613 08:08:20
Last_SQL_Error_Timestamp:
** redacted **
Retrieved_Gtid_Set:
Executed_Gtid_Set: 1904cf31-912b-ee17-4906-7dae335b4bfc:1-3
Auto_Position: 1

The strange issue here is that if you point the slave back to the old master, replication works just fine. The error says that there are missing binary logs in the new master that the slave needs. If there’s no problem with replication performance and the slave can easily catch up, then it looks like there are transactions executed in the new master that have not been executed in the old master but are recorded in the missing binary logs. The binary logs are most likely lost due to manually purging with PURGE BINARY LOGS or automatic purging if expire_logs_days is set.

At this point, it would be prudent to check and sync old master and new master with tools such as pt-table-checksum and pt-table-sync. However, if a consistency check has been performed and no differences have been found, or there’s confidence that the new master is a good copy—such as another node in the PXC cluster—you can follow the steps below to resolve the problem.

Solution

To solve the problem, the slave needs to execute the missing transactions. But since these transactions have been purged, the steps below provide the workaround.

Step 1 Find the GTID sequences that are purged from the new master that is needed by the slave

To identify which GTID sequences are missing, run SHOW GLOBAL VARIABLES LIKE 'gtid_purged'; and SHOW MASTER STATUS; on the new master and SHOW GLOBAL VARIABLES LIKE 'gtid_executed'; on the slave:

New Master:

mysql> SHOW GLOBAL VARIABLES LIKE 'gtid_purged';
+---------------+-------------------------------------------------------------------------------------+
| Variable_name | Value |
+---------------+-------------------------------------------------------------------------------------+
| gtid_purged | 1904cf31-912b-ee17-4906-7dae335b4bfc:1-2,
4998aaaa-6ed5-11e8-948c-0242ac120007:1-11 |
+---------------+-------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
mysql> SHOW MASTER STATUS;
+------------------+----------+--------------+------------------+-------------------------------------------------------------------------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+------------------+----------+--------------+------------------+-------------------------------------------------------------------------------------+
| mysql-bin.000004 | 741 | | | 1904cf31-912b-ee17-4906-7dae335b4bfc:1-6,
4998aaaa-6ed5-11e8-948c-0242ac120007:1-11 |
+------------------+----------+--------------+------------------+-------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Slave:

mysql> SHOW GLOBAL VARIABLES LIKE 'gtid_executed';
+---------------+------------------------------------------+
| Variable_name | Value |
+---------------+------------------------------------------+
| gtid_executed | 1904cf31-912b-ee17-4906-7dae335b4bfc:1-3 |
+---------------+------------------------------------------+
1 row in set (0.00 sec)

Take note that 1904cf31-912b-ee17-4906-7dae335b4bfc and 1904cf31-912b-ee17-4906-7dae335b4bfc are UUIDs and refer to the MySQL instance where the transaction originated from.

Based on the output:

  • The slave has executed 1904cf31-912b-ee17-4906-7dae335b4bfc:1-3
  • The new master has executed 1904cf31-912b-ee17-4906-7dae335b4bfc:1-6 and 4998aaaa-6ed5-11e8-948c-0242ac120007:1-11
  • The new master has purged 1904cf31-912b-ee17-4906-7dae335b4bfc:1-2 and 4998aaaa-6ed5-11e8-948c-0242ac120007:1-11

This means that the slave has no issue with 1904cf31-912b-ee17-4906-7dae335b4bfc it requires sequences 4-6 and sequences 3-6 are still available in the master. However, the slave cannot fetch sequences 1-11 from 4998aaaa-6ed5-11e8-948c-0242ac120007 because these has been purged from the master.

To summarize, the missing GTID sequences are 4998aaaa-6ed5-11e8-948c-0242ac120007:1-11.

Step 2: Identify where the purged GTID sequences came from

From the SHOW SLAVE STATUS output in the introduction section, it says that the Master_UUID is 4998aaaa-6ed5-11e8-948c-0242ac120007, which means the new master is the source of the missing transactions. You can also verify the new Master’s UUID by running SHOW GLOBAL VARIABLES LIKE 'server_uuid';

mysql> SHOW GLOBAL VARIABLES LIKE 'server_uuid';
+---------------+--------------------------------------+
| Variable_name | Value |
+---------------+--------------------------------------+
| server_uuid | 4998aaaa-6ed5-11e8-948c-0242ac120007 |
+---------------+--------------------------------------+
1 row in set (0.00 sec)

If the new master’s UUID does not match the missing GTID, it is most likely that this missing sequence came from its old master, another master higher up the chain or from another PXC node. If that other master still exists, you can run the same query on those masters to check.

The missing sequences are small such as 1-11. Typically, commands executed locally are due to performing maintenance on this server directly. For example, creating users, fixing privileges or updating passwords. However, you have no guarantee that this is the reason, since the binary logs have already been purged. If you still want to point the slave to the new master, proceed to step 3 or step 4.

Step 3. Injecting the missing transactions on the slave with empty transactions

The workaround is to pretend that those missing GTID sequences have been executed on the slave by injecting 11 empty transactions as instructed here by running:

SET GTID_NEXT='UUID:SEQUENCE_NO';
BEGIN;COMMIT;
SET GTID_NEXT='AUTOMATIC';

It looks tedious, but a simple script can automate this:

cat empty_transaction_generator.sh
#!/bin/bash
uuid=$1
first_sequence_no=$2
last_sequence_no=$3
while [ "$first_sequence_no" -le "$last_sequence_no" ]
do
echo "SET GTID_NEXT='$uuid:$first_sequence_no';"
echo "BEGIN;COMMIT;"
first_sequence_no=`expr $first_sequence_no + 1`
done
echo "SET GTID_NEXT='AUTOMATIC';"
bash empty_transaction_generator.sh 4998aaaa-6ed5-11e8-948c-0242ac120007 1 11
SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:1';
BEGIN;COMMIT;
SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:2';
BEGIN;COMMIT;
SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:3';
BEGIN;COMMIT;
SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:4';
BEGIN;COMMIT;
SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:5';
BEGIN;COMMIT;
SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:6';
BEGIN;COMMIT;
SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:7';
BEGIN;COMMIT;
SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:8';
BEGIN;COMMIT;
SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:9';
BEGIN;COMMIT;
SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:10';
BEGIN;COMMIT;
SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:11';
BEGIN;COMMIT;
SET GTID_NEXT='AUTOMATIC';

Before executing the generated output on the slave, stop replication first:

mysql> STOP SLAVE;
Query OK, 0 rows affected (0.00 sec)
mysql> SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:1';
Query OK, 0 rows affected (0.00 sec)
mysql> BEGIN;COMMIT;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
mysql> SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:2';
Query OK, 0 rows affected (0.00 sec)
mysql> BEGIN;COMMIT;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
mysql> SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:3';
Query OK, 0 rows affected (0.00 sec)
mysql> BEGIN;COMMIT;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
mysql> SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:4';
Query OK, 0 rows affected (0.00 sec)
mysql> BEGIN;COMMIT;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
mysql> SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:5';
Query OK, 0 rows affected (0.00 sec)
mysql> BEGIN;COMMIT;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.01 sec)
mysql> SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:6';
Query OK, 0 rows affected (0.00 sec)
mysql> BEGIN;COMMIT;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.01 sec)
mysql> SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:7';
Query OK, 0 rows affected (0.00 sec)
mysql> BEGIN;COMMIT;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.01 sec)
mysql> SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:8';
Query OK, 0 rows affected (0.00 sec)
mysql> BEGIN;COMMIT;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.01 sec)
mysql> SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:9';
Query OK, 0 rows affected (0.00 sec)
mysql> BEGIN;COMMIT;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
mysql> SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:10';
Query OK, 0 rows affected (0.00 sec)
mysql> BEGIN;COMMIT;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
mysql> SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:11';
Query OK, 0 rows affected (0.00 sec)
mysql> BEGIN;COMMIT;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
mysql> SET GTID_NEXT='AUTOMATIC';
Query OK, 0 rows affected (0.00 sec)

There’s also an even easier solution of injecting empty transactions by using mysqlslavetrx from MySQL utilities. By stopping the slave first and running
mysqlslavetrx --gtid-set=4998aaaa-6ed5-11e8-948c-0242ac120007:1-11 --slaves=root:password@:3306 you will achieve the same result as above.

By running SHOW GLOBAL VARIABLES LIKE 'gtid_executed'; on the slave you can see that sequences 4998aaaa-6ed5-11e8-948c-0242ac120007:1-11 have been executed already:

mysql> SHOW GLOBAL VARIABLES LIKE 'gtid_executed';
+---------------+-------------------------------------------------------------------------------------+
| Variable_name | Value |
+---------------+-------------------------------------------------------------------------------------+
| gtid_executed | 1904cf31-912b-ee17-4906-7dae335b4bfc:1-3,
4998aaaa-6ed5-11e8-948c-0242ac120007:1-11 |
+---------------+-------------------------------------------------------------------------------------+
1 row in set (0.01 sec)

Resume replication and check if replication is healthy by running START SLAVE; and SHOW SLAVE STATUS\G

mysql> START SLAVE;
Query OK, 0 rows affected (0.01 sec)
mysql> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: pxc_57_5
Master_User: repl
Master_Port: 3306
** redacted **
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
** redacted **
Seconds_Behind_Master: 0
** redacted **
Master_Server_Id: 1
Master_UUID: 4998aaaa-6ed5-11e8-948c-0242ac120007
** redacted **
Retrieved_Gtid_Set: 1904cf31-912b-ee17-4906-7dae335b4bfc:4-6
Executed_Gtid_Set: 1904cf31-912b-ee17-4906-7dae335b4bfc:1-6,
4998aaaa-6ed5-11e8-948c-0242ac120007:1-11
Auto_Position: 1
** redacted **
1 row in set (0.00 sec)

At this point, we have already solved the problem. However, there’s another way to restore the slave much faster but at the cost of erasing all the existing binary logs on the slave as mentioned in this article. If you want to do this, proceed to step 4.

Step 4. Add the missing sequences to GTID_EXECUTED by modifying GTID_PURGED.

CRITICAL NOTE:
If you followed the steps in Step 3, you do not need to perform Step 4!

To add the missing transactions, you’ll need to stop the slave, reset the master, place the original value of gtid_executed and the missing sequences in gtid_purged variable. A word of caution on using this method: this will purge the existing binary logs of the slave.

mysql> STOP SLAVE;
Query OK, 0 rows affected (0.02 sec)
mysql> RESET MASTER;
Query OK, 0 rows affected (0.02 sec)
mysql> SET GLOBAL gtid_purged="1904cf31-912b-ee17-4906-7dae335b4bfc:1-3,4998aaaa-6ed5-11e8-948c-0242ac120007:1-11";
Query OK, 0 rows affected (0.02 sec)

Similar to Step 3, running SHOW GLOBAL VARIABLES LIKE 'gtid_executed'; on the slave shows that sequence 4998aaaa-6ed5-11e8-948c-0242ac120007:1-11 has been executed already:

mysql> SHOW GLOBAL VARIABLES LIKE 'gtid_executed';
+---------------+-------------------------------------------------------------------------------------+
| Variable_name | Value |
+---------------+-------------------------------------------------------------------------------------+
| gtid_executed | 1904cf31-912b-ee17-4906-7dae335b4bfc:1-3,
4998aaaa-6ed5-11e8-948c-0242ac120007:1-11 |
+---------------+-------------------------------------------------------------------------------------+
1 row in set (0.01 sec)

Run START SLAVE; and SHOW SLAVE STATUS\G to resume replication and check if replication is healthy:

mysql> START SLAVE;
Query OK, 0 rows affected (0.01 sec)
mysql> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: pxc_57_5
Master_User: repl
Master_Port: 3306
** redacted **
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
** redacted **
Seconds_Behind_Master: 0
** redacted **
Master_Server_Id: 1
Master_UUID: 4998aaaa-6ed5-11e8-948c-0242ac120007
** redacted **
Retrieved_Gtid_Set: 1904cf31-912b-ee17-4906-7dae335b4bfc:4-6
Executed_Gtid_Set: 1904cf31-912b-ee17-4906-7dae335b4bfc:1-6,
4998aaaa-6ed5-11e8-948c-0242ac120007:1-11
Auto_Position: 1
** redacted **
1 row in set (0.00 sec)

Step 5. Done

Summary

In this article, I demonstrated how to point the slave to a new master even if it’s missing some binary logs that need to be executed. Although, it is possible to do so with the workarounds shared above, it is prudent to check the consistency of the old and new master first before switching the slave to the new master.

The post Fixing ER_MASTER_HAS_PURGED_REQUIRED_GTIDS when pointing a slave to a different master appeared first on Percona Database Performance Blog.

Mar
13
2018
--

The Multi-Source GTID Replication Maze

Multi-Source GTID Replication

In this blog post, we’ll look at how to navigate some of the complexities of multi-source GTID replication.

GTID replication is often a real challenge for DBAs, especially if this has to do with multi-source GTID replication. A while back, I came across a really interesting customer environment with shards where multi-master, multi-source, multi-threaded MySQL 5.6 MIXED replication was active. This is a highly complex environment that has both pros and cons, introducing risks as a trade-off for specific customer requirements.

This is the set up of part of this environment:

I started looking into this setup when a statement broke replication between db1 and db10. Replication broke due to a statement executed on a schema that was not present on db10. This also resulted in changes originating from db1 to not being pushed down to db100 as db10, as we stopped the replication thread (for db1 channel).

On the other hand, replication was not stopped on db2 because the schema in question was present on db2. Replication between db2 and db20 was broken as well because the schema was not present in db20.

In order to fix db1->db10 replication, four GTID sets were injected in db10.

Here are some interesting blog posts regarding how to handle/fix GTID replication issues:

After injecting the GTID sets, we started replication again and everything ran fine.

 

After that, we had to check the db2->db20 replication, which, as I’ve already said, was broken as well. In this case, injecting only the first GTID trx into db20 instead of all of those causing issues on db10 was enough!

You may wonder how this is possible. Right? The answer is that the rest of them were replicated from db10 to db20, although the channel was not the same.

Another strange thing is the fact that although the replication thread for the db2->db20 channel was stopped (broken), checking the slave status on db20 showed that Executed_Gtid_Set was moving for all channels even though Retrieved_Gtid_Set for the broken one was stopped! So what was happening there?

This raised my curiosity, so I decided to do some further investigation and created scenarios regarding other strange things that could happen. An interesting one was about the replication filters. In our case, I thought “What would happen in the following scenario … ?”

Let’s say we write a row from db1 to db123.table789. This row is replicated to db10 (let’s say using channel 1) and to db2 (let’s say using channel2). On channel 1, we filter out the db123.% tables, on channel2 we don’t. db1 writes the row and the entry to the binary log. db2 writes the row after reading the entry from the binary log and subsequently writes the entry to its own binary log and replicates this change to db20. This change is also replicated to db10. So now, on db10 (depending on which channel finds the GTID first) it either gets filtered on channel1 and written to its own bin log at just startcommit with any actual DDL/DML removed, or if it is read first on channel2 (db1->db2 and then db20->db10) then it is NOT filtered out and executed instead. Is this correct? It definitely ISN’T!

Points of interest

You can find answers to the above questions in the points of interest listed below. Although it’s not really clear through the official documentation, this is what happens with GTID replication and multi-source GTID replication:

  • As we know GTID sets are unique across all nodes in a given cluster. In multi-source replication, Executed_Gtid_Set is common for all channels. This means that regardless the originating channel, when a GTID transaction is executed it is recorded in all channels’ Executed_Gtid_Set. Although it’s logical (each database is unique, so if a trx is going to affect a database it shouldn’t be tightened to a single channel regardless of the channel it uses), the documentation doesn’t provide much info around this.
  • When we have multi-source, multi-level replication, there are cases where the GTID sets originating from one master can end up on one slave via different replication paths. It’s not clear if it applies any special algorithm (although it doesn’t seem that there could be one), but the preferred method seems to be FIFO. The fastest wins! This means that GTID sets can travel to the slave via different channels, and it’s related to how fast the upper-level slaves can commit changes. In fact, the path doesn’t really matter as it only executes each GTID trx once.
  • Replication filters are global regardless the channel. This means they apply each filter to all channels. This is normal as we can’t define a replication filter per channel. In order to be able to debug such cases, adding a small replication delay per channel seems a good idea.
Jun
27
2014
--

Failover with the MySQL Utilities – Part 1: mysqlrpladmin

MySQL Utilities are a set of tools provided by Oracle to perform many kinds of administrative tasks. When GTID-replication is enabled, 2 tools can be used for slave promotion: mysqlrpladmin and mysqlfailover. We will review mysqlrpladmin (version 1.4.3) in this post.

Summary

  • mysqlrpladmin can perform manual failover/switchover when GTID-replication is enabled.
  • You need to have your servers configured with --master-info-repository = TABLE or to add the --rpl-user option for the tool to work properly.
  • The check for errant transactions is failing in the current GA version (1.4.3) so be extra careful when using it or watch bug #73110 to see when a fix is committed.
  • There are some limitations, for instance the inability to pre-configure the list of slaves in a configuration file or the inability to check that the tool will work well without actually doing a failover or switchover.

Failover vs switchover

mysqlrpladmin can help you promote a slave to be the new master when the master goes down and then automate replication reconfiguration after this slave promotion. There are 2 separate scenarios: unplanned promotion (failover) and planned promotion (switchover). Beyond the words, it has implications on the way you have to execute the tool.

Setup for this test

To test the tool, our setup will be a master with 2 slaves, all using GTID replication. mysqlrpladmin can show us the current replication topology with the health command:

$ mysqlrpladmin --master=root@localhost:13001 --discover-slaves-login=root health
# Discovering slaves for master at localhost:13001
# Discovering slave at localhost:13002
# Found slave: localhost:13002
# Discovering slave at localhost:13003
# Found slave: localhost:13003
# Checking privileges.
#
# Replication Topology Health:
+------------+--------+---------+--------+------------+---------+
| host       | port   | role    | state  | gtid_mode  | health  |
+------------+--------+---------+--------+------------+---------+
| localhost  | 13001  | MASTER  | UP     | ON         | OK      |
| localhost  | 13002  | SLAVE   | UP     | ON         | OK      |
| localhost  | 13003  | SLAVE   | UP     | ON         | OK      |
+------------+--------+---------+--------+------------+---------+
# ...done.

As you can see, we have to specify how to connect to the master (no surprise) but instead of listing all the slaves, we can let the tool discover them.

Simple failover scenario

What will the tool do when performing failover? Essentially we will give it the list of slaves and the list of candidates and it will:

  • Run a few sanity checks
  • Elect a candidate to be the new master
  • Make the candidate as up-to-date as possible by making it a slave of all the other slaves
  • Configure replication on all the other slaves to make them replicate from the new master

After killing -9 the master, let’s try failover:

$ mysqlrpladmin --slaves=root:@localhost:13002,root:@localhost:13003 --candidates=root@localhost:13002 failover

This time, the master is down so the tool has no way to automatically discover the slaves. Thus we have to specify them with the --slaves option.

However we get an error:

# Checking privileges.
# Checking privileges on candidates.
ERROR: You must specify either the --rpl-user or set all slaves to use --master-info-repository=TABLE.

The error message is clear, but it would have been nice to have such details when running the health command (maybe a warning instead of an error). That would allow you to check beforehand that the tool can run smoothly rather than to discover in the middle of an emergency that you have to look at the documentation to find which option is missing.

Let’s choose to specify the replication user:

$ mysqlrpladmin --slaves=root:@localhost:13002,root:@localhost:13003 --candidates=root@localhost:13002 --rpl-user=repl:repl failover
# Checking privileges.
# Checking privileges on candidates.
# Performing failover.
# Candidate slave localhost:13002 will become the new master.
# Checking slaves status (before failover).
# Preparing candidate for failover.
# Creating replication user if it does not exist.
# Stopping slaves.
# Performing STOP on all slaves.
# Switching slaves to new master.
# Disconnecting new master as slave.
# Starting slaves.
# Performing START on all slaves.
# Checking slaves for errors.
# Failover complete.
#
# Replication Topology Health:
+------------+--------+---------+--------+------------+---------+
| host       | port   | role    | state  | gtid_mode  | health  |
+------------+--------+---------+--------+------------+---------+
| localhost  | 13002  | MASTER  | UP     | ON         | OK      |
| localhost  | 13003  | SLAVE   | UP     | ON         | OK      |
+------------+--------+---------+--------+------------+---------+
# ...done.

Simple switchover scenario

Let’s now restart the old master and configure it as a slave of the new master (by the way, this can be done with mysqlreplicate, another tool from the MySQL Utilities). If we want to promote the old master, we can run:

$ mysqlrpladmin --master=root@localhost:13002 --new-master=root@localhost:13001 --discover-slaves-login=root --demote-master --rpl-user=repl:repl --quiet switchover
# Discovering slave at localhost:13001
# Found slave: localhost:13001
# Discovering slave at localhost:13003
# Found slave: localhost:13003
+------------+--------+---------+--------+------------+---------+
| host       | port   | role    | state  | gtid_mode  | health  |
+------------+--------+---------+--------+------------+---------+
| localhost  | 13001  | MASTER  | UP     | ON         | OK      |
| localhost  | 13002  | SLAVE   | UP     | ON         | OK      |
| localhost  | 13003  | SLAVE   | UP     | ON         | OK      |
+------------+--------+---------+--------+------------+---------+

Notice that the master is available in this case so we can use the discover-slaves-login option. Also notice that we can tune the verbosity of the tool by using --quiet or --verbose or even log the output in a file with --log.

We also used --demote-master to make the old master a slave of the new master. Without this option, the old master will be isolated from the other nodes.

Extension points

In general doing switchover/failover at the database level is one thing but informing the other components of the application that something has changed is most often necessary for the application to keep on working correctly.

This is where the extension points are handy: you can execute a script before switchover/failover with --exec-before and after switchover/failover with --exec-after.

For instance with these simple scripts:

# cat /usr/local/bin/check_before
#!/bin/bash
/usr/local/mysql5619/bin/mysql -uroot -S /tmp/node1.sock -Ee 'SHOW SLAVE STATUS' > /tmp/before
# cat /usr/local/bin/check_after
#!/bin/bash
/usr/local/mysql5619/bin/mysql -uroot -S /tmp/node1.sock -Ee 'SHOW SLAVE STATUS' > /tmp/after

We can execute:

$ mysqlrpladmin --master=root@localhost:13001 --new-master=root@localhost:13002 --discover-slaves-login=root --demote-master --rpl-user=repl:repl --quiet --exec-before=/usr/local/bin/check_before --exec-after=/usr/local/bin/check_after switchover

And looking the /tmp/before and /tmp/after, we can see that our scripts have been executed:

# cat /tmp/before
# cat /tmp/after
*************************** 1. row ***************************
               Slave_IO_State: Queueing master event to the relay log
                  Master_Host: localhost
                  Master_User: repl
                  Master_Port: 13002
[...]

If the external script does not seem to work, using –verbose can be useful to diagnose the issue.

What about errant transactions?

We already mentioned that errant transactions can create lots of issues when a new master is promoted in a cluster running GTIDs. So the question is: how mysqlrpladmin behaves when there is an errant transaction?

Let’s create an errant transaction:

# On localhost:13003
mysql> CREATE DATABASE test2;
mysql> FLUSH LOGS;
mysql> SHOW BINARY LOGS;
+------------------+-----------+
| Log_name         | File_size |
+------------------+-----------+
| mysql-bin.000001 |     69309 |
| mysql-bin.000002 |   1237667 |
| mysql-bin.000003 |       617 |
| mysql-bin.000004 |       231 |
+------------------+-----------+
mysql> PURGE BINARY LOGS TO 'mysql-bin.000004';

and let’s try to promote localhost:13003 as the new master:

$ mysqlrpladmin --master=root@localhost:13001 --new-master=root@localhost:13003 --discover-slaves-login=root --demote-master --rpl-user=repl:repl --quiet switchover
[...]
+------------+--------+---------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| host       | port   | role    | state  | gtid_mode  | health                                                                                                                                                                                                                                                                                              |
+------------+--------+---------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| localhost  | 13003  | MASTER  | UP     | ON         | OK                                                                                                                                                                                                                                                                                                  |
| localhost  | 13001  | SLAVE   | UP     | ON         | IO thread is not running., Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.', Slave has 1 transactions behind master.  |
| localhost  | 13002  | SLAVE   | UP     | ON         | IO thread is not running., Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.', Slave has 1 transactions behind master.  |
+------------+--------+---------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Oops! Although it is suggested by the documentation, the tool does not check errant transactions. This is a major issue as you cannot run failover/switchover reliably with GTID replication if errant transactions are not correctly detected.

The documentation suggests errant transactions should be checked and a quick look at the code confirms that, but it does not work! So it has been reported.

Some limitations

Apart from the missing errant transaction check, I also noticed a few limitations:

  • You cannot use a configuration file listing all the slaves. This becomes boring once you have a large amount of slaves. In such a case, you should write a wrapper script around mysqlrpladmin to generate the right command for you
  • The slave election process is either automatic or it relies on the order of the servers given in the --candidates option. This is not very sophisticated.
  • It would be useful to have a –dry-run mode which would validate that everything is configured correctly but without actually failing/switching over. This is something MHA does for instance.

Conclusion

mysqlrpladmin is a very good tool to help you perform manual failover/switchover in a cluster using GTID replication. The main caveat at this point is the failing check for errant transactions, which requires a lot of care before executing the tool.

The post Failover with the MySQL Utilities – Part 1: mysqlrpladmin appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com