Jun
27
2014
--

Failover with the MySQL Utilities – Part 1: mysqlrpladmin

MySQL Utilities are a set of tools provided by Oracle to perform many kinds of administrative tasks. When GTID-replication is enabled, 2 tools can be used for slave promotion: mysqlrpladmin and mysqlfailover. We will review mysqlrpladmin (version 1.4.3) in this post.

Summary

  • mysqlrpladmin can perform manual failover/switchover when GTID-replication is enabled.
  • You need to have your servers configured with --master-info-repository = TABLE or to add the --rpl-user option for the tool to work properly.
  • The check for errant transactions is failing in the current GA version (1.4.3) so be extra careful when using it or watch bug #73110 to see when a fix is committed.
  • There are some limitations, for instance the inability to pre-configure the list of slaves in a configuration file or the inability to check that the tool will work well without actually doing a failover or switchover.

Failover vs switchover

mysqlrpladmin can help you promote a slave to be the new master when the master goes down and then automate replication reconfiguration after this slave promotion. There are 2 separate scenarios: unplanned promotion (failover) and planned promotion (switchover). Beyond the words, it has implications on the way you have to execute the tool.

Setup for this test

To test the tool, our setup will be a master with 2 slaves, all using GTID replication. mysqlrpladmin can show us the current replication topology with the health command:

$ mysqlrpladmin --master=root@localhost:13001 --discover-slaves-login=root health
# Discovering slaves for master at localhost:13001
# Discovering slave at localhost:13002
# Found slave: localhost:13002
# Discovering slave at localhost:13003
# Found slave: localhost:13003
# Checking privileges.
#
# Replication Topology Health:
+------------+--------+---------+--------+------------+---------+
| host       | port   | role    | state  | gtid_mode  | health  |
+------------+--------+---------+--------+------------+---------+
| localhost  | 13001  | MASTER  | UP     | ON         | OK      |
| localhost  | 13002  | SLAVE   | UP     | ON         | OK      |
| localhost  | 13003  | SLAVE   | UP     | ON         | OK      |
+------------+--------+---------+--------+------------+---------+
# ...done.

As you can see, we have to specify how to connect to the master (no surprise) but instead of listing all the slaves, we can let the tool discover them.

Simple failover scenario

What will the tool do when performing failover? Essentially we will give it the list of slaves and the list of candidates and it will:

  • Run a few sanity checks
  • Elect a candidate to be the new master
  • Make the candidate as up-to-date as possible by making it a slave of all the other slaves
  • Configure replication on all the other slaves to make them replicate from the new master

After killing -9 the master, let’s try failover:

$ mysqlrpladmin --slaves=root:@localhost:13002,root:@localhost:13003 --candidates=root@localhost:13002 failover

This time, the master is down so the tool has no way to automatically discover the slaves. Thus we have to specify them with the --slaves option.

However we get an error:

# Checking privileges.
# Checking privileges on candidates.
ERROR: You must specify either the --rpl-user or set all slaves to use --master-info-repository=TABLE.

The error message is clear, but it would have been nice to have such details when running the health command (maybe a warning instead of an error). That would allow you to check beforehand that the tool can run smoothly rather than to discover in the middle of an emergency that you have to look at the documentation to find which option is missing.

Let’s choose to specify the replication user:

$ mysqlrpladmin --slaves=root:@localhost:13002,root:@localhost:13003 --candidates=root@localhost:13002 --rpl-user=repl:repl failover
# Checking privileges.
# Checking privileges on candidates.
# Performing failover.
# Candidate slave localhost:13002 will become the new master.
# Checking slaves status (before failover).
# Preparing candidate for failover.
# Creating replication user if it does not exist.
# Stopping slaves.
# Performing STOP on all slaves.
# Switching slaves to new master.
# Disconnecting new master as slave.
# Starting slaves.
# Performing START on all slaves.
# Checking slaves for errors.
# Failover complete.
#
# Replication Topology Health:
+------------+--------+---------+--------+------------+---------+
| host       | port   | role    | state  | gtid_mode  | health  |
+------------+--------+---------+--------+------------+---------+
| localhost  | 13002  | MASTER  | UP     | ON         | OK      |
| localhost  | 13003  | SLAVE   | UP     | ON         | OK      |
+------------+--------+---------+--------+------------+---------+
# ...done.

Simple switchover scenario

Let’s now restart the old master and configure it as a slave of the new master (by the way, this can be done with mysqlreplicate, another tool from the MySQL Utilities). If we want to promote the old master, we can run:

$ mysqlrpladmin --master=root@localhost:13002 --new-master=root@localhost:13001 --discover-slaves-login=root --demote-master --rpl-user=repl:repl --quiet switchover
# Discovering slave at localhost:13001
# Found slave: localhost:13001
# Discovering slave at localhost:13003
# Found slave: localhost:13003
+------------+--------+---------+--------+------------+---------+
| host       | port   | role    | state  | gtid_mode  | health  |
+------------+--------+---------+--------+------------+---------+
| localhost  | 13001  | MASTER  | UP     | ON         | OK      |
| localhost  | 13002  | SLAVE   | UP     | ON         | OK      |
| localhost  | 13003  | SLAVE   | UP     | ON         | OK      |
+------------+--------+---------+--------+------------+---------+

Notice that the master is available in this case so we can use the discover-slaves-login option. Also notice that we can tune the verbosity of the tool by using --quiet or --verbose or even log the output in a file with --log.

We also used --demote-master to make the old master a slave of the new master. Without this option, the old master will be isolated from the other nodes.

Extension points

In general doing switchover/failover at the database level is one thing but informing the other components of the application that something has changed is most often necessary for the application to keep on working correctly.

This is where the extension points are handy: you can execute a script before switchover/failover with --exec-before and after switchover/failover with --exec-after.

For instance with these simple scripts:

# cat /usr/local/bin/check_before
#!/bin/bash
/usr/local/mysql5619/bin/mysql -uroot -S /tmp/node1.sock -Ee 'SHOW SLAVE STATUS' > /tmp/before
# cat /usr/local/bin/check_after
#!/bin/bash
/usr/local/mysql5619/bin/mysql -uroot -S /tmp/node1.sock -Ee 'SHOW SLAVE STATUS' > /tmp/after

We can execute:

$ mysqlrpladmin --master=root@localhost:13001 --new-master=root@localhost:13002 --discover-slaves-login=root --demote-master --rpl-user=repl:repl --quiet --exec-before=/usr/local/bin/check_before --exec-after=/usr/local/bin/check_after switchover

And looking the /tmp/before and /tmp/after, we can see that our scripts have been executed:

# cat /tmp/before
# cat /tmp/after
*************************** 1. row ***************************
               Slave_IO_State: Queueing master event to the relay log
                  Master_Host: localhost
                  Master_User: repl
                  Master_Port: 13002
[...]

If the external script does not seem to work, using –verbose can be useful to diagnose the issue.

What about errant transactions?

We already mentioned that errant transactions can create lots of issues when a new master is promoted in a cluster running GTIDs. So the question is: how mysqlrpladmin behaves when there is an errant transaction?

Let’s create an errant transaction:

# On localhost:13003
mysql> CREATE DATABASE test2;
mysql> FLUSH LOGS;
mysql> SHOW BINARY LOGS;
+------------------+-----------+
| Log_name         | File_size |
+------------------+-----------+
| mysql-bin.000001 |     69309 |
| mysql-bin.000002 |   1237667 |
| mysql-bin.000003 |       617 |
| mysql-bin.000004 |       231 |
+------------------+-----------+
mysql> PURGE BINARY LOGS TO 'mysql-bin.000004';

and let’s try to promote localhost:13003 as the new master:

$ mysqlrpladmin --master=root@localhost:13001 --new-master=root@localhost:13003 --discover-slaves-login=root --demote-master --rpl-user=repl:repl --quiet switchover
[...]
+------------+--------+---------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| host       | port   | role    | state  | gtid_mode  | health                                                                                                                                                                                                                                                                                              |
+------------+--------+---------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| localhost  | 13003  | MASTER  | UP     | ON         | OK                                                                                                                                                                                                                                                                                                  |
| localhost  | 13001  | SLAVE   | UP     | ON         | IO thread is not running., Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.', Slave has 1 transactions behind master.  |
| localhost  | 13002  | SLAVE   | UP     | ON         | IO thread is not running., Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.', Slave has 1 transactions behind master.  |
+------------+--------+---------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Oops! Although it is suggested by the documentation, the tool does not check errant transactions. This is a major issue as you cannot run failover/switchover reliably with GTID replication if errant transactions are not correctly detected.

The documentation suggests errant transactions should be checked and a quick look at the code confirms that, but it does not work! So it has been reported.

Some limitations

Apart from the missing errant transaction check, I also noticed a few limitations:

  • You cannot use a configuration file listing all the slaves. This becomes boring once you have a large amount of slaves. In such a case, you should write a wrapper script around mysqlrpladmin to generate the right command for you
  • The slave election process is either automatic or it relies on the order of the servers given in the --candidates option. This is not very sophisticated.
  • It would be useful to have a –dry-run mode which would validate that everything is configured correctly but without actually failing/switching over. This is something MHA does for instance.

Conclusion

mysqlrpladmin is a very good tool to help you perform manual failover/switchover in a cluster using GTID replication. The main caveat at this point is the failing check for errant transactions, which requires a lot of care before executing the tool.

The post Failover with the MySQL Utilities – Part 1: mysqlrpladmin appeared first on MySQL Performance Blog.

May
18
2014
--

Errant transactions: Major hurdle for GTID-based failover in MySQL 5.6

I have previously written about the new replication protocol that comes with GTIDs in MySQL 5.6. Because of this new replication protocol, you can inadvertently create errant transactions that may turn any failover to a nightmare. Let’s see the problems and the potential solutions.

In short

  • Errant transactions may cause all kinds of data corruption/replication errors when failing over.
  • Detection of errant transactions can be done with the GTID_SUBSET() and GTID_SUBTRACT() functions.
  • If you find an errant transaction on one server, commit an empty transaction with the GTID of the errant one on all other servers.
  • If you are using a tool to perform the failover for you, make sure it can detect errant transactions. At the time of writing, only mysqlfailover and mysqlrpladmin from MySQL Utilities can do that.

What are errant transactions?

Simply stated, they are transactions executed directly on a slave. Thus they only exist on a specific slave. This could result from a mistake (the application wrote to a slave instead of writing to the master) or this could be by design (you need additional tables for reports).

Why can they create problems that did not exist before GTIDs?

Errant transactions have been existing forever. However because of the new replication protocol for GTID-based replication, they can have a significant impact on all servers if a slave holding an errant transaction is promoted as the new master.

Compare what happens in this master-slave setup, first with position-based replication and then with GTID-based replication. A is the master, B is the slave:

# POSITION-BASED REPLICATION
# Creating an errant transaction on B
mysql> create database mydb;
# Make B the master, and A the slave
# What are the databases on A now?
mysql> show databases like 'mydb';
Empty set (0.01 sec)

As expected, the mydb database is not created on A.

# GTID-BASED REPLICATION
# Creating an errant transaction on B
mysql> create database mydb;
# Make B the master, and A the slave
# What are the databases on A now?
mysql> show databases like 'mydb';
+-----------------+
| Database (mydb) |
+-----------------+
| mydb            |
+-----------------+

mydb has been recreated on A because of the new replication protocol: when A connects to B, they exchange their own set of executed GTIDs and the master (B) sends any missing transaction. Here it is the create database statement.

As you can see, the main issue with errant transactions is that when failing over you may execute transactions ‘coming from nowhere’ that can silently corrupt your data or break replication.

How to detect them?

If the master is running, it is quite easy with the GTID_SUBSET() function. As all writes should go to the master, the GTIDs executed on any slave should always be a subset of the GTIDs executed on the master. For instance:

# Master
mysql> show master statusG
*************************** 1. row ***************************
             File: mysql-bin.000017
         Position: 376
     Binlog_Do_DB:
 Binlog_Ignore_DB:
Executed_Gtid_Set: 8e349184-bc14-11e3-8d4c-0800272864ba:1-30,
8e3648e4-bc14-11e3-8d4c-0800272864ba:1-7
# Slave
mysql> show slave statusG
[...]
Executed_Gtid_Set: 8e349184-bc14-11e3-8d4c-0800272864ba:1-29,
8e3648e4-bc14-11e3-8d4c-0800272864ba:1-9
# Now, let's compare the 2 sets
mysql> > select gtid_subset('8e349184-bc14-11e3-8d4c-0800272864ba:1-29,
8e3648e4-bc14-11e3-8d4c-0800272864ba:1-9','8e349184-bc14-11e3-8d4c-0800272864ba:1-30,
8e3648e4-bc14-11e3-8d4c-0800272864ba:1-7') as slave_is_subset;
+-----------------+
| slave_is_subset |
+-----------------+
|               0 |
+-----------------+

Hum, it looks like the slave has executed more transactions than the master, this indicates that the slave has executed at least 1 errant transaction. Could we know the GTID of these transactions? Sure, let’s use GTID_SUBTRACT():

select gtid_subtract('8e349184-bc14-11e3-8d4c-0800272864ba:1-29,
8e3648e4-bc14-11e3-8d4c-0800272864ba:1-9','8e349184-bc14-11e3-8d4c-0800272864ba:1-30,
8e3648e4-bc14-11e3-8d4c-0800272864ba:1-7') as errant_transactions;
+------------------------------------------+
| errant_transactions                      |
+------------------------------------------+
| 8e3648e4-bc14-11e3-8d4c-0800272864ba:8-9 |
+------------------------------------------+

This means that the slave has 2 errant transactions.

Now, how can we check errant transactions if the master is not running (like master has crashed, and we want to fail over to one of the slaves)? In this case, we will have to follow these steps:

  • Check all slaves to see if they have executed transactions that are not found on any other slave: this is the list of potential errant transactions.
  • Discard all transactions originating from the master: now you have the list of errant transactions of each slave

Some of you may wonder how you can know which transactions come from the master as it is not available: SHOW SLAVE STATUS gives you the master’s UUID which is used in the GTIDs of all transactions coming from the master.

How to get rid of them?

This is pretty easy, but it can be tedious if you have many slaves: just inject an empty transaction on all the other servers with the GTID of the errant transaction.

For instance, if you have 3 servers, A (the master), B (slave with an errant transaction: XXX:3), and C (slave with 2 errant transactions: YYY:18-19), you will have to inject the following empty transactions in pseudo-code:

# A
- Inject empty trx(XXX:3)
- Inject empty trx(YYY:18)
- Inject empty trx(YYY:19)
# B
- Inject empty trx(YYY:18)
- Inject empty trx(YYY:19)
# C
- Inject empty trx(XXX:3)

Conclusion

If you want to switch to GTID-based replication, make sure to check errant transactions before any planned or unplanned replication topology change. And be specifically careful if you use a tool that reconfigures replication for you: at the time of writing, only mysqlrpladmin and mysqlfailover from MySQL Utilities can warn you if you are trying to perform an unsafe topology change.

The post Errant transactions: Major hurdle for GTID-based failover in MySQL 5.6 appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com