Jul
02
2014
--

Failover with the MySQL Utilities: Part 2 – mysqlfailover

In the previous post of this series we saw how you could use mysqlrpladmin to perform manual failover/switchover when GTID replication is enabled in MySQL 5.6. Now we will review mysqlfailover (version 1.4.3), another tool from the MySQL Utilities that can be used for automatic failover.

Summary

  • mysqlfailover can perform automatic failover if MySQL 5.6′s GTID-replication is enabled.
  • All slaves must use --master-info-repository=TABLE.
  • The monitoring node is a single point of failure: don’t forget to monitor it!
  • Detection of errant transactions works well, but you have to use the --pedantic option to make sure failover will never happen if there is an errant transaction.
  • There are a few limitations such as the inability to only fail over once, or excessive CPU utilization, but they are probably not showstoppers for most setups.

Setup

We will use the same setup as last time: one master and two slaves, all using GTID replication. We can see the topology using mysqlfailover with the health command:

$ mysqlfailover --master=root@localhost:13001 --discover-slaves-login=root health
[...]
MySQL Replication Failover Utility
Failover Mode = auto     Next Interval = Tue Jul  1 10:01:22 2014
Master Information
------------------
Binary Log File   Position  Binlog_Do_DB  Binlog_Ignore_DB
mysql-bin.000003  700
GTID Executed Set
a9a396c6-00f3-11e4-8e66-9cebe8067a3f:1-3
Replication Health Status
+------------+--------+---------+--------+------------+---------+
| host       | port   | role    | state  | gtid_mode  | health  |
+------------+--------+---------+--------+------------+---------+
| localhost  | 13001  | MASTER  | UP     | ON         | OK      |
| localhost  | 13002  | SLAVE   | UP     | ON         | OK      |
| localhost  | 13003  | SLAVE   | UP     | ON         | OK      |
+------------+--------+---------+--------+------------+---------+

Note that --master-info-repository=TABLE needs to be configured on all slaves or the tool will exit with an error message:

2014-07-01 10:18:55 AM CRITICAL Failover requires --master-info-repository=TABLE for all slaves.
ERROR: Failover requires --master-info-repository=TABLE for all slaves.

Failover

You can use 2 commands to trigger automatic failover:

  • auto: the tool tries to find a candidate in the list of servers specified with --candidates, and if no good server is found in this list, it will look at the other slaves to see if one can be a good candidate. This is the default command
  • elect: same as auto, but if no good candidate is found in the list of candidates, other slaves will not be checked and the tool will exit with an error.

Let’s start the tool with auto:

$ mysqlfailover --master=root@localhost:13001 --discover-slaves-login=root auto

The monitoring console is visible and is refreshed every --interval seconds (default: 15). Its output is similar to what you get when using the health command.

Then let’s kill -9 the master to see what happens once the master is detected as down:

Failed to reconnect to the master after 3 attemps.
Failover starting in 'auto' mode...
# Candidate slave localhost:13002 will become the new master.
# Checking slaves status (before failover).
# Preparing candidate for failover.
# Creating replication user if it does not exist.
# Stopping slaves.
# Performing STOP on all slaves.
# Switching slaves to new master.
# Disconnecting new master as slave.
# Starting slaves.
# Performing START on all slaves.
# Checking slaves for errors.
# Failover complete.
# Discovering slaves for master at localhost:13002
Failover console will restart in 5 seconds.
MySQL Replication Failover Utility
Failover Mode = auto     Next Interval = Tue Jul  1 10:59:47 2014
Master Information
------------------
Binary Log File   Position  Binlog_Do_DB  Binlog_Ignore_DB
mysql-bin.000005  191
GTID Executed Set
a9a396c6-00f3-11e4-8e66-9cebe8067a3f:1-3
Replication Health Status
+------------+--------+---------+--------+------------+---------+
| host       | port   | role    | state  | gtid_mode  | health  |
+------------+--------+---------+--------+------------+---------+
| localhost  | 13002  | MASTER  | UP     | ON         | OK      |
| localhost  | 13003  | SLAVE   | UP     | ON         | OK      |
+------------+--------+---------+--------+------------+---------+

Looks good! The tool is then ready to fail over to another slave if the new master becomes unavailable.

You can also run custom scripts at several points of execution with the --exec-before, --exec-after, --exec-fail-check, --exec-post-failover options.

However it would be great to have a --failover-and-exit option to avoid flapping: the tool would detect master failure, promote one of the slaves, reconfigure replication and then exit (this is what MHA does for instance).

Tool registration

When the tool is started, it registers itself on the master by writing a few things in the specific table:

mysql> SELECT * FROM mysql.failover_console;
+-----------+-------+
| host      | port  |
+-----------+-------+
| localhost | 13001 |
+-----------+-------+

This is nice as it avoids that you start several instances of mysqlfailover to monitor the same master. If we try, this is what we get:

$ mysqlfailover --master=root@localhost:13001 --discover-slaves-login=root auto
[...]
Multiple instances of failover console found for master localhost:13001.
If this is an error, restart the console with --force.
Failover mode changed to 'FAIL' for this instance.
Console will start in 10 seconds..........starting Console.

With the fail command, mysqlfailover will monitor replication health and exit in the case of a master failure, without actually performing failover.

Running in the background

In all previous examples, mysqlfailover was running in the foreground. This is very good for demo, but in a production environment you are likely to prefer running it in the background. This can be done with the --daemon option:

$ mysqlfailover --master=root@localhost:13001 --discover-slaves-login=root auto --daemon=start --log=/var/log/mysqlfailover.log

and it can be stopped with:

$ mysqlfailover --daemon=stop

Errant transactions

If we create an errant transaction on one of the slaves, it will be detected:

MySQL Replication Failover Utility
Failover Mode = auto     Next Interval = Tue Jul  1 16:29:44 2014
[...]
WARNING: Errant transaction(s) found on slave(s).
Replication Health Status
[...]

However this does not prevent failover from occurring! You have to use --pedantic:

$ mysqlfailover --master=root@localhost:13001 --discover-slaves-login=root --pedantic auto
[...]
# WARNING: Errant transaction(s) found on slave(s).
#  - For slave 'localhost@13003': db906eee-012d-11e4-8fe1-9cebe8067a3f:1
2014-07-01 16:44:49 PM CRITICAL Errant transaction(s) found on slave(s). Note: If you want to ignore this issue, please do not use the --pedantic option.
ERROR: Errant transaction(s) found on slave(s). Note: If you want to ignore this issue, please do not use the --pedantic option.

Limitations

  • Like for mysqlrpladmin, the slave election process is not very sophisticated and it cannot be tuned.
  • The server on which mysqlfailover is running is a single point of failure.
  • Excessive CPU utilization: once it is running, mysqlfailover hogs one core. This is quite surprising.

Conclusion

mysqlfailover is a good tool to automate failover in clusters using GTID replication. It is flexible and looks reliable. Its main drawback is that there is no easy way to make it highly available itself: if mysqlfailover crashes, you will have to manually restart it.

The post Failover with the MySQL Utilities: Part 2 – mysqlfailover appeared first on MySQL Performance Blog.

May
18
2014
--

Errant transactions: Major hurdle for GTID-based failover in MySQL 5.6

I have previously written about the new replication protocol that comes with GTIDs in MySQL 5.6. Because of this new replication protocol, you can inadvertently create errant transactions that may turn any failover to a nightmare. Let’s see the problems and the potential solutions.

In short

  • Errant transactions may cause all kinds of data corruption/replication errors when failing over.
  • Detection of errant transactions can be done with the GTID_SUBSET() and GTID_SUBTRACT() functions.
  • If you find an errant transaction on one server, commit an empty transaction with the GTID of the errant one on all other servers.
  • If you are using a tool to perform the failover for you, make sure it can detect errant transactions. At the time of writing, only mysqlfailover and mysqlrpladmin from MySQL Utilities can do that.

What are errant transactions?

Simply stated, they are transactions executed directly on a slave. Thus they only exist on a specific slave. This could result from a mistake (the application wrote to a slave instead of writing to the master) or this could be by design (you need additional tables for reports).

Why can they create problems that did not exist before GTIDs?

Errant transactions have been existing forever. However because of the new replication protocol for GTID-based replication, they can have a significant impact on all servers if a slave holding an errant transaction is promoted as the new master.

Compare what happens in this master-slave setup, first with position-based replication and then with GTID-based replication. A is the master, B is the slave:

# POSITION-BASED REPLICATION
# Creating an errant transaction on B
mysql> create database mydb;
# Make B the master, and A the slave
# What are the databases on A now?
mysql> show databases like 'mydb';
Empty set (0.01 sec)

As expected, the mydb database is not created on A.

# GTID-BASED REPLICATION
# Creating an errant transaction on B
mysql> create database mydb;
# Make B the master, and A the slave
# What are the databases on A now?
mysql> show databases like 'mydb';
+-----------------+
| Database (mydb) |
+-----------------+
| mydb            |
+-----------------+

mydb has been recreated on A because of the new replication protocol: when A connects to B, they exchange their own set of executed GTIDs and the master (B) sends any missing transaction. Here it is the create database statement.

As you can see, the main issue with errant transactions is that when failing over you may execute transactions ‘coming from nowhere’ that can silently corrupt your data or break replication.

How to detect them?

If the master is running, it is quite easy with the GTID_SUBSET() function. As all writes should go to the master, the GTIDs executed on any slave should always be a subset of the GTIDs executed on the master. For instance:

# Master
mysql> show master statusG
*************************** 1. row ***************************
             File: mysql-bin.000017
         Position: 376
     Binlog_Do_DB:
 Binlog_Ignore_DB:
Executed_Gtid_Set: 8e349184-bc14-11e3-8d4c-0800272864ba:1-30,
8e3648e4-bc14-11e3-8d4c-0800272864ba:1-7
# Slave
mysql> show slave statusG
[...]
Executed_Gtid_Set: 8e349184-bc14-11e3-8d4c-0800272864ba:1-29,
8e3648e4-bc14-11e3-8d4c-0800272864ba:1-9
# Now, let's compare the 2 sets
mysql> > select gtid_subset('8e349184-bc14-11e3-8d4c-0800272864ba:1-29,
8e3648e4-bc14-11e3-8d4c-0800272864ba:1-9','8e349184-bc14-11e3-8d4c-0800272864ba:1-30,
8e3648e4-bc14-11e3-8d4c-0800272864ba:1-7') as slave_is_subset;
+-----------------+
| slave_is_subset |
+-----------------+
|               0 |
+-----------------+

Hum, it looks like the slave has executed more transactions than the master, this indicates that the slave has executed at least 1 errant transaction. Could we know the GTID of these transactions? Sure, let’s use GTID_SUBTRACT():

select gtid_subtract('8e349184-bc14-11e3-8d4c-0800272864ba:1-29,
8e3648e4-bc14-11e3-8d4c-0800272864ba:1-9','8e349184-bc14-11e3-8d4c-0800272864ba:1-30,
8e3648e4-bc14-11e3-8d4c-0800272864ba:1-7') as errant_transactions;
+------------------------------------------+
| errant_transactions                      |
+------------------------------------------+
| 8e3648e4-bc14-11e3-8d4c-0800272864ba:8-9 |
+------------------------------------------+

This means that the slave has 2 errant transactions.

Now, how can we check errant transactions if the master is not running (like master has crashed, and we want to fail over to one of the slaves)? In this case, we will have to follow these steps:

  • Check all slaves to see if they have executed transactions that are not found on any other slave: this is the list of potential errant transactions.
  • Discard all transactions originating from the master: now you have the list of errant transactions of each slave

Some of you may wonder how you can know which transactions come from the master as it is not available: SHOW SLAVE STATUS gives you the master’s UUID which is used in the GTIDs of all transactions coming from the master.

How to get rid of them?

This is pretty easy, but it can be tedious if you have many slaves: just inject an empty transaction on all the other servers with the GTID of the errant transaction.

For instance, if you have 3 servers, A (the master), B (slave with an errant transaction: XXX:3), and C (slave with 2 errant transactions: YYY:18-19), you will have to inject the following empty transactions in pseudo-code:

# A
- Inject empty trx(XXX:3)
- Inject empty trx(YYY:18)
- Inject empty trx(YYY:19)
# B
- Inject empty trx(YYY:18)
- Inject empty trx(YYY:19)
# C
- Inject empty trx(XXX:3)

Conclusion

If you want to switch to GTID-based replication, make sure to check errant transactions before any planned or unplanned replication topology change. And be specifically careful if you use a tool that reconfigures replication for you: at the time of writing, only mysqlrpladmin and mysqlfailover from MySQL Utilities can warn you if you are trying to perform an unsafe topology change.

The post Errant transactions: Major hurdle for GTID-based failover in MySQL 5.6 appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com