Jul
02
2018
--

Fixing ER_MASTER_HAS_PURGED_REQUIRED_GTIDS when pointing a slave to a different master

gtid auto position

gtid auto positionGTID replication has made it convenient to setup and maintain MySQL replication. You need not worry about binary log file and position thanks to GTID and auto-positioning. However, things can go wrong when pointing a slave to a different master. Consider a situation where the new master has executed transactions that haven’t been executed on the old master. If the corresponding binary logs have been purged already, how do you point the slave to the new master?

The scenario

Based on technical requirements and architectural change, there is a need to point the slave to a different master by

  1. Pointing it to another node in a PXC cluster
  2. Pointing it to another master in master/master replication
  3. Pointing it to another slave of a master
  4. Pointing it to the slave of a slave of the master … and so on and so forth.

Theoretically, pointing to a new master with GTID replication is easy. All you have to do is run:

STOP SLAVE;
CHANGE MASTER TO MASTER_HOST='new_master_ip';
START SLAVE;
SHOW SLAVE STATUS\G

Alas, in some cases, replication breaks due to missing binary logs:

*************************** 1. row ***************************
Slave_IO_State:
Master_Host: pxc_57_5
Master_User: repl
Master_Port: 3306
**redacted**
Slave_IO_Running: No
Slave_SQL_Running: Yes
** redacted **
Last_IO_Errno: 1236
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.'
** redacted **
Master_Server_Id: 1
Master_UUID: 4998aaaa-6ed5-11e8-948c-0242ac120007
Master_Info_File: /var/lib/mysql/master.info
** redacted **
Last_IO_Error_Timestamp: 180613 08:08:20
Last_SQL_Error_Timestamp:
** redacted **
Retrieved_Gtid_Set:
Executed_Gtid_Set: 1904cf31-912b-ee17-4906-7dae335b4bfc:1-3
Auto_Position: 1

The strange issue here is that if you point the slave back to the old master, replication works just fine. The error says that there are missing binary logs in the new master that the slave needs. If there’s no problem with replication performance and the slave can easily catch up, then it looks like there are transactions executed in the new master that have not been executed in the old master but are recorded in the missing binary logs. The binary logs are most likely lost due to manually purging with PURGE BINARY LOGS or automatic purging if expire_logs_days is set.

At this point, it would be prudent to check and sync old master and new master with tools such as pt-table-checksum and pt-table-sync. However, if a consistency check has been performed and no differences have been found, or there’s confidence that the new master is a good copy—such as another node in the PXC cluster—you can follow the steps below to resolve the problem.

Solution

To solve the problem, the slave needs to execute the missing transactions. But since these transactions have been purged, the steps below provide the workaround.

Step 1 Find the GTID sequences that are purged from the new master that is needed by the slave

To identify which GTID sequences are missing, run SHOW GLOBAL VARIABLES LIKE 'gtid_purged'; and SHOW MASTER STATUS; on the new master and SHOW GLOBAL VARIABLES LIKE 'gtid_executed'; on the slave:

New Master:

mysql> SHOW GLOBAL VARIABLES LIKE 'gtid_purged';
+---------------+-------------------------------------------------------------------------------------+
| Variable_name | Value |
+---------------+-------------------------------------------------------------------------------------+
| gtid_purged | 1904cf31-912b-ee17-4906-7dae335b4bfc:1-2,
4998aaaa-6ed5-11e8-948c-0242ac120007:1-11 |
+---------------+-------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
mysql> SHOW MASTER STATUS;
+------------------+----------+--------------+------------------+-------------------------------------------------------------------------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+------------------+----------+--------------+------------------+-------------------------------------------------------------------------------------+
| mysql-bin.000004 | 741 | | | 1904cf31-912b-ee17-4906-7dae335b4bfc:1-6,
4998aaaa-6ed5-11e8-948c-0242ac120007:1-11 |
+------------------+----------+--------------+------------------+-------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Slave:

mysql> SHOW GLOBAL VARIABLES LIKE 'gtid_executed';
+---------------+------------------------------------------+
| Variable_name | Value |
+---------------+------------------------------------------+
| gtid_executed | 1904cf31-912b-ee17-4906-7dae335b4bfc:1-3 |
+---------------+------------------------------------------+
1 row in set (0.00 sec)

Take note that 1904cf31-912b-ee17-4906-7dae335b4bfc and 1904cf31-912b-ee17-4906-7dae335b4bfc are UUIDs and refer to the MySQL instance where the transaction originated from.

Based on the output:

  • The slave has executed 1904cf31-912b-ee17-4906-7dae335b4bfc:1-3
  • The new master has executed 1904cf31-912b-ee17-4906-7dae335b4bfc:1-6 and 4998aaaa-6ed5-11e8-948c-0242ac120007:1-11
  • The new master has purged 1904cf31-912b-ee17-4906-7dae335b4bfc:1-2 and 4998aaaa-6ed5-11e8-948c-0242ac120007:1-11

This means that the slave has no issue with 1904cf31-912b-ee17-4906-7dae335b4bfc it requires sequences 4-6 and sequences 3-6 are still available in the master. However, the slave cannot fetch sequences 1-11 from 4998aaaa-6ed5-11e8-948c-0242ac120007 because these has been purged from the master.

To summarize, the missing GTID sequences are 4998aaaa-6ed5-11e8-948c-0242ac120007:1-11.

Step 2: Identify where the purged GTID sequences came from

From the SHOW SLAVE STATUS output in the introduction section, it says that the Master_UUID is 4998aaaa-6ed5-11e8-948c-0242ac120007, which means the new master is the source of the missing transactions. You can also verify the new Master’s UUID by running SHOW GLOBAL VARIABLES LIKE 'server_uuid';

mysql> SHOW GLOBAL VARIABLES LIKE 'server_uuid';
+---------------+--------------------------------------+
| Variable_name | Value |
+---------------+--------------------------------------+
| server_uuid | 4998aaaa-6ed5-11e8-948c-0242ac120007 |
+---------------+--------------------------------------+
1 row in set (0.00 sec)

If the new master’s UUID does not match the missing GTID, it is most likely that this missing sequence came from its old master, another master higher up the chain or from another PXC node. If that other master still exists, you can run the same query on those masters to check.

The missing sequences are small such as 1-11. Typically, commands executed locally are due to performing maintenance on this server directly. For example, creating users, fixing privileges or updating passwords. However, you have no guarantee that this is the reason, since the binary logs have already been purged. If you still want to point the slave to the new master, proceed to step 3 or step 4.

Step 3. Injecting the missing transactions on the slave with empty transactions

The workaround is to pretend that those missing GTID sequences have been executed on the slave by injecting 11 empty transactions as instructed here by running:

SET GTID_NEXT='UUID:SEQUENCE_NO';
BEGIN;COMMIT;
SET GTID_NEXT='AUTOMATIC';

It looks tedious, but a simple script can automate this:

cat empty_transaction_generator.sh
#!/bin/bash
uuid=$1
first_sequence_no=$2
last_sequence_no=$3
while [ "$first_sequence_no" -le "$last_sequence_no" ]
do
echo "SET GTID_NEXT='$uuid:$first_sequence_no';"
echo "BEGIN;COMMIT;"
first_sequence_no=`expr $first_sequence_no + 1`
done
echo "SET GTID_NEXT='AUTOMATIC';"
bash empty_transaction_generator.sh 4998aaaa-6ed5-11e8-948c-0242ac120007 1 11
SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:1';
BEGIN;COMMIT;
SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:2';
BEGIN;COMMIT;
SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:3';
BEGIN;COMMIT;
SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:4';
BEGIN;COMMIT;
SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:5';
BEGIN;COMMIT;
SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:6';
BEGIN;COMMIT;
SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:7';
BEGIN;COMMIT;
SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:8';
BEGIN;COMMIT;
SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:9';
BEGIN;COMMIT;
SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:10';
BEGIN;COMMIT;
SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:11';
BEGIN;COMMIT;
SET GTID_NEXT='AUTOMATIC';

Before executing the generated output on the slave, stop replication first:

mysql> STOP SLAVE;
Query OK, 0 rows affected (0.00 sec)
mysql> SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:1';
Query OK, 0 rows affected (0.00 sec)
mysql> BEGIN;COMMIT;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
mysql> SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:2';
Query OK, 0 rows affected (0.00 sec)
mysql> BEGIN;COMMIT;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
mysql> SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:3';
Query OK, 0 rows affected (0.00 sec)
mysql> BEGIN;COMMIT;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
mysql> SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:4';
Query OK, 0 rows affected (0.00 sec)
mysql> BEGIN;COMMIT;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
mysql> SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:5';
Query OK, 0 rows affected (0.00 sec)
mysql> BEGIN;COMMIT;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.01 sec)
mysql> SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:6';
Query OK, 0 rows affected (0.00 sec)
mysql> BEGIN;COMMIT;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.01 sec)
mysql> SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:7';
Query OK, 0 rows affected (0.00 sec)
mysql> BEGIN;COMMIT;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.01 sec)
mysql> SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:8';
Query OK, 0 rows affected (0.00 sec)
mysql> BEGIN;COMMIT;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.01 sec)
mysql> SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:9';
Query OK, 0 rows affected (0.00 sec)
mysql> BEGIN;COMMIT;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
mysql> SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:10';
Query OK, 0 rows affected (0.00 sec)
mysql> BEGIN;COMMIT;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
mysql> SET GTID_NEXT='4998aaaa-6ed5-11e8-948c-0242ac120007:11';
Query OK, 0 rows affected (0.00 sec)
mysql> BEGIN;COMMIT;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
mysql> SET GTID_NEXT='AUTOMATIC';
Query OK, 0 rows affected (0.00 sec)

There’s also an even easier solution of injecting empty transactions by using mysqlslavetrx from MySQL utilities. By stopping the slave first and running
mysqlslavetrx --gtid-set=4998aaaa-6ed5-11e8-948c-0242ac120007:1-11 --slaves=root:password@:3306 you will achieve the same result as above.

By running SHOW GLOBAL VARIABLES LIKE 'gtid_executed'; on the slave you can see that sequences 4998aaaa-6ed5-11e8-948c-0242ac120007:1-11 have been executed already:

mysql> SHOW GLOBAL VARIABLES LIKE 'gtid_executed';
+---------------+-------------------------------------------------------------------------------------+
| Variable_name | Value |
+---------------+-------------------------------------------------------------------------------------+
| gtid_executed | 1904cf31-912b-ee17-4906-7dae335b4bfc:1-3,
4998aaaa-6ed5-11e8-948c-0242ac120007:1-11 |
+---------------+-------------------------------------------------------------------------------------+
1 row in set (0.01 sec)

Resume replication and check if replication is healthy by running START SLAVE; and SHOW SLAVE STATUS\G

mysql> START SLAVE;
Query OK, 0 rows affected (0.01 sec)
mysql> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: pxc_57_5
Master_User: repl
Master_Port: 3306
** redacted **
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
** redacted **
Seconds_Behind_Master: 0
** redacted **
Master_Server_Id: 1
Master_UUID: 4998aaaa-6ed5-11e8-948c-0242ac120007
** redacted **
Retrieved_Gtid_Set: 1904cf31-912b-ee17-4906-7dae335b4bfc:4-6
Executed_Gtid_Set: 1904cf31-912b-ee17-4906-7dae335b4bfc:1-6,
4998aaaa-6ed5-11e8-948c-0242ac120007:1-11
Auto_Position: 1
** redacted **
1 row in set (0.00 sec)

At this point, we have already solved the problem. However, there’s another way to restore the slave much faster but at the cost of erasing all the existing binary logs on the slave as mentioned in this article. If you want to do this, proceed to step 4.

Step 4. Add the missing sequences to GTID_EXECUTED by modifying GTID_PURGED.

CRITICAL NOTE:
If you followed the steps in Step 3, you do not need to perform Step 4!

To add the missing transactions, you’ll need to stop the slave, reset the master, place the original value of gtid_executed and the missing sequences in gtid_purged variable. A word of caution on using this method: this will purge the existing binary logs of the slave.

mysql> STOP SLAVE;
Query OK, 0 rows affected (0.02 sec)
mysql> RESET MASTER;
Query OK, 0 rows affected (0.02 sec)
mysql> SET GLOBAL gtid_purged="1904cf31-912b-ee17-4906-7dae335b4bfc:1-3,4998aaaa-6ed5-11e8-948c-0242ac120007:1-11";
Query OK, 0 rows affected (0.02 sec)

Similar to Step 3, running SHOW GLOBAL VARIABLES LIKE 'gtid_executed'; on the slave shows that sequence 4998aaaa-6ed5-11e8-948c-0242ac120007:1-11 has been executed already:

mysql> SHOW GLOBAL VARIABLES LIKE 'gtid_executed';
+---------------+-------------------------------------------------------------------------------------+
| Variable_name | Value |
+---------------+-------------------------------------------------------------------------------------+
| gtid_executed | 1904cf31-912b-ee17-4906-7dae335b4bfc:1-3,
4998aaaa-6ed5-11e8-948c-0242ac120007:1-11 |
+---------------+-------------------------------------------------------------------------------------+
1 row in set (0.01 sec)

Run START SLAVE; and SHOW SLAVE STATUS\G to resume replication and check if replication is healthy:

mysql> START SLAVE;
Query OK, 0 rows affected (0.01 sec)
mysql> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: pxc_57_5
Master_User: repl
Master_Port: 3306
** redacted **
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
** redacted **
Seconds_Behind_Master: 0
** redacted **
Master_Server_Id: 1
Master_UUID: 4998aaaa-6ed5-11e8-948c-0242ac120007
** redacted **
Retrieved_Gtid_Set: 1904cf31-912b-ee17-4906-7dae335b4bfc:4-6
Executed_Gtid_Set: 1904cf31-912b-ee17-4906-7dae335b4bfc:1-6,
4998aaaa-6ed5-11e8-948c-0242ac120007:1-11
Auto_Position: 1
** redacted **
1 row in set (0.00 sec)

Step 5. Done

Summary

In this article, I demonstrated how to point the slave to a new master even if it’s missing some binary logs that need to be executed. Although, it is possible to do so with the workarounds shared above, it is prudent to check the consistency of the old and new master first before switching the slave to the new master.

The post Fixing ER_MASTER_HAS_PURGED_REQUIRED_GTIDS when pointing a slave to a different master appeared first on Percona Database Performance Blog.

Dec
01
2016
--

Database Daily Ops Series: GTID Replication and Binary Logs Purge

GTID replication

GTID replicationThis blog continues the ongoing series on daily operations and GTID replication.

In this blog, I’m going to investigate why the error below has been appearing in a special environment I’ve been working with on the last few days:

Last_IO_Errno: 1236
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log:
'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the
master has purged binary logs containing GTIDs that the slave requires.'

The error provides the right message, and explains what is going on. But sometimes, it can be a bit tricky to solve this issue: you need additional information discovered after some tests and readings. We try and keep Managed Services scripted, in the sense that our advice and best practices are repeatable and consistent. However, some additional features and practices can be added depending on the customer situation.

Some time ago one of our customer’s database servers presented the above message. At that point, we could see the binary log files in a compressed form on master (gzipped). Of course, MySQL can’t identify a compressed file with a .gz extension as a binary log. We uncompressed the file, but replication presented the same problem – even after uncompressing the file and making sure the UUID of the current master and the TRX_ID were there. Obviously, I needed to go and investigate the problem to see what was going on.

After some reading, I re-read the below:

When the server starts, the global value of gtid_purged, which was called before as gtid_lost, is initialized to the set of GTIDs contained by the Previous_gtid_log_event of the oldest binary log. When a binary log is purged, gtid_purged is re-read from the binary log that has now become the oldest one.

=> https://dev.mysql.com/doc/refman/5.6/en/replication-options-gtids.html#sysvar_gtid_purged

That makes me think: if something is compressing binlogs on the master without purging them as expected by the GTID mechanism, it’s not going to be able to re-read existing GTIDs on disk. When the slave replication threads restarts, or the DBA issues commands like reset slave and reset master (to clean out the increased GTID sets on Executed_Gtid_Set from the SHOW SLAVE STATUS command, for example), this error can occur. But if I compress the file:

  • Will the slave get lost and not find all the needed GTIDs on the master after a reset slave/reset master?
  • If I purge the logs correctly, using PURGE BINARY LOGS, will the slave be OK when restarting replication threads?

Test 1: Compressing the oldest binary log file on master, restarting slave threads

I would like to test this very methodically. We’ll create one GTID per binary log, and then I will compress the oldest binary log file in order to make it unavailable for the slaves. I’m working with three virtual machines, one master and two slaves. On the second slave, I’m going to run the following sequence: stop slave, reset slave, reset master, start slave, and then, check the results. Let’s see what happens.

On master (tool01):

tool01 [(none)]:> show master logs;
+-------------------+-----------+
| Log_name          | File_size |
+-------------------+-----------+
| mysqld-bin.000001 |       341 |
| mysqld-bin.000002 |       381 |
| mysqld-bin.000003 |       333 |
+-------------------+-----------+
3 rows in set (0.00 sec)
tool01 [(none)]:> show binlog events in 'mysqld-bin.000001';
+-------------------+-----+----------------+-----------+-------------+-------------------------------------------------------------------+
| Log_name          | Pos | Event_type     | Server_id | End_log_pos | Info                                                              |
+-------------------+-----+----------------+-----------+-------------+-------------------------------------------------------------------+
| mysqld-bin.000001 |   4 | Format_desc    |         1 |         120 | Server ver: 5.6.32-log, Binlog ver: 4                             |
| mysqld-bin.000001 | 120 | Previous_gtids |         1 |         151 |                                                                   |
| mysqld-bin.000001 | 151 | Gtid           |         1 |         199 | SET @@SESSION.GTID_NEXT= '4fbe2d57-5843-11e6-9268-0800274fb806:1' |
| mysqld-bin.000001 | 199 | Query          |         1 |         293 | create database wb01                                              |
| mysqld-bin.000001 | 293 | Rotate         |         1 |         341 | mysqld-bin.000002;pos=4                                           |
+-------------------+-----+----------------+-----------+-------------+-------------------------------------------------------------------+
5 rows in set (0.00 sec)
tool01 [(none)]:> show binlog events in 'mysqld-bin.000002';
+-------------------+-----+----------------+-----------+-------------+-------------------------------------------------------------------+
| Log_name          | Pos | Event_type     | Server_id | End_log_pos | Info                                                              |
+-------------------+-----+----------------+-----------+-------------+-------------------------------------------------------------------+
| mysqld-bin.000002 |   4 | Format_desc    |         1 |         120 | Server ver: 5.6.32-log, Binlog ver: 4                             |
| mysqld-bin.000002 | 120 | Previous_gtids |         1 |         191 | 4fbe2d57-5843-11e6-9268-0800274fb806:1                            |
| mysqld-bin.000002 | 191 | Gtid           |         1 |         239 | SET @@SESSION.GTID_NEXT= '4fbe2d57-5843-11e6-9268-0800274fb806:2' |
| mysqld-bin.000002 | 239 | Query          |         1 |         333 | create database wb02                                              |
| mysqld-bin.000002 | 333 | Rotate         |         1 |         381 | mysqld-bin.000003;pos=4                                           |
+-------------------+-----+----------------+-----------+-------------+-------------------------------------------------------------------+
5 rows in set (0.00 sec)
tool01 [(none)]:> show binlog events in 'mysqld-bin.000003';
+-------------------+-----+----------------+-----------+-------------+-------------------------------------------------------------------+
| Log_name          | Pos | Event_type     | Server_id | End_log_pos | Info                                                              |
+-------------------+-----+----------------+-----------+-------------+-------------------------------------------------------------------+
| mysqld-bin.000003 |   4 | Format_desc    |         1 |         120 | Server ver: 5.6.32-log, Binlog ver: 4                             |
| mysqld-bin.000003 | 120 | Previous_gtids |         1 |         191 | 4fbe2d57-5843-11e6-9268-0800274fb806:1-2                          |
| mysqld-bin.000003 | 191 | Gtid           |         1 |         239 | SET @@SESSION.GTID_NEXT= '4fbe2d57-5843-11e6-9268-0800274fb806:3' |
| mysqld-bin.000003 | 239 | Query          |         1 |         333 | create database wb03                                              |
+-------------------+-----+----------------+-----------+-------------+-------------------------------------------------------------------+
4 rows in set (0.00 sec)

Here we see that each existing binary log file has just one transaction. That will make it easier to compress the oldest binary log, and then disappear with part of the existing GTIDs. When the slave connects to a master, it will first send all the Executed_Gtid_Set, and then the master sends all the missing IDs to the slave. As Stephane Combaudon said, we will force it to happen! Slave database servers are both currently in the same position:

tool02 [(none)]:> show slave statusG
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.0.10
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysqld-bin.000003
          Read_Master_Log_Pos: 333
               Relay_Log_File: mysqld-relay-bin.000006
                Relay_Log_Pos: 545
        Relay_Master_Log_File: mysqld-bin.000003
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
            ...
           Retrieved_Gtid_Set: 4fbe2d57-5843-11e6-9268-0800274fb806:1-3
            Executed_Gtid_Set: 4fbe2d57-5843-11e6-9268-0800274fb806:1-3
            
tool03 [(none)]:> show slave statusG
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.0.10
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysqld-bin.000003
          Read_Master_Log_Pos: 333
               Relay_Log_File: mysqld-relay-bin.000008
                Relay_Log_Pos: 451
        Relay_Master_Log_File: mysqld-bin.000003
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
...
           Retrieved_Gtid_Set: 4fbe2d57-5843-11e6-9268-0800274fb806:1-3
            Executed_Gtid_Set: 4fbe2d57-5843-11e6-9268-0800274fb806:1-3

Now, we’ll compress the oldest binary log on master:

[root@tool01 mysql]# ls -lh | grep mysqld-bin.
-rw-rw---- 1 mysql mysql  262 Nov 11 13:55 mysqld-bin.000001.gz #: this is the file containing 4fbe2d57-5843-11e6-9268-0800274fb806:1
-rw-rw---- 1 mysql mysql  381 Nov 11 13:55 mysqld-bin.000002
-rw-rw---- 1 mysql mysql  333 Nov 11 13:55 mysqld-bin.000003
-rw-rw---- 1 mysql mysql   60 Nov 11 13:55 mysqld-bin.index

On tool03, which is the database server that will be used, we will execute the replication reload:

tool03 [(none)]:> stop slave; reset slave; reset master; start slave;
Query OK, 0 rows affected (0.01 sec)
Query OK, 0 rows affected (0.03 sec)
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.02 sec)
tool03 [(none)]:> show slave statusG
*************************** 1. row ***************************
               Slave_IO_State:
                  Master_Host: 192.168.0.10
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File:
          Read_Master_Log_Pos: 4
               Relay_Log_File: mysqld-relay-bin.000002
                Relay_Log_Pos: 4
        Relay_Master_Log_File:
             Slave_IO_Running: No
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 0
              Relay_Log_Space: 151
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 1236
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.'
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
                  Master_UUID: 4fbe2d57-5843-11e6-9268-0800274fb806
             Master_Info_File: /var/lib/mysql/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp: 161111 14:47:13
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set:
            Executed_Gtid_Set:
                Auto_Position: 1
1 row in set (0.00 sec)

Bingo! We broke the replication streaming on the slave. Now we know that the missing GTID on the master was due to the compressed file, and wasn’t able to be passed along to the connecting slave during their negotiation. Additionally, @@GTID_PURGED was not reloaded as per what the online manual said. The test done and we confirmed the theory (if you have additional comments, enter it at the end of the blog).

Test 2: Purge the oldest file on master and reload replication on slave

Let’s make it as straightforward as possible. The purge can be done manually using the PURGE BINARY LOGS command to get it done a proper way as the binary log index file should be considered a part of this purge operation as well (it should be edited to remove the file name index entry together with the log file on disk). I’m going to execute the same as before, but include purging the file manually with the mentioned command.

tool01 [(none)]:> show master logs;
+-------------------+-----------+
| Log_name | File_size |
+-------------------+-----------+
| mysqld-bin.000001 | 341 |
| mysqld-bin.000002 | 381 |
| mysqld-bin.000003 | 333 |
+-------------------+-----------+
3 rows in set (0.00 sec)
tool01 [(none)]:> purge binary logs to 'mysqld-bin.000002';
Query OK, 0 rows affected (0.01 sec)
tool01 [(none)]:> show master logs;
+-------------------+-----------+
| Log_name | File_size |
+-------------------+-----------+
| mysqld-bin.000002 | 381 |
| mysqld-bin.000003 | 333 |
+-------------------+-----------+
2 rows in set (0.00 sec)

Now, we’ll execute the commands to check how it goes:

tool03 [(none)]:> stop slave; reset slave; reset master; start slave;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.02 sec)
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.02 sec)
tool03 [(none)]:> show slave statusG
*************************** 1. row ***************************
               Slave_IO_State:
                  Master_Host: 192.168.0.10
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File:
          Read_Master_Log_Pos: 4
               Relay_Log_File: mysqld-relay-bin.000002
                Relay_Log_Pos: 4
        Relay_Master_Log_File:
             Slave_IO_Running: No
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 0
              Relay_Log_Space: 151
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 1236
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.'
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
                  Master_UUID: 4fbe2d57-5843-11e6-9268-0800274fb806
             Master_Info_File: /var/lib/mysql/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp: 161111 16:35:02
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set:
            Executed_Gtid_Set:
                Auto_Position: 1
1 row in set (0.00 sec)

The GTID on the purged file is needed by the slave. In both cases, we can set the @@GTID_PURGED as below with the transaction that we know was purged, and move forward with replication:

tool03 [(none)]:> stop slave; set global gtid_purged='4fbe2d57-5843-11e6-9268-0800274fb806:1';
Query OK, 0 rows affected, 1 warning (0.00 sec)
Query OK, 0 rows affected (0.01 sec)
tool03 [(none)]:> start slave;
Query OK, 0 rows affected (0.01 sec)

The above adjusts the GTID on @@GTID_PURGED to just request the existing GTIDs, using the oldest existing GTID minus one to make the slave start the replication from the oldest existing GTID. In our scenario above, the replica restarts replication from 4fbe2d57-5843-11e6-9268-0800274fb806:2, which lives on binary log file mysqld-bin.000002. Replication is fixed, as its threads can restart processing the data streaming coming from master.

You will need to execute additional steps in checksum and sync for the set of transactions that were jumped when setting a new value for @@GTID_PURGED. If replication continues to break after restarting, I advise you rebuild the slave (possibly the subject of future blog).

Good explanations about this can be found on the below bug, reported by the Facebook guys and Laurynas Biveinis, the Percona Server Lead (who clarified the issue):

  • MySQL Bugs: #72635: Data inconsistencies when master has truncated binary log with GTID after crash;
  • MySQL Bugs: #73032: Setting gtid_purged may break auto_position and thus slaves;

Conclusion

Be careful when purging or doing something manually with binary logs, because @@GTID_PURGED needs to be automatically updated when binary logs are purged. It seems to happen only when

expire_logs_days

 is set to purge binary logs. Yet you need to be careful when trusting this variable, because it doesn’t consider fraction of days, depending the number of writes on a database server, it can get disks full in minutes. This blog showed that even housekeeping scripts and the PURGER BINARY LOGS command were able to make it happen.

Dec
02
2015
--

Fixing errant transactions with mysqlslavetrx prior to a GTID failover

GTID and errant transactionsErrant transactions are a major issue when using GTID replication. Although this isn’t something new, the drawbacks are more notorious with GTID than with regular replication.

The situation where errant transaction bites you is a common DBA task: Failover. Now that tools like MHA have support for GTID replication (starting from 0.56 version), this protocol is becoming more popular, and so are the issues with errant transactions. Luckily, the fix is as simple as injecting an empty transaction into the databases that lack the transaction. You can easily do this through the master, and it will be propagated to all the slaves.

Let’s consider the following situations:

  • What happens when the master blows up into the air and is out of the picture?
  • What happens when there’s not just one but dozens of errant transactions?
  • What happens when you have a high number of slaves?

Things start to become a little more complex.

A side note for the first case: when your master is no longer available, how can you find errant transactions? Well, you can’t. In this case, you should check for errant transactions between your slaves and your former slave/soon-to-be master.

Let’s think alternatives. What’s the workaround of injecting empty transactions for every single errant transaction to every single slave? The MySQL utility mysqlslavetrx. Basically, this utility allows us to skip multiple transactions on multiple slaves in a single step.

One way to install the MySQL utilities is by executing the following steps:

  • wget http://dev.mysql.com/get/Downloads/MySQLGUITools/mysql-utilities-1.6.2.tar.gz
  • tar -xvzf mysql-utilities-1.6.2.tar.gz
  • cd mysql-utilities-1.6.2
  • python ./setup.py build
  • sudo python ./setup.py install

And you’re ready.

What about some examples? Let’s say we have a Master/Slave server with GTID replication, current status as follows:

mysql> show master status;
+------------------+----------+--------------+------------------+------------------------------------------+
| File             | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                        |
+------------------+----------+--------------+------------------+------------------------------------------+
| mysql-bin.000002 | 530      |              |                  | 66fbd3be-976e-11e5-a8fb-1256731a26b7:1-2 |
+------------------+----------+--------------+------------------+------------------------------------------+
1 row in set (0.00 sec)
mysql> show slave statusG
...
Executed_Gtid_Set: 66fbd3be-976e-11e5-a8fb-1256731a26b7:1-2
Auto_Position: 1
1 row in set (0.00 sec)

Add chaos to the slave in form of a new schema:

mysql> create database percona;
Query OK, 1 row affected (0.00 sec)

Now we have an errant transaction!!!!!

The slave status looks different:

mysql> show slave statusG
...
Executed_Gtid_Set: 66fbd3be-976e-11e5-a8fb-1256731a26b7:1-2,
674a625e-976e-11e5-a8fb-125cab082fc3:1
Auto_Position: 1
1 row in set (0.00 sec)

By using the GTID_SUBSET function we can confirm that things go from “all right” to “no-good”:

Before:

mysql> select gtid_subset('66fbd3be-976e-11e5-a8fb-1256731a26b7:1-2','66fbd3be-976e-11e5-a8fb-1256731a26b7:1-2') as is_subset;
+-----------+
| is_subset |
+-----------+
| 1         |
+-----------+
1 row in set (0.00 sec)

After:

mysql> select gtid_subset('66fbd3be-976e-11e5-a8fb-1256731a26b7:1-2,674a625e-976e-11e5-a8fb-125cab082fc3:1','66fbd3be-976e-11e5-a8fb-1256731a26b7:1-2') as is_subset;
+-----------+
| is_subset |
+-----------+
| 0         |
+-----------+
1 row in set (0.00 sec)

All right, it’s a mess, got it. What’s the errant transaction? The GTID_SUBTRACT function will tell us:

mysql> select gtid_subtract('66fbd3be-976e-11e5-a8fb-1256731a26b7:1-2,674a625e-976e-11e5-a8fb-125cab082fc3:1','66fbd3be-976e-11e5-a8fb-1256731a26b7:1-2') as errand;
+----------------------------------------+
| errand                                 |
+----------------------------------------+
| 674a625e-976e-11e5-a8fb-125cab082fc3:1 |
+----------------------------------------+
1 row in set (0.00 sec)

The classic way to fix this is by injecting an empty transaction:

mysql> SET GTID_NEXT='674a625e-976e-11e5-a8fb-125cab082fc3:1';
Query OK, 0 rows affected (0.00 sec)
mysql> begin;
Query OK, 0 rows affected (0.00 sec)
mysql> commit;
Query OK, 0 rows affected (0.00 sec)
mysql> SET GTID_NEXT='AUTOMATIC';
Query OK, 0 rows affected (0.00 sec)

After this, the errant transaction won’t be errant anymore.

mysql> show master status;
+------------------+----------+--------------+------------------+----------------------------------------------------------------------------------+
| File             | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                |
+------------------+----------+--------------+------------------+----------------------------------------------------------------------------------+
| mysql-bin.000002 | 715      |              |                  | 66fbd3be-976e-11e5-a8fb-1256731a26b7:1-2,
674a625e-976e-11e5-a8fb-125cab082fc3:1                                                                                                             |
+------------------+----------+--------------+------------------+----------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Okay, let’s add another slave to the mix. Now is the moment where the mysqlslavetrx utility becomes very handy.

What you need to know is:

  • The slave’s IP address
  • The GTID set

It will be simple to execute:

mysqlslavetrx --­­gtid­-set=6aa9a742­8284­11e5­a09b­12aac3869fc9:1­­ --verbose ­­--slaves=user:password@172.16.1.143:3306,user:password@172.16.1.144

The verbose output will look something this:

# GTID set to be skipped for each server:
# ­- 172.16.1.143@3306: 6aa9a742­8284­11e5­a09b­12aac3869fc9:1
# ­- 172.16.1.144@3306: 6aa9a742­8284­11e5­a09b­12aac3869fc9:1 #
# Injecting empty transactions for '172.16.1.143:3306'...
# ­- 6aa9a742­8284­11e5­a09b­12aac3869fc9:1
# Injecting empty transactions for '172.16.1.144:3306'...
# ­- 6aa9a742­8284­11e5­a09b­12aac3869fc9:1 #
#...done.
#

You can run mysqlslavetrx from anywhere (master or any slave). You just need to be sure that the user and password are valid, and have the SUPER privilege required to set the gtid_next variable.

As a summary: Take advantage of the MySQL utilities. In this particular case, mysqlslavetrx is extremely useful when using GTID replication and you want to perform a clean failover. It can be added as a pre-script for MHA failover (which supports GTID since the 0.56 version) or can be simply used to maintain consistency between master and slaves.

The post Fixing errant transactions with mysqlslavetrx prior to a GTID failover appeared first on MySQL Performance Blog.

Sep
07
2015
--

super_read_only and GTID replication

Percona Server 5.6.21+ and MySQL 5.7.8+ offer the super_read_only option that was first implemented in WebscaleSQL. Unlike read_only, this option prevents all users from running writes (even those with the SUPER privilege). Sure enough, this is a great feature, but what’s the relation with GTID? Read on!

TL;DR

Enabling super_read_only on all slaves when using GTID replication makes your topology far less sensitive to errant transactions. Failover is then easier and safer because creating errant transactions is much harder.

GTID replication is awesome…

For years, all MySQL DBAs in the world have been fighting with positioning when working with replication. Each time you move a slave from one master to another, you must be very careful to start replicating at the correct position. That was boring and error-prone.

GTID replication is a revolution because it allows auto-positioning: when you configure server B to replicate from A, both servers will automatically negociate which events should be sent by the master. Of course this assumes the master has all missing events in its binlogs. Otherwise the slave will complain that it can’t get all the events it needs and you will see an error 1236.

… but there’s a catch

Actually GTID replication has several issues, the main one in MySQL 5.6 being the inability to switch from position-based to GTID-based replication without downtime. This has been fixed since then fortunately.

The issue I was thinking of is errant transactions. Not familiar with this term? Let me clarify.

Say you have a slave (B) replicating from a master (A) using the traditional position-based replication. Now you want to create a new database. This is easy: just connect to B and run:

mysql> CREATE DATABASE new_db;

Ooops! You’ve just made a big mistake: instead of creating the table on the master, you’ve just created it on the slave. But the change is easy to undo: run DROP DATABASE on B, followed by CREATE DATABASE on A.

Nobody will ever known your mistake and next time you’ll be more careful.

However with GTID-replication, this is another story: when you run a write statement on B, you create an associated GTID. And this associated GTID will be recorded forever (even if the binlog containing the transaction is purged at some point).

Now you can still undo the transaction but there is no way to undo the GTID. What you’ve created is called an errant transaction.

This minor mistake can have catastrophic consequences: say that 6 months later, B is promoted as the new master. Because of auto-positioning, the errant transaction will be sent to all slaves. But it’s very likely that the corresponding binlog has been purged, so B will be unable to send the errant transaction. As a result replication will be broken everywhere. Not nice…

super_read_only can help

Enter super_read_only. If it is enabled on all slaves, the above scenario won’t happen because the write on B will trigger an error and no GTID will be created.

With super_read_only, tools that were not reliable with GTID replication become reliable enough to be used again. For instance, MHA supports failover in a GTID-based setup but it doesn’t check errant transactions when failing over, making it risky to use with GTID replication. super_read_only makes MHA attractive again with GTID.

However note that super_read_only can’t prevent all errant transactions. The setting is dynamic so if you have privileged access, you can still disable super_read_only, create an errant transaction and enable it back. But at least it should avoid errant transactions that are created by accident.

The post super_read_only and GTID replication appeared first on MySQL Performance Blog.

Feb
10
2015
--

Online GTID rollout now available in Percona Server 5.6

Global Transaction IDs (GTIDs) are one of my favorite features of MySQL 5.6. The main limitation is that you must stop all the servers at the same time to allow GTID-replication. Not everyone can afford to take a downtime so this requirement has been a showstopper for many people. Starting with Percona Server 5.6.22-72.0 enabling GTID replication can be done without almost no downtime. Let’s see how to do it.

Implementation of the Facebook patch

Finding a solution to migrate to GTIDs with no downtime is not a new idea, and several companies have already developed their own patch. The 2 best known implementations are the one from Facebook and the one from Booking.com.

Both options have pros and cons, and we finally chose to port the Facebook patch and add a new setting (gtid_deployment_step).

Performing the migration

Let’s assume we have a master-slaves setup with 4 servers A, B, C and D. A is the master:

gtid_1
The 1st step is to take each slave out of rotation, one at a time, and set gtid_mode = ON and gtid_deployment_step = ON (and also log_bin, log_slave_updates and enforce_gtid_consistency).

gtid_2
gtid_deployment_step = ON means that a server will not generate GTIDs when it executes writes, but it will record a GTID in its binary log if it gets an event from the replication stream tagged with a GTID.

The 2nd step is to promote one of the slaves to become the new master (for instance C) and to disable gtid_deployment_step. It is a regular slave promotion so you should do it the same way you deal with planned slave promotions (for instance using MHA or your own scripts). Our patch doesn’t help you do this promotion.

At this point replication will break on the old master as it has gtid_mode = OFF and gtid_deployment_step = OFF.

gtid_3
Don’t forget that you need to use CHANGE MASTER TO MASTER_AUTO_POSITION = 1 to enable GTID-based replication.

The 3rd step is to restart the old master to set gtid_mode = ON. Replication will resume automatically, but don’t forget to set MASTER_AUTO_POSITION = 1.

gtid_4
The final step is to disable gtid_deployment_step on all slaves. This can be done dynamically:

mysql> SET GLOBAL gtid_deployment_step = OFF;

and you should remove the setting from the my.cnf file so that it is not set again when the server is restarted.

Optionally, you can promote the old master back to its original role.

That’s it, GTID replication is now available without having restarted all servers at the same time!

Limitations

At some point during the migration, a slave promotion is needed. And at this point, you are still using position-based replication. The patch will not help you with this promotion so use your regular failover scripts. If you have no scripts to deal with that kind of situation, make sure you know how to proceed.

Also be aware that this patch provides a way to migrate to GTIDs with no downtime, but not a way to migrate away from GTIDs with no downtime. So test carefully and make sure you understand all the new stuff that comes with GTIDs, like the new replication protocol, or how to skip transactions.

Other topologies

If you are using master-master replication or multiple tier replication, you can follow the same steps. With multiple tier replication, simply start by setting gtid_mode = ON and gtid_deployment_step = ON for the leaves first.

Conclusion

If you’re interested by the benefits of GTID replication but if taking a downtime has always scared you, you should definitely download the latest Percona Server 5.6 and give it a try!

The post Online GTID rollout now available in Percona Server 5.6 appeared first on MySQL Performance Blog.

Jul
02
2014
--

Failover with the MySQL Utilities: Part 2 – mysqlfailover

In the previous post of this series we saw how you could use mysqlrpladmin to perform manual failover/switchover when GTID replication is enabled in MySQL 5.6. Now we will review mysqlfailover (version 1.4.3), another tool from the MySQL Utilities that can be used for automatic failover.

Summary

  • mysqlfailover can perform automatic failover if MySQL 5.6′s GTID-replication is enabled.
  • All slaves must use --master-info-repository=TABLE.
  • The monitoring node is a single point of failure: don’t forget to monitor it!
  • Detection of errant transactions works well, but you have to use the --pedantic option to make sure failover will never happen if there is an errant transaction.
  • There are a few limitations such as the inability to only fail over once, or excessive CPU utilization, but they are probably not showstoppers for most setups.

Setup

We will use the same setup as last time: one master and two slaves, all using GTID replication. We can see the topology using mysqlfailover with the health command:

$ mysqlfailover --master=root@localhost:13001 --discover-slaves-login=root health
[...]
MySQL Replication Failover Utility
Failover Mode = auto     Next Interval = Tue Jul  1 10:01:22 2014
Master Information
------------------
Binary Log File   Position  Binlog_Do_DB  Binlog_Ignore_DB
mysql-bin.000003  700
GTID Executed Set
a9a396c6-00f3-11e4-8e66-9cebe8067a3f:1-3
Replication Health Status
+------------+--------+---------+--------+------------+---------+
| host       | port   | role    | state  | gtid_mode  | health  |
+------------+--------+---------+--------+------------+---------+
| localhost  | 13001  | MASTER  | UP     | ON         | OK      |
| localhost  | 13002  | SLAVE   | UP     | ON         | OK      |
| localhost  | 13003  | SLAVE   | UP     | ON         | OK      |
+------------+--------+---------+--------+------------+---------+

Note that --master-info-repository=TABLE needs to be configured on all slaves or the tool will exit with an error message:

2014-07-01 10:18:55 AM CRITICAL Failover requires --master-info-repository=TABLE for all slaves.
ERROR: Failover requires --master-info-repository=TABLE for all slaves.

Failover

You can use 2 commands to trigger automatic failover:

  • auto: the tool tries to find a candidate in the list of servers specified with --candidates, and if no good server is found in this list, it will look at the other slaves to see if one can be a good candidate. This is the default command
  • elect: same as auto, but if no good candidate is found in the list of candidates, other slaves will not be checked and the tool will exit with an error.

Let’s start the tool with auto:

$ mysqlfailover --master=root@localhost:13001 --discover-slaves-login=root auto

The monitoring console is visible and is refreshed every --interval seconds (default: 15). Its output is similar to what you get when using the health command.

Then let’s kill -9 the master to see what happens once the master is detected as down:

Failed to reconnect to the master after 3 attemps.
Failover starting in 'auto' mode...
# Candidate slave localhost:13002 will become the new master.
# Checking slaves status (before failover).
# Preparing candidate for failover.
# Creating replication user if it does not exist.
# Stopping slaves.
# Performing STOP on all slaves.
# Switching slaves to new master.
# Disconnecting new master as slave.
# Starting slaves.
# Performing START on all slaves.
# Checking slaves for errors.
# Failover complete.
# Discovering slaves for master at localhost:13002
Failover console will restart in 5 seconds.
MySQL Replication Failover Utility
Failover Mode = auto     Next Interval = Tue Jul  1 10:59:47 2014
Master Information
------------------
Binary Log File   Position  Binlog_Do_DB  Binlog_Ignore_DB
mysql-bin.000005  191
GTID Executed Set
a9a396c6-00f3-11e4-8e66-9cebe8067a3f:1-3
Replication Health Status
+------------+--------+---------+--------+------------+---------+
| host       | port   | role    | state  | gtid_mode  | health  |
+------------+--------+---------+--------+------------+---------+
| localhost  | 13002  | MASTER  | UP     | ON         | OK      |
| localhost  | 13003  | SLAVE   | UP     | ON         | OK      |
+------------+--------+---------+--------+------------+---------+

Looks good! The tool is then ready to fail over to another slave if the new master becomes unavailable.

You can also run custom scripts at several points of execution with the --exec-before, --exec-after, --exec-fail-check, --exec-post-failover options.

However it would be great to have a --failover-and-exit option to avoid flapping: the tool would detect master failure, promote one of the slaves, reconfigure replication and then exit (this is what MHA does for instance).

Tool registration

When the tool is started, it registers itself on the master by writing a few things in the specific table:

mysql> SELECT * FROM mysql.failover_console;
+-----------+-------+
| host      | port  |
+-----------+-------+
| localhost | 13001 |
+-----------+-------+

This is nice as it avoids that you start several instances of mysqlfailover to monitor the same master. If we try, this is what we get:

$ mysqlfailover --master=root@localhost:13001 --discover-slaves-login=root auto
[...]
Multiple instances of failover console found for master localhost:13001.
If this is an error, restart the console with --force.
Failover mode changed to 'FAIL' for this instance.
Console will start in 10 seconds..........starting Console.

With the fail command, mysqlfailover will monitor replication health and exit in the case of a master failure, without actually performing failover.

Running in the background

In all previous examples, mysqlfailover was running in the foreground. This is very good for demo, but in a production environment you are likely to prefer running it in the background. This can be done with the --daemon option:

$ mysqlfailover --master=root@localhost:13001 --discover-slaves-login=root auto --daemon=start --log=/var/log/mysqlfailover.log

and it can be stopped with:

$ mysqlfailover --daemon=stop

Errant transactions

If we create an errant transaction on one of the slaves, it will be detected:

MySQL Replication Failover Utility
Failover Mode = auto     Next Interval = Tue Jul  1 16:29:44 2014
[...]
WARNING: Errant transaction(s) found on slave(s).
Replication Health Status
[...]

However this does not prevent failover from occurring! You have to use --pedantic:

$ mysqlfailover --master=root@localhost:13001 --discover-slaves-login=root --pedantic auto
[...]
# WARNING: Errant transaction(s) found on slave(s).
#  - For slave 'localhost@13003': db906eee-012d-11e4-8fe1-9cebe8067a3f:1
2014-07-01 16:44:49 PM CRITICAL Errant transaction(s) found on slave(s). Note: If you want to ignore this issue, please do not use the --pedantic option.
ERROR: Errant transaction(s) found on slave(s). Note: If you want to ignore this issue, please do not use the --pedantic option.

Limitations

  • Like for mysqlrpladmin, the slave election process is not very sophisticated and it cannot be tuned.
  • The server on which mysqlfailover is running is a single point of failure.
  • Excessive CPU utilization: once it is running, mysqlfailover hogs one core. This is quite surprising.

Conclusion

mysqlfailover is a good tool to automate failover in clusters using GTID replication. It is flexible and looks reliable. Its main drawback is that there is no easy way to make it highly available itself: if mysqlfailover crashes, you will have to manually restart it.

The post Failover with the MySQL Utilities: Part 2 – mysqlfailover appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com