Feb
15
2013
--

Replication checksums in MySQL 5.6

MySQL 5.6 has an impressive list of improvements. Among them, replication checksums caught my attention as it seems that many people misunderstand the real added value of this new feature. I heard people think that with replication checksums, data integrity between the master and its replicas is now enforced. As we’ll see, it’s not that easy.

First, here are a few common reasons why data integrity may be broken (the list is not exhaustive):

  • writes executed on a replica instead of the master
  • non deterministic queries
  • bad use of replication filters
  • rollback of transactions mixing transactional and non-transactional tables

The real problem is that replication may keep on running without error, not letting you know that all servers no longer have the same data.

Will replication checksums detect such problems? Unfortunately, no, they won’t.

That being said, the reasons I mentioned above are not the only ones that can break data integrity. One fairly common issue is binary log or relay log corruption. Will replication checksums help you in this case? Yes, they will!

Replication checksums on replicas

Checksums on slaves are controlled by the slave_sql_verify_checksum variable. Let’s first disable them:

slave1 [localhost] {msandbox} ((none)) > set global slave_sql_verify_checksum=OFF;
Query OK, 0 rows affected (0,00 sec)

We’ll now simulate a relay log corruption by manually editing it before the event is executed on the slave. The easiest way is to stop the SQL thread:

slave1 [localhost] {msandbox} ((none)) > stop slave sql_thread;
Query OK, 0 rows affected (0,00 sec)

Now let’s write something on the master:

master [localhost] {msandbox} ((none)) > create database sakila;
Query OK, 1 row affected (0,00 sec)

and corrupt the relay log (I changed sakila to sakilb):

[...]
@^@^@^F^Cstd^D!^@!^@^H^@^L^Asakila^@sakila^@create database sakilb5      Ý

Let’s restart the SQL thread and see what we have on the slave:

slave1 [localhost] {msandbox} ((none)) > start slave sql_thread;
Query OK, 0 rows affected (0,01 sec)

slave1 [localhost] {msandbox} ((none)) > show databases like 'sakil%';
+-------------------+
| Database (sakil%) |
+-------------------+
| sakilb            |
+-------------------+

sakila on the master, sakilb on the slave: data integrity is broken, but SHOW SLAVE STATUS doesn’t show any problem. This is bad, but expected.

We’ll now turn on replication checksums on the slave:

slave1 [localhost] {msandbox} ((none)) > set global slave_sql_verify_checksum=ON;
Query OK, 0 rows affected (0,00 sec)

and check that the master is writing checksums for binary log events:

master [localhost] {msandbox} ((none)) > show global variables like 'binlog_checksum';
+-----------------+-------+
| Variable_name   | Value |
+-----------------+-------+
| binlog_checksum | CRC32 |
+-----------------+-------+

We also want to discard our manually-edited relay log file (as editing such a file manually can lead to further corruption):

slave1 [localhost] {msandbox} ((none)) > stop slave;
Query OK, 0 rows affected (0,01 sec)

# Use Relay_Master_Log_File and Exec_Master_Log_Pos from SHOW SLAVE STATUS
slave1 [localhost] {msandbox} ((none)) > change master to master_log_file='mysql-bin.000001', master_log_pos=2690;
Query OK, 0 rows affected (0,06 sec)

slave1 [localhost] {msandbox} ((none)) > start slave;
Query OK, 0 rows affected (0,01 sec)

And after repeating the same steps as above (stop sql thread, create sakila2 database on the master, corrupt the relay log, restart the sql thread), we now have a nice error message:

slave1 [localhost] {msandbox} ((none)) > show slave status\G
               [...]
 Slave_IO_Running: Yes
Slave_SQL_Running: No
	       [...]
       Last_Errno: 1594
       Last_Error: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave.

Great! Now we’re warned that something went wrong. It’s also much easier to fix the problem right now (either manually or with pt-table-checksum/pt-table-sync) than several days or several weeks later, as a single inconsistency may easily translate to massive corruption over time.

If the error message looks too cryptic for you, you can use the new --verify-binlog-checksum option of mysqlbinlog:

# mysqlbinlog --verify-binlog-checksum mysql_sandbox18676-relay-bin.000002
[...]
ERROR: Error in Log_event::read_log_event(): 'Event crc check failed! Most likely there is event corruption.', data_len: 103, event_type: 2
ERROR: Could not read entry at offset 283: Error in log format or read error.
[...]

Replication checksums on the master

We can enable them with the following command (they’re disabled by default):

master [localhost] {msandbox} ((none)) > set global master_verify_checksum=ON;
Query OK, 0 rows affected (0,00 sec)

Now if we have corruption in the binary log, we will still be able to write on the master. One way to show the corruption is to use SHOW BINLOG EVENTS:

master [localhost] {msandbox} ((none)) > show binlog events;
ERROR 1220 (HY000): Error when executing command SHOW BINLOG EVENTS: Wrong offset or I/O error

But on slaves, the corruption will be obvious as replication will stop with an error:

slave1 [localhost] {msandbox} ((none)) > show slave status\G
	       [...]
 Slave_IO_Running: No
Slave_SQL_Running: Yes
	       [...]
    Last_IO_Errno: 1236
    Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'event read from binlog did not pass crc check; the first event 'mysql-bin.000001' at 2793, the last event read from './mysql-bin.000001' at 2793, the last byte read from './mysql-bin.000001' at 2896.'

Conclusion

You should not assume that replication checksums in MySQL 5.6 will ensure data integrity in all scenarios, but they do improve reliability of the whole replication process. This is a great step forward, thank you replication team at Oracle!

The post Replication checksums in MySQL 5.6 appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com