Not everyone may know this, but there are precious few innodb crash recovery tests available.
Some folks have noticed this and asked for something to be done about it, but unfortunately, no tests have been created for the main MySQL branch.
The MySQL at Facebook branch has a number of tests that are quite interesting. They basically create a master-slave pair, subject the master to a transactional load, crash the master, restart it, then ensure the master and slave are in sync (once the test load is stopped).
The team at Percona has been known to tinker with Innodb now and again, and were also very interested in having some tests to ensure things were working as expected. To that end, I have created the innodbCrash suite for kewpie.
These tests follow the Facebook guys’ lead for test load + validation, but we use the random query generator for generating our transactions. This is for two reasons:
- My attempts at copying the Facebook multi-threaded python load-generation code seemed very, very slow
- I just like the randgen (it is fun!)
The tests require a debug build as they make use of the DBUG points noted in the crash test bug:
- DBUG_EXECUTE_IF(“crash_commit_before”, DBUG_SUICIDE(););
- DBUG_EXECUTE_IF(“crash_commit_after_prepare”, DBUG_SUICIDE(););
- DBUG_EXECUTE_IF(“crash_commit_after_log”, DBUG_SUICIDE(););
- DBUG_EXECUTE_IF(“crash_commit_before_unlog”, DBUG_SUICIDE(););
- DBUG_EXECUTE_IF(“crash_commit_after”, DBUG_SUICIDE(););
- DBUG_EXECUTE_IF(“half_binlogged_transaction”, DBUG_SUICIDE(););
What happens is that the randgen grammars have a small chance of hitting a rule that will issue a SET SESSION debug=”d,crash_commit_*”. The test will then continue to issue DML until the server hits that debug crash point. We can ensure that the randgen actually did encounter a server crash by examining the contents of kewpie/workdir/bot0/log/randgen.out:
$ tail workdir/bot0/log/randgen.out
# 2011-12-21T14:25:02 bindir is mysql-5.5/sql
# 2011-12-21T14:25:02 core is
"kewpie/randgen/" is not a core dump: Is a directory
No stack.
kewpie/randgen/backtrace.gdb:20: Error in sourced command file:
No registers.
# 2011-12-21T14:25:03
"kewpie/randgen/" is not a core dump: Is a directory
# 2011-12-21T14:25:03
# 2011-12-21T14:25:03 Test completed with failure status STATUS_SERVER_CRASHED (101)
I also took pains to ensure that the randgen workload was having an effect on the test tables between crash/recovery/validate cycles. To examine this for yourself during a run, you can try the new option –test-debug.
I created this option as many tests might have points where we’d like to see what is going on when things fail (or we are changing things and want to make sure they still work), but that would be annoying for general test runs. For example, we can view the comparisons of master-slave checksums after the master has crashed and restarted:
$ ./kewpie.py --suite=innodbCrash --basedir=mysql-5.5 --test-debug crashCommitAfter_test
20111221-142500 INFO MASTER_PORT: 9307
20111221-142500 INFO SOCKET_FILE: kewpie/workdir/bot0/var_s0/s0.sock
20111221-142500 INFO VARDIR: kewpie/workdir/bot0/var_s0
20111221-142500 INFO STATUS: 1
20111221-142500 TEST_DEBUG Crashes remaining: 1
20111221-142505 TEST_DEBUG A: master_checksum= (('test.A', 1651057258L),) | slave_checksum= (('test.A', 1651057258L),)
20111221-142505 TEST_DEBUG ################################################################################
20111221-142505 TEST_DEBUG AA: master_checksum= (('test.AA', 1110194204L),) | slave_checksum= (('test.AA', 1110194204L),)
20111221-142505 TEST_DEBUG ################################################################################
20111221-142505 TEST_DEBUG B: master_checksum= (('test.B', 0L),) | slave_checksum= (('test.B', 0L),)
20111221-142505 TEST_DEBUG ################################################################################
20111221-142505 TEST_DEBUG BB: master_checksum= (('test.BB', 100984011L),) | slave_checksum= (('test.BB', 100984011L),)
20111221-142505 TEST_DEBUG ################################################################################
20111221-142505 TEST_DEBUG C: master_checksum= (('test.C', 2978384584L),) | slave_checksum= (('test.C', 2978384584L),)
20111221-142505 TEST_DEBUG ################################################################################
20111221-142505 TEST_DEBUG CC: master_checksum= (('test.CC', 521860080L),) | slave_checksum= (('test.CC', 521860080L),)
20111221-142505 TEST_DEBUG ################################################################################
20111221-142505 TEST_DEBUG D: master_checksum= (('test.D', 1148937120L),) | slave_checksum= (('test.D', 1148937120L),)
20111221-142505 TEST_DEBUG ################################################################################
20111221-142505 TEST_DEBUG DD: master_checksum= (('test.DD', 2110127969L),) | slave_checksum= (('test.DD', 2110127969L),)
20111221-142505 TEST_DEBUG ################################################################################
20111221-142507 ===============================================================
20111221-142507 TEST NAME [ RESULT ] TIME (ms)
20111221-142507 ===============================================================
20111221-142507 innodbCrash.crashCommitAfter_test [ pass ] 6059
20111221-142507 ===============================================================
20111221-142507 INFO Test execution complete in 16 seconds
Finally, one of the most useful features of the randgen is its ability to use –seed values to change the data and/or queries for a test run. Seed values are deterministic as each randgen run with that seed should produce the same queries and data (there are unit tests to monitor this doesn’t break, IIRC). In the past, I have created separate variants of tests that take a –seed=time argument to give wider coverage. However, this approach gets tiresome when one has many test cases (more to maintain!).
My solution to this problem was the addition of another option –randgen-seed. By default this is 1, but one can also say –randgen-seed=time. For those test cases that are written to take advantage of this (like our shiny new Innodb crash tests!), it is very easy to shuffle data and queries for all tests:
./kewpie.py –suite=innodbCrash –basedir=/path/to/basedir –force –randgen-seed=time –repeat=3 gets us:
20111221-142820 TEST_DEBUG Crashes remaining: 1
20111221-142825 TEST_DEBUG A: master_checksum= (('test.A', 2990808711L),) | slave_checksum= (('test.A', 2990808711L),)
20111221-142825 TEST_DEBUG ################################################################################
20111221-142825 TEST_DEBUG AA: master_checksum= (('test.AA', 2850101684L),) | slave_checksum= (('test.AA', 2850101684L),)
20111221-142825 TEST_DEBUG ################################################################################
20111221-142825 TEST_DEBUG B: master_checksum= (('test.B', 1893458598L),) | slave_checksum= (('test.B', 1893458598L),)
20111221-142825 TEST_DEBUG ################################################################################
20111221-142825 TEST_DEBUG BB: master_checksum= (('test.BB', 686510945L),) | slave_checksum= (('test.BB', 686510945L),)
20111221-142825 TEST_DEBUG ################################################################################
20111221-142825 TEST_DEBUG C: master_checksum= (('test.C', 1036631423L),) | slave_checksum= (('test.C', 1036631423L),)
20111221-142825 TEST_DEBUG ################################################################################
20111221-142825 TEST_DEBUG CC: master_checksum= (('test.CC', 3171836798L),) | slave_checksum= (('test.CC', 3171836798L),)
20111221-142825 TEST_DEBUG ################################################################################
20111221-142825 TEST_DEBUG D: master_checksum= (('test.D', 4208439607L),) | slave_checksum= (('test.D', 4208439607L),)
20111221-142825 TEST_DEBUG ################################################################################
20111221-142825 TEST_DEBUG DD: master_checksum= (('test.DD', 1325583310L),) | slave_checksum= (('test.DD', 1325583310L),)
20111221-142825 TEST_DEBUG ################################################################################
20111221-142827 ===============================================================
20111221-142827 TEST NAME [ RESULT ] TIME (ms)
20111221-142827 ===============================================================
20111221-142827 innodbCrash.crashCommitAfter_test [ pass ] 5743
20111221-142829 TEST_DEBUG Crashes remaining: 1
20111221-142834 TEST_DEBUG A: master_checksum= (('test.A', 0L),) | slave_checksum= (('test.A', 0L),)
20111221-142834 TEST_DEBUG ################################################################################
20111221-142834 TEST_DEBUG AA: master_checksum= (('test.AA', 4174656109L),) | slave_checksum= (('test.AA', 4174656109L),)
20111221-142834 TEST_DEBUG ################################################################################
20111221-142834 TEST_DEBUG B: master_checksum= (('test.B', 0L),) | slave_checksum= (('test.B', 0L),)
20111221-142834 TEST_DEBUG ################################################################################
20111221-142834 TEST_DEBUG BB: master_checksum= (('test.BB', 502335766L),) | slave_checksum= (('test.BB', 502335766L),)
20111221-142834 TEST_DEBUG ################################################################################
20111221-142834 TEST_DEBUG C: master_checksum= (('test.C', 2318403278L),) | slave_checksum= (('test.C', 2318403278L),)
20111221-142834 TEST_DEBUG ################################################################################
20111221-142834 TEST_DEBUG CC: master_checksum= (('test.CC', 2060894679L),) | slave_checksum= (('test.CC', 2060894679L),)
20111221-142834 TEST_DEBUG ################################################################################
20111221-142834 TEST_DEBUG D: master_checksum= (('test.D', 3959154326L),) | slave_checksum= (('test.D', 3959154326L),)
20111221-142834 TEST_DEBUG ################################################################################
20111221-142834 TEST_DEBUG DD: master_checksum= (('test.DD', 2507050488L),) | slave_checksum= (('test.DD', 2507050488L),)
20111221-142834 TEST_DEBUG ################################################################################
20111221-142840 innodbCrash.crashCommitAfter_test [ pass ] 5603
20111221-142844 TEST_DEBUG Crashes remaining: 1
20111221-142849 TEST_DEBUG A: master_checksum= (('test.A', 488788349L),) | slave_checksum= (('test.A', 488788349L),)
20111221-142849 TEST_DEBUG ################################################################################
20111221-142849 TEST_DEBUG AA: master_checksum= (('test.AA', 880803763L),) | slave_checksum= (('test.AA', 880803763L),)
20111221-142849 TEST_DEBUG ################################################################################
20111221-142849 TEST_DEBUG B: master_checksum= (('test.B', 0L),) | slave_checksum= (('test.B', 0L),)
20111221-142849 TEST_DEBUG ################################################################################
20111221-142849 TEST_DEBUG BB: master_checksum= (('test.BB', 3997853573L),) | slave_checksum= (('test.BB', 3997853573L),)
20111221-142849 TEST_DEBUG ################################################################################
20111221-142849 TEST_DEBUG C: master_checksum= (('test.C', 2684719484L),) | slave_checksum= (('test.C', 2684719484L),)
20111221-142849 TEST_DEBUG ################################################################################
20111221-142849 TEST_DEBUG CC: master_checksum= (('test.CC', 1041735070L),) | slave_checksum= (('test.CC', 1041735070L),)
20111221-142849 TEST_DEBUG ################################################################################
20111221-142849 TEST_DEBUG D: master_checksum= (('test.D', 3171892155L),) | slave_checksum= (('test.D', 3171892155L),)
20111221-142849 TEST_DEBUG ################################################################################
20111221-142849 TEST_DEBUG DD: master_checksum= (('test.DD', 1865390033L),) | slave_checksum= (('test.DD', 1865390033L),)
20111221-142849 TEST_DEBUG ################################################################################
20111221-142851 innodbCrash.crashCommitAfter_test [ pass ] 5675
20111221-142851 ===============================================================
20111221-142851 INFO Test execution complete in 41 seconds
So far, my tests have been passing without incident, but I also haven’t dug into partitions or blobs >: )
If you’d like to try these out yourself, they are in lp:kewpie (I changed the name of dbqp). You’ll need the DBD::mysql perl module for the randgen and MySQLdb for kewpie, and a debug build of the server, but other than that, they are ready to run out of the box. The Percona development team will be adding Jenkins runs for these tests soon