Checking B+tree leaf nodes list consistency in InnoDB

If we have InnoDB pages there are two ways to learn how many records they contain:

  • PAGE_N_RECS field in the page header
  • Count records while walking over the list of records from infimum to supremum

In some previous revision of the recovery tool a short summary was added to a dump which is produced by the constraints_parser.

But if a page is lost and page_parser hasn’t found it, all records from this page are lost. In other words per-page recovery statistics gives us little idea about whether or not a recovered table is complete.

To cover this flaw a new tool index_check is introduced in the revision 80

As you might know InnoDB stores a table in a clustered index called PRIMARY.

The PRIMARY index is a B+tree structure. It has a nice feature that all leaf nodes have pointers to a next leaf node, so following the pointers we can read whole index in the primary key order. InnoDB extends the structure and stores also a pointer to the previous node. The previous node of the head is NULL as well as the next node of the tail.

Based on this knowledge we can check if we have all elements of the list. If we can get from any InnoDB page to the beginning and the end of the list then our set of pages is complete.

This is exactly what index_chk does – it reads files from a directory with InnoDB pages (that is produced by page_parser) and tries to walk over the list of pages back and forth.

Let’s review an example. I took some corrupt InnoDB tablespace and split it with a page_parser:

# ./page_parser -f /var/lib/mysql/ibdata1
Opening file: /var/lib/mysql/ibdata1
File information:
ID of device containing file:        64512
inode number:                     27037954
protection:                         100660 (regular file)
number of hard links:                   1
user ID of owner:                      107
group ID of owner:                     116
device ID (if special file):             0
blocksize for filesystem I/O:         4096
number of blocks allocated:          86016
time of last access:            1374662334 Wed Jul 24 06:38:54 2013
time of last modification:      1374233938 Fri Jul 19 07:38:58 2013
time of last status change:     1374233938 Fri Jul 19 07:38:58 2013
total size, in bytes:             44040192 (42.000 MiB)
Size to process:                  44040192 (42.000 MiB)
8.26% done. 2013-07-24 08:39:58 ETA(in 00:00 hours). Processing speed: 3637248 B/sec
95.70% done. 2013-07-24 08:40:09 ETA(in 00:00 hours). Processing speed: 4399053 B/sec

Now let’s take some index and check if it has all pages:

# ls pages-1374669586/FIL_PAGE_INDEX/0-410/
00000000-00000145.page  00000000-00000235.page  00000000-00000241.page  00000000-00000247.page  00000000-00000254.page
00000000-00000147.page  00000000-00000236.page  00000000-00000243.page  00000000-00000249.page  00000000-00000255.page
00000000-00000148.page  00000000-00000239.page  00000000-00000244.page  00000000-00000251.page
# ./index_chk -f pages-1374669586/FIL_PAGE_INDEX/0-410
Couldn't open file pages-1374669586/FIL_PAGE_INDEX/0-410/00000000-00000140.page

Bad news, a page with id 140 is missing!

Indeed the previous page before page#145 is page#140, but it’s missing.

# hexdump -C pages-1374669586/FIL_PAGE_INDEX/0-410/00000000-00000145.page | head -2
00000000  d4 cd 68 41 00 00 00 91  00 00 00 8c ff ff ff ff  |..hA............|
00000010  00 00 00 00 2b 6c ea 90  45 bf 00 00 00 00 00 00  |....+l..E.......|

For a cheerful ending let’s check an index that has all pages:

# ls pages-1374669586/FIL_PAGE_INDEX/0-2/* # this is SYS_COLUMNS table
pages-1374669586/FIL_PAGE_INDEX/0-2/00000000-00000010.page  pages-1374669586/FIL_PAGE_INDEX/0-2/00000000-00002253.page
# ./index_chk -f pages-1374669586/FIL_PAGE_INDEX/0-2

Thus, a table is fully recovered if and only if two conditions are met:

  • Command grep “Lost records: YES” table.dump | grep -v “Leaf page: NO” outputs nothing
  • ./index_chk -f pages-1374669586/FIL_PAGE_INDEX/<inde_id of=”” my=”” table=””> reports OK

The post Checking B+tree leaf nodes list consistency in InnoDB appeared first on MySQL Performance Blog.


Recovering temporal types in MySQL 5.6: TIME, TIMESTAMP and DATETIME

MySQL 5.6 introduces a new feature – microseconds resolution in some temporal types. As of 5.6.4 TIME, TIMESTAMP and DATETIME can have a fractional part. To create a field with subseconds you can specify precision in brackets: TIME(3), DATETIME(6) etc.

Obviously, the new feature requires the format change. All three types may now have a tail with a fractional part. The tail may be one, two or three bytes long and can store up to six digits after the point.

The non-fractional part has changed too. Thus, DATETIME uses only five bytes comparing to eight in previous versions.

As of revision 79 the recovery toolkit supports the new format.

Let’s recover a sample table to see how it works. The table is

  `id` int(11) AUTO_INCREMENT,
  `t1` time(3),
  `t2` timestamp(6),
  `t3` datetime(1),
  `t4` datetime,
  PRIMARY KEY (`id`)

mysql> select * from t;
| id | t1           | t2                         | t3                    | t4                  |
|  1 | 05:05:10.000 | 2013-07-04 05:05:10.000000 | 2013-07-04 05:05:10.0 | 2013-07-04 05:05:10 |
|  2 | 05:14:24.414 | 2013-07-04 05:14:24.125000 | 2013-07-04 05:14:24.4 | 2013-07-04 05:14:25 |
|  3 | 05:14:32.566 | 2013-07-04 05:14:32.207031 | 2013-07-04 05:14:32.3 | 2013-07-04 05:14:32 |
|  4 | 05:14:34.344 | 2013-07-04 05:14:34.507813 | 2013-07-04 05:14:34.5 | 2013-07-04 05:14:35 |
|  5 | 05:14:45.348 | 2013-07-04 05:14:45.832031 | 2013-07-04 05:14:45.1 | 2013-07-04 05:14:45 |

First, we should gerenrate a table definition file:

./create_defs.pl --db test --table t > include/table_defs.h

Then recompile the tool and split a table space that contains records of the table t:

./page_parser -f /var/lib/mysql/test/t.ibd

The next step is to get records from the InnoDB index:

$ ./constraints_parser -5f pages-1372929630/FIL_PAGE_INDEX/0-30/00000000-00000003.page  2> /dev/null
-- Page id: 3, Format: COMPACT, Records list: Valid, Expected records: (5 5)
0000000007DB    A6000001A20110  t       1       "05:05:10.0"    "2013-07-04 05:05:10.0" "2013-07-04 05:05:10.0" "2013-07-04 05:05:10.0"
0000000007DE    A8000001530110  t       2       "05:14:24.4140" "2013-07-04 05:14:24.125000"    "2013-07-04 05:14:24.40"        "2013-07-04 05:14:25.0"
0000000007EC    B2000001A40110  t       3       "05:14:32.5660" "2013-07-04 05:14:32.207031"    "2013-07-04 05:14:32.30"        "2013-07-04 05:14:32.0"
0000000007ED    B3000001A50110  t       4       "05:14:34.3440" "2013-07-04 05:14:34.507813"    "2013-07-04 05:14:34.50"        "2013-07-04 05:14:35.0"
0000000007FB    BD000001670110  t       5       "05:14:45.3480" "2013-07-04 05:14:45.832031"    "2013-07-04 05:14:45.10"        "2013-07-04 05:14:45.0"
-- Page id: 3, Found records: 5, Lost records: NO, Leaf page: YES

To load this dump use the LOAD DATA INFILE command that’s generated by the contraints_parser, I then redirected it to /dev/null in the example above.

As you can see it’s pretty straightforward. There are two notes:

  • Because of format differences the tool can detect whether the field is in the new format or old
  • For the TIME field (w/o fractional part) in the new format you need to give a hint to contraints_parser. This is -6 option. Let me illustrate this

Before 5.6.4 TIME was packed in three bytes: DD×24×3600 + HH×3600 + MM×60 + SS. As of 5.6.4 it still uses three bytes, but format is different:

TIME new format 

bits description values
1 sign 1= non-negative, 0= negative
1 unused reserved for future extensions
10 hour 0-838
6 minute 0-59
6 second 0-59

If a field is created without a fractional part it’s impossible to determite the format from a field value. Let’s take a table:

  `id` int(11) AUTO_INCREMENT,
  `t1` time ,
  `t2` time(3) ,
  PRIMARY KEY (`id`)

mysql> select * from t;
| id | t1       | t2           |
|  1 | 11:01:17 | 11:01:17.000 |
|  2 | 11:01:17 | 11:01:17.125 |
|  3 | 11:01:17 | 11:01:17.125 |

If there is no fractional part constraints_parser assumes old format. So if we try to recover the the records from the table above the result will be wrong:

$ ./constraints_parser -5f pages-1373023772/FIL_PAGE_INDEX/0-31/00000000-00000003.page
-- Page id: 3, Format: COMPACT, Records list: Valid, Expected records: (3 3)
000000000807    C5000001AC0110  t8      1       "12:32:17"      "11:01:17.0"
000000000807    C5000001AC011C  t8      2       "12:32:17"      "11:01:17.1250"
000000000807    C5000001AC0128  t8      3       "12:32:17"      "11:01:17.1250"
-- Page id: 3, Found records: 3, Lost records: NO, Leaf page: YES

Thus, we need to give a hint, then TIME values are correct:

./constraints_parser -5f pages-1373023772/FIL_PAGE_INDEX/0-31/00000000-00000003.page -6
-- Page id: 3, Format: COMPACT, Records list: Valid, Expected records: (3 3)
000000000807    C5000001AC0110  t8      1       "11:01:17.0"    "11:01:17.0"
000000000807    C5000001AC011C  t8      2       "11:01:17.0"    "11:01:17.1250"
000000000807    C5000001AC0128  t8      3       "11:01:17.0"    "11:01:17.1250"
-- Page id: 3, Found records: 3, Lost records: NO, Leaf page: YES

The post Recovering temporal types in MySQL 5.6: TIME, TIMESTAMP and DATETIME appeared first on MySQL Performance Blog.


Recovery deleted ibdata1

Recently I had a case when a customer deleted the InnoDB main table space – ibdata1 – and redo logs – ib_logfile*.

MySQL keeps InnoDB files open all the time. The following recovery technique is based on this fact and it allowed to salvage the database.

Actually, the files were deleted long time ago – 6 months or so. As long as file is open physically it still exits in the file system and reachable to processes which have opened it.

Thus, from user perspective nothing has changed after the deletion. By the way, this is a good reason to monitor existence of these files!
But after the restart InnoDB will detect that there is neither system table space nor log files, so it will create empty ones. The InnoDB dictionary will be empty and InnoDB won’t be able to use a bunch of existing ibd files. This situation will be a job for ibdconnect, but as long as MySQL isn’t restarted it is possible to recover database fast. Let me illustrate how.

Let’s simulate the accident. For that I will delete /var/lib/mysql/ib* files, while sysbench generates read/write activity:


root@localhost:~# sysbench --num-threads=16 --max-requests=0 --test=oltp --oltp-table-size=1000000 --max-time=3600 --mysql-user=root run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

No DB drivers specified, using mysql
Running the test with following options:
Number of threads: 16

Doing OLTP test.
Running mixed OLTP test
Using Special distribution (12 iterations,  1 pct of values are returned in 75 pct cases)
Using "BEGIN" for starting transactions
Using auto_inc on the id column
Threads started!


root@localhost:/var/lib/mysql# rm ib*

Now the files have gone, but MySQL is still running. They don’t exist in /var/lib/mysql, but can be reachable in /proc file system:

root@localhost:/var/lib/mysql# ls -la  /proc/14101/fd/ | grep -e ibdata -e ib_
lrwx------ 1 root  root  64 Aug  7 23:29 3 -> /var/lib/mysql/ibdata1 (deleted)
lrwx------ 1 root  root  64 Aug  7 23:29 8 -> /var/lib/mysql/ib_logfile0 (deleted)
lrwx------ 1 root  root  64 Aug  7 23:29 9 -> /var/lib/mysql/ib_logfile1 (deleted)

where 14101 is the PID of mysqld process.

However, we can’t copy them back because at any given point of time, there are modified pages in the buffer pool. Those pages are not written on disk and will be lost if the changes are not permanently written. This can lead to corruption and data loss.

For the same reason we can’t make MySQL backup by just copying the files.

So, we have to make sure all modifications are written to the disk.

For that we have to stop any further writes and wait until InnoDB flushes all pages.

To stop write activity we can either stop application or lock tables:

mysql> flush tables with read lock;
Query OK, 0 rows affected (0.37 sec)

Now let’s wait until all dirty pages are flushed on disk. For that we will monitor checkpoint age. The checkpoint age is a difference between current log sequence number and last checkpoint in “SHOW ENGINE INNODB STATUS” output. If checkpoint age is zero, all pages are flushed:

Log sequence number 363096003
Log flushed up to   363096003
Last checkpoint at  363096003
Max checkpoint age    7782360

To speed up flushing we can set dirty pages percentage to zero:

mysql> set global innodb_max_dirty_pages_pct=0;
Query OK, 0 rows affected (0.01 sec)

It is also important to ensure all other background processes have finished their work.
One of them is the insert buffer thread. Its size should be not more than 1(it’s never less than 1):

Ibuf: size 1, free list len 398, seg size 400,

Another thread writing in background is the purge thread.
It should purge all transactions up to very last

Trx id counter 0 16644
Purge done for trx's n:o < 0 16644 undo n:o < 0 0

But if the last transaction wasn’t something that requires a purge operation (SELECT for example) Trx id counter will be bigger.
In that case at least ensure InnoDB isn’t doing any writes:

I/O thread 0 state: waiting for i/o request (insert buffer thread)
I/O thread 1 state: waiting for i/o request (log thread)
I/O thread 2 state: waiting for i/o request (read thread)
I/O thread 3 state: waiting for i/o request (write thread)
Pending normal aio reads: 0, aio writes: 0,
 ibuf aio reads: 0, log i/o's: 0, sync i/o's: 0
Pending flushes (fsync) log: 0; buffer pool: 0
332 OS file reads, 47 OS file writes, 32 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s

When all modified pages are flushed it is safe now to copy InnoDB files back:

root@localhost:/var/lib/mysql# cp /proc/14101/fd/3 /var/lib/mysql/ibdata1
root@localhost:/var/lib/mysql# cp /proc/14101/fd/8 /var/lib/mysql/ib_logfile0
root@localhost:/var/lib/mysql# cp /proc/14101/fd/9 /var/lib/mysql/ib_logfile1

Let’s fix the owner:

root@localhost:/var/lib/mysql# chown -R mysql ib*

And restart MySQL:

root@localhost:/var/lib/mysql# /etc/init.d/mysql restart

After the restart all InnoDB tables are reachable:

mysql> select count(*) from sbtest;
| count(*) |
|  1000000 |
1 row in set (0.19 sec)


  • Add to your monitoring system checks that InnoDB files ibdata and ib_logfile* do exist
  • Do not restart MySQL until the further recovery strategy is clear

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com