Nov
30
2011
--

An update on Percona Live MySQL Conference & Expo 2012

We announced a while back that we were going to continue the traditional MySQL conference in Santa Clara, because O’Reilly wasn’t doing it anymore. But we haven’t given an update in a while. Here’s the current status:

  • We created a conference committee.
  • We created a conference website that allows people to create an account and submit proposals. We’re currently looking for session proposals; more on that later.
  • We have been working hard on the logistics of the venue, food and beverage, audio/visual, power, networking, exhibit hall, registration, hotel rooms, and a number of other things.

It probably won’t surprise you to hear that we’ve been learning a lot. For example, we had a few rough spots with the conference website; this is a much harder task than it seems. But that’s under control.

Fortunately, we have three great people here at Percona working on this. There’s Mauricio Stekl, the webmaster. Kortney Runyan is our fulltime conference director, and if you went to our New York or London Percona Live events you saw her in action. And we have our new Chief Marketing Officer, Terry Erisman, who has experience running big events at his previous employers such as DotNetNuke. Terry and Kortney have assembled a team of service providers to help line everything up. The staff at the Santa Clara Hyatt have been invaluable, too.

The result is that the conference is going to be very similar to O’Reilly’s. The A/V and room layout teams, for example, have a lot of experience from previous events, and they’re setting things up identically. We have the same rooms for the sessions, the same expo hall, the same layout, the same stage, the same A/V, and so on. Our guiding principle is that there should be no surprises when you show up in Santa Clara. O’Reilly did a great job with this show, and we think it’s hard to go wrong imitating it.

Now, to make a conference succeed — and I think we all agree that everyone needs this event to succeed — we need a few things. We need attendees, sponsors, exhibitors, speakers, and content.

This is where you come into the picture. If you can help us promote the event, please do. Tell your friends; put promotional banners on your blog; and most of all, help us get great session proposals.

What is a great session proposal? First of all, a great session proposal is audience-focused. Your audience is dual: the people who come to hear you speak, and the conference committee. Submit a session that’s focused on the needs of the attendees, but make sure that it’s also understandable to the committee, who have to assess and vote on the submissions. Make sure that after the committee approves the session, attendees who are browsing through the event program will understand whether the talk is appropriate and interesting to them. There are several good blog posts on how to write a good proposal.

What kind of content do we need? In previous years, I think we had some assumptions that we didn’t recognize. The community submitted a wide variety of talks about interesting things such as how they scaled MySQL, how they abused it in ways its creators didn’t intend and got good (or at least interesting) results, how to do super-advanced and extreme things with MySQL, how to use MySQL for a Ruby on Rails application with MongoDB and Redis too, and so forth. The commercial vendors submitted talks about their solutions and how they can benefit MySQL users. And I think we always just relied on a cadre of the core MySQL developers to submit talks about what is new in MySQL, the MySQL roadmap for the future, and bread-and-butter talks about topics such as how replication works, new features in the query optimizer, the storage engine architecture, and so on. The point is that this year, we need to make this assumption explicit: we need full coverage of the breadth and depth of topics, not just the mad scientist topics.

So, please get your creative juices flowing, and submit those session and tutorial proposals on DRBD, partitioning, community toolkits, security, using MySQL with NoSQL, geospatial functionality, cloud computing, Cluster, search, sharding, memcached, high availability, DevOps, Big Data, SSD storage, data warehousing, and all those other great topics. If you need inspiration, look at the schedule for previous years.

The most important thing is that you share what you know — even if it doesn’t seem conference-worthy to you, trust me, it is. All you have to do is show up and tell a story. Stories win, every time. Your story doesn’t have to be extreme, or expert, or hardcore, or funny, or charismatic. It just needs to be real. It is so much better to listen to someone who’s just a DBA at a company somewhere, telling about his experiences setting up replication, than to come to a session with someone who’s trying to impress you. You’ll be among rock stars who’ve built huge systems with MySQL — don’t try too hard, just be yourself and try to be helpful to your audience.

And please blog, tweet, retweet, tell your friends, tell your local user groups and meetups, and otherwise help spread the word about this event. We’ll see you in Santa Clara in April!

Nov
29
2011
--

Beware the Innodb Table Monitor

As I stated in my last post, I decided to use the Innodb Table monitor to diagnose an Index count mismatch error a customers found in their mysqld.err log to verify if the problem still existed.

The Innodb Table Monitor has existed for basically forever in Innodb (the MySQL manual discusses it back in the 4.1 manual), and is from a time when what was going on inside of Innodb was a lot murkier than it is now.   To use it, you create a table (in any database you choose), like this:

CREATE TABLE innodb_table_monitor (a INT) ENGINE=INNODB;

This, primitively, tells Innodb that it should start the Table monitor.  The monitor itself dumps the contents of the Data dictionary to the mysql error log, which looks something like this:

===========================================
090420 12:09:32 INNODB TABLE MONITOR OUTPUT
===========================================
--------------------------------------
TABLE: name SYS_FOREIGN, id 0 11, columns 7, indexes 3, appr.rows 1
  COLUMNS: ID: DATA_VARCHAR DATA_ENGLISH len 0;
           FOR_NAME: DATA_VARCHAR DATA_ENGLISH len 0;
           REF_NAME: DATA_VARCHAR DATA_ENGLISH len 0;
           N_COLS: DATA_INT len 4;
           DB_ROW_ID: DATA_SYS prtype 256 len 6;
           DB_TRX_ID: DATA_SYS prtype 257 len 6;
  INDEX: name ID_IND, id 0 11, fields 1/6, uniq 1, type 3
   root page 46, appr.key vals 1, leaf pages 1, size pages 1
   FIELDS:  ID DB_TRX_ID DB_ROLL_PTR FOR_NAME REF_NAME N_COLS
  INDEX: name FOR_IND, id 0 12, fields 1/2, uniq 2, type 0
   root page 47, appr.key vals 1, leaf pages 1, size pages 1
   FIELDS:  FOR_NAME ID
  INDEX: name REF_IND, id 0 13, fields 1/2, uniq 2, type 0
   root page 48, appr.key vals 1, leaf pages 1, size pages 1
   FIELDS:  REF_NAME ID
--------------------------------------

Basically you can see every table, every column, and every index in that table.  Once you get the output, you just DROP the table and it stops.  It’s been around for forever, so it should be stable, right?

So, to my customer.  It may be important to note that this customer had a few hundred tables.  I scheduled a time with the customer during off hours to connect to their master and collect the output so we could confirm if Innodb really did have all the indexes or not.  I collected the data just fine, and I found that indeed the indexes that Innodb knew about were correct, and that the error messages the customer saw was apparently only transitory.

However, as I was writing an email to the customer to explain it, I got an email explaining that he saw a big spike in io-wait on the server during the time I ran the Innodb Table monitor (it was just for a minute) and he was very glad we waited until after hours.  The server itself was fine, and there didn’t seem to be any adverse effects from running the monitor, but something clearly was a lot busier than expected during execution.

He had data collected using Percona Tookit‘s pt-stalk during the time and asked me to take a look and see what happened.  I looked through the data and found not only was there high io-wait on the system, but also there were over a hundred user queries in SHOW PROCESSLIST that were stuck in the ‘Opening Tables’ state, uh-oh.

I dug through the source code and found a few interesting things about the table monitor:

  • It holds the dict_sys->mutex for the entire time it runs
  • Before outputting the information for each table, it calls dict_update_statistics and appears to force table statistics to be regenerated for each table
  • It’s possible I’m mistaken, but it sure looks like dict_update_statistics is called twice for each table, at least in the latest Percona Server 5.1 in: storage/innodb_plugin/dict/dict0load.c: line 226, and again in: storage/innodb_plugin/dict/dict0dict.c: line 4883
So, the io-wait spike is explained by table statistics being regenerated, and the SHOW PROCESSLIST connections stuck in ‘Opening Tables’ were waiting for the dict_sys->mutex, which is required for opening tables in Innodb.
It would be better for the Table Monitor if:
  • Table stats weren’t regenerated unless it was necessary
  • dict_sys->mutex weren’t such a big hammer
These points are probably moot, since I don’t think most people don’t use this feature very often.
Just because it’s old, doesn’t mean you should trust it.
Nov
29
2011
--

Innodb vs MySQL index counts

I had a customer recently who a few strange errors in their mysqld.err log:

[ERROR] Table database_name/table_name contains 8 indexes inside InnoDB, which is different from the number of indexes 7 defined in the MySQL

This customer was running Percona Server 5.1 and they got this error on two tables during a maintenance window when they were adding indexes to the same tables.  We had a suspicion that it had something to do with Fast index creation in Innodb, and that it had been corrected when the ALTER TABLE completed because the errors had not recurred.

Reproducing the error on a test system is simple:

  1. create an Innodb table
  2. make a copy of the .frm file
  3. do an ALTER TABLE to add an index
  4. then copy the old .frm file back into place
  5. re-open the table  (Might need a FLUSH TABLES or mysqld restart here)

From my testing, I saw that the error only happened when the table was opened and not on every table access.  So, it was a possibility that the indexes were out of sync and we weren’t seeing new errors in the log simply because the table hadn’t been re-opened.

But, before getting too crazy, how can we verify the problem still exists?  We need a way to compare the output of SHOW CREATE TABLE to what Innodb thinks.  What Innodb thinks is in the Innodb Data dictionary.

  • The first recommendation I got was to simply use the INFORMATION_SCHEMA.INNODB_SYS_INDEXES table, which exists in Percona Server 5.1, but doesn’t appear in MySQL until 5.6 (if the manual is to be trusted).  I’d probably consider this on a newer version of Percona Server or MysqL 5.6.
  • Another person (I’m looking at you, Baron) was adverse to trusting INNODB_SYS_INDEXES from some bad experiences with it, and suggested the Innodb Table monitor instead, see my next post for how that turned out, but this basically will regurgitate the entire Innodb Data dictionary to the mysqld error log file.
  • If I had to do it over again, I think I’d simply try doing:  FLUSH TABLES table1, table2; to force the tables to close and be reopened and simply see if the error message comes back.  That might something of a performance impact, but it seems to be the most stable.
In this case, it turned out that the indexes were not out of sync, so I didn’t have to do anything to fix it.
However if I did have to fix it, I found on my test table that the extra index in Innodb could be removed by doing:
ALTER TABLE table_name ENGINE=Innodb;

This, of course, rebuilds the whole table based on the .frm table definition and removes the existing index in Innodb, which might not be desirable, but at least you can re-add it later.  However, it’s not the greatest thing to do on a live production database master if you can help it.

Another solution might be to figure out what index was missing via the Innodb data dictionary (more on that in a minute), create a separate table identical to the existing .frm, add that index to it, and copy the new .frm back over the original.  Kind of scary.

My advice is to ensure the error still exists before trying to fix it.

Written by in: MySQL,Zend Developer |
Nov
29
2011
--

Percona Replication Manager, a solution for MySQL high availability with replication using Pacemaker

Over the last year, the frustration of many of us at Percona regarding issues with MMM has grown to a level where we started looking at other ways of achieving higher availability using MySQL replication. One of the weakness of MMM is its communication layer, so instead of reinventing a flat tire, we decided, Baron Schwartz and I, to develop a solution using Pacemaker, a well known and established cluster manager with a bullet proof communication layer. One of the great thing about Pacemaker is its flexibility but flexibility may results in complexity. With the help of people from the Pacemaker community, namely Florian Haas and Raoul Bhatia, I have been able to modify the existing MySQL Pacemaker resource agent in a way that it survived our replication tests and offered a behavior pretty similar to MMM regarding Virtual IP addresses, VIPs, management. We decided to call this solution PRM for Percona Replication Manager. All the parts are opensource and available under the GPL license.

Keep in mind this solution is hot from the press, consider it alpha. Like I said above, it survived testing in a very controlled environment but it is young and many issues/bugs are likely to be found. Also, it is different from Yoshinori Matsunobu’s MHA solution and in fact it is quite a complement to it. One of my near term goal is to integrate with MHA for master promotion.

The solution is basically made of 3 pieces:

  • The Pacemaker cluster manager
  • A Pacemaker configuration
  • A MySQL resource agent

Here I will not cover the Pacemaker installation since this is fairly straightforward and covered elsewhere. I’ll discuss the MySQL resource agent and the supporting configuration while assuming basic knowledge of Pacemaker.

But, before we start, what does this solution offers.

  • Reader and writer VIPs behaviors similar to MMM
  • If the master fails, a new master is promoted from the slaves, no master to master setup needed. Selection of master is based on scores published by the slaves, the more up to date slaves have higher scores for promotion
  • Some nodes can be dedicated to be only slaves or less likely to become master
  • A node can be the preferred master
  • If replication on a slave breaks or lags beyond a defined threshold, the reader VIP(s) is removed. MySQL is not restarted.
  • If no slaves are ok, all VIPs, readers and writer, will be located on the master
  • During a master switch, connections are killed on the demoted master to avoid replication conflicts
  • All slaves are in read_only mode
  • Simple administrative commands can remove master role from a node
  • Pacemaker stonith devices are supported
  • No logical limits in term of number of nodes
  • Easy to add nodes

In order to setup the solution you’ll need my version of the MySQL resource agent, it is not yet pushed to the main Pacemaker resource agents branch. More testing and cleaning will be needed before that happen. You can get the resource agent from here:

https://github.com/y-trudeau/resource-agents/raw/master/heartbeat/mysql

You can also the whole branch from here:

https://github.com/y-trudeau/resource-agents/zipball/master

On my Ubuntu Lucid VM, this file goes in /usr/lib/ocf/resource.d/heartbeat/ directory.

To use this agent, you’ll need a Pacemaker configuration. As a starting point, I’ll discuss the configuration I use during my tests.

node testvirtbox1 \
        attributes IP="10.2.2.160"
node testvirtbox2 \
        attributes IP="10.2.2.161"
node testvirtbox3 \
        attributes IP="10.2.2.162"
primitive p_mysql ocf:heartbeat:mysql \
        params config="/etc/mysql/my.cnf" pid="/var/run/mysqld/mysqld.pid" \
               socket="/var/run/mysqld/mysqld.sock" replication_user="root" \
               replication_passwd="rootpass" max_slave_lag="15" evict_outdated_slaves="false" \
               binary="/usr/bin/mysqld_safe" test_user="root" \
               test_passwd="rootpass" \
        op monitor interval="5s" role="Master" OCF_CHECK_LEVEL="1" \
        op monitor interval="2s" role="Slave" OCF_CHECK_LEVEL="1"
primitive reader_vip_1 ocf:heartbeat:IPaddr2 \
        params ip="10.2.2.171" nic="eth0"
primitive reader_vip_2 ocf:heartbeat:IPaddr2 \
        params ip="10.2.2.172" nic="eth0"
primitive reader_vip_3 ocf:heartbeat:IPaddr2 \
        params ip="10.2.2.173" nic="eth0"
primitive writer_vip ocf:heartbeat:IPaddr2 \
        params ip="10.2.2.170" nic="eth0" \
        meta target-role="Started"
ms ms_MySQL p_mysql \
        meta master-max="1" master-node-max="1" clone-max="3" clone-node-max="1" notify="true" globally-unique="false" target-role="Master" is-managed="true"
location No-reader-vip-1-loc reader_vip_1 \
        rule $id="No-reader-vip-1-rule" -inf: readerOK eq 0
location No-reader-vip-2-loc reader_vip_2 \
        rule $id="No-reader-vip-2-rule" -inf: readerOK eq 0
location No-reader-vip-3-loc reader_vip_3 \
        rule $id="No-reader-vip-3-rule" -inf: readerOK eq 0
location No-writer-vip-loc writer_vip \
        rule $id="No-writer-vip-rule" -inf: writerOK eq 0
colocation reader_vip_1_dislike_reader_vip_2 -200: reader_vip_1 reader_vip_2
colocation reader_vip_1_dislike_reader_vip_3 -200: reader_vip_1 reader_vip_3
colocation reader_vip_2_dislike_reader_vip_3 -200: reader_vip_2 reader_vip_3
property $id="cib-bootstrap-options" \
        dc-version="1.0.11-a15ead49e20f047e129882619ed075a65c1ebdfe" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="3" \
        stonith-enabled="false" \
        no-quorum-policy="ignore" \
        last-lrm-refresh="1322236006"
property $id="mysql_replication" \
        replication_info="10.2.2.162|mysql-bin.000090|106"
rsc_defaults $id="rsc-options" \
        resource-stickiness="100"

Let’s review the configuration. First it begins by 3 node entries defining the 3 nodes I have in my cluster. One attribute is required to each node, the IP address that will be used for replication. This is a real IP address not a reader or writer VIP. This attribute allows the use of a private network for replication if needed.

Next is the mysql primitive resource declaration. This primitive defines the mysql resource on each node and has many parameters, here’s the ones I had to define:

  • config: The path of the my.cnf file. Remember that Pacemaker will start MySQL, not the regular init.d script
  • pid: The pid file. This is use by Pacemaker to know if MySQL is already running. It should match the my.cnf pid_file setting.
  • socket: The MySQL unix socket file
  • replication_user: The user to use when setting up replication. It is also currently used for the ‘CHANGE MASTER TO’ command, something that should/will change in the future
  • replication_passwd: The password for the above user
  • max_slave_lag: The maximum allowed slave lag in seconds, if a slave lags by more than that value, it will lose its reader VIP(s)
  • evict_outdated_slaves: A mandatory to set this to false otherwise Pacemaker will stop MySQL on a slave that lags behind. This will absolutely not help its recovery.
  • test_user and test_passwd: The credentials to test MySQL. Default is to run select count(*) on mysql.user table, so the user given should at least have select on that table.
  • op monitor: An entry is needed for each role, Master and Slave. Intervals must not be the same.

Following the mysql primitive declaration, the primitives for 3 reader vips and one writer vip are defined. Those are straightforward so I’ll skip detailed description. The next interesting element is the master-slave “ms” declaration. This is how Pacemaker defines an asymmetrical resource having a master and slaves. The only thing that may change here is clone-max=”3″ which should match the number of database nodes you have.

The handling of the VIPs is the truly new thing in the resource agent. I am grateful to Florian Haas who told me to use node attributes to avoid Pacemaker from over reacting. The availability of a reader or writer VIPs on a node are controlled by the attributes readerOK and writerOK and the location rules. An infinite negative weight is given when a VIP should not be on a host. I also added a few colocation rules to help spread the reader VIPs on all the nodes.

As a final thought on the Pacemaker configuration, remember that in order for a pacemaker cluster to run correctly on a 2 nodes cluster, you should set the quorum policy to ignore. Also, this example configuration has no stonith devices defined so stonith is disable. At the end of the configuration, you’ll notice the replication_info cluster attribute. You don’t have to define this, the mysql RA will add it automatically when the first a node will promoted to master.

There are not many requirements regarding the MySQL configuration, Pacemaker will automatically add “skip-start-slave” for a saner behavior. One of the important setting is “log_slave_updates = OFF” (default value). In some cases, if slaves are logging replication updates, it may cause failover issues. Also, the solution relies on the read_only setting on the slave so, make sure the application database use doesn’t have the SUPER privilege which overrides read_only.

Like I mentioned above, this project is young. In the future, I’d like to integrate MHA to benefit for its capacity of bringing all the nodes to a consistent level. Also, the security around the solution should be improved, a fairly easy task I believe. Of course, I’ll work with the maintainers of the Pacemaker resources agents to include it in the main branch once it matured a bit.

Finally, if you are interested by this solution but have problems setting it up, just contact us at Percona, we’ll be pleased to help.

Written by in: MySQL,Zend Developer |
Nov
29
2011
--

Virident FlashMAX MLC in tpcc-mysql workload

As I mentioned in previous post on Virident FlashMAX MLC, beside sysbench benchmark, I also run tpcc-mysql (to compare performance Virident FlashMAX vs Fusion-io ioDrive Duo)

The report with results is there: http://www.percona.com/files/white-papers/virident-mlc-tpcc.pdf

The graphical result for tpcc-mysql 5000W:

My conclusions from this benchmark:

  • Virident FlashMAX provides stability of performance and reveals a denser throughput.
  • In addition to stability, in many cases there is also a better throughput in MySQL (up to 40\%) using the Virident FlashMAX card.

DISCLOSURE: This benchmark was done as part of our consulting practice for which we compensated by Virident. However, this benchmark was run independently of Virident, and reflects our opinion of this product.


Nov
29
2011
--

Avoiding auto-increment holes on InnoDB with INSERT IGNORE

Are you using InnoDB tables on MySQL version 5.1.22 or newer? If so, you probably have gaps in your auto-increment columns. A simple INSERT IGNORE query creates gaps for every ignored insert, but this is undocumented behaviour. This documentation bug is already submitted.

Firstly, we will start with a simple question. Why do we have gaps on auto-increment columns? Secondly, I will show you a trick to mimic the INSERT IGNORE behaviour without losing auto increment values. Let’s start!

Why do we have gaps?

InnoDB checks an auto_increment counter on the table and if a new value is needed, increments that counter and assigns the new value to the column. Prior to MySQL 5.1.22 InnoDB used a method to access that counter values called “Traditional“. This one uses a special table lock called AUTO-INC that remains until the end of the query or transaction. Because of this, two queries can’t have the AUTO-INC lock at the same time, so we lose concurrency and performance. The problems are even worse with long running queries like INSERT INTO table1 … SELECT … FROM table2.

In version 5.1.22 and later the lock algorithm for the auto_increment value is configurable and you can select from different algorithms using the innodb_autoinc_lock_mode. By default the value is 1, which is a new algorithm called “consecutive“. Thanks to this new value, a simple insert query like a single-row or multi-row INSERT/REPLACE uses a light-weight mutex instead of a table lock on AUTO-INC. We have recovered the concurrency and the performance but with a small cost. Queries like INSERT … ON DUPLICATE KEY UPDATE produce gaps on the auto_increment column.

To avoid this little inconvenience it is possible to return to the traditional method changing the innodb_autoinc_lock_mode to 0. But with a loss of performance and concurrency.

How can I solve this problem for INSERT IGNORE?

As I informed you before, it is not documented that INSERT IGNORE creates gaps, so maybe you have been unaware of this problem for years. You can mimic the INSERT IGNORE behaviour using a special mutex table, as explained on Baron’s blog to get rid of the gaps problem.

A “mutex” table is a clever trick that allows joining tables while keeping them independent of each other in a query. This property allows interesting queries that are not otherwise possible.

This is our mutex table. We only need to insert one integer value:

create table mutex(
i int not null primary key
);
insert into mutex(i) values (1);

Our InnoDB table with auto increment column will be like this:

CREATE TABLE `foo` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `uniqname` (`name`)
) ENGINE=InnoDB;

Insert a value using a LEFT OUTER JOIN:

insert into foo(name) select 1 from mutex left outer join foo on foo.name=1 where mutex.i = 1 and foo.name is null;
Query OK, 1 row affected (0.00 sec)

Insert the same value multiple times. As you will see, the INSERT is ignored and no rows are inserted. The same behaviour as INSERT IGNORE:

insert into foo(name) select 1 from mutex left outer join foo on foo.name=1 where mutex.i = 1 and foo.name is null;
Query OK, 0 rows affected (0.00 sec)
insert into foo(name) select 1 from mutex left outer join foo on foo.name=1 where mutex.i = 1 and foo.name is null;
Query OK, 0 rows affected (0.00 sec)
insert into foo(name) select 1 from mutex left outer join foo on foo.name=1 where mutex.i = 1 and foo.name is null;
Query OK, 0 rows affected (0.00 sec)

Now check the auto_increment counter:

show create table foo\G
*************************** 1. row ***************************
Table: foo
Create Table: CREATE TABLE `foo` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `uniqname` (`name`)
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=latin1

No gaps at all.

This trick was discovered by Michael Rikmas when we were working on a Consulting case for a customer. So, if this saves you from doing an ALTER TABLE to change the auto incremental column size, then send him a beer ;)

Nov
25
2011
--

Pipelined ORDER BY: The Third Power

Todays installment is about the third power of indexing: satisfying order by clauses by an index—thus, preventing the need to sort the data.

The installment consists of two parts: the chapter intro and the first section about the interactions between the order by and the where clause.

Nov
24
2011
--

Helgrinding MySQL with InnoDB for Synchronisation Errors, Fun and Profit

It is no secret that bugs related to multithreading–deadlocks, data races, starvations etc–have a big impact on application’s stability and are at the same time hard to find due to their nondeterministic nature.  Any tool that makes finding such bugs easier, preferably before anybody is aware of their existence, is very welcome.

Enter the Helgrind tool of the Valgrind dynamic analyzer suite. Now, Valgrind does not need much introduction, especially to the MySQL server developers. The Valgrind Memcheck tool, which is synonymous with Valgrind itself, is relatively widely used for the MySQL server development: there is support for it in the MySQL Test Framework, DBQP, and the server source is properly annotated to get the most out of Memcheck. Not everything is perfect though: the manual is only partially correct in how to properly build the server to use it  (bug 61587). We at Percona regularly use Memcheck for our development and upstream testing with useful results (bug 61986, bug 61985, …).

Helgrind does not enjoy the same level of support and I think it’s a shame. What I have to found to be useful in the daily work is to have the ability to run a selected MTR testcase under Helgrind. (For the time being, running the whole MySQL testsuite is not very practical due to 100x slowdown imposed by Helgrind).  For this purpose, I have patched mysql-test-run.pl to accept a new option: –helgrind.  The patch is a copy-paste-modify of the already existing Callgrind support and is not going to win any Cleanest Feature of the Year awards, but hey, this is MTR we are talking about. For a proper implementation for our purposes, Patrick has added preliminary Helgrind support to DBQP.

For 5.1:

mtr-helgrind-51.patch

For 5.5:

mtr-helgrind-55.patch

Now let’s see what kind of goodies does Helgrind find on MySQL server. Let’s take a recent 5.1 bzr version (pre-5.1.61) and a single test, innodb_plugin.innodb_bug53674. The test is chosen for no particular reason except that it is a quick-running one.

==9090== Possible data race during read of size 8 at 0xff27758 by thread #4
 ==9090==    at 0x6B67C51: log_io_complete (sync0sync.ic:150)
 ==9090==    by 0x6B34515: fil_aio_wait (fil0fil.c:4512)
 ==9090==    by 0x6BC58E0: io_handler_thread (srv0start.c:474)
 ==9090==    by 0x4C29C90: mythread_wrapper (hg_intercepts.c:221)
 ==9090==    by 0x4E37EFB: start_thread (pthread_create.c:304)
 ==9090==    by 0x5A0789C: clone (clone.S:112)
 ==9090==  This conflicts with a previous write of size 8 by thread #1
 ==9090==    at 0x6BC9FFF: mutex_spin_wait (sync0sync.c:441)
 ==9090==    by 0x6B751FC: mtr_commit (sync0sync.ic:221)
 ==9090==    by 0x6B39223: fsp_fill_free_list (fsp0fsp.c:1455)
 ==9090==    by 0x6B3DED8: fsp_header_init (fsp0fsp.c:1010)
 ==9090==    by 0x6BC72CE: innobase_start_or_create_for_mysql (srv0start.c:1514)
 ==9090==    by 0x6B48855: innobase_init(void*) (ha_innodb.cc:2284)
 ==9090==    by 0x712F17: ha_initialize_handlerton(st_plugin_int*) (handler.cc:435)
 ==9090==    by 0x7A212A: plugin_initialize(st_plugin_int*) (sql_plugin.cc:1048)
 ==9090==  Address 0xff27758 is 216 bytes inside a block of size 664 alloc'd
 ==9090==    at 0x4C28FDF: malloc (vg_replace_malloc.c:236)
 ==9090==    by 0x6B70A62: mem_heap_create_block (mem0mem.c:333)
 ==9090==    by 0x6B66F3B: log_init (mem0mem.ic:443)
 ==9090==    by 0x6BC63B0: innobase_start_or_create_for_mysql (srv0start.c:1339)
 ==9090==    by 0x6B48855: innobase_init(void*) (ha_innodb.cc:2284)
 ==9090==    by 0x712F17: ha_initialize_handlerton(st_plugin_int*) (handler.cc:435)
 ==9090==    by 0x7A212A: plugin_initialize(st_plugin_int*) (sql_plugin.cc:1048)
 ==9090==    by 0x7A5166: plugin_init(int*, char**, int) (sql_plugin.cc:1275)
 ==9090==    by 0x632436: init_server_components() (mysqld.cc:4035)
 ==9090==    by 0x561388: main (mysqld.cc:4504)

The race here is between the reads and writes of mutex_t::waiters in mutex_get_waiters and mutex_set_waiters. Interestingly, there are comments next to it that say /* Here we assume that the [read|write] of a single word from memory is atomic */.  While it is true that machine native word stores/loads are technically atomic, i.e. you cannot observe any intermediate state, it is likely that if other CPU cores will read this same variable, they will get an old value for quite some time after the store. Now of course it is extremely unfair to say about InnoDB developers that they do not understand parallelism. They do. The stacktraces above are incomplete due to inlining, but it’s not hard to find out that mutex_get_waiters is called from mutex_exit and there we find a comment:

/* A problem: we assume that mutex_reset_lock word
 is a memory barrier, that is when we read the waiters
 field next, the read must be serialized in memory
 after the reset. A speculative processor might
 perform the read first, which could leave a waiting
 thread hanging indefinitely.
Our current solution call every second
 sync_arr_wake_threads_if_sema_free()
 to wake up possible hanging threads if
 they are missed in mutex_signal_object. */

So, this race is accounted for. Let’s go on.

==9090== Possible data race during write of size 8 at 0x6e29da8 by thread #15
 ==9090==    at 0x6BC3EFD: srv_monitor_thread (srv0srv.c:1994)
 ==9090==    by 0x4C29C90: mythread_wrapper (hg_intercepts.c:221)
 ==9090==    by 0x4E37EFB: start_thread (pthread_create.c:304)
 ==9090==    by 0x5A0789C: clone (clone.S:112)
 ==9090==  This conflicts with a previous write of size 8 by thread #14
 ==9090==    at 0x6BC4594: srv_error_monitor_thread (srv0srv.c:1671)
 ==9090==    by 0x4C29C90: mythread_wrapper (hg_intercepts.c:221)
 ==9090==    by 0x4E37EFB: start_thread (pthread_create.c:304)
 ==9090==    by 0x5A0789C: clone (clone.S:112)
 ==9090==  Location 0x6e29da8 is 0 bytes inside global var "srv_last_monitor_time"
 ==9090==  declared at srv0srv.c:424

This is a race between srv_monitor_thread and srv_error_monitor_thread InnoDB threads storing to srv_last_monitor_time. It’s hard for me to think of any harm that may happen here due to the way this variable is used. Moving on.

==9090== Possible data race during write of size 8 at 0x7386718 by thread #1
 ==9090==    at 0x6B08EA5: buf_page_get_gen (buf0buf.c:1597)
 ==9090==    by 0x6BD7CCF: trx_sys_create_doublewrite_buf (trx0sys.c:265)
 ==9090==    by 0x6BC76AD: innobase_start_or_create_for_mysql (srv0start.c:1692)
 ==9090==    by 0x6B48855: innobase_init(void*) (ha_innodb.cc:2284)
 ==9090==    by 0x712F17: ha_initialize_handlerton(st_plugin_int*) (handler.cc:435)
 ==9090==    by 0x7A212A: plugin_initialize(st_plugin_int*) (sql_plugin.cc:1048)
 ==9090==    by 0x7A5166: plugin_init(int*, char**, int) (sql_plugin.cc:1275)
 ==9090==    by 0x632436: init_server_components() (mysqld.cc:4035)
 ==9090==    by 0x561388: main (mysqld.cc:4504)
 ==9090==  This conflicts with a previous read of size 8 by thread #14
 ==9090==    at 0x6B0CAF2: buf_refresh_io_stats (buf0buf.c:3529)
 ==9090==    by 0x6BC45D1: srv_error_monitor_thread (srv0srv.c:1680)
 ==9090==    by 0x4C29C90: mythread_wrapper (hg_intercepts.c:221)
 ==9090==    by 0x4E37EFB: start_thread (pthread_create.c:304)
 ==9090==    by 0x5A0789C: clone (clone.S:112)
 ==9090==  Address 0x7386718 is 184 bytes inside a block of size 632 alloc'd
 ==9090==    at 0x4C28FDF: malloc (vg_replace_malloc.c:236)
 ==9090==    by 0x6B70A62: mem_heap_create_block (mem0mem.c:333)
 ==9090==    by 0x6B07D3B: buf_pool_init (mem0mem.ic:443)
 ==9090==    by 0x6BC6379: innobase_start_or_create_for_mysql (srv0start.c:1310)
 ==9090==    by 0x6B48855: innobase_init(void*) (ha_innodb.cc:2284)
 ==9090==    by 0x712F17: ha_initialize_handlerton(st_plugin_int*) (handler.cc:435)
 ==9090==    by 0x7A212A: plugin_initialize(st_plugin_int*) (sql_plugin.cc:1048)
 ==9090==    by 0x7A5166: plugin_init(int*, char**, int) (sql_plugin.cc:1275)
 ==9090==    by 0x632436: init_server_components() (mysqld.cc:4035)
 ==9090==    by 0x561388: main (mysqld.cc:4504)

This one shows that accesses to the fields of buf_pool->stat are unprotected. These are buffer pool statistics counters: number of read, written, evicted, etc. pages. The more interesting race is not the one show but rather between different threads bumping the counters at the same time. As a result, some of the stores might be lost, with the counter values slightly smaller than they should be in the end.

==9090== Possible data race during write of size 8 at 0xff27738 by thread #1
 ==9090==    at 0x6B66B85: log_write_low (log0log.c:330)
 ==9090==    by 0x6B75144: mtr_commit (mtr0mtr.c:153)
 ==9090==    by 0x6BD7F87: trx_sys_create_doublewrite_buf (trx0sys.c:396)
 ==9090==    by 0x6BC76AD: innobase_start_or_create_for_mysql (srv0start.c:1692)
 ==9090==    by 0x6B48855: innobase_init(void*) (ha_innodb.cc:2284)
 ==9090==    by 0x712F17: ha_initialize_handlerton(st_plugin_int*) (handler.cc:435)
 ==9090==    by 0x7A212A: plugin_initialize(st_plugin_int*) (sql_plugin.cc:1048)
 ==9090==    by 0x7A5166: plugin_init(int*, char**, int) (sql_plugin.cc:1275)
 ==9090==    by 0x632436: init_server_components() (mysqld.cc:4035)
 ==9090==    by 0x561388: main (mysqld.cc:4504)
 ==9090==  This conflicts with a previous read of size 8 by thread #14
 ==9090==    at 0x6BC44E0: srv_error_monitor_thread (log0log.ic:407)
 ==9090==    by 0x4C29C90: mythread_wrapper (hg_intercepts.c:221)
 ==9090==    by 0x4E37EFB: start_thread (pthread_create.c:304)
 ==9090==    by 0x5A0789C: clone (clone.S:112)
 ==9090==  Address 0xff27738 is 184 bytes inside a block of size 664 alloc'd
 ==9090==    at 0x4C28FDF: malloc (vg_replace_malloc.c:236)
 ==9090==    by 0x6B70A62: mem_heap_create_block (mem0mem.c:333)
 ==9090==    by 0x6B66F3B: log_init (mem0mem.ic:443)
 ==9090==    by 0x6BC63B0: innobase_start_or_create_for_mysql (srv0start.c:1339)
 ==9090==    by 0x6B48855: innobase_init(void*) (ha_innodb.cc:2284)
 ==9090==    by 0x712F17: ha_initialize_handlerton(st_plugin_int*) (handler.cc:435)
 ==9090==    by 0x7A212A: plugin_initialize(st_plugin_int*) (sql_plugin.cc:1048)
 ==9090==    by 0x7A5166: plugin_init(int*, char**, int) (sql_plugin.cc:1275)
 ==9090==    by 0x632436: init_server_components() (mysqld.cc:4035)
 ==9090==    by 0x561388: main (mysqld.cc:4504)

This one seems to suggest an unprotected access to the current LSN (log_sys->lsn), with a small wrinkle: the accesses are actually protected. log_write_low asserts that it holds the log system mutex, and log_get_lsn acquires it..

==9090== Possible data race during write of size 8 at 0x6e26200 by thread #1
 ==9090==    at 0x6B77778: os_file_write (os0file.c:2172)
 ==9090==    by 0x6B35B5E: fil_io (fil0fil.c:4432)
 ==9090==    by 0x6B67EDA: log_group_write_buf (log0log.c:1290)
 ==9090==    by 0x6B68496: log_write_up_to.part.12 (log0log.c:1472)
 ==9090==    by 0x6B0E0E9: buf_flush_write_block_low (buf0flu.c:999)
 ==9090==    by 0x6B0EEC1: buf_flush_batch (buf0flu.c:1231)
 ==9090==    by 0x6B69121: log_make_checkpoint_at (log0log.c:1640)
 ==9090==    by 0x6BD7F98: trx_sys_create_doublewrite_buf (trx0sys.c:399)
 ==9090==    by 0x6BC76AD: innobase_start_or_create_for_mysql (srv0start.c:1692)
 ==9090==    by 0x6B48855: innobase_init(void*) (ha_innodb.cc:2284)
 ==9090==    by 0x712F17: ha_initialize_handlerton(st_plugin_int*) (handler.cc:435)
 ==9090==    by 0x7A212A: plugin_initialize(st_plugin_int*) (sql_plugin.cc:1048)
 ==9090==  This conflicts with a previous read of size 8 by thread #14
 ==9090==    at 0x6B79F6F: os_aio_refresh_stats (os0file.c:4408)
 ==9090==    by 0x6BC459F: srv_error_monitor_thread (srv0srv.c:1673)
 ==9090==    by 0x4C29C90: mythread_wrapper (hg_intercepts.c:221)
 ==9090==    by 0x4E37EFB: start_thread (pthread_create.c:304)
 ==9090==    by 0x5A0789C: clone (clone.S:112)
 ==9090==  Location 0x6e26200 is 0 bytes inside global var "os_n_file_writes"
 ==9090==  declared at os0file.c:190

Just like with buffer pool stats, the I/O stats might be slightly smaller than they should be.

==9090== Possible data race during write of size 8 at 0x5e06270 by thread #1
 ==9090==    at 0x6BC83EA: sync_array_reserve_cell (sync0arr.c:366)
 ==9090==    by 0x6BC917F: rw_lock_s_lock_spin (sync0rw.c:415)
 ==9090==    by 0x6B6905A: log_checkpoint (sync0rw.ic:419)
 ==9090==    by 0x6B69141: log_make_checkpoint_at (log0log.c:2059)
 ==9090==    by 0x6BD7F98: trx_sys_create_doublewrite_buf (trx0sys.c:399)
 ==9090==    by 0x6BC76AD: innobase_start_or_create_for_mysql (srv0start.c:1692)
 ==9090==    by 0x6B48855: innobase_init(void*) (ha_innodb.cc:2284)
 ==9090==    by 0x712F17: ha_initialize_handlerton(st_plugin_int*) (handler.cc:435)
 ==9090==    by 0x7A212A: plugin_initialize(st_plugin_int*) (sql_plugin.cc:1048)
 ==9090==    by 0x7A5166: plugin_init(int*, char**, int) (sql_plugin.cc:1275)
 ==9090==    by 0x632436: init_server_components() (mysqld.cc:4035)
 ==9090==    by 0x561388: main (mysqld.cc:4504)
 ==9090==  This conflicts with a previous read of size 8 by thread #14
 ==9090==    at 0x6BC8B23: sync_array_print_long_waits (sync0arr.c:937)
 ==9090==    by 0x6BC4551: srv_error_monitor_thread (srv0srv.c:2297)
 ==9090==    by 0x4C29C90: mythread_wrapper (hg_intercepts.c:221)
 ==9090==    by 0x4E37EFB: start_thread (pthread_create.c:304)
 ==9090==    by 0x5A0789C: clone (clone.S:112)
 ==9090==  Address 0x5e06270 is 0 bytes inside a block of size 800000 alloc'd
 ==9090==    at 0x4C28FDF: malloc (vg_replace_malloc.c:236)
 ==9090==    by 0x6BE309B: ut_malloc (ut0mem.c:106)
 ==9090==    by 0x6BC8074: sync_array_create (sync0arr.c:240)
 ==9090==    by 0x6BCA0F0: sync_init (sync0sync.c:1401)
 ==9090==    by 0x6BC2E04: srv_boot (srv0srv.c:1043)
 ==9090==    by 0x6BC60FE: innobase_start_or_create_for_mysql (srv0start.c:1220)
 ==9090==    by 0x6B48855: innobase_init(void*) (ha_innodb.cc:2284)
 ==9090==    by 0x712F17: ha_initialize_handlerton(st_plugin_int*) (handler.cc:435)
 ==9090==    by 0x7A212A: plugin_initialize(st_plugin_int*) (sql_plugin.cc:1048)
 ==9090==    by 0x7A5166: plugin_init(int*, char**, int) (sql_plugin.cc:1275)
 ==9090==    by 0x632436: init_server_components() (mysqld.cc:4035)
 ==9090==    by 0x561388: main (mysqld.cc:4504)

The RW lock spin stats join the company of the buffer pool and I/O stats.

==9090== Thread #18: lock order "0xF14460 before 0xF141E0" violated
 ==9090==    at 0x4C2A1CB: pthread_mutex_lock (hg_intercepts.c:496)
 ==9090==    by 0x74A32E: show_status_array(THD*, char const*, st_mysql_show_var*, enum_var_type, system_status_var*, char const*, st_table*, bool, Item*) (sql_show.cc:2266)
 ==9090==    by 0x74A894: fill_variables(THD*, TABLE_LIST*, Item*) (sql_show.cc:5522)
 ==9090==    by 0x74E4DF: get_schema_tables_result(JOIN*, enum_schema_table_state) (sql_show.cc:6238)
 ==9090==    by 0x6A5E0C: JOIN::exec() (sql_select.cc:1863)
 ==9090==    by 0x6A7D72: mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_select_lex_unit*, st_select_lex*) (sql_select.cc:2553)
 ==9090==    by 0x6A87FC: handle_select(THD*, st_lex*, select_result*, unsigned long) (sql_select.cc:269)
 ==9090==    by 0x63A3E3: execute_sqlcom_select(THD*, TABLE_LIST*) (sql_parse.cc:5179)
 ==9090==    by 0x6435EA: mysql_execute_command(THD*) (sql_parse.cc:2309)
 ==9090==    by 0x782C6D: sp_instr_stmt::exec_core(THD*, unsigned int*) (sp_head.cc:2970)
 ==9090==    by 0x788560: sp_lex_keeper::reset_lex_and_exec_core(THD*, unsigned int*, bool, sp_instr*) (sp_head.cc:2791)
 ==9090==    by 0x788973: sp_instr_stmt::execute(THD*, unsigned int*) (sp_head.cc:2913)
 ==9090==   Required order was established by acquisition of lock at 0xF14460
 ==9090==    at 0x4C2A1CB: pthread_mutex_lock (hg_intercepts.c:496)
 ==9090==    by 0x61F605: THD::init() (sql_class.cc:841)
 ==9090==    by 0x6204D0: THD::THD() (sql_class.cc:707)
 ==9090==    by 0x7A503E: plugin_init(int*, char**, int) (sql_plugin.cc:1395)
 ==9090==    by 0x632436: init_server_components() (mysqld.cc:4035)
 ==9090==    by 0x561388: main (mysqld.cc:4504)
 ==9090==   followed by a later acquisition of lock at 0xF141E0
 ==9090==    at 0x4C2AFD5: pthread_rwlock_rdlock (hg_intercepts.c:1447)
 ==9090==    by 0x7A2EDA: cleanup_variables(THD*, system_variables*) (sql_plugin.cc:2566)
 ==9090==    by 0x7A6667: plugin_thdvar_init(THD*) (sql_plugin.cc:2523)
 ==9090==    by 0x61F60D: THD::init() (sql_class.cc:842)
 ==9090==    by 0x6204D0: THD::THD() (sql_class.cc:707)
 ==9090==    by 0x7A503E: plugin_init(int*, char**, int) (sql_plugin.cc:1395)
 ==9090==    by 0x632436: init_server_components() (mysqld.cc:4035)
 ==9090==    by 0x561388: main (mysqld.cc:4504)

All previous errors were data races, and here we see a new kind of error: lock order violation, thus a potential deadlock. The locks in question are LOCK_global_system_variables and LOCK_system_variables_hash. Deadlock is possible when one connection issues SHOW VARIABLES or equivalent while another connection is just starting up (i.e. its THD is being initialized). I have reported it as bug 63203. Interestingly Valeriy Kravchuk has confirmed it by finding another (related) lock order violation: between LOCK_system_variables_hash and LOCK_plugin.

I have only shown and analyzed a small part of all Helgrind-reported data races. The remaining ones either show the same issues, show similar issues (other InnoDB stat counters), show non-issues or my analysis errors (like the log_sys->lsn “race” above) or I simply haven’t analyzed them (yes, there is remains a lot of the last kind). What would I do about all this? If I had my way, I’d use atomic access primitives in InnoDB for the benign and minor cases too–if only to declare an intent and bullet-proof the code against future changes that might make the issues not-so-minor. Additionally, I’d consider backporting and using MySQL 5.5 atomic operation primitives with proper memory barriers, so that there is no need for workarounds in mutex_get_waiters/mutex_set_waiters above and similar cases. And then Helgrind would be even more useful for daily development and automated testing with a clean log by default :)

Nov
21
2011
--

Should MySQL update the default innodb_log_file_size?

Now that InnoDB is the default storage engine in MySQL, is it time to update the default configuration for the InnoDB log file size (innodb_log_file_size) setting?

In general, there are two settings that simply can’t be left at their historical defaults for a production installation. MySQL 5.5 increased the default buffer pool size to something more sane (128MB instead of 8MB), but the log file size remains at 5MB. That’s 10MB total, because there are two logs by default.

Is it time to update this? I think so. You simply can’t run a “real” server with 10MB of transaction logs. What’s a sane value? I’d say that something between 64MB and 256MB would be okay for a lot of entry-level workloads, but that would consume some disk space that might surprise people. Perhaps this is the reason that the default hasn’t been increased from 5MB. Regardless, I think that 64MB times two is okay as a default.

What do you think? Write your suggestion for a default log file size into the comments.

By the way, if you’re looking to choose a good log file size for your workload, an arbitrary number such as 64MB or 256MB isn’t great. You can get a good start by using the heuristic shown in this old blog post.

Nov
21
2011
--

MySQL opening .frm even when table is in table definition cache

or… “the case of Stewart recognizing parameters to the read() system call in strace output”.

Last week, a colleague asked a question:

I have an instance of MySQL with 100 tables and the table_definition_cache set to 1000. My understanding of this is that MySQL won’t revert to opening the FRM files to read the table definition, but we can see from strace:

[pid 19876] open("./db/t1.frm", O_RDONLY) = 32 <0.000013>
[pid 19876] read(32, ""..., 10)         = 10 <0.000011>
[pid 19876] close(32)                   = 0 <0.000012>
[pid 19876] open("./db/t2.frm", O_RDONLY) = 32 <0.000014>
[pid 19876] read(32, ""..., 10)         = 10 <0.000012>
[pid 19876] close(32)                   = 0 <0.000012>
[pid 19876] open("./db/t3.frm", O_RDONLY) = 32 <0.000014>
[pid 19876] read(32, ""..., 10)         = 10 <0.000011>
[pid 19876] close(32)                   = 0 <0.000011>
[pid 19876] open("./db/t4.frm", O_RDONLY) = 32 <0.000013>

So, why is this? It turns out that this triggered a memory for me from several years ago. I’ve since discovered the blog post in which I mention it: drop table fail (on the road to removing the FRM). That blog post is from 2008, almost three years ago to the day.

Since we completely reworked how metadata works in Drizzle, it has enabled us to do some truly wonderful things, including more in depth testing of the server. Amazingly enough, spin-offs from this work included being able to find out and then test that the ENUM limit of 65,535 has never been true (but now is in Drizzle), produce a CREATE TABLE statement that took over four minutes to execute and get a more complete view of how the Storage Engine API is called.

But back to what the above strace shows. In MySQL 5.5 you can find in sql/datadict.cc a function named dd_frm_type(). In MySQL 5.1, for some reason yet unknown to humans, it lives in sql/sql_view.cc as mysql_frm_type(). What this code snippet does is:

  • open the FRM
  • read 10 bytes (“header”)
  • check if it’s a view by doing a string compare for “TYPE=VIEW\n” being the first bytes of the FRM file. This is due to VIEWs being stored as the plain text of the SQL query inside the FRM file instead of the normal binary format FRM.
  • some legacy check for a generic table type (I think, I haven’t gone back into the deep history of the FRM file format to confirm)
  • return the fourth byte for the DB_TYPE. i.e. what storage engine it is.

We can ignore the upper limit on number of storage engines for MySQL and understanding the relationship between the range of numbers for dynamic assignment and what this means for on-disk compatibility of data directories is left as an exercise for the reader.

This code is called from several code paths in the server:

  • DROP TABLE
  • RENAME TABLE
  • DROP VIEW
  • open table
  • filling INFORMATION_SCHEMA tables (I think it is actually the TABLES table, but didn’t look closely)
An example of how this is used is that in the DROP TABLE code path, MySQL uses this magic byte to work out which Storage Engine to ask to drop the table. The main consequence of this bit of code is that MySQL may cause unnecessary disk IO for information it already has cached (often at least twice – in InnoDB itself and in the table_definition_cache).

Further reading:

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com