Recall of Percona Server 5.1.47-11.0

Percona Server release 11.0 which we announced few days ago unfortunately was released with a bug introduced while implementing stripping comments in query cache which could cause server crash with certain types of queries if query cache is enabled. We have released Percona Server release 11.1 which includes a fix for this issue. If you can’t perform upgrade promptly you can disable query cache until you can do this.

We’re sorry for shipping release with such bug.

The issue with this bug in more details is as follows:
MySQL Query Cache works by checking incoming queries if they may be found in query cache – this is done by performing simple check if query is starting from SEL or comment. After Query is parsed and executed MySQL stores results for SELECT queries in Query Cache. There are cases when queries will not be considered as cachable query during first check, however result for them will be stored in the query cache. This rare case was not handled properly in the new feature added and it was causing MySQL to crash.

The most typical case for queries causing issues would be queries starting with brackets and having space before select such as ( select * from t1 ) Queries starting with brackets are often used with UNION queries.

Entry posted by peter |

Add to: delicious | digg | reddit | netscape | Google Bookmarks


High availability for MySQL on Amazon EC2 – Part 2 – Setting up the initial instances

This post is the second of a series that started here.

The first step to build the HA solution is to create two working instances, configure them to be EBS based and create a security group for them. A third instance, the client, will be discussed in part 7. Since this will be a proof of concept, I’ll be using m1.small type instances while normally, the mysql host would be much larger. Using another type is trivial. I will assume you are using the command line api tools, on Ubuntu, install “ec2-api-tools”. The use of these tools simplifies the expression of the command compared to the web based console.

Create the security group

The instances involved in the MySQL HA setup will need to be inside the same security group for networking purposes and the help identify them. To create a security simply run this command:


  1. yves@yves-laptop:~$ export EC2_CERT=cert-yves.pem
  2. yves@yves-laptop:~$ export EC2_PRIVATE_KEY=pk-yves.pem
  3. yves@yves-laptop:~$ ec2-add-group hamysql -d ‘nodes for HA MySQL solution’
  4. GROUP   hamysql nodes for HA MySQL solution

From now, I’ll always assume the EC2_CERT and EC2_PRIVATE_KEY environment variables are setup in your shell. Next, we need to authorize some communications for the security group. I’ll authorize 3306/tcp (MySQL) from hamysql, 694/udp (Heartbeat) from hamysql and 22 (SSH) from everywhere. You can be more restrictive for SSH if you want to.


  1. yves@yves-laptop:~$ ec2-authorize hamysql -P tcp -p 3306  -o hamysql -u 834362721059
  2. yves@yves-laptop:~$ ec2-authorize hamysql -P udp -p 694  -o hamysql -u 834362721059
  3. yves@yves-laptop:~$ ec2-authorize hamysql -P tcp -p 22 -s
  4. yves@yves-laptop:~$  ec2-describe-group hamysql
  5. GROUP   834362721059    hamysql nodes for HA MySQL solution
  6. PERMISSION      834362721059    hamysql ALLOWS  tcp     3306    3306    FROM    USER    834362721059    GRPNAME hamysql
  7. PERMISSION      834362721059    hamysql ALLOWS  udp     694     694     FROM    USER    834362721059    GRPNAME hamysql
  8. PERMISSION      834362721059    hamysql ALLOWS  tcp     22      22      FROM    CIDR

Launch the instances

Now we can start creating our instances. Since this is only a proof of concept, I’ll built 2 m1.small instances, fell free to use other types. At the time I wrote this, the following AMI seems ok.


  1. yves@yves-laptop:~$  ec2-describe-images ami-1cdf3775
  2. IMAGE   ami-1cdf3775    099720109477/ubuntu-images-testing/ubuntu-lucid-daily-i386-server-20100618      099720109477    available       public          i386    machine aki-aca44cc5         instance-store

So, lauching 2 of these,


  1. yves@yves-laptop:~$  ec2-run-instances ami-1cdf3775 -n 2 -g hamysql -t m1.small -k yves-key
  2. RESERVATION     r-a29c31c9      834362721059    hamysql
  3. INSTANCE        i-a23a21c9      ami-1cdf3775                    pending yves-key        0               m1.small        201006-18T20:11:14+0000        us-east-1c      aki-aca44cc5         monitoring-disabled                                      instance-store
  4. INSTANCE        i-a03a21cb      ami-1cdf3775                    pending yves-key        1               m1.small        201006-18T20:11:14+0000        us-east-1c      aki-aca44cc5         monitoring-disabled
  5. yves@yves-laptop:~$  ec2-describe-instances
  6. RESERVATION     r-a29c31c9      834362721059    hamysql
  7. INSTANCE        i-a23a21c9      ami-1cdf3775    ec2-17412989188.compute1.amazonaws.com      domU-12313902-BD-C5.compute1.internal       running yves-key        0               m1.small      201006-18T20:11:14+0000        us-east-1c      aki-aca44cc5                    monitoring-disabled                   instance-store
  8. INSTANCE        i-a03a21cb      ami-1cdf3775    ec2-174129187170.compute1.amazonaws.com     domU-12313903-A4-62.compute1.internal       running yves-key        1               m1.small      201006-18T20:11:14+0000        us-east-1c      aki-aca44cc5                    monitoring-disabled                  instance-store

I don’t know about you but I don’t like multi-lines output so I wrote a small filter script to on one line the parameters I need separated by a delimiter.


  1. yves@yves-laptop:~$  cat  filtre_instances.pl
  2. #!/usr/bin/perl
  4. $SecGroup = ;
  5. $IPAdd = ;
  6. $Instance_ID = ;
  8. while (<STDIN>) {
  9.         chomp $_;
  11.         #print “Processing: $_\n;
  13.         @fields = split /\t/, $_;
  15.         if (/^RESERVATION/)     {
  16.                 $SecGroup = $fields[3];
  17.         }
  18.         if (/^INSTANCE/) {
  19.                 $IPAdd = $fields[17];
  20.                 $STORE = $fields[20];
  21.                 $Instance_ID= $fields[1];
  22.                 $AMI_ID= $fields[2];
  23.                 $PUBDNS = $fields[3];
  24.                 $STATUS = $fields[5];
  25.                 $START = $fields[10];
  26.                 print “$SecGroup|$IPAdd|$Instance_ID|$AMI_ID|$PUBDNS|$STATUS|$START|$STORE \n
  27.         }
  28. }

and now we have


  1. yves@yves-laptop:~$ ec2-describe-instances | ./filtre_instances.pl | grep hamysql
  2. hamysql||i-a23a21c9|ami-1cdf3775|ec2-17412989188.compute1.amazonaws.com|running|201006-18T20:11:14+0000|instance-store
  3. hamysql||i-a03a21cb|ami-1cdf3775|ec2-174129187170.compute1.amazonaws.com|running|201006-18T20:11:14+0000|instance-store

which is to my opinion easier to manipulate.

Configuring Heartbeat

Now, let’s configure Heartbeat. The first thing to is to set up the hostname on both host. Heartbeat identifies the host on which it is running by its hostname so that’s mandatory step.

First host:


  1. yves@yves-laptop:~$ ssh -i ~/.ssh/yves-key.pem ubuntu@ec2-17412989188.compute1.amazonaws.com
  2. ubuntu@domU-12313907-C8-32:~$ sudo su –
  3. ubuntu@domU-12313907-C8-32:~# hostname monitor

Second host:


  1. yves@yves-laptop:~$ ssh -i ~/.ssh/yves-key.pem ubuntu@ec2-17412989188.compute1.amazonaws.com
  2. ubuntu@domU-12313804-E5-E4:~$ sudo su –
  3. ubuntu@domU-12313804-E5-E4:~# hostname hamysql

We don’t really need to set /etc/hostname since it is overwritten when the instance is started, even when using EBS based AMI. The next step is to install Heartbeat and Pacemaker on both host, with Ubuntu 10.04, it is very straightforward:


  1. root@monitor:~# apt-get install heartbeat pacemaker
  3. and
  5. root@hamysql:~# apt-get install heartbeat pacemaker

Then we can proceed and configure Heartbeat, Pacemaker will come later. Heartbeat needs 2 configuration files, /etc/ha.d/authkeys for cluster authentication and /etc/ha.d/ha.cf which is the configuration file per say. On both host, the chosen key in the authkeys file must be identical and good way to generate unique one is to run “date | md5sum” and grab a substring from the output.


  1. root@monitor:/etc/ha.d# cat authkeys
  2. auth 1
  3. 1 sha1 c97f2bb4b5ae90f149dc314ed

Also don’t forget to restrict the access rights on the file like:


  1. root@monitor:/etc/ha.d# chmod 600 authkeys

For the /etc/ha.d/ha.cf file, since EC2 does not support neither broadcast or multicast within the security group, we need to use unicast (ucast) so both files will not be identical. The ucast entry on one host will contain the IP address on the internal network of the other host. On the monitor host, we will have:


  1. root@monitor:/etc/ha.d# cat ha.cf
  2. autojoin none
  3. ucast eth0
  4. warntime 5
  5. deadtime 15
  6. initdead 60
  7. keepalive 2
  8. crm respawn
  9. node monitor
  10. node hamysql

and on the hamysql host:


  1. root@hamysql:/etc/ha.d# cat ha.cf
  2. autojoin none
  3. ucast eth0
  4. warntime 5
  5. deadtime 15
  6. initdead 60
  7. keepalive 2
  8. crm respawn
  9. node monitor
  10. node hamysql

Let’s review briefly the configuration file. First we have setup “autojoin none” that means no host not listed explicitely in the configuration file can join the cluster so we know we have at most 2 members, “monitor” and “hamysql”. Next is the ucast communication channel to reach the other node and the timing parameters. “warntime” is a soft timeout in second that logs the other node is later while “deadtime” is the hard limit after which heartbeat will the consider the other node dead and start actions to restore the service. The “initdead” is just a startup delay to allow host to fully boot before attempting actions and “crm respawn” starts the Pacemaker resources manager. Finally, we have the two “node” declarations” for the cluster members.

So we are done with the configure, time to try if it works. On both hosts, run:


  1. service heartbeat start

and if everything is right, after at most a minute, you should be able to see both heartbeat processes chatting over the network


  1. root@monitor:~# tcpdump -i eth0 port 694
  2. tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
  3. listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
  4. 20:38:36.536302 IP domU-12313804-E5-E4.compute1.internal.57802> domU-12313907-C8-32.compute1.internal.694: UDP, length 211
  5. 20:38:36.928860 IP domU-12313907-C8-32.compute1.internal.34058> domU-12313804-E5-E4.compute1.internal.694: UDP, length 212
  6. 20:38:38.580245 IP domU-12313804-E5-E4.compute1.internal.57802> domU-12313907-C8-32.compute1.internal.694: UDP, length 211
  7. 20:38:38.938814 IP domU-12313907-C8-32.compute1.internal.34058> domU-12313804-E5-E4.compute1.internal.694: UDP, length 212

We can also use the “crm” tool to query the cluster status.


  1. root@monitor:~# crm status
  2. ============
  3. Last updated: Tue Jun 29 13:56:04 2010
  4. Stack: Heartbeat
  5. Current DC: monitor (504f45ea-7aee-4fa5-b0ee-a5ac07975ce4) – partition with quorum
  6. Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd
  7. 2 Nodes configured, unknown expected votes
  8. 1 Resources configured.
  9. ============
  11. Online: [ monitor hamysql ]

Install MySQL

For the sake of simplicity, we will just install the MySQL version in the Ubuntu repository by doing:


  1. root@hamysql:~# apt-get install mysql-server-5.1

The package install MySQL has an automatic startup script controlled by init (new to lucid). That’s fine, I will surprise you but Pacemaker will not manager MySQL, just the host running it. I’ll also skip the raid configuration of multiple EBS volumes since it is not the main purpose of this blog series.

EBS based AMI

Others have produce excellent article on how to create EBS based AMI, I will not reinvent the wheel. I followed this one: http://www.capsunlock.net/2009/12/create-ebs-boot-ami.html

Upcoming in part 3, the configuration of the HA resources.

Entry posted by Yves Trudeau |

Add to: delicious | digg | reddit | netscape | Google Bookmarks


Percona Server 5.1.47-rel11.0

Dear Community,

Percona Server version 5.1.47-rel11.0 is available for download now.

The changes in this release include:

New features

Fixed bugs

  • #573100 Can’t compile 5.1.46
  • #573104 separate name in INNODB_SYS_TABLES table
  • #580324 Security bug in upstream
  • #586579 Percona Server 11 fails to compile with CFLAGS=-DUNIV_DEBUG
  • #569156 CentOS 5: mysql-server conflicts with MySQL-server
  • #589639 Recovery process may hang when tablespaces are deleted during the recovery
  • #570840 deb package conflicts with libdbd-mysql-perl

The binary packages are available on our website.

Starting from this release there is a uniform naming scheme for Percona binaries for 5.0 and 5.1 based packages. RPM are named as:
Percona-Server-<sub package>-<package suffix>-<mysql version>-<release number>.<revision number>.<os code>.<platform>.rpm


  1. Percona-Server-server-515.1.47-rel11.0.46.rhel5.x86_64.rpm
  2. Percona-Server-client-515.1.47-rel11.0.46.rhel5.x86_64.rpm


  • “Percona-Server” is a name of the product.
  • “server” or “client” – sub-package
  • 51 – suffix, which means the packages are based on 5.1.x versions

Together “Percona-Server-server-51” is the package name.

  • 5.1.47 is the MySQL version the package is based on
  • rel11.0 is  release number
  • 46 is BZR revision
  • rhel5 is an operating system code
  • x86_64 is a platform

Percona Server obsoletes Percona XtraDB both on rhel and debian paltform. To install RPM or DEB package from Percona repository you just have to do “yum install Percona-Server-server-51”. If there is Percona-XtraDB installed it will be replaced by Percona Server. However Percona-Server doesn’t obsolete MySQL, so you have to deinstall it first. Conflicts are inevitable otherwise.

If there are MySQL dependant  applications already installed, you might want to install Percona Shared Compat package before installing Percona Server.

The latest source code of Percona Server, including development branch you can find on LaunchPAD.

Please report any bugs found on Bugs in Percona  Server.
For general questions use our Pecona-discussions group, and for development question Percona-dev group.

For support, commercial and sponsorship inquiries contact Percona

Entry posted by Aleksandr Kuzminsky |

Add to: delicious | digg | reddit | netscape | Google Bookmarks


High availability for MySQL on Amazon EC2 – Part 1 – Intro

Like many, I have been seduced by the power and flexibility of Amazon EC2. Being able to launch new instances at will depending on the load, is almost too good to be true. Amazon has also some drawbacks, availability is not guaranteed and discovery protocols relying on Ethernet broadcast or multicast cannot be used. That means, it is not easy to have a truly highly available solution for MySQL on EC2. If a MySQL instance fails, here are some challenges that we face:

  1. Detect the failure
  2. Kill the faulty instance
  3. Free the shared resources (ex: EBS volumes)
  4. Launch a new instances
  5. Reconfigure the shared resources
  6. Reconfigure instance monitoring
  7. Reconfigure the database clients

Facing these challenges, a customer recently asked me if I could build a viable HA solution for MySQL on EC2. The goal is to have a cheap small instance monitor the availability of a large (any size) and taking measures to keep MySQL available. A few weeks later, I ended up with a solution that work and is decently elegant using Pacemaker and Heartbeat. The setup is fairly complex, being in the Amazon EC2 virtual world is not a simplification, far from. Because of the complexity, the story will be broken into multiple posts:

  1. Part 1 – Intro (this post)
  2. Part 2 – Setting up the initial instances
  3. Part 3 – Configuring the HA resources
  4. Part 4 – The instance restart script
  5. Part 5 – The instance monitoring script
  6. Part 6 – Publishing the MySQL server location
  7. Part 7 – Pitfalls to avoid

Hopefully, I should be able to write those posts quickly but since consulting is my primary duty, I don’t have much control over my workload.

Stay tuned!

Entry posted by Yves Trudeau |

Add to: delicious | digg | reddit | netscape | Google Bookmarks


What does Handler_read_rnd mean?

MySQL’s SHOW STATUS command has two counters that are often confusing and result in “what does that mean?” questions:

  1. Handler_read_rnd
  2. Handler_read_rnd_next

As I understand it, there is some historical context to the choice of names here, hearkening back to before I was involved with MySQL, way back when it was a wrapper around ISAM tables — or even earlier, I don’t know. (Unireg, anyone?) In any case, these two variables deserve a little explanation.

Both counters indicate the number of times the corresponding storage engine API function has been called. In olden times, the storage engine API was called the handler API, which is why the variables begin with Handler_.

Handler_read_rnd counts the number of times the handler::rnd_pos() method is called. This method fetches a row from a table based on a “fixed position,” i.e. a random-read. What this actually means varies between storage engines. For MyISAM, position really means a byte offset from the beginning of the file. For InnoDB, it means to read a row based on a primary key value.

Handler_read_rnd_next is incremented when handler::rnd_next() is called. This is basically a cursor operation: read the “next” row in the table. The operation advances the cursor position so the next time it’s called, you get the next row.

The naming convention probably doesn’t make as much sense today as it did historically. Monty told me that “rnd” meant “random,” but now I can’t find a reference to that conversation. In any case, one of the operations is basically a random read (if you think of tuples as having a natural order, which historically they did), and the other is just a read-next operation; there’s nothing random about it, really. I would welcome any corrections from those who know more about this than I do.

Why would either of these be called? Usually Handler_read_rnd is called when a sort operation gathers a list of tuples and their “position” values, sorts the tuples by some criterion, and then traverses the sorted list, using the position to fetch each one. This is quite likely to result in retrieving rows from random points in the table, although that might not actually result in random IO if the data is all in memory. Handler_read_rnd_next is usually called for each row of a full or partial table scan.

PS: I’m being a bit vague about the sort-and-fetch algorithm; there are really two algorithms, and they are smarter than I’m giving them credit for. But that’s a topic for another blog post.

Entry posted by Baron Schwartz |

Add to: delicious | digg | reddit | netscape | Google Bookmarks


EXPLAIN EXTENDED can tell you all kinds of interesting things

While many people are familiar with the MySQL EXPLAIN command, fewer people are familiar with “extended explain” which was added in MySQL 4.1

EXPLAIN EXTENDED can show you what the MySQL optimizer does to your query. You might not know this, but MySQL can dramatically change your query before it actually executes it. This process is called query rewriting, and it is a part of any good SQL optimizer. EXPLAIN EXTENDED adds a warning message to the EXPLAIN output which displays additional information, including the rewritten query.

To take a look at EXPLAIN EXTENDED, I’ll start with three empty tables. It is important to note that the tables are empty because the MySQL optimizer treats empty tables (and incidentally, tables with only one row) differently than tables which contain more than one row. More about that in a bit.


  1. mysql> CREATE TABLE j1 (c1 int);
  2. Query OK, 0 rows affected (0.16 sec)
  4. CREATE TABLE j2 (c1 int);
  5. Query OK, 0 rows affected (0.11 sec)
  7. mysql> CREATE TABLE j3 (c1 int);
  8. Query OK, 0 rows affected (0.10 sec)

  1. mysql> EXPLAIN extended SELECT j1.c1 FROM j1, j2, j3 WHERE j1.c1 = j2.c1 AND j3.c1 = j1.c1;
  2. +—-+————-+——-+——+—————+——+———+——+——+—————————————————–+
  3. | id | select_type | TABLE | type | possible_keys | KEY  | key_len | ref  | rows | Extra                                               |
  4. +—-+————-+——-+——+—————+——+———+——+——+—————————————————–+
  5. 1 | SIMPLE      | NULL  | NULL | NULL          | NULL | NULL    | NULL | NULL | Impossible WHERE noticed after reading const TABLES |
  6. +—-+————-+——-+——+—————+——+———+——+——+—————————————————–+
  7. 1 row IN SET, 1 warning (0.04 sec)
  9. mysql> SHOW warnings;
  10. +——-+——+——————————————————————————-+
  11. | Level | Code | Message                                                                       |
  12. +——-+——+——————————————————————————-+
  13. | Note  | 1003 | SELECT ‘0’ AS `c1` FROM `test`.`j1` JOIN `test`.`j2` JOIN `test`.`j3` WHERE 0 |
  14. +——-+——+——————————————————————————-+
  15. 1 row IN SET (0.00 sec)

You might notice a few odd things about this EXPLAIN. First, there are no tables listed. Taking a look at the Extra column we see that MySQL mentions ‘const’ tables. A ‘const’ table is a table that contains 0 or 1 rows, or a table on which all parts of a primary key or unique key lookup are satisfied in the where clause. If a ‘const’ table contains no rows, and it is not used in an OUTER JOIN, then MySQL can immediately return an empty set because it infers that there is no way that rows could be returned. MySQL does this by adding the WHERE clause in the query with ‘where 0’.

Let’s now look at what happens after the value (1) is inserted into each of the tables. Each table contains only a single row, and the value in each table is 1.


  2. Query OK, 1 row affected (0.00 sec)
  4. mysql> EXPLAIN extended SELECT j1.c1 FROM j1, j2, j3 WHERE j1.c1 = j2.c1 AND j3.c1 = j1.c1;
  5. +—-+————-+——-+——–+—————+——+———+——+——+——-+
  6. | id | select_type | TABLE | type   | possible_keys | KEY  | key_len | ref  | rows | Extra |
  7. +—-+————-+——-+——–+—————+——+———+——+——+——-+
  8. 1 | SIMPLE      | j1    | system | NULL          | NULL | NULL    | NULL |    1 |       |
  9. 1 | SIMPLE      | j2    | system | NULL          | NULL | NULL    | NULL |    1 |       |
  10. 1 | SIMPLE      | j3    | system | NULL          | NULL | NULL    | NULL |    1 |       |
  11. +—-+————-+——-+——–+—————+——+———+——+——+——-+
  12. 3 rows IN SET, 1 warning (0.00 sec)
  14. mysql> SHOW warnings;
  15. +——-+——+——————————————————————————-+
  16. | Level | Code | Message                                                                       |
  17. +——-+——+——————————————————————————-+
  18. | Note  | 1003 | SELECT ‘1’ AS `c1` FROM `test`.`j1` JOIN `test`.`j2` JOIN `test`.`j3` WHERE 1 |
  19. +——-+——+——————————————————————————-+
  20. 1 row IN SET (0.00 sec)

You should now notice that the tables are actually listed in the EXPLAIN output, but also notice that the type field is set to ‘system’. A ‘system’ table is a special case of ‘const’ table which is used when a table contains only one row. The contents of these tables are read before the query begins executing. Because of this, MySQL can compare the constant values before completely formulating the plan. You will notice the MySQL replaces the WHERE clause with ‘where 1’ because it knows that all the const tables contain equal values. If they did not, the above plan with the ‘where 0’ would be generated.

Finally, lets insert a few more rows and test the plan:


  3. mysql> EXPLAIN extended SELECT j1.c1 FROM j1, j2, j3 WHERE j1.c1 = j2.c1 AND j3.c1 = j1.c1;
  4. +—-+————-+——-+——+—————+——+———+——+——+————-+
  5. | id | select_type | TABLE | type | possible_keys | KEY  | key_len | ref  | rows | Extra       |
  6. +—-+————-+——-+——+—————+——+———+——+——+————-+
  7. 1 | SIMPLE      | j1    | ALL  | NULL          | NULL | NULL    | NULL |    2 |             |
  8. 1 | SIMPLE      | j2    | ALL  | NULL          | NULL | NULL    | NULL |    3 | USING WHERE |
  9. 1 | SIMPLE      | j3    | ALL  | NULL          | NULL | NULL    | NULL |    4 | USING WHERE |
  10. +—-+————-+——-+——+—————+——+———+——+——+————-+
  11. 3 rows IN SET, 1 warning (0.00 sec)
  13. mysql> SHOW warnings;
  14. +——-+——+—————————————————————————————————————————————————————————-+
  15. | Level | Code | Message                                                                                                                                                                    |
  16. +——-+——+—————————————————————————————————————————————————————————-+
  17. | Note  | 1003 | SELECT `test`.`j1`.`c1` AS `c1` FROM `test`.`j1` JOIN `test`.`j2` JOIN `test`.`j3` WHERE ((`test`.`j2`.`c1` = `test`.`j1`.`c1`) AND (`test`.`j3`.`c1` = `test`.`j1`.`c1`)) |
  18. +——-+——+—————————————————————————————————————————————————————————-+
  19. 1 row IN SET (0.00 sec)

The type field has now changed to ALL, which means that the entire table will be read. This is because these tables contain no indexes.

There is another interesting thing, and I probably should have mentioned it before. You will notice that the query that I explained used the comma syntax, but when MySQL rewrote the query it switched it to use the JOIN keyword. This should put to rest any debate as to any perceived performance difference between comma join and ANSI JOIN syntax. They are intrinsically the same.

Last, EXPLAIN EXTENDED can show you information about the query rewrites that MySQL makes when accessing views which use the MERGE algorithm.

For example:


  1. mysql> CREATE VIEW v1 AS SELECT * FROM j1;
  2. Query OK, 0 rows affected (0.10 sec)
  4. mysql> EXPLAIN extended SELECT * FROM v1 WHERE c1=1;
  5. +—-+————-+——-+——+—————+——+———+——+——+————-+
  6. | id | select_type | TABLE | type | possible_keys | KEY  | key_len | ref  | rows | Extra       |
  7. +—-+————-+——-+——+—————+——+———+——+——+————-+
  8. 1 | SIMPLE      | j1    | ALL  | NULL          | NULL | NULL    | NULL |    2 | USING WHERE |
  9. +—-+————-+——-+——+—————+——+———+——+——+————-+
  10. 1 row IN SET, 1 warning (0.00 sec)
  12. mysql> SHOW warnings;
  13. +——-+——+——————————————————————————-+
  14. | Level | Code | Message                                                                       |
  15. +——-+——+——————————————————————————-+
  16. | Note  | 1003 | SELECT `test`.`j1`.`c1` AS `c1` FROM `test`.`j1` WHERE (`test`.`j1`.`c1` = 1) |
  17. +——-+——+——————————————————————————-+
  18. 1 row IN SET (0.00 sec)

The most important thing to notice is the WHERE clause. You will see that the SELECT statement used in the view has been modified to include the WHERE clause that I used when accessing the view.

Entry posted by Justin Swanhart |

Add to: delicious | digg | reddit | netscape | Google Bookmarks


Virident tachIOn: New player on Flash PCI-E cards market

(Note: The review was done as part of our consulting practice, but is totally independent and fully reflects our opinion)

In my talk on MySQL Conference and Expo 2010 “An Overview of Flash Storage for Databases” I mentioned that most likely there are other players coming soon. I actually was not aware about any real names at that time, it was just a guess, as PCI-E market is really attractive so FusionIO can’t stay alone for long time. So I am not surprised to see new card provided by Virident and I was lucky enough to test a pre-production sample Virident tachIOn 400GB SLC card.

I think it will be fair to say that Virident targets where right now FusionIO has a monopoly, and it will finally bring some competition to the market, which I believe is good for the end users. I am looking forward to price competition ( not having real numbers I can guess that vendors still put high margin in the price) as well as high performance in general and stable performance under high load in particular, and also competition in capacity and data reliability areas.

Priceline for Virident tachIOn cards already shows the price competition: oriented price for tachIOn 400GB is 13,600$ (that is 34$/GB) , and entry-base card is 200GB with price 6,800$ (there also is 300GB card in product line). Price for FusionIO 160GB SLC ( from dell.com, price on 14-Jun-2010 ) is 6,308.99$ ( that is 39.5$/GB)

Couple words about product, I know that Virident engineering team was concentrating on getting stable write performance in long running
write activities and in cases when space utilization is close to 100%. As you may know (check my presentation) SSD design requires background
“garbage collector” activity, which requires space to operate and Virident card already has enough space reservation to get stable write performance even when the disk is almost full.

As for reliability, I think, the design of the card is quite neat. The card by itself contains bunch of replaceable flash modules, and each individual module can be changed in case of failure. Also internally modules are joined in RAID (it is fully transparent for end user).

All this guarantees good level of confidence in data reliability: if a single module fails, the internal RAID will allow to continue operations, and after the replacement of module – it will be rebuilt. It still leaves the controller on card as single point of failure, but in this case all flash modules can be safely relocated to the new card with working controller. (Note: It was not tested by Percona engineers, but taken from vendor’s specification)

As for power failures – flash modules also come with capacitors which guarantees data delivery to final media even if power is lost on the main host. (Note: It was not tested by Percona engineers, but taken from vendor’s specification)

Now to most interesting part – performance numbers. I took sysbench fileio benchmark with 16KB blocksize to see what maximal performance we can expect.

Server specification is:

  • Supermicro X8DTH series motherboard
  • 2 x Xeon E5520 (2.27GHz) processors w/HT enabled (16 cores)
  • 64GB of ECC/Registered DDR3 DRAM
  • Centos 5.3 2-6.18.164 Kernel
  • Filesystem is XFS formatted with mkfs.xfs -s size=4096 option ( size=4096, sector size, is very important to have aligned IO requests) and mounted with nobarrier option
  • Benchmark: sysbench fileio on 100GB file, 16KB blocksize

The raw results are available on Wiki

And the graphs for random read, writes and sequential writes:

I think very interesting to see distribution of 95% response time results ( 0 time is obviously the problem in sysbench, which has no enough time resolution for such very fast operations)

As you can see we can get about 400MB/sec random write bandwidth with 8-16 threads and
with response time below 3.1ms (for 8 threads) and 3.8ms (16 threads) in 95% of cases.

As some issue here, I should mention, that despite the good response time results,
the maximal response time in some cases can jump to 300 ms per request, and I was told
it corresponds to garbage collector activity and will be fixed in the production release of driver.

I think it would be fair to get comparison with FusionIO card, especially for write pressure case
As you may know FusionIO recommends to have space reservation to get sustainable write performance
(Tuning Techniques for Writes).

I took FusionIO ioDrive 160GB SLC card, and tested fully formatted card (filesize 145GB), card formatted with 25% space reservation (file size 110GB), and Virident card 390GB filesize. It also allows us to see if Virident tachIOn card can sustain write in fully utilized card.

As disclaimer I want to mention that Virident tachIOn card was fine tuned by Virident engineers, while FusionIO card was tuned only by me and I may not have all knowledge needed for FusionIO tuning.

First graph is random reads, so see compare read performance

As you see in 1 and 4 threads FusionIO is better, while with more threads Virident card scales better

And now random writes:

You can see that FusionIO definitely needs space reservation to provide high write bandwidth, and it comes with
cost hit ( 25% space reservation -> 25% increase $/GB).

In conclusion I can highlight:

  • I am impressed with architecture design with replaceable individual flash modules, I think it establishes new high-end standard for flash devices
  • With single card you can get over 1GB/sec bandwidth in random reads (16-64 working threads), and it is the maximal results what I’ve seen so far ( again for single card)
  • Random write bandwidth exceeds 400MB/sec (8-16 working threads)
  • Random read/write mix results are also impressive, and it can be quite important in workloads like FlashCache, where card have both concurrent read and write pressure
  • Quite stable sequential writes performance (important in question for log related activity in MySQL)

I am looking forward to present results in sysbench oltp, tpcc workload, and also in FlashCahce mode.

Entry posted by Vadim |

Add to: delicious | digg | reddit | netscape | Google Bookmarks


Purge Thread Spiral of Death

I just wrote a large post on reasons for innodb main tablespace excessive growth and I thought it would make sense to explain briefly of why it is so frequently you have purge not being the problem at all and when out of no where you can see purge thread being unable to keep up and undo table space explodes and performance drops down. Here is what happens.

When you have typical OLTP system with small transactions your UNDO space is small and it fits in buffer pool. In fact most of the changes do not need to go to the disk at all – the space for undo space is allocated, used and freed without ever needing to go to the disk.

Now when you have spike in writes or long running transactions which increases your undo space size it may be evicted from buffer pool and stored on disk. This is when problems often starts to happen. Now instead of purge thread simply operating in memory it has to perform IO which slows it down dramatically and makes it unable to handle amount of changes coming in.

The solution to this problem may range from pacing the load (which is helpful for batch job operations as it can be controlled), using innodb_max_purge_lag or enable separate purge thread (or threads) via innodb_use_purge_thread if you’re running Percona Server.

Entry posted by peter |
No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks


Reasons for run-away main Innodb Tablespace

So you’re running MySQL With innodb_file_per_table option but your ibdata1 file which holds main (or system) tablespace have grown dramatically from its starting 10MB size.
What could be the reason of this growth and what you can do about it ?

There are few things which are always stored in main tablespace – these are system tables, also known as data dictionary – table and index definitions and related stuff, double write buffer, insert buffer which temporary stores records to be inserted in the index tree and undo segment holding previous versions of changed rows.

The system tables size depends on the number and complexity of the tables you have in the system. If you have hundreds of thousands of tables it may consume gigabytes of space but for most installations we’re speaking about tens of megabytes or less. Double Write Buffer is fixed size (2MB = 128 of 16KB pages) and so will not affect growth of your main Innodb Tablespace. Insert Buffer size is also restricted to half of the buffer pool size (can be changed via innodb_ibuf_max_size option) which can be significant size for systems with large amount of memory. Finally undo space can grow unbound depending on your transaction size.

The challenge is both Insert buffer and undo space will grow and shrink during the database operation and unless you can them on being large your would not know what caused your system tablespace size to explode – they will shrink to the normal size and all what you have will be free space, which you unfortunately can’t reclaim without reloading tour database. This is where Trending can help, for example MySQL CACTI Templates will have the graphs you need.

If you’re looking at SHOW INNODB STATUS this is how you can see Insert Buffer Size:


  1. Ibuf: size 108931, free list len 64619, seg size 173551,

In this case we can see the segment is allocated to 173551 pages which is about 2.7GB only about 2/3 of it is in use right now but for sake of monitoring tablespace size you need a full allocated number.

Lets now look at the Undo Space which is a lot more interesting (and which is the most likely cause of getting extremely large system tablespace)

The records can be stored in the undo tablespace for 2 reasons. First it is re


  1. History list length 4000567

In this case the History Length is about 4 million which means there are 4 million of transactions which are not yet purged. This is not very helpful as single transaction may modify single row and so be responsible for one undo entry or it may modify millions of rows. Row size can also vary a lot. For many OLTP applications though which have a lot of tiny transactions this is a very good indicator.

If you’re running Percona Server the following can be helpful:


  1. mysql> SELECT * FROM innodb_rseg;
  2. +———+———-+———-+———+————+———–+
  3. | rseg_id | space_id | zip_size | page_no | max_size   | curr_size |
  4. +———+———-+———-+———+————+———–+
  5. |       0 |        0 |        0 |       6 | 4294967294 |     20993 |
  6. +———+———-+———-+———+————+———–+
  7. 1 row IN SET (0.00 sec)

curr_size will display the current size of RSEG which defines undo segment size, though there is a complex data structure and you can’t easily convert this number to number of data pages for undo slots or number of rows stored.

From the practical standpoint there are 3 reasons a lot of undo space may be required:

Running Transaction which does a lot of changes – If transaction modifies a lot of rows it has to use a lot of space in undo segment and there is no way around it. Be careful in particular with update or delete transactions which go over a lot of rows. In many cases doing such processes in chunks, updating/deleting may be thousands of rows per transaction may be better if your application can handle it. Note ALTER TABLE will not require excessive amount of undo space even for very large tables as it internally commits every 10000 rows.

Running Very Long Transaction If you’re running very long transaction, be it even SELECT, Innodb will be unable to purge records for changes which are done after this transaction has started, in default REPEATABLE-READ isolation mode. This means very long transactions are very bad causing a lot of garbage to be accommodated in the database. It is not limited to undo slots. When we’re speaking about Long Transactions the time is a bad measure. Having transaction in read only database open for weeks does no harm, however if database has very high update rate, say 10K+ rows are modified every second even 5 minute transaction may be considered long as it will be enough to accumulate about 3 million of row changes.

Purge Thread Falling Behind This is the most dangerous reason. It is possible for database updates happen faster than purge thread can purge records when they are no more needed which means undo space can just grow until it consumes all free space (or specified max size for ibdata1 file). “Good” thing is performance typically starts to suffer terribly well before that and it gets noticed. There are to things you can do about this problem first you can use innodb_max_purge_lag to make a threads doing modifications slow down if purge thread can’t keep up. This however does not work in all cases. If you’re running XtraDB you can also use innodb_use_purge_thread to use dedicated purge thread, which works a bit faster as it does not need to compete with other activities of main theread. You can also use more than one purge thread by setting this variable to higher values though this functionality is a bit experimental.

One related question I get asked often is why Master and Slave may get so much different table space size in the end. The answer is of course their workload is very different. On one hand slave has all updates going from one thread so it has less chance for purge thread to fall behind, on other a lot of people use slave for very long queries which may make long transactions reasons a lot more likely. So it can be both – either master or slave can have it main tablespace growing larger than its counterpart.

Writing this blog post I also discovered even in XtraDB we do not have as much transparency in regards to undo space and purging as I would like. I filed number of feature requests and I hope we’ll have a chance to fix this soon.

Entry posted by peter |
No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Written by in: Innodb,MySQL,Zend Developer |


Quite frequently I see people confused what table locks reported by SHOW INNODB STATUS really mean. Check this out for example:


  1. —TRANSACTION 0 4872, ACTIVE 32 sec, process no 7142, OS thread id 1141287232
  2. 2 LOCK struct(s), heap size 368
  3. MySQL thread id 8, query id 164 localhost root
  4. TABLE LOCK TABLE `test/t1` trx id 0 4872 LOCK mode IX

This output gives us an impression Innodb has taken table lock on test/t1 table and many people tend to think Innodb in fact in some circumstances would abandon its row level locking and use table locks instead. I’ve seen various theories ranging from lock escalation to using table locks in special cases, for example when no indexes are defined on the table. None of this is right.

In fact Innodb uses Multiple Granularity Locking and there is always lock taken on the whole table before individual locks can be locked.
Such locks are called intention lock, hence abbreviation IX = Intention eXclusive. Intention locks do not work the same way as table locks – Intention exclusive lock does not prevent other threads taking intention shared or even intention exclusive locks on the same table.

What does Intention mean ? Just what it says. If Innodb sets intention exclusive lock on the table this means it plans to lock some of the rows in exclusive mode. What would these be used for ? They are used to be able to handle operation on the whole table – for example to drop the table you need to lock it exclusively.

So do not worry intention table locks you may observe in SHOW INNODB STATUS output, they almost never would be cause of your lock waits or deadlocks.

Entry posted by peter |
No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Written by in: Innodb,MySQL,Zend Developer |

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com