Percona is hiring a consultant

We’re hiring. We are looking for the following qualifications:

  1. Expert knowledge of MySQL. Not just “certified” — years of production experience with it. You need to know server internals, for example. You need to be able to do anything from optimizing difficult queries to moving high-volume services between data centers without interruption.
  2. Expert knowledge of InnoDB. You should understand its inner workings well enough to answer questions about its internals from memory, such as “how does the insert buffer work?” or “how does MVCC work on secondary indexes?” You should also know why it has trouble on some workloads and how to solve that.
  3. Expert knowledge of Linux systems administration. You need to know how to solve issues with filesystems, hardware, and networking. You need to be able to use tools such as gdb, strace, tcpdump, etc to solve weird problems.
  4. Expert with Apache, memcached, and other technologies you’d typically find in a LAMP application. You need to be able to administer and troubleshoot a variety of popular open-source server software.
  5. Proficient with Maatkit and MMM.
  6. Proven success working in a distributed environment where e-mail, IRC and voice calls are your only interaction with clients, colleagues and managers on a daily basis. You must be a self-starter.

We need all of the above, but if you’re a little weaker in some areas that might be OK. Highly desirable skills include:

  1. Non-core MySQL technologies: XtraDB, Xtrabackup, NDB Cluster, third-party storage engines and appliances, etc.
  2. Technologies such as LVS, Nginx, Sphinx, and load balancers.
  3. Other operating systems such FreeBSD and OpenSolaris.
  4. Amazon’s cloud offerings.
  5. Programming languages and platforms, especially frameworks such as Ruby On Rails and Django.

If you think you’re a good candidate, please fill out the contact form on our website. Thanks!

Entry posted by Baron Schwartz |

Add to: delicious | digg | reddit | netscape | Google Bookmarks


What do we optimize with mk-query-digest ?

When we’re looking at mk-query-digest report we typically look at the Queries causing the most impact (sum of the query execution times) as well as queries having some longest samples. Why are we looking at these ?

Queries with highest Impact are important because looking at these queries and optimizing them typically helps to improve system capacity, which in the end often improves response time of the system for variety of queries. Also queries causing highest impact often correspond to either frequent or slowest transactions in the system which makes these queries important to look at.

Slowest Queries correspond to bad response times. If there is 30 sec query this means somebody had to wait for 30 seconds which is typically not acceptable for interactive queries. It is best to look at 95 percentile “time” rather than slowest time for the queries as there are often exceptionally high time which you can see which are unrelated to query itself – general system overload, row level locks, table locks all could cause query response time to be unreasonable. If Slowest query is not interactive query, for example part of a batch job it is often fine, assuming it is not causing system overload, interactive queries, which have real users waiting for them are often high priority to be optimized. It helps if batch jobs use different MySQL user compared to the web site itself as it makes it easier to remove such queries from analyzes.

What is Important to mention neither of these two query types really guarantees improvement for application performance (and even more so transaction we’re trying to optimize), it typically does so as transactions you’re trying to optimize typically will manifest itself in one of these query sets.

If you are looking for improving response time of particular transaction/user interaction you really have to look at it, rather than aggregating data for the all server – understanding what queries are being ran (together with other interactions – memcache, remote sites etc) is a key to understanding what causes performance issues.

If you have slow queries from interactive application it is true somebody is suffering (though it is unclear how critical this suffering is) – it is however perfectly possible to get bad performance with MySQL responsible for large portion of response time without any slow queries. We’ve seen pages which take 10.000, 20.000 and even more queries to generate. In such cases even with queries taking 1 ms in average you’re looking at 10,20 or more seconds to respond.

Entry posted by peter |

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Written by in: MySQL,Zend Developer |

Battle around MySQL heats up

As the deadline for EC to decide if Oracle can acquire Sun and hence MySQL is coming up the tensions seems to heat up. Monty posts his open letter calling for help lobbying EC to stop Oracle from buying MySQL .

Entry posted by peter |
One comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Written by in: MySQL,Zend Developer |

There will be no separate Percona Performance Conference in April 2010

Now that O’Reilly has announced they’re going to have a MySQL conference independently of Sun/MySQL, we have decided not to proceed with plans for our own Performance Conference. We are participating in the O’Reilly conference, and we will do everything possible to make that a great success for everyone. See you there!

Entry posted by Baron Schwartz |

Add to: delicious | digg | reddit | netscape | Google Bookmarks


Redis Benchmarks on FusionIO (Round 1)

Peter took a look at Redis some time ago; and now, with the impending 1.2 release and a slew of new features, I thought it time to look again.

One of the more interesting features in 1.2 is the ability to operate in “append-only file persistence mode”, meaning Redis has graduated from a semi-persistent to a fully-persistent system! Using the redis-benchmark script included, I ran the following command


in five modes:

1 – In-Memory
I set “save 900000000 900000000” so nothing would be written to disk during the tests.
2 – Semi-Persistent
I set “save 1 1” so that changes would be flushed to disk every second (assuming there was at least one change the previous second).
3 – Fully Persistent
I set appendonly yes and appendfsync always, which calls an fsync for every operation.
4 – Semi-Persistent
I set appendonly yes and appendfsync no, which is about as persistent as MyISAM. An fsync is never explicitly called, rather we’re relying on the OS.
5 – Semi-Persistent
I set appendonly yes and appendfsync everysec, which explicitly calls an fsync every second.

(By default, redis-benchmark invokes 50 parallel clients, with a keep alive of 1, using a 3-byte payload.)

Redis SET Operations

I am quite surprised at the performance penalty here for full durability.

Aside from the numbers, there are a couple of other interesting things I noticed during the benchmarking (note that the full benchmark suite was run, not just SET):

– In case 2, the dump.rdb file was 40k. While in case 3, 4, and 5, appendonly.aof was 1.9M. If you’re going to to use appendonly in production, be sure to consider the additional storage requirements.
– In every case where appendonly was enabled, I received the following error many times as the test was completing (note that this did not occur during the SET operations):

Error writing to client: Broken pipe

To be fair, this is RC, not GA.

* All data easily fit into memory during these tests.
* The FusionIO used was: 160 GB SLC ioDrive
* XFS was the filesystem
* redis version was a git clone at Thu Dec 10 10:41:50 PST 200

Entry posted by Ryan Lowe |

Add to: delicious | digg | reddit | netscape | Google Bookmarks



Dear Community,

As of today release 1.0 is available.

In this release there are following changes:

  • XtraBackup is ported to Windows. The .MSI package as well as .tar.gz is available for 32 bit platform
  • Be more verbose on reporting scp errors

Fixed bugs:

The binary packages for RHEL4,5, Debian, FreeBSD, Windows as well as source code of the XtraBackup is available on MacOS packages will come later.

Debian and RPM are available in Percona repository

The project lives on Launchpad : and you can report bug to Launchpad bug system: The documentation is available on our Wiki.

For general questions use our Pecona-discussions group, and for development question Percona-dev group.

For support, commercial and sponsorship inquiries contact Percona.

Entry posted by Aleksandr Kuzminsky |

Add to: delicious | digg | reddit | netscape | Google Bookmarks


Why delayed flushing can result in less work

I can think of at least two major reasons why systems delay flushing changes to durable storage:

1. So they can do the work when it’s more convenient.
2. So they can do less work in total.

Let’s look at how the second property can be true.

A commenter on Deva’s recent post on InnoDB adaptive flushing asked,

That’s really interesting stuff; am I reading it correctly though that adaptive flush actually increased the IOOPS? Looking at the IO graphs, it looks like both the peak IO rate and average IO rate were higher with adaptive flush nabled (assuming I’m reading properly).

Yes. Adaptive flushing actually increased the overall number of I/O operations performed. Smoothing out the workload can cause more work to be done. To see why, remember that InnoDB works in 16kb pages at a time. Suppose someone makes a change to a row. If InnoDB is flushing constantly, it flushes that entire 16kb page. Just afterwards, another row on the same page gets changed, and another page flush results. If the first flush had been delayed, the two flushes could have been done as one flush. This is called write combining. In some workloads, the same rows could be updated many, many times — so delaying and permitting write combining could be an enormous reduction in the of number of I/O operations.

Now on to the next question from that comment:

Seems to imply that if recovery time isn’t a major factor, you’re better off (for this workload at least) running w/o that option enabled?

Maybe. It depends on whether the flushes are consuming a resource that something else needs. If nothing else needs the disk, what the heck — go crazy, write as much as you want. If the data fits in the buffer pool, no reads are happening, so the disk is only used for writes. If everything is lined up at the operating system and RAID controller layers, then a read isn’t required for a write to take place, either. Remember too that these writes are background operations, so they aren’t blocking anything in the foreground, unless there is a side effect such as mutex contention inside MySQL or InnoDB. So whether this matters for you is workload-dependent.

Entry posted by Baron Schwartz |

Add to: delicious | digg | reddit | netscape | Google Bookmarks


FusionIO – time for benchmarks

I posted about FusionIO couple times RAID vs SSD vs FusionIO and Testing FusionIO: strict_sync is too strict…. The problem was that FusionIO did not provide durability or results were too bad in strict mode, so I lost interest FusionIO for couple month. But I should express respect to FusionIO team, they did not ignore problem, and recently I was told that in last drivers FusionIO provides durability even without strict_mode (see While I do not fully understand how it works internally (FusionIO does not have RAM on board and takes memory from host, so there is always cache kept in OS memory), I also do not have reason to doubt ( but I will test that someone day …). So I decided to see what IO performance we can expect from FusionIO
in different modes.

First few words about card by itself. I have 160 GB SLC ioDrive, and simple google search
by word “FusionIO” drives you to Dell shop (so I do not disclosure any private information here).
By that link you can get card for $6,569.99, which gives us ~40$/GB. There we should
talk about real space. To provide “steady” write performance card comes pre-formatted
with only 120GB available (25% of space is reserved for internal needs), so real cost jumps to 50$/GB. But even that is not enough to get maximal write throughput, so if you want to get announced 600MB/sec you need to use even less space (you will see that from results of benchmarks). As you can see from description on Dell site you should expect “write bandwidth of 600 MB/s and read bandwidth of 700 MB/s”, let’s see what we get in real life.

There is a lot of numbers, so let me put my conclusions first, and later I will prove them by numbers.

So conclusions:

  • Reads: you can really get 700MB/s read bandwidth, but you need 32 working threads
    for that. For single thread I got 140MB/s and for 4 threads – 446MB/s
  • Writes is more complex story (random writes)
  • – with filling 100GB on 125GB partition, I got 316.15MB/s, for 4 threads. for 1 thread – 131MB/s. However for 8 threads result drops to 162.96MB/s. The latency is also interesting. For 1 thread 95% of requests get response within 0.11ms (yes, it’s 110 microseconds!), but for 8 threads it is already 0.85ms, and for 16 threads – 1.55ms (which is still very good). And something happens with 32 threads – response time jumps to 19.71ms. I think there
    some serialization issue inside driver or firmware
  • – you can get about 600MB/s write bandwidth on 16GB file and with 16 threads. With increasing of file size write bandwidth drops.
  • – I got strange results with sequential writes. With increasing number of threads, write bandwidth drops noticeable. Again it should be some serialization problem, I hope FusionIO address it also
  • Now to numbers and benchmarks. All details available here. I used sysbench fileio benchmark for tests (full script available under link). Block size is 16KB and I used directio mode.

    Let me show some graphs, while numeric results are available in full report.
    For reference I show also results for RAID10 on 8 disks, 2.5″ , 15K RPMS each.

    Random reads:

    From this graph we see that fully utilize FusionIO card we need 4 or more working threads.

    Random writes:

    Again 4 threads seem interesting here, we get peak on 4 threads, and with more, bandwidth drops. Again I assume it is some serialization / contention in driver or firmware.

    Sequential reads:

    With sequential reads the same story with 4 threads.

    Sequential writes:

    This is what I mention on sequential writes. Instead of increasing, bandwidth drops down if we have more than 1 working thread, and we can see only 130MB/sec max throughput.

    Write bandwidth vs filesize:

    So there I tested how filesize (fill factor) affects write performance ( 8 working threads).
    The point here is that you can really get promised 600MB/s or close write throughput but only on small files size (<=32GB). With increasing size, throughput drops noticeable.

    Ok, to finalize post I think FusionIO provides really good IO read and write performance.
    If your budget allows and you fight with IO problems it could be good solution. Speaking about budget, $6,569.99 may look quite significant price ( and it is :) ), but we may also look into consolidation factor, how many servers we can replace with single card. Rough look says it may be 5-10 servers, and that amount of servers cost much more.

    Entry posted by Vadim |

    Add to: delicious | digg | reddit | netscape | Google Bookmarks


How many partitions can you have ?

I had an interesting case recently. The customer dealing with large MySQL data warehouse had the table which was had data merged into it with INSERT ON DUPLICATE KEY UPDATE statements. The performance was extremely slow. I turned out it is caused by hundreds of daily partitions created for this table. What is the most interesting (and surprising) not every statement is affected equally as you can see from the benchmarks above:

I got the following test table created:


  1. CREATE TABLE `p10` (
  2.     `id` int(10) UNSIGNED NOT NULL,
  3.     `c`  int(10) UNSIGNED NOT NULL,
  4.      PRIMARY KEY (`id`),
  5.      KEY(c)
  6.      ) ENGINE=InnoDB
  7.      PARTITION BY RANGE(id) (
  8.      PARTITION p100000 VALUES LESS THAN(100001),
  9. PARTITION p200000 VALUES LESS THAN(200001),
  10. PARTITION p300000 VALUES LESS THAN(300001),
  11. PARTITION p400000 VALUES LESS THAN(400001),
  12. PARTITION p500000 VALUES LESS THAN(500001),
  13. PARTITION p600000 VALUES LESS THAN(600001),
  14. PARTITION p700000 VALUES LESS THAN(700001),
  15. PARTITION p800000 VALUES LESS THAN(800001),
  16. PARTITION p900000 VALUES LESS THAN(900001),
  17. PARTITION p1000000 VALUES LESS THAN(1000001)
  18.      );

I used MySQL 5.1.41 for my tests increasing buffer pool to 1G and log file size to 128M so all data is comfortably in memory.

I ranged number of partitions from 1 to 1000 and loaded the table with 1000000 of sequential values from 1 to 1million (the C column was set same as ID column) using bulk insert statements – 1000 rows in each.

Loading the data was taking 9 seconds for 1 partitions, 10 seconds for 10 partitions, 16 seconds for 100 partitions and 23 seconds for 1000 partitions, which means it slows down 2.5 times as number of partitions increases to 1000.

This regression is somethat surprising as in reality only 1 (max 2) partitions got data inserted to them per insert statement.

As I tested the UPDATE path on INSERT OF DUPLICATE KEY UPDATE (adding ON DUPLICATE KEY UPDATE set c=c+1 to my bulk inserts) the regression became even larger.

1 partition insert took 50 seconds to complete, 10 partitions 52, 100 partitions 72, 1000 partitions 290 seconds. The performance loss with large number of partitions is about 6 times in this case and the pattern is a bit different – the slowdown becomes drastic as we go from 100 to 1000 partitions – about 4 times compared to only 1.5 times slow down for the same case with just insert.

The difference grows even larger if we remove index on column C – the “UPDATE” part of INSERT ON DUPLICATE KEY UPDATE completes in 22 seconds for 1 partition and 250 for 1000 partitions – which is over 10x difference.

Both MyISAM and Innodb are affected by this issue. The Update path for MyISAM without indexes took 10 seconds for 1 partition vs 52 seconds for 1000 partitions (the table_cache was warm for this test)

I had couple of suspects in this case – is this the “setup” overhead of statement opening all partitions for execution or is this “per row” overhead. To check this I ran the statements with different number of rows in the batch. With 100 rows per batch and with 10000 rows per batch performance was not significantly different from 1000 rows per batch so I conclude this is per row overhead.

This test gives other rather interesting data point – it looks like updates in Innodb for CPU bound workloads can be as much as 5 times slower than inserts – which is quite a difference.

CONCLUSION: This is interesting topic and I should do more tests on this. In the meanwhile the data point is simple – be careful with number of partitions you use. Creating unused partitions for future use may cost you. From this benchmarks the performance remains reasonable with about 100 partitions (which is reasonable number for most applications) though I would expect numbers to be significantly workload dependent.

Entry posted by peter |

Add to: delicious | digg | reddit | netscape | Google Bookmarks


Effect of adaptive_flushing

I recently had the chance to witness the effects of innodb_adaptive_flushing on the performance of InnoDB Plugin 1.0.5 in the wild, which Yasufumi wrote about previously here and here.

The server in question was Solaris 10 with 8 disk RAID10 and 2 32GB SSDs used for ZIL and L2ARC, 72G RAM and 40G buffer pool. We started it up with innodb_adaptive_flushing=OFF and innodb_doublewrite=OFF, then ramped up traffic and everything looked stable … but I noticed one troubling thing: ~2GB of uncheckpointed data.

Database pages      2318457
Old database pages  855816
Modified db pages   457902
Log flushed up to   10026890404067
Last checkpoint at  10024612103454

We enabled innodb_adaptive_flushing just before 10am, which resulted in a few  changes  which were recorded by the Cacti MySQL Templates.  The most important change for the client here is the recovery time — by enabling adaptive flushing, InnoDB purged modified data much more aggressively, resulting in very little unflushed changes, which translates to much faster crash recovery.


There was also a drop in history list length, which may be because of more aggressive flushing. In general this is good for performance because InnodB does not need to consider as many old versions of rows.


Here we can see the “spiky flushing” before 10am (when adaptive flushing was off), which can be very bad for performance because during those short periods of very high page write activity, the system and/or innodb can become very slow (Yasufumi’s benchmarks show this very well). After enabling adaptive flushing we see a consistent and higher-than-previous-average rate of pages written to disk, which I find rather interesting


Of course, there is also the question of system performance impact… More aggressive flushing means more disk IO and possibly  other impact on the system, which could impact query response time, and so on. And what about CPU, load average, and so on? Actually, those did not change enough to see a difference in the graphs so I am not including them; after all, we are not pushing this system near its limitations so I am not really surprised. We can see a change in disk IO which mirrors the innodb pages written graph  (shown above). I suspect ZFS is heavily buffering writes during the”spiky flushing” period because the delta between peak-write and low-write is much lower in terms of bytes written than pages written.

It is also worth noting there that this system is using 2 SSD’s for the ZIL and L2ARC, and as everyone should know by now, SSD’s lifetime is determined by the number of writes they can perform before failure. This means that more writes per second == shorter lifetime, and so a shorter SSD lifetime is a negative side effect of enabling adaptive flushing on this server.


One more point worth mentioning – the normal approach to control InnoDB’s checkpoint age is by adjusting the log file size. This server is configured with 2 * 1.5G log files, and innodb_io_capacity=1000. While adaptive flushing was off, the checkpoint age was comparable to the log file size, however the behavior with flushing enabled (keeping the checkpoint age very small) seems too aggressive. It’s possible we would see different results by adjusting innodb_io_capacity here.

Is there a lesson in this? Sometimes tuning InnoDB settings can have quite unexpected results…. Setting innodb_adaptive_flushing ON seems better in this case because there is IO bandwidth to spare and we are concerned about crash recovery time, but as Yasufumi’s previous posts show, it can sometimes also have bad results, and one may need to try and try again, before finding the “sweet spot” for your server / workload combination. As always, you should benchmark your particular systems and not blindly follow advice you read on the internet, and have good monitoring and trending tools in place from the start.

Entry posted by Devananda |
No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Powered by WordPress | Theme: Aeros 2.0 by