You may have seen in the last couple of weekly news posts that Baron mentioned we are working on a new adaptive flushing algorithm in InnoDB. In fact, we already have three such algorithms in Percona Server (reflex, estimate, keep_average). Why do we need one more? Okay, first let me start by showing the current problems, and then we will go to solutions.
The basic problem is that, unfortunately, none of the existing flushing implementations (including both MySQL native adaptive flushing and that in Percona Server) can handle it properly. Our last invention, “keep_average”, is doing a very good job on systems based on SSD/Flash storage, but it is not so good for regular slow hard drives.
Let me state the following: If you have a lot of memory (and this is not rare nowadays, for example Cisco UCS C250), your database fits into memory, and your write rate is significant, then the slow disk is, by definition, not able to keep up with all the changes you do in memory.
Here is a graph for the tpcc-mysql benchmark (100W ~ 10GB of data, 12G innodb_buffer_pool_size, 1G innodb_log_file_size),
MySQL 5.5.10 with innodb_adaptive_flushing=ON (default)). The problem here is that the system has quite slow hard drives (2 hard disks in software RAID0).
As you can see, after the initial warmup, throughput drops constantly, all the way to zero, and may stay in that state for a long time (minutes). This comes from the fact that MySQL performs changes in memory at a faster rate than it can write to disks. Eventually it gets into an “async” state, where InnoDB tries to flush as many pages as possible in order to keep the checkpoint age fitting into the InnoDB transaction logs, and this makes the situation even worse (these are the periods of zero throughput on the graph).
MySQL’s general recommendation for this case is to increase innodb_io_capacity and decrease innodb_max_dirty_pages_pct so as to have fewer dirty pages. But I call that tuning by shooting yourself in the left foot. (You can’t run fast with a broken foot, right ?) And actually it does not work, as MySQL is not able to keep the number of dirty pages within the given innodb_max_dirty_pages_pct limit.
Another possible solution would be to increase innodb_log_file_size, but that: 1) only delays the problem until later; 2) increases recovery time (and that is an important factor with slow disks); and 3) MySQL does not support innodb_log_file_size > 4GB (it is supported in Percona Server).
To make things more interesting, let’s touch on the topic of flushing of neighbor pages. Performing some research, we found it works in a way we really did not expect. My understanding was that flushing neighbor pages was implemented to perform sequential writes where possible and avoid random writes, which is quite critical for hard drives. What we found is that InnoDB does it in the following way. Say we want to flush page P. InnoDB is looking in an area of 128 pages around page P, and flushes all the pages in that area that are dirty. To illustrate, say we have an area of memory like this: ...D...D...D....P....D....D...D....D
where each dot is a page that does not need flushing, each “D” is a dirty page that InnoDB will flush, and P is our page.
So, as a result of how it works, instead of performing 1 random write, InnoDB will perform 8 random writes. I do not have a good explanation for why it is implemented this way.
To make the situation even worse, the count of flushed neighbor pages is counted toward the number of pages we asked to be flushed. That is, for example, we see that to make an improvement in the checkpoint_age we need to flush 8 pages. InnoDB flushes page P and its 7 neighbors (which are not really neighbors), and then stops after this (alas, 8 pages flushed). That is, instead of flushing 8 pages what we would expect to be flushed, InnoDB flushes 1 needed page and 7 random pages, which may not even be relevant for improving the checkpoint age.
This makes calculating how many pages we need to flush extremely hard (read “impossible”), and this is one of the reasons why innodb_adaptive_flushing is not able to keep up.
What if we disable flushing of neighbor pages (we have the option innodb_flush_neighbor_pages in Percona Server)? Well, it may help to some extent, but remember that hard disks love sequential operations, and your throughput in this case may turn out significantly worse.
So what is the solution? This was exactly the topic of our research. We set these goals:
- Provide stable throughput by avoiding big jumps in flushing pages.
- Make the algorithm independent of innodb_io_capacity and innodb_max_dirty_pages_pct. (This is important for systems where I/O capacity may vary a lot, like EC2 systems, or is affected by periodic tasks like backup or heavy background queries.)
- If we see that flushing to disk is not able to keep up with changes in memory, we need to have a throttling mechanism that will limit the rate of changes in memory.
- Optimize the flushing of sequential neighbor pages in a way that makes sense.
We have made good progress, and I will keep you posted on our ideas and results.