Sep
03
2014
--

MySQL 101: Monitor Disk I/O with pt-diskstats

MySQL 101: Monitor Disk I/O with pt-diskstatsHere on the Percona Support team we often ask customers to retrieve disk stats to monitor disk IO and to measure block devices iops and latency. There are a number of tools available to monitor IO on Linux. iostat is one of the popular tools and Percona Toolkit, which is free, contains the pt-diskstats tool for this purpose. The pt-diskstats tool is similar to iostat but it’s more interactive and contains extended information. pt-diskstats reports current disk activity and shows the statistics for the last second (which by default is 1 second) and will continue until interrupted. The pt-diskstats tool collects samples of /proc/diskstats.

In this post, I will share some examples about how to monitor and check to see if the IO subsystem is performing properly or if any disks are a limiting factor – all this by using the pt-diskstats tool.

pt-diskstats output consists on number of columns and in order to interpret pt-diskstats output we need to know what each column represents.

  • rd_s tells about number of reads per second while wr_s represents number of writes per second.
  • rd_rt and wr_rt shows average response time in milliseconds for reads & writes respectively, which is similar to iostat tool output await column but pt-diskstats shows individual response time for reads and writes at disk level. Just a note, modern iostat splits read and write latency out, but most distros don’t have the latest iostat in their systat (or equivalent) package.
  • rd_mrg and wr_mrg are other two important columns in pt-diskstats output.  *_mrg is telling us how many of the original operations the IO elevator (disk scheduler) was able to merge to reduce IOPS, so *_mrg is telling us a quite important thing by letting us know that the IO scheduler was able to consolidate many or few operations. If rd_mrg/wr_mrg is high% then the IO workload is sequential on the other hand, If rd_mrg/wr_mrg is a low% then IO workload is all random. Binary logs, redo logs (aka ib_logfile*), undo log and doublewrite buffer all need sequential writes.
  • qtime and stime are last two columns in pt-diskstats output where qtime reflects to time spent in disk scheduler queue i.e. average queue time before sending it to physical device and on the other hand stime is average service time which is time accumulated to process the physical device request. Note, that qtime is not discriminated between reads and writes and you can check if response time is higher for qtime than it signal towards disk scheduler. Also note that service time (stime field and svctm field in in pt-diskstats & iostat output respectively) is not reliable on Linux. If you read the iostat manual you will see it is deprecated.

Along with that, there are many other parameters for pt-diskstats – you can found full documentation here. Below is an example of pt-disktats in action. I used the  –devices-regex option which prints only device information that matches this Perl regex.

$ pt-diskstats --devices-regex=sd --interval 5
#ts device rd_s rd_avkb rd_mb_s rd_mrg rd_cnc rd_rt wr_s wr_avkb wr_mb_s wr_mrg wr_cnc wr_rt busy in_prg io_s qtime stime
1.1 sda   21.6   22.8    0.5      45%    1.2  29.4 275.5   4.0    1.1      0%   40.0  145.1   65%   158 297.1 155.0   2.1
1.1 sdb   15.0   21.0    0.3      33%    0.1   5.2   0.0   0.0    0.0      0%    0.0    0.0   11%     1  15.0   0.5   4.7
1.1 sdc    5.6   10.0    0.1       0%    0.0   5.2   1.9   6.0    0.0     33%    0.0    2.0    3%     0   7.5   0.4   3.6
1.1 sdd    0.0    0.0    0.0       0%    0.0   0.0   0.0   0.0    0.0      0%    0.0    0.0    0%     0   0.0   0.0   0.0
5.0 sda   17.0   14.8    0.2      64%    3.1  66.7 404.9   4.6    1.8     14%  140.9  298.5  100%   111 421.9 277.6   1.9
5.0 sdb   14.0   19.9    0.3      48%    0.1   5.5   0.4 174.0    0.1     98%    0.0    0.0   11%     0  14.4   0.9   2.4
5.0 sdc    3.6   27.1    0.1      61%    0.0   3.5   2.8   5.7    0.0     30%    0.0    2.0    3%     0   6.4   0.7   2.4
5.0 sdd    0.0    0.0    0.0       0%    0.0   0.0   0.0   0.0    0.0      0%    0.0    0.0    0%     0   0.0   0.0   0.0

These are the stats from 7200 RPM SATA disks. As you can see, the write-response time is very high and most of that is made up of IO queue time. This shows the problem exactly. The problem is that the IO subsystem is not able to handle the write workload because the amount of writes that are being performed are way beyond what it can handle. It means the disks cannot service every request concurrently. The workload would actually depend a lot on where the hot data is stored and as we can see in this particular case the workload only hits a single disk out of the 4 disks. A single 7.2K RPM disk can only do about 100 random writes per second which is not a lot considering heavy workload.

It’s not particularly a hardware issue but a hardware capacity issue. The kind of workload that is present and the amount of writes that are performed per second are not something that the IO subsystem is able to handle in an efficient manner. Mostly writes are generated on this server as can be seen by the disk stats.

Let me show you a second example. Here you can see read latency. rd_rt is consistently between 10ms-30ms. It depends on how fast the disks are spinning and the number of disks. To deal with it possible solutions would be to optimize queries to avoid table scans, use memcached where possible, use SSD’s as it can provide good I/O performance with high concurrency. You will find this post useful on SSD’s from our CEO, Peter Zaitsev.

#ts device rd_s rd_avkb rd_mb_s rd_mrg rd_cnc rd_rt wr_s wr_avkb wr_mb_s wr_mrg wr_cnc wr_rt busy in_prg io_s qtime stime
1.0 sdb    33.0  29.1      0.9     0%    1.1  34.7   7.0   10.3    0.1     61%    0.0   0.4   99%   1    40.0  2.2  19.5
1.0 sdb1    0.0   0.0      0.0     0%    0.0   0.0   7.0   10.3    0.1     61%    0.0   0.4    1%   0     7.0  0.0   0.4
1.0 sdb2   33.0  29.1      0.9     0%    1.1  34.7   0.0    0.0    0.0      0%    0.0   0.0   99%   1    33.0  3.5  30.2
1.0 sdb    81.9  28.5      2.3     0%    1.1  14.0   0.0    0.0    0.0      0%    0.0   0.0   99%   1    81.9  2.0  12.0
1.0 sdb1    0.0   0.0      0.0     0%    0.0   0.0   0.0    0.0    0.0      0%    0.0   0.0    0%   0     0.0  0.0   0.0
1.0 sdb2   81.9  28.5      2.3     0%    1.1  14.0   0.0    0.0    0.0      0%    0.0   0.0   99%   1    81.9  2.0  12.0
1.0 sdb    50.0  25.7      1.3     0%    1.3  25.1  13.0   11.7    0.1     66%    0.0   0.7   99%   1    63.0  3.4  11.3
1.0 sdb1   25.0  21.3      0.5     0%    0.6  25.2  13.0   11.7    0.1     66%    0.0   0.7   46%   1    38.0  3.2   7.3
1.0 sdb2   25.0  30.1      0.7     0%    0.6  25.0   0.0    0.0    0.0      0%    0.0   0.0   56%   0    25.0  3.6  22.2

From the below diskstats output it seems that IO is saturated between both reads and writes. This can be noticed with high value for columns rd_s and wr_s. In this particular case, consider having disks in either RAID 5 (better for read only workload) or RAID 10 array is good option along with battery-backed write cache (BBWC) as single disk can really be bad for performance when you are IO bound.

device rd_s rd_avkb rd_mb_s rd_mrg rd_cnc rd_rt wr_s  wr_avkb wr_mb_s wr_mrg wr_cnc wr_rt  busy in_prg io_s  qtime stime
sdb1  362.0  27.4    9.7     0%    2.7    7.5   525.2  20.2    10.3    35%    6.4    8.0   100%   0    887.2  7.0   0.9
sdb1  439.9  26.5   11.4     0%    3.4    7.7   545.7  20.8    11.1    34%    9.8   11.9   100%   0    985.6  9.6   0.8
sdb1  576.6  26.5   14.9     0%    4.5    7.8   400.2  19.9     7.8    34%    6.7   10.9   100%   0    976.8  8.6   0.8
sdb1  410.8  24.2    9.7     0%    2.9    7.1   403.1  18.3     7.2    34%   10.8   17.7   100%   0    813.9 12.5   1.0
sdb1  378.4  24.6    9.1     0%    2.7    7.3   506.1  16.5     8.2    33%    5.7    7.6   100%   0    884.4  6.6   0.9
sdb1  572.8  26.1   14.6     0%    4.8    8.4   422.6  17.2     7.1    30%    1.7    2.8   100%   0    995.4  4.7   0.8
sdb1  429.2  23.0    9.6     0%    3.2    7.4   511.9  14.5     7.2    31%    1.2    1.7   100%   0    941.2  3.6   0.9

The following example reflects write heavy activity but write-response time is very good, under 1ms, which shows disks are healthy and capable of handling high number of IOPS.

#ts device rd_s rd_avkb rd_mb_s rd_mrg rd_cnc rd_rt wr_s wr_avkb wr_mb_s wr_mrg wr_cnc wr_rt busy in_prg io_s qtime stime
1.0  dm-0  530.8 16.0    8.3     0%    0.3    0.5  6124.0  5.1    30.7    0%     1.7    0.3  86%    2   6654.8  0.2  0.1
2.0  dm-0  633.1 16.1    10.0    0%    0.3    0.5  6173.0  6.1    36.6    0%     1.7    0.3  88%    1   6806.1  0.2  0.1
3.0  dm-0  731.8 16.0    11.5    0%    0.4    0.5  6064.2  5.8    34.1    0%     1.9    0.3  90%    2   6795.9  0.2  0.1
4.0  dm-0  711.1 16.0    11.1    0%    0.3    0.5  6448.5  5.4    34.3    0%     1.8    0.3  92%    2   7159.6  0.2  0.1
5.0  dm-0  700.1 16.0    10.9    0%    0.4    0.5  5689.4  5.8    32.2    0%     1.9    0.3  88%    0   6389.5  0.2  0.1
6.0  dm-0  774.1 16.0    12.1    0%    0.3    0.4  6409.5  5.5    34.2    0%     1.7    0.3  86%    0   7183.5  0.2  0.1
7.0  dm-0  849.6 16.0    13.3    0%    0.4    0.5  6151.2  5.4    32.3    0%     1.9    0.3  88%    3   7000.8  0.2  0.1
8.0  dm-0  664.2 16.0    10.4    0%    0.3    0.5  6349.2  5.7    35.1    0%     2.0    0.3  90%    2   7013.4  0.2  0.1
9.0  dm-0  951.0 16.0    14.9    0%    0.4    0.4  5807.0  5.3    29.9    0%     1.8    0.3  90%    3   6758.0  0.2  0.1
10.0 dm-0  742.0 16.0    11.6    0%    0.3    0.5  6461.1  5.1    32.2    0%     1.7    0.3  87%    1   7203.2  0.2  0.1

Let me show you a final example. I used –interval and –iterations parameters for pt-diskstats which tells us to wait for a number of seconds before printing the next disk stats and to limit the number of samples respectively. If you notice, you will see in 3rd iteration high latency (rd_rt, wr_rt) mostly for reads. Also, you can notice a high value for queue time (qtime) and service time (stime) where qtime is related to disk IO scheduler settings. For MySQL database servers we usually recommends noop/deadline instead of default cfq.

$ pt-diskstats --interval=20 --iterations=3
#ts device rd_s rd_avkb rd_mb_s rd_mrg rd_cnc rd_rt wr_s wr_avkb wr_mb_s wr_mrg wr_cnc wr_rt busy in_prg io_s qtime stime
10.4 hda   11.7  4.0     0.0     0%     0.0    1.1  40.7   11.7    0.5    26%    0.1    2.1  10%    0    52.5  0.4   1.5
10.4 hda2   0.0  0.0     0.0     0%     0.0    0.0   0.4    7.0    0.0    43%    0.0    0.1   0%    0     0.4  0.0   0.1
10.4 hda3   0.0  0.0     0.0     0%     0.0    0.0   0.4  107.0    0.0    96%    0.0    0.2   0%    0     0.4  0.0   0.2
10.4 hda5   0.0  0.0     0.0     0%     0.0    0.0   0.7   20.0    0.0    80%    0.0    0.3   0%    0     0.7  0.1   0.2
10.4 hda6   0.0  0.0     0.0     0%     0.0    0.0   0.1    4.0    0.0     0%    0.0    4.0   0%    0     0.1  0.0   4.0
10.4 hda9  11.7  4.0     0.0     0%     0.0    1.1  39.2   10.7    0.4     3%    0.1    2.7   9%    0    50.9  0.5   1.8
10.4 drbd1 11.7  4.0     0.0     0%     0.0    1.1  39.1   10.7    0.4     0%    0.1    2.8   9%    0    50.8  0.5   1.7
20.0 hda   14.6  4.0     0.1     0%     0.0    1.4  39.5   12.3    0.5    26%    0.3    6.4  18%    0    54.1  2.6   2.7
20.0 hda2   0.0  0.0     0.0     0%     0.0    0.0   0.4    9.1    0.0    56%    0.0   42.0   3%    0     0.4  0.0  42.0
20.0 hda3   0.0  0.0     0.0     0%     0.0    0.0   1.5   22.3    0.0    82%    0.0    1.5   0%    0     1.5  1.2   0.3
20.0 hda5   0.0  0.0     0.0     0%     0.0    0.0   1.1   18.9    0.0    79%    0.1   21.4  11%    0     1.1  0.1  21.3
20.0 hda6   0.0  0.0     0.0     0%     0.0    0.0   0.8   10.4    0.0    62%    0.0    1.5   0%    0     0.8  1.3   0.2
20.0 hda9  14.6  4.0     0.1     0%     0.0    1.4  35.8   11.7    0.4     3%    0.2    4.9  18%    0    50.4  0.5   3.5
20.0 drbd1 14.6  4.0     0.1     0%     0.0    1.4  36.4   11.6    0.4     0%    0.2    5.1  17%    0    51.0  0.5   3.4
20.0 hda    0.9  4.0     0.0     0%     0.2   251.9  28.8  61.8    1.7    92%    4.5   13.1  31%    2    29.6 12.8   0.9
20.0 hda2   0.0  0.0     0.0     0%     0.0     0.0   0.6   8.3    0.0    52%    0.1   98.2   6%    0     0.6 48.9  49.3
20.0 hda3   0.0  0.0     0.0     0%     0.0     0.0   2.0  23.2    0.0    83%    0.0    1.4   0%    0     2.0  1.2   0.3
20.0 hda5   0.0  0.0     0.0     0%     0.0     0.0   4.9 249.4    1.2    98%    4.0   13.2   9%    0     4.9 12.9   0.3
20.0 hda6   0.0  0.0     0.0     0%     0.0     0.0   0.0   0.0    0.0     0%    0.0    0.0   0%    0     0.0  0.0   0.0
20.0 hda9   0.9  4.0     0.0     0%     0.2   251.9  21.3  24.2    0.5    32%    0.4   12.9  31%    2    22.2 10.2   9.7
20.0 drbd1  0.9  4.0     0.0     0%     0.2   251.9  30.6  17.0    0.5     0%    0.7   24.1  30%    5    31.4 21.0   9.5

You can see the busy column in pt-diskstats output which is the same as the util column in iostat – which points to utilization. Actually, pt-diskstats is quite similar to the iostat tool but pt-diskstats is more interactive and has more information. The busy percentage is only telling us for how long the IO subsystem was busy, but is not indicating capacity. So the only time you care about %busy is when it’s 100% and at the same time latency (await in iostat and rd_rt/wr_rt in diskstats output) increases over -say- 5ms. You can estimate capacity of your IO subsystem and then look at the IOPS being consumed (r/s + w/s columns). Also, the system can process more than one request in parallel (in case of RAID) so %busy can go beyond 100% in pt-diskstats output.

If you need to check disk throughput, block device IOPS run the following to capture metrics from your IO subsystem and see if utilization matches other worrisome symptoms. I would suggest capturing disk stats during peak load. Output can be grouped by sample or by disk using the –group-by option. You can use the sysbench benchmark tool for this purpose to measure database server performance. You will find this link useful for sysbench tool details.

$ pt-diskstats --group-by=all --iterations=7200 > /tmp/pt-diskstats.out;


Conclusion:

pt-diskstats is one of the finest tools from Percona Toolkit. By using this tool you can easily spot disk bottlenecks, measure the IO subsystem and identify how much IOPS your drive can handle (i.e. disk capacity).

The post MySQL 101: Monitor Disk I/O with pt-diskstats appeared first on MySQL Performance Blog.

Jun
25
2014
--

Why %util number from iostat is meaningless for MySQL capacity planning

Earlier this month I wrote about vmstat iowait cpu numbers and some of the comments I got were advertising the use of util% as reported by the iostat tool instead. I find this number even more useless for MySQL performance tuning and capacity planning.

Now let me start by saying this is a really tricky and deceptive number. Many DBAs who report instances of their systems having a very busy IO subsystem said the util% in vmstat was above 99% and therefore they believe this number is a good indicator of an overloaded IO subsystem.

Indeed – when your IO subsystem is busy, up to its full capacity, the utilization should be very close to 100%. However, it is perfectly possible for the IO subsystem and MySQL with it to have plenty more capacity than when utilization is showing 100% – as I will show in an example.

Before that though lets see what the iostat manual page has to say on this topic – from this main page we can read:

%util

Percentage of CPU time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100% for devices serving requests serially. But for devices serving requests in parallel, such as RAID arrays and modern SSDs, this number does not reflect their performance limits.

Which says right here that the number is useless for pretty much any production database server that is likely to be running RAID, Flash/SSD, SAN or cloud storage (such as EBS) capable of handling multiple requests in parallel.

Let’s look at the following illustration. I will run sysbench on a system with a rather slow storage data size larger than buffer pool and uniform distribution to put pressure on the IO subsystem. I will use a read-only benchmark here as it keeps things more simple…

sysbench –num-threads=1 –max-requests=0 –max-time=6000000 –report-interval=10 –test=oltp –oltp-read-only=on –db-driver=mysql –oltp-table-size=100000000 –oltp-dist-type=uniform –init-rng=on –mysql-user=root –mysql-password= run

I’m seeing some 9 transactions per second, while disk utilization from iostat is at nearly 96%:

[ 80s] threads: 1, tps: 9.30, reads/s: 130.20, writes/s: 0.00 response time: 171.82ms (95%)
[ 90s] threads: 1, tps: 9.20, reads/s: 128.80, writes/s: 0.00 response time: 157.72ms (95%)
[ 100s] threads: 1, tps: 9.00, reads/s: 126.00, writes/s: 0.00 response time: 215.38ms (95%)
[ 110s] threads: 1, tps: 9.30, reads/s: 130.20, writes/s: 0.00 response time: 141.39ms (95%)

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
dm-0 0.00 0.00 127.90 0.70 4070.40 28.00 31.87 1.01 7.83 7.52 96.68

This makes a lot of sense – with read single thread read workload the drive should be only used getting data needed by the query, which will not be 100% as there is some extra time needed to process the query on the MySQL side as well as passing the result set back to sysbench.

So 96% utilization; 9 transactions per second, this is a close to full-system capacity with less than 5% of device time to spare, right?

Let’s run a benchmark with more concurrency – 4 threads at the time; we’ll see…

[ 110s] threads: 4, tps: 21.10, reads/s: 295.40, writes/s: 0.00 response time: 312.09ms (95%)
[ 120s] threads: 4, tps: 22.00, reads/s: 308.00, writes/s: 0.00 response time: 297.05ms (95%)
[ 130s] threads: 4, tps: 22.40, reads/s: 313.60, writes/s: 0.00 response time: 335.34ms (95%)

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
dm-0 0.00 0.00 295.40 0.90 9372.80 35.20 31.75 4.06 13.69 3.38 100.01

So we’re seeing 100% utilization now, but what is interesting – we’re able to reclaim much more than less than 5% which was left if we look at utilization – throughput of the system increased about 2.5x

Finally let’s do the test with 64 threads – this is more concurrency than exists at storage level which is conventional hard drives in RAID on this system…

[ 70s] threads: 64, tps: 42.90, reads/s: 600.60, writes/s: 0.00 response time: 2516.43ms (95%)
[ 80s] threads: 64, tps: 42.40, reads/s: 593.60, writes/s: 0.00 response time: 2613.15ms (95%)
[ 90s] threads: 64, tps: 44.80, reads/s: 627.20, writes/s: 0.00 response time: 2404.49ms (95%)

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
dm-0 0.00 0.00 601.20 0.80 19065.60 33.60 31.73 65.98 108.72 1.66 100.00

In this case we’re getting 4.5x of throughput compared to single thread and 100% utilization. We’re also getting almost double throughput of the run with 4 thread where 100% utilization was reported. This makes sense – there are 4 drives which can work in parallel and with many outstanding requests they are able to optimize their seeks better hence giving a bit more than 4x.

So what have we so ? The system which was 96% capacity but which could have driven still to provide 4.5x throughput – so it had plenty of extra IO capacity. More powerful storage might have significantly more ability to run requests in parallel so it is quite possible to see 10x or more room after utilization% starts to be reported close to 100%

So if utilization% is not very helpful what can we use to understand our database IO capacity better ? First lets look at the performance reported from those sysbench runs. If we look at 95% response time you can see 1 thread and 4 threads had relatively close 95% time growing just from 150ms to 250-300ms. This is the number I really like to look at- if system is able to respond to the queries with response time not significantly higher than it has with concurrency of 1 it is not overloaded. I like using 3x as multiplier – ie when 95% spikes to be more than 3x of the single concurrency the system might be getting to the overload.

With 64 threads the 95% response time is 15-20x of the one we see with single thread so it is surely overloaded.

Do we have anything reported by iostat which we can use in a similar way? It turns out we do! Check out the “await” column which tells us how much the requester had to wait for the IO request to be serviced. With single concurrency it is 7.8ms which is what this drives can do for random IO and is as good as it gets. With 4 threads it is 13.7ms – less than double of best possible, so also good enough… with concurrency of 64 it is however 108ms which is over 10x of what this device could produce with no waiting and which is surely sign of overload.

A couple words of caution. First, do not look at svctm which is not designed with parallel processing in mind. You can see in our case it actually gets better with high concurrency while really database had to wait a lot longer for requests submitted. Second, iostat mixes together reads and writes in single statistics which specifically for databases and especially on RAID with BBU might be like mixing apples and oranges together – writes should go to writeback cache and be acknowledged essentially instantly while reads only complete when actual data can be delivered. The tool pt-diskstats from Percona Tookit breaks them apart and so can be much more for storage tuning for database workload. Some of the recent operating systems also ship with sysstat/iostat which breaks out await to r_await and w_await which is much more useful.

Final note – I used a read-only workload on purpose – when it comes to writes things can be even more complicated – MySQL buffer pool can be more efficient with more intensive writes plus group commit might be able to commit a lot of transactions with single disk write. Still, the same base logic will apply.

Summary: The take away is simple – util% only shows if a device has at least one operation to deal with or is completely busy, which does not reflect actual utilization for a majority of modern IO subsystems. So you may have a lot of storage IO capacity left even when utilization is close to 100%.

The post Why %util number from iostat is meaningless for MySQL capacity planning appeared first on MySQL Performance Blog.

Apr
07
2014
--

Optimizing MySQL Performance: Choosing the Right Tool for the Job

Next Wednesday, I will present a webinar about MySQL performance profiling tools that every MySQL DBA should know.

Application performance is a key aspect of ensuring a good experience for your end users. But finding and fixing performance bottlenecks is difficult in the complex systems that define today’s web applications. Having a method and knowing how to use the tools available can significantly reduce the amount of time between problems manifesting and fixes being deployed.

In the webinar, titled “Optimizing MySQL Performance: Choosing the Right Tool for the Job,” we’ll start with the basic top, iostat, and vmstat then move onto advanced tools like GDB, Oprofile, and Strace.

I’m looking forward to this webinar and invite you to join us April 16th at 10 a.m. Pacific time. You can learn more and also register here to reserve your spot. I also invite you to submit questions ahead of time by leaving them in the comments section below. Thanks for reading and see you next Wednesday!

The post Optimizing MySQL Performance: Choosing the Right Tool for the Job appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com