Jul
18
2018
--

Why Consumer SSD Reviews are Useless for Database Performance Use Case

Anandtech Table reviewing consumer SSD performance

If you’re reading consumer SSD reviews and using them to estimate SSD performance under database workloads, you’d better stop. Databases are not your typical consumer applications and they do not use IO in the same way.

Let’s look, for example, at this excellent AnandTech review of Samsung 960 Pro –  a consumer NVMe device that I happen to have in my test lab.

Anandtech Table reviewing consumer SSD performance

The summary table is actually great, showing the performance both at Queue Depth 1 (single threaded) as well as Queue Depth 32 – a pretty heavy concurrent load.

Even at QD1 we see 50K (4K) writes per second, which should be enough for pretty serious database workloads.

In reality, though, you might be in for some disappointing surprises. While “normal” buffered IO is indeed quite fast, this drive really hates fsync() calls, with a single thread fsync() latency of 3.5ms or roughly 300 fsync/sec. That’s just slightly more than your old school spinning drive.

Why is fsync() performance critical for databases?

To achieve Durabilitythe letter “D” of ACIDdatabases tend to rely on a write ahead log (WAL) which is sequentially written. The WAL must be synced to disk on every transaction commit using fsync() or similar measures, such as opening file with O_SYNC flag. These tend to have similar performance implications.

Other database operations use fsync() too, but writing WAL is where it usually hurts the most.    

In a fully durable configuration MySQL tends to be impacted even more by poor fsync() performance. It may need to perform as many as three fsync operations per transaction commit. Group commit reduces the impact on throughput but transaction latency will still be severely impacted

Want more bad news? If the fsync() performance is phenomenal on your consumer SSD it indeed might be too good to be true. Over the years, some consumer SSDs “faked” fsync and accepted possible data loss in the case of power failure. This might not be a big deal if you only use them for testing but it is a showstopper for any real use.

Want to know more about your drive’s fsync() performance?  You can use these sysbench commands:

sysbench fileio --time=60 --file-num=1 --file-extra-flags= --file-total-size=4096 --file-block-size=4096 --file-fsync-all=on --file-test-mode=rndwr --file-fsync-freq=0 --file-fsync-end=0  --threads=1 --percentile=99 prepare
sysbench fileio --time=60 --file-num=1 --file-extra-flags= --file-total-size=4096 --file-block-size=4096 --file-fsync-all=on --file-test-mode=rndwr --file-fsync-freq=0 --file-fsync-end=0  --threads=1 --percentile=99 run | grep "avg:"

You can also use ioping as described in this blog post

I wish that manufacturers’ tech specifications described fsync latency, along with a clear statement as to whether the drive guarantees no loss of data on power failure. Likewise, I wish folk doing storage reviews could include these in their research.

Interested in fsync() performance for variety of devices?  Yves Trudeau wrote an excellent blog post about fsync() performance on various storage devices  a few months ago.

Other technical resources

Principal Support Escalation Specialist Sveta Smirnova presents Troubleshooting MySQL Concurrency Issues with Load Testing Tools. 

You can download a three part series of eBooks by Principal Consultant Alexander Rubin and me on MySQL Performance.

The post Why Consumer SSD Reviews are Useless for Database Performance Use Case appeared first on Percona Database Performance Blog.

Jun
15
2016
--

Troubleshooting hardware resource usage webinar: Q & A

InnoDB locks and transaction isolation

Troubleshooting hardware resourceIn this blog, I provide answers to the Q & A for the Troubleshooting hardware resource usage webinar.

First, I want to thank everybody who attended the May 26 webinar. The recording and slides for the webinar are available here. Below is the list of your questions that I wasn’t able to answer during the webinar, with responses:

Q: How did you find the memory IO LEAK?

A: Do you mean the replication bug I was talking about in the webinar? I wrote about this bug here. See also comments in the bug report itself.

Q: Do you have common formulas you use to tune MySQL?

A: There are best practices: relate thread concurrency to number of CPU cores you have, set InnoDB buffer pool size large enough so it can contain all your working dataset (which is not always possible), and do not set the Query Cache size larger than 512MB (or even better, turn it off) to avoid issues with global lock set when it needs to be de-fragmented. I prefer not to call them “formulas,” because all options need to be adjusted to match the workload. If this weren’t the case, MySQL Server would have an automatic configuration. There is also a separate webinar on configuration (Troubleshooting configuration issues) where I discuss these practices.

Q: Slide 11: is this real time? Can we get this info for a job that has already finished?

A: Yes, this is real time. No, it is not possible to get this info for a thread that does not exist.

Q: Slide 11: what do negative numbers mean?

A: Numbers are taken from the 

CURRENT_NUMBER_OF_BYTES_USED

 field for table

memory_summary_by_thread_by_event_name

 in Performance Schema. These values, in turn, are calculated as (memory allocated by thread) – (memory freed by thread). Negative numbers here mean either a memory leak or incorrect calculation of memory used by the thread. I reported this behavior in the MySQL Bugs database. Please subscribe to the bug report and wait to see how InnoDB and Performance Schema developers answer.

Q: Are TokuDB memory usage stats recorded in the 

sys.memory_by_thread_by_current_bytes

  table also?  Do we have to set something to enable this collection? I ran the query, but it shows 0 for everything.

A: TokuDB currently does not support Performance Schema, thus its memory statistics are not instrumented. See the user manual on how memory instrumentation works.

Q: With disk what we will check for disk I/O?

A: I quite don’t understand the question. Are you asking on which disk we should check IO statistics? For datadir and other disks, look at the locations where MySQL stores data and log files (if you set custom locations).

Q: How can we put CPU in parallel to process multiple client requests? Put multiple requests In memory by parallel way. By defining transaction. Or there any query hints?

A: We cannot directly put CPU in parallel, but we can affect it indirectly by tuning InnoDB threads-related options (

innodb_threads_concurrency, innodb_read_io_threads, innodb_write_io_threads

) and using the thread pool.

Q: Is there any information the Performance Schema that is not found in the SYS schema?

A: Yes. For example, sys schema does not have a view for statistics about prepared statements, while Performance Schema does, because sys schema takes its statement statistics from digest tables (which make no sense for prepared statements).

Q: What is your favorite tool to investigate a memory issue with a task/job that has already finished?

A: I don’t know that there is such a tool suitable for use in production. In a test environment, you can use valgrind or similar tools. You can also make core dumps of the mysqld process and investigate them after the issue is gone.

Sep
03
2014
--

MySQL 101: Monitor Disk I/O with pt-diskstats

MySQL 101: Monitor Disk I/O with pt-diskstatsHere on the Percona Support team we often ask customers to retrieve disk stats to monitor disk IO and to measure block devices iops and latency. There are a number of tools available to monitor IO on Linux. iostat is one of the popular tools and Percona Toolkit, which is free, contains the pt-diskstats tool for this purpose. The pt-diskstats tool is similar to iostat but it’s more interactive and contains extended information. pt-diskstats reports current disk activity and shows the statistics for the last second (which by default is 1 second) and will continue until interrupted. The pt-diskstats tool collects samples of /proc/diskstats.

In this post, I will share some examples about how to monitor and check to see if the IO subsystem is performing properly or if any disks are a limiting factor – all this by using the pt-diskstats tool.

pt-diskstats output consists on number of columns and in order to interpret pt-diskstats output we need to know what each column represents.

  • rd_s tells about number of reads per second while wr_s represents number of writes per second.
  • rd_rt and wr_rt shows average response time in milliseconds for reads & writes respectively, which is similar to iostat tool output await column but pt-diskstats shows individual response time for reads and writes at disk level. Just a note, modern iostat splits read and write latency out, but most distros don’t have the latest iostat in their systat (or equivalent) package.
  • rd_mrg and wr_mrg are other two important columns in pt-diskstats output.  *_mrg is telling us how many of the original operations the IO elevator (disk scheduler) was able to merge to reduce IOPS, so *_mrg is telling us a quite important thing by letting us know that the IO scheduler was able to consolidate many or few operations. If rd_mrg/wr_mrg is high% then the IO workload is sequential on the other hand, If rd_mrg/wr_mrg is a low% then IO workload is all random. Binary logs, redo logs (aka ib_logfile*), undo log and doublewrite buffer all need sequential writes.
  • qtime and stime are last two columns in pt-diskstats output where qtime reflects to time spent in disk scheduler queue i.e. average queue time before sending it to physical device and on the other hand stime is average service time which is time accumulated to process the physical device request. Note, that qtime is not discriminated between reads and writes and you can check if response time is higher for qtime than it signal towards disk scheduler. Also note that service time (stime field and svctm field in in pt-diskstats & iostat output respectively) is not reliable on Linux. If you read the iostat manual you will see it is deprecated.

Along with that, there are many other parameters for pt-diskstats – you can found full documentation here. Below is an example of pt-disktats in action. I used the  –devices-regex option which prints only device information that matches this Perl regex.

$ pt-diskstats --devices-regex=sd --interval 5
#ts device rd_s rd_avkb rd_mb_s rd_mrg rd_cnc rd_rt wr_s wr_avkb wr_mb_s wr_mrg wr_cnc wr_rt busy in_prg io_s qtime stime
1.1 sda   21.6   22.8    0.5      45%    1.2  29.4 275.5   4.0    1.1      0%   40.0  145.1   65%   158 297.1 155.0   2.1
1.1 sdb   15.0   21.0    0.3      33%    0.1   5.2   0.0   0.0    0.0      0%    0.0    0.0   11%     1  15.0   0.5   4.7
1.1 sdc    5.6   10.0    0.1       0%    0.0   5.2   1.9   6.0    0.0     33%    0.0    2.0    3%     0   7.5   0.4   3.6
1.1 sdd    0.0    0.0    0.0       0%    0.0   0.0   0.0   0.0    0.0      0%    0.0    0.0    0%     0   0.0   0.0   0.0
5.0 sda   17.0   14.8    0.2      64%    3.1  66.7 404.9   4.6    1.8     14%  140.9  298.5  100%   111 421.9 277.6   1.9
5.0 sdb   14.0   19.9    0.3      48%    0.1   5.5   0.4 174.0    0.1     98%    0.0    0.0   11%     0  14.4   0.9   2.4
5.0 sdc    3.6   27.1    0.1      61%    0.0   3.5   2.8   5.7    0.0     30%    0.0    2.0    3%     0   6.4   0.7   2.4
5.0 sdd    0.0    0.0    0.0       0%    0.0   0.0   0.0   0.0    0.0      0%    0.0    0.0    0%     0   0.0   0.0   0.0

These are the stats from 7200 RPM SATA disks. As you can see, the write-response time is very high and most of that is made up of IO queue time. This shows the problem exactly. The problem is that the IO subsystem is not able to handle the write workload because the amount of writes that are being performed are way beyond what it can handle. It means the disks cannot service every request concurrently. The workload would actually depend a lot on where the hot data is stored and as we can see in this particular case the workload only hits a single disk out of the 4 disks. A single 7.2K RPM disk can only do about 100 random writes per second which is not a lot considering heavy workload.

It’s not particularly a hardware issue but a hardware capacity issue. The kind of workload that is present and the amount of writes that are performed per second are not something that the IO subsystem is able to handle in an efficient manner. Mostly writes are generated on this server as can be seen by the disk stats.

Let me show you a second example. Here you can see read latency. rd_rt is consistently between 10ms-30ms. It depends on how fast the disks are spinning and the number of disks. To deal with it possible solutions would be to optimize queries to avoid table scans, use memcached where possible, use SSD’s as it can provide good I/O performance with high concurrency. You will find this post useful on SSD’s from our CEO, Peter Zaitsev.

#ts device rd_s rd_avkb rd_mb_s rd_mrg rd_cnc rd_rt wr_s wr_avkb wr_mb_s wr_mrg wr_cnc wr_rt busy in_prg io_s qtime stime
1.0 sdb    33.0  29.1      0.9     0%    1.1  34.7   7.0   10.3    0.1     61%    0.0   0.4   99%   1    40.0  2.2  19.5
1.0 sdb1    0.0   0.0      0.0     0%    0.0   0.0   7.0   10.3    0.1     61%    0.0   0.4    1%   0     7.0  0.0   0.4
1.0 sdb2   33.0  29.1      0.9     0%    1.1  34.7   0.0    0.0    0.0      0%    0.0   0.0   99%   1    33.0  3.5  30.2
1.0 sdb    81.9  28.5      2.3     0%    1.1  14.0   0.0    0.0    0.0      0%    0.0   0.0   99%   1    81.9  2.0  12.0
1.0 sdb1    0.0   0.0      0.0     0%    0.0   0.0   0.0    0.0    0.0      0%    0.0   0.0    0%   0     0.0  0.0   0.0
1.0 sdb2   81.9  28.5      2.3     0%    1.1  14.0   0.0    0.0    0.0      0%    0.0   0.0   99%   1    81.9  2.0  12.0
1.0 sdb    50.0  25.7      1.3     0%    1.3  25.1  13.0   11.7    0.1     66%    0.0   0.7   99%   1    63.0  3.4  11.3
1.0 sdb1   25.0  21.3      0.5     0%    0.6  25.2  13.0   11.7    0.1     66%    0.0   0.7   46%   1    38.0  3.2   7.3
1.0 sdb2   25.0  30.1      0.7     0%    0.6  25.0   0.0    0.0    0.0      0%    0.0   0.0   56%   0    25.0  3.6  22.2

From the below diskstats output it seems that IO is saturated between both reads and writes. This can be noticed with high value for columns rd_s and wr_s. In this particular case, consider having disks in either RAID 5 (better for read only workload) or RAID 10 array is good option along with battery-backed write cache (BBWC) as single disk can really be bad for performance when you are IO bound.

device rd_s rd_avkb rd_mb_s rd_mrg rd_cnc rd_rt wr_s  wr_avkb wr_mb_s wr_mrg wr_cnc wr_rt  busy in_prg io_s  qtime stime
sdb1  362.0  27.4    9.7     0%    2.7    7.5   525.2  20.2    10.3    35%    6.4    8.0   100%   0    887.2  7.0   0.9
sdb1  439.9  26.5   11.4     0%    3.4    7.7   545.7  20.8    11.1    34%    9.8   11.9   100%   0    985.6  9.6   0.8
sdb1  576.6  26.5   14.9     0%    4.5    7.8   400.2  19.9     7.8    34%    6.7   10.9   100%   0    976.8  8.6   0.8
sdb1  410.8  24.2    9.7     0%    2.9    7.1   403.1  18.3     7.2    34%   10.8   17.7   100%   0    813.9 12.5   1.0
sdb1  378.4  24.6    9.1     0%    2.7    7.3   506.1  16.5     8.2    33%    5.7    7.6   100%   0    884.4  6.6   0.9
sdb1  572.8  26.1   14.6     0%    4.8    8.4   422.6  17.2     7.1    30%    1.7    2.8   100%   0    995.4  4.7   0.8
sdb1  429.2  23.0    9.6     0%    3.2    7.4   511.9  14.5     7.2    31%    1.2    1.7   100%   0    941.2  3.6   0.9

The following example reflects write heavy activity but write-response time is very good, under 1ms, which shows disks are healthy and capable of handling high number of IOPS.

#ts device rd_s rd_avkb rd_mb_s rd_mrg rd_cnc rd_rt wr_s wr_avkb wr_mb_s wr_mrg wr_cnc wr_rt busy in_prg io_s qtime stime
1.0  dm-0  530.8 16.0    8.3     0%    0.3    0.5  6124.0  5.1    30.7    0%     1.7    0.3  86%    2   6654.8  0.2  0.1
2.0  dm-0  633.1 16.1    10.0    0%    0.3    0.5  6173.0  6.1    36.6    0%     1.7    0.3  88%    1   6806.1  0.2  0.1
3.0  dm-0  731.8 16.0    11.5    0%    0.4    0.5  6064.2  5.8    34.1    0%     1.9    0.3  90%    2   6795.9  0.2  0.1
4.0  dm-0  711.1 16.0    11.1    0%    0.3    0.5  6448.5  5.4    34.3    0%     1.8    0.3  92%    2   7159.6  0.2  0.1
5.0  dm-0  700.1 16.0    10.9    0%    0.4    0.5  5689.4  5.8    32.2    0%     1.9    0.3  88%    0   6389.5  0.2  0.1
6.0  dm-0  774.1 16.0    12.1    0%    0.3    0.4  6409.5  5.5    34.2    0%     1.7    0.3  86%    0   7183.5  0.2  0.1
7.0  dm-0  849.6 16.0    13.3    0%    0.4    0.5  6151.2  5.4    32.3    0%     1.9    0.3  88%    3   7000.8  0.2  0.1
8.0  dm-0  664.2 16.0    10.4    0%    0.3    0.5  6349.2  5.7    35.1    0%     2.0    0.3  90%    2   7013.4  0.2  0.1
9.0  dm-0  951.0 16.0    14.9    0%    0.4    0.4  5807.0  5.3    29.9    0%     1.8    0.3  90%    3   6758.0  0.2  0.1
10.0 dm-0  742.0 16.0    11.6    0%    0.3    0.5  6461.1  5.1    32.2    0%     1.7    0.3  87%    1   7203.2  0.2  0.1

Let me show you a final example. I used –interval and –iterations parameters for pt-diskstats which tells us to wait for a number of seconds before printing the next disk stats and to limit the number of samples respectively. If you notice, you will see in 3rd iteration high latency (rd_rt, wr_rt) mostly for reads. Also, you can notice a high value for queue time (qtime) and service time (stime) where qtime is related to disk IO scheduler settings. For MySQL database servers we usually recommends noop/deadline instead of default cfq.

$ pt-diskstats --interval=20 --iterations=3
#ts device rd_s rd_avkb rd_mb_s rd_mrg rd_cnc rd_rt wr_s wr_avkb wr_mb_s wr_mrg wr_cnc wr_rt busy in_prg io_s qtime stime
10.4 hda   11.7  4.0     0.0     0%     0.0    1.1  40.7   11.7    0.5    26%    0.1    2.1  10%    0    52.5  0.4   1.5
10.4 hda2   0.0  0.0     0.0     0%     0.0    0.0   0.4    7.0    0.0    43%    0.0    0.1   0%    0     0.4  0.0   0.1
10.4 hda3   0.0  0.0     0.0     0%     0.0    0.0   0.4  107.0    0.0    96%    0.0    0.2   0%    0     0.4  0.0   0.2
10.4 hda5   0.0  0.0     0.0     0%     0.0    0.0   0.7   20.0    0.0    80%    0.0    0.3   0%    0     0.7  0.1   0.2
10.4 hda6   0.0  0.0     0.0     0%     0.0    0.0   0.1    4.0    0.0     0%    0.0    4.0   0%    0     0.1  0.0   4.0
10.4 hda9  11.7  4.0     0.0     0%     0.0    1.1  39.2   10.7    0.4     3%    0.1    2.7   9%    0    50.9  0.5   1.8
10.4 drbd1 11.7  4.0     0.0     0%     0.0    1.1  39.1   10.7    0.4     0%    0.1    2.8   9%    0    50.8  0.5   1.7
20.0 hda   14.6  4.0     0.1     0%     0.0    1.4  39.5   12.3    0.5    26%    0.3    6.4  18%    0    54.1  2.6   2.7
20.0 hda2   0.0  0.0     0.0     0%     0.0    0.0   0.4    9.1    0.0    56%    0.0   42.0   3%    0     0.4  0.0  42.0
20.0 hda3   0.0  0.0     0.0     0%     0.0    0.0   1.5   22.3    0.0    82%    0.0    1.5   0%    0     1.5  1.2   0.3
20.0 hda5   0.0  0.0     0.0     0%     0.0    0.0   1.1   18.9    0.0    79%    0.1   21.4  11%    0     1.1  0.1  21.3
20.0 hda6   0.0  0.0     0.0     0%     0.0    0.0   0.8   10.4    0.0    62%    0.0    1.5   0%    0     0.8  1.3   0.2
20.0 hda9  14.6  4.0     0.1     0%     0.0    1.4  35.8   11.7    0.4     3%    0.2    4.9  18%    0    50.4  0.5   3.5
20.0 drbd1 14.6  4.0     0.1     0%     0.0    1.4  36.4   11.6    0.4     0%    0.2    5.1  17%    0    51.0  0.5   3.4
20.0 hda    0.9  4.0     0.0     0%     0.2   251.9  28.8  61.8    1.7    92%    4.5   13.1  31%    2    29.6 12.8   0.9
20.0 hda2   0.0  0.0     0.0     0%     0.0     0.0   0.6   8.3    0.0    52%    0.1   98.2   6%    0     0.6 48.9  49.3
20.0 hda3   0.0  0.0     0.0     0%     0.0     0.0   2.0  23.2    0.0    83%    0.0    1.4   0%    0     2.0  1.2   0.3
20.0 hda5   0.0  0.0     0.0     0%     0.0     0.0   4.9 249.4    1.2    98%    4.0   13.2   9%    0     4.9 12.9   0.3
20.0 hda6   0.0  0.0     0.0     0%     0.0     0.0   0.0   0.0    0.0     0%    0.0    0.0   0%    0     0.0  0.0   0.0
20.0 hda9   0.9  4.0     0.0     0%     0.2   251.9  21.3  24.2    0.5    32%    0.4   12.9  31%    2    22.2 10.2   9.7
20.0 drbd1  0.9  4.0     0.0     0%     0.2   251.9  30.6  17.0    0.5     0%    0.7   24.1  30%    5    31.4 21.0   9.5

You can see the busy column in pt-diskstats output which is the same as the util column in iostat – which points to utilization. Actually, pt-diskstats is quite similar to the iostat tool but pt-diskstats is more interactive and has more information. The busy percentage is only telling us for how long the IO subsystem was busy, but is not indicating capacity. So the only time you care about %busy is when it’s 100% and at the same time latency (await in iostat and rd_rt/wr_rt in diskstats output) increases over -say- 5ms. You can estimate capacity of your IO subsystem and then look at the IOPS being consumed (r/s + w/s columns). Also, the system can process more than one request in parallel (in case of RAID) so %busy can go beyond 100% in pt-diskstats output.

If you need to check disk throughput, block device IOPS run the following to capture metrics from your IO subsystem and see if utilization matches other worrisome symptoms. I would suggest capturing disk stats during peak load. Output can be grouped by sample or by disk using the –group-by option. You can use the sysbench benchmark tool for this purpose to measure database server performance. You will find this link useful for sysbench tool details.

$ pt-diskstats --group-by=all --iterations=7200 > /tmp/pt-diskstats.out;


Conclusion:

pt-diskstats is one of the finest tools from Percona Toolkit. By using this tool you can easily spot disk bottlenecks, measure the IO subsystem and identify how much IOPS your drive can handle (i.e. disk capacity).

The post MySQL 101: Monitor Disk I/O with pt-diskstats appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com