Feb
08
2018
--

Fsync Performance on Storage Devices

fsync

fsync performanceWhile preparing a post on the design of ZFS based servers for use with MySQL, I stumbled on the topic of fsync call performance. The fsync call is very expensive, but it is essential to databases as it allows for durability (the “D” of the ACID acronym).

Let’s first review the type of disk IO operations executed by InnoDB in MySQL. I’ll assume the default InnoDB variable values.

The first and most obvious type of IO are pages reads and writes from the tablespaces. The pages are most often read one at a time, as 16KB random read operations. Writes to the tablespaces are also typically 16KB random operations, but they are done in batches. After every batch, fsync is called on the tablespace file handle.

To avoid partially written pages in the tablespaces (a source of data corruption), InnoDB performs a doublewrite. During a doublewrite operation, a batch of dirty pages, from 1 to about 100 pages, is first written sequentially to the doublewrite buffer and fsynced. The doublewrite buffer is a fixed area of the ibdata1 file, or a specific file with the latest Percona Server for MySQL 5.7. Only then do the writes to the tablespaces of the previous paragraph occur.

That leaves us with the writes to the InnoDB log files. During those writes, the transaction information — a kind of binary diff of the affected pages — is written to the log files and then the log file is fsynced. The duration of the fsync call can be a major contributor to the COMMIT latency.

Because the fsync call takes time, it greatly affects the performance of MySQL. Because of this, you probably noticed there are many status variables that relate to fsyncs. To overcome the inherent limitations of the storage devices, group commit allows multiple simultaneous transactions to fsync the log file once for all the transactions waiting for the fsync. There is no need for a transaction to call fsync for a write operation that another transaction already forced to disk. A series of write transactions sent over a single database connection cannot benefit from group commit.

Fsync Results

In order to evaluate the fsync performance, I used the following Python script:

#!/usr/bin/python
import os, sys, mmap
# Open a file
fd = os.open( "testfile", os.O_RDWR|os.O_CREAT|os.O_DIRECT )
m = mmap.mmap(-1, 512)
for i in range (1,1000):
   os.lseek(fd,os.SEEK_SET,0)
   m[1] = "1"
   os.write(fd, m)
   os.fsync(fd)
# Close opened file
os.close( fd )

The script opens a file with the O_DIRECT flag, writes and fsyncs it 1000 times and close the file. I added O_DIRECT after an internal discussion with my colleagues, but it doesn’t change the results and it doesn’t remove the need for calling fsync. We’ll discuss in more detail the impacts of O_DIRECT after we reviewed the results. The script is called with the time command like below:

root@lab:/tmp/testfsync# time python /root/fsync.py
real 0m18.320s
user 0m0.060s
sys 0m0.096s

In the above example using a 7.2k rpm drive, the fsync rate is about 56/s for a latency of 18ms. A 7.2k RPM drive performs 120 rotations per second. On average, the fsyncs require a bit more than two rotations to complete. The filesystem appears to make very little differences: ext4 and XFS show similar results. That means if MySQL uses such storage devices for the InnoDB log files, the latency of each transaction is at least 18ms. If the application workload requires 200 write transactions per second, they’ll need to be executed using at least four database connections.

So, let’s begin with rotational devices. These are becoming a bit less common now with databases, especially without a raid controller. I could only find a few.

Drive RPM Rate Latency Notes
WDC WD2500BJKT 5400 22/s 45 ms Laptop SATA from 2009
ST2000LM003 5400 15/s 66 ms USB-3 portable drive
ST3750528AS 7200 40/s 25 ms Desktop grade SATA
WD2502ABYS-18B7A0 7200 56/s 18 ms Desktop grade SATA
HUA723020ALA641 7200 50/s 20 ms Enterprise grade SATA, md mirror
Dell SAS unknown 7200 58/s 17 ms Behind Perc ctrl but no write cache
HDWE150 7200 43/s 23 ms Recent Desktop grade SATA, 5TB

 

I unfortunately didn’t have access to any 10k or 15k RPM drives that were not behind a raid controller with a write cache. If you have access to such drives, run the above script a few times and send me your results, that would help create a more complete picture! So, we can see a correlation between the rotational speed and the fsync rate, which makes sense. The faster a disk turns, the faster it can fsync. The fsync call saves the data and then updates the metadata. Hence, the heads need to move. That’s probably the main explanation for the remaining disparity. A good point, all drives appears to be fully complying with the SATA flush command even though they all have an enabled write cache. Disabling the drives write caches made no difference.

With the above number, the possible transaction rates in fully ACID mode is pretty depressing. But those drives were rotating ones, what about SSD drives? SSD are memory devices and are much faster for random IO operations. There are extremely fast for reads, and good for writes. But as you will see below, not that great for fsyncs.

Drive rate latency notes
SAMSUNG MZ7LN512 160/s 6.3ms Consumer grade SATA
Crucial_CT480M500SSD1 108/s 9.3ms Consumer grade SATA
Intel 520 2031/s 0.49ms Consumer grade SATA
SAMSUNG MZVPV512HDGL 104/s 9.6ms Consumer grade NVMe
Samsung SSD 960 PRO 267/s 3.8ms High-end consumer grade NVMe
Intel PC-3100 1274/s 0.79ms Low-end consumer grade NVMe (cheat?)
Intel 750 2038/s 0.49ms High-end consumer grade NVMe
Intel PC-3700 7380/s 0.14ms High-end enterprise-grade NVMe

 

Again, this is a small sample of the devices I have access to. All SSD/Flash have write caches, but only the high-end devices have capacitors to flush their write cache to the flash with a loss of power. The PC-3100 device is actually in my home server, and it is obviously cheating. If you look at the card specs on the Intel website, it doesn’t have the “Enhanced Power Loss Data Protection” and “End-to-End Data Protection” features. The much more expansive PC-3700 does. I use the PC-3100 as a ZFS L2ARC device, so I am good. In general, the performance of a flash device varies a bit more than rotational devices, since factors like the number of recent writes and the filling factor come into play.

Even when using a high-end NVMe device like the PC-3700, you can’t reach 10k fully ACID transactions per second at low thread concurrency. How do you reach the higher levels? The answer here is the good old raid controller with a protected write cache. The write cache is basically using DRAM memory protected from power loss by a battery. SAN controllers have similar caches. The writes to the InnoDB log files are sequential writes interleaved with fsyncs. The raid controller concatenates the sequential writes, eventually writing one big chunk on disk and… ignoring the fsyncs. Here’s the result from the only device I had access to:

Drive rate latency notes
Dell Perc with BBU 23000/s 0.04ms Array of 7.2k rpm drives

 

That’s extremely fast but, of course, it is memory. I modified the script to loop 10k times instead of 1k. In theory, something a single slave thread doing simple transactions could reach a rate of 20k/s or more while being fully ACID.

Discussion

We must always consider the results we got in the previous section in the context of a given application. For example, a server using an Intel PC-3700 NVMe card can do more than 7000 fully ACID transactions per second even if it is fully durable provided those transactions are issued by a sufficient number of threads. Adding threads will not allow scaling infinitely. At some point, other bottlenecks like mutex contention or page flushing will dominate.

We often say that Galera-based cluster solutions like Percona XtraDB Cluster (PXC) add latency to the transactions, since it involves communication over the network. With the Galera protocol, a commit operation returns only when all the nodes have received the data. Thus, tt is a good practice to relax the local durability and use innodb_flush_log_at_trx_commit set to 0 or 2. On a local network, the ping time is always below 1ms and often below 0.1ms. As a result, the transaction latency is often smaller.

About fdatasync

The fsync system is not the only system call that persists data to disk. There is also the fdatasync call. fdatasync persists the data to disk but does not update the metadata information like the file size and last update time. Said otherwise, it performs one write operation instead of two. In the Python script, if I replace os.fsync with os.fdatasync, here are the results for a subset of devices:

Drive rpm rate latency notes
ST2000LM003 5400 72/s 13 ms USB-3 portable drive
WD2502ABYS-18B7A0 7200 118/s 8.5 ms Desktop grade SATA
SAMSUNG MZ7LN512 N/A 333/s 3.0ms Consumer grade SATA
Crucial_CT480M500SSD1 N/A 213/s 4.7ms Consumer grade SATA
Samsung SSD 960 PRO N/A 714/s 1.4ms High-end consumer grade NVMe

 

In all cases, the resulting rates have more than doubled. The fdatasync call has a troubled history, as there were issues with it many years ago. Because of those issues, InnoDB never uses fdatasync, only fsyncs. You can find the following comments in the InnoDB os/os0file.cc:

/* We let O_SYNC only affect log files; note that we map O_DSYNC to
O_SYNC because the datasync options seemed to corrupt files in 2001
in both Linux and Solaris */

2001 is a long time ago. Given the above results, maybe we should reconsider the use of fdatasync. From the Linux main page on fdatasync, you find:

fdatasync() is similar to fsync(), but does not flush modified
metadata unless that metadata is needed in order to allow a
subsequent data retrieval to be correctly handled. For example,
changes to st_atime or st_mtime (respectively, time of last
access and time of last modification; see stat(2)) do not require
flushing because they are not necessary for a subsequent data
read to be handled correctly. On the other hand, a change to
the file size (st_size, as made by say ftruncate(2)), would
require a metadata flush.

So, even with fdatasync, operations like extending an InnoDB tablespace will update the metadata correctly. This appears to be an interesting low-hanging fruit in term of MySQL performance. In fact, webscalesql already have fdatasync available

O_DIRECT

Why do we need a fsync or fdatasync with O_DIRECT? With O_DIRECT, the OS is not buffering anything along the way. So the data should be persisted right? Actually, the OS is not buffering but the device very likely is. Here are a few results to highlight the point using a 7.2k rpm SATA drive:

Test rate latency
O_DIRECT, drive Write cache enabled 4651/s 0.22ms
O_DIRECT, drive Write cache disabled 101/s 9.9ms
ASYNC + fdatasync, Write cache enabled 119/s 8.4ms
ASYNC + fdatasync, Write cache disabled 117/s 8.5ms

 

The drive write cache was enabled/disabled using the hdparm command. Clearly, there’s no way the drive can persist 4651 writes per second. O_DIRECT doesn’t send the SATA flush command to the disk, so we are only writing to the drive write cache. If the drive write cache is disabled, the rate falls to a more reasonable value of 101/s. What is interesting — and I don’t really understand why — is that opening the file in async mode and performing fdatasync is significantly faster. As expected, the presence of the drive write cache has no impacts on ASYNC + fdatasync. When the fdatasync call occurs, the data is still in the OS file cache.

If you want to use only O_DIRECT, you should make sure all the storage write caches are crash safe. That’s why MySQL adds a fsync call after a write to a file opened with O_DIRECT.

ZFS

These days, I find it difficult to write a blog post without talking about ZFS. How does ZFS handles fsyncs and fdatasyncs? ZFS, like a database, performs write ahead logging in the ZIL. That means calls like fsync and fdatasync return when the data has been persisted to the ZIL, and not to the actual filesystem. The real write operation is done a few seconds later by a background thread. That means the added write for the metadata does not impact performance right away. My home server uses ZFS over a pair of 7.2k RPM drive and doesn’t have a SLOG device. The ZIL is thus stored on the 7.2k RPM drives. The results are the following:

Drive rpm rate latency
ZFS fsync 7200 104/s 9.6 ms
ZFS fdatasync 7200 107/s 9.3 ms

 

Remember that with ZFS, you need to disable the O_DIRECT mode. The fdatasync rate appears to be slightly faster, but it is not really significant. With ZFS, the fsync/fdatasync performance relates to where the ZIL is stored. If there is no SLOG device, the ZIL is stored with the data and thus, the persitence performance of the devices used for the data matter. If there is a SLOG device, the persistence performance is governed by the specs of the device(s) on which the SLOG is located. That’s a very important aspect we have to consider when designing a MySQL server that will use ZFS. The design of such server will be discussed in more details in a future post.

Mar
22
2017
--

The Puzzling Performance of the Samsung 960 Pro

samsung 960 pro small

In this blog post, I’ll take a look at the performance of the Samsung 960 Pro SSD NVME.

First, I know the Samsung 960 Pro is a consumer SSD NVME drive, not intended for sustained data center workloads. But the AnandTech review looked good enough that I decided to take it for a test spin to see if it would work well with MySQL benchmarks.

Before that, I decided to do a simple sysbench file IO test to see how the drives handled sustained workloads, and if it would start acting up.

My expectation for a consumer SSD drive is that its write consistency will suffer. Many of those drives can sustain high bursts for short periods of time but have to slow down to keep up with write leveling (and other internal activities SSDs must to do). This is not what I saw, however.

I did a benchmark on E5-2630L V3, 64GB RAM Ubuntu 16.04 LTS, XFS Filesystem, Samsung 960 Pro 512GB (FW:1B6QCXP7):  

sysbench --num-threads=64 --max-time=86400 --max-requests=0 --test=fileio --file-num=1 --file-total-size=260G --file-io-mode=async --file-extra-flags=direct --file-test-mode=rndrd run

Note: I used asynchronous direct IO to keep it close to how MySQL (InnoDB) submits IO requests.

This is what the “Read Throughput” graph looks in Percona Monitoring and Management (PMM):

Samsung 960 Pro

As you can see, in addition to some reasonable ebbs and flows we have some major dips from about 1.5GB/sec of random reads to around 800MB/sec. This almost halves the performance. We can clearly see two of those dips, with the third one starting when the test ended.  

What is really interesting is that as I did a read-write test, it performed much more uniformly:

sysbench --num-threads=64 --max-time=86400 --max-requests=0 --test=fileio --file-num=1 --file-total-size=260G --file-io-mode=async --file-extra-flags=direct --file-test-mode=rndrw run

Samsung 960 Pro

Any ideas on what the cause of such strange periodic IO performance regression for reads could be?

This does not look like overheating throttling. It is much too regular for that (and I checked the temperature – is wasn’t any different during this performance regression).

One theory I have is “read disturb management”: could the SSD need to rewrite the data after so many reads? By my calculations, every cell is read some 166 times during the eight hours between those gaps. This doesn’t sound like a lot.

What are your thoughts?

Sep
09
2016
--

Don’t Spin Your Data, Use SSDs!

ssds

ssdsThis blog post discussed the advantages of SSDs over HDDs for database environments.

For years now, I’ve been telling audiences for my MySQL Performance talk the following: if you are running an I/O-intensive database on spinning disks you’re doing it wrong. But there are still a surprising number of laggards who aren’t embracing SSD storage (whether it’s for cost or reliability reasons).

Let’s look at cost first. As I write this now (September 2016), high-performance server-grade spinning hard drives run for about $240 for 600GB (or $0.40 per GB).  Of course, you can get an 8TB archive drive at about same price (about $0.03 per GB), but it isn’t likely you’d use something like that for your operational database. At the same time, you can get a Samsung 850 EVO drive for approximately $300 (or $0.30 per GB), which is cheaper than the server-grade spinning drive!  

While it’s not the best drive money can buy, it is certainly an order of magnitude faster than any spinning disk drive!

(I’m focusing on the cost per GB rather than the cost of the number of IOPS per drive as SSDs have overtaken HDDs years ago when it comes to IOPS/$.)

If we take a look at Amazon EBS pricing, we will find that Amazon has moved to SSD volumes by default as “General Purpose” storage (gp2). Prices for this volume type run about 2x higher per GB than high-performance HDD-based volumes (st1) and provisioned IOPs volumes. The best volumes for databases will likely run you 4x higher than HDD.

This appears to be a significant cost difference, but keep in mind you can get much more IOPS at much better latency from these volumes. They also handle IO spikes better, which is very important for real workloads.

Whether we’re looking at a cloud or private environment, it is wrong just to look at the cost of the storage alone – you must look at the whole server cost. When using an SSD, you might not need to spend also buy a RAID card with battery-backed-up (BBU) cache, as many SSDs have similar functions built in.

(For some entry-level SSDs, there might be an advantage to purchasing a RAID with BBU, but it doesn’t affect performance nearly as much as for HDDs. This works out well, however, as entry level SSDs aren’t going to cost that much to begin with and won’t make this setup particularly costly, relative to a higher-end SSD.)  

Some vendors can charge insane prices for SSDs, but this is where you should negotiate and your alternative vendor choice powers.

Some folks are concerned they can’t get as much storage per server with SSDs because they are smaller. This was the case a few years back, but not any more. You can find a 2TB 2.5” SSD drive easily, which is larger than the available 2.5” spinning drives. You can go as high as 13TB in the 2.5” form factor

There is a bit of challenge if you’re looking at the NVMe (PCI-E) cards, as you typically can’t have as many of those per server as you could using spinning disks, but the situation is changing here as well with the 6.4TB SX300 from Sandisk/FusionIO or the PM1725 from Samsung. Directly attached storage provides extremely high performance and 10TB-class sizes.  

To get multiple storage units together, you can use hardware RAID, software RAID, LVM striping or some file systems (such as ZFS) can take care of it for you.    

Where do we stand with SSD reliability? In my experience, modern SSDs (even inexpensive ones) are pretty reliable, particularly for online data storage. The shelf life of unpowered SSDs is likely to be less than HDDs, but we do not really keep servers off for long periods of time when running database workloads. Most SSDs also do something like RAID internally (it’s called RAIN) in addition to error correction codes that protect your data from a full single flash chip.

In truth, focusing on storage-level redundancy is overrated for databases. We want to protect most critical applications from complete database server failure, which means using some form of replication, storing several copies of data. In this case, you don’t need bulletproof storage on a single server – just a replication setup where you won’t lose the data and any server loss is easy to handle. For MySQL, solutions like Percona XtraDB Cluster come handy. You can use external tools such as Orchestrator or MHA to make MySQL replication work.  

When it comes to comparing SSD vs. HDD performance, whatever you do with SSDs they will likely still perform better than HDDs. Your RAID5 and RAID6 arrays made from SSDs will beat your RAID10 and RAID0 made from HDDs (unless your RAID card is doing something nasty).

Another concern with SSD reliability is write endurance. SSDs indeed have a specified amount of writes they can handle (after which they are likely to fail). If you’re thinking about replacing HDDs with SSDs, examine how long SSDs would endure under a comparable write load.  

If we’re looking at a high HDD write workload, a single device is likely to handle 200 write IOPS of 16KB (when running InnoDB). Let’s double that. That comes to 6.4MB/sec, which gives us  527GB/day (doing this 24/7). Even with the inexpensive Samsung 850 Pro we get 300TB of official write endurance – enough for 1.5 years. And in reality, drives tend to last well beyond their official specs.    

If you don’t like living on the edge, more expensive server-grade storage options have much better endurance. For example, 6.4TB SX300 offers almost 100x more endurance at 22 Petabytes written.

In my experience, people often overestimate how many writes their application performs on a sustained basis. The best approach is to do the math, but also monitor the drive status with a SMART utility or vendor tool. The tools can alert you in advance when drive wears out.

Whatever your workload is, you will likely find an SSD solution that offers you enough endurance while significantly exceeding the performance of an HDD-based solution.

Finally, there is a third and very important component of SSD reliability for operational database workloads: not losing your data during a power failure. Many “consumer-grade” SSDs come with drive write cache enabled by default, but without proper power loss protection. This means you can lose some writes during a power failure, causing data loss or database corruption.

Disabling write cache is one option, though it can severely reduce write performance and does not guarantee data won’t be lost. Using enterprise-grade SSDs from a reputable vendor is another option, and testing SSDs yourself might be a good idea if you’re on a budget.  

Conclusion

When it comes to operational databases, whether your workload is on-premises or in the cloud,  Don’t spin your data – use SSD. There are choices and options for almost any budget and every workload.

Aug
10
2016
--

Small innodb_page_size as a performance boost for SSD

performance boost for SSD

In this blog post, we’ll discuss how a small innodb_page_size can create a performance boost for SSD.

In my previous post Testing Samsung storage in tpcc-mysql benchmark of Percona Server I compared different Samsung devices. Most solid state drives (SSDs) use 4KiB as an internal page size, and the InnoDB default page size is 16KiB. I wondered how using a different innodb_page_size might affect the overall performance.

Fortunately, MySQL 5.7 comes with the option innodb_page_size, so you can set different InnoDB page sizes than the standard 16KiB. This option is still quite inconvenient to use, however. You can’t change innodb_page_size for the existing database. Instead, you need to create a brand new database with a different innodb_page_size and reload whole data set. This is a serious showstopper for production adoption. Specifying innodb_page_size for individual tables or indexes would be a welcome addition, and you could change it with a simple ALTER TABLE foo page_size=4k.

Anyway, this doesn’t stop us from using innodb_page_size=4k in the testing environment. Let’s see how it affects the results using the same conditions described in my previous post.

performance boost for SSD

Again we see that the PM1725 outperforms the SM863 when we have a limited memory, and the result is almost equal when we have plenty of memory.

But what about innodb_page_size 4k vs 16k.?

Here is a direct comparison chart:

performance boost for SSD

Tabular results (in NOTPM, more is better):

Buffer Pool, GiB pm1725_16k pm1725_4k sam850_16k sam850_4k sam863_16k sam863_4k pm1725 4k/16k
5 42427.57 73287.07 1931.54 2682.29 14709.69 48841.04 1.73
15 78991.67 134466.86 2750.85 6587.72 31655.18 93880.36 1.70
25 108077.56 173988.05 5156.72 10817.23 56777.82 133215.30 1.61
35 122582.17 195116.80 8986.15 11922.59 93828.48 164281.55 1.59
45 127828.82 209513.65 12136.51 20316.91 123979.99 192215.27 1.64
55 130724.59 216793.99 19547.81 24476.74 127971.30 212647.97 1.66
65 131901.38 224729.32 27653.94 23989.01 131020.07 220569.86 1.70
75 133184.70 229089.61 38210.94 23457.18 131410.40 223103.07 1.72
85 133058.50 227588.18 39669.90 24400.27 131657.16 227295.54 1.71
95 133553.49 226241.41 39519.18 24327.22 132882.29 223963.99 1.69
105 134021.26 224831.81 39631.03 24273.07 132126.29 222796.25 1.68
115 134037.09 225632.80 39469.34 24073.36 132683.55 221446.90 1.68

 

It’s interesting to see that 4k pages help to improve the performance up to 70%, but only for the PM1725 and SM863. For the low-end Samsung 850 Pro, using a 4k innodb_page_size actually makes things worse when using a high amount of memory.

I think a 70% performance gain is too significant to ignore, even if manipulating innodb_page_size requires extra work. I think it is worthwhile to evaluate if using different innodb_page_size settings help a fast SSD under your workload.

And hopefully MySQL 8.0 makes it easier to use different page sizes!

May
23
2014
--

How to improve InnoDB performance by 55% for write-bound loads

During April’s Percona Live MySQL Conference and Expo 2014, I attended a talk on MySQL 5.7 performance an scalability given by Dimitri Kravtchuk, the Oracle MySQL benchmark specialist. He mentioned at some point that the InnoDB double write buffer was a real performance killer. For the ones that don’t know what the innodb double write buffer is, it is a disk buffer were pages are written before being written to the actual data file. Upon restart, pages in the double write buffer are rewritten to their data files if complete. This is to avoid data file corruption with half written pages. I knew it has an impact on performance, on ZFS since it is transactional I always disable it, but I never realized how important the performance impact could be. Back from PLMCE, a friend had dropped home a Dell R320 server, asking me to setup the OS and test it. How best to test a new server than to run benchmarks on it, so here we go!

ZFS is not the only transactional filesystem, ext4, with the option “data=journal”, can also be transactional. So, the question is: is it better to have the InnoDB double write buffer enabled or to use the ext4 transaction log. Also, if this is better, how does it compare with xfs, the filesystem I use to propose but which do not support transactions.

Methodology

The goal is to stress the double write buffer so the load has to be write intensive. The server has a simple mirror of two 7.2k rpm drives. There is no controller write cache and the drives write caches are disabled. I decided to use the Percona tpcc-mysql benchmark tool and with 200 warehouses, the total dataset size was around 18G, fitting all within the Innodb buffer pool (server has 24GB). Here’re the relevant part of the my.cnf:

innodb_read_io_threads=4
innodb_write_io_threads=8  #To stress the double write buffer
innodb_buffer_pool_size=20G
innodb_buffer_pool_load_at_startup=ON
innodb_log_file_size = 32M #Small log files, more page flush
innodb_log_files_in_group=2
innodb_file_per_table=1
innodb_log_buffer_size=8M
innodb_flush_method=O_DIRECT
innodb_flush_log_at_trx_commit=0
skip-innodb_doublewrite  #commented or not depending on test

So, I generated the dataset for 200 warehouses, added they keys but not the foreign key constraints, loaded all that in the buffer pool with a few queries and dumped the buffer pool. Then, with MySQL stopped, I did a file level backup to a different partition. I used the MySQL 5.6.16 version that comes with Ubuntu 14.04, at the time Percona server was not available for 14.04. Each benchmark followed this procedure:

  1. Stop mysql
  2. umount /var/lib/mysql
  3. comment or uncomment skip-innodb_doublewrite in my.cnf
  4. mount /var/lib/mysql with specific options
  5. copy the reference backup to /var/lib/mysql
  6. Start mysql and wait for the buffer pool load to complete
  7. start tpcc from another server

The tpcc_start I used it the following:

./tpcc_start -h10.2.2.247 -P3306 -dtpcc -utpcc -ptpcc -w200 -c32 -r300 -l3600 -i60

I used 32 connections, let the tool run for 300s of warm up, enough to reach a steady level of dirty pages, and then, I let the benchmark run for one hour, reporting results every minute.

Results

Test: Double write buffer File system options Average NOPTM over 1h
ext4_dw Yes rw 690
ext4_dionolock_dw Yes rw,dioread_nolock 668
ext4_nodw No rw 1107
ext4trx_nodw No rw,data=journal 1066
xfs_dw Yes xfs rw,noatime 754

 

So, from the above table, the first test I did was the common ext4 with the Innodb double write buffer enabled and it yielded 690 new order transactions per minute (NOTPM). Reading the ext4 doc, I also wanted to try the “dioread_nolock” setting that is supposed to reduce mutex contention and this time, I got slightly less 668 NOTPM. The difference is within the measurement error and isn’t significant. Removing the Innodb double write buffer, although unsafe, boosted the throughput to 1107 NOTPM, a 60% increase! Wow, indeed the double write buffer has a huge impact. But what is the impact of asking the file system to replace the innodb double write buffer? Surprisingly, the performance level is only slightly lower at 1066 NOTPM and vmstat did report twice the amount writes. I needed to redo the tests a few times to convince myself. Getting a 55% increase in performance with the same hardware is not common except when some trivial configuration errors are made. Finally, I used to propose xfs with the Innodb double write buffer enabled to customers, that’s about 10% higher than ext4 with the Innodb double write buffer, close to what I was expecting. The graphic below presents the numbers in a more visual form.

TPCC NOTPM for various configurations

TPCC NOTPM for various configurations

In term of performance stability, you’ll find below a graphic of the per minute NOTPM output for three of the tests, ext4 non-transactional with the double write buffer, ext4 transactional without the double write buffer and xfs with the double write buffer. The dispersion is qualitatively similar for all three. The values presented above are just the averages of those data sets.

TPCC NOTPM evolution over time

TPCC NOTPM evolution over time

Safety

Innodb data corruption is not fun and removing the innodb double write buffer is a bit scary. In order to be sure it is safe, I executed the following procedure ten times:

  1. Start mysql and wait for recovery and for the buffer pool load to complete
  2. Check the error log for no corruption
  3. start tpcc from another server
  4. After about 10 minutes, physically unplug the server
  5. Plug back and restart the server

I observed no corruption. I was still a bit preoccupied, what if the test is wrong? I removed the “data=journal” mount option and did a new run. I got corruption the first time. So given what the procedure I followed and the number of crash tests, I think it is reasonable to assume it is safe to replace the InnoDB double write buffer by the ext4 transactional journal.

I also looked at the kernel ext4 sources and changelog. Up to recently, before kernel 3.2, O_DIRECT wasn’t supported with data=journal and MySQL would have issued a warning in the error log. Now, with recent kernels, O_DIRECT is mapped to O_DSYNC and O_DIRECT is faked, always for data=journal, which is exactly what is needed. Indeed, I tried “innodb_flush_method = O_DSYNC” and found the same results. With older kernels I strongly advise to use the “innodb_flush_method = O_DSYNC” setting to make sure files are opened is a way that will cause them to be transactional for ext4. As always, test thoroughfully, I only tested on Ubuntu 14.04.

Impacts on MyISAM

Since we are no longer really using O_DIRECT, even if set in my.cnf, the OS file cache will be used for InnoDB data. If the database is only using InnoDB that’s not a big deal but if MyISAM is significantly used, that may cause performance issues since MyISAM relies on the OS file cache so be warned.

Fast SSDs

If you have a SSD setup that doesn’t offer a transactional file system like the FusionIO directFS, a very interesting setup would be to mix spinning drives and SSDs. For example, let’s suppose we have a mirror of spinning drives handled by a raid controller with a write cache (and a BBU) and an SSD storage on a PCIe card. To reduce the write load to the SSD, we could send the file system journal to the spinning drives using the “journal_path=path” or “journal_dev=devnum” options of ext4. The raid controller write cache would do an awesome job at merging the write operations for the file system journal and the amount of write operations going to the SSD would be cut by half. I don’t have access to such a setup but it seems very promising performance wise.

Conclusion

Like ZFS, ext4 can be transactional and replacing the InnoDB double write buffer with the file system transaction journal yield a 55% increase in performance for write intensive workload. Performance gains are also expected for SSD and mixed spinning/SSD configurations.

The post How to improve InnoDB performance by 55% for write-bound loads appeared first on MySQL Performance Blog.

Aug
29
2013
--

Considering TokuDB as an engine for timeseries data

TokuDBI am working on a customer’s system where the requirement is to store a lot of timeseries data from different sensors.

For performance reasons we are going to use SSD, and therefore there is a list of requirements for the architecture:

  • Provide high insertion rate
  • Provide a good compression rate to store more data on expensive SSDs
  • Engine should be SSD friendly (less writes per timeperiod to help with SSD wear)
  • Provide a reasonable response time (within ~50 ms) on SELECT queries on hot recently inserted data

Looking on these requirements I actually think that TokuDB might be a good fit for this task.

There are several aspects to consider. This time I want to compare TokuDB vs InnoDB on an initial load time and space consumption.

Let’s assume the schema is following

CREATE TABLE `sensordata` (
  `ts` int(10) unsigned NOT NULL DEFAULT '0',
  `sensor_id` int(10) unsigned NOT NULL,
  `data1` double NOT NULL,
  `data2` double NOT NULL,
  `data3` double NOT NULL,
  `data4` double NOT NULL,
  `data6` double NOT NULL,
  `cnt` int(10) unsigned NOT NULL,
  PRIMARY KEY (`sensor_id`,`ts`)
)

where sensor_id is in a range from 1 to about 1000 and ts is monotonically increasing timestamp.

This schema exploits both TokuDB and InnoDB clustering primary key, and all inserts are “almost” sequential, which guarantee that all inserts will not require disk access and work with data in memory.
The same for SELECTS – select queries on the most recent time periods will be executed only by a memory access.

I am doing this research on the Dell PowerEdge R420 box with 48GB of memory (40GB for InnoDB buffer pool size, and default memory allocation for TokuDB, which is 24GB for tokudb cache). The storage is a very fast PCI-e Flash card.

The test export CSV file, suitable for LOAD DATA INFILE is 40GB in size and contains over 1 bln records (1.238.201.948 exactly)

MySQL Versions:

  • For InnoDB tests I used Percona Server 5.6-RC2
  • For TokuDB tests I used mariadb-5.5.30-tokudb-7.0.4 from Tokutek website

So, first, let’s load data into InnoDB, again, I am using LOAD DATA INFILE statement

  • InnoDB, no compression. Load time is 1 hour 26 min 25.77 sec, final table size is 90GB
  • InnoDB, 8K compression. Load time 3 hours 26 min 17.06 sec, the table size is 45GB
  • InnoDB, 4K compression. Load time 17 hours 23 min 43.48 sec, the table size is 26GB

Now for TokuDB:

  • TokuDB, default compression. Load time 33 min 1.18 sec, the table size on disk is 10GB
  • TokuDB, tokudb_small table format. Load time 37 min 2.34 sec, the table size is 4.6GB

So TokuDB is the obvious leader in both load time and compression. Of course just these are not enough, and now we need to see the performance of further INSERTs and SELECTs queries. This is what I am running right now and will post the results when I have them.

The post Considering TokuDB as an engine for timeseries data appeared first on MySQL Performance Blog.

Aug
28
2013
--

Testing Intel, Samsung & SanDisk SATA SSD

While working on the service architecture for one of our projects, I considered several SATA SSD options as the possible main storage for the data. The system will be quite write intensive, so the main interest is the write performance on capacities close to full-size storage.

After some research I picked several candidates (I show prices obviously actual for the date of publishing this post):

  • Samsung 840 Pro, 512GB, the current price $450
  • Samsung 840 Evo, 750GB, price $525. I consider these new SSD because of availability of models with big capacity, 750GB-1TB
  • Intel DC S3500, 480GB, price $650. This model is more expensive in $/GB, as Intel positions these devices for data center usage
  • SanDisk Extreme II 480 GB, price $450. This device caught my attention by good results in some third-party benchmarks

The devices are all attached to a LSI MegaRAID SAS 9260-4i raid controller with 512MB cache, and configured as individual RAID0 virtual devices.

Testing workload is: random writes, 16KiB block size sending in 8 threads as asynchronous IO on ext4 file system. Asynchronous IO is used in the most recent MySQL/InnoDB and theoretically it shows the best possible throughput.

For the Samsung 840 Evo I used 600GiB file size, and for the rest 350GiB, to emulate 70-80% filling

The results in a timeline form to see the stability of the performance:
timeline

The results in a jitter form with a median throughput:
jitter

My thoughts on these results:

  • I frankly expected more from Samsung devices, but they ended up below 50MiB/sec in random writes
  • SanDisk is definitely attractive for absolute throughput, but significant instability is a concern
  • Intel shows a decent performance, but also it is the most expensive

So I am going to run further tests on Intel and SanDisk to find out which one is more suitable for our needs.

For the reference the script I used for tests:

sz=350G
sysbench --test=fileio --file-total-size=$sz --file-num=64 prepare
sysbench --test=fileio --file-total-size=$sz --file-test-mode=rndwr --max-time=180000 --max-requests=0 --num-threads=8 --rand-init=on --file-num=64 --file-io-mode=async --file-extra-flags=direct --file-fsync-freq=0 --file-block-size=16384 --report-interval=10 run

The post Testing Intel, Samsung & SanDisk SATA SSD appeared first on MySQL Performance Blog.

May
16
2013
--

Virident vCache vs. FlashCache: Part 1

Virident vCache vs. FlashCache(This is part one of a two part series) Over the past few weeks I have been looking at a preview release of Virident’s vCache software, which is a kernel module and set of utilities designed to provide functionality similar to that of FlashCache. In particular, Virident engaged Percona to do a usability and feature-set comparison between vCache and FlashCache and also to conduct some benchmarks for the use case where the MySQL working set is significantly larger than the InnoDB buffer pool (thus leading to a lot of buffer pool disk reads) but still small enough to fit into the cache device. In this post and the next, I’ll present some of those results.

Disclosure: The research and testing for this post series was sponsored by Virident.

Usability is, to some extent, a subjective call, as I may have preferences for or against a certain mode of operation that others may not share, so readers may have a different opinion than mine, but on this point I call it an overall draw between vCache and FlashCache.

Ease of basic installation. The setup process was simply a matter of installing two RPMs and running a couple of commands to enable vCache on the PCIe flash card (a Virident FlashMAX II) and set up the cache device with the command-line utilities supplied with one of the RPMs. Moreover, the vCache software is built in to the Virident driver, so there is no additional module to install. FlashCache, on the other hand, requires building a separate kernel module in addition to whatever flash memory driver you’ve already had to install, and then further configuration requires modification to assorted sysctls. I would also argue that the vCache documentation is superior. Winner: vCache.

Ease of post-setup modification / advanced installation. Many of the FlashCache device parameters can be easily modified by echoing the desired value to the appropriate sysctl setting; with vCache, there is a command-line binary which can modify many of the same parameters, but doing so requires a cache flush, detach, and reattach. Winner: FlashCache.

Operational Flexibility: Both solutions share many features here; both of them allow whitelisting and blacklisting of PIDs or simply running in a “cache everything” mode. Both of them have support for not caching sequential IO, adjusting the dirty page threshold, flushing the cache on demand, or having a time-based cache flushing mechanism, but some of these features operate differently with vCache than with FlashCache. For example, when doing a manual cache flush with vCache, this is a blocking operation. With FlashCache, echoing “1″ to the do_sync sysctl of the cache device triggers a cache flush, but it happens in the background, and while countdown messages are written to syslog as the operation proceeds, the device never reports that it’s actually finished. I think both kinds of flushing are useful in different situations, and I’d like to see a non-blocking background flush in vCache, but if I had to choose one or the other, I’ll take blocking and modal over fire-and-forget any day. FlashCache does have the nice ability to switch between FIFO and LRU for its flushing algorithm; vCache does not. This is something that could prove useful in certain situations. Winner: FlashCache.

Operational Monitoring: Both solutions offer plenty of statistics; the main difference is that FlashCache stats can be pulled from /proc but vCache stats have to be retrieved by running the vgc-vcache-monitor command. Personally, I prefer “cat /proc/something” but I’m not sure that’s sufficient to award this category to FlashCache. Winner: None.

Time-based Flushing: This wouldn’t seem like it should be a separate category, but because the behavior seems to be so different between the two cache solutions, I’m listing it here. The vCache manual indicates that “flush period” specifies the time after which dirty blocks will be written to the backing store, whereas FlashCache has a setting called “fallow_delay”, defined in the documentation as the time period before “idle” dirty blocks are cleaned from the cache device. It is not entirely clear whether or not these mechanisms operate in the same fashion, but based on the documentation, it appears that they do not. I find the vCache implementation more useful than the one present in FlashCache. Winner: vCache.

Although nobody likes a tie, if you add up the scores, usability is a 2-2-1 draw between vCache and FlashCache. There are things that I really liked better with FlashCache, and there are other things that I thought vCache did a much better job with. If I absolutely must pick a winner in terms of usability, then I’d give a slight edge to FlashCache due to configuration flexibility, but if the GA release of vCache added some of FlashCache’s additional configuration options and exposed statistics via /proc, I’d vote in the other direction.

Stay tuned for part two of this series, wherein we’ll take a look at some benchmarks. There’s no razor-thin margin of victory for either side here: vCache outperforms FlashCache by a landslide.

The post Virident vCache vs. FlashCache: Part 1 appeared first on MySQL Performance Blog.

Apr
16
2013
--

Testing the Micron P320h

The Micron P320h SSD is an SLC-based PCIe solid-state storage device which claims to provide the highest read throughput of any server-grade SSD, and at Micron’s request, I recently took some time to put the card through its paces, and the numbers are indeed quite impressive.

For reference, the benchmarks for this device were performed primarily on a Dell R720 with 192GB of RAM and two Xeon E5-2660 processors that yield a total of 32 virtual cores. This is the same machine which was used in my previous benchmark run. A small handful of additional tests were also performed using the Cisco UCS C250. The operating system in use was CentOS 6.3, and for the sysbench fileIO tests, the EXT4 filesystem was used. The card itself is the 700GB model.

So let’s take a look at the data.

With the sysbench fileIO test in asynchronous mode, read performance is an extremely steady 3202MiB/sec with almost no deviation. Write performance is also both very strong and very steady, coming in at a bit over 1730MiB/sec with a standard deviation of a bit less than 13MiB/sec.

realssd-asyncIO

 

When we calculate in the fact that the block size in use here is 16KiB, these numbers equate to over 110,000 write IOPS and almost 205,000 read IOPS.

When we switch over to synchronous IO, we find that the card is quite capable of matching the asynchronous performance:

syncIO-throughput

Synchronous read reaches peak capacity somewhere between 32 and 64 threads, and synchronous write tops out somewhere between 64 and 128 threads. The latency numbers are equally impressive; the next two graphs show 95th and 99th-percentile response time, but there really isn’t much difference between the two.

syncIO-latency

At 64 read threads, we reach peak performance with latency of roughly 0.5 milliseconds; and at 128 write threads we have maximum throughput with latency just over 3ms.

How well does it perform with MySQL? Exact results vary, depending upon the usual factors (read/write ratio, working set size, buffer pool size, etc.) but overall the card is extremely quick and handily outperforms the other cards that it was tested against. For example, in the graph below we compare the performance of the P320h on a standard TPCC-MySQL test to the original FusionIO and the Intel i910 with assorted buffer pool sizes:

tpcc-mysql-devicecompare

 

And in this graph we look at the card’s performance on sysbench OLTP:

sysbench-oltp-ext4xfs

It is worth noting here that EXT4 outperforms XFS by a fairly significant margin. The approximate raw numbers, in tabular format, are:

EXT4 XFS
13GiB BP 22000 7500
25GiB BP 17000 9000
50GiB BP 21000 11000
75GiB BP 25000 15000
100GiB BP 31000 19000
125GiB BP 36000 25000

In the final analysis, there may or may not be faster cards out there, but the Micron P320h is the fastest one that I have personally seen to date.

The post Testing the Micron P320h appeared first on MySQL Performance Blog.

Mar
21
2013
--

Testing the Virident FlashMAX II

Approximately 11 months ago, Vadim reported some test results from the Virident FlashMax 1400M, an MLC PCIe SSD device. Since that time, Virident has released the FlashMAX II, which promises both increased capacity and increased performance over the previous model. In this post, we present some benchmark results comparing this new model to its predecessor, and we find that indeed, the FlashMax II is a significant upgrade.

For reference, all of the FlashMax II benchmarks were performed on a Dell R710 with 192GB of RAM. This is a dual-socket Xeon E5-2660 machine with 16 physical and 32 virtual cores. (I had originally planned to use the Cisco UCS C250 that is often used for our test runs, but that machine ran into some unrelated hardware difficulties and was ultimately unavailable.) The operating system in use was CentOS 6.3, and the filesystem used for the test was XFS, mounted with both the noatime,nodiratime options. The card was physically formatted back to factory default settings in between the synchronous and asynchronous test suites. Note that factory default settings for the FlashMax II will cause it to be formatted in “maxcapacity” mode rather than “maxperformance” mode (maxperformance reserves some additional space internally to provide better write performance). In “maxcapacity” mode, the device tested provides approximately 2200GB of space. In “maxperformance” mode, it’s a bit less than 1900GB.

Without further ado, then, here are the numbers.

First, asynchronous random writes:

async-rndwr-warmup-lg

There is a warmup period of around 18 minutes or so, and after about 45 minutes the performance stabilizes and remains effectively constant, as shown by the next graph.

async-rndwr-8-64-lg

Once the write performance reaches equilibrium, it does so at just under 780MiB/sec, which is approximately 40% higher than the 550MiB/sec exhibited by the FlashMax 1400M.

Asynchronous random read is up next:

async-rndrd-128-lg

The behavior of the FlashMax II is very similar to that of the FlashMax 1400M in terms of predictable performance; the standard deviation on the asynchronous random read throughput measurement is only 5.7MiB/sec. However, the overall read throughput is over 1000MiB/sec better with the FlashMax II: we see a read throughput of approximately 2580MiB/sec vs. 1450MiB/sec with the previous generation of hardware, an improvement of roughly 80%.

Finally, we take a look at synchronous random read.

sync-rndrd-128-lg

At 256 threads, read throughput tops out at 2090MiB/sec, which is about 20% less than the asynchronous results; given the small bump in throughput going from 128 to 256 threads and the doubling of latency that was also introduced there, this is likely about as good as it is going to get.

For comparison, the FlashMax 1400M synchronous random read test stopped after 64 threads, reaching a synchronous random read throughput of 1345MiB/sec and a 95th-percentile latency of 1.49ms. With those same 64 threads, the FlashMax II reaches 1883MiB/sec with a 95th-percentile latency of 1.105ms. This represents approximately 40% more throughput, 25% faster.

In every area tested, the FlashMax II outperforms the original FlashMax 1400M by a significant margin, and can be considered a worthy successor.

The post Testing the Virident FlashMAX II appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com