Why Consumer SSD Reviews are Useless for Database Performance Use Case

Anandtech Table reviewing consumer SSD performance

If you’re reading consumer SSD reviews and using them to estimate SSD performance under database workloads, you’d better stop. Databases are not your typical consumer applications and they do not use IO in the same way.

Let’s look, for example, at this excellent AnandTech review of Samsung 960 Pro –  a consumer NVMe device that I happen to have in my test lab.

Anandtech Table reviewing consumer SSD performance

The summary table is actually great, showing the performance both at Queue Depth 1 (single threaded) as well as Queue Depth 32 – a pretty heavy concurrent load.

Even at QD1 we see 50K (4K) writes per second, which should be enough for pretty serious database workloads.

In reality, though, you might be in for some disappointing surprises. While “normal” buffered IO is indeed quite fast, this drive really hates fsync() calls, with a single thread fsync() latency of 3.5ms or roughly 300 fsync/sec. That’s just slightly more than your old school spinning drive.

Why is fsync() performance critical for databases?

To achieve Durabilitythe letter “D” of ACIDdatabases tend to rely on a write ahead log (WAL) which is sequentially written. The WAL must be synced to disk on every transaction commit using fsync() or similar measures, such as opening file with O_SYNC flag. These tend to have similar performance implications.

Other database operations use fsync() too, but writing WAL is where it usually hurts the most.    

In a fully durable configuration MySQL tends to be impacted even more by poor fsync() performance. It may need to perform as many as three fsync operations per transaction commit. Group commit reduces the impact on throughput but transaction latency will still be severely impacted

Want more bad news? If the fsync() performance is phenomenal on your consumer SSD it indeed might be too good to be true. Over the years, some consumer SSDs “faked” fsync and accepted possible data loss in the case of power failure. This might not be a big deal if you only use them for testing but it is a showstopper for any real use.

Want to know more about your drive’s fsync() performance?  You can use these sysbench commands:

sysbench fileio --time=60 --file-num=1 --file-extra-flags= --file-total-size=4096 --file-block-size=4096 --file-fsync-all=on --file-test-mode=rndwr --file-fsync-freq=0 --file-fsync-end=0  --threads=1 --percentile=99 prepare
sysbench fileio --time=60 --file-num=1 --file-extra-flags= --file-total-size=4096 --file-block-size=4096 --file-fsync-all=on --file-test-mode=rndwr --file-fsync-freq=0 --file-fsync-end=0  --threads=1 --percentile=99 run | grep "avg:"

You can also use ioping as described in this blog post

I wish that manufacturers’ tech specifications described fsync latency, along with a clear statement as to whether the drive guarantees no loss of data on power failure. Likewise, I wish folk doing storage reviews could include these in their research.

Interested in fsync() performance for variety of devices?  Yves Trudeau wrote an excellent blog post about fsync() performance on various storage devices  a few months ago.

Other technical resources

Principal Support Escalation Specialist Sveta Smirnova presents Troubleshooting MySQL Concurrency Issues with Load Testing Tools. 

You can download a three part series of eBooks by Principal Consultant Alexander Rubin and me on MySQL Performance.

The post Why Consumer SSD Reviews are Useless for Database Performance Use Case appeared first on Percona Database Performance Blog.


Optimizing MySQL for Zabbix

dolphins making friends with zabbix
This blog post was inspired by my visit at the annual Zabbix Conference in Riga, Latvia this year, where I gave a couple of talks on MySQL and beyond.

It was a two day single-track event with some 200 participants, a number of interesting talks on Zabbix (and related technologies) and really well-organized evening activities. I was amazed how well organized the event was and hope to be invited to speak there next year as well. ;)  (Just in case you’re not sure what Zabbix is, it is an enterprise-class open source distributed monitoring solution for networks and applications)

I must secretly confess, it was also the first conference where I honestly enjoyed being on stage and connecting with the audience – I was even looking forward to it rather than being scared as hell (which is what typically happens to me)! I guess it was all about the positive atmosphere, so big thanks to all the speakers and attendees for that. It meant a lot to me.

If I had to mention one negative vibe I heard from attendees, it would be that there was not enough deeply technical content, however, I think this is slightly biased, because people I talked to most, were ones who enjoyed my technical talks and so they were craving for more.

And now, without further ado, let me get to the essence of this blog post.

Zabbix and MySQL

The very first thing I did when I arrived at the conference was to approach people who I knew use Zabbix on a large scale and tried to figure out what were the biggest challenges they face. Apparently, in all of the cases, it was MySQL and more specifically, MySQL disk IO.

With that in mind, I would like to suggest a few optimizations that will help your MySQL get the best out of your disks (and consequentially will help your Zabbix get the best out of MySQL) and the available hardware resources in general.

SSD is a game changer

“Will MySQL run better on SSDs?” I’ve been hearing this question over and over again, both publicly and privately.

I can tell you without a shadow of doubt, if IO is currently your bottle-neck – either because some queries take a long time to run and you see diskstats reporting 100-250 reads per second until the query completes (latency), or because you are overloading the disks with requests and wait time suffers (throughput), SSDs will definitely help and not just by little, by much!

Consider this: the fastest spinning disk (15k rpm) can do 250 random IO operations per second tops (at this point it is limited by physics) and single query will only ever read from one disk even if you have RAID10 made of 16 disks, so if you need to read 15,000 data points to display a graph, reading those data points from disk will take 60s.

Enterprise-class SSD disk, on the other hand, can do 15,000 or even more 16k random reads per second with a single-thread (16k is the size of an InnoDB block). And as you increase the throughput, it only gets better! So that means that the query in the previous example would take 1s instead of 60s, which is a significant difference. Plus you can run more requests on the same SSD at the same time and the total number of IO operations will only increase, while a single spinning disk would have to share the available 250 IO operations between multiple requests.

The only area where SSDs don’t beat spinning disks (yet) is sequential operation, especially single-threaded sequential writes. If that is your typical workload (which might be the case if you’re mostly collecting data and rarely if ever reading it), then you may want to consider other strategies.

MySQL configuration

Besides improving your disk IO subsystem, there’s ways to reduce the pressure on IO and I’m going to cover a few my.cnf variables that will help you with that (and with other things such as internal contention).

Note, most of the tunables are common for any typical high-performance MySQL setup, though some are explicitly suited for Zabbix because you can relax a few parameters for great effect at the price of, in the worst case, loosing up to 1s worth of collected data which, from discussions during the conference, didn’t seem like a big deal to anyone.

– innodb_buffer_pool_size – if you have a dedicated MySQL server, set it as high as you can (ceiling would be 75% of total available memory). Otherwise, you should balance it with other processes on the server, but if it’s only zabbix server, I would still leave it very high, close to 75% of total RAM.

– innodb_buffer_pool_instances – on MySQL 5.5, set it to 4, on MySQL 5.6 – 8 or even 16.

– innodb_flush_log_at_trx_commit = 0 – this is where you compromise durability for significantly improved write throughput, especially if you don’t own a disk subsystem with non-volatile cache. Basically the loss you may incur is up to 1s worth of writes during MySQL or server crash. A lot of websites actually run with that (a lot of websites still run on MyISAM!!!), I’m quite sure it’s not an issue for Zabbix setup.

– innodb_flush_method = O_DIRECT – if you are running Linux, just leave it set to that.

– innodb_log_file_size – you want these transaction logs (there’s two of them by default) to hold 1 to 2 hours worth of writes. To determinte that, you can probably have a look at the Zabbix graphs for your MySQL server, but also you can run the following from the mysql command line:

mysql> pager grep seq; show engine innodb statusG select sleep(3600); show engine innodb statusG
PAGER set to 'grep seq'
Log sequence number 8373513970951
Log sequence number 8373683996767

The difference between the two numbers is how many bytes InnoDB has written during last hour. So on this server above, I would set innodb_log_file_size=128M and would end up with 256M of log file space allowing me to store more than 1h worth of writes in transaction logs (See this on changing the log file size if you run MySQL 5.5 or earlier)

– innodb_read_io_threads, innodb_write_io_threads – don’t overthink these, they are not as important as they may seem, especially if you are using Async IO (you can check that by running “show global variables like ‘innodb_use_native_aio’” in mysql cli). On MySQL 5.5 and 5.6 you generally want to be using Async IO (AIO), so check mysql log to understand why, if you are not. That said, if you are not using AIO and you are not going to, just set these values to 8 and leave them there.

– innodb_old_blocks_time = 1000 – this will help you prevent buffer pool pollution due to occasional scans. This is now default in MySQL 5.6 (On 5.5, it needs to be set explicitly).

– innodb_io_capacity – set this to as many write iops as your disk subsystem can handle. For SSDs this should be at least few thousand (2000 could be a good start) while for rotating disks somewhat lower values – 500-800, depending on number of bearing disks, will do. Best to benchmark disks or do the math for actual numbers, but default of 200 is definitely too low for most systems nowadays.

– sync_binlog=0 – this is the default setting, but just in case it’s above 0, turn it off, unless you run something else besides Zabbix. The price of not synchronising binary logs is that in case of a master crash, replicas can get out of sync, but if you are constantly hitting IO bottle-neck due to binary log synchronisation just because you want to avoid the hassle of synchronising the slave once every five years when master crashes, you should reconsider this option.

– query_cache_size=0, query_cache_type=0 – that will disable the query cache. Most of the time you don’t want query cache. And if it’s not disabled in the kernel by these settings, queries (especially small ones) will likely suffer due to query cache mutex contention.

– sort_buffer_size, join_buffer_size, read_rnd_buffer_size – if you ever configured these variables, cancel those changes (just remove them or comment them out). I find these are the top three mistuned variables on most customer servers, while in many cases it’s best if they are not touched at all. Just leave them at their defaults and you’re set.

– tmpdir – sometimes it’s a good idea to point tmpdir to /dev/shm so that on-disk temporary tables are actually written to memory, but there’s one important caveat starting with MySQL 5.5: if you do that, it disables AIO acorss the board, because tmpfs does not support AIO. So I would monitor the activity on current tmpdir (/tmp usually) and only switch it to /dev/shm if I see it being an issue.

MySQL Partitioning

I know that Zabbix now supports partitions with a purpose of easier data pruining, however I think there are some extra benefits you could get from partitions. Well actually subpartitions if you are already using partitions by date.

The KPI for Zabbix, that you could hear over and over again, is the “new values per second” number that you can find in the status of Zabbix. Basically the higher the value (given you have enough values to monitor), the better is the throughput of your Zabbix. And this is where a lot of people are hitting the Zabbix limits – MySQL can’t insert enough new values per second.

Besides the optimizations I have already mentioned above (they should greatly increase your write throughput!), I would encourage you to try out partitions (if you’re not using partitions already) or subpartitions (if you are) BY HASH as we found that partitioning in some cases can increase the throughput of InnoDB.

I did not test it with Zabbix specifically and as it’s not supported by Zabbix out of the box, you would have to hack it to make it work, but if you’ve done all the changes above and you still can’t get enough new values per second (AND it is not the hardware that is limiting you), try partitioning or subpartitioning the key tables by hash.

If this sounds interesting but you’re not sure where to start, feel free to contact us and we’ll work with you to make it work.

On MySQL High Availability

There are options to make MySQL highly available, even though many believed it’s not the case. We’ve been writing a lot on it on our blog so I will not paraphrase or repeat, instead I would like to point you to a few valuable resources on that topic:


Percona Server, Percona XtraDB Cluster, Percona Toolkit – it’s all FREE!

I’m not really sure why, but many people I talked to at the conference thought that all of the Percona software either needs to be bought or that it has some enterprise features that are not available unless you buy a license.

The truth is that neither of it is true. All Percona software is completely free of charge. Feel free to download it from our website or through repositories and use it as you please.

See you at the Zabbix conference next year!

see you at Zabbix

The post Optimizing MySQL for Zabbix appeared first on MySQL Performance Blog.


Inexpensive SSDs for Database Workloads

nexpensive SSDs for Database WorkloadsThe cost of SSDs has been dropping rapidly, and at the time of this writing, 2.5-drives have reached the 1TB capacity mark.  You can actually get inexpensive drives for as little as 60 cents per GB. Even inexpensive SSDs can perform tens of thousands of IOPs and come with 1.5M – 2M hous MTBF and a 5-year warranty: check out the Intel SC S3500 specs as an example. There is however one important factor you need to take into account when considering  SSDs as opposed to conventional hard drives – Write Endurance.

Many of us have heard about SSDs having limits in terms of how many writes SSDs can handle, many however assume this is what is already accounted for in the warranty period and so if the hard drives claim to have sequential write speed of 450MB/sec and a warranty of 5 years we expect it to be able to sustain this level of writes for the whole period of time…. Others may look at the computations with 50GB or 100GB per day, such as here, giving us an even better feel of 30-60 years of life time.

In addition to warranty you should be looking at “Write Endurance Rating” which is 450TBW (stands for TB Written) for the given model. Note this number may be defined a little bit differently as the amount of writes a drive needs to do internally will depend on the types of writes – sequential writes having less wear than random ones.

Endurance, not the performance, which might be already more than enough for your workload, is a key factor in which consumer-grade and the enterprise-grade SSDs are different. For example if we take a look at a bit larger 1.2TB FusionIO Card we see Write Endurance of 16.26PB which is some 35 times more – so handling the same amount of writes the card would live 35 times longer.

So how much of these 450TB is really used? Well if you look at the specified write rate of 450MB/sec it will just last for 1M seconds which is less than 2 weeks. This is a shockingly low number and I think few people realize how low it really is. If we look at the “Random IO” number instead which is specified at some 11K IOPs or we can check Vadim’s numbers which give us about 120MB/sec for 16KB writes typical for Innodb – we’re going to be looking at a couple of month of life times of the full write load.

There is more interesting math you can do. Those 100GB a day which sounds like a lot is just 1.2MB/sec on average – hundreds of times less than a peak write performance device can sustain. Now lets see what happens if I would like to have my hard drive with 450TBW last for 3 years. This means I can afford to write about 400GB/day or roughly 5MB/sec on average. Frankly as little as it sounds it is not that bad. If you’re looking at random writes for conventional hard drives they can do roughly 150 truly random writes per second which is just 2.5MB/sec using 16KB innodb page size. This means you can drive the SSD drive on the 2x random write performance conventional hard drive will peak out for 3 year before it is expected to fail.

What we can learn from these numbers? If you’re looking for inexpensive SSDs as replacement for conventional drives for database workloads you really need to be looking at endurance and the write rate for your workload. For these kinds of SSDs you should be looking at them as able to provide you with very fast reads, being able to handle burst writes of pretty high volumes and being able to sustain reasonable life time with medium write load (though still better than conventional hard drives). This is quite good already.

So how can you evaluate whenever your workload fits well for such SSDs?  Measure the amount of writes for a day or week (many monitoring applications can show you that) to see how much you are really writing and use it to see how much life time you can get with such a write ratio. You can also use this number to see how much more of a load (compared to conventional hard drives) you can put on the system – remember chances are you will be endurance and life time limited more than performance limited. If your writes are really high, measured in TBs a day or higher, you will be better off with more expensive enterprise storage drives that have a much higher endurance rating.

Specifically if you have Percona Monitoring Plugins already installed you can use the rrdtool to get the data for long period of time by running something like rrdtool fetch rrddata.rrd AVERAGE –start -1w | awk ‘{ SUM += $2 } END {print SUM }’

The second thing you should do is actually measure reality. Because of the Amplification of writes inside SSDs depend on workload, the actual amount of writes the system handles might be more or less. Many vendors provide some SMART attributes that provide you with insight into how the drive is doing. For example Samsung exposes ID 177 as Wear Level Count where Crucial/Micron expose S.M.A.R.T ID 173 as “Average Block Erase Count”. These are very helpful both for checking how drives are doing as well as planning purposes.

Lets finally look at the database – anything we can do to reduce amount of writes our database is doing supporting our application?
Yes! There are both things you can do at application level side as well as on database side.

Optimizing Application to reduce writes to SSD
There are many things you can do from application side to reduce writes to SSD, here are some examples:

Data Clustering – use Innodb’s clustering by primary key to your advantage so when you do modification as small number of pages as possible is modified.
Normalization – Normalizing data often means you just change one row in one place instead of many rows in many places as it happens with de-normalized environments. SSDs have much faster random reads so more normalized data structures may perform well.
Buffering – Buffer updates in the application or some in memory store where possible to write back to the database periodically. Redis, memcached are common solution for this.
Data Diet – Put your data on diet to see how you can use efficient data types, only needed indexes etc – less data to reflect same amount of information means less updates.

Optimizing Database to reduce writes to SSD
There are a lot of things you can do on the database size as well:

Mix HDD and SSD – HDD when paired with RAID with BBU (Write Cache) can perform writes very well if they go to the small area on disk or sequential. You can use it to put some of the data to hard drives instead of SSD to reduce amount of writes to SSD.
Store DoubleWrite Buffer on HDD – Double Write Buffer is a killer for SSDs essentially doubling amount of writes for data changed in your tablespaces. At the same time this is a very small area (few MB in size) which is constantly updated as such it can very well be on the HDD as long as you have the BBU Cache. In Percona Server you can keep Double Write Buffer in the Separate File while in any MySQL version you can simply put your system tablespace on HDD. For most workloads it will be small enough to be cached.
Store Transactional Logs on HDD – Innodb Transactional Logs is another candidate for HDD – they are written sequentially so HDDs+ RAID BBU cache may well keep up very well with your load.
Store Binary Logs on HDD – Binary Logs… another object which is read and written sequentially as such it does not need to be on SSD.
Temporary Space in Memory – Temporary space (tmpdir) can get a lot of writes and you do not want them to be hitting your SSD. The best is to use tmpfs file system which allocates it in memory (or swap file if there is not enough memory) unless you have extremely large temporary space needs.
Large Buffer Pool Size –  The larger your bufferpool hopefully more dirty pages you can keep and as such more writes will come to already dirty pages effectively eliminating IO to SSD. So even if SSDs means you often can do with smaller memory amounts to achieve performance you need you might still want memory to increase your SSD life time.
Large Innodb Log Files – The larger are your Innodb Log files the more you can delay writes and as such the more logical writes can be merged into the single physical write. SSDs are fast so you often can get crash recovery speed you need with a lot larger log files than you had on HDDs. The combined log files size of 10GB may well work for your application.
Innodb Compression – Innodb Compression reduces amount of data written to tablespace hence can increase life time. Note in some MySQL versions compression increases amount of space written to Innodb Log Files, in this case it is good combined with Innodb Log Files being on Hard drive.
Alternative Storage Engines – TokuDB seems to be offering higher compression ratio and more sequential writes so it might be better choice for your application from SSD life standpoint. It would be interesting to see some real life comparisons in how much writes both engine generate for same workload.

Depending on your application, inexpensive SSDs might be good fit for your workload, though enterprise-grade flash storage might be a good choice for very write-intensive applications even if it costs a lot more due to a much higher endurance capacity. Make sure you understand how many writes your application does and keep thinking about how to optimize your system to do less writes with SSDs – as reducing writes is a key to a longer SSD lifetime and as such lowers costs.

The post Inexpensive SSDs for Database Workloads appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com