Jan
15
2015
--

Hyper-threading – how does it double CPU throughput?

Computer CPUThe other day a customer asked me to do capacity planning for their web server farm. I was looking at the CPU graph for one of the web servers that had Hyper-threading switched ON and thought to myself: “This must be quite a misleading graph – it shows 30% CPU usage. It can’t really be that this server can handle 3 times more work?”

Or can it?

I decided to do what we usually do in such case – I decided to test it and find out the truth. Turns out – there’s more to it than meets the eye.

How Intel Hyper-Threading works

Before we get to my benchmark results, let’s talk a little bit about hyper-threading. According to Intel, Intel® Hyper-Threading Technology (Intel® HT Technology) uses processor resources more efficiently, enabling multiple threads to run on each core. As a performance feature, Intel HT Technology also increases processor throughput, improving overall performance on threaded software.

Sounds almost like magic, but in reality (and correct me if I’m wrong), what HT does essentially is – by presenting one CPU core as two CPUs (threads rather), it allows you to offload task scheduling from kernel to CPU.

So for example if you just had one physical CPU core and two tasks with the same priority running in parallel, the kernel would have to constantly switch the context so that both tasks get a fair amount of CPU time. If, however, you have the CPU presented to you as two CPUs, the kernel can give each task a CPU and take a vacation.

On the hardware level, it will still be one CPU doing the same amount of work, but there maybe some optimization to how that work is going to be executed.

My hypothesis

Here’s the problem that was driving me nuts: if HT does NOT actually give you twice more power and yet the system represents statistics for each CPU thread separately, then at 50% CPU utilization (as per mpstat on Linux), the CPU should be maxed out.

So if I tried to model the scalability of that web server – a 12-core system with HT enabled (represented as 24 CPUs on a system), assuming perfect linear scalability, here’s how it should look:

Throughput
(requests per second)
  |
9 |         ,+++++++++++++++
  |        +
  |       +
6 |      +
  |     +
  |    +
3 |   +
  |  +
  | +
0 '-----+----+----+----+----
    1   6   12   18   24

In the example above, single CPU thread could process the request in 1.2s, which is why you see it max out at 9-10 requests/sec (12/1.2).

From the user perspective, this limitation would hit VERY unexpectedly, as one would expect 50% utilization to be… well, exactly that – 50% utilization.

In fact, the CPU utilization graph would look even more frustrating. For example if I were increasing the number of parallel requests linearly from 1 to 24, here’s how that relationship should look:

CPU utilization:
100% |         ++++++++++++++
     |         .
     |         .
     |         .
     |         .
 50% |         .
     |       +
     |     +
     |   +
     | +
  0% '----+----+----+----+----
    0     6   12   18   24

Hence CPU utilization would skyrocket right at 12 cores from 50% to 100%, because in fact the system CPU would be 100% utilized at this point.

What happens in reality

Naturally, I decided to run a benchmark and see if my assumptions are correct. The benchmark was pretty basic – I wrote a CPU-intensive php script, that took 1.2s to execute on the CPU I was testing this, and bashed it over http (apache) with ab at increasing concurrency. Here’s the result:

Requests per secondRaw data can be found here.

If this does not blow your mind, please go over the facts again and then back at the graph.

Still not sure why do I find this interesting? Let me explain. If you look carefully, initially – at concurrency of 1 through 8 – it scales perfectly. So if you only had data for threads 1-8 (and you knew processes don’t incur coherency delays due to shared data structures), you’d probably predict that it will scale linearly until it reaches ~10 requests/sec at 12 cores, at which point adding more parallel requests would not have any benefits as the CPU would be saturated.

What happens in reality, though, is that past 8 parallel threads (hence, past 33% virtual CPU utilization), execution time starts to increase and maximum performance is only achieved at 24-32 concurrent requests. It looks like at the 33% mark there’s some kind of “throttling” happening.

In other words, to avoid a sharp performance hit past 50% CPU utilization, at 33% virtual thread utilization (i.e. 66% actual CPU utilization), the system gives the illusion of a performance limit – execution slows down so that the system only reaches the saturation point at 24 threads (visually, at 100% CPU utilization).

Naturally then the question is – does it still make sense to run hyper-threading on a multi-core system? I see at least two drawbacks:

1. You don’t see the real picture of how utilized your system really is – if the CPU graph shows 30% utilization, your system may well be 60% utilized already.
2. Past 60% physical utilization, execution speed of your requests will be throttled intentionally in order to provide higher system throughput.

So if you are optimizing for higher throughput – that may be fine. But if you are optimizing for response time, then you may consider running with HT turned off.

Did I miss something?

The post Hyper-threading – how does it double CPU throughput? appeared first on MySQL Performance Blog.

Nov
14
2014
--

Optimizing MySQL for Zabbix

dolphins making friends with zabbix
This blog post was inspired by my visit at the annual Zabbix Conference in Riga, Latvia this year, where I gave a couple of talks on MySQL and beyond.

It was a two day single-track event with some 200 participants, a number of interesting talks on Zabbix (and related technologies) and really well-organized evening activities. I was amazed how well organized the event was and hope to be invited to speak there next year as well. ;)  (Just in case you’re not sure what Zabbix is, it is an enterprise-class open source distributed monitoring solution for networks and applications)

I must secretly confess, it was also the first conference where I honestly enjoyed being on stage and connecting with the audience – I was even looking forward to it rather than being scared as hell (which is what typically happens to me)! I guess it was all about the positive atmosphere, so big thanks to all the speakers and attendees for that. It meant a lot to me.

If I had to mention one negative vibe I heard from attendees, it would be that there was not enough deeply technical content, however, I think this is slightly biased, because people I talked to most, were ones who enjoyed my technical talks and so they were craving for more.

And now, without further ado, let me get to the essence of this blog post.

Zabbix and MySQL

The very first thing I did when I arrived at the conference was to approach people who I knew use Zabbix on a large scale and tried to figure out what were the biggest challenges they face. Apparently, in all of the cases, it was MySQL and more specifically, MySQL disk IO.

With that in mind, I would like to suggest a few optimizations that will help your MySQL get the best out of your disks (and consequentially will help your Zabbix get the best out of MySQL) and the available hardware resources in general.

SSD is a game changer

“Will MySQL run better on SSDs?” I’ve been hearing this question over and over again, both publicly and privately.

I can tell you without a shadow of doubt, if IO is currently your bottle-neck – either because some queries take a long time to run and you see diskstats reporting 100-250 reads per second until the query completes (latency), or because you are overloading the disks with requests and wait time suffers (throughput), SSDs will definitely help and not just by little, by much!

Consider this: the fastest spinning disk (15k rpm) can do 250 random IO operations per second tops (at this point it is limited by physics) and single query will only ever read from one disk even if you have RAID10 made of 16 disks, so if you need to read 15,000 data points to display a graph, reading those data points from disk will take 60s.

Enterprise-class SSD disk, on the other hand, can do 15,000 or even more 16k random reads per second with a single-thread (16k is the size of an InnoDB block). And as you increase the throughput, it only gets better! So that means that the query in the previous example would take 1s instead of 60s, which is a significant difference. Plus you can run more requests on the same SSD at the same time and the total number of IO operations will only increase, while a single spinning disk would have to share the available 250 IO operations between multiple requests.

The only area where SSDs don’t beat spinning disks (yet) is sequential operation, especially single-threaded sequential writes. If that is your typical workload (which might be the case if you’re mostly collecting data and rarely if ever reading it), then you may want to consider other strategies.

MySQL configuration

Besides improving your disk IO subsystem, there’s ways to reduce the pressure on IO and I’m going to cover a few my.cnf variables that will help you with that (and with other things such as internal contention).

Note, most of the tunables are common for any typical high-performance MySQL setup, though some are explicitly suited for Zabbix because you can relax a few parameters for great effect at the price of, in the worst case, loosing up to 1s worth of collected data which, from discussions during the conference, didn’t seem like a big deal to anyone.

– innodb_buffer_pool_size – if you have a dedicated MySQL server, set it as high as you can (ceiling would be 75% of total available memory). Otherwise, you should balance it with other processes on the server, but if it’s only zabbix server, I would still leave it very high, close to 75% of total RAM.

– innodb_buffer_pool_instances – on MySQL 5.5, set it to 4, on MySQL 5.6 – 8 or even 16.

– innodb_flush_log_at_trx_commit = 0 – this is where you compromise durability for significantly improved write throughput, especially if you don’t own a disk subsystem with non-volatile cache. Basically the loss you may incur is up to 1s worth of writes during MySQL or server crash. A lot of websites actually run with that (a lot of websites still run on MyISAM!!!), I’m quite sure it’s not an issue for Zabbix setup.

– innodb_flush_method = O_DIRECT – if you are running Linux, just leave it set to that.

– innodb_log_file_size – you want these transaction logs (there’s two of them by default) to hold 1 to 2 hours worth of writes. To determinte that, you can probably have a look at the Zabbix graphs for your MySQL server, but also you can run the following from the mysql command line:

mysql> pager grep seq; show engine innodb statusG select sleep(3600); show engine innodb statusG
PAGER set to 'grep seq'
Log sequence number 8373513970951
...
Log sequence number 8373683996767

The difference between the two numbers is how many bytes InnoDB has written during last hour. So on this server above, I would set innodb_log_file_size=128M and would end up with 256M of log file space allowing me to store more than 1h worth of writes in transaction logs (See this on changing the log file size if you run MySQL 5.5 or earlier)

– innodb_read_io_threads, innodb_write_io_threads – don’t overthink these, they are not as important as they may seem, especially if you are using Async IO (you can check that by running “show global variables like ‘innodb_use_native_aio’” in mysql cli). On MySQL 5.5 and 5.6 you generally want to be using Async IO (AIO), so check mysql log to understand why, if you are not. That said, if you are not using AIO and you are not going to, just set these values to 8 and leave them there.

– innodb_old_blocks_time = 1000 – this will help you prevent buffer pool pollution due to occasional scans. This is now default in MySQL 5.6 (On 5.5, it needs to be set explicitly).

– innodb_io_capacity – set this to as many write iops as your disk subsystem can handle. For SSDs this should be at least few thousand (2000 could be a good start) while for rotating disks somewhat lower values – 500-800, depending on number of bearing disks, will do. Best to benchmark disks or do the math for actual numbers, but default of 200 is definitely too low for most systems nowadays.

– sync_binlog=0 – this is the default setting, but just in case it’s above 0, turn it off, unless you run something else besides Zabbix. The price of not synchronising binary logs is that in case of a master crash, replicas can get out of sync, but if you are constantly hitting IO bottle-neck due to binary log synchronisation just because you want to avoid the hassle of synchronising the slave once every five years when master crashes, you should reconsider this option.

– query_cache_size=0, query_cache_type=0 – that will disable the query cache. Most of the time you don’t want query cache. And if it’s not disabled in the kernel by these settings, queries (especially small ones) will likely suffer due to query cache mutex contention.

– sort_buffer_size, join_buffer_size, read_rnd_buffer_size – if you ever configured these variables, cancel those changes (just remove them or comment them out). I find these are the top three mistuned variables on most customer servers, while in many cases it’s best if they are not touched at all. Just leave them at their defaults and you’re set.

– tmpdir – sometimes it’s a good idea to point tmpdir to /dev/shm so that on-disk temporary tables are actually written to memory, but there’s one important caveat starting with MySQL 5.5: if you do that, it disables AIO acorss the board, because tmpfs does not support AIO. So I would monitor the activity on current tmpdir (/tmp usually) and only switch it to /dev/shm if I see it being an issue.

MySQL Partitioning

I know that Zabbix now supports partitions with a purpose of easier data pruining, however I think there are some extra benefits you could get from partitions. Well actually subpartitions if you are already using partitions by date.

The KPI for Zabbix, that you could hear over and over again, is the “new values per second” number that you can find in the status of Zabbix. Basically the higher the value (given you have enough values to monitor), the better is the throughput of your Zabbix. And this is where a lot of people are hitting the Zabbix limits – MySQL can’t insert enough new values per second.

Besides the optimizations I have already mentioned above (they should greatly increase your write throughput!), I would encourage you to try out partitions (if you’re not using partitions already) or subpartitions (if you are) BY HASH as we found that partitioning in some cases can increase the throughput of InnoDB.

I did not test it with Zabbix specifically and as it’s not supported by Zabbix out of the box, you would have to hack it to make it work, but if you’ve done all the changes above and you still can’t get enough new values per second (AND it is not the hardware that is limiting you), try partitioning or subpartitioning the key tables by hash.

If this sounds interesting but you’re not sure where to start, feel free to contact us and we’ll work with you to make it work.

On MySQL High Availability

There are options to make MySQL highly available, even though many believed it’s not the case. We’ve been writing a lot on it on our blog so I will not paraphrase or repeat, instead I would like to point you to a few valuable resources on that topic:

 

Percona Server, Percona XtraDB Cluster, Percona Toolkit – it’s all FREE!

I’m not really sure why, but many people I talked to at the conference thought that all of the Percona software either needs to be bought or that it has some enterprise features that are not available unless you buy a license.

The truth is that neither of it is true. All Percona software is completely free of charge. Feel free to download it from our website or through repositories and use it as you please.

See you at the Zabbix conference next year!

see you at Zabbix

The post Optimizing MySQL for Zabbix appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com