Aug
07
2019
--

Google and Twitter are using AMD’s new EPYC Rome processors in their data centers

Google and Twitter are among the companies now using EPYC Rome processors, AMD announced today during a launch event for the 7nm chips. The release of EPYC Rome marks a major step in AMD’s processor war with Intel, which said last month that its own 7nm chips, Ice Lake, won’t be available until 2021 (though it is expected to release its 10nm node this year).

Intel is still the biggest data center processor maker by far, however, and also counts Google and Twitter among its customers. But AMD’s latest releases and its strategy of undercutting competitors with lower pricing have quickly transformed it into a formidable rival.

Google has used other AMD chips before, including in its “Millionth Server,” built in 2008, and says it is now the first company to use second-generation EPYC chips in its data centers. Later this year, Google will also make available to Google Cloud customers virtual machines that run on the chips.

In a press statement, Bart Sano, Google vice president of engineering, said “AMD 2nd Gen Epyc processors will help us continue to do what we do best in our datacenters: innovate. Its scalable compute, memory and I/O performance will expand out ability to drive innovation forward in our infrastructure and will give Google Cloud customers the flexibility to choose the best VM for their workloads.”

Twitter plans to begin using EPYC Rome in its data center infrastructure later this year. Its senior director of engineering, Jennifer Fraser, said the chips will reduce the energy consumption of its data centers. “Using the AMD EPYC 7702 processor, we can scale out our compute clusters with more cores in less space using less power, which translates to 25% lower [total cost of ownership] for Twitter.”

In a comparison test between 2-socket Intel Xeon 6242 and AMD EPYC 7702P processors, AMD claimed that its chips were able to reduce total cost of ownership by up to 50% across “numerous workloads.” AMD EPYC Rome’s flagship is the 64-core, 128-thread 7742 chip, with a 2.25 base frequency, 225 default TDP and 256MB of total cache, starts at $6,950.

Jun
15
2016
--

Troubleshooting hardware resource usage webinar: Q & A

InnoDB locks and transaction isolation

Troubleshooting hardware resourceIn this blog, I provide answers to the Q & A for the Troubleshooting hardware resource usage webinar.

First, I want to thank everybody who attended the May 26 webinar. The recording and slides for the webinar are available here. Below is the list of your questions that I wasn’t able to answer during the webinar, with responses:

Q: How did you find the memory IO LEAK?

A: Do you mean the replication bug I was talking about in the webinar? I wrote about this bug here. See also comments in the bug report itself.

Q: Do you have common formulas you use to tune MySQL?

A: There are best practices: relate thread concurrency to number of CPU cores you have, set InnoDB buffer pool size large enough so it can contain all your working dataset (which is not always possible), and do not set the Query Cache size larger than 512MB (or even better, turn it off) to avoid issues with global lock set when it needs to be de-fragmented. I prefer not to call them “formulas,” because all options need to be adjusted to match the workload. If this weren’t the case, MySQL Server would have an automatic configuration. There is also a separate webinar on configuration (Troubleshooting configuration issues) where I discuss these practices.

Q: Slide 11: is this real time? Can we get this info for a job that has already finished?

A: Yes, this is real time. No, it is not possible to get this info for a thread that does not exist.

Q: Slide 11: what do negative numbers mean?

A: Numbers are taken from the 

CURRENT_NUMBER_OF_BYTES_USED

 field for table

memory_summary_by_thread_by_event_name

 in Performance Schema. These values, in turn, are calculated as (memory allocated by thread) – (memory freed by thread). Negative numbers here mean either a memory leak or incorrect calculation of memory used by the thread. I reported this behavior in the MySQL Bugs database. Please subscribe to the bug report and wait to see how InnoDB and Performance Schema developers answer.

Q: Are TokuDB memory usage stats recorded in the 

sys.memory_by_thread_by_current_bytes

  table also?  Do we have to set something to enable this collection? I ran the query, but it shows 0 for everything.

A: TokuDB currently does not support Performance Schema, thus its memory statistics are not instrumented. See the user manual on how memory instrumentation works.

Q: With disk what we will check for disk I/O?

A: I quite don’t understand the question. Are you asking on which disk we should check IO statistics? For datadir and other disks, look at the locations where MySQL stores data and log files (if you set custom locations).

Q: How can we put CPU in parallel to process multiple client requests? Put multiple requests In memory by parallel way. By defining transaction. Or there any query hints?

A: We cannot directly put CPU in parallel, but we can affect it indirectly by tuning InnoDB threads-related options (

innodb_threads_concurrency, innodb_read_io_threads, innodb_write_io_threads

) and using the thread pool.

Q: Is there any information the Performance Schema that is not found in the SYS schema?

A: Yes. For example, sys schema does not have a view for statistics about prepared statements, while Performance Schema does, because sys schema takes its statement statistics from digest tables (which make no sense for prepared statements).

Q: What is your favorite tool to investigate a memory issue with a task/job that has already finished?

A: I don’t know that there is such a tool suitable for use in production. In a test environment, you can use valgrind or similar tools. You can also make core dumps of the mysqld process and investigate them after the issue is gone.

Jan
15
2015
--

Hyper-threading – how does it double CPU throughput?

Computer CPUThe other day a customer asked me to do capacity planning for their web server farm. I was looking at the CPU graph for one of the web servers that had Hyper-threading switched ON and thought to myself: “This must be quite a misleading graph – it shows 30% CPU usage. It can’t really be that this server can handle 3 times more work?”

Or can it?

I decided to do what we usually do in such case – I decided to test it and find out the truth. Turns out – there’s more to it than meets the eye.

How Intel Hyper-Threading works

Before we get to my benchmark results, let’s talk a little bit about hyper-threading. According to Intel, Intel® Hyper-Threading Technology (Intel® HT Technology) uses processor resources more efficiently, enabling multiple threads to run on each core. As a performance feature, Intel HT Technology also increases processor throughput, improving overall performance on threaded software.

Sounds almost like magic, but in reality (and correct me if I’m wrong), what HT does essentially is – by presenting one CPU core as two CPUs (threads rather), it allows you to offload task scheduling from kernel to CPU.

So for example if you just had one physical CPU core and two tasks with the same priority running in parallel, the kernel would have to constantly switch the context so that both tasks get a fair amount of CPU time. If, however, you have the CPU presented to you as two CPUs, the kernel can give each task a CPU and take a vacation.

On the hardware level, it will still be one CPU doing the same amount of work, but there maybe some optimization to how that work is going to be executed.

My hypothesis

Here’s the problem that was driving me nuts: if HT does NOT actually give you twice more power and yet the system represents statistics for each CPU thread separately, then at 50% CPU utilization (as per mpstat on Linux), the CPU should be maxed out.

So if I tried to model the scalability of that web server – a 12-core system with HT enabled (represented as 24 CPUs on a system), assuming perfect linear scalability, here’s how it should look:

Throughput
(requests per second)
  |
9 |         ,+++++++++++++++
  |        +
  |       +
6 |      +
  |     +
  |    +
3 |   +
  |  +
  | +
0 '-----+----+----+----+----
    1   6   12   18   24

In the example above, single CPU thread could process the request in 1.2s, which is why you see it max out at 9-10 requests/sec (12/1.2).

From the user perspective, this limitation would hit VERY unexpectedly, as one would expect 50% utilization to be… well, exactly that – 50% utilization.

In fact, the CPU utilization graph would look even more frustrating. For example if I were increasing the number of parallel requests linearly from 1 to 24, here’s how that relationship should look:

CPU utilization:
100% |         ++++++++++++++
     |         .
     |         .
     |         .
     |         .
 50% |         .
     |       +
     |     +
     |   +
     | +
  0% '----+----+----+----+----
    0     6   12   18   24

Hence CPU utilization would skyrocket right at 12 cores from 50% to 100%, because in fact the system CPU would be 100% utilized at this point.

What happens in reality

Naturally, I decided to run a benchmark and see if my assumptions are correct. The benchmark was pretty basic – I wrote a CPU-intensive php script, that took 1.2s to execute on the CPU I was testing this, and bashed it over http (apache) with ab at increasing concurrency. Here’s the result:

Requests per secondRaw data can be found here.

If this does not blow your mind, please go over the facts again and then back at the graph.

Still not sure why do I find this interesting? Let me explain. If you look carefully, initially – at concurrency of 1 through 8 – it scales perfectly. So if you only had data for threads 1-8 (and you knew processes don’t incur coherency delays due to shared data structures), you’d probably predict that it will scale linearly until it reaches ~10 requests/sec at 12 cores, at which point adding more parallel requests would not have any benefits as the CPU would be saturated.

What happens in reality, though, is that past 8 parallel threads (hence, past 33% virtual CPU utilization), execution time starts to increase and maximum performance is only achieved at 24-32 concurrent requests. It looks like at the 33% mark there’s some kind of “throttling” happening.

In other words, to avoid a sharp performance hit past 50% CPU utilization, at 33% virtual thread utilization (i.e. 66% actual CPU utilization), the system gives the illusion of a performance limit – execution slows down so that the system only reaches the saturation point at 24 threads (visually, at 100% CPU utilization).

Naturally then the question is – does it still make sense to run hyper-threading on a multi-core system? I see at least two drawbacks:

1. You don’t see the real picture of how utilized your system really is – if the CPU graph shows 30% utilization, your system may well be 60% utilized already.
2. Past 60% physical utilization, execution speed of your requests will be throttled intentionally in order to provide higher system throughput.

So if you are optimizing for higher throughput – that may be fine. But if you are optimizing for response time, then you may consider running with HT turned off.

Did I miss something?

The post Hyper-threading – how does it double CPU throughput? appeared first on MySQL Performance Blog.

Jul
19
2009
0

Computer hardware poster

The guys over at Hack a day found this computer hardware poster of pretty much all the connectors in use by the industry today.  Very useful especially for someone like me who really hasn’t been a hardware geek for quite some time.

You can also buy the print as a full-sized poster.

Written by in: Tech | Tags: , , , , ,

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com