Benchmarking IBM eXFlash™ DIMM with sysbench fileio

Diablo Technologies engaged Percona to benchmark IBM eXFlash™ DIMMs in various aspects. An eXFlash™ DIMM itself is quite an interesting piece of technology. In a nutshell, it’s flash storage, which you can put in the memory DIMM slots. Enabled by Diablo’s Memory Channel Storage™ technology, this practically means low latency and some unique performance characteristics.

These devices are special, because their full performance potential is unlocked by using multiple devices to leverage the parallelism of the memory subsystem. In a modern CPU there is more than one memory controller (4 is typical nowadays) and spreading eXFlash™ DIMMs across them will provide maximum performance. There are quite some details about the device that are omitted in this post, they can be found in this redbook.

Diablo technologies also provided us a 785 GB FusionIO ioDrive 2 card (MLC). We will compare the performance of it to 4 and 8 of the eXFlash DIMMs. The card itself is a good choice for comparison because of it’s popularity, and because as you will see, it provides a good baseline.


CPU: 2x Intel Xeon E5-2690 (hyper threading enabled)
FusionIO driver version: 3.2.6 build 1212
Operating system: CentOS 6.5
Kernel version: 2.6.32-431.el6.x86_64

Sysbench command used:

sysbench --test=fileio

The variables above mean the following:
Size: the devices always had 50% free space.
Test_mode: rndwr, rndrw, rndrd for the different types of workloads (read, write, mixed)
Rtime: 4 hours for asynchronsous tests, 30 minutes for synchronous ones.
Mode: sync or async dependening on which test are we talking about.

From the tdimm devices, we created RAID0 arrays with linux md (this is the recommended way to use them).
In case of 4 devices:

mdadm --create /dev/md0 --level=0 --chunk=4 --raid-devices=4 /dev/td{a,c,e,g}

In case of 4 devices:

mdadm --create /dec/md0 --level=0 --chunk=4 --raid-devices=8 /dev/td{a,b,c,d,e,f,g,h}

The filesystem used for the tests was ext4, it was created on the whole device (md0 or fioa, no parititions), which the block size of 4k.

We tested these with read-only, write-only and mixed workloads with sysbench fileio. The read/write ratio with the mixed test was 1.5. Apart from the workload itself, we varied the IO mode as well (we did tests with synchronous IO as well as asynchronous IO). We did the following tests.

  • Asynchronous read workload (16 threads, 16k block size, 4 hour long)
  • Asynchronous write workload (16 threads, 16k block size, 4 hour long)
  • Asynchronous mixed workload (16 threads, 16k block size, 4 hour long)
  • Synchronous read workload (1..1024 threads, 16k block size, 0.5 hour long)
  • Synchronous write workload (1..1024 threads, 16k block size, 0.5 hour long)
  • Synchronous mixed workload (1..1024 threads, 16k block size, 0.5 hour long)

The asynchronous read tests lasted 4 hours, because we tested the consistency of the device’s performance with those over time.

Asynchronous IO



The throughput is fairly consistent. With 8 dimms, the peak throughput goes above 3 GB/s in case of 8 eXFlash DIMMs.


At the beginning of the testing we expected that latency would drop significantly in case of eXFlash DIMMs. We based this assumption on the fact that the memory is even “closer” to the processor than the PCI-E bus. In case of 8 eXFlash DIMMs, it’s visible that we are able to utilize the memory channels completely, which gives a boost in throughput as well as it gives lower latency.



The consistent asynchronous write  throughput is around 1GB/sec in case of eXFlash DIMM_8. Each case at the beginning has a spike up in throughput: that’s the effect of caching. In eXFlash DIMM itself, the data can arrive much faster on the bus than it could be written to the flash devices.


In case of write latency, the PCI-E based device is Somewhat better, and also the performance is more consistent compared to the case where we were using 4 eXFlash DIMMs. When using 8, similarly to the case where only reads were tested in isolation, both the throughput increased and the latency drops.



In the mixed workload (which is quite relevant to some databases), the eXFlash DIMM shines. The throughput in case of 8 DIMMs can go as high as 1.1 GB/s for reads, while it’s doing roughly 750 MB/s writes.


In case of mixed workload, the latency is the eXFlash DIMMs is lower and far more consistent (the more DIMMs we use, the more consistent it will be).

Synchronous IO



In case of synchronous IO, we reached the peak throughput at 32 threads used in case of the PCI-E device, but needed 64 and 128 threads for the eXFlash DIMM_4 and eXFlash DIMM_8 configurations respectively (in both cases, the throughput at 32 threads was higher than the PCI-E device’s).


The read latency degrades much more gradually in case of the eXFlash DIMMs, which makes it very suitable for workloads like linkbench, where the buffer pool misses are more frequent than writes (many large web application have this kind of read mostly and the database doesn’t fit into memory profile).In order to be able to see the latency differences at a lower number of threads, this graph has to be zoomed in a bit. The following graph is the same (synchronous read latency), but the y axis has a maximum value of 10 ms.


The maximum value is hit at 256 threads in case of the PCI-E device, and performance is already severely degraded at 128 threads (which is the optimal degree of parallelism for the eXFlash DIMM_8 configuration). Massively parallel, synchronous reads is a workload where the eXFlash DIMMs are exceptionally good.



Because of caching mechanisms, we don’t need too much writer threads in order to utilize the devices. This makes sense, writes are easy to cache: after a write, the device can tell the operating system that the write is completed, when in reality it’s only completed later. In case of reads, this is different, when the application requests a block to read, the device can’t give the data to the application, but only read it later. Because of mechanisms like this, consistency is very important in case of write tests. The PCI-E device delivers the most consistent, but the lowest throughput. The reason for this is NUMA’s effect to the MCS performance. Because sysbench was allowed to use any CPU core, it may got lucky by hitting the DIMM which is in the CPU’s memory bank it’s currently running on, or may got unlucky, and the QPI was involved in getting the data. One case is not more rare than the other, hence we see the more wide bars on the graphs.


The latency is better in case of eXFlash DIMMs, but not as much as in case of isolated reads. It only seems to be slightly less in case of eXFlash DIMM_4, so let’s do a similar zoom here as we did in case of writes.


It’s visible on the graph that in case of lower concurrency, the eXFlash DIMM_4 and fusionio measurement is roughly the same, but as the workload becomes more parallel, the eXFlash DIMM will degrade more gradually. This more gradual degradation is also present in the eXFlash DIMM_8 case, but it starts to degrade later.



In case of mixed throughput, the PCI-E device similarly shows more consistent, but lower performance. Using 8 DIMMs means a significant performance boost: the 1.1 GB/s reads and 750 MB/s writes like in case of asynchronous IO is doable with 256 threads and above.


Similarly to the previous cases, the response times are slightly better in case of using eXFlash DIMMs. Because the latency is quite high in case of 1024 threads in each case, a similar zoom like previous cases will show more details.


At a low concurrency, the latency is better on the PCI-E device, at least the maximum latency. We can see some fluctuation in case of the eXFlash DIMMs. The reason for that the NUMA, on a single thread, which CPU has the storage the benchmark is accessing matters a lot. As the concurrency gets higher, the response time degrades in case of the flash device, but in the case of eXFlash DIMMs, the maximum remains the same and the response time becomes more consistent. At 32 threads, the mean latency of the devices are very close, with the PCI-E device being slightly better. From there, the eXFlash DIMMs are degrading more gracefully.


Overall, I am impressed, especially with the read latency at massive parallelism. So far this is the fastest MLC device we have tested. I would be curious how would 16 or 32 DIMMs would perform in a machine which has enough sockets and memory channels for that kind of configuration. Like we moved one step closer to the CPU with PCI-E based flash, MCS is one step closer to the CPU compared to the PCI-E based flash.

The post Benchmarking IBM eXFlash™ DIMM with sysbench fileio appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by