May
16
2012
--

Benchmarking single-row insert performance on Amazon EC2

I have been working for a customer benchmarking insert performance on Amazon EC2, and I have some interesting results that I wanted to share. I used a nice and effective tool iiBench which has been developed by Tokutek. Though the “1 billion row insert challenge” for which this tool was originally built is long over, but still the tool serves well for benchmark purposes.

OK, let’s start off with the configuration details.

Configuration

First of all let me describe the EC2 instance type that I used.

EC2 Configuration

I chose m2.4xlarge instance as that’s the instance type with highest memory available, and memory is what really really matters.

High-Memory Quadruple Extra Large Instance
68.4 GB of memory
26 EC2 Compute Units (8 virtual cores with 3.25 EC2 Compute Units each)
1690 GB of instance storage
64-bit platform
I/O Performance: High
API name: m2.4xlarge

As for the IO configuration I chose 8 x 200G EBS volumes in software RAID 10.

Now let’s come to the MySQL configuration.

MySQL Configuration

I used Percona Server 5.5.22-55 for the tests. Following is the configuration that I used:

## InnoDB options
innodb_buffer_pool_size         = 55G
innodb_log_file_size            = 1G
innodb_log_files_in_group       = 4
innodb_buffer_pool_instances    = 4
innodb_adaptive_flushing        = 1
innodb_adaptive_flushing_method = estimate
innodb_flush_log_at_trx_commit  = 2
innodb_flush_method             = O_DIRECT
innodb_max_dirty_pages_pct      = 50
innodb_io_capacity              = 800
innodb_read_io_threads          = 8
innodb_write_io_threads         = 4
innodb_file_per_table           = 1

## Disabling query cache
query_cache_size                = 0
query_cache_type                = 0

You can see that the buffer pool is sized at 55G and I am using 4 buffer pool instances to reduce the contention caused by buffer pool mutexes. Another important configuration that I am using is that I am using “estimate” flushing method available only on Percona Server. The “estimate” method reduces the impact of traditional InnoDB log flushing, which can cause downward spikes in performance. Other then that, I have also disabled query cache to avoid contention caused by query cache on write heavy workload.

OK, so that was all about the configuration of the EC2 instance and MySQL.

Now as far as the benchmark itself is concerned, I made no code changes to iiBench, and used the version available here. But I changed the table to use range partitioning. I defined a partitioning scheme such that every partition would hold 100 million rows.

Table Structure

The table structure of the table with no secondary indexes is as follows:

CREATE TABLE `purchases_noindex` (
  `transactionid` int(11) NOT NULL AUTO_INCREMENT,
  `dateandtime` datetime DEFAULT NULL,
  `cashregisterid` int(11) NOT NULL,
  `customerid` int(11) NOT NULL,
  `productid` int(11) NOT NULL,
  `price` float NOT NULL,
  PRIMARY KEY (`transactionid`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
/*!50100 PARTITION BY RANGE (transactionid)
(PARTITION p0 VALUES LESS THAN (100000000) ENGINE = InnoDB,
 PARTITION p1 VALUES LESS THAN (200000000) ENGINE = InnoDB,
 PARTITION p2 VALUES LESS THAN (300000000) ENGINE = InnoDB,
 PARTITION p3 VALUES LESS THAN (400000000) ENGINE = InnoDB,
 PARTITION p4 VALUES LESS THAN (500000000) ENGINE = InnoDB,
 PARTITION p5 VALUES LESS THAN (600000000) ENGINE = InnoDB,
 PARTITION p6 VALUES LESS THAN (700000000) ENGINE = InnoDB,
 PARTITION p7 VALUES LESS THAN (800000000) ENGINE = InnoDB,
 PARTITION p8 VALUES LESS THAN (900000000) ENGINE = InnoDB,
 PARTITION p9 VALUES LESS THAN (1000000000) ENGINE = InnoDB,
 PARTITION p10 VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */

While the structure of the table with secondary indexes is as follows:

CREATE TABLE `purchases_index` (
  `transactionid` int(11) NOT NULL AUTO_INCREMENT,
  `dateandtime` datetime DEFAULT NULL,
  `cashregisterid` int(11) NOT NULL,
  `customerid` int(11) NOT NULL,
  `productid` int(11) NOT NULL,
  `price` float NOT NULL,
  PRIMARY KEY (`transactionid`),
  KEY `marketsegment` (`price`,`customerid`),
  KEY `registersegment` (`cashregisterid`,`price`,`customerid`),
  KEY `pdc` (`price`,`dateandtime`,`customerid`)
) ENGINE=InnoDB AUTO_INCREMENT=11073789 DEFAULT CHARSET=latin1
/*!50100 PARTITION BY RANGE (transactionid)
(PARTITION p0 VALUES LESS THAN (100000000) ENGINE = InnoDB,
 PARTITION p1 VALUES LESS THAN (200000000) ENGINE = InnoDB,
 PARTITION p2 VALUES LESS THAN (300000000) ENGINE = InnoDB,
 PARTITION p3 VALUES LESS THAN (400000000) ENGINE = InnoDB,
 PARTITION p4 VALUES LESS THAN (500000000) ENGINE = InnoDB,
 PARTITION p5 VALUES LESS THAN (600000000) ENGINE = InnoDB,
 PARTITION p6 VALUES LESS THAN (700000000) ENGINE = InnoDB,
 PARTITION p7 VALUES LESS THAN (800000000) ENGINE = InnoDB,
 PARTITION p8 VALUES LESS THAN (900000000) ENGINE = InnoDB,
 PARTITION p9 VALUES LESS THAN (1000000000) ENGINE = InnoDB,
 PARTITION p10 VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */

Also, I ran 5 instances of iiBench simultaneously to simulate 5 concurrent connections writing to the table, with each instance of iiBench writing 200 million single row inserts, for a total of 1 billion rows. I ran the test both with the table purchases_noindex which has no secondary index and only a primary index, and against the table purchases_index which has 3 secondary indexes. Another thing I would like to share is that, the size of the table without secondary indexes is 56G while the size of the table with secondary indexes is 181G.

Now let’s come down to the interesting part.

Results

With the table purchases_noindex, that has no secondary indexes, I was able to achieve an avg. insert rate of ~25k INSERTs Per Second, while with the table purchases_index, the avg. insert rate reduced to ~9k INSERTs Per Second. Let’s take a look at the graphs have a better view of the whole picture.

Note, in the above graph, we have “millions of rows” on the x-axis and “INSERTs Per Second” on the y-axis.
The reason why I have chosen to show “millions of rows” on the x-axis so that we can see the impact of growth in data-set on the insert rate.

We can see that adding the secondary indexes to the table has decreased the insert rate by 3x, and its not even consistent. While with the table having no secondary indexes, you can see that the insert rate is pretty much constant remaining between ~25k to ~26k INSERTs Per Second. But on the other hand, with the table having secondary indexes, we can see that there are regular spikes in the insert rate, and the variation in the rate can be classified as large, because it varies between ~6.5k to ~12.5k INSERTs per second, with noticeable spikes after every 100 million rows inserted.

I noticed that the insert rate drop was mainly caused by IO pressure caused by increase in flushing and checkpointing activity. This caused spikes in write activity to the point that the insert rate was decreased.

Conclusion

As we all now there are pros and cons to using secondary indexes. While secondary indexes cause read performance to improve, but they have an impact on the write performance. Well most of the apps rely on read performance and hence having secondary indexes is an obvious choice. But for those applications that are write mostly or that rely a lot on write performance, reducing the no. of secondary indexes or even going away with secondary indexes causes a write throughput increase of 2x to 3x. In this particular case, since I was mostly concerned with write performance, so I went ahead to choose a table structure with no secondary indexes. Other important things to consider when you are concerned with write performance is using partitioning to reduce the size of the B+tree, having multiple buffer pool instances to reduce contention problems caused by buffer pool mutexes, using “estimate” checkpoint method to reduce chances of log flush storms and disabling the query cache.

May
10
2012
--

Testing Fusion-io ioDrive2 Duo

I was lucky enough to get my hands on new Fusion-io ioDrive2 Duo card. So I decided to run the same series of tests I did for other Flash devices. This is ioDrive2 Duo 2.4TB card and it is visible to OS as two devices (1.2TB each), which can be connected together via software RAID. So I tested in two modes: single drive, and software RAID-0 over two drives.

I should note that to run this card you need to have an external power, by the same reason I mentioned in the previous post: PCIe slot can provide only 25W power, which is not enough for ioDrive2 Duo to provide full performance. I mention this, as it may be challenge for some servers: some models may not have connector for external power, and for some you may need special “power kit”. So you need to make sure you have compatible hardware before getting Duo card. I personally ended up with setup like this: I use a separate power supply.

Fusion-io ioDrive2 Firmware v6.0.0, rev 107004 Public, Fusion-io driver version: 3.1.1.

Now to the results.
For this test I also use Cisco UCS C250 server, and on the graph I show the results for both single card and raid (Duo).

Random writes, async:

We see stable and predictable write performance, with throughput: 660 MiB/s for single, and 1300 MiB/s for Duo

Random reads:

Again both modes provides stable level of throughput. 1350 MiB/s for single and 2300 MiB/s for Duo.

Now with separation per thread for random read synchronous IO:

There is also excellent response time characteristics. 0.25ms and 0.19ms for 8 threads, single and Duo cases.

In general ioDrive2 seems to provide better and more stable performance results comparing to previous generation ioDrive.


May
09
2012
--

New distribution of random generator for sysbench – Zipf

Sysbench has three distribution for random numbers: uniform, special and gaussian. I mostly use uniform and special, and I feel that both do not fully reflect my needs when I run benchmarks. Uniform is stupidly simple: for a table with 1 mln rows, each row gets equal amount of hits. This barely reflects real system, it also does not allow effectively test caching solution, each row can be equally put into cache or removed. That’s why there is special distribution, which is better, but to other extreme – it is skewed to very small percentage of rows, which makes this distribution good to test cache, but it is hard to emulate high IO load.

That’s why I was looking for alternatives, and Zipfian distribution seems decent one. This distribution has a parameter ? (theta), which defines how skewed the distribution is. A physical sense of this parameter, if to apply to database tables, is following: say row 1 accessed N, then row 2 is accessed 2^? less times, row 3 is accessed 3^? less, …, row X is accessed X^? less times.
Say ?=1.1, then if row 1 accessed 1,000,000 times, then row 2 is : 1,000,000/(2^1.1)=466,516 times, row 3: 1,000,000/(2^1.1)=298,652 times, …, row id=10000 : 1,000,000/(10,000^1.1) = 39 times.

Obviously with ?=0 we are getting uniform distribution – each row is accessed equal times ( for row X: 1/(X^0) ).

There is a research that shows that user behavior can be described by this distribution: Zipf, Power-laws, and Pareto – a ranking tutorial

To see distribution on graphs, I took tables with 1mln rows and run row lookup 1 million times.

There are histograms on how many times each row selected for different ?: 0.5, 0.9, 1.1:

The curve is very skewed, so I zoomed graphs to show only 0-100k level:

I implemented Zipf for sysbench, right now it is in a separate tree https://code.launchpad.net/~vadim-tk/sysbench/zipf-distribution, you are welcome to try if it sounds interesting.

I am going to run couple incoming benchmarks with this distribution.


May
07
2012
--

Testing Fusion-io ioDrive – now with driver 3.1

In my previous post with results for Fusion-io ioDrive we saw some instability in results, I was pointed that it may be fixed in new drivers VSL 3.1.1. I am not sure if this driver is available for everyone – if you are interested, please contact your Fusion-io support representative. I installed new drivers and firmware, and in fact, the result improved.

Information about driver and firmware: Firmware v6.0.0, rev 107006. Fusion-io driver version: 3.1.1 build 172.

Actually an upgrade was not flawless, after a firmware upgrade I had to perform low-level formatting, which erase all data. So if you want to do the same – make sure you copy your data.

So there are results for driver 3.1 (with comparison to previous driver 2.3)

Random writes:

For random writes there is not much improvements, the throughput is about the same.

Random reads:

But there is a significant improvement for random reads. The results is stable on 640 MiB/sec level and it is higher than previously.

Sync random reads per threads, throughput:

Response time:

Again, there is improvement in throughput, in both in quality and absolute value. For response time – in some cases, there is 2x improvement.

So it seems for Fusion-io ioDrive it is worth to consider upgrade to 3.1 Driver (remember to copy your data before).


May
04
2012
--

Testing Virident FlashMAX 1400

I still continue to run benchmarks of different SSD cards. This time I show numbers for Virident FlashMAX 1400. This is a MLC PCIe SSD device. There are couple notes on these results.
First, this time I use a different server. For this benchmark it is Cisco UCS C250, while for previous results I used HP ProLiant DL380 G6.

Second note is, that I use a mode “turbo=1″ for Virident card. What does that mean? Apparently PCIe specification has a limitation on available power. If I am not mistaken it is 25W, however Virident to provide full write performance requires 28W. And while many servers can handle 28W on PCIe, this is a non-standard mode, and Virident by default uses 25W (turbo=0). To force full power, I load a driver with turbo=1. I also use “maxperformance” formatting for Virident, which gives less capacity (1.2TB visible for user), but reserves internally more space to provide better write performance.

So as usually I start with random writes, async.

Soon after initial period, the result stabilizes at 550 MiB/sec level.

Random read, async:

Random read throughput is very close to perfect line, and it is 1450 MiB/sec.
This is best read throughput I’ve seen so far in my benchmarks.

To see distribution of response time, the results for random read synchronous IO.

There we can see that 1450 MiB/sec is not quite achievable in sync mode, and only 64 threads are getting close.

Response time:

In the conclusion, from all tested cards, Virident FlashMAX shows the most stable results and the best absolute performance so far.

For reference, other results in series:


May
03
2012
--

Testing Fusion-io ioDrive

Following my series of posts on testing different SSD, in my last post I mentioned that SATA SSD performance is getting closer to PCIe cards. It really makes sense to test it under MySQL workload, but before getting to that, let me review the same workload on Fusion-io ioDrive PCIe card. This is yet previous generation of Fusion-io cards, but this is the one that has biggest installation base.

Driver information: Fusion-io driver version: 2.3.10 build 110; Firmware v5.0.7, rev 107053

Following the format of previous benchmarks, first is random write async 16KB case.

We can see some wave-like pattern with throughput 350-400 MiB/sec.

Random reads, async:

Interesting to see that there is quite unstable throughput in range 450-500 MiB/sec.
This is not usual for read-only workload to have such variety in throughput.

It gets even more interesting when we go to read sync IO, with 1-64 threads.
Throughput:

Response time:

For same cases (i.e. 4 threads) we see some interesting patterns.
As for response time, actually it does not seem much better than for Intel 520.
For 8 threads it is 0.6 ms ( for Intel 520 – 0.69 ms).

To better understand patterns in the read synchronous case, let me unfold results and show them in timeline (from 0 to 1800 sec):

I am not sure how to explain it, that with 4 and 8 threads the pattern is less stable than with 32 threads.

It is curios that I published results already for bunch of cards:

and each card shows individuals patterns and different handling of write and read IO cases.


May
01
2012
--

Testing Intel SSD 520

Following my previous benchmarks of SATA SSD cards I got Intel SSD 520 240GB into my hands. In this post I show the results of raw IO performance of this card.

The benchmark methodology I described in previous posts, so let me jump directly to results.

First case is random write asynchronous 8 threads IO, the test is done just after a secure erase operation on the card.

The card is doing stable 380 MiB/sec level, but after around 4000 sec, as garbage collector kicks in, we see a performance drop to around 300 MiB/sec with some instability, which I will research in later charts.

Now, random reads, still asynchronous

It gives almost stable 370 MiB/sec throughput, with some strange small periodic drops.

To better understand response time ranges, we need to switch to synchronous IO and vary amount of threads.

Throughput:

And response times:

We still see small hiccups in throughput and response times even for small amount of threads.
For 8 threads the 95% response time is 0.69ms.

Now let me get back to random write case. I will try synchronous IO varying amount of threads and with measurements every 1 sec to see how bad are drops.

So there is more or less stable performance only for 1 thread. For 2 or more, the throughput varies a lot from second to second. I draw boxplots, which show 25-50-75 percentiles. So there is no grow in throughput after 2 threads, and the result averages at 300 MiB/sec.

I am still interesting in asynchronous IO, as MySQL 5.5 uses async IO for writes. Maybe 8 threads in the first graph is too much and we should go with 1 thread?

So even with 1 async write thread the throughput jumps a lot in range 200 – 400 MiB/sec.

As conclusion, I should say that 300 MiB/sec level for random reads and writes is very decent result for SATA card. I think with this performance SATA is getting closer to level of PCIe cards. Of course PCIe still provides better numbers, but the question is how much MySQL can use. In his keynote Mark Callaghan mentioned that Fusion-io cards they use are highly underutilized.

With the performance variance we see it is a good question how does it affect MySQL performance, and I am going to run some MySQL workloads on these cards to understand it better.

If you are interested more in SSD and MySQL questions – I will be giving a webinary “MySQL and SSD” on May-9. It will be the same as my talk on Percona Live MySQL Conference 2012, if you did not attend my talk – you are welcome to join the webinar.


Apr
25
2012
--

Testing STEC SSD MACH16 200GB SLC

Following my previous benchmark of Samsung 830, today I want to show results for STEC MACH16 SATA card, 200GB size, this card is based on SLC, and regarding STEC website, it is an enterprise grade storage.

For tests I use sysbench fileio, 16KiB block size (to match workload from InnoDB, as this is primary usage for me), and recently I switched to use async IO mode. There are two reasons for that. First, MySQL/InnoDB uses async writes, so this will emulate database load, and second, async mode allows to see maximal possible throughput, it does not show reliable latency though, as it appears there is no a reliable way in the Linux asynchronous IO library to get time metrics for particular IO block.

so my testing command line looks like:

sysbench --test=fileio --file-total-size=${size}G --file-test-mode=rndwr --max-time=18000 --max-requests=0 --num-threads=$numthreads --rand-init=on --file-num=64 --file-io-mode=async --file-extra-flags=direct --file-fsync-freq=0 --file-block-size=16384 --report-interval=10 run

You may see I gather metrics every 10 sec to see how stable the performance is, and it really helps to observe some artifacts, as you will see in following graphs.

Hardware for tests: HP ProLiant DL380 G6, filesystem: ext4, mounted with nobarrier.

The results for random write case (8 async IO threads):

In general it shows stable throughput topping to 148 MiB/sec, but every 20 min, there is small drop to 87 MiB/sec, which I guess is related to internal garbage collector activity.

The results for random read case:

Very stable throughput on line 222 MiB/sec

To understand better what kind of response time we should expect, I ran random read sync IO mode, now for 1-64 threads.

The throughput:

We are getting to the peak throughput at 8 threads.

And response time:

For 8 threads, we may expect 0.62ms response time.

In general I have very good experience with this card, and it seems suitable to work with MySQL. I will publish sysbench oltp benchmarks running MySQL on RAID10 over 4 STEC MACH16 cards.

If you are interested more in SSD and MySQL questions – I will be giving a webinary “MySQL and SSD” on May-9. It will be the same as my talk on Percona Live MySQL Conference 2012, if you did not attend my talk – you are welcome to join the webinar.

Disclaimer: This benchmark is done as part of consulting work for STEC, but this post is totally independent and fully reflects our opinion.


Apr
25
2012
--

Testing Samsung SSD SATA 256GB 830 – not all SSD created equal

I personally like PCIe based Flash, but from a pricing point our customers are looking for cheaper alternatives. SATA SSD is an options. There is many products based on MLC technology, and Intel 320 I would say is the most popular. I do not particularly like its write performance – I wrote about it before, that’s why I am looking for comparable alternatives. Samsung 830 256GB looked like a good product, that’s why I decided to test it.

For tests I use sysbench fileio, 16KiB block size (to match workload from InnoDB, as this is primary usage for me), and recently I switched to use async IO mode. There are two reasons for that. First, MySQL/InnoDB uses async writes, so this will emulate database load, and second, async mode allows to see maximal possible throughput, it does not show reliable latency though, as it appears there is no a reliable way in the Linux asynchronous IO library to get time metrics for particular IO block.

so my testing command line looks like:

sysbench --test=fileio --file-total-size=${size}G --file-test-mode=rndwr --max-time=18000 --max-requests=0 --num-threads=$numthreads --rand-init=on --file-num=64 --file-io-mode=async --file-extra-flags=direct --file-fsync-freq=0 --file-block-size=16384 --report-interval=10 run

You may see I gather metrics every 10 sec to see how stable the performance is, and it really helps to observe some artifacts, as you will see in following graphs.

Hardware for tests: HP ProLiant DL380 G6, filesystem: ext4, mounted with nobarrier.

The results for random write case (8 async IO threads):

It seems that InnoDB is not alone with its flashing problems. You can see there periodical stalls in throughput (0 throughput for 20-30 sec period of time). When there is no drops, the drive keep write throughput on 323 MiB/sec level.

I really thought that these stalls are related, so I was totally surprised them in random reads also.
The results for random read case:

I do not have a good explanation for this. When there is no drop, the drive keeps 375 MiB/sec throughput. I may do a wild guess about drops – the drive periodically cleans an internal cache or something.

To understand better what kind of response time we should expect, I ran random read sync IO mode, now for 1-64 threads.

The throughput:

We are getting to the peak throughput at 16-32 threads.

And response time:

For 16 threads, we may expect 0.96ms response time, which increases to 1.62ms under 32 threads.

The periodic drops that I observe for both random reads and random writes do not allow me to recommend this drive for a database server usage, even in general this drive provides much better throughput than Intel 320 (some results for Intel 320).

If you are interested more in SSD and MySQL questions – I will be giving a webinary “MySQL and SSD” on May-9. It will be the same as my talk on Percona Live MySQL Conference 2012, if you did not attend my talk – you are welcome to join the webinar.


Apr
20
2012
--

Benchmarks challenges of XtraDB Cluster

We are running internally a lot of benchmarks on our recently announced Percona XtraDB Cluster, and I am going to publish these results soon.
But before that I wanted to mention that proper benchmark of distributed system comes with a lot of challenges.
I am saying that not to complain, but to make sure, if you are going to benchmark XtraDB Cluster yourself, there is a lot of things to take into account.

And it seems that one component, which was not much important before, now appears as critical peace, which easily can became bottleneck in the benchmarks – this is network.

In case of simple client-server setup, the network is not fully utilized.

But as we start testing a cluster setup, the 1Gb network between client and switch is getting fully utilized by sysbench communication with 3 nodes.

In this setup it does not make sense to increase number of nodes, as we will not be able to load them properly.

The solution would be to increase network capacity or add additional client boxes.

Now take into account that there is an internal network communication between nodes also, and that makes a network tuning as the critical part of a cluster setup. This is not something we paid much attention before.

The main conclusion of this post is that if you are going to benchmark a Percona XtraDB Cluster or just use it under intensive communication workload, pay an attention to network component. It is very easy that a client or a client network becomes bottleneck.


Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com