Dec
21
2010
--

MySQL 5.5.8 and Percona Server on Fast Flash card (Virident tachIOn)

This is to follow up on my previous post and show the results for MySQL 5.5.8 and Percona Server on the fastest hardware I have in our lab: a Cisco UCS C250 server with 384GB of RAM, powered by a Virident tachIOn 400GB SLC card.

To see different I/O patterns, I used different innodb_buffer_pool_size settings: 13G, 52G, an 144G on a tpcc-mysql workload with 1000W (around 100GB of data). This combination of buffer pool sizes gives us different data/memory ratios (for 13G – an I/O intensive workload, for 52G – half of the data fits into the buffer pool, for 144G – the data all fits into memory). For the cases when the data fits into memory, it is especially important to have big transactional log files, as in these cases the main I/O pressure comes from checkpoint activity, and the smaller the log size, the more I/O per second InnoDB needs to perform.

So let me point out the optimizations I used for Percona Server:

  • innodb_log_file_size=4G (innodb_log_files_in_group=2)
  • innodb_flush_neighbor_pages=0
  • innodb_adaptive_checkpoint=keep_average
  • innodb_read_ahead=none

For MySQL 5.5.8, I used:

  • innodb_log_file_size=2000M (innodb_log_files_in_group=2), as the maximal available setting
  • innodb_buffer_pool_instances=8 (for a 13GB buffer pool); 16 (for 52 and 144GB buffer pools), as it is seems in this configuration this setting provides the best throughput
  • innodb_io_capacity=20000; a difference from the FusionIO case, it gives better results for MySQL 5.5.8.

For both servers I used:

  • innodb_flush_log_at_trx_commit=2
  • ibdata1 and innodb_log_files located on separate RAID10 partitions, InnoDB datafiles on the Virident tachIOn 400G card

The raw results, config, and script are in our Benchmarks Wiki.
Here are the graphs:

13G innodb_buffer_pool_size:

In this case, both servers show a straight line, and it seems having 8 innodb_buffer_pool_instances was helpful.

52G innodb_buffer_pool_size:

144G innodb_buffer_pool_size:

The final graph shows the difference between different settings of innodb_io_capacity for MySQL 5.5.8.

Small innodb_io_capacity values are really bad, while 20000 allows us to get a more stable line.

In summary, if we take the average NOTPM for the final 30 minutes of the runs (to avoid the warmup stage), we get the following results:

  • 13GB: MySQL 5.5.8 – 23,513 NOTPM, Percona Server – 30,436 NOTPM, advantage: 1.29x
  • 52GB: MySQL 5.5.8 – 71,774 NOTPM, Percona Server – 88,792 NOTPM, advantage: 1.23x
  • 144GB: MySQL 5.5.8 – 78,091 NOTPM, Percona Server – 109,631 NOTPM, advantage: 1.4x

This is actually the first case where I’ve seen NOTPM greater than 100,000 for a tpcc-mysql workload with 1000W.

The main factors that allow us to get a 1.4x improvement in Percona Server are:

  • Big log files. Total size of logs are: innodb_log_file_size=8G
  • Disabling flushing of neighborhood pages: innodb_flush_neighbor_pages=0
  • New adaptive checkpointing algorithm innodb_adaptive_checkpoint=keep_average
  • Disabled read-ahead logic: innodb_read_ahead=none
  • Buffer pool scalability fixes (different from innodb_buffer_pool_instances)

We recognize that hardware like the Cisco UCS C250 and the Virident tachIOn card may not be for the mass market yet, but
it is a good choice for if you are looking for high MySQL performance, and we tune Percona Server to get the most from such hardware. Actually, from my benchmarks, I see that the Virident card is not fully loaded, and we may benefit from running two separate instances of MySQL on a single card. This is a topic for another round.

(Edited by: Fred Linhoss)


Entry posted by Vadim |
10 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Dec
20
2010
--

MySQL 5.5.8 and Percona Server: being adaptive

As we can see, MySQL 5.5.8 comes with great improvements and scalability fixes. Adding up all the new features, you have a great release. However, there is one area I want to touch on in this post. At Percona, we consider it important not only to have the best peak performance, but also stable and predictable performance. I refer you to Peter’s post, Performance Optimization and Six Sigma.

In Percona Server (and actually even before that, in percona-patches builds for 5.0), we added adaptive checkpoint algorithms, and later the InnoDB-plugin included an implementation of  ”adaptive flushing”. This post shows the differences between them and MySQL.

The post also answers the question of whether we are going to have releases of Percona Server/XtraDB based on the MySQL 5.5 code line. The answer: Yes, we are. My benchmarks here are based on Percona Server 5.5.7. (You can get the source code from lp:~percona-dev/percona-server/5.5.7 , but it is very beta quality at the moment.)

For this post, I made tpcc-runs on our Dell PowerEdge R900 box, using RAID10 over 8 disks and a FusionIO 320GB MLC card.

First,  the results for tpcc-mysql, 500w (around 50GB of data) on RAID10. I used innodb_buffer_pool_size=24G, innodb_log_file_size=2000M (innodb_log_files_in_group=2), and innodb_flush_log_at_trx_commit=2. Also, innodb_adaptive_flushing (ON) / innodb_adaptive_checkpoint (estimate) were the default values.

The raw results, full config files, and scripts are in our Benchmarks Wiki.

The graphical result below shows the throughput on the server over 8 hours. (Yes, 8 hours, to show MySQL performance over a long time period. It is not a short, 5-minute exercise.)

Although it takes a decent time for the Percona Server results to stabilize, for MySQL 5.5.8 we have regular dips (3 times per hour) from 24900 NOTPM to 17700 NOTPM (dips of around 30%).

Next, the second run on the FusionIO card. There I should say that we were not able to get stable results with the existing adaptive_checkpoint or adaptive_flushing algorithms. So, Yasufumi invested a lot of research time and came up with the new innodb_adaptive_checkpoint=”keep_average” method. This method requires setting innodb_flush_neighbor_pages=0 , to disable flushing of neighborhood pages (not available in MySQL 5.5.8). The problem with flushing neighborhood pages is that it makes an exact calculation of how many pages were handled impossible. The flushing neighborhoods feature was created as an optimization for hard drives, since InnoDB tries to combine writing as many pages as possible into a single sequential write, which means that a single I/O may have a size of 32K, 64K, 96K, …, etc. And again, that makes a prediction of how many I/O operations there are impossible. Furthermore, this optimization is not needed for flash devices, like FusionIO or Virident cards.

An additional optimization we have for SSDs is big log files. For this run, I used innodb_log_file_size=4G (innodb_log_files_in_group=2) for Percona Server. That gave 8GB in total size for log files (MySQL 5.5.8 has a 4GB limit). In additional to increasing log_size we added option innodb_log_block_size which allows to change IO block size for logs files. Default is 512 bytes, in test with FusionIO I use 4096 bytes, to align IO with internal FusionIO size.

You can see that MySQL 5.5.8 has periodic drops here, too. The margin between Percona Server and MySQL is about 2500-2800 NOTPM (~15% difference).

MySQL 5.5.8 now has features related to having several buffer pool instances that are supposed to fix the buffer pool scalability issue. Let’s see how MySQL performance changes for the last workload if we set innodb_buffer_pool_instances=8 or 16.

As you see, having several buffer pools makes the dips deeper and longer. It seems that for Percona Server the best choice is innodb_buffer_pool_instances=1, as we implemented buffer pool scalability in a different way.

UPDATE
By request from commenter I put also results with different innodb_io_capacity for MySQL 5.5.8. It is 500 ( which I used in benchmarks above), 4000 and 20000.

As you see there is no improvements from bigger innodb_io_capacity, and it also concurs with my previous experience, that with bigger io_capacity you rather getting worse results.

For reference, here is the config file used for benchmarks on FusionIO:

CODE:

  1. [client]
  2. socket=/var/lib/mysql/mysql.sock
  3. [mysqld]
  4. core
  5. basedir=/usr/local/mysql
  6. user=root
  7. socket=/var/lib/mysql/mysql.sock
  8. skip-grant-tables
  9. server_id=1
  10. local_infile=1
  11. datadir=/mnt/fio320
  12. innodb_buffer_pool_size=24G
  13. innodb_data_file_path=ibdata1:10M:autoextend
  14. innodb_file_per_table=1
  15. innodb_flush_log_at_trx_commit=2
  16. innodb_log_buffer_size=8M
  17. innodb_log_files_in_group=2
  18. innodb_log_file_size=4G
  19. innodb_log_block_size=4096
  20. innodb_thread_concurrency=0
  21. innodb_flush_method = O_DIRECT
  22. innodb_read_ahead = none
  23. innodb_flush_neighbor_pages = 0
  24. innodb_write_io_threads=8
  25. innodb_read_io_threads=8
  26. innodb_io_capacity=500
  27. max_connections=3000
  28. query_cache_size=0
  29. skip-name-resolve
  30. table_cache=10000
  31. [mysql]
  32. socket=/tmp/mysql.sock

(post edited by Fred Linhoss)


Entry posted by Vadim |
23 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Dec
13
2010
--

Write performance on Virident tachIOn card

This is crosspost from http://www.ssdperformanceblog.com/.

Disclaimer: The benchmarks were done as part of our consulting practice, but this post is totally independent and fully reflects our opinion.

One of the biggest problems with solid state drives is that write performance may drop significantly with decreasing free space. I wrote about this before (http://www.ssdperformanceblog.com/2010/07/free-space-and-write-performance/), using a
FusionIO 320GB Duo card as the example. In that case, when space utilization increased from 100GB to 200GB, the write performance
dropped 2.6 times.

In this regard, Virident claims that tachIOn cards provide “Sustained, predictable random IOPS – Best in the Industry”. Virident generously provided me model 400GB, and I ran the benchmark using the
same methodology as in my experiment with FusionIO, which was stripped for performance. Also using my script, Virident made runs on tachIOn 200GB and 800GB model cards and shared numbers with me ( to be clear I can certify only numbers for 400GB card, but I do not have reasons to question the numbers for 200GB and 800GB, as they corresponding to my results).

The benchmarks was done on Cisco UCS C250 box and raw results are on Benchmarks Wiki



Visually, the drop is not as drastic as it was in the case using FusionIO, but let’s get some numbers.
I am going to take the performance numbers at the points where the available space of the card is 1/3, 1/2, and 2/3 filled, as well as at the point where the card is full. Then I will compute the ratio of each of those IOS numbers to the IOS at the 1/3 point.

**For the 400GB tachIOn card:**

size Throughput, MiB/sec ratio
130 959.17
200 849.58 1.13
260 685.18 1.40
360 417.33 2.29

That is, at the 2/3 point, the 400GB card is slower by 29% than at the 1/3 point, and at full capacity it is slower by 57%.

Observations from looking at the graph:

* You can also see the card never goes below 400MB/sec, even when working at full capacity. This characteristic (i.e., high throughput at full capacity) is very important to know if you are looking to use an SSD card as a cache layer (say, with FlashCache), as, usually for caching, you will try to fill all available space.
* The ratio between the 1/3 capacity point and full capacity point is much smaller compared with FusionIO Duo (without additional spare capacity reservation).
* Also, looking at the graphs for Virident and comparing with the graphs for FusionIO, one might be tempted to say that Virident just has a lot of space reserved internally which is not exposed to the end user, and this is what they use to guarantee a high level of performance. I checked with Virident and they tell me that this is not the case. Actually from diagnostic info on Wiki page you can see: tachIOn0: 0×8000000000 bytes (512 GB), which I assume total installed memory. Regardless, it is not a point to worry about. For pricing, Virident defines GB as the capacity available for end users. So, a competitive $/GB level is maintained, and it does not matter how much space is reserved internally.

Now it would be interesting to compare results with results for FusionIO. As cards have different capacity I made graph which shows
throughput vs percentage of used capacity for both cards, FusionIO 320GB Duo SLC and Virident 400GB SLC

Util % Duo 320GB tachIOn 400GB advantage percent
20% 1,095 990 90%
30% 1,006 980 97%
40% 825 964 117%
50% 613 872 142%
60% 397 783 197%
70% 308 669 217%
80% 237 611 258%
90% 117 502 429%
100% 99 417 421%

In conclusion:
* On single Virident card I see random write throughput close or over 1GB/sec in with low space usage and it is comparable with throughput I’ve got on stripped FusionIO card. I assume Virident maintain good level of parallelism internally.
* Virident card maintains very good throughput level in close to full capacity mode, and that means you do not need to worry ( or worry less) about space reservation or formatting card with less space.


Entry posted by Vadim |
7 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Jul
17
2010
--

SSD: Free space and write performance

( cross posting from SSD Performance Blog )
In previous post On Benchmarks on SSD, commenter touched another interesting point. Available free space affects write performance on SSD card significantly. The reason is still garbage collector, which operates more efficiently the more free space you have. Again, to read mode on garbage collector and write problem you can check Write amplification wiki page.

To see how performance drops with decreasing free space, let’s run sysbench fileio random write benchmark with different file sizes.

For test I took FusionIO 320 GB SLC PCIe DUO™ ioDrive card, with software stripping between two cards, and there if graph how throughput depends on available free space ( the bigger file – the less free space)

The system specification and used scripts you can see on Benchmark Wiki

On graph you can see two line ( yes, there are two lines, even they are almost identical).
First line is result when FusionIO is formatted to use full capacity, and second line is for case when I use additional space reservation ( 25% in this case, that is 240GB available). There is no difference in this case, however additional over-provisioning protects you from overusing space, and keeps performance on corresponding level.

It is clear the maximal throughput strongly depends on available free space.
With 100GiB utilization we have 933.60 MiB/sec,
with 150GiB (half of capacity) 613.48 MiB/sec and
with 200GiB it drops to 354.37 MiB/sec, which is 2.6x times less comparing with 100GiB.

So returning to question how to run proper benchmark, the result significantly depends what percentage of space on card is used, the results for 100GiB file on 160 GB card, will be different from the results for 100GiB file on 320 GB card.

Beside free space, the performance also depends on garbage collector algorithm by itself, and the card from different manufactures will show different results. Some new coming cards make high performance in case with high space utilization as competitive advantage, and I am going to run the same analysis on different cards.


Entry posted by Vadim |
9 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Jul
14
2010
--

On Benchmarks on SSD

(cross post from SSD Performance Blog )
To get meaningful performance results on SSD storage is not easy task, let’s see why.
There is graph from sysbench fileio random write benchmark with 4 threads. The results were taken on PCI-E SSD card ( I do not want to name vendor here, as the problem is the same for any card).

The benchmark starts on the newly formatted card, and some period (fresh period A) you see line with high result, which at some point of time drops (point B) and after some recovery period there is steady state ( state C ).

What happens there, as you may know, SSD has garbage collector activity, and the point B is time when garbage collector starts its work. You can read more on this topic on
Write amplification wiki page.

So as you understand it is important to know, what the state the card was in, when the benchmark was running. Apparently, many manufactures like to put in the specification of device the result from fresh period A, while I think steady state C is more important for end users. So in my further results I will point what was the state of the card during benchmark.

However it makes task of running benchmark on SSD trickier. It is similar to benchmarks on database but up-down. The database just after start is in “cold state” and you need to make sure you have enough warmup and only take results in the hot state, when all internal buffers are filled and populated.
Well, you may say – just to put card in steady state C and run the benchmark, but it is only part of the problem.

The next issue comes from TRIM command. TRIM command is the command sent to device when the file is deleted, and it allows for SSD controller to mark all space related to file as free and reuse it immediately. Not all devices support TRIM command, for example the first generation of Intel SSD cards did not support it, while G2 does.
So why TRIM is the problem for the benchmark – basically if you delete all files, it returns the card to fresh state A. The many benchmark scenarios ( and my initial sysbench fileio scripts) suppose to create files at the start of benchmark and remove afterward. The similar issue is when you restore database from backup, run benchmark, and remove files. That it may happen during your run you cover all states A->B->C, and the final result is pretty much useless. So as the conclusion if you want to see the result from steady state you should make sure you have it in your benchmark.

As we speak about benchmark results, there is another trick from vendors, I want to put your attention. Quite often you can see in specification from imaginary Vendor X say:

  • Read: Up to 520 MB/s
  • Write: Up to 480 MB/s
  • Sustained Write: Up to 420 MB/s
  • Random Write 4KB : 70,000 IOPS

The good thing there is that vendor put both maximal write ( most likely from state A) and Sustained Write ( I guess from state C).
However if you multiply 4KB*70000IOS, you will get 280000KB/s = 274MB/s, which is quite far from
declared 520MB/s.
What is the trick there: the trick is that maximal throughput in MB/sec you are getting when you use big block size, say 64K or 128K, and maximal throughput in IOPS you are getting when you use small block size, 4K in this case.

So when you read Write: Up to 480 MB/s, Random Write 4KB : 70,000 IOPS, you should know that 480MB/s was received with big block size, and for 4KB block size you should expect only 274MB/s ( and most likely in fresh state A).

As SSD market involving, we will see more and more the benchmark results, so be ready to read it carefully.


Entry posted by Vadim |
3 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Written by in: benchmarks,SSD,Zend Developer |
Jul
14
2010
--

SLC vs MLC

(cross posting from SSDPeformanceBlog.com )
All modern solid state drives use NAND memory based on SLC (single level cell) or MLC (multi level cell) technologies.
Not going into physical details – SLC basically stores 1 bit of information, while MLC can do more. Most popular option for MLC is 2 bit, and there is movement into 3 bit direction.

This fact gives us next characteristics:

  • SLC provides less capacity
  • SLC is more expensive
  • SLC is know to have better quality cheap

Along with that there is also limitation on amount of write operations. SLC can handle about 100,000 write cycles, while MLC is 10,000 ( the numbers are rough, and changing with technology improvement)

No wonder that vendor very quickly come with next separation:

  • SLC for enterprise market ( servers )
  • MLC for consumer market ( desktops, workstations, laptops)

As obvious example here is Intel SSD cards: X25-E ( SLC) is sold as enterprise level card, and X25-M ( MLC ) is sold for mass market. As another example of difference in capacity and price:

  • FusionIO 160GB SLC card price $6,308.99
  • FusionIO 320GB MLC card price $6,729.99

That is for the same price MLC card comes with doubled capacity.

However with increasing capacity difference between MLC and SLC is getting fuzzier. For MLC most critical part is software (firmware) algorithm which ensures a uniform usage of available NAND chips, and with bigger capacity it is much easier to implement.
This problem with handling lifetime and manage write cycle for MLC opened way for hardware solution like SandForce controller and recently Anabit announced “Memory Signal Processing (MSP™) technology enables MLC-based solutions at SLC-grade reliability and performance”.

Also important is increasing capacity for MLC devices, for example, if we take 10,000 writes vs 100,000 writes than to provide the same life time MLC would need about 10x more capacity, and
it seems not problem. I expect soon we will see MLC cards with 1600GB, which ideally will have the same lifetime as SLC 160GB cards.

On this way interesting to see Intel announces enterprise line for SSD card will be based on
eMLC
( enterprise MLC ), where each cell has 30,000 writes lifetime and with maximal capacity 400GB

So it seems market is gradually moving into “MLC is ready for enterprise” direction, and sounds as good option to have devices with high capacity and reasonable price in near future.

Some articles on this topics:


Entry posted by Vadim |
No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Written by in: fusionio,SSD,Zend Developer |
Jun
14
2010
--

Virident tachIOn: New player on Flash PCI-E cards market

(Note: The review was done as part of our consulting practice, but is totally independent and fully reflects our opinion)

In my talk on MySQL Conference and Expo 2010 “An Overview of Flash Storage for Databases” I mentioned that most likely there are other players coming soon. I actually was not aware about any real names at that time, it was just a guess, as PCI-E market is really attractive so FusionIO can’t stay alone for long time. So I am not surprised to see new card provided by Virident and I was lucky enough to test a pre-production sample Virident tachIOn 400GB SLC card.

I think it will be fair to say that Virident targets where right now FusionIO has a monopoly, and it will finally bring some competition to the market, which I believe is good for the end users. I am looking forward to price competition ( not having real numbers I can guess that vendors still put high margin in the price) as well as high performance in general and stable performance under high load in particular, and also competition in capacity and data reliability areas.

Priceline for Virident tachIOn cards already shows the price competition: oriented price for tachIOn 400GB is 13,600$ (that is 34$/GB) , and entry-base card is 200GB with price 6,800$ (there also is 300GB card in product line). Price for FusionIO 160GB SLC ( from dell.com, price on 14-Jun-2010 ) is 6,308.99$ ( that is 39.5$/GB)

Couple words about product, I know that Virident engineering team was concentrating on getting stable write performance in long running
write activities and in cases when space utilization is close to 100%. As you may know (check my presentation) SSD design requires background
“garbage collector” activity, which requires space to operate and Virident card already has enough space reservation to get stable write performance even when the disk is almost full.

As for reliability, I think, the design of the card is quite neat. The card by itself contains bunch of replaceable flash modules, and each individual module can be changed in case of failure. Also internally modules are joined in RAID (it is fully transparent for end user).

All this guarantees good level of confidence in data reliability: if a single module fails, the internal RAID will allow to continue operations, and after the replacement of module – it will be rebuilt. It still leaves the controller on card as single point of failure, but in this case all flash modules can be safely relocated to the new card with working controller. (Note: It was not tested by Percona engineers, but taken from vendor’s specification)

As for power failures – flash modules also come with capacitors which guarantees data delivery to final media even if power is lost on the main host. (Note: It was not tested by Percona engineers, but taken from vendor’s specification)

Now to most interesting part – performance numbers. I took sysbench fileio benchmark with 16KB blocksize to see what maximal performance we can expect.

Server specification is:

  • Supermicro X8DTH series motherboard
  • 2 x Xeon E5520 (2.27GHz) processors w/HT enabled (16 cores)
  • 64GB of ECC/Registered DDR3 DRAM
  • Centos 5.3 2-6.18.164 Kernel
  • Filesystem is XFS formatted with mkfs.xfs -s size=4096 option ( size=4096, sector size, is very important to have aligned IO requests) and mounted with nobarrier option
  • Benchmark: sysbench fileio on 100GB file, 16KB blocksize

The raw results are available on Wiki

And the graphs for random read, writes and sequential writes:

I think very interesting to see distribution of 95% response time results ( 0 time is obviously the problem in sysbench, which has no enough time resolution for such very fast operations)

As you can see we can get about 400MB/sec random write bandwidth with 8-16 threads and
with response time below 3.1ms (for 8 threads) and 3.8ms (16 threads) in 95% of cases.

As some issue here, I should mention, that despite the good response time results,
the maximal response time in some cases can jump to 300 ms per request, and I was told
it corresponds to garbage collector activity and will be fixed in the production release of driver.

I think it would be fair to get comparison with FusionIO card, especially for write pressure case
As you may know FusionIO recommends to have space reservation to get sustainable write performance
(Tuning Techniques for Writes).

I took FusionIO ioDrive 160GB SLC card, and tested fully formatted card (filesize 145GB), card formatted with 25% space reservation (file size 110GB), and Virident card 390GB filesize. It also allows us to see if Virident tachIOn card can sustain write in fully utilized card.

As disclaimer I want to mention that Virident tachIOn card was fine tuned by Virident engineers, while FusionIO card was tuned only by me and I may not have all knowledge needed for FusionIO tuning.

First graph is random reads, so see compare read performance

As you see in 1 and 4 threads FusionIO is better, while with more threads Virident card scales better

And now random writes:

You can see that FusionIO definitely needs space reservation to provide high write bandwidth, and it comes with
cost hit ( 25% space reservation -> 25% increase $/GB).

In conclusion I can highlight:

  • I am impressed with architecture design with replaceable individual flash modules, I think it establishes new high-end standard for flash devices
  • With single card you can get over 1GB/sec bandwidth in random reads (16-64 working threads), and it is the maximal results what I’ve seen so far ( again for single card)
  • Random write bandwidth exceeds 400MB/sec (8-16 working threads)
  • Random read/write mix results are also impressive, and it can be quite important in workloads like FlashCache, where card have both concurrent read and write pressure
  • Quite stable sequential writes performance (important in question for log related activity in MySQL)

I am looking forward to present results in sysbench oltp, tpcc workload, and also in FlashCahce mode.


Entry posted by Vadim |
15 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Jun
02
2010
--

FlashCache: tpcc workload with FusionIO card as cache

This run is very similar what I had on Intel SSD X25-M card, but now I use FusionIO 80GB SLC card. I chose this card as smallest available card (and therefore cheapest. On Dell.com you can see it for about $3K). There is also FusionIO IO-Xtreme 80GB card, which is however MLC based and it could be not best choice for FlashCache usage ( as there high write rate on FlashCache for both reading and writing to/from disks, so lifetime could be short).

Also Facebook team released WriteThrough module for FlashCache, which could be good trade-off if you want extra warranty for data consistency and your load is mostly read-bound, so I tested this mode also.

All setup is similar to previous post, so let me just post the results with FlashCache on FusionIO in 20% dirty page, 80% dirty pages and write-through modes. I used full 80GB for caching ( total size of data is about 100GB).

Conclusions from the graph:

  • with 80% dirty page we have about 4x better throughput ( comparing to RAID).
  • Write-through mode is about 2x gain, but remember that load is very write intensive and all benefits in write-through mode come only from cached reads, so it is pretty good for this scenario

On this post I finish my runs on FlashCache for now and I think it may be considered for real usage, at least you may evaluate how it works on your workloads.


Entry posted by Vadim |
20 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Apr
08
2010
--

Should I buy a Fast SSD or more memory?

While a scale-out solution has traditionally been popular for MySQL, it’s interesting to see what room we now have to scale up – cheap memory, fast storage, better power efficiency.  There certainly are a lot of options now – I’ve been meeting about a customer/week using Fusion-IO cards.  One interesting choice I’ve seen people make however, is buying an SSD when they still have a lot of pages read/second – I would have preferred to buy memory instead, and use the storage device for writes.

Here’s the benchmark I came up with to confirm if this is the case:

  • Percona-XtraDB-9.1 release
  • Sysbench OLTP workload with 80 million rows (about 18GB worth of data+indexes)
  • XFS Filesystem mounted with nobarrier option.
  • Tests run with:
    • RAID10 with BBU over 8 disks
    • Intel SSD X25-E 32GB
    • FusionIO 320GB MLC
  • For each test, run with a buffer pool of between 2G and 22G (to test performance compared to memory fit).
  • Hardware was our Dell 900 (specs here).

To start with, we have a test on the RAID10 storage to establish a baseline.  The Y axis is transactions/second (more is better), the X axis is the size of innodb_buffer_pool_size:

Let me point out three interesting characteristics about this benchmark:

  • The A arrow is when data fits completely in the buffer pool (best performance). It’s important to point out that once you hit this point, a further increase in memory at all.
  • The B arrow is where the data just started to exceed the size of the buffer pool.  This is the most painful point for many customers – because while memory decreased by only ~10% the performance dropped by 2.6 times!  In production this usually matches the description of “Last week everything was fine.. but it’s just getting slower and slower!”.  I would suggest that adding memory is by far the best thing to do here.
  • The C arrow shows where data is approximately three times the buffer pool.  This is an interesting point to zoom in on – since you may not be able to justify the cost of the memory, but an SSD might be a good fit:

Where the C arrow was, in this graph a Fusion-IO card improves performance by about five times (or 2x with an Intel SSD).  To get the same improvement with memory, you would have needed to add 60% more memory -or- 260% more memory for a 5x improvement.  Imagine a situation where your C point is when you have 32GB of RAM and 100GB of data.  Than it gets interesting:

  • Can you easily add another 32G RAM (are your memory slots already filled?)
  • Does your budget allow to install SSD cards? (You may still need more than one, since they are all relatively small.  There are already appliances on the market which use 8 Intel SSD devices).
  • Is a 2x or 5x improvement enough?  There are more wins to be had if you can afford to buy all the memory that is required.

The workload here is designed to keep as much of the data hot as possible, but I guess the main lesson here is not to underestimate the size of your “active set” of data.  For some people who just append data to some sort of logging table it may only need to be a small percentage – but in other cases it can be considerably higher.  If you don’t know what your working set is – ask us!

Important note: This graph and these results are valid only for sysbench uniform. In your particular workload the points B and C may be located in differently.

Raw results:

Buffer pool, GB FusionIO Intel SSD RAID 10
2 450.3 186.33 80.67
4 538.19 230.35 99.73
6 608.15 268.18 121.71
8 679.44 324.03 201.74
10 769.44 407.56 252.84
12 855.89 511.49 324.38
14 976.74 664.38 429.15
16 1127.23 836.17 579.29
18 1471.98 1236.9 934.78
20 2536.16 2485.63 2486.88
22 2433.13 2492.06 2448.88

Entry posted by Vadim |
8 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com