Announcing Percona Live: San Francisco February 16th

Today we’re announcing Percona Live – a one day event to be held at the Bently Reserve on February 16th in San Francisco.  Live is our way of showcasing some of the awesome work that has been going into MySQL recently – and the theme of this event is Beyond MySQL 5.1.

Our first guest speaker is none other than Jeremy Zawodny.  Jeremy is well known in the MySQL community having been the original author of High Performance MySQL 1st Ed.  He will be presenting on how Craigslist has already upgraded to MySQL 5.5 – and are running on Fusion-io SSDs in production.

Tickets are available for early bird registration at $50.  To signup, or for more information please visit the percona website.

Entry posted by Morgan Tocker |
One comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks


Effect from innodb log block size 4096 bytes

In my post MySQL 5.5.8 and Percona Server: being adaptive I mentioned that I used innodb-log-block-size=4096 in Percona Server to get better throughput, but later Dimitri in his article MySQL Performance: Analyzing Percona’s TPCC-like Workload on MySQL 5.5 sounded doubt that it really makes sense. Here us quote from his article:

“Question: what is a potential impact on buffered 7MB/sec writes if we’ll use 4K or 512 bytes block size to write to the buffer?.. ;-) )
There will be near no or no impact at all as all writes are managed by the filesystem, and filesystem will use its own block size.. – Of course the things may change if “innodb_flush_log_at_trx_commit=1″ will be used, but it was not a case for the presented tests..”

Well, sure you do not need to believe me, you should demand for real numbers. So I have number to show you.

I took Dell PowerEdge R900 server with 32GB of RAM and with FusionIO 320GB MLC card, and run tpcc-mysql benchmark with 500W using Percona Server 5.5.8.

Here is relevant part of config what I used


  1. innodb_buffer_pool_size=26G
  2. innodb_data_file_path=ibdata1:10M:autoextend
  3. innodb_file_per_table=1
  4. innodb_flush_log_at_trx_commit=2
  5. innodb_log_buffer_size=8M
  6. innodb_log_files_in_group=2
  8. innodb_log_file_size=4G
  10. innodb_adaptive_checkpoint=keep_average
  12. innodb_thread_concurrency=0
  13. innodb_flush_method             = O_DIRECT
  15. innodb_read_ahead = none
  16. innodb_flush_neighbor_pages = 0
  18. innodb_write_io_threads=16
  19. innodb_read_io_threads=16
  20. innodb_io_capacity=2000

I made two runs, one with default innodb-log-block-size ( 512 bytes), and another with –innodb-log-block-size=4096. Full benchmark command is tpcc_start localhost tpcc500 root "" 500 24 10 3600

From graph you can actually see, that there is quite significant impact when we use –innodb-log-block-size=4096.

The average throughput for last 15 mins in first run is 38090.66 NOTPM,
in second run it is 49130.13 NOTPM, that is increase is 1.28x, and I can’t say this is “near no or no impact”.

What is the cause of such difference ? I am not really sure. Apparently FusionIO driver is sensitive to IO block size. And I know that other SSD/Flash drives like to have IO multiplied to their internal block size (which is often 4096 bytes), but I do not know if the effect is the same as on FusionIO.

I put CPU usage graph ( user and system) for both cases:

You may see with 4096 block size USER and SYS CPU is utilized much better, meaning that IDLE is much lower.
Is this contention issue in FusionIO driver when we have 512 bytes IO ? It may be.

Also I am not sure what is strange hill on throughput line with 512 bytes, but it is quite repeatable.
My blind guess (but do not believe me, I have no proof) is that again something is going on inside FusionIO driver,
but this is topic for another research.

For history, FusionIO card information is


  1. Found 1 ioDrive in this system
  2. Fusion-io driver version: 2.2.0 build 82
  4. fct0    Attached as ‘fioa’ (block device)
  5.         Fusion-io ioDrive 320GB, Product Number:FS1-002321-CS SN:10973
  6.         ioDIMM3, PN:00119401203, Mfr:004, Date:20091118
  7.         Firmware v5.0.5, rev 43674

Entry posted by Vadim |

Add to: delicious | digg | reddit | netscape | Google Bookmarks


MySQL 5.5.8 – in search of stability

A couple of days ago, Dimitri published a blog post, Analyzing Percona’s TPCC-like Workload on MySQL 5.5, which was  a response to my post, MySQL 5.5.8 and Percona Server: being adaptive. I will refer to Dimitri’s article as article [1]. As always, Dimitri has provided a very detailed and thoughtful article, and I strongly recommend reading if you want to understand how InnoDB works. In his post, Dimitri questioned some of my conclusions, so I decided to take a more detailed look at my findings. Let me show you my results.

Article [1] recommends using the innodb_max_dirty_pages_pct and innodb_io_capacity parameters to get stable throughput in MySQL 5.5.8. Let’s see what we can do with them. Article [1] also advises that innodb_log_file_size is not important for stable throughput.

For my tests, I again used the Cisco UCS C250 box with 346GB of RAM , and I ran the tpcc-mysql benchmark with 500W (about 50GB of data) on the FusionIO 160GB SLC card. For innodb_buffer_pool_size I used 26GB to represent about a 1/2 ratio of buffer_pool_size to data.

For the initial tests, I used MySQL 5.5.8 (the tar.gz binary from dev.mysql.com), and for the other tests I used Percona Server based on 5.5.8. Addressing a complaint to my previous post, I am sharing the percona-server-5.5.8.tar.gz I used for testing, but please note: It is very pre-beta and should not be used in production. You can download it from our TESTING area.

In order to test different settings in a short period of time, I used 30-minute runs, which may not be long enough to see the long-term trend, but we will see the effects anyway. The full command line to run the test is: tpcc_start localhost tpcc500w root "" 500 32 10 1800. For better understanding the results for each run, I will show different graphs:

  • benchmark throughput – This is New Order Transactions per 10 seconds.
  • dirty page – This graph will contain the percentage of dirty pages in the InnoDB buffer pool. This value is calculated from the output of mysqladmin ext -i10 using this formula: (100*Innodb_buffer_pool_pages_dirty)/(1+Innodb_buffer_pool_pages_data+Innodb_buffer_pool_pages_free). This is the exact formula that InnoDB uses internally to estimate current innodb_dirty_pages_pct.
  • checkpoint age – This is a value in MB or GB and shows what amount of the space in innodb_log_file corresponds to changed pages in the buffer pool. You can compute this value as Log sequence number - Last checkpoint at from SHOW ENGINE INNODB STATUS.

Here are the InnoDB settings for the initial run. Later I will change them in searching for optimal values.


  1. innodb_file_per_table = true
  2. innodb_data_file_path = ibdata1:10M:autoextend
  3. innodb_flush_log_at_trx_commit = 2
  4. innodb_flush_method = O_DIRECT
  5. innodb_log_buffer_size = 64M
  7. innodb_buffer_pool_size = 26G
  9. innodb_buffer_pool_instances=16
  11. innodb_log_file_size = 2000M
  12. innodb_log_files_in_group = 2
  14. innodb_read_io_threads = 16
  15. innodb_write_io_threads = 16
  17. innodb_purge_threads=1
  18. innodb_adaptive_flushing=1
  19. innodb_doublewrite=1

Please note that initially I used the default value for innodb_max_dirty_pages_pct, which is 75, and the default value for innodb_io_capacity, which is 200. I also enabled innodb_doublewrite. As will appear later, it is quite a critical parameter.

So, the results for the initial run, using MySQL 5.5.8:

Let me explain the second graph a little. I put checkpoint age and dirty pages percentage on the same graph to show the relationship between them. Checkpoint age is shown by the red line, using the left Y-axis. Dirty pages are shown by the blue line, using the right Y-axis.

As expected, throughput jumps up and down. Checkpoint age is stable and is about 2854.02 MB. Checkpoint age is the limiting factor here, as InnoDB tries to keep the checkpoint age within 3/4 of the limit of the total log size (total size is 2000MB*2).

The 15-minute average throughput is 59922.8 NOTPM.

Okay, now following the advice in article [1], we will try to limit the percentage of dirty pages and increase I/O capacity.
So, I will set innodb_max_dirty_pages_pct=50 and innodb_io_capacity=20000.

Here are the results:

As we see, throughput is getting into better shape, but is far from being a straight line.
If we look at the checkpoint age/dirty pages graph, we see that the dirty pages percentage is not respected, and is getting up to 70%. And again we see the limiting factor is checkpoint age, which is getting up to 3000MB during the run.

The 15-minute average result for this test is 41257.6 NOTPM.

So, it seems we are not getting the stable result of article [1], and the difference is the doublewrite area. Doublewrite activity actually adds significant I/O activity. Basically, it doubles the amount of writes :) , as you see from its name. So, let’s see what result we have when we disable doublewrite; that is, set innodb_doublewrite=0.

Now, although throughput is not a perfect line, we see a totally different picture for dirty pages and checkpoint age.
The dirty page maximum of 50% is still not respected by InnoDB, but the checkpoint age drops far below the 3000MB line. It is now on about the 1500MB line.

The 15-minute average result for this test is 63898.13 NOTPM. That is, by disabling the doublewrite area, we improved the result 1.55x times.

As it seems hard for InnoDB to keep 50% dirty pages, let’s try 60%.

Here is the run with innodb_max_dirty_pages_pct=60.

Okay, now we finally see throughput more or less flat. The dirty page percentage is kept at the 60% level, and checkpoint age is at the 2000MB level; that is, not bounded by innodb_log_file_size.

The 15-minute average result for this test is 64501.33 NOTPM.

But we still have DOUBLEWRITE=OFF.

Since now we are limited by innodb_max_dirty_pages_pct, what will be the result if we try to increase it to 70% ?

It seems 70% is too big, and now we again hit the limit set by innodb_log_file_size.

The 15-minute average result for this test is 57620.6 NOTPM.

Let me summarize so far. With innodb_doublewrite disabled, we have stable throughput only with innodb_max_dirty_pages_pct=60. Setting this value to 50 or 70 gives us dips in throughput, though for different reasons. In the first case, InnoDB is unable to maintain the 50% level; in the second we are limited by the capacity of REDO logs.

So, what do we get if we again enable innodb_doublewrite, but we now set innodb_max_dirty_pages_pct=60?

This is a bummer. Throughput again jumps up and down. The dirty pages percentage is not respected, and InnoDB is not able to maintain it. And checkpoint age is back to 3000MB and again limited by innodb_log_file_size.

The 15-minute average result is 37509.73 NOTPM.

Okay, so what if we try an even smaller innodb_max_dirty_pages_pct, setting it to 30? (I use a 1-hour run in this case.)

The results:

I can’t say if the resullt should be considered stable. There are still a lot of variations.

The 15-minute average result is 37039.73 NOTPM.

Let’s try an even larger decrease, setting innodb_max_dirty_pages_pct=15.

This seems to be the most stable line I can get with MySQL 5.5.8.

The 15-minute average result is 37235.06 NOTPM.

This allows me to draw a conclusion which partially concurs with the conclusion in article [1]. My conclusion is: With doublewrite enabled, you can get a more or less stable line in MySQL 5.5.8 by tuning innodb_max_dirty_pages_pct and innodb_io_capacity; but the limiting factor is still innodb_log_file_size.

To prove it, I took Percona Server based on 5.5.8 and ran it in MySQL mode (that is, using adaptive_flushing from InnoDB and with the adaptive_checkpoint algorithm disabled), but with giant log files. I used a log file of 8000MB*2, just to see what the maximum checkpoint age is.

Okay, here are the results:

Success! With a big log file, we are getting stable throughput. Checkpoint age jumps up to 3900MB line, but the dirty page percentage is not kept within the 60% line, going instead up to the 70% line limit. That is, to get this stable throughput, we need a total log file size of about 3900MB + 25% = 5300MB.

The 15-minute average result for this test is 48983 NOTPM.

But what about innodb_max_dirty_pages_pct; can we get better results if we increase it? It’s not respected anyway.
Let’s try the previous run, but with innodb_max_dirty_pages_pct=75.

The 75% dirty pages line is at a stable level now, but something happened with throughput. It doesn’t have holes, but there is still oscillating. Checkpoint age is quite significant, reaching 7000MB in the stable area, meaning you need
about 9000MB of log space.

The 15-minute average result for this test is 55073.06 NOTPM.

What can be the reason? Let’s try a guess: flushing neighborhood pages.
Let’s repeat the last run, but with innodb_flush_neighbor_pages=0.

Okay, we are back to a stable level. Checkpoint age is also back to 3000MB, and dirty pages are stable as well, but getting to 77%. I am not sure why it is more than 75%. It is a point for further research, but you are probably tired from all these graphs, as am I.

The 15-minute average result for this test is 52679.93 NOTPM. This is 1.4x better than we have with the stable line in MySQL 5.5.8.

But, finally, let me show the result I got running Percona Server in optimized mode:


  1. innodb_buffer_pool_size = 26G
  2. innodb_buffer_pool_instances=1
  3. innodb_log_file_size = 8000M
  4. innodb_log_files_in_group = 2
  5. innodb_read_io_threads = 16
  6. innodb_write_io_threads = 16
  7. innodb_io_capacity=500
  9. innodb_max_dirty_pages_pct = 60
  10. innodb_purge_threads=1
  11. innodb_adaptive_flushing=0
  12. innodb_doublewrite=1
  13. innodb_flush_neighbor_pages=0
  14. innodb_adaptive_checkpoint=keep_average

The 15-minute average result is 73529.73 NOTPM.

The throughput is about 1.33x better than in “MySQL compatible mode”, though it requires 10500MB for checkpoint age; that is, 14000MB of log space. And, the Percona Server result is ~2x better than the best result I received with MySQL 5.5.8 (with innodb_doublewrite enabled).

In summary, my conclusion is: You can try to get stable throughput in MySQL 5.5.8 by playing with innodb_max_dirty_pages_pct and innodb_io_capacity and having innodb_doublewrite enabled. But you must have the support of big log files (>4GB) to help increase throughput.

Basically, by lowering innodb_max_dirty_pages_pct, you are killing your throughput. When you disable innodb_doublewrite, you can get stable throughput if you are lucky enough to find a magic innodb_max_dirty_pages_pct value. As you saw in the results above, 50 and 70 are not good enough, and only 60 gives stable throughput.

(Post edited by Fred Linhoss)

Entry posted by Vadim |

Add to: delicious | digg | reddit | netscape | Google Bookmarks


Write performance on Virident tachIOn card

This is crosspost from http://www.ssdperformanceblog.com/.

Disclaimer: The benchmarks were done as part of our consulting practice, but this post is totally independent and fully reflects our opinion.

One of the biggest problems with solid state drives is that write performance may drop significantly with decreasing free space. I wrote about this before (http://www.ssdperformanceblog.com/2010/07/free-space-and-write-performance/), using a
FusionIO 320GB Duo card as the example. In that case, when space utilization increased from 100GB to 200GB, the write performance
dropped 2.6 times.

In this regard, Virident claims that tachIOn cards provide “Sustained, predictable random IOPS – Best in the Industry”. Virident generously provided me model 400GB, and I ran the benchmark using the
same methodology as in my experiment with FusionIO, which was stripped for performance. Also using my script, Virident made runs on tachIOn 200GB and 800GB model cards and shared numbers with me ( to be clear I can certify only numbers for 400GB card, but I do not have reasons to question the numbers for 200GB and 800GB, as they corresponding to my results).

The benchmarks was done on Cisco UCS C250 box and raw results are on Benchmarks Wiki

Visually, the drop is not as drastic as it was in the case using FusionIO, but let’s get some numbers.
I am going to take the performance numbers at the points where the available space of the card is 1/3, 1/2, and 2/3 filled, as well as at the point where the card is full. Then I will compute the ratio of each of those IOS numbers to the IOS at the 1/3 point.

**For the 400GB tachIOn card:**

size Throughput, MiB/sec ratio
130 959.17
200 849.58 1.13
260 685.18 1.40
360 417.33 2.29

That is, at the 2/3 point, the 400GB card is slower by 29% than at the 1/3 point, and at full capacity it is slower by 57%.

Observations from looking at the graph:

* You can also see the card never goes below 400MB/sec, even when working at full capacity. This characteristic (i.e., high throughput at full capacity) is very important to know if you are looking to use an SSD card as a cache layer (say, with FlashCache), as, usually for caching, you will try to fill all available space.
* The ratio between the 1/3 capacity point and full capacity point is much smaller compared with FusionIO Duo (without additional spare capacity reservation).
* Also, looking at the graphs for Virident and comparing with the graphs for FusionIO, one might be tempted to say that Virident just has a lot of space reserved internally which is not exposed to the end user, and this is what they use to guarantee a high level of performance. I checked with Virident and they tell me that this is not the case. Actually from diagnostic info on Wiki page you can see: tachIOn0: 0x8000000000 bytes (512 GB), which I assume total installed memory. Regardless, it is not a point to worry about. For pricing, Virident defines GB as the capacity available for end users. So, a competitive $/GB level is maintained, and it does not matter how much space is reserved internally.

Now it would be interesting to compare results with results for FusionIO. As cards have different capacity I made graph which shows
throughput vs percentage of used capacity for both cards, FusionIO 320GB Duo SLC and Virident 400GB SLC

Util % Duo 320GB tachIOn 400GB advantage percent
20% 1,095 990 90%
30% 1,006 980 97%
40% 825 964 117%
50% 613 872 142%
60% 397 783 197%
70% 308 669 217%
80% 237 611 258%
90% 117 502 429%
100% 99 417 421%

In conclusion:
* On single Virident card I see random write throughput close or over 1GB/sec in with low space usage and it is comparable with throughput I’ve got on stripped FusionIO card. I assume Virident maintain good level of parallelism internally.
* Virident card maintains very good throughput level in close to full capacity mode, and that means you do not need to worry ( or worry less) about space reservation or formatting card with less space.

Entry posted by Vadim |

Add to: delicious | digg | reddit | netscape | Google Bookmarks


Is there benefit from having more memory ?

My post back in April, http://www.mysqlperformanceblog.com/2010/04/08/fast-ssd-or-more-memory/, caused quite interest, especially on topic SSD vs Memory.

That time I used fairy small dataset, so it caused more questions, like, should we have more then 128GB of memory?
If we use fast solid state drive, should we still be looking to increase memory, or that configuration provides best possible performance.

To address this, I took Cisco UCS C250 server in our lab, with 384GB of memory and FusionIO 320GB MLC. I generated 230GB data for sysbench benchmark
and run read-only and read-write OLTP workload with varying buffer pool size from 50 to 300GB (with O_DIRECT setting, so
os cache is not used)

This allows as to see effect of having more memory available.

The graph result is:

and raw numbers are on Wiki bencmarks

So let’s take detailed look on numbers with 120GB ( as if you have system with 128GB of RAM) and 250GB

Buffer_pool read-only, tps read-write, tps
120GB 1866.87 2547.69
250GB 5656.62 (ratio 3x) 7633.38 (ratio 2.99)

So you see doubling memory gives 3x ! performance improvement. And it is despite we store data on one of fastest available storage.

So to get best possible performance our advise is still the same – you should try to fit your active dataset into memory, and it is possible as nowadays systems with 300GB+ RAM already available.

Entry posted by Vadim |

Add to: delicious | digg | reddit | netscape | Google Bookmarks


SSD: Free space and write performance

( cross posting from SSD Performance Blog )
In previous post On Benchmarks on SSD, commenter touched another interesting point. Available free space affects write performance on SSD card significantly. The reason is still garbage collector, which operates more efficiently the more free space you have. Again, to read mode on garbage collector and write problem you can check Write amplification wiki page.

To see how performance drops with decreasing free space, let’s run sysbench fileio random write benchmark with different file sizes.

For test I took FusionIO 320 GB SLC PCIe DUO™ ioDrive card, with software stripping between two cards, and there if graph how throughput depends on available free space ( the bigger file – the less free space)

The system specification and used scripts you can see on Benchmark Wiki

On graph you can see two line ( yes, there are two lines, even they are almost identical).
First line is result when FusionIO is formatted to use full capacity, and second line is for case when I use additional space reservation ( 25% in this case, that is 240GB available). There is no difference in this case, however additional over-provisioning protects you from overusing space, and keeps performance on corresponding level.

It is clear the maximal throughput strongly depends on available free space.
With 100GiB utilization we have 933.60 MiB/sec,
with 150GiB (half of capacity) 613.48 MiB/sec and
with 200GiB it drops to 354.37 MiB/sec, which is 2.6x times less comparing with 100GiB.

So returning to question how to run proper benchmark, the result significantly depends what percentage of space on card is used, the results for 100GiB file on 160 GB card, will be different from the results for 100GiB file on 320 GB card.

Beside free space, the performance also depends on garbage collector algorithm by itself, and the card from different manufactures will show different results. Some new coming cards make high performance in case with high space utilization as competitive advantage, and I am going to run the same analysis on different cards.

Entry posted by Vadim |

Add to: delicious | digg | reddit | netscape | Google Bookmarks



(cross posting from SSDPeformanceBlog.com )
All modern solid state drives use NAND memory based on SLC (single level cell) or MLC (multi level cell) technologies.
Not going into physical details – SLC basically stores 1 bit of information, while MLC can do more. Most popular option for MLC is 2 bit, and there is movement into 3 bit direction.

This fact gives us next characteristics:

  • SLC provides less capacity
  • SLC is more expensive
  • SLC is know to have better quality cheap

Along with that there is also limitation on amount of write operations. SLC can handle about 100,000 write cycles, while MLC is 10,000 ( the numbers are rough, and changing with technology improvement)

No wonder that vendor very quickly come with next separation:

  • SLC for enterprise market ( servers )
  • MLC for consumer market ( desktops, workstations, laptops)

As obvious example here is Intel SSD cards: X25-E ( SLC) is sold as enterprise level card, and X25-M ( MLC ) is sold for mass market. As another example of difference in capacity and price:

  • FusionIO 160GB SLC card price $6,308.99
  • FusionIO 320GB MLC card price $6,729.99

That is for the same price MLC card comes with doubled capacity.

However with increasing capacity difference between MLC and SLC is getting fuzzier. For MLC most critical part is software (firmware) algorithm which ensures a uniform usage of available NAND chips, and with bigger capacity it is much easier to implement.
This problem with handling lifetime and manage write cycle for MLC opened way for hardware solution like SandForce controller and recently Anabit announced “Memory Signal Processing (MSP™) technology enables MLC-based solutions at SLC-grade reliability and performance”.

Also important is increasing capacity for MLC devices, for example, if we take 10,000 writes vs 100,000 writes than to provide the same life time MLC would need about 10x more capacity, and
it seems not problem. I expect soon we will see MLC cards with 1600GB, which ideally will have the same lifetime as SLC 160GB cards.

On this way interesting to see Intel announces enterprise line for SSD card will be based on
( enterprise MLC ), where each cell has 30,000 writes lifetime and with maximal capacity 400GB

So it seems market is gradually moving into “MLC is ready for enterprise” direction, and sounds as good option to have devices with high capacity and reasonable price in near future.

Some articles on this topics:

Entry posted by Vadim |
No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Written by in: fusionio,SSD,Zend Developer |

Virident tachIOn: New player on Flash PCI-E cards market

(Note: The review was done as part of our consulting practice, but is totally independent and fully reflects our opinion)

In my talk on MySQL Conference and Expo 2010 “An Overview of Flash Storage for Databases” I mentioned that most likely there are other players coming soon. I actually was not aware about any real names at that time, it was just a guess, as PCI-E market is really attractive so FusionIO can’t stay alone for long time. So I am not surprised to see new card provided by Virident and I was lucky enough to test a pre-production sample Virident tachIOn 400GB SLC card.

I think it will be fair to say that Virident targets where right now FusionIO has a monopoly, and it will finally bring some competition to the market, which I believe is good for the end users. I am looking forward to price competition ( not having real numbers I can guess that vendors still put high margin in the price) as well as high performance in general and stable performance under high load in particular, and also competition in capacity and data reliability areas.

Priceline for Virident tachIOn cards already shows the price competition: oriented price for tachIOn 400GB is 13,600$ (that is 34$/GB) , and entry-base card is 200GB with price 6,800$ (there also is 300GB card in product line). Price for FusionIO 160GB SLC ( from dell.com, price on 14-Jun-2010 ) is 6,308.99$ ( that is 39.5$/GB)

Couple words about product, I know that Virident engineering team was concentrating on getting stable write performance in long running
write activities and in cases when space utilization is close to 100%. As you may know (check my presentation) SSD design requires background
“garbage collector” activity, which requires space to operate and Virident card already has enough space reservation to get stable write performance even when the disk is almost full.

As for reliability, I think, the design of the card is quite neat. The card by itself contains bunch of replaceable flash modules, and each individual module can be changed in case of failure. Also internally modules are joined in RAID (it is fully transparent for end user).

All this guarantees good level of confidence in data reliability: if a single module fails, the internal RAID will allow to continue operations, and after the replacement of module – it will be rebuilt. It still leaves the controller on card as single point of failure, but in this case all flash modules can be safely relocated to the new card with working controller. (Note: It was not tested by Percona engineers, but taken from vendor’s specification)

As for power failures – flash modules also come with capacitors which guarantees data delivery to final media even if power is lost on the main host. (Note: It was not tested by Percona engineers, but taken from vendor’s specification)

Now to most interesting part – performance numbers. I took sysbench fileio benchmark with 16KB blocksize to see what maximal performance we can expect.

Server specification is:

  • Supermicro X8DTH series motherboard
  • 2 x Xeon E5520 (2.27GHz) processors w/HT enabled (16 cores)
  • 64GB of ECC/Registered DDR3 DRAM
  • Centos 5.3 2-6.18.164 Kernel
  • Filesystem is XFS formatted with mkfs.xfs -s size=4096 option ( size=4096, sector size, is very important to have aligned IO requests) and mounted with nobarrier option
  • Benchmark: sysbench fileio on 100GB file, 16KB blocksize

The raw results are available on Wiki

And the graphs for random read, writes and sequential writes:

I think very interesting to see distribution of 95% response time results ( 0 time is obviously the problem in sysbench, which has no enough time resolution for such very fast operations)

As you can see we can get about 400MB/sec random write bandwidth with 8-16 threads and
with response time below 3.1ms (for 8 threads) and 3.8ms (16 threads) in 95% of cases.

As some issue here, I should mention, that despite the good response time results,
the maximal response time in some cases can jump to 300 ms per request, and I was told
it corresponds to garbage collector activity and will be fixed in the production release of driver.

I think it would be fair to get comparison with FusionIO card, especially for write pressure case
As you may know FusionIO recommends to have space reservation to get sustainable write performance
(Tuning Techniques for Writes).

I took FusionIO ioDrive 160GB SLC card, and tested fully formatted card (filesize 145GB), card formatted with 25% space reservation (file size 110GB), and Virident card 390GB filesize. It also allows us to see if Virident tachIOn card can sustain write in fully utilized card.

As disclaimer I want to mention that Virident tachIOn card was fine tuned by Virident engineers, while FusionIO card was tuned only by me and I may not have all knowledge needed for FusionIO tuning.

First graph is random reads, so see compare read performance

As you see in 1 and 4 threads FusionIO is better, while with more threads Virident card scales better

And now random writes:

You can see that FusionIO definitely needs space reservation to provide high write bandwidth, and it comes with
cost hit ( 25% space reservation -> 25% increase $/GB).

In conclusion I can highlight:

  • I am impressed with architecture design with replaceable individual flash modules, I think it establishes new high-end standard for flash devices
  • With single card you can get over 1GB/sec bandwidth in random reads (16-64 working threads), and it is the maximal results what I’ve seen so far ( again for single card)
  • Random write bandwidth exceeds 400MB/sec (8-16 working threads)
  • Random read/write mix results are also impressive, and it can be quite important in workloads like FlashCache, where card have both concurrent read and write pressure
  • Quite stable sequential writes performance (important in question for log related activity in MySQL)

I am looking forward to present results in sysbench oltp, tpcc workload, and also in FlashCahce mode.

Entry posted by Vadim |

Add to: delicious | digg | reddit | netscape | Google Bookmarks


FlashCache: tpcc workload with FusionIO card as cache

This run is very similar what I had on Intel SSD X25-M card, but now I use FusionIO 80GB SLC card. I chose this card as smallest available card (and therefore cheapest. On Dell.com you can see it for about $3K). There is also FusionIO IO-Xtreme 80GB card, which is however MLC based and it could be not best choice for FlashCache usage ( as there high write rate on FlashCache for both reading and writing to/from disks, so lifetime could be short).

Also Facebook team released WriteThrough module for FlashCache, which could be good trade-off if you want extra warranty for data consistency and your load is mostly read-bound, so I tested this mode also.

All setup is similar to previous post, so let me just post the results with FlashCache on FusionIO in 20% dirty page, 80% dirty pages and write-through modes. I used full 80GB for caching ( total size of data is about 100GB).

Conclusions from the graph:

  • with 80% dirty page we have about 4x better throughput ( comparing to RAID).
  • Write-through mode is about 2x gain, but remember that load is very write intensive and all benefits in write-through mode come only from cached reads, so it is pretty good for this scenario

On this post I finish my runs on FlashCache for now and I think it may be considered for real usage, at least you may evaluate how it works on your workloads.

Entry posted by Vadim |

Add to: delicious | digg | reddit | netscape | Google Bookmarks


MySQL 5.5.4 in tpcc-like workload

MySQL-5.5.4 ® is the great release with performance improvements, let’s see how it performs in
tpcc-like workload.

The full details are on Wiki page

I took MySQL-5.5.4 with InnoDB-1.1, tpcc-mysql benchmark with 200W ( about 18GB worth of data),
InnoDB log files are 3.8GB size, and run with different buffer pools from 20GB to 6GB. The storage is FusionIO 320GB MLC card with XFS-nobarrier. .

While the raw results are available on Wiki, there are graphical results.

I intentionally put all line on the same graph to show trends.

It seems adaptive_flushing is not able to keep up and you see periodical drops when InnoDB starts flushing. I hope InnoDB team will fix it before 5.5 GA.

I expect reasonable request how it can be compared with Percona Server/XtraDB, so there is
the same load on our server:

As you see our adaptive_checkpoint algorithm is performing much stable.

And to put direct comparison, there is side-to-side results for 10GB buffer_pool case.

So as you see InnoDB is doing great, trying to keep performance even, as in previous release, there was about 1.7x times difference. I expect to see more improvements in 5.5-GA.

Entry posted by Vadim |

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com