Why you can’t rely on a replica for disaster recovery

A couple of weeks ago one of my colleagues and I worked on a data corruption case that reminded me that sometimes people make unsafe assumptions without knowing it. This one involved SAN snapshotting that was unsafe.

In a nutshell, the client used SAN block-level replication to maintain a standby/failover MySQL system, and there was a failover that didn’t work; both the primary and fallback machine had identically corrupted data files. After running fsck on the replica, the InnoDB data files were entirely deleted.

When we arrived on the scene, there was a data directory with an 800+ GB data file, which we determined had been restored from a SAN snapshot. Accessing this file caused a number of errors, including warnings about accessing data outside of the partition boundaries. We were eventually able to coax the filesystem into truncating the data file back to a size that didn’t contain invalid pointers and could be read without errors on the filesystem level. From InnoDB’s point of view, though, it was still completely corrupted. The “InnoDB file” contained blocks of data that were obviously from other files, such as Python exception logs. The SAN snapshot was useless for practical purposes. (The client decided not to try to extract the data from the corrupted file, which we have specialized tools for doing. It’s an intensive process that costs a little money.)

The problem was that the filesystem was ext2, with no journaling and no consistency guarantees. A snapshot on the SAN is just the same as cutting the power to the machine — the block device is in an inconsistent state. A filesystem that can survive that has to ensure that it writes the data to the block device such that it can bring into a consistent state later. The techniques for doing this include things like ordered writes and meta-data journaling. But ext2 does not know how to do that. The data that’s seen by the SAN is some jumble of blocks that represents the most efficient way to transfer the changed blocks over the interconnect, without regard to logical consistency on the filesystem level.

Two things can help avoid such a disaster: 1) get qualified advice and 2) don’t trust the advice; backups and disaster recovery plans must be tested periodically.

This case illustrates an important point that I repeat often. The danger of using a replica as a backup is that data loss on the primary can affect the replica, too. This is true no matter what type of replication is being used. In this case it’s block-level SAN replication. DRBD would behave just the same way. At a higher level, MySQL replication has the same weakness. If you rely on a MySQL slave for a “backup,” you’ll be out of luck when someone accidentally runs DROP TABLE on your master. That statement will promptly replicate over and drop the table off your “backup.”

I still see people using a replica as a backup, and I know it’s just a matter of time before they lose data. In my experience, the types of errors that will propagate through replication are much more common than those that’ll be isolated to just one machine, such as hardware failures.

Entry posted by Baron Schwartz |

Add to: delicious | digg | reddit | netscape | Google Bookmarks


Letzte offene Symfony Schulung in 2010


vom 29.09. bis 01.10. werde ich meine für dieses Jahr letzte offene Symfony-Schulung halten. Der Schulungsort wird voraussichtlich Heilbronn sein (Baden-Würtemberg). Die Schulung kostet 990.- Euro pro Person und dauert 3 Tage.  Mehr Informationen auch zu den Schulungsinhalten gibt es hier.


Storing MySQL Binary logs on NFS Volume

There is a lot of discussions whenever running MySQL storing data on NFS is a good idea. There is a lot of things for and against this and this post is not about them.
The fact is number of people run their databases on NetApp and other forms of NFS storage and this post is about one of discoveries in such setup.

There are good reasons to have binary logs on NFS volume – binary logs is exactly the thing you want to survive the server crash – using them you can do point in time recovery from backup.

I was testing high volume replication today using Sysbench:


  1. sysbench –test=oltp –oltp-table-size=10000000 –db-driver=mysql –mysql-user=root –mysql-db=sbsmall –init-rng=1 –max-requests=100000000 –max-time=600 –oltp-test-mode=nontrx –oltp-nontrx-mode=update_nokey –num-threads=8 run

On this box I got around 12.000 of updates/sec which is not the perfect number, though it mainly was because of contention issues in MySQL 5.0 rather than any NAS issues.
This number was reachable even with binary log stored on NFS volume. This number is for sync_binlog=0 and innodb_flush_log_at_trx_commit=2

I noted however if I enable replication – connect the slave to this box the throughput on the Master drops to about 2800 updates sec…. which is very close to the magic number how many network roundtrips per second I can get over 1Gb link. It was even more interesting when that. If I would pause replication for prolonged period of time and let few GB of binary logs to accumulate the performance on Master will be high even with replication running, but it will slow down as soon as IO thread on the slave is caught up with master.

When I moved the Binary logs to the local storage I got very similar performance but there have been no degradation when replication is enabled.

I have not checked in details why this could be the case but I guess there is something which requires a network roundtrip when the binary log is written at the same time as slave-feeding thread is reading it.

I’d be curious to know if someone else can observe such behavior and if there is an NFS tuning which can be done to avoid it or if we need to fix MySQL

Entry posted by peter |

Add to: delicious | digg | reddit | netscape | Google Bookmarks


Caching could be the last thing you want to do

I recently had a run-in with a very popular PHP ecommerce package which makes me want to voice a recurring mistake I see in how many web applications are architected.

What is that mistake?

The ecommerce package I was working with depended on caching.  Out of the box it couldn’t serve 10 pages/second unless I enabled some features which were designed to be “optional” (but clearly they weren’t).

I think with great tools like memcached it is easy to get carried away and use it as the mallet for every performance problem, but in many cases it should not be your first choice.  Here is why:

  • Caching might not work for all visitors – You look at a page, it loads fast.  But is this the same for every user?  Caching can sometimes be an optimization that makes the average user have a faster experience, but in reality you should be caring more that all users get a good experience (Peter explains why here, talking about six sigma).  In practice it can often be the same user that has all the cache misses, which can make this problem even worse.
  • Caching can reduce visibility – You look at the performance profile of what takes the most time for a page to load and start trying to apply optimization.  The problem is that the profile you are looking at may skew what you should really be optimizing.  The real need (thinking six sigma again) is to know what the miss path costs, but it is somewhat hidden.
  • Cache management is really hard – have you planned for cache stampeding, or many cache items being invalidated at the same time?

What alternative approach should be taken?

Caching should be seen more as a burden that many applications just can’t live without.  You don’t want that burden until you have exhausted all other easily reachable optimizations.

What other optimizations are possible?

Before implementing caching, here is a non-exhaustive checklist to run through:

  • Do you understand every execution plan of every query? If you don’t, set long_query_time=0 and use mk-query-digest to capture queries.  Run them through MySQL’s EXPLAIN command.
  • Do your queries SELECT *, only to use subset of columns?  Or do you extract many rows, only to use a subset? If so, you are extracting too much data, and (potentially) limiting further optimizations like covering indexes.
  • Do you have information about how many queries were required to generate each page? Or more specifically do you know that each one of those queries is required, and that none of those queries could potentially be eliminated or merged?

I believe this post can be summed up as “Optimization rarely decreases complexity. Avoid adding complexity by only optimizing what is necessary to meet your goals.”  – a quote from Justin’s slides on instrumentation-for-php.  In terms of future-proofing design, many applications are better off keeping it simple and (at least initially) refusing the temptation to try and solve some problems “like the big guys do”.

Entry posted by Morgan Tocker |

Add to: delicious | digg | reddit | netscape | Google Bookmarks


Estimating Replication Capacity

It is easy for MySQL replication to become bottleneck when Master server is not seriously loaded and the more cores and hard drives the get the larger the difference becomes, as long as replication
remains single thread process. At the same time it is a lot easier to optimize your system when your replication runs normally – if you need to add/remove indexes and do other schema changes you probably would be looking at some methods involving replication if you can’t take your system down. So here comes the catch in many systems – we find system is in need for optimization when replication can’t catch up but yet optimization process we’re going to use relays on replication being functional and being able to catch up quickly.

So the question becomes how can we estimate replication capacity, so we can deal with replication load before slave is unable to catch up.

Need to replication capacity is not only needed in case you’re planning to use replication to perform system optimization, it is also needed on other cases. For example in sharded environment you may need to schedule downtime or set object read only to move it to another shard. It is much nicer if it can be planned in advance rather than done on emergency basics when slave(s) are unable to catch up and application is suffering because of stale data. This especially applies to Software as Service providers which often have very strict SLA agreements with their customers and which can have a lot of data per customer so move can take considerable amount of time.

So what is replication capacity I call replication capacity the ability to replicate the master load. If replication is able to replicate 3 times the write load from the master without falling behind I will call it replication capacity of 3. When used with context of applying binary logs (for example point in time recovery from backup) replication capacity of 1 will mean you can reply 1 hour worth of binary logs within 1 hour. I will call “replication load” the inverse of replication capacity – this is basically what percentage of time the replication thread was busy replicating events vs staying idle.

Note you can speak about idle replication capacity, when box does not do anything else as well as loaded replication capacity when the box serves the normal load. Both are important. You care about idle replication capacity when you have no load on the slave and need it to catch up or when restoring from backup, the loaded replication capacity matters during normal operation.

So we defined what replication capacity is. There is however no tools which can tell us straight what replication capacity is for the given system. It also tends to float depending on the load similar as loadavg metrics. Here are some of the ways to measure it:

1) Use “UserStats” functionality from Google patches, which is now available in Percona Server and MariaDB. This is the probably the easiest and most accurate approach but it
does not work in Oracle MySQL Server. set userstat_running=1 and run following query:


  1. mysql> SELECT * FROM information_schema.user_statistics WHERE user=“#mysql_system#” \G
  2. *************************** 1. row ***************************
  3. USER: #mysql_system#
  7. BUSY_TIME: 74
  8. CPU_TIME: 0
  10. BYTES_SENT: 63
  13. ROWS_UPDATED: 127576
  14. TABLE_ROWS_READ: 4085689
  16. UPDATE_COMMANDS: 119127
  17. OTHER_COMMANDS: 89557
  24. 1 row IN SET (0.00 sec)

In this case CONNECTED_TIME is 446 second, out of this replication thread was busy (BUSY_TIME) 74 seconds which means replication capacity is 446/74 = 6
You normally would not like to measure it from the start but rather take the difference in these counters every 5 minutes or other interval of your choice.

2) Use full slow query log and mk-query-digest. This method is great for one time execution especially as it comes together with giving you the list of queries which load replication
the most. It however works only with statement level replication. You need to set long_query_time=0 and log_slave_slow_statements=1 for this method to work.
Get the log file which will include all queries MySQL server ran with their times and run mk-query-digest with filter to only check queries from replication thread:

mk-query-digest slow-log –filter ‘($event->{user} || “”) =~ m/[SLAVE_THREAD]/’ > /tmp/report-slave.txt

In the report you will see something like this as a header:


  1. # 475s user time, 1.2s system time, 80.41M rss, 170.38M vsz
  2. # Current date: Mon Jul 19 15:12:24 2010
  3. # Files: slow-log
  4. # Overall: 1.22M total, 1.27k unique, 558.56 QPS, 0.37x concurrency ______
  5. # total min max avg 95% stddev median
  6. # Exec time 819s 1us 92s 669us 260us 120ms 93us
  7. # Lock time 28s 0 166ms 23us 49us 192us 25us
  8. # Rows sent 4.27k 0 325 0.00 0 1.04 0
  9. # Rows exam 30.88M 0 1.28M 26.48 0 3.07k 0
  10. # Time range 2010-07-19 14:35:53 to 2010-07-19 15:12:22
  11. # bytes 350.99M 5 1022.34k 301.01 719.66 5.75k 124.25
  12. # Bytes sen 1.94M 0 9.42k 1.67 0 110.38 0
  13. # Killed 0 0 0 0 0 0 0
  14. # Last errn 34.11M 0 1.55k 29.26 0 185.83 0
  15. # Merge pas 0 0 0 0 0 0 0
  16. # Rows affe 875.19k 0 17.55k 0.73 0.99 25.61 0.99
  17. # Rows read 2.20M 0 14.83k 1.88 1.96 24.68 1.96
  18. # Tmp disk 4.15k 0 1 0.00 0 0.06 0
  19. # Tmp table 14.19k 0 2 0.01 0 0.14 0
  20. # Tmp table 8.30G 0 2.01M 7.12k 0 117.75k 0
  21. # 0% (5k) Filesort
  22. # 0% (5k) Full_join
  23. # 0% (7k) Full_scan
  24. # 0% (10k) Tmp_table
  25. # 0% (4k) Tmp_table_on_disk

There is a lot of interesting you can find out from this header but in relation to replication capacity – you can get replication load, which is same as “concurrency” figure (0.37x) The concurrency as reported by mk-query-digest is sum of query execution time vs time range the log file covers. In this case as we know there is only one replication thread it will be same as replication load. This gives us replication capacity of 1/0.37 = 2.70

This method should work with original MySQL Server in theory, though I have not tested it. Some versions had log_slave_slow_statements unreliable and also you may need to adjust regular expression for finding users replication thread uses.

3) Processlist Polling This method is simple – the Slave thread has different status in Show Processlist depending on if it processes query or simply waiting. By pooling processlist frequently (for example 10 times a second) we can compute the approximate percentage the thread was busy vs idle. Of course running processlist very aggressively can be an overhead especially if it is busy system with a lot of connections


  1. mysql> SHOW processlist;
  2. +——–+————-+———–+——+———+——+———————————————————————–+——————+
  3. | Id | User | Host | db | Command | Time | State | Info |
  4. +——–+————-+———–+——+———+——+———————————————————————–+——————+
  5. | 801812 | system user | | NULL | Connect | 2665 | Waiting FOR master TO send event | NULL |
  6. | 801813 | system user | | NULL | Connect | 0 | Has READ ALL relay log; waiting FOR the slave I/O thread TO UPDATE it | NULL |
  7. | 802354 | root | localhost | NULL | Query | 0 | NULL | SHOW processlist |
  8. +——–+————-+———–+——+———+——+———————————————————————–+——————+
  9. 3 rows IN SET (0.00 sec)

4) Slave Catchup/Binlog Application method. We can just get the spare server with backups restored on it and apply binary log to it. If 1 hour worth of binary logs applies for 10 minutes we have replication capacity of 6. The challenge of course having spare server around and it is quite labor intensive. At the same time it can be good measurement to take during backup recovery trials when you’re doing this activity anyway. Using this way you can also measure “cold” vs “hot” replication capacity as well as how long replication warmup takes. It is very typical for servers with cold cache to perform a lot slower then they are warmed up. Measuring times for each binary log separately should give you these numbers.

The less intrusive process which can be done in production (especially if you have slave which is used for backups/reporting etc) is to stop the replication for some time and when see how long it takes to catch up. If you paused replication for 10 minutes and it took 5 minutes to catch up your replication capacity will be 3 (not 2) because you not only had to process the events for outstanding 10 minutes but also for these 5 minutes it took to catch up. The formula is (Time_Replication_Paused+Time_Took_To_Catchup)/Time_Took_To_Catchup.

So how much of replication capacity do you need in the healthy system ? It depends a lot on many things including how fast do you need to be able to recover from backups and how much your load variance is. A lot of systems have special requirements on the time it takes to warmup too (there are different things you can do about it too). First I would measure replication capacity on 5 minute intervals (or something similar) because it tends to vary a lot. When I would suggest to ensure the loaded replication capacity is at least 3 during the peak load and 5 during the normal load. This applies to normal operational load – if you push heavy ALTER TABLE through replication they will surely get your replication capacity down for their duration.

One more thing about these methods – methods 1,2,3 work well only if replication capacity is above 1, so system is caught up. If it is less than 1, so the master writes more binary logs than slave can process they will show number close to 1. the method 4 however with work even if replication can’t ever catch up – If 1 hour worth of binary logs takes 2 hours to apply, your replication capacity is 0.5.

Entry posted by peter |

Add to: delicious | digg | reddit | netscape | Google Bookmarks


SSD: Free space and write performance

( cross posting from SSD Performance Blog )
In previous post On Benchmarks on SSD, commenter touched another interesting point. Available free space affects write performance on SSD card significantly. The reason is still garbage collector, which operates more efficiently the more free space you have. Again, to read mode on garbage collector and write problem you can check Write amplification wiki page.

To see how performance drops with decreasing free space, let’s run sysbench fileio random write benchmark with different file sizes.

For test I took FusionIO 320 GB SLC PCIe DUO™ ioDrive card, with software stripping between two cards, and there if graph how throughput depends on available free space ( the bigger file – the less free space)

The system specification and used scripts you can see on Benchmark Wiki

On graph you can see two line ( yes, there are two lines, even they are almost identical).
First line is result when FusionIO is formatted to use full capacity, and second line is for case when I use additional space reservation ( 25% in this case, that is 240GB available). There is no difference in this case, however additional over-provisioning protects you from overusing space, and keeps performance on corresponding level.

It is clear the maximal throughput strongly depends on available free space.
With 100GiB utilization we have 933.60 MiB/sec,
with 150GiB (half of capacity) 613.48 MiB/sec and
with 200GiB it drops to 354.37 MiB/sec, which is 2.6x times less comparing with 100GiB.

So returning to question how to run proper benchmark, the result significantly depends what percentage of space on card is used, the results for 100GiB file on 160 GB card, will be different from the results for 100GiB file on 320 GB card.

Beside free space, the performance also depends on garbage collector algorithm by itself, and the card from different manufactures will show different results. Some new coming cards make high performance in case with high space utilization as competitive advantage, and I am going to run the same analysis on different cards.

Entry posted by Vadim |

Add to: delicious | digg | reddit | netscape | Google Bookmarks


Analyzing the distribution of InnoDB log file writes

I recently did a quick analysis of the distribution of writes to InnoDB’s log files. On a high-traffic commodity MySQL server running Percona XtraDB for a gaming workload (mostly inserts to the “moves” table), I used strace to gather statistics about how the log file writes are distributed in terms of write size. InnoDB writes to the log files in multiples of 512 bytes. Mark Callaghan explained this and some of its performance implications here. How many big writes does InnoDB do, and how many small writes?

First, I found out the file descriptor numbers of the log files:

# lsof -p $(pidof mysqld) | grep ib_log
mysqld  29772 mysql    8uW  REG                8,2   268435456   7143989 /var/lib/mysql/ib_logfile0
mysqld  29772 mysql    9uW  REG                8,2   268435456   7143993 /var/lib/mysql/ib_logfile1

The file descriptors are 8 and 9. We’ll need to capture writes to both of those; InnoDB round-robins through them. The following grabs the write sizes out of 100k calls to pwrite() and aggregates them:

# strace -f -p $(pidof mysqld) -e pwrite -s1 -xx 2>&1 \
   | grep 'pwrite([89],' |head -n 100000 \
   | awk '{writes[$5]++}END{for(w in writes){print w, " ", writes[w]}}'

I could have done it better with a little more shell scripting, but the output from this was enough for me to massage into a decent format with another step or two! Here is the final result:

bytes	count
512	44067
1024	30740
1536	15221
2048	7094
2560	1810
3072	570
3584	219
4096	112
4608	39
5120	23
5632	16
6144	15
6656	5
7168	3
7680	8
8192	2
8704	2
9216	1
9728	2
10240	1
10752	2
11264	1
11776	1
14848	1
15360	1
15872	2
16384	4
16896	4
17408	2
17920	2
18432	2
18944	8
19456	7
19968	4
20480	4
21504	1
22016	2
24064	1
40960	1

So, in sum, we see that about 3/4ths of InnoDB log file writes on this workload are 512 or 1024 bytes. (It might vary for other workloads.) Now, what does this actually mean? There are a lot of interesting and complex things to think about here and research further:

  • Most writes are less than 4k, but the operating system page size is 4k. If the page isn’t in the OS buffer cache, then the write will cause a read-around write — the page will have to be read, modified, and then written again, instead of just written. Vadim did some benchmarks earlier showing that the log files need to be in the OS cache for best performance (sorry, can’t find the link right now). This makes intuitive sense given that the writes are smaller than 4k.
  • This server has innodb_flush_log_at_trx_commit set to 2. This means each transaction does a log file write, the same as for a setting of 1. If it’s set to 0, the distribution may be quite different — the writes might accumulate and become much larger.
  • What happens if the log buffer is actually smaller than a transaction? That’s another topic for research. I don’t know off the top of my head.
  • Is the distribution meaningful for determining the necessary log buffer size? The largest write above is just under 40k, which might seem to indicate that only 64k or so of log buffer would be large enough — but is it really? We’d need to benchmark and see. Peter and I were just talking, trying to remember some complex behavior of how space in the buffer is reserved for writes. We could not quite recall what that is, but there is some subtlety. This needs more research. It’s possible that even if the largest write is quite small, a small log buffer size wouldn’t be good and could cause additional lock contention. (There is a single mutex around the log buffer.) We probably need to benchmark to learn more about this.

In the end, the distribution is a simple observation to make, but the InnoDB redo log system is intricate. It’s not something to guess about, but rather something to measure and study more deeply. Perhaps we can follow this up with some more benchmarks or observations under different InnoDB settings and different workloads. Or then again, maybe Yasufumi will read this post when he returns from vacation and already know all the answers by heart!

Entry posted by Baron Schwartz |

Add to: delicious | digg | reddit | netscape | Google Bookmarks


Data mart or data warehouse?

This is part two in my six part series on business intelligence, with a focus on OLAP analysis.

  1. Part 1 – Intro to OLAP
  2. Identifying the differences between a data warehouse and a data mart. (this post)
  3. Introduction to MDX and the kind of SQL which a ROLAP tool must generate to answer those queries.
  4. Performance challenges with larger databases, and some ways to help performance using aggregation.
  5. Using materialized views to automate that aggregation process.
  6. Comparing the performance of OLAP with and without aggregation over multiple MySQL storage engines at various data scales.

What is a data warehouse?
It turns out that this question is a little more difficult to answer than it probably should be. This is because data warehousing has become an overloaded term that includes BI tools (OLAP/data mining), data extraction and transformation tools (ETL), and schema management tools.

To me, the definition of data warehouse is “A relational database schema which stores historical data and metadata from an operational system or systems, in such a way as to facilitate the reporting and analysis of the data, aggregated to various levels.” This definition is a consolidation of various definitions that I have encountered.

There are a few key points here. First, data warehouses rarely contain information that exists no where else in an organization. The goal of data warehousing is to collect and make a historical record of the information from another system. This might be an ERP application, the logs from a web application, data from manufacturing systems or even data from radio telescopes. This data is extracted from the source system(s) and then cleaned up and inserted into the data warehouse with ETL tools. This process is usually called “conforming” the source data into the warehouse schema. Another important aspect of the definition is aggregation. A data warehouse is usually used to summarize data over years, months, quarters, or other time dimension attributes. This aids in identifying historical trends and making predictions about future trends. Data is often aggregated in many different ways. Aggregated data may be stored in aggregated tables so that it can be accessed quickly. This is particularly important as fact tables reach into the billions of rows and hundreds of gigabytes of information is accumulated. Accessing this data outside of summarized form often takes a very long time.

Is there a particular schema design which lends itself to this historical analysis?
There are two main methodologies which are practiced when it comes to designing database schemata for database warehouse applications. These two methodologies approach the problem of storing data in very different ways.

The first methodology was popularized by Bill Inmon, who is considered by many to be the “father” of the data warehouse, or at least the first dw “evangelist” if you will. This approach focuses on the normalization of data. Highly normalized schema are created and maintained by ETL jobs. Creating and maintaining these jobs is often one of the biggest parts of designing and running a data warehouse. A particular combination of ETL jobs which consist of one or more data transformations is usually called a “flow”. An example ETL flow might combine data from item and category information into a single dimension, while also maintaining the historical information about when each item was in each category.

These types of warehouses are almost always “insert only”. Data is very likely never updated or deleted in these databases and they are expected to grow to very large sizes, usually into the terabyte range, but sometimes even into petabytes. Aggregations are the exception to this rule, as they must be updated periodically to reflect the additions of data to the source tables. The goal of this methodology is the 100% accurate description of historical data from the operational system in a normalized manner which ensures that it is able to be updated quickly. It is accepted that analysis of the data will be more complex in this form, but that this complexity is an acceptable trade off for historical accuracy. This is often described as the “top-down” approach.

What is a data mart?
The second approach, popularized by Ralph Kimball holds that partial de-normalization of the data is beneficial. The goal of a this approach is usually multi-dimensional (OLAP) analysis as it is very hard to create a dimensional model from a highly normalized database schema. It is particularly difficult to build such a model that scales as the volume in the warehouse increases. For this reason OLAP analysis usually is performed on a star schema which partially denormalizes the data. A star schema about a particular subject matter, such as sales, is usually referred to as a “data mart”. Maybe this is because they provide one stop shopping for all the information about the particular subject matter. That is pretty much what I imagine when I hear the phrase.

Data marts tend to be updated frequently, at least once per day. As I mentioned in my previous post, a star schema consists of a central table called the fact table and additional dimension tables which contain information about the facts, such as lists of customers or products. Because of the partially denormalized nature of a star schema, the dimension tables in a data mart may be updated. In fact, there is a term for such a dimension – A “slowly changing dimension” or SCD. The fact table is usually only inserted to, but older data may be purged out of it. Sometimes the fact table will be aggregated from source data. A website which sells banner ads might roll up all the events for a particular ad to the day level, instead of storing detailed information about every impression and click for the ad.

A normalized data warehouse schema might contain tables called items, categories and item_category. These three tables allow a user to determine which items belong to which categories, but this structure creates a large number of joins when many dimensions are involved. A data mart would collapse all of this information into an item dimension which would include the category information in the same row as the item information. It would be possible create two different dimensions, product and category, but performance tends to decrease as the number of dimensions increases.

The difference illustrated
In this mock ERD diagram you can three schemata representing sales orders. The star schema is very denormalized, having only four tables which represent the subject. The data warehouse schema, on the other hand, is very normalized and requires tens of tables to represent the same subject. The snowflake schema is a compromise between the two extremes.

Mock ERD Diagram

Typical star, snowflake and data warehouse schemata.

Is one better than the other?
In the game of data warehousing, a combination of these methods is of course allowed. A company might take the top-down approach where they maintain a large historical data warehouse, but they also build data marts for OLAP analysis from the warehouse data. A different approach is to build a relational warehouse from multiple data marts, or the so-called bottom-up approach to data warehousing.

There is also a cousin of the star schema in which the dimensions are normalized. This type of schema is usually called a snowflake schema. The three table item/category/item_category tables in the warehouse schema example would be considered a snowflake. A dimension table (item) must be joined to additional tables (item_category,category) to find the category. These are not as popular as star schemas because they tend to not perform as well as a star schema, particularly as the volume of data in the database increases.

So what is the big deal?
From a OLAP performance standpoint, many databases will perform better on a star schema than on a snowflake or fully normalized schema at data warehouse volumes. This is in large part because commercial database software supports hash joins, bitmap indexes, table partitioning, parallel query execution, clustered tables and materialized views. These features make working with a star schema much easier than it may be on MySQL, but it is definitely possibly to use MySQL as long as the right tools and techniques are used. There are ways in which can add some of these features to MySQL as well, but that is a topic for a later post.

In the next post I’ll talk more about Mondrian and about MDX, the multi-dimensional query language. Mondrian turns MDX into SQL, so we’ll also look at the kinds of queries which are generated by OLAP analysis.

Entry posted by Justin Swanhart |

Add to: delicious | digg | reddit | netscape | Google Bookmarks


On Benchmarks on SSD

(cross post from SSD Performance Blog )
To get meaningful performance results on SSD storage is not easy task, let’s see why.
There is graph from sysbench fileio random write benchmark with 4 threads. The results were taken on PCI-E SSD card ( I do not want to name vendor here, as the problem is the same for any card).

The benchmark starts on the newly formatted card, and some period (fresh period A) you see line with high result, which at some point of time drops (point B) and after some recovery period there is steady state ( state C ).

What happens there, as you may know, SSD has garbage collector activity, and the point B is time when garbage collector starts its work. You can read more on this topic on
Write amplification wiki page.

So as you understand it is important to know, what the state the card was in, when the benchmark was running. Apparently, many manufactures like to put in the specification of device the result from fresh period A, while I think steady state C is more important for end users. So in my further results I will point what was the state of the card during benchmark.

However it makes task of running benchmark on SSD trickier. It is similar to benchmarks on database but up-down. The database just after start is in “cold state” and you need to make sure you have enough warmup and only take results in the hot state, when all internal buffers are filled and populated.
Well, you may say – just to put card in steady state C and run the benchmark, but it is only part of the problem.

The next issue comes from TRIM command. TRIM command is the command sent to device when the file is deleted, and it allows for SSD controller to mark all space related to file as free and reuse it immediately. Not all devices support TRIM command, for example the first generation of Intel SSD cards did not support it, while G2 does.
So why TRIM is the problem for the benchmark – basically if you delete all files, it returns the card to fresh state A. The many benchmark scenarios ( and my initial sysbench fileio scripts) suppose to create files at the start of benchmark and remove afterward. The similar issue is when you restore database from backup, run benchmark, and remove files. That it may happen during your run you cover all states A->B->C, and the final result is pretty much useless. So as the conclusion if you want to see the result from steady state you should make sure you have it in your benchmark.

As we speak about benchmark results, there is another trick from vendors, I want to put your attention. Quite often you can see in specification from imaginary Vendor X say:

  • Read: Up to 520 MB/s
  • Write: Up to 480 MB/s
  • Sustained Write: Up to 420 MB/s
  • Random Write 4KB : 70,000 IOPS

The good thing there is that vendor put both maximal write ( most likely from state A) and Sustained Write ( I guess from state C).
However if you multiply 4KB*70000IOS, you will get 280000KB/s = 274MB/s, which is quite far from
declared 520MB/s.
What is the trick there: the trick is that maximal throughput in MB/sec you are getting when you use big block size, say 64K or 128K, and maximal throughput in IOPS you are getting when you use small block size, 4K in this case.

So when you read Write: Up to 480 MB/s, Random Write 4KB : 70,000 IOPS, you should know that 480MB/s was received with big block size, and for 4KB block size you should expect only 274MB/s ( and most likely in fresh state A).

As SSD market involving, we will see more and more the benchmark results, so be ready to read it carefully.

Entry posted by Vadim |

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Written by in: benchmarks,SSD,Zend Developer |


(cross posting from SSDPeformanceBlog.com )
All modern solid state drives use NAND memory based on SLC (single level cell) or MLC (multi level cell) technologies.
Not going into physical details – SLC basically stores 1 bit of information, while MLC can do more. Most popular option for MLC is 2 bit, and there is movement into 3 bit direction.

This fact gives us next characteristics:

  • SLC provides less capacity
  • SLC is more expensive
  • SLC is know to have better quality cheap

Along with that there is also limitation on amount of write operations. SLC can handle about 100,000 write cycles, while MLC is 10,000 ( the numbers are rough, and changing with technology improvement)

No wonder that vendor very quickly come with next separation:

  • SLC for enterprise market ( servers )
  • MLC for consumer market ( desktops, workstations, laptops)

As obvious example here is Intel SSD cards: X25-E ( SLC) is sold as enterprise level card, and X25-M ( MLC ) is sold for mass market. As another example of difference in capacity and price:

  • FusionIO 160GB SLC card price $6,308.99
  • FusionIO 320GB MLC card price $6,729.99

That is for the same price MLC card comes with doubled capacity.

However with increasing capacity difference between MLC and SLC is getting fuzzier. For MLC most critical part is software (firmware) algorithm which ensures a uniform usage of available NAND chips, and with bigger capacity it is much easier to implement.
This problem with handling lifetime and manage write cycle for MLC opened way for hardware solution like SandForce controller and recently Anabit announced “Memory Signal Processing (MSP™) technology enables MLC-based solutions at SLC-grade reliability and performance”.

Also important is increasing capacity for MLC devices, for example, if we take 10,000 writes vs 100,000 writes than to provide the same life time MLC would need about 10x more capacity, and
it seems not problem. I expect soon we will see MLC cards with 1600GB, which ideally will have the same lifetime as SLC 160GB cards.

On this way interesting to see Intel announces enterprise line for SSD card will be based on
( enterprise MLC ), where each cell has 30,000 writes lifetime and with maximal capacity 400GB

So it seems market is gradually moving into “MLC is ready for enterprise” direction, and sounds as good option to have devices with high capacity and reasonable price in near future.

Some articles on this topics:

Entry posted by Vadim |
No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Written by in: fusionio,SSD,Zend Developer |

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com