Jun
08
2011
--

Aligning IO on a hard disk RAID – the Benchmarks

In the first part of this article I have showed how I align IO, now I want to share results of the benchmark that I have been running to see how much benefit can we get from a proper IO alignment on a 4-disk RAID1+0 with 64k stripe element. I haven’t been running any benchmarks in a while so be careful with my results and forgiving to my mistakes :)

The environment

Here is the summary of the system I have been running this on (for brevity I have removed some irrelevant information):

# Aspersa System Summary Report ##############################
    Platform | Linux
     Release | Ubuntu 10.04.2 LTS (lucid)
      Kernel | 2.6.32-31-server
Architecture | CPU = 64-bit, OS = 64-bit
# Processor ##################################################
  Processors | physical = 2, cores = 12, virtual = 24, hyperthreading = yes
      Speeds | 24x1600.000
      Models | 24xIntel(R) Xeon(R) CPU X5650 @ 2.67GHz
      Caches | 24x12288 KB
# Memory #####################################################
       Total | 23.59G
...
  Locator   Size     Speed             Form Factor   Type          Type Detail
  ========= ======== ================= ============= ============= ===========
  DIMM_A1   4096 MB  1333 MHz (0.8 ns) DIMM          {OUT OF SPEC} Other
...
# Disk Schedulers And Queue Size #############################
         sda | [deadline] 128
# RAID Controller ############################################
  Controller | LSI Logic MegaRAID SAS
       Model | MegaRAID SAS 8704EM2, PCIE interface, 8 ports
       Cache | 128MB Memory, BBU Present
         BBU | 100% Charged, Temperature 34C, isSOHGood=

  VirtualDev Size      RAID Level Disks SpnDpth Stripe Status  Cache
  ========== ========= ========== ===== ======= ====== ======= =========
  0(no name) 1.088 TB  1 (1-0-0)      2     2-2     64 Optimal WT, RA

  PhysiclDev Type State   Errors Vendor  Model        Size
  ========== ==== ======= ====== ======= ============ ===========
  Hard Disk  SAS  Online   0/0/0 SEAGATE ST3600057SS  558.911
  Hard Disk  SAS  Online   0/0/0 SEAGATE ST3600057SS  558.911
  Hard Disk  SAS  Online   0/0/0 SEAGATE ST3600057SS  558.911
  Hard Disk  SAS  Online   0/0/0 SEAGATE ST3600057SS  558.911

It says controller cache is set to write-through (WT), though in fact for every benchmark I have repeated it with (a) write-through and (b) write-back to see if write-back cache would minimize the effects of misalignment.

File system of choice was XFS. Barriers and physical disk cache was disabled. The tool I used was sysbench 0.4.10 that came with this Ubuntu system. I have run every fileio benchmark and an IO bound read-write oltp benchmark in autocommit mode.

File IO benchmark

For the FileIO benchmark, I used 64 files – 1GB, 4GB and 16GB total in size with 1, 4 and 8 threads. The operations were done in 16kB units to mimic InnoDB pages. There were couple interesting surprised I faced:

1. After I got (what I thought was) the best configuration, I added LVM on top of that and the performance improved another 20-40%. It took me a while to figure it out, but here’s what happened – for XFS file system on a raw partition I was using full partition size which was slightly over 1TB in size. When I added LVM on top however, I made the logical volume slightly below 1TB. Investigating this I found that 32-bit xfs inodes (which are used by default) have to live in the first terabyte of the device which seems to have affected the performance here (IMO that’s because of where first data extents were placed in this case). When I have mounted the partition with inode64 option however, the effect disappeared and performance without LVM was slightly better than with LVM as expected. I had to redo all of the benchmarks to get the numbers right.

2. I was running vmstat during one of the tests and my eye caught the spike in OS buffers during “prepare” phase of sysbench. I found out that sysbench would not honor –file-extra-flags during “prepare” phase and instead of having files created using direct IO they were buffered in OS cache and so writes to files were serialized until they were fully overwritten and that way flushed from OS buffers. Buffers would be flushed within first few seconds so the effects of this were marginal. Alexey Kopytov fixed this in the sysbench trunk immediately, though I didn’t want to recompile sysbench on this system so I’ve used Domas’ uncache after prepare to make sure caches were clean.

OLTP benchmark

As the goal was to compare performance with different IO alignment, not different MySQL configurations, I didn’t try out different MySQL versions or settings. Moreover, I have been running these benchmarks for a customer so I just used the setting that they would have used anyway. One thing I did change was – I have significantly reduced InnoDB buffer pool to make sure the benchmark is IO bound.

That said, benchmark was running on a Percona Server 5.0.92-87 with the following my.cnf configuration:

[mysqld]
datadir=/data/mysql
socket=/var/run/mysqld/mysqld.sock
innodb_file_per_table = true
innodb_data_file_path = ibdata1:10M:autoextend
innodb_flush_log_at_trx_commit = 2
innodb_flush_method = O_DIRECT
innodb_log_buffer_size = 8M
innodb_buffer_pool_size = 128M
innodb_log_file_size = 64M
innodb_log_files_in_group = 2
innodb_read_io_threads = 8
innodb_write_io_threads = 8
innodb_io_capacity = 200
port = 3306
back_log = 50
max_connections = 2500
max_connect_errors = 10
table_cache = 2048
max_allowed_packet = 16M
binlog_cache_size = 16M
max_heap_table_size = 64M
thread_cache_size = 32
query_cache_size = 0
tmp_table_size = 64M
key_buffer_size = 8M
bulk_insert_buffer_size = 8M
myisam_sort_buffer_size = 8M
myisam_max_sort_file_size = 10G
myisam_repair_threads = 1
myisam_recover
skip-grant-tables

Amount of rows used was 20M, transactions were not used (autocommit), number of threads – 1, 4, 8, 16 and 32.

Benchmark scenarios

Here’s the different settings that I have ran the same benchmark on. As I mentioned earlier, each of those were run twice – first with RAID controller cache set to Write-Through and then to Write-Back.

1. Baseline – misalignment on the partition table, no LVM and no alignment settings in the file system. This is what you would often get on RHEL5, Ubuntu 8.04 or similar “older” systems if you wouldn’t do anything with respect to IO alignment.

2. Misalignment on the partition table, but proper alignment options on the file system. This is what we get when file system tries to balance writes but is not aware that it is not aligned to the beginning of the stripe element.

3. 1M alignment in partition table but no options on the file system. You should get this on RHEL6, Ubuntu 10.04 and similar systems if you wouldn’t do anything with respect to IO alignment yourself. In this case offset is correct, but file system is unaware how to align files properly.

4. Partition table and file system properly aligned; sunit/swidth set during mkfs. No LVM at this point.

5. Partition table aligned properly; sunit/swidth set during mounting but not during mkfs. This is your best option if you have a proper alignment in partition table but you did not set alignment options in xfs when creating it and you don’t want or can’t format the file system. One thing to note however – files that were written before this was set may still be unaligned, though xfs defragmentation may be able to fix that (not verified).

6. Added LVM on top of aligned partition table, used proper file system alignment.

Benchmark results

I had a hard time thinking how it would be best to present results so it’s not too stuffed and actually interesting. I decided that instead of preparing charts for each benchmark, I’ll just describe few less interesting numbers first, then I’ll show graphs for more interesting results. Let me know if you thought this was a bad idea :)

File IO benchmark results

Sequential read results are expectedly the least interesting. Read-ahead kicked in immediately giving ~9’600 iops (~150MB/s) at 1 thread, 14500 iops (~230MB/s) at 4 threads and ~16300 iops (~250MB/s) at 8 threads. Neither IO alignment nor file size made any difference. Adding LVM here reduced single-thread performance by 5-10%.

Sequential write results were a bit more interesting. With WT (write-through) cache enabled, performance was really poor whatsoever and there was virtually no difference whether it was 1 thread, 4 or 8 threads. Different file sizes made no difference too. Write-back cache gave an incredible performance boost – up to 33x in single-threaded workload. File system IO alignment seems to have made a difference – up to 15% with write-back cache enabled. Here’s 1GB seqwr with WT cache:

1GB seqwr WT cache

Here’s same test with WB cache:

1GB seqwr WB cache

And just to show you the difference between sequential writes with WT cache and WB cache:

1GB seqwr WT vs WB

Random read. This is probably the most interesting number for OLTP workload which is usually light on writes (especially if there’s a BBU protected Write-Back cache) and heavy on random reads. Regardless of the file size, the difference between aligned and misaligned reads was the same and, WT -vs- WB cache of course showed no difference at all. Here are the results:

16GB rndrd

As you can see IO alignment makes a difference here and improves performance up to 15% in case of 8 threads running concurrently. Because the customer was running a database which was way bigger than 16G, I’ve repeated the random read (and write) benchmark with 8 threads and total size of 256G. While the number of operations per second was slightly lower, the difference was still 15% — 909 iops unaligned -vs- 1049 aligned.

Random write. This is an important metric for write intensive workloads where there’s a lot of data being modified, inserts are done to random positions (not consecutive PK causing page splits) etc. Benchmark results are fairly consistent regardless of file size, let’s look at them. First, results with WT cache:

16 rndwr WT cache

And here’s with WB cache:

16 rndwr WB cache

Apparently proper IO alignment in this case gives up to 23% improvement when WB cache is used. With WT cache enabled, single thread performance improvement is marginal however WB cache brings single thread random write performance close to what 8 threads can do, and IO alignment gives extra 23% in this case.

I mentioned I did single test on a larger files (same test I did for random reads) i.e. 8 thread random write benchmark on files totaling to 256GB. With WB cache enabled, I got 919 iops unaligned and 1127 iops aligned i.e. the improvement is still 23%.

OLTP benchmark results

From this benchmark, I only have two graphs to show you. First one is with RAID controller set to WT cache:

sysbench OLTP 20M rows, WT cache

The second is with WB cache:

sysbench OLTP 20M rows, WB cache

I couldn’t figure out what exactly happened with setting #3 when WB cache was disabled, what I do know though is that, based on IO stats I was gathering during the benchmarks, the reason was in fact lower number of IO operations and higher response time – so it seems in this case misaligned IO had some collateral effects in a mixed read/write environment. Note that the benchmarks were all scripted and oltp benchmarks would automatically start after file tests so if there was an error in the setting, it would have reflected across all other benchmarks for the same setting.

Summary

For the two workloads that are most relevant to databases – random reads and random writes – IO alignment on a 4-disk RAID10 with standard 64k stripe element size makes a significant difference. When I launched the system that I was benchmarking, I could clearly see the difference in production as I had another machine running sideways with the same hardware, but with a misaligned IO. Here’s diskstats from the two shards running side by side:

Aligned:
  #ts device    rd_s rd_avkb rd_mb_s rd_mrg rd_cnc   rd_rt    wr_s wr_avkb wr_mb_s wr_mrg wr_cnc   wr_rt busy in_prg
{540} dm-0     447.1    34.0     7.4     0%    2.4     5.4    23.4    49.6     0.6     0%    0.0     0.6  85%      0

Misaligned:
  #ts device    rd_s rd_avkb rd_mb_s rd_mrg rd_cnc   rd_rt    wr_s wr_avkb wr_mb_s wr_mrg wr_cnc   wr_rt busy in_prg
{925} dm-0     462.1    34.1     7.7     0%    3.8     8.2    12.1    87.0     0.5     0%    0.0     0.7  93%      0

While number of operations from the OS perspective is very similar, due to high concurrency response time in the first case is significantly better.

It would be interesting however to run similar benchmarks on a larger RAID5 system where it should make even bigger difference on writes. Another interesting setting might be a [mirrored] RAID0 with many more stripes as not having proper file system alignment should have really interesting effects. Large stripe on the other hand should somewhat reduce the effects of misalignment, though it would definitely be interesting to run benchmarks and verify that. If you have some numbers to share, please leave a comment. Next, I plan to look at IO alignment on Flash cards to see what benefits we can get there from proper alignment.

You can find scripts and plain data here on our public wiki.

May
06
2011
--

Shard-Query turbo charges Infobright community edition (ICE)

Shard-Query is an open source tool kit which helps improve the performance of queries against a MySQL database by distributing the work over multiple machines and/or multiple cores. This is similar to the divide and conquer approach that Hive takes in combination with Hadoop. Shard-Query applies a clever approach to parallelism which allows it to significantly improve the performance of queries by spreading the work over all available compute resources. In this test, Shard-Query averages a nearly 6x (max over 10x) improvement over the baseline, as shown in the following graph:

One significant advantage of Shard-Query over Hive is that it works with existing MySQL data sets and queries. Another advantage is that it works with all MySQL storage engines.

This set of benchmarks evaluates how well Infobright community edition (ICE) performs in combination with Shard-Query.

Data set

It was important to choose a data set that was large enough to create queries that would run for a decent amount of time, but not so large that it was difficult to work with. The ontime flight performance statistics data, available online from the United States Bureau of Transportation Statistics (BTS) made a good candidate for testing, as it had been tested before:
Another MPB post
Lucid DB testing

The raw data is a completely denormalized schema (single table). In order to demonstrate the power of Shard-Query it is important to test complex queries involving joins and aggregation. A star schema is the most common OLAP/DW data model, since it typically represents a data mart. See also: “Data mart or data warehouse?”. As it is the most common data model, it is desirable to benchmark using a star schema, even though it involves work to transform the data.

Star schema

Transforming the data was straightforward. I should note that I did this preprocessing with the MyISAM storage engine, then I dumped the data to tab delimited flat files using mysqldump. I started by loading the raw data from the BTS into a single database table called ontime_stage.

Then, the airport information was extracted:

create table dim_airport(
airport_id int auto_increment primary key,
unique key(airport_code)
)
as
select
  Origin as `airport_code`,
  OriginCityName as `CityName`,
  OriginState as `State`,
  OriginStateFips as `StateFips`,
  OriginStateName as `StateName` ,
  OriginWac as `Wac`
FROM ontime_stage
UNION
select
  Dest as `airport_code`,
  DestCityName as `CityName`,
  DestState as `State`,
  DestStateFips as `StateFips`,
  DestStateName as `StateName` ,
  DestWac as `Wac`
FROM ontime_stage;

After extracting flight/airline and date information in a similar fashion, a final table `ontime_fact` is created by joining the newly constructed dimension table tables to the staging tables, omitting the dimension columns from the projection, instead replacing them with the dimension keys:

select dim_date.date_id,
       origin.airport_id as origin_airport_id,
       dest.airport_id as dest_airport_id,
       dim_flight.flight_id,
       ontime_stage.TailNum, ...
from ontime_stage
join dim_date using(FlightDate)
join dim_airport origin on ontime_stage.Origin = origin.airport_code
join dim_airport dest on ontime_stage.Dest = dest.airport_code
join dim_flight using (UniqueCarrier,AirlineID,Carrier,FlightNum);

The data set contains ontime flight information for 22 years, which can be confirmed by examining the contents of the date dimension:

mysql> select count(*),
min(FlightDate),
max(FlightDate)
from dim_date\G
*************************** 1. row ***************************
       count(*): 8401
min(FlightDate): 1988-01-01
max(FlightDate): 2010-12-31
1 row in set (0.00 sec)

The airport dimension is a puppet dimension. It is called a puppet because it serves as both origin and destination dimensions, being referenced by origin_airport_id and destination_airport_id in the fact table, respectively. There are nearly 400 major airports included in the data set.

mysql> select count(*) from dim_airport;
+----------+
| count(*) |
+----------+
|      396 |
+----------+
1 row in set (0.00 sec)

The final dimension is the flight dimension, which contains the flight numbers and air carrier hierarchies. Only the largest air carriers must register and report ontime information with the FAA, so there are only 29 air carriers in the table:

mysql> select count(*),
count(distinct UniqueCarrier)
from dim_flight\G
*************************** 1. row ***************************
                     count(*): 58625
count(distinct UniqueCarrier): 29
1 row in set (0.02 sec)

Each year has tens of millions of flights:

mysql> select count(*) from ontime_one.ontime_fact;
+-----------+
| count(*)  |
+-----------+
| 135125787 |
+-----------+
1 row in set (0.00 sec)

This should be made fully clear by the following schema diagram:

Star schema (ontime_fact, dim_date, dim_flight, dim_airport)

Diagram of the ontime dimensional schema

Test environment

For this benchmark, a test environment consisting of a single commodity database server with 6 cores (+6ht) and 24GB of memory was selected. The selected operating system was Fedora 14. Oracle VirtualBox OSE was used to create six virtual machines, each running Fedora 14. Each of the virtual machines was granted 4GB of memory. A SATA 7200rpm RAID10 battery backed RAID array was used as the underlying storage for the virtual machines.

Baseline:
The MySQL command line client was used to execute the a script file containing the 11 queries. This same SQL script was used in the Shard-Query tests. For the baseline, the results and response times were captured with the \T command. The queries were executed on a single database schema containing all of the data. Before loading, there was approximately 23GB of data. After loading, ICE compressed this data to about 2GB. The test virtual machine was assigned 12 cores in this test.

Scale-up:
Shard-Query was given the following configuration file, which lists only one server. A single schema (ontime_one) contained all of the data. The test virtual machine was assigned 12 cores for this test. The same VM was used as the previous baseline test. This VM was rebooted between tests. A SQL script was fed to the run_query.php script and the output was captured with the ‘tee’ command.

$ cat one.ini
[default]
user=mysql
db=ontime_one
password=
port=5029
column=date_id
mapper=hash
gearman=127.0.0.1:7000
inlist=*
between=*

[shard1]
host=192.168.56.102

Scale-out
In addition to adding parallelism via scale-up, Shard-Query can improve performance of almost all queries by spreading them over more than one physical server. This is called “scaling out” and it allows Shard-Query to vastly improve the performance of queries which have to examine a large amount of data. Shard-Query includes a loader (loader.php) which can be used to either split a data into multiple files (for each shard, for later loading) or it can load files directly, in parallel, to multiple hosts.

Shard-Query will execute queries in parallel over all of these machines. With enough machines, massive parallelism is possible and very large data sets may be processed. As each machine examines only a small subset of the data, performance can be improved significantly:

$ cat shards.ini
[default]
user=mysql
db=ontime
password=
port=5029
column=date_id
mapper=hash
gearman=127.0.0.1:7000
inlist=*
between=*

[shard1]
host=192.168.56.101
db=ontime

[shard2]
host=192.168.56.102
db=ontime

[shard3]
host=192.168.56.103
db=ontime

[shard4]
host=192.168.56.105
db=ontime

[shard5]
host=192.168.56.106
db=ontime

[shard6]
host=192.168.56.104
db=ontime

In this configuration, each shard has about 335MB-350MB of data (23GB raw data, compressed to about 2GB total data. then spread over six servers). Due to ICE limitations, the data was split before loading. The splitting/loading process will be described in another post.

Complex queries

Shard-Query was tested with the simple single table version of this data set in a previous blog post. Previous testing was limited to a subset of Vadim’s test queries (see that post). As this new test schema is normalized, Vadim’s test queries must be modified to reflect the more complex schema structure. For this benchmark each of the original queries has been rewritten to conform to the new schema, and additionally two new test queries have been added. To model real world complexity, each of the test queries feature at least one join, and many of the filter conditions are placed on attributes in the dimension tables. It will be very interesting to test these queries on various engines and databases.

Following is a list of the queries, followed by a response time table recording the actual response times for each query. The queries should be self-explanatory.

Performance, at a glance

You will notice that Shard-Query is faster in nearly every case. The performance of the queries is improved significantly by scaling out, even more so than scaling up, because additional parallelism is added beyond what can be exploited by scale up. Scale up can improve query performance when there is enough resources to support the added parallelism, and when one of the the following are in used in the query: BETWEEN or IN clauses, subqueries in the FROM clause, UNION or UNION ALL clauses. If none of those features are used, then parallelism can’t be added. Q9 is an example of such a query.
.

Query details:

-- Q1
 SELECT DayOfWeek, count(*) AS c
   from ontime_fact
   JOIN dim_date using (date_id)
  WHERE Year
BETWEEN 2000 AND 2008
  GROUP BY DayOfWeek
  ORDER BY c DESC;

-- Q2
SELECT DayOfWeek, count(*) AS c
  from ontime_fact
  JOIN dim_date using (date_id)
 WHERE DepDelay>10 AND Year BETWEEN 2000 AND 2008
 GROUP BY DayOfWeek
 ORDER BY c DESC;

-- Q3
SELECT CityName as Origin, count(*) AS c
  from ontime_fact
  JOIN dim_date using (date_id)
  JOIN dim_airport origin
    ON origin_airport_id = origin.airport_id
 WHERE DepDelay>10
   AND Year BETWEEN 2000 AND 2008
 GROUP BY 1
 ORDER BY c
 LIMIT 10;


The next queries show how performance is improved when Shard-Query adds parallelism when “subqueries in the from clause” are used. There are benefits with both “scale-up” and “scale-out”, but once again, the “scale-out” results are the most striking.

-- Q4
SELECT Carrier, count(*) as c
  from ontime_fact
  JOIN dim_date using (date_id)
  join dim_flight using(flight_id)
 WHERE DepDelay>10
   AND Year=2007
 GROUP BY Carrier
 ORDER BY c DESC;

-- Q5
SELECT t.Carrier, c, c2, c*1000/c2 as c3
FROM
     (SELECT Carrier, count(*) AS c
        from ontime_fact
        join dim_date using(date_id)
        join dim_flight using(flight_id)
       WHERE DepDelay>10
         AND Year=2007
       GROUP BY Carrier) t
JOIN (SELECT Carrier, count(*) AS c2
        from ontime_fact
        join dim_date using(date_id)
        join dim_flight using(flight_id)
       WHERE Year=2007
       GROUP BY Carrier) t2
  ON (t.Carrier=t2.Carrier)
ORDER BY c3 DESC;

-- Q6
SELECT t.Year, c1 / c2 as ratio
FROM
     (select Year, count(*)*1000 as c1
        from ontime_fact
        join dim_date using (date_id)
       WHERE DepDelay>10
       GROUP BY Year) t
JOIN (select Year, count(*) as c2
        from ontime_fact
        join dim_date using (date_id)
       WHERE DepDelay>10
       GROUP BY Year) t2
  ON (t.Year=t2.Year);

-- Q7
SELECT t.Year, c1 / c2 as ratio
  FROM (select Year, count(Year)*1000 as c1
          from ontime_fact
          join dim_date using (date_id)
         WHERE DepDelay>10
         GROUP BY Year) t
  JOIN (select Year, count(*) as c2
          from ontime_fact
          join dim_date using (date_id)
         GROUP BY Year) t2
    ON (t.Year=t2.Year);


The performance of the following queries depends on the size of the date range:

-- Q8.0
SELECT dest.CityName, COUNT( DISTINCT origin.CityName)
  from ontime_fact
  JOIN dim_airport dest on ( dest_airport_id = dest.airport_id)
  JOIN dim_airport origin on ( origin_airport_id = origin.airport_id)
  JOIN dim_date using (date_id)
 WHERE Year BETWEEN 2001 and 2001
 GROUP BY dest.CityName
 ORDER BY 2 DESC;

-- Q8.1
SELECT dest.CityName, COUNT( DISTINCT origin.CityName)
  from ontime_fact
  JOIN dim_airport dest on ( dest_airport_id = dest.airport_id)
  JOIN dim_airport origin on ( origin_airport_id = origin.airport_id)
  JOIN dim_date using (date_id)
 WHERE Year BETWEEN 2001 and 2005
 GROUP BY dest.CityName
 ORDER BY 2 DESC;

-- Q8.2
SELECT dest.CityName, COUNT( DISTINCT origin.CityName)
  from ontime_fact
  JOIN dim_airport dest on ( dest_airport_id = dest.airport_id)
  JOIN dim_airport origin on ( origin_airport_id = origin.airport_id)
  JOIN dim_date using (date_id)
 WHERE Year BETWEEN 2001 and 2011
 GROUP BY dest.CityName
 ORDER BY 2 DESC;

-- Q8.3
SELECT dest.CityName, COUNT( DISTINCT origin.CityName)
  from ontime_fact
  JOIN dim_airport dest on ( dest_airport_id = dest.airport_id)
  JOIN dim_airport origin on ( origin_airport_id = origin.airport_id)
  JOIN dim_date using (date_id)
 WHERE Year BETWEEN 1990 and 2011
 GROUP BY dest.CityName
 ORDER BY 2 DESC;

-- Q8.4
SELECT dest.CityName, COUNT( DISTINCT origin.CityName)
  from ontime_fact
  JOIN dim_airport dest on ( dest_airport_id = dest.airport_id)
  JOIN dim_airport origin on ( origin_airport_id = origin.airport_id)
  JOIN dim_date using (date_id)
 WHERE Year BETWEEN 1980 and 2011
 GROUP BY dest.CityName
 ORDER BY 2 DESC;


Finally, Shard-Query performance continues to improve when grouping and filtering is used. Again, notice Q9. It doesn’t use any features which Shard-Query can use to add parallelism. Thus, in the scale up configuration it does not perform any better than the baseline, and actually performed just a little worse. Since scale out splits the data between servers, it performs about 6x better as the degree of parallelism is controlled by the number of shards.

-- Q9
select Year ,count(Year) as c1
  from ontime_fact
  JOIN dim_date using (date_id)
 group by Year;

-- Q10
SELECT Carrier, dest.CityName, COUNT( DISTINCT origin.CityName)
  from ontime_fact
  JOIN dim_airport dest on ( dest_airport_id = dest.airport_id)
  JOIN dim_airport origin on ( origin_airport_id = origin.airport_id)
  JOIN dim_date using (date_id)
  JOIN dim_flight using (flight_id)
 WHERE Year BETWEEN 2009 and 2011
 GROUP BY Carrier,dest.CityName
 ORDER BY 3 DESC;

-- Q11
SELECT Year, Carrier, dest.CityName, COUNT( DISTINCT origin.CityName)
  from ontime_fact
  JOIN dim_airport dest on ( dest_airport_id = dest.airport_id)
  JOIN dim_airport origin on ( origin_airport_id = origin.airport_id)
  JOIN dim_date using (date_id)
  JOIN dim_flight using (flight_id)
 WHERE Year BETWEEN 2000 and 2003
   AND Carrier = 'AA'
 GROUP BY Year, Carrier,dest.CityName
 ORDER BY 4 DESC;


Conclusion

The divide and conquer approach is very useful when working with large quantities of data. Shard-Query can be used with existing data sets easily, improving the performance of queries significantly if they use common query features like BETWEEN or IN. It is also possible to spread your data over multiple machines, scaling out to improve query response times significantly.

These queries are a great test of Shard-Query features. It is currently approaching RC status. If you decide to test it and encounter issues, please file a bug on the bug tracker. You can get Shard-Query (currently in development release form as a checkout from SVN) here: Shard-Query Google code project

Full disclosure

Justin Swanhart, the author of this article is also the creator and maintainer of Shard-Query. The author has previously worked in cooperation with Infobright, including on benchmarking. These particular tests were performed independently of Infobright, without their knowledge or approval. Infobright was, however, given the chance to review this document before publication, as a courtesy. All findings are represented truthfully, transparently, and without any intended bias.

Feb
01
2011
--

Sample datasets for benchmarking and testing

Sometimes you just need some data to test and stress things. But randomly generated data is awful — it doesn’t have realistic distributions, and it isn’t easy to understand whether your results are meaningful and correct. Real or quasi-real data is best. Whether you’re looking for a couple of megabytes or many terabytes, the following sources of data might help you benchmark and test under more realistic conditions.

Post your favorites in the comments!

Dec
04
2009
--

Effect of adaptive_flushing

I recently had the chance to witness the effects of innodb_adaptive_flushing on the performance of InnoDB Plugin 1.0.5 in the wild, which Yasufumi wrote about previously here and here.

The server in question was Solaris 10 with 8 disk RAID10 and 2 32GB SSDs used for ZIL and L2ARC, 72G RAM and 40G buffer pool. We started it up with innodb_adaptive_flushing=OFF and innodb_doublewrite=OFF, then ramped up traffic and everything looked stable … but I noticed one troubling thing: ~2GB of uncheckpointed data.

mysql> SHOW INNODB STATUS\G
....
Database pages      2318457
Old database pages  855816
Modified db pages   457902
Log flushed up to   10026890404067
Last checkpoint at  10024612103454
....

We enabled innodb_adaptive_flushing just before 10am, which resulted in a few  changes  which were recorded by the Cacti MySQL Templates.  The most important change for the client here is the recovery time — by enabling adaptive flushing, InnoDB purged modified data much more aggressively, resulting in very little unflushed changes, which translates to much faster crash recovery.

db01-checkpt-age-1d

There was also a drop in history list length, which may be because of more aggressive flushing. In general this is good for performance because InnodB does not need to consider as many old versions of rows.

db01-innodb-trx-1d

Here we can see the “spiky flushing” before 10am (when adaptive flushing was off), which can be very bad for performance because during those short periods of very high page write activity, the system and/or innodb can become very slow (Yasufumi’s benchmarks show this very well). After enabling adaptive flushing we see a consistent and higher-than-previous-average rate of pages written to disk, which I find rather interesting

db01-buffer-pool-act-1d

Of course, there is also the question of system performance impact… More aggressive flushing means more disk IO and possibly  other impact on the system, which could impact query response time, and so on. And what about CPU, load average, and so on? Actually, those did not change enough to see a difference in the graphs so I am not including them; after all, we are not pushing this system near its limitations so I am not really surprised. We can see a change in disk IO which mirrors the innodb pages written graph  (shown above). I suspect ZFS is heavily buffering writes during the”spiky flushing” period because the delta between peak-write and low-write is much lower in terms of bytes written than pages written.

It is also worth noting there that this system is using 2 SSD’s for the ZIL and L2ARC, and as everyone should know by now, SSD’s lifetime is determined by the number of writes they can perform before failure. This means that more writes per second == shorter lifetime, and so a shorter SSD lifetime is a negative side effect of enabling adaptive flushing on this server.

db01-zfs-iostat-1d

One more point worth mentioning – the normal approach to control InnoDB’s checkpoint age is by adjusting the log file size. This server is configured with 2 * 1.5G log files, and innodb_io_capacity=1000. While adaptive flushing was off, the checkpoint age was comparable to the log file size, however the behavior with flushing enabled (keeping the checkpoint age very small) seems too aggressive. It’s possible we would see different results by adjusting innodb_io_capacity here.

Is there a lesson in this? Sometimes tuning InnoDB settings can have quite unexpected results…. Setting innodb_adaptive_flushing ON seems better in this case because there is IO bandwidth to spare and we are concerned about crash recovery time, but as Yasufumi’s previous posts show, it can sometimes also have bad results, and one may need to try and try again, before finding the “sweet spot” for your server / workload combination. As always, you should benchmark your particular systems and not blindly follow advice you read on the internet, and have good monitoring and trending tools in place from the start.


Entry posted by Devananda |
No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Dec
04
2009
--

Effect of adaptive_flushing

I recently had the chance to witness the effects of innodb_adaptive_flushing on the performance of InnoDB Plugin 1.0.5 in the wild, which Yasufumi wrote about previously here and here.

The server in question was Solaris 10 with 8 disk RAID10 and 2 32GB SSDs used for ZIL and L2ARC, 72G RAM and 40G buffer pool. We started it up with innodb_adaptive_flushing=OFF and innodb_doublewrite=OFF, then ramped up traffic and everything looked stable … but I noticed one troubling thing: ~2GB of uncheckpointed data.

mysql> SHOW INNODB STATUS\G
....
Database pages      2318457
Old database pages  855816
Modified db pages   457902
Log flushed up to   10026890404067
Last checkpoint at  10024612103454
....

We enabled innodb_adaptive_flushing just before 10am, which resulted in a few  changes  which were recorded by the Cacti MySQL Templates.  The most important change for the client here is the recovery time — by enabling adaptive flushing, InnoDB purged modified data much more aggressively, resulting in very little unflushed changes, which translates to much faster crash recovery.

db01-checkpt-age-1d

There was also a drop in history list length, which may be because of more aggressive flushing. In general this is good for performance because InnodB does not need to consider as many old versions of rows.

db01-innodb-trx-1d

Here we can see the “spiky flushing” before 10am (when adaptive flushing was off), which can be very bad for performance because during those short periods of very high page write activity, the system and/or innodb can become very slow (Yasufumi’s benchmarks show this very well). After enabling adaptive flushing we see a consistent and higher-than-previous-average rate of pages written to disk, which I find rather interesting

db01-buffer-pool-act-1d

Of course, there is also the question of system performance impact… More aggressive flushing means more disk IO and possibly  other impact on the system, which could impact query response time, and so on. And what about CPU, load average, and so on? Actually, those did not change enough to see a difference in the graphs so I am not including them; after all, we are not pushing this system near its limitations so I am not really surprised. We can see a change in disk IO which mirrors the innodb pages written graph  (shown above). I suspect ZFS is heavily buffering writes during the”spiky flushing” period because the delta between peak-write and low-write is much lower in terms of bytes written than pages written.

It is also worth noting there that this system is using 2 SSD’s for the ZIL and L2ARC, and as everyone should know by now, SSD’s lifetime is determined by the number of writes they can perform before failure. This means that more writes per second == shorter lifetime, and so a shorter SSD lifetime is a negative side effect of enabling adaptive flushing on this server.

db01-zfs-iostat-1d

One more point worth mentioning – the normal approach to control InnoDB’s checkpoint age is by adjusting the log file size. This server is configured with 2 * 1.5G log files, and innodb_io_capacity=1000. While adaptive flushing was off, the checkpoint age was comparable to the log file size, however the behavior with flushing enabled (keeping the checkpoint age very small) seems too aggressive. It’s possible we would see different results by adjusting innodb_io_capacity here.

Is there a lesson in this? Sometimes tuning InnoDB settings can have quite unexpected results…. Setting innodb_adaptive_flushing ON seems better in this case because there is IO bandwidth to spare and we are concerned about crash recovery time, but as Yasufumi’s previous posts show, it can sometimes also have bad results, and one may need to try and try again, before finding the “sweet spot” for your server / workload combination. As always, you should benchmark your particular systems and not blindly follow advice you read on the internet, and have good monitoring and trending tools in place from the start.


Entry posted by Devananda |
4 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com