May
16
2011
--

Finding an optimal balance of I/O, CPU, and RAM for MySQL

For a long time I’ve wanted to know how MySQL scales as you add more memory to the server. Vadim recently benchmarked the effects of increasing memory and CPU core count. He looked for a balance between utilizing the hardware as much as possible, limiting the system complexity, and lowering the price-to-performance ratio.

The outcome of the research, which was sponsored by Virident, is that as you add CPUs and increase memory size, MySQL doesn’t scale as well as we would like, and solid-state storage — specifically, the Virident tachIOn drive — has more bandwidth than MySQL can fully utilize at present. Therefore, to decrease the price-to-performance ratio and increase the utilization of the tachIOn drive, Vadim sharded the database into smaller instances and colocated them on the same machine. It’s not a new approach, but to date I’m not aware of anyone measuring the different configurations the way Vadim has done.

You can read the full details in our Scaling MySQL With Virident Flash Drives and Multiple Instances of Percona Server white paper.

May
16
2011
--

Using any general purpose computer as a special purpose SIMD computer

Often times, from a computing perspective, one must run a function on a large amount of input. Often times, the same function must be run on many pieces of input, and this is a very expensive process unless the work can be done in parallel.

Shard-Query introduces set based processing, which on the surface appears to be similar to other technologies on the market today. However, the scaling features of Shard-Query are just a side effect of the fact that it operates on sets in parallel. Any set can be operated on to any arbitrary degree of parallelism up to, and including, the cardinality of the set.
Given that:

  1. It is often possible to arbitrarily transform one type of expression into a different, but compatible type for computational purposes as long as the conversion is bidirectional
  2. An range operation over a set of integers or dates can be transformed into one or more discrete sub-ranges
  3. Any operation on an entire set is the same as running that operation on each item individually.
  4. After expanding a set operation into N discrete dimensions, it can always be collapsed back into a one dimensional set.
  5. Arrays are sets

(more…)

May
15
2011
--

Win a free ticket to RailsConf!

I have one free ticket to give away to RailsConf next week in Baltimore! Post a comment to win, and if you aren’t the winner, I’ll give you a discount code for Percona Live as a consolation prize.

Here’s the catch: you have to find at least one thing wrong with the following typical logrotate configuration for MySQL. This should be easy even if you’re not a MySQL expert :-)

# This logname can be set in /etc/my.cnf
# by setting the variable "err-log"
# in the [safe_mysqld] section as follows:
#
# [safe_mysqld]
# err-log=/var/lib/mysql/mysqld.log
#
# If the root user has a password you have to create a
# /root/.my.cnf configuration file with the following
# content:
#
# [mysqladmin]
# password = <secret>
# user= root
#
# where "<secret>" is the password.
#
# ATTENTION: This /root/.my.cnf should be readable ONLY
# for root !

/var/lib/mysql/mysqld.log {
        create 600 mysql mysql
        notifempty
    daily
        rotate 3
        missingok
        compress
    postrotate
    # just if mysqld is really running
    if test -x /usr/bin/mysqladmin && \
       /usr/bin/mysqladmin ping &>/dev/null
    then
       /usr/bin/mysqladmin flush-logs
    fi
    endscript
}
May
14
2011
--

Distributed set processing performance analysis with ICE 3.5.2pl1 at 20 nodes.

Demonstrating distributed set processing performance

Shard-Query + ICE scales very well up to at least 20 nodes

This post is a detailed performance analysis of what I’ve coined “distributed set processing”.

Please also read this post’s “sister post” which describes the distributed set processing technique.

Also, remember that Percona can help you get up and running using these tools in no time flat. We are the only ones with experience with them.

20 is the maximum number of nodes that I am allowed to create in EC2. I can test further on our 32 core system, but I wanted to do a real world “cloud” test to show that this works over the network, in a real world environment. There is a slight performance oddity at 16 nodes. I suspect the EC2 environment is the reason for this, but it requires further investigation.

Next I compared the performance of 20 m1.large machines cold, versus hot. I did not record the cold results on the c1.medium machines, so only the warm results are provided for reference. Remember that the raw input data set was 55GB before converting to a star schema (21GB) and being compressed by ICE to 2.5GB. Many of these queries examine the entire data set doing origin/count(distinct destination) combinations across two dimension (origin/dest), each with 400 unique items.

In the following chart you will see performance at a single node as the tall blue line, and the short cyan line is 20 nodes. In order to avoid too many bars on the chart, response times between 2 and 16 nodes (inclusive) are shown as lines.
Query response time from 1 to 20 nodes

This chart shows the same data in another way:

Concurrency testing is important too. I tested a 20 node m1.large system at 1,2,4,8,16 and 32 threads of concurrency.

The following simple bash scripts were used for the concurrency test:

$ cat start_bench
#!/bin/bash
if [ "$1" = "" ]
then
  workers=1
else
  workers=$1
fi
if [ "$2" = "" ]
then
  iterations=3
else
  iterations=$2
fi
for i in `seq 1 $workers`
do
  echo "Launching benchmark worker #$i with $iterations iterations"
  ./bench_worker $iterations $workers &
done
wait
$ cat bench_worker
#!/bin/bash
for i in `seq 1 $1`
do
        mkdir -p "results/$2/"
        ./run_query < queries.sql |egrep "rows|^-" > results/$2/raw.$$.$i.txt
done;

Query processing is handled by a Gearman queue which limits the maximum number of concurrent storage node queries. This prevents the system from being overloading and provides a scalable average response time under increased concurrency. Another queue in front of the storage nodes is probably advisable.

May
14
2011
--

Distributed Set Processing with Shard-Query

Can Shard-Query scale to 20 nodes?

Peter asked this question in comments to to my previous Shard-Query benchmark. Actually he asked if it could scale to 50, but testing 20 was all I could due to to EC2 and time limits. I think the results at 20 nodes are very useful to understand the performance:

I will shortly release another blog post focusing on ICE performance at 20 nodes, but before that, I want to give a quick preview, then explain exactly how Shard-Query works.
Query response time from 1 to 20 nodes

Yes, Shard-Query scales very well at 20 nodes

Distributed set processing (theory)

What is SQL?

As you probably know, SQL stands for “structured query language”. It isn’t so much the language that is structured, but actually the data. Every SQL statements breaks down into a relational algebra equation. In Algebra you learn that some operations are “distributable”, that is, you can split them up into smaller units, and when you put those units back together the result is the same as if you didn’t take it apart. Projection, GROUP BY, SUM, COUNT, MIN*, and MAX* are distributable. With a little elbow grease, all non-distributable aggregate functions (AVG,STDDEV,VARIANCE,etc) can be decomposed into distributable functions using simple substitution rules.

So, to recap, every SQL query is really a cleverly disguised relational algebra mathematical expression. With relational algebraic substitution, every aggregate expression, even non-distributable ones, can be broken into distributable sub-expressions.

What is a result set?

This isn’t really a trick question. The “result set” is a SET created by the output of a relational algebra expresion. Relational algebra expressions always produce sets as output. Just to drive it home, SQL is relational algebra, and this algebra only operates on sets. The next important thing to understand about SQL is that it is declarative. That is, you tell the database what you want but not how to get it. Most distributed engines work with rows. They break the queries up into the lowest level sets possible (rows) which doesn’t makes much sense to me, since SQL is set oriented. I just said repeatedly that SQL doesn’t work on rows! Why would you break this paradigm by passing around rows instead of sets? From a distributed processing standpoint, rows are the worst case for performance. Optimal mathematical performance requires operations on reduced sets. Keep in mind, that rows based systems work well but still, these systems are much farther from optimal than working directly with relational algebra.

Materialized views techniques applied to distributed computation

I maintain another open source tool called Flexviews which supports incrementally maintaining materialized views. While writing Flexviews, I learned how to distribute the computation of aggregation queries over time. I realised that I could apply these same mathematical concepts to distributed queries as well, but with some minor differences.

Having written Flexviews, I understood that there are special materialized view optimizations that can be applied to INSERT-only workloads, and their are other optimizations that can be applied to views that are based on only a single base table. Knowing this, the query result set is treated as a materialized view over a union of all the already joined and aggregated data from all the nodes. A single temporary table is used to store the results from all the nodes. Since we are projecting results, this is naturally an INSERT-only workload. The insertions into the base table from each node correspond logically to a records in a Flexviews materialized view delta table and thus the logic for applying changes via ON DUPLICATE KEY UPDATE is the same. All of the fastest incremental materialized view optimizations can be applied.

Shard-Query works only on sets

Shard-Query takes relational algebra to its logical maximum conclusion, splitting a problem up into many small problems and then putting the results back together again. This set logic allows to multiple levels of reduction of the set, all in parallel. All joins, aggregation and filtering is done at the storage node level. This means that Shard-Query does not have to have any idea of the structure of the data on which your queries operate. It can, however, use a mapper for partition elimination. The mapper is pluggable.

On each storage node (or a mapped subset of nodes) the result set is aggregated using distributable aggregate functions. The results from this aggregation get a second distributed reduce using UPSERTs into the coordination node when maintaining the “materialized view” of the results. Finally, a single threaded final group-by reduction is run over the coordination node, and this projects the final result set.

Intern-node communication

One or more gearman queues are used to compute work. Computation is massively parallel. Shard-Query simply waits for all the queries to finish, then projects the synchronously maintained incrementally refreshable materialized view that represents the output. There is no external locking or synchronization required. If a query fails on a storage node, then it can be retried on the same node, or in the future, a redundant node.

Work on problems of any size.

Set processing is massively parallel

.
In fact, it is embarassingly parallel. Because Shard-Query works on sets, and features pluggable partition mapping, it allows partitioning resources to any depth and distributing a query over that set of resources. Lets say you have a database system that is at capacity in data center A, and there is not enough power to add new nodes there. You can add new nodes in data center B and distribute the queries over both data centers. The response time should only be increased by the average latency, since work is done in parallel and there is very little data shipping because of the distributed reduction discussed above. If you have any limitation in resources in a cluster (cpu, memory, disk, power,etc) then split the problem up into more chunks.

Each Shard-Query instance is a essentially a proxy for the distributed set based SQL operations executed on the nodes under the proxy. The databases do 99.9% of the work, due to the multiple levels of result set reduction. Finally, Shard-Query can automatically partition sets into subsets using basic relational algebra substitution rules that are documented in this blog post. This allows BETWEEN, IN, subqueries in the FROM clause, and UNION operations to operate fully in parallel.

Distributed set processing is database agnostic.

Keep in mind, at each level of partitioning only a list of servers and some credentials are required to make this work. It can even be set up without the aid of your database administrator.

As long as all of the nodes share a common schema model and share a common SQL dialect, then they can all participate in the distributed query processing. There is almost no data shipping, as only aggregated sets are sent over the network. In the future you will be able to distribute work over any type of compute resource which speaks SQL, but right now only MySQL storage nodes are supported. Amdahl’s law applies to the distributed processing. The results from the slowest node will place a lower bound on the minimum performance level.

Soon you will be able to set up computation over ICE, InnoDB, Vectorwise and other databases, all at the same time, and transparently to the source query. Adding a new storage node SQL dialect is almost trivial. I’ll document that process shortly, I think I need to make some minor abstraction modifications in the data access layer.

Shard-Query can provide query execution plans based on the relational algebra rewrites

So, here is an example of such a plan for a simple query with aggregation, a JOIN, and a WHERE clause that uses an IN clause:

-- INPUT SQL:
select
origin_airport_id,
count(*),
sum(AirTime),
sum(DepDelay),
sum(DepDelay >= 0)
flight_delayed
from ontime_fact
join dim_date
on ontime_fact.date_id = dim_date.date_id
where dim_date.Year IN (2008,2009)
group by 1;

--PARALLEL OPTIMIZATIONS:
* Base level table name: `aggregation_tmp#74082863`
* IN list optimization enabled
* Detected an IN list with 2 items
* IN list compression is not required
-- 2 items is less than the inlist-merge-threshold of 128 items

* The following projections were selected for a UNIQUE CHECK
* on the base table:
`origin_airport_id`

* storage node result set merge optimization enabled:
ON DUPLICATE KEY UPDATE
`origin_airport_id`=VALUES(`origin_airport_id`),
`count(*)`=`count(*)` +  VALUES(`count(*)`),
`sum(AirTime)`=`sum(AirTime)` +  VALUES(`sum(AirTime)`),
`sum(DepDelay)`=`sum(DepDelay)` +  VALUES(`sum(DepDelay)`),
`sum(DepDelay >= 0) flight_delayed`=`sum(DepDelay >= 0) flight_delayed`
  +  VALUES(`sum(DepDelay >= 0) flight_delayed`)

-- SQL TO SEND TO SHARDS 

Array
(
    [0] => SELECT origin_airport_id AS `origin_airport_id`,COUNT(*) AS `count(*)`,
SUM((AirTime)) AS `sum(AirTime)`,SUM((DepDelay)) AS `sum(DepDelay)`,
SUM((DepDelay >= 0)) AS `sum(DepDelay >= 0) flight_delayed`
FROM ontime_fact AS `ontime_fact`
JOIN dim_date AS `dim_date` ON (ontime_fact.date_id = dim_date.date_id)
WHERE dim_date.Year IN  (2008)
GROUP BY 1 ORDER BY NULL

    [1] => SELECT origin_airport_id AS `origin_airport_id`,COUNT(*) AS `count(*)`,
SUM((AirTime)) AS `sum(AirTime)`,SUM((DepDelay)) AS `sum(DepDelay)`,
SUM((DepDelay >= 0)) AS `sum(DepDelay >= 0) flight_delayed`
FROM ontime_fact AS `ontime_fact`
JOIN dim_date AS `dim_date` ON (ontime_fact.date_id = dim_date.date_id)
WHERE dim_date.Year IN  (2009)
GROUP BY 1 ORDER BY NULL
)

-- AGGREGATION SQL:
SELECT `origin_airport_id`,
SUM(`count(*)`) AS `count(*)`,
SUM(`sum(AirTime)`) AS `sum(AirTime)`,
SUM(`sum(DepDelay)`) AS `sum(DepDelay)`,
SUM(`sum(DepDelay >= 0) flight_delayed`) AS `sum(DepDelay >= 0) flight_delayed`
FROM `aggregation_tmp_74082863`
GROUP BY 1 

--POST EXEC:
DROP TABLE IF EXISTS `aggregation_tmp#74082863` ;

Edit

*MIN/MAX are only distributable in INSERT-only workloads

See this information about maintaining materialized views:

http://scholar.google.com/scholar?hl=en&lr=&q=related:Y6OKUW79CGQJ:scholar.google.com

May
12
2011
--

Connecting orphaned .ibd files

There are two ways InnoDB can organize tablespaces. First is when all data, indexes and system buffers are stored in a single tablespace. This is typicaly one or several ibdata files. A well known innodb_file_per_table option brings the second one. Tables and system areas are split into different files. Usually system tablespace is located in ibdata1 file and every InnoDB table has two files e.g.  actor.frm and actor.ibd.

The annoying thing about .ibd files you can’t easily copy the an .ibd file to another MySQL server. If you try to very often you’ll get an error in the log:

InnoDB: Error: tablespace id is 10 in the data dictionary
InnoDB: but in file ./sakila/actor.ibd it is 15!

However sometimes you have to connect the .ibd file to an alien ibdata1.

There are several situation when you have to:

1. ibdata1 is erroneously removed

2. ibdata1 is heavily corrupted and innodb_force_recovery doesn’t help

Chris Calender suggests two methods. The first is create/drop the table many times until space_id in InnoDB dictionary and .ibd file match. The second is to edit space_id inside .ibd file with a hex editor.

I would like to elaborate the second method.

But let’s understand first what’s going on and why InnoDB refuses to use suggested .ibd file.

There is an InnoDB dictionary. It consists of several internal tables. For our topic only SYS_TABLES and SYS_INDEXES are relevant. These are usual InnoDB tables, but they’re hidden from a user(you can see them in information_scheme database in Percona Server though).

The structure of these tables is following:

SYS_TABLES:

CREATE TABLE `SYS_TABLES` (
`NAME` varchar(255) NOT NULL default '',
`ID` bigint(20) unsigned NOT NULL default '0',
`N_COLS` int(10) default NULL,
`TYPE` int(10) unsigned default NULL,
`MIX_ID` bigint(20) unsigned default NULL,
`MIX_LEN` int(10) unsigned default NULL,
`CLUSTER_NAME` varchar(255) default NULL,
`SPACE` int(10) unsigned default NULL,
PRIMARY KEY  (`NAME`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

SYS_INDEXES:

CREATE TABLE `SYS_INDEXES` (
`TABLE_ID` bigint(20) unsigned NOT NULL default '0',
`ID` bigint(20) unsigned NOT NULL default '0',
`NAME` varchar(120) default NULL,
`N_FIELDS` int(10) unsigned default NULL,
`TYPE` int(10) unsigned default NULL,
`SPACE` int(10) unsigned default NULL,
`PAGE_NO` int(10) unsigned default NULL,
PRIMARY KEY  (`TABLE_ID`,`ID`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

Please note field SPACE. For table actor it is equal to 15:

mysql> select * from `INNODB_SYS_TABLES` where `SCHEMA` = 'sakila' AND `NAME` = 'actor'\G
*************************** 1. row ***************************
SCHEMA: sakila
NAME: actor
ID: 13
N_COLS: 2147483652
TYPE: 1
MIX_ID: 0
MIX_LEN: 0
CLUSTER_NAME:
SPACE: 15
1 row in set (0.00 sec)

SPACE is equal to 15 in all actor’s indexes:

mysql> select * from INNODB_SYS_INDEXES WHERE TABLE_ID = 13\G
*************************** 1. row ***************************
TABLE_ID: 13
ID: 15
NAME: PRIMARY
N_FIELDS: 1
TYPE: 3
SPACE: 15
PAGE_NO: 3
*************************** 2. row ***************************
TABLE_ID: 13
ID: 16
NAME: idx_actor_last_name
N_FIELDS: 1
TYPE: 0
SPACE: 15
PAGE_NO: 4
2 rows in set (0.00 sec)

In InnoDB world actor.ibd is a tablespace. It has space_id and it is equal to 15 for this particular table at this particular server.

As you can see secondary indexes are stored in actor.ibd as well.

But where is space_id in actor.ibd?

Like any other tablespace actor.ibd costsists of a set of InnoDB pages. A page is 16k long (UNIV_PAGE_SIZE in the source code).

Let’s take a look at the page header:

InnoDB Page Header
Name Size Description
FIL_PAGE_SPACE_OR_CHKSUM 4 /* in < MySQL-4.0.14 space id the
page belongs to (== 0) but in later
versions the ‘new’ checksum of the
page */
FIL_PAGE_OFFSET 4 ordinal page number from start of space
FIL_PAGE_PREV 4 offset of previous page in key order
FIL_PAGE_NEXT 4 offset of next page in key order
FIL_PAGE_LSN 8 log serial number of page’s latest log record
FIL_PAGE_TYPE 2 current defined types are: FIL_PAGE_INDEX, FIL_PAGE_UNDO_LOG, FIL_PAGE_INODE, FIL_PAGE_IBUF_FREE_LIST
FIL_PAGE_FILE_FLUSH_LSN 8 “the file has been flushed to disk at least up to this lsn” (log serial number), valid only on the first page of the file
FIL_PAGE_ARCH_LOG_NO_OR_SPACE_NO 4 /* starting from 4.1.x this
contains the space id of the page */

So, space_id is 4 bytes written to every InnoDB page. An .ibd file can be huge, while ibdata1 is usually smaller. Thus, it is easier to modify space_id in InnoDB dictionary once than in every InnoDB page.

How let’s connect actor.ibd from sakila database taken from some MySQL server.

0. Create empty InnoDB tablespace.

1. Create the table:

mysql>CREATE TABLE actor (
actor_id SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT,
first_name VARCHAR(45) NOT NULL,
last_name VARCHAR(45) NOT NULL,
last_update TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY  (actor_id),
KEY idx_actor_last_name (last_name)
)ENGINE=InnoDB DEFAULT CHARSET=utf8;

This command will create respective records in SYS_TABLES and SYS_INDEXES.

2. Now let’s modify SPACE in InnoDB dictionary. MySQL must be stopped at this point. There is a tool ibdconnect in Percona InnoDB Recovery Tool. Make sure you’re using the latest version from the trunk.

It reads space_id from an .ibd file and updates the dictionary in ibdata1.

# ./ibdconnect -o /var/lib/mysql/ibdata1 -f /var/lib/mysql/sakila/actor.ibd -d sakila -t actor
actor.ibd
actor.ibd belongs to space #15
... Skipped output...
SYS_TABLES is updated successfully
... Skipped output...
SYS_INDEXES is updated successfully

It is possible that space_id from actor.ibd is already used by some other table.
In this case if ibdata was updated MySQL will fail to start with error:

InnoDB: Reading tablespace information from the .ibd files...
InnoDB: Error: trying to add tablespace 15 of name './sakila/customer.ibd'
InnoDB: to the tablespace memory cache, but tablespace
InnoDB: 15 of name './sakila/actor.ibd' already exists in the tablespace
InnoDB: memory cache!

To refrain from such error ibdconnect does check if the space_id is already used.
It will refuse to update ibdata1:

$ ./ibdconnect -o ibdata1 -f t2.ibd -d sakila -t actor
...
Error: Space id: 12 is already used in InnoDB dictionary for table test/t2

In this case you need to drop table `test`.`t2` and create it again. InnoDB will assign other space_id, thus 12 will be freed.

3. Now SPACE is modified in the dictionary, but checksums are bad. To regenerate them use innochecksum from the same toolset. Run it two times:

# ./innochecksum -f /var/lib/mysql/ibdata1
page 8 invalid (fails new style checksum)
page 8: new style: calculated = 0x B7C5C82C; recorded = 0x BFE71C21
fixing new checksum of page 8
page 11 invalid (fails new style checksum)
page 11: new style: calculated = 0x E4189B9B; recorded = 0x C168689B
fixing new checksum of page 11
#
# ./innochecksum -f /var/lib/mysql/ibdata1
page 8 invalid (fails old style checksum)
page 8: old style: calculated = 0x 8195646B; recorded = 0x DA79A2EE
fixing old checksum of page 8
page 8 invalid (fails new style checksum)
page 8: new style: calculated = 0x 119FD630; recorded = 0x B7C5C82C
fixing new checksum of page 8
page 11 invalid (fails old style checksum)
page 11: old style: calculated = 0x 908297E7; recorded = 0x 6536CEE8
fixing old checksum of page 11
page 11 invalid (fails new style checksum)
page 11: new style: calculated = 0x D5DC3269; recorded = 0x E4189B9B
fixing new checksum of page 11

4. The third time to be sure ibdata1 has valid checksums:

#./innochecksum /var/lib/mysql/ibdata1
#

5. Now you can start MySQL and take a dump from the table. The table is accessible, but due to obvious reason it shold not be used in production.

The tool ibdconnect was tested on MySQL 5.1 on CentOS 5.6 x86_64. However it expected to work on all versions of MySQL/InnoDB.

May
11
2011
--

Shard-Query EC2 images available

Infobright and InnoDB AMI images are now available

There are now demonstration AMI images for Shard-Query. Each image comes pre-loaded with the data used in the previous Shard-Query blog post. The data in the each image is split into 20 “shards”. This blog post will refer to an EC2 instances as a node from here on out. Shard-Query is very flexible in it’s configuration, so you can use this sample database to spread processing over up to 20 nodes.

The Infobright Community Edition (ICE) images are available in 32 and 64 bit varieties. Due to memory requirements, the InnoDB versions are only available on 64 bit instances. MySQL will fail to start on a micro instance, simply decrease the values in the /etc/my.cnf file if you really want to try micro instances.

*EDIT*
The storage worker currently logs too much information. This can cause the disk to fill up with logs. You can fix this by modifying shard-query/run_worker to contain the following:

#!/bin/bash
while [ 1 ]
do
./worker >> /dev/null 2>&1 < /dev/null
done;

Where to find the images

Amazon ID

Name

Arch

Notes
ami-20b74949

shard-query-infobright-demo-64bit

x86_64

ICE 3.5.2pl1. Requires m1.large or larger
ami-8eb648e7

shard-query-innodb-demo-64bit

x86_64

Percona Server 5.5.11 with XtraDB. Requires m1.large or larger.
ami-f65ea19f

shard-query-infobright-demo

i686 ICE 3.5.2pl1 32bit. Requires m1.small or greater.
snap-073b6e68

shard-query-demo-data-flatfiles

30GB ext3 EBS

This is an ext3 volume which contains the flat files for the demos, if you want to reload on your favorite storage engine or database

About the cluster

For best performance, there should be an even data distribution in the system. To get an even distribution, the test data was hashed over the values in the date_id column. There will be another blog post about the usage and performance of the splitter. It is multi-threaded(actually multi-process) and is able to hash split up to 50GB/hour of input data on my i970 test machine. It is possible to distribute splitting and/or loading among multiple nodes as well. Note that in the demonstration each node will contain redundant, but non-accessed data for all configurations of more than one node. This would not be the case in normal circumstances. The extra data will not impact performance because it will never be accessed.

Since both InnoDB and ICE versions of the data are available it is important to examine the differences in size. This will give us some interesting information about how Shard-Query will perform on each database. To do the size comparison, I used the du utility:

InnoDB file size on disk: 42GB (with indexes)

# du -sh *
203M    ibdata1
128M    ib_logfile0
128M    ib_logfile1
988K    mysql
2.1G    ontime1
2.1G    ontime10
2.1G    ontime11
2.1G    ontime12
2.1G    ontime13
2.1G    ontime14
2.1G    ontime15
2.1G    ontime16
2.1G    ontime17
2.1G    ontime18
2.1G    ontime19
2.1G    ontime2
2.1G    ontime20
2.1G    ontime3
2.1G    ontime4
2.1G    ontime5
2.1G    ontime6
2.1G    ontime7
2.1G    ontime8
2.1G    ontime9
212K    performance_schema
0       test

ICE size on disk: 2.5GB

# du -sh *
8.0K    bh.err
11M     BH_RSI_Repository
4.0K    brighthouse.ini
4.0K    brighthouse.log
4.0K    brighthouse.seq
964K    mysql
123M    ontime1
124M    ontime10
123M    ontime11
123M    ontime12
123M    ontime13
123M    ontime14
123M    ontime15
123M    ontime16
123M    ontime17
123M    ontime18
124M    ontime19
124M    ontime2
124M    ontime20
124M    ontime3
123M    ontime4
122M    ontime5
122M    ontime6
122M    ontime7
123M    ontime8
125M    ontime9

The InnoDB data directory size is 42GB, which is twice the original size of the input data. The ICE schema was discussed in the comments of the last post. ICE does not have any indexes (not even primary keys).

Here is the complete InnoDB schema from one shard. The schema is duplicated 20 times (but not the ontime_fact data):

DROP TABLE IF EXISTS `dim_airport`;
CREATE TABLE `dim_airport` (
  `airport_id` int(11) NOT NULL DEFAULT '0',
  `airport_code` char(3) DEFAULT NULL,
  `CityName` varchar(100) DEFAULT NULL,
  `State` char(2) DEFAULT NULL,
  `StateFips` varchar(10) DEFAULT NULL,
  `StateName` varchar(50) NOT NULL,
  `Wac` int(11) DEFAULT NULL,
  PRIMARY KEY (`airport_id`),
  KEY `CityName` (`CityName`),
  KEY `State` (`State`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COMMENT='Data from BTS ontime flight data.  Data for Origin and Destination airport data.';

CREATE TABLE `dim_date` (
  `Year` year(4) DEFAULT NULL,
  `Quarter` tinyint(4) DEFAULT NULL,
  `Month` tinyint(4) DEFAULT NULL,
  `DayofMonth` tinyint(4) DEFAULT NULL,
  `DayOfWeek` tinyint(4) DEFAULT NULL,
  `FlightDate` date NOT NULL,
  `date_id` smallint(6) NOT NULL,
  PRIMARY KEY (`date_id`),
  KEY `FlightDate` (`FlightDate`),
  KEY `Year` (`Year`,`Quarter`,`Month`,`DayOfWeek`),
  KEY `Quarter` (`Quarter`,`Month`,`DayOfWeek`),
  KEY `Month` (`Month`,`DayOfWeek`),
  KEY `DayOfWeek` (`DayOfWeek`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COMMENT='Contains the date information from the BTS ontime flight data.  Note dates may not be in date_id order';
/*!40101 SET character_set_client = @saved_cs_client */;

CREATE TABLE `dim_flight` (
  `UniqueCarrier` char(7) DEFAULT NULL,
  `AirlineID` int(11) DEFAULT NULL,
  `Carrier` char(2) DEFAULT NULL,
  `FlightNum` varchar(10) DEFAULT NULL,
  `flight_id` int(11) NOT NULL DEFAULT '0',
  `AirlineName` varchar(100) DEFAULT NULL,
  PRIMARY KEY (`flight_id`),
  KEY `UniqueCarrier` (`UniqueCarrier`,`AirlineID`,`Carrier`),
  KEY `AirlineID` (`AirlineID`,`Carrier`),
  KEY `Carrier` (`Carrier`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COMMENT='Contains information on flights, and what airline offered those flights and the flight number of the flight.  Some data hand updated.';

--
-- Table structure for table `ontime_fact`
--

CREATE TABLE `ontime_fact` (
  `date_id` int(11) NOT NULL DEFAULT '0',
  `origin_airport_id` int(11) NOT NULL DEFAULT '0',
  `dest_airport_id` int(11) NOT NULL DEFAULT '0',
  `flight_id` int(11) NOT NULL DEFAULT '0',
  `TailNum` varchar(50) DEFAULT NULL,
  `CRSDepTime` int(11) DEFAULT NULL,
  `DepTime` int(11) DEFAULT NULL,
  `DepDelay` int(11) DEFAULT NULL,
  `DepDelayMinutes` int(11) DEFAULT NULL,
  `DepDel15` int(11) DEFAULT NULL,
  `DepartureDelayGroups` int(11) DEFAULT NULL,
  `DepTimeBlk` varchar(20) DEFAULT NULL,
  `TaxiOut` int(11) DEFAULT NULL,
  `WheelsOff` int(11) DEFAULT NULL,
  `WheelsOn` int(11) DEFAULT NULL,
  `TaxiIn` int(11) DEFAULT NULL,
  `CRSArrTime` int(11) DEFAULT NULL,
  `ArrTime` int(11) DEFAULT NULL,
  `ArrDelay` int(11) DEFAULT NULL,
  `ArrDelayMinutes` int(11) DEFAULT NULL,
  `ArrDel15` int(11) DEFAULT NULL,
  `ArrivalDelayGroups` int(11) DEFAULT NULL,
  `ArrTimeBlk` varchar(20) DEFAULT NULL,
  `Cancelled` tinyint(4) DEFAULT NULL,
  `CancellationCode` char(1) DEFAULT NULL,
  `Diverted` tinyint(4) DEFAULT NULL,
  `CRSElapsedTime` int(11) DEFAULT NULL,
  `ActualElapsedTime` int(11) DEFAULT NULL,
  `AirTime` int(11) DEFAULT NULL,
  `Flights` int(11) DEFAULT NULL,
  `Distance` int(11) DEFAULT NULL,
  `DistanceGroup` tinyint(4) DEFAULT NULL,
  `CarrierDelay` int(11) DEFAULT NULL,
  `WeatherDelay` int(11) DEFAULT NULL,
  `NASDelay` int(11) DEFAULT NULL,
  `SecurityDelay` int(11) DEFAULT NULL,
  `LateAircraftDelay` int(11) DEFAULT NULL,
  `FirstDepTime` varchar(10) DEFAULT NULL,
  `TotalAddGTime` varchar(10) DEFAULT NULL,
  `LongestAddGTime` varchar(10) DEFAULT NULL,
  `DivAirportLandings` varchar(10) DEFAULT NULL,
  `DivReachedDest` varchar(10) DEFAULT NULL,
  `DivActualElapsedTime` varchar(10) DEFAULT NULL,
  `DivArrDelay` varchar(10) DEFAULT NULL,
  `DivDistance` varchar(10) DEFAULT NULL,
  `Div1Airport` varchar(10) DEFAULT NULL,
  `Div1WheelsOn` varchar(10) DEFAULT NULL,
  `Div1TotalGTime` varchar(10) DEFAULT NULL,
  `Div1LongestGTime` varchar(10) DEFAULT NULL,
  `Div1WheelsOff` varchar(10) DEFAULT NULL,
  `Div1TailNum` varchar(10) DEFAULT NULL,
  `Div2Airport` varchar(10) DEFAULT NULL,
  `Div2WheelsOn` varchar(10) DEFAULT NULL,
  `Div2TotalGTime` varchar(10) DEFAULT NULL,
  `Div2LongestGTime` varchar(10) DEFAULT NULL,
  `Div2WheelsOff` varchar(10) DEFAULT NULL,
  `Div2TailNum` varchar(10) DEFAULT NULL,
  `Div3Airport` varchar(10) DEFAULT NULL,
  `Div3WheelsOn` varchar(10) DEFAULT NULL,
  `Div3TotalGTime` varchar(10) DEFAULT NULL,
  `Div3LongestGTime` varchar(10) DEFAULT NULL,
  `Div3WheelsOff` varchar(10) DEFAULT NULL,
  `Div3TailNum` varchar(10) DEFAULT NULL,
  `Div4Airport` varchar(10) DEFAULT NULL,
  `Div4WheelsOn` varchar(10) DEFAULT NULL,
  `Div4TotalGTime` varchar(10) DEFAULT NULL,
  `Div4LongestGTime` varchar(10) DEFAULT NULL,
  `Div4WheelsOff` varchar(10) DEFAULT NULL,
  `Div4TailNum` varchar(10) DEFAULT NULL,
  `Div5Airport` varchar(10) DEFAULT NULL,
  `Div5WheelsOn` varchar(10) DEFAULT NULL,
  `Div5TotalGTime` varchar(10) DEFAULT NULL,
  `Div5LongestGTime` varchar(10) DEFAULT NULL,
  `Div5WheelsOff` varchar(10) DEFAULT NULL,
  `Div5TailNum` varchar(10) DEFAULT NULL,
  KEY `date_id` (`date_id`),
  KEY `flight_id` (`flight_id`),
  KEY `origin_airport_id` (`origin_airport_id`),
  KEY `dest_airport_id` (`dest_airport_id`),
  KEY `DepDelay` (`DepDelay`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COMMENT='Contains all avaialble data from 1988 to 2010';

mysql> use ontime1;
Database changed

mysql> show table status like 'ontime_fact'\G
*************************** 1. row ***************************
           Name: ontime_fact
         Engine: InnoDB
        Version: 10
     Row_format: Compact
           Rows: 6697533
 Avg_row_length: 241
    Data_length: 1616904192
Max_data_length: 0
   Index_length: 539279360
      Data_free: 4194304
 Auto_increment: NULL
    Create_time: 2011-05-10 04:26:14
    Update_time: NULL
     Check_time: NULL
      Collation: latin1_swedish_ci
       Checksum: NULL
 Create_options:
        Comment: Contains all avaialble data from 1988 to 2010
1 row in set (0.00 sec)

With ICE, after compression there is only 2.5GB of data, so ICE gets over 16:1 compression ratio(compared to Innodb, 8:1 compared to raw input data), which is quite nice. Each shard contains only 128MB of data!

Storage engine makes a big difference

In general, a column store performs about 8x-10x better than a row store for queries which access a significant amount of data. One big reason for this is the excellent compression that RLE techniques provide.
I have not loaded InnoDB compressed tables yet but since InnoDB compression is not RLE, I doubt it will have the same impact.

For large datasets effective compression results in the need for fewer nodes in order to keep data entirely in memory. This frees disk to use on-disk temporary storage for hash joins and other background operations. This will have a direct impact in our query response times and throughput.

Setting up a cluster using the AMI images

You can easily test Shard-Query for yourself. Spin up the desired number of EC2 instances using on of the the AMI images. You should spin a number of instances that evenly divides into 20 for best results. There is a helpful utility (included in the image) to help configure the cluster and it uses a copy of this text on this page. To use it, ensure:

  1. That only the instances that you want to use are shown in the EC2 console.
  2. That the "private ip" field is selected in the list of columns to show (click show/hide to change the columns)
  3. That the "public dns" field is selected

SSH to the public DNS entry of the node on the list of nodes. This node will become "shard1".

Now, in the EC2 console hit CTRL-A to select all text on the page and then CTRL-C to copy it. Paste this into a text file on shard1 called "/tmp/servers.txt" and run the following commands:

$ cat servers.txt | grep "10\."| grep -v internal |tee hosts.internal
[host list omitted]

Now you need to set up the hosts file:

sudo su -
# cat hosts.internal | ~ec2-user/tools/mkhosts >> /etc/hosts

# ping shard20
PING shard20 (10.126.15.34) 56(84) bytes of data.
64 bytes from shard20 (10.126.15.34): icmp_seq=1 ttl=61 time=0.637 ms
...

Note: There is no need to put that hosts file on your other nodes unless you want to run workers on them.

Generate a cluster configuration

There is a script provided to generate the shards.ini file for testing an cluster of 1 to 20 nodes.

cd shard-query

#generate a config for 20 shards (adjust to your number of nodes)
php genconfig 20 > shards.ini

Running the test

For best performance, you should run the workers on one or two nodes. You should start two workers per core in the cluster.

First start gearmand:

gearmand -p 7000 -d

Then start the workers on node 1 (assuming a 20 node cluster):

cd shard-query
./start_workers 80

I normally start (2 * TOTAL_CLUSTER_CORES) workers. That is, if you have 20 machines, each with 2 cores, run 80 workers.

Test the system. You should see the following row count (the first number is wall time, the second exec time, the third parse time).

$ echo "select count(*) from ontime_fact;" | ./run_query

Array
(
    [count(*)] => 135125787
)
1 rows returned (0.084244966506958s, 0.078309059143066s, 0.0059359073638916s)

Execute the test:

As seen above, the run_query script will run one more more semicolon terminated SQL statements. The queries for the benchmark are in ~ec2-user/shard-query/queries.sql.

I have also provided a convenient script which will summarize the output from the ./run_query command, called pivot_results

cd shard-query/
$ ./run_query < queries.sql | tee raw |./pivot_results &
[1] 12359
$ tail -f ./raw
-- Q1
...

At the end, you will get a result output that is easy to graph in a spreadsheet:

$ cat raw | ./pivot_results
Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8.0,Q8.1,Q8.2,Q8.3,Q8.4,Q9,Q10,Q11
34.354,60.978,114.175,27.138,45.751,14.905,14.732,34.946,126.599,250.222,529.287,581.295,11.042,63.366,14.573

InnoDB my.cnf

[client]
port=3306
socket=/tmp/mysql-inno.sock

[mysqld]
socket=/tmp/mysql-inno.sock
default-storage-engine=INNODB
innodb-buffer-pool-instances=2
innodb-buffer-pool-size=5600M
innodb-file-format=barracuda
innodb-file-per-table
innodb-flush-log-at-trx-commit=1
innodb-flush-method=O_DIRECT
innodb-ibuf-active-contract=1
innodb-import-table-from-xtrabackup=1
innodb-io-capacity=1000
innodb-log-buffer-size=32M
innodb-log-file-size=128M
innodb-open-files=1000
innodb_fast_checksum
innodb-purge-threads=1
innodb-read-ahead=linear
innodb-read-ahead-threshold=8
innodb-read-io-threads=16
innodb-recovery-stats
innodb-recovery-update-relay-log
innodb-replication-delay=#
innodb-rollback-on-timeout
innodb-rollback-segments=16
innodb-stats-auto-update=0
innodb-stats-on-metadata=0
innodb-stats-sample-pages=256
innodb-stats-update-need-lock=0
innodb-status-file
innodb-strict-mode
innodb-thread-concurrency=0
innodb-thread-concurrency-timer-based
innodb-thread-sleep-delay=0
innodb-use-sys-stats-table
innodb-write-io-threads=4
join-buffer-size=16M
key-buffer-size=64M
local-infile=on
lock-wait-timeout=300
log-error=/var/log/mysqld-innodb.log
max-allowed-packet=1M
net-buffer-length=16K
#we value throughput over response time, get a good plan
optimizer-prune-level=0
partition=ON
port=3306
read-buffer-size=512K
read-rnd-buffer-size=1M
skip-host-cache
skip-name-resolve
sort-buffer-size=512K
sql-mode=STRICT_TRANS_TABLES
symbolic-links
table-definition-cache=16384
table-open-cache=128
thread-cache-size=32
thread-stack=256K
tmp-table-size=64M
transaction-isolation=READ-COMMITTED
user=mysql
wait-timeout=86400

To be continued

You can now set up a cluster from 1 to 20 nodes for testing. This way you can verify the numbers in my next blog post. I will compare performance of various cluster sizes on both storage engines.

May
10
2011
--

Introducing our Percona Live speakers

We have mostly finalized the Percona Live schedule at this point, and I thought I’d take a few minutes to introduce who’s going to be speaking and what they’ll cover. A brief explanation first: we’ve personally recruited the speakers, which is why it has been a slow process to finalize and get abstracts on the web. Sometimes you know someone’s a dynamite speaker and you discuss over the phone, and then it takes a long time to get a title and abstract from them. In many cases the better they are the busier they are, so this is expected.

Let me introduce just a few of the great speakers we have lined up for this event: Brendan Gregg, Dr. John Busch, and Vladimir Fedorkov.

Brendan Gregg

Brendan Gregg is the crazy guy who likes to scream at a chassis full of disks and show graphs of the impact the vibration has on the disk latency. If you have not seen this, take a moment and watch it!

Brendan obviously knows his DTrace. He will be talking about how to really use DTrace with MySQL and InnoDB at a very deep level in the storage engine — not the high-level introductions you might have seen elsewhere. I do not think he will be yelling at any JBODs, but who knows. Either way, this will be a talk to attend.

Dr. John Busch

John is CTO and President at Schooner, who makes a high-performance MySQL software appliance. (It used to be hardware-only, if you haven’t been keeping up with them lately.) His session will be about benchmarks of replication alternatives for MySQL. Hasn’t this been done before, and is it really worth doing again? Actually, no, I haven’t seen it done in a scientific way, and yes it’s an important session for two reasons. One, Dr. Busch is a very distinguished engineer who’s been building complex systems since before I was born. Two, this is a session that will teach you how to think about replication, not what to think about replication. Data replication is one of the hardest and most interesting problems in computing, and I think you will come away from this having learned a lot.

I also want to mention that this is a sponsored session. Our sponsors have been asked to keep it 100% technical, with no marketing, and I believe that every one of them understands the value in that request. Take a look at the speakers, titles, and abstracts, and I think you’ll agree with me: our sponsors are among the most accomplished engineers out there, and they have a lot of wisdom and experience to share. This session is a great example.

Vladimir Fedorkov

Vlad actually used to work with Percona, as a consultant, system administrator, and team lead. Now he’s the director of Professional Services at Sphinx. If I can paraphrase his session, I’d say it is about the things you don’t know about Sphinx’s capabilities. Many people think of Sphinx as a great search engine, but it’s much more powerful than just that. It has all kinds of features that let it do things MySQL is very bad at. For example, think about using MySQL to find the top N most important rows per group of rows, distributed across a cluster of servers, sorted in date order (not importance order) and with an indication of how many rows are in the whole set. That’s a nightmare of temp tables and filesorts and multiple queries and all sorts of things in MySQL — and that’s just for performing this task on a single server, not to mention a cluster of servers. Well, Sphinx is very good at this type of task! There’s a lot more, too.

This might be a good time to mention that Sphinx’s author Andrew Aksyonoff just published a book on Sphinx with O’Reilly! Congratulations Andrew. It’s in my (long) list of reading.

Well, that’s all for this time. If you are not signed up for Percona Live yet, I encourage you to get your tickets now. I don’t know quite what happened today, but registrations really jumped, and there is a good chance that we’ll sell every ticket we have. It happened in San Francisco and it could happen again here, even with the additional rooms we’ve rented (to expand to four tracks).

By the way, we are also planning an evening event with an open bar and substantial food, so you can stay around and mingle, without worrying about paying for drinks or rushing off to find food before you faint of hunger. We don’t have a contract signed yet, but we’ll keep you posted. This will be free for everyone to attend, even if you can’t come to the daytime event.

May
10
2011
--

Innodb Caching (part 2)

Few weeks ago I wrote about Innodb Caching with main idea you might need more cache when you think you are because Innodb caches data in pages, not rows, and so the whole page needs to be in memory even if you need only one row from it. I have created the simple benchmark which shows a worse case scenario by picking the random set of primary key values from sysbench table and reading them over and over again.

This time I decided to “zoom in” on the time when result drop happens – 2x increase in number of rows per step hides a lot of details, so I’m starting with some number of rows when everything was still in cache for all runs and increasing number of rows being tested 20% per step. I’m trying standard Innodb page size, 4KB page size as 16K page size compressed to 4. The data in this case compresses perfectly (all pages could be compressed to desired 4K) value so it is testing a best case for compression. Here is the graph:

The results are quite interesting from a lot of angles. First we can see how quickly performance can drop in worse case scenario when data is not in cache any more. For 16K pages we see over 10x drop from 7630 qps to 580 qps as number of keys accessed increases only 20% from 9216 to 11059.

The results for 4K results start dropping earlier. This is because the database actually takes more space with 4K pages – 2.7GB instead of 2.2GB because of larger overhead. They also drop in a lot more gradual fashion compared to 16K results. They still retain almost half of performance at 27.5K even though drops starts at about 9.2K keys, which is less than 3x performance loss with 3 times increase in number of keys, which is way different to 10x performance drop for 16K pages. I do not have a very good explanation why this is happening.

Compression results come as a huge disappointment. The compressed .idb file was only 620MB so if Innodb would chose to keep only compressed pages in memory it should be able to keep about 1/5 of pages cached in 128M buffer pool. It could be compared to 4K pages case just with 1/4 of data (and the overhead needed for page decompression) if this is the policy innodb would have chosen. In reality however results are a lot more close to 16K page results with rather quick drop happening. About same number of qps (530) is reached at 15925 compared to 11049 which is about 40% more keys.

Where Compression results were good though is with very low number of keys, when everything could be kept uncompressed in buffer pool as well as in case of complere disk IO bound workload. It is hard to see from the graph but with large number of keys compressed results were best of all, probably because the data size was smallest.

Lets introduce one more metric. Lets say “acceptable” results for our application is 2000 qps, which is about 25% of the performance when data fits in memory. I just pick this number for simplicity, though I think it is quire reasonable as the target. With such definitions 16K provides acceptable performance with 9216 keys touched, 4K with 27518 and 16K compressed with 11516. This gives us 3x more keys we can fit in memory with 4K pages, which is pretty good even though not quite 4 times increase which we could hope for in theory. This is explained by the fact data size is almost 30% larger with 4K pages.

In this case Percona Server 5.5 was tested, which allows changing page size without recompiling the server, though I expect results for 16K and 16K Compressed to be very similar on MySQL 5.5

As a summary: If you have very random access pattern with small rows consider using 4K pages, compression may not give you much gains from cache fit efficiency.

Next I should look more about performance of different storage engines in this regard. I’m very curious how PBXT and TokuDB would perform in this kind of benchmark. I also should modify benchmark to be able to do writes the fixed number of keys to see if it makes any difference. Finally I should post my benchmark script so everyone interested can repeat my results.

May
09
2011
--

Upcoming webinar on Percona XtraBackup

On 10th May at 9 AM PST I will be giving a webinar about Percona XtraBackup. If you can not attend this time, recorded session will be available soon after webinar.

We always want to store our data in safe way, keep it consistent and be prepared for any kinds of disaster. Every DBA carries this recurring task. This webinar will discuss how to make the process of creating consistent physical backups much easier and faster with Percona XtraBackup utility.

We will discuss:
* XtraBackup operational principles
* Sample usage scenarios
* Full backups
* Incremental and partial backups
* Importing and exporting individual tables
* Streaming and parallel backups
* Limitations and common issues.
You may register  here.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com