Evaluation of PMP Profiling Tools

PMP Profiling Tools

In this blog post, we’ll look at some of the available PMP profiling tools.

While debugging or analyzing issues with Percona Server for MySQL, we often need a quick understanding of what’s happening on the server. Percona experts frequently use the pt-pmp tool from Percona Toolkit (inspired by http://poormansprofiler.org).



 tool collects application stack traces GDB and then post-processes them. From this you get a condensed, ordered list of the stack traces. The list helps you understand where the application spent most of the time: either running something or waiting for something.

Getting a profile with


 is handy, but it has a cost: it’s quite intrusive. In order to get stack traces, GDB has to attach to each thread of your application, which results in interruptions. Under high loads, these stops can be quite significant (up to 15-30-60 secs). This means that the


 approach is not really usable in production.

Below I’ll describe how to reduce GDB overhead, and also what other tools can be used instead of GDB to get stack traces.

  • GDB
    By default, the symbol resolution process in GDB is very slow. As a result, getting stack traces with GDB is quite intrusive (especially under high loads).There are two options available that can help notably reduce GDB tracing overhead:

      1. Use readnever patch. RHEL and other distros based on it include GDB with the readnever patch applied. This patch allows you to avoid unnecessary symbol resolving with the 

         option. As a result you get  up to 10 times better speed.

      2. Use gdb_index. This feature was added to address symbol resolving issue by creating and embedding a special index into the binaries. This index is quite compact: I’ve created and embedded gdb_index for Percona server binary (it increases the size around 7-8MB). The addition of the gdb_index speeds up obtaining stack traces/resolving symbols two to three times.
    # to check if index already exists:
      readelf -S  | grep gdb_index
    # to generate index:
      gdb -batch mysqld -ex "save gdb-index /tmp" -ex "quit"
    # to embed index:
      objcopy --add-section .gdb_index=tmp/mysqld.gdb-index --set-section-flags .gdb_index=readonly mysqld mysqld
  • eu-stack (elfutils)
    The eu-stack from the elfutils package prints the stack for each thread in a process or core file.Symbol resolving also is not very optimized in eu-stack. By default, if you run it under load it will take even more time than GDB. But eu-stack allows you to skip resolving completely, so it can get stack frames quickly and then resolve them without any impact on the workload later.
  • Quickstack
    Quickstack is a tool from Facebook that gets stack traces with minimal overheads.

Now let’s compare all the above profilers. We will measure the amount of time it needs to take all the stack traces from Percona Server for MySQL under a high load (sysbench OLTP_RW with 512 threads).

The results show that eu-stack (without resolving) got all stack traces in less than a second, and that Quickstack and GDB (with the readnever patch) got very close results. For other profilers, the time was around two to five times higher. This is quite unacceptable for profiling (especially in production).

There is one more note regarding the


 tool. The current version only supports GDB as the profiler. However, there is a development version of this tool that supports GDB, Quickstack, eu-stack and eu-stack with offline symbol resolving. It also allows you to look at stack traces for specific threads (tids). So for instance, in the case of Percona Server for MySQL, we can analyze just the purge, cleaner or IO threads.

Below are the command lines used in testing:

# gdb & gdb+gdb_index
  time gdb  -ex "set pagination 0" -ex "thread apply all bt" -batch -p `pidof mysqld` > /dev/null
# gdb+readnever
  time gdb --readnever -ex "set pagination 0" -ex "thread apply all bt" -batch -p `pidof mysqld` > /dev/null
# eu-stack
  time eu-stack -s -m -p `pidof mysqld` > /dev/null
# eu-stack without resolving
  time eu-stack -q -p `pidof mysqld` > /dev/null
# quickstack - 1 sample
  time quickstack  -c 1 -p `pidof mysqld` > /dev/null
# quickstack - 1000 samples
  time quickstack  -c 1000 -p `pidof mysqld` > /dev/null


Using dbsake to recover table structure from .frm files and process mysqldump output

dbsakeWe work on data recoveries quite often. In many cases, we recover table structures from the .frm files because there was no backup available. There is already a great blog post by my colleague Miguel Ángel Nieto about how we can recover structures from .frm files using MySQL utilities.

This works pretty well and we prefer to run mysqlfrm with the “–server” option to get all possible information from a .frm file. However, this option expects that MySQL is up and running so that mysqlfrm can spawn a new MySQL instance, and run the structure recovery there.

Recently I came across a tool that makes this job easier. The name of tool is dbsake, it’s a collection of command-line tools that perform various DBA related tasks for MySQL. In this blog, we will look at two very useful dbsake commands.

Installation is very easy and straightforward. It’s in an executable python zip archive with all dependencies included.

# curl -s http://get.dbsake.net > dbsake
# chmod u+x dbsake
# ./dbsake --version
dbsake, version 2.1.0 9525896

Recovering table structures from MySQL .frm files with dbsake

To recover table structures using dbsake, you need to use the “dbsake frmdump” command, followed by the .frm file path. The frmdump command decodes the MySQL .frm file and provides a “CREATE TABLE” or “CREATE VIEW” statement in the output. The good thing is that it doesn’t require a running a MySQL server instance, and interprets the .frm file according to rules similar to the MySQL server.

Let’s see an example:

# ./dbsake frmdump /var/lib/mysql/sakila/staff.frm
-- Table structure for table `staff`
-- Created with MySQL Version 5.6.27
CREATE TABLE `staff` (
`staff_id` tinyint(3) unsigned NOT NULL AUTO_INCREMENT,
`first_name` varchar(45) NOT NULL,
`last_name` varchar(45) NOT NULL,
`address_id` smallint(5) unsigned NOT NULL,
`picture` blob,
`email` varchar(50) DEFAULT NULL,
`store_id` tinyint(3) unsigned NOT NULL,
`active` tinyint(1) NOT NULL DEFAULT '1',
`username` varchar(16) NOT NULL,
`password` varchar(40) CHARACTER SET utf8 COLLATE utf8_bin DEFAULT NULL,
PRIMARY KEY (`staff_id`),
KEY `idx_fk_store_id` (`store_id`),
KEY `idx_fk_address_id` (`address_id`)

The result looks pretty good, and has recovered the character set and collation information as well. We can also see the MySQL version (5.6.27) retrieved from the .frm file. It is important to mention that the command only decodes the information available in .frm file, which means that it cannot recover InnoDB foreign-key references and AUTO_INCREMENT values. These items are stored outside of the .frm file.

The frmdump command makes the recovery process easy and faster. We can easily script this and recover the structure of a large number of tables. For example, if we need to recover the structure of all tables from a world database, we can do following:

mysql> create database world_recover;
# for tbl in `ls -1 /var/lib/mysql/world/*.frm`; do ./dbsake frmdump $tbl | mysql world_recover; done;
mysql> show tables from world_recover;
| Tables_in_world_recover |
| city                    |
| country                 |
| countrylanguage         |
3 rows in set (0.00 sec)

Filter and transform a mysqldump stream with dbsake

It’s a very common requirement to filter one or more tables from a mysqldump full database backup. The “dbsake sieve [options]” command helps us to filter or transform mysqldump output.

Let’s see how to extract a single table from a mysqldump file.

# mysqldump world > world.sql
# cat world.sql | ./dbsake sieve -t world.city > world.city.sql
Processed . Output: 1 database(s) 1 table(s) and 0 view(s)

The “-t” or “–table” option tells the command to only output the table matching the given pattern. It will also show the number of databases, tables or views processed in output.

To extract multiple tables, you can pass “-t db.tbl” multiple times.

# cat world.sql | ./dbsake sieve -t world.city -t world.country > world.city_country.sql
Processed . Output: 1 database(s) 2 table(s) and 0 view(s)
# cat world.city_country.sql | grep -i 'create table'
CREATE TABLE `country` (

The latest Percona server added the new option “–innodb-optimize-keys” in mysqldump. It changes the way InnoDB tables are dumped, so that secondary keys are created after loading the data, thus taking advantage of InnoDB fast index creation. This is a really great feature in that it helps us to restore data more efficiently than the default incremental rebuild that mysqldump performs.

Using the “dbsake sieve [options]” command, we can transform the regular mysqldump output to take advantage of fast index creation. The “–defer-indexes” option rewrites the output of CREATE TABLE statements, and arranges for secondary indexes to be created after the table data is loaded. Similarly the “–defer-foreign-keys” option can be added to add foreign key constraints after loading table data.

Let’s see an example:

# cat world.sql | ./dbsake sieve --defer-indexes --defer-foreign-keys -t world.city  > world.city.sql
Processed . Output: 1 database(s) 1 table(s) and 0 view(s)

This means that world.city.sql will have a table structure with the Primary Key first, then will insert statements to load data, and an additional ALTER TABLE statement to create secondary keys when there is at least one secondary index to be added. Foreign keys will also created with secondary indexes.

The original structure of table world.city:

  `Name` char(35) NOT NULL DEFAULT '',
  `CountryCode` char(3) NOT NULL DEFAULT '',
  `District` char(20) NOT NULL DEFAULT '',
  `Population` int(11) NOT NULL DEFAULT '0',
  KEY `CountryCode` (`CountryCode`),
  CONSTRAINT `city_ibfk_1` FOREIGN KEY (`CountryCode`) REFERENCES `country` (`Code`)

And the transformation done by dbsake:

  `Name` char(35) NOT NULL DEFAULT '',
  `CountryCode` char(3) NOT NULL DEFAULT '',
  `District` char(20) NOT NULL DEFAULT '',
  `Population` int(11) NOT NULL DEFAULT '0',
INSERT INTO `city` VALUES .....................
-- InnoDB Fast Index Creation (generated by dbsake)
  ADD KEY `CountryCode` (`CountryCode`),
  ADD CONSTRAINT `city_ibfk_1` FOREIGN KEY (`CountryCode`) REFERENCES `country` (`Code`);

For more sieve command options, please read the online manual.


The dbsake command makes it easier to recover table structures from .frm files. We can also filter and transform the mysqldump output easily without writing a complex awk or sed script. There are some more useful features of this tool that you can read about in the manual.

The post Using dbsake to recover table structure from .frm files and process mysqldump output appeared first on MySQL Performance Blog.


innotop 1.9.1 released

Screen Shot 2013-07-12 at 10.45.58Lefred and I spent a bit of time making innotop 1.9.1.
We’ve released a new version mainly to include MySQL 5.6 support as well as including some bugs fixed by Baron Schwartz and Frédéric Descamps.

You can download the .tar.gz and rpm’s (new!) at

Bugs fixed:

  • Issue 71: DBI prints to STDERR instead of just throwing an inactive statement error
  • Issue 73: innotop fails to start if perl is not installed under /usr/bin/
  • Issue 74: Use of uninitialized value $obj in regexp compilation at ./innotop-1.9.0/innotop line 6935.
  • Issue 82&83: Some pages broken with MySQL 5.6
  • Issue 70: .spec doesn’t build rpm if someperl modules are missing
  • Issue 84: feature request: Deadlock clearing transactions set sql_log_bin=0

The post innotop 1.9.1 released appeared first on MySQL Performance Blog.


Percona Toolkit by example – pt-stalk

pt-stalk recipes: Gather forensic data about MySQL when a server problem occurs

It happens to us all from time to time: a server issue arises that leaves you scratching your head. That’s when Percona Toolkit’s pt-stalk comes into play, helping you diagnose the problem by capturing diagnostic data that helps you pinpoint what’s causing the havoc hitting your database.

From the documentation (http://www.percona.com/doc/percona-toolkit/pt-stalk.html):

pt-stalk watches for a trigger condition to become true, and then collects data to help in diagnosing problems. It is designed to run as a daemon with root privileges, so that you can diagnose intermittent problems that you cannot observe directly. You can also use it to execute a custom command, or to gather the data on demand without waiting for the trigger to happen.

There are some common options that you can use in all examples, so I recommend you to read the documentation if you have any specific questions.

Be prepared! It’s wise to have pt-stalk running 24/7, because problems such as MySQL lockups or spikes of activity typically leave no evidence to use in root cause analysis. By continuously running pt-stalk, you’ll have the data it gathers when the trouble occurs.

Let’s look at some specific “pt-stalk recipes.”

Just collect the information:

pt-stalk will collect the information and will exit after that.

$ pt-stalk --no-stalk  -- --defaults-file=./my.default.cnf

Every hour for one day:

Collect the information every one hour (–sleep=3600) 24 times (–iterations=24) without wait for any condition (–threshold=0) and run in background (–daemonize).

$ pt-stalk --sleep=3600 --threshold=0 --iterations=24 \
  --daemonize -- --defaults-file=./my.default.cnf

A host has more than 10 connections:

Collect the all information when the server (–match have more than 10 (–threshold 10) connections opened. You can use any variable from the “show processlist” command, in this case, I’m using the “Host” variable.

$ pt-stalk --function processlist --variable Host\
  --match --threshold 10 -- --defaults-file=./my.default.cnf

More than one variable:

In some cases, you want to check more than one variable, in those cases, you will have to write a small scrip to do this.

The script:
The script must contain a shell function called “trg_plugin” and that function must return a number, this number will be the one that pt-stalk will use to match against the –threshold option.

$ cat  pt-stalk-function 
# This function will count the commands that come form localhost ($4 in this case) 
# and with execution time ($7) greater than 5000.
# The  $EXT_ARGV is the connection options that you will send it to the pt-stalk.
#  $2    $3         $4          $5   $6        $7     $8      $9
#| Id  | User     | Host      | db | Command | Time | State | Info             |

trg_plugin() {
	mysqladmin $EXT_ARGV processlist |  awk -F\| 'BEGIN{counter=0} 
                     { if ($4 ~ /localhost/ && $7 > 5000 ) counter++ ; } END {print counter}'

The pt-stalk command:
Collect all information when the function called trg_plugin inside the script ./pt-stalk-function (–function ../pt-stalk-function) return more than 100 (–threshold 100)

$ pt-stalk --function ./pt-stalk-function \
  --threshold 100  --  --defaults-file=./my.default.cnf

Custom collector with plugins:

Plugins are useful to collect information that it is not included in the pt-stalk by default. For example, if you want to collect pid status information from /proc/[mysqlpid]/status you can use plugins for this.

The plugin:
The script in this case contain a shell function called “before_collect” and pt-stalk will run this function before collect the information (you can collect after, before and after, etc, please check the documentation for more information)

$ cat  pt-stalk-pidplugin
# This plugin will collect the /proc/[mysqlpid]/status from all mysqld pids running in 
# this box and it will store the information in a file in the default directory and 
# with the default prefix and adding the pid number and -status at the end. 
# For example: 2013_01_02_22_58_55-2082-status

        for __p in $(pidof mysqld); do 
            cat /proc/${__p}/status > ${OPT_DEST}/${prefix}-${__p}-status ; 

The pt-stalk command:
Before collect the information, it will run the plugin ./pt-stalk-pidplugin (–plugin ./pt-stalk-pidplugin)

$ pt-stalk --plugin ./pt-stalk-pidplugin \
  --threshold 10  -- --defaults-file=./my.default.cnf

Have any comments or questions? Just let me know below!

The post Percona Toolkit by example – pt-stalk appeared first on MySQL Performance Blog.


Get Me Some Query Logs!

One of my favorite tools in the Percona Toolkit is pt-query-digest.  This tool is indispensable for identifying your top SQL queries, and analyzing which queries are accounting for your database load.

But the report you get from pt-query-digest is only as good as the log of queries you give it as input.  You need a large enough sample of query logs, collected over a period of time when you have representative traffic on your database.

You also need the log to include all the queries, not just those that take more than N seconds.  The reason is that some queries are individually quick, and would not be logged if you set the long_query_time configuration variable to 1 or more seconds.  You want that threshold to be 0 seconds while you’re collecting logs.

However, activating such high-volume query log collection can be costly.  Every statement executed on your  database will cause file I/O, even when the query itself was served out of your buffer pool memory.  That’s a lot of overhead, so we need to be careful about how and when we collect logs, and for how long we leave that running.

I’ve put together a simple shell script to help automate this.  I have given it the functional but unimaginative name full-slow-log.

The script configures full logging, then sleeps for a number of seconds to allow queries to be collected in the logs.  After it finishes sleeping, or if you interrupt the script, the script restores log configuration back to the values they started with.

$ full-slow-log [ -v ] [ -s seconds ] [ -c config ]
  • -v is for verbose output.
  • -s seconds allows you to specify the number of seconds to sleep.  The default is 5 seconds, which is probably too short for most sites, but the value is chosen to be as low impact as possible if you forget to give another value.
  • -c config allows you to specify a MySQL config file other than $HOME/.my.cnf, so you can store host, user, and password.

Here’s an example of running it with verbose output:

$ full-slow-log -v
Discovering slow_query_log=1
Discovering slow_query_log_file=mysql-slow.log
Discovering long_query_time=60.000000
Setting long_query_time=0
Setting slow_query_log_file=mysql-slow.log-full-20121122112413
Setting slow_query_log=1
Flushing slow query log
Sleeping 5 seconds... done.
Restoring slow_query_log_file=mysql-slow.log
Restoring long_query_time=60.000000
Restoring slow_query_log=1
Flushing logs during restore

Notice that the script also redirects the slow query log to a new file, with a filename based on the timestamp.  This is so you have a distinct file that contains only the specific time range of logs you collected.

The restoration of settings is in a “trap” which is a shell scripting feature that serves as both an exit handler and signal handler.  So if you interrupt the script before it’s done, you have some assurance that it will do the right thing to restore settings anyway.

My full-slow-log script is now available on a Github project (along with a few other experimental scripts I have written).  See https://github.com/billkarwin/bk-tools

I hope you find this script a useful complement to my upcoming talks at the Percona Live  MySQL Conference in London, UK on December 3-4 2012:

If you can make it to London in December, we’d love to see you there!  If not, look for future Percona Live conferences.

Tweet the link to this blog for a chance to win a free full conference pass. Make sure to use hashtag #perconalive! Winner will be chosen at the end of the day. 


Here’s a quick way to Foresee if Replication Slave is ever going to catch up and When!

If you ever had a replication slave that is severely behind, you probably noticed that it’s not catching up with a busy master at a steady pace. Instead, the “Seconds behind master” is going up and down so you can’t really tell whether the replica is catching up or not by looking at just few samples, unless these are spread apart. And even then you can’t tell at a glance when it is going to catch up.

Normally, the “severely behind” thing should not happen, but it does often happen in our consulting practice:

  • sometimes replication would break and then it needs to catch up after it is fixed,
  • other times new replication slave is built from a backup which is normally hours behind,
  • or, it could be that replication slave became too slow to catch up due to missing index

Whatever the case is, single question I am being asked by the customer every time this happens is this: When is the replica going to catch up?”

I used to tell them “I don’t know, it depends..” and indeed it is not an easy question to answer. There are few reasons catching up is so unstable:

  1. If you have restarted the server, or started a new one, caches are cold and there’s a lot of IO happening,
  2. Not all queries are created equal – some would run for seconds, while others can be instant,
  3. Batch jobs: some sites would run nightly tasks like building statistics tables or table checksum – these are usually very intense and cause slave to backup slightly.

I didn’t like my own answer to The question, so I decided to do something about it. And because I love awk, I did that something in awk:

cmd="mysql -e 'show slave status\G' | grep Seconds_Behind_Master | awk '{print \$2}'"
while sleep $delay; do
  eval $cmd
done | awk -v delay=$delay '
   passed += delay;
   if (count%10==0)   
      printf("s_behind  d_behind   c_sec_s   eta_d | O_c_sec_s O_eta_d O_eta_h\n");
   if (prev==NULL){
      prev = $1;
      start = $1;
   speed = (prev-$1)/delay;
   o_speed = (start-($1-passed))/passed
   if (speed == 0)    speed_d = 1;
     else             speed_d = speed;
   eta = $1/speed_d;
   if (eta<0)         eta = -86400;
   o_eta = $1/o_speed;
   printf("%8d %8.6f %9.3f %7.3f | %9.3f %7.3f %7.2f\n",
      $1, $1/86400, speed, eta/86400, o_speed, o_eta/86400, o_eta/3600);

I don't know if this is ever going to become a part of a Percona Toolkit, however since it's pretty much a one-liner, I just keep it in my snippets pool for easy copy'n'paste.

Here's a piece of an output from a server that was almost 27 days behind just yesterday:

// at the beginning:
s_behind  d_behind   c_sec_s   eta_d | O_c_sec_s O_eta_d O_eta_h
 2309941 26.735428     0.000  26.735 |     1.000  26.735  641.65
 2309764 26.733380     2.950   9.062 |     2.475  10.801  259.23
 2308946 26.723912    13.633   1.960 |     6.528   4.094   98.25
 2308962 26.724097    -0.267  -1.000 |     5.079   5.262  126.28
 2309022 26.724792    -1.000  -1.000 |     4.063   6.577  157.85
// after one hour:
s_behind  d_behind   c_sec_s   eta_d | O_c_sec_s O_eta_d O_eta_h
 2264490 26.209375    39.033   0.671 |    13.418   1.953   46.88
 2262422 26.185440    34.467   0.760 |    13.774   1.901   45.63
 2261702 26.177106    12.000   2.181 |    13.762   1.902   45.65
// after three hours:
s_behind  d_behind   c_sec_s   eta_d | O_c_sec_s O_eta_d O_eta_h
 2179124 25.221343    13.383   1.885 |    13.046   1.933   46.40
 2178937 25.219178     3.117   8.092 |    12.997   1.940   46.57
 2178472 25.213796     7.750   3.253 |    12.973   1.943   46.64
// after 12 hours:
s_behind  d_behind   c_sec_s   eta_d | O_c_sec_s O_eta_d O_eta_h
 1824590 21.117940    20.233   1.044 |    12.219   1.728   41.48
 1823867 21.109572    12.050   1.752 |    12.221   1.727   41.46
 1823089 21.100567    12.967   1.627 |    12.223   1.726   41.43
// after 21 hours:
s_behind  d_behind   c_sec_s   eta_d | O_c_sec_s O_eta_d O_eta_h
 1501659 17.380312    -0.533  -1.000 |    11.768   1.477   35.44
 1501664 17.380370    -0.083  -1.000 |    11.760   1.478   35.47
 1501689 17.380660    -0.417  -1.000 |    11.751   1.479   35.50

Of course, it is still not perfectly accurate and it does not account for any potential changes in queries, workload, warm-up, nor the time it takes to run the mysql cli, but it does give you an idea and direction that replication slave is going. Note, negative values mean replication isn't catching up, but values themselves are mostly meaningless.

Here's what the weird acronyms stand for:

  • s_behind - current Seconds_Behind_Master value
  • d_behind - number of days behind based on current s_behind
  • c_sec_s - how many seconds per second were caught up during last interval
  • eta_d - this is ETA based on last interval
  • O_c_sec_s - overall catch-up speed in seconds per second
  • O_eta_d - ETA based on overall catch-up speed (in days)
  • O_eta_h - same like previous but in hours

Let me know if you ever find this useful.


Shard-Query turbo charges Infobright community edition (ICE)

Shard-Query is an open source tool kit which helps improve the performance of queries against a MySQL database by distributing the work over multiple machines and/or multiple cores. This is similar to the divide and conquer approach that Hive takes in combination with Hadoop. Shard-Query applies a clever approach to parallelism which allows it to significantly improve the performance of queries by spreading the work over all available compute resources. In this test, Shard-Query averages a nearly 6x (max over 10x) improvement over the baseline, as shown in the following graph:

One significant advantage of Shard-Query over Hive is that it works with existing MySQL data sets and queries. Another advantage is that it works with all MySQL storage engines.

This set of benchmarks evaluates how well Infobright community edition (ICE) performs in combination with Shard-Query.

Data set

It was important to choose a data set that was large enough to create queries that would run for a decent amount of time, but not so large that it was difficult to work with. The ontime flight performance statistics data, available online from the United States Bureau of Transportation Statistics (BTS) made a good candidate for testing, as it had been tested before:
Another MPB post
Lucid DB testing

The raw data is a completely denormalized schema (single table). In order to demonstrate the power of Shard-Query it is important to test complex queries involving joins and aggregation. A star schema is the most common OLAP/DW data model, since it typically represents a data mart. See also: “Data mart or data warehouse?”. As it is the most common data model, it is desirable to benchmark using a star schema, even though it involves work to transform the data.

Star schema

Transforming the data was straightforward. I should note that I did this preprocessing with the MyISAM storage engine, then I dumped the data to tab delimited flat files using mysqldump. I started by loading the raw data from the BTS into a single database table called ontime_stage.

Then, the airport information was extracted:

create table dim_airport(
airport_id int auto_increment primary key,
unique key(airport_code)
  Origin as `airport_code`,
  OriginCityName as `CityName`,
  OriginState as `State`,
  OriginStateFips as `StateFips`,
  OriginStateName as `StateName` ,
  OriginWac as `Wac`
FROM ontime_stage
  Dest as `airport_code`,
  DestCityName as `CityName`,
  DestState as `State`,
  DestStateFips as `StateFips`,
  DestStateName as `StateName` ,
  DestWac as `Wac`
FROM ontime_stage;

After extracting flight/airline and date information in a similar fashion, a final table `ontime_fact` is created by joining the newly constructed dimension table tables to the staging tables, omitting the dimension columns from the projection, instead replacing them with the dimension keys:

select dim_date.date_id,
       origin.airport_id as origin_airport_id,
       dest.airport_id as dest_airport_id,
       ontime_stage.TailNum, ...
from ontime_stage
join dim_date using(FlightDate)
join dim_airport origin on ontime_stage.Origin = origin.airport_code
join dim_airport dest on ontime_stage.Dest = dest.airport_code
join dim_flight using (UniqueCarrier,AirlineID,Carrier,FlightNum);

The data set contains ontime flight information for 22 years, which can be confirmed by examining the contents of the date dimension:

mysql> select count(*),
from dim_date\G
*************************** 1. row ***************************
       count(*): 8401
min(FlightDate): 1988-01-01
max(FlightDate): 2010-12-31
1 row in set (0.00 sec)

The airport dimension is a puppet dimension. It is called a puppet because it serves as both origin and destination dimensions, being referenced by origin_airport_id and destination_airport_id in the fact table, respectively. There are nearly 400 major airports included in the data set.

mysql> select count(*) from dim_airport;
| count(*) |
|      396 |
1 row in set (0.00 sec)

The final dimension is the flight dimension, which contains the flight numbers and air carrier hierarchies. Only the largest air carriers must register and report ontime information with the FAA, so there are only 29 air carriers in the table:

mysql> select count(*),
count(distinct UniqueCarrier)
from dim_flight\G
*************************** 1. row ***************************
                     count(*): 58625
count(distinct UniqueCarrier): 29
1 row in set (0.02 sec)

Each year has tens of millions of flights:

mysql> select count(*) from ontime_one.ontime_fact;
| count(*)  |
| 135125787 |
1 row in set (0.00 sec)

This should be made fully clear by the following schema diagram:

Star schema (ontime_fact, dim_date, dim_flight, dim_airport)

Diagram of the ontime dimensional schema

Test environment

For this benchmark, a test environment consisting of a single commodity database server with 6 cores (+6ht) and 24GB of memory was selected. The selected operating system was Fedora 14. Oracle VirtualBox OSE was used to create six virtual machines, each running Fedora 14. Each of the virtual machines was granted 4GB of memory. A SATA 7200rpm RAID10 battery backed RAID array was used as the underlying storage for the virtual machines.

The MySQL command line client was used to execute the a script file containing the 11 queries. This same SQL script was used in the Shard-Query tests. For the baseline, the results and response times were captured with the \T command. The queries were executed on a single database schema containing all of the data. Before loading, there was approximately 23GB of data. After loading, ICE compressed this data to about 2GB. The test virtual machine was assigned 12 cores in this test.

Shard-Query was given the following configuration file, which lists only one server. A single schema (ontime_one) contained all of the data. The test virtual machine was assigned 12 cores for this test. The same VM was used as the previous baseline test. This VM was rebooted between tests. A SQL script was fed to the run_query.php script and the output was captured with the ‘tee’ command.

$ cat one.ini


In addition to adding parallelism via scale-up, Shard-Query can improve performance of almost all queries by spreading them over more than one physical server. This is called “scaling out” and it allows Shard-Query to vastly improve the performance of queries which have to examine a large amount of data. Shard-Query includes a loader (loader.php) which can be used to either split a data into multiple files (for each shard, for later loading) or it can load files directly, in parallel, to multiple hosts.

Shard-Query will execute queries in parallel over all of these machines. With enough machines, massive parallelism is possible and very large data sets may be processed. As each machine examines only a small subset of the data, performance can be improved significantly:

$ cat shards.ini







In this configuration, each shard has about 335MB-350MB of data (23GB raw data, compressed to about 2GB total data. then spread over six servers). Due to ICE limitations, the data was split before loading. The splitting/loading process will be described in another post.

Complex queries

Shard-Query was tested with the simple single table version of this data set in a previous blog post. Previous testing was limited to a subset of Vadim’s test queries (see that post). As this new test schema is normalized, Vadim’s test queries must be modified to reflect the more complex schema structure. For this benchmark each of the original queries has been rewritten to conform to the new schema, and additionally two new test queries have been added. To model real world complexity, each of the test queries feature at least one join, and many of the filter conditions are placed on attributes in the dimension tables. It will be very interesting to test these queries on various engines and databases.

Following is a list of the queries, followed by a response time table recording the actual response times for each query. The queries should be self-explanatory.

Performance, at a glance

You will notice that Shard-Query is faster in nearly every case. The performance of the queries is improved significantly by scaling out, even more so than scaling up, because additional parallelism is added beyond what can be exploited by scale up. Scale up can improve query performance when there is enough resources to support the added parallelism, and when one of the the following are in used in the query: BETWEEN or IN clauses, subqueries in the FROM clause, UNION or UNION ALL clauses. If none of those features are used, then parallelism can’t be added. Q9 is an example of such a query.

Query details:

-- Q1
 SELECT DayOfWeek, count(*) AS c
   from ontime_fact
   JOIN dim_date using (date_id)
  WHERE Year
BETWEEN 2000 AND 2008
  GROUP BY DayOfWeek

-- Q2
SELECT DayOfWeek, count(*) AS c
  from ontime_fact
  JOIN dim_date using (date_id)
 WHERE DepDelay>10 AND Year BETWEEN 2000 AND 2008

-- Q3
SELECT CityName as Origin, count(*) AS c
  from ontime_fact
  JOIN dim_date using (date_id)
  JOIN dim_airport origin
    ON origin_airport_id = origin.airport_id
 WHERE DepDelay>10
   AND Year BETWEEN 2000 AND 2008
 LIMIT 10;

The next queries show how performance is improved when Shard-Query adds parallelism when “subqueries in the from clause” are used. There are benefits with both “scale-up” and “scale-out”, but once again, the “scale-out” results are the most striking.

-- Q4
SELECT Carrier, count(*) as c
  from ontime_fact
  JOIN dim_date using (date_id)
  join dim_flight using(flight_id)
 WHERE DepDelay>10
   AND Year=2007
 GROUP BY Carrier

-- Q5
SELECT t.Carrier, c, c2, c*1000/c2 as c3
     (SELECT Carrier, count(*) AS c
        from ontime_fact
        join dim_date using(date_id)
        join dim_flight using(flight_id)
       WHERE DepDelay>10
         AND Year=2007
       GROUP BY Carrier) t
JOIN (SELECT Carrier, count(*) AS c2
        from ontime_fact
        join dim_date using(date_id)
        join dim_flight using(flight_id)
       WHERE Year=2007
       GROUP BY Carrier) t2
  ON (t.Carrier=t2.Carrier)

-- Q6
SELECT t.Year, c1 / c2 as ratio
     (select Year, count(*)*1000 as c1
        from ontime_fact
        join dim_date using (date_id)
       WHERE DepDelay>10
       GROUP BY Year) t
JOIN (select Year, count(*) as c2
        from ontime_fact
        join dim_date using (date_id)
       WHERE DepDelay>10
       GROUP BY Year) t2
  ON (t.Year=t2.Year);

-- Q7
SELECT t.Year, c1 / c2 as ratio
  FROM (select Year, count(Year)*1000 as c1
          from ontime_fact
          join dim_date using (date_id)
         WHERE DepDelay>10
         GROUP BY Year) t
  JOIN (select Year, count(*) as c2
          from ontime_fact
          join dim_date using (date_id)
         GROUP BY Year) t2
    ON (t.Year=t2.Year);

The performance of the following queries depends on the size of the date range:

-- Q8.0
SELECT dest.CityName, COUNT( DISTINCT origin.CityName)
  from ontime_fact
  JOIN dim_airport dest on ( dest_airport_id = dest.airport_id)
  JOIN dim_airport origin on ( origin_airport_id = origin.airport_id)
  JOIN dim_date using (date_id)
 WHERE Year BETWEEN 2001 and 2001
 GROUP BY dest.CityName

-- Q8.1
SELECT dest.CityName, COUNT( DISTINCT origin.CityName)
  from ontime_fact
  JOIN dim_airport dest on ( dest_airport_id = dest.airport_id)
  JOIN dim_airport origin on ( origin_airport_id = origin.airport_id)
  JOIN dim_date using (date_id)
 WHERE Year BETWEEN 2001 and 2005
 GROUP BY dest.CityName

-- Q8.2
SELECT dest.CityName, COUNT( DISTINCT origin.CityName)
  from ontime_fact
  JOIN dim_airport dest on ( dest_airport_id = dest.airport_id)
  JOIN dim_airport origin on ( origin_airport_id = origin.airport_id)
  JOIN dim_date using (date_id)
 WHERE Year BETWEEN 2001 and 2011
 GROUP BY dest.CityName

-- Q8.3
SELECT dest.CityName, COUNT( DISTINCT origin.CityName)
  from ontime_fact
  JOIN dim_airport dest on ( dest_airport_id = dest.airport_id)
  JOIN dim_airport origin on ( origin_airport_id = origin.airport_id)
  JOIN dim_date using (date_id)
 WHERE Year BETWEEN 1990 and 2011
 GROUP BY dest.CityName

-- Q8.4
SELECT dest.CityName, COUNT( DISTINCT origin.CityName)
  from ontime_fact
  JOIN dim_airport dest on ( dest_airport_id = dest.airport_id)
  JOIN dim_airport origin on ( origin_airport_id = origin.airport_id)
  JOIN dim_date using (date_id)
 WHERE Year BETWEEN 1980 and 2011
 GROUP BY dest.CityName

Finally, Shard-Query performance continues to improve when grouping and filtering is used. Again, notice Q9. It doesn’t use any features which Shard-Query can use to add parallelism. Thus, in the scale up configuration it does not perform any better than the baseline, and actually performed just a little worse. Since scale out splits the data between servers, it performs about 6x better as the degree of parallelism is controlled by the number of shards.

-- Q9
select Year ,count(Year) as c1
  from ontime_fact
  JOIN dim_date using (date_id)
 group by Year;

-- Q10
SELECT Carrier, dest.CityName, COUNT( DISTINCT origin.CityName)
  from ontime_fact
  JOIN dim_airport dest on ( dest_airport_id = dest.airport_id)
  JOIN dim_airport origin on ( origin_airport_id = origin.airport_id)
  JOIN dim_date using (date_id)
  JOIN dim_flight using (flight_id)
 WHERE Year BETWEEN 2009 and 2011
 GROUP BY Carrier,dest.CityName

-- Q11
SELECT Year, Carrier, dest.CityName, COUNT( DISTINCT origin.CityName)
  from ontime_fact
  JOIN dim_airport dest on ( dest_airport_id = dest.airport_id)
  JOIN dim_airport origin on ( origin_airport_id = origin.airport_id)
  JOIN dim_date using (date_id)
  JOIN dim_flight using (flight_id)
 WHERE Year BETWEEN 2000 and 2003
   AND Carrier = 'AA'
 GROUP BY Year, Carrier,dest.CityName


The divide and conquer approach is very useful when working with large quantities of data. Shard-Query can be used with existing data sets easily, improving the performance of queries significantly if they use common query features like BETWEEN or IN. It is also possible to spread your data over multiple machines, scaling out to improve query response times significantly.

These queries are a great test of Shard-Query features. It is currently approaching RC status. If you decide to test it and encounter issues, please file a bug on the bug tracker. You can get Shard-Query (currently in development release form as a checkout from SVN) here: Shard-Query Google code project

Full disclosure

Justin Swanhart, the author of this article is also the creator and maintainer of Shard-Query. The author has previously worked in cooperation with Infobright, including on benchmarking. These particular tests were performed independently of Infobright, without their knowledge or approval. Infobright was, however, given the chance to review this document before publication, as a courtesy. All findings are represented truthfully, transparently, and without any intended bias.


Aspersa tools bit.ly download shortcuts

I use Aspersa tools a lot and I find myself going to the website just to download one of the tools all the time. I love I can download maatkit with a simple wget maatkit.org/get/tool command so I made bit.ly shortcuts for all of the current aspersa tools. Here’s the full list with my favorite on the top and least favorite (but none the less very useful) on the bottom:

So now it’s as simple as wget bit.ly/aspersa-tool. I hope you will find these useful too.


Update: I made an error with align tool which now links to the root directory of a trunk. I don’t think this is a big deal as I use align least often, but what I did to fix that is created even shorter shortcuts for all the tools, so you can also use wget bit.ly/a-tool


Optimizing slow web pages with mk-query-digest

I don’t use many tools in my consulting practice but for the ones I do, I try to know them as best as I can. I’ve been using mk-query-digest for almost as long as it exists but it continues to surprise me in ways I couldn’t imagine it would. This time I’d like to share a quick tip on how mk-query-digest allows you to slice your data in a completely different way than it otherwise would by default.

Disclaimer: this only works when persistent connections or connection pools aren’t used and is only accurate when single mysql connection is used during execution of a request.

If you are seeking to reduce the load on the database server and [as a result] increase response time for some random user requests, you are usually interested in queries that are consuming most MySQL time and that’s how mk-query-digest groups and orders data by default. Fixing top 10 queries on the list indeed will most likely reduce the load and improve response time for some requests. What if some pages are still slow to load because of the time spent in database and you either can’t or don’t want to profile or debug the application to figure out what’s happening under the hood?

That sounds like something I was working on today – I had a slow query log (captured with long_query_time=0 and all the eXtra benefits from Percona slow query log patch), I knew some particular pages were taking minutes to load and that’s exactly what the customer asked me to focus on. So instead of using mk-query-digest to list me top slowest queries, I asked it to list me top slowest sessions:

mk-query-digest --group-by=Thread_id --order-by=Query_time:sum in > out

Spot on, the session I needed to focus on was right at the top. And what do you know, 519 queries were run during that session which took 148s seconds overall:

# ########################################################################
# Report grouped by Thread_id
# ########################################################################

# Item 1: 3.41 QPS, 0.97x concurrency, ID 0xABCE5AD2A2DD1BA1 at byte 288124661
# This item is included in the report because it matches --limit.
# Scores: Apdex = 0.97 [1.0], V/M = 19.02
# Query_time sparkline: | ^______|
# Time range: 2011-04-05 16:12:13 to 16:14:45
# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count          0     519
# Exec time      2    148s    11us     33s   285ms    53ms      2s    26us
# Lock time      0     5ms       0   334us     9us    66us    32us       0
# Rows sent      0      41       0       1    0.08    0.99    0.27       0
# Rows examine   1   4.97M       0 445.49k   9.80k   5.73k  49.33k       0
# Rows affecte   0       2       0       1    0.00       0    0.06       0
# Rows read      1   2.01M       0 250.47k   3.96k    1.96  27.94k    0.99
# Bytes sent     0 241.20k      11   8.01k  475.89  918.49  689.98  258.32
# Merge passes   0       0       0       0       0       0       0       0
# Tmp tables     0      15       0       1    0.03       0    0.17       0
# Tmp disk tbl   0       3       0       1    0.01       0    0.08       0
# Tmp tbl size   0   4.78k       0   4.78k    9.43       0  211.60       0
# Query size     0 100.95k      19   2.71k  199.17  363.48  206.60  151.03
# InnoDB:
# IO r bytes     0       0       0       0       0       0       0       0
# IO r ops       0       0       0       0       0       0       0       0
# IO r wait      0       0       0       0       0       0       0       0
# pages distin   1  67.99k       0  10.64k   1.26k   3.88k   2.47k   31.70
# queue wait     0       0       0       0       0       0       0       0
# rec lock wai   0       0       0       0       0       0       0       0
# Boolean:
# Filesort       0% yes,  99% no
# Full scan      7% yes,  92% no
# QC Hit        78% yes,  21% no
# Tmp table      2% yes,  97% no
# Tmp table on   0% yes,  99% no
# String:
# Databases    prod_db
# Hosts        localhost
# InnoDB trxID 1153145C (2/0%), 11531626 (2/0%)... 43 more
# Last errno   0
# Users        prod
# Query_time distribution
#   1us
#  10us  ################################################################
# 100us  #########
#   1ms  #
#  10ms  #
# 100ms  #
#    1s  ###
#  10s+  #

The stats here are aggregated per all queries which is great, but I still need to figure out what queries were run. I could use mk-log-player and split all sessions that way, unfortunately mk-log-player will not have all the other useful information, not even query timing. Instead, I’ve used mk-query-digest:

mk-query-digest --filter='$event->{Thread_id} == 160847' in > out

Now I know exactly what needs to be fixed first to make the greatest impact to this page response time. I can also convert that into a slow query log that lists all the queries that were executed during this session in the order they were executed:

mk-query-digest --filter='$event->{Thread_id} == 160847' --no-report --print in > out

Pretty cool, isn’t it? Sure, it would be even better if mk-query-digest would do a nested group-by and order-by within a group so I would avoid the extra step, but then even better than that would be if it would optimize the queries all together! Unfortunately mk-query-digest won’t do that for you, but then there’s mk-query-advisor ;)


Maatkit’s mk-query-digest filters

Have you ever seen BIG weird numbers in mk-query-digest report that just seem wrong? I have! Here’s one report I got today:

# Attribute          total     min     max     avg     95%  stddev  median
# ============     ======= ======= ======= ======= ======= ======= =======
# Exec time          5088s     1us    171s     2ms   467us   104ms    28us
# Lock time            76s       0      3s    26us    69us     3ms       0
# Rows sent          9.80M       0   1.05M    3.50    0.99  642.32       0
# Rows examine       5.59G       0  82.56M   2.00k    0.99  97.41k       0
# Rows affecte     457.30k       0   2.62k    0.16    0.99    1.68       0
# Rows read          2.16G       0  82.56M  788.53   21.45  82.91k    0.99
# Bytes sent         2.14T       0   4.00G 781.27k   3.52k  47.84M   84.10
# Merge passes     273.47G       0   4.00G  97.69k       0  10.35M       0
# Tmp tables       225.85G       0   4.00G  80.67k       0   7.89M       0
# Tmp disk tbl     381.88G       0   4.00G 136.41k       0  14.08M       0
# Tmp tbl size     255.54G       0   2.39G  91.28k       0   8.59M       0
# Query size       418.38M       6 257.39k  149.45  563.87   1.49k   42.48
# InnoDB:
# IO r bytes       272.92G       0   1.58G 479.74k       0  18.19M       0
# IO r ops         328.05G       0   2.00G 576.65k       0  23.08M       0
# IO r wait        237964272912s       0 3377771733s 398921s       0 22888946s       0
# pages distin     156.50G       0   2.39G 275.11k   27.38  14.48M    4.96
# queue wait       143150489533s       0 3377769328s 239976s       0 16014027s       0
# rec lock wai     216352062699s       0 4085510331s 362690s       0 27625029s       0

That can’t be right! Apparently there are couple bugs in the slowlog patch that aren’t really critical as the numbers are only wrong for the administrator commands. Yet when aggregated it kind of messes up the whole mk-query-digest output.

There’s a simple solution to that though — mk-query-digest –filter. With just a small change it will completely ignore administrator commands:

mk-query-digest --filter='$event->{arg} !~ m/administrator command/' in > out

If these bugs are really bugging you, feel free to sponsor the fix. Otherwise enjoy the power of the almighty mk-query-digest!

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com