May
06
2011
--

Shard-Query turbo charges Infobright community edition (ICE)

Shard-Query is an open source tool kit which helps improve the performance of queries against a MySQL database by distributing the work over multiple machines and/or multiple cores. This is similar to the divide and conquer approach that Hive takes in combination with Hadoop. Shard-Query applies a clever approach to parallelism which allows it to significantly improve the performance of queries by spreading the work over all available compute resources. In this test, Shard-Query averages a nearly 6x (max over 10x) improvement over the baseline, as shown in the following graph:

One significant advantage of Shard-Query over Hive is that it works with existing MySQL data sets and queries. Another advantage is that it works with all MySQL storage engines.

This set of benchmarks evaluates how well Infobright community edition (ICE) performs in combination with Shard-Query.

Data set

It was important to choose a data set that was large enough to create queries that would run for a decent amount of time, but not so large that it was difficult to work with. The ontime flight performance statistics data, available online from the United States Bureau of Transportation Statistics (BTS) made a good candidate for testing, as it had been tested before:
Another MPB post
Lucid DB testing

The raw data is a completely denormalized schema (single table). In order to demonstrate the power of Shard-Query it is important to test complex queries involving joins and aggregation. A star schema is the most common OLAP/DW data model, since it typically represents a data mart. See also: “Data mart or data warehouse?”. As it is the most common data model, it is desirable to benchmark using a star schema, even though it involves work to transform the data.

Star schema

Transforming the data was straightforward. I should note that I did this preprocessing with the MyISAM storage engine, then I dumped the data to tab delimited flat files using mysqldump. I started by loading the raw data from the BTS into a single database table called ontime_stage.

Then, the airport information was extracted:

create table dim_airport(
airport_id int auto_increment primary key,
unique key(airport_code)
)
as
select
  Origin as `airport_code`,
  OriginCityName as `CityName`,
  OriginState as `State`,
  OriginStateFips as `StateFips`,
  OriginStateName as `StateName` ,
  OriginWac as `Wac`
FROM ontime_stage
UNION
select
  Dest as `airport_code`,
  DestCityName as `CityName`,
  DestState as `State`,
  DestStateFips as `StateFips`,
  DestStateName as `StateName` ,
  DestWac as `Wac`
FROM ontime_stage;

After extracting flight/airline and date information in a similar fashion, a final table `ontime_fact` is created by joining the newly constructed dimension table tables to the staging tables, omitting the dimension columns from the projection, instead replacing them with the dimension keys:

select dim_date.date_id,
       origin.airport_id as origin_airport_id,
       dest.airport_id as dest_airport_id,
       dim_flight.flight_id,
       ontime_stage.TailNum, ...
from ontime_stage
join dim_date using(FlightDate)
join dim_airport origin on ontime_stage.Origin = origin.airport_code
join dim_airport dest on ontime_stage.Dest = dest.airport_code
join dim_flight using (UniqueCarrier,AirlineID,Carrier,FlightNum);

The data set contains ontime flight information for 22 years, which can be confirmed by examining the contents of the date dimension:

mysql> select count(*),
min(FlightDate),
max(FlightDate)
from dim_date\G
*************************** 1. row ***************************
       count(*): 8401
min(FlightDate): 1988-01-01
max(FlightDate): 2010-12-31
1 row in set (0.00 sec)

The airport dimension is a puppet dimension. It is called a puppet because it serves as both origin and destination dimensions, being referenced by origin_airport_id and destination_airport_id in the fact table, respectively. There are nearly 400 major airports included in the data set.

mysql> select count(*) from dim_airport;
+----------+
| count(*) |
+----------+
|      396 |
+----------+
1 row in set (0.00 sec)

The final dimension is the flight dimension, which contains the flight numbers and air carrier hierarchies. Only the largest air carriers must register and report ontime information with the FAA, so there are only 29 air carriers in the table:

mysql> select count(*),
count(distinct UniqueCarrier)
from dim_flight\G
*************************** 1. row ***************************
                     count(*): 58625
count(distinct UniqueCarrier): 29
1 row in set (0.02 sec)

Each year has tens of millions of flights:

mysql> select count(*) from ontime_one.ontime_fact;
+-----------+
| count(*)  |
+-----------+
| 135125787 |
+-----------+
1 row in set (0.00 sec)

This should be made fully clear by the following schema diagram:

Star schema (ontime_fact, dim_date, dim_flight, dim_airport)

Diagram of the ontime dimensional schema

Test environment

For this benchmark, a test environment consisting of a single commodity database server with 6 cores (+6ht) and 24GB of memory was selected. The selected operating system was Fedora 14. Oracle VirtualBox OSE was used to create six virtual machines, each running Fedora 14. Each of the virtual machines was granted 4GB of memory. A SATA 7200rpm RAID10 battery backed RAID array was used as the underlying storage for the virtual machines.

Baseline:
The MySQL command line client was used to execute the a script file containing the 11 queries. This same SQL script was used in the Shard-Query tests. For the baseline, the results and response times were captured with the \T command. The queries were executed on a single database schema containing all of the data. Before loading, there was approximately 23GB of data. After loading, ICE compressed this data to about 2GB. The test virtual machine was assigned 12 cores in this test.

Scale-up:
Shard-Query was given the following configuration file, which lists only one server. A single schema (ontime_one) contained all of the data. The test virtual machine was assigned 12 cores for this test. The same VM was used as the previous baseline test. This VM was rebooted between tests. A SQL script was fed to the run_query.php script and the output was captured with the ‘tee’ command.

$ cat one.ini
[default]
user=mysql
db=ontime_one
password=
port=5029
column=date_id
mapper=hash
gearman=127.0.0.1:7000
inlist=*
between=*

[shard1]
host=192.168.56.102

Scale-out
In addition to adding parallelism via scale-up, Shard-Query can improve performance of almost all queries by spreading them over more than one physical server. This is called “scaling out” and it allows Shard-Query to vastly improve the performance of queries which have to examine a large amount of data. Shard-Query includes a loader (loader.php) which can be used to either split a data into multiple files (for each shard, for later loading) or it can load files directly, in parallel, to multiple hosts.

Shard-Query will execute queries in parallel over all of these machines. With enough machines, massive parallelism is possible and very large data sets may be processed. As each machine examines only a small subset of the data, performance can be improved significantly:

$ cat shards.ini
[default]
user=mysql
db=ontime
password=
port=5029
column=date_id
mapper=hash
gearman=127.0.0.1:7000
inlist=*
between=*

[shard1]
host=192.168.56.101
db=ontime

[shard2]
host=192.168.56.102
db=ontime

[shard3]
host=192.168.56.103
db=ontime

[shard4]
host=192.168.56.105
db=ontime

[shard5]
host=192.168.56.106
db=ontime

[shard6]
host=192.168.56.104
db=ontime

In this configuration, each shard has about 335MB-350MB of data (23GB raw data, compressed to about 2GB total data. then spread over six servers). Due to ICE limitations, the data was split before loading. The splitting/loading process will be described in another post.

Complex queries

Shard-Query was tested with the simple single table version of this data set in a previous blog post. Previous testing was limited to a subset of Vadim’s test queries (see that post). As this new test schema is normalized, Vadim’s test queries must be modified to reflect the more complex schema structure. For this benchmark each of the original queries has been rewritten to conform to the new schema, and additionally two new test queries have been added. To model real world complexity, each of the test queries feature at least one join, and many of the filter conditions are placed on attributes in the dimension tables. It will be very interesting to test these queries on various engines and databases.

Following is a list of the queries, followed by a response time table recording the actual response times for each query. The queries should be self-explanatory.

Performance, at a glance

You will notice that Shard-Query is faster in nearly every case. The performance of the queries is improved significantly by scaling out, even more so than scaling up, because additional parallelism is added beyond what can be exploited by scale up. Scale up can improve query performance when there is enough resources to support the added parallelism, and when one of the the following are in used in the query: BETWEEN or IN clauses, subqueries in the FROM clause, UNION or UNION ALL clauses. If none of those features are used, then parallelism can’t be added. Q9 is an example of such a query.
.

Query details:

-- Q1
 SELECT DayOfWeek, count(*) AS c
   from ontime_fact
   JOIN dim_date using (date_id)
  WHERE Year
BETWEEN 2000 AND 2008
  GROUP BY DayOfWeek
  ORDER BY c DESC;

-- Q2
SELECT DayOfWeek, count(*) AS c
  from ontime_fact
  JOIN dim_date using (date_id)
 WHERE DepDelay>10 AND Year BETWEEN 2000 AND 2008
 GROUP BY DayOfWeek
 ORDER BY c DESC;

-- Q3
SELECT CityName as Origin, count(*) AS c
  from ontime_fact
  JOIN dim_date using (date_id)
  JOIN dim_airport origin
    ON origin_airport_id = origin.airport_id
 WHERE DepDelay>10
   AND Year BETWEEN 2000 AND 2008
 GROUP BY 1
 ORDER BY c
 LIMIT 10;


The next queries show how performance is improved when Shard-Query adds parallelism when “subqueries in the from clause” are used. There are benefits with both “scale-up” and “scale-out”, but once again, the “scale-out” results are the most striking.

-- Q4
SELECT Carrier, count(*) as c
  from ontime_fact
  JOIN dim_date using (date_id)
  join dim_flight using(flight_id)
 WHERE DepDelay>10
   AND Year=2007
 GROUP BY Carrier
 ORDER BY c DESC;

-- Q5
SELECT t.Carrier, c, c2, c*1000/c2 as c3
FROM
     (SELECT Carrier, count(*) AS c
        from ontime_fact
        join dim_date using(date_id)
        join dim_flight using(flight_id)
       WHERE DepDelay>10
         AND Year=2007
       GROUP BY Carrier) t
JOIN (SELECT Carrier, count(*) AS c2
        from ontime_fact
        join dim_date using(date_id)
        join dim_flight using(flight_id)
       WHERE Year=2007
       GROUP BY Carrier) t2
  ON (t.Carrier=t2.Carrier)
ORDER BY c3 DESC;

-- Q6
SELECT t.Year, c1 / c2 as ratio
FROM
     (select Year, count(*)*1000 as c1
        from ontime_fact
        join dim_date using (date_id)
       WHERE DepDelay>10
       GROUP BY Year) t
JOIN (select Year, count(*) as c2
        from ontime_fact
        join dim_date using (date_id)
       WHERE DepDelay>10
       GROUP BY Year) t2
  ON (t.Year=t2.Year);

-- Q7
SELECT t.Year, c1 / c2 as ratio
  FROM (select Year, count(Year)*1000 as c1
          from ontime_fact
          join dim_date using (date_id)
         WHERE DepDelay>10
         GROUP BY Year) t
  JOIN (select Year, count(*) as c2
          from ontime_fact
          join dim_date using (date_id)
         GROUP BY Year) t2
    ON (t.Year=t2.Year);


The performance of the following queries depends on the size of the date range:

-- Q8.0
SELECT dest.CityName, COUNT( DISTINCT origin.CityName)
  from ontime_fact
  JOIN dim_airport dest on ( dest_airport_id = dest.airport_id)
  JOIN dim_airport origin on ( origin_airport_id = origin.airport_id)
  JOIN dim_date using (date_id)
 WHERE Year BETWEEN 2001 and 2001
 GROUP BY dest.CityName
 ORDER BY 2 DESC;

-- Q8.1
SELECT dest.CityName, COUNT( DISTINCT origin.CityName)
  from ontime_fact
  JOIN dim_airport dest on ( dest_airport_id = dest.airport_id)
  JOIN dim_airport origin on ( origin_airport_id = origin.airport_id)
  JOIN dim_date using (date_id)
 WHERE Year BETWEEN 2001 and 2005
 GROUP BY dest.CityName
 ORDER BY 2 DESC;

-- Q8.2
SELECT dest.CityName, COUNT( DISTINCT origin.CityName)
  from ontime_fact
  JOIN dim_airport dest on ( dest_airport_id = dest.airport_id)
  JOIN dim_airport origin on ( origin_airport_id = origin.airport_id)
  JOIN dim_date using (date_id)
 WHERE Year BETWEEN 2001 and 2011
 GROUP BY dest.CityName
 ORDER BY 2 DESC;

-- Q8.3
SELECT dest.CityName, COUNT( DISTINCT origin.CityName)
  from ontime_fact
  JOIN dim_airport dest on ( dest_airport_id = dest.airport_id)
  JOIN dim_airport origin on ( origin_airport_id = origin.airport_id)
  JOIN dim_date using (date_id)
 WHERE Year BETWEEN 1990 and 2011
 GROUP BY dest.CityName
 ORDER BY 2 DESC;

-- Q8.4
SELECT dest.CityName, COUNT( DISTINCT origin.CityName)
  from ontime_fact
  JOIN dim_airport dest on ( dest_airport_id = dest.airport_id)
  JOIN dim_airport origin on ( origin_airport_id = origin.airport_id)
  JOIN dim_date using (date_id)
 WHERE Year BETWEEN 1980 and 2011
 GROUP BY dest.CityName
 ORDER BY 2 DESC;


Finally, Shard-Query performance continues to improve when grouping and filtering is used. Again, notice Q9. It doesn’t use any features which Shard-Query can use to add parallelism. Thus, in the scale up configuration it does not perform any better than the baseline, and actually performed just a little worse. Since scale out splits the data between servers, it performs about 6x better as the degree of parallelism is controlled by the number of shards.

-- Q9
select Year ,count(Year) as c1
  from ontime_fact
  JOIN dim_date using (date_id)
 group by Year;

-- Q10
SELECT Carrier, dest.CityName, COUNT( DISTINCT origin.CityName)
  from ontime_fact
  JOIN dim_airport dest on ( dest_airport_id = dest.airport_id)
  JOIN dim_airport origin on ( origin_airport_id = origin.airport_id)
  JOIN dim_date using (date_id)
  JOIN dim_flight using (flight_id)
 WHERE Year BETWEEN 2009 and 2011
 GROUP BY Carrier,dest.CityName
 ORDER BY 3 DESC;

-- Q11
SELECT Year, Carrier, dest.CityName, COUNT( DISTINCT origin.CityName)
  from ontime_fact
  JOIN dim_airport dest on ( dest_airport_id = dest.airport_id)
  JOIN dim_airport origin on ( origin_airport_id = origin.airport_id)
  JOIN dim_date using (date_id)
  JOIN dim_flight using (flight_id)
 WHERE Year BETWEEN 2000 and 2003
   AND Carrier = 'AA'
 GROUP BY Year, Carrier,dest.CityName
 ORDER BY 4 DESC;


Conclusion

The divide and conquer approach is very useful when working with large quantities of data. Shard-Query can be used with existing data sets easily, improving the performance of queries significantly if they use common query features like BETWEEN or IN. It is also possible to spread your data over multiple machines, scaling out to improve query response times significantly.

These queries are a great test of Shard-Query features. It is currently approaching RC status. If you decide to test it and encounter issues, please file a bug on the bug tracker. You can get Shard-Query (currently in development release form as a checkout from SVN) here: Shard-Query Google code project

Full disclosure

Justin Swanhart, the author of this article is also the creator and maintainer of Shard-Query. The author has previously worked in cooperation with Infobright, including on benchmarking. These particular tests were performed independently of Infobright, without their knowledge or approval. Infobright was, however, given the chance to review this document before publication, as a courtesy. All findings are represented truthfully, transparently, and without any intended bias.

Apr
17
2011
--

Aspersa tools bit.ly download shortcuts

I use Aspersa tools a lot and I find myself going to the website just to download one of the tools all the time. I love I can download maatkit with a simple wget maatkit.org/get/tool command so I made bit.ly shortcuts for all of the current aspersa tools. Here’s the full list with my favorite on the top and least favorite (but none the less very useful) on the bottom:

So now it’s as simple as wget bit.ly/aspersa-tool. I hope you will find these useful too.

Aurimas

Update: I made an error with align tool which now links to the root directory of a trunk. I don’t think this is a big deal as I use align least often, but what I did to fix that is created even shorter shortcuts for all the tools, so you can also use wget bit.ly/a-tool

Apr
07
2011
--

Optimizing slow web pages with mk-query-digest

I don’t use many tools in my consulting practice but for the ones I do, I try to know them as best as I can. I’ve been using mk-query-digest for almost as long as it exists but it continues to surprise me in ways I couldn’t imagine it would. This time I’d like to share a quick tip on how mk-query-digest allows you to slice your data in a completely different way than it otherwise would by default.

Disclaimer: this only works when persistent connections or connection pools aren’t used and is only accurate when single mysql connection is used during execution of a request.

If you are seeking to reduce the load on the database server and [as a result] increase response time for some random user requests, you are usually interested in queries that are consuming most MySQL time and that’s how mk-query-digest groups and orders data by default. Fixing top 10 queries on the list indeed will most likely reduce the load and improve response time for some requests. What if some pages are still slow to load because of the time spent in database and you either can’t or don’t want to profile or debug the application to figure out what’s happening under the hood?

That sounds like something I was working on today – I had a slow query log (captured with long_query_time=0 and all the eXtra benefits from Percona slow query log patch), I knew some particular pages were taking minutes to load and that’s exactly what the customer asked me to focus on. So instead of using mk-query-digest to list me top slowest queries, I asked it to list me top slowest sessions:

mk-query-digest --group-by=Thread_id --order-by=Query_time:sum in > out

Spot on, the session I needed to focus on was right at the top. And what do you know, 519 queries were run during that session which took 148s seconds overall:

# ########################################################################
# Report grouped by Thread_id
# ########################################################################

# Item 1: 3.41 QPS, 0.97x concurrency, ID 0xABCE5AD2A2DD1BA1 at byte 288124661
# This item is included in the report because it matches --limit.
# Scores: Apdex = 0.97 [1.0], V/M = 19.02
# Query_time sparkline: | ^______|
# Time range: 2011-04-05 16:12:13 to 16:14:45
# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count          0     519
# Exec time      2    148s    11us     33s   285ms    53ms      2s    26us
# Lock time      0     5ms       0   334us     9us    66us    32us       0
# Rows sent      0      41       0       1    0.08    0.99    0.27       0
# Rows examine   1   4.97M       0 445.49k   9.80k   5.73k  49.33k       0
# Rows affecte   0       2       0       1    0.00       0    0.06       0
# Rows read      1   2.01M       0 250.47k   3.96k    1.96  27.94k    0.99
# Bytes sent     0 241.20k      11   8.01k  475.89  918.49  689.98  258.32
# Merge passes   0       0       0       0       0       0       0       0
# Tmp tables     0      15       0       1    0.03       0    0.17       0
# Tmp disk tbl   0       3       0       1    0.01       0    0.08       0
# Tmp tbl size   0   4.78k       0   4.78k    9.43       0  211.60       0
# Query size     0 100.95k      19   2.71k  199.17  363.48  206.60  151.03
# InnoDB:
# IO r bytes     0       0       0       0       0       0       0       0
# IO r ops       0       0       0       0       0       0       0       0
# IO r wait      0       0       0       0       0       0       0       0
# pages distin   1  67.99k       0  10.64k   1.26k   3.88k   2.47k   31.70
# queue wait     0       0       0       0       0       0       0       0
# rec lock wai   0       0       0       0       0       0       0       0
# Boolean:
# Filesort       0% yes,  99% no
# Full scan      7% yes,  92% no
# QC Hit        78% yes,  21% no
# Tmp table      2% yes,  97% no
# Tmp table on   0% yes,  99% no
# String:
# Databases    prod_db
# Hosts        localhost
# InnoDB trxID 1153145C (2/0%), 11531626 (2/0%)... 43 more
# Last errno   0
# Users        prod
# Query_time distribution
#   1us
#  10us  ################################################################
# 100us  #########
#   1ms  #
#  10ms  #
# 100ms  #
#    1s  ###
#  10s+  #
160847

The stats here are aggregated per all queries which is great, but I still need to figure out what queries were run. I could use mk-log-player and split all sessions that way, unfortunately mk-log-player will not have all the other useful information, not even query timing. Instead, I’ve used mk-query-digest:

mk-query-digest --filter='$event->{Thread_id} == 160847' in > out

Now I know exactly what needs to be fixed first to make the greatest impact to this page response time. I can also convert that into a slow query log that lists all the queries that were executed during this session in the order they were executed:

mk-query-digest --filter='$event->{Thread_id} == 160847' --no-report --print in > out

Pretty cool, isn’t it? Sure, it would be even better if mk-query-digest would do a nested group-by and order-by within a group so I would avoid the extra step, but then even better than that would be if it would optimize the queries all together! Unfortunately mk-query-digest won’t do that for you, but then there’s mk-query-advisor ;)

Mar
28
2011
--

Maatkit’s mk-query-digest filters

Have you ever seen BIG weird numbers in mk-query-digest report that just seem wrong? I have! Here’s one report I got today:

...
# Attribute          total     min     max     avg     95%  stddev  median
# ============     ======= ======= ======= ======= ======= ======= =======
# Exec time          5088s     1us    171s     2ms   467us   104ms    28us
# Lock time            76s       0      3s    26us    69us     3ms       0
# Rows sent          9.80M       0   1.05M    3.50    0.99  642.32       0
# Rows examine       5.59G       0  82.56M   2.00k    0.99  97.41k       0
# Rows affecte     457.30k       0   2.62k    0.16    0.99    1.68       0
# Rows read          2.16G       0  82.56M  788.53   21.45  82.91k    0.99
# Bytes sent         2.14T       0   4.00G 781.27k   3.52k  47.84M   84.10
# Merge passes     273.47G       0   4.00G  97.69k       0  10.35M       0
# Tmp tables       225.85G       0   4.00G  80.67k       0   7.89M       0
# Tmp disk tbl     381.88G       0   4.00G 136.41k       0  14.08M       0
# Tmp tbl size     255.54G       0   2.39G  91.28k       0   8.59M       0
# Query size       418.38M       6 257.39k  149.45  563.87   1.49k   42.48
# InnoDB:
# IO r bytes       272.92G       0   1.58G 479.74k       0  18.19M       0
# IO r ops         328.05G       0   2.00G 576.65k       0  23.08M       0
# IO r wait        237964272912s       0 3377771733s 398921s       0 22888946s       0
# pages distin     156.50G       0   2.39G 275.11k   27.38  14.48M    4.96
# queue wait       143150489533s       0 3377769328s 239976s       0 16014027s       0
# rec lock wai     216352062699s       0 4085510331s 362690s       0 27625029s       0
...

That can’t be right! Apparently there are couple bugs in the slowlog patch that aren’t really critical as the numbers are only wrong for the administrator commands. Yet when aggregated it kind of messes up the whole mk-query-digest output.

There’s a simple solution to that though — mk-query-digest –filter. With just a small change it will completely ignore administrator commands:

mk-query-digest --filter='$event->{arg} !~ m/administrator command/' in > out

If these bugs are really bugging you, feel free to sponsor the fix. Otherwise enjoy the power of the almighty mk-query-digest!

Nov
09
2010
--

Lost innodb tables, xfs and binary grep

Before I start a story about the data recovery case I worked on yesterday, here’s a quick tip – having a database backup does not mean you can restore from it. Always verify your backup can be used to restore the database! If not automatically, do this manually, at least once a month. No, seriously – in most of the recovery cases I worked on, customers did have some sort of backup, but it just wasn’t working, complete and what not. Someone set it up and never bothered to check if it still works after a while.

Anyway, this post is not really about the backups but rather about few interesting things I learned during last recovery case.

First, some facts about the system and how data was lost:

  • MySQL had a dedicated partition on XFS file system
  • Server was running innodb_file_per_table
  • There was a production master and two slaves, all had same setting
  • Developer accidentally ran DROP DATABASE X on the wrong machine (production master)
  • All slaves followed and dropped their copy of the data
  • The important tables were all InnoDB
  • Having a backup, customer has first attempted to restore from backup on the production master

Luckily (or rather, unfortunately) backup only had table definitions but not the data so no data was written to file system. Mind however that restoring a backup could have been fatal if it would have written some junk data as that would have overwritten the deleted files. Now, here’s what I learned while working on this case:

Recovering from XFS is possible. Just a month ago we had a team meeting in Mallorca where we went through various data loss scenarios. One of them was deleted files on xfs – we all agreed on few things:

  • recovering files from xfs is hard, if at all possible
  • we had no recovery cases on xfs, most likely because:
  • whoever is using xfs, is smart enough to have backups set up properly

Now I’m not picking on the customer or anything – indeed they did have a backup set up, it’s just that some (most important) tables weren’t backed up. We did not try any of the file recovery tools for xfs – apparently they are all targeting specific file types and sure enough InnoDB is not one of the supported files. What we did is we simply ran page_parser on the (already) unmounted file system treating it as a raw device. I was surprised how amazingly simple and fast it was (did you know that latest version of page_parser identifies pages by infimum and supremum records?) – 10G partition was scanned in like 5 minutes and all 4G of innodb pages were successfully written to a separate partition. That’s the easy part though – you run page parser, wait and see what you get.

If InnoDB Data Dictionary was not overwritten by an attempt to restore from the backup, actually second part would’ve been quite easy too, but it was so I could no longer identify correct PK id for specific tables by just mapping data dictionary table records to index records. Instead I had to grep for specific character sequences against all pages. Note however that only works for text in uncompressed text columns (varchar, char, text) but what if tables don’t have any text columns at all? Then, you read further.

GNU grep won’t match binary strings. This isn’t new, I kind of knew grep couldn’t look for binary “junk”, but I really needed it to. Why? Well, here’s few of the scenarios we’ve gone through yesterday:

1. There was this rather big table with integer and enum columns only, where we knew a rather unique PK, well something like 837492636 so we needed a way to find pages that match it. InnoDB would internally store integers in 4-bytes rather than 10 bytes if it were stored as a sequence of characters, so “grep -r 837492636 /dir” would not have worked.

2. There was another table, a small one with 4 smallint columns where all we could match on was a sequence of numbers from a single record – customer knew that there was at least one row with the following sequence: 7, 3, 7, 8. Matching by any of the numbers would be insane as it would match all of the pages while matching on numbers as a sequence of characters would not work for many reasons.

This is where I found bgrep which was exactly the tool for the task. In the case number one, I have just converted number 837492636 to it’s binary representation 0x31EB1F9C and ran “bgrep 31EB1F9C /dir” – there were only like 10 other matches across the 4 gigabytes of pages, some of them probably from the secondary pages, but when you only have that many pages to check it’s really simple.

Second case seemed somewhat complicated, but it really wasn’t – all of the columns were fixed size – 2bytes each, so the thing we had to look for was this sequence: 0007000300070008. I was expecting a lot of mismatches but in fact I ended up with only one match pointing exactly to the right page and so the right index id.

The other thing I would note about bgrep – it was so much faster than matching text using grep, so if you happen to have a lot of data to scan and you have to choose between matching text and number, matching a number using bgrep may work much better.

We are considering shipping bgrep as part of percona recovery toolset, with some additional converters so we can match against various date/time columns as well.


Entry posted by Aurimas Mikalauskas |
10 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Sep
06
2010
--

UDF -vs- MySQL Stored Function

Few days ago I was working on a case where we needed to modify a lot of data before pushing it to sphinx – MySQL did not have a function to do the thing so I thought I’ll write MySQL Stored Function and we’ll be good to go. It worked! But not so well really – building the index, which was taking 10 minutes, was now taking 16 minutes. Then we added another MySQL function for different set of attributes and indexing speed went from 16 minutes to 26 minutes. I knew using UDF would be faster, but I had no idea how much. Have you ever wondered?

So what were the modifications we needed? It was couple very simple things – (1) two varchar columns needed leading nonalpha characters trimmed, so “123 ^&* and some text” would become “and some text”, and (2) same two varchar columns needed some double characters changed to single one so “Picasso” becomes “Picaso”, “Wesselmann” becomes “Weselman” and so on. Why we needed that is another story which this blog post is not about. Note however that only very small portion of data really needed to be modified.

Here are the two MySQL functions I wrote to do the job – ltrim_junk_mysql() and remove_dups_mysql(). Although processing single row seemed to be instantaneous, we needed to process much more than that – and that wasn’t as fast. For example, here’s how long it took to process 100k rows:

mysql> select ltrim_junk_mysql(author), ltrim_junk_mysql(title) from paintings limit 100000;
100000 rows in set (2.97 sec)

mysql> select remove_dups_mysql(author), remove_dups_mysql(title) from paintings limit 100000;
100000 rows in set (2.04 sec)

If you looked carefully at the second function though, you may have noticed I did not necessarily have to write a function, I could have written it as an SQL statement:

mysql> select
REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE(
REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE(
REPLACE( REPLACE( LOWER(author), 'aa', 'a'), 'bb', 'b'), 'cc', 'c'),
'dd', 'd'), 'ff', 'f'), 'gg', 'g'), 'll', 'l'), 'mm', 'm'), 'nn', 'n'),
'oo', 'o'), 'pp', 'p'), 'rr', 'r'), 'ss', 's'), 'tt', 't'), 'vv', 'v'),
'zz', 'z'),
REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE(
REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE(
REPLACE( REPLACE( LOWER(title), 'aa', 'a'), 'bb', 'b'), 'cc', 'c'),
'dd', 'd'), 'ff', 'f'), 'gg', 'g'), 'll', 'l'), 'mm', 'm'), 'nn', 'n'),
'oo', 'o'), 'pp', 'p'), 'rr', 'r'), 'ss', 's'), 'tt', 't'), 'vv', 'v'),
'zz', 'z') FROM paintings LIMIT 100000;
100000 rows in set (0.33 sec)

Doesn’t look nice, but it already executes more than 6 times faster which is interesting as it shows how much overhead you have by using mysql stored routines interface. So anyway, I asked my colleague Sasha to help me out by rewriting these as UDF functions. Here’s ltrim_junk() function and remove_dups(). Well, guess what:

mysql> select ltrim_junk(author), ltrim_junk(title) from paintings limit 100000;
100000 rows in set (0.13 sec)

mysql> select remove_dups(author), remove_dups(title) from paintings limit 100000;
100000 rows in set (0.17 sec)

So for ltrim_junk() function we got almost 23x improvement and for remove_dups – 12 times if comparing to stored function or 2 times comparing to just using available functions. With that speed I could even scan the whole table of 7 million records:

mysql> select count(*) from paintings where title != ltrim_junk(title);
+----------+
| count(*) |
+----------+
|   101533 |
+----------+
1 row in set (6.82 sec)

mysql> select count(*) from paintings where author != ltrim_junk(author);
+----------+
| count(*) |
+----------+
|    28335 |
+----------+
1 row in set (6.63 sec)

mysql> select count(*) from paintings where author != remove_dups(author) OR title != remove_dups(title);
+----------+
| count(*) |
+----------+
|  2720414 |
+----------+
1 row in set (11.19 sec)

Whereas using stored function used to take minutes!

I don’t mean to say stored functions are bad and you should now rewrite all your functions as UDFs – if you need to process just a few records for a request and you are not burning racks of CPUs to constantly do the job, the speed difference is really negligible. However in case like this one where we have to process many records constantly and every second counts, UDF can really save your day. If you need one and don’t feel confident writing C, you know who to call!


Entry posted by Aurimas Mikalauskas |
5 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Nov
12
2009
--

Tokyo Tyrant -The Extras Part III : Write Bottleneck

This is part 3 of my Tyrant extra’s, part 1 focused on durability, part 2 focused on the perceived performance wall.

#3.  Tokyo Cabinet Can have only a single writer thread, bottlenecking performance

When writing an application using Tokyo Cabinet only one connection can be opened as a “writer”  while the rest are readers.  Tyrant allows for multiple “writes”  to be sent in from multiple applications but it still single threads them when writing out to disk.   If you run several threads all just inserting into Tyrant your will see tyrant hit 100% Cpu on 1 core, and your writes will start to peter out quickly.

Single Threaded Writes

In my tests when I was not disk bound (FS Cache writes) I was able to complete 4Million inserts in a little over 91 seconds using 8 threads.  I actually averaged 43896.98 inserts per second during my 8 thread test.  Moving to 10 threads doing the same 4Million inserts I completed the test in 96 seconds and averaged 41649.42 inserts per second.    Compare this to 4 Million rows using 4 threads which averaged  40933.86 and you start to see that around 40K inserts per second is the most this particular server is capable of ( single threaded ).  Hopefully this is something that maybe able to be fixed internally in the near future.  Until then you may consider breaking up your data into multiple tables each with there own cache.  This limit is per TC DB so this should work.  I had an idea about using the memcached client to distribute the data accross multiple TC database files in the back end.  This should work, I just need to test it :)

Ever notice how as my multi-part posts go on they get shorter and shorter:)  This will be the last Tyrant related post for a little bit.  The 4th & 5th posts were supposed to deal with replication and scaling… this may take a little while.  Thanks for reading!


Entry posted by matt |
3 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Nov
11
2009
--

Tokyo Tyrant – The Extras Part II : The Performance Wall

Continuing my look at Tokyo Tyrant/Cabinet and addressing some of the concerns I have seen people have brought up this is post #2.

#2.  As your data grows does  Tokyo Cabinet slow down?

Yes your performance can degrade. One obvious performance decrease with a larger dataset  is you start to increase the likelihood that your data no longer fits into memory.  This decreases the number of memory operations and trades them for more expensive disk based operations.    As fast as any application is, as you read off disk opposed to memory performance is going to drop off substantially.  One of the more difficult things to test with Tyrant is disk bound performance.  The FS Cache can make Tyrant seem like small amounts of memory will still make it scream.  Once your data set is larger then that, people start to claim they hit the performance “wall”.

In order to help test this I went ahead an mounted the FS with my data files with the sync option which effectively disables the FS cache.  This should help show the real performance of the hash engine.  Here performance dips substantially, as expected :

FS Mounted As Sync

Look at the IO rate:
NoSync:  31 MB/s
Sync:  3.2 MB/s

As one would expect the IO goes crazy when the drive is mounted with the sync option hitting 99% IO wait.  The interesting this here is we are actually bottlenecking on writes and not reads.  You see without the FS cache to buffer the writes when we need to remove data from memory we now have to rely on the internal Tyrant cache and when that is exhausted have to then really write to disk not the FS Cache.  Now Tyrant starts to take on the same characteristics as your classic DB, the bigger the buffer pool the faster the performance:

Difference Memory Sizes for Tyrant
Even here the performance drop-off once you exhaust memory is relative.  The focus here should be the drop off versus other solutions with the same configuration, not the drop off versus a completely cached version.  In this case ask yourself given similar datasets and similar memory requirements what is the performance?  Take the above sync test, when I use 256M of memory and run my test with writes going directly to disk I hit 964 TPS, in previous MySQL tests the same setup (256M BP) netted ~160 TPS.  So 5x improvement all things being equal.  Of course this is a far drop off from the 13K I was getting when everything was effectively in the file system cache or in memory, but 5x is still a very solid improvement.

Next up is looking at Tyrant’s and Cabinet’s write bottleneck.


Entry posted by matt |
3 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Nov
10
2009
--

Tokyo Tyrant – The Extras Part I : Is it Durable?

You know how in addition to the main movie you have extras on the DVD.  Extra commentary, bloopers, extra scenes, etc? Well welcome the Tyrant extras.  With my previous blog posts I was trying to set-up a case for looking at NOSQL tools, and not meant to be a decision making tool.  Each solution has pros and cons that will impact how well the technology works for you.  Based on some of the comments and questions to the other blogs, I thought I would put together a little more detail into some of the deficiencies and strengths of Tokyo Tyrant.

#1.  How durable is Tokyo Tyrant?

Well I went ahead and built a quick script that just inserted data into a TC table ( an id, and a timestamp) and did a kill -9 on the  the server in the middle of it.

Insert:
159796,1256131127.17329
159797,1256131127.17338
159798,1256131127.17345
159799,1256131127.17355
put error: recv error
159800,1256131127.17364

Here we failed at a time of 1256131127.17355 , before the next record was inserted.

After bringing the server up from a crash:

159795,1256131127.1732
159796,1256131127.17329
159797,1256131127.17338
159798,1256131127.17345
159799,1256131127.17355

All the records are still there.  So we are good right?  Looking in the code,  Tokyo Cabinet actually utilizes memory mapped files.  I personally have not using mmaped files, so feel free to correct me if you know better then I.  Using mmap here and performing a kill -9 seems to preserve the changes in memory, while powering down the server does not:

163,1257780699.10123
164,1257780699.35172
165,1257780699.60209
166,1257780699.85246

insert yanking of power cord here… gives us Post crash data of:

142,1257780693.84303
143,1257780694.09345

So we basically lost 5 secondish of data.

Looking at the Tyrant & Cabinet  documentation you will see mention of a  SYNC command which they say does the following:

“The function `tcrdbsync’ is used in order to synchronize updated contents of a remote database object with the file and the device.”

Let’s dig a little deeper into the code and see what’s going on:

/* Synchronize updated contents of a hash database object with the file and the device. */
bool tchdbsync(TCHDB *hdb){
assert(hdb);
if(!HDBLOCKMETHOD(hdb, true)) return false;
if(hdb->fd < 0 || !(hdb->omode & HDBOWRITER) || hdb->tran){
tchdbsetecode(hdb, TCEINVALID, __FILE__, __LINE__, __func__);
HDBUNLOCKMETHOD(hdb);
return false;
}
if(hdb->async && !tchdbflushdrp(hdb)){
HDBUNLOCKMETHOD(hdb);
return false;
}
bool rv = tchdbmemsync(hdb, true);
HDBUNLOCKMETHOD(hdb);
return rv;
}

If it first checks if the file descriptor for the database is less then 0, or your not operating as a writer…  in which case it errors.  Then if checks if your running in async io mode.  If your running async it flushes the records from the delayed record pool.  If your running async and you do not flush your records, then your at the mercy of Tokyo cabinet, or your application to call one of the numerous operations that flushes the delayed record pool ( i.e.  all regular sync operations like tchdbput will flush it ).  I did not test with async, in fact to the best of my knowledge it does not look like tyrant supports async, even though cabinet does.   Which means the meat of the sync command coming from tyrant is tchdbmemsync.

/* Synchronize updating contents on memory of a hash database object. */
bool tchdbmemsync(TCHDB *hdb, bool phys){
assert(hdb);
if(hdb->fd < 0 || !(hdb->omode & HDBOWRITER)){
tchdbsetecode(hdb, TCEINVALID, __FILE__, __LINE__, __func__);
return false;
}
bool err = false;
char hbuf[HDBHEADSIZ];
tchdbdumpmeta(hdb, hbuf);
memcpy(hdb->map, hbuf, HDBOPAQUEOFF);
if(phys){
size_t xmsiz = (hdb->xmsiz > hdb->msiz) ? hdb->xmsiz : hdb->msiz;
if(msync(hdb->map, xmsiz, MS_SYNC) == -1){
tchdbsetecode(hdb, TCEMMAP, __FILE__, __LINE__, __func__);
err = true;
}
if(fsync(hdb->fd) == -1){
tchdbsetecode(hdb, TCESYNC, __FILE__, __LINE__, __func__);
err = true;
}
}
return !err;
}

Here you see the call to msync.  What does msync do?  The man page says:

“The msync() function writes all modified data to permanent storage locations, if any, in those whole pages containing any part of the address space of the process starting at address addr and continuing for len bytes.”

Basically in the Tokyo Tyrant context msync will flush all the changes to a memory mapped object to disk.  This msync is crucial as you can not guarantee data ever makes it to disk if its not called.  (more below)

The tchdbmemsync function is the only place I saw calling msync. What calls  tchdbmemsync?

tchdbmemsync Called via:
tchdboptimize
tchdbsync
tchdbtranbegin
tchdbtrancommit
tchdbtranabort
tchdbcloseimpl
tchdbcopyimpl

The commands that will indirectly call an msync are : running the optimize command, calling a sync directly, closing a connection to the db, or starting,commiting, or aborting a transaction.  Note a transaction in TC is actually a global transaction and locks all write operations ( used for maintenance ).  What is missing here is a scheduled call to msync.  I looked and traced back the calls from Tyrant into Cabinet and could not find anything that is called by automatically.

The documentation on msync actually says without calling msync there is no guarantee of the data making it to disk.  This implies that it may eventually get written without a direct msync call ( When you purge/lru old data from memory ).    Testing this theory I crashed my server several times and found that data written out to disk without calling msync was very flaky indeed.  I had anywhere from 5 seconds of missing data to 60 seconds post crash.

This means for durability you really need to directly call the sync command.  In my previous post someone pointed out a flaw in this approach saying that they had seen that calling a sync after writes ruined performance.  Looking at the code you can see why calling a sync after each write can severely degrade performance.  Before I explain lets look at the performance hit:

Sync After every Call

Saying there is a performance hit here is an understatement.  The reason for this however is really how msync works and how its used in Tokyo Cabinet.  In a sense it is implemented as a global sync, not a record sync. i.e.  all changes  to the underlying database are flushed at once.  So instead of sync the record you just changed, all of the changed records in the DB will be flushed and synced.  In order to perform this operation a lock is required, which blocks other SYNC calls.   So if you have 32 threads, you could have 1 sync running and 31 others blocked.  This means calling a sync after every call is going to severely degrade performance.

So what can we do to Make Cabinet more durable?   Well the best option in my opinion is to steal a trick from Innodb:

We can easily write a a script that calls a background sync every second ( i.e. like innodb_flush_log_at_trx_commit = 0/2).  I have tested this and I see almost 0 impact on my gaming benchmark from when this is running to when it is not.

Once a Second Sync

You can write this and cron the script or TTSERVER actually provides you a method to call functions periodically:

-ext path : specify the script language extension file.
-extpc name period : specify the function name and the calling period of a periodic command.

Now while I did not see a drop in my benchmark, heavy write operations will see a drop in performance… for instance with 8 threads simply update/inserting data is saw this:

heavy insert sync once a second

Ouch, a 2X hit.  But you can configure the frequency of the sync  up or down as needed to ensure you have the proper recovery -vs- performance setting.


Entry posted by matt |
3 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Sep
16
2009
--

How to generate per-database traffic statistics using mk-query-digest

We often encounter customers who have partitioned their applications among a number of databases within the same instance of MySQL (think application service providers who have a separate database per customer organization … or wordpress-mu type of apps). For example, take the following single MySQL instance with multiple (identical) databases:

SHOW DATABASES;
+----------+
| Database |
+----------+
| db1      |
| db2      |
| db3      |
| db4      |
| mysql    |
+----------+

Separating the data in this manner is a great setup for being able to scale by simply migrating a subset of the databases to a different physical host when the existing host begins to get overloaded. But MySQL doesn’t allow us to examine statistics on a per-database basis.

Enter Maatkit.

There is an often-ignored gem in Maatkit‘s mk-query-digest, and that is the –group-by argument. This can be used to aggregate information by tables, hosts, users, or databases (full documentation is available via perldoc).

%> perl mk-query-digest --limit 100% --group-by db slow.log
...
# Rank Query ID Response time Calls R/Call Item
# ==== ======== ============= ===== ====== ====
#    1 0x       6000 60.0%    6000  0.5124 db3
#    2 0x       2000 20.0%    2000  0.0112 db1
#    3 0x       1500 15.0%    1500  0.1665 db2
#    4 0x        500  5.0%     500  0.0022 db4

So here, we can see that the majority (60%, to be exact) of execution time is spent in db3. If the server is reaching it’s capacity and the next most useful performance optimization is to migrate a database to a different server, you know exactly which database to move (db3) and how much room that will give you on the original host (60% growth) and on the new host (40% growth), which may have a direct bearing on your hardware selection.

Let Baron know how awesome you think this is by getting him a gift from his Amazon Wish List!


Entry posted by Ryan Lowe |
No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com