Nov
16
2009
--

“Shard early, shard often”

I wrote a post a while back that said why you don’t want to shard.  In that post that I tried to explain that hardware advances such as 128G of RAM being so cheap is changing the point at which you need to shard, and that the (often omitted) operational issues created by sharding can be painful.

What I didn’t mention was that if you’ve established that you will need to eventually shard, is it better to just get it out of the way early?  My answer is almost always no. That is to say I disagree with a statement I’ve been hearing recently; “shard early, shard often”.  Here’s why:

  • There’s an order of magnitude better performance that can be gained by focusing on query/index/schema optimization.  The gains from sharding are usually much lower.
  • If you shard first, and then decide you want to tune query/index/schema to reduce server count, you find yourself in a more difficult position – since you have to apply your changes across all servers.

Or to phrase that another way:
I would never recommend sharding to a customer until I had at least reviewed their slow query log with mk-query-digest and understood exactly why each of the queries in that report were slow.  While we have some customers who have managed to create their own tools for shard automation, it’s always easier to propose major changes to how data is stored before you have a cluster of 50+ servers.


Entry posted by Morgan Tocker |
6 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Nov
15
2009
--

NDB API examples, the trivia to make ndbapi_scan works!

When I did NDB Cluster engagements, it happened a few time that I had to cover the NDB API. Once, a customer asked an example on how to perform a simple scan. Remembering that some examples are within the NDB source tree, under ./storage/ndb/ndbapi-examples/ and I decided to take a look, why writing something when a good example already exists… I was not expecting what I found!

1. Missing headers


$ make
g++ -g -O0 -c -Wall -fno-rtti -fno-exceptions -I/usr/include -I../../../../include -I../../../../storage/ndb/include -I../../../../storage/ndb/include/ndbapi ndbapi_scan.cpp
ndbapi_scan.cpp: In constructor ‘Car::Car()’:
ndbapi_scan.cpp:111: error: ‘memset’ was not declared in this scope
ndbapi_scan.cpp: In function ‘void drop_table(MYSQL&)’:
ndbapi_scan.cpp:124: error: ‘exit’ was not declared in this scope
ndbapi_scan.cpp: In function ‘void create_table(MYSQL&)’:
...

Hmmm, some header files are missing, let’s add them


#include <string.h>
#include <stdlib.h>

2. The clean up code

With the added header files, the program compiles flawlessly. Let’s try it!


$ LD_LIBRARY_PATH=/usr/local/mysql-cluster-gpl-7.0.8a/lib/mysql ./ndbapi_scan /usr/local/mysql-cluster-gpl-7.0.8a/mysql.sock 127.0.0.
Unable to connect with connect string: nodeid=0,127.0.0.:1186
Retrying every 5 seconds. Attempts left: 4^C

My bad…. I did a typo in the IP of the MGM server. Easy to correct…


yves@yves-laptop:/opt/mysql-cluster-gpl-7.0.8/storage/ndb/ndbapi-examples/ndbapi_scan$ LD_LIBRARY_PATH=/usr/local/mysql-cluster-gpl-7.0.8a/lib/mysql ./ndbapi_scan /usr/local/mysql-cluster-gpl-7.0.8a/mysql.sock 127.0.0.1
MySQL Cluster already has example table: GARAGE. Dropping it...
MySQL Cluster already has example table: GARAGE. Dropping it...
MySQL Cluster already has example table: GARAGE. Dropping it...
MySQL Cluster already has example table: GARAGE. Dropping it...
^C

What’s that? Wow… look at the create_table code…


void create_table(MYSQL &mysql)
{
while (mysql_query(&mysql,
"CREATE TABLE"
" GARAGE"
" (REG_NO INT UNSIGNED NOT NULL,"
" BRAND CHAR(20) NOT NULL,"
" COLOR CHAR(20) NOT NULL,"
" PRIMARY KEY USING HASH (REG_NO))"
" ENGINE=NDB"))
{
if (mysql_errno(&mysql) != ER_TABLE_EXISTS_ERROR)
MYSQLERROR(mysql);
std::cout << "MySQL Cluster already has example table: GARAGE. "
<< "Dropping it..." << std::endl;
/******************
* Recreate table *
******************/
drop_table(mysql);
create_table(mysql);
}
}

That one is sweet, no need to be a C guru to catch it… it is fairly obvious where the problem is isn’t it? Let’s remove the recursive “create_table” call.

3. Success

$ LD_LIBRARY_PATH=/usr/local/mysql-cluster-gpl-7.0.8a/lib/mysql ./ndbapi_scan /usr/local/mysql-cluster-gpl-7.0.8a/mysql.sock 127.0.0.1
Connecting…Connected!
Initializing NDB…Done
Looking for table GARAGE…Got it! = GARAGE
Getting the columns…
Done
Populating…populate: Success!
Scanning and printing…0 Mercedes Blue
12 Toyota Pink
5 BMW Black
7 BMW Black
3 Mercedes Blue
9 BMW Black
10 Toyota Pink
14 Toyota Pink
11 Toyota Pink
1 Mercedes Blue
13 Toyota Pink
6 BMW Black
2 Mercedes Blue
4 Mercedes Blue
8 BMW Black
scan_print: Success!

Going to delete all pink cars!
Deleting pink cars…No error
Done
0 Mercedes Blue
5 BMW Black
7 BMW Black
3 Mercedes Blue
9 BMW Black
1 Mercedes Blue
6 BMW Black
2 Mercedes Blue
4 Mercedes Blue
8 BMW Black
scan_print: Success!

Going to update all Blue cars to Black cars!
No error
0 Mercedes Black
5 BMW Black
7 BMW Black
3 Mercedes Black
9 BMW Black
1 Mercedes Black
6 BMW Black
2 Mercedes Black
4 Mercedes Black
8 BMW Black
scan_print: Success!

I know that these programs are just examples. But yet, they are in some way part of the documentation and the quality is, let’s say, subject to questions.


Entry posted by yves |
4 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Written by in: MySQL,NDB,Zend Developer |
Nov
14
2009
0

D&D: My Fatality Compilation (Part 2)

So it seems that I’m pretty error prone and unlucky when it comes to keeping my characters alive while playing with my D&D group.

This is a continuation of a log of all my character fatalities in the same on-going campaign for your enjoyment.

(more…)

Written by in: D&D | Tags:
Nov
13
2009
--

Finding your MySQL High-Availability solution – Replication

In the last 2 blog posts about High Availability for MySQL we have introduced definitions and provided a list of ( questions that you need to ask yourself before choosing a HA solution. In this new post, we will cover what is the most popular HA solution for MySQL, replication.

High Availability solution for MySQL: Replication

This HA solution is the easiest to implement and to manage. You basically need to setup MySQL replication between a master and one or more slaves. Upon failure of the master, one of the slaves is manually promoted to the master role and replication on the other slaves is re-adjusted to point to the new master. This solution works well with all the MySQL storage engines including MyISAM (NDB is a special discussed later) but it suffers from the limitation of MySQL replication. The main limitation, in term of HA, is the asynchronous design of MySQL replication which does not allow the master to be sure the slave has been updated before returning after a commit statement. There is a window in time where it is possible that a fully committed transaction has not been pushed to the slave(s) leading to data loss. Many large websites that are fine with some data loss rely on replication for HA and for read scaling.

In addition to hardware failure, the level of availability of this solution is affected by the availability of the MySQL replication link between the servers. Replication often break for various reasons and while replication is broken, there is no High-Availability. Also, the availability of this solution is affected by how much the slaves were behind the master when the outage occurred. So, if you want to have a good level of availability, you need a good monitoring and alerting system to quickly react to replication issue and you need a rather small write load so that the slaves do not lag behind the master too much. To maximize the level of availability, recovery should be automatic.

Apart of its simplicity, an HA solution based on replication as many interesting properties, no wonder it is so popular. First, if the application is well designed and has specific database handles for read and write operations, this HA solution can scales the read operations to a high level. Using the slaves for reads cause a second interesting side effect, the caches of the slaves are hot so failing over to a slave means no degraded performance associated with caches warm up. Finally, it is well known that with MySQL, altering a table means recreating the whole table and it is a blocking operations. Altering a large table may takes many hours. The trick here is to run the alter table on a slave and then, once done, we let the slave catch up with the master using the new table schema, we failover to the slave and repeat the alter table on the other server. Those online schema change are easier when a master to master topology is used.

The following figure summarize the simplest HA architecture using MySQL replication. All writes are going to the master while reads are spread between the master and the slave. Upon failure of the master, replication is stopped on the slave and all traffic is redirected to the slave which now handles reads and writes.

HA replication

Pros Cons
Simple Variable level of availability (98-99.9+%)
Inexpensive Not suitable for high write loads
All the servers can be used, no idle standby read scaling only if application splits reads from writes
Supports MyISAM Can lose data
Caches on failover slave are not cold
Online schema changes
Low impact backups

Automatic failover with replication

I already mentioned that for best HA levels, failover or recovery should be automatic. There are tools to manage automatic failover with replication like MMM, Flipper and Tungsten. Here, I will quickly describe the most popular one, MMM.

With MMM, you need to add a separate server, the Manager that, like the name imply, manages the availability of the MySQL service. A high availability solution based on MMM requires at the 2 MySQL servers configured in a Master to Master topology. Additional slaves can also be added. A MMM agent runs on all the MySQL servers and it is used to do OS level operations. The principle of operation of MMM is based on VIPs. There is one write VIP, where write operations are sent and as many read VIPs as the number of MySQL servers. For the write VIP, MMM monitors the state of the current master and, upon failure, try to kill all the connections to the failing server and transfer the write VIP to the other master. For the read VIPs, MMM monitors the state of the slaves and remove the read VIP of a slave if it has failed or is lagging behind the master by more than a defined threshold. One of the main limitation of MMM is its lack of fencing capability. It is important to stop all the connections to the failing master and if that server is not responding, maybe because of a network problem, a stonith device must be used to fence it. I am far from being an expert with MMM, other guys on my team are way better than me, but I heard that the MMM v1 code base had some deficiencies. MMM v2 is a complete rewrite that addresses some of the shortcomings of v1. Walter Heck from OpenQuery gave an excellent webinar on it recently.

The architecture of a highly available setup using MMM and Master-Master replication is presented on the figure below. Apart from the minimum requirement of two MySQL servers replicating each other, there is a third server, called the manager, that controls both MySQL server through an agent that is running on each server. The manager controls and monitors the state of the replication and assign virtual IPs for specific roles. There are one VIP where write operations are sent and two or more VIPs where read operations are sent. If replication on one of the MySQL servers lags behind too much, its read VIP will be moved to another server.

master-master

As a conclusion, replication can be used in many cases to build effective and scalable highly available solutions but it has some limitations. In my next blog post, I’ll present another HA solution build around Heartbeat and DRBD.


Entry posted by yves |
No comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Nov
12
2009
--

Would you like to ask a question about PBXT?

In the next week, I’ll be interviewing Paul McCullagh, the architect behind the PBXT storage engine. If you have any questions for Paul, please post them here by Monday 16th November.

Note: The idea for this interview was inspired by a previous interview with Heikki Tuuri – creator of InnoDB. You can see the answers here and here.


Entry posted by Morgan Tocker |
5 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Nov
12
2009
--

Tokyo Tyrant -The Extras Part III : Write Bottleneck

This is part 3 of my Tyrant extra’s, part 1 focused on durability, part 2 focused on the perceived performance wall.

#3.  Tokyo Cabinet Can have only a single writer thread, bottlenecking performance

When writing an application using Tokyo Cabinet only one connection can be opened as a “writer”  while the rest are readers.  Tyrant allows for multiple “writes”  to be sent in from multiple applications but it still single threads them when writing out to disk.   If you run several threads all just inserting into Tyrant your will see tyrant hit 100% Cpu on 1 core, and your writes will start to peter out quickly.

Single Threaded Writes

In my tests when I was not disk bound (FS Cache writes) I was able to complete 4Million inserts in a little over 91 seconds using 8 threads.  I actually averaged 43896.98 inserts per second during my 8 thread test.  Moving to 10 threads doing the same 4Million inserts I completed the test in 96 seconds and averaged 41649.42 inserts per second.    Compare this to 4 Million rows using 4 threads which averaged  40933.86 and you start to see that around 40K inserts per second is the most this particular server is capable of ( single threaded ).  Hopefully this is something that maybe able to be fixed internally in the near future.  Until then you may consider breaking up your data into multiple tables each with there own cache.  This limit is per TC DB so this should work.  I had an idea about using the memcached client to distribute the data accross multiple TC database files in the back end.  This should work, I just need to test it :)

Ever notice how as my multi-part posts go on they get shorter and shorter:)  This will be the last Tyrant related post for a little bit.  The 4th & 5th posts were supposed to deal with replication and scaling… this may take a little while.  Thanks for reading!


Entry posted by matt |
3 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Nov
11
2009
--

Tokyo Tyrant – The Extras Part II : The Performance Wall

Continuing my look at Tokyo Tyrant/Cabinet and addressing some of the concerns I have seen people have brought up this is post #2.

#2.  As your data grows does  Tokyo Cabinet slow down?

Yes your performance can degrade. One obvious performance decrease with a larger dataset  is you start to increase the likelihood that your data no longer fits into memory.  This decreases the number of memory operations and trades them for more expensive disk based operations.    As fast as any application is, as you read off disk opposed to memory performance is going to drop off substantially.  One of the more difficult things to test with Tyrant is disk bound performance.  The FS Cache can make Tyrant seem like small amounts of memory will still make it scream.  Once your data set is larger then that, people start to claim they hit the performance “wall”.

In order to help test this I went ahead an mounted the FS with my data files with the sync option which effectively disables the FS cache.  This should help show the real performance of the hash engine.  Here performance dips substantially, as expected :

FS Mounted As Sync

Look at the IO rate:
NoSync:  31 MB/s
Sync:  3.2 MB/s

As one would expect the IO goes crazy when the drive is mounted with the sync option hitting 99% IO wait.  The interesting this here is we are actually bottlenecking on writes and not reads.  You see without the FS cache to buffer the writes when we need to remove data from memory we now have to rely on the internal Tyrant cache and when that is exhausted have to then really write to disk not the FS Cache.  Now Tyrant starts to take on the same characteristics as your classic DB, the bigger the buffer pool the faster the performance:

Difference Memory Sizes for Tyrant
Even here the performance drop-off once you exhaust memory is relative.  The focus here should be the drop off versus other solutions with the same configuration, not the drop off versus a completely cached version.  In this case ask yourself given similar datasets and similar memory requirements what is the performance?  Take the above sync test, when I use 256M of memory and run my test with writes going directly to disk I hit 964 TPS, in previous MySQL tests the same setup (256M BP) netted ~160 TPS.  So 5x improvement all things being equal.  Of course this is a far drop off from the 13K I was getting when everything was effectively in the file system cache or in memory, but 5x is still a very solid improvement.

Next up is looking at Tyrant’s and Cabinet’s write bottleneck.


Entry posted by matt |
3 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Nov
10
2009
--

Tokyo Tyrant – The Extras Part I : Is it Durable?

You know how in addition to the main movie you have extras on the DVD.  Extra commentary, bloopers, extra scenes, etc? Well welcome the Tyrant extras.  With my previous blog posts I was trying to set-up a case for looking at NOSQL tools, and not meant to be a decision making tool.  Each solution has pros and cons that will impact how well the technology works for you.  Based on some of the comments and questions to the other blogs, I thought I would put together a little more detail into some of the deficiencies and strengths of Tokyo Tyrant.

#1.  How durable is Tokyo Tyrant?

Well I went ahead and built a quick script that just inserted data into a TC table ( an id, and a timestamp) and did a kill -9 on the  the server in the middle of it.

Insert:
159796,1256131127.17329
159797,1256131127.17338
159798,1256131127.17345
159799,1256131127.17355
put error: recv error
159800,1256131127.17364

Here we failed at a time of 1256131127.17355 , before the next record was inserted.

After bringing the server up from a crash:

159795,1256131127.1732
159796,1256131127.17329
159797,1256131127.17338
159798,1256131127.17345
159799,1256131127.17355

All the records are still there.  So we are good right?  Looking in the code,  Tokyo Cabinet actually utilizes memory mapped files.  I personally have not using mmaped files, so feel free to correct me if you know better then I.  Using mmap here and performing a kill -9 seems to preserve the changes in memory, while powering down the server does not:

163,1257780699.10123
164,1257780699.35172
165,1257780699.60209
166,1257780699.85246

insert yanking of power cord here… gives us Post crash data of:

142,1257780693.84303
143,1257780694.09345

So we basically lost 5 secondish of data.

Looking at the Tyrant & Cabinet  documentation you will see mention of a  SYNC command which they say does the following:

“The function `tcrdbsync’ is used in order to synchronize updated contents of a remote database object with the file and the device.”

Let’s dig a little deeper into the code and see what’s going on:

/* Synchronize updated contents of a hash database object with the file and the device. */
bool tchdbsync(TCHDB *hdb){
assert(hdb);
if(!HDBLOCKMETHOD(hdb, true)) return false;
if(hdb->fd < 0 || !(hdb->omode & HDBOWRITER) || hdb->tran){
tchdbsetecode(hdb, TCEINVALID, __FILE__, __LINE__, __func__);
HDBUNLOCKMETHOD(hdb);
return false;
}
if(hdb->async && !tchdbflushdrp(hdb)){
HDBUNLOCKMETHOD(hdb);
return false;
}
bool rv = tchdbmemsync(hdb, true);
HDBUNLOCKMETHOD(hdb);
return rv;
}

If it first checks if the file descriptor for the database is less then 0, or your not operating as a writer…  in which case it errors.  Then if checks if your running in async io mode.  If your running async it flushes the records from the delayed record pool.  If your running async and you do not flush your records, then your at the mercy of Tokyo cabinet, or your application to call one of the numerous operations that flushes the delayed record pool ( i.e.  all regular sync operations like tchdbput will flush it ).  I did not test with async, in fact to the best of my knowledge it does not look like tyrant supports async, even though cabinet does.   Which means the meat of the sync command coming from tyrant is tchdbmemsync.

/* Synchronize updating contents on memory of a hash database object. */
bool tchdbmemsync(TCHDB *hdb, bool phys){
assert(hdb);
if(hdb->fd < 0 || !(hdb->omode & HDBOWRITER)){
tchdbsetecode(hdb, TCEINVALID, __FILE__, __LINE__, __func__);
return false;
}
bool err = false;
char hbuf[HDBHEADSIZ];
tchdbdumpmeta(hdb, hbuf);
memcpy(hdb->map, hbuf, HDBOPAQUEOFF);
if(phys){
size_t xmsiz = (hdb->xmsiz > hdb->msiz) ? hdb->xmsiz : hdb->msiz;
if(msync(hdb->map, xmsiz, MS_SYNC) == -1){
tchdbsetecode(hdb, TCEMMAP, __FILE__, __LINE__, __func__);
err = true;
}
if(fsync(hdb->fd) == -1){
tchdbsetecode(hdb, TCESYNC, __FILE__, __LINE__, __func__);
err = true;
}
}
return !err;
}

Here you see the call to msync.  What does msync do?  The man page says:

“The msync() function writes all modified data to permanent storage locations, if any, in those whole pages containing any part of the address space of the process starting at address addr and continuing for len bytes.”

Basically in the Tokyo Tyrant context msync will flush all the changes to a memory mapped object to disk.  This msync is crucial as you can not guarantee data ever makes it to disk if its not called.  (more below)

The tchdbmemsync function is the only place I saw calling msync. What calls  tchdbmemsync?

tchdbmemsync Called via:
tchdboptimize
tchdbsync
tchdbtranbegin
tchdbtrancommit
tchdbtranabort
tchdbcloseimpl
tchdbcopyimpl

The commands that will indirectly call an msync are : running the optimize command, calling a sync directly, closing a connection to the db, or starting,commiting, or aborting a transaction.  Note a transaction in TC is actually a global transaction and locks all write operations ( used for maintenance ).  What is missing here is a scheduled call to msync.  I looked and traced back the calls from Tyrant into Cabinet and could not find anything that is called by automatically.

The documentation on msync actually says without calling msync there is no guarantee of the data making it to disk.  This implies that it may eventually get written without a direct msync call ( When you purge/lru old data from memory ).    Testing this theory I crashed my server several times and found that data written out to disk without calling msync was very flaky indeed.  I had anywhere from 5 seconds of missing data to 60 seconds post crash.

This means for durability you really need to directly call the sync command.  In my previous post someone pointed out a flaw in this approach saying that they had seen that calling a sync after writes ruined performance.  Looking at the code you can see why calling a sync after each write can severely degrade performance.  Before I explain lets look at the performance hit:

Sync After every Call

Saying there is a performance hit here is an understatement.  The reason for this however is really how msync works and how its used in Tokyo Cabinet.  In a sense it is implemented as a global sync, not a record sync. i.e.  all changes  to the underlying database are flushed at once.  So instead of sync the record you just changed, all of the changed records in the DB will be flushed and synced.  In order to perform this operation a lock is required, which blocks other SYNC calls.   So if you have 32 threads, you could have 1 sync running and 31 others blocked.  This means calling a sync after every call is going to severely degrade performance.

So what can we do to Make Cabinet more durable?   Well the best option in my opinion is to steal a trick from Innodb:

We can easily write a a script that calls a background sync every second ( i.e. like innodb_flush_log_at_trx_commit = 0/2).  I have tested this and I see almost 0 impact on my gaming benchmark from when this is running to when it is not.

Once a Second Sync

You can write this and cron the script or TTSERVER actually provides you a method to call functions periodically:

-ext path : specify the script language extension file.
-extpc name period : specify the function name and the calling period of a periodic command.

Now while I did not see a drop in my benchmark, heavy write operations will see a drop in performance… for instance with 8 threads simply update/inserting data is saw this:

heavy insert sync once a second

Ouch, a 2X hit.  But you can configure the frequency of the sync  up or down as needed to ensure you have the proper recovery -vs- performance setting.


Entry posted by matt |
3 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Nov
05
2009
--

Air traffic queries in MyISAM and Tokutek (TokuDB)

This is next post in series
Analyzing air traffic performance with InfoBright and MonetDB
Air traffic queries in LucidDB
Air traffic queries in InfiniDB: early alpha

Let me explain the reason of choosing these engines. After initial three posts I am often asked “What is baseline ? Can we compare results with standard MySQL engines ?”. So there come MyISAM to consider it as base point to see how column-oriented-analytic engines are better here.

However, take into account, that for MyISAM we need to choose proper indexes to execute queries effectively, and there is pain coming with indexes: – load of data is getting slower; – to design proper indexes is additional research, especially when MySQL optimizer is not smart in picking best one.

The really nice thing about MonetDB, InfoBright, InfiniDB is that they do not need indexes, so you may not worry about maintaining them and picking best one. I am not sure about LucidDB, I was told indexes are needed, but creating new index was really fast even on full database, so I guess, it’s not B-Tree indexes. So this my reflexion on indexes turned me onto TokuDB direction.

What is so special about TokuDB ? There two things: indexes have special structure and are “cheap”, by “cheap” I mean the maintenance cost is constant and independent on datasize. With regular B-Tree indexes cost grows exponentially on datasize (Bradley Kuszmaul from Tokutek will correct me if I am wrong in this statement). Another point with TokuDB, it uses compression, so I expect less size of loaded data and less IO operations during query execution.

So what indexes we need for queries. To recall you details, the schema is available in this post
http://www.mysqlperformanceblog.com/2009/10/02/analyzing-air-traffic-performance-with-infobright-and-monetdb/, and
queries I posted on sheet “Queries” in my summary Spreadsheet.

With Bradley’s help we chose next indexes:

CODE:

  1. KEY `Year` (`Year`,`Month`),
  2.   KEY `Year_2` (`Year`,`DayOfWeek`),
  3.   KEY `DayOfWeek` (`DayOfWeek`,`Year`,`DepDelay`),
  4.   KEY `DestCityName` (`DestCityName`,`OriginCityName`,`Year`),
  5.   KEY `Year_3` (`Year`,`DestCityName`,`OriginCityName`),
  6.   KEY `Year_4` (`Year`,`Carrier`,`DepDelay`),
  7.   KEY `Origin` (`Origin`,`Year`,`DepDelay`)

And I measured load time for both MyISAM and TokuDB in empty table with created indexes.

Load time for MyISAM: 16608 sec
For TokuDB: 19131 sec

Datasize (including indexes)

MyISAM: 36.7GB
TokuDB: 6.7GB

I am a bit surprised that TokuDB is slower loading data, but my guess it is related to compression, and I expect with bigger amount of data TokuDB will be faster MyISAM.

Now to queries. Bradley pointed me that query Q5 SELECT t.carrier, c, c2, c*1000/c2 as c3 FROM (SELECT carrier,
count(*) AS c FROM ontime WHERE DepDelay>10 AND Year=2007 GROUP BY
carrier) t JOIN (SELECT carrier, count(*) AS c2 FROM ontime WHERE
Year=2007 GROUP BY carrier) t2 ON (t.Carrier=t2.Carrier) ORDER BY c3
can be rewritten as
SELECT carrier,totalflights,ndelayed,ndelayed*1000/totalflights as c3 FROM (SELECT carrier,count(*) as totalflights,sum(if(depdelay>10,1,0)) as ndelayed from ontime where year=2007 group by carrier) t order by c3 desc; ( I name it as Query Q5i)

The summary table with queries execution time (in sec, less is better):

Query MyISAM TokuDB
Q0 72.84 50.25
Q1 61.03 55.01
Q2 98.12 58.36
Q3 123.04 66.87
Q4 6.92 6.91
Q5 13.61 11.86
Q5i 7.68 6.96
Q6 123.84 69.03
Q7 187.22 159.62
Q8 (1y) 8.75 7.59
Q8 (2y) 102.17 64.95
Q8 (3y) 104.7 69.76
Q8 (4y) 107.05 70.46
Q8 (10y) 119.54 84.64
Q9 69.05 47.67

For reference I used 5.1.36-Tokutek-2.1.0 for both MyISAM and TokuDB tests.

And if you are interested to compare MyISAM with previous engines:

Query MyISAM MonetDB InfoBright LucidDB InfiniDB
Q0 72.84 29.9 4.19 103.21 NA
Q1 61.03 7.9 12.13 49.17 6.79
Q2 98.12 0.9 6.73 27.13 4.59
Q3 123.04 1.7 7.29 27.66 4.96
Q4 6.92 0.27 0.99 2.34 0.75
Q5 13.61 0.5 2.92 7.35 NA
Q6 123.84 12.5 21.83 78.42 NA
Q7 187.22 27.9 8.59 106.37 NA
Q8 (1y) 8.75 0.55 1.74 6.76 8.13
Q8 (2y) 102.17 1.1 3.68 28.82 16.54
Q8 (3y) 104.7 1.69 5.44 35.37 24.46
Q8 (4y) 107.05 2.12 7.22 41.66 32.49
Q8 (10y) 119.54 29.14 17.42 72.67 70.35
Q9 69.05 6.3 0.31 76.12 9.54

The all results are available in summary Spreadsheet

I especially do not put TokuDB in the same table with analytic oriented databases, to highlight TokuDB is OLTP engine for general purposes.
As you see it is doing better than MyISAM in all queries.


Entry posted by Vadim |
25 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Nov
05
2009
--

New developers training course is almost ready

We’ve been busy expanding our training curriculum to include training for developers building applications with MySQL.  We have reached the point where we’re ready for a pilot teach – and it brings me great pleasure to announce that we’re opening it up for blog readers to attend, free of charge.

The details:
San Francisco
4th December
9:30AM – 5PM

Spaces are limited, so to give everyone a fair chance we’re delaying registration to open at noon tomorrow (Friday) Pacific Time. It’s strictly first in first served, so be quick!  The registration link is here.


Entry posted by Morgan Tocker |
One comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com