MySQL Workshops: Chicago & London this April

Percona's Kenny Gryp leads a recent MySQL workshop

Percona’s Kenny Gryp leads a recent MySQL workshop

Percona will be in Chicago and London the week of April 8th delivering two 2-day MySQL workshops. For our MPB readers, we are offering a 15% discount. Just use MPB15A when purchasing your tickets to one or both MySQL workshops.

  • Scaling and Optimization for MySQL on Monday & Tuesday, April 8 & 9
    • This MySQL workshop will cover:
      • Apply architecture scaling strategies that make the most sense, including caching, message queuing, database sharding, and replication.
      • Optimize MySQL schema and indexes to support efficient queries.
      • Instrument applications to measure performance.
      • Use successful patterns behind MySQL applications, and avoid design decisions that don’t fit MySQL very well.
      • Solve some common application performance problems like paginating results and full-text search.
      • Understand when to leverage MySQL features such as stored procedures, triggers and events.


  • Operating and Troubleshooting for MySQL on Wednesday & Thursday, April 10 & 11
    • This MySQL workshop will cover:
      • Understand MySQL versions and forks—including which versions and editions you should run.
      • Understand how various system components could be a bottleneck for MySQL: CPU, disks, memory, network; and how to use diagnostic tools in Linux.
      • Investigate database trends via the counters exposed in MySQL’s SHOW GLOBAL STATUS.
      • Backup & restore MySQL and InnoDB data, and describe the advantages of common backup methods (mysqldump, file system snapshot, InnoDB hot backup).
      • Setup and administer MySQL replication.
      • Monitor MySQL servers in production.
      • Discuss security issues with MySQL.
      • Solve via case-study some real-life problems that Percona consultants faced in MySQL.
      • Understand the basic operation of InnoDB—including how data is stored on disk and in memory.
      • Understand what InnoDB does internally when data is written, and how features like the transaction log files and undo space work.
      • See how key features of InnoDB work, including clustered indexes, insert buffer, adaptive hash and double write buffer.
      • Learn to read diagnostic information from InnoDB and make informed decisions while changing configuration.
      • Discover how Percona Server with XtraDB can provide better diagnostics and performance

If you wish to speak to our Sales team about either of these MySQL Workshops, please contact us.

The post MySQL Workshops: Chicago & London this April appeared first on MySQL Performance Blog.


Aligning IO on a hard disk RAID – the Theory

Now that flash storage is becoming more popular, IO alignment question keeps popping up more often than it used to when all we had were rotating hard disk drives. I think the reason is very simple – when systems only had one bearing hard disk drive (HDD) as in RAID1 or one disk drive at all, you couldn’t really have misaligned IO because HDDs operate in 512-byte sectors and that’s also the smallest amount of disk IO that systems can do. NAND flash on the other hand can have a page size of 512-bytes, 2kbytes or 4kbytes (and often you don’t know what size it is really) so the IO alignment question becomes more relevant.

It was and still is, however, relevant with HDD RAID storage – technology we have been using for many years – when there’s striping like in RAID0, 5, 6 or any variation of them (5+0, 1+0, 1+0+0 etc.). While IO inside the RAID is perfectly aligned to disk sectors (again due to the fact operations are done in multiples of 512-bytes), outside of the RAID you want to align IO to a stripe element as you may otherwise end up reading or writing to more disks than you would like to. I decided to do some benchmarks on a hard disk array and see when this matters and whether it matters at all.

In this article I will however focus on the process of alignment, if you’re curious about benchmark results, here they are.

What is IO alignment

I would like to start with some background on IO alignment. So what is IO alignment and how does a misaligned IO look like? Here is one example of it:


In this case the RAID controller is using 32KB stripe unit and that can fit in 2 standard InnoDB pages (16KB in size) as long as they are aligned properly. In first case when reading or writing a single InnoDB page RAID will only read or write to a single disk because of the alignment to a stripe unit. In the second example however every other page spans two disks so there is going to be twice as many operations to read or write these pages which could mean more waiting in some cases and more work for the RAID controller for that same operation. In practice stripes by default are bigger in size – I would often see see 64KB (mdadm default chunk size) or 128KB stripe unit size so in these cases there would be fewer pages spanning multiple disks so the effects of misalignment would be less significant.

Here’s another example of misalignment, described in SGI xfs training slides:


D stands for the disk here so the RAID has 4 bearing disks (spindles) and if there’s a misalignment on the file system, you can see how RAID ends up doing 5 IO operations – two to D4 and one on each of the other three disks instead of just doing one IO to each of the disks. In this case even if this is the single IO request from OS, it’s guaranteed to be slower both for reading and writing.

So, how do we avoid misalignment? Well, we must ensure alignment on each layer of the stack. Here’s how a typical stack looks like:


Let’s talk about each of them briefly:

InnoDB page

You don’t need to do anything to align InnoDB pages – file system takes care of it (assuming you configure the file system correctly). I would however mention couple things about InnoDB storage: first – in Percona Server you can now customize page size and it may be good idea to check that page size is no bigger than stripe element; second – logs are actually written in 512 byte units (in Percona Server 5.1 and 5.5 you can customize this) while I will be talking here about InnoDB data pages which are 16KB in size.

File system

File system plays very important role here – it maps files logical address to physical address (at a certain level) so when writing a file, file system decides how to distribute writes properly so they make the best use of the underlying storage, it also makes sure file starts in a proper position with respect to stripe size. The size of logical IO units also is up to the file system.

The goal is to write and read as little as possible. If you gonna be writing small (say 500 byte) files mostly, it’s best to use 512-byte blocks, for bigger files 4k may make more sense (you can’t use blocks bigger than 4k (page size) on Linux unless you are using HugePages). Some file systems let you set stripe width and stripe unit size so they can do a proper alignment based on that. Mind however that different file systems (and different versions of them) might be using different units for these options so you should refer to a manual on your system to be sure you’re doing the right thing.

Say we have 6-disk RAID5 (so 5 bearing disks) with 64k stripe unit size and 4k file system block size, here’s how we would create the file system:

xfs      - mkfs.xfs -b 4k -d su=64k,sw=5 /dev/ice (alternatively you can specify sunit=X,swidth=Y as options when mounting the device)
ext2/3/4 - mke2fs -b 4096 -E stride=16,stripe-width=80 /dev/ice (some older versions of ext2/3 do not support stripe-with)

You should be all set with the file system alignment at this point. Let’s get down one more level:


If you are using LVM, you want to make sure it does not introduce misalignment. On the other hand it can be used to fix it if it was misaligned on the partition table. On the system that I have been benchmarking, defaults worked out just fine because I was using a rather small 64k stripe element. Have I used 128k or 256k RAID stripe elements, I would have ended up with LVM physical extent starting somewhere in the middle of the stripe which would in turn screw up file system alignment.

You can only set alignment options early in the process when using pvcreate to initialize disk for LVM use, the two options you are interested in are –dataalignment and –dataalignmentoffset. If you have set the offset correctly when creating partitions (see below), you don’t need to use –dataalignmentoffset, otherwise with this option you can shift the beginning of data area to the start of next stripe element. –dataalignment should be set to the size of the stripe element – that way the start of a Physical Extent will always align to the start of the stripe element.

In addition to setting correct options for pvcreate it is also a good idea to use appropriate Volume Group Physical Extent Size for vgcreate – I think default 4MB should be good enough for most cases, when changing however, I would try to not make it smaller than a stripe element size.

To give you a bit more interesting alignment example, let’s assume we have a RAID with 256k stripe element size and a misalignment in partition table – partition /dev/sdb1 starts 11 sectors ahead of the stripe element start (reminder: 1 sector = 512 bytes). Now we want to get to the beginning of next stripe element i.e. 256th kbyte so we need to offset the start by 501 sectors and set proper alignment:

pvcreate --dataalignmentoffset 501s --dataalignment 256k /dev/sdb1

You can check where physical extents will start (or check your current setup) using pvs -o +pe_start. Now let’s move down one more level.

Partition table

This is the most frustrating part of the IO alignment and I think the reason people get frustrated with it is that by default fdisk is using “cylinders” as units instead of sectors. Moreover, on some “older” systems like RHEL5 it would actually align to “cylinders” and leave first “cylinder” blank. This comes from older times when disks were really small and they were actually physical disks. Drive geometry displayed here is not real- this RAID does not really have 255 heads and 63 sectors per track:

db2# fdisk -l

Disk /dev/sda: 1198.0 GB, 1197998080000 bytes
255 heads, 63 sectors/track, 145648 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1        1216     9764864   83  Linux
/dev/sda2            1216        1738     4194304   82  Linux swap / Solaris
Partition 2 does not end on cylinder boundary.
/dev/sda3            1738      145649  1155959808   83  Linux
Partition 3 does not end on cylinder boundary.

So it makes a lot more sense to use sectors with fdisk these days which you can get with -u when invoking it or with “u” when working in the interactive mode:

db2# fdisk -ul

Disk /dev/sda: 1198.0 GB, 1197998080000 bytes
255 heads, 63 sectors/track, 145648 cylinders, total 2339840000 sectors
Units = sectors of 1 * 512 = 512 bytes
   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *        2048    19531775     9764864   83  Linux
/dev/sda2        19531776    27920383     4194304   82  Linux swap / Solaris
Partition 2 does not end on cylinder boundary.
/dev/sda3        27920384  2339839999  1155959808   83  Linux
Partition 3 does not end on cylinder boundary.

The rest of the task is easy – you just have to make sure that Start sector divides by number of sectors in a stripe element without a remainder. Let’s check if /dev/sda3 aligns to 1MB stripe element. 1MB is 2048 sectors, dividing 27920384 by 2048 we get 13633 so it does align to 1MB boundary.

Recent systems like RHEL6 (not verified) and Ubuntu 10.04 (verified) would by default align to 1MB if storage does not support IO alignment hints which is good enough for most cases, however here’s what I got on Ubuntu 8.04 using defaults (you would get the same on RHEL5 and many other systems):

db1# fdisk -ul

Disk /dev/sda: 1197.9 GB, 1197998080000 bytes
255 heads, 63 sectors/track, 145648 cylinders, total 2339840000 sectors
Units = sectors of 1 * 512 = 512 bytes
Disk identifier: 0x00091218

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *          63    19535039     9767488+  83  Linux
/dev/sda2        19535040    27342629     3903795   82  Linux swap / Solaris
/dev/sda3        27342630  2339835119  1156246245   8e  Linux LVM

sda1 and sda3 do not even align to 1k. sda2 aligns up to 32k but the RAID controller actually has 64k stripe so all IO on this system is unaligned (unless compensated by LVM, see above). So on such a system, when creating file systems with fdisk, don’t use the default value for a start sector, instead use the next number that divides by the number of sectors in a stripe element without a reminder and make sure you’re using sectors as units to simplify the math.

Besides DOS partition table which you would typically work with using fdisk (or cfdisk, or sfdisk), there’s also a more modern – GUID partition table (GPT). The tool for the task of working with GPT is typically parted. If you are already running GPT on your system and want to check if it’s aligned, here’s a command for you:

db2# parted /dev/sda unit s print
Model: LSI MegaRAID 8704EM2 (scsi)
Disk /dev/sda: 2339840000s
Sector size (logical/physical): 512B/512B
Partition Table: msdos

Number  Start      End          Size         Type     File system     Flags
 1      2048s      19531775s    19529728s    primary  ext4            boot
 2      19531776s  27920383s    8388608s     primary  linux-swap(v1)
 3      27920384s  2339839999s  2311919616s  primary

This is the same output we saw from fdisk earlier. Again you want to look at Start sector and make sure it divides by the size of stripe element without a reminder.

Lastly, if this is not a system boot disk you are working on, you may not need partition table at all – you can just use the whole raw /dev/sdb and either format it with mkfs directly or add it as an LVM physical volume. This let’s you avoid any mistakes when working on partition table.

RAID stripe

Further down below on the storage stack there’s a group of RAID stripe units (elements) sometimes referred to as a stripe though most of the tools refer to it as a stripe width. RAID level, number of disks and the size of a stripe element set the stripe width size. In case of RAID1 and JBOD there’s no striping, with RAID0 number of bearing disks is actual number of disks (N), with RAID1+0 (RAID10) it’s N/2, with RAID5 – N-1 (single parity), with RAID6 – N-2 (double parity). You want to know that when setting parameters for file system but when RAID is configured, there’s nothing more you can do about it – you just need to know these.

Stripe unit size is the amount of data that will be written to single disk before skipping to next disk in the array. This is also one of the options you usually have to decide on very early when configuring RAID.

Disk sectors

Most if not all hard disk drives available on the market these days use 512-byte sectors so most of the time if not always you don’t care about alignment at this level and nor do RAID controllers as they also operate in 512-bytes internally. This however gets more complicated with SSD drives which often operate in 4kbyte units, though this is surely a topic for another research.


While it may seem there are many moving parts between the database and actual disks, it’s not really all that difficult to get a proper alignment if you’re careful when configuring all of the layers. Not always however you have a fully transparent system – for example in the cloud you don’t really know if the data is properly aligned underneath: you don’t know if you should be using an offset, what stripe size and how many stripe elements. It’s easy to check if you’re aligned – run a benchmark with an offset and compare to a base, but it’s much harder to figure out proper alignment options if you are not aligned.

Now it may be interesting to see what are real life effects of misalignment, my benchmark results are in the second part.


Aspersa tools download shortcuts

I use Aspersa tools a lot and I find myself going to the website just to download one of the tools all the time. I love I can download maatkit with a simple wget command so I made shortcuts for all of the current aspersa tools. Here’s the full list with my favorite on the top and least favorite (but none the less very useful) on the bottom:

So now it’s as simple as wget I hope you will find these useful too.


Update: I made an error with align tool which now links to the root directory of a trunk. I don’t think this is a big deal as I use align least often, but what I did to fix that is created even shorter shortcuts for all the tools, so you can also use wget


Caching could be the last thing you want to do

I recently had a run-in with a very popular PHP ecommerce package which makes me want to voice a recurring mistake I see in how many web applications are architected.

What is that mistake?

The ecommerce package I was working with depended on caching.  Out of the box it couldn’t serve 10 pages/second unless I enabled some features which were designed to be “optional” (but clearly they weren’t).

I think with great tools like memcached it is easy to get carried away and use it as the mallet for every performance problem, but in many cases it should not be your first choice.  Here is why:

  • Caching might not work for all visitors – You look at a page, it loads fast.  But is this the same for every user?  Caching can sometimes be an optimization that makes the average user have a faster experience, but in reality you should be caring more that all users get a good experience (Peter explains why here, talking about six sigma).  In practice it can often be the same user that has all the cache misses, which can make this problem even worse.
  • Caching can reduce visibility – You look at the performance profile of what takes the most time for a page to load and start trying to apply optimization.  The problem is that the profile you are looking at may skew what you should really be optimizing.  The real need (thinking six sigma again) is to know what the miss path costs, but it is somewhat hidden.
  • Cache management is really hard – have you planned for cache stampeding, or many cache items being invalidated at the same time?

What alternative approach should be taken?

Caching should be seen more as a burden that many applications just can’t live without.  You don’t want that burden until you have exhausted all other easily reachable optimizations.

What other optimizations are possible?

Before implementing caching, here is a non-exhaustive checklist to run through:

  • Do you understand every execution plan of every query? If you don’t, set long_query_time=0 and use mk-query-digest to capture queries.  Run them through MySQL’s EXPLAIN command.
  • Do your queries SELECT *, only to use subset of columns?  Or do you extract many rows, only to use a subset? If so, you are extracting too much data, and (potentially) limiting further optimizations like covering indexes.
  • Do you have information about how many queries were required to generate each page? Or more specifically do you know that each one of those queries is required, and that none of those queries could potentially be eliminated or merged?

I believe this post can be summed up as “Optimization rarely decreases complexity. Avoid adding complexity by only optimizing what is necessary to meet your goals.”  – a quote from Justin’s slides on instrumentation-for-php.  In terms of future-proofing design, many applications are better off keeping it simple and (at least initially) refusing the temptation to try and solve some problems “like the big guys do”.

Entry posted by Morgan Tocker |

Add to: delicious | digg | reddit | netscape | Google Bookmarks


Too many connections? No problem!

Did that ever happen to you in production?


  1. [percona@sandbox msb_5_0_87]$ ./use
  2. ERROR 1040 (00000): Too many connections

Just happened to one of our customers. Want to know what we did?

For demo purposes I’ll use sandbox here (so the ./use is actually executing mysql cli). Oh and mind it is not a general-purpose best-practice, but rather a break-and-enter hack when the server is flooded. So, when this happens in production, the problem is – how do you quickly regain access to mysql server to see what are all the sessions doing and how do you do that without restarting the application? Here’s the trick:


  1. [percona@sandbox msb_5_0_87]$ gdb -p $(cat data/ \
  2.                                      -ex “set max_connections=5000” -batch
  3. [Thread debugging using libthread_db enabled]
  4. [New Thread 0x2ad3fe33b5c0 (LWP 1809)]
  5. [New Thread 0x4ed19940 (LWP 27302)]
  6. [New Thread 0x41a8b940 (LWP 27203)]
  7. [New Thread 0x42ec5940 (LWP 1813)]
  8. [New Thread 0x424c4940 (LWP 1812)]
  9. 0x00000035f36cc4c2 in select () from /lib64/

And here’s the result:


  1. [percona@test9 msb_5_0_87]$ ./use
  2. Welcome to the MySQL monitor.  Commands end with ; or \g.
  3. Your MySQL connection id is 8
  4. Server version: 5.0.87-percona-highperf-log MySQL Percona High Performance Edition, Revision 61 (GPL)
  6. Type ‘help;’ or \h for help. Type \c to clear the current input statement.
  8. mysql [localhost] {msandbox} ((none))> select @@global.max_connections;
  9. +————————–+
  10. | @@global.max_connections |
  11. +————————–+
  12. |                     5000 |
  13. +————————–+
  14. 1 row in set (0.00 sec)

Credit for the gdb magic goes to Domas.

Few notes:

  • You would usually have one connection reserved for SUPER user, but that does not help if your application is connecting as a SUPER user (which is a bad idea anyway).
  • This worked for me on 5.0.87-percona-highperf, but use it at your own risk and better test it before you actually have to do it in production.
  • This example assumes you had less than 5000 max_connections configured ;)

Entry posted by Aurimas Mikalauskas |

Add to: delicious | digg | reddit | netscape | Google Bookmarks


Looking at Redis

Recently I had a chance to take a look at Redis project, which is semi-persistent in memory database with idea somethat similar to memcache but richer feature set.

Redis has simple single process event driven design, which means it does not have to deal with any locks which is performance killer for a lot of applications. This however limits it scalability to single core. Still with 100K+ operations a second this single core performance will be good enough for many applications. Also nothing stops you from running many Redis instance on single server to get advantage of multiple cores.

I call Redis semi-persistent because it does not store the data on disk immediately but rather dumps its all database every so often – you have a choice of configuring time and number of updates between database dumps. Because dump is basically serial write Redis does not an expensive IO subsystem. Also because this dump is background it does not affect read/write performance to the database which is in memory. In the tests I’ve done I’ve seen Redis doing writes some 4MB/sec for probably 50% of test duration where Innodb had to write 50MB/sec for about third of throughput and doing a lot of random IO as it was doing it. This is among other things because Innodb has to flush full 16K pages while doing flush.

The background flush in Redis is designed the following way – Redis process forks and justs dumps the database it has in the background. Unix copy on write takes care of getting another copy of pages as they are modified. This keeps overhead rather low. The database is dumped in temporary file which is renamed only after fsync which means if you crash during the dump you simply discard partial file.

I also liked full pipelining support in the protocol – you can send multiple commands at once – any commands and redis server will process them in order and returns results to you. This not only allows for multi-get and multi-set but any of batches of commands being submitted. The API support for this is relatively week but the features are there.

When it comes to data types – Redis supports simple key-value storage just as memcache but it also adds support for Lists, which are similar to the linked list data type you would have as well as sets which allow to store sets of strings with support of various set operations.

I kind of miss support for something like associative array/hash/map in the data types but I guess there is nothing in architecture which would stop it from being added later.

Redis also has support for “databases” which are basically key spaces inside the server. This should allow for using server by different applications, different versions, testing or some other features. Though you’ve got to be careful with these – there is only simple per instance password based authentication in place, so if application can talk to the instance it can access all databases.

I am also a bit surprised why databases are numbered instead of named. Naming the databases would make it more simple to avoid unwanted conflicts etc.

Redis also supports master/slave replication out of the box and it is extremely simple. You just specify from which node to replicate and this is it. It is even more simple than with MySQL as you do not need to deal with snapshot or binary log position. Replication is asynchronous and low overhead – redis will perform the database dump and store the commands on the data since the start of the process. Slave can get the data (which is basically set of commands to populate database itself) and when get the data from the master as it comes. I did not benchmark the replication capacity but I’d expect it to be close to 100K of writes/sec the single instance can handle.

The benchmarks I’ve done were for applications which is very update intensive with updates being pretty much random single row updates which are hard to batch. With MySQL/Innodb I got server being able to handle some 30.000 updates/sec on 16 core server with replication being able to handle 10.000 updates/sec. This was using about 5 cores so you could probably get 4 MySQL instances on this server and get up to 100K updates/sec with up to 40K updates/sec being able to replicate.

With Redis I got about 3 times more updates/sec – close to 100.000 updates/sec with about 1.5 core being used. I have not tried running multiple instances and I’m not sure the network and TCP stack would scale linearly in this case but anyway we’re speaking about hundreds of thousands of updates/sec.

I think Redis can be great piece of architecture for number of applications. You can use it as the database or as cache (it supports data expiration too)

I have not benchmarked it against memcache in terms of performance and memory usage. This may be another project to look at.

Entry posted by peter |

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Written by in: lamp,Zend Developer |

Powered by WordPress | Theme: Aeros 2.0 by