Percona Live NYC Schedule Posted

The first Percona Live conference in San Francisco in February was a huge success, and we’re planning to run several each year. The next one is May 26th in New York City. We’ve just confirmed a number of sought-after speakers and posted the schedule online. We are in the process of confirming sessions from a number of other top industry experts now. The early-bird special pricing is in effect for another week or so — don’t delay! And if you’re interested in sponsoring the show, act now — the deadline for many of the best opportunities is only a week away.


InnoDB page sizes: plans and ideas

It is well known fact that InnoDB standard page size is 16K or 16384 bytes. Sometime ago we added feature to Percona Server to change that to 4K or 8K (innodb_page_size). That maybe useful for SSD that operates with 4K blocks and you can get really much better throughput with 4K ( however we need to run benchmarks yet to see real effect from it).
The bigger limitation is that this setting is for whole database instance and you need to mysqldump your data, re-create database and load data.
It could be much better if we are able to set it per table, or even per index level – this is what we want to look into in near future.

Another area for improvements: support 32K and 64K pages. Why this is needed ? For InnoDB compression.
Right now if you use 16K and trying to compress, the only good choice is to compress to 8K pages (which gives you only 2:1 compress ratio), and you rarely will be able to compress to 4K.
However I think with 32K and 64K we should be able to compress to 8K and to 16K easily and get 4:1 compression rate, as we will be operate with bigger block size with more possibilities to compress (of course it is workload depended). This is also in our plans.

Having 4K-64K block sizes with ability to set parameter per table and per index would make InnoDB really flexible for different type of use cases and storage. As Oracle makes a lot of new good features in MySQL 5.6, I would like to see that implemented also.


Reality of Innodb Caching

I have mentioned few times Innodb caches data in pages and even if you have working set consisting of relatively few rows your working set in terms of pages can be rather large. Now I came to do a little benchmark to show it in practice. I’m using standard “sbtest” with 10mil rows with data file of 2247098368 which gives us 224 bytes of gross storage per row, including all overhead etc. Actual row
size in this table is smaller but lets use this number for our math. For benchmark I’m using set number of random IDs which are repeatedly selected in random order, which would illustrate data set with
some randomly distributed “hot” rows. I read every row in the set once before timing, so when there is enough memory to cache every single row there should not be any disk reads in benchmark run itself.

I’m using 128M buffer pool for this test, which should fit roughly 500K of rows 224 bytes in size. Lets see what Benchmark really shows:

Testing 100 out of 10000000 keys  24.79 seconds,  4034.53 lookups per second
Testing 200 out of 10000000 keys  25.66 seconds,  3896.96 lookups per second
Testing 400 out of 10000000 keys  25.03 seconds,  3995.65 lookups per second
Testing 800 out of 10000000 keys  24.40 seconds,  4097.73 lookups per second
Testing 1600 out of 10000000 keys  24.83 seconds,  4026.59 lookups per second
Testing 3200 out of 10000000 keys  25.47 seconds,  3926.65 lookups per second
Testing 6400 out of 10000000 keys  26.51 seconds,  3771.73 lookups per second
Testing 12800 out of 10000000 keys  386.20 seconds,  258.94 lookups per second
Testing 25600 out of 10000000 keys  640.12 seconds,  156.22 lookups per second
Testing 51200 out of 10000000 keys  775.38 seconds,  128.97 lookups per second
Testing 102400 out of 10000000 keys  841.65 seconds,  118.81 lookups per second

As we see in this case database can really fit only somewhere between 6400 and 12800 different rows which is about 1/50 of “projected size”. This number is very close to what I would have estimated –
With 224 bytes per row we have some 70 rows per page so with random distribution you would expect up to 70 times data which have to be fetched to the database than you need.

I’m wondering if any over storage engine can show better results in such benchmark. Falcon with plans for row cache would fair better, so I would expect better results with PBXT. I also should check with
smaller page sizes available in Percona Server and my expectation is with 4K page size I can fit 4x more distinct rows in my cache.


Percona XtraBackup 1.6 for Windows “try me” edition

Previously we had XtraBackup Windows binaries for 1.0 and 1.2 releases, and they were based on Cygwin compilation of MySQL.

After MySQL 5.5 moved to CMake, and cygwin compilation is not supported anymore, it took some efforts for us to come up with native Windows builds, but they are there already. We consider them “alpha” quality, as there some know limitations,
but you should be able to make backup and restore from it. If you are interested in Windows backup I encourage you to give it a try and report your experience. Binaries are available there http://www.percona.com/downloads/XtraBackup/XtraBackup-1.6/Windows-alpha/


Drop table performance

There have been recent discussions about DROP TABLE performance in InnoDB. (You can refer to Peter’s post http://www.mysqlperformanceblog.com/2011/02/03/performance-problem-with-innodb-and-drop-table/ and these bug reports: http://bugs.mysql.com/bug.php?id=51325 and http://bugs.mysql.com/bug.php?id=56332.) It may not sound that serious, but if your workload often uses DROP TABLE and you have a big buffer pool, it may be a significant issue. This can get especially painful, as during this operation InnoDB holds the LOCK_open mutex, which prevents other queries from executing. So, this is a problem for a server with a large amount of memory, like the one we have in our lab: a Cisco UCS C250 with 340GB of RAM.

To fix this problem, we implemented “background table drop”, which is available in Percona Server 5.1.56 and Percona Server 5.5.10. Also, looking at the MySQL 5.6.2 announcement, the InnoDB team has implemented “cleaning thread”, which is also supposed to fix this problem.

I ran a small benchmark to see what improvement we have. Hardware was the Cisco UCS C250, the InnoDB buffer_pool size was 144GB, and I filled the buffer pool with different amounts of data: 104GB, 52GB, 21GB. I ran a simple cycle. First I created 1000 tables:

for i in `seq 1 1000`
mysql -Bse "CREATE TABLE ts_$i (begin_time int(11) NOT NULL, end_time int(11) NOT NULL, PRIMARY KEY (begin_time)) ENGINE=InnoDB" test2

And then I dropped it. I measured the time to drop 1000 times in MySQL 5.1.56, in Percona Server 5.1.56 with “innodb_lazy_drop_table”, and in MySQL 5.6.2.

Here are my results (results in seconds; fewer is better):

Allocated pages, GB MySQL 5.1.56 Percona Server 5.1.56 MySQL 5.6.2
21GB 310 95 234
52GB 776 210 548
104GB 1635 466 1210

Also, to show the dynamics, here is a graph showing the number of dropped tables per second (more is better):

As we see, all three versions are affected by a bigger buffer pool: The more memory used, the slower DROP TABLE operates. MySQL 5.6.2 shows some improvement compared to MySQL 5.1.56, but Percona Server 5.1.56 has a better result.

The relative comparison for the 104GB case: Pecona Server is 3.5x better than MySQL 5.1.56 and 2.6x better than MySQL 5.6.2.

(Disclaimer: The “innodb_lazy_drop_table” feature in Percona Server and this benchmark are sponsored by a Well Known Social Network.)


MySQL Connection Timeouts

Sometimes on very busy MySQL server you will see sporadic connection timeouts, such as Can’t connect to MySQL server on ‘mydb’ (110). If you have connects timed in your
application you will see some successful connections taking well over the second. The problem may start very slow and be almost invisible for long time, for example having one out of million
of connection attempts to time out… when as the load growths it may become a lot more frequent.

If you time the connect you will often see connection times are being close to 3 and 9 seconds. These are “magic” numbers which I remember from years ago, which correspond to SYN packet being dropped during connection attempt and being resent. 3 seconds corresponds to 1 packet being dropped and 9 seconds correspond to two. If this is happening it is possible you have network issues or more likely you have listen queue overflow. You can check if it is the case by running netstat -s and finding something like:

    38409 times the listen queue of a socket overflowed
    38409 SYNs to LISTEN sockets dropped

This means some SYN packets have to be dropped because kernel buffer of connection requests on LISTEN socket is overflow – MySQL is not accepting connections as quickly as it needs.
There are 2 tuning places you need to consider if this is what is happening.

First – Linux kernel net.ipv4.tcp_max_syn_backlog This is size of kernel buffer for all sockets.
Default I have on my kernel is 2048 though it might vary for different versions, you might need to increase it to 8192 or so if you have intense connection. I’ll explain the math below.

Second – is MySQL parameter back_log which has default value of just 50. You may want to set it to 1000 or even higher. You may also need to increase
net.core.somaxconn kernel setting which contains the maximum depth of listen queue allowed. The kernel I’m running has it set to just 128 which would be too low for many

Now lets look more into the problem and do some Math. First lets look into how MySQL accepts connection. There is single main thread which is accepting connections coming to LISTEN
sockets. Once there is connection coming it it needs to create a new socket for incoming connection and create a new thread or take one out of the thread cache. From this point on MySQL processes network communication in multiple threads and can benefit from multiple cores but this work done by main thread does not.

Usually main thread is able to accept connections pretty quickly, however if it stalls waiting on mutex or doing any other work such as launching new thread takes a lot of time you can have the listen queue to overflow. Lets look at the database which accepts 1000 of connects/sec in average. This is a high number but you can see ones even higher. In most cases because of “random arrivals” nature of traffic you will see some seconds where as much as 3000 connections come in. Under such conditions the default back_log of 50 is enough just for 17 milliseconds, and if main thread stalls somewhere longer than, some SYN packets may be lost.

I would suggest sizing your tcp_max_syn_backlog and back_log value to be enough for at least 2 seconds worth of connection attempts. For example If I have 100 connects/sec which means I should plan for 300 connections using 3x for “peak multiplier”. This means they should be set to at least 600.

Setting it to cover much more than 2 seconds does not make much sense because if client does not get a response within 3 seconds it will consider SYN packet is lost and will send the new one anyway.

There is something else. If you’re creating 1000 of connections a second to MySQL Server you might be pushing your luck and at very least you’re using a lot of resources setting up and tearing down connections. Consider using persistent connections or connection pool at least for applications which are responsible for most of connections being created.


Percona’s O’Reilly MySQL Conference Community Awards

Percona, and Yasufumi Kinoshita who works for Percona, received two awards from the O’Reilly MySQL Conference co-chairs on behalf of the committee that voted on the nominees. The awards were: to Yasufumi as O’Reilly MySQL Community Contributor of the Year 2011 for InnoDB performance improvements, and to XtraBackup as O’Reilly MySQL Application of the Year 2011 for adding opensource hot backup functionality to MySQL.

It’s worth saying that XtraBackup is a team effort: Yasufumi wrote the initial version of XtraBackup, and Alexey Kopytov has been working on it recently. Our congratulations to all of the winners: Yasufumi, Giuseppe Maxia, Continuent, and DeNA. Let’s keep making MySQL better!

(Note: I was part of the conference speaker committee, but not part of the award committee. I’m not sure who was on the award committee.)

(Another note: There was a rumor that Yasufumi’s house or apartment was destroyed in the earthquake/tsunami, but that isn’t the case, thankfully.)


How to use tcpdump on very busy hosts

Often I run into problems when trying to use mk-query-digest with tcpdump on “very” busy hosts. You might be thinking, “very busy is a relative and unquantifiable term,” and you’d be right, so I’ll phrase this differently. Let me give a little background to the problem first. Mk-query-digest tries to handle dropped or missing packets gracefully, but even so they can skew results dramatically. Imagine a situation where a single connection sends two queries and gets two responses, with a response time of R for each request, and a wait time of W between the requests. If the first response and second request are dropped by the kernel, the result – from mk-query-digest’s perspective – is that the database took 2R + W time to respond to the first request.

Back to the question of, “what is a very busy host?” In my experience, if you are getting even 5% of tcpdump packets dropped by the kernel, the results can be skewed enough to cause confusion about which queries are really slow. Recently, I got more than 60% dropped packets on a server with roughly 50MB/s of traffic on port 3306, system load of about 10, and 8 CPU cores. The resulting mk-query-digest output was obviously bogus when compared to the host’s slow-query-log (for example, none of the top 5 slow queries reported by mkqd appeared in the actual slow log file). After a little brain-storming, we came up with a few solutions:

  1. use “mk-query-digest –filter ‘$event->{time} && $event->{time} > 1′” to exclude all queries which it believes took longer than the servers long-query-time of 1 second
  2. tcpdump traffic only from a small and representative subset of clients
  3. tcpdump a modulo of incoming ports (including port 3306, because we must capture the responses from mysqld)

#1 has an obvious flaw — if your long-query-time is 1 second, and mkqd believes that a query which actually took 10ms instead took 0.9s, the results are still useless. That is to say, this doesn’t actually solve the real problem of dropped packets, it just applies a mask to the output. #2 seems like the simplest good solution, but when I tested this, I still got very high percentage of dropped packets (around 30% when filtering only 4 out of hundreds of active clients). While this is lower than without the filter, it is still unusable. #3 actually worked very well and resulted in about 0.2% packet loss on this host, which is acceptable — the variances are statistically smoothed out and don’t noticeably affect the results. Here is the tcpdump command used.

tcpdump -i eth0 -s 65535 -x -n -q -tttt 'port 3306 and tcp[1] & 7 == 2 and tcp[3] & 7 == 2'

I also had to forward the tcpdump output  to another host for processing because the database host couldn’t handle the additional IO or CPU pressure of either writing it to a file or piping it to mk-query-digest. Here is a draft of a script to automate this; use at your own risk and update to suit your needs.


# Set some defaults
d=$(date +%F-%T | tr :- _)
self=`basename $0`
REMOTEHOST='some.other.hostname'           # CHANGEME
LIMIT=50                                   # limit # of queries in report
SLEEPTIME=1200                             # duration to capture tcpdump data in seconds (1200 = 20 min)

# check lock file
if [ -e "$TMP/$LOCKFILE" ]; then
 echo "$self: lock file $LOCKFILE already exists, aborting"
 exit 1

# set trap to be sure tcpdump doesn't run for ever
# and clean up the temp file too
trap  'rm -f $LOCKFILE; kill $PID; ssh $REMOTEHOST -- "rm -f $TMP/$TMPfile"; exit' INT TERM EXIT

# run the tcpdump & write to remote file and sleep for a bit
tcpdump -i eth0 -s 65535 -x -n -q -tttt 'port 3306 and tcp[1] & 7 == 2 and tcp[3] & 7 == 2' 2>/dev/null \
 | ssh $REMOTEHOST -- "cat - > $TMP/$TMPfile" &

kill $PID

# set trap to be sure both remote files are removed
trap 'ssh $REMOTEHOST -- "rm -f $TMP/$TMPfile $TMP/$RESfile"; rm -f $LOCKFILE $RESfile; exit' INT TERM EXIT

# digest the result, copy to localhost, then email it
ssh $REMOTEHOST -- "mk-query-digest --type tcpdump --limit $LIMIT < $TMP/$TMPfile 2>&1 > $TMP/$RESfile"
scp -q $REMOTEHOST:$TMP/$RESfile $RESfile

# email $RESfile using your preferred transport

# clean up remote and local files.
ssh $REMOTEHOST -- "rm -f $TMP/$TMPfile $TMP/$RESfile"
rm -f $RESfile $LOCKFILE

exit 0

Percona Server and XtraBackup Weekly News, April 18th

Last week wasn’t so active in terms of new development, because many of us were at the MySQL conference, of course. However, we did push out the XtraBackup 1.6 release, and began work on backporting some features from MySQL 5.6 to 5.1 for a certain large customer. Several of the features we’ve added in 5.0 and 5.1 are reimplemented in MySQL 5.6, which is great. It means that we can backport Oracle’s code and we don’t have to maintain these features ourselves anymore.

The MySQL conference, and to some extent the Collaborate conference, was a great meeting for a lot of developers. Many meetings had attendees from Oracle, Percona, MariaDB, and Drizzle developers. Some events had participation from senior management in the various companies. There were also a lot of productive meetings with large customers who use MySQL. I take these all together as a sign that the pace of MySQL development could continue to accelerate in the future.

In specific Percona Server news items,

  • We are fixing and backporting multiple purge thread support from 5.6 to 5.1, to replace the similar feature we have had in Percona Server for a while. I usually don’t report specific bugs, but we found a bug in 5.1′s purge-thread functionality that I think is of interest to some fork maintainers: https://bugs.launchpad.net/bugs/756387
  • Some other backports are in progress or in the request queue, but I don’t see any notice of progress on internal weekly reports or the mailing list, so I won’t speculate.

In XtraBackup news,

  • We released XtraBackup 1.6, with many improvements! We recommend everyone to upgrade to it.
  • Alexey Kopytov ported XtraBackup to Windows.

Percona Server 5.1.56-12.7

Percona Server version 5.1.56-12.7 is now available for download. It is based on MySQL 5.1.56 and is the current stable release in the 5.1 series.

New Features

  • Expanded the applicability of InnoDB fast index creation to mysqldump, ALTER TABLE and, OPTIMIZE TABLE. (Alexey Kopytov)

Variables Changed

  • Variable innodb_stats_method has been implemented in the upstream InnoDB, with the same name and functionality that had previously existed only in XtraDB. (Yasufumi Kinoshita)

Other Changes

  • Implemented support for variable innodb_stats_method being implemented in the upstream InnoDB, including adding a column to table INNODB_SYS_STATS. Bug fixed: #733317. (Yasufumi Kinoshita)
  • Added README-windows with basic command to build under Windows, VC 2010 Express. (Vadim Tkachenko)
  • Added helper script README-Solaris for building on Solaris. (Vadim Tkachenko)
  • Changes were made to the post-install messages (Ignacio Nin)

Bugs Fixed

  • Bug #631667 – The mysql binary is now built with readline support so that command line browsing is possible. (Ignacio Nin)
  • Bug #716575 – When using the Expand Table Import feature, importing .ibd files smaller than 1 MB could lead to a server crash. (Yasufumi Kinoshita)
  • Bug #727704 – When using the Expand Table Import feature, importing .ibd files created on MySQL 5.0 or Percona Server versions prior to 5.1.7 could crash the server. (Yasufumi Kinoshita)
  • Bug #735423 – File ownerships and permissions are now changed after running mysql_install_db after installation or upgrade, in order to avoid “permission denied” error when starting ”mysqld”. (Aleksandr Kuzminsky)
  • MySQL bugs 56433 and 51325 – These MySQL bugs have been fixed in Percona Server. (Yasufumi Kinoshita)

For more information, please see the following links:

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com