On Good Instrumentation

In so many cases troubleshooting applications I keep thinking how much more efficient things could be going if only there would be a good instrumentation available. Most of applications out there have very little code to help understand what is going on and if it is there it is frequently looking at some metrics which are not very helpful.

If you look at the system from bird eye view – system needs to process transactions and you want it to successfully complete large number of transactions it gets (this is what called availability) and we want it to serve them with certain response time, which is what is called performance. There could be many variables in environment which change – load, number of concurrent users, database, the way users use the system but in the nutshell all what you really care is having predictable response time within certain range. So if we care about response time – this is exactly what our instrumentation should measure

Response Time Summary We want to understand where exactly response time comes from. For example if we define transaction as the time it took to generate HTML page we want to understand how much time was spent waiting on the database, memcache, other external services, as well as how much CPU time it consumed.

Now what is important we need this information for individual transactions. It may be every transaction which is best and easily achievable for small-medium systems or at least for large enough sample. It is very important this information is available for individual transactions not the average. Average is useless because 100 transactions taking 1 sec and 99 transactions having 1ms and 1 taking 99.1 sec will have the same average while for sake of performance analyzes these are completely different. When you have transaction sample make sure it contains fair population of transaction – getting only transactions which are slow is not helpful as we might want to compare them to the fast transactions to understand why they are slow.

What kind of components do you need to have in response time summary – all components which are significant enough. If your instrumentation has 95% of response time unaccounted for it is useless. You also want blocks not to mix apples with oranges. For example “mysql and memcache” block would not be helpful. Even further I would prefer to split “mysql time” in the “connect time” and “query time” as there are situations when one but not other would be affected.

In is important for response time summary stored in the logs which are easy to query so you can analyze data in a lot of various ways. Sometimes you may find the response time is impacted by queries from certain user, in others it may be attributed to different application/web server.

The goal for Response Time Summary is to quickly point direction where problem happens. Whenever you have spike in response time or it is bad response time for certain kind of request you can quickly understand where does it come from ? Is it wasted CPU time Slow response from MySQL or Memcache.

I also like to see numbers of calls stored together with attributed response time. For example I’d like to see number of mysql calls in addition to MySQL response time. This helps to understand if it is the issue with number of queries or their performance. If I see 2 queries taking 30 seconds it is clearly slow queries. If it is 10.000 queries executed and total response time is 4 sec I know it is pretty much as good as it gets with standard Ethernet network and finding a ways to reduce number of queries is going to be more helpful.

The Glue Our applications involve multiple layers and typically higher layer can only report response time it took to call lower layer, but not the reason for that response time. For example we can report time it took to execute SQL query from PHP application side, but we can’t say why it has taken so long. Was it row level lock ? waiting on disk IO or was it simply question of burning a lot of CPU. On the other hand this information may be available in the instrumentation stats from that lower layer – for example in MySQL Query Log. What is important is however to be able to connect the data from these logs – glue them together. The easy way to do it is to provide an unique identifier to all requests and put it in the logs with request of the lower levels. With mySQL the simple way to do it is to put it in the comments for queries you execute.

Optional Tracing The information in lower layers logs is very helpful however it typically have two problems. First not every layer has good logs. For example if you’re running memcached you probably do not have the logs detailing all requests and their response time. Second – the lower layer may only know response time from its vantage point, which in many cases does not include network communication time which can be very important.

Tracing should be optional and normally applied to the small sample of requests, though it needs to be detailed. Typically you would include the calls to the lower level services together with timestamp, the response summary with timestamp again. The information about request has to be complete enough to identify target action and response completely. For example if I’m speaking about memcache I’d like to know which server:port request was issued to which key was requested, and on response I’d like to know if it was hit, miss or error.

The way I use it may be as follows. I see the increased response time for given kind of request. I see response time is coming from MySQL. I check the number of MySQL Queries and it is 5x when it usually is for this kind of request. Looking at memcache stats I can see high number of misses. Looking at some available traces shows the server memcache01 has very high miss ratio. Checking what is going on with memcache01 shows it just was restarted (and hence has almost empty cache). This is important example as it shows your increased response time from MySQL may not have anything to do with MySQL itself but you would not know unless you’re capturing the right data.

If you’re looking for nice example framework for instrumentation, check out instrumentation for PHP – It has everything mentioned by tracing which is trivial to add.

Entry posted by peter |
One comment

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Written by in: Zend Developer |

Percona Welcomes Justin Swanhart

Percona is pleased to officially welcome Justin Swanhart to our team of consultants.

Before joining Percona, Justin worked as a MySQL DBA at Gazillion, Yahoo, and Kickfire. Justin has become a regular contributor here on the MySQL Performance Blog as well as being an active blogger at He is very active in the community, maintaining FlexViews, a stored procedure managed solution for materialized view creation and maintenance in MySQL 5.1 as well as instrumentation-for-php, a suite of PHP classes for easing the implementation of instrumentation in applications.

Justin, a big welcome – we are fortunate indeed you’re working with us!

Entry posted by Ryan Lowe |

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Written by in: Zend Developer |

FlashCache: tpcc workload

This is my last post in series on FlashCache testing when the cache is placed on Intel SSD card.

This time I am using tpcc-like workload with 1000 Warehouses ( that gives 100GB of data) on Dell PowerEdge R900 with 32GB of RAM, 22GB allocated for buffer pool and I put 70GB on FlashCache partition ( just to simply test the case when data does not fit into cache partition).

Please note in this configuration the benchmark is very write intensive, and it is not going be easy for FlashCache, as in background it has to write blocks to RAID anyway, and write rate in final place is limited by RAID. So all performance benefits will come from read hits

The full report and results are available on benchmark Wiki

Short version of results are on graph:

In summary:
on RAID final result: 2556.592 TpmC
on Intel SSD X25-M: 7084.483 TpmC
on FlashCache with 20% dirty pages: 2632.892 TpmC
on FlashCache with 80% dirty pages: 4468.883 TpmC

So with 20% dirty pages the benefit are really miserable, and it is quite explainable ( see note above about write intensive workload), but really on the graph we can see that probably 2h was not enough to warmup FlashCache enough.
And this is interesting problem with FlashCache what I see. Warmup by simple copying files does not work (you need O_DIRECT with proper blocksize), and you only rely on InnoDB in this case, and it takes about 30min+ to fill FlashCache. Solution there would be PRELOAD TABLE / INDEX, and it is in our roadmap for XtraDB.

With 80% dirty pages the performance gain in 1.7x and it is pretty decent, as you can get it for 500$ ( price of Intel X25-M card) ( for this particular workload, your experience may vary!).

On this stage I consider FlashCache as pretty stable and ready for an evaluation on real workloads ( kudos to Facebook team, they provide really stable release).

I actually did pretty bad test – just turned off power on SSD drive in the middle of tpcc-mysql run,
just SSD power, not whole server. No wonder FlashCache complained on failed writes, and after that I restarted full system. I was expecting that database is going to trash, but after restart FlashCache was able to attach previous cache, and MySQL was able to start and finish crash recovery. I am impressed.

In my next rounds I am looking to run similar benchmarks on FusionIO card.

P.S. And if CentOS team reads this post – please change default IO scheduler from CFQ to Deadline. Seriously, it makes so much damage on performance on servers with IO intensive workloads, so it should be the first action after CentOS installation. And I doubt that there big usage of CentOS on desktop systems anyway.

Entry posted by Vadim |

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Written by in: Zend Developer |

Some seats free for the upcoming german symfony training

There are some seats free for my upcoming german symfony training session ( This training takes place mid-june (16-18th June) in a training center in Heilbronn.

Es sind noch ein paar Plätze für meine kommende offene Symfony-Schulung verfügbar ( Diese findet am 16. bis 18. Juni in einem Trainingszentrum in Heilbronn statt.


Tuning InnoDB Concurrency Tickets

InnoDB has an oft-unused parameter innodb_concurrency_tickets that seems widely misunderstood. From the docs: “The number of threads that can enter InnoDB concurrently is determined by the innodb_thread_concurrency variable. A thread is placed in a queue when it tries to enter InnoDB if the number of threads has already reached the concurrency limit. When a thread is allowed to enter InnoDB, it is given a number of “free tickets” equal to the value of innodb_concurrency_tickets, and the thread can enter and leave InnoDB freely until it has used up its tickets. After that point, the thread again becomes subject to the concurrency check (and possible queuing) the next time it tries to enter InnoDB. The default value is 500…”

What this means from a practical perspective is that each query is allocated 500 tickets when it begins executing. Each time it enters InnoDB, this number is decremented until it reaches zero (“entering InnoDB” appears only to occur when a row is accessed). When it reaches zero, it may-or-may-not be put into a queue and wait to continue execution. InnoDB doesn’t provide us a way in which to determine how many concurrency tickets a query uses, making this parameter notoriously difficult to tune. It is important to note that this variable only comes in to play when innodb_thread_concurrency is greater than zero.

On a stock install of MySQL, here are some example queries and the corresponding number of concurrency tickets used for each:


  1. mysql> CREATE TABLE test_table (
  2.     ->     id int
  3.     -> ) ENGINE=InnoDB; — 0 Tickets Used
  4. Query OK, 0 rows affected (0.36 sec)
  6. mysql> INSERT INTO test_table (id) VALUES (1); — 0 Tickets Used
  7. Query OK, 1 row affected (0.00 sec)
  9. mysql> SELECTFROM test_table; — 1 Ticket Used
  10. +——+
  11. | id   |
  12. +——+
  13. |    1 |
  14. +——+
  15. 1 row IN SET (0.00 sec)
  17. mysql> INSERT INTO test_table (id) VALUES (2),(3); — 0 Tickets Used
  18. Query OK, 2 rows affected (0.00 sec)
  19. Records: 2  Duplicates: 0  Warnings: 0
  21. mysql> SELECT COUNT(*) FROM test_table; — 3 Tickets Used
  22. +———-+
  23. | COUNT(*) |
  24. +———-+
  25. |        3 |
  26. +———-+
  27. 1 row IN SET (0.00 sec)
  29. mysql> UPDATE test_table SET id=4 WHERE id=1; — 4 Tickets Used (because no index, a table scan is performed)
  30. Query OK, 1 row affected (0.00 sec)
  31. Rows matched: 1  Changed: 1  Warnings: 0
  33. mysql> ALTER TABLE test_table ADD INDEX (id); — 5 Tickets Used
  34. Query OK, 3 rows affected (0.01 sec)
  35. Records: 3  Duplicates: 0  Warnings: 0

And now on to a more interesting scenario: foreign keys


  1. mysql> CREATE TABLE parent (id INT NOT NULL,
  2.     ->                      PRIMARY KEY (id)
  3.     -> ) ENGINE=INNODB; — 0 Tickets Used
  4. Query OK, 0 rows affected (0.01 sec)
  6. mysql> CREATE TABLE child (id INT, parent_id INT,
  7.     ->                     INDEX par_ind (parent_id),
  8.     ->                     FOREIGN KEY (parent_id) REFERENCES parent(id)
  9.     ->                       ON DELETE CASCADE
  10.     -> ) ENGINE=INNODB; — 0 Tickets Used
  11. Query OK, 0 rows affected (0.00 sec)
  13. mysql> INSERT INTO parent (id) VALUES (1),(2),(3),(4); — 3 Tickets Used
  14. Query OK, 4 rows affected (0.03 sec)
  16. mysql> INSERT INTO child (id, parent_id) VALUES (1,1),(1,1),(2,1); — 2 Tickets Used
  17. Query OK, 3 rows affected (0.00 sec)
  18. Records: 3  Duplicates: 0  Warnings: 0
  20. mysql> DELETE FROM child WHERE 1; — 6 Tickets Used
  21. Query OK, 3 rows affected (0.02 sec)
  23. mysql> ALTER TABLE `child` ADD PRIMARY KEY (`id`,`parent_id`); — 0 Tickets Used
  24. Query OK, 0 rows affected (0.02 sec)
  25. Records: 0  Duplicates: 0  Warnings: 0
  27. mysql> INSERT INTO `child` (`id`,`parent_id`) VALUES (1,1), (1,2), (2,1),(2,2); — 3 Tickets Used
  28. Query OK, 4 rows affected (0.01 sec)
  29. Records: 4  Duplicates: 0  Warnings: 0

So, how can we put this into practice, since this information isn’t available to most users?

INSERT w/PRIMARY KEY defined: Number of rows inserted – 1
INSERT w/FOREIGN KEY constraint: Number of rows inserted – 1
SELECT: 1 ticket per row returned
UPDATE: 1 ticket per row examined + 1 ticket per row updated
DELETE: 1 ticket per row examined + 1 ticket per row deleted
ALTER: (2 * rows in the table) – 1

As with any performance optimization effort, you will want to optimize for the common case. If you have a very simple workload, you can calculate these values by hand. But for most workloads with a complex access pattern, we’ll need to estimate or wait for InnoDB to expose this information to us.

What happens in the case where I have two distinct access patterns: single row primary-key lookups and SELECT statements that examine 900 rows? If innodb_concurrency_tickets is set to 500, then all of the single row PK lookups will execute without ever being subject to an additional concurrency check (there is always one when a thread first enters InnoDB) while the 900-row SELECT statements will always be subject to one additional concurrency check (we actually care less about the concurrency check itself than the possibility that it may become queued). Your first instinct may be to increase innodb_concurrency_tickets to >=900 in this case, but that isn’t necessarily the best decision. As stated in the docs, the number of threads that can enter InnoDB is limited by innodb_thread_concurrency (which is why these two variables are most often tuned in concert). To continue the example, if innodb_thread_concurrency is set to 8 and eight 900-row-SELECT statements come in, they will effectively block the PK lookups until one of them is subject to a concurrency check or complete execution and exit InnoDB. If innodb_concurrency_tickets had been increased to >= 900, then ALL of the PK lookups would be blocked until the 900-row-SELECT statements complete execution.

With a maximum value of 4,294,967,295 this has the potential to block other queries for a significant amount of time. Setting innodb_concurrency_tickets too high can have startlingly negative performance implications. On the other hand, if we determine that 99% of the traffic are these single row PK lookups and only 1% are the 900-row SELECTs, we may be tempted to lower the setting to 1 to accommodate the “typical case”. The effects of this, though, would be to cause the 900-row SELECT statements to be subject to 899 concurrency checks. This means 899 potential opportunities to be queued! So, as with most other parameters, this is a balancing act.

It really comes down to the importance of the applicable queries. Imagine those 900-row SELECT statements were actually 10,000 row selects, this would become a more pressing issue. If they are reporting queries used only internally, then it is not so much of an issue and you can leave innodb_concurrency_tickets rather small. If, on the other hand, these are the queries that lead to revenue generation, you may want to give them a bit more dedicated CPU time so they execute that much faster (even at the expense of the PK lookups). In other words, if you’re optimizing for throughput in this scenario, you will tune innodb_concurrency_tickets to the 99th percentile of small PK lookups. If you’re optimizing for response time, you would set it larger to accommodate the larger (important) select statements.

A quick sysbench run gives us the following results (X-axis is innodb_concurrency_tickets, Y-axis is txn/sec. More is better). Since all sysbench queries are 10 rows or less, we don’t really expect to see much of a difference here:



  1. sysbench –test=oltp –oltp-table-size=80000000 –oltp-read-only=off –init-rng=on –num-threads=16 –max-requests=0 –oltp-dist-type=uniform –max-time=300  –mysql-user=root –mysql-socket=/var/lib/mysql/mysql.sock run

Applicable my.cnf settings:


  1. innodb_buffer_pool_size=24G
  2. innodb_data_file_path=ibdata1:10M:autoextend
  3. innodb_file_per_table=1
  4. innodb_flush_log_at_trx_commit = 1
  5. innodb_log_buffer_size               = 8M
  6. innodb_log_files_in_group=2
  7. innodb_log_file_size=1900M
  8. innodb_thread_concurrency=16
  9. innodb_flush_method             = O_DIRECT
  10. innodb_write_io_threads=8
  11. innodb_read_io_threads=8
  12. innodb_io_capacity=500
  13. innodb_max_dirty_pages_pct=90
  14. max_connections=3000
  15. query_cache_size=0
  16. skip-name-resolve
  17. table_cache=10000

Entry posted by Ryan Lowe |

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Written by in: Zend Developer |

Percona Server security fix releases

As you may know MySQL ® announced 5.0.91 and 5.1.47 with serious security fixes, so we provide binary releases of Percona Server 5.0 and Percona Server / XtraDB 5.1 with security patches.

Fixed bugs:

Release Percona Server 5.0.91-rel22 is available in our download area:

Release Percona Server/XtraDB 5.1.45-rel10.2 is available also ( we backported patches from 5.1.47 to 5.1.45):

Updated RPM and DEB are also available in our repositories.

We know that there still are many active MySQL 4.0 and MySQL 4.1 installations and we understand how hard it is upgrade 4.0/4.1 to 5.1. With published exploits these installation are now under serious risks. If you are interested in backporting patches to 4.0 / 4.1 or to older 5.0 and 5.1 releases, it is available under Percona Maintenance package or regular Consulting

Entry posted by Vadim |

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Written by in: Zend Developer |

Extending Index for Innodb tables can hurt performance in a surprising way

One schema optimization we often do is extending index when there are queries which can use more key part. Typically this is safe operation, unless index length increases dramatically queries which can use index can also use prefix of the new index are they ? It turns there are special cases when this is not the case.


  1. CREATE TABLE `idxitest` (
  3.   `a` int(11) NOT NULL,
  4.   `b` int(11) NOT NULL,
  5.   PRIMARY KEY (`id`),
  6.   KEY `a` (`a`)
  9. mysql> SELECT count(*) FROM idxitest WHERE a=5 AND b=5;
  10. +———-+
  11. | count(*) |
  12. +———-+
  13. |    60434 |
  14. +———-+
  15. 1 row IN SET (0.69 sec)
  17. mysql> EXPLAIN SELECT count(*) FROM idxitest WHERE a=5 AND b=5;
  18. +—-+————-+———-+——+—————+——+———+——-+——–+————-+
  19. | id | select_type | TABLE    | type | possible_keys | KEY  | key_len | ref   | rows   | Extra       |
  20. +—-+————-+———-+——+—————+——+———+——-+——–+————-+
  21. 1 | SIMPLE      | idxitest | ref  | a             | a    | 4       | const | 707820 | USING WHERE |
  22. +—-+————-+———-+——+—————+——+———+——-+——–+————-+
  23. 1 row IN SET (0.00 sec)

The obvious optimization is to extend index from column (a) to column (a,b) right which will make it faster and should not hurt any other queries a lot, right ?


  1. mysql> ALTER TABLE idxitest DROP KEY a,ADD KEY(a,b);
  2. Query OK, 0 rows affected (24.84 sec)
  3. Records: 0  Duplicates: 0  Warnings: 0
  5. mysql> SELECT count(*) FROM idxitest WHERE a=5 AND b=5;
  6. +———-+
  7. | count(*) |
  8. +———-+
  9. |    60434 |
  10. +———-+
  11. 1 row IN SET (0.02 sec)
  13. mysql> EXPLAIN SELECT count(*) FROM idxitest WHERE a=5 AND b=5;
  14. +—-+————-+———-+——+—————+——+———+————-+——–+————-+
  15. | id | select_type | TABLE    | type | possible_keys | KEY  | key_len | ref         | rows   | Extra       |
  16. +—-+————-+———-+——+—————+——+———+————-+——–+————-+
  17. 1 | SIMPLE      | idxitest | ref  | a             | a    | 8       | const,const | 120640 | USING INDEX |
  18. +—-+————-+———-+——+—————+——+———+————-+——–+————-+
  19. 1 row IN SET (0.00 sec)

Wow. The query runs 30 times faster – not only because it has to scan less rows but also because it is index covering query now – it does not need to access the data.

It turns out it is too early to celebrate. Application also had another query which previously ran so fast it hardly could be noticed. It however became a lot slower:


  1. # Old Schema
  3. mysql> SELECT * FROM  idxitest  WHERE a=100 ORDER BY id DESC LIMIT 1;
  4. +———+—–+—+
  5. | id      | a   | b |
  6. +———+—–+—+
  7. | 3000000 | 100 | 7 |
  8. +———+—–+—+
  9. 1 row IN SET (0.00 sec)
  11. mysql> EXPLAIN SELECT * FROM  idxitest  WHERE a=100 ORDER BY id DESC LIMIT 1;
  12. +—-+————-+———-+——+—————+——+———+——-+——–+————-+
  13. | id | select_type | TABLE    | type | possible_keys | KEY  | key_len | ref   | rows   | Extra       |
  14. +—-+————-+———-+——+—————+——+———+——-+——–+————-+
  15. 1 | SIMPLE      | idxitest | ref  | a             | a    | 4       | const | 126074 | USING WHERE |
  16. +—-+————-+———-+——+—————+——+———+——-+——–+————-+
  17. 1 row IN SET (0.00 sec)
  21. # new Schema
  23. mysql> SELECT * FROM  idxitest  WHERE a=100 ORDER BY id DESC LIMIT 1;
  24. +———+—–+—+
  25. | id      | a   | b |
  26. +———+—–+—+
  27. | 3000000 | 100 | 7 |
  28. +———+—–+—+
  29. 1 row IN SET (1.01 sec)
  31. mysql> EXPLAIN SELECT * FROM  idxitest  WHERE a=100 ORDER BY id DESC LIMIT 1;
  32. +—-+————-+———-+——-+—————+———+———+——+——+————-+
  33. | id | select_type | TABLE    | type  | possible_keys | KEY     | key_len | ref  | rows | Extra       |
  34. +—-+————-+———-+——-+—————+———+———+——+——+————-+
  35. 1 | SIMPLE      | idxitest | INDEX | a             | PRIMARY | 4       | NULL |   36 | USING WHERE |
  36. +—-+————-+———-+——-+—————+———+———+——+——+————-+
  37. 1 row IN SET (0.00 sec)
  39. # The plan also can look something like this:
  41. mysql> EXPLAIN SELECT * FROM  idxitest  WHERE a=100 ORDER BY id DESC LIMIT 1;
  42. +—-+————-+———-+——+—————+——+———+——-+——+——————————————+
  43. | id | select_type | TABLE    | type | possible_keys | KEY  | key_len | ref   | rows | Extra                                    |
  44. +—-+————-+———-+——+—————+——+———+——-+——+——————————————+
  45. 1 | SIMPLE      | idxitest | ref  | a             | a    | 4       | const |    1 | USING WHERE; USING INDEX; USING filesort |
  46. +—-+————-+———-+——+—————+——+———+——-+——+——————————————+
  47. 1 row IN SET (0.01 sec)

So why did this query become so much slower ? The reason is its plan benefits from Innodb specific feature – index entries being sorted by primary key for each complete key value. So when you have index (a) and id is a primary key the real index is (a,id) when we extend index to (a,b) it really becomes (a,b,id). So if there is a query which used both a and id key part from original index it will quite likely be unable to use new index to full extent.

What is solution ? You can have “redundant” indexes on (a) and (a,b) at the same time. This is something what suppose to work but it well often does not:


  1. CREATE TABLE `idxitest` (
  3.   `a` int(11) NOT NULL,
  4.   `b` int(11) NOT NULL,
  5.   PRIMARY KEY (`id`),
  6.   KEY `a` (`a`),
  7.   KEY `a_2` (`a`,`b`)
  10. mysql> SELECT * FROM  idxitest  WHERE a=100 ORDER BY id DESC LIMIT 1;
  11. +———+—–+—+
  12. | id      | a   | b |
  13. +———+—–+—+
  14. | 3000000 | 100 | 7 |
  15. +———+—–+—+
  16. 1 row IN SET (1.03 sec)
  18. mysql> EXPLAIN SELECT * FROM  idxitest  WHERE a=100 ORDER BY id DESC LIMIT 1;
  19. +—-+————-+———-+——-+—————+———+———+——+——+————-+
  20. | id | select_type | TABLE    | type  | possible_keys | KEY     | key_len | ref  | rows | Extra       |
  21. +—-+————-+———-+——-+—————+———+———+——+——+————-+
  22. 1 | SIMPLE      | idxitest | INDEX | a,a_2         | PRIMARY | 4       | NULL | 2247 | USING WHERE |
  23. +—-+————-+———-+——-+—————+———+———+——+——+————-+
  24. 1 row IN SET (0.00 sec)

MySQL Optimizer considers using both (a) and (a,b) indexes and in the end decides to use neither rather doing full index scan until it finds a=100. This looks like an optimizer glitch in this case because it estimates it will scan 2247 rows in the selected plan, while using (a) index you can get result scanning only 1 row guaranteed.

To really get things going you will need to use FORCE INDEX(a) to force MySQL optimizer using right plan.

These results mean you should be very careful applying index changes from mk-duplicate-key-checker key checker when it comes to redundant indexes. If you have query plans depending on Innodb ordering of data by primary key inside indexes they can become significantly affected.

Optimizer behavior may be different in different MySQL versions. These tests were done with 5.1.45 though I’ve seen same behavior with MySQL 5.0 too.

Entry posted by peter |

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Written by in: Zend Developer |

Beyond great cache hit ratio

I worked with application recently which has great memcached hit ratio – over 99% but yet still has average page response time over 500ms. Reason ? There are hundreds memcached gets and even though they have some 0.4ms response time they add up to add hundreds of ms to the total response time.

The great memcached hit ratio is great however even more than that you should target eliminating requests all together. Hit rate is very bad indicator to begin with. Imagine you have application which gets 90 memcache gets (hits) to retrieve some data plus there are 10 more requests which resulted in misses and caused MySQL queries. The hit ratio is 90%. Imagine now you have found a way to store the data which required 90 requests as the single object. You have 1 request (hit) now and 10 misses which drops your hit rate down to less than 10% but performance will likely be a lot better.

The answer in many cases is to use larger objects for the cache.
If you look at a web page you typically will see some “blocks” which will be static and possibly highly cachable while others may be very dynamic or badly cachable. For example the area which displays user name and current time may not be cachable however widget displaying last 10 blog entries cachable at least for several minutes.

There are many development implications, component dependencies etc on how you can really implement caching but if you’re looking from birds eye view you want to have as fewer blocks in the page as possible (to reduce number of get requests and CPU efford required to build the page) at the same time you have to split the page in blocks which can allow good caching:

Make non cacheable blocks as small as possible If you have something like current time on the page which is not cachable do not mix it with other content which would have to be also re-created each time on the cache miss.

Maximize amount of uses of the cache item It is the most efficient if the item in the cache can be used by multiple pages and multiple users this will maximize cache efficiency while reduce amount of cache space required. The importance of this advice depends on the use case for your applications – some applications which have only couple of page views per user per day would really need to have items shared by multiple users to get best hit rate. Others which have hundreds of page views per user per session may be good enough with caches which are effectively per user.

Control invalidation Item in cache expire or get invalidated. It is best if such invalidation only kicks of the data in cache which needs to be invalidated but this may result in too small blocks and hence too many cache requests required. You may look for balance by looking to cache data together which invalidates in about same time.

Multi-Get Multi Get is truly nuclear option for caching with memcached both because it can be so efficient and because it can be so complicated to add it to existing code which was not design with it in mind. Multi-Get effectively allows you to manage items (expire,invalidate) separately while fetching them in the single round trip. You can have have hundreds of different items fetched from cache for the page without paying for network latency.

Does Multi Get removes the need to think about using larger items for cache ? I do not think so. You still have to pay extra cost of processing both on memcached server and application side. However the optimal cache topology would probably be different with multi get than without it.

The good theoretical example would be having complicated front page which would require a lot of cache requests and hundreds of milliseconds of CPU processing and which can be cached almost int its entirety with exception of user name in the corner. Storing complete page HTML in the cache and doing template replace for user name can be very efficient.

This may sound complicated and indeed there is a lot of art and science in designing optimal cache solution. The good thing is most web sites do not have to get it absolutely perfect to get good enough performance, it is most important to avoid biggest design mistakes.

Entry posted by peter |

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Written by in: Zend Developer |

A workaround for the performance problems of TEMPTABLE views

MySQL supports two different algorithms for views: the MERGE algorithm and the TEMPTABLE algorithm. These two algorithms differ greatly. A view which uses the MERGE algorithm can merge filter conditions into the view query itself. This has significant performance advantages over TEMPTABLE views. A view which uses the TEMPTABLE algorithm will have to compute the entire contents of the view before any filter is applied. Computing the entire contents requires a temporary table and many more rows may be accessed than otherwise would had the filter been merged into the view query.

As long as a view avoids aggregation, DISTINCT, GROUP BY, ORDER BY, etc, then it can use the MERGE algorithm. Unfortunately, this means that views of any significant complexity will almost always use the TEMPTABLE algorithm.

This blog post demonstrates a workaround that allows the convenience of complex views, including those with aggregation (that is views which use the TEMPTABLE algorithm) with some of the performance advantages of MERGE algorithm views.

As a demonstration, consider the following table which contains a combination of integer values. There are enough values in the table such that a scan of all rows takes a second or two.


  1. mysql> SHOW CREATE TABLE t2\G
  2. *************************** 1. row ***************************
  3.        TABLE: t2
  5.   `c1` int(11) DEFAULT NULL,
  6.   `c2` int(11) NOT NULL AUTO_INCREMENT,
  7.   PRIMARY KEY (`c2`),
  8.   KEY `c1` (`c1`)
  10. 1 row IN SET (0.00 sec)

If we define a view which uses no aggregation, then MySQL will let us use the MERGE algorithm. Notice that there are no warnings:


  1. mysql> CREATE algorithm=merge VIEW v1 AS SELECT c1, c2 FROM t2;
  2. Query OK, 0 rows affected (0.11 sec)

If we then write a SQL query to aggregate over the rows of the view and we add a filter condition on c1, the view is accessed quicky:


  1. mysql> SELECT c1, count(*) FROM v1 WHERE c1 = 10 GROUP BY c1;
  2. +——+———-+
  3. | c1   | count(*) |
  4. +——+———-+
  5. |   10 |      130 |
  6. +——+———-+
  7. 1 row IN SET (0.01 sec)

We can use EXPLAIN EXTENDED to see that MySQL rewrites the view query to include the restriction:


  1. mysql> EXPLAIN extended  SELECT c1, count(*) FROM v1 WHERE c1 = 10 GROUP BY c1;
  2. +—-+————-+——-+——+—————+——+———+——-+——+———-+————————–+
  3. | id | select_type | TABLE | type | possible_keys | KEY  | key_len | ref   | rows | filtered | Extra                    |
  4. +—-+————-+——-+——+—————+——+———+——-+——+———-+————————–+
  5. 1 | SIMPLE      | t2    | ref  | c1            | c1   | 5       | const |  130 |   100.00 | USING WHERE; USING INDEX |
  6. +—-+————-+——-+——+—————+——+———+——-+——+———-+————————–+
  7. 1 row IN SET, 1 warning (0.00 sec)

Notice the rewritten SQL query:


  1. mysql> SHOW warnings;
  2. +——-+——+———————————————————————————————————————————+
  3. | Level | Code | Message                                                                                                                         |
  4. +——-+——+———————————————————————————————————————————+
  5. | Note  | 1003 | SELECT `test`.`t2`.`c1` AS `c1`,count(0) AS `count(*)` FROM `test`.`t2` WHERE (`test`.`t2`.`c1` = 10) GROUP BY `test`.`t2`.`c1` |
  6. +——-+——+———————————————————————————————————————————+
  7. 1 row IN SET (0.00 sec)

Now, lets try to use the MERGE algorithm with an aggregation query:


  1. mysql> CREATE algorithm=MERGE VIEW v2 AS SELECT c1, count(*) FROM v1 GROUP BY c1;
  2. Query OK, 0 rows affected, 1 warning (0.01 sec)

Notice that MySQL sets the ALGORITHM=UNDEFINED, and will chose to use TEMPTABLE when the view is actually used:


  1. mysql> SHOW warnings;
  2. +———+——+——————————————————————————-+
  3. | Level   | Code | Message                                                                       |
  4. +———+——+——————————————————————————-+
  5. | Warning | 1354 | VIEW merge algorithm can‘t be used here for now (assumed undefined algorithm) |
  6. +———+——+——————————————————————————-+
  7. 1 row in set (0.00 sec)

When you try to restrict the second view using a filter on c1, performance is not good. MySQL is scanning the entire table to produce the result:


  1. mysql> SELECT * FROM v2 WHERE c1 = 10;
  2. +——+———-+
  3. | c1   | count(*) |
  4. +——+———-+
  5. |   10 |      130 |
  6. +——+———-+
  7. 1 row IN SET (1.66 sec)

Even a query with an impossible where clause takes a long time to process:


  1. mysql> SELECT * FROM v2 WHERE c1 = 100;
  2. Empty SET (1.64 sec)

You can see that MySQL is accessing millions of rows:


  1. mysql> EXPLAIN SELECT * FROM v2 WHERE c1 = 100;
  2. +—-+————-+————+——-+—————+——+———+——+———+————-+
  3. | id | select_type | TABLE      | type  | possible_keys | KEY  | key_len | ref  | rows    | Extra       |
  4. +—-+————-+————+——-+—————+——+———+——+———+————-+
  5. 1 | PRIMARY     | <derived2> | ALL   | NULL          | NULL | NULL    | NULL | 1310720 | USING WHERE |
  6. 2 | DERIVED     | t2         | INDEX | NULL          | c1   | 5       | NULL | 2785914 | USING INDEX |
  7. +—-+————-+————+——-+—————+——+———+——+———+————-+
  8. 2 rows IN SET (1.76 sec)

Now to the actual workaround. MySQL won’t merge query filters into the view definition, but we can modify the physical definition of the view in a way which provides a different means of providing filter conditions to the query.

Lets start by creating a parameter table. This table might be view specific or you could construct it in such a way that it can be shared between views. This example is somewhere in between those two extremes:


  1. mysql> CREATE TABLE params (
  2.                   view_name varchar(250),
  3.                   param1_val int,
  4.                   connection_id bigint,
  5.                   PRIMARY KEY(connection_id, view_name)
  6. ) engine=innodb;
  7. Query OK, 0 rows affected (0.10 sec)

Notice that MySQL still won’t let us use the MERGE algorithm (1 warning):


  1. CREATE alorithm=merge VIEW v2 AS
  2. SELECT c1, count(*)
  3.   FROM v1
  4.   JOIN params p ON p.connection_id = connection_id()
  5.                AND p.view_name = ‘test.v2’
  6.                AND p.param1_val = v1.c1
  7. GROUP BY c1;
  8. Query OK, 0 rows affected, 1 warning (0.01 sec)

There are two important things to consider here. First, notice the use of connection_id() in the view definition. This makes sure that the only rows examined in the params table are those which belong to this connection. When the table is inserted too (see below) the connection_id() function will be used to generate the value for the connection_id column. Also notice the restriction that v1.c1 = p.param1_val. This restricts the query to only those rows where v1.c1 matches the value in the params table.

Place parameters into the params table using the REPLACE statement:


  1. mysql> REPLACE INTO params (`view_name`, `param1_val`, `connection_id`) VALUES (‘test.v2’, 10, connection_id());
  2. Query OK, 1 row affected (0.00 sec)

When MySQL executes the new view it will be able to restrict the rows examined based on the param1_val value in the params table:


  1. mysql> SELECT * FROM v2;
  2. +——+———-+
  3. | c1   | count(*) |
  4. +——+———-+
  5. |   10 |      130 |
  6. +——+———-+
  7. 1 row IN SET (0.01 sec)

Notice the vastly different plan over the old one. We no longer examine millions of rows:


  1. mysql> EXPLAIN SELECT * FROM v2 WHERE c1 = 10;
  2. +—-+————-+————+——–+—————+———+———+——-+——+————————–+
  3. | id | select_type | TABLE      | type   | possible_keys | KEY     | key_len | ref   | rows | Extra                    |
  4. +—-+————-+————+——–+—————+———+———+——-+——+————————–+
  5. 1 | PRIMARY     | <derived2> | system | NULL          | NULL    | NULL    | NULL  |    1 |                          |
  6. 2 | DERIVED     | p          | const  | PRIMARY       | PRIMARY | 260     |       |    1 |                          |
  7. 2 | DERIVED     | t2         | ref    | c1            | c1      | 5       | const |  130 | USING WHERE; USING INDEX |
  8. +—-+————-+————+——–+—————+———+———+——-+——+————————–+
  9. 3 rows IN SET (0.01 sec)

In review, I like views for their convenient encapsulation of SQL logic that would usually have to be repeated in SQL statements throughout the application otherwise. The convenience sometimes comes with high cost though, particularly when the TEMPTABLE algorithm is used. The technique outlined in this post can be used to make a trade off between increased query complexity for improved view performance.

Entry posted by Justin Swanhart |

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Written by in: Zend Developer |

Pacemaker, please meet NDB Cluster or using Pacemaker/Heartbeat to start a NDB Cluster

Customers have always asked me to make NDB Cluster starts automatically upon startup of the servers. For the ones who know NDB Cluster, it is tricky to make it starts automatically. I know at least 2 sets of scripts to manage NDB startup, ndb-initializer and from Johan configurator If all the nodes come up at about the same time, it is not too bad but what if one the critical node takes much longer to start because of an fsck on a large ext3 partition. Then, a startup script becomes a nightmare. Finally, if the box on which the script is supposed to run didn’t start at all. That’s a lot of rules to handle.

Since all aspects of HA interest me, I was recently reading the Pacemaker documentation and I realized that Pacemaker has all the logic required to manage NDB Cluster startup. Okay it might seems weird to control a cluster by cluster but if you think about it, this is, I think, the best solution.

The Linux-HA project has split the old Heartbeat-2 project in 2 parts, the clustering and communication layer, Heartbeat and the resources manager, Pacemaker. A key new features that has been added to Pacemaker recently, a Clone resources set, that allows an optional startup if only one of 2 similar resources starts. I use this feature to start the data nodes. If after a major outage, only one of the physical host where the data nodes are located comes up, the cluster will start. The other features of Pacemaker that I need are resource location rsc_location and resource ordering rsc_order.

Let’s start by the beginning. My NDB cluster is made of the following 3 nodes:

  • testvirtbox: ndb_mgmd (
  • test1: ndbd
  • test2: ndbd

For the sake of simplicity, I am not considering the SQL nodes but given the framework, extending to SQL nodes is trivial. Installing Pacemaker and Heartbeat is very easy on Lucid Lynx, just do the following:


  1. apt-get install heartbeat pacemaker

On other distributions, you might have to build from sources, look here for help.

There 2 minimal configuration files:


  1. root@test2:~# cat /etc/ha.d/authkeys
  2. auth 1
  3. 1 sha1  yves
  4. root@test2:~# cat /etc/ha.d/
  5. autojoin none
  6. bcast eth0
  7. warntime 5
  8. deadtime 15
  9. initdead 60
  10. keepalive 2
  11. node test1
  12. node test2
  13. node testvirtbox
  14. crm respawn

And then, Heartbeat can be started on all nodes with /etc/init.d/heartbeat start.

Next, since Pacemaker is used to start resources and not to manage them, we don’t need to define Stonith devices so (run on only one node):


  1. crm_attribute -t crm_config -n stonith-enabled -v false

A last before defining resources, since the Heartbeat cluster is asymmetrical, meaning resources will not be able to run anywhere, we must create an “Opt-In” cluster with (run on only one node):


  1. crm_attribute –attr-name symmetric-cluster –attr-value false

At this point, we have a running cluster controlling nothing. The trick with NDB Cluster is that Heartbeat is required to start the resources but not to stop them. In order to achieve this behavior, I created fake resource scripts that can be fully controlled by Heartbeat but allowing the one way behavior I wanted.


  1. root@testvirtbox:~# cat /usr/local/bin/fake_ndb_mgmd
  2. #!/bin/bash
  4. /usr/bin/nohup /usr/local/mysql/libexec/ndb_mgmd> /dev/null &
  6. while [ 1 ]
  7. do
  8.         /bin/sleep 60
  9. done
  11. root@testvirtbox:~# cat /usr/local/bin/fake_ndb_cluster_start
  12. #!/bin/bash
  14. #Give some time to the nodes to connect
  15. /bin/sleep 15
  17. /usr/local/mysql/bin/ndb_mgm -e ‘all start’> /dev/null
  19. while [ 1 ]
  20. do
  21.         /bin/sleep 60
  22. done
  24. root@test1:~# cat /usr/local/bin/fake_ndbd
  25. #!/bin/bash
  27. #Give some time to ndb_mgmd to start
  28. /bin/sleep 10
  30. nohup /usr/local/mysql/libexec/ndbd -c> /dev/null &
  32. while [ 1 ]
  33. do
  34.         sleep 60
  35. done

With Pacemaker it is not longer required to manipulate the cib in xml format but for this post, xml offers a compact way of presenting the configuration. The first things we need to define are the resources. A very handy resource type for us is the anything resource which allow an arbitrary script or binary to be run. The resources section will look like:


  1. <resources>
  2.       <primitive id=“mgmd” class=“ocf” type=“anything” provider=“heartbeat”>
  3.         <instance_attributes id=“params-mgmd”>
  4.           <nvpair id=“param-mgmd-binfile” name=“binfile” value=“/usr/local/bin/fake_ndb_mgmd”/>
  5.           <nvpair id=“param-mgmd-pidnile” name=“pidfile” value=“/var/run/heartbeat/”/>
  6.         </instance_attributes>
  7.       </primitive>
  8.       <clone id=“ndbdclone”>
  9.         <meta_attributes id=“ndbdclone-options”>
  10.           <nvpair id=“ndbdclone-option-1” name=“globally-unique” value=“false”/>
  11.           <nvpair id=“ndbdclone-option-2” name=“clone-max” value=“2”/>
  12.           <nvpair id=“ndbdclone-option-3” name=“clone-node-max” value=“1”/>
  13.         </meta_attributes>
  14.         <primitive id=“ndbd” class=“ocf” type=“anything” provider=“heartbeat”>
  15.           <instance_attributes id=“params-ndbd”>
  16.             <nvpair id=“param-ndbd-binfile” name=“binfile” value=“/usr/local/bin/fake_ndbd”/>
  17.             <nvpair id=“param-ndbd-pidfile” name=“pidfile” value=“/var/run/heartbeat/”/>
  18.           </instance_attributes>
  19.         </primitive>
  20.       </clone>
  21.       <primitive id=“ndbcluster” class=“ocf” type=“anything” provider=“heartbeat”>
  22.         <instance_attributes id=“params-ndbcluster”>
  23.           <nvpair id=“param-ndbcluster-binfile” name=“binfile” value=“/usr/local/bin/fake_ndb_cluster_start”/>
  24.           <nvpair id=“param-ndbcluster-pidfile” name=“pidfile” value=“/var/run/heartbeat/”/>
  25.         </instance_attributes>
  26.       </primitive>
  27.     </resources>

Please note the ndbd resource is defined through the use of a clone set. The clone set will allow the cluster to start even if only one to the ndb node group is available. If you have multiple ndb node groups, you’ll need one clone set per node group. The ndb_mgmd nodes or eventual SQL nodes could have been handled the same way although for SQL nodes, ndb_waiter is very handy. Once the resources are defined, we need to setup the constraints which cover mandatory locations and ordering.


  1. <constraints>
  2.       <rsc_location id=“loc-1” rsc=“mgmd” node=“testvirtbox” score=“INFINITY”/>
  3.       <rsc_location id=“loc-2” rsc=“ndbcluster” node=“testvirtbox” score=“INFINITY”/>
  4.       <rsc_location id=“loc-3” rsc=“ndbdclone” node=“test1” score=“INFINITY”/>
  5.       <rsc_location id=“loc-4” rsc=“ndbdclone” node=“test2” score=“INFINITY”/>
  6.       <rsc_order id=“order-1”>
  7.         <resource_set id=“ordered-set-1” sequential=“true”>
  8.           <resource_ref id=“mgmd”/>
  9.           <resource_ref id=“ndbdclone”/>
  10.           <resource_ref id=“ndbcluster”/>
  11.         </resource_set>
  12.       </rsc_order>
  13.     </constraints>

And… that’s it. For my part, I configured Pacemaker by dumping the cib in xml format, editing and reloading. In term of commands, it means:


  1. cibadmin –query> local.xml
  2. vi  local.xml
  3. cibadmin  –replace –xml-file local.xml

Once NDB is started, you can even stop heartbeat, it is no longer required.


As suggested by Florian, here is the configuration in CLI format:


  1. root@testvirtbox:~# crm configure show
  2. INFO: building help index
  3. INFO: object order-1 cannot be represented in the CLI notation
  4. node $id=“27687295-f72c-49bd-b82d-25f32dbfe1e2” test2
  5. node $id=“3086852d-abb9-4bdb-93a1-9390e14c148c” test1
  6. node $id=“cad7f678-fc91-4f09-a39e-1dde6d5bcd30” testvirtbox
  7. primitive mgmd ocf:heartbeat:anything \
  8.         params binfile=“/usr/local/bin/fake_ndb_mgmd” pidfile=“/var/run/heartbeat/”
  9. primitive ndbcluster ocf:heartbeat:anything \
  10.         params binfile=“/usr/local/bin/fake_ndb_cluster_start” pidfile=“/var/run/heartbeat/”
  11. primitive ndbd1-IP ocf:heartbeat:anything \
  12.         params binfile=“/usr/local/bin/fake_ndbd” pidfile=“/var/run/heartbeat/”
  13. clone ndbdclone ndbd1-IP \
  14.         meta globally-unique=“false” clone-max=“2” clone-node-max=“1”
  15. location loc-1 mgmd inf: testvirtbox
  16. location loc-2 ndbcluster inf: testvirtbox
  17. location loc-3 ndbdclone inf: test1
  18. location loc-4 ndbdclone inf: test2
  19. xml <rsc_order id=“order-1”> \
  20.         <resource_set id=“ordered-set-1” sequential=“true”> \
  21.                 <resource_ref id=“mgmd”/> \
  22.                 <resource_ref id=“ndbdclone”/> \
  23.                 <resource_ref id=“ndbcluster”/> \
  24.         </resource_set> \
  25. </rsc_order>

Entry posted by Yves Trudeau |

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Written by in: Zend Developer |

Powered by WordPress | Theme: Aeros 2.0 by