Percona Live Featured Tutorial with Morgan Tocker — MySQL 8.0 Optimizer Guide

Percona Live Featured Tutorial

Percona Live Featured TutorialWelcome to another post in the series of Percona Live featured tutorial speakers blogs! In these blogs, we’ll highlight some of the tutorial speakers that will be at this year’s Percona Live conference. We’ll also discuss how these tutorials can help you improve your database environment. Make sure to read to the end to get a special Percona Live 2017 registration bonus!

In this Percona Live featured tutorial, we’ll meet Morgan Tocker, MySQL Product Manager at Oracle. His tutorial is a MySQL 8.0 Optimizer Guide. Many users who follow MySQL development are aware that recent versions introduced a number of improvements to query execution (via the addition of new execution strategies and an improved cost model). But what we don’t talk enough about is that the diagnostic features are also significantly better. I had a chance to speak with Morgan and learn a bit more about the MySQLOptimizer:

Percona: How did you get into database technology? What do you love about it?

Morgan: I started my career as a web developer, mainly focusing on the front end area. As the team I worked on grew and required different skills, I tried my hand at the back end programming. This led me to databases.

I think what I enjoyed about databases at the time was that front end design was a little bit too subjective for my tastes. With databases, you could prove what was “correct” by writing a simple micro-benchmark.  I joined the MySQL team in January 2006, and rejoined it again in 2013 after a five-year hiatus.

I don’t quite subscribe to this same view on micro benchmarks today, since it is very easy to accidentally (or intentionally) write a naïve benchmark. But I am still enjoying myself.

Percona: Your tutorial is called “MySQL 8.0 Optimizer Guide.” What exactly is the MySQL optimizer, and what new things have been added in MySQL 8.0?

Morgan: Because SQL is declarative (i.e., you state “what you want” rather than “do this then that”), there is a process that has to happen internally to prepare a query for execution. I like to describe it as similar to what happens when you enter an address in a GPS navigator. Some software then spits out the best steps on how to get there. In a database server, the Optimizer is that software code area.

There are a number of new optimizer features in MySQL 8.0, both in terms of new syntax supported and performance improvements to existing queries. These will be covered in some talks at the main conference (and also my colleague Øystein’s tutorial). The goal of my tutorial is to focus more on diagnostics than the enhancements themselves.

Percona: How can you use diagnostics to improve queries?

Morgan: To put it in numbers: it is not uncommon to see a user obsess over a configuration change that may yield a 2x improvement, and not spot the 100x improvement available by adding an index!

I like to say that users do not get the performance that they are entitled to if and when they lack the visibility and diagnostics available to them:

-In MySQL 5.6, an optimizer trace diagnostic was added so that you can now see not only why the optimizer arrived at a particular execution plan, but why other options were avoided.

-In MySQL 5.7, the EXPLAIN FORMAT=JSON command now includes the cost information (the internal formula used for calculations). My experience has been that sharing this detail actually makes the optimizer a lot easier to teach.

Good diagnostics by themselves do not make the improvements, but they bring required changes to the surface. On most systems, there are opportunities for improvements (indexes, hints, slight changes to queries, etc.).

Percona: What do you want attendees to take away from your tutorial session? Why should they attend?

Morgan: Visibility into running systems has been a huge priority for the MySQL Engineering team over the last few releases. I think in many cases, force-of-habit leaves users using an older generation of diagnostics (EXPLAIN versus EXPLAIN FORMAT=JSON). My goal is to show them the light using the state-of-the-art stack. This is why I called my talk an 8.0 guide, even though much of it is still relevant for 5.7 and 5.6.

I also have a companion website to my tutorial at www.unofficalmysqlguide.com.

Percona: What are you most looking forward to at Percona Live?

Morgan: I’m excited to talk to users about MySQL 8.0, and not just in an optimizer sense. The MySQL engineering team has made a large investment in improving the reliability of MySQL with the introduction of a native data dictionary. I expect it will be the subject of many discussions, and a great opportunity for feedback.

There is also the social aspect for me, too. It will be 11 years since I first attended the predecessor to Percona Live. I have a lot of fond memories, and enjoy catching up with new friends and old over a beer!

You can find out more about Morgan Tocker and his work with the Optimizer at his tutorial website. Want to find out more about Morgan and MySQL query optimization? Register for Percona Live Data Performance Conference 2017, and see his MySQL 8.0 Optimizer Guide tutorial. Use the code FeaturedTalk and receive $30 off the current registration price!

Percona Live Data Performance Conference 2017 is the premier open source event for the data performance ecosystem. It is the place to be for the open source community as well as businesses that thrive in the MySQL, NoSQL, cloud, big data and Internet of Things (IoT) marketplaces. Attendees include DBAs, sysadmins, developers, architects, CTOs, CEOs, and vendors from around the world.

The Percona Live Data Performance Conference will be April 24-27, 2017 at the Hyatt Regency Santa Clara & The Santa Clara Convention Center.


Percona Live Europe 2016: “MySQL 8.0: what’s new in Optimizer” with Manyi Lu

Percona Live Europe

Percona Live EuropeToday was the first day of sessions at Percona Live Europe 2016, and it was packed with exciting talks and important information on open source databases. Some of the most anticipated talks are those that cover what to expect in MySQL 8.0.

One of those talks was given by Oracle’s Manyi Lu, Director of Software Development. She discussed MySQL 8.0: what’s new in Optimizer.

In her talk, Manyi discussed what we could look forward to in MySQL 8.0’s optimizer. There are substantial improvements in the optimizer in MySQL 5.7 and MySQL 8.0. Most noticeably, users can now combine relational data with NoSQL using the new JSON features. MySQL also now supports functional indexes through generated columns. MySQL 8.0 also makes DBA’s lives easier with EXPLAIN options for running queries and the query rewrite plugin.

On the performance side, MySQL can avoid creating temporary tables for UNION ALL, enable range scans on IN queries with row value expressions and make use of condition filtering. MySQL 8.0 also contains histograms, invisible indexes, common table expressions and we extended the JSON support.

Percona’s EMEA Field Marketing Manager Kamal Taibi was able to chat with Manyi Lu and get some insights into her talk. Check it out below!


Introduction into storage engine troubleshooting: Q & A

storage engine troubleshooting

 storage engine troubleshootingIn this blog, I will provide answers to the Q & A for the “Introduction into storage engine troubleshooting” webinar.

First, I want to thank everybody for attending the July 14 webinar. The recording and slides for the webinar are available here. Below is the list of your questions that I wasn’t able to answer during the webinar, with responses:

Q: At which isolation level do 




  copy data from a table?

A: Both tools do not change the server’s default transaction isolation level. Use either


 or set it in my



Q: Can I create an index to optimize a query which has group by A and order by B, both from different tables and A column is from the first table in the two table join?

A: Do you mean a query like


 ? Yes, this is possible:

mysql> explain select A, B, count(*) from a join b on(a.A=b.id) WHERE b.B < 4 GROUP BY a.A, b.B ORDER BY b.B ASC;
| id | select_type | table | type  | possible_keys | key  | key_len | ref       | rows | Extra                                                     |
|  1 | SIMPLE      | b     | range | PRIMARY,B     | B    | 5       | NULL      |   15 | Using where; Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | a     | ref   | A             | A    | 5       | test.b.id |    1 | Using index                                               |
2 rows in set (0.00 sec)

Q: Where can I find recommendations on what kind of engine to use for different application types or use cases?

A: Storage engines are always being actively developed, therefore I suggest that you don’t search for generic recommendations. These can be outdated just a few weeks after they are written. Study engines instead. For example, just a few years ago MyISAM was the only engine (among those officially supported) that could work with FULLTEXT indexes and SPATIAL columns. Now InnoDB supports both: FULLTEXT indexes since version 5.6 and GIS features in 5.7. Today I can recommend InnoDB as a general-purpose engine for all installations, and TokuDB for write-heavy workloads when you cannot use high-speed disks.

Alternative storage engines can help to realize specific business needs. For example, CONNECT brings data to your server from many sources, SphinxSE talks to the Sphinx daemon, etc.

Other alternative storage engines increase the speed of certain workloads. Memory, for example, can be a good fit for temporary tables.

Q: Can you please explain how we find the full text of the query when we query the view ‘statements_with_full_table_Scans’?

A: Do you mean view in sys schema? Sys schema views take information from


 and digests it in Performance Schema, therefore it does not contain full queries (only digests). Full text of the query can be found in the


  tables in the Performance Schema. Note that even the 


  table can be rewritten very quickly, and you may want to save data from it periodically.

Q: Hi is TokuDB for the new document protocol?

A: As Alex Rubin showed in his detailed blog post, the new document protocol just converts NoSQL queries into SQL, and is thus not limited to any storage engine. To use documents and collections, a storage engine must support generated columns (which TokuDB currently does not). So support of X Protocol for TokuDB is limited to relational tables access.

Q: Please comment on “read committed” versus “repeatable read.”
Q: Repeatable read holds the cursor on the result set for the client versus read committed where the cursor is updated after a transaction.





 are transaction isolation levels, whose details are explained here.
I would not correlate locks set on table rows in different transaction isolation modes with the result set. A transaction with isolation level


  instead creates a snapshot of rows that are accessed by the transaction. Let’s consider a table:

mysql> create table ti(id int not null primary key, f1 int) engine=innodb;
Query OK, 0 rows affected (0.56 sec)
mysql> insert into ti values(1,1), (2,2), (3,3), (4,4), (5,5), (6,6), (7,7), (8,8), (9,9);
Query OK, 9 rows affected (0.03 sec)
Records: 9  Duplicates: 0  Warnings: 0

Then start the transaction and select a few rows from this table:

mysql1> begin;
Query OK, 0 rows affected (0.00 sec)
mysql1> select * from ti where id < 5;
| id | f1   |
|  1 |    1 |
|  2 |    2 |
|  3 |    3 |
|  4 |    4 |
4 rows in set (0.04 sec)

Now let’s update another set of rows in another transaction:

mysql2> update ti set f1 = id*2 where id > 5;
Query OK, 4 rows affected (0.06 sec)
Rows matched: 4  Changed: 4  Warnings: 0
mysql2> select * from ti;
| id | f1   |
|  1 |    1 |
|  2 |    2 |
|  3 |    3 |
|  4 |    4 |
|  5 |    5 |
|  6 |   12 |
|  7 |   14 |
|  8 |   16 |
|  9 |   18 |
9 rows in set (0.00 sec)

You see that the first four rows – which we accessed in the first transaction – were not modified, and last four were modified. If InnoDB only saved the cursor (as someone answered above) we would expect to see the same result if we ran 

SELECT * ...

  query in our old transaction, but it actually shows whole table content before modification:

mysql1> select * from ti;
| id | f1   |
|  1 |    1 |
|  2 |    2 |
|  3 |    3 |
|  4 |    4 |
|  5 |    5 |
|  6 |    6 |
|  7 |    7 |
|  8 |    8 |
|  9 |    9 |
9 rows in set (0.00 sec)

So “snapshot”  is a better word than “cursor” for the result set. In the case of


, the first transaction would see modified rows:

mysql1> drop table ti;
Query OK, 0 rows affected (0.11 sec)
mysql1> create table ti(id int not null primary key, f1 int) engine=innodb;
Query OK, 0 rows affected (0.38 sec)
mysql1> insert into ti values(1,1), (2,2), (3,3), (4,4), (5,5), (6,6), (7,7), (8,8), (9,9);
Query OK, 9 rows affected (0.04 sec)
Records: 9  Duplicates: 0  Warnings: 0
mysql1> set transaction isolation level read committed;
Query OK, 0 rows affected (0.00 sec)
mysql1> begin;
Query OK, 0 rows affected (0.00 sec)
mysql1> select * from ti where id < 5;
| id | f1   |
|  1 |    1 |
|  2 |    2 |
|  3 |    3 |
|  4 |    4 |
4 rows in set (0.00 sec)

Let’s update all rows in the table this time:

mysql2> update ti set f1 = id*2;
Query OK, 9 rows affected (0.04 sec)
Rows matched: 9  Changed: 9  Warnings: 0

Now the first transaction sees both the modified rows with id >= 5 (not in the initial result set), but also the modified rows with id < 5 (which existed in the initial result set):

mysql1> select * from ti;
| id | f1   |
|  1 |    2 |
|  2 |    4 |
|  3 |    6 |
|  4 |    8 |
|  5 |   10 |
|  6 |   12 |
|  7 |   14 |
|  8 |   16 |
|  9 |   18 |
9 rows in set (0.00 sec)


MySQL indexing 101: a challenging single-table query

We discussed in an earlier post how to design indexes for many types of queries using a single table. Here is a real-world example of the challenges you will face when trying to optimize queries: two similar queries, but one is performing a full table scan while the other one is using the index we specially created for these queries. Bug or expected behavior? Read on!

Our two similar queries

# Q1
mysql> explain select col1, col2 from t where ts >= '2015-04-30 00:00:00';
| id | select_type | table         | type | possible_keys | key  | key_len | ref  | rows    | Extra       |
|  1 | SIMPLE      | t             | ALL  | ts            | NULL | NULL    | NULL | 4111896 | Using where |
# Q2
mysql> explain select count(*) from t where ts >='2015-04-30 00:00:00';
| id | select_type | table         | type  | possible_keys | key          | key_len | ref  | rows    | Extra                    |
|  1 | SIMPLE      | t             | range | ts            | ts           | 5       | NULL | 1809458 | Using where; Using index |

Q1 runs a full-table scan while Q2 is using the index on ts, which by the way is covering – See Using index in the Extra field. Why such different execution plans?

Let’s try to understand what happens with Q1.

This is a query with a single inequality on the ts field and we have an index on ts. The optimizer tries to see if this index is usable (possible_keys field), this is all very logical. Now if we look at the rows field for Q1 and Q2, we can see that the index would allow us to only read 45% of the records (1.8M out of 4.1M). Granted, this is not excellent but this should be much better than a full table scan anyway, right?

If you think so, read carefully what’s next. Because this assumption is simply not correct!

Estimating the cost of an execution plan (simplified)

First of all, the optimizer does not know if data or indexes are in memory or need to be read from disk, it will simply assume everything is on disk. What it does know however is that sequential reads are much faster than random reads.

So let’s execute Q1 with the index on ts. Step 1 is to perform a range scan on this index to identify the 1.8M records that match the condition: this is a sequential read, so this is quite fast. However now step 2 is to get the col1 and col2 fields for each record that match the condition. The index provides the primary key value for each matching record so we will have to run a primary key lookup for each matching record.

Here is the issue: 1.8M primary key lookups is equivalent to 1.8M random reads, therefore this will take a lot of time. Much more time than sequentially reading the full table (which means doing a full scan of the primary key because we are using InnoDB here).

Contrast that with how Q2 can be executed with the index on ts. Step 1 is the same: identify the 1.8M matching records. But the difference is: there’s no step 2! That’s why we call this index a ‘covering index’: we don’t need to run point queries on the primary key to get extra fields. So this time, using the index on ts is much more efficient than reading the full table (which again would mean that we would do a full-table scan of the primary key).

Now there is one more thing to understand: a full-table scan is a sequential operation when you think about it from a logical point of view, however the InnoDB pages are certainly not stored sequentially on disk. So at the disk level, a full table is more like multiple random reads than a single large sequential read.

However it is still much faster than a very large number or point query and it’s easy to understand why: when you read a 16KB page for a full table scan, all records will be used. While when you read a 16KB page for a random read, you might only use a single record. So in the worst case, reading 1.8M records will require 1.8M random reads while reading the full table with 4M records will only require 100K random reads – the full table scan is still an order of magnitude faster.

Optimizing our query

Now that we’ve understood why the optimizer chose a full table scan for Q1, is there a way to make it run faster by using an index? If we can create a covering index, we will no longer need the expensive primary key lookups. Then the optimizer is very likely to choose this index over a full table scan. Creating such a covering index is easy:

ALTER TABLE t ADD INDEX idx_ts_col1_col2 (ts, col1, col2);

Some of you may object that because we have an inequality on ts, the other columns cannot be used. This would be true if we had conditions on col1 or col2 in the WHERE clause, but that does not apply here since we’re only adding these extra columns to get a covering index.


Understanding how indexes can be used to filter, sort or cover is paramount to be able to optimize queries, even simple ones. Understanding (even approximately) how a query is run according to a given execution plan is also very useful. Otherwise you will sometimes be puzzled by the decisions made by the optimizer.

Also note that beginning in MySQL 5.7, the cost model can be tuned. This can help the optimizer make better decisions: for instance random reads are far cheaper on fast storage than on regular disks.

The post MySQL indexing 101: a challenging single-table query appeared first on MySQL Performance Blog.


Optimizer hints in MySQL 5.7.7 – The missed manual

In version MySQL 5.7.7 Oracle presented a new promising feature: optimizer hints. However it did not publish any documentation about the hints. The only note which I found in the user manual about the hints is:

  • It is now possible to provide hints to the optimizer by including /*+ … */ comments following the SELECT, INSERT, REPLACE, UPDATE, or DELETE keyword of SQL statements. Such statements can also be used with EXPLAIN. Examples:
    FROM t3 WHERE f1 > 30 AND f1 < 33;
    SELECT /*+ BKA(t1, t2) */ * FROM t1 INNER JOIN t2 WHERE ...;
    SELECT /*+ NO_ICP(t1) */ * FROM t1 WHERE ...;

There are also three worklogs: WL #3996, WL #8016 and WL #8017. But they describe the general concept and do not have much information about which optimizations can be used and how. More light on this provided by slide 59 from Øystein Grøvlen’s session at Percona Live. But that’s all: no “official” full list of possible optimizations, no use cases… nothing.

I tried to sort it out myself.

My first finding is the fact that slide #59 really lists six of seven possible index hints. Confirmation for this exists in one of two new files under sql directory of MySQL source tree, created for this new feature.

$cat sql/opt_hints.h
  Hint types, MAX_HINT_ENUM should be always last.
  This enum should be synchronized with opt_hint_info
  array(see opt_hints.cc).
enum opt_hints_enum

Looking into file sql/opt_hints.cc we can find out what these optimizations give not much choice: either enable or disable.

$cat sql/opt_hints.cc
struct st_opt_hint_info opt_hint_info[]=
  {"BKA", true, true},
  {"BNL", true, true},
  {"ICP", true, true},
  {"MRR", true, true},
  {"NO_RANGE_OPTIMIZATION", true, true},
  {"MAX_EXECUTION_TIME", false, false},
  {"QB_NAME", false, false},
  {0, 0, 0}

A choice for the way to include hints into SQL statements: inside comments with sign “+”


, – is compatible with style of optimizer hints which Oracle uses.

We actually had access to these hints before: they were accessible via variable optimizer_switch. At least such optimizations like BKA, BNL, ICP, MRR. But with new syntax we cannot only modify this access globally or per session, but can turn on or off particular optimization for a single table and column in the query. I can demonstrate it on this quite artificial but always accessible example:

mysql> use mysql
Database changed
mysql> explain select * from user where host in ('%', '');
| id | select_type | table | partitions | type  | possible_keys | key     | key_len | ref  | rows | filtered | Extra                 |
|  1 | SIMPLE      | user  | NULL       | range | PRIMARY       | PRIMARY | 180     | NULL |    2 |   100.00 | Using index condition |
1 row in set, 1 warning (0.01 sec)
mysql> explain select /*+ NO_RANGE_OPTIMIZATION(user PRIMARY) */ * from user where host in ('%', '');
| id | select_type | table | partitions | type | possible_keys | key  | key_len | ref  | rows | filtered | Extra       |
|  1 | SIMPLE      | user  | NULL       | ALL  | PRIMARY       | NULL | NULL    | NULL |    5 |    40.00 | Using where |
1 row in set, 1 warning (0.00 sec)

I used one more hint, which we could not turn on or off directly earlier: range optimization.

One more “intuitively” documented feature is the ability to turn on or off a particular optimization. This works only for BKA, BNL, ICP and MRR: you can specify NO_BKA(table[[, table]…]), NO_BNL(table[[, table]…]), NO_ICP(table indexes[[, table indexes]…]) and NO_MRR(table indexes[[, table indexes]…]) to avoid using these algorithms for particular table or index in the JOIN.

MAX_EXECUTION_TIME does not require any table or key name inside. Instead you need to specify maximum time in milliseconds which query should run:

mysql> select /*+ MAX_EXECUTION_TIME(1000) */  sleep(1) from user;
ERROR 3024 (HY000): Query execution was interrupted, max_statement_time exceeded
mysql> select /*+ MAX_EXECUTION_TIME(10000) */  sleep(1) from user;
| sleep(1) |
|        0 |
|        0 |
|        0 |
|        0 |
|        0 |
5 rows in set (5.00 sec)

QB_NAME is more complicated. WL #8017 tells us this is custom context. But what is this? The answer is in the MySQL test suite! Tests for optimizer hints exist in file t/opt_hints.test For QB_NAME very first entry is query:

EXPLAIN SELECT /*+ NO_ICP(t3@qb1 f3_idx) */ f2 FROM
  (SELECT /*+ QB_NAME(QB1) */ f2, f3, f1 FROM t3 WHERE f1 > 2 AND f3 = 'poiu') AS TD
    WHERE TD.f1 > 2 AND TD.f3 = 'poiu';

So we can specify custom QB_NAME for any subquery and specify optimizer hint only for this context.

To conclude this quick overview I want to show a practical example of when query hints are really needed. Last week I worked on an issue where a customer upgraded from MySQL version 5.5 to 5.6 and found some of their queries started to work slower than before. I wrote an answer which could sound funny, but still remains correct: “One of the reasons for such behavior is optimizer  improvements. While they all are made for better performance, some queries – optimized for older versions – can start working slower than before.”

To demonstrate a public example of such a query I will use my favorite source of information: MySQL Community Bugs Database. In a search for Optimizer regression bugs that are still not fixed we can find bug #68919 demonstrating regression in case the MRR algorithm is used for queries with LIMIT. In run queries, shown in the bug report, we will see a huge difference:

mysql> SELECT * FROM t1 WHERE i1>=42 AND i2<=42 LIMIT 1;
| pk | i1 | i2 | i3 |
| 42 | 42 | 42 | 42 |
1 row in set (6.88 sec)
mysql> explain SELECT * FROM t1 WHERE i1>=42 AND i2<=42 LIMIT 1;
| id | select_type | table | partitions | type  | possible_keys | key  | key_len | ref  | rows    | filtered | Extra                            |
|  1 | SIMPLE      | t1    | NULL       | range | idx           | idx  | 4       | NULL | 9999958 |    33.33 | Using index condition; Using MRR |
1 row in set, 1 warning (0.00 sec)
mysql> SELECT /*+ NO_MRR(t1) */ *  FROM t1  WHERE i1>=42 AND i2<=42 LIMIT 1;
| pk | i1 | i2 | i3 |
| 42 | 42 | 42 | 42 |
1 row in set (0.00 sec)

With MRR query execution takes 6.88 seconds and 0 if MRR is not used! But the bug report itself suggests using


as a workaround. And this will work perfectly well if you are OK to run

SET optimizer_switch="mrr=off";

every time you are running a query which will take advantage of having it OFF. With optimizer hints you can have one or another algorithm to be ON for particular table in the query and OFF for another one. I, again, took quite an artificial example, but it demonstrates the method:

mysql> explain select /*+ MRR(dept_emp) */ * from dept_emp where to_date in  (select /*+ NO_MRR(salaries)*/ to_date from salaries where salary >40000 and salary <45000) and emp_no >10100 and emp_no < 30200 and dept_no in ('d005', 'd006','d007');
| id | select_type  | table       | partitions | type   | possible_keys          | key        | key_len | ref                        | rows    | filtered | Extra                                         |
|  1 | SIMPLE       | dept_emp    | NULL       | range  | PRIMARY,emp_no,dept_no | dept_no    | 8       | NULL                       |   10578 |   100.00 | Using index condition; Using where; Using MRR |
|  1 | SIMPLE       | <subquery2> | NULL       | eq_ref | <auto_key>             | <auto_key> | 3       | employees.dept_emp.to_date |       1 |   100.00 | NULL                                          |
|  2 | MATERIALIZED | salaries    | NULL       | ALL    | salary                 | NULL       | NULL    | NULL                       | 2838533 |    17.88 | Using where                                   |
3 rows in set, 1 warning (0.00 sec)


The post Optimizer hints in MySQL 5.7.7 – The missed manual appeared first on MySQL Performance Blog.


Q&A: Even More Deadly Mistakes of MySQL Development

Percona WebinarsOn Wednesday I gave a presentation on “How to Avoid Even More Common (but Deadly) MySQL Development Mistakes” for Percona MySQL Webinars.  If you missed it, you can still register to view the recording and my slides.

Thanks to everyone who attended, and especially to folks who asked the great questions.  I answered as many as we had time for  during the session, but here are all the questions with my complete answers:

Q: Disk bandwidth also not infinite ;-)

Indeed, you’re right!

We discussed in the webinar the impact on network bandwidth from using column wildcards in queries like SELECT *, but it’s also possible that using SELECT * can impact disk operations. Varchar, Blob, or Text columns can be stored on extra pages in the database, and if you include those columns in your query needlessly, it can cause the storage engine to do a lot of seeks and page reads unnecessarily.

For more details on string storage in InnoDB, see Peter Zaitsev’s blog on Blob Storage in Innodb.

Q: How many tables can be joined in a single query? What is the optimal number of joins?

MySQL has a limit of 63 table references in a given query. This limits how many JOIN operations you can do, and also limits the number of UNIONs. Actually you can go over this limit if your JOIN or UNION don’t reference any tables, that is, create a derived table of one row of expressions.

If you do join a lot of tables (or even self-join the same table many times), you’re likely to hit a practical scaling limit long before you reach 63 table references. The practical limit in your case depends on many factors, including the length of the tables, the data types, the type of join expressions in your queries, and your physical server’s capabilities. It’s not a fixed limit I can cite for you.

If you think you need dozens of table references in a single query, you should probably step back and reconsider your database design or your query design.

I often see this type of question (“what is the limit on the number of joins?”) when people try to use key/value tables, also called Entity-Attribute-Value, and they’re trying to pivot attributes from rows into columns, as if the table were stored in a conventional way with one column per attribute. This is a broken design for many reasons, and the scalability of many-way joins is just one problem with it.

Q: How many indexes can be created in a single table? Any limitation? What is the optimal number of indexes?

All MySQL storage engines support at least 16 indexes per table.

As far as the optimal number of indexes, I don’t pay attention to the number of indexes (as long as it remains lower than the max of 16). I try to make sure I have the right indexes for my queries. If you put an arbitrary cap of for example 8 or 10 indexes on a given table, then you might be running queries that lack a needed index, and the unnecessary extra cost of running that query is probably greater than the cost of maintaining the one extra index it needs.

That said, there are cases where you have such variation in query types that there’s no way to have optimal indexes to cover every possible case. Given that you can have multi-column indexes, and multi-column indexes with columns in different orders, there are n-factorial possible indexes on a table with n columns.

Q: There is a table with 3 columns: id(int), user_id(int), day(date). There is a high chance same user_id will ‘exist’ for every day. I read data by “where user_id = some_id” (very high throuhput) and delete all entries once a day by cron using “where sent_date = ’2014-01-01′ “. Have approx 6M rows per day deletion is pretty painfull. Will partitioning by column ‘day’ help me deleting those bulks faster? If yes – how much faster? How much will it slow down SELECTs? – not all entries are deleted, but only entries for some specific old day, e.g. ‘ WHERE day = ’1 week ago’

Range partitioning by date would give you the opportunity to ALTER TABLE…DROP PARTITION, so you could remove all data for a given date very quickly, much faster than deleting millions of rows. The performance of DROP PARTITION is like that of DROP TABLE, because each partition is physically stored like a separate table.

Searching for “where user_id = ?” would not be able to take advantage of partition pruning, but it would still be able to use an index on user_id. And if you drop old partitions, the benefit of searching a smaller table could be a good tradeoff.

Q: Regarding 20% selectivity as a threshold for the optimizer preferring a table-scan to an index lookup – is that a tunable?

No, it’s not tunable, it’s a fixed behavior of the query optimizer. If you search for a value and the optimizer estimates that > 20% of rows contain the value you search for, it will bypass the index and just do a table-scan.

For the same reason that the index of a book doesn’t contain very common words, because the list of pages that word appears on would be too long, and flipping back and forth from the back of the book to each listed page would actually be more work than just reading the book.

Also keep in mind my figure of 20% is approximate. Your results may vary. This is not a magic threshold in the source code, it’s just a tendency I have observed.

Q: Regarding generating synthetic test data, it sounds like a pretty easy perl script to write.

Yes, it might be easy to do that for one given table. But every table is different, and you might have hundreds of tables in dozens of applications to generate test data for. You might also want to vary the distribution of data values from one test to another.

Writing a test-data generator for one particular case is easy, so you might reasonably do it as a one-off task. Writing a general-purpose test-data generator that you can use for many cases is more work.

Q: Would love to have the set of URLs cited in the presentation without having to go back and mine them out of the presentation.

Open source message queues:

MySQL Performance Blog articles:

Open source test-data generator:

Load-testing tools for web applications:

Load-testing tools to replay query logs:

Further reading for implementing business rules:

Q: How to best use mysql query cache?

Any cache is best used if you read from it many times for each time you write to it. So we’d like to estimate the average ratio of query cache reads to writes, to estimate how much leverage it’s giving us.


Check the values for QCache_hits (which are cases when a query result was read from the query cache) over QCache_inserts (which are cases when the desired query result was not in the cache, and had to be run and then the result stored in the cache). I like to see a ratio of 1000% or more (i.e. 10:1 hits to inserts).

If you have a poor ratio, for example less than 1:1 or less than 100%, then consider disabling the query cache, because it may be costing more to maintain it than the performance benefit it’s giving you.

Keep in mind that this is only a guideline, because the calculation I described is only an average. It could be that the queries served by the query cache are very expensive, so using the cached result is a great benefit even if it accounts for a small number of hits. The only way to be certain is to load-test your application under your load, and compare overall performance results with the query cache enabled or disabled, and at different sizes.

Q: How to detect when too much indexes start to affect performance?

Some people are reluctant to create indexes because they have been warned that indexes require synchronous updates when you INSERT, UPDATE, or DELETE rows. Some people also make the generalization that indexes harm writes but benefit reads. Bot of these are not true.

Your DML operations aren’t really updating indexes in real time. InnoDB includes a feature called change buffering, which defers index updates. The change buffer is gradually merged into the index over time. That way, InnoDB can handle a big spike in traffic without it hurting throughput as much. You can monitor how much content in the change buffer remains to be merged:
mysql> SHOW GLOBAL STATUS LIKE 'Innodb_ibuf_size';

It’s also not accurate that indexes hurt writes. UPDATE and DELETE statements usually have a WHERE clause, to apply the changes to particular rows. These conditions use indexes to reduce the examined rows, just like in SELECT statements. But in UPDATE and DELETE statements, it’s even more important to use indexes, because otherwise the statement has to lock a lot of rows to ensure it locks the rows you’re changing.

So I generally say, don’t avoid indexes based only on the number of indexes you have, just make sure your indexes are being employed by the queries you run, and drop indexes that aren’t used. Here are a couple of past blog posts that show how to do this:

Thanks again for attending my webinar!  Here are some more tips:

  • Check out upcoming Percona Training classes in North America and Europe.
  • Join Percona and the MySQL community at our Percona Live.
  • Watch more webinars from Percona in the future!

The post Q&A: Even More Deadly Mistakes of MySQL Development appeared first on MySQL Performance Blog.


Many-table joins in MySQL 5.6

I recently worked on an uncommon slow query: less than 100 rows were read and returned, the whole dataset was fitting in memory but the query took several seconds to run. Long story short: the query was a join involving 21 tables, running on MySQL 5.1. But by default MySQL 5.1 is not good at handling joins with such a large number of tables. The good news is that MySQL 5.6 brings welcome improvements.

Isolating the problem

As always with a slow query, finding the execution plan with EXPLAIN is the 1st step to understand where time is spent. Here the plan was very good with almost all joins using the primary key or a unique key, but perhaps the most interesting part was that EXPLAIN was very slow as well. This indicates that the optimizer takes a lot of time finding the optimal execution plan. To double check, we can run SHOW PROFILE:

mysql> set @@profiling = 1;
mysql> SELECT ....;
| Status             | Duration |
| starting           | 0.000783 |
| Opening tables     | 0.000192 |
| System lock        | 0.000005 |
| Table lock         | 0.000010 |
| init               | 0.000369 |
| optimizing         | 0.000091 |
| statistics         | 3.459529 |
| preparing          | 0.000213 |
| executing          | 0.000005 |
| Sending data       | 0.004597 |
| end                | 0.000008 |
| query end          | 0.000005 |
| freeing items      | 0.000066 |
| logging slow query | 0.000004 |
| cleaning up        | 0.000006 |

‘Statistics’ means ‘finding a good execution plan’. Okay, we have found why the query is slow.

If you have read this post, you already know how to fix the problem: set optimizer_search_depth = 0 and response time instantly drops to 0.02s. Also mentioned in this post, we should expect MySQL 5.6 to be much better at handling such queries. Well, let’s give it a try!

MySQL 5.6

With the default value of optimizer_search_depth, response time is 0.02s in 5.6 and we can see in SHOW PROFILE that the time spent finding a good execution plan is very small:

| Status               | Duration |
| starting             | 0.002022 |
| checking permissions | 0.000018 |
| checking permissions | 0.000015 |
| Opening tables       | 0.000130 |
| init                 | 0.001093 |
| System lock          | 0.000048 |
| optimizing           | 0.000200 |
| statistics           | 0.009404 |
| preparing            | 0.000432 |
| executing            | 0.000029 |
| Sending data         | 0.005409 |
| end                  | 0.000027 |
| query end            | 0.000009 |
| closing tables       | 0.000034 |
| freeing items        | 0.000082 |
| cleaning up          | 0.000030 |

However ‘Statistics’ is still the most expensive step. What if we set optimizer_search_depth = 0? It is even better:

| statistics           | 0.001896 |

And the response time drops to 0.01s.


  • EXPLAIN is a great tool to understand how the server will run a query, but it does not tell you everything. Additional tools (here, SHOW PROFILES) are often useful as well.
  • By default MySQL 5.6 is much better than the previous versions at handling joins with many tables: great job optimizer team at Oracle!
  • However even with 5.6, you may get better response time with optimizer_search_depth = 0

The post Many-table joins in MySQL 5.6 appeared first on MySQL Performance Blog.


Multiple column index vs multiple indexes with MySQL 5.6

A question often comes when talking about indexing: should we use multiple column indexes or multiple indexes on single columns? Peter Zaitsev wrote about it back in 2008 and the conclusion then was that a multiple column index is most often the best solution. But with all the recent optimizer improvements, is there anything different with MySQL 5.6?


For this test, we will use these 2 tables (same structure as in Peter’s post):

CREATE TABLE t1000merge (
  id int not null auto_increment primary key,
  i int(11) NOT NULL,
  j int(11) NOT NULL,
  val char(10) NOT NULL,
  KEY i (i),
  KEY j (j)
CREATE TABLE t1000idx2 (
  id int not null auto_increment primary key,
  i int(11) NOT NULL,
  j int(11) NOT NULL,
  val char(10) NOT NULL,
  KEY ij (i,j)

Tables were populated with 1M rows for this test, i and j have 1000 distinct values (independent of each other). The buffer pool is large enough to hold all data and indexes.

We will look at this query on MySQL 5.5.35 and MySQL 5.6.15:

SELECT sum(length(val)) FROM T WHERE j=2 AND i BETWEEN 100 and 200

Why this specific query? With MySQL 5.5, for t1000idx2, the optimizer estimates that the index on (i,j) is not selective enough and it falls back to a full table scan. While for t1000merge, the index on (j) is an obvious good candidate to filter efficiently.

Consequently this query has a better response on t1000merge (0.01s) than on t1000idx2 (0.45s).

On MySQL 5.6, this query is a good candidate for index condition pushdown (ICP), so we can reasonably hope that response time for t1000idx2 will improve.

ICP: FORCE INDEX to the rescue

Unfortunately the optimizer still prefers the full table scan which gives us the same bad response time:

mysql5.6> EXPLAIN SELECT sum(length(val)) FROM t1000idx2 WHERE j=2 AND i BETWEEN 100 and 200;
| id | select_type | table     | type | possible_keys | key  | key_len | ref  | rows    | Extra       |
|  1 | SIMPLE      | t1000idx2 | ALL  | ij            | NULL | NULL    | NULL | 1000545 | Using where |

And what if we use FORCE INDEX?

mysql5.6 > EXPLAIN SELECT sum(length(val)) FROM  t1000idx2 FORCE INDEX(ij) WHERE j=2 AND i BETWEEN 100 and 200;
| id | select_type | table     | type  | possible_keys | key  | key_len | ref  | rows   | Extra                 |
|  1 | SIMPLE      | t1000idx2 | range | ij            | ij   | 8       | NULL | 188460 | Using index condition |

This time ICP is used (see “Using index condition” in the Extra field)!

And the difference in response time is impressive:
– Without FORCE INDEX (full table scan): 0.45s
– With FORCE INDEX (multiple column index + index condition pushdown): 0.04s, a 10x improvement!

Additional thoughts

It is interesting to see that the optimizer fails to find the best execution plan for this simple query. The optimizer trace sheds some light:

mysql> SET optimizer_trace="enabled=on";
mysql> SELECT sum(length(val)) FROM  T  WHERE j=2 AND i BETWEEN 100 and 200;
"range_analysis": {
                  "table_scan": {
                    "rows": 1000545,
                    "cost": 202835

This is the estimated cost for a full table scan.
Now we will see how the optimizer estimates the cost of the range scan using the ij index:

"range_scan_alternatives": [
                        "index": "ij",
                        "ranges": [
                          "100 <= i <= 200"
                        "index_dives_for_eq_ranges": true,
                        "rowid_ordered": false,
                        "using_mrr": false,
                        "index_only": false,
                        "rows": 188460,
                        "cost": 226153,
                        "chosen": false,
                        "cause": "cost"

At this stage the optimizer does not know if ICP can be used. This probably explains why the cost of the range scan is overestimated.

If we look at the optimizer trace for the query with the FORCE INDEX hint, ICP is only detected after the range scan is chosen:

"refine_plan": [
                "table": "`t1000idx2` FORCE INDEX (`ij`)",
                "pushed_index_condition": "((`t1000idx2`.`j` = 2) and (`t1000idx2`.`i` between 100 and 200))",
                "table_condition_attached": null,
                "access_type": "range"


Multiple column index vs multiple indexes? Having indexes on single columns often lead to the optimizer using the index_merge access type, which is typically not as good as accessing a single index on multiple columns. MySQL 5.6 makes multiple column indexes more efficient than before with index condition pushdown.

But don’t forget that the optimizer is not perfect: you may have to use index hints to benefit from this feature.

The post Multiple column index vs multiple indexes with MySQL 5.6 appeared first on MySQL Performance Blog.


MySQL Query Patterns, Optimized – Webinar questions followup

MySQL Query Patterns, OptimizedOn Friday I gave a presentation on “MySQL Query Patterns, Optimized” for Percona MySQL Webinars.  If you missed it, you can still register to view the recording and my slides.

Thanks to everyone who attended, and especially to folks who asked the great questions.  I answered as many as we had time for  during the session, but here are all the questions with my complete answers:

Q: Can you compare the use of subqueries/multiple joins vs. multiple queries (e.g. temp tables)?

For performance, it’s hard to make a general rule, because there are different types of subqueries (correlated, non-correlated, derived tables, scalar subqueries) and we saw in my presentation that sometimes these perform worse and sometimes better than another equivalent solution.

If you’re talking about using a temporary table to store an interim result set, and then use that to ultimately produce the result you want, that’s another technique that I see used, although it’s less common.  It can be helpful to simplify the development of a report, but whether it’s a benefit to performance is a case-by-case decision.

There are other drawbacks to using temporary tables in this way, for example when using replication, a temporary table could vanish if the slave restarts and then subsequent queries against the temp table fail.

Q: Why doesn’t the query optimizer take care the execution decisions? Since they are all logically the same, I believe it should all translate to the same execution plan. (I know it doesn’t though.)

Why don’t you try writing a SQL query optimizer and see how easy it is?  ;-)  Generative grammars like SQL can produce an infinite variety of  queries.  The query optimizer analyzes queries for common patterns the MySQL designers know can be improved.  But there’s a finite amount of engineer-years spent developing and testing the query optimizer.  The developers have to make a decision about which query patterns the optimizer recognizes, and which are left to the developer to refactor manually.  There could also be some cases where optimizing a query automatically would be more costly than just running the query in a suboptimal form.

Q: Doesn’t the primary key solution for random selection only work when the IDs for movies are distributed uniformly over the range 1..MAX(id)?

Yes, there’s a risk if you have irregular distribution of matching rows that you’ll get skewed results.  For instance, if there’s a gap of id’s between 47000 and 50000 with no movies, but your random number generator picks values in that range with equal frequency as any other range, then the movie immediately following the “gap” will be picked more frequently.

Nearly every solution for randomly choosing rows involves some compromise, either of performance, or of accuracy of randomness.  The ORDER BY RAND() solution is known to have poor performance, but it returns a better random choice.

Another workaround may be to add a column to the table, and populate the rows you want to choose (movies, in this case) with  values known to be consecutive.  Then you could choose a random value, look up the row with “=” instead of “>” and be guaranteed to find exactly one match.  But the tradeoff of this solution is that it requires storing another column, and reinitializing the sequence after making certain insert/update/delete operations.  If you have a dataset that changes infrequently, then this might be a good tradeoff.

Q: Is the tuple comparison independent of sorting order? Wouldn’t it be a problem if the keywords were returned in a different order from what you specify in the query?

The tuple comparison you’re referring to is this example:

WHERE (k1.keyword, k2.keyword, k3.keyword) = ('espionage', 'nuclear-bomb', 'ejector-seat');

This is okay because each correlation name (k1, k2, k3) is an index lookup.  Look at the EXPLAIN output — it accesses these first, by looking up the keywords you specify.  It doesn’t matter what order you specify the keywords in this tuple, it is equivalent to this expression:

WHERE k1.keyword = 'espionage' AND k2.keyword = 'nuclear-bomb' AND k3.keyword = 'ejector-seat';

The AND operator is commutative, so the order of these terms doesn’t matter.

However, if I were using the tuple syntax to look up values against multiple columns in a compound index, then the order would matter, and it would have to match the order of columns in the index.  Example:

WHERE (last_name, first_name) = ('Karwin', 'Bill');

Q: On the Dynamic Pivot, the straight join, can you explain more about why you thought about doing that and why it helped?

When I did not use the STRAIGHT_JOIN, the query optimizer reordered the tables.  It seemed to prefer an index-scan of 7 rows in the `kind_type` table to be first, then look up matching rows in `title` by a secondary index.  But the result was that it created a temporary table to count the movies per production year for each kind_id.

It was more efficient in this case to force MySQL to scan the `title` table first, grouping by kind_id in index order.  This made the first table in the EXPLAIN seem like it was scanning more rows.  But by avoiding the temporary table, and making lookups to the `kind_types` table by primary key, the result was a query that took half the time.

An important conclusion of my presentation today is that sometimes you find unexpected differences in performance like this, so it pays to test all different solutions, and measure the resulting performance!

Q: Bill, back to the query where you were trying to determine the last episode of each TV show. Couldn’t you have used a SUBQUERY to fetch and compare MAX(episode_id) as an option there?

Yes, I think what you’re referring to is what I covered as the Derived-Table Solution on slides 55-59.  Apologies if that wasn’t clear, because I didn’t present the details of all the tables and columns.  It turned out this solution was 150x faster for this case, so your suggestion is a good one!

Thanks again for attending my webinar!  Here are some more tips:

The post MySQL Query Patterns, Optimized – Webinar questions followup appeared first on MySQL Performance Blog.


Webinar: SQL Query Patterns, Optimized

Using MySQL 5.6 Performance Schema to Troubleshoot Typical Workload BottlenecksNext Friday, May 31 at 10 a.m. Pacific, I’ll present Percona’s next webinar, “SQL Query Patterns, Optimized.”

Based on my experiences solving tough SQL problems for Percona training and consulting, I’ll classify several common types of queries with which developers struggle. I’ll test several SQL solutions for each type of query objective, and show how you can use MySQL 5.6 built-in methods to analyze them for optimal query efficiency.  The discussion will cover optimizer reports, query profiling, and session status to measure performance.

The query patterns will include:

  • Exclusion Join
  • Random Selection
  • Greatest-Per-Group
  • Dynamic Pivot
  • Relational Division

Please register for this webinar and join me next Friday!

The post Webinar: SQL Query Patterns, Optimized appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com