Feb
27
2017
--

Webinar Thursday March 2, 2017: MongoDB Query Patterns

MongoDB Query

MongoDB QueryJoin Percona’s Senior Technical Services Engineer Adamo Tonete on Thursday, March 2, 2017, at 11:00 a.m. PST / 2:00 p.m. EST (UTC-8) as he reviews and discusses MongoDB® query patterns.


MongoDB is a fast and simple-to-query schema-free database. It features a smart query optimizer that tries to use the easiest data retrieval method.

In this webinar, Adamo will discuss common query operators and how to use them effectively. The webinar will cover not only common query operations, but also the best practices for their usage.

Register for the webinar here.

MongoDB QueryAdamo Tonete, Senior Technical Services Engineer

Adamo joined Percona in 2015, after working as a MongoDB/MySQL Database Administrator for three years. As the main database member of a startup, he was responsible for suggesting the best architecture and data flows for a worldwide company in a 24/7 environment. Before that, he worked as a Microsoft SQL Server DBA for a large e-commerce company, mainly on performance tuning and automation. Adamo has almost eight years of experience working as a DBA, and in the past three years, he has moved to NoSQL technologies without giving up relational databases.

Dec
29
2016
--

Percona Live Featured Tutorial with Øystein Grøvlen — How to Analyze and Tune MySQL Queries for Better Performance

Percona Live Featured Tutorial

Percona Live Featured TutorialWelcome to another post in the series of Percona Live featured tutorial speakers blogs! In these blogs, we’ll highlight some of the tutorial speakers that will be at this year’s Percona Live conference. We’ll also discuss how these tutorials can help you improve your database environment. Make sure to read to the end to get a special Percona Live 2017 registration bonus!

In this Percona Live featured tutorial, we’ll meet Øystein Grøvlen, Senior Principal Software Engineer at Oracle. His tutorial is on How to Analyze and Tune MySQL Queries for Better Performance. SQL query performance plays a big role in application performance. If some queries execute slowly, these queries or the database schema may need tuning. I had a chance to speak with Øystein and learn a bit more about the MySQL query tuning:

Percona: How did you get into database technology? What do you love about it?

Øystein: I got into database technology during my Ph.D. studies. I got in touch with a research group in Trondheim, Norway, that did research on highly available distributed database systems. I ended up writing a thesis on query processing in such database systems.

What I love most about my job on the MySQL Optimizer Team is that it involves a lot of problem-solving. Why is a query so slow? What can we do to improve it? I have always been very interested in sports results and statistics. Working with query execution times gives me much of the same feeling. Searching for information is another interest of mine, and that is really what query execution is about.

Percona: What impacts database performance the most?

Øystein: From my point of view – mainly concerned with the performance of read-only queries – the most important performance metric is how much data needs to be accessed in order to answer a query. For update-intensive workloads, it is often about concurrency issues. For SELECT statements, the main thing is to not access more data than necessary.

Users should make sure to design their database schema so that the database system can efficiently access the needed data. This includes creating the right indexes. As MySQL developers, we need to develop the right algorithms to support efficient retrieval. We also need to provide a query optimizer that can pick the best query execution plan.

Of course, there are other performance aspects that are important. Especially if your data cannot fit in your database buffer pool. In that case, the order in which you access the data becomes more important. The best query plan when your data is disk-bound is not necessarily the same as when all data is in memory.

Percona: Your tutorial is called “How to Analyze and Tune MySQL Queries for Better Performance.” What are the most recent MySQL updates that help with tuning queries?

Øystein: I think the biggest improvements came in MySQL 5.6, with increased observability through performance schema and new variants of
EXPLAIN (Structured EXPLAIN (JSON format) and visual EXPLAIN in MySQL Workbench). We also added Optimizer Trace, which gives insight into how the optimizer arrived at a certain query plan. All this made it easier to identify queries that need tuning, understand how a query is executed and what might be done to improve it.

In MySQL 5.7, we added a new syntax for optimizer hints, and provided a lot of new hints that can be used to influence the optimizer to change a non-optimal query plan. We also provided a query rewrite plugin that makes it possible to tune queries even when it is not possible to change the application.

MySQL 5.7 also came with improvements to EXPLAIN. It is now possible to get the query plan for a running query, and Structured EXPLAIN shows both estimated total query cost and the cost per table. A more experimental feature allows you to provide your own cost constants to the optimizer.  This way, you can configure the optimizer to better suit your particular system.

For MySQL 8.0 we are continuing to improve tunability by adding more optimizer hints.  At the same time, we are working hard on features that will reduce the need for tuning. Histograms and awareness of whether data is in memory or on disk make the optimizer able to pick better query plans.

Percona: What do you want attendees to take away from your tutorial session? Why should they attend?

Øystein: While the query optimizer in most cases will come up with a good query plan, there are some cases where it won’t generate the most optimal query plan. This tutorial will show how you can identify which queries need tuning, how you can further investigate the issues and what types of tuning options you have for different types of queries. By attending this tutorial, you will learn how to improve the performance of applications through query tuning.

Percona: What are you most looking forward to at Percona Live?

Øystein: I am looking forward to interacting with MySQL users, discussing the query performance issues they might have, and learning how I can help with their issues.

You can find out more about Øystein Grøvlen and his work with databases at his blog, or follow him on Twitter: @ogrovlen. Want to find out more about Øystein and MySQL query optimization? Register for Percona Live Data Performance Conference 2017, and see his tutorial How to Analyze and Tune MySQL Queries for Better Performance. Use the code FeaturedTalk and receive $30 off the current registration price!

Percona Live Data Performance Conference 2017 is the premier open source event for the data performance ecosystem. It is the place to be for the open source community as well as businesses that thrive in the MySQL, NoSQL, cloud, big data and Internet of Things (IoT) marketplaces. Attendees include DBAs, sysadmins, developers, architects, CTOs, CEOs, and vendors from around the world.

The Percona Live Data Performance Conference will be April 24-27, 2017 at the Hyatt Regency Santa Clara & The Santa Clara Convention Center.

Jun
03
2013
--

MySQL Query Patterns, Optimized – Webinar questions followup

MySQL Query Patterns, OptimizedOn Friday I gave a presentation on “MySQL Query Patterns, Optimized” for Percona MySQL Webinars.  If you missed it, you can still register to view the recording and my slides.

Thanks to everyone who attended, and especially to folks who asked the great questions.  I answered as many as we had time for  during the session, but here are all the questions with my complete answers:

Q: Can you compare the use of subqueries/multiple joins vs. multiple queries (e.g. temp tables)?

For performance, it’s hard to make a general rule, because there are different types of subqueries (correlated, non-correlated, derived tables, scalar subqueries) and we saw in my presentation that sometimes these perform worse and sometimes better than another equivalent solution.

If you’re talking about using a temporary table to store an interim result set, and then use that to ultimately produce the result you want, that’s another technique that I see used, although it’s less common.  It can be helpful to simplify the development of a report, but whether it’s a benefit to performance is a case-by-case decision.

There are other drawbacks to using temporary tables in this way, for example when using replication, a temporary table could vanish if the slave restarts and then subsequent queries against the temp table fail.

Q: Why doesn’t the query optimizer take care the execution decisions? Since they are all logically the same, I believe it should all translate to the same execution plan. (I know it doesn’t though.)

Why don’t you try writing a SQL query optimizer and see how easy it is?  ;-)  Generative grammars like SQL can produce an infinite variety of  queries.  The query optimizer analyzes queries for common patterns the MySQL designers know can be improved.  But there’s a finite amount of engineer-years spent developing and testing the query optimizer.  The developers have to make a decision about which query patterns the optimizer recognizes, and which are left to the developer to refactor manually.  There could also be some cases where optimizing a query automatically would be more costly than just running the query in a suboptimal form.

Q: Doesn’t the primary key solution for random selection only work when the IDs for movies are distributed uniformly over the range 1..MAX(id)?

Yes, there’s a risk if you have irregular distribution of matching rows that you’ll get skewed results.  For instance, if there’s a gap of id’s between 47000 and 50000 with no movies, but your random number generator picks values in that range with equal frequency as any other range, then the movie immediately following the “gap” will be picked more frequently.

Nearly every solution for randomly choosing rows involves some compromise, either of performance, or of accuracy of randomness.  The ORDER BY RAND() solution is known to have poor performance, but it returns a better random choice.

Another workaround may be to add a column to the table, and populate the rows you want to choose (movies, in this case) with  values known to be consecutive.  Then you could choose a random value, look up the row with “=” instead of “>” and be guaranteed to find exactly one match.  But the tradeoff of this solution is that it requires storing another column, and reinitializing the sequence after making certain insert/update/delete operations.  If you have a dataset that changes infrequently, then this might be a good tradeoff.

Q: Is the tuple comparison independent of sorting order? Wouldn’t it be a problem if the keywords were returned in a different order from what you specify in the query?

The tuple comparison you’re referring to is this example:

WHERE (k1.keyword, k2.keyword, k3.keyword) = ('espionage', 'nuclear-bomb', 'ejector-seat');

This is okay because each correlation name (k1, k2, k3) is an index lookup.  Look at the EXPLAIN output — it accesses these first, by looking up the keywords you specify.  It doesn’t matter what order you specify the keywords in this tuple, it is equivalent to this expression:

WHERE k1.keyword = 'espionage' AND k2.keyword = 'nuclear-bomb' AND k3.keyword = 'ejector-seat';

The AND operator is commutative, so the order of these terms doesn’t matter.

However, if I were using the tuple syntax to look up values against multiple columns in a compound index, then the order would matter, and it would have to match the order of columns in the index.  Example:

WHERE (last_name, first_name) = ('Karwin', 'Bill');

Q: On the Dynamic Pivot, the straight join, can you explain more about why you thought about doing that and why it helped?

When I did not use the STRAIGHT_JOIN, the query optimizer reordered the tables.  It seemed to prefer an index-scan of 7 rows in the `kind_type` table to be first, then look up matching rows in `title` by a secondary index.  But the result was that it created a temporary table to count the movies per production year for each kind_id.

It was more efficient in this case to force MySQL to scan the `title` table first, grouping by kind_id in index order.  This made the first table in the EXPLAIN seem like it was scanning more rows.  But by avoiding the temporary table, and making lookups to the `kind_types` table by primary key, the result was a query that took half the time.

An important conclusion of my presentation today is that sometimes you find unexpected differences in performance like this, so it pays to test all different solutions, and measure the resulting performance!

Q: Bill, back to the query where you were trying to determine the last episode of each TV show. Couldn’t you have used a SUBQUERY to fetch and compare MAX(episode_id) as an option there?

Yes, I think what you’re referring to is what I covered as the Derived-Table Solution on slides 55-59.  Apologies if that wasn’t clear, because I didn’t present the details of all the tables and columns.  It turned out this solution was 150x faster for this case, so your suggestion is a good one!

Thanks again for attending my webinar!  Here are some more tips:

The post MySQL Query Patterns, Optimized – Webinar questions followup appeared first on MySQL Performance Blog.

Mar
21
2012
--

Multi Range Read (MRR) in MySQL 5.6 and MariaDB 5.5

This is the second blog post in the series of blog posts leading up to the talk comparing the optimizer enhancements in MySQL 5.6 and MariaDB 5.5. This blog post is aimed at the optimizer enhancement Multi Range Read (MRR). Its available in both MySQL 5.6 and MariaDB 5.5

Now let’s take a look at what this optimization actually is and what benefits it brings.

Multi Range Read

With traditional secondary index lookups, if the columns that are being fetched do not belong to the secondary index definition (and hence covering index optimization is not used), then primary key lookups have to be performed for each secondary key entry fetched. This means that secondary key lookups for column values that do not belong to the secondary index definition can result in a lot of Random I/O. The purpose of MRR is to reduce this Random I/O and make it more sequential, by having a buffer in between where secondary key tuples are buffered and then sorted by the primary key values, and then instead of point primary key lookups, a range lookup is performed on the primary key by using the sorted primary key values.

Let me give you a simple example. Suppose you have the following query executed on the InnoDB table:

SELECT non_key_column FROM tbl WHERE key_column=x

This query will roughly be evaluated in following steps, without MRR:

  1. SELECT key_column, pk_column FROM tbl WHERE key_column=x ORDER BY key_column
    (Note that secondary keys in InnoDB contain primary key columns)
  2. For each pk_column value in step 1 do:
    SELECT non_key_column FROM tbl WHERE pk_column=val

As you can see that the values returned from Step 1 are sorted by the secondary key column ‘key_column’, and then for each value of ‘pk_column’ which is a part of the secondary key tuple, a point primary key lookup is made against base table, the number of these point primary key lookups will be depend on the number of rows that match the condition ‘key_column=x’. You can see that there are a lot of random primary key lookups made.

With MRR, then steps above are changed to the following:

  1. SELECT key_column, pk_column FROM tbl WHERE key_column=x ORDER BY key_column
    (Note that secondary keys in InnoDB contain primary key columns)
  2. Buffer each pk_column value fetched from step 1, and when the buffer is full sort them by pk_column, and do a range primary key lookup as follows:
    SELECT non_key_column from tbl WHERE pk_column IN (…)

As you can see by utilizing the buffer for sorting the secondary key tuples by pk_column, we have converted a lot of point primary key lookups to one or more range primary key lookup. Thereby, converting Random access to one or more sequential access. There is one another interesting thing that has come up here, and that is the importance of the size of the buffer used for sorting the secondary key tuples. If the buffer size is large enough only a single range lookup will be needed, however if the buffer size is small as compared to the combined size of the secondary key tuples fetched, then the number of range lookups will be:

CEIL(S/N)
where,
S is the combined size of the secondary key tuples fetched, and
N is the buffer size.

In MySQL 5.6 the buffer size used by MRR can be controlled by the variable read_rnd_buffer_size, while MariaDB introduces a different variable to control the MRR buffer size mrr_buffer_size. Both buffer sizes default to 256K in MySQL 5.6 and MariaDB 5.5 respectively, which might be low depending on your scenario.

You can read more about the MRR optimization available in MySQL 5.6 here:
http://dev.mysql.com/doc/refman/5.6/en/mrr-optimization.html
and as available in MariaDB 5.5 here:
http://kb.askmonty.org/en/multi-range-read-optimization

Now let’s move on to the benchmarks, to see the difference in numbers.

Benchmark results

For the purpose of this benchmark, I have used TPC-H Query #10 and ran it on TPC-H dataset (InnoDB tables) with a Scale Factor of 2 (InnoDB dataset size ~5G). I did not use Scale Factor of 40 (InnoDB dataset size ~95G), because the query was taking far too long to execute, ~11 hours in case of MySQL 5.5 and ~5 hours in case of MySQL 5.6 and MariaDB 5.5. Note that query cache is disabled during these benchmark runs and that the disks are 4 5.4K disks in Software RAID5.

Also note that the following changes were made in the MySQL config:
optimizer_switch=’index_condition_pushdown=off’
optimizer_switch=’mrr=on’
optimizer_switch=’mrr_sort_keys=on’ (only on MariaDB 5.5)
optimizer_switch=’mrr_cost_based=off’
read_rnd_buffer_size=4M (only on MySQL 5.6)
mrr_buffer_size=4M (only on MariaDB 5.5)

We have turned off ICP optimization for the purpose of this particular benchmark, because we want to see the individual affect of an optimization (where possible). Also note that we have turned off mrr_cost_based, this is because the cost based algorithm used to calculate the cost of MRR when the optimizer is choosing the query execution plan, is not sufficiently tuned and it is recommended to turn this off.

The query used is:

select
        c_custkey,
        c_name,
        sum(l_extendedprice * (1 - l_discount)) as revenue,
        c_acctbal,
        n_name,
        c_address,
        c_phone,
        c_comment
from
        customer,
        orders,
        lineitem,
        nation
where
        c_custkey = o_custkey
        and l_orderkey = o_orderkey
        and o_orderdate >= '1993-08-01'
        and o_orderdate < date_add( '1993-08-01' ,interval '3' month)
        and l_returnflag = 'R'
        and c_nationkey = n_nationkey
group by
        c_custkey,
        c_name,
        c_acctbal,
        c_phone,
        n_name,
        c_address,
        c_comment
order by
        revenue desc
LIMIT 20;

In-memory workload

Now let's see how effective is MRR when the workload fits entirely in memory. For the purpose of benchmarking in-memory workload, the InnoDB buffer pool size is set to 6G and the buffer pool was warmed up, so that the relevant pages were already loaded in the buffer pool. Note that as mentioned at the start of the benchmark results section, the InnoDB dataset size is ~5G. Ok so now let's take a look at the graph:

MRR doesn't really make any positive difference to the query times for MySQL 5.6, when the workload fits entirely in memory, because there is no extra cost for memory access at random locations versus memory access at sequential locations. In fact there is extra cost added by the buffering step introduced by MRR, and hence, there is a slight increase in query time for MySQL 5.6, increase of 0.02s. But the query times for MariaDB 5.5 are greater than both MySQL 5.5 and MySQL 5.6

IO bound workload

Now let's see how effective is MRR when the workload is IO bound. For the purpose of benchmarking IO bound workload, the InnoDB buffer pool size is set to 1G and the buffer pool was not warmed up, so that it does not have the relevant pages loaded up already:

MRR does make a lot of difference when the workload is IO bound, the query time is decreased from ~11min to under a minute. The query time is reduced further when the buffer size is set to 4M. Note also that query time for MariaDB is still a little higher by a couple of seconds, when compared to MySQL 5.6.

Now let's take a look at the status counters.

MySQL Status Counters

These status counters were captured when performing the benchmark on IO bound workload, mentioned above.

Counter Name MySQL 5.5 MySQL 5.6 MySQL 5.6 w/ read_rnd_bufer_size=4M MariaDB 5.5 MariaDB 5.5 w/ mrr_buffer_size=4M
Created_tmp_disk_tables 1 1 1 1 1
Created_tmp_tables 1 1 1 1 1
Handler_mrr_init N/A 0 0 1 1
Handler_mrr_rowid_refills N/A N/A N/A 1 0
Handler_read_key 508833 623738 622184 508913 507516
Handler_read_next 574320 574320 572889 574320 572889
Handler_read_rnd_next 136077 136094 136366 136163 136435
Innodb_buffer_pool_read_ahead 0 20920 23669 20920 23734
Innodb_buffer_pool_read_requests 1361851 1264739 1235472 1263290 1235781
Innodb_buffer_pool_reads 120548 102948 76882 102672 76832
Innodb_data_read 1.84G 1.89G 1.53G 1.89G 1.53G
Innodb_data_reads 120552 123872 100551 103011 77213
Innodb_pages_read 120548 123868 100551 123592 100566
Innodb_rows_read 799239 914146 912318 914146 912318
Select_scan 1 1 1 1 1
Sort_scan 1 1 1 1 1
  • As you can see from the status counters above that both MySQL 5.6 and MariaDB 5.5 are reporting high numbers for Innodb_buffer_pool_read_ahead which shows that the access pattern was sequential and hence InnoDB decided to do read_ahead, while in MySQL 5.5 no read_ahead was done because the access pattern was not sequential. Another thing to note is that more read_ahead is done when the buffer size used by MRR, is set to 4M, which obviously means that the more index tuples that can fit in the buffer the more sequential the access pattern will be.
  • There is one MRR related variable introduced in MySQL 5.6 and MariaDB 5.5 Handler_mrr_init and another additional one introduced in MariaDB 5.5 Handler_mrr_rowid_refills. Handler_mrr_init is incremented when a MRR range scan is performed, but we can see its only incremented in MariaDB 5.5 and not in MySQL 5.6, is that because of a bug in MySQL 5.6 code? As MRR was used in both MySQL 5.6 and MariaDB 5.5. Handler_mrr_rowid_refills counts how many times the buffer used by MRR had to be reinitialized, because the buffer was small and not all index tuples could fit in the buffer. If this is > 0 then it means Handler_mrr_rowid_refills + 1 MRR range scans had to be performed. As in the table above you can with default buffer size of 256K, MariaDB 5.5 shows that Handler_mrr_rowid_refills = 1, which means the buffer is small and there were 2 MRR range scans needed. But with a buffer size of 4M, we can see that Handler_mrr_rowid_refills = 0, which means that the buffer was big enough and only 1 MRR range scan was needed, which is as efficient as it can be. This is also evident in the query times, which is lower by a couple of seconds when buffer size of 4M is used.
  • Another interesting thing to note is that MySQL 5.6 and MariaDB 5.5 are both reading more rows than MySQL 5.5, as can be seen by the numbers reported for the status counter Innodb_rows_read. While MySQL 5.6 is also reporting increased numbers for the counter Handler_read_key. This is because of how status counter values are incremented when index lookup is performed. As I explained at the start of the post that traditional index lookup (for non-index-only columns) involves, reading an index record, and then using the PK column value in the index record to make a lookup in the PK. Both these lookups are performed in a single call to the storage engine and the counters Handler_read_key and Innodb_rows_read are incremented by ONE. However, when MRR is used then there are two separate calls made to the storage engine to perform the index record read and then to perform the MRR range scan on the PK. This causes the counters Handler_read_key and Innodb_rows_read to be incremented by TWO. It does not actually mean that queries with MRR are performing badly. The interesting thing is that though both MariaDB and MySQL 5.6 are reporting high numbers for Innodb_rows_read, which is completely in line with how the counters behave with MRR, but the value for counter Handler_read_key is more or less the same for MariaDB 5.5 when compared to MySQL 5.5, and this does not make sense to me. Probably its due to a bug in how counter is calculated inside MariaDB?

Other Observations

Sometimes both for MariaDB 5.5 and MySQL 5.6, the optimizer chooses the wrong query execution plan. Let's take a look at what are the good and bad query execution plans.

a. Bad Plan

id      select_type     table   type    possible_keys   key     key_len ref     rows    filtered        Extra
1       SIMPLE  nation  ALL     PRIMARY NULL    NULL    NULL    25      100.00  Using temporary; Using filesort
1       SIMPLE  customer        ref     PRIMARY,i_c_nationkey   i_c_nationkey   5       dbt3.nation.n_nationkey 2123    100.00
1       SIMPLE  orders  ref     PRIMARY,i_o_orderdate,i_o_custkey       i_o_custkey     5       dbt3.customer.c_custkey 7       100.00  Using where
1       SIMPLE  lineitem        ref     PRIMARY,i_l_orderkey,i_l_orderkey_quantity      PRIMARY 4       dbt3.orders.o_orderkey  1       100.00  Using where

b. Good Plan

id      select_type     table   type    possible_keys   key     key_len ref     rows    filtered        Extra
1       SIMPLE  orders  range   PRIMARY,i_o_orderdate,i_o_custkey       i_o_orderdate   4       NULL    232722  100.00  Using where; Rowid-ordered scan; Using temporary; Using filesort
1       SIMPLE  customer        eq_ref  PRIMARY,i_c_nationkey   PRIMARY 4       dbt3.orders.o_custkey   1       100.00  Using where
1       SIMPLE  nation  eq_ref  PRIMARY PRIMARY 4       dbt3.customer.c_nationkey       1       100.00
1       SIMPLE  lineitem        ref     PRIMARY,i_l_orderkey,i_l_orderkey_quantity      PRIMARY 4       dbt3.orders.o_orderkey  2       100.00  Using where

So during cold query runs the optimizer would switch to using plan 'a', which does not involve MRR, and the query time for MySQL 5.6 and MariaDB 5.5 jumps to ~11min (this is the query time for MySQL 5.5) While when it sticks to plan 'b' for MySQL 5.6 and MariaDB 5.5, then query times remain under a minute. So when the correct query execution plan is not used, there is no difference in query times between MySQL 5.5 and MySQL 5.6/MariaDB 5.5 This is another area of improvement in the optimizer, as it is clearly a part of the optimizer's job to select the best query execution plan. I had noted a similar thing when benchmarking ICP, the optimizer made a wrong choice. It looks like that there is still improvement and changes needed in the optimizer's cost estimation algorithm.

MariaDB 5.5 expands the concept of MRR to improve the performance of secondary key lookups as well. But this works only with joins and specifically with Block Access Join Algorithms. So I am not going to cover it here, but will cover it in my next post which will be on Block Access Join Algorithms.

Conclusion

There is a huge speedup when the workload is IO bound, the query time goes down from ~11min to under a minute. The query time is reduced further when buffer size is set large enough so that the index tuples fit in the buffer. But there is no performance improvement when the workload is in-memory, in fact MRR adds extra sorting overhead which means that the queries are just a bit slower as compared to MySQL 5.5 MRR clearly changes the access pattern to sequential, and hence InnoDB is able to do many read_aheads. Another thing to take away is that MariaDB is just a bit slower as compared to MySQL 5.6, may be something for the MariaDB guys to look at.

Mar
18
2010
--

When the subselect runs faster

A few weeks ago, we had a query optimization request from one of our customer.

The query was very simple like:

CODE:

  1. SELECT * FROM `table` WHERE (col1=‘A’||col1=‘B’) ORDER BY id DESC LIMIT 20 OFFSET 0

This column in the table is looks like this:

CODE:

  1. `col1` enum(‘A’,‘B’,‘C’,‘CD’,‘DE’,‘F’,‘G’,‘HI’) default NULL

The table have 549252 rows and of course, there is an index on the col1. MySQL estimated the cardinality of that index as 87, though what was of course misleading as index cardinality in this case can’t be over 9, as there is only 8(+ NULL) different possible values for this column.

CODE:

  1. +—-+————-+——-+——-+—————+——+———+——+——–+—————————–+
  2. | id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows   | Extra                       |
  3. +—-+————-+——-+——-+—————+——+———+——+——–+—————————–+
  4. 1 | SIMPLE      | table  | range | col1         | col1 | 2       | NULL | 549252 | Using where; Using filesort |
  5. +—-+————-+——-+——-+—————+——+———+——+——–+—————————–+

This query took more than 5 minutes (the rows are large and table does not fit in cache well)

When you want to run this query mysql first will try to find each row where col1 is A or B using index. Then its going to order by the ID using file sort and then send first 20 rows ignoring the rest.

In this case MySQL has 2 indexes where one is usable to find rows, while other is usable to return them in the right order. MySQL can chose only one of them to execute the query – use index to find rows. This is sensible strategy if there is no LIMIT, however it is poor chose if there is one – it is often a lot faster to retrieve rows in order checking WHERE clause for them until required number of rows were returned. Especially in the cases when WHERE clause is not very selective.

So I tried this:

CODE:

  1. select * from table where id in (SELECT id FROM `table` WHERE (col1=‘A’||col1=‘B’)) ORDER BY id DESC LIMIT 20 OFFSET 0;

In this case we forcing MySQL to do retrieve rows in sorted order and checking if it matches our original WHERE clause with subselects. It looks scary if we look at EXPLAIN but in reality the dependent subquery is only executed enough times to produce 20 rows in result set.

CODE:

  1. +—-+——————–+——-+—————–+—————+———+———+——+——–+————-+
  2. | id | select_type        | table | type            | possible_keys | key     | key_len | ref  | rows   | Extra       |
  3. +—-+——————–+——-+—————–+—————+———+———+——+——–+————-+
  4. 1 | PRIMARY            | table  | index           | NULL          | PRIMARY | 4       | NULL | 765105 | Using where |
  5. 2 | DEPENDENT SUBQUERY | table  | unique_subquery | PRIMARY,col1  | PRIMARY | 4       | func |      1 | Using where |
  6. +—-+——————–+——-+—————–+—————+———+———+——+——–+————-+

The result is a lot better result time:

CODE:

  1. (20 rows in set (0.01 sec))

So by rewriting query using subqueries we actually improved it performance 100 times. So subqueries are
not always slowing things down.

Even though proving subqueries are not always slow this way is not the most optimal. We do not really need separate subselect to make MySQL check WHERE clause while scanning table in index order. We can just use FORCE INDEX hint to override MySQL index choice:

CODE:

  1. mysql> explain select * from table FORCE INDEX(PRIMARY) where (col1=‘A’||col1=‘B’) order by id desc limit 20 offset 0;
  2. +—-+————-+——-+——-+—————+———+———+——+——–+————-+
  3. | id | select_type | table | type  | possible_keys | key     | key_len | ref  | rows   | Extra       |
  4. +—-+————-+——-+——-+—————+———+———+——+——–+————-+
  5. 1 | SIMPLE      | table  | index | NULL          | PRIMARY | 4       | NULL | 549117 | Using where |
  6. +—-+————-+——-+——-+—————+———+———+——+——–+————-+
  7.  
  8. mysql> select * from table FORCE INDEX(PRIMARY) where (col1=‘A’||col1=‘B’) order by id desc limit 20 offset 0;
  9. 20 rows in set (0.00 sec)

This approach works well if WHERE clause is not very selective, otherwise MySQL may need to scan very many rows to find enough matching rows. You can use another trick Peter
wrote. about couple of years ago.


Entry posted by Istvan Podor |
7 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com