Finding bad queries is a big part of optimization. A scientific optimization process can be simplified to “can anything be improved for less than it costs not to improve it? – if not, we’re done.” In databases, we care most about the work the database is doing. That is, queries. There are other things we care about, but not as much as queries. The key here is that the question “can anything be improved” requires a process for identifying queries that can be improved. The easiest way to do that is to look for things that indicate sub-optimality.
All of the above might seem to be obvious, but I wanted to frame the rest of this post with a logical starting point. Next let’s see what kinds of things about queries might indicate that they aren’t optimal.
A responsible approach to a task such as finding bad queries begins with a definition of badness that we can be pretty sure is complete. I am interested in knowing whether I’m missing anything, but I believe the definition of badness in this article is fairly complete. (Post comments if I’m wrong.)
Here is my definition:
- Queries are bad either singly (an individual query is bad) or in aggregate (a group or class of queries is bad). I think this is a complete definition because the only other kind of number besides “one” and “many” is “zero”, and a nonexistent query can’t be bad.
- Queries, or groups of queries, are bad because
- they are slow and provide a bad user experience, or
- they add too much load to the system, or
- they block other queries from running
This last three-part definition is where I can’t be sure that I’ve got a complete definition.
Now, given the definition above, how do we find these queries in some input such as a log of queries? It shouldn’t surprise you that there is a working implementation of most of this in a tool, Maatkit’s mk-query-digest. Here’s how:
- It finds queries (or groups of queries) that are slow and provide a bad user experience by the parameters to the –outliers option. If a query is slower than X more than N times, it’s bad. You can look at factors such as the variance-to-mean ratio, the Apdex score, and the explain sparkline to see whether the query is likely to be possible to optimize.
- It finds queries or groups of queries that add too much load to the system by grouping queries into classes and summing the execution time together. Then it prints out a profile of the classes, so you can see which ones contribute what fraction of total execution time. You can use Amdahl’s Law to decide which of those are candidates for improvement. (The V/M ratio, Apdex score, and the explain sparkline are useful here too.)
- mk-query-digest doesn’t do part 3 — finding queries that block other queries — yet.
The missing piece, part 3 — finding queries that block other queries — is best done by observing clusters in query end times. If there is a sudden burst of queries completing, the first one to complete is likely — but not certain — to have blocked the others from running. This could be due to internal locks, I/O starvation, or what have you.
In all of the above cases, the additional information in Percona Server’s query execution logs is absolutely vital. Without data to analyze, you’re left to guess at what the server is doing. For example, in Percona Server we have microsecond-precision timestamps for log events. Edit: note that I’m referring to the timestamp when the query executed, not merely the duration the query executed for, which has been available in MySQL for years. This is necessary for determining whether a group of queries completed in a cluster. One-second granularity (what you get in standard MySQL) might be enough precision for finding some cases of queries blocking others, but it’s nowhere near precise enough to find and diagnose the kinds of sub-second performance stalls we are interested in. Without data, you can only guess; and guessing ain’t optimizing.
We are considering improving mk-query-digest do to this missing analytical step. I have done it many times with simple awk scripts and various statistical measures. What do you think? Is there more to be done? Is there a better approach? Are we missing some notion of badness, or some heuristic for detecting bad queries? Let me know by posting in the comments.