Sep
06
2018
--

MongoDB: Investigate Queries with explain() and Index Usage (part 2)

MongoDB explain() method optimization tool

MongoDB explain() method optimization toolThis is the second part of a two parts series. In MongoDB: index usage and MongoDB explain() we introduced the main index types supported by MongoDB, and how to create and use them. In this second article, we are going to see some examples on how to use explain() method to investigate queries. Do you need to optimize MongoDB queries? You’ll see how to use explain() to find out how your query uses indexes. Or, perhaps, that it doesn’t!

What is explain()

explain() is a method that you can apply to simple queries or to cursors to investigate the query execution plan. The execution plan is how MongoDB resolves a query. Looking at all the information returned by explain() we can see find out stuff like:

  • how many documents were scanned
  • how many documents were returned
  • which index was used
  • how long the query took to be executed
  • which alternative execution plans were evaluated

…and other useful information.

The aim of using explain() is to find out how to improve the query. For example, by creating missing indexes or by rewriting it in order to use existing indexes more correctly. If you are familiar with the EXPLAIN command in MySQL, the goals of MongoDB’s explain() method are exactly the same.

Explainable object

You can apply the explain() method to a query or a cursor in the following way, as per any other method:

MongoDB > db.mycollection.find().explain()

However, the preferred way to investigate queries in the mongo shell is to create first the explainable object.

We can create an explainable object like this:

MongoDB > var myexp = db.mycollection.explain()

Once you have created the explainable object, then any kind of operation can be run against it to investigate a query or cursor execution plan. For example:

MongoDB > myexp.find()
MongoDB > myexp.update()
MongoDB > myexp.remove()
MongoDB > myexp.aggregate()

Restaurants test database

To see some examples we need a collection with some data.
For our purposes we can use the New York restaurants database. You can download this from the following url:

https://www.w3resource.com/mongodb-exercises/retaurants.zip

Unzip the archive, and import the JSON file into MongoDB:

$ unzip retaurants.zip
$ mongoimport --host 127.0.0.1 -d test -c restaurants --file retaurants.json

This collection has 3772 documents, all the restaurants in New York City. Here is a document sample.

MongoDB > use test
switched to db test
MongoDB > db.restaurants.find().pretty().limit(1)
{
	"_id" : ObjectId("5b71b3281979e24aa18c0121"),
	"address" : {
		"building" : "1007",
		"coord" : [
			-73.856077,
			40.848447
		],
		"street" : "Morris Park Ave",
		"zipcode" : "10462"
	},
	"borough" : "Bronx",
	"cuisine" : "Bakery",
	"grades" : [
		{
			"date" : ISODate("2014-03-03T00:00:00Z"),
			"grade" : "A",
			"score" : 2
		},
		{
			"date" : ISODate("2013-09-11T00:00:00Z"),
			"grade" : "A",
			"score" : 6
		},
		{
			"date" : ISODate("2013-01-24T00:00:00Z"),
			"grade" : "A",
			"score" : 10
		},
		{
			"date" : ISODate("2011-11-23T00:00:00Z"),
			"grade" : "A",
			"score" : 9
		},
		{
			"date" : ISODate("2011-03-10T00:00:00Z"),
			"grade" : "B",
			"score" : 14
		}
	],
	"name" : "Morris Park Bake Shop",
	"restaurant_id" : "30075445"
}

Explain() verbosity

The explain() method has three verbosity modes.

  • queryPlanner – this is the default mode. At this level, explain provides information about the winning plan, including the index used or if a collection scan is needed (COLLSCAN)
  • executionStats – this mode includes all the information provided by the queryPlanner, plus the statistics. Statistics include details such as the number of documents examined and returned, the execution time in milliseconds, and so on.
  • allPlansExecution – this mode includes all the information provided by the executionStats plus information about the discarded execution plans

We’ll see the explain() output in the following examples.

Example 1

It’s time to use the restaurants collection to run our first example: find out all the restaurants in the Manhattan borough.

Let’s create first the explainable object with the executionStats mode.

MongoDB > var exp = db.restaurants.explain("executionStats")

Then let’s investigate the query.

MongoDB > exp.find( { borough: "Manhattan"} )
{
	"queryPlanner" : {
		"plannerVersion" : 1,
		"namespace" : "test.restaurants",
		"indexFilterSet" : false,
		"parsedQuery" : {
			"borough" : {
				"$eq" : "Manhattan"
			}
		},
		"winningPlan" : {
			"stage" : "COLLSCAN",
			"filter" : {
				"borough" : {
					"$eq" : "Manhattan"
				}
			},
			"direction" : "forward"
		},
		"rejectedPlans" : [ ]
	},
	"executionStats" : {
		"executionSuccess" : true,
		"nReturned" : 1883,
		"executionTimeMillis" : 1,
		"totalKeysExamined" : 0,
		"totalDocsExamined" : 3772,
		"executionStages" : {
			"stage" : "COLLSCAN",
			"filter" : {
				"borough" : {
					"$eq" : "Manhattan"
				}
			},
			"nReturned" : 1883,
			"executionTimeMillisEstimate" : 0,
			"works" : 3774,
			"advanced" : 1883,
			"needTime" : 1890,
			"needYield" : 0,
			"saveState" : 29,
			"restoreState" : 29,
			"isEOF" : 1,
			"invalidates" : 0,
			"direction" : "forward",
			"docsExamined" : 3772
		}
	},
	"serverInfo" : {
		"host" : "Admins-MBP",
		"port" : 27017,
		"version" : "3.6.4",
		"gitVersion" : "d0181a711f7e7f39e60b5aeb1dc7097bf6ae5856"
	},
	"ok" : 1
}

Here we can see the output of the explain(). First of all, we can clearly distinguish the “queryPlanner” and the “executionStats” modes. We won’t describe each one of the values, as some are really intuitive. Let’s have a look just at some of them:

queryPlanner.winningPlan.stage = “COLLSCAN”
This provides very important information about the winning plan: it means that MongoDB needs to do a collection scan. The query it’s not optimized because all the documents must be read.

queryPlanner.winningPlan.rejectedPlans = []
It’s empty. There aren’t rejected plans. When the query needs to be executed with COLLSCAN the only execution plan is the winning plan. We don’t have any indexes in the collection, apart the one on _id, so there aren’t other execution plans.

executionStats.nReturned = 1883
The number of documents returned is 1883, the number of restaurants located in Manhattan.

executionStats.totalDocsExamined = 3772
The number of documents examined is exactly the number of documents in the collection. This was expected because the query uses COLLSCAN

executionStats.executionTimeMillis = 1
The execution time of the query. It’s just 1 millisecond. This might seem good, but remember that this is the time needed to scan just 3772 documents, a very small test collection. Think about what this time could be in the case of a collection with millions of documents!

How can we improve the query?

In this case it’s simple. Let’s try to create a single field index on borough, the only condition we have in the find(). Then let’s try to explain the same query again.

MongoDB > db.restaurants.createIndex( {borough: 1} )
{
	"createdCollectionAutomatically" : false,
	"numIndexesBefore" : 1,
	"numIndexesAfter" : 2,
	"ok" : 1
}
MongoDB > exp.find( { borough: "Manhattan"} )
{
	"queryPlanner" : {
		"plannerVersion" : 1,
		"namespace" : "test.restaurants",
		"indexFilterSet" : false,
		"parsedQuery" : {
			"borough" : {
				"$eq" : "Manhattan"
			}
		},
		"winningPlan" : {
			"stage" : "FETCH",
			"inputStage" : {
				"stage" : "IXSCAN",
				"keyPattern" : {
					"borough" : 1
				},
				"indexName" : "borough_1",
				"isMultiKey" : false,
				"multiKeyPaths" : {
					"borough" : [ ]
				},
				"isUnique" : false,
				"isSparse" : false,
				"isPartial" : false,
				"indexVersion" : 2,
				"direction" : "forward",
				"indexBounds" : {
					"borough" : [
						"[\"Manhattan\", \"Manhattan\"]"
					]
				}
			}
		},
		"rejectedPlans" : [ ]
	},
	"executionStats" : {
		"executionSuccess" : true,
		"nReturned" : 1883,
		"executionTimeMillis" : 1,
		"totalKeysExamined" : 1883,
		"totalDocsExamined" : 1883,
		"executionStages" : {
			"stage" : "FETCH",
			"nReturned" : 1883,
			"executionTimeMillisEstimate" : 0,
			"works" : 1884,
			"advanced" : 1883,
			"needTime" : 0,
			"needYield" : 0,
			"saveState" : 14,
			"restoreState" : 14,
			"isEOF" : 1,
			"invalidates" : 0,
			"docsExamined" : 1883,
			"alreadyHasObj" : 0,
			"inputStage" : {
				"stage" : "IXSCAN",
				"nReturned" : 1883,
				"executionTimeMillisEstimate" : 0,
				"works" : 1884,
				"advanced" : 1883,
				"needTime" : 0,
				"needYield" : 0,
				"saveState" : 14,
				"restoreState" : 14,
				"isEOF" : 1,
				"invalidates" : 0,
				"keyPattern" : {
					"borough" : 1
				},
				"indexName" : "borough_1",
				"isMultiKey" : false,
				"multiKeyPaths" : {
					"borough" : [ ]
				},
				"isUnique" : false,
				"isSparse" : false,
				"isPartial" : false,
				"indexVersion" : 2,
				"direction" : "forward",
				"indexBounds" : {
					"borough" : [
						"[\"Manhattan\", \"Manhattan\"]"
					]
				},
				"keysExamined" : 1883,
				"seeks" : 1,
				"dupsTested" : 0,
				"dupsDropped" : 0,
				"seenInvalidated" : 0
			}
		}
	},
	"serverInfo" : {
		"host" : "Admins-MBP",
		"port" : 27017,
		"version" : "3.6.4",
		"gitVersion" : "d0181a711f7e7f39e60b5aeb1dc7097bf6ae5856"
	},
	"ok" : 1
}

Interpreting the output

Now, the output is completely different. Let’s have a look at some of the most relevant values.

queryPlanner.winningPlan.inputStage.stage = “IXSCAN”
This is very important. IXSCAN means that now MongoDB doesn’t need to do a collection scan but an index can be used to find the documents.

queryPlanner.winningPlan.inputStage.indexName = “borough_1”
The name of the index used. This is the default name of an index: the name of the field plus the _1 for ascending or _-1 for descending order.

queryPlanner.winningPlan.inputStage.direction = “forward”
MongoDB traverses the index in a forward direction.

executionStats.nRertuned = 1883
The number of documents returned. Obviously this is the same as before.

executionStats.totalKeysExamined = 1883
The number of keys examined in the index.

executionStats.totalDocsExamined  = 1883
Now the number of documents examined corresponds to the number of elements examined in the index.

We have optimized the query.

Example 2

Now we would like to examine a query to find out all the restaurants with Italian cuisine that received a grade score of greater than 50.

MongoDB > var exp = db.restaurants.explain()
MongoDB > exp.find({$and: [  {"cuisine" : {$eq :"Italian"}},  {"grades.score" : {$gt : 50}}  ]  })
{
	"queryPlanner" : {
		"plannerVersion" : 1,
		"namespace" : "test.restaurants",
		"indexFilterSet" : false,
		"parsedQuery" : {
			"$and" : [
				{
					"cuisine" : {
						"$eq" : "Italian"
					}
				},
				{
					"grades.score" : {
						"$gt" : 50
					}
				}
			]
		},
		"winningPlan" : {
			"stage" : "COLLSCAN",
			"filter" : {
				"$and" : [
					{
						"cuisine" : {
							"$eq" : "Italian"
						}
					},
					{
						"grades.score" : {
							"$gt" : 50
						}
					}
				]
			},
			"direction" : "forward"
		},
		"rejectedPlans" : [ ]
	},
	"serverInfo" : {
		"host" : "Admins-MBP",
		"port" : 27017,
		"version" : "3.6.4",
		"gitVersion" : "d0181a711f7e7f39e60b5aeb1dc7097bf6ae5856"
	},
	"ok" : 1
}

We have a COLLSCAN again. Let’s try to improve the query creating an index on the cuisine field.

MongoDB > var exp = db.restaurants.explain("executionStats")
MongoDB > db.restaurants.createIndex({cuisine:1})
{
	"createdCollectionAutomatically" : false,
	"numIndexesBefore" : 2,
	"numIndexesAfter" : 3,
	"ok" : 1
}
MongoDB > exp.find({$and: [  {"cuisine" : {$eq :"Italian"}},  {"grades.score" : {$gt : 50}}  ]  })
{
	"queryPlanner" : {
		"plannerVersion" : 1,
		"namespace" : "test.restaurants",
		"indexFilterSet" : false,
		"parsedQuery" : {
			"$and" : [
				{
					"cuisine" : {
						"$eq" : "Italian"
					}
				},
				{
					"grades.score" : {
						"$gt" : 50
					}
				}
			]
		},
		"winningPlan" : {
			"stage" : "FETCH",
			"filter" : {
				"grades.score" : {
					"$gt" : 50
				}
			},
			"inputStage" : {
				"stage" : "IXSCAN",
				"keyPattern" : {
					"cuisine" : 1
				},
				"indexName" : "cuisine_1",
				"isMultiKey" : false,
				"multiKeyPaths" : {
					"cuisine" : [ ]
				},
				"isUnique" : false,
				"isSparse" : false,
				"isPartial" : false,
				"indexVersion" : 2,
				"direction" : "forward",
				"indexBounds" : {
					"cuisine" : [
						"[\"Italian\", \"Italian\"]"
					]
				}
			}
		},
		"rejectedPlans" : [ ]
	},
	"executionStats" : {
		"executionSuccess" : true,
		"nReturned" : 6,
		"executionTimeMillis" : 4,
		"totalKeysExamined" : 325,
		"totalDocsExamined" : 325,
		"executionStages" : {
			"stage" : "FETCH",
			"filter" : {
				"grades.score" : {
					"$gt" : 50
				}
			},
			"nReturned" : 6,
			"executionTimeMillisEstimate" : 0,
			"works" : 326,
			"advanced" : 6,
			"needTime" : 319,
			"needYield" : 0,
			"saveState" : 2,
			"restoreState" : 2,
			"isEOF" : 1,
			"invalidates" : 0,
			"docsExamined" : 325,
			"alreadyHasObj" : 0,
			"inputStage" : {
				"stage" : "IXSCAN",
				"nReturned" : 325,
				"executionTimeMillisEstimate" : 0,
				"works" : 326,
				"advanced" : 325,
				"needTime" : 0,
				"needYield" : 0,
				"saveState" : 2,
				"restoreState" : 2,
				"isEOF" : 1,
				"invalidates" : 0,
				"keyPattern" : {
					"cuisine" : 1
				},
				"indexName" : "cuisine_1",
				"isMultiKey" : false,
				"multiKeyPaths" : {
					"cuisine" : [ ]
				},
				"isUnique" : false,
				"isSparse" : false,
				"isPartial" : false,
				"indexVersion" : 2,
				"direction" : "forward",
				"indexBounds" : {
					"cuisine" : [
						"[\"Italian\", \"Italian\"]"
					]
				},
				"keysExamined" : 325,
				"seeks" : 1,
				"dupsTested" : 0,
				"dupsDropped" : 0,
				"seenInvalidated" : 0
			}
		}
	},
	"serverInfo" : {
		"host" : "Admins-MBP",
		"port" : 27017,
		"version" : "3.6.4",
		"gitVersion" : "d0181a711f7e7f39e60b5aeb1dc7097bf6ae5856"
	},
	"ok" : 1
}

The query has improved. The created index cuisine_1 is used, but still we have 325 documents examined with only 6 documents returned. Let’s see if we can do better by creating instead a compound index that uses both the fields in the condition: cuisine and grades.score.

MongoDB > db.restaurants.createIndex({cuisine:1, "grades.score":1})
{
	"createdCollectionAutomatically" : false,
	"numIndexesBefore" : 3,
	"numIndexesAfter" : 4,
	"ok" : 1
}
MongoDB > exp.find({$and: [  {"cuisine" : {$eq :"Italian"}},  {"grades.score" : {$gt : 50}}  ]  })
{
	"queryPlanner" : {
		"plannerVersion" : 1,
		"namespace" : "test.restaurants",
		"indexFilterSet" : false,
		"parsedQuery" : {
			"$and" : [
				{
					"cuisine" : {
						"$eq" : "Italian"
					}
				},
				{
					"grades.score" : {
						"$gt" : 50
					}
				}
			]
		},
		"winningPlan" : {
			"stage" : "FETCH",
			"inputStage" : {
				"stage" : "IXSCAN",
				"keyPattern" : {
					"cuisine" : 1,
					"grades.score" : 1
				},
				"indexName" : "cuisine_1_grades.score_1",
				"isMultiKey" : true,
				"multiKeyPaths" : {
					"cuisine" : [ ],
					"grades.score" : [
						"grades"
					]
				},
				"isUnique" : false,
				"isSparse" : false,
				"isPartial" : false,
				"indexVersion" : 2,
				"direction" : "forward",
				"indexBounds" : {
					"cuisine" : [
						"[\"Italian\", \"Italian\"]"
					],
					"grades.score" : [
						"(50.0, inf.0]"
					]
				}
			}
		},
		"rejectedPlans" : [
			{
				"stage" : "FETCH",
				"filter" : {
					"grades.score" : {
						"$gt" : 50
					}
				},
				"inputStage" : {
					"stage" : "IXSCAN",
					"keyPattern" : {
						"cuisine" : 1
					},
					"indexName" : "cuisine_1",
					"isMultiKey" : false,
					"multiKeyPaths" : {
						"cuisine" : [ ]
					},
					"isUnique" : false,
					"isSparse" : false,
					"isPartial" : false,
					"indexVersion" : 2,
					"direction" : "forward",
					"indexBounds" : {
						"cuisine" : [
							"[\"Italian\", \"Italian\"]"
						]
					}
				}
			}
		]
	},
	"executionStats" : {
		"executionSuccess" : true,
		"nReturned" : 6,
		"executionTimeMillis" : 1,
		"totalKeysExamined" : 7,
		"totalDocsExamined" : 6,
		"executionStages" : {
			"stage" : "FETCH",
			"nReturned" : 6,
			"executionTimeMillisEstimate" : 0,
			"works" : 9,
			"advanced" : 6,
			"needTime" : 1,
			"needYield" : 0,
			"saveState" : 0,
			"restoreState" : 0,
			"isEOF" : 1,
			"invalidates" : 0,
			"docsExamined" : 6,
			"alreadyHasObj" : 0,
			"inputStage" : {
				"stage" : "IXSCAN",
				"nReturned" : 6,
				"executionTimeMillisEstimate" : 0,
				"works" : 8,
				"advanced" : 6,
				"needTime" : 1,
				"needYield" : 0,
				"saveState" : 0,
				"restoreState" : 0,
				"isEOF" : 1,
				"invalidates" : 0,
				"keyPattern" : {
					"cuisine" : 1,
					"grades.score" : 1
				},
				"indexName" : "cuisine_1_grades.score_1",
				"isMultiKey" : true,
				"multiKeyPaths" : {
					"cuisine" : [ ],
					"grades.score" : [
						"grades"
					]
				},
				"isUnique" : false,
				"isSparse" : false,
				"isPartial" : false,
				"indexVersion" : 2,
				"direction" : "forward",
				"indexBounds" : {
					"cuisine" : [
						"[\"Italian\", \"Italian\"]"
					],
					"grades.score" : [
						"(50.0, inf.0]"
					]
				},
				"keysExamined" : 7,
				"seeks" : 1,
				"dupsTested" : 7,
				"dupsDropped" : 1,
				"seenInvalidated" : 0
			}
		}
	},
	"serverInfo" : {
		"host" : "Admins-MBP",
		"port" : 27017,
		"version" : "3.6.4",
		"gitVersion" : "d0181a711f7e7f39e60b5aeb1dc7097bf6ae5856"
	},
	"ok" : 1
}

Now, the winning plan uses the new compound index cuisine_1_grades.score_1 and we have only 6 documents examined. Please note also that now we have a rejected plan, the one that uses the single field index cuisine_1 previously created.

We have optimized the query.

Example 3

Let’s find out all the restaurants that don’t prepare any “American” cuisine in Brooklyn and achieved a grade of ‘A’. We want to see the  results ordered by cuisine, descending.

At this point you should be a little familiar with the explain() output, so in the next box we truncate it for the sake of simplicity, leaving only the relevant parts.

MongoDB > var exp.restaurants.explain("executionStats")
MongoDB > exp.find( {"cuisine" : {$ne : "American "},
... "grades.grade" :"A",
... "borough": "Brooklyn"}).sort({"cuisine":-1})
{
	"queryPlanner" : {
...
...
		"winningPlan" : {
			"stage" : "FETCH",
			"filter" : {
				"$and" : [
					{
						"borough" : {
							"$eq" : "Brooklyn"
						}
					},
					{
						"grades.grade" : {
							"$eq" : "A"
						}
					}
				]
			},
			"inputStage" : {
				"stage" : "IXSCAN",
				"keyPattern" : {
					"cuisine" : 1
				},
				"indexName" : "cuisine_1",
				"isMultiKey" : false,
				"multiKeyPaths" : {
					"cuisine" : [ ]
				},
				"isUnique" : false,
				"isSparse" : false,
				"isPartial" : false,
				"indexVersion" : 2,
				"direction" : "backward",
				"indexBounds" : {
					"cuisine" : [
						"[MaxKey, \"American \")",
						"(\"American \", MinKey]"
					]
				}
			}
		},
...
...
	"executionStats" : {
		"executionSuccess" : true,
		"nReturned" : 493,
		"executionTimeMillis" : 9,
		"totalKeysExamined" : 2518,
		"totalDocsExamined" : 2517,
...
...
	"serverInfo" : {
		"host" : "Admins-MBP",
		"port" : 27017,
		"version" : "3.6.4",
		"gitVersion" : "d0181a711f7e7f39e60b5aeb1dc7097bf6ae5856"
	},
	"ok" : 1
}

Looking at the winning plan, we see that we have IXSCAN on the index cuisine_1. The important thing to notice here is the choice of the index on cuisine field—it’s because we have used sort({cuisine:-1}). There is not a SORT stage because the documents are already extracted using the index, and so they are already sorted. Just notice, too, direction:”backward” —this is because we specified descending order in the query. If we try to execute a slightly different query, changing the sorting to name:1 instead of cuisine:-1 we’ll see a completely different winning plan.

MongoDB > exp.find( {"cuisine" : {$ne : "American "},
... "grades.grade" :"A",
... "borough": "Brooklyn"}).sort({"name":1})
...
		"winningPlan" : {
			"stage" : "SORT",
			"sortPattern" : {
				"name" : -1
			},
			"inputStage" : {
				"stage" : "SORT_KEY_GENERATOR",
				"inputStage" : {
					"stage" : "FETCH",
					"filter" : {
						"$and" : [
							{
								"grades.grade" : {
									"$eq" : "A"
								}
							},
							{
								"$nor" : [
									{
										"cuisine" : {
											"$eq" : "American "
										}
									}
								]
							}
						]
					},
					"inputStage" : {
						"stage" : "IXSCAN",
						"keyPattern" : {
							"borough" : 1
						},
						"indexName" : "borough_1",
						"isMultiKey" : false,
						"multiKeyPaths" : {
							"borough" : [ ]
						},
						"isUnique" : false,
						"isSparse" : false,
						"isPartial" : false,
						"indexVersion" : 2,
						"direction" : "forward",
						"indexBounds" : {
							"borough" : [
								"[\"Brooklyn\", \"Brooklyn\"]"
							]
						}
					}
				}
			}
		},
...
...
	"executionStats" : {
		"executionSuccess" : true,
		"nReturned" : 493,
		"executionTimeMillis" : 13,
		"totalKeysExamined" : 684,
		"totalDocsExamined" : 684,
...
...

In this case, we have fewer documents examined but since the cuisine_1 index cannot be used, a SORT stage is needed, and the index used to fetch the document is borough_1. While MongoDB has examined fewer documents, the execution time is worse because of the extra stage used to sort the documents.

Let’s return now to the original query. We can also notice that the number of documents examined is still too high (2517) compared to the documents returned (493). That’s not optimal. Let’s see if we can further improve the query by adding another compound index on (cuisine,borough,grades.grade).

MongoDB > db.restaurants.createIndex({cuisine:1,borough:1,"grades.grade":1})
{
	"createdCollectionAutomatically" : false,
	"numIndexesBefore" : 4,
	"numIndexesAfter" : 5,
	"ok" : 1
}
MongoDB > exp.find( {"cuisine" : {$ne : "American "},
... "grades.grade" :"A",
... "borough": "Brooklyn"}).sort({"cuisine":-1})
...
...		"winningPlan" : {
			"stage" : "FETCH",
			"filter" : {
				"$nor" : [
					{
						"cuisine" : {
							"$eq" : "American "
						}
					}
				]
			},
			"inputStage" : {
				"stage" : "IXSCAN",
				"keyPattern" : {
					"cuisine" : 1,
					"borough" : 1,
					"grades.grade" : 1
				},
				"indexName" : "cuisine_1_borough_1_grades.grade_1",
				"isMultiKey" : true,
				"multiKeyPaths" : {
					"cuisine" : [ ],
					"borough" : [ ],
					"grades.grade" : [
						"grades"
					]
				},
				"isUnique" : false,
				"isSparse" : false,
				"isPartial" : false,
				"indexVersion" : 2,
				"direction" : "backward",
				"indexBounds" : {
					"cuisine" : [
						"[MaxKey, \"American \")",
						"(\"American \", MinKey]"
					],
					"borough" : [
						"[\"Brooklyn\", \"Brooklyn\"]"
					],
					"grades.grade" : [
						"[\"A\", \"A\"]"
					]
				}
			}
		},
...
...
	"executionStats" : {
		"executionSuccess" : true,
		"nReturned" : 493,
		"executionTimeMillis" : 6,
		"totalKeysExamined" : 591,
		"totalDocsExamined" : 493,
...
...

Now, MongoDB uses the new index and does not need the extra sorting stage. The number of documents examined is the same as the number of documents returned. Also, the execution time is better.

We have optimized the query.

Example 4

This is the final example. Let’s find out the restaurants where the grades array contains a grade of ‘A’ and a score of 9 for a specific date.

MongoDB > exp.find({"grades.date": ISODate("2014-08-11T00:00:00Z"),
... "grades.grade":"A" ,
... "grades.score" : 9})
{
	"queryPlanner" : {
...
...
		"winningPlan" : {
			"stage" : "COLLSCAN",
			"filter" : {
				"$and" : [
					{
						"grades.date" : {
							"$eq" : ISODate("2014-08-11T00:00:00Z")
						}
					},
					{
						"grades.grade" : {
							"$eq" : "A"
						}
					},
					{
						"grades.score" : {
							"$eq" : 9
						}
					}
				]
			},
			"direction" : "forward"
		},
...

A COLLSCAN again. In the query we notice that all the conditions refer to embedded fields in an array object. So let’s try to create a multikey index. Let’s just a create the index on the date field only and see what happens.

MongoDB > db.restaurants.createIndex({"grades.date":1})
{
	"createdCollectionAutomatically" : false,
	"numIndexesBefore" : 5,
	"numIndexesAfter" : 6,
	"ok" : 1
}
MongoDB > exp.find({"grades.date": ISODate("2014-08-11T00:00:00Z"),
... "grades.grade":"A" ,
... "grades.score" : 9})
{
	"queryPlanner" : {
...
...
		"winningPlan" : {
			"stage" : "FETCH",
			"filter" : {
				"$and" : [
					{
						"grades.grade" : {
							"$eq" : "A"
						}
					},
					{
						"grades.score" : {
							"$eq" : 9
						}
					}
				]
			},
			"inputStage" : {
				"stage" : "IXSCAN",
				"keyPattern" : {
					"grades.date" : 1
				},
				"indexName" : "grades.date_1",
				"isMultiKey" : true,
				"multiKeyPaths" : {
					"grades.date" : [
						"grades"
					]
				},
				"isUnique" : false,
				"isSparse" : false,
				"isPartial" : false,
				"indexVersion" : 2,
				"direction" : "forward",
				"indexBounds" : {
					"grades.date" : [
						"[new Date(1407715200000), new Date(1407715200000)]"
					]
				}
			}
		},
		"rejectedPlans" : [ ]
	},
	"executionStats" : {
		"executionSuccess" : true,
		"nReturned" : 15,
		"executionTimeMillis" : 0,
		"totalKeysExamined" : 22,
		"totalDocsExamined" : 22,
...
...

MongoDB uses the index and the winning plan is already good enough. You can try as an exercise to create a compound index including other embedded fields of the array, like {“grades.date”:1, “grades.grade”:1, “grades.score”:1} and see what happens. You will probably see that the index we have created only on date is good enough. Enlarging the compound index will generate only rejected plans. This is because the date field is the most selective.

Hint: when dealing with compound indexes, please remember that the order of the fields is important. The first field should be the most selective one, and the last one should be the less selective. Or, in cases where you don’t need to put lot of fields in the index, the most selective ones are probably all that you need to improve the queries.

Conclusion

That’s all folks. In this two part series we have seen the indexes available in MongoDB and how to use explain() to investigate queries and find out how to improve their performance. We don’t expect now that you’ll know everything, but we hope this will be a good a starting point for you to practice MongoDB query optimization using explain(). You’ll find plenty more information in the manual and on the internet about indexes and explain(), if you are motivated to find out more.

While you are here

If you found this article interesting, you might like to check out some of our other resources. For example, you might find this recorded webinar MongoDB Sharding 101 by my colleague Adamo Tonete to be useful. Or perhaps Tim Vaillancourt’s MongoDB backup and recovery field guide would provide some useful information. If you want to catch our webinars as they happen, please subscribe to receive notifications in good time.

The post MongoDB: Investigate Queries with explain() and Index Usage (part 2) appeared first on Percona Database Performance Blog.

Jun
26
2018
--

MongoDB Explain – Using PMM-QAN for MongoDB Query Analytics

MongoDB explain plan with PMM-QAN

In this blog post, we will walk through PMM-Query Analytics for MongoDB. We will see how to analyze MongoDB query performance; review the initial parameters that we need to check; and find out how to compare MongoDB query performance with and without indexes with the help of EXPLAIN plan.

The Percona Monitoring and Management QAN (PMM-QAN) dashboard helps DBAs and Developers to analyze database queries and identify performance issues more easily. Sometimes it is difficult to find issues by just enabling the profiler and tracking through mongo shell for all the slow queries.

Test Case Environment

We configured the test environment with PMM Server before we ran the test:

PMM Version:<span class="s1">1.11.0</span>
MongoDB Version: 3.4.10
MongoDB Configurations:
3 Member Replicaset (1 Primary, 2 Secondaries)
ns: "test.pro"
Test Query: Find Test case: (with and without Index)
Indexed Field: "product"
Query: db.pro.find({product:"aa"})
Query count: 1000
Total count: 20000

Please Note: We have to enable profiler in the MongoDB environment to reflect the metrics in the QAN page. The QAN dashboard lists multiple queries, based on the threshold we have set in profiler. For this demonstration, we have set profiling level to 2 to get the query reflected in the dashboard.

QAN Analysis

Let’s analyze comparisons of the plans that were collected before and after the index was added to the collection.

Query used:

db.pro.find({product:"aa"})

The QAN dashboard, lists the FIND query for the “pro” collection:

Query Time

After selecting this specific query, we will see its details just below the list.

Review parameter: “Query Time”

Without Index

Here Query Time=18ms, we will check for the query count, docs returned and docs scanned in detail in the explain plan.

With Index

Now the Query Time=15ms, we have improved the query by creating an index, and MongoDB is no longer scanning the whole collection.

EXPLAIN PLAN

Analysis of the query from the QAN EXPLAIN Plan: You can use the toggle button “Expand All” to check the complete execution stats and plans, as this in case of the “test” database

COMPARISON OF PERFORMANCE

Here we make a comparison of query performance with and without the index, and take a look at the basics that need to be checked before and after index creation:

Without Index:

Stage =”COLLSCAN”, this means that MongoDB scans every document in order to fulfil the find query, so you need to create an index to improve query performance

With Index:

Stage=”IXSCAN”, indicates that the optimizer is not scanning the whole document, but scanned only the indexed bound document

Without Index:

It scanned whole documents i.e. docsExamined:20000 and return nReturned:1000.

With Index:

MongoDB uses the index and scans only the relevant documents .i.e. docsExamined:1000 and return nReturned:1000.  We can see the performance improvement from adding an index.

This is how the query behaves after index creation. We can identify exactly which index is being used with QAN. We can discover whether that index is useful or not, and other performance related events, all with an easy UI.

As this blog post is specific to Query Analysis for MongoDB, QAN graphs and their attributes can be accessed from here, an excellent blog post written by my colleague Vinodh.

The post MongoDB Explain – Using PMM-QAN for MongoDB Query Analytics appeared first on Percona Database Performance Blog.

Jun
20
2018
--

Webinar Thu 6/21: How to Analyze and Tune MySQL Queries for Better Performance

database query tuning

database query tuningPlease join Percona’s MySQL Database Administrator, Brad Mickel as he presents How to Analyze and Tune MySQL Queries for Better Performance on Thursday, June 21st, 2018, at 10:00 AM PDT (UTC-7) / 1:00 PM EDT (UTC-4).

 

Query performance is essential in making any application successful. In order to finely tune your queries you first need to understand how MySQL executes them, and what tools are available to help identify problems.

In this session you will learn:

  1. The common tools for researching problem queries
  2. What an Index is, and why you should use one
  3. Index limitations
  4. When to rewrite the query instead of just adding a new index
Register Now

 

Brad Mickel

MySQL DBA

Bradley began working with MySQL in 2013 as part of his duties in healthcare billing. After 3 years in healthcare billing he joined Percona as part of the bootcamp process. After the bootcamp he has served as a remote database administrator on the Atlas team for Percona Managed Services.

The post Webinar Thu 6/21: How to Analyze and Tune MySQL Queries for Better Performance appeared first on Percona Database Performance Blog.

Oct
18
2017
--

Webinar Thursday, October 19, 2017: What You Need to Get the Most Out of Indexes – Part 2

Indexes

IndexesJoin Percona’s Senior Architect, Matthew Boehm, as he presents What You Need to Get the Most Out of Indexes – Part 2 webinar on Thursday, October 19, 2017, at 11:00 am PDT / 2:00 pm EDT (UTC-7).

Proper indexing is key to database performance. Finely tune your query writing and database performance with tips from the experts. MySQL offers a few different types of indexes and uses them in a variety of ways.

In this session you’ll learn:

  • How to use composite indexes
  • Other index usages besides lookup
  • How to find unoptimized queries
  • What is there beyond EXPLAIN?

Register for the webinar.

IndexesMatthew Boehm, Architect

Matthew joined Percona in the fall of 2012 as a MySQL consultant. His areas of knowledge include the traditional Linux/Apache/MySQL/PHP stack, memcached, MySQL Cluster, massive sharding topologies, PHP development and a bit of MySQL-C-API development. Previously, Matthew DBAed for the fifth largest MySQL installation at eBay/PayPal. He also hails from managed hosting environments. During his off-hours, Matthew is a nationally ranked, competitive West Coast Swing dancer and travels to competitions around the US. He enjoys working out, camping, biking and playing MMOs with his son.

May
20
2016
--

Introduction to Troubleshooting Performance – Troubleshooting Slow Queries webinar: Q & A

Troubleshooting Slow Queries

Troubleshooting Slow QueriesIn this blog, I will provide answers to the Q & A for the Troubleshooting Slow Queries webinar.

First, I want to thank you for attending the April 28 webinar. The recording and slides for the webinar are available here. Below is the list of your questions that I wasn’t able to answer during the webinar, with responses:

Q: I’ve heard that is a bad idea to use

select *

; what do you recommend?

A: When I used

SELECT *

 in my slides, I wanted to underline the idea that sometimes you need to select all columns from the table. There is nothing bad about it if you need them.

SELECT *

 is bad when you need only a few columns from the table. In this case, you retrieve more data than needed, which affects performance. Another issue that  

SELECT *

 can cause is if you hard-code the statement into your application, then change table definition; the application could start retrieving columns in wrong order and output (e.g., email instead of billing address). Or even worse, it will try to access a non-existent index in the result set array. The best practice is to explicitly enumerate all columns that your application needs.

Q: I heard that using 

index_field

 length will affect the indexing principle during query execution (e.g., one field is varchar and its length is not fixed, some values have short text, some values have long text. at this time). If we use this field as indexing, what happens?

A: I assume you are asking about the ability to create an index with lengths smaller than the column length? They work as follows:

Assume you have a 

TEXT

  field which contains these user questions:

  1. I’ve heard that is a bad idea to use select * what do you recommend?
  2. I heard that using 
    index_field

     length will affect the indexing principle during query execution (e.g., one field is varchar and its length is not fixed, some values have short text, some values have long text. at this time). If we use this field as indexing, what happens?

  3. ….

Since this is a 

TEXT

  field you cannot create and index on it without specifying its length, so you need to make the index as minimal as possible to uniquely identify questions. If you create an index with length 10 it will contain:

  1. I’ve heard
  2. I heard th

You will index only those parts of questions that are not very distinct from each other, and do not contain useful information about what the question is. You can create index of length 255:

  1. I’ve heard that is a bad idea to use select * what do you recommend?
  2. I heard that using index_field length will affect the indexing principle during query execution (e.g., one field is varchar and its length is not fixed, some values have short text, some values have long text. at this time). If we use this field as index

In this case, the index includes the whole first question and almost all the second question. This makes the index too large and requires us to use more disk space (which causes more IO). Also, information from the second question is probably too much.

If make index of length 75, we will have:

  1. I’ve heard that is a bad idea to use select * what do you recommend?
  2. I heard that using index_field length will affect the indexing principle du

This is more than enough for the first question and gives a good idea of what is in the second question. It also potentially will have enough unique entries to make its cardinality look more like the cardinality of real data distribution.

To conclude: choosing the correct index length is something that requires practice and analysis of your actual data. Try to make them as short as possible, but long enough so that the number of unique entries in the index will be similar to a number of unique entries in the table.

Q: Which view can we query to see stats?

A: Do you mean index statistics?

SHOW INDEX FROM table_name

 will do it.

Q: We have an InnoDB table with 47 fields (mostly text); some are ft-indexed. I tried to do an alter table, and it ran for 24 hours. What is the best way to run an alter table to add one extra field? The table has 1.9 M rows and 47 columns with many indexes.

A: Adding a column requires a table copy. Therefore, the speed of this operation depends on the table size and speed of your disk. If you are using version 5.6 and later, adding a column would not block parallel queries (and therefore is not a big deal). If you are using an older version, you can always use the pt-online-schema-change utility from Percona Toolkit. However, it will run even more slowly than the regular

ALTER TABLE

. Unfortunately, you cannot speed up the execution of

ALTER TABLE

 much. The only thing that you can do is to use a faster disk (with options, tuned to explore speed of the disk).

However, if you do not want to have this increased IO affect the production server, you can alter the table on the separate instance, then copy tablespace to production and then apply all changes to the original table from the binary logs. The steps will be something like:

  1. Ensure you use option
    innodb_file_per_table

      and the big table has individual tablespace

  2. Ensure that binary log is enabled
  3. Start a new server (you can also use an existent stand-by slave).
  4. Disable writes to the table
  5. Record the binary log position
  6. Copy the tablespace to the new server as described here.
  7. Enable writes on the production server
  8. Run
    ALTER TABLE

     on the new server you created in step 2 (it will still take 24 hours)

  9. Stop writes to the table on the production server
  10. Copy the tablespace, altered in step 7
  11. Apply all writes to this table, which are in the binary logs after position, recorded in step 4.
  12. Enable writes to the table

This scenario will take even more time overall, but will have minimal impact on the production server

Q: If there is a compound index like index1(emp_id,date), will the following query be able to use index? “select * from table1 where emp_id = 10”

A: Yes. At least it should.

Q: Are 

filesort

 and

temporary

 in extended info for explain not good?

A: Regarding

filesort

: it depends. For example, you will always have the word

filesort

” for tables which perform 

ORDER BY

 and cannot use an index for

ORDER BY

. This is not always bad. For example, in this query:

mysql> explain select emp_no, first_name from employees where emp_no <20000 order by first_nameG
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: employees
   partitions: NULL
         type: range
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: NULL
         rows: 18722
     filtered: 100.00
        Extra: Using where; Using filesort
1 row in set, 1 warning (0,01 sec)

the primary key used to resolve rows and

filesort

 were necessary and not avoidable. You can read about different

filesort

 algorithms here.

Regarding

Using temporary

: this means what during query execution temporary table will be created. This is can be not good, especially if the temporary table is large and cannot fit into memory. In this case, it would be written to disk and slow down operations. But, again, sometimes creating temporary tables in not avoidable, for example, if you have both

GROUP BY

 and

ORDER BY

 clauses which list columns differently as stated in the user manual.

Q: Is

key_len

 length more of a good thing for query execution?

A:

key_len

 field is not

NULL

 for all queries that use and index, and just shows the length of the key part used. It is not good or bad, it is just for information. You can use this information, for example, to identify which part of combined index is used to resolve the query.

Q: Does an alter query go for an optimizer check?

A: No. You can check it either by enabling optimizer trace, running

ALTER

 and find what trace is empty. Or by enabling the debug option and searching the resulting trace for

optimize

.

Q: A query involves four columns that are all individually covered by an index. The optimizer didn’t merge indexes because of cost, and even didn’t choose the composite index I created.

A: This depends on the table definition and query you used. I cannot provide a more detailed answer based only on this information.

Q cont.: Finally, only certain composite indexes were suitable, the column order in the complex index mattered a lot. Why couldn’t the optimizer merge the four individual single column indexes, and why did the order of the columns in the composite index matter?

A: Regarding why the optimizer could not merge four indexes, I need to see how the table is defined and which data is in these indexed columns. Regarding why the order of the columns in the composite index matters, it depends on the query. Why the optimizer can use an index, say, on

(col1, col2)

 where the conditions

col1=X AND col2=Y

 and

col2=Y AND col2=X

 for the case when you use

OR

, the order is important. For example, for the condition

col1=X OR col2=Y

, where the part

col1=X

 is always executed and the part

col2=Y

  is executed only when

col1=X

 is false. The same logic applies to queries like

SELECT col1 WHERE col2=Y ORDER BY col3

. See the user manual for details.

Q: When I try to obtain the optimizer trace on the console, the result is cut off. Even if I redirect the output to a file, how to overcome that?

A: Which version of MySQL Server do you use? The 

TRACE

 column is defined as

longtext NOT NULL

, and should not cause such issues. If it does with a newer version, report a bug at http://bugs.mysql.com/.

Q: Are there any free graphical visualizers for either EXPLAIN or the optimizer TRACE output?

A: There is graphical visualizer for

EXPLAIN

 in MySQL Workbench. But it works with online data only: you cannot run it on

EXPLAIN

 output, saved into a file. I don’t know about any visualizer for the optimizer

TRACE

 output. However, since it is

JSON

 you can simply save it to file and open in web browser. It will allow a better view than if opened in simple text editor.

Q: When do you use force index instead of

use index

 hints?

A: According to user manual “

USE INDEX (index_list)

 hint tells MySQL to use only one of the named indexes to find rows in the table” and “

FORCE INDEX

  hint acts like

USE INDEX (index_list)

, with the addition that a table scan is assumed to be very expensive . . . a table scan is used only if there is no way to use one of the named indexes to find rows in the table.” This means that when you use

USE INDEX

, you are giving a hint for the optimizer to prefer a particular index to others, but not enforcing index usage if the optimizer prefers a table scan, while

FORCE INDEX

 requires using the index. I myself use only

FORCE

 and

IGNORE

  index hints.

Q: Very informative session. I missed the beginning part. Are you going to distribute the recoded session later?

A: Yes. As usual slides and recording available here.

Apr
06
2016
--

EXPLAIN FORMAT=JSON wrap-up

EXPLAIN FORMAT=JSON

EXPLAIN FORMAT=JSON wrap-upThis blog is an EXPLAIN FORMAT=JSON wrap-up for the series of posts I’ve done in the last few months.

In this series, we’ve discussed everything unique to

EXPLAIN FORMAT=JSON

. I intentionally skipped a description of members such as

table_name

,

access_type

  or

select_id

, which are not unique.

In this series, I only mentioned in passing members that replace information from the

Extra

 column in the regular

EXPLAIN

 output, such as

using_join_buffer

 ,

partitions

,

using_temporary_table

  or simply

message

. You can see these in queries like the following:

mysql> explain format=json select rand() from dual
*************************** 1. row ***************************
EXPLAIN: {
  "query_block": {
    "select_id": 1,
    "message": "No tables used"
  }
}
1 row in set, 1 warning (0.00 sec)

Or

mysql> explain format=json select emp_no from titles where 'Senior Engineer' = 'Senior Cat'
*************************** 1. row ***************************
EXPLAIN: {
  "query_block": {
    "select_id": 1,
    "message": "Impossible WHERE"
  }
}
1 row in set, 1 warning (0.01 sec)

Their use is fairly intuitive, similar to regular

EXPLAIN

, and I don’t think one can achieve anything from reading a blog post about each of them.

The only thing left to list is a Table of Contents for the series:

attached_condition: How EXPLAIN FORMAT=JSON can spell-check your queries

rows_examined_per_scan, rows_produced_per_join: EXPLAIN FORMAT=JSON answers on question “What number of filtered rows mean?”

used_columns: EXPLAIN FORMAT=JSON tells when you should use covered indexes

used_key_parts: EXPLAIN FORMAT=JSON provides insight into which part of multiple-column key is used

EXPLAIN FORMAT=JSON: everything about attached_subqueries, optimized_away_subqueries, materialized_from_subquery

EXPLAIN FORMAT=JSON provides insights on optimizer_switch effectiveness

EXPLAIN FORMAT=JSON: order_by_subqueries, group_by_subqueries details on subqueries in ORDER BY and GROUP BY

grouping_operation, duplicates_removal: EXPLAIN FORMAT=JSON has all details about GROUP BY

EXPLAIN FORMAT=JSON has details for subqueries in HAVING, nested selects and subqueries that update values

ordering_operation: EXPLAIN FORMAT=JSON knows everything about ORDER BY processing

EXPLAIN FORMAT=JSON knows everything about UNIONs: union_result and query_specifications

EXPLAIN FORMAT=JSON: buffer_result is not hidden!

EXPLAIN FORMAT=JSON: cost_info knows why optimizer prefers one index to another

EXPLAIN FORMAT=JSON: nested_loop makes JOIN hierarchy transparent

Thanks for following the series!

Feb
29
2016
--

EXPLAIN FORMAT=JSON: nested_loop makes JOIN hierarchy transparent

EXPLAIN FORMAT=JSON

EXPLAIN FORMAT=JSONOnce again it’s time for another EXPLAIN FORMAT=JSON is cool! post. This post will discuss how EXPLAIN FORMAT=JSON allows the nested_loop command to make the JOIN operation hierarchy transparent.

The regular

EXPLAIN

  command lists each table that participates in a 

JOIN

  operation on a single row. This works perfectly for simple queries:

mysql> explain select * from employees join titles join salariesG
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: employees
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 299379
     filtered: 100.00
        Extra: NULL
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: titles
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 442724
     filtered: 100.00
        Extra: Using join buffer (Block Nested Loop)
*************************** 3. row ***************************
           id: 1
  select_type: SIMPLE
        table: salaries
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 2745434
     filtered: 100.00
        Extra: Using join buffer (Block Nested Loop)
3 rows in set, 1 warning (0.00 sec)

You can see that the first accessed table was

employees

, then

titles

  and finally 

salaries

. Everything is clear.

EXPLAIN FORMAT=JSON

 in this case puts everything into the 

nested_loop

array (even if “MySQL isn’t limited to nested-loop joins”):

mysql> explain format=json select * from employees join titles join salariesG
*************************** 1. row ***************************
EXPLAIN: {
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "7.277755124e16"
    },
    "nested_loop": [
      {
        "table": {
          "table_name": "employees",
          "access_type": "ALL",
          "rows_examined_per_scan": 299379,
          "rows_produced_per_join": 299379,
          "filtered": "100.00",
          "cost_info": {
            "read_cost": "929.00",
            "eval_cost": "59875.80",
            "prefix_cost": "60804.80",
            "data_read_per_join": "13M"
          },
          "used_columns": [
            "emp_no",
            "birth_date",
            "first_name",
            "last_name",
            "gender",
            "hire_date"
          ]
        }
      },
      {
        "table": {
          "table_name": "titles",
          "access_type": "ALL",
          "rows_examined_per_scan": 442724,
          "rows_produced_per_join": 132542268396,
          "filtered": "100.00",
          "using_join_buffer": "Block Nested Loop",
          "cost_info": {
            "read_cost": "62734.88",
            "eval_cost": "26508453679.20",
            "prefix_cost": "26508577218.88",
            "data_read_per_join": "7T"
          },
          "used_columns": [
            "emp_no",
            "title",
            "from_date",
            "to_date"
          ]
        }
      },
      {
        "table": {
          "table_name": "salaries",
          "access_type": "ALL",
          "rows_examined_per_scan": 2745434,
          "rows_produced_per_join": 363886050091503872,
          "filtered": "100.00",
          "using_join_buffer": "Block Nested Loop",
          "cost_info": {
            "read_cost": "314711040856.92",
            "eval_cost": "7.277721002e16",
            "prefix_cost": "7.277755124e16",
            "data_read_per_join": "5171P"
          },
          "used_columns": [
            "emp_no",
            "salary",
            "from_date",
            "to_date"
          ]
        }
      }
    ]
  }
}
1 row in set, 1 warning (0.00 sec)

For a simple query this output does not add much. Except cost info and information about used columns and efficiency of composite indexes.

But what if you not only join tables, but use the other SQL language options? For example, for the query below, which has two

JOIN

 operations and two subqueries, a regular

EXPLAIN

 returns this plan:

mysql> explain select * from employees join dept_manager using (emp_no) where emp_no in (select emp_no from (select emp_no, salary from salaries where emp_no in (select emp_no from titles where title like '%manager%') group by emp_no, salary having salary > avg(salary) ) t )G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: <subquery2>
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: NULL
     filtered: 100.00
        Extra: NULL
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: dept_manager
   partitions: NULL
         type: ref
possible_keys: PRIMARY,emp_no
          key: PRIMARY
      key_len: 4
          ref: <subquery2>.emp_no
         rows: 1
     filtered: 100.00
        Extra: NULL
*************************** 3. row ***************************
           id: 1
  select_type: PRIMARY
        table: employees
   partitions: NULL
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: <subquery2>.emp_no
         rows: 1
     filtered: 100.00
        Extra: NULL
*************************** 4. row ***************************
           id: 2
  select_type: MATERIALIZED
        table: <derived3>
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 9
     filtered: 100.00
        Extra: NULL
*************************** 5. row ***************************
           id: 3
  select_type: DERIVED
        table: titles
   partitions: NULL
         type: index
possible_keys: PRIMARY,emp_no
          key: emp_no
      key_len: 4
          ref: NULL
         rows: 442724
     filtered: 7.51
        Extra: Using where; Using index; Using temporary; Using filesort; LooseScan
*************************** 6. row ***************************
           id: 3
  select_type: DERIVED
        table: salaries
   partitions: NULL
         type: ref
possible_keys: PRIMARY,emp_no
          key: PRIMARY
      key_len: 4
          ref: employees.titles.emp_no
         rows: 9
     filtered: 100.00
        Extra: NULL
6 rows in set, 1 warning (0.00 sec)

It’s pretty hard to understand which part is a subquery and which is not. It’s also it is hard to find out if

DERIVED

 belongs to the first

JOIN

 or to the second. And I am not quite sure why

<subquery2>

  was marked as

PRIMARY

, which is supposed to indicate “Outermost SELECT”.

The real issue here is that the internal representation of

JOIN

 is hierarchical, and MySQL Server (like in the case for

UNION

) has trouble representing an object as a “flat” table.

EXPLAIN FORMAT=JSON

, with its hierarchical nature, can help us in this case.

mysql> explain format=json  select * from employees join dept_manager using (emp_no) where emp_no in (select emp_no from (select emp_no, salary from salaries where emp_no in (select emp_no from titles where title like '%manager%') group by emp_no, salary having salary > avg(salary) ) t )G
*************************** 1. row ***************************
EXPLAIN: {
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "39.45"
    },
    "nested_loop": [
      {
        "table": {
          "table_name": "<subquery2>",
          "access_type": "ALL",
          "materialized_from_subquery": {
            "using_temporary_table": true,
            "query_block": {
              "table": {
                "table_name": "t",
                "access_type": "ALL",
                "rows_examined_per_scan": 9,
                "rows_produced_per_join": 9,
                "filtered": "100.00",
                "cost_info": {
                  "read_cost": "10.45",
                  "eval_cost": "1.80",
                  "prefix_cost": "12.25",
                  "data_read_per_join": "144"
                },
                "used_columns": [
                  "emp_no",
                  "salary"
                ],
                "materialized_from_subquery": {
                  "using_temporary_table": true,
                  "dependent": false,
                  "cacheable": true,
                  "query_block": {
                    "select_id": 3,
                    "cost_info": {
                      "query_cost": "176246.11"
                    },
                    "grouping_operation": {
                      "using_temporary_table": true,
                      "using_filesort": true,
                      "cost_info": {
                        "sort_cost": "9.54"
                      },
                      "nested_loop": [
                        {
                          "table": {
                            "table_name": "titles",
                            "access_type": "index",
                            "possible_keys": [
                              "PRIMARY",
                              "emp_no"
                            ],
                            "key": "emp_no",
                            "used_key_parts": [
                              "emp_no"
                            ],
                            "key_length": "4",
                            "rows_examined_per_scan": 442724,
                            "rows_produced_per_join": 33229,
                            "filtered": "7.51",
                            "using_index": true,
                            "loosescan": true,
                            "cost_info": {
                              "read_cost": "3380.56",
                              "eval_cost": "6645.94",
                              "prefix_cost": "63199.96",
                              "data_read_per_join": "2M"
                            },
                            "used_columns": [
                              "emp_no",
                              "title",
                              "from_date"
                            ],
                            "attached_condition": "(`employees`.`titles`.`title` like '%manager%')"
                          }
                        },
                        {
                          "table": {
                            "table_name": "salaries",
                            "access_type": "ref",
                            "possible_keys": [
                              "PRIMARY",
                              "emp_no"
                            ],
                            "key": "PRIMARY",
                            "used_key_parts": [
                              "emp_no"
                            ],
                            "key_length": "4",
                            "ref": [
                              "employees.titles.emp_no"
                            ],
                            "rows_examined_per_scan": 9,
                            "rows_produced_per_join": 9,
                            "filtered": "100.00",
                            "cost_info": {
                              "read_cost": "49622.62",
                              "eval_cost": "1.91",
                              "prefix_cost": "176236.57",
                              "data_read_per_join": "152"
                            },
                            "used_columns": [
                              "emp_no",
                              "salary",
                              "from_date"
                            ]
                          }
                        }
                      ]
                    }
                  }
                }
              }
            }
          }
        }
      },
      {
        "table": {
          "table_name": "dept_manager",
          "access_type": "ref",
          "possible_keys": [
            "PRIMARY",
            "emp_no"
          ],
          "key": "PRIMARY",
          "used_key_parts": [
            "emp_no"
          ],
          "key_length": "4",
          "ref": [
            "<subquery2>.emp_no"
          ],
          "rows_examined_per_scan": 1,
          "rows_produced_per_join": 9,
          "filtered": "100.00",
          "cost_info": {
            "read_cost": "9.00",
            "eval_cost": "1.80",
            "prefix_cost": "23.05",
            "data_read_per_join": "144"
          },
          "used_columns": [
            "dept_no",
            "emp_no",
            "from_date",
            "to_date"
          ]
        }
      },
      {
        "table": {
          "table_name": "employees",
          "access_type": "eq_ref",
          "possible_keys": [
            "PRIMARY"
          ],
          "key": "PRIMARY",
          "used_key_parts": [
            "emp_no"
          ],
          "key_length": "4",
          "ref": [
            "<subquery2>.emp_no"
          ],
          "rows_examined_per_scan": 1,
          "rows_produced_per_join": 1,
          "filtered": "100.00",
          "cost_info": {
            "read_cost": "1.00",
            "eval_cost": "0.20",
            "prefix_cost": "39.45",
            "data_read_per_join": "48"
          },
          "used_columns": [
            "emp_no",
            "birth_date",
            "first_name",
            "last_name",
            "gender",
            "hire_date"
          ]
        }
      }
    ]
  }
}
1 row in set, 1 warning (0.01 sec)

At first we see that all our tables,

JOIN

 operations and subqueries are in the

nested_loop

 array:

"nested_loop": [
      {
        "table": {
          "table_name": "<subquery2>",
...
      {
        "table": {
          "table_name": "dept_manager",
...
      {
        "table": {
          "table_name": "employees",
...
      }
    ]

Then we see that the first table,

<subquery2>

, was materialized_from_subquery:

"table": {
          "table_name": "<subquery2>",
          "access_type": "ALL",
          "materialized_from_subquery": {
...

Which, in its turn, was

materialized_from_subquery

 too:

"table": {
          "table_name": "<subquery2>",
          "access_type": "ALL",
          "materialized_from_subquery": {
...
                "materialized_from_subquery": {
...

This last subquery performs

grouping_operation

  on the other 

nested_loop

  (

JOIN

) of tables

titles

  and

salaries

:

"grouping_operation": {
                      "using_temporary_table": true,
                      "using_filesort": true,
                      "cost_info": {
                        "sort_cost": "9.54"
                      },
                      "nested_loop": [
                        {
                          "table": {
                            "table_name": "titles",
...
                        },
                        {
                          "table": {
                            "table_name": "salaries",
...

Now we have a better picture of how the query was optimized: tables

titles

 and

salaries

  were joined first, then

GROUP BY

 was executed on the result, then the result was materialized and queried. The result of the query

select emp_no from <materialized> t

  was materialized again as

<subquery2>

, and only after it joined with two other tables.

Conclusion:

EXPLAIN FORMAT=JSON

  helps to understand how complex queries are optimized.

Feb
22
2016
--

EXPLAIN FORMAT=JSON: cost_info knows why optimizer prefers one index to another

EXPLAIN FORMAT=JSON

EXPLAIN FORMAT=JSONTime for another entry in the EXPLAIN FORMAT=JSON is cool! series of blog posts. This time we’ll discuss how using EXPLAIN FORMAT=JSON allows you to see that

cost_info

  knows why the optimizer prefers one index to another.

Tables often have more than one index. Any of these indexes can be used to resolve query. The optimizer has to make a choice in this case. One of the metrics that can be used to help make the choice is the potential cost of the query evaluation.

For example, let’s take the table

titles

  from the standard employees database:

mysql> show create table titlesG
*************************** 1. row ***************************
       Table: titles
Create Table: CREATE TABLE `titles` (
  `emp_no` int(11) NOT NULL,
  `title` varchar(50) NOT NULL,
  `from_date` date NOT NULL,
  `to_date` date DEFAULT NULL,
  PRIMARY KEY (`emp_no`,`title`,`from_date`),
  KEY `emp_no` (`emp_no`),
  CONSTRAINT `titles_ibfk_1` FOREIGN KEY (`emp_no`) REFERENCES `employees` (`emp_no`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.01 sec)

It has two indexes:

emp_no

  and

PRIMARY

, each of  which could be used to resolve query:

select distinct title from titles where year(from_date) > '1990';

At first glance, 

emp_no

  doesn’t really fit for this query.

PRIMARY

  does fit, because it contains both the 

title

  and

from_date

  fields. Unfortunately, it cannot be used to resolve the query, because we don’t limit the search by

emp_no

  and

title

 .  It can, however, be used to select rows from the index. When we use 

EXPLAIN

 , though, it shows us that the optimizer has chosen index

emp_no

  (every secondary index in InnoDB contains a link to the clustered index anyway):

mysql> explain select distinct title from titles where year(from_date) > '1990'G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: titles
   partitions: NULL
         type: index
possible_keys: PRIMARY,emp_no
          key: emp_no
      key_len: 4
          ref: NULL
         rows: 442724
     filtered: 100.00
        Extra: Using where; Using index; Using temporary
1 row in set, 1 warning (0.00 sec)

PRIMARY KEY

  exists in the field

possible_keys

, but was not chosen.

EXPLAIN FORMAT=JSON

  can show us why.

First let’s run it on the original query:

mysql> explain format=json select distinct title from titles where year(from_date) > '1990'G
*************************** 1. row ***************************
EXPLAIN: {
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "89796.80"
    },
    "duplicates_removal": {
      "using_temporary_table": true,
      "using_filesort": false,
      "table": {
        "table_name": "titles",
        "access_type": "index",
        "possible_keys": [
          "PRIMARY",
          "emp_no"
        ],
        "key": "emp_no",
        "used_key_parts": [
          "emp_no"
        ],
        "key_length": "4",
        "rows_examined_per_scan": 442724,
        "rows_produced_per_join": 442724,
        "filtered": "100.00",
        "using_index": true,
        "cost_info": {
          "read_cost": "1252.00",
          "eval_cost": "88544.80",
          "prefix_cost": "89796.80",
          "data_read_per_join": "27M"
        },
        "used_columns": [
          "emp_no",
          "title",
          "from_date"
        ],
        "attached_condition": "(year(`employees`.`titles`.`from_date`) > '1990')"
      }
    }
  }
}
1 row in set, 1 warning (0.01 sec)

The important part here is:

"cost_info": {
      "query_cost": "89796.80"
    },

Which shows that the overall

query_cost

  is 89796.80. We don’t really know what the units are for this cost, or how it is actually measured. It isn’t important; the only thing that is important for now is that smaller is better. (Think of it like shopping for a product: it doesn’t matter which you buy it from, just that you buy it at the lowest price.)

Another important member of the index is

cost_info

, which belongs to the table itself:

"cost_info": {
          "read_cost": "1252.00",
          "eval_cost": "88544.80",
          "prefix_cost": "89796.80",
          "data_read_per_join": "27M"
        },

Here we get even more details, such as cost of read operation and evaluation.

prefix_cost

  is not useful for this example, because it contains the cost of joining to the next table in

JOIN

. Since we don’t join the table

titles

  with any other value of

prefix_cost

, is equivalent to the cost of the full query.

data_read_per_join

  contains the amount of data that should be read for each

JOIN

  operation. In our case it is once again the same as how much data we should read to fully evaluate the query.

Now let’s force index

PRIMARY

  and examine the 

EXPLAIN FORMAT=JSON

  output:

mysql> explain format=json select distinct title from titles force index(primary) where year(from_date) > '1990'G
*************************** 1. row ***************************
EXPLAIN: {
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "531269.80"
    },
    "duplicates_removal": {
      "using_temporary_table": true,
      "using_filesort": false,
      "table": {
        "table_name": "titles",
        "access_type": "index",
        "possible_keys": [
          "PRIMARY",
          "emp_no"
        ],
        "key": "PRIMARY",
        "used_key_parts": [
          "emp_no",
          "title",
          "from_date"
        ],
        "key_length": "59",
        "rows_examined_per_scan": 442724,
        "rows_produced_per_join": 442724,
        "filtered": "100.00",
        "using_index": true,
        "cost_info": {
          "read_cost": "442725.00",
          "eval_cost": "88544.80",
          "prefix_cost": "531269.80",
          "data_read_per_join": "27M"
        },
        "used_columns": [
          "emp_no",
          "title",
          "from_date"
        ],
        "attached_condition": "(year(`employees`.`titles`.`from_date`) > '1990')"
      }
    }
  }
}
1 row in set, 1 warning (0.01 sec)

Notice the numbers are different this time. The total query cost is 531269.80, which is about 6 times greater than 89796.80:

"cost_info": {
      "query_cost": "531269.80"
    },

read_cost

  is 442725.00, which is 353 times greater than 1252.00. However, the 

eval_cost

  and

data_read_per_join

  are the same as the query that uses index

emp_no

 :

"cost_info": {
          "read_cost": "442725.00",
          "eval_cost": "88544.80",
          "prefix_cost": "531269.80",
          "data_read_per_join": "27M"
        },

These numbers clearly explain why the optimizer prefers the index 

emp_no

  to

PRIMARY KEY

.

In our example above this behavior is correct. In a real life scenario, if the optimizer’s choice is wrong. these numbers can show either that there is a bug in the optimizer or  that the table’s statistics are outdated and need to be updated.

Conclusion:

EXPLAIN FORMAT=JSON

  can be used together with

FORCE INDEX

  to find out why the optimizer prefers one index to another.

Feb
09
2016
--

EXPLAIN FORMAT=JSON: buffer_result is not hidden!

EXPLAIN FORMAT=JSON

EXPLAIN FORMAT=JSONTime for another entry in the EXPLAIN FORMAT=JSON is cool! series. Today we’re going to look at how you can view the buffer result using JSON (instead of the regular

EXPLAIN

 command.

Regular

EXPLAIN

 does not identify if

SQL_BUFFER_RESULT

 was used at all. To demonstrate, let’s run this query:

mysql> explain select * from salariesG
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: salaries
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 2557022
     filtered: 100.00
        Extra: NULL
1 row in set, 1 warning (0.01 sec)
Note (Code 1003): /* select#1 */ select `employees`.`salaries`.`emp_no` AS `emp_no`,`employees`.`salaries`.`salary` AS `salary`,`employees`.`salaries`.`from_date` AS `from_date`,`employees`.`salaries`.`to_date` AS `to_date` from `employees`.`salaries`

Now, let’s compare it to this query:

mysql> explain select sql_buffer_result * from salariesG
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: salaries
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 2557022
     filtered: 100.00
        Extra: Using temporary
1 row in set, 1 warning (0.00 sec)
Note (Code 1003): /* select#1 */ select sql_buffer_result `employees`.`salaries`.`emp_no` AS `emp_no`,`employees`.`salaries`.`salary` AS `salary`,`employees`.`salaries`.`from_date` AS `from_date`,`employees`.`salaries`.`to_date` AS `to_date` from `employees`.`salaries`

Notice there is no difference, except the expected

"Using temporary"

 value in the

"Extra"

 row of the second query. The field 

"Using temporary"

  is expected here, because

SQL_BUFFER_RESULT

  directly instructs the MySQL server to put a result set into a temporary table to free locks. But what if the query uses the temporary table by itself? For example, for a grouping operation? In this case, the 

EXPLAIN

 result for the original query and the query that contains the 

SQL_BUFFER_RESULT

  clause will be 100% identical.

Compare:

mysql> explain select emp_no, salary/avg(salary) from salaries group by emp_no, salaryG
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: salaries
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 2557022
     filtered: 100.00
        Extra: Using temporary; Using filesort
1 row in set, 1 warning (0.00 sec)
Note (Code 1003): /* select#1 */ select `employees`.`salaries`.`emp_no` AS `emp_no`,(`employees`.`salaries`.`salary` / avg(`employees`.`salaries`.`salary`)) AS `salary/avg(salary)` from `employees`.`salaries` group by `employees`.`salaries`.`emp_no`,`employees`.`salaries`.`salary`

With:

mysql> explain select sql_buffer_result emp_no, salary/avg(salary) from salaries group by emp_no, salaryG
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: salaries
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 2557022
     filtered: 100.00
        Extra: Using temporary; Using filesort
1 row in set, 1 warning (0.00 sec)
Note (Code 1003): /* select#1 */ select sql_buffer_result `employees`.`salaries`.`emp_no` AS `emp_no`,(`employees`.`salaries`.`salary` / avg(`employees`.`salaries`.`salary`)) AS `salary/avg(salary)` from `employees`.`salaries` group by `employees`.`salaries`.`emp_no`,`employees`.`salaries`.`salary`

There is no difference! We not able to tell if we used a temporary table to resolve the query, or simply put the result set into the buffer. The 

EXPLAIN FORMAT=JSON

  command can help in this case as well. Its output is clear, and shows all the details of the query optimization:

mysql> explain format=json select sql_buffer_result emp_no, salary/avg(salary) from salaries group by emp_no, salaryG
*************************** 1. row ***************************
EXPLAIN: {
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "3073970.40"
    },
    "grouping_operation": {
      "using_temporary_table": true,
      "using_filesort": true,
      "cost_info": {
        "sort_cost": "2557022.00"
      },
      "buffer_result": {
        "using_temporary_table": true,
        "table": {
          "table_name": "salaries",
          "access_type": "ALL",
          "rows_examined_per_scan": 2557022,
          "rows_produced_per_join": 2557022,
          "filtered": "100.00",
          "cost_info": {
            "read_cost": "5544.00",
            "eval_cost": "511404.40",
            "prefix_cost": "516948.40",
            "data_read_per_join": "39M"
          },
          "used_columns": [
            "emp_no",
            "salary",
            "from_date"
          ]
        }
      }
    }
  }
}
1 row in set, 1 warning (0.00 sec)
Note (Code 1003): /* select#1 */ select sql_buffer_result `employees`.`salaries`.`emp_no` AS `emp_no`,(`employees`.`salaries`.`salary` / avg(`employees`.`salaries`.`salary`)) AS `salary/avg(salary)` from `employees`.`salaries` group by `employees`.`salaries`.`emp_no`,`employees`.`salaries`.`salary`

Firstly, we can see how the

grouping_operation

 was optimized:

"grouping_operation": { "using_temporary_table": true, "using_filesort": true,

And it does indeed use the temporary table.

Now we can follow the details for

SQL_BUFFER_RESULT

:

"buffer_result": {
        "using_temporary_table": true,

With this output, we can be absolutely certain that the temporary table was created for both the  

SQL_BUFFER_RESULT

 and the grouping operation. This is especially helpful for support engineers who need the 

EXPLAIN

  output to help their customers to tune queries, but are afraid to ask for the same query twice — once with the 

SQL_BUFFER_RESULT

 clause and once without.

Conclusion:

EXPLAIN FORMAT=JSON

  does not hide important details for query optimizations.

Jan
29
2016
--

EXPLAIN FORMAT=JSON knows everything about UNIONs: union_result and query_specifications

EXPLAIN FORMAT=JSON

EXPLAIN FORMAT=JSONReady for another post in the EXPLAIN FORMAT=JSON is Cool series! Great! This post will discuss how to see all the information that is contained in optimized queries with

UNION

 using the

union_result

 and

query_specifications

 commands.

 

When optimizing complicated queries with

UNION

, it is easy to get lost in the regular

EXPLAIN

  output trying to identify which part of the output belongs to each part of the

UNION

.

Let’s consider the following example:

mysql> explain
    ->     select emp_no, last_name, 'low_salary' from employees
    ->     where emp_no in (select emp_no from salaries
    ->         where salary < (select avg(salary) from salaries))
    -> union
    ->     select emp_no, last_name, 'high salary' from employees
    ->     where emp_no in (select emp_no from salaries
    ->         where salary >= (select avg(salary) from salaries))G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: employees
   partitions: NULL
         type: ALL
possible_keys: PRIMARY
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 299778
     filtered: 100.00
        Extra: NULL
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: salaries
   partitions: NULL
         type: ref
possible_keys: PRIMARY,emp_no
          key: PRIMARY
      key_len: 4
          ref: employees.employees.emp_no
         rows: 9
     filtered: 33.33
        Extra: Using where; FirstMatch(employees)
*************************** 3. row ***************************
           id: 3
  select_type: SUBQUERY
        table: salaries
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 2557022
     filtered: 100.00
        Extra: NULL
*************************** 4. row ***************************
           id: 4
  select_type: UNION
        table: employees
   partitions: NULL
         type: ALL
possible_keys: PRIMARY
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 299778
     filtered: 100.00
        Extra: NULL
*************************** 5. row ***************************
           id: 4
  select_type: UNION
        table: salaries
   partitions: NULL
         type: ref
possible_keys: PRIMARY,emp_no
          key: PRIMARY
      key_len: 4
          ref: employees.employees.emp_no
         rows: 9
     filtered: 33.33
        Extra: Using where; FirstMatch(employees)
*************************** 6. row ***************************
           id: 6
  select_type: SUBQUERY
        table: salaries
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 2557022
     filtered: 100.00
        Extra: NULL
*************************** 7. row ***************************
           id: NULL
  select_type: UNION RESULT
        table: <union1,4>
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: NULL
     filtered: NULL
        Extra: Using temporary
7 rows in set, 1 warning (0.00 sec)
Note (Code 1003): /* select#1 */ select `employees`.`employees`.`emp_no` AS `emp_no`,`employees`.`employees`.`last_name` AS `last_name`,'low_salary' AS `low_salary` from `employees`.`employees` semi join (`employees`.`salaries`) where ((`employees`.`salaries`.`emp_no` = `employees`.`employees`.`emp_no`) and (`employees`.`salaries`.`salary` < (/* select#3 */ select avg(`employees`.`salaries`.`salary`) from `employees`.`salaries`))) union /* select#4 */ select `employees`.`employees`.`emp_no` AS `emp_no`,`employees`.`employees`.`last_name` AS `last_name`,'high salary' AS `high salary` from `employees`.`employees` semi join (`employees`.`salaries`) where ((`employees`.`salaries`.`emp_no` = `employees`.`employees`.`emp_no`) and (`employees`.`salaries`.`salary` >= (/* select#6 */ select avg(`employees`.`salaries`.`salary`) from `employees`.`salaries`)))

While we can guess that subquery 3 belongs to the first query of the union, and subquery 6 belongs to the second (which has number 4 for some reason), we have to be very careful (especially in our case) when queries use the same tables in both parts of the

UNION

.

The main issue with the regular

EXPLAIN

 for

UNION

  is that it has to re-present the hierarchical structure as a table. The same issue occurs when you want to store objects created in programming language, such as Java, in the database.

EXPLAIN FORMAT=JSON

, on the other hand, has hierarchical structure and more clearly displays how

UNION

 was optimized:

mysql> explain format=json select emp_no, last_name, 'low_salary' from employees where emp_no in (select emp_no from salaries  where salary < (select avg(salary) from salaries)) union select emp_no, last_name, 'high salary' from employees where emp_no in (select emp_no from salaries where salary >= (select avg(salary) from salaries))G
*************************** 1. row ***************************
EXPLAIN: {
  "query_block": {
    "union_result": {
      "using_temporary_table": true,
      "table_name": "<union1,4>",
      "access_type": "ALL",
      "query_specifications": [
        {
          "dependent": false,
          "cacheable": true,
          "query_block": {
            "select_id": 1,
            "cost_info": {
              "query_cost": "921684.48"
            },
            "nested_loop": [
              {
                "table": {
                  "table_name": "employees",
                  "access_type": "ALL",
                  "possible_keys": [
                    "PRIMARY"
                  ],
                  "rows_examined_per_scan": 299778,
                  "rows_produced_per_join": 299778,
                  "filtered": "100.00",
                  "cost_info": {
                    "read_cost": "929.00",
                    "eval_cost": "59955.60",
                    "prefix_cost": "60884.60",
                    "data_read_per_join": "13M"
                  },
                  "used_columns": [
                    "emp_no",
                    "last_name"
                  ]
                }
              },
              {
                "table": {
                  "table_name": "salaries",
                  "access_type": "ref",
                  "possible_keys": [
                    "PRIMARY",
                    "emp_no"
                  ],
                  "key": "PRIMARY",
                  "used_key_parts": [
                    "emp_no"
                  ],
                  "key_length": "4",
                  "ref": [
                    "employees.employees.emp_no"
                  ],
                  "rows_examined_per_scan": 9,
                  "rows_produced_per_join": 299778,
                  "filtered": "33.33",
                  "first_match": "employees",
                  "cost_info": {
                    "read_cost": "302445.97",
                    "eval_cost": "59955.60",
                    "prefix_cost": "921684.48",
                    "data_read_per_join": "4M"
                  },
                  "used_columns": [
                    "emp_no",
                    "salary"
                  ],
                  "attached_condition": "(`employees`.`salaries`.`salary` < (/* select#3 */ select avg(`employees`.`salaries`.`salary`) from `employees`.`salaries`))",
                  "attached_subqueries": [
                    {
                      "dependent": false,
                      "cacheable": true,
                      "query_block": {
                        "select_id": 3,
                        "cost_info": {
                          "query_cost": "516948.40"
                        },
                        "table": {
                          "table_name": "salaries",
                          "access_type": "ALL",
                          "rows_examined_per_scan": 2557022,
                          "rows_produced_per_join": 2557022,
                          "filtered": "100.00",
                          "cost_info": {
                            "read_cost": "5544.00",
                            "eval_cost": "511404.40",
                            "prefix_cost": "516948.40",
                            "data_read_per_join": "39M"
                          },
                          "used_columns": [
                            "salary"
                          ]
                        }
                      }
                    }
                  ]
                }
              }
            ]
          }
        },
        {
          "dependent": false,
          "cacheable": true,
          "query_block": {
            "select_id": 4,
            "cost_info": {
              "query_cost": "921684.48"
            },
            "nested_loop": [
              {
                "table": {
                  "table_name": "employees",
                  "access_type": "ALL",
                  "possible_keys": [
                    "PRIMARY"
                  ],
                  "rows_examined_per_scan": 299778,
                  "rows_produced_per_join": 299778,
                  "filtered": "100.00",
                  "cost_info": {
                    "read_cost": "929.00",
                    "eval_cost": "59955.60",
                    "prefix_cost": "60884.60",
                    "data_read_per_join": "13M"
                  },
                  "used_columns": [
                    "emp_no",
                    "last_name"
                  ]
                }
              },
              {
                "table": {
                  "table_name": "salaries",
                  "access_type": "ref",
                  "possible_keys": [
                    "PRIMARY",
                    "emp_no"
                  ],
                  "key": "PRIMARY",
                  "used_key_parts": [
                    "emp_no"
                  ],
                  "key_length": "4",
                  "ref": [
                    "employees.employees.emp_no"
                  ],
                  "rows_examined_per_scan": 9,
                  "rows_produced_per_join": 299778,
                  "filtered": "33.33",
                  "first_match": "employees",
                  "cost_info": {
                    "read_cost": "302445.97",
                    "eval_cost": "59955.60",
                    "prefix_cost": "921684.48",
                    "data_read_per_join": "4M"
                  },
                  "used_columns": [
                    "emp_no",
                    "salary"
                  ],
                  "attached_condition": "(`employees`.`salaries`.`salary` >= (/* select#6 */ select avg(`employees`.`salaries`.`salary`) from `employees`.`salaries`))",
                  "attached_subqueries": [
                    {
                      "dependent": false,
                      "cacheable": true,
                      "query_block": {
                        "select_id": 6,
                        "cost_info": {
                          "query_cost": "516948.40"
                        },
                        "table": {
                          "table_name": "salaries",
                          "access_type": "ALL",
                          "rows_examined_per_scan": 2557022,
                          "rows_produced_per_join": 2557022,
                          "filtered": "100.00",
                          "cost_info": {
                            "read_cost": "5544.00",
                            "eval_cost": "511404.40",
                            "prefix_cost": "516948.40",
                            "data_read_per_join": "39M"
                          },
                          "used_columns": [
                            "salary"
                          ]
                        }
                      }
                    }
                  ]
                }
              }
            ]
          }
        }
      ]
    }
  }
}
1 row in set, 1 warning (0.00 sec)
Note (Code 1003): /* select#1 */ select `employees`.`employees`.`emp_no` AS `emp_no`,`employees`.`employees`.`last_name` AS `last_name`,'low_salary' AS `low_salary` from `employees`.`employees` semi join (`employees`.`salaries`) where ((`employees`.`salaries`.`emp_no` = `employees`.`employees`.`emp_no`) and (`employees`.`salaries`.`salary` < (/* select#3 */ select avg(`employees`.`salaries`.`salary`) from `employees`.`salaries`))) union /* select#4 */ select `employees`.`employees`.`emp_no` AS `emp_no`,`employees`.`employees`.`last_name` AS `last_name`,'high salary' AS `high salary` from `employees`.`employees` semi join (`employees`.`salaries`) where ((`employees`.`salaries`.`emp_no` = `employees`.`employees`.`emp_no`) and (`employees`.`salaries`.`salary` >= (/* select#6 */ select avg(`employees`.`salaries`.`salary`) from `employees`.`salaries`)))

First it puts member

union_result

 in the

query_block

  at the very top level:

EXPLAIN: {
  "query_block": {
    "union_result": {

The

union_result

 object contains information about how the result set of the

UNION

 was processed:

"using_temporary_table": true,
      "table_name": "<union1,4>",
      "access_type": "ALL",

And also contains the 

query_specifications

 array which also contains all the details about queries in the

UNION

:

"query_specifications": [
        {
          "dependent": false,
          "cacheable": true,
          "query_block": {
            "select_id": 1,
<skipped>
        {
          "dependent": false,
          "cacheable": true,
          "query_block": {
            "select_id": 4,

This representation is much more clear, and also contains all the details which the regular

EXPLAIN

misses for regular queries.

Conclusion:

EXPLAIN FORMAT=JSON

 not only contains additional optimization information for each query in the

UNION

, but also has a hierarchical structure that is more suitable for the hierarchical nature of the

UNION

.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com