Mar
22
2021
--

Want MongoDB Performance? You Will Need to Add and Remove Indexes!

MongoDB Performance

MongoDB PerformanceGood intentions can sometimes end up with bad results.  Adding indexes boosts performance until it doesn’t. Avoid over-indexing.

The difference between your application being fast, responsive, and scaling properly is often dependent on how you use indexes in the database.  MongoDB is no different, its performance (and the overall performance of your application) is heavily dependent on getting the proper amount of indexes on the right things.   A simple index or two can speed up getting data from MongoDB a million-fold for million-records tables.  But at the same time having too many indexes on a large collection can lead to massive slowdowns in overall performance.  You need to get your indexes just right.

For this blog, we are going to talk about having too many indexes and help you find both duplicate and unused indexes.  If you are interested in finding out if you need additional indexes or if your query is using an index, I would suggest reading previous Percona articles on query tuning (Part 1 & Part 2 of that series).

So, indexes are very good for getting faster queries. How many indexes do I need to create on a collection? What are the best practices for the indexes? How do I find which indexes are being used or not?  What if I have duplicated indexes?

Common Performance Problems

After analyzing a lot of different MongoDB environments I can provide the following list summarizing the typical errors I have seen:

  • Not creating indexes at all, other than the primary key _id created by design.
    • I’m not joking – I have seen databases without any user-created indexes, which had owners surprised the server was overloaded and/or the queries were very slow.
  • Over-indexing the collection.
    • Some developers usually create a lot of indexes without a specific reason or just for testing a query. Then they forget to drop them.
    • In some cases, the size of all the indexes was larger than the data. This is not good; indexes should be as small as possible to be really effective.

I’m not considering the first case. I’m going to discuss instead the second one.

How Many Indexes you Need in a Collection

It depends – that’s the right answer. Basically, it depends on your application workload. You should consider the following rules when indexing a collection:

  • Create as many indexes as possible for your application.
  • Don’t create a lot of indexes.

What? These rules are stating the opposite thing! Well, we can summarize in just one simple rule:

  • You need to create all the indexes your application really needs for solving the most frequent queries. Not one more, not one less.

That’s it.

Pros and Cons of Indexing

The big advantage of the indexes is that they permit the queries, updates, and deletes to run as fast as possible if they are used. (Every update or delete also needs to do a lookup step first). More indexes in a collection can benefit several queries.

Unfortunately, the indexes require some extra work for MongoDB. Any time your run a write, all the indexes must be updated. The new values are stored or dropped into the B-Tree structure, some splitting or merging is needed, and this requires some time.

The main problem is that “more indexes you have in a collection, the slower all the writes will be”.

A very large collection with just 10 or 15 indexes can have a significant performance loss for the writes. Also, remember that indexes have to be copied into the WiredTiger cache. More indexes imply also more pressure for the memory cache. The pressure can then lead to more cache evictions and slowness.

A good example of this is when I was working with a customer a few weeks ago we found 12 extra indexes on a collection they did not need. The collection was around 80GB; the total index size was more than the data size. They had a relevant write load based on several frequent inserts and updates all the time. Cleaning these indexes increased their write queries execution time by 25-30 percent on average. The improvement observed for this real case won’t be the same quantitative amount in other cases, but for sure the fewer indexes you have the faster all the writes will be.

We need to find some kind of balancing: creating more indexes, but not that much.

How to Reduce Over-Indexing

Very easy to say: drop all the indexes you don’t need.

There are two things you can do to identify the indexes to get dropped:

  • Check for the duplicates.
  • Check for the unused indexes.

For dropping an index you need to run something like the following:

db.mycollection.dropIndex("index_name")

Find Duplicate Indexes

A duplicate index could be an index with the same exact definition as another index that already exists in the collection. Fortunately, MongoDB is able to check this and it is not permitted to create such an index.

Let’s do a test using a simple collection with no indexes.

rs_test:PRIMARY> db.test.find()
{ "_id" : ObjectId("60521309d7268c122c7cd630"), "name" : "corrado", "age" : 49 }
{ "_id" : ObjectId("60521313d7268c122c7cd631"), "name" : "simone", "age" : 12 }
{ "_id" : ObjectId("6052131cd7268c122c7cd632"), "name" : "gabriele", "age" : 19 }
{ "_id" : ObjectId("60521323d7268c122c7cd633"), "name" : "luca", "age" : 14 }
{ "_id" : ObjectId("60521328d7268c122c7cd634"), "name" : "lucia", "age" : 49 }

# create an index on name field
rs_test:PRIMARY> db.test.createIndex( { name: 1 } )
{
   "createdCollectionAutomatically" : false,
   "numIndexesBefore" : 1,
   "numIndexesAfter" : 2,
   "commitQuorum" : "votingMembers",
   "ok" : 1,
   "$clusterTime" : {
      "clusterTime" : Timestamp(1615991942, 5),
      "signature" : {
         "hash" : BinData(0,"vQN6SGIL0fAMvTusJ12KgySqKOI="),
         "keyId" : NumberLong("6890926313742270469")
      }
   },
   "operationTime" : Timestamp(1615991942, 5)
}

# check indexes available
rs_test:PRIMARY> db.test.getIndexes()
[
   {
      "v" : 2,
      "key" : {
         "_id" : 1
      },
      "name" : "_id_"
   },
   {
      "v" : 2,
      "key" : {
         "name" : 1
      },
      "name" : "name_1"
   }
]

# try to create again the same index
rs_test:PRIMARY> db.test.createIndex( { name: 1 } )
{
   "numIndexesBefore" : 2,
   "numIndexesAfter" : 2,
   "note" : "all indexes already exist",
   "ok" : 1,
   "$clusterTime" : {
      "clusterTime" : Timestamp(1615991942, 5),
      "signature" : {
         "hash" : BinData(0,"vQN6SGIL0fAMvTusJ12KgySqKOI="),
         "keyId" : NumberLong("6890926313742270469")
      }
   },
   "operationTime" : Timestamp(1615991942, 5)
}

# great, MongoDB can detect the index already exists

# let's try to see if you can create the same index with a different name
rs_test:PRIMARY> db.test.createIndex( { name: 1 }, { name: "this_is_a_different_index_name" } )
{
   "operationTime" : Timestamp(1615991981, 1),
   "ok" : 0,
   "errmsg" : "Index with name: this_is_a_different_index_name already exists with a different name",
   "code" : 85,
   "codeName" : "IndexOptionsConflict",
   "$clusterTime" : {
      "clusterTime" : Timestamp(1615991981, 1),
      "signature" : {
         "hash" : BinData(0,"whkRyQQxyJVBt+7d3HOtFvYY32g="),
         "keyId" : NumberLong("6890926313742270469")
      }
   }
}

# even in this case MongoDB doesn't permit the index creation

MongoDB is then clever enough to avoid the creation of duplicate indexes. But what about the creation of an index that is the left-prefix of an existing index? Let’s test it.

# let's drop the previous index we have created
rs_test:PRIMARY> db.test.dropIndex( "name_1" )
{
   "nIndexesWas" : 2,
   "ok" : 1,
   "$clusterTime" : {
      "clusterTime" : Timestamp(1615993029, 1),
      "signature" : {
         "hash" : BinData(0,"njFiuCeyA5VcdNOOP2ASboOpWwo="),
         "keyId" : NumberLong("6890926313742270469")  
      }
   },
   "operationTime" : Timestamp(1615993029, 1)
}

# check indexes. Only _id available
rs_test:PRIMARY> db.test.getIndexes()
[ { "v" : 2, "key" : { "_id" : 1 }, "name" : "_id_" } ]

# create a compound index 
rs_test:PRIMARY> db.test.createIndex( { name:1, age: 1 } )
{
   "createdCollectionAutomatically" : false,
   "numIndexesBefore" : 1,
   "numIndexesAfter" : 2,
   "commitQuorum" : "votingMembers",
   "ok" : 1,
   "$clusterTime" : {
      "clusterTime" : Timestamp(1615993054, 5),
      "signature" : {
         "hash" : BinData(0,"gfaPsWsSM745opEiQORCt2L3HYo="),
         "keyId" : NumberLong("6890926313742270469")
      }
   },
   "operationTime" : Timestamp(1615993054, 5)
}

# create another index that is the leftmost prefix of the compound index
rs_test:PRIMARY> db.test.createIndex( { name:1 } )
{
   "createdCollectionAutomatically" : false,
   "numIndexesBefore" : 2,
   "numIndexesAfter" : 3,
   "commitQuorum" : "votingMembers",
   "ok" : 1,
   "$clusterTime" : {
      "clusterTime" : Timestamp(1615993060, 5),
      "signature" : {
         "hash" : BinData(0,"C2XWVA5mi+WWyPMn3Jw2VHTw/Dk="),
         "keyId" : NumberLong("6890926313742270469")
      }
   },
   "operationTime" : Timestamp(1615993060, 5)
}

# check indexes
rs_test:PRIMARY> db.test.getIndexes()
[
   {
      "v" : 2,
      "key" : {
         "_id" : 1 
      },
      "name" : "_id_"
   },
   {
      "v" : 2,
      "key" : {
         "name" : 1,
         "age" : 1
      },
      "name" : "name_1_age_1"
   },
   {
      "v" : 2,
      "key" : {
         "name" : 1
      },
      "name" : "name_1"
   }
]

We consider a leftmost-prefix index as a duplicate as well.

To take benefit from a compound index MongoDB doesn’t need to use all the fields of that index, the leftmost prefix is enough. For example an index on (A,B,C) can be used to satisfy the combinations (A), (A,B), (A,B,C) but not (B) or (B,C). As a consequence, if I have two different indexes, one on (A, B, C) and another one on (A, B), the second is a duplicate because the first can be used the same way for solving the query with the combinations (A, B) and (A).

Then, find all duplicate indexes and drop them since they’re useless. Just be aware and check that your application doesn’t use hint() on the indexes you’re going to drop.

In order to avoid manually checking all the collections to discover the duplicates, I provide here a javascript code for that:

var ldb = db.adminCommand( { listDatabases: 1 } );

for ( i = 0; i < ldb.databases.length; i++ ) {

   if ( ldb.databases[i].name != 'admin' && ldb.databases[i].name != 'config' && ldb.databases[i].name != 'local') {

      print('DATABASE ',ldb.databases[i].name);
      print("+++++++++++++++++++++++++++++++++++++++++")

      var db = db.getSiblingDB(ldb.databases[i].name); 
      var cpd = db.getCollectionNames();

      for ( j = 0; j < cpd.length; j++ ) { 

         if ( cpd[j] != 'system.profile' ) {

            var indexes = JSON.parse(JSON.stringify(db.runCommand( { listIndexes: cpd[j] } ).cursor.firstBatch));
            print("COLL :"+cpd[j]);

            for ( k = 0; k < indexes.length; k++ ) {

               indexes[k] = (((JSON.stringify(indexes[k].key)).replace("{","")).replace("}","")).replace(/,/g ,"_");

            }

            var founddup = false;

            for ( k1 = 0; k1 < indexes.length; k1++ ) {

               for ( k2 = 0; k2 < indexes.length; k2++ ) {

                  if ( k1 != k2 ) {

                     if (indexes[k1].startsWith(indexes[k2],0)) {

                        print("{ "+indexes[k2]+" } is the left prefix of { "+indexes[k1]+" } and should be dropped");

                        founddup = true;

                     }
                  }
               } 
            }

            if (!founddup) {

               print("no duplicate indexes found");

            }

            print("\n");

         } 
      }

      print("\n");
   } 
}

Note: this script is just an initial test and could be improved, but it should work in most cases.

Find Unused Indexes

MongoDB maintains internal statistics about index usage. Any time an index is used for solving a query a specific counter is an increment. After running MongoDB for a significant amount of time – days or weeks – the statistics are reliable and we can find out which indexes have been used or not.

For looking at the index stats, MongoDB provides a stage in the aggregation pipeline: $indexStats

Here you can see an example:

rs_test:PRIMARY> db.restaurants.aggregate([ { $indexStats: {} } ]).pretty()
{
   "name" : "borough_1",
   "key" : {
      "borough" : 1
   },
   "host" : "ip-172-30-2-12:27017",
   "accesses" : {
      "ops" : NumberLong(312),
      "since" : ISODate("2021-03-17T13:48:51.305Z")
   },
   "spec" : {
      "v" : 2,
      "key" : {
         "borough" : 1
      },
      "name" : "borough_1"
   }
}
{
   "name" : "_id_",
   "key" : {
      "_id" : 1
   },
   "host" : "ip-172-30-2-12:27017",
   "accesses" : {
      "ops" : NumberLong(12),
      "since" : ISODate("2021-03-17T13:48:51.305Z")
   },
   "spec" : {
      "v" : 2,
      "key" : {
         "_id" : 1
      },
      "name" : "_id_"
   }
}
{
   "name" : "cuisine_1_borough_1",
   "key" : {
      "cuisine" : 1,
      "borough" : 1
   },
   "host" : "ip-172-30-2-12:27017",
   "accesses" : {
      "ops" : NumberLong(0),
      "since" : ISODate("2021-03-17T13:48:51.305Z")
   },
   "spec" : { 
      "v" : 2,
      "key" : {
         "cuisine" : 1,
         "borough" : 1
      },
      "name" : "cuisine_1_borough_1"
   }
}

The accesses.ops is the number of times the index has been used. In the example you can see the { borough:1 } has been used 312 times, the index { _id } 12 times, and the index { cuisine:1, borough: 1} 0 times. The last one could be dropped.

If the database is running for a long time with millions of queries executed and if an index was not used, most probably it won’t be used even in the future.

Then you should consider dropping the unused indexes in order to improve the writes, reduce the cache pressure, and saving disk space as well.

Using the following script you can find out the index statistics for all the collections:

var ldb=db.adminCommand( { listDatabases: 1 } );

   for (i=0; i<ldb.databases.length; i++) { 

      print('DATABASE ',ldb.databases[i].name);

      if ( ldb.databases[i].name != 'admin' && ldb.databases[i].name != 'config' && ldb.databases[i].name != 'local' ) {

      var db = db.getSiblingDB(ldb.databases[i].name); 
      var cpd = db.getCollectionNames();

      for (j=0; j<cpd.length; j++) {

         if ( cpd[j] != 'system.profile' ) {

            print(cpd[j]); 

            var pui = db.runCommand({ aggregate : cpd[j] ,pipeline : [{$indexStats: {}}],cursor: { batchSize: 100 } }); 
            printjson(pui);

         } 
      }

      print('\n\n'); 
   }
}

Look for the indexes having “ops”: NumberLong(0)

Conclusion

Creating indexes for solving queries is a good habit, but be aware to not abuse indexing. Excessive indexing can lead to slower writes, excessive pressure on the memory cache, and more evictions.

You should consider maintaining your indexes from time to time dropping all the duplicates and the unused indexes. The scripts provided in this article may help your index analysis.

Apr
05
2019
--

Peter Kraus dishes on the market

During my recent conversation with Peter Kraus, which was supposed to be focused on Aperture and its launch of the Aperture New World Opportunities Fund, I couldn’t help veering off into tangents about the market in general. Below is Kraus’ take on the availability of alpha generation, the Fed, inflation versus Amazon, housing, the cross-ownership of U.S. equities by a few huge funds and high-frequency trading.

Gregg Schoenberg: Will alpha be more available over the next five years than it has been over the last five?

To think that at some point equities won’t become more volatile and decline 20% to 30%… I think it’s crazy.

Peter Kraus: Do I think it’s more available in the next five years than it was in the last five years? No. Do I think people will pay more attention to it? Yes, because when markets are up to 30 percent, if you get another five, it doesn’t matter. When markets are down 30 percent and I save you five by being 25 percent down, you care.

GS: Is the Fed’s next move up or down?

PK: I think the Fed does zero, nothing. In terms of its next interest rate move, in my judgment, there’s a higher probability that it’s down versus up.

Jun
26
2018
--

MongoDB Explain – Using PMM-QAN for MongoDB Query Analytics

MongoDB explain plan with PMM-QAN

In this blog post, we will walk through PMM-Query Analytics for MongoDB. We will see how to analyze MongoDB query performance; review the initial parameters that we need to check; and find out how to compare MongoDB query performance with and without indexes with the help of EXPLAIN plan.

The Percona Monitoring and Management QAN (PMM-QAN) dashboard helps DBAs and Developers to analyze database queries and identify performance issues more easily. Sometimes it is difficult to find issues by just enabling the profiler and tracking through mongo shell for all the slow queries.

Test Case Environment

We configured the test environment with PMM Server before we ran the test:

PMM Version:<span class="s1">1.11.0</span>
MongoDB Version: 3.4.10
MongoDB Configurations:
3 Member Replicaset (1 Primary, 2 Secondaries)
ns: "test.pro"
Test Query: Find Test case: (with and without Index)
Indexed Field: "product"
Query: db.pro.find({product:"aa"})
Query count: 1000
Total count: 20000

Please Note: We have to enable profiler in the MongoDB environment to reflect the metrics in the QAN page. The QAN dashboard lists multiple queries, based on the threshold we have set in profiler. For this demonstration, we have set profiling level to 2 to get the query reflected in the dashboard.

QAN Analysis

Let’s analyze comparisons of the plans that were collected before and after the index was added to the collection.

Query used:

db.pro.find({product:"aa"})

The QAN dashboard, lists the FIND query for the “pro” collection:

Query Time

After selecting this specific query, we will see its details just below the list.

Review parameter: “Query Time”

Without Index

Here Query Time=18ms, we will check for the query count, docs returned and docs scanned in detail in the explain plan.

With Index

Now the Query Time=15ms, we have improved the query by creating an index, and MongoDB is no longer scanning the whole collection.

EXPLAIN PLAN

Analysis of the query from the QAN EXPLAIN Plan: You can use the toggle button “Expand All” to check the complete execution stats and plans, as this in case of the “test” database

COMPARISON OF PERFORMANCE

Here we make a comparison of query performance with and without the index, and take a look at the basics that need to be checked before and after index creation:

Without Index:

Stage =”COLLSCAN”, this means that MongoDB scans every document in order to fulfil the find query, so you need to create an index to improve query performance

With Index:

Stage=”IXSCAN”, indicates that the optimizer is not scanning the whole document, but scanned only the indexed bound document

Without Index:

It scanned whole documents i.e. docsExamined:20000 and return nReturned:1000.

With Index:

MongoDB uses the index and scans only the relevant documents .i.e. docsExamined:1000 and return nReturned:1000.  We can see the performance improvement from adding an index.

This is how the query behaves after index creation. We can identify exactly which index is being used with QAN. We can discover whether that index is useful or not, and other performance related events, all with an easy UI.

As this blog post is specific to Query Analysis for MongoDB, QAN graphs and their attributes can be accessed from here, an excellent blog post written by my colleague Vinodh.

The post MongoDB Explain – Using PMM-QAN for MongoDB Query Analytics appeared first on Percona Database Performance Blog.

Feb
05
2018
--

Four Ways MySQL Executes GROUP BY

MySQL GROUP BY

MySQL GROUP BYIn this blog post, I’ll look into four ways MySQL executes GROUP BY. 

In my previous blog post, we learned that indexes or other means of finding data might not be the most expensive part of query execution. For example, MySQL GROUP BY could potentially be responsible for 90% or more of the query execution time. 

The main complexity when MySQL executes GROUP BY is computing aggregate functions in a GROUP BY statement. How this works is shown in the documentation for UDF Aggregate Functions. As we see, the requirement is that UDF functions get all values that constitute the single group one after another. That way, it can compute the aggregate function value for the single group before moving to another group.

The problem, of course, is that in most cases the source data values aren’t grouped. Values coming from a variety of groups follow one another during processing. As such, we need a special step to handle MySQL GROUP BY.

Let’s look at the same table we looked at before:

mysql> show create table tbl G
*************************** 1. row ***************************
      Table: tbl
Create Table: CREATE TABLE `tbl` (
 `id` int(11) NOT NULL AUTO_INCREMENT,
 `k` int(11) NOT NULL DEFAULT '0',
 `g` int(10) unsigned NOT NULL,
 PRIMARY KEY (`id`),
 KEY `k` (`k`)
) ENGINE=InnoDB AUTO_INCREMENT=2340933 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

And the same GROUP BY statements executed in different ways:

1: Index Ordered GROUP BY in MySQL

mysql> select k, count(*) c from tbl group by k order by k limit 5;
+---+---+
| k | c |
+---+---+
| 2 | 3 |
| 4 | 1 |
| 5 | 2 |
| 8 | 1 |
| 9 | 1 |
+---+---+
5 rows in set (0.00 sec)
mysql> explain select k, count(*) c from tbl group by k order by k limit 5 G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: tbl
  partitions: NULL
        type: index
possible_keys: k
         key: k
     key_len: 4
         ref: NULL
        rows: 5
    filtered: 100.00
       Extra: Using index
1 row in set, 1 warning (0.00 sec)

In this case, we have an index on the column we use for GROUP BY. This way, we can just scan data group by group and perform GROUP BY on the fly (inexpensively).

It works especially well when we use LIMIT to restrict the number of groups we retrieve or when a “covering index” is in use, as a sequential index-only scan is a very fast operation.

If you have a small number of groups though, and no covering index, index order scans can cause a lot of IO. So this might not be the most optimal plan.

2: External Sort GROUP BY in MySQL

mysql> explain select SQL_BIG_RESULT g, count(*) c from tbl group by g limit 5 G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: tbl
  partitions: NULL
        type: ALL
possible_keys: NULL
         key: NULL
     key_len: NULL
         ref: NULL
        rows: 998490
    filtered: 100.00
       Extra: Using filesort
1 row in set, 1 warning (0.00 sec)
mysql> select SQL_BIG_RESULT g, count(*) c from tbl group by g limit 5;
+---+---+
| g | c |
+---+---+
| 0 | 1 |
| 1 | 2 |
| 4 | 1 |
| 5 | 1 |
| 6 | 2 |
+---+---+
5 rows in set (0.88 sec)

If we do not have an index that allows us to scan the data in group order, we can instead get data sorted through an external sort (also referred to as “filesort” in MySQL).

You may notice I’m using an SQL_BIG_RESULT hint here to get this plan. Without it, MySQL won’t choose this plan in this case.

In general, MySQL prefers to use this plan only if we have a large number of groups, because in this case sorting is more efficient than having a temporary table (which we will talk about next).

3: Temporary Table GROUP BY in MySQL

mysql> explain select  g, sum(g) s from tbl group by g limit 5 G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: tbl
  partitions: NULL
        type: ALL
possible_keys: NULL
         key: NULL
     key_len: NULL
         ref: NULL
        rows: 998490
    filtered: 100.00
       Extra: Using temporary
1 row in set, 1 warning (0.00 sec)
mysql> select  g, sum(g) s from tbl group by g order by null limit 5;
+---+------+
| g | s    |
+---+------+
| 0 |    0 |
| 1 |    2 |
| 4 |    4 |
| 5 |    5 |
| 6 |   12 |
+---+------+
5 rows in set (7.75 sec)

In this case, MySQL also does a full table scan. But instead of running additional sort passes, it creates a temporary table instead. This temporary table contains one row per group, and with each incoming row the value for the corresponding group is updated. Lots of updates! While this might be reasonable in-memory, it becomes very expensive if the resulting table is so large that updates are going to cause a lot of disk IO. In this case, external sort plans are usually better.

Note that while MySQL selects this plan by default for this use case, if we do not supply any hints it is almost 10x slower than the plan we get using the SQL_BIG_RESULT hint.

You may notice I added “ORDER BY NULL” to this query. This is to show you “clean” the temporary table only plan. Without it, we get this plan:

mysql> explain select  g, sum(g) s from tbl group by g limit 5 G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: tbl
  partitions: NULL
        type: ALL
possible_keys: NULL
         key: NULL
     key_len: NULL
         ref: NULL
        rows: 998490
    filtered: 100.00
       Extra: Using temporary; Using filesort
1 row in set, 1 warning (0.00 sec)

In it, we get the “worst of both worlds” with Using Temporary Table and filesort.  

MySQL 5.7 always returns GROUP BY results sorted in group order, even if this the query doesn’t require it (which can then require an expensive additional sort pass). ORDER BY NULL signals the application doesn’t need this.

You should note that in some cases – such as JOIN queries with aggregate functions accessing columns from different tables – using temporary tables for GROUP BY might be the only option.

If you want to force MySQL to use a plan that does temporary tables for GROUP BY, you can use the SQL_SMALL_RESULT  hint.

4:  Index Skip-Scan-Based GROUP BY in MySQL

The previous three GROUP BY execution methods apply to all aggregate functions. Some of them, however, have a fourth method.

mysql> explain select k,max(id) from tbl group by k G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: tbl
  partitions: NULL
        type: range
possible_keys: k
         key: k
     key_len: 4
         ref: NULL
        rows: 2
    filtered: 100.00
       Extra: Using index for group-by
1 row in set, 1 warning (0.00 sec)
mysql> select k,max(id) from tbl group by k;
+---+---------+
| k | max(id) |
+---+---------+
| 0 | 2340920 |
| 1 | 2340916 |
| 2 | 2340932 |
| 3 | 2340928 |
| 4 | 2340924 |
+---+---------+
5 rows in set (0.00 sec)

This method applies only to very special aggregate functions: MIN() and MAX(). These do not really need to go through all the rows in the group to compute the value at all.

They can just jump to the minimum or maximum group value in the group directly (if there is such an index).

How can you find MAX(ID) value for each group if the index is only built on (K) column? This is an InnoDB table. Remember InnoDB tables effectively append the PRIMARY KEY to all indexes. (K) becomes (K,ID), allowing us to use Skip-Scan optimization for this query.

This optimization is only enabled if there is a large number of rows per group. Otherwise, MySQL prefers more conventional means to execute this query (like Index Ordered GROUP BY detailed in approach #1).

While we’re on MIN()/MAX() aggregate functions, other optimizations apply to them as well. For example, if you have an aggregate function with no GROUP BY (effectively  having one group for all tables), MySQL fetches those values from indexes during a statistics analyzes phase and avoids reading tables during the execution stage altogether:

mysql> explain select max(k) from tbl G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: NULL
  partitions: NULL
        type: NULL
possible_keys: NULL
         key: NULL
     key_len: NULL
         ref: NULL
        rows: NULL
    filtered: NULL
       Extra: Select tables optimized away
1 row in set, 1 warning (0.00 sec)

Filtering and Group By

We have looked at four ways MySQL executes GROUP BY.  For simplicity, I used GROUP BY on the whole table, with no filtering applied. The same concepts apply when you have a WHERE clause:

mysql> explain select  g, sum(g) s from tbl where k>4 group by g order by NULL limit 5 G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: tbl
  partitions: NULL
        type: range
possible_keys: k
         key: k
     key_len: 4
         ref: NULL
        rows: 1
    filtered: 100.00
       Extra: Using index condition; Using temporary
1 row in set, 1 warning (0.00 sec)

For this case, we use the range on the K column for data filtering/lookup and do a GROUP BY when there is a temporary table.

In some cases, the methods do not conflict. In others, however, we have to choose either to use one index for GROUP BY or another index for filtering:

mysql> alter table tbl add key(g);
Query OK, 0 rows affected (4.17 sec)
Records: 0  Duplicates: 0  Warnings: 0
mysql> explain select  g, sum(g) s from tbl where k>1 group by g limit 5 G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: tbl
  partitions: NULL
        type: index
possible_keys: k,g
         key: g
     key_len: 4
         ref: NULL
        rows: 16
    filtered: 50.00
       Extra: Using where
1 row in set, 1 warning (0.00 sec)
mysql> explain select  g, sum(g) s from tbl where k>4 group by g limit 5 G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: tbl
  partitions: NULL
        type: range
possible_keys: k,g
         key: k
     key_len: 4
         ref: NULL
        rows: 1
    filtered: 100.00
       Extra: Using index condition; Using temporary; Using filesort
1 row in set, 1 warning (0.00 sec)

Depending on specific constants used in this query, we can see that we either use an index ordered scan for GROUP BY (and  “give up”  benefiting from the index to resolve the WHERE clause), or use an index to resolve the WHERE clause (but use a temporary table to resolve GROUP BY).

In my experience, this is where MySQL GROUP BY does not always make the right choice. You might need to use FORCE INDEX to execute queries the way you want them to.

Summary

I hope this article provides a good overview of how MySQL executes GROUP BY.  In my next blog post, we will look into techniques you can use to optimize GROUP BY queries.

Oct
18
2017
--

Webinar Thursday, October 19, 2017: What You Need to Get the Most Out of Indexes – Part 2

Indexes

IndexesJoin Percona’s Senior Architect, Matthew Boehm, as he presents What You Need to Get the Most Out of Indexes – Part 2 webinar on Thursday, October 19, 2017, at 11:00 am PDT / 2:00 pm EDT (UTC-7).

Proper indexing is key to database performance. Finely tune your query writing and database performance with tips from the experts. MySQL offers a few different types of indexes and uses them in a variety of ways.

In this session you’ll learn:

  • How to use composite indexes
  • Other index usages besides lookup
  • How to find unoptimized queries
  • What is there beyond EXPLAIN?

Register for the webinar.

IndexesMatthew Boehm, Architect

Matthew joined Percona in the fall of 2012 as a MySQL consultant. His areas of knowledge include the traditional Linux/Apache/MySQL/PHP stack, memcached, MySQL Cluster, massive sharding topologies, PHP development and a bit of MySQL-C-API development. Previously, Matthew DBAed for the fifth largest MySQL installation at eBay/PayPal. He also hails from managed hosting environments. During his off-hours, Matthew is a nationally ranked, competitive West Coast Swing dancer and travels to competitions around the US. He enjoys working out, camping, biking and playing MMOs with his son.

Jul
11
2017
--

Webinar Wednesday July 12, 2017: MongoDB Index Types – How, When and Where Should They Be Used?

MongoDB Index Types

MongoDB Index TypesJoin Percona’s Senior Technical Services Engineer, Adamo Tonete as he presents MongoDB Index Types: How, When and Where Should They Be Used? on Wednesday, July 12, 2017 at 11:00 am PDT / 2:00 pm EDT (UTC-7).

MongoDB has 12 index types. Do you know how each works, or when you should use each of them? This talk will arm you with this knowledge, and help you understand how indexes impact performance, storage and even the sharding of your data. We will also discuss some solid index operational practices, as well as some settings for things like TTL you might not know exist. The contents of this webinar will make you a Rock Star!

Register for the webinar here.

MongoDB Index TypesAdamo Tonete, Senior Technical Services Engineer

Adamo joined Percona in 2015, after working as a MongoDB/MySQL database administrator for three years. As the main database admin of a startup, he was responsible for suggesting the best architecture and data flows for a worldwide company in a 24/7 environment. Before that, he worked as a Microsoft SQL Server DBA for a large e-commerce company, working mainly on performance tuning and automation. Adamo has almost eight years of experience working as a DBA, and in the past three years he has moved to NoSQL technologies (without giving up relational databases).

Jul
08
2017
--

MongoDB Indexing Types: How, When and Where Should They Be Used?

MongoDB Index Types

MongoDB IndexingIn this blog post, we will talk about MongoDB indexing, and the different types of indexes that are available in MongoDB.

Note: We are hosting a webinar on July 12, 2017, where I will talk about MongoDB indexes and how to choose a good indexing plan.

MongoDB is a NoSQL database that is document-oriented. NoSQL databases share many features with relational databases, and one of them is indexes. The question is how are such documents indexed in the database?

Remember that because MongoDB is a database that writes JSONs, there is no predefined schema in the document. The same field can be a string or an integer – depending on the document.

MongoDB, as well as other databases, use B-trees to index. With some exceptions, the algorithm is the same as a relational database.

The B-tree can use integers and strings together to organize data. The most important thing to know is that an index-less database must read all the documents to filter what you want, while an indexed database can – through indexes – find the documents quickly. Imagine you are looking for a book in a disorganized library. This is how the query optimizer feels when we are looking for something that is not indexed.

There are several different types of indexes available: single field, compound indexes, hashed indexes, geoIndexes, unique, sparse, partial, text… and so on. All of them help the database in some way, although they obviously also get in the way of write performance if used too much.

  • Single fields. Single fields are simple indexes that index only one field in a collection. MongoDB can usually only use one index per query, but in some cases the database can take advantage of more than one index to reply to a query (this is called index intersection). Also, $or actually executes more than one query at a time. Therefore $or and $in can also use more than one index.
  • Compound indexes. Compound indexes are indexes that have more than one indexed field, so ideally the most restrictive field should be to the left of the B-tree. If you want to index by sex and birth, for instance, the index should begin by birth as it is much more restrictive than sex.
  • Hashed indexes. Shards use hashed indexes, and create a hash according to the field value to spread the writes across the sharded instances.
  • GeoIndexes. GeoIndexes are a special index type that allows a search based on location, distance from a point and many other different features.
  • Unique indexes. Unique indexes work as in relational databases. They guarantee that the value doesn’t repeat and raise an error when we try to insert a duplicated value. Unique doesn’t work across shards.
  • Text indexes. Text indexes can work better with indexes than a single indexed field. There are different flags we can use, like giving weights to control the results or using different collections.
  • Sparse/Partial indexes. Sparse and partial indexes seem very similar. However, sparse indexes will index only an existing field and not check its value, while partial indexes will apply a filter (like greater than) to a field to index. This means the partial index doesn’t index all the documents with the existing field, but only documents that match the create index filter.

We will discuss and talk more about indexes and how they work in my webinar MongoDB® Index Types: How, When and Where Should They Be Used? If you have questions, feel free to ask them in the comments below and I will try to answer all of them in the webinar (or in a complementary post).

I hope this blog post was useful, please feel free to reach out me on twitter @AdamoTonete or @percona.

Jul
03
2017
--

Webinar Wednesday July 5, 2017: Indexes – What You Need to Know to Get the Most Out of Them

Indexes

IndexesJoin Percona’s Senior Architect, Matthew Boehm, as he presents Indexes – What You Need to Know to Get the Most Out of Them on Wednesday, July 5, 2017 at 8:00 am PDT / 11:00 am EDT (UTC-7).

Proper indexing is key to database performance. Find out how MySQL uses indexes for query execution, and then how to come up with an optimal index strategy. In this session, you’ll also learn how to know when you need an index, and also how to get rid of indexes that you don’t need to speed up queries.

Register for the webinar here.

MatthewMatthew Boehm, Architect

Matthew joined Percona in the fall of 2012 as a MySQL consultant. He quickly rose to Architect and became one of Percona’s training instructors. His areas of knowledge include the traditional Linux/Apache/MySQL/PHP stack, memcached, MySQL Cluster (NDB), massive sharding topologies, PHP development and a bit of MySQL-C-API development.

Sep
09
2016
--

Basic Housekeeping for MySQL Indexes

MySQL Indexes

MySQL IndexesIn this blog post, we’ll look at some of the basic housekeeping steps for MySQL indexes.

We all know that indexes can be the difference between a high-performance database and a bad/slow/painful query ride. It’s a critical part that needs deserves some housekeeping once in a while. So, what should you check? In no particular order, here are some things to look at:

1. Unused indexes

With sys schema, is pretty easy to find unused indexes: use the schema_unused_indexes view.

mysql> select * from sys.schema_unused_indexes;
+---------------+-----------------+-------------+
| object_schema | object_name     | index_name  |
+---------------+-----------------+-------------+
| world         | City            | CountryCode |
| world         | CountryLanguage | CountryCode |
+---------------+-----------------+-------------+
2 rows in set (0.01 sec)

This view is based on the performance_schema.table_io_waits_summary_by_index_usage table, which will require enabling the Performance Schema, the events_waits_current consumer and the wait/io/table/sql/handler instrument. PRIMARY (key) indexes are ignored.

If you don’t have them enabled, just execute these queries:

update performance_schema.setup_consumers set enabled = 'yes' where name = 'events_waits_current';
update performance_schema.setup_instruments set enabled = 'yes' where name = 'wait/io/table/sql/handler';

Quoting the documentation:

“To trust whether the data from this view is representative of your workload, you should ensure that the server has been up for a representative amount of time before using it.”

And by representative amount, I mean representative: 

  • Do you have a weekly job? Wait at least one week
  • Do you have monthly reports? Wait at least one month
  • Don’t rush!

Once you’ve found unused indexes, remove them.

2. Duplicated indexes

You have two options here:

  • pt-duplicate-key-checker
  • the schema_redundant_indexes view from sys_schema

The pt-duplicate-key-checker is part of Percona Toolkit. The basic usage is pretty straightforward:

[root@e51d333b1fbe mysql-sys]# pt-duplicate-key-checker
# ########################################################################
# world.CountryLanguage
# ########################################################################
# CountryCode is a left-prefix of PRIMARY
# Key definitions:
#   KEY `CountryCode` (`CountryCode`),
#   PRIMARY KEY (`CountryCode`,`Language`),
# Column types:
#      	  `countrycode` char(3) not null default ''
#      	  `language` char(30) not null default ''
# To remove this duplicate index, execute:
ALTER TABLE `world`.`CountryLanguage` DROP INDEX `CountryCode`;
# ########################################################################
# Summary of indexes
# ########################################################################
# Size Duplicate Indexes   2952
# Total Duplicate Indexes  1
# Total Indexes            37

Now, the schema_redundant_indexes view is also easy to use once you have sys schema installed. The difference is that it is based on the information_schema.statistics table:

mysql> select * from schema_redundant_indexesG
*************************** 1. row ***************************
              table_schema: world
                table_name: CountryLanguage
      redundant_index_name: CountryCode
   redundant_index_columns: CountryCode
redundant_index_non_unique: 1
       dominant_index_name: PRIMARY
    dominant_index_columns: CountryCode,Language
 dominant_index_non_unique: 0
            subpart_exists: 0
            sql_drop_index: ALTER TABLE `world`.`CountryLanguage` DROP INDEX `CountryCode`
1 row in set (0.00 sec)

Again, once you find the redundant index, remove it.

3. Potentially missing indexes

The statements summary tables from the performance schema have several interesting fields. For our case, two of them are pretty important: NO_INDEX_USED (means that the statement performed a table scan without using an index) and NO_GOOD_INDEX_USED (“1” if the server found no good index to use for the statement, “0” otherwise).

Sys schema has one view that is based on the performance_schema.events_statements_summary_by_digest table, and is useful for this purpose: statements_with_full_table_scans, which lists all normalized statements that have done a table scan.

For example:

mysql> select * from world.CountryLanguage where isOfficial = 'F';
55a208785be7a5beca68b147c58fe634  -
746 rows in set (0.00 sec)
mysql> select * from statements_with_full_table_scansG
*************************** 1. row ***************************
                   query: SELECT * FROM `world` . `Count ... guage` WHERE `isOfficial` = ?
                      db: world
              exec_count: 1
           total_latency: 739.87 us
     no_index_used_count: 1
no_good_index_used_count: 0
       no_index_used_pct: 100
               rows_sent: 746
           rows_examined: 984
           rows_sent_avg: 746
       rows_examined_avg: 984
              first_seen: 2016-09-05 19:51:31
               last_seen: 2016-09-05 19:51:31
                  digest: aa637cf0867616c591251fac39e23261
1 row in set (0.01 sec)

The above query doesn’t use an index because there was no good index to use, and thus was reported. See the explain output:

mysql> explain select * from world.CountryLanguage where isOfficial = 'F'G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: CountryLanguage
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 984
        Extra: Using where

Note that the “query” field reports the query digest (more like a fingerprint) instead of the actual query.

In this case, the CountryLanguage table is missing an index over the “isOfficial” field. It is your job to decide whether it is worth it to add the index or not.

4. Multiple column indexes order

It was explained before that Multiple Column index beats Index Merge in all cases when such index can be used, even when sometimes you might have to use index hints to make it work.

But when using them, don’t forget that the order matters. MySQL will only use a multi-column index if at least one value is specified for the first column in the index.

For example, consider this table:

mysql> show create table CountryLanguageG
*************************** 1. row ***************************
       Table: CountryLanguage
Create Table: CREATE TABLE `CountryLanguage` (
  `CountryCode` char(3) NOT NULL DEFAULT '',
  `Language` char(30) NOT NULL DEFAULT '',
  `IsOfficial` enum('T','F') NOT NULL DEFAULT 'F',
  `Percentage` float(4,1) NOT NULL DEFAULT '0.0',
  PRIMARY KEY (`CountryCode`,`Language`),
  KEY `CountryCode` (`CountryCode`),
  CONSTRAINT `countryLanguage_ibfk_1` FOREIGN KEY (`CountryCode`) REFERENCES `Country` (`Code`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

A query against the field “Language” won’t use an index:

mysql> explain select * from CountryLanguage where Language = 'English'G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: CountryLanguage
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 984
        Extra: Using where

Simply because it is not the leftmost prefix for the Primary Key. If we add the “CountryCode” field, now the index will be used:

mysql> explain select * from CountryLanguage where Language = 'English' and CountryCode = 'CAN'G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: CountryLanguage
         type: const
possible_keys: PRIMARY,CountryCode
          key: PRIMARY
      key_len: 33
          ref: const,const
         rows: 1
        Extra: NULL

Now, you’ll have to also consider the selectivity of the fields involved. Which is the preferred order?

In this case, the “Language” field has a higher selectivity than “CountryCode”:

mysql> select count(distinct CountryCode)/count(*), count(distinct Language)/count(*) from CountryLanguage;
+--------------------------------------+-----------------------------------+
| count(distinct CountryCode)/count(*) | count(distinct Language)/count(*) |
+--------------------------------------+-----------------------------------+
|                               0.2368 |                            0.4644 |
+--------------------------------------+-----------------------------------+

So in this case, if we create a multi-column index, the preferred order will be (Language, CountryCode).

Placing the most selective columns first is a good idea when there is no sorting or grouping to consider, and thus the purpose of the index is only to optimize where lookups. You might need to choose the column order, so that it’s as selective as possible for the queries that you’ll run most.

Now, is this good enough? Not really. What about special cases where the table doesn’t have an even distribution? When a single value is present way more times than all the others? In that case, no index will be good enough. Be careful not to assume that average-case performance is representative of special-case performance. Special cases can wreck performance for the whole application.

In conclusion, we depend heavily on proper indexes. Give them some love and care once in a while, and the database will be very grateful.

All the examples were done with the following MySQL and Sys Schema version:

mysql> select * from sys.version;
+-------------+-----------------+
| sys_version | mysql_version   |
+-------------+-----------------+
| 1.5.1       | 5.6.31-77.0-log |
+-------------+-----------------+

Aug
21
2015
--

Find unused indexes on MongoDB and TokuMX

Finding and removing unused indexes is a pretty common technique to improve overall performance of relational databases. Less indexes means faster insert and updates but also less disk space used. The usual way to do it is to log all queries’ execution plans and then get a list of those indexes that are not used. Same theory applies to MongoDB and TokuMX so in this blog post I’m going to explain how to find those.

Profiling in MongoDB

To understand what profiling is you only need to think about MySQL’s slow query log, it is basically the same idea. It can be enabled with the following command:

db.setProfilingLevel(level, slowms)

There are three different levels:

0: No profiling enabled.
1: Only those queries slower than “slowms” are profiled.
2: All queries are profiled, similar to query_long_time=0.

Once it is enabled you can use db.system.profile.find().pretty() to read it. You would need to scan through all profiles and find those indexes that are never used. To make things easier there is a javascript program that will find the unused indexes after reading all the profile information. Unfortunately, it only works with mongodb 2.x.

The javascript is hosted in this github project https://github.com/wfreeman/indexalizer You just need to start mongo shell with indexStats.js and run db.indexStats() command. This is an sample output:

scanning profile {ns:"test.col"} with 2 records... this could take a while.
{
	"query" : {
		"b" : 1
	},
	"count" : 1,
	"index" : "",
	"cursor" : "BtreeCursor b_1",
	"millis" : 0,
	"nscanned" : 1,
	"n" : 1,
	"scanAndOrder" : false
}
{
	"query" : {
		"b" : 2
	},
	"count" : 1,
	"index" : "",
	"cursor" : "BtreeCursor b_1",
	"millis" : 0,
	"nscanned" : 1,
	"n" : 1,
	"scanAndOrder" : false
}
checking for unused indexes in: col
this index is not being used:
"_id_"
this index is not being used:
"a_1"

 

So “a_1” is not used and could be dropped. We can ignore “_id_” because that one is needed :)

There is a problem with profiling. It will affect performance so you need to run it only for some hours and usually during low peak. That means that there is a possibility that not all possible queries from your application are going to be executed during that maintenance window. What alternative TokuMX provides?

Finding unused indexes in TokuMX

Good news for all of us. TokuMX doesn’t require you to enable profiling. Index usage statistics are stored as part of every query execution and you can access them with a simple db.collection.stats() command. Let me show you an example:

> db.col.stats()
[...]
{
"name" : "a_1",
"count" : 5,
"size" : 140,
"avgObjSize" : 28,
"storageSize" : 16896,
"pageSize" : 4194304,
"readPageSize" : 65536,
"fanout" : 16,
"compression" : "zlib",
"queries" : 0,
"nscanned" : 0,
"nscannedObjects" : 0,
"inserts" : 0,
"deletes" : 0
},
{
"name" : "b_1",
"count" : 5,
"size" : 140,
"avgObjSize" : 28,
"storageSize" : 16896,
"pageSize" : 4194304,
"readPageSize" : 65536,
"fanout" : 16,
"compression" : "zlib",
"queries" : 2,
"nscanned" : 2,
"nscannedObjects" : 2,
"inserts" : 0,
"deletes" : 0
}
],
"ok" : 1
}

 

There are our statistics without profiling enabled. queries means the number of times that index has been used on a query execution. b_1 has been used twice and a_1 has never been used. You can use this small javascript code I’ve written to scan all collections inside the current database:

db.forEachCollectionName(function (cname) {
	output = db.runCommand({collstats : cname });
	print("Checking " + output.ns + "...")
	output.indexDetails.forEach(function(findUnused) { if (findUnused.queries == 0) { print( "Unused index: " + findUnused.name ); }})
});

 

An example using the same data:

> db.forEachCollectionName(function (cname) {
... output = db.runCommand({collstats : cname });
... print("Checking " + output.ns + "...")
... output.indexDetails.forEach(function(findUnused) { if (findUnused.queries == 0) { print( "Unused index: " + findUnused.name ); }})
...
... });
Checking test.system.indexes...
Checking test.col...
Unused index: a_1

 

Conclusion

Finding unused indexes is a regular task that every DBA should do. In MongoDB you have to use profiling while in TokuMX nothing needs to be enabled because it will gather information by default without impacting service performance.

The post Find unused indexes on MongoDB and TokuMX appeared first on Percona Data Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com