Jan
03
2019
--

MongoDB Engines: MMAPV1 Vs WiredTiger

review of MongoDB storaage MMAPv1 and WiredTiger

review of MongoDB storaage MMAPv1 and WiredTigerIn this post, we’ll take a look at the differences between the MMAP and WiredTiger engines in MongoDB®. I’ve been asked this question by customers many times, and this blog is for you! We’ll tell you about the key features of these engines, then you can choose the right engine based on your requirement.

In MongoDB, we mainly use the MMAPV1 and WiredTiger engines. We could use other engines like in-Memory, rocks db with Percona Server for MongoDB (PSMDB), and in-memory engine with MongoDB Enterprise version. When MongoDB was introduced, MMAPV1 was the default engine and it’s still a part of the MongoDB releases, though it will not be seen from 4.2 as per MongoDB’s plan. Those who remember the days working with version 1.8 might miss this, even though they don’t use MMAP currently! MongoDB acquired wiredTiger Inc (see here https://www.mongodb.com/press/wired-tiger) and from version 3.2 made it the default engine of MongoDB. This engine enabled the introduction of transactions with multi-documents, and is mainly used for features such as compression and document level locking. Here we’ll see the key features of wiredTiger and MMAPV1, and also present them in a tabular column at the end – who doesn’t love a table to check quickly the differences! It reminds me my school days :-)). My co-author, and friend – Aayushi feels the same?! ?

Some differences in detail

Storage Engines

The MongoDB storage engines manage BSON data in memory and on disk to support read and write operations.

MMAPV1:  This is the original storage engine for MongoDB, introduced in the first release, but from version 4.0 it is deprecated

WiredTiger:  This is the pluggable engine introduced by MongoDB in version 3.0 and it became the default storage engine from version 3.2

Data compression

MMAPV1: does not support data compression and it is based on memory mapped files. So it works well when you can keep your writeset in memory. It excels at workloads with high volume inserts, reads, and in-place updates.

WiredTiger: supports snappy and zlib compression. Consequently, MongoDB with WiredTiger takes very little space comparing with MMAP. It has its own write-cache and a filesystem cache.

  • Snappy: This is the default algorithm,  efficient computation with reasonable compression. See here.
  • Zlib: higher compression rate at the cost of CPU. See here.

Data Directory

Let’s take a look at the file system supporting the same data and replica set member for each of the engines. 

MMAPV1:

total 1.2G
-rw-r--r-- 1 vagrant vagrant    5 Nov 28 04:41 mongod.lock
-rw-rw-r-- 1 vagrant vagrant   69 Nov 28 04:41 storage.bson
-rw------- 1 vagrant vagrant  16M Nov 28 04:58 local.0
drwxrwxr-x 2 vagrant vagrant 4.0K Nov 28 04:58 journal
-rw------- 1 vagrant vagrant  16M Nov 28 04:58 admin.ns
-rw------- 1 vagrant vagrant  16M Nov 28 04:58 admin.0
-rw------- 1 vagrant vagrant 512M Nov 28 04:59 local.2
drwxrwxr-x 2 vagrant vagrant 4.0K Nov 28 04:59 diagnostic.data
drwxrwxr-x 2 vagrant vagrant 4.0K Nov 28 05:16 _tmp
-rw------- 1 vagrant vagrant  16M Nov 28 05:17 test.ns
-rw------- 1 vagrant vagrant  16M Nov 28 05:17 test.0
-rw------- 1 vagrant vagrant  32M Nov 28 05:17 test.1
-rw------- 1 vagrant vagrant  16M Nov 28 09:09 local.ns
-rw------- 1 vagrant vagrant 512M Nov 28 09:09 local.1

WiredTiger:

total 5.4M
-rw-rw-r-- 1 vagrant vagrant   21 Nov 28 07:38 WiredTiger.lock
-rw-rw-r-- 1 vagrant vagrant   49 Nov 28 07:38 WiredTiger
drwxrwxr-x 2 vagrant vagrant 4.0K Nov 28 07:38 journal
-rw-rw-r-- 1 vagrant vagrant 4.0K Nov 28 07:38 WiredTigerLAS.wt
-rw-rw-r-- 1 vagrant vagrant   95 Nov 28 07:38 storage.bson
-rw-r--r-- 1 vagrant vagrant    5 Nov 28 07:38 mongod.lock
-rw-rw-r-- 1 vagrant vagrant  16K Nov 28 07:38 index-7--2134189858403062482.wt
-rw-rw-r-- 1 vagrant vagrant  16K Nov 28 07:38 index-5--2134189858403062482.wt
-rw-rw-r-- 1 vagrant vagrant  16K Nov 28 07:38 index-3--2134189858403062482.wt
-rw-rw-r-- 1 vagrant vagrant  16K Nov 28 07:38 index-1--2134189858403062482.wt
-rw-rw-r-- 1 vagrant vagrant  16K Nov 28 07:38 collection-4--2134189858403062482.wt
-rw-rw-r-- 1 vagrant vagrant  16K Nov 28 07:38 collection-2--2134189858403062482.wt
-rw-rw-r-- 1 vagrant vagrant  16K Nov 28 07:38 collection-0--2134189858403062482.wt
-rw-rw-r-- 1 vagrant vagrant  16K Nov 28 07:38 index-15--2134189858403062482.wt
-rw-rw-r-- 1 vagrant vagrant  16K Nov 28 07:38 index-14--2134189858403062482.wt
-rw-rw-r-- 1 vagrant vagrant 1.8M Nov 28 07:38 index-17--2134189858403062482.wt
-rw-rw-r-- 1 vagrant vagrant 3.2M Nov 28 07:39 collection-16--2134189858403062482.wt
-rw-rw-r-- 1 vagrant vagrant  16K Nov 28 07:39 collection-13--2134189858403062482.wt
-rw-rw-r-- 1 vagrant vagrant  32K Nov 28 07:39 _mdb_catalog.wt
-rw-rw-r-- 1 vagrant vagrant  36K Nov 28 09:09 sizeStorer.wt
-rw-rw-r-- 1 vagrant vagrant  36K Nov 28 09:09 collection-6--2134189858403062482.wt
-rw-rw-r-- 1 vagrant vagrant  52K Nov 28 09:09 collection-12--2134189858403062482.wt
-rw-rw-r-- 1 vagrant vagrant  76K Nov 28 09:09 WiredTiger.wt
-rw-rw-r-- 1 vagrant vagrant 1003 Nov 28 09:09 WiredTiger.turtle
drwxrwxr-x 2 vagrant vagrant 4.0K Nov 28 09:09 diagnostic.data

Journaling

MMAPV1: Ensures that writes are atomic.  If MongoDB goes down or terminates before committing changes to the data files, MongoDB can use the journal files to apply the write operation to the data files and maintain a consistent state.

WiredTiger: This uses checkpoints between writes and the journal persists all data modifications between checkpoints. So for any recovery from database crash or abrupt termination, it uses journal entries since the last checkpoint. In most cases, journal is not necessary for this engine and you enable it only if you need to be sure to recover until the last successful write before the crash from the journal. Otherwise, usually MongoDB can recover from the last valid checkpoint. Checkpoint occurs every minute by default. 

Journal directory

This is how journal files appear in the data directory for the different engines:

MMAPV1:

vagrant@m103:/data/mongo1/journal$ ls -lrth
total 35M
-rw------- 1 vagrant vagrant  88 Nov 28 09:17 lsn
-rw------- 1 vagrant vagrant 35M Nov 28 09:17 j._0

WiredTiger:

-rw-rw-r-- 1 vagrant vagrant 100M Nov 28 07:38 WiredTigerPreplog.0000000001
-rw-rw-r-- 1 vagrant vagrant 100M Nov 28 07:38 WiredTigerPreplog.0000000002
-rw-rw-r-- 1 vagrant vagrant 100M Nov 28 09:16 WiredTigerLog.0000000001

Locks and concurrency

MMAPV1

  • Up until version 2.6: uses a readers-writer [1] lock that allows concurrent reads access to a database, but gives exclusive access to a single write operation. When a read lock exists, many read operations may use this lock. However, when a write lock exists, a single write operation holds the lock exclusively, and no other read or write operations may share the lock.
  • From 3.0: The MMAPv1 storage engine uses collection level locking as of the 3.0 release series, an improvement on earlier versions in which the database lock was the finest-grain lock.

WiredTiger: supports document level locking. For most read and write operations, WiredTiger uses optimistic concurrency control. WiredTiger uses only intent locks at the global, database, and collection levels.

For example: deleting documents from the collection “testData” for a value of {x:1}, will acquire write “LOCK” at collection level differently for each of the storage engines.

MMAPV1:

2018-12-17T10:09:46.830+0000 I COMMAND  [conn8] command
testDB.$cmd appName: "MongoDB Shell"
command: delete { delete: "testData",
deletes: [ { q: { x: 1.0 }, limit: 0.0 } ], ordered: true }
numYields:0 reslen:89 locks:{ Global: { acquireCount: { r: 100795, w: 100795 } },
MMAPV1Journal: { acquireCount: { w: 100796 }, acquireWaitCount: { w: 12 },
timeAcquiringMicros: { w: 46212 } }, Database: { acquireCount: { w: 100795 } }
, Collection: { acquireCount: { W: 795 } }

where w = Represents Exclusive (X) lock

WiredTiger:

2018-12-17T10:17:38.340+0000 I COMMAND  [conn1] command
testDB.$cmd appName: "MongoDB Shell"
command: delete { delete: "testData",
deletes: [ { q: { x: 1.0 }, limit: 0.0 } ], ordered: true }
numYields:0 reslen:89 locks:{ Global: { acquireCount: { r: 100795, w: 100795 } },
Database: { acquireCount: { w: 100795 } }, Collection: { acquireCount: { w: 795 } }

where w = Represents Intent Exclusive (IX) lock

Memory

MMAPv1: MongoDB automatically uses all free memory on the machine as its cache. System resource monitors show that MongoDB uses a lot of memory, but its usage is dynamic. If another process suddenly needs half the server’s RAM, MongoDB will yield cached memory to the other process.

Technically, the operating system’s virtual memory subsystem manages MongoDB’s memory. This means that MongoDB will use as much free memory as it can, swapping to disk as needed. Deployments with enough memory to fit the application’s working data set in RAM will achieve the best performance.

WiredTiger: with wiredTiger, MongoDB utilizes both the WiredTiger internal cache and the filesystem cache. Via the filesystem cache, MongoDB automatically uses all free memory that is not used by the WiredTiger cache or by other processes. Starting in 3.4, the WiredTiger internal cache, by default, will use the larger of either:

  • 50% of (RAM – 1 GB), or
  • 256 MB.

Quick reference: MMAPV1 vs WiredTiger

Use this table for a quick reference to the differences between MMAPv1 and WiredTiger

Key Feature MMAPV1 wiredTiger
Introduction & Default Engine Introduced with MongoDB from scratch and default engine till 3.0 version. Deprecated in 4.0 and will be removed in future Introduced in 3.0 version and default from 3.2 version
Data Compression Doesn’t support compression Compression with default snappy compression method and zlib compression method. So occupy less space than MMAPV1 engine
Journaling MongoDB writes the in-memory changes first to on-disk journal files. If MongoDB goes down/terminates before committing the changes to the data files, MongoDB can use the journal files to apply the write operation to the data files and maintain a consistent state. The WiredTiger journal persists all data modifications between checkpoints. If MongoDB exits between checkpoints, it uses the journal to replay all data modified since the last checkpoint.
Locks & Concurrency Till 2.6, MongoDB uses a readers-writer [1] lock that allows concurrent reads access to a database but gives exclusive access to a single write operation. From 3.0, uses collection level lock It supports document level locking.
Transaction Operation on a single document is atomic Multi-document transactions are only available for deployments from version 4.0
CPU Performance adding CPU cores does not improve performance much performs better on multicore systems
Encryption Encryption is not possible Encryption at rest is available with MongoDB enterprise and as BETA in PSMDB 3.6.8
Memory automatically uses all free memory on the machine as its cache Uses internal cache and filesystem cache
Updates It excels at workloads with high volume inserts, reads, and in-place updates. Does not support in place updates. It causes the whole document to rewrite
Tuning Less chance to tune it Allows more tuning with this engine through different variables. Eg: cache size, read / write tickets, checkpoint interval etc

Conclusion

The above information does not cover every difference between MMAPV1 and WiredTiger, but it lists the key differences. If you have any key features to add, please feel free to add in the comments! Let’s share and let everyone know about them ?


Photo by Mathew Schwartz on Unsplash

Mar
08
2017
--

Migrating MongoDB Away from MMAPv1

MMAPv1

MMAPv1This is another post in the series of blogs on the Percona Server for MongoDB 3.4 bundle release. In this blog post, we’ll discuss moving away from the MMAPv1 storage engine.

Introduction

WIth the MongoDB v3.0 release in February of 2015, the long-awaited ability to choose storage engines became a reality. As of version 3.0, you could choose two engines in MongoDB Community Server and, if you use Percona Server for MongoDB, you could choose from four. Here’s a table for ease of consumption:

Here’s a table for easy consumption:

Storage Engine Percona Server for MongoDB MongoDB Community Server MongoDB Enterprise Server (licensed)
MMAPv1

?

?

?

WiredTiger

?

?

?

MongoRocks

?

In-memory

?

?

Encrypted

?

 

Why change engines?

With increased possibilities comes an increase in the decision-making process difficult (a concept that gets reinforced every time I take my mother out a restaurant with a large menu – ordering is never quick). In all seriousness, the introduction of the storage engine API to MongoDB is possibly the single greatest feature MongoDB, Inc has released to-date.

One of the biggest gripes from the pre-v3.0 days was MongoDB’s lack of scale. This was mostly due to the MMAPv1 storage engine, which suffered from a very primitive locking scheme. If you would like a illustration of the problem, think of the world’s biggest supermarket with one checkout line – you might be able to fit in lots of shoppers, but they’re not going to accomplish their goal quickly. So, the ability to increase performance and concurrency with a simple switch is huge! Additionally, modern storage engines support compression. This should reduce your space utilization when switching by at least 50%.

All the way up to MongoDB v3.2, the default storage engine was MMAPv1. If you didn’t make a conscious decision about what storage engine to choose when you started using MongoDB, there is a good chance that MMAPv1 is what you’re on. If you’d like to find out for sure what engine you’re using, simply run the command below. The output will be the name of the storage engine. As you can see, I was running the MMAPv1 storage engine on this machine. Now that we understand where we’re at, let’s get into where we can be in the future.

db.serverStatus().storageEngine.name
mmapv1

Public Service Announcement

Before we get into what storage engine(s) to evaluate, we need to talk about testing. In my experience, a majority of the MySQL and MongoDB community members are rolling out changes to production without planning or testing. If you’re in the same boat, you’re in very good company (or at least in a great deal of company). However, you should stop this practice. It’s basic “sample size” in statistics – when engaged in risk-laden behavior, the optimal time to stop increasing the sample size is prior to the probability of failure reaching “1”. In other words, start your testing and planning process today!

At Percona, we recommend that you thoroughly test any database changes in a testing or development environment before you decide to roll them into production. Additionally, prior to rolling the changes into production (with a well thought out plan, of course), you’ll need to have a roll-back plan in case of unintended consequences. Luckily, with MongoDB’s built-in replication and election protocols, both are fairly easy. The key here is to plan. This is doubly true if you are undertaking a major version upgrade, or are jumping over major versions. With major version upgrades (or version jumps) comes the increased likelihood of a change in database behavior as it relates to your application’s response time (or even stability).

What should I think about?

In the table above, we listed the pre-packaged storage engine options that are available to us and other distributions. We also took a look at why you should consider moving off of MMAPv1 in the preceding section. To be clear, in my opinion a vast majority of MongoDB users that are on MMAPv1 can benefit from a switch. Which engine to switch to is the pressing question. Your first decision should be to evaluate whether or not your workload fits into the sweet spot for MMAPv1 by reading the section below. If that section doesn’t describe your application, then the additional sections should help you narrow down your choices.

Now, let’s take a look at what workloads match up with what storage engines.

MMAPv1

Believe it or not, there are some use cases where MMAPv1 is likely to give you as good (or better) performance as any other engine. If you’re not worried about the size of your database on disk, then you may not want to bother changing engines. Users that are likely to see no benefit from changing have read-heavy (or 100%) read applications. Also, certain update-heavy use cases, where you’re updating small amounts of data or performing $set operations, are likely to be faster on MMAPv1.

WiredTiger

WiredTiger is a the new default storage engine for MongoDB. It is a good option for general workloads that are currently running on MMAPv1. WiredTiger will give you good performance for most workloads and will reduce your storage utilization with compression that’s enabled by default. If you have a write-heavy workload, or are approaching high I/O utilization (>55%) with plans for it to rise, then you might benefit from a migration to WiredTiger.

MongoRocks (RocksDB from Facebook)

This is Facebook’s baby, which was forged in the fires of the former Parse business unit. MongoRocks, which uses LSM indexing, is advertised as “designed to work with fast storage.” Don’t let this claim fool you. For workloads that are heavy on writes, highly concurrent or approaching disk bound, MongoRocks could give you great benefits. In terms of compression, MongoRocks has the ability to efficiently handle deeper compression algorithms, which should further decrease your storage requirements.

In-Memory

The in-memory engine, whether we’re speaking about the MongoDB or Percona implementation, should be used for workloads where extreme low latency is the most important requirement. The types of applications that I’m talking about are usually low-latency, “real-time” apps — like decision making or user session tracking. The in-memory engine is not persistent, so it operates strictly out of the cache allocated to MongoDB. Consequently, the data may (and likely will) be lost if the server crashes.

Encrypted

This is for applications in highly secure environments where on-disk encryption is necessary for compliance. This engine will protect the MongoDB data in the case that a disk or server is stolen. On the flip side, this engine will not protect you from a hacker that has access to the server (MongoDB shell), or can intercept your application traffic. Another way to achieve the same level of encryption for compliance is using volume level encryption like LUKS. An additional benefit of volume level encryption, since it works outside the database, is re-use on all compliant servers (not just MongoDB).

Getting to your new engine

Switching to the new engine is actually pretty easy, especially if you’re running a replica set. One important caveat is that unlike MySQL, the storage engine can only be defined per mongod process (not per database or collection). This means that it’s an all or nothing operation on a single MongoDB process. You’ll need to reload your data on that server. This is necessary because the data files from one engine are not compatible with another engine. Thus reloading the data to transform from one engine format to another is necessary. Here are the high-level steps (assuming you’re running a replica set):

  1. Make sure you’re not in your production environment
  2. Backup your data (it can’t hurt)
  3. Remove a replica set member
  4. Rename (or delete) the old data directory. The member will re-sync with the replica set
    • Make sure you have enough disk space if you’re going to keep a copy of the old data directory
  5. Update the mongo.conf file to use a new storage engine. Here’s an example for RocksDB from our documentation:
    storage:
     engine: rocksdb
     rocksdb:
       cacheSizeGB: 4
       compression: snappy
  6. Start the MongoDB process again
  7. Join the member to the replica set (initial sync will happen)
  8. When the updated member is all caught up, pick another member and repeat the process.
  9. Continue until the primary is the only server left. At this point, you should step down the primary, but hold off switching storage engines until you are certain that the new storage engine meets your needs.

The Wrap Up

At this point I’ve explained how you can understand your options, where you can gain additional performance and what engines to evaluate. Please don’t forget to test your application with the new setup before launching into production. Please drop a comment below if you found this helpful or, on the other hand, if there’s something that would make it more useful to you. Chances are, if you’d find something helpful, the rest of the community will as well.

Feb
28
2017
--

Percona Monitoring and Management (PMM) Graphs Explained: MongoDB MMAPv1

Percona Monitoring and Management (PMM)

Percona Monitoring and Management (PMM)This post is part of the series of Percona’s MongoDB 3.4 bundle release blogs. In this blog post, I hope to cover some areas to watch with Percona Monitoring and Management (PMM) when running MMAPv1. The graph examples from this article are from the MMAPv1 dashboard that will be released for the first time in PMM 1.1.2.

Since the very beginning of MongoDB, the MMAPv1 storage engine has existed. MongoDB 3.0 added a pluggable storage engine API. You could only use MMAPv1 with MongoDB before that. While MMAPv1 often offers good read performance, it has become famous for its poor write performance and fragmentation at scale. This means there are many areas to watch for regarding performance and monitoring.

Percona Monitoring and Management (PMM)

Percona Monitoring and Management (PMM) is an open-source platform for managing and monitoring MySQL and MongoDB. It was developed by Percona on top of open-source technology. Behind the scenes, the graphing features this article covers use Prometheus (a popular time-series data store), Grafana (a popular visualization tool), mongodb_exporter (our MongoDB database metric exporter) plus other technologies to provide database and operating system metric graphs for your database instances.

(Beware of) MMAPv1

mmap() is a system-level call that causes the operating system kernel to map on-disk files to memory while it is being read and written by a program.

As mmap() is a core feature of the Unix/Linux operating system kernel (and not the MongoDB code base), I’ve always felt that calling MMAPv1 a “storage engine” is quite misleading, although it does allow for a simpler explanation. The distinction and drawbacks of the storage logic being in the operating system kernel vs. the actual database code (like most database storage engines) becomes very important when monitoring MMAPv1.

As Unix/Linux are general-purpose operating systems that can have many processes, users and uses cases, they offer limited OS-level metrics in terms of activity, latency and performance of mmap(). Those metrics are for the entire operating system, not just for the MongoDB processes.

mmap() uses memory from available OS-level buffers/caches for mapping the MMAPv1 data to RAM — memory that can be “stolen” away by any other operating system process that asks for it. As many deployments “micro-shard” MMAPv1 to reduce write locks, this statement can become exponentially more important. If 3 x MongoDB instances run on a single host, the kernel fights to cache and evict memory pages created by 3 x different instances with no priority or queuing, essentially at random, while creating contention. This causes inefficiencies and less-meaningful monitoring values.

When monitoring MMAPv1, you should consider MongoDB AND the operating system as one “component” more than most engines. Due to this, it is critical that a database host runs a single MongoDB instance with no other processes except database monitoring tools such as PMM’s client. This allows MongoDB to be the only user of the operating system filesystem cache that MMAPv1 relies on. This also makes OS-level memory metrics more accurate because MongoDB is the only user of memory. If you need to “micro-shard” instances, I recommend using containers (Docker or plain cgroups) or virtualization to separate your memory for each MongoDB instance, with just one MongoDB instance per container.

Locking

MMAPv1’s has locks for both reads and writes. In the early days the lock was global only. Locking became per-database in v2.2 and per-collection in v3.0.

Locking is the leading cause of the performance issues we see on MMAPv1 systems, particularly write locking. To measure how much locking an MMAPv1 instance is waiting on, first we look at the “MMAPv1 Lock Ratio”:

Another important metric to watch is “MongoDB Lock Wait Time”, breaking down a number of time operations spend waiting on locks:

Three factors in combination influence locking:

  1. Data hotspots — if every query hits the same collection or database, locking increases
  2. Query performance — a lock is held for the duration of an operation; if that operation is slow, lock time increases
  3. Volume of queries — self-explanatory

Page Faults

Page faults happen when MMAPv1 data is not available in the cache and needs to be fetched from disk. On systems with data that is smaller than memory page faults usually only occur on reboot, or if the file system cache is dumped. On systems where data exceeds memory, this happens more frequently — MongoDB is asked for data not in memory.

How often this happens depends on how your application accesses your data. If it accesses new or frequently-queried data, it is more likely to be in memory. If it accesses old or infrequent data, more page faults occur.

If page faults suddenly start occurring, check to see if your data set has grown beyond the size of memory. You may be able to reduce your data set by removing fragmentation (explained later).

Journaling

As MMAPv1 eventually flushes changes to disk in batches, journaling is essential for running MongoDB with any real data integrity guarantees. As well as being included in the lock statistic graphs mentioned above, there are some good metrics for journaling (which is a heavy consumer of disk writes).

Here we have “MMAPv1 Journal Write Activity”, showing the data rates of journaling (max 19MB/sec):

“MMAPv1 Journal Commit Activity” measures the commits to the journal ops/second:

A very useful metric for write query performance is “MMAPv1 Journaling Time” (there is another graph with 99th percentile times):

This is important to watch, as write operations need to wait for a journal commit. In the above example, “write_to_journal” and “write_to_data_files” are the main metrics I tend to look at. “write_to_journal” is the rate of changes being written to the journal, and “write_to_data_files” is the rate that changes are written to on-disk data.

If you see very high journal write times, you may need faster disks or in-sharding scenarios. Adding more shards spreads out the disk write load.

Background Flushing

“MMAPv1 Background Flushing Time” graphs the background operation that calls flushes to disk:

This process does not block the database, but does cause more disk activity.

Fragmentation

Due to the way MMAPv1 writes to disk, it creates a high rate of fragmentation (or holes) in its data files. Fragmentation slows down scan operations, wastes some filesystem cache memory and can use much more disk space than there is actual data. On many systems I’ve seen, the size of MMAPv1 data files on disk take over twice the true data size.

Currently, our Percona Monitoring and Management MMAPv1 support does not track this, but we plan to add it in the future.

To track it manually, look at the output of the “.stats()” command for a given collection (replace “sbtest1” with your collection name):

> 1 - ( db.sbtest1.stats().size / db.sbtest1.stats().storageSize )
0.14085410557184752

Here we can see this collection is about 14% fragmented on disk. To fix fragmentation, the most common fix is dropping and recreating the collection using a backup. Many just remove a replication member, clear the data and let it do a new replication initial sync.

Operating System Memory

In PMM we have graphed the operating system cached memory as it acts as the primary cache for MMAPv1:

For the most part, “Cached” is the value showing the amount of data that is cached MMAPv1 data (assuming the host is only running MongoDB).

We also graph the dirty memory pages:

It is important that dirty pages do not exceed the hard dirty page limit (which causes pauses). It is also important that dirty pages don’t accumulate (which wastes cache memory). The “soft” dirty page limit is the limit that starts dirty page cleanup without pausing.

On this host, you could probably lower the soft limit to clean up memory faster, assuming the increase in disk activity is acceptable. This topic is covered in this post: https://www.percona.com/blog/2016/08/12/tuning-linux-for-mongodb/.

What’s Missing?

As mentioned earlier, fragmentation rates are missing for MMAPv1 (this would be a useful addition). Due to the limited nature of the metrics offered for MMAPv1, PMM probably won’t provide the same level of graphs for MMAPv1 compared to what we provide for WiredTiger or RocksDB. There will likely be fewer additions to the graphing capabilities going forward.

If you are using a highly concurrent system, we highly recommend you upgrade to WiredTiger or RocksDB (both also covered in this monitoring series). These engines provide several solutions to MMAPv1 headaches: document-level locking, built-in compression, checkpointing that cause near-zero fragmentation on disk and much-improved visibility for monitoring. We just released Percona Server for MongoDB 3.4, and it provides many exciting features (including these engines).

Look out for more monitoring posts from this series!

Jan
06
2016
--

MongoDB revs you up: What storage engine is right for you? (Part 1)

MongoDB

MongoDBDifferentiating Between MongoDB Storage Engines

The tremendous data growth of the last decade has affected almost all aspects of applications and application use. Since nearly all applications interact with a database at some point, this means databases needed to adapt to the change in usage conditions as well. Database technology has grown significantly in the last decade to meet the needs of constantly changing applications. Enterprises often need to scale, modify, or replace their databases in order to meet new business demands.

Within a database management system (DBMS), there are many levels that can affect performance, including the choice of your database storage engine. Surprisingly, many enterprises don’t know they have a choice of storage engines, or that specific storage engine types are architected to handle specific scenarios. Often the best option depends on what function the database in question is designed to fulfill.

With Percona’s acquisition of Tokutek, we’ve moved from a mostly-MySQL company to having several MongoDB-based software options available.

MongoDB is a cross-platform, NoSQL, document-oriented database. It doesn’t use the traditional table-based relational database structure, and instead employs JSON-type documents with dynamic schemas. The intention is making the integration of certain application data types easier and faster.

This blog (the first in a series) will briefly review some of the available options for a MongoDB database storage engine, and the pros and cons of each. Hopefully it will help database administrators, IT staff, and enterprises realize that when it comes to MongoDB, you aren’t limited to a single storage engine choice.

What is a Storage Engine?

A database storage engine is the underlying software that a DBMS uses to create, read, update and delete data from a database. The storage engine should be thought of as a “bolt on” to the database (server daemon), which controls the database’s interaction with memory and storage subsystems. Thus, the storage engine is not actually the database, but a service that the database consumes for the storage and retrieval of information. Given that the storage engine is responsible for managing the information stored in the database, it greatly affects the overall performance of the database (or lack thereof, if the wrong engine is chosen).

Most storage engines are organized using one of the following structures: a Log-Structured Merge (LSM) tree, B-Tree or Fractal tree.

  • LSM Tree. An LSM tree has performance characteristics that make it attractive for providing indexed access to files with high insert volume. LSM trees seek to provide the excellent insertion performance of log type storage engines, while minimizing the impact of searches in a data structure that is “sorted” strictly on insertion order. LSMs buffer inserts, updates and deletes by using layers of logs that increase in size, and then get merged in sorted order to increase the efficiency of searches.
  • B-Tree. B-Trees are the most commonly implemented data structure in databases. Having been around since the early 1970’s, they are one of the most time-tested storage engine “methodologies.” B-Trees method of data maintenance makes searches very efficient. However, the need to maintain a well-ordered data structure can have a detrimental effect on insertion performance.
  • Fractal Tree. A Fractal Tree index is a tree data structure much like that of a B-tree (designed for efficient searches), but also ingests data into log-like structures for efficient memory usage in order to facilitate high-insertion performance. Fractal Trees were designed to ingest data at high rates of speed in order to interact efficiently with the storage for high bandwidth applications.

Fractal Trees and the LSM trees sound very similar. The main differentiating factor, however, is the manner in which they sort the data into the tree for efficient searches. LSM trees merge data into a tree from a series of logs as the logs fill up. Fractal Trees sort data into log-like structures (message buffers) along the proper data path in the tree.

What storage engine is best?

That question is not a simple one. In order decide which engine to choose, it’s necessary to determine the core functionality provided in each engine. Core functionality can generally be aggregated into three areas:

  • Locking types. Locking within database engines defines how access and updates to information are controlled. When an object in the database is locked for updating, other processes cannot modify (or in some cases read) the data until the update has completed. Locking not only affects how many different applications can update the information in the database, it can also affect queries on that data. It is important to monitor how queries access data, as the data could be altered or updated as it is being accessed. In general, such delays are minimal. The bulk of the locking mechanism is devoted to preventing multiple processes updating the same data. Since both additions (INSERT statements) and alterations (UPDATE statements) to the data require locking, you can imagine that multiple applications using the same database can have a significant impact. Thus, the “granularity” of the locking mechanism can drastically affect the throughput of the database in “multi-user” (or “highly-concurrent”) environments.
  • Indexing. The indexing method can dramatically increase database performance when searching and recovering data. Different storage engines provide different indexing techniques, and some may be better suited for the type of data you are storing. Typically, every index defined on a collection is another data structure of the particular type the engine uses (B-tree for WiredTiger, Fractal Tree for PerconaFT, and so forth). The efficiency of that data structure in relation to your workload is very important. An easy way of thinking about it is viewing every extra index as having performance overhead. A data structure that is write-optimized will have lower overhead for every index in a high-insert application environment than a non-write optimized data structure would. For use cases that require a large number of indexes, choosing an appropriate storage engine can have a dramatic impact.
  • Transactions. Transactions provide data reliability during the update or insert of information by enabling you to add data to the database, but only to commit that data when other conditions and stages in the application execution have completed successfully. For example, when transferring information (like a monetary credit) from one account to another, you would use transactions to ensure that both the debit from one account and the credit to the other completed successfully. Often, you will hear this referred to as “atomicity.” This means the operations that are bundled together are an immutable unit: either all operations complete successfully, or none do. Despite the ability of RocksDB, PerconaFT and WiredTiger to support transactions, as of version 3.2 this functionality is not available in the MongoDB storage engine API. Multi-document transactions cannot be used in MongoDB. However, atomicity can be achieved at the single document level. According to statements from MongoDB, Inc., multi-document transactions will be supported in the future, but a firm date has not been set as of this writing.

Now that we’ve established a general framework, we’ll move onto discussing engines. For the first blog in this series, we’ll look at MMAPv1 (the default storage engine that comes with MongoDB up until the release 3.0).

MMAPv1

Find it in: MongoDB or Percona builds

MMAPv1 is MongoDB’s original storage engine, and was the default engine in MongoDB 3.0 and earlier. It is a B-tree based system that offloads much of the functions of storage interaction and memory management to the operating system. MongoDB is based on memory mapped files.

The MMAP storage engine uses a process called “record allocation” to grab disk space for document storage. All records are contiguously located on disk, and when a document becomes larger than the allocated record, it must allocate a new record. New allocations require moving a document and updating all indexes that refer to the document, which takes more time than in-place updates and leads to storage fragmentation. Furthermore, MMAPv1 in it’s current iterations usually leads to high space utilization on your filesystem due to over-allocation of record space and it’s lack of support for compression.

As mentioned previously, a storage engine’s locking scheme is one of the most important factors in overall database performance. MMAPv1 has collection-level locking – meaning only one insert, update or delete operation can use a collection at a time. This type of locking scheme creates a very common scenario in concurrent workloads, where update/delete/insert operations are always waiting for the operation(s) in front of them to complete. Furthermore, oftentimes those operations are flowing in more quickly than they can be completed in serial fashion by the storage engine. To put it in context, imagine a giant supermarket on Sunday afternoon that only has one checkout line open: plenty of customers, but low throughput!

Given the storage engine choices brought about by the storage engine API in MongoDB 3.0, it is hard to imagine an application that demands the MMAPv1 storage engine for optimized performance. If you read between the lines, you could conclude that MongoDB, Inc. would agree given that the default engine was switched to WiredTiger in v3.2.

Conclusion

Most people don’t know that they have a choice when it comes to storage engines, and that the choice should be based on what the database workload will look like. Percona’s Vadim Tkachenko performed an excellent benchmark test comparing the performances of RocksDB, PerconaFT and WiredTiger to help specifically differentiate between these engines.

In the next post, we’ll examine the ins and outs of MongoDB’s new default storage engine, WiredTiger.

 

 

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com