Jan
03
2019
--

MongoDB Engines: MMAPV1 Vs WiredTiger

review of MongoDB storaage MMAPv1 and WiredTiger

review of MongoDB storaage MMAPv1 and WiredTigerIn this post, we’ll take a look at the differences between the MMAP and WiredTiger engines in MongoDB®. I’ve been asked this question by customers many times, and this blog is for you! We’ll tell you about the key features of these engines, then you can choose the right engine based on your requirement.

In MongoDB, we mainly use the MMAPV1 and WiredTiger engines. We could use other engines like in-Memory, rocks db with Percona Server for MongoDB (PSMDB), and in-memory engine with MongoDB Enterprise version. When MongoDB was introduced, MMAPV1 was the default engine and it’s still a part of the MongoDB releases, though it will not be seen from 4.2 as per MongoDB’s plan. Those who remember the days working with version 1.8 might miss this, even though they don’t use MMAP currently! MongoDB acquired wiredTiger Inc (see here https://www.mongodb.com/press/wired-tiger) and from version 3.2 made it the default engine of MongoDB. This engine enabled the introduction of transactions with multi-documents, and is mainly used for features such as compression and document level locking. Here we’ll see the key features of wiredTiger and MMAPV1, and also present them in a tabular column at the end – who doesn’t love a table to check quickly the differences! It reminds me my school days :-)). My co-author, and friend – Aayushi feels the same?! ?

Some differences in detail

Storage Engines

The MongoDB storage engines manage BSON data in memory and on disk to support read and write operations.

MMAPV1:  This is the original storage engine for MongoDB, introduced in the first release, but from version 4.0 it is deprecated

WiredTiger:  This is the pluggable engine introduced by MongoDB in version 3.0 and it became the default storage engine from version 3.2

Data compression

MMAPV1: does not support data compression and it is based on memory mapped files. So it works well when you can keep your writeset in memory. It excels at workloads with high volume inserts, reads, and in-place updates.

WiredTiger: supports snappy and zlib compression. Consequently, MongoDB with WiredTiger takes very little space comparing with MMAP. It has its own write-cache and a filesystem cache.

  • Snappy: This is the default algorithm,  efficient computation with reasonable compression. See here.
  • Zlib: higher compression rate at the cost of CPU. See here.

Data Directory

Let’s take a look at the file system supporting the same data and replica set member for each of the engines. 

MMAPV1:

total 1.2G
-rw-r--r-- 1 vagrant vagrant    5 Nov 28 04:41 mongod.lock
-rw-rw-r-- 1 vagrant vagrant   69 Nov 28 04:41 storage.bson
-rw------- 1 vagrant vagrant  16M Nov 28 04:58 local.0
drwxrwxr-x 2 vagrant vagrant 4.0K Nov 28 04:58 journal
-rw------- 1 vagrant vagrant  16M Nov 28 04:58 admin.ns
-rw------- 1 vagrant vagrant  16M Nov 28 04:58 admin.0
-rw------- 1 vagrant vagrant 512M Nov 28 04:59 local.2
drwxrwxr-x 2 vagrant vagrant 4.0K Nov 28 04:59 diagnostic.data
drwxrwxr-x 2 vagrant vagrant 4.0K Nov 28 05:16 _tmp
-rw------- 1 vagrant vagrant  16M Nov 28 05:17 test.ns
-rw------- 1 vagrant vagrant  16M Nov 28 05:17 test.0
-rw------- 1 vagrant vagrant  32M Nov 28 05:17 test.1
-rw------- 1 vagrant vagrant  16M Nov 28 09:09 local.ns
-rw------- 1 vagrant vagrant 512M Nov 28 09:09 local.1

WiredTiger:

total 5.4M
-rw-rw-r-- 1 vagrant vagrant   21 Nov 28 07:38 WiredTiger.lock
-rw-rw-r-- 1 vagrant vagrant   49 Nov 28 07:38 WiredTiger
drwxrwxr-x 2 vagrant vagrant 4.0K Nov 28 07:38 journal
-rw-rw-r-- 1 vagrant vagrant 4.0K Nov 28 07:38 WiredTigerLAS.wt
-rw-rw-r-- 1 vagrant vagrant   95 Nov 28 07:38 storage.bson
-rw-r--r-- 1 vagrant vagrant    5 Nov 28 07:38 mongod.lock
-rw-rw-r-- 1 vagrant vagrant  16K Nov 28 07:38 index-7--2134189858403062482.wt
-rw-rw-r-- 1 vagrant vagrant  16K Nov 28 07:38 index-5--2134189858403062482.wt
-rw-rw-r-- 1 vagrant vagrant  16K Nov 28 07:38 index-3--2134189858403062482.wt
-rw-rw-r-- 1 vagrant vagrant  16K Nov 28 07:38 index-1--2134189858403062482.wt
-rw-rw-r-- 1 vagrant vagrant  16K Nov 28 07:38 collection-4--2134189858403062482.wt
-rw-rw-r-- 1 vagrant vagrant  16K Nov 28 07:38 collection-2--2134189858403062482.wt
-rw-rw-r-- 1 vagrant vagrant  16K Nov 28 07:38 collection-0--2134189858403062482.wt
-rw-rw-r-- 1 vagrant vagrant  16K Nov 28 07:38 index-15--2134189858403062482.wt
-rw-rw-r-- 1 vagrant vagrant  16K Nov 28 07:38 index-14--2134189858403062482.wt
-rw-rw-r-- 1 vagrant vagrant 1.8M Nov 28 07:38 index-17--2134189858403062482.wt
-rw-rw-r-- 1 vagrant vagrant 3.2M Nov 28 07:39 collection-16--2134189858403062482.wt
-rw-rw-r-- 1 vagrant vagrant  16K Nov 28 07:39 collection-13--2134189858403062482.wt
-rw-rw-r-- 1 vagrant vagrant  32K Nov 28 07:39 _mdb_catalog.wt
-rw-rw-r-- 1 vagrant vagrant  36K Nov 28 09:09 sizeStorer.wt
-rw-rw-r-- 1 vagrant vagrant  36K Nov 28 09:09 collection-6--2134189858403062482.wt
-rw-rw-r-- 1 vagrant vagrant  52K Nov 28 09:09 collection-12--2134189858403062482.wt
-rw-rw-r-- 1 vagrant vagrant  76K Nov 28 09:09 WiredTiger.wt
-rw-rw-r-- 1 vagrant vagrant 1003 Nov 28 09:09 WiredTiger.turtle
drwxrwxr-x 2 vagrant vagrant 4.0K Nov 28 09:09 diagnostic.data

Journaling

MMAPV1: Ensures that writes are atomic.  If MongoDB goes down or terminates before committing changes to the data files, MongoDB can use the journal files to apply the write operation to the data files and maintain a consistent state.

WiredTiger: This uses checkpoints between writes and the journal persists all data modifications between checkpoints. So for any recovery from database crash or abrupt termination, it uses journal entries since the last checkpoint. In most cases, journal is not necessary for this engine and you enable it only if you need to be sure to recover until the last successful write before the crash from the journal. Otherwise, usually MongoDB can recover from the last valid checkpoint. Checkpoint occurs every minute by default. 

Journal directory

This is how journal files appear in the data directory for the different engines:

MMAPV1:

vagrant@m103:/data/mongo1/journal$ ls -lrth
total 35M
-rw------- 1 vagrant vagrant  88 Nov 28 09:17 lsn
-rw------- 1 vagrant vagrant 35M Nov 28 09:17 j._0

WiredTiger:

-rw-rw-r-- 1 vagrant vagrant 100M Nov 28 07:38 WiredTigerPreplog.0000000001
-rw-rw-r-- 1 vagrant vagrant 100M Nov 28 07:38 WiredTigerPreplog.0000000002
-rw-rw-r-- 1 vagrant vagrant 100M Nov 28 09:16 WiredTigerLog.0000000001

Locks and concurrency

MMAPV1

  • Up until version 2.6: uses a readers-writer [1] lock that allows concurrent reads access to a database, but gives exclusive access to a single write operation. When a read lock exists, many read operations may use this lock. However, when a write lock exists, a single write operation holds the lock exclusively, and no other read or write operations may share the lock.
  • From 3.0: The MMAPv1 storage engine uses collection level locking as of the 3.0 release series, an improvement on earlier versions in which the database lock was the finest-grain lock.

WiredTiger: supports document level locking. For most read and write operations, WiredTiger uses optimistic concurrency control. WiredTiger uses only intent locks at the global, database, and collection levels.

For example: deleting documents from the collection “testData” for a value of {x:1}, will acquire write “LOCK” at collection level differently for each of the storage engines.

MMAPV1:

2018-12-17T10:09:46.830+0000 I COMMAND  [conn8] command
testDB.$cmd appName: "MongoDB Shell"
command: delete { delete: "testData",
deletes: [ { q: { x: 1.0 }, limit: 0.0 } ], ordered: true }
numYields:0 reslen:89 locks:{ Global: { acquireCount: { r: 100795, w: 100795 } },
MMAPV1Journal: { acquireCount: { w: 100796 }, acquireWaitCount: { w: 12 },
timeAcquiringMicros: { w: 46212 } }, Database: { acquireCount: { w: 100795 } }
, Collection: { acquireCount: { W: 795 } }

where w = Represents Exclusive (X) lock

WiredTiger:

2018-12-17T10:17:38.340+0000 I COMMAND  [conn1] command
testDB.$cmd appName: "MongoDB Shell"
command: delete { delete: "testData",
deletes: [ { q: { x: 1.0 }, limit: 0.0 } ], ordered: true }
numYields:0 reslen:89 locks:{ Global: { acquireCount: { r: 100795, w: 100795 } },
Database: { acquireCount: { w: 100795 } }, Collection: { acquireCount: { w: 795 } }

where w = Represents Intent Exclusive (IX) lock

Memory

MMAPv1: MongoDB automatically uses all free memory on the machine as its cache. System resource monitors show that MongoDB uses a lot of memory, but its usage is dynamic. If another process suddenly needs half the server’s RAM, MongoDB will yield cached memory to the other process.

Technically, the operating system’s virtual memory subsystem manages MongoDB’s memory. This means that MongoDB will use as much free memory as it can, swapping to disk as needed. Deployments with enough memory to fit the application’s working data set in RAM will achieve the best performance.

WiredTiger: with wiredTiger, MongoDB utilizes both the WiredTiger internal cache and the filesystem cache. Via the filesystem cache, MongoDB automatically uses all free memory that is not used by the WiredTiger cache or by other processes. Starting in 3.4, the WiredTiger internal cache, by default, will use the larger of either:

  • 50% of (RAM – 1 GB), or
  • 256 MB.

Quick reference: MMAPV1 vs WiredTiger

Use this table for a quick reference to the differences between MMAPv1 and WiredTiger

Key Feature MMAPV1 wiredTiger
Introduction & Default Engine Introduced with MongoDB from scratch and default engine till 3.0 version. Deprecated in 4.0 and will be removed in future Introduced in 3.0 version and default from 3.2 version
Data Compression Doesn’t support compression Compression with default snappy compression method and zlib compression method. So occupy less space than MMAPV1 engine
Journaling MongoDB writes the in-memory changes first to on-disk journal files. If MongoDB goes down/terminates before committing the changes to the data files, MongoDB can use the journal files to apply the write operation to the data files and maintain a consistent state. The WiredTiger journal persists all data modifications between checkpoints. If MongoDB exits between checkpoints, it uses the journal to replay all data modified since the last checkpoint.
Locks & Concurrency Till 2.6, MongoDB uses a readers-writer [1] lock that allows concurrent reads access to a database but gives exclusive access to a single write operation. From 3.0, uses collection level lock It supports document level locking.
Transaction Operation on a single document is atomic Multi-document transactions are only available for deployments from version 4.0
CPU Performance adding CPU cores does not improve performance much performs better on multicore systems
Encryption Encryption is not possible Encryption at rest is available with MongoDB enterprise and as BETA in PSMDB 3.6.8
Memory automatically uses all free memory on the machine as its cache Uses internal cache and filesystem cache
Updates It excels at workloads with high volume inserts, reads, and in-place updates. Does not support in place updates. It causes the whole document to rewrite
Tuning Less chance to tune it Allows more tuning with this engine through different variables. Eg: cache size, read / write tickets, checkpoint interval etc

Conclusion

The above information does not cover every difference between MMAPV1 and WiredTiger, but it lists the key differences. If you have any key features to add, please feel free to add in the comments! Let’s share and let everyone know about them ?


Photo by Mathew Schwartz on Unsplash

Sep
26
2018
--

Scaling IO-Bound Workloads for MySQL in the Cloud – part 2

Rplot07-innodb-iops

This post is a followup to my previous article https://www.percona.com/blog/2018/08/29/scaling-io-bound-workloads-mysql-cloud/

In this instance, I want to show the data in different dimensions, primarily to answer questions around how throughput scales with increasing IOPS.

A recap: for the test I use Amazon instances and Amazon gp2 and io1 volumes. In addition to the original post, I also tested two gpl2 volumes combined in software RAID0. I did this for the following reason: Amazon cap the single gp2 volume throughput to 160MB/sec, and as we will see from the charts, this limits InnoDB performance.

Also, a reminder from the previous post: we can increase gp2 IOPS by increasing volume size (to the top limit 10000 IOPS), and for io1 we can increase IOPS by paying per additional IOPS.

Scaling with InnoDB

So for the first result, let’s see how InnoDB scales with increasing IOPS.

There are a few interesting observations here: InnoDB scales linearly with additional IOPS, but it faces a throughput limit that Amazon applies to volumes.

So besides considering IOPS, we should take into account the maximal throughout of volumes.

In the second chart we compare InnoDB performance vs the cost of volumes:

It’s interesting to see here the slope for gp2 volumes is steeper than for io1 volumes. This means we can get a bigger increase in InnoDB performance per dollar using gp2 volumes, but only until we reach the IOPS and throughput limits that are applied to gp2 volumes.

Scaling with MyRocks

And here’s the similar chart but for MyRocks:

Here we can also see that MyRocks scales linearly, showing identical results on gp2 and io1 volumes. This means that running on gp2 will be cheaper. Also, there is no plateau in throughput, as we saw for InnoDB, which means that MyRocks uses less IO throughput.

And the chart for the cost of running MyRocks:

This charts also shows that it is cheaper to run on gp2 volume but only while it provides enough IOPS. I assume that using two gp2 volumes would allow me to double the throughput. (I did not run the test for MyRocks using two volumes)

Conclusions

  • Both MyRocks and InnoDB can scale (linearly) with additional IOPS on gp2 and io1 Amazon volumes.
  • Take into account that IOPS is not the only factor to consider. There is also throughput limit, which affects InnoDB results, so for further scaling you might need to use multiple volumes.

The post Scaling IO-Bound Workloads for MySQL in the Cloud – part 2 appeared first on Percona Database Performance Blog.

Sep
15
2017
--

Percona Blog Poll Results: What Database Engine Are You Using to Store Time Series Data?

TIme Series Data

TIme Series DataIn this blog post, we talk about the results of Percona’s time series database poll “What Database Engine Are You Using to Store Time Series Data?”

Time series data is some of the most actionable data available when it comes to analyzing trends and making predictions. Simply put, time series data is data that is indexed not just by value, but by time as well – allowing you to view value changes over time as they occur. Obvious uses include the stock market, web traffic, user behavior, etc.

With the increasing number of smart devices in the Internet of Things (IoT), being able to track data over time is more and more important. With time series data, you can measure and make predictions on things like energy consumption, pH values, water consumption, data from environment-aware machines like smart cars, etc. The sensors used in IoT devices and systems generate huge amounts of time-series data.

A couple of months back, we ran a poll on what time series databases were being used by the community. We wanted to quickly report on the results from that poll.

First the results:

Note: There is a poll embedded within this post, please visit the site to participate in this post’s poll.

Here are some thoughts:

  • The fact that this blog started as a place exclusively for MySQL information probably explains why we skewed high with MySQL respondents – still that doesn’t mean it doesn’t reflect reality.
  • Elastic seems the most common after that, possibly to tie in with MySQL use.
  • InfluxDB as next popular. This suggests that Paul Dix’s chosen business model is “AOK” so to speak. It is unclear if people use the open source version, or outgrow it and switch to the commercial stuff.
  • We lumped together “general purpose NoSQL engine”, but in some cases examples like Cassandra are targeted at time series. Notice that KairosDB, which is built on top of Cassandra itself, is not as popular in our survey.
  • Prometheus is the canonical “not a time series database”, but still used as one. I have a feeling alongside Graphite, this is monitoring related.
  • ClickHouse time series is a new time series database and it is surprising that it gets such high rankings. It was also relatively unknown outside of its home country Russia, but now we are seeing uses at places like CloudFlare and more.

Thanks for participating in the poll. We’re still running a poll on operating systems, so don’t forget to register your responses. We’ll report on that poll soon, with a new one on the way shortly.

Mar
07
2017
--

How to Change MongoDB Storage Engines Without Downtime

MongoDB Storage Engines

MongoDB Storage EnginesThis blog is another in the series for the Percona Server for MongoDB 3.4 bundle release. Today’s blog post is about how to migrate between Percona Server for MongoDB storage engines without downtime.

Today, the default storage engine for MongoDB is WiredTiger. In previous versions (before 3.2), it was MMAPv1.

Percona Server for MongoDB features some additional storage engines, giving you the freedom for a DBA to choose the best storage based on application workload. Our storages engines are:

By design, each storage engine has its own algorithm and disk usage patterns. We simply stop and start Percona Server for MongoDB using different storage engines.

There are two common methods to change storage engines. One requires downtime, and the second doesn’t.

All the database operations are the same, even if it is using a different storage engine. From the database perspective, it doesn’t matter what storage engine gets used. The database layer asks the persistence API to save or retrieve data regardless.

For a single database instance, the best storage engine migration method is to start replication and add a secondary node with a different storage engine. Then

stepdown()

 the primary, making the secondary the new primary (killing the old primary).

However, this isn’t always an option. In this case, create a backup and use the backup to restore the database.

In the following set of steps, we’ll explain how to migrate a replica set storage engine from WiredTiger to RocksDB without downtime. I’m assuming that the replica set is already configured and doesn’t have any replication lag.

Please follow the instructions below:

  1. Check replica set status and identify the primary and secondaries. (Part of the output has been hidden to make it easier to read.):
    foo:PRIMARY> rs.status()
    {
    	"set" : "foo",
    	"date" : ISODate("2017-02-18T18:47:54.349Z"),
    	"myState" : 2,
    	"term" : NumberLong(2),
    	"syncingTo" : "adamo-percona:27019",
    	"heartbeatIntervalMillis" : NumberLong(2000),
    	"members" : [
    		{
    			"_id" : 0,
    			"name" : "test:27017",
    			"stateStr" : "PRIMARY" (...)
    		},
    		{
    			"_id" : 1,
    			"name" : "test:27018",
    			"stateStr" : "SECONDARY" (...)
    		},
    		{
    			"_id" : 2,
    			"name" : "test:27019",
    			"stateStr" : "SECONDARY" (...)
    		} { ... }
    	],
    	"ok" : 1
    }
  2. Choose the secondary for the new storage engine, and change its priority to 0:

    foo:PRIMARY> cfg = rs.config()

    We are going to work with test:27018 and test:27019. They are respectively the index 1 and 2 in the array members.

  3. Change the last secondary to the first instance to replace the storage engine:

    foo:PRIMARY> cfg.members[2].name
    test:27019
    foo:PRIMARY> cfg.members[2].priority = 0
    0
    foo:PRIMARY> cfg.members[2].hidden = true
    true
    foo:PRIMARY> rs.reconfig(cfg)
    { "ok" : 1 }
  4. Check if the configuration is in place:
    foo:PRIMARY>rs.config()
    {
    	"_id" : "foo",
    	"version" : 4,
    	"protocolVersion" : NumberLong(1),
    	"members" : [
    		{
    			"_id" : 0,
    			"host" : "test:27017",
    			"arbiterOnly" : false,
    			"buildIndexes" : true,
    			"hidden" : false,
    			"priority" : 1,
    			"votes" : 1
    		},
    		{
    			"_id" : 1,
    			"host" : "test:27018",
    			"arbiterOnly" : false,
    			"buildIndexes" : true,
    			"hidden" : false,
    			"priority" : 1,
    			"slaveDelay" : NumberLong(0),
    			"votes" : 1
    		},
    		{
    			"_id" : 2,
    			"host" : "test:27019",
    			"arbiterOnly" : false,
    			"buildIndexes" : true,
    			"hidden" : true, <--
    			"priority" : 0, <--
    			"slaveDelay" : NumberLong(0),
    			"votes" : 1
    		}
    	],
    	"settings" : {...}
    }
  5. Then stop the desired secondary and wipe the database folder. As we are running the replica set in a testing box, I’m going to kill the process running on port 27019. If using services please run:
    sudo service mongod stop

     on the secondary box. Before starting the mongod service, add the

    --storageEngine

     parameter to the config file or application parameter:

    ps -ef | grep mongodb | grep 27019
    kill < mongod pid>;
    rm -rf /data3/*
    ./mongod --dbpath data3 --logpath data3/log3.log --fork --port 27019 <strong>--storageEngine=rocksdb</strong> --replSet foo

    <config file>
    storage:
      engine: rocksdb
  6. This instance is now using the RocksDB storage engine and will perform an initial sync. When it finishes, to get the data from the primary node remove the
    hidden = false

     flag and let the application query this box:

    foo:PRIMARY> cfg = rs.config()
    foo:PRIMARY> cfg.members[2].hidden = false
    false
    foo:PRIMARY> rs.reconfig(cfg)
    { "ok" : 1 }
  7. Repeat step 6 for box test:27018, and use the following command as step 6. This makes one of the secondaries become the primary. Please be sure all secondaries are healthy before proceeding:
    foo:PRIMARY> cfg = rs.config()
    foo:PRIMARY> cfg.members[2].hidden = false
    false
    foo:PRIMARY> cfg.members[2].priority = 1
    foo:PRIMARY> cfg.members[1].priority = 1
  8. When both secondaries are available for reading and in sync with the primary, we need to change the primary’s storage engine. To do so, please perform a
    stepdown()

     in the primary, making this instance secondary. An election is triggered (and may take a few seconds to complete):

    foo:PRIMARY> rs.stepDown()
    2017-02-20T16:34:53.814-0300 E QUERY [thread1] Error: error doing query: failed: network error while attempting to run command 'replSetStepDown' on host '127.0.0.1:27019' :
    DB.prototype.runCommand@src/mongo/shell/db.js:135:1
    DB.prototype.adminCommand@src/mongo/shell/db.js:153:16
    rs.stepDown@src/mongo/shell/utils.js:1182:12
    @(shell):1:1
    2017-02-20T16:34:53.815-0300 I NETWORK [thread1] trying reconnect to 127.0.0.1:27019 (127.0.0.1) failed
    2017-02-20T16:34:53.816-0300 I NETWORK [thread1] reconnect 127.0.0.1:27019 (127.0.0.1) ok
    foo:SECONDARY> rs.status()
  9. Please identify the new primary with
    rs.status()

     and repeat the step 5 and 7 with the old primary.

After this process, the instances will run RocksDB without experiencing downtime (just an election to change the primary).

Please feel free to ping us on Twitter @percona with any questions and suggestions for this blog post.

Feb
10
2017
--

Percona Blog Poll: What Database Engine Are You Using to Store Time Series Data?

TIme Series Data

TIme Series DataTake Percona’s blog poll on what database engine you are using to store time series data.

Time series data is some of the most actionable data available when it comes to analyzing trends and making predictions. Simply put, time series data is data that is indexed not just by value, but by time as well – allowing you to view value changes over time as they occur. Obvious uses include the stock market, web traffic, user behavior, etc.

With the increasing number of smart devices in the Internet of Things (IoT), being able to track data over time is more and more important. With time series data, you can measure and make predictions on things like energy consumption, pH values, water consumption, data from environment-aware machines like smart cars, etc. The sensors used in IoT devices and systems generate huge amounts of time-series data.

How is all of this data collected, segmented and stored? We’d like to hear from you: what database engine are you using to store time series data? Please take a few seconds and answer the following poll. Which are you using? Help the community learn what database engines help solve critical database issues. Please select from one to three database engines as they apply to your environment. Feel free to add comments below if your engine isn’t listed.

Note: There is a poll embedded within this post, please visit the site to participate in this post’s poll.

Jan
06
2016
--

MongoDB revs you up: What storage engine is right for you? (Part 1)

MongoDB

MongoDBDifferentiating Between MongoDB Storage Engines

The tremendous data growth of the last decade has affected almost all aspects of applications and application use. Since nearly all applications interact with a database at some point, this means databases needed to adapt to the change in usage conditions as well. Database technology has grown significantly in the last decade to meet the needs of constantly changing applications. Enterprises often need to scale, modify, or replace their databases in order to meet new business demands.

Within a database management system (DBMS), there are many levels that can affect performance, including the choice of your database storage engine. Surprisingly, many enterprises don’t know they have a choice of storage engines, or that specific storage engine types are architected to handle specific scenarios. Often the best option depends on what function the database in question is designed to fulfill.

With Percona’s acquisition of Tokutek, we’ve moved from a mostly-MySQL company to having several MongoDB-based software options available.

MongoDB is a cross-platform, NoSQL, document-oriented database. It doesn’t use the traditional table-based relational database structure, and instead employs JSON-type documents with dynamic schemas. The intention is making the integration of certain application data types easier and faster.

This blog (the first in a series) will briefly review some of the available options for a MongoDB database storage engine, and the pros and cons of each. Hopefully it will help database administrators, IT staff, and enterprises realize that when it comes to MongoDB, you aren’t limited to a single storage engine choice.

What is a Storage Engine?

A database storage engine is the underlying software that a DBMS uses to create, read, update and delete data from a database. The storage engine should be thought of as a “bolt on” to the database (server daemon), which controls the database’s interaction with memory and storage subsystems. Thus, the storage engine is not actually the database, but a service that the database consumes for the storage and retrieval of information. Given that the storage engine is responsible for managing the information stored in the database, it greatly affects the overall performance of the database (or lack thereof, if the wrong engine is chosen).

Most storage engines are organized using one of the following structures: a Log-Structured Merge (LSM) tree, B-Tree or Fractal tree.

  • LSM Tree. An LSM tree has performance characteristics that make it attractive for providing indexed access to files with high insert volume. LSM trees seek to provide the excellent insertion performance of log type storage engines, while minimizing the impact of searches in a data structure that is “sorted” strictly on insertion order. LSMs buffer inserts, updates and deletes by using layers of logs that increase in size, and then get merged in sorted order to increase the efficiency of searches.
  • B-Tree. B-Trees are the most commonly implemented data structure in databases. Having been around since the early 1970’s, they are one of the most time-tested storage engine “methodologies.” B-Trees method of data maintenance makes searches very efficient. However, the need to maintain a well-ordered data structure can have a detrimental effect on insertion performance.
  • Fractal Tree. A Fractal Tree index is a tree data structure much like that of a B-tree (designed for efficient searches), but also ingests data into log-like structures for efficient memory usage in order to facilitate high-insertion performance. Fractal Trees were designed to ingest data at high rates of speed in order to interact efficiently with the storage for high bandwidth applications.

Fractal Trees and the LSM trees sound very similar. The main differentiating factor, however, is the manner in which they sort the data into the tree for efficient searches. LSM trees merge data into a tree from a series of logs as the logs fill up. Fractal Trees sort data into log-like structures (message buffers) along the proper data path in the tree.

What storage engine is best?

That question is not a simple one. In order decide which engine to choose, it’s necessary to determine the core functionality provided in each engine. Core functionality can generally be aggregated into three areas:

  • Locking types. Locking within database engines defines how access and updates to information are controlled. When an object in the database is locked for updating, other processes cannot modify (or in some cases read) the data until the update has completed. Locking not only affects how many different applications can update the information in the database, it can also affect queries on that data. It is important to monitor how queries access data, as the data could be altered or updated as it is being accessed. In general, such delays are minimal. The bulk of the locking mechanism is devoted to preventing multiple processes updating the same data. Since both additions (INSERT statements) and alterations (UPDATE statements) to the data require locking, you can imagine that multiple applications using the same database can have a significant impact. Thus, the “granularity” of the locking mechanism can drastically affect the throughput of the database in “multi-user” (or “highly-concurrent”) environments.
  • Indexing. The indexing method can dramatically increase database performance when searching and recovering data. Different storage engines provide different indexing techniques, and some may be better suited for the type of data you are storing. Typically, every index defined on a collection is another data structure of the particular type the engine uses (B-tree for WiredTiger, Fractal Tree for PerconaFT, and so forth). The efficiency of that data structure in relation to your workload is very important. An easy way of thinking about it is viewing every extra index as having performance overhead. A data structure that is write-optimized will have lower overhead for every index in a high-insert application environment than a non-write optimized data structure would. For use cases that require a large number of indexes, choosing an appropriate storage engine can have a dramatic impact.
  • Transactions. Transactions provide data reliability during the update or insert of information by enabling you to add data to the database, but only to commit that data when other conditions and stages in the application execution have completed successfully. For example, when transferring information (like a monetary credit) from one account to another, you would use transactions to ensure that both the debit from one account and the credit to the other completed successfully. Often, you will hear this referred to as “atomicity.” This means the operations that are bundled together are an immutable unit: either all operations complete successfully, or none do. Despite the ability of RocksDB, PerconaFT and WiredTiger to support transactions, as of version 3.2 this functionality is not available in the MongoDB storage engine API. Multi-document transactions cannot be used in MongoDB. However, atomicity can be achieved at the single document level. According to statements from MongoDB, Inc., multi-document transactions will be supported in the future, but a firm date has not been set as of this writing.

Now that we’ve established a general framework, we’ll move onto discussing engines. For the first blog in this series, we’ll look at MMAPv1 (the default storage engine that comes with MongoDB up until the release 3.0).

MMAPv1

Find it in: MongoDB or Percona builds

MMAPv1 is MongoDB’s original storage engine, and was the default engine in MongoDB 3.0 and earlier. It is a B-tree based system that offloads much of the functions of storage interaction and memory management to the operating system. MongoDB is based on memory mapped files.

The MMAP storage engine uses a process called “record allocation” to grab disk space for document storage. All records are contiguously located on disk, and when a document becomes larger than the allocated record, it must allocate a new record. New allocations require moving a document and updating all indexes that refer to the document, which takes more time than in-place updates and leads to storage fragmentation. Furthermore, MMAPv1 in it’s current iterations usually leads to high space utilization on your filesystem due to over-allocation of record space and it’s lack of support for compression.

As mentioned previously, a storage engine’s locking scheme is one of the most important factors in overall database performance. MMAPv1 has collection-level locking – meaning only one insert, update or delete operation can use a collection at a time. This type of locking scheme creates a very common scenario in concurrent workloads, where update/delete/insert operations are always waiting for the operation(s) in front of them to complete. Furthermore, oftentimes those operations are flowing in more quickly than they can be completed in serial fashion by the storage engine. To put it in context, imagine a giant supermarket on Sunday afternoon that only has one checkout line open: plenty of customers, but low throughput!

Given the storage engine choices brought about by the storage engine API in MongoDB 3.0, it is hard to imagine an application that demands the MMAPv1 storage engine for optimized performance. If you read between the lines, you could conclude that MongoDB, Inc. would agree given that the default engine was switched to WiredTiger in v3.2.

Conclusion

Most people don’t know that they have a choice when it comes to storage engines, and that the choice should be based on what the database workload will look like. Percona’s Vadim Tkachenko performed an excellent benchmark test comparing the performances of RocksDB, PerconaFT and WiredTiger to help specifically differentiate between these engines.

In the next post, we’ll examine the ins and outs of MongoDB’s new default storage engine, WiredTiger.

 

 

Dec
23
2015
--

Percona Server for MongoDB storage engines in iiBench insert workload

storage engine

storage enginesWe recently released the GA version of Percona Server for MongoDB, which comes with a variety of storage engines: RocksDB, PerconaFT and WiredTiger.

Both RocksDB and PerconaFT are write-optimized engines, so I wanted to compare all engines in a workload oriented to data ingestions.

For a benchmark I used iiBench-mongo (https://github.com/mdcallag/iibench-mongodb), and I inserted one billion (bln) rows into a collection with three indexes. Inserts were done in ten parallel threads.

For memory limits, I used a 10GB as the cache size, with a total limit of 20GB available for the mongod process, limited with cgroups (so the extra 10GB of memory was available for engine memory allocation and OS cache).

For the storage I used a single Crucial M500 960GB SSD. This is a consumer grade SATA SSD. It does not provide the best performance, but it is a great option price/performance wise.

Every time I mention WiredTiger, someone in the comments asks about the LSM option for WiredTiger. Even though LSM is still not an official mode in MongoDB 3.2, I added WiredTiger-LSM from MongoDB 3.2 into the mix. It won’t have the optimal settings, as there is no documentation how to use LSM in WiredTiger.

First, let me show a combined graph for all engines:
engines-timeline

And now, let’s zoom in on the individual engines.

WiredTiger:

wt-3.0

RocksDB + PerconaFT:

rocks-perconaft-3.0

UPDATE on 12/30/15
With an input from RocksDB developers at Facebook, after extra tuning of RocksDB (add delayed_write_rate=12582912;soft_rate_limit=0;hard_rate_limit=0; to config) I am able to get much better result for RocksDB:
rocks-3.0-dyn12M

What conclusions can we make?

  1. WiredTiger’s memory (about the first one million (mln) rows) performed extremely well, achieving over 100,000 inserts/sec. As data grows and exceeds memory size, WiredTiger behaved as a traditional B-Tree engine (which is no surprise).
  2. PerconaFT and RocksDB showed closer to constant throughput, with RocksDB being overall better, However, with data growth both engines start to experience challenges. For PerconaFT, the throughput varies more with more data, and RocksDB shows more stalls (which I think is related to a compaction process).
  3. WiredTiger LSM didn’t show as much variance as a B-Tree, but it still had a decline related to data size, which in general should not be there (as we see with RocksDB, also LSM based).

Inserting data is only one part of the equation. Now we also need to retrieve data from the database (which we’ll cover in another blog post).

Configuration for PerconaFT:

numactl --interleave=all ./mongod --dbpath=/mnt/m500/perconaft --storageEngine=PerconaFT --PerconaFTEngineCacheSize=$(( 10*1024*1024*1024 )) --syncdelay=900 --PerconaFTIndexFanout=128 --PerconaFTCollectionFanout=128 --PerconaFTIndexCompression=quicklz --PerconaFTCollectionCompression=quicklz --PerconaFTIndexReadPageSize=16384 --PerconaFTCollectionReadPageSize=16384

Configuration for RocksDB:

storage.rocksdb.configString:
 "bytes_per_sync=16m;max_background_flushes=3;max_background_compactions=12;max_write_buffer_number=4;max_bytes_for_level_base=1500m;target_file_size_base=200m;level0_slowdown_writes_trigger=12;write_buffer_size=400m;compression_per_level=kSnappyCompression:kSnappyCompression:kSnappyCompression:kSnappyCompression:kSnappyCompression:kSnappyCompression:kSnappyCompression;optimize_filters_for_hits=true"

Configuration for WiredTiger-3.2 LSM:

storage.wiredTiger.collectionConfig.configString:
 "type=lsm"
 storage.wiredTiger.indexConfig.configString:
 "type=lsm"

Load parameters for iibench:

TEST_RUN_ARGS_LOAD="1000000000 6000 1000 999999 10 256 3 0

Jul
24
2015
--

InnoDB vs TokuDB in LinkBench benchmark

Previously I tested Tokutek’s Fractal Trees (TokuMX & TokuMXse) as MongoDB storage engines – today let’s look into the MySQL area.

I am going to use modified LinkBench in a heavy IO-load.

I compared InnoDB without compression, InnoDB with 8k compression, TokuDB with quicklz compression.
Uncompressed datasize is 115GiB, and cachesize is 12GiB for InnoDB and 8GiB + 4GiB OS cache for TokuDB.

Important to note is that I used tokudb_fanout=128, which is only available in our latest Percona Server release.
I will write more on Fractal Tree internals and what does tokudb_fanout mean later. For now let’s just say it changes the shape of the fractal tree (comparing to default tokudb_fanout=16).

I am using two storage options:

  • Intel P3600 PCIe SSD 1.6TB (marked as “i3600” on charts) – as a high end performance option
  • Crucial M500 SATA SSD 900GB (marked as “M500” on charts) – as a low end SATA SSD

The full results and engine options are available here

Results on Crucial M500 (throughput, more is better)

Crucial M500

    Engine Throughput [ADD_LINK/10sec]

  • InnoDB: 6029
  • InnoDB 8K: 6911
  • TokuDB: 14633

There TokuDB outperforms InnoDB almost two times, but also shows a great variance in results, which I correspond to a checkpoint activity.

Results on Intel P3600 (throughput, more is better)

Intel P3600

  • Engine Throughput [ADD_LINK/10sec]
  • InnoDB: 27739
  • InnoDB 8K: 9853
  • TokuDB: 20594

To understand the reasoning why InnoDB shines on a fast storage let’s review IO usage by all engines.
Following chart shows Reads in KiB, that engines, in average, performs for a request from client.

IO Reads

Following chart shows Writes in KiB, that engines, in average, performs for a request from client.

IO Writes

There we can make interesting observations that TokuDB on average performs two times less writes than InnoDB, and this is what allows TokuDB to be better on slow storages. On a fast storage, where there is no performance penalty on many writes, InnoDB is able to get ahead, as InnoDB is still better in using CPUs.

Though, it worth remembering, that:

  • On a fast expensive storage, TokuDB provides a better compression, which allows to store more data in limited capacity
  • TokuDB still writes two time less than InnoDB, that mean twice longer lifetime for SSD (still expensive).

Also looking at the results, I can make the conclusion that InnoDB compression is inefficient in its implementation, as it is not able to get benefits: first, from doing less reads (well, it helps to get better than uncompressed InnoDB, but not much); and, second, from a fast storage.

The post InnoDB vs TokuDB in LinkBench benchmark appeared first on Percona Data Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com