Sep
14
2015
--

MongoDB and Percona TokuMX Security Guidelines

Several reports we’re published in the news about how easy it is to access data stored in some NoSQL systems, including MongoDB. This is not surprising because security was rather relaxed in earlier versions of MongoDB . This post lists some of the common vulnerabilities in MongoDB and Percona TokuMX.

Network Security

One key point is to ensure that the bind_ip setting is correctly adjusted: in MongoDB 2.4 and Percona TokuMX, it is not set which means that the server will listen to all available network interfaces. If proper firewall rules (iptables, Security Groups in AWS, …) are not in place, your dataset could easily be queried from anywhere in the world!

In MongoDB 2.6+, bind_ip is set by default to 127.0.0.1 in the official .deb and .rpm packages. This is great from a security point of view, but remember that you’ll still have to adjust the setting if the application servers are not running on the same host.

MongoDB has an HTTP interface that can display statistics, it is available on a port that is 1000 higher than the port used for regular connections. So if mongod is listening on port 27017, the HTTP interface can be reached on port 28017. Although this interface is read only, the stats that are exposed should not be readable by anyone. The best option is then to disable this interface for production systems. This is done by default in MongoDB 2.6+ (and set nohttpinterface: false if you want to use this HTTP interface).

Operating System Security

The mongod and mongos binaries should be run with a dedicated user that has limited privileges. This is done automatically if you have installed MongoDB/Percona TokuMX from a package, but that may not be the case if you have installed from a tarball.

Also make sure that the permissions of the configuration file do not allow any user to make any modifications that would take effect next time MongoDB is restarted.

User Security

MongoDB does not enable authentication by default, this is very handy for development but it is of course not suitable for production servers.

For standalone instances, you will have to set security.authorization: enabled in MongoDB 2.6+ or use the auth setting in MongoDB 2.4/Percona TokuMX.

For replica sets and sharded cluster, you will have to use a key file (security.keyFile or keyFile depending on the version). In this case security.authorization/auth is implied.

Then you will need to create users: start with an administrator user, and then application user(s). This is well explained in the documentation. Note that you will define users and roles in slightly different ways depending on the MongoDB version. It is always good to check the documentation for your specific version.

Also note that the localhost exception allows you to enable authentication before you create the first user.

Conclusion

As you can see, it is not that difficult to get a decently secured MongoDB deployment. However the trap is that some critical pieces (like authentication) are not enabled by default, so if you’re not careful or if you are under heavy time constraints, it can be easy to forget necessary configurations which can result in a weak setup. Don’t let this happen to you! And of course a good place to start is the official documentation.

The post MongoDB and Percona TokuMX Security Guidelines appeared first on MySQL Performance Blog.

Sep
10
2015
--

Percona TokuMX 2.0.2 is now available

Percona TokuMXse Percona is glad to announce the release of Percona TokuMX 2.0.2 on September 10th 2015. Downloads are available here and from the Percona Software Repositories.

Based on MongoDB 2.4.10 including all the bug fixes in it, Percona TokuMX 2.0.2 is current GA release in Percona TokuMX 2.0 series.

New Features

  • Percona TokuMX Enterprise has been open-sourced and it’s now replacing Percona TokuMX Community edition. To install or upgrade this new release use the tokumx-enterprise package. Packages are available from Percona Software Repositories, this means that you’ll need to replace the Tokutek‘s S3 repository with Percona‘s in order to see the new packages.
  • Packages for Fedora and Arch Linux are not available anymore.

Bugs Fixed

  • Fixed long replica node startup times: The entire op log is no longer read during startup, it is now read from the point of the last successfully replicated GTID. Git commit #1d5afda.
  • Fixed rare multiple primary issue: In certain scenarios, replicas would attempt to declare themselves as primary when another node had already been declared as primary. A node now performs an extra check against its role during remote primary notification. Git commit #5909737.

The post Percona TokuMX 2.0.2 is now available appeared first on MySQL Performance Blog.

Aug
21
2015
--

Find unused indexes on MongoDB and TokuMX

Finding and removing unused indexes is a pretty common technique to improve overall performance of relational databases. Less indexes means faster insert and updates but also less disk space used. The usual way to do it is to log all queries’ execution plans and then get a list of those indexes that are not used. Same theory applies to MongoDB and TokuMX so in this blog post I’m going to explain how to find those.

Profiling in MongoDB

To understand what profiling is you only need to think about MySQL’s slow query log, it is basically the same idea. It can be enabled with the following command:

db.setProfilingLevel(level, slowms)

There are three different levels:

0: No profiling enabled.
1: Only those queries slower than “slowms” are profiled.
2: All queries are profiled, similar to query_long_time=0.

Once it is enabled you can use db.system.profile.find().pretty() to read it. You would need to scan through all profiles and find those indexes that are never used. To make things easier there is a javascript program that will find the unused indexes after reading all the profile information. Unfortunately, it only works with mongodb 2.x.

The javascript is hosted in this github project https://github.com/wfreeman/indexalizer You just need to start mongo shell with indexStats.js and run db.indexStats() command. This is an sample output:

scanning profile {ns:"test.col"} with 2 records... this could take a while.
{
	"query" : {
		"b" : 1
	},
	"count" : 1,
	"index" : "",
	"cursor" : "BtreeCursor b_1",
	"millis" : 0,
	"nscanned" : 1,
	"n" : 1,
	"scanAndOrder" : false
}
{
	"query" : {
		"b" : 2
	},
	"count" : 1,
	"index" : "",
	"cursor" : "BtreeCursor b_1",
	"millis" : 0,
	"nscanned" : 1,
	"n" : 1,
	"scanAndOrder" : false
}
checking for unused indexes in: col
this index is not being used:
"_id_"
this index is not being used:
"a_1"

 

So “a_1” is not used and could be dropped. We can ignore “_id_” because that one is needed :)

There is a problem with profiling. It will affect performance so you need to run it only for some hours and usually during low peak. That means that there is a possibility that not all possible queries from your application are going to be executed during that maintenance window. What alternative TokuMX provides?

Finding unused indexes in TokuMX

Good news for all of us. TokuMX doesn’t require you to enable profiling. Index usage statistics are stored as part of every query execution and you can access them with a simple db.collection.stats() command. Let me show you an example:

> db.col.stats()
[...]
{
"name" : "a_1",
"count" : 5,
"size" : 140,
"avgObjSize" : 28,
"storageSize" : 16896,
"pageSize" : 4194304,
"readPageSize" : 65536,
"fanout" : 16,
"compression" : "zlib",
"queries" : 0,
"nscanned" : 0,
"nscannedObjects" : 0,
"inserts" : 0,
"deletes" : 0
},
{
"name" : "b_1",
"count" : 5,
"size" : 140,
"avgObjSize" : 28,
"storageSize" : 16896,
"pageSize" : 4194304,
"readPageSize" : 65536,
"fanout" : 16,
"compression" : "zlib",
"queries" : 2,
"nscanned" : 2,
"nscannedObjects" : 2,
"inserts" : 0,
"deletes" : 0
}
],
"ok" : 1
}

 

There are our statistics without profiling enabled. queries means the number of times that index has been used on a query execution. b_1 has been used twice and a_1 has never been used. You can use this small javascript code I’ve written to scan all collections inside the current database:

db.forEachCollectionName(function (cname) {
	output = db.runCommand({collstats : cname });
	print("Checking " + output.ns + "...")
	output.indexDetails.forEach(function(findUnused) { if (findUnused.queries == 0) { print( "Unused index: " + findUnused.name ); }})
});

 

An example using the same data:

> db.forEachCollectionName(function (cname) {
... output = db.runCommand({collstats : cname });
... print("Checking " + output.ns + "...")
... output.indexDetails.forEach(function(findUnused) { if (findUnused.queries == 0) { print( "Unused index: " + findUnused.name ); }})
...
... });
Checking test.system.indexes...
Checking test.col...
Unused index: a_1

 

Conclusion

Finding unused indexes is a regular task that every DBA should do. In MongoDB you have to use profiling while in TokuMX nothing needs to be enabled because it will gather information by default without impacting service performance.

The post Find unused indexes on MongoDB and TokuMX appeared first on Percona Data Performance Blog.

Aug
13
2015
--

The language of compression

Leif Walsh & friends

Leif Walsh will talk about the language of compression at Percona Live Amsterdam

Storage. Everyone needs it. Whether your data is in MySQL, a NoSQL, or somewhere in the cloud, with ever-growing data volumes – along with the need for SSDs to cut latency and replication to provide insurance – an organization’s storage footprint is an important place to look for savings. That’s where compression comes in (squeeze!) to save disk space.

Two Sigma software engineer Leif Walsh speaks the language of compression. Fluently. In fact, he’ll be speaking on

that exact subject September 22 during the Percona Live conference in Amsterdam.

I asked him about his talk, and about Amsterdam, the other day. Here’s what he had to say.

* * *

Tom: Hi Leif, how will your talk help IT decision-makers cut through the marketing mumbo-jumbo on what’s important to focus on and what is not
Leif: My talk will have three lessons aimed at those making storage decisions for their company:

  1. What are the key factors to consider when evaluating storage options, and how can they affect your bottom line?  This is not only how storage tech influences your hardware, operations, and management costs, but also how it can facilitate new development initiatives and cut time-to-market for your products.
  2. How should you read benchmarks and marketing materials about storage technology?  You’ll learn what to look for in promotional material, and how to think critically about whether that material is applicable to your business needs.
  3. What’s the most effective way to communicate with storage vendors about your application’s requirements?  A lot of time can be spent in the early stages of a relationship in finding a common language for users and vendors to have meaningful discussions about users’ needs and vendors’ capacity to meet those needs.  With the tools you’ll learn in my talk, you’ll be able to accelerate quickly to the high-bandwidth conversations you need to have in order to make the right decision, and consequently, you’ll be empowered to evaluate more choices to find the best one faster.

Tom: In addition to IT decision-makers, who else should attend your session and what will they take away afterward?
Leif: My talk is primarily about the language that everyone in the storage community should be using to communicate. Therefore, storage vendors should attend to get ideas for how  to express their benchmarks and their system’s properties more effectively, and application developers and operations people will learn strategies for getting better support and for making a convincing case to the decision makers in their own company.

Tom: Which session(s) are you most looking forward to besides your own?
Leif: Sam Kottler is a good friend and an intensely experienced systems engineer with a dynamic and boisterous personality, so I can’t wait to hear more about his experiences with Linux tuning.

As one of the original developers of TokuMX, I’ll absolutely have to check out Stephane’s talk about it, but I promise not to heckle. Charity Majors is always hilarious and has great experiences and insights to share, so I’ll definitely check out her talk too.

* * *

Catch Leif’s talk at Percona Live in Amsterdam September 21-23. Enter the promo code “BlogInterview” at registration and save €20! Register now!

The post The language of compression appeared first on MySQL Performance Blog.

Aug
03
2015
--

Checkpoint strikes back

In my recent benchmarks for MongoDB, we can see that the two engines WiredTiger and TokuMX struggle from periodical drops in throughput, which is clearly related to a checkpoint interval – and therefore I correspond it to a checkpoint activity.

The funny thing is that I thought we solved checkpointing issues in InnoDB once and for good. There are bunch of posts on this issue in InnoDB, dated some 4 years ago.  We did a lot of research back then working on a fix for Percona Server

But, like Baron said, “History Repeats“… and it seems it also repeats in technical issues.

So, let’s take a look what is checkpointing is, and why it is a problem for storage engines.

As you may know, a transactional engine writes transactional logs (name you may see: “redo logs”, write-ahead logs or WAL etc),
to be able to perform a crash recovery in the case of database crash or server power outage. To maintain somehow the time for recovery (we all expect that database will start quick), the engine has to limit how many changes are in logs. For this a transactional engine performs a “checkpoint,” which basically synchronizes changes in memory with corresponding changes in logs – so old log records can be deleted. Often it results in writing changed database pages from memory to a permanent storage.

InnoDB takes an approach to limit size of log files (in total it equals to innodb-log-file-size * innodb-log-files-in-group), which I name “size limited checkpointing”.
Both TokuDB and WiredTiger limit changes by time periods (by default 60 sec), which results in that log files may grow unlimited within a given time interval (I name it “time limited checkpointing”).

Also the difference is InnoDB takes a “fuzzy checkpointing” approach, which was not really “fuzzy”, until we fixed it, but basically it is not to wait until we reach size limited, but perform “checkpointing” all the time, and the more intensive the closer we get to log size limit. This allows to achieve more or less a smooth throughput, without significant drops in throughput.

Unlike InnoDB, both TokuDB (that I am sure) and WiredTiger (I speculate here, I did not look into WiredTiger internal) wait till the last moment, and perform checkpoint strictly by prescribed interval. If it happens that a database contains many changes in memory, it will result in performance stalls. This effect is close by effect to “hitting a wall on full speed”: user queries get locked, until an engine writes all changes in memory it has to write.

Interestingly enough, RocksDB, because it has a different architecture (I may write on it in future, but for now I will point to RocksDB Wiki), does not have this problem with checkpoints (it however has its own background activity, like level compactions and tombstone maintenance, but it is a different topic).

I do not know how WiredTiger is going to approach this issue with checkpoint, but we are looking to improve TokuDB to make this less painful for user queries – and eventually to move to “fuzzy checkpointing” model.

The post Checkpoint strikes back appeared first on Percona Data Performance Blog.

Jul
24
2015
--

InnoDB vs TokuDB in LinkBench benchmark

Previously I tested Tokutek’s Fractal Trees (TokuMX & TokuMXse) as MongoDB storage engines – today let’s look into the MySQL area.

I am going to use modified LinkBench in a heavy IO-load.

I compared InnoDB without compression, InnoDB with 8k compression, TokuDB with quicklz compression.
Uncompressed datasize is 115GiB, and cachesize is 12GiB for InnoDB and 8GiB + 4GiB OS cache for TokuDB.

Important to note is that I used tokudb_fanout=128, which is only available in our latest Percona Server release.
I will write more on Fractal Tree internals and what does tokudb_fanout mean later. For now let’s just say it changes the shape of the fractal tree (comparing to default tokudb_fanout=16).

I am using two storage options:

  • Intel P3600 PCIe SSD 1.6TB (marked as “i3600” on charts) – as a high end performance option
  • Crucial M500 SATA SSD 900GB (marked as “M500” on charts) – as a low end SATA SSD

The full results and engine options are available here

Results on Crucial M500 (throughput, more is better)

Crucial M500

    Engine Throughput [ADD_LINK/10sec]

  • InnoDB: 6029
  • InnoDB 8K: 6911
  • TokuDB: 14633

There TokuDB outperforms InnoDB almost two times, but also shows a great variance in results, which I correspond to a checkpoint activity.

Results on Intel P3600 (throughput, more is better)

Intel P3600

  • Engine Throughput [ADD_LINK/10sec]
  • InnoDB: 27739
  • InnoDB 8K: 9853
  • TokuDB: 20594

To understand the reasoning why InnoDB shines on a fast storage let’s review IO usage by all engines.
Following chart shows Reads in KiB, that engines, in average, performs for a request from client.

IO Reads

Following chart shows Writes in KiB, that engines, in average, performs for a request from client.

IO Writes

There we can make interesting observations that TokuDB on average performs two times less writes than InnoDB, and this is what allows TokuDB to be better on slow storages. On a fast storage, where there is no performance penalty on many writes, InnoDB is able to get ahead, as InnoDB is still better in using CPUs.

Though, it worth remembering, that:

  • On a fast expensive storage, TokuDB provides a better compression, which allows to store more data in limited capacity
  • TokuDB still writes two time less than InnoDB, that mean twice longer lifetime for SSD (still expensive).

Also looking at the results, I can make the conclusion that InnoDB compression is inefficient in its implementation, as it is not able to get benefits: first, from doing less reads (well, it helps to get better than uncompressed InnoDB, but not much); and, second, from a fast storage.

The post InnoDB vs TokuDB in LinkBench benchmark appeared first on Percona Data Performance Blog.

Jul
21
2015
--

Percona now offering 24/7 support for MongoDB and TokuMX

Today Percona announced the immediate availability of 24/7, enterprise-class support for MongoDB and TokuMX. The new support service helps organizations achieve maximum application performance without database bloat. Customers have round-the-clock access (365 days a year) to the most trusted team of database experts in the open source community.

The news means that Percona now offers support across the entire open-source database ecosystem, including the entire LAMP stack (Linux, Apache, MySQL, and PHP/Python/Perl), providing a single, expert, proven service provider for companies to turn to in good times (always best to be proactive) – and during emergencies, too.

Today’s support announcement follows Percona’s acquisition of Tokutek, which included the Tokutek distribution of MongoDB – making Percona the first vendor to offer both MySQL and MongoDB software and solutions.

Like Percona’s other support services, support for MongoDB and TokuMX enables organizations to talk directly with Percona’s support experts at any time, day or night.

The Percona Support team is always ready to help resolve database and server instability, initiate data recovery, optimize performance, deal with response and outage issues – and ensure proactive system monitoring and alert responses. Percona also offers support across on-premises, cloud, and hybrid deployments.

The post Percona now offering 24/7 support for MongoDB and TokuMX appeared first on MySQL Performance Blog.

Jul
20
2015
--

Fractal Tree library as a Key-Value store

As you may know, Tokutek is now part of Percona and I would like to explain some internals of TokuDB and TokuMX – what performance benefits they bring, along with further optimizations we are working on.

However, before going into deep details, I feel it is needed to explain the fundamentals of Key-Value store, and how Fractal Tree handles it.

Before that, allow me to say that I hear opinions that the “Fractal Tree” name does not reflect an internal structure and looks more like a marketing term than a technical one. I will not go into this discussion and will keep using name “Fractal Tree” just out of the respect to inventors. I think they are in a position to name their invention with any name they want.

So with that said, the Fractal Tree library implements a new data structure for a more efficient handling (with main focus on insertion, but more on this later) of Key-Value store.

You may question how Key-Value is related to SQL Transactional databases – this is more from the NOSQL world. Partially this is true, and Fractal Tree Key-Value library is successfully used in Percona TokuMX (based on MongoDB 2.4) and Percona TokuMXse (storage engine for MongoDB 3.0) products.

But if we look on a Key-Value store in general, actually it maybe a good fit to use in structural databases. To explain this, let’s take a look in Key-Value details.

So what is Key-Value data structure?

We will use a notation (k,v), or key=>val, which basically mean we associate some value “v” with a key “k”. For software developers following analogies may be close:
key-value access is implemented as dictionary in Python, associative array in PHP or map in C++.
(More details in Wikipedia)

I will define key-value structure as a list of pairs (k,v).

It is important to note that both key and value cannot be just scalars (single value), but to be compound.
That is "k1, k2, k3 => v1, v2", which we can read as (give me two values by a 3-part key).

This brings us closer to a database table structure.
If we apply additional requirement that all (k) in list (k,v) must be unique, this will represent
a PRIMARY KEY for a traditional database table.
To understand this better, let’s take a look on following table:
CREATE TABLE metrics (
ts timestamp,
device_id int,
metric_id int,
cnt int,
val double,
PRIMARY KEY (ts, device_id, metric_id),
KEY metric_id (metric_id, ts),
KEY device_id (device_id, ts)
)

We can state that Key-Value structure (ts, device_id, metric_id => cnt, val), with a requirement
"ts, device_id, metric_id" to be unique, represents PRIMARY KEY for this table, actually this is how InnoDB (and TokuDB for this matter) stores data internally.

Secondary indexes also can be represented in Key=>Value notion, for example, again, how it is used in TokuDB and InnoDB:
(seconday_index_key=>primary_key), where a key for a secondary index points to a primary key (so later we can get values by looking up primary key). Please note that that seconday_index_key may not be unique (unless we add an UNIQUE constraint to a secondary index).

Or if we take again our table, the secondary keys are defined as
(metric_id, ts => ts, device_id, metric_id)
and
(device_id, ts => ts, device_id, metric_id)

It is expected from a Key-Value storage to support basic data manipulation and extraction operations, such as:

        – Add or Insert: add

(key => value)

        pair to a collection
        – Update: from

(key => value2)

        to

(key => value2)

        , that is update

"value"

        assigned to

"key"

        .
        – Delete: remove(key): delete a pair

(key => value)

        from a collection
        – Lookup (select): give a

"value"

        assigned to

"key"

and I want to add fifth operation:

        – Range lookup: give all values for keys defined by a range, such as

"key > 5"

        or

"key >= 10 and key < 15"

They way software implements an internal structure of Key-Value store defines the performance of mentioned operations, and especially if datasize of a store grows over a memory capacity.

For the decades, the most popular data structure to represent Key-Value store on disk is B-Tree, and within the reason. I won’t go into B-Tree details (see for example https://en.wikipedia.org/wiki/B-tree), but it provides probably the best possible time for Lookup operations. However it has challenges when it comes to Insert operations.

And this is an area where newcomers to Fractal Tree and LSM-tree (https://en.wikipedia.org/wiki/Log-structured_merge-tree) propose structures which provide a better performance for Insert operations (often at the expense of Lookup/Select operation, which may become slower).

To get familiar with LSM-tree (this is a structure used by RocksDB) I recommend http://www.benstopford.com/2015/02/14/log-structured-merge-trees/. And as for Fractal Tree I am going to cover details in following posts.

The post Fractal Tree library as a Key-Value store appeared first on MySQL Performance Blog.

Jul
14
2015
--

MongoDB benchmark: sysbench-mongodb IO-bound workload comparison

In this post I’ll share the results of a sysbench-mongodb benchmark I performed on my server. I compared MMAP, WiredTiger, RocksDB and TokuMXse (based on MongoDB 3.0) and TokuMX (based on MongoDB 2.4) in an IO-intensive workload.

The full results are available here, and below I’ll just share the summary chart:

MongoDB benchmarks: sysbench-mongodb IO-bound workload

I would like to highlight that this benchmark was designed to emulate a heavy IO load on a (relatively) slow IO subsystem. This use case, I believe, is totally valid and represents frequently used “cloud” setups with limited memory and slow IO.

The WiredTiger engine, as B-Tree based, is expected to perform worse comparing to RocksDB and Toku Fractal Trees, which, are designed to handle IO-intensive workloads. My assumption is that WiredTiger will perform better (or even outperform others) for CPU intensive in-memory workloads (see for example Mark Callaghan’s results). Also WiredTiger is expected to perform better with faster storage.

The post MongoDB benchmark: sysbench-mongodb IO-bound workload comparison appeared first on MySQL Performance Blog.

Jul
01
2015
--

Using Cgroups to Limit MySQL and MongoDB memory usage

Quite often, especially for benchmarks, I am trying to limit available memory for a database server (usually for MySQL, but recently for MongoDB also). This is usually needed to test database performance in scenarios with different memory limits. I have physical servers with the usually high amount of memory (128GB or more), but I am interested to see how a database server will perform, say if only 16GB of memory is available.

And while InnoDB usually respects the setting of innodb_buffer_pool_size in O_DIRECT mode (OS cache is not being used in this case), more engines (TokuDB for MySQL, MMAP, WiredTiger, RocksDB for MongoDB) usually get benefits from OS cache, and Linux kernel by default is generous enough to allocate as much memory as available. There I should note that while TokuDB (and TokuMX for MongoDB) supports DIRECT mode (that is bypass OS cache), we found there is a performance gain if OS cache is used for compressed pages.

Well, an obvious recommendation on how to restrict available memory would be to use a virtual machine, but I do not like this because virtualization does come cheap and usually there are both CPU and IO penalties.

Other popular options I hear are:

  • to use "mem=" option in a kernel boot line. Despite the fact that it requires a server reboot by itself (so you can’t really script this and leave for automatic iterations through different memory options), I also suspect it does not work well in a multi-node NUMA environment – it seems that a kernel limits memory only from some nodes and not from all proportionally
  • use an auxiliary program that allocates as much memory as you want to make unavailable and execute mlock call. This option may work, but I again have an impression that the Linux kernel does not always make good choices when there is a huge amount of locked memory that it can’t move around. For example, I saw that in this case Linux starts swapping (instead of decreasing cached pages) even if vm.swappiness is set to 0.

Another option, on a raising wave of Docker and containers (like LXC), is, well, to use docker or another container… put a database server inside a container and limit resources this way. This, in fact, should work, but if you are lazy as I am, and do not want to deal with containers, we can just use Cgroups (https://en.wikipedia.org/wiki/Cgroups), which in fact are extensively used by mentioned Docker and LXC.

Using cgroups, our task can be accomplished in a few easy steps.

1. Create control group: cgcreate -g memory:DBLimitedGroup (make sure that cgroups binaries installed on your system, consult your favorite Linux distribution manual for how to do that)
2. Specify how much memory will be available for this group:
echo 16G > /sys/fs/cgroup/memory/DBLimitedGroup/memory.limit_in_bytesThis command limits memory to 16G (good thing this limits the memory for both malloc allocations and OS cache)
3. Now, it will be a good idea to drop pages already stayed in cache:
sync; echo 3 > /proc/sys/vm/drop_caches
4. And finally assign a server to created control group:

cgclassify -g memory:DBLimitedGroup `pidof mongod`

This will assign a running mongod process to a group limited by only 16GB memory.

On this, our task is accomplished… but there is one more thing to keep in mind.

This are dirty pages in the OS cache. As long as we rely on OS cache, Linux will control writing from OS cache to disk by two variables:
/proc/sys/vm/dirty_background_ratio and /proc/sys/vm/dirty_ratio.

These variables are percentage of memory that Linux kernel takes as input for flushing of dirty pages.

Let’s talk about them a little more. In simple terms:
/proc/sys/vm/dirty_background_ratio which by default is 10 on my Ubuntu, meaning that Linux kernel will start background flushing of dirty pages from OS cache, when amount of dirty pages reaches 10% of available memory.

/proc/sys/vm/dirty_ratio which by default is 20 on my Ubuntu, meaning that Linux kernel will start foreground flushing of dirty pages from OS cache, when amount of dirty pages reaches 20% of available memory. Foreground means that user threads executing IO might be blocked… and this is what will cause IO stalls for a user (and we want to avoid at all cost).

Why this is important to keep in mind? Let’s consider 20% from 256GB (this is what I have on my servers), this is 51.2GB, which database can make dirty VERY fast in write intensive workload, and if it happens that server has a slow storage (HDD RAID or slow SATA SSD), it may take long time for Linux kernel to flush all these pages, while stalling user’s IO activity meantime.

So it is worth to consider changing these values (or corresponding /proc/sys/vm/dirty_background_bytes and /proc/sys/vm/dirty_bytes if you like to operate in bytes and not in percentages).

Again, it was not important for our traditional usage of InnoDB in O_DIRECT mode, that’s why we did not pay much attention before to Linux OS cache tuning, but as soon as we start to rely on OS cache, this is something to keep in mind.

Finally, it’s worth remembering that dirty_bytes and dirty_background_bytes are related to ALL memory, not controlled by cgroups. It applies also to containers, if you are running several Docker or LXC containers on the same box, dirty pages among ALL of them are controlled globally by a single pair of dirty_bytes and dirty_background_bytes.

It may change it future Linux kernels, as I saw patches to apply dirty_bytes and dirty_background_bytes to cgroups, but it is not available in current kernels.

The post Using Cgroups to Limit MySQL and MongoDB memory usage appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com