
Using Cgroups to Limit MySQL and MongoDB memory usage

Quite often, especially for benchmarks, I am trying to limit available memory for a database server (usually for MySQL, but recently for MongoDB also). This is usually needed to test database performance in scenarios with different memory limits. I have physical servers with the usually high amount of memory (128GB or more), but I am interested to see how a database server will perform, say if only 16GB of memory is available.

And while InnoDB usually respects the setting of innodb_buffer_pool_size in O_DIRECT mode (OS cache is not being used in this case), more engines (TokuDB for MySQL, MMAP, WiredTiger, RocksDB for MongoDB) usually get benefits from OS cache, and Linux kernel by default is generous enough to allocate as much memory as available. There I should note that while TokuDB (and TokuMX for MongoDB) supports DIRECT mode (that is bypass OS cache), we found there is a performance gain if OS cache is used for compressed pages.

Well, an obvious recommendation on how to restrict available memory would be to use a virtual machine, but I do not like this because virtualization does come cheap and usually there are both CPU and IO penalties.

Other popular options I hear are:

  • to use "mem=" option in a kernel boot line. Despite the fact that it requires a server reboot by itself (so you can’t really script this and leave for automatic iterations through different memory options), I also suspect it does not work well in a multi-node NUMA environment – it seems that a kernel limits memory only from some nodes and not from all proportionally
  • use an auxiliary program that allocates as much memory as you want to make unavailable and execute mlock call. This option may work, but I again have an impression that the Linux kernel does not always make good choices when there is a huge amount of locked memory that it can’t move around. For example, I saw that in this case Linux starts swapping (instead of decreasing cached pages) even if vm.swappiness is set to 0.

Another option, on a raising wave of Docker and containers (like LXC), is, well, to use docker or another container… put a database server inside a container and limit resources this way. This, in fact, should work, but if you are lazy as I am, and do not want to deal with containers, we can just use Cgroups (https://en.wikipedia.org/wiki/Cgroups), which in fact are extensively used by mentioned Docker and LXC.

Using cgroups, our task can be accomplished in a few easy steps.

1. Create control group: cgcreate -g memory:DBLimitedGroup (make sure that cgroups binaries installed on your system, consult your favorite Linux distribution manual for how to do that)
2. Specify how much memory will be available for this group:
echo 16G > /sys/fs/cgroup/memory/DBLimitedGroup/memory.limit_in_bytesThis command limits memory to 16G (good thing this limits the memory for both malloc allocations and OS cache)
3. Now, it will be a good idea to drop pages already stayed in cache:
sync; echo 3 > /proc/sys/vm/drop_caches
4. And finally assign a server to created control group:

cgclassify -g memory:DBLimitedGroup `pidof mongod`

This will assign a running mongod process to a group limited by only 16GB memory.

On this, our task is accomplished… but there is one more thing to keep in mind.

This are dirty pages in the OS cache. As long as we rely on OS cache, Linux will control writing from OS cache to disk by two variables:
/proc/sys/vm/dirty_background_ratio and /proc/sys/vm/dirty_ratio.

These variables are percentage of memory that Linux kernel takes as input for flushing of dirty pages.

Let’s talk about them a little more. In simple terms:
/proc/sys/vm/dirty_background_ratio which by default is 10 on my Ubuntu, meaning that Linux kernel will start background flushing of dirty pages from OS cache, when amount of dirty pages reaches 10% of available memory.

/proc/sys/vm/dirty_ratio which by default is 20 on my Ubuntu, meaning that Linux kernel will start foreground flushing of dirty pages from OS cache, when amount of dirty pages reaches 20% of available memory. Foreground means that user threads executing IO might be blocked… and this is what will cause IO stalls for a user (and we want to avoid at all cost).

Why this is important to keep in mind? Let’s consider 20% from 256GB (this is what I have on my servers), this is 51.2GB, which database can make dirty VERY fast in write intensive workload, and if it happens that server has a slow storage (HDD RAID or slow SATA SSD), it may take long time for Linux kernel to flush all these pages, while stalling user’s IO activity meantime.

So it is worth to consider changing these values (or corresponding /proc/sys/vm/dirty_background_bytes and /proc/sys/vm/dirty_bytes if you like to operate in bytes and not in percentages).

Again, it was not important for our traditional usage of InnoDB in O_DIRECT mode, that’s why we did not pay much attention before to Linux OS cache tuning, but as soon as we start to rely on OS cache, this is something to keep in mind.

Finally, it’s worth remembering that dirty_bytes and dirty_background_bytes are related to ALL memory, not controlled by cgroups. It applies also to containers, if you are running several Docker or LXC containers on the same box, dirty pages among ALL of them are controlled globally by a single pair of dirty_bytes and dirty_background_bytes.

It may change it future Linux kernels, as I saw patches to apply dirty_bytes and dirty_background_bytes to cgroups, but it is not available in current kernels.

Is 80% of RAM how you should tune your innodb_buffer_pool_size?

It seems these days if anyone knows anything about tuning InnoDB, it’s that you MUST tune your innodb_buffer_pool_size to 80% of your physical memory. This is such prolific tuning advice, it seems engrained in many a DBA’s mind.  The MySQL manual to this day refers to this rule, so who can blame the DBA?  The question is: does it makes sense?

What uses the memory on your server?

Before we question such advice, let’s consider what can take up RAM in a typical MySQL server in their broad categories.  This list isn’t necessarily complete, but I think it outlines the large areas a MySQL server could consume memory.

  • OS Usage: Kernel, running processes, filesystem cache, etc.
  • MySQL fixed usage: query cache, InnoDB buffer pool size, mysqld rss, etc.
  • MySQL workload based usage: connections, per-query buffers (join buffer, sort buffer, etc.)
  • MySQL replication usage:  binary log cache, replication connections, Galera gcache and cert index, etc.
  • Any other services on the same server: Web server, caching server, cronjobs, etc.

There’s no question that for tuning InnoDB, the innodb_buffer_pool_size is the most important variable.  It’s expected to occupy most of the RAM on a dedicated MySQL/Innodb server, but of course other local services may affect how it is tuned.  If it (and other memory consumption on the server) is too large, swapping can kick in and degrade your performance rapidly.

Further, the workload of the MySQL server itself may cause a lot of variation.  Does the server have a lot of open connections and active query workload consuming memory?  The memory consumption caused by this can be dramatically different server to server.

Finally, replication mechanisms like Galera have their own memory usage pattern and can require some adjustments to your buffer pool.

We can see clearly that the 80% rule isn’t as nuanced as reality.

A rule of thumb

However, for the sake of argument, let’s say the 80% rule is a starting point.  A rule of thumb to help us get a quick tuning number to get the server running.  Assuming we don’t know anything really about the workload on the system yet, but we know that the system is dedicated to InnoDB, how might our 80% rule play out?

Total Server RAM Buffer pool with 80% rule Remaining RAM
1G 800MB 200MB
16G 13G 3G
32G 26G 6G
64G 51G 13G
128G 102G 26G
256G 205G 51G
512G 409G 103G
1024G 819G 205G

At lower numbers, our 80% rule looks pretty reasonable.  However, as we get into large servers, it starts to seem less sane.  For the rule to hold true, it must mean that workload memory consumption increases in proportion to needed size of the buffer pool, but that usually isn’t the case.  Our server that has 1TB of RAM likely doesn’t need 205G of that to handle things like connections and queries (likely MySQL couldn’t handle that many active connections and queries anyway).

So, if you really just spent all that money on a beefy server do you really want to pay a 20% tax on that resource because of this rule of thumb?

The origins of the rule

At one of my first MySQL conferences, probably around 2006-2007 when I worked at Yahoo, I attended an InnoDB tuning talk hosted by Heikki Tuuri (the original author of InnoDB) and Peter Zaitsev.  I distinctly remember asking about the 80% rule because at the time Yahoo had some beefy 64G servers and the rule wasn’t sitting right with me.

Heikki’s answer stuck with me.  He said something to the effect of (not a direct quote): “Well, the server I was testing on had 1GB of RAM and 80% seemed about right”.  He then, if memory serves, clarified it and said it would not apply similarly to larger servers.

How should you tune?

80% is maybe a great start and rule of thumb.  You do want to be sure the server has plenty of free RAM for the OS and the usually unknown workload.  However, as we can see above, the larger the server, the more likely the rule will wind up wasting RAM.   I think for most people it starts and ends at the rule of thumb, mostly because changing the InnoDB buffer pool requires a restart in current releases.

So what’s a better rule of thumb?  My rule is that you tune the innodb_buffer_pool_size as large as possible without using swap when the system is running the production workload.  This sounds good in principle, but again, it requires a bunch of restarts and may be easier said than done.

Fortunately MySQL 5.7 and it’s online buffer pool resize feature should make this an easier principle to follow.  Seeing lots of free RAM (and/or filesystem cache usage)?  Turn the buffer pool up dynamically.  Seeing some swap activity?  Just turn it down with no restart required.   In practice, I suspect there will be some performance related hiccups of using this feature, but it is at least a big step in the right direction.

