super_read_only and GTID replication

Percona Server 5.6.21+ and MySQL 5.7.8+ offer the super_read_only option that was first implemented in WebscaleSQL. Unlike read_only, this option prevents all users from running writes (even those with the SUPER privilege). Sure enough, this is a great feature, but what’s the relation with GTID? Read on!


Enabling super_read_only on all slaves when using GTID replication makes your topology far less sensitive to errant transactions. Failover is then easier and safer because creating errant transactions is much harder.

GTID replication is awesome…

For years, all MySQL DBAs in the world have been fighting with positioning when working with replication. Each time you move a slave from one master to another, you must be very careful to start replicating at the correct position. That was boring and error-prone.

GTID replication is a revolution because it allows auto-positioning: when you configure server B to replicate from A, both servers will automatically negociate which events should be sent by the master. Of course this assumes the master has all missing events in its binlogs. Otherwise the slave will complain that it can’t get all the events it needs and you will see an error 1236.

… but there’s a catch

Actually GTID replication has several issues, the main one in MySQL 5.6 being the inability to switch from position-based to GTID-based replication without downtime. This has been fixed since then fortunately.

The issue I was thinking of is errant transactions. Not familiar with this term? Let me clarify.

Say you have a slave (B) replicating from a master (A) using the traditional position-based replication. Now you want to create a new database. This is easy: just connect to B and run:

mysql> CREATE DATABASE new_db;

Ooops! You’ve just made a big mistake: instead of creating the table on the master, you’ve just created it on the slave. But the change is easy to undo: run DROP DATABASE on B, followed by CREATE DATABASE on A.

Nobody will ever known your mistake and next time you’ll be more careful.

However with GTID-replication, this is another story: when you run a write statement on B, you create an associated GTID. And this associated GTID will be recorded forever (even if the binlog containing the transaction is purged at some point).

Now you can still undo the transaction but there is no way to undo the GTID. What you’ve created is called an errant transaction.

This minor mistake can have catastrophic consequences: say that 6 months later, B is promoted as the new master. Because of auto-positioning, the errant transaction will be sent to all slaves. But it’s very likely that the corresponding binlog has been purged, so B will be unable to send the errant transaction. As a result replication will be broken everywhere. Not nice…

super_read_only can help

Enter super_read_only. If it is enabled on all slaves, the above scenario won’t happen because the write on B will trigger an error and no GTID will be created.

With super_read_only, tools that were not reliable with GTID replication become reliable enough to be used again. For instance, MHA supports failover in a GTID-based setup but it doesn’t check errant transactions when failing over, making it risky to use with GTID replication. super_read_only makes MHA attractive again with GTID.

However note that super_read_only can’t prevent all errant transactions. The setting is dynamic so if you have privileged access, you can still disable super_read_only, create an errant transaction and enable it back. But at least it should avoid errant transactions that are created by accident.

The post super_read_only and GTID replication appeared first on MySQL Performance Blog.


How much could you benefit from MySQL 5.6 parallel replication?

I have heard this question quite often: “At busy times, our replicas start lagging quite frequently. We are using N schemas, so which performance boost could we expect from MySQL 5.6 parallel replication?” Here is a quick way to give you a rough estimate of the potential benefit.

General idea

In MySQL 5.6, parallelism is added at the schema level. So in theory, if you have N schemas and if you use N parallel threads, replication could be up to N times faster. This assumes at least 2 things:

  • Replication throughput scales linearly with the number of parallel threads.
  • Writes are evenly distributed across schemas.

Both assumptions are of course not realistic. But it is easy to know the distribution of writes, and that can already give you an idea about how much you could benefit from parallel replication.

Writes are stored in binary logs but it is much easier to work with the slow query log, so we can enable full slow query logging for some time with long_query_time = 0 and then use pt-query-digest to analyze the resulting log file.

An example

I have a test server with 3 schemas, and I’ve run some sysbench load on it to get a decent slow query log file. Once done, I can run this command:

pt-query-digest --filter '$event->{arg} !~ m/^select|^set|^commit|^show|^admin|^rollback|^begin/i' --group-by db --report-format profile slow_query.log > digest.out

and here is the result I get:

# Profile
# Rank Query ID Response time  Calls  R/Call V/M   Item
# ==== ======== ============== ====== ====== ===== ====
#    1 0x       791.6195 52.1% 100028 0.0079  0.70 db3
#    2 0x       525.1231 34.5% 100022 0.0053  0.68 db1
#    3 0x       203.4649 13.4% 100000 0.0020  0.64 db2

In a perfect world, with 3 parallel threads and if each schema would handle 33% of the total write workload, I could expect a 3x performance improvement.

However here we can see in the report that the 3 replication threads will only work simultaneously 25% of the time in the best case (13.4/52.1 = 0.25). We can also expect 2 replication threads to work simultaneously for some part of the workload, but let’s ignore that for clarity.

It means that instead of the theoretical 200% performance improvement (3 parallel threads 100% of the time), we can hardly expect more than a 50% performance improvement (3 parallel threads 25% of the time). And the reality is that the benefit will be much lower than that.


Parallel replication in MySQL 5.6 is a great step forward, however don’t expect too much if your writes are not evenly distributed across all your schemas. The pt-query-digest trick I shared can give you a rough idea whether your workload is a good fit for multi-threaded slaves in 5.6.

I’m expecting much better results for 5.7, partly because parallelism is handled differently, but also because you can tune how efficient parallel replication will be by adjusting the binlog group commit settings.

The post How much could you benefit from MySQL 5.6 parallel replication? appeared first on MySQL Performance Blog.


The MySQL query cache: Worst enemy or best friend?

During the last couple of months I have been involved in an unusually high amount of performance audits for e-commerce applications running with Magento. And although the systems were quite different, they also had one thing in common: the MySQL query cache was very useful. That was counter-intuitive for me as I’ve always expected the query cache to be such a bottleneck that response time is better when the query cache is turned off no matter what. That lead me to run a few experiments to better understand when the query cache can be helpful.

Some context

The query cache is well known for its contentions: a global mutex has to be acquired for any read or write operation, which means that any access is serialized. This was not an issue 15 years ago, but with today’s multi-core servers, such serialization is the best way to kill performance.

However from a performance point of view, any query cache hit is served in a few tens of microseconds while the fastest access with InnoDB (primary lookup) still requires several hundreds of microseconds. Yes, the query cache is at least an order of magnitude faster than any query that goes to InnoDB.

A simple test

To better understand how good or bad the query cache can be, I set up a very simple benchmark:

  • 1M records were inserted in 16 tables.
  • A moderate write load (65 updates/s) was run with a modified version of the update_index.lua sysbench script (see the end of the post for the code).
  • The select.lua sysbench script was run, with several values for the --num-threads option.

Note that the test is designed to be unfavorable to the query cache as the whole dataset fits in the buffer pool and the SELECT statements are very simple. Also note that I configured the query cache to be large enough so that no entry was evicted from the cache due to low memory.

Results – MySQL query cache ON

First here are the results when the query cache is enabled:


This configuration scales well up to 4 concurrent threads, but then the throughput degrades very quickly. With 10 concurrent threads, SHOW PROCESSLIST is enough to show you that all threads spend all their time waiting for the query cache mutex. Okay, this is not a surprise.

Results – MySQL query cache OFF

When the query cache is disabled, this is another story:


Throughput scales well up to somewhere between 10 and 20 threads (for the record the server I was using had 16 cores). But more importantly, even at the higher concurrencies, the overall throughput continued to increase: at 20 concurrent threads, MySQL was able to serve nearly 3x more queries without the query cache.


With Magento, you can expect to have a light write workload, very low concurrency and also quite complex SELECT statements. Given the results of our simple benchmarks, it is finally not that surprising that the MySQL query cache is a good fit in this case.

It is also worth noting that many applications run a database workload where writes are light and concurrency is low: the query cache should then not be discarded immediately. And maybe it is time for Oracle to make plans to improve the query cache as suggested by Peter a few years ago?

Annex: sysbench commands

# Modified update_index.lua
function event(thread_id)
   local table_name
   table_name = "sbtest".. sb_rand_uniform(1, oltp_tables_count)
   rs = db_query("UPDATE ".. table_name .." SET k=k+1 WHERE id=" .. sb_rand(1, oltp_table_size))
   db_query("SELECT SLEEP(0.015)")

# Populate the tables
sysbench --mysql-socket=/data/mysql/mysql.sock --mysql-user=root --mysql-db=db1 --oltp-table-size=1000000 --oltp-tables-count=16 --num-threads=16 --test=/usr/share/doc/sysbench/tests/db/insert.lua prepare
# Write workload
sysbench --mysql-socket=/data/mysql/mysql.sock --mysql-user=root --mysql-db=db1 --oltp-tables-count=16 --num-threads=1 --test=/usr/share/doc/sysbench/tests/db/update_index.lua --max-requests=1000000 run
# Read workload
sysbench --mysql-socket=/data/mysql/mysql.sock --mysql-user=root --mysql-db=db1 --oltp-tables-count=16 --num-threads=1 --test=/usr/share/doc/sysbench/tests/db/select.lua --max-requests=10000000 run

The post The MySQL query cache: Worst enemy or best friend? appeared first on MySQL Performance Blog.


SELinux and the MySQL init script

I recently worked with a customer who had a weird issue: when their MySQL server was started (Percona Server 5.5), if they try to run service mysql start a second time, the init script was not able to detect that an instance was already running. As a result, it tried to start a second instance with the same settings as the first one. Of course this fails and this creates a mess. What was the issue? A missing rule in SELinux. At least it looks like


If SELinux is set to enforcing and if you are using Percona Server on CentOS/RHEL 6 (other versions could be affected), service mysql start doesn’t work properly and a fix is simple to run:

# grep mysqld_safe /var/log/audit/audit.log | audit2allow -M mysqld_safe
# semodule -i mysqld_safe.pp
# service mysql restart

Other options are:

  • Set SELinux to permissive
  • Use the CentOS/RHEL standard MySQL init script (note I didn’t extensively check if that could trigger other errors)

How did we see the issue?

That was pretty easy: if an instance is already running and if you run service mysql start again, you should see something like this in the MySQL error log:

150717 08:47:58 mysqld_safe A mysqld process already exists

But if you rather see tons of error messages like:

2015-07-17 08:47:05 27065 [ERROR] InnoDB: Unable to lock ./ibdata1, error: 11
2015-07-17 08:47:05 27065 [Note] InnoDB: Check that you do not already have another mysqld process using the same InnoDB data or log files.

it means that the init script is broken somewhere.


When the issue was brought to my attention, I tried to reproduce it on my local box, but with no luck. What was so special in the configuration used by the customer?

The only thing that was slightly out of the ordinary was SELinux which was set to enforcing. Then we set SELinux to permissive, and guess what? service mysql start was now working properly and it didn’t allow 2 concurrent instances to be run!

Next step was to look at the SELinux logs to find any error related to MySQL, and we discovered messages like:

type=SYSCALL msg=audit(1437121845.464:739): arch=c000003e syscall=62 success=no exit=-13
a0=475 a1=0 a2=0 a3=7fff0e954130 items=0 ppid=1 pid=5732 auid=500 uid=0 gid=0 euid=0 suid=0
fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=5 comm="mysqld_safe" exe="/bin/bash"
subj=unconfined_u:system_r:mysqld_safe_t:s0 key=(null)

At this point, we knew that a rule was missing for mysqld_safe, we needed to add a new one.

Deeper investigation

Actually what happens is that SELinux prevents this condition to be executed in mysqld_safe:

if kill -0 $PID > /dev/null 2> /dev/null

and then the script assumes that this means the mysqld process is not running. That’s why a second mysqld is started.

However users of Oracle MySQL will probably never experience that issue, simply because the init script is a bit different: before calling mysqld_safe, the init script tries to ping a potential mysqld instance and if it gets a positive reply or an Access denied error, it concludes that mysqld is already running and it doesn’t invoke mysqld_safe.

The fix

Fortunately, this is quite simple. You can generate the corresponding rule with audit2allow:

grep mysqld_safe /var/log/audit/audit.log | audit2allow -M mysqld_safe

And after checking the corresponding .te file, we were able to load that new module:

semodule -i mysqld_safe.pp

After stopping MySQL, you can now use service mysql start normally.


This issue was quite interesting to work on because finding the culprit was not that easy. Also it only triggers when SELinux is enabled and Percona Server is used. Now should the init script of Percona Server be fixed? I’m not sure of the potential problems that could occur if we did so, but of course feel free to leave your feedback in the comments.

The post SELinux and the MySQL init script appeared first on MySQL Performance Blog.


Percona XtraDB Cluster (PXC): How many nodes do you need?

A question I often hear when customers want to set up a production PXC cluster is: “How many nodes should we use?”

Three nodes is the most common deployment, but when are more nodes needed? They also ask: “Do we always need to use an even number of nodes?”

This is what we’ll clarify in this post.

This is all about quorum

I explained in a previous post that a quorum vote is held each time one node becomes unreachable. With this vote, the remaining nodes will estimate whether it is safe to keep on serving queries. If quorum is not reached, all remaining nodes will set themselves in a state where they cannot process any query (even reads).

To get the right size for you cluster, the only question you should answer is: how many nodes can simultaneously fail while leaving the cluster operational?

  • If the answer is 1 node, then you need 3 nodes: when 1 node fails, the two remaining nodes have quorum.
  • If the answer is 2 nodes, then you need 5 nodes.
  • If the answer is 3 nodes, then you need 7 nodes.
  • And so on and so forth.

Remember that group communication is not free, so the more nodes in the cluster, the more expensive group communication will be. That’s why it would be a bad idea to have a cluster with 15 nodes for instance. In general we recommend that you talk to us if you think you need more than 10 nodes.

What about an even number of nodes?

The recommendation above always specifies odd number of nodes, so is there anything bad with an even number of nodes? Let’s take a 4-node cluster and see what happens if nodes fail:

  • If 1 node fails, 3 nodes are remaining: they have quorum.
  • If 2 nodes fail, 2 nodes are remaining: they no longer have quorum (remember 50% is NOT quorum).

Conclusion: availability of a 4-node cluster is no better than the availability of a 3-node cluster, so why bother with a 4th node?

The next question is: is a 4-node cluster less available than a 3-node cluster? Many people think so, specifically after reading this sentence from the manual:

Clusters that have an even number of nodes risk split-brain conditions.

Many people read this as “as soon as one node fails, this is a split-brain condition and the whole cluster stop working”. This is not correct! In a 4-node cluster, you can lose 1 node without any problem, exactly like in a 3-node cluster. This is not better but not worse.

By the way the manual is not wrong! The sentence makes sense with its context.

There could actually reasons why you might want to have an even number of nodes, but we will discuss that topic in the next section.

Quorum with multiple data centers

To provide more availability, spreading nodes in several datacenters is a common practice: if power fails in one DC, nodes are available elsewhere. The typical implementation is 3 nodes in 2 DCs:


Notice that while this setup can handle any single node failure, it can’t handle all single DC failures: if we lose DC1, 2 nodes leave the cluster and the remaining node has not quorum. You can try with 4, 5 or any number of nodes and it will be easy to convince yourself that in all cases, losing one DC can make the whole cluster stop operating.

If you want to be resilient to a single DC failure, you must have 3 DCs, for instance like this:


Other considerations

Sometimes other factors will make you choose a higher number of nodes. For instance, look at these requirements:

  • All traffic is directed to a single node.
  • The application should be able to fail over to another node in the same datacenter if possible.
  • The cluster must keep operating even if one datacenter fails.

The following architecture is an option (and yes, it has an even number of nodes!):



Regarding availability, it is easy to estimate the number of nodes you need for your PXC cluster. But node failures are not the only aspect to consider: Resilience to a datacenter failure can, for instance, influence the number of nodes you will be using.

The post Percona XtraDB Cluster (PXC): How many nodes do you need? appeared first on MySQL Performance Blog.


Percona XtraDB Cluster: Quorum and Availability of the cluster

Percona XtraDB Cluster (PXC) has become a popular option to provide high availability for MySQL servers. However many people are still having a hard time understanding what will happen to the cluster when one or several nodes leave the cluster (gracefully or ungracefully). This is what we will clarify in this post.

Nodes leaving gracefully

Let’s assume we have a 3-node cluster and all nodes have an equal weight, which is the default.

What happens if Node1 is gracefully stopped (service mysql stop)? When shutting down, Node1 will instruct the other nodes that it is leaving the cluster. We now have a 2-node cluster and the remaining members have 2/2 = 100% of the votes. The cluster keeps running normally.

What happens now if Node2 is gracefully stopped? Same thing, Node3 knows that Node2 is no longer part of the cluster. Node3 then has 1/1 = 100% of the votes and the 1-node cluster can keep on running.

In these scenarios, there is no need for a quorum vote as the remaining node(s) always know what happened to the nodes that are leaving the cluster.

Nodes becoming unreachable

On the same 3-node cluster with all 3 nodes running, what happens now if Node1 crashes?

This time Node2 and Node3 must run a quorum vote to estimate if it is safe continue: they have 2/3 of the votes, 2/3 is > 50%, so the remaining 2 nodes have quorum and they keep on working normally.

Note that the quorum vote does not happen immediately when Node2 and Node3 are not able to join Node1. It only happens after the ‘suspect timeout’ (evs.suspect_timeout) which is 5 seconds by default. Why? It allows the cluster to be resilient to short network failures which can be quite useful when operating the cluster over a WAN. The tradeoff is that if a node crashes, writes are stalled during the suspect timeout.

Now what happens if Node2 also crashes?

Again a quorum vote must be performed. This time Node3 has only 1/2 of the votes: this is not > 50% of the votes. Node3 doesn’t have quorum, so it stops processing reads and writes.

If you look at the wsrep_cluster_status status variable on the remaining node, it will show NON_PRIMARY. This indicates that the node is not part of the Primary Component.

Why does the remaining node stop processing queries?

This is a question I often hear: after all, MySQL is up and running on Node3 so why is it prevented from running any query? The point is that Node3 has no way to know what happened to Node2:

  • Did it crash? In this case, it is safe for the remaining node to keep on running queries.
  • Or is there a network partition between the two nodes? In this case, it is dangerous to process queries because Node2 might also process other queries that will not be replicated because of the broken network link: the result will be two divergent datasets. This is a split-brain situation, and it is a serious issue as it may be impossible to later merge the two datasets. For instance if the same row has been changed in both nodes, which row has the correct value?

Quorum votes are not held because it’s fun, but only because the remaining nodes have to talk together to see if they can safely proceed. And remember that one of the goals of Galera is to provide strong data consistency, so any time the cluster does not know whether it is safe to proceed, it takes a conservative approach and it stops processing queries.

In such a scenario, the status of Node3 will be set to NON_PRIMARY and a manual intervention is needed to re-bootstrap the cluster from this node by running:

SET GLOBAL wsrep_provider_options='pc.boostrap=YES';

An aside question is: now it is clear why writes should be forbidden in this scenario, but what about reads? Couldn’t we allow them?

Actually this is possible from PXC 5.6.24-25.11 with the wsrep_dirty_reads setting.


Split-brain is one of the worst enemies of a Galera cluster. Quorum votes will take place every time one or several nodes suddenly become unreachable and are meant to protect data consistency. The tradeoff is that it can hurt availability, because in some situations a manual intervention is necessary to instruct the remaining nodes that they can accept executing queries.

The post Percona XtraDB Cluster: Quorum and Availability of the cluster appeared first on MySQL Performance Blog.


Optimizing Percona XtraDB Cluster for write hotspots

Some applications have a heavy write workload on a few records – for instance when incrementing a global counter: this is called a write hotspot. Because you cannot update the same row simultaneously from multiple threads, this can lead to performance degradation. When using Percona XtraDB Cluster (PXC), some users try to solve this specific issue by writing on multiple nodes at the same time. Good idea or bad idea? Read on!

Simultaneous writes on a standalone InnoDB server

Say you have these 3 transactions being run simultaneously (id is the primary key of the table):

# T1
UPDATE t SET ... WHERE id = 100
# T2
UPDATE t SET ... WHERE id = 100
# T3
UPDATE t SET ... WHERE id = 101

All transactions will require a row lock on the record they want to modify. So T3 can commit at the same time than T1 and/or T2, because it will not lock the same record as T1 and T2.

But T1 and T2 cannot execute simultaneously because they need to set a lock on the same record. Let’s assume T2 is executed by InnoDB first, how long does T1 need to wait? It is essentially the time needed for T2 to execute.

Simultaneous writes on multiple nodes (PXC)

Now is it any different if you have a 3-node PXC cluster and if you want to run T1 on Node1 on T2 on Node2? Let’s review step by step how the cluster will execute these queries:

1. T1 starts executing on Node1: a row lock is set on the record where id=100. T2 also starts executing on Node2 and also sets a row lock on the record where id=100. How is it possible that 2 transactions set the same lock on the same record? Remember that locking in Galera is optimistic, meaning that a transaction can only set locks on the node where it is executing, but never on the remote nodes: here, T1 sets a lock on Node1 and T2 sets a lock on Node2.


2. Let’s assume T1 reaches commit before T2. T1 is then synchronously replicated on all nodes and it gets a global sequence number. At this point, a certification test is run on Node1 to determine whether there is any write conflicts between T1 and other “in-flight” transactions in the cluster. Go to the next section if you want some clarification about “in-flight” transactions and the certification test.

Back to our transactions: T1 is the first transaction to try to commit, so no other transaction is “in-flight”: the certification test will succeed and InnoDB will be able to apply the transaction on Node1. On Node2, the same certification test will be run and T1 will be put in the apply queue (and it will be applied at some point in the near future).


Ok, wait a minute! No other transaction is “in-flight”, really? What about T2 on Node2? Actually T2 is simply a local transaction on Node2 and it is not known by the cluster yet. Therefore it is not what I called an “in-flight” transaction and it does not play any role in the certification test.

3. Now T2 reaches commit on node2. It is then synchronously replicated to all nodes and a certification test will run on node2. If T1 and T2 did commit simultaneously, there is a good chance that when T2 starts committing, T1 is still in the apply queue of Node2. In this case, the certification test for T2 will fail. Why? To make sure that T1 will commit on all nodes no matter what, any other transaction that wants to set a lock on the record where id=100 will be rejected.

Then if the certification test for T2 fails, T2 is rolled back. The only option to commit T2 is to retry executing it.


Let’s assume that this second try is successful, how long did T2 have to wait before being executing? Essentially we had to execute T2 twice so we had to replicate it twice, each replication taking 1 network RTT, we had to roll T2 back on Node2 (rollback is expensive in InnoDB) and the application had to decide that T2 had to be executed a second time. That is a lot more work and wait compared to the scenario on a single server.

So where is the fundamental problem when we tried to write on several nodes? Galera uses optimistic locking, and we had to go very far in the execution of T2 before realizing that the query will not succeed. Multi-node writing is therefore not a good solution at all when the system sees write hotspots.

“In-flight” transactions and certification test

“In-flight” transactions are transactions that have already been applied on at least one node of the cluster but not on all nodes. Remember that even if replicating transactions is synchronous, applying committed transactions on remote nodes is not. So a transaction Tx can be committed and applied on node1 but not on node2: on node2, Tx will simply sit in an apply queue, waiting to be executed. Tx is then an “in-flight” transaction.

The goal of the certification test is to make sure that no transaction can prevent Tx to execute on node2: as Tx is already on node1 and as we want data consistency, we must make sure that Tx will execute successfully no matter what can happen. And the magic of Galera is that the test is deterministic so group communication is not necessary when a node runs a certification test.


So what are your options with PXC when the workload has write hotspots? The most obvious one is to write on a single node: then you will have the same locking model as with a standalone InnoDB server. Performance will not be as good as with a standalone server as you will have to pay for synchronous replication, but you will avoid the very expensive write conflicts.

Another option would be to rearchitect your application to write less often. For instance, for a global counter, you could maintain it with Redis and only periodically flush the value to PXC.

And if you want to understand more about the benefits and drawback of writing on multiple nodes of a PXC cluster, you can read these two posts.

The post Optimizing Percona XtraDB Cluster for write hotspots appeared first on MySQL Performance Blog.


Storing time-series data with MongoDB and TokuMX

Storing time-series data is a frequent pattern for databases – be it for logs or for any kind of monitoring. Such data has the following properties: records are inserted but also never updated, the insertion rate can be high and records are likely to expire after some time. MongoDB and TokuMX are both good fits because of their flexible schema feature. But how can we handle data expiration efficiently? Several options are available: capped collections, TTL collections and partitioning (TokuMX only), but they all have different features and performance profiles.


  • Capped collections: very good insert performance, but not eligible for sharding and hard to predict when documents will expire.
  • TTL collections: expiration date is easy to enforce and compatible with sharding, but purge or records is inefficient.
  • Partitioning: very good performance for inserts and purge, but not compatible with sharding and only available with TokuMX.

Capped collections

To create an app_stats capped collection with a size of 1GB, use the following command:

> db.createCollection( "app_stats", { capped: true, size: 1024*1024*1024 } )

MongoDB will write to this collections in a circular fashion: once allocated files are full, data at the beginning of the first file is being overwritten. This is very good to make sure your collection will never exceed the size you set. However predicting the size you will need to store 6 months of data can be tricky.

Another benefit of capped collection is that they keep data in insertion order. So you don’t need to add an extra index to sort data by insertion date: this is a good point for good write performance.

Can you update records in a capped collection? Yes but only as long as they don’t increase the original size of the document, but it is recommended to be light on updates as you might experience strange errors with secondaries in some rare cases.

Another limitation is that sharding is not supported.

TTL collections

TTL collections take a different approach: they are normal, but they have an index that has a special option. For instance:

> db.logs.createIndex({insertDate:1}, {expireAfterSeconds: 86400})

This index will make sure that records will be expired automatically after one day. Pretty neat!

And as a TTL collection is a regular collection, you can shard it if you need.

However the main limitation is the purge process: every minute, a background thread will look for documents that are expired and if it finds some, it will remove them. This is not really different from the application running a cron job every minute to remove old documents: it adds a constant purge workload that can be detrimental to insertion performance.

Note that TTL indexes do not work with capped collections.

Partitioning (TokuMX)

With TokuMX you can partition your data like you would with MySQL. For instance, if you want to create a collection partitioned on an insertDate field, you would use these commands:

> db.createCollection('stats_part',{primaryKey:{insertDate:1,_id:1}, partitioned: true})
> db.stats_part.addPartition({insertDate:ISODate("2015-05-31T23:59:59")})
> db.stats_part.addPartition({insertDate:ISODate("2015-06-30T23:59:59")})

The main benefit of partitioning is that expiring data is extremely simple and fast: remove the corresponding partition(s) with the dropPartition() function. Another nice property is that you can be very flexible regarding how large your partitions can grow. This could deserve a whole blog post, but let me give a quick example.

Suppose you want to expire data after 6 months. With regular MongoDB, indexes are using B-Trees so insertion is only fast if indexes fit in memory. So if you could create a partitioned collection, insertions would only be fast if the partition you are writing to (the last one) would fit in memory. Then you would probably have to create a partition for every day. With TokuMX, there is no such limitation and insertion performance will be as good if you have larger partitions like one partition per week or one partition per month.

Any drawback of partitioning? It involves some application overhead as you need to write the logic to periodically drop old partitions and create new ones. And sharding only has a limited support.


There is no one-size-fits-all solution when it comes to storing time-series data in MongoDB and TokuMX, but several options with their own benefits and trade-offs. Note that if your collection needs to be sharded, a TTL index is probably the only option. And if you are using TokuMX, partitioning is something to look at if you were planning a capped collection.

The post Storing time-series data with MongoDB and TokuMX appeared first on MySQL Performance Blog.


MySQL indexing 101: a challenging single-table query

We discussed in an earlier post how to design indexes for many types of queries using a single table. Here is a real-world example of the challenges you will face when trying to optimize queries: two similar queries, but one is performing a full table scan while the other one is using the index we specially created for these queries. Bug or expected behavior? Read on!

Our two similar queries

# Q1
mysql> explain select col1, col2 from t where ts >= '2015-04-30 00:00:00';
| id | select_type | table         | type | possible_keys | key  | key_len | ref  | rows    | Extra       |
|  1 | SIMPLE      | t             | ALL  | ts            | NULL | NULL    | NULL | 4111896 | Using where |
# Q2
mysql> explain select count(*) from t where ts >='2015-04-30 00:00:00';
| id | select_type | table         | type  | possible_keys | key          | key_len | ref  | rows    | Extra                    |
|  1 | SIMPLE      | t             | range | ts            | ts           | 5       | NULL | 1809458 | Using where; Using index |

Q1 runs a full-table scan while Q2 is using the index on ts, which by the way is covering – See Using index in the Extra field. Why such different execution plans?

Let’s try to understand what happens with Q1.

This is a query with a single inequality on the ts field and we have an index on ts. The optimizer tries to see if this index is usable (possible_keys field), this is all very logical. Now if we look at the rows field for Q1 and Q2, we can see that the index would allow us to only read 45% of the records (1.8M out of 4.1M). Granted, this is not excellent but this should be much better than a full table scan anyway, right?

If you think so, read carefully what’s next. Because this assumption is simply not correct!

Estimating the cost of an execution plan (simplified)

First of all, the optimizer does not know if data or indexes are in memory or need to be read from disk, it will simply assume everything is on disk. What it does know however is that sequential reads are much faster than random reads.

So let’s execute Q1 with the index on ts. Step 1 is to perform a range scan on this index to identify the 1.8M records that match the condition: this is a sequential read, so this is quite fast. However now step 2 is to get the col1 and col2 fields for each record that match the condition. The index provides the primary key value for each matching record so we will have to run a primary key lookup for each matching record.

Here is the issue: 1.8M primary key lookups is equivalent to 1.8M random reads, therefore this will take a lot of time. Much more time than sequentially reading the full table (which means doing a full scan of the primary key because we are using InnoDB here).

Contrast that with how Q2 can be executed with the index on ts. Step 1 is the same: identify the 1.8M matching records. But the difference is: there’s no step 2! That’s why we call this index a ‘covering index’: we don’t need to run point queries on the primary key to get extra fields. So this time, using the index on ts is much more efficient than reading the full table (which again would mean that we would do a full-table scan of the primary key).

Now there is one more thing to understand: a full-table scan is a sequential operation when you think about it from a logical point of view, however the InnoDB pages are certainly not stored sequentially on disk. So at the disk level, a full table is more like multiple random reads than a single large sequential read.

However it is still much faster than a very large number or point query and it’s easy to understand why: when you read a 16KB page for a full table scan, all records will be used. While when you read a 16KB page for a random read, you might only use a single record. So in the worst case, reading 1.8M records will require 1.8M random reads while reading the full table with 4M records will only require 100K random reads – the full table scan is still an order of magnitude faster.

Optimizing our query

Now that we’ve understood why the optimizer chose a full table scan for Q1, is there a way to make it run faster by using an index? If we can create a covering index, we will no longer need the expensive primary key lookups. Then the optimizer is very likely to choose this index over a full table scan. Creating such a covering index is easy:

ALTER TABLE t ADD INDEX idx_ts_col1_col2 (ts, col1, col2);

Some of you may object that because we have an inequality on ts, the other columns cannot be used. This would be true if we had conditions on col1 or col2 in the WHERE clause, but that does not apply here since we’re only adding these extra columns to get a covering index.


Understanding how indexes can be used to filter, sort or cover is paramount to be able to optimize queries, even simple ones. Understanding (even approximately) how a query is run according to a given execution plan is also very useful. Otherwise you will sometimes be puzzled by the decisions made by the optimizer.

Also note that beginning in MySQL 5.7, the cost model can be tuned. This can help the optimizer make better decisions: for instance random reads are far cheaper on fast storage than on regular disks.

The post MySQL indexing 101: a challenging single-table query appeared first on MySQL Performance Blog.


MongoDB’s flexible schema: How to fix write amplification

Being schemaless is one of the key features of MongoDB. On the bright side this allows developers to easily modify the schema of their collections without waiting for the database to be ready to accept a new schema. However schemaless is not free and one of the drawbacks is write amplification. Let’s focus on that topic.

Write amplification?

The link between schema and write amplification is not obvious at first sight. So let’s first look at a table in the relational world:

mysql> SELECT * FROM user LIMIT 2;
| id | login | first_name | last_name | city      | country                          | zipcode | address                           | password   | birth_year |
|  1 | arcu  | Vernon     | Chloe     | Paulista  | Cook Islands                     | 28529   | P.O. Box 369, 1464 Ac Rd.         | SSC44GZL5R |       1970 |
|  2 | quis  | Rogan      | Lewis     | Nashville | Saint Vincent and The Grenadines | H3T 3S6 | P.O. Box 636, 5236 Elementum, Av. | TSY29YRN6R |       1983 |

As all records have exactly the same fields, the field names are stored once in a separate file (.frm file). So the field names is metadata while the value of each field for each record is of course data.

Now let’s look at an equivalent collection in MongoDB:

                "login": "arcu",
                "first_name": "Vernon",
                "last_name": "Chloe",
                "city": "Paulista",
                "country": "Cook Islands",
                "zipcode": "28529",
                "address": "P.O. Box 369, 1464 Ac Rd.",
                "password": "SSC44GZL5R",
                "birth_year": 1970
                "login": "quis",
                "first_name": "Rogan",
                "last_name": "Lewis",
                "city": "Nashville",
                "country": "Saint Vincent and The Grenadines",
                "zipcode": "H3T 3S6",
                "address": "P.O. Box 636, 5236 Elementum, Av.",
                "password": "TSY29YRN6R",
                "birth_year": 1983

One difference with a table in the relational world is that MongoDB doesn’t know which fields each document will have. Therefore field names are data, not metadata and they must be stored with each document.

Then the question is: how large is the overhead in terms of disk space? To have an idea, I inserted 10M such records in an InnoDB table (adding an index on password and on birth_year to make the table look like a real table): the size on disk is around 1.4GB.

I also inserted the exact same 10M records in a MongoDB collection using the regular MMAPv1 storage engine, again adding an index on password and on birth_year, and this time the size on disk is … 2.97GB!

Of course it is not an apples-to-apples comparison as the InnoDB storage format and the MongoDB storage format are not identical. However a 100% difference is still significant.


One way to deal with write amplification is to use compression. With MongoDB 3.0, the WiredTiger storage engine is available and one of its benefits is compression (default algorithm: snappy). Percona TokuMX also has built-in compression using zlib by default.

Rebuilding the collection with 10M documents and the 2 indexes now gives the following results:
WiredTiger: 1.14GB
TokuMX: 736MB

This is a 2.5x to 4x data size reduction, pretty good!

WiredTiger also provides zlib compression and in this case the collection is only 691MB. However CPU usage is much higher compared to snappy so zlib will not be usable in all situations.


MongoDB schemaless design is attractive but it comes with several tradeoffs. Write amplification is one of them and using either WiredTiger with MongoDB 3.0 or Percona TokuMX is a very simple way to fix the issue.

The post MongoDB’s flexible schema: How to fix write amplification appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by