For a very long part of its history, MySQL replication has been limited in terms of performance. Because there was no way of knowing if transactions or updates were independent, the updates had to be executed on a replica following the exact same sequence of operations as on the primary server. The only way to guarantee the same sequence of operations on the replica was to use a single thread. In this post, we’ll do a dive into the MySQL multi-threaded replication (MTR) implementation and explore the available tuning options.
MTR is the culmination of the evolution in the development of parallel replication which followed the path:
- Single-threaded replication
- Per-database replication
- Logical clock replication
We’ll leave aside, for now, the recent dependency tracking feature.
Context
Before we discuss the multi-threaded implementation, let’s review in detail the limitations of replication with a single thread. The most obvious one is the CPU processing capacity. With a single thread, a process is bound to a single CPU core. If the updates the replica has to execute are CPU intensive, the replica is likely to lag behind. This situation is, however, quite exceptional, replication loads are rarely CPU intensive, at least they shouldn’t. With row-based replication, the only way a replica can run with a high CPU is if the schema is improperly indexed. This is obviously a pathological case.
The latency of IO operations is the most common performance limitation of a single thread. Spinning disks have a latency of a few milliseconds, essentially related to the time it takes for the platters to turn and the head to move to the desired cylinder. If this latency is 10ms, a single thread won’t be able to perform more than 100 IOPs per second. The latency of a flash storage device is much lower but often accessed over the network. The latency is thus still significant.
To illustrate the single-thread replication issue, here’s a simple sysbench indexed update benchmark with a setup sensitive to IO latency. In yellow are the rates of com_update for the primary server with one to 16 threads for short two minute executions of sysbench. In green are the corresponding com_update rates of the replica. The area under each curve, for a given number of sysbench threads, is the same. The replica has to execute the same updates and it is allowed to catch up with the primary before the next run is started. While at one thread the rates are the same, at 16 threads the difference is huge.
In real-world scenarios we rarely see single-threaded traffic running on the primary server. Modern applications are relying on concurrent scalability. This means that single-threaded replication is very likely to suffer from lag which will lead to problems. For one, a lag can prevent load balancing because the replica has outdated data. Also, a failover can potentially take longer because the replica needs to recover from lag, etc… Until the arrival of MySQL 5.6 and especially 5.7, these were the well too familiar issues with MySQL replication.
Per-Database Replication
Prior to jumping into the analysis of the actual topic, it is worth mentioning one of the parallel types of replication introduced in the 5.6 version. This is called per-database replication or per-schema replication. It assumes that transactions running in different schemas can be applied in parallel on the replica. This was an important performance improvement, especially in sharded environments where multiple databases receive writes in parallel on the primary server.
Group Commit
The single-threaded limitations of MySQL replication were not the only performance-related issues. The rate of transactions MySQL/InnoDB could handle was quite limited in a fully durable setup. Every commit, even if implicit, needed to accomplish three fsync operations. fsync is probably the slowest type of IO operation, even on flash devices, it can take more than a millisecond.
When a file is fsynced, all the pending dirty buffers of the file are flushed to the storage, not just the ones written by the thread doing the fsync. If there are 10 threads at the commit stage, the first thread to fsync the InnoDB log file and the binary log file flushes the data of all the other nine threads. This process is called a group commit. The group commit was introduced in 5.6 and it considerably improved the scalability of MySQL for such workloads. It will also have a huge impact on replication.
Logical_clock Replication
In order to understand parallel replication, one has to realize that transactions within a group commit are independent. A transaction that is dependent on another one at the commit stage is locked by InnoDB. That information is extremely important, as it allows the transactions within a group commit to being applied in parallel.
MySQL 5.7 added markers in binary logs to indicate the group commit boundary and a new replication mode to benefit from these markers, logical_clock. You can see these markers in the binary logs with the mysqlbinlog tool. Here’s an example:
root@ip-172-31-10-84:~# mysqlbinlog ip-172-31-10-84-bin.000047 | grep last_committed | awk '{ print $11" "$12}' | more
last_committed=0 sequence_number=1
last_committed=0 sequence_number=2
last_committed=0 sequence_number=3
last_committed=0 sequence_number=4
last_committed=0 sequence_number=5
last_committed=0 sequence_number=6
last_committed=0 sequence_number=7
last_committed=0 sequence_number=8
last_committed=0 sequence_number=9
last_committed=0 sequence_number=10
last_committed=0 sequence_number=11
last_committed=0 sequence_number=12
last_committed=0 sequence_number=13
last_committed=1 sequence_number=14 <= trx 1 committed alone
last_committed=1 sequence_number=15
last_committed=1 sequence_number=16
last_committed=1 sequence_number=17
last_committed=8 sequence_number=18 <= trx 2 to 8 committed together
last_committed=8 sequence_number=19
last_committed=8 sequence_number=20
last_committed=8 sequence_number=21
last_committed=8 sequence_number=22
last_committed=8 sequence_number=23
last_committed=17 sequence_number=24 <= trx 9 to 17 committed together
last_committed=17 sequence_number=25
There is one important thing to remember with multi-threaded replication and logical_clock: if you want the replica to apply transactions in parallel, there must be group commits on the primary.
Test Configuration
It is not as easy as it may seem to show clean replication results, even with something as simple as indexed updates benchmarks with sysbench. We must wait for replication to sync up and flushing easily messes around with the results, adding a lot of noise. To get cleaner results, you need: short sysbench runs, delayed flushing, and waiting for flushing to complete before starting a new run. Even then, the results are not as clean as we would have hoped for.
We have used AWS EC2 instances for our benchmark along with the new gp3 general-purpose, SSD-based EBS volumes. Our setup can be summarized as follows:
- Primary and Replica servers: r5b.large (2vcpu, 16GB of RAM)
- gp3 EBS of 128GB (Stable 3000 IOPS, IO latency of about 0.7ms)
- Dataset of 56GB, 12 tables of 4.7GB (20M rows)
- Percona server 8.0.26 on Ubuntu 20.04
The size of the dataset was chosen to be more than four times the size of the InnoDB buffer pool (12GB). Since the updates are random, the benchmarks are IO-bound, 79% of the time a page needs to be read from storage. The benchmark runs were chosen to be short, one or two minutes, in order to avoid issues with page flushing.
Sysbench itself was executed from a third EC2 VM, a c6i.xlarge instance, to avoid the competition for resources with the database servers. This instance also hosted Percona Monitoring and Management (PMM) to collect performance metrics of the database servers. The figures of this post are directly taken from a custom dashboard in the PMM.
How Good is MTR?
We have seen earlier, the limitations of replication with a single thread. For the same fully durable database cluster, let’s re-run the sysbench indexed updates benchmarks but this time, the replica has replica_parallel_workers set to 16.
As we can see, the replica is fully able to keep up with replication. It starts to lag only when sysbench uses nearly as many threads as we have defined in replica_parallel_workers. Keep in mind though it is a simple and well-defined workload, a real-world workload may not experience the same level of improvement.
The Cost of Durability
The durability is the “D” of the ACID acronym. By now you should be well aware of the importance of group commit for replication scalability and the prime way of achieving group commit is durability. Two variables control durability with MySQL/InnoDB:
innodb_flush_log_at_trx_commit (1 is durable, 0 and 2 are not)
sync_binlog (1 is durable, 0 is not)
There are cases however where a durable primary server is not desirable or, said otherwise, durability induces too much latency. An example of such a case is click-ad recording. What is important with this workload is to record, as fast as possible, clicks on web advertisements. When a new promotion is out, there can be a huge spike in clicks. What matters in such cases is to record the clicks as fast as possible. To be able to capture the spike is more important than potentially losing some clicks because of a crash.
The absence of durability on the primary server makes it faster for small writes but it essentially disables group commit. It becomes very difficult for the replica to keep up and we are back to the original replication performance issue.
The following figure shows four sysbench executions of one minute each with 1, 2, 3, and 4 threads. The primary server (yellow) is not durable. Although the replica (green) is configured with 16 parallel worker threads, it is obviously unable to keep up. So, between a “slow” durable primary server that allows replicas to keep and a “fast” non-durable primary server that replicas can’t follow, is there the possibility of a compromise?
Sync Delay
Instead of relying on the hardware (fsync), what if we have the possibility to define the duration of the transaction grouping time. This is exactly the purpose of the binlog_group_commit_sync_delay variable. This variable defines a time, in microseconds, during which transactions are grouped. The first transaction starts a timer and until it expires, the following transactions are grouped with it. The main advantage of this variable compared to the original behavior is that the interval can be tuned. The following figure shows the same non-durable results as the previous one but this time, a small grouping delay of 5us was added.
The figure shows the same com_update rates of the primary (yellow) and replica (green). For our very simple workload, the impacts on the replica of that small delay are significant. With four sysbench threads, the replica execution rate is close to 2000/s, an increase of close to 30%. There are also impacts on the primary, especially for the lower sysbench thread counts. The rate of sysbench updates with one thread is lower by close to 30% with the delay.
So, the cost of adding a delay is somewhat similar to the cost of durability. The transaction latency is increased and this affects mostly low concurrency workloads. One could argue that such workloads are not the most likely to cause replication lag issues. At a higher concurrency level, when more threads are active at the same time, the delay is still present but less visible. For a given throughput, the database will just use a few more threads to overcome the delay impacts.
Sync No Delay Count
Although a few more running threads is normally not a big issue, database workloads often present bursts during which the number of running threads rises suddenly. The addition of a grouping delay makes things worse and contention can rise. To avoid this problem, it is possible to configure the primary so that it gives up waiting if too many transactions are already delayed. The variable controlling this behavior is binlog_group_commit_sync_no_delay_count. The meaning of the variable is: if the number of transactions waiting for the sync delay is at the value of the variable, stop waiting and proceed.
This variable provides a safeguard against the worst impacts of a sync delay. One could set an aggressive delay to help the replica keep up with the normal workload and have a low no-delay count to absorb the peaks. As a rule of thumb, if the no delay count is used, it should be set to the number of replica worker threads.
In order to better understand the impacts of the no-delay count value, the following figure shows the sysbench update rates for a high sync delay (12ms) but with various values of no delay count.
In all cases, sysbench used 16 threads and the replica had 16 worker threads. A value of “0” (left) disables the no delay count feature. This leads to a very low rate on the primary but a perfectly synchronized replica. A value of “1” essentially disables the sync delay feature. This leads to a high rate of transaction on the primary but the replica struggles. For higher values, the stability of the primary transaction rates surprised me. I was expecting a decrease in performance with higher no delay count values. The replica on another hand behaves as expected, the larger the grouping, the higher are its transaction rates.
Conclusion
In this post, we discussed the MySQL LOGICAL_CLOCK multi-threaded replication implementation of 5.7 and its relation with group commit. Although outstanding, group commit relies on durability and not all workloads can deal with the additional latency. The LOGICAL_CLOCK implementation also supports a more flexible grouping delay algorithm along with a max grouping count for these latency-sensitive workloads.
The improvements to replication performance since 5.7 have been simply amazing. These helped to unlock the performance bottleneck that has plagued replication for years. Hopefully, this post will help you understand this piece of technology and how to tune it.