WAN Synchronous Clusters: Dealing with Latency Using Concurrency

WAN Latency

In this blog, we’ll discuss how to use concurrency to help with WAN latency when using synchronous clusters.

WAN Latency Problem

Our customers often ask us for help or advice with WAN clustering problems. Historically, the usual solution for MySQL WAN deployments is having the primary site in one data center, and stand-by backup site in another data center (replicating from the primary asynchronously). These days, however, there is a huge desire to employ available synchronous replication solutions for MySQL. These solutions include things like Galera (i.e., Percona XtraDB Cluster) or the recently released MySQL Group Replication. This trend is attributable to the fact that these solutions are less problematic and provide more automatic fail over and fail back procedures. But it’s also because businesses want to write in both data centers simultaneously.

Unfortunately, WAN link reliability and latency makes the synchronous replication solution a big challenge. In many cases, these challenges force geographically separate data centers to still replicate asynchronously.

From a requirements point of view, the Galera founders official documentation has WAN related recommendations and some dedicated options (like segments) — as described in Jay’s blog post. But WAN deployments are absolutely possible, and even an advertised option, in Galera. The MySQL Group Replication team, however, seem to discourage such use cases, as we can read:

Group Replication is designed to be deployed in a cluster environment where server instances are very close to each other, and is impacted by both network latency as well as network bandwidth.


While perhaps obvious to some, I would like to point out a simple dependency that might be a viable solution in some deployments that face significant network latency. That solution is concurrency! When you face the problem of limited write throughput due to a transaction commit latency, you can employ more writer threads. By using separate connections to MySQL, overall you can to commit more transactions at the same time.

Let me demonstrate with example results based on a very simple test case. I tested both Percona XtraDB Cluster (with Galera replication) and MySQL Group Replication. I configured a minimal cluster of three nodes in each case, running as Docker containers on the same host (simulating a WAN network). For this setup, latency is around 0.05ms on average. Then, I introduced an artificial network latency of 50ms and 100ms into one of the node’s network interfaces. I later repeated the same tests using VirtualBox VM instances, running on a completely different server. The results were very similar. The command to simulate additional network latency is:

# tc qdisc add dev eth0 root netem delay 50ms

To delay the ping to other nodes in the cluster:

# ping -c 2
PING ( 56(84) bytes of data.
64 bytes from icmp_seq=1 ttl=64 time=50.0 ms
64 bytes from icmp_seq=2 ttl=64 time=50.1 ms

The test is very simple: execute 500 small insert transactions, each inserting just single row (but that is less relevant now).

For testing, I used a simple mysqlslap command:

mysqlslap --password=*** --host=$IP --user=root --delimiter=";" --number-of-queries=500 --create-schema=test --concurrency=$i --query="insert into t1 set a='fooBa'"

and simple single table:

`a` char(5) DEFAULT NULL,

Interestingly, without increased latency, the same test takes much longer against the Group Replication cluster, even though by default Group Replication works with enabled


, and disabled


. Theoretically, it should be a lighter operation, from a “data consistency checks” point of view. Also with WAN-type latencies, Percona XtraDB Cluster seems to be slightly faster in this particular test. Here are the test results for the three different network latencies:

XtraDB Cluster latency/seconds
Threads 100ms 50ms 0.05ms
1 51.671 25.583 0.268
4 13.936 8.359 0.187
8 7.84 4.18 0.146
16 4.641 2.353 0.13
32 2.33 1.16 0.122
64 1.808 0.925 0.098
GR latency/seconds
Threads 100ms 50ms 0.05ms
1 55.513 29.339 5.059
4 14.889 7.916 2.184
8 7.673 4.195 1.294
16 4.52 2.507 0.767
32 3.417 1.479 0.473
64 2.099 0.809 0.267

WAN latency

I used the same InnoDB settings for both clusters, each node under a separate Docker container or Virtual Box VM. Similar test result could differ a lot in real production systems, where more CPU cores provide better multi-concurrency conditions.

It also wasn’t my idea to benchmark Galera versus Group Replication, but rather to show that the same concurrency to write throughput dependency applies to both technologies. I might be missing some tuning on the Group Replication side, so I don’t claim any verified winner here.

Just to provide some more details, I was using Percona XtraDB Cluster 5.7.16 and MySQL with Group Replication 5.7.17.

One important note: when you expect higher concurrency to provide better throughput, you must make sure the concurrency is not limited by server settings. For example, you must look at


  (I used 0, so unlimited), 


 for GR and


 for Galera (among others related to IO operations, etc.).

Apart from “concurrency tuning,” which could involve application changes if not architectural re-design, there are of course more possible optimizations for WAN environments. For example:

https://www.percona.com/blog/2016/03/14/percona-xtradb-cluster-in-a-high-latency-network-environment/ (to deal with latency)



for saving/minimizing network utilization using 


 and other variables.

But these are out of the scope of this post. I hope this simple post helps you deal with the speed of light better!  ?

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com