How Not to do MySQL High Availability: Geographic Node Distribution with Galera-Based Replication Misuse

MySQL High Availability 2

Let’s talk about MySQL high availability (HA) and synchronous replication once more.

It’s part of a longer series on some high availability reference architecture solutions over geographically distributed areas.

Part 1: Reference Architecture(s) for High Availability Solutions in Geographic Distributed Scenarios: Why Should I Care?

Part 2: MySQL High Availability On-Premises: A Geographically Distributed Scenario

The Problem

A question I often get from customers is: How do I achieve high availability in case if I need to spread my data in different, distant locations? Can I use Percona XtraDB Cluster?

Percona XtraDB Cluster (PXC), mariadb-cluster or MySQL-Galera are very stable and well-known solutions to improve MySQL high availability using an approach based on multi-master data-centric synchronous data replication model. Which means that each data-node composing the cluster MUST see the same data, at a given moment in time.

Information/transactions must be stored and visible synchronously on all the nodes at a given time. This is defined as a tightly coupled database cluster. This level of consistency comes with a price, which is that nodes must physically reside close to each other and cannot be geographically diverse.

This is by design (in all synchronous replication mechanisms). This also has to be clarified over and over throughout the years. Despite that we still see installations that span across geographic locations, including AWS Regions.

We still see some solutions breaking the golden rule of proximity, and trying to break the rules of physics as well. The problem/mistake is not different for solutions based on-premises or in the cloud (for whatever cloud provider).

Recently I had to design a couple of customer solutions based on remote geographic locations. In both cases, the customer was misled by an incorrect understanding of how the synchronous solution works, and from a lack of understanding of the network layer. I decided I need to cover this topic again, as I have done previously in Galera geographic replication and Effective way to check network connection in a geographically distributed environment 

What Happen When I Put Things on the Network?

Well, let’s start with the basics.

While light travels at 300 million meters per second, the propagation of the electric fields or electric signaling is slower than that.

The real speed depends by the medium used to transmit it. But it can be said that the real speed normally spans from 0% to 99% of light-speed (depending on the transmission medium).

This means that in optimal conditions the signal travels at approximately 299.72Km per millisecond, in good/mid condition about half that at 149.86Km per millisecond, and in bad conditions it could be 3Km per millisecond or less.

To help you understand, the distance between Rome (Italy) and Mountain View (California) is about 10,062Km. At light-speed it will take 33.54ms. In good conditions (90% of light-speed) the signal will take 37.26ms to reach Mountain View, and in less optimal conditions it can easily double to 74.53 ms.

Keep in mind this is the electric field propagation speed: optimal conditions with no interruption, re-routing and so on. Reality will bring all the kind of interruptions, repeaters and routing.

All the physics above works as a baseline. On top of this, each human construct adds functionalities, flexibility and (unfortunately) overhead – leading to longer times and slower speeds.

The final speed will be different than the simple propagation of the electric fields. It will include the transmission time of complex signaling using ICMP protocol, or even higher delays with the use of a very complex protocol like TCP/IP, which includes handshaking, package rerouting, re-sending and so on. On top of that, when sending things over the internet, we need to realize that it is very improbable we will be the only user sending data over that physical channel. As such, whatever we have “on the road” will need to face bandwidth limitation, traffic congestion and so on.

I had described the difference between protocols (ICMP – TCP/IP) hereclarifying how the TCP/IP scenario is very different from using different protocols like ICMP, or the theoretical approach.

What it all means is that we cannot trust the theoretical performance. We must move to a more empirical approach. But we must understand the right empirical approach or we will be misled.

An Example

I recently worked on a case where a customer had two data centers (DC) at a distance of approximately 400Km, connected with “fiber channel”. Server1 and Server2 were hosted in the same DC, while Server3 was in the secondary DC.

Their ping, with default dimension, to Server3 was ~3ms. Not bad at all, right?

We decided to perform some serious tests, running multiple sets of tests with netperf for many days collecting data. We also used the data to perform additional fine tuning on the TCP/IP layer AND at the network provider.

The results produced a common (for me) scenario (not so common for them):


The red line is the first set of tests BEFORE we optimized. The yellow line is the results after we optimized.

The above graph reports the number of transactions/sec (AVG) we could run against the different dimension of the dataset and changing the destination server. The full roundtrip was calculated.

It is interesting to note that while the absolute numbers were better in the second (yellow) tests, this was true only for a limited dataset dimension. The larger the dataset, the higher the impact. This makes sense if you understand how the TCP/IP stack works (the article I mentioned above explains it).

But what surprised them were the numbers. Keeping aside the extreme cases and focusing instead on the intermediate case, we saw that shifting from a 48k dataset dimension to 512K hugely dropped the performance. The drop for executed transactions was from 2299 to 219 on Server2 (same dc) and from 1472 to 167 Server3 (different DC).

Also, note that Server3 only managed ~35% fewer transactions comparing to Server2 from the start given the latency. Latency moved from a more than decent 2.61ms to 27.39ms for Server2 and 4.27ms to 37.25ms for Server3.

37ms latency is not very high. If that had been the top limit, it would have worked.

But it was not.

In the presence of the optimized channel, with fiber and so on, when the tests were hitting heavy traffic, the congestion was such to compromise the data transmitted. It hit a latency >200ms for Server3. Note those were spikes, but if you are in the presence of a tightly coupled database cluster, those events can become failures in applying the data and can create a lot of instability.

Let me recap a second the situation for Server3:

We had two datacenters.

  • The connection between the two was with fiber
  • Distance Km ~400, but now we MUST consider the distance to go and come back. This because in case of real communication, we have not only the send, but also the receive packages.
  • Theoretical time at light-speed =2.66ms (2 ways)
  • Ping = 3.10ms (signal traveling at ~80% of the light speed) as if the signal had traveled ~930Km (full roundtrip 800 Km)
  • TCP/IP best at 48K = 4.27ms (~62% light speed) as if the signal had traveled ~1,281km
  • TCP/IP best at 512K =37.25ms (~2.6% light speed) as if the signal had traveled ~11,175km

Given the above, we have from ~20%-~40% to ~97% loss from the theoretical transmission rate. Keep in mind that when moving from a simple signal to a more heavy and concurrent transmission, we also have to deal with the bandwidth limitation. This adds additional cost. All in only 400Km distance.

This is not all. Within the 400km we were also dealing with data congestions, and in some cases the tests failed to provide the level of accuracy we required due to transmission failures and too many packages retry.

For comparison, consider Server2 which is in the same DC of Server1. Let see:

  • Ping = 0.027ms that is as if the signal had traveled ~11km light-speed
  • TCP/IP best at 48K = 2.61ms as if traveled for ~783km
  • TCP/IP best at 512K =27.39ms as if traveled for ~8,217km
  • We had performance loss, but the congestion issue and accuracy failures did not happen.

You might say, “But this is just a single case, Marco, you cannot generalize from this behavior!”

You would be right IF that were true (but is not).

The fact is, I have done this level of checks many times and in many different environments. On-premises or using the cloud. Actually, in the cloud (AWS), I had even more instability. The behavior stays the same. Please test it yourself (it is not difficult to use netperf). Just do the right tests with RTT and multiple requests (note at the end of the article).

Anyhow, what I know from the tests is that when working INSIDE a DC with some significant overhead due to the TCP/IP stack (and maybe wrong cabling), I do not encounter the same congestion or bandwidth limits I have when dealing with an external DC.

This allows me to have more predictable behavior and tune the cluster accordingly. Tuning that I cannot do to cover the transmission to Server3 because of unpredictable packages behavior and spikes. >200ms is too high and can cause delivery failures.

If we apply the given knowledge to the virtually-synchronous replication we have with Galera (Percona XtraDB Cluster), we can identify that we are hitting the problems well-explained in Jay’s article Is Synchronous Replication right for your appThere, he explains Callaghan’s Law: [In a Galera cluster] a given row can’t be modified more than once per RTT. 

On top of that, when talking of geographical disperse solutions we have the TCP/IP magnifying the effect at writeset transmission/latency level. This causes nodes NOT residing on the same physical contiguous network delay for all the certification-commit phases for an X amount of time.

When X is predictable, it may range between 80% – 3% of the light speed for the given distance. But you can’t predict the transmission-time of a set of data split into several datagrams, then sent on the internet, when using TCP/IP. So we cannot use the X range as a trustable measure.

The effect is unpredictable delay, and this is read as a network issue from Galera. The node can be evicted from the cluster. Which is exactly what happens, and what we experience when dealing with some “BAD” unspecified network issue. This means that whenever we need to use a solution based on tightly coupled database cluster (like PXC), we cannot locate our nodes at a distance that is longer than the largest RTT time of our shortest desired period of commit.

If our application must apply the data in a maximum of 200ms in one of its functions, our min RTT is 2ms and our max RTT is 250ms. We cannot use this solution, period. To be clear, locating a node on another geolocation, and as such use the internet to transmit/receive data, is by default a NO GO given the unpredictability of that network link.

I doubt that nowadays we have many applications that can wait an unpredictable period to commit their data. The only case when having a node geographically distributed is acceptable is if you accept commits happening in undefined periods of time and with possible failures.

What Is the Right Thing To Do?

The right solution is easier than the wrong one, and there are already tools in place to make it work efficiently. Say you need to define your HA solution between the East and West Coast, or between Paris and Frankfurt. First of all, identify the real capacity of your network in each DC. Then build a tightly coupled database cluster in location A and another tightly coupled database cluster in the other location B. Then link them using ASYNCHRONOUS replication.

Finally, use a tool like Replication Manager for Percona XtraDB Cluster to automatically manage asynchronous replication failover between nodes. On top of all of that use a tool like ProxySQL to manage the application requests.

The full architecture is described here.


The myth of using ANY solution based on tightly coupled database cluster on distributed geographic locations is just that: a myth. It is conceptually wrong and practically dangerous. It MIGHT work when you set it up, it MIGHT work when you test it, it MIGHT even work for some time in production.

By definition, it will break, and it will break when it is least convenient. It will break in an unpredictable moment, but because of a predictable reason. You did the wrong thing by following a myth.

Whenever you need to distribute your data over different geographic locations, and you cannot rely on a single physical channel (fiber) to connect the two locations, use asynchronous replication, period!


Sample test

test_log=/tmp/results_$(date +'%Y-%m-%d_%H_%M_%S').txt
exec 9>>"$test_log"
exec 2>&9
exec 1>&9
echo "$(date +'%Y-%m-%d_%H_%M_%S')" >&9
for ip in 11 12 13; do
  echo "  ==== Processing server 10.0.0.$ip === "
  for size in 1 48 512 1024 4096;do
    echo " --- PING ---"
    ping -M do -c 5  10.0.0.$ip -s $size
    echo "  ---- Record Size $size ---- "
    netperf -H 10.0.0.$ip -4 -p 3307 -I 95,10 -i 3,3 -j -a 4096 -A 4096  -P 1 -v 2 -l 20 -t TCP_RR -- -b 5 -r ${size}K,48K -s 1M -S 1M
    echo "  ---- ================= ---- ";
   echo "  ==== ----------------- === ";



MySQL High Availability On-Premises: A Geographically Distributed Scenario

On-Premises MySQL High Availability

MySQL High Availability

MySQL High Availability.

In this article, we’ll look at an example of an on-premises, geographically distributed MySQL high availability solution. It’s part of a longer series on some high availability reference architecture solutions over geographically distributed areas.

Part 1: Reference Architecture(s) for High Availability Solutions in Geographic Distributed Scenarios: Why Should I Care?

Percona consulting’s main aim is to identify simple solutions to complex problems. We try to focus on identifying the right tool, a more efficient solution, and what can be done to make our customers’ lives easier. We believe in doing the work once, doing it well and have more time afterward for other aspects of life.

In our journey, we often receive requests for help – some simple, some complicated.  


The company “ACME Inc.” is moving its whole business from a monolithic application to a distributed application, split into services. Each different service deals with the requests independently from each other. Some services follow the tightly-bounded transactional model, and others work/answer asynchronously. Each service can access the data storage layer independently.

In this context, ACME Inc. identified the need to distribute the application services over wide geographic regions, focusing on each region achieving scale independently.

The identified regions are:

  • North America
  • Europe
  • China

ACME Inc. is also aware of the fact that different legislation acts on each region. As such, each region requires independent information handling about sales policies, sales campaigns, customers, orders, billing and localized catalogs, but will share the global catalog and some historical aggregated data. While most of the application services will work feeding and reading local distributed caches, the basic data related to the catalog, sales and billing is based on an RDBMS.

Historical data is instead migrated to a “Big Data” platform, and aggregated data is elaborated and push to a DWH solution at HQ. The application components are developed using multiple programming languages, depending on the service.   

The RDBMS identified by ACME Inc. in collaboration with the local authorities was MySQL-oriented. There were several solutions like:

  • PostgreSQL
  • Oracle DB
  • MS SQL server

We excluded closed-source RDBMSs given that some countries imposed a specific audit plugin. This plugin was only available for the mentioned platforms. The cost of parallel development and subsequent maintenance in case of RDBMS diversification was too high. As such all the regions must use the same major RDBMS component.

We excluded PostgreSQL given that compared to the adoption of MySQL, utilization cases were higher and MySQL had a well-defined code producer. Finally, the Business Continuity team of ACME Inc., had defined an ITSC (Information Technology Service Continuity) plan that defined the RPO (Recovery Point Objective), the RTO (Recovery Time Objective) and system redundancy.

That’s it. To fulfill the ITSCP, each region must have the critical system redundantly replicated in the same region, but not on the proximity.

Talking About the Components

This is a not-so-uncommon scenario, and it also presents a lot of complexity if you try to address it with one solution. But let’s analyze it and see how we can simplify the approach while still meeting the needs and requirements of ACME Inc.

When using MySQL-based solutions, the answer to “what should we use?” is use what best fits your business needs. The “nines” availability reference table for the MySQL world (most RDBMSs) can be summarized below:

9 0. 0 0 0 % (36 days) MySQL Replication
9 9. 9 0 0 % (8 hours) Linux Heartbeat with DRBD (Obsolete DRBD)
9 9. 9 0 0 % (8 hours) RHCS with Shared Storage (Active/Passive)
9 9. 9 9 0 % (52 minutes) MHA/Orchestrator with at least three nodes
9 9. 9 9 0 % (52 minutes) DRBD and Replication (Obsolete DRBD)
9 9 .9 9 5 % (26 minutes) Multi-Master (Galera replication) 3 node minimum
9 9. 9 9 9 % (5 minutes) MySQL Cluster

An expert will tell you that it always doesn’t make sense to go for the most “nines” in the list. This because each solution comes with a tradeoff: the more high availability (HA) you get, the higher the complexity of the solution and in managing the solution.

For instance, the approach used in MySQL Cluster (NDB) makes this solution not suitable for generic utilization. It requires proper analysis of the application needs, data utilization and archiving before being selected. It also requires in-depth knowledge to properly manage the cluster, as it is more complex than other similar solutions.

This indirectly makes a solution based on MySQL+Galera replication the one with the highest HA level a better choice, since it is close to the defaults generalized utilizations. 

This is why MySQL+Galera replication has become in the last six years the most used solution for platform looking for very high HA, without the need to diverge from standard MySQL/InnoDB approach. You can read more about Galera replication: 

Read more about Percona XtraDB Cluster.

There are several distributions implementing Galera replication:

*Note that MariaDB Cluster/Server and all related solutions coming from MariaDB have significantly diverged from the MySQL mainstream. This often means that once migrated to MariaDB; your database will not be compatible with other MySQL solutions. In short, you are locked-in to MariaDB. It is recommended that you carefully evaluate the move to MariaDB before making that move.

Choosing the Components


Our advice is to use Percona XtraDB Cluster (PXC), because at the moment it is one of the most flexible and reliable and compatible solutions. PXC is composed of three main components:

The cluster is normally composed of three nodes or more. Each node can be used as a Master, but the preferred and recommended way is to use one node as a Writer and the other as Readers.

Application-wise, accessing the right node can be challenging since this means you need to be aware of which node is the writer, which is the reader, and be able to shift from one to the other if necessary.


To simplify this process, it helps to have an additional component that works as a “proxy” connecting the application layer to the desired node(s). The most popular solutions are:

  • HAProxy
  • ProxySQL

There are several important differences between the two. But in summary, ProxySQL is a Level 7 proxy and is MySQL protocol aware. So, while HAProxy is just passing the connection over as a forward proxy (level 4), ProxySQL is aware of what is going through it and acts as reverse proxy. 

With ProxySQL is possible to decide, based on several parameters, where to send traffic (read/write split and more), what must be stopped, or if we should rewrite an incoming SQL command. A lot of information is available on the ProxySQL website and on the Percona Database Performance Blog .


No RDBMS platform is safe without a well-tested procedure for backup and recovery. The Percona XtraDB Cluster package distribution comes with Percona XtraBackup as the default method for node provisioning. A good backup and restore (B/R) policy start from the consideration of ACME’s ITSCP, to have full and incremental backups, perfectly covering the RPO, and a good recovery procedure to keep the recovery time inside RTO whenever possible.

There are several tools that allow you to plan and execute backup/restore procedure, some coming from vendors other than open source and community-oriented. In respect to being a fully open source and community-oriented, we in consulting normally suggest using:

Pyxbackup is a wrapper around XtraBackup that helps simplify the B/R operations, including the preparation of a full and incremental set. This helps significantly reduce the recovery time.  

Disaster Recovery

Another very important aspect of the ITSC Plan is the capacity of the system to survive to major disasters. The disaster and recovery (DR) solution must be able to act as the main production environment. Therefore, it must be designed and scaled as the main production site in resources. It must be geographically separated, normally hundreds of kilometers or more. It must be completely independent of the main site. It must be as much as possible in sync with the main production site.

While the first three “musts” are easy to understand, the fourth one is often the object of misunderstanding.

The concept of be as much in sync with the production site as possible creates confusion in designing HA solutions with Galera replication involved. The most common misunderstanding is the misuse of the Galera replication layer. Mainly the conceptual confusion between tightly coupled database cluster and loosely coupled database cluster.

Any solution based on Galera replication is a tightly coupled database cluster, because the whole idea is to be data-centric, synchronously distributed and consistent. The price is that this solution cannot be geographically distributed.

Solutions like standard MySQL replication are instead loosely coupled database cluster and they are designed to be asynchronous. Given that, the nodes connected by it are completely independent in processing/apply the transaction, and the solution fits perfectly into ANY geographically distributed replication solution. The price is that data on the receiving front might not be up to date with the one from the source in that specific instant.

The point is that for the DR site the ONLY valid solution is the asynchronous link (loosely coupled database cluster), because by design and requirement the two sites must be separated by a significant number of kilometers. For better understanding about why synchronous replication cannot work in a geographically distributed scenario, see “Misuse of Geographic Node distribution with Galera-based replication“.

In our scenario, the use of Percona XtraDB Cluster helps to create a most robust asynchronous solution. This is because each tightly coupled database cluster, no matter if source or destination, will be seen by the other tightly coupled database cluster as a single entity.

What it means is that we can shift from one node to another inside the two clusters, still confident we will have the same data available and the same asynchronous stream passing from one source to the other.

To ensure this procedure is fully automated, we add to our architecture the last block: replication manager for Percona XtraDB Cluster ( RMfP is another open source tool that simplifies and automates failover inside each PXC cluster such that our asynchronous solution doesn’t suffer if the node is currently acting as Master fails.  

How to Link the Components

Summarizing all the different components of our solution:

  • Application stack
    • Load balancer
    • Application nodes by service
    • Distributed caching
    • Data access service
  • Database stack
    • Data proxy (ProxySQL)
    • RDBMS (Percona XtraDB Cluster)
    • Backup/Restore
      • Xtrabackup
      • Pyxbackup
      • Custom scripts
    • DR
      • Replication Manager for Percona XtraDB Cluster
  • Monitoring
    • PMM (not covered here see <link> for detailed information)


In the solution above, we have two locations separated by several kilometers. On top of them, the load balancer(s)/DNS resolution redirects the incoming traffic to the active site. Each site hosts a full application stack, and applications connect to local ProxySQL.

ProxySQL has read/write enabled to optimize the platform utilization, and is configured to shift writes from one PXC node to another in case of node failure. Asynchronous replication connects the two locations and transmits data from master to slave.

Note that with this solution, it is possible to have multiple geographically distributed sites.

Backups are taken at each site independently and recovery test is performed. RMfP oversees and modifies the replication channels in the case of a node failure.

Finally, Percona Monitoring and Management (PMM) is in place to perform in-depth monitoring of the whole database platform.


We always look for the most efficient, manageable, user-friendly combination of products, because we believe in providing and supporting the community with simple but efficient solutions. What we have presented here is the most robust and stable high availability solution in the MySQL space (except for MySQL NDB that we have excluded). 

It is conceptualized to provide maximum service continuity, with limited bonding between the platforms/sites. It also is a well-tested solution, that has been adopted and adapted in many different scenarios where performance and real HA are a must. I have preferred to keep this digression at a high level, given the details of the implementation have already been discussed elsewhere (see reference section for more reading).

Still, Percona XtraDB Cluster (as any other solution implementing Galera replication) might not fit the final use. Given that, it is important to understand where it does and doesn’t fit. This article is a good summary with examples: Is Synchronous Replication right for your app?.

Check out the next article on How Not to do MySQL High Availability.




This Week in Data with Colin Charles 26: Percona Live Schedule is Near Completion, FOSDEM Underway and a Percona Toolkit Use Case

Colin Charles

Colin CharlesJoin Percona Chief Evangelist Colin Charles as he covers happenings, gives pointers and provides musings on the open source database community.

Percona Live Santa Clara 2018 update: tutorials have been picked, and the schedule/press release should be announced by next week. We’ve (the committee) rated over 300+ talks, and easily 70% of the schedule should go live next week as well.

There’s a lot happening for FOSDEM this week — so expect a longer report of some sort next week.

A friend, Yanwei Zhou, DBA at Qunar in China, gave an excellent presentation in Chinese on how they use Percona Toolkit. Check it out:

Are you on Twitter? Hope you’re following the @planetmysql account.


Link List

Upcoming appearances


I look forward to feedback/tips via e-mail at or on Twitter @bytebot.


Upcoming HA Webinar Wed 6/21: Percona XtraDB Cluster, Galera Cluster, MySQL Group Replication

High Availability

High AvailabilityJoin Percona’s MySQL Practice Manager Kenny Gryp and QA Engineer, Ramesh Sivaraman as they present a high availability webinar around Percona XtraDB Cluster, Galera Cluster, MySQL Group Replication on Wednesday, June 21, 2017 at 10:00 am PDT / 1:00 pm EDT (UTC-7).

What are the implementation differences between Percona XtraDB Cluster 5.7, Galera Cluster 5.7 and MySQL Group Replication?

  • How do they work?
  • How do they behave differently?
  • Do these methods have any major issues?

This webinar will describe the differences and shed some light on how QA is done for each of the different technologies.

Register for the webinar here.

High AvailabilityRamesh Sivaraman, QA Engineer

Ramesh joined the Percona QA Team in March 2014. He has almost six years of experience in database administration and, before joining Percona, was giving MySQL database support to various service and product based internet companies. Ramesh’s professional interests include writing shell/Perl script to automate routine tasks and new technology. Ramesh lives in Kerala, the southern part of India, close to his family.

High AvailabilityKenny Gryp, MySQL Practice Manager

Kenny is currently MySQL Practice Manager at Percona.


Clarification on “Call me Maybe: MariaDB Galera Cluster”

Recently Aphyr (Kyle Kingsbury) published

The article is technically valid, I am not going to dispute a conclusion Aphyr made, but it is also quite technically involved, so users who just jump to conclusion may get the wrong impression and we’re left with more questions than ever.

So, let me state what is the real conclusion of this article:
“Galera cluster does not support SNAPSHOT ISOLATION LEVEL, in contract to what was stated in the documentation”.
Following that conclusion is using Galera cluster may result in “corrupted” data.

I do not quite like the usage of the word “corrupted” here. For me, the more correct word be to use is “inconsistent”.

So with this clarification, the Aphyr’s conclusion that Galera Cluster (and it affects both MariaDB Galera and Percona XtraDB Cluster products)
does not support SNAPSHOT ISOLATION and may leave data in inconsistent state is valid.
But there I need to add quite IMPORTANT addition: it may leave data in inconsistent state
if you use SPECIAL TYPE of transactions in default isolation levels that Aphyr uses in his test.
Moreover, if we test the same workload on a simple single instance InnoDB, we will get the same result.

Before getting too scary of “inconsistent data”, let’s review what kind of transactions are used and what are practical implications.

Aphyr uses following logic:
Assume we have a table

CREATE TABLE `accounts` (
  `account_id` int(11) NOT NULL,
  `balance` int(11) NOT NULL,
  PRIMARY KEY (`account_id`)

We have N rows in table accounts, and each row populated with initial balance 100.
That results in SUM(balance) FROM accounts == 100*N

Application logic: Execute following transactions concurrently:

BEGIN (start transactions)
// in application: renerate random $fromAccount in a range [1;N]
SELECT balance FROM accounts WHERE account_id=$fromAccount
// in application read balance into $from_balance variable
// in application: renerate random $toAccount in a range [1;N], but != $fromAccount
SELECT balance FROM accounts WHERE account_id=$toAccount
// in application read balance into $to_balance variable
// in application, generate random amount to move from one account to another, as $moveAmt
// in application, calculate new balance for account1: $newBalance1=$from_balance-$moveAmt
// in application, calculate new balance for account2: $newBalance2=$to_balance+$moveAmt
// execute queries:
UPDATE account SET balance=$newBalance1 WHERE account_id=$fromAccount
UPDATE account SET balance=$newBalance2 WHERE account_id=$toAccount

As you see it includes some application logic, so on database side, the transactions looks like:
(assuming we move 25 from account 5 to 8)

SELECT balance FROM accounts WHERE account_id=5
SELECT balance FROM accounts WHERE account_id=8
UPDATE account SET balance=75 WHERE account_id=5
UPDATE account SET balance=125 WHERE account_id=8

Aphyr’s proves that these transactions executed concurrently should keep balances consistent (that is SUM(balances)==N*100) if database support SNAPSHOT ISOLATION or SERIALIZABLE isolation levels.
In his test he shows that running on Galera cluster these transactions executing concurrently results in inconsistent balance, therefore Galera cluster does not support SNAPSHOT ISOLATION level.

This however is totally expected.
Moreover, if you try this test on a single server against InnoDB in REPEATABLE-READ (default) mode,
you also will end up in inconsistent state (you can find my code here:

This is because how InnoDB handles REPEATABLE-READ mode (one may argue that InnoDB’s REPEATABLE-READ is weaker
than standard defined REPEATABLE-READ, and it is more closer to READ-COMMITED. This is a good opportunity for Asphyr to start another FUD “Call Me Maybe: InnoDB”). In simplified terms, InnoDB executes reads in repeatable-read mode, and writes or locked-read in read-committed mode.

What does it mean from practical standpoint?
From my opinion these transactions are little bit artificial (although are totally valid).

If you use this in a real life, the more obvious way to write these transactions is:

SELECT balance FROM accounts WHERE account_id=5
SELECT balance FROM accounts WHERE account_id=8
UPDATE account SET balance=balance-25 WHERE account_id=5
UPDATE account SET balance=balance+25 WHERE account_id=8

If you do this, it will NOT produce an inconsistent state.

Another way to handle this (for a single server InnoDB) is to use SERIALIZABLE isolation level (with an expected performance penalty).
Unfortunately Galera Cluster does not support SERIALIZABLE isolation level, as it does not pass read-set between nodes,
and node communication happens on COMMIT stage.

Third way, MySQL also provides an extension: SELECT .. FOR UPDATE statements to handle cases exactly like these.
So if you want to keep REPEATABLE-READ and original transactions, you will need to rewrite this as

SELECT balance FROM accounts WHERE account_id=5 FOR UPDATE;
SELECT balance FROM accounts WHERE account_id=8 FOR UPDATE;
UPDATE account SET balance=75 WHERE account_id=5
UPDATE account SET balance=125 WHERE account_id=8

This will result in a consistent state for table accounts and will work for both single InnoDB and multi-node Percona XtraDB Cluster deployments.

One thing to remember, that with Percona XtraDB Cluster you may get a DEADLOCK error trying to execute "COMMIT" statement, so your application should be ready to handle this error, rollback and repeat the transaction if needed.

So in conclusion of my post:

Using transactions described in may result in inconsistent state (not in data corrupted state!), for both Galera Cluster and single instance InnoDB. But this is because these transactions do not use properly InnoDB’s REPETABLE-READ isolation level. To reflect InnoDB’s requirement we need to use “SELECT ... FOR UPDATE” or to rewrite transactions in a described way.

UPDATE 18-Sep-2015.
Based on Twitter comments and comments from, I would like to add following.

The post Clarification on “Call me Maybe: MariaDB Galera Cluster” appeared first on MySQL Performance Blog.


MariaDB: Selective binary logs events

In the first post in a series on MariaDB features we find interesting, we begin with selectively skipping replication of binlog events. This feature is available on MariaDB 5.5 and 10.

By default when using MySQL’s standard replication, all events are logged in the binary log and those binary log events are replicated to all slaves (it’s possible to filter out some schema). But with this feature, it’s also possible to bypass some events to be replicated on the slave(s) even if they are written in the binary log. Having those event in the binary logs is always useful for point-in-time recovery.

Indeed, usually when we need to not replicate an event, we set sql_log_bin = 0 and the event is bypassed: neither written into the binlog, neither replicated to slave(s).

So with this new feature, it’s possible to just set a session variable to tag events that will be written into the binary log and bypassed on demand on some slaves.

And it’s really easy to use, on the master you do:

set skip_replication=1;

and on the slave(s) having replicate_events_marked_for_skip='FILTER_ON_MASTER' or 'FILTER_ON_SLAVE' the events skipped on the master won’t be replicated.

The valid values for replicate_events_marked_for_skip are:

  • REPLICATE (default) : skipped events are replicated on the slave
  • FILTER_ON_SLAVE : events so marked will be skipped on the slave and not replicated
  • FILTER_ON_MASTER : the filtering will be done on the master so the slave won’t even receive it and then save network bandwidth

That’s a cool feature but when this can be very useful?

Use case:

For archiving this can be very interesting. Indeed most of the time when people is archiving data, they use something like pt-archiver that deletes the data and copy the removed data on an archive server.

Thanks to this feature, instead of having an archiving server where we copy the deleted data, it’s possible to have a slave where we won’t delete the data. This will be much faster (smarter?) and allows to have an archiving server always up to date. Of course in this case sql_log_bin = 0 would have worked (if we ignore the point-in-time recovery).
But with a Galera Cluster? Yes that’s where this feature is really cool, if we would have used sql_log_bin = 0 on a Galera Cluster node, all other nodes would have ignored the delete and the result would be inconsistency between the nodes.

So if you use an asynchronous slave as an archiving server of a Galera Cluster, this feature is really mandatory.

As illustrated below, you can have a MariaDB Galera Cluster node joining a Percona XtraDB Cluster that will be used to delete historical data using pt-archiver:


pt-archiver is started with --set-vars "skip_replication=1"

The post MariaDB: Selective binary logs events appeared first on MySQL Performance Blog.


A schema change inconsistency with Galera Cluster for MySQL

I recently worked on a case where one node of a Galera cluster had its schema desynchronized with the other nodes. And that was although Total Order Isolation method was in effect to perform the schema changes. Let’s see what happened.


For those of you who are not familiar with how Galera can perform schema changes, here is a short recap:

  • Two methods are available depending on the value of the wsrep_OSU_method setting. Both have benefits and drawbacks, it is not the main topic of this post.
  • With TOI (Total Order Isolation), a DDL statement is performed at the same point in the replication flow on all nodes, giving strong guarantees that the schema is always identical on all nodes.
  • With RSU (Rolling Schema Upgrade), a DDL statement is not replicated to the other nodes. Until the DDL statement has been executed on all nodes, the schema is not consistent everywhere (so you should be careful not to break replication).

You can look at the official document here.

If you read carefully the section on TOI, you will see that “[…] TOI transactions will never fail certification and are guaranteed to be executed.” But also that “The system replicates the TOI query before execution and there is no way to know whether it succeeds or fails. Thus, error checking on TOI queries is switched off.”

Confusing? Not really. It simply means that with TOI, a DDL statement will always pass certification. But if for some reason, the DDL statement fails on one of the nodes, it will not be rolled back on the other nodes. This opens the door for schema inconsistencies between nodes.

A test case

Let’s create a table on a 3-node Percona XtraDB Cluster 5.6 cluster and insert a few rows:

pxc1> create table t (id int not null auto_increment primary key, c varchar(10));
pxc1> insert into t (c) values ('aaaa'),('bbbb');

Then on node 3, let’s introduce a schema change on t that can make other schema changes fail:

pxc3> set global wsrep_OSU_method=RSU;
pxc3> alter table t add d int;
pxc3> set global wsrep_OSU_method=TOI;

As the schema change was done on node 3 with RSU, it is not replicated to the other nodes.

Now let’s try another schema change on node 1:

pxc1> alter table t add d varchar(10);
Query OK, 0 rows affected (0,14 sec)
Records: 0  Duplicates: 0  Warnings: 0

Apparently everything went well and indeed on node 1 and 2, we have the correct schema:

pxc2>show create table tG
*************************** 1. row ***************************
       Table: t
Create Table: CREATE TABLE t (
  c varchar(10) DEFAULT NULL,
  d varchar(10) DEFAULT NULL,

But on node 3, the statement failed so the schema has not been changed:

pxc3> show create table tG
*************************** 1. row ***************************
       Table: t
Create Table: CREATE TABLE t (
  c varchar(10) DEFAULT NULL,
  d int(11) DEFAULT NULL,

The error is visible in the error log of node 3:

2014-07-18 10:37:14 9649 [ERROR] Slave SQL: Error 'Duplicate column name 'd'' on query. Default database: 'repl_test'. Query: 'alter table t add d varchar(10)', Error_code: 1060
2014-07-18 10:37:14 9649 [Warning] WSREP: RBR event 1 Query apply warning: 1, 200
2014-07-18 10:37:14 9649 [Warning] WSREP: Ignoring error for TO isolated action: source: 577ffd51-0e52-11e4-a30e-4bde3a7ad3f2 version: 3 local: 0 state: APPLYING flags: 65 conn_id: 3 trx_id: -1 seqnos (l: 17, g: 200, s: 199, d: 199, ts: 7722177758966)

But of course it is easy to miss. And then a simple INSERT can trigger a shutdown on node3:

pxc2> insert into t (c,d) values ('cccc','dddd');
Query OK, 1 row affected (0,00 sec)

will trigger this on node 3:

2014-07-18 10:42:27 9649 [ERROR] Slave SQL: Column 2 of table 'repl_test.t' cannot be converted from type 'varchar(10)' to type 'int(11)', Error_code: 1677
2014-07-18 10:42:27 9649 [Warning] WSREP: RBR event 3 Write_rows apply warning: 3, 201
2014-07-18 10:42:27 9649 [Warning] WSREP: Failed to apply app buffer: seqno: 201, status: 1
	 at galera/src/trx_handle.cpp:apply():340
2014-07-18 10:42:27 9649 [Note] WSREP: Received NON-PRIMARY.
2014-07-18 10:42:27 9649 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 201)
2014-07-18 10:42:27 9649 [Note] WSREP: Received self-leave message.
2014-07-18 10:42:27 9649 [Note] WSREP: Flow-control interval: [0, 0]
2014-07-18 10:42:27 9649 [Note] WSREP: Received SELF-LEAVE. Closing connection.
2014-07-18 10:42:27 9649 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 201)
2014-07-18 10:42:27 9649 [Note] WSREP: RECV thread exiting 0: Success
2014-07-18 10:42:27 9649 [Note] WSREP: recv_thread() joined.
2014-07-18 10:42:27 9649 [Note] WSREP: Closing replication queue.
2014-07-18 10:42:27 9649 [Note] WSREP: Closing slave action queue.
2014-07-18 10:42:27 9649 [Note] WSREP: bin/mysqld: Terminated.


As on regular MySQL, schema changes are challenging with Galera. Some subtleties can create a lot of troubles if you are not aware of them. So before running DDL statement, make sure you fully understand how TOI and RSU methods work.

The post A schema change inconsistency with Galera Cluster for MySQL appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by