Typical misconceptions on Galera Replication for MySQL

Typical misconceptions on GaleraEven if a Galera node looks like a regular MySQL server, the underlying replication mechanism is very different. This implies some changes in the way you have to configure the Galera nodes. Here are some of the most common misconceptions about Galera when using Percona XtraDB Cluster.

Q: Why should I enable binary logging as it is not needed by Galera replication?
Unlike for regular asynchronous MySQL replication, it is true that you don’t need to enable binary logging to use Galera replication. However what if someone runs an accidental DROP TABLE?

In this case, the statement will be replicated immediately on all nodes. Then your main option to recover lost data is to use a backup. But if binary logging is not enabled, all changes to the table after the backup will be lost.

What if you have an asynchronous slave that is intentionally delayed? This is another good option to quickly recover the lost data, but to be able to set up an async slave, you will also have to enable binary logging!

So don’t forget to add on all your nodes:

binlog_format = ROW

Q: If I set innodb_flush_log_at_trx_commit = 2, I may lose data in some cases, right?
For a regular MySQL master, it is recommended to set innodb_flush_log_at_trx_commit = 1 because it is the only way to make sure that every committed transaction is stored permanently on disk. The main drawback is that it can slow down writes a lot because it involves one fsync at each commit.

With Galera, the difference is that commit is synchronous: this means that a transaction is committed on node #1, it has already been replicated to all the other nodes (it has not necessarily been executed on the remote nodes though).

As Galera implements cluster-wide durability, there is no need to have single server durability and you can safely use innodb_flush_log_at_trx_commit = 2 on all nodes.

Actually this is not totally accurate… For instance if all nodes lose power at the same time, you may lose some transactions. The likeliness of such a failure is related to how nodes are hosted: with each node in a separate datacenter, this is very unlikely. But with 3 nodes being VMs on a single physical host, it could well happen from time to time.

Q: The network RTT between my servers is 100ms. I know that each commit takes at least as much time as a network RTT so can I only hope to run 10 writes/s?

This needs some clarification: commit is synchronous because the full transaction is replicated to all nodes when it is committed. However you pay the same price whether the transaction has a single statements or several statements.

So if all your writes are auto-committed transactions, each write will trigger a commit that will need at least one RTT to complete. If RTT is 100ms, that means 10 writes/s.

But if your transactions have 10 writes, you will only need a commit every 10 writes: with 10 commits/s, you can now run 100 writes/s.

And finally several threads can commit at the same time, increasing the write throughput. With 10 concurrent threads executing transactions with 10 statements, you get 1000 writes/s.

Of course, this is only theory. In the real world, you will probably not get so perfectly aligned numbers. You can look at this post (Comparing Percona XtraDB Cluster with Semi-Sync replication Cross-WAN) to see real numbers with 1 thread and 32 threads.

Conclusion: network latency is a limiting factor for write throughput, that’s true. But that may not be as bad as you can think. And remember Callaghan’s law: “In a Galera cluster a given row can’t be modified more than once per RTT”.

I want to write on all nodes to get write scalability. Is it a good idea?

Let me start by stating that Galera cannot be a true solution for scale writes: the reason is simply that all writes need to be applied on all nodes.

But Galera brings a limited form of write scalability when writing concurrently to several nodes because:

  • Write sets can be applied in parallel on the remote nodes.
  • Galera is using row-based replication, so applying the replicated events can be faster than executing the original write.

There is a caveat however when writing on several nodes: because of optimistic locking, concurrent transactions on different nodes can lead to write conflicts. In this case Galera will roll back one of the transactions, and it is up to the application to retry executing the transaction.

Galera Replication is a nice technology that can help solve challenges around high availability. But a basic understanding of how it works is useful as it can avoid frustration or unreasonable expectations.

The post Typical misconceptions on Galera Replication for MySQL appeared first on MySQL Performance Blog.


Innotop: A real-time, advanced investigation tool for MySQL

GUI monitoring tools for MySQL are not always suitable for all our needs or situations. Most of them are designed to provide historical views into what happens to our database over time rather then real-time insight into current MySQL server status. Excellent free tools for this include Cacti, Zabbix, Ganglia, Nagios, etc. But each of them needs to be properly configured to provide details on what is going on in our MySQL instances. And setting up one of these monitoring solutions is neither quick nor trivial (well, maybe with the exception of Ganglia).

MySQL Workbench provides a substitute for real-time MySQL status views, though it is very limited. It also needs to be installed on a workstation with a graphical environment.

The fact is that you won’t find many text-mode tools of this kind for MySQL. The most widely known are Innotop and mytop. The latter unfortunately seems to be an abandon project as the last significant update happened in 2007. Additionally, mytop offers way less functionality then Innotop.


Innotop can be really handy when you need a quick and easy tool that can provide a lot of details on what MySQL is doing – without using difficult queries to get those details manually. For our Percona Support customers, using Innotop will often be much easier then running complex SHOW statements and filtering their output in order to get some simple facts like slave replication lag, number of busy threads or InnoDB history list length. Also for us, Support engineers, it’s sometimes quicker to get a fast system overview with Innotop before we do deeper investigation using SHOW statements, Information_schema or more complex ways.

But Innotop is even more then that! In this article I will show what are my favorite features of this little tool, that you probably were not aware of. It is not my intention to show all the functionality here, so to get quick idea of the base tool capabilities, check the sample screenshots.


Let’s start from quick configuration examples to demonstrate how you can get the tool up and running in 5 minutes. Btw. I think configuration examples can be really useful as documentation is bit unclear about some of it’s aspects.

After we are successful with installing latest Innotop, simply type ‘innotop’ command in our terminal and this is what most likely you will see:

[RO] Dashboard (? for help)                   localhost, 0.00, 0 QPS, 0/0/0 con/run/cac thds, !localhost
Uptime  MaxSQL  ReplLag  QPS  Cxns  Run  Miss  Lock  Tbls  Repl  SQL
localhost: Access denied for user 'root'@'localhost' (using password: NO)

There are actually at least three ways to establish first working connection to MySQL server(s) for Innotop.

The simplest way to just start working is to run it by using similar host/user/password parameters as you do for basic mysql client:

innotop -h -u root -p mysecret

Another way is to run innotop and configure connection(s) from inside. Quick help screen (invoked with ‘?’ key) tells us: “@ Select/create server connections”, so let’s press ‘@’:

You can enter the name of a new connection to create it.
____________________________ Choose from ____________________________
localhost  DBI:mysql:;host=localhost;mysql_read_default_group=client
Choose connections for this mode:

As we can see there is already ‘localhost’ one created automatically. We can now select this one, or create a new one. However, after you enter ‘localhost’ – it just continues to use current connection and gets you back to previous screen. I have no idea if you can actually edit this existing ‘localhost’ connection, so let’s just create a new one:

Choose connections for this mode: local
There is no connection called 'local'. Create it?: y

Next step:

Typical DSN strings look like
The db and port are optional and can usually be omitted.
If you specify 'mysql_read_default_group=mysql' many options can be read
from your mysql options files (~/.my.cnf, /etc/my.cnf).
Enter a DSN string: DBI:mysql:;host=localhost

so the only thing I put in this ‘connection wizard’s’ step was “DBI:mysql:;host=localhost”, but if you have some custom TCP port or socket, you must specify it here.
Next step:

Optional: enter a table (must not exist) to use when resetting InnoDB deadlock information: test.deadarch
Do you want to specify a username for local?: y
Do you want to specify a password for local?: y
Enter username for local: root
Enter password for 'root' on local:
Save password in plain text in the config file?: y

And after that step, we are immediately back in our Dashboard view, but now it is actually working:

[RO] Dashboard (? for help)                    local, 1h59m, 0.40 QPS, 2/1/0 con/run/cac thds, 5.5.33-31.1-log
Uptime  MaxSQL  ReplLag  QPS   Cxns  Run  Miss  Lock  Tbls  Repl  SQL
 1h59m                   0.40     2       0.00     0    26  Off

This way is for sure much more complicated and longer then just running with -h -u -p parameters, so why bother?
Also, if we quit the tool, and run again – the set up connection is lost!
OK, I’ll show you later why it is actually good to set up connection this way.
First, we make sure the set up connection will be saved! To make so, run it like this:

innotop --write

Then configure the connection again and quit the tool. Now Innotop has created configuration files for us:

[root@centos6-2 ~]# ls .innotop/
innotop.conf plugins

We edit the .innotop/innotop.conf file and… it’s huge! All options inside… Check documentation for all of their meaning, I will only focus on my favourite ones.
For us the most important will be this section:

local=user=root have_user=1 pass=mypaSS have_pass=1 dsn=DBI:mysql:;host=localhost savepass=1
localhost=user= dsn=DBI:mysql:;host=localhost;mysql_read_default_group=client dl_table=test.innotop_dl

Actually, using the connection wizard was useful for me only to create initial first connection definition, which we can use as an example to create connections to more MySQL instances. It is just hard to create it manually just by reading the documentation.
We can see both the default ‘localhost’ connection which was there already, and our new ‘local’ where we did specify connection credentials. Now feel free to remove the line with ‘localhost’ one, we don’t need it.

Another important section is [active_connections] where each tool’s view can have assigned connection name. For example:
means that ‘local’ is default connection for Dashboard view. If we only have one connection configured in Innotop, after we enter another mode, it will get automatically assigned in configuration.

Monitoring Many Hosts

OK, since we now have a new configuration template ready, let’s manually configure more connections, which will be much faster then using the ‘wizard’.
For my tests, I used example replication sandbox (set up in 1 minute with mysqlsandbox), so connections will look like this:

master56=user=msandbox have_user=1 pass=msandbox have_pass=1 dsn=DBI:mysql:;host=;port=19479; savepass=1
slave156=user=msandbox have_user=1 pass=msandbox have_pass=1 dsn=DBI:mysql:;host=;port=19480; savepass=1
slave256=user=msandbox have_user=1 pass=msandbox have_pass=1 dsn=DBI:mysql:;host=;port=19481; savepass=1

Cool, now I can switch between each server for a given view mode using ‘n’ key, or choose a connection from the list after pressing ‘@’.
This is how an example [Q]uery view for slave256 looks like:

[RO] Query List (? for help)                       slave256, 23h, 10.93k QPS, 8/5/0 con/run/cac thds, 5.6.14-log
When   Load  Cxns  QPS     Slow  Se/In/Up/De%  QCacheHit  KCacheHit  BpsIn    BpsOut
Now    0.00     6  10.93k     0  99/ 0/ 0/ 0       0.00%    100.00%    1.05M  787.03k
Total  0.00   151   8.72      0  92/ 0/ 0/ 0       0.00%    100.00%  977.10     3.14k
Cmd      ID      State               User      Host           DB      Time      Query
Connect       2  Slave has read all  system u                         23:24:35
Execute      23  Sending data        msandbox  localhost      sbtest     00:00  SELECT count(k) FROM sbtest WHERE k BETWEEN ? AND ?
Execute      24  Sending data        msandbox  localhost      sbtest     00:00  SELECT count(k) FROM sbtest WHERE k BETWEEN ? AND ?
Execute      25  preparing           msandbox  localhost      sbtest     00:00  SELECT count(k) FROM sbtest WHERE k BETWEEN ? AND ?
Execute      26  Sending data        msandbox  localhost      sbtest     00:00  SELECT count(k) FROM sbtest WHERE k BETWEEN ? AND ?

Server Groups

But that’s not enough! Maybe you have 100 servers, in 10 master->slave(s) groups – you can watch them all using single Innotop instance in a sensible way! All thanks to server groups functionality. From the defined connections, you can easily configure groups and for almost each view mode – you can either choose a single connection or a group. Then switching between groups is as easy as pressing TAB. There is information in the top right corner of the screen telling us what server group or what single server we are watching at.

So let’s create our first group:

56=master56 slave156 slave256

Now we can see all the servers in the group in the same screen, which for some view modes makes more in other less sense. I am sure you will like the group view for replication mode.
Below is an example when one of the slaves is lagging. We can see 52s lag on slave1, I/O threads are keeping up on both slaves with master (pos 596814971), but what I am missing here is the Exec_Master_Log_Pos info for slaves. But the information is enough to see it’s the SQL thread on slave2 not keeping up.

[RO] Replication Status (? for help)                                                   Servers: 56
__________________________ Slave SQL Status ___________________________
CXN       Master     On?  TimeLag  Catchup  Temp  Relay Pos  Last Error
slave156  Yes    00:52     0.00     0  510811815
slave256  Yes    00:00     0.00     0  596815181
________________________________________ Slave I/O Status _________________________________________
CXN       Master     On?  File              Relay Size  Pos        State
slave156  Yes  mysql-bin.000003     569.17M  596814971  Waiting for master to send event
slave256  Yes  mysql-bin.000003     569.17M  596814971  Waiting for master to send event
__________________ Master Status __________________
CXN       File              Position   Binlog Cache
master56  mysql-bin.000003  596814971        99.82%
slave156  mysql-bin.000002  510811605        99.80%
slave256  mysql-bin.000002  596814971        99.82%

One press key (I) and we are in the InnoDB IO threads view, where there we can see one important detail difference about slave1 – it’s doing much more fsyncs and write operations then two other servers.

[RO] InnoDB I/O Info (? for help)                                Servers: 56
______________________________ I/O Threads _______________________________
CXN       Thread  Purpose               Thread Status
master56       0  insert buffer thread  waiting for completed aio requests
master56       1  log thread            waiting for completed aio requests
master56       2  read thread           waiting for completed aio requests
master56       3  read thread           waiting for completed aio requests
master56       4  read thread           waiting for completed aio requests
master56       5  read thread           waiting for completed aio requests
master56       6  write thread          waiting for completed aio requests
master56       7  write thread          waiting for completed aio requests
master56       8  write thread          waiting for completed aio requests
master56       9  write thread          waiting for completed aio requests
slave156       0  insert buffer thread  waiting for completed aio requests
slave156       1  log thread            waiting for completed aio requests
slave156       2  read thread           waiting for completed aio requests
slave156       3  read thread           waiting for completed aio requests
slave156       4  read thread           waiting for completed aio requests
slave156       5  read thread           waiting for completed aio requests
slave156       6  write thread          waiting for completed aio requests
slave156       7  write thread          waiting for completed aio requests
slave156       8  write thread          waiting for completed aio requests
slave156       9  write thread          waiting for completed aio requests
slave256       0  insert buffer thread  waiting for completed aio requests
slave256       1  log thread            waiting for completed aio requests
slave256       2  read thread           waiting for completed aio requests
slave256       3  read thread           waiting for completed aio requests
slave256       4  read thread           waiting for completed aio requests
slave256       5  read thread           waiting for completed aio requests
slave256       6  write thread          waiting for completed aio requests
slave256       7  write thread          waiting for completed aio requests
slave256       8  write thread          waiting for completed aio requests
slave256       9  write thread          waiting for completed aio requests
_________________________________ Pending I/O __________________________________
CXN       Async Rds  Async Wrt  IBuf Async Rds  Sync I/Os  Log Flushes  Log I/Os
master56                                     0          0            0         0
slave156                                     0          0            0         0
slave256                                     0          0            0         0
_____________________________ File I/O Misc ______________________________
CXN       OS Reads  OS Writes  OS fsyncs  Reads/Sec  Writes/Sec  Bytes/Sec
master56      1024     118562       4737       0.00        0.00          0
slave156       565     120586      20333       0.00      144.86          0
slave256       532     119388       4644       0.00        0.00          0
__________________________ Log Statistics __________________________
CXN       Sequence No.  Flushed To  Last Checkpoint  IO Done  IO/Sec
master56  929977732     929977732   929977732          22777    0.00
slave156  929560312     929560312   927035927          17145   95.00
slave256  935238232     935238232   935238232          22767    0.00

If you still did not guess – it’s the sync_relay_log=1 and sync_binlog=1 set on slave1 only resulting in many more write operations.

These kind of views for multiple servers on the same screen are really nice and allow to very fast compare many servers or find the ones having problems.

More Advanced Options

One of the modes, which many of you probably did not know Innotop has, is one where you can view any status variables how changing in real time.
Below is view of Variables & Status mode (S), which btw. you can view as standard (vmstat like) view, pivoted or graph (tload). The one below is “pivoted”.

[RO] Variables & Status (? for help)    master56, 2h33m, InnoDB 1s :-), 5.09k QPS, 5/4/0 con/run/cac thds, 5.6.14-log
name               set_0    set_1    set_2    set_3    set_4    set_5    set_6    set_7    set_8    set_9
QPS                5211.69  4478.65  5457.82  5057.26  5704.39  5522.03  5737.59  5800.76  5188.59  58.52
Commit_PS          260.91   223.28   272.94   252.86   285.74   275.35   286.97   290.46   258.44   2.92
Rollback_Commit    0        0        0        0        0        0        0        0        0        0
Write_Commit       17.97    18.05    17.99    17.99    17.96    18.05    17.98    17.96    18.07    18
R_W_Ratio          0.78     0.78     0.78     0.78     0.78     0.78     0.78     0.78     0.78     0.78
Opens_PS           0        0        0        0        0        0        0        0        0        0.01
Table_Cache_Used   0        0        0        0        0        0        0        0        0        0
Threads_PS         0        0        0        0        0        0        0        0        0        0
Thread_Cache_Used  0        0        0        0        0        0        0        0        0        0
CXN_Used_Ever      3.31     3.31     3.31     3.31     3.31     3.31     3.31     3.31     3.31     3.31
CXN_Used_Now       3.31     3.31     3.31     3.31     3.31     3.31     3.31     3.31     3.31     3.31

What is really cool about this view is that you can select many kinds of variables to display here. With pressing ‘c’ key in this mode you will see the list (truncated as it’s very wide):

________________________________________________________ Choose from _________________________________________________________
          commands   Uptime, Questions, Com_delete, Com_delete_multi, Com_insert, Com_insert_select, Com_replace, Com_replace_
cxns_files_threads   Uptime,Aborted_clients,Aborted_connects,Bytes_received,Bytes_sent,Compression,Connections,Created_tmp_dis
           general   set_precision(Questions/Uptime_hires) as QPS, set_precision(Com_commit/Uptime_hires) as Commit_PS, set_pr
           handler   Uptime,Handler_read_key,Handler_read_first,Handler_read_next,Handler_read_prev,Handler_read_rnd,Handler_r
            innodb   Uptime,Innodb_row_lock_current_waits,Innodb_row_lock_time,Innodb_row_lock_time_avg,Innodb_row_lock_time_m
     innodb_health   dulint_to_int(IB_tx_trx_id_counter) - dulint_to_int(IB_tx_purge_done_for) as OldVersions,IB_sm_mutex_spin
    innodb_health2   percent(1-((Innodb_buffer_pool_pages_free||0)/($cur->{Innodb_buffer_pool_pages_total}||1))) as BP_page_ca
         key_cache   Uptime,Key_blocks_not_flushed,Key_blocks_unused,Key_blocks_used,Key_read_requests,Key_reads,Key_write_req
         prep_stmt   Uptime,Com_dealloc_sql,Com_execute_sql,Com_prepare_sql,Com_reset,Com_stmt_close,Com_stmt_execute,Com_stmt
       query_cache   percent((Qcache_hits||0)/(((Com_select||0)+(Qcache_hits||0))||1)) as Hit_Pct,set_precision((Qcache_hits||
      query_status   Uptime,Select_full_join,Select_full_range_join,Select_range,Select_range_check,Select_scan,Slow_queries,S
      slow_queries   set_precision(Slow_queries/Uptime_hires) as Slow_PS, set_precision(Select_full_join/Uptime_hires) as Full
               txn   Uptime,Com_begin,Com_commit,Com_rollback,Com_savepoint,Com_xa_commit,Com_xa_end,Com_xa_prepare,Com_xa_rec
Choose a set of values to display, or enter the name of a new one:

What is really cool about this mode, is that you can define your own set of values or compose functions making some calculations from many of them!

Not only a passive watch-tool!

Another thing worth mentioning is that Innotop is not only meant for monitoring! You can manage your servers there too (in some limited way).
For example, let’s see the replication mode example:
press ‘O’ key:

[RO] Replication Status (? for help)                                                   Servers: 56
__________________________ Slave SQL Status ___________________________
CXN       Master     On?  TimeLag  Catchup  Temp  Relay Pos  Last Error
slave156  Yes    00:00     0.00     0   57271325
slave256  Yes    00:00     0.00     0   57271325
________________________________________ Slave I/O Status ________________________________________
CXN       Master     On?  File              Relay Size  Pos       State
slave156  Yes  mysql-bin.000005      54.62M  57271115  Waiting for master to send event
slave256  Yes  mysql-bin.000005      54.62M  57271115  Waiting for master to send event
_________________ Master Status __________________
CXN       File              Position  Binlog Cache
master56  mysql-bin.000005  57271115         0.00%
slave156  mysql-bin.000004  57271115         0.00%
slave256  mysql-bin.000005  57271155         0.00%
_________________________________ Choose from _________________________________
master56   DBI:mysql:;host=;port=19479;
slave156   DBI:mysql:;host=;port=19480;
slave256   DBI:mysql:;host=;port=19481;mysql_read_default_group=client
Which servers?: slave156
Command to send: stop slave SQL_thread;
Success on slave156
Press any key to continue

But… this option is actually a hidden gate to just run ANY command that is allowed to the user Innotop uses to connect:

Which servers?: slave156
Command to send: DROP TABLE test.t1;
Success on slave156
Press any key to continue

So beware of using too powerful users for your Innotop client. In the case above I was able to drop table on slave even though read_only=1 since I am connecting with super user.

Innotop and MySQL 5.6

Thanks to mysqlsandbox it is super fast and easy to test various MySQL versions and configurations. So, after we enable GTID replication in our replication sandbox (as easy as starting ‘enable_gtid’ script) – the replication view mode in Innotop looks unfortunately the same, no additional information. There is already a feature request on this, just needs some love:
Note that if you have Innotop version older then 1.9.1 – the InnoDB status related modes will crash with MySQL 5.6, so update to 1.9.1.

Innotop and Galera replication

This would be really cool to see some wsrep status variables in Innotop. Currently, monitoring Percona XtraDB Cluster (PXC) nodes works well, you can configure each cluster as a group, just the replication/node status in cluster part is not seen.
I’ve made a new feature request for this, with a hope that some Perl master takes care of:

However, that does not mean that you cannot monitor Galera status with Innotop at all! Remember the custom set of values for (S) mode?
Quick tutorial:
Press ‘S’ to enter the mode, choose connection to any PXC node you want, then press ‘c’ key:

Choose a set of values to display, or enter the name of a new one: wsrep1

Enter variables for wsrep1: wsrep_cluster_conf_id,wsrep_cluster_size,wsrep_local_recv_queue


[RO] Variables & Status (? for help)      pxc1, 3h, InnoDB 1s :-), 1.90 QPS, 202/1/0 con/run/cac thds, 5.5.29-log
name                    set_0  set_1  set_2  set_3  set_4  set_5  set_6  set_7  set_8  set_9
wsrep_cluster_conf_id   3      3      3      3      3      3      3      3      3      3
wsrep_cluster_size      3      3      3      3      3      3      3      3      3      3
wsrep_local_recv_queue  0      0      0      0      0      0      0      0      0      0

If we have readonly option disabled in innotop.conf, after we exit the tool, our new variable set is saved like this:


Unfortunately, only numeric values are allowed here, also the S mode does not work well with server groups yet :(

There are many more bit hidden features inside Innotop you are probably not aware of, and which can be quite handy, so I encourage you to take a deep look into internal help (press ?) in each of the mode views.

Bugs – help needed!

Unfortunately, my favorite feature – ability of monitoring server groups on single screen is seriously affected by those two bugs:

List of all active bugs and feature request:

Complete manual:

The post Innotop: A real-time, advanced investigation tool for MySQL appeared first on MySQL Performance Blog.


Is Synchronous Replication right for your app?

I talk with lot of people who are really interested in Percona XtraDB Cluster (PXC) and mostly they are interested in PXC as a high-availability solution.  But, what they tend not to think too much about is if moving from async to synchronous replication is right for their application or not.

Facts about Galera replication

There’s a lot of different facts about Galera that come into play here, and it isn’t always obvious how they will affect your database workload.  For example:

  • Transaction commit takes approximately the worst packet round trip time (RTT) between any two nodes in your cluster.
  • Transaction apply on slave nodes is still asynchronous from client commit (except on the original node where the transaction is committed)
  • Galera prevents writing conflicts to these pending transactions while they are inflight in the form of deadlock errors.  (This is actually a form of Eventual Consistency where the client is forced to correct the problem before it can commit.  It is NOT the typical form of Eventual Consistency, known as asynchronous repair, that most people think of).

Callaghan’s Law

But what does that all actually mean?  Well, at the Percona Live conference a few weeks ago I heard a great maxim that really helps encapsulate a lot of this information and puts it into context with your application workload:

[In a Galera cluster] a given row can’t be modified more than once per RTT

This was attributed to Mark Callaghan from Facebook by Alexey Yurchenko from Codership at his conference talk.  Henceforth this will be known as “Callaghan’s law” in Galera circles forever, though Mark didn’t immediately recall saying it.

Applied to a standalone Innodb instance

Let’s break it down a bit.  Our unit of locking in Innodb is a single row (well, the PRIMARY KEY index entry for that row).  This means typically on a single Innodb node we can have all sorts modifications floating around as long as they don’t touch the same row.  Row locks are held for modifications until the transaction commits and that takes an fsync to the redo log by default, so applying Callaghan’s law to single-server Innodb, we’d get:

[On a single node Innodb server] a given row can’t be modified more than the time to fsync

You can obviously relax that by simply not fsyncing every transaction (innodb_flush_log_at_trx_commit != 1), or work around it with by fsyncing to memory (Battery or capacitor-backed write cache), etc., but the principle is basically the same.  If we want this transaction to persist after a crash, it has to get to disk.

This has no effect on standard MySQL replication from this instance, since MySQL replication is asynchronous.

What about semi-sync MySQL replication?

It’s actually much worse than Galera.  As I illustrated in a blog post last year, semi-sync must serialize all transactions and wait for them one at a time.  So, Callaghan’s law applied to semi-sync is:

[On a semi-sync replication master] you can’t commit (at all) more than once per RTT. 

Applied to a Galera cluster

In the cluster we’re protecting the data as well, though not by ensuring it goes to disk (though you can do that).  We protect the data by ensuring it gets to every node in the cluster.

But why every node and not just a quorum?  Well, it turns out transaction ordering really, really matters (really!).  By enforcing replication to all nodes, we can (simultaneously) establish global ordering for the transaction, so by the time the original node gets acknowledgement of the transaction back from all the other nodes, a GTID will also (by design) be established.  We’ll never end up with non-deterministic ordering of transactions as a result.

So this brings us back to Callaghan’s law for Galera.  We must have group communication to replicate and establish global ordering for every transaction, and the expense of doing that for Galera is approximately one RTT between the two nodes in the cluster that are furthest apart (regardless of where the commit comes from!).  The least amount of data we can change in Innodb at a time is a single row, so the most any single row can be modified cluster-wide is once per RTT.

What about WAN clusters?

Callaghan’s law applies to WAN clusters as well.  LANs usually have sub-millisecond RTTs.  WANs usually have anywhere from a few ms up to several hundred.  This really will open a large window where rows won’t be able to be updated more than just a few times a second at best.

Some things the rule does not mean on Galera

  • It does NOT mean you can’t modify different rows simultaneously.  You can.
  • It does NOT mean you can’t modify data on multiple cluster nodes simultaneously.  You can.
  • It does NOT set an lower bound on performance, only a upper bound.  The best performance you can expect is modifying a given row once per RTT, it could get slower if apply times start to lag.

So what about my application?

Think about your workload.  How frequently do you update any given row?  We call rows that are updated heavily “hotspots“.

Examples of hotspots

Example 1: Your application is an online game and you keep track of global achievement statistics in a single table with a row for each stat; there are just a few hundred rows.  When a player makes an achievement, your application updates this table with a statement like this:

UPDATE achievements SET count = count + 1 where achievement = 'killed_troll';

How many players might accomplish this achievement at the same time?

Example 2: You have users and groups in your application.  These are maintained in separate tables and there also exists a users_groups table to define the relationship between them.  When someone joins a group, you run a transaction that adds the relationship row to users_groups, but also updates groups with some metadata:

INSERT INTO users_groups (user_id, group_id) VALUES (100, 1);
UPDATE groups SET last_joined=NOW(), last_user_id=100 WHERE id=1;

How often might multiple users join the same group?


In both of the above examples you can imagine plenty of concurrent clients attempting to modify the same record at once.  But what will actually happen to the clients who try to update the same row within the same RTT?  This depends on which node in the cluster the writes are coming from:

From the same node: This will behave just like standard Innodb.  The first transaction will acquire the necessary row locks while it commits (which will take the 1 RTT).  The other transactions will lock wait until the lock(s) they need are available.  The application just waits in those cases.

From other nodes: First to commit wins.  The others that try to commit AFTER the first and while the first is still in the local apply queue on their nodes will get a deadlock error.

So, the best case (which may not be best for your application database throughput) will be more write latency into the cluster.  The worst case is that your transactions won’t even commit and you have to take some action you normally wouldn’t have had to do.


If your hotspots were really bad in standalone Innodb, you might consider relaxing the fsync:  set innodb_flush_log_at_trx_commit to something besides 1 and suddenly you can update much faster.  I see this tuning very frequently for “performance” reasons when data durability isn’t as crucial.  This is fine as long as you weigh both options carefully.

But in Galera you cannot relax synchronous replication.  You can’t change the law, you can only adapt around it, but how might you do that ?

Write to one node

If your issue is really the deadlock errors and not so much the waiting, you could simply send all your writes to one node.  This should prevent the deadlock errors, but will not change the lock waiting that your application will need to do for hotspots.


If your hotspots are all updates with autocommits, you can rely on wsrep_retry_autocommit to auto-retry the transactions for you.  However, each autocommit is retried only the number of times specified by this variable (default is 1 retry).  This means more waiting, and after the limit is exceeded you will still get the deadlock error.

This is not implemented for full BEGIN … COMMIT multi-statement transactions since it cannot be assumed that those are not applying application logic in between the statements that is not safe to retry after the database state changes.

retry deadlocks

Now we start to get into (*gasp*) territory where your application needs to be modified.  Generally if you use Innodb, you should be able to handle deadlock errors in your application.  Raise your hands if your application has that logic (I usually get less than 5 people who do out of 100).

But, what to do?  Retrying automatically, or giving your end user a chance to retry manually are typical answers.  However, this means more latency waiting for a write to go through, and possibly some poor user experience.

batch writes

Instead of updating global counters one at a time (from Example 1, above), how about maintaining the counter in memcache or redis and only flushing to the database periodically?

if( $last_count % 100 == 0 ) {
  $db->do( "UPDATE achievements SET count = $last_count where achievement = 'killed_troll'";

change your schema

In Example 2, above, how above moving the ‘joined’ column to the users_groups table so we don’t need to update the parent group row so often?

INSERT INTO users_groups (user_id, group_id, joined) VALUES (100, 1, NOW());


Choosing a system to replicate your data to a distributed system requires tradeoffs.  Most of us are used to the tradeoffs we take when deploying conventional stand-alone MySQL Innodb with asynchronous slaves.  We may not think about the tradeoffs, but we’re making them (anyone obsessively testing slave position to ensure it’s caught up with the master?).

Synchronous replication with PXC and Galera is no different in that there are trade-offs, they just aren’t what we commonly expect.

If Callaghan’s law is going to cause you trouble and you are not prepared to adapt to work with it, PXC/Galera Synchronous replication is probably not right for you.

The post Is Synchronous Replication right for your app? appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by