High-load clusters and desynchronized nodes on Percona XtraDB Cluster

There can be a lot of confusion and lack of planning in Percona XtraDB Clusters in regards to nodes becoming desynchronized for various reasons.  This can happen a few ways:

When I say “desynchronized” I mean a node that is permitted to build up a potentially large wsrep_local_recv_queue while some operation is happening.  For example a node taking a backup would set wsrep_desync=ON during the backup and potentially fall behind replication some amount.

Some of these operations may completely block Galera from applying transactions, while others may simply increase load on the server enough that it falls behind and applies at a reduced rate.

In all the cases above, flow control is NOT used while the node cannot apply transactions, but it MAY be used while the node is recovering from the operation.  For an example of this, see my last blog about IST.

If a cluster is fairly busy, then the flow control that CAN happen when the above operations catch up MAY be detrimental to performance.

Example setup

Let us take my typical 3 node cluster with workload on node1.  We are taking a blocking backup of some kind on node3 so we are executing the following steps:

  1. node3> set global wsrep_desync=ON;
  2. Node3’s “backup” starts, this starts with FLUSH TABLES WITH READ LOCK;
  3. Galera is paused on node3 and the wsrep_local_recv_queue grows some amount
  4. Node3’s “backup” finishes, finishing with UNLOCK TABLES;
  5. node3> set global wsrep_desync=OFF;

During the backup

This includes up through step 3 above.  My node1 is unaffected by the backup on node3, I can see it averaging 5-6k writesets(transactions) per second which it did before we began:

Screen Shot 2015-08-19 at 2.38.34 PM


node2 is also unaffected:

Screen Shot 2015-08-19 at 2.38.50 PM

but node3 is not applying and its queue is building up:

Screen Shot 2015-08-19 at 2.39.04 PM

Unlock tables, still wsrep_desync=ON

Let’s examine briefly what happens when node3 is permitted to start applying, but wsrep_desync stays enabled:

Screen Shot 2015-08-19 at 2.42.16 PM

node1’s performance is pretty much the same, node3 is not using flow control yet. However, there is a problem:

Screen Shot 2015-08-19 at 2.43.13 PM

It’s hard to notice, but node3 is NOT catching up, instead it is falling further behind!  We have potentially created a situation where node3 may never catch up.

The PXC nodes were close enough to the red-line of performance that node3 can only apply just about as fast (and somewhat slower until it heats up a bit) as new transactions are coming into node1.

This represents a serious concern in PXC capacity planning:

Nodes do not only need to be fast enough to handle normal workload, but also to catch up after maintenance operations or failures cause them to fall behind.

Experienced MySQL DBA’s will realize this isn’t all that different than Master/Slave replication.

Flow Control as a way to recovery

So here’s the trick:  if we turn off wsrep_desync on node3 now, node3 will use flow control if and only if the incoming replication exceeds node3’s apply rate.  This gives node3 a good chance of catching up, but the tradeoff is reducing write throughput of the cluster.  Let’s see what this looks like in context with all of our steps.  wsrep_desync is turned off at the peak of the replication queue size on node3, around 12:20PM:

Screen Shot 2015-08-19 at 2.47.12 PM

Screen Shot 2015-08-19 at 2.48.07 PM

So at the moment node3 starts utilizing flow control to prevent falling further behind, our write throughput (in this specific environment and workload) is reduced by approximately 1/3rd (YMMV).   The cluster will remain in this state until node3 catches up and returns to the ‘Synced’ state.  This catchup is still happening as I write this post, almost 4 hours after it started and will likely take another hour or two to complete.

I can see a more realtime representation of this by using myq_status on node1, summarizing every minute:

[root@node1 ~]# myq_status -i 1m wsrep
mycluster / node1 (idx: 1) / Galera 3.11(ra0189ab)
         Cluster  Node       Outbound      Inbound       FlowC     Conflct Gcache     Appl
    time P cnf  # stat laten msgs data que msgs data que pause snt lcf bfa   ist  idx  %ef
19:58:47 P   5  3 Sync 0.9ms 3128 2.0M   0   27 213b   0 25.4s   0   0   0 3003k  16k  62%
19:59:47 P   5  3 Sync 1.1ms 3200 2.1M   0   31 248b   0 18.8s   0   0   0 3003k  16k  62%
20:00:47 P   5  3 Sync 0.9ms 3378 2.2M  32   27 217b   0 26.0s   0   0   0 3003k  16k  62%
20:01:47 P   5  3 Sync 0.9ms 3662 2.4M  32   33 266b   0 18.9s   0   0   0 3003k  16k  62%
20:02:47 P   5  3 Sync 0.9ms 3340 2.2M  32   27 215b   0 27.2s   0   0   0 3003k  16k  62%
20:03:47 P   5  3 Sync 0.9ms 3193 2.1M   0   27 215b   0 25.6s   0   0   0 3003k  16k  62%
20:04:47 P   5  3 Sync 0.9ms 3009 1.9M  12   28 224b   0 22.8s   0   0   0 3003k  16k  62%
20:05:47 P   5  3 Sync 0.9ms 3437 2.2M   0   27 218b   0 23.9s   0   0   0 3003k  16k  62%
20:06:47 P   5  3 Sync 0.9ms 3319 2.1M   7   28 220b   0 24.2s   0   0   0 3003k  16k  62%
20:07:47 P   5  3 Sync 1.0ms 3388 2.2M  16   31 251b   0 22.6s   0   0   0 3003k  16k  62%
20:08:47 P   5  3 Sync 1.1ms 3695 2.4M  19   39 312b   0 13.9s   0   0   0 3003k  16k  62%
20:09:47 P   5  3 Sync 0.9ms 3293 2.1M   0   26 211b   0 26.2s   0   0   0 3003k  16k  62%

This reports around 20-25 seconds of flow control every minute, which is consistent with that ~1/3rd of performance reduction we see in the graphs above.

Watching node3 the same way proves it is sending the flow control (FlowC snt):

mycluster / node3 (idx: 2) / Galera 3.11(ra0189ab)
         Cluster  Node       Outbound      Inbound       FlowC     Conflct Gcache     Appl
    time P cnf  # stat laten msgs data que msgs data que pause snt lcf bfa   ist  idx  %ef
17:38:09 P   5  3 Dono 0.8ms    0   0b   0 4434 2.8M 16m 25.2s  31   0   0 18634  16k  80%
17:39:09 P   5  3 Dono 1.3ms    0   0b   1 5040 3.2M 16m 22.1s  29   0   0 37497  16k  80%
17:40:09 P   5  3 Dono 1.4ms    0   0b   0 4506 2.9M 16m 21.0s  31   0   0 16674  16k  80%
17:41:09 P   5  3 Dono 0.9ms    0   0b   0 5274 3.4M 16m 16.4s  27   0   0 22134  16k  80%
17:42:09 P   5  3 Dono 0.9ms    0   0b   0 4826 3.1M 16m 19.8s  26   0   0 16386  16k  80%
17:43:09 P   5  3 Jned 0.9ms    0   0b   0 4957 3.2M 16m 18.7s  28   0   0 83677  16k  80%
17:44:09 P   5  3 Jned 0.9ms    0   0b   0 3693 2.4M 16m 27.2s  30   0   0  131k  16k  80%
17:45:09 P   5  3 Jned 0.9ms    0   0b   0 4151 2.7M 16m 26.3s  34   0   0  185k  16k  80%
17:46:09 P   5  3 Jned 1.5ms    0   0b   0 4420 2.8M 16m 25.0s  30   0   0  245k  16k  80%
17:47:09 P   5  3 Jned 1.3ms    0   0b   1 4806 3.1M 16m 21.0s  27   0   0  310k  16k  80%

There are a lot of flow control messages (around 30) per minute.  This is a lot of ON/OFF toggles of flow control where writes are briefly delayed rather than a steady “you can’t write” for 20 seconds straight.

It also interestingly spends a long time in the Donor/Desynced state (even though wsrep_desync was turned OFF hours before) and then moves to the Joined state (this has the same meaning as during an IST).

Does it matter?

As always, it depends.

If these are web requests and suddenly the database can only handle ~66% of the traffic, that’s likely a problem, but maybe it just slows down the website somewhat.  I want to emphasize that WRITES are what is affected here.  Reads on any and all nodes should be normal (though you probably don’t want to read from node3 since it is so far behind).

If this were some queue processing that had reduced throughput, I’d expect it to possibly catch up later

This can only be answered for your application, but the takeaways for me are:

  • Don’t underestimate your capacity requirements
  • Being at the redline normally means you are well past the redline for abnormal events.
  • Plan for maintenance and failure recoveries
  • Where possible, build queuing into your workflows so diminished throughput in your architecture doesn’t generate failures.

Happy clustering!

Graphs in this post courtesy of VividCortex.

The post High-load clusters and desynchronized nodes on Percona XtraDB Cluster appeared first on Percona Data Performance Blog.


Optimizing PXC Xtrabackup State Snapshot Transfer

State Snapshot Transfer (SST) at a glance

PXC uses a protocol called State Snapshot Transfer to provision a node joining an existing cluster with all the data it needs to synchronize.  This is analogous to cloning a slave in asynchronous replication:  you take a full backup of one node and copy it to the new one, while tracking the replication position of the backup.

PXC automates this process using scriptable SST methods.  The most common of these methods is the xtrabackup-v2 method which is the default in PXC 5.6.  Xtrabackup generally is more favored over other SST methods because it is non-blocking on the Donor node (the node contributing the backup).

The basic flow of this method is:

  • The Joiner:
    • joins the cluster
    • Learns it needs a full SST and clobbers its local datadir (the SST will replace it)
    • prepares for a state transfer by opening a socat on port 4444 (by default)
    • The socat pipes the incoming files into the datadir/.sst directory
  • The Donor:
    • is picked by the cluster (could be configured or be based on WAN segments)
    • starts a streaming Xtrabackup and pipes the output of that via socat to the Joiner on port 4444.
    • Upon finishing its backup, sends an indication of this and the final Galera GTID of the backup is sent to the Joiner
  • The Joiner:
    • Records all changes from the Donor’s backup’s GTID forward in its gcache (and overflow pages, this is limited by available disk space)
    • runs the –apply-log phase of Xtrabackup on the donor
    • Moves the datadir/.sst directory contents into the datadir
    • Starts mysqld
    • Applies all the transactions it needs (Joining and Joined states just like IST does it)
    • Moves to the ‘Synced’ state and is done.

There are a lot of moving pieces here, and nothing is really tuned by default.  On larger clusters, SST can be quite scary because it may take hours or even days.  Any failure can mean starting over again from the start.

This blog will concentrate on some ways to make a good dent in the time SST can take.  Many of these methods are trade-offs and may not apply to your situations.  Further, there may be other ways I haven’t thought of to speed things up, please share what you’ve found that works!

The Environment

I am testing SST on a PXC 5.6.24 cluster in AWS.  The nodes are c3.4xlarge and the datadirs are RAID-0 over the two ephemeral SSD drives in that instance type.  These instances are all in the same region.

My simulated application is using only node1 in the cluster and is sysbench OLTP with 200 tables with 1M rows each.  This comes out to just under 50G of data.  The test application runs on a separate server with 32 threads.

The PXC cluster itself is tuned to best practices for Innodb and Galera performance


In my first test the cluster is a single member (receiving workload) and I am  joining node2.  This configuration is untuned for SST.  I measured the time from when mysqld started on node2 until it entered the Synced state (i.e., fully caught up).  In the log, it looked like this:

150724 15:59:24 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
... lots of other output ...
2015-07-24 16:48:39 31084 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 4647341)

Doing some math on the above, we find that the SST took 51 minutes to complete.


One of the first things I noticed was that the –apply-log step on the Joiner was very slow.  Anyone who uses Xtrabackup a lot will know that –apply-log will be a lot faster if you give it some extra RAM to use while making the backup consistent via the –use-memory option.  We can set this in our my.cnf like this:


The [sst] section is a special one understood only by the xtrabackup-v2 script.  inno-apply-opts allows me to specify arguments to innobackupex when it runs.

Note that this change only applies to the Joiner (i.e., you don’t have to put it on all your nodes and restart them to take advantage of it).

This change immediately makes a huge improvement to our above scenario (node2 joining node1 under load) and the SST now takes just over 30 minutes.


Another slow part of getting to Synced is how long it takes to apply transactions up to realtime after the backup is restored and in place on the Joiner.  We can improve this throughput by increasing the number of apply threads on the Joiner to make better use of the CPU.  Prior to this wsrep_slave_threads was set to 1, but if I increase this to 32  (there are 16 cores on this instance type) my SST now takes 25m 32s


xtrabackup-v2 supports adding a compression process into the datastream.  On the Donor it compresses and on the Joiner it decompresses.  This allows you to trade CPU for transfer speed.  If your bottleneck turns out to be network transport and you have spare CPU, this can help a lot.

Further, I can use pigz instead of gzip to get parallel compression, but theoretically any compression utilization can work as long as it can compress and decompress standard input to standard output.  I install the ‘pigz’ package on all my nodes and change my my.cnf like this:

decompressor="pigz -d"

Both the Joiner and the Donor must have the respective decompressor and compressor settings or the SST will fail with a vague error message (not actually having pigz installed will do the same thing).

By adding compression, my SST is down to 21 minutes, but there’s a catch.  My application performance starts to take a serious nose-dive during this test.  Pigz is consuming most of the CPU on my Donor, which is also my primary application node.  This may or may not hurt your application workload in the same way, but this emphasizes the importance of understanding (and measuring) the performance impact of SST has on your Donor nodes.

Dedicated donor

To alleviate the problem with the application, I now leave node2 up and spin up node3.  Since I’m expecting node2 to normally not be receiving application traffic directly, I can configure node3 to prefer node2 as its donor like this:

wsrep_sst_donor = node2,

When node3 starts, this setting instructs the cluster that node3 is the preferred donor, but if that’s not available, pick something else (that’s what the trailing comma means).

Donor nodes are permitted to fall behind in replication apply as needed without sending flow control.  Sending application traffic to such a node may see an increase in the amount of stale data as well as certification failures for writes (not to mention the performance issues we saw above with node1).  Since node2 is not getting application traffic, moving into the Donor state and doing an expensive SST with pigz compression should be relatively safe for the rest of the cluster (in this case, node1).

Even if you don’t have a dedicated donor, if you use a load balancer of some kind in front of your cluster, you may elect to consider Donor nodes as failing their health checks so application traffic is diverted during any state transfer.

When I brought up node3, with node2 as the donor, the SST time dropped to 18m 33s


Each of these tunings helped the SST speed, though the later adjustments maybe had less of a direct impact.  Depending on your workload, database size, network and CPU available, your mileage may of course vary.  Your tunings should vary accordingly, but also realize you may actually want to limit (and not increase) the speed of state transfers in some cases to avoid other problems. For example, I’ve seen several clusters get unstable during SST and the only explanation for this is the amount of network bandwidth consumed by the state transfer preventing the actual Galera communication between the nodes. Be sure to consider the overall state of production when tuning your SSTs.

The post Optimizing PXC Xtrabackup State Snapshot Transfer appeared first on MySQL Performance Blog.


PXC – Incremental State transfers in detail

IST Basics

State transfers in Galera remain a mystery to most people.  Incremental State transfers (as opposed to full State Snapshot transfers) are used under the following conditions:

  • The Joiner node reports Galera a valid Galera GTID to the cluster
  • The Donor node selected contains all the transactions the Joiner needs to catch up to the rest of the cluster in its Gcache
  • The Donor node can establish a TCP connection to the Joiner on port 4568 (by default)

IST states

Galera has many internal node states related to Joiner nodes.  They currently are:

  1. Joining
  2. Joining: preparing for State Transfer
  3. Joining: requested State Transfer
  4. Joining: receiving State Transfer
  5. Joining: State Transfer request failed
  6. Joining: State Transfer failed
  7. Joined

I don’t claim any special knowledge of most of these states apart from what their titles indicate.  Many of these states are occur very briefly and it is unlikely you’ll ever actually see them on a node’s wsrep_local_state_comment.

During IST, however, I have observed the following states have the potential to take a long while:

Joining: receiving State Transfer

During this state transactions are being streamed to the Joiner’s wsrep_local_recv_queue.  You can connect to the node at this time and poll state.  If you do, you’ll easily see the inbound queue increasing (usually quickly) but no writesets being ‘received’ (read: applied).  It’s not clear to me if there is a reason why transction apply couldn’t be started during this steam, but it does not do so currently.

The further behind the Joiner is, the longer this can take.  Here’s some output from the latest release of myq-tools showing wsrep stats:

[root@node2 ~]# myq_status wsrep
mycluster / node2 (idx: 1) / Galera 3.11(ra0189ab)
         Cluster  Node       Outbound      Inbound       FlowC     Conflct Gcache     Appl
    time P cnf  # stat laten msgs data que msgs data que pause snt lcf bfa   ist  idx  %ef
14:04:40 P   4  2 J:Rc 0.4ms    0   0b   0    1 197b  4k   0ns   0   0   0  367k    0  94%
14:04:41 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b  5k   0ns   0   0   0  368k    0  93%
14:04:42 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b  6k   0ns   0   0   0  371k    0  92%
14:04:43 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b  7k   0ns   0   0   0  373k    0  92%
14:04:44 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b  8k   0ns   0   0   0  376k    0  92%
14:04:45 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 10k   0ns   0   0   0  379k    0  92%
14:04:46 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 11k   0ns   0   0   0  382k    0  92%
14:04:47 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 12k   0ns   0   0   0  386k    0  91%
14:04:48 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 13k   0ns   0   0   0  390k    0  91%
14:04:49 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 14k   0ns   0   0   0  394k    0  91%
14:04:50 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 15k   0ns   0   0   0  397k    0  91%
14:04:51 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 16k   0ns   0   0   0  401k    0  91%
14:04:52 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 18k   0ns   0   0   0  404k    0  91%
14:04:53 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 19k   0ns   0   0   0  407k    0  91%
14:04:54 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 20k   0ns   0   0   0  411k    0  91%

The node is in ‘J:Rc’ (Joining: Receiving) state and we can see the Inbound queue growing (wsrep_local_recv_queue). Otherwise this node is not sending or receiving transactions.


Once all the requested transactions are copied over, the Joiner flips to the ‘Joining’ state, during which it starts applying the transactions as quickly as the wsrep_slave_threads can go.  For example:

Cluster  Node       Outbound      Inbound       FlowC     Conflct Gcache     Appl
    time P cnf  # stat laten msgs data que msgs data que pause snt lcf bfa   ist  idx  %ef
14:04:55 P   4  2 Jing 0.6ms    0   0b   0 2243 3.7M 19k   0ns   0   0   0  2236  288  91%
14:04:56 P   4  2 Jing 0.5ms    0   0b   0 4317 7.0M 16k   0ns   0   0   0  6520  199  92%
14:04:57 P   4  2 Jing 0.5ms    0   0b   0 4641 7.5M 12k   0ns   0   0   0 11126  393  92%
14:04:58 P   4  2 Jing 0.4ms    0   0b   0 4485 7.2M  9k   0ns   0   0   0 15575  200  93%
14:04:59 P   4  2 Jing 0.5ms    0   0b   0 4564 7.4M  5k   0ns   0   0   0 20102  112  93%

Notice the Inbound msgs (wsrep_received) starts increasing rapidly and the queue decreases accordingly.


14:05:00 P   4  2 Jned 0.5ms    0   0b   0 4631 7.5M  2k   0ns   0   0   0 24692   96  94%

Towards the end the node briefly switches to the ‘Joined’ state, though that is a fast state in this case. ‘Joining’ and ‘Joined’ are similar states, the difference (I believe) is that:

  • ‘Joining’ is applying transactions acquired via the IST
  • ‘Joined’ is applying transactions that have queued up via standard Galera replication since the IST (i.e., everything has been happening on the cluster since the IST)

Flow control during Joining/Joined states

The Codership documentation says something interesting about ‘Joined’ (from experimentation, I believe the ‘Joining’ state behaves the same here.):

Nodes in this state can apply write-sets. Flow Control here ensures that the node can eventually catch up with the cluster. It specifically ensures that its write-set cache never grows. Because of this, the cluster wide replication rate remains limited by the rate at which a node in this state can apply write-sets. Since applying write-sets is usually several times faster than processing a transaction, nodes in this state hardly ever effect cluster performance.

What this essentially means is that a Joiner’s wsrep_local_recv_queue is allowed to shrink but NEVER GROW during an IST catchup.  Growth will trigger flow control, but why would it grow?  Writes on other cluster nodes must still be replicated to our Joiner and added to the queue.

If the Joiner’s apply rate is less than the rate of writes coming from Cluster replication, flow control will be applied to slow down Cluster replication (read: your application writes).  As far as I can tell, there is no way to tune this or turn it off.  The Codership manual continues here:

The one occasion when nodes in the JOINED state do effect cluster performance is at the very beginning, when the buffer pool on the node in question is empty.

Essentially a Joiner node with a cold cache can really hurt performance on your cluster.  This can possibly be improved by:

  • Better IO and other resources available to the Joiner for a quicker cache warmup.  A huge example of this would be flash over convention storage.
  • Buffer pool preloading
  • More Galera apply threads
  • etc.


From what I can tell, the ‘Joined’ state ends when the wsrep_local_recv_queue drops lower than the node’s configured flow control limit.  At that point it changes to ‘Synced’ and the node behaves more normally (WRT to flow control).

Cluster  Node       Outbound      Inbound       FlowC     Conflct Gcache     Appl
    time P cnf  # stat laten msgs data que msgs data que pause snt lcf bfa   ist  idx  %ef
14:05:01 P   4  2 Sync 0.5ms    0   0b   0 3092 5.0M   0   0ns   0   0   0 27748  150  94%
14:05:02 P   4  2 Sync 0.5ms    0   0b   0 1067 1.7M   0   0ns   0   0   0 28804  450  93%
14:05:03 P   4  2 Sync 0.5ms    0   0b   0 1164 1.9M   0   0ns   0   0   0 29954   67  92%
14:05:04 P   4  2 Sync 0.5ms    0   0b   0 1166 1.9M   0   0ns   0   0   0 31107  280  92%
14:05:05 P   4  2 Sync 0.5ms    0   0b   0 1160 1.9M   0   0ns   0   0   0 32258  606  91%
14:05:06 P   4  2 Sync 0.5ms    0   0b   0 1154 1.9M   0   0ns   0   0   0 33401  389  90%
14:05:07 P   4  2 Sync 0.5ms    0   0b   0 1147 1.8M   1   0ns   0   0   0 34534  297  90%
14:05:08 P   4  2 Sync 0.5ms    0   0b   0 1147 1.8M   0   0ns   0   0   0 35667  122  89%
14:05:09 P   4  2 Sync 0.5ms    0   0b   0 1121 1.8M   0   0ns   0   0   0 36778  617  88%


You may notice these states during IST if you aren’t watching the Joiner closely, but if your IST is talking a long while, it should be easy using the above situation to understand what is happening.

The post PXC – Incremental State transfers in detail appeared first on MySQL Performance Blog.


Bypassing SST in Percona XtraDB Cluster with binary logs

In my previous post, I used incremental backups in Percona XtraBackup as a method for rebuilding a Percona XtraDB Cluster (PXC) node without triggering an actual SST. Practically this reproduces the SST steps, but it can be handy if you already had backups available to use.

In this post, I want to present another methodology for this that also uses a full backup, but instead of incrementals uses any binary logs that the cluster may be producing.

Binary logs on PXC

Binary logs are not strictly needed in PXC for replication, but you may be using them for backups or for asynchronous slaves of the cluster.  To set them up properly, we need the following settings added to our config:


As I stated, none of these are strictly needed for PXC.

  • server-id=1 — We recommend PXC nodes share the same server-id.
  • log-bin — actually enable the binary log
  • log-slave-updates — log ALL updates to the cluster to this server’s binary log

This doesn’t need to be set on every node, but likely you would set these on at least two nodes in the cluster for redundancy.

Note that this strategy should work with or without 5.6 asynchronous GTIDs.

Recovering data with backups and binary logs

This methodology is conventional point-in-time backup recovery for MySQL.  We have a full backup that was taken at a specific binary log position:

... backup created in the past...
# innobackupex --no-timestamp /backups/full
# cat /backups/full/xtrabackup_binlog_info
node3-bin.000002	735622700

We have this binary log and all binary logs since:

-rw-r-----. 1 root root 1.1G Jul 14 18:53 node3-bin.000002
-rw-r-----. 1 root root 1.1G Jul 14 18:53 node3-bin.000003
-rw-r-----. 1 root root 321M Jul 14 18:53 node3-bin.000004

Recover the full backup

We start by preparing the backup with –apply-log:

# innobackupex --apply-log --use-memory=1G /backups/full
xtrabackup: Recovered WSREP position: 1663c027-2a29-11e5-85da-aa5ca45f600f:60072936
InnoDB: Last MySQL binlog file position 0 735622700, file name node3-bin.000002
# innobackupex --copy-back /backups/full
# chown -R mysql.mysql /var/lib/mysql

The output confirms the same binary log file and position that we knew from before.

Start MySQL without Galera

We need to start mysql, but without Galera so we can apply the binary log changes before trying to join the cluster. We can do this simply by commenting out all the wsrep settings in the MySQL config.

# grep wsrep /etc/my.cnf
#wsrep_cluster_address           = gcomm://pxc.service.consul
#wsrep_cluster_name              = mycluster
#wsrep_node_name                 = node3
#wsrep_node_address              =
#wsrep_provider                  = /usr/lib64/
#wsrep_provider_options          = "gcache.size=8G; gcs.fc_limit=1024"
#wsrep_slave_threads             = 4
#wsrep_sst_method                = xtrabackup-v2
#wsrep_sst_auth                  = sst:secret
# systemctl start mysql

Apply the binary logs

We now check our binary log starting position:

# mysqlbinlog -j 735622700 node3-bin.000002 | grep Xid | head -n 1
#150714 18:38:36 server id 1  end_log_pos 735623273 CRC32 0x8426c6bc 	Xid = 60072937

We can compare the Xid on this binary log position to that of the backup. The Xid in a binary log produced by PXC will be the seqno of the GTID of that transaction. The starting position in the binary log shows us the next Xid is one increment higher, so this makes sense: we can start at this position in the binary log and apply all changes as high as we can go to get the datadir up to a more current position.

# mysqlbinlog -j 735622700 node3-bin.000002 | mysql
# mysqlbinlog node3-bin.000003 | mysql
# mysqlbinlog node3-bin.000004 | mysql

This action isn’t particularly fast as binlog events are being applied by a single connection thread. Remember that if the cluster is taking writes while this is happening, the amount of time you have is limited by the size of gcache and the rate at which it is being filled up.

Prime the grastate

Once the binary logs are applied, we can check the final log’s last position to get the seqno we need:

[root@node3 backups]# mysqlbinlog node3-bin.000004 | tail -n 500
#150714 18:52:52 server id 1  end_log_pos 335782932 CRC32 0xb983e3b3 	Xid = 63105191

This is indeed the seqno we put in our grastate.dat. Like in the last post, we can copy a grastate.dat from another node to get the proper format. However, this time we must put the proper seqno into place:

# cat grastate.dat
# GALERA saved state
version: 2.1
uuid:    1663c027-2a29-11e5-85da-aa5ca45f600f
seqno:   63105191

Be sure the grastate.dat has the proper permissions, uncomment the wsrep settings and restart mysql on the node:

# chown mysql.mysql /var/lib/mysql/grastate.dat
# grep wsrep /etc/my.cnf
wsrep_cluster_address           = gcomm://pxc.service.consul
wsrep_cluster_name              = mycluster
wsrep_node_name                 = node3
wsrep_node_address              =
wsrep_provider                  = /usr/lib64/
wsrep_provider_options          = "gcache.size=8G; gcs.fc_limit=1024"
wsrep_slave_threads             = 4
wsrep_sst_method                = xtrabackup-v2
wsrep_sst_auth                  = sst:secret
# systemctl restart mysql

The node should now attempt to join the cluster with the proper GTID:

2015-07-14 19:28:50 4234 [Note] WSREP: Found saved state: 1663c027-2a29-11e5-85da-aa5ca45f600f:63105191

This, of course, still does not guarantee an IST. See my previous post for more details on the conditions needed for that to happen.

The post Bypassing SST in Percona XtraDB Cluster with binary logs appeared first on MySQL Performance Blog.


Bypassing SST in Percona XtraDB Cluster with incremental backups

Beware the SST

In Percona XtraDB Cluster (PXC) I often run across users who are fearful of SSTs on their clusters. I’ve always maintained that if you can’t cope with a SST, PXC may not be right for you, but that doesn’t change the fact that SSTs with multiple Terabytes of data can be quite costly.

SST, by current definition, is a full backup of a Donor to Joiner.  The most popular method is Percona XtraBackup, so we’re talking about a donor node that must:

  1. Run a full XtraBackup that reads its entire datadir
  2. Keep up with Galera replication to it as much as possible (though laggy donors don’t send flow control)
  3. Possibly still be serving application traffic if you don’t remove Donors from rotation.

So, I’ve been interested in alternative ways to work around state transfers and I want to present one way I’ve found that may be useful to someone out there.

Percona XtraBackup and Incrementals

It is possible to use Percona XtraBackup Full and Incremental backups to build a datadir that might possibly SST.  First we’ll focus on the mechanics of the backups, preparing them and getting the Galera GTID and then later discuss when it may be viable for IST.

Suppose I have fairly recent full Xtrabackup and and one or more incremental backups that I can apply on top of that to get VERY close to realtime on my cluster (more on that ‘VERY’ later).

# innobackupex --no-timestamp /backups/full
... sometime later ...
# innobackupex --incremental /backups/inc1 --no-timestamp --incremental-basedir /backups/full
... sometime later ...
# innobackupex --incremental /backups/inc2 --no-timestamp --incremental-basedir /backups/inc1

In my proof of concept test, I now have a full and two incrementals:

# du -shc /backups/*
909M	full
665M	inc1
812M	inc2
2.4G	total

To recover this data, I follow the normal Xtrabackup incremental apply process:

# cp -av /backups/full /backups/restore
# innobackupex --apply-log --redo-only --use-memory=1G /backups/restore
xtrabackup: Recovered WSREP position: 1663c027-2a29-11e5-85da-aa5ca45f600f:35694784
# innobackupex --apply-log --redo-only /backups/restore --incremental-dir /backups/inc1 --use-memory=1G
# innobackupex --apply-log --redo-only /backups/restore --incremental-dir /backups/inc2 --use-memory=1G
xtrabackup: Recovered WSREP position: 1663c027-2a29-11e5-85da-aa5ca45f600f:46469942
# innobackupex --apply-log /backups/restore --use-memory=1G

I can see that as I roll forward on my incrementals, I get a higher and higher GTID. Galera’s GTID is stored in the Innodb recovery information, so Xtrabackup extracts it after every batch it applies to the datadir we’re restoring.

We now have a datadir that is ready to go, we need to copy it into the datadir of our joiner node and setup a grastate.dat. Without a grastate, starting the node would force an SST no matter what.

# innobackupex --copy-back /backups/restore
# ... copy a grastate.dat from another running node ...
# cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid:    1663c027-2a29-11e5-85da-aa5ca45f600f
seqno:   -1
# chown -R mysql.mysql /var/lib/mysql/

If I start the node now, it should see the grastate.dat with the -1 seqo and run –wsrep_recover to extract the GTID from Innodb (I could have also just put that directly into my grastate.dat).

This will allow the node to startup from merged Xtrabackup incrementals with a known Galera GTID.

But will it IST?

That’s the question.  IST happens when the selected donor has all the transactions the joiner needs to get it fully caught up inside of the donor’s gcache.  There are several implications of this:

  • A gcache is mmap allocated and does not persist across restarts on the donor.  A restart essentially purges the mmap.
  • You can query the oldest GTID seqno on a donor by checking the status variable ‘wsrep_local_cached_downto’.  This variable is not available on 5.5, so you are forced to guess if you can IST or not.
  • most PXC 5.6 will auto-select a donor based on IST.  Prior to that (i.e., 5.5) donor selection was not based on IST candidacy at all, meaning you had to be much more careful and do donor selection manually.
  • There’s no direct mapping from the earliest GTID in a gcache to a specific time, so knowing at a glance if a given incremental will be enough to IST is difficult.
  • It’s also difficult to know how big to make your gcache (set in MB/GB/etc.)  with respect to your backups (which are scheduled by the day/hour/etc.)

All that being said, we’re still talking about backups here.  The above method will only work if and only if:

  • You do frequent incremental backups
  • You have a large gcache (hopefully more on this in a future blog post)
  • You can restore a backup faster than it takes for your gcache to overflow

The post Bypassing SST in Percona XtraDB Cluster with incremental backups appeared first on MySQL Performance Blog.


TOI wsrep_RSU_method in PXC 5.6.24 and up

I noticed that in the latest release of Percona XtraDB Cluster (PXC), the behavior of wsrep_RSU_method changed somewhat.  Prior to this release, the variable was GLOBAL only, meaning to use it you would:

mysql> set GLOBAL wsrep_RSU_method='RSU';
mysql> ALTER TABLE ...
mysql> set GLOBAL wsrep_RSU_method='TOI';

This had the (possibly negative) side-effect that ALL DDL’s issued on this node would be affected by the setting while in RSU mode.

So, in this latest release, this variable was made to also have a SESSION value, while retaining GLOBAL as well. This has a couple of side-effects that are common to MySQL variables that are both GLOBAL and SESSION:

  • The SESSION copy is made from whatever the GLOBAL’s value is when a new connection (session) is established.
  • SET GLOBAL does not affect existing connection’s SESSION values.

Therefore, our above workflow would only set the GLOBAL value to RSU and not the SESSION value for the local connection.  Therefore, our ALTER TABLE will be TOI and NOT RSU!

So, for those using RSU, the proper workflow would be to make your connection, set the SESSION copy of the variable and then issue your DDL:

mysql> set SESSION wsrep_RSU_method='RSU';
mysql> ALTER TABLE ...
... disconnect ...

The advantage here is ONLY your session’s DDLs will be affected by RSU (handy if you possibly do DDLs automatically from your application).

The post TOI wsrep_RSU_method in PXC 5.6.24 and up appeared first on MySQL Performance Blog.


Is 80% of RAM how you should tune your innodb_buffer_pool_size?

It seems these days if anyone knows anything about tuning InnoDB, it’s that you MUST tune your innodb_buffer_pool_size to 80% of your physical memory. This is such prolific tuning advice, it seems engrained in many a DBA’s mind.  The MySQL manual to this day refers to this rule, so who can blame the DBA?  The question is: does it makes sense?

What uses the memory on your server?

Before we question such advice, let’s consider what can take up RAM in a typical MySQL server in their broad categories.  This list isn’t necessarily complete, but I think it outlines the large areas a MySQL server could consume memory.

  • OS Usage: Kernel, running processes, filesystem cache, etc.
  • MySQL fixed usage: query cache, InnoDB buffer pool size, mysqld rss, etc.
  • MySQL workload based usage: connections, per-query buffers (join buffer, sort buffer, etc.)
  • MySQL replication usage:  binary log cache, replication connections, Galera gcache and cert index, etc.
  • Any other services on the same server: Web server, caching server, cronjobs, etc.

There’s no question that for tuning InnoDB, the innodb_buffer_pool_size is the most important variable.  It’s expected to occupy most of the RAM on a dedicated MySQL/Innodb server, but of course other local services may affect how it is tuned.  If it (and other memory consumption on the server) is too large, swapping can kick in and degrade your performance rapidly.

Further, the workload of the MySQL server itself may cause a lot of variation.  Does the server have a lot of open connections and active query workload consuming memory?  The memory consumption caused by this can be dramatically different server to server.

Finally, replication mechanisms like Galera have their own memory usage pattern and can require some adjustments to your buffer pool.

We can see clearly that the 80% rule isn’t as nuanced as reality.

A rule of thumb

However, for the sake of argument, let’s say the 80% rule is a starting point.  A rule of thumb to help us get a quick tuning number to get the server running.  Assuming we don’t know anything really about the workload on the system yet, but we know that the system is dedicated to InnoDB, how might our 80% rule play out?

Total Server RAM Buffer pool with 80% rule Remaining RAM
1G 800MB 200MB
16G 13G 3G
32G 26G 6G
64G 51G 13G
128G 102G 26G
256G 205G 51G
512G 409G 103G
1024G 819G 205G

At lower numbers, our 80% rule looks pretty reasonable.  However, as we get into large servers, it starts to seem less sane.  For the rule to hold true, it must mean that workload memory consumption increases in proportion to needed size of the buffer pool, but that usually isn’t the case.  Our server that has 1TB of RAM likely doesn’t need 205G of that to handle things like connections and queries (likely MySQL couldn’t handle that many active connections and queries anyway).

So, if you really just spent all that money on a beefy server do you really want to pay a 20% tax on that resource because of this rule of thumb?

The origins of the rule

At one of my first MySQL conferences, probably around 2006-2007 when I worked at Yahoo, I attended an InnoDB tuning talk hosted by Heikki Tuuri (the original author of InnoDB) and Peter Zaitsev.  I distinctly remember asking about the 80% rule because at the time Yahoo had some beefy 64G servers and the rule wasn’t sitting right with me.

Heikki’s answer stuck with me.  He said something to the effect of (not a direct quote): “Well, the server I was testing on had 1GB of RAM and 80% seemed about right”.  He then, if memory serves, clarified it and said it would not apply similarly to larger servers.

How should you tune?

80% is maybe a great start and rule of thumb.  You do want to be sure the server has plenty of free RAM for the OS and the usually unknown workload.  However, as we can see above, the larger the server, the more likely the rule will wind up wasting RAM.   I think for most people it starts and ends at the rule of thumb, mostly because changing the InnoDB buffer pool requires a restart in current releases.

So what’s a better rule of thumb?  My rule is that you tune the innodb_buffer_pool_size as large as possible without using swap when the system is running the production workload.  This sounds good in principle, but again, it requires a bunch of restarts and may be easier said than done.

Fortunately MySQL 5.7 and it’s online buffer pool resize feature should make this an easier principle to follow.  Seeing lots of free RAM (and/or filesystem cache usage)?  Turn the buffer pool up dynamically.  Seeing some swap activity?  Just turn it down with no restart required.   In practice, I suspect there will be some performance related hiccups of using this feature, but it is at least a big step in the right direction.

The post Is 80% of RAM how you should tune your innodb_buffer_pool_size? appeared first on MySQL Performance Blog.


Testing backup locks during Xtrabackup SST on Percona XtraDB Cluster

Background on Backup Locks

I was very excited to see Backup locks support in release notes for the latest Percona XtraDB Cluster 5.6.21 release. For those who are not aware, backup locks offer an alternative to FLUSH TABLES WITH READ LOCK (FTWRL) in Xtrabackup. While Xtrabackup can hot-copy Innodb, everything else in MySQL must be locked (usually briefly) to get a consistent snapshot that lines up with Innodb. This includes all other storage engines, but also things like table schemas (even on Innodb) and async replication binary logs. You can skip this lock, but it isn’t generally considered a ‘safe’ backup in every case.

Until recently, Xtrabackup (like most other backup tools) used FTWRL to accomplish this. This worked great, but had the unfortunate side-effect of locking every single table, even the Innodb ones.  This functionally meant that even a hot-backup tool for Innodb had to take a (usually short) global lock to get a consistent backup with MySQL overall.

Backup locks change that by introducing a new locking command on Percona Server called ‘LOCK TABLES FOR BACKUP’.  This works by locking writes to non-transactional tables, as well as locking DDL on all tables (including Innodb).  If Xtrabackup (of a recent vintage) detects that it’s backing up a Percona Server (also of recent vintage), it will automatically use LOCK TABLES WITH BACKUP instead of FLUSH TABLES WITH READ LOCK.

The TL;DR of this is that you can keep on modifying your Innodb data through the entire backup, since we don’t need to use FTWRL any longer.

This feature was introduced in Percona Server 5.6.16-64.0 and Percona XtraBackup 2.2.  I do not believe you will find it in any other MySQL variant, though I could be corrected.

What this means for Percona XtraDB Cluster (PXC)

The most common (and logical) SST method for Percona XtraDB Cluster is using Xtrabackup. This latest release of PXC includes support for backup locks, meaning that Xtrabackup donor nodes will no longer need to get a global lock. Practically for PXC users, this means that your Donor nodes can stay in rotation without causing client interruptions due to FTWRL.

Seeing it in action

To test this out, I spun up a 3-node cluster on AWS and fired up a sysbench run on the first node. I forced and SST on the node. Here is a snippet of the innobackup.backup.log (generated by all Xtrabackup donors in Percona XtraDB Cluster):

InnoDB Backup Utility v1.5.1-xtrabackup; Copyright 2003, 2009 Innobase Oy
and Percona LLC and/or its affiliates 2009-2013. All Rights Reserved.
This software is published under
the GNU GENERAL PUBLIC LICENSE Version 2, June 1991.
Get the latest version of Percona XtraBackup, documentation, and help resources:
141218 19:22:01 innobackupex: Connecting to MySQL server with DSN 'dbi:mysql:;mysql_read_default_file=/etc/my.cnf;mysql_read_default_group=xtraback
up;mysql_socket=/var/lib/mysql/mysql.sock' as 'sst' (using password: YES).
141218 19:22:01 innobackupex: Connected to MySQL server
141218 19:22:01 innobackupex: Starting the backup operation
IMPORTANT: Please check that the backup run completes successfully.
 At the end of a successful backup run innobackupex
 prints "completed OK!".
innobackupex: Using server version 5.6.21-70.1-56
innobackupex: Created backup directory /tmp/tmp.Rm0qA740U3
141218 19:22:01 innobackupex: Starting ibbackup with command: xtrabackup --defaults-file="/etc/my.cnf" --defaults-group="mysqld" --backup --suspe
nd-at-end --target-dir=/tmp/tmp.dM03LgPHFY --innodb_data_file_path="ibdata1:12M:autoextend" --tmpdir=/tmp/tmp.dM03LgPHFY --extra-lsndir='/tmp/tmp.dM
03LgPHFY' --stream=xbstream
innobackupex: Waiting for ibbackup (pid=21892) to suspend
innobackupex: Suspend file '/tmp/tmp.dM03LgPHFY/xtrabackup_suspended_2'
xtrabackup version 2.2.7 based on MySQL server 5.6.21 Linux (x86_64) (revision id: )
xtrabackup: uses posix_fadvise().
xtrabackup: cd to /var/lib/mysql
xtrabackup: open files limit requested 0, set to 5000
xtrabackup: using the following InnoDB configuration:
xtrabackup: innodb_data_home_dir = ./
xtrabackup: innodb_data_file_path = ibdata1:12M:autoextend
xtrabackup: innodb_log_group_home_dir = ./
xtrabackup: innodb_log_files_in_group = 2
xtrabackup: innodb_log_file_size = 1073741824
xtrabackup: using O_DIRECT
>> log scanned up to (10525811040)
xtrabackup: Generating a list of tablespaces
[01] Streaming ./ibdata1
>> log scanned up to (10529368594)
>> log scanned up to (10532685942)
>> log scanned up to (10536422820)
>> log scanned up to (10539562039)
>> log scanned up to (10543077110)
[01] ...done
[01] Streaming ./mysql/innodb_table_stats.ibd
[01] ...done
[01] Streaming ./mysql/innodb_index_stats.ibd
[01] ...done
[01] Streaming ./mysql/slave_relay_log_info.ibd
[01] ...done
[01] Streaming ./mysql/slave_master_info.ibd
[01] ...done
[01] Streaming ./mysql/slave_worker_info.ibd
[01] ...done
[01] Streaming ./sbtest/sbtest1.ibd
>> log scanned up to (10546490256)
>> log scanned up to (10550321726)
>> log scanned up to (10553628936)
>> log scanned up to (10555422053)
[01] ...done
[01] Streaming ./sbtest/sbtest17.ibd
>> log scanned up to (10831343724)
>> log scanned up to (10834063832)
>> log scanned up to (10837100278)
>> log scanned up to (10840243171)
[01] ...done
xtrabackup: Creating suspend file '/tmp/tmp.dM03LgPHFY/xtrabackup_suspended_2' with pid '21892'
>> log scanned up to (10843312323)
141218 19:24:06 innobackupex: Continuing after ibbackup has suspended
141218 19:24:06 innobackupex: Executing LOCK TABLES FOR BACKUP...
141218 19:24:06 innobackupex: Backup tables lock acquired
141218 19:24:06 innobackupex: Starting to backup non-InnoDB tables and files
innobackupex: in subdirectories of '/var/lib/mysql/'
innobackupex: Backing up files '/var/lib/mysql//mysql/*.{frm,isl,MYD,MYI,MAD,MAI,MRG,TRG,TRN,ARM,ARZ,CSM,CSV,opt,par}' (74 files)
>> log scanned up to (10846683627)
>> log scanned up to (10847773504)
innobackupex: Backing up files '/var/lib/mysql//sbtest/*.{frm,isl,MYD,MYI,MAD,MAI,MRG,TRG,TRN,ARM,ARZ,CSM,CSV,opt,par}' (21 files)
innobackupex: Backing up file '/var/lib/mysql//test/db.opt'
innobackupex: Backing up files '/var/lib/mysql//performance_schema/*.{frm,isl,MYD,MYI,MAD,MAI,MRG,TRG,TRN,ARM,ARZ,CSM,CSV,opt,par}' (53 files)
>> log scanned up to (10852976291)
141218 19:24:09 innobackupex: Finished backing up non-InnoDB tables and files
141218 19:24:09 innobackupex: Executing LOCK BINLOG FOR BACKUP...
141218 19:24:09 innobackupex: Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS...
141218 19:24:09 innobackupex: Waiting for log copying to finish
>> log scanned up to (10856996124)
xtrabackup: The latest check point (for incremental): '9936050111'
xtrabackup: Stopping log copying thread.
.>> log scanned up to (10856996124)
xtrabackup: Creating suspend file '/tmp/tmp.dM03LgPHFY/xtrabackup_log_copied' with pid '21892'
141218 19:24:10 innobackupex: Executing UNLOCK BINLOG
141218 19:24:10 innobackupex: Executing UNLOCK TABLES
141218 19:24:10 innobackupex: All tables unlocked
141218 19:24:10 innobackupex: Waiting for ibbackup (pid=21892) to finish
xtrabackup: Transaction log of lsn (9420426891) to (10856996124) was copied.
innobackupex: Backup created in directory '/tmp/tmp.Rm0qA740U3'
141218 19:24:30 innobackupex: Connection to database server closed
141218 19:24:30 innobackupex: completed OK!

We can see the LOCK TABLES FOR BACKUP issued at 19:24:06 and unlocked at 19:24:10. Let’s see Galera apply stats from this node during that time:

mycluster / ip-10-228-128-220 (idx: 0) / Galera 3.8(rf6147dd)
Wsrep    Cluster  Node Repl  Queue     Ops       Bytes     Conflct   Gcache    Window        Flow
    time P cnf  # Stat Laten   Up   Dn   Up   Dn   Up   Dn  lcf  bfa  ist  idx dst appl comm  p_ms
19:23:55 P   5  3 Dono 698µs    0   72    0 5418  0.0 3.5M    0    0 187k   94  3k    3    2     0
19:23:56 P   5  3 Dono 701µs    0   58    0 5411  0.0 3.5M    0    0 188k  229  3k    3    2     0
19:23:57 P   5  3 Dono 701µs    0    2    0 5721  0.0 3.7M    0    0 188k  120  3k    3    2     0
19:23:58 P   5  3 Dono 689µs    0    5    0 5643  0.0 3.6M    0    0 188k   63  3k    3    2     0
19:23:59 P   5  3 Dono 679µs    0   55    0 5428  0.0 3.5M    0    0 188k  115  3k    3    2     0
19:24:01 P   5  3 Dono 681µs    0    1    0 4623  0.0 3.0M    0    0 188k  104  3k    3    2     0
19:24:02 P   5  3 Dono 690µs    0    0    0 4301  0.0 2.7M    0    0 188k  141  3k    3    2     0
19:24:03 P   5  3 Dono 688µs    0    2    0 4907  0.0 3.1M    0    0 188k  227  3k    3    2     0
19:24:04 P   5  3 Dono 692µs    0   44    0 4894  0.0 3.1M    0    0 188k  116  3k    3    2     0
19:24:05 P   5  3 Dono 706µs    0    0    0 5337  0.0 3.4M    0    0 188k   63  3k    3    2     0

Initially the node is keeping up ok with replication. The Down Queue (wsrep_local_recv_queue) is sticking around 0. We’re applying 4-5k transactions per second (Ops Dn). When the backup lock kicks in, we do see an increase in the queue size, but note that transactions are still applying on this node:

19:24:06 P   5  3 Dono 696µs    0  170    0 5671  0.0 3.6M    0    0 187k  130  3k    3    2     0
19:24:07 P   5  3 Dono 695µs    0 2626    0 3175  0.0 2.0M    0    0 185k 2193  3k    3    2     0
19:24:08 P   5  3 Dono 692µs    0 1248    0 6782  0.0 4.3M    0    0 186k 1800  3k    3    2     0
19:24:09 P   5  3 Dono 693µs    0  611    0 6111  0.0 3.9M    0    0 187k  651  3k    3    2     0
19:24:10 P   5  3 Dono 708µs    0   93    0 5316  0.0 3.4M    0    0 187k  139  3k    3    2     0

So this node isn’t locked from innodb write transactions, it’s just suffering a bit of IO load while the backup finishes copying its files and such. After this, the backup finished up and the node goes back to a Synced state pretty quickly:

19:24:11 P   5  3 Dono 720µs    0    1    0 4486  0.0 2.9M    0    0 188k   78  3k    3    2     0
19:24:12 P   5  3 Dono 715µs    0    0    0 3982  0.0 2.5M    0    0 188k  278  3k    3    2     0
19:24:13 P   5  3 Dono 1.2ms    0    0    0 4337  0.0 2.8M    0    0 188k  143  3k    3    2     0
19:24:14 P   5  3 Dono 1.2ms    0    1    0 4901  0.0 3.1M    0    0 188k  130  3k    3    2     0
19:24:16 P   5  3 Dono 1.1ms    0    0    0 5289  0.0 3.4M    0    0 188k   76  3k    3    2     0
19:24:17 P   5  3 Dono 1.1ms    0   42    0 4998  0.0 3.2M    0    0 188k  319  3k    3    2     0
19:24:18 P   5  3 Dono 1.1ms    0   15    0 3290  0.0 2.1M    0    0 188k   75  3k    3    2     0
19:24:19 P   5  3 Dono 1.1ms    0    0    0 4124  0.0 2.6M    0    0 188k  276  3k    3    2     0
19:24:20 P   5  3 Dono 1.1ms    0    4    0 1635  0.0 1.0M    0    0 188k   70  3k    3    2     0
19:24:21 P   5  3 Dono 1.1ms    0    0    0 5026  0.0 3.2M    0    0 188k  158  3k    3    2     0
19:24:22 P   5  3 Dono 1.1ms    0   20    0 4100  0.0 2.6M    0    0 188k  129  3k    3    2     0
19:24:23 P   5  3 Dono 1.1ms    0    0    0 5412  0.0 3.5M    0    0 188k  159  3k    3    2     0
19:24:24 P   5  3 Dono 1.1ms    0  315    0 4567  0.0 2.9M    0    0 187k  170  3k    3    2     0
19:24:25 P   5  3 Dono 1.0ms    0   24    0 5535  0.0 3.5M    0    0 188k  131  3k    3    2     0
19:24:26 P   5  3 Dono 1.0ms    0    0    0 5427  0.0 3.5M    0    0 188k   71  3k    3    2     0
19:24:27 P   5  3 Dono 1.0ms    0    1    0 5221  0.0 3.3M    0    0 188k  256  3k    3    2     0
19:24:28 P   5  3 Dono 1.0ms    0    0    0 5317  0.0 3.4M    0    0 188k  159  3k    3    2     0
19:24:29 P   5  3 Dono 1.0ms    0    1    0 5491  0.0 3.5M    0    0 188k  163  3k    3    2     0
19:24:30 P   5  3 Sync 1.0ms    0    0    0 5540  0.0 3.5M    0    0 188k  296  3k    3    2     0
19:24:31 P   5  3 Sync 992µs    0  106    0 5594  0.0 3.6M    0    0 187k  130  3k    3    2     0
19:24:33 P   5  3 Sync 984µs    0   19    0 5723  0.0 3.7M    0    0 188k  275  3k    3    2     0
19:24:34 P   5  3 Sync 976µs    0    0    0 5508  0.0 3.5M    0    0 188k  182  3k    3    2     0

Compared to Percona XtraDB Cluster 5.5

The Backup Locking is only a feature of Percona XtraDB Cluster 5.6, so if we repeat the experiment on 5.5, we can see a more severe lock:

141218 20:31:19  innobackupex: Executing FLUSH TABLES WITH READ LOCK...
141218 20:31:19  innobackupex: All tables locked and flushed to disk
141218 20:31:19  innobackupex: Starting to backup non-InnoDB tables and files
innobackupex: in subdirectories of '/var/lib/mysql/'
innobackupex: Backing up files '/var/lib/mysql//sbtest/*.{frm,isl,MYD,MYI,MAD,MAI,MRG,TRG,TRN,ARM,ARZ,CSM,CSV,opt,par}' (21 files)
innobackupex: Backing up files '/var/lib/mysql//mysql/*.{frm,isl,MYD,MYI,MAD,MAI,MRG,TRG,TRN,ARM,ARZ,CSM,CSV,opt,par}' (72 files)
>> log scanned up to (6633554484)
innobackupex: Backing up file '/var/lib/mysql//test/db.opt'
innobackupex: Backing up files '/var/lib/mysql//performance_schema/*.{frm,isl,MYD,MYI,MAD,MAI,MRG,TRG,TRN,ARM,ARZ,CSM,CSV,opt,par}' (18 files)
141218 20:31:21  innobackupex: Finished backing up non-InnoDB tables and files
141218 20:31:21  innobackupex: Executing FLUSH NO_WRITE_TO_BINLOG ENGINE LOGS...
141218 20:31:21  innobackupex: Waiting for log copying to finish
xtrabackup: The latest check point (for incremental): '5420681649'
xtrabackup: Stopping log copying thread.
.>> log scanned up to (6633560488)
xtrabackup: Creating suspend file '/tmp/tmp.Cq5JRZEFki/xtrabackup_log_copied' with pid '23130'
141218 20:31:22  innobackupex: All tables unlocked

Our lock lasts from 20:31:19 until 20:31:21, so it’s fairly short. Note that with larger databases with more schemas and tables, this can be quite a bit longer. Let’s see the effect on the apply rate for this node:

mycluster / ip-10-229-68-156 (idx: 0) / Galera 2.11(r318911d)
Wsrep    Cluster  Node Repl  Queue     Ops       Bytes     Conflct   Gcache    Window        Flow
    time P cnf  # Stat Laten   Up   Dn   Up   Dn   Up   Dn  lcf  bfa  ist  idx dst appl comm  p_ms
20:31:13 P   5  3 Dono   N/A    0   73    0 3493  0.0 1.8M    0    0 1.8m  832 746    2    2   0.0
20:31:14 P   5  3 Dono   N/A    0   29    0 3578  0.0 1.9M    0    0 1.8m  850 749    3    2   0.0
20:31:15 P   5  3 Dono   N/A    0    0    0 3513  0.0 1.8M    0    0 1.8m  735 743    2    2   0.0
20:31:16 P   5  3 Dono   N/A    0    0    0 3651  0.0 1.9M    0    0 1.8m  827 748    2    2   0.0
20:31:17 P   5  3 Dono   N/A    0   27    0 3642  0.0 1.9M    0    0 1.8m  840 762    2    2   0.0
20:31:18 P   5  3 Dono   N/A    0    0    0 3840  0.0 2.0M    0    0 1.8m  563 776    2    2   0.0
20:31:19 P   5  3 Dono   N/A    0    0    0 4368  0.0 2.3M    0    0 1.8m  823 745    2    1   0.0
20:31:20 P   5  3 Dono   N/A    0 3952    0  339  0.0 0.2M    0    0 1.8m  678 751    1    1   0.0
20:31:21 P   5  3 Dono   N/A    0 7883    0    0  0.0  0.0    0    0 1.8m  678 751    0    0   0.0
20:31:22 P   5  3 Dono   N/A    0 4917    0 5947  0.0 3.1M    0    0 1.8m 6034  3k    7    6   0.0
20:31:24 P   5  3 Dono   N/A    0   10    0 8238  0.0 4.3M    0    0 1.8m  991  1k    7    6   0.0
20:31:25 P   5  3 Dono   N/A    0    0    0 3016  0.0 1.6M    0    0 1.8m  914 754    2    1   0.0
20:31:26 P   5  3 Dono   N/A    0    0    0 3253  0.0 1.7M    0    0 1.8m  613 766    1    1   0.0
20:31:27 P   5  3 Dono   N/A    0    1    0 3600  0.0 1.9M    0    0 1.8m  583 777    2    1   0.0
20:31:28 P   5  3 Dono   N/A    0    0    0 3640  0.0 1.9M    0    0 1.8m  664 750    2    2   0.0

The drop here is more severe and the apply rate hits 0 (and stays there for the duration of the FTWRL).


Obviously Xtrabackup running on a PXC node will cause some load on the node itself, so there still maybe good reasons to keep a Donor node out of rotation from your application.  However, this is less of an issue than it was in the past, where writes would definitely stall on a Donor node and present potentially intermittent stalls on the application.

How you allow applications to start using a Donor node automatically (or not) depends on how you have your HA between the application and cluster setup.  If you use HAproxy or similar with clustercheck, you can either modify the script itself or change a command line argument. The node is in the Donor/Desynced state below:

[root@ip-10-229-64-35 ~]# /usr/bin/clustercheck clustercheckuser clustercheckpassword!
HTTP/1.1 503 Service Unavailable
Content-Type: text/plain
Connection: close
Content-Length: 44
Percona XtraDB Cluster Node is not synced.
[root@ip-10-229-64-35 ~]# /usr/bin/clustercheck clustercheckuser clustercheckpassword! 1
HTTP/1.1 200 OK
Content-Type: text/plain
Connection: close
Content-Length: 40
Percona XtraDB Cluster Node is synced.

For those doing their own custom health checking, you basically just need to pass nodes that have a wsrep_local_state_comment of either ‘Synced’ or ‘Donor/Desynced’.

The post Testing backup locks during Xtrabackup SST on Percona XtraDB Cluster appeared first on MySQL Performance Blog.


MySQL Tutorials: A time to learn at Percona Live 2015

The many hours of intensive tutorials, led by some of the top minds in MySQL, have always been a major draw each year to the Percona Live MySQL Conference and Expo. And 2015’s event will be no exception.

Percona Live 2015 runs April 13-16 in Santa Clara, Calif. and the first day is dedicated to the classroom – so bring your laptops for the combined 45 hours of learning. MySQL tutorials are included with the full-conference pass but a “tutorial-only pass” is also available. Super-saver registration discounts have been extended until Dec. 19. Here’s a look at this year’s tutorials lineup. (There will be a couple more announced in January).

And that’s just on Monday! There will be much more over the four days of the Percona Live MySQL Conference and Expo 2015. I posted a sneak peek of the full Percona Live (initial) roster a couple weeks ago. And remember, super-saver registration discounts have been extended until Dec. 19 so register now – and don’t forgot your laptop (and power cord)!

The post MySQL Tutorials: A time to learn at Percona Live 2015 appeared first on MySQL Performance Blog.


Streamlined Percona XtraDB Cluster (or anything) testing with Consul and Vagrant

Introducing Consul

I’m always interested in what Mitchell Hashimoto and Hashicorp are up to, I typically find their projects valuable.  If you’ve heard of Vagrant, you know their work.

I recently became interested in a newer project they have called ‘Consul‘.  Consul is a bit hard to describe.  It is (in part):

  • Highly consistent metadata store (a bit like Zookeeeper)
  • A monitoring system (lightweight Nagios)
  • A service discovery system, both DNS and HTTP-based. (think of something like haproxy, but instead of tcp load balancing, it provides dns lookups with healthy services)

What this has to do with Percona XtraDB Cluster

I’ve had some more complex testing for Percona XtraDB Cluster (PXC) to do on my plate for quite a while, and I started to explore Consul as a tool to help with this.  I already have Vagrant setups for PXC, but ensuring all the nodes are healthy, kicking off tests, gathering results, etc. were still difficult.

So, my loose goals for Consul are:

  • A single dashboard to ensure my testing environment is healthy
  • Ability to adapt to any size environment — 3 node clusters up to 20+
  • Coordinate starting and stopping load tests running on any number of test clients
  • Have the ability to collect distributed test results

I’ve succeeded on some of these fronts with a Vagrant environment I’ve been working on. This spins up:

  • A Consul cluster (default is a single node)
  • Test server(s)
  • A PXC cluster

Additionally, it integrates the Test servers and PXC nodes with Consul such that:

  • The servers setup a Consul agent in client mode to the  Consul cluster
  • Additionally, they setup a local DNS forwarder that sends all DNS requests to the ‘.consul’ domain to the local agent to be serviced by the Consul cluster.
  • The servers register services with Consul that run local health checks
  • The test server(s) setup a ‘watch’ in consul to wait for starting sysbench on a consul ‘event’.

Seeing it in action

Once I run my ‘vagrant up’, I get a consul UI I can connect to on my localhost at port 8501:

Consul's Node Overview

Consul’s Node Overview


I can see all 5 of my nodes.  I can check the services and see that test1 is failing one health check because sysbench isn’t running yet:

Consul reporting sysbench is not running.

Consul reporting sysbench is not running.

This is expected, because I haven’t started testing yet.  I can see that my PXC cluster is healthy:


Health checks are using clustercheck from the PXC package

Health checks are using clustercheck from the PXC package


Involving Percona Cloud Tools in the system

So far, so good.  This Vagrant configuration (if I provide a PERCONA_AGENT_API_KEY in my environment) also registers my test servers with Percona Cloud Tools, so I can see data being reported there for my nodes:

Percona Cloud Tool's Dashboard for a single node

Percona Cloud Tool’s Dashboard for a single node

So now I am ready to begin my test.  To do so, I simply need to issue a consul event from any of the nodes:

jayj@~/Src/pxc_consul [507]$ vagrant ssh consul1
Last login: Wed Nov 26 14:32:38 2014 from
[root@consul1 ~]# consul event -name='sysbench_update_index'
Event ID: 7c8aab42-fd2e-de6c-cb0c-1de31c02ce95

My pre-configured watchers on my test node knows what to do with that event and launches sysbench.  Consul shows that sysbench is indeed running:

Screen Shot 2014-11-26 at 9.43.29 AM


And I can indeed see traffic start to come in on Percona Cloud Tools:

Screen Shot 2014-11-26 at 9.53.11 AM

I have testing traffic limited for my example, but that’s easily tunable via the Vagrantfile.  To show something a little more impressive, here’s a 5 node cluster running hitting around 2500 tps total throughput:

Screen Shot 2014-11-26 at 1.08.48 PM

So to summarize thus far:

  • I can spin up any size cluster I want and verify it is healthy with Consul’s UI
  • I can spin up any number of test servers and kick off sysbench on all of them simultaneously

Another big trick of Consul’s

That so far so good, but let me point out a few things that may not be obvious.  If you check the Vagrantfile, I use a consul hostname in a few places.  First, on the test servers:

# sysbench setup
            'tables' => 1,
            'rows' => 1000000,
            'threads' => 4 * pxc_nodes,
            'tx_rate' => 10,
            'mysql_host' => 'pxc.service.consul'

then again on the PXC server configuration:

# PXC setup
          "percona_server_version"  => pxc_version,
          'innodb_buffer_pool_size' => '1G',
          'innodb_log_file_size' => '1G',
          'innodb_flush_log_at_trx_commit' => '0',
          'pxc_bootstrap_node' => (i == 1 ? true : false ),
          'wsrep_cluster_address' => 'gcomm://pxc.service.consul',
          'wsrep_provider_options' => 'gcache.size=2G; gcs.fc_limit=1024',

Notice ‘pxc.service.consul’.  This hostname is provided by Consul and resolves to all the IPs of the current servers both having and passing the ‘pxc’ service health check:

[root@test1 ~]# host pxc.service.consul
pxc.service.consul has address
pxc.service.consul has address
pxc.service.consul has address

So I am using this to my advantage in two ways:

  1. My PXC cluster bootstraps the first node automatically, but all the other nodes use this hostname for their wsrep_cluster_address.  This means: no specific hostnames or ips in the my.cnf file, and this hostname will always be up to date with what nodes are active in the cluster; which is the precise list that should be in the wsrep_cluster_address at any given moment.
  2. My test servers connect to this hostname, therefore they always know where to connect and they will round-robin (if I have enough sysbench threads and PXC nodes) to different nodes based on the response of the dns lookup, which returns 3 of the active nodes in a different order each time.

(Some of) The Issues

This is still a work in progress and there are many improvements that could be made:

  • I’m relying on PCT to collect my data, but it’d be nice to utilize Consul’s central key/value store to store results of the independent sysbench runs.
  • Consul’s leader election could be used to help the cluster determine which node should bootstrap on first startup. I am assuming node1 should bootstrap.
  • A variety of bugs in various software still makes this a bit clunky sometimes to manage.  Here is a sample:
    • Consul events sometimes don’t fire in the current release (though it looks to be fixed soon)
    • PXC joining nodes sometimes get stuck putting speed bumps into the automated deploy.
    • Automated installs of percona-agent (which sends data to Percona Cloud Tools) is straight-forward, except when different cluster nodes clobber each other’s credentials.

So, in summary, I am happy with how easily Consul integrates and I’m already finding it useful for a product in its 0.4.1 release.

The post Streamlined Percona XtraDB Cluster (or anything) testing with Consul and Vagrant appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by