Feb
13
2018
--

Want IST Not SST for Node Rejoins? We Have a Solution!

IST Not SST for Node Rejoins

IST Not SST for Node RejoinsWhat if we tell you that there is a sure way to get IST not SST for node rejoins? You can guarantee that a new node rejoins using IST. Sound interesting? Keep reading.

Normally when a node is taken out of the cluster for a short period of time (for maintenance or shutdown), gcache on other nodes of the cluster help donate the missing write-set(s) when the node rejoins. This approach works if you have configured a larger gcache, or the downtime is short enough. Both approaches aren’t good, especially for a production cluster. Also, a larger gcache for the server lifetime means blocking larger disk-space when the same job can be done with relative smaller disk-space.

Re-configuring gcache, on a potential DONOR node before downtime requires a node shutdown. (Dynamic resizing of the gcache is not possible, or rather not needed now.) Restoring it back to original size needs another shutdown. So “three shutdowns” for a single downtime. No way …… not acceptable with busy production clusters and the possibility of more errors.

Introducing “gcache.freeze_purge_at_seqno”

Given the said pain-point, we are introducing gcache.freeze_purge_at_seqno Percona XtraDB Cluster 5.7.20. This controls the purging of the gcache, thereby retaining more data to facilitate IST when the node rejoins.

All the transactions in the Galera cluster world are assigned unique global sequence number (seqno). Tracking things happens using this seqno (like wsrep_last_applied, wsrep_last_committed, wsrep_replicated, wsrep_local_cached_downto, etc…). wsrep_local_cached_downto represents the sequence number down to which the gcache has been purged. Say wsrep_local_cached_downto = N, then gcache has data from [N, wsrep_replicated] and has purged data from [1,N).

gcache.freeze_purge_at_seqno takes three values:

  1. -1 (default): no freeze, the purge operates as normal.
  2. x (should be valid seqno in gcache): freeze purge of write-sets >= x. The best way to select x is to use the wsrep_last_applied value as an indicator from the node that you plan to shut down. (wsrep_applied * 0.09. Retain this extra 10% to trick the safety gap heuristic algorithm of IST.)
  3. now: freeze purge of write-sets >= smallest seqno currently in gcache. Instant freeze of gcache-purge. (If tracing x (above) is difficult, simply use “now” and you are good).

Set this on an existing node of the cluster (that will continue to be part of the cluster and can act as potential DONOR). This node continues to retain the write-sets, thereby allowing the restarting node to rejoin using IST. (You can feed the said node as a preferred DONOR through wsrep_sst_donor while restarting the said rejoining node.)

Remember to set it back to -1 once the node rejoins. This avoids hogging space on the DONOR beyond the said timeline. On the next purge cycle, all the old retained write-sets are freed as well (reclaiming the space back to original).

Note:

To find out existing value of gcache.freeze_purge_at_seqno query wsrep_provider_options.
select @@wsrep_provider_options;
To set gcache.freeze_purge_at_seqno
set global wsrep_provider_options="gcache.freeze_purge_at_seqno = now";

Why should you use it?

  • gcache grows dynamically (using existing pagestore mechanism) and shrinks once the user sets it back to -1. This means you only use disk-space when needed.
  • No restart needed. The user can concentrate on maintenance node only.
  • No complex math or understanding of seqno involved (simply use “now”).
  • Less prone to error, as SST is one of the major error-prone areas with the cluster.

So why wait? Give it a try! It is part of Percona XtraDB Cluster 5.7.20 onwards, and helps you get IST not SST for node rejoins

Note: If you need more information about gcache, check here and here.

Nov
15
2017
--

Understanding how an IST donor is selected

IST donor cluster

IST donor clusterIn a clustering environment, we often see a node that needs to be taken down for maintenance. For a node to rejoin, it should re-sync with the cluster state. In PXC (Percona XtraDB Cluster), there are 2 ways for the rejoining node to re-sync: State Snapshot Transfer (SST) and Incremental State Transfer (IST). SST involves a full data transfer (which could be time consuming). IST is an incremental data transfer whereby only missing write-sets are donated by a DONOR to the rejoining node (aka as JOINER).

In this article I will try to show how a DONOR for the IST process is selected.

Selecting an IST DONOR

First, a word about gcache. Each node retains some write-sets in its cache known as gcache. Once this gcache is full it is purged to make room for new write-sets. Based on gcache configuration, each node may retain a different span of write-sets. The wider the span, the greater the probability of the node acting as prospective DONOR. The lowest seqno in gcache can be queried using ( 

show status like 'wsrep_local_cached_downto'

 )

Let’s understand the IST DONOR algorithm with a topology and working example:

  • Say we have 3 node cluster: N1, N2, N3.
  • To start with, all 3 nodes are in sync (wsrep_last_committed is the same for all 3 nodes, let’s say 100).
  • N3 is schedule for maintenance and is taken down.
  • In meantime N1 and N2 processes workload, thereby moving them from 100 -> 1100.
  • N1 and N2 also purges the gcache. Let’s say wsrep_local_cached_downto for N1 and N2 is 110 and 90 respectively.
  • Now N3 is restarted and discovers that the cluster has made progress from 100 -> 1100 and so it needs the write-sets from (101, 1100).
  • It starts looking for a prospective DONOR.
    • N1 can service data from (110, 1100) but the request is for (101, 1100) so N1 can’t act as DONOR
    • N2 can service data from (90, 1100) and the request is for (101, 1100) so N2 can act as DONOR.

Safety gap and how it affects DONOR selection

So far so good. But can N2 reliably act as DONOR? While N3 is evaluating the prospective DONOR, what if N2 purges more data and now wsrep_local_cached_downto on N2 is 105? In order to accommodate this, the N3 algorithm adds a safety gap.

safety gap = (Current State of Cluster – Lowest available seqno from any of the existing node of the cluster) * 0.008

So the N2 range is considered to be (90 + (1100 – 90) * 0.008, 1100) = (98, 1100).

Can now N2 act as DONOR ? Yes: (98, 1100) < (101, 1100)

What if N2 had purged up to 95 and then N3 started looking for prospective DONOR?

In this case the N2 range would be (95 + (1100 – 95) * 0.008, 1100) = (103, 1100), ruling N2 out from the prospective DONOR list.

Twist at the end

Considering the latter case above (N2 purged up to 95), it has been proven that N2 can’t act as the IST DONOR and the only way for N3 to join is through SST.

What if I say that N3 still joins back using IST? CONFUSED?

Once N3 falls back from IST to SST it will select a SST donor. This selection is done sequentially and nominates N1 as the first choice. N1 doesn’t have the required write-sets, so SST is forced.

But what if I configure

wsrep_sst_donor=N2

  on N3? This will cause N2 to get selected instead of N1. But wait: N2 doesn’t qualify either as with safety gap, the range is (103, 1100).

That’s true. But the request has IST + SST request, so even though N3 ruled out N2 as the IST DONOR, a request is sent for one last try. If N2 can service the request using IST, it is allowed to do so.  Otherwise it falls back to SST.

Interesting! This is a well thought out algorithm from Codership: I applaud them for this and the many other important control functions that go on backstage of the galera cluster.

Nov
30
2016
--

Galera Cache (gcache) is finally recoverable on restart

Gcache

GcacheThis post describes how to recover Galera Cache (or gcache) on restart.

Recently Codership introduced (with Galera 3.19) a very important and long awaited feature. Now users can recover Galera cache on restart.

Need

If you gracefully shutdown cluster nodes one after another, with some lag time between nodes, then the last node to shutdown holds the latest data. Next time you restart the cluster, the last node shutdown will be the first one to boot. Any followup nodes that join the cluster after the first node will demand an SST.

Why SST, when these nodes already have data and only few write-sets are missing? The DONOR node caches missing write-sets in Galera cache, but on restart this cache is wiped clean and restarted fresh. So the DONOR node doesn’t have a Galera cache to donate missing write-sets.

This painful set up made it necessary for users to think and plan before gracefully taking down the cluster. With the introduction of this new feature, the user can retain the Galera cache.

How does this help ?

On restart, the node will revive the galera-cache. This means the node can act as a DONOR and service missing write-sets (facilitating IST, instead of using SST). This option to retain the galera-cache is controlled by an option named

gcache.recover=yes/no

. The default is NO (Galera cache is not retained). The user can set this option for all nodes, or selective nodes, based on disk usage.

gcache.recover in action

The example below demonstrates how to use this option:

  • Let’s say the user has a three node cluster (n1, n2, n3), with all in sync.
  • The user gracefully shutdown n2 and n3.
  • n1 is still up and running, and processes some workload, so now n1 has latest data.
  • n1 is eventually shutdown.
  • Now the user decides to restart the cluster. Obviously, the user needs to start n1 first, followed by n2/n3.
  • n1 boots up, forming an new cluster.
  • n2 boots up, joins the cluster, finds there are missing write-sets and demands IST but given that n1 doesn’t have a gcache, it falls back to SST.

n2 (JOINER node log):

2016-11-18 13:11:06 3277 [Note] WSREP: State transfer required:
 Group state: 839028c7-ad61-11e6-9055-fe766a1886c3:4680
 Local state: 839028c7-ad61-11e6-9055-fe766a1886c3:3893

n1 (DONOR node log),

gcache.recover=no

:

2016-11-18 13:11:06 3245 [Note] WSREP: IST request: 839028c7-ad61-11e6-9055-fe766a1886c3:3893-4680|tcp://192.168.1.3:5031
2016-11-18 13:11:06 3245 [Note] WSREP: IST first seqno 3894 not found from cache, falling back to SST

Now let’s re-execute this scenario with

gcache.recover=yes

.

n2 (JOINER node log):

2016-11-18 13:24:38 4603 [Note] WSREP: State transfer required:
 Group state: ee8ef398-ad63-11e6-92ed-d6c0646c9f13:1495
 Local state: ee8ef398-ad63-11e6-92ed-d6c0646c9f13:769
....
2016-11-18 13:24:41 4603 [Note] WSREP: Receiving IST: 726 writesets, seqnos 769-1495
....
2016-11-18 13:24:49 4603 [Note] WSREP: IST received: ee8ef398-ad63-11e6-92ed-d6c0646c9f13:1495

n1 (DONOR node log):

2016-11-18 13:24:38 4573 [Note] WSREP: IST request: ee8ef398-ad63-11e6-92ed-d6c0646c9f13:769-1495|tcp://192.168.1.3:5031
2016-11-18 13:24:38 4573 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

You can also validate this by checking the lowest write-set available in gcache on the DONOR node.

mysql> show status like 'wsrep_local_cached_downto';
+---------------------------+-------+
| Variable_name | Value |
+---------------------------+-------+
| wsrep_local_cached_downto | 1 |
+---------------------------+-------+
1 row in set (0.00 sec)

So as you can see,

gcache.recover

 could restore the cache on restart and help service IST over SST. This is a major resource saver for most of those graceful shutdowns.

gcache revive doesn’t work if . . .

If gcache pages are involved. Gcache pages are still removed on shutdown, and the gcache write-set until that point also gets cleared.

Again let’s see and example:

  • Let’s assume the same configuration and workflow as mentioned above. We will just change the workload pattern.
  • n1, n2, n3 are in sync and an average-size workload is executed, such that the write-set fits in the gcache. (seqno=1-x)
  • n2 and n3 are shutdown.
  • n1 continues to operate and executes some average size workload followed by a huge transaction that results in the creation of a gcache page. (1-x-a-b-c-h) [h represent transaction seqno]
  • Now n1 is shutdown. During shutdown, gcache pages are purged (irrespective of the
    keep_page_sizes

     setting).

  • The purge ensures that all the write-sets that has seqno smaller than
    gcache-page-residing

     write-set are purged, too. This effectively means (1-h) everything is removed, including (a,b,c).

  • On restart, even though n1 can revive the gcache it can’t revive anything, as all the write-sets are purged.
  • When n2 boots up, it requests IST, but n1 can’t service the missing write-set (a,b,c,h). This causes SST to take place.

Summing it up

Needless to say,

gcache.recover

 is a much needed feature, given it saves SST pain. (Thanks Codership.) It would be good to see if the feature can be optimized to work with gcache pages.

And yes, Percona XtraDB Cluster inherits this feature in its upcoming release.

Nov
16
2016
--

All You Need to Know About GCache (Galera-Cache)

GCache

GCacheThis blog discusses some important aspects of GCache.

Why do we need GCache?

Percona XtraDB Cluster is a multi-master topology, where a transaction executed on one node is replicated on another node(s) of the cluster. This transaction is then copied over from the group channel to Galera-Cache followed by apply action.

The cache can be discarded immediately once the transaction is applied, but retaining it can help promote a node as a DONOR node serving write-sets for a newly booted node.

So in short, GCache acts as a temporary storage for replicated transactions.

How is GCache managed?

Naturally, the first choice to cache these write-sets is to use memory allocated pool, which is governed by gcache.mem_store. However, this is deprecated and buggy and shouldn’t be used.

Next on the list is on-disk files. Galera has two types of on-disk files to manage write-sets:

  • RingBuffer File:
    • A circular file (aka RingBuffer file). As the name suggests, this file is re-usable in a circular queue fashion, and is pre-created when the server starts. The size of this file is preconfigured and can’t be changed dynamically, so selecting a proper size for this file is important.
    • The user can set the size of this file using gcache.size. (There are multiple blogs about how to estimate size of the Galera Cache, which is generally linked to downtime. If properly planned, the next booting node will find all the missing write-sets in the cache, thereby avoiding need for SST.)
    • Write-sets are appended to this file and, when needed, the file is re-cycled for use.
  • On-demand page store:
    • If the transaction write-set is large enough not to fit in a RingBuffer File (actually large enough not to fit in half of the RingBuffer file) then an independent page (physical disk file) is allocated to cache the write-sets.
    • Again there are two types of pages:
      • Page with standard size: As defined by gcache.page_size (default=128M).
      • Page with non-standard page size: If the transaction is large enough not to fit into a standard page, then a non-standard page is created for the transaction. Let’s say
        gcache.page_size=1M

         and transaction

        write_set = 1.5M

        , then a separate page (in turn on-disk file) will be created with a size of 1.5M.

How long are on demand pages retained? This is controlled using following two variables:

  • gcache.keep_pages_size
    • keep_pages_size

       defines total size of allocated pages to keep. For example, if

      keep_pages_size = 10M

       then N pages that add up to 10M can be retained. If N pages add to more than 10M, then pages are removed from the start of the queue until the size falls below set threshold. A size of 0 means don’t retain any page.

  • gcache.keep_pages_count (PXC specific)
    • But before pages are actually removed, a second check is done based on
      page_count

      . Let’s say

      keep_page_count = N+M

      , then even though N pages adds up to 10M, they will be retained as the 

      page_count

       threshold is not yet hit. (The exception to this is non-standard pages at the start of the queue.)

So in short, both condition must be satisfied. The recommendation is to use whichever condition is applicable in the user environment.

Where are GCache files located?

The default location is the data directory, but this can be changed by setting gcache.dir. Given the temporary nature of the file, and iterative read/write cycle, it may be wise to place these files in a faster IO disk. Also, the default name of the file is gcache.cache. This is configurable by setting gcache.name.

What if one of the node is DESYNCED and PAUSED?

If a node desyncs, it will continue to received write-sets and apply them, so there is no major change in gcache handling.

If the node is desynced and paused, that means the node can’t apply write-sets and needs to keep caching them. This will, of course, affect the desynced/paused node and the node will continue to create on-demand page store. Since one of the cluster nodes can’t proceed, it will not emit a “last committed” message. In turn, other nodes in the cluster (that can purge the entry) will continue to retain the write-sets, even if these nodes are not desynced and paused.

Jul
16
2015
--

Bypassing SST in Percona XtraDB Cluster with incremental backups

Beware the SST

In Percona XtraDB Cluster (PXC) I often run across users who are fearful of SSTs on their clusters. I’ve always maintained that if you can’t cope with a SST, PXC may not be right for you, but that doesn’t change the fact that SSTs with multiple Terabytes of data can be quite costly.

SST, by current definition, is a full backup of a Donor to Joiner.  The most popular method is Percona XtraBackup, so we’re talking about a donor node that must:

  1. Run a full XtraBackup that reads its entire datadir
  2. Keep up with Galera replication to it as much as possible (though laggy donors don’t send flow control)
  3. Possibly still be serving application traffic if you don’t remove Donors from rotation.

So, I’ve been interested in alternative ways to work around state transfers and I want to present one way I’ve found that may be useful to someone out there.

Percona XtraBackup and Incrementals

It is possible to use Percona XtraBackup Full and Incremental backups to build a datadir that might possibly SST.  First we’ll focus on the mechanics of the backups, preparing them and getting the Galera GTID and then later discuss when it may be viable for IST.

Suppose I have fairly recent full Xtrabackup and and one or more incremental backups that I can apply on top of that to get VERY close to realtime on my cluster (more on that ‘VERY’ later).

# innobackupex --no-timestamp /backups/full
... sometime later ...
# innobackupex --incremental /backups/inc1 --no-timestamp --incremental-basedir /backups/full
... sometime later ...
# innobackupex --incremental /backups/inc2 --no-timestamp --incremental-basedir /backups/inc1

In my proof of concept test, I now have a full and two incrementals:

# du -shc /backups/*
909M	full
665M	inc1
812M	inc2
2.4G	total

To recover this data, I follow the normal Xtrabackup incremental apply process:

# cp -av /backups/full /backups/restore
# innobackupex --apply-log --redo-only --use-memory=1G /backups/restore
...
xtrabackup: Recovered WSREP position: 1663c027-2a29-11e5-85da-aa5ca45f600f:35694784
...
# innobackupex --apply-log --redo-only /backups/restore --incremental-dir /backups/inc1 --use-memory=1G
# innobackupex --apply-log --redo-only /backups/restore --incremental-dir /backups/inc2 --use-memory=1G
...
xtrabackup: Recovered WSREP position: 1663c027-2a29-11e5-85da-aa5ca45f600f:46469942
...
# innobackupex --apply-log /backups/restore --use-memory=1G

I can see that as I roll forward on my incrementals, I get a higher and higher GTID. Galera’s GTID is stored in the Innodb recovery information, so Xtrabackup extracts it after every batch it applies to the datadir we’re restoring.

We now have a datadir that is ready to go, we need to copy it into the datadir of our joiner node and setup a grastate.dat. Without a grastate, starting the node would force an SST no matter what.

# innobackupex --copy-back /backups/restore
# ... copy a grastate.dat from another running node ...
# cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid:    1663c027-2a29-11e5-85da-aa5ca45f600f
seqno:   -1
cert_index:
# chown -R mysql.mysql /var/lib/mysql/

If I start the node now, it should see the grastate.dat with the -1 seqo and run –wsrep_recover to extract the GTID from Innodb (I could have also just put that directly into my grastate.dat).

This will allow the node to startup from merged Xtrabackup incrementals with a known Galera GTID.

But will it IST?

That’s the question.  IST happens when the selected donor has all the transactions the joiner needs to get it fully caught up inside of the donor’s gcache.  There are several implications of this:

  • A gcache is mmap allocated and does not persist across restarts on the donor.  A restart essentially purges the mmap.
  • You can query the oldest GTID seqno on a donor by checking the status variable ‘wsrep_local_cached_downto’.  This variable is not available on 5.5, so you are forced to guess if you can IST or not.
  • most PXC 5.6 will auto-select a donor based on IST.  Prior to that (i.e., 5.5) donor selection was not based on IST candidacy at all, meaning you had to be much more careful and do donor selection manually.
  • There’s no direct mapping from the earliest GTID in a gcache to a specific time, so knowing at a glance if a given incremental will be enough to IST is difficult.
  • It’s also difficult to know how big to make your gcache (set in MB/GB/etc.)  with respect to your backups (which are scheduled by the day/hour/etc.)

All that being said, we’re still talking about backups here.  The above method will only work if and only if:

  • You do frequent incremental backups
  • You have a large gcache (hopefully more on this in a future blog post)
  • You can restore a backup faster than it takes for your gcache to overflow

The post Bypassing SST in Percona XtraDB Cluster with incremental backups appeared first on MySQL Performance Blog.

Sep
08
2014
--

How to calculate the correct size of Percona XtraDB Cluster’s gcache

How to calculate the correct size of Percona XtraDB Cluster's gcacheWhen a write query is sent to Percona XtraDB Cluster all the nodes store the writeset on a file called gcache. By default the name of that file is galera.cache and it is stored in the MySQL datadir. This is a very important file, and as usual with the most important variables in MySQL, the default value is not good for high-loaded servers. Let’s see why it’s important and how can we calculate a correct value for the workload of our cluster.

What’s the gcache?
When a node goes out of the cluster (crash or maintenance) it obviously stops receiving changes. When you try to reconnect the node to the cluster the data will be outdated. The joiner node needs to ask a donor to send the changes happened during the downtime.

The donor will first try to transfer an incremental (IST), that is, the writesets the cluster received while the node was down. The donor checks the last writeset received by the joiner and then checks local gcache file. If all needed writesets are on that cache the donor sends them to the joiner. The joiner applies them and that’s all, it is up to date and ready to join the cluster. Therefore, IST can only be achieved if all changes missed by the node that went away are still in that gcache file of the donor.

On the other hand, if the writesets are not there a full transfer would be needed (SST) using one of the supported methods, XtraBackup, Rsync or mysqldump.

In a summary, the difference between a IST and SST is the time that a node needs to join the cluster. The difference could be from seconds to hours. In case of WAN connections and large datasets maybe days.

That’s why having a correct gcache is important. It work as a circular log, so when it is full it starts to rewrite the writesets at the beginning. With a larger gcache a node can be out of the cluster more time without requiring a SST. My colleague Jay Janssen explains in more detail about how IST works and how to find the right server to use as donor.

Calculating the correct size
When trick is pretty similar to the one used to calculate the correct InnoDB log file size. We need to check how many bytes are written every minute. The variables to check are:

wsrep_replicated_bytes: Total size (in bytes) of writesets sent to other nodes.

wsrep_received_bytes: Total size (in bytes) of writesets received from other nodes.

mysql> show global status like 'wsrep_received_bytes';
show global status like 'wsrep_replicated_bytes';
select sleep(60);
show global status like 'wsrep_received_bytes';
show global status like 'wsrep_replicated_bytes';
+----------------------+----------+
| Variable_name        | Value    |
+----------------------+----------+
| wsrep_received_bytes | 83976571 |
+----------------------+----------+
+------------------------+-------+
| Variable_name          | Value |
+------------------------+-------+
| wsrep_replicated_bytes | 0     |
+------------------------+-------+
[...]
+----------------------+----------+
| Variable_name        | Value    |
+----------------------+----------+
| wsrep_received_bytes | 90576957 |
+----------------------+----------+
+------------------------+-------+
| Variable_name          | Value |
+------------------------+-------+
| wsrep_replicated_bytes | 800   |
+------------------------+-------+

Therefore:

Bytes per minute:

(second wsrep_received_bytes – first wsrep_received_bytes) + (second wsrep_replicated_bytes – first wsrep_replicated_bytes)

(90576957 – 83976571) + (800 – 0) = 6601186 bytes or 6 MB per minute.

Bytes per hour:

6MB * 60 minutes = 360 MB per hour of writesets received by the cluster.

If you want to allow one hour of maintenance (or downtime) of a node, you need to increase the gcache to that size. If you want more time, just make it bigger.

The post How to calculate the correct size of Percona XtraDB Cluster’s gcache appeared first on MySQL Performance Blog.

Jan
08
2014
--

Finding a good IST donor in Percona XtraDB Cluster 5.6

Gcache and IST

The Gcache is a memory-based cache of recent Galera transactions that is local to each node in a cluster.  If a node leaves and rejoins the cluster, it can use the gcache from another node that stayed in the cluster (i.e., its donor node) to fetch the transactions it missed (IST) as opposed to doing a full state snapshot transfer (SST).  However, there are a few nuances that are not obvious to the beginner:

  • The Gcache is lost when a node restarts
  • The Gcache is fixed size and implemented as a LRU.  Once it is full, older transactions roll off.
  • Donor selection is made irregardless of the gcache state
  • If the given donor for a restarting node doesn’t have all transactions needed, a full SST (read: full backup) is done instead
  • Until recent developments, there was no way to tell what, precisely, was in the Gcache.

So, with (somewhat) arbitrary donor selection, it was hard to be certain that a node restart would not trigger a SST.  For example:

  • A node crashed over night or was otherwise down for some length of time.  How do you know if the gcache on any node is big enough to contain all the transactions necessary for IST?
  • If you brought two nodes in your cluster simultaneously, the second one you restart might select the first one as its donor and be forced to SST.

Along comes PXC 5.6.15 RC1

Astute readers of the PXC 5.6.15 release notes will have noticed this little tidbit:

New wsrep_local_cached_downto status variable has been introduced. This variable shows the lowest sequence number in gcache. This information can be helpful with determining IST and/or SST.

Until this release there was no visibility into any node’s Gcache and what was likely to happen when restarting a node.  You could make some assumptions, but now it its a bit easier to:

  1. Tell if a given node would be a suitable donor
  2. And hence select a donor manually using wsrep_sst_donor instead of leaving it to chance.

 

What it looks like

Suppose I have a 3 node cluster where load is hitting node1.  I execute the following in sequence:

  1. Shut down node2
  2. Shut down node3
  3. Restart node2

At step 3, node1 is the only viable donor for node2.  Because our restart was quick, we can have some reasonable assurance that node2 will IST correctly (and it does).

However, before we restart node3, let’s check the oldest transaction in the gcache on nodes 1 and 2:

[root@node1 ~]# mysql -e "show global status like 'wsrep_local_cached_downto';"
+---------------------------+--------+
| Variable_name             | Value  |
+---------------------------+--------+
| wsrep_local_cached_downto | 889703 |
+---------------------------+--------+
[root@node2 mysql]# mysql -e "show global status like 'wsrep_local_cached_downto';"
+---------------------------+---------+
| Variable_name             | Value   |
+---------------------------+---------+
| wsrep_local_cached_downto | 1050151 |
+---------------------------+---------+

So we can see that node1 has a much more “complete” gcache than node2 does (i.e., a much smaller seqno). Node2′s gcache was wiped when it restarted, so it only has transactions from after its restart.

To check node3′s GTID, we can either check the grastate.dat, or (if it has crashed and the grastate is zeroed) use –wsrep_recover:

[root@node3 ~]# cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid:    7206c8e4-7705-11e3-b175-922feecc92a0
seqno:   1039191
cert_index:
[root@node3 ~]# mysqld_safe --wsrep-recover
140107 16:18:37 mysqld_safe Logging to '/var/lib/mysql/error.log'.
140107 16:18:37 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
140107 16:18:37 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.pIVkT4' --pid-file='/var/lib/mysql/node3-recover.pid'
140107 16:18:39 mysqld_safe WSREP: Recovered position 7206c8e4-7705-11e3-b175-922feecc92a0:1039191
140107 16:18:41 mysqld_safe mysqld from pid file /var/lib/mysql/node3.pid ended

So, armed with this information, we can tell what would happen to node3, depending on which donor was selected:

Donor selected Donor’s gcache oldest seqno Node3′s seqno Result for node3
node2 1050151 1039191 SST
node1 889703 1039191 IST

So, we can instruct node3 to use node1 as its donor on restart with wsrep_sst_donor:

[root@node3 ~]# service mysql start --wsrep_sst_donor=node1

Note that passing mysqld options on the command line is only supported in RPM packages, Debian requires you put that setting in your my.cnf.  We can see from node3′s log that it does properly IST:

2014-01-07 16:23:26 19834 [Note] WSREP: Prepared IST receiver, listening at: tcp://192.168.70.4:4568
2014-01-07 16:23:26 19834 [Note] WSREP: Node 0.0 (node3) requested state transfer from 'node1'. Selected 2.0 (node1)(SYNCED) as donor.
...
2014-01-07 16:23:27 19834 [Note] WSREP: Receiving IST: 39359 writesets, seqnos 1039191-1078550
2014-01-07 16:23:27 19834 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.6.15-56'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Percona XtraDB Cluster (GPL), Release 25.2, Revision 645, wsrep_25.2.r4027
2014-01-07 16:23:41 19834 [Note] WSREP: IST received: 7206c8e4-7705-11e3-b175-922feecc92a0:1078550

Sometime in the future, this may be handled automatically on donor selection, but for now it is very useful that we can at least see the status of the gcache.

The post Finding a good IST donor in Percona XtraDB Cluster 5.6 appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com