How Does Semisynchronous MySQL Replication Work?

With the recent release of Percona XtraDB Cluster, I am increasingly being asked about MySQL’s semi-synchronous replication. I find that there are often a number of misconceptions about how semi-synchronous replication really works. I think it is very important to understand what guarantees you actually get with semi-synchronous replication, and what you don’t get.

The first thing to understand is that despite the name, semi-synchronous replication is still asynchronous. Semi-synchronous is actually a pretty bad name, because there is no strong coupling between a commit on the master and a commit on the replicas. To understand why, let’s look at what truly synchronous replication means. In truly synchronous replication, when you commit a transaction, the commit does not complete until all replicas have also committed successfully. In MySQL’s semi-synchronous replication, the commit completes before the transaction is even sent to any of the replicas. Therefore, by definition the transaction cannot have committed on any of the replicas. If there’s any problem after the commit happens on the master, it’s possible that the replicas won’t get the transaction, and even after they do, there’s no guarantee they can apply and commit it successfully themselves (duplicate key error, anyone?). If any of these problems happens, it’s too late–the commit is already permanent on the master, and can’t be rolled back.

What should semi-synchronous replication be called instead? I believe that it should be called delayed-acknowledgment commits, because this is what actually happens. When a transaction commits on the master, the commit proceeds as normal, and the transaction is sent to the replicas as normal, but the client connection to the master is not told that the commit has completed until after at least one replica has acknowledged receiving the transaction.

Another way to look at the same thing is that semi-synchronous replication actually forces the client to be synchronized, not the replicas. The client is forced to wait until the transaction has been sent to one of the replicas, but the commit on the master is not forced to wait at all, nor are replicas forced to do anything. The commit has already happened on the master, so the cat’s out of the bag and there’s no way to force replicas to do anything. As a result, the effect is that the client’s activity is throttled so that it cannot outpace the replica’s ability to fetch updates from the master. Have you seen the bumper sticker that says “don’t drive faster than your Guardian Angel can fly?” That is the effect of this throttling.

Semi-synchronous replication also does not guarantee that your replicas will not become delayed. The client connection is forced to wait until at least one of the replicas has retrieved the transaction, but not until the transaction has actually been applied to the replica. As you probably know, it is perfectly possible to send a very long transaction to the replica in a matter of milliseconds. The replica will take a long time to apply this transaction to its own data, and during that time, it will be delayed relative to the master. However, other transactions can continue committing and sending their changes to the replica, because the process of retrieving changes from the master and applying them run in separate threads on the replica.

Finally, semi-synchronous replication does not provide strong guarantees against data loss. What do I mean by a strong guarantee against data loss? I consider the safety of my data strongly guaranteed when at least one other server must have a copy of the data before it can be committed on the master. However, that is not what happens in semi-synchronous replication. And if there is an error in semi-synchronous replication, such as a crash at the wrong moment, or a timeout, then even the throttling is abandoned, and everything defaults back to the traditional mode of replication.

What does semi-synchronous replication guarantee me then? If there are no errors or timeouts, then the guarantee is essentially that only one transaction per client is likely to be lost if the master crashes.

I do not mean to sound negative, or to send the message that semi-synchronous replication is not useful. It is useful, but if you misunderstand it, you could be relying on a strong guarantee that is not actually provided.

If you want to learn more about this, then I encourage you to read the relevant section of the MySQL manual. But read carefully, for example, the following sentences:

When a commit returns successfully, it is known that the data exists in at least two places (on the master and at least one slave). If the master commits but a crash occurs while the master is waiting for acknowledgment from a slave, it is possible that the transaction may not have reached any slave.

Finally, I would be interested to hear how many people are actually running semi-synchronous replication in production. I have a feeling that very few people are, even though a lot of people seem to have heard about it. What are your experiences?


Backing up binary log files with mysqlbinlog

Backing up binary logs are essential part of creating good backup infrastructure as it gives you the possibility for point in time recovery. After restoring a database from backup you have the option to recover changes that happend after taking a backup. The problem with this approach was that you had to do periodic filesystem level backups of the binary log files which could still lead to data loss depending on the interval you back them up.
Recently in MySQL 5.6, mysqlbinlog got a new feature addition that supports connecting to remote MySQL instances and dumping binary log data to local disks ( http://dev.mysql.com/doc/refman/5.6/en/mysqlbinlog-backup.html ). This can be used as a foundation of our live binary log backups.

The wrapper script below will connect to the remote server specified in the config and ensure mysqlbinlog utility is up and running. By default if you do not supply the binary log file, mysqlbinlog deletes and overwrites them all that is undesired behaviour in our case, so we have to supply the name of the last binary log. This last file will be still overwritten hence we make a backup first.


source $1

echo "Backup dir: $BACKUPDIR "

while :
LASTFILE=`ls -1 $BACKUPDIR|grep -v orig|tail -n 1`
TIMESTAMP=`date +%s`
FILESIZE=$(stat -c%s "$LASTFILE")

if [ $FILESIZE -gt 0 ]; then
    echo "Backing up last binlog"
echo "Starting live binlog backup"
$MBL --raw --read-from-remote-server --stop-never --host $MYSQLHOST --port $MYSQLPORT -u $MYSQLUSER -p$MYSQLPASS $LASTFILE 

echo "mysqlbinlog exited with $? trying to reconnect in $RESPAWN seconds."

sleep $RESPAWN

Configuration file:


# time to wait before reconnecting after failure

Starting in the background with logging to /var/log/livebinlog/server2.log:

nohup /media/binlogs/livebinlog.sh /media/binlogs/livebackup.server2.conf 2>&1 > /var/log/livebinlog/server2.log &

As a great addition, older logfiles that have been rotated can be checked against the MySQL server’s version if they are the same or not. For this purpose you can use rsync in “dry-run” mode.

Please note MySQL 5.6 is not yet released as GA but you can use mysqlbinlog to backup MySQL 5.1 and 5.5 databases.


Benchmarks of new innodb_flush_neighbor_pages

In our recent release of Percona Server 5.5.19 we introduced new value for innodb_flush_neighbor_pages=cont.
This way we are trying to deal with the problem of InnoDB flushing.

Actually there is also the second fix to what we think is bug in InnoDB, where it blocks queries while it is not needed (I will refer to it as “sync fix”). In this post I however will focus on innodb_flush_neighbor_pages.

By default InnoDB flushes so named neighbor pages, which really are not neighbors.
Say we want to flush page P. InnoDB is looking in an area of 128 pages around page P, and flushes all the pages in that area that are dirty. To illustrate, say we have an area of memory like this: ...D...D...D....P....D....D...D....D where each dot is a page that does not need flushing, each “D” is a dirty page that InnoDB will flush, and P is our page.
So, as the result of how it works, instead of performing 1 random write, InnoDB will perform 8 random writes.
This is quite far from original intention to flush as many pages as possible in singe sequential write.

So we added new innodb_flush_neighbor_pages=cont method, with it, only really sequential write will be performed
That is case ...D...D...D..DDDPD....D....D...D....D only following pages will be flushed:
...D...D...D..FFFFF....D....D...D....D (marked as “F”)

Beside “cont”, in Percona Server 5.5.19 innodb_flush_neighbor_pages also accepts values “area” (default) and “none” (recommended for SSD).

What kind of effect does it have ? Let’s run some benchmarks.

We repeated the same benchmark I ran in Disaster MySQL 5.5 flushing, but now we used two servers: Cisco UCS C250 and HP ProLiant DL380 G6

First results from HP ProLiant.

Throughput graph:

Response time graph (axe y has logarithmic scale):

As you see with “cont” we are able to get stable line. And even with default innodb_flush_neighbor_pages, Percona Server has smaller dips than MySQL.

So this is to show effect of “sync fix”, let’s compare Percona Server 5.5.18 (without fix) and 5.5.19 (with fix).

You see that the fix helps to have queries running in cases when before it was “hard” stop, and no
transaction processed.

The previous result may give you impression that “cont” guarantees stable line, but unfortunately this is not always the case.

There are results ( throughput and response time) from Cisco UCS 250 server:

You see, on this server we have longer and deeper periods when MySQL stuck in flushing, and in such cases, the
innodb_flush_neighbor_pages=cont only helps to relief the problem, not completely solving it.
Which, I believe, is still better than complete stop for significant amount of time.

The raw results, scripts and different CPU/IO metrics are available from our Benchmarks Launchpad


Percona XtraDB Cluster Feature 1: High Availability

There and in coming posts I am going to cover main features of Percona XtraDB Cluster. The first feature is High Availability.

But before jumping to HA, let’s review general architecture of the Percona XtraDB Cluster.

1. The Cluster consists of Nodes. Recommended configuration is to have at least 3 nodes, but you can make it running with 2 nodes too.
2. Each Node is regular MySQL / Percona Server setup. The point is that you can convert your existing MySQL / Percona Server into Node and roll Cluster using it as base. Or otherwise – you can detach Node from Cluster and use it as just a regular server.
3. Each Node contains the full copy of data. That defines XtraDB Cluster behavior in many ways. And obviously there are benefits and drawbacks.

Cluster architecture

Benefits of such approach:

  • When you execute a query, it is executed locally on the node. All data is available locally, no need for remote access.
  • No central management. You can loose any node at any point of time, and the cluster will continue to function
  • Good solution for scaling a read workload. You can put read queries to any of the nodes


  • Overhead of joining new node. The new node has to copy full dataset from one of existing node. If it is 100GB, it copies 100GB.
  • This can’t be used as an effective write scaling solution. There might be some improvements in write throughput when you run write traffic to 2 nodes vs all traffic to 1 node, but you can’t expect a lot. All writes still have to go on all nodes.
  • You have several duplicates of data. For 3 nodes – 3 duplicates

This basically defines how Percona XtraDB Cluster can be used for High Availability.

Basic setup: you run 3-nodes setup.
The Percona XtraDB Cluster will continue to function when you take any of nodes down.
At any point of time you can shutdown any Node to perform maintenance or make configuration changes.
Or Node may crash or become network unavailable. The Cluster will continue to work, you can continue to run queries on working nodes.

The biggest question there, what will happen when the Node joins the cluster back, and there were changes to data while the node
was down.

Let’s focus on this with details.
There is two ways that Node may use when it joins the cluster: State Snapshot Transfer (SST) and Incremental State Transfer (IST).

  • SST is the full copy if data from one node to another. SST is used when new node joins the cluster, it has to transfer data from existing node.
    There is three methods of SST available in Percona XtraDB Cluster: mysqldump, rsync and xtrabackup ( Percona XtraBackup with support of XtraDB Cluster will be released soon, so far you need to use our source code repository).
    The downside of mysqldump and rsync is that your cluster becomes READ-ONLY for time that takes to copy data from one node to another (SST applies FLUSH TABLES WITH READ LOCK command).
    Xtrabackup SST does not require READ LOCK for full time, only for syncing .frm files (the same as with regular backup).
  • Even with that, SST may be intrusive, that’s why there is IST mechanism. If down your node for short period of time, and then start it, the node is able to fetch only changes made during period it was down.
    This is done using caching mechanism on nodes. Each node contains a cache, ring-buffer, (the size is configurable) of last N changes, and the node is able to transfer part of this cache. Obviously IST can be done only if amount of changes needed to transfer is less than N. If it exceeds N, then the joining node has to perform SST.

You can monitor current state of Node by using
SHOW STATUS LIKE 'wsrep_local_state_comment', when it is ‘Synced (6)’, the node is ready to handle traffic.


Announcing Percona Server 5.5.19-24.0

Percona is glad to announce the release of Percona Server 5.5.19-24.0 on January 13th, 2012 (Downloads are available here and from the Percona Software Repositories).

Based on MySQL 5.5.19, including all the bug fixes in it, Percona Server 5.5.19-24.0 is now the current stable release in the 5.5 series. All of Percona ‘s software is open-source and free, all the details of the release can be found in the 5.5.19-24.0 milestone at Launchpad.

New Features

  • Variable innodb_flush_neighbor_pages can be now set to a new value cont. The previously-available option values 0 and 1 now have more descriptive names none and area. The value of none disables the neighbor page flush and area matches the default InnoDB behavior: any dirty pages in the vicinity of the page selected for flushing may be flushed too. The new option value cont improves the neighbor flushing by considering only contiguous blocks of neighbor pages, thus performing the flush by sequential instead of random I/O. (Yasufumi Kinoshita, Laurynas Biveinis)
  • Improvements to the XtraDB’s sync flush algorithm. If the XtraDB checkpoint age grows dangerously close to its limit and XtraDB is forced to perform a sync flush, these changes should slightly improve the user query performance instead of completely blocking them. (Yasufumi Kinoshita, Laurynas Biveinis)

Bug Fixes

  • Minor MEMORY engine test suite fix: #849921 (Laurynas Biveinis)
  • A fix for testsuite integration into Jenkins: #911237 (Oleg Tsarev)

For this release we also provide EXPERIMENTAL binaries for Mac OS X platform, and you can download tar.gz or dmg there


Percona testing: Quick test clusters with kewpie!

The announcement of Percona XtraDB Cluster seems to have generated a fair bit of interest : )

Although the documentation contains more formal instructions for setting up a test cluster, I wanted to share a quick way to set up an ad-hoc cluster on a single machine to help people play with this (imho) rather amazing bit of software.

To do this, you will need kewpie (PXC will have kewpie in-tree soon)
cd basedir;
bzr branch lp:kewpie

edit the file kewpie.py like so:

=== modified file 'kewpie.py'
--- kewpie.py    2012-01-09 21:17:09 +0000
+++ kewpie.py    2012-01-11 18:32:17 +0000
@@ -49,9 +49,9 @@ from lib.test_mgmt.execution_management
# We base / look for a lot of things based on the location of
# the kewpie.py file
qp_rootdir = os.path.dirname(os.path.abspath(sys.argv[0]))
-#project_name = 'percona-xtradb-cluster'
+project_name = 'percona-xtradb-cluster'
#project_name = 'xtrabackup'
-project_name = None
+#project_name = None
defaults = get_defaults(qp_rootdir,project_name)
variables = test_run_options.parse_qp_options(defaults)
variables['qp_root'] = qp_rootdir

Or you may branch kewpie anywhere and simply pass appropriate –basedir and –wsrep-provider-path instructions and use –default-server-type=galera

* A default location of /usr/lib/galera/libgalera_smm.so is assumed

To get your cluster, run the tests with –start-and-exit:
./kewpie.py  –start-and-exit
This will start up 3 nodes and join them into a cluster:

percona-xtradb-cluster/kewpie$ ./kewpie.py --start-and-exit
Setting --no-secure-file-priv=True for randgen usage...
20120113-125552 INFO Using --no-shm, will not link workdir to shm
20120113-125552 INFO Using mysql source tree:
20120113-125552 INFO basedir: /percona-xtradb-cluster
20120113-125552 INFO clientbindir: /percona-xtradb-cluster/client
20120113-125552 INFO testdir: /percona-xtradb-cluster/kewpie
20120113-125552 INFO server_version: 5.5.17
20120113-125552 INFO server_compile_os: Linux
20120113-125552 INFO server_platform: x86_64
20120113-125552 INFO server_comment: (Source distribution wsrep_22.3.r3683)
20120113-125552 INFO Using default-storage-engine: innodb
20120113-125552 INFO Using testing mode: native
20120113-125552 INFO Processing test suites...
20120113-125552 INFO Found 35 test(s) for execution
20120113-125552 INFO Creating 1 bot(s)
20120113-125604 INFO Taking clean db snapshot...
20120113-125610 INFO Taking clean db snapshot...
20120113-125616 INFO Taking clean db snapshot...
20120113-125621 INFO bot0 server:
20120113-125621 INFO NAME: s0
20120113-125621 INFO MASTER_PORT: 9317
20120113-125621 INFO GALERA_LISTEN_PORT: 9318
20120113-125621 INFO GALERA_RECV_PORT: 9319
20120113-125621 INFO SOCKET_FILE: /percona-xtradb-cluster/kewpie/workdir/bot0/var_s0/my.sock
20120113-125621 INFO VARDIR: /percona-xtradb-cluster/kewpie/workdir/bot0/var_s0
20120113-125621 INFO STATUS: 1
20120113-125621 INFO bot0 server:
20120113-125621 INFO NAME: s1
20120113-125621 INFO MASTER_PORT: 9320
20120113-125621 INFO GALERA_LISTEN_PORT: 9321
20120113-125621 INFO GALERA_RECV_PORT: 9322
20120113-125621 INFO SOCKET_FILE: /percona-xtradb-cluster/kewpie/workdir/bot0/var_s1/my.sock
20120113-125621 INFO VARDIR: /percona-xtradb-cluster/kewpie/workdir/bot0/var_s1
20120113-125621 INFO STATUS: 1
20120113-125621 INFO bot0 server:
20120113-125621 INFO NAME: s2
20120113-125621 INFO MASTER_PORT: 9323
20120113-125621 INFO GALERA_LISTEN_PORT: 9324
20120113-125621 INFO GALERA_RECV_PORT: 9325
20120113-125621 INFO SOCKET_FILE: /percona-xtradb-cluster/kewpie/workdir/bot0/var_s2/my.sock
20120113-125621 INFO VARDIR: /percona-xtradb-cluster/kewpie/workdir/bot0/var_s2
20120113-125621 INFO STATUS: 1
20120113-125621 INFO User specified --start-and-exit.  kewpie.py exiting and leaving servers running...

Now for some play:

$ mysql -uroot --protocol=tcp --port=9317 test
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 6
Server version: 5.5.17-log Source distribution wsrep_22.3.r3683

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> create table t1 (a int not null auto_increment, primary key(a));
Query OK, 0 rows affected (0.11 sec)

mysql> insert into t1 values (),(),(),(),();
Query OK, 5 rows affected (0.06 sec)
Records: 5  Duplicates: 0  Warnings: 0

mysql> select * from t1;
| a  |
|  1 |
|  4 |
|  7 |
| 10 |
| 13 |
5 rows in set (0.00 sec)

mysql> exit;
$ mysql -uroot --protocol=tcp --port=9320 test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 8
Server version: 5.5.17-log Source distribution wsrep_22.3.r3683

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> select * from t1;
| a  |
|  1 |
|  4 |
|  7 |
| 10 |
| 13 |
5 rows in set (0.00 sec)

mysql> exit
$ mysql -uroot --protocol=tcp --port=9323 test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 8
Server version: 5.5.17-log Source distribution wsrep_22.3.r3683

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> select * from t1;
| a  |
|  1 |
|  4 |
|  7 |
| 10 |
| 13 |
5 rows in set (0.00 sec)

mysql> exit

Should you wish to alter the number of nodes or their configuration, you can edit the percona_tests/cluster_basic/suite_config.py file:

server_requirements = [[],[],[]]
server_requests = {'join_cluster':[(0,1), (0,2)]}
servers = []

Each ‘[]‘ in the server_requirements list is a server.  You can add new servers by adding a new list.  If you want specific options, put them into the list representing the server:

You will need to add an entry into the server_requests dictionary as well.  If you added a new node and want it in the cluster you would simply change it as follows:
server_requests = {‘join_cluster’:[(0,1), (0,2), (0,3)]}

When you are done, you may use mode=cleanup to kill off any servers:

./kewpie.py --mode=cleanup
Setting --no-secure-file-priv=True for randgen usage...
Setting --start-dirty=True for cleanup mode...
20120113-132229 INFO Using --start-dirty, not attempting to touch directories
20120113-132229 INFO Using mysql source tree:
20120113-132229 INFO basedir: /percona-xtradb-cluster
20120113-132229 INFO clientbindir: /percona-xtradb-cluster/client
20120113-132229 INFO testdir: /percona-xtradb-cluster/kewpie
20120113-132229 INFO server_version: 5.5.17
20120113-132229 INFO server_compile_os: Linux
20120113-132229 INFO server_platform: x86_64
20120113-132229 INFO server_comment: (Source distribution wsrep_22.3.r3683)
20120113-132229 INFO Using default-storage-engine: innodb
20120113-132229 INFO Using testing mode: cleanup
20120113-132229 INFO Killing pid 17040 from /percona-xtradb-cluster/kewpie/workdir/bot0/var_s0/run/my.pid
20120113-132229 INFO Killing pid 17096 from /percona-xtradb-cluster/kewpie/workdir/bot0/var_s2/run/my.pid
20120113-132229 INFO Killing pid 17070 from /percona-xtradb-cluster/kewpie/workdir/bot0/var_s1/run/my.pid
20120113-132229 INFO Stopping all running servers...

Alternately, you can just let the tests run to ensure some basic functionality.  I’ll be writing more about these tests and other testing efforts soon, but I wanted to help people get started with their own explorations.

Happy testing and I hope you dig Percona XtraDB Cluster as much as we do : )


Create 3 nodes XtraDB Cluster in 3 minutes

I understand that the new technology can be scaring and this makes entry barriers to try and get familiar with it.
That’s why I created simple set of scripts which setup N nodes cluster on running EC2 instances.

The script assumes that you have running N EC2 instances (in the same availability zone) running RedHat Enterprise Linux 6.2 64bit (I use m1.xlarge size).

You put hostnames of the instances into ec2hosts.txt file and run install.nodes.sh script.
You also should have your private key to access to your instances (test-cluster.pem in my case).

For my box installing and starting 3 nodes took exactly

time install.nodes.sh
real    2m51.257s

In fact you can deploy as many nodes as you want, I made also running 20 nodes cluster.

After that you have the cluster deployed and ready to accept queries!

The scripts are available from Launchpad:
or as tar.gz file



Partial Results

Todays installment opens the Chapter about efficiently fetching partial results—that is, whenever you don’t need all the rows, but just a few from the beginning. These kind of queries are every often used to show lists in a page wise manner.

The installment consists of two parts: the chapter intro and the first section about the Top-N queries.


Making the impossible: 3 nodes intercontinental replication

In this post I want to show new possibilities which open with Percona XtraDB Cluster.
We will create 3 nodes Cluster with nodes on different continents (Europe, USA, Japan) and each node will accept write queries.
Well, you theoretically could create 3 node traditional MySQL ring replication, but this is not what you want to use day-to-day.

To show how it works I will use Amazon m1.xlarge instances, by one in Tokyo, Ireland and North California, running RedHat Entreprise 6.2 64bit.

In fact to create instances is most time consuming task. After that using my script you will have cluster running in 5 min or less.

There however some precautions needed if you run Amazon instances.
First, you need to open ports in the firewall. For the communication the nodes need, by default, 4444, 4567, 4568 ports
(see our FAQ why)

Second, please take into account that the communication is done using open channels, and you may want to establish an encrypted connection, using it in real life.

Now, as we have running instances, you can install Cluster packages from RPM repositories.
You can follow steps from documentation.

Or I prepared simple script which does all the work: http://percona.com/downloads/Percona-XtraDB-Cluster/scripts/intercont.tar.gz,
you also can follow steps from the script to adjust it for your environment.

You just need to change nodes host names and your keys in the file: install.nodes.sh

When all nodes start, we have running 3 nodes EU<->USA<->JAPAN, and each node is ready to execute both read and write queries.

Of course you may wonder what is query response time in such environment. We can check it.

Simple table: CREATE TABLE t (ID INT) and simple query: INSERT INTO t VALUES (1)

Response time on single node in EU (no cluster setup): 0.005100 sec
Response time on two-nodes (EU<->JAP) cluster: 0.275642 sec
Response time on three-nodes (EU<->JAP<->USA) cluster: 0.294754 sec

Well, one may argue that 0.27 sec for single query is kind of big, but this is the physic law, you cannot
go faster than speed of light, and the round trip between Europe and Japan takes time (or at least until scientists figure out how to attach transaction to Faster than light neutrino).
Also note, that XtraDB Cluster can apply events in parallel and throughput should be less affected by big distance.



Learn about Percona XtraDB Cluster at Percona Live DC

You probably saw that Vadim blogged about the first alpha release of Percona XtraDB Cluster. Just in time: now I can talk about it at my High Availability talk in Percona Live DC! If you’re coming, be sure to catch us in the hallway and ask those specific questions that won’t be addressed in a general survey talk on HA solutions.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com