Mar
15
2019
--

MySQL Ripple: The First Impression of a MySQL Binlog Server

MySQL Ripple

MySQL RippleJust about a month ago, Pavel Ivanov released Ripple under the Apache-2.0 license. Ripple is a MySQL binlog server: software which receives binary logs from MySQL or MariaDB servers and delivers them to another MySQL or MariaDB server. Practically ,this is an intermediary master which does not store any data, except the binary logs themselves, and does not apply events. This solution allows saving of a lot of resources on the server, which acts only as a middle-man between the master and its actual slave(s).

The intermediary server, keeping binary logs only and not doing any other job, is a prevalent use case which allows us to remove IO (binlog read) and network (binlog retrieval via network) load from the actual master and free its resources for updates. The intermediary master, which does not do any work, distributes binary logs to slaves connected to it. This way you can have an increased number of slaves, attached to such a server, without affecting the application, running updates.

Currently, users exploit the Blackhole storage engine to emulate similar behavior. But Blackhole is just a workaround: it still executes all the events in the binary logs, requires valid MySQL installation, and has a lot of issues. Such a pain!

Therefore a new product which can do the same job and is released with an open source license is something worth trying.

A simple test

For this blog, I did a simple test. First, I installed it as described in the README file. Instructions are pretty straightforward, and I successfully built the server on my Ubuntu 18.04.2 LTS laptop. Guidelines suggest to install

libmariadbclient-dev

, and I replaced

libmysqlclient-dev

which I had already on my machine. Probably this was not needed, but since the tool claims to support both MySQL and MariaDB binary log formats, I preferred to install the MariaDB client.

There is no manual of usage instructions. However, the tool supports

-help

  command, and it is, again, straightforward.

The server can be started with options:

$./bazel-bin/rippled -ripple_datadir=./data -ripple_master_address=127.0.0.1 -ripple_master_port=13001 -ripple_master_user=root -ripple_server_ports=15000

Where:

  • -ripple-datadir

     : datadir where Ripple stores binary logs

  • -ripple_master_address

     : master host

  • -ripple_master_port

     : master port

  • -ripple_master_user

     : replication user

  • -ripple_server_ports

     : comma-separated ports which Ripple will listen

I did not find an option for securing binary log retrieval. The slave can connect to the Ripple server with any credentials. Have this in mind when deploying Ripple in production.

Now, let’s run a simple test. I have two servers. Both running on localhost, one with port 13001 (master) and another one on port 13002 (slave). The command line which I used to start

rippled

 , points to the master. Binary logs are stored in the data directory:

$ ls -l data/
total 14920
-rw-rw-r-- 1 sveta sveta 15251024 Mar 6 01:43 binlog.000000
-rw-rw-r-- 1 sveta sveta 71 Mar 6 00:50 binlog.index

I pointed the slave to the Ripple server with the command

mysql> change master to master_host='127.0.0.1',master_port=15000, master_user='ripple';
Query OK, 0 rows affected, 1 warning (0.02 sec)

Then started the slave.

On the master, I created the database

sbtest

  and ran sysbench

oltp_read_write.lua

test for a single table. After some time, I stopped the load and checked the content of the table on master and slave:

master> select count(*) from sbtest1;
+----------+
| count(*) |
+----------+
| 10000 |
+----------+
1 row in set (0.08 sec)
master> checksum table sbtest1;
+----------------+------------+
| Table | Checksum |
+----------------+------------+
| sbtest.sbtest1 | 4162333567 |
+----------------+------------+
1 row in set (0.11 sec)
slave> select count(*) from sbtest1;
+----------+
| count(*) |
+----------+
| 10000 |
+----------+
1 row in set (0.40 sec)
slave> checksum table sbtest1;
+----------------+------------+
| Table | Checksum |
+----------------+------------+
| sbtest.sbtest1 | 1797645970 |
+----------------+------------+
1 row in set (0.13 sec)
slave> checksum table sbtest1;
+----------------+------------+
| Table | Checksum |
+----------------+------------+
| sbtest.sbtest1 | 4162333567 |
+----------------+------------+
1 row in set (0.10 sec)

It took some time for the slave to catch up, but everything was applied successfully.

Ripple has nice verbose logging:

$ ./bazel-bin/rippled -ripple_datadir=./data -ripple_master_address=127.0.0.1 -ripple_master_port=13001 -ripple_master_user=root -ripple_server_ports=15000
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0306 15:57:13.641451 27908 rippled.cc:48] InitPlugins
I0306 15:57:13.642007 27908 rippled.cc:60] Setup
I0306 15:57:13.642937 27908 binlog.cc:307] Starting binlog recovery
I0306 15:57:13.644090 27908 binlog.cc:350] Scanning binlog file: binlog.000000
I0306 15:57:13.872016 27908 binlog.cc:417] Binlog recovery complete
binlog file: binlog.000000, offset: 15251088, gtid: 6ddac507-3f90-11e9-8ee9-00163e000000:0-0-7192
I0306 15:57:13.872050 27908 rippled.cc:106] Recovered binlog
I0306 15:57:13.873811 27908 mysql_server_port_tcpip.cc:150] Listen on host: localhost, port: 15000
I0306 15:57:13.874282 27908 rippled.cc:62] Start
I0306 15:57:13.874511 27910 mysql_master_session.cc:181] Master session starting
I0306 15:57:13.882601 27910 mysql_client_connection.cc:148] connected to host: 127.0.0.1, port: 13001
I0306 15:57:13.895349 27910 mysql_master_session.cc:137] Connected to host: 127.0.0.1, port: 13001, server_id: 1, server_name:
W0306 15:57:13.898556 27910 mysql_master_session.cc:197] master does not support semi sync
I0306 15:57:13.898583 27910 mysql_master_session.cc:206] start replicating from '6ddac507-3f90-11e9-8ee9-00163e000000:0-0-7192'
I0306 15:57:13.899031 27910 mysql_master_session.cc:229] Master session entering main loop
I0306 15:57:13.899550 27910 binlog.cc:626] Update binlog position to end_pos: binlog.000000:15251152, gtid: 0-0-7192
I0306 15:57:13.899572 27910 binlog.cc:616] Skip writing event [ Previous_gtids len = 67 ]
I0306 15:57:13.899585 27910 binlog.cc:626] Update binlog position to end_pos: binlog.000000:15251152, gtid: 0-0-7192
...

Conclusion

it may be good to run more tests before using Ripple in production, and to explore its other options, but from a first view it seems to be a very nice and useful product.


Photo by Kishor on Unsplash

Aug
02
2018
--

Amazon RDS Multi-AZ Deployments and Read Replicas

RDS Multi-AZ

Amazon RDS is a managed relational database service that makes it easier to set up, operate, and scale a relational database in the cloud. One of the common questions that we get is “What is Multi-AZ and how it’s different from Read Replica, do I need both?”.  I have tried to answer this question in this blog post and it depends on your application needs. Are you looking for High Availability (HA), read scalability … or both?

Before we go to into detail, let me explain two common terms used with Amazon AWS.

Region – an AWS region is a separate geographical area like US East (N. Virginia), Asia Pacific (Mumbai), EU (London) etc. Each AWS Region has multiple, isolated locations known as Availability Zones.

Availability Zone (AZ) – AZ is simply one or more data centers, each with redundant power, networking and connectivity, housed in separate facilities. Data centers are geographically isolated within the same region.

What is Multi-AZ?

Amazon RDS provides high availability and failover support for DB instances using Multi-AZ deployments.

In a Multi-AZ deployment, Amazon RDS automatically provisions and maintains a synchronous standby replica of the master DB in a different Availability Zone. The primary DB instance is synchronously replicated across Availability Zones to the standby replica to provide data redundancy, failover support and to minimize latency during system backups. In the event of planned database maintenance, DB instance failure, or an AZ failure of your primary DB instance, Amazon RDS automatically performs a failover to the standby so that database operations can resume quickly without administrative intervention.

You can check in the AWS management console if a database instance is configured as Multi-AZ. Select the RDS service, click on the DB instance and review the details section.

AWS management console showing that instance is Multi-AZ

This screenshot from AWS management console (above) shows that the database is hosted as Multi-AZ deployment and the standby replica is deployed in us-east-1a AZ.

Benefits of Multi-AZ deployment:

  • Replication to a standby replica is synchronous which is highly durable.
  • When a problem is detected on the primary instance, it will automatically failover to the standby in the following conditions:
    • The primary DB instance fails
    • An Availability Zone outage
    • The DB instance server type is changed
    • The operating system of the DB instance is undergoing software patching.
    • A manual failover of the DB instance was initiated using Reboot with failover.
  • The endpoint of the DB instance remains the same after a failover, the application can resume database operations without manual intervention.
  • If a failure occurs, your availability impact is limited to the time that the automatic failover takes to complete. This helps to achieve increased availability.
  • It reduces the impact of maintenance. RDS performs maintenance on the standby first, promotes the standby to primary master, and then performs maintenance on the old master which is now a standby replica.
  • To prevent any negative impact of the backup process on performance, Amazon RDS creates a backup from the standby replica.

Amazon RDS does not failover automatically in response to database operations such as long-running queries, deadlocks or database corruption errors. Also, the Multi-AZ deployments are limited to a single region only, cross-region Multi-AZ is not currently supported.

Can I use an RDS standby replica for read scaling?

The Multi-AZ deployments are not a read scaling solution, you cannot use a standby replica to serve read traffic. Multi-AZ maintains a standby replica for HA/failover. It is available for use only when RDS promotes the standby instance as the primary. To service read-only traffic, you should use a Read Replica instead.

What is Read Replica?

Read replicas allow you to have a read-only copy of your database.

When you create a Read Replica, you first specify an existing DB instance as the source. Then Amazon RDS takes a snapshot of the source instance and creates a read-only instance from the snapshot. You can use MySQL native asynchronous replication to keep Read Replica up-to-date with the changes. The source DB must have automatic backups enabled for setting up read replica.

Benefits of Read Replica

  • Read Replica helps in decreasing load on the primary DB by serving read-only traffic.
  • A Read Replica can be manually promoted as a standalone database instance.
  • You can create Read Replicas within AZ, Cross-AZ or Cross-Region.
  • You can have up to five Read Replicas per master, each with own DNS endpoint. Unlike a Multi-AZ standby replica, you can connect to each Read Replica and use them for read scaling.
  • You can have Read Replicas of Read Replicas.
  • Read Replicas can be Multi-AZ enabled.
  • You can use Read Replicas to take logical backups (mysqldump/mydumper) if you want to store the backups externally to RDS.
  • Read Replica helps to maintain a copy of databases in a different region for disaster recovery.

At AWS re:Invent 2017, AWS announced the preview for Amazon Aurora Multi-Master, this will allow users to create multiple Aurora writer nodes and helps in scaling reads/writes across multiple AZs. You can sign up for preview here.

Conclusion

While both (Multi-AZ and Read replica) maintain a copy of database but they are different in nature. Use Multi-AZ deployments for High Availability and Read Replica for read scalability. You can further set up a cross-region read replica for disaster recovery.

The post Amazon RDS Multi-AZ Deployments and Read Replicas appeared first on Percona Database Performance Blog.

Dec
02
2013
--

Useful MySQL 5.6 features you get for free in PXC 5.6

I get a lot of questions about Percona XtraDB Cluster 5.6 (PXC 5.6), specifically about whether such and such MySQL 5.6 Community Edition feature is in PXC 5.6.  The short answer is: yes, all features in community MySQL 5.6 are in Percona Server 5.6 and, in turn, are in PXC 5.6.  Whether or not the new feature is useful in 5.6 really depends on how useful it is in general with Galera.

I thought it would be useful to highlight a few features and try to show them working:

Innodb Fulltext Indexes

Yes, FTS works in Innodb in 5.6, so why wouldn’t it work in PXC 5.6?  To test this I used the Sakila database , which contains a single table with FULLTEXT.  In the sakila-schema.sql file, it is still designated a MyISAM table:

CREATE TABLE film_text (
  film_id SMALLINT NOT NULL,
  title VARCHAR(255) NOT NULL,
  description TEXT,
  PRIMARY KEY  (film_id),
  FULLTEXT KEY idx_title_description (title,description)
)ENGINE=MyISAM DEFAULT CHARSET=utf8;

I edited that file to change MyISAM to Innodb, loaded the schema and data into my 3 node cluster:

[root@node1 sakila-db]# mysql < sakila-schema.sql
[root@node1 sakila-db]# mysql < sakila-data.sql

and it works seamlessly:

node1 mysql> select title, description, match( title, description) against ('action saga' in natural language mode) as score from sakila.film_text order by score desc limit 5;
+-----------------+-----------------------------------------------------------------------------------------------------------+--------------------+
| title | description | score |
+-----------------+-----------------------------------------------------------------------------------------------------------+--------------------+
| FACTORY DRAGON | A Action-Packed Saga of a Teacher And a Frisbee who must Escape a Lumberjack in The Sahara Desert | 3.0801234245300293 |
| HIGHBALL POTTER | A Action-Packed Saga of a Husband And a Dog who must Redeem a Database Administrator in The Sahara Desert | 3.0801234245300293 |
| MATRIX SNOWMAN | A Action-Packed Saga of a Womanizer And a Woman who must Overcome a Student in California | 3.0801234245300293 |
| REEF SALUTE | A Action-Packed Saga of a Teacher And a Lumberjack who must Battle a Dentist in A Baloon | 3.0801234245300293 |
| SHANE DARKNESS | A Action-Packed Saga of a Moose And a Lumberjack who must Find a Woman in Berlin | 3.0801234245300293 |
+-----------------+-----------------------------------------------------------------------------------------------------------+--------------------+
5 rows in set (0.00 sec)

Sure enough, I can run this query on any node and it works fine:

node3 mysql> select title, description, match( title, description) against ('action saga' in natural language mode) as score from sakila.film_text order by score desc limit 5;
+-----------------+-----------------------------------------------------------------------------------------------------------+--------------------+
| title           | description                                                                                               | score              |
+-----------------+-----------------------------------------------------------------------------------------------------------+--------------------+
| FACTORY DRAGON  | A Action-Packed Saga of a Teacher And a Frisbee who must Escape a Lumberjack in The Sahara Desert         | 3.0801234245300293 |
| HIGHBALL POTTER | A Action-Packed Saga of a Husband And a Dog who must Redeem a Database Administrator in The Sahara Desert | 3.0801234245300293 |
| MATRIX SNOWMAN  | A Action-Packed Saga of a Womanizer And a Woman who must Overcome a Student in California                 | 3.0801234245300293 |
| REEF SALUTE     | A Action-Packed Saga of a Teacher And a Lumberjack who must Battle a Dentist in A Baloon                  | 3.0801234245300293 |
| SHANE DARKNESS  | A Action-Packed Saga of a Moose And a Lumberjack who must Find a Woman in Berlin                          | 3.0801234245300293 |
+-----------------+-----------------------------------------------------------------------------------------------------------+--------------------+
5 rows in set (0.05 sec)
node3 mysql> show create table sakila.film_text\G
*************************** 1. row ***************************
       Table: film_text
Create Table: CREATE TABLE `film_text` (
  `film_id` smallint(6) NOT NULL,
  `title` varchar(255) NOT NULL,
  `description` text,
  PRIMARY KEY (`film_id`),
  FULLTEXT KEY `idx_title_description` (`title`,`description`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.00 sec)

There might be a few caveats and differences from how FTS works in Innodb vs MyISAM, but it is there.

Minimal replication images

Galera relies heavily on RBR events, but until 5.6 those were entire row copies, even if you only changed a single column in the table. In 5.6 you can change this to send only the updated data using the variable binlog_row_image=minimal.

Using a simple sysbench update test for 1 minute, I can determine the baseline size of the replicated data:

node3 mysql> show global status like 'wsrep_received%';
+----------------------+-----------+
| Variable_name        | Value     |
+----------------------+-----------+
| wsrep_received       | 703       |
| wsrep_received_bytes | 151875070 |
+----------------------+-----------+
2 rows in set (0.04 sec)
... test runs for 1 minute...
node3 mysql> show global status like 'wsrep_received%';
+----------------------+-----------+
| Variable_name        | Value     |
+----------------------+-----------+
| wsrep_received       | 38909     |
| wsrep_received_bytes | 167749809 |
+----------------------+-----------+
2 rows in set (0.17 sec)

This results in 62.3 MB of data replicated in this test.

If I set binlog_row_image=minimal on all nodes and do a rolling restart, I can see how this changes:

node3 mysql> show global status like 'wsrep_received%';
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| wsrep_received       | 3     |
| wsrep_received_bytes | 236   |
+----------------------+-------+
2 rows in set (0.07 sec)
... test runs for 1 minute...
node3 mysql> show global status like 'wsrep_received%';
+----------------------+----------+
| Variable_name        | Value    |
+----------------------+----------+
| wsrep_received       | 34005    |
| wsrep_received_bytes | 14122952 |
+----------------------+----------+
2 rows in set (0.13 sec)

This yields a mere 13.4MB, that’s 80% smaller, quite a savings!  This benefit, of course, fully depends on the types of workloads you are doing.

Durable Memcache Cluster

It turns out this feature does not work properly with Galera, see below for an explanation:

5.6 introduces an Memcached interface for Innodb.  This means any standard memcache client can talk to our PXC nodes with the memcache protocol and the data is:

  • Replicated to all nodes
  • Durable across the cluster
  • Highly available
  • Easy to hash memcache clients across all servers for better cache coherency

To set this up, we need to simply load the innodb_memcache schema from the example and restart the daemon to get a listening memcached port:

[root@node1 ~]# mysql < /usr/share/mysql/innodb_memcached_config.sql
[root@node1 ~]# service mysql restart
Shutting down MySQL (Percona XtraDB Cluster)...... SUCCESS!
Starting MySQL (Percona XtraDB Cluster)...... SUCCESS!
[root@node1 ~]# lsof +p`pidof mysqld` | grep LISTEN
mysqld  31961 mysql   11u  IPv4             140592       0t0      TCP *:tram (LISTEN)
mysqld  31961 mysql   55u  IPv4             140639       0t0      TCP *:memcache (LISTEN)
mysqld  31961 mysql   56u  IPv6             140640       0t0      TCP *:memcache (LISTEN)
mysqld  31961 mysql   59u  IPv6             140647       0t0      TCP *:mysql (LISTEN)

This all appears to work and I can fetch the sample AA row from all the nodes with the memcached interface:

node1 mysql> select * from demo_test;
+-----+--------------+------+------+------+
| c1  | c2           | c3   | c4   | c5   |
+-----+--------------+------+------+------+
| AA  | HELLO, HELLO |    8 |    0 |    0 |
+-----+--------------+------+------+------+
[root@node3 ~]# telnet 127.0.0.1 11211
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
get AA
VALUE AA 8 12
HELLO, HELLO
END

However, if I try to update a row, it does not seem to replicate (even if I set innodb_api_enable_binlog):

[root@node3 ~]# telnet 127.0.0.1 11211
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
set DD 0 0 0
STORED
^]
telnet> quit
Connection closed.
node3 mysql> select * from demo_test;
+----+--------------+------+------+------+
| c1 | c2           | c3   | c4   | c5   |
+----+--------------+------+------+------+
| AA | HELLO, HELLO |    8 |    0 |    0 |
| DD |              |    0 |    1 |    0 |
+----+--------------+------+------+------+
2 rows in set (0.00 sec)
node1 mysql> select * from demo_test;
+-----+--------------+------+------+------+
| c1  | c2           | c3   | c4   | c5   |
+-----+--------------+------+------+------+
| AA  | HELLO, HELLO |    8 |    0 |    0 |
+-----+--------------+------+------+------+

So unfortunately the memcached plugin must use some backdoor to Innodb that Galera is unaware of. I’ve filed a bug on the issue, but it’s not clear if there will be an easy solution or if a whole lot of code will be necessary to make this work properly.

In the short-term, however, you can at least read data from all nodes with the memcached plugin as long as data is only written using the standard SQL interface.

Async replication GTID Integration

Async GTIDs were introduced in 5.6 in order to make CHANGE MASTER easier.  You have always been able to use async replication from any cluster node, but now with this new GTID support, it is much easier to failover to another node in the cluster as a new master.

If we take one node out of our cluster to be a slave and enable GTID binary logging on the other two by adding these settings:

server-id = ##
log-bin = cluster_log
log-slave-updates
gtid_mode = ON
enforce-gtid-consistency

If I generate some writes on the cluster, I can see GTIDs are working:

node1 mysql> show master status\G
*************************** 1. row ***************************
File: cluster_log.000001
Position: 573556
Binlog_Do_DB:
Binlog_Ignore_DB:
Executed_Gtid_Set: e941e026-ac70-ee1c-6dc9-40f8d3b5db3f:1-1505
1 row in set (0.00 sec)
node2 mysql> show master status\G
*************************** 1. row ***************************
File: cluster_log.000001
Position: 560011
Binlog_Do_DB:
Binlog_Ignore_DB:
Executed_Gtid_Set: e941e026-ac70-ee1c-6dc9-40f8d3b5db3f:1-1505
1 row in set (0.00 sec)

Notice that we’re at GTID 1505 on both nodes, even though the binary log position happens to be different.

I set up my slave to replicate from node1 (.70.2):

node3 mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.70.2
                  Master_User: slave
...
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
...
           Retrieved_Gtid_Set: e941e026-ac70-ee1c-6dc9-40f8d3b5db3f:1-1506
            Executed_Gtid_Set: e941e026-ac70-ee1c-6dc9-40f8d3b5db3f:1-1506
                Auto_Position: 1
1 row in set (0.00 sec)

And it’s all caught up.  If put some load on the cluster, I can easily change to node2 as my master without needing to stop writes:

node3 mysql> stop slave;
Query OK, 0 rows affected (0.09 sec)
node3 mysql> change master to master_host='192.168.70.3', master_auto_position=1;
Query OK, 0 rows affected (0.02 sec)
node3 mysql> start slave;
Query OK, 0 rows affected (0.02 sec)
node3 mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.70.3
                  Master_User: slave
...
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
...
            Executed_Gtid_Set: e941e026-ac70-ee1c-6dc9-40f8d3b5db3f:1-3712
                Auto_Position: 1

So this seems to work pretty well. It does turns out there is a bit of a bug, but it’s actually with Xtrabackup — currently the binary logs are not copied in Xtrabackup SST and this can cause GTID inconsistencies within nodes in the cluster.  I would expect this to get fixed relatively quickly.

Conclusion

MySQL 5.6 introduces a lot of new interesting features that are even more compelling in the PXC/Galera world.  If you want to experiment for yourself, I pushed the Vagrant environment I used to Github at: https://github.com/jayjanssen/pxc_56_features

The post Useful MySQL 5.6 features you get for free in PXC 5.6 appeared first on MySQL Performance Blog.

Oct
16
2013
--

Measuring Max Replication Throughput on Percona XtraDB Cluster with wsrep_desync

Checking throughput with async MySQL replication

Replication throughput is the measure of just how fast the slaves can apply replication (at least by my definition).  In MySQL async replication this is important to know because the single-threaded apply nature of async replication can be a write performance bottleneck.  In a production system, we can tell how fast the slave is currently running (applying writes), and we might have historical data to check for the most throughput ever seen, but that doesn’t give us a solid way of determining where we stand right NOW().

An old consulting trick to answer this question is to simply stop replicating on your slave for a minute, (usually just the SQL_THREAD), restart it and watch how long it takes to catch up.  We can also watch the slave thread apply rate during this interval to get a sense of just how many writes per second we can do and compare that with the normal rate (during peak hours, for example).  This can be a handy way of quickly assessing how close you are to our maximum theoretical throughput.

But what about with PXC and Galera?  This is easy on async because the master doesn’t care, but to be able to do this on PXC we need a way to intentionally lag a node without hanging or causing flow control on the rest of the cluster.  And as it turns out, as of version 5.5.33, there’s a pretty easy way.

Measuring an average apply rate on PXC

First we need to pick a node that is not taking reads or writes (or shift some traffic away from one that is).  We’re assuming reads and writes are happening on the rest of the cluster normally, and probably also that the node we chose has pretty similar hardware to every other node.  Once we have this, we can use myq_status to see replication coming into the node and being applied:

mycluster / ip-10-142-147-72 / Galera 2.7(r157)
Wsrep    Cluster  Node     Queue   Ops     Bytes     Flow    Conflct PApply        Commit
    time P cnf  #  cmt sta  Up  Dn  Up  Dn   Up   Dn pau snt lcf bfa dst oooe oool wind
20:18:04 P   3  3 Sync T/T   0   0   0  5k    0 2.7M 0.0   0   0   0 703   82    4    3
20:18:05 P   3  3 Sync T/T   0  15   0  6k    0 3.1M 0.1   5   0   0 804   78    4    3
20:18:06 P   3  3 Sync T/T   0   0   0  6k    0 3.2M 0.1   5   0   0 701   80    4    3
20:18:07 P   3  3 Sync T/T   0  10   0  6k    0 3.1M 0.0   5   0   0 820   81    5    3
20:18:08 P   3  3 Sync T/T   0   0   0  6k    0 3.1M 0.1   3   0   0  1k   77    4    3
20:18:09 P   3  3 Sync T/T   0  57   0  3k    0 1.3M 0.6   2   0   0 758   64    2    2

If we check the rate of growth of wsrep_last_committed over a full minute we can see:

ip-10-142-147-72 mysql> show global status like 'wsrep_last_committed'; select sleep(60); show global status like 'wsrep_last_committed';
+----------------------+---------+
| Variable_name        | Value   |
+----------------------+---------+
| wsrep_last_committed | 3136415 |
+----------------------+---------+
1 row in set (0.01 sec)
+-----------+
| sleep(60) |
+-----------+
|         0 |
+-----------+
1 row in set (59.99 sec)
+----------------------+---------+
| Variable_name        | Value   |
+----------------------+---------+
| wsrep_last_committed | 3443992 |
+----------------------+---------+
1 row in set (0.00 sec)

So we’re averaging 5.1k TPS applying on this node (and across the whole cluster). But how much can we handle at peak?

Measuring Max Replication throughput on PXC

In another window on that same node, we execute this SQL (all at once):

mysql> set global wsrep_desync=ON; flush tables with read lock; show global status like 'wsrep_last_committed'; select sleep( 60 ); unlock tables;
+----------------------+--------+
| Variable_name        | Value  |
+----------------------+--------+
| wsrep_last_committed | 665368 |
+----------------------+--------+
1 row in set (0.00 sec)

We’ve desynced the node, locked writes to all tables, and checked the last seqno we committed on this node (665368). The wsrep_desync state tells this node to enter the Donor/Desynced state, which means it will not send Flow Control to the rest of the cluster if its queue gets backlogged.

Then we proceed to take a read lock on all tables, pausing the Galera applier.  Then sleep 60 seconds, unlock the tables and wait to see how long it recovers.

Once the initial FTWRL happens, we can immediately see the node drop to the Donor/Desynced state and watch replication start to queue up on the node:

20:18:10 P   3  3 Sync T/T   0  55   0  4k    0 2.0M 0.4   3   0   0 768   79    4    4
20:18:11 P   3  3 Dono T/T   0  6k   0 577    0 306K 0.0   3   0   0 791   79    0    4
20:18:12 P   3  3 Dono T/T   0 12k   0   0    0    0 0.0   0   0   0 791    0    0    0
20:18:13 P   3  3 Dono T/T   0 18k   0   0    0    0 0.0   0   0   0 791    0    0    0
20:18:14 P   3  3 Dono T/T   0 24k   0   0    0    0 0.0   0   0   0 791    0    0    0
20:18:15 P   3  3 Dono T/T   0 31k   0   0    0    0 0.0   0   0   0 791    0    0    0
20:18:16 P   3  3 Dono T/T   0 37k   0   0    0    0 0.0   0   0   0 791    0    0    0
20:18:17 P   3  3 Dono T/T   0 43k   0   0    0    0 0.0   0   0   0 791    0    0    0
20:18:18 P   3  3 Dono T/T   0 49k   0   0    0    0 0.0   0   0   0 791    0    0    0

The rest of our cluster is operating normally here.

A minute later the queue is backlogged to almost 350k transactions. Then the lock is released, and Galera starts to apply that queue as quickly as possible:

mycluster / ip-10-142-147-72 / Galera 2.7(r157)
Wsrep    Cluster  Node     Queue   Ops     Bytes     Flow    Conflct PApply        Commit
    time P cnf  #  cmt sta  Up  Dn  Up  Dn   Up   Dn pau snt lcf bfa dst oooe oool wind
20:19:06 P   3  3 Dono T/T   0 326k   0   0    0    0 0.0   0   0   0 791    0    0    0
20:19:07 P   3  3 Dono T/T   0 332k   0   0    0    0 0.0   0   0   0 791    0    0    0
20:19:09 P   3  3 Dono T/T   0 339k   0   0    0    0 0.0   0   0   0 791    0    0    0
20:19:10 P   3  3 Dono T/T   0 345k   0   0    0    0 0.0   0   0   0 791    0    0    0
20:19:11 P   3  3 Dono T/T   0 342k   0  9k    0 4.6M 0.0   0   0   0  5k  100    1   14
20:19:12 P   3  3 Dono T/T   0 336k   0 13k    0 6.7M 0.0   0   0   0 12k   96    2   10
20:19:13 P   3  3 Dono T/T   0 329k   0 12k    0 6.4M 0.0   0   0   0 15k   96    2   10
20:19:14 P   3  3 Dono T/T   0 322k   0 13k    0 6.8M 0.0   0   0   0 15k   92    2    6
20:19:15 P   3  3 Dono T/T   0 313k   0 13k    0 6.9M 0.3   0   0   0 15k   94    1    8
20:19:16 P   3  3 Dono T/T   0 304k   0 13k    0 6.7M 0.4   0   0   0 15k   91    1    7
20:19:17 P   3  3 Dono T/T   0 298k   0 13k    0 6.7M 0.0   0   0   0 15k   94    1    9

We can see right away that our ‘Ops Dn’ is much higher: peaking at 13k, but how can we get a good average? Let’s watch it catch all the way up:

mycluster / ip-10-142-147-72 / Galera 2.7(r157)
Wsrep    Cluster  Node     Queue   Ops     Bytes     Flow    Conflct PApply        Commit
    time P cnf  #  cmt sta  Up  Dn  Up  Dn   Up   Dn pau snt lcf bfa dst oooe oool wind
20:19:54 P   3  3 Dono T/T   0 70k   0 13k    0 6.7M 0.0   0   0   0 15k   95    2    9
20:19:55 P   3  3 Dono T/T   0 63k   0 13k    0 6.8M 0.0   0   0   0 15k   94    2    8
20:19:56 P   3  3 Dono T/T   0 56k   0 13k    0 6.7M 0.0   0   0   0 15k   92    2    7
20:19:57 P   3  3 Dono T/T   0 46k   0 14k    0 7.1M 0.0   0   0   0 15k   97    2   11
20:19:59 P   3  3 Dono T/T   0 39k   0 13k    0 6.9M 0.0   0   0   0 15k   93    2    8
20:20:00 P   3  3 Dono T/T   0 31k   0 13k    0 6.9M 0.1   0   0   0 15k   95    2   10
20:20:01 P   3  3 Dono T/T   0 25k   0 13k    0 6.8M 0.0   0   0   0 15k   92    2    7
20:20:02 P   3  3 Dono T/T   0 18k   0 12k    0 6.3M 0.0   0   0   0 15k   95    2    8
20:20:03 P   3  3 Dono T/T   0  5k   0 14k    0 7.3M 0.0   0   0   0 15k   90    2    5
20:20:04 P   3  3 Dono T/T   0  23   0 11k    0 5.8M 0.0   0   0   0 887   95    3    9
20:20:05 P   3  3 Dono T/T   0   1   0  7k    0 3.4M 0.0   0   0   0 883   67    4    2
20:20:06 P   3  3 Dono T/T   0   0   0  7k    0 3.4M 0.0   0   0   0 920   68    3    2
20:20:07 P   3  3 Dono T/T   0   0   0  5k    0 2.7M 0.0   0   0   0 852   73    4    3

So, it took this node 50 seconds to catch up again. Right at 20:20:05 when the queue zeroed out, I checked wsrep_last_committed again:

ip-10-142-147-72 mysql> show global status like 'wsrep_last_committed';
+----------------------+---------+
| Variable_name        | Value   |
+----------------------+---------+
| wsrep_last_committed | 1332551 |
+----------------------+---------+
1 row in set (0.00 sec)

Be sure to turn off wsrep_desync when we are done and caught up!  Note you can turn off wsrep_desync right away, but that puts the node into the JOINED state which does limited flow control to help the node catch up.  We want our sample to be unbiased by flow control (at least from this node).

ip-10-142-147-72 mysql> set global wsrep_desync=OFF;
Query OK, 0 rows affected, 1 warning (0.00 sec)

So the node drops back into the ‘Synced’ state and FC applies again:

mycluster / ip-10-142-147-72 / Galera 2.7(r157)
Wsrep    Cluster  Node     Queue   Ops     Bytes     Flow    Conflct PApply        Commit
    time P cnf  #  cmt sta  Up  Dn  Up  Dn   Up   Dn pau snt lcf bfa dst oooe oool wind
20:20:10 P   3  3 Dono T/T   0   0   0  6k    0 3.2M 0.1   0   0   0 901   76    4    3
20:20:11 P   3  3 Dono T/T   0   1   0  6k    0 3.3M 0.0   0   0   0 877   82    4    4
20:20:12 P   3  3 Dono T/T   0  24   0  6k    0 3.3M 0.0   0   0   0 877   88    3    7
20:20:13 P   3  3 Dono T/T   0   0   0  7k    0 3.5M 0.0   0   0   0 886   81    4    5
20:20:14 P   3  3 Sync T/T   0   0   0  6k    0 3.2M 0.0   2   0   0 879   69    3    3
20:20:15 P   3  3 Sync T/T   0   0   0  7k    0 3.4M 0.0   1   0   0 873   79    4    3
20:20:17 P   3  3 Sync T/T   0   0   0  6k    0 3.3M 0.1   7   0   0 971   80    4    3

Conclusion

So in 50 seconds, the node was able to apply 667183 transactions (difference between the two wsrep_last_seqno) which comes out to 13.3k tps apply capacity (at least sustained for 1 minute). This tells us we’re around 38% capacity for write throughput. Is that a perfect number? Maybe not, but it at least gives you a rough idea.

However, the point is that thanks to wsrep_desync we can measure this safely within a synchronous replication environment that may normally not allow this type of operation.

The post Measuring Max Replication Throughput on Percona XtraDB Cluster with wsrep_desync appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com