Oct
23
2017
--

MySQL Point in Time Recovery the Right Way

MySQL Point In Time Recovery

MySQL Point In Time RecoveryIn this blog, I’ll look at how to do MySQL point in time recovery (PITR) correctly.

Sometimes we need to restore from a backup, and then replay the transactions that happened after the backup was taken. This is a common procedure in most disaster recovery plans, when for example you accidentally drop a table/database or run an update/delete without the “where” clause and lose data.

The usual way is to get a copy of your binlogs and use mysqlbinlog to replay those transactions. But this approach has many pitfalls that can make the whole PITR process a nightmare. Some examples:

  • You need to make sure to run a single mysqldump command with all related binlogs, and pipe them to mysql at once. Otherwise, if binlog.000001 creates a temporary table, and binlog.000002 requires that temporary table, it will not be present. Each execution of MySQL creates a new connection:
shell> mysqlbinlog binlog.000001 | mysql -u root -p # Creates tmp table X
shell> mysqlbinlog binlog.000002 | mysql -u root -p # Uses tmp table X

  • We can say that it has to be an atomic operation. If it fails halfway through, it will be very difficult to know where it failed and even more difficult to resume from that point forward. There are many reasons for it to fail: InnoDB lock wait timeout / deadlock with some concurrent transaction, server and client have different
    max_allowed_packet

     and you get a Lost connection to MySQL server during query error, and so on.

So how can we overcome those limitations and have a reliable way to do Point In Time Recovery?

We can restore the backup on the desired server, build a second server with just the minimal MySQL required data and move the all binary logs to this “fake” server datadir. Then we need to configure the server where we want the PITR to happen as a slave of the fake server. From this point forward, it’s going to be pure MySQL replication happening.

To illustrate it, I create a Docker container on the machine. I have Percona Server for MySQL running on the box listening on 3306, and have already restored the backup on it. There is a tarball there with all binlogs required. The saved positions for PITR are as follows:

[root@localhost ~]# cat /var/lib/mysql/xtrabackup_binlog_info
master-bin.000007	1518932

I create a folder to store the Docker MySQL datadir:

mkdir /tmp/pitr
chown -R 1001 /tmp/pitr

I start the Docker container. As we can see from xtrabackup_binlog_info, my binlogs are named master-bin and I’ll be setting the same server-id as original master:

docker run --name ps_pitr -v /tmp/pitr:/var/lib/mysql
-p 3307:3306 -e MYSQL_ROOT_PASSWORD=secret
-d percona/percona-server:5.7.18
--log_bin=master-bin --server-id=10

In case you want to make usage of GTID, append --gtid-mode=ON --enforce_gtid_consistency=ON to the end of the Docker command.

The command above starts a MySQL instance, invokes mysqld –initialize, sets the root password to secret and it’s port 3306 is mapped back to my local 3307 port. Now I’ll stop it, remove the binlogs that it created, uncompress and move all required binlogs to its datadir mapped folder and start it again:

docker stop ps_pitr
rm /tmp/pitr/master-bin.*
tar -zxf binlogs.tgz -C /tmp/pitr
chown -R 1001 /tmp/pitr/master-bin.*
docker start ps_pitr

If it all worked correctly, at this point we can see the full list of binary logs on the Docker container by connecting on port 3307:

mysql -u root -psecret -P 3307 --protocol=TCP -e "SHOW BINARY LOGS"
mysql: [Warning] Using a password on the command line interface can be insecure.
+-------------------+-----------+
| Log_name          | File_size |
+-------------------+-----------+
| master-bin.000005 |  26216208 |
| master-bin.000006 |  26214614 |
| master-bin.000007 |  26214902 |
. . .
| master-bin.000074 |       154 |
+-------------------+-----------+

Now, all we need to do is connect to our server, which has the backup restored, and configure it as a slave from 3307:

mysql -u root -p
Enter password:
Welcome to the MySQL monitor.  Commands end with ; or g.
Your MySQL connection id is 6
Server version: 5.7.18-16 Percona Server (GPL), Release 16, Revision d7301f8
Copyright (c) 2009-2017 Percona LLC and/or its affiliates
Copyright (c) 2000, 2017, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or 'h' for help. Type 'c' to clear the current input statement.
mysql> CHANGE MASTER TO MASTER_HOST='127.0.0.1', MASTER_PORT=3307, MASTER_USER='root', MASTER_PASSWORD='secret', MASTER_LOG_FILE='master-bin.000007', MASTER_LOG_POS=1518932;
Query OK, 0 rows affected, 2 warnings (0.01 sec)
mysql> START SLAVE;
Query OK, 0 rows affected (0.01 sec)
mysql> SHOW SLAVE STATUSG
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 127.0.0.1
                  Master_User: root
                  Master_Port: 3307
                Connect_Retry: 60
              Master_Log_File: master-bin.000008
          Read_Master_Log_Pos: 449696
               Relay_Log_File: localhost-relay-bin.000002
                Relay_Log_Pos: 28957
        Relay_Master_Log_File: master-bin.000007
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 15217950
              Relay_Log_Space: 11476311
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 4382
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 10
                  Master_UUID: 80b9fe26-a945-11e7-aa1d-0242ac110002
             Master_Info_File: /var/lib/mysql/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Opening tables
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set:
            Executed_Gtid_Set:
                Auto_Position: 0
         Replicate_Rewrite_DB:
                 Channel_Name:
           Master_TLS_Version:
1 row in set (0.17 sec)
. . .
mysql> SHOW SLAVE STATUSG
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 127.0.0.1
                  Master_User: root
                  Master_Port: 3307
                Connect_Retry: 60
              Master_Log_File: master-bin.000074
          Read_Master_Log_Pos: 154
               Relay_Log_File: localhost-relay-bin.000133
                Relay_Log_Pos: 381
        Relay_Master_Log_File: master-bin.000074
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 154
              Relay_Log_Space: 819
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 10
                  Master_UUID: 80b9fe26-a945-11e7-aa1d-0242ac110002
             Master_Info_File: /var/lib/mysql/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set:
            Executed_Gtid_Set:
                Auto_Position: 0
         Replicate_Rewrite_DB:
                 Channel_Name:
           Master_TLS_Version:
1 row in set (0.01 sec)

If you want to apply logs up to a particular time you can make use of mysqlbinlog to verify what the last position / GTID it should apply, and use START SLAVE UNTIL MASTER_LOG_FILE = 'log_name', MASTER_LOG_POS = log_pos or START SLAVE SQL_THREAD UNTIL SQL_AFTER_GTIDS = 3E11FA47-71CA-11E1-9E33-C80AA9429562:11-56.

Special thanks to Marcos Albe, who originally showed me this MySQL point in time recovery approach.

Aug
23
2017
--

Migrating Data from an Encrypted Amazon MySQL RDS Instance to an Encrypted Amazon Aurora Instance

Migrating Data

Migrating DataIn this blog post, we’ll discuss migrating data from encrypted Amazon MySQL RDS to encrypted Amazon Aurora.

One of my customers wanted to migrate from an encrypted MySQL RDS instance to an encrypted Aurora instance. They have a pretty large database, therefore using mysqldump or a similar tool was not suitable for them. They also wanted to setup replication between old MySQL RDS and new Aurora instances.

Spoiler: this is possible without any logical dump.

At first, I checked Amazon’s documentation on encryption and found nothing about this type of migration. Even more, if I trust the documentation it looks like they don’t support replication or migration between encrypted MySQL RDS and encrypted Aurora. All instructions are for either “MySQL RDS to MySQL RDS” or “Aurora to Aurora” setups. For example, the documentation says here:

You can create Read Replicas of both encrypted and unencrypted DB clusters. The Read Replica must be encrypted if the source DB cluster is encrypted.

When I tried to create an Aurora read replica of my encrypted MySQL RDS instance, however, the “Enable Encryption” select control was grayed out and I could not change “No” to “Yes”.

I had to find a workaround.

Another idea was creating an encrypted MySQL RDS replica and migrating it to Aurora. While creating encrypted MySQL replica is certainly possible (actually all replicas of encrypted instances must be encrypted) it was not possible to migrate it to any other instance using the standard “Migrate Latest Snapshot” option:

However, the documentation specified that Aurora and MySQL RDS use the same AWS KMS key. As a result, both kinds of encryption should be compatible (if not practically the same). Amazon also has the “AWS Database Migration Service“, which has this promising section in its FAQ:

Q. Can I replicate data from encrypted data sources?

Yes, AWS Database Migration Service can read and write from and to encrypted databases. AWS Database Migration Service connects to your database endpoints on the SQL interface layer. If you use the Transparent Data Encryption features of Oracle or SQL Server, AWS Database Migration Service will be able to extract decrypted data from such sources and replicate it to the target. The same applies to storage-level encryption. As long as AWS Database Migration Service has the correct credentials to the database source, it will be able to connect to the source and propagate data (in decrypted form) to the target. We recommend using encryption-at-rest on the target to maintain the confidentiality of your information. If you use application-level encryption, the data will be transmitted through AWS Database Migration Service as is, in encrypted format, and then inserted into the target database.

I decided to give it a try. And it worked!

The next step was to make this newly migrated Aurora encrypted instance a read replica of the original MySQL RDS instance. This is easy in part with the help of great how-to on migration by Adrian Cantrill. As suggested, you only need to find the master’s binary log file, current position and supply them to the stored routine

mysql.rds_set_external_master

. Then start replication using the stored routine

mysql.rds_start_replication

.

Conclusion: While AWS Database Migration Service has limitations for both source and target databases, this solution allows you to migrate encrypted instances easily and securely.

Save

Save

Save

Save

Aug
22
2017
--

The MySQL High Availability Landscape in 2017 (The Adults)

In this blog post, we’ll look at some of the MySQL high availability solution options.

In the previous post of this series, we looked at the MySQL high availability (HA) solutions that have been around for a long time. I called these solutions “the elders.” Some of these solutions (like replication) are heavily used today and have been improved from release to release of MySQL.

This post focuses on the MySQL high availability solutions that have appeared over the last five years and gained a fair amount of traction in the community. I chose to include this group only two solutions: Galera and RDS Aurora. I’ll use the term “Galera” generically: it covers Galera Cluster, MariaDB Cluster and Percona XtraDB Cluster. I debated for some time whether or not to include Aurora. I don’t like the fact that they use closed source code. Given the tight integration with the AWS environment, what is the commercial risk of opening the source code? That question evades me, but I am not on the business side of technology. ?

Galera

When I say “Galera,” it means a replication protocol supported by a library provided by Codeship, a Finnish company. The library needs hooks inside the MySQL source code for the replication protocol to work. In Percona XtraDB cluster, I counted 66 .cc files where the word “wsrep” is present. As you can see, it is not a small task to add support for Galera to MySQL. Not all the implementations are similar. Percona, for example, focused more on stability and usability at the expense of new features.

high availability

Let’s start with a quick description of a Galera-based cluster. Unless you don’t care about split-brain, a Galera cluster needs at least three nodes. The Galera replication protocol is nearly synchronous, which is a huge gain compared to regular MySQL replication. It performs transactions almost simultaneously on all the nodes, and the protocol ensures the same commit order. The transactions are almost synchronous because there are incoming queues on each node to improve performance. The presence of these incoming queues forces an extra step: the certification. The certification compares an incoming transaction with the ones already queued. If there a conflict, it returns a deadlock error.

For performance reasons, the certification process must be quick so that the incoming queue stays in memory. Since the number of transactions defines the size of the queue, the presence of large transactions uses a lot of memory. There are safeguards against memory overload, so be aware that transactions like:

update ATableWithMillionsRows set colA=1;

will likely fail. That’s the first important limitation of a Galera-based cluster: the size of the number transactions is limited.

It is also critical to uniquely identify conflicting rows. The best way to achieve an efficient row comparison is to make sure all the tables have a primary key. In a Galera-based cluster, your tables need primary keys otherwise you’ll run into trouble. That’s the second limitation of a Galera based cluster: the need for primary keys. Personally, I think that a table should always have a primary key – but I have seen many oddities…

Another design characteristic is the need for an acknowledgment by all the nodes when a transaction commits. That means the network link with the largest latency between two nodes will set the floor value of the transactional latency. It is an important factor to consider when deploying a Galera-based cluster over a WAN. Similarly, an overloaded node can slow down the cluster if it cannot acknowledge the transaction in a timely manner. In most cases, adding slave threads will allow you to overcome the throughput limitations imposed by the network latency. Each transaction will suffer from latency, but more of them will be able to run at the same time and maintain the throughput.

The exception here is when there are “hot rows.” A hot row is a row that is hammered by updates all the time. The transactions affecting hot rows cannot be executed in parallel and are thus limited by the network latency.

Since Galera-based clusters are very popular, they must also have some good points. The first and most obvious is the full-durability support. Even if the node on which you executed a transaction crashes a fraction of second after the commit statement returned, the data is present on the other nodes incoming queues. In my opinion, it is the main reason for the demise of the share storage solution. Before Galera-based clusters, the shared storage solution was the only other solution guaranteeing no data loss in case of a crash.

While the standby node is unusable with the shared storage solution, all the nodes of a Galera-based cluster are available and are almost in sync. All the nodes can be used for reads without stressing too much about replication lag. If you accept a higher risk of deadlock errors, you can even write on all nodes.

Finally, and not the least, there is an automatic provisioning service for the new nodes called SST. During the SST process, a joiner node asks the cluster for a donor. One of the existing nodes agrees to be the donor and initiate a full backup. The backup is streamed over the network to the joiner and restored there. When the backup completes, the joiner performs an IST to get the recent updates and, once applied, joins the cluster. The most common SST method uses the Percona XtraBackup utility. When using SST for XtraBackup, the cluster is fully available during the SST, although it may degrade performance. This feature really simplifies the operational side of things.

The technology is very popular. Of course, I am a bit biased since I work for Percona and one of our flagship products is Percona XtraDB Cluster – an implementation of the Galera protocol. Other than standard MySQL replication, it is by far the most common HA solution used by the customers I work with.

RDS Aurora

The second “adult” MySQL high availability solution is RDS Aurora. I hesitated to add Aurora here, mainly because it is not an open-source technology. I must also admit that I haven’t followed the latest developments around Aurora very closely. So, let’s describe Aurora.

There are three major parts in Aurora: at least one database server, the writer node and the storage.

high availability

What makes Aurora special is the storage layer has its own processing logic. I don’t know if the processing logic is part of the writer node AWS instance or part of the storage service directly, since the source code is not available. Anyway, I’ll call that layer the appliers. The applier role is to apply redo log fragments that then allow the writer node to write only the redo log fragments (normally written to the InnoDB log files). The appliers read those fragments and modify the pages in the storage. If a node requests a page that has pending redo fragments to be applied, they get applied before returning the page.

From the writer node perspective, there are much fewer writes. There is also no direct upper bound in terms of a number of fragments to be queued, so it is a bit like having

innodb_log_file_size

 set to an extremely large value. Also, since Aurora doesn’t need to flush pages, if the write node needs to read from the storage, and there are no free pages in the buffer pool, it can just discard one even if it is “dirty.” Actually, there are no dirty pages in the buffer pool.

So far, that seems to be very good for high write loads with spikes. What about the reader nodes? These reader nodes receive the updates from the writer nodes. If the update concerns a page they have in their buffer pool, they can modify it in place or discard it and read again from the storage. Again, without the source code, it is hard to tell the implementation. The point is, the readers have the same data as the master, they just can’t lag behind.

Apart from the impossibility of any reader lag, the other good point of Aurora is the storage. InnoDB pages are saved as objects in an object store, not like in a regular file on a file system. That means you don’t need to over-provision your storage ahead of time. You pay for what you are using – actually the maximum you ever use. InnoDB tablespaces do not shrink, even with Aurora.

Furthermore, if you have a 5TB dataset and your workload is such that you would need ten servers (one writer and nine readers), you still need only 5TB of storage if you are not replicating to other AZ. If we compare with regular MySQL and replication, you would have one master and nine slaves, each with 5TB of storage, for a total of 50TB. On top of that, you’ll have at least ten times the write IOPS.

So, storage wise, we have something that could be very interesting for applications with large datasets and highly concurrent read heavy workloads. You ensure high availability with the ability to promote a reader to writer automatically. You access the primary or the readers through endpoints that automatically connect to the correct instances. Finally, you can replicate the storage to multiple availability zones for DR.

Of course, such an architecture comes with a trade-off. If you experiment with Aurora, you’ll quickly discover that the smallest instance types underperform while the largest ones perform in a more expected manner. It is also quite easy to overload the appliers. Just perform the following queries:

update ATableWithMillionsRows set colA=1;
select count(*) from ATableWithMillionsRows where colA=1;

given that the table ATableWithMillionsRows is larger than the buffer pool. The select will hang for a long time because the appliers are overloaded by the number of pages to update.

In term of adoption, we have some customers at Percona using Aurora, but not that many. It could be that users of Aurora do not naturally go to Percona for services and support. I also wonder about the decision to keep the source code closed. It is certainly not a positive marketing factor in a community like the MySQL community. Since the Aurora technology seems extremely bounded to their ecosystem, is there really a risk for the technology to be reused by a competitor? With a better understanding of the technology through open access to the source, Amazon could have received valuable contributions. It would also be much easier to understand, tune and recommend Aurora.

Further reading:

Jul
20
2017
--

Where Do I Put ProxySQL?

ProxySQL

In this blog post, we’ll look at how to deploy ProxySQL.

ProxySQL is a high-performance proxy, currently for MySQL and its forks (like Percona Server for MySQL and MariaDB). It acts as an intermediary for client requests seeking resources from the database. It was created for DBAs by René Cannaò, as a means of solving complex replication topology issues. When bringing up ProxySQL with my clients, I always get questions about where it fits into the architecture. This post should clarify that.

Before continuing, you might want to know why you should use this software. The features that are of interest include:

  • MySQL firewall
  • Connection pooling
  • Shard lookup and automated routing
  • Ability to read/write split
  • Automatically switch to another master in case of active master failure
  • Query cache
  • Performance metrics
  • Other neat features!

Initial Configuration

In general, you install it on nodes that do not have a running MySQL database. You manage it via the MySQL command line on another port, usually 6032. Once it is started the configuration in /etc is not used, and you do everything within the CLI. The backend database is actually SQLite, and the db file is stored in /var/lib/proxysql.

There are many guides out there on initializing and installing it, so I won’t cover those details here. It can be as simple as:

apt-get install proxysql

ProxySQL Architecture

While most first think to install ProxySQL on a standalone node between the application and database, this has the potential to affect query performance due to the additional latency from network hops.

 

ProxySQL

To have minimal impact on performance (and avoid the additional network hop), many recommend installing ProxySQL on the application servers. The application then connects to ProxySQL (acting as a MySQL server) on localhost, using Unix Domain Socket, and avoiding extra latency. It would then use its routing rules to reach out and talk to the actual MySQL servers with its own connection pooling. The application doesn’t have any idea what happens beyond its connection to ProxySQL.

ProxySQL

Reducing Your Network Attack Surface

Another consideration is reducing your network attack surface. This means attempting to control all of the possible vulnerabilities in your network’s hardware and software that are accessible to unauthenticated users.

Percona generally suggests that you put a ProxySQL instance on each application host, like in the second image above. This suggestion is certainly valid for reducing latency in your database environment (by limiting network jumps). But while this is good for performance, it can be bad for security.

Every instance must be able to talk to:

  • Every master
  • Every slave

As you can imagine, this is a security nightmare. With every instance, you have x many more connections spanning your network. That’s x many more connections an attacker might exploit.

Instead, it can be better to have one or more ProxySQL instances that are between your application and MySQL servers (like the first image above). This provides a reasonable DMZ-type setup that prevents opening too many connections across the network.

That said, both architectures are valid production configurations – depending on your requirements.

Jul
19
2017
--

Multi-Threaded Slave Statistics

Multi-Threaded Slave Statistics

Multi-Threaded Slave StatisticsIn this blog post, I’ll talk about multi-threaded slave statistics printed in MySQL error log file.

MySQL version 5.6 and later allows you to execute replicated events using parallel threads. This feature is called Multi-Threaded Slave (MTS), and to enable it you need to modify the

slave_parallel_workers

 variable to a value greater than 1.

Recently, a few customers asked about the meaning of some new statistics printed in their error log files when they enable MTS. These error messages look similar to the example stated below:

[Note] Multi-threaded slave statistics for channel '': seconds elapsed = 123; events assigned = 57345; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 0 waited (count) when Workers occupied = 0 waited when Workers occupied = 0

The MySQL reference manual doesn’t show information about these statistics. I’ve filled a bug report asking Oracle to add information about these statistics in the MySQL documentation. I reported this bug as #85747.

Before they update the documentation, we can use the MySQL code to get insight as to the statistics meaning. We can also determine how often these statistics are printed in the error log file. Looking into the rpl_slave.cc file, we find that when you enable MTS – and log-warnings variable is greater than 1 (log-error-verbosity greater than 2 for MySQL 5.7) – the time to print these statistics in MySQL error log is 120 seconds. It is determined by a hard-coded constant number. The code below shows this:

/*
  Statistics go to the error log every # of seconds when --log-warnings > 1
*/
const long mts_online_stat_period= 60 * 2;

Does this mean that every 120 seconds MTS prints statistics to your MySQL error log (if enabled)? The answer is no. MTS prints statistics in the mentioned period depending on the level of activity of your slave. The following line in MySQL code verifies the level of the slave’s activity to print the statistics:

if (rli->is_parallel_exec() && rli->mts_events_assigned % 1024 == 1)

From the above code, you need MTS enabled and the modulo operation between the 

mts_events_assigned

 variable and 1024 equal to 1 in order to print the statistics. The 

mts_events_assigned

 variable stores the number of events assigned to the parallel queue. If you’re replicating a low level of events, or not replicating at all, MySQL won’t print the statistics in the error log. On the other hand, if you’re replicating a high number of events all the time, and the

mts_events_assigned

 variable increased its value until the remainder from the division between this variable and 1024 is 1, MySQL prints MTS statistics in the error log almost every 120 seconds.

You can find the explanation these statistics below (collected from information in the source code):

  1. Worker queues filled over overrun level: MTS tends to load balance events between all parallel workers, and the  
    slave_parallel_workers

     variable determines the number of workers. This statistic shows the level of saturation that workers are suffering. If a parallel worker queue is close to full, this counter is incremented and the worker replication event is delayed in order to avoid reaching worker queue limits.

  2. Waited due to a Worker queue full: This statistic is incremented when the coordinator thread must wait because of the worker queue gets overfull.
  3. Waited due to the total size: This statistic shows the number of times that the coordinator thread slept due to reaching the limit of the memory available in the worker queue to hold unapplied events. If this statistic keeps increasing when printed in your MySQL error log, you should resize the
    slave_pending_jobs_size_max

     variable to a higher value to avoid the coordinator thread waiting time.

  4. Waited at clock conflicts: In the case of possible dependencies between transactions, this statistic shows the wait time corresponding to logical timestamp conflict detection and resolution.
  5. Waited (count) when used occupied: A counter of how many times the coordinator saw Workers filled up with “enough” with assignments. The enough definition depends on the scheduler type (per-database or clock-based).
  6. Waited when workers occupied: These are statistics to compute coordinator thread waiting time for any worker available, and applies solely to the Commit-clock scheduler.
Conclusion

Multi-threaded slave is an exciting feature that allows you to replicate events faster, and keep in sync with master instances. By changing the log-warnings variable to a value greater than 1, you can get information from the slave error log file about how multi-threaded performance.

Jun
20
2017
--

The MySQL High Availability Landscape in 2017 (The Elders)

High Availability

In this blog, we’ll look at different MySQL high availability options.

The dynamic MySQL ecosystem is rapidly evolving many technologies built around MySQL. This is especially true for the technologies involved with the high availability (HA) aspects of MySQL. When I joined Percona back in 2009, some of these HA technologies were very popular – but have since been almost forgotten. During the same interval, new technologies have emerged. In order to give some perspective to the reader, and hopefully help to make better choices, I’ll review the MySQL HA landscape as it is in 2017. This review will be in three parts. The first part (this post) will cover the technologies that have been around for a long time: the elders. The second part will focus on the technologies that are very popular today: the adults. Finally, the last part will try to extrapolate which technologies could become popular in the upcoming years: the babies.

Quick disclaimer, I am reporting on the technologies I see the most. There are likely many other solutions not covered here, but I can’t talk about technologies I have barely or never used. Apart from the RDS-related technologies, all the technologies covered are open-source. The target audience for this post are people relatively new to MySQL.

The Elders

Let’s define the technologies in the elders group. These are technologies that anyone involved with MySQL for last ten years is sure to be aware of. I could have called this group the “classics”.  I include the following technologies in this group:

  • Replication
  • Shared storage
  • NDB cluster

Let’s review these technologies in the following sections.

Replication

Simple replication topology

 

MySQL replication is very well known. It is one of the main features behind the wide adoption of MySQL. Replication gets used almost everywhere. The reasons for that are numerous:

  • Replication is simple to setup. There are tons of how-to guides and scripts available to add a slave to a MySQL server. With Amazon RDS, adding a slave is just a few clicks.
  • Slaves allow you to easily scale reads. The slaves are accessible and can be used for reads. This is the most common way of scaling up a MySQL database.
  • Slaves have little impact on the master. Apart from the added network traffic, the presence of slaves does not impact the master performance significantly.
  • It is well known. No surprises here.
  • Used for failover. Your master died, promote a slave and use it as your new master.
  • Used for backups. You don’t want to overload your master with the backups, run them off a slave.

Of course, replication also has some issues:

  • Replication can lag. Replication used to be single-threaded. That means a master with a concurrent load could easily outpace a slave. MySQL 5.6 and MariaDB 10.0 have introduced some parallelism to the slave. Newer versions have further improved to a point where today’s slaves are many times faster than they were.
  • Slaves can diverge. When you modify data on the master, the slave must perform the exact same update. That seems easy, but there are many ways an update can be non-deterministic with statement-based replication. They fixed many issues, and the introduction of row-based replication has been another big step forward. Still, if you write directly to a slave you are asking for trouble. There is a read_only setting, but if the MySQL user has the “SUPER” privilege it is just ignored. That’s why there is now the “super_read_only” setting. Tools like pt-table-checksum and pt-table-sync from the Percona toolkit exist to solve this problem.
  • Replication can impact the master. I wrote above that the presence of slaves does not affect the master, but logging changes are more problematic. The most common issue is the InnoDB table-level locking for auto_increment values with statement-based replication. Only one thread can insert new rows at a time. You can avoid this issue with row-based replication and properly configuring settings.
  • Data gets lost. Replication is asynchronous. That means the master will reply “done” after a commit statement even though the slaves have not received updates yet. Some transactions can get lost if the master crashes.

Although an old technology, a lot of work has been done on replication. It is miles away from the replication implementation of 5.0.x. Here’s a list, likely incomplete, of the evolution of replication:

  • Row based replication (since 5.1). The binary internal representation of the rows is sent instead of the SQL statements. This makes replication more robust against slave divergence.
  • Global transaction ID (since 5.6). Transactions are uniquely identified. Replication can be setup without knowing the binlog file and offset.
  • Checksum (since 5.6). Binlog events have checksum values to validate their integrity.
  • Semi-sync replication (since 5.5). An addition to the replication protocol to make the master aware of the reception of events by the slaves. This helps to avoid losing data when a master crashes.
  • Multi-source replication (since 5.7). Allows a slave to have more than one master.
  • Multi-threaded replication (since 5.6). Allows a slave to use multiple threads. This helps to limit the slave lag.

Managing replication is a tedious job. The community has written many tools to manage replication:

  • MMM. An old Perl tool that used to be quite popular, but had many issues. Now rarely used.
  • MHA. The most popular tool to manage replication. It excels at reconfiguring replication without losing data, and does a decent at handling failover.  It is also simple. No wonder it is popular.
  • PRM. A Pacemaker-based solution developed to replace MMM. It’s quite good at failover, but not as good as MHA at reconfiguring replication. It’s also quite complex, thanks to Pacemaker. Not used much.
  • Orchestrator. The new cool tool. It can manage complex topologies and has a nice web-based interface to monitor and control the topology.

 

Shared Storage

Simple shared storage topology

 

Back when I was working for MySQL ten years ago, shared storage HA setups were very common. A shared storage HA cluster uses one copy of the database files between one of two servers. One server is active, the other one is passive. In order to be shared, the database files reside on a device that can be mounted by both servers. The device can be physical (like a SAN), or logical (like a Linux DRBD device). On top of that, you need a cluster manager (like Pacemaker) to handle the resources and failovers. This solution is very popular because it allows for failover without losing any transactions.

The main drawback of this setup is the need for an idle standby server. The standby server cannot have any other assigned duties since it must always be ready to take over the MySQL server. A shared storage solution is also obviously not resilient to file-level corruption (but that situation is exceptional). Finally, it doesn’t play well with a cloud-based environment.

Today, newly-deployed shared storage HA setups are rare. The only ones I encountered over the last year were either old implementations needing support, or new setups that deployed because of existing corporate technology stacks. That should tell you about the technology’s loss of popularity.

NDB Cluster

A simple NDB Cluster topology

 

An NDB Cluster is a distributed clustering solution that has been around for a long time. I personally started working with this technology back in 2008. An NDB Cluster has three types of nodes: SQL, management and data. A full HA cluster requires a minimum of four nodes.

An NDB Cluster is not a general purpose database due to its distributed nature. For suitable workloads, it is extraordinary good. For unsuitable workloads, it is miserable. A suitable workload for an NDB Cluster contains high concurrency, with a high rate of small primary key oriented transactions. Reaching one million trx/s on an NDB Cluster is nothing exceptional.

At the other end of the spectrum, a poor workload for an NDB Cluster is a single-threaded report query on a star-like schema. I have seen some extreme cases where just the network time of a reporting query amounted to more than 20 minutes.

Although NDB Clusters have improved, and are still improving, their usage has been pushed toward niche-type applications. Overall, the technology is losing ground and is now mostly used for Telcos and online gaming applications.

Jun
13
2017
--

Webinar Thursday, June 15, 2017: Demystifying Postgres Logical Replication

Postgres Logical Replication

Postgres Logical ReplicationJoin Percona’s Senior Technical Services Engineer Emanuel Calvo as he presents Demystifying Postgres Logical Replication on Thursday, June 15, 2017 at 7 am PDT / 10 am EDT (UTC-7).

The Postgres logical decoding feature was added in version 9.4, and thankfully it is continuously improving due to the vibrant open source community. In this webinar, we are going to walk through its concepts, usage and some of the new things coming up in future releases.

Logical decoding is one of the features under the BDR implementation, allowing bidirectional streams of data between Postgres instances. It also allows you to stream data outside Postgres into many other data systems.

Register for the webinar here.

Emanuel CalvoEmanuel Calvo, Percona Sr. Technical Services

Emanuel has worked with MySQL for more than eight years. He is originally from Argentina, but also lived and worked in Spain and other Latin American countries. He lectures and presents at universities and local events. Emanuel currently works as a Sr. Technical Services at Percona, focusing primarily on MySQL. His professional background includes experience at telecommunication companies, educational institutions and data warehousing solutions. In his career, he has worked as a developer, SysAdmin and DBA in companies like Pythian, Blackbird.io/PalominoDB, Siemens IT Solutions, Correo Argentino (Argentinian Postal Services), Globant-EA, SIU – Government Educational Institution and Aedgency among others. As a community member he has lectured and given talks in Argentina, Brazil, United States, Paraguay, Spain and Belgium as well as written several technical papers.

May
25
2017
--

What About ProxySQL and Mirroring?

ProxySQL and Mirroring

In this blog post, we’ll look at how ProxySQL and mirroring go together.

Overview

Let me be clear: I love ProxySQL, and I think it is a great component for expanding architecture flexibility and high availability. But not all that shines is gold! In this post, I want to correctly set some expectations, and avoid selling carbon for gold (carbon has it’s own uses, while gold has others).

First of all, we need to cover the basics of how ProxySQL manages traffic dispatch (I don’t want to call it mirroring, and I’ll explain further below).

ProxySQL receives a connection from the application, and through it we can have a simple SELECT or a more complex transaction. ProxySQL gets each query, passes them to the Query Processor, processes them, identifies if a query is mirrored, duplicates the whole MySQL session ProxySQL internal object and associates it to a mirror queue (which refer to a mirror threads pool). If the pool is free (has an available active slot in the concurrent active threads set) then the query is processed right away. If not, it will stay in the queue. If the queue is full, the query is lost.

Whatever is returned from the query goes to /dev/null, and as such no result set is passed back to the client.

The whole process is not free for a server. If you check the CPU utilization, you will see that the “mirroring” in ProxySQL actually doubles the CPU utilization. This means that the traffic on server A is impacted because of resource contention.

Summarizing, ProxySQL will:

  1. Send the query for execution in different order
  2. Completely ignore any transaction isolation
  3. Have different number of query executed on B with respect to A
  4. Add significant load on the server resources

This point, coupled with the expectations I mention in the reasoning at the end of this article, it is quite clear to me that at the moment we cannot consider ProxySQL as a valid mechanism to duplicate a consistent load from server A to server B.

Personally, I don’t think that the ProxySQL development team (Rene :D) should waste time on fixing this issue, as there are so many other things to cover and improve on in ProxySQL.

After working extensively with ProxySQL, and doing a deep QA on mirroring, I think that either we keep it as basic blind traffic dispatcher. Otherwise, a full re-conceptualization is required. But once we have clarified that, ProxySQL “traffic dispatch” (still don’t want to call it mirroring) remains a very interesting feature that can have useful applications – especially since it is easy to setup.

The following test results should help set the correct expectations.

The tests were simple: load data in a Percona XtraDB Cluster and use ProxySQL to replicate the load on a MySQL master-slave environment.

  • Machines for MySQL/Percona XtraDB Cluster: VM with CentOS 7, 4 CPU 3 GB RAM, attached storage
  • Machine for ProxySQL: VM CentOS 7, 8 CPU 8GB RAM

Why did I choose to give ProxySQL a higher volume of resources? I knew in advance I could need to play a bit with a couple of settings that required more memory and CPU cycles. I wanted to be sure I didn’t get any problems from ProxySQL in relation to CPU and memory.

The application that I was using to add load is a Java application I develop to perform my tests. The app is at https://github.com/Tusamarco/blogs/blob/master/stresstool_base_app.tar.gz, and the whole set I used to do the tests are here:  https://github.com/Tusamarco/blogs/tree/master/proxymirror.

I used four different tables:

+------------------+
| Tables_in_mirror |
+------------------+
| mirtabAUTOINC    |
| mirtabMID        |
| mirtabMIDPart    |
| mirtabMIDUUID    |

Ok so let start. Note that the meaningful tests are the ones below. For the whole set, refer to the whole set package. First setup ProxySQL:

First setup ProxySQL:

delete from mysql_servers where hostgroup_id in (500,501,700,701);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections) VALUES ('192.168.0.5',500,3306,60000,400);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections) VALUES ('192.168.0.5',501,3306,100,400);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections) VALUES ('192.168.0.21',501,3306,20000,400);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections) VALUES ('192.168.0.231',501,3306,20000,400);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections) VALUES ('192.168.0.7',700,3306,1,400);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections) VALUES ('192.168.0.7',701,3306,1,400);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections) VALUES ('192.168.0.25',701,3306,1,400);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections) VALUES ('192.168.0.43',701,3306,1,400);
LOAD MYSQL SERVERS TO RUNTIME; SAVE MYSQL SERVERS TO DISK;
delete from mysql_users where username='load_RW';
insert into mysql_users (username,password,active,default_hostgroup,default_schema,transaction_persistent) values ('load_RW','test',1,500,'test',1);
LOAD MYSQL USERS TO RUNTIME;SAVE MYSQL USERS TO DISK;
delete from mysql_query_rules where rule_id=202;
insert into mysql_query_rules (rule_id,username,destination_hostgroup,mirror_hostgroup,active,retries,apply) values(202,'load_RW',500,700,1,3,1);
LOAD MYSQL QUERY RULES TO RUNTIME;SAVE MYSQL QUERY RULES TO DISK;

Test 1

The first test is mainly a simple functional test during which I insert records using one single thread in Percona XtraDB Cluster and MySQL. No surprise, here I have 3000 loops and at the end of the test I have 3000 records on both platforms.

To have a baseline we can see that the ProxySQL CPU utilization is quite low:

ProxySQL and Mirroring

At the same time, the number of “questions” against Percona XtraDB Cluster and MySQL very similar:

Percona XtraDB Cluster

ProxySQL and Mirroring

MySQL

ProxySQL and Mirroring

The other two metrics we want to keep an eye on are Mirror_concurrency and Mirror_queue_length. These two refer respectively to mysql-mirror_max_concurrency and mysql-mirror_max_queue_length:

ProxySQL and Mirroring

These two new variables and metrics were introduced in ProxySQL 1.4.0, with the intent to control and manage the load ProxySQL generates internally related to the mirroring feature. In this case, you can see we have a max of three concurrent connections and zero queue entries (all good).

Now that we have a baseline, and that we know at functional level “it works,” let see what happens when increasing the load.

Test 2

The scope of the test was identifying how ProxySQL behaves with a standard configuration and increasing load. It comes up that as soon as ProxySQL has a little bit more load, it starts to lose some queries along the way.

Executing 3000 loops for 40 threads only results in 120,000 rows inserted in all the four tables in Percona XtraDB Cluster. But the table in the secondary (mirrored) platform only has a variable number or inserted rows, between 101,359 and 104,072. This demonstrates consistent loss of data.

After reviewing and comparing the connections running in Percona XtraDB Cluster and the secondary, we can see that (as expected) Percona XtraDB Cluster’s number of connections is scaling and serving the number of incoming requests, while the connections on the secondary are limited by the default value of mysql-mirror_max_concurrency=16.

ProxySQL and Mirroring

Is also interesting to note that the ProxySQL transaction process queue maintains its connection to the Secondary longer than the connection to Percona XtraDB Cluster.

ProxySQL and Mirroring

As we can see above, the queue is an evident bell curve that reaches 6K entries (which is quite below the mysql-mirror_max_queue_length limit (32K)). Yet queries were dropped by ProxySQL, which indicates the queue is not really enough to accommodate the pending work.

ProxySQL and Mirroring

CPU-wise, ProxySQL (as expected) take a few more cycles, but nothing crazy. The overhead for the simple mirroring queue processing can be seen when the main load stops around 12:47.

Another interesting graph to keep an eye on is the one describing the executed commands inside Percona XtraDB Cluster and the secondary:

Percona XtraDB Cluster

ProxySQL and Mirroring

Secondary

ProxySQL and Mirroring

As you can see, the traffic on the secondary was significantly less (669 on average, compared to Percona XtraDB Cluster’s 1.17K). Then it spikes when the main load on the Percona XtraDB Cluster node terminates. In short it is quite clear that ProxySQL is not able to scale following the traffic existing in Percona XtraDB Cluster, and actually loses a significant amount of data on the secondary.

Doubling the load in Test 3 shows the same behavior, with ProxySQL reaches its limit for traffic duplication.

But can this be optimized?

The answer is, of course, yes! This is what the mysql-mirror_max_concurrency is for, so let;’s see what happens if we increase the value from 16 to 100 (just to make it crazy high).

Test 4 (two app node writing)

The first thing that comes to attention is that both Percona XtraDB Cluster and secondary report the same number of rows in the tables (240,000). That is a good first win.

Second, note the number of running connections:

ProxySQL and Mirroring

The graphs are now are much closer, and the queue drops to just a few entries.

Commands executed in Percona XtraDB Cluster:

And commands executed in the secondary:

Average execution reports the same value, and very similar trends.

Finally, what was the CPU cost and effect?

Percona XtraDB Cluster and secondary CPU utilization:

     

As expected, some difference in the CPU usage distribution exists. But the trend is consistent between the two nodes, and the operations.

The ProxySQL CPU utilization is definitely higher than before:

But it’s absolutely manageable, and still reflects the initial distribution.

What about CRUD? So far I’ve only tested the insert operation, but what happen if we run a full CRUD set of tests?

Test 7 (CRUD)

First of all, let’s review the executed commands in Percona XtraDB Cluster:

And the secondary:

While in appearance we have very similar workloads, selects aside the behavior will significantly diverge. This is because in the secondary the different operations are not encapsulated by the transaction. They are executed as they are received. We can see a significant difference in update and delete operations between the two.

Also, the threads in the execution show a different picture between the two platforms:

Percona XtraDB Cluster

Secondary

It appears quite clear that Percona XtraDB Cluster is constantly running more threads and more connections. Nevertheless, both platforms process a similar total number of questions:

Percona XtraDB Cluster

Secondary

Both have an average or around 1.17K/second questions.

This is also another indication of how much the impact of concurrent operation on behavior, with no respect to the isolation or execution order. Below we can clearly see different behavior by reviewing the CPU utilization:

Percona XtraDB Cluster

Secondary

Conclusions

To close this article, I want to go back to the start. We cannot consider the mirror function in ProxySQL as a real mirroring, but more as traffic redirection (check here for more reasoning on mirroring from my side).

Using ProxySQL with this approach is still partially effective in testing the load and the effect it has on a secondary platform. As we know, data consistency is not guaranteed in this scenario, and Selects, Updates and Deletes are affected (given the different data-set and result-set they manage).

The server behaviors change between the original and mirror, if not in the quantity or the quality.

I am convinced that when we need a tool able to test our production load on a different or new platform, we would do better to look to something else. Possibly query Playback, recently reviewed and significantly patched by DropBox (https://github.com/Percona-Lab/query-playback).

In the end, ProxySQL is already a cool tool. If it doesn’t cover mirroring well, I can live with that. I am interested in having it working as it should (and it does in many other functionalities).

Acknowledgments

As usual, to Rene, who worked on fixing and introducing new functionalities associated with mirroring, like queue and concurrency control.

To the Percona team who developed Percona Monitoring and Management (PMM): all the graphs here (except 3) come from PMM (some of them I customized).

Apr
14
2017
--

TokuDB Hotbackup and Replication

TokuDB Hotbackup

TokuDB HotbackupTokuDB Hotbackup is a solution that allows you to do backups on the fly. It works as a library that intercepts certain system calls that duplicate data written to already copied parts of files, so that at the end of the backup process the copied files contain the same content as the original files. There are several blog posts describing how TokuDB Hot Backup works in details:

Replication often uses backups replication to create slaves. For this purpose, we need to know the last executed GTID or binary log position both for the slave and master configurations.

You can obtain the corresponding information with SHOW MASTER/SLAVE STATUS. But if we request this information after the backup, the corresponding binlog position or executed GTID can be changed and the backed-up data won’t correspond to the master/slave state.

It’s possible to lock tables with FLUSH TABLE WITH READ LOCK, or use more smart locks like LOCK TABLES/BINLOG FOR BACKUP. See this  blog post for details: Introducing backup locks in Percona Server.

But the larger question is when can we use the corresponding locks? We could lock binlog or some table before getting a backup, and then release after it’s done. But if the data being backed-up is big enough the backup itself can take time — and all that time the data is locked for changes.

There must be a more suitable solution. Let’s recall how TokuDB Hotbackup works: it uses a special library that intercepts some system calls and it has an API. This API includes such commands as “start capturing” and “stop capturing”. “Capturing” means that if some part of a file gets copied to a backup location, and this part is changed (by the MySQL server in our case), then these changes are also applied to the backup location. TokuDB Hotbackup plugin uses the API, starts capturing, make copies and stops capturing.

There is no need to lock the binlog during the copy process, because when the binlog is flushed the changes are copied to backup by a “capturing” mechanism. After everything has been copied, and with the capturing still running, this is a good time for LOCK BINLOG FOR BACKUP execution. After this statement is executed, the binlog is flushed, the flushed changes are captured and applied to a backup location, and any queries that could change the binlog position or executed GTID are blocked. After this we can stop capturing, get the binlog position or the last executed GTID and unlock the binlog.

This is how it’s implemented in TokuDB Hotbackup. After a backup is taken, there are “tokubackup_slave_info” and “tokubackup_binlog_info” files in the backup directory that contain the corresponding information about the slave and master (in human-readable format). You can use this information to start a new slave from the master or slave. 5.7 supports a multisource replication, and “tokubackup_slave_info” contains information for all replication channels.

There could be a happy ending here, but the devil is in the details. For example, there are several binary logging formats: RBR, SBR, MBR (for details see https://dev.mysql.com/doc/refman/5.7/en/replication-formats.html). In the case of SBR or MBR, a binary log event can contain statements that produce temporary tables on the slave side, and the result of further statements can depend on the content of the temporary tables.

Usually, temporary tables are created in a separate directory that is out of a MySQL data directory, and aren’t backed up. That is why if we create a backup when temporary tables produced by binary log events exist, and then try to restore the backup, the temporary tables aren’t restored (as they were not backed up). If we try to restore the replication from the point saved during the backup, and after this point the binary log contains events that use the content of non-restored temporary tables, the data will be inconsistent.

That is why the so-called “safe-slave” mode is implemented in TokuDB Hotbackup. The same mode is implemented in Percona XtraBackup, and its name was also inherited from Percona XtraBackup. In this mode, along with LOCK BINLOG FOR BACKUP statement execution, the slave SQL thread is stopped. After this it is checked to see if temporary tables produced by slave SQL thread exist or not. If yes, then the slave SQL thread is restarted until there are no temporary tables (or the certain timeout is reached).

For those purposes, we introduced the following new system variables:

  • --tokudb-backup-safe-slave

     – turn on/off safe-slave mode

  • --tokudb-backup-safe-slave-timeout

     – maximum amount of time in seconds to wait until temp tables disappear

Note for the case of multisource replication, the simplified version of this feature is implemented. So if there are several replication channels, and for some of them the SQL thread is started while for others the thread is stopped, the TokuDB Hotbackup does not restore the channels state. It just restarts the SQL threads for all channels, and after the backup SQL threads for all channels will be started.

You can’t use this option for group-replication.

This could be the happy ending, but… well, you know, it’s complicated. Apart from replication features in the MySQL server, there are a lot of engine-specific features. One of them is how frequently the recovery log is synchronized.

For TokuDB, there are two system variables you can use to tune recovery log synchronization frequency:

tokudb_commit_sync

 and

tokudb_fsync_log_period

. By playing with these variables, you can establish some tradeoff between durability and performance. Here is a good blogpost about these parameters: High Insertion Rates into a TokuDB Table with Durable Transactions.

But let’s imagine we have a certain recovery log synchronization period, and the backup finished somewhere in the middle of this period. It’s OK if we use the backup just to restore some data, because the data is restored using recovery and rollback logs during the server’s start.

But if we also want to start the slave using the information stored in the “tokubackup_slave_info” and “tokubackup_binlog_info” files, there might be a situation where after the recovery stage the data in the database is behind the replication position stored during backup. See this bug description https://bugs.launchpad.net/percona-server/+bug/1533174 for details. Now, when capturing is still active, TokuDB forcibly synchronizes the recovery log when the copy process is finished.

So this could be the happy ending but… Well, actually I hope this is a happy ending! At least we’ve implemented the general functionality for creating slaves using TokuDB Hotbackup. However, as “ the show must go on”, or “the spice must flow” etc., we are happy to discuss any concerns, ideas, proposals and bug reports.

Thanks to all who took part in development: George O. Lorch III, Sergey Glushchenko, Shahriyar Rzaev, Alexey Kopytov (as the author of “LOCK BINLOG FOR BACKUP” feature) and other people from Percona team and community.

Feb
24
2017
--

Quest for Better Replication in MySQL: Galera vs. Group Replication

Group Replication

Group ReplicationUPDATE: Some of the language in the original post was considered overly-critical of Oracle by some community members. This was not my intent, and I’ve modified the language to be less so. I’ve also changed term “synchronous” (which the use of is inaccurate and misleading) to “virtually synchronous.” This term is more accurate and already used by both technologies’ founders, and should be less misleading.

I also wanted to thank Jean-François Gagné for pointing out the incorrect sentence about multi-threaded slaves in Group Replication, which I also corrected accordingly.

In today’s blog post, I will briefly compare two major virtually synchronous replication technologies available today for MySQL.

More Than Asynchronous Replication

Thanks to the Galera plugin, founded by the Codership team, we’ve had the choice between asynchronous and virtually synchronous replication in the MySQL ecosystem for quite a few years already. Moreover, we can choose between at least three software providers: Codership, MariaDB and Percona, each with its own Galera implementation.

The situation recently became much more interesting when MySQL Group Replication went into GA (stable) stage in December 2016.

Oracle, the upstream MySQL provider, introduced its own replication implementation that is very similar in concept. Unlike the others mentioned above, it isn’t based on Galera. Group Replication was built from the ground up as a new solution. MySQL Group Replication shares many very similar concepts to Galera. This post doesn’t cover MySQL Cluster, another and fully-synchronous solution, that existed much earlier then Galera — it is a much different solution for different use cases.

In this post, I will point out a couple of interesting differences between Group Replication and Galera, which hopefully will be helpful to those considering switching from one to another (or if they are planning to test them).

This is certainly not a full list of all the differences, but rather things I found interesting during my explorations.

It is also important to know that Group Replication has evolved a lot before it went GA (its whole cluster layer was replaced). I won’t mention how things looked before the GA stage, and will just concentrate on latest available 5.7.17 version. I will not spend too much time on how Galera implementations looked in the past, and will use Percona XtraDB Cluster 5.7 as a reference.

Multi-Master vs. Master-Slave

Galera has always been multi-master by default, so it does not matter to which node you write. Many users use a single writer due to workload specifics and multi-master limitations, but Galera has no single master mode per se.

Group Replication, on the other hand, promotes just one member as primary (master) by default, and other members are put into read-only mode automatically. This is what happens if we try to change data on non-master node:

mysql> truncate test.t1;
ERROR 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement

To change from single primary mode to multi-primary (multi-master), you have to start group replication with the 

group_replication_single_primary_mode

variable disabled.
Another interesting fact is you do not have any influence on which cluster member will be the master in single primary mode: the cluster auto-elects it. You can only check it with a query:

mysql> SELECT * FROM performance_schema.global_status WHERE VARIABLE_NAME like 'group_replication%';
+----------------------------------+--------------------------------------+
| VARIABLE_NAME                    | VARIABLE_VALUE                       |
+----------------------------------+--------------------------------------+
| group_replication_primary_member | 329333cd-d6d9-11e6-bdd2-0242ac130002 |
+----------------------------------+--------------------------------------+
1 row in set (0.00 sec)

Or just:

mysql> show status like 'group%';
+----------------------------------+--------------------------------------+
| Variable_name                    | Value                                |
+----------------------------------+--------------------------------------+
| group_replication_primary_member | 329333cd-d6d9-11e6-bdd2-0242ac130002 |
+----------------------------------+--------------------------------------+
1 row in set (0.01 sec)

To show the hostname instead of UUID, here:

mysql> select member_host as "primary master" from performance_schema.global_status join performance_schema.replication_group_members where variable_name='group_replication_primary_member' and member_id=variable_value;
+----------------+
| primary master |
+----------------+
| f18ff539956d   |
+----------------+
1 row in set (0.00 sec)

Replication: Majority vs. All

Galera delivers write transactions synchronously to ALL nodes in the cluster. (Later, applying happens asynchronously in both technologies.) However, Group Replication needs just a majority of the nodes confirming the transaction. This means a transaction commit on the writer succeeds and returns to the client even if a minority of nodes still have not received it.

In the example of a three-node cluster, if one node crashes or loses the network connection, the two others continue to accept writes (or just the primary node in Single-Primary mode) even before a faulty node is removed from the cluster.

If the separated node is the primary one, it denies writes due to the lack of a quorum (it will report the error

ERROR 3101 (HY000): Plugin instructed the server to rollback the current transaction.

). If one of the nodes receives a quorum, it will be elected to primary after the faulty node is removed from the cluster, and will then accept writes.

With that said, the “majority” rule in Group Replication means that there isn’t a guarantee that you won’t lose any data if the majority nodes are lost. There is a chance these could apply some transactions that aren’t delivered to the minority at the moment they crash.

In Galera, a single node network interruption makes the others wait for it, and pending writes can be committed once either the connection is restored or the faulty node removed from cluster after the timeout. So the chance of losing data in a similar scenario is lower, as transactions always reach all nodes. Data can be lost in Percona XtraDB Cluster only in a really bad luck scenario: a network split happens, the remaining majority of nodes form a quorum, the cluster reconfigures and allows new writes, and then shortly after the majority part is damaged.

Schema Requirements

For both technologies, one of the requirements is that all tables must be InnoDB and have a primary key. This requirement is now enforced by default in both Group Replication and Percona XtraDB Cluster 5.7. Let’s look at the differences.

Percona XtraDB Cluster:

mysql> create table nopk (a char(10));
Query OK, 0 rows affected (0.08 sec)
mysql> insert into nopk values ("aaa");
ERROR 1105 (HY000): Percona-XtraDB-Cluster prohibits use of DML command on a table (test.nopk) without an explicit primary key with pxc_strict_mode = ENFORCING or MASTER
mysql> create table m1 (id int primary key) engine=myisam;
Query OK, 0 rows affected (0.02 sec)
mysql> insert into m1 values(1);
ERROR 1105 (HY000): Percona-XtraDB-Cluster prohibits use of DML command on a table (test.m1) that resides in non-transactional storage engine with pxc_strict_mode = ENFORCING or MASTER
mysql> set global pxc_strict_mode=0;
Query OK, 0 rows affected (0.00 sec)
mysql> insert into nopk values ("aaa");
Query OK, 1 row affected (0.00 sec)
mysql> insert into m1 values(1);
Query OK, 1 row affected (0.00 sec)

Before Percona XtraDB Cluster 5.7 (or in other Galera implementations), there were no such enforced restrictions. Users unaware of these requirements often ended up with problems.

Group Replication:

mysql> create table nopk (a char(10));
Query OK, 0 rows affected (0.04 sec)
mysql> insert into nopk values ("aaa");
ERROR 3098 (HY000): The table does not comply with the requirements by an external plugin.
2017-01-15T22:48:25.241119Z 139 [ERROR] Plugin group_replication reported: 'Table nopk does not have any PRIMARY KEY. This is not compatible with Group Replication'
mysql> create table m1 (id int primary key) engine=myisam;
ERROR 3161 (HY000): Storage engine MyISAM is disabled (Table creation is disallowed).

I am not aware of any way to disable these restrictions in Group Replication.

GTID

Galera has it’s own Global Transaction ID, which has existed since MySQL 5.5, and is independent from MySQL’s GTID feature introduced in MySQL 5.6. If MySQL’s GTID is enabled on a Galera-based cluster, both numerations exist with their own sequences and UUIDs.

Group Replication is based on a native MySQL GTID feature, and relies on it. Interestingly, a separate sequence block range (initially 1M) is pre-assigned for each cluster member.

WAN Support

The MySQL Group Replication documentation isn’t very optimistic on WAN support, claiming that both “Low latency, high bandwidth network connections are a requirement” and “Group Replication is designed to be deployed in a cluster environment where server instances are very close to each other, and is impacted by both network latency as well as network bandwidth.” These statements are found here and here. However there is network traffic optimization: Message Compression.

I don’t see group communication level tunings available yet, as we find in the Galera evs.* series of

wsrep_provider_options

.

Galera founders actually encourage trying it in geo-distributed environments, and some WAN-dedicated settings are available (the most important being WAN segments).

But both technologies need a reliable network for good performance.

State Transfers

Galera has two types of state transfers that allow syncing data to nodes when needed: incremental (IST) and full (SST). Incremental is used when a node has been out of a cluster for some time, and once it rejoins the other nodes has the missing write sets still in Galera cache. Full SST is helpful if incremental is not possible, especially when a new node is added to the cluster. SST automatically provisions the node with fresh data taken as a snapshot from one of the running nodes (donor). The most common SST method is using Percona XtraBackup, which takes a fast and non-blocking binary data snapshot (hot backup).

In Group Replication, state transfers are fully based on binary logs with GTID positions. If there is no donor with all of the binary logs (included the ones for new nodes), a DBA has to first provision the new node with initial data snapshot. Otherwise, the joiner will fail with a very familiar error:

2017-01-16T23:01:40.517372Z 50 [ERROR] Slave I/O for channel 'group_replication_recovery': Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.', Error_code: 1236

The official documentation mentions that provisioning the node before adding it to the cluster may speed up joining (the recovery stage). Another difference is that in the case of state transfer failure, a Galera joiner will abort after the first try, and will shutdown its mysqld instance. The Group Replication joiner will then fall-back to another donor in an attempt to succeed. Here I found something slightly annoying: if no donor can satisfy joiner demands, it will still keep trying the same donors over and over, for a fixed number of attempts:

[root@cd81c1dadb18 /]# grep 'Attempt' /var/log/mysqld.log |tail
2017-01-16T22:57:38.329541Z 12 [Note] Plugin group_replication reported: 'Establishing group recovery connection with a possible donor. Attempt 1/10'
2017-01-16T22:57:38.539984Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 2/10'
2017-01-16T22:57:38.806862Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 3/10'
2017-01-16T22:58:39.024568Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 4/10'
2017-01-16T22:58:39.249039Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 5/10'
2017-01-16T22:59:39.503086Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 6/10'
2017-01-16T22:59:39.736605Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 7/10'
2017-01-16T23:00:39.981073Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 8/10'
2017-01-16T23:00:40.176729Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 9/10'
2017-01-16T23:01:40.404785Z 12 [Note] Plugin group_replication reported: 'Retrying group recovery connection with another donor. Attempt 10/10'

After the last try, even though it fails, mysqld keeps running and allows client connections…

Auto Increment Settings

Galera adjusts the auto_increment_increment and auto_increment_offset values according to the number of members in a cluster. So, for a 3-node cluster,

auto_increment_increment

  will be “3” and

auto_increment_offset

  from “1” to “3” (depending on the node). If a number of nodes change later, these are updated immediately. This feature can be disabled using the 

wsrep_auto_increment_control

 setting. If needed, these settings can be set manually.

Interestingly, in Group Replication the

auto_increment_increment

 seems to be fixed at 7, and only

auto_increment_offset

 is set differently on each node. This is the case even in the default Single-Primary mode! this seems like a waste of available IDs, so make sure that you adjust the

group_replication_auto_increment_increment

 setting to a saner number before you start using Group Replication in production.

Multi-Threaded Slave Side Applying

Galera developed its own multi-threaded slave feature, even in 5.5 versions, for workloads that include tables in the same database. It is controlled with the  wsrep_slave_threads variable. Group Replication uses a feature introduced in MySQL 5.7, where the number of applier threads is controlled with slave_parallel_workers. Galera will do multi-threaded replication based on potential conflicts of changed/locked rows. Group Replication parallelism is based on an improved LOGICAL_CLOCK scheduler, which uses information from writesets dependencies. This can allow it to achieve much better results than in normal asynchronous replication MTS mode. More details can be found here: http://mysqlhighavailability.com/zooming-in-on-group-replication-performance/

Flow Control

Both technologies use a technique to throttle writes when nodes are slow in applying them. Interestingly, the default size of the allowed applier queue in both is much different:

Moreover, Group Replication provides separate certifier queue size, also eligible for the Flow Control trigger:

 group_replication_flow_control_certifier_threshold

. One thing I found difficult, is checking the actual applier queue size, as the only exposed one via performance_schema.replication_group_member_stats is the

Count_Transactions_in_queue

 (which only shows the certifier queue).

Network Hiccup/Partition Handling

In Galera, when the network connection between nodes is lost, those who still have a quorum will form a new cluster view. Those who lost a quorum keep trying to re-connect to the primary component. Once the connection is restored, separated nodes will sync back using IST and rejoin the cluster automatically.

This doesn’t seem to be the case for Group Replication. Separated nodes that lose the quorum will be expelled from the cluster, and won’t join back automatically once the network connection is restored. In its error log we can see:

2017-01-17T11:12:18.562305Z 0 [ERROR] Plugin group_replication reported: 'Member was expelled from the group due to network failures, changing member status to ERROR.'
2017-01-17T11:12:18.631225Z 0 [Note] Plugin group_replication reported: 'getstart group_id ce427319'
2017-01-17T11:12:21.735374Z 0 [Note] Plugin group_replication reported: 'state 4330 action xa_terminate'
2017-01-17T11:12:21.735519Z 0 [Note] Plugin group_replication reported: 'new state x_start'
2017-01-17T11:12:21.735527Z 0 [Note] Plugin group_replication reported: 'state 4257 action xa_exit'
2017-01-17T11:12:21.735553Z 0 [Note] Plugin group_replication reported: 'Exiting xcom thread'
2017-01-17T11:12:21.735558Z 0 [Note] Plugin group_replication reported: 'new state x_start'

Its status changes to:

mysql> SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+--------------+-------------+--------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+--------------+-------------+--------------+
| group_replication_applier | 329333cd-d6d9-11e6-bdd2-0242ac130002 | f18ff539956d | 3306 | ERROR |
+---------------------------+--------------------------------------+--------------+-------------+--------------+
1 row in set (0.00 sec)

It seems the only way to bring it back into the cluster is to manually restart Group Replication:

mysql> START GROUP_REPLICATION;
ERROR 3093 (HY000): The START GROUP_REPLICATION command failed since the group is already running.
mysql> STOP GROUP_REPLICATION;
Query OK, 0 rows affected (5.00 sec)
mysql> START GROUP_REPLICATION;
Query OK, 0 rows affected (1.96 sec)
mysql> SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+--------------+-------------+--------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+--------------+-------------+--------------+
| group_replication_applier | 24d6ef6f-dc3f-11e6-abfa-0242ac130004 | cd81c1dadb18 | 3306 | ONLINE |
| group_replication_applier | 329333cd-d6d9-11e6-bdd2-0242ac130002 | f18ff539956d | 3306 | ONLINE |
| group_replication_applier | ae148d90-d6da-11e6-897e-0242ac130003 | 0af7a73f4d6b | 3306 | ONLINE |
+---------------------------+--------------------------------------+--------------+-------------+--------------+
3 rows in set (0.00 sec

Note that in the above output, after the network failure, Group Replication did not stop. It waits in an error state. Moreover, in Group Replication a partitioned node keeps serving dirty reads as if nothing happened (for non-super users):

cd81c1dadb18 {test} ((none)) > SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+--------------+-------------+--------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+--------------+-------------+--------------+
| group_replication_applier | 24d6ef6f-dc3f-11e6-abfa-0242ac130004 | cd81c1dadb18 | 3306 | ERROR |
+---------------------------+--------------------------------------+--------------+-------------+--------------+
1 row in set (0.00 sec)
cd81c1dadb18 {test} ((none)) > select * from test1.t1;
+----+-------+
| id | a |
+----+-------+
| 1 | dasda |
| 3 | dasda |
+----+-------+
2 rows in set (0.00 sec)
cd81c1dadb18 {test} ((none)) > show grants;
+-------------------------------------------------------------------------------+
| Grants for test@% |
+-------------------------------------------------------------------------------+
| GRANT SELECT, INSERT, UPDATE, DELETE, REPLICATION CLIENT ON *.* TO 'test'@'%' |
+-------------------------------------------------------------------------------+
1 row in set (0.00 sec)

A privileged user can disable

super_read_only

, but then it won’t be able to write:

cd81c1dadb18 {root} ((none)) > insert into test1.t1 set a="split brain";
ERROR 3100 (HY000): Error on observer while running replication hook 'before_commit'.
cd81c1dadb18 {root} ((none)) > select * from test1.t1;
+----+-------+
| id | a |
+----+-------+
| 1 | dasda |
| 3 | dasda |
+----+-------+
2 rows in set (0.00 sec)

I found an interesting thing here, which I consider to be a bug. In this case, a partitioned node can actually perform DDL, despite the error:

cd81c1dadb18 {root} ((none)) > show tables in test1;
+-----------------+
| Tables_in_test1 |
+-----------------+
| nopk |
| t1 |
+-----------------+
2 rows in set (0.01 sec)
cd81c1dadb18 {root} ((none)) > create table test1.split_brain (id int primary key);
ERROR 3100 (HY000): Error on observer while running replication hook 'before_commit'.
cd81c1dadb18 {root} ((none)) > show tables in test1;
+-----------------+
| Tables_in_test1 |
+-----------------+
| nopk |
| split_brain |
| t1 |
+-----------------+
3 rows in set (0.00 sec)

In a Galera-based cluster, you are automatically protected from that, and a partitioned node refuses to allow both reads and writes. It throws an error: 

ERROR 1047 (08S01): WSREP has not yet prepared node for application use

. You can force dirty reads using the 

wsrep_dirty_reads

 variable.

There many more subtle (and less subtle) differences between these technologies – but this blog post is long enough already. Maybe next time ?

Article with Similar Subject

http://lefred.be/content/group-replication-vs-galera/

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com