Dec
02
2016
--

Business Continuity and MySQL Backups

MySQL Backups

MySQL BackupsThis blog post discusses the business continuity plan around MySQL backups, and how organizations should think about them.

During the years I’ve worked in IT, I’ve learned that backups sometimes are a conceptual subject in organizations. Many companies have them, but don’t document the associated business continuation plan for them. I experienced this the hard way many many years ago, somewhere around when MySQL 5.0 was still widely used.

In most organizations, there are a couple of business continuity subjects that should be described internally. For example, what is the recovery time objective and what is the recovery point objective. Let’s go a bit deeper into both concepts:

Recovery Point Objective:

A recovery point objective describes the utter limit of time data can be lost during a major incident. For example, recovery while a massive data center failure happens. One of the questions you should ask prior to these situations is what is a tolerable time point for lost information? 

If you have a recovery point objective of over a day, your daily backup routines might cover this. However, if you have a recovery point objective that is more stringent, you might be forced to have some additional tools like binary streaming or incremental backup.

Recovery Time Objective

This second term and concept is also essential in building a business continuity plan. Your environment has to remain active to generate traffic and, potentially, revenue.

What are the requirements promised to your customers? Are there any SLA’s described with the customer, or is it best effort? If it’s best effort, what would be the tipping point for your users to start using an alternative service from your competitor. These are all factors to consider while determining your RTO.

In Short

If the recovery point objective and recovery time objective are stringent, this might mean additional costs might be required when buying hardware, or perhaps having a secondary data center becomes mandatory. However, it’s a cost/value discussion: what makes your company lose revenue, and what is acceptable during a crisis?

Based on your business continuity requirements, you can potentially build your DR plans. Make sure your business continuity requirements builds the DR plan, and not vice versa.

What tools do you have at your disposal to create sensible MySQL backups?

Logical backups

MySQLdump. Remember mysqldump, the original tool included in MySQL? The good thing about mysqldump is that you can actually read and even edit the output of the backup before potentially restoring data, which can prove interesting during development work.

mysqldump’s biggest negative is that it’s not scalable, nor fast for backing up large amounts of data. Additionally, restoring data is even slower as you must replay the complete dataset on your new MySQL database servers (rebuild indexes, large IO, etc.).

mysqldump’s advantages include the convenience and flexibility of viewing or even editing the output before restoring. It gives you the ability to clone databases for development, and produce slight variations of an existing database for testing.

mydumper. This tool is comparable to mysqldump, however it does it in parallel, which provides significant benefits in backup time and restoration time.

Binary backups

Binary backups refers to copies made of the entire MySQL dataset. Binary backups are typically faster compared to logical backups, especially  on larger datasets. Several tools come to mind in these cases.

Percona XtrabackupAn opensource binary backup solution for InnoDB. The good thing about XtraBackup is that it is non-locking when using MySQL with the InnoDB storage engine. 

MySQL Enterprise BackupAn InnoDB hot backup solution that is included in the subscription level of MySQL enterprise. 

These tools can offer you incremental and daily backups, however they still don’t bring you point-in-time recovery. If your recovery point objective is very limited, it might mean that that you require to externally store (backup) your binary logs and replay them on your restored database. Keep in mind that this factor potentially impacts your recovery time objective.

Delayed Slaves

This concept is not a backup, but this technology might help you to recover your database and limit the recovery time significantly.

Conclusion

We’ve discussed having a business continuity requirement list, and some potential tools that might assist you in covering them (at least on the MySQL level). One of the last items that is important is actual testing. The number of companies that require data recovery and then notice that their backups are corrupted are way too numerous.

Make sure your organization tests their backups regularly. Are you sure they work properly? Make sure that you perform regression tests for new code – for example on a restoration set of the backups.

If you make sure you trust your backups, you might sleep better at night!   ;-). 

Dec
02
2016
--

Make MySQL 8.0 Better Through Better Benchmarking

MySQL 8.0

MySQL 8.0This blog post discusses how better MySQL 8.0 benchmarks can improve MySQL in general.

Like many in MySQL community, I’m very excited about what MySQL 8.0 offers. There are a lot of great features and architecture improvements. Also like many in the MySQL community, I would like to see MySQL 8.0 perform better. Better performance is what we always want (and expect) from new database software releases.

Rarely do performance improvements happen by accident – they require running benchmarks, finding bottlenecks and eliminating them. This is the area where I think things could use improvement.

If you come to the MySQL Keynote at Oracle OpenWorld, or if you go to MySQL Benchmarks Page, you find a very limited set of benchmarks. They mostly focus around sysbench performance, with large numbers of connections and large numbers of cores. I’m not convinced this effort is the best use of our time these days.

Don’t get me wrong: as one of the original designers of sysbench, it is a great and simple tool that helps spot many bottlenecks. I still use it to find performance issues. But it is only one tool, which is by no means provides full coverage of real-world MySQL workloads.

I agree with Mark Callaghan (see discussion here): we need to run more benchmarks using a wider set of circumstances, to ensure there are no regressions in new releases. This will help move MySQL performance forward for real users.      

Here are some specific ideas on how I think we could benchmark MySQL 8.0 better:

  1. Focus on production-recommended settings. Way too often we see benchmarks run with the doublewrite buffer disabled, InnoDB checksum disabled and no binary log (like in this benchmark). While they produce some excitingly high numbers, they have little practical value for real workloads. At very least I would very much like to see separate numbers for the “street legal car,” versus one designed to set a record on the salt flats.
  2. Go beyond sysbench. Sysbench focuses on PK-only based access for very simple tables, and does not even do JOINs as part of its workload. I would like to see more benchmarks that have tables with many indexes, using secondary key lookups and joins, involving rows with many fields, and medium and large size blobs that are common bottlenecks. We also need more database features covered. Are foreign keys or triggers getting faster or slower? What about stored procedure execution performance? I would love to see these get covered. Mark Callaghan suggests LinkBench, which I think is a fine benchmark to add, but it shouldn’t be the only one.
  3. Do more with sysbench. Sysbench could get more done and cover more important workloads. Workloads with data fitting in memory and not fitting in memory should be shown separately. Testing performance with large numbers of tables is also very important – many MySQL installations for SaaS applications run tens of thousands of tables (sometimes going into millions). I would also suggest running more injection benchmarks with sysbench, as they are more representative of the real world.
  4. Look at latency, not just throughput. The benchmarks results we commonly see are focused on the throughput over a long period of time, without looking at the latency and how performance changes over time. Stalls and performance dips are well known in the MySQL space – especially the famous InnoDB checkpointing woes (though this issue has gotten a lot better) There are other cases and circumstances where stalls and slowdowns can happen.  
  5. Measure resource consumption. I very much like how Mark Callaghan shows the CPU usage and IO usage per transaction/operation, so we can get a better idea of efficiency.
  6. Concurrency. Recently, the focus has been on very high concurrency in terms of connections and active connections, typically on very big iron (using as many as 72 cores). And as much as this is important to “future-proofing” MySQL as we get more and more cores per socket every year, it should not be the only focus. In fact, it is extremely rare for me to see sustained loads of more than 20-40  “threads running” for well-configured systems. With modern solutions like ProxySQL, you can restrict concurrency to the most optimal levels for your server through multiplexing. Not to mention the thread pool, which is available in MySQL Enterprise, Percona Server and MariaDB. I would like to see a much more focused benchmark at medium-to-low concurrency. The fact that single thread performance has gotten slower in every Major MySQL version is not a good thing. As MySQL currently runs a single query in a single thread, it impacts query latencies in many real-world situations.
  7. Virtualization. We need more benchmarks in virtualized environments, as virtualization and the cloud are where most workloads are these days (by number). Yes, big iron and bare metal are where you get the best performance, but it’s not where most users are running MySQL. Whenever you are looking at full blown virtualization or containers, the performance profile can be substantially different from bare metal. Virtualized instances often have smaller CPU cores – getting the best performance with 8-16 virtual cores might be a more relevant data set for many than the performance with 100+ cores.
  8. SSL and encryption. MySQL 5.7 was all about security. We’re supposed to be able to enable SSL easily, but was any work done on making it cheap? The benchmark Ernie Souhrada did a few years back showed a pretty high overhead (in MySQL 5.6). We need more focus on SSL performance, and getting it would allow more people to run MySQL with SSL. I would also love to see more benchmarks with encryption enabled, to understand better how much it costs to have your data encrypted “at rest,” and in what cases.
  9. Protocol X and MySQL Doc Store. These were added after MySQL 5.7 GA, so it would be unfair to complain about the lack of benchmarks comparing the performance of those versus previous versions. But if Protocol X is the future, some benchmarks are in order. It would be great to have official numbers on the amount of overhead using MySQL Doc Store has compared to SQL (especially since we know that queries are converted to SQL for execution).
  10. Replication benchmarks. There are a lot of great replication features in newer MySQL versions: statement/row/mixed, GTID or no GTID, chose multiple formats for row events, enable various forms of semi-sync replication, two ways of parallel replication and multi-source replication. Additionally, MySQL group replication is on the way. There seems to be very little comprehensive benchmarks for these features, however. We really need to understand how they scale and perform under various workloads.
  11. Mixed workloads.  Perhaps one of the biggest differences between benchmarks and real production environments is that in benchmarks the same workload often is used over and over, while in the real world there is a mix of “application workloads.” The real world also has additional tasks such as backups, reporting or running “online” ALTER TABLE operations. Practical performance is performance you can count on while also serving these types of background activities. Sometimes you can get a big surprise from the severity of impact from such background activities.
  12. Compression benchmarks. There have been some InnoDB compression benchmarks (both for new and old methods), but they are a completely separate set of benchmarks that are hard to put in context with everything else. For example, do they scale well with high numbers of connections and large numbers of cores?
  13. Long-running benchmarks. A lot of the benchmarks run are rather short. Many of the things that affect performance take time to accumulate: memory fragmentation on the process (or OS Kernel) side, disk fragmentation and database fragmentation. For a database that is expected to run many months without restarting, it would be great to see some benchmark runs that last several days/weeks to check long term stability, or if there is a regression or resource leak.
  14. Complex queries. While MySQL is not an analytical database, it would still be possible to run complex queries with JOINs while the MySQL optimizer team provides constant improvements to the optimizer. It would be quite valuable to see how optimizer improvements affect query execution. We want to see how these improvements affect scalability with hardware and concurrency as well..

These are just some of ideas on what could be done. Of course, there are only so many things the performance engineering team can focus at the time: one can’t boil the ocean! My main suggestion is this: we have done enough deep optimizing of primary key lookups with sysbench on high concurrency and monster hardware, and it’s time to go wider. This ensures that MySQL doesn’t falter with poor performance on commonly run workloads. Benchmarks like these have much more practical value than beating one million primary key selects a second on a single server.

Dec
02
2016
--

AWS shoots for total cloud domination

screen-shot-2016-11-30-at-11-02-45-am AWS held its annual re:Invent customer conference this week — and as it revealed one new service after another, one thing became clear: the company with a marketshare lead that is by Gartner’s estimate 10 times bigger than its 14 closest competitors combined, has no plans to slow down or rest on its laurels.
If that market lead isn’t enough to shake up the competition… Read More

Dec
01
2016
--

Managing Replication with Percona XtraDB Cluster

Managing Replication

Managing ReplicationThis blog post discusses managing replication with Percona XtraDB Cluster.

Recently a customer asked me to setup replication between two distinct Percona XtraDB Clusters located in geographically separate data centers. The customer goal was to use one of the clusters only in case of disaster recovery. They tried extending the cluster, but because of the WAN latency impact on their writes and the requirements of a node in a third data center for quorum, they walked away from that setup. Since they were not concerned about losing a few transactions in case of a major disaster, they were OK with regular MySQL replication using GTIDs.

Easy enough right! Both clusters are cloud-based, of course, and the provider can stop/restart any node on short notice. This setup caused some concern for the customer around how to handle replication. Since they don’t have dedicated personnel to monitor replication, or at least handle alerts, they asked if we could find a way to automate the process. So, here we go!

We all try to solve the problems with the tools we know. In my case, I like Pacemaker a lot. So using Pacemaker was my first thought. In a cloud environment, a Pacemaker setup is not easy (wouldn’t that be a cluster in a cluster… a bit heavy). But wait! Percona XtraDB Cluster with Galera replication is already handling quorum, and it provides a means of exchanging information between the nodes. Why not use that?

We can detect quorum status the same way the clustercheck scripts do it. To exchange messages, why don’t we simply write to a table. The Galera replication will update the other nodes. I went on and wrote a bash script that is called by cron every minute. The script monitors the node state and the content of the table. If all is right, it updates the table to report its presence (and if it is acting as a slave or not). The script validates the presence of a slave in the cluster. If no reporting slave is found, the script proceeds to the “election” of a new slave, based on the

wsrep_local_index

 value. Basically, the script is a big bunch of “if” statements. The script is here, and the basic documentation on how to set it up here

Of course, if it works for one cluster, it can work for two. I have configured my customer’s two Percona XtraDB Clusters in a master-to-master relationship using this script. I ran through a bunch of failure scenario cases. The script survived all of them! But of course, this is new. If you are going to implement this solution, run your own set of tests! If you find any problem, file an issue on GitHub. I’ll be happy to fix it!

Dec
01
2016
--

Amazon Glue solves sticky data prep problem in cloud

img_20161201_100233 Amazon announced Amazon Glue today at the re:Invent conference in Las Vegas. The tool has been designed to help developers process data, whether the source is in the cloud or on prem. This is sometimes known as extract, transform, load (ETL) of data, and it’s generally considered one of the toughest parts of analytics. As Amazon CTO Werner Vogels explained today on stage at the… Read More

Dec
01
2016
--

AWS lets developers execute Lambda functions on edge locations with Lambda@Edge

screen-shot-2016-12-01-at-10-34-23-am Today, Amazon said it was rolling out a tool called Lambda@Edge, that allows developers to execute Lambda functions at edge locations on a content delivery network, cutting down the time needed to make a process. With Lambda@Edge, developers can do processing at the edge locations without having to go back to the original source. These functions are able to inspect HTTP requests and take… Read More

Dec
01
2016
--

AWS Batch simplifies batch computing in the cloud

img_20161201_101449 Amazon’s new AWS Batch allows engineers to execute a series of jobs automatically, in the cloud. The tool lets you run apps and container images on whatever EC2 instances are required to accomplish a given task.
Amazon recognized that many of its customers were bootstrapping their own batch computing systems. Users were stringing together EC2 instances, containers, notifications, and… Read More

Dec
01
2016
--

YayPay wants to help big companies deal with accounts receivable

stocksnap_gidmt5tduf YayPay is focusing on an interesting yet underserved niche — collecting payments from your customers. The startup is launching the second version of its platform today and now focuses on big companies with accounts receivable teams.
The company first launched on stage at Disrupt London exactly one year ago. At the time, the company wanted to help small and medium companies get paid on time. Read More

Dec
01
2016
--

Database Daily Ops Series: GTID Replication and Binary Logs Purge

GTID replication

GTID replicationThis blog continues the ongoing series on daily operations and GTID replication.

In this blog, I’m going to investigate why the error below has been appearing in a special environment I’ve been working with on the last few days:

Last_IO_Errno: 1236
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log:
'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the
master has purged binary logs containing GTIDs that the slave requires.'

The error provides the right message, and explains what is going on. But sometimes, it can be a bit tricky to solve this issue: you need additional information discovered after some tests and readings. We try and keep Managed Services scripted, in the sense that our advice and best practices are repeatable and consistent. However, some additional features and practices can be added depending on the customer situation.

Some time ago one of our customer’s database servers presented the above message. At that point, we could see the binary log files in a compressed form on master (gzipped). Of course, MySQL can’t identify a compressed file with a .gz extension as a binary log. We uncompressed the file, but replication presented the same problem – even after uncompressing the file and making sure the UUID of the current master and the TRX_ID were there. Obviously, I needed to go and investigate the problem to see what was going on.

After some reading, I re-read the below:

When the server starts, the global value of gtid_purged, which was called before as gtid_lost, is initialized to the set of GTIDs contained by the Previous_gtid_log_event of the oldest binary log. When a binary log is purged, gtid_purged is re-read from the binary log that has now become the oldest one.

=> https://dev.mysql.com/doc/refman/5.6/en/replication-options-gtids.html#sysvar_gtid_purged

That makes me think: if something is compressing binlogs on the master without purging them as expected by the GTID mechanism, it’s not going to be able to re-read existing GTIDs on disk. When the slave replication threads restarts, or the DBA issues commands like reset slave and reset master (to clean out the increased GTID sets on Executed_Gtid_Set from the SHOW SLAVE STATUS command, for example), this error can occur. But if I compress the file:

  • Will the slave get lost and not find all the needed GTIDs on the master after a reset slave/reset master?
  • If I purge the logs correctly, using PURGE BINARY LOGS, will the slave be OK when restarting replication threads?

Test 1: Compressing the oldest binary log file on master, restarting slave threads

I would like to test this very methodically. We’ll create one GTID per binary log, and then I will compress the oldest binary log file in order to make it unavailable for the slaves. I’m working with three virtual machines, one master and two slaves. On the second slave, I’m going to run the following sequence: stop slave, reset slave, reset master, start slave, and then, check the results. Let’s see what happens.

On master (tool01):

tool01 [(none)]:> show master logs;
+-------------------+-----------+
| Log_name          | File_size |
+-------------------+-----------+
| mysqld-bin.000001 |       341 |
| mysqld-bin.000002 |       381 |
| mysqld-bin.000003 |       333 |
+-------------------+-----------+
3 rows in set (0.00 sec)
tool01 [(none)]:> show binlog events in 'mysqld-bin.000001';
+-------------------+-----+----------------+-----------+-------------+-------------------------------------------------------------------+
| Log_name          | Pos | Event_type     | Server_id | End_log_pos | Info                                                              |
+-------------------+-----+----------------+-----------+-------------+-------------------------------------------------------------------+
| mysqld-bin.000001 |   4 | Format_desc    |         1 |         120 | Server ver: 5.6.32-log, Binlog ver: 4                             |
| mysqld-bin.000001 | 120 | Previous_gtids |         1 |         151 |                                                                   |
| mysqld-bin.000001 | 151 | Gtid           |         1 |         199 | SET @@SESSION.GTID_NEXT= '4fbe2d57-5843-11e6-9268-0800274fb806:1' |
| mysqld-bin.000001 | 199 | Query          |         1 |         293 | create database wb01                                              |
| mysqld-bin.000001 | 293 | Rotate         |         1 |         341 | mysqld-bin.000002;pos=4                                           |
+-------------------+-----+----------------+-----------+-------------+-------------------------------------------------------------------+
5 rows in set (0.00 sec)
tool01 [(none)]:> show binlog events in 'mysqld-bin.000002';
+-------------------+-----+----------------+-----------+-------------+-------------------------------------------------------------------+
| Log_name          | Pos | Event_type     | Server_id | End_log_pos | Info                                                              |
+-------------------+-----+----------------+-----------+-------------+-------------------------------------------------------------------+
| mysqld-bin.000002 |   4 | Format_desc    |         1 |         120 | Server ver: 5.6.32-log, Binlog ver: 4                             |
| mysqld-bin.000002 | 120 | Previous_gtids |         1 |         191 | 4fbe2d57-5843-11e6-9268-0800274fb806:1                            |
| mysqld-bin.000002 | 191 | Gtid           |         1 |         239 | SET @@SESSION.GTID_NEXT= '4fbe2d57-5843-11e6-9268-0800274fb806:2' |
| mysqld-bin.000002 | 239 | Query          |         1 |         333 | create database wb02                                              |
| mysqld-bin.000002 | 333 | Rotate         |         1 |         381 | mysqld-bin.000003;pos=4                                           |
+-------------------+-----+----------------+-----------+-------------+-------------------------------------------------------------------+
5 rows in set (0.00 sec)
tool01 [(none)]:> show binlog events in 'mysqld-bin.000003';
+-------------------+-----+----------------+-----------+-------------+-------------------------------------------------------------------+
| Log_name          | Pos | Event_type     | Server_id | End_log_pos | Info                                                              |
+-------------------+-----+----------------+-----------+-------------+-------------------------------------------------------------------+
| mysqld-bin.000003 |   4 | Format_desc    |         1 |         120 | Server ver: 5.6.32-log, Binlog ver: 4                             |
| mysqld-bin.000003 | 120 | Previous_gtids |         1 |         191 | 4fbe2d57-5843-11e6-9268-0800274fb806:1-2                          |
| mysqld-bin.000003 | 191 | Gtid           |         1 |         239 | SET @@SESSION.GTID_NEXT= '4fbe2d57-5843-11e6-9268-0800274fb806:3' |
| mysqld-bin.000003 | 239 | Query          |         1 |         333 | create database wb03                                              |
+-------------------+-----+----------------+-----------+-------------+-------------------------------------------------------------------+
4 rows in set (0.00 sec)

Here we see that each existing binary log file has just one transaction. That will make it easier to compress the oldest binary log, and then disappear with part of the existing GTIDs. When the slave connects to a master, it will first send all the Executed_Gtid_Set, and then the master sends all the missing IDs to the slave. As Stephane Combaudon said, we will force it to happen! Slave database servers are both currently in the same position:

tool02 [(none)]:> show slave statusG
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.0.10
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysqld-bin.000003
          Read_Master_Log_Pos: 333
               Relay_Log_File: mysqld-relay-bin.000006
                Relay_Log_Pos: 545
        Relay_Master_Log_File: mysqld-bin.000003
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
            ...
           Retrieved_Gtid_Set: 4fbe2d57-5843-11e6-9268-0800274fb806:1-3
            Executed_Gtid_Set: 4fbe2d57-5843-11e6-9268-0800274fb806:1-3
            
tool03 [(none)]:> show slave statusG
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 192.168.0.10
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysqld-bin.000003
          Read_Master_Log_Pos: 333
               Relay_Log_File: mysqld-relay-bin.000008
                Relay_Log_Pos: 451
        Relay_Master_Log_File: mysqld-bin.000003
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
...
           Retrieved_Gtid_Set: 4fbe2d57-5843-11e6-9268-0800274fb806:1-3
            Executed_Gtid_Set: 4fbe2d57-5843-11e6-9268-0800274fb806:1-3

Now, we’ll compress the oldest binary log on master:

[root@tool01 mysql]# ls -lh | grep mysqld-bin.
-rw-rw---- 1 mysql mysql  262 Nov 11 13:55 mysqld-bin.000001.gz #: this is the file containing 4fbe2d57-5843-11e6-9268-0800274fb806:1
-rw-rw---- 1 mysql mysql  381 Nov 11 13:55 mysqld-bin.000002
-rw-rw---- 1 mysql mysql  333 Nov 11 13:55 mysqld-bin.000003
-rw-rw---- 1 mysql mysql   60 Nov 11 13:55 mysqld-bin.index

On tool03, which is the database server that will be used, we will execute the replication reload:

tool03 [(none)]:> stop slave; reset slave; reset master; start slave;
Query OK, 0 rows affected (0.01 sec)
Query OK, 0 rows affected (0.03 sec)
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.02 sec)
tool03 [(none)]:> show slave statusG
*************************** 1. row ***************************
               Slave_IO_State:
                  Master_Host: 192.168.0.10
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File:
          Read_Master_Log_Pos: 4
               Relay_Log_File: mysqld-relay-bin.000002
                Relay_Log_Pos: 4
        Relay_Master_Log_File:
             Slave_IO_Running: No
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 0
              Relay_Log_Space: 151
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 1236
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.'
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
                  Master_UUID: 4fbe2d57-5843-11e6-9268-0800274fb806
             Master_Info_File: /var/lib/mysql/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp: 161111 14:47:13
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set:
            Executed_Gtid_Set:
                Auto_Position: 1
1 row in set (0.00 sec)

Bingo! We broke the replication streaming on the slave. Now we know that the missing GTID on the master was due to the compressed file, and wasn’t able to be passed along to the connecting slave during their negotiation. Additionally, @@GTID_PURGED was not reloaded as per what the online manual said. The test done and we confirmed the theory (if you have additional comments, enter it at the end of the blog).

Test 2: Purge the oldest file on master and reload replication on slave

Let’s make it as straightforward as possible. The purge can be done manually using the PURGE BINARY LOGS command to get it done a proper way as the binary log index file should be considered a part of this purge operation as well (it should be edited to remove the file name index entry together with the log file on disk). I’m going to execute the same as before, but include purging the file manually with the mentioned command.

tool01 [(none)]:> show master logs;
+-------------------+-----------+
| Log_name | File_size |
+-------------------+-----------+
| mysqld-bin.000001 | 341 |
| mysqld-bin.000002 | 381 |
| mysqld-bin.000003 | 333 |
+-------------------+-----------+
3 rows in set (0.00 sec)
tool01 [(none)]:> purge binary logs to 'mysqld-bin.000002';
Query OK, 0 rows affected (0.01 sec)
tool01 [(none)]:> show master logs;
+-------------------+-----------+
| Log_name | File_size |
+-------------------+-----------+
| mysqld-bin.000002 | 381 |
| mysqld-bin.000003 | 333 |
+-------------------+-----------+
2 rows in set (0.00 sec)

Now, we’ll execute the commands to check how it goes:

tool03 [(none)]:> stop slave; reset slave; reset master; start slave;
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.02 sec)
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.02 sec)
tool03 [(none)]:> show slave statusG
*************************** 1. row ***************************
               Slave_IO_State:
                  Master_Host: 192.168.0.10
                  Master_User: repl
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File:
          Read_Master_Log_Pos: 4
               Relay_Log_File: mysqld-relay-bin.000002
                Relay_Log_Pos: 4
        Relay_Master_Log_File:
             Slave_IO_Running: No
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 0
              Relay_Log_Space: 151
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 1236
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.'
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
                  Master_UUID: 4fbe2d57-5843-11e6-9268-0800274fb806
             Master_Info_File: /var/lib/mysql/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp: 161111 16:35:02
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set:
            Executed_Gtid_Set:
                Auto_Position: 1
1 row in set (0.00 sec)

The GTID on the purged file is needed by the slave. In both cases, we can set the @@GTID_PURGED as below with the transaction that we know was purged, and move forward with replication:

tool03 [(none)]:> stop slave; set global gtid_purged='4fbe2d57-5843-11e6-9268-0800274fb806:1';
Query OK, 0 rows affected, 1 warning (0.00 sec)
Query OK, 0 rows affected (0.01 sec)
tool03 [(none)]:> start slave;
Query OK, 0 rows affected (0.01 sec)

The above adjusts the GTID on @@GTID_PURGED to just request the existing GTIDs, using the oldest existing GTID minus one to make the slave start the replication from the oldest existing GTID. In our scenario above, the replica restarts replication from 4fbe2d57-5843-11e6-9268-0800274fb806:2, which lives on binary log file mysqld-bin.000002. Replication is fixed, as its threads can restart processing the data streaming coming from master.

You will need to execute additional steps in checksum and sync for the set of transactions that were jumped when setting a new value for @@GTID_PURGED. If replication continues to break after restarting, I advise you rebuild the slave (possibly the subject of future blog).

Good explanations about this can be found on the below bug, reported by the Facebook guys and Laurynas Biveinis, the Percona Server Lead (who clarified the issue):

  • MySQL Bugs: #72635: Data inconsistencies when master has truncated binary log with GTID after crash;
  • MySQL Bugs: #73032: Setting gtid_purged may break auto_position and thus slaves;

Conclusion

Be careful when purging or doing something manually with binary logs, because @@GTID_PURGED needs to be automatically updated when binary logs are purged. It seems to happen only when

expire_logs_days

 is set to purge binary logs. Yet you need to be careful when trusting this variable, because it doesn’t consider fraction of days, depending the number of writes on a database server, it can get disks full in minutes. This blog showed that even housekeeping scripts and the PURGER BINARY LOGS command were able to make it happen.

Dec
01
2016
--

AWS Personal Health Dashboard helps developers monitor the state of their cloud apps

img_20161201_091825 DevOps teams will be happy to hear that Amazon is launching its own dashboard for Amazon Web Services. Personal Health Dashboard, as the company calls it, is its latest release from the stage of re:Invent 2016 to support more advanced cloud apps monitoring. The tool puts critical infrastructure data in one place. The dashboard will automatically notify teams of failures and allow them… Read More

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com