Jan
11
2019
--

AWS Aurora MySQL – HA, DR, and Durability Explained in Simple Terms

It’s a few weeks after AWS re:Invent 2018 and my head is still spinning from all of the information released at this year’s conference. This year I was able to enjoy a few sessions focused on Aurora deep dives. In fact, I walked away from the conference realizing that my own understanding of High Availability (HA), Disaster Recovery (DR), and Durability in Aurora had been off for quite a while. Consequently, I decided to put this blog out there, both to collect the ideas in one place for myself, and to share them in general. Unlike some of our previous blogs, I’m not focused on analyzing Aurora performance or examining the architecture behind Aurora. Instead, I want to focus on how HA, DR, and Durability are defined and implemented within the Aurora ecosystem.  We’ll get just deep enough into the weeds to be able to examine these capabilities alone.

introducing the aurora storage engine 1

Aurora MySQL – What is it?

We’ll start with a simplified discussion of what Aurora is from a very high level.  In its simplest description, Aurora MySQL is made up of a MySQL-compatible compute layer and a multi-AZ (multi availability zone) storage layer. In the context of an HA discussion, it is important to start at this level, so we understand the redundancy that is built into the platform versus what is optional, or configurable.

Aurora Storage

The Aurora Storage layer presents a volume to the compute layer. This volume is built out in 10GB increments called protection groups.  Each protection group is built from six storage nodes, two from each of three availability zones (AZs).  These are represented in the diagram above in green.  When the compute layer—represented in blue—sends a write I/O to the storage layer, the data gets replicated six times across three AZs.

Durable by Default

In addition to the six-way replication, Aurora employs a 4-of-6 quorum for all write operations. This means that for each commit that happens at the database compute layer, the database node waits until it receives write acknowledgment from at least four out of six storage nodes. By receiving acknowledgment from four storage nodes, we know that the write has been saved in at least two AZs.  The storage layer itself has intelligence built-in to ensure that each of the six storage nodes has a copy of the data. This does not require any interaction with the compute tier. By ensuring that there are always at least four copies of data, across at least two datacenters (AZs), and ensuring that the storage nodes are self-healing and always maintain six copies, it can be said that the Aurora Storage platform has the characteristic of Durable by Default.  The Aurora storage architecture is the same no matter how large or small your Aurora compute architecture is.

One might think that waiting to receive four acknowledgments represents a lot of I/O time and is therefore an expensive write operation.  However, Aurora database nodes do not behave the way a typical MySQL database instance would. Some of the round-trip execution time is mitigated by the way in which Aurora MySQL nodes write transactions to disk. For more information on exactly how this works, check out Amazon Senior Engineering Manager, Kamal Gupta’s deep-dive into Aurora MySQL from AWS re:Invent 2018.

HA and DR Options

While durability can be said to be a default characteristic to the platform, HA and DR are configurable capabilities. Let’s take a look at some of the HA and DR options available. Aurora databases are deployed as members of an Aurora DB Cluster. The cluster configuration is fairly flexible. Database nodes are given the roles of either Writer or Reader. In most cases, there will only be one Writer node. The Reader nodes are known as Aurora Replicas. A single Aurora Cluster may contain up to 15 Aurora Replicas. We’ll discuss a few common configurations and the associated levels of HA and DR which they provide. This is only a sample of possible configurations: it is not meant to represent an exhaustive list of the possible configuration options available on the Aurora platform.

Single-AZ, Single Instance Deployment

great durability with Aurora but DA and HA less so

The most basic implementation of Aurora is a single compute instance in a single availability zone. The compute instance is monitored by the Aurora Cluster service and will be restarted if the database instance or compute VM has a failure. In this architecture, there is no redundancy at the compute level. Therefore, there is no database level HA or DR. The storage tier provides the same high level of durability described in the sections above. The image below is a view of what this configuration looks like in the AWS Console.

Single-AZ, Multi-Instance

Introducing HA into an Amazon Aurora solutionHA can be added to a basic Aurora implementation by adding an Aurora Replica.  We increase our HA level by adding Aurora Replicas within the same AZ. If desired, the Aurora Replicas can be used to also service some of the read traffic for the Aurora Cluster. This configuration cannot be said to provide DR because there are no database nodes outside the single datacenter or AZ. If that datacenter were to fail, then database availability would be lost until it was manually restored in another datacenter (AZ). It’s important to note that while Aurora has a lot of built-in automation, you will only benefit from that automation if your base configuration facilitates a path for the automation to follow. If you have a single-AZ base deployment, then you will not have the benefit of automated Multi-AZ availability. However, as in the previous case, durability remains the same. Again, durability is a characteristic of the storage layer. The image below is a view of what this configuration looks like in the AWS Console. Note that the Writer and Reader are in the same AZ.

Multi-AZ Options

Partial disaster recovery with Amazon auroraBuilding on our previous example, we can increase our level of HA and add partial DR capabilities to the configuration by adding more Aurora Replicas. At this point we will add one additional replica in the same AZ, bringing the local AZ replica count to three database instances. We will also add one replica in each of the two remaining regional AZs. Aurora provides the option to configure automated failover priority for the Aurora Replicas. Choosing your failover priority is best defined by the individual business needs. That said, one way to define the priority might be to set the first failover to the local-AZ replicas, and subsequent failover priority to the replicas in the other AZs. It is important to remember that AZs within a region are physical datacenters located within the same metro area. This configuration will provide protection for a disaster localized to the datacenter. It will not, however, provide protection for a city-wide disaster. The image below is a view of what this configuration looks like in the AWS Console. Note that we now have two Readers in the same AZ as the Writer and two Readers in two other AZs.

Cross-Region Options

The three configuration types we’ve discussed up to this point represent configuration options available within an AZ or metro area. There are also options available for cross-region replication in the form of both logical and physical replication.

Logical Replication

Aurora supports replication to up to five additional regions with logical replication.  It is important to note that, depending on the workload, logical replication across regions can be notably susceptible to replication lag.

Physical Replication

Durability, High Availability and Disaster Recovery with Amazon AuroraOne of the many announcements to come out of re:Invent 2018 is a product called Aurora Global Database. This is Aurora’s implementation of cross-region physical replication. Amazon’s published details on the solution indicate that it is storage level replication implemented on dedicated cross-region infrastructure with sub-second latency. In general terms, the idea behind a cross-region architecture is that the second region could be an exact duplicate of the primary region. This means that the primary region can have up to 15 Aurora Replicas and the secondary region can also have up to 15 Aurora Replicas. There is one database instance in the secondary region in the role of writer for that region. This instance can be configured to take over as the master for both regions in the case of a regional failure. In this scenario the secondary region becomes primary, and the writer in that region becomes the primary database writer. This configuration provides protection in the case of a regional disaster. It’s going to take some time to test this, but at the moment this architecture appears to provide the most comprehensive combination of Durability, HA, and DR. The trade-offs have yet to be thoroughly explored.

Multi-Master Options

Amazon is in the process of building out a new capability called Aurora Multi-Master. Currently, this feature is in preview phase and has not been released for general availability. While there were a lot of talks at re:Invent 2018 which highlighted some of the components of this feature, there is still no affirmative date for release. Early analysis points to the feature being localized to the AZ. It is not known if cross-region Multi-Master will be supported, but it seems unlikely.

Summary

As a post re:Invent takeaway, what I learned was that there is an Aurora configuration to fit almost any workload that requires strong performance behind it. Not all heavy workloads also demand HA and DR. If this describes one of your workloads, then there is an Aurora configuration that fits your needs. On the flip side, it is also important to remember that while data durability is an intrinsic quality of Aurora, HA and DR are not. These are completely configurable. This means that the Aurora architect in your organization must put thought and due diligence into the way they design your Aurora deployment. While we all need to be conscious of costs, don’t let cost consciousness become a blinder to reality. Just because your environment is running in Aurora does not mean you automatically have HA and DR for your database. In Aurora, HA and DR are configuration options, and just like the on-premise world, viable HA and DR have additional costs associated with them.

For More Information See Also:

 

 

 

Nov
26
2018
--

Percona at AWS Re:Invent 2018!

AWS re:Invent

Come see Percona at AWS re:Invent from November 26-30, 2018 in booth 1605 in The Venetian Hotel Expo Hall.

Percona is a Bronze sponsor of AWS re:Invent in 2018 and will be there for the whole show! Drop by booth 1605 in The Venetian Expo Hall to discuss how Percona’s unbiased open source databse experts can help you with your cloud database and DBaaS deployments!

Our CEO, Peter Zaitsev will be presenting a keynote called MySQL High Availability and Disaster Recovery at AWS re:Invent!

  • When: 27 November at 1:45 PM – 2:45 PM
  • Where: Bellagio Hotel, Level 1, Gauguin 2

Check out our case study with Passportal on how Percona DBaaS expertise help guarantee uptime for their AWS RDS environment.

Percona has a lot of great content on how we can help improve your AWS open source database deployment. Check out some of our resources below:

Blogs:

Case Studies:

White Papers:

Webinars:

Datasheets:

See you at the show!

Nov
16
2018
--

Percona on the AWS Database Blog

AWS Database Blog

AWS Database BlogPercona’s very own Michael Benshoof posted an article called Supercharge your Amazon RDS for MySQL deployment with ProxySQL and Percona Monitoring and Management on the AWS Database blog.

In this article, Michael looks at how to maximize the performance of your cloud database Amazon RDS deployment. He shows that you can enhance performance by using ProxySQL to handle load balancing and Percona Monitoring and Management to look at performance metrics.

Percona is an Advanced AWS Partner and provides support and managed cloud database services for AWS RDS and Amazon Aurora.

Check the article it out!

Nov
14
2018
--

Migrating to Amazon Aurora: Reduce the Unknowns

Migrating to Amazon Aurora

Migrating to Amazon Aurora

Migrating to Amazon Aurora. Shutterstock.com

In this Checklist for Success series, we will discuss reducing unknowns when hosting in the cloud using and migrating to Amazon Aurora. These tips might also apply to other database as a service (DBaaS) offerings.

While DBaaS encapsulates a lot of the moving pieces, it also means relying on this approach for your long-term stability. This encapsulation is a two-edged sword that takes away your visibility into performance outside of the service layer.

Shine a Light on Bad Queries

Bad queries are one of the top offenders of downtime. Aurora doesn’t protect you against them. Performing a query review as part of a routine health check of your workload helps ensure that you do not miss looming issues. It also helps you predict the workload on specific times and events. For example, if you already know your top three queries tend to exponentially increase, and are read bound, you can easily decide to increase the number of read-replicas on your cluster.

Having historical query performance data helps makes this task easier and less stressful. While historical data allows you to look backward, it’s also very valuable to have a tool that lets you look at active incident scenarios in progress. Knowing what queries are currently running when suffering from performance issues reduces guesswork and helps solve problems faster.

Pick Your Tool(s)

There are a number of ways you can achieve query performance excellence. Performance Insights is a built-in offering from AWS that is tightly integrated with RDS. It has a seven-day free retention period, with an extra cost beyond that. It is available for each instance in a cluster. Performance Insights takes most of its metrics from the Performance_Schema. It includes specific metrics from the operating system that may not be available from regular Cloudwatch metrics.

Query Analytics from Percona Monitoring and Management (PMM) also uses the same source as Performance Insights: the Performance Schema. Unlike Performance Insights though, PMM is deployed separately from the cluster. This means you can keep your metrics even if you keep recycling your cluster instances. With PMM, you can also consolidate your query reviews from a single location, and you can monitor your cluster instances from the same location – including an extensive list of performance metrics.

You can enable Performance Insights and configure for the default seven-day retention period, and then combine with PMM for longer retention period across all your cluster instances. Note though that PMM may incur a cost for additional API calls to retrieve performance insight metrics.

Outside of the built-in and open source alternative, VividCortex, NewRelic and Datadog are excellent tools that do everything we discussed above and more. NewRelic, for example, allows you to take a good view of the database, application and external requests timing. This, in my opinion, is so very valuable.

Bad queries are not only the potential unknowns. Deleted rows, dropped tables, crippling schema changes, and even AZ/Region failures are realities in the cloud. We will discuss them next! Stay “tuned” for part two: Migrating to Amazon Aurora: Optimize for Binary Log Replication.

Meanwhile, we’d like to hear your success stories in Amazon Aurora in the comments below!

Don’t forget to come by and see us at AWS re:Invent, November 26-30, 2018 in booth 1605! Percona CEO Peter Zaitsev will deliver a keynote on MySQL High Availability & Disaster Recovery, Tuesday, November 27 at 1:45 PM – 2:45 PM in the Bellagio Hotel, Level 1, Gauguin 2

Nov
02
2018
--

Maintenance Windows in the Cloud

maintenance windows cloud

maintenance windows cloudRecently, I’ve been working with a customer to evaluate the different cloud solutions for MySQL. In this post I am going to focus on maintenance windows and requirements, and what the different cloud platforms offer.

Why is this important at all?

Maintenance windows are required so that the cloud provider can do the necessary updates, patches, and changes to our setup. But there are many questions like:

  • Is this going to impact our production traffic?
  • Is this going to cause any downtime?
  • How long does it take?
  • Any way to avoid it?

Let’s discuss the three most popular cloud provider: AWS, Google, Microsoft. These three each have a MySQL based database service where we can compare the maintenance settings.

AWS

When you create an instance you can define your maintenance window. It’s a 30 minutes block when AWS can update and restart your instances, but it might take more time, AWS does not guarantee the update will be done in 30 minutes. The are two different type of updates, Required and Available. 

If you defer a required update, you receive a notice from Amazon RDS indicating when the update will be performed. Other updates are marked as available, and these you can defer indefinitely.

It is even possible to disable auto upgrade for minor versions, and in that case you can decide when do you want to do the maintenance.

AWS separate OS updates and database engine updates.

OS Updates

It requires some downtime, but you can minimise it by using Multi-AZ deployments. First, the secondary instance will be updated. Then AWS do a failover and update the Primary instance as well. This means some small outage during the failover.

DB Engine Updates

For DB maintenance, the updates are applied to both instances (primary and secondary) at the same time. That will cause some downtime.

More information: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_UpgradeDBInstance.Maintenance.html#USER_UpgradeDBInstance.Maintenance.Multi-AZ

Google CloudSQL

With CloudSQL you have to define an hour for a maintenance window, for example 01:00–02:00, and in that hour, they can restart the instances at any time. It is not guaranteed the update will be done in that hour. The primary and the secondary have the same maintenance window. The read replicas do not have any maintenance window, they can be stopped at any time.

CloudSQL does not differentiate between OS or DB engine, or between required and available upgrades. Because the failover replica has the same maintenance window, any upgrade might cause database outage in that time frame.

More information: https://cloud.google.com/sql/docs/mysql/instance-settings

Microsoft Azure

Azure provides a service called Azure Database for MySQL servers. I was reading the documentation and doing some research trying to find anything regarding the maintenance window, but I did not find anything.

I span up an instance in Azure to see if there is any available settings, but I did not find anything so at this point I do not know how Azure does OS or DB maintenance or how that impacts production traffic.

If someone knows where can I find this information in the documentation, please let me know.

Conclusion

AWS CloudSQL Azure
Maintenance Window 30m 1h Unknown
Maintenance Window for Read Replicas No No Unknown
Separate OS and DB updates Yes No Unknown
Outage during update Possible Possible Unknown
Postpone an update Possible No Unknown
Different priority for updates Yes No Unknown

 

While I do not intend  to prefer or promote any of the providers, for this specific question, AWS offers the most options and controls for how we want to deal with maintenance.


Photo by Caitlin Oriel on Unsplash

Oct
03
2018
--

Percona Blog Poll: How Do You Run Your Database in the Cloud?

Cloud database poll

Cloud database pollPercona’s latest blog poll asks how you run your database in the cloud?

Join in!

Are you using a fully managed service or are you self-managing your databases in the cloud? And what provider are you relying on? Perhaps you’re using more than one. Don’t worry, you can tick multiple boxes, so please choose up to four answers. If you don’t see your solution listed, use the comments box on this blog to feedback your thoughts.

If you’d like to, you’re also welcome leave a comment to tell us about your choice—or maybe why you’re NOT planning on moving to a cloud solution in the near future. Likewise if you want to share how you’ve found your cloud deployment so far, feel free to send a comment. Thanks!

Note: There is a poll embedded within this post, please visit the site to participate in this post’s poll.

The post Percona Blog Poll: How Do You Run Your Database in the Cloud? appeared first on Percona Database Performance Blog.

Sep
07
2018
--

Taboola and Percona Host a Meetup in Tel Aviv, Oct 3

Percona Taboola Meetup in Tel Aviv

Percona Taboola Meetup in Tel AvivFrom time to time, Percona hosts and co-hosts events around the world, often at no cost for attendees. The common focus for these events is, of course, open source databases. On October 3 2018, we team up with Taboola for an evening of talks suitable for all levels of expertise, addressing both the how and the why of choosing a database technology. The event is to be held at the Taboola offices in Tel Aviv. Here’s the schedule:

  • 6:00 pm Welcome with food & drink served
  • 6:45 pm How to pick database technologies that cover your requirements (Dimitri Vanoverbeke, Percona)
  • 7:20 pm Lessons from the field, extremely heavy loads on Percona Server for MySQL (Ariel Pisetzky, Taboola)
  • 7:40 pm Migrating to the cloud, the why and the how. (Dimitri Vanoverbeke, Percona)
  • 8:00 pm Event close

If you would like to join us, please sign up here so that we can be sure to cater for you.

Meanwhile, if you have not yet bought your tickets for Percona Live Europe from November 5 – 7 in Frankfurt, then hurry… I have extended Early Bird pricing one week. We’ll be publishing the exciting, high quality schedule soon so watch out for it! (But don’t miss the Early Bird…)

Finally, if you would like to be sure to not miss our future events, please subscribe to our newsletter–we don’t necessarily blog about every opportunity.

The post Taboola and Percona Host a Meetup in Tel Aviv, Oct 3 appeared first on Percona Database Performance Blog.

Aug
29
2018
--

Scaling IO-Bound Workloads for MySQL in the Cloud

InnoDB / MyRocks throughput on IO1

Is increasing GP2 volumes size or increasing IOPS for IO1 volumes a valid method for scaling IO-Bound workloads? In this post I’ll focus on one question: how much can we improve performance if we use faster cloud volumes? This post is a continuance of previous cloud research posts:

To recap, in Amazon EC2 we can use gp2 and io1 volumes. gp2 performance can be scaled with size, i.e for gp2 volume size of 500GB we get 1500 iops; size 1000GB – 3000 iops; and for 3334GB – 10000 iops (maximal possible value). For io1 volumes we can “buy” throughput up to 30000 iops.

So I wanted to check how both InnoDB and RocksDB storage engines perform on these volumes with different throughput.

Benchmark Scenario

I will use the same datasize that I used in Saving With MyRocks in The Cloud, that is sysbench-tpcc, 50 tables, 100W each, about 500GB datasize in InnoDB and 100GB in RocksDB (compressed with LZ4).

Volumes settings: gp2 volumes from 500GB (1000GB for InnoDB) to 3400GB with 100GB increments (so each increment increases throughput by 300 iops); io1 volumes: 1TB in size, iops from 1000 to 30000 with 1000 increments.

Let’s take look at the results. I will use a slightly different format than usual, but hopefully it represents the results better. You will see density throughout the plots—a higher and narrower chart represents less variance in the throughput. The plot represents the distribution of the throughput.

Results on GP2 volumes:

InnoDB/MyRocks throughput on gp2

It’s quite interesting to see how the result scales with better IO throughput. InnoDB does not improve its throughput after gp2 size 2600GB, while MyRocks continues to scale linearly. The problem with MyRocks is that there is a lot of variance in throughput (I will show a one second resolution chart).

Results on IO1 volumes

InnoDB / MyRocks throughput on IO1

Here MyRocks again shows an impressive growth as as we add more IO capacity, but also shows a lot of variance on high capacity volumes.

Let’s compare how engines perform with one second resolution. GP2 volume, 3400GB:

InnoDB/MyRocks throughput on gp2 3400GB

IO1 volume, 30000 iops:

InnoDB/MyRocks throughput on IO1 30000 IOPS

So for MyRocks there seems to be periodical background activity, which does not allow it to achieve a stable throughput.

Raw results, if you’d like to review them, can be found here: https://github.com/Percona-Lab-results/201808-rocksdb-cloudio

Conclusions

If you are looking to improve throughput in IO-bound workloads, either increasing GP2 volumes size or increasing IOPS for IO1 volumes is a valid method, especially for the MyRocks engine.

The post Scaling IO-Bound Workloads for MySQL in the Cloud appeared first on Percona Database Performance Blog.

Aug
28
2018
--

Webinar Wed 8/29: Databases in the Hosted Cloud

databases-in-the-cloud

databases-in-the-cloudPlease join Percona’s Chief Evangelist, Colin Charles on Wednesday, August 29th, 2018, as he presents Databases in the Hosted Cloud at 7:00 AM PDT (UTC-7) / 10:00 AM EDT (UTC-4).

 

Nearly everyone today uses some form of database in the hosted cloud. You can use hosted MySQL, MariaDB, Percona Server, and PostgreSQL in several cloud providers as a database as a service (DBaaS).

In this webinar, Colin Charles explores how to efficiently deploy a cloud database configured for optimal performance, with a particular focus on MySQL.

You’ll learn the differences between the various public cloud offerings for Amazon RDS including Aurora, Google Cloud SQL, Rackspace OpenStack DBaaS, Microsoft Azure, and Alibaba Cloud, as well as the access methods and the level of control you have. Hosting in the cloud can be a challenge but after today’s webinar, we’ll make sure you walk away with a better understanding of how you can leverage the cloud for your business needs.

Topics include:

  • Backup strategies
  • Planning multiple data centers for availability
  • Where to host your application
  • How to get the most performance out of the solution
  • Cost
  • Monitoring
  • Moving from one DBaaS to another
  • Moving from a DBaaS to your own hosted platform

Register Now.

The post Webinar Wed 8/29: Databases in the Hosted Cloud appeared first on Percona Database Performance Blog.

Aug
24
2018
--

This Week in Data with Colin Charles 50: Percona Live Europe Sessions, PostgreSQL in Google Cloud

Colin Charles

Colin Charles

Join Percona Chief Evangelist Colin Charles as he covers happenings, gives pointers and provides musings on the open source database community.

Grading is underway for talks at Percona Live Europe 2018. I understand that by next week you will see the tutorial schedule released. As part of the program committee, I have enjoyed reviewing tutorials, and I reckon there is great competition for the schedule. I suggest you register now, and don’t forget to book your accommodation (need a discount?).

A video worth watching: How we Live Migrated Millions of BBM Users & its Infrastructure Across the Pacific (Cloud Next ’18). I was a big Blackberry Messenger (BBM) user in its heyday. They moved from their Canadian data center to Google Cloud. Blackberry used MySQL, migrated Oracle to PostgreSQL, as Oracle isn’t sanctioned to run on Google Cloud Platform. They also moved from MySQL to Google CloudSQL since the databases weren’t critical to running the application. Some cases write to BigQuery and have Tableau query them. They make use of Cassandra native replication. With PostgreSQL, there is master-slave replication: they got users onto the new master and then promoted the master. When they did the master promotion, there was minimal (5-10 minutes) downtime while this process was ongoing. This didn’t impact their key services, messages continued to work, and this was seen as acceptable.

Releases

  • MariaDB 10.3.9 – InnoDB from 5.7.23, a new variable innodb_log_optimize_ddl for avoiding delay due to page flushing and allowing concurrent backup, and many fixes and improvements around ALTER TABLE too.
  • Percona Server for MySQL 5.6.41-84.1 – full-text search index with InnoDB table bug fixed, as well as ensuring queries on a table with CHARSET=euckr COLLATE=euckr_bin always return the same results.
  • Percona Server for MySQL 5.5.61-38.13  bug fixes.

Link List

Upcoming Appearances

Feedback

I look forward to feedback/tips via e-mail at colin.charles@percona.com or on Twitter @bytebot.

The post This Week in Data with Colin Charles 50: Percona Live Europe Sessions, PostgreSQL in Google Cloud appeared first on Percona Database Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com