Nov
19
2018
--

Migrating to Amazon Aurora: Design for Flexibility

Migrating to Amazon Aurora

Migrating to Amazon AuroraIn this Checklist for Success series, we will discuss reducing unknowns when hosting in the cloud using and migrating to Amazon Aurora. These tips might also apply to other database as a service (DBaaS) offerings.

Previous blogs in the migrating to Amazon Aurora series:

The whole premise of a database as a service offering is that you do not need to worry about the operating the service, you just need to use it. But all DBaaS offerings have limitations as well as strengths. You should not get too comfortable with all the niceties of such services. You need to remain flexible and ultimately design to prevent database failure.

Have a Grasp of Your Cluster Behavior

Disable lab mode. You should not depend on this setting to take advantage of new features. It can break and leave you in a bad position. If you rely on this feature, and designed your application around it, you might find yourself working around the same problem if, for example, you are running the same queries on a non-Aurora deployment. This is not to say that you shouldn’t take advantage of all Aurora features. Lab mode, however, is “lab mode” and should not be enabled on a production environment.

Separate parameter group per cluster to keep your configuration changes isolated. In some cases, you might have a group of clusters that operate the same workload, But this should be rare, and also prohibits you from making rolling changes against each cluster.

Some might point out that syncing the parameter groups in this situation might be difficult. It really isn’t, and you don’t need any complicated tools to do it. For example, you can use

pt-config-diff

 to regularly inspect the differences between the runtime config on each cluster and identify or resolve differences.

While it is ideal that your clusters are always up to date, it can be intrusive to let them run on their own. Especially if your workload is not dependent on high/low traffic periods. I recommend having more control over the upgrade process, and this excellent community blog post from Renato on how to do just that is worth a read.

Don’t Put All Your Eggs in One Basket

On another note, Aurora can hold up to 64TB of data. Yes, that’s big. It might not be a problem and some of you might even be excited about this potential. But when you think about it, do you really want to store that amount of data in a single basket? What if you need to analyze this data for a particular time period. Is the cost worth it? Surely at some point, you will need to transport that data somewhere.

We’ve seen problems even at sizes less than 2TB. If you need to rebuild an asynchronous replica, for example, it takes a while. You have to be really ahead of capacity planning to ensure that you add new read-replicas when needed. This can be a challenge when you are on the spot. A burst of traffic might already be over before the replica provisioning is complete no matter how fast Aurora replica provisioning is.

Another challenge with datasets that are too big is when you have large tables. Schema changes become increasingly difficult in these situations, especially when such tables are subject to highly concurrent reads and writes. Recall that in the blog Migrating to Amazon Aurora: Optimize for Binary Log Replication we recommend setting

binlog_format

 to

ROW

 to be able to use tools like gh-ost in these types of situations.

High Availability On Your Terms

One limitation with Aurora cluster instances is that there is no easy way of taking a misbehaving read-replica out of rotation. Sure, you can delete the read replica. That leads to transient errors to the application, however, and impacts performance due to the time lag required to replace it to cover the workload.

Similarly, a misbehaving query can easily spoil the whole cluster, even if that query is spread out evenly to the read-replicas. Depending on how quickly you can disable the query, it might result in losing some business in the process. It would be nice if you could blackhole, rewrite or redirect such queries on demand so as to isolate the impact (or even fix it immediately).

Lastly, certain situations require that you restart the cluster. However, doing so could violate your uptime SLA. These situations can occur when you need to apply a non-dynamic cluster parameter, or you need to perform a cluster upgrade.

You can avoid most of these problems by not solely relying on Aurora’s own implementation of high availability. I say this because they are continuously improving this process. For now, however, you can use tools like ProxySQL to redirect traffic both in-cluster and between clusters replicating asynchronously. Percona has existing blog posts on this topic: Leveraging ProxySQL with AWS Aurora to Improve Performance, Or How ProxySQL Out-performs Native Aurora Cluster Endpoints and How to Implement ProxySQL with AWS Aurora.

Meanwhile, we’d like to hear your success stories in migrating to Amazon Aurora in the comments below!

Don’t forget to come by and see us at AWS re:Invent, November 26-30, 2018 in booth 1605! Percona CEO Peter Zaitsev will deliver a keynote on MySQL High Availability & Disaster Recovery, Tuesday, November 27 at 1:45 PM – 2:45 PM in the Bellagio Hotel, Level 1, Gauguin 2

Nov
16
2018
--

Migrating to Amazon Aurora: Optimize for Binary Log Replication

Migrating to Amazon Aurora 1

Migrating to Amazon Aurora 1In this Checklist for Success series, we will discuss reducing unknowns when hosting in the cloud using and migrating to Amazon Aurora. These tips might also apply to other database as a service (DBaaS) offerings.

In our previous article, we discussed the importance of continuous query performance analysis, especially in Amazon Aurora where there is less diagnostic visibility compared to running on EC2 or on-premise. Aside from uptime though, we need a lot more from our data, and we definitely cannot isolate it in Aurora.

Next on our checklist is that at one point or another, we will need to use asynchronous replication. Amazon Aurora has an excellent reputation for absorbing intense amounts of writes, and for many cases where you need an asynchronous replica, any replica can have potential issues catching up.

Different Clusters for Different Workloads

Critical workloads and datasets cannot rely on a single copy of their data. With Amazon Aurora, predictable performance means avoiding mixing workloads within your production cluster. While read heavy workloads might fit easily into read-replicas, reporting or analytics workloads might not be a good fit to execute on your main cluster where read-what-you-write profiles are normally found. You can either delegate this on a separate asynchronous replica as separate Amazon Aurora cluster, or another that runs on an EC2 instance (as an example). This is also true if, say, your analytics or reporting workload generate a significant amount of disk IOPs.

Amazon Aurora IO bills operations per million. You might save some money running disk-heavy analytics operations on a replica running on an i3 instance with a local NVMe for example. Similarly, running an async replica on an EC2 instance or on-premise allows you to take your own independent backups, or just an extra level of redundancy.

Multi-Threaded Replication

It is a known fact that MySQL asynchronous replication performance is subject to some limitations. The biggest one is that, by default, it is single-threaded. MySQL 5.6 introduced multi-threaded replication at one database per thread. This did not apply to the majority of use cases, as workloads vary per database and therefore create an imbalance. With MySQL 5.7 (Aurora 2.0), there have been additional improvements such as an alternative algorithm in parallelizing thread execution that depends on certain behaviors regarding how acting primary servers write binary log entries.

With that said, certain multi-threaded replication variables (transaction_write_set_extraction) require that the binlog format is set to ROW. This might sound counter-intuitive because the ROW binlog format actually can increase the replication workload. While ROW format reduces the ambiguity from potentially non-deterministic statements that could cause a replica to drift and become inconsistent, critical operations (schema changes) and optimizations (MTS) requires that you use ROW binlog format.

It should be apparent by now that the only reasonable path forward to improving asynchronous replication is via the multi-threaded approach. Along with that, there is the need for ROW binlog format. Any design effort should always include this fact if async replication lag is considered a risk. For the basics, configuration options like

slave_compressed_protocol

  and

binlog_row_image

 can reduce network churn. In the deep end, reducing dataset hotspots, ensuring tables have PRIMARY KEYs and embracing multi-threaded optimization can also go a long way.

Additional Benefits

While running certain read-heavy queries on an async replica, or ensuring you have access to physical datafiles are common use cases, being able to switch to another location (region) or just simply another cluster might also be necessary for some instances.

  • Adding or dropping a column/index on a large table can be achieved with either pt-online-schema-change or gh-ost. But for cases where time is a constraint, applying schema changes on an asynchronous cluster and switching to that sometimes pays for itself.
  • Configuration changes or upgrades that require a cluster restart can take seconds or even minutes. Wouldn’t it be nice if you already had a cluster with all these changes ready to take over at a moment’s notice? Or even fail back to the original if there was an unforeseen issue?

Stay “tuned” for part three.

Meanwhile, we’d like to hear your success stories in Amazon Aurora in the comments below!

Don’t forget to come by and see us at AWS re:Invent, November 26-30, 2018 in booth 1605! Percona CEO Peter Zaitsev will deliver a keynote on MySQL High Availability & Disaster Recovery, Tuesday, November 27 at 1:45 PM – 2:45 PM in the Bellagio Hotel, Level 1, Gauguin 2

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com