Nov
24
2021
--

Querying Archived RDS Data Directly From an S3 Bucket

querying archived rds data from s3 bucket.png

querying archived rds data from s3 bucket.pngA recommendation we often give to our customers is along the lines of “archive old data” to reduce your database size. There is a tradeoff between keeping all our data online and archiving part of it to cold storage.

There could also be legal requirements to keep certain data online, or you might want to query old data occasionally without having to go through the hassle of restoring an old backup.

In this post, we will explore a very useful feature of AWS RDS/Aurora that allows us to export data to an S3 bucket and run SQL queries directly against it.

Archiving Data to S3

Let’s start by describing the steps we need to take to put our data into an S3 bucket in the required format, which is called Apache Parquet.

Amazon states the Parquet format is up to 2x faster to export and consumes up to 6x less storage in S3, compared to other text formats.

1. Create a snapshot of the database (or select an existing one)

2. Create a customer-managed KMS key to encrypt the exported data

Archiving Data to S3

 

3. Create an IAM role (e.g. exportrdssnapshottos3role)

4. Create an IAM policy for the export task and assign it to the role

{
    "Version": "2012-10-17",
    "Id": "Policy1636727509941",
    "Statement": [
        {
            "Sid": "Stmt1636727502144",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::123456789:role/service-role/exportrdssnapshottos3role"
            },
            "Action": [
                "s3:PutObject",
                "s3:ListBucket",
                "s3:GetObject",
                "s3:DeleteObject",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::test-athena",
                "arn:aws:s3:::test-athena/exports/*"
            ]
        }
    ]
}

5. Optional: Create an S3 bucket (or use an existing one)

6. Set a bucket policy to allow the IAM role to perform the export e.g.:

{
     "Version": "2012-10-17",
     "Statement": [
       {
         "Effect": "Allow",
         "Principal": {
            "Service": "export.rds.amazonaws.com"
          },
         "Action": "sts:AssumeRole"
       }
     ] 
   }

7. Export the snapshot to Amazon S3 as an Apache Parquet file. You can choose to export specific sets of databases, schemas, or tables

 

Export the snapshot to Amazon S3

IAM role

Querying the Archived Data

When you need to access the data, you can use Amazon Athena to query the data directly from the S3 bucket.

1. Set a query result location

Amazon Athena

2. Create an external table in Athena Query editor. We need to map the MySQL column types to equivalent types in Parquet

CREATE EXTERNAL TABLE log_requests (
  id DECIMAL(20,0),
  name STRING, 
  is_customer TINYINT,
  created_at TIMESTAMP,
  updated_at TIMESTAMP
)
STORED AS PARQUET
LOCATION 's3://aurora-training-s3/exports/2021/log/'
tblproperties ("parquet.compression"="SNAPPY");

3. Now we can query the external table from the Athena Query editor

SELECT name, COUNT(*)
FROM log_requests
WHERE created_at >= CAST('2021-10-01' AS TIMESTAMP)
  AND created_at < CAST('2021-11-01' AS TIMESTAMP)
GROUP BY name;

Removing the Archived Data from the Database

After testing that we can query the desired data from the S3 bucket, it is time to delete archived data from the database for good. We can use the pt-archiver tool for this task.

Having a smaller database has several benefits. To name a few: your backup/restore will be faster, you will be able to keep more data in memory so response times improve, you may even be able to scale down your server specs and save some money.

Complete the 2021 Percona Open Source Data Management Software Survey

Have Your Say!

Aug
26
2020
--

Creating an External Replica of AWS Aurora MySQL with Mydumper

Oftentimes, we need to replicate between Amazon Aurora and an external MySQL server. The idea is to start by taking a point-in-time copy of the dataset. Next, we can configure MySQL replication to roll it forward and keep the data up-to-date.

This process is documented by Amazon, however, it relies on the mysqldump method to create the initial copy of the data. If the dataset is in the high GB/TB range, this single-threaded method could take a very long time. Similarly, there are ways to improve the import phase (which can easily take 2x the time of the export).

Let’s explore some tricks to significantly improve the speed of this process.

Preparation Steps

The first step is to enable binary logs in Aurora. Go to the Cluster-level parameter group and make sure binlog_format is set to ROW. There is no log_bin option in Aurora (in case you are wondering), simply setting binlog_format is enough. The change requires a restart of the writer instance, so it, unfortunately, means a few minutes of downtime.

We can check if a server is generating binary logs as follows:

mysql> SHOW MASTER LOGS;

+----------------------------+-----------+
| Log_name                   | File_size |
+----------------------------+-----------+
| mysql-bin-changelog.034148 | 134219307 |
| mysql-bin-changelog.034149 | 134218251 |
...

Otherwise, you will get an error:

ERROR 1381 (HY000): You are not using binary logging

We also need to ensure a proper binary log retention period. For example, if we expect the initial data export/import to take one day, we can set the retention period to something like three days to be on the safe side. This will help ensure we can roll forward the restored data.

mysql> call mysql.rds_set_configuration('binlog retention hours', 72);
Query OK, 0 rows affected (0.27 sec)

mysql> CALL mysql.rds_show_configuration;
+------------------------+-------+------------------------------------------------------------------------------------------------------+
| name                   | value | description                                                                                          |
+------------------------+-------+------------------------------------------------------------------------------------------------------+
| binlog retention hours | 72    | binlog retention hours specifies the duration in hours before binary logs are automatically deleted. |
+------------------------+-------+------------------------------------------------------------------------------------------------------+
1 row in set (0.25 sec)

The next step is creating a temporary cluster to take the export. We need to do this for a number of reasons: first to avoid overloading the actual production cluster by our export process, also because mydumper relies on FLUSH TABLES WITH READ LOCK to get a consistent backup, which in Aurora is not possible (due to the lack of SUPER privilege).

Go to the RDS console and restore a snapshot that was created AFTER the date/time where you enabled the binary logs. The restored cluster should also have binlog_format set, so select the correct Cluster parameter group.

Next, capture the binary log position for replication. This is done by inspecting the Recent events section in the console. After highlighting your new temporary writer instance in the console, you should see something like this:

Binlog position from crash recovery is mysql-bin-changelog.034259 32068147

So now we have the information to prepare the CHANGE MASTER command to use at the end of the process.

Exporting the Data

To get the data out of the temporary instance, follow these steps:

  1. Backup the schema
  2. Save the user privileges
  3. Backup the data

This gives us added flexibility; we can do some schema changes, add indexes, or extract only a subset of the data.

Let’s create a configuration file with the login details, for example:

tee /backup/aurora.cnf <<EOF
[client]
user=percona
password=percona
host=percona-tmp.cgutr97lnli6.us-west-1.rds.amazonaws.com
EOF

For the schema backup, use mydumper to do a no-rows export:

mydumper --no-data \
--triggers \
--routines \
--events \
-v 3 \
--no-locks \
--outputdir /backup/schema \
--logfile /backup/mydumper.log \
--regex '^(?!(mysql|test|performance_schema|information_schema|sys))' \
--defaults-file /backup/aurora.cnf

To get the user privileges I normally like to use pt-show-grants. Aurora is, however, hiding the password hashes when you run SHOW GRANTS statement, so pt-show-grants will print incomplete statements e.g.:

mysql> SHOW GRANTS FOR 'user'@'%';
+---------------------------------------------------------+
| Grants for user@%                                       |
+---------------------------------------------------------+
| GRANT USAGE ON *.* TO 'user'@'%' IDENTIFIED BY PASSWORD |
| GRANT SELECT ON `db`.* TO 'user'@'%'                    |
+---------------------------------------------------------+

We can still gather the hashes and replace them manually in the pt-show-grants output if there is a small-ish number of users.

pt-show-grants --user=percona -ppercona -hpercona-tmp.cgutr97lnli6.us-west-1.rds.amazonaws.com  > grants.sql

mysql> select user, password from mysql.user;

Finally, run mydumper to export the data:

mydumper -t 8 \
--compress \
--triggers \
--routines \
--events \
—-rows=10000000 \
-v 3 \
--long-query-guard 999999 \
--no-locks \
--outputdir /backup/export \
--logfile /backup/mydumper.log \
--regex '^(?!(mysql|test|performance_schema|information_schema|sys))' \
-O skip.txt \
--defaults-file /backup/aurora.cnf

The number of threads should match the number of CPUs of the instance running mydumper. In the skip.txt file, you can include any tables that you don’t want to copy. The –rows argument will give you the ability to split tables in chunks of X number of rows. Each chunk can run in parallel, so it is a huge speed bump for big tables.

Importing the Data

We need to stand up a MySQL instance to do the data import. In order to speed up the process as much as possible, I suggest doing a number of optimizations to my.cnf as follows:

[mysqld]
pid-file=/var/run/mysqld/mysqld.pid
log-error=/var/log/mysqld.log
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
log_slave_updates
innodb_buffer_pool_size=16G
binlog_format=ROW
innodb_log_file_size=1G
innodb_flush_method=O_DIRECT
innodb_flush_log_at_trx_commit=0
server-id=1000
log-bin=/log/mysql-bin
sync_binlog=0
master_info_repository=TABLE
relay_log_info_repository=TABLE
query_cache_type=0
query_cache_size=0
innodb_flush_neighbors=0
innodb_io_capacity_max=10000
innodb_stats_on_metadata=off
max_allowed_packet=1G
net_read_timeout=60
performance_schema=off
innodb_adaptive_hash_index=off
expire_logs_days=3
sql_mode=NO_ENGINE_SUBSTITUTION
innodb_doublewrite=off

Note that mydumper is smart enough to turn off the binary log for the importer threads.

After the import is complete, it is important to revert these settings to “safer” values: innodb_doublewriteinnodb_flush_log_at_trx_commit, sync_binlog, and also enable performance_schema again.

The next step is to create an empty schema by running myloader:

myloader \
-d /backup/schema \
-v 3 \
-h localhost \
-u root \
-p percona

At this point, we can easily introduce modifications like adding indexes, since the tables are empty. We can also restore the users at this time:

(echo "SET SQL_LOG_BIN=0;" ; cat grants.sql ) | mysql -uroot -ppercona -f

Now we are ready to restore the actual data using myloader. It is recommended to run this inside a screen session:

myloader -t 4 \
-d /backup/export \
-q 100 \
-v 3 \
-h localhost \
-u root \
-p percona

The rule of thumb here is to use half the number of vCPU threads. I also normally like to reduce mydumper default transaction size (1000) to avoid long transactions, but your mileage may vary.

After the import process is done, we can leverage faster methods (like snapshots or Percona Xtrabackup) to seed any remaining external replicas.

Setting Up Replication

The final step is setting up replication from the actual production cluster (not the temporary one!) to your external instance.

It is a good idea to create a dedicated user for this process in the source instance, as follows:

CREATE USER 'repl'@'%' IDENTIFIED BY 'password';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%';

Now we can start replication, using the binary log coordinates that we captured before:

CHANGE MASTER TO MASTER_HOST='aurora-cluster-gh5s6lnli6.us-west-1.rds.amazonaws.com', MASTER_USER='repl', MASTER_PASSWORD='percona', MASTER_LOG_FILE='mysql-bin-changelog.034259', MASTER_LOG_POS=32068147;
START SLAVE;

Final Words

Unfortunately, there is no quick and easy method to get a large dataset out of an Aurora cluster. We have seen how mydumper and myloader can save a lot of time when creating external replicas, by introducing parallel operations. We also reviewed some good practices and configuration tricks for speeding up the data loading phase as much as possible.


Optimize your database performance with Percona Monitoring and Management, a free, open source database monitoring tool. Designed to work with Amazon RDS MySQL and Amazon Aurora MySQL with a specific dashboard for monitoring Amazon Aurora MySQL using Cloudwatch and direct sampling of MySQL metrics.

Visit the Demo

Jan
29
2020
--

Greylock’s Reid Hoffman and Sarah Guo to talk fundraising at Early Stage SF 2020

Early Stage SF is around the corner, on April 28 in San Francisco, and we are more than excited for this brand new event. The intimate gathering of founders, VCs, operators and tech industry experts is all about giving founders the tools they need to find success, no matter the challenge ahead of them.

Struggling to understand the legal aspects of running a company, like negotiating cap tables or hiring international talent? We’ve got breakout sessions for that. Wondering how to go about fundraising, from getting your first yes to identifying the right investors to planning the timeline for your fundraise sprint? We’ve got breakout sessions for that. Growth marketing? PR/Media? Building a tech stack? Recruiting?

We. Got. You.

Hoffman + Guo

Today, we’re very proud to announce one of our few Main Stage sessions that will be open to all attendees. Reid Hoffman and Sarah Guo will join us for a conversation around “How To Raise Your Series A.”

Reid Hoffman is a legendary entrepreneur and investor in Silicon Valley. He was an Executive VP and founding board member at PayPal before going on to co-found LinkedIn in 2003. He led the company to profitability as CEO before joining Greylock in 2009. He serves on the boards of Airbnb, Apollo Fusion, Aurora, Coda, Convoy, Entrepreneur First, Microsoft, Nauto and Xapo, among others. He’s also an accomplished author, with books like “Blitzscaling,” “The Startup of You” and “The Alliance.”

Sarah Guo has a wealth of experience in the tech world. She started her career in high school at a tech firm founded by her parents, called Casa Systems. She then joined Goldman Sachs, where she invested in growth-stage tech startups such as Zynga and Dropbox, and advised both pre-IPO companies (Workday) and publicly traded firms (Zynga, Netflix and Nvidia). She joined Greylock Partners in 2013 and led the firm’s investment in Cleo, Demisto, Sqreen and Utmost. She has a particular focus on B2B applications, as well as infrastructure, cybersecurity, collaboration tools, AI and healthcare.

The format for Hoffman and Guo’s Main Stage chat will be familiar to folks who have followed the investors. It will be an updated, in-person combination of Hoffman’s famously annotated LinkedIn Series B pitch deck that led to Greylock’s investment, and Sarah Guo’s in-depth breakdown of what she looks for in a pitch.

They’ll lay out a number of universally applicable lessons that folks seeking Series A funding can learn from, tackling each from their own unique perspectives. Hoffman has years of experience in consumer-focused companies, with a special expertise in network effects. Guo is one of the top minds when it comes to investment in enterprise software.

We’re absolutely thrilled about this conversation, and to be honest, the entire Early Stage agenda.

How it works

Here’s how it all works:

There will be about 50+ breakout sessions at the event, and attendees will have an opportunity to attend at least seven. The sessions will cover all the core topics confronting early-stage founders — up through Series A — as they build a company, from raising capital to building a team to growth. Each breakout session will be led by notables in the startup world.

Don’t worry about missing a breakout session, because transcripts from each will be available to show attendees. And most of the folks leading the breakout sessions have agreed to hang at the show for at least half the day and participate in CrunchMatch, TechCrunch’s app to connect founders and investors based on shared interests.

Here’s the fine print. Each of the 50+ breakout sessions is limited to around 100 attendees. We expect a lot more attendees, of course, so signups for each session are on a first-come, first-serve basis. Buy your ticket today and you can sign up for the breakouts that we’ve announced. Pass holders will also receive 24-hour advance notice before we announce the next batch. (And yes, you can “drop” a breakout session in favor of a new one, in the event there is a schedule conflict.)

Grab yourself a ticket and start registering for sessions right here. Interested sponsors can hit up the team here.


Jun
25
2019
--

Adaptive Hash Index on AWS Aurora

Adaptive Hash Index on AWS Aurora

Adaptive Hash Index on AWS AuroraRecently I had a case where queries against Aurora Reader were 2-3 times slower than on the Writer node. In this blog post, we are going to discuss why.

I am not going to go into the details of how Aurora works, as there are other blog posts discussing that. Here I am only going to focus on one part.

The Problem

My customer reported there is a huge performance difference between the Reader and the Writer node just by running selects. I was a bit surprised, as the select queries should run locally on the reader node, the dataset could fit easily in memory, there were no reads on disk level, and everything looked fine.

I was trying to rule out every option when one of my colleagues mentioned I should have a look at the InnoDB_Adaptive_Hash_Indexes. He was right – it was disabled on the Reader nodes, I could see it on the console.

Let’s enable the adaptive hash index

I opened the control panel and I was checking the parameter groups, but the adaptive hash index was already enabled. Ok, I might have made a mistake but I double checked myself many times and it was true. Adaptive hash was disabled on the console but enabled on the control panel. That means the AWS control panel is lying!

I have restarted the nodes multiple times, and I have created new test clusters, etc… but I was not able to enable adaptive hash on the Reader node. It was enabled on the Writer node, and it was working.

Is this causing the performance difference?

Because I was able to enable or disable the adaptive hash index on the Writer node, I continued my tests there and I could confirm that when I disabled it the queries got slower. Same speed as on the Reader node. When I enabled,  AHI queries got faster.

In general with AHI on the Writer node, the customer’s queries were running 2 times faster.

AHI can help for many workloads but not all of them, and you should test your queries/workload both with and without AHI.

Why is it disabled on the Reader?

I have to be honest because I am not an AWS engineer and I do not know the code of Aurora, but I am only guessing here and I might be wrong.

Why can I change it in the parameter group?

We can modify the adaptive hash in the parameter groups, but there is no impact on the Reader nodes at all. Many customers could think they have AHI enabled but actually, they don’t. I think this is a bad practice because if we cannot enable it on the Reader node we should not be able to change it on the control panel.

Is this causing any performance problems for me?

If you are using the Reader node for selects queries, which are based on secondary keys, you are probably suffering from this but it depends on your workload if it is impacting your performance or not. In my customer’s case, the difference was 2 times slower without AHI.

But I want fast queries!

If your queries heavily benefit from AHI, you should run your queries on the Writer node or even on an async slave, or have a look on AWS RDS which does not have this limitation or use EC2 instances. You could also check query cache in Aurora.

Query Cache

In Aurora, they reworked the Query Cache which does not have the limitations like in Community Edition or in Percona Server.  Cacheable queries take out an “exclusive lock” on MySQL’s query cache. In the real world, that means only one query can use the Query Cache at a time and all the other queries have to wait for the mutex. Also in MySQL 8.0 they completely removed the Query Cache.

But in Aurora they redesigned it and they removed this limitation – there is no single global mutex on the Query Cache anymore. I think one of the reasons for this is could be because they knew that Adaptive Hash won’t work.

Does AWS know about this?

I have created a ticket to AWS engineers to get some feedback on this, and they verified my findings and have confirmed Adaptive Hash Index cannot be enabled on the Reader nodes. They are looking into why we can modify it on the control panel.

Conclusion

I would recommend checking your queries on your Reader nodes to make sure they perform well and compare the performance with the Writer node. At this moment, we cannot enable AHI on Reader nodes, and I am not sure if that will change any time soon. But this can impact the performance in some cases, for sure.

Sep
21
2017
--

Percona Support with Amazon RDS

Amazon RDS

This blog post will give a brief overview of Amazon RDS capabilities and limitations, and how Percona Support can help you succeed in your Amazon RDS deployments.

One of the common questions that we get from customers and prospective customers is about Percona Support with Amazon RDS. As many companies have shifted to the cloud, or are considering how to do so, it’s natural to try to understand the limitations inherent in different deployment strategies.

Why Use Amazon RDS?

As more companies move to using the cloud, we’ve seen a shift towards work models in technical teams that require software developers to take on more operational duties than they have traditionally. This makes it essential to abstract infrastructure so it can be interacted with as code, whether through automation or APIs. Amazon RDS presents a compelling DBaaS product with significant flexibility while maintaining ease of deployment.

Use Cases Where RDS Isn’t a Fit

There are a number of use cases where the inherent limitations of RDS make it not a good fit. With RDS, you are trading off the flexibility to deploy complex environment topologies for the ease of deploying with the push of a button, or a simple API call. RDS eliminates most of the operational overhead of running a database in your environment by abstracting away the physical or virtual hardware and the operating system, networking and replication configuration. This, however, means that you can’t get too fancy with replication, networking or the underlying operating system or hardware.

When Using RDS, Which Engine is Right For Me?

Amazon’s RDS has numerous database engines available, each suited to a specific use case. The three RDS database engines we’ll be discussing briefly here are MySQL, MariaDB and Aurora.

Use MySQL when you have an application tuned for MySQL, you need to use MySQL plug-ins or you wish to maintain compatibility to support external replicas in EC2. MySQL with RDS has support for Memcached, including plug-in support and 5.7 compatible query optimizer improvements. Unfortunately, thread pooling and similar features that are available in Percona Server for MySQL are not currently available in the MySQL engine on RDS.

Use MariaDB when you have an application that requires features available for this engine but not in others. Currently, MariaDB engines in RDS support thread pooling, table elimination, user roles and virtual columns. MySQL or Aurora don’t support these. MariaDB engines in RDS support global transaction IDs (GTIDs), but they are based on the MariaDB implementation. They are not compatible with MySQL GTIDs. This can affect replication or migrations in the future.

Use Aurora when you want a simple-to-setup solution with strong availability guarantees and minimal configuration. This RDS database engine is cloud-native, built with elasticity and the vagaries of running in a distributed infrastructure in mind. While it does limit your configuration and optimization capabilities more than other RDS database engines, it handles a lot of things for you – including ensuring availability. Aurora automatically detects database crashes and restarts without the need for crash recovery or to rebuild the database cache. If the entire instance fails, Aurora automatically fails over to one of up to 15 read replicas.

So If RDS Handles Operations, Why Do I Need Support?

Generally speaking, properly using a database implies four quadrants of tasks. RDS only covers one of these four quadrants: the operational piece. Your existing staff (or another provider such as Percona) must cover each of the remaining quadrants.

Amazon RDS
Amazon RDS

The areas where people run into trouble are slow queries, database performance not meeting expectations or other such issues. In these cases they often can contact Amazon’s support line. The AWS Support Engineers are trained and focused on addressing issues specific to the AWS environment, however. They’re not DBAs and do not have the database expertise necessary to fully troubleshoot your database issues in depth. Often, when an RDS user encounters a performance issue, the first instinct is to increase the size of their AWS deployment because it’s a simple solution. A better path would be investigating performance tuning. More hardware is not necessarily the best solution. You often end up spending far more on your monthly cloud hosting bill than necessary by ignoring unoptimized configurations and queries.

As noted above, when using MariaDB or MySQL RDS database engines you can make use of plug-ins and inject additional configuration options that aren’t available in Aurora. This includes the ability to replicate to external instances, such as in an EC2 environment. This provides more configuration flexibility for performance optimization – but does require expertise to make use of it.

Outside support vendors (like Percona) can still help you even when you eliminate the operational elements by lending the expertise to your technical teams and educating them on tuning and optimization strategies.

Aug
09
2017
--

How to Configure Aurora RDS Parameters

Aurora RDS Parameters

Aurora RDS ParametersIn this blog post, we’ll look at some tips on how to configure Aurora RDS parameters.

I was recently deploying a few Aurora RDS instances, a process very similar to configuring a regular RDS instance. I noticed a few minor differences in the way you configure Aurora RDS parameters, and very few articles on how the commands should be structured (for RDS as well as Aurora). The only real literature available is the official Amazon RDS documentation.

This blog provides a concise “how-to” guide to quickly change Aurora RDS parameters using the AWS CLI. Aurora retains the parameter group model introduced with RDS, with new instances having the default read only parameter groups. For a new instance, you need to create and allocate a new parameter group (this requires a DB reboot). After that, you can apply changes to dynamic variables immediately. In other words, the first time you add the DB parameter group you’ll need to reboot even if the variable you are configuring is dynamic. It’s best to create a new DB parameter group when initializing your clusters. Nothing stops you from adding more than one host to the same DB Parameter Group rather than creating one per instance.

In addition to the DB Parameter Group, each instance is also allocated a DB Cluster Parameter Group. The DB Parameter Group is used for instance-level parameters, while the DB Cluster Parameter Group is used for cluster-level parameters (and applies to all instances in a cluster). You’ll find some of the MySQL engine variables can only be found in the DB Cluster Parameter Group. Here you will find a handy reference of all the DB cluster and DB instance parameters that are viewable or configurable for Aurora instances.

To run these commands, you’ll need to have the “aws” cli tool installed and configured. Note that the force-failover option used for RDS instances doesn’t apply to Aurora. You should perform either a controlled failover or let Aurora handle this. Also, the group family to use for Aurora is “oscar5.6”. The commands to set this up are as follows:

aws rds create-db-parameter-group
    --db-parameter-group-name percona-opt
    --db-parameter-group-family oscar5.6
    --description "Percona Optimizations"
aws rds modify-db-parameter-group
    --db-parameter-group-name percona-opt
    --parameters "ParameterName=max_connections,ParameterValue=5000,ApplyMethod=immediate"
# For each instance-name:
aws rds modify-db-instance --db-instance-identifier <instance-name>
    --db-parameter-group-name=percona-opt
aws rds reboot-db-instance
    --db-instance-identifier <instance-name>

Once you create the initial DB parameter group, configure the variables as follows:

aws rds modify-db-parameter-group
    --db-parameter-group-name <instance-name>
    --parameters "ParameterName=max_connect_errors,ParameterValue=999999,ApplyMethod=immediate"
aws rds modify-db-parameter-group
    --db-parameter-group-name <instance-name>
    --parameters "ParameterName=max_connect_errors,ParameterValue=999999,ApplyMethod=immediate"
## Verifying change:
aws rds describe-db-parameters
      --db-parameter-group-name aurora-instance-1
      | grep -B7 -A2 'max_connect_errors'

Please keep in mind, it can take a few seconds to propagate changes to nodes. Give it a moment before checking the values with “show global variables”. You can configure the DB Cluster Parameter group similarly, for example:

# Create a new db cluster parameter group
aws rds create-db-cluster-parameter-group --db-cluster-parameter-group-name percona-cluster --db-parameter-group-family oscar5.6 --description "new cluster group"
# Tune a variable on the db cluster parameter group
aws rds modify-db-cluster-parameter-group --db-cluster-parameter-group-name percona-cluster --parameters "ParameterName=innodb_flush_log_at_trx_commit,ParameterValue=2,ApplyMethod=immediate"
# Allocate the new db cluster parameter to your cluster
aws rds modify-db-cluster --db-cluster-identifier <cluster_identifier> --db-cluster-parameter-group-name=percona-cluster
# And of course, for viewing the cluster parameters
aws rds describe-db-cluster-parameters --db-cluster-parameter-group-name=percona-cluster

I hope you find this article useful, please make sure to share with the community!

Nov
11
2016
--

Amazon AWS Service Tiers

Amazon AWS Service TiersThis blog post discusses the differences between the Amazon AWS service tiers.

Many people want to move to an Amazon environment but are unsure what AWS service makes the most sense (EC2, RDS, Aurora). For database services, the tiering at Amazon starts with EC2, then moves up to RDS, and on to Aurora. Amazon takes on more of the implementation and management of the database As you move up the tiers. This limits the optimization options. Obviously, moving up the tiers increases basic costs, but there are tradeoffs at each level to consider.

  • EC2 (Elastic Compute Cloud) is a basic cloud platform. It provides the user with complete control of the compute environment, while reducing your need to monitor and manage hardware. From a database perspective, you can do almost anything in EC2 that you could do running a database on your own hardware. You can tweak OS and database settings, plus do all of the normal database optimization work you would do in a bare metal environment. In EC2, you can run a single server, master/slave, or a cluster, and you can use MySQL, MongoDB, or any other product. You can use AWS Snapshot Manager to take backups, or you can use another backup tool. This option is ideal if you want all the flexibility of running your own hardware without the hassles of daily hardware maintenance.
  • RDS (Relational Data Service) makes it easy to set up a relational database in the cloud. It offers similar resizing capabilities to EC2, but also automates a lot of tasks. RDS supports Aurora (more on that later), Postgres, MySQL, MariaDB, Oracle, and MSSQL. RDS simplifies deployment and automates some maintenance tasks. This means that you are limited in terms of the tweaks that you can implement at the OS and database configuration level. This means you will focus on query and schema changes to optimize a database in this environment. RDS also includes automated backups and provides options for read replicas that you can spread across multiple availability zones. You must consider and manage all these are all items in the EC2 world. This choice is great if you are looking to implement a database but don’t want (or know how) to take on a lot of the tasks, such as backups and replication setup, that are needed for a stable and highly available environment.
  • Aurora is one of the database options available through RDS. You might hear people refer to it either as Aurora or RDS Aurora (they’re both the same). With Aurora, Amazon takes on even more of the configuration and management options. This limits your optimization capabilities even more. It also means that there are far fewer things to worry about since Amazon handles so much of the administration. Aurora is MySQL-compatible, and is great if you want the power and convenience of MySQL with a minimum of effort on the hardware side. Aurora is designed to automatically detect database crashes and restart without the need for crash recovery or to rebuild the database cache. If the entire instance fails, Aurora will automatically failover to one of up to 15 read replicas.

With data in the cloud, security becomes a bigger concern. You continue to govern access to your content, platform, applications, systems ,and networks, just like you would with data stored in your own datacenter. Amazon’s cloud offerings also support highly secure environments, like HIPAA and PCI compliance. They have designed the cloud environment to be a secure database environment while maintaining the necessary access for use and administration, even in these more regulated environments.

Storing data in the cloud is becoming more common. Amazon offers multiple platform options and allows for easy scalability, availability, and reliability.

May
26
2016
--

AWS Aurora Benchmarking part 2

AWS-Aurora-Benchmarking

AWS Aurora Benchmarking

Some time ago, I published the article on AWS Aurora Benchmarking (AWS Aurora Benchmarking – Blast or Splash?), in which I analyzed the behavior of different solutions using synchronous replication in AWS environment. This blog follows up with some of the comments and suggestions I received regarding that post from the community and Amazon engineers.

I decided to perform another round of tests, keeping in mind comments and suggestions received.

I presented some of the results during the Percona conference in Santa Clara last April 2016. The following is the transposition that presentation, with more details.

Not interested in the preliminary descriptions? Go to the results section

Why new tests?

A very good question, with an easy answer.

Aurora is a product that is still under development and refinement: six months of development could present major changes in performance. Not only that, but the initial tests focused on entry-level solutions, meaning I was analyzing the kind of users that are currently starting their business and looking for a flexible solution that allows them to save money and scale.

This time, I put the focus on enterprise solutions by analyzing what an already well-established company would get when looking for a decent scalable solution.

These are two different scenarios.

Why so many (different) tests?

I used many different benchmarking tools, and I am still planning to run others. Why so? Why not simply use one of them?

Again, a simple answer. I used different tools because in some cases, they provide me a different way of accessing and using data. I also do not trust benchmarking tools, not even the ones I developed. I wanted to test the same thing using different tools and compare the results. ONLY if I see a common pattern, then would I consider the test valid. Personally, I tend to discard any test that is not consistent, or if the analysis performed is using a single benchmarking tool. In my opinion, being lazy is not an option when doing these kind of exercises.

About the tests

It was difficult to compare apples to apples here. And I think that is the main point to keep in mind.

Aurora is not a standard RDS solution, like we are used to. Aurora looks like MySQL, smells like MySQL, but is not vanilla MySQL. To achieve what they have, the engineers had to change many parts. The more you dig in, the more you realize there are significant differences.

Because of that, I had to focus more on identifying what each solution can do and compare the solutions against expectations, rather than comparing the numbers.

I was more interested to see what happen if:

  • I have a burst of connections, and my application goes from 4K to 40K connections. Will it crash? Will it slow down?
  • How long should I wait if a node fails?
  • What should I not have in my schema design, to prevent bottlenecks?

Those are relevant questions in my opinion, more so than discovering that solution A has 3000 rows written/sec, and solution B has 3100. Or that I might (might) have some additional page rotation, file -> memory-> flushes because the amount of memory differs.

That is valuable information, for sure, but less valuable than having a decent understanding of which platform will help my business grow and remain stable.

What is the right tool for the job? This is the question I am addressing.

Tests run

I had run three main kinds of tests:

  • Performance and load stress
  • High availability failover
  • Response time (latency) from the application point of view

Performance and load stress

These tests were the most extensive and demanding.

I analyzed the capacity to serve the load under different conditions, from a light load up to full utilization, and some degree of resource saturation.

  • The first set of tests were to evaluate a simple load on a single table, causing the table to become a hotspot and showing how the platform would manage the increasing contention.
  • The second set of tests were to perform a similar load, but distributing it cross multiple tables and batching the operations. Parallelization, contention, scalability and distributed hotspots were in the picture.

The two above focused on write operations only, and were done using different tools (comparing the results as they were complementary).

  • Third set of tests, using my own stress tool, were focused on R/W oriented usage. The tests were executed against multiple tables, performing CRUD actions, using simple and batch insert, reads by PK, index, by range, IN and exact match conditions.
  • The fourth set of tests were performed using a TPC-C like load (OLTP).
  • The fifth set of tests were using sysbench in OLTP mode, with 250 tables.

The scope of the last three set of tests was to identify how the platforms would manage the load, considering the following:

  • Read and write contention on the same tables
  • High level of parallelism (from the application)
  • Possible hot-spots (TPCC district)
  • Increasing utilization (memory, threads, IO)
  • Saturation (connections)

Finally, all tests were run with fully utilized BufferPool.

The machines

Small boxes (first round of tests):

EIP = 1
VPC = 1
ELB=1
Subnets = 4 (1 public, 3 private)
HAProxy = 6
MHA Monitor (micro ec2) = 1
NAT Instance (EC2) =1 (hosting EIP)
DB Instances (EC2) = 3 (m4.xlarge) 16GB
Application Instances (EC2) = 6 (4)
EBS SSD 3000 PIOS
Aurora RDS node = 3 (db.r3.xlarge) 30GB

Large boxes (latest tests):

EIP = 1
VPC = 1
ELB=1
Subnets = 4 (1 public, 3 private)
HAProxy = 4
MHA Monitor (micro ec2) = 1
NAT Instance (EC2) =1 (hosting EIP)
DB Instances (EC2) = 3 (c3.8xlarge) 60GB
Application Instances (EC2) = 4
EBS SSD 5000 PIOS
Aurora RDS node = 3 (db.r3.8xlarge) 244GB

A note

It was pointed out to me that I deliberately chose to use an Ec2 solution for Percona XtraDB Cluster with less memory than the one available in Aurora. This is true, and we must take that into consideration. The reason for this is that the only Ec2 solution matching the memory of a db_r3.8xlarge is the d2.8xlarge.

I did try it, but the level of scalability I got (from the CPU point of view) was less efficient than the one available with c3.8xlarge. I decided to prefer CPU resources to memory, especially because I was going to test concurrency and parallelism in conjunction with the load increase.

From the result, I feel confident that I chose correctly – but I am open to comment.

The layout

This is what the setup looks like:

AWS Aurora Benchmarking

Where you read Java, those are the application nodes running the different test applications.

Two words about Aurora first

Aurora has a few key concepts that we must have clearly in mind, especially how it manages the writes across replica, and how connections are implemented.

The IO activity

To replicate the information across the different storage, Aurora only replicates FRM files and data coming from IB_LOGS. This is a quite significant advantage to other forms of replication, given the limited number of bytes that are replicated over the network (and also if they are replicated six times).

AWS Aurora Benchmarking

Another significant advantage is that Aurora does not use a double write buffer, which is obviously another blast (see the recent optimization in Percona Server https://www.percona.com/blog/2016/05/09/percona-server-5-7-parallel-doublewrite/ ).

In other words, writes in Aurora are organized by filling its commit queue and pushing the changes as group commit to the storage.

AWS Aurora Benchmarking

In some presentations, you might have seen that all steps are asynchronous. But is important to underline that a commit is acknowledged by Aurora when at least two availability zones (AZ) have received and written the incoming data related to that commit. Writes here mean received in the storage node incoming queue and with a quorum of four over six nodes.

This means that no matter what, data has to travel on the network to reach the final destination, and ACK signals come back before Aurora returns the ACK to the commit operation. The network is in the same region, but still it could represent an incognita about performance. No wonder we could have some latency at this stage!

As you can see, what I am reporting is also confirmed in the image below (and in the observations). The point is that the impact of steps 1 – 2 is not obviously clear.

AWS Aurora Benchmarking

Thread pooling

Aurora also use thread pooling – a lot! That will become very clear later, and as more of the work is based on parallelism, the more efficient thread pooling seems to be.

In most cases we are used to seeing CPUs on database servers not fully utilized, unless there is some heavy ordering operation or a bad query. That behavior is also (not only) a direct consequence of the connection-to-thread model, which implies a period of latency and stand by. In Aurora, the incoming connections are not following the same model. Instead, the pool redistributes the load of the incoming connection to a pool of threads, optimizing the latency period, resulting in a higher CPU utilization. Which is what you want from your resource: to be utilized and not waiting for something else to do its job.

AWS Aurora Benchmarking

 

The results

Without wasting more electronic ink, let see what comes out of this round of tests (not the final one by the way). To simplify the results, I will also report the graphs from the first set of tests, but will focus on the latest.Small Boxes = SB, Large Boxes LB.

Small Boxes = SB, Large Boxes = LB.

First Test: IIBench

As declared previously, my scope was to verify how the two platforms would have reacted to a simple load focus on inserts with a basic single table. The bufferpool was saturated before running the test.

SB

AWS Aurora Benchmarking

LB

AWS Aurora Benchmarking

As we can see, in the presence of a hot spot the solution using Percona XtraDB Cluster outperformed Aurora, in both cases. What is notable, though, is that while XtraDB Cluster remained approximately around the same time/performance, Aurora is significantly reduced the time taken. This shows that Aurora was taking advantage of the more powerful platform, while XtraDB Cluster was not able to.

With further analyzation of the details, we notice that Aurora performs better atomically. It was able to manage more writes/second as well as rows and pages managed. But it was inconsistent: Aurora had performance hiccups at regular intervals. As such the final result was that it took more time to process the whole workload.

I was not able to dig to deeply, given some metrics are not fully available in Aurora. As such I had to rely fully on Aurora engineers, who mentioned to me that hot-spot contention was a possible issue.

Aurora Handler calls:

AWS Aurora Benchmarking

XtraDB Cluster Handlers:

AWS Aurora Benchmarking

The execution in XtraDB Cluster showed fewer calls but constant performance, while Aurora has hiccups.

Aurora page activity write:

AWS Aurora Benchmarking

XtraDB Cluster page activity write:

AWS Aurora Benchmarking

The trend shown by the handlers stayed consistent in the page management and rows insert, as expected.

Second Test: Application Ingest

As mentioned, this test showed many threads from different application servers, inserted by a batch of 50 statements against multiple tables.

The results coming from this test are quite favorable to Aurora, as we can see starting from the time taken to complete the same workload:

LB

AWS Aurora Benchmarking

SB

AWS Aurora Benchmarking

With small ones, the situation was inverted.

But here is where the interesting part starts.

Aurora can manage significantly higher numbers of rows, as the picture below shows:

AWS Aurora Benchmarking

The results are also constant, and don’t decrease significantly like the inserts with XtraDB Cluster.

The number of handler commits, however, are significantly less.

AWS Aurora Benchmarking

Once more they stay the same with the load increase, without impacting performance.

Reviewing all handler calls, we get our first surprise.

XtraDB Cluster handler calls:

AWS Aurora Benchmarking

Aurora handler calls:

AWS Aurora Benchmarking

The gap/drop existing in the two graphs are the different tests (with an increasing number of threads).

Two things to notice here: the first one is that XtraDB Cluster decreases in performance while processing the load, while Aurora does not. The second (you need to zoom the image) is the number of commits is floating in XtraDB Cluster, while it stays fixed in Aurora.

An even bigger surprise comes up when reviewing the connections graphs.

As expected, XtraDB Cluster has all my connections open, and the number of threads running is quite close to the number of connected threads.

AWS Aurora Benchmarking

Both of them follow the increasing number of connected threads.

But this is not the case in Aurora.

AWS Aurora Benchmarking

Also, if my applications are trying to open ~800 threads, the Aurora node see only a part of them, and the number of running is fixed to 32 threads.

The important things to consider here are that a) my applications don’t connect directly to the Aurora instance, but to a connector (MariaDB), and b) that Aurora, in this case, caps the number of running threads to the number of CPU available on the instance (here 32).

Given that, I expected to have worse performance (but I don’t). The fact that Aurora uses one thread for multiple connections seems to be working quite efficiently.

The number of rows inserted is also consistent with the handler calls, and has better performance than XtraDB Cluster.

Aurora rows inserted:

AWS Aurora Benchmarking

XtraDB Cluster rows inserted

AWS Aurora Benchmarking

Again we have the same trend, only, this time, Aurora performs better than XtraDB Cluster.

Third Test: OLTP Application

When run on the small boxes, this test saw XtraDB Cluster performing much better than Aurora. The time taken by Aurora was ~3 times the time taken by XtraDB Cluster.

AWS Aurora Benchmarking

With a large box, I had the inverse result: Aurora is outperforming XtraDB Cluster from 2 to 7 times the speed.

AWS Aurora Benchmarking

Analyzing the number of commands executed with the increasing workload, we can see how XtraDB Cluster can perform better than Aurora with a workload of 128 threads, but starts to have worse performance as the load increases.

On the other hand, Aurora manages the read/write load without significant performance loss, which includes being able to increase the number of commits/sec.

AWS Aurora Benchmarking

Reviewing the handler calls, we see that the handler commit calls are significantly less in Aurora (as already noticed in the ingest tests).

AWS Aurora Benchmarking

Another thing to note is that the number of calls for XtraDB Cluster is significantly higher and not scaling, while Aurora has a nice scaling trend.

Fourth Test: TPCC-mysql

The TPCC test is mainly to test OLTP traffic, with the note that some tables (like district) might become a hotspot. The tests I ran were executed against 400 warehouses, and used 128 threads maximum for the small box and 2048 threads for the large box.

During this test, I hit one of the Aurora limitations and I escalated it to the Aurora engineers (who are aware of the problem).

Small boxes:

AWS Aurora Benchmarking

In the case of small boxes, there is nothing to say: XtraDB Cluster manages the load more efficiently. This trend is not optimal, having significant fluctuation. Aurora is just not able to keep it up.

Large boxes:

 

AWS Aurora Benchmarking

It is a different and a more complex scenario in the case of the use of large boxes. I would like to say that Aurora performs better.

This is true for two of the three tests, and up to when it got stuck by internal limitation Aurora was also performing better on the third. But then its performance just collapsed.

With a more in-depth investigation, I noticed that under the hood Aurora was not performing as well as it appeared. This comes out quite clearly by looking at a comparison between the graphs covering Comm_ execution, open files, handlers and InnoDBrow lock time.

In all of them it is evident how XtraDB Cluster keeps serving the workload with consistent behavior, while Aurora fails the second test on (512 threads) — not just on the third with 2048 threads.

Aurora:

AWS Aurora Benchmarking

XtraDB Cluster:

AWS Aurora Benchmarking

It is clear that Aurora was better served during the test with 256 threads going over the 450K com select serve (in 10 sec interval), compared with XtraDB Cluster that was not able to go over 350K.

But in the following tests, while XtraDB Cluster was able to keep going (with decreasing performance), Aurora started to struggle with very inconsistent behavior.

This was also confirmed by the open files graph.

Aurora:

AWS Aurora Benchmarking

XtraDB Cluster:

AWS Aurora Benchmarking

The graphs show the instances of files open during the test, not the ones already open. It reflects the Open_file metric “The number of files that are open. This count includes regular files opened by the server. It does not include other types of files such as sockets or pipes. Also, the count does not include files that storage engines open using their own internal functions rather than asking the server level to do so.”

I was quite surprised by the number of files open by Aurora.

Handlers reflected the same behavior, as well.

Aurora:

AWS Aurora Benchmarking

XtraDB Cluster:

AWS Aurora Benchmarking

Perfectly in line with the com trend.

So what was increasing in reverse?

Aurora:

AWS Aurora Benchmarking

XtraDB Cluster:

AWS Aurora Benchmarking

As you can see from the above, the exactly same workload generated an increasing lock row time, from quite low in the test with 256 threads, up to a crazy high with 2048 threads.

As mentioned, we know that TPCC has a couple of tables that act as hotspots, and we already saw with IIbench how Aurora is not working efficiently in that case.

I also was getting a lot of 188 errors during the test. This is an Aurora internal error. When I reported it, I was told they know about it, and they are planning to work on it.

I hope they do soon, because if this issue is solved it is very likely that Aurora will not only be able to manage the tested workload, but exceed it by far.

I am saying this because also with the identified issues Aurora was able to keep going and manage a more than decent response time during the second test (with 512 threads).

AWS Aurora Benchmarking

Fifth Test: Sysbench

I added the sysbench tests to test scalability, and to see the what happens when the system reaches a saturation point. This test brought up some limitations existing in the Aurora solution, related more to the connector than the Aurora engine itself.

Aurora has a limit of 16k connections. I wanted to see what happens if I got to saturation point or close to it. It doesn’t matter if this is a ridiculously high number or not.

What happened was that Aurora managed traffic up to 4K. The closer I got to the limit, however, the more I had a connectivity issue. At the end I had to run the test with 8K, 12K and 20K threads pointing directly to the Aurora instance, bypassing the connector that was not able to serve the traffic. After that, I was able to hit up to ~15500 threads (but with a lot of inconsistent performance). I am defining the limit of a meaningful test from the previous level of 12K threads.

XtraDB Cluster was able to scale up to 16K no problem.

What also is notable here is that Aurora was able to manage the workload more efficiently regarding transaction handling (i.e., as transactions executed and latency).

AWS Aurora Benchmarking

The number of transactions executed by Aurora was ~three times the one executed by XtraDB Cluster.

AWS Aurora Benchmarking

Regarding latency, Aurora showed less latency then XtraDB Cluster.

Internally, Aurora and XtraDB Cluster operations were once again different regarding how the workload was handled. The most divergent result was the handler calls:

AWS Aurora Benchmarking

Commit calls in Aurora were a fraction of the calls in XtraDB Cluster, while the number of rollbacks was higher.

The read calls had an even more divergent behavior, with XtraDB Cluster performing a higher number of read_keys, while Aurora was having a very limited number of them. Read_rnd are very high in XtraDB Cluster, but totally absent in Aurora (note that in Aurora, read_rnds are reported but seem not to increase). On the other hand, Aurora reported a high number of read_rnd_next, while XtraDB Cluster has none.

AWS Aurora Benchmarking

HA availability

Fail-over time

Both solutions:

AWS Aurora Benchmarking

In this test, the fail-over time for the solution using Galera and HAProxy was more efficient. For both a limited or mid-level load. One assumption is that given Aurora has to verify both the status of the data transmitted and its consistency across the six data store nodes in every case; the process is not as fast as it could be.

It could also be that the cluster connector is not as efficient as it should in redirecting the traffic from one node to another. It would be a very interesting exercise to replace it with some other custom solution.

Note that I was performing the tests following the Amazon recommendation to use the following to simulate a real crash:

ALTER SYSTEM CRASH [INSTANCE|NODE]

As such, I was not doing anything strange or out of the ordinary.

It is worth mentioning that of the eight seconds taken by MySQL/Galera to perform the failover, six were due to the HAProxy settings (which had a 3000 ms interval and two loops in the settings before executing failover).

Execution latency

The purpose of these tests was to identify the latency existing between the moment that application sends the request and the moment MySQL/Aurora took the request in “charge”. The expectation is that the busier the database, the higher the latency.

For this test, I reported both results: the one coming from the old tests with the small box, and the new one with the large box.

Small boxes:

AWS Aurora Benchmarking

Large boxes:

AWS Aurora Benchmarking

It is clear from the graphs that the two tests report different scenarios. In the first, Galera was able to manage the load more efficiently and serve requests with lower latency. For the new tests, I had used a higher number of threads than the ones for the small box. Nevertheless, in the second test the CPU utilization and the number of running threads lead me to think that Aurora was finally able to utilize resources more efficiently and the lower latency.

The latency jumped up again when the number of connections rose above 12K, but that was expected given previous tests results.

Conclusions

High Availability

The two platforms were able to manage the failover operation in a limited time frame (below 1 minute). Nevertheless, MySQL/Galera was shown to be more efficient and consistent. This result is a direct consequence of synchronous replication, which by design prevents MySQL/Galera from allowing an active node to fall behind.

In my opinion, the replication method used in Aurora is efficient, and given that data is shared across the read replicas, fail-over should happen faster.

The tests suffered because of the connector, and I have the feeling that having another solution in place may bring some surprises (actually, I would like to test that as well).

Performance

In this run of tests, Aurora was able to invert the results I had in the first test with the small boxes. In almost all cases, Aurora performed as well or better then XtraDB Cluster. There are still cases where Aurora is penalized, and those are the ones where hotspots are present. The contention in Aurora is killing performance, and raise errors (188). But I hope we will see a significant evolution soon.

General Comments on Aurora

The product is evolving quickly, and benchmark results may become obsolete in very short time (this is why it is important to have repeatable and comparable tests). From my point of view, in this set of tests Aurora clearly shows where it’s a better fit: higher-end levels, where high availability and CPU power is the focus (not concerns about the cost).

There is no reason to use Aurora in small-mid boxes: the platform is not going to be as efficient as a standard solution like XtraDB Cluster. But if cost is not an issue, and the applications require a lot of parallelism, Aurora on db.r3.8xlarge is a good solution.

I still see space for improvements (like for cluster connectors, or the time taken to restart a cluster after a full stop, or contention reduction). But I am also confident that the work led by the development team will fix most of my concerns (and more) soon.

Final note: it would be nice to have the code open source, so that the community could contribute (but I understand the business reasons not to).

About Cost

I don’t think it is this the right place to mention the cost of each solution (especially because each need is different).

As such, I am not reporting any specific numbers. You can, however, follow the links below and do the necessary math:

Aurora cost calculator

AWS cost calculator

 

Nov
02
2015
--

A first look at RDS Aurora

Recently, I happened to have an onsite engagement and the goal of the engagement was to move a database service to RDS Aurora. Like probably most of you, I knew the service by name but I couldn’t say much about it, so, I Googled, I listened to talks and I read about it. Now that my onsite engagement is over, here’s my first impression of Aurora.

First, let’s describe the service itself. It is part of RDS and, at first glance, very similar to a regular RDS instance. In order to setup an Aurora instance, you go to the RDS console and you either launch a new instance choosing Aurora as type or you create a snapshot of a RDS 5.6 instance and migrate it to Aurora. While with a regular MySQL RDS instance you can create slaves, with Aurora you can add reader nodes to an existing cluster. An Aurora cluster minimally consists of a writer node but you can add up to 15 reader nodes (only one writer though). It is at the storage level that things become interesting. Aurora doesn’t rely on a filesystem type storage, at least not from a database standpoint, it has its own special storage service that is replicated locally and to two other AZ automatically for a total of 6 copies. Furthermore, you pay only for what you use and the storage grows/shrinks automatically in increments of 10 GB, which is pretty cool. You can have up to 64 TB in an Aurora cluster.

Now, all that is fine, but what are the benefits of using Aurora? I must say I barely used Aurora; one week is not a field proven experience. These are claims by Amazon, but, as we will discuss, there are some good arguments in favor of these claims.

The first claim is that the write capacity is increased by up to 4x. So, even if only a single instance is used as writer in Aurora, you get up to 400% the write capacity of a normal MySQL instance. That’s quite huge and amazing, but it basically means replication is asynchronous at the storage level, at least for the multi-AZ part since the latency would be a performance killer. Locally Aurora uses a quorum-based approach with the storage nodes. Given that the object store is a separate service with its own high availability configuration, that is a reasonable trade-off. For example, the clustering solutions with Galera like Percona XtraDB Cluster typically lowers the write capacity since all nodes must synchronize on commit. Other claims are that the readers performance is unaffected by the clustering and that the readers have almost no lag with the writer. Furthermore, as if that is not enough, readers can’t diverge from the master. Finally, since there’s no lag, any readers can replace the writer very quickly, so in terms of failover, all is right.

That seems almost too good to be true; how can it be possible? I happen to be interested in object stores, Ceph especially, and I was toying with the idea of using Ceph to store InnoDB pages. It appears that the Amazon team did a super great job at putting an object store under InnoDB and they went way further than what I was thinking. Here, I may be speculating a bit and I would be happy to be found wrong. The writer never writes dirty pages back to the store… it only writes fragments of InnoDB log to the object store as objects, one per transaction, and notifies the readers of the set of pages that have been updated by this fragment log object. Just have a look at the show global status of an Aurora instance and you’ll see what I mean… Said otherwise, it is like having an infinitely large set of InnoDB log files; you can’t reach the max checkpoint age. Also, if the object store supports atomic operations, there’s no need for the double-write buffer, a high source of contention in MySQL. Just those two aspects are enough, in my opinion, to explain the up to 4x performance claim for the write capacity, but also considering the amount of writes and the log files are a kind of binary diff, that’s usually much less stuff to write than whole pages.

Something is needed to remove the fragment log objects, since over time, the accumulation of these log objects and the need to apply them would impact performance, a phenomenon called log amplification. With Aurora, that seems to be handled at the storage level and the storage system is wise enough to know that a requested page is dirty and apply the log fragments before sending it back to the reader. The shared object store can also explain why the readers have almost no lag and why they can’t diverge. The only lag the readers can have is the notification time which has to be short if within the same AZ.

So, how does Aurora compares to a technology like Galera?

Pros:

  • Higher write capacity, writer is unaffected by the other nodes
  • Simpler logic, no need for certification
  • No need for an SST to provision a new node
  • Can’t diverge
  • Scale iops tremendously
  • Fast failover
  • No need for quorum (handled by the object store)
  • Simple to deploy

Cons:

  • Likely asynchronous at the storage level
  • Only one node is writable
  • Not open source

Aurora is a mind shift in term of database and a jewel in the hands of Amazon. Openstack currently has no database service that can offer similar features. I wonder how hard it would be to produce an equivalent solution using well known opensource components like Ceph for the object store and corosync or zookeeper or zeroMQ or else for the communication layer. Also, would there be a use case?

The post A first look at RDS Aurora appeared first on MySQL Performance Blog.

Jan
08
2015
--

Managing data using open source technologies? Learn what’s hot in 2015!

Whether you’re looking at the overall MySQL ecosystem or the all-data management landscape, the choice of technologies has never been larger than it is in 2015.

Having so many options is great but it also can be very hard to make a selection. I’m going to help narrow the list next week during a Webinar titled, “Open Source Technologies you should evaluate in 2015,” January 14 at 10 a.m. PST.

During the hour I’ll share which technologies I think worthy of consideration in 2015 – open source and proprietary technologies that allow you to manage your data easier, increase development pace, scale better and improve availability and security. I’ll also discuss recent developments in MySQL, NoSQL and NewSQL, Cloud and general advances in hardware.

Open source technologies you should evaluate in 2015Specifically, some of the areas I’ll address will include:

  • Cloud-based Database as a Service (DBaaS) such as Amazon RDS for MySQL, Amazon RDS for Aurora, Google Cloud, and OpenStack Trove
  • MySQL 5.7
  • Hybrid database environments with MySQL plus MongoDB or other NoSQL solutions
  • Advanced Monitoring capabilities such as Percona Cloud Tools
  • Other performance enhancements such as solid state devices (SSD) and the TokuDB storage engine

I hope to see you next week! (Register now to reserve your spot!)

The post Managing data using open source technologies? Learn what’s hot in 2015! appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com