Mar
05
2019
--

How to Upgrade Amazon Aurora MySQL from 5.6 to 5.7

Over time, software evolves and it is important to stay up to date if you want to benefit from new features and performance improvements.  Database engines follow the exact same logic and providers are always careful to provide an easy upgrade path. With MySQL, the mysql_upgrade tool serves that purpose.

A database upgrade process becomes more challenging in a managed environment like AWS RDS where you don’t have shell access to the database host and don’t have access to the SUPER MySQL privilege. This post is a collaboration between Fattmerchant and Percona following an engagement focused on the upgrade of the Fattmerchant database from Amazon Aurora MySQL 5.6 to Amazon Aurora MySQL 5.7. Jacques Fu, the CTO of Fattmerchant, is the co-author of this post.  Our initial plan was to follow a path laid out previously by others but we had difficulties finding any complete and detailed procedure outlining the steps. At least, with this post, there is now one.

Issues with the regular upgrade procedure

How do we normally upgrade a busy production server with minimal downtime?  The simplest solution is to use a slave server with the newer version. Such a procedure has the side benefit of providing a “staging” database server which can be used to test the application with the new version. Basically we need to follow these steps:

  1. Enable replication on the old server
  2. Make a consistent backup
  3. Restore the backup on a second server with the newer database version – it can be a temporary server
  4. Run mysql_upgrade if needed
  5. Configure replication with the old server
  6. Test the application against the new version. If the tests includes conflicting writes, you may have to jump back to step 3
  7. If tests are OK and the new server is in sync, replication wise, with the old server, stop the application (only for a short while)
  8. Repoint the application to the new server
  9. Reset the slave
  10. Start the application

If the new server was temporary, you’ll need to repeat most of the steps the other way around, this time starting from the new server and ending on the old one.

What we thought would be a simple task turned out to be much more complicated. We were preparing to upgrade our database from Amazon Aurora MySQL 5.6 to 5.7 when we discovered that there was no option for an in-place upgrade. Unlike a standard AWS RDS MySQL (RDS MySQL upgrade 5.6 to 5.7) at the time of this article you cannot perform an in-place upgrade or even restore a backup across the major versions of Amazon Aurora MySQL.

We initially chose Amazon Aurora for the benefits of the tuning work that AWS provided out of the box, but we realized with any set of pros there comes a list of cons. In this case, the limitations meant that something that should have been straightforward took us off the documented path.

Our original high-level plan

Since we couldn’t use an RDS snapshot to provision a new Amazon Aurora MySQL 5.7 instance, we had to fallback to the use of a logical backup. The intended steps were:

  1. Backup the Amazon Aurora MySQL 5.6 write node with mysqldump
  2. Spin up an empty Amazon Aurora MySQL 5.7 cluster
  3. Restore the backup
  4. Make the Amazon Aurora MySQL 5.7 write node a slave of the Amazon Aurora MySQL 5.6 write node
  5. Once in sync, transfer the application to the Amazon Aurora MySQL 5.7 cluster

Even those simple steps proved to be challenging.

Backup of the Amazon Aurora MySQL 5.6 cluster

First, the Amazon Aurora MySQL 5.6 write node must generate binary log files. The default cluster parameter group that is generated when creating an Amazon Aurora instance does not enable these settings. Our 5.6 write node was not generating binary log files, so we copied the default cluster parameter group to a new “replication” parameter group and changed the “binlog_format” variable to MIXED.  The parameter is only effective after a reboot, so overnight we rebooted the node. That was a first short downtime.

At that point, we were able to confirm, using “show master status;” that the write node was indeed generating binlog files.  Since our procedure involves a logical backup and restore, we had to make sure the binary log files are kept for a long enough time. With a regular MySQL server the variable “expire_logs_days” controls the binary log files retention time. With RDS, you have to use the mysql.rds_set_configuration. We set the retention time to two weeks:

CALL mysql.rds_set_configuration('binlog retention hours', 336);

You can confirm the new setting is used with:

CALL mysql.rds_show_configuration;

For the following step, we needed a mysqldump backup along with its consistent replication coordinates. The option

--master-data

   of mysqldump implies “Flush table with read lock;” while the replication coordinates are read from the server.  A “Flush table” requires the SUPER privilege and this privilege is not available in RDS.

Since we wanted to avoid downtime, it is out of question to pause the application for the time it would take to backup 100GB of data. The solution was to take a snapshot and use it to provision a temporary Amazon Aurora MySQL 5.6 cluster of one node. As part of the creation process, the events tab of the AWS console will show the binary log file and position consistent with the snapshot, it looks like this:

Consistent snapshot replication coordinates

Consistent snapshot replication coordinates

From there, the temporary cluster is idle so it is easy to back it up with mysqldump. Since our dataset is large we considered the use of MyDumper but the added complexity was not worthwhile for a one time operation. The dump of a large database can take many hours. Essentially we performed:

mysqldump -h entrypoint-temporary-cluster -u awsrootuser -pxxxx \
 --no-data --single-transaction -R -E -B db1 db2 db3 > schema.sql
mysqldump -h entrypoint-temporary-cluster -nt --single-transaction \
 -u awsrootuser -pxxxx -B db1 db2 db3 | gzip -1 > dump.sql.gz
pt-show-grants -h entrypoint-temporary-cluster -u awsrootuser -pxxxx > grants.sql

The schema consist of three databases: db1, db2 and db3.  We have not included the mysql schema because it will cause issues with the new 5.7 instance. You’ll see why we dumped the schema and the data separately in the next section.

Restore to an empty Amazon Aurora MySQL 5.7 cluster

With our backup done, we are ready to spin up a brand new Amazon Aurora MySQL 5.7 cluster and restore the backup. Make sure the new Amazon Aurora MySQL 5.7 cluster is in a subnet with access to the Amazon Aurora MySQL 5.6 production cluster. In our schema, there a few very large tables with a significant number of secondary keys. To speed up the restore, we removed the secondary indexes of these tables from the schema.sql file and created a restore-indexes.sql file with the list of alter table statements needed to recreate them. Then we restored the data using these steps:

cat grants.sql | mysql -h entrypoint-new-aurora-57 -u awsroot -pxxxx
cat schema-modified.sql | mysql -h entrypoint-new-aurora-57 -u awsroot -pxxxx
zcat dump.sql.gz | mysql -h entrypoint-new-aurora-57 -u awsroot -pxxxx
cat restore-indexes.sql | mysql -h entrypoint-new-aurora-57 -u awsroot -pxxxx

Configure replication

At this point, we have a new Amazon Aurora MySQL 5.7 cluster provisioned with a dataset at a known replication coordinates from the Amazon Aurora MySQL 5.6 production cluster.  It is now very easy to setup replication. First we need to create a replication user in the Amazon Aurora MySQL 5.6 production cluster:

GRANT REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO 'repl_user'@'%' identified by 'agoodpassword';

Then, in the new Amazon Aurora MySQL 5.7 cluster, you configure replication and start it by:

CALL mysql.rds_set_external_master ('mydbcluster.cluster-123456789012.us-east-1.rds.amazonaws.com', 3306,
  'repl_user', 'agoodpassword', 'mysql-bin-changelog.000018', 65932380, 0);
CALL mysql.rds_start_replication;

The endpoint mydbcluster.cluster-123456789012.us-east-1.rds.amazonaws.com points to the Amazon Aurora MySQL 5.6 production cluster.

Now, if everything went well, the new Amazon Aurora MySQL 5.7 cluster will be actively syncing with its master, the current Amazon Aurora MySQL 5.6 production cluster. This process can take a significant amount of time depending on the write load and the type of instance used for the new cluster. You can monitor the progress with the show slave status\G command, the Seconds_Behind_Master will tell you how far behind in seconds the new cluster is compared to the old one.  It is not a measurement of how long it will take to resync.

You can also monitor throughput using the AWS console. In this screenshot you can see the replication speeding up over time before it peaks when it is completed.

Replication speed

Test with Amazon Aurora MySQL 5.7

At this point, we have an Amazon Aurora MySQL 5.7 cluster in sync with the production Amazon Aurora MySQL 5.6 cluster. Before transferring the production load to the new cluster, you need to test your application with MySQL 5.7. The easiest way is to snapshot the new Amazon Aurora MySQL 5.7 cluster and, using the snapshot, provision a staging Amazon Aurora MySQL 5.7 cluster. Test your application against the staging cluster and, once tested, destroy the staging cluster and any unneeded snapshots.

Switch production to the Amazon Aurora MySQL 5.7 cluster

Now that you have tested your application with the staging cluster and are satisfied how it behaves with Amazon Aurora MySQL 5.7, the very last step is to migrate the production load. Here are the last steps you need to follow:

  1. Make sure the Amazon Aurora MySQL 5.7 cluster is still in sync with the Amazon Aurora MySQL 5.6 cluster
  2. Stop the application
  3. Validate the Show master status; of the 5.6 cluster is no longer moving
  4. Validate from the Show slave status\G in the 5.7 cluster the Master_Log_File and Exec_Master_Log_Pos match the output of the “Show master status;” from the 5.6 cluster
  5. Stop the slave in the 5.7 cluster with CALL mysql.rds_stop_replication;
  6. Reset the slave in the 5.7 cluster with CALL mysql.rds_reset_external_master;
  7. Reconfigure the application to use the 5.7 cluster endpoint
  8. Start the application

The application is down from steps 2 to 8.  Although that might appear to be a long time, these steps can easily be executed within a few minutes.

Summary

So, in summary, although RDS Aurora doesn’t support an in place upgrade between Amazon Aurora MySQL 5.6 and 5.7, there is a possible migration path, minimizing downtime.  In our case, we were able to limit the downtime to only a few minutes.

Co-Author: Jacques Fu, Fattmerchant

 

Jacques is CTO and co-founder at the fintech startup Fattmerchant, author of Time Hacks, and co-founder of the Orlando Devs, the largest developer meetup in Orlando. He has a passion for building products, bringing them to market, and scaling them.

Jan
17
2019
--

Using Parallel Query with Amazon Aurora for MySQL

parallel query amazon aurora for mysql

parallel query amazon aurora for mysqlParallel query execution is my favorite, non-existent, feature in MySQL. In all versions of MySQL – at least at the time of writing – when you run a single query it will run in one thread, effectively utilizing one CPU core only. Multiple queries run at the same time will be using different threads and will utilize more than one CPU core.

On multi-core machines – which is the majority of the hardware nowadays – and in the cloud, we have multiple cores available for use. With faster disks (i.e. SSD) we can’t utilize the full potential of IOPS with just one thread.

AWS Aurora (based on MySQL 5.6) now has a version which will support parallelism for SELECT queries (utilizing the read capacity of storage nodes underneath the Aurora cluster). In this article, we will look at how this can improve the reporting/analytical query performance in MySQL. I will compare AWS Aurora with MySQL (Percona Server) 5.6 running on an EC2 instance of the same class.

In Short

Aurora Parallel Query response time (for queries which can not use indexes) can be 5x-10x better compared to the non-parallel fully cached operations. This is a significant improvement for the slow queries.

Test data and versions

For my test, I need to choose:

  1. Aurora instance type and comparison
  2. Dataset
  3. Queries

Aurora instance type and comparison

According to Jeff Barr’s excellent article (https://aws.amazon.com/blogs/aws/new-parallel-query-for-amazon-aurora/) the following instance classes will support parallel query (PQ):

“The instance class determines the number of parallel queries that can be active at a given time:

  • db.r*.large – 1 concurrent parallel query session
  • db.r*.xlarge – 2 concurrent parallel query sessions
  • db.r*.2xlarge – 4 concurrent parallel query sessions
  • db.r*.4xlarge – 8 concurrent parallel query sessions
  • db.r*.8xlarge – 16 concurrent parallel query sessions
  • db.r4.16xlarge – 16 concurrent parallel query sessions”

As I want to maximize the concurrency of parallel query sessions, I have chosen db.r4.8xlarge. For the EC2 instance I will use the same class: r4.8xlarge.

Aurora:

mysql> show global variables like '%version%';
+-------------------------+------------------------------+
| Variable_name           | Value                        |
+-------------------------+------------------------------+
| aurora_version          | 1.18.0                       |
| innodb_version          | 1.2.10                       |
| protocol_version        | 10                           |
| version                 | 5.6.10                       |
| version_comment         | MySQL Community Server (GPL) |
| version_compile_machine | x86_64                       |
| version_compile_os      | Linux                        |
+-------------------------+------------------------------+

MySQL on ec2

mysql> show global variables like '%version%';
+-------------------------+------------------------------------------------------+
| Variable_name           | Value                                                |
+-------------------------+------------------------------------------------------+
| innodb_version          | 5.6.41-84.1                                          |
| protocol_version        | 10                                                   |
| slave_type_conversions  |                                                      |
| tls_version             | TLSv1.1,TLSv1.2                                      |
| version                 | 5.6.41-84.1                                          |
| version_comment         | Percona Server (GPL), Release 84.1, Revision b308619 |
| version_compile_machine | x86_64                                               |
| version_compile_os      | debian-linux-gnu                                     |
| version_suffix          |                                                      |
+-------------------------+------------------------------------------------------+

Table

I’m using the “Airlines On-Time Performance” database from http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time  (You can find the scripts I used here: https://github.com/Percona-Lab/ontime-airline-performance).

mysql> show table status like 'ontime'\G
*************************** 1. row ***************************
          Name: ontime
        Engine: InnoDB
       Version: 10
    Row_format: Compact
          Rows: 173221661
Avg_row_length: 409
   Data_length: 70850183168
Max_data_length: 0
  Index_length: 0
     Data_free: 7340032
Auto_increment: NULL
   Create_time: 2018-09-26 02:03:28
   Update_time: NULL
    Check_time: NULL
     Collation: latin1_swedish_ci
      Checksum: NULL
Create_options:
       Comment:
1 row in set (0.00 sec)

The table is very wide, 84 columns.

Working with Aurora PQ (Parallel Query)

Documentation: https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/aurora-mysql-parallel-query.html

Aurora PQ works by doing a full table scan (parallel reads are done on the storage level). The InnoDB buffer pool is not used when Parallel Query is utilized.

For the purposes of the test I turned PQ on and off (normally AWS Aurora uses its own heuristics to determine if the PQ will be helpful or not):

Turn on and force:

mysql> set session aurora_pq = 1;
Query OK, 0 rows affected (0.00 sec)
mysql> set aurora_pq_force = 1;
Query OK, 0 rows affected (0.00 sec)

Turn off:

mysql> set session aurora_pq = 0;
Query OK, 0 rows affected (0.00 sec)

The EXPLAIN plan in MySQL will also show the details about parallel query execution statistics.

Queries

Here, I use the “reporting” queries, running only one query at a time. The queries are similar to those I’ve used in older blog posts comparing MySQL and Apache Spark performance (https://www.percona.com/blog/2016/08/17/apache-spark-makes-slow-mysql-queries-10x-faster/ )

Here is a summary of the queries:

  1. Simple queries:
    • select count(*) from ontime where flightdate > '2017-01-01'
    • select avg(DepDelay/ArrDelay+1) from ontime
  2. Complex filter, single table:
select SQL_CALC_FOUND_ROWS
FlightDate, UniqueCarrier as carrier, FlightNum, Origin, Dest
FROM ontime
WHERE
  DestState not in ('AK', 'HI', 'PR', 'VI')
  and OriginState not in ('AK', 'HI', 'PR', 'VI')
  and flightdate > '2015-01-01'
   and ArrDelay < 15
and cancelled = 0
and Diverted = 0
and DivAirportLandings = 0
  ORDER by DepDelay DESC
LIMIT 10;

3. Complex filter, join “reference” table

select SQL_CALC_FOUND_ROWS
FlightDate, UniqueCarrier, TailNum, FlightNum, Origin, OriginCityName, Dest, DestCityName, DepDelay, ArrDelay
FROM ontime_ind o
JOIN carriers c on o.carrier = c.carrier_code
WHERE
  (carrier_name like 'United%' or carrier_name like 'Delta%')
  and ArrDelay > 30
  ORDER by DepDelay DESC
LIMIT 10\G

4. select one row only, no index

Query 1a: simple, count(*)

Let’s take a look at the most simple query: count(*). This variant of the “ontime” table has no secondary indexes.

select count(*) from ontime where flightdate > '2017-01-01';

Aurora, pq (parallel query) disabled:

I disabled the PQ first to compare:

mysql> select count(*) from ontime where flightdate > '2017-01-01';
+----------+
| count(*) |
+----------+
|  5660651 |
+----------+
1 row in set (8 min 25.49 sec)
mysql> select count(*) from ontime where flightdate > '2017-01-01';
+----------+
| count(*) |
+----------+
|  5660651 |
+----------+
1 row in set (2 min 48.81 sec)
mysql> mysql> select count(*) from ontime where flightdate > '2017-01-01';
+----------+
| count(*) |
+----------+
|  5660651 |
+----------+
1 row in set (2 min 48.25 sec)
Please note: the first run was “cold run”; data was read from disk. The second and third run used the cached data.
Now let's enable and force Aurora PQ:
mysql> set session aurora_pq = 1;
Query OK, 0 rows affected (0.00 sec)
mysql> set aurora_pq_force = 1; 
Query OK, 0 rows affected (0.00 sec)
mysql> explain select count(*) from ontime where flightdate > '2017-01-01'\G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: ontime
        type: ALL
possible_keys: NULL
         key: NULL
     key_len: NULL
         ref: NULL
        rows: 173706586
       Extra: Using where; Using parallel query (1 columns, 1 filters, 0 exprs; 0 extra)
1 row in set (0.00 sec)

(from the EXPLAIN plan, we can see that parallel query is used).

Results:

mysql> select count(*) from ontime where flightdate > '2017-01-01';                                                                                                                          
+----------+
| count(*) |
+----------+
|  5660651 |
+----------+
1 row in set (16.53 sec)
mysql> select count(*) from ontime where flightdate > '2017-01-01';
+----------+
| count(*) |
+----------+
|  5660651 |
+----------+
1 row in set (16.56 sec)
mysql> select count(*) from ontime where flightdate > '2017-01-01';
+----------+
| count(*) |
+----------+
|  5660651 |
+----------+
1 row in set (16.36 sec)
mysql> select count(*) from ontime where flightdate > '2017-01-01';
+----------+
| count(*) |
+----------+
|  5660651 |
+----------+
1 row in set (16.56 sec)
mysql> select count(*) from ontime where flightdate > '2017-01-01';
+----------+
| count(*) |
+----------+
|  5660651 |
+----------+
1 row in set (16.36 sec)

As we can see the results are very stable. It does not use any cache (ie: innodb buffer pool) either. The result is also interesting: utilizing multiple threads (up to 16 threads) and reading data from disk (using disk cache, probably) can be ~10x faster compared to reading from memory in a single thread.

Result: ~10x performance gain, no index used

Query 1b: simple, avg

set aurora_pq = 1; set aurora_pq_force=1;
select avg(DepDelay) from ontime;
+---------------+
| avg(DepDelay) |
+---------------+
|        8.2666 |
+---------------+
1 row in set (1 min 48.17 sec)
set aurora_pq = 0; set aurora_pq_force=0;  
select avg(DepDelay) from ontime;
+---------------+
| avg(DepDelay) |
+---------------+
|        8.2666 |
+---------------+
1 row in set (2 min 49.95 sec)
Here we can see that PQ gives use ~2x performance increase.

Summary of simple query performance

Here is what we learned comparing Aurora PQ performance to native MySQL query execution:

  1. Select count(*), not using index: 10x performance increase with Aurora PQ.
  2. select avg(…), not using index: 2x performance increase with Aurora PQ.

Query 2: Complex filter, single table

The following query will always be slow in MySQL. This combination of the filters in the WHERE condition makes it extremely hard to prepare a good set of indexes to make this query faster.

select SQL_CALC_FOUND_ROWS
FlightDate, UniqueCarrier as carrier, FlightNum, Origin, Dest
FROM ontime
WHERE
  DestState not in ('AK', 'HI', 'PR', 'VI')
  and OriginState not in ('AK', 'HI', 'PR', 'VI')
  and flightdate > '2015-01-01'
  and ArrDelay < 15
and cancelled = 0
and Diverted = 0
and DivAirportLandings = '0'
ORDER by DepDelay DESC
LIMIT 10;

Let’s compare the query performance with and without PQ.

PQ disabled:

mysql> set aurora_pq_force = 0;
Query OK, 0 rows affected (0.00 sec)
mysql> set aurora_pq = 0;                                                                                                                                                                  
Query OK, 0 rows affected (0.00 sec)
mysql> explain select SQL_CALC_FOUND_ROWS FlightDate, UniqueCarrier as carrier, FlightNum, Origin, Dest FROM ontime WHERE    DestState not in ('AK', 'HI', 'PR', 'VI') and OriginState not in ('AK', 'HI', 'PR', 'VI') and flightdate > '2015-01-01'     and ArrDelay < 15 and cancelled = 0 and Diverted = 0 and DivAirportLandings = 0 ORDER by DepDelay DESC LIMIT 10\G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: ontime
        type: ALL
possible_keys: NULL
         key: NULL
     key_len: NULL
         ref: NULL
        rows: 173706586
       Extra: Using where; Using filesort
1 row in set (0.00 sec)
mysql> select SQL_CALC_FOUND_ROWS FlightDate, UniqueCarrier as carrier, FlightNum, Origin, Dest FROM ontime WHERE    DestState not in ('AK', 'HI', 'PR', 'VI') and OriginState not in ('AK', 'HI', 'PR', 'VI') and flightdate > '2015-01-01'     and ArrDelay < 15 and cancelled = 0 and Diverted = 0 and DivAirportLandings = 0 ORDER by DepDelay DESC LIMIT 10;
+------------+---------+-----------+--------+------+
| FlightDate | carrier | FlightNum | Origin | Dest |
+------------+---------+-----------+--------+------+
| 2017-10-09 | OO      | 5028      | SBP    | SFO  |
| 2015-11-03 | VX      | 969       | SAN    | SFO  |
| 2015-05-29 | VX      | 720       | TUL    | AUS  |
| 2016-03-11 | UA      | 380       | SFO    | BOS  |
| 2016-06-13 | DL      | 2066      | JFK    | SAN  |
| 2016-11-14 | UA      | 1600      | EWR    | LAX  |
| 2016-11-09 | WN      | 2318      | BDL    | LAS  |
| 2016-11-09 | UA      | 1652      | IAD    | LAX  |
| 2016-11-13 | AA      | 23        | JFK    | LAX  |
| 2016-11-12 | UA      | 800       | EWR    | SFO  |
+------------+---------+-----------+--------+------+

10 rows in set (3 min 42.47 sec)

/* another run */

10 rows in set (3 min 46.90 sec)

This query is 100% cached. Here is the graph from PMM showing the number of read requests:

  1. Read requests: logical requests from the buffer pool
  2. Disk reads: physical requests from disk

Buffer pool requests:

Buffer pool requests from PMM

Now let’s enable and force PQ:

PQ enabled:

mysql> set session aurora_pq = 1;
Query OK, 0 rows affected (0.00 sec)
mysql> set aurora_pq_force = 1;                                                                                                                              Query OK, 0 rows affected (0.00 sec)
mysql> explain select SQL_CALC_FOUND_ROWS FlightDate, UniqueCarrier as carrier, FlightNum, Origin, Dest FROM ontime WHERE    DestState not in ('AK', 'HI', 'PR', 'VI') and OriginState not in ('AK', 'HI', 'PR', 'VI') and flightdate > '2015-01-01'     and ArrDelay < 15 and cancelled = 0 and Diverted = 0 and DivAirportLandings = 0 ORDER by DepDelay DESC LIMIT 10\G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: ontime
        type: ALL
possible_keys: NULL
         key: NULL
     key_len: NULL
         ref: NULL
        rows: 173706586
       Extra: Using where; Using filesort; Using parallel query (12 columns, 4 filters, 3 exprs; 0 extra)
1 row in set (0.00 sec)
mysql> select SQL_CALC_FOUND_ROWS                                                                                                                                                                      -> FlightDate, UniqueCarrier as carrier, FlightNum, Origin, Dest -> FROM ontime
   -> WHERE
   ->  DestState not in ('AK', 'HI', 'PR', 'VI')
   ->  and OriginState not in ('AK', 'HI', 'PR', 'VI')
   ->  and flightdate > '2015-01-01'
   ->   and ArrDelay < 15
   -> and cancelled = 0
   -> and Diverted = 0
   -> and DivAirportLandings = 0
   ->  ORDER by DepDelay DESC
   -> LIMIT 10;
+------------+---------+-----------+--------+------+
| FlightDate | carrier | FlightNum | Origin | Dest |
+------------+---------+-----------+--------+------+
| 2017-10-09 | OO      | 5028      | SBP    | SFO  |
| 2015-11-03 | VX      | 969       | SAN    | SFO  |
| 2015-05-29 | VX      | 720       | TUL    | AUS  |
| 2016-03-11 | UA      | 380       | SFO    | BOS  |
| 2016-06-13 | DL      | 2066      | JFK    | SAN  |
| 2016-11-14 | UA      | 1600      | EWR    | LAX  |
| 2016-11-09 | WN      | 2318      | BDL    | LAS  |
| 2016-11-09 | UA      | 1652      | IAD    | LAX  |
| 2016-11-13 | AA      | 23        | JFK    | LAX  |
| 2016-11-12 | UA      | 800       | EWR    | SFO  |
+------------+---------+-----------+--------+------+
10 rows in set (41.88 sec)
/* run 2 */
10 rows in set (28.49 sec)
/* run 3 */
10 rows in set (29.60 sec)

Now let’s compare the requests:

InnoDB Buffer Pool Requests

As we can see, Aurora PQ is almost NOT utilizing the buffer pool (there are a minor number of read requests. Compare the max of 4K requests per second with PQ to the constant 600K requests per second in the previous graph).

Result: ~8x performance gain

Query 3: Complex filter, join “reference” table

In this example I join two tables: the main “ontime” table and a reference table. If we have both tables without indexes it will simply be too slow in MySQL. To make it better, I have created an index for both tables and so it will use indexes for the join:

CREATE TABLE `carriers` (
 `carrier_code` varchar(8) NOT NULL DEFAULT '',
 `carrier_name` varchar(200) DEFAULT NULL,
 PRIMARY KEY (`carrier_code`),
 KEY `carrier_name` (`carrier_name`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
mysql> show create table ontime_ind\G
...
 PRIMARY KEY (`id`),
 KEY `comb1` (`Carrier`,`Year`,`ArrDelayMinutes`),
 KEY `FlightDate` (`FlightDate`)
) ENGINE=InnoDB AUTO_INCREMENT=178116912 DEFAULT CHARSET=latin1

Query:

select SQL_CALC_FOUND_ROWS
FlightDate, UniqueCarrier, TailNum, FlightNum, Origin, OriginCityName, Dest, DestCityName, DepDelay, ArrDelay
FROM ontime_ind o
JOIN carriers c on o.carrier = c.carrier_code
WHERE
  (carrier_name like 'United%' or carrier_name like 'Delta%')
  and ArrDelay > 30
  ORDER by DepDelay DESC
LIMIT 10\G

PQ disabled, explain plan:

mysql> set aurora_pq_force = 0;
Query OK, 0 rows affected (0.00 sec)
mysql> set aurora_pq = 0;                                                                                                                                                                  
Query OK, 0 rows affected (0.00 sec)
mysql> explain
   -> select SQL_CALC_FOUND_ROWS
   -> FlightDate, UniqueCarrier, TailNum, FlightNum, Origin, OriginCityName, Dest, DestCityName, DepDelay, ArrDelay
   -> FROM ontime_ind o
   -> JOIN carriers c on o.carrier = c.carrier_code
   -> WHERE
   ->  (carrier_name like 'United%' or carrier_name like 'Delta%')
   ->  and ArrDelay > 30
   ->  ORDER by DepDelay DESC
   -> LIMIT 10\G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: c
        type: range
possible_keys: PRIMARY,carrier_name
         key: carrier_name
     key_len: 203
         ref: NULL
        rows: 3
       Extra: Using where; Using index; Using temporary; Using filesort
*************************** 2. row ***************************
          id: 1
 select_type: SIMPLE
       table: o
        type: ref
possible_keys: comb1
         key: comb1
     key_len: 3
         ref: ontime.c.carrier_code
        rows: 2711597
       Extra: Using index condition; Using where
2 rows in set (0.01 sec)

As we can see MySQL uses indexes for the join. Response times:

/* run 1 – cold run */

10 rows in set (29 min 17.39 sec)

/* run 2  – warm run */

10 rows in set (2 min 45.16 sec)

PQ enabled, explain plan:

mysql> explain
   -> select SQL_CALC_FOUND_ROWS
   -> FlightDate, UniqueCarrier, TailNum, FlightNum, Origin, OriginCityName, Dest, DestCityName, DepDelay, ArrDelay
   -> FROM ontime_ind o
   -> JOIN carriers c on o.carrier = c.carrier_code
   -> WHERE
   ->  (carrier_name like 'United%' or carrier_name like 'Delta%')
   ->  and ArrDelay > 30
   ->  ORDER by DepDelay DESC
   -> LIMIT 10\G
*************************** 1. row ***************************
          id: 1
 select_type: SIMPLE
       table: c
        type: ALL
possible_keys: PRIMARY,carrier_name
         key: NULL
     key_len: NULL
         ref: NULL
        rows: 1650
       Extra: Using where; Using temporary; Using filesort; Using parallel query (2 columns, 0 filters, 1 exprs; 0 extra)
*************************** 2. row ***************************
          id: 1
 select_type: SIMPLE
       table: o
        type: ALL
possible_keys: comb1
         key: NULL
     key_len: NULL
         ref: NULL
        rows: 173542245
       Extra: Using where; Using join buffer (Hash Join Outer table o); Using parallel query (11 columns, 1 filters, 1 exprs; 0 extra)
2 rows in set (0.00 sec)

As we can see, Aurora does not use any indexes and uses a parallel scan instead.

Response time:

mysql> select SQL_CALC_FOUND_ROWS
   -> FlightDate, UniqueCarrier, TailNum, FlightNum, Origin, OriginCityName, Dest, DestCityName, DepDelay, ArrDelay
   -> FROM ontime_ind o
   -> JOIN carriers c on o.carrier = c.carrier_code
   -> WHERE
   ->  (carrier_name like 'United%' or carrier_name like 'Delta%')
   ->  and ArrDelay > 30
   ->  ORDER by DepDelay DESC
   -> LIMIT 10\G
...
*************************** 4. row ***************************
   FlightDate: 2017-05-04
UniqueCarrier: UA
      TailNum: N68821
    FlightNum: 1205
       Origin: KOA
OriginCityName: Kona, HI
         Dest: LAX
 DestCityName: Los Angeles, CA
     DepDelay: 1457
     ArrDelay: 1459
*************************** 5. row ***************************
   FlightDate: 1991-03-12
UniqueCarrier: DL
      TailNum:
    FlightNum: 1118
       Origin: ATL
OriginCityName: Atlanta, GA
         Dest: STL
 DestCityName: St. Louis, MO
...
10 rows in set (28.78 sec)
mysql> select found_rows();
+--------------+
| found_rows() |
+--------------+
|      4180974 |
+--------------+
1 row in set (0.00 sec)

Result: ~5x performance gain

(this is actually comparing the index cached read to a non-index PQ execution)

Summary

Aurora PQ can significantly improve the performance of reporting queries as such queries may be extremely hard to optimize in MySQL, even when using indexes. With indexes, Aurora PQ response time can be 5x-10x better compared to the non-parallel, fully cached operations. Aurora PQ can help improve performance of complex queries by performing parallel reads.

The following table summarizes the query response times:

Query Time, No PQ, index Time, PQ
select count(*) from ontime where flightdate > ‘2017-01-01’ 2 min 48.81 sec 16.53 sec
select avg(DepDelay) from ontime; 2 min 49.95 sec 1 min 48.17 sec
select SQL_CALC_FOUND_ROWS

FlightDate, UniqueCarrier as carrier, FlightNum, Origin, Dest

FROM ontime

WHERE

DestState not in (‘AK’, ‘HI’, ‘PR’, ‘VI’)

and OriginState not in (‘AK’, ‘HI’, ‘PR’, ‘VI’)

and flightdate > ‘2015-01-01’

and ArrDelay < 15

and cancelled = 0

and Diverted = 0

and DivAirportLandings = 0

ORDER by DepDelay DESC

LIMIT 10;

3 min 42.47 sec 28.49 sec
select SQL_CALC_FOUND_ROWS

FlightDate, UniqueCarrier, TailNum, FlightNum, Origin, OriginCityName, Dest, DestCityName, DepDelay, ArrDelay

FROM ontime_ind o

JOIN carriers c on o.carrier = c.carrier_code

WHERE

(carrier_name like ‘United%’ or carrier_name like ‘Delta%’)

and ArrDelay > 30

ORDER by DepDelay DESC

LIMIT 10\G

2 min 45.16 sec 28.78 sec


Photo by Thomas Lipke on Unsplash

Jan
11
2019
--

AWS Aurora MySQL – HA, DR, and Durability Explained in Simple Terms

It’s a few weeks after AWS re:Invent 2018 and my head is still spinning from all of the information released at this year’s conference. This year I was able to enjoy a few sessions focused on Aurora deep dives. In fact, I walked away from the conference realizing that my own understanding of High Availability (HA), Disaster Recovery (DR), and Durability in Aurora had been off for quite a while. Consequently, I decided to put this blog out there, both to collect the ideas in one place for myself, and to share them in general. Unlike some of our previous blogs, I’m not focused on analyzing Aurora performance or examining the architecture behind Aurora. Instead, I want to focus on how HA, DR, and Durability are defined and implemented within the Aurora ecosystem.  We’ll get just deep enough into the weeds to be able to examine these capabilities alone.

introducing the aurora storage engine 1

Aurora MySQL – What is it?

We’ll start with a simplified discussion of what Aurora is from a very high level.  In its simplest description, Aurora MySQL is made up of a MySQL-compatible compute layer and a multi-AZ (multi availability zone) storage layer. In the context of an HA discussion, it is important to start at this level, so we understand the redundancy that is built into the platform versus what is optional, or configurable.

Aurora Storage

The Aurora Storage layer presents a volume to the compute layer. This volume is built out in 10GB increments called protection groups.  Each protection group is built from six storage nodes, two from each of three availability zones (AZs).  These are represented in the diagram above in green.  When the compute layer—represented in blue—sends a write I/O to the storage layer, the data gets replicated six times across three AZs.

Durable by Default

In addition to the six-way replication, Aurora employs a 4-of-6 quorum for all write operations. This means that for each commit that happens at the database compute layer, the database node waits until it receives write acknowledgment from at least four out of six storage nodes. By receiving acknowledgment from four storage nodes, we know that the write has been saved in at least two AZs.  The storage layer itself has intelligence built-in to ensure that each of the six storage nodes has a copy of the data. This does not require any interaction with the compute tier. By ensuring that there are always at least four copies of data, across at least two datacenters (AZs), and ensuring that the storage nodes are self-healing and always maintain six copies, it can be said that the Aurora Storage platform has the characteristic of Durable by Default.  The Aurora storage architecture is the same no matter how large or small your Aurora compute architecture is.

One might think that waiting to receive four acknowledgments represents a lot of I/O time and is therefore an expensive write operation.  However, Aurora database nodes do not behave the way a typical MySQL database instance would. Some of the round-trip execution time is mitigated by the way in which Aurora MySQL nodes write transactions to disk. For more information on exactly how this works, check out Amazon Senior Engineering Manager, Kamal Gupta’s deep-dive into Aurora MySQL from AWS re:Invent 2018.

HA and DR Options

While durability can be said to be a default characteristic to the platform, HA and DR are configurable capabilities. Let’s take a look at some of the HA and DR options available. Aurora databases are deployed as members of an Aurora DB Cluster. The cluster configuration is fairly flexible. Database nodes are given the roles of either Writer or Reader. In most cases, there will only be one Writer node. The Reader nodes are known as Aurora Replicas. A single Aurora Cluster may contain up to 15 Aurora Replicas. We’ll discuss a few common configurations and the associated levels of HA and DR which they provide. This is only a sample of possible configurations: it is not meant to represent an exhaustive list of the possible configuration options available on the Aurora platform.

Single-AZ, Single Instance Deployment

great durability with Aurora but DA and HA less so

The most basic implementation of Aurora is a single compute instance in a single availability zone. The compute instance is monitored by the Aurora Cluster service and will be restarted if the database instance or compute VM has a failure. In this architecture, there is no redundancy at the compute level. Therefore, there is no database level HA or DR. The storage tier provides the same high level of durability described in the sections above. The image below is a view of what this configuration looks like in the AWS Console.

Single-AZ, Multi-Instance

Introducing HA into an Amazon Aurora solutionHA can be added to a basic Aurora implementation by adding an Aurora Replica.  We increase our HA level by adding Aurora Replicas within the same AZ. If desired, the Aurora Replicas can be used to also service some of the read traffic for the Aurora Cluster. This configuration cannot be said to provide DR because there are no database nodes outside the single datacenter or AZ. If that datacenter were to fail, then database availability would be lost until it was manually restored in another datacenter (AZ). It’s important to note that while Aurora has a lot of built-in automation, you will only benefit from that automation if your base configuration facilitates a path for the automation to follow. If you have a single-AZ base deployment, then you will not have the benefit of automated Multi-AZ availability. However, as in the previous case, durability remains the same. Again, durability is a characteristic of the storage layer. The image below is a view of what this configuration looks like in the AWS Console. Note that the Writer and Reader are in the same AZ.

Multi-AZ Options

Partial disaster recovery with Amazon auroraBuilding on our previous example, we can increase our level of HA and add partial DR capabilities to the configuration by adding more Aurora Replicas. At this point we will add one additional replica in the same AZ, bringing the local AZ replica count to three database instances. We will also add one replica in each of the two remaining regional AZs. Aurora provides the option to configure automated failover priority for the Aurora Replicas. Choosing your failover priority is best defined by the individual business needs. That said, one way to define the priority might be to set the first failover to the local-AZ replicas, and subsequent failover priority to the replicas in the other AZs. It is important to remember that AZs within a region are physical datacenters located within the same metro area. This configuration will provide protection for a disaster localized to the datacenter. It will not, however, provide protection for a city-wide disaster. The image below is a view of what this configuration looks like in the AWS Console. Note that we now have two Readers in the same AZ as the Writer and two Readers in two other AZs.

Cross-Region Options

The three configuration types we’ve discussed up to this point represent configuration options available within an AZ or metro area. There are also options available for cross-region replication in the form of both logical and physical replication.

Logical Replication

Aurora supports replication to up to five additional regions with logical replication.  It is important to note that, depending on the workload, logical replication across regions can be notably susceptible to replication lag.

Physical Replication

Durability, High Availability and Disaster Recovery with Amazon AuroraOne of the many announcements to come out of re:Invent 2018 is a product called Aurora Global Database. This is Aurora’s implementation of cross-region physical replication. Amazon’s published details on the solution indicate that it is storage level replication implemented on dedicated cross-region infrastructure with sub-second latency. In general terms, the idea behind a cross-region architecture is that the second region could be an exact duplicate of the primary region. This means that the primary region can have up to 15 Aurora Replicas and the secondary region can also have up to 15 Aurora Replicas. There is one database instance in the secondary region in the role of writer for that region. This instance can be configured to take over as the master for both regions in the case of a regional failure. In this scenario the secondary region becomes primary, and the writer in that region becomes the primary database writer. This configuration provides protection in the case of a regional disaster. It’s going to take some time to test this, but at the moment this architecture appears to provide the most comprehensive combination of Durability, HA, and DR. The trade-offs have yet to be thoroughly explored.

Multi-Master Options

Amazon is in the process of building out a new capability called Aurora Multi-Master. Currently, this feature is in preview phase and has not been released for general availability. While there were a lot of talks at re:Invent 2018 which highlighted some of the components of this feature, there is still no affirmative date for release. Early analysis points to the feature being localized to the AZ. It is not known if cross-region Multi-Master will be supported, but it seems unlikely.

Summary

As a post re:Invent takeaway, what I learned was that there is an Aurora configuration to fit almost any workload that requires strong performance behind it. Not all heavy workloads also demand HA and DR. If this describes one of your workloads, then there is an Aurora configuration that fits your needs. On the flip side, it is also important to remember that while data durability is an intrinsic quality of Aurora, HA and DR are not. These are completely configurable. This means that the Aurora architect in your organization must put thought and due diligence into the way they design your Aurora deployment. While we all need to be conscious of costs, don’t let cost consciousness become a blinder to reality. Just because your environment is running in Aurora does not mean you automatically have HA and DR for your database. In Aurora, HA and DR are configuration options, and just like the on-premise world, viable HA and DR have additional costs associated with them.

For More Information See Also:

 

 

 

Jan
09
2019
--

Amazon Aurora Serverless – The Sleeping Beauty

Amazon RDS Aurora Serverless activation times

One of the most exciting features Amazon Aurora Serverless brings to the table is its ability to go to sleep (pause) when idle. This is a fantastic feature for development and test environments. You get access to a powerful database to run tests quickly, but it goes easy on your wallet as you only pay for storage when the instance is paused.

You can configure Amazon RDS Aurora Serverless to go to sleep after a specified period of time. This can be set to anywhere between five minutes and 24 hours

configure Amazon RDS Aurora Serverless sleep time

For this feature to work, however, inactivity has to be complete. If you have so much as a single query or even maintain an idle open connection, Amazon Aurora Serverless will not be able to pause.

This means, for example, that pretty much any monitoring you may have enabled, including our own Percona Monitoring and Management (PMM) will prevent the instance from pausing. It would be great if Amazon RDS Aurora Serverless would allow us to specify user accounts to ignore, or additional service endpoints which should not prevent it from pausing, but currently you need to get by without such monitoring and diagnostic tools, or else enable them only for duration of the test run.

If you’re using Amazon Aurora Serverless to back very low traffic applications, you might consider disabling the automatic pause function, since waking up currently takes quite a while. Otherwise, your users should be prepared for a 30+ seconds wait while Amazon Aurora Serverless activates.

Having such a high time to activate means you need to be mindful of timeout configuration in your test/dev scripts so you do not have to deal with sporadic failures. Or you can also use something like the mysqladmin ping command to activate the instance before your test run.

Some activation experiments

Let’s now take a closer look at Amazon RDS Aurora Serverless activation times. These times are measured for MySQL 5.6 based Aurora Serverless – the only one currently available. I expect numbers could be different in other editions

Amazon RDS Aurora Serverless activation times

I measured the time it takes to run a trivial query (SELECT 1) after the instance goes to sleep. You’ll see I manually scaled the Amazon RDS Aurora Serverless instance to a desired capacity in ACU (Aurora Compute Units), and then had the script wait for six minutes to allow for pause to happen before running the query. The test was performed 12 times and the Min/Max/Avg times of these test runs for different settings of ACU are presented above.

You can see there is some variation between min and max times. I would expect to have even higher outliers, so plan for an activation time of more than a minute as a worst case scenario.

Also note that there is an interesting difference in the activation time between instance sizes. While in my tests the smallest possible size (2 ACU) consistently took longer to activate compared to the medium size (8 ACU), the even bigger size (64 ACU) was the slowest of all.

So make no assumptions about how long it would take for instance of given size to wake up with your workload, but rather test it if it is important consideration for you.

In some (rare) cases I also observed some internal timeouts during the resume process:

[root@ip-172-31-16-160 serverless]# mysqladmin ping -h serverless-test.cluster-XXXX.us-east-2.rds.amazonaws.com -u user -ppassword
mysqladmin: connect to server at 'serverless-test.cluster-XXXX.us-east-2.rds.amazonaws.com' failed
error: 'Database was unable to resume within timeout period.'

What about Autoscaling?

Finally, you may wonder how such Amazon Aurora Serverless pausing plays with Amazon Aurora Serverless Autoscaling ?

In my tests, I observed that resume always restores the instance size to the same ACU as it was before it was paused. However, this is where pausing configuration matters a great deal. According to this document, Amazon Aurora Serverless will not scale down more frequently than once per 900 seconds. While the document does not clarify over what period of time the conditions initiating scale down – cpu usage, connection usage etc – have to be met for scale down to be triggered, I can see that if the instance is idle for five minutes the scale down is not performed – it is just put to sleep.

At the same time, if you change this default five minute period to a longer time, the idle instance will be automatically scaled down a notch every 900 seconds before it finally goes to sleep. Consequently, when it is awakened it will not be at the last stage at which the load was applied, but instead at the stage it was at when it was scaled down. Also, scaling down is considered an event by itself, which resets the idle counter and delays the pause. For example: if the initial instance scale is 8, and the pause timer is set to 1h, it takes 1h 30 minutes for the pause to actually happen – 30 minutes to do scale down twice, plus 1 hour at the minimum size for pause to trigger

Here is a graph to illustrate this:

Amazon Aurora Serverless scale down timings

This also shows that when the load is re-applied at about 13:47, it recovers to the last number of ACU it had before the pause.

This means that a pause time of more than 15 minutes makes the pause behavior substantially different to the default.

Summary

  • Amazon Aurora Serverless automatic pause is a great for test/dev environments.
  • Resume time is relatively long, can reach as much as one minute.
  • Consider disabling automatic pausing for low traffic production applications, or at least let your users know they need to wait when they wake up the application.
  • Pause and Resume behavior is different in practice for a pause timeout of more than 15 minutes. Sticking to the default 5 minutes is recommended unless you really know what you’re doing.
Jan
04
2019
--

Amazon RDS Aurora MySQL – Differences Among Editions

differences MySQL aurora versions

differences MySQL aurora versionsAmazon Aurora with MySQL Compatibility comes in three editions which, at the time of writing, have quite a few differences around the features that they support.  Make sure you don’t assume the newer Aurora 2.x supports everything in Aurora 1.x. On the contrary, right now Aurora 1.x (MySQL 5.6 based) supports most Aurora features.  The serverless option was launched for this version, and it’s not based on the latest MySQL 5.7.  However, the serverless option, too, has its own set of limitations

I found a concise comparison of what is available in which Amazon Aurora edition hard to come by so I’ve created one.  The table was compiled based mostly on documentation research, so if you spot some mistakes please let me know and I’ll make a correction.

Please keep in mind, this is expected to change over time. For example Amazon Aurora 2.x was initially released without Performance_Schema support, which was enabled in later versions.

There seems to be lag porting Aurora features from MySQL 5.6 compatible to MySQL 5.7 compatible –  the current 2.x release does not include features introduced in Aurora 1.16 or later as per this document

A comparison table

MySQL 5.6 Based MySQL 5.7 Based Serverless MySQL 5.6 Based
Compatible to MySQL MySQL 5.6.10a MySQL 5.7.12 MySQL 5.6.10a
Aurora Engine Version 1.18.0 2.03.01 1.18.0
Parallel Query Yes No No
Backtrack Yes No No
Aurora Global Database Yes No No
Performance Insights Yes No No
SELECT INTO OUTFILE S3 Yes Yes Yes
Amazon Lambda – Native Function Yes No No
Amazon Lambda – Stored Procedure Yes Yes Yes
Hash Joins Yes No Yes
Fast DDL Yes Yes Yes
LOAD DATA FROM S3 Yes Yes No
Spatial Indexing Yes Yes Yes
Asynchronous Key Prefetch (AKP) Yes No Yes
Scan Batching Yes No Yes
S3 Backed Based Migration Yes No No
Advanced Auditing Yes Yes No
Aurora Replicas Yes Yes No
Database Cloning Yes Yes No
IAM database authentication Yes Yes No
Cross-Region Read Replicas Yes Yes No
Restoring Snapshot from MySQL DB Yes Yes No
Enhanced Monitoring Yes Yes No
Log Export to Cloudwatch Yes Yes No
Minor Version Upgrade Control Yes Yes Always On
Data Encryption Configuration Yes Yes Always On
Maintenance Window Configuration Yes Yes No

Hope this is helps with selecting which Amazon Aurora edition is right for you, when it comes to supported features.


Photo by Nathan Dumlao on Unsplash

Dec
20
2018
--

Percona Database Performance Blog 2018 Year in Review: Top Blog Posts

Percona Database Performance Blog

Percona Database Performance BlogLet’s look at some of the most popular Percona Database Performance Blog posts in 2018.

The closing of a year lends itself to looking back. And making lists. With the Percona Database Performance Blog, Percona staff and leadership work hard to provide the open source community with insights, technical support, predictions and metrics around multiple open source database software technologies. We’ve had nearly 4 million visits to the blog in 2018: thank you! We look forward to providing you with even better articles, news and information in 2019.

As 2018 moves into 2019, let’s take a quick look back at some of the most popular posts on the blog this year.

Top 10 Most Read

These posts had the most number of views (working down from the highest):

When Should I Use Amazon Aurora and When Should I use RDS MySQL?

Now that Database-as-a-service (DBaaS) is in high demand, there is one question regarding AWS services that cannot always be answered easily : When should I use Aurora and when RDS MySQL?

About ZFS Performance

ZFS has many very interesting features, but I am a bit tired of hearing negative statements on ZFS performance. It feels a bit like people are telling me “Why do you use InnoDB? I have read that MyISAM is faster.” I found the comparison of InnoDB vs. MyISAM quite interesting, and I’ll use it in this post.

Linux OS Tuning for MySQL Database Performance

In this post we will review the most important Linux settings to adjust for performance tuning and optimization of a MySQL database server. We’ll note how some of the Linux parameter settings used OS tuning may vary according to different system types: physical, virtual or cloud.

A Look at MyRocks Performance

As the MyRocks storage engine (based on the RocksDB key-value store http://rocksdb.org ) is now available as part of Percona Server for MySQL 5.7, I wanted to take a look at how it performs on a relatively high-end server and SSD storage.

How to Restore MySQL Logical Backup at Maximum Speed

The ability to restore MySQL logical backups is a significant part of disaster recovery procedures. It’s a last line of defense.

Why MySQL Stored Procedures, Functions and Triggers Are Bad For Performance

MySQL stored procedures, functions and triggers are tempting constructs for application developers. However, as I discovered, there can be an impact on database performance when using MySQL stored routines. Not being entirely sure of what I was seeing during a customer visit, I set out to create some simple tests to measure the impact of triggers on database performance. The outcome might surprise you.

AMD EPYC Performance Testing… or Don’t get on the wrong side of SystemD

Ever since AMD released their EPYC CPU for servers I wanted to test it, but I did not have the opportunity until recently, when Packet.net started offering bare metal servers for a reasonable price. So I started a couple of instances to test Percona Server for MySQL under this CPU. In this benchmark, I discovered some interesting discrepancies in performance between  AMD and Intel CPUs when running under systemd.

Tuning PostgreSQL Database Parameters to Optimize Performance

Out of the box, the default PostgreSQL configuration is not tuned for any particular workload. Default values are set to ensure that PostgreSQL runs everywhere, with the least resources it can consume and so that it doesn’t cause any vulnerabilities. It is primarily the responsibility of the database administrator or developer to tune PostgreSQL according to their system’s workload. In this blog, we will establish basic guidelines for setting PostgreSQL database parameters to improve database performance according to workload.

Using AWS EC2 instance store vs EBS for MySQL: how to increase performance and decrease cost

If you are using large EBS GP2 volumes for MySQL (i.e. 10TB+) on AWS EC2, you can increase performance and save a significant amount of money by moving to local SSD (NVMe) instance storage. Interested? Then read on for a more detailed examination of how to achieve cost-benefits and increase performance from this implementation.

Why You Should Avoid Using “CREATE TABLE AS SELECT” Statement

In this blog post, I’ll provide an explanation why you should avoid using the CREATE TABLE AS SELECT statement. The SQL statement “create table <table_name> as select …” is used to create a normal or temporary table and materialize the result of the select. Some applications use this construct to create a copy of the table. This is one statement that will do all the work, so you do not need to create a table structure or use another statement to copy the structure.

Honorable Mention:

Is Serverless Just a New Word for Cloud-Based?

Top 10 Most Commented

These posts generated some healthy discussions (not surprisingly, this list overlaps with the first):

Posts Worth Revisiting

Don’t miss these great posts that have excellent information on important topics:

Have a great end of the year celebration, and we look forward to providing more great blog posts in 2019.

Dec
17
2018
--

Amazon RDS Aurora Serverless – The Basics

amazon aurora serverless

amazon aurora serverlessWhen I attended AWS Re:Invent 2018, I saw there was a lot of attention from both customers and the AWS team on Amazon RDS Aurora Serverless. So I decided to take a deeper look at this technology, and write a series of blog posts on this topic.

In this first post of the series, you will learn about Amazon Aurora Serverless basics and use cases. In later posts, I will share benchmark results and in depth realization results.

What Amazon Aurora Serverless Is

A great source of information on this topic is How Amazon Aurora Serverless Works from the official AWS  documentation. In this article, you learn what Serverless deployment rather than provisional deployment means. Instead of specifying an instance size you specify the minimum and maximum number of “Aurora Capacity Units” you would like to have:

choose MySQL version on Aurora

Amazon Aurora setup

capacity settings on Amazon Aurora

Once you set up such an instance it will automatically scale between its minimum and maximum capacity points. You also will be able to scale it manually if you like.

One of the most interesting Aurora Serverless properties in my opinion is its ability to go into pause if it stays idle for specified period of time.

pause capacity on Amazon Aurora

This feature can save a lot of money for test/dev environment where load can be intermittent.  Be careful, though, using this for production size databases as waking up is far from instant. I’ve seen cases of it taking over 30 seconds in my experiments.

Another thing which may surprise you about Amazon Aurora Serverless, at the time of this writing, is that it is not very well coordinated with other Amazon RDS Aurora products –  it is only available as a MySQL 5.6 based edition and is not compatible with recent parallel query innovations either as it comes with list of other significant limitations. I’m sure Amazon will resolve these in due course, but for now you need to be aware of them.

A simple way to think about it is as follows: Amazon Aurora Serverless is a way to deploy Amazon Aurora so it scales automatically with load; can automatically pause when there is no load; and resume automatically when requests come in.

What Amazon Aurora Serverless is not

When I think about Serverless Computing I think about about elastic scalability across multiple servers and resource usage based pricing.   DynamoDB, another Database which is advertised as Serverless by Amazon, fits those criteria while Amazon Aurora Serverless does not.

With Amazon Aurora Serverless, for better or for worse, you’re still living in the “classical” instance word.  Aurora Capacity Units (ACUs) are pretty much CPU and Memory Capacity. You still need to understand how many database connections you are allowed to have. You still need to monitor your CPU usage on the instance to understand when auto scaling will happen.

Amazon Aurora Serverless also does not have any magic to scale you beyond single instance performance, which you can get with provisioned Amazon Aurora

Summary

I’m excited about the new possibilities Amazon Aurora Serveless offers.  As long as you do not expect magic and understand this is one of the newest products in the Amazon Aurora family, you surely should give it a try for applications which fit.

If you’re hungry for more information about Amazon Aurora Serverless and can’t wait for the next articles in this series, this article by Jeremy Daly contains a lot of great information.


Photo by Emily Hon on Unsplash

Nov
26
2018
--

Percona at AWS Re:Invent 2018!

AWS re:Invent

Come see Percona at AWS re:Invent from November 26-30, 2018 in booth 1605 in The Venetian Hotel Expo Hall.

Percona is a Bronze sponsor of AWS re:Invent in 2018 and will be there for the whole show! Drop by booth 1605 in The Venetian Expo Hall to discuss how Percona’s unbiased open source databse experts can help you with your cloud database and DBaaS deployments!

Our CEO, Peter Zaitsev will be presenting a keynote called MySQL High Availability and Disaster Recovery at AWS re:Invent!

  • When: 27 November at 1:45 PM – 2:45 PM
  • Where: Bellagio Hotel, Level 1, Gauguin 2

Check out our case study with Passportal on how Percona DBaaS expertise help guarantee uptime for their AWS RDS environment.

Percona has a lot of great content on how we can help improve your AWS open source database deployment. Check out some of our resources below:

Blogs:

Case Studies:

White Papers:

Webinars:

Datasheets:

See you at the show!

Nov
19
2018
--

Migrating to Amazon Aurora: Design for Flexibility

Migrating to Amazon Aurora

Migrating to Amazon AuroraIn this Checklist for Success series, we will discuss reducing unknowns when hosting in the cloud using and migrating to Amazon Aurora. These tips might also apply to other database as a service (DBaaS) offerings.

Previous blogs in the migrating to Amazon Aurora series:

The whole premise of a database as a service offering is that you do not need to worry about the operating the service, you just need to use it. But all DBaaS offerings have limitations as well as strengths. You should not get too comfortable with all the niceties of such services. You need to remain flexible and ultimately design to prevent database failure.

Have a Grasp of Your Cluster Behavior

Disable lab mode. You should not depend on this setting to take advantage of new features. It can break and leave you in a bad position. If you rely on this feature, and designed your application around it, you might find yourself working around the same problem if, for example, you are running the same queries on a non-Aurora deployment. This is not to say that you shouldn’t take advantage of all Aurora features. Lab mode, however, is “lab mode” and should not be enabled on a production environment.

Separate parameter group per cluster to keep your configuration changes isolated. In some cases, you might have a group of clusters that operate the same workload, But this should be rare, and also prohibits you from making rolling changes against each cluster.

Some might point out that syncing the parameter groups in this situation might be difficult. It really isn’t, and you don’t need any complicated tools to do it. For example, you can use

pt-config-diff

 to regularly inspect the differences between the runtime config on each cluster and identify or resolve differences.

While it is ideal that your clusters are always up to date, it can be intrusive to let them run on their own. Especially if your workload is not dependent on high/low traffic periods. I recommend having more control over the upgrade process, and this excellent community blog post from Renato on how to do just that is worth a read.

Don’t Put All Your Eggs in One Basket

On another note, Aurora can hold up to 64TB of data. Yes, that’s big. It might not be a problem and some of you might even be excited about this potential. But when you think about it, do you really want to store that amount of data in a single basket? What if you need to analyze this data for a particular time period. Is the cost worth it? Surely at some point, you will need to transport that data somewhere.

We’ve seen problems even at sizes less than 2TB. If you need to rebuild an asynchronous replica, for example, it takes a while. You have to be really ahead of capacity planning to ensure that you add new read-replicas when needed. This can be a challenge when you are on the spot. A burst of traffic might already be over before the replica provisioning is complete no matter how fast Aurora replica provisioning is.

Another challenge with datasets that are too big is when you have large tables. Schema changes become increasingly difficult in these situations, especially when such tables are subject to highly concurrent reads and writes. Recall that in the blog Migrating to Amazon Aurora: Optimize for Binary Log Replication we recommend setting

binlog_format

 to

ROW

 to be able to use tools like gh-ost in these types of situations.

High Availability On Your Terms

One limitation with Aurora cluster instances is that there is no easy way of taking a misbehaving read-replica out of rotation. Sure, you can delete the read replica. That leads to transient errors to the application, however, and impacts performance due to the time lag required to replace it to cover the workload.

Similarly, a misbehaving query can easily spoil the whole cluster, even if that query is spread out evenly to the read-replicas. Depending on how quickly you can disable the query, it might result in losing some business in the process. It would be nice if you could blackhole, rewrite or redirect such queries on demand so as to isolate the impact (or even fix it immediately).

Lastly, certain situations require that you restart the cluster. However, doing so could violate your uptime SLA. These situations can occur when you need to apply a non-dynamic cluster parameter, or you need to perform a cluster upgrade.

You can avoid most of these problems by not solely relying on Aurora’s own implementation of high availability. I say this because they are continuously improving this process. For now, however, you can use tools like ProxySQL to redirect traffic both in-cluster and between clusters replicating asynchronously. Percona has existing blog posts on this topic: Leveraging ProxySQL with AWS Aurora to Improve Performance, Or How ProxySQL Out-performs Native Aurora Cluster Endpoints and How to Implement ProxySQL with AWS Aurora.

Meanwhile, we’d like to hear your success stories in migrating to Amazon Aurora in the comments below!

Don’t forget to come by and see us at AWS re:Invent, November 26-30, 2018 in booth 1605! Percona CEO Peter Zaitsev will deliver a keynote on MySQL High Availability & Disaster Recovery, Tuesday, November 27 at 1:45 PM – 2:45 PM in the Bellagio Hotel, Level 1, Gauguin 2

Nov
16
2018
--

Migrating to Amazon Aurora: Optimize for Binary Log Replication

Migrating to Amazon Aurora 1

Migrating to Amazon Aurora 1In this Checklist for Success series, we will discuss reducing unknowns when hosting in the cloud using and migrating to Amazon Aurora. These tips might also apply to other database as a service (DBaaS) offerings.

In our previous article, we discussed the importance of continuous query performance analysis, especially in Amazon Aurora where there is less diagnostic visibility compared to running on EC2 or on-premise. Aside from uptime though, we need a lot more from our data, and we definitely cannot isolate it in Aurora.

Next on our checklist is that at one point or another, we will need to use asynchronous replication. Amazon Aurora has an excellent reputation for absorbing intense amounts of writes, and for many cases where you need an asynchronous replica, any replica can have potential issues catching up.

Different Clusters for Different Workloads

Critical workloads and datasets cannot rely on a single copy of their data. With Amazon Aurora, predictable performance means avoiding mixing workloads within your production cluster. While read heavy workloads might fit easily into read-replicas, reporting or analytics workloads might not be a good fit to execute on your main cluster where read-what-you-write profiles are normally found. You can either delegate this on a separate asynchronous replica as separate Amazon Aurora cluster, or another that runs on an EC2 instance (as an example). This is also true if, say, your analytics or reporting workload generate a significant amount of disk IOPs.

Amazon Aurora IO bills operations per million. You might save some money running disk-heavy analytics operations on a replica running on an i3 instance with a local NVMe for example. Similarly, running an async replica on an EC2 instance or on-premise allows you to take your own independent backups, or just an extra level of redundancy.

Multi-Threaded Replication

It is a known fact that MySQL asynchronous replication performance is subject to some limitations. The biggest one is that, by default, it is single-threaded. MySQL 5.6 introduced multi-threaded replication at one database per thread. This did not apply to the majority of use cases, as workloads vary per database and therefore create an imbalance. With MySQL 5.7 (Aurora 2.0), there have been additional improvements such as an alternative algorithm in parallelizing thread execution that depends on certain behaviors regarding how acting primary servers write binary log entries.

With that said, certain multi-threaded replication variables (transaction_write_set_extraction) require that the binlog format is set to ROW. This might sound counter-intuitive because the ROW binlog format actually can increase the replication workload. While ROW format reduces the ambiguity from potentially non-deterministic statements that could cause a replica to drift and become inconsistent, critical operations (schema changes) and optimizations (MTS) requires that you use ROW binlog format.

It should be apparent by now that the only reasonable path forward to improving asynchronous replication is via the multi-threaded approach. Along with that, there is the need for ROW binlog format. Any design effort should always include this fact if async replication lag is considered a risk. For the basics, configuration options like

slave_compressed_protocol

  and

binlog_row_image

 can reduce network churn. In the deep end, reducing dataset hotspots, ensuring tables have PRIMARY KEYs and embracing multi-threaded optimization can also go a long way.

Additional Benefits

While running certain read-heavy queries on an async replica, or ensuring you have access to physical datafiles are common use cases, being able to switch to another location (region) or just simply another cluster might also be necessary for some instances.

  • Adding or dropping a column/index on a large table can be achieved with either pt-online-schema-change or gh-ost. But for cases where time is a constraint, applying schema changes on an asynchronous cluster and switching to that sometimes pays for itself.
  • Configuration changes or upgrades that require a cluster restart can take seconds or even minutes. Wouldn’t it be nice if you already had a cluster with all these changes ready to take over at a moment’s notice? Or even fail back to the original if there was an unforeseen issue?

Stay “tuned” for part three.

Meanwhile, we’d like to hear your success stories in Amazon Aurora in the comments below!

Don’t forget to come by and see us at AWS re:Invent, November 26-30, 2018 in booth 1605! Percona CEO Peter Zaitsev will deliver a keynote on MySQL High Availability & Disaster Recovery, Tuesday, November 27 at 1:45 PM – 2:45 PM in the Bellagio Hotel, Level 1, Gauguin 2

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com