May
31
2016
--

What is a big innodb_log_file_size?

big innodb_log_file_size

In this post, we’ll discuss what constitutes a big

innodb_log_file_size

, and how it can affect performance.

In the comments for our post on Percona Server 5.7 performance improvements, someone asked why we use

innodb_log_file_size=10G

 with an indication that it might be too big?

In my previous post (https://www.percona.com/blog/2016/05/17/mysql-5-7-read-write-benchmarks/), the example used

innodb_log_file_size=15G

. Is that too big? Let’s take a more detailed look at this.

First, let me start by rephrasing my warning: the log file size should be set as big as possible, but not bigger than necessary. A bigger log file size is better for performance, but it has a drawback (a significant one) that you need to worry about: the recovery time after a crash. You need to balance recovery time in the rare event of a crash recovery versus maximizing throughput during peak operations. This limitation can translate to a 20x longer crash recovery process!

But how big is “big enough”? Is it 48MB (the default value), 1-2GB (which I often see in production), or 10-15GB (like we use for benchmarks)?

I wrote about how the 

innodb_log_file_size

 is related to background flushing five years ago, and I recommend this post if you are interested in details:

InnoDB Flushing: Theory and solutions

Since that time many improvements have been made both in Percona Server and MySQL, but a small

innodb_log_file_size

 still affects the throughput.

How? Let’s review how writes happen in InnoDB. Practically all data page writes happen in the background. It seems like background writes shouldn’t affect user query performance, but it does. The more intense background writes are, the more resources are taken away from the user foreground workload. There are three big forces that rule background writes:

  1. How close checkpoint age is to the async point (again, see previous material https://www.percona.com/blog/2011/04/04/innodb-flushing-theory-and-solutions/). This is adaptive flushing.
  2. How close is
    innodb_max_dirty_pages_pct

     to the percentage of actual dirty pages.  You can see this in the LRU flushing metrics.

  3. What amount of free pages are defined by
    innodb_lru_scan_depth

    . This is also in LRU flushing metrics.

So in this equation 

innodb_log_file_size

 defines the async point, and how big checkpoint age can be.

To show a practical application of these forces, I’ve provided some chart data. I will use charts from the Percona Monitoring and Management tool and data from Percona Server 5.7.

Before jumping to graphs, let me remind you that the max checkpoint age is defined not only by

innodb_log_file_size

, but also 

innodb_log_files_in_group

 (which is usually “2” by default). So 

innodb_log_file_size=2GB

 will have 4GB of log space, from which MySQL will use about 3.24GB (MySQL makes extra reservations to avoid a situation when we fully run out of log space).

Below are graphs from a tpcc-mysql benchmark with 1500 warehouses, which provides about 150GB of data. I used

innodb_buffer_pool_size=64GB

, and I made two runs:

  1. with
    innodb_log_file_size=2GB
  2. with
    innodb_log_file_size=15GB

Other details about my setup:

  • CPU: 56 logical CPU threads servers Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz
  • OS: Ubuntu 16.04
  • Kernel 4.4.0-21-generic
  • The storage device is Samsung SM863 SATA SSD, single device, with ext4 filesystem
  • MySQL versions: Percona Server 5.7.11
  • innodb_io_capacity=5000 / innodb_io_capacity_max=7500

On the first chart, let’s look at the max checkpoint age, current checkpoint age and amount of flushed pages per second:

. . . and also a related graph of how many pages are flushed by different forces (LRU flushing and adaptive flushing). You can receive this data by enabling

innodb_monitor_enable = '%'

.

From these charts, we can see that with 2GB innodb_log_file_size InnoDB is forced by adaptive flushing to flush (write) more pages, because the current checkpoint age (uncheckpointed bytes) is very close to Max Checkpoint Age. To see the checkpoint age in MySQL, you can use the innodb_metrics table and metrics 

recovery_log_lsn_checkpoint_age

 and

recovery_log_max_modified_age_sync

.

In the case using

innodb_log_file_size=15GB

, the main flushing is done via LRU flushing (to keep 5000 pages (

innodb_lru_scan_depth

) free per buffer pool instance). From the first graph we can figure that uncheckpointed bytes never reach 12GB, so in this case using

innodb_log_file_size=15GB

 is overkill. We might be fine with innodb_log_file_size=8GB – but we wouldn’t know unless we set the 

innodb_log_file_size

 big enough. MySQL 5.7 comes with a very convenient improvement: now it is much easier to change the

innodb_log_file_size

, but it still requires a server restart. I wish we could change it online, like we can for

innodb_buffer_pool_size

 (I do not see technical barriers for this).

Let’s also look into the InnoDB buffer pool content:

We can see that there are more modified pages in the case with 15GB log files (which is good, as more modified pages means less work done in the background).

And the most interesting question: how does it affect throughput?

With

innodb_log_file_size=2GB

, the throughput is about 20% worse. With a 2GB log size, you can see that often zero transactions are processed within one second – this is bad, and says that the flushing algorithm still needs improvements in cases when the checkpoint age is close to or at the async point.

This should make a convincing case that using big

innodb_log_file_size

 is beneficial. In this particular case, probably 8GB (with

innodb_log_files_in_group=2

) would be enough.

What about the recovery time? To measure this, I killed mysqld when the checkpoint age (uncheckpointed bytes) was about 10GB. It appeared to take 20 mins to start mysqld after the crash. In another experiment with 25GB of uncheckpointed bytes, it took 45 mins. Unfortunately, crash recovery in MySQL is still singlethreaded, so it takes a long time to read and apply 10GB worth of changes (even from the fast storage).

We can see that recovery is single-threaded from the CPU usage chart during recovery:

cpu_recovery

The system uses 2% of the CPU (which corresponds to a single CPU).

In many cases, crash recovery is not a huge concern. People don’t always have to wait for MySQL recovery – since even one minute of downtime can be too long, often the instance fails over to a slave (especially with async replication), or the crashed node just leaves the cluster (if you use Percona XtraDB Cluster).

I would still like to see improvements in this area. Crash recovery is the biggest showstopper for using a big

innodb_log_file_size

, and I think it is possible to add parallelism similar to multithreaded slaves into the crash recovery process.

You can find the raw results, scripts and configs here.

 

May
31
2016
--

Robin makes the office smarter with $7 million in new funding

robin-busy One of the first-world problems most plaguing the new corporate world is the time wasted trying to find and secure a meeting room. At least, that’s the premise Robin is based on.
Robin is a company that helps offices organize, track, and monitor their spaces.
The company just raised $7 million in Series A led by FirstMark capital, with participation from existing investors such as… Read More

May
30
2016
--

Diane Greene wants to put the enterprise front and center of Google Cloud strategy

Diane Greene, Google When Google bought bebop Technologies last fall for $348 million, it got more than a stealthy startup. It also landed Diane Greene as executive vice president of Google Cloud Enterprise — and that perhaps was the bigger prize. Greene brought with her years of industry experience, having co-founded and been CEO at VMware for a decade, building it into a virtualization powerhouse. In… Read More

May
27
2016
--

Galera Error Failed to Report Last Committed (Interrupted System Call)

MySQL-Got-an-error-reading-communication-packet-errors

In this blog, we’ll discuss the ramifications of the Galera Error Failed to Report Last Committed (Interrupted System Call).

I have recently seen this error with Percona XtraDB Cluster (or Galera):

[Warning] WSREP: Failed to report last committed 549684236, -4 (Interrupted system call)

It was posted in launchpad as a bug in 2013: https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1434646

My colleague Przemek replied, and explained it as:

Reporting the last committed transaction is just a part of the certification index purge process. In case it fails for some reason (it occasionally does), the cert index purge may be a little delayed. But it does not mean the transaction was not applied successfully. This is a warning after all.

If we look up this error in the source code, we realize it is reusing Linux system errors. Specifically:

#define EINTR 4 /* Interrupted system call */

As there isn’t much documentation regarding this error, and internet searches did not bring up useful information, my colleague David Bennett and I delved into the source code (as we do on occasion).

If we look in the Galera source code gcs_sm.hpp we see:

289  * @retval -EINTR  - was interrupted by another thread

We also see:

317                 /* was interrupted, will be handled by someone else */

This means that the thread was interrupted, but the server will retry on another thread. As it is just a warning, it isn’t anything to be too concerned about – unless they begin to pile up (which could be a sign of concurrency issues).

The specific warning is thrown from galera_service_thd.cpp here:

58                 if (gu_unlikely(ret < 0))
59                 {
60                     log_warn << "Failed to report last committed "
61                              << data.last_committed_ << ", " << ret
62                              << " (" << strerror (-ret) << ')';
63                     // @todo: figure out what to do in this case
64                 }

This warning could be handled better so as to not bloody the logs, or sound cryptic enough to concern administrators.

May
27
2016
--

Asynchronous Query Execution with MySQL 5.7 X Plugin

Asynchronous-Query-Execution

In this blog, we will discuss MySQL 5.7 asynchronous query execution using the X Plugin.

Overview

MySQL 5.7 supports X Plugin / X Protocol, which allows (if the library supports it) asynchronous query execution. In 2014, I published a blog on how to increase a slow query performance with the parallel query execution. There, I created a prototype in the bash shell. Here, I’ve tried a similar idea with NodeJS + mysqlx library (which uses MySQL X Plugin).

TL;DR version: By using the MySQL X Plugin with NodeJS I was able to increase query performance 10x (some query rewrite required).

X Protocol and NodeJS

Here are the steps required:

  1. First, we will need to enable X Plugin in MySQL 5.7.12+, which will use a different port (33060 by default).
  2. Second, download and install NodeJS (>4.2) and mysql-connector-nodejs-1.0.2.tar.gz (follow Getting Started with Connector/Node.JS guide).
    # node --version
    v4.4.4
    # wget https://dev.mysql.com/get/Downloads/Connector-Nodejs/mysql-connector-nodejs-1.0.2.tar.gz
    # npm install mysql-connector-nodejs-1.0.2.tar.gz

    Please note: on older systems, you will probably need to upgrade the nodejs version. Follow the Installing Node.js via package manager guide.

  3. All set! Now we can use the asynchronous queries feature.

Test data 

I’m using the same Wikipedia Page Counts dataset (wikistats) I’ve used for my Apache Spark and MySQL example. Let’s imagine we want to compare the popularity of MySQL versus PostgeSQL in January 2008 (comparing the total page views). Here are the sample queries:

mysql> select sum(tot_visits) from wikistats_by_day_spark where url like '%mysql%';
mysql> select sum(tot_visits) from wikistats_by_day_spark where url like '%postgresql%';

The table size only holds data for English Wikipedia for January 2008, but still has ~200M rows and ~16G in size. Both queries run for ~5 minutes each, and utilize only one CPU core (one connection = one CPU core). The box has 24 CPU cores, Intel(R) Xeon(R) CPU L5639 @ 2.13GHz. Can we run the query in parallel, utilizing all cores?

That is possible now with NodeJS and X Plugin, but require some preparation:

  1. Partition the table using hash, 24 partitions:
    CREATE TABLE `wikistats_by_day_spark_part` (
      `id` int(11) NOT NULL AUTO_INCREMENT,
      `mydate` date NOT NULL,
      `url` text,
      `cnt` bigint(20) NOT NULL,
      `tot_visits` bigint(20) DEFAULT NULL,
      PRIMARY KEY (`id`)
    ) ENGINE=InnoDB AUTO_INCREMENT=239863472 DEFAULT CHARSET=latin1
    /*!50100 PARTITION BY HASH (id)
    PARTITIONS 24 */
  2. Rewrite the query running one connection (= one thread) per each partition, choosing its own partition for each thread:
    select sum(tot_visits) from wikistats_by_day_spark_part partition (p<N>) where url like '%mysql%';
  3. Wrap it up inside the NodeJS Callback functions / Promises.

The code

var mysqlx = require('mysqlx');
var cs_pre = {
    host: 'localhost',
    port: 33060,
    dbUser: 'root',
    dbPassword: 'mysql'
};
var cs = {
    host: 'localhost',
    port: 33060,
    dbUser: 'root',
    dbPassword: 'mysql'
};
var partitions = [];
var res = [];
var total = 0;
mysqlx.getNodeSession( cs_pre ).then(session_pre => {
        var sql="select partition_name from information_schema.partitions where table_name = 'wikistats_by_day_spark_part' and table_schema = 'wikistats' ";
        session_pre.executeSql(sql)
                .execute(function (row) {
                        partitions.push(row);
                }).catch(err => {
                        console.log(err);
                })
                .then( function () {
                        partitions.forEach(function(p) {
                                  mysqlx.getNodeSession( cs ).then(session => {
                                     var sql="select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(" + p + ") where url like '%mysql%';"
                                     console.log("Started SQL for partiton: " + p);
                                     return Promise.all([
                                        session.executeSql(sql)
                                                 .execute(function (row) {
                                                        console.log(p + ":" + row);
                                                        res.push(row);
                                                        total = Number(total) + Number(row);
                                                }).catch(err => {
                                                        console.log(err);
                                                }),
                                        session.close()
                                     ]);
                                  }).catch(err => {
                                       console.log(err + "partition: " + p);
                                  }).then(function() {
                                        // All done
                                        if (res.length == partitions.length) {
                                                console.log("All done! Total: " + total);
                                                // can now sort "res" array if needed an display
                                        }
                                  });
                        });
                });
        session_pre.close();
});
console.log("Starting...");

The explanation

The idea here is rather simple:

  1. Find all the partitions for the table by using “select partition_name from information_schema.partitions”
  2. For each partition, run the query in parallel: create a connection, run the query with a specific partition name, define the callback function, then close the connection.
  3. As the callback function is used, the code will not be blocked, but rather proceed to the next iteration. When the query is finished, the callback function will be executed.
  4. Inside the callback function, I’m saving the result into an array and also calculating the total (actually I only need a total in this example).
    .execute(function (row) {
                             console.log(p + ":" + row);
                             res.push(row);
                             total = Number(total) + Number(row); ...

Asynchronous Salad: tomacucumtoes,bersmayonn,aise *

This may blow your mind: because everything is running asynchronously, the callback functions will return when ready. Here is the result of the above script:

$ time node async_wikistats.js
Starting...
Started SQL for partiton: p0
Started SQL for partiton: p1
Started SQL for partiton: p2
Started SQL for partiton: p3
Started SQL for partiton: p4
Started SQL for partiton: p5
Started SQL for partiton: p7
Started SQL for partiton: p8
Started SQL for partiton: p6
Started SQL for partiton: p9
Started SQL for partiton: p10
Started SQL for partiton: p12
Started SQL for partiton: p13
Started SQL for partiton: p11
Started SQL for partiton: p14
Started SQL for partiton: p15
Started SQL for partiton: p16
Started SQL for partiton: p17
Started SQL for partiton: p18
Started SQL for partiton: p19
Started SQL for partiton: p20
Started SQL for partiton: p21
Started SQL for partiton: p22
Started SQL for partiton: p23

… here the script will wait for the async calls to return, and they will return when ready – the order is not defined.

Meanwhile, we can watch MySQL processlist:

+------+------+-----------------+-----------+---------+-------+--------------+-------------------------------------------------------------------------------------------------------------------+
| Id   | User | Host            | db        | Command | Time  | State        | Info                                                                                                              |
+------+------+-----------------+-----------+---------+-------+--------------+-------------------------------------------------------------------------------------------------------------------+
|  186 | root | localhost:44750 | NULL      | Sleep   | 21391 | cleaning up  | PLUGIN                                                                                                            |
| 2290 | root | localhost       | wikistats | Sleep   |  1417 |              | NULL                                                                                                              |
| 2510 | root | localhost:41737 | NULL      | Query   |     2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p0) where url like '%mysql%'  |
| 2511 | root | localhost:41738 | NULL      | Query   |     2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p1) where url like '%mysql%'  |
| 2512 | root | localhost:41739 | NULL      | Query   |     2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p2) where url like '%mysql%'  |
| 2513 | root | localhost:41741 | NULL      | Query   |     2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p4) where url like '%mysql%'  |
| 2514 | root | localhost:41740 | NULL      | Query   |     2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p3) where url like '%mysql%'  |
| 2515 | root | localhost:41742 | NULL      | Query   |     2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p5) where url like '%mysql%'  |
| 2516 | root | localhost:41743 | NULL      | Query   |     2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p6) where url like '%mysql%'  |
| 2517 | root | localhost:41744 | NULL      | Query   |     2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p7) where url like '%mysql%'  |
| 2518 | root | localhost:41745 | NULL      | Query   |     2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p8) where url like '%mysql%'  |
| 2519 | root | localhost:41746 | NULL      | Query   |     2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p9) where url like '%mysql%'  |
| 2520 | root | localhost:41747 | NULL      | Query   |     2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p10) where url like '%mysql%' |
| 2521 | root | localhost:41748 | NULL      | Query   |     2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p11) where url like '%mysql%' |
| 2522 | root | localhost:41749 | NULL      | Query   |     2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p12) where url like '%mysql%' |
| 2523 | root | localhost:41750 | NULL      | Query   |     2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p13) where url like '%mysql%' |
| 2524 | root | localhost:41751 | NULL      | Query   |     2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p14) where url like '%mysql%' |
| 2525 | root | localhost:41752 | NULL      | Query   |     2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p15) where url like '%mysql%' |
| 2526 | root | localhost:41753 | NULL      | Query   |     2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p16) where url like '%mysql%' |
| 2527 | root | localhost:41754 | NULL      | Query   |     2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p17) where url like '%mysql%' |
| 2528 | root | localhost:41755 | NULL      | Query   |     2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p18) where url like '%mysql%' |
| 2529 | root | localhost:41756 | NULL      | Query   |     2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p19) where url like '%mysql%' |
| 2530 | root | localhost:41757 | NULL      | Query   |     2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p20) where url like '%mysql%' |
| 2531 | root | localhost:41758 | NULL      | Query   |     2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p21) where url like '%mysql%' |
| 2532 | root | localhost:41759 | NULL      | Query   |     2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p22) where url like '%mysql%' |
| 2533 | root | localhost:41760 | NULL      | Query   |     2 | Sending data | PLUGIN: select sum(tot_visits) from wikistats.wikistats_by_day_spark_part partition(p23) where url like '%mysql%' |
| 2534 | root | localhost       | NULL      | Query   |     0 | starting     | show full processlist                                                                                             |
+------+------+-----------------+-----------+---------+-------+--------------+-------------------------------------------------------------------------------------------------------------------+

And CPU utilization:

Tasks:  41 total,   1 running,  33 sleeping,   7 stopped,   0 zombie
%Cpu0  : 91.9 us,  1.7 sy,  0.0 ni,  6.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  : 97.3 us,  2.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  : 97.0 us,  3.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 97.7 us,  2.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  : 95.7 us,  2.7 sy,  0.0 ni,  1.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  : 98.3 us,  1.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu6  : 98.3 us,  1.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu7  : 97.7 us,  2.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu8  : 96.7 us,  3.0 sy,  0.0 ni,  0.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu9  : 98.3 us,  1.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu10 : 95.7 us,  4.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu11 : 97.7 us,  2.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu12 : 98.0 us,  2.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu13 : 98.0 us,  1.7 sy,  0.0 ni,  0.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu14 : 97.7 us,  2.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu15 : 97.3 us,  2.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu16 : 98.0 us,  2.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu17 :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu18 : 97.3 us,  2.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu19 : 98.7 us,  1.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu20 : 99.3 us,  0.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu21 : 97.3 us,  2.3 sy,  0.0 ni,  0.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu22 : 97.0 us,  3.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu23 : 96.0 us,  4.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
...
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
18901 mysql     20   0 25.843g 0.017t   7808 S  2386 37.0 295:34.05 mysqld

Now, here is our “salad”:

p1:2499
p23:2366
p2:2297
p0:4735
p12:12349
p14:1412
p3:2045
p16:4157
p20:3160
p18:8717
p17:2967
p13:4519
p15:5462
p10:1312
p5:2815
p7:4644
p9:766
p4:3218
p6:4175
p21:2958
p8:929
p19:4182
p22:3231
p11:4020

As we can see, all partitions are in random order. If needed, we can even sort the result array (which isn’t needed for this example as we only care about the total). Finally our result and timing:

All done! Total: 88935
real    0m30.668s
user    0m0.256s
sys     0m0.028s

Timing and Results

  • Original query, single thread: 5 minutes
  • Modified query, 24 threads in Node JS: 30 seconds
  • Performance increase: 10x

If you are interested in the original question (MySQL versus PostgreSQL, Jan 2008):

  • MySQL, total visits: 88935
  • PostgreSQL total visits: 17753

Further Reading:

PS: Original Asynchronous Salad Joke, by Vlad @Crazy_Owl (in Russian)

May
27
2016
--

Join me for the Community Open House for MongoDB

MongoDB-Community-Open-House

MongoDB-Community-Open-House
If you can make it to Manhattan Thursday, June 30, 2016, please join me at the Community Open House for MongoDB. The Community Open House for MongoDB, held from 9 AM to 6 PM at the Park Central Hotel, is a free and open event that will feature technical presentations and sessions from key members of the MongoDB open source community. It’s the day after MongoDB World ends, so many of you will already be in town and ready to talk MongoDB – stay one more day to get an even bigger and broader perspective on your MongoDB needs!

The open source community is a diverse and powerful collection of companies, organizations and individuals that have helped to literally change the world. Percona is proud to call itself a member of the open source community, and we strongly feel that upholding the principles of the community is a key to our success. These principals include an open dialog, an open mind, and a zeal for cooperative interaction. Since Percona and ObjectRocket weren’t allowed to participate in MongoDB World, we felt this event would help foster creative and productive dialogue within the community. Together, we can build amazing things!

A reception will be held from 4:30 PM to 6:00 PM, featuring plenty of food, drink and fun.

Event Speakers:

The Community Open House for MongoDB features expert speakers from Percona, ObjectRocket, Facebook, Appboy, Severalnines and The Washington Post. Here’s a partial lineup – more will be announced soon!

  • Open Source Monitoring for MongoDB – Peter Zaitsev, Percona
  • Real-world Operational Concerns for Scaling MongoDB – Kim Wilkins, ObjectRocket
  • Doing More With Less With RocksDB and MongoRocks at Facebook – Islam AbdelRahman, Facebook
  • Tracking Billions of App Installs with MongoDB – Jon Hyman, Appboy
  • WiredTiger Scalability and Performance Optimization – Vadim Tkachenko, Percona
  • Using MongoDB and NodeJS to Build Better Forms at the Washington Post – Kat Styons, The Washington Post
  • Real-world Operational Concerns for Scaling MongoDB – Tim Banks, ObjectRocket
  • How to Automate, Monitor and Manage Existing MongoDB Servers – Art van Scheppingen, Severalnines
  • Open Source Encryption for MongoDB – Tim Banks, ObjectRocket
  • My First Moments with MongoDB – Colin Charles
  • Open Source Backups for MongoDB – David Murphy – Percona
  • MongoDB Chunks – Distribution, Splitting & Merging – Jason Terpko, ObjectRocket
  • MongoDB for MySQL Developers and DBAs – Alexander Rubin, Percona

This event is free of charge and open to all, but we do ask you to register in advance so we can save you a seat.

I hope to see you Thursday in NYC!

May
26
2016
--

Monitoring made easy with Percona App for Grafana

Percona-App-for-Grafana

Percona has released a new Percona App for Grafana!

Are you using Grafana 3.x with Prometheus’ time-series database? Now there is a “Percona App” available on Grafana.net! The app provides a set of dashboards for MySQL performance and system monitoring with Prometheus’ datasource, and make it easy for users install them. The dashboards rely on the alias label in the Prometheus config and depend on the small patch applied on Grafana.

The dashboards in the app are:

  • Cross Server Graphs
  • Disk Performance
  • Disk Space
  • Galera Graphs
  • MySQL InnoDB Metrics
  • MySQL MyISAM Metrics
  • MySQL Overview
  • MySQL Performance Schema
  • MySQL Query Response Time
  • MySQL Replication
  • MySQL Table Statistics
  • MySQL User Statistics
  • Prometheus
  • Summary Dashboard
  • System Overview
  • TokuDB Graphs
  • Trends Dashboard

Percona App for Grafana

The Grafana and Prometheus teams are doing a fantastic job of bringing monitoring and time-series to the next level. They are making collecting and graphing metrics simple and more usable.

See my previous blog post for step-by-step instructions on how to install Grafana and Prometheus. Get the Percona App for Grafana today!

May
26
2016
--

AWS Aurora Benchmarking part 2

AWS-Aurora-Benchmarking

AWS Aurora Benchmarking

Some time ago, I published the article on AWS Aurora Benchmarking (AWS Aurora Benchmarking – Blast or Splash?), in which I analyzed the behavior of different solutions using synchronous replication in AWS environment. This blog follows up with some of the comments and suggestions I received regarding that post from the community and Amazon engineers.

I decided to perform another round of tests, keeping in mind comments and suggestions received.

I presented some of the results during the Percona conference in Santa Clara last April 2016. The following is the transposition that presentation, with more details.

Not interested in the preliminary descriptions? Go to the results section

Why new tests?

A very good question, with an easy answer.

Aurora is a product that is still under development and refinement: six months of development could present major changes in performance. Not only that, but the initial tests focused on entry-level solutions, meaning I was analyzing the kind of users that are currently starting their business and looking for a flexible solution that allows them to save money and scale.

This time, I put the focus on enterprise solutions by analyzing what an already well-established company would get when looking for a decent scalable solution.

These are two different scenarios.

Why so many (different) tests?

I used many different benchmarking tools, and I am still planning to run others. Why so? Why not simply use one of them?

Again, a simple answer. I used different tools because in some cases, they provide me a different way of accessing and using data. I also do not trust benchmarking tools, not even the ones I developed. I wanted to test the same thing using different tools and compare the results. ONLY if I see a common pattern, then would I consider the test valid. Personally, I tend to discard any test that is not consistent, or if the analysis performed is using a single benchmarking tool. In my opinion, being lazy is not an option when doing these kind of exercises.

About the tests

It was difficult to compare apples to apples here. And I think that is the main point to keep in mind.

Aurora is not a standard RDS solution, like we are used to. Aurora looks like MySQL, smells like MySQL, but is not vanilla MySQL. To achieve what they have, the engineers had to change many parts. The more you dig in, the more you realize there are significant differences.

Because of that, I had to focus more on identifying what each solution can do and compare the solutions against expectations, rather than comparing the numbers.

I was more interested to see what happen if:

  • I have a burst of connections, and my application goes from 4K to 40K connections. Will it crash? Will it slow down?
  • How long should I wait if a node fails?
  • What should I not have in my schema design, to prevent bottlenecks?

Those are relevant questions in my opinion, more so than discovering that solution A has 3000 rows written/sec, and solution B has 3100. Or that I might (might) have some additional page rotation, file -> memory-> flushes because the amount of memory differs.

That is valuable information, for sure, but less valuable than having a decent understanding of which platform will help my business grow and remain stable.

What is the right tool for the job? This is the question I am addressing.

Tests run

I had run three main kinds of tests:

  • Performance and load stress
  • High availability failover
  • Response time (latency) from the application point of view

Performance and load stress

These tests were the most extensive and demanding.

I analyzed the capacity to serve the load under different conditions, from a light load up to full utilization, and some degree of resource saturation.

  • The first set of tests were to evaluate a simple load on a single table, causing the table to become a hotspot and showing how the platform would manage the increasing contention.
  • The second set of tests were to perform a similar load, but distributing it cross multiple tables and batching the operations. Parallelization, contention, scalability and distributed hotspots were in the picture.

The two above focused on write operations only, and were done using different tools (comparing the results as they were complementary).

  • Third set of tests, using my own stress tool, were focused on R/W oriented usage. The tests were executed against multiple tables, performing CRUD actions, using simple and batch insert, reads by PK, index, by range, IN and exact match conditions.
  • The fourth set of tests were performed using a TPC-C like load (OLTP).
  • The fifth set of tests were using sysbench in OLTP mode, with 250 tables.

The scope of the last three set of tests was to identify how the platforms would manage the load, considering the following:

  • Read and write contention on the same tables
  • High level of parallelism (from the application)
  • Possible hot-spots (TPCC district)
  • Increasing utilization (memory, threads, IO)
  • Saturation (connections)

Finally, all tests were run with fully utilized BufferPool.

The machines

Small boxes (first round of tests):

EIP = 1
VPC = 1
ELB=1
Subnets = 4 (1 public, 3 private)
HAProxy = 6
MHA Monitor (micro ec2) = 1
NAT Instance (EC2) =1 (hosting EIP)
DB Instances (EC2) = 3 (m4.xlarge) 16GB
Application Instances (EC2) = 6 (4)
EBS SSD 3000 PIOS
Aurora RDS node = 3 (db.r3.xlarge) 30GB

Large boxes (latest tests):

EIP = 1
VPC = 1
ELB=1
Subnets = 4 (1 public, 3 private)
HAProxy = 4
MHA Monitor (micro ec2) = 1
NAT Instance (EC2) =1 (hosting EIP)
DB Instances (EC2) = 3 (c3.8xlarge) 60GB
Application Instances (EC2) = 4
EBS SSD 5000 PIOS
Aurora RDS node = 3 (db.r3.8xlarge) 244GB

A note

It was pointed out to me that I deliberately chose to use an Ec2 solution for Percona XtraDB Cluster with less memory than the one available in Aurora. This is true, and we must take that into consideration. The reason for this is that the only Ec2 solution matching the memory of a db_r3.8xlarge is the d2.8xlarge.

I did try it, but the level of scalability I got (from the CPU point of view) was less efficient than the one available with c3.8xlarge. I decided to prefer CPU resources to memory, especially because I was going to test concurrency and parallelism in conjunction with the load increase.

From the result, I feel confident that I chose correctly – but I am open to comment.

The layout

This is what the setup looks like:

AWS Aurora Benchmarking

Where you read Java, those are the application nodes running the different test applications.

Two words about Aurora first

Aurora has a few key concepts that we must have clearly in mind, especially how it manages the writes across replica, and how connections are implemented.

The IO activity

To replicate the information across the different storage, Aurora only replicates FRM files and data coming from IB_LOGS. This is a quite significant advantage to other forms of replication, given the limited number of bytes that are replicated over the network (and also if they are replicated six times).

AWS Aurora Benchmarking

Another significant advantage is that Aurora does not use a double write buffer, which is obviously another blast (see the recent optimization in Percona Server https://www.percona.com/blog/2016/05/09/percona-server-5-7-parallel-doublewrite/ ).

In other words, writes in Aurora are organized by filling its commit queue and pushing the changes as group commit to the storage.

AWS Aurora Benchmarking

In some presentations, you might have seen that all steps are asynchronous. But is important to underline that a commit is acknowledged by Aurora when at least two availability zones (AZ) have received and written the incoming data related to that commit. Writes here mean received in the storage node incoming queue and with a quorum of four over six nodes.

This means that no matter what, data has to travel on the network to reach the final destination, and ACK signals come back before Aurora returns the ACK to the commit operation. The network is in the same region, but still it could represent an incognita about performance. No wonder we could have some latency at this stage!

As you can see, what I am reporting is also confirmed in the image below (and in the observations). The point is that the impact of steps 1 – 2 is not obviously clear.

AWS Aurora Benchmarking

Thread pooling

Aurora also use thread pooling – a lot! That will become very clear later, and as more of the work is based on parallelism, the more efficient thread pooling seems to be.

In most cases we are used to seeing CPUs on database servers not fully utilized, unless there is some heavy ordering operation or a bad query. That behavior is also (not only) a direct consequence of the connection-to-thread model, which implies a period of latency and stand by. In Aurora, the incoming connections are not following the same model. Instead, the pool redistributes the load of the incoming connection to a pool of threads, optimizing the latency period, resulting in a higher CPU utilization. Which is what you want from your resource: to be utilized and not waiting for something else to do its job.

AWS Aurora Benchmarking

 

The results

Without wasting more electronic ink, let see what comes out of this round of tests (not the final one by the way). To simplify the results, I will also report the graphs from the first set of tests, but will focus on the latest.Small Boxes = SB, Large Boxes LB.

Small Boxes = SB, Large Boxes = LB.

First Test: IIBench

As declared previously, my scope was to verify how the two platforms would have reacted to a simple load focus on inserts with a basic single table. The bufferpool was saturated before running the test.

SB

AWS Aurora Benchmarking

LB

AWS Aurora Benchmarking

As we can see, in the presence of a hot spot the solution using Percona XtraDB Cluster outperformed Aurora, in both cases. What is notable, though, is that while XtraDB Cluster remained approximately around the same time/performance, Aurora is significantly reduced the time taken. This shows that Aurora was taking advantage of the more powerful platform, while XtraDB Cluster was not able to.

With further analyzation of the details, we notice that Aurora performs better atomically. It was able to manage more writes/second as well as rows and pages managed. But it was inconsistent: Aurora had performance hiccups at regular intervals. As such the final result was that it took more time to process the whole workload.

I was not able to dig to deeply, given some metrics are not fully available in Aurora. As such I had to rely fully on Aurora engineers, who mentioned to me that hot-spot contention was a possible issue.

Aurora Handler calls:

AWS Aurora Benchmarking

XtraDB Cluster Handlers:

AWS Aurora Benchmarking

The execution in XtraDB Cluster showed fewer calls but constant performance, while Aurora has hiccups.

Aurora page activity write:

AWS Aurora Benchmarking

XtraDB Cluster page activity write:

AWS Aurora Benchmarking

The trend shown by the handlers stayed consistent in the page management and rows insert, as expected.

Second Test: Application Ingest

As mentioned, this test showed many threads from different application servers, inserted by a batch of 50 statements against multiple tables.

The results coming from this test are quite favorable to Aurora, as we can see starting from the time taken to complete the same workload:

LB

AWS Aurora Benchmarking

SB

AWS Aurora Benchmarking

With small ones, the situation was inverted.

But here is where the interesting part starts.

Aurora can manage significantly higher numbers of rows, as the picture below shows:

AWS Aurora Benchmarking

The results are also constant, and don’t decrease significantly like the inserts with XtraDB Cluster.

The number of handler commits, however, are significantly less.

AWS Aurora Benchmarking

Once more they stay the same with the load increase, without impacting performance.

Reviewing all handler calls, we get our first surprise.

XtraDB Cluster handler calls:

AWS Aurora Benchmarking

Aurora handler calls:

AWS Aurora Benchmarking

The gap/drop existing in the two graphs are the different tests (with an increasing number of threads).

Two things to notice here: the first one is that XtraDB Cluster decreases in performance while processing the load, while Aurora does not. The second (you need to zoom the image) is the number of commits is floating in XtraDB Cluster, while it stays fixed in Aurora.

An even bigger surprise comes up when reviewing the connections graphs.

As expected, XtraDB Cluster has all my connections open, and the number of threads running is quite close to the number of connected threads.

AWS Aurora Benchmarking

Both of them follow the increasing number of connected threads.

But this is not the case in Aurora.

AWS Aurora Benchmarking

Also, if my applications are trying to open ~800 threads, the Aurora node see only a part of them, and the number of running is fixed to 32 threads.

The important things to consider here are that a) my applications don’t connect directly to the Aurora instance, but to a connector (MariaDB), and b) that Aurora, in this case, caps the number of running threads to the number of CPU available on the instance (here 32).

Given that, I expected to have worse performance (but I don’t). The fact that Aurora uses one thread for multiple connections seems to be working quite efficiently.

The number of rows inserted is also consistent with the handler calls, and has better performance than XtraDB Cluster.

Aurora rows inserted:

AWS Aurora Benchmarking

XtraDB Cluster rows inserted

AWS Aurora Benchmarking

Again we have the same trend, only, this time, Aurora performs better than XtraDB Cluster.

Third Test: OLTP Application

When run on the small boxes, this test saw XtraDB Cluster performing much better than Aurora. The time taken by Aurora was ~3 times the time taken by XtraDB Cluster.

AWS Aurora Benchmarking

With a large box, I had the inverse result: Aurora is outperforming XtraDB Cluster from 2 to 7 times the speed.

AWS Aurora Benchmarking

Analyzing the number of commands executed with the increasing workload, we can see how XtraDB Cluster can perform better than Aurora with a workload of 128 threads, but starts to have worse performance as the load increases.

On the other hand, Aurora manages the read/write load without significant performance loss, which includes being able to increase the number of commits/sec.

AWS Aurora Benchmarking

Reviewing the handler calls, we see that the handler commit calls are significantly less in Aurora (as already noticed in the ingest tests).

AWS Aurora Benchmarking

Another thing to note is that the number of calls for XtraDB Cluster is significantly higher and not scaling, while Aurora has a nice scaling trend.

Fourth Test: TPCC-mysql

The TPCC test is mainly to test OLTP traffic, with the note that some tables (like district) might become a hotspot. The tests I ran were executed against 400 warehouses, and used 128 threads maximum for the small box and 2048 threads for the large box.

During this test, I hit one of the Aurora limitations and I escalated it to the Aurora engineers (who are aware of the problem).

Small boxes:

AWS Aurora Benchmarking

In the case of small boxes, there is nothing to say: XtraDB Cluster manages the load more efficiently. This trend is not optimal, having significant fluctuation. Aurora is just not able to keep it up.

Large boxes:

 

AWS Aurora Benchmarking

It is a different and a more complex scenario in the case of the use of large boxes. I would like to say that Aurora performs better.

This is true for two of the three tests, and up to when it got stuck by internal limitation Aurora was also performing better on the third. But then its performance just collapsed.

With a more in-depth investigation, I noticed that under the hood Aurora was not performing as well as it appeared. This comes out quite clearly by looking at a comparison between the graphs covering Comm_ execution, open files, handlers and InnoDBrow lock time.

In all of them it is evident how XtraDB Cluster keeps serving the workload with consistent behavior, while Aurora fails the second test on (512 threads) — not just on the third with 2048 threads.

Aurora:

AWS Aurora Benchmarking

XtraDB Cluster:

AWS Aurora Benchmarking

It is clear that Aurora was better served during the test with 256 threads going over the 450K com select serve (in 10 sec interval), compared with XtraDB Cluster that was not able to go over 350K.

But in the following tests, while XtraDB Cluster was able to keep going (with decreasing performance), Aurora started to struggle with very inconsistent behavior.

This was also confirmed by the open files graph.

Aurora:

AWS Aurora Benchmarking

XtraDB Cluster:

AWS Aurora Benchmarking

The graphs show the instances of files open during the test, not the ones already open. It reflects the Open_file metric “The number of files that are open. This count includes regular files opened by the server. It does not include other types of files such as sockets or pipes. Also, the count does not include files that storage engines open using their own internal functions rather than asking the server level to do so.”

I was quite surprised by the number of files open by Aurora.

Handlers reflected the same behavior, as well.

Aurora:

AWS Aurora Benchmarking

XtraDB Cluster:

AWS Aurora Benchmarking

Perfectly in line with the com trend.

So what was increasing in reverse?

Aurora:

AWS Aurora Benchmarking

XtraDB Cluster:

AWS Aurora Benchmarking

As you can see from the above, the exactly same workload generated an increasing lock row time, from quite low in the test with 256 threads, up to a crazy high with 2048 threads.

As mentioned, we know that TPCC has a couple of tables that act as hotspots, and we already saw with IIbench how Aurora is not working efficiently in that case.

I also was getting a lot of 188 errors during the test. This is an Aurora internal error. When I reported it, I was told they know about it, and they are planning to work on it.

I hope they do soon, because if this issue is solved it is very likely that Aurora will not only be able to manage the tested workload, but exceed it by far.

I am saying this because also with the identified issues Aurora was able to keep going and manage a more than decent response time during the second test (with 512 threads).

AWS Aurora Benchmarking

Fifth Test: Sysbench

I added the sysbench tests to test scalability, and to see the what happens when the system reaches a saturation point. This test brought up some limitations existing in the Aurora solution, related more to the connector than the Aurora engine itself.

Aurora has a limit of 16k connections. I wanted to see what happens if I got to saturation point or close to it. It doesn’t matter if this is a ridiculously high number or not.

What happened was that Aurora managed traffic up to 4K. The closer I got to the limit, however, the more I had a connectivity issue. At the end I had to run the test with 8K, 12K and 20K threads pointing directly to the Aurora instance, bypassing the connector that was not able to serve the traffic. After that, I was able to hit up to ~15500 threads (but with a lot of inconsistent performance). I am defining the limit of a meaningful test from the previous level of 12K threads.

XtraDB Cluster was able to scale up to 16K no problem.

What also is notable here is that Aurora was able to manage the workload more efficiently regarding transaction handling (i.e., as transactions executed and latency).

AWS Aurora Benchmarking

The number of transactions executed by Aurora was ~three times the one executed by XtraDB Cluster.

AWS Aurora Benchmarking

Regarding latency, Aurora showed less latency then XtraDB Cluster.

Internally, Aurora and XtraDB Cluster operations were once again different regarding how the workload was handled. The most divergent result was the handler calls:

AWS Aurora Benchmarking

Commit calls in Aurora were a fraction of the calls in XtraDB Cluster, while the number of rollbacks was higher.

The read calls had an even more divergent behavior, with XtraDB Cluster performing a higher number of read_keys, while Aurora was having a very limited number of them. Read_rnd are very high in XtraDB Cluster, but totally absent in Aurora (note that in Aurora, read_rnds are reported but seem not to increase). On the other hand, Aurora reported a high number of read_rnd_next, while XtraDB Cluster has none.

AWS Aurora Benchmarking

HA availability

Fail-over time

Both solutions:

AWS Aurora Benchmarking

In this test, the fail-over time for the solution using Galera and HAProxy was more efficient. For both a limited or mid-level load. One assumption is that given Aurora has to verify both the status of the data transmitted and its consistency across the six data store nodes in every case; the process is not as fast as it could be.

It could also be that the cluster connector is not as efficient as it should in redirecting the traffic from one node to another. It would be a very interesting exercise to replace it with some other custom solution.

Note that I was performing the tests following the Amazon recommendation to use the following to simulate a real crash:

ALTER SYSTEM CRASH [INSTANCE|NODE]

As such, I was not doing anything strange or out of the ordinary.

It is worth mentioning that of the eight seconds taken by MySQL/Galera to perform the failover, six were due to the HAProxy settings (which had a 3000 ms interval and two loops in the settings before executing failover).

Execution latency

The purpose of these tests was to identify the latency existing between the moment that application sends the request and the moment MySQL/Aurora took the request in “charge”. The expectation is that the busier the database, the higher the latency.

For this test, I reported both results: the one coming from the old tests with the small box, and the new one with the large box.

Small boxes:

AWS Aurora Benchmarking

Large boxes:

AWS Aurora Benchmarking

It is clear from the graphs that the two tests report different scenarios. In the first, Galera was able to manage the load more efficiently and serve requests with lower latency. For the new tests, I had used a higher number of threads than the ones for the small box. Nevertheless, in the second test the CPU utilization and the number of running threads lead me to think that Aurora was finally able to utilize resources more efficiently and the lower latency.

The latency jumped up again when the number of connections rose above 12K, but that was expected given previous tests results.

Conclusions

High Availability

The two platforms were able to manage the failover operation in a limited time frame (below 1 minute). Nevertheless, MySQL/Galera was shown to be more efficient and consistent. This result is a direct consequence of synchronous replication, which by design prevents MySQL/Galera from allowing an active node to fall behind.

In my opinion, the replication method used in Aurora is efficient, and given that data is shared across the read replicas, fail-over should happen faster.

The tests suffered because of the connector, and I have the feeling that having another solution in place may bring some surprises (actually, I would like to test that as well).

Performance

In this run of tests, Aurora was able to invert the results I had in the first test with the small boxes. In almost all cases, Aurora performed as well or better then XtraDB Cluster. There are still cases where Aurora is penalized, and those are the ones where hotspots are present. The contention in Aurora is killing performance, and raise errors (188). But I hope we will see a significant evolution soon.

General Comments on Aurora

The product is evolving quickly, and benchmark results may become obsolete in very short time (this is why it is important to have repeatable and comparable tests). From my point of view, in this set of tests Aurora clearly shows where it’s a better fit: higher-end levels, where high availability and CPU power is the focus (not concerns about the cost).

There is no reason to use Aurora in small-mid boxes: the platform is not going to be as efficient as a standard solution like XtraDB Cluster. But if cost is not an issue, and the applications require a lot of parallelism, Aurora on db.r3.8xlarge is a good solution.

I still see space for improvements (like for cluster connectors, or the time taken to restart a cluster after a full stop, or contention reduction). But I am also confident that the work led by the development team will fix most of my concerns (and more) soon.

Final note: it would be nice to have the code open source, so that the community could contribute (but I understand the business reasons not to).

About Cost

I don’t think it is this the right place to mention the cost of each solution (especially because each need is different).

As such, I am not reporting any specific numbers. You can, however, follow the links below and do the necessary math:

Aurora cost calculator

AWS cost calculator

 

May
25
2016
--

Meet the latest cohort of Acceleprise startups, and the apps they built to make work easier

Michael Cardamone presenting at Acceleprise in San Francisco. Acceleprise, a San Francisco-based accelerator for enterprise tech and software-as-a-service startups, held a Demo Day for its third cohort of companies today. The accelerator invests $50,000 into each startup admitted to its program via convertible note financing, taking approximately a 5 percent stake under a $1 million cap. It admits 8-12 companies per cohort, and invests out of a $3.5… Read More

May
25
2016
--

Salesforce inks deal with AWS to expand international presence

AWS logo AWS announced today that it was expanding its relationship with Salesforce.com, with Salesforce naming the cloud giant a preferred cloud provider. The agreement should help Salesforce increase its international presence without having to build its own data centers in countries that have data sovereignty laws, which require that data stays in-country. It’s expensive to build their… Read More

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com