Amazon Aurora Multi-Primary First Impression

Amazon Aurora Multi-Primary First Impression

Amazon Aurora Multi-Primary First ImpressionFor what reason should I use a real multi-primary setup?

To be clear, not a multi-writer solution where any node can become the active writer in case of needs, as for Percona XtraDB Cluster (PXC) or Percona Server for MySQL using Group_replication. No, we are talking about a multi-primary setup where I can write at the same time on multiple nodes. I want to insist on this “why?”.

After having excluded the possible solutions mentioned above, both covering the famous 99.995% availability, which is 26.30 minutes of downtime in a year, what is left?

Disaster Recovery? Well, that is something I would love to have, but to be a real DR solution we need to put several kilometers (miles for imperial) in the middle.

And we know (see here and here) that aside from some misleading advertising, we cannot have a tightly coupled cluster solution across geographical regions.

So, what is left? I may need more HA, ok, that is a valid reason. Or I may need to scale the number of writes, which is a valid reason as well. This means, in the end, that I am looking to a multi-primary because:

  • Scale writes (more nodes more writes)
    • Consistent reads (what I write on A must be visible on B)
  • Gives me 0 (zero) downtime, or close to that (5 nines is a maximum downtime of 864 milliseconds per day!!)
  • Allow me to shift the writer pointer at any time from A to B and vice versa, consistently.

Now, keeping myself bound to the MySQL ecosystem, my natural choice would be MySQL NDB cluster. But my (virtual) boss was at AWS re-invent and someone mentioned to him that Aurora Multi-Primary does what I was looking for. This (long) article is my voyage in discovering if that is true or … not.

Given I am focused on the behavior first, and NOT interested in absolute numbers to shock the audience with millions of QPS, I will use low-level Aurora instances. And will perform tests from two EC2 in the same VPC/region of the nodes.

You can find the details about the tests on GitHub here.

Finally, I will test:

  • Connection speed
  • Stale read
  • Write single node for baseline
  • Write on both node:
    • Scaling splitting the load by schema
    • Scaling same schema

Test Results

Let us start to have some real fun. The first test is …

Connection Speed

The purpose of this test is to evaluate the time taken in opening a new connection and time taken to close it. The action of the open/close connection can be a very expensive operation, especially if applications do not use a connection pool mechanism.

Amazon Aurora Multi-Primary

As we can see, ProxySQL results to be the most efficient way to deal with opening connections, which was expected given the way it is designed to reuse open connections towards the backend.

Different is the close connection operation, in which ProxySQL seems to take a little bit longer.

As a global observation, we can say that by using ProxySQL we have more consistent behavior. Of course, this test is a simplistic one, and we are not checking the scalability (from 1 to N connections) but it is good enough to give us the initial feeling. Specific connection tests will be the focus of the next blog on Aurora MM.

Stale Reads

Aurora multi-primary uses the same mechanism of the default Aurora to update the buffer pool:

Using the Page Cache update, just doing both ways. This means that the Buffer Pool of Node2 is updated with the modification performed in Node1 and vice versa.

To verify if an application would be really able to have consistent reads, I have run this test. This test is meant to measure if, and how many, stale reads we will have when writing on a node and reading from the other.

Amazon Aurora multi-primary has two consistency models:

Aurora consistency model
As an interesting fact, the result was that with the default consistency model (INSTANCE_RAW), we got a 100% stale read.
Given that I focused on identifying the level of the cost that exists when using the other consistency model (REGIONAL_RAW), that allows an application to have consistent reads.

The results indicate an increase of 44% in total execution time, and of 95% (22 times slower) in write execution.

It is interesting to note that the time taken is in some way predictable and consistent between the two consistency models.

The graph below shows in yellow how long the application must wait to see the correct data on the reader node. In blue is the amount of time the application waits to get back the same consistent read because it must wait for the commit on the writer.

lag time in nanoseconds

As you can see, the two are more or less aligned.

Given the performance cost imposed by using REGIONAL_RAW,  all the other tests are done with the default INSTANCE_RAW, unless explicitly stated.

Writing Tests

All tests run in this section were done using sysbench-tpcc with the following settings:

sysbench ./tpcc.lua --mysql-host=<> --mysql-port=3306 --mysql-user=<> --mysql-password=<> --mysql-db=tpcc --time=300 --threads=32 --report-interval=1 --tables=10 --scale=15  --mysql_table_options=" CHARSET=utf8 COLLATE=utf8_bin"  --db-driver=mysql prepare

 sysbench /opt/tools/sysbench-tpcc/tpcc.lua --mysql-host=$mysqlhost --mysql-port=$port --mysql-user=<> --mysql-password=<> --mysql-db=tpcc --db-driver=mysql --tables=10 --scale=15 --time=$time  --rand-type=zipfian --rand-zipfian-exp=0 --report-interval=1 --mysql-ignore-errors=all --histogram  --report_csv=yes --stats_format=csv --db-ps-mode=disable --threads=$threads run

Write Single Node (Baseline)

Before starting the comparative analysis, I was looking to define what was the “limit” of traffic/load for this platform.

baseline reads/writes

From the graph above, we can see that this setup scales up to 128 threads and after that, the performance remains more or less steady.

Amazon claims that we can mainly double the performance when using both nodes in write mode and use a different schema to avoid conflict.

aurora scalability

Once more, remember I am not interested in the absolute numbers here, but I am expecting the same behavior. Given that, our expectation is to see:

expected scalability

Write on Both Nodes, Different Schemas

So AWS recommend this as the scaling solution:

And I diligently follow the advice. I used two EC2 nodes in the same subnet of the Aurora Node, writing to a different schema (tpcc & tpcc2).


Let us make it short and go straight to the point. Did we get the expected scalability?

Well, no:

We just had a 26% increase, quite far to be the expected 100% Let us see what happened in detail (if not interested just skip and go to the next test).

Node 1

Schema read writes Aurora

Node 2

As you can see, Node1 was (more or less) keeping up with the expectations and being close to the expected performance. But Node2 was just not keeping up, and performances there were just terrible.

The graphs below show what happened.

While Node1 was (again more or less) scaling up to the baseline expectations (128 threads), Node2 collapsed on its knees at 16 threads. Node2 was never able to scale up.


Node 1

Node1 is scaling the reads as expected, but also here and there we can see performance deterioration.

Node 2

Node2 is not scaling Reads at all.


Node 1

Same as Read.

Node 2

Same as read.

Now someone may think I was making a mistake and I was writing on the same schema. I assure you I was not. Check the next test to see what happened if using the same schema.

Write on Both Nodes,  Same Schema


Now, now, Marco, this is unfair. You know this will cause contention. Yes, I do! But nonetheless, I was curious to see what was going to happen and how the platform would deal with that level of contention.

My expectations were to have a lot of performance degradation and an increased number of locks. About conflict I was not wrong, node2 after the test reported:

| table       | index   | PHYSICAL_CONFLICTS_HIST |
| district9   | PRIMARY |                    3450 |
| district6   | PRIMARY |                    3361 |
| district2   | PRIMARY |                    3356 |
| district8   | PRIMARY |                    3271 |
| district4   | PRIMARY |                    3237 |
| district10  | PRIMARY |                    3237 |
| district7   | PRIMARY |                    3237 |
| district3   | PRIMARY |                    3217 |
| district5   | PRIMARY |                    3156 |
| district1   | PRIMARY |                    3072 |
| warehouse2  | PRIMARY |                    1867 |
| warehouse10 | PRIMARY |                    1850 |
| warehouse6  | PRIMARY |                    1808 |
| warehouse5  | PRIMARY |                    1781 |
| warehouse3  | PRIMARY |                    1773 |
| warehouse9  | PRIMARY |                    1769 |
| warehouse4  | PRIMARY |                    1745 |
| warehouse7  | PRIMARY |                    1736 |
| warehouse1  | PRIMARY |                    1735 |
| warehouse8  | PRIMARY |                    1635 |

Which is obviously a strong indication something was not working right. In terms of performance gain, if we compare ONLY the result with the 128 Threads:

Also with the high level of conflict, we still have 12% of performance gain.

The problem is that in general, we have the two nodes behaving quite badly. If you check the graph below you can see that the level of conflict is such to prevent the nodes not only to scale but to act consistently.

Node 1

Write on Both Nodes,  Same Schema

Node 2


In the following graphs, we can see how node1 had issues and it actually crashed three times, during tests with 32/64/512 threads. Node2 was always up but the performances were very low.

Node 1

Node 2


Node 1

Node 2

Recovery From Crashed Node

About recovery time, reading the AWS documentation and listening to presentations, I often heard that Aurora Multi-Primary is a 0 downtime solution. Or other statements like: “in applications where you can’t afford even brief downtime for database write operations, a multi-master cluster can help to avoid an outage when a writer instance becomes unavailable. The multi-master cluster doesn’t use the failover mechanism, because it doesn’t need to promote another DB instance to have read/write capability”

To achieve this the suggestion, the solution I found was to have applications pointing directly to the Nodes endpoint and not use the Cluster endpoint.

In this context, the solution pointing to the Nodes should be able to failover within a second or so, while the cluster endpoint:

Recovery From Crashed Node

Personally, I think that designing an architecture where the application is responsible for the connection to the database and failover is some kind of refuse from 2001. But if you feel this is the way, well, go for it.

What I did for testing is to use ProxySQL, as plain as possible with nothing else, and the basic monitor coming from the native monitor. I then compared the results with the tests using the Cluster endpoint. In this way, I adopt the advice of pointing directly at the nodes, but I was doing things in our time.

The results are below and they confirm (more or less) the data coming from Amazon.

A downtime of seven seconds is quite a long time nowadays, especially if I am targeting the 5 nines solution that I want to remember is 864 ms downtime per day. Using ProxySQL is going closer to that, but still too long to be called zero downtime.
I also have fail-back issues when using the AWS cluster endpoint, given it was not able to move the connection to the joining node seamlessly.

Last but not least, when using the consistency level INSTANCE_RAW, I had some data issue as well as PK conflict:
FATAL: mysql_drv_query() returned error 1062 (Duplicate entry ‘18828082’ for key ‘PRIMARY’) 


As state at the beginning of this long blog, the reasonable expectations to go for a multi-primary solution were:

  • Scale writes (more nodes more writes)
  • Gives me zero downtime, or close to that (5 nines is a maximum downtime of 864 milliseconds per day!!)
  • Allow me to shift the writer pointer at any time from A to B and vice versa, consistently.

Honestly, I feel we have completely failed the scaling point. Probably if I use the largest Aurora I will get much better absolute numbers, and it will take me more to encounter the same issues, but I will. In any case, if the multi-primary solution is designed to provide that scalability, and it should do that with any version.

I did not have zero downtime, but I was able to failover pretty quickly with ProxySQL.

Finally, unless the consistency model is REGIONAL_RAW, shifting from one node to the other is not prone to possible negative effects like stale reads. Given that I consider this requirement not satisfied in full.

Given all the above, I think this solution could eventually be valid only for High Availability (close to being 5 nines), but given it comes with some limitations I do not feel comfortable in preferring it over others just for that, at the end default Aurora is already good enough as a High available solution.


AWS re:Invent 2019: Amazon Aurora Multi-Master: Scaling out database write performance

Working with Aurora multi-master clusters

Improving enterprises ha and disaster recovery solutions reviewed

Robust ha solutions with proxysql

Limitations of multi-master clusters


One Billion Tables in MySQL 8.0 with ZFS

one billion tables MySQL

The short version

I created > one billion InnoDB tables in MySQL 8.0 (tables, not rows) just for fun. Here is the proof:

$ mysql -A
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1425329
Server version: 8.0.12 MySQL Community Server - GPL
Copyright (c) 2000, 2018, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
mysql> select count(*) from information_schema.tables;
| count(*)   |
| 1011570298 |
1 row in set (6 hours 57 min 6.31 sec)

Yes, it took 6 hours and 57 minutes to count them all!

Why does anyone need one billion tables?

In my previous blog post, I created and tested MySQL 8.0 with 40 million tables (that was a real case study). The One Billion Tables project is not a real world scenario, however. I was challenged by Billion Tables Project (BTP) in PostgreSQL, and decided to repeat it with MySQL, creating 1 billion InnoDB tables.

As an aside: I think MySQL 8.0 is the first MySQL version where creating 1 billion InnoDB tables is even practically possible.

Challenges with one billion InnoDB tables

Disk space

The first and one of the most important challenges is disk space. InnoDB allocates data pages on disk when creating .ibd files. Without disk level compression we need > 25Tb of disk. The good news: we have ZFS which provides transparent disk compression. Here’s how the disk utilization looks:

Actual data (apparent-size):

# du -sh --apparent-size /mysqldata/
26T     /mysqldata/

Compressed data:

# du -sh /mysqldata/
2.4T    /mysqldata/

Compression ratio:

# zfs get compression,compressratio
mysqldata/mysql/data             compressratio         7.14x                      -
mysqldata/mysql/data             compression           gzip                       inherited from mysqldata/mysql

(Looks like the compression ratio reported is not 100% correct, we expect ~10x compression ratio.)

Too many tiny files

This is usually the big issue with databases that create a file per table. With MySQL 8.0 we can create a shared tablespace and “assign” a table to it. I created a tablespace per database, and created 1000 tables in each database.

The result:

mysql> select count(*) from information_schema.schemata;
| count(*) |
|  1011575 |
1 row in set (1.31 sec)

Creating tables

Another big challenge is how to create tables fast enough so it will not take months. I have used three approaches:

  1. Disabled all possible consistency checks in MySQL, and decreased the innodb page size to 4K (these config options are NOT for production use)
  2. Created tables in parallel: as the mutex contention bug in MySQL 8.0 has been fixed, creating tables in parallel works fine.
  3. Use local NVMe cards on top of an AWS ec2 i3.8xlarge instance

my.cnf config file (I repeat: do not use this in production):

default-authentication-plugin = mysql_native_password
log-error = /mysqldata/mysql/log/error.log
innodb_log_group_home_dir = /mysqldata/mysql/log/
innodb_doublewrite = 0
innodb_stats_persistent = 0
tablespace_definition_cache = 524288
schema_definition_cache = 524288
table_definition_cache = 524288

ZFS pool:

# zpool status
  pool: mysqldata
 state: ONLINE
  scan: scrub repaired 0B in 1h49m with 0 errors on Sun Oct 14 02:13:17 2018
        NAME        STATE     READ WRITE CKSUM
        mysqldata   ONLINE       0     0     0
          nvme0n1   ONLINE       0     0     0
          nvme1n1   ONLINE       0     0     0
          nvme2n1   ONLINE       0     0     0
          nvme3n1   ONLINE       0     0     0
errors: No known data errors

A simple “deploy” script to create tables in parallel (includes the sysbench table structure):

function do_db {
        db_exist=$(mysql -A -s -Nbe "select 1 from information_schema.schemata where schema_name = '$db'")
        if [ "$db_exist" == "1" ]; then echo "Already exists: $db"; return 0; fi;
        tbspace="create database $db; use $db; CREATE TABLESPACE $db ADD DATAFILE '$db.ibd' engine=InnoDB";
        #echo "Tablespace $db.ibd created!"
        for i in {1..1000}
                table="CREATE TABLE sbtest$i ( id int(10) unsigned NOT NULL AUTO_INCREMENT, k int(10) unsigned NOT NULL DEFAULT '0', c varchar(120) NOT NULL DEFAULT '', pad varchar(60) NOT NULL DEFAULT '', PRIMARY KEY (id), KEY k_1 (k) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 tablespace $db;"
                tables="$tables; $table;"
        echo "$tbspace;$tables" | mysql
echo "starting..."
c=$(mysql -A -s -Nbe "select max(cast(SUBSTRING_INDEX(schema_name, '_', -1) as unsigned)) from information_schema.schemata where schema_name like 'sbtest_%'")
for m in {1..100000}
        echo "m=$m"
        for i in {1..30}
                let c=$c+1
                echo $c
                do_db &

How fast did we create tables? Here are some stats:

# mysqladmin -i 10 -r ex|grep Com_create_table
| Com_create_table                                      | 6497                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| Com_create_table                                      | 6449

So we created ~650 tables per second. The average, above, is per 10 seconds.

Counting the tables

It took > 6 hours to do “count(*) from information_schema.tables”! Here is why:

  1. MySQL 8.0 uses a new data dictionary (this is great as it avoids creating 1 billion frm files). Everything is stored in this file:
    # ls -lah /mysqldata/mysql/data/mysql.ibd
    -rw-r----- 1 mysql mysql 6.1T Oct 18 15:02 /mysqldata/mysql/data/mysql.ibd
  2. The information_schema.tables is actually a view:
mysql> show create table information_schema.tables\G
*************************** 1. row ***************************
                View: TABLES
         Create View: CREATE ALGORITHM=UNDEFINED DEFINER=`mysql.infoschema`@`localhost` SQL SECURITY DEFINER VIEW `information_schema`.`TABLES` AS select `cat`.`name` AS `TABLE_CATALOG`,`sch`.`name` AS `TABLE_SCHEMA`,`tbl`.`name` AS `TABLE_NAME`,`tbl`.`type` AS `TABLE_TYPE`,if((`tbl`.`type` = 'BASE TABLE'),`tbl`.`engine`,NULL) AS `ENGINE`,if((`tbl`.`type` = 'VIEW'),NULL,10) AS `VERSION`,`tbl`.`row_format` AS `ROW_FORMAT`,internal_table_rows(`sch`.`name`,`tbl`.`name`,if(isnull(`tbl`.`partition_type`),`tbl`.`engine`,''),`tbl`.`se_private_id`,(`tbl`.`hidden` <> 'Visible'),`ts`.`se_private_data`,coalesce(`stat`.`table_rows`,0),coalesce(cast(`stat`.`cached_time` as unsigned),0)) AS `TABLE_ROWS`,internal_avg_row_length(`sch`.`name`,`tbl`.`name`,if(isnull(`tbl`.`partition_type`),`tbl`.`engine`,''),`tbl`.`se_private_id`,(`tbl`.`hidden` <> 'Visible'),`ts`.`se_private_data`,coalesce(`stat`.`avg_row_length`,0),coalesce(cast(`stat`.`cached_time` as unsigned),0)) AS `AVG_ROW_LENGTH`,internal_data_length(`sch`.`name`,`tbl`.`name`,if(isnull(`tbl`.`partition_type`),`tbl`.`engine`,''),`tbl`.`se_private_id`,(`tbl`.`hidden` <> 'Visible'),`ts`.`se_private_data`,coalesce(`stat`.`data_length`,0),coalesce(cast(`stat`.`cached_time` as unsigned),0)) AS `DATA_LENGTH`,internal_max_data_length(`sch`.`name`,`tbl`.`name`,if(isnull(`tbl`.`partition_type`),`tbl`.`engine`,''),`tbl`.`se_private_id`,(`tbl`.`hidden` <> 'Visible'),`ts`.`se_private_data`,coalesce(`stat`.`max_data_length`,0),coalesce(cast(`stat`.`cached_time` as unsigned),0)) AS `MAX_DATA_LENGTH`,internal_index_length(`sch`.`name`,`tbl`.`name`,if(isnull(`tbl`.`partition_type`),`tbl`.`engine`,''),`tbl`.`se_private_id`,(`tbl`.`hidden` <> 'Visible'),`ts`.`se_private_data`,coalesce(`stat`.`index_length`,0),coalesce(cast(`stat`.`cached_time` as unsigned),0)) AS `INDEX_LENGTH`,internal_data_free(`sch`.`name`,`tbl`.`name`,if(isnull(`tbl`.`partition_type`),`tbl`.`engine`,''),`tbl`.`se_private_id`,(`tbl`.`hidden` <> 'Visible'),`ts`.`se_private_data`,coalesce(`stat`.`data_free`,0),coalesce(cast(`stat`.`cached_time` as unsigned),0)) AS `DATA_FREE`,internal_auto_increment(`sch`.`name`,`tbl`.`name`,if(isnull(`tbl`.`partition_type`),`tbl`.`engine`,''),`tbl`.`se_private_id`,(`tbl`.`hidden` <> 'Visible'),`ts`.`se_private_data`,coalesce(`stat`.`auto_increment`,0),coalesce(cast(`stat`.`cached_time` as unsigned),0),`tbl`.`se_private_data`) AS `AUTO_INCREMENT`,`tbl`.`created` AS `CREATE_TIME`,internal_update_time(`sch`.`name`,`tbl`.`name`,if(isnull(`tbl`.`partition_type`),`tbl`.`engine`,''),`tbl`.`se_private_id`,(`tbl`.`hidden` <> 'Visible'),`ts`.`se_private_data`,coalesce(cast(`stat`.`update_time` as unsigned),0),coalesce(cast(`stat`.`cached_time` as unsigned),0)) AS `UPDATE_TIME`,internal_check_time(`sch`.`name`,`tbl`.`name`,if(isnull(`tbl`.`partition_type`),`tbl`.`engine`,''),`tbl`.`se_private_id`,(`tbl`.`hidden` <> 'Visible'),`ts`.`se_private_data`,coalesce(cast(`stat`.`check_time` as unsigned),0),coalesce(cast(`stat`.`cached_time` as unsigned),0)) AS `CHECK_TIME`,`col`.`name` AS `TABLE_COLLATION`,internal_checksum(`sch`.`name`,`tbl`.`name`,if(isnull(`tbl`.`partition_type`),`tbl`.`engine`,''),`tbl`.`se_private_id`,(`tbl`.`hidden` <> 'Visible'),`ts`.`se_private_data`,coalesce(`stat`.`checksum`,0),coalesce(cast(`stat`.`cached_time` as unsigned),0)) AS `CHECKSUM`,if((`tbl`.`type` = 'VIEW'),NULL,get_dd_create_options(`tbl`.`options`,if((ifnull(`tbl`.`partition_expression`,'NOT_PART_TBL') = 'NOT_PART_TBL'),0,1))) AS `CREATE_OPTIONS`,internal_get_comment_or_error(`sch`.`name`,`tbl`.`name`,`tbl`.`type`,`tbl`.`options`,`tbl`.`comment`) AS `TABLE_COMMENT` from (((((`mysql`.`tables` `tbl` join `mysql`.`schemata` `sch` on((`tbl`.`schema_id` = `sch`.`id`))) join `mysql`.`catalogs` `cat` on((`cat`.`id` = `sch`.`catalog_id`))) left join `mysql`.`collations` `col` on((`tbl`.`collation_id` = `col`.`id`))) left join `mysql`.`tablespaces` `ts` on((`tbl`.`tablespace_id` = `ts`.`id`))) left join `mysql`.`table_stats` `stat` on(((`tbl`.`name` = `stat`.`table_name`) and (`sch`.`name` = `stat`.`schema_name`)))) where (can_access_table(`sch`.`name`,`tbl`.`name`) and is_visible_dd_object(`tbl`.`hidden`))
character_set_client: utf8
collation_connection: utf8_general_ci

and the explain plan looks like this:

mysql> explain select count(*) from information_schema.tables \G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: cat
   partitions: NULL
         type: index
possible_keys: PRIMARY
          key: name
      key_len: 194
          ref: NULL
         rows: 1
     filtered: 100.00
        Extra: Using index
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: tbl
   partitions: NULL
         type: ALL
possible_keys: schema_id
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 1023387060
     filtered: 100.00
        Extra: Using where; Using join buffer (Block Nested Loop)
*************************** 3. row ***************************
           id: 1
  select_type: SIMPLE
        table: sch
   partitions: NULL
         type: eq_ref
possible_keys: PRIMARY,catalog_id
          key: PRIMARY
      key_len: 8
          ref: mysql.tbl.schema_id
         rows: 1
     filtered: 11.11
        Extra: Using where
*************************** 4. row ***************************
           id: 1
  select_type: SIMPLE
        table: stat
   partitions: NULL
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 388
          ref: mysql.sch.name,mysql.tbl.name
         rows: 1
     filtered: 100.00
        Extra: Using index
*************************** 5. row ***************************
           id: 1
  select_type: SIMPLE
        table: ts
   partitions: NULL
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 8
          ref: mysql.tbl.tablespace_id
         rows: 1
     filtered: 100.00
        Extra: Using index
*************************** 6. row ***************************
           id: 1
  select_type: SIMPLE
        table: col
   partitions: NULL
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 8
          ref: mysql.tbl.collation_id
         rows: 1
     filtered: 100.00
        Extra: Using index


  1. I have created more than 1 billion real InnoDB tables with indexes in MySQL 8.0, just for fun, and it worked. It took ~2 weeks to create.
  2. Probably MySQL 8.0 is the first version where it is even practically possible to create billion InnoDB tables
  3. ZFS compression together with NVMe cards makes it reasonably cheap to do, for example, by using i3.4xlarge or i3.8xlarge instances on AWS.

one billion tables MySQL


Scaling Percona Monitoring and Management (PMM)

PMM tested with 1000 nodes

Starting with PMM 1.13,  PMM uses Prometheus 2 for metrics storage, which tends to be heaviest resource consumer of CPU and RAM.  With Prometheus 2 Performance Improvements, PMM can scale to more than 1000 monitored nodes per instance in default configuration. In this blog post we will look into PMM scaling and capacity planning—how to estimate the resources required, and what drives resource consumption.

PMM tested with 1000 nodes

We have now tested PMM with up to 1000 nodes, using a virtualized system with 128GB of memory, 24 virtual cores, and SSD storage. We found PMM scales pretty linearly with the available memory and CPU cores, and we believe that a higher number of nodes could be supported with more powerful hardware.

What drives resource usage in PMM ?

Depending on your system configuration and workload, a single node can generate very different loads on the PMM server. The main factors that impact the performance of PMM are:

  1. Number of samples (data points) injected into PMM per second
  2. Number of distinct time series they belong to (cardinality)
  3. Number of distinct query patterns your application uses
  4. Number of queries you have on PMM, through the user interface on the API, and their complexity

These specifically can be impacted by:

  • Software version – modern database software versions expose more metrics)
  • Software configuration – some metrics are only exposed in certain configuration
  • Workload – a large number of database objects and high concurrency will increase both the number of samples ingested and their cardinality.
  • Exporter configuration – disabling collectors can reduce amount of data collectors
  • Scrape frequency –  controlled by METRICS_RESOLUTION

All these factors together may impact resource requirements by a factor of ten or more, so do your own testing to be sure. However, the numbers in this article should serve as good general guidance as a start point for your research.

On the system supporting 1000 instances we observed the following performance:

Performance PMM 1000 nodes load

As you can see, we have more than 2.000 scrapes/sec performed, providing almost two million samples/sec, and more than eight million active time series. These are the main numbers that define the load placed on Prometheus.

Capacity planning to scale PMM

Both CPU and memory are very important resources for PMM capacity planning. Memory is the more important as Prometheus 2 does not have good options for limiting memory consumption. If you do not have enough memory to handle your workload, then it will run out of memory and crash.

We recommend at least 2GB of memory for a production PMM Installation. A test installation with 1GB of memory is possible. However, it may not be able to monitor more than one or two nodes without running out of memory. With 2GB of memory you should be able to monitor at least five nodes without problem.

With powerful systems (8GB of more) you can have approximately eight systems per 1GB of memory, or about 15,000 samples ingested/sec per 1GB of memory.

To calculate the CPU usage resources required, allow for about 50 monitored systems per core (or 100K metrics/sec per CPU core).

One problem you’re likely to encounter if you’re running PMM with 100+ instances is the “Home Dashboard”. This becomes way too heavy with such a large number of servers. We plan to fix this issue in future releases of PMM, but for now you can work around it in two simple ways:

You can select the host, for example “pmm-server” in your home dashboard and save it, before adding a large amount of hosts to the system.

set home dashboard for PMM

Or you can make some other dashboard of your choice and set it as the home dashboard.


  • More than 1,000 monitored systems is possible per single PMM server
  • Your specific workload and configuration may significantly change the resources required
  • If deploying with 8GB or more, plan 50 systems per core, and eight systems per 1GB of RAM

The post Scaling Percona Monitoring and Management (PMM) appeared first on Percona Database Performance Blog.


Webinar Tues 8/14: Utilizing ProxySQL for Connection Pooling in PHP

ProxySQL for connection pooling

ProxySQL for connection poolingPlease join Percona’s Architect, Tibi Köröcz as he presents Utilizing ProxySQL for Connection Pooling in PHP on Tuesday August 14, 2018, at 8:00 am PDT (UTC-7) / 11:00 am EDT (UTC-4).


ProxySQL is a very powerful tool, with extended capabilities. This presentation will demonstrate how to use ProxySQL to gain functionality (seamless database backend switch) and correct problems (applications missing connection pooling).

The presentation will be a real-life study on how we use ProxySQL for connection pooling, database failover and load balancing the communication between our (third party) PHP-application and our master-master MySQL-cluster.
Also, we will show monitoring and statistics using Percona Monitoring and Management (PMM).

Register Now!

Tibor Köröcz


ProxySQL for Connection Pooling

Tibi joined Percona in 2015 as a Consultant. Before joining Percona, among many other things, he worked at the world’s largest car hire booking service as a Senior Database Engineer. He enjoys trying and working with the latest technologies and applications which can help or work with MySQL together. In his spare time he likes to spend time with his friends, travel around the world and play ultimate frisbee.


The post Webinar Tues 8/14: Utilizing ProxySQL for Connection Pooling in PHP appeared first on Percona Database Performance Blog.


Amazon RDS Multi-AZ Deployments and Read Replicas

RDS Multi-AZ

Amazon RDS is a managed relational database service that makes it easier to set up, operate, and scale a relational database in the cloud. One of the common questions that we get is “What is Multi-AZ and how it’s different from Read Replica, do I need both?”.  I have tried to answer this question in this blog post and it depends on your application needs. Are you looking for High Availability (HA), read scalability … or both?

Before we go to into detail, let me explain two common terms used with Amazon AWS.

Region – an AWS region is a separate geographical area like US East (N. Virginia), Asia Pacific (Mumbai), EU (London) etc. Each AWS Region has multiple, isolated locations known as Availability Zones.

Availability Zone (AZ) – AZ is simply one or more data centers, each with redundant power, networking and connectivity, housed in separate facilities. Data centers are geographically isolated within the same region.

What is Multi-AZ?

Amazon RDS provides high availability and failover support for DB instances using Multi-AZ deployments.

In a Multi-AZ deployment, Amazon RDS automatically provisions and maintains a synchronous standby replica of the master DB in a different Availability Zone. The primary DB instance is synchronously replicated across Availability Zones to the standby replica to provide data redundancy, failover support and to minimize latency during system backups. In the event of planned database maintenance, DB instance failure, or an AZ failure of your primary DB instance, Amazon RDS automatically performs a failover to the standby so that database operations can resume quickly without administrative intervention.

You can check in the AWS management console if a database instance is configured as Multi-AZ. Select the RDS service, click on the DB instance and review the details section.

AWS management console showing that instance is Multi-AZ

This screenshot from AWS management console (above) shows that the database is hosted as Multi-AZ deployment and the standby replica is deployed in us-east-1a AZ.

Benefits of Multi-AZ deployment:

  • Replication to a standby replica is synchronous which is highly durable.
  • When a problem is detected on the primary instance, it will automatically failover to the standby in the following conditions:
    • The primary DB instance fails
    • An Availability Zone outage
    • The DB instance server type is changed
    • The operating system of the DB instance is undergoing software patching.
    • A manual failover of the DB instance was initiated using Reboot with failover.
  • The endpoint of the DB instance remains the same after a failover, the application can resume database operations without manual intervention.
  • If a failure occurs, your availability impact is limited to the time that the automatic failover takes to complete. This helps to achieve increased availability.
  • It reduces the impact of maintenance. RDS performs maintenance on the standby first, promotes the standby to primary master, and then performs maintenance on the old master which is now a standby replica.
  • To prevent any negative impact of the backup process on performance, Amazon RDS creates a backup from the standby replica.

Amazon RDS does not failover automatically in response to database operations such as long-running queries, deadlocks or database corruption errors. Also, the Multi-AZ deployments are limited to a single region only, cross-region Multi-AZ is not currently supported.

Can I use an RDS standby replica for read scaling?

The Multi-AZ deployments are not a read scaling solution, you cannot use a standby replica to serve read traffic. Multi-AZ maintains a standby replica for HA/failover. It is available for use only when RDS promotes the standby instance as the primary. To service read-only traffic, you should use a Read Replica instead.

What is Read Replica?

Read replicas allow you to have a read-only copy of your database.

When you create a Read Replica, you first specify an existing DB instance as the source. Then Amazon RDS takes a snapshot of the source instance and creates a read-only instance from the snapshot. You can use MySQL native asynchronous replication to keep Read Replica up-to-date with the changes. The source DB must have automatic backups enabled for setting up read replica.

Benefits of Read Replica

  • Read Replica helps in decreasing load on the primary DB by serving read-only traffic.
  • A Read Replica can be manually promoted as a standalone database instance.
  • You can create Read Replicas within AZ, Cross-AZ or Cross-Region.
  • You can have up to five Read Replicas per master, each with own DNS endpoint. Unlike a Multi-AZ standby replica, you can connect to each Read Replica and use them for read scaling.
  • You can have Read Replicas of Read Replicas.
  • Read Replicas can be Multi-AZ enabled.
  • You can use Read Replicas to take logical backups (mysqldump/mydumper) if you want to store the backups externally to RDS.
  • Read Replica helps to maintain a copy of databases in a different region for disaster recovery.

At AWS re:Invent 2017, AWS announced the preview for Amazon Aurora Multi-Master, this will allow users to create multiple Aurora writer nodes and helps in scaling reads/writes across multiple AZs. You can sign up for preview here.


While both (Multi-AZ and Read replica) maintain a copy of database but they are different in nature. Use Multi-AZ deployments for High Availability and Read Replica for read scalability. You can further set up a cross-region read replica for disaster recovery.

The post Amazon RDS Multi-AZ Deployments and Read Replicas appeared first on Percona Database Performance Blog.


Webinar Weds 20/6: Percona XtraDB Cluster 5.7 Tutorial Part 2

webinar Percona XtraDB Cluster

Including setting up Percona XtraDB Cluster with ProxySQL and PMM

webinar Percona XtraDB ClusterPlease join Percona’s Architect, Tibi Köröcz as he presents Percona XtraDB Cluster 5.7 Tutorial Part 2 on Wednesday, June 20th, 2018, at 7:00 am PDT (UTC-7) / 10:00 am EDT (UTC-4).


Never used Percona XtraDB Cluster before? This is the webinar for you! In this 45-minute webinar, we will introduce you to a fully functional Percona XtraDB Cluster.

This webinar will show you how to install Percona XtraDB Cluster with ProxySQL, and monitor it with Percona Monitoring and Management (PMM).

We will also cover topics like bootstrap, IST, SST, certification, common-failure situations and online schema changes.

After this webinar, you will have enough knowledge to set up a working Percona XtraDB Cluster with ProxySQL, in order to meet your high availability requirements.

You can see part one of this series here: Percona XtraDB Cluster 5.7 Tutorial Part 1

Register Now!

Tibor Köröcz


ProxySQL for Connection Pooling

Tibi joined Percona in 2015 as a Consultant. Before joining Percona, among many other things, he worked at the world’s largest car hire booking service as a Senior Database Engineer. He enjoys trying and working with the latest technologies and applications which can help or work with MySQL together. In his spare time he likes to spend time with his friends, travel around the world and play ultimate frisbee.


The post Webinar Weds 20/6: Percona XtraDB Cluster 5.7 Tutorial Part 2 appeared first on Percona Database Performance Blog.


Using Apache Hadoop and Impala together with MySQL for data analysis

Apache Hadoop is commonly used for data analysis. It is fast for data loads and scalable. In a previous post I showed how to integrate MySQL with Hadoop. In this post I will show how to export a table from  MySQL to Hadoop, load the data to Cloudera Impala (columnar format) and run a reporting on top of that. For the examples below I will use the “ontime flight performance” data from my previous post (Increasing MySQL performance with parallel query execution). I’ve used the Cloudera Manager v.4 to install Apache Hadoop and Impala. For this test I’ve (intentionally) used an old hardware (servers from 2006) to show that Hadoop can utilize the old hardware and still scale. The test cluster consists of 6 datanodes. Below are the specs:

Purpose Server specs
Namenode, Hive metastore, etc + Datanodes 2x PowerEdge 2950, 2x L5335 CPU @ 2.00GHz, 8 cores, 16G RAM, RAID 10 with 8 SAS drives
Datanodes only 4x PowerEdge SC1425, 2x Xeon CPU @ 3.00GHz, 2 cores, 8G RAM, single 4TB drive

As you can see those a pretty old servers; the only thing I’ve changed is added a 4TB drive to be able to store more data. Hadoop provides redundancy on the server level (it writes 3 copies of the same block to all datanodes) so we do not need RAID on the datanodes (need redundancy for namenodes thou).

Data export

There are a couple of ways to export data from MySQL to Hadoop. For the purpose of this test I have simply exported the ontime table into a text file with:

select * into outfile '/tmp/ontime.psv' 
from ontime;

(you can use “|” or any other symbol as a delimiter) Alternatively, you can download data directly from www.transtats.bts.gov site using this simple script:

for y in {1988..2013}
        for i in {1..12}
                wget $u -o ontime.log
                unzip On_Time_On_Time_Performance_${y}_${i}.zip

Load into Hadoop HDFS

First thing we will need to do is to load data into HDFS as a set of files. Hive or Impala it will work with a directory to which you have imported your data and concatenate all files inside this directory. In our case it is easy to simply copy all our files into the directory inside HDFS

$ hdfs dfs -mkdir /data/ontime/
$ hdfs -v dfs -copyFromLocal On_Time_On_Time_Performance_*.csv /data/ontime/

 Create external table in Impala

Now, when we have all data files loaded we can create an external table:

YearD int ,
Quarter tinyint ,
MonthD tinyint ,
DayofMonth tinyint ,
DayOfWeek tinyint ,
FlightDate string ,
UniqueCarrier string ,
AirlineID int ,
Carrier string ,
TailNum string ,
FlightNum string ,
OriginAirportID int ,
OriginAirportSeqID int ,
OriginCityMarketID int ,
Origin string ,
OriginCityName string ,
OriginState string ,
OriginStateFips string ,
OriginStateName string ,
OriginWac int ,
DestAirportID int ,
DestAirportSeqID int ,
DestCityMarketID int ,
Dest string ,
LOCATION '/data/ontime';

Note the “EXTERNAL” keyword and LOCATION (LOCATION points to a directory inside HDFS, not a file). The impala will create a meta information only (will not modify the table). We can query this table right away, however, impala will need to scan all files (full scan) for queries.


[d30.local:21000] > select yeard, count(*) from ontime_psv  group by yeard;
Query: select yeard, count(*) from ontime_psv  group by yeard
| yeard | count(*) |
| 2010  | 6450117  |
| 2013  | 5349447  |
| 2009  | 6450285  |
| 2002  | 5271359  |
| 2004  | 7129270  |
| 1997  | 5411843  |
| 2012  | 6096762  |
| 2005  | 7140596  |
| 1999  | 5527884  |
| 2007  | 7455458  |
| 1994  | 5180048  |
| 2008  | 7009726  |
| 1988  | 5202096  |
| 2003  | 6488540  |
| 1996  | 5351983  |
| 1989  | 5041200  |
| 2011  | 6085281  |
| 1998  | 5384721  |
| 1991  | 5076925  |
| 2006  | 7141922  |
| 1993  | 5070501  |
| 2001  | 5967780  |
| 1995  | 5327435  |
| 1990  | 5270893  |
| 1992  | 5092157  |
| 2000  | 5683047  |
Returned 26 row(s) in 131.38s

(Note that “group by” will not sort the rows, unlike MySQL. To sort we will need to add “ORDER BY yeard”)

Explain plan:

Query: explain select yeard, count(*) from ontime_csv  group by yeard
| Explain String                                            |
| PLAN FRAGMENT 0                                           |
|   PARTITION: UNPARTITIONED                                |
|                                                           |
|   4:EXCHANGE                                              |
|                                                           |
| PLAN FRAGMENT 1                                           |
|   PARTITION: HASH_PARTITIONED: yeard                      |
|                                                           |
|   STREAM DATA SINK                                        |
|     EXCHANGE ID: 4                                        |
|     UNPARTITIONED                                         |
|                                                           |
|   3:AGGREGATE (merge finalize)                            |
|   |  output: SUM(COUNT(*))                                |
|   |  group by: yeard                                      |
|   |                                                       |
|   2:EXCHANGE                                              |
|                                                           |
| PLAN FRAGMENT 2                                           |
|   PARTITION: RANDOM                                       |
|                                                           |
|   STREAM DATA SINK                                        |
|     EXCHANGE ID: 2                                        |
|     HASH_PARTITIONED: yeard                               |
|                                                           |
|   1:AGGREGATE                                             |
|   |  output: COUNT(*)                                     |
|   |  group by: yeard                                      |
|   |                                                       |
|   0:SCAN HDFS                                             |
|      table=ontime.ontime_csv #partitions=1/1 size=45.68GB |
Returned 31 row(s) in 0.13s

As we can see it will scan 45G of data.

Impala with columnar format and compression

The great benefit of the impala is that it supports columnar format and compression. I’ve tried the new “parquet” format with “snappy” compression codec. As our table is very wide (and de-normalized) it will help alot to use columnar format. To take advantages of the “parquet” format we will need to load data into it, which is easy to do when we already have a table inside impala and files inside HDFS:

[d30.local:21000] > set PARQUET_COMPRESSION_CODEC=snappy;
[d30.local:21000] > create table ontime_parquet_snappy LIKE ontime_parquet_snappy STORED AS PARQUET;
[d30.local:21000] > insert into ontime_parquet_snappy select * from ontime_csv;
Query: insert into ontime_parquet_snappy select * from ontime_csv
Inserted 152657276 rows in 729.76s

Then we can test our query against the new table:

Query: explain select yeard, count(*) from ontime_parquet_snappy  group by yeard
| Explain String                                                      |
| PLAN FRAGMENT 0                                                     |
|   PARTITION: UNPARTITIONED                                          |
|                                                                     |
|   4:EXCHANGE                                                        |
|                                                                     |
| PLAN FRAGMENT 1                                                     |
|   PARTITION: HASH_PARTITIONED: yeard                                |
|                                                                     |
|   STREAM DATA SINK                                                  |
|     EXCHANGE ID: 4                                                  |
|     UNPARTITIONED                                                   |
|                                                                     |
|   3:AGGREGATE (merge finalize)                                      |
|   |  output: SUM(COUNT(*))                                          |
|   |  group by: yeard                                                |
|   |                                                                 |
|   2:EXCHANGE                                                        |
|                                                                     |
| PLAN FRAGMENT 2                                                     |
|   PARTITION: RANDOM                                                 |
|                                                                     |
|   STREAM DATA SINK                                                  |
|     EXCHANGE ID: 2                                                  |
|     HASH_PARTITIONED: yeard                                         |
|                                                                     |
|   1:AGGREGATE                                                       |
|   |  output: COUNT(*)                                               |
|   |  group by: yeard                                                |
|   |                                                                 |
|   0:SCAN HDFS                                                       |
|      table=ontime.ontime_parquet_snappy #partitions=1/1 size=3.95GB |
Returned 31 row(s) in 0.02s

As we can see it will scan much smaller amount of data: 3.95 (with compression) compared to 45GB


Query: select yeard, count(*) from ontime_parquet_snappy  group by yeard
| yeard | count(*) |
| 2010  | 6450117  |
| 2013  | 5349447  |
| 2009  | 6450285  |
Returned 26 row(s) in 4.17s

And the response time is much better as well.

Impala complex query example

I’ve used the complex query from my previous post. I had to adapt it for use with Impala: it does not support “sum(ArrDelayMinutes>30)” notation but “sum(if(ArrDelayMinutes>30, 1, 0)” works fine.

   min(yeard), max(yeard), Carrier, count(*) as cnt,
   sum(if(ArrDelayMinutes>30, 1, 0)) as flights_delayed,
   round(sum(if(ArrDelayMinutes>30, 1, 0))/count(*),2) as rate
FROM ontime_parquet_snappy
   DayOfWeek not in (6,7) and OriginState not in ('AK', 'HI', 'PR', 'VI')
   and DestState not in ('AK', 'HI', 'PR', 'VI')
   and flightdate < '2010-01-01'
GROUP by carrier
HAVING cnt > 100000 and max(yeard) > 1990
ORDER by rate DESC
LIMIT 1000;

The query is intentionally designed the way it does not take advantage of the indexes: most of the conditions will only filter out less than 30% of the data.

Impala results:

| min(yeard) | max(yeard) | carrier | cnt      | flights_delayed | rate |
| 2003       | 2009       | EV      | 1454777  | 237698          | 0.16 |
| 2003       | 2009       | FL      | 1082489  | 158748          | 0.15 |
| 2006       | 2009       | XE      | 1016010  | 152431          | 0.15 |
| 2003       | 2009       | B6      | 683874   | 103677          | 0.15 |
| 2006       | 2009       | YV      | 740608   | 110389          | 0.15 |
| 2003       | 2005       | DH      | 501056   | 69833           | 0.14 |
| 2001       | 2009       | MQ      | 3238137  | 448037          | 0.14 |
| 2004       | 2009       | OH      | 1195868  | 160071          | 0.13 |
| 2003       | 2006       | RU      | 1007248  | 126733          | 0.13 |
| 2003       | 2006       | TZ      | 136735   | 16496           | 0.12 |
| 1988       | 2009       | UA      | 9593284  | 1197053         | 0.12 |
| 1988       | 2009       | AA      | 10600509 | 1185343         | 0.11 |
| 1988       | 2001       | TW      | 2659963  | 280741          | 0.11 |
| 1988       | 2009       | CO      | 6029149  | 673863          | 0.11 |
| 2007       | 2009       | 9E      | 577244   | 59440           | 0.10 |
| 1988       | 2009       | US      | 10276941 | 991016          | 0.10 |
| 2003       | 2009       | OO      | 2654259  | 257069          | 0.10 |
| 1988       | 2009       | NW      | 7601727  | 725460          | 0.10 |
| 1988       | 2009       | DL      | 11869471 | 1156267         | 0.10 |
| 1988       | 2009       | AS      | 1506003  | 146920          | 0.10 |
| 1988       | 2005       | HP      | 2607603  | 235675          | 0.09 |
| 2005       | 2009       | F9      | 307569   | 28679           | 0.09 |
| 1988       | 1991       | PA      | 206841   | 19465           | 0.09 |
| 1988       | 2009       | WN      | 12722174 | 1107840         | 0.09 |
Returned 24 row(s) in 15.28s

15.28 seconds is significantly faster than original MySQL results (15 min 56.40 sec without parallel execution and  5 min 47 with the parallel execution). However, this is not “apple to apple comparison”:

  • MySQL will scan 45G of data and Impala with parquet will only scan 3.5G
  • MySQL will run on a single server, Hadoop + Impala will run in parallel on 6 servers.

Nevertheless, Hadoop + Implala shows impressive performance and ability to scale out the box, which can help a lot with the large data volume analysis.


Hadoop + Impala will give us an easy way to analyze large datasets using SQL with the ability to scale even on the old hardware.

In my next posts I will plan to explore:

As always, please share your thoughts in the comments.

The post Using Apache Hadoop and Impala together with MySQL for data analysis appeared first on MySQL Performance Blog.


Modeling InnoDB Scalability on Multi-Core Servers

Mat Keep’s blog post on InnoDB-vs-MyISAM benchmarks that Oracle recently published prompted me to do some mathematical modeling of InnoDB’s scalability as the number of cores in the server increases. Vadim runs lots of benchmarks that measure what happens under increasing concurrency while holding the hardware constant, but not as many with varying numbers of cores, so I decided to use Mat Keep’s data for this. The modeling I performed is Universal Scalability Law modeling, which can predict both software and hardware scalability, depending on how it is used.

In brief, the benchmarks are sysbench’s read-only and read-write tests, and the server has two Intel SSDs, 64GB of memory, and 4 x 12-core AMD Opteron 6172 “Magny-Cours” 2.1GHz CPUs. It is a reasonably typical commodity machine except for the high core count, which is more than I can remember seeing in the wild. The database was MySQL 5.5.7-rc. I am not sure why they didn’t run the GA version of MySQL for this benchmark. Maybe they wrote the paper before 5.5 went GA.

The following are plots of the read-only and read-write scalability models that I generated, based on these benchmarks.

Read-Only Results

Read-Only Results

Read-Write Results

Read-Write Results

The model predicts that the server will continue to provide more throughput as the core count climbs into the mid-50s, although the bang for the buck isn’t very good at that point. Also, there appears to be some bottleneck that hits more sharply than the model predicts at high core counts. It would be great if the benchmark were re-run with the same core counts and with sysbench on another machine, instead of taking 12 cores away from MySQL and giving them to sysbench. That way we could test with 48 cores and see what happens. My gut feeling is that the results will not be as good as the model predicts at high numbers of cores. But as Neil Gunther says, this wouldn’t mean the model is broken; it would mean that there is potentially something to fix in the server at high core counts. Without the model, there wouldn’t even be a basis for discussion.

The biggest thing I want to point out here is the dramatic improvement over just a few years ago, when you could “upgrade” from 4 to 8 cores and see a reduction in throughput. Oracle (and Percona, and lots of others) have done great work in the last couple of years making InnoDB scale and perform better on modern hardware.


Percona white paper: Forecasting MySQL Scalability

Ewen and I have just published Percona’s latest white paper, Forecasting MySQL Scalability with the Universal Scalability Law. This is essentially a streamlined walk-through of Dr. Neil J. Gunther’s book Guerrilla Capacity Planning, with examples to show how you can apply it to MySQL servers.

One thing alluded to in the paper is extracting the necessary metrics from network traffic. I had this idea after studying the data in Linux’s /proc/diskstats file. It turns out that two simple metrics can provide amazingly rich insight into system performance and scalability, in combination with Little’s Law and queueing theory. These are the busy time and the total time that requests were resident in the system. There are different terms for the latter, but in MySQL we’d call it query response time. After studying these for a few months, I’m so awed by how useful they are that I am going to make a blanket recommendation: if you create server software, you must expose these simple metrics. (I have filed a feature request for Percona Server to add these metrics for MySQL users.)

I’ll probably follow this up with another blog post or white paper at some point in the future, to show how to use the busy time and query response time to predict a system’s scalability.

Entry posted by Baron Schwartz |

Add to: delicious | digg | reddit | netscape | Google Bookmarks

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com