May
26
2016
--

AWS Aurora Benchmarking part 2

AWS-Aurora-Benchmarking

AWS Aurora Benchmarking

Some time ago, I published the article on AWS Aurora Benchmarking (AWS Aurora Benchmarking – Blast or Splash?), in which I analyzed the behavior of different solutions using synchronous replication in AWS environment. This blog follows up with some of the comments and suggestions I received regarding that post from the community and Amazon engineers.

I decided to perform another round of tests, keeping in mind comments and suggestions received.

I presented some of the results during the Percona conference in Santa Clara last April 2016. The following is the transposition that presentation, with more details.

Not interested in the preliminary descriptions? Go to the results section

Why new tests?

A very good question, with an easy answer.

Aurora is a product that is still under development and refinement: six months of development could present major changes in performance. Not only that, but the initial tests focused on entry-level solutions, meaning I was analyzing the kind of users that are currently starting their business and looking for a flexible solution that allows them to save money and scale.

This time, I put the focus on enterprise solutions by analyzing what an already well-established company would get when looking for a decent scalable solution.

These are two different scenarios.

Why so many (different) tests?

I used many different benchmarking tools, and I am still planning to run others. Why so? Why not simply use one of them?

Again, a simple answer. I used different tools because in some cases, they provide me a different way of accessing and using data. I also do not trust benchmarking tools, not even the ones I developed. I wanted to test the same thing using different tools and compare the results. ONLY if I see a common pattern, then would I consider the test valid. Personally, I tend to discard any test that is not consistent, or if the analysis performed is using a single benchmarking tool. In my opinion, being lazy is not an option when doing these kind of exercises.

About the tests

It was difficult to compare apples to apples here. And I think that is the main point to keep in mind.

Aurora is not a standard RDS solution, like we are used to. Aurora looks like MySQL, smells like MySQL, but is not vanilla MySQL. To achieve what they have, the engineers had to change many parts. The more you dig in, the more you realize there are significant differences.

Because of that, I had to focus more on identifying what each solution can do and compare the solutions against expectations, rather than comparing the numbers.

I was more interested to see what happen if:

  • I have a burst of connections, and my application goes from 4K to 40K connections. Will it crash? Will it slow down?
  • How long should I wait if a node fails?
  • What should I not have in my schema design, to prevent bottlenecks?

Those are relevant questions in my opinion, more so than discovering that solution A has 3000 rows written/sec, and solution B has 3100. Or that I might (might) have some additional page rotation, file -> memory-> flushes because the amount of memory differs.

That is valuable information, for sure, but less valuable than having a decent understanding of which platform will help my business grow and remain stable.

What is the right tool for the job? This is the question I am addressing.

Tests run

I had run three main kinds of tests:

  • Performance and load stress
  • High availability failover
  • Response time (latency) from the application point of view

Performance and load stress

These tests were the most extensive and demanding.

I analyzed the capacity to serve the load under different conditions, from a light load up to full utilization, and some degree of resource saturation.

  • The first set of tests were to evaluate a simple load on a single table, causing the table to become a hotspot and showing how the platform would manage the increasing contention.
  • The second set of tests were to perform a similar load, but distributing it cross multiple tables and batching the operations. Parallelization, contention, scalability and distributed hotspots were in the picture.

The two above focused on write operations only, and were done using different tools (comparing the results as they were complementary).

  • Third set of tests, using my own stress tool, were focused on R/W oriented usage. The tests were executed against multiple tables, performing CRUD actions, using simple and batch insert, reads by PK, index, by range, IN and exact match conditions.
  • The fourth set of tests were performed using a TPC-C like load (OLTP).
  • The fifth set of tests were using sysbench in OLTP mode, with 250 tables.

The scope of the last three set of tests was to identify how the platforms would manage the load, considering the following:

  • Read and write contention on the same tables
  • High level of parallelism (from the application)
  • Possible hot-spots (TPCC district)
  • Increasing utilization (memory, threads, IO)
  • Saturation (connections)

Finally, all tests were run with fully utilized BufferPool.

The machines

Small boxes (first round of tests):

EIP = 1
VPC = 1
ELB=1
Subnets = 4 (1 public, 3 private)
HAProxy = 6
MHA Monitor (micro ec2) = 1
NAT Instance (EC2) =1 (hosting EIP)
DB Instances (EC2) = 3 (m4.xlarge) 16GB
Application Instances (EC2) = 6 (4)
EBS SSD 3000 PIOS
Aurora RDS node = 3 (db.r3.xlarge) 30GB

Large boxes (latest tests):

EIP = 1
VPC = 1
ELB=1
Subnets = 4 (1 public, 3 private)
HAProxy = 4
MHA Monitor (micro ec2) = 1
NAT Instance (EC2) =1 (hosting EIP)
DB Instances (EC2) = 3 (c3.8xlarge) 60GB
Application Instances (EC2) = 4
EBS SSD 5000 PIOS
Aurora RDS node = 3 (db.r3.8xlarge) 244GB

A note

It was pointed out to me that I deliberately chose to use an Ec2 solution for Percona XtraDB Cluster with less memory than the one available in Aurora. This is true, and we must take that into consideration. The reason for this is that the only Ec2 solution matching the memory of a db_r3.8xlarge is the d2.8xlarge.

I did try it, but the level of scalability I got (from the CPU point of view) was less efficient than the one available with c3.8xlarge. I decided to prefer CPU resources to memory, especially because I was going to test concurrency and parallelism in conjunction with the load increase.

From the result, I feel confident that I chose correctly – but I am open to comment.

The layout

This is what the setup looks like:

AWS Aurora Benchmarking

Where you read Java, those are the application nodes running the different test applications.

Two words about Aurora first

Aurora has a few key concepts that we must have clearly in mind, especially how it manages the writes across replica, and how connections are implemented.

The IO activity

To replicate the information across the different storage, Aurora only replicates FRM files and data coming from IB_LOGS. This is a quite significant advantage to other forms of replication, given the limited number of bytes that are replicated over the network (and also if they are replicated six times).

AWS Aurora Benchmarking

Another significant advantage is that Aurora does not use a double write buffer, which is obviously another blast (see the recent optimization in Percona Server https://www.percona.com/blog/2016/05/09/percona-server-5-7-parallel-doublewrite/ ).

In other words, writes in Aurora are organized by filling its commit queue and pushing the changes as group commit to the storage.

AWS Aurora Benchmarking

In some presentations, you might have seen that all steps are asynchronous. But is important to underline that a commit is acknowledged by Aurora when at least two availability zones (AZ) have received and written the incoming data related to that commit. Writes here mean received in the storage node incoming queue and with a quorum of four over six nodes.

This means that no matter what, data has to travel on the network to reach the final destination, and ACK signals come back before Aurora returns the ACK to the commit operation. The network is in the same region, but still it could represent an incognita about performance. No wonder we could have some latency at this stage!

As you can see, what I am reporting is also confirmed in the image below (and in the observations). The point is that the impact of steps 1 – 2 is not obviously clear.

AWS Aurora Benchmarking

Thread pooling

Aurora also use thread pooling – a lot! That will become very clear later, and as more of the work is based on parallelism, the more efficient thread pooling seems to be.

In most cases we are used to seeing CPUs on database servers not fully utilized, unless there is some heavy ordering operation or a bad query. That behavior is also (not only) a direct consequence of the connection-to-thread model, which implies a period of latency and stand by. In Aurora, the incoming connections are not following the same model. Instead, the pool redistributes the load of the incoming connection to a pool of threads, optimizing the latency period, resulting in a higher CPU utilization. Which is what you want from your resource: to be utilized and not waiting for something else to do its job.

AWS Aurora Benchmarking

 

The results

Without wasting more electronic ink, let see what comes out of this round of tests (not the final one by the way). To simplify the results, I will also report the graphs from the first set of tests, but will focus on the latest.Small Boxes = SB, Large Boxes LB.

Small Boxes = SB, Large Boxes = LB.

First Test: IIBench

As declared previously, my scope was to verify how the two platforms would have reacted to a simple load focus on inserts with a basic single table. The bufferpool was saturated before running the test.

SB

AWS Aurora Benchmarking

LB

AWS Aurora Benchmarking

As we can see, in the presence of a hot spot the solution using Percona XtraDB Cluster outperformed Aurora, in both cases. What is notable, though, is that while XtraDB Cluster remained approximately around the same time/performance, Aurora is significantly reduced the time taken. This shows that Aurora was taking advantage of the more powerful platform, while XtraDB Cluster was not able to.

With further analyzation of the details, we notice that Aurora performs better atomically. It was able to manage more writes/second as well as rows and pages managed. But it was inconsistent: Aurora had performance hiccups at regular intervals. As such the final result was that it took more time to process the whole workload.

I was not able to dig to deeply, given some metrics are not fully available in Aurora. As such I had to rely fully on Aurora engineers, who mentioned to me that hot-spot contention was a possible issue.

Aurora Handler calls:

AWS Aurora Benchmarking

XtraDB Cluster Handlers:

AWS Aurora Benchmarking

The execution in XtraDB Cluster showed fewer calls but constant performance, while Aurora has hiccups.

Aurora page activity write:

AWS Aurora Benchmarking

XtraDB Cluster page activity write:

AWS Aurora Benchmarking

The trend shown by the handlers stayed consistent in the page management and rows insert, as expected.

Second Test: Application Ingest

As mentioned, this test showed many threads from different application servers, inserted by a batch of 50 statements against multiple tables.

The results coming from this test are quite favorable to Aurora, as we can see starting from the time taken to complete the same workload:

LB

AWS Aurora Benchmarking

SB

AWS Aurora Benchmarking

With small ones, the situation was inverted.

But here is where the interesting part starts.

Aurora can manage significantly higher numbers of rows, as the picture below shows:

AWS Aurora Benchmarking

The results are also constant, and don’t decrease significantly like the inserts with XtraDB Cluster.

The number of handler commits, however, are significantly less.

AWS Aurora Benchmarking

Once more they stay the same with the load increase, without impacting performance.

Reviewing all handler calls, we get our first surprise.

XtraDB Cluster handler calls:

AWS Aurora Benchmarking

Aurora handler calls:

AWS Aurora Benchmarking

The gap/drop existing in the two graphs are the different tests (with an increasing number of threads).

Two things to notice here: the first one is that XtraDB Cluster decreases in performance while processing the load, while Aurora does not. The second (you need to zoom the image) is the number of commits is floating in XtraDB Cluster, while it stays fixed in Aurora.

An even bigger surprise comes up when reviewing the connections graphs.

As expected, XtraDB Cluster has all my connections open, and the number of threads running is quite close to the number of connected threads.

AWS Aurora Benchmarking

Both of them follow the increasing number of connected threads.

But this is not the case in Aurora.

AWS Aurora Benchmarking

Also, if my applications are trying to open ~800 threads, the Aurora node see only a part of them, and the number of running is fixed to 32 threads.

The important things to consider here are that a) my applications don’t connect directly to the Aurora instance, but to a connector (MariaDB), and b) that Aurora, in this case, caps the number of running threads to the number of CPU available on the instance (here 32).

Given that, I expected to have worse performance (but I don’t). The fact that Aurora uses one thread for multiple connections seems to be working quite efficiently.

The number of rows inserted is also consistent with the handler calls, and has better performance than XtraDB Cluster.

Aurora rows inserted:

AWS Aurora Benchmarking

XtraDB Cluster rows inserted

AWS Aurora Benchmarking

Again we have the same trend, only, this time, Aurora performs better than XtraDB Cluster.

Third Test: OLTP Application

When run on the small boxes, this test saw XtraDB Cluster performing much better than Aurora. The time taken by Aurora was ~3 times the time taken by XtraDB Cluster.

AWS Aurora Benchmarking

With a large box, I had the inverse result: Aurora is outperforming XtraDB Cluster from 2 to 7 times the speed.

AWS Aurora Benchmarking

Analyzing the number of commands executed with the increasing workload, we can see how XtraDB Cluster can perform better than Aurora with a workload of 128 threads, but starts to have worse performance as the load increases.

On the other hand, Aurora manages the read/write load without significant performance loss, which includes being able to increase the number of commits/sec.

AWS Aurora Benchmarking

Reviewing the handler calls, we see that the handler commit calls are significantly less in Aurora (as already noticed in the ingest tests).

AWS Aurora Benchmarking

Another thing to note is that the number of calls for XtraDB Cluster is significantly higher and not scaling, while Aurora has a nice scaling trend.

Fourth Test: TPCC-mysql

The TPCC test is mainly to test OLTP traffic, with the note that some tables (like district) might become a hotspot. The tests I ran were executed against 400 warehouses, and used 128 threads maximum for the small box and 2048 threads for the large box.

During this test, I hit one of the Aurora limitations and I escalated it to the Aurora engineers (who are aware of the problem).

Small boxes:

AWS Aurora Benchmarking

In the case of small boxes, there is nothing to say: XtraDB Cluster manages the load more efficiently. This trend is not optimal, having significant fluctuation. Aurora is just not able to keep it up.

Large boxes:

 

AWS Aurora Benchmarking

It is a different and a more complex scenario in the case of the use of large boxes. I would like to say that Aurora performs better.

This is true for two of the three tests, and up to when it got stuck by internal limitation Aurora was also performing better on the third. But then its performance just collapsed.

With a more in-depth investigation, I noticed that under the hood Aurora was not performing as well as it appeared. This comes out quite clearly by looking at a comparison between the graphs covering Comm_ execution, open files, handlers and InnoDBrow lock time.

In all of them it is evident how XtraDB Cluster keeps serving the workload with consistent behavior, while Aurora fails the second test on (512 threads) — not just on the third with 2048 threads.

Aurora:

AWS Aurora Benchmarking

XtraDB Cluster:

AWS Aurora Benchmarking

It is clear that Aurora was better served during the test with 256 threads going over the 450K com select serve (in 10 sec interval), compared with XtraDB Cluster that was not able to go over 350K.

But in the following tests, while XtraDB Cluster was able to keep going (with decreasing performance), Aurora started to struggle with very inconsistent behavior.

This was also confirmed by the open files graph.

Aurora:

AWS Aurora Benchmarking

XtraDB Cluster:

AWS Aurora Benchmarking

The graphs show the instances of files open during the test, not the ones already open. It reflects the Open_file metric “The number of files that are open. This count includes regular files opened by the server. It does not include other types of files such as sockets or pipes. Also, the count does not include files that storage engines open using their own internal functions rather than asking the server level to do so.”

I was quite surprised by the number of files open by Aurora.

Handlers reflected the same behavior, as well.

Aurora:

AWS Aurora Benchmarking

XtraDB Cluster:

AWS Aurora Benchmarking

Perfectly in line with the com trend.

So what was increasing in reverse?

Aurora:

AWS Aurora Benchmarking

XtraDB Cluster:

AWS Aurora Benchmarking

As you can see from the above, the exactly same workload generated an increasing lock row time, from quite low in the test with 256 threads, up to a crazy high with 2048 threads.

As mentioned, we know that TPCC has a couple of tables that act as hotspots, and we already saw with IIbench how Aurora is not working efficiently in that case.

I also was getting a lot of 188 errors during the test. This is an Aurora internal error. When I reported it, I was told they know about it, and they are planning to work on it.

I hope they do soon, because if this issue is solved it is very likely that Aurora will not only be able to manage the tested workload, but exceed it by far.

I am saying this because also with the identified issues Aurora was able to keep going and manage a more than decent response time during the second test (with 512 threads).

AWS Aurora Benchmarking

Fifth Test: Sysbench

I added the sysbench tests to test scalability, and to see the what happens when the system reaches a saturation point. This test brought up some limitations existing in the Aurora solution, related more to the connector than the Aurora engine itself.

Aurora has a limit of 16k connections. I wanted to see what happens if I got to saturation point or close to it. It doesn’t matter if this is a ridiculously high number or not.

What happened was that Aurora managed traffic up to 4K. The closer I got to the limit, however, the more I had a connectivity issue. At the end I had to run the test with 8K, 12K and 20K threads pointing directly to the Aurora instance, bypassing the connector that was not able to serve the traffic. After that, I was able to hit up to ~15500 threads (but with a lot of inconsistent performance). I am defining the limit of a meaningful test from the previous level of 12K threads.

XtraDB Cluster was able to scale up to 16K no problem.

What also is notable here is that Aurora was able to manage the workload more efficiently regarding transaction handling (i.e., as transactions executed and latency).

AWS Aurora Benchmarking

The number of transactions executed by Aurora was ~three times the one executed by XtraDB Cluster.

AWS Aurora Benchmarking

Regarding latency, Aurora showed less latency then XtraDB Cluster.

Internally, Aurora and XtraDB Cluster operations were once again different regarding how the workload was handled. The most divergent result was the handler calls:

AWS Aurora Benchmarking

Commit calls in Aurora were a fraction of the calls in XtraDB Cluster, while the number of rollbacks was higher.

The read calls had an even more divergent behavior, with XtraDB Cluster performing a higher number of read_keys, while Aurora was having a very limited number of them. Read_rnd are very high in XtraDB Cluster, but totally absent in Aurora (note that in Aurora, read_rnds are reported but seem not to increase). On the other hand, Aurora reported a high number of read_rnd_next, while XtraDB Cluster has none.

AWS Aurora Benchmarking

HA availability

Fail-over time

Both solutions:

AWS Aurora Benchmarking

In this test, the fail-over time for the solution using Galera and HAProxy was more efficient. For both a limited or mid-level load. One assumption is that given Aurora has to verify both the status of the data transmitted and its consistency across the six data store nodes in every case; the process is not as fast as it could be.

It could also be that the cluster connector is not as efficient as it should in redirecting the traffic from one node to another. It would be a very interesting exercise to replace it with some other custom solution.

Note that I was performing the tests following the Amazon recommendation to use the following to simulate a real crash:

ALTER SYSTEM CRASH [INSTANCE|NODE]

As such, I was not doing anything strange or out of the ordinary.

It is worth mentioning that of the eight seconds taken by MySQL/Galera to perform the failover, six were due to the HAProxy settings (which had a 3000 ms interval and two loops in the settings before executing failover).

Execution latency

The purpose of these tests was to identify the latency existing between the moment that application sends the request and the moment MySQL/Aurora took the request in “charge”. The expectation is that the busier the database, the higher the latency.

For this test, I reported both results: the one coming from the old tests with the small box, and the new one with the large box.

Small boxes:

AWS Aurora Benchmarking

Large boxes:

AWS Aurora Benchmarking

It is clear from the graphs that the two tests report different scenarios. In the first, Galera was able to manage the load more efficiently and serve requests with lower latency. For the new tests, I had used a higher number of threads than the ones for the small box. Nevertheless, in the second test the CPU utilization and the number of running threads lead me to think that Aurora was finally able to utilize resources more efficiently and the lower latency.

The latency jumped up again when the number of connections rose above 12K, but that was expected given previous tests results.

Conclusions

High Availability

The two platforms were able to manage the failover operation in a limited time frame (below 1 minute). Nevertheless, MySQL/Galera was shown to be more efficient and consistent. This result is a direct consequence of synchronous replication, which by design prevents MySQL/Galera from allowing an active node to fall behind.

In my opinion, the replication method used in Aurora is efficient, and given that data is shared across the read replicas, fail-over should happen faster.

The tests suffered because of the connector, and I have the feeling that having another solution in place may bring some surprises (actually, I would like to test that as well).

Performance

In this run of tests, Aurora was able to invert the results I had in the first test with the small boxes. In almost all cases, Aurora performed as well or better then XtraDB Cluster. There are still cases where Aurora is penalized, and those are the ones where hotspots are present. The contention in Aurora is killing performance, and raise errors (188). But I hope we will see a significant evolution soon.

General Comments on Aurora

The product is evolving quickly, and benchmark results may become obsolete in very short time (this is why it is important to have repeatable and comparable tests). From my point of view, in this set of tests Aurora clearly shows where it’s a better fit: higher-end levels, where high availability and CPU power is the focus (not concerns about the cost).

There is no reason to use Aurora in small-mid boxes: the platform is not going to be as efficient as a standard solution like XtraDB Cluster. But if cost is not an issue, and the applications require a lot of parallelism, Aurora on db.r3.8xlarge is a good solution.

I still see space for improvements (like for cluster connectors, or the time taken to restart a cluster after a full stop, or contention reduction). But I am also confident that the work led by the development team will fix most of my concerns (and more) soon.

Final note: it would be nice to have the code open source, so that the community could contribute (but I understand the business reasons not to).

About Cost

I don’t think it is this the right place to mention the cost of each solution (especially because each need is different).

As such, I am not reporting any specific numbers. You can, however, follow the links below and do the necessary math:

Aurora cost calculator

AWS cost calculator

 

May
13
2016
--

Benchmark MongoDB with sysbench

Benchmark MongoDB with sysbench

Benchmark MongoDB with sysbenchIn this blog post, we’ll discuss how to benchmark MongoDB with sysbench.

In an earlier post, I mentioned our use of sysbench-mongodb (via this fork) to run benchmarks of MongoDB servers. I now want to share our work extending sysbench to make it work with MongoDB.

If you’re not familiar with sysbench, it’s a great project developed by Alexey Kopytov that lets you run different types of benchmarks (referred to as “tests” by the tool), including database benchmarks. The database tests are implemented in Lua scripts, which means you can customize them as needed (or even write new ones from scratch) – something useful for simulating specific workloads.

All of the database tests in sysbench assume an SQL-based database, so instead of trying to shoehorn MongoDB tests into this framework I modified the connect/disconnect functions to handle MongoDB, and then implemented new functions specific for this database.

You can find the work (which is still in progress but usable, and in fact currently used by us in benchmarks) on the dev-mongodb-support-1.0 branch of our sysbench fork.

To use it, you just need to specify the –mongo-url argument (others too, as needed, but this is the one that must be present for sysbench to detect a MongoDB test is requested), and then provide the path to the Lua script you want to run. The following is an example:

sysbench
--mongo-write-concern=1
--mongo-url="mongodb://localhost"
--mongo-database-name=sbtest
--test=sysbench/sysbench/tests/mongodb/oltp.lua
--oltp_table_size=60000000
--oltp_tables_count=16
--num-threads=512
--rand-type=pareto
--report-interval=10
--max-requests=0
--max-time=600
--oltp-point-selects=10
--oltp-simple-ranges=1
--oltp-sum-ranges=1
--oltp-order-ranges=1
--oltp-distinct-ranges=1
--oltp-index-updates=1
--oltp-non-index-updates=1
--oltp-inserts=1 run

To build this branch, you’ll first need to build and install (or otherwise obtain) the mongo-c-driver project, as that is what we use to connect to MongoDB. Once that’s done, building is just a matter of running the following commands from the repo’s root:

./autogen.sh
./configure
make
sudo make install #optionally

The changes should not affect the other database tests in sysbench, though I have only verified that the MySQL ones continue to work.

Right now, the workload from sysbench-mongodb is implemented in Lua scripts (oltp.lua), and work is in progress to allow freeform operations to be created with new Lua scripts (by providing functions that take JSON as the argument). As an alternative, you may want to check out this much-less-tested (and currently unstable) branch based on luamongo. It already supports the creation of arbitrary workloads in Lua. In this case, you also need to build luamongo, which is included.

With either branch, you can add new tests by implementing new Lua scripts (though the dev-mongodb-support-1.0 branch still needs a few functions implemented on the C side to support arbitrary operations from the Lua side).

We think there are still some types of operations needed to improve sysbench’s usefulness for MongoDB, such as queries involving arrays, union, the $in operator, geospatial operators, and in place updates.

We hope you find this useful, and we welcome suggestions and bug reports to improve it.

Happy benchmarking!

Feb
24
2015
--

Is MySQL’s innodb_file_per_table slowing you down?

MySQL’s innodb_file_per_table is a wonderful thing – most of the time. Having every table use its own .ibd file allows you to easily reclaim space when dropping or truncating tables. But in some use cases, it may cause significant performance issues.

Many of you in the audience are responsible for running automated tests on your codebase before deploying to production. If you are, then one of your goals is having tests run as quickly as possible so you can run them as frequently as possible. Often times you can change specific settings in your test environment that don’t affect the outcome of the test, but do improve throughput. This post discusses how innodb_file_per_table is one of those settings.

I recently spoke with a customer whose use case involved creating hundreds of tables on up to 16 schemas concurrently as part of a Jenkins testing environment. This was not in production, so performance was far more important than durability. They’d run their tests, and then drop the schemas. This process took close to 20 minutes. They asked “How can we make this faster?”

Due to the number of tables involved innodb_file_per_table seemed a likely culprit.

It’s been noted here on the MySQL Performance Blog that innodb_file_per_table can cause table creation and drops to slow down. But what exactly is the performance hit? We wanted to find out.

The innodb_file_per_table Test:

On a test server running CentOS release 6.5, xfs filesystem, and 5.6.22-71.0-log Percona Server, I ran the following homemade benchmark bash script:

[root@host ~]# time $(for db in {1..16};
do mysql -e "create database bench$db";
$(for tb in {1..500}; do $(mysql bench$db -e "create table tab${tb} (i int) engine=innodb"); done) & done)

If you open the mysql client in another screen or terminal, you should see something like this:

...
+-------+------+-----------+---------+---------+------+----------------+------------------------------------------+-----------+---------------+         
| Id    | User | Host      | db      | Command | Time | State          | Info                                     | Rows_sent | Rows_examined |         
+-------+------+-----------+---------+---------+------+----------------+------------------------------------------+-----------+---------------+         
| 80013 | root | localhost | NULL    | Query   |    0 | init           | show processlist                         |         0 |             0 |         
| 89462 | root | localhost | bench5  | Query   |    0 | creating table | create table tab95 (i int) engine=innodb |         0 |             0 |         
| 89466 | root | localhost | bench8  | Query   |    0 | creating table | create table tab81 (i int) engine=innodb |         0 |             0 |         
| 89467 | root | localhost | bench1  | Query   |    0 | creating table | create table tab91 (i int) engine=innodb |         0 |             0 |         
| 89468 | root | localhost | bench13 | Query   |    0 | creating table | create table tab90 (i int) engine=innodb |         0 |             0 |         
| 89469 | root | localhost | bench15 | Query   |    0 | creating table | create table tab90 (i int) engine=innodb |         0 |             0 |         
| 89472 | root | localhost | bench9  | Query   |    0 | creating table | create table tab86 (i int) engine=innodb |         0 |             0 |         
| 89473 | root | localhost | bench10 | Query   |    0 | creating table | create table tab94 (i int) engine=innodb |         0 |             0 |         
| 89474 | root | localhost | bench11 | Query   |    0 | creating table | create table tab80 (i int) engine=innodb |         0 |             0 |         
| 89475 | root | localhost | bench3  | Query   |    0 | creating table | create table tab80 (i int) engine=innodb |         0 |             0 |         
| 89476 | root | localhost | bench2  | Query   |    0 | creating table | create table tab82 (i int) engine=innodb |         0 |             0 |         
| 89478 | root | localhost | bench4  | Query   |    0 | creating table | create table tab91 (i int) engine=innodb |         0 |             0 |         
| 89479 | root | localhost | bench16 | Query   |    0 | creating table | create table tab88 (i int) engine=innodb |         0 |             0 |         
| 89481 | root | localhost | bench12 | Query   |    0 | creating table | create table tab90 (i int) engine=innodb |         0 |             0 |         
| 89483 | root | localhost | bench6  | Query   |    0 | creating table | create table tab96 (i int) engine=innodb |         0 |             0 |         
| 89484 | root | localhost | bench14 | Query   |    0 | creating table | create table tab95 (i int) engine=innodb |         0 |             0 |         
+-------+------+-----------+---------+---------+------+----------------+------------------------------------------+-----------+---------------+         
...        

After creating the tables, I dropped all schemas concurrently:

[root@host ~]# time $(for db in {1..16};
do mysql -e "drop database bench${db}" & done)

So what was the difference with innodb_file_per_table ON vs OFF?

  • With innodb_file_per_table=ON
    • Schema and table creation = 1m54.852s
    • Schema drops = 1m21.682s
  • With innodb_file_per_table=OFF
    • Schema and table creation = 0m59.968s
    • Schema drops = 0m54.870s

So creation time decreased by 48%, drop time decreased by 33%.

I think its worth noting that this benchmark creates and drops empty tables. Dropping InnoDB tables created with innodb_file_per_table=ON can take much longer if they have large amounts of data.

Please also be aware that there are always trade-offs when modifying your InnoDB settings. That is outside the scope of this post, so please research and test before making changes. The MySQL documentation discusses that here.  In 5.6.6 and up, innodb_file_per_table is ON by default. MySQL 5.6 will also create temp tables as InnoDB, as noted here.

So there you have it. If your primary goal is to improve create and drop table time, turning OFF innodb_file_per_table will save significant amounts of time.

The post Is MySQL’s innodb_file_per_table slowing you down? appeared first on MySQL Performance Blog.

Jan
15
2015
--

Hyper-threading – how does it double CPU throughput?

Computer CPUThe other day a customer asked me to do capacity planning for their web server farm. I was looking at the CPU graph for one of the web servers that had Hyper-threading switched ON and thought to myself: “This must be quite a misleading graph – it shows 30% CPU usage. It can’t really be that this server can handle 3 times more work?”

Or can it?

I decided to do what we usually do in such case – I decided to test it and find out the truth. Turns out – there’s more to it than meets the eye.

How Intel Hyper-Threading works

Before we get to my benchmark results, let’s talk a little bit about hyper-threading. According to Intel, Intel® Hyper-Threading Technology (Intel® HT Technology) uses processor resources more efficiently, enabling multiple threads to run on each core. As a performance feature, Intel HT Technology also increases processor throughput, improving overall performance on threaded software.

Sounds almost like magic, but in reality (and correct me if I’m wrong), what HT does essentially is – by presenting one CPU core as two CPUs (threads rather), it allows you to offload task scheduling from kernel to CPU.

So for example if you just had one physical CPU core and two tasks with the same priority running in parallel, the kernel would have to constantly switch the context so that both tasks get a fair amount of CPU time. If, however, you have the CPU presented to you as two CPUs, the kernel can give each task a CPU and take a vacation.

On the hardware level, it will still be one CPU doing the same amount of work, but there maybe some optimization to how that work is going to be executed.

My hypothesis

Here’s the problem that was driving me nuts: if HT does NOT actually give you twice more power and yet the system represents statistics for each CPU thread separately, then at 50% CPU utilization (as per mpstat on Linux), the CPU should be maxed out.

So if I tried to model the scalability of that web server – a 12-core system with HT enabled (represented as 24 CPUs on a system), assuming perfect linear scalability, here’s how it should look:

Throughput
(requests per second)
  |
9 |         ,+++++++++++++++
  |        +
  |       +
6 |      +
  |     +
  |    +
3 |   +
  |  +
  | +
0 '-----+----+----+----+----
    1   6   12   18   24

In the example above, single CPU thread could process the request in 1.2s, which is why you see it max out at 9-10 requests/sec (12/1.2).

From the user perspective, this limitation would hit VERY unexpectedly, as one would expect 50% utilization to be… well, exactly that – 50% utilization.

In fact, the CPU utilization graph would look even more frustrating. For example if I were increasing the number of parallel requests linearly from 1 to 24, here’s how that relationship should look:

CPU utilization:
100% |         ++++++++++++++
     |         .
     |         .
     |         .
     |         .
 50% |         .
     |       +
     |     +
     |   +
     | +
  0% '----+----+----+----+----
    0     6   12   18   24

Hence CPU utilization would skyrocket right at 12 cores from 50% to 100%, because in fact the system CPU would be 100% utilized at this point.

What happens in reality

Naturally, I decided to run a benchmark and see if my assumptions are correct. The benchmark was pretty basic – I wrote a CPU-intensive php script, that took 1.2s to execute on the CPU I was testing this, and bashed it over http (apache) with ab at increasing concurrency. Here’s the result:

Requests per secondRaw data can be found here.

If this does not blow your mind, please go over the facts again and then back at the graph.

Still not sure why do I find this interesting? Let me explain. If you look carefully, initially – at concurrency of 1 through 8 – it scales perfectly. So if you only had data for threads 1-8 (and you knew processes don’t incur coherency delays due to shared data structures), you’d probably predict that it will scale linearly until it reaches ~10 requests/sec at 12 cores, at which point adding more parallel requests would not have any benefits as the CPU would be saturated.

What happens in reality, though, is that past 8 parallel threads (hence, past 33% virtual CPU utilization), execution time starts to increase and maximum performance is only achieved at 24-32 concurrent requests. It looks like at the 33% mark there’s some kind of “throttling” happening.

In other words, to avoid a sharp performance hit past 50% CPU utilization, at 33% virtual thread utilization (i.e. 66% actual CPU utilization), the system gives the illusion of a performance limit – execution slows down so that the system only reaches the saturation point at 24 threads (visually, at 100% CPU utilization).

Naturally then the question is – does it still make sense to run hyper-threading on a multi-core system? I see at least two drawbacks:

1. You don’t see the real picture of how utilized your system really is – if the CPU graph shows 30% utilization, your system may well be 60% utilized already.
2. Past 60% physical utilization, execution speed of your requests will be throttled intentionally in order to provide higher system throughput.

So if you are optimizing for higher throughput – that may be fine. But if you are optimizing for response time, then you may consider running with HT turned off.

Did I miss something?

The post Hyper-threading – how does it double CPU throughput? appeared first on MySQL Performance Blog.

Jun
24
2014
--

Sysbench Benchmarking of Tesora’s Database Virtualization Engine

Tesora, previously called Parelastic, asked Percona to do a sysbench benchmark evaluation of its Database Virtualization Engine on specific architectures on Amazon EC2.tesora2

The focus of Tesora is to provide a scalable Database As A Service platform for OpenStack. The Database Virtualization Engine (DVE) plays a part in this as it aims at allowing databases to scale transparently across multiple MySQL shards.

DVE was open sourced last week. Downloads and source are already available on tesora.com

Some of the features include:

  • Transparent Sharding of data accross multiple storage nodes.
  • Applications can connect to DVE directly, using the MySQL Protocol, no code changes required.
  • Transactional and full ACID compliance with multiple storage nodes.
  • Storage Nodes can be added on an existing cluster.

Benchmarking Setup

Synthetic benchmarks were run using sysbench on different environments and different DVE architectures, as provided by Tesora. Architectures with 1 and 3 DVE nodes were benchmarked using up to 5 storage nodes.

The environments include a disk-bound dataset, the purpose of which is to create more insight into how large datasets might scale across multiple nodes. This type of use case is a common reason for companies to look into solutions like sharding.

A memory-bound dataset was also benchmarked to find out how DVE performs.

The memory-bound dataset was also used to compare a standalone MySQL Instance with a single DVE node using a single storage node. This provides more insight into the amount of overhead DVE creates at its most basic setup.

The remainder of this blog post gives a general overview on the findings of these benchmarks.
The full report, which includes more information on the configuration and goes deeper in it’s analysis of the results, can be downloaded here (PDF).

Scalability

OLTP – Disk Bound – Throughput:

analysis_medium_oltp_Throughput_avgOLTP – Disk Bound – Response Time:

In terms of scalability DVE lives up to its expectations as long as there is enough CPU available on the DVE nodes.

More complex transactions that query across multiple shards scale quite well. The explanation for that scalability, which is beyond linear, is that the available memory grows as storage nodes are added. The environment becomes less disk bound and performs better.

SELECT, INSERT, UPDATE – Disk Bound – Throughput:

SELECT, INSERT, UPDATE – Disk Bound – Response Time:

Single row reads and writes scale even better. This demonstrates that sharding has to be tailored towards both the data and the queries that will be executed across those tables.

Especially when using storage nodes with default EBS storage, disk performance is bad which makes the difference even larger.

The good thing about this solution is that storage and DVE nodes can be added, and the capacity of the whole system increases. No application code changes are necessary.

Running such an environment could come at a higher cost, but it could save a lot of resources and thus money on development. Of course, as a sharded architecture such as this are more complex compared to a non sharded architecture, the operational cost should not be ignored.

Overhead

Memory Bound Throughput:

Memory Bound Response Time:

The overhead of using DVE is noticeable, but much of it is related to the added network hop in the database layer and the CPU requirements of the DVE nodes.

DVE suffers as we go above 64 threads as you can see in the big increase in response time.

CPU Intensive

OLTP – Disk Bound:

DVE nodes are very CPU intensive. Throughout the analysis, the bottleneck is often caused by the CPU-constrained DVE nodes, this is very visible when looking at the benchmark results of p1_s3 and p1_s5, which use only one DVE node.

Even with simpler single-row, primary key-based SELECT statements, the overhead is noticeable.

Keep in mind that the hardware specifications of the DVE nodes (8 cores) were much higher than the database nodes (2 cores) itself. This makes the issue even more apparent.

Summary

Overall DVE looks very promising. When looking at the sysbench results, DVE seems to scale very well, provided that there are enough CPU resources available for the DVE nodes. You can download the full report here (PDF).

It will be interesting to see how much DVE scales on larger instance types. Other types of benchmarks should also be performed, such as linkbench. It is also important to understand the impact on the operational side, such as adding storage nodes, backups & recovery, high availability and how well it works transactions in various scenarios. Stay tuned.

The post Sysbench Benchmarking of Tesora’s Database Virtualization Engine appeared first on MySQL Performance Blog.

May
07
2013
--

Benchmarking Percona Server TokuDB vs InnoDB

After compiling Percona Server with TokuDB, of course I wanted to compare InnoDB performance vs TokuDB.
I have a particular workload I’m interested in testing – it is an insert-intensive workload (which is TokuDB’s strong suit) with some roll-up aggregation, which should produce updates in-place (I will use INSERT .. ON DUPLICATE KEY UPDATE statements for that), so it will produce all good amount of reads.

A few words about the hardware: I am going to use new the Dell PowerEdge R420 with two Intel(R) Xeon(R) CPU E5-2450 0 @ 2.10GHz, 48GB of RAM and SATA SSD: Kingston HyperX 3K 240 GB.

Workload: I will use two different schemas. The first schema is from sysbench, and the table looks like:

CREATE TABLE sbtest$I (
id BIGINT UNSIGNED NOT NULL,
k INTEGER UNSIGNED DEFAULT '0' NOT NULL,
c CHAR(120) DEFAULT '' NOT NULL,
pad CHAR(60) DEFAULT '' NOT NULL,
PRIMARY KEY (id)
)
CREATE INDEX k on sbtest$I(k)

and tables sbtest$I_r10, sbtest$I_r100, sbtest$I_r1000, with roll-up sum for 10, 100, 1000 records in the main table.

and transactions for this workload are:

$ID=monotonically increasing ID
$K=rand(0,10000) // distributed by pareto distribution
$C, $PAD = random_string()
BEGIN
INSERT INTO sbtest (id, k, c, pad) VALUES ($ID, $K, $C, $PAD);
INSERT INTO sbtest_r10 (id, k) VALUES ($ID/10, $K) ON DUPLICATE KEY UPDATE k=VALUES(k)+k;
INSERT INTO sbtest_r100 (id, k) VALUES ($ID/100, $K) ON DUPLICATE KEY UPDATE k=VALUES(k)+k;
INSERT INTO sbtest_r1000 (id, k) VALUES ($ID/1000, $K) ON DUPLICATE KEY UPDATE k=VALUES(k)+k;
COMMIT;

So this workload produces SEQUENTIAL inserts into Primary Key, that this is quite suitable for InnoDB, and in it we have random inserts into SECONDARY KEYS (k),
which is not so good for InnoDB.

So let’s see what results we have. The results are in TPS (more is better) and we start with empty tables.

Now, before looking at the graph, please do not jump to conclusions, as the graph is MISLEADING.

TokuDB-1

So we see that InnoDB performance steadily declines from 24000 tps to 18000 tps, but InnoDB can’t make 5h run. After 3h the disk is full, and InnoDB data size is about 210GB with 234.238.440 inserted records.
While TokuDB averages around 14000 tps mark with some periodical drops into 10000 tps area.
TokuDB datasize after 5h of run is about 50GB with 276.934.863 records.

So why do I say that the graph is misleading?
Obviously we can say that InnoDB is faster, but you should look into the steady decline of InnoDB throughput. Eventually it will drop to the level 14000 tps and below. I do not have enough space on this SSD to run this experiment that long. So there we see the strong side of TokuDB: it has more than 4x data compression on this dataset.
We can easily fill TokuDB tables with 1bln of rows on this SSD, and projected InnoDB performance on this size will be the same or worse, but will require 1TB in size.

Now to see this point of intersection, let’s review different workload (which actually is closer to what I need).

Tables looks like:

CREATE TABLE `sbtest1` (
  `hid` int(10) unsigned NOT NULL DEFAULT '0',
  `mid` int(10) unsigned NOT NULL DEFAULT '0',
  `id` bigint(20) unsigned NOT NULL,
  `k` int(10) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`hid`,`mid`,`id`)
)

and transactions are:

$HID=rand(0,10000) // distributed by uniform distribution
$MID=rand(0,10000) // distributed by uniform distribution
$ID=monotonically non-decreasing ID
$K=rand(0,10000) // distributed by pareto distribution
BEGIN
INSERT INTO sbtest (hid, mid, id, k) VALUES ($HID, $MID, $ID, $K);
INSERT INTO sbtest_r10 (hid, mid, id, k) VALUES ($HID, $MID, $ID/10, $K) ON DUPLICATE KEY UPDATE k=VALUES(k)+k;
INSERT INTO sbtest_r100 (hid, mid, id, k) VALUES ($HID, $MID, $ID/100, $K) ON DUPLICATE KEY UPDATE k=VALUES(k)+k;
INSERT INTO sbtest_r1000 (hid, mid, id, k) VALUES ($HID, $MID, $ID/1000, $K) ON DUPLICATE KEY UPDATE k=VALUES(k)+k;
COMMIT;

That is, our PRIMARY KEY is not sequential anymore, which is bad for InnoDB, but this is what I need for my setup (I still can have synthetic auto_inc PK, but in that case I still will need SECONDARY KEY (hid,mid,id) ).

Also please note the transaction produces 4 INSERTs and workload is very write intensive.

So what are results in this case:

TokuDB-2

InnoDB gradually declines as data growth (which is expected) and by the end of 5 hours averages at 2700 tps.
With TokuDB we also see a drop, and by end of 5 hours the average throughput is 7800 tps.
Something to take into account: TokuDB results are not quite stable, that is why I also show 5-minute moving averages to TokuDB.

So TokuDB shows about 2.8x better throughput, and on data size:

  • InnoDB table: 58GB and 244.980.192 records
  • TokuDB table: 15GB and 232.927.460 records

So TokuDB looks better in this workload, however the sparse throughput is worrisome to me. Let’s zoom in to 10 min intervals and see throughput:
TokuDB-3
We can see periodical drops, which I believe are related to 60-sec checkpoint interval, as TokuDB does time-based checkpoints.
These drops are quite concerning, and it might be a problem for some users.

Now, I understand that my PRIMARY KEY (hid,mid,id) where id is sequential, and hid,mid is low selectivity is not good for fast inserts, but it is suitable for range selects by id. However it will interesting how both InnoDB and TokuDB performs if PK is (id,hid,mid). This also will affect select performance, so we will need to measure that also.

And, if you want to repeat this benchmark, the sysbench code is on Launchpad lp:~vadim-tk/sysbench/insert-roll-2,
command line to run:

sysbench --test=insert_roll.lua --oltp-table-size=10000  --mysql-user=root --oltp-tables-count=32 --mysql_table_engine=tokudb --oltp_auto_inc=on --max-time=18000 --report-interval=10 --max-requests=0 --num-threads=32 --rand-type=pareto run

and InnoDB options are:

[mysqld_safe]
numa-interleave
malloc-lib=/usr/local/Percona-Server-5.5.30-rel30.2-500.Linux.x86_64/lib/mysql/libjemalloc.so
[mysqld]
gdb
datadir=/mnt/data/mysql
#for SSD
innodb_flush_neighbor_pages = none
innodb_adaptive_flushing_method = keep_average
innodb_file_per_table = true
innodb_data_file_path = ibdata1:100M:autoextend
innodb_flush_method = O_DIRECT
innodb_log_buffer_size = 256M
innodb_flush_log_at_trx_commit = 1
innodb_buffer_pool_size = 36G
innodb_log_file_size = 4G
innodb_log_files_in_group = 2
innodb_log_block_size=4096
#####plugin options
innodb_read_io_threads = 16
innodb_write_io_threads = 4
innodb_io_capacity = 4000
innodb_use_native_aio=0
#not innodb options (fixed)
port = 3306
back_log = 50
max_connections = 2000
max_prepared_stmt_count=500000
max_connect_errors = 10
table_open_cache = 2048
max_allowed_packet = 16M
binlog_cache_size = 16M
max_heap_table_size = 64M
sort_buffer_size = 4M
join_buffer_size = 4M
thread_cache_size = 1000
query_cache_size = 0
query_cache_type = 0
ft_min_word_len = 4
#default_table_type = InnoDB
thread_stack = 192K
tmp_table_size = 64M
server-id = 10
#*** MyISAM Specific options
key_buffer_size = 8M
read_buffer_size = 1M
read_rnd_buffer_size = 4M
bulk_insert_buffer_size = 8M
myisam_sort_buffer_size = 8M
myisam_max_sort_file_size = 10G
#myisam_max_extra_sort_file_size = 10G
myisam_repair_threads = 1
myisam_recover
socket=/var/lib/mysql/mysql.sock
user=root
skip-grant-tables

TokuDB-related options are all defaults, as I understand from documentation TokuDB comes with good settings out-of-box, but I am ready to tune something if there are suggestions.

The post Benchmarking Percona Server TokuDB vs InnoDB appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com