Oct
09
2020
--

Amazon Aurora Multi-Primary First Impression

Amazon Aurora Multi-Primary First Impression

Amazon Aurora Multi-Primary First ImpressionFor what reason should I use a real multi-primary setup?

To be clear, not a multi-writer solution where any node can become the active writer in case of needs, as for Percona XtraDB Cluster (PXC) or Percona Server for MySQL using Group_replication. No, we are talking about a multi-primary setup where I can write at the same time on multiple nodes. I want to insist on this “why?”.

After having excluded the possible solutions mentioned above, both covering the famous 99.995% availability, which is 26.30 minutes of downtime in a year, what is left?

Disaster Recovery? Well, that is something I would love to have, but to be a real DR solution we need to put several kilometers (miles for imperial) in the middle.

And we know (see here and here) that aside from some misleading advertising, we cannot have a tightly coupled cluster solution across geographical regions.

So, what is left? I may need more HA, ok, that is a valid reason. Or I may need to scale the number of writes, which is a valid reason as well. This means, in the end, that I am looking to a multi-primary because:

  • Scale writes (more nodes more writes)
    • Consistent reads (what I write on A must be visible on B)
  • Gives me 0 (zero) downtime, or close to that (5 nines is a maximum downtime of 864 milliseconds per day!!)
  • Allow me to shift the writer pointer at any time from A to B and vice versa, consistently.

Now, keeping myself bound to the MySQL ecosystem, my natural choice would be MySQL NDB cluster. But my (virtual) boss was at AWS re-invent and someone mentioned to him that Aurora Multi-Primary does what I was looking for. This (long) article is my voyage in discovering if that is true or … not.

Given I am focused on the behavior first, and NOT interested in absolute numbers to shock the audience with millions of QPS, I will use low-level Aurora instances. And will perform tests from two EC2 in the same VPC/region of the nodes.

You can find the details about the tests on GitHub here.

Finally, I will test:

  • Connection speed
  • Stale read
  • Write single node for baseline
  • Write on both node:
    • Scaling splitting the load by schema
    • Scaling same schema

Test Results

Let us start to have some real fun. The first test is …

Connection Speed

The purpose of this test is to evaluate the time taken in opening a new connection and time taken to close it. The action of the open/close connection can be a very expensive operation, especially if applications do not use a connection pool mechanism.

Amazon Aurora Multi-Primary


As we can see, ProxySQL results to be the most efficient way to deal with opening connections, which was expected given the way it is designed to reuse open connections towards the backend.


Different is the close connection operation, in which ProxySQL seems to take a little bit longer.

As a global observation, we can say that by using ProxySQL we have more consistent behavior. Of course, this test is a simplistic one, and we are not checking the scalability (from 1 to N connections) but it is good enough to give us the initial feeling. Specific connection tests will be the focus of the next blog on Aurora MM.

Stale Reads

Aurora multi-primary uses the same mechanism of the default Aurora to update the buffer pool:


Using the Page Cache update, just doing both ways. This means that the Buffer Pool of Node2 is updated with the modification performed in Node1 and vice versa.

To verify if an application would be really able to have consistent reads, I have run this test. This test is meant to measure if, and how many, stale reads we will have when writing on a node and reading from the other.

Amazon Aurora multi-primary has two consistency models:

Aurora consistency model
As an interesting fact, the result was that with the default consistency model (INSTANCE_RAW), we got a 100% stale read.
Given that I focused on identifying the level of the cost that exists when using the other consistency model (REGIONAL_RAW), that allows an application to have consistent reads.

The results indicate an increase of 44% in total execution time, and of 95% (22 times slower) in write execution.

It is interesting to note that the time taken is in some way predictable and consistent between the two consistency models.

The graph below shows in yellow how long the application must wait to see the correct data on the reader node. In blue is the amount of time the application waits to get back the same consistent read because it must wait for the commit on the writer.

lag time in nanoseconds

As you can see, the two are more or less aligned.

Given the performance cost imposed by using REGIONAL_RAW,  all the other tests are done with the default INSTANCE_RAW, unless explicitly stated.

Writing Tests

All tests run in this section were done using sysbench-tpcc with the following settings:

sysbench ./tpcc.lua --mysql-host=<> --mysql-port=3306 --mysql-user=<> --mysql-password=<> --mysql-db=tpcc --time=300 --threads=32 --report-interval=1 --tables=10 --scale=15  --mysql_table_options=" CHARSET=utf8 COLLATE=utf8_bin"  --db-driver=mysql prepare

 sysbench /opt/tools/sysbench-tpcc/tpcc.lua --mysql-host=$mysqlhost --mysql-port=$port --mysql-user=<> --mysql-password=<> --mysql-db=tpcc --db-driver=mysql --tables=10 --scale=15 --time=$time  --rand-type=zipfian --rand-zipfian-exp=0 --report-interval=1 --mysql-ignore-errors=all --histogram  --report_csv=yes --stats_format=csv --db-ps-mode=disable --threads=$threads run

Write Single Node (Baseline)

Before starting the comparative analysis, I was looking to define what was the “limit” of traffic/load for this platform.

baseline reads/writes

From the graph above, we can see that this setup scales up to 128 threads and after that, the performance remains more or less steady.

Amazon claims that we can mainly double the performance when using both nodes in write mode and use a different schema to avoid conflict.

aurora scalability

Once more, remember I am not interested in the absolute numbers here, but I am expecting the same behavior. Given that, our expectation is to see:

expected scalability

Write on Both Nodes, Different Schemas

So AWS recommend this as the scaling solution:


And I diligently follow the advice. I used two EC2 nodes in the same subnet of the Aurora Node, writing to a different schema (tpcc & tpcc2).

Overview

Let us make it short and go straight to the point. Did we get the expected scalability?

Well, no:


We just had a 26% increase, quite far to be the expected 100% Let us see what happened in detail (if not interested just skip and go to the next test).

Node 1

Schema read writes Aurora

Node 2


As you can see, Node1 was (more or less) keeping up with the expectations and being close to the expected performance. But Node2 was just not keeping up, and performances there were just terrible.

The graphs below show what happened.

While Node1 was (again more or less) scaling up to the baseline expectations (128 threads), Node2 collapsed on its knees at 16 threads. Node2 was never able to scale up.

Reads

Node 1


Node1 is scaling the reads as expected, but also here and there we can see performance deterioration.

Node 2


Node2 is not scaling Reads at all.

Writes

Node 1


Same as Read.

Node 2


Same as read.

Now someone may think I was making a mistake and I was writing on the same schema. I assure you I was not. Check the next test to see what happened if using the same schema.

Write on Both Nodes,  Same Schema

Overview

Now, now, Marco, this is unfair. You know this will cause contention. Yes, I do! But nonetheless, I was curious to see what was going to happen and how the platform would deal with that level of contention.

My expectations were to have a lot of performance degradation and an increased number of locks. About conflict I was not wrong, node2 after the test reported:

+-------------+---------+-------------------------+
| table       | index   | PHYSICAL_CONFLICTS_HIST |
+-------------+---------+-------------------------+
| district9   | PRIMARY |                    3450 |
| district6   | PRIMARY |                    3361 |
| district2   | PRIMARY |                    3356 |
| district8   | PRIMARY |                    3271 |
| district4   | PRIMARY |                    3237 |
| district10  | PRIMARY |                    3237 |
| district7   | PRIMARY |                    3237 |
| district3   | PRIMARY |                    3217 |
| district5   | PRIMARY |                    3156 |
| district1   | PRIMARY |                    3072 |
| warehouse2  | PRIMARY |                    1867 |
| warehouse10 | PRIMARY |                    1850 |
| warehouse6  | PRIMARY |                    1808 |
| warehouse5  | PRIMARY |                    1781 |
| warehouse3  | PRIMARY |                    1773 |
| warehouse9  | PRIMARY |                    1769 |
| warehouse4  | PRIMARY |                    1745 |
| warehouse7  | PRIMARY |                    1736 |
| warehouse1  | PRIMARY |                    1735 |
| warehouse8  | PRIMARY |                    1635 |
+-------------+---------+-------------------------+

Which is obviously a strong indication something was not working right. In terms of performance gain, if we compare ONLY the result with the 128 Threads:


Also with the high level of conflict, we still have 12% of performance gain.

The problem is that in general, we have the two nodes behaving quite badly. If you check the graph below you can see that the level of conflict is such to prevent the nodes not only to scale but to act consistently.

Node 1

Write on Both Nodes,  Same Schema

Node 2


Reads

In the following graphs, we can see how node1 had issues and it actually crashed three times, during tests with 32/64/512 threads. Node2 was always up but the performances were very low.

Node 1


Node 2


Writes

Node 1


Node 2


Recovery From Crashed Node

About recovery time, reading the AWS documentation and listening to presentations, I often heard that Aurora Multi-Primary is a 0 downtime solution. Or other statements like: “in applications where you can’t afford even brief downtime for database write operations, a multi-master cluster can help to avoid an outage when a writer instance becomes unavailable. The multi-master cluster doesn’t use the failover mechanism, because it doesn’t need to promote another DB instance to have read/write capability”

To achieve this the suggestion, the solution I found was to have applications pointing directly to the Nodes endpoint and not use the Cluster endpoint.

In this context, the solution pointing to the Nodes should be able to failover within a second or so, while the cluster endpoint:

Recovery From Crashed Node

Personally, I think that designing an architecture where the application is responsible for the connection to the database and failover is some kind of refuse from 2001. But if you feel this is the way, well, go for it.

What I did for testing is to use ProxySQL, as plain as possible with nothing else, and the basic monitor coming from the native monitor. I then compared the results with the tests using the Cluster endpoint. In this way, I adopt the advice of pointing directly at the nodes, but I was doing things in our time.

The results are below and they confirm (more or less) the data coming from Amazon.


A downtime of seven seconds is quite a long time nowadays, especially if I am targeting the 5 nines solution that I want to remember is 864 ms downtime per day. Using ProxySQL is going closer to that, but still too long to be called zero downtime.
I also have fail-back issues when using the AWS cluster endpoint, given it was not able to move the connection to the joining node seamlessly.

Last but not least, when using the consistency level INSTANCE_RAW, I had some data issue as well as PK conflict:
FATAL: mysql_drv_query() returned error 1062 (Duplicate entry ‘18828082’ for key ‘PRIMARY’) 

Conclusions

As state at the beginning of this long blog, the reasonable expectations to go for a multi-primary solution were:

  • Scale writes (more nodes more writes)
  • Gives me zero downtime, or close to that (5 nines is a maximum downtime of 864 milliseconds per day!!)
  • Allow me to shift the writer pointer at any time from A to B and vice versa, consistently.

Honestly, I feel we have completely failed the scaling point. Probably if I use the largest Aurora I will get much better absolute numbers, and it will take me more to encounter the same issues, but I will. In any case, if the multi-primary solution is designed to provide that scalability, and it should do that with any version.

I did not have zero downtime, but I was able to failover pretty quickly with ProxySQL.

Finally, unless the consistency model is REGIONAL_RAW, shifting from one node to the other is not prone to possible negative effects like stale reads. Given that I consider this requirement not satisfied in full.

Given all the above, I think this solution could eventually be valid only for High Availability (close to being 5 nines), but given it comes with some limitations I do not feel comfortable in preferring it over others just for that, at the end default Aurora is already good enough as a High available solution.

References

AWS re:Invent 2019: Amazon Aurora Multi-Master: Scaling out database write performance

Working with Aurora multi-master clusters

Improving enterprises ha and disaster recovery solutions reviewed

Robust ha solutions with proxysql

Limitations of multi-master clusters

Sep
16
2020
--

MongoDB 4.4 Performance Regression: Overwhelmed by Memory

MongoDB 4.4 Performance Regression

MongoDB 4.4 Performance RegressionThere is a special collection of database bugs when the system starts to perform worse when given more resources. Examples of such bugs for MySQL I have:

Bug #15815 – This is where InnoDB on an 8-CPU system performed worse than on a 4-CPU system with increased concurrency.

Bug #29847 – This is a similar bug to what I will describe today, when given more memory (innodb_buffer_pool_size), the MySQL crash recovery process would take longer than with less memory, which was described in our blog Innodb Recovery – Is a large buffer pool always better?

It seems InnoDB Flushing was not optimal with a big amount of memory at that time, and this is what I think is happening with MongoDB 4.4 in the scenario I will describe.

MongoDB 4.4 Load Data Procedures

So when preparing data for my benchmark (Percona Server for MongoDB 4.2 vs 4.4 in Python TPCC Benchmark), I also measured how long it takes to load 1000 Warehouses (about 165GB of data in MongoDB) and to have repeatable numbers, as I usually like to repeat the procedure multiple times.

What I noticed is that when MongoDB 4.4 starts with unlimited cache (that is on the server with 250 GB of RAM, it will allocate 125GB for WiredTiger cache and the rest can be used for OS cache) it shows interesting behavior.

Let me describe by load procedure, which is quite simple

  1. Load data into database TPCC1
  2. Sleep 10 mins
  3. Load data into database TPCC3

That is the second time we load into a different database, and in background pages for database, TPCC1 should be flushed and evicted.

Before jumping to the problem I see, let’s check the number for MongoDB 4.2, and by the number I mean how long it takes to accomplish step 1 and step 3.

MongoDB 4.2 with limited memory (WiredTiger cache 25GB):

Step 1: 20 min

Step 3: 21 min

MongoDB 4.4 with limited memory (WiredTiger cache 25GB):

Step 1: 24 min

Step 3: 26 min

MongoDB 4.2 with 125GB WiredTiger cache 

Step 1: 18 min

Step 3: 19 min

 

And now to the problem I see:

MongoDB 4.4 with WiredTiger cache

Step 1: 19 min

Step 3: 497 min

Notice Step 3 takes almost 8 and a half hours, instead of the usual ~20 mins for all previous cases, and this is when MongoDB has 125GB of WiredTiger cache.

What’s interesting is that I do not see this issue when I limit the WiredTiger cache to 25GB, and also this problem does not exist in MongoDB 4.2.

That’s why I think MongoDB 4.4 starts to behave differently when it is given a lot of memory for WiredTiger cache. Why this happens exactly I do not know yet and I will continue to profile this case. A quick look may indicate this is related to the WiredTiger eviction process (similar to the InnoDB flushing problem in the crash recovery process) and replication flow control which was created in MongoDB 4.2 to keep replicas in sync (but I do not use replicas for this test).


Learn more about the history of Oracle, the growth of MongoDB, and what really qualifies software as open source. If you are a DBA, or an executive looking to adopt or renew with MongoDB, this is a must-read!

Download “Is MongoDB the New Oracle?”

Sep
15
2020
--

Percona Server for MongoDB 4.2 vs 4.4 in Python TPCC Benchmark

Percona Server for MongoDB 4.2 vs. 4.4

Percona Server for MongoDB 4.2 vs. 4.4Following my previous blogs on py-tpcc benchmark for MongoDB, Evaluating the Python TPCC MongoDB Benchmark and Evaluating MongoDB Under Python TPCC 1000W Workload, and the recent release of Percona Server for MongoDB 4.4, I wanted to evaluate 4.2 vs 4.4 in similar scenarios.

Hardware Specs

For the client and server, I will use identical bare metal servers, connected via a 10Gb network.

The node specification:

# Percona Toolkit System Summary Report ######################
        Date | 2020-09-14 16:52:46 UTC (local TZ: EDT -0400)
    Hostname | node3
      System | Supermicro; SYS-2028TP-HC0TR; v0123456789 (Other)
    Platform | Linux
     Release | Ubuntu 20.04.1 LTS (focal)
      Kernel | 5.4.0-42-generic
Architecture | CPU = 64-bit, OS = 64-bit
# Processor ##################################################
  Processors | physical = 2, cores = 28, virtual = 56, hyperthreading = yes
      Models | 56xIntel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz
      Caches | 56x35840 KB
# Memory #####################################################
       Total | 251.8G
  Swappiness | 0
 DirtyPolicy | 80, 5
 DirtyStatus | 0, 0

The drive I used for the storage in this benchmark is a Samsung SM863 SATA SSD.

MongoDB Topology

For MongoDB I used:

  • Single node instance without limiting cache size. As the bare metal server has 250GB of RAM, MongoDB should allocate 125GB of memory for WiredTiger cache and the rest will be used for OS cache. This should produce more CPU bound workload.
  • Single node instance with limited cache size. For WiredTiger cache I will set a limit 25GB, and to limit OS cache I will limit the memory available to a mongodb instance to 50GB, as described in Using Cgroups to Limit MySQL and MongoDB memory usage.
  • However I did not use cgroups in this case, but I rather used Docker to run different versions and set limits.

The script to start Percona Server for MongoDB in docker with memory limits:

> bash startserver.sh 4.4 1
=== script startserver.sh ===
docker run -d --name db$2 -m 50g  \
          -v /mnt/data/psmdb$2-$1:/data/db \
          --net=host \
          percona/percona-server-mongodb:$1 --replSet "rs$2" --port $(( 27016 + $2 )) \
          --logpath /data/db/server1.log --slowms=10000 --wiredTigerCacheSizeGB=25 

sleep 10

mongo mongodb://127.0.0.1:$(( 27016 + $2 )) --eval "rs.initiate( { _id : 'rs$2',  members: [      { _id: 0, host: '172.16.0.3:$(( 27016 + $2 ))' }   ] })"

MongoDB Versions:

Benchmark Results

Unlimited Memory

The results are in New Order Transactions per Minute (NOTPM), and more is better:

Clients 4.2 4.4
10 541.31 691.89
30 999.89 1105.88
50 1048.50 1171.35
70 1095.72 1335.90
90 1184.38 1433.09
110 1210.18 1521.56
130 1231.38 1575.23
150 1245.31 1680.81
170 1224.13 1668.33
190 1300.11 1641.45
210 1240.86 1619.58
230 1220.89 1575.57
250 1237.86 1545.01

MongoDB 4.4 Unlimited Memory

Limited Memory, 50GB in Total and 25GB for Cache

The results are in New Order Transactions per Minute (NOTPM), and more is better:

Clients 4.2 4.4
10 351.45 377.29
30 483.88 447.22
50 535.34 522.59
70 576.30 574.14
90 604.49 582.10
110 618.59 542.11
130 593.31 386.33
150 386.67 301.75
170 265.91 298.80
190 259.56 301.38
210 254.57 301.88
230 249.47 299.15
250 251.03 300.00

MongoDB 4.2 Limited Memory

Observation

Actually I wanted to perform more benchmarks on 4.4 vs 4.2, but some interesting behavior in 4.4 made me reconsider my plans and I’ve gotten distracted trying to understand the issue, and I share this in the post MongoDB 4.4 Performance Regression: Overwhelmed by Memory.

Besides that, in my tests, 4.4 outperformed 4.2 in case of unlimited memory, but I want to consider a variation of throughput during the benchmark so we are working on a py-tpcc version that would report data with 1-sec resolution. Also, I want to re-evaluate how 4.4 would perform in a long-running benchmark, as the current length of the benchmark is 900 sec.

In the case with limited memory, 4.4 did identically or worse than 4.2 with concurrency over 100 clients.

Both versions did not handle the increased number of clients well, showing worse results with 150 clients compared to 10 clients.


Learn more about the history of Oracle, the growth of MongoDB, and what really qualifies software as open source. If you are a DBA, or an executive looking to adopt or renew with MongoDB, this is a must-read!

Download “Is MongoDB the New Oracle?”

Aug
27
2020
--

More on Checkpoints in InnoDB MySQL 8

Recently I posted about checkpointing in MySQL, where MySQL showed interesting “wave” behavior.

Soon after Dimitri posted a solution with how to fix “waves,” and I would like to dig a little more into proposed suggestions, as there are some materials to process.

This post will be very heavy on InnoDB configuration, so let’s start with the basic configuration for MySQL, but before that some initial environment.

I use MySQL version 8.0.21 on the hardware as described here

As for the storage, I am not using some “old dusty SSD”, but production available Enterprise-Grade Intel SATA SSD D3-S4510. This SSD is able to handle the throughput of 468MiB/sec of random writes or 30000 IOPS of random writes of 16KiB blocks.

So initial configuration for my test was:

[mysqld]
datadir= /data/mysql8-8.0.21
user=mysql
bind_address = 0.0.0.0

socket=/tmp/mysql.sock
log-error=error.log

ssl=0
performance_schema=OFF

skip_log_bin
server_id = 7

# general
table_open_cache = 200000
table_open_cache_instances=64
back_log=3500
max_connections=4000

join_buffer_size=256K
sort_buffer_size=256K

# files
innodb_file_per_table
innodb_log_file_size=10G
innodb_log_files_in_group=2
innodb_open_files=4000

# buffers
innodb_buffer_pool_size= 140G
innodb_buffer_pool_instances=8
innodb_page_cleaners=8
innodb_purge_threads=4
innodb_lru_scan_depth=512
innodb_log_buffer_size=64M

default_storage_engine=InnoDB

innodb_flush_log_at_trx_commit  = 1
innodb_doublewrite= 1
innodb_flush_method             = O_DIRECT
innodb_file_per_table           = 1
innodb_io_capacity=2000
innodb_io_capacity_max=4000
innodb_flush_neighbors=0

#innodb_monitor_enable=all
max_prepared_stmt_count=1000000

innodb_adaptive_hash_index=1
innodb_monitor_enable='%'

innodb-buffer-pool-load-at-startup=OFF
innodb_buffer_pool_dump_at_shutdown=OFF

There is a lot of parameters, so let’s highlight the most relevant for this test:

innodb_buffer_pool_size= 140G

Buffer pool size is enough to fit all data, which is about 100GB in size

innodb_adaptive_hash_index=1

Adaptive hash index is enabled (as it comes in default InnoDB config)

innodb_buffer_pool_instances=8

This is what defaults provide, but I will increase it, following my previous post. 

innodb_log_file_size=10G

innodb_log_files_in_group=2

These parameters define the limit of 20GB for our redo logs, and this is important, as our workload will be “redo-log” bounded, as we will see from the results

innodb_io_capacity=2000

innodb_io_capacity_max=4000

You may ask, why do I use 2000 and 4000, while the storage can handle 30000 IOPS.

This is a valid point, and as we can see later, these parameters are not high enough for this workload, but also it does not mean we should use them all the way up to 30000, as we will see from the results.

MySQL Manual says the following about innodb_io_capacity:

“The innodb_io_capacity variable defines the overall I/O capacity available to InnoDB. It should be set to approximately the number of I/O operations that the system can perform per second (IOPS). When innodb_io_capacity is set, InnoDB estimates the I/O bandwidth available for background tasks based on the set value.” 

From this, you may get the impression that if you set innodb_io_capacity to I/O bandwidth of your storage, you should be fine. Though this part does not say what you should take as I/O operations. For example, if your storage can perform 500MB/sec, then if you do 4KB block IO operations it will be 125000 IO per second, and if you do 16KB IO, then it will be 33000 IO per second. 

MySQL manual leaves it up to your imagination, but as InnoDB typical page size is 16KB, let’s assume we do 16KB blocks IO.

However later on that page, we can read:

“Ideally, keep the setting as low as practical, but not so low that background activities fall behind. If the value is too high, data is removed from the buffer pool and change buffer too quickly for caching to provide a significant benefit. For busy systems capable of higher I/O rates, you can set a higher value to help the server handle the background maintenance work associated with a high rate of row changes”

and

“Consider write workload when tuning innodb_io_capacity. Systems with large write workloads are likely to benefit from a higher setting. A lower setting may be sufficient for systems with a small write workload.”

I do not see that the manual provides much guidance about what value I should use, so we will test it.

Initial results

So if we benchmark with initial parameters, we can see the “wave” pattern.

 

As for why this is happening, let’s check Percona Monitoring and Management “InnoDB Checkpoint Age” chart:

Actually InnoDB Flushing by Type in PMM does not show sync flushing yet, so I had to modify chart a little to show “sync flushing” in orange line:

And we immediately see that Uncheckpointed Bytes exceed Max Checkpoint Age in 16.61GiB, which is defined by 20GiB of innodb log files. 16.61GiB is less than 20GB, because InnoDB reserves some cushion for the cases exactly like this, so even if we exceed 16.61GiB, InnoDB still has an opportunity to flush data.

Also, we see that before Uncheckpointed Bytes exceed Max Checkpoint Age, InnoDB flushes pages with the rate 4000 IOPS, just as defined by innodb_io_capacity_max.

We should try to avoid the case when Uncheckpointed Bytes exceed Max Checkpoint Age, because when it happens, InnoDB gets into “emergency” flushing mode, and in fact, this is what causes the waves we see. I should have detected this in my previous post, mea culpa.

So the first conclusion we can make – if InnoDB does not flush fast enough, what if we increase innodb_io_capacity_max ? Sure, let’s see. And for the simplification, for the next experiments, I will use

Innodb_io_capacity = innodb_io_capacity_max, unless specified otherwise.

Next run with Innodb_io_capacity = innodb_io_capacity_max = 7000

Not much improvement and this also confirmed by InnoDB Checkpoint ge chart

InnoDB tries to flush more pages per second up to 5600 pages/sec, but it is not enough to avoid exceeding Max Checkpoint Age.

Why is this the case? The answer is a double write buffer.

Even though MySQL improved the doublewrite buffer in MySQL 8.0.20, it does not perform well enough with proposed defaults. 

Well, at least the problem was solved because previous Oracle ran benchmarks with disabled doublewrite, just to hide and totally ignore the issue with doublewrite. For the example check this.

But let’s get back to our 8.0.21 and fixed doublewrite.

Dimiti mentions:

“the main config options for DBLWR in MySQL 8.0 are:

  • innodb_doublewrite_files = N
    innodb_doublewrite_pages = M”

Let’s check the manual again:

“The innodb_doublewrite_files variable is intended for advanced performance tuning. The default setting should be suitable for most users.

innodb_doublewrite_pages

The innodb_doublewrite_pages variable (introduced in MySQL 8.0.20) controls the number of maximum number of doublewrite pages per thread. If no value is specified, innodb_doublewrite_pages is set to the innodb_write_io_threads value. This variable is intended for advanced performance tuning. The default value should be suitable for most users.

Was it wrong to assume that innodb_doublewrite_files and  innodb_doublewrite_pages provides the value suitable for our use case?

But let’s try with the values Dmitri recommended to look into, I will use

innodb_doublewrite_files=2 and innodb_doublewrite_pages=128

Results with innodb_doublewrite_files=2 and innodb_doublewrite_pages=128

The problem with waves is fixed! 

And InnoDB Checkpoint Age chart:

Now we are able to keep Uncheckpointed Bytes under Max Checkpoint Age, and this is what fixed “waves” pattern.

We can say that parallel doublewrite is a new welcomed improvement, but the fact that one has to change innodb_doublewrite_pages in order to get improved performance is the design flaw in my opinion.

But there are still a lot of variations in 1 sec resolution and small drops. Before we get to them, let’s take a look at another suggestion: use –innodb_adaptive_hash_index=0 ( that is to disable Adaptive Hash Index). I will use AHI=0 on the charts to mark this setting.

Let’s take a look at the results with improved settings and with –innodb_adaptive_hash_index=0

Results with –innodb_adaptive_hash_index=0

To see what is the real improvement with –innodb_adaptive_hash_index=0 , let’s compare barcharts:

Or in numeric form:

settings Avg tps, last 2000 sec
io_cap_max=7000,doublewrite=opt 7578.69
io_cap_max=7000,doublewrite=opt,AHI=0 7996.33

So –innodb_adaptive_hash_index=0 really brings some improvements, about 5.5%, so I will use  –innodb_adaptive_hash_index=0 for further experiments.

Let’s see if increased innodb_buffer_pool_instances=32 will help to smooth periodical variance.

Results with innodb_buffer_pool_instances=32

So indeed using innodb_buffer_pool_instances=32 gets us less variations, keeping overall throughput about the same. It is 7936.28 tps for this case.

Now let’s review the parameter innodb_change_buffering=none, which Dmitri also suggests.

Results with innodb_change_buffering=none

There is NO practical difference if we disable innodb_change_buffer.

And if we take a look at PMM change buffer chart:

We can see there is NO Change Buffer activity outside of the initial 20 mins. I am not sure why Dimitri suggested disabling it. In fact, Change Buffer can be quite useful, and I will show it in my benchmark for the different workloads.

Now let’s take a look at suggested settings with Innodb_io_capacity = innodb_io_capacity_max = 8000. That will INCREASE innodb_io_capacity_max , and compare to results with innodb_io_capacity_max = 7000.

Or in tabular form:

settings Avg tps, last 2000 sec
io_cap_max=7000,doublewrite=opt,AHI=0,BPI=32 7936.28
io_cap_max=8000,doublewrite=opt,AHI=0,BPI=32 7693.08

Actually with innodb_io_capacity_max=8000 the throughput is LESS than with  innodb_io_capacity_max=7000

Can you guess why? 

Let’s compare InnoDB Checkpoint Age.

This is for innodb_io_capacity_max=8000 :

And this is for innodb_io_capacity_max=7000 

This is like a child’s game: Find the difference.

The difference is that with  innodb_io_capacity_max=7000
Uncheckpointed Bytes is 13.66 GiB,
and with innodb_io_capacity_max=8000
Uncheckpointed Bytes is 12.51 GiB

What does it mean? It means that with innodb_io_capacity_max=7000 HAS to flush LESS pages and still keep within Max Checkpoint Age.

In fact, if we try to push even further, and use innodb_io_capacity_max=innodb_io_capacity=6500 we will get InnoDB Checkpoint Age chart as:

Where Uncheckpointed Bytes are 15.47 GiB. Does it improve throughput? Absolutely!

settings Avg tps, last 2000 sec
io_cap_max=6500,doublewrite=opt,AHI=0,BPI=32 8233.628
io_cap_max=7000,doublewrite=opt,AHI=0,BPI=32 7936.283
io_cap_max=8000,io_cap_max=8000,doublewrite=opt,AHI=0,BPI=32 7693.084

The difference between innodb_io_capacity_max=6500 and innodb_io_capacity_max=8000 is 7%

This now becomes clear what Manual means in the part where it says:

“Ideally, keep the setting as low as practical, but not so low that background activities fall behind”

So we really need to increase innodb_io_capacity_max to the level that Uncheckpointed Bytes stays under Max Checkpoint Age, but not by much, otherwise InnoDB will do more work then it is needed and it will affect the throughput.

In my opinion, this is a serious design flaw in InnoDB Adaptive Flushing, that you actually need to wiggle innodb_io_capacity_max to achieve appropriate results.

Inverse relationship between innodb_io_capacity_max and innodb_log_file_size

To show an even more complicated relation between innodb_io_capacity_max and innodb_log_file_size, let consider the following experiment.

We will increase innodb_log_file_size from 10GB to 20GB, effectively doubling our redo-log capacity.

And now let’s check InnoDB Checkpoint Age with innodb_io_capacity_max=7000:

We can see there is a lot of space in InnoDB logs which InnoDB does not use. There is only 22.58GiB of Uncheckpointed Bytes, while 33.24 GiB are available.

So what happens if we increase innodb_io_capacity_max to 4500

 InnoDB Checkpoint Age with innodb_io_capacity_max=4500:

In this setup, We can push Uncheckpointed Bytes to 29.80 GiB, and it has a positive effect on the throughput.

Let’s compare throughput :

settings Avg tps, last 2000 sec
io_cap_max=4500,log_size=40GB,doublewrite=opt,AHI=0,BPI=32 9865.308
io_cap_max=7000,log_size=40GB,doublewrite=opt,AHI=0,BPI=32 9374.121

So by decreasing innodb_io_capacity_max from 7000 to 4500 we can gain 5.2% in the throughput.

Please note that we can’t continue to decrease innodb_io_capacity_max, because in this case Uncheckpointed Bytes risks to exceed Max Checkpoint Age, and this will lead to the negative effect of emergency flushing.

So again, in order to improve throughput, we should be DECREASING innodb_io_capacity_max, but only to a certain threshold. We should not be setting innodb_io_capacity_max to 30000, to what really SATA SSD can provide.

Again, for me, this is a major design flaw in the current InnoDB Adaptive Flushing. Please note this was a static workload. If your workload changes during the day, it is practically impossible to come up with optimal value. 

Conclusions:

Trying to summarize all of the above, I want to highlight:

  • To fix “wave” pattern we need to tune innodb_io_capacity_max and innodb_doublewrite_pages 
  • InnoDB parallel doublewrite in MySQL 8.0.20 is a definitely positive improvement, but the default values seem chosen poorly, in contradiction with Manual. I wish Oracle/MySQL shipped features that work out of the box for most users.
  • InnoDB Adaptive Hash index is not helping here, and you get better performance by disabling it. I also observed that in other workloads, the InnoDB Adaptive Hash index might be another broken subsystem, which Oracle ignores to fix and just disables it in its benchmarks.
  • InnoDB Change Buffer has no effect on this workload, so you may or may not disable it — there is no difference. But I saw a positive effect from InnoDB Change Buffer in other workloads, so I do not recommend blindly disabling it.
  • Now about InnoDB Adaptive Flushing. In my opinion, InnoDB Adaptive Flushing relies too much on manual tuning of innodb_io_capacity_max , which in fact has nothing to do with the real storage IO capacity. In fact, often you need to lower innodb_io_capacity_max  to get better performance, but not make it too low, because at some point it will hurt the performance. The best way to monitor it is to check InnoDB Checkpoint Age chart in PMM
  • I would encourage Oracle to fix the broken design of InnoDB Adaptive Flushing, where it would detect IO capacity automatically and to not flush aggressively, but to keep  Uncheckpointed Bytes just under Max Checkpoint Age. Let’s hope Oracle faster than doublewrite buffer because history shows that to force Oracle to make improvements in InnoDB IO subsystem, we need to do it first in Percona Server for MySQL like we did with parallel doublewrite buffer.  For the reference parallel doublewrite was implemented first in Percona Server for MySQL 5.7.11-4 which was released March 15th, 2016. Oracle implemented (with not optimal default settings ) parallel doublewrite in MySQL 8.0.20, which was released 4 years later after Percona Server, on April 4th, 2020.
Aug
14
2020
--

Evaluating Performance Improvements in MariaDB 10.5.5

Performance Improvements in MariaDB 10.5.5

Performance Improvements in MariaDB 10.5.5Recently, I published a series of posts on MySQL and MariaDB, where MariaDB 10.5.4 does not necessarily perform well in some scenarios:

In comments left on How MySQL and MariaDB Perform on NVMe Storage, I was told that MariaDB 10.5.4 is not the best version, as there are already known performance bugs that will be fixed in MariaDB 10.5.5, and it would be better to test MariaDB 10.5.5. And now, MariaDB 10.5.5 comes with fixes:

I made a promise that I will test the new version, so this is me fulfilling my promise.

Instead of posting multiple posts, let’s review all scenarios at once. Just to refresh the material, I’ve tested sysbench-tpcc 1000W, which is about 100GB in size in the following setups:

  • Data is located on SATA SSD, innodb_buffer_pool_size=140GB (CPU-bound scenario)
  • Data is located on NVMe, innodb_buffer_pool_size=140GB (CPU-bound scenario)
  • Data is located on SATA SSD, innodb_buffer_pool_size=25GB (IO-bound scenario)
  • Data is located on NVMe, innodb_buffer_pool_size=25GB (IO-bound scenario)

Data is located on SATA SSD, innodb_buffer_pool_size=140GB (CPU-bound scenario)

(The original results are here – Checkpointing in MySQL and MariaDB)

For this re-test round, I compare only MariaDB 10.5.4 vs MariaDB 10.5.5:

MariaDB 10.5.4 vs MariaDB 10.5.5

After the difference in the warm-up behavior, the result is about the same, and we can confirm this with boxplots for the last 2500 sec:

MariaDB 10.5.4 vs MariaDB 10.5.5 2500 sec

Data is located on SATA SSD, innodb_buffer_pool_size=25GB (IO-bound scenario)

This one is interesting, and actually this is where we can see improvements in MariaDB 10.5.5.

improvements in MariaDB 10.5.5

To see individual variations in details, let’s separate charts:

And then again compare the throughput for the last 2500 sec:

compare the throughput for the last 2500 sec

What’s interesting is not only has MariaDB 10.5.5 improved the results, but it also shows a better average throughput than MySQL 8.0.21, though with a higher variation.

Data is located on NVMe, innodb_buffer_pool_size=140GB (CPU-bound scenario)

There we still see a weird initial drop followed by recovery, and the throughput is similar to MariaDB 10.5.4, but something a little worse.

Data is located on NVMe, innodb_buffer_pool_size=25GB (IO-bound scenario)

throughput is similar to MariaDB 10.5.4

This case is an absolute disaster for MariaDB 10.5.5; there is bug https://jira.mariadb.org/browse/MDEV-23399. Most of the time MariaDB 10.5.5 shows the throughput under 100 tps (while MySQL averages above 5000 tps).

I personally think this is a catastrophic bug and MariaDB 10.5.5 with this performance should never have been released to the public, but instead, stay in the internal testing stage.

Aug
14
2020
--

Part Two: How Many innodb_buffer_pool_instances Do You Need in MySQL 8 With a CPU-Bound Workload?

how many innodb_buffer_pool_instances Do You Need in MySQL 8

how many innodb_buffer_pool_instances Do You Need in MySQL 8Part one of this series can be found here: How Many innodb_buffer_pool_instances Do You Need in MySQL 8?

Following up on my recent benchmark posts on MySQL and MariaDB, MySQL and MariaDB on Enterprise SSD Storage and How MySQL and MariaDB Perform on NVMe Storage, I wanted to dig a little deeper and understand how different MySQL parameters affect performance.

One of the obscure MySQL Parameters (in my opinion) is innodb_buffer_pool_instances. In particular, I do not have a feel for how many you need and how to understand if a chosen value is a good or a bad one.

In our articles MySQL 5.7 Performance Tuning After Installation and MySQL 101: Parameters to Tune for MySQL Performance we recommended to use value “8”, but I can’t really say if this is a good enough value. This post is a follow-up to my previous article about results we got with different innodb_buffer_pool instances with an IO-bound scenario.

So let’s see what result we will get with different innodb_buffer_pool instances, under the following scenario. I will use sysbench oltp_read_write benchmark with pareto distribution and set innodb_buffer_pool_size=140GB for the database in size 100GB, so this will be a CPU-bound workload.

Benchmark

The hardware I use is:

System | Supermicro; SYS-F619P2-RTN; v0123456789 (Other)
   Platform | Linux
    Release | Ubuntu 18.04.4 LTS (bionic)
     Kernel | 5.3.0-42-generic
Architecture | CPU = 64-bit, OS = 64-bit
  Threading | NPTL 2.27
    SELinux | No SELinux detected
Virtualized | No virtualization detected
# Processor ##################################################
 Processors | physical = 2, cores = 40, virtual = 80, hyperthreading = yes
     Models | 80xIntel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
     Caches | 80x28160 KB
# Memory #####################################################
      Total | 187.6G

With the storage on  SATA SSD INTEL SSDSC2KB960G8 (Intel Enterprise-grade SSD D3-S4510).

A short settings overview:

  • Data fits into memory (The datasize is ~100GB, memory on the server is 188GB, and we allocate 140GB for MySQL innodb_buffer_pool_size).
  • The workload on storage will be very read-write-intensive (reads are done from the storage), with full ACID-compliant and data safe settings in MySQL.
  • For SATA SSD storage  innodb_io_capacity I will use 2000 and innodb_io_capacity_max = 4000 .
  • I will test the following values for innodb_buffer_pool_instances: 1 2 4 8 16 32 64.
  • Innodb_buffer_pool_instances = 64 is the maximum value allowed by MySQL.

The benchmark command line is:

sysbench oltp_read_write --threads=150 --time=10000 --tables=40 --table_size=10000000 --mysql-host=127.0.0.1 --mysql-user=sbtest --mysql-password=sbtest --max-requests=0 --report-interval=1 --mysql-db=sbtest --mysql-ssl=off --create_table_options=DEFAULT CHARSET=utf8mb4 --report_csv=yes --rand-type=pareto run

This means that the benchmark will run for three hours with reporting throughput every 1 sec. I will use a three-hour time frame for the reason which will be apparent later from the results.

Results on SATA SSD

Let’s see what results I’ve got individually for each of innodb_buffer_pool_instances:

innodb_buffer_pool_instances=1

innodb_buffer_pool_instances=1

innodb_buffer_pool_instances=2

innodb_buffer_pool_instances=2

innodb_buffer_pool_instances=4

innodb_buffer_pool_instances=4

innodb_buffer_pool_instances=8

innodb_buffer_pool_instances=8

innodb_buffer_pool_instances=16

innodb_buffer_pool_instances=16

innodb_buffer_pool_instances=32

innodb_buffer_pool_instances=32

innodb_buffer_pool_instances=64

innodb_buffer_pool_instances=64

What seems apparent is that as we increase innodb_buffer_pool_instances, it has a positive effect on variation in the throughput. We can condense the results in a single chart to see it closely:

increase innodb_buffer_pool_instances

And if we want to compare the throughput and deviation, let’s compare the results for the last 2500 sec:

compare the throughput and deviation

So actually, innodb_buffer_pool_instances=64 showed the best throughput and the less variability, as even with innodb_buffer_pool_instances=32 there are noticeable stalls.

With innodb_buffer_pool_instances from 1 to 8, there are regular stalls, including drops to 0 tps for a few seconds.

Increasing innodb_buffer_pool_instances over 64 seems to have no effect.

Final Thoughts

For this particular case, innodb_buffer_pool_instances=64 was the best choice, but I still can’t recommend a reliable way to find what the optimal value is. Keep in mind that this particular setting depends on the hardware (I tested on the hardware with 80 CPU cores) and on the workload.

Part one of this series can be found here: How Many innodb_buffer_pool_instances Do You Need in MySQL 8?

Aug
13
2020
--

How Many innodb_buffer_pool_instances Do You Need in MySQL 8?

Following up on my recent benchmark posts on MySQL and MariaDB, MySQL and MariaDB on Enterprise SSD Storage and How MySQL and MariaDB Perform on NVMe Storage, I wanted to dig a little deeper and understand how different MySQL parameters affect performance.

One of the obscure MySQL Parameters (in my opinion) is innodb_buffer_pool_instances. In particular, I do not have a feel for how many you need and how to understand if a chosen value is a good or a bad one.

In our articles MySQL 5.7 Performance Tuning After Installation and MySQL 101: Parameters to Tune for MySQL Performance we recommended to use value “8”, but I can’t really say if this is a good enough value. So let’s see what results we will get with different innodb_buffer_pool instances, under the following scenario:

I will use the sysbench oltp_read_write benchmark with pareto distribution. I will set innodb_buffer_pool_size=25GB for a database 100 GB in size, so there will be a competition for buffer_pool space and this will be an IO-bound scenario.

Benchmark

The hardware I use is:

System | Supermicro; SYS-F619P2-RTN; v0123456789 (Other)
   Platform | Linux
    Release | Ubuntu 18.04.4 LTS (bionic)
     Kernel | 5.3.0-42-generic
Architecture | CPU = 64-bit, OS = 64-bit
  Threading | NPTL 2.27
    SELinux | No SELinux detected
Virtualized | No virtualization detected
# Processor ##################################################
 Processors | physical = 2, cores = 40, virtual = 80, hyperthreading = yes
     Models | 80xIntel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
     Caches | 80x28160 KB
# Memory #####################################################
      Total | 187.6G

With the storage on SATA SSD INTEL SSDSC2KB960G8 (Intel Enterprise-grade SSD D3-S4510).

The short settings overview:

  • Data does not fit into memory (the datasize is ~100GB, memory on the server is 188GB, and we allocate 25GB for MySQL innodb_buffer_pool_size using O_DIRECT, so even there is a lot of memory on the server, it is not used over specified 25GB).
  • The workload on storage will be very read-write-intensive (reads will be done from the storage), with full ACID-compliant and data safe settings in MySQL.
  • For SATA SSD storage  innodb_io_capacity, I will use 2000 and innodb_io_capacity_max = 4000.
  • I will test the following values for innodb_buffer_pool_instances: 1, 2, 4, 8, 16, 32, 64.
  • Innodb_buffer_pool_instances = 64 is the maximum value allowed by MySQL.

The benchmark command line is:

sysbench oltp_read_write --threads=150 --time=10000 --tables=40 --table_size=10000000 --mysql-host=127.0.0.1 --mysql-user=sbtest --mysql-password=sbtest --max-requests=0 --report-interval=1 --mysql-db=sbtest --mysql-ssl=off --create_table_options=DEFAULT CHARSET=utf8mb4 --report_csv=yes --rand-type=pareto run

This means that the benchmark will run for three hours with reporting throughput every 1 sec. I will use a three-hour time frame for a reason which will be apparent later from the results.

Results on SATA SSD

Let’s see what results I’ve got individually for each of innodb_buffer_pool_instances:

innodb_buffer_pool_instances=1

innodb_buffer_pool_instances=1

innodb_buffer_pool_instances=2

innodb_buffer_pool_instances=2

innodb_buffer_pool_instances=4

innodb_buffer_pool_instances=4

innodb_buffer_pool_instances=8

innodb_buffer_pool_instances=8

innodb_buffer_pool_instances=16

innodb_buffer_pool_instances=16

innodb_buffer_pool_instances=32

innodb_buffer_pool_instances=32

innodb_buffer_pool_instances=64

innodb_buffer_pool_instances=64

What seems apparent is that as we increase innodb_buffer_pool_instances, it has a positive effect on variation in the throughput. We can condense the results in a single chart to see it closely:

innodb_buffer_pool_instances chart

And if we want to compare the throughput and deviation, let’s compare the results for the last 2500 sec:

So actually, innodb_buffer_pool_instances=64 showed the best throughput and less variability. Recommended innodb_buffer_pool_instances=8 seems better compared to 1-4 values in a sense of variability, but it does not produce the best throughput.

Final Thoughts

For this particular case, innodb_buffer_pool_instances=64 was the best choice, but I still can’t recommend a reliable way to find what the optimal value is. Small values 1-4 seems to produce a lot of variability or even stalls, so starting from 8 is a good choice.

Part Two of this series can be found here: How Many innodb_buffer_pool_instances Do You Need in MySQL 8 With a CPU-Bound Workload?

Nov
26
2019
--

Comparing S3 Streaming Tools with Percona XtraBackup

Comparing S3 Streaming Tools

Making backups over the network can be done in two ways: either save on disk and transfer or just transfer without saving. Both ways have their strong and weak points. The second way, particularly, is highly dependent on the upload speed, which would either reduce or increase the backup time. Other factors that influence it are chunk size and the number of upload threads.

Percona XtraBackup 2.4.14 has gained S3 streaming, which is the capability to upload backups directly to s3-compatible storage without saving locally first. This feature was developed because we wanted to improve the upload speeds of backups in Percona Operator for XtraDB Cluster.

There are many implementations of S3 Compatible Storage: AWS S3, Google Cloud Storage, Digital Ocean Spaces, Alibaba Cloud OSS, MinIO, and Wasabi.

We’ve measured the speed of AWS CLI, gsutil, MinIO client, rclone, gof3r and the xbcloud tool (part of Percona XtraBackup) on AWS (in single and multi-region setups) and on Google Cloud. XtraBackup was compared in two variants: a default configuration and one with tuned chunk size and amount of uploading threads.

Here are the results.

AWS (Same Region)

The backup data was streamed from the AWS EC2 instance to the AWS S3, both in the us-east-1 region.

 

 

tool settings CPU max mem speed speed comparison
AWS CLI default settings 66% 149Mb 130MiB/s baseline
AWS CLI 10Mb block, 16 threads 68% 169Mb 141MiB/s +8%
MinIO client not changeable 10% 679Mb 59MiB/s -55%
rclone rcat not changeable 102% 7138Mb 139MiB/s +7%
gof3r default settings 69% 252Mb 97MiB/s -25%
gof3r 10Mb block, 16 threads 77% 520Mb 108MiB/s -17%
xbcloud default settings 10% 96Mb 25MiB/s -81%
xbcloud 10Mb block, 16 threads 60% 185Mb 134MiB/s +3%

 

Tip: If you run MySQL on an EC2 instance to make backups inside one region, do snapshots instead.

AWS (From US to EU)

The backup data was streamed from AWS EC2 in us-east-1 to AWS S3 in eu-central-1.

 

 

tool settings CPU max mem speed speed comparison
AWS CLI default settings 31% 149Mb 61MiB/s baseline
AWS CLI 10Mb block, 16 threads 33% 169Mb 66MiB/s +8%
MinIO client not changeable 3% 679Mb 20MiB/s -67%
rclone rcat not changeable 55% 9307Mb 77MiB/s +26%
gof3r default settings 69% 252Mb 97MiB/s +59%
gof3r 10Mb block, 16 threads 77% 520Mb 108MiB/s +77%
xbcloud default settings 4% 96Mb 10MiB/s -84%
xbcloud 10Mb block, 16 threads 59% 417Mb 123MiB/s +101%

 

Tip: Think about disaster recovery, and what will you do when the whole region is not available. It makes no sense to back up to the same region; always transfer backups to another region.

Google Cloud (From US to EU)

The backup data were streamed from Compute Engine instance in us-east1 to Cloud Storage europe-west3. Interestingly, Google Cloud Storage supports both native protocol and S3(interoperability) API. So, Percona XtraBackup can transfer data to Google Cloud Storage directly via S3(interoperability) API.

 

tool settings CPU max mem speed speed comparison
gsutil not changeable, native protocol 8% 246Mb 23MiB/s etalon
rclone rcat not changeable, native protocol 6% 61Mb 16MiB/s -30%
xbcloud default settings, s3 protocol 3% 97Mb 9MiB/s -61%
xbcloud 10Mb block, 16 threads, s3 protocol 50% 417Mb 133MiB/s +478%

 

Tip: A cloud provider can block your account due to many reasons, such as human or robot mistakes, inappropriate content abuse after hacking, credit card expire, sanctions, etc. Think about disaster recovery and what will you do when a cloud provider blocks your account, it may make sense to back up to another cloud provider or on-premise.

Conclusion

xbcloud tool (part of Percona XtraBackup) is 2-5 times faster with tuned settings on long-distance with native cloud vendor tools, and 14% faster and requires 20% less memory than analogs with the same settings. Also, xbcloud is the most reliable tool for transferring backups to S3-compatible storage because of two reasons:

  • It calculates md5 sums during the uploading and puts them into a .md5/filename.md5 file and verifies sums on the download (gof3r does the same).
  • xbcloud sends data in 10mb chunks and resends them if any network failure happens.

PS: Please find instructions on GitHub if you would like to reproduce this article’s results.

Nov
12
2019
--

Watch Out for Disk I/O Performance Issues when Running EXT4

Performance Issues When Running EXT4

Recently, at Percona Live Europe 2019, Dimitri Kravchuk from Oracle mentioned that he observed some unclear drop in performance for MySQL on an ext4 filesystem with the latest Linux kernels. I decided to check this case out on my side and found out that indeed, starting from linux kernel 4.9, there are some cases with notable (up to 2x) performance drops for ext4 filesystem in direct i/o mode.

So what’s wrong with ext4? It started in 2016 from the patch that was pushed to kernel 4.9: “ext4: Allow parallel DIO reads”. The purpose of that patch was to help to improve read scalability in direct i/o mode. However, along with improvements in pure read workloads, it also introduced regression in intense mixed random read/write scenarios. And it’s quite weird, but this issue had not been noticed for 3 years. Only this summer was performance regression reported and discussed in LKML. As a result of this discussion, there is an attempt to fix it, but from my current understanding that fix will be pushed only to upcoming 5.4/5.5 kernels. Below I will describe what this regression looks like, how it affects MySQL workloads, and what workarounds we can apply to mitigate this issue.

ext4 Performance Regression

Let’s start by defining the scope of this ext4 performance regression. It will only have an impact if the setup/workload meets following conditions:
– fast ssd/nvme
– linux kernel>=4.9
– files resides on ext4 file system
– files opened with O_DIRECT flag
– at least some I/O should be synchronous

In the original report to LKML, the issue was observed/reproduced with a mixed random read/write scenario with sync I/O and O_DIRECT. But how do these factors relate to MySQL? The only files opened by InnoDB in O_DIRECT mode are tablespaces (*.ibd files), and I/O pattern for tablespaces consists of following operations:

– reads ibd data in synchronous mode
– writes ibd data in asynchronous mode
– posix_allocate to extend tablespace file followed by a synchronous write
– fsync

There are also extra I/O from WAL log files:

– writes data to log files in synchronous mode
– fsync

So in the case of InnoDB tablespaces that are opened with O_DIRECT, we have a mix of sync reads and async writes and it turned out that such a combination along with sync writes to innodb log file is enough to cause notable performance regression as well. I have sketched the workload for fio tool (see below) that simulates the I/O access pattern for InnoDB and have run it for SSD and NVMe drives for linux kernels 4.4.0, 5.3.0, and 5.3.0 with ext4 scalability fix.

[global]
filename=tablespace1.ibd:tablespace2.ibd:tablespace3.ibd:tablespace4.ibd:tablespace5.ibd
direct=1
bs=16k
iodepth=1

#read data from *.ibd tablespaces
[ibd_sync_read]
rw=randread
ioengine=psync

#write data to *.ibd tavlespaces
[ibd_async_write]
rw=randwrite
ioengine=libaio

#write data to ib* log file
[ib_log_sync_write]
rw=write
bs=8k
direct=0
ioengine=psync
fsync=1
filename=log.ib
numjobs=1

fio results on the chart:

Observations:

– for SATA/SSD drive there is almost no difference in throughtput, and only at 16 threads do we see a drop in reads for ext4/kernel-5.3.0. For ext4/kernel-5.3.0 mounted with dioread_nolock (that enables scalability fixes), we see that IOPS back and even look better.
– for NVMe drive the situation looks quite different – until 8 i/o threads IOPS for both reads and writes are more/less similar but after increasing pressure on i/o we see a notable spike for writes and similar drop for reads. And again mounting ext4 with dioread_nolock helps to get the same throughput as and for kernels < 4.9.

The similar performance data for the original issue reported to LKML (with more details and analysis) can be found in the patch itself.

How it Affects MySQL

O_DIRECT

Now let’s check the impact of this issue on an IO-bound sysbench/OLTP_RW workload in O_DIRECT mode. I ran a test for the following setup:

– filesystem: xfs, ext4/default, ext4/dioread_nolock
– drives: SATA/SSD and NVMe
– kernels: 4.4.0, 5.3.0, 5.3.0+ilock_fix

Observations

– in the case of SATA/SSD, the ext4 scalability issue has an impact on tps rate after 256 threads and drop is 10-15%
– in the case of NVMe and regular ext4 with kernel 5.3.0 causes performance drop in ~30-80%. If we apply a fix by mounting ext4 with dioread_nolock or use xfs,  throughput looks good.

O_DSYNC

As ext4 regression affects O_DIRECT, let’s replace O_DIRECT with O_DSYNC and look at results of the same sysbench/OLTP_RW workload on kernel 5.3.0:

Note: In order to make results between O_DIRECT and O_DSYNC comparable, I have limited available memory for MySQL instance by cgroup.

Observations:

In the case of O_DSYNC and regular ext4, the performance is just 10% less than for O_DIRECT/ext4/dioread_nolock and O_DIRECT/xfs and ~35% better than for O_DIRECT/ext4. That means that O_DSYNC can be used as a workaround for cases when you have fast storage and ext4 as filesystem but can’t switch to xfs or upgrade kernel.

Conclusions/workarounds

If your workload/setup is affected, there are the following options that you may consider as a workaround:

– downgrade linux kernel to 4.8
– install kernel 5.3.0 with fix and mount ext4 with dioread_nolock option
– if O_DIRECT is important, switch to xfs filesystem
– if changing filesystem is not an option,  replace O_DIRECT with O_DSYNC+cgroup

Aug
15
2019
--

A Faster, Lightweight Trigger Function in C for PostgreSQL

Trigger Function in C for PostgreSQL

Trigger Function in C for PostgreSQLWe have been writing blog posts about how to write simple extensions in C language and a little more complex one by Ibrar which were well received by PostgreSQL user community. Then we observed that many PostgreSQL users create simple triggers for small auditing requirements, and then feel the pain of trigger on transactions. So we were discussing how simple/lightweight and faster a trigger function is when written in C. Generally, Trigger functions are written in high-level languages like PlpgSQL, but it has a higher overhead during execution and it can impact the transactions – and thereby application performance.

This blog post is an attempt to create a simple trigger function to address one of the common use-cases of triggers, which is to update auditing columns in a table.

In this post, we are going to introduce SPI (Server Programming Interface) functions for novice users. Towards the end of the blog, we share some of the quick benchmark results for understanding the benefits.

Example of Audit timestamp

Let’s proceed with taking up a case and assume that we have a table to hold transaction details. But auditing requirements say that there should be a timestamp on each tuple when the tuple is inserted and when it was last updated.

CREATE TABLE transdtls(
  transaction_id int,
  cust_id int,
  amount  int,
...
  insert_ts timestamp,
  update_ts timestamp
);

For demonstration purpose, let’s remove and trim the other columns and create a table with only 3 essential columns.

CREATE TABLE transdtls(
  transaction_id int,
  insert_ts timestamp,
  update_ts timestamp
);

Developing Trigger Function

The trigger function can also be developed and packaged as an extension, which we discussed in s previous blog post here. So we are not going to repeat those steps here. The difference is that file names are named as “trgr” instead of “addme” in the previous blog. Makefile is also modified to refer “trgr” files. This need not be same as the function name “trig_test” in the C source detailed below.

In the end, the following files are available in the development folder:

$ ls
Makefile trgr--0.0.1.sql trgr.c trgr.control

The trgr.c is the main source files with the following content:

#include <stdio.h>
#include <time.h>
#include "postgres.h"
#include "utils/rel.h"
#include "executor/spi.h"
#include "commands/trigger.h"
#include "utils/fmgrprotos.h"
#ifdef PG_MODULE_MAGIC
PG_MODULE_MAGIC;
#endif

extern Datum trig_test(PG_FUNCTION_ARGS);

PG_FUNCTION_INFO_V1(trig_test);

Datum
trig_test(PG_FUNCTION_ARGS)
{
    TriggerData *trigdata = (TriggerData *) fcinfo->context;
    //TupleDesc   tupdesc;
    HeapTuple   tuple;
    HeapTuple   rettuple;
    int         attnum = 0;
    Datum       datumVal;

    //Get the structure of the tuple in the table.
    //tupdesc = trigdata->tg_relation->rd_att;

    //Make sure that the function is called from a trigger
    if (!CALLED_AS_TRIGGER(fcinfo))
        elog(ERROR, "are you sure you are calling from trigger manager?");

    //If the trigger is part of an UPDATE event
    if (TRIGGER_FIRED_BY_UPDATE(trigdata->tg_event))
    {
        //attnum = SPI_fnumber(tupdesc,"update_ts");
        attnum = 3;
        tuple = trigdata->tg_newtuple;
    }
    else   //If the trigger is part of INSERT event
    {
        //attnum = SPI_fnumber(tupdesc,"insert_ts");
        attnum = 2;
        tuple = trigdata->tg_trigtuple;
    }
    //Get the current timestamp using "now"
    datumVal = DirectFunctionCall3(timestamp_in, CStringGetDatum("now"), ObjectIdGetDatum(InvalidOid), Int32GetDatum(-1));

    //Connect to Server and modify the tuple
    SPI_connect();
    rettuple = SPI_modifytuple(trigdata->tg_relation, tuple, 1, &attnum, &datumVal, NULL);
    if (rettuple == NULL)
    {
        if (SPI_result == SPI_ERROR_ARGUMENT || SPI_result == SPI_ERROR_NOATTRIBUTE)
                elog(ERROR, "SPI_result failed! SPI_ERROR_ARGUMENT or SPI_ERROR_NOATTRIBUTE");
         elog(ERROR, "SPI_modifytuple failed!");
    }
    SPI_finish();                           /* don't forget say Bye to SPI mgr */
    return PointerGetDatum(rettuple);
}

and

trgr--0.0.1.sql

  with the following content:

CREATE OR REPLACE FUNCTION trig_test() RETURNS trigger
     AS 'MODULE_PATHNAME','trig_test'
LANGUAGE C STRICT;

Now it is a matter of building, installing, and creating the extension.

$ make
$ sudo make install
psql> create extension trgr;

In case you don’t want to develop it as an extension, you may compile it to generate a shared object file (.so) file. Copy the same to the library folder of PostgreSQL binaries, which on my Ubuntu laptop is : /usr/lib/postgresql/11/lib/, and then define the function. You can even specify the full path of the shared object file like this:

CREATE FUNCTION trig_test() RETURNS trigger     
  AS '/usr/lib/postgresql/11/lib/trgr.so'
LANGUAGE C;

Using Trigger Function

Usage of trigger function is not different from regular PLpgSQL functions. You just need to attach the function to the table for all INSERT and UPDATE events.

CREATE TRIGGER transtrgr
 BEFORE INSERT OR UPDATE ON public.transdtls 
FOR EACH ROW EXECUTE PROCEDURE public.trig_test();

Benchmarking

For a fair comparison with trigger function written in PLpgSQL, a similar function is created as follows:

CREATE OR REPLACE FUNCTION transtrgr_pl()
  RETURNS TRIGGER AS $$
  BEGIN
     if  (TG_OP = 'UPDATE') then
        NEW.update_ts = now();
     else 
        NEW.insert_ts = now();
     end if;
    RETURN NEW;
  END;
  $$ language 'plpgsql';

The number of lines and the readability of the code is in favor of PLpgSQL. The development and debugging time required is much less.

Regarding the performance benchmarking, three cases are compared.

  1. PostgreSQL client/application providing the audit timestamp, so that trigger can be avoided.
  2. Trigger function in C language.
  3. Trigger function in PLpgSQL.

Here are the performance numbers in milliseconds for 1 million bulk inserts, obviously a smaller number is better.


Caveats

  1. The first case where there is no trigger on the database side, it takes less time. But the application and network need to take up the extra load, which is not considered in this test.
  2. The C function is bit hardcoded with an attribute number like
    attnum = 3;

    and if we want a generic trigger function which looks for specific column name, we can use SPI_fnumber function like

    attnum = SPI_fnumber(tupdesc,"update_ts");

    .  Such a generic trigger function can be used in multiple tables. Obviously, this involves more processing. Those lines are commented out in the source code. On repeated tests, the average time of execution increases to 1826.722 ms. Still, we can see that it is considerably faster than the PLpgSQL trigger function.

Discuss on Hacker News

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com