Some time ago we at Percona were approached by ScaleFlux Inc to benchmark their latest hardware appliance, the CSD 2000 Drive, which is a next-generation SSD computational storage drive. It goes without saying that a truly relevant report requires us to be as honest and as forthright as possible. In other words, my mission was to, ahem, see what kind of mayhem I could cause.
Benchmarking is a bit like cooking; it requires key ingredients, strict adherence to following a set of instructions, mixing the ingredients together, and a bit of heat to make it all happen. In this case, the ingredients include the Linux OS running Ubuntu 18.04 on both the database and the bench-marking hosts, PostgreSQL version 12, SysBench the modular, cross-platform, and multi-threaded benchmark tool, and a comparable, competing appliance i.e. the Intel DC P4610 series drive. The two appliances are mounted as partitions respectively both using the same type of file system.
Once the environment is ready, the next step involves declaring and implementing the bench-marking rules which consist of various types of DML and DDL activity. Keeping in mind that apart from the classic OLAP vs OLTP modes of database processing, executing a benchmark that closely follows real production activities can be problematic. Quite often, when pushing a system to its full capacity, one can say that all production systems are to some extent unique. Therefore, for our purposes, we used the testing regime SysBench offers by default.
Once the system was ready, loading started out slow and gentle. The idea was to develop a baseline for the various types of activity and Postgres runtime conditions. Then, the bench-marking intensity was gradually increased to the point where we eventually started getting interesting results.
Needless to say, it took quite a bit of time running the various permutations, double-checking our numbers, graphing the data, and then after all that, interpreting the output. I’m not going to go into any great detailing the analysis itself. Instead, I encourage you to look at the whitepaper itself.
So after all this effort, what was the takeaway?
There are two key observations that I’d like to share:
At peak loading, the ScaleFlux CSD 2000 Drive demonstrated less performance variance than that of the Intel DC P4610. Variance being the statistical encapsulation of IO read-write spread between maximum and minimum values. The significance is server predictability. This becomes important when, for example, finely tuned application processes depend upon consistent performance with the RDBMS. Many a time I’ve seen applications get upset when response times between inserting, updating, or deleting data and getting the resultant queries would suddenly change.
Remarkable space savings were realized when the Postgres fillfactor was reduced. As you know, the fillfactor can become a critical runtime parameter in regards to performance when high-frequency UPDATE and DELETE operations take place on the same tuple over and over again.
Finally, one last item… I didn’t mention it but we also benchmarked MySQL for ScaleFlux. The results were pretty remarkable. It’s worth your while to have a look at that one too.
Network volumes in Kubernetes provide great flexibility, but still, nothing beats local volumes from direct-attached storage in the sense of database performance.
I want to explore ways to deploy both Percona Kubernetes Operators (Percona Kubernetes Operator for Percona XtraDB Cluster and Percona Kubernetes Operator for Percona Server for MongoDB) using local volumes, both on the bare-metal deployments or in cloud deployments.
Simple Ways
There are two ways available out of the box to deploy using local storage, which you can use immediately.
You can specify in cluster deployment yaml volume specification, using either hostPath:
volumeSpec:
hostPath:
path: /data
type: Directory
Or
emptyDir
(which will be equal somewhat to ephemeral storage in EC2):
volumeSpec:
emptyDir: {}
While this will work, it is a very rigid way to force local storage, and it is not really the “Kubernetes way”, as we will lose the capability to manage data volumes. We want to see a more uniform way with Persistent Volumes and Persistent Volumes Claims.
Persistent Volumes
Recognizing limitations with hostPath and emptyDir, Kubernetes introduced Local Persistent Volumes.
Unfortunately, while this will work for the simple deployment of single pods, it does not work with dynamically created volumes which we need for Operators. We need the support of Dynamic Volume Provisioning.
There are several projects to combine Dynamic Volume Provisioning with Local Persistent Volumes, but I did not have much success with them, and the only project which worked for me is OpenEBS.
OpenEBS provides much more than just Local Persistent Volumes, but in this blog, I want to touch only OpenEBS Local PV Hostpath.
OpenEBS Local PV Hostpath
This is actually quite simple, and this is what I like about OpenEBS. First, we will define storage classes:
Making backups over the network can be done in two ways: either save on disk and transfer or just transfer without saving. Both ways have their strong and weak points. The second way, particularly, is highly dependent on the upload speed, which would either reduce or increase the backup time. Other factors that influence it are chunk size and the number of upload threads.
Percona XtraBackup 2.4.14 has gained S3 streaming, which is the capability to upload backups directly to s3-compatible storage without saving locally first. This feature was developed because we wanted to improve the upload speeds of backups in Percona Operator for XtraDB Cluster.
We’ve measured the speed of AWS CLI, gsutil, MinIO client, rclone, gof3r and the xbcloud tool (part of Percona XtraBackup) on AWS (in single and multi-region setups) and on Google Cloud. XtraBackup was compared in two variants: a default configuration and one with tuned chunk size and amount of uploading threads.
Here are the results.
AWS (Same Region)
The backup data was streamed from the AWS EC2 instance to the AWS S3, both in the us-east-1 region.
tool
settings
CPU
max mem
speed
speed comparison
AWS CLI
default settings
66%
149Mb
130MiB/s
baseline
AWS CLI
10Mb block, 16 threads
68%
169Mb
141MiB/s
+8%
MinIO client
not changeable
10%
679Mb
59MiB/s
-55%
rclone rcat
not changeable
102%
7138Mb
139MiB/s
+7%
gof3r
default settings
69%
252Mb
97MiB/s
-25%
gof3r
10Mb block, 16 threads
77%
520Mb
108MiB/s
-17%
xbcloud
default settings
10%
96Mb
25MiB/s
-81%
xbcloud
10Mb block, 16 threads
60%
185Mb
134MiB/s
+3%
Tip: If you run MySQL on an EC2 instance to make backups inside one region, do snapshots instead.
AWS (From US to EU)
The backup data was streamed from AWS EC2 in us-east-1 to AWS S3 in eu-central-1.
tool
settings
CPU
max mem
speed
speed comparison
AWS CLI
default settings
31%
149Mb
61MiB/s
baseline
AWS CLI
10Mb block, 16 threads
33%
169Mb
66MiB/s
+8%
MinIO client
not changeable
3%
679Mb
20MiB/s
-67%
rclone rcat
not changeable
55%
9307Mb
77MiB/s
+26%
gof3r
default settings
69%
252Mb
97MiB/s
+59%
gof3r
10Mb block, 16 threads
77%
520Mb
108MiB/s
+77%
xbcloud
default settings
4%
96Mb
10MiB/s
-84%
xbcloud
10Mb block, 16 threads
59%
417Mb
123MiB/s
+101%
Tip: Think about disaster recovery, and what will you do when the whole region is not available. It makes no sense to back up to the same region; always transfer backups to another region.
Google Cloud (From US to EU)
The backup data were streamed from Compute Engine instance in us-east1 to Cloud Storage europe-west3. Interestingly, Google Cloud Storage supports both native protocol and S3(interoperability) API. So, Percona XtraBackup can transfer data to Google Cloud Storage directly via S3(interoperability) API.
tool
settings
CPU
max mem
speed
speed comparison
gsutil
not changeable, native protocol
8%
246Mb
23MiB/s
etalon
rclone rcat
not changeable, native protocol
6%
61Mb
16MiB/s
-30%
xbcloud
default settings, s3 protocol
3%
97Mb
9MiB/s
-61%
xbcloud
10Mb block, 16 threads, s3 protocol
50%
417Mb
133MiB/s
+478%
Tip: A cloud provider can block your account due to many reasons, such as human or robot mistakes, inappropriate content abuse after hacking, credit card expire, sanctions, etc. Think about disaster recovery and what will you do when a cloud provider blocks your account, it may make sense to back up to another cloud provider or on-premise.
Conclusion
xbcloud tool (part of Percona XtraBackup) is 2-5 times faster with tuned settings on long-distance with native cloud vendor tools, and 14% faster and requires 20% less memory than analogs with the same settings. Also, xbcloud is the most reliable tool for transferring backups to S3-compatible storage because of two reasons:
It calculates md5 sums during the uploading and puts them into a .md5/filename.md5 file and verifies sums on the download (gof3r does the same).
xbcloud sends data in 10mb chunks and resends them if any network failure happens.
PS: Please find instructions on GitHub if you would like to reproduce this article’s results.
Last year, I wrote a post focused on the performance of the fsync call on various storage devices. The fsync call is extremely important for a database when durability, the “D” of the ACID acronym is a hard requirement. The call ensures the data is permanently stored on disk. The durability requirement forces every transaction to return only when the InnoDB log file and they binary log file have been flushed to disk.
In this post, instead of focusing on the performance of various devices, we’ll see what can be done to improve fsync performance using an Intel Optane card.
Intel Optane
A few years ago, Intel introduced a new type of storage devices based on the 3D_XPoint technology and sold under the Optane brand. Those devices are outperforming regular flash devices and have higher endurance. In the context of this post, I found they are also very good at handling the fsync call, something many flash devices are not great at doing.
I recently had access to an Intel Optane NVMe card, a DC P4800X card with a storage capacity of 375GB. Let’s see how it can be used to improve performance.
Optane used directly as a storage device
This is by far the simplest option if your dataset fits on the card. Just install the device, create a filesystem, mount it and go. Using the same python script as in the first post, the results are:
Options
Fsync rate
Latency
ext4, O_DIRECT
21200/s
0.047 ms
ext4
20000/s
0.050 ms
ext4, data=journal
9600/s
0.100 ms
The above results are pretty amazing. The fsync performance is on par with a RAID controller with a write cache, for which I got a rate of 23000/s and is much better than a regular NAND based NVMe card like the Intel PC-3700, able to deliver a fsync rate of 7300/s. Even enabling the full ext4 journal, the rate is still excellent although, as expected, cut by about half.
Optane used as the cache block device in a hybrid volume
If you have a large dataset, you can still use the Optane card as a read/write cache and improve fsync performance significantly. I did some tests with two easily available solutions, dm-cache and bcache. In both cases, the Optane card was put in front of an external USB Sata disk and the cache layer set to writeback.
Options
Fsync rate
Latency
No cache
13/s
75 ms
dm-cache
3100/s
0.32 ms
bcache
2500/s
0.40 ms
Both solutions improve the fsync rate by two orders of magnitude. That’s still much slower than the straight device but a very decent trade-off.
Optane used as an ZFS SLOG
ZFS can also use a fast device for its write journal, the ZIL. Such a device in ZFS terminology is called a SLOG. With the ZFS logbias set to “latency”, here is the impact of using an Optane device as SLOG in front of the same slow USB SATA disk:
Options
Fsync rate
Latency
ZFS, SLOG
7400/s
0.135 ms
ZFS, no SLOG
28/s
36 ms
The addition of SLOG device boosted fsync rate by a factor of nearly 260. The rates are also twice as important as the ones reported using dm-cache and bcache and about a third of the result using the Optane device for storage. Considering all the added benefits of ZFS like compression and snapshots, that’s a really interesting result.
Conclusion
If you are struggling with the commit latency of a large transactional database, 3D_XPoint devices like the Intel Optane may offer you new options.
Please join Percona’s Chief Technology Officer, Vadim Tkachenko, as he presents How to Rock with MyRocks on Thursday, January 17th at 10:00 AM PDT (UTC-7) / 1:00 PM EDT (UTC-4).
MyRocks is a new storage engine from Facebook and is available in Percona Server for MySQL. In what cases will you want to use it? We will check different workloads and when MyRocks is most suitable for you. Also, as for any new engine, it’s important to set it up and tune it properly. So, we will review the most important settings to pay attention to.
AWS Elastic Block Storage (EBS) is great and it’s got better through the years, adding different storage types and features like Provisioned IOPS. However, it still has the most basic inconvenient requirement – I have to decide in advance how much space I need to allocate, and pay for all of that allocated space whether I use it or not.
It would be so much better if AWS would allow true consumption model pricing with EBS, where you pay for the storage used, not the storage allocated. This is already the case for S3, RDS, and even EC2 instances (with Unlimited Option on T2/T3 Instances), not to mention Serverless focused services.
For example, I would love to be able to create a 1TB EBS volume but only pay for 10GB of storage if I only use this amount of space.
Modern storage subsystems do a good job differentiating between the space available on the block device and what’s being used by user files and filesystem metadata. The space that’s not allocated any more can be TRIMmed. This is a basic requirement for working well on flash storage, and as modern EC2 instances already provision EBS storage as emulated NVMe devices, I would imagine Amazon could hook into such functionality to track space actually used.
For us at Percona this would make shipping applications on AWS Marketplace much more convenient. Right now for Percona Monitoring and Management (PMM) we have to choose how much space to allocate to the EBS volume by default, picking between it being expensive to run because of paying for the large unused EBS volume or setting a very limited by default capacity that needs user action to resize the EBS volume. Consumption-based EBS pricing would solve this dilemma.
This problem seems to be well recognized and understood. For example Pure Storage Cloud Block Storage (currently in Beta) is expected to have such a feature.
I hope with its insane customer focus AWS will add this feature in the future, but currently we have to get by without it.
Starting with PMM 1.13, PMM uses Prometheus 2 for metrics storage, which tends to be heaviest resource consumer of CPU and RAM. With Prometheus 2 Performance Improvements, PMM can scale to more than 1000 monitored nodes per instance in default configuration. In this blog post we will look into PMM scaling and capacity planning—how to estimate the resources required, and what drives resource consumption.
We have now tested PMM with up to 1000 nodes, using a virtualized system with 128GB of memory, 24 virtual cores, and SSD storage. We found PMM scales pretty linearly with the available memory and CPU cores, and we believe that a higher number of nodes could be supported with more powerful hardware.
What drives resource usage in PMM ?
Depending on your system configuration and workload, a single node can generate very different loads on the PMM server. The main factors that impact the performance of PMM are:
Number of samples (data points) injected into PMM per second
Number of distinct time series they belong to (cardinality)
Number of distinct query patterns your application uses
Number of queries you have on PMM, through the user interface on the API, and their complexity
These specifically can be impacted by:
Software version – modern database software versions expose more metrics)
Software configuration – some metrics are only exposed in certain configuration
Workload – a large number of database objects and high concurrency will increase both the number of samples ingested and their cardinality.
Exporter configuration – disabling collectors can reduce amount of data collectors
Scrape frequency – controlled by METRICS_RESOLUTION
All these factors together may impact resource requirements by a factor of ten or more, so do your own testing to be sure. However, the numbers in this article should serve as good general guidance as a start point for your research.
On the system supporting 1000 instances we observed the following performance:
As you can see, we have more than 2.000 scrapes/sec performed, providing almost two million samples/sec, and more than eight million active time series. These are the main numbers that define the load placed on Prometheus.
Capacity planning to scale PMM
Both CPU and memory are very important resources for PMM capacity planning. Memory is the more important as Prometheus 2 does not have good options for limiting memory consumption. If you do not have enough memory to handle your workload, then it will run out of memory and crash.
We recommend at least 2GB of memory for a production PMM Installation. A test installation with 1GB of memory is possible. However, it may not be able to monitor more than one or two nodes without running out of memory. With 2GB of memory you should be able to monitor at least five nodes without problem.
With powerful systems (8GB of more) you can have approximately eight systems per 1GB of memory, or about 15,000 samples ingested/sec per 1GB of memory.
To calculate the CPU usage resources required, allow for about 50 monitored systems per core (or 100K metrics/sec per CPU core).
One problem you’re likely to encounter if you’re running PMM with 100+ instances is the “Home Dashboard”. This becomes way too heavy with such a large number of servers. We plan to fix this issue in future releases of PMM, but for now you can work around it in two simple ways:
You can select the host, for example “pmm-server” in your home dashboard and save it, before adding a large amount of hosts to the system.
Or you can make some other dashboard of your choice and set it as the home dashboard.
Summary
More than 1,000 monitored systems is possible per single PMM server
Your specific workload and configuration may significantly change the resources required
If deploying with 8GB or more, plan 50 systems per core, and eight systems per 1GB of RAM
Is increasing GP2 volumes size or increasing IOPS for IO1 volumes a valid method for scaling IO-Bound workloads? In this post I’ll focus on one question: how much can we improve performance if we use faster cloud volumes? This post is a continuance of previous cloud research posts:
To recap, in Amazon EC2 we can use gp2 and io1 volumes. gp2 performance can be scaled with size, i.e for gp2 volume size of 500GB we get 1500 iops; size 1000GB – 3000 iops; and for 3334GB – 10000 iops (maximal possible value). For io1 volumes we can “buy” throughput up to 30000 iops.
So I wanted to check how both InnoDB and RocksDB storage engines perform on these volumes with different throughput.
Benchmark Scenario
I will use the same datasize that I used in Saving With MyRocks in The Cloud, that is sysbench-tpcc, 50 tables, 100W each, about 500GB datasize in InnoDB and 100GB in RocksDB (compressed with LZ4).
Volumes settings: gp2 volumes from 500GB (1000GB for InnoDB) to 3400GB with 100GB increments (so each increment increases throughput by 300 iops); io1 volumes: 1TB in size, iops from 1000 to 30000 with 1000 increments.
Let’s take look at the results. I will use a slightly different format than usual, but hopefully it represents the results better. You will see density throughout the plots—a higher and narrower chart represents less variance in the throughput. The plot represents the distribution of the throughput.
Results on GP2 volumes:
It’s quite interesting to see how the result scales with better IO throughput. InnoDB does not improve its throughput after gp2 size 2600GB, while MyRocks continues to scale linearly. The problem with MyRocks is that there is a lot of variance in throughput (I will show a one second resolution chart).
Results on IO1 volumes
Here MyRocks again shows an impressive growth as as we add more IO capacity, but also shows a lot of variance on high capacity volumes.
Let’s compare how engines perform with one second resolution. GP2 volume, 3400GB:
IO1 volume, 30000 iops:
So for MyRocks there seems to be periodical background activity, which does not allow it to achieve a stable throughput.
If you are looking to improve throughput in IO-bound workloads, either increasing GP2 volumes size or increasing IOPS for IO1 volumes is a valid method, especially for the MyRocks engine.
If you are a regular reader of this blog, you likely know I like the ZFS filesystem a lot. ZFS has many very interesting features, but I am a bit tired of hearing negative statements on ZFS performance. It feels a bit like people are telling me “Why do you use InnoDB? I have read that MyISAM is faster.” I found the comparison of InnoDB vs. MyISAM quite interesting, and I’ll use it in this post.
To have some data to support my post, I started an AWS i3.large instance with a 1000GB gp2 EBS volume. A gp2 volume of this size is interesting because it is above the burst IOPS level, so it offers a constant 3000 IOPS performance level.
I used sysbench to create a table of 10M rows and then, using export/import tablespace, I copied it 329 times. I ended up with 330 tables for a total size of about 850GB. The dataset generated by sysbench is not very compressible, so I used lz4 compression in ZFS. For the other ZFS settings, I used what can be found in my earlier ZFS posts but with the ARC size limited to 1GB. I then used that plain configuration for the first benchmarks. Here are the results with the sysbench point-select benchmark, a uniform distribution and eight threads. The InnoDB buffer pool was set to 2.5GB.
In both cases, the load is IO bound. The disk is doing exactly the allowed 3000 IOPS. The above graph appears to be a clear demonstration that XFS is much faster than ZFS, right? But is that really the case? The way the dataset has been created is extremely favorable to XFS since there is absolutely no file fragmentation. Once you have all the files opened, a read IOP is just a single fseek call to an offset and ZFS doesn’t need to access any intermediate inode. The above result is about as fair as saying MyISAM is faster than InnoDB based only on table scan performance results of unfragmented tables and default configuration. ZFS is much less affected by the file level fragmentation, especially for point access type.
More on ZFS metadata
ZFS stores the files in B-trees in a very similar fashion as InnoDB stores data. To access a piece of data in a B-tree, you need to access the top level page (often called root node) and then one block per level down to a leaf-node containing the data. With no cache, to read something from a three levels B-tree thus requires 3 IOPS.
Simple three levels B-tree
The extra IOPS performed by ZFS are needed to access those internal blocks in the B-trees of the files. These internal blocks are labeled as metadata. Essentially, in the above benchmark, the ARC is too small to contain all the internal blocks of the table files’ B-trees. If we continue the comparison with InnoDB, it would be like running with a buffer pool too small to contain the non-leaf pages. The test dataset I used has about 600MB of non-leaf pages, about 0.1% of the total size, which was well cached by the 3GB buffer pool. So only one InnoDB page, a leaf page, needed to be read per point-select statement.
To correctly set the ARC size to cache the metadata, you have two choices. First, you can guess values for the ARC size and experiment. Second, you can try to evaluate it by looking at the ZFS internal data. Let’s review these two approaches.
You’ll read/hear often the ratio 1GB of ARC for 1TB of data, which is about the same 0.1% ratio as for InnoDB. I wrote about that ratio a few times, having nothing better to propose. Actually, I found it depends a lot on the recordsize used. The 0.1% ratio implies a ZFS recordsize of 128KB. A ZFS filesystem with a recordsize of 128KB will use much less metadata than another one using a recordsize of 16KB because it has 8x fewer leaf pages. Fewer leaf pages require less B-tree internal nodes, hence less metadata. A filesystem with a recordsize of 128KB is excellent for sequential access as it maximizes compression and reduces the IOPS but it is poor for small random access operations like the ones MySQL/InnoDB does.
To determine the correct ARC size, you can slowly increase the ARC size and monitor the number of metadata cache-misses with the arcstat tool. Here’s an example:
You’ll want the ‘mm%’, the metadata missed percent, to reach 0. So when the ‘arcsz’ column is no longer growing and you still have high values for ‘mm%’, that means the ARC is too small. Increase the value of ‘zfs_arc_max’ and continue to monitor.
If the 1GB of ARC for 1TB of data ratio is good for large ZFS recordsize, it is likely too small for a recordsize of 16KB. Does 8x more leaf pages automatically require 8x more ARC space for the non-leaf pages? Although likely, let’s verify.
The second option we have is the zdb utility that comes with ZFS, which allows us to view many internal structures including the B-tree list of pages for a given file. The tool needs the inode of a file and the ZFS filesystem as inputs. Here’s an invocation for one of the tables of my dataset:
The last section of the above output is very interesting as it shows the B-tree pages. The ZFSB-tree of the file sbtest1.ibd has four levels. L3 is the root page, L2 is the first level (from the top) pages, L1 are the second level pages, and L0 are the leaf pages. The metadata is essentially L3 + L2 + L1. When you change the recordsize property of a ZFS filesystem, you affect only the size of the leaf pages.
The non-leaf page size is always 16KB (4000L) and they are always compressed on disk with lzop (If I read correctly). In the ARC, these pages are stored uncompressed so they use 16KB of memory each. The fanout of a ZFS B-tree, the largest possible ratio of a number of pages between levels, is 128. With the above output, we can easily calculate the required size of metadata we would need to cache all the non-leaf pages in the ARC.
So, each of the 330 tables of the dataset has 1160 non-leaf pages and 145447 leaf pages; a ratio very close to the prediction of 0.8%. For the complete dataset of 749GB, we would need the ARC to be, at a minimum, 6GB to fully cache all the metadata pages. Of course, there is some overhead to add. In my experiments, I found I needed to add about 15% for ARC overhead in order to have no metadata reads at all. The real minimum for the ARC size I should have used is almost 7GB.
Of course, an ARC of 7GB on a server with 15GB of Ram is not small. Is there a way to do otherwise? The first option we have is to use a larger InnoDB page size, as allowed by MySQL 5.7. Instead of the regular Innodb page size of 16KB, if you use a page size of 32KB with a matching ZFS recordsize, you will cut the ARC size requirement by half, to 0.4% of the uncompressed size.
Similarly, an Innodb page size of 64KB with similar ZFS recordsize would further reduce the ARC size requirement to 0.2%. That approach works best when the dataset is highly compressible. I’ll blog more about the use of larger InnoDB pages with ZFS in a near future. If the use of larger InnoDB page sizes is not a viable option for you, you still have the option of using the ZFS L2ARC feature to save on the required memory.
So, let’s proposed a new rule of thumb for the required ARC/L2ARC size for a a given dataset:
Recordsize of 128KB => 0.1% of the uncompressed dataset size
Recordsize of 64KB => 0.2% of the uncompressed dataset size
Recordsize of 32KB => 0.4% of the uncompressed dataset size
Recordsize of 16KB => 0.8% of the uncompressed dataset size
The ZFS revenge
In order to improve ZFS performance, I had 3 options:
Increase the ARC size to 7GB
Use a larger Innodb page size like 64KB
Add a L2ARC
I was reluctant to grow the ARC to 7GB, which was nearly half the overall system memory. At best, the ZFS performance would only match XFS. A larger InnoDB page size would increase the CPU load for decompression on an instance with only two vCPUs; not great either. The last option, the L2ARC, was the most promising.
The choice of an i3.large instance type is not accidental. The instance has a 475GB ephemeral NVMe storage device. Let’s try to use this storage for the ZFS L2ARC. The warming of an L2ARC device is not exactly trivial. In my case, with a 1GB ARC, I used:
I then ran ‘cat /var/lib/mysql/data/sbtest/* > /dev/null’ to force filesystem reads and caches on all of the tables. A key setting here to allow the L2ARC to cache data is the zfs_arc_meta_limit. It needs to be slightly smaller than the zfs_arc_max in order to allow some data to be cache in the ARC. Remember that the L2ARC is fed by the LRU of the ARC. You need to cache data in the ARC in order to have data cached in the L2ARC. Using lz4 in ZFS on the sysbench dataset results in a compression ration of only 1.28x. A more realistic dataset would compress by more than 2x, if not 3x. Nevertheless, since the content of the L2ARC is compressed, the 475GB device caches nearly 600GB of the dataset. The figure below shows the sysbench results with the L2ARC enabled:
Now, the comparison is very different. ZFS completely outperforms XFS, 5000 qps for ZFS versus 3000 for XFS. The ZFS results could have been even higher but the two vCPUs of the instance were clearly the bottleneck. Properly configured, ZFS can be pretty fast. Of course, I could use flashcache or bcache with XFS and improve the XFS results but these technologies are way more exotic than the ZFS L2ARC. Also, only the L2ARC stores data in a compressed form, maximizing the use of the NVMe device. Compression also lowers the size requirement and cost for the gp2 disk.
ZFS is much more complex than XFS and EXT4 but, that also means it has more tunables/options. I used a simplistic setup and an unfair benchmark which initially led to poor ZFS results. With the same benchmark, very favorable to XFS, I added a ZFS L2ARC and that completely reversed the situation, more than tripling the ZFS results, now 66% above XFS.
Conclusion
We have seen in this post why the general perception is that ZFS under-performs compared to XFS or EXT4. The presence of B-trees for the files has a big impact on the amount of metadata ZFS needs to handle, especially when the recordsize is small. The metadata consists mostly of the non-leaf pages (or internal nodes) of the B-trees. When properly cached, the performance of ZFS is excellent. ZFS allows you to optimize the use of EBS volumes, both in term of IOPS and size when the instance has fast ephemeral storage devices. Using the ephemeral device of an i3.large instance for the ZFS L2ARC, ZFS outperformed XFS by 66%.
The ability to restore MySQL logical backups is a significant part of disaster recovery procedures. It’s a last line of defense.
Even if you lost all data from a production server, physical backups (data files snapshot created with an offline copy or with Percona XtraBackup) could show the same internal database structure corruption as in production data. Backups in a simple plain text format allow you to avoid such corruptions and migrate between database formats (e.g., during a software upgrade and downgrade), or even help with migration from completely different database solution.
Unfortunately, the restore speed for logical backups is usually bad, and for a big database it could require days or even weeks to get data back. Thus it’s important to tune backups and MySQL for the fastest data restore and change settings back before production operations.
Disclaimer
All results are specific to my combination of hardware and dataset, but could be used as an illustration for MySQL database tuning procedures related to logical backup restore.
Benchmark
There is no general advice for tuning a MySQL database for a bulk logical backup load, and any parameter should be verified with a test on your hardware and database. In this article, we will explore some variables that help that process. To illustrate the tuning procedure, I’ve downloaded IMDB CSV files and created a MySQL database with pyimdb.
You may repeat the whole benchmark procedure, or just look at settings changed and resulting times.
The simplest restore procedure for logical backups created by the mysqldump tool:
mysql -e 'create database imdb;'
time mysql imdb < imdb.sql
# real 129m51.389s
This requires slightly more than two hours to restore the backup into the MySQL instance started with default settings.
I’m using the Docker image percona:latest – it contains Percona Server 5.7.20-19 running on a laptop with 16GB RAM, Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz, two disks: SSD KINGSTON RBU-SNS and HDD HGST HTS721010A9.
Let’s start with some “good” settings: buffer pool bigger than default, 2x1GB transaction log files, disable sync (because we are using slow HDD), and set big values for IO capacity,
the load should be faster with big batches thus use 1GB for max_allowed_packet.
Values were chosen to be bigger than the default MySQL parameters because I’m trying to see the difference between the usually suggested values (like 80% of RAM should belong to InnoDB buffer pool).
docker run --publish-all --name p57 -it -e MYSQL_ALLOW_EMPTY_PASSWORD=1 percona:5.7
--innodb_buffer_pool_size=4GB
--innodb_log_file_size=1G
--skip-log-bin
--innodb_flush_log_at_trx_commit=0
--innodb_flush_method=nosync
--innodb_io_capacity=2000
--innodb_io_capacity_max=3000
--max_allowed_packet=1G
time (mysql --max_allowed_packet=1G imdb1 < imdb.sql )
# real 59m34.252s
The load is IO bounded, and there is no reaction on set global foreign_key_checks=0 and unique_checks=0 because these variables are already disabled in the dump file.
By default, mysqldump produces SQL data, but it could also save data to CSV format:
cd /var/lib/mysql-files
mkdir imdb
chown mysql:mysql imdb/
time mysqldump --max_allowed_packet=128M --tab /var/lib/mysql-files/imdb imdb1
# real 1m45.983s
sudo rm -rf mysql/*
docker rm p57
docker run -v /srv/ihanick/tmp/imdb:/var/lib/mysql-files/imdb -v /home/ihanick/Private/Src/tmp/data-movies/mysql:/var/lib/mysql
--name p57 -it -e MYSQL_ALLOW_EMPTY_PASSWORD=1 percona:5.7
--innodb_buffer_pool_size=4GB
--innodb_log_file_size=1G
--skip-log-bin
--innodb_flush_log_at_trx_commit=0
--innodb_io_capacity=700
--innodb_io_capacity_max=1500
--max_allowed_packet=1G
--innodb_doublewrite=0
time (
mysql -e 'drop database imdb1;create database imdb1;set global FOREIGN_KEY_CHECKS=0;'
(echo "SET FOREIGN_KEY_CHECKS=0;";cat *.sql) | mysql imdb1 ;
for i in $PWD/*.txt ; do mysqlimport imdb1 $i ; done
)
# real 21m56.049s
1.5X faster, just because of changing the format from SQL to CSV!
We’re still using only one CPU core, let’s improve the load with the –use-threads=4 option:
time (
mysql -e 'drop database if exists imdb1;create database imdb1;set global FOREIGN_KEY_CHECKS=0;'
(echo "SET FOREIGN_KEY_CHECKS=0;";cat *.sql) | mysql imdb1
mysqlimport --use-threads=4 imdb1 $PWD/*.txt
)
# real 15m38.147s
In the end, the load is still not fully parallel due to a big table: all other tables are loaded, but one thread is still active.
Let’s split CSV files into smaller ones. For example, 100k rows in each file and load with GNU/parallel:
# /var/lib/mysql-files/imdb/test-restore.sh
apt-get update ; apt-get install -y parallel
cd /var/lib/mysql-files/imdb
time (
cd split1
for i in ../*.txt ; do echo $i ; split -a 6 -l 100000 -- $i `basename $i .txt`. ; done
for i in `ls *.*|sed 's/^[^.]+.//'|sort -u` ; do
mkdir ../split-$i
for j in *.$i ; do mv $j ../split-$i/${j/$i/txt} ; done
done
)
# real 2m26.566s
time (
mysql -e 'drop database if exists imdb1;create database imdb1;set global FOREIGN_KEY_CHECKS=0;'
(echo "SET FOREIGN_KEY_CHECKS=0;";cat *.sql) | mysql imdb1
parallel 'mysqlimport imdb1 /var/lib/mysql-files/imdb/{}/*.txt' ::: split-*
)
#real 16m50.314s
Split is not free, but you can split your dump files right after backup.
The load is parallel now, but the single big table strikes back with ‘setting auto-inc lock’ in SHOW ENGINE INNODB STATUSG
Using the --innodb_autoinc_lock_mode=2 option fixes this issue: 16m2.567s.
We got slightly better results with just mysqlimport --use-threads=4. Let’s check if hyperthreading helps and if the problem caused by “parallel” tool:
Using four parallel jobs for load: 17m3.662s
Using four parallel jobs for load and two threads: 16m4.218s
There is no difference between GNU/Parallel and --use-threads option of mysqlimport.
Why 100k rows? With 500k rows: 15m33.258s
Now we have performance better than for mysqlimport --use-threads=4.
How about 1M rows at once? Just 16m52.357s.
I see periodic flushing logs message with bigger transaction logs (2x4GB): 12m18.160s:
Let’s compare the number with myloader 0.6.1 also running with four threads (myloader have only -d parameter, myloader execution time is under corresponding mydumper command):
# oversized statement size to get 0.5M rows in one statement, single statement per chunk file
mydumper -B imdb1 --no-locks --rows 500000 --statement-size 536870912 -o 500kRows512MBstatement
17m59.866s
mydumper -B imdb1 --no-locks -o default_options
17m15.175s
mydumper -B imdb1 --no-locks --chunk-filesize 128 -o chunk128MB
16m36.878s
mydumper -B imdb1 --no-locks --chunk-filesize 64 -o chunk64MB
18m15.266s
Returning back to parallel CSV files load, even bigger transaction logs 2x8GB: 11m15.132s.
What about a bigger buffer pool: --innodb_buffer_pool_size=12G? 9m41.519s
Let’s check six-year-old server-grade hardware: Intel(R) Xeon(R) CPU E5-2430 with SAS raid (used only for single SQL file restore test) and NVMe (Intel Corporation PCIe Data Center SSD, used for all other tests).
I’m using similar options as for previous tests, with 100k rows split for CSV files load:
Single SQL file created by mysqldump loaded for 117m29.062s = 2x slower.
24 parallel processes of mysqlimport: 11m51.718s
Again hyperthreading making a huge difference! 12 parallel jobs: 18m3.699s.
Due to higher concurrency, adaptive hash index is a reason for locking contention. After disabling it with --skip-innodb_adaptive_hash_index: 10m52.788s.
In many places, disable unique checks referred as a performance booster: 10m52.489s
You can spend more time reading advice about unique_checks, but it might help for some databases with many unique indexes (in addition to primary one).
The buffer pool is smaller than the dataset, can you change old/new pages split to make insert faster? No: --innodb_old_blocks_pct=5 : 10m59.517s.
O_DIRECT is also recommended: --innodb_flush_method=O_DIRECT: 11m1.742s.
O_DIRECT is not able to improve performance by itself, but if you can use a bigger buffer pool: O_DIRECT + 30% bigger buffer pool: --innodb_buffeer_pool_size=11G: 10m46.716s.
Conclusions
There is no common solution to improve logical backup restore procedure.
If you have IO-bounded restore: disable InnoDB double write. It’s safe because even if the database crashes during restore, you can restart the operation.
Do not use SQL dumps for databases > 5-10GB. CSV files are much faster for mysqldump+mysql. Implement mysqldump --tabs+mysqlimport or use mydumper/myloader with appropriate chunk-filesize.
The number of rows per load data infile batch is important. Usually 100K-1M, use binary search (2-3 iterations) to find a good value for your dataset.
InnoDB log file size and buffer pool size are really important options for backup restore performance.
O_DIRECT reduces insert speed, but it’s good if you can increase the buffer pool size.
If you have enough RAM or SSD, the restore procedure is limited by CPU. Use a faster CPU (higher frequency, turboboost).
Hyperthreading also counts.
A powerful server could be slower than your laptop (12×2.4GHz vs. 4×2.8+turboboost).
Even with modern hardware, it’s hard to expect backup restore faster than 50MBps (for the final size of InnoDB database).
You can find a lot of different advice on how to improve backup load speed. Unfortunately, it’s not possible to implement improvements blindly, and you should know the limits of your system with general Unix performance tools like vmstat, iostat and various MySQL commands like SHOW ENGINE INNODB STATUS (all can be collected together with pt-stalk).
Percona Monitoring and Management (PMM) also provides good graphs, but you should be careful with QAN: full slow query log during logical database dump restore can cause significant processing load.
Default MySQL settings could cost you 10x backup restore slowdown
This benchmark is aimed at speeding up the restore procedure while the application is not running and the server is not used in production. Make sure that you have reverted all configuration parameters back to production values after load. For example, if you disable the InnoDB double write buffer during restore and left it enabled in production, you may have scary data corruption due to partial InnoDB pages writes.
If the application is running during restore, in most cases you will get an inconsistent database due to missing support for locking or correct transactions for restore methods (discussed above).