Testing the Value of ScaleFlux Computational Storage Drive (CSD) for PostgreSQL

ScaleFlux Computational Storage Drive PostgreSQL

Some time ago we at Percona were approached by ScaleFlux Inc to benchmark their latest hardware appliance, the CSD 2000 Drive, which is a next-generation SSD computational storage drive. It goes without saying that a truly relevant report requires us to be as honest and as forthright as possible. In other words, my mission was to, ahem, see what kind of mayhem I could cause.

Benchmarking is a bit like cooking; it requires key ingredients, strict adherence to following a set of instructions, mixing the ingredients together, and a bit of heat to make it all happen. In this case, the ingredients include the Linux OS running Ubuntu 18.04 on both the database and the bench-marking hosts, PostgreSQL version 12, SysBench the modular, cross-platform, and multi-threaded benchmark tool, and a comparable, competing appliance i.e. the Intel DC P4610 series drive. The two appliances are mounted as partitions respectively both using the same type of file system.


Once the environment is ready, the next step involves declaring and implementing the bench-marking rules which consist of various types of DML and DDL activity. Keeping in mind that apart from the classic OLAP vs OLTP modes of database processing, executing a benchmark that closely follows real production activities can be problematic. Quite often, when pushing a system to its full capacity, one can say that all production systems are to some extent unique. Therefore, for our purposes, we used the testing regime SysBench offers by default.

Once the system was ready, loading started out slow and gentle. The idea was to develop a baseline for the various types of activity and Postgres runtime conditions. Then, the bench-marking intensity was gradually increased to the point where we eventually started getting interesting results.

Have open source expertise you want to share? Submit your talk for Percona Live ONLINE 2021!

Needless to say, it took quite a bit of time running the various permutations, double-checking our numbers, graphing the data, and then after all that, interpreting the output. I’m not going to go into any great detailing the analysis itself. Instead, I encourage you to look at the whitepaper itself.

So after all this effort, what was the takeaway?

There are two key observations that I’d like to share:

  1. At peak loading, the ScaleFlux CSD 2000 Drive demonstrated less performance variance than that of the Intel DC P4610. Variance being the statistical encapsulation of IO read-write spread between maximum and minimum values. The significance is server predictability. This becomes important when, for example, finely tuned application processes depend upon consistent performance with the RDBMS. Many a time I’ve seen applications get upset when response times between inserting, updating, or deleting data and getting the resultant queries would suddenly change.
  2. Remarkable space savings were realized when the Postgres fillfactor was reduced. As you know, the fillfactor can become a critical runtime parameter in regards to performance when high-frequency UPDATE and DELETE operations take place on the same tuple over and over again.

Finally, one last item… I didn’t mention it but we also benchmarked MySQL for ScaleFlux. The results were pretty remarkable. It’s worth your while to have a look at that one too.

ScaleFlux White Papers:


Deploying Percona Kubernetes Operators with OpenEBS Local Storage

Deploying Percona Kubernetes Operators with OpenEBS Local Storage

Deploying Percona Kubernetes Operators with OpenEBS Local StorageNetwork volumes in Kubernetes provide great flexibility, but still, nothing beats local volumes from direct-attached storage in the sense of database performance.

I want to explore ways to deploy both Percona Kubernetes Operators (Percona Kubernetes Operator for Percona XtraDB Cluster and Percona Kubernetes Operator for Percona Server for MongoDB) using local volumes, both on the bare-metal deployments or in cloud deployments.

Simple Ways

There are two ways available out of the box to deploy using local storage, which you can use immediately.

You can specify in cluster deployment yaml volume specification, using either hostPath:

        path: /data
        type: Directory



  (which will be equal somewhat to ephemeral storage in EC2):

      emptyDir: {}

While this will work, it is a very rigid way to force local storage, and it is not really the “Kubernetes way”, as we will lose the capability to manage data volumes.  We want to see a more uniform way with Persistent Volumes and Persistent Volumes Claims.

Persistent Volumes

Recognizing limitations with hostPath and emptyDir, Kubernetes introduced Local Persistent Volumes.

Unfortunately, while this will work for the simple deployment of single pods, it does not work with dynamically created volumes which we need for Operators. We need the support of Dynamic Volume Provisioning.

There are several projects to combine Dynamic Volume Provisioning with Local Persistent Volumes, but I did not have much success with them, and the only project which worked for me is OpenEBS.

OpenEBS provides much more than just Local Persistent Volumes, but in this blog, I want to touch only OpenEBS Local PV Hostpath.

OpenEBS Local PV Hostpath

This is actually quite simple, and this is what I like about OpenEBS. First, we will define storage classes:

For the local nvme storage:

kind: StorageClass
  name: localnvme
  annotations: local |
      - name: StorageType
        value: hostpath
      - name: BasePath
        value: /data/openebs
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

And for the local ssd:

kind: StorageClass
  name: localssd
  annotations: local |
      - name: StorageType
        value: hostpath
      - name: BasePath
        value: /mnt/data/ebs
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

And after that we can deploy Operators using StorageClass, for example  in cluster deployment yaml:

        storageClassName: localnvme
        accessModes: [ "ReadWriteOnce" ]
            storage: 200Gi

The cluster will be deployed using localnvme StorageClass, and using space on /data/openebs.

Now we can observe used volumes with kubectl get pv:

NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                        STORAGECLASS       REASON   AGE
pvc-325e11ff-9082-432c-b484-c0c7a3d1c949   200Gi      RWO            Delete           Bound    pxc/datadir-cluster2-pxc-2   localssd                    2d18h
pvc-3f940d14-8363-450e-a238-a9ff09bee5bc   200Gi      RWO            Delete           Bound    pxc/datadir-cluster3-pxc-1   localnvme                   43h
pvc-421cbf46-73a6-45f1-811e-1fca279fe22d   200Gi      RWO            Delete           Bound    pxc/datadir-cluster2-pxc-1   localssd                    2d18h
pvc-53ee7345-bc53-4808-97d6-2acf728a57d7   200Gi      RWO            Delete           Bound    pxc/datadir-cluster3-pxc-0   localnvme                   43h
pvc-b98eca4a-03b0-4b5f-a152-9b45a35767c6   200Gi      RWO            Delete           Bound    pxc/datadir-cluster2-pxc-0   localssd                    2d18h
pvc-fbe499a9-ecae-4196-b29e-c4d35d69b381   200Gi      RWO            Delete           Bound    pxc/datadir-cluster3-pxc-2   localnvme                   43h

There we can see data volumes from two deployed clusters. OpenEBS Local PV Hostpath is now my go-to method to deploy clusters with the local storage.


Comparing S3 Streaming Tools with Percona XtraBackup

Comparing S3 Streaming Tools

Making backups over the network can be done in two ways: either save on disk and transfer or just transfer without saving. Both ways have their strong and weak points. The second way, particularly, is highly dependent on the upload speed, which would either reduce or increase the backup time. Other factors that influence it are chunk size and the number of upload threads.

Percona XtraBackup 2.4.14 has gained S3 streaming, which is the capability to upload backups directly to s3-compatible storage without saving locally first. This feature was developed because we wanted to improve the upload speeds of backups in Percona Operator for XtraDB Cluster.

There are many implementations of S3 Compatible Storage: AWS S3, Google Cloud Storage, Digital Ocean Spaces, Alibaba Cloud OSS, MinIO, and Wasabi.

We’ve measured the speed of AWS CLI, gsutil, MinIO client, rclone, gof3r and the xbcloud tool (part of Percona XtraBackup) on AWS (in single and multi-region setups) and on Google Cloud. XtraBackup was compared in two variants: a default configuration and one with tuned chunk size and amount of uploading threads.

Here are the results.

AWS (Same Region)

The backup data was streamed from the AWS EC2 instance to the AWS S3, both in the us-east-1 region.



tool settings CPU max mem speed speed comparison
AWS CLI default settings 66% 149Mb 130MiB/s baseline
AWS CLI 10Mb block, 16 threads 68% 169Mb 141MiB/s +8%
MinIO client not changeable 10% 679Mb 59MiB/s -55%
rclone rcat not changeable 102% 7138Mb 139MiB/s +7%
gof3r default settings 69% 252Mb 97MiB/s -25%
gof3r 10Mb block, 16 threads 77% 520Mb 108MiB/s -17%
xbcloud default settings 10% 96Mb 25MiB/s -81%
xbcloud 10Mb block, 16 threads 60% 185Mb 134MiB/s +3%


Tip: If you run MySQL on an EC2 instance to make backups inside one region, do snapshots instead.

AWS (From US to EU)

The backup data was streamed from AWS EC2 in us-east-1 to AWS S3 in eu-central-1.



tool settings CPU max mem speed speed comparison
AWS CLI default settings 31% 149Mb 61MiB/s baseline
AWS CLI 10Mb block, 16 threads 33% 169Mb 66MiB/s +8%
MinIO client not changeable 3% 679Mb 20MiB/s -67%
rclone rcat not changeable 55% 9307Mb 77MiB/s +26%
gof3r default settings 69% 252Mb 97MiB/s +59%
gof3r 10Mb block, 16 threads 77% 520Mb 108MiB/s +77%
xbcloud default settings 4% 96Mb 10MiB/s -84%
xbcloud 10Mb block, 16 threads 59% 417Mb 123MiB/s +101%


Tip: Think about disaster recovery, and what will you do when the whole region is not available. It makes no sense to back up to the same region; always transfer backups to another region.

Google Cloud (From US to EU)

The backup data were streamed from Compute Engine instance in us-east1 to Cloud Storage europe-west3. Interestingly, Google Cloud Storage supports both native protocol and S3(interoperability) API. So, Percona XtraBackup can transfer data to Google Cloud Storage directly via S3(interoperability) API.


tool settings CPU max mem speed speed comparison
gsutil not changeable, native protocol 8% 246Mb 23MiB/s etalon
rclone rcat not changeable, native protocol 6% 61Mb 16MiB/s -30%
xbcloud default settings, s3 protocol 3% 97Mb 9MiB/s -61%
xbcloud 10Mb block, 16 threads, s3 protocol 50% 417Mb 133MiB/s +478%


Tip: A cloud provider can block your account due to many reasons, such as human or robot mistakes, inappropriate content abuse after hacking, credit card expire, sanctions, etc. Think about disaster recovery and what will you do when a cloud provider blocks your account, it may make sense to back up to another cloud provider or on-premise.


xbcloud tool (part of Percona XtraBackup) is 2-5 times faster with tuned settings on long-distance with native cloud vendor tools, and 14% faster and requires 20% less memory than analogs with the same settings. Also, xbcloud is the most reliable tool for transferring backups to S3-compatible storage because of two reasons:

  • It calculates md5 sums during the uploading and puts them into a .md5/filename.md5 file and verifies sums on the download (gof3r does the same).
  • xbcloud sends data in 10mb chunks and resends them if any network failure happens.

PS: Please find instructions on GitHub if you would like to reproduce this article’s results.


Update on fsync Performance

fsync performance

fsync performanceLast year, I wrote a post focused on the performance of the fsync call on various storage devices. The fsync call is extremely important for a database when durability, the “D” of the ACID acronym is a hard requirement. The call ensures the data is permanently stored on disk. The durability requirement forces every transaction to return only when the InnoDB log file and they binary log file have been flushed to disk.

In this post, instead of focusing on the performance of various devices, we’ll see what can be done to improve fsync performance using an Intel Optane card.

Intel Optane

A few years ago, Intel introduced a new type of storage devices based on the 3D_XPoint technology and sold under the Optane brand. Those devices are outperforming regular flash devices and have higher endurance. In the context of this post, I found they are also very good at handling the fsync call, something many flash devices are not great at doing.

I recently had access to an Intel Optane NVMe card, a DC P4800X card with a storage capacity of 375GB. Let’s see how it can be used to improve performance.

Optane used directly as a storage device

This is by far the simplest option if your dataset fits on the card. Just install the device, create a filesystem, mount it and go. Using the same python script as in the first post, the results are:

Options Fsync rate Latency
ext4, O_DIRECT 21200/s 0.047 ms
ext4 20000/s 0.050 ms
ext4, data=journal 9600/s 0.100 ms


The above results are pretty amazing. The fsync performance is on par with a RAID controller with a write cache, for which I got a rate of 23000/s and is much better than a regular NAND based NVMe card like the Intel PC-3700, able to deliver a fsync rate of 7300/s. Even enabling the full ext4 journal, the rate is still excellent although, as expected, cut by about half.

Optane used as the cache block device in a hybrid volume

If you have a large dataset, you can still use the Optane card as a read/write cache and improve fsync performance significantly. I did some tests with two easily available solutions, dm-cache and bcache. In both cases, the Optane card was put in front of an external USB Sata disk and the cache layer set to writeback.

Options Fsync rate Latency
No cache 13/s 75 ms
dm-cache 3100/s 0.32 ms
bcache 2500/s 0.40 ms


Both solutions improve the fsync rate by two orders of magnitude. That’s still much slower than the straight device but a very decent trade-off.

Optane used as an ZFS SLOG

ZFS can also use a fast device for its write journal, the ZIL. Such a device in ZFS terminology is called a SLOG. With the ZFS logbias set to “latency”, here is the impact of using an Optane device as SLOG in front of the same slow USB SATA disk:

Options Fsync rate Latency
ZFS, SLOG 7400/s 0.135 ms
ZFS, no SLOG 28/s 36 ms


The addition of SLOG device boosted fsync rate by a factor of nearly 260. The rates are also twice as important as the ones reported using dm-cache and bcache and about a third of the result using the Optane device for storage.  Considering all the added benefits of ZFS like compression and snapshots, that’s a really interesting result.


If you are struggling with the commit latency of a large transactional database, 3D_XPoint devices like the Intel Optane may offer you new options.


Upcoming Webinar Thurs 1/17: How to Rock with MyRocks

How to Rock with MyRocks

How to Rock with MyRocksPlease join Percona’s Chief Technology Officer, Vadim Tkachenko, as he presents How to Rock with MyRocks on Thursday, January 17th at 10:00 AM PDT (UTC-7) / 1:00 PM EDT (UTC-4).

Register Now

MyRocks is a new storage engine from Facebook and is available in Percona Server for MySQL. In what cases will you want to use it? We will check different workloads and when MyRocks is most suitable for you. Also, as for any new engine, it’s important to set it up and tune it properly. So, we will review the most important settings to pay attention to.

Register for this webinar to learn How to Rock with MyRocks.


AWS Elastic Block Storage (EBS) – Can We Get It Truly Elastic?

very old disk storage

very old disk storageAt AWS Re:Invent 2018 there were many great announcements of AWS New Services and New Features, but one basic feature that I’ve been waiting for years to be released is still nowhere to be  found.

AWS Elastic Block Storage (EBS) is great and it’s got better through the years, adding different storage types and features like Provisioned IOPS. However, it still has the most basic inconvenient requirement – I have to decide in advance how much space I need to allocate, and pay for all of that allocated space whether I use it or not.

It would be so much better if AWS would allow true consumption model pricing with EBS, where you pay for the storage used, not the storage allocated. This is already the case for S3,  RDS, and even EC2 instances (with Unlimited Option on T2/T3 Instances), not to mention Serverless focused services.

For example, I would love to be able to create a 1TB EBS volume but only pay for 10GB of storage if I only use this amount of space.

Modern storage subsystems do a good job differentiating between the space available on the block device and what’s being used by user files and filesystem metadata. The space that’s not allocated any more can be TRIMmed. This is a basic requirement for working well on flash storage, and as modern EC2 instances already provision EBS storage as emulated NVMe devices, I would imagine Amazon could hook into such functionality to track space actually used.

For us at Percona this would make shipping applications on AWS Marketplace much more convenient. Right now for Percona Monitoring and Management (PMM)  we have to choose how much space to allocate to the EBS volume by default, picking between it being expensive to run because of paying for the large unused EBS volume or setting a very limited by default capacity that needs user action to resize the EBS volume. Consumption-based EBS pricing would solve this dilemma.

This problem seems to be well recognized and understood. For example Pure Storage Cloud Block Storage (currently in Beta) is  expected to have such a feature.

I hope with its insane customer focus AWS will add this feature in the future, but currently we have to get by without it.

Image: Arnold Reinhold [CC BY-SA 2.5], via Wikimedia Commons


Scaling Percona Monitoring and Management (PMM)

PMM tested with 1000 nodes

Starting with PMM 1.13,  PMM uses Prometheus 2 for metrics storage, which tends to be heaviest resource consumer of CPU and RAM.  With Prometheus 2 Performance Improvements, PMM can scale to more than 1000 monitored nodes per instance in default configuration. In this blog post we will look into PMM scaling and capacity planning—how to estimate the resources required, and what drives resource consumption.

PMM tested with 1000 nodes

We have now tested PMM with up to 1000 nodes, using a virtualized system with 128GB of memory, 24 virtual cores, and SSD storage. We found PMM scales pretty linearly with the available memory and CPU cores, and we believe that a higher number of nodes could be supported with more powerful hardware.

What drives resource usage in PMM ?

Depending on your system configuration and workload, a single node can generate very different loads on the PMM server. The main factors that impact the performance of PMM are:

  1. Number of samples (data points) injected into PMM per second
  2. Number of distinct time series they belong to (cardinality)
  3. Number of distinct query patterns your application uses
  4. Number of queries you have on PMM, through the user interface on the API, and their complexity

These specifically can be impacted by:

  • Software version – modern database software versions expose more metrics)
  • Software configuration – some metrics are only exposed in certain configuration
  • Workload – a large number of database objects and high concurrency will increase both the number of samples ingested and their cardinality.
  • Exporter configuration – disabling collectors can reduce amount of data collectors
  • Scrape frequency –  controlled by METRICS_RESOLUTION

All these factors together may impact resource requirements by a factor of ten or more, so do your own testing to be sure. However, the numbers in this article should serve as good general guidance as a start point for your research.

On the system supporting 1000 instances we observed the following performance:

Performance PMM 1000 nodes load

As you can see, we have more than 2.000 scrapes/sec performed, providing almost two million samples/sec, and more than eight million active time series. These are the main numbers that define the load placed on Prometheus.

Capacity planning to scale PMM

Both CPU and memory are very important resources for PMM capacity planning. Memory is the more important as Prometheus 2 does not have good options for limiting memory consumption. If you do not have enough memory to handle your workload, then it will run out of memory and crash.

We recommend at least 2GB of memory for a production PMM Installation. A test installation with 1GB of memory is possible. However, it may not be able to monitor more than one or two nodes without running out of memory. With 2GB of memory you should be able to monitor at least five nodes without problem.

With powerful systems (8GB of more) you can have approximately eight systems per 1GB of memory, or about 15,000 samples ingested/sec per 1GB of memory.

To calculate the CPU usage resources required, allow for about 50 monitored systems per core (or 100K metrics/sec per CPU core).

One problem you’re likely to encounter if you’re running PMM with 100+ instances is the “Home Dashboard”. This becomes way too heavy with such a large number of servers. We plan to fix this issue in future releases of PMM, but for now you can work around it in two simple ways:

You can select the host, for example “pmm-server” in your home dashboard and save it, before adding a large amount of hosts to the system.

set home dashboard for PMM

Or you can make some other dashboard of your choice and set it as the home dashboard.


  • More than 1,000 monitored systems is possible per single PMM server
  • Your specific workload and configuration may significantly change the resources required
  • If deploying with 8GB or more, plan 50 systems per core, and eight systems per 1GB of RAM

The post Scaling Percona Monitoring and Management (PMM) appeared first on Percona Database Performance Blog.


Scaling IO-Bound Workloads for MySQL in the Cloud

InnoDB / MyRocks throughput on IO1

Is increasing GP2 volumes size or increasing IOPS for IO1 volumes a valid method for scaling IO-Bound workloads? In this post I’ll focus on one question: how much can we improve performance if we use faster cloud volumes? This post is a continuance of previous cloud research posts:

To recap, in Amazon EC2 we can use gp2 and io1 volumes. gp2 performance can be scaled with size, i.e for gp2 volume size of 500GB we get 1500 iops; size 1000GB – 3000 iops; and for 3334GB – 10000 iops (maximal possible value). For io1 volumes we can “buy” throughput up to 30000 iops.

So I wanted to check how both InnoDB and RocksDB storage engines perform on these volumes with different throughput.

Benchmark Scenario

I will use the same datasize that I used in Saving With MyRocks in The Cloud, that is sysbench-tpcc, 50 tables, 100W each, about 500GB datasize in InnoDB and 100GB in RocksDB (compressed with LZ4).

Volumes settings: gp2 volumes from 500GB (1000GB for InnoDB) to 3400GB with 100GB increments (so each increment increases throughput by 300 iops); io1 volumes: 1TB in size, iops from 1000 to 30000 with 1000 increments.

Let’s take look at the results. I will use a slightly different format than usual, but hopefully it represents the results better. You will see density throughout the plots—a higher and narrower chart represents less variance in the throughput. The plot represents the distribution of the throughput.

Results on GP2 volumes:

InnoDB/MyRocks throughput on gp2

It’s quite interesting to see how the result scales with better IO throughput. InnoDB does not improve its throughput after gp2 size 2600GB, while MyRocks continues to scale linearly. The problem with MyRocks is that there is a lot of variance in throughput (I will show a one second resolution chart).

Results on IO1 volumes

InnoDB / MyRocks throughput on IO1

Here MyRocks again shows an impressive growth as as we add more IO capacity, but also shows a lot of variance on high capacity volumes.

Let’s compare how engines perform with one second resolution. GP2 volume, 3400GB:

InnoDB/MyRocks throughput on gp2 3400GB

IO1 volume, 30000 iops:

InnoDB/MyRocks throughput on IO1 30000 IOPS

So for MyRocks there seems to be periodical background activity, which does not allow it to achieve a stable throughput.

Raw results, if you’d like to review them, can be found here:


If you are looking to improve throughput in IO-bound workloads, either increasing GP2 volumes size or increasing IOPS for IO1 volumes is a valid method, especially for the MyRocks engine.

The post Scaling IO-Bound Workloads for MySQL in the Cloud appeared first on Percona Database Performance Blog.


About ZFS Performance


If you are a regular reader of this blog, you likely know I like the ZFS filesystem a lot. ZFS has many very interesting features, but I am a bit tired of hearing negative statements on ZFS performance. It feels a bit like people are telling me “Why do you use InnoDB? I have read that MyISAM is faster.” I found the comparison of InnoDB vs. MyISAM quite interesting, and I’ll use it in this post.

To have some data to support my post, I started an AWS i3.large instance with a 1000GB gp2 EBS volume. A gp2 volume of this size is interesting because it is above the burst IOPS level, so it offers a constant 3000 IOPS performance level.

I used sysbench to create a table of 10M rows and then, using export/import tablespace, I copied it 329 times. I ended up with 330 tables for a total size of about 850GB. The dataset generated by sysbench is not very compressible, so I used lz4 compression in ZFS. For the other ZFS settings, I used what can be found in my earlier ZFS posts but with the ARC size limited to 1GB. I then used that plain configuration for the first benchmarks. Here are the results with the sysbench point-select benchmark, a uniform distribution and eight threads. The InnoDB buffer pool was set to 2.5GB.

In both cases, the load is IO bound. The disk is doing exactly the allowed 3000 IOPS. The above graph appears to be a clear demonstration that XFS is much faster than ZFS, right? But is that really the case? The way the dataset has been created is extremely favorable to XFS since there is absolutely no file fragmentation. Once you have all the files opened, a read IOP is just a single fseek call to an offset and ZFS doesn’t need to access any intermediate inode. The above result is about as fair as saying MyISAM is faster than InnoDB based only on table scan performance results of unfragmented tables and default configuration. ZFS is much less affected by the file level fragmentation, especially for point access type.

More on ZFS metadata

ZFS stores the files in B-trees in a very similar fashion as InnoDB stores data. To access a piece of data in a B-tree, you need to access the top level page (often called root node) and then one block per level down to a leaf-node containing the data. With no cache, to read something from a three levels B-tree thus requires 3 IOPS.

Simple three levels B-tree

The extra IOPS performed by ZFS are needed to access those internal blocks in the B-trees of the files. These internal blocks are labeled as metadata. Essentially, in the above benchmark, the ARC is too small to contain all the internal blocks of the table files’ B-trees. If we continue the comparison with InnoDB, it would be like running with a buffer pool too small to contain the non-leaf pages. The test dataset I used has about 600MB of non-leaf pages, about 0.1% of the total size, which was well cached by the 3GB buffer pool. So only one InnoDB page, a leaf page, needed to be read per point-select statement.

To correctly set the ARC size to cache the metadata, you have two choices. First, you can guess values for the ARC size and experiment. Second, you can try to evaluate it by looking at the ZFS internal data. Let’s review these two approaches.

You’ll read/hear often the ratio 1GB of ARC for 1TB of data, which is about the same 0.1% ratio as for InnoDB. I wrote about that ratio a few times, having nothing better to propose. Actually, I found it depends a lot on the recordsize used. The 0.1% ratio implies a ZFS recordsize of 128KB. A ZFS filesystem with a recordsize of 128KB will use much less metadata than another one using a recordsize of 16KB because it has 8x fewer leaf pages. Fewer leaf pages require less B-tree internal nodes, hence less metadata. A filesystem with a recordsize of 128KB is excellent for sequential access as it maximizes compression and reduces the IOPS but it is poor for small random access operations like the ones MySQL/InnoDB does.

To determine the correct ARC size, you can slowly increase the ARC size and monitor the number of metadata cache-misses with the arcstat tool. Here’s an example:

# echo 1073741824 > /sys/module/zfs/parameters/zfs_arc_max
# arcstat -f time,arcsz,mm%,mhit,mread,dread,pread 10
    time  arcsz  mm%  mhit  mread  dread  pread
10:22:49   105M    0     0     0      0      0
10:22:59   113M  100     0    22     73      0
10:23:09   120M  100     0    20     68      0
10:23:19   127M  100     0    20     65      0
10:23:29   135M  100     0    22     74      0

You’ll want the ‘mm%’, the metadata missed percent, to reach 0. So when the ‘arcsz’ column is no longer growing and you still have high values for ‘mm%’, that means the ARC is too small. Increase the value of ‘zfs_arc_max’ and continue to monitor.

If the 1GB of ARC for 1TB of data ratio is good for large ZFS recordsize, it is likely too small for a recordsize of 16KB. Does 8x more leaf pages automatically require 8x more ARC space for the non-leaf pages? Although likely, let’s verify.

The second option we have is the zdb utility that comes with ZFS, which allows us to view many internal structures including the B-tree list of pages for a given file. The tool needs the inode of a file and the ZFS filesystem as inputs. Here’s an invocation for one of the tables of my dataset:

# cd /var/lib/mysql/data/sbtest
# ls -li | grep sbtest1.ibd
36493 -rw-r----- 1 mysql mysql 2441084928 avr 15 15:28 sbtest1.ibd
# zdb -ddddd mysqldata/data 36493 > zdb5d.out
# more zdb5d.out
Dataset mysqldata/data [ZPL], ID 90, cr_txg 168747, 4.45G, 26487 objects, rootbp DVA[0]=<0:1a50452800:200> DVA[1]=<0:5b289c1600:200> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=3004977L/3004977P fill=26487 cksum=13723d4400:5d1f47fb738:fbfb87e6e278:1f30c12b7fa1d1
    Object  lvl   iblk   dblk  dsize  lsize   %full  type
     36493    4    16K    16K  1.75G  2.27G   97.62  ZFS plain file
                                        168   bonus  System attributes
        dnode maxblkid: 148991
        path    /var/lib/mysql/data/sbtest/sbtest1.ibd
        uid     103
        gid     106
        atime   Sun Apr 15 15:04:13 2018
        mtime   Sun Apr 15 15:28:45 2018
        ctime   Sun Apr 15 15:28:45 2018
        crtime  Sun Apr 15 15:04:13 2018
        gen     3004484
        mode    100640
        size    2441084928
        parent  36480
        links   1
        pflags  40800000004
Indirect blocks:
               0 L3    0:1a4ea58800:400 4000L/400P F=145446 B=3004774/3004774
               0  L2   0:1c83454c00:1800 4000L/1800P F=16384 B=3004773/3004773
               0   L1  0:1eaa626400:1600 4000L/1600P F=128 B=3004773/3004773
               0    L0 0:1c6926ec00:c00 4000L/c00P F=1 B=3004773/3004773
            4000    L0 EMBEDDED et=0 4000L/6bP B=3004484
            8000    L0 0:1c69270c00:400 4000L/400P F=1 B=3004773/3004773
            c000    L0 0:1c7fbae400:800 4000L/800P F=1 B=3004736/3004736
           10000    L0 0:1ce3f53600:3200 4000L/3200P F=1 B=3004484/3004484
           14000    L0 0:1ce3f56800:3200 4000L/3200P F=1 B=3004484/3004484
           18000    L0 0:18176fa600:3200 4000L/3200P F=1 B=3004485/3004485
           1c000    L0 0:18176fd800:3200 4000L/3200P F=1 B=3004485/3004485
           [more than 140k lines truncated]

The last section of the above output is very interesting as it shows the B-tree pages. The ZFSB-tree of the file sbtest1.ibd has four levels. L3 is the root page, L2 is the first level (from the top) pages, L1 are the second level pages, and L0 are the leaf pages. The metadata is essentially L3 + L2 + L1. When you change the recordsize property of a ZFS filesystem, you affect only the size of the leaf pages.

The non-leaf page size is always 16KB (4000L) and they are always compressed on disk with lzop (If I read correctly). In the ARC, these pages are stored uncompressed so they use 16KB of memory each. The fanout of a ZFS B-tree, the largest possible ratio of a number of pages between levels, is 128. With the above output, we can easily calculate the required size of metadata we would need to cache all the non-leaf pages in the ARC.

# grep -c L3 zdb5d.out
# grep -c L2 zdb5d.out
# grep -c L1 zdb5d.out
# grep -c L0 zdb5d.out

So, each of the 330 tables of the dataset has 1160 non-leaf pages and 145447 leaf pages; a ratio very close to the prediction of 0.8%. For the complete dataset of 749GB, we would need the ARC to be, at a minimum, 6GB to fully cache all the metadata pages. Of course, there is some overhead to add. In my experiments, I found I needed to add about 15% for ARC overhead in order to have no metadata reads at all. The real minimum for the ARC size I should have used is almost 7GB.

Of course, an ARC of 7GB on a server with 15GB of Ram is not small. Is there a way to do otherwise? The first option we have is to use a larger InnoDB page size, as allowed by MySQL 5.7. Instead of the regular Innodb page size of 16KB, if you use a page size of 32KB with a matching ZFS recordsize, you will cut the ARC size requirement by half, to 0.4% of the uncompressed size.

Similarly, an Innodb page size of 64KB with similar ZFS recordsize would further reduce the ARC size requirement to 0.2%. That approach works best when the dataset is highly compressible. I’ll blog more about the use of larger InnoDB pages with ZFS in a near future. If the use of larger InnoDB page sizes is not a viable option for you, you still have the option of using the ZFS L2ARC feature to save on the required memory.

So, let’s proposed a new rule of thumb for the required ARC/L2ARC size for a a given dataset:

  • Recordsize of 128KB => 0.1% of the uncompressed dataset size
  • Recordsize of 64KB => 0.2% of the uncompressed dataset size
  • Recordsize of 32KB => 0.4% of the uncompressed dataset size
  • Recordsize of 16KB => 0.8% of the uncompressed dataset size

The ZFS revenge

In order to improve ZFS performance, I had 3 options:

  1. Increase the ARC size to 7GB
  2. Use a larger Innodb page size like 64KB
  3. Add a L2ARC

I was reluctant to grow the ARC to 7GB, which was nearly half the overall system memory. At best, the ZFS performance would only match XFS. A larger InnoDB page size would increase the CPU load for decompression on an instance with only two vCPUs; not great either. The last option, the L2ARC, was the most promising.

The choice of an i3.large instance type is not accidental. The instance has a 475GB ephemeral NVMe storage device. Let’s try to use this storage for the ZFS L2ARC. The warming of an L2ARC device is not exactly trivial. In my case, with a 1GB ARC, I used:

echo 1073741824 > /sys/module/zfs/parameters/zfs_arc_max
echo 838860800 > /sys/module/zfs/parameters/zfs_arc_meta_limit
echo 67108864 > /sys/module/zfs/parameters/l2arc_write_max
echo 134217728 > /sys/module/zfs/parameters/l2arc_write_boost
echo 4 > /sys/module/zfs/parameters/l2arc_headroom
echo 16 > /sys/module/zfs/parameters/l2arc_headroom_boost
echo 0 > /sys/module/zfs/parameters/l2arc_norw
echo 1 > /sys/module/zfs/parameters/l2arc_feed_again
echo 5 > /sys/module/zfs/parameters/l2arc_feed_min_ms
echo 0 > /sys/module/zfs/parameters/l2arc_noprefetch

I then ran ‘cat /var/lib/mysql/data/sbtest/* > /dev/null’ to force filesystem reads and caches on all of the tables. A key setting here to allow the L2ARC to cache data is the zfs_arc_meta_limit. It needs to be slightly smaller than the zfs_arc_max in order to allow some data to be cache in the ARC. Remember that the L2ARC is fed by the LRU of the ARC. You need to cache data in the ARC in order to have data cached in the L2ARC. Using lz4 in ZFS on the sysbench dataset results in a compression ration of only 1.28x. A more realistic dataset would compress by more than 2x, if not 3x. Nevertheless, since the content of the L2ARC is compressed, the 475GB device caches nearly 600GB of the dataset. The figure below shows the sysbench results with the L2ARC enabled:

Now, the comparison is very different. ZFS completely outperforms XFS, 5000 qps for ZFS versus 3000 for XFS. The ZFS results could have been even higher but the two vCPUs of the instance were clearly the bottleneck. Properly configured, ZFS can be pretty fast. Of course, I could use flashcache or bcache with XFS and improve the XFS results but these technologies are way more exotic than the ZFS L2ARC. Also, only the L2ARC stores data in a compressed form, maximizing the use of the NVMe device. Compression also lowers the size requirement and cost for the gp2 disk.

ZFS is much more complex than XFS and EXT4 but, that also means it has more tunables/options. I used a simplistic setup and an unfair benchmark which initially led to poor ZFS results. With the same benchmark, very favorable to XFS, I added a ZFS L2ARC and that completely reversed the situation, more than tripling the ZFS results, now 66% above XFS.


We have seen in this post why the general perception is that ZFS under-performs compared to XFS or EXT4. The presence of B-trees for the files has a big impact on the amount of metadata ZFS needs to handle, especially when the recordsize is small. The metadata consists mostly of the non-leaf pages (or internal nodes) of the B-trees. When properly cached, the performance of ZFS is excellent. ZFS allows you to optimize the use of EBS volumes, both in term of IOPS and size when the instance has fast ephemeral storage devices. Using the ephemeral device of an i3.large instance for the ZFS L2ARC, ZFS outperformed XFS by 66%.

The post About ZFS Performance appeared first on Percona Database Performance Blog.


How to Restore MySQL Logical Backup at Maximum Speed

Restore MySQL Logical Backup

Restore MySQL Logical BackupThe ability to restore MySQL logical backups is a significant part of disaster recovery procedures. It’s a last line of defense.

Even if you lost all data from a production server, physical backups (data files snapshot created with an offline copy or with Percona XtraBackup) could show the same internal database structure corruption as in production data. Backups in a simple plain text format allow you to avoid such corruptions and migrate between database formats (e.g., during a software upgrade and downgrade), or even help with migration from completely different database solution.

Unfortunately, the restore speed for logical backups is usually bad, and for a big database it could require days or even weeks to get data back. Thus it’s important to tune backups and MySQL for the fastest data restore and change settings back before production operations.


All results are specific to my combination of hardware and dataset, but could be used as an illustration for MySQL database tuning procedures related to logical backup restore.


There is no general advice for tuning a MySQL database for a bulk logical backup load, and any parameter should be verified with a test on your hardware and database. In this article, we will explore some variables that help that process. To illustrate the tuning procedure, I’ve downloaded IMDB CSV files and created a MySQL database with pyimdb.

You may repeat the whole benchmark procedure, or just look at settings changed and resulting times.


  • 16GB – InnoDB database size
  • 6.6GB – uncompressed mysqldump sql
  • 5.8GB – uncompressed CSV + create table statements.

The simplest restore procedure for logical backups created by the mysqldump tool:

mysql -e 'create database imdb;'
time mysql imdb < imdb.sql
# real 129m51.389s

This requires slightly more than two hours to restore the backup into the MySQL instance started with default settings.

I’m using the Docker image percona:latest – it contains Percona Server 5.7.20-19 running on a laptop with 16GB RAM, Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz, two disks: SSD KINGSTON RBU-SNS and HDD HGST HTS721010A9.

Let’s start with some “good” settings: buffer pool bigger than default, 2x1GB transaction log files, disable sync (because we are using slow HDD), and set big values for IO capacity,
the load should be faster with big batches thus use 1GB for max_allowed_packet.

Values were chosen to be bigger than the default MySQL parameters because I’m trying to see the difference between the usually suggested values (like 80% of RAM should belong to InnoDB buffer pool).

docker run --publish-all --name p57 -it -e MYSQL_ALLOW_EMPTY_PASSWORD=1 percona:5.7
  time (mysql --max_allowed_packet=1G imdb1 < imdb.sql )
  # real 59m34.252s

The load is IO bounded, and there is no reaction on set global foreign_key_checks=0 and unique_checks=0 because these variables are already disabled in the dump file.

How can we reduce IO?

Disable InnoDB double write: --innodb_doublewrite=0

time (mysql --max_allowed_packet=1G imdb1 < imdb.sql )
# real 44m49.963s

A huge improvement, but we still have an IO-bounded load.

We will not be able to improve load time significantly for IO bounded load. Let’s move to SSD:

time (mysql --max_allowed_packet=1G imdb1 < imdb.sql )
# real 33m36.975s

Is it vital to disable disk sync for the InnoDB transaction log?

sudo rm -rf mysql/*
docker rm p57
docker run -v /home/ihanick/Private/Src/tmp/data-movies/imdb.sql:/root/imdb.sql -v /home/ihanick/Private/Src/tmp/data-movies/mysql:/var/lib/mysql
--name p57 -it -e MYSQL_ALLOW_EMPTY_PASSWORD=1 percona:5.7
# real 33m49.724s

There is no significant difference.

By default, mysqldump produces SQL data, but it could also save data to CSV format:

cd /var/lib/mysql-files
mkdir imdb
chown mysql:mysql imdb/
time mysqldump --max_allowed_packet=128M --tab /var/lib/mysql-files/imdb imdb1
# real 1m45.983s
sudo rm -rf mysql/*
docker rm p57
docker run -v /srv/ihanick/tmp/imdb:/var/lib/mysql-files/imdb -v /home/ihanick/Private/Src/tmp/data-movies/mysql:/var/lib/mysql
--name p57 -it -e MYSQL_ALLOW_EMPTY_PASSWORD=1 percona:5.7
time (
mysql -e 'drop database imdb1;create database imdb1;set global FOREIGN_KEY_CHECKS=0;'
(echo "SET FOREIGN_KEY_CHECKS=0;";cat *.sql) | mysql imdb1 ;
for i in $PWD/*.txt ; do mysqlimport imdb1 $i ; done
# real 21m56.049s
1.5X faster, just because of changing the format from SQL to CSV!

We’re still using only one CPU core, let’s improve the load with the –use-threads=4 option:

time (
mysql -e 'drop database if exists imdb1;create database imdb1;set global FOREIGN_KEY_CHECKS=0;'
(echo "SET FOREIGN_KEY_CHECKS=0;";cat *.sql) | mysql imdb1
mysqlimport --use-threads=4 imdb1 $PWD/*.txt
# real 15m38.147s

In the end, the load is still not fully parallel due to a big table: all other tables are loaded, but one thread is still active.

Let’s split CSV files into smaller ones. For example, 100k rows in each file and load with GNU/parallel:

# /var/lib/mysql-files/imdb/
apt-get update ; apt-get install -y parallel
cd /var/lib/mysql-files/imdb
time (
cd split1
for i in ../*.txt ; do echo $i ; split -a 6 -l 100000 -- $i `basename $i .txt`. ; done
for i in `ls *.*|sed 's/^[^.]+.//'|sort -u` ; do
mkdir ../split-$i
for j in *.$i ; do mv $j ../split-$i/${j/$i/txt} ; done
# real 2m26.566s
time (
mysql -e 'drop database if exists imdb1;create database imdb1;set global FOREIGN_KEY_CHECKS=0;'
(echo "SET FOREIGN_KEY_CHECKS=0;";cat *.sql) | mysql imdb1
parallel 'mysqlimport imdb1 /var/lib/mysql-files/imdb/{}/*.txt' ::: split-*
#real 16m50.314s

Split is not free, but you can split your dump files right after backup.

The load is parallel now, but the single big table strikes back with ‘setting auto-inc lock’ in SHOW ENGINE INNODB STATUSG

Using the --innodb_autoinc_lock_mode=2 option fixes this issue: 16m2.567s.

We got slightly better results with just mysqlimport --use-threads=4. Let’s check if hyperthreading helps and if the problem caused by “parallel” tool:

  • Using four parallel jobs for load: 17m3.662s
  • Using four parallel jobs for load and two threads: 16m4.218s

There is no difference between GNU/Parallel and --use-threads option of mysqlimport.

Why 100k rows? With 500k rows: 15m33.258s

Now we have performance better than for mysqlimport --use-threads=4.

How about 1M rows at once? Just 16m52.357s.

I see periodic flushing logs message with bigger transaction logs (2x4GB): 12m18.160s:

--innodb_buffer_pool_size=4GB --innodb_log_file_size=4G --skip-log-bin --innodb_flush_log_at_trx_commit=0 --innodb_io_capacity=700 --innodb_io_capacity_max=1500 --max_allowed_packet=1G --innodb_doublewrite=0 --innodb_autoinc_lock_mode=2 --performance-schema=0

Let’s compare the number with myloader 0.6.1 also running with four threads (myloader have only -d parameter, myloader execution time is under corresponding mydumper command):

# oversized statement size to get 0.5M rows in one statement, single statement per chunk file
mydumper -B imdb1 --no-locks --rows 500000 --statement-size 536870912 -o 500kRows512MBstatement
mydumper -B imdb1 --no-locks -o default_options
mydumper -B imdb1 --no-locks --chunk-filesize 128 -o chunk128MB
mydumper -B imdb1 --no-locks --chunk-filesize 64 -o chunk64MB

It will be great to test mydumper with CSV format, but unfortunately, it wasn’t implemented in the last 1.5 years:

Returning back to parallel CSV files load, even bigger transaction logs 2x8GB: 11m15.132s.

What about a bigger buffer pool: --innodb_buffer_pool_size=12G? 9m41.519s

Let’s check six-year-old server-grade hardware: Intel(R) Xeon(R) CPU E5-2430 with SAS raid (used only for single SQL file restore test) and NVMe (Intel Corporation PCIe Data Center SSD, used for all other tests).

I’m using similar options as for previous tests, with 100k rows split for CSV files load:

--innodb_buffer_pool_size=8GB --innodb_log_file_size=8G --skip-log-bin --innodb_flush_log_at_trx_commit=0 --innodb_io_capacity=700 --innodb_io_capacity_max=1500 --max_allowed_packet=1G --innodb_doublewrite=0 --innodb_autoinc_lock_mode=2

  • Single SQL file created by mysqldump loaded for 117m29.062s = 2x slower.
  • 24 parallel processes of mysqlimport: 11m51.718s
  • Again hyperthreading making a huge difference! 12 parallel jobs: 18m3.699s.
  • Due to higher concurrency, adaptive hash index is a reason for locking contention. After disabling it with --skip-innodb_adaptive_hash_index: 10m52.788s.
  • In many places, disable unique checks referred as a performance booster: 10m52.489s
    You can spend more time reading advice about unique_checks, but it might help for some databases with many unique indexes (in addition to primary one).
  • The buffer pool is smaller than the dataset, can you change old/new pages split to make insert faster? No: --innodb_old_blocks_pct=5 : 10m59.517s.
  • O_DIRECT is also recommended: --innodb_flush_method=O_DIRECT: 11m1.742s.
  • O_DIRECT is not able to improve performance by itself, but if you can use a bigger buffer pool: O_DIRECT + 30% bigger buffer pool: --innodb_buffeer_pool_size=11G: 10m46.716s.


  • There is no common solution to improve logical backup restore procedure.
  • If you have IO-bounded restore: disable InnoDB double write. It’s safe because even if the database crashes during restore, you can restart the operation.
  • Do not use SQL dumps for databases > 5-10GB. CSV files are much faster for mysqldump+mysql. Implement mysqldump --tabs+mysqlimport or use mydumper/myloader with appropriate chunk-filesize.
  • The number of rows per load data infile batch is important. Usually 100K-1M, use binary search (2-3 iterations) to find a good value for your dataset.
  • InnoDB log file size and buffer pool size are really important options for backup restore performance.
  • O_DIRECT reduces insert speed, but it’s good if you can increase the buffer pool size.
  • If you have enough RAM or SSD, the restore procedure is limited by CPU. Use a faster CPU (higher frequency, turboboost).
  • Hyperthreading also counts.
  • A powerful server could be slower than your laptop (12×2.4GHz vs. 4×2.8+turboboost).
  • Even with modern hardware, it’s hard to expect backup restore faster than 50MBps (for the final size of InnoDB database).
  • You can find a lot of different advice on how to improve backup load speed. Unfortunately, it’s not possible to implement improvements blindly, and you should know the limits of your system with general Unix performance tools like vmstat, iostat and various MySQL commands like SHOW ENGINE INNODB STATUS (all can be collected together with pt-stalk).
  • Percona Monitoring and Management (PMM) also provides good graphs, but you should be careful with QAN: full slow query log during logical database dump restore can cause significant processing load.
  • Default MySQL settings could cost you 10x backup restore slowdown
  • This benchmark is aimed at speeding up the restore procedure while the application is not running and the server is not used in production. Make sure that you have reverted all configuration parameters back to production values after load. For example, if you disable the InnoDB double write buffer during restore and left it enabled in production, you may have scary data corruption due to partial InnoDB pages writes.
  • If the application is running during restore, in most cases you will get an inconsistent database due to missing support for locking or correct transactions for restore methods (discussed above).

Powered by WordPress | Theme: Aeros 2.0 by