How to Reclaim Disk Space in Percona Server for MongoDB

Reclaim Disk Space in Percona Server for MongoDB

Reclaim Disk Space in Percona Server for MongoDBCompaction in Percona Server for MongoDB (PSMDB)

There are two ways to reclaim disk space in Percona Server for MongoDB (PSMDB): run compact on nodes or resync the node. In this blog, we will see the best practice for reclaiming the fragmented space on a disk in PSMDB using compact.

Disk storage is a critical resource for any scalable database system. As the database grows and multiple write operations run, this contiguous space gets fragmented into smaller blocks with chunks of free space in between. The typical and fastest solution is to increase the disk size. However, if the application team started to delete data regularly and optimize algorithms to reduce redundant data, there are ways that can help you regain the free space without having to scale your disk size. Simply deleting documents cannot reduce PSMDB disk usage, but if you drop the collection it will delete the data file and immediately release the space, in most cases dropping the collection is not possible in production environments.

By reducing the disk space, it will not only save storage space but also will save money. We have done this for one of our customers whose environment was running on the cloud. The customer was using 4TB of disk space on each node of eight shards, i.e. 96TB in total for all shards. We archived the data, freed up space, and ran Compact to reclaim the disk space. After that, we were able to reduce the disk footprint from 4TB to 2.5TB on each node of all the shards i.e. from 96TB to 60TB. It significantly reduced the customer cloud bill.

What compact does

The WiredTiger storage engine maintains lists of empty records in data files as it deletes documents. This space can be reused by WiredTiger, but will not be returned to the operating system.

However, if you delete a lot of documents and want the WiredTiger storage engine to release this empty space to the operating system, you can defragment your data file. This is possible using the compact command.

Let’s see how data is stored and what happens when we delete the data and follow that run the compact.

  1. When we are inserting the data, new blocks are allocated to the data file, and the size keeps increasing as soon as new data is inserted:

       2. Now some data has been deleted/removed from the collection:

In step two, you can see when documents are removed or deleted and how space is not released.

       3. When you will perform the compact operation:

You can see above how the space is released and defragmented after the compact operation.

How do you check the storage stats?

PSMDB has a command db.stats() that can provide insights into the storage statistics of databases.

$ use testing
$ db.stats()
        "db" : "testing",
        "collections" : 8,
        "views" : 0,
        "objects" : 170955,
        "avgObjSize" : 457.31783217805855,
        "dataSize" : 78180770,
        "storageSize" : 33837056,
        "numExtents" : 0,
        "indexes" : 21,
        "indexSize" : 25571328,
        "scaleFactor" : 1,
        "fsUsedSize" : 1317318361088,
        "fsTotalSize" : 3296899104768,
        "ok" : 1,
        "$clusterTime" : {
                "clusterTime" : Timestamp(1658005440, 4),
                "signature" : {
                        "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
                        "keyId" : NumberLong(0)
        "operationTime" : Timestamp(1658005440, 4)


The total size in bytes of the uncompressed data held in this database.

The total amount of disk space allocated to all collections in the database.

The total size of all disk space in use on the filesystem where PSMDB has the data directory.

The total size of all disk capacity on the filesystem where PSMDB has the data directory.

You can see a difference between the storageSize and dataSize above because in WiredTiger, storageSize may be smaller than dataSize as compression is enabled by default. 

When a large chunk of data is deleted from a collection and the collection never uses the deleted space for new documents, this space needs to be returned to the operating system so that it can be used by your other databases or collections. 

Use the below command to check how much data you can reclaim in a collection:

$["block-manager"]["file bytes available for reuse"]

Note: The above command output will come in bytes.

Use the below command to check how much data you can reclaim per collection of the database:

$ use testing
$ var count = 0;
$ var bytesInGB = 1024 * 1024 * 1024;
$ var cols = db.getCollectionNames();
$ cols.forEach(function(col) {
   if(col != 'system.profile') {
   var size = db[col].stats().wiredTiger["block-manager"]["file bytes available for reuse"]
   count += size
$ print("Total Space can be claimed in GB is: " + " " + count/bytesInGB)


Once you have the details using the above command and decided which collection you want to compact and reclaim space, please keep in mind the below things before running the compact command:

  1. Always take the full backup of the database.
  2. The user should have the required privilege to run the compact command.
  3. Check if you have enough replication oplog window.
  4. Always run compact on the secondary nodes/hidden nodes or nodes having low priority following with the Primary node in last after stepping it down.
  5. In a replica set, compact command need to be run on each node.
  6. In a shard cluster, a compact command need to be run on each node of every shard. Compact cannot be run against the mongos. 
  7. When you try to compact a collection, a write lock will be added to the database where the collection is located, which may cause all read/write requests to the database to be blocked.
  8. If you are running PSMDB 4.4 or a newer version, compact will not block CRUD operations as it did in earlier versions.
  9. In PSMDB 4.4 or newer, compact will only block the metadata operations like dropping a collection, dropping the index, and creating a new index.

Use the below command to compact the single collection of the database:

$ use dbName 
$ db.runCommand({compact: collectionName })

Use the below command to compact all collections of the database:

$ use dbName
$ db.getCollectionNames().forEach(function (collectionName) {
print('Compacting: ' + collectionName);
db.runCommand({ compact: collectionName });

You can check the progress of compaction in mongo logs or by running db.currentOp() command in another shell instance.

Once the collections are compacted, please check the reclaimed space using the same command which we used to check how much space we can reclaim. Also, you can check the disk space on the OS level as well.


Sometimes when a large collection is compacted, the compact command immediately returns OK, but in reality, the physical space of the collection remains unchanged. This is because WiredTiger deems that the collection does not need to be compacted. In order to overcome this, you need to run the compact command again till it releases the space.

Before PSMDB 4.4, it was always advisable to run compact in a scheduled maintenance window due to the nature of the command which blocks all the read/write operations. Starting from the PSMDB 4.4, you can plan it to run at any time.

Percona Server for MongoDB is an open-source replacement for MongoDB Community Edition that combines all of the features and benefits of MongoDB Community Edition with enterprise-class features developed by Percona: LDAP Authentication and Authorization, Audit Logging, Kerberos Authentication, and hot backups

To learn more about the enterprise-grade features available in the vendor lock-in free Percona Server for MongoDB, we recommend going through our blog MongoDB: Why Pay for Enterprise When Open Source Has You Covered? 


Performance of Various EBS Storage Types in AWS

EBS Storage Types in AWS

EBS Storage Types in AWSEBS storage type choices in AWS can be impacted by a lot of factors. As a consultant, I get a lot of questions about choosing the best storage type for a workload. Let me share a few examples. Is io2 better than gp2/3 if the configured iops are the same? What can I expect when upgrading gp2 to gp3?

In order to be able to answer questions like this, in this blog post, we will take a deeper look. We will compare storage devices that are “supposed to be the same”, in order to reveal the differences between these storage types. We will examine the following storage devices:

    1 TB gp2 volume (has 3000 iops by definition)
    1 TB gp3 volume, with the iops set to 3000
    1 TB io1 volume, with the iops set to 3000
    1 TB io2 volume, with the iops set to 3000

So, all the volumes are 1TB with 3000 iops, so in theory, they are the same. Also, in theory, theory and practice are the same, but in practice, they are different. Storage performance is more complex than just capacity and the number of iops, as we will see soon. Note that this test is very limited to draw conclusions like io1 is better than gp2 or anything like that in general. These devices have very different scalability characteristics (the io devices are scaling to 64k iops, while the maximum for the gp devices is 16k). Measuring the scalability of these devices and testing them in the long run and in different availability zones are out of scope for these tests. The reason I chose devices that have the same “specs” is to gain an understanding of the difference in their behavior. The tests were only run in a single availability zone (eu-west-1a).

For the tests, I used sysbench fileio, with the following prepare command.

sysbench --test=fileio \
--file-total-size=700G \
--threads=16 \
--file-num=64 \
--file-block-size=16384 \

The instances I used were r5.xlarge instances, which have up to 4750 Mbps bandwidth to EBS.

I used the following command to run the tests:

sysbench fileio \
--file-total-size=700G \
--time=1800 \
--max-requests=0 \
--threads=${th} \
--file-num=64 \
--file-io-mode=sync \
--file-test-mode=${test_mode} \
--file-extra-flags=direct \
--file-fsync-freq=0 \
--file-block-size=16384 \
--report-interval=1 \

In this command, the test mode can be rndwr (random writes only), rndrd (random reads only), and rndwr (random reads and writes mixed). The number of threads used were 1, 2, 4, 8, 16, 32, 64, and 128. All tests are using 16k io operations with direct io enabled (bypassing the filesystem cache), based on this, the peak theoretical throughput of the tests is 16k*3000 = 48 MB/s.

Random Writes

sysbench random writes

The gp2 and io1 devices reached the peak throughput for this benchmark with 4 threads and the gp3 reached it with 2 threads (but with a larger variance). The io2 device has more consistent performance overall. The peak throughput in these tests is the expected peak throughput (16k*3000 iops = 46.8MB/sec).

sysbench random mixed read/write latency

At a low thread count, gp3 has the highest variation in latency, gp2’s performance is more consistent. The latencies of io1 and io2 are more consistent, especially io2 at a higher thread count.

This means if the workload is mostly writes:

– Prefer gp3 over gp2 (better performance, less price).
– Prefer io2 if the price is worth the consistency in performance at lower thread counts.
– If the workload is multithreaded, and there are always more than 4 threads, prefer gp3 (in this case, the performance is the same, gp3 is the cheapest option).

Random Reads

sysbench random reads

The random read throughput shows a much bigger difference than writes. First of all, the performance is more inconsistent in the case of gp2 and gp3, but gp2 seems to be slightly more consistent. The io2 device has the same consistent performance even with a single thread.

sysbench random read latency

Similarly, there is a much bigger variance in latency in the case of low thread counts between the gp2 and the gp3. Even at 64 threads, the io2 device has very consistent latency characteristics.

This means if the workload is mostly reads:

– The gp2 volumes can give slightly better performance, but they are also slightly more expensive.
– Above 16 parallel threads, the devices are fairly similar, prefer gp3 because of the price.
– Prefer io2 if performance and latency are important with a low thread count (even over io1).

Random Mixed Reads/Writes

random mixed reads/writes

The mixed workload behavior is similar to the random read one, so the variance in the read performance will also show as a variance in the write performance. The more reads are added to the mix, the inconsistent the performance will become with the gp2/gp3 volumes. The io1 volume reaches peak throughput even with two threads, but with a high variance.

In the case of the mixed workload, the gp3 has the least consistent performance. This can come as an unpleasant surprise when the volumes are upgraded to gp3, and the workload has a low concurrency. This can be an issue for not loaded, but latency-sensitive applications. Otherwise, for choosing storage, the same advice applies to random reads.


The difference between these seemingly similar devices is greatest when a low number of threads are used against the device. If the io workload is parallel enough, the devices behave very similarly.

The raw data for these measurements are available on GitHub:


MySQL/ZFS Performance Update

MySQL/ZFS Performance Update

MySQL/ZFS Performance UpdateAs some of you likely know, I have a favorable view of ZFS and especially of MySQL on ZFS. As I published a few years ago, the argument for ZFS was less about performance than its useful features like data compression and snapshots. At the time, ZFS was significantly slower than xfs and ext4 except when the L2ARC was used.

Since then, however, ZFS on Linux has progressed a lot and I also learned how to better tune it. Also, I found out the sysbench benchmark I used at the time was not a fair choice since the dataset it generates compresses much less than a realistic one. For all these reasons, I believe that it is time to revisit the performance aspect of MySQL on ZFS.

ZFS Evolution

In 2018, I reported ZFS performance results based on version, the default version available in Ubuntu Xenial. The present post is using version 0.8.6-1 of ZFS, the default one available on Debian Buster. Between the two versions, there are in excess of 3600 commits adding a number of new features like support for trim operations and the addition of the efficient zstd compression algorithm.

ZFS 0.8.6-1 is not bleeding edge, there have been more than 1700 commits since and after 0.8.6, the ZFS release number jumped to 2.0. The big addition included in the 2.0 release is native encryption.

Benchmark Tools

The classic sysbench MySQL database benchmarks have a dataset containing mostly random data. Such datasets don’t compress much, less than most real-world datasets I worked with. The compressibility of the dataset is important since ZFS caches, the ARC and L2ARC, store compressed data. A better compression ratio essentially means more data is cached and fewer IO operations will be needed.

A well-known tool to benchmark a transactional workload is TPCC. Furthermore, the dataset created by TPCC compresses rather well making it more realistic in the context of this post. The sysbench TPCC implementation was used.

Test Environment

Since I am already familiar with AWS and Google cloud, I decided to try Azure for this project. I launched these two virtual machines:


  • benchmark host
  • Standard D2ds_v4 instance
  • 2 vCpu, 8GB of Ram and 75 GB of temporary storage
  • Debian Buster


  • Database host
  • Standard E4-2ds-v4 instance
  • 2 vCpu, 32GB of Ram and 150GB of temporary storage
  • 256GB SSD Premium (SSD Premium LRS P15 – 1100 IOPS (3500 burst), 125 MB/s)
  • Debian Buster
  • Percona server 8.0.22-13


By default and unless specified, the ZFS filesystems are created with:

zpool create bench /dev/sdc
zfs set compression=lz4 atime=off logbias=throughput bench
zfs create -o mountpoint=/var/lib/mysql/data -o recordsize=16k \
           -o primarycache=metadata bench/data
zfs create -o mountpoint=/var/lib/mysql/log bench/log

There are two ZFS filesystems. bench/data is optimized for the InnoDB dataset while bench/log is tuned for the InnoDB log files. Both are compressed using lz4 and the logbias parameter is set to throughput which changes the way the ZIL is used. With ext4, the noatime option is used.

ZFS has also a number of kernel parameters, the ones set to non-default values are:


Essentially, the above settings limit the ARC size to 2GB and they throttle down the aggressiveness of ZFS for deletes. Finally, the database configuration is slightly different between ZFS and ext4. There is a common section:

pid-file = /var/run/mysqld/
socket = /var/run/mysqld/mysqld.sock
log-error = /var/log/mysql/error.log
datadir = /var/lib/mysql/data
innodb_buffer_pool_size = 26G
innodb_flush_log_at_trx_commit = 1 # TPCC reqs.
innodb_log_file_size = 1G
innodb_log_group_home_dir = /var/lib/mysql/log
innodb_flush_neighbors = 0
innodb_fast_shutdown = 2

and when ext4 is used:

innodb_flush_method = O_DIRECT

and when ZFS is used:

innodb_flush_method = fsync
innodb_doublewrite = 0 # ZFS is transactional
innodb_use_native_aio = 0
innodb_read_io_threads = 10
innodb_write_io_threads = 10

ZFS doesn’t support O_DIRECT but it is ignored with a message in the error log. I chose to explicitly set the flush method to fsync. The doublewrite buffer is not needed with ZFS and I was under the impression that the Linux native asynchronous IO implementation was not well supported by ZFS so I disabled it and increased the number of IO threads. We’ll revisit the asynchronous IO question in a future post.


I use the following command to create the dataset:

./tpcc.lua --mysql-host= --mysql-user=tpcc --mysql-password=tpcc --mysql-db=tpcc \
--threads=8 --tables=10 --scale=200 --db-driver=mysql prepare

The resulting dataset has a size of approximately 200GB. The dataset is much larger than the buffer pool so the database performance is essentially IO-bound.

Test Procedure

The execution of every benchmark was scripted and followed these steps:

  1. Stop MySQL
  2. Remove all datafiles
  3. Adjust the filesystem
  4. Copy the dataset
  5. Adjust the MySQL configuration
  6. Start MySQL
  7. Record the configuration
  8. Run the benchmark


For the benchmark, I used the following invocation:

./tpcc.lua --mysql-host= --mysql-user=tpcc --mysql-password=tpcc --mysql-db=tpcc \
--threads=16 --time=7200 --report-interval=10 --tables=10 --scale=200 --db-driver=mysql ru

The TPCC benchmark uses 16 threads for a duration of 2 hours. The duration is sufficiently long to allow for a steady state and to exhaust the storage burst capacity. Sysbench returns the total number of TPCC transactions per second every 10s. This number includes not only the New Order transactions but also the other transaction types like payment, order status, etc. Be aware of that if you want to compare these results with other TPCC benchmarks.

In those conditions, the figure below presents the rates of TPCC transactions over time for ext4 and ZFS.

TPCC transactions ZFS

MySQL TPCC results for ext4 and ZFS

During the initial 15 minutes, the buffer pool warms up but at some point, the workload shifts between an IO read bound to an IO write and CPU bound. Then, at around 3000s the SSD Premium burst capacity is exhausted and the workload is only IO-bound. I have been a bit surprised by the results, enough to rerun the benchmarks to make sure. The results for both ext4 and ZFS are qualitatively similar. Any difference is within the margin of error. That essentially means if you configure ZFS properly, it can be as IO efficient as ext4.

What is interesting is the amount of storage used. While the dataset on ext4 consumed 191GB, the lz4 compression of ZFS yielded a dataset of only 69GB. That’s a huge difference, a factor of 2.8, which could save a decent amount of money over time for large datasets.


It appears that it was indeed a good time to revisit the performance of MySQL with ZFS. In a fairly realistic use case, ZFS is on par with ext4 regarding performance while still providing the extra benefits of data compression, snapshots, etc. In a future post, I’ll examine the use of cloud ephemeral storage with ZFS and see how this can further improve performance.


MongoDB Tuning Anti-Patterns: How Tuning Memory Can Make Things Much Worse

MongoDB Memory Tuning

MongoDB Memory TuningIt’s your busiest day of the year and the website has crawled to a halt and finally crashed… and it was all because you did not understand how MongoDB uses memory and left your system open to cluster instability, poor performance, and unpredictable behavior. Understanding how MongoDB uses memory and planning for its use can save you a lot of headaches, tears, and grief. Over the last 5 years, I have too often been called in to fix what are easily avoided problems. Let me share with you how MongoDB uses Memory, and how to avoid potentially disastrous mistakes when using MongoDB.

In most databases, more data cached in RAM is better. Same in MongoDB. However, cache competes with other memory-intensive processes as well as the kernel ones.

To speed up performance many people simply allocate the resources to the most visible issue. In the case of MongoDB however, sometimes allocating more memory actually hurts performance. How is this possible? The short answer is MongoDB relies on both its internal memory caches as well as the operating system’s cache. The OS cache generally is seen as “Unallocated” by sysadmins, dba’s, and devs. This means they steal memory from the OS and allocate it internally to MongoDB. Why is this potentially a bad thing? Let me explain.

How MongoDB Uses the Memory for Caching Data

Anytime you run a query some pages are copied from the files into an internal memory cache of the mongod process for future reuse. A part of your data and indexes can be cached and retrieved really very fast when needed. This is what the WiredTiger Cache (WTC) does. The goal of the WTC is to store the most frequently and recently used pages in order to provide the fastest access to your data. That’s awesome for improving the performance of the database.

By default, a mongod process uses up to 50% of the available RAM for that cache. Eventually, you can change the size of the WTC using the  storage.wiredTiger.engineConfig.cacheSizeGB configuration variable.

Remember that the data is compressed on disk files while the cache stores instead uncompressed pages.

When the WTC gets close to full, more evictions can happen. Evictions happen when the requested pages are not in the cache and mongod has to drop out existing pages in order to make room and read the incoming pages from the file system. The eviction walk algorithm does a few other things (LRU page list sorting and WT page reconciliation) as well as marking the least recently used pages as available for reuse, and altogether this can cause at some point slowness because of a more intensive IO.

Based on how the WTC works, someone could think it’s a good idea to assign even 80%/90% of the memory to it (if you are familiar with MySQL, it’s the same you do when configuring the Buffer Pool for InnoDB). Most of the time this is a mistake and to understand why let’s see now another way mongod uses the memory.

How MongoDB Uses the Memory for File Buffering

Sudden topic change: we’re going to talk about OS instead for a bit. The OS also caches into the memory normal filesystem disk blocks in order to speed up their retrieval if they are requested multiple times. This feature is provided by the system regardless of which application is using it, and it’s really beneficial when an application needs frequent access to the disk. When the IO operation is triggered, the data can be returned by reading the blocks from the memory instead of accessing the disk for real. Then the request will be served faster. This kind of memory managed by the OS is called cached, as you see in /proc/meminfo. We can also call it “File Buffering”.

# cat /proc/meminfo 
MemTotal:        1882064 kB
MemFree:         1376380 kB
MemAvailable:    1535676 kB
Buffers:            2088 kB
Cached:           292324 kB
SwapCached:            0 kB
Active:           152944 kB
Inactive:         252628 kB
Active(anon):     111328 kB
Inactive(anon):    16508 kB
Active(file):      41616 kB
Inactive(file):   236120 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       2097148 kB
SwapFree:        2097148 kB
Dirty:                40 kB
Writeback:             0 kB
AnonPages:        111180 kB
Mapped:            56396 kB

Keep in mind that MongoDB relies entirely on the Operating System for file buffering.

On a dedicated server, where running a single mongod process, as long as you use the database more disk blocks will be stored into the memory. In the end, almost all the “cached” + “buffer” fields in the memory stat output shown above will be used exclusively for the disk blocks requested by mongod.

An important thing is that the cached memory saves the disk blocks exactly as they are. Since the disk blocks are compressed into the WT files, also the blocks into the memory are compressed. Because of the compression, you can store really a lot of your MongoDB data and indexes.

Let’s suppose you have a 4x compression ratio, in a 10GB memory file buffer (cached memory) you can store up to 40GB of real data. That’s a lot more, for free.

Putting Things Together

The following picture gives you a rough overview of memory usage.

MongoDB Memory Usage

Suppose we have a dedicated 64GB RAM machine and a 120GB dataset. Because of compression, the database uses around 30GB of storage, assuming a 4x compression ratio, which is quite common.

Without changing anything on the configuration, then around 32GB will be used by the WTC for saving 32GB of uncompressed data. The remaining memory will be used in part by the OS and other applications and let’s say it is 4GB. The remaining RAM is 28GB and it will be mainly used for file buffering. In that 28 GB, we can store almost the entire compressed database. The overall performance of MongoDB will be great because most of the time it won’t read from disk. Only 2GB of the compressed file data are not stored on File Buffering. Or 8GB of the uncompressed 120GB as another way to look at it. So, when there’s an access on a page not amongst the 32GB in the  WTC at that moment the IO will read a disk block most probably from the File Buffer instead of doing real disk access. At least 10x better latency, maybe 100x. That’s awesome.

Multiple mongod on the Same Machine is Bad

As I mentioned, people hate to see that (apparently) unallocated memory on their systems.  Not everyone with that misconception increases the WTC, sometimes they view this as an opportunity to add other mongods on the same box, to use that unused memory.

The multiple mongod processes would like all their disk file content to be cached in memory by the OS too. You can limit the size of the WTC, but you cannot affect the requests to the disk and the file buffering usage. This causes less memory used for the file buffering for any mongod process triggering more real disk IO. In addition, the processes will compete for accessing other resources, like the CPU.

Another problem is that multiple mongod processes make troubleshooting more complicated. It won’t be so simple to identify the root cause of any issue. Which mongod is using more memory for file buffering? Is the other mongod’s slowness affecting the performance of my mongod?

Troubleshooting can be addressed easier on a dedicated machine when running a single mongod.

If one of the mongods gets crazy and uses more CPU time and memory, then all the mongods on the machine will slow down because of fewer resources available in the system.

In the end, never deploy more than one mongod on the same machine. Eventually, you may consider Docker containers. Running mongod in a container you can limit the amount of memory it can use. In such a case do your calculations for how much memory you need in total for the server and how much memory reserve for any container to get the best possible performance for mongod.

It is Not Recommended to Have a Very Large WTC

Increasing the WTC significantly, more than the 50% default, is also a bad habit.

With a larger cache, you can store more uncompressed data but at the same time, you leave a little memory for file buffering. More queries can benefit from the larger WTC but when having evictions mongod could trigger a lot of real disk accesses slowing down the database.

For this reason, in most cases, it is not recommended to increase the WTC higher than the default 50%. The goal is to save enough space for buffering disk blocks into the memory. This can help you to get a very good and more stable performance.


When you think about mongod, you have to consider it as the only process running in the universe. It tries to use as much memory as it can. But there are two caches – the WT cache (uncompressed documents) and the file buffer (of WiredTiger’s compressed files), and performance will be hurt if you starve one for the other.

Never deploy multiple mongods into the same box or at least consider containers. For the WTC, also remember that most of the time the default size (up to 50% of the RAM) works well.


Testing the Value of ScaleFlux Computational Storage Drive (CSD) for PostgreSQL

ScaleFlux Computational Storage Drive PostgreSQL

Some time ago we at Percona were approached by ScaleFlux Inc to benchmark their latest hardware appliance, the CSD 2000 Drive, which is a next-generation SSD computational storage drive. It goes without saying that a truly relevant report requires us to be as honest and as forthright as possible. In other words, my mission was to, ahem, see what kind of mayhem I could cause.

Benchmarking is a bit like cooking; it requires key ingredients, strict adherence to following a set of instructions, mixing the ingredients together, and a bit of heat to make it all happen. In this case, the ingredients include the Linux OS running Ubuntu 18.04 on both the database and the bench-marking hosts, PostgreSQL version 12, SysBench the modular, cross-platform, and multi-threaded benchmark tool, and a comparable, competing appliance i.e. the Intel DC P4610 series drive. The two appliances are mounted as partitions respectively both using the same type of file system.


Once the environment is ready, the next step involves declaring and implementing the bench-marking rules which consist of various types of DML and DDL activity. Keeping in mind that apart from the classic OLAP vs OLTP modes of database processing, executing a benchmark that closely follows real production activities can be problematic. Quite often, when pushing a system to its full capacity, one can say that all production systems are to some extent unique. Therefore, for our purposes, we used the testing regime SysBench offers by default.

Once the system was ready, loading started out slow and gentle. The idea was to develop a baseline for the various types of activity and Postgres runtime conditions. Then, the bench-marking intensity was gradually increased to the point where we eventually started getting interesting results.

Have open source expertise you want to share? Submit your talk for Percona Live ONLINE 2021!

Needless to say, it took quite a bit of time running the various permutations, double-checking our numbers, graphing the data, and then after all that, interpreting the output. I’m not going to go into any great detailing the analysis itself. Instead, I encourage you to look at the whitepaper itself.

So after all this effort, what was the takeaway?

There are two key observations that I’d like to share:

  1. At peak loading, the ScaleFlux CSD 2000 Drive demonstrated less performance variance than that of the Intel DC P4610. Variance being the statistical encapsulation of IO read-write spread between maximum and minimum values. The significance is server predictability. This becomes important when, for example, finely tuned application processes depend upon consistent performance with the RDBMS. Many a time I’ve seen applications get upset when response times between inserting, updating, or deleting data and getting the resultant queries would suddenly change.
  2. Remarkable space savings were realized when the Postgres fillfactor was reduced. As you know, the fillfactor can become a critical runtime parameter in regards to performance when high-frequency UPDATE and DELETE operations take place on the same tuple over and over again.

Finally, one last item… I didn’t mention it but we also benchmarked MySQL for ScaleFlux. The results were pretty remarkable. It’s worth your while to have a look at that one too.

ScaleFlux White Papers:


Deploying Percona Kubernetes Operators with OpenEBS Local Storage

Deploying Percona Kubernetes Operators with OpenEBS Local Storage

Deploying Percona Kubernetes Operators with OpenEBS Local StorageNetwork volumes in Kubernetes provide great flexibility, but still, nothing beats local volumes from direct-attached storage in the sense of database performance.

I want to explore ways to deploy both Percona Kubernetes Operators (Percona Kubernetes Operator for Percona XtraDB Cluster and Percona Kubernetes Operator for Percona Server for MongoDB) using local volumes, both on the bare-metal deployments or in cloud deployments.

Simple Ways

There are two ways available out of the box to deploy using local storage, which you can use immediately.

You can specify in cluster deployment yaml volume specification, using either hostPath:

        path: /data
        type: Directory



  (which will be equal somewhat to ephemeral storage in EC2):

      emptyDir: {}

While this will work, it is a very rigid way to force local storage, and it is not really the “Kubernetes way”, as we will lose the capability to manage data volumes.  We want to see a more uniform way with Persistent Volumes and Persistent Volumes Claims.

Persistent Volumes

Recognizing limitations with hostPath and emptyDir, Kubernetes introduced Local Persistent Volumes.

Unfortunately, while this will work for the simple deployment of single pods, it does not work with dynamically created volumes which we need for Operators. We need the support of Dynamic Volume Provisioning.

There are several projects to combine Dynamic Volume Provisioning with Local Persistent Volumes, but I did not have much success with them, and the only project which worked for me is OpenEBS.

OpenEBS provides much more than just Local Persistent Volumes, but in this blog, I want to touch only OpenEBS Local PV Hostpath.

OpenEBS Local PV Hostpath

This is actually quite simple, and this is what I like about OpenEBS. First, we will define storage classes:

For the local nvme storage:

kind: StorageClass
  name: localnvme
  annotations: local |
      - name: StorageType
        value: hostpath
      - name: BasePath
        value: /data/openebs
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

And for the local ssd:

kind: StorageClass
  name: localssd
  annotations: local |
      - name: StorageType
        value: hostpath
      - name: BasePath
        value: /mnt/data/ebs
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

And after that we can deploy Operators using StorageClass, for example  in cluster deployment yaml:

        storageClassName: localnvme
        accessModes: [ "ReadWriteOnce" ]
            storage: 200Gi

The cluster will be deployed using localnvme StorageClass, and using space on /data/openebs.

Now we can observe used volumes with kubectl get pv:

NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                        STORAGECLASS       REASON   AGE
pvc-325e11ff-9082-432c-b484-c0c7a3d1c949   200Gi      RWO            Delete           Bound    pxc/datadir-cluster2-pxc-2   localssd                    2d18h
pvc-3f940d14-8363-450e-a238-a9ff09bee5bc   200Gi      RWO            Delete           Bound    pxc/datadir-cluster3-pxc-1   localnvme                   43h
pvc-421cbf46-73a6-45f1-811e-1fca279fe22d   200Gi      RWO            Delete           Bound    pxc/datadir-cluster2-pxc-1   localssd                    2d18h
pvc-53ee7345-bc53-4808-97d6-2acf728a57d7   200Gi      RWO            Delete           Bound    pxc/datadir-cluster3-pxc-0   localnvme                   43h
pvc-b98eca4a-03b0-4b5f-a152-9b45a35767c6   200Gi      RWO            Delete           Bound    pxc/datadir-cluster2-pxc-0   localssd                    2d18h
pvc-fbe499a9-ecae-4196-b29e-c4d35d69b381   200Gi      RWO            Delete           Bound    pxc/datadir-cluster3-pxc-2   localnvme                   43h

There we can see data volumes from two deployed clusters. OpenEBS Local PV Hostpath is now my go-to method to deploy clusters with the local storage.


Comparing S3 Streaming Tools with Percona XtraBackup

Comparing S3 Streaming Tools

Making backups over the network can be done in two ways: either save on disk and transfer or just transfer without saving. Both ways have their strong and weak points. The second way, particularly, is highly dependent on the upload speed, which would either reduce or increase the backup time. Other factors that influence it are chunk size and the number of upload threads.

Percona XtraBackup 2.4.14 has gained S3 streaming, which is the capability to upload backups directly to s3-compatible storage without saving locally first. This feature was developed because we wanted to improve the upload speeds of backups in Percona Operator for XtraDB Cluster.

There are many implementations of S3 Compatible Storage: AWS S3, Google Cloud Storage, Digital Ocean Spaces, Alibaba Cloud OSS, MinIO, and Wasabi.

We’ve measured the speed of AWS CLI, gsutil, MinIO client, rclone, gof3r and the xbcloud tool (part of Percona XtraBackup) on AWS (in single and multi-region setups) and on Google Cloud. XtraBackup was compared in two variants: a default configuration and one with tuned chunk size and amount of uploading threads.

Here are the results.

AWS (Same Region)

The backup data was streamed from the AWS EC2 instance to the AWS S3, both in the us-east-1 region.



tool settings CPU max mem speed speed comparison
AWS CLI default settings 66% 149Mb 130MiB/s baseline
AWS CLI 10Mb block, 16 threads 68% 169Mb 141MiB/s +8%
MinIO client not changeable 10% 679Mb 59MiB/s -55%
rclone rcat not changeable 102% 7138Mb 139MiB/s +7%
gof3r default settings 69% 252Mb 97MiB/s -25%
gof3r 10Mb block, 16 threads 77% 520Mb 108MiB/s -17%
xbcloud default settings 10% 96Mb 25MiB/s -81%
xbcloud 10Mb block, 16 threads 60% 185Mb 134MiB/s +3%


Tip: If you run MySQL on an EC2 instance to make backups inside one region, do snapshots instead.

AWS (From US to EU)

The backup data was streamed from AWS EC2 in us-east-1 to AWS S3 in eu-central-1.



tool settings CPU max mem speed speed comparison
AWS CLI default settings 31% 149Mb 61MiB/s baseline
AWS CLI 10Mb block, 16 threads 33% 169Mb 66MiB/s +8%
MinIO client not changeable 3% 679Mb 20MiB/s -67%
rclone rcat not changeable 55% 9307Mb 77MiB/s +26%
gof3r default settings 69% 252Mb 97MiB/s +59%
gof3r 10Mb block, 16 threads 77% 520Mb 108MiB/s +77%
xbcloud default settings 4% 96Mb 10MiB/s -84%
xbcloud 10Mb block, 16 threads 59% 417Mb 123MiB/s +101%


Tip: Think about disaster recovery, and what will you do when the whole region is not available. It makes no sense to back up to the same region; always transfer backups to another region.

Google Cloud (From US to EU)

The backup data were streamed from Compute Engine instance in us-east1 to Cloud Storage europe-west3. Interestingly, Google Cloud Storage supports both native protocol and S3(interoperability) API. So, Percona XtraBackup can transfer data to Google Cloud Storage directly via S3(interoperability) API.


tool settings CPU max mem speed speed comparison
gsutil not changeable, native protocol 8% 246Mb 23MiB/s etalon
rclone rcat not changeable, native protocol 6% 61Mb 16MiB/s -30%
xbcloud default settings, s3 protocol 3% 97Mb 9MiB/s -61%
xbcloud 10Mb block, 16 threads, s3 protocol 50% 417Mb 133MiB/s +478%


Tip: A cloud provider can block your account due to many reasons, such as human or robot mistakes, inappropriate content abuse after hacking, credit card expire, sanctions, etc. Think about disaster recovery and what will you do when a cloud provider blocks your account, it may make sense to back up to another cloud provider or on-premise.


xbcloud tool (part of Percona XtraBackup) is 2-5 times faster with tuned settings on long-distance with native cloud vendor tools, and 14% faster and requires 20% less memory than analogs with the same settings. Also, xbcloud is the most reliable tool for transferring backups to S3-compatible storage because of two reasons:

  • It calculates md5 sums during the uploading and puts them into a .md5/filename.md5 file and verifies sums on the download (gof3r does the same).
  • xbcloud sends data in 10mb chunks and resends them if any network failure happens.

PS: Please find instructions on GitHub if you would like to reproduce this article’s results.


Update on fsync Performance

fsync performance

fsync performanceLast year, I wrote a post focused on the performance of the fsync call on various storage devices. The fsync call is extremely important for a database when durability, the “D” of the ACID acronym is a hard requirement. The call ensures the data is permanently stored on disk. The durability requirement forces every transaction to return only when the InnoDB log file and they binary log file have been flushed to disk.

In this post, instead of focusing on the performance of various devices, we’ll see what can be done to improve fsync performance using an Intel Optane card.

Intel Optane

A few years ago, Intel introduced a new type of storage devices based on the 3D_XPoint technology and sold under the Optane brand. Those devices are outperforming regular flash devices and have higher endurance. In the context of this post, I found they are also very good at handling the fsync call, something many flash devices are not great at doing.

I recently had access to an Intel Optane NVMe card, a DC P4800X card with a storage capacity of 375GB. Let’s see how it can be used to improve performance.

Optane used directly as a storage device

This is by far the simplest option if your dataset fits on the card. Just install the device, create a filesystem, mount it and go. Using the same python script as in the first post, the results are:

Options Fsync rate Latency
ext4, O_DIRECT 21200/s 0.047 ms
ext4 20000/s 0.050 ms
ext4, data=journal 9600/s 0.100 ms


The above results are pretty amazing. The fsync performance is on par with a RAID controller with a write cache, for which I got a rate of 23000/s and is much better than a regular NAND based NVMe card like the Intel PC-3700, able to deliver a fsync rate of 7300/s. Even enabling the full ext4 journal, the rate is still excellent although, as expected, cut by about half.

Optane used as the cache block device in a hybrid volume

If you have a large dataset, you can still use the Optane card as a read/write cache and improve fsync performance significantly. I did some tests with two easily available solutions, dm-cache and bcache. In both cases, the Optane card was put in front of an external USB Sata disk and the cache layer set to writeback.

Options Fsync rate Latency
No cache 13/s 75 ms
dm-cache 3100/s 0.32 ms
bcache 2500/s 0.40 ms


Both solutions improve the fsync rate by two orders of magnitude. That’s still much slower than the straight device but a very decent trade-off.

Optane used as an ZFS SLOG

ZFS can also use a fast device for its write journal, the ZIL. Such a device in ZFS terminology is called a SLOG. With the ZFS logbias set to “latency”, here is the impact of using an Optane device as SLOG in front of the same slow USB SATA disk:

Options Fsync rate Latency
ZFS, SLOG 7400/s 0.135 ms
ZFS, no SLOG 28/s 36 ms


The addition of SLOG device boosted fsync rate by a factor of nearly 260. The rates are also twice as important as the ones reported using dm-cache and bcache and about a third of the result using the Optane device for storage.  Considering all the added benefits of ZFS like compression and snapshots, that’s a really interesting result.


If you are struggling with the commit latency of a large transactional database, 3D_XPoint devices like the Intel Optane may offer you new options.


Upcoming Webinar Thurs 1/17: How to Rock with MyRocks

How to Rock with MyRocks

How to Rock with MyRocksPlease join Percona’s Chief Technology Officer, Vadim Tkachenko, as he presents How to Rock with MyRocks on Thursday, January 17th at 10:00 AM PDT (UTC-7) / 1:00 PM EDT (UTC-4).

Register Now

MyRocks is a new storage engine from Facebook and is available in Percona Server for MySQL. In what cases will you want to use it? We will check different workloads and when MyRocks is most suitable for you. Also, as for any new engine, it’s important to set it up and tune it properly. So, we will review the most important settings to pay attention to.

Register for this webinar to learn How to Rock with MyRocks.


AWS Elastic Block Storage (EBS) – Can We Get It Truly Elastic?

very old disk storage

very old disk storageAt AWS Re:Invent 2018 there were many great announcements of AWS New Services and New Features, but one basic feature that I’ve been waiting for years to be released is still nowhere to be  found.

AWS Elastic Block Storage (EBS) is great and it’s got better through the years, adding different storage types and features like Provisioned IOPS. However, it still has the most basic inconvenient requirement – I have to decide in advance how much space I need to allocate, and pay for all of that allocated space whether I use it or not.

It would be so much better if AWS would allow true consumption model pricing with EBS, where you pay for the storage used, not the storage allocated. This is already the case for S3,  RDS, and even EC2 instances (with Unlimited Option on T2/T3 Instances), not to mention Serverless focused services.

For example, I would love to be able to create a 1TB EBS volume but only pay for 10GB of storage if I only use this amount of space.

Modern storage subsystems do a good job differentiating between the space available on the block device and what’s being used by user files and filesystem metadata. The space that’s not allocated any more can be TRIMmed. This is a basic requirement for working well on flash storage, and as modern EC2 instances already provision EBS storage as emulated NVMe devices, I would imagine Amazon could hook into such functionality to track space actually used.

For us at Percona this would make shipping applications on AWS Marketplace much more convenient. Right now for Percona Monitoring and Management (PMM)  we have to choose how much space to allocate to the EBS volume by default, picking between it being expensive to run because of paying for the large unused EBS volume or setting a very limited by default capacity that needs user action to resize the EBS volume. Consumption-based EBS pricing would solve this dilemma.

This problem seems to be well recognized and understood. For example Pure Storage Cloud Block Storage (currently in Beta) is  expected to have such a feature.

I hope with its insane customer focus AWS will add this feature in the future, but currently we have to get by without it.

Image: Arnold Reinhold [CC BY-SA 2.5], via Wikimedia Commons

Powered by WordPress | Theme: Aeros 2.0 by