As some of you likely know, I have a favorable view of ZFS and especially of MySQL on ZFS. As I published a few years ago, the argument for ZFS was less about performance than its useful features like data compression and snapshots. At the time, ZFS was significantly slower than xfs and ext4 except when the L2ARC was used.
Since then, however, ZFS on Linux has progressed a lot and I also learned how to better tune it. Also, I found out the sysbench benchmark I used at the time was not a fair choice since the dataset it generates compresses much less than a realistic one. For all these reasons, I believe that it is time to revisit the performance aspect of MySQL on ZFS.
ZFS Evolution
In 2018, I reported ZFS performance results based on version 0.6.5.6, the default version available in Ubuntu Xenial. The present post is using version 0.8.6-1 of ZFS, the default one available on Debian Buster. Between the two versions, there are in excess of 3600 commits adding a number of new features like support for trim operations and the addition of the efficient zstd compression algorithm.
ZFS 0.8.6-1 is not bleeding edge, there have been more than 1700 commits since and after 0.8.6, the ZFS release number jumped to 2.0. The big addition included in the 2.0 release is native encryption.
Benchmark Tools
The classic sysbench MySQL database benchmarks have a dataset containing mostly random data. Such datasets don’t compress much, less than most real-world datasets I worked with. The compressibility of the dataset is important since ZFS caches, the ARC and L2ARC, store compressed data. A better compression ratio essentially means more data is cached and fewer IO operations will be needed.
A well-known tool to benchmark a transactional workload is TPCC. Furthermore, the dataset created by TPCC compresses rather well making it more realistic in the context of this post. The sysbench TPCC implementation was used.
Test Environment
Since I am already familiar with AWS and Google cloud, I decided to try Azure for this project. I launched these two virtual machines:
tpcc:
- benchmark host
- Standard D2ds_v4 instance
- 2 vCpu, 8GB of Ram and 75 GB of temporary storage
- Debian Buster
db:
- Database host
- Standard E4-2ds-v4 instance
- 2 vCpu, 32GB of Ram and 150GB of temporary storage
- 256GB SSD Premium (SSD Premium LRS P15 – 1100 IOPS (3500 burst), 125 MB/s)
- Debian Buster
- Percona server 8.0.22-13
Configuration
By default and unless specified, the ZFS filesystems are created with:
zpool create bench /dev/sdc zfs set compression=lz4 atime=off logbias=throughput bench zfs create -o mountpoint=/var/lib/mysql/data -o recordsize=16k \ -o primarycache=metadata bench/data zfs create -o mountpoint=/var/lib/mysql/log bench/log
There are two ZFS filesystems. bench/data is optimized for the InnoDB dataset while bench/log is tuned for the InnoDB log files. Both are compressed using lz4 and the logbias parameter is set to throughput which changes the way the ZIL is used. With ext4, the noatime option is used.
ZFS has also a number of kernel parameters, the ones set to non-default values are:
zfs_arc_max=2147483648 zfs_async_block_max_blocks=5000 zfs_delete_blocks=1000
Essentially, the above settings limit the ARC size to 2GB and they throttle down the aggressiveness of ZFS for deletes. Finally, the database configuration is slightly different between ZFS and ext4. There is a common section:
[mysqld] pid-file = /var/run/mysqld/mysqld.pid socket = /var/run/mysqld/mysqld.sock log-error = /var/log/mysql/error.log skip-log-bin datadir = /var/lib/mysql/data innodb_buffer_pool_size = 26G innodb_flush_log_at_trx_commit = 1 # TPCC reqs. innodb_log_file_size = 1G innodb_log_group_home_dir = /var/lib/mysql/log innodb_flush_neighbors = 0 innodb_fast_shutdown = 2
and when ext4 is used:
innodb_flush_method = O_DIRECT
and when ZFS is used:
innodb_flush_method = fsync innodb_doublewrite = 0 # ZFS is transactional innodb_use_native_aio = 0 innodb_read_io_threads = 10 innodb_write_io_threads = 10
ZFS doesn’t support O_DIRECT but it is ignored with a message in the error log. I chose to explicitly set the flush method to fsync. The doublewrite buffer is not needed with ZFS and I was under the impression that the Linux native asynchronous IO implementation was not well supported by ZFS so I disabled it and increased the number of IO threads. We’ll revisit the asynchronous IO question in a future post.
Dataset
I use the following command to create the dataset:
./tpcc.lua --mysql-host=10.3.0.6 --mysql-user=tpcc --mysql-password=tpcc --mysql-db=tpcc \ --threads=8 --tables=10 --scale=200 --db-driver=mysql prepare
The resulting dataset has a size of approximately 200GB. The dataset is much larger than the buffer pool so the database performance is essentially IO-bound.
Test Procedure
The execution of every benchmark was scripted and followed these steps:
- Stop MySQL
- Remove all datafiles
- Adjust the filesystem
- Copy the dataset
- Adjust the MySQL configuration
- Start MySQL
- Record the configuration
- Run the benchmark
Results
For the benchmark, I used the following invocation:
./tpcc.lua --mysql-host=10.3.0.6 --mysql-user=tpcc --mysql-password=tpcc --mysql-db=tpcc \ --threads=16 --time=7200 --report-interval=10 --tables=10 --scale=200 --db-driver=mysql ru
The TPCC benchmark uses 16 threads for a duration of 2 hours. The duration is sufficiently long to allow for a steady state and to exhaust the storage burst capacity. Sysbench returns the total number of TPCC transactions per second every 10s. This number includes not only the New Order transactions but also the other transaction types like payment, order status, etc. Be aware of that if you want to compare these results with other TPCC benchmarks.
In those conditions, the figure below presents the rates of TPCC transactions over time for ext4 and ZFS.

MySQL TPCC results for ext4 and ZFS
During the initial 15 minutes, the buffer pool warms up but at some point, the workload shifts between an IO read bound to an IO write and CPU bound. Then, at around 3000s the SSD Premium burst capacity is exhausted and the workload is only IO-bound. I have been a bit surprised by the results, enough to rerun the benchmarks to make sure. The results for both ext4 and ZFS are qualitatively similar. Any difference is within the margin of error. That essentially means if you configure ZFS properly, it can be as IO efficient as ext4.
What is interesting is the amount of storage used. While the dataset on ext4 consumed 191GB, the lz4 compression of ZFS yielded a dataset of only 69GB. That’s a huge difference, a factor of 2.8, which could save a decent amount of money over time for large datasets.
Conclusion
It appears that it was indeed a good time to revisit the performance of MySQL with ZFS. In a fairly realistic use case, ZFS is on par with ext4 regarding performance while still providing the extra benefits of data compression, snapshots, etc. In a future post, I’ll examine the use of cloud ephemeral storage with ZFS and see how this can further improve performance.