Hands-On Look at ZFS with MySQL

ZFS with MySQL

ZFS with MySQLThis post is a hands-on look at ZFS with MySQL.

In my previous post, I highlighted the similarities between MySQL and ZFS. Before going any further, I’d like you to be able to play and experiment with ZFS. This post shows you how to configure ZFS with MySQL in a minimalistic way on either Ubuntu 16.04 or Centos 7.


In order to be able to use ZFS, you need some available storage space. For storage – since the goal here is just to have a hands-on experience – we’ll use a simple file as a storage device. Although simplistic, I have now been using a similar setup on my laptop for nearly three years (just can’t get rid of it, it is too useful). For simplicity, I suggest you use a small Centos7 or Ubuntu 16.04 VM with one core, 8GB of disk and 1GB of RAM.

First, you need to install ZFS as it is not installed by default. On Ubuntu 16.04, you simply need to run:

root@Ubuntu1604:~# apt-get install zfs-dkms zfsutils-linux

On RedHat or Centos 7.4, the procedure is a bit more complex. First, we need to install the EPEL ZFS repository:

[root@Centos7 ~]# yum install
[root@Centos7 ~]# gpg --quiet --with-fingerprint /etc/pki/rpm-gpg/RPM-GPG-KEY-zfsonlinux
[root@Centos7 ~]# gpg --quiet --with-fingerprint /etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-7

Apparently, there were issues with ZFS kmod kernel modules on RedHat/Centos. I never had any issues with Ubuntu (and who knows how often the kernel is updated). Anyway, it is recommended that you enable kABI-tracking kmods. Edit the file /etc/yum.repos.d/zfs.repo, disable the ZFS repo and enable the zfs-kmod repo. The beginning of the file should look like:

name=ZFS on Linux for EL7 - dkms
name=ZFS on Linux for EL7 - kmod

Now, we can proceed and install ZFS:

[root@Centos7 ~]# yum install zfs

After the installation, I have ZFS version on Ubuntu and version on Centos7. The version difference doesn’t matter for what will follow.


So, we need a container for the data. You can use any of the following options for storage:

  • A free disk device
  • A free partition
  • An empty LVM logical volume
  • A file

The easiest solution is to use a file, and so that’s what I’ll use here. A file is not the fastest and most efficient storage, but it is fine for our hands-on. In production, please use real devices. A more realistic server configuration will be discussed in a future post. The following steps are identical on Ubuntu and Centos. The first step is to create the storage file. I’ll use a file of 1~GB in /mnt. Adjust the size and path to whatever suits the resources you have:

[root@Centos7 ~]# dd if=/dev/zero of=/mnt/zfs.img bs=1024 count=1048576

The result is a 1GB file in /mnt:

[root@Centos7 ~]# ls -lh /mnt
total 1,0G
-rw-r--r--.  1 root root 1,0G 16 nov 16:50 zfs.img

Now, we will create our ZFS pool, mysqldata, using the file we just created:

[root@Centos7 ~]# modprobe zfs
[root@Centos7 ~]# zpool create mysqldata /mnt/zfs.img
[root@Centos7 ~]# zpool status
  pool: mysqldata
 state: ONLINE
  scan: none requested
        NAME            STATE     READ WRITE CKSUM
        mysqldata       ONLINE       0     0     0
          /mnt/zfs.img  ONLINE       0     0     0
errors: No known data errors
[root@Centos7 ~]# zfs list
mysqldata  79,5K   880M    24K  /mysqldata

If you have a result similar to the above, congratulations, you have a ZFS pool. If you put files in /mysqldata, they are in ZFS.

MySQL installation

Now, let’s install MySQL and play around a bit. We’ll begin by installing the Percona repository:

root@Ubuntu1604:~# cd /tmp
root@Ubuntu1604:/tmp# wget$(lsb_release -sc)_all.deb
root@Ubuntu1604:/tmp# dpkg -i percona-release_*.deb
root@Ubuntu1604:/tmp# apt-get update
[root@Centos7 ~]# yum install

Next, we install Percona Server for MySQL 5.7:

root@Ubuntu1604:~# apt-get install percona-server-server-5.7
root@Ubuntu1604:~# systemctl start mysql
[root@Centos7 ~]# yum install Percona-Server-server-57
[root@Centos7 ~]# systemctl start mysql

The installation command pulls all the dependencies and sets up the MySQL root password. On Ubuntu, the install script asks for the password, but on Centos7 a random password is set. To retrieve the random password:

[root@Centos7 ~]# grep password /var/log/mysqld.log
2017-11-21T18:37:52.435067Z 1 [Note] A temporary password is generated for root@localhost: XayhVloV+9g+

The following step is to reset the root password:

[root@Centos7 ~]# mysql -p -e "ALTER USER 'root'@'localhost' IDENTIFIED BY 'Mysql57OnZfs_';"
Enter password:

Since 5.7.15, the password validation plugin by defaults requires a length greater than 8, mixed cases, at least one digit and at least one special character. On either Linux distributions, I suggest you set the credentials in the /root/.my.cnf file like this:

[# cat /root/.my.cnf

MySQL configuration for ZFS

Now that we have both ZFS and MySQL, we need some configuration to make them play together. From here, the steps are the same on Ubuntu and Centos. First, we stop MySQL:

# systemctl stop mysql

Then, we’ll configure ZFS. We will create three ZFS filesystems in our pool:

  • mysql will be the top level filesystem for the MySQL related data. This filesystem will not directly have data in it, but data will be stored in the other filesystems that we create. The utility of the mysql filesystem will become obvious when we talk about snapshots. Something to keep in mind for the next steps, the properties of a filesystem are by default inherited from the upper level.
  • mysql/data will be the actual datadir. The files in the datadir are mostly accessed through random IO operations, so we’ll set the ZFS recordsize to match the InnoDB page size.
  • mysql/log will be where the log files will be stored. By log files, I primarily mean the InnoDB log files. But the binary log file, the slow query log and the error log will all be stored in that directory. The log files are accessed through sequential IO operations. We’ll thus use a bigger ZFS recordsize in order to maximize the compression efficiency.

Let’s begin with the top-level MySQL container. I could have used directly mysqldata, but that would somewhat limit us. The following steps create the filesystem and set some properties:

# zfs create mysqldata/mysql
# zfs set compression=gzip mysqldata/mysql
# zfs set recordsize=128k mysqldata/mysql
# zfs set atime=off mysqldata/mysql

I just set compression to ‘gzip’ (the equivalent of gzip level 6), recordsize to 128KB and atime (the file’s access time) to off. Once we are done with the mysql filesystem, we can proceed with the data and log filesystems:

# zfs create mysqldata/mysql/log
# zfs create mysqldata/mysql/data
# zfs set recordsize=16k mysqldata/mysql/data
# zfs set primarycache=metadata mysqldata/mysql/data
# zfs get compression,recordsize,atime mysqldata/mysql/data
NAME                  PROPERTY     VALUE     SOURCE
mysqldata/mysql/data  compression  gzip      inherited from mysqldata/mysql
mysqldata/mysql/data  recordsize   16K       local
mysqldata/mysql/data  atime        off       inherited from mysqldata/mysql

Of course, there are other properties that could be set, but let’s keep things simple. Now that the filesystems are ready, let’s move the files to ZFS (make sure you stopped MySQL):

# mv /var/lib/mysql/ib_logfile* /mysqldata/mysql/log/
# mv /var/lib/mysql/* /mysqldata/mysql/data/

and then set the real mount points:

# zfs set mountpoint=/var/lib/mysql mysqldata/mysql/data
# zfs set mountpoint=/var/lib/mysql-log mysqldata/mysql/log
# chown mysql.mysql /var/lib/mysql /var/lib/mysql-log

Now we have:

# zfs list
mysqldata             1,66M   878M  25,5K  /mysqldata
mysqldata/mysql       1,54M   878M    25K  /mysqldata/mysql
mysqldata/mysql/data   890K   878M   890K  /var/lib/mysql
mysqldata/mysql/log    662K   878M   662K  /var/lib/mysql-log

We must adjust the MySQL configuration accordingly. Here’s what I put in my /etc/my.cnf file (/etc/mysql/my.cnf on Ubuntu):

innodb_log_group_home_dir = /var/lib/mysql-log
innodb_doublewrite = 0
innodb_checksum_algorithm = none
slow_query_log = /var/lib/mysql-log/slow.log
log-error = /var/lib/mysql-log/error.log
server_id = 12345
log_bin = /var/lib/mysql-log/binlog
# Disabling symbolic-links is recommended to prevent assorted security risks

On Centos 7, selinux prevented MySQL from accessing files in /var/lib/mysql-log. I had to perform the following steps:

[root@Centos7 ~]# yum install policycoreutils-python
[root@Centos7 ~]# semanage fcontext -a -t mysqld_db_t "/var/lib/mysql-log(/.*)?"
[root@Centos7 ~]# chcon -Rv --type=mysqld_db_t /var/lib/mysql-log/

I could have just disabled selinux since it is a test server, but if I don’t get my hands dirty on selinux once in a while with semanage and chcon I will not remember how to do it. Selinux is an important security tool on Linux (but that’s another story).

At this point, feel free to start using your test MySQL database on ZFS.

Monitoring ZFS

To monitor ZFS, you can use the zpool command like this:

[root@Centos7 ~]# zpool iostat 3
              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
mysqldata   19,6M   988M      0      0      0    290
mysqldata   19,3M   989M      0     44      0  1,66M
mysqldata   23,4M   985M      0     49      0  1,33M
mysqldata   23,4M   985M      0     40      0   694K
mysqldata   26,7M   981M      0     39      0   561K
mysqldata   26,7M   981M      0     37      0   776K
mysqldata   23,8M   984M      0     43      0   634K

This shows the ZFS activity while I was loading some data. Also, the following command gives you an estimate of the compression ratio:

[root@Centos7 ~]# zfs get compressratio,used,logicalused mysqldata/mysql
mysqldata/mysql  compressratio  4.10x  -
mysqldata/mysql  used           116M   -
mysqldata/mysql  logicalused    469M   -
[root@Centos7 ~]# zfs get compressratio,used,logicalused mysqldata/mysql/data
NAME                  PROPERTY       VALUE  SOURCE
mysqldata/mysql/data  compressratio  4.03x  -
mysqldata/mysql/data  used           67,9M  -
mysqldata/mysql/data  logicalused    268M   -
[root@Centos7 ~]# zfs get compressratio,used,logicalused mysqldata/mysql/log
NAME                 PROPERTY       VALUE  SOURCE
mysqldata/mysql/log  compressratio  4.21x  -
mysqldata/mysql/log  used           47,8M  -
mysqldata/mysql/log  logicalused    201M   -

In my case, the dataset compresses very well (4x). Another way to see how files are compressed is to use ls and du. ls returns the actual uncompressed size of the file, while du returns the compressed size. Here’s an example:

[root@Centos7 mysql]# -lah ibdata1
-rw-rw---- 1 mysql mysql 90M nov 24 16:09 ibdata1
[root@Centos7 mysql]# du -hs ibdata1
14M     ibdata1

I really invite you to further experiment and get a feeling of how ZFS and MySQL behave together.

Snapshots and backups

A great feature of ZFS that work really well with MySQL are snapshots. A snapshot is a consistent view of the filesystem at a given point in time. Normally, it is best to perform a snapshot while a flush tables with read lock is held. That allows you to record the master position, and also to flush MyISAM tables. It is quite easy to do that. Here’s how I create a snapshot with MySQL:

[root@Centos7 ~]# mysql -e 'flush tables with read lock;show master status;! zfs snapshot -r mysqldata/mysql@my_first_snapshot'
| File          | Position  | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
| binlog.000002 | 110295083 |              |                  |                   |
[root@Centos7 ~]# zfs list -t snapshot
NAME                                     USED  AVAIL  REFER  MOUNTPOINT
mysqldata/mysql@my_first_snapshot          0B      -    24K  -
mysqldata/mysql/data@my_first_snapshot     0B      -  67,9M  -
mysqldata/mysql/log@my_first_snapshot      0B      -  47,8M  -

The command took about 1s. The only time where such commands would take more time is when there are MyISAM tables with a lot of pending updates to the indices, or when there are long running transactions. You probably wonder why the “USED” column reports 0B. That’s simply because there were no changes to the filesystem since the snapshot was created. It is a measure of the amount of data that hasn’t been free because the snapshot requires the data. Said otherwise, it how far the snapshot has diverged from its parent. You can access the snapshot through a clone or through ZFS as a file system. To access the snapshot through ZFS, you have to set the snapdir parameter to “visible, ” and then you can see the files. Here’s how:

[root@Centos7 ~]# zfs set snapdir=visible mysqldata/mysql/data
[root@Centos7 ~]# zfs set snapdir=visible mysqldata/mysql/log
[root@Centos7 ~]# ls /var/lib/mysql-log/.zfs/snapshot/my_first_snapshot/
binlog.000001  binlog.000002  binlog.index  error.log  ib_logfile0  ib_logfile1

The files in the snapshot directory are read-only. If you want to be able to write to the files, you first need to clone the snapshots:

[root@Centos7 ~]# zfs create mysqldata/mysqlslave
[root@Centos7 ~]# zfs clone mysqldata/mysql/data@my_first_snapshot mysqldata/mysqlslave/data
[root@Centos7 ~]# zfs clone mysqldata/mysql/log@my_first_snapshot mysqldata/mysqlslave/log
[root@Centos7 ~]# zfs list
NAME                        USED  AVAIL  REFER  MOUNTPOINT
mysqldata                   116M   764M    26K  /mysqldata
mysqldata/mysql             116M   764M    24K  /mysqldata/mysql
mysqldata/mysql/data       67,9M   764M  67,9M  /var/lib/mysql
mysqldata/mysql/log        47,8M   764M  47,8M  /var/lib/mysql-log
mysqldata/mysqlslave         28K   764M    26K  /mysqldata/mysqlslave
mysqldata/mysqlslave/data     1K   764M  67,9M  /mysqldata/mysqlslave/data
mysqldata/mysqlslave/log      1K   764M  47,8M  /mysqldata/mysqlslave/log

At this point, it is up to you to use the clones to spin up a local slave. Like for the snapshots, the clone only grows in size when actual data is written to it. ZFS records that haven’t changed since the snapshot was taken are shared. That’s a huge space savings. For a customer, I once wrote a script to automatically create five MySQL slaves for their developers. The developers would do tests, and often replication broke. Rerunning the script would recreate fresh slaves in a matter of a few minutes. My ZFS snapshot script and the script I wrote to create the clone based slaves are available here:

Optional features

In the previous post, I talked about a SLOG device for the ZIL and the L2ARC, a disk extension of the ARC cache. If you promise to never use the following trick in production, here’s how to speed MySQL on ZFS drastically:

[root@Centos7 ~]# dd if=/dev/zero of=/dev/shm/zil_slog.img bs=1024 count=131072
131072+0 enregistrements lus
131072+0 enregistrements écrits
134217728 octets (134 MB) copiés, 0,373809 s, 359 MB/s
[root@Centos7 ~]# zpool add mysqldata log /dev/shm/zil_slog.img
[root@Centos7 ~]# zpool status
  pool: mysqldata
 state: ONLINE
  scan: none requested
        NAME                     STATE     READ WRITE CKSUM
        mysqldata                ONLINE       0     0     0
          /mnt/zfs.img           ONLINE       0     0     0
          /dev/shm/zil_slog.img  ONLINE       0     0     0
errors: No known data errors

The data in the SLOG is critical for ZFS recovery. I performed some tests with virtual machines, and if you crash the server and lose the SLOG you may lose all the data stored in the ZFS pool. Normally, the SLOG is on a mirror in order to lower the risk of losing it. The SLOG can be added and removed online.

I know I asked you to promise to never use an shm file as SLOG in production. Actually, there are exceptions. I would not hesitate to temporarily use such a trick to speed up a lagging slave. Another situation where such a trick could be used is with Percona XtraDB Cluster. With a cluster, there are multiple copies of the dataset. Even if one node crashed and lost its ZFS filesystems, it could easily be reconfigured and reprovisioned from the surviving nodes.

The other optional feature I want to cover is a cache device. The cache device is what is used for the L2ARC. The content of the L2ARC is compressed as the original data is compressed. To add a cache device (again an shm file), do:

[root@Centos7 ~]# dd if=/dev/zero of=/dev/shm/l2arc.img bs=1024 count=131072
131072+0 enregistrements lus
131072+0 enregistrements écrits
134217728 octets (134 MB) copiés, 0,272323 s, 493 MB/s
[root@Centos7 ~]# zpool add mysqldata cache /dev/shm/l2arc.img
[root@Centos7 ~]# zpool status
  pool: mysqldata
 state: ONLINE
  scan: none requested
    NAME                     STATE     READ WRITE CKSUM
    mysqldata                ONLINE       0     0     0
      /mnt/zfs.img           ONLINE       0     0     0
      /dev/shm/zil_slog.img  ONLINE       0     0     0
      /dev/shm/l2arc.img     ONLINE       0     0     0
errors: No known data errors

To monitor the L2ARC (and also the ARC), look at the file: /proc/spl/kstat/zfs/arcstats. As the ZFS filesystems are configured right now, very little will go to the L2ARC. This can be frustrating. The reason is that the L2ARC is filled by the elements evicted from the ARC. If you recall, we have set primarycache=metatdata for the filesystem containing the actual data. Hence, in order to get some data to our L2ARC, I suggest the following steps:

[root@Centos7 ~]# zfs set primarycache=all mysqldata/mysql/data
[root@Centos7 ~]# echo 67108864 > /sys/module/zfs/parameters/zfs_arc_max
[root@Centos7 ~]# echo 3 > /proc/sys/vm/drop_caches
[root@Centos7 ~]# grep '^size' /proc/spl/kstat/zfs/arcstats
size                            4    65097584

It takes the echo command to drop_caches to force a re-initialization of the ARC. Now, InnoDB data starts to be cached in the L2ARC. The way data is sent to the L2ARC has many tunables, which I won’t discuss here. I chose 64MB for the ARC size mainly because I am using a low memory VM. A size of 64MB is aggressively small and will slow down ZFS if the metadata doesn’t fit in the ARC. Normally you should use a larger value. The actual good size depends on many parameters like the filesystem system size, the number of files and the presence of a L2ARC. You can monitor the ARC and L2ARC using the arcstat tool that comes with ZFS on Linux (when you use Centos 7). With Ubuntu, download the tool from here.


So the ZFS party is over? We need to clean up the mess! Let’s begin:

[root@Centos7 ~]# systemctl stop mysql
[root@Centos7 ~]# zpool remove /dev/shm/l2arc.img
[root@Centos7 ~]# zpool remove mysqldata /dev/shm/zil_slog.img
[root@Centos7 ~]# rm -f /dev/shm/*.img
[root@Centos7 ~]# zpool destroy mysqldata
[root@Centos7 ~]# rm -f /mnt/zfs.img
[root@Centos7 ~]# yum erase spl kmod-spl libzpool2 libzfs2 kmod-zfs zfs

The last step is different on Ubuntu:

root@Ubuntu1604:~# apt-get remove spl-dkms zfs-dkms libzpool2linux libzfs2linux spl zfsutils-linux zfs-zed


With this guide, I hope I provided a positive first experience in using ZFS with MySQL. The configuration is simple, and not optimized for performance. However, we’ll look at more realistic configurations in future posts.


ZFS from a MySQL perspective

Open ZFS logo

Open ZFS logoSince the purpose of a database system is to store data, there is close relationship with the filesystem. As MySQL consultants, we always look at the filesystems for performance tuning opportunities. The most common choices in term of filesystems are XFS and EXT4, on Linux it is exceptional to encounter another filesystem. Both XFS and EXT4 have pros and cons but their behaviors are well known and they perform well. They perform well but they are not without shortcomings.

Over the years, we have developed a bunch of tools and techniques to overcome these shortcomings. For example, since they don’t allow a consistent view of the filesystem, we wrote tools like Xtrabackup to backup a live MySQL database. Another example is the InnoDB double write buffer. The InnoDB double write buffer is required only because neither XFS nor EXT4 is transactional. There is one filesystem which offers nearly all the features we need, ZFS.  ZFS is arguably the most advanced filesystem available on Linux. Maybe it is time to reconsider the use of ZFS with MySQL.

ZFS on Linux or ZoL (from the OpenZFS project), has been around for quite a long time now. I first started using ZoL back in 2012, before it was GA (general availability), in order to solve a nearly impossible challenge to backup a large database (~400 GB) with a mix of InnoDB and MyISAM tables. Yes, ZFS allows that very easily, in just a few seconds. As of 2017, ZoL has been GA for more than 3 years and most of the issues that affected it in the early days have been fixed. ZFS is also GA in FreeBSD, illumos, OmniOS and many others.

This post will hopefully be the first of many posts, devoted to the use of ZFS with MySQL. The goal here is not to blindly push for ZFS but to see when ZFS can help solve real problems. We will first examine ZFS and try to draw parallels with the architecture of MySQL. This will help us to better understand how ZFS works and behaves. Future posts will be devoted to more specific topics like performance, PXC, backups, compression, database operations, bad and poor use cases and sample configurations.

Some context

ZFS is a filesystem that was developed by Sun Microsystems and introduced for the first time in with OpenSolaris in 2005. ZFS is unique in many ways, let’s first have a look at its code base using the sloccount tool which provides an estimation of the development effort.

  • EXT4: 8.5 person-years
  • XFS: 17 person-years
  • ZFS: 77 person-years

graph of the estimated development efforts for ZFS versus other filesystems

In term of code base complexity, it is approaching 10 times the complexity of EXT4, the above graphic shows the scale. To put things in perspective, the sloccount development effort for Percona-Server 5.7 which is based on MySQL community 5.7, is estimated at 680 person-years. The ZoL development is sponsored by the Lawrence Livermore National Laboratory and the project is very active.

ZFS features

Why does ZFS need such a large code base? Well, in Linux, it functionally replaces MD (software raid), LVM (volume manager) and the filesystem. ZFS is really a transactional database designed to support filesystem operations. Let’s review the ZFS main features.

128 bits filesystem

That’s huge! According to Jeff Bonwick (, the rest energy of such a storage device would be enough to boil the oceans.  It seems inconceivable that we’d ever need a larger filesystem.

Copy-on-write (COW)

When ZFS needs to update a record it does not overwrite it. Instead, it writes a new record, change the pointers and then frees up the old one if it is no longer referenced. That design is at the core of ZFS. It allows for features like free snapshots and transactions.


ZFS supports snapshots, and because of its COW architecture taking a snapshot is merely a matter of recording a transaction number and telling ZFS to protect the referenced records from its garbage collector. This is very similar to the InnoDB MVCC. If a read view is kept open, InnoDB keeps a copy of each of the rows that changed in the undo log, and those rows are not purged until the transaction commits.


A ZFS snapshot can be cloned and then written too. At this point, the clone is like a fork for the original data. There is no equivalent feature in MySQL/InnoDB.


All the ZFS records have a checksum. This is exactly like the page checksums of InnoDB. If a record is found to have an invalid checksum, it is automatically replaced by a copy, provided one is available. It is normal to define a ZFS production with more than one copy of the data set. With ZFS, we can safely disable InnoDB checksums.


ZFS records can be compressed transparently. The most common algorithms are gzip and lz4. The data is compressed per record and the recordsize is an adjustable property. The principle is similar to transparent InnoDB page compression but without the need for punching holes. In nearly all the ZFS setups I have worked with, enabling compression helped performance.


ZoL doesn’t support transparent encryption of the records yet, but the encryption code is currently under review. If all goes well, the encryption should be available in a matter of a few months. Once there, it will offer another option for encryption at rest with MySQL. That feature compares very well with InnoDB tablespace encryption.


An fsync on ZFS is transactional. This comes mainly from the fact that ZFS uses COW. When a file is opened with O_SYNC or O_DSYNC, ZFS behaves like a database where the fsync calls represent commits. The writes are atomic. The fsync calls return as soon as ZFS has written the data to the ZIL (ZFS Intent Log).  Later, a background process flushes the data accumulated in the ZIL to the actual data store. This flushing process is called at an interval of txg_timeout. By default, txg_timeout is set to 5s.  The process is extremely similar to the way InnoDB flushes pages.  A direct benefit for MySQL is the possibility of disabling the InnoDB doublewrite buffer. The InnoDB doublewrite buffer is often a source of contention in a heavy write environment, although the latest Percona Server releases have parallel doublewrite buffers that relieve most of the issue.


The transactional support in ZFS bears a huge price in term of latency, since the synchronous writes and fsyncs involve many random write IO operations. Since ZFS is transactional, it needs a transactional journal, the ZIL. ZIL stands for ‘ZFS Intent Log’. There is always a ZIL. The ZIL serves a purpose very similar to the InnoDB log files. The ZIL is written to sequentially, is fsynced often, and read from only for recovery after a crash. The goal is to delay random write IO operations by writing sequentially pending changes to a device. By default the ZIL delays the actual writes by only 5s (zfs_txg_timeout) but that’s still very significant. To help synchronous write performance, ZFS has the possibility of locating the ZIL on a Separate Intent Log (SLOG).

The SLOG device doesn’t need to be very large, a few GB is often enough, but it must be fast for sequential writes and fast for fsyncs. A fast flash device with good write endurance or spinners behind a raid controller with a protected write cache are great SLOG devices. Normally, the SLOG is on a redundant device like a mirror since losing the ZIL can be dramatic. With MySQL, the presence of a fast SLOG is extremely important for performance.


The ARC is the ZFS file cache. It is logically split in two parts, the ARC and the L2ARC. The ARC is the in memory file cache, while the L2ARC is an optional on disk cache that stores items that are evicted from the ARC. The L2ARC is especially interesting with MySQL because it allows the use of a small flash storage device as a cache for a large slow storage device. Functionally, the ARC is like the InnoDB buffer pool while the L2ARC is similar to tools like flashcache/bcache/dm-cache.


ZFS has its own way of dealing with disk. At the lowest level, ZFS can use the bare disks individually with no redundancy, a bit like JBOD devices used with LVM. Redundancy can be added with a mirror which is essentially a software RAID-1 device. These mirrors can then be striped together to form the equivalent of a RAID-10 array. Going further, there are RAIDZ-1, RAIDZ-2 and RAIDZ-3 which are respectively the equivalent of RAID-5, RAID-6 and RAID… Well, an array with 3 parities has no standard name yet. When you build a RAID array with Linux MD, you could have the RAID-5+ write hole issue if you do not have a write journal. The write journal option is available only in recent kernels and with the latest mdadm packages. ZFS is not affected by the RAID-5 write hole.


I already touched on this feature when I talked about the checksums. If more than one copy of a record is available and one of the copies is found to be corrupted, ZFS will return only a valid copy and will repair the damaged record. You can trigger a full check with the



ZVOL block devices

Not only can ZFS manage filesystems, it can also offer block devices. The block devices, called ZVOLs, can be snapshotted and cloned. That’s a very handy feature when I want to create a cluster of similar VMs. I create a base image and then snaphot and create clones for all the VMs. The whole image is stored only once, and each clone contains only the records that have been modified since the original clone was created.


ZFS allows you to send and receive snapshots. This feature is very useful to send data between servers. If there is already a copy of the data on the remote server, you can also send only the incremental changes.


ZFS can automatically hardlink together files (or records) that have identical content. Although interesting, if you have a lot of redundant data, the dedup feature is very intensive. I don’t see a practical use case of dedup for databases except maybe for a backup server.

This concludes this first post about ZFS, stay tuned for more.



Webinar Wednesday, October 18, 2017: How to Scale with MongoDB

Scale with MongoDB

Scale with MongoDBJoin Percona’s Senior Technical Services Engineer Adamo Tonete as he presents How To Scale with MongoDB on Wednesday, October 18, 2017, at 11:00 am PDT / 2:00 pm EDT (UTC-7).

In this webinar, we will talk about how to scale with MongoDB, up to thousands of writes and reads per second. What are the common issues when you scale with MongoDB? Is it better to shard or to add further secondaries?

We will walk through many common scaling situations, and through the steps needed to deploy a sharded cluster: from a single instance to a sharded environment. We will also talk about common mistakes/pitfalls a company can make when scaling its database – and how to avoid such situations.

Register for the webinar.

Adamo Tonete, Senior Technical Services Engineer

Adamo joined Percona in 2015, after working as a MongoDB/MySQL database administrator for three years. As the main database member of a startup, he was responsible for suggesting the best architecture and data flows for a worldwide company in a 24×7 environment. Before that, he worked as a Microsoft SQL Server DBA for a large e-commerce company, mainly on performance tuning and automation. Adamo has almost eight years of experience working as a DBA, and in the past three has moved to NoSQL technologies without giving up relational databases. He likes to play video games and study everything that is related to engines. Adamo lives with his wife in São Paulo, Brazil.


Looking at Disk Utilization and Saturation

DIsk Utilization and Saturation small

In this blog post, I will look at disk utilization and saturation.

In my previous blog post, I wrote about CPU utilization and saturation, the practical difference between them and how different CPU utilization and saturation impact response times. Now we will look at another critical component of database performance: the storage subsystem. In this post, I will refer to the storage subsystem as “disk” (as a casual catch-all). 

The most common tool for command line IO performance monitoring is


, which shows information like this:

root@ts140i:~# iostat -x nvme0n1 5
Linux 4.4.0-89-generic (ts140i)         08/05/2017      _x86_64_        (4 CPU)
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          0.51    0.00    2.00    9.45    0.00   88.04
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00 3555.57 5887.81 52804.15 87440.73    29.70     0.53    0.06    0.13    0.01   0.05  50.71
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          0.60    0.00    1.06   20.77    0.00   77.57
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00 7612.80    0.00 113507.20     0.00    29.82     0.97    0.13    0.13    0.00   0.12  93.68
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          0.50    0.00    1.26    6.08    0.00   92.16
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00 7653.20    0.00 113497.60     0.00    29.66     0.99    0.13    0.13    0.00   0.12  93.52

The first line shows the average performance since system start. In some cases, it is useful to compare the current load to the long term average. In this case, as it is a test system, it can be safely ignored. The next line shows the current performance metrics over five seconds intervals (as specified in the command line).



 command reports utilization information in the %util column, and you can look at saturation by either looking at the average request queue size (the avgqu-sz column) or looking at the r_await and w_await columns (which show the average wait for read and write operations). If it goes well above “normal” then the device is over-saturated.

As in my previous blog post, we’ll perform some system Sysbench runs and observe how the


 command line tool and Percona Monitoring and Management graphs behave.

To focus specifically on the disk, we’re using the Sysbench fileio test. I’m using just one 100GB file, as I’m using DirectIO so all requests hit the disk directly. I’m also using “sync” request submission mode so I can get better control of request concurrency.

I’m using an Intel 750 NVME SSD in this test (though it does not really matter).

Sysbench FileIO 1 Thread

root@ts140i:/mnt/data# sysbench  --threads=1 --time=600 --max-requests=0  fileio --file-num=1 --file-total-size=100G --file-io-mode=sync --file-extra-flags=direct --file-test-mode=rndrd run
File operations:
   reads/s:                      7113.16
   writes/s:                     0.00
   fsyncs/s:                     0.00
   read, MiB/s:                  111.14
   written, MiB/s:               0.00
General statistics:
   total time:                          600.0001s
   total number of events:              4267910
Latency (ms):
        min:                                  0.07
        avg:                                  0.14
        max:                                  6.18
        95th percentile:                      0.17

A single thread run is always great as a baseline, as with only one request in flight we should expect the best response time possible (though typically not the best throughput possible).

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00 7612.80    0.00 113507.20     0.00    29.82     0.97    0.13    0.13    0.00   0.12  93.68

Disk LatencyDIsk Utilization and Saturation

The Disk Latency graph confirms the disk IO latency we saw in the


 command, and it will be highly device-specific. We use it as a baseline to compare changes to with higher concurrency.

Disk IO Utilization

DIsk Utilization and Saturation 2

Disk IO utilization is close to 100% even though we have just one outstanding IO request (queue depth). This is the problem with Linux disk utilization reporting: unlike CPUs, Linux does not have direct visibility on how the IO device is designed. How many “execution units” does it really have? How are they utilized?  Single spinning disks can be seen as a single execution unit while RAID, SSDs and cloud storage (such as EBS) are more than one.

Disk Load

DIsk Utilization and Saturation 3

This graph shows the disk load (or request queue size), which roughly matches the number of threads that are hitting disk as hard as possible.

Saturation (IO Load)

DIsk Utilization and Saturation 4

The IO load on the Saturation Metrics graph shows pretty much the same numbers. The only difference is that unlike Disk IO statistics, it shows the summary for the whole system.

Sysbench FileIO 4 Threads

Now let’s increase IO to four concurrent threads and see how disk responds:

sysbench  --threads=4 --time=600 --max-requests=0  fileio --file-num=1 --file-total-size=100G --file-io-mode=sync --file-extra-flags=direct --file-test-mode=rndrd run
File operations:
   reads/s:                      26248.44
   writes/s:                     0.00
   fsyncs/s:                     0.00
   read, MiB/s:                  410.13
   written, MiB/s:               0.00
General statistics:
   total time:                          600.0002s
   total number of events:              15749205
Latency (ms):
        min:                                  0.06
        avg:                                  0.15
        max:                                  8.73
        95th percentile:                      0.21

We can see the number of requests scales almost linearly, while request latency changes very little: 0.14ms vs. 0.15ms. This shows the device has enough execution units internally to handle the load in parallel, and there are no other bottlenecks (such as the connection interface).

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00 28808.60    0.00 427668.00     0.00    29.69     4.05    0.14    0.14    0.00   0.03  99.92

Disk Latency

DIsk Utilization and Saturation 5

Disk Utilization

DIsk Utilization and Saturation 6

Disk Load

DIsk Utilization and Saturation 7

Saturation Metrics (IO Load)

DIsk Utilization and Saturation 8

These stats and graphs show interesting picture: we barely see a response time increase for IO requests, while utilization inches closer to 100% (with four threads submitting requests all the time, it is hard to catch the time when the disk does not have any requests in flight). The load is near four (showing the disk has to handle four requests at the time on average).

Sysbench FileIO 16 Threads

root@ts140i:/mnt/data# sysbench  --threads=16 --time=600 --max-requests=0  fileio --file-num=1 --file-total-size=100G --file-io-mode=sync --file-extra-flags=direct --file-test-mode=rndrd run
File operations:
   reads/s:                      76845.96
   writes/s:                     0.00
   fsyncs/s:                     0.00
   read, MiB/s:                  1200.72
   written, MiB/s:               0.00
General statistics:
   total time:                          600.0003s
   total number of events:              46107727
Latency (ms):
        min:                                  0.07
        avg:                                  0.21
        max:                                  9.72
        95th percentile:                      0.36

Going from four to 16 threads, we again see a good throughput increase with a mild response time increase. If you look at the results closely, you will notice one more interesting thing: the average response time has increased from 0.15ms to 0.21ms (which is a 40% increase), while the 95% response time has increased from 0.21ms to 0.36ms (which is 71%). I also ran a separate test measuring 99% response time, and the difference is even larger: 0.26ms vs. 0.48ms (or 84%).

This is an important observation to make: once saturation starts to happen, the variance is likely to increase and some of the requests will be disproportionately affected (beyond what the average response time shows).

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00 82862.20    0.00 1230567.20     0.00    29.70    16.33    0.20    0.20    0.00   0.01 100.00

Disk IO Latency

DIsk Utilization and Saturation 9

Disk IO Utilization

DIsk Utilization and Saturation 10

Disk Load

DIsk Utilization and Saturation 11

Saturation Metrics IO Load

DIsk Utilization and Saturation 12

The graphs show an expected figure: the disk load and IO load from saturation are up to about 16, and utilization remains at 100%.

One thing to notice is increased jitter in the graphs. IO utilization jumps to over 100% and disk IO load spikes to 18, when there should not be as many requests in flight. This comes from how this information is gathered. An attempt is made to sample this data every second, but with the loaded system it takes time for this process to work: sometimes when we try to get the data for a one-second interval but really get data for 1.05- or 0.95-second intervals. When the math is applied to the data, it creates the spikes and dips in the graph when there should be none. You can just ignore them if you’re looking at the big picture.

Sysbench FileIO 64 Threads

Finally, let’s run sysbench with 64 concurrent threads hitting the disk:

root@ts140i:/mnt/data# sysbench  --threads=64 --time=600 --max-requests=0  fileio --file-num=1 --file-total-size=100G --file-io-mode=sync --file-extra-flags=direct --file-test-mode=rndrd run
File operations:
   reads/s:                      127840.59
   writes/s:                     0.00
   fsyncs/s:                     0.00
   read, MiB/s:                  1997.51
   written, MiB/s:               0.00
General statistics:
   total time:                          600.0014s
   total number of events:              76704744
Latency (ms):
        min:                                  0.08
        avg:                                  0.50
        max:                                  9.34
        95th percentile:                      1.25

We can see the average has risen from 0.21ms to 0.50 (more than two times), and 95% almost tripped from 0.36ms to 1.25ms. From a practical standpoint, we can see some saturation starting to happen, but we’re still not seeing a linear response time increase with increasing numbers of parallel operations as we have seen with CPU saturation. I guess this points to the fact that this IO device has a lot of parallel capacity inside and can process requests more effectively (even going from 16 to 64 concurrent threads).

Over the series of tests, as we increased concurrency from one to 64, we saw response times increase from 0.14ms to 0.5ms (or approximately three times). The 95% response time at this time grew from 0.17ms to 1.25ms (or about seven times). For practical purposes, this is where we see the IO device saturation start to show.

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00 138090.20    0.00 2049791.20     0.00    29.69    65.99    0.48    0.48    0.00   0.01 100.24

We’ll skip the rest of the graphs as they basically look the same, just with higher latency and 64 requests in flight.

Sysbench FileIO 256 Threads

root@ts140i:/mnt/data# sysbench  --threads=256 --time=600 --max-requests=0  fileio --file-num=1 --file-total-size=100G --file-io-mode=sync --file-extra-flags=direct --file-test-mode=rndrd run
File operations:
   reads/s:                      131558.79
   writes/s:                     0.00
   fsyncs/s:                     0.00
   read, MiB/s:                  2055.61
   written, MiB/s:               0.00
General statistics:
   total time:                          600.0026s
   total number of events:              78935828
Latency (ms):
        min:                                  0.10
        avg:                                  1.95
        max:                                 17.08
        95th percentile:                      3.89

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
nvme0n1           0.00     0.00 142227.60    0.00 2112719.20     0.00    29.71   268.30    1.89    1.89    0.00   0.01 100.00

With 256 threads, finally we’re seeing the linear growth of the average response time that indicates overload and queueing to process requests. There is no easy way to tell if it is due to the IO bus saturation (we’re reading 2GB/sec here) or if it is the internal device processing ability.  

As we’ve seen a less than linear increase in response time going from 16 to 64 connections, and a linear increase going from 64 to 256, we can see the “optimal” concurrency for this device: somewhere between 16 and 64 connections. This allows for peak throughput without a lot of queuing.

Before we get to the summary, I want to make an important note about this particular test. The test is a random reads test, which is a very important pattern for many database workloads, but it might not be the dominant load for your environment. You might be write-bound as well, or have mainly sequential IO access patterns (which could behave differently). For those other workloads, I hope this gives you some ideas on how to also analyze them.

Another Way to Think About Saturation

When I asked the Percona staff for feedback on this blog post by, my colleague Yves Trudeau provided another way to think about saturation: measure saturation as percent increase in the average response time compared to the single user. Like this:

Threads Avg Response Time Saturation
1 0.14
4 0.15 1.07x  or 7%
16 0.21 1.5x  or 50%
64 0.50 3.6x or 260%
256 1.95 13.9x or 1290%



We can see how understanding disk utilization and saturation is much more complicated than for the CPU:

  • The Utilization metric (as reported by

     and by PMM) is not very helpful for showing true storage utilization, as it only measures the time when there is at least one request in flight. If you had the same metric for the CPU, it would correspond to something running on at least one of the cores (not very useful for highly parallel systems).

  • Unlike a CPU, Linux tools do not provide us with information about the structure of the underlying storage and how much parallel load it should be able to handle without saturation. Even more so, storage might well have different low-level resources that cause saturation. For example, it could be the network connection, SATA BUS or even the kernel IO stack for older kernels and very fast storage.
  • Saturation as measured by the number of requests in flight is helpful for guessing if there might be saturation, but since we do not know how many requests the device can efficiently process concurrently, just looking the raw metric doesn’t let us determine that the device is overloaded.
  • Avg Response Time is a great metric for looking at saturation, but as with the response time you can’t say what response time is good or bad for this device. You need to look at it in context and compare it to the baseline. When you’re looking at the Avg Response Time, make sure you’re looking at read request response time vs. write request response time separately, and keep the average request size in mind to ensure we are comparing apples to apples.

Better Than Linear Scaling


In this blog, we’ll look at how to achieve better-than-linear scaling.

Scalability is the capability of a system, network or process to handle a growing amount of work, or its potential to be enlarged to accommodate that growth. For example, we consider a system scalable if it is capable of increasing its total output under an increased load when resources (typically hardware) are added:

It is often accepted as a fact that systems (in particular databases) can’t scale better than linearly. By this I mean when you double resources, the expected performance doubles, at best (and often is less than doubled).  

We can attribute this assumption to Amdahl’s law (, and later to the Universal Scalability Law ( Both these laws prescribe that it is impossible to achieve better than linear scalability. To be totally precise, this is practically correct for single server systems when the added resources are only CPU units.

Multi-nodes systems

However, I think databases systems no longer should be seen as single server systems. MongoDB and Cassandra for a long time have had multi-node auto-sharding capabilities. We are about to see the rise of strongly-consistent SQL based multi-node systems. And even MySQL is frequently deployed with manual sharding on multi-nodes.

The products like Vitess ( proposes auto-sharding for MySQL, and with ProxySQL (which I will use in my experiment) you can setup a basic sharding schema.

I describe multi-nodes setups, because in this environment it is possible to achieve much better than linear scalability. I will show this below.

Why is this important?

Understanding scalability of multi-node systems is important for resource planning, and understanding how much of a potential performance gain we can expect when we add more nodes. This is especially interesting for cloud deployments.

How is it possible?

I’ve written about how the size of available memory (cache) affects the performance. When we add additional nodes to the deployment, effectively we increase not only CPU cores, but also the memory that comes with the node (and we are adding extra IO capacity). So, with increasing node counts, we also increase available memory (and cache). As we can see from these graphs, the effect of extra memory could be non-linear (and actually better than linear). Playing on this fact, we can achieve better-than-linear scaling in a sharded setup. I am going to show the experimental setup of how to achieve this.

Experimental setup

To show the sharded setup we will use ProxySQL in front of N MySQL servers (shards). We also will use sysbench with 60 tables (4 million rows each, uniform distribution).

  • For one shard, this shard contains all 60 tables
  • For two shards, each shard contains 30 tables each
  • For three shards, each shard contains 20 tables each
  • For six shards, each shard contains ten tables each

So schematically, it looks like this:

One shard:


Two shards:


Six shards:


We want to measure how the performance (for both throughput and latency) changes when we go from 1 to 2, to 3, to 4, to 5 and to 6 shards.

For the single shard, I used a Google Cloud instance with eight virtual CPUs and 16GB of RAM, where 10GB is allocated for the innodb_buffer_pool_size.

The database size (for all 60 tables) is about 51GB for the data, and 7GB for indexes.

For this we will use a sysbench read-only uniform workload, and ProxySQL helps to perform query routing. We will use ProxySQL query rules, and set sharding as:

mysql -u admin -padmin -h -P6032 -e "DELETE FROM mysql_query_rules"
for i in {1..60}
hg=$(( $i % $shards + 1))
mysql -u admin -padmin -h -P6032 -e "INSERT INTO mysql_query_rules (rule_id,active,username,match_pattern,destination_hostgroup,apply) VALUES ($i,1,'root','sbtest$is',$hg,1);"
mysql -u admin -padmin -h -P6032 -e "LOAD MYSQL QUERY RULES TO RUNTIME;"

Command line for sysbench 1.0.4:
sysbench oltp_read_only.lua --mysql-socket=/tmp/proxysql.sock --mysql-user=root --mysql-password=test --tables=60 --table-size=4000000 --threads=60 --report-interval=10 --time=900 --rand-type=pareto run

The results

Nodes Throughput Speedup vs. 1 node Latency, ms
1 245 1.00 244.88
2 682 2.78 87.95
3 1659 6.77 36.16
4 2748 11.22 21.83
5 3384 13.81 17.72
6 3514 14.34 17.07

As we can see, the performance improves by a factor much better than just linearly.

With five nodes, the improvement is 13.81 times compared to the single node.

The 6th node does not add much benefit, as at this time data practically fits into memory (with five nodes, the total cache size is 50GB compared to the 51GB data size)

Factors that affects multi-node scaling

How can we model/predict the performance gain? There are multiple factors to take into account: the size of the active working set, the available memory size and (also importantly) the distribution of the access to the working set (with uniform distribution being the best case scenario, and with access to the one with only one row being the opposite corner-case, where speedup is impossible). Also we need to keep network speed in mind: if we come close to using all available network bandwidth, it will be impossible to get significant improvement.


In multi-node, auto-scaling, auto-sharding distributed systems, the traditional scalability models do not provide much help. We need to have a better framework to understand how multiple nodes affect performance.


The Puzzling Performance of the Samsung 960 Pro

samsung 960 pro small

In this blog post, I’ll take a look at the performance of the Samsung 960 Pro SSD NVME.

First, I know the Samsung 960 Pro is a consumer SSD NVME drive, not intended for sustained data center workloads. But the AnandTech review looked good enough that I decided to take it for a test spin to see if it would work well with MySQL benchmarks.

Before that, I decided to do a simple sysbench file IO test to see how the drives handled sustained workloads, and if it would start acting up.

My expectation for a consumer SSD drive is that its write consistency will suffer. Many of those drives can sustain high bursts for short periods of time but have to slow down to keep up with write leveling (and other internal activities SSDs must to do). This is not what I saw, however.

I did a benchmark on E5-2630L V3, 64GB RAM Ubuntu 16.04 LTS, XFS Filesystem, Samsung 960 Pro 512GB (FW:1B6QCXP7):  

sysbench --num-threads=64 --max-time=86400 --max-requests=0 --test=fileio --file-num=1 --file-total-size=260G --file-io-mode=async --file-extra-flags=direct --file-test-mode=rndrd run

Note: I used asynchronous direct IO to keep it close to how MySQL (InnoDB) submits IO requests.

This is what the “Read Throughput” graph looks in Percona Monitoring and Management (PMM):

Samsung 960 Pro

As you can see, in addition to some reasonable ebbs and flows we have some major dips from about 1.5GB/sec of random reads to around 800MB/sec. This almost halves the performance. We can clearly see two of those dips, with the third one starting when the test ended.  

What is really interesting is that as I did a read-write test, it performed much more uniformly:

sysbench --num-threads=64 --max-time=86400 --max-requests=0 --test=fileio --file-num=1 --file-total-size=260G --file-io-mode=async --file-extra-flags=direct --file-test-mode=rndrw run

Samsung 960 Pro

Any ideas on what the cause of such strange periodic IO performance regression for reads could be?

This does not look like overheating throttling. It is much too regular for that (and I checked the temperature – is wasn’t any different during this performance regression).

One theory I have is “read disturb management”: could the SSD need to rewrite the data after so many reads? By my calculations, every cell is read some 166 times during the eight hours between those gaps. This doesn’t sound like a lot.

What are your thoughts?


Using NVMe Command Line Tools to Check NVMe Flash Health


NVMEIn this blog post, I’ll look at the types of NVMe flash health information you can get from using the NVMe command line tools.

Checking SATA-based drive health is easy. Whether it’s an SSD or older spinning drive, you can use the


 command to get a wealth of information about the device’s performance and health. As an example:

root@blinky:/var/lib/mysql# smartctl -A /dev/sda
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-62-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke,
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
 1 Raw_Read_Error_Rate     0x002f   100   100   000    Pre-fail  Always       -       0
 5 Reallocated_Sector_Ct   0x0032   100   100   010    Old_age   Always       -       0
 9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       41
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       2
171 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
172 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
173 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       1
174 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   065   059   000    Old_age   Always       -       35 (Min/Max 21/41)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
202 Unknown_SSD_Attribute   0x0030   100   100   001    Old_age   Offline      -       0
206 Unknown_SSD_Attribute   0x000e   100   100   000    Old_age   Always       -       0
246 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       145599393
247 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       4550280
248 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       582524
180 Unused_Rsvd_Blk_Cnt_Tot 0x0033   000   000   000    Pre-fail  Always       -       1260
210 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       0



 might not know all vendor-specific smart values, typically you can Google the drive model along with “smart attributes” and find documents like this to get more details.

If you move to newer generation NVMe-based flash storage,


 won’t work anymore – at least it doesn’t work for the packages available for Ubuntu 16.04 (what I’m running). It looks like support for NVMe in Smartmontools is coming, and it would be great to get a single tool that supports both  SATA and NVMe flash storage.

In the meantime, you can use the


 tool available from the nvme-cli package. It provides some basic information for NVMe devices.

To get information about the NVMe devices installed:

root@alex:~# nvme list
Node             SN                   Model                                    Version  Namespace Usage                      Format           FW Rev
---------------- -------------------- ---------------------------------------- -------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     S3EVNCAHB01861F      Samsung SSD 960 PRO 1TB                  1.2      1         689.63  GB /   1.02  TB    512   B +  0 B   1B6QCXP7

To get SMART information:

root@alex:~# nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning                    : 0
temperature                         : 34 C
available_spare                     : 100%
available_spare_threshold           : 10%
percentage_used                     : 0%
data_units_read                     : 3,465,389
data_units_written                  : 9,014,689
host_read_commands                  : 89,719,366
host_write_commands                 : 134,671,295
controller_busy_time                : 310
power_cycles                        : 11
power_on_hours                      : 21
unsafe_shutdowns                    : 8
media_errors                        : 0
num_err_log_entries                 : 1
Warning Temperature Time            : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1                : 34 C
Temperature Sensor 2                : 47 C
Temperature Sensor 3                : 0 C
Temperature Sensor 4                : 0 C
Temperature Sensor 5                : 0 C
Temperature Sensor 6                : 0 C

To get additional SMART information (not all devices support it):

root@ts140i:/home/pz/workloads/1m# nvme smart-log-add /dev/nvme0
Additional Smart Log for NVME device:nvme0 namespace-id:ffffffff
key                               normalized raw
program_fail_count              : 100%       0
erase_fail_count                : 100%       0
wear_leveling                   :  62%       min: 1114, max: 1161, avg: 1134
end_to_end_error_detection_count: 100%       0
crc_error_count                 : 100%       0
timed_workload_media_wear       : 100%       37.941%
timed_workload_host_reads       : 100%       51%
timed_workload_timer            : 100%       446008 min
thermal_throttle_status         : 100%       0%, cnt: 0
retry_buffer_overflow_count     : 100%       0
pll_lock_loss_count             : 100%       0
nand_bytes_written              : 100%       sectors: 16185227
host_bytes_written              : 100%       sectors: 6405605

Some of this information is self-explanatory, and some of it isn’t. After looking at the NVMe specification document, here is my read on some of the data:

Available Spare. Contains a normalized percentage (0 to 100%) of the remaining spare capacity that is available.

Available Spare Threshold. When the Available Spare capacity falls below the threshold indicated in this field, an asynchronous event completion can occur. The value is indicated as a normalized percentage (0 to 100%).

(Note: I’m not quite sure what the practical meaning of “asynchronous event completion” is, but it looks like something to avoid!)

Percentage Used. Contains a vendor specific estimate of the percentage of the NVM subsystem life used, based on actual usage and the manufacturer’s prediction of NVM life.

(Note: the number can be more than 100% if you’re using storage for longer than its planned life.)

Data Units Read/Data Units Written. This is the number of 512-byte data units that are read/written, but it is measured in an unusual way. The first value corresponds to 1000 of the 512-byte units. So you can multiply this value by 512000 to get value in bytes. It does not include meta-data accesses.

Host Read/Write Commands. The number of commands of the appropriate type issued. Using this value, as well as one below, you can compute the average IO size for “physical” reads and writes.

Controller Busy Time. Time in minutes that the controller was busy servicing commands. This can be used to gauge long-term storage load trends.

Unsafe Shutdowns. The number of times a power loss happened without a shutdown notification being sent. Depending on the NVMe device you’re using, an unsafe shutdown might corrupt user data.

Warning Temperature Time/Critical Temperature Time. The time in minutes a device operated above a warning or critical temperature. It should be zeroes.

Wear_Leveling. This shows how much of the rated cell life was used, as well as the min/max/avg write count for different cells. In this case, it looks like the cells are rated for 1800 writes and about 1100 on average were used

Timed Workload Media Wear. The media wear by the current “workload.” This device allows you to measure some statistics from the time you reset them (called the “workload”) in addition to showing the device lifetime values.

Timed Workload Host Reads. The percentage of IO operations that were reads (since the workload timer was reset).

Thermal Throttle Status. This shows if the device is throttled due to overheating, and when there were throttling events in the past.

Nand Bytes Written. The bytes written to NAND cells. For this device, the measured unit seems to be in 32MB values. It might be different for other devices.

Host Bytes Written. The bytes written to the NVMe storage from the system. This unit also is in 32MB values. The scale of these values is not very important, as they are the most helpful for finding the write amplification of your workload. This ratio is measured in writes to NAND and writes to HOST. For this example, the Write Amplification Factor (WAF) is 16185227 / 6405605 = 2.53  

As you can see, the NVMe command line tools provide a lot of good information for understanding the health and performance of NVMe devices. You don’t need to use vendor specific tools (like isdct).


How to Replace MySQL with Percona Server on a CPanel, WHM VPS or Dedicated Server

Replace MySQL with Percona Server

Replace MySQL with Percona ServerIn this blog post, we’ll look at how to replace MySQL with Percona Server for MySQL on a CPanel, WHM VPS or dedicated server.

In general, CPanel and WHM have been leaning towards support of MariaDB over other flavors. This is partly due to the upstream repos replacing the MySQL package with MariaDB (for example, on CentOS).

MySQL 5.6 is still supported though, which means they are keeping support for core MySQL products. But if you want to get some extra performance enhancements or enterprise features for free, without getting too many bells and whistles, you might want to install Percona Server.

I’ve done this work on a new dedicated server with the latest WHM and CPanel on CentOS 7, with MySQL 5.6 installed. Besides the backup, this is a fairly quick process.

It’s pretty simple. From the Percona Server for MySQL 5.7 installation doc, we can get the YUM repo. (Run commands as root if you can, otherwise as sudo.)

yum install

Now that we have the repo, let’s install Percona XtraBackup in case we need to roll this back at any point:

yum install percona-xtrabackup

This server had a drive mounted at /backup, so I created the backup with the following commands:

xtrabackup --target-dir=/backup/xtrabackup --backup
xtrabackup --target-dir=/backup/xtrabackup --prepare

Now that we have a good backup, let’s remove MySQL:

service mysql stop
yum remove MySQL* mysql*

Depending on your dependency chain, this could remove Percona XtraBackup, but that can be fixed. Let’s accept this uninstall.

Let’s install Percona Server for MySQL 5.7 and Percona Toolkit:

yum install Percona-Server-server-57 percona-toolkit percona-xtrabackup

Now that it’s installed ensure the mysql service is running. If it isn’t, start it. Now let’s upgrade:


NOTE. This works if you can log in as root without a password; if you can’t, you will need to specify the





Once you run the upgrade, restart the mysql service:

service mysql restart

And there you go, you are now running on Percona Server for MySQL 5.7. If your managed providers tell you it’s not supported, don’t worry! It works as long as CPanel supports MySQL 5.6.

If you have any issues, just restore the backup.

NOTE: One thing to keep in mind is that 5.7 breaks CPanel’s ability to create users in MySQL. I believe this is due to the changes to the mysql.user table. If this is an issue for you, you can always use Percona Server for MySQL 5.6.


Three Things to Consider When Thinking About Containers


Containers like Docker and Rocket are getting more popular every day. In my conversations with customers, they consistently ask what containers are and how they can use them in their environment. If you’re as curious as most people, read on. . .

How did this happen?

From what I understand, containers grew out of Google’s (and others’) need for massive horizontal scale. Now, this is hardly a unique problem. At the time there were several different solutions out there that could help deploy and orchestrate the applications and infrastructure necessary to scale — namely virtual machines (VMs) and their orchestration services (like Vmware’s vCenter). At the uber-massive scale that companies like Google were pushing, however, server virtualization had some serious drawbacks. Enter containers. . .

What is a container?

Essentially, the main difference between a container and a virtual machine is the amount of overhead involved in each unit.

A virtual machine is literally the components of an entire server, minus the physical hardware. Simplified, it’s the operating system and all additional components required to run the application. That’s a great way to separate applications for mobility, but it also requires a large amount of redundancy (and overhead).

A container is your application and all of its dependencies that run on top of the operating system, but not the operating system (OS) itself. The OS kernel is shared among all the containers on the system. This approach reduces the CPU, memory and disk overhead that virtual machines introduce by running a separate OS instance on every VM. At single system scale, this is not much of a consideration. But when you are talking about scale in the millions, it can lead to an incredible amount of savings without much of a reduction in functionality.

When you get down to the key differences, it essentially amounts to “at what level the hypervisor runs,” or if that’s too technical, the level at which the two technologies provide abstraction from their underlying components (take a look at the image below). For VMs, it abstracts the hardware from the OS. For containers, it abstracts the OS from the applications that interface with it. This is where the benefit comes from: containers allow applications to have isolation without requiring every application to have an additional copy of the operating system. Operating systems take up a great deal of a computer’s resources, so the ability to share that overhead — while getting isolation — can be extremely efficient. Google reported that they spin up roughly two billion containers per week. At that type of scale, you can see the necessity for efficiency!

Image courtesy of

How do I decide between VMs and containers?

This is the million dollar question. Currently, the two are often used together. If you’re running containers on an AWS EC2 instance, then you are running containers inside a VM. However, the are important differences to note:


Security = VMs

Whenever you are consolidating resource, regardless of whether it’s with VMs or containers, security should be top of mind. This is because the process that controls access to resources presents a single, high-value target for attack. If you can compromise that process, you can potentially gain control of all resources that are using it. Scary, right?


Cost reduction = Containers

As I said before, you can use both. However, containers have great potential to reduce the number of physical servers you rely on or your spend in any cloud environment. Especially now that there are orchestration solutions like Kubernetes and Swarm to help you manage your environment.

The only other thing that you should consider before deciding what to put in a container is the next point. . .


Performance = Depends

If you have services (applications) that have very spiky workloads or require a great deal of resources, you have to make very careful decisions about what services you pair together. This is true of both VMs and containers. The reason being that there are a finite amount of resources on the physical machine that is supporting the services. The closer that machine gets to 100%, the slower it responds to the service, and the slower your application runs.

The wrap up . . .

The container space is still maturing. There are sure to be many exciting announcements to come. One thing is certain: this isn’t a fad! While you don’t have to go and start “containerizing” everything in your environment, I would suggest that you start conversations regarding where they’re a good fit in your environment. After all, someone much smarter than I said “you’re either planning for success or failure. The only difference is deliberation.”



Using Ceph with MySQL


CephOver the last year, the Ceph world drew me in. Partly because of my taste for distributed systems, but also because I think Ceph represents a great opportunity for MySQL specifically and databases in general. The shift from local storage to distributed storage is similar to the shift from bare disks host configuration to LVM-managed disks configuration.

Most of the work I’ve done with Ceph was in collaboration with folks from RedHat (mainly Brent Compton and Kyle Bader). This work resulted in a number of talks presented at the Percona Live conference in April and the RedHat Summit San Francisco at the end of June. I could write a lot about using Ceph with databases, and I hope this post is the first in a long series on Ceph. Before I starting with use cases, setup configurations and performance benchmarks, I think I should quickly review the architecture and principles behind Ceph.

Introduction to Ceph

Inktank created Ceph a few years ago as a spin-off of the hosting company DreamHost. RedHat acquired Inktank in 2014 and now offers it as a storage solution. OpenStack uses Ceph as its dominant storage backend. This blog, however, focuses on a more general review and isn’t restricted to a virtual environment.

A simplistic way of describing Ceph is to say it is an object store, just like S3 or Swift. This is a true statement but only up to a certain point.  There are minimally two types of nodes with Ceph, monitors and object storage daemons (OSDs). The monitor nodes are responsible for maintaining a map of the cluster or, if you prefer, the Ceph cluster metadata. Without access to the information provided by the monitor nodes, the cluster is useless. Redundancy and quorum at the monitor level are important.

Any non-trivial Ceph setup has at least three monitors. The monitors are fairly lightweight processes and can be co-hosted on OSD nodes (the other node type needed in a minimal setup). The OSD nodes store the data on disk, and a single physical server can host many OSD nodes – though it would make little sense for it to host more than one monitor node. The OSD nodes are listed in the cluster metadata (the “crushmap”) in a hierarchy that can span data centers, racks, servers, etc. It is also possible to organize the OSDs by disk types to store some objects on SSD disks and other objects on rotating disks.

With the information provided by the monitors’ crushmap, any client can access data based on a predetermined hash algorithm. There’s no need for a relaying proxy. This becomes a big scalability factor since these proxies can be performance bottlenecks. Architecture-wise, it is somewhat similar to the NDB API, where – given a cluster map provided by the NDB management node – clients can directly access the data on data nodes.

Ceph stores data in a logical container call a pool. With the pool definition comes a number of placement groups. The placement groups are shards of data across the pool. For example, on a four-node Ceph cluster, if a pool is defined with 256 placement groups (pg), then each OSD will have 64 pgs for that pool. You can view the pgs as a level of indirection to smooth out the data distribution across the nodes. At the pool level, you define the replication factor (“size” in Ceph terminology).

The recommended values are a replication factor of three for spinners and two for SSD/Flash. I often use a size of one for ephemeral test VM images. A replication factor greater than one associates each pg with one or more pgs on the other OSD nodes.  As the data is modified, it is replicated synchronously to the other associated pgs so that the data it contains is still available in case an OSD node crashes.

So far, I have just discussed the basics of an object store. But the ability to update objects atomically in place makes Ceph different and better (in my opinion) than other object stores. The underlying object access protocol, rados, updates an arbitrary number of bytes in an object at an arbitrary offset, exactly like if it is a regular file. That update capability allows for much fancier usage of the object store – for things like the support of block devices, rbd devices, and even a network file systems, cephfs.

When using MySQL on Ceph, the rbd disk block device feature is extremely interesting. A Ceph rbd disk is basically the concatenation of a series of objects (4MB objects by default) that are presented as a block device by the Linux kernel rbd module. Functionally it is pretty similar to an iSCSI device as it can be mounted on any host that has access to the storage network and it is dependent upon the performance of the network.

The benefits of using Ceph

In a world striving for virtualization and containers, Ceph gives easily moves database resources between hosts.

IO scalability
On a single host, you have access only to the IO capabilities of that host. With Ceph, you basically put in parallel all the IO capabilities of all the hosts. If each host can do 1000 iops, a four-node cluster could reach up to 4000 iops.

High availability
Ceph replicates data at the storage level, and provides resiliency to storage node crash.  A kind of DRBD on steroids…

Ceph rbd block devices support snapshots, which are quick to make and have no performance impacts. Snapshots are an ideal way of performing MySQL backups.

Thin provisioning
You can clone and mount Ceph snapshots as block devices. This is a useful feature to provision new database servers for replication, either with asynchronous replication or with Galera replication.

The caveats of using Ceph

Of course, nothing is free. Ceph use comes with some caveats.

Ceph reaction to a missing OSD
If an OSD goes down, the Ceph cluster starts copying data with fewer copies than specified. Although good for high availability, the copying process significantly impacts performance. This implies that you cannot run a Ceph with a nearly full storage, you must have enough disk space to handle the loss of one node.

The “no out” OSD attribute mitigates this, and prevents Ceph from reacting automatically to a failure (but you are then on your own). When using the “no out” attribute, you must monitor and detect that you are running in degraded mode and take action. This resembles a failed disk in a RAID set. You can choose this behavior as default with the mon_osd_auto_mark_auto_out_in setting.

Every day and every week (deep), Ceph scrubs operations that, although they are throttled, can still impact performance. You can modify the interval and the hours that control the scrub action. Once per day and once per week are likely fine. But you need to set osd_scrub_begin_hour and osd_scrub_end_hour to restrict the scrubbing to off hours. Also, scrubbing throttles itself to not put too much load on the nodes. The osd_scrub_load_threshold variable sets the threshold.

Ceph has many parameters so that tuning Ceph can be complex and confusing. Since distributed systems push hardware, properly tuning Ceph might require things like distributing interrupt load among cores and thread core pinning, handling of Numa zones – especially if you use high-speed NVMe devices.


Hopefully, this post provided a good introduction to Ceph. I’ve discussed the architecture, the benefits and the caveats of Ceph. In future posts, I’ll present use cases with MySQL. These cases include performing Percona XtraDB Cluster SST operations using Ceph snapshots, provisioning async slaves and building HA setups. I also hope to provide guidelines on how to build and configure an efficient Ceph cluster.

Finally, a note for the ones who think cost and complexity put building a Ceph cluster out of reach. The picture below shows my home cluster (which I use quite heavily). The cluster comprises four ARM-based nodes (Odroid-XU4), each with a two TB portable USB-3 hard disk, a 16 GB EMMC flash disk and a gigabit Ethernet port.

I won’t claim record breaking performance (although it’s decent), but cost-wise it is pretty hard to beat (at around $600)!



Powered by WordPress | Theme: Aeros 2.0 by