About ZFS Performance


If you are a regular reader of this blog, you likely know I like the ZFS filesystem a lot. ZFS has many very interesting features, but I am a bit tired of hearing negative statements on ZFS performance. It feels a bit like people are telling me “Why do you use InnoDB? I have read that MyISAM is faster.” I found the comparison of InnoDB vs. MyISAM quite interesting, and I’ll use it in this post.

To have some data to support my post, I started an AWS i3.large instance with a 1000GB gp2 EBS volume. A gp2 volume of this size is interesting because it is above the burst IOPS level, so it offers a constant 3000 IOPS performance level.

I used sysbench to create a table of 10M rows and then, using export/import tablespace, I copied it 329 times. I ended up with 330 tables for a total size of about 850GB. The dataset generated by sysbench is not very compressible, so I used lz4 compression in ZFS. For the other ZFS settings, I used what can be found in my earlier ZFS posts but with the ARC size limited to 1GB. I then used that plain configuration for the first benchmarks. Here are the results with the sysbench point-select benchmark, a uniform distribution and eight threads. The InnoDB buffer pool was set to 2.5GB.

In both cases, the load is IO bound. The disk is doing exactly the allowed 3000 IOPS. The above graph appears to be a clear demonstration that XFS is much faster than ZFS, right? But is that really the case? The way the dataset has been created is extremely favorable to XFS since there is absolutely no file fragmentation. Once you have all the files opened, a read IOP is just a single fseek call to an offset and ZFS doesn’t need to access any intermediate inode. The above result is about as fair as saying MyISAM is faster than InnoDB based only on table scan performance results of unfragmented tables and default configuration. ZFS is much less affected by the file level fragmentation, especially for point access type.

More on ZFS metadata

ZFS stores the files in B-trees in a very similar fashion as InnoDB stores data. To access a piece of data in a B-tree, you need to access the top level page (often called root node) and then one block per level down to a leaf-node containing the data. With no cache, to read something from a three levels B-tree thus requires 3 IOPS.

Simple three levels B-tree

The extra IOPS performed by ZFS are needed to access those internal blocks in the B-trees of the files. These internal blocks are labeled as metadata. Essentially, in the above benchmark, the ARC is too small to contain all the internal blocks of the table files’ B-trees. If we continue the comparison with InnoDB, it would be like running with a buffer pool too small to contain the non-leaf pages. The test dataset I used has about 600MB of non-leaf pages, about 0.1% of the total size, which was well cached by the 3GB buffer pool. So only one InnoDB page, a leaf page, needed to be read per point-select statement.

To correctly set the ARC size to cache the metadata, you have two choices. First, you can guess values for the ARC size and experiment. Second, you can try to evaluate it by looking at the ZFS internal data. Let’s review these two approaches.

You’ll read/hear often the ratio 1GB of ARC for 1TB of data, which is about the same 0.1% ratio as for InnoDB. I wrote about that ratio a few times, having nothing better to propose. Actually, I found it depends a lot on the recordsize used. The 0.1% ratio implies a ZFS recordsize of 128KB. A ZFS filesystem with a recordsize of 128KB will use much less metadata than another one using a recordsize of 16KB because it has 8x fewer leaf pages. Fewer leaf pages require less B-tree internal nodes, hence less metadata. A filesystem with a recordsize of 128KB is excellent for sequential access as it maximizes compression and reduces the IOPS but it is poor for small random access operations like the ones MySQL/InnoDB does.

To determine the correct ARC size, you can slowly increase the ARC size and monitor the number of metadata cache-misses with the arcstat tool. Here’s an example:

# echo 1073741824 > /sys/module/zfs/parameters/zfs_arc_max
# arcstat -f time,arcsz,mm%,mhit,mread,dread,pread 10
    time  arcsz  mm%  mhit  mread  dread  pread
10:22:49   105M    0     0     0      0      0
10:22:59   113M  100     0    22     73      0
10:23:09   120M  100     0    20     68      0
10:23:19   127M  100     0    20     65      0
10:23:29   135M  100     0    22     74      0

You’ll want the ‘mm%’, the metadata missed percent, to reach 0. So when the ‘arcsz’ column is no longer growing and you still have high values for ‘mm%’, that means the ARC is too small. Increase the value of ‘zfs_arc_max’ and continue to monitor.

If the 1GB of ARC for 1TB of data ratio is good for large ZFS recordsize, it is likely too small for a recordsize of 16KB. Does 8x more leaf pages automatically require 8x more ARC space for the non-leaf pages? Although likely, let’s verify.

The second option we have is the zdb utility that comes with ZFS, which allows us to view many internal structures including the B-tree list of pages for a given file. The tool needs the inode of a file and the ZFS filesystem as inputs. Here’s an invocation for one of the tables of my dataset:

# cd /var/lib/mysql/data/sbtest
# ls -li | grep sbtest1.ibd
36493 -rw-r----- 1 mysql mysql 2441084928 avr 15 15:28 sbtest1.ibd
# zdb -ddddd mysqldata/data 36493 > zdb5d.out
# more zdb5d.out
Dataset mysqldata/data [ZPL], ID 90, cr_txg 168747, 4.45G, 26487 objects, rootbp DVA[0]=<0:1a50452800:200> DVA[1]=<0:5b289c1600:200> [L0 DMU objset] fletcher4 lz4 LE contiguous unique double size=800L/200P birth=3004977L/3004977P fill=26487 cksum=13723d4400:5d1f47fb738:fbfb87e6e278:1f30c12b7fa1d1
    Object  lvl   iblk   dblk  dsize  lsize   %full  type
     36493    4    16K    16K  1.75G  2.27G   97.62  ZFS plain file
                                        168   bonus  System attributes
        dnode maxblkid: 148991
        path    /var/lib/mysql/data/sbtest/sbtest1.ibd
        uid     103
        gid     106
        atime   Sun Apr 15 15:04:13 2018
        mtime   Sun Apr 15 15:28:45 2018
        ctime   Sun Apr 15 15:28:45 2018
        crtime  Sun Apr 15 15:04:13 2018
        gen     3004484
        mode    100640
        size    2441084928
        parent  36480
        links   1
        pflags  40800000004
Indirect blocks:
               0 L3    0:1a4ea58800:400 4000L/400P F=145446 B=3004774/3004774
               0  L2   0:1c83454c00:1800 4000L/1800P F=16384 B=3004773/3004773
               0   L1  0:1eaa626400:1600 4000L/1600P F=128 B=3004773/3004773
               0    L0 0:1c6926ec00:c00 4000L/c00P F=1 B=3004773/3004773
            4000    L0 EMBEDDED et=0 4000L/6bP B=3004484
            8000    L0 0:1c69270c00:400 4000L/400P F=1 B=3004773/3004773
            c000    L0 0:1c7fbae400:800 4000L/800P F=1 B=3004736/3004736
           10000    L0 0:1ce3f53600:3200 4000L/3200P F=1 B=3004484/3004484
           14000    L0 0:1ce3f56800:3200 4000L/3200P F=1 B=3004484/3004484
           18000    L0 0:18176fa600:3200 4000L/3200P F=1 B=3004485/3004485
           1c000    L0 0:18176fd800:3200 4000L/3200P F=1 B=3004485/3004485
           [more than 140k lines truncated]

The last section of the above output is very interesting as it shows the B-tree pages. The ZFSB-tree of the file sbtest1.ibd has four levels. L3 is the root page, L2 is the first level (from the top) pages, L1 are the second level pages, and L0 are the leaf pages. The metadata is essentially L3 + L2 + L1. When you change the recordsize property of a ZFS filesystem, you affect only the size of the leaf pages.

The non-leaf page size is always 16KB (4000L) and they are always compressed on disk with lzop (If I read correctly). In the ARC, these pages are stored uncompressed so they use 16KB of memory each. The fanout of a ZFS B-tree, the largest possible ratio of a number of pages between levels, is 128. With the above output, we can easily calculate the required size of metadata we would need to cache all the non-leaf pages in the ARC.

# grep -c L3 zdb5d.out
# grep -c L2 zdb5d.out
# grep -c L1 zdb5d.out
# grep -c L0 zdb5d.out

So, each of the 330 tables of the dataset has 1160 non-leaf pages and 145447 leaf pages; a ratio very close to the prediction of 0.8%. For the complete dataset of 749GB, we would need the ARC to be, at a minimum, 6GB to fully cache all the metadata pages. Of course, there is some overhead to add. In my experiments, I found I needed to add about 15% for ARC overhead in order to have no metadata reads at all. The real minimum for the ARC size I should have used is almost 7GB.

Of course, an ARC of 7GB on a server with 15GB of Ram is not small. Is there a way to do otherwise? The first option we have is to use a larger InnoDB page size, as allowed by MySQL 5.7. Instead of the regular Innodb page size of 16KB, if you use a page size of 32KB with a matching ZFS recordsize, you will cut the ARC size requirement by half, to 0.4% of the uncompressed size.

Similarly, an Innodb page size of 64KB with similar ZFS recordsize would further reduce the ARC size requirement to 0.2%. That approach works best when the dataset is highly compressible. I’ll blog more about the use of larger InnoDB pages with ZFS in a near future. If the use of larger InnoDB page sizes is not a viable option for you, you still have the option of using the ZFS L2ARC feature to save on the required memory.

So, let’s proposed a new rule of thumb for the required ARC/L2ARC size for a a given dataset:

  • Recordsize of 128KB => 0.1% of the uncompressed dataset size
  • Recordsize of 64KB => 0.2% of the uncompressed dataset size
  • Recordsize of 32KB => 0.4% of the uncompressed dataset size
  • Recordsize of 16KB => 0.8% of the uncompressed dataset size

The ZFS revenge

In order to improve ZFS performance, I had 3 options:

  1. Increase the ARC size to 7GB
  2. Use a larger Innodb page size like 64KB
  3. Add a L2ARC

I was reluctant to grow the ARC to 7GB, which was nearly half the overall system memory. At best, the ZFS performance would only match XFS. A larger InnoDB page size would increase the CPU load for decompression on an instance with only two vCPUs; not great either. The last option, the L2ARC, was the most promising.

The choice of an i3.large instance type is not accidental. The instance has a 475GB ephemeral NVMe storage device. Let’s try to use this storage for the ZFS L2ARC. The warming of an L2ARC device is not exactly trivial. In my case, with a 1GB ARC, I used:

echo 1073741824 > /sys/module/zfs/parameters/zfs_arc_max
echo 838860800 > /sys/module/zfs/parameters/zfs_arc_meta_limit
echo 67108864 > /sys/module/zfs/parameters/l2arc_write_max
echo 134217728 > /sys/module/zfs/parameters/l2arc_write_boost
echo 4 > /sys/module/zfs/parameters/l2arc_headroom
echo 16 > /sys/module/zfs/parameters/l2arc_headroom_boost
echo 0 > /sys/module/zfs/parameters/l2arc_norw
echo 1 > /sys/module/zfs/parameters/l2arc_feed_again
echo 5 > /sys/module/zfs/parameters/l2arc_feed_min_ms
echo 0 > /sys/module/zfs/parameters/l2arc_noprefetch

I then ran ‘cat /var/lib/mysql/data/sbtest/* > /dev/null’ to force filesystem reads and caches on all of the tables. A key setting here to allow the L2ARC to cache data is the zfs_arc_meta_limit. It needs to be slightly smaller than the zfs_arc_max in order to allow some data to be cache in the ARC. Remember that the L2ARC is fed by the LRU of the ARC. You need to cache data in the ARC in order to have data cached in the L2ARC. Using lz4 in ZFS on the sysbench dataset results in a compression ration of only 1.28x. A more realistic dataset would compress by more than 2x, if not 3x. Nevertheless, since the content of the L2ARC is compressed, the 475GB device caches nearly 600GB of the dataset. The figure below shows the sysbench results with the L2ARC enabled:

Now, the comparison is very different. ZFS completely outperforms XFS, 5000 qps for ZFS versus 3000 for XFS. The ZFS results could have been even higher but the two vCPUs of the instance were clearly the bottleneck. Properly configured, ZFS can be pretty fast. Of course, I could use flashcache or bcache with XFS and improve the XFS results but these technologies are way more exotic than the ZFS L2ARC. Also, only the L2ARC stores data in a compressed form, maximizing the use of the NVMe device. Compression also lowers the size requirement and cost for the gp2 disk.

ZFS is much more complex than XFS and EXT4 but, that also means it has more tunables/options. I used a simplistic setup and an unfair benchmark which initially led to poor ZFS results. With the same benchmark, very favorable to XFS, I added a ZFS L2ARC and that completely reversed the situation, more than tripling the ZFS results, now 66% above XFS.


We have seen in this post why the general perception is that ZFS under-performs compared to XFS or EXT4. The presence of B-trees for the files has a big impact on the amount of metadata ZFS needs to handle, especially when the recordsize is small. The metadata consists mostly of the non-leaf pages (or internal nodes) of the B-trees. When properly cached, the performance of ZFS is excellent. ZFS allows you to optimize the use of EBS volumes, both in term of IOPS and size when the instance has fast ephemeral storage devices. Using the ephemeral device of an i3.large instance for the ZFS L2ARC, ZFS outperformed XFS by 66%.

The post About ZFS Performance appeared first on Percona Database Performance Blog.


Installing MySQL 8.0 on Ubuntu 16.04 LTS in Five Minutes

Installing MySQL 8.0 on Ubuntu small

Do you want to install MySQL 8.0 on Ubuntu 16.04 LTS? In this quick tutorial, I show you exactly how to do it in five minutes or less.

This tutorial assumes you don’t have MySQL or MariaDB installed. If you do, it’s necessary to uninstall them or follow a slightly more complicated upgrade process (not covered here).

Step 1: Install MySQL APT Repository

Ubuntu 16.04 LTS, also known as Xenial, comes with a choice of MySQL 5.7 and MariaDB 10.0.

If you want to use MySQL 8.0, you need to install the MySQL/Oracle Apt repository first:

dpkg -i mysql-apt-config_0.8.10-1_all.deb

The MySQL APT repository installation package allows you to pick what MySQL version you want to install, as well as if you want access to Preview Versions. Let’s leave them all as default:

Installing MySQL 8.0 on Ubuntu

Step 2: Update repository configuration and install MySQL Server

apt-get update
apt-get install mysql-server

Note: Do not forget to run “apt-get update”, otherwise you can get an old version of MySQL from Ubuntu repository installed.

The installation process asks you to set a password for the root user:

Installing MySQL 8.0 on Ubuntu 1

I recommend you set a root password for increased security. If you do not set a password for the root account, “auth_socket” authentication is enabled. This ensures only the operating system’s “root” user can connect to MySQL Server without a password.

Next, the installation script asks you whether to use Strong Password Encryption or Legacy Authentication:

Installing MySQL 8.0 on Ubuntu 2

While using strong passwords is recommend for security purposes, not all applications and drivers support this new authentication method. Going with Legacy Authentication is a safer choice

All Done

You should have MySQL 8.0 Server running. You can test it by connecting to it with a command line client:

Installing MySQL 8.0 on Ubuntu 3

As you can see, it takes just a few simple steps to install MySQL 8.0 on Ubuntu 16.04 LTS.

Installing MySQL 8.0 on Ubuntu 16.04 LTS is easy. Go ahead give it a try!

The post Installing MySQL 8.0 on Ubuntu 16.04 LTS in Five Minutes appeared first on Percona Database Performance Blog.


This Week in Data with Colin Charles 39: a valuable time spent at

Colin Charles

Colin CharlesJoin Percona Chief Evangelist Colin Charles as he covers happenings, gives pointers and provides musings on the open source database community. 2018 just ended, and it was very enjoyable to be in Bangalore for the conference. The audience was large, the conversations were great, and overall I think this is a rather important conference if you’re into the “DevOps” movement (or are a sysadmin!). From the data store world, Oracle MySQL was a sponsor, as was MyDBOPS (blog), and Elastic. There were plenty more, including Digital Ocean/GoJek/Walmart Labs — many MySQL users.

I took a handful of pictures with people, and here are some of the MyDBOPS team and myself.  They have over 20 employees, and serve the Indian market at rates that would be more palatable than straight up USD rates. Traveling through Asia, many businesses always do find local partners and offer local pricing; this really becomes more complex in the SaaS space (everyone pays the same rate generally) and also the services space.

Colin at Rootconf with Oracle
Some of the Oracle MySQL team who were exhibiting were very happy they got a good amount of traffic to the booth based on stuff discussed at the talk and BOF.

From a talk standpoint, I did a keynote for an hour and also a BoF session for another hour (great discussion, lots of blog post ideas from there), and we had a Q&A session for about 15 minutes. There were plenty of good conversations in the hallway track.

A quick observation that I notice happens everywhere: many people don’t realize features that have existed in MySQL since 5.6/5.7.  So they are truly surprised with stuff in 8.0 as well. It is clear there is a huge market that would thrive around education. Not just around feature checklists, but also around how to use features. Sometimes, this feels like the MySQL of the mid-2000’s — getting apps to also use new features, would be a great thing.


This seems to have been a quiet week on the releases front.

Are you a user of Amazon Aurora MySQL? There is now the Amazon Aurora Backtrack feature, which allows you to go back in time. It is described to work as:

Aurora uses a distributed, log-structured storage system (read Design Considerations for High Throughput Cloud-Native Relational Databases to learn a lot more); each change to your database generates a new log record, identified by a Log Sequence Number (LSN). Enabling the backtrack feature provisions a FIFO buffer in the cluster for storage of LSNs. This allows for quick access and recovery times measured in seconds.

Link List

Upcoming appearances


I look forward to feedback/tips via e-mail at or on Twitter @bytebot.


The post This Week in Data with Colin Charles 39: a valuable time spent at appeared first on Percona Database Performance Blog.


Does the Version Number Matter?


ProxySQLYes, it does! In this blog post, I am going to share my recent experiences with ProxySQL and how important the database software version number can be.


I was working on a migration to Percona XtraDB Cluster (PXC) with ProxySQL, fortunately on a staging environment first so we could catch any issues (like this one).

We installed Percona XtraDB Cluster and ProxySQL on the staging environment and repointed the staging application to ProxySQL. At first, everything looked great. We were able to do some application tests and everything looked good. I advised the customer to do more testing to make sure everything works well.

Something is wrong, but what?

A few days later the customer noticed that their application was not working properly.

We started investigating. Everything seemed well-configured, and the only thing we could see in the application log was the following:

2018-04-20 11:28:31,169 [ default-threads - 42] ERROR Error in lifecycle management : org.hibernate.StaleStateException : Batch update returned unexpected row count from update [0]; actual row count: 0; expected: 1 { 103} (method: error)
org.hibernate.StaleStateException: Batch update returned unexpected row count from update [0]; actual row count: 0; expected: 1
at org.hibernate.jdbc.Expectations$BasicExpectation.checkBatched(
at org.hibernate.jdbc.Expectations$BasicExpectation.verifyOutcome(

Based on this error I still did not know what is wrong. Were some of the queries failing because of PXC, ProxySQL or some other settings?

We redirected the application to one of the nodes from PXC, and everything worked fine. We tried HAproxy as well, and everything worked again. We knew something was happening around ProxySQL which is causing the problem. But we still could not find the problem. Every query went through ProxySQL without any issue.

Debug log is our savior

The customer finally enabled the application debug logging so we could see which query was failing:

delete from TABLENAME where ID='11' and Timestamp ='2018-04-20 16:15:03';

I was confused at first: this is a kind of simple query, what could be wrong? Let’s investigate it on the cluster. When I tried to select the data on the cluster, it gave me back zero results. That’s OK, maybe the row was already deleted?

For this investigation, the slow query logging was enabled and long_query_time set to 0 to log all the queries. I checked the slow query log looking for queries like this. What I found helped me realize what the problem was:

delete from TABLENAME where ID=10 and Timestamp ='2018-04-20 11:17:22.35';
delete from TABLENAME where ID=24 and Timestamp ='2018-04-20 11:17:31.602';
delete from TABLENAME where ID=43 and Timestamp ='2018-04-20 11:18:13.2';
delete from TABLENAME where ID=22 and Timestamp ='2018-04-20 11:11:02.854';
delete from TABLENAME where ID=11 and Timestamp ='2018-04-20 11:21:57';
delete from TABLENAME where ID=64 and Timestamp ='2018-04-20 11:18:34';
delete from TABLENAME where ID=47 and Timestamp ='2018-04-20 10:38:35';
delete from TABLENAME where ID=23 and Timestamp ='2018-04-20 11:30:03';

I hope you see the difference! The first four lines have fractional seconds! At that time, the application was pointed to the cluster directly. So ProxySQL cut off the fractional seconds? That would be a nasty bug.

I checked the application log again with the debug information, and I could see the application does not even use the fractional seconds in the queries when it points to ProxySQL. This is why the query was failing (does not delete any rows), because in the table all the rows had fractional seconds but the queries were not using them.

So why does the application not use fractional seconds with ProxySQL?

First of all, fractional seconds were introduced in MySQL 5.6.4. The application is a Java-based application with Jboss and Hibernate. I knew ProxySQL reports MySQL 5.5. Maybe the application/connector reads the version number and makes decisions based on that?

It was quite easy to test this theory by just changing the version number in ProxySQL like this:

update global_variables set variable_value="5.7.21" where variable_name='mysql-server_version';
load mysql variables to run;save mysql variables to disk;

The application had to be restarted (probably it was caching the previous settings) but after that everything was working as expected.

But be careful, now it will report 5.7.21 for all the hostgroups. What if you have multiple hostgroups with different MySQL versions? It would be nice if you could define this for every hostgroup.


The solution was very easy, but finding the source of the problem took a long time. If you are planning to use ProxySQL, I would always recommend changing the mysql-server_version to match to the underlying MySQL server version number because who knows which connector or application checks the version and makes a decision based on that.

There is another example here where Marco Tusa had a very similar problem with a Java connector.

The post Does the Version Number Matter? appeared first on Percona Database Performance Blog.


Deploying PMM on Linode: Your $5-Per-Month Monitoring Solution

PMM on Linode small

In this blog, I will show you how to install PMM on Linode as a low-cost database monitoring solution.

Many of my friends use Linode to run their personal sites, as well as small projects. While Linode is no match for Big Cloud providers in features, it is really wonderful when it comes to cost and simplicity: a Linode “nanode” instance offers 1GB of memory, 1 core, 20GB of storage and 1TB of traffic for just $5 a month.

A single Linode instance is powerful enough to use with Percona Monitoring and Management (PMM) to monitor several systems, so I use Linode a lot when I want to demonstrate PMM deployment through Docker, rather than Amazon Marketplace.

Here are step-by-step instructions to get you started with Percona Monitoring and Management (PMM) on Linode in five minutes (or less):

Step 1:  Pick the Linode Type, Location and launch it.

PMM on Linode

Step 2: Name your Linode

This step is optional and is not PMM-related, but you may want to give your Linode an easy-to-remember name instead of something like “linode7796908”. Click on Linode Name and then on “Settings” and enter a name in “Linode Label”.

PMM on Linode 2

Step 3:  Deploy the Image

Click on Linode Name and then on “Deploy an Image”.

PMM on Linode 3

I suggest choosing the latest Ubuntu LTS version and allocating 512MB for the swap file, especially on a Linode with a small amount of memory. Remember to set a strong root password, as Linode allows root password login by default from any IP.

Step 4: Boot Linode

Now prepare the image you need to boot your Linode. Click on the Boot button for that:

PMM on Linode 4

Step 5: Login to the system and install Docker

Use your favorite SSH client to login to the Linode you created using “root” user and password you set at Step 3, and install Docker:

apt install

Step 6: Run PMM Server

Here are detailed instructions to install the PMM Server on Docker. Below are the commands to do basic installation:

docker pull percona/pmm-server:latest
docker create
  -v /opt/prometheus/data
  -v /opt/consul-data
  -v /var/lib/mysql
  -v /var/lib/grafana
  --name pmm-data
  percona/pmm-server:latest /bin/true
docker run -d
  -p 80:80
  --volumes-from pmm-data
  --name pmm-server
  --restart always

Note: This deploys PMM Server without authentication. For anything but test usage, you should set a password by following instructions on this page.

You’re done!

You’ve now installed PMM Server and you can see it monitoring itself by going to the server IP with a browser.

PMM on Linode 5

Now you can go ahead and install the PMM Client on the nodes you want to monitor!

The post Deploying PMM on Linode: Your $5-Per-Month Monitoring Solution appeared first on Percona Database Performance Blog.


How Binary Logs Affect MySQL 8.0 Performance

As part of my benchmarks of binary logs, I’ve decided to check how the recently released MySQL 8.0 performance is affected in similar scenarios, especially as binary logs are enabled by default. It is also interesting to check how MySQL 8.0 performs against the claimed performance improvements in redo logs subsystem.

I will use a similar setup as in my last blog with MySQL 8.0, using the utf8mb4 charset.

I have a few words about MySQL 8.0 tuning. Dimitri’s recommends in his blog posts using innodb_undo_log_truncate=off and innodb_doublewrite=0. However, in my opinion, using these setting are the same as participating in a car race without working breaks: you will drive very fast, but it will end badly. So, contrary to Dimitri’s recommendations I used innodb_undo_log_truncate=on and innodb_doublewrite=1.

Servers Comparison

For the first run, let’s check the results without binary logs vs. with binary logs enabled, but with sync_binlog=1 for Percona Server for MySQL 5.7 vs. MySQL 8.0.

MySQL 8.0 Performance

In tabular form:

Binary log Buffer pool, GB MYSQL8 PS57 Ratio PS57/MySQL8
binlog 5 768.0375 771.5532 1.00
binlog 10 1224.535 1245.496 1.02
binlog 20 1597.48 1625.153 1.02
binlog 30 1859.603 1979.328 1.06
binlog 40 2164.329 2388.804 1.10
binlog 50 2572.827 2942.082 1.14
binlog 60 3158.408 3528.791 1.12
binlog 70 3883.275 4535.281 1.17
binlog 80 4390.69 5246.567 1.19
nobinlog 5 788.9388 783.155 0.99
nobinlog 10 1290.035 1294.098 1.00
nobinlog 20 1745.464 1743.759 1.00
nobinlog 30 2109.301 2158.267 1.02
nobinlog 40 2508.28 2649.695 1.06
nobinlog 50 3061.196 3334.766 1.09
nobinlog 60 3841.92 4168.089 1.08
nobinlog 70 4772.747 5140.316 1.08
nobinlog 80 5727.795 5947.848 1.04


Binary Log Effect

MySQL 8.0 Performance 2

In tabular form:

Buffer pool, GB server binlog nobinlog Ratio nobinlog / binlog
5 MYSQL8 768.0375 788.9388 1.03
5 PS57 771.5532 783.155 1.02
10 MYSQL8 1224.535 1290.0352 1.05
10 PS57 1245.496 1294.0983 1.04
20 MYSQL8 1597.48 1745.4637 1.09
20 PS57 1625.153 1743.7586 1.07
30 MYSQL8 1859.603 2109.3005 1.13
30 PS57 1979.328 2158.2668 1.09
40 MYSQL8 2164.329 2508.2799 1.16
40 PS57 2388.804 2649.6945 1.11
50 MYSQL8 2572.827 3061.1956 1.19
50 PS57 2942.082 3334.7656 1.13
60 MYSQL8 3158.408 3841.9203 1.22
60 PS57 3528.791 4168.0886 1.18
70 MYSQL8 3883.275 4772.7466 1.23
70 PS57 4535.281 5140.316 1.13
80 MYSQL8 4390.69 5727.795 1.30
80 PS57 5246.567 5947.8477 1.13



It seems that binary logs have quite an effect MySQL 8.0, and we see up to a 30% performance penalty as opposed to the 13% for Percona Server for MySQL 5.7.

In general, for in-memory workloads, Percona Server for MySQL 5.7 outperforms MySQL 8.0 by 10-20% with binary logs enabled, and 4-9% without binary logs enabled.

For io-bound workloads (buffer pool size <= 30GB), the performance numbers for Percona Server for MySQL and MySQL are practically identical.

Hardware spec

Supermicro server:

  • Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz
  • 2 sockets / 28 cores / 56 threads
  • Memory: 256GB of RAM
  • Storage: SAMSUNG  SM863 1.9TB Enterprise SSD
  • Filesystem: ext4/xfs
  • Percona-Server-5.7.21-20
  • OS: Ubuntu 16.04.4, kernel 4.13.0-36-generic

Extra Raw Results, Scripts and Config

My goal is to provide fully repeatable benchmarks. I have shared all scripts and settings I used in the following GitHub repo:


The post How Binary Logs Affect MySQL 8.0 Performance appeared first on Percona Database Performance Blog.


How Binary Logs (and Filesystems) Affect MySQL Performance

I want to take a closer look at MySQL performance with binary logs enabled on different filesystems, especially as MySQL 8.0 comes with binary logs enabled by default.

As part of my benchmarks of the MyRocks storage engine, I’ve noticed an unusual variance in throughput for the InnoDB storage engine, even though we spent a lot of time making it as stable as possible in Percona Server for MySQL. In the end, the culprit was enabled binary logs. There is also always the question, “If there is a problem with EXT4, does XFS perform differently?” To answer that, I will repeat the same benchmark on the EXT4 and XFS filesystems.

You can find our previous experiments with binary logs here:

Benchmark Setup

A short overview of the benchmark setup:

  • Percona Server for MySQL 5.7.21
  • InnoDB storage engine
  • In contrast to the previous benchmark, I enabled foreign keys, used REPEATABLE-READ isolation level, and I used UTF8 character sets. Because of these changes, the results are not really comparable with the previous results.
  • The dataset is the same: sysbench-tpcc with ten tables and 100 warehouses, resulting in a total of 1000 warehouses, and about a 90GB dataset size.
  • I will use innodb_buffer_pool_size 80GB, 70GB, and 60GB to emulate different IO loads and evaluate how that affects binary logs writes.

Initial Results

For the first run, let’s check the results without binary logs vs. with binary log enabled, but with sync_binlog=0:

Binary Log Performance

We can see that results without binary logs are generally better, but we can also see that with binary logs enabled and sync_binglog=0, there are regular drops to 0 for 1-2 seconds. This basically results in stalls in any connected application.

So, enabling binary logs may result in regular application stalls. The reason for this is that there is a limit on the size of the binary log file (max_binlog_size), which is 1GB. When the limit is reached, MySQL has to perform a binary log rotation. With sync_binlog=0, all previous writes to the binary log are cached in the OS cache, and during rotation, MySQL forces synchronous flushing of all changes to disk. This results in complete stalls every ~40 seconds (the amount of time it takes to fill 1GB of binary log in the above tests).

How can we deal with this? The obvious solution is to enable more frequent sync writes of binary logs. This can be achieved by setting sync_binlog > 0. The popular choice is the most strict, sync_binlog=1, providing the most guarantees. The strict setting also comes with noted performance penalties. I will also test sync_binlog=1000 and sync_binlog=10000, which means perform synchronous writes of binary logs every 1000 and 10000 transactions, respectively.

The Results

Binary Log Performance 1

The same results in a tabular format with median throughput (tps, more is better)

Bp sync_binlog 0 1 1000 10000 nobinlog
60 GB 4174.945 3598.12 3950.19 4205.165 4277.955
70 GB 5053.11 4541.985 4714 4997.875 5328.96
80 GB 5701.985 5263.375 5303.145 5664.155 6087.925


Some conclusions we can make:

  • sync_binlog=1 comes with the biggest performance penalty, but with minimal variance. This is comparable to running without binary logs.
  • sync_binlog=0 provides best (for enabled binary logs) performance, but the variance is huge.
  • sync_binlog=1000 is a good compromise, providing better performance than sync_binlog=1 with minimal variance.
  • sync_binlog=10000 might not be good, showing less variance than with 0, but it is still big.

So what value should we use? This is probably a choice between sync_binlog=1 or some value like 1000. It depends on your use case and your storage solution. In the case of slow storage, sync_binlog=1 may show a bigger penalty compared to what I can see on my enterprise SATA SSD SAMSUNG SM863.


All of the above results were on an EXT4 filesystem. Let’s compare to XFS. Will it show different throughput and variance?

Binary Log Performance 2

The median throughput in tabular format:

sync_binlog Buffer pool (GB) EXT4 XFS
0 60 4174.945 3902.055
0 70 5053.11 4884.075
0 80 5701.985 5596.025
1 60 3598.12 3526.545
1 70 4541.985 4538.455
1 80 5263.375 5255.38
1000 60 3950.19 3620.05
1000 70 4714 4526.49
1000 80 5303.145 5150.11
10000 60 4205.165 3874.03
10000 70 4997.875 4845.85
10000 80 5664.155 5557.61
No binlog 60 4277.955 4169.215
No binlog 70 5328.96 5139.625
No binlog 80 6087.925 5957.015


We can observe the general trend that median throughput on XFS is a little worse than with EXT4, with practically identical variance.

The difference in throughput is minimal. You can use either XFS or EXT4.

Hardware Spec

Supermicro server:

  • Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz
  • 2 sockets / 28 cores / 56 threads
  • Memory: 256GB of RAM
  • Storage: SAMSUNG  SM863 1.9TB Enterprise SSD
  • Filesystem: ext4/xfs
  • Percona-Server-5.7.21-20
  • OS: Ubuntu 16.04.4, kernel 4.13.0-36-generic

Extra Raw Results, Scripts and Config

My goal is to provide fully repeatable benchmarks. To that effect, I’ve shared all the scripts and settings I used in the following GitHub repo:

The post How Binary Logs (and Filesystems) Affect MySQL Performance appeared first on Percona Database Performance Blog.


This Week in Data with Colin Charles 38: Percona Live Europe 2018 and PostgreSQL

Colin Charles

Colin CharlesJoin Percona Chief Evangelist Colin Charles as he covers happenings, gives pointers and provides musings on the open source database community.

The week after Percona Live Santa Clara 2018 tends to be much quieter, aided by the fact that I took a few days away during Labor Day. The next thing to look out for is Percona Live Europe 2018, which at this stage is really a note to let you save the dates: November 5-7 2018, at the Radisson Blu, in Frankfurt. There is no call for papers yet, there is no committee, and it is not listed yet at the Percona Live Conferences page. Hang in there! We’ll open the call for papers soon!

Now that Percona is in the PostgreSQL space, it seems prudent that there will also be more PostgreSQL content here. A great resource naturally is Planet PostgreSQL. There also seems to be another resource on The internals of PostgreSQL, and as books go, Mastering PostgreSQL in Application Development sure looks very interesting. Do you have recommended resources?


Link List

Upcoming appearances


I look forward to feedback/tips via e-mail at or on Twitter @bytebot.

The post This Week in Data with Colin Charles 38: Percona Live Europe 2018 and PostgreSQL appeared first on Percona Database Performance Blog.


Percona Live 2018 Community Report

So, after a whirlwind few days, Percona Live 2018 has been and gone. There was a great energy about the conference, and it was fantastic to meet so many open source database enthusiasts and supporters. A few things that I experienced:

  • Your great willingness to share knowledge. It was a fantastic place to learn for those who have experience from a different field of technology. Almost everyone seemed to be very open and generous with their time.
  • The “superstars” from our industry are not so scary. They are as willing to be open and generous with their experience and views as any of the other attendees, and equally as interested in making new discoveries.
  • There aren’t many times you can sit down to a (community) dinner, to share food and anecdotes with people from USA, UK, Germany and Armenia at the same time. I thoroughly enjoyed the company, and wish there were more opportunities for similar encounters. Thanks to Pythian for setting that up.
  • My Percona colleagues are wonderful, committed human beings with more than a passing interest in music – the Percona Sessions have got to happen…
  • That you can run a very long way in a day between the Santa Clara Convention Center and the Hyatt Regency Hotel.

I had very many positive conversations with delegates. You offered any criticisms along with a suggestion of how we should tweak things for the better. Our community is a creative, generous, problem-solving machine, though I shouldn’t be surprised at that.

So, with only a few more duties to complete, I’d like to thank you for your company. For those that did not make it to this year’s event, I hope that you might be persuaded to join us in the future — either at Percona Live Europe 2018 or at Percona Live 2019.

Packt Prizes

Our media sponsor, Packt, generously provided us with three free ebooks and two free instruction videos as prizes for delegates:

  1. Mastering MongoDB 3.x
  2. MySQL 8 Cookbook
  3. MongoDB Administrator’s Guide
  4. Elastic Databases and Data Processing with AWS [Video]
  5. AWS Administration – Database, Networking, and Beyond [Video]

There are another 10 titles for which we can offer delegates a 50% discount: you should have received your emails. Thanks are due again to Packt.

Community Blog

While I have your attention, I’d like to let you know about the forthcoming Percona community blog. Having been some time in the planning, this is starting really soon, and is like a year-round, online, Percona Live. We already have some keen writers for this, but if you would be interested in creating content (whether written, podcast or webcast) for the community blog, then please get in touch. The brief is very wide — as long as your submission is relevant to the open source database community then it would be welcome.

Finally, I would like to invite feedback on how to make the event shine even brighter — please drop me an email if you have suggestions or ideas. Meanwhile, I hope you enjoy these photographs of the MySQL Community Awards Winners, presented at PL18. You can read more about this community initiative.

Perhaps you’ll be able to join us in Frankfurt in November? Time to start thinking about those submissions for the call for papers!

Or perhaps next year at Percona Live Open Source Database Conference in 2019 – wherever it may be!

Photographs: Randy Tunnell Photography

The post Percona Live 2018 Community Report appeared first on Percona Database Performance Blog.


Q&A: “Percona XtraDB Cluster 5.7 and ProxySQL for Your High Availability Needs” Webinar

High Availability

High AvailabilityOn March 22, 2018, we held a webinar on how Percona XtraDB cluster 5.7 (PXC) and ProxySQL can help achieve your database clustering high availability needs. Firstly, thanks to all the attendees for taking time to attend the webinar and we are sure you had a webinar experience. We tried answering some of your high availability questions during the call but due to time restrictions if we missed some of the questions then this blog will help clarify them.

Q. You say the replication to servers is virtually synchronous, if there is any latency, does ProxySQL detect this and select a node accordingly?

A. PXC nodes are virtually synchronous, which effectively means while the apply/commit of a transaction may be in progress on one node, other nodes may have completed applying it. There is no direct way for ProxySQL to know about this, but it could be traced by looking at wsrep_last_applied and wsrep_last_committed. Also, if a user expects to always fetch updated data, then a wsrep_sync_wait configuration can be used.

Q. Hello, do you suggest geoReplication / wan clustering for an e-commerce website? Let ‘s say served by an Italian pxc cluster and served by a US PXC cluster?

A. Geo-distributed PXC is already in use by a lot of customers, and is meant to exactly serve the use-case you have pointed out. An important aspect of geo-distributed clustering (that often gets missed) is to configure timeout and window setting to accommodate network latency and segment settings. There is also a separate webinar on this topic and you can surely get in touch with us to find out more details on how to configure it correctly.

Q. Can we add a read-only node with PXC?

A. You can simply mark the selected nodes as super_read_only (or read_only). Replication continues as normal but direct traffic is blocked.

Q. Does the ProxySQL impact performance?

A. Using all of ProxySQL’s features gives you a huge performance improvement. Here is the sample use case.

Q. I have not had “excellent” results with Drupal. (e.g., clearing cache sometimes causes corruption, although i ensure all tables do have a primary key). Any advice on its suitability? I am currently using proxySQL with a single percona (non-cluster) 5.7 but would like to try again with PXC if advisable.

A. Not sure what exact problem you faced, but you may want to check this variable and articles around it wsrep_drupal_282555_workaround.

Q. Another question (to queue up as you are able to answer if possible): do you recommend SSL between ProxySQL — and specifically, what are the performance impacts, especially if there’s some latency between proxySQL and master percona db for writes?

A. We recommend SSL for security reasons, but it depends on the individual setup. Currently, ProxySQL does not support SSL from frontends. This feature is only available since 2.0.

Q1. Can i put two PXC clusters in master-slave replication mode with automatic failover?

Q2. How can i setup two PXC clusters in master-slave replication model with automatic failover?

A. You can have async master-slave replication link among two PXC clusters, but automatic failover of the node (if the acting master from cluster-1 fails then another active nodes of the cluster takes over as master) is currently not supported.

Q. Can the replication be done from ProxySQL level, so that if one node goes down in slave PXC, another node in PXC will take over the slave role?

A. This feature is not supported through ProxySQL. You can monitor replication lag through ProxySQL.

Q. Suppose if i have 5 node cluster in DC1 & DC2, how can we make transaction successful as soon as nodes in DC1 are committed rather than waiting for certification from nodes in DC2?

A. Given the transaction is executed on the local node and during commit (as a pre-commit stage) it is replicated (replication action doesn’t include certification and commit) to the other nodes of the cluster. Once replicated each node can parallelly certify, apply and commit the transaction. So this effectively means a transaction doesn’t need to be certified on all the nodes of the cluster before communicating commit success to end-user. Once the transaction is replicated, originating node can complete the local commit and communicate success to the application.

Q. Hi, thank you for the webinar is ProxySQL support HA, is it a single point of failure?

A. ProxySQL supports Native Clustering, thereby forming a ProxySQL cluster (vs. a single ProxySQL node) and in turn helps avoid a single point failure.

Q.  What is a good setup you will recommend, make proxySQL on some other server/vm or on the same as one of PXC nodes?

A. We would recommend installing ProxySQL on an independent node (or share with other applications). We don’t recommend installing ProxySQL on a PXC node. If the node hosting PXC and ProxySQL goes down (network or power failure), even though the cluster is working, the application will still lose connectivity as the ProxySQL gateway goes down as well.

Q. Let’s say we have three nodes, good quorum, what happened when one node goes down for maintenance what happens to the quorum since only two nodes now?

A. Two nodes can still form the quorum and continue servicing the workload.

Q If the transaction is not committed to all the nodes then will the cluster remains locked for read too?

A. No. The transaction commit is independent of a read action. “transaction commit” can continue in the background and the user can continue to read from the cluster node. If a user has configured wsrep_sync_wait, which effectively means wait for a transaction to get committed to fetch updated data only, then the read may wait for transaction commit to complete.

Q. Is there a way to do partitioning over data? To not have 100% replicate in each master?

A. PXC/Galera, being a multi-master solution, doesn’t recommend unsync data nodes. As an end-user you can still achieve it by setting wsrep_on=off -> execute a workload (this will not be replicated on the cluster) -> wsrep_on=on (all action post this point will again follow replication). This can lead to data inconsistency, though,  and shutdown of the cluster if the workload or action are not properly segregated – so not-recommended.

Q. Are changes done by triggers rollbackable?

A. Yes, they are.

Q. Does ProxySQL prevent “mysql server gone away” in mostly idle daemons?

A. ProxySQL Monitor Module regularly probes the backend nodes and marks the node as OFFLINE in the ProxySQL database if MySQL server is down.

Q. Can proxysql cache rules use regexes?

A. We can use regex with ProxySQL query rules. Go here for more info.

Q. Can PMM be used in Digitalocean droplets?

A. Yes.

Q. In regards to PXC, how much delay is introduced when data is written since it has to appear on all nodes?

A. When a user initiates a transaction on given node (let’s call it an originating node), then it is first applied (not committed) on the said node and a binary write-set is created. This write-set is then replicated on other nodes of the cluster. Once the replication is successful, each node can independently certify, apply and commit the transaction. Since originating node has already applied the transaction, it just needs to certify and commit the transaction. But it is interesting to note that the apply stage on the other replicated node is fast too, given that the transaction is now packed in a database optimized apply format. In short, there would be no delay (or marginal delay). Delay could be higher if the transaction is a huge transaction, as the apply stage could take time. That is one of the reasons Galera doesn’t recommend huge transactions.

Q. How does PXC (Percona XtraDB Cluster) allow DDL (schema changes) on one server with DML on the same table on another server? (This can break MySQL Master-Master replication)?

A. PXC executes DDL using the TOI (total order isolated) protocol. In short, while DDL is executing it takes complete control of the node (no other parallel DML or DDL is allowed). DDL executes at the same position on all then does.

Q. Can ProxySQL split read-write queries based on stored procedure names (patterns)? e.g. sp_write vs sp_read?

A. ProxySQL read/write split is based on mysql_query_rules and hostgroups. For more info.

Q. Can we use ProxySQL with a single node for the query caching feature? Especially since query cache will be discontinued in MySQL 8?

A. If you configure Query Cache properly, you can cache queries for a single node. 

Q. Must binary logging be enabled for ProxySQL / PXC to work?

A. PXC replicates write-sets. While binary logging is not needed, PXC still needs these write-sets that are generated using binary logging module so PXC can then enable emulation based bin logs for a generation of these write-sets (persistence to disk is not needed). If disk space is not a constraint, we recommend you enable binary logging.

Q. Please define Galera and Percona, as well as the relationship between the two?

A. Galera is replication technology owned and developed by Codership and distributed under GPL license. Percona has adopted the said technology along with its Percona Server for MySQL and build PXC. Percona continues to refresh updates made to Galera and related wsrep-plugin on a regular basis. At the same time, Percona also continues to refresh from Percona-Server for MySQL for related enhancement and bug fixes.

Q. Is the ProxySql Admin tool the script/tool that you mentioned would autodetect your existing PXC or there’s a different script? Trying to know if you need to have the PXC and ProxySQL installed at the same time?

With ProxySQL, do we need to wait for active threads on the PXC to drain before shutting down the PXC?

A. ProxySQL Admin (proxysql-admin) script helps you configure your PXC nodes to ProxySQL database. PXC and ProxySQL should be up and running to initiate proxysql-admin script. For more info.

If you trigger PXC node shutdown proxysql_galera_checker script marks the node as offline in the ProxySQL DB, and new connections aren’t redirected to the offline node.

Q. 1) HAProxy and ProxySQL: which one has the better performance when the number of Clusters is large? Up to 30 Clusters?
2) What´s the better tool to monitoring a large number of clusters and nodes?

A. For PXC, we strongly recommend ProxySQL as it is closely integrated with PXC. HAProxy works with PXC as well, and before ProxySQL we had customers using it. For a quick comparison, you can take a look at following article.

Q. When the PXC settings have a maximum connection how does ProxySQL allow for much more than the standard connections?

A. ProxySQL terminates the connection with a connection timeout error.

FATAL: `thread_run' function failed: /usr/share/sysbench/oltp_insert.lua:61: SQL error, errno = 9001, state = 'HY000': Max connect timeout reached while reaching hostgroup 10 after 10012ms


Once again, thanks for your questions and queries. If you still have more questions or need clarification, you can log them at the percona-xtradb-cluster forum. We would also like to know what else you expect from Percona XtraDB Cluster in upcoming releases.

The post Q&A: “Percona XtraDB Cluster 5.7 and ProxySQL for Your High Availability Needs” Webinar appeared first on Percona Database Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by