Sep
21
2021
--

Making Aurora Write Latency 15x Higher (or More!) by Choosing a Bad Primary Key

Aurora MySQL Write Latency

Aurora MySQL Write LatencyPrimary Key design is an important thing for InnoDB performance, and choosing a poor PK definition will have an impact on performance and also write propagation in databases. When this comes to Aurora, this impact is even worse than you may notice.

In short, we consider a poor definition of a Primary Key in InnoDB as “anything but quasi sequential values”, which may cause very random access to data and thus increase the IO dependency.

In this post, I’ll try to demonstrate the potential impact of the primary key design when running on Aurora, and how a bad design can lead to a 15x write latency penalty (or more).

The Analysis

Recently I worked on a case where a customer was having issues with scaling writes in Aurora MySQL. While this is a known limitation in Aurora considering how the distributed storage layer syncs out data among all nodes of the cluster, we observed additional latency occurring when more clients were writing to the database.

The first thing I noticed is that their main table had a poor definition of the Primary Key as they were using UUID-like columns based on VARCHAR data types. In this case, the nature of values for the Primary Key was very random, which is really bad for a b-tree based storage like InnoDB.

With this in mind, I referred to a great post from my colleague Yves Trudeau explaining why UUIDs are bad for performance, so based on this premise I decided to try to measure how big this impact can be in the Aurora world.

The set of tests I’ve run were using a db.r5.2xlarge Aurora MySQL Cluster (8vCPU and 64GB of ram) which is pretty similar to the cluster my customer was using.

First, I’ve started with two very basic tables to avoid any extra overhead but something close to a real case:

CREATE TABLE `test_sequential_PK` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`number` int(10) NOT NULL,
PRIMARY KEY (`id`),
KEY `idx_number` (`number`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

CREATE TABLE `test_random_PK` (
`id` varchar(26) NOT NULL,
`number` int(10) NOT NULL,
PRIMARY KEY (`id`),
KEY `idx_number` (`number`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

Then I created two simple lua scripts to execute with sysbench – super simple, I’d say – as they were just for doing inserts using either the auto_increment property in table test_sequential_PK or creating random values for test_random_PK table.

The final purpose was not to measure Aurora performance as a whole but the write latency when the Primary Key is not optimal.

I’ve started the process by warming up the instances for few days by running both sysbench scripts and pushing the instance really hard while I was filling up the InnoDB Buffer Pool, results were pretty good for some time until the traffic became IO-bound:

Amazon Aurora Latency

It took a few days but after some time we started to see an increase in the write latency. I created an initial set of data using 50 concurrent threads, which means the graphic above is not very helpful for the analysis I’m going to make.

After I was sure the buffer pool was filled and the instance was warmed up, I verified that the dataset is bigger than the buffer pool:

SELECT concat(table_schema,'.',table_name) schema_table,concat(round(table_rows/1000000,2),'M') rows,concat(round(data_length/(1024*1024*1024),2),'G') Data,round(data_length/table_rows,0) DataRow ,concat(round(index_length/(1024*1024*1024),2),'G') idx,round(index_length/table_rows,0) as IdxRow,concat(round((data_length+index_length)/(1024*1024*1024),2),'G') totSize,round(index_length/data_length,2) idxfrac
FROM information_schema.TABLES
where table_schema in ('percona')
ORDER BY data_length+index_length DESC LIMIT 10;
+-------------------------------+----------+---------+---------+--------+--------+---------+---------+
| schema_table                  | rows     | Data    | DataRow | idx    | IdxRow | totSize | idxfrac |
+-------------------------------+----------+---------+---------+--------+--------+---------+---------+
| percona.test_random_PK        | 1586.25M | 104.53G |      71 | 73.04G |     49 | 177.57G |    0.70 |
| percona.test_sequential_PK    | 1840.49M | 54.74G  |      32 | 24.54G |     14 | 79.27G  |    0.45 |
+-------------------------------+----------+---------+---------+--------+--------+---------+---------+

I’ll explain the difference between table sizes later in this post.

After this, I started to run separate tests to check how the write latency is affected by our table design.

First I ran sysbench as follows:

sysbench /usr/share/sysbench/insert_sequence.lua    --mysql-user=admin --mysql-password=xxxxxx --mysql-host=francisco-sysbench.cluster --mysql-db=percona --report-interval=1   --threads=15  --events=0 --time=0 run

MySQL Client Thread Activity

We can see that performance remains steady roughly at 124 microseconds average with all threads running so it looks there is no impact for about 24 hours.

Then I tested the random insert by running:

sysbench /usr/share/sysbench/insert_random.lua    --mysql-user=admin --mysql-password=xxxxxx --mysql-host=francisco-sysbench.cluster --mysql-db=percona --report-interval=1   --threads=15  --events=0 --time=0 run

MySQL InnoDB
This is a huge impact if you consider that random access is causing the instance to suffer from performance, roughly 15x writing latency increase.

These numbers were very impressive compared to my previous experience so, being extra curious, I checked what was reported in Cloudwatch for Aurora Write latency for the previous 3 days.

Cloudwatch for Aurora Write latency

It’s quite clear the 3 stages of the checks:

  • Warming up the instance and buffer pool by pushing really hard with sysbench in both tables (peak of load at the beginning)
  • First round of sequential inserts using sequential write (sequential writes)
  • Last round of random access (random writes)

As I said, the code used on each script was very simple. This is on purpose to avoid adding overhead somewhere other than the database. For reference, this is the interesting portion of code for random writes:

local inserts = {
"INSERT INTO percona.test_random_PK (id, number) VALUES ('%s', %i)"
}

local rnd_str_tmpl = "###########-###########"

function execute_inserts()
             local id = sysbench.rand.string(rnd_str_tmpl)
             con:query(string.format(inserts[1], id, sb_rand(1, sysbench.opt.table_size)))
end

And for sequential writes:

local inserts = {

"INSERT INTO percona.test_sequential_PK (number) VALUES (%i)"

}

function execute_inserts()

-- INSERT

con:query(string.format(inserts[1], sb_rand(1, sysbench.opt.table_size)))

end

Conclusion

Primary key design for InnoDB tables was largely discussed in several posts, and specially UUID format impact was perfectly described in the post I mentioned, above so there are no surprises. What I’ve found interesting in this analysis is that in Aurora there seems to be an extra impact of this random access.

Given that, what’s happening underneath is not quite clear. I just can try to elaborate a theory:

In Aurora, there are no dirty pages. Basically, every commit is synchronized to disk immediately so all replicas can be virtually in sync with the source server. This is reducing the chance of hitting the same page in memory every time you perform a write operation so the whole sequence of grabbing the page, placing it in memory, updating information, flushing, and committing synchronously to the block storage is particularly expensive in Aurora.

Additionally given that every secondary index adds an entry of the Primary Key, the increase in data size (and thus the disk footprint) can be impacted a lot as we have seen before, so this may also cause an extra disk utilization.

So be warned, plan your PKs properly if you don’t want to see a huge performance degradation in AWS Aurora as the database grows and workload is increased.

Percona Distribution for MySQL is the most complete, stable, scalable, and secure, open-source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!

Download Percona Distribution for MySQL Today

Jul
09
2021
--

MySQL/ZFS Performance Update

MySQL/ZFS Performance Update

MySQL/ZFS Performance UpdateAs some of you likely know, I have a favorable view of ZFS and especially of MySQL on ZFS. As I published a few years ago, the argument for ZFS was less about performance than its useful features like data compression and snapshots. At the time, ZFS was significantly slower than xfs and ext4 except when the L2ARC was used.

Since then, however, ZFS on Linux has progressed a lot and I also learned how to better tune it. Also, I found out the sysbench benchmark I used at the time was not a fair choice since the dataset it generates compresses much less than a realistic one. For all these reasons, I believe that it is time to revisit the performance aspect of MySQL on ZFS.

ZFS Evolution

In 2018, I reported ZFS performance results based on version 0.6.5.6, the default version available in Ubuntu Xenial. The present post is using version 0.8.6-1 of ZFS, the default one available on Debian Buster. Between the two versions, there are in excess of 3600 commits adding a number of new features like support for trim operations and the addition of the efficient zstd compression algorithm.

ZFS 0.8.6-1 is not bleeding edge, there have been more than 1700 commits since and after 0.8.6, the ZFS release number jumped to 2.0. The big addition included in the 2.0 release is native encryption.

Benchmark Tools

The classic sysbench MySQL database benchmarks have a dataset containing mostly random data. Such datasets don’t compress much, less than most real-world datasets I worked with. The compressibility of the dataset is important since ZFS caches, the ARC and L2ARC, store compressed data. A better compression ratio essentially means more data is cached and fewer IO operations will be needed.

A well-known tool to benchmark a transactional workload is TPCC. Furthermore, the dataset created by TPCC compresses rather well making it more realistic in the context of this post. The sysbench TPCC implementation was used.

Test Environment

Since I am already familiar with AWS and Google cloud, I decided to try Azure for this project. I launched these two virtual machines:

tpcc:

  • benchmark host
  • Standard D2ds_v4 instance
  • 2 vCpu, 8GB of Ram and 75 GB of temporary storage
  • Debian Buster

db:

  • Database host
  • Standard E4-2ds-v4 instance
  • 2 vCpu, 32GB of Ram and 150GB of temporary storage
  • 256GB SSD Premium (SSD Premium LRS P15 – 1100 IOPS (3500 burst), 125 MB/s)
  • Debian Buster
  • Percona server 8.0.22-13

Configuration

By default and unless specified, the ZFS filesystems are created with:

zpool create bench /dev/sdc
zfs set compression=lz4 atime=off logbias=throughput bench
zfs create -o mountpoint=/var/lib/mysql/data -o recordsize=16k \
           -o primarycache=metadata bench/data
zfs create -o mountpoint=/var/lib/mysql/log bench/log

There are two ZFS filesystems. bench/data is optimized for the InnoDB dataset while bench/log is tuned for the InnoDB log files. Both are compressed using lz4 and the logbias parameter is set to throughput which changes the way the ZIL is used. With ext4, the noatime option is used.

ZFS has also a number of kernel parameters, the ones set to non-default values are:

zfs_arc_max=2147483648
zfs_async_block_max_blocks=5000
zfs_delete_blocks=1000

Essentially, the above settings limit the ARC size to 2GB and they throttle down the aggressiveness of ZFS for deletes. Finally, the database configuration is slightly different between ZFS and ext4. There is a common section:

[mysqld]
pid-file = /var/run/mysqld/mysqld.pid
socket = /var/run/mysqld/mysqld.sock
log-error = /var/log/mysql/error.log
skip-log-bin
datadir = /var/lib/mysql/data
innodb_buffer_pool_size = 26G
innodb_flush_log_at_trx_commit = 1 # TPCC reqs.
innodb_log_file_size = 1G
innodb_log_group_home_dir = /var/lib/mysql/log
innodb_flush_neighbors = 0
innodb_fast_shutdown = 2

and when ext4 is used:

innodb_flush_method = O_DIRECT

and when ZFS is used:

innodb_flush_method = fsync
innodb_doublewrite = 0 # ZFS is transactional
innodb_use_native_aio = 0
innodb_read_io_threads = 10
innodb_write_io_threads = 10

ZFS doesn’t support O_DIRECT but it is ignored with a message in the error log. I chose to explicitly set the flush method to fsync. The doublewrite buffer is not needed with ZFS and I was under the impression that the Linux native asynchronous IO implementation was not well supported by ZFS so I disabled it and increased the number of IO threads. We’ll revisit the asynchronous IO question in a future post.

Dataset

I use the following command to create the dataset:

./tpcc.lua --mysql-host=10.3.0.6 --mysql-user=tpcc --mysql-password=tpcc --mysql-db=tpcc \
--threads=8 --tables=10 --scale=200 --db-driver=mysql prepare

The resulting dataset has a size of approximately 200GB. The dataset is much larger than the buffer pool so the database performance is essentially IO-bound.

Test Procedure

The execution of every benchmark was scripted and followed these steps:

  1. Stop MySQL
  2. Remove all datafiles
  3. Adjust the filesystem
  4. Copy the dataset
  5. Adjust the MySQL configuration
  6. Start MySQL
  7. Record the configuration
  8. Run the benchmark

Results

For the benchmark, I used the following invocation:

./tpcc.lua --mysql-host=10.3.0.6 --mysql-user=tpcc --mysql-password=tpcc --mysql-db=tpcc \
--threads=16 --time=7200 --report-interval=10 --tables=10 --scale=200 --db-driver=mysql ru

The TPCC benchmark uses 16 threads for a duration of 2 hours. The duration is sufficiently long to allow for a steady state and to exhaust the storage burst capacity. Sysbench returns the total number of TPCC transactions per second every 10s. This number includes not only the New Order transactions but also the other transaction types like payment, order status, etc. Be aware of that if you want to compare these results with other TPCC benchmarks.

In those conditions, the figure below presents the rates of TPCC transactions over time for ext4 and ZFS.

TPCC transactions ZFS

MySQL TPCC results for ext4 and ZFS

During the initial 15 minutes, the buffer pool warms up but at some point, the workload shifts between an IO read bound to an IO write and CPU bound. Then, at around 3000s the SSD Premium burst capacity is exhausted and the workload is only IO-bound. I have been a bit surprised by the results, enough to rerun the benchmarks to make sure. The results for both ext4 and ZFS are qualitatively similar. Any difference is within the margin of error. That essentially means if you configure ZFS properly, it can be as IO efficient as ext4.

What is interesting is the amount of storage used. While the dataset on ext4 consumed 191GB, the lz4 compression of ZFS yielded a dataset of only 69GB. That’s a huge difference, a factor of 2.8, which could save a decent amount of money over time for large datasets.

Conclusion

It appears that it was indeed a good time to revisit the performance of MySQL with ZFS. In a fairly realistic use case, ZFS is on par with ext4 regarding performance while still providing the extra benefits of data compression, snapshots, etc. In a future post, I’ll examine the use of cloud ephemeral storage with ZFS and see how this can further improve performance.

May
27
2021
--

How InnoDB Handles TEXT/BLOB Columns

InnoDB Handles TEXT BLOB Columns

InnoDB Handles TEXT BLOB ColumnsRecently we had a debate in the consulting team about how InnoDB handles TEXT/BLOB columns. More specifically, the argument was around the Barracuda file format with dynamic rows.

In the InnoDB official documentation, you can find this extract:

When a table is created with ROW_FORMAT=DYNAMIC, InnoDB can store long variable-length column values (for VARCHAR, VARBINARY, and BLOB and TEXT types) fully off-page, with the clustered index record containing only a 20-byte pointer to the overflow page.

Whether columns are stored off-page depends on the page size and the total size of the row. When a row is too long, the longest columns are chosen for off-page storage until the clustered index record fits on the B-tree page. TEXT and BLOB columns that are less than or equal to 40 bytes are stored in line.

The first paragraph indicates InnoDB can store long columns fully off-page, in an overflow page. A little further down in the documentation, the behavior of short values, the ones with lengths less than 40 bytes, is described. These short values are stored in the row.

Our argument was about what happened for lengths above 40 bytes. Are these values stored in the rows or in overflow pages. I don’t really need a reason to start digging on a topic, imagine when I have one. With simple common tools, let’s experiment and find out.

Experimentation with TEXT/BLOB

For our experiment, we need a MySQL database server, a table with a TEXT column, and the hexdump utility. I spun up a simple LXC instance in my lab with Percona-server 8.0 and created the following table:

CREATE TABLE `testText` (
`id` int(11) unsigned NOT NULL,
`before` char(6) NOT NULL DEFAULT 'before',
`data` text NOT NULL,
`after` char(5) NOT NULL DEFAULT 'after',
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

The before and after columns are there to help locate the TEXT column in the datafile page. We can insert a few rows in the table, a TEXT value with size being a multiplier of 16 is convenient with hexdump, successive identical lines are replaced by ‘*’.

mysql> insert into testText (id,data) values (1,repeat('0123456789abcdef',1));
Query OK, 1 row affected (0.01 sec)
mysql> insert into testText (id,data) values (2,repeat('0123456789abcdef',2));
Query OK, 1 row affected (0.00 sec)
mysql> insert into testText (id,data) values (3,repeat('0123456789abcdef',3));
Query OK, 1 row affected (0.01 sec)

Then we use the hexdump utility at the shell level, the first leaf page starts at offset 0xc000:

# hexdump -C /var/lib/mysql/test/testText.ibd
...
0000c060  02 00 1b 69 6e 66 69 6d  75 6d 00 04 00 0b 00 00  |...infimum......|
0000c070  73 75 70 72 65 6d 75 6d  10 00 00 10 00 32 00 00  |supremum.....2..|
0000c080  00 01 00 00 00 05 74 3f  d9 00 00 01 5d 01 10 62  |......t?....]..b|
0000c090  65 66 6f 72 65 30 31 32  33 34 35 36 37 38 39 61  |efore0123456789a|
0000c0a0  62 63 64 65 66 61 66 74  65 72 20 00 00 18 00 42  |bcdefafter ....B|
0000c0b0  00 00 00 02 00 00 00 05  74 40 da 00 00 01 5e 01  |........t@....^.|
0000c0c0  10 62 65 66 6f 72 65 30  31 32 33 34 35 36 37 38  |.before012345678|
0000c0d0  39 61 62 63 64 65 66 30  31 32 33 34 35 36 37 38  |9abcdef012345678|
0000c0e0  39 61 62 63 64 65 66 61  66 74 65 72 30 00 00 20  |9abcdefafter0.. |
0000c0f0  ff 7e 00 00 00 03 00 00  00 05 74 45 dd 00 00 01  |.~........tE....|
0000c100  62 01 10 62 65 66 6f 72  65 30 31 32 33 34 35 36  |b..before0123456|
0000c110  37 38 39 61 62 63 64 65  66 30 31 32 33 34 35 36  |789abcdef0123456|
*
0000c130  37 38 39 61 62 63 64 65  66 61 66 74 65 72 00 00  |789abcdefafter..|

Clearly, the values are stored inside the row, even the third one with a length of 48.

Cutoff to an Overflow Page

If we continue to increase the length, the behavior stays the same up to a length of 8080 bytes (505 repetitions). If we add 16 bytes to the length, the row becomes larger than half of the available space on the page. At this point, the TEXT value is moved to an overflow page and replaced by a 20 bytes pointer in the row itself. The pointer in the rows looks like this:

0000c060 02 00 1c 69 6e 66 69 6d 75 6d 00 02 00 0b 00 00 |...infimum......|
0000c070 73 75 70 72 65 6d 75 6d 14 c0 00 00 10 ff f1 00 |supremum........|
0000c080 00 00 01 00 00 00 05 74 80 a4 00 00 01 fe 01 10 |.......t........|
0000c090 62 65 66 6f 72 65 00 00 01 02 00 00 00 04 00 00 |before..........|
0000c0a0 00 26 00 00 00 00 00 00 1f a0 61 66 74 65 72 00 |.&........after.|
0000c0b0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|

The 20 bytes pointer is: 00 00 01 02 00 00 00 04 00 00 00 26 00 00 00 00 00 00 1f a0
Space ID: 00 00 01 02
First overflow page: 00 00 00 04 (4 x 0x4000 = 0x10000)
Offset in overflow page: 00 00 00 26
Version number: 00 00 00 00
Total length of the TEXT value: 00 00 1f a0 (0x1fa0 = 8096 = 16*506)

The overflow page:

00010000 c1 06 3d 24 00 00 00 04 00 00 00 00 00 00 00 00 |..=$............|
00010010 00 00 00 00 74 dd 8e 3f 00 0a 00 00 00 00 00 00 |....t..?........|
00010020 00 00 00 00 01 02 00 00 1f a0 ff ff ff ff 30 31 |..............01|
00010030 32 33 34 35 36 37 38 39 61 62 63 64 65 66 30 31 |23456789abcdef01|
*
00011fc0 32 33 34 35 36 37 38 39 61 62 63 64 65 66 00 00 |23456789abcdef..|
00011fd0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|

The value header starting @ 0x10026: 00 00 1f a0 ff ff ff ff
Length in that overflow page: 00 00 1f a0 (0x1fa0 = 8096 = 16*506)
Next overflow page: ff ff ff ff (this means it is the last)

From the above, one can make a few interesting observations:

  • The 20 bytes pointer includes a space_id value so in theory, the overflow page could be in another tablespace (ibd file).
  • The total length is using 4 bytes even though this is a TEXT and not a LONGTEXT column. 2 bytes would have been sufficient.
  • The length of an overflow page chunk is also using 4 bytes for length even though the largest possible InnoDB page size is 64KB.

Performance Impacts

TEXT/BLOB columns are the source of a number of performance-related impacts. Let’s review the most obvious ones.

Storage

The first performance impact related to storage inefficiency. TEXT/BLOB values are stored in chunks of 16KB if the default InnoDB page size is used. That means, on average, about 8KB per value is lost when overflow pages are used. This leads to larger data files and less efficient caching.

Reads

The presence of TEXT/BLOB columns can significantly increase the number of read IOPs required for queries. For example, let’s consider the following simple query:

SELECT * from myTable limit 10;

If we ignore caching, without TEXT/BLOB columns, the above query would require only one read IOP per level in the primary key btree of myTable. For a small table, this could be one or two read IOPs. Now, if each row has a 1MB TEXT/BLOB column, the same simple query would require in excess of 640 read IOPs since each TEXT/BLOB value is a chain of 64 pages of 16KB.

Writes

For this section, let’s assume a worst-case scenario with row-based replication enabled and a full row image. Now, if we insert a row with a 1MB value as a longText column in a Percona server 8.0 instance, we have:

+----------------------------------+---------------------------+
| FILE_NAME                        | SUM_NUMBER_OF_BYTES_WRITE |
+----------------------------------+---------------------------+
| /var/lib/mysql/#ib_16384_0.dblwr |                   1572864 |
| /var/lib/mysql/ib_logfile0       |                   1174528 |
| /var/lib/mysql/test/testText.ibd |                   1130496 |
| /var/lib/mysql/binlog.000003     |                   1048879 |
| /tmp/MLfd=41                     |                   1048783 |
| /var/lib/mysql/undo_001          |                    278528 |
| /var/lib/mysql/undo_002          |                    131072 |
| /var/lib/mysql/ibdata1           |                     16384 |
| /var/lib/mysql/mysql.ibd         |                     16384 |
+----------------------------------+---------------------------+

For a total of around 6.4 MB. This is not surprising, there are two log files and the data is also written twice because of the doublewrite buffer. The temporary file is used for the disk binlog cache and unless the transaction is very long, it won’t actually be written to storage.

Anyway, this is just to set the stage for what happens after an update. If I just change the last letter of the longText value, the amount of data written raises to approximately 10 MB.

+----------------------------------+---------------------------+
| FILE_NAME                        | SUM_NUMBER_OF_BYTES_WRITE |
+----------------------------------+---------------------------+
| /var/lib/mysql/#ib_16384_0.dblwr | 2260992                   |
| /var/lib/mysql/test/testText.ibd | 2211840                   |
| /var/lib/mysql/binlog.000003     | 2097475                   |
| /tmp/MLfd=41                     | 2097379                   |
| /var/lib/mysql/ib_logfile0       | 1129472                   |
| /var/lib/mysql/undo_001          | 32768                     |
| /var/lib/mysql/ibdata1           | 16384                     |
+----------------------------------+---------------------------+

The longText value is not modified in place, it is copied to a new set of overflow pages. The new and old overflow pages then need to be flushed to storage. Also, since we use the worst-case scenario of the full row image, the binary log entry has the old and new value stored but the InnoDB log files only have the new version.

I hope this illustrates why storing mutable data in a TEXT/BLOB column is a bad idea.

JSON

Although columns stored using the MySQL JSON data type are stored as a TEXT/BLOB column, MySQL 8.0 added some logic to allow in-place updates. The impact of an update to a large JSON column in 8.0 is not as severe as in 5.7.

How to Best Deal with TEXT/BLOB Columns?

Data Compression

Data compression is a very compelling option for TEXT/BLOB values By definition, those values are normally large and thus, usually compress well. This is nothing new and has been reported in previous previously. PostgreSQL, for one, compresses by default its TEXT/BLOB columns (called TOAST).

The best option for compression is always the application when it is possible of course. As we have just seen, this reduces the write load and spares the database from the CPU burden of compression.

Another option with MySQL is the InnoDB Barracuda table compression. When used, TEXT/BLOB values are compressed before being written to the overflow pages. This is much more efficient than compressing the pages one at a time.

Finally, if you are using Percona Server or MariaDB, you have access to transparent column compression. This is the second-best option, performance-wise, behind compression by the application is not possible.

Avoid Returning TEXT/BLOB Columns

When there are large TEXT/BLOB columns in a table, the cost of accessing those columns is high. Because of this, it is important to select only the columns that are needed and avoid the default use of “select * from”. Unfortunately, many ORM frameworks instantiate objects by grabbing all the columns. When using such frameworks, you should consider storing the TEXT/BLOB in a separate table with a loose 1:1 relationship with the original table. That way, the ORM is able to instantiate the object without necessarily forcing a read of the TEXT/BLOB columns.

Conclusion

I hope this article improved your understanding of TEXT/BLOB values in InnoDB. When used correctly, TEXT/BLOB columns can be useful and have a limited impact on the performance of the database.

Apr
02
2021
--

Percona XtraBackup Point-In-Time Recovery for the Single Database

Percona XtraBackup Point-In-Time Recovery

Percona XtraBackup Point-In-Time RecoveryRecovering to a particular time in the past is called Point-In-Time Recovery (PITR). With PITR you can rollback unwanted DELETE without WHERE clause or any other harmful command.

PITR with Percona XtraBackup is pretty straightforward and perfectly described in the user manual. You need to restore the data from the backup, then apply all binary logs created or updated after the backup was taken, but skip harmful event(s).

However, if your data set is large you may want to recover only the affected database or table. This is possible but you need to be smart when filtering events from the binary log. In this post, I will show how to perform such a partial recovery using Percona XtraBackup, mysql command-line client, and mysqlbinlog programs only. There is an alternative approach that involves creating a fake source server, that is described in MySQL Point in Time Recovery the Right Way. You may consider it, especially if you need to apply changes to a single table.

Percona XtraBackup Point-In-Time Recovery

For our example we will create data first, then run DROP and DELETE commands on two different tables. Then we will rollback these commands.

First, let’s assume we have a server with two databases: test and sbtest. We are using GTIDs and row-based binary log format. We also run the server with the option innodb_file_per_table=1 and all our InnoDB tables use individual tablespaces. Otherwise, the individual restore method would not work.

mysql> show tables from sbtest;
+------------------+
| Tables_in_sbtest |
+------------------+
| sbtest1          |
| sbtest2          |
| sbtest3          |
| sbtest4          |
| sbtest5          |
| sbtest6          |
| sbtest7          |
| sbtest8          |
+------------------+
8 rows in set (0.00 sec)

mysql> show tables from test;
+----------------+
| Tables_in_test |
+----------------+
| bar            |
| baz            |
| foo            |
+----------------+
3 rows in set (0.00 sec)

We will experiment with tables foo and bar. We assume that at the time of our first backup, each of the tables contained five rows. Tables in the database sbtest also contain data, but it does not really matter for our experiment.

mysql> select count(*) from foo;
+----------+
| count(*) |
+----------+
|        5 |
+----------+
1 row in set (0.00 sec)

mysql> select count(*) from bar;
+----------+
| count(*) |
+----------+
| 5        |
+----------+
1 row in set (0.00 sec)

mysql> select count(*) from baz;
+----------+
| count(*) |
+----------+
| 0        |
+----------+
1 row in set (0.00 sec)

Since we want to restore individual tables, we need to make a preparation before taking a backup: store database structure. We will do it with help of the mysqldump command. In this example, I store structure per database to make partial PITR easier, but you are free to use the option --all-databases.

mysqldump --no-data --set-gtid-purged=OFF --triggers --routines --events test > test_structure.sql
mysqldump --no-data --set-gtid-purged=OFF --triggers --routines --events sbtest > sbtest_structure.sql

Then we are ready to take the backup.

xtrabackup --parallel=8 --target-dir=./full_backup --backup

I am using the option --parallel to speed up the backup process.

Now let’s do some testing. First, let’s update rows in the table foo.

mysql> update foo set f1=f1*2;
Query OK, 5 rows affected (0.01 sec)
Rows matched: 5 Changed: 5 Warnings: 0

mysql> select * from foo;
+----+------+
| id | f1   |
+----+------+
|  1 |    2 |
|  2 |    4 |
|  3 |    6 |
|  4 |    8 |
|  5 |   10 |
+----+------+
5 rows in set (0.00 sec)

And then drop it and delete all rows from the table bar.

mysql> drop table foo;
Query OK, 0 rows affected (0.02 sec)

mysql> delete from bar;
Query OK, 5 rows affected (0.01 sec)

Finally, let’s insert a few rows into the tables bar and baz.

mysql> insert into bar(f1) values(6),(7),(8),(9),(10);
Query OK, 5 rows affected (0.01 sec)
Records: 5 Duplicates: 0 Warnings: 0

mysql> insert into baz(f1) values(1),(2),(3),(4),(5);
Query OK, 5 rows affected (0.01 sec)
Records: 5 Duplicates: 0 Warnings: 0

Assume that the DROP TABLE and DELETE command was an accident and we want to restore the state of the tables foo and bar as they were before these unlucky statements.

First, we need to prepare the backup.

Since we are interested in restoring only tables in the database test we need to prepare the backup with a special option --export that exports tablespaces in a way that they could be later imported:

xtrabackup --prepare --export --target-dir=./full_backup

Now the directory for the database test contains not only table definition files (.frm, only before 8.0) and tablespace files (.ibd) but also configuration files (.cfg).

Since we want all changes that happened after backup and before the problematic DROP TABLE and DELETE statements were applied, we need to identify which binary log and position were actual at the backup time. We can find it in the xtrabackup_binlog_info file:

$ cat full_backup/xtrabackup_binlog_info
master-bin.000004 1601 0ec00eed-87f3-11eb-acd9-98af65266957:1-56

Now we are ready to perform restore.

First, let’s restore the table foo from the backup. Restoring individual tablespaces requires the ALTER TABLE ... IMPORT TABLESPACE command. This command assumes that the table exists in the server. However, in our case, it was dropped and therefore we need to re-create it.

We will recreate the full database test from the file test_structure.sql

Since we do not want these administrative tasks to be re-applied, I suggest disabling binary logging for the session which will recreate the database structure.

mysql> set sql_log_bin=0;
Query OK, 0 rows affected (0.00 sec)

mysql> source test_structure.sql
Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)
....

Once tables are recreated discard their tablespaces. I will show an example for the table foo. Adjust the code for the rest of the tables.

mysql> alter table foo discard tablespace;
Query OK, 0 rows affected (0.01 sec)

Then, in another terminal, copy the tablespace and configuration files from the backup to the database directory:

cp full_backup/test/foo.{ibd,cfg} var/mysqld.1/data/test/

And, finally, import the tablespace:

mysql> alter table foo import tablespace;
Query OK, 0 rows affected (0.05 sec)

Repeat for the other tables in the database test.

Now you can enable binary logging back.

You can do the same task in a script. For example:

for table in `mysql test --skip-column-names --silent -e "show tables"`
> do
>   mysql test -e "set sql_log_bin=0; alter table $table discard tablespace"
>   cp full_backup/test/$table.{ibd,cfg} var/mysqld.1/data/test/
>   mysql test -e "set sql_log_bin=0; alter table $table import tablespace"
> done

Our tables are recovered but do not have the updates made after the backup.

mysql> select * from foo;
+----+------+
| id | f1   |
+----+------+
|  1 |    1 |
|  2 |    2 |
|  3 |    3 |
|  4 |    4 |
|  5 |    5 |
+----+------+
5 rows in set (0.00 sec)

mysql> select * from bar;
+----+------+
| id | f1   |
+----+------+
|  1 |    1 |
|  2 |    2 |
|  3 |    3 |
|  4 |    4 |
|  5 |    5 |
+----+------+
5 rows in set (0.00 sec)

mysql> select * from baz;
Empty set (0.00 sec)

Therefore, we need to restore data from the binary logs.

To do it we first need to identify the GTID of the disaster event. It can be done if we dump all binary logs updated and created after backup into a dump file and then search for the DROP TABLE and DELETE statements and skipping them.

First, let’s check which binary logs do we have.

mysql> show binary logs;
+-------------------+-----------+
| Log_name          | File_size |
+-------------------+-----------+
| master-bin.000001 |   1527476 |
| master-bin.000002 |      3035 |
| master-bin.000003 |      1987 |
| master-bin.000004 |      2466 |
| master-bin.000005 |       784 |
+-------------------+-----------+
5 rows in set (0.00 sec)

So we need to parse them, starting from the log master-bin.000004 and position 1601:

mysqlbinlog --start-position=1601 -vvv --base64-output=decode-rows --database=test master-bin.000004 master-bin.000005 > binlog_test.sql

I used options -vvv that prints SQL representation of row events, so we can find the one which we want to skip and --base64-output=decode-rows to not print row events at all. We will not use this file for the restore, only for searching the DROP TABLE and DELETE events.

Here they are, at the positions 2007 and 2123, with GTID 0ec00eed-87f3-11eb-acd9-98af65266957:58 and 0ec00eed-87f3-11eb-acd9-98af65266957:59

# at 2007
#210321 13:29:58 server id 1 end_log_pos 2123 CRC32 0xd1eb9854 Query thread_id=138 exec_time=0 error_code=0
use `test`/*!*/;
SET TIMESTAMP=1616322598/*!*/;
DROP TABLE `foo` /* generated by server */
/*!*/;
# at 2123
#210321 13:30:08 server id 1 end_log_pos 2188 CRC32 0xfc9b2088 GTID last_committed=7 sequence_number=8 rbr_only=yes original_committed_timestamp=0 immediate_commit_timestamp=0 transaction_length=0
/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/;
# original_commit_timestamp=0 (1970-01-01 02:00:00.000000 EET)
# immediate_commit_timestamp=0 (1970-01-01 02:00:00.000000 EET)
/*!80001 SET @@session.original_commit_timestamp=0*//*!*/;
/*!80014 SET @@session.original_server_version=0*//*!*/;
/*!80014 SET @@session.immediate_server_version=0*//*!*/;
SET @@SESSION.GTID_NEXT= '0ec00eed-87f3-11eb-acd9-98af65266957:59'/*!*/;
# at 2188
#210321 13:30:08 server id 1 end_log_pos 2260 CRC32 0x1d525b11 Query thread_id=138 exec_time=0 error_code=0
SET TIMESTAMP=1616322608/*!*/;
BEGIN
/*!*/;
# at 2260
#210321 13:30:08 server id 1 end_log_pos 2307 CRC32 0xb57ecb73 Table_map: `test`.`bar` mapped to number 226
# at 2307
#210321 13:30:08 server id 1 end_log_pos 2387 CRC32 0x6770a7e2 Delete_rows: table id 226 flags: STMT_END_F
### DELETE FROM `test`.`bar`
### WHERE
### @1=1 /* INT meta=0 nullable=0 is_null=0 */
### @2=1 /* INT meta=0 nullable=1 is_null=0 */
### DELETE FROM `test`.`bar`
### WHERE
...

Note that decoded row event contains a DELETE command for each affected row.

We may also find to which binary log this event belongs if search for the "Rotate to" event. In our case “Rotate to master-bin.000005” happened after the found positions, so we only need file master-bin.000004 In your case, you may need to skip events from the previous log files too.

So to restore the data we need to run mysqlbinlog one more time, this time with parameters:

mysqlbinlog  --start-position=1601 --exclude-gtids=0ec00eed-87f3-11eb-acd9-98af65266957:58-59 --database=test --skip-gtids=true master-bin.000004 master-bin.000005 > binlog_restore.sql

I removed options -vvvbecause we are not going to examine this restore file and option --base64-output=decode-rows because we need row events to present in the resulting file. I also used option --exclude-gtids=0ec00eed-87f3-11eb-acd9-98af65266957:58-59 to exclude GTIDs that we do not want to re-apply. We also need to use --skip-gtids=true because otherwise updates will be skipped since such GTIDs already exist on the server.

Now binlog_restore.sql contains all updates to the database test made after the backup and before the DROP statement. Let’s restore it.

mysql test < binlog_restore.sql

Restore went successfully. Our tables have all past updates.

mysql> select * from foo;
+----+------+
| id | f1   |
+----+------+
|  1 |    2 |
|  2 |    4 |
|  3 |    6 |
|  4 |    8 |
|  5 |   10 |
+----+------+
5 rows in set (0.01 sec)

mysql> select count(*) from bar;
+----------+
| count(*) |
+----------+
|       10 |
+----------+
1 row in set (0.00 sec)

mysql> select count(*) from baz;
+----------+
| count(*) |
+----------+
|        5 |
+----------+
1 row in set (0.00 sec)

Conclusion

You may save the time required for PITR if use the per-database restore method. However, you need to take into account the following considerations:

  • mysqlbinlog does not support filtering per table, therefore you either need to restore the full database or use a fake server method, described in MySQL Point in Time Recovery the Right Way.
  • Per-database filters depend on the USE statement in the statement-based binary log format. Therefore option --database can only be considered safe with a row-based format.
  • If you do not use GTID you still can use this method. You will need to combine options --start-position and --stop-position to skip the event.

Percona XtraBackup is a free, open source database backup solution for Percona Server for MySQL and MySQL.

Feb
11
2021
--

Various Types of InnoDB Transaction Isolation Levels Explained Using Terminal

InnoDB Transaction Isolation Levels

InnoDB Transaction Isolation LevelsThe goal of this blog post is to explain the various types of transaction isolation levels available in MySQL. After reading the blog, you will be able to explain dirty reads, non-repeatable reads, and the concept of phantom rows as well.

What is the Isolation Level in MySQL?

Isolation (I) is one of the properties from ACID. It defines how each transaction is isolated from other transactions and is a critical component of application design. As per the SQL:1992 standard, InnoDB has four types of Isolation levels. Below, I have listed the types in order, and each transaction isolation level provides better consistency compared to the previous one.

  • READ-UNCOMMITTED
  • READ-COMMITTED
  • REPEATABLE-READ – ( MySQL’s DEFAULT )
  • SERIALIZABLE

You can change the isolation level using the variable “transaction_isolation” at runtime. As transaction isolation changes can impact the result sets of your queries, you most certainly want to test this in a non-production environment in order to evaluate the impact on your application.”

mysql> set global transaction_isolation='read-committed';
Query OK, 0 rows affected (0.00 sec)

READ-UNCOMMITTED:

  • No locks
  • Dirty reads, non-repeatable reads, phantom reads are possible

The below example will help to understand the “read-uncommitted” and how the dirty reads are happening. I am using two sessions – S1 and S2.

For session S1:

mysql> set global transaction_isolation='read-uncommitted';
Query OK, 0 rows affected (0.00 sec)

mysql> \r
Connection id:    16
Current database: percona

mysql> select * from ReadUncommit;
+----+------+
| id | name |
+----+------+
|  1 | jc   |
|  2 | herc |
|  3 | sri  |
+----+------+
3 rows in set (0.00 sec)

mysql> begin;
Query OK, 0 rows affected (0.00 sec)

mysql> update ReadUncommit set name='ram' where id=3;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

For session S2:

mysql> select * from percona.ReadUncommit;
+----+------+
| id | name |
+----+------+
|  1 | jc   |
|  2 | herc |
|  3 | ram  |
+----+------+
3 rows in set (0.00 sec)

At S1, I globally modified the transaction_isolation to read-uncommitted and started the transaction. I executed the UPDATE statement ( name = ram ) at S1 but did not commit the transaction yet. Then I created the S2 and executed the SELECT for the table, and I was able to see the uncommitted modified data. This is called dirty reads.

So, with “read-uncommitted”, the transactions from different sessions can view the modification from the different transactions before it commits.

READ-COMMITTED:

  • Dirty reads are not possible
  • Non-repeatable reads and phantom reads are possible

The below example will help to understand the “read-committed” and how the non-repeatable reads are happening. I am using two sessions – S1 and S2.

For session S1:

mysql> set global transaction_isolation='read-committed';
Query OK, 0 rows affected (0.00 sec)

mysql> \r
Connection id:    18
Current database: percona

mysql> select * from ReadCommit;
+----+------+
| id | name |
+----+------+
|  1 | jc   |
|  2 | herc |
|  3 | sri  |
+----+------+
3 rows in set (0.00 sec)

mysql> begin;
Query OK, 0 rows affected (0.00 sec)

mysql> update ReadCommit set name='ram' where id=3;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

For session  S2:

mysql> select * from percona.ReadCommit;
+----+------+
| id | name |
+----+------+
|  1 | jc   |
|  2 | herc |
|  3 | sri  |
+----+------+
3 rows in set (0.00 sec)

At S1, I globally modified the transaction_isolation to “read-committed” and started the transaction. I executed the UPDATE statement ( name = ram ) at S1 but did not commit the transaction. Then I created S2 and executed the SELECT for the table, and I was not able to see the uncommitted modified data.

So, with “read-committed”, the transactions from different sessions can’t view the modification from the different transactions until it commits. Only committed modifications can be viewed.

Then, what is the drawback with “read-committed”?

Non-repeatable read is possible with the “read-committed”. Below, I explain how the non-repeatable read occurred.

For session  S1:

mysql> begin;
Query OK, 0 rows affected (0.00 sec)

mysql> select * from ReadCommit;
+----+------+
| id | name |
+----+------+
|  1 | jc   |
|  2 | herc |
|  3 | sri  |
+----+------+
3 rows in set (0.00 sec)

For session  S2:

mysql> begin;
Query OK, 0 rows affected (0.00 sec)

mysql> update percona.ReadCommit set name='ram' where id=3;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> commit;
Query OK, 0 rows affected (0.00 sec)

For session  S1:

mysql> begin;
Query OK, 0 rows affected (0.00 sec)

mysql> select * from ReadCommit;
+----+------+
| id | name |
+----+------+
|  1 | jc   |
|  2 | herc |
|  3 | sri  |
+----+------+
3 rows in set (0.00 sec)

mysql> select * from ReadCommit;
+----+------+
| id | name |
+----+------+
|  1 | jc   |
|  2 | herc |
|  3 | ram  |
+----+------+
3 rows in set (0.00 sec)

At S1, I started the transaction and executed the SELECT to view the data. Then at S2, I executed the UPDATE statement ( name = ram where id = 3) to modify the data. I committed the transaction on S2. Again, at S1 I executed the same SELECT to view the data inside the same transaction. But, this time I have a different result. This is called a non-repeatable read.

Having a different result for the same query inside the transaction is not fair, and it may lead your transaction to be inconsistent. “REPEATABLE-READ” will help to overcome this.

REPEATABLE-READ:

  • Dirty reads and non-repeatable reads are not possible
  • Phantom reads are possible

The below example will help to understand the “repeatable-read” and how the phantom reads are happening. I am using two sessions – S1 and S2.

For session S1:

mysql> set global transaction_isolation='repeatable-read';
Query OK, 0 rows affected (0.00 sec)

mysql> \r
Connection id:    20
Current database: percona

mysql> begin;
Query OK, 0 rows affected (0.00 sec)

mysql> select * from RepeatRead;
+----+------+
| id | name |
+----+------+
|  1 | jc   |
|  2 | herc |
|  3 | sri  |
+----+------+
3 rows in set (0.00 sec)

For session S2:

mysql> begin;
Query OK, 0 rows affected (0.00 sec)

mysql> update percona.RepeatRead set name='ram' where id=3;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> commit;
Query OK, 0 rows affected (0.00 sec)

For session S1:

mysql> begin;
Query OK, 0 rows affected (0.00 sec)

mysql> select * from RepeatRead;
+----+------+
| id | name |
+----+------+
|  1 | jc   |
|  2 | herc |
|  3 | sri  |
+----+------+
3 rows in set (0.00 sec)

mysql> select * from RepeatRead;
+----+------+
| id | name |
+----+------+
|  1 | jc   |
|  2 | herc |
|  3 | sri  |
+----+------+
3 rows in set (0.00 sec)

At S1, I globally modified the transaction_isolation to “repeatable-read” and started the transaction. I executed the SELECT to view the data. Then I created S2 and executed the UPDATE statement ( name = ram where id = 3) to modify the data. I committed the transaction on S2. Again at S1, I executed the same SELECT to view the data inside the same transaction. There are no changes. So, here we overcome the problem of being read-committed.

At repeatable-read, the snapshot of the SELECT will be taken during the first execution of SELECT, and it will be until the transaction ends. Still, we may get the phantom rows with repeatable-read.

How Do Phantom Rows Occur?

The below example will help to understand how the phantom rows are occurring inside the transaction.

For session S1:

mysql> begin;
Query OK, 0 rows affected (0.00 sec)

mysql> select * from RepeatRead;
+----+------+
| id | name |
+----+------+
|  1 | jc   |
|  2 | herc |
|  3 | sri  |
+----+------+
3 rows in set (0.00 sec)

For session  S2:

mysql> begin;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into percona.RepeatRead values (4,'ram');
Query OK, 1 row affected (0.00 sec)

mysql> commit;
Query OK, 0 rows affected (0.00 sec)

For session  S1:

mysql> begin;
Query OK, 0 rows affected (0.00 sec)

mysql> select * from RepeatRead;
+----+------+
| id | name |
+----+------+
|  1 | jc   |
|  2 | herc |
|  3 | sri  |
+----+------+
3 rows in set (0.00 sec)

mysql> select * from RepeatRead;
+----+------+
| id | name |
+----+------+
|  1 | jc   |
|  2 | herc |
|  3 | sri  |
+----+------+
3 rows in set (0.00 sec)

mysql> update percona.RepeatRead set name='sriram' where id=4;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

mysql> select * from RepeatRead;
+----+--------+
| id | name   |
+----+--------+
|  1 | jc     |
|  2 | herc   |
|  3 | sri    |
|  4 | sriram |
+----+--------+
4 rows in set (0.00 sec)

From the above example, at S1, I have executed the SELECT to read the data. Then, I have inserted the row ( id=4 ) on S2. Again, I executed the SELECT at S1, and there are no changes because we are using repeatable-read isolation. Again at S1, I executed the UPDATE to modify the data which was inserted by S2 ( id = 4 ), then executed the same SELECT at S1. But, this time I have different results, which is called the phantom reads. This can be avoided with “SERIALIZABLE”.

SERIALIZABLE:

  • No dirty reads
  • No non-repeatable reads
  • No phantom reads

The below example will help to understand the “serializable” isolation. I am using two sessions – S1 and S2.

For session S1:

mysql> set global transaction_isolation='serializable';
Query OK, 0 rows affected (0.00 sec)

mysql> \r
Connection id:    22
Current database: percona

mysql> begin;
Query OK, 0 rows affected (0.00 sec)

mysql> select * from Serialize;
+----+------+
| id | name |
+----+------+
|  1 | jc   |
|  2 | herc |
|  3 | sri  |
+----+------+
3 rows in set (0.00 sec)

For session S2:

mysql> begin;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into percona.Serialize values (4,'ram');
ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

mysql> update percona.Serialize set name='aaa' where id=3;
ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

From the above example, I globally modified the isolation level to “serializable” and started the transaction at S1. Then I created session 2 and tried to do an INSERT/UPDATE at S2… however, these statements timeout due to pre-existing locks being held. At the isolation level serializable, InnoDB implicitly converts all SELECT statements to SELECT FOR SHARE if autocommit is disabled. The SELECT statement from S1 sets shared locks (LOCK IN SHARE MODE) that permit other transactions to read the examined rows but not to update or delete them. So, the INSERT/UPDATE is failing.

Serializable ensures more consistency for the transactions, but you will have to address increased locking conditions in MySQL.

Jan
19
2021
--

The MySQL Clone Wars: Plugin vs. Percona XtraBackup

MySQL Plugin vs. Percona XtraBackup

MySQL Plugin vs. Percona XtraBackupLarge replication topologies are quite common nowadays, and this kind of architecture often requires a quick method to rebuild a replica from another server.

The Clone Plugin, available since MySQL 8.0.17, is a great feature that allows cloning databases out of the box. It is easy to rebuild a replica or to add new nodes to a cluster using the plugin. Before the release of the plugin, the best open-source alternative was Percona XtraBackup for MySQL Databases.

In this blog post, we compare both alternatives for cloning purposes. If you need to perform backups, Percona XtraBackup is a better tool as it supports compression and incremental backups, among other features not provided by the plugin. The plugin supports compression only for network transmission, not for storage.

But one of the plugin’s strong points is simplicity. Once installed and configured, cloning a database is straightforward. Just issuing a command from the destination database is enough.

Percona XtraBackup, on the other side, is a more complex tool. The cloning process involves several stages: backup, stream, write, and prepare. These stages can take place in parallel: we can stream the backup to the new server using netcat and, at the same time, we can write it into the destination directory. The only stage that is sequential is the last one: prepare.

Test Characteristics

We used sysbench to create 200 tables of 124Mb each for a total of 24Gb. Both source and replica virtual machines run 4 cores, 8 Gb RAM, and 60Gb storage. We created the disks on the same datastore.

During the tests, we did not generate additional operations on the database. We measured only the clone process, reducing the benchmark complexity. Otherwise, we would have to take into consideration things like application response time, or the number of transactions executed. This is beyond the scope of this assessment.

We tested different combinations of clone and Percona XtraBackup operations. For XtraBackup, we tested 1 to 4 threads, with and without compression. In the case of compression, we allocated the same number of threads to compression and decompression. For the clone plugin, we tested auto (which lets the server decide how many threads will perform the clone) and 1 to 4 threads. We also tested with and without compression. Finally, we executed all the tests using three different network limits: 500mbps, 1000mbps, and 4000mbps. These make a total of 54 tests, executed 12+1 times each.

All times are in seconds. In the graphs below, lower values are better.

Method

Clone

Out of the required parameters to operate the clone plugin, the following were set up accordingly in the recipient server:

  • clone_max_concurrency=<maximum number of threads> Defines the maximum number of threads used for a remote cloning operation with autotune enabled. Otherwise, this is the exact number of threads that remote cloning uses.
  • clone_autotune_concurrency If enabled the clone operation uses up to clone_max_concurrency threads. The default is 16.
    • clone_enable_compression If enabled, the remote clone operation will use compression.

Percona XtraBackup

To stream the backup we used the xbstream format and sent the data to the remote server using netcat. We applied the following parameters:

  • parallel=<number of threads> Xtrabackup and xbstream parameter that defines the number of threads used for backup and restore operations.
  • rebuild-threads The number of threads used for the rebuild (prepare) operation.
  • decompress_threads and compress_threads Xtrabackup and xbstream parameters that define the number of threads used for compression operations.

Some people use additional parameters like innodb-read-io-threads, innodb-write-io-threads, or innoDB-io-capacity, but these parameters only affect the behavior of InnoDB background threads. They have no impact during backup and restore operations.

Results

Clone

No compression

For the lower bandwidth tests, the number of threads used does not make a difference. Once we increase bandwidth we see that time cuts by half when we move from one thread to two. Going beyond that value improves slightly. Probably we reach the disk i/o limit.

Clone Plugin performance without compression.

The auto option is consistently the fastest one.

Compression

Compression is supposed to improve performance for lower bandwidth connections, but we see that this is not the case. Bandwidth has no impact on execution time and compression makes the clone slower. Again auto gives the best results, equivalent to 4 threads.
clone plugin with compression

Percona XtraBackup

No Compression

Without compression, we see again that the number of threads does not make any difference in the lower bandwidth test. When we increase bandwidth, the number of threads is important, but we quickly reach i/o limits.

Percona Xtrabackup stream without compression

Compression

When using compression, we see that requires less time to complete in almost every case compared with the option without compression, even when bandwidth is not the limit.

Percona Xtrabackup stream with compression

Conclusion

We see that, when using compression, the clone plugin is the slower option while Percona XtraBackup gives great results for all bandwidths. Without compression, the clone plugin is faster when using more than 2 threads. XtraBackup is faster for fewer threads.

Xtrabackup vs. Clone plugin - results summary

Below, we have a chart comparing the worst and best results. As expected, the worst results correspond to one thread executions.

The Clone Plugin is a great option for simplicity. Percona XtraBackup is excellent to save bandwidth and provides better results with fewer threads. With enough threads and bandwidth available, both solutions provide comparable results.

Sep
30
2020
--

MySQL 101: Tuning MySQL After Upgrading Memory

Tuning MySQL After Upgrading Memory

Tuning MySQL After Upgrading MemoryIn this post, we will discuss what to do when you add more memory to your instance. Adding memory to a server where MySQL is running is common practice when scaling resources.

First, Some Context

Scaling resources is just adding more resources to your environment, and this can be split in two main ways: vertical scaling and horizontal scaling.

Vertical scaling is increasing hardware capacity for a given instance, thus having a more powerful server, while horizontal scaling is adding more servers, a pretty standard approach for load balancing and sharding.

As traffic grows, working datasets are getting bigger, and thus we start to suffer because the data that doesn’t fit into memory has to be retrieved from disk. This is a costly operation, even with modern NVME drives, so at some point, we will need to deal with either of the scaling solutions we mentioned.

In this case, we will discuss adding more RAM, which is usually the fastest and easiest way to scale hardware vertically, and also having more memory is probably the main benefit for MySQL.

How to Calculate Memory Utilization

First of all, we need to be clear about what variables allocate memory during MySQL operations, and we will cover only commons ones as there are a bunch of them. Also, we need to know that some variables will allocate memory globally, and others will do a per-thread allocation.

For the sake of simplicity, we will cover this topic considering the usage of the standard storage engine: InnoDB.

We have globally allocated variables:

key_buffer_size: MyISAM setting should be set to 8-16M, and anything above that is just wrong because we shouldn’t use MyISAM tables unless for a particular reason. A typical scenario is MyISAM being used by system tables only, which are small (this is valid for versions up to 5.7), and in MySQL 8 system tables were migrated to the InnoDB engine. So the impact of this variable is negligible.

query_cache_size: 0 is default and removed in 8.0, so we won’t consider it.

innodb_buffer_pool_size: which is the cache where InnoDB places pages to perform operations. The bigger, the better. ?

Of course, there are others, but their impact is minimal when running with defaults.

Also, there are other variables that are allocated on each thread (or open connection):
read_buffer_size, read_rnd_buffer_size, sort_buffer_size, join_buffer_size and tmp_table_size, and few others. All of them, by default, work very well as allocation is small and efficient. Hence, the main potential issue becomes where we allocate many connections that can hold these buffers for some time and add extra memory pressure. The ideal situation is to control how many connections are being opened (and used) and try to reduce that number to a sufficient number that doesn’t hurt the application.

But let’s not lose the focus, we have more memory, and we need to know how to tune it properly to make the best usage.

The most memory-impacting setting we need to focus on is innodb_buffer_pool_size, as this is where almost all magic happens and is usually the more significant memory consumer. There is an old rule of thumb that says, “size of this setting should be set around 75% of available memory”, and some cloud vendors setup this value to total_memory*0.75.

I said “old” because that rule was good when running instances with 8G or 16G of RAM was common, so allocating roughly 6G out of 8G or 13G out of 16G used to be logical.

But what if we run into an instance with 100G or even 200G? It’s not uncommon to see this type of hardware nowadays, so we will use 80G out of 100G or 160G out of 200G? Meaning, will we avoid allocating something between 20G to 40G of memory and leave that for filesystem cache operations? While these filesystem operations are not useless, I don’t see OS needing more than 4G-8G for this purpose on a dedicated DB. Also, it is recommended to use the O_DIRECT flushing method for InnoDB to bypass the filesystem cache.

Example

Now that we understand the primary variables allocating memory let’s check a good use case I’m currently working on. Assuming this system:

$ free -m
      total      used     free    shared    buff/cache    available
Mem: 385625    307295    40921         4         37408        74865

So roughly 380G of RAM, a nice amount of memory. Now let’s check what is the maximum potential allocation considering max used connections.

*A little disclaimer here, while this query is not entirely accurate and thus it can diverge from real results, we can have a sense of what is potentially going to be allocated, and we can take advantage of performance_schema database, but this may require enabling some instruments disabled by default:

mysql > show global status like 'max_used_connections';
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| Max_used_connections |    67 |
+----------------------+-------+
1 row in set (0.00 sec)

So with a maximum of 67 connections used, we can get:

mysql > SELECT ( @@key_buffer_size
-> + @@innodb_buffer_pool_size
-> + 67 * (@@read_buffer_size
-> + @@read_rnd_buffer_size
-> + @@sort_buffer_size
-> + @@join_buffer_size
-> + @@tmp_table_size )) / (1024*1024*1024) AS MAX_MEMORY_GB;
+---------------+
| MAX_MEMORY_GB |
+---------------+
| 316.4434      |
+---------------+
1 row in set (0.00 sec)

So far, so good, we are within memory ranges, now let’s see how big the innodb_buffer_pool_size is and if it is well sized:

mysql > SELECT (@@innodb_buffer_pool_size) / (1024*1024*1024) AS BUFFER_POOL_SIZE;
+------------------+
| BUFFER_POOL_SIZE |
+------------------+
| 310.0000         |
+------------------+
1 row in set (0.01 sec)

So the buffer pool is 310G, roughly 82% of total memory, and total usage so far was around 84% which leaves us around 60G of memory not being used. Well, being used by filesystem cache, which, in the end, is not used by InnoDB.

Ok now, let’s get to the point, how to properly configure memory to be used effectively by MySQL. From pt-mysql-summary we know that the buffer pool is fully filled:

Buffer Pool Size | 310.0G
Buffer Pool Fill | 100%

Does this mean we need more memory? Maybe, so let’s check how many disk operations we have in an instance we know with a working dataset that doesn’t fit in memory (the very reason why we increased memory size) using with this command:

mysqladmin -r -i 1 -c 60 extended-status | egrep "Innodb_buffer_pool_read_requests|Innodb_buffer_pool_reads"
| Innodb_buffer_pool_read_requests | 99857480858|
| Innodb_buffer_pool_reads         | 598600690  |
| Innodb_buffer_pool_read_requests | 274985     |
| Innodb_buffer_pool_reads         | 1602       |
| Innodb_buffer_pool_read_requests | 267139     |
| Innodb_buffer_pool_reads         | 1562       |
| Innodb_buffer_pool_read_requests | 270779     |
| Innodb_buffer_pool_reads         | 1731       |
| Innodb_buffer_pool_read_requests | 287594     |
| Innodb_buffer_pool_reads         | 1567       |
| Innodb_buffer_pool_read_requests | 282786     |
| Innodb_buffer_pool_reads         | 1754       |

Innodb_buffer_pool_read_requests: page reads satisfied from memory (good)
Innodb_buffer_pool_reads: page reads from disk (bad)

As you may notice, we still get some reads from the disk, and we want to avoid them, so let’s increase the buffer pool size to 340G (90% of total memory) and check again:

mysqladmin -r -i 1 -c 60 extended-status | egrep "Innodb_buffer_pool_read_requests|Innodb_buffer_pool_reads"
| Innodb_buffer_pool_read_requests | 99937722883 |
| Innodb_buffer_pool_reads         | 599056712   |
| Innodb_buffer_pool_read_requests | 293642      |
| Innodb_buffer_pool_reads         | 1           |
| Innodb_buffer_pool_read_requests | 296248      |
| Innodb_buffer_pool_reads         | 0           |
| Innodb_buffer_pool_read_requests | 294409      |
| Innodb_buffer_pool_reads         | 0           |
| Innodb_buffer_pool_read_requests | 296394      |
| Innodb_buffer_pool_reads         | 6           |
| Innodb_buffer_pool_read_requests | 303379      |
| Innodb_buffer_pool_reads         | 0           |

Now we are barely going to disk, and IO pressure was released; this makes us happy –  right?

Summary

If you increase the memory size of a server, you mostly need to focus on innodb_buffer_pool_size, as this is the most critical variable to tune. Allocating 90% to 95% of total available memory on big systems is not bad at all, as OS requires only a few GB to run correctly, and a few more for memory swap should be enough to run without problems.

Also, check your maximum connections required (and used,) as this is a common mistake causing memory issues, and if you need to run with 1000 connections opened, then allocating 90% of the memory of the buffer pool may not be possible, and some additional actions may be required (i.e., adding a proxy layer or a connection pool).

From MySQL 8, we have a new variable called innodb_dedicated_server, which will auto-calculate the memory allocation. While this variable is really useful for an initial approach, it may under-allocate some memory in systems with more than 4G of RAM as it sets the buffer pool size = (detected server memory * 0.75), so in a 200G server, we have only 150 for the buffer pool.

Conclusion

Vertical scaling is the easiest and fastest way to improve performance, and it is also cheaper – but not magical. Tuning variables properly requires analysis and understanding of how memory is being used. This post focused on the essential variables to consider when tuning memory allocation, specifically innodb_buffer_pool_size and max_connections. Don’t over-tune when it’s not necessary and be cautious of how these two affect your systems.

Aug
27
2020
--

More on Checkpoints in InnoDB MySQL 8

Recently I posted about checkpointing in MySQL, where MySQL showed interesting “wave” behavior.

Soon after Dimitri posted a solution with how to fix “waves,” and I would like to dig a little more into proposed suggestions, as there are some materials to process.

This post will be very heavy on InnoDB configuration, so let’s start with the basic configuration for MySQL, but before that some initial environment.

I use MySQL version 8.0.21 on the hardware as described here

As for the storage, I am not using some “old dusty SSD”, but production available Enterprise-Grade Intel SATA SSD D3-S4510. This SSD is able to handle the throughput of 468MiB/sec of random writes or 30000 IOPS of random writes of 16KiB blocks.

So initial configuration for my test was:

[mysqld]
datadir= /data/mysql8-8.0.21
user=mysql
bind_address = 0.0.0.0

socket=/tmp/mysql.sock
log-error=error.log

ssl=0
performance_schema=OFF

skip_log_bin
server_id = 7

# general
table_open_cache = 200000
table_open_cache_instances=64
back_log=3500
max_connections=4000

join_buffer_size=256K
sort_buffer_size=256K

# files
innodb_file_per_table
innodb_log_file_size=10G
innodb_log_files_in_group=2
innodb_open_files=4000

# buffers
innodb_buffer_pool_size= 140G
innodb_buffer_pool_instances=8
innodb_page_cleaners=8
innodb_purge_threads=4
innodb_lru_scan_depth=512
innodb_log_buffer_size=64M

default_storage_engine=InnoDB

innodb_flush_log_at_trx_commit  = 1
innodb_doublewrite= 1
innodb_flush_method             = O_DIRECT
innodb_file_per_table           = 1
innodb_io_capacity=2000
innodb_io_capacity_max=4000
innodb_flush_neighbors=0

#innodb_monitor_enable=all
max_prepared_stmt_count=1000000

innodb_adaptive_hash_index=1
innodb_monitor_enable='%'

innodb-buffer-pool-load-at-startup=OFF
innodb_buffer_pool_dump_at_shutdown=OFF

There is a lot of parameters, so let’s highlight the most relevant for this test:

innodb_buffer_pool_size= 140G

Buffer pool size is enough to fit all data, which is about 100GB in size

innodb_adaptive_hash_index=1

Adaptive hash index is enabled (as it comes in default InnoDB config)

innodb_buffer_pool_instances=8

This is what defaults provide, but I will increase it, following my previous post. 

innodb_log_file_size=10G

innodb_log_files_in_group=2

These parameters define the limit of 20GB for our redo logs, and this is important, as our workload will be “redo-log” bounded, as we will see from the results

innodb_io_capacity=2000

innodb_io_capacity_max=4000

You may ask, why do I use 2000 and 4000, while the storage can handle 30000 IOPS.

This is a valid point, and as we can see later, these parameters are not high enough for this workload, but also it does not mean we should use them all the way up to 30000, as we will see from the results.

MySQL Manual says the following about innodb_io_capacity:

“The innodb_io_capacity variable defines the overall I/O capacity available to InnoDB. It should be set to approximately the number of I/O operations that the system can perform per second (IOPS). When innodb_io_capacity is set, InnoDB estimates the I/O bandwidth available for background tasks based on the set value.” 

From this, you may get the impression that if you set innodb_io_capacity to I/O bandwidth of your storage, you should be fine. Though this part does not say what you should take as I/O operations. For example, if your storage can perform 500MB/sec, then if you do 4KB block IO operations it will be 125000 IO per second, and if you do 16KB IO, then it will be 33000 IO per second. 

MySQL manual leaves it up to your imagination, but as InnoDB typical page size is 16KB, let’s assume we do 16KB blocks IO.

However later on that page, we can read:

“Ideally, keep the setting as low as practical, but not so low that background activities fall behind. If the value is too high, data is removed from the buffer pool and change buffer too quickly for caching to provide a significant benefit. For busy systems capable of higher I/O rates, you can set a higher value to help the server handle the background maintenance work associated with a high rate of row changes”

and

“Consider write workload when tuning innodb_io_capacity. Systems with large write workloads are likely to benefit from a higher setting. A lower setting may be sufficient for systems with a small write workload.”

I do not see that the manual provides much guidance about what value I should use, so we will test it.

Initial results

So if we benchmark with initial parameters, we can see the “wave” pattern.

 

As for why this is happening, let’s check Percona Monitoring and Management “InnoDB Checkpoint Age” chart:

Actually InnoDB Flushing by Type in PMM does not show sync flushing yet, so I had to modify chart a little to show “sync flushing” in orange line:

And we immediately see that Uncheckpointed Bytes exceed Max Checkpoint Age in 16.61GiB, which is defined by 20GiB of innodb log files. 16.61GiB is less than 20GB, because InnoDB reserves some cushion for the cases exactly like this, so even if we exceed 16.61GiB, InnoDB still has an opportunity to flush data.

Also, we see that before Uncheckpointed Bytes exceed Max Checkpoint Age, InnoDB flushes pages with the rate 4000 IOPS, just as defined by innodb_io_capacity_max.

We should try to avoid the case when Uncheckpointed Bytes exceed Max Checkpoint Age, because when it happens, InnoDB gets into “emergency” flushing mode, and in fact, this is what causes the waves we see. I should have detected this in my previous post, mea culpa.

So the first conclusion we can make – if InnoDB does not flush fast enough, what if we increase innodb_io_capacity_max ? Sure, let’s see. And for the simplification, for the next experiments, I will use

Innodb_io_capacity = innodb_io_capacity_max, unless specified otherwise.

Next run with Innodb_io_capacity = innodb_io_capacity_max = 7000

Not much improvement and this also confirmed by InnoDB Checkpoint ge chart

InnoDB tries to flush more pages per second up to 5600 pages/sec, but it is not enough to avoid exceeding Max Checkpoint Age.

Why is this the case? The answer is a double write buffer.

Even though MySQL improved the doublewrite buffer in MySQL 8.0.20, it does not perform well enough with proposed defaults. 

Well, at least the problem was solved because previous Oracle ran benchmarks with disabled doublewrite, just to hide and totally ignore the issue with doublewrite. For the example check this.

But let’s get back to our 8.0.21 and fixed doublewrite.

Dimiti mentions:

“the main config options for DBLWR in MySQL 8.0 are:

  • innodb_doublewrite_files = N
    innodb_doublewrite_pages = M”

Let’s check the manual again:

“The innodb_doublewrite_files variable is intended for advanced performance tuning. The default setting should be suitable for most users.

innodb_doublewrite_pages

The innodb_doublewrite_pages variable (introduced in MySQL 8.0.20) controls the number of maximum number of doublewrite pages per thread. If no value is specified, innodb_doublewrite_pages is set to the innodb_write_io_threads value. This variable is intended for advanced performance tuning. The default value should be suitable for most users.

Was it wrong to assume that innodb_doublewrite_files and  innodb_doublewrite_pages provides the value suitable for our use case?

But let’s try with the values Dmitri recommended to look into, I will use

innodb_doublewrite_files=2 and innodb_doublewrite_pages=128

Results with innodb_doublewrite_files=2 and innodb_doublewrite_pages=128

The problem with waves is fixed! 

And InnoDB Checkpoint Age chart:

Now we are able to keep Uncheckpointed Bytes under Max Checkpoint Age, and this is what fixed “waves” pattern.

We can say that parallel doublewrite is a new welcomed improvement, but the fact that one has to change innodb_doublewrite_pages in order to get improved performance is the design flaw in my opinion.

But there are still a lot of variations in 1 sec resolution and small drops. Before we get to them, let’s take a look at another suggestion: use –innodb_adaptive_hash_index=0 ( that is to disable Adaptive Hash Index). I will use AHI=0 on the charts to mark this setting.

Let’s take a look at the results with improved settings and with –innodb_adaptive_hash_index=0

Results with –innodb_adaptive_hash_index=0

To see what is the real improvement with –innodb_adaptive_hash_index=0 , let’s compare barcharts:

Or in numeric form:

settings Avg tps, last 2000 sec
io_cap_max=7000,doublewrite=opt 7578.69
io_cap_max=7000,doublewrite=opt,AHI=0 7996.33

So –innodb_adaptive_hash_index=0 really brings some improvements, about 5.5%, so I will use  –innodb_adaptive_hash_index=0 for further experiments.

Let’s see if increased innodb_buffer_pool_instances=32 will help to smooth periodical variance.

Results with innodb_buffer_pool_instances=32

So indeed using innodb_buffer_pool_instances=32 gets us less variations, keeping overall throughput about the same. It is 7936.28 tps for this case.

Now let’s review the parameter innodb_change_buffering=none, which Dmitri also suggests.

Results with innodb_change_buffering=none

There is NO practical difference if we disable innodb_change_buffer.

And if we take a look at PMM change buffer chart:

We can see there is NO Change Buffer activity outside of the initial 20 mins. I am not sure why Dimitri suggested disabling it. In fact, Change Buffer can be quite useful, and I will show it in my benchmark for the different workloads.

Now let’s take a look at suggested settings with Innodb_io_capacity = innodb_io_capacity_max = 8000. That will INCREASE innodb_io_capacity_max , and compare to results with innodb_io_capacity_max = 7000.

Or in tabular form:

settings Avg tps, last 2000 sec
io_cap_max=7000,doublewrite=opt,AHI=0,BPI=32 7936.28
io_cap_max=8000,doublewrite=opt,AHI=0,BPI=32 7693.08

Actually with innodb_io_capacity_max=8000 the throughput is LESS than with  innodb_io_capacity_max=7000

Can you guess why? 

Let’s compare InnoDB Checkpoint Age.

This is for innodb_io_capacity_max=8000 :

And this is for innodb_io_capacity_max=7000 

This is like a child’s game: Find the difference.

The difference is that with  innodb_io_capacity_max=7000
Uncheckpointed Bytes is 13.66 GiB,
and with innodb_io_capacity_max=8000
Uncheckpointed Bytes is 12.51 GiB

What does it mean? It means that with innodb_io_capacity_max=7000 HAS to flush LESS pages and still keep within Max Checkpoint Age.

In fact, if we try to push even further, and use innodb_io_capacity_max=innodb_io_capacity=6500 we will get InnoDB Checkpoint Age chart as:

Where Uncheckpointed Bytes are 15.47 GiB. Does it improve throughput? Absolutely!

settings Avg tps, last 2000 sec
io_cap_max=6500,doublewrite=opt,AHI=0,BPI=32 8233.628
io_cap_max=7000,doublewrite=opt,AHI=0,BPI=32 7936.283
io_cap_max=8000,io_cap_max=8000,doublewrite=opt,AHI=0,BPI=32 7693.084

The difference between innodb_io_capacity_max=6500 and innodb_io_capacity_max=8000 is 7%

This now becomes clear what Manual means in the part where it says:

“Ideally, keep the setting as low as practical, but not so low that background activities fall behind”

So we really need to increase innodb_io_capacity_max to the level that Uncheckpointed Bytes stays under Max Checkpoint Age, but not by much, otherwise InnoDB will do more work then it is needed and it will affect the throughput.

In my opinion, this is a serious design flaw in InnoDB Adaptive Flushing, that you actually need to wiggle innodb_io_capacity_max to achieve appropriate results.

Inverse relationship between innodb_io_capacity_max and innodb_log_file_size

To show an even more complicated relation between innodb_io_capacity_max and innodb_log_file_size, let consider the following experiment.

We will increase innodb_log_file_size from 10GB to 20GB, effectively doubling our redo-log capacity.

And now let’s check InnoDB Checkpoint Age with innodb_io_capacity_max=7000:

We can see there is a lot of space in InnoDB logs which InnoDB does not use. There is only 22.58GiB of Uncheckpointed Bytes, while 33.24 GiB are available.

So what happens if we increase innodb_io_capacity_max to 4500

 InnoDB Checkpoint Age with innodb_io_capacity_max=4500:

In this setup, We can push Uncheckpointed Bytes to 29.80 GiB, and it has a positive effect on the throughput.

Let’s compare throughput :

settings Avg tps, last 2000 sec
io_cap_max=4500,log_size=40GB,doublewrite=opt,AHI=0,BPI=32 9865.308
io_cap_max=7000,log_size=40GB,doublewrite=opt,AHI=0,BPI=32 9374.121

So by decreasing innodb_io_capacity_max from 7000 to 4500 we can gain 5.2% in the throughput.

Please note that we can’t continue to decrease innodb_io_capacity_max, because in this case Uncheckpointed Bytes risks to exceed Max Checkpoint Age, and this will lead to the negative effect of emergency flushing.

So again, in order to improve throughput, we should be DECREASING innodb_io_capacity_max, but only to a certain threshold. We should not be setting innodb_io_capacity_max to 30000, to what really SATA SSD can provide.

Again, for me, this is a major design flaw in the current InnoDB Adaptive Flushing. Please note this was a static workload. If your workload changes during the day, it is practically impossible to come up with optimal value. 

Conclusions:

Trying to summarize all of the above, I want to highlight:

  • To fix “wave” pattern we need to tune innodb_io_capacity_max and innodb_doublewrite_pages 
  • InnoDB parallel doublewrite in MySQL 8.0.20 is a definitely positive improvement, but the default values seem chosen poorly, in contradiction with Manual. I wish Oracle/MySQL shipped features that work out of the box for most users.
  • InnoDB Adaptive Hash index is not helping here, and you get better performance by disabling it. I also observed that in other workloads, the InnoDB Adaptive Hash index might be another broken subsystem, which Oracle ignores to fix and just disables it in its benchmarks.
  • InnoDB Change Buffer has no effect on this workload, so you may or may not disable it — there is no difference. But I saw a positive effect from InnoDB Change Buffer in other workloads, so I do not recommend blindly disabling it.
  • Now about InnoDB Adaptive Flushing. In my opinion, InnoDB Adaptive Flushing relies too much on manual tuning of innodb_io_capacity_max , which in fact has nothing to do with the real storage IO capacity. In fact, often you need to lower innodb_io_capacity_max  to get better performance, but not make it too low, because at some point it will hurt the performance. The best way to monitor it is to check InnoDB Checkpoint Age chart in PMM
  • I would encourage Oracle to fix the broken design of InnoDB Adaptive Flushing, where it would detect IO capacity automatically and to not flush aggressively, but to keep  Uncheckpointed Bytes just under Max Checkpoint Age. Let’s hope Oracle faster than doublewrite buffer because history shows that to force Oracle to make improvements in InnoDB IO subsystem, we need to do it first in Percona Server for MySQL like we did with parallel doublewrite buffer.  For the reference parallel doublewrite was implemented first in Percona Server for MySQL 5.7.11-4 which was released March 15th, 2016. Oracle implemented (with not optimal default settings ) parallel doublewrite in MySQL 8.0.20, which was released 4 years later after Percona Server, on April 4th, 2020.
Aug
27
2020
--

MySQL 8.0.19 InnoDB ReplicaSet Configuration and Manual Switchover

Manual Switchover

Manual SwitchoverInnoDB ReplicaSet was introduced from MySQL 8.0.19. It works based on the MySQL asynchronous replication. Generally, InnoDB ReplicaSet does not provide high availability on its own like InnoDB Cluster, because with InnoDB ReplicaSet we need to perform the manual failover. AdminAPI includes the support for the InnoDB ReplicaSet. We can operate the InnoDB ReplicaSet using the MySQL shell. 

  • InnoDB cluster is the combination of MySQL shell and Group replication and MySQL router
  • InnoDB ReplicaSet is the combination of MySQL shell and MySQL traditional async replication and MySQL router

Why InnoDB ReplicaSet?

  • You can manually perform the switchover and failover with InnoDB ReplicaSet
  • You can easily add the new node to your replication environment. InnoDB ReplicaSet helps with data provisioning (using MySQL clone plugin) and setting up the replication.

In this blog, I am going to explain the process involved in the following topics

  • How to set up the InnoDB ReplicaSet in a fresh environment?
  • How to perform the manual switchover with ReplicaSet?

Before going into the topic, I am summarising the points which should be made aware to work on InnoDB ReplicaSet. 

  • ReplicaSet only supports GTID based replication environments. 
  • MySQL version should be 8.x +.
  • It has support for only Row-based replication.
  • Replication filters are not supported with InnoDB ReplicaSet
  • InnoDB ReplicaSet should have one primary node ( master ) and one or multiple secondary nodes ( slaves ). All the secondary nodes should be configured under the primary node. 
  • There is no limit for secondary nodes, you can configure many nodes under ReplicaSet.
  • It supports only manual failover.
  • InnoDB ReplicaSet should be completely managed with MySQL shell. 

How to set up the InnoDB ReplicaSet in a fresh environment?

I have created two servers (replicaset1, replicaset2) for testing purposes. My goal is to create the InnoDB ReplicaSet with one primary node and one secondary node. I installed Percona Server for MySQL 8.0.20 for my testing.

Step 1 :

Allow hostname based communication. Make sure that you configured this on all the servers, which participated in the ReplicaSet.

#vi /etc/hosts
172.28.128.20 replicaset1 replicaset1
172.28.128.21 replicaset2 replicaset2

Step 2 :

In this step, I am going to prepare the MySQL instances for InnoDB ReplicaSet. Below are the major tasks that need to be performed as part of this operation.

  • Create a dedicated user account to effectively manage the ReplicaSet. The account will be automatically created with sufficient privileges.
  • MySQL parameters changes which need to be updated for InnoDB ReplicaSet (persisting settings).
  • Restart the MySQL instance to apply the changes.

Command : dba.configureReplicaSetInstance()

Connecting the shell,

[root@replicaset1 ~]# mysqlsh --uri root@localhost
Please provide the password for 'root@localhost': *************
Save password for 'root@localhost'? [Y]es/[N]o/Ne[v]er (default No): y
MySQL Shell 8.0.20

Configuring the instance,

Once you triggered the command, it will start to interact with you. You have to choose the needed options. 

MySQL  localhost:33060+ ssl  JS > dba.configureReplicaSetInstance()
Configuring local MySQL instance listening at port 3306 for use in an InnoDB ReplicaSet...
This instance reports its own address as replicaset1:3306
Clients and other cluster members will communicate with it through this address by default. If this is not correct, thereport_host MySQL system variable should be changed.
ERROR: User 'root' can only connect from 'localhost'. New account(s) with proper source address specification to allow remote connection from all instances must be created to manage the cluster.

1) Create remotely usable account for 'root' with same grants and password
2) Create a new admin account for InnoDB ReplicaSet with minimal required grants
3) Ignore and continue
4) Cancel

Please select an option [1]: 2
Please provide an account name (e.g: icroot@%) to have it created with the necessary
privileges or leave empty and press Enter to cancel.
Account Name: InnodbReplicaSet
Password for new account: ********
Confirm password: ********

NOTE: Some configuration options need to be fixed:
+--------------------------+---------------+----------------+--------------------------------------------------+
| Variable                 | Current Value | Required Value | Note                                             |
+--------------------------+---------------+----------------+--------------------------------------------------+
| enforce_gtid_consistency | OFF           | ON             | Update read-only variable and restart the server |
| gtid_mode                | OFF           | ON             | Update read-only variable and restart the server |
| server_id                | 1             | <unique ID>    | Update read-only variable and restart the server |
+--------------------------+---------------+----------------+--------------------------------------------------+
Some variables need to be changed, but cannot be done dynamically on the server.
Do you want to perform the required configuration changes? [y/n]: y
Do you want to restart the instance after configuring it? [y/n]: y
Cluster admin user 'InnodbReplicaSet'@'%' created.
Configuring instance...
The instance 'replicaset1:3306' was configured to be used in an InnoDB ReplicaSet.
Restarting MySQL...
NOTE: MySQL server at replicaset1:3306 was restarted.

You can find the updated parameters from the file “mysqld-auto.cnf”. The blog by Marco Tusa has more details about the PERSIST configuration. 

[root@replicaset1 mysql]# cat mysqld-auto.cnf 

{ "Version" : 1 , "mysql_server" : { "server_id" : { "Value" : "3391287398" , "Metadata" : { "Timestamp" : 1598084590766958 , "User" : "root" , "Host" : "localhost" } } , "read_only" : { "Value" : "OFF" , "Metadata" : { "Timestamp" : 1598084718849667 , "User" : "InnodbReplicaSet" , "Host" : "localhost" } } , "super_read_only" : { "Value" : "ON" , "Metadata" : { "Timestamp" : 1598084898510380 , "User" : "InnodbReplicaSet" , "Host" : "localhost" } } , "mysql_server_static_options" : { "enforce_gtid_consistency" : { "Value" : "ON" , "Metadata" : { "Timestamp" : 1598084590757563 , "User" : "root" , "Host" : "localhost" } } , "gtid_mode" : { "Value" : "ON" , "Metadata" : { "Timestamp" : 1598084590766121 , "User" : "root" , "Host" : "localhost" } } } } }

Note :

  • Make sure that this step is executed on all the MySQL instances which are going to participate in the ReplicaSet group.
  • Make sure that the cluster account name and password are the same on all MySQL instances.

Step 3 :

In this step, I am going to switch my login to the ReplicaSet account which was created in Step 2. 

MySQL  localhost:33060+ ssl  JS > \connect InnodbReplicaSet@replicaset1
Creating a session to 'InnodbReplicaSet@replicaset1'
Please provide the password for 'InnodbReplicaSet@replicaset1': ********
Save password for 'InnodbReplicaSet@replicaset1'? [Y]es/[N]o/Ne[v]er (default No): y
Fetching schema names for autocompletion... Press ^C to stop.
Closing old connection...
Your MySQL connection id is 8 (X protocol)
Server version: 8.0.20-11 Percona Server (GPL), Release 11, Revision 5b5a5d2

Step 4:

Now, all are set to create the ReplicaSet.

Command : dba.createReplicaSet(‘<ReplicaSet Name>’)

MySQL  replicaset1:33060+ ssl  JS > dba.createReplicaSet('PerconaReplicaSet')
A new replicaset with instance 'replicaset1:3306' will be created.

* Checking MySQL instance at replicaset1:3306
This instance reports its own address as replicaset1:3306
replicaset1:3306: Instance configuration is suitable.

* Updating metadata...
ReplicaSet object successfully created for replicaset1:3306.
Use rs.addInstance() to add more asynchronously replicated instances to this replicaset and rs.status() to check its status.
<ReplicaSet:PerconaReplicaSet>

ReplicaSet is created with the name “PerconaReplicaSet”

Step 5:

In this step, I am going to assign the ReplicaSet to the variable and check the ReplicaSet status. Assigning to the variable can be done while creating the ReplicaSet as well  (i.e. var replicaset = dba.createReplicaSet(‘<ReplicaSet Name>’)  

MySQL  replicaset1:33060+ ssl  JS > replicaset = dba.getReplicaSet()
You are connected to a member of replicaset 'PerconaReplicaSet'.
<ReplicaSet:PerconaReplicaSet>
 MySQL  replicaset1:33060+ ssl  JS > 
 MySQL  replicaset1:33060+ ssl  JS > replicaset.status()
{
    "replicaSet": {
        "name": "PerconaReplicaSet", 
        "primary": "replicaset1:3306", 
        "status": "AVAILABLE", 
        "statusText": "All instances available.", 
        "topology": {
            "replicaset1:3306": {
                "address": "replicaset1:3306", 
                "instanceRole": "PRIMARY", 
                "mode": "R/W", 
                "status": "ONLINE"
            }
        }, 
        "type": "ASYNC"
    }
}

The ReplicaSet status states the Instance replicaset1 is operational and is the PRIMARY member. 

Step 6:

Now, I need to add the secondary instance “replicaset2” to the ReplicaSet.

When adding the new instance, it should be fulfilled with all the ReplicaSet requirements. We have two recovery methods when joining the new node.

Clone: It will take the snapshot from the ONLINE instance and build the target node with a snapshot and finally add to the ReplicaSet. This method is always recommended when adding fresh nodes.

Incremental: This method relies on MySQL replication and applies all the transactions which are missed on the new instance. This can be faster when the missing transaction amount is small.

Command : replicaset.addInstance(‘<instance name>:<port>’)

MySQL  replicaset1:33060+ ssl  JS > replicaset.addInstance('replicaset2:3306')
Adding instance to the replicaset...
* Performing validation checks
This instance reports its own address as replicaset2:3306
replicaset2:3306: Instance configuration is suitable.
* Checking async replication topology...
* Checking transaction state of the instance...

NOTE: The target instance 'replicaset2:3306' has not been pre-provisioned (GTID set is empty). The Shell is unable to decide whether replication can completely recover its state.
The safest and most convenient way to provision a new instance is through automatic clone provisioning, which will completely overwrite the state of 'replicaset2:3306' with a physical snapshot from an existing replicaset member. To use this method by default, set the 'recoveryMethod' option to 'clone'.

WARNING: It should be safe to rely on replication to incrementally recover the state of the new instance if you are sure all updates ever executed in the replicaset were done with GTIDs enabled, there are no purged transactions and the new instance contains the same GTID set as the replicaset or a subset of it. To use this method by default, set the 'recoveryMethod' option to 'incremental'.
Please select a recovery method [C]lone/[I]ncremental recovery/[A]bort (default Clone): C
* Updating topology
Waiting for clone process of the new member to complete. Press ^C to abort the operation.
* Waiting for clone to finish...
NOTE: replicaset2:3306 is being cloned from replicaset1:3306
** Stage DROP DATA: Completed
** Clone Transfer  
    FILE COPY  ############################################################  100%  Completed
    PAGE COPY  ############################################################  100%  Completed
    REDO COPY  ############################################################  100%  Completed
** Stage RECOVERY: |
NOTE: replicaset2:3306 is shutting down...

* Waiting for server restart... ready
* replicaset2:3306 has restarted, waiting for clone to finish...
* Clone process has finished: 60.68 MB transferred in about 1 second (~60.68 MB/s)
** Configuring replicaset2:3306 to replicate from replicaset1:3306
** Waiting for new instance to synchronize with PRIMARY...
The instance 'replicaset2:3306' was added to the replicaset and is replicating from replicaset1:3306.

Here I have chosen the clone method for recovery. 

MySQL  replicaset1:33060+ ssl  JS > replicaset.status()
{
    "replicaSet": {
        "name": "PerconaReplicaSet", 
        "primary": "replicaset1:3306", 
        "status": "AVAILABLE", 
        "statusText": "All instances available.", 
        "topology": {
            "replicaset1:3306": {
                "address": "replicaset1:3306", 
                "instanceRole": "PRIMARY", 
                "mode": "R/W", 
                "status": "ONLINE"
            }, 
            "replicaset2:3306": {
                "address": "replicaset2:3306", 
                "instanceRole": "SECONDARY", 
                "mode": "R/O", 
                "replication": {
                    "applierStatus": "APPLIED_ALL", 
                    "applierThreadState": "Slave has read all relay log; waiting for more updates", 
                    "receiverStatus": "ON", 
                    "receiverThreadState": "Waiting for master to send event", 
                    "replicationLag": null
                }, 
                "status": "ONLINE"
            }
        }, 
        "type": "ASYNC"
    }
}

The second instance has been added to the ReplicaSet. 

How to perform the manual switchover with ReplicaSet? 

As per the current topology,

  • replicaset1 is the PRIMARY
  • replicaset2 is the SECONDARY

Requirement: Regarding the maintenance activity, I am planning to remove the server “replicaset1” from the ReplicaSet. This needs to be performed in a safe manner and the secondary instance “replicaset2” should be available for application writes and reads.

  • First, I need to promote “replicaset2” as the PRIMARY.
  • Then, remove the “replicaset1” from the group.

Switching the “replicaset2” as the PRIMARY.

Command : replicaset.setPrimaryInstance(‘host:port’)

MySQL  replicaset1:33060+ ssl  JS > replicaset.setPrimaryInstance('replicaset2:3306')
replicaset2:3306 will be promoted to PRIMARY of 'PerconaReplicaSet'.
The current PRIMARY is replicaset1:3306.

* Connecting to replicaset instances
** Connecting to replicaset1:3306
** Connecting to replicaset2:3306
** Connecting to replicaset1:3306
** Connecting to replicaset2:3306
* Performing validation checks
** Checking async replication topology...
** Checking transaction state of the instance...
* Synchronizing transaction backlog at replicaset2:3306
* Updating metadata
* Acquiring locks in replicaset instances
** Pre-synchronizing SECONDARIES
** Acquiring global lock at PRIMARY
** Acquiring global lock at SECONDARIES
* Updating replication topology
** Configuring replicaset1:3306 to replicate from replicaset2:3306
replicaset2:3306 was promoted to PRIMARY.

You can see the “replicaset2” has been promoted as PRIMARY.

MySQL  replicaset1:33060+ ssl  JS > replicaset.status()
{
    "replicaSet": {
        "name": "PerconaReplicaSet", 
        "primary": "replicaset2:3306", 
        "status": "AVAILABLE", 
        "statusText": "All instances available.", 
        "topology": {
            "replicaset1:3306": {
                "address": "replicaset1:3306", 
                "instanceRole": "SECONDARY", 
                "mode": "R/O", 
                "replication": {
                    "applierStatus": "APPLIED_ALL", 
                    "applierThreadState": "Slave has read all relay log; waiting for more updates", 
                    "receiverStatus": "ON", 
                    "receiverThreadState": "Waiting for master to send event", 
                    "replicationLag": null
                }, 
                "status": "ONLINE"
            }, 
            "replicaset2:3306": {
                "address": "replicaset2:3306", 
                "instanceRole": "PRIMARY", 
                "mode": "R/W", 
                "status": "ONLINE"
            }
        }, 
        "type": "ASYNC"
    }
}

Removing “replicaset1” from the group,

Command : replicaset.removeInstance(‘host:port’)

MySQL  replicaset1:33060+ ssl  JS > replicaset.removeInstance('replicaset1:3306')
The instance 'replicaset1:3306' was removed from the replicaset.
MySQL  replicaset1:33060+ ssl  JS > 
MySQL  replicaset1:33060+ ssl  JS > replicaset.status()
{
    "replicaSet": {
        "name": "PerconaReplicaSet", 
        "primary": "replicaset2:3306", 
        "status": "AVAILABLE", 
        "statusText": "All instances available.", 
        "topology": {
            "replicaset2:3306": {
                "address": "replicaset2:3306", 
                "instanceRole": "PRIMARY", 
                "mode": "R/W", 
                "status": "ONLINE"
            }
        }, 
        "type": "ASYNC"
    }
}

We can perform the forced failover using “ReplicaSet.forcePrimaryInstance()”. This is dangerous and only recommended to use on the disaster type of scenario.

MySQL InnoDB ReplicaSet is a very good feature to manage the MySQL asynchronous replication environment. It has the CLONE plugin support and it greatly helps on data provisioning and setting up the replication. But, still it has some limitations when compared with the MySQL InnoDB Cluster. 

Aug
14
2020
--

Part Two: How Many innodb_buffer_pool_instances Do You Need in MySQL 8 With a CPU-Bound Workload?

how many innodb_buffer_pool_instances Do You Need in MySQL 8

how many innodb_buffer_pool_instances Do You Need in MySQL 8Part one of this series can be found here: How Many innodb_buffer_pool_instances Do You Need in MySQL 8?

Following up on my recent benchmark posts on MySQL and MariaDB, MySQL and MariaDB on Enterprise SSD Storage and How MySQL and MariaDB Perform on NVMe Storage, I wanted to dig a little deeper and understand how different MySQL parameters affect performance.

One of the obscure MySQL Parameters (in my opinion) is innodb_buffer_pool_instances. In particular, I do not have a feel for how many you need and how to understand if a chosen value is a good or a bad one.

In our articles MySQL 5.7 Performance Tuning After Installation and MySQL 101: Parameters to Tune for MySQL Performance we recommended to use value “8”, but I can’t really say if this is a good enough value. This post is a follow-up to my previous article about results we got with different innodb_buffer_pool instances with an IO-bound scenario.

So let’s see what result we will get with different innodb_buffer_pool instances, under the following scenario. I will use sysbench oltp_read_write benchmark with pareto distribution and set innodb_buffer_pool_size=140GB for the database in size 100GB, so this will be a CPU-bound workload.

Benchmark

The hardware I use is:

System | Supermicro; SYS-F619P2-RTN; v0123456789 (Other)
   Platform | Linux
    Release | Ubuntu 18.04.4 LTS (bionic)
     Kernel | 5.3.0-42-generic
Architecture | CPU = 64-bit, OS = 64-bit
  Threading | NPTL 2.27
    SELinux | No SELinux detected
Virtualized | No virtualization detected
# Processor ##################################################
 Processors | physical = 2, cores = 40, virtual = 80, hyperthreading = yes
     Models | 80xIntel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
     Caches | 80x28160 KB
# Memory #####################################################
      Total | 187.6G

With the storage on  SATA SSD INTEL SSDSC2KB960G8 (Intel Enterprise-grade SSD D3-S4510).

A short settings overview:

  • Data fits into memory (The datasize is ~100GB, memory on the server is 188GB, and we allocate 140GB for MySQL innodb_buffer_pool_size).
  • The workload on storage will be very read-write-intensive (reads are done from the storage), with full ACID-compliant and data safe settings in MySQL.
  • For SATA SSD storage  innodb_io_capacity I will use 2000 and innodb_io_capacity_max = 4000 .
  • I will test the following values for innodb_buffer_pool_instances: 1 2 4 8 16 32 64.
  • Innodb_buffer_pool_instances = 64 is the maximum value allowed by MySQL.

The benchmark command line is:

sysbench oltp_read_write --threads=150 --time=10000 --tables=40 --table_size=10000000 --mysql-host=127.0.0.1 --mysql-user=sbtest --mysql-password=sbtest --max-requests=0 --report-interval=1 --mysql-db=sbtest --mysql-ssl=off --create_table_options=DEFAULT CHARSET=utf8mb4 --report_csv=yes --rand-type=pareto run

This means that the benchmark will run for three hours with reporting throughput every 1 sec. I will use a three-hour time frame for the reason which will be apparent later from the results.

Results on SATA SSD

Let’s see what results I’ve got individually for each of innodb_buffer_pool_instances:

innodb_buffer_pool_instances=1

innodb_buffer_pool_instances=1

innodb_buffer_pool_instances=2

innodb_buffer_pool_instances=2

innodb_buffer_pool_instances=4

innodb_buffer_pool_instances=4

innodb_buffer_pool_instances=8

innodb_buffer_pool_instances=8

innodb_buffer_pool_instances=16

innodb_buffer_pool_instances=16

innodb_buffer_pool_instances=32

innodb_buffer_pool_instances=32

innodb_buffer_pool_instances=64

innodb_buffer_pool_instances=64

What seems apparent is that as we increase innodb_buffer_pool_instances, it has a positive effect on variation in the throughput. We can condense the results in a single chart to see it closely:

increase innodb_buffer_pool_instances

And if we want to compare the throughput and deviation, let’s compare the results for the last 2500 sec:

compare the throughput and deviation

So actually, innodb_buffer_pool_instances=64 showed the best throughput and the less variability, as even with innodb_buffer_pool_instances=32 there are noticeable stalls.

With innodb_buffer_pool_instances from 1 to 8, there are regular stalls, including drops to 0 tps for a few seconds.

Increasing innodb_buffer_pool_instances over 64 seems to have no effect.

Final Thoughts

For this particular case, innodb_buffer_pool_instances=64 was the best choice, but I still can’t recommend a reliable way to find what the optimal value is. Keep in mind that this particular setting depends on the hardware (I tested on the hardware with 80 CPU cores) and on the workload.

Part one of this series can be found here: How Many innodb_buffer_pool_instances Do You Need in MySQL 8?

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com