Aug
30
2024
--

When Warnings Deceive: The Curious Case of InnoDB’s Row Size Limitation

Mysterious warning Recently, I was involved in an investigation whose goal was to find out the reason for a warning message like this: [crayon-66d3235a263dd269727999/] The message looks clear, isn’t it? Well, the problem was that this particular table had not been changed for years, and so no DDL (ALTER) query was involved here. Moreover, there […]

Jan
18
2024
--

Syscalls Analysis in MySQL When Using innodb_flush_method and innodb_use_fdatasync

MySQL When Using innodb_flush_method and innodb_use_fdatasyncIn this blog post, we will discuss how to validate at the operating system level the effects of changing the innodb_flush_method to variations other than the default (particularly for O_DIRECT which is most commonly used) and the use of innodb_use_fdatasync.IntroductionFirst, let’s define what the innodb_flush_method parameter does. It dictates how InnoDB manages the flushing of data […]

Sep
19
2023
--

Use Physical Backups With MySQL InnoDB Redo Log Archiving

InnoDB Redo Log Archiving

In the world of data backup and security, physical backups play an extremely important role. Physical backup methods are faster than logical because they involve only file copying without conversion. This type of backup is suitable for large, important databases that need to be recovered quickly when problems occur.

Physical backups are the backups that consist of raw copies of the directories and files that store database contents. In addition to databases, the backup can include any related files, such as log or configuration files. Now, since backup speed and compactness are important for busy, important databases, Percona’s open source physical backup solution – Percona XtraBackup (PXB), takes into account all these aspects and benefits MySQL world with its exceptional capabilities!

This blog post will walk you through how PXB uses MySQL’s InnoDB redo log archiving to manage the redo log files on the busiest systems and looks at how a new feature released in PXB version 8.0.34-29 will let you set up redo log directory while performing a backup.

InnoDB redo log archiving

Backup utilities that copy redo log records may sometimes fail to keep pace with redo log generation while a backup operation is in progress, resulting in lost redo log records due to overwritten records. This issue most often occurs when there is significant MySQL server activity during the backup operation, and the redo log file storage media operates faster than the backup storage media. The redo log archiving feature, introduced in MySQL 8.0.17, addresses this issue by sequentially writing redo log records to an archive file and the redo log files. Backup utilities can copy redo log records from the archive file as necessary, thereby avoiding the potential loss of data.

Enabling redo log archiving on the server requires setting a value for the innodb_redo_log_archive_dirs system variable. The value is specified as a semicolon-separated list of labeled redo log archive directories. The label:directory pair is separated by a colon (:). For example:

mysql> SET GLOBAL innodb_redo_log_archive_dirs='label1:directory_path1[;label2:directory_path2;…]';

Now, first, we are going to demonstrate to you how we tested the new InnoDB redo log archiving on Percona Server for MySQL 8.0.33-25 (GPL) and should be using MySQL shell (mysqlsh) for enabling the InnoDB redo log archiving.

  1. Create new directories for storing redo log archives.
[root@ip-xxx-xx-xx-xx ~]# mkdir -p /var/lib/mysql-redo-archive/backup1
[root@ip-xxx-xx-xx-xx ~]# chown mysql. -R /var/lib/mysql-redo-archive
[root@ip-xxx-xx-xx-xx ~]# chmod -R 700 /var/lib/mysql-redo-archive/

  1. Login into mysqlsh by root user and check the variable innodb_redo_log_archive_dirs.
[root@ip-xxx-xx-xx-xx ~]# mysqlsh -uroot -p
MySQL Shell 8.0.33
Copyright (c) 2016, 2023, Oracle and/or its affiliates.
Oracle is a registered trademark of Oracle Corporation and/or its affiliates.
Other names may be trademarks of their respective owners.
Type 'help' or '?' for help; 'quit' to exit.
WARNING: Using a password on the command line interface can be insecure.
Creating a session to 'root@localhost'
Fetching schema names for auto-completion... Press ^C to stop.
Your MySQL connection id is 10 (X protocol)
Server version: 8.0.33-25 Percona Server (GPL), Release 25, Revision 60c9e2c5
No default schema selected; type use <schema> to set one.
MySQL localhost:33060+ ssl  SQL >
MySQL localhost:33060+ ssl JS > \sql
Switching to SQL mode... Commands end with ;
Fetching global names for auto-completion... Press ^C to stop.

MySQL localhost:33060+ ssl SQL > show global variables like 'innodb_redo_log_ar%';
+------------------------------+-------+
| Variable_name                | Value |
+------------------------------+-------+
| innodb_redo_log_archive_dirs |       |
+------------------------------+-------+
1 row in set (0.0031 sec)

  1. Set the value of variable innodb_redo_log_archive_dirs to the one in which we created the directories for redo log archives as below.
MySQL  localhost:33060+ ssl  SQL > set persist innodb_redo_log_archive_dirs='backup1:/var/lib/mysql-redo-archive/';
Query OK, 0 rows affected (0.0019 sec)
MySQL  localhost:33060+ ssl  SQL >  show global variables like 'innodb_redo_log_ar%';
+------------------------------+--------------------------------------+
| Variable_name                | Value                                |
+------------------------------+--------------------------------------+
| innodb_redo_log_archive_dirs | backup1:/var/lib/mysql-redo-archive/ |
+------------------------------+--------------------------------------+
1 row in set (0.0025 sec)

  1. Now it’s ready to work, but it’s not enabled. It will be enabled when the session initializes the backup and will invoke innodb_redo_log_archive_start().
MySQL  localhost:33060+ ssl  SQL > select innodb_redo_log_archive_start('backup1', 'backup1');
+-----------------------------------------------------+
| innodb_redo_log_archive_start('backup1', 'backup1') |
+-----------------------------------------------------+
|                                                   0 |
+-----------------------------------------------------+
1 row in set (0.0229 sec)

How to check if the redo log archiving is active

MySQL  localhost:33060+ ssl  SQL > select * from performance_schema.file_instances where event_name like '%::redo_log_archive_file'G
*************************** 1. row ***************************
FILE_NAME: /var/lib/mysql-redo-archive/backup1/archive.f255a32d-2fb4-11ee-889e-0242ac110005.000001.log
EVENT_NAME: wait/io/file/innodb/meb::redo_log_archive_file
OPEN_COUNT: 1
1 row in set (0.0015 sec)

So, this is not enough to ensure the redo log archiving is active. But we have the possibility also to check if the thread is active using this query:

MySQL  localhost:33060+ ssl  SQL > select thread_id, name, type from performance_schema.threads  where name like '%redo_log_archive%';
+-----------+-----------------------------------------------------+------------+
| thread_id | name                                                | type       |
+-----------+-----------------------------------------------------+------------+
|        48 | thread/innodb/meb::redo_log_archive_consumer_thread | BACKGROUND |
+-----------+-----------------------------------------------------+------------+
1 row in set (0.0021 sec)

If a row is returned, it means that the redo log archiving is enabled and active.

If, instead, you wish not to revert and to make it persist on service restart, add it to my.cnf under [mysqld]:

[mysqld]
innodb_redo_log_archive_dirs = redo-archived1:/mysql/archived-redo1

How to stop redo log archiving

MySQL  localhost:33060+ ssl  SQL > select innodb_redo_log_archive_stop();
+--------------------------------+
| innodb_redo_log_archive_stop() |
+--------------------------------+
|                              0 |
+--------------------------------+
1 row in set (0.0009 sec)

[root@ip-xxx-xx-xx-xx ~]# ls -ltrh /var/lib/mysql-redo-archive/backup1/
total 24K
-r--r-----. 1 mysql mysql 24K Aug 21 11:22 archive.94f4ab58-3d1c-11ee-9e4f-0af1fd7c44c9.000001.log

Revert back the changes

To revert back and disable redo log archiving:

mysql> SET GLOBAL innodb_redo_log_archive_dirs='';

How does Percona XtraBackup deal with redo log files?

PXB considers the log sequence number and starts copying away the data files when a backup is performed. The backup operation on InnoDB is a non-blocking operation, so the ibd files are allowed to change while the backup process is running. This means that while PXB is copying the files, the file reflects the state of the database at different times. PXB needs the redo log records for every change to the data files since it began execution.

PXB runs a background process that watches the transaction log files and copies changes from them. It needs to do this continually because InnoDB writes redo logs to the disk circularly and can be reused after a while. However, prior to PXB version 8.0.34-29, if InnoDB redo log archiving is not enabled and provided that redo logs are written faster, then PXB is able to copy it slower than it is written to the disk. It is possible to receive the below message:

xtrabackup: error: log block numbers mismatch:
xtrabackup: error: expected log block no. X, but got no. Y from the log file.
xtrabackup: error: it looks like InnoDB log has wrapped around before xtrabackup could process all records due to either log copying being too slow, or log files being too small.
xtrabackup: Error: xtrabackup_copy_logfile() failed.

Earlier to the release of the feature redo log archiving in MySQL 8.0.17, the  potential solutions to the above problem were:

  • Increasing the size of the redo logs so they are not wrapped while PXB is working.
  • The read speed is too slow, which usually is a sign of IO congestion.
  • The write speed is too slow, usually a sign of IO congestion or network congestion.
  • Taking backup off-peak time while the system saturation is lower.

However, now, with the use of the redo log archiving feature introduced in MySQL 8.0.17, you can overcome the slowness of physical backups and can efficiently perform the backup even during peak hours.

Also, PXB version 8.0.34-29 has introduced a new option –redo-log-arch-dir that can be used to set the redo log archiving directory if not already set in MySQL.

Using MySQL redo log archiving for Percona XtraBackup

So now we have a redo-log-file archived in the folder we created above demonstration.

[root@ip-xxx-xx-xx-xx ~]# ls -ltrh /var/lib/mysql-redo-archive/backup1/
total 24K
-r--r-----. 1 mysql mysql 24K Aug 21 11:22 archive.94f4ab58-3d1c-11ee-9e4f-0af1fd7c44c9.000001.log

Test with Percona XtraBackup

  1. Run PXB as the owner of the mysqld, typically the MySQL OS user. We will be using version 8.0.34-29
[root@ip-xxx-xx-xx-xx ~]# xtrabackup --version
2023-09-11T16:07:17.696628-00:00 0 [Note] [MY-011825] [Xtrabackup] recognized server arguments: --datadir=/var/lib/mysql
xtrabackup version 8.0.34-29 based on MySQL server 8.0.34 Linux (x86_64) (revision id: 5ba706ee)

This is required for PXB to use the InnoDB archiving feature. Otherwise, it will encounter a “Permission denied” error and will not use archiving as below:

2023-08-21T13:27:59.293520-00:00 0 [Note] [MY-011825] [Xtrabackup] xtrabackup redo_log_arch_dir is set to backup1:/var/lib/mysql-redo-archive/
2023-08-21T13:27:59.295379-00:00 0 [ERROR] [MY-011825] [Xtrabackup] failed to fetch query result select innodb_redo_log_archive_start('backup1', '1692710879293') : Cannot create redo log archive file '/var/lib/mysql-redo-archive/1692710879293/archive.94f4ab58-3d1c-11ee-9e4f-0af1fd7c44c9.000001.log' (OS errno: 13 - Permission denied)
2023-08-21T13:27:59.295429-00:00 0 [Note] [MY-011825] [Xtrabackup] Redo Log Archiving is not used.

PXB will create a temporary directory for the archive file. Since redo log archiving is a MySQL process, it needs to be owned by the MySQL OS user to be able to write to it.

1.1) Option 1: Login as the MySQL OS user:

shell> sudo -H -u mysql bash

Execute:

shell> xtrabackup --no-lock=1 --compress --parallel=4 --host=localhost --user=root --password='password_string' --backup=1 --target-dir=/Backup/21Aug 2> /tmp/b0-with-redo-archiving-as-mysql-os-user.log

1.2) Option 2: Directly execute as the root OS user for the MySQL user.

You must add -c on the command like:

shell> sudo -H -u mysql bash -c '<COMMAND>' 2> /tmp/backup.log

Execute:

shell> sudo -H -u mysql bash -c 'xtrabackup --no-lock=1 --compress --parallel=4 --host=localhost --user=root --password='password_string' --backup=1 --target-dir=/Backup/21Aug' 2> /tmp/b0-with-redo-archiving-as-mysql-os-user.log

4.2) Verify archiving is used:

shell> less /tmp/b0-with-redo-archiving-as-mysql-os-user.log 
2023-08-22T13:36:02.773345-00:00 0 [Note] [MY-011825] [Xtrabackup] recognized server arguments: --datadir=/var/lib/mysql  
2023-08-22T13:36:02.773543-00:00 0 [Note] [MY-011825] [Xtrabackup] recognized client arguments: --no-lock=1 --compress --parallel=4 --host=localhost --user=root --password=* --backup=1 --target-dir=/Backup/22Aug  xtrabackup version 8.0.34-29 based on MySQL server 8.0.34 Linux (x86_64) (revision id: 5ba706ee) 230822 13:36:02  version_check Connecting to MySQL server with DSN 'dbi:mysql:;mysql_read_default_group=xtrabackup;host=localhost' as 'root'  (using password: YES). 230822 13:36:02  version_check Connected to MySQL server 230822 13:36:02  version_check Executing a version check against the server... 230822 13:36:02  version_check Done. 
2023-08-22T13:36:02.899397-00:00 0 [Note] [MY-011825] [Xtrabackup] Connecting to MySQL server host: localhost, user: root, password: set, port: not set, socket: not set 2023-08-22T13:36:02.906059-00:00 0 [Note] [MY-011825] [Xtrabackup] Using server version 8.0.34 
.... 
.... 
2023-08-22T13:36:02.980089-00:00 0 [Note] [MY-011825] [Xtrabackup] Connecting to MySQL server host: localhost, user: root, password: set, port: not set, socket: not set 
2023-08-22T13:36:02.986698-00:00 0 [Note] [MY-011825] [Xtrabackup] xtrabackup redo_log_arch_dir is set to backup1:/var/lib/mysql-redo-archive/ 
2023-08-22T13:36:03.015207-00:00 0 [Note] [MY-011825] [Xtrabackup] Waiting for archive file '/var/lib/mysql-redo-archive//1692711362986/archive.94f4ab58-3d1c-11ee-9e4f-0af1fd7c44c9.000001.log' 
2023-08-22T13:36:03.091572-00:00 1 [Note] [MY-011825] [Xtrabackup] >> log scanned up to (19630018) 
.... 
.... 
2023-08-22T13:36:04.569714-00:00 0 [Note] [MY-011825] [Xtrabackup] Done: Compressing file /Backup/22Aug/xtrabackup_info.zst 
2023-08-22T13:36:05.583075-00:00 0 [Note] [MY-011825] [Xtrabackup] Transaction log of lsn (19630018) to (19630028) was copied. 
2023-08-22T13:36:05.790293-00:00 0 [Note] [MY-011825] [Xtrabackup] completed OK!

How to set –redo-log-arch-dir=name in PXB command

PXB 8.0.34-29 introduces a new feature  –redo-log-arch-dir=name. This option sets the redo log archive directory if this directory is not already set on the server. Here is a short example demonstrating the functioning of  –redo-log-arch-dir parameter.

Suppose innodb_redo_log_archive_dirs is NULL, which means initially, the redo logs archive directory is not set in MySQL, as shown below.

mysql>  select @@innodb_redo_log_archive_dirs;
+--------------------------------+
| @@innodb_redo_log_archive_dirs |
+--------------------------------+
| NULL                           |
+--------------------------------+
1 row in set (0.01 sec)

Create a redo log file directory and give the appropriate permissions

[root@ip-xxx-xx-xx-xx ~]# cd /var/lib/mysql-redo-archive/
[root@ip-xxx-xx-xx-xx ~]# mkdir redo_arch_backup

### Check mysql-redo-archive folder Permissions
[root@ip-xxx-xx-xx-xx ~]# ls -ltrh /var/lib/ |grep archive 
drwx------.  4 mysql  mysql     51 Sep 11 17:16 mysql-redo-archive

As you can see, PXB is all set to use the new redo archive directory just by passing the archive directory path to the parameter  –redo-log-arch-dir

–redo-log-arch-dir=redo_arch_backup:/var/lib/mysql-redo-archive/

[root@ip-xxx-xx-xx-xx ~]# sudo -H -u mysql bash
bash-5.1$ xtrabackup --no-lock=1 --compress --parallel=4 --host=localhost --user=root --password='password_string' --backup=1 --target-dir=/Backup/11Sept23_bkp --redo-log-arch-dir=redo_arch_backup:/var/lib/mysql-redo-archive/ 2> /tmp/a1-with-redo-archiving-as-mysql-os-user.log

Verifying the PXB logs:

[root@ip-xxx-xx-xx-xx ~]# cat /tmp/a1-with-redo-archiving-as-mysql-os-user.log
2023-09-11T17:16:47.836951-00:00 0 [Note] [MY-011825] [Xtrabackup] recognized server arguments: --datadir=/var/lib/mysql
2023-09-11T17:16:47.837105-00:00 0 [Note] [MY-011825] [Xtrabackup] recognized client arguments: --no-lock=1 --compress --parallel=4 --host=localhost --user=root --password=* --backup=1 --target-dir=/Backup/11Sept23_bkp --redo_log_arch_dir=redo_arch_backup:/var/lib/mysql-redo-archive/
xtrabackup version 8.0.34-29 based on MySQL server 8.0.34 Linux (x86_64) (revision id: 5ba706ee)
230911 17:16:47  version_check Connecting to MySQL server with DSN 'dbi:mysql:;mysql_read_default_group=xtrabackup;host=localhost' as 'root'  (using password: YES).
230911 17:16:47  version_check Connected to MySQL server
230911 17:16:47  version_check Executing a version check against the server...
230911 17:16:47  version_check Done.
2023-09-11T17:16:47.956847-00:00 0 [Note] [MY-011825] [Xtrabackup] Connecting to MySQL server host: localhost, user: root, password: set, port: not set, socket: not set
....
....
2023-09-11T17:16:48.028506-00:00 0 [Note] [MY-011825] [Xtrabackup] Connecting to MySQL server host: localhost, user: root, password: set, port: not set, socket: not set
2023-09-11T17:16:48.034515-00:00 0 [Note] [MY-011825] [Xtrabackup] xtrabackup redo_log_arch_dir is set to redo_arch_backup:/var/lib/mysql-redo-archive/
2023-09-11T17:16:48.056504-00:00 0 [Note] [MY-011825] [Xtrabackup] Waiting for archive file '/var/lib/mysql-redo-archive//1694452608034/archive.94f4ab58-3d1c-11ee-9e4f-0af1fd7c44c9.000001.log'
2023-09-11T17:16:48.137941-00:00 1 [Note] [MY-011825] [Xtrabackup] >> log scanned up to (19816218)
2023-09-11T17:16:48.139886-00:00 0 [Note] [MY-012953] [InnoDB] Disabling background ibuf IO read threads.
2023-09-11T17:16:48.343117-00:00 0 [Note] [MY-011825] [Xtrabackup] Generating a list of tablespaces
....
....
2023-09-11T17:16:49.496740-00:00 0 [Note] [MY-011825] [Xtrabackup] Compressing /Backup/11Sept23_bkp/backup-my.cnf.zst
2023-09-11T17:16:49.496789-00:00 0 [Note] [MY-011825] [Xtrabackup] Done: Compressing file /Backup/11Sept23_bkp/backup-my.cnf.zst
2023-09-11T17:16:49.498483-00:00 0 [Note] [MY-011825] [Xtrabackup] Compressing /Backup/11Sept23_bkp/xtrabackup_info.zst
2023-09-11T17:16:49.498521-00:00 0 [Note] [MY-011825] [Xtrabackup] Done: Compressing file /Backup/11Sept23_bkp/xtrabackup_info.zst
2023-09-11T17:16:50.501818-00:00 0 [Note] [MY-011825] [Xtrabackup] Transaction log of lsn (19816218) to (19816228) was copied.
2023-09-11T17:16:50.708435-00:00 0 [Note] [MY-011825] [Xtrabackup] completed OK!

mysql>  select @@innodb_redo_log_archive_dirs;
+-----------------------------------------------+
| @@innodb_redo_log_archive_dirs                |
+-----------------------------------------------+
| redo_arch_backup:/var/lib/mysql-redo-archive/ |
+-----------------------------------------------+
1 row in set (0.00 sec)

This means if innodb_redo_log_archive_dirs is not set in MySQL, PXB 8.0.34-29 will set this through –redo-log-arch-dir option and start using the redo log archive while a backup is initiated. When not set explicitly through –redo-log-arch-dir, PXB, by default, will first check innodb_redo_log_archive_dirs. If this variable is set in MySQL, as demonstrated previously, it will use the redo log arching directory assigned in innodb_redo_log_archive_dirs variable while performing the backup.

Conversely, if this MySQL variable innodb_redo_log_archive_dirs is NULL and you did not explicitly specify –redo-log-arch-dir while taking the backup, then PXB won’t use the redo log archiving feature.

Conclusion

This feature is crucial for heavy workloads when the backup storage doesn’t have the same capabilities as the production storage and cannot follow up the speed of the writes.

When enabled by the DBA, Percona XtraBackup will detect the archived redo log files and will use it automatically. To verify if the redo log archiving process was started and is still active, you can check the performance_schema.threads table as illustrated above. On the other hand, to use this feature with Percona XtraBackup, you can also explicitly use –redo-log-arch-dir=name.

Percona XtraBackup is a free, open source, complete online backup solution for all versions of Percona Server for MySQL and MySQL.

 

Try Percona XtraBackup today!

 

Related links:

https://docs.percona.com/percona-xtrabackup/8.0/release-notes/8.0/8.0.34-29.0.html

https://docs.percona.com/percona-xtrabackup/8.0/xtrabackup-option-reference.html#-redo-log-arch-dirname

https://dev.mysql.com/doc/refman/8.0/en/innodb-redo-log.html#innodb-redo-log-archiving

https://www.percona.com/blog/innodb-redo-log-archiving/

 

Aug
25
2023
--

Failover and Recovery Scenarios in InnoDB Cluster and ClusterSet

Failover and Recovery Scenarios in InnoDB Cluster

This blog post will focus on failover and recovery scenarios inside the InnoDB Cluster and ClusterSet environment. To know more about the deployments of these topologies, you can refer to the manuals – InnoDB Cluster and Innodb ClusterSet setup.

In the below snippet, we have two clusters (cluster1 and cluster2), which are connected via an async channel and combined, known as a ClusterSet topology. We are going to use the below topology in all of our cases.

MySQL  127.0.0.1:3308 ssl  JS > myclusterset.status({extended: 1})
{
    "clusters": {
        "Cluster1": {
            "clusterRole": "PRIMARY", 
            "globalStatus": "OK", 
            "primary": "127.0.0.1:3308", 
            "status": "OK", 
            "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", 
            "topology": {
                "127.0.0.1:3308": {
                    "address": "127.0.0.1:3308", 
                    "memberRole": "PRIMARY", 
                    "mode": "R/W", 
                    "status": "ONLINE", 
                    "version": "8.0.31"
                }, 
                "127.0.0.1:3309": {
                    "address": "127.0.0.1:3309", 
                    "memberRole": "SECONDARY", 
                    "mode": "R/O", 
                    "replicationLagFromImmediateSource": "", 
                    "replicationLagFromOriginalSource": "", 
                    "status": "ONLINE", 
                    "version": "8.0.31"
                }, 
                "127.0.0.1:3310": {
                    "address": "127.0.0.1:3310", 
                    "memberRole": "SECONDARY", 
                    "mode": "R/O", 
                    "replicationLagFromImmediateSource": "", 
                    "replicationLagFromOriginalSource": "", 
                    "status": "ONLINE", 
                    "version": "8.0.31"
                }
            }, 
            "transactionSet": "39ec1328-3909-11ee-a0f8-5254004d77d3:1-4,b3a79c40-3909-11ee-805d-5254004d77d3:1-124,b3a7a659-3909-11ee-805d-5254004d77d3:1-5"
        }, 
        "cluster2": {
            "clusterRole": "REPLICA", 
            "clusterSetReplication": {
                "applierStatus": "APPLIED_ALL", 
                "applierThreadState": "Waiting for an event from Coordinator", 
                "applierWorkerThreads": 4, 
                "receiver": "127.0.0.1:3311", 
                "receiverStatus": "ON", 
                "receiverThreadState": "Waiting for source to send event", 
                "source": "127.0.0.1:3308"
            }, 
            "clusterSetReplicationStatus": "OK", 
            "globalStatus": "OK", 
            "status": "OK", 
            "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", 
            "topology": {
                "127.0.0.1:3311": {
                    "address": "127.0.0.1:3311", 
                    "memberRole": "PRIMARY", 
                    "mode": "R/O", 
                    "replicationLagFromImmediateSource": "", 
                    "replicationLagFromOriginalSource": "", 
                    "status": "ONLINE", 
                    "version": "8.0.31"
                }, 
                "127.0.0.1:3312": {
                    "address": "127.0.0.1:3312", 
                    "memberRole": "SECONDARY", 
                    "mode": "R/O", 
                    "replicationLagFromImmediateSource": "", 
                    "replicationLagFromOriginalSource": "", 
                    "status": "ONLINE", 
                    "version": "8.0.31"
                }, 
                "127.0.0.1:3313": {
                    "address": "127.0.0.1:3313", 
                    "memberRole": "SECONDARY", 
                    "mode": "R/O", 
                    "replicationLagFromImmediateSource": "", 
                    "replicationLagFromOriginalSource": "", 
                    "status": "ONLINE", 
                    "version": "8.0.31"
                }
            }, 
            "transactionSet": "39ec1328-3909-11ee-a0f8-5254004d77d3:1-4,3db995e4-390a-11ee-b678-5254004d77d3:1-5,b3a79c40-3909-11ee-805d-5254004d77d3:1-124,b3a7a659-3909-11ee-805d-5254004d77d3:1-5", 
            "transactionSetConsistencyStatus": "OK", 
            "transactionSetErrantGtidSet": "", 
            "transactionSetMissingGtidSet": ""
        }
    }, 
    "domainName": "firstclusterset", 
    "globalPrimaryInstance": "127.0.0.1:3308", 
    "metadataServer": "127.0.0.1:3308", 
    "primaryCluster": "Cluster1", 
    "status": "HEALTHY", 
    "statusText": "All Clusters available."
}

How failover happens inside a single InnoDB Cluster

  • Connect to any node of the first cluster (“cluster1”) via MySQLShell and fetch the details.
    MySQL  127.0.0.1:3308 ssl  JS > c root@localhost:3308
    MySQL  localhost:3308 ssl  JS > cluster1=dba.getCluster()

    MySQL  localhost:3308 ssl  JS > cluster1.status()
    {
        "clusterName": "Cluster1", 
        "clusterRole": "PRIMARY", 
        "defaultReplicaSet": {
            "name": "default", 
            "primary": "127.0.0.1:3308", 
            "ssl": "REQUIRED", 
            "status": "OK", 
            "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", 
            "topology": {
                "127.0.0.1:3308": {
                    "address": "127.0.0.1:3308", 
                    "memberRole": "PRIMARY", 
                    "mode": "R/W", 
                    "readReplicas": {}, 
                    "role": "HA", 
                    "status": "ONLINE", 
                    "version": "8.0.31"
                }, 
                "127.0.0.1:3309": {
                    "address": "127.0.0.1:3309", 
                    "memberRole": "SECONDARY", 
                    "mode": "R/O", 
                    "readReplicas": {}, 
                    "replicationLagFromImmediateSource": "", 
                    "replicationLagFromOriginalSource": "", 
                    "role": "HA", 
                    "status": "ONLINE", 
                    "version": "8.0.31"
                }, 
                "127.0.0.1:3310": {
                    "address": "127.0.0.1:3310", 
                    "memberRole": "SECONDARY", 
                    "mode": "R/O", 
                    "readReplicas": {}, 
                    "replicationLagFromImmediateSource": "", 
                    "replicationLagFromOriginalSource": "", 
                    "role": "HA", 
                    "status": "ONLINE", 
                    "version": "8.0.31"
                }
            }, 
            "topologyMode": "Single-Primary"
        }, 
        "domainName": "firstclusterset", 
        "groupInformationSourceMember": "127.0.0.1:3308"
    }

     

  • Now perform a primary switchover from the instance (“127.0.0.1:3308”) to (“127.0.0.1:3309”).
MySQL  localhost:3308 ssl  JS > cluster1.setPrimaryInstance("root@127.0.0.1:3309")

Output:

Setting instance '127.0.0.1:3309' as the primary instance of cluster 'Cluster1'...

Instance '127.0.0.1:3308' was switched from PRIMARY to SECONDARY.
Instance '127.0.0.1:3309' was switched from SECONDARY to PRIMARY.
Instance '127.0.0.1:3310' remains SECONDARY.

The instance '127.0.0.1:3309' was successfully elected as primary.

  • Finally, the instance (“127.0.0.1:3309”) will show the status as primary.
MySQL localhost:3308 ssl JS > cluster1.status()
...
"127.0.0.1:3309": {
"address": "127.0.0.1:3309",
"memberRole": "PRIMARY",
"mode": "R/W",
"readReplicas": {},
"role": "HA",
"status": "ONLINE",
"version": "8.0.31"
...

How to rejoin the lost instance again

If, for some reason, an instance leaves the cluster or loses connection and can’t automatically rejoin the cluster, we might need to rejoin an instance to a cluster by issuing the “Cluster.rejoinInstance(instance)” command. Here, we will try to create a small example that can demonstrate the usage of this command.

  • Create some blocker by stopping group replication on the instance (“127.0.0.1:3310”).
MySQL localhost:3310 ssl SQL > stop group_replication;

  • Looking over the information below, we can see the instance (“127.0.0.1:3310”) is showing “MISSING” status.
MySQL  localhost:3310 ssl  JS > cluster1.status()
...
"127.0.0.1:3310": {
                "address": "127.0.0.1:3310", 
                "instanceErrors": [
                    "NOTE: group_replication is stopped."
                ], 
                "memberRole": "SECONDARY", 
                "memberState": "OFFLINE", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "(MISSING)", 
                "version": "8.0.31"
            }
...

  • Now, add the instance (“127.0.0.1:3310”) again with the rejoinInstance() command.
MySQL  localhost:3310 ssl  JS > cluster1.rejoinInstance('127.0.0.1:3310')

...
Validating instance configuration at 127.0.0.1:3310...
NOTE: Instance detected as a sandbox.
Please note that sandbox instances are only suitable for deploying test clusters for use within the same host.

This instance reports its own address as 127.0.0.1:3310

Instance configuration is suitable.
Rejoining instance '127.0.0.1:3310' to cluster 'Cluster1'...

Re-creating recovery account...
NOTE: User 'mysql_innodb_cluster_2558498413'@'%' already existed at instance '127.0.0.1:3309'. It will be deleted and created again with a new password.

* Waiting for the Cluster to synchronize with the PRIMARY Cluster...
** Transactions replicated  ############################################################  100% 

The instance '127.0.0.1:3310' was successfully rejoined to the cluster.
...

And after the above operation, the instance seems to be “ONLINE” now.

"127.0.0.1:3310": {
"address": "127.0.0.1:3310",
"memberRole": "SECONDARY",
"mode": "R/O",
"readReplicas": {},
"replicationLagFromImmediateSource": "",
"replicationLagFromOriginalSource": "",
"role": "HA",
"status": "ONLINE",
"version": "8.0.31"
}

How to recover a cluster from a quorum or vote loss

There are situations when the running instance can fail, and a cluster can lose its quorum (ability to vote) due to insufficient members. This could trigger when there is a failure of enough/majority of instances that make up the cluster to vote on Group Replication operations. In case a cluster loses the quorum, it can no longer process any write transactions within the cluster or change the cluster’s topology.

However, if any instance is online that contains the latest InnoDB Cluster metadata, it is possible to restore a cluster with the quorum.

Let’s see how we can use the feature “forceQuorumUsingPartitionOf” to recover the cluster again with minimal member votes.

  • First, we will try to fail the majority of nodes with a simple “KILL” operation.
root     30281     1  1 07:10 ?        00:02:03 /root/mysql-sandboxes/3308/bin/mysqld --defaults-file=/root/mysql-sandboxes/3308/my.cnf --user=root
root     30788     1  1 07:14 ?        00:01:53 /root/mysql-sandboxes/3309/bin/mysqld --defaults-file=/root/mysql-sandboxes/3309/my.cnf --user=root
root     30912     1  2 07:14 ?        00:02:48 /root/mysql-sandboxes/3310/bin/mysqld --defaults-file=/root/mysql-sandboxes/3310/my.cnf --user=root

[root@localhost ~]# kill -9 30281 30912

  • Now check the cluster1 status again. The cluster lost the quorum, and no write activity was allowed.
MySQL  localhost:3309 ssl  JS > cluster1.status()
{
    "clusterName": "Cluster1", 
    "clusterRole": "PRIMARY", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "127.0.0.1:3309", 
        "ssl": "REQUIRED", 
        "status": "NO_QUORUM", 
        "statusText": "Cluster has no quorum as visible from '127.0.0.1:3309' and cannot process write transactions. 2 members are not active.", 
        "topology": {
            "127.0.0.1:3308": {
                "address": "127.0.0.1:3308", 
                "memberRole": "SECONDARY", 
                "memberState": "(MISSING)", 
                "mode": "n/a", 
                "readReplicas": {}, 
                "role": "HA", 
                "shellConnectError": "MySQL Error 2003: Could not open connection to '127.0.0.1:3308': Can't connect to MySQL server on '127.0.0.1:3308' (111)", 
                "status": "UNREACHABLE", 
                "version": "8.0.31"
            }, 
            "127.0.0.1:3309": {
                "address": "127.0.0.1:3309", 
                "memberRole": "PRIMARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.31"
            }, 
            "127.0.0.1:3310": {
                "address": "127.0.0.1:3310", 
                "memberRole": "SECONDARY", 
                "memberState": "(MISSING)", 
                "mode": "n/a", 
                "readReplicas": {}, 
                "role": "HA", 
                "shellConnectError": "MySQL Error 2003: Could not open connection to '127.0.0.1:3310': Can't connect to MySQL server on '127.0.0.1:3310' (111)", 
                "status": "UNREACHABLE", 
                "version": "8.0.31"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "domainName": "firstclusterset", 
    "groupInformationSourceMember": "127.0.0.1:3309"
}

  • To fix this situation, we can connect to the available instance (“127.0.0.1:3309”) and reset the quorum again with the below command.
MySQL  localhost:3309 ssl  JS > mycluster1.forceQuorumUsingPartitionOf("root@localhost:3309")

...

Restoring cluster 'Cluster1' from loss of quorum, by using the partition composed of [127.0.0.1:3309]

Restoring the InnoDB cluster ...

The InnoDB cluster was successfully restored using the partition from the instance 'root@localhost:3309'.

WARNING: To avoid a split-brain scenario, ensure that all other members of the cluster are removed or joined back to the group that was restored.

...

  • The Instances with port (3308,3310) are still down/failed. However, we recovered the cluster quorum with a single primary member (“127.0.0.1:3309”). Now, we can perform the write activity on the cluster without any issues.
MySQL  localhost:3309 ssl  JS > cluster1.status()
{
    "clusterName": "Cluster1", 
    "clusterRole": "PRIMARY", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "127.0.0.1:3309", 
        "ssl": "REQUIRED", 
        "status": "OK_NO_TOLERANCE_PARTIAL", 
        "statusText": "Cluster is NOT tolerant to any failures. 2 members are not active.", 
        "topology": {
            "127.0.0.1:3308": {
                "address": "127.0.0.1:3308", 
                "memberRole": "SECONDARY", 
                "mode": "n/a", 
                "readReplicas": {}, 
                "role": "HA", 
                "shellConnectError": "MySQL Error 2003: Could not open connection to '127.0.0.1:3308': Can't connect to MySQL server on '127.0.0.1:3308' (111)", 
                "status": "(MISSING)"
            }, 
            "127.0.0.1:3309": {
                "address": "127.0.0.1:3309", 
                "memberRole": "PRIMARY", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.31"
            }, 
            "127.0.0.1:3310": {
                "address": "127.0.0.1:3310", 
                "memberRole": "SECONDARY", 
                "mode": "n/a", 
                "readReplicas": {}, 
                "role": "HA", 
                "shellConnectError": "MySQL Error 2003: Could not open connection to '127.0.0.1:3310': Can't connect to MySQL server on '127.0.0.1:3310' (111)", 
                "status": "(MISSING)"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "domainName": "firstclusterset", 
    "groupInformationSourceMember": "127.0.0.1:3309"
}


MySQL  localhost:3309 ssl  SQL > create database sbtest2;
Query OK, 1 row affected (0.0055 sec)

Later on, we can fix the failed instances, and they will join cluster1 again. Sometimes, we might need to perform the “rejoinInstance()” operation in order to add the nodes again.

How to recover a complete cluster from a major outage

Sometimes, even if all the nodes are up and some internal issues happen, like group replication is stuck or some networking problem, you might experience a complete outage and be unable to perform any writes/activity on the cluster.

In such circumstances, you can use any one node and use its metadata to recover the cluster. You need to connect to the most up-to-date instance, as otherwise, you may lose data or have inconsistency in the cluster nodes.

Let’s see how we can introduce such a situation manually and then try to fix that.

  • First, stop the group replication on all three instances.
MySQL localhost:3308 ssl SQL > stop group_replication;
MySQL localhost:3310 ssl SQL > stop group_replication;
MySQL localhost:3309 ssl SQL > stop group_replication;

127.0.0.1:3308": {
                "address": "127.0.0.1:3308", 
                "instanceErrors": [
                    "NOTE: group_replication is stopped."
                ], 
                "memberRole": "SECONDARY", 
                "memberState": "OFFLINE", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "(MISSING)", 
                "version": "8.0.31"
            }, 
            "127.0.0.1:3309": {
                "address": "127.0.0.1:3309", 
                "instanceErrors": [
                    "NOTE: group_replication is stopped."
                ], 
                "memberRole": "SECONDARY", 
                "memberState": "OFFLINE", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "(MISSING)", 
                "version": "8.0.31"
            }, 
            "127.0.0.1:3310": {
                "address": "127.0.0.1:3310", 
                "instanceErrors": [
                    "NOTE: group_replication is stopped."
                ], 
                "memberRole": "SECONDARY", 
                "memberState": "OFFLINE", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "(MISSING)", 
                "version": "8.0.31"
            }

Now we are completely stuck, and even if we try to perform any writes on the last primary member, we see the below error since the “–super-read-only” is enabled in order to protect the trxs on the database.

MySQL  localhost:3309 ssl  SQL > create database sbtest3;
ERROR: 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement

  • Now, we will fix the issue with the below set of steps.

1. Connect to the instance with most executions/Gtid’s. We can confirm the same by connecting to each instance and comparing the GTIDs with the below command.

mysql> SELECT @@GLOBAL.GTID_EXECUTED;
mysql> SELECT received_transaction_set FROM performance_schema.replication_connection_status WHERE channel_name="group_replication_applier";

2. In our case, it is the same in all three instances, so we can choose any of the nodes. Let’s choose the last primary (“127.0.0.1:3309”).

MySQL  localhost:3309 ssl  SQL > SELECT @@GLOBAL.GTID_EXECUTED;
+-------------------------------------------------------------------------------------------------------------------------------------------------+
| @@GLOBAL.GTID_EXECUTED                                                                                                                          |
+-------------------------------------------------------------------------------------------------------------------------------------------------+
| 39ec1328-3909-11ee-a0f8-5254004d77d3:1-4,
b3a79c40-3909-11ee-805d-5254004d77d3:1-152:1000071-1000079,
b3a7a659-3909-11ee-805d-5254004d77d3:1-12 |
+-------------------------------------------------------------------------------------------------------------------------------------------------+

MySQL  localhost:3309 ssl  SQL > SELECT received_transaction_set FROM performance_schema.replication_connection_status WHERE channel_name="group_replication_applier";
+-------------------------------------------------------------------------------------------------------------------------------------------------+
| received_transaction_set                                                                                                                        |
+-------------------------------------------------------------------------------------------------------------------------------------------------+
| 39ec1328-3909-11ee-a0f8-5254004d77d3:1-4,
b3a79c40-3909-11ee-805d-5254004d77d3:1-152:1000071-1000079,
b3a7a659-3909-11ee-805d-5254004d77d3:1-12 |
+-------------------------------------------------------------------------------------------------------------------------------------------------+

3. Finally, connect to the instance (“127.0.0.1:3309”) and execute the below command to recover from the outage/failure.

MySQL  localhost:3309 ssl  JS > c root@localhost:3309
MySQL  localhost:3309 ssl  JS > mycluster1 = dba.rebootClusterFromCompleteOutage("cluster1",{force: true})

...

Restoring the Cluster 'Cluster1' from complete outage...

Cluster instances: '127.0.0.1:3308' (OFFLINE), '127.0.0.1:3309' (OFFLINE), '127.0.0.1:3310' (OFFLINE)
Validating instance configuration at localhost:3309...

This instance reports its own address as 127.0.0.1:3309

Instance configuration is suitable.
* Waiting for seed instance to become ONLINE...
127.0.0.1:3309 was restored.
Validating instance configuration at 127.0.0.1:3308...

This instance reports its own address as 127.0.0.1:3308

Instance configuration is suitable.
Rejoining instance '127.0.0.1:3308' to cluster 'Cluster1'...

Re-creating recovery account...
NOTE: User 'mysql_innodb_cluster_3676500949'@'%' already existed at instance '127.0.0.1:3309'. It will be deleted and created again with a new password.

* Waiting for the Cluster to synchronize with the PRIMARY Cluster...
** Transactions replicated  ############################################################  100% 

The instance '127.0.0.1:3308' was successfully rejoined to the cluster.

Validating instance configuration at 127.0.0.1:3310...

This instance reports its own address as 127.0.0.1:3310

Instance configuration is suitable.
Rejoining instance '127.0.0.1:3310' to cluster 'Cluster1'...

Re-creating recovery account...
NOTE: User 'mysql_innodb_cluster_2558498413'@'%' already existed at instance '127.0.0.1:3309'. It will be deleted and created again with a new password.

* Waiting for the Cluster to synchronize with the PRIMARY Cluster...
** Transactions replicated  ############################################################  100% 

The instance '127.0.0.1:3310' was successfully rejoined to the cluster.

The Cluster was successfully rebooted.

<Cluster:Cluster1>

...

Note: Be careful with the force option as it can bypass other important checks like GTID_SET or instance reachability.

Now, if we check the status again, we see all three nodes are up now. The recovery command fixes all problems (group replication started) and rejoins the instances again.

MySQL  localhost:3309 ssl  JS > cluster1=dba.getCluster()
MySQL  localhost:3309 ssl  JS > cluster1.status()  
{
    "clusterName": "Cluster1", 
    "clusterRole": "PRIMARY", 
    "defaultReplicaSet": {
        "name": "default", 
        "primary": "127.0.0.1:3309", 
        "ssl": "REQUIRED", 
        "status": "OK", 
        "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", 
        "topology": {
            "127.0.0.1:3308": {
                "address": "127.0.0.1:3308", 
                "memberRole": "SECONDARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLagFromImmediateSource": "", 
                "replicationLagFromOriginalSource": "", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.31"
            }, 
            "127.0.0.1:3309": {
                "address": "127.0.0.1:3309", 
                "memberRole": "PRIMARY", 
                "mode": "R/W", 
                "readReplicas": {}, 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.31"
            }, 
            "127.0.0.1:3310": {
                "address": "127.0.0.1:3310", 
                "memberRole": "SECONDARY", 
                "mode": "R/O", 
                "readReplicas": {}, 
                "replicationLagFromImmediateSource": "", 
                "replicationLagFromOriginalSource": "", 
                "role": "HA", 
                "status": "ONLINE", 
                "version": "8.0.31"
            }
        }, 
        "topologyMode": "Single-Primary"
    }, 
    "domainName": "firstclusterset", 
    "groupInformationSourceMember": "127.0.0.1:3309"
}

How to perform switchover/failover from one cluster to another in a ClusterSet

Sometimes, we need to perform maintenance and update activities on the database instances. To avoid a long downtime or major impact on production, we do some control switchover so the concerned node can be offline without impacting any running workload.

Inside ClusterSet, we can perform such activity by changing the clusterRole from “REPLICA” to “PRIMARY” among the running clusters.

Let’s see the exact scenario below.

  • Fetch the ClusterSet information.
MySQL  127.0.0.1:3308 ssl  JS > myclusterset=dba.getClusterSet()
<ClusterSet:firstclusterset>

MySQL  127.0.0.1:3308 ssl  JS > myclusterset.status({extended: 1})

...

    "clusters": {
        "Cluster1": {
            "clusterRole": "PRIMARY", 
            "globalStatus": "OK", 
            "primary": "127.0.0.1:3309", 
            "status": "OK", 
            "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", 
            "topology": {

      "cluster2": {
            "clusterRole": "REPLICA", 
            "clusterSetReplication": {
                "applierStatus": "APPLIED_ALL", 
                "applierThreadState": "Waiting for an event from Coordinator", 
                "applierWorkerThreads": 4, 
                "receiver": "127.0.0.1:3311", 
                "receiverStatus": "ON", 
                "receiverThreadState": "Waiting for source to send event", 
                "source": "127.0.0.1:3309"
...

  • Performing a switchover from “cluster1” to “cluster2”.
MySQL  127.0.0.1:3308 ssl  JS > myclusterset.setPrimaryCluster('cluster2')

...

Switching the primary cluster of the clusterset to 'cluster2'
* Verifying clusterset status
** Checking cluster Cluster1
  Cluster 'Cluster1' is available
** Checking cluster cluster2
  Cluster 'cluster2' is available

* Reconciling 5 internally generated GTIDs

* Refreshing replication account of demoted cluster
* Synchronizing transaction backlog at 127.0.0.1:3311
** Transactions replicated  ############################################################  100% 


* Updating metadata

* Updating topology
** Changing replication source of 127.0.0.1:3308 to 127.0.0.1:3311
** Changing replication source of 127.0.0.1:3310 to 127.0.0.1:3311
** Changing replication source of 127.0.0.1:3309 to 127.0.0.1:3311
* Acquiring locks in replicaset instances
** Pre-synchronizing SECONDARIES
** Acquiring global lock at PRIMARY
** Acquiring global lock at SECONDARIES

* Synchronizing remaining transactions at promoted primary
** Transactions replicated  ############################################################  100% 


* Updating replica clusters
Cluster 'cluster2' was promoted to PRIMARY of the clusterset. The PRIMARY instance is '127.0.0.1:3311'
...

  • Now, if we see the status again, we can observe the change in the ClusterRole.
"clusters": {
        "Cluster1": {
            "clusterRole": "REPLICA", 
            "clusterSetReplication": {
                "applierStatus": "APPLIED_ALL", 
                "applierThreadState": "Waiting for an event from Coordinator", 
                "applierWorkerThreads": 4, 
                "receiver": "127.0.0.1:3309", 
                "receiverStatus": "ON", 
                "receiverThreadState": "Waiting for source to send event", 
                "source": "127.0.0.1:3311"
            }, 

"cluster2": {
            "clusterRole": "PRIMARY", 
            "globalStatus": "OK", 
            "primary": "127.0.0.1:3311", 
            "status": "OK", 
            "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", 
            "topology": {
                "127.0.0.1:3311": {
                    "address": "127.0.0.1:3311", 
                    "memberRole": "PRIMARY", 
                    "mode": "R/W", 
                    "status": "ONLINE", 
                    "version": "8.0.31"
                },

In some unlucky situations, when one of the clusters (“cluster1”) is completely down, and we don’t have any choice, we might need to do an emergency failover. The below command could be handy in those cases.

mysql> myclusterset.forcePrimaryCluster("cluster2")

In case you are using the MySQLRouter interface for routing traffic in the ClusterSet then you also need to change the Router option with (“setRoutingOption”) so the traffic can be diverted to the new Primary Cluster (“cluster2”) otherwise the application will fail and not able to communicate.

In the next scenario, we will see the usage of “setRoutingOption” in more detail.

How to change traffic routes with MySQLRouter

In the case of Cluster:

Within a single cluster or “cluster1,” the writes will fall to the Primary and reads will fall to the secondary in a balanced order. The router will automatically detect the failovers and choose the Primary.

In the case of ClusterSet:

We can switch the Primary stack for routing traffic from “cluster1” to “cluster2” with the help of the below steps.

  • Verifying the current configurations.
MySQL localhost:3309 ssl JS > myclusterset=dba.getClusterSet() 
MySQL  localhost:3309 ssl  JS > myclusterset.listRouters()
 
{
    "domainName": "firstclusterset", 
    "routers": {
        "localhost.localdomain::Router1": {
            "hostname": "localhost.localdomain", 
            "lastCheckIn": "2023-08-12 10:13:22", 
            "roPort": "6447", 
            "roXPort": "6449", 
            "rwPort": "6446", 
            "rwXPort": "6448", 
            "targetCluster": null, 
            "version": "8.0.31"
        }
    }
}

MySQL  localhost:3309 ssl  JS > myclusterset.routingOptions()

Output:

    
    {
    "domainName": "firstclusterset", 
    "global": {
        "invalidated_cluster_policy": "drop_all", 
        "stats_updates_frequency": 0, 
        "target_cluster": "primary"
    }, 
    "routers": {
        "localhost.localdomain::Router1": {}
    }
}

  • Now, switching the target cluster from “cluster1” to “cluster2”.
MySQL  localhost:3309 ssl  JS > myclusterset.setRoutingOption('localhost.localdomain::Router1', 'target_cluster', 'cluster2')
Routing option 'target_cluster' successfully updated in router 'localhost.localdomain::Router1'.

  • Checking the status again, we can see the target cluster is now changed to “cluster2”.
MySQL  localhost:3309 ssl  JS > myclusterset.routingOptions()
 
{
    "domainName": "firstclusterset", 
    "global": {
        "invalidated_cluster_policy": "drop_all", 
        "stats_updates_frequency": 0, 
        "target_cluster": "primary"
    }, 
    "routers": {
        "localhost.localdomain::Router1": {
            "target_cluster": "cluster2"
        }
    }
}

How to rejoin the lost cluster again in the ClusterSet

There are situations when the cluster has been marked as invalidated, or there might be some issue with the replication channel. To fix that, we might need to perform the “rejoinCluster()” process in order to add the same to the ClusterSet.

We can simulate the same by using the below steps.

  • Stopping Async replication channel on the Primary instance (“127.0.0.1:3311”) of cluster2, which is syncing from the instance (“127.0.0.1:3309”) of cluster1.
MySQL  localhost:3311 ssl  SQL > stop slave;

...
               Slave_IO_State: 
                  Master_Host: 127.0.0.1
                  Master_User: mysql_innodb_cs_63c324c0
                  Master_Port: 3309
                Connect_Retry: 3
              Master_Log_File: localhost-bin.000005
          Read_Master_Log_Pos: 2248
               Relay_Log_File: localhost-relay-bin-clusterset_replication.000002
                Relay_Log_Pos: 432
        Relay_Master_Log_File: localhost-bin.000005
             Slave_IO_Running: No
            Slave_SQL_Running: No
...

  • If we check the status again, we see the replication is stopped now on the cluster2 instance.
MySQL  localhost:3311 ssl  JS > myclusterset.status({extended:1})

"clusterRole": "REPLICA", 
            "clusterSetReplication": {
                "applierStatus": "OFF", 
                "applierThreadState": "", 
                "applierWorkerThreads": 4, 
                "receiver": "127.0.0.1:3311", 
                "receiverStatus": "OFF", 
                "receiverThreadState": "", 
                "source": "127.0.0.1:3309"
            }, 
            "clusterSetReplicationStatus": "STOPPED", 
            "globalStatus": "OK_NOT_REPLICATING", 
            "status": "OK", 
            "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",

  • Now, we will fix the same by running the below rejoin command.
MySQL  localhost:3311 ssl  JS > myclusterset.rejoinCluster('cluster2')

...

Rejoining cluster 'cluster2' to the clusterset
NOTE: Cluster 'cluster2' is not invalidated
* Refreshing replication settings
** Changing replication source of 127.0.0.1:3312 to 127.0.0.1:3309
** Changing replication source of 127.0.0.1:3313 to 127.0.0.1:3309
** Changing replication source of 127.0.0.1:3311 to 127.0.0.1:3309

Cluster 'cluster2' was rejoined to the clusterset

...

So, the stopped replication started automatically and synced with the main cluster.

cluster2": 
            "clusterRole": "REPLICA", 
            "clusterSetReplication": {
                "applierStatus": "APPLIED_ALL", 
                "applierThreadState": "Waiting for an event from Coordinator", 
                "applierWorkerThreads": 4, 
                "receiver": "127.0.0.1:3311", 
                "receiverStatus": "ON", 
                "receiverThreadState": "Waiting for source to send event", 
                "source": "127.0.0.1:3309"
            }, 
            "clusterSetReplicationStatus": "OK", 
            "globalStatus": "OK", 
            "status": "OK", 
            "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.",

Summary

Here, we have discussed a few scenarios and methodologies that could be very useful in order to recover the cluster nodes and perform some manual failovers in the topology. The above-discussed options would be used in both InnoDB Cluster and Innodb ClusterSet-related environments.

Percona Distribution for MySQL is the most complete, stable, scalable, and secure open source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!

 

Try Percona Distribution for MySQL today!

Aug
07
2023
--

Understanding the Differences Between InnoDB Undo Log and Redo Log

Differences Between InnoDB Undo Log and Redo Log

In InnoDB, the undo log and the redo log are two indispensable components that play a vital role in maintaining data integrity and ensuring transactional consistency. Both logs are crucial in the ACID (Atomicity, Consistency, Isolation, Durability) properties of database systems. Additionally, they are essential for the Multi-Version Concurrency Control (MVCC) mechanism. In this blog post, we will delve into the differences between the InnoDB undo log and redo log, exploring their significance and providing code examples to illustrate their usage.

InnoDB Undo Log

The undo log, also known as the rollback segment, is a crucial part of the InnoDB storage engine. Its primary purpose is to support transactional consistency and provide the ability to roll back changes made during a transaction. Here’s how it works:

Maintaining before-images: Whenever a transaction modifies data in InnoDB, the undo log records the before-image of the modified data. This before-image contains the original values of the modified rows, allowing for undoing or rolling back changes if needed.

Transaction isolation: The undo log plays a vital role in providing transaction isolation levels like READ COMMITTED or REPEATABLE READ. It ensures that within a transaction, other concurrent transactions can still read consistent data by using the before-images stored in the undo log.

Crash recovery: In the event of a system crash or restart, the undo log helps restore the database to a consistent state by applying the necessary undo operations based on the recorded before-images.

Code example – InnoDB Undo Log:

-- Begin a transaction
START TRANSACTION;

-- Modify data
UPDATE employees SET salary = salary * 1.1 WHERE department = 'Sales';

-- Rollback the changes
ROLLBACK;

Terminal 1: 

In Terminal 1, we created a table called “employees” and inserted some records into it. Then, we started a transaction, updated employees’ salaries in the “Sales” department, and observed that ten rows were affected. Finally, we selected all records from the “employees” table to see the updated salaries.

Table:

mysql> CREATE TABLE employees (
    ->   employee_id INT PRIMARY KEY,
    ->   name VARCHAR(100),
    ->   department VARCHAR(100),
    ->   salary DECIMAL(10, 2)
    -> );
Query OK, 0 rows affected (0.03 sec)

Inserts a few records into the employees’ table.

mysql> INSERT INTO employees (employee_id, name, department, salary)
    -> VALUES
    ->   (1, 'John Smith', 'Sales', 50000.00),
    ->   (2, 'Jane Doe', 'Sales', 55000.00),
    ->   (3, 'Michael Johnson', 'Sales', 60000.00),
    ->   (4, 'Emily Williams', 'Sales', 52000.00),
    ->   (5, 'David Anderson', 'Sales', 58000.00),
    ->   (6, 'Sarah Thompson', 'Sales', 51000.00),
    ->   (7, 'Christopher Lee', 'Sales', 54000.00),
    ->   (8, 'Jessica Brown', 'Sales', 57000.00),
    ->   (9, 'Matthew Davis', 'Sales', 53000.00),
    ->   (10, 'Olivia Taylor', 'Sales', 56000.00);
Query OK, 10 rows affected (0.01 sec)
Records: 10  Duplicates: 0  Warnings: 0

mysql> START TRANSACTION;
Query OK, 0 rows affected (0.00 sec)

mysql> UPDATE employees SET salary = salary * 1.1 WHERE department = 'Sales';
Query OK, 10 rows affected (0.00 sec)
Rows matched: 10  Changed: 10  Warnings: 0

Using the same terminal, let’s query the table and observe that the query returns the updated record.

mysql> select * from mytest.employees;
+-------------+-----------------+------------+----------+
| employee_id | name            | department | salary   |
+-------------+-----------------+------------+----------+
|           1 | John Smith      | Sales      | 55000.00 |
|           2 | Jane Doe        | Sales      | 60500.00 |
|           3 | Michael Johnson | Sales      | 66000.00 |
|           4 | Emily Williams  | Sales      | 57200.00 |
|           5 | David Anderson  | Sales      | 63800.00 |
|           6 | Sarah Thompson  | Sales      | 56100.00 |
|           7 | Christopher Lee | Sales      | 59400.00 |
|           8 | Jessica Brown   | Sales      | 62700.00 |
|           9 | Matthew Davis   | Sales      | 58300.00 |
|          10 | Olivia Taylor   | Sales      | 61600.00 |
+-------------+-----------------+------------+----------+
10 rows in set (0.00 sec)
mysql>

Terminal 2: 

In Terminal 2, we queried the “employees” table and noticed that it returned the old records, which are the before-images stored in the undo log. This demonstrates how the undo log maintains transactional consistency, allowing concurrent transactions to read consistent data.

mysql> select * from mytest.employees;
+-------------+-----------------+------------+----------+
| employee_id | name            | department | salary   |
+-------------+-----------------+------------+----------+
|           1 | John Smith      | Sales      | 50000.00 |
|           2 | Jane Doe        | Sales      | 55000.00 |
|           3 | Michael Johnson | Sales      | 60000.00 |
|           4 | Emily Williams  | Sales      | 52000.00 |
|           5 | David Anderson  | Sales      | 58000.00 |
|           6 | Sarah Thompson  | Sales      | 51000.00 |
|           7 | Christopher Lee | Sales      | 54000.00 |
|           8 | Jessica Brown   | Sales      | 57000.00 |
|           9 | Matthew Davis   | Sales      | 53000.00 |
|          10 | Olivia Taylor   | Sales      | 56000.00 |
+-------------+-----------------+------------+----------+
10 rows in set (0.00 sec)

Terminal 1: 

In Terminal 1, we performed a rollback, which utilized the before-images to undo or roll back the changes made during the transaction. After the rollback, we selected all records from the “employees” table again and confirmed that the salaries reverted to their original values.

These actions showcase the role of the undo log in maintaining transactional consistency and providing the ability to revert changes when needed.

mysql> ROLLBACK;
Query OK, 0 rows affected (0.04 sec)
mysql> select * from mytest.employees;
+-------------+-----------------+------------+----------+
| employee_id | name            | department | salary   |
+-------------+-----------------+------------+----------+
|           1 | John Smith      | Sales      | 50000.00 |
|           2 | Jane Doe        | Sales      | 55000.00 |
|           3 | Michael Johnson | Sales      | 60000.00 |
|           4 | Emily Williams  | Sales      | 52000.00 |
|           5 | David Anderson  | Sales      | 58000.00 |
|           6 | Sarah Thompson  | Sales      | 51000.00 |
|           7 | Christopher Lee | Sales      | 54000.00 |
|           8 | Jessica Brown   | Sales      | 57000.00 |
|           9 | Matthew Davis   | Sales      | 53000.00 |
|          10 | Olivia Taylor   | Sales      | 56000.00 |
+-------------+-----------------+------------+----------+
10 rows in set (0.00 sec)

In the above code snippet, the transaction modifies the salary column for employees in the Sales department. If the transaction is rolled back, the undo log is utilized to revert the changes made to the original values.

InnoDB Redo Log

The redo log, also known as the transaction log or InnoDB log, serves a different purpose than the undo log. Its primary function is to ensure durability and aid in crash recovery. Let’s explore its characteristics:

Recording changes: Whenever a transaction modifies data in InnoDB, the redo log records the changes made to the database. It stores the actual modifications performed, such as inserting new rows, updating existing rows, or deleting rows.

Redo Log not only records the changes made to the database but also includes the modifications that are written into the rollback segments. In a database system, rollback segments are used to temporarily store undo data during transactions. So, in addition to tracking database changes, the Redo Log also captures the changes made to the rollback segments. This ensures that during recovery procedures, the system can properly restore the database to a consistent state by applying both the database changes and the modifications stored in the rollback segments.

Write-ahead logging: The redo log follows a write-ahead logging approach, meaning that changes are written to the redo log before the corresponding data pages are updated. This ensures that in the event of a crash, the changes recorded in the redo log can be replayed to restore the database to a consistent state.

Crash recovery: During crash recovery, the redo log is crucial for replaying the logged changes to bring the database to a consistent state. By reapplying the changes recorded in the redo log, InnoDB can recover transactions that were committed but not yet written to disk. 

Redo Logs are also used in making consistent physical backups. Tools like Percona XtraBackup leveraging the Redo Logs during the prepare stage to make the backup consistent. This distinguishes physical backups, which include changes made during the backup process, from logical backups that represent the database at the beginning of the backup. Logical backups provide an image of the database at the beginning of the backup. Physical backups provide an image of the database at the end of the backup.

-- Begin a transaction
START TRANSACTION;

-- Modify data
UPDATE products SET stock = stock - 10 WHERE id = 123;

-- Commit the transaction
COMMIT;

In the above example, the transaction updates the stock column of a product identified by its ID. The changes are recorded in the redo log, ensuring durability and enabling recovery in case of a crash.

Conclusion

Understanding the differences between the InnoDB undo log and redo log is crucial for ensuring data integrity and durability in database systems. The undo log serves as a critical component for maintaining transactional consistency, allowing for the rollback of changes and supporting transaction isolation levels. On the other hand, the redo log plays a vital role in ensuring durability and aiding in crash recovery by recording database changes and following a write-ahead logging approach.

It is worth noting that the process of flushing (checkpointing vs. redo log) and purging (vs. undo logs) can have a significant impact on system performance and overhead. Flushing refers to the periodic writing of dirty pages from memory to disk, either through checkpointing or by utilizing the redo log. This process helps ensure that changes are safely persisted to disk, contributing to durability. Purging, on the other hand, involves removing old or unnecessary data from the system, which can include purging undo logs. Properly managing flushing and purging operations is crucial for system performance and tuning.

By understanding the overhead, tuning considerations, and the role of flushing and purging in relation to the undo log and redo log, database administrators can optimize their systems to achieve the desired balance between data integrity, performance, and storage requirements.

In summary, the InnoDB undo log and redo log are indispensable components that work together to provide transactional consistency, durability, and crash recovery capabilities. Alongside flushing and purging operations, these logs are essential for managing data integrity and system performance in database systems.

Percona Distribution for MySQL is the most complete, stable, scalable, and secure open source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!

 

Try Percona Distribution for MySQL today!

Jul
05
2023
--

The Impacts of Fragmentation in MySQL

impacts of Fragmentation in MySQL

Fragmentation is a common concern in some database systems. Highly fragmented tables can affect performance and resource allocation. But reducing fragmentation often involves rebuilding the table completely. This blog post will discuss fragmentation and its impact on InnoDB.

What is fragmentation?

We say that something is fragmented when it is formed by parts that are separate or placed in a different order than the natural one. In databases, we can experiment with different types of fragmentation:

  • Segment Fragmentation: segments are fragmented; they are stored not following the order of data, or there are empty pages gaps between the data pages.
  • Tablespace Fragmentation: the tablespace is stored in non-consecutive filesystem blocks.
  • Table Fragmentation: data is stored not following the primary key order (heap tables), or table pages have a significant amount of free space (heap tables or index tables).
  • Index Fragmentation: indexes usually store the data following the index order (b-tree) or in a random order (hash). In both cases, fragmentation means free space inside the index pages.
  • Row Fragmentation: single rows are spread across multiple pages.

Not all types of fragmentation can happen on all the database technologies, and sometimes you can’t avoid a certain degree of fragmentation. InnoDB tables are index tables, meaning data is organized following the primary key.

The principle of locality

While the principle of locality is usually related to processors and cache access patterns, it also applies to data access in general. This principle describes two data access patterns: spatial and temporal locality.

Temporal locality means that data that has been recently retrieved is more prone to be required again in a short period of time. Spatial locality tells us that somehow related data (near) tends to be accessed together. If your data is organized following the principle of locality, data access will be more efficient.

How does fragmentation affect the locality of data?

Table and index fragmentation often lead to database pages containing significant free space. This reduces the probability of storing frequently accessed data together and goes against the temporal locality principle.

Table fragmentation also affects the spatial locality by storing related data on different database pages. Regarding tablespace and segment fragmentation, modern storage systems tend to reduce the impact of these types of fragmentation.

The situation with row fragmentation is a bit different. We usually have three possibilities: row fragmentation is not supported, it is supported through row chaining (the partial row has a pointer to another page where the row continues), or it is supported only for large datatypes. In the last case, blobs are usually stored out-of-band in a separate area. This type of fragmentation can improve performance and efficiency.

Fragmentation in InnoDB

In InnoDB, everything is an index. Primary keys are important because they define how data will be sorted in the table. One of the effects of this design is that there is no fragmentation due to unordered data. But we still can have fragmentation caused by free space within the table pages.

To split or not to split, that is the question.

InnoDB stores the rows in pages. A new row is placed on a specific page based on the primary key. But what happens when a page becomes full? InnoDB must allocate a new page where the new row will be stored. Here InnoDB is quite clever. Most RDBMS performs a page split: a new page is created, and half the contents of the full page are moved to the recently allocated page, leaving two half-full pages. What InnoDB does instead is to analyze the insertion pattern and, if it is sequential, create a page and place the new row there. This is very efficient for sequential primary key inserts.

The inserts don’t need to be purely sequential; they need to follow a direction: incremental or decremental. We will not cover the internals; just tell you that each index leave page has metadata to indicate the direction of recent inserts and how many rows were inserted following it.

Random vs. Sequential inserts, effect on fragmentation

As we explained before, InnoDB has a clever method to identify if a new row has to be inserted in an empty page or if it makes sense to perform a page split. This method is extremely efficient for sequential insertions because it generates the minimum number of additional pages and, traditionally, has been considered harmful for non-sequential insertions (when the primary key is random or unknown).

We will review the process with sequential and random insertions to understand how rows are inserted. But first, let’s see the contents of an empty page. Initially, we have some metadata and free space for new data.

InnoDB empty leaf page

Once we start inserting data into this page, it does not matter if the data is sequential or not; the page will start filling.

InnoDB Leaf with data

But once we go below 1/16th of free space, we must allocate new pages. How new pages are allocated and filled depends on whether the insertions are sequential or random. For sequential insertions, we have this pattern:

InnoDB sequential insertion pattern

The new data is inserted in the new leaf page. But for random insertions, we have a different behavior:

InnoDB random insertion pattern

We must reserve free space on both pages because we can’t assume where new rows will be inserted.

As new rows are inserted, the sequential inserts will continue with a low fragmentation:

InnoDB sequential primary key insertion impact on storage allocation

But what happens with random insertions? For simplicity, we are assuming that the primary keys are uniformly distributed.

Impact of random primary key insertion on storage allocation

Here we see an interesting phenomenon when fragmentation is low; new inserts may trigger page splits that increase fragmentation. But once we reach a certain level of fragmentation, almost all pages will have enough free space to accept new rows without performing splits. Until the threshold is reached and new splits will happen again.

This means that random insertions lead to temporary fragmentation.

Random inserts and deletes

The previous case covered situations where we only insert data, but what happens when we also remove it? Often we have tables where old rows are periodically purged. If they are sorted by primary key, there is not any problem: empty pages will be removed completely from the beginning of the indexed table.

Impact of InnoDB sequential primary key insertion and deletion on storage allocation

What we see here is that old rows belong to the same pages, and once they are removed, it is possible to return that page to the tablespace. Later this page will be allocated again to be used for new rows.

But what happens when inserts and deletions are random? The assumption that deletes are also random is correct, as data is distributed randomly.

Impact of InnoDB random primary key insertion and deletion on storage allocation.

As we can see, as long as the number of rows deleted and inserted is roughly equal, this pattern will not significantly increase (or decrease) fragmentation.

The ideal situation with random inserts and deletions is having enough space to insert new rows without reaching the split point.

Additional causes of fragmentation

Three factors define additional causes of fragmentation. We’ve been analyzing the first one previously: how data is inserted. The other two are data modification (updates) and data removal (deletes).

When a null field is filled with data or a varchar field content is replaced by a longer text, the page must make room for this extra data. What happens if there is not enough free space? InnoDB will split the page into two half-full pages. This increases fragmentation. To avoid this, InnoDB reserves 1/16th of each page for data modification. This reservation is made regardless of the insertion pattern.

If you increase the size of many rows, you will generate fragmentation.

And what happens with deletes? Deletes increase fragmentation by reducing the number of records stored on affected pages. To fix this, if a page goes below 50% utilization, InnoDB will look at adjacent pages (following the index order), and if one of those pages is below 50%, it will merge both pages into one. The freed space will not return to the file system but will be reused by new pages.

Detecting fragmentation

Currently, there is no easy way to measure page fragmentation. There is a column in the table information_schema.tables that is supposed to contain the average row length.

https://dev.mysql.com/doc/refman/8.0/en/information-schema-tables-table.html

But after some verification, we’ve found this is not properly calculated. We opened a bug to MySQL support and Percona Support to get this fixed.

https://bugs.mysql.com/bug.php?id=107739

To calculate the fragmentation, it is feasible to write an SQL script that returns the total size of the data in the table using the information from the MySQL documentation and the actual data length of variable-length columns. We can estimate the average fragmentation by comparing that data with the number of pages allocated for data. Unfortunately, this would not provide enough information to detect fragmentation hot spots.

https://dev.mysql.com/doc/refman/8.0/en/storage-requirements.html

Measuring page splits

An indirect method to identify that fragmentation is happening is measuring the number of page splits. Unfortunately, it is impossible to measure the number of page splits created during inserts or updates on a specific table.

The global statistics about InnoDB page splits are stored in the information_schema table innodb_metrics.

These statistics must be enabled by using the innodb_monitor_enable global variable.

https://dev.mysql.com/doc/refman/8.0/en/innodb-parameters.html#sysvar_innodb_monitor_enable

https://dev.mysql.com/doc/refman/8.0/en/innodb-information-schema-metrics-table.html

InnoDB Ruby

It is possible to analyze the structure of InnoDB structures using an external open-source tool called InnoDB Ruby. Jeremy Cole has developed this tool, and is accessible here:

https://github.com/jeremycole/innodb_ruby

There is also a wiki page that documents application usage:

https://github.com/jeremycole/innodb_ruby/wiki

To get an overview of fragmentation for a specific table, you can use the following command:

innodb_space -f space-extents-illustrate

This command returns a graphical representation of the tablespace using different formats to display space allocation on each page.

InnoDB Ruby

Reducing fragmentation

Once a table is fragmented, the only method to reduce fragmentation is rebuilding the table. The problem with reducing fragmentation by rebuilding the table is that random inserts will fragment the table quickly. This fragmentation appears quickly because new rows are inserted randomly, and the fragmentation reduction leads to no free space for new rows.

Rebuilding the table could lead to a massive increase in fragmentation shortly after the rebuild, as page splits would bring us to a point where almost all the pages are half full.

Innodb_fill_factor

Ideally, if we perform random inserts, we must allocate enough space for new inserts after a full table rebuild. There is a global variable that exactly does this: innodb_fill_factor.

https://dev.mysql.com/doc/refman/8.0/en/innodb-parameters.html#sysvar_innodb_fill_factor

Innodb_fill_factor defines the percentage of space on each B-tree page that is filled during a sorted index build, with the remaining space reserved for future index growth. For example, setting innodb_fill_factor to 80 reserves 20 percent of the space on each B-tree page for future index growth. Actual percentages may vary. The innodb_fill_factor setting is interpreted as a hint rather than a hard limit.

This means that if we perform random inserts and deletes and we rebuild the tables using a fill factor big enough to hold all the inserts before the purge, the table will maintain a low level of fragmentation.

Random insert and delete tests and recommended fill factor

We performed multiple tests with different fill factors. The test performed consisted of the following:

  1. Create a table.
  2. Insert 2,000,000 records using md5 as the function to generate the hash.
  3. Set the fill factor to the test value.
  4. Optimize the table.
  5. Repeat 400 times
    1. Insert 10,000 rows into the table.
    2. Remove 10,000 rows from the table.
    3. Measure the fragmentation.

We tested with these fill factors: 75, 80, 82, 83, 85, and 100.

Total space file size

This chart shows the initial and final space allocation.

Initial and final size of innodb file after multiple insertions and deletions.

As we can see, using a fill factor of 83 provides the best results for this test.

Page splits and merges

We also analyzed the number of page splits (the number of times a row does not fit in the corresponding page and the page needs to be split into two pages) and the number of page merges (the number of times that, after a delete operation, a page goes below 50% and InnoDB tries to merge it with adjacent pages).

Fill factor Page splits Merge attempts Merge successful
75 63 4 3
80 565 108 34
82 1363 348 106
83 2063 658 203
85 4350 1324 318
100 44771 15527 2323

As we can see, there are a number of page splits for every fill factor. For the 100 fill factor, we had a page split for every 89 rows processed, while with a fill factor of 83, we had a page split for every 1930 rows processed.

Fragmentation maps

We provide fragmentation maps for 75, 83, and 100 fill factors after 400 iterations.

 

 

Conclusions

Fragmentation usually is not a problem for InnoDB tables. InnoDB deals with fragmentation quite efficiently, and table rebuilds are seldom needed.

There is only an edge case when data is inserted following a random primary key. In this case, results will depend on the table’s structure, the keys’ distribution, and how frequently data is inserted or removed.

For our tests, a value of innodb_fill_factor of around 83% was optimal. It allowed for keeping the fragmentation under control. Smaller fill factors did not provide additional benefits. Your mileage may vary.

If you have a large table with random insertions, we recommend using a tool like innodb_ruby to monitor fragmentation and analyze if the table needs a rebuild with a different fill factor.

Percona Distribution for MySQL is the most complete, stable, scalable, and secure open source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!

 

Try Percona Distribution for MySQL today!

Mar
23
2023
--

InnoDB Performance Optimization Basics

InnoDB Performance Optimization Basics

This blog is in reference to our previous ones for ‘Innodb Performance Optimizations Basics’ 2007 and 2013. Although there have been many blogs about adjusting MySQL variables for better performance since then, I think this topic deserves a blog update since the last update was a decade ago, and MySQL 5.7 and 8.0 have been released since then with some major changes.

These guidelines work well for a wide range of applications, though the optimal settings, of course, depend on the workload.

Hardware

Memory

The amount of RAM to be provisioned for database servers can vary greatly depending on the size of the database and the specific requirements of the company. Some servers may need a few GBs of RAM, while others may need hundreds of GBs or even terabytes of RAM. Factors that can affect the amount of RAM needed by a database server include the total size of the database, the number of concurrent users, and the complexity of the database queries. As datasets continue to grow in size, the amount of RAM required to store and process these datasets also increases. By caching hot datasets, indexes, and ongoing changes, InnoDB can provide faster response times and utilize disk IO in a much more optimal way.

CPU

From a CPU standpoint, faster processors with many cores provide better throughput. CPUs with 32/64 cores are still common, and we see some large clients with 96 cores, and the latest MySQL versions can utilize them much better than before. However, it is worth noting that simply adding more CPU cores does not always result in improved performance. CPU core usage will also depend on the specific workload of the application, such as the number of concurrent users or the complexity of the queries being run.

Storage

The type of storage and disk used for database servers can have a significant impact on performance and reliability. Nowadays, solid-state drives (SSDs) or non-volatile memory express (NVMe) drives are preferred over traditional hard disk drives (HDDs) for database servers due to their faster read and write speeds, lower latency, and improved reliability. While NVMe or SSDs are generally more expensive than HDDs, the increased performance and reliability that they offer make them a cost-effective choice for database servers that require fast access to data and minimal downtime. RAID 10 is still the recommended level for most workloads, but make sure your RAID controller can utilize the SSD drive’s performance and will not become the actual bottleneck.

Operating system

Linux is the most common operating system for high-performance MySQL servers. Make sure to use modern filesystems, like EXT4, XFS, or ZFS on Linux, combined with the most recent kernel. Each of them has its own limits and advantages: for example, XFS is fast in deleting large files, while EXT4 can provide better performance on fast SSD drives, and ZFS on Linux has progressed a lot. Benchmark before you decide.

For database servers, we usually recommend our clients have:

  1. Jemalloc installed and enabled for MySQL.
  2. Transparent huge pages (THP) disabled.
  3. Setting swappiness to one is generally recommended, lowering the tendency of swapping.
  4. Setting oom_score_adj to -800.

Cloud

Different cloud providers offer a range of instance types and sizes, each with varying amounts of CPU, memory, and storage. Some cloud providers also offer specialized instances for database workloads, which may provide additional features and optimizations for performance and scalability. One of the benefits of cloud-based database servers is the ability to scale resources up or down as needed. It’s important to consider the potential need for scaling and select an instance type and size to accommodate future growth. Some cloud providers also offer auto-scaling features that can automatically adjust the number of instances based on workload demand.

MySQL InnoDB settings

(Dynamic) – Does not require MySQL restart for change.

(Static) – Requires MySQL restart for change.

innodb_buffer_pool_size (Dynamic) – InnoDB relies heavily on the buffer pool and should be set correctly. Typically a good value is 70%-80% of available memory. Also, refer to innodb_buffer_pool_chunk_size mentioned below. 

innodb_buffer_pool_instances  (Static) – Enabling this is useful in highly concurrent workloads as it may reduce contention of the global mutexes. The optimal value can be decided after testing multiple settings, starting from eight is a good choice.

innodb_buffer_pool_chunk_size  (Static) – Defines the chunk size by which the buffer pool is enlarged or reduced. This variable is not dynamic, and if it is incorrectly configured, it could lead to undesired situations.  Refer to InnoDB Buffer Pool Resizing: Chunk Change for more details on configuration.

innodb_log_file_size (Static) – Large enough InnoDB transaction logs are crucial for good, stable write performance. But also larger log files mean that the recovery process will be slower in case of a crash. However, this variable has been deprecated since 8.0.30. Refer to innodb_redo_log_capacity below. 

innodb_redo_log_capacity (Dynamic) – Introduced in 8.0.30, this defines the amount of disk space occupied by redo log files. This variable supersedes the innodb_log_files_in_group and innodb_log_file_size variables. When this setting is defined, the innodb_log_files_in_group and innodb_log_file_size settings are ignored (those two variables are now deprecated since 8.0.30).

innodb_log_buffer_size (Dynamic) – InnoDB writes changed data records into its log buffer, which is kept in memory, and it saves disk I/O for large transactions as it does not need to write the log of changes to disk before transaction commit. If you have transactions that update, insert, or delete many rows, making the log buffer larger saves disk I/O. 

innodb_flush_log_at_trx_commit (Dynamic) – The default value of ‘1’ gives the most durability (ACID compliance) at a cost of increased filesystem writes/syncs. Setting the value to ‘0’ or ‘2’ will give more performance but less durability. At a minimum, transactions are flushed once per second.

innodb_thread_concurrency (Dynamic) – With improvements to the InnoDB engine, it is recommended to allow the engine to control the concurrency by keeping it to the default value (which is zero). If you see concurrency issues, you can tune this variable. A recommended value is two times the number of CPUs plus the number of disks. 

innodb_flush_method (Static) – Setting this to O_DIRECT will avoid a performance penalty from double buffering; this means InnoDB bypasses the operating system’s file cache and writes data directly to disk (reducing the number of I/O operations required).

innodb_online_alter_log_max_size (Dynamic) – The upper limit in bytes on the size of the temporary log files used during online DDL operations for InnoDB tables. If a temporary log file exceeds the upper size limit, the ALTER TABLE operation fails, and all uncommitted concurrent DML operations are rolled back. Thus, a large value for this option allows more DML to happen during an online DDL operation but also extends the period of time at the end of the DDL operation when the table is locked to apply the data from the log.

innodb_numa_interleave (Static) – For ‘NUMA enabled systems’ with large amounts of memory (i.e.,> 128GB), we recommend turning on NUMA interleaving. Enabling this parameter configures memory allocation to be ‘interleaved’ across the various CPU-Memory channels. This helps “even out” memory allocations so that one CPU does not become a memory bottleneck.

innodb_buffer_pool_dump_at_shutdown/innodb_buffer_pool_load_at_startup (Dynamic/Static respectively) – These variables allow you to dump the contents of the InnoDB buffer pool to disk at shutdown and load it back at startup, which will pre-warm the buffer pool so that you don’t have to start with a cold buffer pool after a restart.

innodb_buffer_pool_dump_pct (Dynamic) – The option defines the percentage of most recently used buffer pool pages to dump. By default, MySQL only saves 25% of the most actively accessed pages, which should be reasonable for most use cases, it can then be loaded faster than if you try to load every page in the buffer pool (100%), many of which might not be necessary for a general workload. You can increase this percentage if needed for your use case.

Innodb_io_capacity (Dynamic) –  It defines the number of I/O operations per second (IOPS) available to InnoDB background tasks, such as flushing pages from the buffer pool and merging data from the change buffer. Ideally, keep the setting as low as practical but not so low that background activities fall behind. Refer to this for more information on configuration.

Innodb_io_capacity_max (Dynamic) –  If the flushing activity falls behind, InnoDB can flush more aggressively, at a higher rate than innodb_io_capacity. innodb_io_capacity_max defines the maximum number of IOPS performed by InnoDB background tasks in such situations. Refer to Give Love to Your SSDs – Reduce innodb_io_capacity_max! for more information on configuration.

innodb_autoinc_lock_mode (Static) – Setting the value to ‘2’ (interleaved mode) can remove the need for an auto-inc lock (at the table level) and can increase performance when using multi-row insert statements to insert values into a table with an auto-increment primary key. Note that this requires either ROW or MIXED binlog format. (The default setting is 2 as of MySQL 8.0)

innodb_temp_data_file_path (Static) – Defines the relative path, name, size, and attributes of InnoDB temporary tablespace data files. If you do not specify a value for innodb_temp_data_file_path, the default behavior is to create a single, auto-extending data file named ibtmp1 in the MySQL data directory. For 5.7, it is recommended to set a max value to avoid the risk of datadir partition filling up due to a heavy or bad query. 8.0 introduced session temporary tablespaces, temporary tables, or the internal optimizer tables no longer use ‘ibtmp1’.

innodb_stats_on_metadata (Dynamic) – The default setting of “OFF” avoids unnecessary updating of InnoDB statistics and can greatly improve read speeds.

innodb_page_cleaners (Static) – InnoDB supports multiple page cleaner threads for flushing dirty pages from buffer pool instances. If your workload is write-IO bound when flushing dirty pages from buffer pool instances to data files, and if your system hardware has available capacity, increasing the number of page cleaner threads may help improve write-IO throughput.

innodb_deadlock_detect (Dynamic) – This option can be used to disable deadlock detection. On high-concurrency systems, deadlock detection can cause a slowdown when numerous threads wait for the same lock.

Application tuning for InnoDB

Make sure your application is prepared to handle deadlocks that may happen. Review your table structure and see how you can take advantage of InnoDB properties – clustering by primary key, having a primary key in all indexes (so keep primary key short), and fast lookups by primary keys (try to use it in joins).

Conclusion

There are many other options you may want to tune, but here we’ve covered the important InnoDB parameters, OS-related tweaking, and hardware for optimal MySQL server performance. I hope this helps!

If all of this seems overwhelming or difficult to understand, our world-class MySQL Training is here to help. Our two-day Intensive Scaling and Optimizations for MySQL class walks you through much of what you need to know to optimize your MySQL performance.

Percona Distribution for MySQL is the most complete, stable, scalable, and secure open-source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!

 

Try Percona Distribution for MySQL today!

References

How MySQL 8.0.21 and MariaDB 10.5.4 Perform in IO-Bound Scenarios on SATA SSD and NVMe Storage

Don’t Spin Your Data, Use SSDs!

Enabling and Disabling Jemalloc on Percona Server

Settling the Myth of Transparent HugePages for Databases

MySQL 101: Linux Tuning for MySQL

MySQL 101: Parameters to Tune for MySQL Performance

Adjusting MySQL 8.0 Memory Parameters

Mar
06
2023
--

How To Calculate a Good MySQL Redo Log Size in MySQL 8

Calculate a Good MySQL Redo Log Size in MySQL 8

MySQL uses Redo Logs internally during crash recovery to correct data written by incomplete transactions. But how do you know what the right Redo Log size is? We will walk through how to figure that out in this blog.

We already have a couple of posts related to this topic. “How to calculate a good InnoDB log file size” and “How to Choose the MySQL innodb_log_file_size.” The first one has the formula I’ll mention in this blog, and the second has more details regarding Redo Logs. Most of it is still valid, just replace innodb_log_file_size / innodb_log_files_in_group with innodb_redo_log_capacity.

So in this one, I want to focus on two things:

  1. Present the new way of working with Redo Logs (No longer innodb_log_file_size / innodb_log_files_in_group).
  2. Validate the formula output and compare it with the information you can get from Percona Management and Monitoring (PMM).

The TL;DR for the second point is: 

  • If you want to have a quick (not necessarily correct) answer, you can still use the formula.
  • If you want an answer based on detailed information about your database usage and patterns, it’s recommended to use Percona Management and Monitoring.

PMM graphs with related information (in MySQL / InnoDB Details / InnoDB Logging):

  • Redo Generation Rate.
  • InnoDB Log File Usage Hourly.

Formula:

mysql> pager grep sequence; show engine innodb status\G select sleep(60); show engine innodb status\G nopager;

mysql> select (<second_value> - <first_value>) / 1024 / 1024 as MB_per_min;

 

NOTE: This formula is from the first post mentioned above; it’s used to see how many MB have been written to the log in one minute.

Redo Logs

Starting from MySQL 8.0.30, the variable that should be tuned for optimizing the Redo Logs is innodb_redo_log_capacity, and we start with good news here: It’s dynamic! So you won’t have downtime if you need to tune this.

MySQL versions before 8.0.30:

innodb_log_file_size and innodb_log_files_in_group:

By default, InnoDB creates two redo log files (innodb_log_files_in_group) in the data directory, named ib_logfile0 and ib_logfile1, and writes to these files in a circular fashion; each with a size of innodb_log_file_size.

MySQL versions 8.0.30 and later:

innodb_redo_log_capacity 

InnoDB creates 32 files (each of them with a size of innodb_redo_log_capacity / 32) in the #innodb_redo directory in the data directory unless a different directory was specified by the innodb_log_group_home_dir variable.

So, with the new variable in place, how is the value calculated?

  1. If innodb_redo_log_capacity is configured, that’s the value used (ignoring innodb_log_file_size and innodb_log_files_in_group).
  2. If innodb_redo_log_capacity is not set, but innodb_log_file_size and innodb_log_files_in_group are configured, MySQL will compute the value (innodb_log_files_in_group * innodb_log_file_size = innodb_redo_log_capacity).
  3. If none of the values are configured, redo log capacity is set to the innodb_redo_log_capacity default value (100 MB).

Just as a reminder about why you want to tune this: If the Redo Log is full, MySQL must execute a checkpoint (flushing from InnoDB Buffer Pool to disk). If your Redo Logs are small, checkpoints will happen more often, adding more I/O and impacting the performance.

A rule of thumb here is to configure the Redo Logs to be able to hold one hour of traffic, thus letting checkpointing perform in a smooth, less aggressive way. Otherwise, the flushing may happen at the worst time possible due to the checkpoint being unavoidable.

Testing the formula

To have a general idea, I executed some basic mysqlslap commands on a loop to create load in the database. This is far from a real scenario test, but it should help us in this particular case.

$ while true; do mysqlslap --auto-generate-sql -vv --concurrency=100 --iterations=200 --number-char-cols=7 --number-int-cols=10;

Testing the “rule of thumb formula” shared in the first blog post:

mysql> pager grep sequence; show engine innodb status\G select sleep(60); show engine innodb status\G nopager;

Log sequence number          3182015832

Log sequence number          3252604221

mysql> select (3252604221 - 3182015832) / 1024 / 1024 as MB_per_min;

+-------------+
| MB_per_min  |
+-------------+
| 67.31833363 |
+-------------+

I tested a couple of times again, with the following results:

mysql> select (3428540758 - 3366312894) / 1024 / 1024 as MB_per_min;
+-------------+
| MB_per_min |
+-------------+
| 59.34511566 |
+-------------+

mysql> select (3517465187 - 3449605532) / 1024 / 1024 as MB_per_min;
+-------------+
| MB_per_min |
+-------------+
| 64.71601009 |
+-------------+

As we can see, with a consistent load, the rule of thumb gives an average of 63.8 MB per minute (3.74 GB per hour)

This works, but it’s manual, and you need to be online and run this a few times during peak hours.

Using PMM

Inside PMM, you can find a lot of useful graphs, and for this blog, I will focus on MySQL / InnoDB Details / InnoDB Logging.

The first graph I want to point out is Redo Generation Rate. I’ve filtered to see the timeframe when mysqlslap was running on a loop, and the value is 1.1 MB/s (we can extrapolate to 66 MB per minute or 3.87 GB per hour, almost the same as the result from the formula above).

redo generation rate

With that being said, you could have a better answer about how much redo is being created using this graph alone, and you didn’t need to use the commands and manually check when the load is running, you can check it later on or even check it for a particular time frame. For example, if your biggest load happens over the weekends, you don’t want to be online only to calculate the proper settings.

There’s yet another graph in PMM you can use to check the hourly usage of Redo Logs in your database:  InnoDB Log File Usage Hourly; in this one, you can see the pattern over the last 24 hours, and at first glance, it pops to the eye if you need to increase/decrease your settings.

InnoDB Log File Usage

NOTE: I’ve truncated the last 24 hours’ graph to avoid the image being too small here.

See the spike around 20:00 hours in the graphic above, which matches the hour the mysqlslap was running “aggressively.” And if you’ve been following the maths, it matches the other two checks:

  • Rule of thumb formula:  63.8 MB per minute = 3.74 GB per hour.
  • Redo Generation Rate: 1.1 MB per second = 66 MB per minute = 3.87 GB per hour
  • InnoDB Log File Usage Hourly: 4GB per hour.

Conclusion

Even though the variables changed recently, you still need to monitor and configure your Redo Logs, especially for write-heavy environments; to do that, you can still use the formula, or you can take advantage of PMM Graphs.

The formula can work as a first step but may lead to overconfiguring (or underconfiguring) your Redo Logs.

As with everything in MySQL, the more information you have about your environment before performing changes, the better results you’ll have. That’s why I suggest taking this a step further and check the details with PMM. If you’re not yet using PMM, it’s open source, and you can install it and start using it quickly. Check out our PMM Quickstart and talk to us in the forums if you have any questions or problems.

Percona Distribution for MySQL is the most complete, stable, scalable, and secure open-source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!

Try Percona Distribution for MySQL today!

Nov
14
2022
--

Enable innodb_print_all_deadlocks Parameter To Get All Deadlock Information in mysqld Error Log

Enable innodb_print_all_deadlocks ParameterAt Percona Managed Services, sometimes clients’ applications face deadlock situations and need all historic deadlock information for application tuning.

We could get the LATEST DETECTED DEADLOCK from SHOW ENGINE INNODB STATUS\G:

….
------------------------
LATEST DETECTED DEADLOCK
------------------------
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 163 page no 3 n bits 72 index GEN_CLUST_INDEX of table `deadlock_test`.`t` trx id 78507 lock_mode X waiting
*** (2) TRANSACTION:
TRANSACTION 78508, ACTIVE 155 sec starting index read
mysql tables in use 1, locked 1
….

But how could we view all past deadlock information?

We could enable innodb_print_all_deadlocks,  and all deadlocks in InnoDB user transactions will be recorded in the MySQL error log.

Let‘s start the test.

Create the test database and insert some test data

mysql> create database deadlock_test;
Query OK, 1 row affected (0.01 sec)

mysql> use deadlock_test;
Database changed
mysql>

--=====================

CREATE TABLE t (i INT, c1 varchar(20)) ENGINE = InnoDB;
INSERT INTO t (i,c1) VALUES(1,'test1');
INSERT INTO t (i,c1) VALUES(2,'test2');
 mysql> select * from t;
+------+-------+
| i    | c1    |
+------+-------+
|    1 | test1 |
|    2 | test2 |
+------+-------+

 

Deadlock simulation on record 1 of table t  (i=1)

-- session 1 :
mysql> START TRANSACTION;
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT * FROM t WHERE i = 1 LOCK IN SHARE MODE;

--session 2 :
  mysql> START TRANSACTION;
Query OK, 0 rows affected (0.00 sec)
mysql> DELETE FROM t WHERE i = 1;   # This statement will wait here
We go back to  session 1 and run 

-- session 1 :

mysql> DELETE FROM t WHERE i = 1;
Query OK, 1 row affected (0.00 sec)

,

Then on session 2 , we would get 
-- on session 2, you will find
ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction
mysql>


And  SHOW ENGINE INNODB STATUS would show
------------------------
LATEST DETECTED DEADLOCK
------------------------

2022-10-26 09:45:37 0x7fb09c1ed700
*** (1) TRANSACTION:
TRANSACTION 78507, ACTIVE 84 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 2 lock struct(s), heap size 1136, 2 row lock(s)
MySQL thread id 23, OS thread handle 140396446037760, query id 1398 localhost root updating

DELETE FROM t WHERE i = 1

*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 163 page no 3 n bits 72 index GEN_CLUST_INDEX of table `deadlock_test`.`t` trx id 78507 lock_mode X waiting
*** (2) TRANSACTION:
TRANSACTION 78508, ACTIVE 155 sec starting index read
mysql tables in use 1, locked 1
4 lock struct(s), heap size 1136, 4 row lock(s)
MySQL thread id 22, OS thread handle 140396510238464, query id 1399 localhost root updating
DELETE FROM t WHERE i = 1

*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 163 page no 3 n bits 72 index GEN_CLUST_INDEX of table `deadlock_test`.`t` trx id 78508 lock mode S
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 163 page no 3 n bits 72 index GEN_CLUST_INDEX of table `deadlock_test`.`t` trx id 78508 lock_mode X waiting
*** WE ROLL BACK TRANSACTION (1)

This is expected.

Enable innodb_print_all_deadlocks  dynamic parameter

mysql> select @@innodb_print_all_deadlocks;
+------------------------------+
| @@innodb_print_all_deadlocks |
+------------------------------+
|                            0 |
+------------------------------+
1 row in set (0.00 sec)

mysql> set global innodb_print_all_deadlocks=1;

 

If we run the deadlock simulation on record 1 of table t  ( i=1),  we‘ll get the deadlock error both in show engine InnoDB status \G  and MySQL error log.

The deadlock information in MySQL error log is similar to show engine InnoDB status \G, like below:

tail -f /var/log/mysqld.log

2022-10-26T13:50:36.678728Z 22 [Note] InnoDB: Transactions deadlock detected, dumping detailed information.
2022-10-26T13:50:36.678760Z 22 [Note] InnoDB: 
*** (1) TRANSACTION:

TRANSACTION 78510, ACTIVE 18 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 2 lock struct(s), heap size 1136, 1 row lock(s)
MySQL thread id 23, OS thread handle 140396446037760, query id 1409 localhost root updating
DELETE FROM t WHERE i = 1
2022-10-26T13:50:36.678785Z 22 [Note] InnoDB: *** (1) WAITING FOR THIS LOCK TO BE GRANTED:

RECORD LOCKS space id 163 page no 3 n bits 72 index GEN_CLUST_INDEX of table `deadlock_test`.`t` trx id 78510 lock_mode X waiting
2022-10-26T13:50:36.678958Z 22 [Note] InnoDB: *** (2) TRANSACTION:

TRANSACTION 78511, ACTIVE 33 sec starting index read
mysql tables in use 1, locked 1
4 lock struct(s), heap size 1136, 3 row lock(s)
MySQL thread id 22, OS thread handle 140396510238464, query id 1410 localhost root updating
DELETE FROM t WHERE i = 1

2022-10-26T13:50:36.679113Z 22 [Note] InnoDB: *** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 163 page no 3 n bits 72 index GEN_CLUST_INDEX of table `deadlock_test`.`t` trx id 78511 lock mode S
2022-10-26T13:50:36.679159Z 22 [Note] InnoDB: *** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 163 page no 3 n bits 72 index GEN_CLUST_INDEX of table `deadlock_test`.`t` trx id 78511 lock_mode X waiting

2022-10-26T13:50:36.679195Z 22 [Note] InnoDB: *** WE ROLL BACK TRANSACTION (1)

 

Let‘s simulate another deadlock by dealing with record 2 (i=2)

Deadlock simulation on record 2 of table t (i=2):

-- session 1 :

mysql> START TRANSACTION;
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT * FROM t WHERE i = 2 LOCK IN SHARE MODE;

--session 2 :
mysql> use deadlock_test;
mysql> START TRANSACTION;
Query OK, 0 rows affected (0.00 sec)
mysql> DELETE FROM t WHERE i = 2;   # This statement will wait here


Let’s go back to session 1 

-- session 1 :
mysql> DELETE FROM t WHERE i = 2;
Query OK, 1 row affected (0.00 sec)

-- on session 2, you will find
ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction  i = 2
mysql>

SHOW ENGINE INNODB STATUS would just keep the last detected deadlock information.

------------------------
LATEST DETECTED DEADLOCK
------------------------

2022-10-26 09:53:02 0x7fb09c1ed700
*** (1) TRANSACTION:
TRANSACTION 78512, ACTIVE 14 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 2 lock struct(s), heap size 1136, 1 row lock(s)
MySQL thread id 23, OS thread handle 140396446037760, query id 1414 localhost root updating
DELETE FROM t WHERE i = 2
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 163 page no 3 n bits 72 index GEN_CLUST_INDEX of table `deadlock_test`.`t` trx id 78512 lock_mode X waiting
*** (2) TRANSACTION:
TRANSACTION 78513, ACTIVE 26 sec starting index read
mysql tables in use 1, locked 1
4 lock struct(s), heap size 1136, 3 row lock(s)
MySQL thread id 22, OS thread handle 140396510238464, query id 1415 localhost root updating
DELETE FROM t WHERE i = 2

*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 163 page no 3 n bits 72 index GEN_CLUST_INDEX of table `deadlock_test`.`t` trx id 78513 lock mode S
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 163 page no 3 n bits 72 index GEN_CLUST_INDEX of table `deadlock_test`.`t` trx id 78513 lock_mode X waiting
*** WE ROLL BACK TRANSACTION (1)

------------

 

We could get all history deadlock information from the MySQL error log.


=======================================================
 1st deadlock error after we enabled innodb_print_all_deadlocks
=======================================================


2022-10-25T22:59:30.834488Z 6 [Note] Aborted connection 6 to db: 'unconnected' user: 'root' host: 'localhost' (Got an error reading communication packets)
2022-10-26T13:50:36.678728Z 22 [Note] InnoDB: Transactions deadlock detected, dumping detailed information.
2022-10-26T13:50:36.678760Z 22 [Note] InnoDB: 

*** (1) TRANSACTION:

TRANSACTION 78510, ACTIVE 18 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 2 lock struct(s), heap size 1136, 1 row lock(s)
MySQL thread id 23, OS thread handle 140396446037760, query id 1409 localhost root updating
DELETE FROM t WHERE i = 1
2022-10-26T13:50:36.678785Z 22 [Note] InnoDB: *** (1) WAITING FOR THIS LOCK TO BE GRANTED:

RECORD LOCKS space id 163 page no 3 n bits 72 index GEN_CLUST_INDEX of table `deadlock_test`.`t` trx id 78510 lock_mode X waiting
2022-10-26T13:50:36.678958Z 22 [Note] InnoDB: *** (2) TRANSACTION:

TRANSACTION 78511, ACTIVE 33 sec starting index read
mysql tables in use 1, locked 1
4 lock struct(s), heap size 1136, 3 row lock(s)
MySQL thread id 22, OS thread handle 140396510238464, query id 1410 localhost root updating
DELETE FROM t WHERE i = 1
2022-10-26T13:50:36.679113Z 22 [Note] InnoDB: *** (2) HOLDS THE LOCK(S):


RECORD LOCKS space id 163 page no 3 n bits 72 index GEN_CLUST_INDEX of table `deadlock_test`.`t` trx id 78511 lock mode S
2022-10-26T13:50:36.679159Z 22 [Note] InnoDB: *** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 163 page no 3 n bits 72 index GEN_CLUST_INDEX of table `deadlock_test`.`t` trx id 78511 lock_mode X waiting
2022-10-26T13:50:36.679195Z 22 [Note] InnoDB: *** WE ROLL BACK TRANSACTION (1)


=======================================================
 2nd  deadlock error after we enabled innodb_print_all_deadlocks
=======================================================


2022-10-26T13:53:02.858247Z 22 [Note] InnoDB: Transactions deadlock detected, dumping detailed information.
2022-10-26T13:53:02.858266Z 22 [Note] InnoDB: 
*** (1) TRANSACTION:

TRANSACTION 78512, ACTIVE 14 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 2 lock struct(s), heap size 1136, 1 row lock(s)
MySQL thread id 23, OS thread handle 140396446037760, query id 1414 localhost root updating
DELETE FROM t WHERE i = 2
2022-10-26T13:53:02.858296Z 22 [Note] InnoDB: *** (1) WAITING FOR THIS LOCK TO BE GRANTED:

RECORD LOCKS space id 163 page no 3 n bits 72 index GEN_CLUST_INDEX of table `deadlock_test`.`t` trx id 78512 lock_mode X waiting
2022-10-26T13:53:02.858316Z 22 [Note] InnoDB: *** (2) TRANSACTION:

TRANSACTION 78513, ACTIVE 26 sec starting index read
mysql tables in use 1, locked 1
4 lock struct(s), heap size 1136, 3 row lock(s)
MySQL thread id 22, OS thread handle 140396510238464, query id 1415 localhost root updating
DELETE FROM t WHERE i = 2
2022-10-26T13:53:02.858338Z 22 [Note] InnoDB: *** (2) HOLDS THE LOCK(S):


RECORD LOCKS space id 163 page no 3 n bits 72 index GEN_CLUST_INDEX of table `deadlock_test`.`t` trx id 78513 lock mode S
2022-10-26T13:53:02.858351Z 22 [Note] InnoDB: *** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 163 page no 3 n bits 72 index GEN_CLUST_INDEX of table `deadlock_test`.`t` trx id 78513 lock_mode X waiting
2022-10-26T13:53:02.858366Z 22 [Note] InnoDB: *** WE ROLL BACK TRANSACTION (1)

 

Notes: on MySQL 8 (especially in >8.0.4), it’s better to set log_error_verbosity = 3  (default  log_error_verbosity = 2) as well, otherwise, you will get partial deadlock outputs in the error log, which are not really usable.

Please see details in this bug report https://bugs.mysql.com/bug.php?id=108891 that Agustín G  filed.

Conclusion

Innodb_print_all_deadlocks dynamic parameter has been available since MySQL 5.6, and we can enable it to get all the deadlock information in the MySQL error log file.

Hope this is helpful for your application tuning.

Sep
07
2022
--

Speed Up Your Large Table Drops in MySQL

Large Table Drops in MySQL

Large Table Drops in MySQLA large table is a pain for many reasons as long as it is in a system. And as if that’s not enough, it is also a difficult task to get rid of it. In this post, we will understand why it is a pain to do this operation and what we can do about it. It will be like asking the table “Tell me what happened and I will ease up the eviction”.

So what happened? When a table is dropped (or truncated), InnoDB has to scan the pages throughout the buffer pool and remove all those belonging to that table. For a large buffer pool, this crawling in the buffer pool pages and eviction process will be slower. When we say “scan buffer pool”, it mainly looks for “LRU”, “FLUSH” (Dirty pages), and “AHI” entries.

LRU: Buffer pool pages are stored in a linked list of pages in order of usage. As the data reaches the end of the list, it is evicted to make space for new data. When the room is needed to add a new page to the buffer pool, the least recently used page is evicted and a new page is added to the middle of the list.

AHI: This is a hash index containing index pages that are frequently accessed. InnoDB has a mechanism that monitors index searches. If InnoDB notices that queries could benefit from building a hash index, it does so automatically.

Disclaimer:

The issue we’re going to discuss has been resolved in MySQL 8.0.23 (onwards) though this is still relevant to many MySQL deployments on lower versions. From MySQL 8.0.23, this process of drop table has been improved and MySQL no longer waits for the eviction. InnoDB implemented a lazy eviction process for the respective pages from the buffer-pool.

Check a sample ENGINE INNODB STATUS below:

BUFFER POOL AND MEMORY
----------------------
Total large memory allocated 3576902451200
Dictionary memory allocated 5414868
Internal hash tables (constant factor + variable factor)
    Adaptive hash index 419777277888    (54392761408 + 365384516480)
    Page hash           53118808 (buffer pool 0 only)
    Dictionary cache    13603605220     (13598190352 + 5414868)
    File system         1374872         (812272 + 562600)
    Lock system         9470361768      (9461541368 + 8820400)
    Recovery system     0       (0 + 0)
Buffer pool size   209689600
Buffer pool size, bytes 3435554406400
Free buffers       16263
Database pages     187372028
Old database pages 69165341
Modified db pages  2323769
Pending reads      0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 1831432982, not young 28737547535
0.00 youngs/s, 0.00 non-youngs/s
Pages read 1655886985, created 633099435, written 15361171213
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
Buffer pool hit rate 1000 / 1000, young-making rate 0 / 1000 not 1 / 1000
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 187372028, unzip_LRU len: 0
I/O sum[7845824]:cur[13760], unzip sum[0]:cur[0]

Note the “Database Pages” and “LRU len”, representing the number of pages that need to be scanned.

Next, the ball is passed to the underlying OS to delete the file. Depending on how large the file is and how busy the system is, IO operation is going to take time.

Broadly speaking, the cost for drop table operation drills down to the following two tasks:

  1. Traverse through the buffer pool and evict the pages found
  2. Delete the file from the OS disk

Task one is MySQL-specific, while task two is OS (disk) dependent. We will look for improvements in these two aspects of the process.

1. Traverse through the buffer pool and evict the pages found

Let’s see if we can find out ways to improve performance for this task.

Idea 1: The buffer pool is large and so is the linked list; can we temporarily reduce the buffer pool and make the linked list smaller?

The steps are:

Save current buffer pool state > reduce the buffer pool size > drop table (scanning small list) > reset buffer pool-size > restore buffer pool contents

SET GLOBAL innodb_buffer_pool_dump_now=ON;
    SHOW STATUS LIKE 'Innodb_buffer_pool_dump_status';
    SET GLOBAL innodb_buffer_pool_size=128M;
    DROP TABLE LARGE_TABLE;
    SET GLOBAL innodb_buffer_pool_size=1T;
    SET GLOBAL innodb_buffer_pool_load_now=ON;
    SHOW STATUS LIKE 'Innodb_buffer_pool_dump_status';

Sounds really easy but it doesn’t work; actually, it makes the situation worse because of the following reasons.

  • Buffer pool resize is a blocking operation, it will still have to scan the list, defragment and resize the buffer pool.
  • Running traffic on a reduced buffer pool will be a performance penalty.
  • The same blocking operation will repeat upon buffer-pool extension.

Refer: https://dev.mysql.com/doc/refman/5.7/en/innodb-buffer-pool-resize.html

Idea 2: Stop using the table (no selects, no writes on the table that need to be removed)

  • It is possible to avoid table operations by revoking grants from the user on the table or controlling read/writes on the application level.
  • This will cause the buffer pool to slowly evict the pages out depending on how busy MySQL is.
  • The eviction process will be eased out, as it traverses through the linked list, it won’t find any relevant pages for the table.
  • That said, you will still be scanning the large linked list.

In summary for MySQL level, you can not avoid traversing through the linked list of the buffer-pool but you can save some time on the eviction process by waiting and letting the buffer-pool evict it eventually.

2. Delete the file from OS disk

Now, this is the task that depends on how busy and fast your disks are. The slower the file deletion operation, the longer it will take for MySQL to return as “delete successful”.  So what can we do to improve this deletion operation? Let’s see.

Idea 1: Smaller the file is on disk, the faster it will be to remove.

We can make the data smaller on disk by purging the data from the table. That’s something we can do using pt-archiver. That said, the table will become fragmented and the size is still going to remain the same unless an OPTIMIZE TABLE or a dummy alter (ALTER TABLE … ENGINE=InnoDB) is performed.

Use pt-archiver to slowly DELETE the table records > bring down the table size > Give LRU algorithm some time to evict those pages from buffer pool > DROP

This approach involves multiple activities needing more time as an archive process followed by a cool-off period could take longer based on the table size.  Now note that for larger tables, pt-archiver is going to take a long time. We are still going to have a fragmented table with a large table size on disk until we run a dummy alter.

Idea 2: Don’t delete the underlying tablespace file (ibd)

What if we totally skip the file delete operation? MySQL will return the command as soon as the buffer-pool scan is done! But can we?

Yes, partially yes. We can use a hard link to “fool MySQL” into believing that the data file for the table was removed even though it is still there!

What’s a hard link?

(You can skip this section if you’re already aware of this concept)

A hard link is a directory entry that associates a name with a file. Every filename that you see is a hard link. Let’s explore a bit:

MacBook-Pro:~ kedarvaijanapurkar$ touch percona.rocks
MacBook-Pro:~ kedarvaijanapurkar$ echo "Test" > percona.rocks
MacBook-Pro:~ kedarvaijanapurkar$ ls -li percona.rocks
20594511 -rw-r--r--  1 kedarvaijanapurkar  staff  5 Aug 26 16:22 percona.rocks

Here “percona.rocks” is a hard link. Note that the first value “20594511” is a file inode represented by “percona.rocks” filename.  Now let’s create another hard link.

MacBook-Pro:~ kedarvaijanapurkar$ ln percona.rocks new_hardlink
MacBook-Pro:~ kedarvaijanapurkar$ ls -li *hardlink*
20594511 -rw-r--r--  2 kedarvaijanapurkar  staff  5 Aug 26 16:22 percona.rocks
20594511 -rw-r--r--  2 kedarvaijanapurkar  staff  5 Aug 26 16:22 new_hardlink
MacBook-Pro:~ kedarvaijanapurkar$

Note that we can see two different files but the inode is still the same “20594511”, pointing to the same physical location/file.

Without going into more details about the OS, understand that the file names are a hard link to the actual data on the disk and thus each file must have at least one hard link. Thus, a file from the disk will not be deleted as long as it has one hard link associated with it.

Here we have two hard links;  namely, percona.rocks and new_hardlink having the same content (as they point to the same physical data).

MacBook-Pro:~ kedarvaijanapurkar$ cat percona.rocks
Percona is great and hardlink is a fun concept.
MacBook-Pro:~ kedarvaijanapurkar$ cat new_hardlink
Percona is great and hardlink is a fun concept.
MacBook-Pro:~ kedarvaijanapurkar$

Using this knowledge, if we create a hard link to the data file (ibd), the DROP TABLE will not actually delete the file from the OS but only remove the hard link, which is very light on the disk IO.

We can use the following steps to speed up the drop table.

Steps for dropping a large table in MySQL

Let’s say we have to drop large_table.

  • Create a hard link on the server (elevated privileges required)
cd <datadir>/<database>
ln large_table.ibd large_table.ibd.hardlink

  • Issue drop table 
set SQL_LOG_BIN=0;
drop table if exists large_table;

Because of the hard link we just created, dropping the table will only remove the large_table.ibd link. The original data still exists, but MySQL is unaware of this leftover.

  • Truncate file slowly until small enough to run
cd <datadir>/<database>
truncate  -s  -1GB large_table.ibd.hardlink

# You can increase the truncate size from 1G to 10 to 100 depending on how fast it moves to bring down the file size.

  • Execute the rm command
rm large_table.ibd.hardlink

It is better to ensure that the table is not queried in order to avoid the “eviction” process and MySQL just traverses through the linked-list.

Considering you have a replicated environment, I’d recommend you perform on a (passive) replica server first, followed by production, to have an idea beforehand. You may also perform this task with failover: perform on replica > failover > perform on the old primary.

Risk involved: This approach to speed up a drop table in older MySQL versions, involves handling database files on OS level and a human error (as always) could be catastrophic.

Let me know how you go about dropping those massive tables and if this approach interests you.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com