Jun
15
2021
--

MySQL Static and Dynamic Privileges (Part 2)

MySQL Dynamic and Static Privileges

MySQL Dynamic and Static PrivilegesWhen organizing things helps to simplify life.

In the previous article, we start to explore dynamic privileges and the interaction with static ones. We also saw how to remove SUPER privilege from a DBA account. 

What we did was go by subtraction. But in real life, we should act differently. We should ADD only what is really needed for the account to work correctly.

Adding privilege one by one, and for each user is problematic given the level of interaction they may have, and also prone to mistakes. 

Instead, we can use ROLES to group, assign, and revoke the correct privileges in a much easier way.

This is becoming even more important in MySQL with the advent of dynamic privileges.

What should we do to correctly use ROLES? Well first of all design.  

The first step is to identify the ROLES, in doing so we need to keep a basic principle, make it simple, as such let us try to avoid having too many ROLES, or ROLE with too many cross-functional privileges.

My proposal: 

  • DBA (The lord of the databases who can do all)
  • MaintenanceAdmin (DBA minions ? they can perform only some action on the server, and server only)
  • UserAdmin (Can create users assign grants and so on)
  • MonitorUser (See all process and read from performance_schema)
  • DBManager (Can add/drop/modify schemas/tables/triggers/view/routines etc )
  • DBDesigner (Can modify specific objects mostly with clear identification by schema/table)
  • ReplicationAdmin (Can add/change/remove start/stop replication also GR)
  • BackupAdmin (Can take backup, cannot restore)

We have eight administrative ROLES and they should cover ALL we need for administrative tasks.

Now let us create them:

CREATE ROLE 'DBA', 'MaintenanceAdmin', 'UserAdmin', 'MonitorUser', 'DBManager', 'DBDesigner', 'ReplicationAdmin', 'BackupAdmin'

DC2-1(root@localhost) [mysql]>Select user,host from mysql.user where account_locked ='Y' and password_expired='Y' order by 1;
+------------------+------+
| user             | host |
+------------------+------+
| BackupAdmin      | %    |
| DBA              | %    |
| DBDesigner       | %    |
| DBManager        | %    |
| MaintenanceAdmin | %    |
| MonitorUser      | %    |
| ReplicationAdmin | %    |
| UserAdmin        | %    |
+------------------+------+
8 rows in set (0.00 sec)

Let us check the roles one by one and see what privileges we need to assign.

Our test user does not have any grant:

DC2-2(secure_test@localhost) [(none)]>show grants for current_user()\G
*************************** 1. row ***************************
Grants for secure_test@localhost: GRANT USAGE ON *.* TO `secure_test`@`localhost`

DBA

Well, you may say .. easy GRANT ALL.

Wrong! As already indicated in the previous article, doing that will also assign SUPER, which is deprecated from MySQL 8.0.x. Let us start with the right foot and add ONLY what we need:

GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, RELOAD, SHUTDOWN, PROCESS, FILE, REFERENCES, INDEX, ALTER, SHOW DATABASES, CREATE TEMPORARY TABLES, LOCK TABLES, EXECUTE, REPLICATION SLAVE, REPLICATION CLIENT, CREATE VIEW, SHOW VIEW, CREATE ROUTINE, ALTER ROUTINE, CREATE USER, EVENT, TRIGGER, CREATE TABLESPACE, CREATE ROLE, DROP ROLE ON *.* TO `DBA`@`%` WITH GRANT OPTION;
   
GRANT APPLICATION_PASSWORD_ADMIN,AUDIT_ADMIN,BACKUP_ADMIN,BINLOG_ADMIN,BINLOG_ENCRYPTION_ADMIN,CLONE_ADMIN,CONNECTION_ADMIN,ENCRYPTION_KEY_ADMIN,FLUSH_OPTIMIZER_COSTS,FLUSH_STATUS,FLUSH_TABLES,FLUSH_USER_RESOURCES,GROUP_REPLICATION_ADMIN,INNODB_REDO_LOG_ARCHIVE,INNODB_REDO_LOG_ENABLE,PERSIST_RO_VARIABLES_ADMIN,REPLICATION_APPLIER,REPLICATION_SLAVE_ADMIN,RESOURCE_GROUP_ADMIN,RESOURCE_GROUP_USER,ROLE_ADMIN,SERVICE_CONNECTION_ADMIN,SESSION_VARIABLES_ADMIN,SET_USER_ID,SHOW_ROUTINE,SYSTEM_USER,SYSTEM_VARIABLES_ADMIN,TABLE_ENCRYPTION_ADMIN,XA_RECOVER_ADMIN ON *.* TO `DBA`@`%` WITH GRANT OPTION;

That should be exactly the same as GRANT ALL but without SUPER. 

To assign the ROLE to our test user:

GRANT `DBA`@`%` TO `secure_test`@`localhost`

Now our user has:

DC2-2(secure_test@localhost) [(none)]>show grants for current_user()\G
*************************** 1. row ***************************
Grants for secure_test@localhost: GRANT USAGE ON *.* TO `secure_test`@`localhost`
*************************** 2. row ***************************
Grants for secure_test@localhost: GRANT `DBA`@`%` TO `secure_test`@`localhost`

Correct you now see DBA as grant but that is not active:

DC2-2(secure_test@localhost) [(none)]>show grants for DBA@'%'\G
ERROR 1142 (42000): SELECT command denied to user 'secure_test'@'localhost' for table 'user'

To ACTIVATE a role you need to do it explicitly:

 SET DEFAULT ROLE DBA TO  secure_test@'localhost';

And have the user reconnect!

Once a role is activated we can also use:

show grants for current_user()\G

To check which privileges are now active for a specific user.

We can also control which role is active for which user querying the table mysql.default_roles. 

To remove the active role:

SET DEFAULT ROLE NONE TO  secure_test@'localhost';

Anyhow, NOW we have our DBA role available for all DBA and if we need to change something we can do it just there on the ROLE and not on every single user. 

MaintenanceAdmin

GRANT EVENT, LOCK TABLES, RELOAD, SELECT, SHOW DATABASES, RELOAD, SHUTDOWN ON *.* TO `MaintenanceAdmin`@`%`;
GRANT BINLOG_ADMIN, CONNECTION_ADMIN, ENCRYPTION_KEY_ADMIN, GROUP_REPLICATION_ADMIN, REPLICATION_SLAVE_ADMIN, SESSION_VARIABLES_ADMIN, SET_USER_ID, SYSTEM_VARIABLES_ADMIN ON *.* TO `MaintenanceAdmin`@`%`;
GRANT `MaintenanceAdmin`@`%` TO `secure_test`@`localhost` ;

UserAdmin

GRANT CREATE USER, GRANT OPTION, RELOAD, SHOW DATABASES ON *.* TO `UserAdmin`@`%`;
GRANT ROLE_ADMIN  ON *.* TO `UserAdmin`@`%`;
GRANT `UserAdmin`@`%` TO `secure_test`@`localhost` ;

MonitorUser

GRANT PROCESS, REPLICATION CLIENT ON *.* TO `MonitorUser`@`%`;
GRANT SELECT ON performance_schema.* TO `MonitorUser`@`%`;
GRANT `MonitorUser`@`%` TO `secure_test`@`localhost` ;

DBManager

GRANT ALTER, ALTER ROUTINE, CREATE, CREATE ROUTINE, CREATE TABLESPACE, CREATE TEMPORARY TABLES, CREATE VIEW, DELETE, DROP, DROP ROLE, EVENT, INDEX, INSERT, LOCK TABLES, RELOAD, SELECT, SHOW DATABASES, SHOW VIEW, TRIGGER, UPDATE  ON *.* TO `DBManager`@`%`;
GRANT SET_USER_ID, SHOW_ROUTINE ON *.* TO `DBManager`@`%`;
GRANT `DBManager`@`%` TO `secure_test`@`localhost` ;

DBDesigner

GRANT ALTER, ALTER ROUTINE, CREATE, CREATE ROUTINE, CREATE VIEW, INDEX, SELECT, SHOW DATABASES, SHOW VIEW, TRIGGER ON *.* TO `DBDesigner`@`%`;
GRANT `DBDesigner`@`%` TO `secure_test`@`localhost` ;

ReplicationAdmin

GRANT REPLICATION CLIENT ON *.* TO `ReplicationAdmin`@`%`;
GRANT REPLICATION_APPLIER, REPLICATION_SLAVE_ADMIN, GROUP_REPLICATION_ADMIN, SERVICE_CONNECTION_ADMIN ON *.* TO `ReplicationAdmin`@`%`;
GRANT SELECT on performance_schema.* TO `ReplicationAdmin`@`%`;
GRANT SELECT on mysql.* TO `ReplicationAdmin`@`%`;
GRANT `ReplicationAdmin`@`%` TO `secure_test`@`localhost` ;

BackupAdmin

GRANT EVENT, LOCK TABLES, SELECT, SHOW DATABASES ON *.* TO `BackupAdmin`@`%`;
GRANT BACKUP_ADMIN ON *.* TO `BackupAdmin`@`%`;
GRANT `BackupAdmin`@`%` TO `secure_test`@`localhost` ;

Once all our ROLES are in, we can test them. For instance, we can check our ReplicationAdmin checking the Binary Logs and stopping/starting our Group Replication (or normal Replication):

DC2-2(secure_test@localhost) [(none)]>show binary logs;
ERROR 1227 (42000): Access denied; you need (at least one of) the SUPER, REPLICATION CLIENT privilege(s) for this operation

DC2-2(secure_test@localhost) [(none)]>stop group_replication;
ERROR 1227 (42000): Access denied; you need (at least one of) the SUPER or GROUP_REPLICATION_ADMIN privilege(s) for this operation

Also if created and assigned the role is not active. Let us now enable the role for the user:

SET DEFAULT ROLE ReplicationAdmin  TO  secure_test@'localhost';

Remember to reconnect!

DC2-2(secure_test@localhost) [(none)]>show binary logs;
+---------------+-----------+-----------+
| Log_name      | File_size | Encrypted |
+---------------+-----------+-----------+
| binlog.000011 | 113802321 | No        |
| binlog.000012 |     19278 | No        |
+---------------+-----------+-----------+
2 rows in set (0.00 sec)

DC2-2(secure_test@localhost) [(none)]>stop group_replication;
Query OK, 0 rows affected (5.25 sec)

DC2-2(secure_test@localhost) [(none)]>select * from  performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 7fddf04f-9797-11eb-a193-08002734ed50 | gr5         |        3306 | OFFLINE      |             |                |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
1 row in set (0.00 sec)

DC2-2(secure_test@localhost) [(none)]>start group_replication;
Query OK, 0 rows affected (3.70 sec)

DC2-2(secure_test@localhost) [(none)]>select * from  performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 79ede65d-9797-11eb-9963-08002734ed50 | gr4         |        3306 | ONLINE       | PRIMARY     | 8.0.23         |
| group_replication_applier | 7e214802-9797-11eb-a0cf-08002734ed50 | gr6         |        3306 | ONLINE       | SECONDARY   | 8.0.23         |
| group_replication_applier | 7fddf04f-9797-11eb-a193-08002734ed50 | gr5         |        3306 | ONLINE       | SECONDARY   | 8.0.23         |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
3 rows in set (0.01 sec)

And these are the privileges active:

DC2-2(secure_test@localhost) [(none)]>show grants for current_user()\G
*************************** 1. row ***************************
Grants for secure_test@localhost: GRANT REPLICATION CLIENT ON *.* TO `secure_test`@`localhost`
*************************** 2. row ***************************
Grants for secure_test@localhost: GRANT GROUP_REPLICATION_ADMIN,REPLICATION_APPLIER,REPLICATION_SLAVE_ADMIN,SERVICE_CONNECTION_ADMIN ON *.* TO `secure_test`@`localhost`
*************************** 3. row ***************************
Grants for secure_test@localhost: GRANT SELECT ON `mysql`.* TO `secure_test`@`localhost`
*************************** 4. row ***************************
Grants for secure_test@localhost: GRANT SELECT ON `performance_schema`.* TO `secure_test`@`localhost`
*************************** 5. row ***************************
Grants for secure_test@localhost: GRANT `BackupAdmin`@`%`,`DBA`@`%`,`DBDesigner`@`%`,`DBManager`@`%`,`MaintenanceAdmin`@`%`,`MonitorUser`@`%`,`ReplicationAdmin`@`%`,`UserAdmin`@`%` TO `secure_test`@`localhost`
5 rows in set (0.00 sec)

Conclusion

Using the ROLES allows us to modify the needed privileges in one place, and at the same time will allow us to keep under control the possible proliferation of dynamic privileges due to the use of components or plugins, significantly reducing the complexity of having multiple privileges sources.

Roles are normally used in the most common databases and MySQL had implemented them quite late. But using roles is the natural evolution of how we should deal with user grants when moving from small platforms to medium/large. 

The time when we assign single user privileges IS GONE, welcome to 2021 MySQLers!

For your convenience, I am distributing a simple SQL file with all commands to create the Roles as described in this article.

References

https://dev.mysql.com/doc/refman/8.0/en/roles.html

Some queries related to MySQL Roles

MySQL 8.0: Listing Roles

 

MySQL 8.0 Roles and Graphml

Jun
15
2021
--

MySQL Static and Dynamic Privileges (Part 1)

MySQL Static and Dynamic Privileges

MySQL Static and Dynamic PrivilegesWhen trying to make things better, make it very complicated.

I was working on a Security Threat Tool script when I had to learn more about the interaction between static and dynamic privileges in MySQL 8.

Dynamic privileges is a “new” thing added in MySQL 8 to easily extend the privileges definition, and at the same time to provide more granularity. For instance, the FLUSH operation now has dedicated Privileges and by scope. 

Dynamic privileges are assigned at runtime. Most of them are active when the server starts. But they can also change with respect to the components or plugin when activated. (https://dev.mysql.com/doc/mysql-security-excerpt/8.0/en/privileges-provided.html#privileges-provided-dynamic)

Static privileges are the classical privileges available in MySQL. (https://dev.mysql.com/doc/mysql-security-excerpt/8.0/en/privileges-provided.html#privileges-provided-static)

Those are built into the server and cannot be changed.

So far, all is good. If we can give more flexibility to the security mechanism existing in MySQL, well, I am all for it.

My first step was to deal with the abuse of SUPER. 

About that –  the manual comes to help with a section called Migrating Accounts from SUPER to Dynamic Privileges

Woo perfect!

Let us play a bit. First, let me create a user:

create user secure_test@'localhost' identified by 'secret';
DC2-2(secure_test@localhost) [(none)]>show grants for current_user();
+-------------------------------------------------+
| Grants for secure_test@localhost                |
+-------------------------------------------------+
| GRANT USAGE ON *.* TO `secure_test`@`localhost` |
+-------------------------------------------------+

As you can see I can connect, but have no permissions.

On another terminal with an administrative account, let us do the classical operation to create a DBA:

GRANT ALL on *.* to secure_test@'localhost' WITH GRANT OPTION;

And now I have:

DC2-2(secure_test@localhost) [(none)]>show grants for current_user()\G
*************************** 1. row ***************************
Grants for secure_test@localhost: GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, RELOAD, SHUTDOWN, PROCESS, FILE, REFERENCES, INDEX, ALTER, SHOW DATABASES, SUPER, CREATE TEMPORARY TABLES, LOCK TABLES, EXECUTE, REPLICATION SLAVE, REPLICATION CLIENT, CREATE VIEW, SHOW VIEW, CREATE ROUTINE, ALTER ROUTINE, CREATE USER, EVENT, TRIGGER, CREATE TABLESPACE, CREATE ROLE, DROP ROLE ON *.* TO `secure_test`@`localhost` WITH GRANT OPTION
*************************** 2. row ***************************
Grants for secure_test@localhost: GRANT APPLICATION_PASSWORD_ADMIN,AUDIT_ADMIN,BACKUP_ADMIN,BINLOG_ADMIN,BINLOG_ENCRYPTION_ADMIN,CLONE_ADMIN,CONNECTION_ADMIN,ENCRYPTION_KEY_ADMIN,FLUSH_OPTIMIZER_COSTS,FLUSH_STATUS,FLUSH_TABLES,FLUSH_USER_RESOURCES,GROUP_REPLICATION_ADMIN,INNODB_REDO_LOG_ARCHIVE,INNODB_REDO_LOG_ENABLE,PERSIST_RO_VARIABLES_ADMIN,REPLICATION_APPLIER,REPLICATION_SLAVE_ADMIN,RESOURCE_GROUP_ADMIN,RESOURCE_GROUP_USER,ROLE_ADMIN,SERVICE_CONNECTION_ADMIN,SESSION_VARIABLES_ADMIN,SET_USER_ID,SHOW_ROUTINE,SYSTEM_USER,SYSTEM_VARIABLES_ADMIN,TABLE_ENCRYPTION_ADMIN,XA_RECOVER_ADMIN ON *.* TO `secure_test`@`localhost` WITH GRANT OPTION

As you can see, I have a bunch of privileges assigned. 

To be honest, to have to identify exactly what each privilege does and how it interacts with the others is challenging.

 Anyhow, the manual tells us:

For each account identified by the preceding query, determine the operations for which it needs SUPER. Then grant the dynamic privileges corresponding to those operations, and revoke SUPER.

In our case:

revoke SUPER on *.* from secure_test@'localhost';

Which will remove the SUPER privileges, but what else will remain active?  Let us try one of the easiest things, let us modify the variable super_read_only.

With super I can change the value of the variable without problems, but if I remove the SUPER privileges, what will happen? 

DC2-2(secure_test@localhost) [(none)]>show grants for current_user()\G
*************************** 1. row ***************************
Grants for secure_test@localhost: GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, RELOAD, SHUTDOWN, PROCESS, FILE, REFERENCES, INDEX, ALTER, SHOW DATABASES, CREATE TEMPORARY TABLES, LOCK TABLES, EXECUTE, REPLICATION SLAVE, REPLICATION CLIENT, CREATE VIEW, SHOW VIEW, CREATE ROUTINE, ALTER ROUTINE, CREATE USER, EVENT, TRIGGER, CREATE TABLESPACE, CREATE ROLE, DROP ROLE ON *.* TO `secure_test`@`localhost` WITH GRANT OPTION
*************************** 2. row ***************************
Grants for secure_test@localhost: GRANT APPLICATION_PASSWORD_ADMIN,AUDIT_ADMIN,BACKUP_ADMIN,BINLOG_ADMIN,BINLOG_ENCRYPTION_ADMIN,CLONE_ADMIN,CONNECTION_ADMIN,ENCRYPTION_KEY_ADMIN,FLUSH_OPTIMIZER_COSTS,FLUSH_STATUS,FLUSH_TABLES,FLUSH_USER_RESOURCES,GROUP_REPLICATION_ADMIN,INNODB_REDO_LOG_ARCHIVE,INNODB_REDO_LOG_ENABLE,PERSIST_RO_VARIABLES_ADMIN,REPLICATION_APPLIER,REPLICATION_SLAVE_ADMIN,RESOURCE_GROUP_ADMIN,RESOURCE_GROUP_USER,ROLE_ADMIN,SERVICE_CONNECTION_ADMIN,SESSION_VARIABLES_ADMIN,SET_USER_ID,SHOW_ROUTINE,SYSTEM_USER,SYSTEM_VARIABLES_ADMIN,TABLE_ENCRYPTION_ADMIN,XA_RECOVER_ADMIN ON *.* TO `secure_test`@`localhost` WITH GRANT OPTION

As you can see SUPER is gone. 

DC2-2(secure_test@localhost) [(none)]>set global super_read_only=0;
Query OK, 0 rows affected (0.00 sec)

And I can still modify the global variable. WHY?  

The manual says that SYSTEM_VARIABLES_ADMIN from the dynamic privileges allow us to modify “Enables system variable changes at runtime“. Well, what if I revoke it? 

revoke SYSTEM_VARIABLES_ADMIN on *.* from  secure_test@'localhost';

DC2-2(secure_test@localhost) [(none)]>set global super_read_only=0;
ERROR 1227 (42000): Access denied; you need (at least one of) the SUPER or SYSTEM_VARIABLES_ADMIN privilege(s) for this operation

Great! So in order to really remove/limit super, I need to also remove SYSTEM_VARIABLES_ADMIN. But is that all?

Well to make it short, no it is not

Checking the manual you can see that SUPER is affecting all these:

  • BINLOG_ADMIN,
  • CONNECTION_ADMIN,
  • ENCRYPTION_KEY_ADMIN,
  • GROUP_REPLICATION_ADMIN,
  • REPLICATION_SLAVE_ADMIN,
  • SESSION_VARIABLES_ADMIN,
  • SET_USER_ID,
  • SYSTEM_VARIABLES_ADMIN

And these are the ones by default. But we can also have others depending on the plugins we have active. 

So in theory to be sure we are removing all SUPER related privileges, we should:

REVOKE SUPER, BINLOG_ADMIN, CONNECTION_ADMIN, ENCRYPTION_KEY_ADMIN, GROUP_REPLICATION_ADMIN, REPLICATION_SLAVE_ADMIN, SESSION_VARIABLES_ADMIN, SET_USER_ID, SYSTEM_VARIABLES_ADMIN on *.* from secure_test@'localhost';

This should leave us with the equivalent of a user without SUPER:

DC2-2(secure_test@localhost) [(none)]>show grants for current_user()\G
*************************** 1. row ***************************
Grants for secure_test@localhost: GRANT SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, RELOAD, SHUTDOWN, PROCESS, FILE, REFERENCES, INDEX, ALTER, SHOW DATABASES, CREATE TEMPORARY TABLES, LOCK TABLES, EXECUTE, REPLICATION SLAVE, REPLICATION CLIENT, CREATE VIEW, SHOW VIEW, CREATE ROUTINE, ALTER ROUTINE, CREATE USER, EVENT, TRIGGER, CREATE TABLESPACE, CREATE ROLE, DROP ROLE ON *.* TO `secure_test`@`localhost` WITH GRANT OPTION
*************************** 2. row ***************************
Grants for secure_test@localhost: GRANT APPLICATION_PASSWORD_ADMIN,AUDIT_ADMIN,BACKUP_ADMIN,BINLOG_ENCRYPTION_ADMIN,CLONE_ADMIN,FLUSH_OPTIMIZER_COSTS,FLUSH_STATUS,FLUSH_TABLES,FLUSH_USER_RESOURCES,INNODB_REDO_LOG_ARCHIVE,INNODB_REDO_LOG_ENABLE,PERSIST_RO_VARIABLES_ADMIN,REPLICATION_APPLIER,RESOURCE_GROUP_ADMIN,RESOURCE_GROUP_USER,ROLE_ADMIN,SERVICE_CONNECTION_ADMIN,SHOW_ROUTINE,SYSTEM_USER,TABLE_ENCRYPTION_ADMIN,XA_RECOVER_ADMIN ON *.* TO `secure_test`@`localhost` WITH GRANT OPTION

CONCLUSION

In this first blog, we have started to explore the usage of Dynamic privileges, and what we need to do to remove the SUPER privilege. 

Nevertheless, the list above is still a bit chaotic and unsafe. We still have SHUTDOWN or RELOAD or FILE, all of them are insecure and should be assigned with great care. In the next article in this series, we see how to deal with Dynamic and Static privileges by Role and we also try to have clearer how they affect one another.

Jun
14
2021
--

Migrating Into Kubernetes Running the Percona Distribution for MySQL Operator

Migrating Into k8s Running the Percona Distribution for MySQL Operator

The practice of running databases in containers continues to grow in popularity.  As a Technical Account Manager at Percona, I get asked regularly about our Percona Distribution for MySQL Operator.  Additionally, I’m asked what I’m seeing in the industry in terms of adoption.  In most cases, the questions stem around new deployments.  Our DBaaS tool (currently in Technical Preview) makes launching a new cluster in a Kubernetes deployment trivial.  

Once the operator completes and verifies the setup, the UI displays the endpoint and credentials and you are on your way.  Voila!  You now have a cluster, behind a load balancer, that you can access from within your k8s cluster or externally:  

 Percona Distribution for MySQL Operator

This is all well and good, giving developers the opportunity to quickly build new applications.  However, a new question recently came up in a client call: how can I migrate an existing database into Kubernetes?  This got me thinking, so I decided to spin up some test servers and experiment with the backup and restore functionality of the operator.

General Migration Process

For my testing, I wanted to follow the standard process:

  1. Take a binary snapshot of the primary server
  2. Restore that snapshot into the new cluster
  3. Start replication from the primary server to the new cluster
  4. Cutover the application after replication is caught up

Looking through the operator documentation, I wanted to leverage the existing restore functionality in my testing.  Through some manual updates to the service created by the DBaaS tool, I defined my S3 credentials and details:

kubectl edit perconaxtradbcluster.pxc.percona.com/mbenshoof-pxc-cluster

Next, I verified that I could take and restore backups from the command line (using the kubectl tool).  So far, so good.  

Finally, it was time to test the main step of a generic migration.  I had a test server set up and took a streaming backup (using xbcloud with the –md5 switch passed) directly into the S3 bucket configured in the operator.  The final step was to define a custom restore resource and kick off the restore:

# cat restore-external.yaml
apiVersion: pxc.percona.com/v1
kind: PerconaXtraDBClusterRestore
metadata:
name: restore-external-v1
spec:
pxcCluster: migrate-from-external
backupSource:
destination: s3://<S3-BUCKET-FOR-BACKUPS>/external-80-full_backup
s3:
credentialsSecret: <MY-K8S-AWS-SECRET>
region: us-east-1

After some time, I verified that the restoration succeeded and also verified that my DBaaS cluster was operational again:

$ kubectl logs job/restore-job-restore-external-v1-migrate-from-external

 

If everything is fine, you can clean up the job:

$ kubectl delete pxc-restore/restore-external-v1

Completed: 2021-06-04T16:42:58Z
State: Succeeded
Events: <none>

Success!  I was able to restore my external 8.0 instance into a new DBaaS-generated cluster.  The final step in a migration process is to set up replication into the new cluster.

Setting up Replication

With a replication user-defined, it was trivial to get replication running into the new cluster.  In fact, there was no difference between this test and setting up replication normally.  Once I allowed access to the source server (via VPC Security Groups), I simply started replication and verified that I was picking up new transactions:

mysql> show replica status\G
*************************** 1. row ***************************
Replica_IO_State: Waiting for master to send event
Source_Host: 10.1.1.226
Source_User: repl
Source_Port: 3306
Connect_Retry: 60
Source_Log_File: binlog.000005
Read_Source_Log_Pos: 679
Relay_Log_File: migrate-from-external-pxc-0-relay-bin.000002
...
Seconds_Behind_Source: 0
...
Retrieved_Gtid_Set: c6a4c262-c4a7-11eb-bed8-0e0d8750f515:84-85
Executed_Gtid_Set: c6a4c262-c4a7-11eb-bed8-0e0d8750f515:1-85
Auto_Position: 1
...
1 row in set (0.00 sec)

Just to ensure things were working as expected, I ran a few queries and sent some write traffic to my primary test server.  Everything worked as expected when setting up replication from a stand-alone instance into the new cluster!

Primary Challenge – User Credentials

Working through this exercise, I did identify the main limitation in this process.  When I first attempted this exercise, the restore process hung and eventually failed.  The reason: conflicting user credentials in the freshly restored cluster.

The operator manages the cluster nodes via credentials defined in a Kubernetes secret.  When new clusters are created, random and unique credentials are generated and used to set up the cluster.  While this is great for a new cluster and keeps it secure upon launch, it isn’t ideal for the migration process.

While playing around with various migration test variations, I used one of the following two workaround methods:

  1. Modifying the source database with the newly generated cluster credentials
  2. Modifying the k8s secret file with credentials from the existing database

Modifying the Source Database

For this method, the workaround I used was capturing all of the operator generated credentials on the empty cluster via pt-show-grants:

pt-show-grants --only=clustercheck,monitor,operator,proxyadmin,root,xtrabackup

I then applied these grants to the source server (in practice, I would never do this, but this was just an exercise with dummy data).  Once the credentials were updated and confirmed, I then took the S3 streaming backup and the restore went off without a hitch.

Modifying the Secrets File

For the second workaround, there are some assumptions:

  1. You know all the users created by the operator by default
  2. Those users exist (or can be created) on the source database

In this workaround, I grabbed the password(s) from the existing source cluster as the first step.  Once I had them base64 encoded, it was just a matter of editing the secrets created with the new cluster.  I only needed to modify the dbaas-* generated secrets as that was kept in sync with all the other secret files.

The main thing in the process – ensure the secrets file is updated BEFORE kicking off the restore job.  Secrets are only reloaded when pods are terminated and restarted, so doing that cleanly is important.  The restore job does a few things:

  1. Pull the backup from S3 into a new container
  2. Prepare and validate the backup
  3. Stop the old cluster
  4. Restart the cluster with the freshly prepared backup 

Assuming you have the secrets file in place before starting the job, the new credentials will be picked up nicely when the cluster restarts and all will go well.  Otherwise, you’ll be looking at lots of manual battles and debugging.

Other Limitations

The other primary limitation with the current release is that the source needs to also be running 8.0.  As the restore procedure uses xtrabackup-80 for the process, providing a 5.7 backup will result in the following error:

This version of Percona XtraBackup can only perform backups and restores against MySQL 8.0 and Percona Server 8.0, please use Percona Xtrabackup 2.4 for this database.

To handle a 5.7 -> 8.0 migration, you would need to do a logical dump/reload of the databases in question.  Like the binary restore, the user credentials will be problematic so it would be preferable to omit the user tables.  This process comes with the standard 5.7 -> 8.0 challenges, but it is possible.  

Summary

As this process is not supported (or even recommended) in the current release, I’ve omitted some details and configuration from this post.  Despite the challenges and the process being a little rough around the edges, it was great to validate xtrabackup restoration and replication into k8s from an external source.

I’m excited to see the progress in the coming months as our DBaaS offering approaches GA.  With features like UX management of the process and MySQL user migration on the roadmap, it should be much easier in the future to migrate existing databases into Kubernetes. 

With the investments we are seeing around large k8s deployments, it would be a shame to limit it to new applications.  As the  Percona Distribution for MySQL Operator continues to evolve along with the DBaaS interface, the options should be unlimited.  As always, if you need help evaluating an existing architecture or migrating into Kubernetes, don’t hesitate to reach out to our Professional Services team! 

Jun
10
2021
--

Self-Healing Feature in Percona Distribution for MySQL Operator

Self-Healing Feature in Percona Distribution for MySQL Operator

Self-Healing Feature in Percona Distribution for MySQL OperatorIn the previous release of our Percona Distribution for MySQL Operator, we implemented one interesting feature, which can be seen as “self-healing”: https://jira.percona.com/browse/K8SPXC-564.

I do not think it got enough attention, so I want to write more about this.

As it is well known, a 3-node cluster can survive a crash of one node (or pod, in Kubernetes terminology), and this case is very well handled by itself. However, if there is a problem with 2 nodes at the same time, this scenario is problematic for Percona XtraDB Cluster. Let’s see why this is a problem.

First, let’s review if the first node goes offline:

 

MySQL Operator Node

 

In this case, the cluster can continue work, because Node 1 and Node 2 figure out they still can form a majority, and they establish a 2-node cluster and continue operations.

Now, let’s take a look at what happens if Node 2 becomes unresponsive.

 

MySQL Kubernetes Operator

 

When this happens, Node 1 loses communication to both Node 2 and Node 3, and it has no choice but to declare itself offline, in order to prevent a split-brain situation. Node 1 does not have a way to know if Node 2 crashed or it is a problem in the network link between Node 1 and Node 2, and if this is only a network link, then Node 2 in theory can accept user updates from a different link.

So in this case Node 1 becomes offline, and the only way to resolve it is with human intervention. This is the case with our standard deployments of Percona XtraDB Cluster, and it was the case in Kubernetes Operators up until version 1.7.0.

Let’s see how the problem exposes itself in Percona Distribution for MySQL Operator 1.6.0. Assume we have a functioning cluster:

NAMESPACE     NAME                                                     READY   STATUS                       RESTARTS   AGE
pxc           cluster1-pxc-0                                           1/1     Running                      0          8d
pxc           cluster1-pxc-1                                           1/1     Running                      0          8d
pxc           cluster1-pxc-2                                           1/1     Running                      0          8d

And now to emulate the crash of two nodes, I will execute:

kubectl delete pod cluster1-pxc-2 --force

And 10 seconds later:

kubectl delete pod cluster1-pxc-1 --force

So two pods are killed, and on logs from cluster1-pxc-0, we can see:

2021-06-09T15:12:47.659093Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = no, bootstrap = no, my_idx = 1, memb_num = 3
2021-06-09T15:12:47.659307Z 0 [Note] [MY-000000] [Galera] Flow-control interval: [173, 173]
2021-06-09T15:12:47.659343Z 0 [Note] [MY-000000] [Galera] Received NON-PRIMARY.

And it will continue to stay in NON-PRIMARY until we manually do something to resolve it.

Auto-Recovery in Percona Distribution for MySQL Operator 1.7.0

In version 1.7.0 we decided to improve how the Operator handles this crash, and make it less manual and more automatic. In the end, let’s take a look at the Operator goal in general (a quote from https://kubernetes.io/docs/concepts/extend-kubernetes/operator/#motivation):

“The Operator pattern aims to capture the key aim of a human operator who is managing a service or set of services. Human operators who look after specific applications and services have deep knowledge of how the system ought to behave, how to deploy it, and how to react if there are problems.

People who run workloads on Kubernetes often like to use automation to take care of repeatable tasks. The Operator pattern captures how you can write code to automate a task beyond what Kubernetes itself provides“

So we decided to add more automation and make the full cluster crash less painful.

How did we achieve this? On one hand, it may look like more of a brute force approach, but on the other, we added sophistication to make sure there is no data loss.

Brute force approach: we forceful reboot all cluster pods, and the sophisticated part is that after a restart it finds a pod with the most advanced GTID position so it has all the recent data and this pod will start first (will work as a bootstrap), and other pods will re-join cluster, using IST or SST if necessary.

Let’s see what happens in the new Operator (I use the 1.8.0 version).

The setup and steps to get into “unoperational” state are the same as previously.

But let’s take a look at logs in cluster1-pxc-0 pod:

#####################################################FULL_PXC_CLUSTER_CRASH:cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local#####################################################
You have the situation of a full PXC cluster crash. In order to restore your PXC cluster, please check the log
from all pods/nodes to find the node with the most recent data (the one with the highest sequence number (seqno).
It is cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local node with sequence number (seqno): 17
Cluster will recover automatically from the crash now.
If you have set spec.pxc.autoRecovery to false, run the following command to recover manually from this node:
kubectl exec cluster1-pxc-0 -c pxc -- sh -c 'kill -s USR1 1'
#####################################################LAST_LINE:cluster1-pxc-0.cluster1-pxc.pxc.svc.cluster.local:17:#####################################################
++ echo 'Recovery is in progress, please wait....'
++ sed -i 's/wsrep_cluster_address=.*/wsrep_cluster_address=gcomm:\/\//g' /etc/mysql/node.cnf
Recovery is in progress, please wait....
++ rm -f /tmp/recovery-case
++ '[' -s /var/lib/mysql//grastate.dat ']'
++ sed -i 's/safe_to_bootstrap: 0/safe_to_bootstrap: 1/g' /var/lib/mysql//grastate.dat
++ echo 'Recovery was finished.'
++ exec mysqld --wsrep_start_position=1db775e9-c938-11eb-82b4-2ad310807778:17
Recovery was finished.

Basically, you see what I described: The Operator enforced the full cluster crash, and after it performed a recovery from the most recent pod. And we can see:

NAMESPACE     NAME                                                     READY   STATUS                       RESTARTS   AGE
pxc           cluster1-pxc-0                                  3/3     Running   1          25m
pxc           cluster1-pxc-1                                  3/3     Running   1          22m
pxc           cluster1-pxc-2                                  3/3     Running   1          22m

The cluster recovered from an “unoperation” state. Happy Clustering with our Percona Distribution for MySQL Operator!

Jun
08
2021
--

Restore Strategies with MyDumper

Restore Strategies with MyDumper

Restore Strategies with MyDumperIn my previous post Back From a Long Sleep, MyDumper Lives!, I told you that Fast Index Creation was in the queue and (after fixing several bugs) it will package in release 0.10.7 next month. But why am I so excited about it? Well, this feature opens new opportunities, what I’m going to call Restore Strategies.

In the Past…

In the past, the only option was to first restore the table definition and then insert the data rows in two simple steps. On tables with millions of rows, we already know why it takes more time as it inserts in the clustered index and in the secondary index, instead of building the secondary indexes after the data has been inserted, as myloader is able to do now. 

Nowadays, we have the option to do it, in three steps:

  • Create the table with the primary/unique key.
  • Execute the inserts statements in parallel.
  • Create the indexes and constraints.

But, we don’t have just one table, we have multiple tables with different amounts of indexes, different amounts of columns per index, different amounts of columns, different amounts of rows… and depending on the table order selected will be the restoration time. So, the question is: what is the best table order?

Currently…

Currently, myloader has no rule or strategy that defines the order, as we read the files using g_dir_read_name and enqueue them. We know that larger tables should be imported first, if not, the whole process could take longer, as might be possible that importing the data of the largest tables and then creating the indexes, take the same amount of time that the rest of the database to import. After working on several restore operations, I realized that not always the largest table is the one that is going to take longer. So, what we need to determine is the table that is going to take the longest to import. During a mydumper execution, we know the number of rows that each table has and we can get that info and pass it to myloader. We can also estimate the index length based on the amount of index and the type of each one. 

Another thing to consider is that the three steps are for all the tables, which means that we create all the tables, execute the inserts for all the tables, and finally, we create the indexes for all the tables. But, I think that doing inserts in a table and running in parallel a process that creates the indexes, will speed up the whole process. 

Do We Know Enough?

From my point of view, there is a lot of information that we are missing during a table restore, we don’t know inserts or MB per second per table or database. I think that this will be valuable information if we want to develop the tools to use the best Restore Strategy on our database.

Conclusion

I shared some ideas and opened some discussions about how to import using myloader. I hope we can develop some of them in the future to have a better understanding and reduce the restore timings. Feel free to open an issue on MyDumper repository or add a comment below!

Jun
04
2021
--

Inspecting MySQL Servers Part 4: An Engine in Motion

pt-stalk MySQL Percona

pt-stalk MySQL PerconaThe combination of the information obtained from the “pt-summaries” discussed in the previous posts of this series (Part 1: The Percona Support Way, Part 2: Knowing the Server, Part 3: What MySQL?) helps us come up with the first impression of a MySQL server. However, apart from the quick glance we get at two samples of a selective group of MySQL status variables, they provide what I call a “static” view of the server, akin to looking at a still picture of an engine. We get the chance to spot some major discrepancies in the MySQL configuration in view of the available resources in the host and adjust those following some best practices but the real tuning starts by observing how the engine performs in motion.

pt-stalk is the final tool in our Percona Toolkit trifecta: we use it to collect data from a running server, from both the OS side as well as MySQL’s, so we can then analyze how it is operating. For better gains, you should run pt-stalk during a period of high activity (ideally, peak time). It can also be used to collect data while the server is undergoing a performance issue, for troubleshooting. 

If you have the Percona Toolkit installed on the server, you can run pt-stalk like follows:

sudo pt-stalk --no-stalk --iterations=2 --sleep=30 -- --user=root --password=<mysql-root-pass>

A few observations:

  • Options –no-stalk –iterations=2 –sleep=30 are used to collect two sets of diagnostic data with a 30-second interval between then; since –run-time is not defined, the default value of 30 seconds is employed thus the command above provides a 1-minute worth of data.
  • By default, pt-stalk will save the captured data under /var/lib/pt-stalk. You can choose a different destination directory with option –dest .
  • If your system is configured in such a way that you don’t need to provide credentials to access MySQL then you can remove the — –user=root –password=<mysql-root-pass> section.
  • Many of the data collection tools used by pt-stalk are part of the sysstat package so make sure it is installed on the server.

Let’s have a look at the data collected.

A Mix of OS and MySQL Diagnostics Data

Part of the data collected by pt-stalk is purely OS-related, which makes the tool partially useful for inspecting other kinds of Linux servers as well. The other part is purely MySQL. Here’s the list of files generated by the command above in one of my test servers – note there are two groups of files below, prefixed with timestamps set 30 seconds apart:

$ ls /var/lib/pt-stalk/
2021_05_01_12_55_30-df              2021_05_01_12_56_00-df
2021_05_01_12_55_30-disk-space      2021_05_01_12_56_00-disk-space
2021_05_01_12_55_30-diskstats       2021_05_01_12_56_00-diskstats
2021_05_01_12_55_30-hostname        2021_05_01_12_56_00-hostname
2021_05_01_12_55_30-innodbstatus1   2021_05_01_12_56_00-innodbstatus1
2021_05_01_12_55_30-innodbstatus2   2021_05_01_12_56_00-innodbstatus2
2021_05_01_12_55_30-interrupts      2021_05_01_12_56_00-interrupts
2021_05_01_12_55_30-iostat          2021_05_01_12_56_00-iostat
2021_05_01_12_55_30-iostat-overall  2021_05_01_12_56_00-iostat-overall
2021_05_01_12_55_30-log_error       2021_05_01_12_56_00-log_error
2021_05_01_12_55_30-lsof            2021_05_01_12_56_00-lsof
2021_05_01_12_55_30-meminfo         2021_05_01_12_56_00-meminfo
2021_05_01_12_55_30-mpstat          2021_05_01_12_56_00-mpstat
2021_05_01_12_55_30-mpstat-overall  2021_05_01_12_56_00-mpstat-overall
2021_05_01_12_55_30-mutex-status1   2021_05_01_12_56_00-mutex-status1
2021_05_01_12_55_30-mutex-status2   2021_05_01_12_56_00-mutex-status2
2021_05_01_12_55_30-mysqladmin      2021_05_01_12_56_00-mysqladmin
2021_05_01_12_55_30-netstat         2021_05_01_12_56_00-netstat
2021_05_01_12_55_30-netstat_s       2021_05_01_12_56_00-netstat_s
2021_05_01_12_55_30-opentables1     2021_05_01_12_56_00-opentables1
2021_05_01_12_55_30-opentables2     2021_05_01_12_56_00-opentables2
2021_05_01_12_55_30-output          2021_05_01_12_56_00-output
2021_05_01_12_55_30-pmap            2021_05_01_12_56_00-pmap
2021_05_01_12_55_30-processlist     2021_05_01_12_56_00-processlist
2021_05_01_12_55_30-procstat        2021_05_01_12_56_00-procstat
2021_05_01_12_55_30-procvmstat      2021_05_01_12_56_00-procvmstat
2021_05_01_12_55_30-ps              2021_05_01_12_56_00-ps
2021_05_01_12_55_30-slabinfo        2021_05_01_12_56_00-slabinfo
2021_05_01_12_55_30-slave-status    2021_05_01_12_56_00-slave-status
2021_05_01_12_55_30-sysctl          2021_05_01_12_56_00-sysctl
2021_05_01_12_55_30-top             2021_05_01_12_56_00-top
2021_05_01_12_55_30-trigger         2021_05_01_12_56_00-trigger
2021_05_01_12_55_30-variables       2021_05_01_12_56_00-variables
2021_05_01_12_55_30-vmstat          2021_05_01_12_56_00-vmstat
2021_05_01_12_55_30-vmstat-overall  2021_05_01_12_56_00-vmstat-overall

We won’t be examining all of these in detail here; I’ll focus on the most important ones (in general), following the order (and logic) I usually employ. As was the case in the previous posts, the excerpts of data used in the following sections do not all come from the same pt-stalk collection, not even from the same server; they are assorted from different sources to better illustrate the case at hand.

Was the Server Under Moderate or High Load When the Data was Captured?

My first stop tend to be the first top sample; there we can see reasonably clear if the server was under high load when pt-stalk was run, and for how long:

top - 11:23:16 up 14 days, 23:54,  4 users,  load average: 20.17, 17.82, 14.98
Tasks: 262 total,   1 running, 261 sleeping,   0 stopped,   0 zombie
%Cpu(s): 78.2 us,  6.9 sy,  0.0 ni, 14.2 id,  0.4 wa,  0.0 hi,  0.4 si,  0.0 st
KiB Mem : 13186303+total,   658972 free, 11101868+used, 20185376 buff/cache
KiB Swap:  8388604 total,  8215548 free,   173056 used. 20041612 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 3408 mysql     20   0  112.4g 104.9g 141832 S  1169 83.4   3761:10 mysqld
16379 root      20   0  113652   1340    608 S   6.2  0.0   0:00.01 bash
16530 root      20   0  162020   2180   1528 R   6.2  0.0   0:00.01 top
    1 root      20   0   52148   3448   1940 S   0.0  0.0   4:37.76 systemd

In the sample above, we can see the server was undergoing an increase of load, “averaging” close to 15 for the past 15 minutes, 18 for the past 5 minutes, and finally 20 in the last minute. Just remember that the load metric is dependent on the number of cores in a server. This other sample below may look scarier but it comes from a box with 192 cores; while this is an oversimplification of the metric, we could say the server was operating at (76.9/192=) 40% of its maximum processing capacity in the last minute:

top - 15:30:26 up 258 days, 21 min, 16 users,  load average: 76.90, 78.63, 80.69
Tasks: 1711 total,   1 running, 1710 sleeping,   0 stopped,   0 zombie
%Cpu(s): 23.2 us,  6.5 sy,  0.0 ni, 59.7 id, 10.0 wa,  0.0 hi,  0.7 si,  0.0 st
KiB Mem : 15847333+total,  8529432 free, 78651558+used, 78968825+buff/cache
KiB Swap: 23488102+total, 23441689+free,   464128 used. 79652627+avail Mem

   PID USER   PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 45930 mysql  20   0  652.5g 645.5g  10128 S  2163 42.7 160718:24 mysqld
 91413 apuser 20   0   34.8g   9.1g 164012 S 368.4  0.6   1037:45 java
 11450 root   20   0  173092   4332   1920 R  21.1  0.0   0:00.05 top

It is not a problem if the data was captured under low or moderate load, it just changes the way we should look at and analyze the rest of it.

We can also have a good idea of overall memory usage, and how much of it was allocated to MySQL. From there, I like to have a sneaky peek at meminfo as well; sometimes we find a server with more than a little part of the memory converted into transparent huge pages (THP):

TS 1616111316.005848775 2021-03-18 23:48:36
MemTotal:       197910740 kB
MemFree:         3300968 kB
MemAvailable:   15070884 kB
Buffers:          137580 kB
Cached:         11954976 kB
SwapCached:            0 kB
Active:         178836988 kB
Inactive:        7812608 kB
Active(anon):   174796296 kB
Inactive(anon):   162712 kB
Active(file):    4040692 kB
Inactive(file):  7649896 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
(...)
AnonHugePages:  108390400 kB

No swap space and 55% of the total memory converted in THP doesn’t look that good.

Depicting CPU Usage and I/O

My next stop is the mpstat sample, here’s one from a 16-core server under high load:

11:23:16 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
11:23:17 AM  all   77.51    0.00    1.50    0.94    0.00    0.25    0.00    0.00    0.00   19.80
11:23:17 AM    0   91.92    0.00    1.01    0.00    0.00    0.00    0.00    0.00    0.00    7.07
11:23:17 AM    1   61.86    0.00    2.06    3.09    0.00    0.00    0.00    0.00    0.00   32.99
11:23:17 AM    2   67.68    0.00    3.03    1.01    0.00    0.00    0.00    0.00    0.00   28.28
11:23:17 AM    3   81.00    0.00    2.00    0.00    0.00    0.00    0.00    0.00    0.00   17.00
11:23:17 AM    4  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
11:23:17 AM    5   55.56    0.00    2.02    3.03    0.00    0.00    0.00    0.00    0.00   39.39
11:23:17 AM    6   73.74    0.00    3.03    0.00    0.00    0.00    0.00    0.00    0.00   23.23
11:23:17 AM    7   62.63    0.00    1.01    0.00    0.00    0.00    0.00    0.00    0.00   36.36
11:23:17 AM    8   56.00    0.00    3.00    3.00    0.00    1.00    0.00    0.00    0.00   37.00
11:23:17 AM    9   82.00    0.00    1.00    2.00    0.00    0.00    0.00    0.00    0.00   15.00
11:23:17 AM   10   98.02    0.00    0.00    0.00    0.00    0.99    0.00    0.00    0.00    0.99
11:23:17 AM   11   67.68    0.00    1.01    1.01    0.00    0.00    0.00    0.00    0.00   30.30
11:23:17 AM   12   79.80    0.00    2.02    0.00    0.00    0.00    0.00    0.00    0.00   18.18
11:23:17 AM   13  100.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
11:23:17 AM   14   87.88    0.00    2.02    0.00    0.00    0.00    0.00    0.00    0.00   10.10
11:23:17 AM   15   76.24    0.00    0.99    0.99    0.00    0.99    0.00    0.00    0.00   20.79

MySQL should be run by a regular, non-superuser (named mysql by default), thus all database work should be accounted for in the usr column above (which, in a dedicated server, is all about MySQL). In the sample above we see 2 cores pegged at 100% utilization; unless this sample came from an exceptional period of high load, this is a fine example of a server we should probably not have running in powersave mode (for CPU scaling governor, as seen in the 2nd post of this series) and have all cores running at full speed instead.

In the sample above we find a very good balance of load: all cores are working, and working hard. But picture a similar sample where you only see one of several cores at 100% usr load: in a replica, this is a sign replication is operating in single-thread mode. Back in the days of MySQL 5.5, and even 5.6, this was mostly the norm: we sometimes saw servers with one core running at 100% all time and replication lag continuously increasing. This was a time when replicas with less, but faster cores excelled. Multi-threaded replication appeared in MySQL 5.6 but it was limited to one database per (replication) thread; if your whole schema was organized around a single database, or if only one of the databases received most of the writes, it didn’t help much. Multi-threaded replication really blossomed in MySQL 5.7, with the introduction of the LOGICAL_CLOCK mode. The variables file contains all MySQL settings and you can use it to check how is replication configured in a replica:

$ grep "slave_parallel_" *-variables
slave_parallel_type LOGICAL_CLOCK
slave_parallel_workers 8

Another important column to monitor in mpstat is the iowait column: if you see values repeatedly above 1.0, particularly under moderate load, it may be a sign that the storage subsystem is a limiting factor. Double-check in the iostat sample if the average queue size (avgqu-sz) depth is often bigger than 1:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     4.00    0.00 1187.00     0.00 58873.33    99.20     4.71    3.97    0.00    3.97   0.04   5.20
sda               0.00     0.00    0.00    0.33     0.00     1.33     8.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-0              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
dm-1              0.00     0.00    0.00 1187.33     0.00 58841.33    99.12     4.76    4.01    0.00    4.01   0.04   5.20

In the sample above, we can see that sdb is already saturated with around 1000 write IOPS. Sometimes the IO capacity of the system is overestimated and the settings are set too high, making InnoDB believe it has more IOPS at its disposal than there actually is:

$ grep “innodb_io_capacity” *-variables
innodb_io_capacity 1000
innodb_io_capacity_max 10000

It is a good practice to measure the practical limit of IOPS that can be provided by your disks. Overestimating innodb_io_capacity may actually lead to worse performance and stalls in response times. For more information on this, see Tuning MySQL/InnoDB Flushing for a Write-Intensive Workload.

The last stop in my usual tour around the OS data collected by pt-stalk is the vmstat sample, which I mostly use to check if there was some swapping taking place when the data was collected:

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
28  2 172544 676900 160176 20025276    0    0   158   243    1    1 16  1 83  0  0
29  0 172544 672692 160312 20027832  260    0  1064  3588 22249 3959 95  2  3  0  0
17  0 172544 669032 160316 20028860   52    0   348  3436 20754 5070 78  2 20  1  0
(...)
13  0 172544 697792 160536 19998792    0    0   576  3600 19421 5221 73  2 24  1  0
21  0 173056 705116 160532 19991780    0  300  1744  3936 183103 4090 86  3 10  1  0
11  0 173056 696420 160540 19999820    0   88  7360  3548 48524 4451 82  2 16  0  0
21  0 173056 692664 160540 20003536    0    0  3280  3716 20936 4711 88  2  9  1  0

While swapping is detrimental to performance, as discussed in the 2nd post, I’m one to advocate in favor of having swap space configured in a server in most cases – just make sure to adjust swappiness to a minimum as explained there. You can check that is the case by looking at the sysctl sample:

$ grep swappiness *-sysctl
vm.swappiness = 1

MySQL in Motion

This is my favorite part: mysqladmin files contain samples from SHOW GLOBAL STATUS and while it is difficult to make sense of the raw data, pt-mext, another tool from the Percona Toolkit, can be used to process them and provide a clear view of the status counters variation over time, using the first sample in the series as the base reference. Here’s how you can use it:

pt-mext -r -- cat 2021_05_01_12_55_30-mysqladmin > ptmext.out

Let the fun begin!

Does My Hot Data Fit in Memory?

This is important information to have: is the InnoDB Buffer Pool big enough to store the data that is accessed more often? If it does, it means very little data is fetched from the tablespace files on the disk (which is slow to load) and that most of it is found in memory already. We get that ratio from two InnoDB status counters:

$ grep "Innodb_buffer_pool_read_requests\|Innodb_buffer_pool_reads" ptmext.out
Innodb_buffer_pool_read_requests      69496091591              4040274              3636485              3475398              3509297              3324067              2790171              3811242              3820111              4472993              4240323              3603441              4636822              4713943              4880787              4032916              4328607              3959668              4187900              3883735              3850137              4166323              3933372              3628560              4101771              2766796              3220822              4093628              3581942
Innodb_buffer_pool_reads                  9420027                   12                   17                    7                   11                   28                   21                   16                   53                   5                    4                   17                   10                    9                   14                    9                   14                    5                   30                   4                    6                    7                    4                   10                   1                  115                    8                    9                   56

  • Innodb_buffer_pool_read_requests indicates the number of (logical) read requests
  • Innodb_buffer_pool_reads indicates the number of (logical) read requests that could not be satisfied from data already present in the Buffer Pool; the data had to first be read from disk and then loaded into the Buffer Pool to be served.

The “magic” of pt-mext is shown above: the first value presented for each metric comes from the first sample collected, which provides a base reference, while the values for all consecutive samples, taken 1 second apart, are the difference between them and the previous sample. This way, we get a clear idea of how the metric is changing over time.

Here are the values for the first four samples of Innodb_buffer_pool_reads used in the excerpt above, to better illustrate this function:

Raw value pt-mext
9420027 9420027
9420039 +12
9420056 +17
9420063 +7

Looking at Temporary Tables

Another important aspect we discussed in the last post that is key to MySQL tuning is the use of temporary tables. If there are too many of them being created every second:

  1. Your queries may be needing some optimization;
  2. They may be taking too much of the memory available and the situation may become critical (as in running out of memory) during peaks of high activity/concurrency.

There’s another element to it: if they are growing bigger than the threshold for temporary tables to remain in-memory:

$ grep "tmp_table_size\|max_heap_table_size" *-variables
max_heap_table_size 209715200
tmp_table_size 209715200

they will be moved to the disk (becoming on-disk temporary tables) and your queries will take more time to complete, which may also create a vicious cycle during moments of high concurrency. Here’s a somewhat extreme example from a server with only 126G of RAM and a Buffer Pool of 60G:

Created_tmp_disk_tables              61327112            73           523           501           448           518           421           533           540           464           618           741           528           332           216           522           400           113             8             7           363           208           575           519           396           400           483           610           236
Created_tmp_tables                  134023235           115           546           525           474           542           452           550           557           493           641           763           548           356           250           558           419           129            23            19           406           234           595           542           420           419           511           637           271

  • Created_tmp_tables indicates the total number of temporary tables created by the server, be it in-memory or on-disk.
  • Created_tmp_disk_tables only accounts for temporary tables created on-disk

In the excerpt above, we see a high number of temporary tables created every second, with most of them being created as or converted to on-disk temporary tables. If we consider the latter option, that they are being created in-memory until reaching the threshold set by tmp_table_size, the amount of memory allocated for temporary tables during peak time may surpass what remains available in the server. That is what was happening with this server until they downsized the buffer pool to the current value of 60G as a short term solution; despite having swappiness set to 1 and theoretically available memory in the OS cache, swap space was almost depleted due to memory pressure:

top - 15:54:29 up 126 days,  1:18,  3 users,  load average: 4.17, 4.34, 4.70
Tasks: 445 total,   4 running, 441 sleeping,   0 stopped,   0 zombie
%Cpu(s):  8.0 us,  6.3 sy,  0.0 ni, 71.8 id, 13.5 wa,  0.0 hi,  0.4 si,  0.0 st
KiB Mem : 13176490+total,  1973120 free, 10337484+used, 26416932 buff/cache
KiB Swap: 10485756 total,   562384 free,  9923372 used. 26613916 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
11830 mysql     20   0  111.4g  93.8g   8760 S 175.0 74.6  10332:00 mysqld

procs -----------memory----------     ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache       si   so    bi    bo    in    cs us sy id wa st
18  4 9923884 1988808     40 26417080    1    1    33   584     0     0  2  1 97  0  0
 5  3 9919788 1980508     40 26424644 5556    0 17444 30576 42479 53317  4  2 82 11  0
 2  3 9912876 1970328     40 26427844 7872    0 21856 35881 50203 67962  6  2 82 11  0
 2  3 9903404 1955664     40 26433916 10992   0 24208 41452 58789 76326  6  2 82 10  0
 3  4 9893932 1938188     40 26437384 11008   0 25712 39040 54518 65092  6  2 82 11  0
 5  3 9888044 1926884     40 26442912 7776    0 27152 37513 75462 82150 17  3 71  9  0
 2  4 9881644 1917948     40 26445612 7336    0 23288 32787 44001 53016  7  2 79 12  0
 5  4 9869612 1896296     40 26450052 15308   0 30748 34702 50177 58230  9  2 77 12

The long-term solution for them was to work on optimizing their queries to decrease the amount and size of temporary tables utilized by them.

Contention at the Table Cache Level

When a connection runs a query that accesses a given table, the table is opened and the Opened_tables counter is increased. If there are ten connections running the same query concurrently, the table will be opened ten times; the reference for a given opened table is not shared by different client sessions because each may have a different state set for the target table. Note this impacts the number of file descriptors used by MySQL.

For performance reasons, the reference for each opened table is kept in a cache. When that cache is full and someone tries to open a table that is not in the cache, MySQL needs to make space for it by either finding a table that is no longer being in use by any thread and recycling its entry on the table cache or by temporarily expanding the table cache.

The general advice suggests that if the value for Opened_tables increases constantly, it’s a sign that the table cache might be too small:

Opened_tables                        17369503                  123                  179                  187                  228                  307                  296                  350                  209                  208                  192                  219                  308                  183                  316                  260                  312                  291                  123                  237                  152                  146                  208                  273                  284                  335                  110                  238

I rather look at the number of cache misses and overflows, which provides a more direct indication of contention at the table cache level:

Table_open_cache_hits                697915937                 5362                 7645                 6461                 4789                 4815                 5912                 6569                 5989                 7260                 4128                 5892                 7822                 4265                 7562                 5032                 6892                 6451                 2865                 5740                 5475                 3744                 5568                 6137                 4688                 5687                 1940                 3743                 
Table_open_cache_misses              17355555                  123                  179                  187                  228                  307                  296                  350                  209                  208                  192                  219                  308                  183                  316                  260                  312                  290                  123                  237                  152                  146                  208                  273                  284                  335                  110                  238                   
Table_open_cache_overflows           17344722                  123                  179                  187                  228                  307                  296                  350                  209                  208                  192                  219                  308                  183                  316                  260                  312                  290                  123                  237                  152                  146                  208                  273                  284                  335                  110                  238

As discussed in the previous post, pay attention to the relative size of each table cache instance:

$ grep table_open_cache *-variables
table_open_cache 10708
table_open_cache_instances 16

Ever since the table cache has been partitioned to perform better in highly-concurrent environments, it is no longer one big cache with table_open_cache entries: this value is now divided by the number of table_open_cache_instances. If the number of entries per instance is low, it won’t scale well; you may need to increase table_open_cache, or yet decrease the number of instances.

There are a lot of other status variables we could be covered here but I hope you have got the general idea of how we can use pt-mext to process mysqladmin samples from the examples above. We shall keep going, there are two essential pieces of data we cannot leave uncovered.

Looking Inside the (Real) Engine

And that would be InnoDB. Yes, we get a lot of InnoDB-related metrics covered in the status variables but there is precious information about the activity of the storage engine we can only find in the output of SHOW ENGINE INNODB STATUS (“SEIS”) which we find in the innodbstatus files.

There is a lot of information in there, I reckon it is a common feeling to feel a bit lost when trying to make sense of it all. The trick is to break it down into sections; you will usually focus on a subset of them, according to what you are looking for. 

Transactions Running

InnoDB is a transactional storage engine and SEIS is one of the best places to look for information about them. Open one of the innodbstatus (there are two collected for each of the iterations executed) and have a general look. Before I do it myself, I like to get a general idea about the number of transactions inside InnoDB; I use grep for this:

$ grep "\---TRANSACTION" innodbstatus |wc -l
2503

That’s a lot of transactions, but it doesn’t mean they are all running:

$ grep "\---TRANSACTION" innodbstatus | grep ACTIVE | wc -l
273

If you have a lot of active transactions running inside InnoDB and they are taking too much time to complete, getting just a bit of CPU cycles to process each time and then going back to waiting for more, generating a lot of context switch, you may benefit from limiting the maximum number of transactions that can be allowed inside InnoDB at the same time – the rest of them will have to wait at the door, a concept that is not much different than using a connection pooler. The variable that can be used for this is innodb_thread_concurrency: the default value of zero indicates there is no limit in place.

This is one of the best examples I have in mind when I think about “tuning the engine for the race track”: the workload dictates how MySQL should be tuned. In a high concurrency system, if there are queries that compete for the same resources blocking each other, it may be better to allow only a fraction of them to run at a time so they complete faster and leave. In cases like this, how many threads should be allowed to run concurrently is a matter of experimentation: try with values that are multiples of the number of cores in the server.

Long-Running Transactions

You will find the oldest, longest-running transactions on the bottom of the list of transactions: if you are troubleshooting a performance problem, there is a good chance that those transactions are involved – either as culprits or victims of it.

Long-running transactions can be very detrimental to general server performance: they drive the InnoDB history list length high and can hold locks needed by other queries to complete. Here’s an example of such:

---TRANSACTION 48793810, ACTIVE 293183 sec
106688 lock struct(s), heap size 10887376, 1642661 row lock(s), undo log entries 3698942
MySQL thread id 349, OS thread handle 140316247336704, query id 64765183 myserver.hostname.com 172.x.x.x myuser
Trx read view will not see trx with id >= 48793810, sees < 48793810

You can investigate those further by looking for them in the respective transactions file:

*************************** 2. row ***************************
                    trx_id: 48793810
                 trx_state: RUNNING
               trx_started: 2020-08-21 00:00:33
     trx_requested_lock_id: NULL
          trx_wait_started: NULL
                trx_weight: 3805639
       trx_mysql_thread_id: 349
                 trx_query: NULL
       trx_operation_state: NULL
         trx_tables_in_use: 0
         trx_tables_locked: 11
          trx_lock_structs: 106688
     trx_lock_memory_bytes: 10887376
           trx_rows_locked: 1642663
         trx_rows_modified: 3698951
   trx_concurrency_tickets: 0
       trx_isolation_level: REPEATABLE READ
         trx_unique_checks: 1
    trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
 trx_adaptive_hash_latched: 0
 trx_adaptive_hash_timeout: 0
          trx_is_read_only: 0
trx_autocommit_non_locking: 0

processlist file:

*************************** 1. row ***************************
           Id: 349
         User: myuser
         Host: myserver.hostname.com:60260
           db: mydatabase
      Command: Sleep
         Time: 1
        State:
         Info: NULL
    Rows_sent: 0
Rows_examined: 0

and lock-waits file:

*************************** 1. row ***************************
   who_blocks: thread 349 from myserver.hostname.com:60260
  idle_in_trx: 1
max_wait_time: 49
  num_waiters: 10

I reckon I’m stopping short of providing a complete procedure for investigating locking issues; this is another subject that warrants more attention.

Checkpointing and Redo Log Space

We briefly discussed the importance of tuning the redo log space (innodb_log_file_size x innodb_log_files_in_group) in the previous post, and how it should be big enough to optimize the checkpointing mechanism. There is an old formula that can still be used for estimation, and the data for this can be found in SEIS. If we used the first innodbstatus file of each of the two samples:

$ grep "Log sequence number" 2020_08_24_15_*-innodbstatus1
2020_08_24_15_30_25-innodbstatus1:Log sequence number 66452632416280
2020_08_24_15_30_55-innodbstatus1:Log sequence number 66453484080767

We can obtain the number of bytes written in this 30-second space and make a rough projection for one hour:

(66453484080767 - 66452632416280) * 2 * 60 ~= 95G/h

Such a projection is very dependent on the timing when the SEIS samples were captured; you may want to track Log sequence number during an extended period of time but do note it is important that the redo log be tuned for the most demanding write periods, which usually takes place during peak time. You certainly don’t need to have it covered for a full hour; 15-30 minutes should be enough for checkpointing to operate in an optimized way:

20min ~= ( innodb_log_file_size ? 16G ) x ( innodb_log_files_in_group ? 2 )

Adaptive Hash Index

InnoDBs can make use of a special feature to index values that are being accessed frequently, called Adaptive Hash Index, or AHI, for short. As the name suggests, these are built as hash indexes in memory, on top of the B-Tree table indexes we usually talk about. They allow for very fast hashed lookups. But it won’t be helpful in workloads where most of the searches are non-hash searches:

$ grep "hash searches" innodbstatus
19000.00 hash searches/s, 1955000.00 non-hash searches/s

In the excerpt above, we find that only 1% of the searches are “hashable”: in cases like this, disabling the AHI feature (innodb_adaptive_hash_index) can improve performance. AHI will have a bigger impact (be it positive or negative) on read-only workloads so it may make sense to have it enabled on your replicas but not on the master – or vice-versa. This highlights another interesting point when inspecting database servers: it is very likely you will find different variations of your workload if you split traffic into writes (master) and reads (replicas); you should analyze (and configure) your servers independently.

In other cases, the rate of hashable searches can be predominant but the system may still suffer contention from this mechanism. This used to be the case in MySQL 5.6 and earlier versions, when “the adaptive hash index search system was protected by a single latch (btr_search_latch) which could become a point of contention”. This is apparent when we see many threads waiting on an RW-latch created in btr0sea.c in the upper part of the SEIS output, for example. This has been mitigated in MySQL 5.7 with the possibility of partitioning the AHI search mechanism (innodb_adaptive_hash_index_parts): “each index is bound to a specific partition, with each partition protected by a separate latch”.  

If you are in doubt about the effectiveness of the AHI for your workload and can test operating with it disabled for comparison, that would be a simple path to pursuit.

A Quick Note on Processlist

We mentioned processlist samples briefly above: they contain basic information about ongoing connections to MySQL, such as what their source is and what they are currently doing, although not always in exact terms. There are moments when we need to inspect this file in search of a specific thread like we did earlier, but we can also extract more high-level stats from these samples. The following command will sort connections by Command:

$ cat processlist|grep Command:|sort|uniq -c|sort -rn
   2750 Command: Sleep
   1226 Command: Query
     36 Command: Connect
     29 Command: Binlog Dump
      2 Command: Execute

Please note you cannot simply look at the overall raw numbers above: each processlist file contains 29 samples of SHOW FULL PROCESSLIST; averaging these values does not provide an accurate view of the system at the time, but it does provide a pretty good general representation of it. Sorting by State is even better in this sense:

$ cat processlist|grep State:|sort|uniq -c|sort -rn
   2750   State: 
    548   State: Opening tables
    369   State: statistics
    103   State: Sending data
     68   State: starting
     31   State: query end
     31   State: Creating sort index
     29   State: Waiting for the slave SQL thread to free enough relay log space
     27   State: update
     18   State: executing
     17   State: Master has sent all binlog to slave; waiting for more updates
     14   State: Sending to client
      7   State: updating
      6   State: Receiving from client
      6   State: preparing for alter table
      5   State: checking permissions
      4   State: Waiting for table metadata lock
      4   State: closing tables
      3   State: System lock
      1   State: Sending binlog event to slave
      1   State: preparing
      1   State: creating table

By looking at the excerpt above, I would be inclined to further examine the efficiency of the table cache and that of the redo log space.

One last point that almost didn’t make this document: the MyISAM storage engine has its days counted. In MySQL 8.0, the move was made that even system tables are not MyISAM anymore. In certain situations, you may resort back to MyISAM for on-disk temporary tables (default_tmp_storage_engine). Just have the following information in mind: for obvious reasons, MyISAM tables aren’t featured in the output of SHOW ENGINE INNODB STATUS: if you have a performance issue involving a query and you can’t find anything about it in SEIS, chances are that the underlying tables are not InnoDB; look for clues about the query in SHOW PROCESSLIST instead. 

Daemon Mode

You can have pt-stalk running in the background as a daemon, ready to trigger when a predefined condition is met. By default, a threshold for Threads_running is used, but you can also employ your own trigger conditions through a custom function!

In the Next Post …

The best complement to our beloved Percona Toolkit trifecta is none other than Percona Monitoring and Management (PMM). It continually collects MySQL status variables for your servers and plots the metrics in Grafana panels that are easy to interpret. Plus, it includes a very rich Query Analytics dashboard that can help you identify your top slow queries. In the final post of this series, we will have a look at how we can obtain much of the same information we got from the Percona Toolkit tools (and sometimes more) from PMM.

Jun
03
2021
--

Inspecting MySQL Servers Part 3: What MySQL?

Inspecting MySQL Servers

Inspecting MySQL ServersIn the previous post of this series, we looked at the hardware specifications and operating system settings of the host server through the lenses of a pt-summary report. Now that we know the foundation on which the database is running, we can turn our focus to MySQL itself. The second of our triad of tools from the Percona Toolkit will help us with that:

pt-mysql-summary conveniently summarizes the status and configuration of a MySQL database server so that you can learn about it at a glance. 

The goal for this part is to find what MySQL distribution and version is being used if the server is a replica or maybe a PXC/Galera node, as well as its main configuration (such as the size of the Buffer Pool, redo log space, table cache, temporary tables) and check if it makes sense for the platform it is running. In order to gather the necessary information, the tool will connect to the MySQL server and run several queries so you must have superuser access to the database.

If you have the Percona Toolkit installed on the server, you can run pt-summary as simply as:

sudo pt-mysql-summary -- --user=root --password=mysql-root-password

But it is usually better to redirect the report to a file you can save for later, which also allows for better searching:

sudo pt-summary -- --user=root --password=mysql-root-password > /tmp/pt-mysql-summary.out

You should have one thing in mind though: any configuration adjustments done to MySQL based on the information provided by a pt-summary report must be seen as an initial setup, and not tuning properly said; we tune a database server for the workload it executes and this cannot be done based on static data alone. The next post in this series will dig deep into this part. However, we need a starting point more reasonable than the standard configuration provided by the installation packages “out of the box”, otherwise the diagnostic data collected in the next step won’t be as useful, and more iterations will be required.

Header Information

The first section of the report lists all mysqld running instances, identifies the binary being used, and indicates whether it contains symbols (useful for analyzing stack traces from crashes, for instance) or has been stripped of them:

# Percona Toolkit MySQL Summary Report #######################
              System time | 2020-10-08 15:30:53 UTC (local TZ: CDT -0500)
# Instances ##################################################
  Port  Data Directory             Nice OOM Socket
  ===== ========================== ==== === ======
   3306 /home/mysqladm/mysql/data  0    0   /mysqladm/tmp/mysql.sock
# MySQL Executable ###########################################
       Path to executable | /mysqladm/Percona-Server-5.7.25-28-Linux.x86_64.ssl101/bin/mysqld
              Has symbols | Yes

I’m not seeing it happening very often nowadays, but in the past, it was more common to find multiple instances of MySQL running on the same server. The report will cater for the first instance in the list:

# Report On Port 3306 ########################################
                     User | mysqladm@%
                     Time | 2021-04-08 10:30:53 (CDT)
                 Hostname | myservername
                  Version | 5.7.25-28-log Percona Server (GPL), Release 28, Revision c335905
                 Built On | Linux x86_64
                  Started | 2021-04-07 08:53 (up 1+01:37:04)
                Databases | 6
                  Datadir | /home/mysqladm/mysql/data/
                Processes | 2000 connected, 1500 running
              Replication | Is not a slave, has 2 slaves connected
                  Pidfile | /mysqladm/tmp/mysqld.pid (exists)

From the section above we get a set of important information:

  • The exact MySQL version running: we can check the Release Notes page from the respective distribution to find out when it was released, for which bugs it has been fixed and, even more important, which ones it doesn’t (based on exact bug reports and release notes from more recent releases). The one from the example above has been released over two years; it is simply not a best practice to allow a server to become this outdated, missing both on security patches and overall improvements.
  • The server has been (re)started only a day ago, and this suggests the caches (mostly the InnoDB Buffer Pool) could be still warming up.
  • Here we get the instance’s data directory (datadir) path; when it happens to match the root directory of one of the mount points listed on the pt-summary report we may get the exact database size (on disk) from the partition’s usage stats.
  • The number of running Processes is just an “eye-opener”, something to pay more attention to later.
  • The information indicating this server is a master is a crucial one: we probably shouldn’t be making any concessions in terms of relaxing data durability without a “safety net”.

Processes

pt-mysql-summary connects to the database and runs a SHOW FULL PROCESSLIST query in the background: the information appearing in the Processlist section is based on that single sample. The processes in the sample are aggregated according to Command, User, Host, database (db), and State. While it doesn’t provide a good representation of the workload, it may hint at common issues, such as contention at the table cache level or open files limit when there is a majority of connections stuck in the Opening tables state.

Status Variables

The next section processes data collected from two samples of SHOW GLOBAL STATUS taken 10 seconds apart, and “rounded”; the idea here is not to provide accurate metrics but to provide an insight on how the counters increased in this short time span. Thus, the Per day value doesn’t really represent a full 24-hour period, it is simply the initial value for the counter (from the first sample) divided by the number of seconds in a day (86400), with some rounding involved. I tend to pay attention to the 10 secs column only, as I find it more straightforward and useful: it shows how much a given counter increased from the first sample to the second one, taken 10 seconds later.

You should not spend too much time analyzing the Processes and Status Counters sections from a pt-mysql-summary; the time for this will come later, and from more rich and diverse data collected with pt-stalk.

MySQL Settings

Below the Status Counters section, it’s mostly about MySQL settings, but it includes some statistics mixed in. It obviously doesn’t cover all the 500+ variables you can tune on the latest MySQL server, and the sections often don’t carry the formal variable name (e.g., “Buffer Pool Size” instead of innodb_buffer_pool_size). The aim is to highlight some of the most important ones. I will comment about a subset of them, starting with the Table cache:

# Table cache ################################################
                     Size | 512
                    Usage | 225%

Usage is obtained by dividing the status counter Open_tables by the variable table_open_cache (with some rounding involved). I reckon the excerpt above is not a common one, but it helps illustrate an interesting scenario: in some cases, a variable like table_open_cache was defined in the my.cnf configuration file years ago and left there, forgotten. The server has been upgraded, sometimes multiple times, and the default value for this variable has increased: in 5.5 it was 400, then 2000 in 5.6, and 4000 now in 8.0. In fact, the table cache has since been partitioned to scale better in highly-concurrent environments, now running in modern servers with multiple cores: table_open_cache_instances was introduced in 5.6 but had the default value of 1 then. In 5.7, the default value changed to 16. For our example above, with table_open_cache still defined on the configuration file as 512 but with table_open_cache_instances not defined, it results, in practice, in a server operating with 16 table cache instances, but each with only (512/16=) 32 entries only!

We will be discussing how to evaluate the efficiency of the table cache size further in the next post. For now, make an exercise of reviewing your my.cnf configuration file and check how the variables defined there relate to default values for your server’s version.

The Query Cache

Speaking of the server’s version, a feature that has existed for a long time but which has been deprecated in MySQL 5.7 and effectively removed in 8.0 is the Query cache:

# Query cache ################################################
         query_cache_type | ON
                     Size | 100.0M
                    Usage | 6%
         HitToInsertRatio | 200%

It is reminiscent of a time when CPU was scarce and concurrency was low. That is no longer the case for a long time, and the query cache became a real bottleneck: it operates out of a single mutex so every SELECT needs to wait until it can acquire it to execute. Can you believe the potential impact of this setting on highly concurrent workloads?

The query cache is disabled by default since MySQL 5.6 (query_cache_type=0) but we still find that variable set in the configuration files of many MySQL 5.7 servers. (Hint: you should also remove the definition for query_cache_size if you have it in place as that much memory will be allocated for the query cache even if it is disabled).

The Main Storage Engine

MySQL became a transactional database with the addition of the InnoDB storage engine many years ago, first as a plugin, then as the default engine. With MySQL 8.0, MyISAM remains available but even system tables are now InnoDB. There are alternatives, such as MyRocks, which can be easily installed for Percona server for MySQL, but for most environments, much of the tuning done to MySQL is InnoDB-specific. There are about 170 innodb_ prefixed variables in MySQL 8.0 and pt-mysql-summary only covers some of these in the InnoDB section:

# InnoDB #####################################################
                  Version | 5.7.29
         Buffer Pool Size | 50.0G
         Buffer Pool Fill | 100%
        Buffer Pool Dirty | 1%
           File Per Table | ON
                Page Size | 16k
            Log File Size | 2 * 2.0G = 4.0G
          Log Buffer Size | 16M
             Flush Method | O_DIRECT
      Flush Log At Commit | 1
               XA Support | ON
                Checksums | ON
              Doublewrite | ON
          R/W I/O Threads | 4 4
             I/O Capacity | 200
       Thread Concurrency | 0
      Concurrency Tickets | 5000
       Commit Concurrency | 0
      Txn Isolation Level | READ-COMMITTED
        Adaptive Flushing | ON
      Adaptive Checkpoint |
           Checkpoint Age | 1G
             InnoDB Queue | 0 queries inside InnoDB, 0 queries in queue
       Oldest Transaction | 0 Seconds
         History List Len | 95
               Read Views | 1
         Undo Log Entries | 2 transactions, 17441 total undo, 15747 max undo
        Pending I/O Reads | 0 buf pool reads, 0 normal AIO, 0 ibuf AIO, 0 preads
       Pending I/O Writes | 0 buf pool (0 LRU, 0 flush list, 0 page); 0 AIO, 0 sync, 0 log IO (0 log, 0 chkp); 0 pwrites
      Pending I/O Flushes | 0 buf pool, 0 log
       Transaction States | 2xACTIVE, 65xnot started

The most important of them is undoubtedly the Buffer Pool Size (innodb_buffer_pool_size). The buffer pool is a data cache: the ideal situation in a database server is if you could fit all your dataset in memory, inside the buffer pool; this way it wouldn’t be necessary to “ever” fetch data from disk, limiting I/O to basically writes. This is, of course, not totally accurate in practice, but neither is the statement “fit all your dataset in memory”: in fact, if you can fit all your hot data, or the data that is most frequently accessed, in memory, the effect would be similar.

If you have been in MySQL for a long time, you probably heard about the rule of thumb that says one should allocate 80% of the server’s memory to the Buffer Pool. This may have been a reasonable starting point when the servers didn’t have as much memory, but nowadays it would often be considered too conservative: take a server with 512G of RAM, for example, reserve 80% to the Buffer Pool, and there would remain up to 100G of RAM left. You surely need to reserve some memory for the OS (Linux) and for MySQL to manage connections, run queries (particularly temporary tables, which we will be discussing later), and other structures, such as the Adaptive Hash Index (AHI), but most of the memory surplus will end up going to the OS cache, out of a better use for it. And since InnoDB has direct access to the memory (O_DIRECT) it bypasses the OS cache to read and write data to disk. Only other parts of MySQL rely on the OS cache to interact with the storage so watch out to make the most of the memory that is available.

Buffer Pool Fill is an estimation of how “full” is the Buffer Pool at the moment, obtained by simply subtracting the status counter Innodb_buffer_pool_pages_free from Innodb_buffer_pool_pages_total. Similarly,  Buffer Pool Dirty, or the number of pages in the Buffer Pool that have been modified (in memory) but have not yet been flushed (written) to the destination tablespace, is calculated by subtracting Innodb_buffer_pool_pages_dirty from Innodb_buffer_pool_pages_total. These metrics shouldn’t be considered for a server that has just been restarted though, when the Buffer Pool is still “warming up”, getting filled with the “hot data” mentioned above. Besides, there is a better evaluation of how good enough is the Buffer Pool (size) for production workload, which we will cover in the next post.

Now, you may have read the words “dirty pages” and “not yet been written” and freaked out a bit, which is understandable. It is time to talk about data durability in MySQL (or, more precisely, in InnoDB). Flush Log At Commit represents the variable innodb_flush_log_at_trx_commit: it determines how often modified data is written to disk. For full ACID compliance, it should happen at each transaction commit (a value of 1). Why would anyone run away from this, you may ask. Because of the impact on I/O: depending on the workload, this may become a severe bottleneck. Thus, it is common to have this setting “relaxed” in replicas and operate with a value of 0 for this variable, meaning modified data is flushed to disk once per second. In the worst-case scenario, a server crash means the possible loss of (approximately) one-second worth of data: if there were 100 transactions committed in that last second preceding the crash, they might not be there when the server is restarted.

This compromise is acceptable in replicas because they can be rebuilt from scratch. On a master, operating with innodb_flush_log_at_trx_commit set to anything other than one would be taking the risk described in the previous paragraph. Note, however, that if a BBU is available and the write-back cache is enabled, the BBU will merge fsync calls and orchestrate the actual “sync to disk” process, optimizing I/O and, thus, reducing the cost of running InnoDB in fully durable mode with innodb_flush_log_at_trx_commit=1.

I should note that dirty pages aren’t written from the Buffer Pool straight to their respective tablespace files: that would be a very costly operation as the target blocks are spread across the disk. To optimize on this front, the pages are first written to a log file, sequentially, and only later, over time, they are copied from there to the tablespace files. That log is in fact a circular data structure composed of two or more files and is known as redo log. The size of the redo log is, then, a product of innodb_log_files_in_group by innodb_log_file_size:

            Log File Size | 2 * 2.0G = 4.0G

The process of copying data from the redo log to the tablespace files is called checkpointing. This process is constantly running in the background, moving dirty pages to their destination data files at a steady but “gentle” rate to optimize I/O. The exception is when the amount of dirty pages in the Buffer Pool exceeds the number of free pages in the redo log. When that happens, a sharp checkpoint process is required: off goes the gentleness and all I/O is put at the service of the checkpoint process, which often stalls the server. You can read more about this process in InnoDB Flushing in Action for Percona Server for MySQL.

The redo log also plays an important role in InnoDB’s crash recovery process, when MySQL is restarting following a crash: all non-free blocks need to be inspected to correct any data written by incomplete transactions, which need to be rolled back. Thus, the redo log space should be big enough to avoid any sharp checkpointing being triggered. However, a too big redo log that is full may delay the MySQL initialization process following a crash (though this has been improved significantly in the latest version of MySQL).

There are another 170 other InnoDB variables we could tune, but getting innodb_buffer_pool_size, innodb_flush_log_at_trx_commit, and innodb_log_file_sizeright” is a great start.

Binary Logging

The main use of binary logs in MySQL is for replication so you should retain, at the very least, enough binlogs to allow any replica that is temporarily stopped (for maintenance purposes, for instance) or otherwise lagging to recover the missing replication events. expire_logs_days was the variable that used to control “the number of days before automatic removal of binary log files”; in MySQL 8.0 it has been deprecated in favor of binlog_expire_logs_seconds.

# Binary Logging #############################################
                  Binlogs | 108
               Zero-Sized | 0
               Total Size | 104.4G
            binlog_format | ROW
         expire_logs_days | 5
              sync_binlog | 1
                server_id | 1
             binlog_do_db | 
         binlog_ignore_db |

From the above list, sync_binlog is one to pay attention to: similar to innodb_flush_log_at_trx_commit, it controls how often the binary log is synchronized to disk, with a value of 1 also providing the safest measure at the expense of higher I/O demand.

Binary logs are also used for Point-In-Time-Recovery (PITR): if you have retained all binlogs since your last backup was taken, be it a binary backup (taken with Percona XtraBackup, for instance) or a logical backup (mysqldump), you can restore the backup and then “play” (or load) all the binary logs that succeeded it, sequentially, down to the last event recorded. For this reason, binary logs play an important role in any backup strategy.

Best of the Rest

There is a section called Noteworthy Variables in the pt-summary-report that highlights a selection of other MySQL variables. Here are the ones in this group we pay most attention to, and why:

Temporary Tables

There are two kinds of temporary tables in MySQL:

The following variables control, respectively, how big each kind of temporary table is allowed to grow in memory:

     max_heap_table_size | 64M
          tmp_table_size | 64M

Once a temporary table reaches its in-memory table, it is converted to an on-disk memory table. These tables are not allowed to grow in memory indefinitely for a very simple reason: a query that produces big temporary tables on a highly concurrent environment can have devastating effects on memory usage. Picture this: how much extra memory would be used in total by MySQL to process a query that makes use of 256M temporary tables for 500 concurrent connections? Operating with too-high-values for these variables is the most common cause for having the kernel’s OOM monitor to kill mysqld. The long-term solution is to optimize your queries to avoid the need for big temporary tables in the first place.

Some internal temporary tables are created straight on disk though, independent of their size: it’s the case for operations on tables involving BLOB or TEXT columns, for example. Once in a while, we see attempts to work around this limitation by having the tmpdir point to a tmpfs partition that maintains files in memory: be careful with that.

MySQL Error Log

The MySQL error log is the #1 source of information about your database and the first place you should check when looking to understand what is going on with your server. It doesn’t have all the answers but it very often provides enough information to steer you in the right direction. Whenever MySQL crashes due to a bug, there is a stack trace left in the error log with more information. If you see MySQL restarting without first going through a clean shutdown process, you can be sure that either the mysqld process has been killed or the host suffered a major crash/power failure. 

                     log | ON
                log_error | /var/lib/mysql/mysql.err
             log_warnings | 2

log_warnings was the name of the variable that controlled the level of verbosity present in the error log. It has been deprecated in favor of log_error_verbosity in MySQL 5.7 and removed from MySQL 8.0. The higher the value set, the more information the log will contain.

Make a habit out of checking the error log from time to time and you may catch issues before they become real problems.

Slow Query Log

Remember the long-term solution to avoid the use of big temporary tables mentioned a few paragraphs up? The quest for query optimization starts with identifying the slow queries. Make sure that slow_query_log is enabled and tune long_query_time down accordingly; set it to zero to capture all queries. If you are running Percona Server for MySQL you can take advantage of a few extra options, such as log_slow_rate_limit, to mitigate the additional overhead from logging all queries while still obtaining a good representation of them. log_slow_verbosity allows you to customize what you want to have included in the slow log, such as query_plan to log information about the query’s execution plan.

Query optimization is a subject for another full blog series. My very simple but useful advice on this front is to capture your slow queries for at least a couple of days (ideally more than that), including peak time, and then processing the slow query log with pt-query-digest (another tool from the Percona Toolkit): it will create a rank of the top slow queries, taking into consideration not only the queries’ duration but also their frequency. You can use that list and the details provided in the report that is generated to focus on optimizing those that will bring you the most gains.

Configuration File

This section will list all settings defined in a MySQL configuration file if mysqld has been started with the option –defaults-file specifying one. This is less and less the case with modern Linux distributions that make use of systemd to manage services. Plus, it became a common practice to group sections of similar settings in distinct configuration files so this part of the pt-mysql-summary requires some reworking.

Memory Allocator

The very last section in the pt-mysql-summary report indicates whether MySQL has been configured to start with the memory allocator jemalloc and if the respective library is in fact available in the system. jemalloc is an alternative to the standard glibc library that focuses on performance, reducing memory fragmentation to help improve scalability and increase throughput. It is not a change I would recommend for all and every setup but something to really consider for high-loaded servers where memory access is a bottleneck.

In the Next Post …

Now that we got to know the server and have potentially made a few adjustments to the MySQL configuration, it’s time to capture diagnostics data from production workload and start tuning the database for it.

Jun
02
2021
--

Inspecting MySQL Servers Part 2: Knowing the Server

MySQL pt-summary

MySQL pt-summaryIn part 1 of this series, we introduced you to the approach used by the Percona Support team to inspect MySQL servers, which is based on a triad of tools from the Percona Toolkit used to gather diagnostic information. In this article, we will be focusing on the first of these tools, the pt-summary.

pt-summary is, in fact, not database-related and all about the server that is hosting the service. The suffix “summary” derives from the fact the execution of the tool generates a “single-page” report containing information about the hardware specifications and operating system settings. It aims at providing a starting point for the server assessment, as noted in the introduction post:

To start with, this box the service is running on, what kind of server is this? How much memory is available? How many cores? What about storage, is it fast, is there enough disk space? Is it running on CentOS, Ubuntu, or something else? Is it a VM?! Which kernel version? Is it swapping … is there any swap space configured at all?

To refresh your memory, if you have the Percona Toolkit installed on the server, you can run pt-summary as simply as:

$ sudo pt-summary

But it is usually better to redirect the report to a file you can save for later, which also allows for better searching:

$ sudo pt-summary > /tmp/pt-summary.out

The next sections break down the generated report in various parts to facilitate its assessment but mostly follow the order in which they are presented. Note that the data presented in each section is used for illustration purposes and does not represent a single server; it is a mix-and-match taken from various different reports, with sensitive data as hostname and IP addresses altered for obvious reasons when necessary.

The Heart and Soul of the Server

The header of the report provides information about the system platform:

# Percona Toolkit System Summary Report ######################
        Date | 2021-02-19 16:11:19 UTC (local TZ: CET +0100)
    Hostname | my.server.net
      Uptime | 199 days,  2:35,  1 user,  load average: 5.37, 3.94, 2.74
      System | Dell Inc.; PowerEdge R730; vNot Specified (Rack Mount Chassis)
 Service Tag | 4XYSG62
    Platform | Linux
     Release | CentOS Linux release 7.8.2003 (Core)
      Kernel | 3.10.0-1127.18.2.el7.x86_64
Architecture | CPU = 64-bit, OS = 64-bit
   Threading | NPTL 2.17
     SELinux | Disabled
 Virtualized | No virtualization detected

A trained eye will scan through these lines pulling information and starting the work of picturing the server as if rendering the image from a series of pixels collected here and there.

When I receive a pt-summary report from a customer, I will check the Date just to be sure it is recent enough and double-check the hostname – this is important in order to match it with the data provided by the other tools. Uptime will tell us if the server has been recently restarted, or maybe it is a new one, and that may impact other metrics and statistics in the report, such as memory caches. The load averages provide a glimpse of the recent load of the server when the report was taken, a metric that should be coupled with the processors’ information for better interpretation.

The Linux distribution and release is a piece of information I will save for later, in case I have to download the specific database packages to check on something or to get the mysqld symbols, for instance. Likewise, knowing this is a bare metal server or a Virtualized one, particularly in the case of a virtual machine, influences my assessment; there are certain scenarios and situations known to affect the performance of a server in a virtualized environment, such as the famous noisy neighbors. The section above may also contain the cloud provider details, if applicable.

Processing Power

The next section covers the processors in the server, which provides a good idea of its processing power and parallelization. There are a number of MySQL settings that should be tuned according to these, which we will be discussing in the next two posts. For now, take note of the number of available cores. Here is an example of how this section may look like:

# Processor ##################################################
  Processors | physical = 2, cores = 16, virtual = 32, hyperthreading = yes
      Speeds | 1x2900.585, 1x3103.320, 1x3305.859, 4x3399.804, 13x3400.000, 2x3400.390, 1x3400.781, 1x3402.148, 1x3405.859, 1x3442.968, 1x3443.554, 1x3453.320, 1x3461.132, 1x3478.906, 1x3573.242, 1x3577.929
      Models | 32xIntel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz
      Caches | 32x20480 KB

Note that not all cores in the example above are running at the same speed; why is that, and is this a problem? 

The default power schema in most Linux distributions operates on an “on-demand” basis, which scales down CPU frequencies while they are idle in order to save energy; the name of this scaling governor mode is powersave. Having such a setting in place in a dedicated database server may or may not make sense – it depends on your workload and what your priorities are.

In many cases, the expectation is to have the server running at full capacity; that can be accomplished with the performance governor. How to change the power mode varies between Linux distributions, and even among different versions of the same one, but in most cases, the adjustment can be done with a single command run as root:

# for i in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor ; do { echo "performance" > $i; } done

Check the exact details for your Linux system as well as how to make the change persist a server restart.

Memory

Another big piece of the puzzle we are trying to assemble (or make sense of) is obviously memory: not only how much is available, but also how it is being used and managed. Here’s the memory section excerpt from the pt-summary report of yet a different server:

# Memory #####################################################
       Total | 94.3G
        Free | 6.7G
        Used | physical = 87.6G, swap allocated = 16.0G, swap used = 604.2M, virtual = 88.2G
     Buffers | 354.8M
      Caches | 81.0G
       Dirty | 332 kB
     UsedRSS | 4.5G
  Swappiness | 60
 DirtyPolicy | 20, 10
 DirtyStatus | 0, 0

There are basically three pieces of information we obtain from this section:

  • How much memory is there available on this server
  • How much memory remains available
  • If there is any swap space allocated and what is the server inclination to make use of it

The Total memory is only the starting point. Besides the amount of memory reported as being Free, note how much memory is being used for the OS cache: this is not typical for an InnoDB-based MySQL server. Normally, we would dedicate the majority of the available memory to the InnoDB Buffer Pool: more on this in the next post.

As for swap space: I feel this has become a controversial subject in the last few years, with the rise of the cloud and disposable servers. If you route your read traffic among a pool of servers, it might be that you would prefer to have one of these face an immediate termination by the kernel’s Out Of Memory (OOM) monitor rather than tap onto swap space and perform slower. But that would be an edge case, and certainly restricted to replicas – the cost and risk of losing a master like this is rather high. In general, a better practice is to have swap space in place, for safeguard, but instruct the OS to only make use of it when absolutely necessary. There is a setting to tune this: swappiness controls the kernel’s tendency to swap memory pages. Higher values for this setting increase the aggressiveness of the kernel to swap whereas lower values decrease the likelihood of this happening. For dedicated database servers, we used to advise setting swappiness to zero, but that proved to be too much for more recent kernels; nowadays, setting it to 1 is a safer approach. You can make the change very easily as root:

# echo "vm.swappiness = 1" >> /etc/sysctl.conf
# sysctl -p

THP and a Missing Point

The very last sections of a pt-summary report cover an important complementary point – while leaving another important one behind – whether Transparent Huge Pages (THP) is enabled on the server:

# Memory manmagement ##########################################
Transparent huge pages are enabled.

The most succinct explanation I read on why you should have it disabled comes from the MongoDB manual

(…) database workloads often perform poorly with THP enabled, because they tend to have sparse rather than contiguous memory access patterns. When running MongoDB on Linux, THP should be disabled for best performance.

And the same is true for MySQL. I will provide a bit more information on this topic in the next post. For now, know you can disable THP in most systems with the following commands:

# echo "never" > /sys/kernel/mm/transparent_hugepage/enabled
# echo "never" > /sys/kernel/mm/transparent_hugepage/defrag

Ideally, however, you should start the server with them already disabled. You can create a systemd service file for this, as explained in the link above, or alter the kernel boot options in grub.conf to include transparent_hugepage=never .

A notable absence from this section is NUMA-related information, which is essential to analyze memory unbalance on NUMA systems equipped with lots of it. I kindly invite you to support this request.

Storage

This is the longest part of a pt-summary report and includes details about the disk partitioning and arrangement (LVM), IO schedulers, RAID controllers, inodes, and mount points. The latter is the first of the sections presented, and we can usually infer where is MySQL data stored; here is a simplified example:

# Mounted Filesystems ########################################
  Filesystem             Size Used Type  Opts                         Mountpoint
  10.1.1.240:/backup     3.0T  37% nfs   rw,bg,soft,addr=10.1.1.240   /backup
  /dev/mapper/data       4.3T  42% ext4  rw,noatime,nodiratime        /data
  /dev/mapper/system      20G  23% ext4  rw                           /
  /dev/sda2              504M   6% ext2  rw                           /boot
  tmpfs                   48G   0% tmpfs rw                           /dev/shm

The topic of which filesystem to use for a database such as MySQL is a popular one but there is no need to lose your sleep over it; between EXT4 and XFS, you will be fine. NFS should be only considered for storing backup files. You should pay attention to the mount options you employ though, as proper ones are determinants for improved performance. In particular, we must avoid having the kernel update timestamps on database files every time they are accessed (atime), hence we should mount the partition containing the data with noatime. The same can be done in regards to directories with nodiratime, though the impact is smaller. If the drives are behind a RAID controller equipped with a battery-backed unit (BBU) for caching data, then you don’t need to “enforce proper on-disk ordering of journal commits” to avoid risking “filesystem corruption in case of power failure” (from the mount manual page), thus you can also mount the data partition with the nobarrier option for improved performance.

Here’s the section of the pt-summary report showing RAID Controller details (make sure you have the controller management software installed on the server, or this section will be empty); note the presence of a BBU cache:

# RAID Controller ############################################
  Controller | LSI Logic MegaRAID SAS
       Model | ServeRAID M5210, PCIE interface, 8 ports
       Cache | 1024MB Memory, BBU 
         BBU | % Charged, Temperature 24C, isSOHGood=

  VirtualDev Size      RAID Level Disks SpnDpth Stripe Status  Cache
  ========== ========= ========== ===== ======= ====== ======= =========
  0(no name)  1.089 TB 0 (:-1-0)      2 Depth-1  256 KB Optimal WB, RA
  1(no name)  4.361 TB 0 (:-5-3)      5 Depth-1  256 KB Optimal WB, RA

  PhysiclDev Type State   Errors Vendor  Model        Size
  ========== ==== ======= ====== ======= ============ ===========
  Hard Disk  SAS  Online,  0/0/0 LENOVO-XST1200MM0018 XL592S4009N7HL592L592L592 1.090  
  Hard Disk  SAS  Online,  0/0/0 LENOVO-XST1200MM0018 XL592S400AD3GL592L592L592 1.090  
  Hard Disk  SAS  Online,  0/0/0 LENOVO-XST1200MM0018 XL592S400AD0BL592L592L592 1.090  
  Hard Disk  SAS  Online,  0/0/0 LENOVO-XST1200MM0018 XL592S4009NSML592L592L592 1.090  
  Hard Disk  SAS  Online,  0/2/0 LENOVO-XST1200MM0018 XL592S400AD49L592L592L592 1.090  
  Hard Disk  SAS  Online,  0/0/0 LENOVO-XST1200MM0018 XL592S400AD5JL592L592L592 1.090  
  Hard Disk  SAS  Online,  0/0/0 LENOVO-XST1200MM0018 XL592S4009NB7L592L592L592 1.090

Speaking of RAIDs, we used to advise against RAID-5 (as all the checksum required to ensure data integrity would imply an excessive write penalty) and in favor of RAID-10 volumes spanned across multiple disks and supported by a controller with a BBU, but things are changing with the advent of modern storage technology (SSD and beyond). It also affects the choice for the I/O scheduler:

# Disk Schedulers And Queue Size #############################
        dm-0 | 128
        dm-1 | 128
        dm-2 | 128
        dm-3 | 128
        dm-4 | 128
        dm-5 | 128
         sda | [deadline] 128
         sdb | [deadline] 128

Back in the day, cfq used to be the default I/O scheduler in most Linux distributions but it has been shown to be a bad choice for database workloads. deadline remains the most indicated for RAID setup with spindle disks but noop is better for SSDs. Changing the scheduler is easy to do on the fly:

# echo "noop" > /sys/block/sdb/queue/scheduler

Changing the default scheduler in a persistent way requires editing the GRUB_CMDLINE_LINUX_DEFAULT line in grub.conf to include option elevator=noop .

Network

I reckon this is the section I least pay attention to in a pt-summary report. It provides basic details about the controllers and devices, so we get to see how the server is configured, but, to look beyond this, we need more detailed and numerous samples than what is provided here. We do get these from pt-stalk though, which we will be covering in a future post in this blog series.

A Quick Look Into What is Running

The list with the Top Processes comes from a single top sample and the intent of it is simply to provide an insight into what is running at the time:

# Top Processes ##############################################
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
12313 mysqladm  20   0   65.8g  63.2g  11724 S 721.1 80.5   5698:16 mysqld
   10 root      20   0       0      0      0 S  10.5  0.0 213:58.07 rcu_sched
   25 root      20   0       0      0      0 S   5.3  0.0 488:46.43 ksoftirqd/3
   66 root      20   0       0      0      0 S   5.3  0.0 477:53.52 ksoftirqd/+
   76 root      20   0       0      0      0 S   5.3  0.0 510:34.38 ksoftirqd/+
   81 root      20   0       0      0      0 S   5.3  0.0 514:53.12 ksoftirqd/+
 6010 root      20   0  670928  79788   3368 S   5.3  0.1   5578:29 guard_stap
 6573 mysqladm  20   0  168900   2648   1704 R   5.3  0.0   0:00.04 top
    1 root      20   0  212568   7796   3116 S   0.0  0.0  53:18.47 systemd

In a dedicated database server, we expect mysqld to be the top process and this sample does not disappoint in this regard. When looking at this section of the report my eyes will automatically focus on the VIRT(ual) and the RES(idual) memory columns: I’ll compare the values between the two of them but also in view of the total memory in the server (which in this case was 78G – this excerpt comes from a different pt-summary report to the one I used to illustrate the Memory section). Together, these three pieces of information (total memory plus RES and VIR for mysqld) are useful to create a first impression of how well balanced is memory usage in this server. I can almost guess the size of the Buffer Pool, but then I’m often wrong; I tend to underestimate how much memory is being used by temporary tables …

In the Next Post …

What MySQL is running on this server and how it is configured in view of the available resources in its host? pt-mysql-summary is the perfect match for pt-summary and provides great insight into MySQL configuration.

Jun
01
2021
--

Skipping Percona Server for MySQL Version 8.0.24 and Releasing 8.0.25 Next

Percona Server for MySQL Version 8.0.24

After Oracle released MySQL version 8.0.24 on April 20, 2021, our engineering team got started right away with merging our enhancements to prepare the corresponding 8.0.24 version of Percona Server for MySQL.

However, Oracle released MySQL version 8.0.25 shortly afterward on May 11, 2021, to fix a critical bug that we also observed during our initial testing and reported back to them.

Therefore, we have decided to skip releasing Percona Server for MySQL 8.0.24 both as a standalone product and a distribution as well as the matching Percona XtraDB Cluster release.

Our next public release after Percona Server for MySQL version 8.0.23 will be version 8.0.25, so don’t be surprised if you’re looking for version 8.0.24 and can’t find it. Thank you!

Jun
01
2021
--

Inspecting MySQL Servers Part 1: The Percona Support Way

Inspecting MySQL Servers

Inspecting MySQL ServersWhen I joined the Percona Support team in early 2013, MySQL 5.6 was being launched. I remember all the buzz and euphoria surrounding its release: it was a big step from the 5.5 series, carrying important changes in the MySQL architecture and internals that brought highly anticipated features such as the (still incipient) multi-threaded replication and online DDLs. What was not necessarily evident to the general public at the time was the focus it had on highly concurrent environments: MySQL 5.6 was targeting the future. What that meant in practice was, for many workloads that were still common at the time, 5.6 could perform worse than 5.5: it started pouring support tickets from unhappy customers that did the upgrade and were looking for assistance on this front.

It was a steep learning curve for me. Luckily – or as expected – I was surrounded by experts: Percona was already at the forefront of MySQL development and innovation, and my colleagues from the Percona Support team are some of the best in the industry. I was taught how to approach the analysis and troubleshooting of a MySQL server using a tried-and-tested method supported by tools crafted and honed by these same experts to make this task more effective. Fast forward eight years and we continue to operate with an improved version of this same method, which I’m going to present to you in this series of blog posts.

Where Do I Start?

One could argue that “it depends” but that is not the case when what you are aiming to do is analyze the performance of a MySQL server – or any server for that matter. You start by getting to know the server under which the service is running and how it behaves under load.

We can, then, break this approach into two distinct parts:

  1. Getting to “know” the server
  2. Understanding how the server behaves under load

The secret to starting on this road well is obtaining the right data. The first part can be considered more of a “static” analysis: we are looking for information about the underlying server’s (a) hardware specifications, (b) general OS settings, as well as (c) MySQL-specific configuration. It doesn’t matter much when this data is gathered as it should remain unchanged in the short term.

For the second part, however, the timing matters: data must be gathered either at a precise moment (such as while experiencing a problem) or covering a target time frame. In the case of a general MySQL server assessment, the time window must include peak time, when the load in the server is at its highest level. Usually, we want MySQL tuned in such a way that it is able to handle traffic during the busiest period of the day/week. If it can do that then it should be able to handle the more ordinary workload without breaking a sweat. On the other hand, when we are troubleshooting a specific scenario or circumstance, we need to capture the data at the moment the issue is manifesting itself. We will discuss this further later in this article.

Knowing the Server

The scenario is a common one, I believe all of you have been there before. A colleague or maybe a customer approaches you looking for some help with a server that is not performing well, one that you have never seen before. It might be a common web server running Apache and basic PHP, a more complex application server running Ruby on Rails or a full Java stack, or yet a dedicated database server running MySQL, PostgreSQL, or MongoDB.

To start with, this box where the service is running on, what kind of server is this? How much memory is available? How many cores? What about storage, is it fast, is there enough disk space? Is it running on CentOS, Ubuntu, or something else? Is it a VM?! Which kernel version? Is it swapping … is there any swap space configured at all?

You can find the answers to those questions by connecting to the server and looking around, running various specific commands to build the initial picture of the server, akin to building a puzzle from pieces you collect here and there. Do this a few times, though, and you will get bored and probably forget to collect a piece of information (or two). This is the kind of procedure that can be scripted and that has been done already by clever experts in this field, through different tools. The one we use and maintain is pt-summary, which is part of the Percona Toolkit. If you have one of the Percona repositories configured, you can install the toolkit with a simple:

sudo apt-get/yum install percona-toolkit

I will show you how to use it in a moment.

What MySQL?

What about the MySQL part? When it comes to inspecting a MySQL server, there is some key information that is helpful to obtain upfront: what distribution and version, if it is a replica, or maybe a PXC/Galera node, the size of the Buffer Pool, redo log space, table cache, temporary tables, …

There is another tool in the Percona Toolkit specific for this, pt-mysql-summary (and if you are wondering, yes, there is also a pt-pg-summary and pt-mongodb-summary, but those are subjects for another blog series).

First Impressions

With the information provided by pt-summary and pt-mysql-summary, we can now get a more clear picture of what that server is and start to get a sense of how MySQL is configured in view of the host server, of how the database settings fit in the available physical resources. It then becomes easier, for example, to contrast the Buffer Pool size with the available memory in the server and ponder about how much memory the mysqld process is using, or yet spot a clearly misconfigured/unbalanced setting. Spend a few years doing this and you may start relating a given server version associated with a couple of specific settings values to a “popular” bug you remember the ID by heart… let’s just hope you haven’t got that far already.

The combination of the information obtained from the “pt-summaries” provides what I call a “static” view of the server: it helps us draw our first impressions but it is akin to looking at a still picture of an engine. How does it perform in motion?

An Engine in Motion

The performance of a server is dependent on the physical resources available as well as the workload in play, with the workload in the case of a database server, in a simplified view, being a combination of queries and concurrency. Some workloads are CPU-bound while others are more IO-intensive. Some workloads don’t change much over time, some follow a distinct pattern over the day/week/month/season. It’s very rare to come across a workload that is completely unpredictable, and when that happens it is usually a temporary scenario.

This discourse about workloads occupies a central point here for a reason: we tune a server for the workload it runs and not merely for the specs it has. But how do we do this? We start by observing the server operating.

This observation can be divided into two large categories:

  • at the operating system level, we can observe how the physical resources (CPU, memory, storage, network) are being used with tools such as mpstat, iostat, vmstat, and netstat;
  • at the database level, we can observe how MySQL is performing through a collection of queries such as the popular SHOW ENGINE INNODB STATUS and SHOW PROCESSLIST but also others that will cover, for instance, status variables and transaction and mutex-specific information.

Gathering all this data manually would be impractical and very difficult to coordinate, a chance there is pt-stalk, another tool from the Percona Toolkit, that automates this whole process.

Percona Support Favorite Triad

There you have it: a combination of data collected from pt-summary, pt-mysql-summary, and pt-stalk provides a good insight into a MySQL server. In the following articles, I will walk you through how we interpret the data provided by each of these tools as well as how to correlate them to understand how MySQL is performing. Once you understand how this works, the last post in this series will show you how you can obtain much of the same information (and sometimes more) from PMM. For now, I’ll leave you with a simple procedure to run them, which should be a handy reference for you to have:

PTDEST=${HOME}/$(hostname)/
mkdir ${PTDEST}
sudo pt-summary > "${PTDEST}/pt-summary.out"
sudo pt-mysql-summary -- --user=root --password=<mysql-root-password>  > "${PTDEST}/pt-mysql-summary.out"
sudo pt-stalk --no-stalk --iterations=2 --sleep=30 --dest="${PTDEST}" -- --user=root --password=<mysql-root-pass>

NOTES:

1) By default, pt-stalk will save the captured data under /var/lib/pt-stalk. We prefer to collect everything in a directory named after the server’s hostname so we employ the variable PTDEST to define one above, which you may adjust to your liking.

2) If your system is configured in such a way that you don’t need to provide credentials to access MySQL then you can remove the

-- --user=root --password=<mysql-root-pass>

section from the pt-mysql-summary and pt-stalk commands above.

3) You can compress the target directory PTDEST and keep a copy of the data archived somewhere or share with a colleague or yet your favorite Support team ?

tar czvf "${HOME}/$(hostname).tar.gz" "${PTDEST}

Capturing Data on a Server Without the Percona Toolkit

We still need the triad of tools for this to work, the difference here is that if the Percona Toolkit is not installed on the target database server we can still download the tools in separate and execute a modified version of the above procedure as follows:

PTDEST=${HOME}/$(hostname)/
mkdir ${PTDEST}
cd ~
wget percona.com/get/pt-{summary,mysql-summary,stalk}
chmod +x pt*
sudo ./pt-summary > "${PTDEST}/pt-summary.out"
sudo ./pt-mysql-summary -- --user=root --password=<mysql-root-password>  > "${PTDEST}/pt-mysql-summary.out"
sudo ./pt-stalk --no-stalk --iterations=2 --sleep=30 --dest="${PTDEST}" -- --user=root --password=<mysql-root-pass>

In the Next Post …

What is the most relevant information about your server you can find in a pt-summary report and how does it relate to database performance. Stay tuned!

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com