Apr
14
2021
--

Win Percona Swag for Technical Product Reviews!

Percona Swag Review

Percona Swag ReviewYou can earn Percona swag by writing a short technical review about Percona software by May 30, 2021!  The keyword here is technical. Percona is seeking reviews that describe Percona software use in particular environments or in solving particular issues, and why!

This means being specific about your Percona environment, architecture, and problem-solving. Usage of any size, from small to enterprise, is welcome. Relevant topics may include the operating system, CPU, memory, volumes of data, engines, programming languages, applications, versions, configurations, bugs, performance, obstacles, etc. Naturally, you should not disclose anything confidential or proprietary when you write.

App marketplaces collect hundreds of user reviews and help users make sound decisions. If Percona has been good for you, your testimonials and recommendations can also help others find similar success. Add your review to one of these well-known software marketplaces: Capterra, G2, and SourceForge under one of these seven Percona product areas:

Percona Product Capterra   G2           SourceForge
Percona Monitoring and Management Capterra G2 SourceForge
Percona Server For MySQL Capterra G2 SourceForge
Percona XtraDB Cluster Capterra G2 SourceForge
Percona XtraBackup Capterra G2 SourceForge
Percona Distribution for PostgreSQL G2 SourceForge
Percona Backup for MongoDB G2 SourceForge
Percona Server for MongoDB G2 SourceForge
Percona Kubernetes Operator for Percona XtraDB Cluster G2 SourceForge

 

Review authors will receive a Percona T-shirt and a mug shipped free almost anywhere in the world. To claim your swag, email community-team@percona.com after your review is published. Include the following:

  1. Link to your published review.
  2. Your postal address.
  3. Your phone number (for delivery use only, never for marketing).
  4. Your T-shirt size (Small, Medium, Large, or Extra Large).
  5. Your preferred shirt color (White, Black, or Blue).

All reviews are read by our team and gratefully received. Note however that generalized reviews without strong technical content are not eligible to receive swag. Percona reserves final judgment on which reviews qualify and which do not.  Thank you for your submission!

Apr
14
2021
--

What You Can Do With Auto-Failover and Percona Distribution for MySQL (8.0.x)

auto-failover Percona MySQL

Where x is >= 22 ?

The Problem

There are few things your data does not like. One is water and another is fire. Well, guess what:

OVH Fire

If you think that everything will be fine after all, take a look:



Given my ISP had part of its management infrastructure on OVH, they had been impacted by the incident.

As you can see from the highlight, the ticket number in three years changes very little (2k cases) and the date jumps from 2018 to 2021. On top of that, I have to mention I had opened several tickets the month before that disappeared. 

So either my ISP was very lucky and had very few cases in three years and sent all my tickets to /dev/null… or they have lost THREE YEARS of data.   

Let us go straight to the chase; they have lost their data, period. 

After the fire at the OVH, these guys did not have a good backup to use for data restoring and did not even have a decent Disaster Recovery solution. Their platform remained INACCESSIBLE for more than five days, during which they also lost visibility of their own network/access point/clients and so on.   

Restoring data has brought them back online, but it takes them more than a month to review and fix the internal management system and bring the service back to acceptable standards. Needless to say, complaints and more costly legal actions had been raised against them.

All this because they missed two basic Best Practices when designing a system:

  • Good backup/restore procedure
  • Always have a Disaster Recovery solution in place 

Yeah, I know… I should change ISP. 

Anyhow, a Disaster Recovery (DR) solution is a crucial element in any production system. It is weird we still have to cover this in 2021, but apparently, it still is something being underestimated that requires our attention. 

This is why in this (long) article, I will illustrate how to implement another improved DR solution utilizing Percona Server for MySQL and standard MySQL features as group replication and asynchronous replication automatic failover (AAF).

Asynchronous Replication Automatic Failover

I have already covered the new MySQL feature here (http://www.tusacentral.net/joomla/index.php/mysql-blogs/227-mysql-asynchronous-source-auto-failover) but let us recap.

From MySQL 8.0.22 and Percona Server for MySQL 8.0.22 you can take advantage of AAF when designing distributed solutions. What does this mean?

When using simple Async-replication you have this:

simple Async-replication

Whereas, a Highly Available (HA) solution in DC2 is pulling data out from another HA solution in DC1 with the relation 1:1, meaning the connection is one node against another node.

If you have this:

Your data replication is interrupted and the two DCs diverge. Also you need to manually (or by script) recover the interrupted link. With AAF you can count on a significant improvement:

 

The link now is NOT 1:1, but a node in DC2 can count on AAF to recover the link on the other remaining nodes:

Asynchronous Replication Automatic Failover

If a node in the DC2 (the replica side) fails, then the link is broken again and it requires manual intervention. This solves a quite large chunk of problems, but it does not fix all, as I mentioned in the article above.

 

If a node in the DC2 (the replica side) fails, then the link is broken again and it requires manual intervention.

GR Failover

I was hoping to have this fixed in MySQL 8.0.23, but unfortunately, it is not. So I decided to develop a Proof Of Concept and see if it would fix the problem, and more importantly what needs to be done to do it safely. 

The result is a very basic (and I need to refine the code) Stored Procedure called grfailover, which manages the shift between primaries inside a Group Replication cluster:

 

I borrowed the concept from Yves’ Replication Manager for Percona XtraDB Cluster (https://github.com/y-trudeau/Mysql-tools/tree/master/PXC), but as we will see for GR and this use we need much less.

Why Can This Be a Simplified Version?

Because in GR we already have a lot of information and we also have the autofailover for async replication. Given that, what we need to do is only manage the start/stop of the Replica. Auto-failover will take care of the shift from one source to the other, while GR will take care of which node should be the preferred Replica (Primary on replica site). 

In short, the check just needs to see if the node is a Primary, and if so, start the replication if it is not already active while eventually stopping it if the node IS NOT a primary.

We can also maintain a table of what is going on, to be sure that we do not have two nodes replicating at the same time.

The definition will be something like this:

+--------------+---------------+------+-----+---------+-------+
| Field        | Type          | Null | Key | Default | Extra |
+--------------+---------------+------+-----+---------+-------+
| server_uuid  | char(36)      | NO   | PRI | NULL    |       |
| HOST         | varchar(255)  | NO   |     | NULL    |       |
| PORT         | int           | NO   |     | 3306    |       |
| channel_name | varchar(100)  | NO   |     | NULL    |       |
| gr_role      | varchar(30)   | NO   |     | NULL    |       |
| STATUS       | varchar(50)   | YES  |     | NULL    |       |
| started      | timestamp(6)  | YES  |     | NULL    |       |
| lastupdate   | timestamp(6)  | YES  |     | NULL    |       |
| active       | tinyint       | YES  |     | 0       |       |
| COMMENT      | varchar(2000) | YES  |     | NULL    |       |
+--------------+---------------+------+-----+---------+-------+

The full code can be found in GitHub here: https://github.com/Tusamarco/blogs/tree/master/asyncAutoFailOver.

How-To

The first thing you need to do is deploy Percona Server Distribution for MySQL (8.0.22 or greater) using Group Replication as a HA solution. To do so, refer to the extensive guide here: Percona Distribution for MySQL: High Availability with Group Replication Solution.

Once you have it running on both DCs, you can configure AAF on both DCs Primary node following either MySQL 8.0.22: Asynchronous Replication Automatic Connection (IO Thread) Failover or this MySQL Asynchronous SOURCE auto failover.

Once you have the AAF replication up and running, it is time for you to create the procedure and the management table in your DC-Source Primary.

First of all, be sure you have a percona schema, and if not, create it:

Create schema percona;

Then create the table:

CREATE TABLE `group_replication_failover_manager` (
  `server_uuid` char(36) NOT NULL,
  `HOST` varchar(255) NOT NULL,
  `PORT` int NOT NULL DEFAULT '3306',
  `channel_name` varchar(100) NOT NULL,
  `gr_role` varchar(30) NOT NULL,
  `STATUS` varchar(50) DEFAULT NULL,
  `started` timestamp(6) NULL DEFAULT NULL,
  `lastupdate` timestamp(6) NULL DEFAULT NULL,
  `active` tinyint DEFAULT '0',
  `COMMENT` varchar(2000) DEFAULT NULL,
  PRIMARY KEY (`server_uuid`)
) ENGINE=InnoDB;

Last, create the procedure. Keep in mind you may need to change the DEFINER or simply remove it. The code will be replicated on all nodes. To be sure, run the command below on all nodes:

select ROUTINE_SCHEMA,ROUTINE_NAME,ROUTINE_TYPE from information_schema.ROUTINES where ROUTINE_SCHEMA ='percona' ;
+----------------+--------------+--------------+
| ROUTINE_SCHEMA | ROUTINE_NAME | ROUTINE_TYPE |
+----------------+--------------+--------------+
| percona        | grfailover   | PROCEDURE    |
+----------------+--------------+--------------+

You should get something as above. 

If not, then check your replication, something probably needs to be fixed. If instead, it all works out, this means you are ready to go.

To run the procedure you can use any kind of approach you like, the only important thing is that you MUST run it FIRST on the current PRIMARY node of each DCs

This is because the PRIMARY node must be the first one to register in the management table. Personally, I like to run it from cron when in “production” while manually when testing:

IE:/opt/mysql_templates/PS-8P/bin/mysql -h 127.0.0.1 -P 3306 -D percona -e "call  grfailover(5,\"dc2_to_dc1\");"

Where:

  • grfailover is the name of the procedure.
  • 5 is the timeout in minutes after which the procedure will activate the replication in the Node.
  • dc2_to_dc1 Is the name of the channel in the current node, the procedure needs to manage.

Given two clusters as:

DC1-1(root@localhost) [(none)]>SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | e891d1b4-9793-11eb-92ac-08002734ed50 | gr3         |        3306 | ONLINE       | SECONDARY   | 8.0.23         |
| group_replication_applier | ebff1ab8-9793-11eb-ba5f-08002734ed50 | gr1         |        3306 | ONLINE       | SECONDARY   | 8.0.23         |
| group_replication_applier | f47df54e-9793-11eb-a60b-08002734ed50 | gr2         |        3306 | ONLINE       | PRIMARY     | 8.0.23         |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+

DC2-2(root@localhost) [percona]>SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 79ede65d-9797-11eb-9963-08002734ed50 | gr4         |        3306 | ONLINE       | SECONDARY   | 8.0.23         |
| group_replication_applier | 7e214802-9797-11eb-a0cf-08002734ed50 | gr6         |        3306 | ONLINE       | PRIMARY     | 8.0.23         |
| group_replication_applier | 7fddf04f-9797-11eb-a193-08002734ed50 | gr5         |        3306 | ONLINE       | SECONDARY   | 8.0.23         |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+

If you query the management table after you have run the procedure ONLY on the two Primaries:

>select * from percona.group_replication_failover_manager order by host\G
*************************** 1. row ***************************
server_uuid: f47df54e-9793-11eb-a60b-08002734ed50
        HOST: gr2
        PORT: 3306
channel_name: dc2_to_dc1
     gr_role: PRIMARY
      STATUS: ONLINE
     started: 2021-04-08 10:22:40.000000
  lastupdate: 2021-04-08 10:22:53.000000
      active: 1
     COMMENT: Just inserted
*************************** 2. row ***************************
 server_uuid: 7e214802-9797-11eb-a0cf-08002734ed50
        HOST: gr6
        PORT: 3306
channel_name: dc1_to_dc2
     gr_role: PRIMARY
      STATUS: ONLINE
     started: 2021-04-08 09:17:50.000000
  lastupdate: 2021-04-08 09:17:50.000000
      active: 1
     COMMENT: Just inserted

Given the replication link was already active, the nodes will report only “Just Inserted” in the comment. 

While if one of the two channels was down and the node NOT deactivated (set the active flag in the management table to 0), the comment will change to “COMMENT: REPLICA restarted for the channel <channel name>”

At this point, you can run the procedure also on the other nodes and after that, if you query the table by channel:

DC1-1(root@localhost) [(none)]>select * from percona.group_replication_failover_manager where channel_name ='dc2_to_dc1' order by host\G
*************************** 1. row ***************************
 server_uuid: ebff1ab8-9793-11eb-ba5f-08002734ed50
        HOST: gr1
        PORT: 3306
channel_name: dc2_to_dc1
     gr_role: SECONDARY
      STATUS: ONLINE
     started: NULL
  lastupdate: NULL
      active: 1
     COMMENT: Just inserted
*************************** 2. row ***************************
 server_uuid: f47df54e-9793-11eb-a60b-08002734ed50
        HOST: gr2
        PORT: 3306
channel_name: dc2_to_dc1
     gr_role: PRIMARY
      STATUS: ONLINE
     started: 2021-04-08 10:22:40.000000
  lastupdate: 2021-04-08 10:22:53.000000
      active: 1
     COMMENT: REPLICA restarted for the channel dc2_to_dc1
*************************** 3. row ***************************
 server_uuid: e891d1b4-9793-11eb-92ac-08002734ed50
        HOST: gr3
        PORT: 3306
channel_name: dc2_to_dc1
     gr_role: SECONDARY
      STATUS: ONLINE
     started: NULL
  lastupdate: NULL
      active: 1
     COMMENT: Just inserted
3 rows in set (0.00 sec)

What happens if I now change my Primary, or if the Primary goes down? Well let say we “just” shift our PRIMARY:

stop slave for channel 'dc2_to_dc1';SELECT group_replication_set_as_primary('ebff1ab8-9793-11eb-ba5f-08002734ed50');
Query OK, 0 rows affected, 1 warning (0.01 sec)

+--------------------------------------------------------------------------+
| group_replication_set_as_primary('ebff1ab8-9793-11eb-ba5f-08002734ed50') |
+--------------------------------------------------------------------------+
| Primary server switched to: ebff1ab8-9793-11eb-ba5f-08002734ed50         |
+--------------------------------------------------------------------------+

Please note that given I have an ACTIVE replication channel, to successfully shift the primary, I MUST stop the replication channel first.

C1-2(root@localhost) [percona]>DC1-2(root@localhost) [percona]>SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | e891d1b4-9793-11eb-92ac-08002734ed50 | gr3         |        3306 | ONLINE       | SECONDARY   | 8.0.23         |
| group_replication_applier | ebff1ab8-9793-11eb-ba5f-08002734ed50 | gr1         |        3306 | ONLINE       | PRIMARY     | 8.0.23         |
| group_replication_applier | f47df54e-9793-11eb-a60b-08002734ed50 | gr2         |        3306 | ONLINE       | SECONDARY   | 8.0.23         |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+

Reading the management table we will see that grFailOver had started the shift:

DC1-1(root@localhost) [(none)]>select * from percona.group_replication_failover_manager where channel_name ='dc2_to_dc1' order by host\G
*************************** 1. row ***************************
 server_uuid: ebff1ab8-9793-11eb-ba5f-08002734ed50
        HOST: gr1
        PORT: 3306
channel_name: dc2_to_dc1
     gr_role: PRIMARY
      STATUS: ONLINE
     started: NULL
  lastupdate: NULL
      active: 1
     COMMENT: Need to wait 5 minutes, passed: 0
*************************** 2. row ***************************
 server_uuid: f47df54e-9793-11eb-a60b-08002734ed50
        HOST: gr2
        PORT: 3306
channel_name: dc2_to_dc1
     gr_role: PRIMARY
      STATUS: ONLINE
     started: 2021-04-08 10:22:40.000000
  lastupdate: 2021-04-08 10:22:53.000000
      active: 1
     COMMENT: REPLICA restarted for the channel dc2_to_dc1
*************************** 3. row ***************************
 server_uuid: e891d1b4-9793-11eb-92ac-08002734ed50
        HOST: gr3
        PORT: 3306
channel_name: dc2_to_dc1
     gr_role: SECONDARY
      STATUS: ONLINE
     started: NULL
  lastupdate: NULL
      active: 1
     COMMENT: Just inserted

Checking the new PRIMARY node gr1, we can see that:

  • Gr_role is PRIMARY
  • COMMENT reports the countdown (in minutes) the node waits

After the 5 minutes: 

DC1-1(root@localhost) [(none)]>select * from percona.group_replication_failover_manager where channel_name ='dc2_to_dc1' order by host\G
*************************** 1. row ***************************
 server_uuid: ebff1ab8-9793-11eb-ba5f-08002734ed50
        HOST: gr1
        PORT: 3306
channel_name: dc2_to_dc1
     gr_role: PRIMARY
      STATUS: ONLINE
     started: 2021-04-08 10:27:54.000000
  lastupdate: 2021-04-08 10:30:12.000000
      active: 1
     COMMENT: REPLICA restarted for the channel dc2_to_dc1
*************************** 2. row ***************************
 server_uuid: f47df54e-9793-11eb-a60b-08002734ed50
        HOST: gr2
        PORT: 3306
channel_name: dc2_to_dc1
     gr_role: SECONDARY
      STATUS: ONLINE
     started: NULL
  lastupdate: NULL
      active: 1
     COMMENT: Resetted by primary node ebff1ab8-9793-11eb-ba5f-08002734ed50 at 2021-04-08 10:27:53
*************************** 3. row ***************************
 server_uuid: e891d1b4-9793-11eb-92ac-08002734ed50
        HOST: gr3
        PORT: 3306
channel_name: dc2_to_dc1
     gr_role: SECONDARY
      STATUS: ONLINE
     started: NULL
  lastupdate: NULL
      active: 1
     COMMENT: Just inserted

Now, what we can see is:

  • Node gr1 had become active in replicating
    • It reports the time it started the replication 
    • It reports the last time it checked for the replication to be active
  • Node gr2 is marked SECONDARY
    • In the comment is also reported the time and when the replication was restarted on the new REPLICA node

If for any reason the replication in the original node gr2 was restarted (like moving back the PRIMARY) while the countdown was still in place, grFailOver will stop any action and reset the gr1 status. 

In short, now my two DCs can rely on AAF for failing over on a different SOURCE and on grFailOver for shifting the Node following GR Primary, or to failover to another node when my Primary crashes.

Conclusion

I am sure Oracle is backing something about this and I am sure we will see it out soon, but in the meantime, I have to say that this simple solution works. It has improved the resiliency of my testing architecture A LOT. 

And while I am still testing it and I am totally confident that the procedure can be written in a more efficient way, I am also sure bugs and errors are around the corner. 

BUT, this was a POC and I am happy with the outcome. This proves it is not so difficult to make better what we have, and also proves that sometimes a small thing can have a HUGE impact. 

It also proves we should not always wait for others to do what is required and that ANYONE can help. 

Finally, as mentioned above, this is a POC solution, but no one prevents you to start from it and make it a production solution, as my colleague Yves did for his Percona XtraDB Cluster Replication Manager. 

Is just on you!  Great MySQL to all. 

References

https://www.datacenterdynamics.com/en/news/fire-destroys-ovhclouds-sbg2-data-center-strasbourg/

http://www.tusacentral.net/joomla/index.php/mysql-blogs/227-mysql-asynchronous-source-auto-failover

https://github.com/y-trudeau/Mysql-tools/tree/master/PXC

https://www.percona.com/blog/2020/10/26/mysql-8-0-22-asynchronous-replication-automatic-connection-io-thread-failover/

Apr
14
2021
--

Percona Distribution for MySQL: High Availability with Group Replication Solution

This blog provides high availability (HA) guidelines using group replication architecture and deployment recommendations in MySQL, based on our best practices.

Every architecture and deployment depends on the customer requirements and application demands for high availability and the estimated level of usage. For example, using high read or high write applications, or both, with a need for 99.999% availability.

Here, we give architecture and deployment recommendations along with a technical overview for a solution that provides a high level of high availability and assumes the usage of high read/write applications (20k or more queries per second).

Layout

MySQL High Availability with Group Replication

Components

This architecture is composed of two main layers:

  • Connection and distribution layer
  • RDBMS (Relational Database Management System) layer

Connection Layer

The connection layer is composed of:

  • Application to proxy redirection mechanism which can be anything from a Virtual IP managed by Keepalived local service to a DNS resolution service like Amazon Route 53. Its function is to redirect the traffic to the active Proxy node.
  • Proxy connection distribution is composed of two or more nodes. Its role is to redirect the traffic to the active nodes of the Group Replication cluster. In cases like ProxySQL where the proxy is a level 7 proxy and is able to perform Read/Write split, this layer is also in charge of redirecting writes to the Primary node and reads to the Replicas, and of HA to prevent a single point of failure

Data Layer

The data layer is composed of:

  • Primary node serving writes (or source) – this is the node that will accept writes and DDL modifications. Data will be processed following the ACID paradigm (atomicity, consistency, isolation, durability) and replicated to all other nodes.
  • Replica nodes are the elements serving read requests. Some replica nodes can be elected Primary in case of Primary node failure. A replica node should be able to leave and join back a healthy cluster without impacting the service.
  • Replication mechanism to distribute changes across nodes and in this solution is done with Group Replication. Group Replication is a tightly coupled solution which means that the database cluster is based on a Datacentric approach (single state of the data, distributed commit). In this case, the data is consistent in time across nodes and replication requires a high performant link. Given that, geographic distribution is strongly discouraged and Disaster Recovery (DR) is not implicitly supported by the main Group Replication mechanism. 

The node characteristics (CPU/RAM/Storage) are not relevant to the main solution design.  They instead must reflect the estimated workload the solution will have to cover, which is a case-by-case identification. 

What is important to keep in mind is that all nodes that are part of the cluster must have the same characteristics.  If they don’t, the cluster will be imbalanced and service will be affected.

As a generic indication, we recommend using solutions with at least 8 cores and 16GB RAM when production.  

High Availability

How do we measure availability and at what point does it become “high” availability?

Generally speaking, the measurement of availability is done by establishing a measurement time frame and dividing it by the time that it was available. This ratio will rarely be 1, which is equal to 100% availability. At Percona we don’t consider a solution to be highly available if it is not at least 99% or “two nines” available.  

Availability % Downtime per year Downtime per month Downtime per week Downtime per day
99% (“two nines”) 3.65 days 7.31 hours 1.68 hours 14.40 minutes
99.5% (“two nines five”) 1.83 days 3.65 hours 50.40 minutes 7.20 minutes
99.9% (“three nines”) 8.77 hours 43.83 minutes 10.08 minutes 1.44 minutes
99.95% (“three nines five”) 4.38 hours 21.92 minutes 5.04 minutes 43.20 seconds
99.99% (“four nines”) 52.60 minutes 4.38 minutes 1.01 minutes 8.64 seconds
99.995% (“four nines five”) 26.30 minutes 2.19 minutes 30.24 seconds 4.32 seconds
99.999% (“five nines”) 5.26 minutes 26.30 seconds 6.05 seconds 864.00 milliseconds

How is High Availability Achieved?

There are three key components to high availability:

  1. Infrastructure – This is the physical or virtual hardware that database systems rely on to run. Without enough infrastructure (VM’s, networking, etc) there cannot be high availability. The easiest example is: there is no way to make a single server highly available.
  2. Topology Management – This is the software management related specifically to the database and managing its ability to stay consistent in the event of a failure. Many clustering or synchronous replication solutions offer this capability out of the box. However, for asynchronous replication, this is handled by additional software. 
  3. Connection Management – This is the software management related specifically to the networking and connectivity aspect of the database. Clustering solutions typically bundle with a connection manager, however in asynchronous clusters deploying a connection manager is mandatory for high availability.

This Solution Provides:

The proposed solution, based on a tightly coupled database cluster, offers an HA level of 99.995% when coupled with the Group replication setting group_replication_consistency=AFTER.

group replication

Failovers

If properly planned and architected, a database failure or configuration change that requires a restart shouldn’t affect the stability of the database infrastructure. Failovers are an integral part of a stability strategy and aligning the business requirements for availability and uptime with failover methodologies is critical to achieving those goals. Below are the 3 main types of failovers that can occur in database environments.

  • Planned Failover: A planned failover is a failover that has been scheduled in advance or occurs at a regular interval. There can be many reasons for planned failovers including patching, large data operations, retiring existing infrastructure, or simply testing the failover strategy.
  • Unplanned Failover: An unplanned failover is what occurs when a database unexpectedly becomes unresponsive or experiences instability. This could also include emergency changes that do not fall under the planned failover cadence or scheduling parameters. Unplanned failovers are generally considered higher-risk operations due to the high stress and high potential for either data corruption or data fragmentation.
  • Regional or Disaster Recovery Failover: Unplanned failovers still work with the assumption that additional database infrastructure is immediately available and in a usable state. In a regional or DR failover, we would be making the assumption that there is a large-scale infrastructure outage that requires the business to move its operations away from its current availability zone.
  • This solution covers both planned and unplanned failovers.

Maintenance Windows

Major vs Minor Maintenance: Although it may not be obvious at first, not all maintenance activities are created equal and do not have the same dependencies. It is good to separate maintenance that demands downtime or failover from maintenance that can be done without impacting those important stability metrics. When defining these maintenance dependencies there can be a change in the actual maintenance process that allows for a different cadence.

Maintenance Without Service Interruption: With rolling restart and using proper version upgrade it is possible to cover both major and minor maintenance without service interruption.

Uptime

When referring to database stability, uptime is likely the largest indicator of stability and oftentimes is the most obvious symptom of an unstable database environment. Uptime is composed of 3 key components and, contrary to common perception, is based on what happens when the database software is not able to take incoming requests rather than maintaining the ability to take requests with errors.

Recovery Time Objective (RTO): This can be boiled down to a very simple question “How long can the business sustain a database outage?”. Once the business is aligned with a goal of a minimum viable recovery time objective, it is much more straightforward to plan and invest in the infrastructure required to meet that requirement. It is important to acknowledge that while everyone desires 100% uptime, there is a need for realistic expectations that align with the business needs and not a technical desire.

Recovery Point Objective (RPO): There is a big distinction between the Recovery Point and the Recovery Time for database infrastructure. The database can be available, but not to the exact state that it was when it became unavailable. That is where Recovery Point comes in. The question we ask here is “How much data can the business lose during a database outage?”. All businesses have their own requirements here and it is worthy to note that it is always the goal to never sustain any data loss. But this is framed in a worst-case scenario how much data could be lost and the business maintains the ability to continue.

Disaster Recovery: While RTO and RPO are great for unplanned outages or small-scale hiccups to the infrastructure, when we talk about Disaster Recovery this is a major large-scale outage not strictly for the database infrastructure. How capable is the business of restarting operations with the assumption that all resources are completely unavailable in the main availability zone? The assumption here is that there is no viable restoration point or time that aligns with the business requirements. While each DR scenario is unique based on available infrastructure, backup strategy, and technology stack, there are some common threads for any scenario. 

This solution helps improve uptime:

Using this solution will help you to significantly reduce both RPO and RTO. Given the tightly coupled cluster solution approach, the failure of a single node will not result in service interruption.
Increasing the number of nodes will also improve the cluster resilience by the formula:
F = (N -1) / 2

Where:

F – Number of admissible failures

N – number of nodes in the cluster

Examples:

In a cluster of 5 nodes, F = (5 – 1)/2 = 2. 

The cluster can support up to 2 failures. 

In a cluster of 4 nodes, F = (4 – 1)/2 = 1.  

The cluster can support up to 1 failure. 

This solution also allows for a more restrictive backup policy, dedicating a node to the backup cycle, which will help in keeping RPO low. As previously mentioned, DR is not covered by default by the solution which will require an additional replication setup and controller. 

Measurement and Monitoring

To ensure database infrastructure is performing as intended or at its best, it is necessary to measure specific metrics and alert when some of these metrics are not in line with expectations. Periodic review of these measurements is also encouraged to promote stability and understand potential risks associated with the database infrastructure. Below are the 3 aspects of Database performance measurement and monitoring

Measurement: To understand how a database infrastructure is performing there is a need to measure multiple aspects of the infrastructure. With measurement, it’s important to understand the impact of the sample sizes, sample timing, and sample types.

Metrics: Metrics refer to the actual parts of the database infrastructure being measured. When we discuss metrics, more isn’t always better as it could introduce unintentional noise or allow for troubleshooting to become overly burdensome.

Alerting: When one or many metrics of the database infrastructure is not within a normal or acceptable range, an alert should be generated so that the team responsible for the appropriate portion of the database infrastructure can investigate and remedy it

Monitoring for this solution is covered by:

Percona Monitoring and Management has a specific dashboard to monitor the Group Replication state and cluster status as a whole. (https://www.percona.com/doc/percona-monitoring-and-management/2.x/introduction.html) has a specific dashboard to monitor Group Replication state, and cluster status as a whole.   

How to Implement the Infrastructure

In this section, we are providing the step by step instructions on how to implement the above solution. 

The Elements

The following will be used:

  • 1 Virtual IP for ProxySQL failover – 192.168.4.194
  • 2 ProxySQL nodes
    • Proxy1 192.168.4.191
    • Proxy2 192.168.4.192
  • 4 MySQL nodes in Single Primary mode
    • Gr1 192.168.4.81 – Initial Primary
    • Gr2 192.168.4.82 – Replica / failover
    • Gr3 192.168.4.83 – Replica / failover
    • Gr4 192.168.4.84 – Replica / Backup
  • Ports. All ports must be open if a firewall is in place or any other restriction like AppArmor or SELinux. 
    • Proxysql
      • 6033
      • 6032
      • 3306
    • MySQL – GR
      • 3306
      • 33060
      • 33061

Software Installation

First, you need to install the Percona Distribution for MySQL, the Percona Server for MySQL-based variant, on each node. Follow the instructions at https://www.percona.com/doc/percona-distribution-mysql/8.0/installing.html to install Percona Server for MySQL v8.0.

Configure the Nodes

Before anything, make sure that all the nodes use the same time-zone and time:

[root@gr1 ps8]# date
    Tue Aug 18 08:22:12 EDT 2020

Check also for ntpd service to be present and enabled. Be sure that each node resolves the other nodes by name:

root@gr1 ps8]# for i in 1 2 3 4 ; do ping -c 1 gr$i > /dev/null;echo $?; done

If not able to resolve, add the entries in the /etc/hosts file.

Once instances are up and running check the Percona Server for MySQL version on each node:

(root@node1) [(none)]>\s
--------------
/opt/mysql_templates/PS-8P/bin/mysql  Ver 8.0.20-11 for Linux on x86_64 (Percona Server (GPL), Release 11, Revision 159f0eb)

Step 1

Create a proper user for administration:

CREATE user dba@localhost identified by 'dbapw';
CREATE user dba@'192.168.%' identified by 'dbapw'; 

GRANT ALL on *.* to dba@localhost with grant option;
GRANT ALL on *.* to dba@'192.168.%' with grant option;

Exit from the client as user root and login as user dba.

Be sure to have a good and unique SERVER_ID value:

(dba@node1) [(none)]>show global variables like 'server_id';
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| server_id     |     1 | <--- Not good given the same for all nodes
+---------------+-------+
1 row in set (0.01 sec)

It’s now time to add group replication settings to the instances.

Step 2

Stop all running nodes, then in the my.cnf add:

#####################
    #Replication + binlog settings
    #####################
    auto-increment-increment                                    =1
    auto-increment-offset                                       =1

    log-bin                                             =<path_to_logs>/binlog
    log-bin-index                                       =binlog.index
    binlog-checksum                                             =NONE
    binlog-format                                               =ROW
    binlog-row-image                                            =FULL
    log-slave-updates 						=1
    binlog-transaction-dependency-tracking                      =WRITESET


    enforce-gtid-consistency                                    =TRUE
    gtid-mode                                                   =ON

    master-info-file                                            =master.info
    master-info-repository                                      =TABLE
    relay_log_info_repository                                   =TABLE
    relay-log                                            =<path_to_logs>/relay

    sync-binlog                                                 =1

    ### SLAVE SECTION
    skip-slave-start
    slave-parallel-type                                        = LOGICAL_CLOCK
    slave-parallel-workers                                      = 4
    slave-preserve-commit-order                                 = 1

    ######################################
    #Group Replication
    ######################################
    plugin_load_add                                    ='group_replication.so'
    plugin-load-add                                    ='mysql_clone.so'
    group_replication_group_name       ="aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa" <-- Not good use something that will help you 
                                                                            to identify the GR transactions and from where they 
                                                                             come from IE "dc1euz1-aaaa-aaaa-aaaa-aaaaaaaaaaaa"
    group_replication_start_on_boot                     =off
    group_replication_local_address                     = "192.168.4.81/2/3/4:33061"  <---- CHANGE THIS TO MATCH EACH NODE LOCAL IP
    group_replication_group_seeds                       = "192.168.4.81:33061,192.168.4.82:33061,192.168.4.83:33061,192.168.4.84:33061"
    group_replication_bootstrap_group                   = off
    transaction-write-set-extraction                    = XXHASH64

Restart all nodes and connect to them.

Step 3

Create a user for replication (on all nodes):

SET SQL_LOG_BIN=0;
    CREATE USER replica@'192.168.4.%' IDENTIFIED BY 'replicapw';   #<--- Please note the filter by IP is more restrictive 
    GRANT REPLICATION SLAVE ON *.* TO replica@'192.168.4.%';
    FLUSH PRIVILEGES;
    SET SQL_LOG_BIN=1;

Link the nodes with replication channel (on all nodes):

CHANGE MASTER TO MASTER_USER='replica', MASTER_PASSWORD='replicapw' FOR CHANNEL 'group_replication_recovery';

Check the current status:

(dba@node1) [(none)]>\u performance_schema
    (dba@node1) [performance_schema]>show tables like '%repl%';
    +-------------------------------------------+
    | Tables_in_performance_schema (%repl%)     |
    +-------------------------------------------+
    | replication_applier_configuration         |
    | replication_applier_filters               |
    | replication_applier_global_filters        |
    | replication_applier_status                |
    | replication_applier_status_by_coordinator |
    | replication_applier_status_by_worker      |
    | replication_connection_configuration      |
    | replication_connection_status             |
    | replication_group_member_stats            |
    | replication_group_members                 | <------------------------
    +-------------------------------------------+

   (dba@node1) [performance_schema]>select * from replication_group_members\G
CHANNEL_NAME: group_replication_applier
     MEMBER_ID: 
   MEMBER_HOST: 
   MEMBER_PORT: 
  MEMBER_STATE: 
   MEMBER_ROLE: OFFLINE
MEMBER_VERSION: 
1 row in set (0.00 sec)

At this stage, you should be able to start the first (Primary) cluster node.

Only on GR1:

(dba@node1)[none]> SET GLOBAL group_replication_bootstrap_group=ON;
(dba@node1)[none]> START GROUP_REPLICATION;
(dba@node1)[none]> SET GLOBAL group_replication_bootstrap_group=OFF;

And then check it:

(dba@node1) [none]>select * from performance_schema.replication_group_members\G
     CHANNEL_NAME: group_replication_applier
     MEMBER_ID: 90a353b8-e6dc-11ea-98fa-08002734ed50
   MEMBER_HOST: gr1
   MEMBER_PORT: 3306
  MEMBER_STATE: ONLINE
   MEMBER_ROLE: PRIMARY
MEMBER_VERSION: 8.0.20

Once the Primary is running, connect on the second node GR2 and start Group replication:

(dba@node2) [none]>START GROUP_REPLICATION;
Query OK, 0 rows affected (4.60 sec)

Check if it registered correctly:

(dba@node2) [performance_schema]>select * from replication_group_members\G
*************************** 1. row ***************************
  CHANNEL_NAME: group_replication_applier
     MEMBER_ID: 58ffd118-e6dc-11ea-8af8-08002734ed50
   MEMBER_HOST: gr2
   MEMBER_PORT: 3306
  MEMBER_STATE: ONLINE
   MEMBER_ROLE: SECONDARY
MEMBER_VERSION: 8.0.20
*************************** 2. row ***************************
  CHANNEL_NAME: group_replication_applier
     MEMBER_ID: 90a353b8-e6dc-11ea-98fa-08002734ed50
   MEMBER_HOST: gr1
   MEMBER_PORT: 3306
  MEMBER_STATE: ONLINE
   MEMBER_ROLE: PRIMARY
MEMBER_VERSION: 8.0.20

Test if replication works:

On GR1

(dba@node1) [performance_schema]>create schema test;
Query OK, 1 row affected (0.76 sec)

(dba@node1) [performance_schema]>\u test
Database changed

(dba@node1) [test]>create table test1 (`id` int auto_increment primary key);
Query OK, 0 rows affected (0.32 sec)

(dba@node1) [test]>insert into test1 values(null);
Query OK, 1 row affected (0.34 sec)

On GR2

(dba@node2) [performance_schema]>use \test
 Database changed
 (dba@node2) [test]>select * from test1;
 +----+
 | id |
 +----+
 |  1 |
 +----+
 1 row in set (0.00 sec)

Start group replication of the other two nodes GR3 and GR4:

(dba@node3) [performance_schema]>START GROUP_REPLICATION;
(dba@node4) [performance_schema]>START GROUP_REPLICATION;

Proxy Setup

Step 1

In our solution we will use two ProxySQL nodes:

  • Proxy1 192.168.4.191
  • Proxy2 192.168.4.192

First, you need to install ProxySQL on the nodes you have selected, in our case the two above.

To install the software follow the instructions in How to Install ProxySQL From the Percona RepositoryOnce you have installed the software, we first need to grant access to the ProxySQL monitor user to our Percona Server for MySQL nodes.

Create monitor user in MySQL group replication nodes:

Create monitor user in MySQL group replication nodes:
create user monitor@'192.168.4.%' identified by 'monitor';
grant usage on *.* to 'monitor'@'192.168.4.%';
grant select on sys.* to 'monitor'@'192.168.4.%';

Then define some basic variables:

update global_variables set Variable_Value='admin:admin;cluster1:clusterpass'  where Variable_name='admin-admin_credentials';
update global_variables set variable_value='cluster1' where variable_name='admin-cluster_username';
update global_variables set variable_value='clusterpass' where variable_name='admin-cluster_password';
update global_variables set Variable_Value=0  where Variable_name='mysql-hostgroup_manager_verbose';
update global_variables set Variable_Value='true'  where Variable_name='mysql-query_digests_normalize_digest_text';
update global_variables set Variable_Value='8.0.20'  where Variable_name='mysql-server_version';
update global_variables set Variable_Value='utf8'  where Variable_name='mysql-default_charset';
update global_variables set Variable_Value=300  where Variable_name='mysql-tcp_keepalive_time';
update global_variables set Variable_Value='true'  where Variable_name='mysql-use_tcp_keepalive';
update global_variables set Variable_Value='true'  where Variable_name='mysql-verbose_query_error';
update global_variables set Variable_Value='true'  where Variable_name='mysql-show_processlist_extended';
update global_variables set Variable_Value=50000  where Variable_name='mysql-max_stmts_cache';
update global_variables set Variable_Value='false'  where Variable_name='admin-web_enabled';
update global_variables set Variable_Value='0'  where Variable_name='mysql-set_query_lock_on_hostgroup';

load admin variables to run;save admin variables to disk;
load mysql variables to run;save mysql variables to disk;

The user name and password need to reflect your standards. The ones used above are just an example. Then set up the nodes as a cluster:

INSERT INTO proxysql_servers (hostname,port,weight,comment) VALUES('192.168.4.191',6032,100,'PRIMARY');
INSERT INTO proxysql_servers (hostname,port,weight,comment) VALUES('192.168.4.192',6032,100,'SECONDARY');
load proxysql servers to run;save proxysql servers to disk;

Step 2

Define user(s), servers, and query rules to perform read/write split. Create one or more valid user(s), for instance, if you have a user named app_gr with the password test, that has access to your group replication cluster:

insert into mysql_users (username,password,active,default_hostgroup,default_schema,transaction_persistent,comment) values ('app_gr','test',1,400,'mysql',1,'application test user GR');
LOAD MYSQL USERS TO RUNTIME;SAVE MYSQL USERS TO DISK;

Define servers:

INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.81',400,3306,10000,2000,'GR1');
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.81',401,3306,100,2000,'GR1');
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.82',401,3306,10000,2000,'GR2');    
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.83',401,3306,10000,2000,'GR2');        
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.84',401,3306,1,2000,'GR2');        
LOAD MYSQL SERVERS TO RUNTIME; SAVE MYSQL SERVERS TO DISK;

Define query rules to get read-write split:

INSERT INTO mysql_query_rules (rule_id,proxy_port,username,destination_hostgroup,active,retries,match_digest,apply) values(4040,6033,'app_gr',400,1,3,'^SELECT.*FOR UPDATE',1);
INSERT INTO mysql_query_rules (rule_id,proxy_port,username,destination_hostgroup,active,retries,match_digest,multiplex,apply) values(4042,6033,'app_gr',401,1,3,'^SELECT.*$',1,1);
LOAD MYSQL QUERY RULES TO RUN;SAVE MYSQL QUERY RULES TO DISK;

Step 3

Once we have all the configuration ready, we need to have a special view in the SYS schema in our Percona server nodes. The view working for the server version 8 and above can be found here (https://github.com/Percona-Lab/group_replication_tools/blob/master/GR_sys_view_forProxysql_v1.sql

Run that sql on the PRIMARY node of the Group Replication cluster.

Step 4

Now we are ready to activate the native support for Group Replication in ProxySQL. We will use the following group definition:

Writer HG-> 400
Reader HG-> 401
BackupW HG-> 402
Offline HG-> 9401 

INSERT INTO mysql_group_replication_hostgroups (writer_hostgroup,backup_writer_hostgroup,reader_hostgroup, offline_hostgroup,active,max_writers,writer_is_also_reader,max_transactions_behind) 
values (400,402,401,9401,1,1,1,100);
LOAD MYSQL SERVERS TO RUNTIME; SAVE MYSQL SERVERS TO DISK;

Few comments here about the parameters (for full reference see here https://proxysql.com/documentation/main-runtime#mysql_group_replication_hostgroups ). We recommend setting the number of writers always to 1, and witer_is_also_reader to 1 as well to obtain the most reliable results.

max_writers: 1
writer_is_also_reader: 1

The max_transactions_behind is a subjective parameter that you should calculate on the basis of your needs. If for instance you cannot have a stale read, it will be safe to set this value to a low number (ie 50) and to set in all Group replication nodes:

set global group_replication_consistency=AFTER;

If instead, you have no issue or strict requirements about some stale read, you can relax the parameter and ignore the group_replication_consistency setting. Our recommended setting is group_replication_consistency=AFTER and max_transactions_behind: 100.

Proxy HA

The final step is to enable High Availability for the ProxySQL layer. In this approach, we will use the well-known keepalived service. First, install the keepalived software using yum or apt-get on each ProxySQL node:

Sudo yum install -y keepalived
Or
Sudo apt-get install -y keepalived

Then modify the /etc/keepalived/keepalived.conf file accordingly to your setup. In our case:

  • Proxy1 192.168.4.0/24 dev enp0s9 proto kernel scope link src 192.168.4.191
  • Proxy2 192.168.4.0/24 dev enp0s9 proto kernel scope link src 192.168.4.192
  • VIP       192.168.4.194

We want to have the primary to be Proxy1 and the failover node to be Proxy2. Given that the config will look like:

cat /etc/keepalived/keepalived.conf 
global_defs {
  # Keepalived process identifier
  router_id  proxy_HA
}
# Script used to check if Proxy is running
vrrp_script check_proxy {
  script "killall -0 proxysql"
  interval 2
  weight 2
}
# Virtual interface
# The priority specifies the order in which the assigned interface to take over in a failover
vrrp_instance VI_01 {
  state MASTER 
  interface enp0s9
  virtual_router_id 51
  priority 100  <----- This needs to be different for each ProxySQL node, like 100/99 
 
  # The virtual ip address shared between the two load balancers
  virtual_ipaddress {
   192.168.4.194  dev enp0s9
  }
  track_script {
    check_proxy
  }
}

Once done, start the keepalived service, and from now on the VIP will be associated with the Proxy1 unless service is down.

In the system log:

proxysql1 Keepalived_vrrp[17422]: VRRP sockpool: [ifindex(4), proto(112), unicast(0), fd(10,11)]
proxysql1 Keepalived_vrrp[17422]: VRRP_Script(check_proxy) succeeded
proxysql1 Keepalived_vrrp[17422]: VRRP_Instance(VI_01) Transition to MASTER STATE
proxysql1 Keepalived_vrrp[17422]: VRRP_Instance(VI_01) Received advert with higher priority 101, ours 100
proxysql1 Keepalived_vrrp[17422]: VRRP_Instance(VI_01) Entering BACKUP STATE
proxysql1 Keepalived_vrrp[17422]: VRRP_Instance(VI_01) Changing effective priority from 100 to 102
proxysql1 Keepalived_vrrp[17422]: VRRP_Instance(VI_01) forcing a new MASTER election
proxysql1 Keepalived_vrrp[17422]: VRRP_Instance(VI_01) Transition to MASTER STATE
proxysql1 Keepalived_vrrp[17422]: VRRP_Instance(VI_01) Entering MASTER STATE
proxysql1 Keepalived_vrrp[17422]: VRRP_Instance(VI_01) setting protocol VIPs.
proxysql1 Keepalived_vrrp[17422]: Sending gratuitous ARP on enp0s9 for 192.168.4.194
proxysql1 Keepalived_vrrp[17422]: VRRP_Instance(VI_01) Sending/queueing gratuitous RPs on enp0s9 for 192.168.4.194
proxysql1 Keepalived_vrrp[17422]: Sending gratuitous ARP on enp0s9 for 192.168.4.194
..
proxysql1 Keepalived_vrrp[17422]: Sending gratuitous ARP on enp0s9 for 192.168.4.194
proxysql1 avahi-daemon[989]: Registering new address record for 192.168.4.194 on enp0s9.IPv4.

Disaster Recovery Implementation

The implementation of a DR (Disaster Recovery) site will follow the same direction provided for the main site. There are only some generic rules that should be followed:

  • A DR site should be located in a different geographic location than the main site (several hundred kilometers/miles away).
  • The connection link between the main site and the DR site can only be established using asynchronous replication (standard MySQL replication setup ).

Monitoring

There are few ways to monitor a Group Replication cluster. The easiest way is to have Percona Monitoring and Management (Version 2.10 or later) deployed to do it for you. For an easy installation of Percona Monitoring and Management check out this quickstart.

Percona Monitoring and Management

The only important thing to remember is that when registering the Percona Server for MySQL node or the MySQL node, you should specify the replication_set flag.

Ie:  pmm-admin add mysql --username=pmm --password=pmm --query-source=perfschema --replication-set=gr_test_lab  group_rep4 127.0.0.1:3306

Then you can use the Group replication Dashboard and monitor your cluster with a lot of details.

The sections are:

  • Overview(3 panels)

MySQL Group Replication

  • Replication Delay Details(3 panels)

  • Transactions(8 panels)

  • Conflicts

From Command Line

From the command line you need to manually query the tables in Performance schema:

+----------------------------------------------+
| replication_applier_configuration            |
| replication_applier_filters                  |
| replication_applier_global_filters           |
| replication_applier_status                   |
| replication_applier_status_by_coordinator    |
| replication_applier_status_by_worker         |
| replication_connection_configuration         |
| replication_connection_status                |
| replication_group_member_stats               |
| replication_group_members                    |
+----------------------------------------------+

For instance, to get the lag in the number of transactions on a node:

select @last_exec:=SUBSTRING_INDEX(SUBSTRING_INDEX( @@global.GTID_EXECUTED,':',-1),'-',-1) last_executed;select  @last_rec:=SUBSTRING_INDEX(SUBSTRING_INDEX(Received_transaction_set,':',-1),'-',-1) last_received FROM performance_schema.replication_connection_status WHERE Channel_name = 'group_replication_applier'; select (@last_rec - @last_exec) as real_lag;
+---------------+
| last_executed |
+---------------+
| 125624        |
+---------------+
1 row in set, 1 warning (0.03 sec)

+---------------+
| last_received |
+---------------+
| 125624        |
+---------------+
1 row in set, 1 warning (0.00 sec)

+----------+
| real_lag |
+----------+
|        0 |
+----------+
1 row in set (0.00 sec)

Or use a more composite query:

SELECT
  conn_status.channel_name as channel_name,
  conn_status.service_state as IO_thread,
  applier_status.service_state as SQL_thread,
  conn_status.LAST_QUEUED_TRANSACTION as last_queued_transaction,
  applier_status.LAST_APPLIED_TRANSACTION as last_applied_transaction,
  LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP - 
                            LAST_APPLIED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP 'rep delay (sec)',
  LAST_QUEUED_TRANSACTION_START_QUEUE_TIMESTAMP - 
                           LAST_QUEUED_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP 'transport time', 
  LAST_QUEUED_TRANSACTION_END_QUEUE_TIMESTAMP - 
                           LAST_QUEUED_TRANSACTION_START_QUEUE_TIMESTAMP 'time RL',
  LAST_APPLIED_TRANSACTION_END_APPLY_TIMESTAMP - 
                           LAST_APPLIED_TRANSACTION_START_APPLY_TIMESTAMP 'apply time',  
  if(GTID_SUBTRACT(LAST_QUEUED_TRANSACTION, LAST_APPLIED_TRANSACTION) = "","0" ,        abs(time_to_sec(if(time_to_sec(APPLYING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP)=0,0,timediff(APPLYING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP,now()))))) `lag_in_sec` 
FROM
  performance_schema.replication_connection_status AS conn_status
JOIN performance_schema.replication_applier_status_by_worker AS applier_status
  ON applier_status.channel_name = conn_status.channel_name
ORDER BY lag_in_sec, lag_in_sec desc\G

Which will provide information about each applier:

*************************** 1. row ***************************
channel_name: group_replication_applier
IO_thread: ON
SQL_thread: ON
last_queued_transaction: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:125624
last_applied_transaction: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:125621
rep delay (sec): 3.153038
transport time: 0.061327
time RL: 0.001005
apply time: 0.388680
lag_in_sec: 0

As you can see, Percona Monitoring and Management will give you a better view without compromising the details. 

Conclusions

Using these steps and recommendations, you can set up database infrastructure with high availability based on group replication and use Percona Monitoring and Managemen to monitor the infrastructure’s performance and health. 

Keep in mind that we are constantly working on making our recommendations better. As such, what is illustrated here is subject to changes and revision especially on the basis of the increasing adoption of Group Replication. This is because the more GR is used the more edge cases or deviation we will identify. Those are a significant help for us to refine our best practices.

Apr
12
2021
--

Replay the Execution of MySQL With RR (Record and Replay)

MySql Record and Replay

MySql Record and ReplayChasing bugs can be a tedious task, and multi-threaded software doesn’t make it any easier. Threads will be scheduled at different times, instructions will not have deterministic results, and in order for one to reproduce a particular issue, it might require the exact same threads, doing the exact same work, at the exact same time. As you can imagine, this is not straightforward.

Let’s say your database is crashing or even having a transient stall.  By the time you get to it, the crash has happened and you are stuck restoring service quickly and doing after-the-fact forensics.  Wouldn’t it be nice to replay the work from right before or during the crash and see exactly what was happening?

Record and Replay is a technique where we record the execution of a program allowing it to be replayed over and over producing the same result. Engineers at Mozilla have created RR, and basically, this open source tool allows you to record the execution of the software and replay it under the well-known GDB.

A Backup Problem

To demonstrate how powerful the tool is, we will be walking through how we used it to narrow down the issue from PXB-2180 (Special thanks to Satya Bodapati, who helped with all the InnoDB internals research for this bug). 

In summary, we were seeing Percona XtraBackup crashing at the prepare stage (remember, always test your backup!). The crash was happening randomly, sometimes after the second incremental, sometimes after the 10th incremental, with no visible pattern.

The stack trace was also not always the same. It was crashing on different parts of InnoDB, but here we had one commonality from all crashes – it always happened while trying to apply a redo log record to the same block page and space id:

#12 0x00000000015ad05f in recv_parse_or_apply_log_rec_body (type=MLOG_COMP_REC_INSERT, ptr=0x7f2849150556 "\003K4G", '\377' <repeats 13 times>, end_ptr=0x7f2849150573 "", space_id=<optimized out>, page_no=<optimized out>, block=0x7f2847d7da00, mtr=0x7f286857b4f0, parsed_bytes=18446744073709551615) at /home/marcelo.altmann/percona-xtrabackup/storage/innobase/log/log0recv.cc:2002
2002         ptr = page_cur_parse_insert_rec(FALSE, ptr, end_ptr, block, index, mtr);
(gdb) p block->page->id
+p block->page->id
$3 = {
  m_space = 4294967294,
  m_page_no = 5
}

Our suspicion was that the page layout on this block diverged between MySQL and XtraBackup. When working with these types of bugs, the crash is always the consequence of something that happened earlier, eg.: a crash on the sixth incremental backup could be the consequence of an issue that happened on the fourth incremental. 

The main goal at this step is to prove and identify where the page layout has diverted.

With this information, we ran MySQL under RR and reran the backup until we saw the same issue at prepare. We can now replay the MySQL execution and check how it compares. Our idea is to:

  1. Read the LSNs for this same page before/after each backup prepare.
  2. Identify all changes to  m_space = 4294967294 & m_page_no = 5 at mysqld.

Before we progress further, let’s explain a few things:

  1. m_space = 4294967294 correspond to the MySQL data dictionary (mysql.ibd) – dict0dict.h:1146
  2. On disk page, LSN is stored at the 16th byte of the page and has a size of 8 bytes – fil0types.h:66
  3. Pages are written sequentially to disk, as an example, for the default 16k page size, from bytes 1 to 16384 will have the data for page 0, from byte 16385 to 32768 data from page 1, and so on. 
  4. Frame is raw data of a page – buf0buf.h:1358

Replaying the Execution

To start, let’s read what LSN we have on mysql.ibd for page five before the backup. We will be using od (check man od for more information) and the information explained above:

$ od -j $((16384 * 5 + 16)) -N 8 -t x1 full/mysql.ibd
0240020 00 00 00 00 01 10 21 85
0240030

And check if it matches an LSN stamp from mysqld. For that we will add a conditional breakpoint on the replay execution of MySQL at function buf_flush_note_modification:

$ rr replay .
. . .
(rr) b buf_flush_note_modification if block->page->id->m_space == 4294967294 && block->page->id->m_page_no == 5
+b buf_flush_note_modification if block->page->id->m_space == 4294967294 && block->page->id->m_page_no == 5
Breakpoint 1 at 0x495beb1: file /home/marcelo.altmann/percona-server/storage/innobase/include/buf0flu.ic, line 69.
(rr) c
[Switching to Thread 18839.18868]

Breakpoint 1, buf_flush_note_modification (block=0x7fd2df4ad750, start_lsn=17892965, end_lsn=17893015, observer=0x0) at /home/marcelo.altmann/percona-server/storage/innobase/include/buf0flu.ic:69
69     ut_ad(!srv_read_only_mode ||
++rr-set-suppress-run-hook 1
(rr) p/x block->frame[16]@8
+p/x block->frame[16]@8
$1 =   {[0x0] = 0x0,
  [0x1] = 0x0,
  [0x2] = 0x0,
  [0x3] = 0x0,
  [0x4] = 0x1,
  [0x5] = 0x10,
  [0x6] = 0x21,
  [0x7] = 0x85}
(rr)

We can see the LSN stamp from before the preparation of full backup and the first stamp from the replay session match. Time to prepare the backup, advance the replay execution, and recheck:

xtrabackup --prepare --apply-log-only --target-dir=full/
. . .
Shutdown completed; log sequence number 17897577
Number of pools: 1
210402 17:46:29 completed OK!


$ od -j $((16384 * 5 + 16)) -N 8 -t x1 full/mysql.ibd
0240020 00 00 00 00 01 11 07 06
0240030


(rr) c
+c
Continuing.
[Switching to Thread 18839.18868]

Breakpoint 1, buf_flush_note_modification (block=0x7fd2df4ad750, start_lsn=19077332, end_lsn=19077382, observer=0x0) at /home/marcelo.altmann/percona-server/storage/innobase/include/buf0flu.ic:69
69     ut_ad(!srv_read_only_mode ||
++rr-set-suppress-run-hook 1
(rr) p/x block->frame[16]@8
+p/x block->frame[16]@8
$16 =   {[0x0] = 0x0,
  [0x1] = 0x0,
  [0x2] = 0x0,
  [0x3] = 0x0,
  [0x4] = 0x1,
  [0x5] = 0x11,
  [0x6] = 0x7,
  [0x7] = 0x6}
(rr)

Same LSN stamp on both, server and backup. Time to move on and start to apply the incrementals:

xtrabackup --prepare --apply-log-only --target-dir=full/ --incremental-dir=inc1/
. . .
Shutdown completed; log sequence number 19082430
. . .
210402 18:12:20 completed OK!


$ od -j $((16384 * 5 + 16)) -N 8 -t x1 full/mysql.ibd
0240020 00 00 00 00 01 23 19 06
0240030


(rr) c
+c
Continuing.
Breakpoint 1, buf_flush_note_modification (block=0x7fd2df4ad750, start_lsn=20262758, end_lsn=20262808, observer=0x0) at /home/marcelo.altmann/percona-server/storage/innobase/include/buf0flu.ic:69
69     ut_ad(!srv_read_only_mode ||
++rr-set-suppress-run-hook 1
(rr) p/x block->frame[16]@8
+p/x block->frame[16]@8
$17 =   {[0x0] = 0x0,
  [0x1] = 0x0,
  [0x2] = 0x0,
  [0x3] = 0x0,
  [0x4] = 0x1,
  [0x5] = 0x23,
  [0x6] = 0x19,
  [0x7] = 0x6}
(rr)

Once again, we have a matching LSN stamp on both sides. Moving to the next incremental:

xtrabackup --prepare --apply-log-only --target-dir=full/ --incremental-dir=inc2/
. . .
Shutdown completed; log sequence number 20269669
. . .
210402 18:15:04 completed OK!


$ od -j $((16384 * 5 + 16)) -N 8 -t x1 full/mysql.ibd
0240020 00 00 00 00 01 35 2f 98
0240030


(rr) c
+c
Continuing.

Breakpoint 1, buf_flush_note_modification (block=0x7fd2df4ad750, start_lsn=21449997, end_lsn=21450047, observer=0x0) at /home/marcelo.altmann/percona-server/storage/innobase/include/buf0flu.ic:69
69     ut_ad(!srv_read_only_mode ||
++rr-set-suppress-run-hook 1
(rr) p/x block->frame[16]@8
+p/x block->frame[16]@8
$18 =   {[0x0] = 0x0,
  [0x1] = 0x0,
  [0x2] = 0x0,
  [0x3] = 0x0,
  [0x4] = 0x1,
  [0x5] = 0x35,
  [0x6] = 0x2f,
  [0x7] = 0x98}
(rr)

Incremental two applied and matching LSN stamp from mysqld. Let’s keep doing this until we find a mismatch:

xtrabackup --prepare --apply-log-only --target-dir=full/ --incremental-dir=inc3/
. . .
Shutdown completed; log sequence number 21455916
. . .
210402 18:18:25 completed OK!


$ od -j $((16384 * 5 + 16)) -N 8 -t x1 full/mysql.ibd
0240020 00 00 00 00 01 47 4d 3f
0240030


(rr) c
+c
Continuing.

Breakpoint 1, buf_flush_note_modification (block=0x7fd2df4ad750, start_lsn=25529471, end_lsn=25529521, observer=0x0) at /home/marcelo.altmann/percona-server/storage/innobase/include/buf0flu.ic:69
69     ut_ad(!srv_read_only_mode ||
++rr-set-suppress-run-hook 1
(rr) p/x block->frame[16]@8
+p/x block->frame[16]@8
$19 =   {[0x0] = 0x0,
  [0x1] = 0x0,
  [0x2] = 0x0,
  [0x3] = 0x0,
  [0x4] = 0x1,
  [0x5] = 0x47,
  [0x6] = 0x4d,
  [0x7] = 0x3f}
(rr)


xtrabackup --prepare --apply-log-only --target-dir=full/ --incremental-dir=inc4/
. . .
Shutdown completed; log sequence number 23044902
. . .
210402 18:24:00 completed OK!

$ od -j $((16384 * 5 + 16)) -N 8 -t x1 full/mysql.ibd
0240020 00 00 00 00 01 5f a3 26
0240030


(rr) c
+c
Continuing.

Breakpoint 1, buf_flush_note_modification (block=0x7fd2df4ad750, start_lsn=27218464, end_lsn=27218532, observer=0x0) at /home/marcelo.altmann/percona-server/storage/innobase/include/buf0flu.ic:69
69     ut_ad(!srv_read_only_mode ||
++rr-set-suppress-run-hook 1
(rr) p/x block->frame[16]@8
+p/x block->frame[16]@8
$242 =   {[0x0] = 0x0,
  [0x1] = 0x0,
  [0x2] = 0x0,
  [0x3] = 0x0,
  [0x4] = 0x1,
  [0x5] = 0x9f,
  [0x6] = 0x3f,
  [0x7] = 0xc9}
(rr)

Ok, here we have something. Backup files jumped from 0x01474d3f to 0x015fa326  when applying incremental four while the server moved from 0x01474d3f to 0x019f3fc9 . Perhaps we missed some other place where we can update the LSN stamp of a page? But now, we are at a point in the future with our replay execution of the MySQL server.

Replaying the Execution Backward

Here is (yet) another very cool feature from RR, it allows you to replay the execution backward. To eliminate the possibility of missing a place that is also updating the LSN of this block, let’s add a hardware watchpoint on the block->frame memory address and reverse the execution:

(rr) p block->frame
+p block->frame
$243 = (unsigned char *) 0x7fd2e0758000 "\327\064X["
(rr) watch *(unsigned char *) 0x7fd2e0758000
+watch *(unsigned char *) 0x7fd2e0758000
Hardware watchpoint 2: *(unsigned char *) 0x7fd2e0758000
(rr) disa 1
+disa 1
(rr) reverse-cont
+reverse-cont
+continue
Continuing.
Hardware watchpoint 2: *(unsigned char *) 0x7fd2e0758000

Old value = 215 '\327'
New value = 80 'P'

0x0000000004c13903 in mach_write_to_4 (b=0x7fd2e0758000 "P\257\"\347", n=3610531931) at /home/marcelo.altmann/percona-server/storage/innobase/include/mach0data.ic:135
135   b[0] = static_cast<byte>(n >> 24);
++rr-set-suppress-run-hook 1
++rr-set-suppress-run-hook 1
(rr) p/x buf_flush_init_for_writing::block->frame[16]@8
+p/x buf_flush_init_for_writing::block->frame[16]@8
$11 =   {[0x0] = 0x0,
  [0x1] = 0x0,
  [0x2] = 0x0,
  [0x3] = 0x0,
  [0x4] = 0x1,
  [0x5] = 0x9f,
  [0x6] = 0x3f,
  [0x7] = 0xc9}
(rr) reverse-cont
+reverse-cont
+continue
Continuing.
Hardware watchpoint 2: *(unsigned char *) 0x7fd2e0758000

Old value = 80 'P'
New value = 43 '+'
0x0000000004c13903 in mach_write_to_4 (b=0x7fd2e0758000 "+k*\304", n=1353655015) at /home/marcelo.altmann/percona-server/storage/innobase/include/mach0data.ic:135
135   b[0] = static_cast<byte>(n >> 24);
++rr-set-suppress-run-hook 1
++rr-set-suppress-run-hook 1
(rr) p/x buf_flush_init_for_writing::block->frame[16]@8
+p/x buf_flush_init_for_writing::block->frame[16]@8
$12 =   {[0x0] = 0x0,
  [0x1] = 0x0,
  [0x2] = 0x0,
  [0x3] = 0x0,
  [0x4] = 0x1,
  [0x5] = 0x47,
  [0x6] = 0x4d,
  [0x7] = 0x3f}
(rr)

By replaying the execution backward we can see that indeed the server changed the LSN  from 0x01474d3f to 0x019f3fc9. This confirms the issue is at incremental backup four as the LSN 0x015fa326 we see at end of incremental four was never a valid LSN at the server execution.

Root Cause

Now that we have limited the scope from six backups to a single one, things will become easier.

If we look closely at the log messages from the –prepare of the backup we can see that LSN of mysql.ibd matches the LSN stamp at the end of the backup:

xtrabackup --prepare --apply-log-only --target-dir=full/ --incremental-dir=inc4/
. . .
Shutdown completed; log sequence number 23044902
. . .
210402 18:24:00 completed OK!

$ od -j $((16384 * 5 + 16)) -N 8 -t x1 full/mysql.ibd
0240020 00 00 00 00 01 5f a3 26
0240030


$ echo $(( 16#015fa326 ))
23044902

By checking the stack trace of the issue and examining further the block we have parsed we can see that this is innodb_dynamic_metadata index:

(gdb) f 13
+f 13
#13 0x00000000015af3dd in recv_recover_page_func (just_read_in=just_read_in@entry=true, block=block@entry=0x7f59efd7da00) at /home/marcelo.altmann/percona-xtrabackup/storage/innobase/log/log0recv.cc:2624
2624       recv_parse_or_apply_log_rec_body(recv->type, buf, buf + recv->len,
(gdb) p/x block->frame[66]@8
+p/x block->frame[66]@8
$4 =   {[0x0] = 0x0,
  [0x1] = 0x0,
  [0x2] = 0x0,
  [0x3] = 0x0,
  [0x4] = 0x0,
  [0x5] = 0x0,
  [0x6] = 0x0,
  [0x7] = 0x2}

You might be wondering where 66 came from; this is from examining position FIL_PAGE_DATA + PAGE_INDEX_ID. That gave us index ID 2. This is below 1024, which is reserved for Data Dictionary tables. By checking what is the second table on that list, we can see that it’s innodb_dynamic_metadata. With all this information summed up we can look at what the server does at shutdown, and it becomes clear what the issue is:
srv0start.cc:3965

/** Shut down the InnoDB database. */
void srv_shutdown() {
  . . .

  /* Write dynamic metadata to DD buffer table. */
  dict_persist_to_dd_table_buffer();
. . .
}

As part of the shutdown process, we are persisting dirty metadata back to the DD Buffer table (innodb_dynamic_metadata), which is wrong. Those changes will likely be persisted by the server and redo logged once the server performs a checkpoint. Also, more data can be merged together by the point of when the backup was taken and when the server itself persists this data to DD Tables. This is a result of the implementation of WL#7816 and WL#6204 which required Percona XtraBackup to change how it handles these types of redo records.

Summary

In this blog, we walked through the process of analyzing a real Percona XtraBackup bug. This bug exposes a challenge we face in various types of bugs, where the crash/malfunction is a consequence of something that happened way before, and by the time we have a stack trace/coredump, it is too late to perform a proper analysis. Record and Replay enabled us to consistently replay the execution of the source server, making it possible to narrow down the issue to where the root cause was. 


Percona XtraBackup is a free, open source, complete database backup solution for all versions of Percona Server for MySQL and MySQL

Apr
02
2021
--

Percona XtraBackup Point-In-Time Recovery for the Single Database

Percona XtraBackup Point-In-Time Recovery

Percona XtraBackup Point-In-Time RecoveryRecovering to a particular time in the past is called Point-In-Time Recovery (PITR). With PITR you can rollback unwanted DELETE without WHERE clause or any other harmful command.

PITR with Percona XtraBackup is pretty straightforward and perfectly described in the user manual. You need to restore the data from the backup, then apply all binary logs created or updated after the backup was taken, but skip harmful event(s).

However, if your data set is large you may want to recover only the affected database or table. This is possible but you need to be smart when filtering events from the binary log. In this post, I will show how to perform such a partial recovery using Percona XtraBackup, mysql command-line client, and mysqlbinlog programs only. There is an alternative approach that involves creating a fake source server, that is described in MySQL Point in Time Recovery the Right Way. You may consider it, especially if you need to apply changes to a single table.

Percona XtraBackup Point-In-Time Recovery

For our example we will create data first, then run DROP and DELETE commands on two different tables. Then we will rollback these commands.

First, let’s assume we have a server with two databases: test and sbtest. We are using GTIDs and row-based binary log format. We also run the server with the option innodb_file_per_table=1 and all our InnoDB tables use individual tablespaces. Otherwise, the individual restore method would not work.

mysql> show tables from sbtest;
+------------------+
| Tables_in_sbtest |
+------------------+
| sbtest1          |
| sbtest2          |
| sbtest3          |
| sbtest4          |
| sbtest5          |
| sbtest6          |
| sbtest7          |
| sbtest8          |
+------------------+
8 rows in set (0.00 sec)

mysql> show tables from test;
+----------------+
| Tables_in_test |
+----------------+
| bar            |
| baz            |
| foo            |
+----------------+
3 rows in set (0.00 sec)

We will experiment with tables foo and bar. We assume that at the time of our first backup, each of the tables contained five rows. Tables in the database sbtest also contain data, but it does not really matter for our experiment.

mysql> select count(*) from foo;
+----------+
| count(*) |
+----------+
|        5 |
+----------+
1 row in set (0.00 sec)

mysql> select count(*) from bar;
+----------+
| count(*) |
+----------+
| 5        |
+----------+
1 row in set (0.00 sec)

mysql> select count(*) from baz;
+----------+
| count(*) |
+----------+
| 0        |
+----------+
1 row in set (0.00 sec)

Since we want to restore individual tables, we need to make a preparation before taking a backup: store database structure. We will do it with help of the mysqldump command. In this example, I store structure per database to make partial PITR easier, but you are free to use the option --all-databases.

mysqldump --no-data --set-gtid-purged=OFF --triggers --routines --events test > test_structure.sql
mysqldump --no-data --set-gtid-purged=OFF --triggers --routines --events sbtest > sbtest_structure.sql

Then we are ready to take the backup.

xtrabackup --parallel=8 --target-dir=./full_backup --backup

I am using the option --parallel to speed up the backup process.

Now let’s do some testing. First, let’s update rows in the table foo.

mysql> update foo set f1=f1*2;
Query OK, 5 rows affected (0.01 sec)
Rows matched: 5 Changed: 5 Warnings: 0

mysql> select * from foo;
+----+------+
| id | f1   |
+----+------+
|  1 |    2 |
|  2 |    4 |
|  3 |    6 |
|  4 |    8 |
|  5 |   10 |
+----+------+
5 rows in set (0.00 sec)

And then drop it and delete all rows from the table bar.

mysql> drop table foo;
Query OK, 0 rows affected (0.02 sec)

mysql> delete from bar;
Query OK, 5 rows affected (0.01 sec)

Finally, let’s insert a few rows into the tables bar and baz.

mysql> insert into bar(f1) values(6),(7),(8),(9),(10);
Query OK, 5 rows affected (0.01 sec)
Records: 5 Duplicates: 0 Warnings: 0

mysql> insert into baz(f1) values(1),(2),(3),(4),(5);
Query OK, 5 rows affected (0.01 sec)
Records: 5 Duplicates: 0 Warnings: 0

Assume that the DROP TABLE and DELETE command was an accident and we want to restore the state of the tables foo and bar as they were before these unlucky statements.

First, we need to prepare the backup.

Since we are interested in restoring only tables in the database test we need to prepare the backup with a special option --export that exports tablespaces in a way that they could be later imported:

xtrabackup --prepare --export --target-dir=./full_backup

Now the directory for the database test contains not only table definition files (.frm, only before 8.0) and tablespace files (.ibd) but also configuration files (.cfg).

Since we want all changes that happened after backup and before the problematic DROP TABLE and DELETE statements were applied, we need to identify which binary log and position were actual at the backup time. We can find it in the xtrabackup_binlog_info file:

$ cat full_backup/xtrabackup_binlog_info
master-bin.000004 1601 0ec00eed-87f3-11eb-acd9-98af65266957:1-56

Now we are ready to perform restore.

First, let’s restore the table foo from the backup. Restoring individual tablespaces requires the ALTER TABLE ... IMPORT TABLESPACE command. This command assumes that the table exists in the server. However, in our case, it was dropped and therefore we need to re-create it.

We will recreate the full database test from the file test_structure.sql

Since we do not want these administrative tasks to be re-applied, I suggest disabling binary logging for the session which will recreate the database structure.

mysql> set sql_log_bin=0;
Query OK, 0 rows affected (0.00 sec)

mysql> source test_structure.sql
Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)
....

Once tables are recreated discard their tablespaces. I will show an example for the table foo. Adjust the code for the rest of the tables.

mysql> alter table foo discard tablespace;
Query OK, 0 rows affected (0.01 sec)

Then, in another terminal, copy the tablespace and configuration files from the backup to the database directory:

cp full_backup/test/foo.{ibd,cfg} var/mysqld.1/data/test/

And, finally, import the tablespace:

mysql> alter table foo import tablespace;
Query OK, 0 rows affected (0.05 sec)

Repeat for the other tables in the database test.

Now you can enable binary logging back.

You can do the same task in a script. For example:

for table in `mysql test --skip-column-names --silent -e "show tables"`
> do
>   mysql test -e "set sql_log_bin=0; alter table $table discard tablespace"
>   cp full_backup/test/$table.{ibd,cfg} var/mysqld.1/data/test/
>   mysql test -e "set sql_log_bin=0; alter table $table import tablespace"
> done

Our tables are recovered but do not have the updates made after the backup.

mysql> select * from foo;
+----+------+
| id | f1   |
+----+------+
|  1 |    1 |
|  2 |    2 |
|  3 |    3 |
|  4 |    4 |
|  5 |    5 |
+----+------+
5 rows in set (0.00 sec)

mysql> select * from bar;
+----+------+
| id | f1   |
+----+------+
|  1 |    1 |
|  2 |    2 |
|  3 |    3 |
|  4 |    4 |
|  5 |    5 |
+----+------+
5 rows in set (0.00 sec)

mysql> select * from baz;
Empty set (0.00 sec)

Therefore, we need to restore data from the binary logs.

To do it we first need to identify the GTID of the disaster event. It can be done if we dump all binary logs updated and created after backup into a dump file and then search for the DROP TABLE and DELETE statements and skipping them.

First, let’s check which binary logs do we have.

mysql> show binary logs;
+-------------------+-----------+
| Log_name          | File_size |
+-------------------+-----------+
| master-bin.000001 |   1527476 |
| master-bin.000002 |      3035 |
| master-bin.000003 |      1987 |
| master-bin.000004 |      2466 |
| master-bin.000005 |       784 |
+-------------------+-----------+
5 rows in set (0.00 sec)

So we need to parse them, starting from the log master-bin.000004 and position 1601:

mysqlbinlog --start-position=1601 -vvv --base64-output=decode-rows --database=test master-bin.000004 master-bin.000005 > binlog_test.sql

I used options -vvv that prints SQL representation of row events, so we can find the one which we want to skip and --base64-output=decode-rows to not print row events at all. We will not use this file for the restore, only for searching the DROP TABLE and DELETE events.

Here they are, at the positions 2007 and 2123, with GTID 0ec00eed-87f3-11eb-acd9-98af65266957:58 and 0ec00eed-87f3-11eb-acd9-98af65266957:59

# at 2007
#210321 13:29:58 server id 1 end_log_pos 2123 CRC32 0xd1eb9854 Query thread_id=138 exec_time=0 error_code=0
use `test`/*!*/;
SET TIMESTAMP=1616322598/*!*/;
DROP TABLE `foo` /* generated by server */
/*!*/;
# at 2123
#210321 13:30:08 server id 1 end_log_pos 2188 CRC32 0xfc9b2088 GTID last_committed=7 sequence_number=8 rbr_only=yes original_committed_timestamp=0 immediate_commit_timestamp=0 transaction_length=0
/*!50718 SET TRANSACTION ISOLATION LEVEL READ COMMITTED*//*!*/;
# original_commit_timestamp=0 (1970-01-01 02:00:00.000000 EET)
# immediate_commit_timestamp=0 (1970-01-01 02:00:00.000000 EET)
/*!80001 SET @@session.original_commit_timestamp=0*//*!*/;
/*!80014 SET @@session.original_server_version=0*//*!*/;
/*!80014 SET @@session.immediate_server_version=0*//*!*/;
SET @@SESSION.GTID_NEXT= '0ec00eed-87f3-11eb-acd9-98af65266957:59'/*!*/;
# at 2188
#210321 13:30:08 server id 1 end_log_pos 2260 CRC32 0x1d525b11 Query thread_id=138 exec_time=0 error_code=0
SET TIMESTAMP=1616322608/*!*/;
BEGIN
/*!*/;
# at 2260
#210321 13:30:08 server id 1 end_log_pos 2307 CRC32 0xb57ecb73 Table_map: `test`.`bar` mapped to number 226
# at 2307
#210321 13:30:08 server id 1 end_log_pos 2387 CRC32 0x6770a7e2 Delete_rows: table id 226 flags: STMT_END_F
### DELETE FROM `test`.`bar`
### WHERE
### @1=1 /* INT meta=0 nullable=0 is_null=0 */
### @2=1 /* INT meta=0 nullable=1 is_null=0 */
### DELETE FROM `test`.`bar`
### WHERE
...

Note that decoded row event contains a DELETE command for each affected row.

We may also find to which binary log this event belongs if search for the "Rotate to" event. In our case “Rotate to master-bin.000005” happened after the found positions, so we only need file master-bin.000004 In your case, you may need to skip events from the previous log files too.

So to restore the data we need to run mysqlbinlog one more time, this time with parameters:

mysqlbinlog  --start-position=1601 --exclude-gtids=0ec00eed-87f3-11eb-acd9-98af65266957:58-59 --database=test --skip-gtids=true master-bin.000004 master-bin.000005 > binlog_restore.sql

I removed options -vvvbecause we are not going to examine this restore file and option --base64-output=decode-rows because we need row events to present in the resulting file. I also used option --exclude-gtids=0ec00eed-87f3-11eb-acd9-98af65266957:58-59 to exclude GTIDs that we do not want to re-apply. We also need to use --skip-gtids=true because otherwise updates will be skipped since such GTIDs already exist on the server.

Now binlog_restore.sql contains all updates to the database test made after the backup and before the DROP statement. Let’s restore it.

mysql test < binlog_restore.sql

Restore went successfully. Our tables have all past updates.

mysql> select * from foo;
+----+------+
| id | f1   |
+----+------+
|  1 |    2 |
|  2 |    4 |
|  3 |    6 |
|  4 |    8 |
|  5 |   10 |
+----+------+
5 rows in set (0.01 sec)

mysql> select count(*) from bar;
+----------+
| count(*) |
+----------+
|       10 |
+----------+
1 row in set (0.00 sec)

mysql> select count(*) from baz;
+----------+
| count(*) |
+----------+
|        5 |
+----------+
1 row in set (0.00 sec)

Conclusion

You may save the time required for PITR if use the per-database restore method. However, you need to take into account the following considerations:

  • mysqlbinlog does not support filtering per table, therefore you either need to restore the full database or use a fake server method, described in MySQL Point in Time Recovery the Right Way.
  • Per-database filters depend on the USE statement in the statement-based binary log format. Therefore option --database can only be considered safe with a row-based format.
  • If you do not use GTID you still can use this method. You will need to combine options --start-position and --stop-position to skip the event.

Percona XtraBackup is a free, open source database backup solution for Percona Server for MySQL and MySQL.

Apr
02
2021
--

Upgrading to MySQL 8: Embrace the Challenge

Upgrading to MySQL 8

Upgrading to MySQL 8Nobody likes change, especially when that change may be challenging.  When faced with a technical challenge, I try to remember this comment from Theodore Roosevelt: “Nothing in the world is worth having or worth doing unless it means effort, pain, difficulty.”  While this is a bit of an exaggeration, in this case, the main concept is still valid.  We shouldn’t shy away from an upgrade path because it may be difficult.

MySQL 8.0 is maturing and stabilizing.  There are new features (too many to list here) and performance improvements.  More and more organizations are upgrading to MySQL 8 and running it in production, which expedites the stabilization.  While there is still some significant runway on 5.7 and it is definitely stable (EOL slated for October 2023), organizations need to be preparing to make the jump if they haven’t already. 

What Changed?

So how is a major upgrade to 8.0 different than in years past?  It honestly really isn’t that different.  The same general process applies:

  1. Upgrade in a lower environment
  2. Test, test, and then test some more
  3. Upgrade a replica and start sending read traffic
  4. Promote a replica to primary
  5. Be prepared for a rollback as needed

The final bullet point is the biggest change, especially once you have finished upgrading to MySQL 8.  Historically, minor version upgrades were fairly trivial.  A simple instance stop, binary swap, and instance start were enough to revert to a previous version.  

In 8.0, this process is no longer supported (as noted in the official documentation): 

“Downgrade from MySQL 8.0 to MySQL 5.7, or from a MySQL 8.0 release to a previous MySQL 8.0 release, is not supported.”

That is a definite change in the release paradigm, and it has shown real issues across minor releases.  One good example of how this can impact a live system was captured well in the blog post MySQL 8 Minor Version Upgrades Are ONE-WAY Only from early 2020. 

How to Approach Upgrading to MySQL 8

With this new paradigm, it may seem scary to push through with the upgrade.  I think in some ways, it can be a positive change.  As mentioned above, proper preparation and testing should be the majority of the process.  The actual upgrade/cutover should essentially be a non-event.  There is nothing a DBA loves more than finalizing an upgrade with nobody noticing (aside from any potential improvements).

Unfortunately, in practice, proper testing and preparation is generally an afterthought.  With how easy upgrades (and particularly rollbacks) have been, it was generally easier to just “give it a shot and roll back if needed”.  As downgrades are no longer trivial, this should be viewed as a golden opportunity to enhance the preparation phase of an upgrade.  

Some extra focus should be given to:

  • Reviewing the release notes in detail for any potential changes (new features are also sometimes enabled by default in 8.0)
  • Testing the system with REAL application traffic (benchmarks are nice, but mean nothing if they are generic)
  • Solidifying the backup and restore process (this is already perfect, right?)
  • Reviewing automation (this means no more automated “silent” upgrades)
  • The actual upgrade process (starting at the bottom of the replication chain and maintaining replicas for rollbacks if needed)

With so much emphasis on the preparation, we should hopefully start to see the actual upgrade become less impactful.  It should also instill more confidence across the organization.

What About “At Scale”?

Having worked with a wide range of clients as a TAM, I’ve seen environments that range from a single primary/replica pair to 1000s of servers.  I will freely admit that completing a major version upgrade across 10,000 servers is non-trivial.  Nothing is more painful than needing to rollback 5000 servers when something crops up halfway through the upgrade process.  While no amount of testing can completely eliminate this possibility, we can strive to minimize the risk.

At this scale, testing actual traffic patterns is so much more critical.  When you are looking at complex environments and workloads, the likelihood of hitting an edge case definitely increases.  Identifying those edge cases in lower environments is critical for a smooth process in production.  Similarly, ensuring processes and playbooks exist for rolling back (in the event an issue does appear) is critical.  

Finally, deploying upgrades in phases is also critical.  Assuming you have monitoring such as Percona Monitoring and Management in place, A/B testing and comparison can be invaluable.  Seeing version X and Y on the same dashboard while serving the same traffic allows a proper comparison.  Comparing X in staging to Y in production is important, but can sometimes be misleading.  

Conclusion

Overall, upgrading to MySQL 8 isn’t that different from previous versions.  Extra care needs to be taken during the preparation phase, but that should be viewed as a positive overall.  We definitely shouldn’t shy away from the change, but rather embrace it as it needs to happen eventually.  The worst thing that can happen is to continue to kick the can down the road and then be pressed for time as 5.7 EOL approaches.

To solidify the preparation phase and testing, what tools do you think are missing?  What would make it easier to accurately replay traffic against test instances?  While there are tools available, is there anything that would help ensure these best practices are followed?

If your organization needs help in preparing for or completing an upgrade, our Professional Services team can be a great asset.  Likewise, our Support engineers can help your team as you hit edge cases in testing.  

And, finally, the most important strategy when it comes to upgrades: read-only Friday should be at the top of your playbook!

Apr
01
2021
--

Easily Validate Configuration Settings in MySQL 8

Validate Configuration Settings in MySQL

Validate Configuration Settings in MySQLIn past versions of MySQL, there was often an ‘upgrade dance’ that had to be performed in starting up a newly upgraded MySQL instance with the previous version configuration file. In some cases a few deprecated options might no longer be supported in the newer server version, triggering an error and a subsequent shutdown moments after starting. The same thing can happen even outside of upgrade scenarios if a configuration change was made with a mistake or typo in the variable name or value.

As of MySQL 8.0.16 and later, there is now a ‘validate-config’ option to quickly test and validate server configuration options without having to start the server. Once used, if no issues are found with the configuration file, the server will exit with an exit code of zero (0). If a problem is found, the server will exit with an error code of one (1) for the first occurrence of anything that is determined to be invalid.

Validating A Single Option

For example, consider the following server option (old_passwords) which was removed in MySQL 8:

./mysqld --old_passwords=1 --validate-config
2021-03-25T17:56:49.932782Z 0 [ERROR] [MY-000067] [Server] unknown variable 'old_passwords=1'.
2021-03-25T17:56:49.932862Z 0 [ERROR] [MY-010119] [Server] Aborting

Note the server exited with an error code of one (1) and is showing the error for the invalid option.

Validating A Configuration File

It is also possible to validate an entire my.cnf configuration file to check all options:

./mysqld --defaults-file=/home/sandbox/my.cnf --validate-config
2021-03-25T18:03:41.938734Z 0 [ERROR] [MY-000067] [Server] unknown variable 'old_passwords=1'.
2021-03-25T18:03:41.938865Z 0 [ERROR] [MY-010119] [Server] Aborting

Note that the server exited on the first occurrence of an invalid value. Any remaining errors in the configuration file will need to be found after correcting the first error and running the validate-config option again. So in this example, I’ve now removed the ‘old_passwords=1’ option in the configuration file, and I need to run validate-config again to see if there are any other errors remaining:

./mysqld --defaults-file=/home/sandbox/my.cnf --validate-config
2021-03-25T18:08:28.612912Z 0 [ERROR] [MY-000067] [Server] unknown variable 'query_cache_size=41984'.
2021-03-25T18:08:28.612980Z 0 [ERROR] [MY-010119] [Server] Aborting

Indeed, there is yet another option that was removed from MySQL 8 in the configuration file, so after running the validate again (after fixing the first issue), we’ve now identified a second problem. After removing the query_cache_size option from the my.cnf file and running validate-config again, we finally get a clean bill of health:

./mysqld --defaults-file=/home/sandbox/my.cnf --validate-config

Change Validation Verbosity

By default, the validate-config option will only report error messages. If you are also interested in seeing any warnings or informational messages, you can change the log_error_verbosity with a value that is greater than one (1):

./mysqld --defaults-file=/home/sandbox/my.cnf --log-error-verbosity=2 --validate-config
2021-03-25T18:20:01.380727Z 0 [Warning] [MY-000076] [Server] option 'read_only': boolean value 'y' was not recognized. Set to OFF.

Now we are seeing warning level messages, and the server is exiting with an error code of zero (0) as there are technically no errors, only warnings. Taking this further, I’ve added the query_cache_size option from above back in the my.cnf file, and if we run the validate again we see both errors and warnings this time, while the server exits with an error code of one (1) as there really was an error:

./mysqld --defaults-file=/home/sandbox/my.cnf --log-error-verbosity=2 --validate-config
2021-03-25T18:23:11.364288Z 0 [Warning] [MY-000076] [Server] option 'read_only': boolean value 'y' was not recognized. Set to OFF.
2021-03-25T18:23:11.372729Z 0 [ERROR] [MY-000067] [Server] unknown variable 'old_passwords=1'.
2021-03-25T18:23:11.372795Z 0 [ERROR] [MY-010119] [Server] Aborting

Older Version Alternative

During my testing of the validate_config feature, a colleague pointed out that there is a way to replicate this validation on older MySQL versions by using a combination of ‘help’ and ‘verbose’ options as below:

$ mysqld --defaults-file=/tmp/my.cnf --verbose --help 1>/dev/null; echo $?
0
 
$ echo default_table_encryption=OFF >> /tmp/my.cnf
 
$ mysqld --defaults-file=/tmp/my.cnf --verbose --help 1>/dev/null; echo $?
2021-03-26T08:43:04.987265Z 0 [ERROR] unknown variable 'default_table_encryption=OFF'
2021-03-26T08:43:04.990898Z 0 [ERROR] Aborting
1
 
$ mysqld --version
mysqld  Ver 5.7.33-36 for Linux on x86_64 (Percona Server (GPL), Release 36, Revision 7e403c5)

Closing Thoughts

While not perfect, the validate-config feature goes a long way towards making upgrades and configuration changes easier to manage. It is now possible to know with some certainty that your configuration files or options are valid prior to restarting the server and finding an issue that ends up keeping your node from starting normally leading to unexpected downtime.


Percona Distribution for MySQL: An Enterprise-Grade MySQL Solution, for Free

Mar
31
2021
--

Webinar April 14: Optimize and Troubleshoot MySQL Using Percona Monitoring and Management

Troubleshoot MySQL Using Percona Monitoring and Management

Troubleshoot MySQL Using Percona Monitoring and ManagementOptimizing MySQL performance and troubleshooting MySQL problems are two of the most critical and challenging tasks for MySQL DBAs. The databases powering applications need to be able to handle changing traffic workloads while remaining responsive and stable in order to deliver an excellent user experience. Further, DBAs are also expected to find cost-efficient means of solving these issues.

In this webinar, we will demonstrate the advanced options of Percona Monitoring and Management V.2 that enable you to solve these challenges, which are built on free and open-source software. We will look at specific, common MySQL problems and review them.

Please join Peter Zaitsev on Wednesday, April 14th, 2021, at 11 am EDT for his webinar Optimize and Troubleshoot MySQL using Percona Monitoring and Management (PMM).

Register for Webinar

If you can’t attend, sign up anyway, and we’ll send you the slides and recording afterward.

Mar
31
2021
--

MySQL 101: Using super_read_only

MySQL 101 super_read_only

MySQL 101 super_read_onlyAs many of you may remember, Percona added the super_read_only feature way back in Percona Server for MySQL 5.6.21, based on work done by WebScaleSQL. This feature eventually found its way into the Community branch of MySQL starting with 5.7.8, and it works the same in both cases. While this is now old news, over the last year I’ve had a couple of inquiries from clients around super_read_only usage in MySQL, and how it works in practice. While the usage of super_read_only is not complex, there is a small caveat that occasionally leads to some confusion around its use. As such, I thought it may be a good idea to write a quick blog post explaining this feature a bit more, and expanding on how it interacts with read_only.

What is super_read_only?

For those unfamiliar, what is super_read_only? Prior to its introduction, MySQL had the option to set a node to read_only, preventing everyone except those with the SUPER privilege from writing to the database. Most often used for replica nodes, it was a good step in preventing someone from inadvertently updating a replica manually without going through the primary node and letting replication threads handle the distribution. This, of course, could break replication due to duplicate keys, missing rows, or other issues as a result of the inconsistency between the datasets.

Using super_read_only takes this one step further, behaving identically to read_only while also blocking those with SUPER privileges from writing to the database as well. While at first glance this may seem like a stop-gap measure in lieu of better and more restrictive user permissions, it has proven very handy in production environments to add a further layer of protection to replica nodes, and helping to prevent human error from causing unexpected downtime.

One Hand Washes The Other

The inquiries I referenced earlier comes with its use, and not realizing (or forgetting) that read_only and super_read_only are linked. The key thing to keep in mind is that enabling super_read_only implies regular read_only as well.

  • Setting read_only = OFF also sets super_read_only = OFF.
  • Setting super_read_only = ON also sets read_only = ON.
  • All other changes to either of these variables have no effect on the other.

This behavior does seem logical, as when disabling read_only you probably also want to disable super_read_only, and vice-versa, when enabling super_read_only you probably also want to enable read_only. While this linked behavior is documented, there are no warnings or notes in MySQL itself alerting you to this when you make a change to one or the other. This can lead to some head-scratching, as a change to one variable changes the other in lockstep.

Final Considerations

There are a few other implications to keep in mind for read_only that apply to super_read_only as well – for instance, operations on temporary tables are allowed no matter how these variables are set. Any OPTIMIZE TABLE or ANALYZE TABLE operations are also allowed since the purpose of read_only / super_read_only is to prevent changes to the table structure, but not to table metadata such as index statistics. Finally, these settings will need to be manually disabled (or scripted via automation) if there is ever an instance where the replica needs to be promoted to primary status.


Percona Distribution for MySQL: An Enterprise-Grade MySQL Solution, for Free

Mar
30
2021
--

How To Automate Dashboard Importing in Percona Monitoring and Management

Automate Dashboard Importing in Percona Monitoring and Management

Automate Dashboard Importing in Percona Monitoring and ManagementIn this blog post, I’ll look at how to import custom dashboards into Percona Monitoring and Management (PMM) 2.x, and give some tips on how to automate it.

The number of dashboards in PMM2 is constantly growing. For example, we recently added a new HAProxy dashboard to the latest 2.15.0 release. Even though the PMM server has more than fifty dashboards, it’s not possible to cover all common server applications.

The greatest source of dashboards is the official Grafana site. Here, anyone can share their own dashboards with the community or find already uploaded ones. Percona has its own account and publishes as-yet-unreleased or unique (non-PMM) dashboards.

Each dashboard has its own number which can be used to refer to it. For example, 12630 is assigned to the dashboard “MySQL Query Performance Troubleshooting”.
Percona Monitoring and Management Dashboard

You can download dashboards as JSON files and import them into your PMM2 installation using the UI.

PMM2
This is easy, but we are forgetting that dashboards can be updated by publishers as new revisions. So it’s possible that the dashboard has a bunch of useful changes that were published after you downloaded it. But, you keep using an old version of the dashboard.

So the only way to use the latest dashboard version is to check the site from time to time. It can really be a pain in the neck, especially if you have to track more than one dashboard.

This is why it’s time to take a look at automation. Grafana has a very powerful API that I used to create this shell script. Let’s take a peek at it. It’s based on the api/dashboards/import API function. The function requires a POST request with a dashboard body.

The first step is to download a dashboard.

curl -s https://grafana.com/api/dashboards/12630/revisions/1/download --output 12630_rev1.json

Note how I used dashboard number 12630 and revision 1 in the command. By increasing the revision number I can find out the latest available dashboard version. This is exactly the approach used in the script.

In the next example, I’ll use a dashboard from our dashboard repository. (I will explain why later.)

curl -L -k https://github.com/percona/grafana-dashboards/raw/PMM-2.0/dashboards/Disk_Details.json --output Disk_Details.json

Now I have a file and can form a POST request to import the dashboard into a PMM installation.

$ curl -s -k -X POST -H "Content-Type: application/json" -d "{\"dashboard\":$(cat Disk_Details.json),\"overwrite\":true}" -u admin:admin https://18.218.63.13/graph/api/dashboards/import


The dashboard has been uploaded. If you take a look at the output you may notice the parameter folderId. With this, it’s possible to specify a Grafana folder for my dashboards.

Here is the command for fetching a list of existing folders.

curl -s -k -u admin:admin https://172.20.0.1/graph/api/folders

I now have folder IDs and can use them in the importing command. The Folder ID should be specified in a POST request as shown in the next example.


Now you are familiar with API import commands, I’ll give you a closer look at community dashboards.

Most of them have the parameter “Data Sources”.
It means that for dashboard importing, you have to specify the data source names assigned by your installation.


This point makes it impossible to import any downloaded dashboards with the API without modifying them. If I execute the import command used earlier (the 12630_rev1.json file downloaded from Grafana.com) I will get an error.


So, here’s another script (cleanup_dash.py) that replaces the datasource fields in dashboards and allows me to pass an importing command. The script takes a dashboard file name as a parameter.


The importing script calls cleanup-dash.py automatically if an initial importing attempt was unsuccessful.

Note the parameters of the importing script. Here you should set the details of your PMM installation. dashboards is an array of dashboards IDs that you want to import into PMM2.

#!/bin/bash
dashboards=(13266 12630 12470)
pmm_server="172.20.0.1"
user_pass="admin:admin"
folderName="General"

Now, you should download both scripts and try to import dashboards. Make sure that both scripts are executable and in the same folder. Here are the commands to do it.

curl -LJOs https://github.com/Percona-Lab/pmm-dashboards/raw/master/misc/import-dashboard-grafana-cloud.sh --output import-dashboard-grafana-cloud.sh
curl -LJOs https://github.com/Percona-Lab/pmm-dashboards/raw/master/misc/cleanup-dash.py --output cleanup-dash.py

chmod a+x import-dashboard-grafana-cloud.sh
chmod a+x cleanup-dash.py

./import-dashboard-grafana-cloud.sh

You can next find the imported dashboards in your PMM installation. They were put into the ‘Insight’ folder and can be found by the keyword ‘PMM2’.

imported PMM dashboards

By default, the script imports all designed for PMM2 dashboards from Percona account. Also, folder names and dashboard IDs can be specified as parameters for the script.

Here are some usage examples:

import-dashboard-grafana-cloud.sh Default list of dashboards will be uploaded into General folder
import-dashboard-grafana-cloud.sh Insight Default list of dashboards will be uploaded into Insight folder
import-dashboard-grafana-cloud.sh 13266 12630 12470 Dashboards 13266 12630 12470 will be uploaded into General folder
import-dashboard-grafana-cloud.sh Insight 13266 12630 12470 Dashboards 13266 12630 12470 will be uploaded into Insight folder

You can define any number of dashboards in the script parameters and run the script periodically to always have the most recent dashboard versions.


Percona Monitoring and Management is free to download and use. Try it today!

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com