Dec
09
2021
--

A Look Into Percona XtraDB Cluster Non-Blocking Operation for Online Schema Upgrade

Percona XtraDB Cluster Non-Blocking Operation

Percona XtraDB Cluster Non-Blocking OperationPercona XtraDB Cluster 8.0.25 (PXC) has introduced a new option to perform online schema modifications: NBO (Non-Blocking Operation).

When using PXC, the cluster relies on the wsrep_OSU_method parameter to define the Online Schema Upgrade (OSU) method the node uses to replicate DDL statements. 

Until now, we normally have three options:

  • Use Total Isolation Order (TOI, the default)
  • Use Rolling Schema Upgrade (RSU)
  • Use Percona’s online schema change tool (TOI + PTOSC)

Each method has some positive and negative aspects. TOI will lock the whole cluster from being able to accept data modifications for the entire time it takes to perform the DDL operation. RSU will misalign the schema definition between the nodes, and in any case, the node performing the DDL operation is still locked. Finally, TOI+PTOSC will rely on creating triggers and copying data, so in some cases, this can be very impactful. 

The new Non-Blocking Operation (NBO) method is to help to reduce the impact on the cluster and make it easier to perform some DDL operations.

At the moment we only support a limited set of operations with NBO like:

  • ALTER INDEX
  • CREATE INDEX
  • DROP INDEX

Any other command will result in an error message ER_NOT_SUPPORTED_YET.

But let us see how it works and what the impact is while we will also compare it with the default method TOI.

What we will do is work with four connections:

1 – to perform ddl
2 – to perform insert data in the table being altered
3 – to perform insert data on a different table
4-5 – checking the other two nodes operations

PXC must be at least Version 8.0.25-15.1.

The table we will modify is :

DC1-1(root@localhost) [windmills_s]>show create table windmills_test\G
*************************** 1. row ***************************
       Table: windmills_test
Create Table: CREATE TABLE `windmills_test` (
  `id` bigint NOT NULL AUTO_INCREMENT,
  `uuid` char(36) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `millid` smallint NOT NULL,
  `kwatts_s` int NOT NULL,
  `date` date NOT NULL,
  `location` varchar(50) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  `active` tinyint NOT NULL DEFAULT '1',
  `time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `strrecordtype` char(3) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL,
  PRIMARY KEY (`id`),
  KEY `IDX_millid` (`millid`,`active`),
  KEY `IDX_active` (`id`,`active`),
  KEY `kuuid_x` (`uuid`),
  KEY `millid_x` (`millid`),
  KEY `active_x` (`active`)
) ENGINE=InnoDB AUTO_INCREMENT=8199260 DEFAULT CHARSET=utf8mb3 COLLATE=utf8_bin ROW_FORMAT=DYNAMIC
1 row in set (0.00 sec)

And contains ~five million rows.

DC1-1(root@localhost) [windmills_s]>select count(*) from windmills_test;
+----------+
| count(*) |
+----------+
|  5002909 |
+----------+
1 row in set (0.44 sec)

The Commands

Connection 1:

ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE;
  ALTER TABLE windmills_test drop INDEX idx_1, ALGORITHM=INPLACE;

 

Connection 2:

while [ 1 = 1 ];do da=$(date +'%s.%3N');/opt/mysql_templates/PXC8P/bin/mysql --defaults-file=./my.cnf -uroot -D windmills_s -e "insert into windmills_test  select null,uuid,millid,kwatts_s,date,location,active,time,strrecordtype from windmills7 limit 1;" -e "select count(*) from windmills_s.windmills_test;" > /dev/null;db=$(date +'%s.%3N'); echo "$(echo "($db - $da)"|bc)";sleep 1;done

 

Connection 3:

while [ 1 = 1 ];do da=$(date +'%s.%3N');/opt/mysql_templates/PXC8P/bin/mysql --defaults-file=./my.cnf -uroot -D windmills_s -e "insert into windmills8  select null,uuid,millid,kwatts_s,date,location,active,time,strrecordtype from windmills7 limit 1;" -e "select count(*) from windmills_s.windmills_test;" > /dev/null;db=$(date +'%s.%3N'); echo "$(echo "($db - $da)"|bc)";sleep 1;done

 

Connections 4-5:

while [ 1 = 1 ];do echo "$(date +'%T.%3N')";/opt/mysql_templates/PXC8P/bin/mysql --defaults-file=./my.cnf -uroot -D windmills_s -e "show full processlist;"|egrep -i -e "(windmills_test|windmills_s)"|grep -i -v localhost;sleep 1;done

Operations

  • Start inserts from connections
  • Start commands in connections 4 – 5 on the other nodes
  • Execute: 
    • For TOI 
      • DC1-1(root@localhost) [windmills_s]>SET SESSION wsrep_OSU_method=TOI;
    • For NBO
      • DC1-1(root@localhost) [windmills_s]>SET SESSION wsrep_OSU_method=NBO;
    • For both
      • DC1-1(root@localhost) [windmills_s]>ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE, LOCK=shared;

Let’s Run It

Altering a Table with TOI

DC1-1(root@localhost) [windmills_s]>ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE;
Query OK, 0 rows affected (1 min 4.74 sec)
Records: 0  Duplicates: 0  Warnings: 0

 

Inserts in the altering table (connection 2):

.450
.492
64.993 <--- Alter blocks all inserts on the table we are altering
.788
.609

 

Inserts on the other table (connection 3):

.455
.461
64.161 <--- Alter blocks all inserts on all the other tables as well
.641
.483

 

On the other nodes at the same time of the ALTER we can see:

Id  User             db         Command Time  State             Info                                                                            Time_ms Rows_sent Rows_examined 
15	system user		windmills_s	Query	102	  altering table	ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE	102238	0	       0    <--- time from start

In short, we have the whole cluster locked for ~64 seconds. During this period of time, all the operations to modify data or structure were on hold. 

Let’s Now Try With NBO

Inserts in the altering table:

.437
.487
120.758 <---- Execution time increase
.617
.510

 

Inserts on the other table:

.468
.485
25.061 <---- still a metalock, but not locking the other tables for the whole duration 
.494
.471

 

On the other nodes at the same time of the ALTER we can see:

Id      User         db             Command Time  State             Info                                                                            Time_ms Rows_sent Rows_examined 
110068	system user	 windmills_s	Connect	86	  altering table	ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE	 120420	 0	          0

 

In this case, what is also interesting to note is that:

  1. We have a moment of metalock:
    1. 110174 pmm 127.0.0.1:42728 NULL Query 2 Waiting for table metadata lock  SELECT x FROM information_schema.tables  WHERE TABLE_SCHEMA = 'windmills_s' 1486 10    0
    2. 110068  system user connecting host windmills_s Connect 111 closing tables ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE 111000 0 0
  2. The execution time is longer

Summarizing:

TOI            NBO
Time on hold for insert for altering table   	~64 sec    	~120 sec
Time on hold for insert for another table   	~64 sec      ~25 sec 
metalock                       			whole time  	 only at the end

What is Happening? What are the Differences and Why Does it Take Longer with NBO?

Let’s see at a very high level how the two work:

  • TOI: when you issue a DDL like ADD INDEX a metadata lock is taken on the table and it will be released only at the end of the operation. During this time, you cannot:
    • Perform DMLs on any cluster node
    • Alter another table in the cluster
  • NBO: the metadata lock is taken at the start and at the end for a very brief period of time. The ADD INDEX operation will then work on each node independently. The lock taken at the end is to have all the nodes agree on the operation and commit or rollback (using cluster error voting). This final phase costs a bit more in time and is what adds a few seconds to the operation execution. But during the operation:
    • You can alter another table (using NBO)
    • You can continue to insert data, except in the table(s) you are altering.
    • On node crash, the operation will continue on the other nodes, and if successful it will persist.  

In short, the cluster server behavior changes significantly when using NBO, offering significant flexibility compared to TOI. The cost in time should not linearly increase with the dimension of the table, but more in relation to the single node efficiency in performing the ALTER operation.       

Conclusion

NBO can be significantly helpful to reduce the impact of DDL on the cluster, for now, limited to the widely used creation/modification/drop of an index. But in the future … we may expand it. 

The feature is still a technology preview, so do not trust it in production, but test it and let us know what you think. 

Final comment: another distribution has introduced NBO, but only if you buy the enterprise version.

Percona, which is truly open source with facts not just words, has implemented NBO in standard PXC, and the code is fully open source. This is not the first one, but just another of the many features Percona is offering for free while others ask you to buy the enterprise version.

Enjoy the product and let us have your feedback! Great MySQL to all! 

Dec
01
2021
--

PXC Scheduler Handler: The Missing Piece for Galera/Percona XtraDB Cluster Puzzle

Percona XtraDB Cluster Scheduler HandlerWorking on a real case scenario in a five node Percona XtraDB Cluster (PXC), we were forced to use wsrep_sync_wait = 1, because the app does reads-after-write and we send reads to all the nodes. We had the idea to leave some nodes in DESYNC mode to reduce the flow control messages during peak load and expected to have a steadier write throughput keeping the read consistency.

We decided to test Perconas’s new PXC Scheduler Handler which is an application that manages integration between ProxySQL and Galera/PXC (the scope is to maintain the ProxySQL mysql_server table, if a negative scenario occurs, like: failures, service degradation, and maintenance). However, we realized that when a node is in DESYNC mode, it is kicked out of the read hostgroup. That is why we asked Marco Tusa to implement this new feature which will remove the node from the read hostgroup if wsrep_local_recv_queue is higher than max_replication_lag.

Environment

  • 5 PXC nodes
  • 1 ProxySQL server

In db01, we run sysbench to simulate write traffic:

sysbench /usr/share/sysbench/oltp_insert.lua --threads=50 --tables=100 --mysql-password=<pass> --mysql-user=<user> --report-interval=5 --time=600 --tx-rate=0  run

In ProxySQL we run sysbench to simulate read-only traffic:

sysbench /usr/share/sysbench/oltp_read_only.lua --tables=100 --mysql-password=<pass> --mysql-user=<user> --mysql-host=127.0.0.1 run --threads=32 --report-interval=5 --time=600 --db-ps-mode=disable

DESYNC Test

The goal of this test is to see the differences between wsrep_desync ON/OFF and wsrep_sync_wait = 1.

In the next graph, we are going to see both scenarios in the same graph, on the left of each graph when wsrep_desync is ON and on the right when it is OFF.

DESYNC Test 

As you can see there are decreases in the read traffic when DESYNC is OFF. It occurs in  the same period of time when the flow control messages are sent:

This is expected and it is not new. The number of queries executed were:

    DESYNC ON  queries:                             30561552 (50934.44 per sec.)
    DESYNC OFF queries:                             28324704 (47195.67 per sec.)

Basically, you can execute 8% of queries if flow control is not enabled.

Consistency Test

Now, we are going to simulate a scenario when the cluster receives CPU-intensive queries. We are going to execute in db03 (or any other node):

for j in $(seq 1 40) ; do 
  for i in $(seq 1 250); do 
    echo "SELECT pad FROM sbtest.sbtest1 GROUP BY pad LIMIT 10" | mysql --password=<pass> --user=<user> --host=<host> ; 
  done > /dev/null 2>&1 & 
done

It starts 40 threads that execute 250 times the same group by query which is enough for our testing.

And we are going to add a timestamp column on the sbtest1 to monitor the lag:

ALTER TABLE sbtest.sbtest1 add  `tnow` timestamp(6) NOT NULL DEFAULT CURRENT_TIMESTAMP(6) ON UPDATE CURRENT_TIMESTAMP(6),

Now we have this 4 scenario where we are going to monitor with:

select timediff(tnow,current_timestamp(6))  from sbtest1 order by tnow desc limit 1

And the status variable: wsrep_local_recv_queue.

This will be the header of the tables:

| timediff | query time |  wsrep_local_recv_queue |

– With wsrep_desync=OFF and wsrep_sync_wait=1

| 00:00:02.553139 | 2.69 sec |  237 |
| 00:00:00.223150 | 0.26 sec |   72 |

– With wsrep_desync=OFF and wsrep_sync_wait=0

| 00:00:01.711082 | 1.74 sec |  183 |
| 00:00:01.175446 | 1.20 sec |  112 |

– With wsrep_desync=ON and wsrep_sync_wait=1

| -00:00:00.021460 | 1.30 sec | 13542 |
| -00:00:00.009098 | 1.28 sec | 13184 |

– With wsrep_desync=ON and wsrep_sync_wait=0

| -00:01:02.065854 | 0.46 sec | 635387 |
| -00:01:02.882633 | 0.54 sec | 643742 |

With wsrep_desync=OFF, the behavior is similar, and this means that the node needs to be on sync and then it checks wsrep_sync_wait. 

With wsrep_desync=ON and wsrep_sync_wait=1, we can see that the query is delayed because it needs to apply the transaction in the apply queue. It is not the case when wsrep_sync_wait=0, which data is far behind the writer node, and the query is answered immediately.

The two cases that matter are when wsrep_sync_wait=1, and both cases are read-consistent even if they show different timediff values, as the query time is measuring the flow control lag when wsrep_desync=OFF and the apply queue time when wsrep_desync=ON. 

Cluster Behavior During Load Increases

It is time to merge both test and simulate when ProxySQL spreads the CPU-intensive queries over the cluster. We are going to execute in ProxySQL the same script:

for j in $(seq 1 40) ; do 
  for i in $(seq 1 250); do 
    echo "SELECT pad FROM sbtest.sbtest1 GROUP BY pad LIMIT 10" | mysql --password=<pass> --user=<user> --host=<host> ; 
  done > /dev/null 2>&1 & 
done

In the next graph we are going to see how the active connections on different nodes went up and down as the status of the node changed: 

MySQL Node Change

And when the script finished it went back to normal.

In ProxySQL, you will see how the status of the servers changes to ONLINE to OFFLINE_SOFT and back to ONLINE because of PXC Scheduler Handler intervention, like this:

proxysql> select hostgroup_id,hostname,status from runtime_mysql_servers order by hostgroup_id ='101' ;
+--------------+---------------+--------------+
| hostgroup_id | hostname      | status       |
+--------------+---------------+--------------+
| 101          | 10.127.70.242 | ONLINE       |
| 101          | 10.127.70.243 | ONLINE       |
| 101          | 10.127.70.244 | OFFLINE_SOFT |
| 101          | 10.127.70.245 | ONLINE       |
| 101          | 10.127.71.3   | ONLINE       |
+--------------+---------------+--------------+
proxysql> select hostgroup_id,hostname,status from runtime_mysql_servers order by hostgroup_id ='101' ;
+--------------+---------------+--------------+
| hostgroup_id | hostname      | status       |
+--------------+---------------+--------------+
| 101          | 10.127.70.242 | ONLINE       |
| 101          | 10.127.70.243 | ONLINE       |
| 101          | 10.127.70.244 | ONLINE       |
| 101          | 10.127.70.245 | OFFLINE_SOFT |
| 101          | 10.127.71.3   | ONLINE       |
+--------------+---------------+--------------+

This can be reviewed also in PMM:

percona monitoring and management

This means that PXC Scheduler Handler is helping us to spread the load across DESYNC nodes, improving the response time for the read-only traffic.

Architecture

With traditional replication we had an abstract diagram like this:

traditional replication

We didn’t have any production-ready option to guarantee Read Consistency on the Read Replicas.

With Galera/PXC we don’t need the Replication Manager as it will be replaced with PXC Scheduler Handler: 

PXC Scheduler Handler

We have the same amount of nodes with Read Consistency (thanks to wsrep_sync_wait) and a synced secondary writer node. What is missing, but not difficult to add, is a tool that monitors the WriterNodes, as in a failure scenario, we might want to keep not less than two synced nodes.

Conclusion

I think that Marco Tusa did a great job with the PXC Scheduler Handler which allowed us to implement this new Architecture that might help people that need to scale reads, need consistent write throughput, need consistent reads, and don’t want reads to affect the flow of the replication process.

If you’re interested in learning more about Percona XtraDB Cluster and ProxySQL, be sure to check out Percona’s Training Services. We offer an advanced two-day hands-on tutorial for XtraDB Cluster, in addition to our one-day ProxySQL intensive. Contact us today to learn more!

Jul
12
2021
--

How To Recover Percona XtraDB Cluster 5.7 Node Without SST

Recover Percona XtraDB Cluster 5.7 Node Without SST

The Problem

Recover Percona XtraDB Cluster 5.7 Node Without SSTState Snapshot Transfer can be a very long and expensive process, depending on the size of your Percona XtraDB Cluster (PXC)/Galera cluster, as well as network and disk bandwidth. There are situations where it is needed though, like after long enough node separation, where the gcache on other members was too small to keep all the needed transactions.

Let’s see how we can avoid SST, yet recover fast and without even the need for doing a full backup from another node.

Below, I will present a simple scenario, where one of the cluster nodes was having a broken network for long enough that it will make Incremental State Transfer (IST) no longer possible.

For this solution to work, I am assuming that the cluster has binary logs with GTID mode enabled, and logs with missing transactions were not purged yet. Though it would be still possible without GTID, just slightly more complex.

My example PXC member, node3, gets separated from the cluster due to a network outage. Its last applied transaction status is:

node3 > show global variables like 'gtid_executed';
+---------------+----------------------------------------------+
| Variable_name | Value                                        |
+---------------+----------------------------------------------+
| gtid_executed | 2cd15721-261a-ee14-4166-00c9b4945b0b:1-28578 |
+---------------+----------------------------------------------+
1 row in set (0.01 sec)

node3 > show status like 'wsrep_last_committed';
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| wsrep_last_committed | 28610 |
+----------------------+-------+
1 row in set (0.00 sec)

However, other available active nodes in the cluster have already rotated the gcache further:

node1 > show status like 'wsrep_local_cached_downto';
+---------------------------+-------+
| Variable_name             | Value |
+---------------------------+-------+
| wsrep_local_cached_downto | 42629 |
+---------------------------+-------+
1 row in set (0.00 sec)

Hence, after the network is restored, it fails to re-join the cluster due to IST failure:

DONOR error log:

2021-06-30T21:52:02.199697Z 2 [Note] WSREP: IST request: d32ea8de-d9e5-11eb-be99-ff364b6ba4f4:28610-83551|tcp://127.0.0.1:27172
2021-06-30T21:52:02.199743Z 2 [Note] WSREP: IST first seqno 28611 not found from cache, falling back to SST

JOINER error log:

2021-06-30T21:52:02.139242Z 0 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 83551)
2021-06-30T21:52:02.139408Z 4 [Note] WSREP: State transfer required:
Group state: d32ea8de-d9e5-11eb-be99-ff364b6ba4f4:83551
Local state: d32ea8de-d9e5-11eb-be99-ff364b6ba4f4:28610
...
2021-06-30T21:52:02.200137Z 0 [Warning] WSREP: 1.0 (node1): State transfer to 0.0 (node3) failed: -61 (No data available)
2021-06-30T21:52:02.200171Z 0 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():805: State transfer request failed unrecoverably because the donor seqno had gone forward during IST, but SST request was not prepared from our side due to selected state transfer method (which do not supports SST during node operation). Restart required.
2021-06-30T21:52:02.200191Z 0 [Note] WSREP: gcomm: terminating thread

And node3 shuts down its service as a result.

The Solution

To avoid using full backup transfer from the donor, let’s try asynchronous replication here, to let the failed node catch up with the others so that IST should be possible later.

To achieve that, let’s modify the configuration file first on the separated node, and add these to avoid accidental writes during the operation:

super_read_only = 1
skip_networking

and to disable PXC mode for the time, comment out the provider:

#wsrep-provider=/usr/lib64/libgalera_smm.so

Now, after a restart, node3 becomes a standalone MySQL node, without Galera replication enabled. So, let’s configure async replication channel (repl user was created already on all nodes):

node3 > CHANGE MASTER TO MASTER_HOST='localhost', MASTER_USER='repl', MASTER_PASSWORD='replpassword', MASTER_AUTO_POSITION=1, MASTER_PORT=27037;
Query OK, 0 rows affected, 2 warnings (0.03 sec)

node3 > start slave;
Query OK, 0 rows affected (0.00 sec)

And then wait for it to catch up with the source node. Once this replica is fully up to date, let’s stop it, remove async channel configuration, and note its new GTID position:

node3 > stop slave;
Query OK, 0 rows affected (0.00 sec)

node3 > reset slave all;
Query OK, 0 rows affected (0.01 sec)

node3 > show global variables like 'gtid_executed';
+---------------+----------------------------------------------+
| Variable_name | Value                                        |
+---------------+----------------------------------------------+
| gtid_executed | 2cd15721-261a-ee14-4166-00c9b4945b0b:1-83553 |
+---------------+----------------------------------------------+
1 row in set (0.00 sec)

Now, we have to find the corresponding cluster’s wsrep sequence, in the source binary log, like this:

$ mysqlbinlog mysql-bin.000005|grep -A1000 '2cd15721-261a-ee14-4166-00c9b4945b0b:83553'|grep Xid|head -1
#210701  0:19:06 server id 100  end_log_pos 1010 CRC32 0x212d2592  Xid = 83557

With this position, the grastate.dat file on the failed node has to be updated, as follows:

$ cat pxc_msb_pxc5_7_33/node3/data/grastate.dat
# GALERA saved state
version: 2.1
uuid:    d32ea8de-d9e5-11eb-be99-ff364b6ba4f4
seqno:   83557
safe_to_bootstrap: 0

The previous configuration file modifications must be now reverted, and the service restarted again.

This time, IST was finally possible:

2021-06-30T22:26:10.563512Z 2 [Note] WSREP: State transfer required:
Group state: d32ea8de-d9e5-11eb-be99-ff364b6ba4f4:85668
Local state: d32ea8de-d9e5-11eb-be99-ff364b6ba4f4:83557
...
2021-06-30T22:26:28.860555Z 2 [Note] WSREP: Receiving IST: 2111 writesets, seqnos 83557-85668
2021-06-30T22:26:28.860812Z 0 [Note] WSREP: Receiving IST...  0.0% (   0/2111 events) complete.
2021-06-30T22:26:29.247313Z 0 [Note] WSREP: Receiving IST...100.0% (2111/2111 events) complete.
2021-06-30T22:26:29.247713Z 2 [Note] WSREP: IST received: d32ea8de-d9e5-11eb-be99-ff364b6ba4f4:85668
2021-06-30T22:26:29.247902Z 0 [Note] WSREP: 0.0 (node3): State transfer from 1.0 (node1) complete.
...
2021-06-30T22:26:29.248074Z 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 85668)

And node3 joins back the cluster properly:

node3 > show global variables like 'gtid_executed';
+---------------+----------------------------------------------+
| Variable_name | Value                                        |
+---------------+----------------------------------------------+
| gtid_executed | 2cd15721-261a-ee14-4166-00c9b4945b0b:1-85664 |
+---------------+----------------------------------------------+
1 row in set (0.00 sec)

node3 > show status like 'wsrep_last_committed';
+----------------------+-------+
| Variable_name        | Value |
+----------------------+-------+
| wsrep_last_committed | 85668 |
+----------------------+-------+
1 row in set (0.01 sec)

Summary

With the help of traditional asynchronous replication, we were able to restore the failed node back to the cluster faster and without all the overhead related to a full backup made by SST.

The only requirement for such a method to work is an enabled binary log, with a long enough rotation policy.

I have tested this on version:

node3 > select @@version,@@version_comment\G
*************************** 1. row ***************************
        @@version: 5.7.33-36-49-log
@@version_comment: Percona XtraDB Cluster binary (GPL) 5.7.33-rel36-49, Revision a1ed9c3, wsrep_31.49
1 row in set (0.00 sec)

Unfortunately, a similar solution does not work with Percona XtraDB Cluster 8.0.x, due to the modified way wsrep positions are kept in the storage engine, hence the trick with updating grastate.dat does not work as expected there.

I would like to also remind here, that in case some node is expected to stay separated from the cluster for too long, there is a way to preserve longer galera cache history for it. So by doing this, the solution I presented may not even be needed – check the relevant article: Want IST Not SST for Node Rejoins? We Have a Solution!

Jun
01
2021
--

Skipping Percona Server for MySQL Version 8.0.24 and Releasing 8.0.25 Next

Percona Server for MySQL Version 8.0.24

After Oracle released MySQL version 8.0.24 on April 20, 2021, our engineering team got started right away with merging our enhancements to prepare the corresponding 8.0.24 version of Percona Server for MySQL.

However, Oracle released MySQL version 8.0.25 shortly afterward on May 11, 2021, to fix a critical bug that we also observed during our initial testing and reported back to them.

Therefore, we have decided to skip releasing Percona Server for MySQL 8.0.24 both as a standalone product and a distribution as well as the matching Percona XtraDB Cluster release.

Our next public release after Percona Server for MySQL version 8.0.23 will be version 8.0.25, so don’t be surprised if you’re looking for version 8.0.24 and can’t find it. Thank you!

May
04
2021
--

Inconsistent Voting in Percona XtraDB Cluster

Cluster Error Voting

AKA Cluster Error Voting…

Cluster Error VotingWhat is Cluster Error Voting (CEV)?

“Cluster Error Voting is a new feature implemented by Alexey Yurchenko, and it is a protocol for nodes to decide how the cluster will react to problems in replication. When one or several nodes have an issue applying an incoming transaction(s) (e.g., suspected inconsistency), this new feature helps. In a 5-node cluster, if 2-nodes fail to apply the transaction, they get removed, and a DBA can go in to fix what went wrong so that the nodes can rejoin the cluster. (Seppo Jaakola)”

This feature was ported to Percona XtraDB Cluster (PXC) in version 8.0.21. As indicated above, it is about increasing the resilience of the cluster, especially when TWO nodes fail to operate and may drop from the cluster abruptly. The protocol is activated in a cluster with any number of nodes.

Before CEV, if a node has a problem/error during a transaction, the node having the issue would report the error in his own log and exit the cluster:

2021-04-23T15:18:38.568903Z 11 [ERROR] [MY-010584] [Repl] Slave SQL: Could not execute Write_rows event on table test.test_voting; Duplicate entry '21' for key 'test_voting.PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 0, Error_code: MY-001062
2021-04-23T15:18:38.568976Z 11 [Warning] [MY-000000] [WSREP] Event 3 Write_rows apply failed: 121, seqno 16
2021-04-23T15:18:38.569717Z 11 [Note] [MY-000000] [Galera] Failed to apply write set: gtid: 224fddf7-a43b-11eb-84d5-2ebf2df70610:16 server_id: d7ae67e4-a43c-11eb-861f-8fbcf4f1cbb8 client_id: 40 trx_id: 115 flags: 3
2021-04-23T15:18:38.575439Z 11 [Note] [MY-000000] [Galera] Closing send monitor...
2021-04-23T15:18:38.575578Z 11 [Note] [MY-000000] [Galera] Closed send monitor.
2021-04-23T15:18:38.575647Z 11 [Note] [MY-000000] [Galera] gcomm: terminating thread
2021-04-23T15:18:38.575737Z 11 [Note] [MY-000000] [Galera] gcomm: joining thread
2021-04-23T15:18:38.576132Z 11 [Note] [MY-000000] [Galera] gcomm: closing backend
2021-04-23T15:18:38.577954Z 11 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(NON_PRIM,3206d174,5)
memb {
	727c277a,1
	}
joined {
	}
left {
	}
partitioned {
	3206d174,1
	d7ae67e4,1
	}
)
2021-04-23T15:18:38.578109Z 11 [Note] [MY-000000] [Galera] PC protocol downgrade 1 -> 0
2021-04-23T15:18:38.578158Z 11 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view ((empty))
2021-04-23T15:18:38.578640Z 11 [Note] [MY-000000] [Galera] gcomm: closed
2021-04-23T15:18:38.578747Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1

While the other nodes will “just” report the node as out of the view:

2021-04-23T15:18:38.561402Z 0 [Note] [MY-000000] [Galera] forgetting 727c277a (tcp://10.0.0.23:4567)
2021-04-23T15:18:38.562751Z 0 [Note] [MY-000000] [Galera] Node 3206d174 state primary
2021-04-23T15:18:38.570411Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(PRIM,3206d174,6)
memb {
	3206d174,1
	d7ae67e4,1
	}
joined {
	}
left {
	}
partitioned {
	727c277a,1
	}
)
2021-04-23T15:18:38.570679Z 0 [Note] [MY-000000] [Galera] Save the discovered primary-component to disk
2021-04-23T15:18:38.574592Z 0 [Note] [MY-000000] [Galera] forgetting 727c277a (tcp://10.0.0.23:4567)
2021-04-23T15:18:38.574716Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
2021-04-23

With CEV, we have a different process. Let us review it with images first.

Let us start with a cluster…

3 Nodes, where only one works as Primary.

Primary writes and, as expected, writesets are distributed on all nodes.

insert into test_voting values(null,REVERSE(UUID()), NOW()); <-- Few times

DC1-1(root@localhost) [test]>select * from test_voting;
+----+--------------------------------------+---------------------+
| id | what                                 | when                |
+----+--------------------------------------+---------------------+
|  3 | 05de43720080-938a-be11-305a-6d135601 | 2021-04-24 14:43:34 |
|  6 | 05de43720080-938a-be11-305a-7eb60711 | 2021-04-24 14:43:36 |
|  9 | 05de43720080-938a-be11-305a-6861c221 | 2021-04-24 14:43:37 |
| 12 | 05de43720080-938a-be11-305a-d43f0031 | 2021-04-24 14:43:38 |
| 15 | 05de43720080-938a-be11-305a-53891c31 | 2021-04-24 14:43:39 |
+----+--------------------------------------+---------------------+
5 rows in set (0.00 sec)

Some inexperienced DBA does a manual operation on a secondary using the very unsafe feature wsrep_on…

And then, by mistake or because he did not understand what he is doing…

insert into test_voting values(17,REVERSE(UUID()), NOW()); <-- with few different ids

At the end of the operation of the Secondary node, he will have:

DC1-2(root@localhost) [test]>select * from test_voting;
+----+--------------------------------------+---------------------+
| id | what                                 | when                |
+----+--------------------------------------+---------------------+
|  3 | 05de43720080-938a-be11-305a-6d135601 | 2021-04-24 14:43:34 |
|  6 | 05de43720080-938a-be11-305a-7eb60711 | 2021-04-24 14:43:36 |
|  9 | 05de43720080-938a-be11-305a-6861c221 | 2021-04-24 14:43:37 |
| 12 | 05de43720080-938a-be11-305a-d43f0031 | 2021-04-24 14:43:38 |
| 15 | 05de43720080-938a-be11-305a-53891c31 | 2021-04-24 14:43:39 |
| 16 | 05de43720080-a39a-be11-405a-82715600 | 2021-04-24 14:50:17 |
| 17 | 05de43720080-a39a-be11-405a-f9d62e22 | 2021-04-24 14:51:14 |
| 18 | 05de43720080-a39a-be11-405a-f5624662 | 2021-04-24 14:51:20 |
| 19 | 05de43720080-a39a-be11-405a-cd8cd640 | 2021-04-24 14:50:23 |
+----+--------------------------------------+---------------------+

This is not in line with the rest of the cluster that still has the previous data. Then our guy put the node back:

At this point, the Primary does another insert in that table and:

Houston, we have a problem! 

The secondary node already has the entry with that ID and cannot perform the insert:

2021-04-24T13:52:51.930184Z 12 [ERROR] [MY-010584] [Repl] Slave SQL: Could not execute Write_rows event on table test.test_voting; Duplicate entry '18' for key 'test_voting.PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 0, Error_code: MY-001062
2021-04-24T13:52:51.930295Z 12 [Warning] [MY-000000] [WSREP] Event 3 Write_rows apply failed: 121, seqno 4928120

But instead of exit from the cluster, it will raise a verification through voting:

2021-04-24T13:52:51.932774Z 0 [Note] [MY-000000] [Galera] Member 0(node2) initiates vote on ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120,878ded7898c83a72:  Duplicate entry '18' for key 'test_voting.PRIMARY', Error_code: 1062;
2021-04-24T13:52:51.932888Z 0 [Note] [MY-000000] [Galera] Votes over ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120:
   878ded7898c83a72:   1/3
Waiting for more votes.
2021-04-24T13:52:51.936525Z 0 [Note] [MY-000000] [Galera] Member 1(node3) responds to vote on ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120,0000000000000000: Success
2021-04-24T13:52:51.936626Z 0 [Note] [MY-000000] [Galera] Votes over ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120:
   0000000000000000:   1/3
   878ded7898c83a72:   1/3
Waiting for more votes.
2021-04-24T13:52:52.003615Z 0 [Note] [MY-000000] [Galera] Member 2(node1) responds to vote on ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120,0000000000000000: Success
2021-04-24T13:52:52.003722Z 0 [Note] [MY-000000] [Galera] Votes over ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120:
   0000000000000000:   2/3
   878ded7898c83a72:   1/3
Winner: 0000000000000000

As you can see, each node informs the cluster about the success or failure of the operation, and the majority wins.

Once the majority had identified the operation was legit, as such, the node that asked for the voting will need to get out from the cluster:

2021-04-24T13:52:52.038510Z 12 [ERROR] [MY-000000] [Galera] Inconsistency detected: Inconsistent by consensus on ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120
	 at galera/src/replicator_smm.cpp:process_apply_error():1433
2021-04-24T13:52:52.062666Z 12 [Note] [MY-000000] [Galera] Closing send monitor...
2021-04-24T13:52:52.062750Z 12 [Note] [MY-000000] [Galera] Closed send monitor.
2021-04-24T13:52:52.062796Z 12 [Note] [MY-000000] [Galera] gcomm: terminating thread
2021-04-24T13:52:52.062880Z 12 [Note] [MY-000000] [Galera] gcomm: joining thread
2021-04-24T13:52:52.063372Z 12 [Note] [MY-000000] [Galera] gcomm: closing backend
2021-04-24T13:52:52.085853Z 12 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(NON_PRIM,65a111c6-bb0f,23)
memb {
	65a111c6-bb0f,2
	}
joined {
	}
left {
	}
partitioned {
	aae38617-8dd5,2
	dc4eaa39-b39a,2
	}
)
2021-04-24T13:52:52.086241Z 12 [Note] [MY-000000] [Galera] PC protocol downgrade 1 -> 0
2021-04-24T13:52:52.086391Z 12 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view ((empty))
2021-04-24T13:52:52.150106Z 12 [Note] [MY-000000] [Galera] gcomm: closed
2021-04-24T13:52:52.150340Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1

It is also nice to notice that now we have a decent level of information about what happened in the other nodes; the log below is from the Primary:

2021-04-24T13:52:51.932829Z 0 [Note] [MY-000000] [Galera] Member 0(node2) initiates vote on ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120,878ded7898c83a72:  Duplicate entry '18' for key 'test_voting.PRIMARY', Error_code: 1062;
2021-04-24T13:52:51.978123Z 0 [Note] [MY-000000] [Galera] Votes over ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120:
…<snip>
2021-04-24T13:52:51.981647Z 0 [Note] [MY-000000] [Galera] Votes over ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120:
   0000000000000000:   2/3
   878ded7898c83a72:   1/3
Winner: 0000000000000000
2021-04-24T13:52:51.981887Z 11 [Note] [MY-000000] [Galera] Vote 0 (success) on ab5deb8e-389d-11eb-b1c0-36eca47bacf0:4928120 is consistent with group. Continue.
2021-04-24T13:52:52.064685Z 0 [Note] [MY-000000] [Galera] declaring aae38617-8dd5 at tcp://10.0.0.31:4567 stable
2021-04-24T13:52:52.064885Z 0 [Note] [MY-000000] [Galera] forgetting 65a111c6-bb0f (tcp://10.0.0.21:4567)
2021-04-24T13:52:52.066916Z 0 [Note] [MY-000000] [Galera] Node aae38617-8dd5 state primary
2021-04-24T13:52:52.071577Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(PRIM,aae38617-8dd5,24)
memb {
	aae38617-8dd5,2
	dc4eaa39-b39a,2
	}
joined {
	}
left {
	}
partitioned {
	65a111c6-bb0f,2
	}
)
2021-04-24T13:52:52.071683Z 0 [Note] [MY-000000] [Galera] Save the discovered primary-component to disk
2021-04-24T13:52:52.075293Z 0 [Note] [MY-000000] [Galera] forgetting 65a111c6-bb0f (tcp://10.0.0.21:4567)
2021-04-24T13:52:52.075419Z 0 [Note] [MY-000000] [Galera] New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2

At this point, a DBA can start to investigate and manually fix the inconsistency and have the node rejoin the cluster. In the meantime, the rest of the cluster continues to operate:

+----+--------------------------------------+---------------------+
| id | what                                 | when                |
+----+--------------------------------------+---------------------+
|  3 | 05de43720080-938a-be11-305a-6d135601 | 2021-04-24 14:43:34 |
|  6 | 05de43720080-938a-be11-305a-7eb60711 | 2021-04-24 14:43:36 |
|  9 | 05de43720080-938a-be11-305a-6861c221 | 2021-04-24 14:43:37 |
| 12 | 05de43720080-938a-be11-305a-d43f0031 | 2021-04-24 14:43:38 |
| 15 | 05de43720080-938a-be11-305a-53891c31 | 2021-04-24 14:43:39 |
| 18 | 05de43720080-938a-be11-405a-d02c7bc5 | 2021-04-24 14:52:51 |
+----+--------------------------------------+---------------------+

Conclusion

Cluster Error Voting (CEV) is a nice feature to have. It helps us better understand what goes wrong and increases the stability of the cluster, and with the voting has a better way to manage the node expulsion.

Another aspect is visibility; never underestimate the fact information is available also on other nodes. Having it available on multiple nodes may help investigations if the log on the failing node gets lost (for any reason).

We still do not have active tuple certification, but it is a good step, especially given the history we have seen of data drift in PXC/Galera in these 12 years of utilization.

My LAST comment is that while I agree WSREP_ON can be a very powerful tool in the hands of experts, as indicated in my colleague’s blog How to Perform Compatible Schema Changes in Percona XtraDB Cluster (Advanced Alternative). That option remains DANGEROUS, and you should never use it UNLESS your name is Przemys?aw Malkowski, and you really know what you are doing.

Great MySQL to everybody!

References

https://www.percona.com/doc/percona-xtradb-cluster/8.0/release-notes/Percona-XtraDB-Cluster-8.0.21-12.1.html

Galera Clustering in MariaDB 10.5 and beyond – Seppo Jaakola – MariaDB Server Fest 2020

 

Register for Percona Live ONLINE
A Virtual Event about Open Source Databases

Apr
23
2021
--

Yearly Reminder: DDLs That Fail to Propagate May Cause Percona XtraDB Cluster Inconsistencies

DDLs Percona XtraDB Cluster Inconsistencies

DDLs Percona XtraDB Cluster InconsistenciesApologies for the silly title, but the issue is a real one, even though it is not a new thing. Schema upgrades are not an ordinary operation in Galera. For the subject at hand, the bottom line is: under the default Total Order Isolation (TOI) method, “the cluster replicates the schema change query as a statement before its execution.” What this means in practice is that a DDL issued in one node is replicated to other nodes in the cluster before it is even executed in the source node, let alone completed successfully.

As a result of this, it may fail in one node and be successful in another, and this without raising loud alerts or stopping nodes to protect against data inconsistency. This is not a bug in itself but rather a compromise of design. With new changes in MySQL and the recent support for atomic DDLs added in version 8.0, we may see improvements in this area over time in Galera as well. For now, the warning should be maintained: DDLs are not atomic in Galera, even in 8.0.

How Big of an Issue Is This?

To be fair, it’s rare – we do not see this happening often.

How Does the Problem Manifest Itself in Practice? Give Us an Example!

We recently investigated a case where a DBA issued a TRUNCATE TABLE statement in one node, which was taking some time to complete. His reaction at the time was to hit Ctrl+C on the session where the DDL was issued. That session seemed to hang, so he ended up opening a new session and issued a full DELETE on that same table. It also took time to complete, however this time, he was more patient. The problem was: once the statement was completed, he noticed that the two other nodes disappeared from the cluster. Upon closer inspection of the logs, both nodes reported the following error before evicting the cluster:

[ERROR] Slave SQL: Could not execute Delete_rows event on table mydb.target; Can't find record in 'target’, Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 1073, Error_code: 1032

The DELETE was successful on the source node but failed to complete on the other two nodes. Why? Because the target table was already empty: the previous TRUNCATE TABLE was effectively executed in those nodes.

Wasn’t the TRUNCATE TABLE also successful on the source node? It turned out it wasn’t. In fact, the TRUNCATE was recorded in the slow query log of the source node like this:

# Time: (...)
# User (...)
# Schema: building Last_errno: 1317 Killed: 0
# Query_time: 145.753247 Lock_time: 0.000000 Rows_sent: 0 Rows_examined: 0 Rows_affected: 0
# Bytes_sent: 44 Tmp_tables: 0 Tmp_disk_tables: 0 Tmp_table_sizes: 0
# QC_Hit: No Full_scan: No Full_join: No Tmp_table: No Tmp_table_on_disk: No
# Filesort: No Filesort_on_disk: No Merge_passes: 0
# No InnoDB statistics available for this query
SET timestamp=(...)
truncate target;

Note the Last_errno indicated above:

$ perror 1317
MySQL error code 1317 (ER_QUERY_INTERRUPTED): Query execution was interrupted

Our colleague Przemek reproduced a similar problem, which he described in bug PXC-3536: Modify processing to not allow threads/queries to be killed if the thread is in TOI. The Percona Engineering team has since fixed the abortion issue, leaving the following note on the bug tracker:

Fixed in 5.7[.33-31.49] and 8.0[.22-13.1].  Any attempt to kill the query or connection while in TOI will fail.  Note that due to timing, the “kill query” may sometimes return success, but the operation will still have succeeded.

In the case we investigated, the MySQL error log of the source node had the answer; there was some other higher-priority transaction already in place on that node that prevented it from executing the TRUNCATE:

[Note] WSREP: MDL conflict db=mydb table=target ticket=MDL_SHARED_WRITE solved by abort

To clarify, this cluster was receiving writes from multiple nodes, and the higher-priority transaction was applied by a replication thread, which usually has precedence over local events when a conflict is detected.

As you may have suspected, the TRUNCATE was not found in the binary log of the source node; since the statement did not complete, it was not logged. Only the subsequent DELETE statement was there.

And as for the binary logs of the other two nodes, they included both the TRUNCATE as well as the DELETEs covering all rows in the target table. Rows these other nodes no longer had in place and thus were unable to remove caused the data inconsistency that followed.

Feb
24
2021
--

Point-In-Time Recovery in Kubernetes Operator for Percona XtraDB Cluster – Architecture Decisions

Point-In-Time Recovery in Kubernetes Operator

Point-In-Time Recovery in Kubernetes OperatorPoint-In-Time Recovery (PITR) for MySQL databases is an important feature that is essential and covers common use cases, like a recovery to the latest possible transaction or roll-back the database to a specific date before some bad query was executed. Percona Kubernetes Operator for Percona XtraDB Cluster (PXC) added support for PITR in version 1.7, and in this blog post we are going to look into the technical details and decisions we made to implement this feature.

Architecture Decisions

Store Binary Logs on Object Storage

MySQL uses binary logs to perform point-in-time recovery. Usually, they are stored locally along with the data, but it is not an option for us:

  • We run the cluster and we cannot rely on a single node’s local storage.
  • The cloud-native world lives in an ephemeral dimension, where nodes and pods can be terminated and S3-compatible storage is a de facto standard to store data.
  • We should be able to recover the data to another Kubernetes cluster in case of a disaster.

We have decided to add a new Binlog Uploader Pod, which connects to the available PXC member and uploads binary logs to S3. Under the hood, it relies on the mysqlbinlog utility.

Use Global Transaction ID

Binary logs on the clustered nodes are not synced and can have different names and contents. This becomes a problem for the Uploader, as it can connect to different PXC nodes for various reasons.

To solve this problem, we decided to rely on Global Transaction ID (GTID). It is a unique transaction identifier, but it is unique not only to the server on which it originated, but is unique across all servers in a given replication topology.  With the GTID captured in binary logs, we can identify any transaction not depending on the filename or its contents. This allows us to continue streaming binlogs from any PXC member at any moment.

User-Defined Functions

We have a unique identifier for every transaction, but the mysqlbinlog utility still doesn’t have the functionality to determine which binary log file contains which GTID. We decided to extend MySQL with few User Defined Functions and added them to Percona Server for MySQL and Percona XtraDB Cluster versions 8.0.21

get_gtid_set_by_binlog()

This function returns all GTIDs that are stored inside the given binlog file. We put the GTID setlist to a new file next to the binary log on S3.

get_binlog_by_gtid_set()

This function takes GTID set as an input and returns a binlog filename which is stored locally. We use it to figure out which GTIDs are already uploaded and which binlog to upload next. 

binlog uploader pod

Have open source expertise you want to share? Submit your talk for Percona Live ONLINE 2021!

Find the node with the oldest binary log

Our quality assurance team caught a bug before the release which can happen in the cluster only:

  • Add a new node to the Percona XtraDB Cluster (for example scale up from 3 to 5 nodes).
  • Binlog Uploader Pod tries to execute get_binlog_by_gtid_set on the new node but gets the error.
2021/01/19 11:23:19 ERROR: collect binlog files: get last uploaded binlog name by gtid set: scan binlog: sql: Scan error on column index 0, name "get_binlog_by_gtid_set('a8e657ab-5a47-11eb-bea2-f3554c9e5a8d:15')": converting NULL to string is unsupported

The error is valid, as this node is new and there are no binary log files that have the GTID set that Uploader got from S3. If you look into this pull request, the quick patch is to always pick the oldest node in the array or in other words the node, which most likely would have the binary logs we need. In the next release of the Operator, we add more sophisticated logic, to discover the node which has the oldest binary logs for sure.

Storageless binlog uploader

The size of binlogs depends on the cluster usage patterns, so it is hard to predict the size of the storage or memory required for them. We decided to take this complexity away by making our Binary Log Uploader Pod completely storageless. Mysqlbinlog can store remote binlog only into files, but we need to put them to S3. To get there we decided to use a named pipe or FIFO special file. Now mysqlbinlog utility loads the binary log file to a named pipe, our Uploader reads it and streams the data directly to S3.

Also, storageless design means that we never store any state between Uploader restarts. Basically, state is not needed, we only need to know which GTIDs are already uploaded and we have this data on a remote S3 bucket. Such design enables the continuous upload flow of binlogs.

Binlog upload delay

S3 protocol expects that the file is completely uploaded. If the file upload is interrupted (let’s say Uploader Pod is evicted), the file will not be accessible/visible on S3. Potentially we can lose many hours of binary logs because of such interruptions. That’s why we need to split the binlog stream into files and upload them separately.

One of the options that users can configure when enabling point-in-time recovery in Percona XtraDB Cluster Operator is timeBetweenUploads. It sets the number of seconds between uploads for Binlog Uploader Pod. By default, we set it to 60 seconds, but it can go down to one second. We do not recommend setting it too low, as every invocation of the Uploader leads to FLUSH BINARY LOGS command execution on the PXC node. We need to flush the logs to close the binary log file to upload it to external storage, but doing it frequently may negatively affect IO and as a result database performance.

Recovery

It is all about recovery and it has two steps:

  1. Recover the cluster from a full backup
  2. Apply binary logs

We already have the functionality to restore from a full backup (see here), so let’s get to applying the binary logs.

First, we need to figure out from which GTID set we should start applying binary logs – in other words: where do we start?. As we rely on the Percona XtraBackup utility to take full MySQL backups, what we need to do is read the xtrabackup_info file which has lots of useful metadata. We already have this file on S3 near the full backup.

Second, find the binlog which has the GTID set we need. As you remember, we store a file with binlog’s GTID sets on S3 already, so it boils down to reading these files.

Third, download binary logs and apply them. Here we rely on mysqlbinlog as well, which has the flags we need, like –stop-datetime – which stops recovery when the event with a specific timestamp is caught in the log.

point in time recovery

Conclusion

MySQL is more than 25 years old and has a great tooling ecosystem established around it, but as we saw in this blog post, not all these tools are cloud-native ready. Percona engineering teams are committed to providing users the same features across various environments, whether it is a bare-metal installation in the data center or cutting edge Kubernetes in the cloud.

Nov
17
2020
--

Wondering How to Run Percona XtraDB Cluster on Kubernetes? Try Our Operator!

Run Percona XtraDB Cluster on Kubernetes

Run Percona XtraDB Cluster on KubernetesKubernetes has been a big trend for a while now, particularly well-suited for microservices. Running your main databases on Kubernetes is probably NOT what you are looking for. However, there’s a niche market for them. My colleague Stephen Thorn did a great job explaining this in The Criticality of a Kubernetes Operator for Databases. If you are considering running your database on Kubernetes, have a look at it first. And, if after reading it you start wondering how the Operator works, Stephen also wrote an Introduction to Percona Kubernetes Operator for Percona XtraDB Cluster (PXC), which presents the Kubernetes architecture and how the Percona Operator simplifies the deployment of a full HA PXC cluster in this environment, proxies included!

Now, if you are curious about how it actually works in practice but are afraid the entry barrier is too high, I can help you with that. In fact, this technology is widespread now, with most cloud providers offering a dedicated Kubernetes engine. In this blog post, I’ll walk you over the steps on how to deploy a Percona XtraDB Cluster (PXC) using the Percona Operator for Kubernetes on Google Cloud Platform (GCP).

Creating a Virtual Environment to Run Kubernetes on GCP

Google Cloud Platform includes among its products the Google Kubernetes Engine (GKE). We can take advantage of their trial offer to create our test cluster there: https://cloud.google.com.

After you sign up, you can access all the bells and whistles in their web interface. Note the Kubernetes Engine API is not enabled by default, you need to do it by visiting the Kubernetes Engine section in the left menu, under COMPUTE.

For the purpose of deploying our environment, we should install their SDK and work from the command line: see https://cloud.google.com/sdk/docs/install and follow the respective installation instructions for your OS (you will probably want to install the SDK on your personal computer).

With the SDK installed, we can initialize our environment, which requires authenticating to the Google Cloud account:

gcloud init

You will be prompted to choose a cloud project to use: there’s one created by default when the account is activated, named “My First Project”. It will receive a unique id, which you can verify in the Google Cloud interface, but usually, it is displayed as the first option presented in the prompt.

Alternatively, you can use gcloud config set to configure your default project and zone, among other settings.

For this exercise, we will be creating a 3-node cluster named k8-test-cluster with n1-standard-4 instances in the us-central1-b zone:

gcloud container clusters create --machine-type n1-standard-4 --num-nodes 3 --zone us-central1-b --cluster-version latest k8-test-cluster

If the command above was successful, you should see your newly created cluster in the list returned by:

gcloud container clusters list

Getting Ready to Work with Kubernetes

Besides the Google Cloud SDK that is used to manage the cloud instances, we also need the Kubernetes command-line tool, kubectl, to manage the Kubernetes cluster. One way to install it is through gcloud itself:

gcloud components install kubectl

This method won’t work for everyone though, as the Cloud SDK component manager is disabled for certain kinds of installation, such as through apt or yum in Linux. I find myself in this group, using Ubuntu, but the failed attempt to install kubectl through gcloud suggested another approach that worked for me:

sudo apt install kubectl

Deploying a PXC Cluster Using the Percona Kubernetes Operator

The Percona operators are available on Github. The most straightforward way to obtain a copy is by cloning the operator’s repository. The latest version of the PXC operator is 1.6.0 and we can clone it with the following command:

git clone -b v1.6.0 https://github.com/percona/percona-xtradb-cluster-operator

Move inside the created directory:

cd percona-xtradb-cluster-operator

and run the following sequence of commands:

  1. Define the Custom Resource Definitions for PXC:

    kubectl apply -f deploy/crd.yaml
  2. Create a namespace on Kubernetes and associate it to your current context:

    kubectl create namespace pxc
    kubectl config set-context $(kubectl config current-context) --namespace=pxc
  3. Define Role-Based Access Control (RBAC) for PXC:

    kubectl apply -f deploy/rbac.yaml
  4. Start the operator within Kubernetes:

    kubectl apply -f deploy/operator.yaml
  5. Configure PXC users and their credentials:

    kubectl apply -f deploy/secrets.yaml
  6. Finally, deploy the cluster:

    kubectl apply -f deploy/cr.yaml

You can find a more detailed explanation of each of these steps, as well as how to customize your installation, in the Percona Kubernetes Operator for Percona XtraDB Cluster online documentation, which includes a quickstart guide for GKE.

Now, it is a matter of waiting for the deployment to complete, which you can monitor with:

kubectl get pods

A successful deployment will show output for the above command similar to:

NAME                                               READY   STATUS    RESTARTS   AGE
cluster1-haproxy-0                                 2/2     Running   0          4m21s
cluster1-haproxy-1                                 2/2     Running   0          2m47s
cluster1-haproxy-2                                 2/2     Running   0          2m21s
cluster1-pxc-0                                     1/1     Running   0          4m22s
cluster1-pxc-1                                     1/1     Running   0          2m52s
cluster1-pxc-2                                     1/1     Running   0          111s
percona-xtradb-cluster-operator-79d786dcfb-9lthw   1/1     Running   0          4m37s

As you can see above, the operator will deploy seven pods with the default settings, and those are distributed across the three GKE n1-standard-4 machines we created at first:

kubectl get nodes
NAME                                             STATUS   ROLES    AGE    VERSION
gke-k8-test-cluster-default-pool-02c370e1-gvfg   Ready    <none>   152m   v1.17.13-gke.1400
gke-k8-test-cluster-default-pool-02c370e1-lvh7   Ready    <none>   152m   v1.17.13-gke.1400
gke-k8-test-cluster-default-pool-02c370e1-qn3p   Ready    <none>   152m   v1.17.13-gke.1400

Accessing the Cluster

One way to access the cluster is by creating an interactive shell in the Kubernetes cluster:

kubectl run -i --rm --tty percona-client --image=percona:8.0 --restart=Never -- bash -il

From there, we can access MySQL through the cluster’s HAproxy writer node:

mysql -h cluster1-haproxy -uroot -proot_password

Note the hostname used above is an alias, the connection being routed to one of the HAproxy servers available in the cluster. It is also possible to connect to a specific node by modifying the host option -h with the node’s name:

mysql -h cluster1-pxc-0 -uroot -proot_password

This is where all the fun and experimentation starts: you can test and break things without worrying too much as you can easily and quickly start again from scratch.

Destroying the Cluster and Deleting the Test Environment

Once you are done playing with your Kubernetes cluster, you can destroy it with:

gcloud container clusters delete --zone=us-central1-b k8-test-cluster

It’s important to note the command above will not discard the persistent disk volumes that were created and used by the nodes, which you can check with the command:

gcloud compute disks list

A final purging command is required to remove those as well:

gcloud compute disks delete <disk_name_1> <disk_name_2> <disk_name_3> --zone=us-central1-b

If you are feeling overzealous, you can double-check that all has been deleted:

gcloud container clusters list

gcloud compute disks list

Learn More About Percona Kubernetes Operator for Percona XtraDB Cluster

Interested In Hands-On Learning?

Be sure to get in touch with Percona’s Training Department to schedule your PXC Kubernetes training engagement. Our expert trainers will guide your team firstly through the basics, cover all the configuration noted above (and then some), and then dive deeper into how the operator functions along with High-Availability exercises, disaster recovery scenarios, backups, restore, and much more.

Oct
30
2020
--

CVE-2020-15180 – Affects Percona XtraDB Cluster

CVE-2020-15180

CVE-2020-15180Galera replication technology, a key component of Percona XtraDB Cluster, suffered from a remote code execution vulnerability. Percona has been working with the vendor since early September on this issue and has made releases available to address the problem.

Applicability

A malicious party with access to the WSREP service port (4567/TCP) as well as prerequisite knowledge of the configuration of the Galera cluster name is required in order to exploit this vulnerability, which leads to remote code execution via the WSREP protocol. 

Fixes are available in Percona XtraDB Cluster versions:

>= 8.0.20-11.2

>= 5.7.31-31.45.2

>= 5.6.49-28.42.2

Credits

Percona would like to thank all the Percona staff involved in the resolution of this issue.

More Information

Release notes

Oct
08
2020
--

The Criticality of a Kubernetes Operator for Databases

Importance of Kubernetes Operators for Databases

Importance of Kubernetes Operators for DatabasesAs a Solutions Engineer at Percona, one of my responsibilities is to support our customers as they investigate new and emerging technologies. This affords me the opportunity to speak to many current and new customers who partner with Percona. The topic of Kubernetes is becoming more popular as companies are investigating and adopting this technology. The issue most companies are encountering is architecting a stateful database that doesn’t fall victim to an environment tuned for ephemeral workloads. This obviously introduces a level of complexity as to how to run a stateful database in an inherently stateless world, as databases are not natively designed for that.

To make your life easier, as a part of the Percona Cloud-Native Autonomous Database Initiative, our engineering teams have built two Kubernetes Operators: Percona Kubernetes Operator for Percona XtraDB Cluster and Percona Kubernetes Operator for Percona Server for MongoDB, which allows for Kubernetes Pods to be destroyed, moved, or created with no impact to the application. To see an overview of Kubernetes, you can read this previous blog of mine Introduction to Percona Kubernetes Operator for Percona XtraDB Cluster that covers this topic. It’s common for companies new to Kubernetes to attempt to run their databases in Kubernetes the same way they would in a traditional environment. But, this is not advised as it introduces the possibility of data loss and it is not recommended for production workloads. Why is this dangerous and how has Percona solved this?

Appropriate Workloads for Kubernetes

Kubernetes is not the answer for everyone. It’s even not the answer for most people. Do not be misled into thinking that moving a database into Kubernetes is going to solve any of your problems. Before you consider moving your database into Kubernetes, ensure the rest of your application is cloud-native and can be used with Kubernetes. Moving your database to Kubernetes should happen after you have started both elastic vertical and horizontal scale and need to orchestrate it to control costs.

As more companies are moving to Kubernetes something has to happen to the legacy workloads. Oftentimes we see a lift and shift mentality into Kubernetes, which can be dangerous or cause more work than expected. We have seen two primary ideal use cases for moving database workloads to Kubernetes: Microservices and Unified Abstraction Layer.

Monolithic, large datasets can prohibit some of Kubernetes’ strong points: self-healing and availability. This can be an issue due to the time it takes to physically transmit data to a new Pod instance as it joins the database cluster. If your dataset is too large, this process is slow due to physical limitations and prohibits performance and the availability of your database. Microservices are a great fit due to the relatively smaller datasets, which allows Kubernetes automation to work well with the dataset size.

Companies looking to take full advantage of cloud-native applications and databases can be a really good fit for Kubernetes as well. If you truly want the ability to deploy and run your databases anywhere utilizing the concept of a Unified Abstraction Layer, Kubernetes is a great option. You can move your databases to anywhere that is running Kubernetes and know it will work.

We talked about large unsharded datasets and the limitations Kubernetes presents when handling them, but we should mention a few more workloads better suited for traditional platforms. Applications with a throughput sensitivity may not do well on Kubernetes, or they may not be cost-effective to do so. Kubernetes is fundamentally designed for container orchestration and is not designed to handle highly performant databases that require low latency. This may be possible to achieve, but at what cost? This applies to highly performant distributed applications as well. Lowest latency across all nodes is not a core tenant of Kubernetes, so ensure you have planned and tested against this before you move everything over to Kubernetes.

Pods Are Cattle, Not Pets

If you’re not familiar with Pets vs Cattle, it’s a DevOps concept that differentiates deployment methodologies of unique servers that require attention when issues arise (pets) versus the ability to replace a server with a copy if issues arise (cattle). Due to the nature of how Kubernetes operates, Pods can be destroyed, spun up, and moved at any time due to factors outside of the application’s control, much like how cattle are treated. Kubernetes uses a scheduler, which by design, can destroy and recreate Pods to meet the configuration needs of your Kubernetes Cluster. This is great for stateless applications as any failure in the application will result in a Pod containing the application being destroyed and recreated, eliminating the need for human interaction, and greatly speeding up the process to a resolution. This isn’t ideal for databases as you don’t want your database to suddenly stop working, halt the application, and introduce the potential for lost or corrupted data. One of the tools Kubernetes can utilize to help combat this is called Stateful Sets. These help by keeping a Pod’s identity assigned to it as it is destroyed and re-created. This helps facilitate stateful workloads, but how does this come into play with high availability and utilizing the automation aspects of Kubernetes?

Databases Are Pets, Not Cattle

Databases by design need to keep their identity, information, and most importantly their data safe and accessible at all times. They are the backbones of the application as they are the source of truth an application relies on for normal processing. Any errors in their operations will quickly stop an application from functioning. They are important, to say the least. How can we safely run databases in Kubernetes and still ensure we have highly available database deployments? By using Stateful Sets and Persistent Volumes we can maintain data integrity, but we need an additional set of hands to take on database administrator tasks such as ensuring failover happens, database members are recovered, and re-join the highly available architecture, along with other technology-specific functions. Fortunately, Kubernetes is extensible and has Operators, which aims to automate the key task of a human operator who is managing a service or set of services.

Automation, Automation, Automation

We know the complexities of running a database (safely) in Kubernetes and some of the concepts used to help bridge the gaps between automation and traditional human functions. With the help of Percona’s Kubernetes Operators, we can safely run databases the way they were intended to run. Percona’s Kubernetes Operators are able to automate tasks that are usually done by a database administrator such as:

  • Fully automated deployments with strict consistency and no single point of failure
  • Automated scaling with the ability to change the size parameter to add or remove members of a Cluster or Replica-Set
  • Fully automated backups and restores
  • Fully automated self-healing by automatically recovering from the failure of a single Cluster or Replica-Set member.
  • Automatically manage system users on password rotation
  • Simplified updates

Always Use a Kubernetes Operator

With the complexities of running a highly available database environment and the inherent dangers introduced by using the dynamic Kubernetes environment, an Operator should always be used when deploying databases in Kubernetes. Fortunately, Percona has already solved this by providing Percona Kubernetes Operator for Percona XtraDB Cluster and Percona Kubernetes Operator for Percona Server for MongoDB. Percona provides full support for databases running in Kubernetes with the Percona Operators. If you are interested in learning more or obtaining support or professional services to maximize your database deployments, please reach out to us.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com