Even though I used a dedicated Kubernetes cluster to host my test database, I had this belief that by not explicitly allocating (or requesting, in Kubernetes vocabulary) CPU resources to my Percona XtraDB Cluster (PXC) pods or yet making just a small request, Kubernetes could be delaying access to the free CPU cycles available on […]
20
2024
Can We Set up a Replicate Filter Within the Percona XtraDB Cluster?
If you were to ask me the question, “Would it be possible to set up a replicate filter within the Percona XtraDB Cluster?” my immediate response would be to question you right back. Why would you even want the cluster nodes to have different data when you’re using a synchronous replication cluster? That’s because, instinctively, […]
21
2022
Preventing Stale Reads on Percona XtraDB Cluster by Enforcing Causality Checks on a Per-Query Basis
When we run a SELECT in a replica server and it returns a different result to the one we would have obtained had we queried the source server instead, due to changes in the dataset that have not yet been replicated or synchronized to the replica, we get what is known as a stale read. My colleague Marcelo has blogged about Stale Reads and How to Fix Them before.
The need to avoid stale reads is a constraint that leads people into migrating to Percona XtraDB Cluster (PXC) / Galera Cluster or, more recently, Group Replication. I am focusing on PXC and Galera in this short blog post, which, similar to Group Replication, implements not exactly a synchronous replication model but what is commonly referred to as a “virtually synchronous” replication (but not without some polemic). In short, it means a transaction that is committed in one node is guaranteed (and there’s a certification process behind this) to have “reached” other nodes, but that doesn’t mean the changes this transaction carries with it have been applied on the other nodes already. Thus, stale reads can happen in PXC too. For the long version, see the manual page about Certification in Percona XtraDB Cluster.
Why does PXC implement virtually synchronous replication instead of fully synchronous replication? There are a few challenges to accomplish this but we can nail it down to performance constraints. Fully synchronous replication requires not only for the transactions to reach the other nodes but for the transactions to be applied to them too. And then they have to report back the success of the operation. For each transaction. This takes time and increases the overall commit time, which in practice makes writing to the database slow and impacts overall concurrency.
Evolution of “synchronous” replication on PXC
To be clear, no matter how you configure PXC, it won’t become a fully-synchronous system (neither will Group Replication, for that matter). BUT, it is possible to make it behave in such a way that it enforces data consistency. That’s what the variable wsrep_causal_reads has been used for. When this setting was introduced, it worked like a switch that one could use to kind of enable “synchronous” replication (as in data consistency) across the cluster – either for all transactions (globally) or at a connection basis (per session). In fact, what this setting does in practice is to enforce causality checks; quoting the manual: “When this variable is set to ON, the replica will wait until that event is applied before doing any other queries”. By waiting, it means blocking new writes until the expected consistency is reached.
Being able to enable “synchronous” replication for selected connections brought the potential to mix the best of both modes: operate with virtually synchronous replication by default (good performance) and use “synchronous” replication (once more, enforcing data consistency) when stale reads cannot be afforded.
The problem with this approach is that it requires some planning in advance to determine which connections should be configured with wsrep_causal_reads enabled. Without carefully planning ahead, what ends up happening in practice in many cases is having more connections configured with wsrep_causal_reads enabled than necessary. As a result, the cluster performance gets closer to that of operating with that setting applied globally.
The switch button that was wsrep_causal_reads evolved into the more granular wsrep_sync_wait, which allows the configuration of causality checks for different combinations of statements.
The cherry on top
While wsrep_sync_wait expanded the possibilities for causality checks, it still requires the “planning ahead” discussed above. I recently worked on the troubleshooting of a slow cluster and found it was configured with wsrep_sync_wait set to three (“Perform checks for READ, UPDATE, and DELETE statements”) all around, which is an even more “aggressive” configuration than the legacy wsrep_causal_reads. The justification for operating this way was that it was not possible to identify upfront which connection would be issuing a particular query that could not afford stale reads.
The good news is that starting with Percona XtraDB Cluster 8.0.26-16, which was released earlier this year, we can set wsrep_sync_wait on a per-query basis through the use of the optimizer hint SET_VAR: “The SET_VAR hint sets the session value of a system variable temporarily (for the duration of a single statement)”.
Here’s a simple example of employing SET_VAR to enforce the highest level of causality checks for a simple SELECT:
SELECT /*+ SET_VAR(wsrep_sync_wait=7) */ balance FROM accounts WHERE id=100;
If you find it is complicated to determine beforehand which connections should be configured to avoid stale reads, know that now you have the option to literally hard-code this constraint in your query. And if you end up deploying your database in a regular MySQL server, which is shipped without any wsrep_ settings, the optimizer hint is simply ignored and the query is processed normally.
11
2022
Online DDL With Group Replication in MySQL 8.0.27
In April 2021, I wrote an article about Online DDL and Group Replication. At that time we were dealing with MySQL 8.0.23 and also opened a bug report which did not have the right answer to the case presented.
Anyhow, in that article I have shown how an online DDL was de facto locking the whole cluster for a very long time even when using the consistency level set to EVENTUAL.
This article is to give justice to the work done by the MySQL/Oracle engineers to correct that annoying inconvenience.
Before going ahead, let us remember how an Online DDL was propagated in a group replication cluster, and identify the differences with what happens now, all with the consistency level set to EVENTUAL (see).
In MySQL 8.0.23 we were having:
While in MySQL 8.0.27 we have:
As you can see from the images we have three different phases. Phase one is the same between version 8.0.23 and version 8.0.27.
Phases two and three, instead, are quite different. In MySQL 8.0.23 after the DDL is applied on the Primary, it is propagated to the other nodes, but a metalock was also acquired and the control was NOT returned. The result was that not only the session executing the DDL was kept on hold, but also all the other sessions performing modifications.
Only when the operation was over on all secondaries, the DDL was pushed to Binlog and disseminated for Asynchronous replication, lock raised and operation can restart.
Instead, in MySQL 8.0.27, once the operation is over on the primary the DDL is pushed to binlog, disseminated to the secondaries and control returned. The result is that the write operations on primary have no interruption whatsoever and the DDL is distributed to secondary and Asynchronous replication at the same time.
This is a fantastic improvement, available only with consistency level EVENTUAL, but still, fantastic.
Let’s See Some Numbers
To test the operation, I have used the same approach used in the previous tests in the article mentioned above.
Connection 1: ALTER TABLE windmills_test ADD INDEX idx_1 (`uuid`,`active`), ALGORITHM=INPLACE, LOCK=NONE; ALTER TABLE windmills_test drop INDEX idx_1, ALGORITHM=INPLACE; Connection 2: while [ 1 = 1 ];do da=$(date +'%s.%3N');/opt/mysql_templates/mysql-8P/bin/mysql --defaults-file=./my.cnf -uroot -D windmills_large -e "insert into windmills_test select null,uuid,millid,kwatts_s,date,location,active,time,strrecordtype from windmill7 limit 1;" -e "select count(*) from windmills_large.windmills_test;" > /dev/null;db=$(date +'%s.%3N'); echo "$(echo "($db - $da)"|bc)";sleep 1;done Connection 3: while [ 1 = 1 ];do da=$(date +'%s.%3N');/opt/mysql_templates/mysql-8P/bin/mysql --defaults-file=./my.cnf -uroot -D windmills_large -e "insert into windmill8 select null,uuid,millid,kwatts_s,date,location,active,time,strrecordtype from windmill7 limit 1;" -e "select count(*) from windmills_large.windmills_test;" > /dev/null;db=$(date +'%s.%3N'); echo "$(echo "($db - $da)"|bc)";sleep 1;done Connections 4-5: while [ 1 = 1 ];do echo "$(date +'%T.%3N')";/opt/mysql_templates/mysql-8P/bin/mysql --defaults-file=./my.cnf -uroot -D windmills_large -e "show full processlist;"|egrep -i -e "(windmills_test|windmills_large)"|grep -i -v localhost;sleep 1;done
Modifying a table with ~5 million rows:
node1-DC1 (root@localhost) [windmills_large]>select count(*) from windmills_test; +----------+ | count(*) | +----------+ | 5002909 | +----------+
The numbers below represent the time second/milliseconds taken by the operation to complete. While I was also catching the state of the ALTER on the other node I am not reporting it here given it is not relevant.
EVENTUAL (on the primary only) ------------------- Node 1 same table: .184 .186 <--- no locking during alter on the same node .184 <snip> .184 .217 <--- moment of commit .186 .186 .186 .185 Node 1 another table : .189 .198 <--- no locking during alter on the same node .188 <snip> .191 .211 <--- moment of commit .194
As you can see there is just a very small delay at the moment of commit, but other impacts.
Now if we compare this with the recent tests I have done for Percona XtraDB Cluster (PXC) Non-Blocking operation (see A Look Into Percona XtraDB Cluster Non-Blocking Operation for Online Schema Upgrade) with the same number of rows and same kind of table/data:
Action | Group Replication | PXC (NBO) |
---|---|---|
Time on hold for insert for altering table | ~ 0.217 sec | ~ 120 sec |
Time on hold for insert for another table | ~ 0.211 sec | ~ 25 sec |
However, yes there is a however, PXC was maintaining consistency between the different nodes during the DDL execution, while MySQL 8.0.27 with Group Replication was postponing consistency on the secondaries, thus Primary and Secondary were not in sync until full DDL finalization on the secondaries.
Conclusions
MySQL 8.0.27 comes with this nice fix that significantly reduces the impact of an online DDL operation on a busy server. But we can still observe a significant misalignment of the data between the nodes when a DDL is executing.
On the other hand, PXC with NBO is a bit more “expensive” in time, but nodes remain aligned all the time.
In the end, is what is more important for you to choose one or the other solution, consistency vs. operational impact.
Great MySQL to all.
23
2021
MySQL on Kubernetes with GitOps
GitOps workflow was introduced by WeaveWorks as a way to implement Continuous Deployment for cloud-native applications. This technique quickly found its way into devops and developer’s hearts as it greatly simplifies the application delivery pipeline: the change in the manifests in the git repository is reflected in Kubernetes right away. With GitOps there is no need to provide access to the cluster for the developer as all the actions are executed by the Operator.
This blog post is a guide on how to deploy Percona Distribution for MySQL on Kubernetes with Flux – GitOps Operator that keeps your cluster state in sync with the Git repository.
In a nutshell, the flow is the following:
- Developer triggers the change in the GitHub repository
- Flux Operator:
- detects the change
- deploys Percona Distribution for MySQL Operator
- creates the Custom Resource, which triggers the creation of Percona XtraDB Cluster and HAProxy pods
The result is a fully working MySQL service deployed without talking to Kubernetes API directly.
Preparation
Prerequisites:
- Kubernetes cluster
- Github user and account
- For this blog post, I used the manifests from this repository
It is a good practice to create a separate namespace for Flux:
$ kubectl create namespace gitops
Installing and managing Flux is easier with
fluxctl
. In Ubuntu, I use snap to install tools, for other operating systems please refer to the manual here.
$ sudo snap install fluxctl --classic
Install Flux operator to your Kubernetes cluster:
$ fluxctl install --git-email=your@email.com --git-url=git@github.com:spron-in/blog-data.git --git-path=gitops-mysql --manifest-generation=true --git-branch=master --namespace=gitops | kubectl apply -f -
GitHub Sync
As per configuration, Flux will monitor the changes in the spron-in/blog-data repository continuously and sync the state. It is required to grant access to Flux to the repo.
Get the public key that was generated during the installation:
$ fluxctl identity --k8s-fwd-ns gitops
Copy the key, add it as Deploy key with write access in GitHub. Go to Settings -> Deploy keys -> Add deploy key:
Action
All set. Flux reconcile loops check the state for changes every five minutes. To trigger synchronization right away run:
$ fluxctl sync --k8s-fwd-ns gitops
In my case I have two YAMLs in the repo:
-
bundle.yaml
– installs the Operator, creates the Custom Resource Definitions (CRDs)
-
cr.yaml
– deploys PXC and HAProxy pods
Flux is going to deploy them both.
$ kubectl get pods NAME READY STATUS RESTARTS AGE cluster1-haproxy-0 2/2 Running 0 26m cluster1-haproxy-1 2/2 Running 0 25m cluster1-pxc-0 1/1 Running 0 26m cluster1-pxc-1 1/1 Running 0 25m cluster1-pxc-2 1/1 Running 0 23m percona-xtradb-cluster-operator-79966668bd-95plv 1/1 Running 0 26m
Now let’s add one more HAProxy Pod by changing
spec.haproxy.size
from 2 to 3 in
cr.yaml
. After that commit and push the changes. In a production-grade scenario, the Pull Request will go through a thorough review, in my case I push directly to the main branch.
$ git commit cr.yaml -m 'increase haproxy size from 2 to 3' $ git push Enumerating objects: 7, done. Counting objects: 100% (7/7), done. Delta compression using up to 2 threads Compressing objects: 100% (4/4), done. Writing objects: 100% (4/4), 385 bytes | 385.00 KiB/s, done. Total 4 (delta 2), reused 0 (delta 0), pack-reused 0 remote: Resolving deltas: 100% (2/2), completed with 2 local objects. To https://github.com/spron-in/blog-data e1a27b8..d555c77 master -> master
Either trigger the sync with
fluxctl sync
command or wait for approximately 5 minutes for Flux reconcile loop to detect the changes. In the logs of the Flux Operator you will see the event:
ts=2021-06-15T12:59:08.267469963Z caller=loop.go:134 component=sync-loop event=refreshed url=ssh://git@github.com/spron-in/blog-data.git branch=master HEAD=d555c77c19ea9d1685392680186e1491905401cc ts=2021-06-15T12:59:08.270678093Z caller=sync.go:61 component=daemon info="trying to sync git changes to the cluster" old=e1a27b8a81e640d3bee9bc2e2c31f9c4189e898a new=d555c77c19ea9d1685392680186e1491905401cc ts=2021-06-15T12:59:08.844068322Z caller=sync.go:540 method=Sync cmd=apply args= count=9 ts=2021-06-15T12:59:09.097835721Z caller=sync.go:606 method=Sync cmd="kubectl apply -f -" took=253.684342ms err=null output="serviceaccount/percona-xtradb-cluster-operator unchanged\nrole.rbac.authorization.k8s.io/percona-xtradb-cluster-operator unchanged\ncustomresourcedefinition.apiextensions.k8s.io/perconaxtradbbackups.pxc.percona.com configured\ncustomresourcedefinition.apiextensions.k8s.io/perconaxtradbclusterbackups.pxc.percona.com unchanged\ncustomresourcedefinition.apiextensions.k8s.io/perconaxtradbclusterrestores.pxc.percona.com unchanged\ncustomresourcedefinition.apiextensions.k8s.io/perconaxtradbclusters.pxc.percona.com unchanged\nrolebinding.rbac.authorization.k8s.io/service-account-percona-xtradb-cluster-operator unchanged\ndeployment.apps/percona-xtradb-cluster-operator unchanged\nperconaxtradbcluster.pxc.percona.com/cluster1 configured" ts=2021-06-15T12:59:09.099258988Z caller=daemon.go:701 component=daemon event="Sync: d555c77, default:perconaxtradbcluster/cluster1" logupstream=false ts=2021-06-15T12:59:11.387525662Z caller=loop.go:236 component=sync-loop state="tag flux" old=e1a27b8a81e640d3bee9bc2e2c31f9c4189e898a new=d555c77c19ea9d1685392680186e1491905401cc ts=2021-06-15T12:59:12.122386802Z caller=loop.go:134 component=sync-loop event=refreshed url=ssh://git@github.com/spron-in/blog-data.git branch=master HEAD=d555c77c19ea9d1685392680186e1491905401cc
The log indicates that the main CR was configured:
perconaxtradbcluster.pxc.percona.com/cluster1 configured
Now we have three HAProxy Pods:
$ kubectl get pods NAME READY STATUS RESTARTS AGE cluster1-haproxy-0 2/2 Running 1 50m cluster1-haproxy-1 2/2 Running 0 48m cluster1-haproxy-2 2/2 Running 0 4m45s
It is important to note that GitOps maintains the sync between Kubernetes and GitHub. It means that if the user manually changes the object on Kubernetes, Flux, or any other GitOps Operator will revert the changes and sync them with GitHub.
GitOps also comes in handy when users want to take the backup or perform the restoration. To do that the user just creates YAML manifests in the GitHub repo and Flux creates corresponding Kubernetes objects. The Database Operator does the rest.
Conclusion
GitOps is a simple approach to deploy and manage applications on Kubernetes:
- Change Management is provided by git version-control and code reviews
- Direct access to Kubernetes API is limited which increases security
- Infrastructure-as-a-Code is here, there is no need to integrate Terraform, Ansible, or any other tool
All Percona Operators can be deployed and managed with GitOps. As a result, you will get production-grade MySQL, MongoDB, or PostgreSQL cluster which just works.
28
2021
Resolving the MySQL Active-Active Replication Dilemma
Multi-writer replication has been a challenge in the MySQL ecosystem for years before truly dedicated solutions were introduced – first Galera (and so Percona XtradDB Cluster (PXC)) replication (around 2011), and then Group Replication (first GA in 2016).
Now, with both multi-writer technologies available, do we still need traditional asynchronous replication, set up in active-active topology? Apparently yes, there are still valid use cases. And you may need it not only when for some reason Galera/PXC or GR are not suitable, but also when you actually use them. Of course, the most typical case is to have a second cluster in a different geographic location, as Disaster Recovery. If you still wonder why you would need it, just read how a whole data center can disappear in the news a few weeks ago, about the OVH incident.
So, a DR site needs to replicate online from the primary cluster and be able to take over the workload very fast if needed. But also it is expected to be able to switch back effortlessly, hence very often the async replication channels are set up in both directions.
A very good writeup about this can be found here: How Not to do MySQL High Availability: Geographic Node Distribution with Galera-Based Replication Misuse.
Now, after seeing repeating problems with active-active setups falling over and over for years, I thought there are still too few warnings out there about the risks, so I decided to add one more little stone to the stack.
Before I continue, I have to mention this great webinar made last year by my colleague Sveta. You should definitely watch if you are interested in the subject: How Safe is Asynchronous Master-Master Setup in MySQL?.
Register for Percona Live ONLINE
A Virtual Event about Open Source Databases
Failure Test
So, let me demonstrate a simple test case, which may be an eye-opener to some.
First, let’s use the great dbdeployer tool to launch two MySQL instances with active-active replication in just one command:
$ dbdeployer deploy replication --topology=all-masters --nodes=2 --concurrent 8.0.23 all-masters directory installed in $HOME/sandboxes/all_masters_msb_8_0_23 run 'dbdeployer usage multiple' for basic instructions' $HOME/sandboxes/all_masters_msb_8_0_23/initialize_ms_nodes # server: 1 # server: 2
Now, create a very simple table with one example data row:
node1 (test) > create table t1 (id int primary key auto_increment, a int); Query OK, 0 rows affected (0.04 sec) node1 (test) > insert into t1 set a=500; Query OK, 1 row affected (0.01 sec)
So, at this point both nodes have the same data:
node1 (test) > select * from test.t1; +----+------+ | id | a | +----+------+ | 1 | 500 | +----+------+ 1 row in set (0.00 sec) node2 (test) > select * from test.t1; +----+------+ | id | a | +----+------+ | 1 | 500 | +----+------+ 1 row in set (0.00 sec)
In the next step, let’s simulate some little replication lag by introducing a delay of one second, as it will allow reproducing the problem at will:
node1 (test) > stop replica sql_thread; CHANGE REPLICATION SOURCE TO SOURCE_DELAY=1 FOR CHANNEL "node2"; start replica sql_thread; Query OK, 0 rows affected (0.01 sec) Query OK, 0 rows affected (0.00 sec) Query OK, 0 rows affected (0.00 sec) node2 (test) > stop replica sql_thread; CHANGE REPLICATION SOURCE TO SOURCE_DELAY=1 FOR CHANNEL "node1"; start replica sql_thread; Query OK, 0 rows affected (0.01 sec) Query OK, 0 rows affected (0.00 sec) Query OK, 0 rows affected (0.00 sec)
OK, so what if we send an update, very close in time, to both nodes, where the same row gets a different value:
$ all_masters_msb_8_0_23/use_all -e "update test.t1 set a=@@server_id where id=1" # server: 1 # server: 2
As a result, both nodes have different column values!
node1 (test) > select * from test.t1; +----+------+ | id | a | +----+------+ | 1 | 200 | +----+------+ 1 row in set (0.00 sec) node2 (test) > select * from test.t1; +----+------+ | id | a | +----+------+ | 1 | 100 | +----+------+ 1 row in set (0.00 sec)
Is Replication Broken?
You may think replication is now broken and some error will alert you about the situation? Not at all!
node1 (test) > show replica status\G *************************** 1. row *************************** ... Replica_IO_Running: Yes Replica_SQL_Running: Yes ... Seconds_Behind_Source: 0 Source_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: ... Executed_Gtid_Set: 00023824-1111-1111-1111-111111111111:1-3, 00023825-2222-2222-2222-222222222222:1 ... node2 (test) > show replica status\G *************************** 1. row *************************** ... Replica_IO_Running: Yes Replica_SQL_Running: Yes ... Seconds_Behind_Source: 0 Source_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: ... Executed_Gtid_Set: 00023824-1111-1111-1111-111111111111:1-3, 00023825-2222-2222-2222-222222222222:1
Or, did you hope enabling the GTID feature would prevent inconsistency from happening? Well, see the Executed_Gtid_Set on both nodes – it is identical, yet the nodes have different data.
What happens next? Well, it depends, maybe replication will eventually fail someday with an error if the same row is modified again, but also it is possible the inconsistency will spread further without you even notice it!
Lesson learned?
I hope this simple example emphasized the need for extra care when dealing with multi-primary replication topologies. The remedy though is usually quite simple:
1 – Make sure to allow only one node at a time for writes (super_read_only is here to help you).
2 – Check the data consistency regularly with appropriate tools (MySQL replication primer with pt-table-checksum and pt-table-sync).
23
2021
Yearly Reminder: DDLs That Fail to Propagate May Cause Percona XtraDB Cluster Inconsistencies
Apologies for the silly title, but the issue is a real one, even though it is not a new thing. Schema upgrades are not an ordinary operation in Galera. For the subject at hand, the bottom line is: under the default Total Order Isolation (TOI) method, “the cluster replicates the schema change query as a statement before its execution.” What this means in practice is that a DDL issued in one node is replicated to other nodes in the cluster before it is even executed in the source node, let alone completed successfully.
As a result of this, it may fail in one node and be successful in another, and this without raising loud alerts or stopping nodes to protect against data inconsistency. This is not a bug in itself but rather a compromise of design. With new changes in MySQL and the recent support for atomic DDLs added in version 8.0, we may see improvements in this area over time in Galera as well. For now, the warning should be maintained: DDLs are not atomic in Galera, even in 8.0.
How Big of an Issue Is This?
To be fair, it’s rare – we do not see this happening often.
How Does the Problem Manifest Itself in Practice? Give Us an Example!
We recently investigated a case where a DBA issued a TRUNCATE TABLE
statement in one node, which was taking some time to complete. His reaction at the time was to hit Ctrl+C
on the session where the DDL was issued. That session seemed to hang, so he ended up opening a new session and issued a full DELETE
on that same table. It also took time to complete, however this time, he was more patient. The problem was: once the statement was completed, he noticed that the two other nodes disappeared from the cluster. Upon closer inspection of the logs, both nodes reported the following error before evicting the cluster:
[ERROR] Slave SQL: Could not execute Delete_rows event on table mydb.target; Can't find record in 'target’, Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 1073, Error_code: 1032
The DELETE
was successful on the source node but failed to complete on the other two nodes. Why? Because the target table was already empty: the previous TRUNCATE TABLE
was effectively executed in those nodes.
Wasn’t the TRUNCATE TABLE
also successful on the source node? It turned out it wasn’t. In fact, the TRUNCATE
was recorded in the slow query log of the source node like this:
# Time: (...) # User (...) # Schema: building Last_errno: 1317 Killed: 0 # Query_time: 145.753247 Lock_time: 0.000000 Rows_sent: 0 Rows_examined: 0 Rows_affected: 0 # Bytes_sent: 44 Tmp_tables: 0 Tmp_disk_tables: 0 Tmp_table_sizes: 0 # QC_Hit: No Full_scan: No Full_join: No Tmp_table: No Tmp_table_on_disk: No # Filesort: No Filesort_on_disk: No Merge_passes: 0 # No InnoDB statistics available for this query SET timestamp=(...) truncate target;
Note the Last_errno indicated above:
$ perror 1317 MySQL error code 1317 (ER_QUERY_INTERRUPTED): Query execution was interrupted
Our colleague Przemek reproduced a similar problem, which he described in bug PXC-3536: Modify processing to not allow threads/queries to be killed if the thread is in TOI. The Percona Engineering team has since fixed the abortion issue, leaving the following note on the bug tracker:
Fixed in 5.7[.33-31.49] and 8.0[.22-13.1]. Any attempt to kill the query or connection while in TOI will fail. Note that due to timing, the “kill query” may sometimes return success, but the operation will still have succeeded.
In the case we investigated, the MySQL error log of the source node had the answer; there was some other higher-priority transaction already in place on that node that prevented it from executing the TRUNCATE
:
[Note] WSREP: MDL conflict db=mydb table=target ticket=MDL_SHARED_WRITE solved by abort
To clarify, this cluster was receiving writes from multiple nodes, and the higher-priority transaction was applied by a replication thread, which usually has precedence over local events when a conflict is detected.
As you may have suspected, the TRUNCATE
was not found in the binary log of the source node; since the statement did not complete, it was not logged. Only the subsequent DELETE
statement was there.
And as for the binary logs of the other two nodes, they included both the TRUNCATE
as well as the DELETEs
covering all rows in the target table. Rows these other nodes no longer had in place and thus were unable to remove caused the data inconsistency that followed.
20
2021
Change Storage Class on Kubernetes on the Fly
Percona Kubernetes Operators support various options for storage: Persistent Volume (PV), hostPath, ephemeral storage, etc. In most of the cases, PVs are used, which are provisioned by the Operator through Storage Classes and Persistent Volume Claims.
Storage Classes define the underlying volume type that should be used (ex. AWS, EBS, gp2, or io1), file system (xfs, ext4), and permissions. In various cases, cluster administrators want to change the Storage Class for already existing volumes:
- DB cluster is underutilized and it is a good cost-saving when switching from io1 to gp2
- The other way – DB cluster is saturated on IO and it is required to upsize the volumes
- Switch the file system for better performance (MongoDB is much better with xfs)
In this blog post, we will show what the best way is to change the Storage Class with Percona Operators and not introduce downtime to the database. We will cover the change in the Storage Class, but not the migration from PVC to other storage types, like hostPath.
Changing Storage Class on Kubernetes
Prerequisites:
- GKE cluster
- Percona XtraDB Cluster (PXC) deployed with Percona Operator. See instructions here.
Goal:
- Change the storage from pd-standard to pd-ssd without downtime for PXC.
Planning
The steps we are going to take are the following:
- Create a new Storage class for pd-ssd volumes
- Change the
storageClassName
in Custom Resource (CR)
- Change the
storageClassName
in the StatefulSet
- Scale up the cluster (optional, to avoid performance degradation)
- Reprovision the Pods one by one to change the storage
- Scale down the cluster
Register for Percona Live ONLINE
A Virtual Event about Open Source Databases
Execution
Create the Storage Class
By default, standard Storage Class is already present in GKE, we need to create the new
StorageClass
for ssd:
$ cat pd-ssd.yaml apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: name: ssd provisioner: kubernetes.io/gce-pd parameters: type: pd-ssd volumeBindingMode: Immediate reclaimPolicy: Delete $ kubectl apply -f pd-ssd.yaml
The new Storage Class will be called
ssd
and will provision the volumes of
type: pd-ssd
.
Change the storageClassName in Custom Resource
We need to change the configuration for Persistent Volume Claims in our Custom Resource. The variable we look for is
storageClassName
and it is located under
spec.pxc.volumeSpec.persistentVolumeClaim
.
spec: pxc: volumeSpec: persistentVolumeClaim: - storageClassName: standard + storageClassName: ssd
Now apply new cr.yaml:
$ kubectl apply -f deploy/cr.yaml
Change the storageClassName in the StatefulSet
StatefulSets are almost immutable and when you try to edit the object you get the warning:
# * spec: Forbidden: updates to statefulset spec for fields other than 'replicas', 'template', and 'updateStrategy' are forbidden
We will rely on the fact that the operator controls the StatefulSet. If we delete it, the Operator is going to recreate it with the last applied configuration. For us, it means – with the new storage class. But the deletion of the StatefulSet leads to Pods termination, but our goal is 0 downtime. To get there, we will delete the set, but keep the Pods running. It can be done with –cascade flag:
$ kubectl delete sts cluster1-pxc --cascade=orphan
As a result, the Pods are up and the Operator recreated the StatefulSet with new storageClass:
$ kubectl get sts | grep cluster1-pxc cluster1-pxc 3/3 8s
Scale up the Cluster (Optional)
Changing the storage type would require us to terminate the Pods, which decreases the computational power of the cluster and might cause performance issues. To improve performance during the operation we are going to changing the size of the cluster from 3 to 5 nodes:
spec: pxc: - size: 3 + size: 5 $ kubectl apply -f deploy/cr.yaml
As long as we have changed the StatefulSet already, new PXC Pods will be provisioned with the volumes backed by the new
StorageClass
:
$ kubectl get pvc datadir-cluster1-pxc-0 Bound pvc-6476a94c-fa1b-45fe-b87e-c884f47bd328 6Gi RWO standard 78m datadir-cluster1-pxc-1 Bound pvc-fcfdeb71-2f75-4c36-9d86-8c68e508da75 6Gi RWO standard 76m datadir-cluster1-pxc-2 Bound pvc-08b12c30-a32d-46a8-abf1-59f2903c2a9e 6Gi RWO standard 64m datadir-cluster1-pxc-3 Bound pvc-b96f786e-35d6-46fb-8786-e07f3097da02 6Gi RWO ssd 69m datadir-cluster1-pxc-4 Bound pvc-84b55c3f-a038-4a38-98de-061868fd207d 6Gi RWO ssd 68m
Reprovision the Pods One by One to Change the Storage
This is the step where underlying storage is going to be changed for the database Pods.
Delete the PVC of the Pod that you are going to reprovision. Like for Pod
cluster1-pxc-2
the PVC is called
datadir-cluster1-pxc-2
:
$ kubectl delete pvc datadir-cluster1-pxc-2
The PVC will not be deleted right away as there is a Pod using it. To proceed, delete the Pod:
$ kubectl delete pod cluster1-pxc-2
The Pod will be deleted along with the PVCs. The StatefulSet controller will notice that the pod is gone and will recreate it along with the new PVC of a new Storage Class:
$ kubectl get pods ... cluster1-pxc-2 0/3 Init:0/1 0 7s
$ kubectl get pvc datadir-cluster1-pxc-2 NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE … datadir-cluster1-pxc-2 Bound pvc-08b12c30-a32d-46a8-abf1-59f2903c2a9e 6Gi RWO ssd 10s
The STORAGECLASS column indicates that this PVC is of type ssd.
You might face the situation that the Pod is stuck in a pending state with the following error:
Warning FailedScheduling 63s (x3 over 71s) default-scheduler persistentvolumeclaim "datadir-cluster1-pxc-2" not found
It can happen due to the race condition: Pod was created when old PVC was terminating, and when the Pod is ready to start the PVC is already gone. Just delete the Pod again, so that the PVC is recreated.
Once the Pod is up, the State Snapshot Transfer kicks in and the data is synced from other nodes. It might take a while if your cluster holds lots of data and is heavily utilized. Please wait till the node is fully up and running, sync is finished, and only then proceed to the next Pod.
Scale Down the Cluster
Once all the Pods are running on the new storage it is time to scale down the cluster (if it was scaled up):
spec:
pxc:
- size: 5
+ size: 3
$ kubectl apply -f deploy/cr.yaml
Do not forget to clean up the PVCs for nodes 4 and 5. And done!
Conclusion
Changing the storage type is a simple task on the public clouds, but its simplicity is not synchronized yet with Kubernetes capabilities. Kubernetes and Container Native landscape is evolving and we hope to see this functionality soon. The way of changing Storage Class described in this blog post can be applied to both the Percona Operator for PXC and Percona Operator for MongoDB. If you have ideas on how to automate this and are willing to collaborate with us on the implementation, please submit the Issue to our public roadmap on Github.
01
2021
Infinitely Scalable Storage with High Compression Feature
It is no secret that compute and storage costs are the main drivers of cloud bills. Migration of data from the legacy data center to the cloud looks appealing at first as it significantly reduces capital expense (CapEx) and keeps operational expenses (OpEx) under control. But once you see the bill, the lift and shift project does not look that promising anymore. See Percona’s recent open source survey which shows that many organizations saw an unexpected growth around cloud and data.
Storage growth is an organic process for the expanding business: more customers store more data, and more data needs more backups and disaster recovery storage for low RTO.
Today, the Percona Innovation Team, which is part of the Engineering organization, is proud to announce a new feature – High Compression. With this feature enabled, your MySQL databases will have infinite storage at zero cost.
The Problem
Our research team was digging into the problem of storage growth. They have found that the storage growth of a successful business inevitably leads to the increase of the cloud bill. After two years of research we got the data we need and the problem is now clear, and you can see it on the chart below:
The correlation is clearly visible – the more data you have, the more you pay.
The Solution
Once our Innovation Team received the data, we started working day and night on the solution. The goal was to change the trend and break the correlation. That is how after two years, we are proud to share with the community the High Compression feature. You can see the comparison of the storage costs with and without this new feature below:
Option | 100 TB AWS EBS | 100 TB AWS S3 for backups | 100 TB AWS EBS + High compression | 100 TB AWS S3 for backups + High Compression |
Annual run rate |
$120,000 |
$25,200 |
$1.2 |
< $1 |
As you see it is a 100,000x difference! What is more interesting, the cost of the storage with the High Compression feature enabled always stays flat and the chart now looks like this:
Theory
Not many people know, but data on disks is stored as bits, which are 0s and 1s. They form the binary sequences which are translated into normal data.
After thorough research, we came to the conclusion that we can replace the 1s with 0s easily. The formula is simple:
f(1) = 0
So instead of storing all these 1s, our High Compression feature stores zeroes only:
Implementation
The component which does the conversion is called the Nullifier, and every bit of data goes through it. We are first implementing this feature in Percona Operator for Percona XtraDB Cluster and below is the technical view of how it is implemented in Kubernetes:
As you see, all the data written by the user (all Insert or Update statements) goes through the Nullifier first, and only then are stored on the Persistent Volume Claim (PVC). With the High Compression feature enabled, the size of the PVC can be always 1 GB.
Percona is an open source company and we are thrilled to share our code with everyone. You can see the Pull Request for the High Compression feature here. As you see in the PR, our feature provides the Nullifier through the underestimated and very powerful Blackhole engine.
if [ "$HIGH_COMPRESSION" == 'yes' ]; then sed -r "s|^[#]?default_storage_engine=.*$|default_storage_engine=BLACKHOLE|" ${CFG} 1<>${CFG} grep -E -q "^[#]?pxc_strict_mode" "$CFG" || sed '/^\[mysqld\]/a pxc_strict_mode=PERMISSIVE\n' ${CFG} 1<>${CFG} fi
The High Compression feature will be enabled by default starting from PXC Operator version 1.8.0, but we have added the flag into
cr.yaml
to disable this feature if needed:
spec.pxc.highCompression: true
.
Backups and with the High Compression feature are blazing fast and take seconds with any amount of data. The challenge our Engineering team is working on now is recovery. The Nullifier does the job, but recovering the data is hard. We are confident that De-Nullifier will be released in 1.8.0 as well.
Conclusion
Percona is spearheading innovation in the database technology field. The High Compression feature solves the storage growth problem and as a result, reduces the cloud bill significantly. The release of the Percona Kubernetes Operator for Percona XtraDB Cluster 1.8.0 is planned for mid-April, but this feature is already available in Tech Preview.
As a quick peek at our roadmap, we are glad to share that the Innovation Team has already started working on the High-Density feature, which will drastically reduce the compute footprint required to run MySQL databases.
11
2020
Smart Update Strategy in Percona Kubernetes Operator for Percona XtraDB Cluster
In Percona Kubernetes Operator for Percona XtraDB Cluster (PXC) versions prior to 1.5.0, there were two methods for upgrading PXC clusters, and both of these use built-in StatefulSet update strategies. The first one is manual (OnDelete update strategy) and the second one is semi-automatic (RollingUpdate strategy). Since the Kubernetes operator is about automating the database management, and there are use cases to always keep the database up to date, a new smart update strategy was implemented.
Smart Update Strategy
The smart update strategy can be used to enable automatic context-aware upgrades of PXC clusters between minor versions. One of the use cases for automatic upgrades is if you want to get security updates as soon as they get released.
This strategy will upgrade reader PXC Pods at first and the last one upgraded will be the writer Pod, and it will also wait for the upgraded Pod to show up as online in ProxySQL before the next Pod is upgraded. This is needed to minimize the number of failovers during the upgrade and to make the upgrade as smooth as possible.
To make this work we implemented a version-unaware entrypoint and a Version Service to be queried for the up-to-date versions information.
The non-version specific entrypoint script is included in the operator docker image and is used to start different PXC versions. This makes the operator version not tightly coupled with a specific PXC docker image like it was done before, so one version of the operator will be able to run multiple versions of PXC cluster.
Version Service, which runs at https://check.percona.com/ by default, provides database version and alert information for various open source products. Version Service is open source and it can be run inside your own infrastructure, but that will be covered in some other blog posts.
How Does it Work?
When smart update is enabled and a new cluster has started, the values for docker images in the cr.yaml file will be ignored since the intention is to get them from the Version Service.
If smart update is enabled for an existing cluster, then at the scheduled time a version of the currently running PXC cluster, Kubernetes operator version, and the desired upgrade path will be provided to the Version Service. Version Service will return the JSON object with a set of docker images that should be used in the current environment. After that, the operator will update the CR with the new image paths and continue with deleting and redeploying the Pods in optimal order to minimize failovers.
The upgrade will not be done if the backup operation is in progress during the check for updates since the backup has a higher priority, but instead, it will be done next time the Version Service is checked.
With smart update functionality, you can also lock your database to a specific version, basically disabling automatic upgrades, but when needed use the smart update to trigger context-aware upgrades or just changes to resources.
This is how the upgrade might look in the operator logs (some parts stripped for brevity):
{"level":"info","ts":..,"logger":"..","msg":"update PXC version to 5.7.29-32-57 (fetched from db)"} {"level":"info","ts":..,"logger":"..","msg":"add new job: * * * * *"} {"level":"info","ts":..,"logger":"..","msg":"update PXC version from 5.7.29-32-57 to 5.7.30-31.43"} {"level":"info","ts":..,"logger":"..","msg":"statefullSet was changed, run smart update"} {"level":"info","ts":..,"logger":"..","msg":"primary pod is cluster1-pxc-0.cluster1-pxc.pxc-test.svc.cluster.local"} {"level":"info","ts":..,"logger":"..","msg":"apply changes to secondary pod cluster1-pxc-2"} {"level":"info","ts":..,"logger":"..","msg":"pod cluster1-pxc-2 is running"} {"level":"info","ts":..,"logger":"..","msg":"pod cluster1-pxc-2 is online"} {"level":"info","ts":..,"logger":"..","msg":"apply changes to secondary pod cluster1-pxc-1"} {"level":"info","ts":..,"logger":"..","msg":"pod cluster1-pxc-1 is running"} {"level":"info","ts":..,"logger":"..","msg":"pod cluster1-pxc-1 is online"} {"level":"info","ts":..,"logger":"..","msg":"apply changes to primary pod cluster1-pxc-0"} {"level":"info","ts":..,"logger":"..","msg":"pod cluster1-pxc-0 is running"} {"level":"info","ts":..,"logger":"..","msg":"pod cluster1-pxc-0 is online"} {"level":"info","ts":..,"logger":"..","msg":"smart update finished"}
As you can see, the initial PXC version deployed is 5.7.29, after which the smart update was enabled with the schedule to check for updates every minute (this is done for the test only). After that, smart update contacted the Version Service and started the upgrade process to version 5.7.30. The Primary Pod identified was PXC Pod 0, so firstly Pods 2 and 1 (readers) were upgraded, and only after that Pod 0 (writer), and at the end, the message was logged that the upgrade process finished.
Configuration Options Inside cr.yaml File
spec: updateStrategy: SmartUpdate upgradeOptions: versionServiceEndpoint: https://check.percona.com/versions apply: recommended schedule: "0 4 * * *"
As already mentioned, updateStrategy can be OnDelete or RollingUpdate in previous versions, but to use automatic upgrades it should be set to SmartUpdate.
Value of the upgradeOptions.versionServiceEndpoint option can be changed from the default if you have your own Version Service running (e.g. if your cluster doesn’t have a connection to the Internet and you have your own custom docker image repositories).
The most important setting is the upgradeOptions.apply option, which can have several values:
- Never or Disabled – automatic upgrades are disabled and smart update is only utilized for other types of full cluster changes such as resource alterations or ConfigMap updates.
- Recommended – automatic upgrades will choose the most recent version of software flagged as Recommended.
- Latest – automatic upgrades will choose the most recent version of the software available.
If you are starting a cluster from scratch and you have selected Recommended or Latest as desired versions, the current 8.0 major version will be selected. If you are already running a cluster, in that case, Version Service should always return the upgrade path inside your major version. Basically, if you want to start with a 5.7 major version from scratch, you should explicitly specify some 5.7 version (see below). Then, after the cluster is deployed, if you wish to enable automatic upgrades, you need to change the value of upgradeOptions.apply to “Recommended” or “Latest”. - Version Number – when a version number is supplied, this will start an upgrade if the running version doesn’t match the explicit version and then all future upgrades are no-ops. This essentially locks your database cluster to a specific database version. Example values for this can be “ 5.7.30-31.43” or “8.0.19-10.1”.
upgradeOptions.schedule is a classic cron schedule option and by default, it is set to check for new versions every day at 4 AM.
Limitations
A smart update strategy can only be used to upgrade PXC clusters and not the operator itself. It will only do the minor version upgrade automatically and it cannot be used for downgrades (since, as you may know, the downgrades in MySQL from version 8 might be problematic even between minor versions).
Conclusion
If there is a need to always keep the PXC cluster upgraded to the latest/recommended version, or if you just want to facilitate some benefits of the new smart update strategy even without automatic upgrades functionality, you can use version 1.5.0 of the Percona Kubernetes Operator for Percona XtraDB Cluster for this.
- Documentation related to the upgrade of a PXC cluster with this strategy is located here.
- The PXC Kubernetes operator in version 1.5.0 had many improvements so you might want to check out Updates to Percona Kubernetes Operator for Percona XtraDB Cluster or the release notes for further insights.