Oct
12
2021
--

ProxySQL 2.3.0: Enhanced Support for MySQL Group Replication

ProxySQL 2.3 MySQL Group Replication

ProxySQL 2.3 MySQL Group ReplicationProxySQL 2.3.0 was recently released and when I was reading the release notes, I was really impressed with the Group Replication enhancements and features. I thought of experimenting with those things and was interested to write a blog about them. Here, I have focused on the following two topics:

  • When the replication lag threshold is reached, ProxySQL will move the server to SHUNNED state, instead of moving them to OFFLINE hostgroup. When shunning a server, it will be performed gracefully and not immediately drop all backend connections.
  • The servers can be taken to maintenance through ProxySQL using “OFFLINE_SOFT”.

Test Environment

To test this, I have configured a three-node GR cluster (gr1,gr2,gr3) in my local environment. I have configured a single primary cluster (1 writer, 2 readers).

mysql> select member_host,member_state,member_role,member_version from performance_schema.replication_group_members;
+-------------+--------------+-------------+----------------+
| member_host | member_state | member_role | member_version |
+-------------+--------------+-------------+----------------+
| gr1         | ONLINE       | PRIMARY     | 8.0.26         |
| gr2         | ONLINE       | SECONDARY   | 8.0.26         |
| gr3         | ONLINE       | SECONDARY   | 8.0.26         |
+-------------+--------------+-------------+----------------+
3 rows in set (0.00 sec)

Currently, there is no transaction delay in the GR cluster.

mysql> select * from sys.gr_member_routing_candidate_status;
+------------------+-----------+---------------------+----------------------+
| viable_candidate | read_only | transactions_behind | transactions_to_cert |
+------------------+-----------+---------------------+----------------------+
| YES              | YES       |                   0 |                    0 |
+------------------+-----------+---------------------+----------------------+
1 row in set (0.00 sec)

To compare the result with older ProxySQL versions, I have configured two versions of ProxySQL. One is the latest version (2.3.0) and another one is the older version (2.2.2).

ProxySQL 1: (2.3.0)

mysql>  show variables like '%admin-version%';
+---------------+------------------+
| Variable_name | Value            |
+---------------+------------------+
| admin-version | 2.3.1-8-g794b621 |
+---------------+------------------+
1 row in set (0.00 sec)

ProxySQL 2: ( < 2.3.0 )

mysql> show variables like '%admin-version%';
+---------------+-------------------+
| Variable_name | Value             |
+---------------+-------------------+
| admin-version | 2.2.2-11-g0e7630d |
+---------------+-------------------+
1 row in set (0.01 sec)

GR nodes are configured on both the ProxySQLs:

mysql> select hostgroup_id,hostname,status from runtime_mySQL_servers;
+--------------+----------+--------+
| hostgroup_id | hostname | status |
+--------------+----------+--------+
| 2            | gr1      | ONLINE |
| 3            | gr3      | ONLINE |
| 3            | gr2      | ONLINE |
+--------------+----------+--------+
3 rows in set (0.00 sec)

Host group settings are:

mysql> select writer_hostgroup,reader_hostgroup,offline_hostgroup from runtime_mysql_group_replication_hostgroups\G
*************************** 1. row ***************************
 writer_hostgroup: 2
 reader_hostgroup: 3
offline_hostgroup: 4
1 row in set (0.00 sec)

Scenario 1: When the replication lag threshold is reached, ProxySQL will move the server to SHUNNED state, instead of moving them to OFFLINE host group.

Here the replication lag threshold is configured when the ProxySQL is “20”.

mysql> select @@mysql-monitor_groupreplication_max_transactions_behind_count;
+----------------------------------------------------------------+
| @@mysql-monitor_groupreplication_max_transactions_behind_count |
+----------------------------------------------------------------+
| 20                                                             |
+----------------------------------------------------------------+
1 row in set (0.00 sec)

As per my current setting,

  • At ProxySQL 2.3.0, if the transaction_behind reaches 20, then the node will be put into “SHUNNED” state.
  • At “< ProxySQL 2.3.0”, if the transaction_behind reaches 20, then the node will be put into an offline hostgroup.

To manually create the replication lag, I am going to start the read/write load on the GR cluster using the sysbench.

sysbench oltp_read_write --tables=10 --table_size=1000000  --mysql-host=172.28.0.96 --mysql-port=6033 --mysql-user=monitor --mysql-password="Monitor@321" --mysql-db=jc --time=30000 --threads=50 --report-interval=1 run

As expected, now I can see the transaction delay in the cluster.

mysql> select * from sys.gr_member_routing_candidate_status;
+------------------+-----------+---------------------+----------------------+
| viable_candidate | read_only | transactions_behind | transactions_to_cert |
+------------------+-----------+---------------------+----------------------+
| YES              | YES       |                 457 |                    0 |
+------------------+-----------+---------------------+----------------------+
1 row in set (0.00 sec)

Let’s see how the different ProxySQL versions are behaving now.

At ProxySQL 2.3.0:

mysql> select hostgroup_id,hostname, status from runtime_mysql_servers;
+--------------+----------+---------+
| hostgroup_id | hostname | status  |
+--------------+----------+---------+
| 2            | gr1      | ONLINE  |
| 3            | gr2      | SHUNNED |
| 3            | gr3      | SHUNNED |
+--------------+----------+---------+
3 rows in set (0.00 sec)

As expected, both the reader nodes (gr2,gr3) are moved to “SHUNNED” state. And, the servers are still available in reader_hostgroup.

At “< ProxySQL 2.3.0”:

mysql> select hostgroup_id,hostname, status from runtime_mysql_servers;
+--------------+----------+---------+
| hostgroup_id | hostname | status  |
+--------------+----------+---------+
| 2            | gr1      | ONLINE  |
| 4            | gr2      | ONLINE  |
| 4            | gr3      | ONLINE  |
+--------------+----------+---------+
3 rows in set (0.00 sec)

The server status is still ONLINE. But, the hostgroup_id is changed from 3 to 4. “4” is the offline hostgroup_id.

So, when comparing both the results, seems the latest release (2.3.0) has the correct implementation. Shunning the node is just temporarily taking the server out of use until the replication lag issue is fixed. When shunning a server, it will be performed gracefully and not immediately drop all backend connections. You can see the servers are still available in the reader hostgroups. With the previous implementation, the servers are moved to offline_hostgroup immediately.

Again, from ProxySQL 2.3.0, during the lag, shunning the nodes depend on the parameter “mysql-monitor_groupreplication_max_transactions_behind_for_read_only”. The parameter has 3 values (0,1,2).

  • “0” means only servers with “read_only=0” are placed as SHUNNED.
  • “1” means Only servers with “read_only=1” are placed as SHUNNED. This is the default one.
  • “2” means Both servers with “read_only=1” and “read_only=0” are placed as SHUNNED.

Scenario 2: The servers can be taken to maintenance through ProxySQL using “OFFLINE_SOFT”.

Personally, I would say, this is one of the nice implementations. From ProxySQL 2.3.0, ProxySQL itself can put the servers into the maintenance mode using the “OFFLINE_SOFT”. In the older version, you can also set it, but it was not stable. Let me explain the behavior of the latest and the older version.

Both ProxySQLs have the following configuration:

mysql> select hostgroup_id,hostname,status from runtime_mySQL_servers;
+--------------+----------+--------+
| hostgroup_id | hostname | status |
+--------------+----------+--------+
| 2            | gr1      | ONLINE |
| 3            | gr3      | ONLINE |
| 3            | gr2      | ONLINE |
+--------------+----------+--------+
3 rows in set (0.00 sec)

— Now, I am going to put the server “gr3” into maintenance mode on both ProxySQL.

After putting it into maintenance mode, both ProxySQL has the following output.

mysql> update mysql_servers set status='offline_soft' where hostname='gr3'; load mysql servers to runtime; save mysql servers to disk;
Query OK, 1 row affected (0.00 sec)
Query OK, 0 rows affected (0.01 sec)
Query OK, 0 rows affected (0.04 sec)

mysql> select hostgroup_id,hostname,status from runtime_mySQL_servers;
+--------------+----------+--------------+
| hostgroup_id | hostname | status       |
+--------------+----------+--------------+
| 2            | gr1      | ONLINE       |
| 3            | gr3      | OFFLINE_SOFT |
| 3            | gr2      | ONLINE       |
+--------------+----------+--------------+
3 rows in set (0.00 sec)

— Now, I am going to stop the group replication service on the “gr3”.

mysql> stop group_replication;
Query OK, 0 rows affected (4.59 sec)

Let’s check the ProxySQL status now.

At ProxySQL 2.3.0:

mysql> select hostgroup_id,hostname,status from runtime_mySQL_servers;
+--------------+----------+--------------+
| hostgroup_id | hostname | status       |
+--------------+----------+--------------+
| 2            | gr1      | ONLINE       |
| 3            | gr3      | OFFLINE_SOFT |
| 3            | gr2      | ONLINE       |
+--------------+----------+--------------+
3 rows in set (0.00 sec)

The latest release still maintains the same status. “gr3” is still in maintenance mode.

At “< ProxySQL 2.3.0”:

mysql> select hostgroup_id,hostname,status from runtime_mySQL_servers;
+--------------+----------+--------+
| hostgroup_id | hostname | status |
+--------------+----------+--------+
| 2            | gr1      | ONLINE |
| 4            | gr3      | ONLINE |
| 3            | gr2      | ONLINE |
+--------------+----------+--------+
3 rows in set (0.00 sec)

The older ProxySQL release removed the “OFFLINE_SOFT” flag from “gr3” and put it on the offline hostgroup (hg 4).

— Now, I am again going to start the group_replication service on gr3.

mysql> start group_replication;
Query OK, 0 rows affected (2.58 sec)

At ProxySQL 2.3.0:

mysql> select hostgroup_id,hostname,status from runtime_mySQL_servers;
+--------------+----------+--------------+
| hostgroup_id | hostname | status       |
+--------------+----------+--------------+
| 2            | gr1      | ONLINE       |
| 3            | gr3      | OFFLINE_SOFT |
| 3            | gr2      | ONLINE       |
+--------------+----------+--------------+
3 rows in set (0.00 sec)

The latest release still maintains the same state.

At “< ProxySQL 2.3.0”:

mysql> select hostgroup_id,hostname,status from runtime_mySQL_servers;
+--------------+----------+--------+
| hostgroup_id | hostname | status |
+--------------+----------+--------+
| 2            | gr1      | ONLINE |
| 3            | gr3      | ONLINE |
| 3            | gr2      | ONLINE |
+--------------+----------+--------+
3 rows in set (0.00 sec)

At the older release, the server “gr3” came to ONLINE automatically. This is not the expected one because we did manually put that node into maintenance mode.

As you see in the comparison for the latest and older releases, the latest release has the right implementation. To remove the maintenance, we have to manually update the status to “ONLINE” as shown below.

mysql> update mysql_servers set status='online' where hostname='gr3'; load mysql servers to runtime; save mysql servers to disk;
Query OK, 1 row affected (0.00 sec)
Query OK, 0 rows affected (0.00 sec)
Query OK, 0 rows affected (0.04 sec)

mysql> select hostgroup_id,hostname,status from runtime_mySQL_servers;
+--------------+----------+--------+
| hostgroup_id | hostname | status |
+--------------+----------+--------+
| 2            | gr1      | ONLINE |
| 3            | gr3      | ONLINE |
| 3            | gr2      | ONLINE |
+--------------+----------+--------+
3 rows in set (0.00 sec)

I believe these two new implementations are very helpful to those who are running with the GR + ProxySQL setup. Apart from GR, the recent major releases have other important features as well. I will try to write a blog about them in the future, be on the lookout for that.

Complete the 2021 Percona Open Source Data Management Software Survey

Have Your Say!

Oct
07
2021
--

Getting Started with ProxySQL in Kubernetes

Getting Started with ProxySQL in Kubernetes

Getting Started with ProxySQL in KubernetesThere are plenty of ways to run ProxySQL in Kubernetes (K8S). For example, we can deploy sidecar containers on the application pods, or run a dedicated ProxySQL service with its own pods.

We are going to discuss the latter approach, which is more likely to be used when dealing with a large number of application pods. Remember each ProxySQL instance runs a number of checks against the database backends. These checks monitor things like server-status and replication lag. Having too many proxies can cause significant overhead.

Creating a Cluster

For the purpose of this example, I am going to deploy a test cluster in GKE. We need to follow these steps:

1. Create a cluster

gcloud container clusters create ivan-cluster --preemptible --project my-project --zone us-central1-c --machine-type n2-standard-4 --num-nodes=3

2. Configure command-line access

gcloud container clusters get-credentials ivan-cluster --zone us-central1-c --project my-project

3. Create a Namespace

kubectl create namespace ivantest-ns

4. Set the context to use our new Namespace

kubectl config set-context $(kubectl config current-context) --namespace=ivantest-ns

Dedicated Service Using a StatefulSet

One way to implement this approach is to have ProxySQL pods use persistent volumes to store the configuration. We can rely on ProxySQL Cluster mode to make sure the configuration is kept in sync.

For simplicity, we are going to use a ConfigMap with the initial config for bootstrapping the ProxySQL service for the first time.

Exposing the passwords in the ConfigMap is far from ideal, and so far the K8S community hasn’t made up its mind about how to implement Reference Secrets from ConfigMap.

1. Prepare a file for the ConfigMap

tee proxysql.cnf <<EOF
datadir="/var/lib/proxysql"
 
admin_variables=
{
    admin_credentials="admin:admin;cluster:secret"
    mysql_ifaces="0.0.0.0:6032"
    refresh_interval=2000
    cluster_username="cluster"
    cluster_password="secret"  
}
 
mysql_variables=
{
    threads=4
    max_connections=2048
    default_query_delay=0
    default_query_timeout=36000000
    have_compress=true
    poll_timeout=2000
    interfaces="0.0.0.0:6033;/tmp/proxysql.sock"
    default_schema="information_schema"
    stacksize=1048576
    server_version="8.0.23"
    connect_timeout_server=3000
    monitor_username="monitor"
    monitor_password="monitor"
    monitor_history=600000
    monitor_connect_interval=60000
    monitor_ping_interval=10000
    monitor_read_only_interval=1500
    monitor_read_only_timeout=500
    ping_interval_server_msec=120000
    ping_timeout_server=500
    commands_stats=true
    sessions_sort=true
    connect_retries_on_failure=10
}
 
mysql_servers =
(
    { address="mysql1" , port=3306 , hostgroup=10, max_connections=100 },
    { address="mysql2" , port=3306 , hostgroup=20, max_connections=100 }
)
 
mysql_users =
(
    { username = "myuser", password = "password", default_hostgroup = 10, active = 1 }
)
 
proxysql_servers =
(
    { hostname = "proxysql-0.proxysqlcluster", port = 6032, weight = 1 },
    { hostname = "proxysql-1.proxysqlcluster", port = 6032, weight = 1 },
    { hostname = "proxysql-2.proxysqlcluster", port = 6032, weight = 1 }
)
EOF

2. Create the ConfigMap

kubectl create configmap proxysql-configmap --from-file=proxysql.cnf

3. Prepare a file with the StatefulSet

tee proxysql-ss-svc.yml <<EOF
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: proxysql
  labels:
    app: proxysql
spec:
  replicas: 3
  serviceName: proxysqlcluster
  selector:
    matchLabels:
      app: proxysql
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: proxysql
    spec:
      restartPolicy: Always
      containers:
      - image: proxysql/proxysql:2.3.1
        name: proxysql
        volumeMounts:
        - name: proxysql-config
          mountPath: /etc/proxysql.cnf
          subPath: proxysql.cnf
        - name: proxysql-data
          mountPath: /var/lib/proxysql
          subPath: data
        ports:
        - containerPort: 6033
          name: proxysql-mysql
        - containerPort: 6032
          name: proxysql-admin
      volumes:
      - name: proxysql-config
        configMap:
          name: proxysql-configmap
  volumeClaimTemplates:
  - metadata:
      name: proxysql-data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 2Gi
---
apiVersion: v1
kind: Service
metadata:
  annotations:
  labels:
    app: proxysql
  name: proxysql
spec:
  ports:
  - name: proxysql-mysql
    nodePort: 30033
    port: 6033
    protocol: TCP
    targetPort: 6033
  - name: proxysql-admin
    nodePort: 30032
    port: 6032
    protocol: TCP
    targetPort: 6032
  selector:
    app: proxysql
  type: NodePort
EOF

4. Create the StatefulSet

kubectl create -f proxysql-ss-svc.yml

5. Prepare the definition of the headless Service (more on this later)

tee proxysql-headless-svc.yml <<EOF 
apiVersion: v1
kind: Service
metadata:
  name: proxysqlcluster
  labels:
    app: proxysql
spec:
  clusterIP: None
  ports:
  - port: 6032
    name: proxysql-admin
  selector:
    app: proxysql
EOF

6. Create the headless Service

kubectl create -f proxysql-headless-svc.yml

7. Verify the Services

kubectl get svc

NAME              TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)                         AGE
proxysql          NodePort    10.3.249.158           6033:30033/TCP,6032:30032/TCP   12m
proxysqlcluster   ClusterIP   None                   6032/TCP                        8m53s

Pod Name Resolution

By default, each pod has a DNS name associated in the form pod-ip-address.my-namespace.pod.cluster-domain.example.

The headless Service causes K8S to auto-create a DNS record with each pod’s FQDN as well. The result is we will have the following entries available:

proxysql-0.proxysqlcluster
proxysql-1.proxysqlcluster
proxysql-3.proxysqlcluster

We can then use these to set up the ProxySQL cluster (the proxysql_servers part of the configuration file).

Connecting to the Service

To test the service, we can run a container that includes a MySQL client and connect its console output to our terminal. For example, use the following command (which also removes the container/pod after we exit the shell):

kubectl run -i --rm --tty percona-client --image=percona/percona-server:latest --restart=Never -- bash -il

The connections from other pods should be sent to the Cluster-IP and port 6033 and will be load balanced. We can also use the DNS name proxysql.ivantest-ns.svc.cluster.local that got auto-created.

mysql -umyuser -ppassword -h10.3.249.158 -P6033

Use port 30033 instead if the client is connecting from an external network:

mysql -umyuser -ppassword -h10.3.249.158 -P30033

Cleanup Steps

In order to remove all the resources we created, run the following steps:

kubectl delete statefulsets proxysql
kubectl delete service proxysql
kubectl delete service proxysqlcluster

Final Words

We have seen one of the possible ways to deploy ProxySQL in Kubernetes. The approach presented here has a few shortcomings but is good enough for illustrative purposes. For a production setup, consider looking at the Percona Kubernetes Operators instead.

Complete the 2021 Percona Open Source Data Management Software Survey

Have Your Say!

Jun
23
2021
--

Boosting Percona Distribution for MySQL Operator Efficiency

Percona Distribution for MySQL Operator Efficiency

Percona Distribution for MySQL Operator EfficiencyPercona is well known for its offer of several outstanding fully open source, free-to-download software packages. And while Percona started as a MySQL-focused company, nowadays it covers different technologies such as MySQL, PostgreSQL, and MongoDB.

In its constant effort to make life easier for our users, Percona had moved from providing single software packages to Percona Distributions for MySQL, MongoDB, and PostgreSQL. Percona Distributions are a set of software packages that Percona has tested and certifies working together. These are easier to deploy architectural solutions requiring the use of multiple components, such as proxy, topology manager, backup software, and more. 

But we are going even further, and with the release of Percona Distribution for MySQL/MongoDB Operator, we are providing a high level of automation to roll out and manage solutions based on Percona Distributions. 

One of my tasks, as MySQL technical leader, is to identify optimal architectures to serve several common cases. Such as Percona Distribution for MySQL: High Availability with Group Replication Solution. Or in the case of the Percona Distribution for MySQL Operator, identify the different dimensions (low/mid/high utilization) and suggest a Vanilla setup with the scope to get the most out of the solution deployed.  

This is a long exercise, which started with a lot of internal discussions to identify what can make sense as traffic, then testing, identifying the saturation points, testing again, and so on. 

It is during this process that I found a small problem (Feature Request). This small issue is preventing us from easily and dynamically modifying some parameters in the checks the Operator uses. Given that, we had to put the testing on hold until the above FR is implemented. As you can see it is a small thing, but it will give us better control over the Operator’s behavior and will help you to have a well-tuned platform. 

This article is to show the level of improvement you can have with small but targeted tuning. To do so, I used the smallest solution we have identified. The solution is dimensioned to serve a small website or a simple application with a low level of traffic. 

The Environment

To help identify a balanced setup we were using sysbench and sysbench-tpcc. The whole stack on GCP was composed of Application nodes with sysbench, two ProxySQL nodes for R/W split only, three VMS 8 CPU 32GB RAM, with the Percona operator managing the MySQL service.

Tests

For this specific test we were running 68 – 96 -128 – 256 threads:

  • Sysbench read-only
  • Sysbench read/write
  • Sysbench Tpc-c like 

The tests were run multiple times and the data considered is the consolidation of the multiple runs. 

We always run first on basic environments for baseline. Meaning no tuning for the MySQL or Operator, just dimension correctly disk space and BufferPool (and related).

Then we apply some tuning and run the tests multiple times eventually refining when/where needed. 

The code can be found here and here

And now the results…

Sysbench r/w tests

I am not going to describe in detail the images that I think are clear enough. Just keep in mind on the left we have the results from our baseline, and on the right, the same tests on the same platform with the optimization applied.

Operations

Sysbench r/w tests

It’s worth mentioning that without tuning, the platform was not able to consistently scale up to 256 threads. While with a bit of adjustment not only it was able to serve 256 threads but we could have gone a bit further.

Reads

Writes

Comments

As you can see, the sysbench tests clearly indicate that the platform with minor adjustment was acting better and that it was able to serve more and with constant scaling. Let me add that almost all the tests run on the “basic” platform had incidents, meaning as soon as the traffic was increasing, Sysbench was reporting connection interruptions or errors.

TPC-C

Operations

Reads

Writes

Comments

Also for Tpc-c like tests, we have exactly the same trend. Our “optimized” solution was able to serve up to 1516 qps while the “basic” one was able to reach only 322. In this case, the “optimized” solution was not able to scale up to 256 threads, but that makes sense, given the more intense write workload present in this test and the small dimension of the platform.

Woah, What Have You Changed?

You may think we have done crazy things to get this difference, but we did not.

Let us jump back. As indicated at the beginning, I had opened an FR (https://jira.percona.com/browse/K8SPXC-749) to be able to tune some/most of the timeouts existing in the operator.

Why? Think about this, when you install a cluster on iron, you do not set it to be able to work only when the load is low, and all the components of the server are able to answer in nanoseconds. What you do instead is tune the whole system to accommodate the increasing load, and you will give some elements more space for “flexibility”, eventually expecting to have delays in answer. When doing so you also need to correctly align all the parameters that will be affected by the cascade. For instance, if you know your data nodes will be very busy serving queries, they may also slow down in answering internal health checks, but if you relax the cluster health checks and not the checks used for testing the cluster from the operator point of view, the platform will be unbalanced and will not work correctly.

At the same time, if you do not tune the solution at all, you may end up with a platform that is theoretically able to serve the load, but that is crashing for artificial limitations. 

The last one is exactly what was happening with our “basic” solution. As it is, the operator comes with parameters that allow it to work well, but that is not designed to scale. Is like having a server where your CPUs are always at 20% and if the applications ask more, a controller will chop them in fear of having too much load. But the fact is that you want to have the CPUs at 80% or the server will be underutilized. 

Anyhow, what we have changed was some InnoDB parameters, to allow internal operations to work better. Then we force consistent reads in PXC, which actually SLOW down the operations, and finally, we tune the PXC cluster to be more flexible in its internal checks, avoiding having it expel nodes unless really needed to. 

All the above were done using the Operator configuration, but then we had to work on manually changing all the timeouts parameters used by the operator checks to be aligned with what we had defined in the cluster. 

In particular, what we have changed was:

script                      line    value
/usr/bin/clustercheckcron   33      TIMEOUT=10
liveness-check.sh           23      TIMEOUT=5
readiness-check.sh          21      TIMEOUT=10
/usr/local/bin/check_pxc.sh 15      TIMEOUT=${CUSTOM_TIMEOUT:-10}

Wait… Why ProxySQL?

Ok, this is another long discussion and I will cover it better in another article. For now, just consider that HAProxy does not allow r/w splitting or other nice functionalities like firewalling, etc. So the idea is simple, let us use the operator with what fits it better, and then decouple the special needs, eventually adding proxysql in a separate deployment. 

If you are scared of the cost of adding an additional block to the architecture:

Where:
ProxySQL means: Application ? ProxySQL ? HAProxy.
HAProxy means: Application ? HAProxy.

Hope this puts your worries at rest, of course, this is using the “optimized” solution.

Conclusions

The Percona Distribution for MySQL Operator is a constantly growing/improving solution. It also has a lot of interesting features, like being able to manage your backup/restore, point-in-time recovery, and more. But its adoption is still limited and it is normal to have some drawbacks like this one. It is on us who play with real production environments, or as in this case, playing to define certified solutions and giving feedback to improve how the operator works, in order to make it a stronger product able to serve you better day by day.

Now we are going to wait for the FR to be implemented, and then we will recover our dimensioning work. 

Jun
18
2021
--

ProxySQL-Admin 2.x: Encryption of Credential Information

ProxySQL-Admin 2.x Encryption of Credential Information

ProxySQL-Admin 2.x Encryption of Credential InformationStarting with the release of proxysql-admin 2.0.15,  the proxysql-admin 2.x series can now encrypt the credentials needed to access proxysql and cluster nodes. This only applies to the proxysql-admin configuration, this does not change the ProxySQL config, so those credentials are still unencrypted.

The credentials file is the unencrypted file containing the usernames, passwords, hostnames, and ports needed to connect to ProxySQL and PXC (Percona XtraDB Cluster).

The proxysql-login-file tool is used to encrypt the credentials file. This encrypted file is known as a login-file. This login-file can then be used by the proxysql-admin and proxysql-status scripts.

Note: This feature requires OpenSSL v1.1.1 and above (with the exception of Ubuntu 16.04). Please see the supported platforms topic below.

Configuration Precedence

  1. command-line options
  2. the encrypted login-file options (if the login-file is used)
  3. the unencrypted proxysql-admin configuration file values

Example Usage

# create the credentials file
$ echo "monitor.user=monitor" > credentials.cnf
$ echo "monitor.password=password" >> credentials.cnf

# Choose a password
$ passwd="secret"

# Method (1) : Encrypt this data with --password
$ proxysql-login-file --in credentials.cnf --out login-file.cnf --password=${passwd}

# Method (2a) : Encrypt the data with --password-file
# Sending the password via the command-line is insecure,
# it's better to use --password-file so that the
# password doesn't show up in the command-line
$ proxysql-login-file --in credentials.cnf --out login-file.cnf \
--password-file=<(echo "${passwd}")

# Method (2b) : Running the command using sudo will not work with
# bash's process substition. In this case, sending the
# password via stdin is another option.
$ sudo echo "${passwd}" | proxysql-login-file --in credentials.cnf --out login-file.cnf \
--password-file=/dev/stdin

# Method (3) : The script will prompt for the password
# if no password is provided via the command-line options.
$ proxysql-login-file --in credentials.cnf --out login-file.cnf
Enter the password:

# Remove the unencrypted credentials file
$ rm credentials.cnf

# Call the proxysql-admin script with the login-file
$ proxysql-admin --enable --login-file=login-file.cnf \
--login-password-file=<(echo "${passwd}")

This script will assist with configuring ProxySQL for use with
Percona XtraDB Cluster (currently only PXC in combination
with ProxySQL is supported)

...

# Call proxysql-status with the login-file
$ proxysql-status --login-file=login-file.cnf \
--login-password-file=<(echo "${passwd}")

............ DUMPING MAIN DATABASE ............
***** DUMPING global_variables *****
+--------------------------------------------------------------+-----------------------------+
| variable_name                                                | variable_value              |
+--------------------------------------------------------------+-----------------------------+
| mysql-default_charset                                        | utf8                        |
|
...

Credentials File Format

# --------------------------------
# This file is constructed as a set of "name=value" pairs.
# Notes:
# (1) Comment lines start with '#' and must be on separate lines
# (2) the name part
# - The only acceptable names are shown below in this example.
# Other values will be ignored.
# (3) The value part:
# - This does NOT use quotes, so any quote character will be part of the value
# - The entire line will be used (be careful with spaces)
#
# If a value is not specified here, than the default value from the
# configuration file will be used.
# --------------------------------

# --------------------------------
# proxysql admin interface credentials.
# --------------------------------
proxysql.user=admin
proxysql.password=admin
proxysql.host=localhost
proxysql.port=6032

# --------------------------------
# PXC admin credentials for connecting to a PXC node.
# --------------------------------
cluster.user=admin
cluster.password=admin
cluster.host=localhost
cluster.port=4110

# --------------------------------
# proxysql monitoring user. proxysql admin script will create
# this user in PXC to monitor a PXC node.
# --------------------------------
monitor.user=monitor
monitor.password=monitor

# --------------------------------
# Application user to connect to a PXC node through proxysql
# --------------------------------
cluster-app.user=cluster_one
cluster-app.password=passw0rd

 

Requirements and Supported Platforms

OpenSSL 1.1.1 (and higher) is an installation requirement (with the exception of Ubuntu 16.04 (xenial), see the comment below).

  • Centos 7

The OpenSSL 1.1.1+ package must be installed. This can be installed with

yum install openssl11

This command will install OpenSSL 1.1 alongside the system installation and the script will use the openssl11 binary.

  •  Centos 8

The default version of OpenSSL is v1.1.1

  • Ubuntu 16.04 (xenial)

For Ubuntu xenial (16.04), installation of OpenSSL v1.1.1+ is not required, a purpose-built binary used for the encryption/decryption (proxysql-admin-openssl) will be installed alongside the proxysql-admin scripts.

  • Ubuntu 18.04 (bionic)

The default version of OpenSSL is v1.1.1

Jan
28
2021
--

Load Balancing ProxySQL in AWS

Load Balancing ProxySQL in AWS

Load Balancing ProxySQL in AWSThere are several ways to deploy ProxySQL between your applications and the database servers. A common approach is to have a floating virtual IP (VIP) managed by keepalived as the application endpoint. The proxies have to be strategically provisioned to improve the resiliency of the solution (different hardware, network segments, etc,).

When we consider cloud environments, spreading instances across many availability zones (AZ) is considered a best practice, but that presents a problem regarding VIP handling.

Per definition, VPC subnets have to be created in a specific AZ, and subnet IP ranges can’t overlap with one another. An IP address cannot simply be moved to an instance on a different AZ, as it would end up in a subnet that doesn’t include it.

So in order to use the VIP method, we would need to keep all our proxies in a single AZ. This clearly is not the best idea. In addition to this, the regular VIP method doesn’t work, due to the fact that broadcast is not allowed in AWS.

Let’s instead see how to overcome this by putting ProxySQL instances behind a Network Load Balancer (NLB) instead.

Creating a Load Balancer

1. Create an NLB, specifying the subnets where you launched the ProxySQL instances:

aws elbv2 create-load-balancer \ 
--name proxysql-lb \
--type network \
--scheme internal \
--subnets subnet-03fd9799aedda2a1d subnet-0c9c99a5902d8760f

With the above command, the LB internal endpoints will automatically pick an available IP address on each subnet. Alternatively, if you want to specify the IP addresses yourself, you can run the following:

aws elbv2 create-load-balancer \
--name proxysql-lb \
--type network \
--scheme internal \
--subnet-mappings Subnet-Id=subnet-03fd9799aedda2a1d,PrivateIPv4Address=10.1.1.2 Subnet-Id=subnet-0c9c99a5902d8760f,PrivateIPv4Address=10.1.2.2

The output of the above includes the Amazon Resource Name (ARN) of the load balancer, with the following format:

arn:aws:elasticloadbalancing:us-east-1:686800432451:loadbalancer/net/ivan-proxysql-lb/980f7598e7c43506

Let’s save the value on a variable for later use:

export LB_ARN=<paste the value from above>

Adding the ProxySQL Targets

2. Create a target group, specifying the same VPC that you used for your ProxySQL instances:

aws elbv2 create-target-group \ 
--name proxysql-targets \ 
--protocol TCP \
--port 6033 \
--target-type instance \ 
--health-check-port 6032 \ 
--health-check-interval-seconds 10 \ 
--vpc-id vpc-018cc1c34d4d709d5

The output should include the ARN of the target group with this format:

arn:aws:elasticloadbalancing:us-east-1:686800432451:targetgroup/proxysql-targets/d997e5efc62db322

We can store the value for later use:

export TG_ARN=<paste the value from above>

3. Register your ProxySQL instances with the target group:

aws elbv2 register-targets \
--target-group-arn $TG_ARN \
--targets Id=i-02d9e450af1b00524

aws elbv2 register-targets \
--target-group-arn $TG_ARN \
--targets Id=i-05d9f450af1b00521

Creating the LB Listener

4. Create a listener for your load balancer with a default rule to forward requests to your target group:

aws elbv2 create-listener \ 
--load-balancer-arn $LB_ARN \ 
--protocol TCP \
--port 3306 \
--default-actions Type=forward,TargetGroupArn=$TG_ARN

The output contains the ARN of the listener, with the following format:

arn:aws:elasticloadbalancing:us-east-1:686800432451:listener/net/ivan-proxysql-lb/980f7598e7c43506/0d0c68ddde71b83f

5. You can verify the health of the registered targets using the following? command:

aws elbv2 describe-target-health --target-group-arn $TG_ARN

Be aware it takes a few minutes for the health to go green.

Testing Access

6. Now let’s get the DNS name of the load balancer:

LB_DNS=$(aws elbv2 describe-load-balancers --load-balancer-arns $LB_ARN --query 'LoadBalancers[0].DNSName' --output text)

7. Test access to the load balancer itself:

curl -v $LB_DNS:3306

8. Finally, test the connection to the database through the load balancer:

mysql -u percona -p -hinternal-proxysql-1232905176.us-east-1.elb.amazonaws.com

Final Considerations

For this example, I am using a simple TCP connection to ProxySQL’s admin port as the health check. Another option would be to expose a separate HTTP service that queries ProxySQL to handle more complex health check logic.

It is also important to mention the difference between target-type:instance and target-type:ip for the target group. In the latter, if you check the client connections on the Proxy side (stats_mysql_processlist table) you will see they all come from the load balancer address instead of the actual client. Hence it is more desirable to use instance, to see the real client IP.

Jan
11
2021
--

Full Read Consistency Within Percona Kubernetes Operator for Percona XtraDB Cluster

Full Read Consistency Within Percona Kubernetes Operator

Full Read Consistency Within Percona Kubernetes OperatorThe aim of Percona Kubernetes Operator for Percona XtraDB Cluster is to be a special type of controller introduced to simplify complex deployments. The Operator extends the Kubernetes API with custom resources. The Operator solution is using Percona XtraDB Cluster (PXC) behind the hood to provide a highly available, resilient, and scalable MySQL service in the Kubernetes space. 

This solution comes with all the advantages/disadvantages provided by Kubernetes, plus some advantages of its own like the capacity to scale reads on the nodes that are not Primary.

Of course, there are some limitations like the way PXC handles DDLs, which may impact the service, but there is always a cost to pay to get something, expecting to have it all for free is unreasonable.     

In this context, we need to talk and cover what is full read consistency in this solution and why it is important to understand the role it plays.  

Stale Reads

When using Kubernetes we should talk about the service and not about the technology/product used to deliver such service. 

In our case, the Percona Operator is there to deliver a MySQL service. We should then see that as a whole, as a single object. To be more clear what we must consider is NOT the fact we have a cluster behind the service but that we have a service that to be resilient and highly available, use a cluster. 

We should not care if a node/pod goes down unless the service is discontinued.

What we have as a plus in the Percona Operator solution is a certain level of READ scalability. This achieved optimizing the use of the non PRIMARY nodes, and instead of having them sitting there applying only replicated data, the Percona Operator provides access to them to scale the reads.  

But… there is always a BUT ? 

Let us start with an image:

 

By design, the apply and commit finalize in Galera (PXC) may have (and has) a delay between nodes. This means that, if using defaults, applications may have inconsistent reads if trying to access the data from different nodes than the Primary. 

It provides access using two different solutions:

  • Using HAProxy (default)
  • Using ProxySQL

 

 

When using HAProxy you will have 2 entry points:

  • cluster1-haproxy, which will point to the Primary ONLY, for reads and writes. This is the default entry point for the applications to the MySQL database.
  • cluster1-haproxy-replicas, which will point to all three nodes and is supposed to be used for READS only. This is the PLUS you can use if your application has READ/WRITE separation.

Please note that at the moment there is nothing preventing an application to use the cluster1-haproxy-replicas also for write, but that is dangerous and wrong because will generate a lot of certification conflicts and BF abort given it will distribute writes all over the cluster impacting on performance as well (and not giving you any write scaling):

 

[marcotusa@instance-1 ~]$ for i in `seq 1 100`; do mysql -h cluster1-haproxy-replicas -e "insert into test.iamwritingto values(null,@@hostname)";done
+----------------+-------------+
| host           | count(host) |
+----------------+-------------+
| cluster1-pxc-1 |          34 |
| cluster1-pxc-2 |          33 |
| cluster1-pxc-0 |          33 |
+----------------+-------------+

When using ProxySQL the entry point is a single one, but you may define query rules to automatically split the R/W requests coming from the application. This is the preferred method when an application has no way to separate the READS from the writes.

Here I have done a comparison of the two methods, HAProxy and ProxySQL.

Now, as mentioned above, by default, PXC (any Galera base solution) comes with some relaxed settings, for performance purposes. This is normally fine in many standard cases, but if you use the Percona Operator and use the PLUS of scaling reads using the second access point with HAproxy or Query Rules with Proxysql, you should NOT have stale reads, given the service must provide consistent data, as if you are acting on a single node. 

To achieve that you can change the defaults and change the parameter in PXC wsrep_sync_wait. 

When changing the parameter wsrep_sync_wait as explained in the documentation, the node initiates a causality check, blocking incoming queries while it catches up with the cluster. 

Once all data on the node receiving the READ request is commit_finalized, the node performs the read.

But this has a performance impact, as said before.

What Is The Impact?

To test the performance impact I had used a cluster deployed in GKE, with these characteristics:

  • 3 Main nodes n2-standard-8 (8 vCPUs, 32 GB memory)
  • 1 App node n2-standard-8 (8 vCPUs, 32 GB memory)
  • PXC pods using:
    •  25GB of the 32 available 
    • 6 CPU of the 8 available
  • HAProxy:
    • 600m CPU
    • 1GB RAM
  • PMM agent
    • 500m CPU
    • 500 MB Ram

In the application node, I used sysbench running two instances, one in r/w mode the other only reads. Finally, to test the stale read, I used the stale read test from my test suite.

Given I was looking for results with a moderate load, I just used 68/96/128 threads per sysbench instance. 

Results

Marco, did we have or not have stale reads? Yes, we did:

I had from 0 (with very light load) up to 37% stale reads with a MODERATED load, where moderated was the 128 threads sysbench running. 

Setting wsrep_sync_wait=3 of course I had full consistency.  But I had performance loss:

As you can see, I had an average loss of 11% in case of READS:

While for writes the average loss was 16%. 

Conclusions

At this point, we need to stop and think about what is worth doing. If my application is READs heavy and READs scaling, it is probably worth enabling the full synchronicity given scaling on the additional node allows me to have 2x or more READs. 

If instead my application is write critical, probably losing also ~16% performance is not good.

Finally if my application is stale reads tolerant, I will just go with the defaults and get all the benefits without penalties.

Also keep in mind that Percona Kubernetes Operator for Percona XtraDB Cluster is designed to offer a MySQL service so the state of the single node is not as critical as if you are using a default PXC installation, PODs are by nature ephemeral objects while service is resilient.

References

Percona Kubernetes Operator for Percona XtraDB Cluster

https://github.com/Tusamarco/testsuite

https://en.wikipedia.org/wiki/Isolation_(database_systems)#Dirty_reads

https://galeracluster.com/library/documentation/mysql-wsrep-options.html#wsrep-sync-wait

https://www.slideshare.net/lefred.descamps/galera-replication-demystified-how-does-it-work

Jan
11
2021
--

Percona Kubernetes Operator for Percona XtraDB Cluster: HAProxy or ProxySQL?

Percona Kubernetes Operator HAProxy or ProxySQL

Percona Kubernetes Operator HAProxy or ProxySQLPercona Kubernetes Operator for Percona XtraDB Cluster comes with two different proxies, HAProxy and ProxySQL. While the initial version was based on ProxySQL, in time, Percona opted to set HAProxy as the default Proxy for the operator, without removing ProxySQL. 

While one of the main points was to guarantee users to have a 1:1 compatibility with vanilla MySQL in the way the operator allows connections, there are also other factors that are involved in the decision to have two proxies. In this article, I will scratch the surface of this why.

Operator Assumptions

When working with the Percona Operator, there are few things to keep in mind:

  • Each deployment has to be seen as a single MySQL service as if a single MySQL instance
  • The technology used to provide the service may change in time
  • Pod resiliency is not guaranteed, service resiliency is
  • Resources to be allocated are not automatically calculated and must be identified at the moment of the deployment
  • In production, you cannot set more than 5 or less than 3 nodes when using PXC

There are two very important points in the list above.

The first one is that what you get IS NOT a Percona XtraDB Cluster (PXC), but a MySQL service. The fact that Percona at the moment uses PXC to cover the service is purely accidental and we may decide to change it anytime.

The other point is that the service is resilient while the pod is not. In short, you should expect to see pods stopping to work and being re-created. What should NOT happen is that service goes down. Trying to debug each minor issue per node/pod is not what is expected when you use Kubernetes. 

Given the above, review your expectations… and let us go ahead. 

The Plus in the Game (Read Scaling)

As said, what is offered with Percona Operator is a MySQL service. Percona has added a proxy on top of the nodes/pods that help the service to respect the resiliency service expectations. There are two possible deployments:

  • HAProxy
  • ProxySQL

Both allow optimizing one aspect of the Operator, which is read scaling. In fact what we were thinking was, given we must use a (virtually synchronous) cluster, why not take advantage of that and allow reads to scale on the other nodes when available? 

This approach will help all the ones using POM to have the standard MySQL service but with a plus. 

But, with it also comes with some possible issues like READ/WRITE splitting and stale reads. See this article about stale reads on how to deal with it. 

For R/W splitting we instead have a totally different approach in respect to what kind of proxy we implement. 

If using HAProxy, we offer a second entry point that can be used for READ operation. That entrypoint will balance the load on all the nodes available. 

Please note that at the moment there is nothing preventing an application to use the cluster1-haproxy-replicas also for write, but that is dangerous and wrong because will generate a lot of certification conflicts and BF abort, given it will distribute writes all over the cluster impacting on performance as well (and not giving you any write scaling). It is your responsibility to guarantee that only READS will go through that entrypoint.

If instead ProxySQL is in use, it is possible to implement automatic R/W splitting. 

Global Difference and Comparison

At this point, it is useful to have a better understanding of the functional difference between the two proxies and what is the performance difference if any. 

As we know HAProxy acts as a level 4 proxy when operating in TCP mode, it also is a forward-proxy, which means each TCP connection is established with the client with the final target and there is no interpretation of the data-flow.

ProxySQL on the other hand is a level 7 proxy and is a reverse-proxy, this means the client establishes a connection to the proxy who presents itself as the final backend. Data can be altered on the fly when it is in transit. 

To be honest, it is more complicated than that but allows me the simplification. 

On top of that, there are additional functionalities that are present in one (ProxySQL) and not in the other. The point is if they are relevant for use in this context or not. For a shortlist see below (source is from ProxySQL blog but data was removed) : 

As you may have noticed HAProxy is lacking some of that functionalities, like R/W split, firewalling, and caching, proper of the level 7 implemented in ProxySQL.  

The Test Environment

To test the performance impact I had used a cluster deployed in GKE, with these characteristics:

  • 3 Main nodes n2-standard-8 (8 vCPUs, 32 GB memory)
  • 1 App node n2-standard-8 (8 vCPUs, 32 GB memory)
  • PXC pods using:
    •  25GB of the 32 available 
    • 6 CPU of the 8 available
  • HAProxy:
    • 600m CPU
    • 1GB RAM
  • PMM agent
    • 500m CPU
    • 500 MB Ram
  • Tests using sysbench as for (https://github.com/Tusamarco/sysbench), see in GitHub for command details.

What I have done is to run several tests running two Sysbench instances. One only executing reads, while the other reads and writes. 

In the case of ProxySQL, I had R/W splitting thanks to the Query rules, so both sysbench instances were pointing to the same address. While testing HAProxy I was using two entry points:

  • Cluster1-haproxy – for read and write
  • Cluster1-haproxy-replicas – for read only

Then I also compare what happens if all requests hit one node only. For that, I execute one Sysbench in R/W mode against one entry point, and NO R/W split for ProxySQL.

Finally, sysbench tests were executed with the –reconnect option to force the tests to establish new connections.

As usual, tests were executed multiple times, on different days of the week and moments of the day. Data reported is a consolidation of that, and images from Percona Monitoring and Management (PMM) are samples coming from the execution that was closest to the average values. 

Comparing Performance When Scaling Reads

These tests imply that one node is mainly serving writes while the others are serving reads. To not affect performance, and given I was not interested in maintaining full read consistency, the parameter wsrep_sync_wait was kept as default (0). 

HAProxy

HAProxy ProxySQL

A first observation shows how ProxySQL seems to keep a more stable level of requests served. The increasing load penalizes HAProxy reducing if ? the number of operations at 1024 threads.

HAProxy ProxySQL HAProxy ProxySQL read comparison

Digging a bit more we can see that HAProxy is performing much better than ProxySQL for the WRITE operation. The number of writes remains almost steady with minimal fluctuations. ProxySQL on the other hand is performing great when the load in write is low, then performance drops by 50%.

For reads, we have the opposite. ProxySQL is able to scale in a very efficient way, distributing the load across the nodes and able to maintain the level of service despite the load increase. 

If we start to take a look at the latency distribution statistics (sysbench histogram information), we can see that:

latency HAProxy latency ProxySQL

In the case of low load and writes, both proxies stay on the left side of the graph with a low value in ms. HAProxy is a bit more consistent and grouped around 55ms value, while ProxySQL is a bit more sparse and spans between 190-293ms.

About reads we have similar behavior, both for the large majority between 28-70ms. We have a different picture when the load increases:  

ProxySQL is having some occurrences where it performs better, but it spans in a very large range, from ~2k ms to ~29k ms. While HAProxy is substantially grouped around 10-11K ms. As a result, in this context, HAProxy is able to better serve writes under heavy load than ProxySQL. 

Again, a different picture in case of reads.

Here ProxySQL is still spanning on a wide range ~76ms – 1500ms, while HAProxy is more consistent but less efficient, grouping around 1200ms the majority of the service. This is consistent with the performance loss we have seen in READ when using high load and HAProxy.  

Comparing When Using Only One Node

But let us now discover what happens when using only one node. So using the service as it should be, without the possible Plus of read scaling. 

Percona Kubernetes Operator for Percona XtraDB Cluster

The first thing I want to mention is strange behavior that was consistently happening (no matter what proxy used) at 128 threads. I am investigating it but I do not have a good answer yet on why the Operator was having that significant drop in performance ONLY with 128 threads.

Aside from that, the results were consistently showing HAProxy performing better in serving read/writes. Keep in mind that HAProxy just establishes the connection point-to-point and is not doing anything else. While ProxySQL is designed to eventually act on the incoming stream of data. 

This becomes even more evident when reviewing the latency distribution. In this case, no matter what load we have, HAProxy performs better:

As you can notice, HAProxy is less grouped than when we have two entry points, but it is still able to serve more efficiently than ProxySQL.

Conclusions

As usual, my advice is to use the right tool for the job, and do not force yourself into something stupid. And as clearly stated at the beginning, Percona Kubernetes Operator for Percona XtraDB Cluster is designed to provide a MySQL SERVICE, not a PXC cluster, and all the configuration and utilization should converge on that.

ProxySQL can help you IF you want to scale a bit more on READS using the possible plus. But this is not guaranteed to work as it works when using standard PXC. Not only do you need to have a very good understanding of Kubernetes and ProxySQL if you want to avoid issues, but with HAProxy you can scale reads as well, but you need to be sure you have R/W separation at the application level.

In any case, utilizing HAProxy for the service is the easier way to go. This is one of the reasons why Percona decided to shift to HAProxy. It is the solution that offers the proxy service more in line with the aim of the Kubernetes service concept. It is also the solution that remains closer to how a simple MySQL service should behave.

You need to set your expectations correctly to avoid being in trouble later.

References

Percona Kubernetes Operator for Percona XtraDB Cluster

 

Wondering How to Run Percona XtraDB Cluster on Kubernetes? Try Our Operator!

The Criticality of a Kubernetes Operator for Databases

 

Nov
30
2020
--

Support for Percona XtraDB Cluster in ProxySQL (Part Two)

Support for Percona XtraDB Cluster in ProxySQL

Support for Percona XtraDB Cluster in ProxySQL

How scheduler and script stand in supporting failover (Percona and Marco example) 

In part one of this series,  I had illustrated how simple scenarios may fail or have problems when using Galera native support inside ProxySQL. In this post, I will repeat the same tests but using the scheduler option and the external script.

The Scheduler

First a brief explanation about the scheduler.

The scheduler inside ProxySQL was created to allow administrators to extend ProxySQL capabilities. The scheduler gives the option to add any kind of script or application and run it at the specified interval of time. The scheduler was also the initial first way we had to deal with Galera/Percona XtraDB Cluster (PXC) node management in case of issues. 

The scheduler table is composed as follows:

CREATE TABLE scheduler (
    id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
    active INT CHECK (active IN (0,1)) NOT NULL DEFAULT 1,
    interval_ms INTEGER CHECK (interval_ms>=100 AND interval_ms<=100000000) NOT NULL,
    filename VARCHAR NOT NULL,
    arg1 VARCHAR,
    arg2 VARCHAR,
    arg3 VARCHAR,
    arg4 VARCHAR,
    arg5 VARCHAR,
    comment VARCHAR NOT NULL DEFAULT '')

The relevant elements are:

  • Active: that defines if the scheduler should execute or not the external script
  • Interval_ms: frequency of the execution. This has NO check if previous executions terminate. Given that a script must include a check to prevent launching multiple instances which will probably create conflicts and resource issues.
  • Filename: the FULL path of the script/app you want to be executed.
  • Arg(s): whatever you want to pass as arguments. When you have a complex script, either use a configuration file or collapse multiple arguments in a single string.

The Scripts

In this blog, I will present two different scripts (as examples). Both will cover the scenarios as in the previous article and can do more, but I will focus only on that part for now.

One script is written in Bash and is the porting of the proxysql_galera_checker Percona was using with ProxySQL-admin in ProxySQL version 1.4. The script is available here from Percona-lab (git clone ).

The other, written by me, is written in Perl and is probably the first script that came out in 2016. I have done some enhancements and bug fixing to it during the years. Available here (git clone).

Both are offered here as examples and I am not suggesting to use them in critical production environments.

The Setup

To use the two scripts some custom setup must be done. First of all, check that the files are executable by the user running ProxySQL.

Let’s start with mine in Perl

To make it work we need to define a set of host groups that will work as Reader/Writer/Backup-writer/backup-reader (optional but recommended). The difference from the native support is that instead of having them indicated in a specialized table, we will use the mysql_servers table.

  • Writer: 100
  • Readers: 101
  • Backup Writers:8100
  • Backup Readers: 8101

Given the above, on top of the already defined servers in the previous article, we just need to add the 8000 HGs. 

For example:

INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.22',8100,3306,1000,2000,'Failover server preferred');
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.23',8100,3306,999,2000,'Second preferred');    
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.233',8100,3306,998,2000,'Third and last in the list');      
    
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.22',8101,3306,100,2000,'');
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.23',8101,3306,1000,2000,'');    
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.233',8101,3306,1000,2000,'');

After that we need to insert the instructions for the scheduler:

INSERT  INTO scheduler (id,active,interval_ms,filename,arg1) values (10,0,2000,"/opt/tools/proxy_sql_tools/galera_check.pl","-u=cluster1 -p=clusterpass -h=192.168.4.191 -H=100:W,101:R -P=6032 --retry_down=2 --retry_up=1 --main_segment=2 --debug=0  --log=/var/lib/proxysql/galeraLog --active_failover=1");

The result will be:

id: 10
     active: 0
interval_ms: 2000
   filename: /opt/tools/proxy_sql_tools/galera_check.pl
       arg1: -u=cluster1 -p=clusterpass -h=192.168.4.191 -H=100:W,101:R -P=6032 --retry_down=2 --retry_up=1 --main_segment=2 --debug=0  --log=/var/lib/proxysql/galeraLog --active_failover=1
       arg2: NULL
       arg3: NULL
       arg4: NULL
       arg5: NULL
    comment:

Please refer to the instruction in Github for the details of the parameters. What we can specify here is:

  • -H=100:W,101:R Are the Host Group we need to refer to as the ones dealing with our PXC cluster
  • –active_failover=1 Failover method to apply
  • –retry_down=2 –retry_up=1 If action must be taken immediately or if a retry is to be done. This is to avoid the possible jojo effect due to any delay from the node or network.  

Always set it to 0 and activate only when all is set and you are ready to go. Once the above is done, the script ready to be used by ProxySQL is the galera_check script.

Percona proxysql_galera_checker

One limitation this script has is that you cannot use different IPs for the PXC internal communication and the ProxySQL node. Given that, we need to modify the setup we had in the previous blog to match the script requirements. Also here we need to define which HG will be the writer which the reader, but we will specify the internal IPs, and, of course, ProxySQL must have access to that network as well.

  • Writer HG : 200
  • Reader HG: 201
  • Network IPs 10.0.0.22 – 23 – 33

Given that, our ProxySQL setup will be:

delete from mysql_users where username='app_test';
insert into mysql_users (username,password,active,default_hostgroup,default_schema,transaction_persistent,comment) values ('app_test','test',1,200,'mysql',1,'application test user DC1');
LOAD MYSQL USERS TO RUNTIME;SAVE MYSQL USERS TO DISK;

delete from mysql_query_rules where rule_id in(1040,1042);
insert into mysql_query_rules (rule_id,proxy_port,username,destination_hostgroup,active,retries,match_digest,apply) values(1040,6033,'app_test',200,1,3,'^SELECT.*FOR UPDATE',1);
insert into mysql_query_rules (rule_id,proxy_port,username,destination_hostgroup,active,retries,match_digest,apply) values(1042,6033,'app_test',201,1,3,'^SELECT.*$',1);
load mysql query rules to run;save mysql query rules to disk;

delete from mysql_servers where hostgroup_id in (200,201);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('10.0.0.22',200,3306,10000,2000,'DC1');
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('10.0.0.22',201,3306,100,2000,'DC1');
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('10.0.0.23',201,3306,10000,2000,'DC1');    
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('10.0.0.33',201,3306,10000,2000,'DC1');        

load mysql servers to run;save mysql servers to disk;

As you can see here we need to redefine also the user and query rules to match the different HGs, if you use the same (100 -101) no need to do that. Now it’s time to add the line in for the scheduler:

delete from scheduler where id=60;
INSERT  INTO scheduler (id,active,interval_ms,filename,arg1) values (60,0,3000,"/opt/tools/proxysql-scheduler/proxysql_galera_checker","--config-file=/opt/tools/proxysql-scheduler/proxysql-admin-sample.cnf --writer-is-reader=always --write-hg=200 --read-hg=201 --writer-count=1 --priority=10.0.0.22:3306,10.0.0.23:3306,10.0.0.33:3306 --mode=singlewrite --debug --log=/tmp/pxc_test_proxysql_galera_check.log");
LOAD SCHEDULER TO RUNTIME;SAVE SCHEDULER TO DISK;

Also in this case please refer to the specifications of the parameters, but it’s worth mentioning:

  • –write-hg=200 –read-hg=201 Host groups definition
  • –writer-is-reader=always Keep this as ALWAYS please, we will see you do not need anything different.
  • –mode=singlewrite Possible modes are load balancer and single writer. This is refuse from the old. Never, ever use Galera/PXC in multi-primary mode, period.
  • –priority=10.0.0.22:3306,10.0.0.23:3306,10.0.0.33:3306 This is where we define the priority for the writers.

Also in this case when loading a schedule, keep the schedule deactivated, and enable it only when ready.

The Tests

Read Test

The first test is the simple read test, so while we have sysbench running in read_only mode we remove one reader after the other.

Marco script:

+---------+-----------+---------------+----------+--------------+----------+----------+
| weight  | hostgroup | srv_host      | srv_port | status       | ConnUsed | ConnFree |
+---------+-----------+---------------+----------+--------------+----------+----------+
| 10000   | 100       | 192.168.4.22  | 3306     | ONLINE       | 0        | 0        |
| 10000   | 101       | 192.168.4.233 | 3306     | ONLINE       | 38       | 8        |
| 10000   | 101       | 192.168.4.23  | 3306     | ONLINE	| 15       | 49       |
| 100     | 101       | 192.168.4.22  | 3306     | ONLINE       | 0        | 64       |

As we can see, by just setting the weight we will be able to prevent sending reads to the Writer, and while some will still arrive there, it is negligible. Once we put all the readers down…

Marco script: 

+---------+-----------+---------------+----------+--------------+----------+
| weight  | hostgroup | srv_host      | srv_port | status       | ConnUsed |
+---------+-----------+---------------+----------+--------------+----------+
| 10000   | 100       | 192.168.4.22  | 3306     | ONLINE       | 0        |
| 10000   | 101       | 192.168.4.233 | 3306     | SHUNNED      | 0        |
| 10000   | 101       | 192.168.4.23  | 3306     | SHUNNED      | 0 	|
| 100     | 101       | 192.168.4.22  | 3306     | ONLINE       | 58       |

Given the last node also if with the low weight it will serve all the reads.

Percona Script:

+---------+-----------+---------------+----------+--------------+--------
| weight  | hostgroup | srv_host      | srv_port | status       | ConnUsed | 
+---------+-----------+---------------+----------+--------------+--------
| 10000   | 200       | 10.0.0.22     | 3306     | ONLINE       | 0        | 
| 10000   | 201       | 10.0.0.33     | 3306     | ONLINE       | 22       | 
| 10000   | 201       | 10.0.0.23     | 3306     | ONLINE       | 21	| 
| 100     | 201       | 10.0.0.22     | 3306     | ONLINE       | 1        |

Remove the reads:

+---------+-----------+---------------+----------+--------------+-------
| weight  | hostgroup | srv_host      | srv_port | status       | ConnUsed |
+---------+-----------+---------------+----------+--------------+-------
| 10000   | 200       | 10.0.0.22     | 3306     | ONLINE       | 0        |
| 10000   | 201       | 10.0.0.33     | 3306     | OFFLINE_SOFT | 0 	
| 10000   | 201       | 10.0.0.23     | 3306     | OFFLINE_SOFT | 0 	
| 100     | 201       | 10.0.0.22     | 3306     | ONLINE       | 62       |

In both cases, no issue at all; the writer takes the load of the reads only when left alone. 

Maintenance Test

In this test, I will simply put the node down into maintenance mode using pxc_maint_mode=maintenance, as done in the other article. As a reminder, this was working fine also with native Galera.


Marco script:

+---------+-----------+---------------+----------+--------------+----------+
| weight  | hostgroup | srv_host      | srv_port | status       | ConnUsed |
+---------+-----------+---------------+----------+--------------+----------+
| 10000   | 100       | 192.168.4.22  | 3306     | ONLINE       | 50       |
| 10000   | 101       | 192.168.4.233 | 3306     | ONLINE       | 8        |
| 10000   | 101       | 192.168.4.23  | 3306     | ONLINE       | 3    	   |
| 100     | 101       | 192.168.4.22  | 3306     | ONLINE       | 0        |
| 1000000 | 200       | 10.0.0.23     | 3306     | OFFLINE_SOFT | 0        |

After:

+---------+-----------+---------------+----------+--------------+----------+
| weight  | hostgroup | srv_host      | srv_port | status       | ConnUsed |
+---------+-----------+---------------+----------+--------------+----------+
| 999     | 100       | 192.168.4.23  | 3306     | ONLINE       | 50       |
| 10000   | 100       | 192.168.4.22  | 3306     | OFFLINE_SOFT | 0        |
| 10000   | 101       | 192.168.4.233 | 3306     | ONLINE       | 5        |
| 10000   | 101       | 192.168.4.23  | 3306     | ONLINE       | 6 	   |
| 100     | 101       | 192.168.4.22  | 3306     | OFFLINE_SOFT | 0        |

Node was elected and connections on the old writer were also able to end given OFFLINE_SOFT. Putting back the node, removing it from maintenance:

+---------+-----------+---------------+----------+--------------+----------+
| weight  | hostgroup | srv_host      | srv_port | status       | ConnUsed |
+---------+-----------+---------------+----------+--------------+----------+
| 999     | 100       | 192.168.4.23  | 3306     | ONLINE	| 50       |
| 10000   | 101       | 192.168.4.233 | 3306     | ONLINE       | 5        |
| 10000   | 101       | 192.168.4.23  | 3306     | ONLINE       | 5 	   |
| 100     | 101       | 192.168.4.22  | 3306     | ONLINE       | 0        |

Node WILL NOT failback by default (this is by design), this will eventually allow you to warm caches or anything else it may be meaningful before moving the node to Primary role again.

The Percona script will behave a bit differently:

+---------+-----------+---------------+----------+--------------+----------+
| weight  | hostgroup | srv_host      | srv_port | status       | ConnUsed |
+---------+-----------+---------------+----------+--------------+----------+
| 1000000 | 200       | 10.0.0.23     | 3306     | OFFLINE_SOFT | 0        |
| 10000   | 200       | 10.0.0.22     | 3306     | ONLINE       | 50       |
| 10000   | 201       | 10.0.0.33     | 3306     | ONLINE       | 4 	  |
| 10000   | 201       | 10.0.0.23     | 3306     | ONLINE       | 10	  |
| 100     | 201       | 10.0.0.22     | 3306     | ONLINE       | 0        |

Then I put the node under maintenance:

+---------+-----------+---------------+----------+--------------+----------+
| weight  | hostgroup | srv_host      | srv_port | status       | ConnUsed |
+---------+-----------+---------------+----------+--------------+----------+
| 1000000 | 200       | 10.0.0.23     | 3306     | ONLINE       | 26       |
| 10000   | 200       | 10.0.0.22     | 3306     | OFFLINE_SOFT | 22       |
| 10000   | 201       | 10.0.0.33     | 3306     | ONLINE       | 8        |
| 10000   | 201       | 10.0.0.23     | 3306     | ONLINE       | 12       |
| 100     | 201       | 10.0.0.22     | 3306     | OFFLINE_SOFT | 0        |

Connections will be moved to the new Writer slowly based on the application approach. 

But when I put the node back from maintenance:

+---------+-----------+---------------+----------+--------------+----------
| weight  | hostgroup | srv_host      | srv_port | status       | ConnUsed 
+---------+-----------+---------------+----------+--------------+----------
| 1000000 | 200       | 10.0.0.23     | 3306     | OFFLINE_SOFT | 0        
| 10000   | 200       | 10.0.0.22     | 3306     | ONLINE       | 49       
| 10000   | 201       | 10.0.0.33     | 3306     | ONLINE       | 5        
| 10000   | 201       | 10.0.0.23     | 3306     | ONLINE       | 14       
| 100     | 201       | 10.0.0.22     | 3306     | ONLINE       | 0

The old Writer will be put back as Primary. As indicated above I consider this wrong, given we may risk putting back a node that is cold and that can affect production performance. It is true that putting it back from maintenance is a controlled action, but the more checks the better.

Testing Node Crash

Marco script:

To emulate a crash I will kill the mysqld process with kill -9 <pid>.

+---------+-----------+---------------+----------+--------------+----------+
| weight  | hostgroup | srv_host      | srv_port | status	| ConnUsed |
+---------+-----------+---------------+----------+--------------+----------+
| 1000    | 100       | 192.168.4.22  | 3306     | ONLINE	| 50       |
| 10000   | 101       | 192.168.4.233 | 3306     | ONLINE	| 12       |
| 10000   | 101       | 192.168.4.23  | 3306     | ONLINE	| 4        |
| 100     | 101       | 192.168.4.22  | 3306     | ONLINE	| 0        |

Kill the process:

59,50,53.99,6603.16,6205.21,218.97,178.98,1561.52,0.00,2.00
60,50,54.11,5674.25,5295.50,215.43,163.32,1648.20,0.00,1.00 
61,50,3.99,3382.12,3327.22,30.95,23.96,2159.29,0.00,48.91   <--- start
62,50,0.00,820.35,820.35,0.00,0.00,0.00,0.00,0.00         
63,50,0.00,2848.86,2550.67,195.13,103.07,0.00,0.00,0.00
64,50,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
65,50,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
66,50,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
67,50,50.00,4268.99,4066.99,52.00,150.00,7615.89,0.00,1.00  <--- failover end 
68,50,72.00,6522.40,6096.37,268.02,158.01,1109.09,0.00,1.00

Five seconds is consistently taken, of which two are because I set the scheduler to run every two seconds, and also a retry. And the new Primary is serving while the failed node is removed:

+---------+-----------+---------------+----------+--------------+----------+----------+
| weight  | hostgroup | srv_host      | srv_port | status	    | ConnUsed | ConnFree |
+---------+-----------+---------------+----------+--------------+----------+----------+
| 999     | 100       | 192.168.4.23  | 3306     | ONLINE       | 0        | 50       |
| 10000   | 101       | 192.168.4.233 | 3306     | ONLINE       | 0        | 34       |
| 10000   | 101       | 192.168.4.23  | 3306     | ONLINE       | 0        | 35       |
| 100     | 101       | 192.168.4.22  | 3306     | SHUNNED      | 0        | 0        |

Percona script:

Also, in this case, the Percona script behaves a bit differently.

Before the crash:

+---------+-----------+---------------+----------+--------------+----------+
| weight  | hostgroup | srv_host      | srv_port | status       | ConnUsed |
+---------+-----------+---------------+----------+--------------+----------+
| 10000   | 200       | 10.0.0.22     | 3306     | ONLINE       | 49       |
| 10000   | 201       | 10.0.0.33     | 3306     | ONLINE       | 5        |
| 10000   | 201       | 10.0.0.23     | 3306     | ONLINE       | 14       |
| 100     | 201       | 10.0.0.22     | 3306     | ONLINE       | 0        |

Then kill the process:

29,50,41.05,4099.74,3838.44,155.18,106.12,2009.23,0.00,0.00
30,50,8.01,1617.92,1547.79,37.07,33.06,1803.47,0.00,50.09
31,50,0.00,2696.60,2696.60,0.00,0.00,0.00,0.00,0.00       <--- start
32,50,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
33,50,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
34,50,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
35,50,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
36,50,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
37,50,12.96,2385.82,2172.46,91.72,121.63,8795.93,0.00,0.00  <--- failback ends 6"
38,50,39.95,4360.00,4083.38,148.80,127.82,9284.15,0.00,0.00

Variable time to recover but around 6-12 seconds.

+---------+-----------+---------------+----------+---------+----------+
| weight  | hostgroup | srv_host      | srv_port | status  | ConnUsed |
+---------+-----------+---------------+----------+---------+----------+
| 1000000 | 200       | 10.0.0.23     | 3306     | ONLINE  | 50       | ? new
| 10000   | 201       | 10.0.0.33     | 3306     | ONLINE  | 11       |
| 10000   | 201       | 10.0.0.23     | 3306     | ONLINE  | 5        |

New Primary is elected. But on node recovery:

+---------+-----------+---------------+----------+--------------+----------+
| weight  | hostgroup | srv_host      | srv_port | status       | ConnUsed |
+---------+-----------+---------------+----------+--------------+----------+
| 1000000 | 200       | 10.0.0.23     | 3306     | OFFLINE_SOFT | 50       | 
| 10000   | 200       | 10.0.0.22     | 3306     | ONLINE       | 0        |<--old is back
| 10000   | 201       | 10.0.0.33     | 3306     | ONLINE       | 10       |
| 10000   | 201       | 10.0.0.23     | 3306     | ONLINE	| 6        |
| 1000    | 201       | 10.0.0.22     | 3306     | ONLINE       | 0        |

As for maintenance, when the node comes back, by default it is moved to the Primary role. As already explained I consider this wrong and dangerous, but it is a way of seeing what a script should do.

Conclusions

PXC is a complex product, the ways it can be deployed are many, and is not easy or possible to identify all of the possible variants.

Having the opportunity to use native support could be the easier to go solution, but as illustrated part one of this series, misbehavior is just around the corner and it may seriously impact your production environment.

The use of the scheduler with a properly developed script/application that handles the Galera support can guarantee better consistency and proper behavior in respect to your custom expectations. 

There are solutions out there that may fit you and your needs, but if not you can develop your own solution, and be sure that you keep consistency when changing versions of ProxySQL and/or PXC/Galera. In the end, once the main work is done, maintaining a script will be much easier than having to patch a product or wait for a feature request to be implemented. 

I know it may look like a step back, moving out from native support and using a scheduler again. But it is not, it’s just the acknowledgment that sometimes it is better to keep it simple and do some specific tuning/work, rather than trying to put the universe in a bottle which overcomplicates the problem.

Nov
30
2020
--

Support for Percona XtraDB Cluster in ProxySQL (Part One)

Support for Percona XtraDB Cluster in ProxySQL

Support for Percona XtraDB Cluster in ProxySQL

How native ProxySQL stands in failover support (both v2.0.15 and v2.1.0)

In recent times I have been designing several solutions focused on High Availability and Disaster Recovery. Some of them using Percona Server for MySQL with group replication, some using Percona XtraDB Cluster (PXC). What many of them had in common was the use of ProxySQL for the connection layer. This is because I consider the use of a layer 7 Proxy preferable, given the possible advantages provided in ReadWrite split and SQL filtering. 

The other positive aspect provided by ProxySQL, at least for Group Replication, is the native support which allows us to have a very quick resolution of possible node failures.

ProxySQL has Galera support as well, but in the past, that had shown to be pretty unstable, and the old method to use the scheduler was still the best way to go.

After Percona Live Online 2020 I decided to try it again and to see if at least the basics were now working fine. 

What I Have Tested

I was not looking for complicated tests that would have included different levels of transaction isolation. I was instead interested in the more simple and basic ones. My scenario was:

1 ProxySQL node v2.0.15  (192.168.4.191)
1 ProxySQL node v2.1.0  (192.168.4.108)
3 PXC 8.20 nodes (192.168.4.22/23/233) with internal network (10.0.0.22/23/33) 

ProxySQL was freshly installed. 

All the commands used to modify the configuration are here. Tests were done first using ProxySQL v2.015 then v2.1.0. Only if results diverge I will report the version and results. 

PXC- Failover Scenario

As mentioned above I am going to focus on the fail-over needs, period. I will have two different scenarios:

  • Maintenance
  • Node crash 

From the ProxySQL point of view I will have three scenarios always with a single Primary:

  • Writer is NOT a reader (option 0 and 2)
  • Writer is also a reader

The configuration of the native support will be:

INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.22',100,3306,10000,2000,'Preferred writer');
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.23',100,3306,1000,2000,'Second preferred ');
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.233',100,3306,100,2000,'Las chance');
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.22',101,3306,100,2000,'last reader');
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.23',101,3306,10000,2000,'reader1');    
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_connections,comment) VALUES ('192.168.4.233',101,3306,10000,2000,'reader2');

Galera host groups:

  • Writer: 100
  • Reader: 101
  • Backup_writer: 102
  • Offline_hostgroup: 9101

Before going ahead let us analyze the Mysql Servers settings. As you can notice I am using the weight attribute to indicate ProxySQL which is my preferred write. But I also use weight for the READ Host Group to indicate which servers should be used and how.

Given that we have that:

  • Write
    • 192.168.4.22  is the preferred Primary
    • 192.168.4.23  is the first failover 
    • 192.168.4.233 is the last chance 
  • Read
    • 192.168.4.233/23 have the same weight and load should be balanced between the two of them
    • The 192.168.4.22 given is the preferred writer should NOT receive the same load in reads and have a lower weight value.  

The Tests

First Test

The first test is to see how the cluster will behave in the case of 1 Writer and 2 readers, with the option writer_is_also_reader = 0.
To achieve this the settings for proxysql will be:

insert into mysql_galera_hostgroups (writer_hostgroup,backup_writer_hostgroup,reader_hostgroup, offline_hostgroup,active,max_writers,writer_is_also_reader,max_transactions_behind) values (100,102,101,9101,1,1,0,10);

As soon as I load this to runtime, ProxySQL should move the nodes to the relevant Host Group. But this is not happening, instead, it keeps the readers in the writer HG and SHUN them.

+---------+--------------+---------------+--------------+
| weight  | hostgroup_id | srv_host      | status       |
+---------+--------------+---------------+--------------+
| 100     | 100          | 192.168.4.233 | SHUNNED      |
| 1000    | 100          | 192.168.4.23  | SHUNNED      |
| 10000   | 100          | 192.168.4.22  | ONLINE       |
| 100     | 102          | 192.168.4.233 | ONLINE       |
| 1000    | 102          | 192.168.4.23  | ONLINE       |
+---------+--------------+---------------+--------------+

This is, of course, wrong. But why does it happen?

The reason is simple. ProxySQL is expecting to see all nodes in the reader group with READ_ONLY flag set to 1. 

In ProxySQL documentation we can read:

writer_is_also_reader=0: nodes with read_only=0 will be placed either in the writer_hostgroup and in the backup_writer_hostgroup after a topology change, these will be excluded from the reader_hostgroup.

This is conceptually wrong. 

A PXC cluster is a tightly coupled replication cluster, with virtually synchronous replication. One of its benefits is to have the node “virtually” aligned with respect to the data state. 

In this kind of model, the cluster is data-centric, and each node shares the same data view.

What it also means is that if correctly set the nodes will be fully consistent in data READ.

The other characteristic of the cluster is that ANY node can become a writer anytime. 
While best practices indicate that it is better to use one Writer a time as Primary to prevent certification conflicts, this does not mean that the nodes not currently elected as Primary, should be prevented from becoming a writer.

Which is exactly what READ_ONLY flag does if activated.

Not only, the need to have READ_ONLY set means that we must change it BEFORE we have the node able to become a writer in case of fail-over. 

This, in short, means the need to have either a topology manager or a script that will do that with all the relative checks and logic to be safe. Which in time of fail-over means it will add time and complexity when it’s not really needed and that goes against the concept of the tightly-coupled cluster itself.

Given the above, we can say that this ProxySQL method related to writer_is_also_reader =0, as it is implemented today for Galera, is, at the best, useless. 

Why is it working for Group Replication? That is easy; because Group Replication internally uses a mechanism to lock/unlock the nodes when non-primary, when using the cluster in single Primary mode. That internal mechanism was implemented as a security guard to prevent random writes on multiple nodes, and also manage the READ_ONLY flag. 

Second Test

Let us move on and test with writer_is_also_reader = 2. Again from the documentation:

writer_is_also_reader=2 : Only the nodes with read_only=0 which are placed in the backup_writer_hostgroup are also placed in the reader_hostgroup after a topology change i.e. the nodes with read_only=0 exceeding the defined max_writers.

Given the settings as indicated above, my layout before using Galera support is:

+---------+--------------+---------------+--------------+
| weight  | hostgroup_id | srv_host      | status       |
+---------+--------------+---------------+--------------+
| 100     | 100          | 192.168.4.233 | ONLINE       |
| 1000    | 100          | 192.168.4.23  | ONLINE       |
| 10000   | 100          | 192.168.4.22  | ONLINE       |
| 10000   | 101          | 192.168.4.233 | ONLINE       |
| 10000   | 101          | 192.168.4.23  | ONLINE       |
| 100     | 101          | 192.168.4.22  | ONLINE       |
+---------+--------------+---------------+--------------+

After enabling Galera support:

+--------+-----------+---------------+----------+---------+
| weight | hostgroup | srv_host      | srv_port | status  |
+--------+-----------+---------------+----------+---------+
| 100    | 100       | 192.168.4.233 | 3306     | SHUNNED |
| 1000   | 100       | 192.168.4.23  | 3306     | SHUNNED |
| 10000  | 100       | 192.168.4.22  | 3306     | ONLINE  |
| 100    | 101       | 192.168.4.233 | 3306     | ONLINE  |
| 1000   | 101       | 192.168.4.23  | 3306     | ONLINE  |
| 100    | 102       | 192.168.4.233 | 3306     | ONLINE  |
| 1000   | 102       | 192.168.4.23  | 3306     | ONLINE  |
+--------+-----------+---------------+----------+---------+

So node ending with 22 (the Primary elected) is not in the reader pool. Which can be ok, I assume. 

But what is not OK at all is that the READERS have now a completely different weight. Nodes x.23 and x.233 are NOT balancing the load any longer, because the weight is not the same or the one I define. It is instead copied over from the WRITER settings. 

Well of course this is wrong and not what I want. Anyhow, let’s test the READ failover.

I will use sysbench read-only:

sysbench ./src/lua/windmills/oltp_read.lua  --mysql-host=192.168.4.191 --mysql-port=6033 --mysql-user=app_test --mysql-password=test --mysql-db=windmills_s --db-driver=mysql --tables=10 --table_size=10000  --rand-type=zipfian --rand-zipfian-exp=0.5 --skip_trx=true  --report-interval=1  --mysql_storage_engine=innodb --auto_inc=off --histogram --table_name=windmills  --stats_format=csv --db-ps-mode=disable --point-selects=50 --range-selects=true --threads=50 --time=2000   run

mysql> select * from  runtime_mysql_galera_hostgroups \G
*************************** 1. row ***************************
       writer_hostgroup: 100
backup_writer_hostgroup: 102
       reader_hostgroup: 101
      offline_hostgroup: 9101
                 active: 1
            max_writers: 1
  writer_is_also_reader: 2
max_transactions_behind: 10
                comment: NULL

Test Running

+--------+-----------+---------------+----------+---------+----------+
| weight | hostgroup | srv_host      | srv_port | status  | ConnUsed |
+--------+-----------+---------------+----------+---------+----------+
| 100    | 100       | 192.168.4.233 | 3306     | SHUNNED | 0        |
| 1000   | 100       | 192.168.4.23  | 3306     | SHUNNED | 0        |
| 10000  | 100       | 192.168.4.22  | 3306     | ONLINE  | 0        |
| 100    | 101       | 192.168.4.233 | 3306     | ONLINE  | 1        |
| 1000   | 101       | 192.168.4.23  | 3306     | ONLINE  | 51       |
| 100    | 102       | 192.168.4.233 | 3306     | ONLINE  | 0        |
| 1000   | 102       | 192.168.4.23  | 3306     | ONLINE  | 0        |
+--------+-----------+---------------+----------+---------+----------+

As indicated above the reads are not balanced.  Removing node x.23 using wsrep_reject_queries=all:

+---------+--------------+---------------+--------------+----------+
| weight  | hostgroup_id | srv_host	 | status       | ConnUsed |
+---------+--------------+---------------+--------------+----------+
| 100     | 100          | 192.168.4.233 | SHUNNED      | 0        |
| 10000   | 100          | 192.168.4.22  | ONLINE       | 0        |
| 100     | 101          | 192.168.4.233 | ONLINE       | 48       |
| 100     | 102          | 192.168.4.233 | ONLINE       | 0        |
+---------+--------------+---------------+--------------+----------+

The remaining node x.233 is taking all the writes, good. If I set wsrep_reject_queries=all also on x.233:

+---------+--------------+---------------+--------------+
| weight  | hostgroup_id | srv_host      | status       |   
+---------+--------------+---------------+--------------+
| 10000   | 100          | 192.168.4.22  | ONLINE	|
| 100     | 9101         | 192.168.4.233 | SHUNNED	|
| 10000   | 9101         | 192.168.4.23  | ONLINE	|
+---------+--------------+---------------+--------------+

And application failed:

FATAL: mysql_drv_query() returned error 9001 (Max connect timeout reached while reaching hostgroup 101 after 10000ms) for query ‘SELECT id, millid, date,active,kwatts_s FROM windmills2 WHERE id=9364’

Now, this may be like this by design, but I have serious difficulties understanding what the reasoning is here, given we allow a platform to fail serving while we still have a healthy server. 

Last but not least I am not allowed to decide WHICH the backup_writers are, ProxySQL will choose them from my writer list of servers. SO why not also include the one I have declared as Primary, at least in case of needs?  ¯\_(?)_/¯

Third Test

Ok last try with writer_is_also_reader = 1.

mysql> select * from  runtime_mysql_galera_hostgroups \G
*************************** 1. row ***************************
       writer_hostgroup: 100
backup_writer_hostgroup: 102
       reader_hostgroup: 101
      offline_hostgroup: 9101
                 active: 1
            max_writers: 1
  writer_is_also_reader: 1
max_transactions_behind: 10
                comment: NULL
1 row in set (0.01 sec)

And now I have:

+---------+--------------+---------------+--------------+----------+
| weight  | hostgroup_id | srv_host	 | status       | ConnUsed |
+---------+--------------+---------------+--------------+----------+
| 100     | 100          | 192.168.4.233 | SHUNNED      | 0        |
| 1000    | 100          | 192.168.4.23  | SHUNNED      | 0        |
| 10000   | 100          | 192.168.4.22  | ONLINE       | 0        |
| 100     | 101          | 192.168.4.233 | ONLINE       | 0        |
| 1000    | 101          | 192.168.4.23  | ONLINE       | 0        |
| 10000   | 101          | 192.168.4.22  | ONLINE       | 35       | <-- :(
| 100     | 102          | 192.168.4.233 | ONLINE	| 0        |
| 1000    | 102          | 192.168.4.23  | ONLINE	| 0        |
+---------+--------------+---------------+--------------+----------+

Then remove on Reader at the time as before:

+---------+--------------+---------------+--------------+----------+
| weight  | hostgroup_id | srv_host	 | status       | ConnUsed |
+---------+--------------+---------------+--------------+----------+
| 100     | 100          | 192.168.4.233 | SHUNNED	| 0        |
| 10000   | 100          | 192.168.4.22  | ONLINE       | 0        |
| 100     | 101          | 192.168.4.233 | ONLINE	| 0        |
| 10000   | 101          | 192.168.4.22  | ONLINE	| 52       | <-- :(
| 100     | 102          | 192.168.4.233 | ONLINE       | 0        |
| 10000   | 9101         | 192.168.4.23  | ONLINE	| 0        |
+---------+--------------+---------------+--------------+----------+

+---------+--------------+---------------+--------------+----------+
| weight  | hostgroup_id | srv_host	 | status	| ConnUsed |
+---------+--------------+---------------+--------------+----------+
| 10000   | 100          | 192.168.4.22  | ONLINE       | 0        |
| 100     | 101          | 192.168.4.22  | ONLINE       | 39       | <-- :(
| 100     | 9101         | 192.168.4.233 | SHUNNED	| 0        |
| 10000   | 9101         | 192.168.4.23  | ONLINE	| 0        |
+---------+--------------+---------------+--------------+----------+

Now as you may have already realized, the point here is that, YES I have my node x.22 (Primary) able to get the READS as well, but the node was taking the whole load from the beginning. This is because of the shift ProxySQL had done in regards to the weight. 

This happens because while internally ProxySQL initially populates the internal table mysql_servers_incoming with the data from the mysql_servers, after several steps that information is overwritten using the data coming from the writer also for the readers. 

Messing up the desired results.

Fourth Test

Failover due to maintenance. In this case, I will set the writer pxc_maint_mode = MAINTENANCE to failover to another writer.
The sysbench command used:

sysbench ./src/lua/windmills/oltp_read_write.lua  --mysql-host=192.168.4.191 --mysql-port=6033 --mysql-user=app_test --mysql-password=test --mysql-db=windmills_s --db-driver=mysql --tables=10 --table_size=10000  --rand-type=zipfian --rand-zipfian-exp=0.5 --skip_trx=false  --report-interval=1  --mysql_storage_engine=innodb --auto_inc=off --histogram --table_name=windmills  --stats_format=csv --db-ps-mode=disable --point-selects=50 --range-selects=true --threads=50 --time=2000   run

After started sysbench I set the writer in maintenance mode:

+-----------------------------+-------------+
| Variable_name               | Value       |
+-----------------------------+-------------+
| pxc_encrypt_cluster_traffic | OFF         |
| pxc_maint_mode              | MAINTENANCE |
| pxc_maint_transition_period | 10          |
| pxc_strict_mode             | ENFORCING   |
+-----------------------------+-------------+

ProxySQL is setting the node as SHUNNED, but is not able to pass over the connection given sysbench uses sticky connections.

+---------+--------------+---------------+--------------+----------+
| weight  | hostgroup_id | srv_host      | status       | ConnUsed |
+---------+--------------+---------------+--------------+----------+
| 100     | 100          | 192.168.4.233 | SHUNNED      | 0        |
| 1000    | 100          | 192.168.4.23  | ONLINE       | 0        |
| 10000   | 100          | 192.168.4.22  | SHUNNED      | 50       |
| 100     | 101          | 192.168.4.233 | ONLINE       | 2        |
| 1000    | 101          | 192.168.4.23  | ONLINE       | 13       |
| 100     | 102          | 192.168.4.233 | ONLINE       | 0        |
| 10000   | 9101         | 192.168.4.22  | ONLINE       | 0        |
+---------+--------------+---------------+--------------+----------+

THIS IS EXPECTED!
If your application uses sticky connections and never refreshes, you must restart the application. Adding to the sysbench command –reconnect=50 I can see that the connections are a shift to the new master as expected:

+---------+--------------+---------------+--------------+----------+
| weight  | hostgroup_id | srv_host      | status       | ConnUsed |
+---------+--------------+---------------+--------------+----------+
| 100     | 100          | 192.168.4.233 | SHUNNED      | 0        |
| 1000    | 100          | 192.168.4.23  | ONLINE       | 26       | <-- New Primary
| 10000   | 100          | 192.168.4.22  | SHUNNED      | 19       | <-- shift
| 100     | 101          | 192.168.4.233 | ONLINE       | 0        |
| 10000   | 101          | 192.168.4.23  | ONLINE       | 21       |
| 100     | 102          | 192.168.4.233 | ONLINE       | 0        |
| 10000   | 9101         | 192.168.4.23  | ONLINE       | 0        | <-- ??
| 10000   | 9101         | 192.168.4.22  | ONLINE       | 0        |
+---------+--------------+---------------+--------------+----------+

As we can see ProxySQL does the failover to node x.23 as expected. But it also adds the node in the HG 9101, which is supposed to host the offline servers.

So why move the Primary there? 

Once maintenance is over, disable pxc_main_mode will restore the master. In short, ProxySQL will fail-back. 

The whole process will be not impactful if the application is NOT using sticky connection, otherwise, the application will have to deal with:

  • Error with the connection
  • Retry cycle to re-run the drop DML

Failover Because of a Crash

To check the next case I will add –mysql-ignore-errors=all to sysbench, to be able to see how many errors I will have and for how long, when in the need to failover. To simulate a crash I will KILL -9 the mysqld process on the writer.

After Kill:

98,50,53.00,6472.71,6070.73,221.99,179.99,1327.91,0.00,1.00 <--
99,50,0.00,2719.17,2719.17,0.00,0.00,0.00,0.00,50.00        <--start
100,50,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
101,50,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
102,50,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
103,50,0.00,2849.89,2549.90,193.99,106.00,0.00,0.00,0.00
104,50,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00
105,50,49.85,2663.99,2556.31,23.93,83.75,7615.89,0.00,6.98  <-- done

In this case, it takes 6 seconds for a failover.

+--------+-----------+---------------+----------+---------+----------+
| weight | hostgroup | srv_host      | srv_port | status  | ConnUsed | 
+--------+-----------+---------------+----------+---------+----------+
| 100    | 100       | 192.168.4.233 | 3306     | SHUNNED | 0        | 
| 1000   | 100       | 192.168.4.23  | 3306     | ONLINE  | 48       | 
| 100    | 101       | 192.168.4.233 | 3306     | ONLINE  | 1        | 
| 1000   | 101       | 192.168.4.23  | 3306     | ONLINE  | 18       | 
| 100    | 102       | 192.168.4.233 | 3306     | ONLINE  | 0        | 
| 10000  | 9101      | 192.168.4.22  | 3306     | SHUNNED | 0        | 
+--------+-----------+---------------+----------+---------+----------+

So all good here. But during one of my tests ONLY on v2.0.15 and when using the same weight, I had the following weird behavior. Once the failover is done I found that ProxySQL is sending connections to BOTH remaining nodes.

Check below the data taken one after the other nodeS start to take over, keep in mind here the PRIMARY was node 192.168.4.233:

+--------+-----------+---------------+----------+---------+----------+
| weight | hostgroup | srv_host      | srv_port | status  | ConnUsed |
+--------+-----------+---------------+----------+---------+----------+
| 10000  | 100       | 192.168.4.233 | 3306     | SHUNNED | 0        |
| 10000  | 100       | 192.168.4.23  | 3306     | ONLINE  | 10       |<--
| 10000  | 100       | 192.168.4.22  | 3306     | SHUNNED | 40       |<--
| 10000  | 101       | 192.168.4.233 | 3306     | SHUNNED | 0        |
| 10000  | 101       | 192.168.4.23  | 3306     | ONLINE  | 3        |
| 10000  | 101       | 192.168.4.22  | 3306     | ONLINE  | 12       |
| 10000  | 102       | 192.168.4.22  | 3306     | ONLINE  | 0        |
+--------+-----------+---------------+----------+---------+----------+
...
+--------+-----------+---------------+----------+---------+----------+
| weight | hostgroup | srv_host      | srv_port | status  | ConnUsed |
+--------+-----------+---------------+----------+---------+----------+
| 10000  | 100       | 192.168.4.233 | 3306     | SHUNNED | 0        |
| 10000  | 100       | 192.168.4.23  | 3306     | ONLINE  | 37       |<--
| 10000  | 100       | 192.168.4.22  | 3306     | SHUNNED | 13       |<--
| 10000  | 101       | 192.168.4.233 | 3306     | SHUNNED | 0        |
| 10000  | 101       | 192.168.4.23  | 3306     | ONLINE  | 7        |
| 10000  | 101       | 192.168.4.22  | 3306     | ONLINE  | 12       |
| 10000  | 102       | 192.168.4.22  | 3306     | ONLINE  | 0        |
+--------+-----------+---------------+----------+---------+----------+
...
+--------+-----------+---------------+----------+---------+----------+
| weight | hostgroup | srv_host      | srv_port | status  | ConnUsed |
+--------+-----------+---------------+----------+---------+----------+
| 10000  | 100       | 192.168.4.233 | 3306     | SHUNNED | 0        |
| 10000  | 100       | 192.168.4.23  | 3306     | ONLINE  | 49       |<--
| 10000  | 100       | 192.168.4.22  | 3306     | SHUNNED | 0        |<--
| 10000  | 101       | 192.168.4.233 | 3306     | SHUNNED | 0        |
| 10000  | 101       | 192.168.4.23  | 3306     | ONLINE  | 10       |
| 10000  | 101       | 192.168.4.22  | 3306     | ONLINE  | 10       |
| 10000  | 102       | 192.168.4.22  | 3306     | ONLINE  | 0        |
+--------+-----------+---------------+----------+---------+----------+

In the end, only one node will remain as Primary, but for an X amount of time, both were serving also if only ONE node was declared ONLINE.

A Problem Along the Road… (only with v2.0.15)

While I was trying to “fix” the issue with the weight for READERS…

Let’s say we have this:

+--------+-----------+---------------+----------+---------+----------+
| weight | hostgroup | srv_host      | srv_port | status  | ConnUsed |
+--------+-----------+---------------+----------+---------+----------+
| 10000  | 100       | 192.168.4.23  | 3306     | ONLINE  | 686      |
| 10000  | 100       | 192.168.4.22  | 3306     | SHUNNED | 0        |
| 10000  | 101       | 192.168.4.233 | 3306     | ONLINE  | 62       |
| 10000  | 101       | 192.168.4.23  | 3306     | ONLINE  | 43       |
| 10000  | 101       | 192.168.4.22  | 3306     | ONLINE  | 19       |
| 10000  | 102       | 192.168.4.22  | 3306     | ONLINE  | 0        |
+--------+-----------+---------------+----------+---------+----------+

And I want to release some of the READ load from WRITER (currently 192.168.4.23).

If I simply do:

update mysql_servers set weight=100 where hostgroup_id=101 and hostname='192.168.4.23';

+--------------+---------------+------+-----------+--------+--------+
| hostgroup_id | hostname      | port | gtid_port | status | weight | 
+--------------+---------------+------+-----------+--------+--------+
| 100          | 192.168.4.23  | 3306 | 0         | ONLINE | 10000  | 
| 101          | 192.168.4.22  | 3306 | 0         | ONLINE | 10000  | 
| 101          | 192.168.4.23  | 3306 | 0         | ONLINE | 100    | 
| 101          | 192.168.4.233 | 3306 | 0         | ONLINE | 10000  | 
+--------------+---------------+------+-----------+--------+--------+

Now I load it into runtime, and… if I am lucky:

+--------+-----------+---------------+----------+---------+
| weight | hostgroup | srv_host      | srv_port | status  |
+--------+-----------+---------------+----------+---------+
| 10000  | 100       | 192.168.4.23  | 3306     | ONLINE  |
| 10000  | 100       | 192.168.4.22  | 3306     | SHUNNED |
| 10000  | 101       | 192.168.4.233 | 3306     | ONLINE  |
| 100    | 101       | 192.168.4.23  | 3306     | ONLINE  |
| 10000  | 101       | 192.168.4.22  | 3306     | ONLINE  |
| 10000  | 102       | 192.168.4.22  | 3306     | ONLINE  |
+--------+-----------+---------------+----------+---------+

And then it is changed to:

+--------+-----------+---------------+----------+---------+
| weight | hostgroup | srv_host      | srv_port | status  |
+--------+-----------+---------------+----------+---------+
| 10000  | 100       | 192.168.4.23  | 3306     | ONLINE  |
| 10000  | 100       | 192.168.4.22  | 3306     | SHUNNED |
| 10000  | 101       | 192.168.4.233 | 3306     | ONLINE  |
| 10000  | 101       | 192.168.4.23  | 3306     | ONLINE  |
| 10000  | 101       | 192.168.4.22  | 3306     | ONLINE  |
| 10000  | 102       | 192.168.4.22  | 3306     | ONLINE  |
+--------+-----------+---------------+----------+---------+

As you can notice ProxySQL initially set it to the value I choose. After, it changed back to what was set in the HG 100. But worse, is that if I am not lucky:

+--------+-----------+---------------+----------+---------+----------+
| weight | hostgroup | srv_host      | srv_port | status  | ConnUsed |
+--------+-----------+---------------+----------+---------+----------+
| 100    | 100       | 192.168.4.23  | 3306     | SHUNNED | 0        |
| 10000  | 100       | 192.168.4.22  | 3306     | ONLINE  | 0        |
| 10000  | 101       | 192.168.4.233 | 3306     | ONLINE  | 718      |
| 100    | 101       | 192.168.4.23  | 3306     | ONLINE  | 0        |
| 10000  | 101       | 192.168.4.22  | 3306     | SHUNNED | 0        |
| 100    | 102       | 192.168.4.23  | 3306     | ONLINE  | 0        |
+--------+-----------+---------------+----------+---------+----------+

it changes the value (randomly) also for the HG 102 which will impact also the WRITER HG causing a failover. At this point, I stopped testing. Too many things making a failover scenario too unpredictable. 

Conclusions

ProxySQL has a great concept behind it and is for sure covering a really needed gap existing in the MySQL environment, optimizing and powering up the connection layer between the application layer to the data layer.  

But, in regards to the Galera support, we are not there. The support provided is not only limited, it is fallacious, and could lead to serious and/or unexpected problems. Also using the option writer_is_also_reader=1, which is the only one worthy of usage, we still see too many issues in how the nodes are managed in case of serious events as failover.

ProxySQL v2.1.0 seems to have fixed some instabilities, but we still have too many open issues to trust the Galera native support. My advice is to stay away from it and use the scheduler to deal with the Galera cluster. Write a robust script that will cover your specific needs if you must customize the actions. A scheduler will serve you well. 

If too lazy to do so, there is a sample in Percona-Lab. This is the old script used in ProxySQL 1.4.x modified to work with ProxySQL 2.x. I have also written one a long time ago that can help as well here. Both come without any guarantee and I advise you to use them as examples for your own, see Part 2 of this post for details

Finally, let me say that ProxySQL is a great tool, but no tool can cover all. People like me that have been around for long enough have seen this happening many times, and it is of no surprise. 

Great MySQL to all.

References

https://www.percona.com/doc/percona-xtradb-cluster/LATEST/install/index.html

https://galeracluster.com/

https://proxysql.com/blog/proxysql-native-galera-support/

https://www.percona.com/blog/2019/02/20/proxysql-native-support-for-percona-xtradb-cluster-pxc/

https://proxysql.com/documentation/galera-configuration/

Oct
09
2020
--

Amazon Aurora Multi-Primary First Impression

Amazon Aurora Multi-Primary First Impression

Amazon Aurora Multi-Primary First ImpressionFor what reason should I use a real multi-primary setup?

To be clear, not a multi-writer solution where any node can become the active writer in case of needs, as for Percona XtraDB Cluster (PXC) or Percona Server for MySQL using Group_replication. No, we are talking about a multi-primary setup where I can write at the same time on multiple nodes. I want to insist on this “why?”.

After having excluded the possible solutions mentioned above, both covering the famous 99.995% availability, which is 26.30 minutes of downtime in a year, what is left?

Disaster Recovery? Well, that is something I would love to have, but to be a real DR solution we need to put several kilometers (miles for imperial) in the middle.

And we know (see here and here) that aside from some misleading advertising, we cannot have a tightly coupled cluster solution across geographical regions.

So, what is left? I may need more HA, ok, that is a valid reason. Or I may need to scale the number of writes, which is a valid reason as well. This means, in the end, that I am looking to a multi-primary because:

  • Scale writes (more nodes more writes)
    • Consistent reads (what I write on A must be visible on B)
  • Gives me 0 (zero) downtime, or close to that (5 nines is a maximum downtime of 864 milliseconds per day!!)
  • Allow me to shift the writer pointer at any time from A to B and vice versa, consistently.

Now, keeping myself bound to the MySQL ecosystem, my natural choice would be MySQL NDB cluster. But my (virtual) boss was at AWS re-invent and someone mentioned to him that Aurora Multi-Primary does what I was looking for. This (long) article is my voyage in discovering if that is true or … not.

Given I am focused on the behavior first, and NOT interested in absolute numbers to shock the audience with millions of QPS, I will use low-level Aurora instances. And will perform tests from two EC2 in the same VPC/region of the nodes.

You can find the details about the tests on GitHub here.

Finally, I will test:

  • Connection speed
  • Stale read
  • Write single node for baseline
  • Write on both node:
    • Scaling splitting the load by schema
    • Scaling same schema

Test Results

Let us start to have some real fun. The first test is …

Connection Speed

The purpose of this test is to evaluate the time taken in opening a new connection and time taken to close it. The action of the open/close connection can be a very expensive operation, especially if applications do not use a connection pool mechanism.

Amazon Aurora Multi-Primary


As we can see, ProxySQL results to be the most efficient way to deal with opening connections, which was expected given the way it is designed to reuse open connections towards the backend.


Different is the close connection operation, in which ProxySQL seems to take a little bit longer.

As a global observation, we can say that by using ProxySQL we have more consistent behavior. Of course, this test is a simplistic one, and we are not checking the scalability (from 1 to N connections) but it is good enough to give us the initial feeling. Specific connection tests will be the focus of the next blog on Aurora MM.

Stale Reads

Aurora multi-primary uses the same mechanism of the default Aurora to update the buffer pool:


Using the Page Cache update, just doing both ways. This means that the Buffer Pool of Node2 is updated with the modification performed in Node1 and vice versa.

To verify if an application would be really able to have consistent reads, I have run this test. This test is meant to measure if, and how many, stale reads we will have when writing on a node and reading from the other.

Amazon Aurora multi-primary has two consistency models:

Aurora consistency model
As an interesting fact, the result was that with the default consistency model (INSTANCE_RAW), we got a 100% stale read.
Given that I focused on identifying the level of the cost that exists when using the other consistency model (REGIONAL_RAW), that allows an application to have consistent reads.

The results indicate an increase of 44% in total execution time, and of 95% (22 times slower) in write execution.

It is interesting to note that the time taken is in some way predictable and consistent between the two consistency models.

The graph below shows in yellow how long the application must wait to see the correct data on the reader node. In blue is the amount of time the application waits to get back the same consistent read because it must wait for the commit on the writer.

lag time in nanoseconds

As you can see, the two are more or less aligned.

Given the performance cost imposed by using REGIONAL_RAW,  all the other tests are done with the default INSTANCE_RAW, unless explicitly stated.

Writing Tests

All tests run in this section were done using sysbench-tpcc with the following settings:

sysbench ./tpcc.lua --mysql-host=<> --mysql-port=3306 --mysql-user=<> --mysql-password=<> --mysql-db=tpcc --time=300 --threads=32 --report-interval=1 --tables=10 --scale=15  --mysql_table_options=" CHARSET=utf8 COLLATE=utf8_bin"  --db-driver=mysql prepare

 sysbench /opt/tools/sysbench-tpcc/tpcc.lua --mysql-host=$mysqlhost --mysql-port=$port --mysql-user=<> --mysql-password=<> --mysql-db=tpcc --db-driver=mysql --tables=10 --scale=15 --time=$time  --rand-type=zipfian --rand-zipfian-exp=0 --report-interval=1 --mysql-ignore-errors=all --histogram  --report_csv=yes --stats_format=csv --db-ps-mode=disable --threads=$threads run

Write Single Node (Baseline)

Before starting the comparative analysis, I was looking to define what was the “limit” of traffic/load for this platform.

baseline reads/writes

From the graph above, we can see that this setup scales up to 128 threads and after that, the performance remains more or less steady.

Amazon claims that we can mainly double the performance when using both nodes in write mode and use a different schema to avoid conflict.

aurora scalability

Once more, remember I am not interested in the absolute numbers here, but I am expecting the same behavior. Given that, our expectation is to see:

expected scalability

Write on Both Nodes, Different Schemas

So AWS recommend this as the scaling solution:


And I diligently follow the advice. I used two EC2 nodes in the same subnet of the Aurora Node, writing to a different schema (tpcc & tpcc2).

Overview

Let us make it short and go straight to the point. Did we get the expected scalability?

Well, no:


We just had a 26% increase, quite far to be the expected 100% Let us see what happened in detail (if not interested just skip and go to the next test).

Node 1

Schema read writes Aurora

Node 2


As you can see, Node1 was (more or less) keeping up with the expectations and being close to the expected performance. But Node2 was just not keeping up, and performances there were just terrible.

The graphs below show what happened.

While Node1 was (again more or less) scaling up to the baseline expectations (128 threads), Node2 collapsed on its knees at 16 threads. Node2 was never able to scale up.

Reads

Node 1


Node1 is scaling the reads as expected, but also here and there we can see performance deterioration.

Node 2


Node2 is not scaling Reads at all.

Writes

Node 1


Same as Read.

Node 2


Same as read.

Now someone may think I was making a mistake and I was writing on the same schema. I assure you I was not. Check the next test to see what happened if using the same schema.

Write on Both Nodes,  Same Schema

Overview

Now, now, Marco, this is unfair. You know this will cause contention. Yes, I do! But nonetheless, I was curious to see what was going to happen and how the platform would deal with that level of contention.

My expectations were to have a lot of performance degradation and an increased number of locks. About conflict I was not wrong, node2 after the test reported:

+-------------+---------+-------------------------+
| table       | index   | PHYSICAL_CONFLICTS_HIST |
+-------------+---------+-------------------------+
| district9   | PRIMARY |                    3450 |
| district6   | PRIMARY |                    3361 |
| district2   | PRIMARY |                    3356 |
| district8   | PRIMARY |                    3271 |
| district4   | PRIMARY |                    3237 |
| district10  | PRIMARY |                    3237 |
| district7   | PRIMARY |                    3237 |
| district3   | PRIMARY |                    3217 |
| district5   | PRIMARY |                    3156 |
| district1   | PRIMARY |                    3072 |
| warehouse2  | PRIMARY |                    1867 |
| warehouse10 | PRIMARY |                    1850 |
| warehouse6  | PRIMARY |                    1808 |
| warehouse5  | PRIMARY |                    1781 |
| warehouse3  | PRIMARY |                    1773 |
| warehouse9  | PRIMARY |                    1769 |
| warehouse4  | PRIMARY |                    1745 |
| warehouse7  | PRIMARY |                    1736 |
| warehouse1  | PRIMARY |                    1735 |
| warehouse8  | PRIMARY |                    1635 |
+-------------+---------+-------------------------+

Which is obviously a strong indication something was not working right. In terms of performance gain, if we compare ONLY the result with the 128 Threads:


Also with the high level of conflict, we still have 12% of performance gain.

The problem is that in general, we have the two nodes behaving quite badly. If you check the graph below you can see that the level of conflict is such to prevent the nodes not only to scale but to act consistently.

Node 1

Write on Both Nodes,  Same Schema

Node 2


Reads

In the following graphs, we can see how node1 had issues and it actually crashed three times, during tests with 32/64/512 threads. Node2 was always up but the performances were very low.

Node 1


Node 2


Writes

Node 1


Node 2


Recovery From Crashed Node

About recovery time, reading the AWS documentation and listening to presentations, I often heard that Aurora Multi-Primary is a 0 downtime solution. Or other statements like: “in applications where you can’t afford even brief downtime for database write operations, a multi-master cluster can help to avoid an outage when a writer instance becomes unavailable. The multi-master cluster doesn’t use the failover mechanism, because it doesn’t need to promote another DB instance to have read/write capability”

To achieve this the suggestion, the solution I found was to have applications pointing directly to the Nodes endpoint and not use the Cluster endpoint.

In this context, the solution pointing to the Nodes should be able to failover within a second or so, while the cluster endpoint:

Recovery From Crashed Node

Personally, I think that designing an architecture where the application is responsible for the connection to the database and failover is some kind of refuse from 2001. But if you feel this is the way, well, go for it.

What I did for testing is to use ProxySQL, as plain as possible with nothing else, and the basic monitor coming from the native monitor. I then compared the results with the tests using the Cluster endpoint. In this way, I adopt the advice of pointing directly at the nodes, but I was doing things in our time.

The results are below and they confirm (more or less) the data coming from Amazon.


A downtime of seven seconds is quite a long time nowadays, especially if I am targeting the 5 nines solution that I want to remember is 864 ms downtime per day. Using ProxySQL is going closer to that, but still too long to be called zero downtime.
I also have fail-back issues when using the AWS cluster endpoint, given it was not able to move the connection to the joining node seamlessly.

Last but not least, when using the consistency level INSTANCE_RAW, I had some data issue as well as PK conflict:
FATAL: mysql_drv_query() returned error 1062 (Duplicate entry ‘18828082’ for key ‘PRIMARY’) 

Conclusions

As state at the beginning of this long blog, the reasonable expectations to go for a multi-primary solution were:

  • Scale writes (more nodes more writes)
  • Gives me zero downtime, or close to that (5 nines is a maximum downtime of 864 milliseconds per day!!)
  • Allow me to shift the writer pointer at any time from A to B and vice versa, consistently.

Honestly, I feel we have completely failed the scaling point. Probably if I use the largest Aurora I will get much better absolute numbers, and it will take me more to encounter the same issues, but I will. In any case, if the multi-primary solution is designed to provide that scalability, and it should do that with any version.

I did not have zero downtime, but I was able to failover pretty quickly with ProxySQL.

Finally, unless the consistency model is REGIONAL_RAW, shifting from one node to the other is not prone to possible negative effects like stale reads. Given that I consider this requirement not satisfied in full.

Given all the above, I think this solution could eventually be valid only for High Availability (close to being 5 nines), but given it comes with some limitations I do not feel comfortable in preferring it over others just for that, at the end default Aurora is already good enough as a High available solution.

References

AWS re:Invent 2019: Amazon Aurora Multi-Master: Scaling out database write performance

Working with Aurora multi-master clusters

Improving enterprises ha and disaster recovery solutions reviewed

Robust ha solutions with proxysql

Limitations of multi-master clusters

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com