This blog post discusses ProxySQL and MHA integration, and how they work together.
MHA (Master High Availability Manager and tools for MySQL) is almost fully integrated with the ProxySQL process. This means you can count on the MHA standard feature to manage failover, and ProxySQL to manage the traffic and shift from one server to another.
This is one of the main differences between MHA and VIP, and MHA and ProxySQL: with MHA/ProxySQL, there is no need to move IPs or re-define DNS.
The following is an example of an MHA configuration file for use with ProxySQL:
server default]
user=mha
password=mha
ssh_user=root
repl_password=replica
manager_log=/tmp/mha.log
manager_workdir=/tmp
remote_workdir=/tmp
master_binlog_dir=/opt/mysql_instances/mha1/logs
client_bindir=/opt/mysql_templates/mysql-57/bin
client_libdir=/opt/mysql_templates/mysql-57/lib
master_ip_failover_script=/opt/tools/mha/mha4mysql-manager/samples/scripts/master_ip_failover
master_ip_online_change_script=/opt/tools/mha/mha4mysql-manager/samples/scripts/master_ip_online_change
log_level=debug
[server1]
hostname=mha1r
ip=192.168.1.104
candidate_master=1
[server2]
hostname=mha2r
ip=192.168.1.107
candidate_master=1
[server3]
hostname=mha3r
ip=192.168.1.111
candidate_master=1
[server4]
hostname=mha4r
ip=192.168.1.109
no_master=1
NOTE: Be sure to comment out the “FIX ME ” lines in the sample/scripts.
After that, just install MHA as you normally would.
In ProxySQL, be sure to have all MHA users and the servers set.
When using ProxySQL with standard replication, it’s important to set additional privileges for the ProxySQL monitor user. It must also have “Replication Client” set or it will fail to check the SLAVE LAG. The servers MUST have a defined value for the attribute
max_replication_lag
, or the check will be ignored.
As a reminder:
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_replication_lag) VALUES ('192.168.1.104',600,3306,1000,0);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_replication_lag) VALUES ('192.168.1.104',601,3306,1000,10);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_replication_lag) VALUES ('192.168.1.107',601,3306,1000,10);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_replication_lag) VALUES ('192.168.1.111',601,3306,1000,10);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight,max_replication_lag) VALUES ('192.168.1.109',601,3306,1000,10);
INSERT INTO mysql_replication_hostgroups VALUES (600,601);
LOAD MYSQL SERVERS TO RUNTIME; SAVE MYSQL SERVERS TO DISK;
insert into mysql_query_rules (username,destination_hostgroup,active) values('mha_W',600,1);
insert into mysql_query_rules (username,destination_hostgroup,active) values('mha_R',601,1);
insert into mysql_query_rules (username,destination_hostgroup,active,retries,match_digest) values('mha_RW',600,1,3,'^SELECT.*FOR UPDATE');
insert into mysql_query_rules (username,destination_hostgroup,active,retries,match_digest) values('mha_RW',601,1,3,'^SELECT');
LOAD MYSQL QUERY RULES TO RUNTIME;SAVE MYSQL QUERY RULES TO DISK;
insert into mysql_users (username,password,active,default_hostgroup,default_schema) values ('mha_W','test',1,600,'test_mha');
insert into mysql_users (username,password,active,default_hostgroup,default_schema) values ('mha_R','test',1,601,'test_mha');
insert into mysql_users (username,password,active,default_hostgroup,default_schema) values ('mha_RW','test',1,600,'test_mha');
LOAD MYSQL USERS TO RUNTIME;SAVE MYSQL USERS TO DISK
OK, now that all is ready, let’s rock’n’roll!
Controlled fail-over
First of all, the masterha_manager should not be running or you will get an error.
Now let’s start some traffic:
Write
sysbench --test=/usr/share/doc/sysbench/tests/db/oltp.lua --mysql-host=192.168.1.50 --mysql-port=3311 --mysql-user=mha_RW --mysql-password=test --mysql-db=mha_test --db-driver=mysql --oltp-tables-count=50 --oltp-tablesize=5000 --max-requests=0 --max-time=900 --oltp-point-selects=5 --oltp-read-only=off --oltp-dist-type=uniform --oltp-reconnect-mode=transaction --oltp-skip-trx=off --num-threads=10 --report-interval=10 --mysql-ignore-errors=all run
Read only
sysbench --test=/usr/share/doc/sysbench/tests/db/oltp.lua --mysql-host=192.168.1.50 --mysql-port=3311 --mysql-user=mha_RW --mysql-password=test --mysql-db=mha_test --db-driver=mysql --oltp-tables-count=50 --oltp-tablesize=5000 --max-requests=0 --max-time=900 --oltp-point-selects=5 --oltp-read-only=on --num-threads=10 --oltp-reconnect-mode=query --oltp-skip-trx=on --report-interval=10 --mysql-ignore-errors=all run
Let it run for a bit, then check:
mysql> select * from stats_mysql_connection_pool where hostgroup between 600 and 601 order by hostgroup,srv_host desc;
+-----------+---------------+----------+--------+----------+----------+--------+---------+---------+-----------------+-----------------+------------+
| hostgroup | srv_host | srv_port | status | ConnUsed | ConnFree | ConnOK | ConnERR | Queries | Bytes_data_sent | Bytes_data_recv | Latency_ms |
+-----------+---------------+----------+--------+----------+----------+--------+---------+---------+-----------------+-----------------+------------+
| 600 | 192.168.1.104 | 3306 | ONLINE | 10 | 0 | 20 | 0 | 551256 | 44307633 | 0 | 285 | <--- current Master
| 601 | 192.168.1.111 | 3306 | ONLINE | 5 | 3 | 11 | 0 | 1053685 | 52798199 | 4245883580 | 1133 |
| 601 | 192.168.1.109 | 3306 | ONLINE | 3 | 5 | 10 | 0 | 1006880 | 50473746 | 4052079567 | 369 |
| 601 | 192.168.1.107 | 3306 | ONLINE | 3 | 5 | 13 | 0 | 1040524 | 52102581 | 4178965796 | 604 |
| 601 | 192.168.1.104 | 3306 | ONLINE | 7 | 1 | 16 | 0 | 987548 | 49458526 | 3954722258 | 285 |
+-----------+---------------+----------+--------+----------+----------+--------+---------+---------+-----------------+-----------------+------------+
Now perform the failover. To do this, instruct MHA to do a switch, and to set the OLD master as a new slave:
masterha_master_switch --master_state=alive --conf=/etc/mha.cnf --orig_master_is_new_slave --interactive=0 --running_updates_limit=0
Check what happened:
[ 160s] threads: 10, tps: 354.50, reads: 3191.10, writes: 1418.50, response time: 48.96ms (95%), errors: 0.00, reconnects: 0.00
[ 170s] threads: 10, tps: 322.50, reads: 2901.98, writes: 1289.89, response time: 55.45ms (95%), errors: 0.00, reconnects: 0.00
[ 180s] threads: 10, tps: 304.60, reads: 2743.12, writes: 1219.91, response time: 58.09ms (95%), errors: 0.10, reconnects: 0.00 <--- moment of the switch
[ 190s] threads: 10, tps: 330.40, reads: 2973.40, writes: 1321.00, response time: 50.52ms (95%), errors: 0.00, reconnects: 0.00
[ 200s] threads: 10, tps: 304.20, reads: 2745.60, writes: 1217.60, response time: 58.40ms (95%), errors: 0.00, reconnects: 1.00
[ 210s] threads: 10, tps: 353.80, reads: 3183.80, writes: 1414.40, response time: 48.15ms (95%), errors: 0.00, reconnects: 0.00
Check ProxySQL:
mysql> select * from stats_mysql_connection_pool where hostgroup between 600 and 601 order by hostgroup,srv_host desc;
+-----------+---------------+----------+--------+----------+----------+--------+---------+---------+-----------------+-----------------+------------+
| hostgroup | srv_host | srv_port | status | ConnUsed | ConnFree | ConnOK | ConnERR | Queries | Bytes_data_sent | Bytes_data_recv | Latency_ms |
+-----------+---------------+----------+--------+----------+----------+--------+---------+---------+-----------------+-----------------+------------+
| 600 | 192.168.1.107 | 3306 | ONLINE | 10 | 0 | 10 | 0 | 123457 | 9922280 | 0 | 658 | <--- new master
| 601 | 192.168.1.111 | 3306 | ONLINE | 2 | 6 | 14 | 0 | 1848302 | 91513537 | 7590137770 | 1044 |
| 601 | 192.168.1.109 | 3306 | ONLINE | 5 | 3 | 12 | 0 | 1688789 | 83717258 | 6927354689 | 220 |
| 601 | 192.168.1.107 | 3306 | ONLINE | 3 | 5 | 13 | 0 | 1834415 | 90789405 | 7524861792 | 658 |
| 601 | 192.168.1.104 | 3306 | ONLINE | 6 | 2 | 24 | 0 | 1667252 | 82509124 | 6789724589 | 265 |
+-----------+---------------+----------+--------+----------+----------+--------+---------+---------+-----------------+-----------------+------------+
In this case, the servers weren’t behind the master and switch happened quite fast.
We can see that the WRITE operations that normally are an issue, given the need to move around a VIP or change name resolution, had a limited hiccup.
Read operations were not affected, at all. Nice, eh?
Do you know how long it takes to do a switch under these conditions? real 0m2.710s yes 2.7 seconds.
This is more evidence that, most of the time, an MHA-based switch is caused by the need to redirect traffic from A to B using the network.
Crash fail-over
What happened if instead of an easy switch, we have to cover a real failover?
First of all, let’s start masterha_manager:
nohup masterha_manager --conf=/etc/mha.cnf --wait_on_monitor_error=60 --wait_on_failover_error=60 >> /tmp/mha.log 2>&1
Then let’s start a load again. Finally, go to the MySQL node that uses master xxx.xxx.xxx.107
ps aux|grep mysql
mysql 18755 0.0 0.0 113248 1608 pts/0 S Aug28 0:00 /bin/sh /opt/mysql_templates/mysql-57/bin/mysqld_safe --defaults-file=/opt/mysql_instances/mha1/my.cnf
mysql 21975 3.2 30.4 4398248 941748 pts/0 Sl Aug28 93:21 /opt/mysql_templates/mysql-57/bin/mysqld --defaults-file=/opt/mysql_instances/mha1/my.cnf --basedir=/opt/mysql_templates/mysql-57/ --datadir=/opt/mysql_instances/mha1/data --plugin-dir=/opt/mysql_templates/mysql-57//lib/plugin --log-error=/opt/mysql_instances/mha1/mysql-3306.err --open-files-limit=65536 --pid-file=/opt/mysql_instances/mha1/mysql.pid --socket=/opt/mysql_instances/mha1/mysql.sock --port=3306
And kill the MySQL process.
kill -9 21975 18755
As before, check what happened on the application side:
[ 80s] threads: 4, tps: 213.20, reads: 1919.10, writes: 853.20, response time: 28.74ms (95%), errors: 0.00, reconnects: 0.00
[ 90s] threads: 4, tps: 211.30, reads: 1901.80, writes: 844.70, response time: 28.63ms (95%), errors: 0.00, reconnects: 0.00
[ 100s] threads: 4, tps: 211.90, reads: 1906.40, writes: 847.90, response time: 28.60ms (95%), errors: 0.00, reconnects: 0.00
[ 110s] threads: 4, tps: 211.10, reads: 1903.10, writes: 845.30, response time: 29.27ms (95%), errors: 0.30, reconnects: 0.00 <-- issue starts
[ 120s] threads: 4, tps: 198.30, reads: 1785.10, writes: 792.40, response time: 28.43ms (95%), errors: 0.00, reconnects: 0.00
[ 130s] threads: 4, tps: 0.00, reads: 0.60, writes: 0.00, response time: 0.00ms (95%), errors: 0.00, reconnects: 0.40 <-- total stop in write
[ 140s] threads: 4, tps: 173.80, reads: 1567.80, writes: 696.30, response time: 34.89ms (95%), errors: 0.40, reconnects: 0.00 <-- writes restart
[ 150s] threads: 4, tps: 195.20, reads: 1755.10, writes: 780.50, response time: 33.98ms (95%), errors: 0.00, reconnects: 0.00
[ 160s] threads: 4, tps: 196.90, reads: 1771.30, writes: 786.80, response time: 33.49ms (95%), errors: 0.00, reconnects: 0.00
[ 170s] threads: 4, tps: 193.70, reads: 1745.40, writes: 775.40, response time: 34.39ms (95%), errors: 0.00, reconnects: 0.00
[ 180s] threads: 4, tps: 191.60, reads: 1723.70, writes: 766.20, response time: 35.82ms (95%), errors: 0.00, reconnects: 0.00
So it takes ~10 seconds to perform failover.
To understand better, let see what happened in MHA-land:
Tue Aug 30 09:33:33 2016 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Tue Aug 30 09:33:33 2016 - [info] Reading application default configuration from /etc/mha.cnf..
... Read conf and start
Tue Aug 30 09:33:47 2016 - [debug] Trying to get advisory lock..
Tue Aug 30 09:33:47 2016 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..
... Wait for errors
Tue Aug 30 09:34:47 2016 - [warning] Got error on MySQL select ping: 2006 (MySQL server has gone away) <--- Error time
Tue Aug 30 09:34:56 2016 - [warning] Connection failed 4 time(s).. <--- Finally MHA decide to do something
Tue Aug 30 09:34:56 2016 - [warning] Master is not reachable from health checker!
Tue Aug 30 09:34:56 2016 - [warning] Master mha2r(192.168.1.107:3306) is not reachable!
Tue Aug 30 09:34:56 2016 - [warning] SSH is reachable.
Tue Aug 30 09:34:58 2016 - [info] Master failover to mha1r(192.168.1.104:3306) completed successfully. <--- end of the failover
MHA sees the server failing at xx:47, but because of the retry and checks validation, it actually fully acknowledges the downtime at xx:56 (~8 seconds after).
To perform the whole failover, it only takes ~2 seconds (again). Because no movable IPs or DNSs were involved, the operations were fast. This is true when the servers have the binary-log there, but it’s a different story if MHA also has to manage and push data from the binarylog to MySQL.
As you can see, ProxySQL can also help reduce the timing for this scenario, totally skipping the network-related operations. These operations are the ones causing the most trouble in these cases.