Jan
19
2017
--

Setup ProxySQL for High Availability (not a Single Point of Failure)

ProxySQL for High Availability

In this blog post, we’ll look at how to set up ProxySQL for high availability.

During the last few months, we’ve had a lot of opportunities to present and discuss a very powerful tool that will become more and more used in the architectures supporting ProxySQL for High AvailabilityMySQL: ProxySQL.

ProxySQL is becoming more flexible, solid, performant and used every day (http://www.proxysql.com/ and recent http://www.proxysql.com/compare). You can use ProxySQL for high availability.

The tool is a winner when compared to similar ones, and we should all have a clear(er) idea of how to integrate it in our architectures in order to achieve the best results.

The first thing to keep in mind is that ProxySQL doesn’t natively support any high availability solution. We can setup a cluster of MySQL(s) and achieve four or even five nines of HA. But if we include ProxySQL as it is, and as a single block, our HA has a single point of failure (SPOF) that can drag us down in the case of a crash.

ProxySQL for High AvailabilityTo solve this, the most common solution is setting up ProxySQL as part of a tile architecture, where Application/ProxySQL are deployed together.

This is a good solution for some cases, and it for sure reduces the network hops, but it might be less than practical when our architecture has a very high number of tiles (hundreds of application servers, which is not so unusual nowadays).
In that case, managing ProxySQL is challenging. But more problematically, ProxySQL must perform several checks on the destination servers (MySQL). If we have 400 instances of ProxySQL, we end up keeping our databases busy just performing the checks.

In short, this is not a smart move.

ProxySQL for High AvailabilityAnother possible approach is to have two layers of ProxySQL, one close to the application and another in the middle to connect to the database.

I personally don’t like this approach for many reasons. The most relevant reasons are that this approach creates additional complexity in the management of the platform, and it adds network hops.

So what can be done?

I love the KISS principle because I am lazy and don’t want to reinvent a wheel someone else has already invented. I also like it when my customers don’t need to depend on me or any other colleagues after I am done and gone. They must be able to manage, understand and fix their environment by themselves.

To keep things simple, here is my checklist:

  • Exclude the cases where a tile (application/ProxySQL) makes sense
  • Exclude the “in the cloud” cases where tools like ELB (Elastic Load Balancer) exist
  • Exclude architecture that already includes a balancer

What can I use for the remaining cases?

The answer comes with combining existing solutions and existing blocks: KeepAlived + ProxySQl + MySQL.

For an explanation of KeepAlived, visit http://www.keepalived.org/.

Short description
“Keepalived is a routing software written in C. The main goal of this project is to provide simple and robust facilities for load balancing and high-availability to Linux system and Linux-based infrastructures. The load balancing framework relies on well-known and widely used Linux Virtual Server (IPVS) kernel module providing Layer4 load balancing. Keepalived implements a set of checkers to dynamically and adaptively maintain and manage load-balanced server pool according to their health. On the other hand, high-availability is achieved by VRRP protocol. VRRP is a fundamental brick for router failover. Also, Keepalived implements a set of hooks to the VRRP finite state machine providing low-level and high-speed protocol interactions. Keepalived frameworks can be used independently or all together to provide resilient infrastructures.”

Bingo! This is exactly what we need for our ProxySQL setup.

Below, I will explain how to set up:

  • A simple solution based on a single VIP
  • A more complex solution using multiple VIPs
  • An even more complex solution using virtual VIPs and virtual servers

All we want to do is to prevent ProxySQL from becoming a SPOF. And while doing that, we need to reduce network hops as much as possible (keeping the solution SIMPLE).

Another important concept to keep in mind is that ProxySQL (re)start takes place in less than a second. This means that if it crashes, assuming it can be restarted by the angel process, doing that is much more efficient than any kind of failover mechanism. As such, your solution plan should keep in mind the ~1 second time of ProxySQL restart as a baseline.

Ready? Let’s go.

Setup

Choose three machines to host the combination of Keepalived and ProxySQL.

In the following example, I will use three machines for ProxySQL and Keepalived, and three hosting Percona XtraDB Cluster. You can have the Keepalived+ProxySQL whenever you like (even on the same Percona XtraDB Cluster box).

For the following examples, we will have:

PXC
node1 192.168.0.5 galera1h1n5
node2 192.168.0.21 galera2h2n21
node3 192.168.0.231 galera1h3n31
 
ProxySQL-Keepalived
test1 192.168.0.11
test2 192.168.0.12
test3 192.168.0.235
 
VIP 192.168.0.88 /89/90
 

To check, I will use this table (please create it in your MySQL server):

DROP TABLE  test.`testtable2`;
 CREATE TABLE test.`testtable2` (
  `autoInc` bigint(11) NOT NULL AUTO_INCREMENT,
  `a` varchar(100) COLLATE utf8_bin NOT NULL,
  `b` varchar(100) COLLATE utf8_bin NOT NULL,
  `host` varchar(100) COLLATE utf8_bin NOT NULL,
  `userhost` varchar(100) COLLATE utf8_bin NOT NULL,
  PRIMARY KEY (`autoInc`)
) ENGINE=InnoDB ROW_FORMAT=DYNAMIC;

And this bash TEST command to use later:

while [ 1 ];do export mydate=$(date +'%Y-%m-%d %H:%M:%S.%6N'); mysql --defaults-extra-file=./my.cnf -h 192.168.0.88 -P 3311  --skip-column-names -b -e "BEGIN;set @userHost='a';select concat(user,'_', host) into @userHost from information_schema.processlist  where user = 'load_RW' limit 1;insert into test.testtable2 values(NULL,'$mydate',SYSDATE(6),@@hostname,@userHost);commit;select * from test.testtable2 order by 1 DESC limit 1" ; sleep 1;done
  1. Install ProxySQL (refer to https://github.com/sysown/proxysql/wiki#installation)
  2. Install Keepalived (yum install keepalived; apt-get install keepalived)
  3. Setup ProxySQL users and servers

Once you have your ProxySQL up (run the same configuration on all ProxySQL nodes, it is much simpler), connect to the Admin interface and execute the following:

DELETE FROM mysql_replication_hostgroups WHERE writer_hostgroup=500 ;
DELETE FROM mysql_servers WHERE hostgroup_id IN (500,501);
 
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.0.5',500,3306,1000000000);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.0.5',501,3306,100);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.0.21',500,3306,1000000);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.0.21',501,3306,1000000000);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.0.231',500,3306,100);
INSERT INTO mysql_servers (hostname,hostgroup_id,port,weight) VALUES ('192.168.0.231',501,3306,1000000000);
LOAD MYSQL SERVERS TO RUNTIME; SAVE MYSQL SERVERS TO DISK;
 
DELETE FROM mysql_users WHERE username='load_RW';
INSERT INTO mysql_users (username,password,active,default_hostgroup,default_schema,transaction_persistent) VALUES ('load_RW','test',1,500,'test',1);
LOAD MYSQL USERS TO RUNTIME;SAVE MYSQL USERS TO DISK;
 
DELETE FROM mysql_query_rules WHERE rule_id IN (200,201);
INSERT INTO mysql_query_rules (rule_id,username,destination_hostgroup,active,retries,match_digest,apply) VALUES(200,'load_RW',501,1,3,'^SELECT.*FOR UPDATE',1);
INSERT INTO mysql_query_rules (rule_id,username,destination_hostgroup,active,retries,match_digest,apply) VALUES(201,'load_RW',501,1,3,'^SELECT ',1);
 
LOAD MYSQL QUERY RULES TO RUNTIME;SAVE MYSQL QUERY RULES TO DISK;

Create a my.cnf file in your default dir with:

[mysql]
user=load_RW
password=test

Simple Setup using a single VIP, 3 ProxySQL and 3 Galera nodes

ProxySQL for High Availability

First, setup the Keepalived configuration file (/etc/keepalived/keepalived.conf):

global_defs {
  # Keepalived process identifier
  lvs_id proxy_HA
}
# Script used to check if Proxy is running
vrrp_script check_proxy {
  script "killall -0 proxysql"
  interval 2
  weight 2
}
# Virtual interface
# The priority specifies the order in which the assigned interface to take over in a failover
vrrp_instance VI_01 {
  state MASTER
  interface em1
  virtual_router_id 51
  priority <calculate on the WEIGHT for each node>
 
  # The virtual ip address shared between the two loadbalancers
  virtual_ipaddress {
    192.168.0.88 dev em1
  }
  track_script {
    check_proxy
  }
}

Given the above, and given I want to have TEST1 as the main priority, we will set as:

test1 = 101
test2 = 100
test3 = 99

Modify the config in each node with the above values and (re)start Keepalived.

If all is set correctly, you will see the following in the system log of the TEST1 machine:

Jan 10 17:56:56 mysqlt1 systemd: Started LVS and VRRP High Availability Monitor.
Jan 10 17:56:56 mysqlt1 Keepalived_healthcheckers[6183]: Configuration is using : 6436 Bytes
Jan 10 17:56:56 mysqlt1 Keepalived_healthcheckers[6183]: Using LinkWatch kernel netlink reflector...
Jan 10 17:56:56 mysqlt1 Keepalived_vrrp[6184]: Configuration is using : 63090 Bytes
Jan 10 17:56:56 mysqlt1 Keepalived_vrrp[6184]: Using LinkWatch kernel netlink reflector...
Jan 10 17:56:56 mysqlt1 Keepalived_vrrp[6184]: VRRP sockpool: [ifindex(2), proto(112), unicast(0), fd(10,11)]
Jan 10 17:56:56 mysqlt1 Keepalived_vrrp[6184]: VRRP_Script(check_proxy) succeeded
Jan 10 17:56:57 mysqlt1 Keepalived_vrrp[6184]: VRRP_Instance(VI_01) Transition to MASTER STATE
Jan 10 17:56:57 mysqlt1 Keepalived_vrrp[6184]: VRRP_Instance(VI_01) Received lower prio advert, forcing new election
Jan 10 17:56:57 mysqlt1 Keepalived_vrrp[6184]: VRRP_Instance(VI_01) Received higher prio advert
Jan 10 17:56:57 mysqlt1 Keepalived_vrrp[6184]: VRRP_Instance(VI_01) Entering BACKUP STATE
Jan 10 17:56:58 mysqlt1 Keepalived_vrrp[6184]: VRRP_Instance(VI_01) forcing a new MASTER election
...
Jan 10 17:57:00 mysqlt1 Keepalived_vrrp[6184]: VRRP_Instance(VI_01) Transition to MASTER STATE
Jan 10 17:57:01 mysqlt1 Keepalived_vrrp[6184]: VRRP_Instance(VI_01) Entering MASTER STATE <-- MASTER
Jan 10 17:57:01 mysqlt1 Keepalived_vrrp[6184]: VRRP_Instance(VI_01) setting protocol VIPs.
Jan 10 17:57:01 mysqlt1 Keepalived_healthcheckers[6183]: Netlink reflector reports IP 192.168.0.88 added
Jan 10 17:57:01 mysqlt1 avahi-daemon[937]: Registering new address record for 192.168.0.88 on em1.IPv4.
Jan 10 17:57:01 mysqlt1 Keepalived_vrrp[6184]: VRRP_Instance(VI_01) Sending gratuitous ARPs on em1 for 192.168.0.88

In the other two machines:

Jan 10 17:56:59 mysqlt2 Keepalived_vrrp[13107]: VRRP_Instance(VI_01) Entering BACKUP STATE <--- 

Which means the node is there as a Backup.  ?

Now it’s time to test our connection to our ProxySQL pool. From an application node, or just from your laptop, open three terminals. In each one:

watch -n 1 'mysql -h <IP OF THE REAL PROXY (test1|test2|test3)> -P 3310 -uadmin -padmin -t -e "select * from stats_mysql_connection_pool where hostgroup in (500,501,9500,9501) order by hostgroup,srv_host ;" -e " select srv_host,command,avg(time_ms), count(ThreadID) from stats_mysql_processlist group by srv_host,command;" -e "select * from stats_mysql_commands_counters where  Total_Time_us > 0;"'

Unless you are already sending queries to proxies, the proxies are doing nothing. Time to start the test bash as I indicated above. If everything is working correctly, you will see the bash command reporting this:

+----+----------------------------+----------------------------+-------------+----------------------------+
| 49 | 2017-01-10 18:12:07.739152 | 2017-01-10 18:12:07.733282 | galera1h1n5 | load_RW_192.168.0.11:33273 |
+----+----------------------------+----------------------------+-------------+----------------------------+
  ID    execution time in the bash   exec time inside mysql     node hostname   user and where the connection is coming from

The other three running bash commands will show that ONLY the ProxySQL in TEST1 is currently getting/serving requests, because it is the one with the VIP:

+-----------+---------------+----------+--------+----------+----------+--------+---------+---------+-----------------+-----------------+------------+
| hostgroup | srv_host      | srv_port | status | ConnUsed | ConnFree | ConnOK | ConnERR | Queries | Bytes_data_sent | Bytes_data_recv | Latency_ms |
+-----------+---------------+----------+--------+----------+----------+--------+---------+---------+-----------------+-----------------+------------+
| 500       | 192.168.0.21  | 3306     | ONLINE | 0        | 0        | 0      | 0       | 0       | 0               | 0               | 629        |
| 500       | 192.168.0.231 | 3306     | ONLINE | 0        | 0        | 0      | 0       | 0       | 0               | 0               | 510        |
| 500       | 192.168.0.5   | 3306     | ONLINE | 0        | 0        | 3      | 0       | 18      | 882             | 303             | 502        |
| 501       | 192.168.0.21  | 3306     | ONLINE | 0        | 0        | 0      | 0       | 0       | 0               | 0               | 629        |
| 501       | 192.168.0.231 | 3306     | ONLINE | 0        | 0        | 0      | 0       | 0       | 0               | 0               | 510        |
| 501       | 192.168.0.5   | 3306     | ONLINE | 0        | 0        | 0      | 0       | 0       | 0               | 0               | 502        |
+-----------+---------------+----------+--------+----------+----------+--------+---------+---------+-----------------+-----------------+------------+
+---------+---------------+-----------+-----------+-----------+---------+---------+----------+----------+-----------+-----------+--------+--------+---------+----------+
| Command | Total_Time_us | Total_cnt | cnt_100us | cnt_500us | cnt_1ms | cnt_5ms | cnt_10ms | cnt_50ms | cnt_100ms | cnt_500ms | cnt_1s | cnt_5s | cnt_10s | cnt_INFs |
+---------+---------------+-----------+-----------+-----------+---------+---------+----------+----------+-----------+-----------+--------+--------+---------+----------+
| BEGIN   | 9051          | 3         | 0         | 0         | 0       | 3       | 0        | 0        | 0         | 0         | 0      | 0      | 0       | 0        |
| COMMIT  | 47853         | 3         | 0         | 0         | 0       | 0       | 0        | 3        | 0         | 0         | 0      | 0      | 0       | 0        |
| INSERT  | 3032          | 3         | 0         | 0         | 1       | 2       | 0        | 0        | 0         | 0         | 0      | 0      | 0       | 0        |
| SELECT  | 8216          | 9         | 3         | 0         | 3       | 3       | 0        | 0        | 0         | 0         | 0      | 0      | 0       | 0        |
| SET     | 2154          | 3         | 0         | 0         | 3       | 0       | 0        | 0        | 0         | 0         | 0      | 0      | 0       | 0        |
+---------+---------------+-----------+-----------+-----------+---------+---------+----------+----------+-----------+-----------+--------+--------+---------+----------+

This is all as expected. Time to see if the failover-failback works along the chain.

Let us kill the ProxySQL on TEST1 while the test bash command is running.

killall -9 proxysql

Here is what you will get:

+----+----------------------------+----------------------------+-------------+----------------------------+
| 91 | 2017-01-10 18:19:06.188233 | 2017-01-10 18:19:06.183327 | galera1h1n5 | load_RW_192.168.0.11:33964 |
+----+----------------------------+----------------------------+-------------+----------------------------+
ERROR 2003 (HY000): Can't connect to MySQL server on '192.168.0.88' (111)
+----+----------------------------+----------------------------+-------------+----------------------------+
| 94 | 2017-01-10 18:19:08.250093 | 2017-01-10 18:19:11.250927 | galera1h1n5 | load_RW_192.168.0.12:39635 | <-- note
+----+----------------------------+----------------------------+-------------+----------------------------+

The source changed, but not the Percona XtraDB Cluster node. If you check the system log for TEST1:

Jan 10 18:19:06 mysqlt1 Keepalived_vrrp[6184]: VRRP_Script(check_proxy) failed
Jan 10 18:19:07 mysqlt1 Keepalived_vrrp[6184]: VRRP_Instance(VI_01) Received higher prio advert
Jan 10 18:19:07 mysqlt1 Keepalived_vrrp[6184]: VRRP_Instance(VI_01) Entering BACKUP STATE
Jan 10 18:19:07 mysqlt1 Keepalived_vrrp[6184]: VRRP_Instance(VI_01) removing protocol VIPs.
Jan 10 18:19:07 mysqlt1 Keepalived_healthcheckers[6183]: Netlink reflector reports IP 192.168.0.88 removed

While on TEST2:

Jan 10 18:19:08 mysqlt2 Keepalived_vrrp[13107]: VRRP_Instance(VI_01) Transition to MASTER STATE
Jan 10 18:19:09 mysqlt2 Keepalived_vrrp[13107]: VRRP_Instance(VI_01) Entering MASTER STATE
Jan 10 18:19:09 mysqlt2 Keepalived_vrrp[13107]: VRRP_Instance(VI_01) setting protocol VIPs.
Jan 10 18:19:09 mysqlt2 Keepalived_healthcheckers[13106]: Netlink reflector reports IP 192.168.0.88 added
Jan 10 18:19:09 mysqlt2 Keepalived_vrrp[13107]: VRRP_Instance(VI_01) Sending gratuitous ARPs on em1 for 192.168.0.88

Simple, and elegant. No need to re-invent the wheel to get a smoothly working process.

The total time for the recovery for the ProxySQL crash is about 5.06 seconds, considering the wider window (last application start, last recovery in Percona XtraDB Cluster 2017-01-10 18:19:06.188233|2017-01-10 18:19:11.250927).

As such this is the worse scenario, keeping in mind that we run the check for the ProxySQL every two seconds (real recovery max window 5-2=3 sec).

OK, what about fail-back?

Let us restart the proxysql service:

/etc/init.d/proxysql start (or systemctl)

Here’s the output:

+-----+----------------------------+----------------------------+-------------+----------------------------+
| 403 | 2017-01-10 18:29:34.550304 | 2017-01-10 18:29:34.545970 | galera1h1n5 | load_RW_192.168.0.12:40330 |
+-----+----------------------------+----------------------------+-------------+----------------------------+
+-----+----------------------------+----------------------------+-------------+----------------------------+
| 406 | 2017-01-10 18:29:35.597984 | 2017-01-10 18:29:38.599496 | galera1h1n5 | load_RW_192.168.0.11:34640 |
+-----+----------------------------+----------------------------+-------------+----------------------------+

The worst recovery time is 4.04 seconds, of which 2 seconds are a delay due to the check interval.

Of course, the test is running every second and is running a single operation. As such, the impact is minimal (no error in fail-back) and the recovery longer.

Let’s check another thing: is the failover working as expected? Test1 -> 2 -> 3? Let’s kill 1 and 2 and see:

Kill Test1 :
+-----+----------------------------+----------------------------+-------------+----------------------------+
| 448 | 2017-01-10 18:35:43.092878 | 2017-01-10 18:35:43.086484 | galera1h1n5 | load_RW_192.168.0.11:35240 |
+-----+----------------------------+----------------------------+-------------+----------------------------+
+-----+----------------------------+----------------------------+-------------+----------------------------+
| 451 | 2017-01-10 18:35:47.188307 | 2017-01-10 18:35:50.191465 | galera1h1n5 | load_RW_192.168.0.12:40935 |
+-----+----------------------------+----------------------------+-------------+----------------------------+
...
Kill Test2
+-----+----------------------------+----------------------------+-------------+----------------------------+
| 463 | 2017-01-10 18:35:54.379280 | 2017-01-10 18:35:54.373331 | galera1h1n5 | load_RW_192.168.0.12:40948 |
+-----+----------------------------+----------------------------+-------------+----------------------------+
+-----+----------------------------+----------------------------+-------------+-----------------------------+
| 466 | 2017-01-10 18:36:08.603754 | 2017-01-10 18:36:09.602075 | galera1h1n5 | load_RW_192.168.0.235:33268 |
+-----+----------------------------+----------------------------+-------------+-----------------------------+

This image is where you should be at the end:

ProxySQL for High Availability

Where the last server is Test3. In this case, I have killed one server immediately after the other. Keepalived had to take a bit longer failing over, but it still did it correctly and following the planned chain.

Fail-back is smooth as usual:

+-----+----------------------------+----------------------------+-------------+-----------------------------+
| 502 | 2017-01-10 18:39:18.749687 | 2017-01-10 18:39:18.749688 | galera1h1n5 | load_RW_192.168.0.235:33738 |
+-----+----------------------------+----------------------------+-------------+-----------------------------+
+-----+----------------------------+----------------------------+-------------+----------------------------+
| 505 | 2017-01-10 18:39:19.794888 | 2017-01-10 18:39:22.800800 | galera1h1n5 | load_RW_192.168.0.11:35476 |
+-----+----------------------------+----------------------------+-------------+----------------------------+

Let’s see another case

The case above is simple. But as a caveat, I can only access one ProxySQL a time. This might be good or not. In any case, it might be nice to have the possibility to choose the ProxySQL. With Keepalived, you can. We can actually set an X number of VIPs and associate them to each test box.

The result will be that each server hosting ProxySQL will also host a VIP, and will eventually be able to fail-over to any of the other two servers.

Failing-over/back is fully managed by Keepalived, checking as we did before if ProxySQL is running. An example of the configuration of one node can be seen below:

global_defs {
  # Keepalived process identifier
  lvs_id proxy_HA
}
# Script used to check if Proxy is running
vrrp_script check_proxy {
  script "killall -0 proxysql"
  interval 2
  weight 3
}
 
# Virtual interface 1
# The priority specifies the order in which the assigned interface to take over in a failover
vrrp_instance VI_01 {
  state MASTER
  interface em1
  virtual_router_id 51
  priority 102
 
  # The virtual ip address shared between the two loadbalancers
  virtual_ipaddress {
    192.168.0.88 dev em1
  }
  track_script {
    check_proxy
  }
}
 
# Virtual interface 2
# The priority specifies the order in which the assigned interface to take over in a failover
vrrp_instance VI_02 {
  state MASTER
  interface em1
  virtual_router_id 52
  priority 100
 
  # The virtual ip address shared between the two loadbalancers
  virtual_ipaddress {
    192.168.0.89 dev em1
  }
  track_script {
    check_proxy
  }
}
 
# Virtual interface 3
# The priority specifies the order in which the assigned interface to take over in a failover
vrrp_instance VI_03 {
  state MASTER
  interface em1
  virtual_router_id 53
  priority 99
 
  # The virtual ip address shared between the two loadbalancers
  virtual_ipaddress {
    192.168.0.90 dev em1
  }
  track_script {
    check_proxy
  }
}

The tricky part, in this case, is to play with the PRIORITY for each VIP and each server, such that you will NOT assign the same IP twice. The whole set of configs can be found at GitHub here: https://github.com/Tusamarco/proxysql_ha.

Performing a check with the test bash, we have:

Test 1 crash
+-----+----------------------------+----------------------------+-------------+----------------------------+
| 422 | 2017-01-11 18:30:14.411668 | 2017-01-11 18:30:14.344009 | galera1h1n5 | load_RW_192.168.0.11:55962 |
+-----+----------------------------+----------------------------+-------------+----------------------------+
ERROR 2003 (HY000): Can't connect to MySQL server on '192.168.0.88' (111)
+-----+----------------------------+----------------------------+-------------+----------------------------+
| 426 | 2017-01-11 18:30:18.531279 | 2017-01-11 18:30:21.473536 | galera1h1n5 | load_RW_192.168.0.12:49728 | <-- new server
+-----+----------------------------+----------------------------+-------------+----------------------------+
....
Test 2 crash
+-----+----------------------------+----------------------------+-------------+----------------------------+
| 450 | 2017-01-11 18:30:27.885213 | 2017-01-11 18:30:27.819432 | galera1h1n5 | load_RW_192.168.0.12:49745 |
+-----+----------------------------+----------------------------+-------------+----------------------------+
ERROR 2003 (HY000): Can't connect to MySQL server on '192.168.0.88' (111)
+-----+----------------------------+----------------------------+-------------+-----------------------------+
| 454 | 2017-01-11 18:30:30.971708 | 2017-01-11 18:30:37.916263 | galera1h1n5 | load_RW_192.168.0.235:33336 | <-- new server
+-----+----------------------------+----------------------------+-------------+-----------------------------+

The final state of IPs on Test3:

enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 08:00:27:c2:16:3f brd ff:ff:ff:ff:ff:ff
    inet 192.168.0.235/24 brd 192.168.0.255 scope global enp0s8   <-- Real IP
       valid_lft forever preferred_lft forever
    inet 192.168.0.90/32 scope global enp0s8    <--- VIP 3
       valid_lft forever preferred_lft forever
    inet 192.168.0.89/32 scope global enp0s8    <--- VIP 2
       valid_lft forever preferred_lft forever
    inet 192.168.0.88/32 scope global enp0s8    <--- VIP 1
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fec2:163f/64 scope link
       valid_lft forever preferred_lft forever

And this is the image:

ProxySQL for High Availability

 

Recovery times:

test 1 crash = 7.06 sec (worse case scenario)
  test 2 crash = 10.03 sec (worse case scenario)

Conclusions

In this example, I used a test that checks the process. But a check can be anything reporting 0|1. The limit is defined only by what you need.

The failover times can be significantly shorter by reducing the check time, and only counting the time taken to move the VIP. I preferred to show the worse case scenario, using an application with a one-second interval. This is a pessimistic view of what normally happens with real traffic.

I was looking for a simple, simple, simple way to add HA to ProxySQL – something that can be easily integrated with automation, and that is actually also well-established and maintained. In my opinion, using Keepalived is a good solution because it matches all the above expectations.

Implementing a set of ProxySQL nodes, and having Keepalived manage the failover between them, is pretty easy. But you can expand the usage (and the complexity) if you need to, counting on tools that are already part of the Linux stack. There is no need to re-invent the wheel with a crazy mechanism.

If you want to have fun doing crazy things, at least start with something that helps you to go beyond the basics. For instance, I was also playing a bit with Keepalived and a virtual server, creating a set of redundant ProxySQL with load balancers and . . . but that is another story (blog).

?

Great MySQL and ProxySQL to all!

Jan
19
2017
--

Open Source Databases on Big Machines: Does READ COMMITTED Scale on 144 Cores?

READ COMMITTED

In the second post in my series on open source databases on big machines, we’ll look at whether READ COMMITTED scales with multiple cores.

The default transaction level for InnoDB is

REPEATABLE READ

. A more permissive level is 

READ COMMITTED

, and is known to work well. While the 

REPEATABLE READ

 level maintains the transaction history up to the start of the transaction,

READ COMMITTED

 maintains the transaction history up to the start of the current statement. Peter Zaitsev described the differences between how these modes are handled in this blog post. Both can theoretically cause performance slowdowns, but

READ COMMITTED

 is usually seen as fast-working – at least on a typical MySQL machine (owned by Percona):

READ COMMITTED

The default transaction isolation mode for PostgreSQL is also

READ COMMITTED

. Originally I wanted to use this mode for MySQL tests as well. But when I tested on a machine with 144 cores, I found that after 36 threads

REPEATABLE READ

 continued to scale while

READ COMMITTED

 slowed down. It then got stuck at about 3x slower results for standard OLTP RW tests.

READ COMMITTED

Dimitri Kravtchuk wrote about performance issues with READ COMMITTED in 2015, but he tested with 40 cores that time. My tests show that there is a huge difference after 40 cores.

I tested this originally with Percona Server 5.7.15 and recently re-tested with Oracle’s MySQL versions 5.6.35 and 5.7.17. I confirmed that the bug exists in these versions as well, and reported it. I the Oracle MySQL Team fixes it. The good news is that while 5.6 stopped scaling after 16 threads, 5.7 improves this to 36 threads.

Results for 5.6.35:

READ COMMITTED

Results for 5.7.17:

Machine details:

PostgreSQL Professional and Freematiq machine (tests for MYSQL 5.6.35, Percona 5.7.15 and MySQL 5.7.17 servers):

Processors: physical = 4, cores = 72, virtual = 144, hyperthreading = yes
Memory: 3.0T
Disk speed: about 3K IOPS
OS: CentOS 7.1.1503
File system: XFS

Percona machine (test for Percona 5.7.15 server):

Processors: physical = 2, cores = 12, virtual = 24, hyperthreading = yes
Memory: 251.9G
Disk speed: about 33K IOPS
OS: Ubuntu 14.04.5 LTS
File system: EXT4

Test: SysBench OLTP RW test, converted to use prepared statements, as described in this post.

MySQL Options: described in this post.

Save

Save

Save

Save

Save

Save

Save

Save

Save

Save

Jan
19
2017
--

With a $1.5M seed round, Eloquent Labs mixes AI and Mechanical Turk to fix customer service

widget_on_store Keenon Werling would be the first to agree that conversational AI is regularly overhyped. So instead of taking the traditional approach and gloating about a glitzy new deeper learning algorithm to pitch his new venture Eloquent Labs, Werling instead opted to differentiate by optimizing something far more low-tech, people. The startup’s special sauce is embracing a mix of AI,… Read More

Jan
19
2017
--

Oracle boosts cloud offering with Apiary API management tool acquisition

Oracle Office Buildings at Silicon Valley Headquarters (Photo by ?? Christopher J. Morris/CORBIS/Corbis via Getty Images) Oracle acquired API management startup Apiary today for an undisclosed amount. Apiary, which launched in 2011 and has raised a modest $8.5 million, helps companies manage APIs, which is an increasingly important job. As every company becomes a software company, they are building platforms and providing ways for customers and third-parties to build applications using their technologies.… Read More

Jan
19
2017
--

Yext hires new president as it looks to IPO

attending the Yext Launch Party at 1 Madison Ave., NY on Oct 1, 2015 Photo By: (Ken Arcara/Guest of a Guest) Yext, the company that helps businesses make sure that their location data appears correctly in search results, is announcing that Jim Steele will be joining the company full-time as president and chief revenue officer. Steele was previously on the board at Yext and was most recently at InsideSales. “I found a new younger Mark Benioff and I’m really excited,” Steele says… Read More

Jan
18
2017
--

Salesforce opens engineering office in Microsoft’s and Amazon’s backyards

Lobby in Salesforce's Bellevue, Washington engineering offices. Salesforce announced today that it plans to open an engineering and innovation hub in the Seattle area, greatly expanding its presence there. Company co-founder Parker Harris says the office, which is located in the Nine Two Nine Office Tower in Bellevue, will be dedicated mostly to supporting the company’s artificial intelligence initiative, introduced last Fall, which… Read More

Jan
18
2017
--

Elasticsearch Ransomware: Open Source Database Security Part 2

Elasticsearch Ransomware

Elasticsearch RansomwareIn this blog post, we’ll look at a new Elasticsearch ransomware outbreak and what you can do to prevent it happening to you.

Mere weeks after reports of MongoDB servers getting hacked and infected with ransomware, Elasticsearch clusters are experiencing the same difficulties. David Murphy’s blog discussed the situation and the solution for MongoDB servers. In this blog post, we look at how you can prevent ransomware attacks on your Elasticsearch clusters.

First off, what is Elasticsearch? Elasticsearch is an open source distributed index based on Apache Lucene. It provides a full-text search with an HTTP API, using schemaless JSON documents. By its nature, it is also distributed and redundant. Companies use Elasticsearch with logging via the ELK stack and data-gathering software, to assist with data analytics and visualizations. It is also used to back search functionality in a number of popular apps and web services.

In this new scenario, the ransomware completed wiped away the cluster data, and replaced it with the following warning index:

“SEND 0.2 BTC TO THIS WALLET: 1DAsGY4Kt1a4LCTPMH5vm5PqX32eZmot4r IF YOU WANT RECOVER YOUR DATABASE! SEND TO THIS EMAIL YOUR SERVER IP AFTER SENDING THE BITCOINS.”

As with the MongoDB situation, this isn’t a flaw in the Elasticsearch software. This vulnerability stems from not correctly using the security settings provided by Elasticsearch. As the PCWorld article sums up:

According to experts, there is no reason to expose Elasticsearch clusters to the internet. In response to these recent attacks, search technologies and distributed systems architect Itamar Syn-Hershko has published a blog post with recommendations for securing Elasticsearch deployments.

The blog post they reference has excellent advice and examples of how to protect your Elasticsearch installations from exploitation. To summarize its advice (from the post itself):

Whatever you do, never expose your cluster nodes to the web directly.

So how do you prevent hackers from getting into your Elasticsearch cluster? Using the advice from Syn-Hershko’s blog, here are some bullet points for shoring up your Elasticsearch security:

  • HTTP-enabled nodes should only listen to private IPs. You can configure what IPs Elasticsearch listens to: localhost, private IPs, public IPs or several of these options.
    network.bind_host

     and 

    network.host

     control the IP types (manual). Never set Elasticsearch to listen to a public IP or a publicly accessible DNS name.

  • Use proxies to communicate with clients. You should pass any application queries to Elasticsearch through some sort of software that can filter requests, perform audit-logging and password-protect the data. Your client-side javascript shouldn’t talk to Elastic directly, and should only communicate with your server-side software. That software can translate all client-side requests to Elasticsearch DSL, execute the query, and then send the response in a format the clients expect.
  • Don’t use default ports. Once again for clarity: DON’T USE DEFAULT PORTS. You can easily change Elasticsearch’s default ports by modifying the .YML file. The relevant parameters are
    http.port

     and

    transport.tcp.port

     (manual).

  • Disable HTTP if you don’t need it. Only Elasticsearch client nodes should enable HTTP, and your private network applications should be the only ones with access to them. You can completely disable the HTTP module by setting
    http.enabled

     to

    false

     (manual).

  • Secure publicly available client nodes. You should protect your Elasticsearch client and any UI it communicates with (such as Kibana and Kopf) behind a VPN. If you choose to allow some nodes access to the public network, use HTTPS and don’t transmit data and credentials as plain-text. You can use plugins like Elastic’s Shield or SearchGuard to secure your cluster.
  • Disable scripting (pre-5.x). Malicious scripts can hack clusters via the Search API. Earlier versions of Elasticscript allowed unsecured scripts to access the software. If you are using an older version (pre-5.x), upgrade to a newer version or disable dynamic scripting completely.

Go to Syn-Hershko’s blog for more details.

This should get you started on correctly protecting yourself against Elasticsearch ransomware (and other security threats). If you want to have someone review your security, please contact us.

Jan
18
2017
--

A new CEO for DocuSign

docusign DocuSign, the electronic signature company, has ended its search for a permanent CEO. Effective this week, Daniel Springer is taking the reigns and former CEO Keith Krach is moving to a chairman role. With a reported valuation of more than $3 billion, 14-year-old DocuSign is said to be scaling, with speculation that it’s nearing its long-awaited IPO. Though he couldn’t comment… Read More

Jan
18
2017
--

Cloud security startup ProtectWise raises another $25 million

Man watching video on monitors ProtectWise, a security startup that records all your network traffic DVR-style, and saves it in the cloud, announced a $25 million round today. The company’s security strategy is surprisingly simple. It records all of the networking activity, then indexes the data and provides customers with forensic tools to search and review it. Once a breach has been detected, the victimized… Read More

Jan
18
2017
--

Collibra nabs $50M at reported $650M valuation led by ICONIQ

big data words Data governance and management startup Collibra — originally founded in Belgium but now based out of New York to help businesses in sectors like finance and healthcare to manage and comply with data retention policies — has raised $50 million in its latest round of funding. The company is not disclosing the valuation, but we heard that it is in the region of $650 million (which is… Read More

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com