Jan
10
2014
--

The use of Iptables ClusterIP target as a load balancer for PXC, PRM, MHA and NDB

Most technologies achieving high-availability for MySQL need a load-balancer to spread the client connections to a valid database host, even the Tungsten special connector can be seen as a sophisticated load-balancer. People often use hardware load balancer or software solution like haproxy. In both cases, in order to avoid having a single point of failure, multiple load balancers must be used. Load balancers have two drawbacks: they increase network latency and/or they add a validation check load on the database servers. The increased network latency is obvious in the case of standalone load balancers where you must first connect to the load balancer which then completes the request by connecting to one of the database servers. Some workloads like reporting/adhoc queries are not affected by a small increase of latency but other workloads like oltp processing and real-time logging are. Each load balancers must also check regularly if the database servers are in a sane state, so adding more load balancers increases the idle chatting over the network. In order to reduce these impacts, a very different type of load balancer is needed, let me introduce the Iptables ClusterIP target.

Normally, as stated by the RFC 1812 Requirements for IP Version 4 Routers an IP address must be unique on a network and each host must respond only for IPs it own. In order to achieve a load balancing behavior, the Iptables ClusterIP target doesn’t strictly respect the RFC. The principle is simple, each computer in the cluster share an IP address and MAC address with the other members but it answers requests only for a given subset, based on the modulo of a network value which is sourceIP-sourcePort by default. The behavior is controlled by an iptables rule and by the content of the kernel file /proc/net/ipt_CLUSTERIP/VIP_ADDRESS. The kernel /proc file just informs the kernel to which portion of the traffic it should answer. I don’t want to go too deep in the details here since all those things are handled by the Pacemaker resource agent, IPaddr2.

The IPaddr2 Pacemaker resource agent is commonly used for VIP but what is less know is its behavior when defined as part of a clone set. When part of clone set, the resource agent defines a VIP which uses the Iptables ClusterIP target, the iptables rules and the handling of the proc file are all done automatically. That seems very nice in theory but until recently, I never succeeded in having a suitable distribution behavior. When starting the clone set on, let’s say, three nodes, it distributes correctly, one instance on each but if 2 nodes fail and then recover, the clone instances all go to the 3rd node and stay there even after the first two nodes recover. That bugged me for quite a while but I finally modified the resource agent and found a way to have it work correctly. It also now set correctly the MAC address if none is provided to the MAC multicast address domain which starts by “01:00:5E”. The new agent, IPaddr3, is available here. Now, let’s show what we can achieve with it.

We’ll start from the setup described in my previous post and we’ll modify it. First, download and install the IPaddr3 agent.

root@pacemaker-1:~# wget -O /usr/lib/ocf/resource.d/percona/IPaddr3 https://github.com/percona/percona-pacemaker-agents/raw/master/agents/IPaddr3
root@pacemaker-1:~# chmod u+x /usr/lib/ocf/resource.d/percona/IPaddr3

Repeat these steps on all 3 nodes. Then, we’ll modify the pacemaker configuration like this (I’ll explain below):

node pacemaker-1 \
        attributes standby="off"
node pacemaker-2 \
        attributes standby="off"
node pacemaker-3 \
        attributes standby="off"
primitive p_cluster_vip ocf:percona:IPaddr3 \
        params ip="172.30.212.100" nic="eth1" \
        meta resource-stickiness="0" \
        op monitor interval="10s"
primitive p_mysql_monit ocf:percona:mysql_monitor \
        params reader_attribute="readable_monit" writer_attribute="writable_monit" user="repl_user" password="WhatAPassword" pid="/var/lib/mysql/mysqld.pid" socket="/var/run/mysqld/mysqld.sock" max_slave_lag="5" cluster_type="pxc" \
        op monitor interval="1s" timeout="30s" OCF_CHECK_LEVEL="1"
clone cl_cluster_vip p_cluster_vip \
        meta clone-max="3" clone-node-max="3" globally-unique="true"
clone cl_mysql_monitor p_mysql_monit \
        meta clone-max="3" clone-node-max="1"
location loc-distrib-cluster-vip cl_cluster_vip \
        rule $id="loc-distrib-cluster-vip-rule" -1: p_cluster_vip_clone_count gt 1
location loc-enable-cluster-vip cl_cluster_vip \
        rule $id="loc-enable-cluster-vip-rule" 2: writable_monit eq 1
location loc-no-cluster-vip cl_cluster_vip \
        rule $id="loc-no-cluster-vip-rule" -inf: writable_monit eq 0
property $id="cib-bootstrap-options" \
        dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="3" \
        stonith-enabled="false" \
        no-quorum-policy="ignore" \
        last-lrm-refresh="1384275025" \
        maintenance-mode="off"

First, the VIP primitive is modified to use the new agent, IPaddr3, and we set resource-stickiness=”0″. Next, we define the cl_cluster_vip clone set using: clone-max=”3″ to have three instances, clone-node-max=”3″ to allow up to three instances on the same node and globally-unique=”true” to tell Pacemaker it has to allocate an instance on a node even if there’s already one. Finally, there’re three location rules needed to get the behavior we want, one using the p_cluster_vip_clone_count attribute and the other two around the writable_monit attribute. Enabling all that gives:

root@pacemaker-1:~# crm_mon -A1
============
Last updated: Tue Jan  7 10:51:38 2014
Last change: Tue Jan  7 10:50:38 2014 via cibadmin on pacemaker-1
Stack: openais
Current DC: pacemaker-2 - partition with quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
3 Nodes configured, 3 expected votes
6 Resources configured.
============
Online: [ pacemaker-1 pacemaker-2 pacemaker-3 ]
 Clone Set: cl_cluster_vip [p_cluster_vip] (unique)
     p_cluster_vip:0    (ocf::percona:IPaddr3): Started pacemaker-3
     p_cluster_vip:1    (ocf::percona:IPaddr3): Started pacemaker-1
     p_cluster_vip:2    (ocf::percona:IPaddr3): Started pacemaker-2
 Clone Set: cl_mysql_monitor [p_mysql_monit]
     Started: [ pacemaker-1 pacemaker-2 pacemaker-3 ]
Node Attributes:
* Node pacemaker-1:
    + p_cluster_vip_clone_count         : 1
    + readable_monit                    : 1
    + writable_monit                    : 1
* Node pacemaker-2:
    + p_cluster_vip_clone_count         : 1
    + readable_monit                    : 1
    + writable_monit                    : 1
* Node pacemaker-3:
    + p_cluster_vip_clone_count         : 1
    + readable_monit                    : 1
    + writable_monit                    : 1

and the network configuration is:

root@pacemaker-1:~# iptables -L INPUT -n
Chain INPUT (policy ACCEPT)
target     prot opt source               destination
CLUSTERIP  all  --  0.0.0.0/0            172.30.212.100       CLUSTERIP hashmode=sourceip-sourceport clustermac=01:00:5E:91:18:86 total_nodes=3 local_node=1 hash_init=0
root@pacemaker-1:~# cat /proc/net/ipt_CLUSTERIP/172.30.212.100
2
root@pacemaker-2:~# cat /proc/net/ipt_CLUSTERIP/172.30.212.100
3
root@pacemaker-3:~# cat /proc/net/ipt_CLUSTERIP/172.30.212.100
1

In order to test the access, you need to query the VIP from a fourth node:

root@pacemaker-4:~# while [ 1 ]; do mysql -h 172.30.212.100 -u repl_user -pWhatAPassword -BN -e "select variable_value from information_schema.global_variables where variable_name like 'hostname';"; sleep 1; done
pacemaker-1
pacemaker-1
pacemaker-2
pacemaker-2
pacemaker-2
pacemaker-3
pacemaker-2
^C

So, all good… Let’s now desync the pacemaker-1 and pacemaker-2.

root@pacemaker-1:~# mysql -e 'set global wsrep_desync=1;'
root@pacemaker-1:~#
root@pacemaker-2:~# mysql -e 'set global wsrep_desync=1;'
root@pacemaker-2:~#
root@pacemaker-3:~# crm_mon -A1
============
Last updated: Tue Jan  7 10:53:51 2014
Last change: Tue Jan  7 10:50:38 2014 via cibadmin on pacemaker-1
Stack: openais
Current DC: pacemaker-2 - partition with quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
3 Nodes configured, 3 expected votes
6 Resources configured.
============
Online: [ pacemaker-1 pacemaker-2 pacemaker-3 ]
 Clone Set: cl_cluster_vip [p_cluster_vip] (unique)
     p_cluster_vip:0    (ocf::percona:IPaddr3): Started pacemaker-3
     p_cluster_vip:1    (ocf::percona:IPaddr3): Started pacemaker-3
     p_cluster_vip:2    (ocf::percona:IPaddr3): Started pacemaker-3
 Clone Set: cl_mysql_monitor [p_mysql_monit]
     Started: [ pacemaker-1 pacemaker-2 pacemaker-3 ]
Node Attributes:
* Node pacemaker-1:
    + p_cluster_vip_clone_count         : 1
    + readable_monit                    : 0
    + writable_monit                    : 0
* Node pacemaker-2:
    + p_cluster_vip_clone_count         : 1
    + readable_monit                    : 0
    + writable_monit                    : 0
* Node pacemaker-3:
    + p_cluster_vip_clone_count         : 3
    + readable_monit                    : 1
    + writable_monit                    : 1
root@pacemaker-3:~# cat /proc/net/ipt_CLUSTERIP/172.30.212.100
1,2,3
root@pacemaker-4:~# while [ 1 ]; do mysql -h 172.30.212.100 -u repl_user -pWhatAPassword -BN -e "select variable_value from information_schema.global_variables where variable_name like 'hostname';"; sleep 1; done
pacemaker-3
pacemaker-3
pacemaker-3
pacemaker-3
pacemaker-3
pacemaker-3

Now, if pacemaker-1 and pacemaker-2 are back in sync, we have the desired distribution:

root@pacemaker-1:~# mysql -e 'set global wsrep_desync=0;'
root@pacemaker-1:~#
root@pacemaker-2:~# mysql -e 'set global wsrep_desync=0;'
root@pacemaker-2:~#
root@pacemaker-3:~# crm_mon -A1
============
Last updated: Tue Jan  7 10:58:40 2014
Last change: Tue Jan  7 10:50:38 2014 via cibadmin on pacemaker-1
Stack: openais
Current DC: pacemaker-2 - partition with quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
3 Nodes configured, 3 expected votes
6 Resources configured.
============
Online: [ pacemaker-1 pacemaker-2 pacemaker-3 ]
 Clone Set: cl_cluster_vip [p_cluster_vip] (unique)
     p_cluster_vip:0    (ocf::percona:IPaddr3): Started pacemaker-3
     p_cluster_vip:1    (ocf::percona:IPaddr3): Started pacemaker-1
     p_cluster_vip:2    (ocf::percona:IPaddr3): Started pacemaker-2
 Clone Set: cl_mysql_monitor [p_mysql_monit]
     Started: [ pacemaker-1 pacemaker-2 pacemaker-3 ]
Node Attributes:
* Node pacemaker-1:
    + p_cluster_vip_clone_count         : 1
    + readable_monit                    : 1
    + writable_monit                    : 1
* Node pacemaker-2:
    + p_cluster_vip_clone_count         : 1
    + readable_monit                    : 1
    + writable_monit                    : 1
* Node pacemaker-3:
    + p_cluster_vip_clone_count         : 1
    + readable_monit                    : 1
    + writable_monit                    : 1

All the clone instances redistributed on all nodes as we wanted.

As a conclusion, Pacemaker with a clone set of IPaddr3 is a very interesting kind of load balancer, especially if you already have pacemaker deployed. It introduces almost no latency, it doesn’t need any other hardware, doesn’t increase the database validation load and is as highly-available as your database is. The only drawback I can see is in a case where the inbound traffic is very important. In that case, all nodes are receiving all the traffic and are equally saturated. With databases and web type traffics, the inbound traffic is usually small. This solution also doesn’t redistribute the connections based on the server load like a load balancer can do but that would be fairly easy to implement with something like a server_load attribute and an agent similar to mysql_monitor but that will check the server load instead of the database status. In such a case, I suggest using much more than 1 VIP clone instance per node to have a better granularity in load distribution. Finally, the ClusterIP target, although still fully supported, has been deprecated in favor of the Cluster-match target. It is basically the same principle and I plan to adapt the IPaddr3 agent to Cluster-match in a near future.

The post The use of Iptables ClusterIP target as a load balancer for PXC, PRM, MHA and NDB appeared first on MySQL Performance Blog.

Oct
23
2013
--

High-availability options for MySQL, October 2013 update

The technologies allowing to build highly-available (HA) MySQL solutions are in constant evolution and they cover very different needs and use cases. In order to help people choose the best HA solution for their needs, we decided, Jay Janssen and I, to publish, on a regular basis (hopefully, this is the first), an update on the most common technologies and their state, with a focus on what type of workloads suite them best. We restricted ourselves to the open source solutions that provide automatic failover. Of course, don’t simply look at the number of Positives/Negatives items, they don’t have the same values. Should you pick any of these technologies, heavy testing is mandatory, HA is never beyond scenario that have been tested.

Percona XtraDB Cluster (PXC)

Percona XtraDB Cluster (PXC) is a version of Percona Server implementing the Galera replication protocol from Codeship.

Positive points Negative points
  • Almost synchronous replication, very small lag if any
  • Automatic failover
  • At best with small transactions
  • All nodes are writable
  • Very small read after write lag, usually no need to care about
  • Scale reads very well and to some extent, writes
  • New nodes are provisioned automatically through State Snapshot Transfer (SST)
  • Multi-threaded apply, greater write capacity than regular replication
  • Can do geographical disaster recovery (Geo DR)
  • More resilient to unresponsive nodes (swapping)
  • Can resolve split-brain situations by itself
  • Still under development, some rough edges
  • Large transactions like multi-statement transactions or large write operations cause issues and are usually not a good fit
  • For quorum reasons, 3 nodes are needed but one can be a lightweight arbitrator
  • SST can be heavy over a Wan
  • Commit are affected by the network latency, this impacts especially Geo DR
  • To achieve HA, a load balancer, like haproxy, is needed
  • Failover time is determined by the load balancer check frequency
  • Performance is affected by the weakest/busiest node
  • Foreign Keys are potential issues
  • MyISAM should be avoided
  • Can be mixed with regular async replication as master or slave but, slaves are not easy to reconfigure after a SST on their master
  • Require careful setup of the host, swapping can lead to node expulsion from the cluster
  • No manual failover mode
  • Debugging some Galera protocol issues isn’t trivial

 

Percona replication manager (PRM)

Percona replication manager (PRM) uses the Linux HA Pacemaker resource manager to manage MySQL and replication and provide high-availability. Information about PRM can be found here, the official page on the Percona web site is in the making.

Positive points Negative points
  • Nothing specific regarding the workload
  • Unlimited number of slaves
  • Slaves can have different roles
  • Typically VIP based access, typically 1 writer VIP and many reader VIPs
  • Also works without VIP (see the fake_mysql_novip agent)
  • Detects if slave lags too much and remove reader VIPs
  • All nodes are monitored
  • The best slaves is picked for master after failover
  • Geographical Disaster recovery possilbe with the lightweight booth protocol
  • Can be operated in manual failover mode
  • Graceful failover is quick, under 2s in normal conditions
  • Ungraceful failover under 30s
  • Distributed operation with Pacemaker, no single point of failure
  • Builtin pacemaker logic, stonith, etc. Very rich and flexible.
  • Still under development, some rough edges
  • Transaction maybe lost is master crashes (async replication)
  • For quorum reasons, 3 nodes are needed but one can be a lightweight arbitrator
  • Only one node is writable
  • Read after write may not be consistent (replication lag)
  • Only scales reads
  • Careful setup for the host, swapping can lead to node expulsion from the cluster
  • Data inconsistency can happen if the master crashes (fix coming)
  • Pacemaker is complex, logs are difficult to read and understand

 

MySQL master HA (MHA)

Like with PRM above, MySQL master HA (MHA), provides high-availability through replication. The approach is different, instead of relying on an HA framework like Pacemaker, it uses Perl scripts. Information about MHA can be found here.

Positive points Negative points
  • Mature
  • Nothing specific regarding the workload
  • No latency effects on writes
  • Can have many slaves and slaves can have different roles
  • Very good binlog/relaylog handling
  • Work pretty hard to minimise data loss
  • Can be operated in manual failover mode
  • Graceful failover is quick, under 5s in normal conditions
  • If the master crashes, slaves will be consistent
  • The logic is fairly easy to understand
  • Transaction maybe lost is master crashes (async replication)
  • Only one node is writable
  • Read after write may not be consistent (replication lag)
  • Only scales reads
  • Monitoring and logic are centralized, single-point of failure, a network partition can cause a split-brain
  • Custom fencing devices, custom VIP scripts, no reuse of other projects tools
  • Most of the deployments are using manual failover (at least at Percona)
  • Requires priviledged ssh access to read relay-logs, can be a security concern
  • No monitoring of the slave to invalidate it if it lags too much or if replication is broken, need to be done by external tool like HAProxy
  • Careful setup for the host, swapping can lead to node expulsion from the cluster

 

NDB Cluster

NDB cluster is the most high-end form of high-availability configuration for MySQL. It is a complete shared nothing architecture where the storage engine is distributed over multiple servers (data nodes). Probably the best starting point with NDB is the official document, here.

Positive points Negative points
  • Mature
  • Synchronous replication
  • Very good at small transactions
  • Very good at high concurrency (many client threads)
  • Huge transaction capacity, more than 1M trx/s are not uncommon
  • Failover can be ~1s
  • No single point of failure
  • Geographical disaster recovery capacity built-in
  • Strong at async replication, applying by batches gives multithreaded apply at the data node level
  • Can scale reads and writes, the framework implements sharding by hashes
  • Not a drop-in replacement for Innodb, you need to tune the schema and the queries
  • Not a general purpose database, some loads like reporting are just bad
  • Only the Read-commited isolation level is available
  • Hardware heavy, need 4 servers mininum for full HA
  • Memory (RAM) hungry, even with disk-based tables
  • Complex to operate, lots of parameters to adjust
  • Need a load balancer for failover
  • Very new foreign key support, field reports scarce on it

 

Shared storage/DRBD

Achieving high-availability use a shared storage medium is an old and well known method. It is used by nearly all the major databases. The share storage can be a DAS connected to two servers, a LUN on SAN accessible from 2 servers or a DRBD partition replicated synchronously over the network. DRBD is by bar the most common shared storage device used in the MySQL world.

Positive points Negative points
  • Mature
  • Synchronous replication (DRBD)
  • Automatic failover is easy to implement
  • VIP based access
  • Write capacity is impacted by network latency for DRBD
  • SANs are expensive
  • Only for InnoDB
  • Standby node, a big server doing nothing
  • Need a warmup period after failover to be fully operational
  • Disk corruption can spread

 

The post High-availability options for MySQL, October 2013 update appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com