During the design period of a new cluster, it is always advised to have at least 3 nodes (this is the case with PXC but it’s also the same with PRM). But why and what are the risks ?
The goal of having more than 2 nodes, in fact an odd number is recommended in that kind of clusters, is to avoid split-brain situation. This can occur when quorum (that can be simplified as “majority vote”) is not honoured. A split-brain is a state in which the nodes lose contact with one another and then both try to take control of shared resources or provide simultaneously the cluster service.
On PRM the problem is obvious, both nodes will try to run the master and slave IPS and will accept writes. But what could happen with Galera replication on PXC ?
Ok first let’s have a look with a standard PXC setup (no special galera options), 2 nodes:
two running nodes (percona1 and percona2), communication between nodes is ok
[root@percona1 ~]# clustercheck
HTTP/1.1 200 OK
Content-Type: Content-Type: text/plain
Node is running.
Same output on percona2
Now let’s check the status variables:
| wsrep_local_state_comment | Synced (6) |
| wsrep_cert_index_size | 2 |
| wsrep_cluster_conf_id | 4 |
| wsrep_cluster_size | 2 |
| wsrep_cluster_state_uuid | 8dccca9f-d4b8-11e1-0800-344f6b618448 |
| wsrep_cluster_status | Primary |
| wsrep_connected | ON |
| wsrep_local_index | 0 |
| wsrep_ready | ON |
on percona2:
| wsrep_local_state_comment | Synced (6) |
| wsrep_cert_index_size | 2 |
| wsrep_cluster_conf_id | 4 |
| wsrep_cluster_size | 2 |
| wsrep_cluster_state_uuid | 8dccca9f-d4b8-11e1-0800-344f6b618448 |
| wsrep_cluster_status | Primary |
| wsrep_connected | ON |
| wsrep_local_index | 1 |
| wsrep_ready | ON |
only wsrep_local_index defers as expected.
Now let’s stop the communication between both nodes (using firewall rules):
iptables -A INPUT -d 192.168.70.3 -s 192.168.70.2 -j REJECT
This rule simulates a network outage that makes the connections between the two nodes impossible (switch/router failure)
[root@percona1 ~]# clustercheck
HTTP/1.1 503 Service Unavailable
Content-Type: Content-Type: text/plain
Node is *down*.
We can see that the node appears down, but we can still run some statements on it:
on node1:
| wsrep_local_state_comment | Initialized (0) |
| wsrep_cert_index_size | 2 |
| wsrep_cluster_conf_id | 18446744073709551615 |
| wsrep_cluster_size | 1 |
| wsrep_cluster_state_uuid | 8dccca9f-d4b8-11e1-0800-344f6b618448 |
| wsrep_cluster_status | non-Primary |
| wsrep_connected | ON |
| wsrep_local_index | 0 |
| wsrep_provider_name | Galera |
| wsrep_provider_vendor | Codership Oy info@codership.com |
| wsrep_provider_version | 2.1(r113) |
| wsrep_ready | OFF |
on node2:
| wsrep_local_state_comment | Initialized (0) |
| wsrep_cert_index_size | 2 |
| wsrep_cluster_conf_id | 18446744073709551615 |
| wsrep_cluster_size | 1 |
| wsrep_cluster_state_uuid | 8dccca9f-d4b8-11e1-0800-344f6b618448 |
| wsrep_cluster_status | non-Primary |
| wsrep_connected | ON |
| wsrep_local_index | 0 |
| wsrep_provider_name | Galera |
| wsrep_provider_vendor | Codership Oy info@codership.com |
| wsrep_provider_version | 2.1(r113) |
| wsrep_ready | OFF |
And if you test to use the mysql server:
[root@percona1 ~]# mysql
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 868
Server version: 5.5.24
Copyright (c) 2000, 2011, Oracle and/or its affiliates. All rights reserved.
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
percona1 mysql> use test
ERROR 1047 (08S01): Unknown command
If you try to insert data just while the communication problem occurs, here is what you will have:
percona1 mysql> insert into percona values (0,'percona1','peter');
ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction
percona1 mysql> insert into percona values (0,'percona1','peter');
ERROR 1047 (08S01): Unknown command
Is is possible anyway to have a two nodes cluster ?
So, by default, Percona XtraDB Cluster does the right thing, this is how it needs to work and you don’t suffer critical problem when you have enough nodes.
But how can we deal with that and avoid the resource to stop ? If we check the list of parameters on galera’s wiki wcan see that there are two options referring to that:
– pc.ignore_quorum: Completely ignore quorum calculations. E.g. in case master splits from several slaves it still remains operational. Use with extreme caution even in master-slave setups, because slaves won’t automatically reconnect to master in this case.
– pc.ignore_sb : Should we allow nodes to process updates even in the case of split brain? This is a dangerous setting in multi-master setup, but should simplify things in master-slave cluster (especially if only 2 nodes are used).
Let’s try first with ignoring quorum:
By ignoring the quorum, we ask to the cluster to not perform the majority calculation to define the Primary Component (PC). A component is a set of nodes which are connected to each other and when everything is ok, the whole cluster is one component. For example if you have 3 nodes, and if 1 node gets isolated (2 nodes can see each others and 1 node can see only itself), we have then 2 components and the quorum calculation will be 2/3 (66%) on the 2 nodes communicating each others and 1/3 (33%) on the single one. In this case the service will be stopped on the nodes where the majority is not reached. The quorum algorithm helps to select a PC and guarantees that there is no more than one primary component in the cluster.
In our 2 nodes setup, when the communication between the 2 nodes is broken, the quorum will be 1/2 (50%) on both node which is not the majority… therefore the service is stopped on both node. In this case, service means accepting queries.
Back to our test, we check that the data is the same on both nodes:
percona1 mysql> select * from percona;
+----+---------------+--------+
| id | inserted_from | name |
+----+---------------+--------+
| 2 | percona1 | lefred |
| 3 | percona2 | kenny |
+----+---------------+--------+
2 rows in set (0.00 sec)
percona2 mysql> select * from percona;
+----+---------------+--------+
| id | inserted_from | name |
+----+---------------+--------+
| 2 | percona1 | lefred |
| 3 | percona2 | kenny |
+----+---------------+--------+
2 rows in set (0.00 sec)
Adding ignore quorum and restart both nodes:
set global wsrep_provider_options=”pc.ignore_quorum=true”; (this seems to not work properly currently, I needed to change it in my.cnf and restart the nodes)
wsrep_provider_options = “pc.ignore_quorum = true”
break again connection between both nodes
iptables -A INPUT -d 192.168.70.3 -s 192.168.70.2 -j REJECT
and perform an insert, this first insert will take longer:
percona1 mysql> insert into percona values (0,'percona1','vadim');
Query OK, 1 row affected (7.77 sec)
percona1 mysql> insert into percona values (0,'percona1','miguel');
Query OK, 1 row affected (0.01 sec)
and on node2 you can also add records:
percona2 mysql> insert into percona values (0,'percona2','jay');
Query OK, 1 row affected (0.02 sec)
The wsrep status variables are like this now:
| wsrep_local_state_uuid | e20f9da7-d509-11e1-0800-013f68429ec1 |
| wsrep_local_state_comment | Synced (6) |
| wsrep_cluster_size | 1 |
| wsrep_cluster_state_uuid | e20f9da7-d509-11e1-0800-013f68429ec1 |
| wsrep_cluster_status | Primary |
| wsrep_connected | ON |
| wsrep_ready | ON |
| wsrep_local_state_uuid | e20f9da7-d509-11e1-0800-013f68429ec1 |
| wsrep_local_state_comment | Synced (6) |
| wsrep_cluster_size | 1 |
| wsrep_cluster_state_uuid | e20f9da7-d509-11e1-0800-013f68429ec1 |
| wsrep_cluster_status | Primary |
| wsrep_connected | ON |
| wsrep_ready | ON |
Then we fix the connection problem:
iptables -D INPUT -d 192.168.70.3 -s 192.168.70.2 -j REJECT
nothing changes, data stays different:
percona1 mysql> select * from percona;
+----+---------------+--------+
| id | inserted_from | name |
+----+---------------+--------+
| 2 | percona1 | lefred |
| 3 | percona2 | kenny |
| 5 | percona1 | liz |
| 9 | percona1 | ewen |
| 11 | percona1 | vadim |
| 13 | percona1 | miguel |
+----+---------------+--------+
6 rows in set (0.00 sec)
percona2 mysql> select * from percona;
+----+---------------+--------+
| id | inserted_from | name |
+----+---------------+--------+
| 2 | percona1 | lefred |
| 3 | percona2 | kenny |
| 5 | percona1 | liz |
| 9 | percona1 | ewen |
| 10 | percona2 | jay |
+----+---------------+--------+
5 rows in set (0.01 sec)
You can keep inserting data, it won’t be replicated and you will have 2 different version of your inconsistent data !
Also when we restart a it will request an SST or in certain case fail to start like this:
120723 23:45:30 [ERROR] WSREP: Local state seqno (6) is greater than group seqno (5): states diverged. Aborting to avoid potential data loss. Remove '/var/lib/mysql//grastate.dat' file and restart if you wish to continue. (FATAL)
at galera/src/replicator_str.cpp:state_transfer_required():34
120723 23:45:30 [Note] WSREP: applier thread exiting (code:7)
120723 23:45:30 [ERROR] Aborting
Of course all the data that was written on the node we just restarted is lost after the SST.
Now let’s try with pc.ignore_sb=true:
When the quorum algorithm fails to select a Primary Component, we have then a split-brain condition. In our 2 nodes setup when a node loses connection to it’s only peer, the default is to stop accepting queries to avoid database inconsistency. We can bypass this behaviour by ignoring the split-brain by adding
wsrep_provider_options = “pc.ignore_sb = true” in my.cnf
Then we can insert in both nodes without any problem when the connection between the nodes is gone:
percona1 mysql> insert into percona values (0,'percona1','jaime');
Query OK, 1 row affected (0.02 sec)
percona1 mysql> select * from percona;
+----+---------------+--------+
| id | inserted_from | name |
+----+---------------+--------+
| 2 | percona1 | lefred |
| 3 | percona2 | kenny |
| 5 | percona1 | liz |
| 9 | percona1 | ewen |
| 11 | percona1 | vadim |
| 13 | percona1 | miguel |
| 14 | percona1 | marcos |
| 15 | percona1 | baron |
| 16 | percona1 | brian |
| 17 | percona1 | jaime |
+----+---------------+--------+
10 rows in set (0.00 sec)
percona2 mysql> insert into percona values (0,'percona2','daniel');
Query OK, 1 row affected (0.02 sec)
percona2 mysql> select * from percona;
+----+---------------+--------+
| id | inserted_from | name |
+----+---------------+--------+
| 2 | percona1 | lefred |
| 3 | percona2 | kenny |
| 5 | percona1 | liz |
| 9 | percona1 | ewen |
| 11 | percona1 | vadim |
| 13 | percona1 | miguel |
| 14 | percona1 | marcos |
| 15 | percona1 | baron |
| 16 | percona1 | brian |
| 17 | percona2 | daniel |
+----+---------------+--------+
10 rows in set (0.00 sec)
When the connection is back, the two servers are like independent, these are now two single node clusters.
We can see it in the log file:
120724 10:58:09 [Note] WSREP: evs::proto(86928728-d56d-11e1-0800-f7c4916d8330, GATHER, view_id(REG,7e6d285b-d56d-11e1-0800-2491595e99bb,2)) detected inactive node: 7e6d285b-d56d-11e1-0800-2491595e99bb
120724 10:58:09 [Warning] WSREP: Ignoring possible split-brain (allowed by configuration) from view:
view(view_id(REG,7e6d285b-d56d-11e1-0800-2491595e99bb,2) memb {
7e6d285b-d56d-11e1-0800-2491595e99bb,
86928728-d56d-11e1-0800-f7c4916d8330,
} joined {
7e6d285b-d56d-11e1-0800-2491595e99bb,
} left {
} partitioned {
})
to view:
view(view_id(TRANS,7e6d285b-d56d-11e1-0800-2491595e99bb,2) memb {
86928728-d56d-11e1-0800-f7c4916d8330,
} joined {
} left {
} partitioned {
7e6d285b-d56d-11e1-0800-2491595e99bb,
})
120724 10:58:09 [Note] WSREP: view(view_id(PRIM,86928728-d56d-11e1-0800-f7c4916d8330,3) memb {
86928728-d56d-11e1-0800-f7c4916d8330,
} joined {
} left {
} partitioned {
7e6d285b-d56d-11e1-0800-2491595e99bb,
})
120724 10:58:09 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1
120724 10:58:09 [Note] WSREP: forgetting 7e6d285b-d56d-11e1-0800-2491595e99bb (tcp://192.168.70.2:4567)
120724 10:58:09 [Note] WSREP: deleting entry tcp://192.168.70.2:4567
120724 10:58:09 [Note] WSREP: (86928728-d56d-11e1-0800-f7c4916d8330, 'tcp://0.0.0.0:4567') turning message relay requesting off
120724 10:58:09 [Note] WSREP: STATE_EXCHANGE: sent state UUID: b0dddbb3-d56d-11e1-0800-b62dc3759660
120724 10:58:09 [Note] WSREP: STATE EXCHANGE: sent state msg: b0dddbb3-d56d-11e1-0800-b62dc3759660
120724 10:58:09 [Note] WSREP: STATE EXCHANGE: got state msg: b0dddbb3-d56d-11e1-0800-b62dc3759660 from 0 (percona2)
120724 10:58:09 [Note] WSREP: Quorum results:
version = 2,
component = PRIMARY,
conf_id = 2,
members = 1/1 (joined/total),
act_id = 16,
last_appl. = 0,
protocols = 0/4/2 (gcs/repl/appl),
group UUID = e20f9da7-d509-11e1-0800-013f68429ec1
120724 10:58:09 [Note] WSREP: Flow-control interval: [8, 16]
120724 10:58:09 [Note] WSREP: New cluster view: global state: e20f9da7-d509-11e1-0800-013f68429ec1:16, view# 3: Primary, number of nodes: 1, my index: 0, protocol version 2
Now when we put the two settings to true, with a two node cluster, it acts exactly like when ignore_sb is enabled.
And if the local state seqno is greater than group seqno it fails to restart. You need again to delete the file grastate.dat to request a full SST and you loose again some data.
This is why two node clusters is not recommended at all. Now if you have only storage for 2 nodes, using the galera arbitrator is a very good alternative then.
On a third node, instead of running Percona XtraDB Cluster (mysqld) just run garbd:
Currently there is no init script for garbd, but this is something easy to write as it can run in daemon mode using -d
[root@percona3 ~]# garbd -a gcomm://192.168.70.2:4567 -g trimethylxanthine
2012-07-24 11:42:50.237 INFO: Read config:
daemon: 0
address: gcomm://192.168.70.2:4567
group: trimethylxanthine
sst: trivial
donor:
options: gcs.fc_limit=9999999; gcs.fc_factor=1.0; gcs.fc_master_slave=yes
cfg:
log:
2012-07-24 11:42:50.248 INFO: protonet asio version 0
2012-07-24 11:42:50.252 INFO: backend: asio
2012-07-24 11:42:50.254 INFO: GMCast version 0
2012-07-24 11:42:50.255 INFO: (eed227a2-d573-11e1-0800-b8b68845d409, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2012-07-24 11:42:50.255 INFO: (eed227a2-d573-11e1-0800-b8b68845d409, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2012-07-24 11:42:50.257 INFO: EVS version 0
2012-07-24 11:42:50.258 INFO: PC version 0
2012-07-24 11:42:50.258 INFO: gcomm: connecting to group 'trimethylxanthine', peer '192.168.70.2:4567'
2012-07-24 11:42:50.270 INFO: (eed227a2-d573-11e1-0800-b8b68845d409, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.70.3:4567
2012-07-24 11:42:50.533 INFO: (eed227a2-d573-11e1-0800-b8b68845d409, 'tcp://0.0.0.0:4567') turning message relay requesting off
2012-07-24 11:42:51.080 INFO: declaring 577c0ad0-d570-11e1-0800-198f7634d8ec stable
2012-07-24 11:42:51.080 INFO: declaring 593e68c8-d570-11e1-0800-74c74791749b stable
2012-07-24 11:42:51.088 INFO: view(view_id(PRIM,577c0ad0-d570-11e1-0800-198f7634d8ec,8) memb {
577c0ad0-d570-11e1-0800-198f7634d8ec,
593e68c8-d570-11e1-0800-74c74791749b,
eed227a2-d573-11e1-0800-b8b68845d409,
} joined {
} left {
} partitioned {
})
2012-07-24 11:42:51.268 INFO: gcomm: connected
2012-07-24 11:42:51.268 INFO: Changing maximum packet size to 64500, resulting msg size: 32636
2012-07-24 11:42:51.269 INFO: Shifting CLOSED -> OPEN (TO: 0)
2012-07-24 11:42:51.271 INFO: Opened channel 'trimethylxanthine'
2012-07-24 11:42:51.272 INFO: New COMPONENT: primary = yes, bootstrap = no, my_idx = 2, memb_num = 3
2012-07-24 11:42:51.273 INFO: STATE EXCHANGE: Waiting for state UUID.
2012-07-24 11:42:51.276 INFO: STATE EXCHANGE: sent state msg: f00ff3f4-d573-11e1-0800-db814c3ecb8f
2012-07-24 11:42:51.276 INFO: STATE EXCHANGE: got state msg: f00ff3f4-d573-11e1-0800-db814c3ecb8f from 0 (percona1)
2012-07-24 11:42:51.276 INFO: STATE EXCHANGE: got state msg: f00ff3f4-d573-11e1-0800-db814c3ecb8f from 1 (percona2)
2012-07-24 11:42:51.292 INFO: STATE EXCHANGE: got state msg: f00ff3f4-d573-11e1-0800-db814c3ecb8f from 2 (garb)
2012-07-24 11:42:51.292 INFO: Quorum results:
version = 2,
component = PRIMARY,
conf_id = 6,
members = 2/3 (joined/total),
act_id = 19,
last_appl. = -1,
protocols = 0/4/2 (gcs/repl/appl),
group UUID = e20f9da7-d509-11e1-0800-013f68429ec1
2012-07-24 11:42:51.292 INFO: Flow-control interval: [9999999, 9999999]
2012-07-24 11:42:51.292 INFO: Shifting OPEN -> PRIMARY (TO: 19)
2012-07-24 11:42:51.292 INFO: Sending state transfer request: 'trivial', size: 7
2012-07-24 11:42:51.297 INFO: Node 2 (garb) requested state transfer from '*any*'. Selected 0 (percona1)(SYNCED) as donor.
2012-07-24 11:42:51.297 INFO: Shifting PRIMARY -> JOINER (TO: 19)
2012-07-24 11:42:51.303 INFO: 2 (garb): State transfer from 0 (percona1) complete.
2012-07-24 11:42:51.308 INFO: Shifting JOINER -> JOINED (TO: 19)
2012-07-24 11:42:51.311 INFO: Member 2 (garb) synced with group.
2012-07-24 11:42:51.311 INFO: Shifting JOINED -> SYNCED (TO: 19)
2012-07-24 11:42:51.325 WARN: 0 (percona1): State transfer to 2 (garb) failed: -1 (Operation not permitted)
2012-07-24 11:42:51.328 INFO: Member 0 (percona1) synced with group.
If the communication fails between node1 and node2, they will communicate and eventually send the changes through the node running garbd (node3) and if one node dies, the other one behaves without any problem and when the dead node comes back it will perform its IST or SST.
In conclusion: 2 nodes cluster is possible with Percona XtraDB Cluster but it’s not advised at all because it will generates a lot of problem in case of issue on one of the nodes. It’s much safer to use then a 3rd node even a fake one using garbd.
If you plan anyway to have a cluster with only 2 nodes, don’t forget that :
– by default if one peer dies or if the communication between both nodes is unstable, both nodes won’t accept queries
– if you plan to ignore split-brain or quorum, you risk to have inconsistent data very easily