Jun
14
2018
--

What is the Top Cause of Application Downtime Today?

Application outages lurking monster

Application outages lurking monsterI frequently talk to our customer base about what keeps them up at night. While there is a large variance of answers, they tend to fall into one of two categories. The first is the conditioned fear of some monster lurking behind the scenes that could pounce at any time. The second, of course, is the actual monster of downtime on a critical system. Ask most tech folks and they will tell you outages seem to only happen late at night or early in the morning. And that they do keep them up.

Entire companies and product lines have been built around providing those in the IT world with some ability to sleep at night. Modern enterprises have spent millions to mitigate the risk and prevent their businesses from having a really bad day because of an outage. Cloud providers are attuned to the downtime dilemma and spend lots of time, money, and effort to build in redundancy and make “High Availability” (HA) as easy as possible. The frequency of “hardware” or server issues continues to dwindle.

Where does the downtime issue start?

In my discussions, most companies I have talked to say their number one cause of outages and customer interruptions is ultimately related to the deployment of new or upgraded code. Often I hear the operations team has little or no involvement with an application until it’s put into production. It is a bit ironic that this is also the area where companies tend to drastically under-invest. They opt instead to invest in ways to “Scale Out or Up”. Or perhaps how to survive asteroids hitting two out three of their data centers.

Failing over broken or slow code from one server to another does not fix it. Adding more servers to distribute the load can mitigate a problem, but can also escalate the cost dramatically. In most cases, the solutions they apply don’t address the primary cause of the problems.

While there are some fantastic tools out there that can help with getting better visibility into code level issues — such as New Relic, AppDynamics and others — the real problem is that these often end up being used to diagnose issues after they have appeared in production. Most companies carry out some amount of testing before releasing code, but typically it is a fraction of what they should be doing. Working for a company that specializes in open source databases, we get a lot of calls on issues that have prevented companies’ end users from using critical applications. Many of these problems are fixable before they cost a loss of revenue and reputation.

I think it’s time technology companies start to rethink our QA, Testing, and Pre-Deployment requirements. How much time, effort, and money can we save if we catch these “monsters” before they make it into production?

Not to mention how much better our operations team will sleep . . .

The post What is the Top Cause of Application Downtime Today? appeared first on Percona Database Performance Blog.

May
30
2018
--

MySQL Test Framework for Percona XtraDB Cluster

MySQL Test Framework

At my latest webinar “MySQL Test Framework (MTR) for Troubleshooting”, I received an interesting question about MTR test cases for Percona XtraDB Cluster (PXC). Particularly about testing SST and IST.

This post is intended to answer this question. It assumes you are familiar with MTR and can write tests for MySQL servers. If you are not, please watch the webinar recording first.

You can find example tests in any PXC tarball package. They are located in directories

mysql-test/suite/galera

 ,

mysql-test/suite/galera_3nodes

  and

mysql-test/suite/wsrep

 , though that last directory only contains a configuration file.

If you simply try to run tests in galera suite you will find they all are disabled, because the environment variable

WSREP_PROVIDER

  was not set:

sveta@Thinkie:~/mysql_packages/Percona-XtraDB-Cluster-5.7.19-rel17-29.22.3.Linux.x86_64.ssl100/mysql-test$ ./mtr --suite=galera
Logging: ./mtr --suite=galera
MySQL Version 5.7.19
Too long tmpdir path '/home/sveta/mysql_packages/Percona-XtraDB-Cluster-5.7.19-rel17-29.22.3.Linux.x86_64.ssl100/mysql-test/var/tmp' creating a shorter one...
- using tmpdir: '/tmp/xYgQqOa5b7'
Checking supported features...
- SSL connections supported
- binaries built with wsrep patch
Using suites: galera
Collecting tests...
Checking leftover processes...
- found old pid 30624 in 'mysqld.3.pid', killing it...
process did not exist!
Removing old var directory...
Creating var directory '/home/sveta/mysql_packages/Percona-XtraDB-Cluster-5.7.19-rel17-29.22.3.Linux.x86_64.ssl100/mysql-test/var'...
Installing system database...
Using parallel: 1
==============================================================================
TEST RESULT TIME (ms) or COMMENT
--------------------------------------------------------------------------
worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 13000..13009
galera.GAL-419 [ skipped ] Test needs 'big-test' option
...
galera.galera_binlog_checksum [ skipped ] Test requires wsrep provider library (libgalera_smm.so). Did you set $WSREP_PROVIDER?
galera.galera_binlog_event_max_size_min [ skipped ] Test requires wsrep provider library (libgalera_smm.so). Did you set $WSREP_PROVIDER?
galera.galera_flush_gtid [ skipped ] Test requires wsrep provider library (libgalera_smm.so). Did you set $WSREP_PROVIDER?
galera.galera_gtid [ skipped ] Test requires wsrep provider library (libgalera_smm.so). Did you set $WSREP_PROVIDER?
galera.lp1435482 [ skipped ] Test requires wsrep provider library (libgalera_smm.so). Did you set $WSREP_PROVIDER?
^Cmysql-test-run: *** ERROR: Got ^C signal

In order to run these tests you need to set this variable first.

I use the quite outdated 5.7.19 PXC package (the version does not matter for the purpose of this post) and run tests as:

WSREP_PROVIDER=/home/sveta/mysql_packages/Percona-XtraDB-Cluster-5.7.19-rel17-29.22.3.Linux.x86_64.ssl100/lib/libgalera_smm.so ./mtr --suite=galera

After the variable

WSREP_PROVIDER

  is set, 

mtr

  can successfully run:

sveta@Thinkie:~/mysql_packages/Percona-XtraDB-Cluster-5.7.19-rel17-29.22.3.Linux.x86_64.ssl100/mysql-test

WSREP_PROVIDER=/home/sveta/mysql_packages/Percona-XtraDB-Cluster-5.7.19-rel17-29.22.3.Linux.x86_64.ssl100/lib/libgalera_smm.so ./mtr --suite=galera
Logging: ./mtr --suite=galera
MySQL Version 5.7.19
Too long tmpdir path '/home/sveta/mysql_packages/Percona-XtraDB-Cluster-5.7.19-rel17-29.22.3.Linux.x86_64.ssl100/mysql-test/var/tmp' creating a shorter one...
- using tmpdir: '/tmp/I6HfuqkwR1'
Checking supported features...
- SSL connections supported
- binaries built with wsrep patch
Using suites: galera
Collecting tests...
Checking leftover processes...
- found old pid 14271 in 'mysqld.1.pid', killing it...
process did not exist!
- found old pid 14273 in 'mysqld.2.pid', killing it...
process did not exist!
Removing old var directory...
Creating var directory '/home/sveta/mysql_packages/Percona-XtraDB-Cluster-5.7.19-rel17-29.22.3.Linux.x86_64.ssl100/mysql-test/var'...
Installing system database...
Using parallel: 1
==============================================================================
TEST RESULT TIME (ms) or COMMENT
--------------------------------------------------------------------------
worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 13000..13009
galera.GAL-419 [ skipped ] Test needs 'big-test' option
...
worker[1] mysql-test-run: WARNING: Waited 60 seconds for /home/sveta/mysql_packages/Percona-XtraDB-Cluster-5.7.19-rel17-29.22.3.Linux.x86_64.ssl100/mysql-test/var/run/mysqld.2.pid to be created, still waiting for 120 seconds...
galera.galera_binlog_checksum [ pass ] 2787
worker[1] mysql-test-run: WARNING: Waited 60 seconds for /home/sveta/mysql_packages/Percona-XtraDB-Cluster-5.7.19-rel17-29.22.3.Linux.x86_64.ssl100/mysql-test/var/run/mysqld.2.pid to be created, still waiting for 120 seconds...
galera.galera_binlog_event_max_size_min [ pass ] 2200
...

Now we are ready to write our first PXC test. The easiest way to get started is to open any existing test and check how it is written. Then modify it so that it replays our own scenario.

Since the question was about testing

IST

  and

SST

, I will use the test

galera_ist_progress

  as an example. First let’s check that it runs successfully and that it does not have any requirements that could prevent it from running inside regular production binaries:

sveta@Thinkie:~/mysql_packages/Percona-XtraDB-Cluster-5.7.19-rel17-29.22.3.Linux.x86_64.ssl100/mysql-test$ WSREP_PROVIDER=/home/sveta/mysql_packages/Percona-XtraDB-Cluster-5.7.19-rel17-29.22.3.Linux.x86_64.ssl100/lib/libgalera_smm.so ./mtr --suite=galera galera_ist_progress
Logging: ./mtr --suite=galera galera_ist_progress
MySQL Version 5.7.19
Too long tmpdir path '/home/sveta/mysql_packages/Percona-XtraDB-Cluster-5.7.19-rel17-29.22.3.Linux.x86_64.ssl100/mysql-test/var/tmp' creating a shorter one...
- using tmpdir: '/tmp/EodvOyCJwo'
Checking supported features...
- SSL connections supported
- binaries built with wsrep patch
Collecting tests...
Checking leftover processes...
Removing old var directory...
Creating var directory '/home/sveta/mysql_packages/Percona-XtraDB-Cluster-5.7.19-rel17-29.22.3.Linux.x86_64.ssl100/mysql-test/var'...
Installing system database...
Using parallel: 1
==============================================================================
TEST RESULT TIME (ms) or COMMENT
--------------------------------------------------------------------------
worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 13000..13009
worker[1] mysql-test-run: WARNING: Waited 60 seconds for /home/sveta/mysql_packages/Percona-XtraDB-Cluster-5.7.19-rel17-29.22.3.Linux.x86_64.ssl100/mysql-test/var/run/mysqld.2.pid to be created, still waiting for 120 seconds...
galera.galera_ist_progress [ pass ] 17970
--------------------------------------------------------------------------
The servers were restarted 0 times
Spent 17.970 of 218 seconds executing testcases
Completed: All 1 tests were successful.

Everything is fine. Now let’s look into the test itself.

First, this test has its own configuration file. Let’s check what’s in there:

$ cat suite/galera/t/galera_ist_progress.cnf
!include ../galera_2nodes.cnf
[mysqld.1]
wsrep_provider_options='base_port=@mysqld.1.#galera_port;pc.ignore_sb=true'

galera_2nodes.cnf

  is one of the standard configuration files in galera suite. If we look into it we may notice that 

wsrep_provider_options

  is defined and overriding this option is not required for all tests.

We’ll continue our review. The test script includes the 

galera_cluster.inc

  file:

--source include/galera_cluster.inc

This file is located outside of galera suite and contains 2 lines:

--let $galera_cluster_size = 2
--source include/galera_init.inc

galera_init.inc

 , in its turn, creates as many nodes as defined by the 

galera_cluster_size

  variable and additionally creates a default connection for each of them.

Now let’s step out from

galera_ist_progress

  and check if this knowledge is enough to create our first PXC test.

I created a simple test based on a two node setup which checks a few status and system variables, creates a table, inserts data into it, and ensures that content is accessible on both nodes:

$ cat ~/src/tests/t/pxc.test
--source include/galera_cluster.inc
--connection node_1
--echo We are on node 1
select @@hostname, @@port;
show status like 'wsrep_cluster_size';
show status like 'wsrep_cluster_status';
show status like 'wsrep_connected';
create table t1(id int not null auto_increment primary key, f1 int) engine=innodb;
insert into t1(f1) values(1),(2),(3);
select * from t1;
--connection node_2
--echo We are on node 2
select @@hostname, @@port;
show status like 'wsrep_cluster_size';
show status like 'wsrep_cluster_status';
show status like 'wsrep_connected';
select * from t1;
insert into t1(f1) values(1),(2),(3);
select * from t1;
--connection node_1
--echo We are on node 1
select * from t1;
drop table t1;

However, if I run this test in the main suite, it will fail:

sveta@Thinkie:~/mysql_packages/Percona-XtraDB-Cluster-5.7.19-rel17-29.22.3.Linux.x86_64.ssl100/mysql-test$ export WSREP_PROVIDER=/home/sveta/mysql_packages/Percona-XtraDB-Cluster-5.7.19-rel17-29.22.3.Linux.x86_64.ssl100/lib/libgalera_smm.so
sveta@Thinkie:~/mysql_packages/Percona-XtraDB-Cluster-5.7.19-rel17-29.22.3.Linux.x86_64.ssl100/mysql-test$ do_test.sh -s ~/mysql_packages -b Percona-XtraDB-Cluster-5.7.19-rel17-29.22.3.Linux.x86_64.ssl100
Logging: ./mysql-test-run.pl --record --force pxc
MySQL Version 5.7.19
Too long tmpdir path '/home/sveta/mysql_packages/Percona-XtraDB-Cluster-5.7.19-rel17-29.22.3.Linux.x86_64.ssl100/mysql-test/var/tmp' creating a shorter one...
- using tmpdir: '/tmp/uUmBztSWUA'
Checking supported features...
- SSL connections supported
- binaries built with wsrep patch
Collecting tests...
Checking leftover processes...
Removing old var directory...
Creating var directory '/home/sveta/mysql_packages/Percona-XtraDB-Cluster-5.7.19-rel17-29.22.3.Linux.x86_64.ssl100/mysql-test/var'...
Installing system database...
Using parallel: 1
==============================================================================
TEST RESULT TIME (ms) or COMMENT
--------------------------------------------------------------------------
worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 13000..13009
main.pxc [ skipped ] Test requires wsrep provider library (libgalera_smm.so). Did you set $WSREP_PROVIDER?
--------------------------------------------------------------------------
The servers were restarted 0 times
Spent 0.000 of 108 seconds executing testcases
Completed: All 0 tests were successful.
1 tests were skipped, 1 by the test itself.
=====Percona-XtraDB-Cluster-5.7.19-rel17-29.22.3.Linux.x86_64.ssl100=====
=====pxc=====
sveta@Thinkie:~/mysql_packages/Percona-XtraDB-Cluster-5.7.19-rel17-29.22.3.Linux.x86_64.ssl100/mysql-test$ echo $WSREP_PROVIDER
/home/sveta/mysql_packages/Percona-XtraDB-Cluster-5.7.19-rel17-29.22.3.Linux.x86_64.ssl100/lib/libgalera_smm.so

The reason for this failure is that galera suite has default option files that set the necessary variables. Let’s skip those option files for a while and simply run our test in galera suite:

sveta@Thinkie:~/mysql_packages/Percona-XtraDB-Cluster-5.7.19-rel17-29.22.3.Linux.x86_64.ssl100/mysql-test$ do_test.sh -s ~/mysql_packages -b Percona-XtraDB-Cluster-5.7.19-rel17-29.22.3.Linux.x86_64.ssl100 -t galera
Logging: ./mysql-test-run.pl --record --force --suite=galera pxc
MySQL Version 5.7.19
Too long tmpdir path '/home/sveta/mysql_packages/Percona-XtraDB-Cluster-5.7.19-rel17-29.22.3.Linux.x86_64.ssl100/mysql-test/var/tmp' creating a shorter one...
- using tmpdir: '/tmp/ytqEjnfM7i'
Checking supported features...
- SSL connections supported
- binaries built with wsrep patch
Collecting tests...
Checking leftover processes...
Removing old var directory...
Creating var directory '/home/sveta/mysql_packages/Percona-XtraDB-Cluster-5.7.19-rel17-29.22.3.Linux.x86_64.ssl100/mysql-test/var'...
Installing system database...
Using parallel: 1
==============================================================================
TEST RESULT TIME (ms) or COMMENT
--------------------------------------------------------------------------
worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 13000..13009
worker[1] mysql-test-run: WARNING: Waited 60 seconds for /home/sveta/mysql_packages/Percona-XtraDB-Cluster-5.7.19-rel17-29.22.3.Linux.x86_64.ssl100/mysql-test/var/run/mysqld.2.pid to be created, still waiting for 120 seconds...
galera.pxc [ pass ] 2420
--------------------------------------------------------------------------
The servers were restarted 0 times
Spent 2.420 of 208 seconds executing testcases
Completed: All 1 tests were successful.
pxc.result
=====Percona-XtraDB-Cluster-5.7.19-rel17-29.22.3.Linux.x86_64.ssl100=====
=====pxc=====
We are on node 1
select @@hostname, @@port;
@@hostname @@port
Thinkie 13000
show status like 'wsrep_cluster_size';
Variable_name Value
wsrep_cluster_size 2
show status like 'wsrep_cluster_status';
Variable_name Value
wsrep_cluster_status Primary
show status like 'wsrep_connected';
Variable_name Value
wsrep_connected ON
create table t1(id int not null auto_increment primary key, f1 int) engine=innodb;
insert into t1(f1) values(1),(2),(3);
select * from t1;
id f1
2 1
4 2
6 3
We are on node 2
select @@hostname, @@port;
@@hostname @@port
Thinkie 13004
show status like 'wsrep_cluster_size';
Variable_name Value
wsrep_cluster_size 2
show status like 'wsrep_cluster_status';
Variable_name Value
wsrep_cluster_status Primary
show status like 'wsrep_connected';
Variable_name Value
wsrep_connected ON
select * from t1;
id f1
2 1
4 2
6 3
insert into t1(f1) values(1),(2),(3);
select * from t1;
id f1
2 1
4 2
6 3
7 1
9 2
11 3
We are on node 1
select * from t1;
id f1
2 1
4 2
6 3
7 1
9 2
11 3
drop table t1;

You will see that the test reports that the two nodes run on different ports:

We are on node 1
select @@hostname, @@port;
@@hostname @@port
Thinkie 13000
...
We are on node 2
select @@hostname, @@port;
@@hostname @@port
Thinkie 13004

… and that PXC started:

show status like 'wsrep_cluster_size';
Variable_name Value
wsrep_cluster_size 2
show status like 'wsrep_cluster_status';
Variable_name Value
wsrep_cluster_status Primary
show status like 'wsrep_connected';
Variable_name Value
wsrep_connected ON

And we can also clearly see that each node sees the changes to our test table that were made by the other node.

Now let’s get back to

IST

  test, defined in

galera_ist_progress.test

 .

In order to test

IST

  it first stops writes to the cluster:

# Isolate node #2
--connection node_2
SET GLOBAL wsrep_provider_options = 'gmcast.isolate = 1';

Then it connects to node 1 and waits until 

wsrep_cluster_size

  becomes 1:

--connection node_1
--let $wait_condition = SELECT VARIABLE_VALUE = 1 FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE VARIABLE_NAME = 'wsrep_cluster_size';
--source include/wait_condition.inc

Then it turns

wsrep_on OFF

  on node 2:

--connection node_2
SET SESSION wsrep_on = OFF;
--let $wait_condition = SELECT VARIABLE_VALUE = 'non-Primary' FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE VARIABLE_NAME = 'wsrep_cluster_status';
--source include/wait_condition.inc
SET SESSION wsrep_on = ON;

Now node 2 is completely isolated and node 1 can be updated, so we can test

IST

  when we bring node 2 back online.

--connection node_1
CREATE TABLE t1 (f1 INTEGER) ENGINE=InnoDB;
INSERT INTO t1 VALUES (1);
INSERT INTO t1 VALUES (2);
INSERT INTO t1 VALUES (3);
INSERT INTO t1 VALUES (4);
INSERT INTO t1 VALUES (5);
INSERT INTO t1 VALUES (6);
INSERT INTO t1 VALUES (7);
INSERT INTO t1 VALUES (8);
INSERT INTO t1 VALUES (9);
INSERT INTO t1 VALUES (10);

After the update is done, node 2 is brought online:

--connection node_2
SET GLOBAL wsrep_provider_options = 'gmcast.isolate = 0';
--connection node_1
--let $wait_condition = SELECT VARIABLE_VALUE = 2 FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE VARIABLE_NAME = 'wsrep_cluster_size';
--source include/wait_condition.inc
--connection node_2
--let $wait_condition = SELECT VARIABLE_VALUE = 'Primary' FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE VARIABLE_NAME = 'wsrep_cluster_status';
--source include/wait_condition.inc

Once node 2 is online, checks for IST progress are performed. To check for IST progress, the test greps the error log file from node 2 where any messages about IST progress are printed:

#
# Grep for expected IST output in joiner log
#
--connection node_1
--let $assert_count = 1
--let $assert_file = $MYSQLTEST_VARDIR/log/mysqld.2.err
--let $assert_only_after = Need state transfer
--let $assert_text = Receiving IST: 11 writesets, seqnos
--let $assert_select = Receiving IST: 11 writesets, seqnos
--source include/assert_grep.inc
--let $assert_text = Receiving IST... 0.0% ( 0/11 events) complete
--let $assert_select = Receiving IST... 0.0% ( 0/11 events) complete
--source include/assert_grep.inc
--let $assert_text = Receiving IST...100.0% (11/11 events) complete
--let $assert_select = Receiving IST...100.0% (11/11 events) complete
--source include/assert_grep.inc

Here is the error log snipped from node 2 when it re-joined the cluster and initiated state transfer.

2018-05-25T17:00:46.908569Z 0 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 13)
2018-05-25T17:00:46.908637Z 2 [Note] WSREP: State transfer required:
	Group state: f364a69b-603c-11e8-a632-ce5a4a7d5964:13
	Local state: f364a69b-603c-11e8-a632-ce5a4a7d5964:2
2018-05-25T17:00:46.908673Z 2 [Note] WSREP: New cluster view: global state: f364a69b-603c-11e8-a632-ce5a4a7d5964:13, view# 4: Primary, number of nodes: 2, my index: 1, protocol version 3
2018-05-25T17:00:46.908694Z 2 [Note] WSREP: Setting wsrep_ready to true
2018-05-25T17:00:46.908717Z 2 [Warning] WSREP: Gap in state sequence. Need state transfer.
2018-05-25T17:00:46.908737Z 2 [Note] WSREP: Setting wsrep_ready to false
2018-05-25T17:00:46.908757Z 2 [Note] WSREP: You have configured 'xtrabackup-v2' state snapshot transfer method which cannot be performed on a running server. Wsrep provider won't be able to fall back to it if other means of state transfer are unavailable. In that case you will need to restart the server.
2018-05-25T17:00:46.908777Z 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2018-05-25T17:00:46.908799Z 2 [Note] WSREP: REPL Protocols: 7 (3, 2)
2018-05-25T17:00:46.908831Z 2 [Note] WSREP: Assign initial position for certification: 13, protocol version: 3
2018-05-25T17:00:46.908886Z 0 [Note] WSREP: Service thread queue flushed.
2018-05-25T17:00:46.908934Z 2 [Note] WSREP: Check if state gap can be serviced using IST
2018-05-25T17:00:46.909062Z 2 [Note] WSREP: IST receiver addr using tcp://127.0.0.1:13006
2018-05-25T17:00:46.909232Z 2 [Note] WSREP: Prepared IST receiver, listening at: tcp://127.0.0.1:13006
2018-05-25T17:00:46.909267Z 2 [Note] WSREP: State gap can be likely serviced using IST. SST request though present would be void.
2018-05-25T17:00:46.909489Z 0 [Note] WSREP: Member 1.0 (Thinkie) requested state transfer from '*any*'. Selected 0.0 (Thinkie)(SYNCED) as donor.
2018-05-25T17:00:46.909513Z 0 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 13)
2018-05-25T17:00:46.909557Z 2 [Note] WSREP: Requesting state transfer: success, donor: 0
2018-05-25T17:00:46.909602Z 2 [Note] WSREP: GCache history reset: f364a69b-603c-11e8-a632-ce5a4a7d5964:2 -> f364a69b-603c-11e8-a632-ce5a4a7d5964:13
2018-05-25T17:00:46.910221Z 0 [Note] WSREP: 0.0 (Thinkie): State transfer to 1.0 (Thinkie) complete.
2018-05-25T17:00:46.910422Z 0 [Note] WSREP: Member 0.0 (Thinkie) synced with group.
2018-05-25T17:00:47.006802Z 2 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): full reset
2018-05-25T17:00:47.106423Z 2 [Note] WSREP: Receiving IST: 11 writesets, seqnos 2-13
2018-05-25T17:00:47.106764Z 0 [Note] WSREP: Receiving IST...  0.0% ( 0/11 events) complete.
2018-05-25T17:00:47.109740Z 0 [Note] WSREP: Receiving IST...100.0% (11/11 events) complete.
2018-05-25T17:00:47.110029Z 2 [Note] WSREP: IST received: f364a69b-603c-11e8-a632-ce5a4a7d5964:13
2018-05-25T17:00:47.110433Z 0 [Note] WSREP: 1.0 (Thinkie): State transfer from 0.0 (Thinkie) complete.
2018-05-25T17:00:47.110480Z 0 [Note] WSREP: SST leaving flow control
2018-05-25T17:00:47.110509Z 0 [Note] WSREP: Shifting JOINER -> JOINED (TO: 13)
2018-05-25T17:00:47.110778Z 0 [Note] WSREP: Member 1.0 (Thinkie) synced with group.
2018-05-25T17:00:47.110830Z 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 13)
2018-05-25T17:00:47.110890Z 2 [Note] WSREP: Synchronized with group, ready for connections

If you want to write your own tests for IST and SST operations you can use existing test cases as a baseline. You are not required to use grep, and can explore your own scenarios. The important parts of the code are:

  • The variable
    WSREP_PROVIDER

     must be set before the test run

  • The test should be either in galera suite or if you choose to use your own suite you must copy the definitions from the galera suite default configuration file
  • The test should include the file
    include/galera_cluster.inc
  • To isolate the node from the cluster run the following code:
# Isolate node #2
--connection node_2
SET GLOBAL wsrep_provider_options = 'gmcast.isolate = 1';
--connection node_1
--let $wait_condition = SELECT VARIABLE_VALUE = 1 FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE VARIABLE_NAME = 'wsrep_cluster_size';
--source include/wait_condition.inc
--connection node_2
SET SESSION wsrep_on = OFF;
--let $wait_condition = SELECT VARIABLE_VALUE = 'non-Primary' FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE VARIABLE_NAME = 'wsrep_cluster_status';
--source include/wait_condition.inc
SET SESSION wsrep_on = ON;

Replace the node numbers if needed.

To bring the node back to the cluster run the following code:

# Restore node #2, IST is performed
--connection node_2
SET GLOBAL wsrep_provider_options = 'gmcast.isolate = 0';
--connection node_1
--let $wait_condition = SELECT VARIABLE_VALUE = 2 FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE VARIABLE_NAME = 'wsrep_cluster_size';
--source include/wait_condition.inc
--connection node_2
--let $wait_condition = SELECT VARIABLE_VALUE = 'Primary' FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE VARIABLE_NAME = 'wsrep_cluster_status';
--source include/wait_condition.inc

Depending on the size of the updates and

gcache

 you can test either IST or SST in this way.

The post MySQL Test Framework for Percona XtraDB Cluster appeared first on Percona Database Performance Blog.

Apr
06
2015
--

The cost of not properly managing your databases

Every day hundreds of millions of dollars are wasted by allowing improperly tuned or misconfigured systems, misunderstood infrastructure, and inefficient IT operations to live and thrive in data centers around the globe. There are both direct and indirect costs associated with allowing these unhealthy systems to continue to exist. Let’s look at some.

The setup:

Let us start by using a small example. We will start by looking at a small database setup. This setup will have a single master-slave, with a database size of lets say 500GB. Traffic is steady and let’s say this translates into 500 IOPS on the master. You have chosen to host this on Amazon’s AWS. A common way of ensuring backups occur in AWS is to setup ebs snapshots of the slave. In terms of usage, let us assume your CPU is about 50% used and you have about 20GB of hot data that needs to stay in the memory for the database.

If we look at what this would take to support in EC2 you are looking roughly at this:

  • 2 c3.4xlarge servers (16 vcpu, 30GB of memory )
  • Master-Slave Set
  • with 1TB of Provisioned IOPS SSD, over 2 volumes
  • with 500 IOPS on the master, 125 iops on the slave
  • estimated 7TB of storage for snapshots

This calculator gives us an estimated cost of $3,144.28 per month, or roughly $38,000 a year in hosting fees.  Note that you can choose other tiers of service, or reserved or spot servers to get different pricing.

Regular, steady growth:

Now let’s assume your database is growing along with its traffic at about 5% per month (these are rough numbers I know). After a year your database server would be out of steam using 86% CPU, 34GB of hot data (so relying more heavily on disk), and be consuming just about 850GB of storage space. Moving up to the next tier of servers and with additional iops you will see your spend per month jump to around $4,771.32 per month ($57,000 per year).

When tuning and auditing an environment like the above we been able to give some customers up to 50% or more improvement in performance, and often see 20-25% reduction in space. Let’s be conservative and say we can get a 25% boost in your performance, reduce your 5% monthly growth to 4%, and shrink your database by 10%. Based on that you can stave off upgrading your servers an additional 9 months, saving you almost $15,000 in that first year alone. Over 4 years this customer would end up saving an estimated $75,000 in total spend in AWS costs just based on smaller data and performance enhancements.

DB.Cost.Savings.pt1.png 

In this case performance enhancements are not the only place to save costs. Moving from EBS Snapshots to regular MySQL backups using Percona XtraBackup, keeping one copy on disk and sending those backups to s3, the cost of the environment drops to $2,043.87 per month ( from $3144.28).  This means a simple switch of backup methodology can net you about $1,100 a month or $13,200 a year off your hosting bill.

image02

These numbers are are based on only two servers, the saving over dozens or even hundreds of servers can be huge. Take a look at this 10 server environment:
image04

Often we are not only reducing the resources needed, but we can also reduce the number of servers needed to run your application through tuning. We had a recent client who was able to see a 90% reduction in their read heavy workload and actually turn off servers that used to be used to serve their application. Here is what their savings over the next couple of years would look like:

DB Cost Savings - 90% reduction

Here we helped cut this customers direct costs by two thirds.

Handling spikes:

 The one thing to keep in mind is this assumes a linear growth in terms of application and database usage.  This would mean you can predict when you will need hardware.  If your user base is growing and feature adds are controlled it is possible, however in most environments you will not see that linear growth. You will probably see something like this:

 

DB Workload - Cost savings

Understanding this pattern and the spikes are vital to keeping your costs down. See that giant spike up to 2,500? The first reaction for many is upgrade their hardware, then tune. Inevitably any tuning benefit is offset by the already sunk cost of the hardware upgrade which after the tuning they may not have needed. Getting in front of that spike and preventing it could have saved tens or even hundreds of thousands of dollars.

 Spikes kill performance and cost real dollars. Those spikes may not be easy to find. A few years ago I was working with a Fortune 500 client who had one of these spikes. They had been running perfectly fine with steady but controllable growth for 7 or 8 months, then the 9th month things went very wrong very quick.

A critical component of their company was to certify professionals through a testing application. During their peak time of season these certifications had stopped completely – delaying certifications for thousands of employees and clients for 2 weeks. I was flown out to help control the bleeding and hopefully fix the problem.

The number of users using the application was the same, the number of page views on the web was steady, but the number of queries to the database skyrocketed. None of the queries had hit their threshold of 1 second to be flagged as problem queries. It turned out to be one query that took 250ms to execute that was causing this company to grind to a halt. That one query ended up being executed 25,000 times per page when certain conditions were met, and those conditions were not met until the 9th month after this application was re-released.

This query lay like a trojan horse waiting to destroy this company’s ability to deliver to its customers.

 Two lessons can be learned from this. The first is even a seemingly well-tuned system may not be. Second, small things matter. In this case fixing the code is the correct solution, however, proper indexing of the tables dropped the query time from 250ms down to 50ms.  This was enough of a relief to allow the certification process to start up again until the code could be fixed. A seemingly small impacting query still should be optimized.

 Another source for these performance spikes is a company’s application release cycle. Applications are very a living entity in today’s world. They grow and expand and change on a regular basis. In order to stay ahead of any problems you need to have a process and resources in place that can proactively monitor and tune. Every release of new code should be going through a rigorous performance review to prevent trojan problems that may cause problems and extra costs down the road.

Indirect and hard to calculate costs:

All of this discussion so far has been around direct hosting costs. There is also a cost to your reputation and your ability to deliver services that meet a customer’s expectations. Customers who come to your site or are using your application can leave in droves due to poor performance. We have seen several customers who lost 50% or more of their user base due to performance problems with the application.

Lost revenue and profits are often much more difficult to quantify, and vary greatly from company to company. This cost, however, is very real. Silicon Valley is littered with the remnants of companies that did not plan to address scale or simply missed important problems in their IT infrastructure. Unfortunately I have worked directly with numerous companies that learned this lesson the hard way. These hidden costs can kill a customer quicker than any competitor or market shift.

 One of the biggest hidden costs companies needlessly pay is the cost of downtime.

The cost of downtime:

I was reading a gartner study where they estimated that the cost per minute of downtime was $5,600 dollars; other studies, like this one, have pegged the cost per minute of downtime at $7,900.

Anyway you slice it being down for even a minute costs you money. If we are conservative in our estimates, the cost of an hour of downtime can easily top $100,000. It’s amazing the number of well-established companies that don’t have a solid plan for dealing with downtime.

Let’s look at some common disaster recovery policies:

Restore from backup:

How quickly can your DBAs get alerted to an outage, then login to look at the outage, and finally make a call whether or not to restore? I submit that most people are going to take a few minutes to get an alert (let’s say 2). They will then take a few minutes to get to the computer and into the system (let’s say 5 minutes). Then they will take at least 10 minutes to try and figure out what’s going on. Fast-forward 17 minutes later…. minimum has gone by with nothing to show for it.

Restoring the backup itself could take a few minutes or several hours. Let’s just say 40 minutes total. If we use that $7,900 number, you could have just lost $316,000. That’s a huge amount that could have easily been avoided. Maybe you know that you’re not losing $7,900 a minute, maybe it’s only $1,000. That’s still $40,000!

Manual failover to a slave:

The time for getting, reacting and taking action does not change in this equation. The original 17 minutes of time (minimum) to react and start fixing just potentially cost you $134,300.

Automated failover:

Not all automated failovers are created equal. Some solutions can take several minutes to even hours to restore proper service (passive cold slaves warm up time). Just because you think you are protected does not mean you are. Having the right automated solution can mean you minimize your downtime risks to $10K or less, having the wrong one can be worse than having none at all.

It’s important to understand the cost of downtime and pick the proper solution to mitigate it.

Cost of being wrong is high:

These are just a few of the costs that companies can incur by having the improper database and infrastructure setup. Mitigating these costs requires a solid process, a high-level of expertise, and the right resources in place.

The post The cost of not properly managing your databases appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com