Sep
11
2014
--

OpenStack users shed light on Percona XtraDB Cluster deadlock issues

OpenStack_PerconaI was fortunate to attend an Ops discussion about databases at the OpenStack Summit Atlanta this past May as one of the panelists. The discussion was about deadlock issues OpenStack operators see with Percona XtraDB Cluster (of course this is applicable to any Galera-based solution). I asked to describe what they are seeing, and as it turned out, nova and neutron uses the SELECT … FOR UPDATE SQL construct quite heavily. This is a topic I thought was worth writing about.

Write set replication in a nutshell (with oversimplification)

Any node is writable, and replication happens in write sets. A write set is practically a row based binary log event or events and “some additional stuff.” The “some additional stuff” is good for 2 things.

  • Two write sets can be compared and told if they are conflicting or not.
  • A write set can be checked against a database if it’s applicable.

Before committing on the originating node, the write set is transferred to all other nodes in the cluster. The originating node checks that the transaction is not conflicting with any of the transactions in the receive queue and checks if it’s applicable to the database. This process is called certification. After the write set is certified the transaction is committed. The remote nodes will do certification asynchronously compared to the local node. Since the certification is deterministic, they will get the same result. Also the write set on the remote nodes can be applied later because of this reason. This kind of replication is called virtually synchronous, which means that the data transfer is synchronous, but the actual apply is not.

We have a nice flowchat about this.

Since the write set is only transferred before commit, InnoDB row level locks, which are held locally, are not held on remote nodes (if these were escalated, each row lock would take a network round trip to acquire). This also means that by default if multiple nodes are used, the ability to read your own writes is not guaranteed. In that case, a certified transaction, which is already committed on the originating node can still sit in the receive queue of the node the application is reading from, waiting to be applied.

SELECT … FOR UPDATE

The SELECT … FOR UPDATE construct reads the given records in InnoDB, and locks the rows that are read from the index the query used, not only the rows that it returns. Given how write set replication works, the row locks of SELECT … FOR UPDATE are not replicated.

Putting it together

Let’s create a test table.

CREATE TABLE `t` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

And some records we can lock.

pxc1> insert into t values();
Query OK, 1 row affected (0.01 sec)
pxc1> insert into t values();
Query OK, 1 row affected (0.01 sec)
pxc1> insert into t values();
Query OK, 1 row affected (0.01 sec)
pxc1> insert into t values();
Query OK, 1 row affected (0.00 sec)
pxc1> insert into t values();
Query OK, 1 row affected (0.01 sec)

pxc1> select * from t;
+----+---------------------+
| id | ts                  |
+----+---------------------+
|  1 | 2014-06-26 21:37:01 |
|  4 | 2014-06-26 21:37:02 |
|  7 | 2014-06-26 21:37:02 |
| 10 | 2014-06-26 21:37:03 |
| 13 | 2014-06-26 21:37:03 |
+----+---------------------+
5 rows in set (0.00 sec)

On the first node, lock the record.

pxc1> start transaction;
Query OK, 0 rows affected (0.00 sec)
pxc1> select * from t where id=1 for update;
+----+---------------------+
| id | ts                  |
+----+---------------------+
|  1 | 2014-06-26 21:37:01 |
+----+---------------------+
1 row in set (0.00 sec)

On the second, update it with an autocommit transaction.

pxc2> update t set ts=now() where id=1;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1  Changed: 1  Warnings: 0
pxc1> select * from t;
ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction

Let’s examine what happened here. The local record lock held by the started transation on pxc1 didn’t play any part in replication or certification (replication happens at commit time, there was no commit there yet). Once the node received the write set from pxc2, that write set had a conflict with a transaction still in-flight locally. In this case, our transaction on pxc1 has to be rolled back. This is a type of conflict as well, but here the conflict is not caught on certification time. This is called a brute force abort. This happens when a transaction done by a slave thread conflict with a transaction that’s in-flight on the node. In this case the first commit wins (which is the already replicated one) and the original transaction is aborted. Jay Janssen discusses multi-node writing conflicts in detail in this post.

The same thing happens when 2 of the nodes are holding record locks via select for update. Whichever node commits first will win, the other transaction will hit the deadlock error and will be rolled back. The behavior is correct.

Here is the same SELECT … FOR UPDATE transaction overlapping on the 2 nodes.

pxc1> start transaction;
Query OK, 0 rows affected (0.00 sec)
pxc2> start transaction;
Query OK, 0 rows affected (0.00 sec)

pxc1> select * from t where id=1 for update;
+----+---------------------+
| id | ts                  |
+----+---------------------+
|  1 | 2014-06-26 21:37:48 |
+----+---------------------+
1 row in set (0.00 sec)
pxc2> select * from t where id=1 for update;
+----+---------------------+
| id | ts                  |
+----+---------------------+
|  1 | 2014-06-26 21:37:48 |
+----+---------------------+
1 row in set (0.00 sec)

pxc1> update t set ts=now() where id=1;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1  Changed: 1  Warnings: 0
pxc2> update t set ts=now() where id=1;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

pxc1> commit;
Query OK, 0 rows affected (0.00 sec)
pxc2> commit;
ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction

Where does this happen in OpenStack?

For example in OpenStack Nova (the compute project in OpenStack), tracking the quota usage uses the SELECT…FOR UPDATE construct.

# User@Host: nova[nova] @  [10.10.10.11]  Id:   147
# Schema: nova  Last_errno: 0  Killed: 0
# Query_time: 0.001712  Lock_time: 0.000000  Rows_sent: 4  Rows_examined: 4  Rows_affected: 0
# Bytes_sent: 1461  Tmp_tables: 0  Tmp_disk_tables: 0  Tmp_table_sizes: 0
# InnoDB_trx_id: C698
# QC_Hit: No  Full_scan: Yes  Full_join: No  Tmp_table: No  Tmp_table_on_disk: No
# Filesort: No  Filesort_on_disk: No  Merge_passes: 0
#   InnoDB_IO_r_ops: 0  InnoDB_IO_r_bytes: 0  InnoDB_IO_r_wait: 0.000000
#   InnoDB_rec_lock_wait: 0.000000  InnoDB_queue_wait: 0.000000
#   InnoDB_pages_distinct: 2
SET timestamp=1409074305;
SELECT quota_usages.created_at AS quota_usages_created_at, quota_usages.updated_at AS quota_usages_updated_at, quota_usages.deleted_at AS quota_usages_deleted_at, quota_usages.deleted AS quota_usages_deleted, quota_usages.id AS quota_usages_id, quota_usages.project_id AS quota_usages_project_id, quota_usages.user_id AS quota_usages_user_id, quota_usages.resource AS quota_usages_resource, quota_usages.in_use AS quota_usages_in_use, quota_usages.reserved AS quota_usages_reserved, quota_usages.until_refresh AS quota_usages_until_refresh
FROM quota_usages
WHERE quota_usages.deleted = 0 AND quota_usages.project_id = '12ce401aa7e14446a9f0c996240fd8cb' FOR UPDATE;

So where does it come from?

These constructs are generated by SQLAlchemy using with_lockmode(‘update’). Even in nova’s pydoc, it’s recommended to avoid with_lockmode(‘update’) whenever possible. Galera replication is not mentioned among the reasons to avoid this construct, but knowing how many OpenStack deployments are using Galera for high availability (either Percona XtraDB Cluster, MariaDB Galera Cluster, or Codership’s own mysql-wsrep), it can be a very good reason to avoid it. The solution proposed in the linked pydoc above is also a good one, using an INSERT INTO … ON DUPLICATE KEY UPDATE is a single atomic write, which will be replicated as expected, it will also keep correct track of quota usage.

The simplest way to overcome this issue from the operator’s point of view is to use only one writer node for these types of transactions. This usually involves configuration change at the load-balancer level. See this post for possible load-balancer configurations.

The post OpenStack users shed light on Percona XtraDB Cluster deadlock issues appeared first on MySQL Performance Blog.

Sep
05
2014
--

Using MySQL 5.6 Global Transaction IDs (GTIDs) in production: Q&A

Using MySQL Global Transaction IDs in ProductionThank you to all of you who attended my webinar last week about Global Transaction IDs (GTIDs), which were introduced in MySQL 5.6 to make the reconfiguration of replication straightforward. If you missed my webinar, you can still listen to the recording and download the sides (free). We had a lot of questions during the webinar, so let me try to answer them here. Please let me know in the comments if additional clarification is needed.

Q: Does GTID provide any benefit to master-master replication? If yes, how?
Q: Is ACTIVE ACTIVE MASTER MASTER successful in MySQL with GTID?

A: GTIDs don’t change the nature of MySQL replication: for instance it is still asynchronous and if you write on both masters in a master-master setup, there is still no write conflict detection mechanism. The main benefit of GTIDs is that any change of the replication topology is easy because you no longer need to run complex calculations to find the right binlog positions when connecting a slave to a new master.
So master-master replication can be configured with GTIDs, it does not provide a lot of benefits compared to position-based replication as you will never change the replication topology.
And having a setup where both masters receive writes is still not recommended with GTIDs.

Q: Will GTIDs work well with master:standby master? How quick would the failover be?
A: Yes, GTIDs works well with this kind of setup (which is one master and one slave). GTIDs do not provide failover, you will have to use an external tool. Speed of failover then depends on the tool you will use.

Q: For already set up MASTER-MASTER/MASTER-SLAVE Replication, after getting GTID set up, we need to rebuild replication again using AUTO POS=1, correct?
A: Yes, using MASTER_AUTO_POSITION=1 is necessary to indicate that you will use GTID replication. So you will have to run: STOP SLAVE; CHANGE MASTER TO … MASTER_AUTO_POSITION = 1; START SLAVE;

Q: Application having tables from different Engines(InnoDB and MyISAM), how that will handled in GTID?
A: Transactions using both MyISAM and InnoDB tables are not allowed, please refer to the documentation

Q: In a master-slave replication topology (with GTID enabled), how does slave get data from the master if the master’s binary logs are purged given that AUTO_POSITION=1 is used as part of the change master command?
A: This will break replication with error 1236.

Q: Whats the value of show slave status who determines if there is a lag on the slave?
A: This is Seconds_Behind_Master. It’s not always reliable though. For instance if you have a replication setup like A -> B -> C, Seconds_Behind_Master on C will shop the lag relatively to B, not A.

Q: What is the value of saving the history of previous master’s GTIDs executed in the show slave status -> Executed_Gtid_Set?
A: The new replication protocol makes sure that when the slave connects to its master, it sends the range of GTIDs it has already executed. Then the master sends back all other transactions. That’s why Executed_Gtid_Set contains the history of all executed transactions.

Q: We use DB Master and Slave VIPs on our servers, can the mysqlfailover tool also switch the VIP to the new master? Is it scriptable on the event of a failover?
A: Yes you can use extension points to add you own custom scripts with mysqlfailover. See the documentation for –exec-before and –exec-after.

Q: How does mysqlfailover handle brief network instability between the Master and Slaves?
A: mysqlfailover only triggers failover when it suspects the master is no longer alive. So network instability between the master and its slaves won’t affect it for master crash detection. However it can prevent the tool from reconfiguring replication correctly during failover/switchover if one or several slaves are not reachable.

Q: Does Facebook use MySQL with GTID? if yes, which module or all together for everything?
A: I can’t speak for Facebook, but this talk at the MySQL Conference this year suggests that they’re using GTIDs in production. They have added custom code to make GTIDs easier to use.

Q: is GTID_SUBSET function part of MySQL utilities? or we should set a script to regularly detect it? is GTID_SUBSET beneficial in case of an ACTIVE ACTIVE MASTER MASTER setup?
Q: Can you please confirm how to get gtid set in order to use gtid functions?

A: GTID_SUBSET() is a built-in function in MySQL 5.6, you don’t need to install MySQL Utilities to use it. It can be used to easily know whether Executed_Gtid_Set on a given server is a subset of Executed_Gtid_Set on another server, so it can be beneficial to use it in any replication topology.

Q: What is difference between HOLE and BUGS?
A: Holes are not allowed in MySQL 5.6 implementation of GTIDs. So if you see a hole in a GTID sequence, you’re hitting a bug!

Q: Using MySQL utilities, we can set-up replication also using a python script with just one command. Does it automatically takes dump from master to slave and starts replication? If yes so a 300 GB data directory, will it run as background if executed using shell script? Or it just starts the replicationn from current position and won’t take the dump?
A: I think you are talking about mysqlreplicate. This tool only runs CHANGE MASTER TO for you so it doesn’t take a backup of any kind.

Q: Is it possible to use mysqlfailover script at any node(like slave) any time to know which is its MASTER and other SLAVE options also? If no, is this available by some other means?
A: You should probably use mysqlrplshow instead.

Q: As told during limitations of MySQL Utilities on automatic failover, so how can I achieve AUTOMATIC failover if I want this as primary option?
A: The node running mysqlfailover is not highly available so if it is down you lose the ability of doing automatic failover. There are several options if you want to achieve automatic failover: carefully monitor the monitoring node or use solutions like Percona Replication Manager which relies on Pacemaker or Percona XtraDB Cluster which relies on Galera replication.

Q: mysqlrpadmin failover/switchover: Can we execute this command on slave or any other monitoring node?
A: Yes, as long as mysqlrpladmin is installed on a server and if it can connect to the database servers, the command can be executed from anywhere.

Q: If we set slaves to read_only…is that recommended to eliminate the errant transaction?
A: Yes, it is recommended. However it doesn’t prevent users with the SUPER privilege from accidentally writing on a slave.

Q: Can this errant transaction issue be prevented with (active-passive) master-master replication?
A: Master-master replication will make sure that any transaction written on one server will automatically end up being written on the other slave. So it is like avoiding errant transactions. However writing on both masters is not recommended as you can have write conflicts.

Q: Is there any practical advantage of using GTID with MHA?
A: GTIDs provide no way to perform failover, they only simplify how you can reconfigure replication. So using MHA to leverage GTIDs makes sense.

Q: What is diifference GUID and GTID? When do we have to use GUID and GTID?
A: I’m not sure I correctly understand the question. A GTID is made of a source id and a transaction id. The source id is the master’s server_uuid, which is a GUID that is automatically generated when MySQL is started for the first time.

Q: How to check the slave database tables and record? we have to sync data from master to salve database or automatically will be happen?
A: You can use pt-table-checksum and pt-table-sync from Percona Toolkit.

Thanks again for attending the webinar! You can replay it, download the slide – and also access Percona’s vast library of other MySQL webinar recordings here.

The post Using MySQL 5.6 Global Transaction IDs (GTIDs) in production: Q&A appeared first on MySQL Performance Blog.

Sep
03
2014
--

Migrating to Percona XtraDB Cluster 2014 edition: Sept. 10 MySQL webinar

Join Jay Janssen Sept. 10 at 10 a.m. PDT and learn how to migrate to Percona XtraDB Cluster 5.6Join me online next week (September 10 at 10 a.m. PDT) for my live webinar on Migrating to Percona XtraDB Cluster.  This was a popular webinar that I gave a few years ago, so I’m doing it again with updates for Percona XtraDB Cluster 5.6 (PXC) and all the latest in the Galera world.

This webinar will be really good for people interested in getting an overview of what PXC/Galera is, what it would take to adopt it for your application, and some of the differences and challenges it brings compared with a conventional MySQL Master/slave setup.  I’d highly suggest attending if you are considering Galera in your environment and want to get a better understanding of its uses and antipatterns.

Additionally, I’ll cover such questions as:

  • What are the requirements for running Percona XtraDB Cluster?
  • Will I have to reload all my tables?
  • How does configuration for the cluster differ from configuring a stand-alone InnoDB server?
  • How should my application interact with the Cluster?
  • Can I use Percona XtraDB Cluster if I only have two MySQL servers currently?
  • How can I move to the Cluster and keep downtime to a minimum?
  • How can I migrate to Percona XtraDB Cluster gradually?

I hope to see you next Wednesday. And please feel free to ask questions in advance in the comments section below. Next week’s live event, like all of our MySQL webinars, is free. Register here!

The post Migrating to Percona XtraDB Cluster 2014 edition: Sept. 10 MySQL webinar appeared first on MySQL Performance Blog.

Sep
02
2014
--

Using sysbench 0.5 for performing MySQL benchmarks

Given the recent excitement & interest around OpenStack I wanted to make sure I was ready to conduct appropriate evaluations of system performance.  I generally turn to sysbench since it comes with a variety of different tests (accessed via –test= option interface), including:

  • fileio – File I/O test
  • cpu – CPU performance test
  • memory – Memory functions speed test
  • threads – Threads subsystem performance test
  • mutex – Mutex performance test

As you can see, sysbench lets you stress many of the fundamental components of your hardware and infrastructure, such as your disk subsystem, along with your CPUs and memory. An additional option exists that is designed to perform synthetic stress testing of MySQL, and I was surprised when I didn’t see it in the above list on version 0.5, as it used to show up as “oltp – OLTP test”. What happened to –test=oltp ??

This list is from the latest release of sysbench which is 0.5 — you’re only going to be on this version if you build it yourself or if you use the package provided courtesy of Frederic Descamps (thanks lefred!).  If you’re using the version from EPEL, Ubuntu 14.04, or Debian 7 you’re still using version 0.4.12 (check with sysbench –version).  One thing you’ll notice is that the test type of OLTP doesn’t show up anymore.  What gives?  I was scratching my head until I asked on Percona IRC and found out that in 0.5 the standard OLTP test type was replaced with a different syntax, that instead of passing parameters to sysbench you instead reference scripts written in lua.  The advantage here is that now you have an interface in order to write your own specific load tests (provided you know lua, but it isn’t hard).  For those of you looking to run the pre-canned load tests they still exist but you have to have them as part of the RPM install or otherwise copied to your system.

Fortunately if you use the package provided by lefred you’ll find these lua scripts here (this is using Amazon ami as of August 4th, 2014):

[root@pxc-control ~]# ls -l /usr/share/doc/sysbench/tests/db/
total 44
-rw-r--r-- 1 root root 3585 Sep 7 2012 common.lua
-rw-r--r-- 1 root root 340 Sep 7 2012 delete.lua
-rw-r--r-- 1 root root 830 Sep 7 2012 insert.lua
-rw-r--r-- 1 root root 2925 Sep 7 2012 oltp.lua
-rw-r--r-- 1 root root 342 Sep 7 2012 oltp_simple.lua
-rw-r--r-- 1 root root 425 Sep 7 2012 parallel_prepare.lua
-rw-r--r-- 1 root root 343 Sep 7 2012 select.lua
-rw-r--r-- 1 root root 3964 Sep 7 2012 select_random_points.lua
-rw-r--r-- 1 root root 4066 Sep 7 2012 select_random_ranges.lua
-rw-r--r-- 1 root root 343 Sep 7 2012 update_index.lua
-rw-r--r-- 1 root root 552 Sep 7 2012 update_non_index.lua

So the trick (if you want to call it that) is that instead of passing a single word to the –test directive, instead you pass the full path to the lua script.

This is the old way (sysbench 0.4.12 from EPEL repo):

--test=oltp --oltp-test-mode=complex

This is the new way (sysbench 0.5):

--test=/usr/share/doc/sysbench/tests/db/insert.lua

Here is an example of a test I’m running through haproxy against a 3-node PXC cluster doing the INSERT-only test type so you can see the full syntax I pass to sysbench:

[root@pxc-control ~]# cat sys_haproxy.sh
#!/bin/bash
sysbench
--test=/usr/share/doc/sysbench/tests/db/insert.lua
--mysql-host=pxc-control
--mysql-port=9999
--mysql-user=sysbench-haproxy
--mysql-password=sysbench-haproxy
--mysql-db=sbtest
--mysql-table-type=innodb
--oltp-test-mode=complex
--oltp-read-only=off
--oltp-reconnect=on
--oltp-table-size=1000000
--max-requests=100000000
--num-threads=3
--report-interval=1
--report-checkpoints=10
--tx-rate=24
$1

And here’s what the insert.lua script looks like:

[root@pxc-control ~]# cat /usr/share/doc/sysbench/tests/db/insert.lua
pathtest = string.match(test, "(.*/)") or ""
dofile(pathtest .. "common.lua")
function thread_init(thread_id)
   set_vars()
end
function event(thread_id)
   local table_name
   local i
   local c_val
   local k_val
   local pad_val
   table_name = "sbtest".. sb_rand_uniform(1, oltp_tables_count)
   if (oltp_auto_inc) then
      i = 0
   else
      i = sb_rand_uniq(1, oltp_table_size)
   end
   k_val = sb_rand(1, oltp_table_size)
   c_val = sb_rand_str([[
###########-###########-###########-###########-###########-###########-###########-###########-###########-###########]])
   pad_val = sb_rand_str([[
###########-###########-###########-###########-###########]])
   rs = db_query("INSERT INTO " .. table_name ..  " (id, k, c, pad) VALUES " .. string.format("(%d, %d, '%s', '%s')",i, k_val, c_val, pad_val))
end

The thing that I like most about sysbench 0.5 (beyond the lua interface, of course!) is that it now comes with a –report-interval option (which I generally set as = 1) so that you get output while the script is running. No more waiting until the end of the test to get feedback! Here’s a sample of sysbench 0.5 in action running the INSERT test through a local haproxy instance and writing to three nodes in a PXC cluster such as OpenStack Trove might do:

[root@pxc-control ~]# ./sys_haproxy.sh run
sysbench 0.5:  multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 3
Report intermediate results every 1 second(s)
Random number generator seed is 0 and will be ignored
Threads started!
[   1s] threads: 3, tps: 0.00, reads/s: 0.00, writes/s: 1099.28, response time: 9.86ms (95%)
[   2s] threads: 3, tps: 0.00, reads/s: 0.00, writes/s: 973.02, response time: 10.77ms (95%)
[   3s] threads: 3, tps: 0.00, reads/s: 0.00, writes/s: 1181.01, response time: 6.23ms (95%)
[   4s] threads: 3, tps: 0.00, reads/s: 0.00, writes/s: 1103.00, response time: 6.77ms (95%)

I would also like to call your attention to a blog post by Nilnandan Joshi from Percona’s Support team where he describes a method to build sysbench 0.5 on Debian 7.  Thanks Nil for pointing this out!

I hope that helps others out there who upgrade to sysbench 0.5 and then have questions about where –test=oltp went to. I’d love to hear your own sysbench use cases, and whether anyone else is publishing lua scripts for their own load testing!

The post Using sysbench 0.5 for performing MySQL benchmarks appeared first on MySQL Performance Blog.

Aug
20
2014
--

How to use MySQL Global Transaction IDs (GTIDs) in production

Reconfiguring replication has always been a challenge with MySQL. Each time the replication topology has to be changed, the process is tedious and error-prone because finding the correct binlog position is not straightforward at all. Global Transaction IDs (GTIDs) introduced in MySQL 5.6 aim at solving this annoying issue.

The idea is quite simple: each transaction is associated with a unique identifier shared by all servers in a given replication topology. Now reconfiguring replication is easy as the correct binlog position can be automatically calculated by the server.

Awesome? Yes it is! However GTIDs are also changing a lot of things in how we can perform operations on replication. For instance, skipping transactions is a bit more difficult. Or you can get bitten by errant transactions, a concept that did not exist before.

Percona MySQL webinarsThis is why I will be presenting a webinar on Aug. 27 at 10 a.m. PDT: Using MySQL Global Transaction IDs in Production.

You will learn what you need to operate a replication cluster using GTIDs: how to monitor replication status or to recover from replication errors, tools that can help you and tools that you should avoid and also the main issues that can occur with GTIDs.

This webinar is free but you can register today to reserve your seat. And a recording will be available afterwards. See you next week!

The post How to use MySQL Global Transaction IDs (GTIDs) in production appeared first on MySQL Performance Blog.

Aug
19
2014
--

Measuring failover time for ScaleArc load balancer

ScaleArc hired Percona to benchmark failover times for the ScaleArc database traffic management software in different scenarios. We tested failover times for various clustered setups, where ScaleArc itself was the load balancer for the cluster. These tests complement other performance tests on the ScaleArc software – sysbench testing for latency and testing for WordPress acceleration.

We tested failover times for Percona XtraDB Cluster (PXC) and MHA (any traditional MySQL replication-based solution works pretty much the same way).

In each case, we tested failover with a rate limited sysbench benchmark. In this mode, sysbench generates roughly 10 transactions each second, even if not all 10 were completed in the previous second. In that case, the newly generated transactions are queued.

The sysbench command we used for testing is the following.

# while true ; do sysbench --test=sysbench/tests/db/oltp.lua
                           --mysql-host=172.31.10.231
                           --mysql-user=root
                           --mysql-password=admin
                           --mysql-db=sbtest
                           --oltp-table-size=10000
                           --oltp-tables-count=4
                           --num-threads=4
                           --max-time=0
                           --max-requests=0
                           --report-interval=1
                           --tx-rate=10
                           run ; sleep 1; done

The command is run in a loop, because typically at the time of failover, the application receives some kind of error while the virtual IP is moved, or while the current node is declared dead. Well-behaving applications are reconnecting and retrying in this case. Sysbench is not a well-behaving application from this perspective – after failover it has to be restarted.

This is good for testing the duration of errors during a failover procedure – but not the number of errors. In a simple failover scenario (ScaleArc is just used as a regular load balancer), the number of errors won’t be any higher than usual with ScaleArc’s queueing mechanism.

ScaleArc+MHA

In this test, we used MHA as a replication manager. The test result would be similar regardless of how its asynchronous replication is managed – only the ScaleArc level checks would be different. In the case of MHA, we tested graceful and non-graceful failover. In the graceful case, we stopped the manager and performed a manual master switchover, after which we informed the ScaleArc software via an API call for failover.

We ran two tests:

  • A manual switchover with the manager stopped, switching over manually, and informing the load balancer about the change.
  • An automatic failover where the master was killed, and we let MHA and the ScaleArc software discover it.

ScaleArc+MHA manual switchover

The manual switchover was performed with the following command.

# time (
          masterha_master_switch --conf=/etc/cluster_5.cnf
          --master_state=alive
          --new_master_host=10.11.0.107
          --orig_master_is_new_slave
          --interactive=0
          &&
          curl -k -X PUT https://172.31.10.30/api/cluster/5/manual_failover
          -d '{"apikey": "0c8aa9384ddf2a5756299a9e7650742a87bbb550"}' )
{"success":true,"message":"4214   Failover status updated successfully.","timestamp":1404465709,"data":{"apikey":"0c8aa9384ddf2a5756299a9e7650742a87bbb550"}}
real    0m8.698s
user    0m0.302s
sys     0m0.220s

The curl command calls ScaleArc’s API to inform the ScaleArc software about the master switchover.

During this time, sysbench output was the following.

[  21s] threads: 4, tps: 13.00, reads/s: 139.00, writes/s: 36.00, response time: 304.07ms (95%)
[  21s] queue length: 0, concurrency: 1
[  22s] threads: 4, tps: 1.00, reads/s: 57.00, writes/s: 20.00, response time: 570.13ms (95%)
[  22s] queue length: 8, concurrency: 4
[  23s] threads: 4, tps: 19.00, reads/s: 237.99, writes/s: 68.00, response time: 976.61ms (95%)
[  23s] queue length: 0, concurrency: 2
[  24s] threads: 4, tps: 9.00, reads/s: 140.00, writes/s: 40.00, response time: 477.55ms (95%)
[  24s] queue length: 0, concurrency: 3
[  25s] threads: 4, tps: 10.00, reads/s: 105.01, writes/s: 28.00, response time: 586.23ms (95%)
[  25s] queue length: 0, concurrency: 1

Only a slight hiccup is visible at 22 seconds. In this second, only 1 transaction was done, and 8 others were queued. These results show a sub-second failover time. The reason no errors were received is that ScaleArc itself queued the transactions during the failover process. If the transaction in question were done from an interactive client, the queuing itself would be visible as increased response time – for example a START TRANSACTION or an INSERT command is taking longer than usual, but no errors result. This is as good as it gets for graceful failover. ScaleArc knows about the failover (and in the case of a switchover initiated by a DBA, notifying the ScaleArc software can be part of the failover process). The queueing mechanism is quite configurable. Administrators can set up the timeout for the queue – we set it to 60 seconds, so if the failover doesn’t complete in that timeframe transactions start to fail.

ScaleArc+MHA non-graceful failover

In the case of the non-graceful failover MHA and the ScaleArc software have to figure out that the node died.

[  14s] threads: 4, tps: 11.00, reads/s: 154.00, writes/s: 44.00, response time: 1210.04ms (95%)
[  14s] queue length: 4, concurrency: 4
( sysbench restarted )
[   1s] threads: 4, tps: 0.00, reads/s: 0.00, writes/s: 0.00, response time: 0.00ms (95%)
[   1s] queue length: 13, concurrency: 4
[   2s] threads: 4, tps: 0.00, reads/s: 0.00, writes/s: 0.00, response time: 0.00ms (95%)
[   2s] queue length: 23, concurrency: 4
[   3s] threads: 4, tps: 0.00, reads/s: 0.00, writes/s: 0.00, response time: 0.00ms (95%)
[   3s] queue length: 38, concurrency: 4
[   4s] threads: 4, tps: 0.00, reads/s: 0.00, writes/s: 0.00, response time: 0.00ms (95%)
[   4s] queue length: 46, concurrency: 4
[   5s] threads: 4, tps: 0.00, reads/s: 0.00, writes/s: 0.00, response time: 0.00ms (95%)
[   5s] queue length: 59, concurrency: 4
[   6s] threads: 4, tps: 0.00, reads/s: 0.00, writes/s: 0.00, response time: 0.00ms (95%)
[   6s] queue length: 69, concurrency: 4
[   7s] threads: 4, tps: 0.00, reads/s: 0.00, writes/s: 0.00, response time: 0.00ms (95%)
[   7s] queue length: 82, concurrency: 4
[   8s] threads: 4, tps: 0.00, reads/s: 0.00, writes/s: 0.00, response time: 0.00ms (95%)
[   8s] queue length: 92, concurrency: 4
[   9s] threads: 4, tps: 0.00, reads/s: 0.00, writes/s: 0.00, response time: 0.00ms (95%)
[   9s] queue length: 99, concurrency: 4
[  10s] threads: 4, tps: 0.00, reads/s: 0.00, writes/s: 0.00, response time: 0.00ms (95%)
[  10s] queue length: 108, concurrency: 4
[  11s] threads: 4, tps: 0.00, reads/s: 0.00, writes/s: 0.00, response time: 0.00ms (95%)
[  11s] queue length: 116, concurrency: 4
[  12s] threads: 4, tps: 0.00, reads/s: 0.00, writes/s: 0.00, response time: 0.00ms (95%)
[  12s] queue length: 126, concurrency: 4
[  13s] threads: 4, tps: 0.00, reads/s: 0.00, writes/s: 0.00, response time: 0.00ms (95%)
[  13s] queue length: 134, concurrency: 4
[  14s] threads: 4, tps: 0.00, reads/s: 0.00, writes/s: 0.00, response time: 0.00ms (95%)
[  14s] queue length: 144, concurrency: 4
[  15s] threads: 4, tps: 0.00, reads/s: 0.00, writes/s: 0.00, response time: 0.00ms (95%)
[  15s] queue length: 153, concurrency: 4
[  16s] threads: 4, tps: 0.00, reads/s: 0.00, writes/s: 0.00, response time: 0.00ms (95%)
[  16s] queue length: 167, concurrency: 4
[  17s] threads: 4, tps: 5.00, reads/s: 123.00, writes/s: 32.00, response time: 16807.85ms (95%)
[  17s] queue length: 170, concurrency: 4
[  18s] threads: 4, tps: 18.00, reads/s: 255.00, writes/s: 76.00, response time: 16888.55ms (95%)
[  18s] queue length: 161, concurrency: 4

The failover time in this case was 18 seconds. We looked at the results and found that a check the ScaleArc software does (which involves opening an SSH connection to all nodes in MHA) take 5 seconds, and ScaleArc declares the node dead after 3 consecutive checks (this parameter is configurable). Hence the high failover time. A much lower one can be achieved with more frequent checks and a different check method – for example checking the read-only flag, or making MHA store its state in the databases.

ScaleArc+Percona XtraDB Cluster tests

Percona XtraDB Cluster is special when it comes to high availability testing because of the many ways one can use it. Many people write only to one node to avoid rollback on conflicts, but we have also seen lots of folks using all the nodes for writes.

Also, graceful failover can be interpreted differently.

  • Failover can be graceful from both Percona XtraDB Cluster’s and from the ScaleArc software’s perspective. First the traffic is switched to another node, or removed from the node in question, and then MySQL is stopped.
  • Failover can be graceful from Percona XtraDB Cluster’s perspective, but not from the ScaleArc software’s perspective. In this case the database is simply stopped with service mysql stop, and the load balancer figures out what happened. I will refer to this approach as semi-graceful from now on.
  • Failover can be completely non-graceful if a node is killed, where neither Percona XtraDB Cluster, nor ScaleArc knows about its departure.

We did all the tests using one node at a time, and using all three nodes. What makes these test sets even more complex is that when only one node at a time used, some tests (the semi-graceful and the non-graceful one) don’t have the same result if the node removed is the used one or an unused one. This process involves a lot of tests, so for the sake of brevity, I omit the actual sysbench output here – they look either like the graceful MHA and non-graceful MHA case – and only present the results in a tabular format. In the case of active/active setups, to remove the nodes gracefully we first have to lower the maximum number of connections on that node to 0.

Failover type 1 node (active) 1 node (passive) all nodes
Graceful sub-second (no errors) no effect at all sub-second (no errors)
Semi-graceful 4 seconds (errors) no effect at all 3 seconds
Non-graceful 4 seconds (errors) 6 seconds (no errors) 7 seconds (errors)

In the previous table, active means that the failed node did receive sysbench transactions and passive means that it didn’t.

All the graceful cases are similar to MHA’s graceful case.

If only one node is used and a passive node is removed from the cluster, by stopping the database itself gracefully with the

# service mysql stop

command, it doesn’t have an effect on the cluster. For the subsequent cases of the graceful failover, switching on ScaleArc will enable queuing similar to MHA’s case. In case of the semi-graceful, if the passive node (which has no traffic) departs, it has no effect. Otherwise, the application will get errors (because of the unexpected mysql stop), and the failover time is around 3-4 seconds for the cases when only 1 node is active and when all 3 are active. This makes sense, because ScaleArc was configured to do checks (using clustercheck) every second, and declare a node dead after three consecutive failed checks. After the ScaleArc software determined that it should fail over, and it did so, the case is practically the same as the passive node’s removal from that point (because ScaleArc removes traffic in practice making that node passive).

The non-graceful case is tied to suspect timeout. In this case, XtraDB Cluster doesn’t know that the node departed the cluster, and the originating nodes are trying to send write sets to it. Those write sets will never be certified because the node is gone, so the writes will be stalled. The exception here is the case when the active node failed. After ScaleArc figures out that the node dies (three consecutive checks at one second intervals) a new node is chosen, but because only the failed node was doing transactions, no write sets are in the remaining two nodes’ queues, which are waiting for certification, so there is no need to wait for suspect timeout here.

Conclusion

ScaleArc does have reasonably good failover time, especially in the case when it doesn’t have to interrupt transactions. Its promise of zero-downtime maintenance is fulfilled both in the case of MHA and XtraDB Cluster.

The post Measuring failover time for ScaleArc load balancer appeared first on MySQL Performance Blog.

Jul
31
2014
--

Paris OpenStack Summit Voting – Percona Submits 16 MySQL Talks

Paris OpenStack Summit Voting - Percona Submits 16 MySQL TalksMySQL plays a critical role in OpenStack. It serves as the host database supporting most components such as Nova, Glance, and Keystone and is the most mature guest database in Trove. Many OpenStack operators use Percona open source software including the MySQL drop-in compatible Percona Server and Galera-based Percona XtraDB Cluster as well as tools such as Percona XtraBackup and Percona Toolkit. We see a need in the community to understand how to improve MySQL performance in OpenStack. As a result, Percona, submitted 16 presentations for the Paris OpenStack Summit.

Paris OpenStack Summit presentations are chosen by OpenStack member voting. Please vote for our talks by clicking the titles below that interest you. You must be an OpenStack Foundation member to vote. If you aren’t a member, sign up here – it’s free and only takes a minute. The deadline to vote is Wednesday, August 6, 2014!

Paris OpenStack Summit MySQL Talks Submitted by Percona

OpenStack Operations

MySQL Database Operations in the OpenStack World
Speaker: Stéphane Combaudon

MySQL High Availability Options for Openstack
Speakers: Stéphane Combaudon

Host and Guest Database Backup and Recovery for OpenStack Ops
Speakers: George Lorch, David Busby

Benchmarking the Different Cinder Storage Backends
Speaker: Peter Boros

MySQL and OpenStack Deep Dive
Speakers: Peter Boros, Jay Pipes (Mirantis)

Trove Performance Tuning for MySQL
Speaker: Alexander Rubin

Schema Management: Versioning and Automation with Puppet and MySQL Utilities
Speaker: Frederic Descamps

Deploying Databases for OpenStack
Speakers: Matt Griffin, Jay Pipes (Mirantis), Amrith Kumar (Tesora), Vinay Joosery (Severalnines)

Related Open Source Software Projects

Introduction to Percona XtraDB Cluster
Speaker: Kenny Gryp

Percona Server Features for OpenStack and Trove Ops
Speakers: George Lorch, Vipul Sabhaya (HP Cloud)

Products, Tools & Services

ClusterControl: Efficient and reliable MySQL Management, Monitoring, and Troubleshooting for OpenStack HA
Speakers: Peter Boros, Vinay Joosery (Severalnines)

Advanced MySQL Performance Monitoring for OpenStack Ops
Speaker: Daniel Nichter

Targeting Apps for OpenStack Clouds

Oars in the Cloud: Virtualization-aware Galera instances
Speaker: Raghavendra Prabhu

ACIDic Clusters: Review of contemporary ACID-compliant databases with synchronous replication
Speaker: Raghavendra Prabhu

Cloud Security

Security: It’s more than just your database you should worry about
Speaker: David Busby

Planning Your OpenStack Project

Infrastructure at Scale
Speaker: Michael Coburn

The Paris OpenStack Summit will offer developers, operators, and service providers with valuable insights into OpenStack. The Design Summit sessions will be filled with lively discussions driving OpenStack development including sessions defining the future of Trove, the DBaaS (database as a service) component near and dear to Percona’s heart. There will also be many valuable presentations in the main Paris OpenStack Summit conference about operating OpenStack, utilizing the latest features, complimentary software and services, and real world case studies.

Thank you for your support. We’re looking forward to seeing many Percona software users at the Paris OpenStack Summit in November.

The post Paris OpenStack Summit Voting – Percona Submits 16 MySQL Talks appeared first on MySQL Performance Blog.

Jul
28
2014
--

What I learned while migrating a customer MySQL installation to Amazon RDS

Hi, I recently had the experience of assisting with a migration of a customer MySQL installation to Amazon RDS (Relational Database Service). Amazon RDS is a great platform for hosting your MySQL installation and offers the following list of pros and cons:

  • You can scale your CPU, IOPS, and storage space separately by using Amazon RDS. Otherwise you need to take downtime and upgrade physical components of a rack-mounted server.
  • Backups, software version patching, failure detection, and (some) recovery is automated with Amazon RDS.
  • You lose shell access to your DB instance
  • You lose SUPER privilege for regular users. Many SUPER-type statements and commands are provided for as a Stored Procedure.
  • It is easy to set up multiple read replicas (slaves in READ_ONLY=1 mode).
  • You can set up a secondary sychronous instance for failover in the event your primary instance fails.

While this article is written to be Amazon RDS-specific it also has implications for any sort of migration.

  1. The only way to interface with RDS is through mysql client, which means loading data must be done using SQL. This means you need to use mysqldump or mydumper, which can be a large endeavour should your dataset be north of 500GB — this is a lot of single threaded activity!  Think about not only how long dumping and loading will take, but also factor in how much time it will take for replication to catch up on the hours/days/weeks your dumping and loading procedure took.  You might need to allocate more disk space and Provisioned IOPS to your RDS node in order to improve disk throughput, along with a change to innodb_flush_log_at_trx_commit, and sync_binlog.
  2. RDS is set to UTC (system_time_zone=UTC) and this cannot be changed as in Parameter Groups you will see that default_time_zone is set as Modifiable=false. This can bite you if you are planning to use RDS as a slave for a short while before failing the application over to Amazon RDS.  If you have configured binlog_format=STATEMENT on your master and you have TIMESTAMP columns, this will lead to differences in RDS data set for absolute values ’2014-07-24 10:15:00′ vs NOW(). It is also a concern for the Developer who may not be explicitly declaring their MySQL connections to set an appropriate time zone. Often the best piece of advice can be to leave all database data in UTC no matter where the server is physically located, and deal with localization at the presentation layer.
  3. Amazon RDS by default has max_allowed_packet=1MB. This is pretty low as most other configs are 16MB so if you’re using extended-insert (by default, you are), the size of each insert statement will be close to 16MB and thus can lead to errors related to “packet too big” on Amazon RDS side, thus failing out an import.
  4. Amazon RDS does not support the SUPER privilege for regular users. For example, this becomes quite a challenge as many tools (Percona Toolkit) are authored to assume you have SUPER-level access on all nodes — simple tasks become vastly more complicated as you need to think of clever workarounds (I’m looking at you pt-table-sync!).
  5. Triggers and views thus cannot be applied using the default mysqldump syntax which includes SQL DEFINER entries — these lines are there so that a user with SUPER can “grant” another user ability to execute the trigger/view. Your load will fail if you forget this.
  6. Consider running your load with –force to the mysql client, and log to disk stderr/stdout so you can review errors later. It is painful to spend 4 days loading a 500GB database only to have it fail partially through because you forgot about SQL DEFINER issue..
  7. Consider splitting the mysqldump into two phases: –no-data so you dump schema only, and then –data-only so you get just the rows. This way you can isolate faults and solve them along the way.
  8. Skipping replication events is SLOW. You don’t have ability to do sql_slave_skip_counter (since this requires SUPER), instead you need to use an Amazon RDS function of mysql.rds_skip_repl_error. Sadly this Stored Procedure takes no argument and thus it only skips one event at a time. It takes about 2-3 seconds for each execution, so if you have a lot of events to skip, that’s a problem. Having to skip ANYTHING is indication that something went wrong in the process, so if you find yourself in the unenviable position of skipping events, know that pt-table-checksum should be able to give you an idea how widespread is the data divergence issue.
  9. pt-table-sync doesn’t work against Amazon RDS as it is written to expect SUPER because it wants to do binlog_format=STATEMENT in session, but that’s not allowed. Kenny Gryp hacked me a version to just skip this check, and Kenny also reported it for inclusion in a future release of Percona Toolkit, but in the meantime you need to work around the lack of SUPER privilege.
  10. pt-table-sync is SLOW against RDS. As pt-table-sync doesn’t log a lot of detail about where time is spent, I haven’t completely isolated the source of the latency, but I suspect this is more about network round trip than anything else.
  11. innodb_log_file_size is hardcoded to 128MB in Amazon RDS, you can’t change this.  innodb_log_files_in_group is not even showing up in Parameter Groups view but SHOW GLOBAL VARIABLES reports as 2. So you’re cookin’ on 256MB, if your writes are heavy this may become a bottleneck with little workaround available in MySQL.
  12. CHANGE MASTER isn’t available in RDS. You define RDS as a slave by calling a stored procedure where you pass the appropriate options such as CALL mysql.rds_set_external_master.

For those of you wondering about the SUPER-privilege, I was fortunate that Bill Karwin from Percona’s Support team took the time to review my post and suggested I dig into this deeper, turns out that Amazon didn’t hack MySQL to remove the SUPER privilege, but instead run the Stored Procedures with security_type of DEFINER:

mysql> select db,name,type,language,security_type,definer from proc where name = 'rds_external_master' G
*************************** 1. row ***************************
 db: mysql
 name: rds_external_master
 type: PROCEDURE
 language: SQL
security_type: DEFINER
 definer: rdsadmin@localhost
1 row in set (0.08 sec)

mysql> show grants for 'rdsadmin'@'localhost';
+------------------------------------------------------------------------------------------------------+
| Grants for rdsadmin@localhost                                                                        |
+------------------------------------------------------------------------------------------------------+
| GRANT ALL PRIVILEGES ON *.* TO 'rdsadmin'@'localhost' IDENTIFIED BY PASSWORD 'XXX' WITH GRANT OPTION |
+------------------------------------------------------------------------------------------------------+
1 row in set (0.07 sec)

So for those of you working with Amazon RDS, I hope that this list saves you some time and helps our your migration!  If you get stuck you can always contact Percona Consulting for assistance.

The post What I learned while migrating a customer MySQL installation to Amazon RDS appeared first on MySQL Performance Blog.

Apr
23
2014
--

Encrypted and incremental MySQL backups with Percona XtraBackup

We’ve recently received a number of questions on how to implement incremental MySQL backups alongside encryption with Percona XtraBackup. Some users thought it was not initially possible because with the default

--encrypt

options with XtraBackup, all files will be encrypted, but alas, that is not the case. This is where the option

--extra-lsn-dir

becomes useful, because it allows you to save LSN (Log Sequence Number) information to another directory and exclude it from encryption, allowing you to use the same information needed by incremental backups. Enough talk, let me show you.

Because you would want to usually script your backup and restore procedure, I’d use variables here as well to make you more familiar. First, I’d create 3 folders, where my backups will be stored, ‘full’ for full backups, ‘incr’ for incremental backups, and ‘lsns’ to store an extra copy of my

xtrabackup_checkpoints

file with

--extra-lsn-dir

.

mkdir -p /ssd/msb/msb_5_5_360/bkp/full
mkdir -p /ssd/msb/msb_5_5_360/bkp/incr
mkdir -p /ssd/msb/msb_5_5_360/bkp/lsns

Second, to have better control of where my backups would go, I prefer assigning timestamped folders instead and use the –no-timestamp option to innobackupex.

CURDATE=$(date +%Y-%m-%d_%H_%M_%S)

Then manually create the specific directory where the backup’s xtrabackup_checkpoints file would be saved:

mkdir -p /ssd/msb/msb_5_5_360/bkp/lsns/$CURDATE

Of course, I need an encryption key for my encrypted backups, in this case, taking the example from the manual, I used openssl to generate a random key. You can use your own key string as long as its size conforms to the size required by the

--encrypt

algorithm you chose.

echo -n $(
   openssl enc -aes-256-cbc -pass pass:Password -P -md sha1 \
   | grep iv | cut -d'=' -f2
) > /ssd/msb/msb_5_5_360/bkp/backups.key

Next, I would run my full backup:

innobackupex --defaults-file=/ssd/msb/msb_5_5_360/my.sandbox.cnf \
   --extra-lsndir=/ssd/msb/msb_5_5_360/bkp/lsns/$CURDATE \
   --encrypt=AES256 --encrypt-key-file=/ssd/msb/msb_5_5_360/bkp/backups.key \
   --no-timestamp /ssd/msb/msb_5_5_360/bkp/full/$CURDATE

The output says my full backup is saved to:

innobackupex: Backup created in directory '/ssd/msb/msb_5_5_360/bkp/full/2014-04-23_01_20_46'
140423 01:20:55  innobackupex: Connection to database server closed
140423 01:20:55  innobackupex: completed OK!

Now here’s the trick, because the full backup is encrypted, we will use the xtrabackup_checkpoints file separately saved by xtrabackup to the

--extra-lsn-dir

path we specified above to get the LSN and use that for our next incremental backup.

LAST_LSN=$(
   cat /ssd/msb/msb_5_5_360/bkp/lsns/$CURDATE/xtrabackup_checkpoints \
   | grep to_lsn | cut -d'=' -f2
)
CURDATE=$(date +%Y-%m-%d_%H_%M_%S)
mkdir /ssd/msb/msb_5_5_360/bkp/lsns/$CURDATE

Above, we get the LSN value and assign it to a variable. Similarly, we created a new CURDATE string for our incremental backup to use and created a new directory for the xtrabackup_checkpoints file. If you plan to create another incremental backup based off of what we are about to take now, you will use this next xtrabackup_checkpoints file to get LAST_LSN.

With the up and coming Percona XtraBackup 2.2.1, you will not need

--extra-lsn-dir

anymore nor parse the

xtrabackup_checkpoints

file anymore for this purpose. A new feature that will allow the user to save backup metadata to an InnoDB table will be available.

So, now that we got our

$LAST_LSN

value, we execute our incremental backup with the command:

innobackupex --defaults-file=/ssd/msb/msb_5_5_360/my.sandbox.cnf \
   --extra-lsndir=/ssd/msb/msb_5_5_360/bkp/lsns/$CURDATE \
   --encrypt=AES256 --encrypt-key-file=/ssd/msb/msb_5_5_360/bkp/backups.key \
   --no-timestamp --incremental --incremental-lsn $LAST_LSN \
   /ssd/msb/msb_5_5_360/bkp/incr/$CURDATE

Again, based on the output, my backup was created at:

innobackupex: Backup created in directory '/ssd/msb/msb_5_5_360/bkp/incr/2014-04-23_01_21_00'
140423 01:21:47  innobackupex: Connection to database server closed
140423 01:21:47  innobackupex: completed OK!

No we have a full backup and an incremental backup, of course to make sure our backups are usable, we’d like to validate them. To do that, our first step is to decrypt both full and incremental backups. innobackupex has another handy

--decrypt

option for that, you can even use

--parallel

to make it faster.

innobackupex --decrypt=AES256 \
   --encrypt-key-file=/ssd/msb/msb_5_5_360/bkp/backups.key \
   /ssd/msb/msb_5_5_360/bkp/full/2014-04-23_01_20_46
innobackupex --decrypt=AES256 \
   --encrypt-key-file=/ssd/msb/msb_5_5_360/bkp/backups.key \
   /ssd/msb/msb_5_5_360/bkp/incr/2014-04-23_01_21_00

Once the backups are decrypted, we can go through the usual process of preparing a full and incremental backups as described on the manual.

innobackupex --defaults-file=/ssd/msb/msb_5_5_360/my.sandbox.cnf \
   --apply-log --redo-only /ssd/msb/msb_5_5_360/bkp/full/2014-04-23_01_20_46
innobackupex --defaults-file=/ssd/msb/msb_5_5_360/my.sandbox.cnf \
   --apply-log --redo-only /ssd/msb/msb_5_5_360/bkp/full/2014-04-23_01_20_46 \
   --incremental-dir=/ssd/msb/msb_5_5_360/bkp/incr/2014-04-23_01_21_00
innobackupex --defaults-file=/ssd/msb/msb_5_5_360/my.sandbox.cnf \
   --apply-log /ssd/msb/msb_5_5_360/bkp/full/2014-04-23_01_20_46

The post Encrypted and incremental MySQL backups with Percona XtraBackup appeared first on MySQL Performance Blog.

Mar
31
2014
--

ScaleArc: Benchmarking with sysbench

ScaleArc recently hired Percona to perform various tests on its database traffic management product. This post is the outcome of the benchmarks carried out by Uday Sawant (ScaleArc) and myself. You can also download the report directly as a PDF here.

The goal of these benchmarks is to identify the potential overhead of the ScaleArc software itself and the potential benefits of caching. The benchmarks were carried out with the trunk version of sysbench. For this reason, we used a very small set of data, so the measurements will be fast, and it’s known that caching has huge benefits when the queries themselves are rather expensive. We decided that we would rather show that benefit with a real-world application, which is coming later is this series. And if you’re in the Silicon Valley area, be sure to join us this evening at the first-ever Open Source Appreciation Day – I’d be happy to discuss the findings presented here in this post. Admission is free but due to limited space you should register now. I’ll also be available throughout the Percona Live MySQL Conference and Expo all this week.

sysbench_image1.2

In this summary graph it’s visible that in terms of throughput (read-only benchmark, which is relevant for read mostly applications), ScaleArc doesn’t have any significant overhead, while caching can have potentially huge benefits.

sysbench_image2

The situation is pretty similar with response times. ScaleArc doesn’t add any significant overhead, and caching can mean huge benefit in terms of response time as well.

In case of this particular workload (which is read only sysbench), using caching means a roughly 3x increase in throughput and a roughly 80% drop in response time.

Overall, ScaleArc is a good product in terms of performance and features as well. I would definitely recommend it.

About ScaleArc for MySQL
ScaleArc for MySQL is a software appliance that drops in transparently between applications and databases to improve application availability and performance. It requires no changes to applications or databases and delivers:

  • Instant scale up – transparent connection pooling and multiplexing, TTL-based transparent caching, surge protection
  • Transparent scale out – read/write split, load balancing, query routing, sharding
  • Automatic high availability – automatic failover
  • Real-time actionable analytics

Benchmarking setup
The client machines are running the benchmarking software like sysbench in case of these benchmarks.

CPU: 2 x Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz (6 cores, chip multithreading off)
Memory: 64G

We used 2 clients. The results of the 2 clients are graphed separately, so it’s visible that they put the same amount of workload on the database or ScaleArc software.

Database machines
CPU: 2 x Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz (6 cores, chip multithreading off)
Memory: 64G

Running MySQL Community Edition 5.6.15

MySQL configuration

[mysqld]
   max_allowed_packet = 64M
   thread_cache = 256
   query_cache_size = 0
   query_cache_type = 0
   max_connections = 20020
   max_user_connections = 20000
   max_connect_errors = 99999999
   wait_timeout = 28800
   interactive_timeout = 28800
   log-error=/var/lib/mysql/mysql.err
   back_log=60000
   innodb_buffer_pool_size = 3G
   innodb_additional_mem_pool_size = 16M
   innodb_log_buffer_size = 8M
   innodb_flush_log_at_trx_commit = 0
   innodb_flush_method = O_DIRECT
   innodb_open_files = 2000
   innodb_file_per_table
   innodb_log_file_size=2G
   innodb_log_files_in_group=2
   innodb_purge_threads=1
   innodb_max_purge_lag=0
   innodb_support_xa=0
   innodb_locks_unsafe_for_binlog = 1
   innodb_buffer_pool_instances=8
   sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES

The buffer pool of the database is intentionally small, so it’s easy to generate a disk-bound workload.

Please note that the following settings are not recommended in production.

innodb_support_xa=0
   innodb_locks_unsafe_for_binlog = 1

 

We used these settings to drive the node to its peak performance, avoiding any possible overhead which might be required on a production system. In typical production settings, these are not set, and binary logging is enabled, which potentially reduces ScaleArc’s overhead further.

ScaleArc software appliances
CPU: 1 x Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz (6 cores, chip multithreading off)
Memory: 64G

The machines were running ScaleArc for MySQL 3.0.

Network
The machines were connected using 10G connections.

Measurements
All of the measurements were done with a very small database that completely fits in memory.

--oltp-table-size=10000
  --oltp-tables-count=64

 

In these benchmarks, we expected both the database and ScaleArc to be CPU bound. In case of a disk-bound workload, ScaleArc would shine even more than in this benchmark. If the queries are more expensive (they have to hit storage), the overhead in % is smaller, and in case of caching the query-by-query benefit is bigger.

We measured 3 different setups, both on read-only and read-write cases. These are the following.

  • Direct connection to the database.
  • Connection to the database through ScaleArc, where ScaleArc only acts as a pass-through filter (since it’s a load balancer that speaks the MySQL wire protocol, all the mechanics for that are still in place). Please note that this setup doesn’t make sense in real life. The purpose of this setup is to show the potential overhead of using ScaleArc and uncovering potential limitations of the ScaleArc software itself.
  • Connection to database through ScaleArc, where ScaleArc is allowed to cache. Caching in ScaleArc is TTL (Time To Live)-based caching, meaning that a read query’s results are cached in ScaleArc. If that read query is seen again before expiring, the query is not run again on the database server but rather served from the cache. Once the timer for the cached query expires, the query will be issued on the database again. Caching of course only works for reads, which are not in an explicit transaction (autocommit is on and no START TRANSACTION is issued). Because of that, we used –oltp-skip-trx during cached benchmarks (read-only case). In case of these benchmarks, the TTL was 1 hour, because we wanted to saturate the ScaleArc software while serving cached queries. An 1 hour TTL might be unrealistic for some applications, while for other applications even an 1 day TTL is something they can live with for some queries. In this case, we wanted to measure the cache’s performance, so we wanted the queries to be cached during the entire benchmark run to show the potential gain even in case of very small queries.

TTL-based caching
It’s important to note that the cache’s expiration is controlled by a TTL value – there is no other invalidation, so it’s possible to read stale data when the query results is changed, but the cache is not expired. Reading stale data alone is ok for most applications, it can happen with a regular, asynchronous slave if it’s lagging behind the master (and it always lagging behind somewhat). Otherwise, the cache is pretty similar to MySQL’s query cache, which doesn’t suffer from the stale read problem, but it has a coarse invalidation (if a table is written, the cache entries belonging to the given table are flushed). While the cache is flushed, the query cache mutex is held, which blocks reads even. Because of the mutex, the built-in query cache is a very usual performance bottleneck. ScaleArc’s cache doesn’t suffer from this.

It’s important to note that ScaleArc caches nothing by default. Also, there are other ways to invlidate cache entries apart from waiting for the TTL to expire.

  • API Call based invalidation (you can clear the cache for an entire query pattern rule with one API call)
  • Query comment based invalidation (you can put a comment /*wipe*/ before a query and wipe and refresh the cache)
  • Cache Bypass (you can send a comment /*nocache*/ and bypass the cache for that specific query)

Read-only
Sysbench throughput

sysbench_image6

In the lower region of threads (up to 32), we see that the TPS value significantly drops in case of going through ScaleArc. That’s nothing to be surprised about, the reason for that is network roundtrips. Because ScaleArc is a software appliance, it adds a hop between the database and the application, which introduces latency. If the number of threads is higher (32 and up), this starts to matter less and less, and performance is almost identical which is very impressive. It means that around the optimal degree of parallelism for these machines, ScaleArc introduces very little (barely measurable) overhead.

Sysbench response time
sysbench_image7
This graph contains the response times belonging to the previous benchmarks. This is really hard to read because at 4096 threads, the system is overloaded, and the response time is much more than in the maximum throughput region. Because it’s multiple orders of magnitude higher, the interesting response times are not readable from this graph.

sysbench_image8

The following graph is the same as above, except that the y axis is limited to 250 ms, so the region which is not visible on the graph above is visible here. What we see there regarding the overhead is pretty much the same as we saw in case of the throughput graph, which means that ScaleArc by itself introduces immeasurably low latency (which explains the difference in cases when parallelism is low). Usually applications which are utilizing the database server are using significantly more than one thread (in MySQL a single query always uses a single thread, in other words there is no intra-query parallelism). The latency from 32 threads above is actually somewhat lower when going through ScaleArc (the exact tipping point can be different here based on the number of CPUs). The reason for that is ScaleArc itself uses an event loop to connect to MySQL, so at a high concurrency, and can schedule sending the traffic to MySQL differently. This only matters when otherwise the MySQL server is saturated CPU-wise.

CPU utilization

sysbench_image9

Last but not least, this graph contains the CPU utilization of the different setups. The left-hand side shows the CPU utilization when connecting directly to the database, and the right-hand side shows connecting through ScaleArc. In both cases, the database server’s CPU is the bottleneck. It’s visible that the client node’s CPU is more than 75% idle (only client1 is graphed to improve readability, client2 is practically the same). From 32 threads and up, the blue bar (CPU user%) is relatively high on the database servers, as is the green (CPU sys%). From 64 threads, the idle time is practically 0, until the systems are overloaded. On the right hand side, we can see that ScaleArc at this load still had 50% idle CPU, which means that we could practically do the same benchmark on another set of boxes through the very same ScaleArc, and only then it will be fully utilized. We are talking about 3000 sysbench tps here. One more interesting thing to note is the relatively high system time of ibd. This is also because of the way ScaleArc connects to the database (see the previous paragraph).

[  17s] threads: 64, tps: 3001.98, reads/s: 41962.70, writes/s: 0.00, response time: 35.22ms (95%)

 

These threads are from a single client, which means that ScaleArc could keep up with parsing roughly 84000 statements / second with utilizing half of its CPU, which is impressive. Please note that the ScaleArc software in this case was tuned towards this type of workload, which means we had more query processing threads. In case of caching, we will have more cache handler threads.

Effects of caching on read-only workload
Sysbench throughput

The next set of graphs will compare the cases when cache is used and not used.

sysbench_image11

The preceding TPS graph contains reads / second (because we measured with –oltp-skip-trx), so roughly 42000 reads corresponds to roughly 3000 transactions in the earlier setup (14 reads in a transaction). On the left-hand side of the graph, the cached throughput is visible with green – on the right-hand side, the non-cached throughput is visible with red (direct access) and blue (access through ScaleArc as a pass-through filter). It’s visible that caching improves the speed drastically, but when ScaleArc becomes overloaded (8192 client threads, 4096 from each client), the performance becomes somewhat inconsistent, which is understandable considering how few cores ScaleArc was running on. On the graph, the dots are translucent, which means the colors are brighter in the areas that have more samples. Even in the overloaded case, the majority of the samples are in the region of 100k+ reads / second across two clients, which means that the performance degrades very gracefully even under heavy load.

Sysbench response time

sysbench_image12

Like in the case of a non-cached workload, the response times are not too readable because of the very high response times when the systems are overloaded. But from the overloaded response times visible, it seems like using caching doesn’t make response times worse.

sysbench_Image13

Like in the case of non-cached workload, this graph is the zoomed version of the previous one. Here the maximum of the y axis is 100 ms. From this graph, it’s visible that at lower concurrency and at the optimal throughput, caching actually helps response time. This is understandable, since in case of a cache hit, ScaleArc can serve the results, and the client (in our case here sysbench) doesn’t have to go to the database, so a roundtrip and database processing time is spared. It’s also worth mentioning that the data “comes from memory,” it doesn’t matter if we hit the ScaleArc cache of the database. When the ScaleArc cache is used, the response time is lower because the additional roundtrip to the database and potential database work (like parsing SQL) is avoided. This means that caching can have benefits even if the database fits in the buffer pool. The improvement is always subject to the workload – caching helps the most when it can cache relatively expensive queries like aggregations and queries hitting the storage.

CPU utilization

sysbench_image14

Similarly to the previous case, the preceding graph shows CPU utilization of the various components. In case of the cached workload, the client itself is much more utilized (since it gets responses sooner, it has to generate the traffic faster). With this kind of workload, when using only one client, we would hit the client’s CPU as the performance bottleneck. The database is interesting too. With caching, its CPU is barely used. This is because if a query is served from the cache, it never gets to the database, so the database’s CPU utilization will be lower. In other words, using the cache helps to offload the database. If offloading is visible on ScaleArc’s graphs, when caching is used, the CPU on the server hosting ScaleArc is much more utilized. For this benchmark, the ScaleArc software was tuned to handle a cached workload, which means more cache handler threads.

Read-write
For read-write benchmarks, we had to create oltp_nontran.lua, which is the same sysbench benchmark as oltp.lua, except that it does the reads outside of the transaction and does only the writes in transaction, so caching can have an effect on read. The rest of the benchmarking setup is the same as the read-only case.

Sysbench throughput

sysbench_image15

Similarly to the read-only case, at a low concurrency, the overhead of ScaleArc is coming from the additional network roundtrip. At the optimal concurrency, the overhead is barely measurable (the dots are plotted practically on top of each other).

Sysbench response time

sysbench_image16

sysbench_image17

The case is pretty similar with the response times as in the read-only case. Similarly, the second graph is a zoomed version of the first one, which a 250 ms maximum.

CPU utilization

sysbench_image18

The CPU utilization graph shows that in this case, the database server’s CPU is the bottleneck. What is interesting is that ScaleArc is using less CPU than in the read-only case. This is understandable, since a transaction now contains writes as well, which are expensive on the database side, but they are still just statements to route on the ScaleArc side.

Effects of caching on read-write workload
Measuring caching here is interesting because the workload is no longer read-only of mostly reads. We have a very significant amount of writes.

sysbench_image19

For 30k reads, we get 8,5k writes. It’s expected that caching won’t help as much as in the previous case, because writes can’t be cached and while they are in process, the benchmarking threads can’t proceed with reads. Please note that this means that roughly 25% of the traffic is write, a typical application scaling out with additional slaves for reads doesn’t have this kind of read-to-write ratio.

Sysbench throughput

sysbench_image20

The first graph shows that in terms of total throughput, caching still helps.

Sysbench response time

sysbench_image21

sysbench_image22

Similarly to the read-only case, caching also helps response time, because it reduces the time needed for the read part of the workload.

CPU utilization

sysbench_image24

This test really stresses the database server’s CPU when not caching. With caching on, similarly to the read-only case, the client’s workload increases somewhat (but not as much), and the database server’s CPU usage decreases significantly. In the last row, the CPU utilization of ScaleArc shows that although it’s somewhat higher with caching, it’s still not that much higher.

From these tests it’s visible that caching can still be beneficial even if the write ratio is as high as in this test.

Conclusion
Engineering is always about making the right tradeoffs. If one wants features that needs a protocol-level load balancer like ScaleArc, the price should be paid in the overhead of Layer 7 parsing and decision making. ScaleArc’s engineering team did a great job minimizing this overhead. ScaleArc itself is very well tunable for different workload types (if caching is important, ScaleArc can be tuned for caching – if query rewriting, ScaleArc can be tuned for that).

The post ScaleArc: Benchmarking with sysbench appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com