Sep
01
2015
--

Booking.com’s Jean-François Gagné on Percona Live Amsterdam

Booking.com, one of the world’s leading e-commerce companies, helps travels book nearly 1 million rooms per night. Established in 1996, Booking.com B.V. guarantees the best prices for any type of property, from small, family-run bed and breakfasts to executive apartments and five-star luxury suites.

The travel website is also a dedicated contributor to the MySQL and Perl community. Other open source technologies include CentOS Linux, Nginx, python, puppet, Git and more.

A Diamond sponsor of Percona Live Amsterdam Sept. 21-23, you can meet the people who power Booking.com at booth 205. Enter promo code “BlogInterview” at registration to save €20!

In the meantime, meet Jean-François Gagné, a system engineer at Booking.com. He’ll be presenting a couple of talks: “Riding the Binlog: an in Deep Dissection of the Replication Stream” and “Binlog Servers at Booking.com.”


Tom: Hi Jean-François, in your session, “Riding the Binlog: an in Deep Dissection of the Replication Stream“, you talk about how we can think of the binary logs as a transport for a “Stream of Transactions”. What will be the top 3 things attendees will come away with following this 50-minute talk?

Jean-François: Hi Tom, thanks for this opportunity to give a sneak peak of my talk.  The most important subject that will be discussed is that the binary logs evolves: by the usage of “log-slave-updates”, the stream can grow, shrink or morph.  Said in another way: the binary logs of a slave can be very different from the binary logs of the master, and this should be taken into account when relying on those (including when replicating using intermediate master and when promoting a slave as a new master using GTIDs).  We will also explore how the binary logs can be decomposed in sub-streams, or viewed as the multiplexing of many streams.  We will also look for de-multiplexing functions and the new possibilities that are opened with that.

 

Tom: Percona Live, starting with this conference, has a new venue and a broader theme – now encompassing, in addition to MySQL, MongoDB, NoSQL and data in the cloud. Your thoughts? And what do think is missing – what would you change (if anything)?

Jean-François: I think you forget the best of all changes: going from a 2 day conference last year in London to a 3 day conference this year.  This will allow better knowledge exchange and I am very happy about that.  I think this event will be a success with a good balance of sessions focus on technologies and presentation about a specific use-case of those technologies.  If I had one wish: I would like to see more sessions about specific use-cases of NoSQL technologies with and in deep discussion about why they are a better choice than more traditional solutions: maybe more of those sessions will be submitted next year.

 

Tom: Which other session(s) are you most looking forward to besides your own?

Jean-François: I will definitely attend the Facebook session about Semi-Synchronous Replication: it is very close to my interest, especially as Booking.com is thinking about using loss-less semi-sync replication in the future, and I look forward to hear war stories about this feature.  All sessions dissecting internals of a technology (InnoDB, TokuDB, RocksDB, …) will also have my attention.  Finally, it is always interesting to hear about how large companies are using databases, so I plan to attend the MySQL@Wikimedia session.

 

Tom: As a resident of Amsterdam, what are some of the must-do activities/sightseeing for those visiting for Percona Live from out of town?

Jean-François: Seeing the city from a high point is impressive, and you will have the opportunity of enjoying that view from the Booking.com office at the Community Dinner.  Also, I recommend finding a bike and discover the city pedaling (there are many renting shops, just ask Google).  From the conference venue, you can do a 70 minutes ride crossing three nice parks: the Westerpark, the Rembrandtpark and the Vondelpark – https://goo.gl/P13Mc7 – and you can discover the first of third park in a shorter ride (45 minutes).  If you feel a little more adventurous, I recommend a 90 minute ride South following the Amstel: once out of Amsterdam, you will have the water on one side at the level of the road, and the fields (Polder) 3 meters below on the other side (https://goo.gl/OPDv5z).  This will allow you to see for yourself why this place is called the “Low Countries”.

The post Booking.com’s Jean-François Gagné on Percona Live Amsterdam appeared first on Percona Data Performance Blog.

Aug
20
2015
--

Optimizing PXC Xtrabackup State Snapshot Transfer

State Snapshot Transfer (SST) at a glance

PXC uses a protocol called State Snapshot Transfer to provision a node joining an existing cluster with all the data it needs to synchronize.  This is analogous to cloning a slave in asynchronous replication:  you take a full backup of one node and copy it to the new one, while tracking the replication position of the backup.

PXC automates this process using scriptable SST methods.  The most common of these methods is the xtrabackup-v2 method which is the default in PXC 5.6.  Xtrabackup generally is more favored over other SST methods because it is non-blocking on the Donor node (the node contributing the backup).

The basic flow of this method is:

  • The Joiner:
    • joins the cluster
    • Learns it needs a full SST and clobbers its local datadir (the SST will replace it)
    • prepares for a state transfer by opening a socat on port 4444 (by default)
    • The socat pipes the incoming files into the datadir/.sst directory
  • The Donor:
    • is picked by the cluster (could be configured or be based on WAN segments)
    • starts a streaming Xtrabackup and pipes the output of that via socat to the Joiner on port 4444.
    • Upon finishing its backup, sends an indication of this and the final Galera GTID of the backup is sent to the Joiner
  • The Joiner:
    • Records all changes from the Donor’s backup’s GTID forward in its gcache (and overflow pages, this is limited by available disk space)
    • runs the –apply-log phase of Xtrabackup on the donor
    • Moves the datadir/.sst directory contents into the datadir
    • Starts mysqld
    • Applies all the transactions it needs (Joining and Joined states just like IST does it)
    • Moves to the ‘Synced’ state and is done.

There are a lot of moving pieces here, and nothing is really tuned by default.  On larger clusters, SST can be quite scary because it may take hours or even days.  Any failure can mean starting over again from the start.

This blog will concentrate on some ways to make a good dent in the time SST can take.  Many of these methods are trade-offs and may not apply to your situations.  Further, there may be other ways I haven’t thought of to speed things up, please share what you’ve found that works!

The Environment

I am testing SST on a PXC 5.6.24 cluster in AWS.  The nodes are c3.4xlarge and the datadirs are RAID-0 over the two ephemeral SSD drives in that instance type.  These instances are all in the same region.

My simulated application is using only node1 in the cluster and is sysbench OLTP with 200 tables with 1M rows each.  This comes out to just under 50G of data.  The test application runs on a separate server with 32 threads.

The PXC cluster itself is tuned to best practices for Innodb and Galera performance

Baseline

In my first test the cluster is a single member (receiving workload) and I am  joining node2.  This configuration is untuned for SST.  I measured the time from when mysqld started on node2 until it entered the Synced state (i.e., fully caught up).  In the log, it looked like this:

150724 15:59:24 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
... lots of other output ...
2015-07-24 16:48:39 31084 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 4647341)

Doing some math on the above, we find that the SST took 51 minutes to complete.

–use-memory

One of the first things I noticed was that the –apply-log step on the Joiner was very slow.  Anyone who uses Xtrabackup a lot will know that –apply-log will be a lot faster if you give it some extra RAM to use while making the backup consistent via the –use-memory option.  We can set this in our my.cnf like this:

[sst]
inno-apply-opts="--use-memory=20G"

The [sst] section is a special one understood only by the xtrabackup-v2 script.  inno-apply-opts allows me to specify arguments to innobackupex when it runs.

Note that this change only applies to the Joiner (i.e., you don’t have to put it on all your nodes and restart them to take advantage of it).

This change immediately makes a huge improvement to our above scenario (node2 joining node1 under load) and the SST now takes just over 30 minutes.

wsrep_slave_threads

Another slow part of getting to Synced is how long it takes to apply transactions up to realtime after the backup is restored and in place on the Joiner.  We can improve this throughput by increasing the number of apply threads on the Joiner to make better use of the CPU.  Prior to this wsrep_slave_threads was set to 1, but if I increase this to 32  (there are 16 cores on this instance type) my SST now takes 25m 32s

Compression

xtrabackup-v2 supports adding a compression process into the datastream.  On the Donor it compresses and on the Joiner it decompresses.  This allows you to trade CPU for transfer speed.  If your bottleneck turns out to be network transport and you have spare CPU, this can help a lot.

Further, I can use pigz instead of gzip to get parallel compression, but theoretically any compression utilization can work as long as it can compress and decompress standard input to standard output.  I install the ‘pigz’ package on all my nodes and change my my.cnf like this:

[sst]
inno-apply-opts="--use-memory=20G"
compressor="pigz"
decompressor="pigz -d"

Both the Joiner and the Donor must have the respective decompressor and compressor settings or the SST will fail with a vague error message (not actually having pigz installed will do the same thing).

By adding compression, my SST is down to 21 minutes, but there’s a catch.  My application performance starts to take a serious nose-dive during this test.  Pigz is consuming most of the CPU on my Donor, which is also my primary application node.  This may or may not hurt your application workload in the same way, but this emphasizes the importance of understanding (and measuring) the performance impact of SST has on your Donor nodes.

Dedicated donor

To alleviate the problem with the application, I now leave node2 up and spin up node3.  Since I’m expecting node2 to normally not be receiving application traffic directly, I can configure node3 to prefer node2 as its donor like this:

[mysqld]
...
wsrep_sst_donor = node2,

When node3 starts, this setting instructs the cluster that node3 is the preferred donor, but if that’s not available, pick something else (that’s what the trailing comma means).

Donor nodes are permitted to fall behind in replication apply as needed without sending flow control.  Sending application traffic to such a node may see an increase in the amount of stale data as well as certification failures for writes (not to mention the performance issues we saw above with node1).  Since node2 is not getting application traffic, moving into the Donor state and doing an expensive SST with pigz compression should be relatively safe for the rest of the cluster (in this case, node1).

Even if you don’t have a dedicated donor, if you use a load balancer of some kind in front of your cluster, you may elect to consider Donor nodes as failing their health checks so application traffic is diverted during any state transfer.

When I brought up node3, with node2 as the donor, the SST time dropped to 18m 33s

Conclusion

Each of these tunings helped the SST speed, though the later adjustments maybe had less of a direct impact.  Depending on your workload, database size, network and CPU available, your mileage may of course vary.  Your tunings should vary accordingly, but also realize you may actually want to limit (and not increase) the speed of state transfers in some cases to avoid other problems. For example, I’ve seen several clusters get unstable during SST and the only explanation for this is the amount of network bandwidth consumed by the state transfer preventing the actual Galera communication between the nodes. Be sure to consider the overall state of production when tuning your SSTs.

The post Optimizing PXC Xtrabackup State Snapshot Transfer appeared first on MySQL Performance Blog.

Aug
19
2015
--

How much could you benefit from MySQL 5.6 parallel replication?

I have heard this question quite often: “At busy times, our replicas start lagging quite frequently. We are using N schemas, so which performance boost could we expect from MySQL 5.6 parallel replication?” Here is a quick way to give you a rough estimate of the potential benefit.

General idea

In MySQL 5.6, parallelism is added at the schema level. So in theory, if you have N schemas and if you use N parallel threads, replication could be up to N times faster. This assumes at least 2 things:

  • Replication throughput scales linearly with the number of parallel threads.
  • Writes are evenly distributed across schemas.

Both assumptions are of course not realistic. But it is easy to know the distribution of writes, and that can already give you an idea about how much you could benefit from parallel replication.

Writes are stored in binary logs but it is much easier to work with the slow query log, so we can enable full slow query logging for some time with long_query_time = 0 and then use pt-query-digest to analyze the resulting log file.

An example

I have a test server with 3 schemas, and I’ve run some sysbench load on it to get a decent slow query log file. Once done, I can run this command:

pt-query-digest --filter '$event->{arg} !~ m/^select|^set|^commit|^show|^admin|^rollback|^begin/i' --group-by db --report-format profile slow_query.log > digest.out

and here is the result I get:

# Profile
# Rank Query ID Response time  Calls  R/Call V/M   Item
# ==== ======== ============== ====== ====== ===== ====
#    1 0x       791.6195 52.1% 100028 0.0079  0.70 db3
#    2 0x       525.1231 34.5% 100022 0.0053  0.68 db1
#    3 0x       203.4649 13.4% 100000 0.0020  0.64 db2

In a perfect world, with 3 parallel threads and if each schema would handle 33% of the total write workload, I could expect a 3x performance improvement.

However here we can see in the report that the 3 replication threads will only work simultaneously 25% of the time in the best case (13.4/52.1 = 0.25). We can also expect 2 replication threads to work simultaneously for some part of the workload, but let’s ignore that for clarity.

It means that instead of the theoretical 200% performance improvement (3 parallel threads 100% of the time), we can hardly expect more than a 50% performance improvement (3 parallel threads 25% of the time). And the reality is that the benefit will be much lower than that.

Conclusion

Parallel replication in MySQL 5.6 is a great step forward, however don’t expect too much if your writes are not evenly distributed across all your schemas. The pt-query-digest trick I shared can give you a rough idea whether your workload is a good fit for multi-threaded slaves in 5.6.

I’m expecting much better results for 5.7, partly because parallelism is handled differently, but also because you can tune how efficient parallel replication will be by adjusting the binlog group commit settings.

The post How much could you benefit from MySQL 5.6 parallel replication? appeared first on MySQL Performance Blog.

Aug
07
2015
--

The MySQL query cache: Worst enemy or best friend?

During the last couple of months I have been involved in an unusually high amount of performance audits for e-commerce applications running with Magento. And although the systems were quite different, they also had one thing in common: the MySQL query cache was very useful. That was counter-intuitive for me as I’ve always expected the query cache to be such a bottleneck that response time is better when the query cache is turned off no matter what. That lead me to run a few experiments to better understand when the query cache can be helpful.

Some context

The query cache is well known for its contentions: a global mutex has to be acquired for any read or write operation, which means that any access is serialized. This was not an issue 15 years ago, but with today’s multi-core servers, such serialization is the best way to kill performance.

However from a performance point of view, any query cache hit is served in a few tens of microseconds while the fastest access with InnoDB (primary lookup) still requires several hundreds of microseconds. Yes, the query cache is at least an order of magnitude faster than any query that goes to InnoDB.

A simple test

To better understand how good or bad the query cache can be, I set up a very simple benchmark:

  • 1M records were inserted in 16 tables.
  • A moderate write load (65 updates/s) was run with a modified version of the update_index.lua sysbench script (see the end of the post for the code).
  • The select.lua sysbench script was run, with several values for the --num-threads option.

Note that the test is designed to be unfavorable to the query cache as the whole dataset fits in the buffer pool and the SELECT statements are very simple. Also note that I configured the query cache to be large enough so that no entry was evicted from the cache due to low memory.

Results – MySQL query cache ON

First here are the results when the query cache is enabled:

qcache_on

This configuration scales well up to 4 concurrent threads, but then the throughput degrades very quickly. With 10 concurrent threads, SHOW PROCESSLIST is enough to show you that all threads spend all their time waiting for the query cache mutex. Okay, this is not a surprise.

Results – MySQL query cache OFF

When the query cache is disabled, this is another story:

qcache_off

Throughput scales well up to somewhere between 10 and 20 threads (for the record the server I was using had 16 cores). But more importantly, even at the higher concurrencies, the overall throughput continued to increase: at 20 concurrent threads, MySQL was able to serve nearly 3x more queries without the query cache.

Conclusion

With Magento, you can expect to have a light write workload, very low concurrency and also quite complex SELECT statements. Given the results of our simple benchmarks, it is finally not that surprising that the MySQL query cache is a good fit in this case.

It is also worth noting that many applications run a database workload where writes are light and concurrency is low: the query cache should then not be discarded immediately. And maybe it is time for Oracle to make plans to improve the query cache as suggested by Peter a few years ago?

Annex: sysbench commands

# Modified update_index.lua
function event(thread_id)
   local table_name
   table_name = "sbtest".. sb_rand_uniform(1, oltp_tables_count)
   rs = db_query("UPDATE ".. table_name .." SET k=k+1 WHERE id=" .. sb_rand(1, oltp_table_size))
   db_query("SELECT SLEEP(0.015)")
end

# Populate the tables
sysbench --mysql-socket=/data/mysql/mysql.sock --mysql-user=root --mysql-db=db1 --oltp-table-size=1000000 --oltp-tables-count=16 --num-threads=16 --test=/usr/share/doc/sysbench/tests/db/insert.lua prepare
# Write workload
sysbench --mysql-socket=/data/mysql/mysql.sock --mysql-user=root --mysql-db=db1 --oltp-tables-count=16 --num-threads=1 --test=/usr/share/doc/sysbench/tests/db/update_index.lua --max-requests=1000000 run
# Read workload
sysbench --mysql-socket=/data/mysql/mysql.sock --mysql-user=root --mysql-db=db1 --oltp-tables-count=16 --num-threads=1 --test=/usr/share/doc/sysbench/tests/db/select.lua --max-requests=10000000 run

The post The MySQL query cache: Worst enemy or best friend? appeared first on MySQL Performance Blog.

Aug
05
2015
--

PXC – Incremental State transfers in detail

IST Basics

State transfers in Galera remain a mystery to most people.  Incremental State transfers (as opposed to full State Snapshot transfers) are used under the following conditions:

  • The Joiner node reports Galera a valid Galera GTID to the cluster
  • The Donor node selected contains all the transactions the Joiner needs to catch up to the rest of the cluster in its Gcache
  • The Donor node can establish a TCP connection to the Joiner on port 4568 (by default)

IST states

Galera has many internal node states related to Joiner nodes.  They currently are:

  1. Joining
  2. Joining: preparing for State Transfer
  3. Joining: requested State Transfer
  4. Joining: receiving State Transfer
  5. Joining: State Transfer request failed
  6. Joining: State Transfer failed
  7. Joined

I don’t claim any special knowledge of most of these states apart from what their titles indicate.  Many of these states are occur very briefly and it is unlikely you’ll ever actually see them on a node’s wsrep_local_state_comment.

During IST, however, I have observed the following states have the potential to take a long while:

Joining: receiving State Transfer

During this state transactions are being streamed to the Joiner’s wsrep_local_recv_queue.  You can connect to the node at this time and poll state.  If you do, you’ll easily see the inbound queue increasing (usually quickly) but no writesets being ‘received’ (read: applied).  It’s not clear to me if there is a reason why transction apply couldn’t be started during this steam, but it does not do so currently.

The further behind the Joiner is, the longer this can take.  Here’s some output from the latest release of myq-tools showing wsrep stats:

[root@node2 ~]# myq_status wsrep
mycluster / node2 (idx: 1) / Galera 3.11(ra0189ab)
         Cluster  Node       Outbound      Inbound       FlowC     Conflct Gcache     Appl
    time P cnf  # stat laten msgs data que msgs data que pause snt lcf bfa   ist  idx  %ef
14:04:40 P   4  2 J:Rc 0.4ms    0   0b   0    1 197b  4k   0ns   0   0   0  367k    0  94%
14:04:41 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b  5k   0ns   0   0   0  368k    0  93%
14:04:42 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b  6k   0ns   0   0   0  371k    0  92%
14:04:43 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b  7k   0ns   0   0   0  373k    0  92%
14:04:44 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b  8k   0ns   0   0   0  376k    0  92%
14:04:45 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 10k   0ns   0   0   0  379k    0  92%
14:04:46 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 11k   0ns   0   0   0  382k    0  92%
14:04:47 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 12k   0ns   0   0   0  386k    0  91%
14:04:48 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 13k   0ns   0   0   0  390k    0  91%
14:04:49 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 14k   0ns   0   0   0  394k    0  91%
14:04:50 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 15k   0ns   0   0   0  397k    0  91%
14:04:51 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 16k   0ns   0   0   0  401k    0  91%
14:04:52 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 18k   0ns   0   0   0  404k    0  91%
14:04:53 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 19k   0ns   0   0   0  407k    0  91%
14:04:54 P   4  2 J:Rc 0.4ms    0   0b   0    0   0b 20k   0ns   0   0   0  411k    0  91%

The node is in ‘J:Rc’ (Joining: Receiving) state and we can see the Inbound queue growing (wsrep_local_recv_queue). Otherwise this node is not sending or receiving transactions.

Joining

Once all the requested transactions are copied over, the Joiner flips to the ‘Joining’ state, during which it starts applying the transactions as quickly as the wsrep_slave_threads can go.  For example:

Cluster  Node       Outbound      Inbound       FlowC     Conflct Gcache     Appl
    time P cnf  # stat laten msgs data que msgs data que pause snt lcf bfa   ist  idx  %ef
14:04:55 P   4  2 Jing 0.6ms    0   0b   0 2243 3.7M 19k   0ns   0   0   0  2236  288  91%
14:04:56 P   4  2 Jing 0.5ms    0   0b   0 4317 7.0M 16k   0ns   0   0   0  6520  199  92%
14:04:57 P   4  2 Jing 0.5ms    0   0b   0 4641 7.5M 12k   0ns   0   0   0 11126  393  92%
14:04:58 P   4  2 Jing 0.4ms    0   0b   0 4485 7.2M  9k   0ns   0   0   0 15575  200  93%
14:04:59 P   4  2 Jing 0.5ms    0   0b   0 4564 7.4M  5k   0ns   0   0   0 20102  112  93%

Notice the Inbound msgs (wsrep_received) starts increasing rapidly and the queue decreases accordingly.

Joined

14:05:00 P   4  2 Jned 0.5ms    0   0b   0 4631 7.5M  2k   0ns   0   0   0 24692   96  94%

Towards the end the node briefly switches to the ‘Joined’ state, though that is a fast state in this case. ‘Joining’ and ‘Joined’ are similar states, the difference (I believe) is that:

  • ‘Joining’ is applying transactions acquired via the IST
  • ‘Joined’ is applying transactions that have queued up via standard Galera replication since the IST (i.e., everything has been happening on the cluster since the IST)

Flow control during Joining/Joined states

The Codership documentation says something interesting about ‘Joined’ (from experimentation, I believe the ‘Joining’ state behaves the same here.):

Nodes in this state can apply write-sets. Flow Control here ensures that the node can eventually catch up with the cluster. It specifically ensures that its write-set cache never grows. Because of this, the cluster wide replication rate remains limited by the rate at which a node in this state can apply write-sets. Since applying write-sets is usually several times faster than processing a transaction, nodes in this state hardly ever effect cluster performance.

What this essentially means is that a Joiner’s wsrep_local_recv_queue is allowed to shrink but NEVER GROW during an IST catchup.  Growth will trigger flow control, but why would it grow?  Writes on other cluster nodes must still be replicated to our Joiner and added to the queue.

If the Joiner’s apply rate is less than the rate of writes coming from Cluster replication, flow control will be applied to slow down Cluster replication (read: your application writes).  As far as I can tell, there is no way to tune this or turn it off.  The Codership manual continues here:

The one occasion when nodes in the JOINED state do effect cluster performance is at the very beginning, when the buffer pool on the node in question is empty.

Essentially a Joiner node with a cold cache can really hurt performance on your cluster.  This can possibly be improved by:

  • Better IO and other resources available to the Joiner for a quicker cache warmup.  A huge example of this would be flash over convention storage.
  • Buffer pool preloading
  • More Galera apply threads
  • etc.

Synced

From what I can tell, the ‘Joined’ state ends when the wsrep_local_recv_queue drops lower than the node’s configured flow control limit.  At that point it changes to ‘Synced’ and the node behaves more normally (WRT to flow control).

Cluster  Node       Outbound      Inbound       FlowC     Conflct Gcache     Appl
    time P cnf  # stat laten msgs data que msgs data que pause snt lcf bfa   ist  idx  %ef
14:05:01 P   4  2 Sync 0.5ms    0   0b   0 3092 5.0M   0   0ns   0   0   0 27748  150  94%
14:05:02 P   4  2 Sync 0.5ms    0   0b   0 1067 1.7M   0   0ns   0   0   0 28804  450  93%
14:05:03 P   4  2 Sync 0.5ms    0   0b   0 1164 1.9M   0   0ns   0   0   0 29954   67  92%
14:05:04 P   4  2 Sync 0.5ms    0   0b   0 1166 1.9M   0   0ns   0   0   0 31107  280  92%
14:05:05 P   4  2 Sync 0.5ms    0   0b   0 1160 1.9M   0   0ns   0   0   0 32258  606  91%
14:05:06 P   4  2 Sync 0.5ms    0   0b   0 1154 1.9M   0   0ns   0   0   0 33401  389  90%
14:05:07 P   4  2 Sync 0.5ms    0   0b   0 1147 1.8M   1   0ns   0   0   0 34534  297  90%
14:05:08 P   4  2 Sync 0.5ms    0   0b   0 1147 1.8M   0   0ns   0   0   0 35667  122  89%
14:05:09 P   4  2 Sync 0.5ms    0   0b   0 1121 1.8M   0   0ns   0   0   0 36778  617  88%

Conclusion

You may notice these states during IST if you aren’t watching the Joiner closely, but if your IST is talking a long while, it should be easy using the above situation to understand what is happening.

The post PXC – Incremental State transfers in detail appeared first on MySQL Performance Blog.

Jul
29
2015
--

Why base64-output=DECODE-ROWS does not print row events in MySQL binary logs

Lately I saw many cases when users specified option

--base64-output=DECODE-ROWS

  to print out a statement representation of row events in MySQL binary logs just to get nothing. Reason for this is obvious: option

--base64-output=DECODE-ROWS

  does not convert row events into its string representation, this is job of option

--verbose

. But why users mix these two options so often? This blog post is result of my investigations.

There are already two great blog posts about printing row events on the Percona blog: “Debugging problems with row based replication” by Justin Swanhart and “Identifying useful info from MySQL row-based binary logs” by Alok Pathak.

Both authors run

mysqlbinlog

  with options 

–base64-output=decode-rows -vv

  and demonstrate how a combination of them can produce human-readable output of row events. However, one thing which is not clear yet is what the differences are between these options. I want to underline the differences in this post.

Let’s check the user manual first.

–base64-output=value

This option determines when events should be displayed encoded as base-64 strings using BINLOG statements. The option has these permissible values (not case sensitive):

    AUTO (“automatic”) or UNSPEC (“unspecified”) displays BINLOG statements automatically when necessary (that is, for format description events and row events). If no –base64-output option is given, the effect is the same as –base64-output=AUTO.
    Note

    Automatic BINLOG display is the only safe behavior if you intend to use the output of mysqlbinlog to re-execute binary log file contents. The other option values are intended only for debugging or testing purposes because they may produce output that does not include all events in executable form.

    NEVER causes BINLOG statements not to be displayed. mysqlbinlog exits with an error if a row event is found that must be displayed using BINLOG.

    DECODE-ROWS specifies to mysqlbinlog that you intend for row events to be decoded and displayed as commented SQL statements by also specifying the –verbose option. Like NEVER, DECODE-ROWS suppresses display of BINLOG statements, but unlike NEVER, it does not exit with an error if a row event is found.

For examples that show the effect of –base64-output and –verbose on row event output, see Section 4.6.8.2, “mysqlbinlog Row Event Display”.

Literally

--base64-output=DECODE-ROWS

  just suppresses

BINLOG

  statement and does not print anything.

To test its effect I run command

insert into t values (2, 'bar');

on an InnoDB table while binary log uses ROW format. As expected if I specify no option I will receive unreadable output:

$mysqlbinlog var/mysqld.1/data/master-bin.000002
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#150720 15:19:15 server id 1  end_log_pos 120 CRC32 0x3d52aee2  Start: binlog v 4, server v 5.6.25-73.1-debug-log created 150720 15:19:15
BINLOG '
Q+esVQ8BAAAAdAAAAHgAAAAAAAQANS42LjI1LTczLjEtZGVidWctbG9nAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAEzgNAAgAEgAEBAQEEgAAXAAEGggAAAAICAgCAAAACgoKGRkAAeKu
Uj0=
'/*!*/;
# at 120
#150720 15:19:21 server id 1  end_log_pos 192 CRC32 0xbebac59d  Query   thread_id=2     exec_time=0     error_code=0
SET TIMESTAMP=1437394761/*!*/;
SET @@session.pseudo_thread_id=2/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
SET @@session.sql_mode=1073741824/*!*/;
SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/;
/*!C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=8/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
BEGIN
/*!*/;
# at 192
#150720 15:19:21 server id 1  end_log_pos 239 CRC32 0xe143838b  Table_map: `test`.`t` mapped to number 70
# at 239
#150720 15:19:21 server id 1  end_log_pos 283 CRC32 0x75523a2d  Write_rows: table id 70 flags: STMT_END_F
BINLOG '
SeesVRMBAAAALwAAAO8AAAAAAEYAAAAAAAEABHRlc3QAAXQAAgMPAv8AA4uDQ+E=
SeesVR4BAAAALAAAABsBAAAAAEYAAAAAAAEAAgAC//wCAAAAA2Jhci06UnU=
'/*!*/;
# at 283
#150720 15:19:21 server id 1  end_log_pos 314 CRC32 0xd183c769  Xid = 14
COMMIT/*!*/;
# at 314
#150720 15:19:22 server id 1  end_log_pos 362 CRC32 0x892fe43b  Rotate to master-bin.000003  pos: 4
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

The

INSERT

  is here:

BINLOG '
SeesVRMBAAAALwAAAO8AAAAAAEYAAAAAAAEABHRlc3QAAXQAAgMPAv8AA4uDQ+E=
SeesVR4BAAAALAAAABsBAAAAAEYAAAAAAAEAAgAC//wCAAAAA2Jhci06UnU=
'/*!*/;

But this string is not for humans.

What will happen if I add option

--base64-output=DECODE-ROWS

 ?

$mysqlbinlog var/mysqld.1/data/master-bin.000002 --base64-output=DECODE-ROWS
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#150720 15:19:15 server id 1  end_log_pos 120 CRC32 0x3d52aee2  Start: binlog v 4, server v 5.6.25-73.1-debug-log created 150720 15:19:15
# at 120
#150720 15:19:21 server id 1  end_log_pos 192 CRC32 0xbebac59d  Query   thread_id=2     exec_time=0     error_code=0
SET TIMESTAMP=1437394761/*!*/;
SET @@session.pseudo_thread_id=2/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
SET @@session.sql_mode=1073741824/*!*/;
SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/;
/*!C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=8/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
BEGIN
/*!*/;
# at 192
#150720 15:19:21 server id 1  end_log_pos 239 CRC32 0xe143838b  Table_map: `test`.`t` mapped to number 70
# at 239
#150720 15:19:21 server id 1  end_log_pos 283 CRC32 0x75523a2d  Write_rows: table id 70 flags: STMT_END_F
# at 283
#150720 15:19:21 server id 1  end_log_pos 314 CRC32 0xd183c769  Xid = 14
COMMIT/*!*/;
# at 314
#150720 15:19:22 server id 1  end_log_pos 362 CRC32 0x892fe43b  Rotate to master-bin.000003  pos: 4
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

Row event was just suppressed!

Lets now check option verbose:

–verbose, -v

Reconstruct row events and display them as commented SQL statements. If this option is given twice, the output includes comments to indicate column data types and some metadata.

For examples that show the effect of –base64-output and –verbose on row event output, see Section 4.6.8.2, “mysqlbinlog Row Event Display”.

Surprisingly

--base64-output=DECODE-ROWS

  is not needed!:

$mysqlbinlog var/mysqld.1/data/master-bin.000002 --verbose
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#150720 15:19:15 server id 1  end_log_pos 120 CRC32 0x3d52aee2  Start: binlog v 4, server v 5.6.25-73.1-debug-log created 150720 15:19:15
BINLOG '
Q+esVQ8BAAAAdAAAAHgAAAAAAAQANS42LjI1LTczLjEtZGVidWctbG9nAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAEzgNAAgAEgAEBAQEEgAAXAAEGggAAAAICAgCAAAACgoKGRkAAeKu
Uj0=
'/*!*/;
# at 120
#150720 15:19:21 server id 1  end_log_pos 192 CRC32 0xbebac59d  Query   thread_id=2     exec_time=0     error_code=0
SET TIMESTAMP=1437394761/*!*/;
SET @@session.pseudo_thread_id=2/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
SET @@session.sql_mode=1073741824/*!*/;
SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/;
/*!C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=8/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
BEGIN
/*!*/;
# at 192
#150720 15:19:21 server id 1  end_log_pos 239 CRC32 0xe143838b  Table_map: `test`.`t` mapped to number 70
# at 239
#150720 15:19:21 server id 1  end_log_pos 283 CRC32 0x75523a2d  Write_rows: table id 70 flags: STMT_END_F
BINLOG '
SeesVRMBAAAALwAAAO8AAAAAAEYAAAAAAAEABHRlc3QAAXQAAgMPAv8AA4uDQ+E=
SeesVR4BAAAALAAAABsBAAAAAEYAAAAAAAEAAgAC//wCAAAAA2Jhci06UnU=
'/*!*/;
### INSERT INTO `test`.`t`
### SET
###   @1=2
###   @2='bar'
# at 283
#150720 15:19:21 server id 1  end_log_pos 314 CRC32 0xd183c769  Xid = 14
COMMIT/*!*/;
# at 314
#150720 15:19:22 server id 1  end_log_pos 362 CRC32 0x892fe43b  Rotate to master-bin.000003  pos: 4
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

INSERT statement successfully restored as:

### INSERT INTO `test`.`t`
### SET
###   @1=2
###   @2='bar'
# at 283

Why do the bloggers mentioned above suggest to use

--base64-output=DECODE-ROWS

 ? Lets try to use both options:

$mysqlbinlog var/mysqld.1/data/master-bin.000002 --base64-output=DECODE-ROWS --verbose
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#150720 15:19:15 server id 1  end_log_pos 120 CRC32 0x3d52aee2  Start: binlog v 4, server v 5.6.25-73.1-debug-log created 150720 15:19:15
# at 120
#150720 15:19:21 server id 1  end_log_pos 192 CRC32 0xbebac59d  Query   thread_id=2     exec_time=0     error_code=0
SET TIMESTAMP=1437394761/*!*/;
SET @@session.pseudo_thread_id=2/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
SET @@session.sql_mode=1073741824/*!*/;
SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/;
/*!C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=8/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
BEGIN
/*!*/;
# at 192
#150720 15:19:21 server id 1  end_log_pos 239 CRC32 0xe143838b  Table_map: `test`.`t` mapped to number 70
# at 239
#150720 15:19:21 server id 1  end_log_pos 283 CRC32 0x75523a2d  Write_rows: table id 70 flags: STMT_END_F
### INSERT INTO `test`.`t`
### SET
###   @1=2
###   @2='bar'
# at 283
#150720 15:19:21 server id 1  end_log_pos 314 CRC32 0xd183c769  Xid = 14
COMMIT/*!*/;
# at 314
#150720 15:19:22 server id 1  end_log_pos 362 CRC32 0x892fe43b  Rotate to master-bin.000003  pos: 4
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

In this case row event was suppressed and statement is printed. Also, the resulting file cannot be used to re-apply events, because statements are commented out. This is very useful when binary log is big and you just need to investigate what it contains, not re-apply events.

This is not main purpose of this post, but you can also find information about columns metadata if specify option

--verbose

  twice:

$mysqlbinlog var/mysqld.1/data/master-bin.000002 --base64-output=DECODE-ROWS --verbose --verbose
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#150720 15:19:15 server id 1  end_log_pos 120 CRC32 0x3d52aee2  Start: binlog v 4, server v 5.6.25-73.1-debug-log created 150720 15:19:15
# at 120
#150720 15:19:21 server id 1  end_log_pos 192 CRC32 0xbebac59d  Query   thread_id=2     exec_time=0     error_code=0
SET TIMESTAMP=1437394761/*!*/;
SET @@session.pseudo_thread_id=2/*!*/;
SET @@session.foreign_key_checks=1, @@session.sql_auto_is_null=0, @@session.unique_checks=1, @@session.autocommit=1/*!*/;
SET @@session.sql_mode=1073741824/*!*/;
SET @@session.auto_increment_increment=1, @@session.auto_increment_offset=1/*!*/;
/*!C utf8 *//*!*/;
SET @@session.character_set_client=33,@@session.collation_connection=33,@@session.collation_server=8/*!*/;
SET @@session.lc_time_names=0/*!*/;
SET @@session.collation_database=DEFAULT/*!*/;
BEGIN
/*!*/;
# at 192
#150720 15:19:21 server id 1  end_log_pos 239 CRC32 0xe143838b  Table_map: `test`.`t` mapped to number 70
# at 239
#150720 15:19:21 server id 1  end_log_pos 283 CRC32 0x75523a2d  Write_rows: table id 70 flags: STMT_END_F
### INSERT INTO `test`.`t`
### SET
###   @1=2 /* INT meta=0 nullable=1 is_null=0 */
###   @2='bar' /* VARSTRING(255) meta=255 nullable=1 is_null=0 */
# at 283
#150720 15:19:21 server id 1  end_log_pos 314 CRC32 0xd183c769  Xid = 14
COMMIT/*!*/;
# at 314
#150720 15:19:22 server id 1  end_log_pos 362 CRC32 0x892fe43b  Rotate to master-bin.000003  pos: 4
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;

Note, this is, again, job of

--verbose

 , not

--base64-output=DECODE-ROWS

To conclude:

If you want to see statement representation of row events use option

--verbose (-v)


If you want to see metadata of columns specify

--verbose

  twice:

--verbose --verbose

  or

-vv


If you want to suppress output of row events specify option

--base64-output=DECODE-ROWS

The post Why base64-output=DECODE-ROWS does not print row events in MySQL binary logs appeared first on MySQL Performance Blog.

Jul
28
2015
--

Multi-source replication in MySQL 5.7 vs Tungsten Replicator

MySQL 5.7 comes with a new set of features and multi-source replication is one of them. In few words this means that one slave can replicate from different masters simultaneously.

During the last couple of months I’ve been playing a lot with this trying to analyze its potential in a real case that I’ve been facing while working with a customer.

This was motivated because my customer is already using multi-sourced slaves with Tungsten Replicator and I wanted to do a side-by-side comparison between Tungsten Replicator and Multi-source Replication in MySQL 5.7

Consider the following scenario:

mixed
DB1 is our main master attending mostly writes from several applications, it also needs to serve read traffic which is putting it’s capacity close to the limit. It has attached 6 replication slaves using regular replication.
A1, A2, A3, B1, B2 and DB7 are reporting slaves used to offload some reads from master and also woking on some offline ETL processes.

Since they had some idle capacity customer decided to go further and set a different architecture:
A1 and B1 became also masters of other slaves using Tungsten Replicator, in this case group A is a set of servers for a statistics application and B is attending a finance application, so A2, A3 and B2 became multi sourced slaves.
New applications writes directly to A1 and B1 without impacting write capacity of main master.

Pros and Cons of this approach

Pros

  • It just works. We’ve been running this way for a long time now and we haven’t suffered major issues.
  • Tungsten Replicator has some built in tools and scripts to make slave provision easy.

Cons

  • Tungsten Replicator is a great product but bigger than needed for this architecture. In some cases we had to configure Java Virtual Machine with 4GB of RAM to make it work properly.
  • Tungsten is a complex tool that needs some extra expertise to deploy it, make it work and troubleshoot issues when errors happen (i.e. handling duplicate keys errors)

With all this in mind we moved a step forward and started to test if we can move this architecture to use legacy replication only.

New architecture design:
Blank Flowchart - New Page (7)

We added some storage capacity to DB7  for our testing purposes and the goal here is to replace all Tungsten replicated slaves by a single server where all databases are consolidated.

For some data dependency we weren’t able to completely separate A1 and B1 servers to become master-only so they are currently acting as masters of DB7 and slaves of DB1 By data dependency I mean DB1 replicates it’s schemas to all of it’s direct slaves, including DB7.  DB7 also gets replication of the finance DB running locally to B1 and stats DB running locally to A1.

Some details about how this was done and what multi source is implemented:

  • The main difference between regular replication, as known up to 5.6 version, is that now you have replication channels, each channel means a different source, in other words each master has it’s own replication channel.
  • Replication needs to be set as crash safe, meaning that both master_info_repository and
    relay_log_info_repository variables needs to be set to TABLE
  • We haven’t considered GTID because servers acting as masters have different versions than our test multi-sourced slave.
  • log_slave_updates needs to be disabled in A1 and B2 to avoid having duplicate data in DB7 due replication flow.

Pros and Cons of this approach

Pros

  • MySQL 5.7 can replicate from different versions of master, we tested multi-source replication working along with 5.5 and 5.6 simultaneously and didn’t suffer problems besides those known changes with timestamp based fields.
  • Administration becomes easier. Any DBA already familiar with legacy replication can adapt to handle multiple channels without much learning, some new variables and a couple of new tables and you’re ready to go here.

Cons

  • 5.7 is not production ready yet. At this point we don’t have a GA release data which means that we may expect bugs to appear in the short/mid term.
  • Multi-source is still tricky for some special cases: database and table filtering works globally (can’t set per-channel filters) and administration commands like sql_slave_skip_counter is a global command still which means you can’t easily skip a statement in a particular channel.

Now the funny part: The How

It was easier than you think. First of all we needed to start from a backup of data coming from our masters. Due to versions used in production (main master is 5.5, A1 and B1 are 5.6) we started from a logical dump so we avoided to deal with mysql_upgrade issues.

Disclaimer: this does not pretend to be a guide on how to setup multi-source replication

For the matter of our case we did the backup/restore using mydumper/myloader as follow:

[root@db1]$ mydumper -l 600 -v 3 -t 8 --outputdir /mnt/backup_db1/20150708 --less-locking --regex="^(database1.|database2.|database3.)"
[root@a1]$ mydumper -l 600 -v 3 -t 8 --outputdir /mnt/backup_a1/20150708 --less-locking --regex="^(tungsten_stats.|stats.)"
[root@b1]$ mydumper -l 600 -v 3 -t 8 --outputdir /mnt/backup_b1/20150708 --less-locking --regex="^(tungsten_finance.|finance.)"

Notice each command was run in each master server, now the restore part:

[root@db7]$ myloader -d /mnt/backup_db1/20150708  -o -t 8 -q 10000 -h localhost
[root@db7]$ myloader -d /mnt/backup_a1/20150708 -o -t 8 -q 10000 -h localhost
[root@db7]$ myloader -d /mnt/backup_b1/20150708 -o -t 8 -q 10000 -h localhost

So at this point we have a new slave with a copy of databases from 3 different masters, just for context we need to dump/restore tungsten* databases because they are constantly updated by Replicator (which at this point is still in use). Pretty easy right?

Now the most important part of this whole process, setting up replication. The procedure is very similar than regular replication but now we need to consider which binlog position is necessary for each replication channel, this is very easy to get from each backup by reading in this case the metadata file created by mydumper. In known backup methods (either logical or physical) you have a way to get binlog coordinates, for example –master-data=2 in mysqldump or xtrabackup_binlog_info file in xtrabackup.

Once we get the replication info (and created a replication user in master) then we only need to run the known CHANGE MASTER TO and START SLAVE commands, but here we have our new way to do it:

db7:information_schema> change master to master_host='db1', master_user='rep', master_password='rep', master_log_file='db1-bin.091487', master_log_pos=74910596 FOR CHANNEL 'main_master';
       Query OK, 0 rows affected (0.02 sec)
db7:information_schema> change master to master_host='a1', master_user='rep', master_password='rep', master_log_file='a1-bin.394460', master_log_pos=56004 FOR CHANNEL 'a1_slave';
       Query OK, 0 rows affected (0.02 sec)
db7:information_schema> change master to master_host='b1', master_user='rep', master_password='rep', master_log_file='b1-bin.1653245', master_log_pos=2563356 FOR CHANNEL 'b1_slave';
       Query OK, 0 rows affected (0.02 sec)

Replication is set and now we are good to go:

db10:information_schema> START SLAVE FOR CHANNEL 'main_master';
       Query OK, 0 rows affected (0.00 sec)
db10:information_schema> START SLAVE FOR CHANNEL 'a1_slave';
       Query OK, 0 rows affected (0.00 sec)
db10:information_schema> START SLAVE FOR CHANNEL 'b1_slave';
       Query OK, 0 rows affected (0.00 sec)

New commands includes the FOR CHANNEL 'channel_name' option to handle replication channels independently

At this point we have a slave running 3 replication channels from different sources, we can check the status of replication with our known command SHOW SLAVE STATUS (TL;DR)

db10:information_schema> show slave statusG
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: db1
                  Master_User: rep
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: db1-bin.077011
          Read_Master_Log_Pos: 15688468
               Relay_Log_File: db7-relay-main_master.000500
                Relay_Log_Pos: 18896705
        Relay_Master_Log_File: db1-bin.076977
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table: mysql.%,temp.%
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 18896506
              Relay_Log_Space: 2260203264
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 31047
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1004
                  Master_UUID: 65107c0c-7ab5-11e4-a85a-bc305bf01f00
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: System lock
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set:
            Executed_Gtid_Set:
                Auto_Position: 0
         Replicate_Rewrite_DB:
                 Channel_Name: main_master
*************************** 2. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: a1
                  Master_User: slave
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: a1-bin.072336
          Read_Master_Log_Pos: 10329256
               Relay_Log_File: db7-relay-db3_slave.000025
                Relay_Log_Pos: 10329447
        Relay_Master_Log_File: a1-bin.072336
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table: mysql.%,temp.%
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 10329256
              Relay_Log_Space: 10329697
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 4000
                  Master_UUID: 0f061ec4-6fad-11e4-a069-a0d3c10545b0
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set:
            Executed_Gtid_Set:
                Auto_Position: 0
         Replicate_Rewrite_DB:
                 Channel_Name: a1_slave
*************************** 3. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: b1.las1.fanops.net
                  Master_User: slave
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: b1-bin.093214
          Read_Master_Log_Pos: 176544432
               Relay_Log_File: db7-relay-db8_slave.000991
                Relay_Log_Pos: 176544623
        Relay_Master_Log_File: b1-bin.093214
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table: mysql.%,temp.%
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 176544432
              Relay_Log_Space: 176544870
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1001
                  Master_UUID:
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set:
            Executed_Gtid_Set:
                Auto_Position: 0
         Replicate_Rewrite_DB:
                 Channel_Name: b1_slave
3 rows in set (0.00 sec)

Yeah I know, output is too large and the Oracle guys noticed it, too, so they have created a set of new tables in performance_schema DB to help us retrieving this information in a friendly manner, check this link for more information. We could also run SHOW SLAVE STATUS FOR CHANNEL 'b1_slave' for instance

Some limitations found during tests:

  • As mentioned some configurations are still global and can’t be set per replication channel, for instance replication filters which can be set without restarting MySQL but they will affect all replication channels as you can see here.
  • Replication events are somehow serialized at slave side, just like a global counter that is not well documented yet. In reality this means that you need to be very careful when troubleshooting issues because you may suffer unexpected issues, for instance if you have 2 replication channels failing with a duplicate key error then is not easy to predict which even you will skip when running set global sql_slave_skip_counter=1

Conclusions
So far this new feature looks very nice and provides some extra flexibility to slaves which helps to reduce architecture complexity when we want to consolidate databases from different sources into a single server. After some time testing it I’d say that I prefer this type of replication over Tungsten Replicator in this kind of scenarios due it’s simplicity for administration, i.e. pt-table-checksum and pt-table-sync will work without proper limitations of Tungsten.

With the exception of some limitations that need to be addressed, I believe this new feature is game changing and will definitely make DBA’s life easier. I still have a lot to test still but that is material for a future post.

The post Multi-source replication in MySQL 5.7 vs Tungsten Replicator appeared first on Percona Data Performance Blog.

Jul
24
2015
--

InnoDB vs TokuDB in LinkBench benchmark

Previously I tested Tokutek’s Fractal Trees (TokuMX & TokuMXse) as MongoDB storage engines – today let’s look into the MySQL area.

I am going to use modified LinkBench in a heavy IO-load.

I compared InnoDB without compression, InnoDB with 8k compression, TokuDB with quicklz compression.
Uncompressed datasize is 115GiB, and cachesize is 12GiB for InnoDB and 8GiB + 4GiB OS cache for TokuDB.

Important to note is that I used tokudb_fanout=128, which is only available in our latest Percona Server release.
I will write more on Fractal Tree internals and what does tokudb_fanout mean later. For now let’s just say it changes the shape of the fractal tree (comparing to default tokudb_fanout=16).

I am using two storage options:

  • Intel P3600 PCIe SSD 1.6TB (marked as “i3600” on charts) – as a high end performance option
  • Crucial M500 SATA SSD 900GB (marked as “M500” on charts) – as a low end SATA SSD

The full results and engine options are available here

Results on Crucial M500 (throughput, more is better)

Crucial M500

    Engine Throughput [ADD_LINK/10sec]

  • InnoDB: 6029
  • InnoDB 8K: 6911
  • TokuDB: 14633

There TokuDB outperforms InnoDB almost two times, but also shows a great variance in results, which I correspond to a checkpoint activity.

Results on Intel P3600 (throughput, more is better)

Intel P3600

  • Engine Throughput [ADD_LINK/10sec]
  • InnoDB: 27739
  • InnoDB 8K: 9853
  • TokuDB: 20594

To understand the reasoning why InnoDB shines on a fast storage let’s review IO usage by all engines.
Following chart shows Reads in KiB, that engines, in average, performs for a request from client.

IO Reads

Following chart shows Writes in KiB, that engines, in average, performs for a request from client.

IO Writes

There we can make interesting observations that TokuDB on average performs two times less writes than InnoDB, and this is what allows TokuDB to be better on slow storages. On a fast storage, where there is no performance penalty on many writes, InnoDB is able to get ahead, as InnoDB is still better in using CPUs.

Though, it worth remembering, that:

  • On a fast expensive storage, TokuDB provides a better compression, which allows to store more data in limited capacity
  • TokuDB still writes two time less than InnoDB, that mean twice longer lifetime for SSD (still expensive).

Also looking at the results, I can make the conclusion that InnoDB compression is inefficient in its implementation, as it is not able to get benefits: first, from doing less reads (well, it helps to get better than uncompressed InnoDB, but not much); and, second, from a fast storage.

The post InnoDB vs TokuDB in LinkBench benchmark appeared first on Percona Data Performance Blog.

Jul
23
2015
--

The Q&A: Creating best-in-class backup solutions for your MySQL environment

Percona MySQL and MongoDB WebinarsThank you for attending my July 15 webinar, “Creating Best in Class Backup solutions for your MySQL environment.” Due to the amount of content we discussed and some minor technical difficulties faced near the end of webinar we have decided to cover the final two slides of the presentation along with the questions asked by attendees during the webinar via this blog post.

The slides are available for download. And you can watch the webinar in it’s entirety here.

The final two slides were about our tips for having a good backup and recovery strategy. Lets see the bullet points along with what would have been their explanation during the webinar :

  • Use the three types of backups
    • Binary for full restores, new slaves
      • Binary backups are easy to restore, plus takes the least amount of time to restore. The mean time to recover is mostly bound by the time to transfer backup to the appropriate target server,
    • Logical for partial restores
      • Logical backups, especially when done table-wise come in handy when you’re wanting to restore one or few smaller tables,
    • Binlog for point in time recovery
      • Very often the need is to have Point In Time Recovery, with a Full backup of any type (Logical or Binary) its half the story, we still need the DML statements processed on the server in order to bring it to the latest state, thats where Binary logs (Binlog) backups come into picture.
  • Store on more than one server and off-site
    •  Store your backups on more than one location, what if the backup server goes down ? Considering offsite storages like Amazon S3 and Glacier with weekly or monthly backups retention can be cheaper options.
  • Test your backups!!!!
    • Testing your backups is very important, its always great to know backups are recoverable and not corrupted. Spin off an EC2 instance if you want, copy and restore the backup there, roll-forward a days worth of binlogs just to be sure.
  • Document restore procedures, script them and test them!!!
    • Also when you test your backups, make sure to document the steps to restore the backup to avoid last minute hassle over which commands to use.
  • If taking from a slave run pt-table-checksum
    • Backups are mostly taken from slaves, as such make sure to checksum them regularly, you dont wanna backup inconsistent data. 
  • Configuration files, scripts
    • Data is not the only thing you should be backing up, backup your config files, scripts and user access at a secure location.
  • Do you need to backup everything all days?
    • For very large instances doing a logical backup is a toughie, in such cases evaluate your backup needs, do you want to backup all the tables ? Most of the time smaller tables are the more important ones, and needs partial restore, backup only those.
  • Hardlinking backups can save lot of disk space in some circumstances
    • There are schemas which contains only a few high activity tables, rest of them are probably updated once a week or are updated by an archiver job that runs montly, make sure to hardlink the files with the previous backup, it can save good amount of space in such scenarios.
  • Monitor your Backups
    • Lastly, monitor your backups. You do not want to realize that you’re backup had been failing the whole time. Even a simple email notification from your backup scripts can help reduce the chance of failure.

Now lets try to answer some of the questions asked during the webinar :

Q : –use-memory=2G, is that pretty standard, if we have more more, should we have a higher value?
Usually we would evaluate the value based on size of xtrabackup_logfile (amount of transactions to apply). If you have more free memory feel free to provide it to –use-memory, you dont want to let the memory be a bottleneck in the restore process.

Q : which is the best backup option for a 8Tb DB?
Usually it would depend on what type of data would you have and business requirements for the backups. For eg: a full xtrabackup and later incrementals on the weekdays would be a good idea. Time required for backups play an important role here, backing up to a slow NAS share can be time consuming, and it will make xtrabackup record lot of transactions which will further increase your restore time. Also look into backing up very important medium-small size tables via logical backups.

Q : I’m not sure if this will be covered, but if you have a 3 master-master-master cluster using haproxy, is it recommended to run the backup from the haproxy server or directly on a specific server? Would it be wise to have a 4th server which would be part of the cluster, but not read from to perform the backups?
I am assuming this a Galera cluster setup, in which case you can do backups locally on any of the node by using tools like percona xtrabackup, however the best solution would be spinning off a slave from one of the nodes and running backups there.

Q : With Mudumper, can we strem the data over SSH or netcat to another server? Or would one have to use something like NFS? I’ve used mysqldump and piped it over netcat before.. curious if we can do that with Mydumper ?
Mydumper is similar in nature with other mysql client tools. They can be run remotely (–host option). Which means you can run mydumper from another server to backup from the master or slave. Mydumper can be piped for sure too.

Q : Is Mydumper still maintained. It hasn’t had a release since March of last year?
Indeed, Max Bubenick from Percona is currently maintaining the project. Actually he has added new features to the tool which  makes it more comprehensive and feature rich. He is planning the next release soon, stay tuned for the blog post.

Q : Is MyDumper an opensource ? prepare and restore are same ?
Absolutely. Right now we need to download the source and compile, however very soon we will have packages built for it too. Prepare and Restore are common terminologies used in the backup lingo, in the webinar, Restore means copying back the backup files from its storage location to the destination location, whereas Prepare means applying the transactions to the backup and making it ready to restore.

Q : Is binlog mirroring needed on Galera (PXC)?
It is good idea to keep binlog mirroring. Even though the IST and SST will do its job to join the node, the binlogs could play a role in case you wanted to rollforward a particular schema on a slave or QA instance.

Q : As we know that Percona XtraBackup takes Full & Incremental as well. Like that Does MyDumper helps in taking the incremental backup.?
At this moment we do not have the ability to take Incremental backups with mydumper or with any other logical backup tool. However, Weekly full backups (logical) and daily binlog backups can serve as the same strategy with other Incremental backup solutions, plus they are easy to restore :)

Q : Is it possible to encrypt the output file ? What will be Best methodology to back up data with the database size of 7 to 8 Gb and increses 25 % each day ? what is difference between innobackupex and mydumper ?
Indeed its possible to encrypt the backup files, as a matter of fact, we encrypt backup files with GPG keys before uploading to offsite storage. The best method to backup a 7 to 8G instance would be implementing all 3 types of backup we discussed in the webinar, your scenarios require planning for the future, so its always best to have different solutions available as the data grows. Innobackupex is part of the Percona-Xtrabackup toolkit and is a script which does binary backups of databases, MyDumper on the other hand is a logical backup tool which creates backups as text files.

Q : How can I optimize a MySQL dump of a large database? The main bottleneck while taking MySQL dump backup of a large database is if any table is found to be corrupted then it never goes beyond by skipping this corrupted tables temporary. Can we take database backup of large database without using locking mechanism i.e. Does someone know how to make the backup without locking the tables ? Is there any tools which would faster in restoration and backup technique or how come we use MySQL dump to optimize this kind of issue in future during crash recovery.
Mysqldump is logical backup tool, and as such it executes full table scans to backup the tables and write them down in the output file, hence its very difficult to improve performance of mysqldump (query-wise). Assuming that you’re referring the corruption to MyISAM tables, it is highly recommended you repair them before backing up, also to make sure mysqldump doesnt fail due to error on such a corrupt table try using –force option to mysqldump. If you’re using MyISAM tables first recommendation would be to switch to Innodb, with most of the tables innodb locking can be greatly reduced, actually till a point where the locking is negligible, look into –single-transaction. Faster backup recovery can be achieved with binary backups, look into using Percona Xtrabackup tool, we have comprehensive documentation to get you started.

Hope this was a good webinar and we have answered most of your questions. Stay tuned for more such webinars from Percona.

The post The Q&A: Creating best-in-class backup solutions for your MySQL environment appeared first on Percona Data Performance Blog.

Jul
22
2015
--

SELinux and the MySQL init script

I recently worked with a customer who had a weird issue: when their MySQL server was started (Percona Server 5.5), if they try to run service mysql start a second time, the init script was not able to detect that an instance was already running. As a result, it tried to start a second instance with the same settings as the first one. Of course this fails and this creates a mess. What was the issue? A missing rule in SELinux. At least it looks like

Summary

If SELinux is set to enforcing and if you are using Percona Server on CentOS/RHEL 6 (other versions could be affected), service mysql start doesn’t work properly and a fix is simple to run:

# grep mysqld_safe /var/log/audit/audit.log | audit2allow -M mysqld_safe
# semodule -i mysqld_safe.pp
# service mysql restart

Other options are:

  • Set SELinux to permissive
  • Use the CentOS/RHEL standard MySQL init script (note I didn’t extensively check if that could trigger other errors)

How did we see the issue?

That was pretty easy: if an instance is already running and if you run service mysql start again, you should see something like this in the MySQL error log:

150717 08:47:58 mysqld_safe A mysqld process already exists

But if you rather see tons of error messages like:

2015-07-17 08:47:05 27065 [ERROR] InnoDB: Unable to lock ./ibdata1, error: 11
2015-07-17 08:47:05 27065 [Note] InnoDB: Check that you do not already have another mysqld process using the same InnoDB data or log files.

it means that the init script is broken somewhere.

Investigation

When the issue was brought to my attention, I tried to reproduce it on my local box, but with no luck. What was so special in the configuration used by the customer?

The only thing that was slightly out of the ordinary was SELinux which was set to enforcing. Then we set SELinux to permissive, and guess what? service mysql start was now working properly and it didn’t allow 2 concurrent instances to be run!

Next step was to look at the SELinux logs to find any error related to MySQL, and we discovered messages like:

type=SYSCALL msg=audit(1437121845.464:739): arch=c000003e syscall=62 success=no exit=-13
a0=475 a1=0 a2=0 a3=7fff0e954130 items=0 ppid=1 pid=5732 auid=500 uid=0 gid=0 euid=0 suid=0
fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts0 ses=5 comm="mysqld_safe" exe="/bin/bash"
subj=unconfined_u:system_r:mysqld_safe_t:s0 key=(null)

At this point, we knew that a rule was missing for mysqld_safe, we needed to add a new one.

Deeper investigation

Actually what happens is that SELinux prevents this condition to be executed in mysqld_safe:

if kill -0 $PID > /dev/null 2> /dev/null

and then the script assumes that this means the mysqld process is not running. That’s why a second mysqld is started.

However users of Oracle MySQL will probably never experience that issue, simply because the init script is a bit different: before calling mysqld_safe, the init script tries to ping a potential mysqld instance and if it gets a positive reply or an Access denied error, it concludes that mysqld is already running and it doesn’t invoke mysqld_safe.

The fix

Fortunately, this is quite simple. You can generate the corresponding rule with audit2allow:

grep mysqld_safe /var/log/audit/audit.log | audit2allow -M mysqld_safe

And after checking the corresponding .te file, we were able to load that new module:

semodule -i mysqld_safe.pp

After stopping MySQL, you can now use service mysql start normally.

Conclusion

This issue was quite interesting to work on because finding the culprit was not that easy. Also it only triggers when SELinux is enabled and Percona Server is used. Now should the init script of Percona Server be fixed? I’m not sure of the potential problems that could occur if we did so, but of course feel free to leave your feedback in the comments.

The post SELinux and the MySQL init script appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com