Q&A: High availability when using MySQL in the cloud

Percona MySQL webinar followup: Q&ALast week I hosted a webinar on using MySQL in the cloud for High Availability (HA) alongside 451 Research analyst Jason Stamper. You can watch the recording and also download the slides (free) here. Just click the “Register” button at the end of that page.

We had several excellent questions and we didn’t have time to get to several of them in the allotted time. I’m posting them here along with the answers. Feel free to ask follow-up questions in the comments below.

Q: Can the TokuDB engine be used in a PXC environment?

A: No, TokuDB cannot currently be used in a PXC environment, the only supported engine in Percona XtraDB Cluster 5.6 is InnoDB.

Q: With Galera replication (PXC), is balancing the load on each node?

A: No, you need to implement your own load balancing and HA layer between your clients and the Percona XtraDB Cluster server.  Examples mentioned in the webinar include HAProxy and F5 BigIP.

Q: What’s the best version of Percona XtraDB Cluster regarding InnoDB performance?

A: In general for best performance you should be using the latest release of Percona XtraDB Cluster 5.6, which is currently 5.6.24, released on June 3rd, 2015.

Q: Can I redirect my writes in Percona XtraDB Cluster to multiple nodes using the HAProxy? While trying with SysBench I can see write-only goes to first nodes in PXC while reads does goes to multiple nodes.

A: Yes you can configure HAProxy to distribute both reads and writes across all of your nodes in a Percona XtraDB Cluster environment. Perhaps SysBench created only one database connection for all writes, and so haproxy kept those confined to only one host. You may want to experiment with parallel_prepare.lua.

Q: What’s the optimal HA for small datasets (db is less than 10gb)?

A: The optimal HA deployment for small datasets would be dependent on your level of recovery required (tolerance for loss of transactions) and time that you can be in an unavailable state (seconds, minutes, hours?).  Unfortunately there isn’t a single answer to your question, however, if you are interested in further discussion on this point Percona would be happy to coordinate a time to speak.  Please feel free to contact me directly and we can continue the conversation at michael.coburn@percona.com.

 Q: Is there a concept of local master vs. remote master with PXC?

A: No there is no concept of local vs remote master.  All nodes in a Percona XtraDB Cluster can now be classified as Master, regardless of their proximity to the clients.

Q: Are there any concerns when considering AWS RDS or AURORA DB for MySQL HA in the Cloud?

A: Regarding AWS RDS, yes this a good option for MySQL HA in the Cloud.  I unfortunately haven’t worked with Aurora DB that much yet so I don’t have an opinion on it’s suitability for HA in the Cloud.

Q: We tried out PXC awhile back and it used to lock everything whenever any ddl was done. Has that changed?

A: We would have to look at the specifics of your environment, however, there have been numerous improvements in the 1½ years of development since Percona XtraDB Cluster went Generally Available (GA) on January 30th, 2014 in version 5.6.15.

Q: Is using the arbitrator a must?

A: No the arbitrator role via the garbd daemon is generally only used when operating in a minimal environment of two nodes that contain the data and you need a third node for quorum but don’t want to store the data a third time.

Q: Can we do a cluster across different zones?

A: Yes you can. However be aware that the latency incurred for all cluster certification operations will be impacted by the round trip time between nodes.

Q: Does PXC also support the MyISAM database?

A: No, Percona XtraDB Cluster does not support any storage engine other than InnoDB as of PXC 5.6.

Q: How do load balancers affect the throughput in a Galera-based setup given that the write would be limited by the slowest node?

A: Load balancers will introduce some measure of additional latency in the form of CPU time in the load balancer layer as it evaluates its own ruleset, and also in network time due to additional hop via load balancer.  Otherwise there should be no perceptible difference in the write throughput of a Percona XtraDB Cluster with and without a load balancer as it relates to the “slowest node” factor.

Q: Have you used MaxScale yet? If so, what are your thoughts?

A: Unfortunately I haven’t used MaxScale however Yves Trudeau, Percona Principal Architect, has recently written about MaxScale in this blog post.

Q: How do you configure timeout and maintain persistent connection to HAProxy?

A: I would encourage you to refer to the HAProxy Documentation.

The post Q&A: High availability when using MySQL in the cloud appeared first on MySQL Performance Blog.


Update on the InnoDB double-write buffer and EXT4 transactions

In a post, written a few months ago, I found that using EXT4 transactions with the “data=journal” mount option, improves the write performance significantly, by 55%, without putting data at risk. Many people commented on the post mentioning they were not able to reproduce the results and thus, I decided to further investigate in order to find out why my results were different.

So, I ran sysbench benchmarks on a few servers and found when the InnoDB double-write buffer limitations occur and when they don’t. I also made sure some of my colleagues were able to reproduce the results. Basically, in order to reproduce the results you need the following conditions:

  • Spinning disk (no SSD)
  • Enough CPU power
  • A dataset that fits in the InnoDB buffer pool
  • A continuous high write load with many ops waiting for disk

Using the InnoDB double write buffer on an SSD disk somewhat prevents us from seeing the issue, something good performance wise. That comes from the fact that the latency of each write operation is much lower. That makes sense, the double-writer buffer is an area of 128 pages on disk that is used by the write threads. When a write thread needs to write a bunch of dirty pages to disk, it first writes them sequentially to free slots in the double write buffer in a single iop and then, it spends time writing the pages to their actual locations on disk using typically one iop per page. Once done writing, it releases the double-write buffer slots it was holding and another thread can do its work. The presence of a raid controller with a write cache certainly helps, at least until the write cache is full. Thus, since I didn’t tested with a raid controller, I suspect a raid controller write cache will delay the apparition of the symptoms but if the write load is sustained over a long period of time, the issue with the InnoDB double write buffer will appear.

So, to recapitulate, on a spinning disk, a write thread needs to hold a lock on some of the double-write buffer slots for at least a few milliseconds per page it needs to write while on a SSD disk, the slots are released very quickly because of the low latency of the SSD storage. To actually stress the InnoDB double-write buffer on a SSD disk, one must push much more writes.

That leads us to the second point, the amount of CPU resources available. At first, one of my colleague tried to reproduce the results on a small EC2 instance and failed. It appeared that by default, the sysbench oltp.lua script is doing quite a lot of reads and those reads saturate the CPU, throttling the writes. By lowering the amount of reads in the script, he was then able to reproduce the results.

For my benchmarks, I used the following command:

sysbench --num-threads=16 --mysql-socket=/var/lib/mysql/mysql.sock
--mysql-database=sbtest --mysql-user=root
--test=/usr/share/doc/sysbench/tests/db/oltp.lua --oltp-table-size=50000000
--oltp-test-mode=complex --mysql-engine=innodb --db-driver=mysql
--report-interval=60 --max-requests=0 --max-time=3600 run

Both servers used were metal boxes with 12 physical cores (24 HT). With less CPU resources, I suggest adding the following parameters:


So that the CPU is not wasted on reads and enough writes are generated. Remember we are not doing a generic benchmarks, we are just stressing the InnoDB double-write buffer.

In order to make sure something else isn’t involved, I verified the following:

  • Server independence, tried on 2 physical servers and one EC2 instance, Centos 6 and Ubuntu 14.04
  • MySQL provided, tried on MySQL community and Percona Server
  • MySQL version, tried on 5.5.37 and 5.6.23 (Percona Server)
  • Varied the InnoDB log file size from 32MB to 512MB
  • The impacts of the number of InnoDB write threads (1,2,4,8,16,32)
  • The use of Linux native asynchronous iop
  • Spinning and SSD storage

So, with all those verifications done, I can maintain that if you are using a server with spinning disks and a high write load, using EXT4 transactions instead of the InnoDB double write buffer yields to an increase in throughput of more than 50%. In an upcoming post, I’ll show how the performance stability is affected by the InnoDB double-write buffer under a high write load.

Appendix: the relevant part of the my.cnf

innodb_buffer_pool_size = 12G
innodb_write_io_threads = 8 # or else in {1,2,4,8,16,32}
innodb_read_io_threads = 8
innodb_flush_log_at_trx_commit = 0 # must be 0 or 2 to really stress the double write buffer
innodb_log_file_size = 512M # or 32M, 64M
innodb_log_files_in_group = 2
innodb_flush_method=O_DIRECT # or O_DSYNC
innodb_buffer_pool_restore_at_startup=300 # On 5.5.x, important to warm up the buffer pool
#innodb_buffer_pool_load_at_startup=ON # on 5.6, important to warm up the buffer pool
#innodb_buffer_pool_dump_at_shutdown=ON # on 5.6, important to warm up the buffer pool,
skip-innodb_doublewrite # or commented out
innodb_flush_neighbor_pages=none # or area for spinning

The post Update on the InnoDB double-write buffer and EXT4 transactions appeared first on MySQL Performance Blog.


MaxScale: A new tool to solve your MySQL scalability problems

Ever since MySQL replication has existed, people have dreamed of a good solution to automatically split read from write operations, sending the writes to the MySQL master and load balancing the reads over a set of MySQL slaves. While if at first it seems easy to solve, the reality is far more complex.

First, the tool needs to make sure it parses and analyses correctly all the forms of SQL MySQL supports in order to sort writes from reads, something that is not as easy as it seems. Second, it needs to take into account if a session is in a transaction or not.

While in a transaction, the default transaction isolation level in InnoDB, Repeatable-read, and the MVCC framework insure that you’ll get a consistent view for the duration of the transaction. That means all statements executed inside a transaction must run on the master but, when the transaction commits or rollbacks, the following select statements on the session can be again load balanced to the slaves, if the session is in autocommit mode of course.

Then, what do you do with sessions that set variables? Do you restrict those sessions to the master or you replay them to the slave? If you replay the set variable commands, you need to associate the client connection to a set of MySQL backend connections, made of at least a master and a slave. What about temporary objects like with “create temporary table…”? How do you deal when a slave lags behind or what if worse, replication is broken? Those are just a few of the challenges you face when you want to build a tool to perform read/write splitting.

Over the last few years, a few products have tried to tackle the read/write split challenge. The MySQL_proxy was the first attempt I am aware of at solving this problem but it ended up with many limitations. ScaleARC does a much better job and is very usable but it stills has some limitations. The latest contender is MaxScale from MariaDB and this post is a road story of my first implementation of MaxScale for a customer.

Let me first introduce what is MaxScale exactly. MaxScale is an open source project, developed by MariaDB, that aims to be a modular proxy for MySQL. Most of the functionality in MaxScale is implemented as modules, which includes for example, modules for the MySQL protocol, client side and server side.

Other families of available modules are routers, monitors and filters. Routers are used to determine where to send a query, Read/Write splitting is accomplished by the readwritesplit router. The readwritesplit router uses an embedded MySQL server to parse the queries… quite clever and hard to beat in term of query parsing.

There are other routers available, the readconnrouter is basically a round-robin load balancer with optional weights, the schemarouter is a way to shard your data by schema and the binlog router is useful to manage a large number of slaves (have a look at Booking.com’s Jean-François Gagné’s talk at PLMCE15 to see how it can be used).

Monitors are modules that maintain information about the backend MySQL servers. There are monitors for a replicating setup, for Galera and for NDB cluster. Finally, the filters are modules that can be inserted in the software stack to manipulate the queries and the resultsets. All those modules have well defined APIs and thus, writing a custom module is rather easy, even for a non-developer like me, basic C skills are needed though. All event handling in MaxScale uses epoll and it supports multiple threads.

Over the last few months I worked with a customer having a challenging problem. On a PXC cluster, they have more than 30k queries/s and because of their write pattern and to avoid certification issues, they want to have the possibility to write to a single node and to load balance the reads. The application is not able to do the Read/Write splitting so, without a tool to do the splitting, only one node can be used for all the traffic. Of course, to make things easy, they use a lot of Java code that set tons of sessions variables. Furthermore, for ISO 27001 compliance, they want to be able to log all the queries for security analysis (and also for performance analysis, why not?). So, high query rate, Read/Write splitting and full query logging, like I said a challenging problem.

We experimented with a few solutions. One was a hardware load balancer that failed miserably – the implementation was just too simple, using only regular expressions. Another solution we tried was ScaleArc but it needed many rules to whitelist the set session variables and to repeat them to multiple servers. ScaleArc could have done the job but all the rules increases the CPU load and the cost is per CPU. The queries could have been sent to rsyslog and aggregated for analysis.

Finally, the HA implementation is rather minimalist and we had some issues with it. Then, we tried MaxScale. At the time, it was not GA and was (is still) young. Nevertheless, I wrote a query logging filter module to send all the queries to a Kafka cluster and we gave it a try. Kafka is extremely well suited to record a large flow of queries like that. In fact, at 30k qps, the 3 Kafka nodes are barely moving with cpu under 5% of one core. Although we encountered some issues, remember MaxScale is very young, it appeared to be the solution with the best potential and so we moved forward.

The folks at MariaDB behind MaxScale have been very responsive to the problems we encountered and we finally got to a very usable point and the test in the pilot environment was successful. The solution is now been deployed in the staging environment and if all goes well, it will be in production soon. The following figure is simplified view of the internals of MaxScale as configured for the customer:

MaxScale internals

The blocks in the figure are nearly all defined in the configuration file. We define a TCP listener using the MySQL protocol (client side) which is linked with a router, either the readwritesplit router or the readconn router.

The first step when routing a query is to assign the backends. This is where the read/write splitting decision is made. Also, as part of the steps required to route a query, 2 filters are called, regexp (optional) and Genlog. The regexp filter may be used to hot patch a query and the Genlog filter is the logging filter I wrote for them. The Genlog filter will send a json string containing about what can be found in the MySQL general query log plus the execution time.

Authentication attempts are also logged but the process is not illustrated in the figure. A key point to note, the authentication information is cached by MaxScale and is refreshed upon authentication failure, the refresh process is throttled to avoid overloading the backend servers. The servers are continuously monitored, the interval is adjustable, and the server status are used when the decision to assign a backend for a query is done.

In term of HA, I wrote a simple Pacemaker resource agent for MaxScale that does a few fancy things like load balancing with IPTables (I’ll talk about that in future post). With Pacemaker, we have a full fledge HA solution with quorum and fencing on which we can rely.

Performance wise, it is very good – a single core in a virtual environment was able to read/write split and log to Kafka about 10k queries per second. Although MaxScale supports multiple threads, we are still using a single thread per process, simply because it yields a slightly higher throughput and the custom Pacemaker agent deals with the use of a clone set of MaxScale instances. Remember we started early using MaxScale and the beta versions were not dealing gracefully with threads so we built around multiple single threaded instances.

So, since a conclusion is needed, MaxScale has proven to be a very useful and flexible tool that allows to elaborate solutions to problems that were very hard to tackle before. In particular, if you need to perform read/write splitting, then, try MaxScale, it is best solution for that purpose I have found so far. Keep in touch, I’ll surely write other posts about MaxScale in the near future.

The post MaxScale: A new tool to solve your MySQL scalability problems appeared first on MySQL Performance Blog.


Percona submits 7 talks for Vancouver OpenStack Summit (voting ends Feb. 23)

Percona has submitted seven talks for the next OpenStack Summit in Vancouver this May. And as with all OpenStack Summit events, the community decides the content. Voting ends February 23, and if you aren’t already an OpenStack Foundation member (required to vote), you can join now for free here.

Percona’s Vancouver OpenStack Summit proposals

Vancouver OpenStack Summit, photo from https://www.flickr.com/photos/73230975@N03Percona’s proposals include collaborations with top contributors across a variety of OpenStack services including Trove and Swift. You can vote for our talks by clicking the titles below that interest you.

MySQL and OpenStack Deep Dive
Speakers: Peter Boros, Jay Pipes (Mirantis)

Deep Dive into MySQL Replication with OpenStack Trove, and Kilo
Speakers: George Lorch, Amrith Kumar (Tesora)

MySQL on Ceph Storage: Exploring Design, Challenges and Benefits
Speakers: Yves Trudeau, Kyle Bader (Red Hat)

Core Services MySQL Database Backup & Recovery to Swift
Speakers: Kenny Gryp, Chris Nelson (SwiftStack)

Smart MySQL Log Management with Swift
Speakers: Matt Griffin, Chris Nelson (SwiftStack)

Discovering Better Database Architectures For Core Services In OpenStack
Speakers: Kenny Gryp, Matt Griffin

Upgrading your OpenStack Core Services On The Database Side: Nova, Neutron, Cinder…
Speakers: Kenny Gryp, Matt Griffin

Other interesting proposals

Here are a few proposals from other organizations that look particularly interesting. Please consider them as well.

Exploration into OpenStack Trove, customer use cases, and the future of Trove for the community
Speakers: Amrith Kumar (Tesora), Brad Topol (IBM), Mariam John (IBM)

The Entrepreneur’s Challenge: The Realities of Starting an OpenStack Company
Speakers: Simon Anderson (Dreamhost, Inktank), Ken Rugg (Tesora), Josh McKenty (Pivotal, Piston Cloud), Jesse Proudman (BlueBox), Joe Arnold (SwiftStack)

Making a Case for Your OpenStack Deployment: How Vendor’s Can Help
Speaker: Ryan Floyd (Storm Ventures)

Real World Experiences with Upgrading OpenStack at Time Warner Cable
Speakers: Clayton ONeill (Time Warner Cable), Matt Fischer (Time Warner Cable)

Percona & OpenStack

According to the most recent OpenStack User Survey in November 2014, Percona’s database software is a popular choice for OpenStack operators needing high availability.

OpenStack November 2014 User Survey  - Database Options

OpenStack User Survey results from November 2014 shows Percona XtraDB Cluster as the top Galera-based choice for production clouds.

Percona XtraDB Cluster, the top Galera-based MySQL cluster solution for production OpenStack deployments, incorporates the latest version of MySQL 5.6, Percona Server 5.6, Percona XtraBackup, and Galera. This combination delivers top performance, high availability, and critical security coverage with the latest features and fixes. Additionally, Percona Server is a popular guest database option with unique features designed for cloud operators offering DBaaS.

In addition to sharing our open source software with the OpenStack community, Percona is sharing our expertise in services like Trove, projects like the HA Guide update, extensive benchmark testing activities, and upcoming events like OpenStack Live 2015. The inaugural OpenStack Live Conference, April 13-14 in Santa Clara, California, will be a user-focused event. The program will cover database-related on topics like Trove as well as other OpenStack services and feature multiple 3-hour hands-on tutorials.

Percona is a proud supporter of OpenStack and we hope to see you in both Santa Clara in April and Vancouver in May. And in the meantime, don’t forget to vote!

The post Percona submits 7 talks for Vancouver OpenStack Summit (voting ends Feb. 23) appeared first on MySQL Performance Blog.


OpenStack: A MySQL DBA Perspective – Sept. 17 webinar

OpenStack: A MySQL DBA Perspective - Sept. 17 webinar

OpenStack: A MySQL DBA Perspective

I’ll have the pleasure to present, next Wednesday, September 17 at 10 a.m. PDT (1 p.m. EDT) a webinar titled “OpenStack: A MySQL DBA Perspective.” Everyone is invited.

The webinar will be divided into two parts. The first part will cover how MySQL can be used by the OpenStack infrastructure including the expected load, high-availability solutions and geo-DR.

The second part will focus on the use of MySQL within an OpenStack cloud. We’ll look into the various options that are available, the traditional ones and Trove. We’ll also discuss the block device options in regards with MySQL performance and, finally, we’ll review the high-availability implications of running MySQL in an OpenStack cloud.

Register here. I look forward to your questions, and if you have any related to OpenStack that I can help with in advance of the webinar please feel free to post those in the comments section below. I’ll write a followup post after the webinar to recap all related questions and answers. I’ll also provide the slides.

See you next Wednesday!

The post OpenStack: A MySQL DBA Perspective – Sept. 17 webinar appeared first on MySQL Performance Blog.


New in Percona Replication Manager: Slave resync, Async stop

Percona Replication Manager (PRM) continues to evolve and improve, I want to introduce two new features: Slave resync and Async stop.

Slave resync

This behavior is for regular non-gtid replication.  When a master crashes and is no longer available, how do we make sure all the slaves are in a consistent state. It is easy to find the most up to date slave and promote it to be the new master based on the master execution position, the PRM agent already does that but how do we apply the missing transactions to the other slaves.

In order to solve that problem, I modified a tool originally written by Yelp, that outputs the MD5 sums of the payload (XID boundaries) and the commit positions of a binlog file. It produces an output like:

root@yves-desktop:/home/yves/src/PRM/percona-pacemaker-agents/tools/prm_binlog_parser# ./prm_binlog_parser.static.x86_64 /opt/prm/binlog.000382 | tail -n 5

Where the first field is the commit position and the second field, the md5 sum of the payload. The only type of transaction that is not supported is “LOAD DATA INTO”. Of course, since it relies on XID values, this only works with InnoDB. It also requires the “log_slave_updates” to be enabled. The sources and some static binaries can be found in the tools directory on the PRM github, link is at the bottom.

So, if the agent detects a master crash and the prm_binlog_parser tool is available (the prm_binlog_parser_path primitive parameter), upon promotion, the new master will look at its binary logs and publish to the pacemaker CIB, the last 3000 transactions. The 3000 transactions limit is from bash, the command line must be below 64k. With some work this limit could be raised but I believe, maybe wrongly, that it covers most of the cases. The published data in the CIB attribute also contains the corresponding binlog file.

The next step happens on the slaves, when they get the post-promotion notification. They look for the MD5 sum of their last transaction in the relay log, again using the prm_binlog_parser tool, find the matching file and position in the last 3000 transactions the new master published to the CIB and reconfigure replication using the corrected file and position.

The result is a much more resilient solution that helps having the slave in a consistent state. My next goal regarding this feature will be to look at the crashed master and attempt to salvage any transactions from the old master binlog in the case MySQL crashed but Pacemaker is still up and running.


The async_stop feature, enabled by the “async_stop” primitive parameter, allows for faster failover. Without this feature, when stopping mysqld, PRM will wait until is confirmed stopped before completing a failover. When there’re many InnoDB dirty pages, we all know that stopping mysql can take many minutes. Jervin Real, a colleague at Percona, suggested that we should fake to Pacemaker that MySQL is stopping in order to proceed faster with failover. After adding some safe guards, it proved to be a very good idea. I’ll spare you of the implementation details but now, if the setting is enabled, as soon as mysqld is effectively stopping, the failover completes. If it happens that Pacemaker wants to start a stopping MySQL instance, a very usual situation, an error will be generated.

PRM agents, tools and documentation can be found here: https://github.com/percona/percona-pacemaker-agents

The post New in Percona Replication Manager: Slave resync, Async stop appeared first on MySQL Performance Blog.


Exploring message brokers

Message brokers are not regularly covered here but are, nonetheless, important web-related technologies. Some time ago, I was asked by one of our customer to review a selection of OSS message brokers and propose a couple of good candidates. The requirements were fairly simple: behave well when there’s a large backlog of messages, be able to create a cluster and in case of the failure of a node in a cluster, try to protect the data but never blocks the publishers even though that might imply data lost. Nothing fancy regarding queues and topics management. I decided to write my findings here, before I forget…

I don’t consider myself a message broker specialist and I spent only about a day or two on each so, I may have done some big mistakes configuration wise. I’ll take the blame if something is misconfigured or not used correctly.


RabbitMQ is well known and popular message broker and it has many powerful features. The documentation on the RabbitMQ web site is excellent and there are many books available. RabbitMQ is written in Erlang, not a widely used programming language but well adapted to such tasks. The company Pivotal develops and maintains RabbitMQ. I reviewed version 3.2.2 on CentOS 6 servers.

The installation was easy, I installed Erlang version R14B from epel and the RabbitMQ rpm. The only small issue I had was that the server is expecting “″ to be resolved in /etc/hosts and the openstack VMs I used were missing that. Easy to fix. I also installed and enabled the management plugin.

The RabbitMQ configuration is set in the rabbitmq.config file and it has tons of adjustable parameters. I used the defaults. In term of client API, RabbitMQ support a long list of languages and some standards protocols, like STOMP are available with a plugin. Queues and topics can be created either by the web interface or through the client API directly. If you have more than one node, they can be clustered and then, queues and topics, can be replicated to other servers.

I created 4 queues, wrote a ruby client and started inserting messages. I got a publishing rate of about 20k/s using multiple threads but I got a few stalls caused by the vm_memory_high_watermark, from my understanding during those stalls it writing to disk. Not exactly awesome given my requirements. Also, some part is always kept in memory even if a queue is durable so, even though I had plenty of disk space, the memory usage grew and eventually hit the vm_memory_high_watermark setting. The cpu load was pretty high during the load, between 40% and 50% on an 8 cores VM.

Even though my requirements were not met, I setup a replicated queue on 2 nodes and inserted a few millions objects. I killed one of the two nodes and insert were even faster but then… I did a mistake. I restarted the node and asked for a resync. Either I didn’t set it correctly or the resync is poorly implemented but it took forever to resync and it was slowing down as it progressed. At 58% done, it has been running for 17h, one thread at 100%. My patience was exhausted.

So, lots of feature, decent performance but behavior not compatible with the requirements.


Kafka has been designed originally by LinkedIn, it is written in Java and it is now under the Apache project umbrella. Sometimes you look at a technology and you just say: wow, this is really done the way it should be. At least I could say that for the purpose I had. What is so special about Kafka is the architecture, it stores the messages in flat files and consumers ask messages based on an offset. Think of it like a MySQL server (producer) saving messages (updates SQL) to its binlogs and slaves (consumers) ask messages based on an offset. The server is pretty simple and just don’t care about the consumers much. That simplicity makes it super fast and low on resource. Old messages can be retained on a time base (like expire_logs_days) and/or on a storage usage base.

So, if the server doesn’t keep track of what has been consumed on each topics, how do can you have multiple consumer. The missing element here is Zookeeper. The Kafka server uses Zookeeper for cluster membership and routing while the consumers can also use Zookeeper or something else for synchronization. The sample consumer provided with the server uses Zookeeper so you can launch many instances and they’ll synchronize automatically. For the ones that doesn’t know Zookeeper, it is a highly-available synchronous distributed storage system. If you know Corosync, it provides somewhat the same functionality.

Feature wise Kafka, isn’t that great. There’s no web frontend builtin although a few are available in the ecosystem. Routing and rules are inexistent and stats are just with JMX. But, the performance… I reached a publishing speed of 165k messages/s over a single thread, I didn’t bother tuning for more. Consuming was essentially disk bound on the server, 3M messages/s… amazing. That was without Zookeeker coordination. Memory and cpu usage were modest.

To test clustering, I created a replicated queue, inserted a few messages, stopped a replica, inserted a few millions more messages and restarted the replica. I took only a few seconds to resync.

So, Kafka is very good fit for the requirements, stellar performance, low resource usage and nice fit with the requirements.


ActiveMQ is another big player in the field with an impressive feature set. ActiveMQ is more in the RabbitMQ league than Kafka and like Kafka, it is written in Java. HA can be provided by the storage backend, levelDB supports replication but I got some issues with it. My requirements are not for full HA, just to make sure the publishers are never blocked so I dropped the storage backend replication in favor of a mesh of brokers.

My understanding of the mesh of brokers is that you connect to one of the members and you publish or consume a message. You don’t know on which node(s) the queue is located, the broker you connect to knows and routes your request. To further help, you can specify all the brokers on the connection string and the client library will just reconnect to another if the one you are connected to goes down. That looks pretty good for the requirements.

With the mesh of brokers setup, I got an insert rate of about 5000 msg/s over 15 threads and a single consumer was able to read 2000 msg/s. I let it run for a while and got 150M messages. At this point though, I lost the web interface and the publishing rate was much slower.

So, a big beast, lot of features, decent performance, on the edge with the requirements.


Kestrel is another interesting broker, this time, more like Kafka. Written in scala, the Kestrel broker speaks the memcached protocol. Basically, the key becomes the queue name and the object is the message. Kestrel is very simple, queues are defined in a configuration file but you can specify, per queue, storage limits, expiration and behavior when limits are reached. With a setting like “discardOldWhenFull = true”, my requirement of never blocking the publishers is easily met.

In term of clustering Kestrel is a bit limited but each can publish its availability to Zookeeper so that publishers and consumers can be informed of a missing server and adjust. Of course, if you have many Kestrel servers with the same queue defined, the consumers will need to query all of the broker to get the message back and strict ordering can be a bit hard.

In term of performance, a few simple bash scripts using nc to publish messages easily reached 10k messages/s which is very good. The rate is static over time and likely limited by the reconnection for each message. The presence of consumers slightly reduces the publishing rate but nothing drastic. The only issue I had was when a large number of messages expired, the server froze for some time but that was because I forgot to set maxExpireSweep to something like 100 and all the messages were removed in one pass.

So, fairly good impression on Kestrel, simple but works well.


For the requirements given by the customer, Kafka was like a natural fit. It offers a high guarantee that the service will be available and non-blocking under any circumstances. In addition, messages can easily be replicated for higher data availability. Kafka performance is just great and resource usage modest.

The post Exploring message brokers appeared first on MySQL Performance Blog.


Percona Replication Manager (PRM) now supporting 5.6 GTID

Over the last few days, I integrated the MySQL 5.6 GTID version of the Percona Replication Manager (PRM) work of Frédéric Descamps, a colleague at Percona. The agent supports the GTID replication mode of MySQL 5.6 and if the master suffers a hard crash, it picks the slave having applied the highest transaction ID from the dead master. Given the nature of GTID-based replication, that causes all the other slaves to resync appropriately to their new master which is pretty cool and must yet be matched by the regular PRM agent.

For now, it is part of a separate agent, mysql_prm56, which may be integrated with the regular agent in the future. To use it, download the agent with the link above, the pacemaker configuration is similar to the one of the regular PRM agent. If you start from scratch, have a look here and of course, replace “mysql_prm” with “mysql_prm56″. Keep in mind that although it successfully ran many tests, it is the first release and there’s no field experience. I invite you to send any issue or successful usage to PRM-discuss.

As a side note, dealing with GTID based replication is slightly different than regular replication. I invite to consult these posts for more details:

Replication in MySQL 5.6: GTIDs benefits and limitations – Part 1

How to create/restore a slave using GTID replication in MySQL 5.6
How to create a new (or repair a broken) GTID based slave with Percona XtraBackup
Repair MySQL 5.6 GTID replication by injecting empty transactions

The post Percona Replication Manager (PRM) now supporting 5.6 GTID appeared first on MySQL Performance Blog.


How to add VIPs to Percona XtraDB Cluster or MHA with Pacemaker

It is a rather frequent problem to have to manage Virtual IP addresses (VIPs) with a Percona XtraDB Cluster (PXC) or with MySQL master HA (MHA). In order to help solving these problems, I wrote a Pacemaker agent, mysql_monitor that is a simplified version of the mysql_prm agent. The mysql_monitor agent only monitors MySQL and set attributes according to the state of MySQL, the read-only variable, the slave status and/or the output of the clustercheck script for PXC. The agent can operate in 3 modes or cluster types: replication (default), pxc and read-only.

The simplest mode is read-only, only the state of the read_only variable is looked at. If the node has the read_only variable set to OFF, then the writer is set to 1 and reader attributes is set to 1 while if the node has read_only set to ON, the writer attributes will be set to 0 and the reader attribute set to 1.

In replication mode, the writer and reader attributes are set to 1 on the nodes where the read_only variable is set to OFF and on the nodes where the read_only variable is set to ON, the reader attribute is set according to the replication state.

Finally, in the PXC mode, both attributes are set according to the return code of the clustercheck script.

In all cases, if MySQL is not running, the reader and writer attributes are set to 0. The following table recaps the behavior in a more visual way, using readable for the reader attribute and writable for the write attribute:

Cluster type: Read-only Replication PXC
MySQL not OK readable = 0
writable = 0
Read-only = OFF readable = 1
writable = 1
readable = 1
writable = 1
Read-only = ON
Replication OK
readable = 1
writable = 0
readable = 1writable = 0 N/A
Read-only = ON
Replication Not OK
readable = 1
writable = 0
readable = 0writable = 0 N/A
Clustercheck OK N/A N/A readable = 1
writable = 1
Clustercheck not OK N/A N/A readable = 0
writable = 0


The agent can be found in the percona-pacemaker-agents github repository, more specifically here and the accompanying documentation here.

To get a feel how use it, here’s a sample pacemaker configuration for a PXC cluster:

primitive p_mysql_monit ocf:percona:mysql_monitor \
    params user="repl_user" password="WhatAPassword" pid="/var/lib/mysql/mysqld.pid" \
      socket="/var/run/mysqld/mysqld.sock" cluster_type="pxc" \
    op monitor interval="1s" timeout="30s" OCF_CHECK_LEVEL="1"
clone cl_mysql_monitor p_mysql_monit \
    meta clone-max="3" clone-node-max="1"
primitive writer_vip ocf:heartbeat:IPaddr2 \
    params ip="" nic="eth1" \
    op monitor interval="10s"
primitive reader_vip_1 ocf:heartbeat:IPaddr2 \
    params ip="" nic="eth1" \
    op monitor interval="10s"
primitive reader_vip_2 ocf:heartbeat:IPaddr2 \
    params ip="" nic="eth1" \
    op monitor interval="10s"
location No-reader-vip-1-loc reader_vip_1 \
    rule $id="No-reader-vip-1-rule" -inf: readable eq 0
location No-reader-vip-2-loc reader_vip_2 \
    rule $id="No-reader-vip-2-rule" -inf: readable eq 0
location No-writer-vip-loc writer_vip \
    rule $id="No-writer-vip-rule" -inf: writable eq 0
colocation col_vip_dislike_each_other -200: reader_vip_1 reader_vip_2 writer_vip

The resulting cluster status with the attributes set looks like:

root@pacemaker-1:~# crm_mon -A1
Last updated: Tue Nov 19 17:06:18 2013
Last change: Tue Nov 19 16:40:51 2013 via cibadmin on pacemaker-1
Stack: openais
Current DC: pacemaker-3 - partition with quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
3 Nodes configured, 3 expected votes
6 Resources configured.
Online: [ pacemaker-1 pacemaker-2 pacemaker-3 ]
 Clone Set: cl_mysql_monitor [p_mysql_monit]
     Started: [ pacemaker-1 pacemaker-2 pacemaker-3 ]
 reader_vip_1   (ocf::heartbeat:IPaddr2):       Started pacemaker-1
 reader_vip_2   (ocf::heartbeat:IPaddr2):       Started pacemaker-2
 writer_vip     (ocf::heartbeat:IPaddr2):       Started pacemaker-3
Node Attributes:
* Node pacemaker-1:
    + readable                          : 1
    + writable                          : 1
* Node pacemaker-2:
    + readable                          : 1
    + writable                          : 1
* Node pacemaker-3:
    + readable                          : 1
    + writable                          : 1

Nothing too exciting so far, let’s desync one of the nodes (pacemaker-1), setting the variable wsrep_desync=1. This gives us the following status:

root@pacemaker-1:~# crm_mon -A1
Last updated: Tue Nov 19 17:10:21 2013
Last change: Tue Nov 19 16:40:51 2013 via cibadmin on pacemaker-1
Stack: openais
Current DC: pacemaker-3 - partition with quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
3 Nodes configured, 3 expected votes
6 Resources configured.
Online: [ pacemaker-1 pacemaker-2 pacemaker-3 ]
 Clone Set: cl_mysql_monitor [p_mysql_monit]
     Started: [ pacemaker-1 pacemaker-2 pacemaker-3 ]
 reader_vip_1   (ocf::heartbeat:IPaddr2):       Started pacemaker-2
 reader_vip_2   (ocf::heartbeat:IPaddr2):       Started pacemaker-3
 writer_vip     (ocf::heartbeat:IPaddr2):       Started pacemaker-3
Node Attributes:
* Node pacemaker-1:
    + readable                          : 0
    + writable                          : 0
* Node pacemaker-2:
    + readable                          : 1
    + writable                          : 1
* Node pacemaker-3:
    + readable                          : 1
    + writable                          : 1

… where, as expected, no VIPs are now on the desynced node, pacemaker-1. Using different VIPs for PXC is not ideal, the pacemaker IPAddr2 agent allows to create a clone set of an IP using the CLUSTERIP target of iptables. I’ll write a quick follow up to this post devoted to the use of CLUSTERIP.

The post How to add VIPs to Percona XtraDB Cluster or MHA with Pacemaker appeared first on MySQL Performance Blog.


LVM read performance during snapshots

For the same customer I am exploring ZFS for backups, the twin server is using regular LVM and XFS. On this twin, I have setup mylvmbackup for a more conservative backup approach. I quickly found some odd behaviors, the backup was taking much longer than what I was expecting. It is not the first time I saw that, but here it was obvious. So I recorded some metrics, bi from vmstat and percent of cow space used from lvs during a backup. Cow space is the Copy On Write buffer used by LVM to record the modified pages like they were at the beginning of the snapshot. Upon reads, LVM must scan the list to verify that there’s no newer version. Here’s the other details about the backup:

  • Filesystem: 2TB, xfs
  • Snapsize: 60GB
  • Amount to backup: ~600GB
  • Backup tool: mylvmbackup
  • Compressor: pbzip2


As you can see, the processing of the COW space has a huge impact on the read performance. For this database the backup time was 11h but if I stop the slave and let it calm down for 10 min. so that the insert buffer is cleared, the backup time is a bit less than 3h and could probably be less if I use a faster compressor since the bottleneck is now the CPU overhead of pbzip2, all cores at 100%.

So, for large filesystems, if you plan to use LVM snapshots, have in mind that read performance will degrade with COW space used and it might be a good idea to reduce the number of writes during the backup. You could also compress the backup in a second stage if you have the storage capacity.

The post LVM read performance during snapshots appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com