Introducing email reports for Percona Cloud Tools!

For those PCT users who forget to regularly check query performance, you can now get server query reports delivered to your email daily or weekly. The report contains a digest of the most critical performance information related to a particular MySQL instance, enabling you to act on any recent issues. There are convenient links that open the corresponding view of the PCT Query Analytics module.


Server summary snapshot

The report also provides a list of ten slowest queries and ten queries with the highest load during the period. It is important to compare these lists, because some of the slowest queries may occur only several times during the period (so they probably do not overload the server), while seemingly quick queries may run millions of times and put a lot of unnecessary overhead, which you could investigate and fix.


List of slowest queries


List of queries with highest load

Finally, the report also provides information about new queries that you add to MySQL. This enables you to estimate the efficiency of a new query and see how it compares to the worst performing ones.


List of new queries

Reports are enabled by default for some of our customers. We will gradually enable them for more PCT users every week to make sure that the new feature rolls out smoothly. You can manage reports in the PCT web UI under Configure > Reports at

By default, you will receive a separate weekly report for each MySQL instance in your infrastructure. If you add a new MySQL instance, weekly reports will be automatically enabled for it.

To disable all reports, clear the Enable server query reports check box. If you do not want to enable reports for new MySQL instances that you add, clear the Automatically receive reports for new MySQL instances check box.

You can enable reports for specific MySQL instances, as well as select between daily and weekly period. Disable reports for servers that do not require regular attention and enable weekly reports for more critical servers. Use daily reports for the most active servers, where you expect to have the most amount of change and load.

If you do not have a Percona Cloud Tools account yet, register at and get regular query reports by email!

The post Introducing email reports for Percona Cloud Tools! appeared first on MySQL Performance Blog.


Easy query metrics with MySQL Performance Schema

The MySQL Performance Schema exposes so much data that it’s not trivial to learn, configure, and use. With recently released Percona Agent 1.0.11 you can get query metrics – like min, max, and average query execution time – with a few clicks:

Configure PCT Query Analytics for Performance Schema

Click “Apply” and about two minutes later you’ll have query metrics from Performance Schema, collected and sent every minute.

Percona Cloud Tools (PCT) and Percona Agent handle all the details. You’ll need MySQL (or Percona Server) 5.6 and Percona Agent 1.0.11 or newer. One caveat at the moment: it only works for local MySQL instances (so not Amazon RDS). This limitation should be fixed soon; we’re already working on it.

Why use Performance Schema? We prefer Query Analytics with MySQL slow logs (especially Percona Server slow logs which expose more metrics) because slow logs expose the most metrics compared to other sources, but sometimes the slow log just isn’t an option, so Performance Schema is the next best choice, but the choice means tradeoffs. For example, Performance Schema does not expose actual query examples (just fingerprints), so EXPLAIN does not work.

For those who’ve been following PCT development, you know that Percona Agent 1.0.5 first introduced support for Performance Schema. What’s new in 1.0.11 is everything – we completely rewrote this part of the agent. It’s so much better that it’s now the required minimum version for using Query Analytics with Performance Schema. Upgrading is really easy: just run the single command line you used to install the agent and it will auto-update.

MySQL Performance Schema exposes a lot of data and insights into the server, but query metrics are perhaps the most important because the primary job of your MySQL database is to execute queries. That’s why Percona Cloud Tools makes Query Analytics with Performance Schema (and slow logs) so easy: to help you focus on the essential and leave the details to the tools.

Percona Cloud Tools is in beta, so it’s still free to sign up and free to use all the tools and features.

The post Easy query metrics with MySQL Performance Schema appeared first on MySQL Performance Blog.


Faster fingerprints and Go packages for MySQL

I’m happy to announce Go packages for MySQL. Particularly exciting is a new query fingerprint function which is very fast and efficient, but I’ll talk about that later. First, go-mysql is currently three simple Go packages for parsing and aggregating MySQL slow logs. If you’ve been following Percona development, you’ve no doubt heard of Percona Cloud Tools (PCT), a somewhat new performance management web service for MySQL.

One tool in PCT is “Query Analytics” which continuously analyzes query metrics from the slow log. The slow log provides the most metrics and therefore the most performance insight into MySQL. percona-agent, the open-source agent for PCT, uses go-mysql to parse and analyze the slow log, so the code has both heavy formal testing and heavy real-world testing. If you’re working with Go, MySQL, and MySQL slow logs, we invite you to try go-mysql.

Last October we implemented a completely new query fingerprint function. (See “Fingerprints” in the pt-query-digest doc for a background on query fingerprints.) Since mydumpslow, the very first slow log parser circa 2000, fingerprints have been accomplished with regular expressions. This approach is normally fine, but percona-agent needs to be faster and more efficient than normal to reduce the cost of observation. Regex patterns are like little state machines. One regex can be very fast, but several are required to produce a good fingerprint. Therefore, the regex approach requires processing the same query several times to produce a fingerprint. Even worse: a regex can backtrack which means a single logical pass through the query can result in several physical passes. In short: regular expressions are a quick and easy solution, but they are very inefficient.

Several years ago, a former colleague suggested a different approach: a single pass, purpose-built, character-level state machine. The resulting code is rather complicated, but the resulting performance is a tremendous improvement: 3-5x faster in informal benchmarks on my machine, and it handles more edge cases. In simplest terms: the new fingerprint function does more with less, which makes percona-agent and Query Analytics better.

Check out, browse the percona-agent source code if you wish, and try Percona Cloud Tools for free.

The post Faster fingerprints and Go packages for MySQL appeared first on MySQL Performance Blog.


Using Percona Cloud Tools to solve real-world MySQL problems

For months when speaking with customers I have been positioning Percona Cloud Tools (PCT) as a valuable tool for the DBA/Developer/SysAdmin but only recently have I truly been able to harness the data and make a technical recommendation to a customer that I feel would have been very difficult to accomplish otherwise.

Using Percona Cloud Tools to solve real-world MySQL problemsLet me provide some background: I was tasked with performing a Performance Audit for one of our customers (Performance Audits are extremely popular as they allow you to have a MySQL Expert confirm or reveal challenges within your MySQL environment and make your database run faster!) and as part of our conversation we discussed and agreed to install Percona Cloud Tools. We let the site run for a few days, and then I started my audit. What I noticed was that at regular intervals there was often a CPU spike, along with a corresponding drop in Queries Per Second (QPS), but that lasted only for a few seconds. We decided that further investigation was warranted as the customer was concerned the spikes impacted their users’ experience with the application.

Here are the tasks that Percona Cloud Tools made easy while I worked to identify the source of the CPU spike and QPS drop:

  1. Per-second granularity data capture of PCT allowed me to identify how significant the spike and QPS actually were – if I was looking at the 1 minute or higher average values (such as Cacti would provide) I probably wouldn’t have been able to detect the spike or stall as clearly in the first place, it would have been lost in the average. In the case of PCT the current graphs group at the 1 minute range but you have the ability to view the min and max values during this 1 minute range since they are the true highest and lowest observed 1s intervals during the 1 minute group.
  2. Ability for all graphs to maintain the same resolution time allowed me to zero-in on the problematic time period and then quickly look across all graphs for corresponding deflections. This analysis led me to discover a significant spike in InnoDB disk reads.
  3. Ability to use the Query Analytics functionality to zoom-in again on the problematic query. By adjusting Query Analytics to an appropriate time period narrowed down the range of unique queries that could be considered the cause. This task in my opinion is the best part of using PCT.
  4. Query Analytics allowed me to view the Rows Examined in Total for each query based on just this shortened interval. I then tagged those that had higher than 10k Rows Examined (arbitrary but most queries for this customer seemed to fall below this) so that I could then review in real-time with the customer before making a decision on what to do next. We can only view this sort of information by leveraging the slow query log – this data is not available via Performance_Schema or via network sniffing.

Once we were able to identify the problematic queries then the rest was routine query optimization – 10 minutes work using Percona Cloud Tools for what might have been an hour using traditional methods!

For those of you wondering how else this can be done, assuming you detected the CPU spike / QPS drop (perhaps you are using Graphite or other tool that can deliver per-second resolution) then you’d also need to be capturing the slow query log at a good enough resolution level (I prefer long_query_time=0 to just get it all), and then be adept at leveraging pt-query-digest with –since and –until options to narrow down your range of queries.  The significant drawback to this approach is that each time you want to tweak your time range you probably need to stream through a fairly large slow log file multiple times which can be both CPU and disk intensive operations, which means it can take some time (probably minutes, maybe hours) depending on the size of your log file.  Certainly a workable approach but nowhere near as quick as reloading a page in your browser :)

So what are you waiting for? Start using Percona Cloud Tools today, it’s free! Register for the free beta here.

The post Using Percona Cloud Tools to solve real-world MySQL problems appeared first on MySQL Performance Blog.


Managing data using open source technologies? Learn what’s hot in 2015!

Whether you’re looking at the overall MySQL ecosystem or the all-data management landscape, the choice of technologies has never been larger than it is in 2015.

Having so many options is great but it also can be very hard to make a selection. I’m going to help narrow the list next week during a Webinar titled, “Open Source Technologies you should evaluate in 2015,” January 14 at 10 a.m. PST.

During the hour I’ll share which technologies I think worthy of consideration in 2015 – open source and proprietary technologies that allow you to manage your data easier, increase development pace, scale better and improve availability and security. I’ll also discuss recent developments in MySQL, NoSQL and NewSQL, Cloud and general advances in hardware.

Open source technologies you should evaluate in 2015Specifically, some of the areas I’ll address will include:

  • Cloud-based Database as a Service (DBaaS) such as Amazon RDS for MySQL, Amazon RDS for Aurora, Google Cloud, and OpenStack Trove
  • MySQL 5.7
  • Hybrid database environments with MySQL plus MongoDB or other NoSQL solutions
  • Advanced Monitoring capabilities such as Percona Cloud Tools
  • Other performance enhancements such as solid state devices (SSD) and the TokuDB storage engine

I hope to see you next week! (Register now to reserve your spot!)

The post Managing data using open source technologies? Learn what’s hot in 2015! appeared first on MySQL Performance Blog.


How small changes impact complex systems – MySQL example

If you’ve been studying complex systems you know what minor changes might cause consequences of much greater proportions, sometimes causing some effects that are not easily explained at first. I recently ran across a great illustration of such behavior while doing MySQL benchmarks which I thought would be interesting to share.

I’m using a very simple benchmark – Sysbench 0.5 on Percona Server 5.6.21-70.1 just running update queries:

sysbench  --num-threads=64 --report-interval=10 --max-time=0 --max-requests=0 --rand-type=pareto --oltp-table-size=1000000000 --mysql-user=root --mysql-password= --mysql-db=sbinnodb  --test=/usr/share/doc/sysbench/tests/db/update_index.lua run

Some people frown upon such benchmarks due to their triviality and being irrelevant to workloads. I like them because they often allow you to study already complex system behavior in a much more controlled environment than “real” workloads – and so you can understand internal dynamics better.

For this test I set innodb_purge_threads=8, innodb_max_purge_lag=1000000, innodb_max_purge_lag_delay=5000 as otherwise chances are that the history will just be growing forever not allowing the system to reach the steady state. I wrote about InnoDB history problems in detail recently in a post titled “InnoDB transaction history often hides dangerous ‘debt’.

The only change I was looking for is changing innodb_flush_log_at_trx_commit from 1 to 2. I then went back to 1 a few hours later, after the system had reached the steady state. What I would normally expect in such a case is a performance increase because there is less work to do on commit and that is pretty much it. Looking at the graphs however I can see much more changes in the system. Let me go through them and explain the ones which I can:

The throughput graph is probably the most interesting here. We can see that indeed the throughput has substantially improved when set innodb_flush_log_at_trx_commit=2 and then went back to the old one after it was set back to 1. We see however in 2 some interesting effects here – a spike of a few minutes and later a dip below the original performance level for a few hours, until finally recovery back to the baseline. I’ll explain this with different graphs.

CPU usage is interesting, too. We can see there is user CPU increase correlating to the throughput increase but also a drastic IOWait increase. Why would it increase when we’re actually making workload less IO bound by not flushing log as much? The reason is because there are more threads now that can perform IO concurrently, hence increasing what is reporting IO wait and deceasing idle time.

InnoDB writing much more data is expected and it very well matches throughput. Not much of anything unexpected.

As we see setting innodb_flush_log_at_trx_commit=2 causes the amount of dirty pages in the buffer pool to go down significantly. It recovers when it is put back. The reason I suspect in this case is the undo space – as we will see in later graph history length spikes quickly – obviously requiring some space in the buffer pool.

This is perhaps one of the most interesting graphs here. You can see for some reason when log flush is enabled that purge threads are able to keep up. But if you disable it they cannot, which makes history grow very quickly until the max allowed amount. This is by the way the time we see the spike in the throughput as there is no throttling being done to keep history in check. This also explains why after I set innodb_flush_log_at_trx_commit back to 1 that performance first drops well below original levels – InnoDB needs to go through and purge the history, which takes resources. Another interesting thing you can notice is how uneven history cleanup is. Specifically you can note some 75% percent of the history has been cleaned up almost instantly while initial 25% took a lot of time to work out. I do not have an exact explanation here yet I believe it has to do with the fact initially history purging requires a significant amount of IO while at a certain point in time a purge happens completely in the buffer pool and hence being much faster.

Why would purge threads have a hard time keeping up when log flush is disabled? I suspect there are 2 reasons. First, the system can provide much more throughput in such configuration (not keeping purging into account) while purge subsystem is limited and can’t keep up with such a high load. The other reason is – eliminating serialization on the log flush exposes other parts of InnoDB to contention which further suppresses purge system progress.

This is another relevant graph to show the throttling done because purging can’t keep up. As you see we got about 2.5ms delay injected for updates to give purging some time to keep up. It is good to watch it to understand if the max delay you specified is enough to prevent purge lag while still avoid having the system grind to a halt. Had I set innodb_max_purge_lag_delay to 1000 it would not be enough to prevent history length growing to infinity, having I set it to 100000 the system could grind to a halt in case there is some long-running transaction causing history length to grow to about 1 million.

Lets look at Physical IO – it is also quite interesting. As you can see in the first and last parts of the graph there are more pages flushed than pages read, while when we disable log flushing we get much more reads than writes which only stops after history purging catch up is complete. I find this very interesting – this means that working with even modest history – 1M in this case causes a lot of IO – I assume because due to access pattern to the undo space it is constantly being washed out from the buffer pool. This graph is a very good illustration why large history might be a death spiral, as once it is there the system needs to do much more work.

This IO utilization graph might be a bit hard to understand but what it generally shows is that during the time the hard drive has been 100% utilized, there was still at least one outstanding IO to serve. However the concurrency was different with more average IO requests outstanding when we had log flushing disabled. This makes sense. We can also see it was highest when we enabled log flushing back but there was history to go through and catch up.

Changes to contention are very curious – we can see there is a lot more spin rounds and specifically shared lock spin rounds when we have log flush disabled. This is what I believe comes from less serialization on the log flushing causes a lot more internal contention.

It might not be very easy to see but this graph shows that innodb_row_lock_times spikes 1000x when log flushing is disabled. Why would that be? I suspect the reason is injected delay for purging. I also suspect the design of this throttling so it only is done when an actual change is attempted to be made, this is where update statement already holds some locks, which causes more row lock contention for other statements.

InnoDB has something called Server Activity count. As you can see from this graph it is not really a good indicator – for some reason it shows a lower value when the server was doing much more work. I do not understand other than this variable is incremented enough to explain it but find it curious.

LRU flushing also is drastically different between enabled and disabled log flushing. We get a lot more calls to search free pages and a lot more scans. I’m not sure why exactly this is happening but I think this has something to do with undo space caching which is constantly washed out from buffer pool which increased IO seems to indicate.

Finally lets look at perhaps the most counter-intuitive result – the amount of pending pwrites to the log file – we allowed to cache more log files in the OS cache but the amount of waits on writes to the log files actually increased? I’m not very sure why this is happening – perhaps there is a lot of IO pressure on OS so it does not delay writes to the log files for long time?

This is the same graph showing log fsyncs and log pwrites together. We can see when we set innodb_flush_log_at_trx_commit=2 that we see a lot more waits on pwrites, while less waits on fsync (which now should only be happening once per second or so). In general these two graphs tell me in some cases assuming log writes with innodb_flush_log_at_trx_commit=2 will be “free” because the operating system will instantly acknowledge them and put in the cache is not always correct.

Summary: The main point I’m looking to illustrate here is what changes you’re doing to the MySQL configuration (same can be said about your application OS configuration, hardware, etc.) can cause much larger changes to system behavior than you would guess. If you have the time it is good to validate whenever there are any other changes to system behavior beyond what you would think. If not at least keep an open mind and assume what any change, even a small one, can cause dramatic system behavior changes.

If you want to explore similar graphs check out Percona Cloud Tools – it’s easy to get going and allows you to build graphs like these absolutely free and perform a lot of in depth analyses. Now in free beta, Percona Cloud Tools is a hosted service providing access to query performance insights for all MySQL uses.

The post How small changes impact complex systems – MySQL example appeared first on MySQL Performance Blog.


Streamlined Percona XtraDB Cluster (or anything) testing with Consul and Vagrant

Introducing Consul

I’m always interested in what Mitchell Hashimoto and Hashicorp are up to, I typically find their projects valuable.  If you’ve heard of Vagrant, you know their work.

I recently became interested in a newer project they have called ‘Consul‘.  Consul is a bit hard to describe.  It is (in part):

  • Highly consistent metadata store (a bit like Zookeeeper)
  • A monitoring system (lightweight Nagios)
  • A service discovery system, both DNS and HTTP-based. (think of something like haproxy, but instead of tcp load balancing, it provides dns lookups with healthy services)

What this has to do with Percona XtraDB Cluster

I’ve had some more complex testing for Percona XtraDB Cluster (PXC) to do on my plate for quite a while, and I started to explore Consul as a tool to help with this.  I already have Vagrant setups for PXC, but ensuring all the nodes are healthy, kicking off tests, gathering results, etc. were still difficult.

So, my loose goals for Consul are:

  • A single dashboard to ensure my testing environment is healthy
  • Ability to adapt to any size environment — 3 node clusters up to 20+
  • Coordinate starting and stopping load tests running on any number of test clients
  • Have the ability to collect distributed test results

I’ve succeeded on some of these fronts with a Vagrant environment I’ve been working on. This spins up:

  • A Consul cluster (default is a single node)
  • Test server(s)
  • A PXC cluster

Additionally, it integrates the Test servers and PXC nodes with Consul such that:

  • The servers setup a Consul agent in client mode to the  Consul cluster
  • Additionally, they setup a local DNS forwarder that sends all DNS requests to the ‘.consul’ domain to the local agent to be serviced by the Consul cluster.
  • The servers register services with Consul that run local health checks
  • The test server(s) setup a ‘watch’ in consul to wait for starting sysbench on a consul ‘event’.

Seeing it in action

Once I run my ‘vagrant up’, I get a consul UI I can connect to on my localhost at port 8501:

Consul's Node Overview

Consul’s Node Overview


I can see all 5 of my nodes.  I can check the services and see that test1 is failing one health check because sysbench isn’t running yet:

Consul reporting sysbench is not running.

Consul reporting sysbench is not running.

This is expected, because I haven’t started testing yet.  I can see that my PXC cluster is healthy:


Health checks are using clustercheck from the PXC package

Health checks are using clustercheck from the PXC package


Involving Percona Cloud Tools in the system

So far, so good.  This Vagrant configuration (if I provide a PERCONA_AGENT_API_KEY in my environment) also registers my test servers with Percona Cloud Tools, so I can see data being reported there for my nodes:

Percona Cloud Tool's Dashboard for a single node

Percona Cloud Tool’s Dashboard for a single node

So now I am ready to begin my test.  To do so, I simply need to issue a consul event from any of the nodes:

jayj@~/Src/pxc_consul [507]$ vagrant ssh consul1
Last login: Wed Nov 26 14:32:38 2014 from
[root@consul1 ~]# consul event -name='sysbench_update_index'
Event ID: 7c8aab42-fd2e-de6c-cb0c-1de31c02ce95

My pre-configured watchers on my test node knows what to do with that event and launches sysbench.  Consul shows that sysbench is indeed running:

Screen Shot 2014-11-26 at 9.43.29 AM


And I can indeed see traffic start to come in on Percona Cloud Tools:

Screen Shot 2014-11-26 at 9.53.11 AM

I have testing traffic limited for my example, but that’s easily tunable via the Vagrantfile.  To show something a little more impressive, here’s a 5 node cluster running hitting around 2500 tps total throughput:

Screen Shot 2014-11-26 at 1.08.48 PM

So to summarize thus far:

  • I can spin up any size cluster I want and verify it is healthy with Consul’s UI
  • I can spin up any number of test servers and kick off sysbench on all of them simultaneously

Another big trick of Consul’s

That so far so good, but let me point out a few things that may not be obvious.  If you check the Vagrantfile, I use a consul hostname in a few places.  First, on the test servers:

# sysbench setup
            'tables' => 1,
            'rows' => 1000000,
            'threads' => 4 * pxc_nodes,
            'tx_rate' => 10,
            'mysql_host' => 'pxc.service.consul'

then again on the PXC server configuration:

# PXC setup
          "percona_server_version"  => pxc_version,
          'innodb_buffer_pool_size' => '1G',
          'innodb_log_file_size' => '1G',
          'innodb_flush_log_at_trx_commit' => '0',
          'pxc_bootstrap_node' => (i == 1 ? true : false ),
          'wsrep_cluster_address' => 'gcomm://pxc.service.consul',
          'wsrep_provider_options' => 'gcache.size=2G; gcs.fc_limit=1024',

Notice ‘pxc.service.consul’.  This hostname is provided by Consul and resolves to all the IPs of the current servers both having and passing the ‘pxc’ service health check:

[root@test1 ~]# host pxc.service.consul
pxc.service.consul has address
pxc.service.consul has address
pxc.service.consul has address

So I am using this to my advantage in two ways:

  1. My PXC cluster bootstraps the first node automatically, but all the other nodes use this hostname for their wsrep_cluster_address.  This means: no specific hostnames or ips in the my.cnf file, and this hostname will always be up to date with what nodes are active in the cluster; which is the precise list that should be in the wsrep_cluster_address at any given moment.
  2. My test servers connect to this hostname, therefore they always know where to connect and they will round-robin (if I have enough sysbench threads and PXC nodes) to different nodes based on the response of the dns lookup, which returns 3 of the active nodes in a different order each time.

(Some of) The Issues

This is still a work in progress and there are many improvements that could be made:

  • I’m relying on PCT to collect my data, but it’d be nice to utilize Consul’s central key/value store to store results of the independent sysbench runs.
  • Consul’s leader election could be used to help the cluster determine which node should bootstrap on first startup. I am assuming node1 should bootstrap.
  • A variety of bugs in various software still makes this a bit clunky sometimes to manage.  Here is a sample:
    • Consul events sometimes don’t fire in the current release (though it looks to be fixed soon)
    • PXC joining nodes sometimes get stuck putting speed bumps into the automated deploy.
    • Automated installs of percona-agent (which sends data to Percona Cloud Tools) is straight-forward, except when different cluster nodes clobber each other’s credentials.

So, in summary, I am happy with how easily Consul integrates and I’m already finding it useful for a product in its 0.4.1 release.

The post Streamlined Percona XtraDB Cluster (or anything) testing with Consul and Vagrant appeared first on MySQL Performance Blog.


Tips from the trenches for over-extended MySQL DBAs

This post is a follow-up to my November 19 webinar, “Tips from the Trenches: A Guide to Preventing Downtime for the Over-Extended DBA,” during which I described some of the most common reasons DBAs experience avoidable downtime. The session was aimed at the “over-stretched DBA,” identified as the MySQL DBA short of time or an engineer of another discipline without the depth of the MySQL system. The over-stretched DBA may be prone to making fundamental mistakes that cause downtime through poor response time, operations that cause blocking on important data or administrative mishaps through the lack of best practice monitoring and alerting. (You can download my slides and view the recorded webinar here.)

Monitor the things
One of the aides to keeping the system up and running is ensuring that your finger is on the pulse of the environment. Here on the Percona Managed Services team, we leverage Percona Monitoring Plugins (open source plugins for Nagios, Cacti and Zabbix) to ensure we have visibility of our client’s operations. Having a handle on basics such as disk space, memory usage and MySQL operational metrics ensures that we avoid trivial downtime that would affect the client’s uptime or worse, their bottom line.

Road Blocks
One of the most common reasons that an application is unable to serve data to its end user is that access to a table is being blocked due to another ongoing operation. This can be blamed on a variety of sources: backups, schema changes, poor configuration and long running transactions can all lend themselves to costly blocking. Understanding the impact of actions on a MySQL server can be the difference between a happy end user and a frustrated one.

During the webinar I made reference to some resources and techniques that can assist the over extended DBA avoid downtime and here are some highlights….

Monitoring and Alerting
It’s important that you have some indications that something is reaching its capacity. It might be the disk, connections to MySQL or auto_increment limit on a highly used table. There is quite the landscape to cover but here are a handful of helpful tools:
* Percona Monitoring Plugins
* Monyog
* New Relic

Query Tuning
Poorly performing SQL can be indicative that the configuration is incorrect, that there’s a missing index or that your development team needs a quick lesson on MySQL anti-patterns. Arm yourself with proof that the SQL statements are substandard using these resources and work with the source to make things more efficient:
* Percona Cloud Tools
* pt-query-digest, explain, indexes

High Availability
If you need to ensure that your application survives hiccups such as hardware failure or network impairment, a well deployed HA solution will give you the peace of mind that you can quickly mitigate bumps in the road.
Percona XtraDB Cluster, Galera
* Percona Replication Manager
* LinuxHA/Corosync/DRBD

A wise man once quoted “A backup today saves you tomorrow.” Covering all bases can be the difference between recovering from a catastrophic failure and job hunting. Mixing logical, physical and incremental backups while adding in some offsite copies can provide you with the safety net in the event that a small mistake like a dropped table is met or worse, all working copies of data and backups are lost in a SAN failure. It happens so be prepared.
* Percona XtraBackup
* mydumper
* mysqldump
* mysqlbinlog (5.6)
* mylvmbackup

We had some great questions from the attendees and regrettably were unable to answer them all, so here are some of them with my response.

Tips from the trenches for over-extended MySQL DBAsQ: I use MySQL on Amazon RDS. Isn’t much of the operations automated or do these tips still apply?
A: It’s not completely automated. There are still challenges to address and configuration opportunities, but understanding the limitations of RDS is key. For example, the location and size of the tmpdir is something you are unable to customise on RDS. You would typically review this config in a production environment if your workload required it. Any costly queries that perform operations requiring tmp area to sort (think OLAP) might not be a good fit on RDS due to this limitation. Getting to know the limitations around hosted or DBaaS services is time well spent to avoid explaining what keeps taking the application down in peak hours.

Q: What other parts of Percona Toolkit do you recommend for MySQL operations?
A: Percona Toolkit is a well-evolved suite of tools that all MySQL DBAs should familiarize themselves with. In particular I will fit many tools into my weekly workflow:


  • pt-online-schema-change
  • pt-table-checksum
  • pt-table-sync


  • pt-stalk
  • pt-pmp
  • pt-config-diff

Knowledge Gathering

  • pt-summary
  • pt-mysql-summary
  • pt-duplicate -key-checker

The key with Percona Toolkit is that many common tasks or problems that could cause you to reinvent the wheel are covered, mature and production ready. As with any tool, you should always read the label or in this case the documentation so you’re well aware what the tools can do, the risks and the features that you can make use of.

Q: HA – are there any solutions that you would stay away from?
A: Using any particular HA solution is going to be another R&D exercise. You will need to understand the tradeoffs, configuration options and compare between products. Some might have a higher TCO or lack functionality. Once the chosen solution is implemented it’s pertinent that the engineers understand the technology to be able to troubleshoot or utilize the functionality in the situation where failover needs to be instigated. I like HA solutions to be fast to failover to and some entail starting MySQL from cold.

Q: You mentioned having tested backups. How do you perform this?
A: Percona’s method is using a dedicated host with access to the backup files. Then with a combination of mysqlsandbox and pt-table-checksum we can discover if we trust the files we capture for disaster recovery. Many people underestimate the importance of this task.

Q: Percona Cloud Tools – how much does it cost?
A: Right now it’s a free service. Visit for more information, but in a nutshell Percona Cloud Tools is a hosted service providing access to query performance insights for all MySQL uses.

Q: Is there API access to Percona Cloud Tools for application integration?
A: There is currently not a public API available. It is on the roadmap, though. We’d be interested to hear more about your use case so please sign up for the service and try it out. After signing in, all pages include a Feedback link to share your thoughts and ideas such as how you’d like to use a public API.

Q: Can you use MHA with Percona XtraDB Cluster?
A: MHA is not something that can be used with Percona XtraDB Cluster (PXC). It’s common to partner PXC with HAProxy for making sure your writes are going to the appropriate node.

Q: Can MHA make automatic failover? If MHA has automatic failover, what do you recommend? Configure it for automatic failover?
A: MHA can make an automatic failover. Personally I prefer managed failover. When working with automated failover it’s important that failback is manual to avoid “flapping.” “Splitbrain” is an ailment that you don’t want to suffer from as well and auto failover removes the human judgment from the decision to relocate operations from a failed node onto a standby node. If you are going to vote for an automatic failover it is advised to test all potential failure scenarios and to employ a STONITH method to really ensure that the unresponsive node is not serving read/write traffic.

Q: What is the best way to detect database blocking from DML statements? Is there a tool that will show blocking after the fact so you don’t have to catch it real-time?
A: Once again, Percona has a tool called pt-deadlock-logger that can detect and log deadlocks. Detecting locking can be achieved using “SHOW ENGINE INNODB STATUS” or utilizing the information_schema.innodb_locks table. Some engineering might be required for this to be logged but those resources exist for use.

Q: Since you mentioned tinkering with ELK I was wondering if you had any tips on good Kibana dashboards to build to monitor MySQL databases/clusters?
A: ELK is something that I’m looking to publish some information on soon so watch this space!

Thanks again everyone for the great questions! And as a reminder, you can download my slides and view the recorded webinar here.

The post Tips from the trenches for over-extended MySQL DBAs appeared first on MySQL Performance Blog.


Innodb transaction history often hides dangerous ‘debt’

In many write-intensive workloads Innodb/XtraDB storage engines you may see hidden and dangerous “debt” being accumulated – unpurged transaction “history” which if not kept in check over time will cause serve performance regression or will take all free space and cause an outage. Let’s talk about where it comes from and what can you do to avoid running into the trouble.

Technical Background: InnoDB is an MVCC engine which means it keeps multiple versions of the rows in the database, and when rows are deleted or updated they are not immediately removed from the database but kept for some time – until they can be removed. For a majority of OLTP workloads they can be removed seconds after the change actually took place. In some cases though they might need to be kept for a long period of time – if there are some old transactions running in the system that might still need to look at an old database state. As of MySQL 5.6 Innodb has one or several “purge threads” which remove the old data that can be removed, though they might not be doing it fast enough for workloads with very intensive writes.

Does it really happen? I started looking into this problem based on some customer concerns and to my surprise I could very easily get the history to grow rapidly using basic sysbench “update” workload. It is especially easy with default innodb_purge_threads=1 setting but even with innodb_purge_threads=8 it grows rather rapidly.

If we take a look at the purging speed (which comes from innodb-metrics table) we can see what purge is being very much starved by the active concurrent sysbench process and it speeds up greatly when it is finished:

Now to be frank this is not an easy situation to get in the majority of workloads with short transactions when the undo space is kept in memory purge and is able to keep up. If Undo space however happens to be gone from buffer pool the purge speed can slow down drastically and the system might not be able to keep up anymore. How it could happen? There are 2 common variants….

Long Running Transaction: If you’re having some long running transaction, for example mysqldump, on the larger table the purging has to pause while that transaction is running and a lot of history will be accumulated. If there is enough IO pressure a portion of undo space will be removed from the buffer pool.

MySQL Restart: Even with modest history length restarting MySQL will wash away from memory and will cause purge to be IO bound. This is of course if you’re not using InnoDB Buffer Pool save and reload.

How do you check if your UNDO space is well cached? In Percona Server I can use those commands:

mysql> select sum(curr_size)*16/1024 undo_space_MB from XTRADB_RSEG;
| undo_space_MB |
|     1688.4531 |
1 row in set (0.00 sec)
mysql> select count(*) cnt, count(*)*16/1024 size_MB, page_type from INNODB_BUFFER_PAGE group by page_type;
| cnt    | size_MB   | page_type         |
|     55 |    0.8594 | EXTENT_DESCRIPTOR |
|      2 |    0.0313 | FILE_SPACE_HEADER |
|    108 |    1.6875 | IBUF_BITMAP       |
|  17186 |  268.5313 | IBUF_INDEX        |
| 352671 | 5510.4844 | INDEX             |
|     69 |    1.0781 | INODE             |
|    128 |    2.0000 | SYSTEM            |
|      1 |    0.0156 | TRX_SYSTEM        |
|   6029 |   94.2031 | UNDO_LOG          |
|  16959 |  264.9844 | UNKNOWN           |
10 rows in set (1.65 sec)

This shows what the total undo space size is now, 1.7GB, with less than 100MB cached in the buffer pool size….

Here are a few graphs from Running Heavy concurrent query during lighter workload where purging could keep up. In this case I used the “injection” benchmark in sysbench setting –trx-rate to 50% of what the system shown as peak.

mysql> select count(distinct k+ length(pad)) from sbtest1;
| count(distinct k+ length(pad)) |
|                       30916851 |
1 row in set (28 min 32.38 sec)

What we can see from those graphs is that InnoDB purging initially is progressing at a speed fast enough to keep up with inflow of transactions,
however as we kick up the complicated query, purging is stopped and when the query is done the purge speed settles on the new much lower level where it is not able to keep up with the workload anymore.

Now, there is recognition of this problem and there are options with innodb_max_purge_lag and innodb_max_purge_lag_delay to set the maximum length of the history after reaching which delay will be injected for DML statements up to a specified amount of microseconds.

Unfortunately it is not designed very well to use with real applications. The problems I see with its design are two fold….

Looking at Total History: If you think about it there are 2 kinds of records within the history – there are records that can be purged and there are ones which can’t be purged because they are needed by some active transaction. It is perfectly fine to have a lot of records in history if some long transaction is running – it is not the cause of the problem or overload, while we expect what “purgable history” should be low most of the time.

Looking at the Size rather than Rate of Change: Even worse, the history blowout prevention is looking at the current value to inject a delay and not at whenever it is that’s growing or already shrinking.

These together means that cases of long running transactions concurrently with OLTP workloads is handled very poorly – as long as history reaches the specified maximum amount the system will kick into overdrive, delaying all statements to the maximum extent possible, until the history falls back below the threshold. Here is how it looks on graphs:

As you see on the last graph, we got the purge_dml_delay_usec spiking to 10000us (the max I set) even as no purging can be done (see the blue line is at zero). It only actually starts to work on the history when the heavy query completes and really releases the breaks when the purge is complete. In this case the throughput of the system reduced more than 5 times when the delay was active – which would not work for most real-world systems.

Design Thoughts: So what would I change in the purging design of the configuration? I would like to see a better default configuration that should include multiple purge threads and purge delay (improved). I would find some way to measure not only history size but purgable history size and base purge delay on it.  Also make it based on the change rather than threshold – do just enough delay so the history is gradually shrinking. Also basing it on the undo space size instead of the number of transactions (which can vary in size) might be more practical and easier to auto-tune. We also can probably do better in terms of undo space caching – similar to Insert buffer, I’d like to keep it in memory say until 10% of the buffer pool size as removing from the cache something you know you will need very soon is bad business, as well as consider whether there is some form of read-ahead which can work to pre-read undo space which is needed. Right now I’ve tested and neither linear nor random read-ahead seems to help picking it up from disk with less random IO.

Practical Thoughts: Whatever improvements we’ll get from purging we have MySQL and Percona Server 5.6 systems to run for some years to come. So what are the practical steps we can do to manage purge history better?

Monitor: Make sure you are monitoring and graphing innodb_history_list_length. If you use large transactions, set alerts pretty high but do not leave it unchecked.

Configure Set innodb_purge_threads=8 or some other value if you have write intensive workload. Consider playing with innodb_max_purge_lag and innodb_max_purge_lag_delay but be careful – as currently designed it can really bring the server to its knees. You may consider using it interactively instead, changing them as run-time options if you spot history list growths unchecked, balancing current workload demands with resources allocated to purging.

Let it purge before shutdown: In many cases I find purge performance much worse after I restart MySQL Server because of caching. So the good approach might be just to remove the workload from MySQL server before shutting it down to let the purge of outstanding history complete – and only after that shut it down. If the server has crashed you might consider letting it complete purging before getting traffic routed back to it.

Use Innodb Buffer Pool Preload Use innodb_buffer_pool_dump_at_shutdown=on and innodb_buffer_pool_load_at_startup=on to ensure undo space is preloaded back to the buffer pool on startup.

P.S If you wonder where the graphs I have used came from – it is our Percona Cloud Tools – a very convenient way for analyses like these allowing access to all MySQL status variables, InnoDB metrics, tons of OS metrics and more.

The post Innodb transaction history often hides dangerous ‘debt’ appeared first on MySQL Performance Blog.


Galera data on Percona Cloud Tools (and other MySQL monitoring tools)

I was talking with a Percona Support customer earlier this week who was looking for Galera data on Percona Cloud Tools. (Percona Cloud Tools, now in free beta, is a hosted service providing access to query performance insights for all MySQL uses.)

The customer mentioned they were already keeping track of some Galera stats on Cacti, and given they were inclined to use Percona Cloud Tools more and more, they wanted to know if it was already supporting Percona XtraDB Cluster. My answer was: “No, not yet: you can install agents in each node (the regular way in the first node, then manually on the other nodes… and when prompted say “No” to create MySQL user and provide the one you’re using already) and monitor them as autonomous MySQL servers – but the concept of cluster and specially the “Galera bits” has yet to be implemented there.

Except I was wrong.

By “concept of cluster” I mean supporting the notion of group instances, which should allow a single cluster-wide view for metrics and query reports, such as the slow queries (which are recorded locally on the node where the query was run and thus observed in a detached way). This still needs to be implemented indeed, but it’s on the roadmap.

The “Galera bits” I mentioned above are the various “wsrep_” status variables. In fact, those refer to the Write Set REPlication patches (based in the wsrep API), a set of hooks applied to the InnoDB/XtraDB storage engine and other components of MySQL that modifies the way replication works (to put it in a very simplified way), which in turn are used by the Galera library to provide a “generic Synchronous Multi-Master replication plugin for transactional applications.” A patched version of Percona Server together with the Galera libray compose the base of Percona XtraDB Cluster.

As I found out only now, Percona Cloud Tools does collect data from the various wsrep_ variables and it is available for use – it’s just not shown by default. A user only needs to add a dashboard/chart manually on PCT to see these metrics:


Click on the picture to enlarge it

Now, I need to call that customer …

Monitoring the cluster

Since I’m touching this topic I thought it would be useful to include some additional information on monitoring a Galera (Percona XtraDB Cluster in particular) cluster, even though most of what I mention below has already been published in different posts here on the MySQL Performance Blog. There’s a succint documentation page bearing the same title of this section that indicates the main wsrep_ variables you should monitor to check the health status of the cluster and how well it’s coping with load along the time (performance). Remember you can get a grasp of the full set of variables at any time by issuing the following command from one (or each one) of the nodes:

mysql> SHOW GLOBAL STATUS LIKE "wsrep_%";

And for a broader and real time view of the wsrep_ status variables you can use Jay Janssen’s myq_gadgets toolkit, which he modified a couple of years ago to account for Galera.

There’s also a specific Galera-template available in our Percona Monitoring Plugins (PMP) package that you can use in your Cacti server. That would cover the “how well the cluster copes with load along the time,” providing historical graphing. And while there isn’t a Galera specific plugin for Nagios in PMP, Jay explains in another post how you can customize pmp-check-mysql-status to “check any status variable you like,” describing his approach to keep a cluster’s “health status” in check by setting alerts on specific stats, on a per node basis.

VividCortex recently added a set of templates for Galera in their product and you can also rely on Severalnines’ ClusterControl monitoring features to get that “global view” of your cluster that Percona Cloud Tools doesn’t yet provide. Even though ClusterControl is a complete solution for cluster deployment and management, focusing on the automation of the whole process, the monitoring part alone is particularly interesting as it encompasses cluster-wide information in a customized way, including the “Galera bits”. You may want to give it a try as the monitoring features are available in the Community version of the product (and if you’re a Percona customer with a Support contract covering Percona XtraDB Cluster, then you’re entitled to get support for it from us).

One thing I should note that differentiate the monitoring solutions from above is that while you can install Cacti, Nagios and ClusterControl as servers in your own infrastructure both Percona Cloud Tools and VividCortex are hosted, cloud-based services. Having said that, neither one nor the other upload sensitive data to the cloud and both provide options for query obfuscation.


Contrary to what I believed, Percona Cloud Tools does provide support for “Galera bits” (the wsrep_ status variables), even though it has yet to implement support for the notion of group instances, which will allow for cluster-wide view for metrics and query reports. You can also rely on the Galera template for Cacti provided by Percona Monitoring Plugins for historical graphing and some clever use of Nagios’ pmp-check-mysql-status for customized cluster alerts. VividCortex and ClusterControl also provide monitoring for Galera.

Percona Cloud Tools, now in free beta, is a hosted service providing access to query performance insights for all MySQL uses. After a brief setup, unlock new information about your database and how to improve your applications. Sign up to request access to the beta today.  

The post Galera data on Percona Cloud Tools (and other MySQL monitoring tools) appeared first on MySQL Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by