Jul
14
2017
--

Percona Monitoring and Management 1.2.0 is Now Available

Percona Monitoring and Management (PMM)

Percona announces the release of Percona Monitoring and Management 1.2.0 on July 14, 2017.

For installation instructions, see the Deployment Guide.


Changes in PMM Server

PMM Server 1.2.0 introduced the following changes:

New Features

  • PMM-737: New graphs in System Overview dashboard:
      • Memory Advanced Details
      • Saturation Metrics

  • PMM-1090: Added ESXi support for PMM Server virtual appliance.

UI Fixes

  • PMM-707: Fixed QPS metric in MySQL Overview dashboard to always show queries per second regardless of the selected interval.
  • PMM-708: Fixed tooltips for graphs that displayed incorrectly.
  • PMM-739PMM-797: Fixed PMM Server update feature on the landing page.
  • PMM-823: Fixed arrow padding for collapsible blocks in QAN.
  • PMM-887: Disabled the Add button when no table is specified for showing query info in QAN.
  • PMM-888: Disabled the Apply button in QAN settings when nothing is changed.
  • PMM-889: Fixed the switch between UTC and local time zone in the QAN time range selector.
  • PMM-909: Added message No query example when no example for a query is available in QAN.
  • PMM-933: Fixed empty tooltips for Per Query Stats column in the query details section of QAN.
  • PMM-937: Removed the percentage of total query time in query details for the TOTAL entry in QAN (because it is 100% by definition).
  • PMM-951: Fixed the InnoDB Page Splits graph formula in the MySQL InnoDB Metrics Advanced dashboard.
  • PMM-953: Enabled stacking for graphs in MySQL Performance Schema dashboard.
  • PMM-954: Renamed Top Users by Connections graph in MySQL User Statistics dashboard to Top Users by Connections Created and added the Connections/sec label to the Y-axis.
  • PMM-957: Refined titles for Client Connections and Client Questions graphs in ProxySQL Overview dashboard to mentioned that they show metrics for all host groups (not only the selected one).
  • PMM-961: Fixed the formula for Client Connections graph in ProxySQL Overview dashboard.
  • PMM-964: Fixed the gaps for high zoom levels in MySQL Connections graph on the MySQL Overview dashboard.
  • PMM-976: Fixed Orchestrator handling by supervisorctl.
  • PMM-1129: Updated the MySQL Replication dashboard to support new connection_name label introduced in mysqld_exporter for multi-source replication monitoring.
  • PMM-1054: Fixed typo in the tooltip for the Settings button in QAN.
  • PMM-1055: Fixed link to Query Analytics from Metrics Monitor when running PMM Server as a virtual appliance.
  • PMM-1086: Removed HTML code that showed up in the QAN time range selector.

Bug Fixes

  • PMM-547: Added warning page to Query Analytics app when there are no PMM Clients running the QAN service.
  • PMM-799: Fixed Orchestrator to show correct version.
  • PMM-1031: Fixed initialization of Query Profile section in QAN that broke after upgrading Angular.
  • PMM-1087: Fixed QAN package building.

Other Improvements

  • PMM-348: Added daily log rotation for nginx.
  • PMM-968: Added Prometheus build information.
  • PMM-969: Updated the Prometheus memory usage settings to leverage new flag. For more information about setting memory consumption by PMM, see FAQ.

Changes in PMM Client

PMM Client 1.2.0 introduced the following changes:

New Features

  • PMM-1114: Added PMM Client packages for Debian 9 (“stretch”).

Bug Fixes

  • PMM-481PMM-1132: Fixed fingerprinting for queries with multi-line comments.
  • PMM-623: Fixed mongodb_exporter to display correct version.
  • PMM-927: Fixed bug with empty metrics for MongoDB query analytics.
  • PMM-1126: Fixed promu build for node_exporter.
  • PMM-1201: Fixed node_exporter version.

Other Improvements

  • PMM-783: Directed mongodb_exporter log messages to stderr and excluded many generic messages from the default INFO logging level.
  • PMM-756: Merged upstream node_exporter version 0.14.0.
    PMM deprecated several collectors in this release:

    • gmond – Out of scope.
    • megacli – Requires forking, to be moved to textfile collection.
    • ntp – Out of scope.

    It also introduced the following breaking change:

    • Collector errors are now a separate metric: node_scrape_collector_success, not a label on node_exporter_scrape_duration_seconds
  • PMM-1011: Merged upstream mysqld_exporter version 0.10.0.
    This release introduced the following breaking change:

    • mysql_slave_... metrics now include an additional connection_name label to support MariaDB multi-source replication.

About Percona Monitoring and Management

Percona Monitoring and Management (PMM) is an open-source platform for managing and monitoring MySQL and MongoDB performance. Percona developed it in collaboration with experts in the field of managed database services, support and consulting.

Percona Monitoring and Management is a free and open-source solution that you can run in your own environment for maximum security and reliability. It provides thorough time-based analysis for MySQL and MongoDB servers to ensure that your data works as efficiently as possible.

A live demo of PMM is available at pmmdemo.percona.com.

Please provide your feedback and questions on the PMM forum.

If you would like to report a bug or submit a feature request, use the PMM project in JIRA.

Jun
21
2017
--

Percona Monitoring and Management 1.1.5 is Now Available

Percona Monitoring and Management (PMM)

Percona announces the release of Percona Monitoring and Management 1.1.5 on June 21, 2017.

For installation instructions, see the Deployment Guide.


Changes in PMM Server

  • PMM-667: Fixed the Latency graph in the ProxySQL Overview dashboard to plot microsecond values instead of milliseconds.

  • PMM-800: Fixed the InnoDB Page Splits graph in the MySQL InnoDB Metrics Advanced dashboard to show correct page merge success ratio.

  • PMM-1007: Added links to Query Analytics from MySQL Overview and MongoDB Overview dashboards. The links also pass selected host and time period values.

    NOTE: These links currently open QAN2, which is still considered experimental.

Changes in PMM Client

  • PMM-931: Fixed pmm-admin script when adding MongoDB metrics monitoring for secondary in a replica set.

About Percona Monitoring and Management

Percona Monitoring and Management (PMM) is an open-source platform for managing and monitoring MySQL and MongoDB performance. Percona developed it in collaboration with experts in the field of managed database services, support and consulting.

PMM is a free and open-source solution that you can run in your own environment for maximum security and reliability. It provides thorough time-based analysis for MySQL and MongoDB servers to ensure that your data works as efficiently as possible.

A live demo of PMM is available at pmmdemo.percona.com.

Please provide your feedback and questions on the PMM forum.

If you would like to report a bug or submit a feature request, use the PMM project in JIRA.

May
29
2017
--

Percona Monitoring and Management 1.1.4 is Now Available

Percona Monitoring and Management

Percona Monitoring and ManagementPercona announces the release of Percona Monitoring and Management 1.1.4 on May 29, 2017.

For installation instructions, see the Deployment Guide.

This release includes experimental support for MongoDB in Query Analytics, including updated QAN interface.

Query Analytics for MongoDB

To enable MongoDB query analytics, use the mongodb:queries alias when adding the service. As an experimental feature, it also requires the --dev-enable option:

sudo pmm-admin add --dev-enable mongodb:queries

NOTE: Currently, it monitors only collections that are present when you enable MongoDB query analytics. Query data for collections that you add later is not gathered. This is a known issue and it will be fixed in the future.

Query Analytics Redesign

The QAN web interface was updated for better usability and functionality (including the new MongoDB query analytics data). The new UI is experimental and available by specifying /qan2 after the URL of PMM Server.

New Query Analytics web interface

NOTE: The button on the main landing page still points to the old QAN interface.

You can check out the new QAN web UI at https://pmmdemo.percona.com/qan2

New in PMM Server

  • PMM-724: Added the Index Condition Pushdown (ICP) graph to the MySQL InnoDB Metrics dashboard.
  • PMM-734: Fixed the MySQL Active Threads graph in the MySQL Overview dashboard.
  • PMM-807: Fixed the MySQL Connections graph in the MySQL Overview dashboard.
  • PMM-850: Updated the MongoDB RocksDB and MongoDB WiredTiger dashboards.
  • Removed the InnoDB Deadlocks and Index Collection Pushdown graphs from the MariaDB dashboard.
  • Added tooltips with descriptions for graphs in the MySQL Query Response Time dashboard.Similar tooltips will be gradually added to all graphs.

New in PMM Client

  • PMM-801: Improved PMM Client upgrade process to preserve credentials that are used by services.
  • Added options for pmm-admin to enable MongoDB cluster connections.

About Percona Monitoring and Management

Percona Monitoring and Management is an open-source platform for managing and monitoring MySQL and MongoDB performance. Percona developed it in collaboration with experts in the field of managed database services, support and consulting.

PMM is a free and open-source solution that you can run in your own environment for maximum security and reliability. It provides thorough time-based analysis for MySQL and MongoDB servers to ensure that your data works as efficiently as possible.

A live demo of PMM is available at pmmdemo.percona.com.

Please provide your feedback and questions on the PMM forum.

If you would like to report a bug or submit a feature request, use the PMM project in JIRA.

Apr
21
2017
--

Percona Monitoring and Management 1.1.3 is Now Available

Percona Monitoring and Management

Percona Monitoring and ManagementPercona announces the release of Percona Monitoring and Management 1.1.3 on April 21, 2017.

For installation instructions, see the Deployment Guide.

This release includes several new graphs in dashboards related to InnoDB and MongoDB operation, as well as smaller fixes and improvements.

New in PMM Server

  • PMM-649: Added the InnoDB Page Splits and InnoDB Page Reorgs graphs to the MySQL InnoDB Metrics Advanced dashboard.
  • Added the following graphs to the MongoDB ReplSet dashboard:
    • Oplog Getmore Time
    • Oplog Operations
    • Oplog Processing Time
    • Oplog Buffered Operations
    • Oplog Buffer Capacity
  • Added descriptions for graphs in the following dashboards:
    • MongoDB Overview
    • MongoDB ReplSet
    • PMM Demo

Changes in PMM Client

  • PMM-491: Improved pmm-admin error messages.
  • PMM-523: Added the --verbose option for pmm-admin add.
  • PMM-592: Added the --force option for pmm-admin stop.
  • PMM-702: Added the db.serverStatus().metrics.repl.executor stats to mongodb_exporter. These new stats will be used for graphs in future releases.
  • PMM-731: Added real-time checks to pmm-admin check-network output.
  • The following commands no longer require connection to PMM Server:
    • pmm-admin start --all
    • pmm-admin stop --all
    • pmm-admin restart --all
    • pmm-admin show-passwords

    NOTE: If you want to start, stop, or restart a specific service, connection to PMM Server is still required.

About Percona Monitoring and Management

Percona Monitoring and Management is an open-source platform for managing and monitoring MySQL and MongoDB performance. Percona developed it in collaboration with experts in the field of managed database services, support and consulting.

PMM is a free and open-source solution that you can run in your own environment for maximum security and reliability. It provides thorough time-based analysis for MySQL and MongoDB servers to ensure that your data works as efficiently as possible.

A live demo of PMM is available at pmmdemo.percona.com.

Please provide your feedback and questions on the PMM forum.

If you would like to report a bug or submit a feature request, use the PMM project in JIRA.

Sep
14
2016
--

Webinar Thursday Sept. 15: Identifying and Solving Database Performance Issues with PMM

PMM

PMMPlease join Roman Vynar, Lead Platform Engineer on Thursday, September 15, 2016 at 10 am PDT (UTC-7) for a webinar on Identifying and Solving Database Performance Issues with PMM.

Database performance is the key to high-performance applications. Gaining visibility into the database is the key to improving database performance. Percona’s Monitoring and Management (PMM) provides the insight you need into your database environment.

In this webinar, we will demonstrate how using PMM for query analytics, in combination with database and host performance metrics, can more efficiently drive tuning, issue management and application development. Using PMM can result in faster resolution times, more focused development and a more efficient IT team.

Register for the webinar here.

register-now

PMMRoman Vynar, Lead Platform Engineer
Roman is a Lead Platform Engineer at Percona. He joined the company to establish and develop the Remote DBA service from scratch. Over time, the growing service successfully expanded to Managed Services. Roman develops the monitoring tools, automated scripts, backup solution, notification and incident tracking web system and currently leading Percona Monitoring and Management project.
Jul
28
2016
--

Percona Monitoring and Management 1.0.2 Beta

Percona Monitoring and Management 1.0.2 Beta

Percona Monitoring and Management 1.0.2 BetaPercona announces the release of Percona Monitoring and Management 1.0.2 Beta on 28 July 2016.

Like prior versions, PMM is distributed through Docker Hub and is free to download. Full instructions for download and installation of the server and client are available in the documentation.

Notable changes to the tool include:

  • Upgraded to Grafana 3.1.0.
  • Upgraded to Prometheus 1.0.1.
  • Set default metrics retention to 30 days.
  • Eliminated port 9001. Now the container uses only one configurable port, 80 by default.
  • Eliminated the need to specify ADDRESS variable when creating Docker container.
  • Completely re-wrote pmm-admin with more functions.
  • Added ability to stop all services using the new pmm-admin.
  • Added support to name instances using the new pmm-admin.
  • Query Analytics Application updates:
    • Redesigned queries profile table
    • Redesigned metrics table
    • Redesigned instance settings page
    • Added sparkline charts
    • Added ability to show more than ten queries
  • Various updates for MongoDB dashboards.

The full release notes are available in the documentation. The documentation also includes details on installation and architecture.

A demonstration of the tool has been set up at pmmdemo.percona.com.

We have also implemented forums for the discussion of PMM.

Help us improve our software quality by reporting any bugs you encounter using our bug tracking system. As always, thanks for your continued support of Percona!

Some screen shots of the updates:

Note the new sparkline that shows the current load in context (so you know if the number is higher/normal/lower than normal), and the option to “Load next 10 queries” at the bottom of the listing.

Sparklines in QAN

Our admin tool was completely re-written with new functions:

pmm-admin-help
pmm-admin –help output

 

pmm-admin list command output
pmm-admin list command output

 

pmm-admin check-network output, which provides information on the status of the client’s network connection to the server.
May
26
2016
--

Monitoring made easy with Percona App for Grafana

Percona-App-for-Grafana

Percona has released a new Percona App for Grafana!

Are you using Grafana 3.x with Prometheus’ time-series database? Now there is a “Percona App” available on Grafana.net! The app provides a set of dashboards for MySQL performance and system monitoring with Prometheus’ datasource, and make it easy for users install them. The dashboards rely on the alias label in the Prometheus config and depend on the small patch applied on Grafana.

The dashboards in the app are:

  • Cross Server Graphs
  • Disk Performance
  • Disk Space
  • Galera Graphs
  • MySQL InnoDB Metrics
  • MySQL MyISAM Metrics
  • MySQL Overview
  • MySQL Performance Schema
  • MySQL Query Response Time
  • MySQL Replication
  • MySQL Table Statistics
  • MySQL User Statistics
  • Prometheus
  • Summary Dashboard
  • System Overview
  • TokuDB Graphs
  • Trends Dashboard

Percona App for Grafana

The Grafana and Prometheus teams are doing a fantastic job of bringing monitoring and time-series to the next level. They are making collecting and graphing metrics simple and more usable.

See my previous blog post for step-by-step instructions on how to install Grafana and Prometheus. Get the Percona App for Grafana today!

Feb
28
2016
--

Graphing MySQL performance with Prometheus and Grafana

prometheus

prometheus grafanaThis post explains how you can quickly start using such trending tools as Prometheus and Grafana for monitoring and graphing of MySQL and system performance.

I will try to keep this blog as short as possible, so you can quickly set things up before getting bored. I plan to cover the details in the next few posts. I am going to go through the installation process here in order to get some really useful and good-looking graphs in the end.

Overview

PrometheusPrometheus is an open-source service monitoring system and time series database. In short, the quite efficient daemon scrapes metrics from remote machines using HTTP protocol and stores data in the local time-series database. Prometheus provides a simple web interface, a very powerful query language, HTTP API etc. However, the storage is not designed to be durable for the time being.

The remote machines need to run exporters to expose metrics to Prometheus. We will be using the following two:

GrafanaGrafana is an open source, feature-rich metrics dashboard and graph editor for Graphite, Elasticsearch, OpenTSDB, Prometheus and InfluxDB. It is a powerful tool for visualizing large-scale measurement data and designed to work with time-series. Grafana supports different types of graphs, allows for custom representation of individual metrics on the graph and various methods of authentication including LDAP.

Diagram

Here is a diagram of the setup we are going to use:
Prometheus + Grafana diagram

Prometheus setup

To install on the monitor host.

Get the latest tarball from Github.

wget https://github.com/prometheus/prometheus/releases/download/0.17.0rc2/prometheus-0.17.0rc2.linux-amd64.tar.gz
mkdir /opt/prometheus
tar zxf prometheus-0.17.0rc2.linux-amd64.tar.gz -C /opt/prometheus --strip-components=1

Create a simple config:

cat << EOF > /opt/prometheus/prometheus.yml
global:
  scrape_interval:     5s
  evaluation_interval: 5s
scrape_configs:
  - job_name: linux
    target_groups:
      - targets: ['192.168.56.107:9100']
        labels:
          alias: db1
  - job_name: mysql
    target_groups:
      - targets: ['192.168.56.107:9104']
        labels:
          alias: db1
EOF

where 192.168.56.107 is the IP address of the db host we are going to monitor and db1 is its short name. Note, the “alias” label is important here because we rely on it in the predefined dashboards below to get per host graphs.

Start Prometheus in foreground:

[root@centos7 ~]# cd /opt/prometheus
[root@centos7 prometheus]# ./prometheus
prometheus, version 0.17.0rc2 (branch: release-0.17, revision: 667c221)
  build user:       fabianreinartz@macpro
  build date:       20160205-13:35:53
  go version:       1.5.3
INFO[0000] Loading configuration file prometheus.yml     source=main.go:201
INFO[0000] Loading series map and head chunks...         source=storage.go:297
INFO[0000] 0 series loaded.                              source=storage.go:302
WARN[0000] No AlertManager configured, not dispatching any alerts  source=notification.go:165
INFO[0000] Starting target manager...                    source=targetmanager.go:114
INFO[0000] Target manager started.                       source=targetmanager.go:168
INFO[0000] Listening on :9090                            source=web.go:239

Now we can access Prometheus’ built-in web interface by http://monitor_host:9090

Prometheus web interface
If you look at the Status page from the top menu, you will see that our monitoring targets are down so far. Now let’s setup them – prometheus exporters.

Prometheus exporters setup

Install on the db host. Of course, you can use the same monitor host for the experiment. Obviously, this node must run MySQL.

Download exporters from here and there.

wget https://github.com/prometheus/node_exporter/releases/download/0.12.0rc3/node_exporter-0.12.0rc3.linux-amd64.tar.gz
wget https://github.com/prometheus/mysqld_exporter/releases/download/0.7.1/mysqld_exporter-0.7.1.linux-amd64.tar.gz
mkdir /opt/prometheus_exporters
tar zxf node_exporter-0.12.0rc3.linux-amd64.tar.gz -C /opt/prometheus_exporters
tar zxf mysqld_exporter-0.7.1.linux-amd64.tar.gz -C /opt/prometheus_exporters

Start node_exporter in foreground:

[root@centos7 ~]# cd /opt/prometheus_exporters
[root@centos7 prometheus_exporters]# ./node_exporter
INFO[0000] No directory specified, see --collector.textfile.directory  source=textfile.go:57
INFO[0000] Enabled collectors:                           source=node_exporter.go:146
INFO[0000]  - filesystem                                 source=node_exporter.go:148
INFO[0000]  - loadavg                                    source=node_exporter.go:148
INFO[0000]  - time                                       source=node_exporter.go:148
INFO[0000]  - vmstat                                     source=node_exporter.go:148
INFO[0000]  - diskstats                                  source=node_exporter.go:148
INFO[0000]  - filefd                                     source=node_exporter.go:148
INFO[0000]  - mdadm                                      source=node_exporter.go:148
INFO[0000]  - meminfo                                    source=node_exporter.go:148
INFO[0000]  - netdev                                     source=node_exporter.go:148
INFO[0000]  - textfile                                   source=node_exporter.go:148
INFO[0000]  - entropy                                    source=node_exporter.go:148
INFO[0000]  - stat                                       source=node_exporter.go:148
INFO[0000]  - uname                                      source=node_exporter.go:148
INFO[0000]  - conntrack                                  source=node_exporter.go:148
INFO[0000]  - netstat                                    source=node_exporter.go:148
INFO[0000]  - sockstat                                   source=node_exporter.go:148
INFO[0000]  - version                                    source=node_exporter.go:148
INFO[0000] Starting node_exporter v0.12.0rc3 at :9100    source=node_exporter.go:167

Unlike node_exporter, mysqld_exporter wants MySQL credentials. Those privileges should be sufficient:

mysql> GRANT REPLICATION CLIENT, PROCESS ON *.* TO 'prom'@'localhost' identified by 'abc123';
mysql> GRANT SELECT ON performance_schema.* TO 'prom'@'localhost';

Create .my.cnf and start mysqld_exporter in foreground:

[root@centos7 ~]# cd /opt/prometheus_exporters
[root@centos7 prometheus_exporters]# cat << EOF > .my.cnf
[client]
user=prom
password=abc123
EOF
[root@centos7 prometheus_exporters]#
[root@centos7 prometheus_exporters]# ./mysqld_exporter -config.my-cnf=".my.cnf"
INFO[0000] Starting Server: :9104                        file=mysqld_exporter.go line=1997

At this point we should see our endpoints are up and running on the Prometheus Status page:
Prometheus status page

Grafana setup

Install on the monitor host.

Grafana has RPM and DEB packages. The installation is as simple as installing one package.
RPM-based system:

yum install https://grafanarel.s3.amazonaws.com/builds/grafana-2.6.0-1.x86_64.rpm

or APT-based one:

wget https://grafanarel.s3.amazonaws.com/builds/grafana_2.6.0_amd64.deb
apt-get install -y adduser libfontconfig
dpkg -i grafana_2.6.0_amd64.deb

Open and edit the last section of /etc/grafana/grafana.ini resulting in the following ending:

[dashboards.json]
enabled = true
path = /var/lib/grafana/dashboards

Percona has built the predefined dashboards for Grafana with Prometheus for you.

Let’s get them deployed:

git clone https://github.com/percona/grafana-dashboards.git
cp -r grafana-dashboards/dashboards /var/lib/grafana

Finally, start Grafana:

service grafana-server start

At this point, we are one step before being done. Login into Grafana web interface http://monitor_host:3000 (admin/admin).

Go to Data Sources and add one for Prometheus:
Grafana datasource

Now check out the dashboards and graphs. Say choose “System Overview” and period “Last 5 minutes” on top-right. You should see something similar:
Grafana screen
If your graphs are not populating ensure the system time is correct on the monitor host.

Samples

Here are some real-world samples (images are clickable and scrollable):



Enjoy!

Conclusion

Prometheus and Grafana is a great tandem for enabling monitoring and graphing capabilities for MySQL. The tools are pretty easy to deploy, they are designed for time series with high efficiency in mind. In the next blog posts I will talk more about technical aspects, problems and related stuff.

Feb
26
2016
--

Monitoring MongoDB Response Time

Monitoring MongoDB response timeIn this blog post, we’ll discuss how using Prometheus can help with monitoring MongoDB response time. I am currently comparing the performance of different storage engines on Percona Server for MongoDB, using a slightly customized version of Tim Callaghan’s sysbench-mongodb. Since I’m interested in measuring response time for database operations, I created a very simple exporter of response time data for Prometheus.

My first approach to measuring MongoDB response time was inspired by Ignacio Nin’s work on tcprstat, some years ago – and by the way the VividCortex query agent works (which is not surprising, since, to the best of my knowledge, Baron inspired tcprstat in the first place).

With this in mind, I created mongo-response-time, which performs the only function of printing to stdout the response time of every mongodb query seen on the wire, along with a timestamp up to the second. My thanks go to the programmers of Facebook’s Go projects, as their code helped me hit the ground running.

As a first approach this was useful enough for my needs, and here is an example of a basic graph created from data generated by it: Monitoring MongoDB response time

I had to use a log scale as otherwise the graph was just a thick bar near the bottom, and a few outliers above. This is already useful, but it does not scale well. As an example, a sysbench-mongodb run of about an hour produced a csv file with a little over eight million data points. Larger rounds (like 24 hours) are just too difficult to handle with R (in one case, even though I had enough memory in my box to hold more than three copies of the file, read.csv aborted after running out of memory – if this happens to you, I suggest the use of RMySQL instead, which seems more memory-efficient than read.csv for ingesting large amounts of data).

For a second approach, I decided to live with less fidelity and settled for some quantiles and a max. For this, I created a simple Prometheus exporter that exposes 0.5, 0.9 and 0.99 quantiles, and also the max response time for every five second period.

With it, I was able to visualize the MongoDB response time data in Grafana in a way that is affordable and good enough for my needs, as can be seen in the following graphs: Monitoring MongoDB response time

The quantiles are calculated on the client side, using the Summary type from Prometheus’ Go client. The exporter also provides the same quantiles, but through a Histogram, which has the advantage of being more lightweight on clients. I decided to use the Summary as a source for this graph as the impact on the machine seems negligible for now, and I do find its data a bit more reliable (if I compare to calculating quantiles per periods analyzing the full data set in R). You can see how the max (a Gauge, glad you asked!) is useful to have, as it lets you find out about outliers that even the .99 quantile misses (which is expected, by the way).

If you want to try this out, you can find darwin and linux binaries here, and if you hit any problems, please reply here or email me directly at fernando (dot) ipar at Percona’s domain.

Happy monitoring!

Jan
14
2016
--

Prometheus as an Engine for MySQL Monitoring

prometheusWhen I first discovered Graphite years ago, I was very impressed with its monitoring capabilities.  Compared to many RRD-based tools that were popular at the time (like Cacti), Graphite separated the captured data and graphs, allowing you to do all kinds of math and transformations while visualizing data. For example, I could plot the relationship between system queries and disk IO, and capture how the system was getting more IO bound over time. It also had reasonably high performance, allowing me to capture high-resolution data for medium-sized systems.

Just last year I discovered Prometheus, and it also impressed me. I think it has the potential to take Graphite’s flexibility to the next level. Though I am in no way a Prometheus expert, I  want to share my understanding and thoughts on it so far.

Data Model

The data model is perhaps what attracted me to Prometheus the most. While it’s not obvious at first, when you do figure it out it has fantastic flexibility.

In the data model used by Whisper and Carbon in Graphite, you will use something like this to store MySQL data:

myapp.store.mysql.db01.status.questions = 5000

You can set up any hierarchy you like, but it has to have a hierarchy.

What Prometheus does instead is allow you to use a set of key-value pairs. The same data shown above could be presented like this:

questions_total{app=”myapp”,subsystem=”store”,engine=”mysql”,host=”db01”, source=”status”} = 5000

(You most likely wouldn’t use this exact structure, but it’s good for illustration.)

The difference between these approaches it that Prometheus provides you multiple dimensions on which you can filter and aggregate, plus you can add those dimensions later as you need them (without needing to redesign your tree hierarchy).

These labels are very dynamic, and I can change them in a second. For example, a MySQL server reporting as a “Master” might start reporting as a “Slave” in the very next second, and its data will be aggregated differently.

This is especially important in the modern, often cloud-based and virtualized world. For example, using Prometheus it is very easy to tag servers by their region or availability zones. I can also do things like compute MySQL space usage by both the database and storage engine. The possibilities are endless.

Data Capture

Unlike Graphite – where the main model is push and the hosts themselves choose what kind of information they want to push to monitoring system and at which intervals – with Prometheus you set up “Exporters” that have the ability to export the data. It is up to the Prometheus server configuration to choose what data to sample and how frequently.

The clear benefit of Prometheus’ approach is that you can have as many servers as you like pulling the data, so it is very easy to create a development version of your system and play around with it – without affecting production. It also provides a simple pathway to high availability.

(Both the push and pull approaches have their benefits and drawbacks. Brian Brazil wrote an excellent article advertising the pull model of monitoring.)

Prometheus does create a few challenges for me. Unless I want to set up Service Discovery, it is a hassle to monitor any development/test VMs I might spin up (that would otherwise not be open to external access at all). While this isn’t the main use case for Prometheus, it is helpful for me to test the dashboard’s behavior with different operating systems, workloads, etc.

A more significant issue I discovered is dealing with some data that can’t be captured to multiple locations, because the data capture causes the data to change.

Here is specific example: if I look at the

events_statements_summary_by_digest

 table in

PERFORMANCE_SCHEMA

, there is a

MAX_TIMER_WAIT

 field that shows me what the maximum query execution time is for the query pattern. If I want to get the maximum query execution time for every minute, for example, I would need to “truncate” the table to reset the statistics and let the maximum value be computed again. If I don’t perform that operation, the data becomes meaningless. If I make the exporter to reset the statistics during the poll, however, I can’t pull it from two Prometheus servers.

This is one instance where Prometheus’ performance schema design could be better. I could set up a Cron job or Event to clear out the statistics regularly and get a  proper maximum value for every five minutes, but that isn’t an overly convenient solution.

Another issue I discovered is that Prometheus doesn’t have any protection from bad (long) samples, or a very good method of detecting of them. Let’s imagine that I have a MySQL server and I’m sampling status data every second. For some reason the call to

SHOW GLOBAL STATUS

 took five seconds to execute. The truth is we don’t really know where in those five seconds the

SHOW GLOBAL STATUS

 output corresponds – it might be at very start, it might be at the very end. As such, you don’t really know how to process the counters. Whatever you do, you’re likely to be wrong. My preference in this case it to simply discard such samples, because even missing one percent of the samples is unlikely to change the whole picture. Constantly questioning whether you really had a couple of seconds where the QPS spiked to ten times the normal rate, or that it’s an invalid sample, is not something I on which I want to waste a lot of time!

My preferred approach is to configure the

SHOW GLOBAL STATUS

 capture so that if it takes more than ten percent of the capture interval, it will be discarded. For example, with a one second capture I would allow 100ms for the capture. If the system is not keeping up with this scale, I would be better to not fool myself and reduce the capture resolution to around five seconds.

The only protection Prometheus allows is to configure the scrape_timeout, but unfortunately it is only limited to one second resolution at this point. This is too coarse for any high-resolution capture.

Finally, it is also inconvenient to specify different resolutions for different data. In MySQL there is a often a lot of data that I want to capture, but the resolution needed for each capture is different. For example,

SHOW GLOBAL STATUS

 with one second resolution is must. At the same time, capturing the table size information from

INFORMATION_SCHEMA

 with a one second resolution would put too much load on MySQL, especially if there are a lot of tables. That level of resolution in this case isn’t really needed.

An attractive thing about Prometheus is that the Prometheus development team uses it a lot for MySQL monitoring, so the MySQL Exporter is really good. Most MySQL monitoring plugins I find resort to reporting just a few basics statistics, which is not nearly enough for advanced diagnostics. The Prometheus MySQL exporter gets tons of stuff and has been adding more in every version.

I also very much like that the Prometheus Exporters are designed using HTTP protocol. This means it is very easy to debug or see what kind of data they capture. They present it simply using a web-browser:

HTTP browser

Computational Model

I think the basic operations in Prometheus are pretty intuitive, but if you look at some of the advanced behaviors you’re going to find some inconveniences and some things that are likely to surprise you.

One inconvenience is that Prometheus is mainly designed for working with high resolution data. If there are more than five minute holes (by default) in the time series, they could disappear from the graphs. As I mentioned, for MySQL there is quite a lot of information that it makes sense to capture at a resolution lower than five minutes.

Prometheus functions are looking in the “past,” and designed in a way that the value of the function at any time (T) when it could be computed is not going to change. It all looks clean and logical, but it causes issues with holes in the data capture.  

As an example, let’s imagine following five seconds where the total number of questions from the start successfully scrapped some seconds but not others (due to a network issue, overload, etc.):

1  –  10
2  –  20
3  –  X
4  –  X
5  –  200

 When we capture data of “counter” type the most important value it has is not the actual counter value at the given time but the rate of change of the counter at different time intervals. If in this case, for example, the query rate was ten QPS for intervals one through two seconds, this can be clearly computed. But what was the query rate in the three through four second interval? We don’t really have exact data, but that is fine: we know there have been 180 queries during the two through five second interval, giving us 60 QPS (which we can use for the three through four seconds interval).

This is NOT, however, how Prometheus will compute it if you use a high irate() function (which is suppose to give you highest resolution possible). When you evaluate irate() at T=4, it doesn’t have access to the T=5 value, even if it is in the database. Instead, it will look back and find the matching previous interval (one through two) and use the corresponding value of ten QPS.

I find this pretty puzzling and inconvenient.

There is also the rate() function, which can be used to get the average rate for the period.  Unfortunately it can’t estimate the rate for a smaller period based on the available data for a longer period. So for example if I ask rate() function to compute a query rate at T=4, looking one second back, it will return no data. This isn’t a big deal when you’re working with data manually, but if you’re building zoomable dashboards it means you can zoom in to the point where the data will disappear (rather than stopping at the best possible value available).

Storage

Prometheus has its own high performance storage system which is based in part on LevelDB. It is highly optimized for time series and can achieve a very high level of compression. Be ready, though: all your label combinations will create a different time series on the low level, and will require a lot of files. This isn’t really a problem with SSD drives and modern file systems, but it something to look out for.

The capture engine and storage systems are rather efficient. Even though Prometheus does not have built in clustering for “scaling out,” you can reportedly get more than 300K metrics per second captured on a single node. You can also use multiple Prometheus servers as needed.

The problem I found with Prometheus’ storage is that it is very self contained: you can only use it from Prometheus or access it from the HTTP API. There are no tools at this point to export it for advanced analysis with R, or to dump the whole database into something like JSON format so it can be loaded into a different database engine. Some of these features might already be on roadmap.

Purging and Aggregation

Retention configuration in Prometheus is pretty spartan. You can set

storage.local.retention

 to the length you want to store the data, but that’s it. You can’t configure it to purge different data at different times. You can run multiple Prometheus instances to achieve this, but it’s quite a hassle.  It’s also not possible to instruct Prometheus to automatically build summaries in order to execute low resolution queries faster.

For example if I have MySQL’s query rate captured every second, but I want to view the data over a long time period (e.g., how it changed over last three months to estimate growth trends), data aggregated at hour intervals would be enough for that purpose.

There is support for recording rules to help achieve some of this, but it is not explicit or convenient in my opinion.

Looking at the Prometheus roadmap, some of these issues might not be fixed in Prometheus but achieved through integrating other systems such as InfluxDB (where experimental support already exists).

Purpose

A lot of these limitations make sense if you look at the purpose for which Prometheus was created: getting high-resolution data and being able to provide as much troubleshooting information as possible to its Alerting engine. It is not really designed for storing extensive history. Too bad! I would very much like to get both of those properties in the single system!

Visualization

As you install Prometheus, it has a built-in Expression Browser, which is great for debugging and interactive analyses. It also allows you to see what data you actually have in the database. It will disappoint you, however, if you’re looking for beautiful graphs!

HTTP installer

This shows I have the information about MySQL query rate from two servers, as well as the available and configured labels.

If I want to pick one server and look at the average rate of queries per five minutes, I can do this:

HTTP graphs

There are some tools available in the graph to chose the time range and resolution.

You should aware that visualizing data with the rate() function often shows you things that do not exist. In this case, it looks like the number of queries was gradually creeping up. In reality, I just started the benchmark so the number of queries jumped almost immediately. This is what the real situation looks like (using irate()):

HTTP graphs 2

As I explained before, irate() does not handle missing data points very well, plus it behaves somewhat bizarrely when you “zoom out” – providing instant rate information at sparse intervals (e.g., the instant rate computed every one second over 60 seconds) rather than smoothing things to averages.

There is also the PromDash tool available for Prometheus, which gives you nicer looking dashboards and supports a lot of Prometheus’ features. Now that Grafana has official support for Prometheus, it is my preferred tool to build dashboards – especially since it supports multiple data sources besides Prometheus.

Summary

I’m very excited about Prometheus. It allows me to get a lot of information easily and use it for Performance analyses in benchmarking or troubleshooting. It would be great if it also had a simple integrated solution for long term storage and trending. I am also looking forward to better integration with Grafana and better documentation on how to create Prometheus-based dashboards – especially with some Prometheus-based examples!  

Note: All above was written about Prometheus 0.16.1. Prometheus is rapidly evolving and may  change with newer versions.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com