Jan
15
2019
--

Customizing Per-Process Metrics in PMM

Process Memory Usage - a filtered graph in PMM

If you have set up per-process metrics in Percona Monitoring and Management, you may have found yourself in need of tuning it further to not only group processes together, but to display some of them in isolation. In this blogpost we will explore how to modify the rules for grouping processes, so that you can make the most out of this awesome PMM integration.

Let’s say you have followed the link above on how to set up the per-process metrics integration on PMM, and you have imported the dashboard to show these metrics. You will see something like the following:

PMM database and system monitoring and management software

This is an internal testing server we use, in which you can see a high number of VBoxHeadless (29) and mysqld (99) processes running. All the metrics in the dashboard will be grouped by the name of the command used. But, what if we want to see metrics for only one of these processes in isolation? As things stand, we will not be able to do so. It may not make sense to do so in a testing environment, but if you are running multiple mysqld processes (or mongos, postgres, etc) bound to different ports, you may want to see metrics for each of them separately.

Modifying the configuration file

Enter all.yaml!

In the process-exporter documentation on using a configuration file, we can see the following:

The general format of the -config.path YAML file is a top-level process_names section, containing a list of name matchers. […] A process may only belong to one group: even if multiple items would match, the first one listed in the file wins.

This means that even if we have two rules that would match a process, only the first one will be taken into account. This will allow us to both list processes by themselves, and not miss any non-grouped process. How? Let’s imagine we have the following processes running:

mysqld --port=1
mysqld --port=2
mysqld --port=3
mysqld --port=4

And we wanted to be able to tell apart the instances running in ports 1 and 2 from the other ones, we could use the following rules:

- name: "mysqld_port_1"
 cmdline:
 - '.*mysqld.*port=1.*'
- name: "mysqld_port_2"
 cmdline:
 - '.*mysqld.*port=2.*'
- name: "{{.Comm}}"
 cmdline:
 - '.+'

In cmdline we will need the regular expression against which to match the process command running. In this case, we made use of the fact that they were using different ports, but any difference in the command strings can be used. The last rule is the one that will default to “anything else” (with the regular expression that matches anything).

The default rule at the end will make sure you don’t miss any other process, so unless you want only some processes metrics collected, you should always have a rule for it.

A real life working example of configuring per-process metrics

In case all these generic information didn’t make much sense, we will present a concrete example, hoping that it will make everything fit together nicely.

In this example we want to have the mysqld instance using the mysql_sandbox16679.sock socket isolated from all the others, and the VM with ID finishing in 97eafa2795da listed by their own. All other processes are to be grouped together by using the basename of the executable.

You can check the output from ps aux to see the full command used. For instance:

shell> ps aux | grep 97eafa2795da
agustin+ 27785  0.7 0.2 5619280 542536 ?      Sl Nov28 228:24 /usr/lib/virtualbox/VBoxHeadless --comment centos_node1_1543443575974_22181 --startvm a0151e29-35dd-4c14-8e37-97eafa2795da --vrde config

So, we can use the following regular expression for it (we use .* to match any string):

.*VBoxHeadless.*97eafa2795da.*

The same applies to the regular expression for the mysqld process.

The configuration file will end up looking like:

shell>  cat /etc/process-exporter/all.yaml
process_names:
 - name: "Custom VBox"
   cmdline:
   - '.*VBoxHeadless.*97eafa2795da.*'
 - name: "Custom MySQL"
   cmdline:
   - '.*mysqld.*mysql_sandbox16679.sock.*'
 - name: "{{.Comm}}"
   cmdline:
   - '.+'

Let’s restart the service, so that new changes apply, and we will check the graphs after five minutes, to see new changes. Note that you may have to reload the page for the changes to apply.

shell> systemctl restart process-exporter

After refreshing, we will see the new list of processes in the drop-down list:

A new list of processes in PMM after filtering

And after we select them, we will be able to see data for those processes in particular:

Thanks to the default configuration at the end, we are still capturing data from all the other mysqld processes. However, they will have their own group, as mentioned before:

System Processes Metrics graph in PMM

 

Nov
20
2018
--

How CVE-2018-19039 Affects Percona Monitoring and Management

CVE-2018-19039

CVE-2018-19039Grafana Labs has released an important security update, and as you’re aware PMM uses Grafana internally. You’re probably curious whether this issue affects you.  CVE-2018-19039 “File Exfiltration vulnerability Security fix” covers a recently discovered security flaw that allows any Grafana user with Editor or Admin permissions to have read access to the filesystem, performed with the same privileges as the Grafana process has.

We have good news: if you’re running PMM 1.10.0 or later (released April 2018), you’re not affected by this security issue.

The reason you’re not affected is an interesting one. CVE-2018-19039 relates to Grafana component PhantomJS, which Percona omitted when we changed how we build the version of Grafana embedded in Percona Monitoring and Management. We became aware of this via bug PMM-2837 when we discovered images do not render.

We fixed this image rendering issue in and applied the required security update in 1.17. This ensures PMM is not vulnerable to CVE-2018-19039.

Users of PMM who are running release 1.1.0 (February 2017) through 1.9.1 (April 2018) are advised to upgrade ASAP.  If you cannot immediately upgrade, we advise that you take two steps:

  1. Convert all Grafana users to Viewer role
  2. Remove all Dashboards that contain text panels

How to Get PMM Server

PMM is available for installation using three methods:

Nov
20
2018
--

Percona Monitoring and Management (PMM) 1.17.0 Is Now Available

Percona Monitoring and Management 1.17.0

Percona Monitoring and Management 1.17.0

Percona Monitoring and Management 1.17.0 (PMM) is a free and open-source platform for managing and monitoring MySQL, MongoDB, and PostgreSQL performance. You can run Percona Monitoring and Management 1.17.0 in your own environment for maximum security and reliability. It provides thorough time-based analysis for MySQL®, MongoDB®, and PostgreSQL® servers to ensure that your data works as efficiently as possible.

Although we patched a bug this release related to a Grafana CVE announcement, previous releases since 1.10 (April 2018) were not vulnerable, and we encourage you to see our Grafana CVE blog post for further details.

In this release, we made six improvements and fixed 11 bugs.

Dashboard Improvements

We updated five Dashboards with improved Tooltips – If you haven’t seen this before, hover your mouse over the  icon in the top left of most graph elements, and you’ll see a new box appear.  This box provides a brief description of what the graph displays, along with links to related documentation resources so you can learn further.  We hope you find the content useful!

Percona Monitoring and Management 1.17.0

The Dashboards we’re updating are:

  1. MySQL Amazon Aurora Metrics
  2. MySQL MyISAM/Aria Metrics
  3. MySQL Replication
  4. Prometheus Exporters Overview
  5. Trends

We hope you enjoy this release, and we welcome your feedback via the Percona PMM Forums!

New Features and Improvements

Fixed Bugs

  • PMM-3257: Grafana Security patch for CVE-2018-19039
  • PMM-3252: Update button in 1.16 is not visible when a newer version exists
  • PMM-3209: Special symbols in username or password prevent the addition of Remote Instances
  • PMM-2837: Image Rendering Does not work due to absent Phantom.JS binary
  • PMM-2428: Remove Host=All on dashboards where this variable does not apply
  • PMM-2294: No changes in zoomed out Cluster Size Graph if the node was absent for a short time
  • PMM-2289: SST Time Graph based on the wrong formula
  • PMM-2192: Memory leak in ProxySQL_Exporter when ProxySQL is down
  • PMM-2158: MongoDB “Query Efficiency – Document” arithmetic appears to be incorrectly calculated
  • PMM-1837: System Info shows duplicate hosts
  • PMM-1805: Available Downtime before SST Required doesn’t seem to be accurate –  thanks to and Yves Trudeau for help

How to get PMM Server

PMM is available for installation using three methods:

Help us improve our software quality by reporting any Percona Monitoring and Management bugs you encounter using our bug tracking system.

Nov
01
2018
--

Percona Monitoring and Management (PMM) 1.16.0 Is Now Available

Percona Monitoring and Management

PMM (Percona Monitoring and Management) is a free and open-source platform for managing and monitoring MySQL, MongoDB, and PostgreSQL performance. You can run PMM in your own environment for maximum security and reliability. It provides thorough time-based analysis for MySQL® and MongoDB® servers to ensure that your data works as efficiently as possible.

Percona Monitoring and Management

While much of the team is working on longer-term projects, we were able to provide the following feature:

  • MySQL and PostgreSQL support for all cloud DBaaS providers – Use PMM Server to gather Metrics and Queries from remote instances!
  • Query Analytics + Metric Series – See Database activity alongside queries
  • Collect local metrics using node_exporter + textfile collector

We addressed 11 new features and improvements, and fixed 21 bugs.

MySQL and PostgreSQL support for all cloud DBaaS providers

You’re now able to connect PMM Server to your MySQL and PostgreSQL instances, whether they run in a cloud DBaaS environment, or you simply want Database metrics without the OS metrics.  This can help you get up and running with PMM using minimal configuration and zero client installation, however be aware there are limitations – there won’t be any host-level dashboards populated for these nodes since we don’t attempt to connect to the provider’s API nor are we granted access to the instance in order to deploy an exporter.

How to use

Using the PMM Add Instance screen, you can now add instances from any cloud provider (AWS RDS and Aurora, Google Cloud SQL for MySQL, Azure Database for MySQL) and benefit from the same dashboards that you are already accustomed to. You’ll be able to collect Metrics and Queries from MySQL, and Metrics from PostgreSQL.  You can add remote instances by selecting the PMM Add Instance item in a PMM group of the system menu:

https://github.com/percona/pmm/blob/679471210d476a5e98d26a632318f1680cfd98a2/doc/source/.res/graphics/png/metrics-monitor.menu.pmm1.png?raw=true

where you will then have the opportunity to add a Remote MySQL or Remote PostgreSQL instance:

You’ll add the instance by supplying just the Hostname, database Username and Password (and optional Port and Name):

metrics-monitor.add-remote-mysql-instance.png

Also new as part of this release is the ability to display nodes you’ve added, on screen RDS and Remote Instances:

metrics-monitor.add-rds-or-remote-instance1.png

Server activity metrics in the PMM Query Analytics dashboard

The Query Analytics dashboard now shows a summary of the selected host and database activity metrics in addition to the top ten queries listed in a summary table.  This brings a view of System Activity (CPU, Disk, and Network) and Database Server Activity (Connections, Queries per Second, and Threads Running) to help you better pinpoint query pileups and other bottlenecks:

https://raw.githubusercontent.com/percona/pmm/86e4215a58e788a8ec7cb1ebe679e1593c484078/doc/source/.res/graphics/png/query-analytics.png

Extending metrics with node_exporter textfile collector

While PMM provides an excellent solution for system monitoring, sometimes you may have the need for a metric that’s not present in the list of node_exporter metrics out of the box. There is a simple method to extend the list of available metrics without modifying the node_exporter code. It is based on the textfile collector.  We’ve enabled this collector as on by default, and is deployed as part of linux:metrics in PMM Client.

The default directory for reading text files with the metrics is /usr/local/percona/pmm-client/textfile-collector, and the exporter reads files from it with the .prom extension. By default it contains an example file example.prom which has commented contents and can be used as a template.

You are responsible for running a cronjob or other regular process to generate the metric series data and write it to this directory.

Example – collecting docker container information

This example will show you how to collect the number of running and stopped docker containers on a host. It uses a crontab task, set with the following lines in the cron configuration file (e.g. in /etc/crontab):

*/1* * * *     root   echo -n "" > /tmp/docker_all.prom; docker ps -a -q | wc -l | xargs echo node_docker_containers_total >> /usr/local/percona/pmm-client/docker_all.prom;
*/1* * * *     root   echo -n "" > /tmp/docker_running.prom; docker ps | wc -l | xargs echo node_docker_containers_running_total >> /usr/local/percona/pmm-client/docker_running.prom;

The result of the commands is placed into the docker_all.prom and docker_running.prom files and read by exporter and will create two new metric series named node_docker_containers_total and node_docker_containers_running_total, which we’ll then plot on a graph:

pmm 1.16

New Features and Improvements

  • PMM-3195 Remove the light bulb
  • PMM-3194 Change link for “Where do I get the security credentials for my Amazon RDS DB instance?”
  • PMM-3189 Include Remote MySQL & PostgreSQL instance logs into PMM Server logs.zip system
  • PMM-3166 Convert status integers to strings on ProxySQL Overview Dashboard – Thanks,  Iwo Panowicz for  https://github.com/percona/grafana-dashboards/pull/239
  • PMM-3133 Include Metric Series on Query Analytics Dashboard
  • PMM-3078 Generate warning “how to troubleshoot postgresql:metrics” after failed pmm-admin add postgresql execution
  • PMM-3061 Provide Ability to Monitor Remote MySQL and PostgreSQL Instances
  • PMM-2888 Enable Textfile Collector by Default in node_exporter
  • PMM-2880 Use consistent favicon (Percona logo) across all distribution methods
  • PMM-2306 Configure EBS disk resize utility to run from crontab in PMM Server
  • PMM-1358 Improve Tooltips on Disk Space Dashboard – thanks, Corrado Pandiani for texts

Fixed Bugs

  • PMM-3202 Cannot add remote PostgreSQL to monitoring without specified dbname
  • PMM-3186 Strange “Quick ranges” tag appears when you hover over documentation links on PMM Add Instance screen
  • PMM-3182 Some sections for MongoDB are collapsed by default
  • PMM-3171 Remote RDS instance cannot be deleted
  • PMM-3159 Problem with enabling RDS instance
  • PMM-3127 “Expand all” button affects JSON in all queries instead of the selected one
  • PMM-3126 Last check displays locale format of the date
  • PMM-3097 Update home dashboard to support PostgreSQL nodes in Environment Overview
  • PMM-3091 postgres_exporter typo
  • PMM-3090 TLS handshake error in PostgreSQL metric
  • PMM-3088 It’s possible to downgrade PMM from Home dashboard
  • PMM-3072 Copy to clipboard is not visible for JSON in case of long queries
  • PMM-3038 Error adding MySQL queries when options for mysqld_exporters are used
  • PMM-3028 Mark points are hidden if an annotation isn’t added in advance
  • PMM-3027 Number of vCPUs for RDS is displayed incorrectly – report and proposal from Janos Ruszo
  • PMM-2762 Page refresh makes Search condition lost and shows all queries
  • PMM-2483 LVM in the PMM Server AMI is poorly configured/documented – reported by Olivier Mignault  and lot of people involved.  Special thanks to  Chris Schneider for checking with fix options
  • PMM-2003 Delete all info related to external exporters on pmm-admin list output

How to get PMM Server

PMM is available for installation using three methods:

Help us improve our software quality by reporting any Percona Monitoring and Management bugs you encounter using our bug tracking system.

Oct
23
2018
--

Reclaiming space on your Docker PMM server deployment

reclaiming space Docker PMM

reclaiming space Docker PMMRecently we had a customer that had issues with a filled disk on the server hosting their Docker pmm-server environment. They were not able to access the web UI, or even stop the pmm-server container because they had filled the /var/ mount point.

Setting correct expectations

The best way to avoid these kinds of issues in the first place is to plan ahead, and to know exactly with what you are dealing with in terms of disk space requirements. Michael Coburn has written a great blogpost on this matter:

https://www.percona.com/blog/2017/05/04/how-much-disk-space-should-i-allocate-for-percona-monitoring-and-management/

We are now using Prometheus version 2 inside PMM server, so you should take it with a pinch of salt. On the other hand, it will show how you should plan ahead, and think about the “steady state” disk usage, so it’s a good read.

That’s the first step to make sure you won’t get into trouble down the line. But, what happens if you are already in trouble? We’ll see two quick ways that may help reclaiming space.

Before anything else, you should stop any and all PMM clients running, so that you don’t have a race condition after recovering some space, in which metrics coming from the running clients will fill up whatever disk you had freed.

If

pmm-admin stop --all

  won’t work, you can stop the services manually, or even manually kill the running processes as a last resort:

shell> systemctl list-unit-files | grep enabled | grep pmm | awk '{print $1}' | xargs -n 1 systemctl stop
shell> ps ax | egrep "exporter|qan-agent|pmm" | grep -v "ssh" | awk '{print $1}' | xargs kill

Removing unused containers

In order for the next steps to be as effective as possible, make sure there are no unused containers running, or stopped:

shell> docker ps -a

If you see any container that you know you don’t need anymore:

shell> docker stop <container_name>
shell> docker rm -v <container_name>

WARNING! Do not remove the pmm-data container!

Reclaiming space from unused Docker images

After you are done cleaning unused containers, we can move forward with removing unused images. Unless you are manually building your own Docker images, it’s really easy to get them again if needed, so you shouldn’t be afraid of deleting the ones that are not being used. In fact, you don’t need to explicitly download the images. By simply running

docker run … image_name

  Docker will automatically do it for you if it’s not found locally.

shell> docker image prune -a
WARNING! This will remove all images without at least one container associated to them.
Are you sure you want to continue? [y/N] y
Deleted Images:
...
Total reclaimed space: 3.97GB

Not too bad, we just reclaimed 4Gb of disk space. This alone should be enough to restart the Docker service and have the pmm-server container back up. But we want more, just because we can ?

Reclaiming space from orphaned Docker volumes

By default, when removing a container (with

docker rm

 ) Docker will not delete the associated volumes, unless you use the -v switch as we did above. This will mean that, unless you were aware of this fact, you will probably have some other gigabytes worth of data occupying disk space. We can easily do this with the volume prune command:

shell> docker volume prune
WARNING! This will remove all local volumes not used by at least one container.
Are you sure you want to continue? [y/N] y
Deleted Volumes:
...
Total reclaimed space: 115GB

Yeah… that’s some significant amount of disk space we just reclaimed back! Again, make sure you don’t care about any of the volumes from your past containers to be able to do this safely, since there is no turning back from this, obviously.

For earlier versions of Docker where this command is not available, you can check this link.

Planning ahead

As mentioned before, you should now revisit Michael’s blogpost, and set the metrics retention and queries retention variables to whatever makes sense for your environment. Even if you plan ahead, you may not be counting on the additional variable overhead of images and orphaned volumes, so you may want to (warning: shameless plug for my own blogpost ahead) use different mount points for your PMM deployment, and avoid using the shared /var/lib/docker/ mount point for it.

PMM also includes a Disk Space usage dashboard, that you can use to monitor this.

Don’t forget to start back up your PMM clients, and continue to monitor them 24×7!

Photo by Andrew Wulf on Unsplash

Oct
10
2018
--

Percona Monitoring and Management (PMM) 1.15.0 Is Now Available

Percona Monitoring and Management

Percona Monitoring and Management (PMM) is a free and open-source platform for managing and monitoring MySQL® and MongoDB® performance. You can run PMM in your own environment for maximum security and reliability. It provides thorough time-based analysis for MySQL® and MongoDB® servers to ensure that your data works as efficiently as possible.

Percona Monitoring and Management

This release offers two new features for both the MySQL Community and Percona Customers:

  • MySQL Custom Queries – Turn a SELECT into a dashboard!
  • Server and Client logs – Collect troubleshooting logs for Percona Support

We addressed 17 new features and improvements, and fixed 17 bugs.

MySQL Custom Queries

In 1.15 we are introducing the ability to take a SQL SELECT statement and turn the result set into metric series in PMM.  The queries are executed at the LOW RESOLUTION level, which by default is every 60 seconds.  A key advantage is that you can extend PMM to profile metrics unique to your environment (see users table example), or to introduce support for a table that isn’t part of PMM yet. This feature is on by default and only requires that you edit the configuration file and use vaild YAML syntax.  The configuration file is in /usr/local/percona/pmm-client/queries-mysqld.yml.

Example – Application users table

We’re going to take a fictional MySQL users table that also tracks the number of upvotes and downvotes, and we’ll convert this into two metric series, with a set of seven labels, where each label can also store a value.

Browsing metrics series using Advanced Data Exploration Dashboard

Lets look at the output so we understand the goal – take data from a MySQL table and store in PMM, then display as a metric series.  Using the Advanced Data Exploration Dashboard you can review your metric series. Exploring the metric series  app1_users_metrics_downvotes we see the following:

PMM Advanced Data Exploration Dashboard

MySQL table

Lets assume you have the following users table that includes true/false, string, and integer types.

SELECT * FROM `users`
+----+------+--------------+-----------+------------+-----------+---------------------+--------+---------+-----------+
| id | app  | user_type    | last_name | first_name | logged_in | active_subscription | banned | upvotes | downvotes |
+----+------+--------------+-----------+------------+-----------+---------------------+--------+---------+-----------+
|  1 | app2 | unprivileged | Marley    | Bob        |         1 |                   1 |      0 |     100 |        25 |
|  2 | app3 | moderator    | Young     | Neil       |         1 |                   1 |      1 |     150 |        10 |
|  3 | app4 | unprivileged | OConnor   | Sinead     |         1 |                   1 |      0 |      25 |        50 |
|  4 | app1 | unprivileged | Yorke     | Thom       |         0 |                   1 |      0 |     100 |       100 |
|  5 | app5 | admin        | Buckley   | Jeff       |         1 |                   1 |      0 |     175 |         0 |
+----+------+--------------+-----------+------------+-----------+---------------------+--------+---------+-----------+

Explaining the YAML syntax

We’ll go through a simple example and mention what’s required for each line.  The metric series is constructed based on the first line and appends the column name to form metric series.  Therefore the number of metric series per table will be the count of columns that are of type GAUGE or COUNTER.  This metric series will be called app1_users_metrics_downvotes:

app1_users_metrics:                                 ## leading section of your metric series.
  query: "SELECT * FROM app1.users"                 ## Your query. Don't forget the schema name.
  metrics:                                          ## Required line to start the list of metric items
    - downvotes:                                    ## Name of the column returned by the query. Will be appended to the metric series.
        usage: "COUNTER"                            ## Column value type.  COUNTER will make this a metric series.
        description: "Number of upvotes"            ## Helpful description of the column.

Full queries-mysqld.yml example

Each column in the SELECT is named in this example, but that isn’t required, you can use a SELECT * as well.  Notice the format of schema.table for the query is included.

---
app1_users_metrics:
  query: "SELECT app,first_name,last_name,logged_in,active_subscription,banned,upvotes,downvotes FROM app1.users"
  metrics:
    - app:
        usage: "LABEL"
        description: "Name of the Application"
    - user_type:
        usage: "LABEL"
        description: "User's privilege level within the Application"
    - first_name:
        usage: "LABEL"
        description: "User's First Name"
    - last_name:
        usage: "LABEL"
        description: "User's Last Name"
    - logged_in:
        usage: "LABEL"
        description: "User's logged in or out status"
    - active_subscription:
        usage: "LABEL"
        description: "Whether User has an active subscription or not"
    - banned:
        usage: "LABEL"
        description: "Whether user is banned or not"
    - upvotes:
        usage: "COUNTER"
        description: "Count of upvotes the User has earned.  Upvotes once granted cannot be revoked, so the number can only increase."
    - downvotes:
        usage: "GAUGE"
        description: "Count of downvotes the User has earned.  Downvotes can be revoked so the number can increase as well as decrease."
...

We hope you enjoy this feature, and we welcome your feedback via the Percona forums!

Server and Client logs

We’ve enhanced the volume of data collected from both the Server and Client perspectives.  Each service provides a set of files designed to be shared with Percona Support while you work on an issue.

Server

From the Server, we’ve improved the logs.zip service to include:

  • Prometheus targets
  • Consul nodes, QAN API instances
  • Amazon RDS and Aurora instances
  • Version
  • Server configuration
  • Percona Toolkit commands

You retrieve the link from your PMM server using this format:   https://pmmdemo.percona.com/managed/logs.zip

Client

On the Client side we’ve added a new action called summary which fetches logs, network, and Percona Toolkit output in order to share with Percona Support. To initiate a Client side collection, execute:

pmm-admin summary

The output will be a file you can use to attach to your Support ticket.  The single file will look something like this:

summary__2018_10_10_16_20_00.tar.gz

New Features and Improvements

  • PMM-2913 – Provide ability to execute Custom Queries against MySQL – Credit to wrouesnel for the framework of this feature in wrouesnel/postgres_exporter!
  • PMM-2904 – Improve PMM Server Diagnostics for Support
  • PMM-2860 – Improve pmm-client Diagnostics for Support
  • PMM-1754Provide functionality to easily select query and copy it to clipboard in QAN
  • PMM-1855Add swap to AMI
  • PMM-3013Rename PXC Overview graph Sequence numbers of transactions to IST Progress
  • PMM-2726 – Abort data collection in Exporters based on Prometheus Timeout – MySQLd Exporter
  • PMM-3003 – PostgreSQL Overview Dashboard Tooltip fixes
  • PMM-2936Some improvements for Query Analytics Settings screen
  • PMM-3029PostgreSQL Dashboard Improvements

Fixed Bugs

  • PMM-2976Upgrading to PMM 1.14.x fails if dashboards from Grafana 4.x are present on an installation
  • PMM-2969rds_exporter becomes throttled by CloudWatch API
  • PMM-1443The credentials for a secured server are exposed without explicit request
  • PMM-3006Monitoring over 1000 instances is displayed imperfectly on the label
  • PMM-3011PMM’s default MongoDB DSN is localhost, which is not resolved to IPv4 on modern systems
  • PMM-2211Bad display when using old range in QAN
  • PMM-1664Infinite loading with wrong queryID
  • PMM-2715Since pmm-client-1.9.0, pmm-admin detects CentOS/RHEL 6 installations using linux-upstart as service manager and ignores SysV scripts
  • PMM-2839Tablestats safety precaution does not work for RDS/Aurora instances
  • PMM-2845pmm-admin purge causes client to panic
  • PMM-2968pmm-admin list shows empty data source column for mysql:metrics
  • PMM-3043 Total Time percentage is incorrectly shown as a decimal fraction
  • PMM-3082Prometheus Scrape Interval Variance chart doesn’t display data

How to get PMM Server

PMM is available for installation using three methods:

Help us improve our software quality by reporting any Percona Monitoring and Management bugs you encounter using our bug tracking system.

Oct
09
2018
--

PostgreSQL Monitoring: Set Up an Enterprise-Grade Server (and Sign Up for Webinar Weds 10/10…)

PostgreSQL Monitoring

PostgreSQL logoThis is the last post in our series on building an enterprise-grade PostgreSQL set up using open source tools, and we’ll be covering monitoring.

The previous posts in this series discussed aspects such as security, backup strategy, high availability, connection pooling and load balancing, extensions, and detailed logging in PostgreSQL. Tomorrow, Wednesday, October 10 at 10AM EST, we will be reviewing these topics together, and showcasing then in practice in a webinar format: we hope you can join us!

 

Monitoring databases

The importance of monitoring the activity and health of production systems is unquestionable. When it comes to the database, with its high number of customizable settings, the ability to track its various metrics (status counters and gauges) allows for the maintenance of a historical record of its performance over time. This can be used for capacity planningtroubleshooting and validation.

When it comes to capacity planning, a monitoring solution is a helpful tool to help you assess how the current setup is faring. At the same time, it can help predict future needs based on trends, such as the increase of active connections, queries, and CPU usage. For example, an increase in CPU usage might be due to a genuine increase in workload, but it could also be a sign of unoptimized queries growing in popularity. In which case, comparing CPU with disk access might provide a more complete view of what is going on.

Being able to easily correlate data like this helps you to catch minor issues and to plan accordingly, sometimes allowing you to avoid an easier but more costly solution of scaling up to mitigate problems like this. But having the right monitoring solution is really invaluable when it comes to investigative work and root cause analysis. Trying to understand a problem that has already taken place is a rather complicated, and often unenviable, task unless you established a continuous, watchful eye on the set up for the whole time.

Finally, a monitoring solution can help you validate changes made in the business logic in general or in the database configuration in specific. By comparing prior and post results for a given metric or for overall performance, you can observe the impact of such changes in practice.

Monitoring PostgreSQL with open source solutions

There is a number of monitoring solutions for PostgreSQL and postgresql.org’s Wiki provides an extensive list, albeit a little outdated. It categorizes the main monitoring solutions into two distinct categories: those that can be identified as generic solutions—and can be extended to cover different technologies through custom plugins—and those labeled as Postgres-centric, which are specific to PostgreSQL.

In the first group, we find venerated open source monitoring tools such as Munin, Zabbix, and CactiNagios could have also been added to this group but it was instead indirectly included in the “Checkers” group. That category includes monitoring scripts that can be used both in stand-alone mode or as feeders (plugins) for “Nagios like software“. Examples of these are check_pgactivity and check_postgres.

One omission from this list is Grafana, a modern time series analytics platform conceived to display metrics from a number of different data sources. Grafana includes a solution packaged as a PostgreSQL native plugin. Percona has built its Percona Monitoring and Management (PMM) platform around Grafana, using Prometheus as its data source. Since version 1.14.0, PMM supports PostgreSQL. Query Analytics (QAN) integration is coming soon.

An important factor that all these generic solutions have in common is that they are widely used for the monitoring of a diverse collection of services, like you’d normally find in enterprise-like environments. It’s common for a given company to adopt one, or sometimes two, such solutions with the aim of monitoring their entire infrastructure. This infrastructure often includes a heterogeneous combination of databases and application servers.

Nevertheless, there is a place for complementary Postgres-centric monitoring solutions in such enterprise environments too. These solutions are usually implemented with a specific goal in mind. Two examples we can mention in this context are PGObserver, which has a focus on monitoring stored procedures, and pgCluu, with its focus on auditing.

Monitoring PostgreSQL with PMM

We built an enterprise-grade PostgreSQL set up for the webinar, and use PMM for monitoring. We will be showcasing some of PMM’s main features, and highlighting some of the most important metrics to watch, during our demo.You may want to have a look at this demo setup to get a feel of how our PostgreSQL Overview dashboard looks:

You can find instructions on how to setup PMM for monitoring your PostgreSQL server in our documentation space. And if there’s still time, sign up for tomorrow’s webinar!

 

Sep
28
2018
--

Scaling Percona Monitoring and Management (PMM)

PMM tested with 1000 nodes

Starting with PMM 1.13,  PMM uses Prometheus 2 for metrics storage, which tends to be heaviest resource consumer of CPU and RAM.  With Prometheus 2 Performance Improvements, PMM can scale to more than 1000 monitored nodes per instance in default configuration. In this blog post we will look into PMM scaling and capacity planning—how to estimate the resources required, and what drives resource consumption.

PMM tested with 1000 nodes

We have now tested PMM with up to 1000 nodes, using a virtualized system with 128GB of memory, 24 virtual cores, and SSD storage. We found PMM scales pretty linearly with the available memory and CPU cores, and we believe that a higher number of nodes could be supported with more powerful hardware.

What drives resource usage in PMM ?

Depending on your system configuration and workload, a single node can generate very different loads on the PMM server. The main factors that impact the performance of PMM are:

  1. Number of samples (data points) injected into PMM per second
  2. Number of distinct time series they belong to (cardinality)
  3. Number of distinct query patterns your application uses
  4. Number of queries you have on PMM, through the user interface on the API, and their complexity

These specifically can be impacted by:

  • Software version – modern database software versions expose more metrics)
  • Software configuration – some metrics are only exposed in certain configuration
  • Workload – a large number of database objects and high concurrency will increase both the number of samples ingested and their cardinality.
  • Exporter configuration – disabling collectors can reduce amount of data collectors
  • Scrape frequency –  controlled by METRICS_RESOLUTION

All these factors together may impact resource requirements by a factor of ten or more, so do your own testing to be sure. However, the numbers in this article should serve as good general guidance as a start point for your research.

On the system supporting 1000 instances we observed the following performance:

Performance PMM 1000 nodes load

As you can see, we have more than 2.000 scrapes/sec performed, providing almost two million samples/sec, and more than eight million active time series. These are the main numbers that define the load placed on Prometheus.

Capacity planning to scale PMM

Both CPU and memory are very important resources for PMM capacity planning. Memory is the more important as Prometheus 2 does not have good options for limiting memory consumption. If you do not have enough memory to handle your workload, then it will run out of memory and crash.

We recommend at least 2GB of memory for a production PMM Installation. A test installation with 1GB of memory is possible. However, it may not be able to monitor more than one or two nodes without running out of memory. With 2GB of memory you should be able to monitor at least five nodes without problem.

With powerful systems (8GB of more) you can have approximately eight systems per 1GB of memory, or about 15,000 samples ingested/sec per 1GB of memory.

To calculate the CPU usage resources required, allow for about 50 monitored systems per core (or 100K metrics/sec per CPU core).

One problem you’re likely to encounter if you’re running PMM with 100+ instances is the “Home Dashboard”. This becomes way too heavy with such a large number of servers. We plan to fix this issue in future releases of PMM, but for now you can work around it in two simple ways:

You can select the host, for example “pmm-server” in your home dashboard and save it, before adding a large amount of hosts to the system.

set home dashboard for PMM

Or you can make some other dashboard of your choice and set it as the home dashboard.

Summary

  • More than 1,000 monitored systems is possible per single PMM server
  • Your specific workload and configuration may significantly change the resources required
  • If deploying with 8GB or more, plan 50 systems per core, and eight systems per 1GB of RAM

The post Scaling Percona Monitoring and Management (PMM) appeared first on Percona Database Performance Blog.

Sep
20
2018
--

Prometheus 2 Times Series Storage Performance Analyses

cpu saturation and max core usage

Prometheus 2 time series database (TSDB) is an amazing piece of engineering, offering a dramatic improvement compared to “v2” storage in Prometheus 1 in terms of ingest performance, query performance and resource use efficiency. As we’ve been adopting Prometheus 2 in Percona Monitoring and Management (PMM), I had a chance to look into the performance of Prometheus 2 TSDB. This blog post details my observations.

Understanding the typical Prometheus workload

For someone who has spent their career working with general purpose databases, the typical workload of Prometheus is quite interesting. The ingest rate tends to remain very stable: typically, devices you monitor will send approximately the same amount of metrics all the time, and infrastructure tends to change relatively slowly.

Queries to the data can come from multiple sources. Some of them, such as alerting, tend to be very stable and predictable too. Others, such as users exploring data, can be spiky, though it is not common for this to be largest part of the load.

The Benchmark

In my assessment, I focused on handling an ingest workload. I had deployed Prometheus 2.3.2 compiled with Go 1.10.1 (as part of PMM 1.14)  on Linode using this StackScript.  For a maximally realistic load generation, I spin up multiple MySQL nodes running some real workloads (Sysbench TPC-C Test) , with each emulating 10 Nodes running MySQL and Linux using this StackScript

The observations below are based on a Linode instance with eight virtual cores and 32GB of memory, running  20 load driving simulating the monitoring of 200 MySQL instances. Or, in Prometheus Terms, some 800 targets; 440 scrapes/sec 380K samples ingested per second and 1.7M of active time series.

Design Observations

The conventional approach of traditional databases, and the approach that Prometheus 1.x used, is to limit amount of memory. If this amount of memory is not enough to handle the load, you will have high latency and some queries (or scrapes) will fail. Prometheus 2 memory usage instead is configured by

storage.tsdb.min-block-duration

   which determines how long samples will be stored in memory before they are flushed (the default being 2h). How much memory it requires will depend on the number of time series, the number of labels you have, and your scrape frequency in addition to the raw ingest rate. On disk, Prometheus tends to use about three bytes per sample. Memory requirements, though, will be significantly higher.

While the configuration knob exists to change the head block size, tuning this by users is discouraged. So you’re limited to providing Prometheus 2 with as much memory as it needs for your workload.

If there is not enough memory for Prometheus to handle your ingest rate, then it will crash with out of memory error message or will be killed by OOM killer.

Adding more swap space as a “backup” in case Prometheus runs out of RAM does not seem to work as using swap space causes a dramatic memory usage explosion. I suspect swapping does not play well with Go garbage collection.

Another interesting design choice is aligning block flushes to specific times, rather than to time since start:

head block Prometheus 2

As you can see from this graph, flushes happen every two hours, on the clock. If you change min-block-duration  to 1h, these flushes will happen every hour at 30 minutes past the hour.

(If you want to see this and other graphs for your Prometheus Installation you can use this Dashboard. It has been designed for PMM but can work for any Prometheus installation with little adjustments.)

While the active block—called head block— is kept in memory, blocks containing older blocks are accessed through

mmap()

  This eliminates the need to configure cache separately, but also means you need to allocate plenty of memory for OS Cache if you want to query data older than fits in the head block.

It also means the virtual memory you will see Prometheus 2 using will get very high: do not let it worry you.

Prometheus process memory usage

Another interesting design choice is WAL configuration. As you can see in the storage documentation, Prometheus protects from data loss during a crash by having WAL log. The exact durability guarantees, though, are not clearly described. As of Prometheus 2.3.2, Prometheus flushes the WAL log every 10 seconds, and this value is not user configurable.

Compactions

Prometheus TSDB is designed somewhat similar to the LSM storage engines – the head block is flushed to disk periodically, while at the same time, compactions to merge a few blocks together are performed to avoid need to scan too many blocks for queries

Here is the number of data blocks I observed on my system after a 24h workload:

active data blocks

If you want more details about storage, you can check out the meta.json file which has additional information about the blocks you have, and how they came about.

{
       "ulid": "01CPZDPD1D9R019JS87TPV5MPE",
       "minTime": 1536472800000,
       "maxTime": 1536494400000,
       "stats": {
               "numSamples": 8292128378,
               "numSeries": 1673622,
               "numChunks": 69528220
       },
       "compaction": {
               "level": 2,
               "sources": [
                       "01CPYRY9MS465Y5ETM3SXFBV7X",
                       "01CPYZT0WRJ1JB1P0DP80VY5KJ",
                       "01CPZ6NR4Q3PDP3E57HEH760XS"
               ],
               "parents": [
                       {
                               "ulid": "01CPYRY9MS465Y5ETM3SXFBV7X",
                               "minTime": 1536472800000,
                               "maxTime": 1536480000000
                       },
                       {
                               "ulid": "01CPYZT0WRJ1JB1P0DP80VY5KJ",
                               "minTime": 1536480000000,
                               "maxTime": 1536487200000
                       },
                       {
                               "ulid": "01CPZ6NR4Q3PDP3E57HEH760XS",
                               "minTime": 1536487200000,
                               "maxTime": 1536494400000
                       }
               ]
       },
       "version": 1
}

Compactions in Prometheus are triggered at the time the head block is flushed, and several compactions may be performed at these intervals:Prometheus 2 compactions

Compactions do not seem to be throttled in any way, causing huge spikes of disk IO usage when they run:

spike in io activity for compactions

And a spike in CPU usage:

spike in CPU usage during compactions

This, of course, can cause negative impact to the system performance. This is also why it is one of the greatest questions in LSM engines: how to run compactions to maintain great query performance, but not cause too much overhead.

Memory utilization as it relates to the compaction process is also interesting:

Memory utilization during compaction process

We can see after compaction a lot of memory changes from “Cached”  to “Free”, meaning potentially valuable data is washed out from memory. I wonder if

fadvice()

 or other techniques to minimize data washout from cache are in use, or if this is caused by the fact that the blocks which were cached are destroyed by the compaction process

Crash Recovery

Crash recovery from the log file takes time, though it is reasonable. For an ingest rate of about 1 mil samples/sec, I observed some 25 minutes recovery time on SSD storage:

level=info ts=2018-09-13T13:38:14.09650965Z caller=main.go:222 msg="Starting Prometheus" version="(version=2.3.2, branch=v2.3.2, revision=71af5e29e815795e9dd14742ee7725682fa14b7b)"
level=info ts=2018-09-13T13:38:14.096599879Z caller=main.go:223 build_context="(go=go1.10.1, user=Jenkins, date=20180725-08:58:13OURCE)"
level=info ts=2018-09-13T13:38:14.096624109Z caller=main.go:224 host_details="(Linux 4.15.0-32-generic #35-Ubuntu SMP Fri Aug 10 17:58:07 UTC 2018 x86_64 1bee9e9b78cf (none))"
level=info ts=2018-09-13T13:38:14.096641396Z caller=main.go:225 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2018-09-13T13:38:14.097715256Z caller=web.go:415 component=web msg="Start listening for connections" address=:9090
level=info ts=2018-09-13T13:38:14.097400393Z caller=main.go:533 msg="Starting TSDB ..."
level=info ts=2018-09-13T13:38:14.098718401Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536530400000 maxt=1536537600000 ulid=01CQ0FW3ME8Q5W2AN5F9CB7R0R
level=info ts=2018-09-13T13:38:14.100315658Z caller=web.go:467 component=web msg="router prefix" prefix=/prometheus
level=info ts=2018-09-13T13:38:14.101793727Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536732000000 maxt=1536753600000 ulid=01CQ78486TNX5QZTBF049PQHSM
level=info ts=2018-09-13T13:38:14.102267346Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536537600000 maxt=1536732000000 ulid=01CQ78DE7HSQK0C0F5AZ46YGF0
level=info ts=2018-09-13T13:38:14.102660295Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536775200000 maxt=1536782400000 ulid=01CQ7SAT4RM21Y0PT5GNSS146Q
level=info ts=2018-09-13T13:38:14.103075885Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536753600000 maxt=1536775200000 ulid=01CQ7SV8WJ3C2W5S3RTAHC2GHB
level=error ts=2018-09-13T14:05:18.208469169Z caller=wal.go:275 component=tsdb msg="WAL corruption detected; truncating" err="unexpected CRC32 checksum d0465484, want 0" file=/opt/prometheus/data/.prom2-data/wal/007357 pos=15504363
level=info ts=2018-09-13T14:05:19.471459777Z caller=main.go:543 msg="TSDB started"
level=info ts=2018-09-13T14:05:19.471604598Z caller=main.go:603 msg="Loading configuration file" filename=/etc/prometheus.yml
level=info ts=2018-09-13T14:05:19.499156711Z caller=main.go:629 msg="Completed loading of configuration file" filename=/etc/prometheus.yml
level=info ts=2018-09-13T14:05:19.499228186Z caller=main.go:502 msg="Server is ready to receive web requests."

The problem I observed with recovery is that it is very memory intensive. While the server may be capable of handling the normal load with memory to spare if it crashes, it may not be able to ever recover due to running out of memory.  The only solution I found for this is to disable scraping, let it perform crash recovery, and then restarting the server with scraping enabled

Warmup

Another behavior to keep in mind is the need for warmup – a lower performance/higher resource usage ratio immediately after start. In some—but not all—starts I can observe significantly higher initial CPU and memory usage

cpu usage during warmup

memory usage during warmup

The gaps in the memory utilization graph show that Prometheus is not initially able to perform all the scrapes configured, and as such some data is lost.

I have not profiled what exactly causes this extensive CPU and memory consumption. I suspect these might be happening when new time series entries are created, at head block, and at high rate.

CPU Usage Spikes

Besides compaction—which is quite heavy on the Disk IO—I also can observe significant CPU spikes about every 2 minutes. These are longer with a higher ingest ratio. These seem to be caused by Go Garbage Collection during these spikes: at least some CPU cores are completely saturated

cpu usage spikes maybe during Go Garbage collection

cpu saturation and max core usage

These spikes are not just cosmetic. It looks like when these spikes happen, the Prometheus internal /metrics endpoint becomes unresponsive, thus producing data gaps during the exact time that the spikes occur:

Prometheus 2 process memory usage

We can also see the Prometheus Exporter hitting a one second timeout:

scrape time by job

We can observe this correlates with garbage collection:

garbage collection in Prometheus processing

Conclusion

Prometheus 2 TSDB offers impressive performance, being able to handle a cardinality of millions of time series, and also to handle hundreds of thousands of samples ingested per second on rather modest hardware. CPU and disk IO usage are both very impressive. I got up to 200K/metrics/sec per used CPU core!

For capacity planning purposes you need to ensure that you have plenty of memory available, and it needs to be real RAM. The actual amount of memory I observed was about 5GB per 100K/samples/sec ingest rate, which with additional space for OS cache, makes it 8GB or so.

There is work that remains to be done to avoid CPU and IO usage spikes, though this is not unexpected considering how young Prometheus 2 TSDB is – if we look at InnoDB, TokuDB, RocksDB, WiredTiger all of them had similar problem in their initial releases.

The post Prometheus 2 Times Series Storage Performance Analyses appeared first on Percona Database Performance Blog.

Sep
18
2018
--

Tutorial Schedule for Percona Live Europe 2018 Is Live

Percona Live Europe tutorials and sneak peak

Percona Live Europe Open Source Database Conference PLE 2018Percona has revealed the line-up of in-depth tutorials for the Percona Live Europe 2018 Open Source Database Conference, taking place November 5–7, 2018 at the Radisson Blu Hotel in Frankfurt, Germany. Secure your spot now with Advanced Registration prices. Be sure to buy your tickets soon as tickets prices will only head up, not down! Sponsorship opportunities for the conference are still available.

Percona Live Europe 2018 Open Source Database Conference is the premier open source database event. Our theme this year is “Connect. Accelerate. Innovate.”  Percona Live is the place to learn about how open source database technology can power your applications, improve your websites and solve your critical database issues.

Monday, November 5: Tutorial Day

Tutorials take place throughout the day on Monday, November 5. Tutorials are three hours and provide practical, in-depth knowledge exchange on critical open source database issues. The line up includes:

  • Query Optimization with MySQL 8.0 and MariaDB 10.3: The Basics with Jaime Crespo – Wikimedia Foundation
  • Hands on ProxySQL with René Cannaò – ProxySQL
  • ElasticSearch 101 with Antonios Giannopoulos – ObjectRocket
  • MySQL Performance Schema in Action: the Complete Tutorial with Sveta Smirnova and Alexander Rubin – Percona
  • MySQL InnoDB Cluster in a Nutshell : The Saga Continues with 8.0 with Frédéric Descamps – Oracle
  • Introduction to PostgreSQL for MySQL and Oracle DBAs with Avinash Vallarapu – Percona
  • InnoDB Architecture and Optimization with Peter Zaitsev – Percona
  • MongoDB: Replica Sets and Sharded Clusters with Adamo Tonete – Percona
  • High Availability PostgreSQL and Kubernetes with Google Cloud presented by Alexis Guajardo – Google
  • Open Source Database Performance Optimization and Monitoring with PMM with Michael Coburn, Vinicius Grippa, and Avinash Vallarapu – Percona
  • Percona XtraDB Cluster Tutorial presented by Tibor Köröcz – Percona
  • Mastering PostgreSQL Administration with Bruce Momjian – EnterpriseDB

and a sneak peak at some of the sessions

Of course, we have a stellar line up of talks, too! Here’s a tantalising glimpse of just some of the talks you could MISS if you don’t head to Frankfurt in November

Tuesday 6th November

  • Percona Server 8.0 – Laurynas Biveinis – Percona
  • MySQL 8.0 Performance: Scalability & Benchmarks – Dimitri Kravtchuk – Oracle
  • MySQL Group Replication : the magic explained – Frédéric Descamps – Oracle
  • Explaining the Postgres Query Optimizer – Bruce Momjian – EnterpriseDB
  • TLS for MySQL at large scale – Jaime Crespo – Wikimedia Foundation
  • BlaBlaCar – 100% Containers Powered Carpooling – Maxime Fouilleul – BlaBlaCar
  • A Year in Google Cloud – Carmen Mason, Alan Mason – VitalSource Technologies
  • Demystifying MySQL Replication Crash Safety – Jean-François Gagné
  • MongoDB Shard 101 – Adamo Tonete, Vinodh Krishnaswamy – Percona
  • Highway to Hell or Stairway to Cloud? – Alexander Kukushkin – Zalando
  • MongoDB administration cool tips – Gabriel Ciciliani – Pythian

Wednesday 7th November

  • PostgreSQL Enterprise Features – Michael Banck – credativ GmbH
  • Open Source Databases and Non-Volatile Memory – Frank Ober – Intel Memory Group
  • MariaDB 10.3 Optimizer and beyond – Vicentiu Ciorbaru – MariaDB Foundation
  • MariaDB system-versioned tables – Federico Razzoli – PayProp

A shout out to our fantastic Conference Committee who have been working hard to review the tutorial and talk submissions: we had over 200! Thank you!

The Radisson Blu Hotel, Frankfurt

Percona Live 2018 Open Source Database Conference will be held at the Radisson Blu Hotel, Frankfurt Franklinstraße 65, 60486 Frankfurt am Main, Germany

The Radisson Blu enjoys an enviable location in the Bockenheim District, just off the A66 motorway – only one kilometer from Messe Frankfurt, one of the world’s largest exhibition complexes. They’re also just 10 minutes from the city center, and Frankfurt International Airport (FRA) is a quick 15-minute drive away.

Book your hotel using Percona’s special room block rate, available only until September 20.

Sponsorships

Sponsorship opportunities for Percona Live Europe 2018 Open Source Database Conference are available and offer the opportunity to interact with the DBAs, sysadmins, developers, CTOs, CEOs, business managers, technology evangelists, solution vendors, and entrepreneurs who typically attend the event. Contact events@percona.com for sponsorship details.

The post Tutorial Schedule for Percona Live Europe 2018 Is Live appeared first on Percona Database Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com