PMM Now Supports Monitoring of PostgreSQL Instances Connecting With Any Database (Name)

Monitoring of PostgreSQL Instances

The recent release of Percona Monitoring and Management 2.25.0 (PMM) includes a fix for bug PMM-6937: before that, PMM expected all monitoring connections to PostgreSQL servers to be made using the default postgres database. This worked well for most deployments, however, some DBaaS providers like Heroku and DigitalOcean do not provide direct access to the postgres database (at least not by default) and instead create custom databases for each of their instances. Starting with this new release of PMM, it is possible to specify the name of the database the Postgres exporter should connect to, both through the pmm-admin command-line tool as well as when configuring the monitoring of a remote PostgreSQL instance directly in the PMM web interface.

Secure Connections

Most DBaaS providers enforce or even restrict access to their instances to SSL (well, in fact, TLS) client connections – and that’s a good thing! Just pay attention to this detail when you configure the monitoring of these instances on PMM. If when you configure your instance the system returns a red-box alert with the saying:

Connection check failed: pq: no pg_hba.conf entry for host “”, user “pmm”, database “mydatabase”, SSL off.

and your firewall allows external connections to your database, chances are the problem is on the last part – “SSL off”.

Web Interface

When configuring the monitoring of a remote PostgreSQL instance in the PMM web interface, make sure you tick the box for using TLS connections:

PMM TLS connections

But note that checking this box is not enough; you have to complement this setting with one of two options:

a) Provide the TLS certificates and key, using the forms that appear once you click on the “Use TLS” check-box:

  • TLS CA
  • TLS certificate key
  • TLS certificate

The PMM documentation explains this further and more details about server and client certificates for PostgreSQL connection can be found in the SSL Support chapter of its online manual.

b) Opt to not provide TLS certificates, leaving the forms empty and checking the “Skip TLS” check-box:

What should rule this choice is the SSL mode the PostgreSQL server requires for client connections. When DBaaS providers enforce secure connections, usually this means sslmode=require and for this option b above is enough. Only when the PostgreSQL server employs the more restrictive verify-ca or verify-full modes you will need to with option a and provide the TLS certificates and key.

Command-Line Tool

When configuring the monitoring of a PostgreSQL server using the pmm-admin command-line tool, the same options are available and the PMM documentation covers them as well. Here’s a quick example to illustrate configuring the monitoring for a remote PostgreSQL server when sslmode=require and providing a custom database name:

$ pmm-admin add postgresql --server-url=https://admin:admin@localhost:443 --server-insecure-tls --port=12345 --username=pmmuser --password=mYpassword --database=customdb --tls --tls-skip-verify

Note that option server-insecure-tls applies to the connection with the PMM server itself; the options prefixed with tls are the ones that apply to the connection to the PostgreSQL database.

Tracking Stats

You may have also noticed in the release notes that PMM 2.25.0 also provides support for the release candidate of our own pg_stat_monitor. For now, however, it is unlikely this extension will be made available by DBaaS providers, so you need to stick with pg_stat_statements for now:


Make sure this extension is loaded for your instance’s custom database, otherwise, PMM won’t be able to track many important stats. That is the case already for all non-hobby Heroku Postgres but for other providers like Digital Ocean it needs to be created beforehand:

defaultdb=> select count(*) from pg_stat_statements;
ERROR:  relation "pg_stat_statements" does not exist
LINE 1: select count(*) from pg_stat_statements;
defaultdb=> CREATE EXTENSION pg_stat_statements;

defaultdb=> select count(*) from pg_stat_statements;
(1 row)

Percona Monitoring and Management is a best-of-breed open source database monitoring solution. It helps you reduce complexity, optimize performance, and improve the security of your business-critical database environments, no matter where they are located or deployed.

Download Percona Monitoring and Management Today


Custom Percona Monitoring and Management Metrics in MySQL and PostgreSQL

mysql postgresl custom metrics

mysql postgresl custom metricsA few weeks ago we did a live stream talking about Percona Monitoring and Management (PMM) and showcased some of the fun things we were doing at the OSS Summit.  During the live stream, we tried to enable some custom queries to track the number of comments being added to our movie database example.  We ran into a bit of a problem live and did not get it to work. As a result, I wanted to follow up and show you all how to add your own custom metrics to PMM and show you some gotchas to avoid when building them.

Custom metrics are defined in a file deployed on each server you are monitoring (not on the server itself).  You can add custom metrics by navigating over to one of the following:

  • For MySQL:  /usr/local/percona/pmm2/collectors/custom-queries/mysql
  • For PostgreSQL:  /usr/local/percona/pmm2/collectors/custom-queries/postgresql
  • For MongoDB:  This feature is not yet available – stay tuned!

You will notice the following directories under each directory:

  • high-resolution/  – every 5 seconds
  • medium-resolution/ – every 10 seconds
  • low-resolution/ – every 60 seconds

Note you can change the frequency of the default metric collections up or down by going to the settings and changing them there.  It would be ideal if in the future we added a resolution config in the YML file directly.  But for now, it is a universal setting:

Percona Monitoring and Management metric collections

In each directory you will find an example .yml file with a format like the following:

  query: "select count(1) as comment_cnt from movie_json_test.movies_normalized_user_comments;"
    - comment_cnt: 
        usage: "GAUGE" 
        description: "count of the number of comments coming in"

Our error during the live stream was we forgot to include the database in our query (i.e. table_name.database_name), but there was a bug that prevented us from seeing the error in the log files.  There is no setting for the database in the YML, so take note.

This will create a metric named mysql_oss_demo_comment_cnt in whatever resolution you specify.  Each YML will execute separately with its own connection.  This is important to understand as if you deploy lots of custom queries you will see a steady number of connections (this is something you will want to consider if you are doing custom collections).  Alternatively, you can add queries and metrics to the same file, but they are executed sequentially.  If, however, the entire YML file can not be completed in less time than the defined resolution ( i.e. finished within five seconds for high resolution), then the data will not be stored, but the query will continue to run.  This can lead to a query pile-up if you are not careful.   For instance, the above query generally takes 1-2 seconds to return the count.  I placed this in the medium bucket.  As I added load to the system, the query time backed up.

You can see the slowdown.  You need to be careful here and choose the appropriate resolution.  Moving this over to the low resolution solved the issue for me.

That said, query response time is dynamic based on the conditions of your server.  Because these queries will run to completion (and in parallel if the run time is longer than the resolution time), you should consider limiting the query time in MySQL and PostgreSQL to prevent too many queries from piling up.

In MySQL you can use:

mysql>  select /*+ MAX_EXECUTION_TIME(4) */  count(1) as comment_cnt from movie_json_test.movies_normalized_user_comments ;
ERROR 1317 (70100): Query execution was interrupted

And on PostgreSQL you can use:

SET statement_timeout = '4s'; 
select count(1) as comment_cnt from movies_normalized_user_comments ;
ERROR:  canceling statement due to statement timeout

By forcing a timeout you can protect yourself.  That said, these are “errors” so you may see errors in the error log.

You can check the system logs (syslog or messages) for errors with your custom queries (note at this time as of PMM 2.0.21, errors were not making it into these logs because of a potential bug).  If the data is being collected and everything is set up correctly, head over to the default Grafana explorer or the “Advanced Data Exploration” dashboard in PMM.  Look for your metric and you should be able to see the data graphed out:

Advanced Data Exploration PMM

In the above screenshot, you will notice some pretty big gaps in the data (in green).  These gaps were caused by our query taking longer than the resolution bucket.  You can see when we moved to 60-second resolution (in orange), the graphs filled in.

Percona Monitoring and Management is a best-of-breed open source database monitoring solution. It helps you reduce complexity, optimize performance, and improve the security of your business-critical database environments, no matter where they are located or deployed.

Download Percona Monitoring and Management Today


Inspecting MySQL Servers Part 5: Percona Monitoring and Management

Inspecting MySQL Servers PMM

Inspecting MySQL Servers PMMIn the previous posts of this series, I presented how the Percona Support team approaches the analysis and troubleshooting of a MySQL server using a tried-and-tested method supported by specific tools found in the Percona Toolkit:

Inspecting MySQL Servers Part 1: The Percona Support Way

Inspecting MySQL Servers Part 2: Knowing the Server

Inspecting MySQL Servers Part 3: What MySQL?

Inspecting MySQL Servers Part 4: An Engine in Motion

A drawback from such an approach is that data collection is done in a “reactive” way and (part of) it needs to be processed before we can interpret it. Enters Percona Monitoring and Management (PMM): PMM continually collects MySQL status variables and plots the metrics in easy-to-interpret Grafana graphs and panels. Plus, it includes a rich Query Analytics dashboard that helps identify the top slow queries and show how they are executing. It makes for an excellent complement to the approach we presented. In fact, many times it takes the central role: we analyze the data available on PMM and, if necessary, look at complementing it with pt-stalk samples. In this post, I will show you how we can obtain much of the same information we got from the Percona Toolkit tools (and sometimes more) from PMM.

* As was the case in the previous posts in this series, data, and graphs used to illustrate this post does not come from a single server and have been captured using different versions of PMM.

Know the Server

Once you are connected to PMM, you can select the target server under the Node Name field in the menu located on the top-left side of the interface, then select PMM dashboards on the left menu, System (Node), and, finally, Node Summary, as shown in the screenshot below:

PMM Dashboard

The header section of the Node Summary page shows the basic hardware specs of the server as well as a few metrics and projections. You will find on the right side of this section the full output of pt-summary, which we have scrutinized extensively in the second post of this series, there, waiting for you:

MySQL Node Summary
Below the header section, there are four panels dedicated to CPU, Memory, Disk, and Network, each containing graphics with specific metrics on each of these areas. It makes it easy, for example, to look at overall CPU utilization:

CPU utilization

Recent spikes in I/O activity:

And memory usage:

Note the graphs cover the last 12 hours of activity by default but you can select a different time range in the top-right menu:

What MySQL?

Taking a slightly different route by selecting MySQL instead of System (Node) and then MySQL Summary, we get to access a dashboard that displays MySQL-specific metrics for the selected instance:

MySQL Summary

Under the Service Summary panel, you will find the full output of pt-mysql-summary, which we reviewed in detail in the third post of this series:

The main goal of the pt-mysql-summary is to provide a sneak-peek into how MySQL is configured, at a single point in time.  With PMM you get instant access to most of the MySQL trends and status variables we only get a glance from in the report. We can go and look straight under the hood to look at the engine characteristics while it is under load, over the last 5 minutes to the last 30 days or more!

An Engine in Motion

There is so much we can look at at this point. If we go and more or less follow the sequence observed in the previous posts we can start by checking if the table cache is big enough. The example below shows it to be just right, if we base ourselves in the limited time frame this particular sample covers, with an average hit ratio close to 100%:

MySQL Table Open Cache Status

Or we can look for a disruption in the pattern, such as a peak in threads connected:

And then investigate the effects it caused on the server (or was it already a consequence of something else that occurred?), for example, a change in the rate of temporary tables created at that time for both in-memory and on-disk tables:

The MySQL Instance Summary is just one of many dashboards available for MySQL:

Under the MySQL InnoDB Details dashboard we find many InnoDB-specific metrics plotted as a multitude of different graphs, providing a visual insight into things such as the number of requests that can be satisfied from data that is already loaded in the Buffer Pool versus those that must be first read from disk (does my hot data fit in memory?):

InnoDB Buffer Pool Requests

Besides MySQL status variables metrics, there is also data filtered directly from SHOW ENGINE INNODB STATUS. For instance, we can find long-running transactions based on increasing values of InnoDB’s history length list:

Another perk of PMM is the ability to easily evaluate whether redo log space is big enough based on the rate of writes versus the size of the log files:

And thus observe checkpoint age, a concept that is explained in detail for PMM in How to Choose the MySQL innodb_log_file_size:

Another evaluation made easy with PMM is whether a server’s workload is benefitting from having InnoDB’s Adaptive Hash Index (AHI) enabled. The example below shows an AHI hit-ratio close to 100% up to a certain point, from which the number of searches increased and the situation inverted:

The evaluation of settings like the size of the redo log space and the efficiency of AHI should be done at a macro level, spanning days: we should be looking for what is the best general configuration for these. However, when we are investigating a particular event, it is important to zoom in on the time frame where it occurred to better analyze the data captured at the time. Once you do this, change the data resolution from the default of auto to 1s or 5s interval/granularity so you can better see spikes and overall variation: 

QAN: Query Analytics

Query analysis is something I only hinted at but didn’t explore in the first articles in this series. The “manual” way requires processing the slow query log with a tool such as pt-query-digest and then going for details about a particular query by connecting to the server to obtain the execution plan and schema details. A really strong feature of PMM is the Query Analytics dashboard, which provides a general overview of query execution and captures all information about it for you. 

The example below comes from a simple sysbench read-write workload on my test server:

PMM Query AnalyticsWe can select an individual query on the list and check the details of its execution:

The query’s  EXPLAIN plan is also available, both in classic and JSON formats:

You can read more about QAN on our website as well as in other posts on our blog platform, such as How to Find Query Slowdowns Using Percona Monitoring and Management.

What PMM Does Not Include

There remains information/data we cannot obtain from PMM, such as the full output of SHOW ENGINE INNODB STATUS. For situations when obtaining this information is important, we resort back to pt-stalk. It is not one or the other, we see them as complementary tools in our job of inspecting MySQL servers. 

If you are curious about PMM and would like to see how it works in practice, check our demo website at To get up and running with PMM quickly, refer to our quickstart guide.

Tuning the Engine for the Race Track

There you have it! It certainly isn’t all there is but we’ve got a lot packed in this series, enough to get you moving in the right direction when it comes to inspecting and troubleshooting MySQL servers. I hope you have enjoyed the journey and learned a few new tricks along the way ?


MongoDB Integrated Alerting in Percona Monitoring and Management

MongoDB Integrated Alerting

MongoDB Integrated AlertingPercona Monitoring and Management (PMM) recently introduced the Integrated Alerting feature as a technical preview. This was a very eagerly awaited feature, as PMM doesn’t need to integrate with an external alerting system anymore. Recently we blogged about the release of this feature.

PMM includes some built-in templates, and in this post, I am going to show you how to add your own alerts.

Enable Integrated Alerting

The first thing to do is navigate to the PMM Settings by clicking the wheel on the left menu, and choose Settings:

Next, go to Advanced Settings, and click on the slider to enable Integrated Alerting down in the “Technical Preview” section.

While you’re here, if you want to enable SMTP or Slack notifications you can set them up right now by clicking the new Communications tab (which shows up after you hit “Apply Changes” turning on the feature).

The example below shows how to configure email notifications through Gmail:

You should now see the Integrated Alerting option in the left menu under Alerting, so let’s go there next:

Configuring Alert Destinations

After clicking on the Integrated Alerting option, go to the Notification Channels to configure the destination for your alerts. At the time of this writing, email via your SMTP server, Slack and PagerDuty are supported.

Creating a Custom Alert Template

Alerts are defined using MetricsQL which is backward compatible with Prometheus QL. As an example, let’s configure an alert to let us know if MongoDB is down.

First, let’s go to the Explore option from the left menu. This is the place to play with the different metrics available and create the expressions for our alerts:

To identify MongoDB being down, one option is using the up metric. The following expression would give us the alert we need:


To validate this, I shut down a member of a 3-node replica set and verified that the expression returns 0 when the node is down:

The next step is creating a template for this alert. I won’t go into a lot of detail here, but you can check Integrated Alerting Design in Percona Monitoring and Management for more information about how templates are defined.

Navigate to the Integrated Alerting page again, and click on the Add button, then add the following template:

  - name: MongoDBDown
    version: 1
    summary: MongoDB is down
    expr: |-
      up{service_type="mongodb"} == 0
    severity: critical
      summary: MongoDB is down ({{ $labels.service_name }})
      description: |-
        MongoDB {{ $labels.service_name }} on {{ $labels.node_name }} is down

This is how it looks like:

Next, go to the Alert Rules and create a new rule. We can use the Filters section to add comma-separated “key=value” pairs to filter alerts per node, per service, per agent, etc.

For example: node_id=/node_id/123456, service_name=mongo1, agent_id=/agent_id/123456

After you are done, hit the Save button and go to the Alerts dashboard to see if the alert is firing:

From this page, you can also silence any firing alerts.

If you configured email as a destination, you should have also received a message like this one:

For now, a single notification is sent. In the future, it will be possible to customize the behavior.

Creating MongoDB Alerts

In addition to the obvious “MongoDB is down” alert, there are a couple more things we should monitor. For starters, I’d suggest creating alerts for the following conditions:

  • Replica set member in an unusual state
mongodb_replset_member_state != 1 and mongodb_replset_member_state != 2

  • Connections higher than expected
avg by (service_name) (mongodb_connections{state="current"}) > 5000

  • Cache evictions higher than expected
avg by(service_name, type) (rate(mongodb_mongod_wiredtiger_cache_evicted_total[5m])) > 5000

  • Low WiredTiger tickets
avg by(service_name, type) (max_over_time(mongodb_mongod_wiredtiger_concurrent_transactions_available_tickets[1m])) < 50

The values listed above are just for illustrative purposes, you need to decide the proper thresholds for your specific environment(s).

As another example, let’s add the alert template for the low WiredTiger tickets:

  - name: MongoDB Wiredtiger Tickets
    version: 1
    summary: MongoDB Wiredtiger Tickets low
    expr: avg by(service_name, type) (max_over_time(mongodb_mongod_wiredtiger_concurrent_transactions_available_tickets[1m])) < 50
    severity: warning
      description: "WiredTiger available tickets on (instance {{ $labels.node_name }}) are less than 50"


Integrated alerting is a really nice to have feature. While it is still in tech preview state, we already have a few built-in alerts you can test, and also you can define your own. Make sure to check the Integrated Alerting official documentation for more information about this topic.

Do you have any specific MongoDB alerts you’d like to see? Given the feature is still in technical preview, any contributions and/or feedback about the functionality are welcome as we’re looking to release this as GA very soon!


Improving Percona Monitoring and Management EC2 Instance Resilience Using CloudWatch Alarm Actions

Percona Monitoring and Management EC2 Instance Resilience Using CloudWatch

Percona Monitoring and Management EC2 Instance Resilience Using CloudWatchNothing lasts forever, including hardware running your EC2 instances. You will usually receive an advance warning on hardware degradation and subsequent instance retirement, but sometimes hardware fails unexpectedly. Percona Monitoring and Management (PMM) currently doesn’t have an HA setup, and such failures can leave wide gaps in monitoring if not resolved quickly.

In this post, we’ll see how to set up automatic recovery for PMM instances deployed in AWS through Marketplace. The automation will take care of the instance following an underlying systems failure. We’ll also set up an automatic restart procedure in case the PMM instance itself is experiencing issues. These simple automatic actions go a long way in improving the resilience of your monitoring setup and minimizing downtime.

Some Background

Each EC2 instance has two associated status checks: System and Instance. You can read in more detail about them on the “Types of status checks” AWS documentation page. The gist of it is, the System check fails to pass when there are some infrastructure issues. The Instance check fails to pass if there’s anything wrong on the instance side, like its OS having issues. You can normally see the results of these checks as “2/2 checks passed” markings on your instances in the EC2 console.

ec2 instance overview showing 2/2 status checks

CloudWatch, an AWS monitoring system, can react to the status check state changes. Specifically, it is possible to set up a “recover” action for an EC2 instance where a system check is failing. The recovered instance is identical to the original, but will not retain the same public IP unless it’s assigned an Elastic IP. I recommend that you use Elastic IP for your PMM instances (see also this note in PMM documentation). For the full list of caveats related to instance recovery check out the “Recover your instance” page in AWS documentation.

According to CloudWatch pricing, having two alarms set up will cost $0.2/month. An acceptable cost for higher availability.

Automatically Recovering PMM on System Failure

Let’s get to actually setting up the automation. There are at least two ways to set up the alarm through GUI: from the EC2 console, and from the CloudWatch interface. The latter option is a bit involved, but it’s very easy to set up alarms from the EC2 console. Just right-click on your instance, pick the “Monitor and troubleshoot” section in the drop-down menu, and then choose the “Manage CloudWatch alarms” item.

EC2 console dropdown menu navigation to CloudWatch Alarms

Once there, choose the “Status check failed” as the “Type of data to sample”, and specify the “Recover” Alarm action. You should see something like this.

Adding CloudWatch alarm through EC2 console interface

You could notice that GUI offers to set up a notification channel for the alarm. If you want to get a message if your PMM instance is recovered automatically, feel free to set that up.

The alarm will fire when the maximum value of the StatusCheckFailed_System metric is >=0.99 during two consecutive 300 seconds periods. Once the alarm fires, it will recover the PMM instance. We can check out our new alarm in the CloudWatch GUI.

EC2 instance restart alarm

EC2 console will also show that the alarm is set up.

Single alarm set in the instance overview

This example uses a pretty conservative alarm check duration of 10 minutes, spread over two 5-minute intervals. If you want to recover a PMM instance sooner, risking triggering on false positives, you can bring down the alarm period and number of evaluation periods. We also use the “maximum” of the metric over a 5-minute interval. That means a check could be in failed state only one minute out of five, but still count towards alarm activation. The assumption here is that checks don’t flap for ten minutes without a reason.

Automatically Restarting PMM on Instance Failure

While we’re at it, we can also set up an automatic action to execute when “Instance status check” is failing. As mentioned, usually that happens when there’s something wrong within the instance: really high load, configuration issue, etc. Whenever a system check fails, an instance check is going to be failing, too, so we’ll set this alarm to check for a longer period of time before firing. That’ll also help us to minimize the rate of false-positive restarts, for example, due to a spike in load. We use the same period of 300 seconds here but will only alarm after three periods show instance failure. The restart will thus happen after ~15 minutes. Again, this is pretty conservative, so adjust as you think works best for you.

In the GUI, repeat the same actions that we did last time, but pick the “Status check failed: instance” data type, and “Reboot” alarm action.

Adding CloudWatch alarm for instance failure through EC2 console interface

Once done, you can see both alarms reflected in the EC2 instance list.

EC2 instance overview showing 2 alarms set

Setting up Alarms Using AWS CLI

Setting up our alarm is also very easy to do using the AWS CLI.  You’ll need an AWS account with permissions to create CloudWatch alarms, and some EC2 permissions to list instances and change their state. A pretty minimal set of permissions can be found in the policy document below. This set is enough for a user to invoke CLI commands in this post.

  "Version": "2012-10-17",
  "Statement": [
      "Action": [
      "Resource": [
      "Effect": "Allow"
      "Action": [
      "Resource": [
      "Effect": "Allow"

And here are the specific AWS CLI commands. Do not forget to change instance id (i-xxx) and region in the command. First, setting up recovery after system status check failure.

$ aws cloudwatch put-metric-alarm \
    --alarm-name restart_PMM_on_system_failure_i-xxx \
    --alarm-actions arn:aws:automate:REGION:ec2:recover \
    --metric-name StatusCheckFailed_System --namespace AWS/EC2 \
    --dimensions Name=InstanceId,Value=i-xxx \
    --statistic Maximum --period 300 --evaluation-periods 2 \
    --datapoints-to-alarm 2 --threshold 1 \
    --comparison-operator GreaterThanOrEqualToThreshold \
    --tags Key=System,Value=PMM

Second, setting up restart after instance status check failure.

$ aws cloudwatch put-metric-alarm \
    --alarm-name restart_PMM_on_instance_failure_i-xxx \
    --alarm-actions arn:aws:automate:REGION:ec2:reboot \
    --metric-name StatusCheckFailed_Instance --namespace AWS/EC2 \
    --dimensions Name=InstanceId,Value=i-xxx \
    --statistic Maximum --period 300 --evaluation-periods 3 \
    --datapoints-to-alarm 3 --threshold 1 \
    --comparison-operator GreaterThanOrEqualToThreshold \
    --tags Key=System,Value=PMM

Testing New Alarms

Unfortunately, it doesn’t seem to be possible to simulate system status check failure. If there’s a way, let us know in the comments. Because of that, we’ll be testing our alarms using instance failure instead. The simplest way to fail the Instance status check is to mess up networking on an instance. Let’s just bring down a network interface. In this particular instance, it’s ens5.

# ip link
2: ens5:...
# date
Sat Apr 17 20:06:26 UTC 2021
# ip link set ens5 down

The stuff of nightmares: having no response for a command. In 15 minutes, we can check that the PMM instance is available again. We can see that alarm triggered and executed the restart action at 20:21:37 UTC.

CloudWatch showing Alarm fired actions

And we can access the server now.

$ date
Sat Apr 17 20:22:59 UTC 2021

The alarm itself takes a little more time to return to OK state. Finally, let’s take a per-minute look at the instance check metric.

CloudWatch metric overview

Instance failure was detected at 20:10 and resolved at 20:24, according to the AWS data. In reality, the server rebooted and was accessible even earlier. PMM instances deployed from a marketplace image have all the required service in autostart, and thus PMM is fully available after restarting without a need for an operator to take any action.


Even though Percona Monitoring and Management itself doesn’t offer high availability at this point, you can utilize built-in AWS features to make your PMM installation more resilient to failure. PMM in the Marketplace is set up in a way that should not have problems with recovery from the instance retirement events. After your PMM instance is recovered or rebooted, you’ll be able to access monitoring without any manual intervention.

Note: While this procedure is universal and can be applied to any EC2 instance, there are some caveats explained in the “What do I need to know when my Amazon EC2 instance is scheduled for retirement?” article on the AWS site. Note specifically the Warning section. 



Running Percona Monitoring and Management v2 on Windows with Docker

PMM Windows Docker

PMM Windows DockerOne way to run Percona Monitoring and Management (PMM) v2 on Windows is using the Virtual Appliance installation method. It works well but requires you to run extra software to run Virtual Appliances, such as Oracle VirtualBox, VMWare Workstation Player, etc. Or, you can use Docker.

Docker is not shipped with Windows 10, though you can get it installed for free with Docker Desktop. Modern Docker on Windows actually can either run using Hyper-V backend or using Windows Subsystem for Linux v2 (WSL2). Because WSL2-based Docker installation is more involved, I’m sticking to what is now referred to as “legacy” Hyper-V backend.

On my Windows 10 Pro (19042) system, I had to change some BIOS settings:


PMM on Windows Docker


And install the Hyper-V component for Docker Desktop to install correctly:


Percona Monitoring and Management Windows


With this, we’re good to go to install Percona Monitoring and Management on Windows.


Register for Percona Live ONLINE
A Virtual Event about Open Source Databases


Official installing PMM with Docker instructions assumes you’re running a Linux/Unix-based system that needs to be slightly modified for Windows. First, there is no sudo on Windows, and second, “Command Prompt” does not interpret “\”  the same way as shell does.

With those minor fixes though, basically, the same command to start PMM works:

docker pull percona/pmm-server:2
docker create --volume /srv --name pmm-data percona/pmm-server:2 /bin/true
docker run --detach --restart always --publish 80:80 --publish 443:443 --volumes-from pmm-data --name pmm-server percona/pmm-server:2

That’s it! You should have your Percona Monitoring and Management (PMM) v2 running!

Looking at PMM Server Instance through PMM itself you may notice it does not have all resources by default, as Docker on Linux would.




Note there are only 2 CPU cores, 2GB of Memory, and some 60GB of Disk space available.  This should be enough for a test but if you want to put some real load on this instance you may want to adjust those settings. It is done through the “Resources” setting in Docker Desktop:




Adjust these to suit your load needs and you’re done!  Enjoy your Percona Monitoring and Management!


Percona Monitoring and Management Meetup on May 6th: Join Us for Live Tuning and Roadmap Talk

We’re leveling up – would you like to join us on Discord? There we chat about databases, open source, and even play together.

On May 6th at 11 am EST, the Percona Community is talking about Percona Monitoring and Management (PMM). Our experts are also there to live tune and optimize your database – come check it out!

Are there any features you would like to see in PMM? The meetup is a good way to get your voice heard! But you can already start chatting with all of us on Discord right now.

It’s just a few more nights of sleep until Percona Live ONLINE with many interesting events for you to check out too. Join the community and let’s shape the future of databases together!


Integrated Alerting Design in Percona Monitoring and Management

Integrated Alerting Design Percona Monitoring and Management

Integrated Alerting Design Percona Monitoring and ManagementPercona Monitoring and Management 2.13 (PMM) introduced the Integrated Alerting feature as a technical preview. It adds a user-friendly way to set up and manage alerts for your databases. You can read more about this feature usage in our announcement blog post and in our documentation, while in this article we will be focusing on design and implementation details.


There are four basic entities used for IA: Alert Rule Template, Alert Rule, Alert, and Notification Channel.

Everything starts from the alert rule template. You can see its YAML representation below:

 - name: pmm_mongodb_high_memory_usage
   version: 1
   summary: Memory used by MongoDB
   expr: |-
     sum by (node_name) (mongodb_ss_mem_resident * 1024 * 1024)
     / on (node_name) (node_memory_MemTotal_bytes)
     * 100
     > [[ .threshold ]]
     - name: threshold
       summary: A percentage from configured maximum
       unit: "%"
       type: float
       range: [0, 100]
       value: 80
   for: 5m
   severity: warning
     cultom_label: demo
     summary: MongoDB high memory usage ({{ $labels.service_name }})
     description: |-
       {{ $value }}% of memory (more than [[ .threshold ]]%) is used
       by {{ $labels.service_name }} on {{ $labels.node_name }}.

A template serves as the base for alert rules. It defines several fields, let’s look at them:

  • name: uniquely identifies template (required)
  • version: defines template format version (required)
  • summary: a template description (required)
  • expr: a MetricsQL query string with parameter placeholders. MetricsQL is backward compatible with PromQL and provides some additional features. (required)
  • params: contains parameter definitions required for the query. Each parameter has a name, type, and summary. It also may have a unit, available range, and default value.
  • for: specifies the duration of time the expression must be met for;  The  alert query should return true for this period of time at which point the alert will be fired (required)
  • severity: specifies default alert severity level (required)
  • labels: are additional labels to be added to generated alerts (optional)
  • annotations: are additional annotations to be added to generated alerts. (optional)

A template is designed to be re-used as the basis for multiple alert rules so from a single pmm_node_high_cpu_load template you can have alerts for production vs non-production, warning vs critical, etc.

Register for Percona Live ONLINE
A Virtual Event about Open Source Databases

Users can create alert rules from templates. An alert rule is what’s actually executed against metrics and what produces an alert. The rule can override default values specified in the template, add filters to apply the rule to only required services/nodes/etc, and specify target notification channels, such as email, Slack, PagerDuty, or Webhooks. If the rule hasn’t any associated notification channels its alerts will be available only via PMM UI. It’s useful to note that after creation rule keeps its relation with the template and any change in the template will affect all related rules.

Here is an alert rule example:

 - name: PMM Integrated Alerting
     - alert: /rule_id/c8e5c559-ffba-43ed-847b-921f69c031a9
       rule: test
       expr: |-
         sum by (node_name) (mongodb_ss_mem_resident * 1024 * 1024)
         / on (node_name) (node_memory_MemTotal_bytes)
         * 100
         > 40
       for: 5s
         ia: "1"
         rule_id: /rule_id/c8e5c559-ffba-43ed-847b-921f69c031a9
         severity: error
         template_name: pmm_mongodb_high_memory_usage
         cultom_label: demo
         description: |-
         { { $value } }% of memory (more than 40%) is used
         by {{ $labels.service_name }} on {{ $labels.node_name }}.
         summary: MongoDB high memory usage ({{ $labels.service_name }})

It has a Prometheus alert rule format.

How it Works

Integrated Alerting feature built on top of Prometheus Alertmanager, VictoriaMetrics TimescaleDB (TSDB), and VMAlert.

VictoriaMetrics TSDB is the main metrics storage in PMM, VMalert responsible for alert rules execution, and Prometheus Alertmanager responsible for alerts delivery. VMAlert runs queries on VM TSDB, checks if they are positive for the specified amount of time (example: MySQL is down for 5 minutes), and triggers alerts. All alerts forwarded to the PMM internal Alertmanager but also can be duplicated to some external Alertmanager (it can be set up on the PMM Settings page).

There are four available templates sources:

  1. Built-in templates, shipped with PMM distribution. They are embedded into the managed binary (core component on PMM).
  2. Percona servers. It’s not available yet, but it will be similar to the STT checks delivery mechanism (HTTPS + files signatures).
  3. Templates created by the user via PMM UI. We persist them in PMM’s database.
  4. Templates created by the user as files in the /srv/ia/templates directory.

During PMM startup, managed loads templates from all sources into the memory.

Alert rules can be created via PMM UI or just by putting rule files in the /srv/prometheus/rules directory. Alert rules created via UI persist in PMM’s internal PostgreSQL database. For each alert rule from DB, managed binary creates a YAML file in /etc/ia/rules/ and asks VMalert to reload the configuration and reread rule files. VMAlert executes query from each loaded alert rule every minute, once the rule condition is met (query is positive for the specified amount of time) VMAlert produces an alert and passes it to the Alertmanager. Please note that /etc/ia/rules/ controlled by managed and any manual changes in that directory will be lost.

Managed generates configuration for Alertmanager and updates it once any related entity changes.

Managed goes through the list of the existing rules and collects unique notification channel combinations. For example, if we have two rules and each of them has assigned channels a,b, and c it will be the one unique channel combination. For each rule managed generates a route and for each unique channel combination, it generates a receiver in the Alertmanager configuration file. Each route has a target receiver and filter by rule id, also it can contain user-defined filters. If a rule hasn’t assigned notification channels, then a special empty receiver will be used. Users can redefine an empty receiver with Alertmanagers base configuration file /srv/alertmanager/alertmanager.base.yml. When some Notification Channel is disabled, managed recollects unique channel combinations excluding disabled channels and regenerates receivers and routing rules. If the rule has only one specified channel and it was disabled then a special disabled receiver will be used for that. Unlike empty receiver, disabled can’t be redefined by the user and always means “do nothing”.  It prevents unexpected behavior after channels disabling. After each Alertmanager configuration update, managed asks Alermanager to reload it.

When Alertmanager receives an alert from VMAlert, it uses routes to find an appropriate receiver and forward alerts to destination channels. The user also can observe alerts via PMM UI. In that case, managed gets all available alerts from Alertmanager API and applies required filters before showing them.


The Integrated Alerting feature has many moving parts, and functionally it’s more about managing configuration for different components and making them work together. It provides a really nice way to be aware of important events in your system. While it’s still in tech preview state, it’s already helpful. With built-in templates, it’s easy to try without diving into documentation about Prometheus queries and other stuff. So please try it and tell us about your experience. What parameters of a system you would like to have covered with templates? What use cases do you have for alerting? We will happy to any feedback.


Achieving a More Secure MongoDB with Percona Monitoring and Management’s Security Tool

More Secure MongoDB

Percona Monitoring and Management (PMM) is a great way to monitor your MongoDB deployment for things like memory, CPU, inter-database metrics like wiredTiger cache utilization, read/write ticket utilization, Query Analytics, and many, many more.  Did you know that in addition to all that, it also has a security threat tool?  In this blog, we’ll go over the Security Threat Tool and its checks for MongoDB.  We’ll discuss how they can bring you value by helping you prevent MongoDB data leaks.

What is the Security Threat Tool?

PMM security threat tool

The Security Threat Tool is a widget in your main PMM dashboard that runs regular checks against connected databases, alerting you if any servers pose a potential security threat.  By default, these checks are run every 24 hours, so the checks will not be in real-time.  You can run them ad-hoc by going to PMM/PMM Database Checks in your PMM Grafana Dashboards and clicking on the “Run DB Checks” button.  When you first enable the Security Threat Tool it can take up to 24 hours for the results to populate, assuming you do not use the ad hoc button.  In addition to MongoDB, the Security Threat Tool, like PMM itself, can also check MySQL, PostgreSQL, and MariaDB for various security threats.  Additionally, if you find a check is too noisy, you do have the ability to silence it in PMM so that it no longer shows as a failed check.

What Does the Security Threat Tool Check for MongoDB?

At present, the Security Threat Tool checks MongoDB for two security-related items, mongodb_auth, and mongodb_version.  First, mongodb_auth checks to be sure that MongoDB authentication is enabled on your deployment.  Finally, mongodb_version checks to make sure that you’re running the latest version of MongoDB that you’re on.  


Checking MongoDB to ensure that authentication and authorization are enabled is of paramount importance to the security of your database.   If authentication and authorization are disabled for your MongoDB deployment, then anyone who can reach the port that you have MongoDB running on will have access to your data.  Disabled authentication and authorization as well as binding to all of your interfaces on your machine have been the cause of many data leaks over the years with MongoDB.  Having this check can help you and your MongoDB deployment from leaking data.

This check can also be helpful if you’re performing maintenance and need to temporarily disable authentication on a secondary or hidden node and forget to re-enable authentication and authorization when you’re done.



Many organizations are requiring more frequent patching or upgrading of software, especially databases that often house the most critical data for your applications, in order to stay on top of changing security requirements.  With MongoDB’s replica set and sharded cluster architecture, it’s fairly easy to upgrade in a rolling manner and keep up to date on your MongoDB versions (after appropriate testing of the new versions!).  This checks to make sure you’re running on the latest version of the minor release that your MongoDB database is running on.  For example, your 3.6 version of MongoDB won’t alert telling you there’s a newer version of MongoDB 4.4 but will tell you when a new minor version of MongoDB 3.6 is released and can be upgraded to. Staying as close as possible to the latest minor release assures that you get as many bug fixes as are available and just as importantly, any security fixes for that version of MongoDB.  You want to minimize your security exposure by being up to date and having as many security vulnerabilities patched as are available.



We hope that this blog post has helped you to see how enabling the Security Threat Tool in Percona Monitoring and Management can help you keep your MongoDB deployment secure.  Thanks for reading!


Percona Monitoring and Management is free to download and use. Try it today!


Observations on Better Resource Usage with Percona Monitoring and Management v2.12.0

Better Resource Usage with Percona Monitoring and Management

Better Resource Usage with Percona Monitoring and ManagementPercona Monitoring and Management (PMM) v2.12.0 comes with a lot of improvements and one of the most talked-about is the usage of VictoriaMetricsDB. The reason we are doing this comparison is that PMM 2.12.0 is a release in which we integrate VictoriaMetricsDB and replace Prometheus as its default method of data ingestion.

A reason for this change was also driven by motivation for improved performance for PMM Server, and here we will look at an overview of why users must definitely consider using the 2.12.0 version if they have always been looking for a less resource-intensive PMM. This post will try to address some of those concerns.

Benchmark Setup Details

The benchmark was performed using a virtualized system with PMM Server running on an ec2 instance, with 8 cores, 32 GB of memory, and SSD Storage. The duration of the observation is 24 hours, and for clients, we set up 25 virtualized client instances on Linode with each emulating 10 Nodes running MySQL with real workloads using Sysbench TPC-C Test.

Percona Monitoring and Management benchmark

Both PMM 2.11.1 and PMM 2.12.0 were set up in the same way, with client instances running the exact same load, and to monitor the difference in performance, we used the default metrics mode for 2.12.0 for this observation.

Sample Ingestion rate for the load was around 96.1k samples/sec, with around 8.5 billion samples received in 24 hours. 

A more detailed benchmark for Prometheus vs. VictoriaMetrics was done by the VM team and it clearly shows how efficient VictoriaMetrics is and how better performance can be achieved with Victoria Metrics.

Disk Space Usage

VictoriaMetrics has really good efficiency when it comes to disk usage of the host system, and we found that 2.11.1 generates a lot of disk usage spikes with the maximum storage touching around 23.11 GB of storage space. If we compare the same for PMM 2.12.0, the disk usage spikes are not as high as 2.11.1 while the maximum disk usage is around 8.44 GB.

It is clear that PMM 2.12.0 needs 2.7 times less disk space for monitoring the same number of services for the same duration as compared to PMM 2.11.1.

Disk Usage PMM 2.11.1

Disk Usage 1: PMM 2.11.1 

Disk Usage PMM 2.12.0

Disk Usage 2: PMM 2.12.0

Memory Utilization

Another parameter on which PMM 2.12.0 performs better is Memory Utilization. During our testing, we found that PMM 2.11.1 was using two times more memory for monitoring the same number of services. This is indeed a significant improvement in terms of performance.

The memory usage clearly shows several spikes for PMM 2.11.1, which is not the case with PMM 2.12.0.

Memory Utilization PMM 2.11.1

Memory Utilization: PMM 2.11.1

Free Memory PMM 2.11.1
Free Memory PMM 2.11.1 

Memory Utilization 2.11.1
Memory Utilization PMM 2.12.0

The Memory Utilization for 2.12.0 clearly shows more than 55% of memory is available across the 24 hours of our observation, which is a significant improvement over 2.11.1.

Free Memory 2.12.0

Free Memory PMM 2.12.0

CPU Usage

During the observation we noticed a slight increase in CPU usage for PMM 2.12.0, the average CPU usage was about 2.6% more than PMM 2.11.1, but the max CPU usage for both versions did not have any significant difference.

CPU Usage 2.11.1

CPU Usage: PMM 2.11.1

CPU Usage 2.12.0

CPU Usage: PMM 2.12.0



The overall performance improvements are around the Memory Utilization and Disk Usage, and we also observed a significantly less disk I/O bandwidth with far fewer spikes in the write operations for PMM 2.12.0. This behavior is observed and articulated in the VictoriaMetrics benchmarking. CPU Usage and Memory are two important resource factors when planning to setup PMM Server, and with PMM 2.12.0 we can safely say that it will cost about half in terms of Memory and Disk resources when compared to any other previously released PMM versions. This would also likely encourage current users to be able to add more instances for monitoring without the need to think about the cost of extra infrastructure. 

Powered by WordPress | Theme: Aeros 2.0 by