Apr
29
2021
--

Improving Percona Monitoring and Management EC2 Instance Resilience Using CloudWatch Alarm Actions

Percona Monitoring and Management EC2 Instance Resilience Using CloudWatch

Percona Monitoring and Management EC2 Instance Resilience Using CloudWatchNothing lasts forever, including hardware running your EC2 instances. You will usually receive an advance warning on hardware degradation and subsequent instance retirement, but sometimes hardware fails unexpectedly. Percona Monitoring and Management (PMM) currently doesn’t have an HA setup, and such failures can leave wide gaps in monitoring if not resolved quickly.

In this post, we’ll see how to set up automatic recovery for PMM instances deployed in AWS through Marketplace. The automation will take care of the instance following an underlying systems failure. We’ll also set up an automatic restart procedure in case the PMM instance itself is experiencing issues. These simple automatic actions go a long way in improving the resilience of your monitoring setup and minimizing downtime.

Some Background

Each EC2 instance has two associated status checks: System and Instance. You can read in more detail about them on the “Types of status checks” AWS documentation page. The gist of it is, the System check fails to pass when there are some infrastructure issues. The Instance check fails to pass if there’s anything wrong on the instance side, like its OS having issues. You can normally see the results of these checks as “2/2 checks passed” markings on your instances in the EC2 console.

ec2 instance overview showing 2/2 status checks

CloudWatch, an AWS monitoring system, can react to the status check state changes. Specifically, it is possible to set up a “recover” action for an EC2 instance where a system check is failing. The recovered instance is identical to the original, but will not retain the same public IP unless it’s assigned an Elastic IP. I recommend that you use Elastic IP for your PMM instances (see also this note in PMM documentation). For the full list of caveats related to instance recovery check out the “Recover your instance” page in AWS documentation.

According to CloudWatch pricing, having two alarms set up will cost $0.2/month. An acceptable cost for higher availability.

Automatically Recovering PMM on System Failure

Let’s get to actually setting up the automation. There are at least two ways to set up the alarm through GUI: from the EC2 console, and from the CloudWatch interface. The latter option is a bit involved, but it’s very easy to set up alarms from the EC2 console. Just right-click on your instance, pick the “Monitor and troubleshoot” section in the drop-down menu, and then choose the “Manage CloudWatch alarms” item.

EC2 console dropdown menu navigation to CloudWatch Alarms

Once there, choose the “Status check failed” as the “Type of data to sample”, and specify the “Recover” Alarm action. You should see something like this.

Adding CloudWatch alarm through EC2 console interface

You could notice that GUI offers to set up a notification channel for the alarm. If you want to get a message if your PMM instance is recovered automatically, feel free to set that up.

The alarm will fire when the maximum value of the StatusCheckFailed_System metric is >=0.99 during two consecutive 300 seconds periods. Once the alarm fires, it will recover the PMM instance. We can check out our new alarm in the CloudWatch GUI.

EC2 instance restart alarm

EC2 console will also show that the alarm is set up.

Single alarm set in the instance overview

This example uses a pretty conservative alarm check duration of 10 minutes, spread over two 5-minute intervals. If you want to recover a PMM instance sooner, risking triggering on false positives, you can bring down the alarm period and number of evaluation periods. We also use the “maximum” of the metric over a 5-minute interval. That means a check could be in failed state only one minute out of five, but still count towards alarm activation. The assumption here is that checks don’t flap for ten minutes without a reason.

Automatically Restarting PMM on Instance Failure

While we’re at it, we can also set up an automatic action to execute when “Instance status check” is failing. As mentioned, usually that happens when there’s something wrong within the instance: really high load, configuration issue, etc. Whenever a system check fails, an instance check is going to be failing, too, so we’ll set this alarm to check for a longer period of time before firing. That’ll also help us to minimize the rate of false-positive restarts, for example, due to a spike in load. We use the same period of 300 seconds here but will only alarm after three periods show instance failure. The restart will thus happen after ~15 minutes. Again, this is pretty conservative, so adjust as you think works best for you.

In the GUI, repeat the same actions that we did last time, but pick the “Status check failed: instance” data type, and “Reboot” alarm action.

Adding CloudWatch alarm for instance failure through EC2 console interface

Once done, you can see both alarms reflected in the EC2 instance list.

EC2 instance overview showing 2 alarms set

Setting up Alarms Using AWS CLI

Setting up our alarm is also very easy to do using the AWS CLI.  You’ll need an AWS account with permissions to create CloudWatch alarms, and some EC2 permissions to list instances and change their state. A pretty minimal set of permissions can be found in the policy document below. This set is enough for a user to invoke CLI commands in this post.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "cloudwatch:PutMetricAlarm",
        "cloudwatch:GetMetricStatistics",
        "cloudwatch:ListMetrics",
        "cloudwatch:DescribeAlarms"
      ],
      "Resource": [
        "*"
      ],
      "Effect": "Allow"
    },
    {
      "Action": [
        "ec2:DescribeInstanceStatus",
        "ec2:DescribeInstances",
        "ec2:StopInstances",
        "ec2:StartInstances"
      ],
      "Resource": [
        "*"
      ],
      "Effect": "Allow"
    }
  ]
}

And here are the specific AWS CLI commands. Do not forget to change instance id (i-xxx) and region in the command. First, setting up recovery after system status check failure.

$ aws cloudwatch put-metric-alarm \
    --alarm-name restart_PMM_on_system_failure_i-xxx \
    --alarm-actions arn:aws:automate:REGION:ec2:recover \
    --metric-name StatusCheckFailed_System --namespace AWS/EC2 \
    --dimensions Name=InstanceId,Value=i-xxx \
    --statistic Maximum --period 300 --evaluation-periods 2 \
    --datapoints-to-alarm 2 --threshold 1 \
    --comparison-operator GreaterThanOrEqualToThreshold \
    --tags Key=System,Value=PMM

Second, setting up restart after instance status check failure.

$ aws cloudwatch put-metric-alarm \
    --alarm-name restart_PMM_on_instance_failure_i-xxx \
    --alarm-actions arn:aws:automate:REGION:ec2:reboot \
    --metric-name StatusCheckFailed_Instance --namespace AWS/EC2 \
    --dimensions Name=InstanceId,Value=i-xxx \
    --statistic Maximum --period 300 --evaluation-periods 3 \
    --datapoints-to-alarm 3 --threshold 1 \
    --comparison-operator GreaterThanOrEqualToThreshold \
    --tags Key=System,Value=PMM

Testing New Alarms

Unfortunately, it doesn’t seem to be possible to simulate system status check failure. If there’s a way, let us know in the comments. Because of that, we’ll be testing our alarms using instance failure instead. The simplest way to fail the Instance status check is to mess up networking on an instance. Let’s just bring down a network interface. In this particular instance, it’s ens5.

# ip link
...
2: ens5:...
...
# date
Sat Apr 17 20:06:26 UTC 2021
# ip link set ens5 down

The stuff of nightmares: having no response for a command. In 15 minutes, we can check that the PMM instance is available again. We can see that alarm triggered and executed the restart action at 20:21:37 UTC.

CloudWatch showing Alarm fired actions

And we can access the server now.

$ date
Sat Apr 17 20:22:59 UTC 2021

The alarm itself takes a little more time to return to OK state. Finally, let’s take a per-minute look at the instance check metric.

CloudWatch metric overview

Instance failure was detected at 20:10 and resolved at 20:24, according to the AWS data. In reality, the server rebooted and was accessible even earlier. PMM instances deployed from a marketplace image have all the required service in autostart, and thus PMM is fully available after restarting without a need for an operator to take any action.

Summary

Even though Percona Monitoring and Management itself doesn’t offer high availability at this point, you can utilize built-in AWS features to make your PMM installation more resilient to failure. PMM in the Marketplace is set up in a way that should not have problems with recovery from the instance retirement events. After your PMM instance is recovered or rebooted, you’ll be able to access monitoring without any manual intervention.

Note: While this procedure is universal and can be applied to any EC2 instance, there are some caveats explained in the “What do I need to know when my Amazon EC2 instance is scheduled for retirement?” article on the AWS site. Note specifically the Warning section. 

References

Apr
28
2021
--

Running Percona Monitoring and Management v2 on Windows with Docker

PMM Windows Docker

PMM Windows DockerOne way to run Percona Monitoring and Management (PMM) v2 on Windows is using the Virtual Appliance installation method. It works well but requires you to run extra software to run Virtual Appliances, such as Oracle VirtualBox, VMWare Workstation Player, etc. Or, you can use Docker.

Docker is not shipped with Windows 10, though you can get it installed for free with Docker Desktop. Modern Docker on Windows actually can either run using Hyper-V backend or using Windows Subsystem for Linux v2 (WSL2). Because WSL2-based Docker installation is more involved, I’m sticking to what is now referred to as “legacy” Hyper-V backend.

On my Windows 10 Pro (19042) system, I had to change some BIOS settings:

 

PMM on Windows Docker

 

And install the Hyper-V component for Docker Desktop to install correctly:

 

Percona Monitoring and Management Windows

 

With this, we’re good to go to install Percona Monitoring and Management on Windows.

 

Register for Percona Live ONLINE
A Virtual Event about Open Source Databases

 

Official installing PMM with Docker instructions assumes you’re running a Linux/Unix-based system that needs to be slightly modified for Windows. First, there is no sudo on Windows, and second, “Command Prompt” does not interpret “\”  the same way as shell does.

With those minor fixes though, basically, the same command to start PMM works:

docker pull percona/pmm-server:2
docker create --volume /srv --name pmm-data percona/pmm-server:2 /bin/true
docker run --detach --restart always --publish 80:80 --publish 443:443 --volumes-from pmm-data --name pmm-server percona/pmm-server:2

That’s it! You should have your Percona Monitoring and Management (PMM) v2 running!

Looking at PMM Server Instance through PMM itself you may notice it does not have all resources by default, as Docker on Linux would.

 

PMM

 

Note there are only 2 CPU cores, 2GB of Memory, and some 60GB of Disk space available.  This should be enough for a test but if you want to put some real load on this instance you may want to adjust those settings. It is done through the “Resources” setting in Docker Desktop:

 

PMM

 

Adjust these to suit your load needs and you’re done!  Enjoy your Percona Monitoring and Management!

Apr
27
2021
--

Percona Monitoring and Management Meetup on May 6th: Join Us for Live Tuning and Roadmap Talk

We’re leveling up – would you like to join us on Discord? There we chat about databases, open source, and even play together.

On May 6th at 11 am EST, the Percona Community is talking about Percona Monitoring and Management (PMM). Our experts are also there to live tune and optimize your database – come check it out!

Are there any features you would like to see in PMM? The meetup is a good way to get your voice heard! But you can already start chatting with all of us on Discord right now.

It’s just a few more nights of sleep until Percona Live ONLINE with many interesting events for you to check out too. Join the community and let’s shape the future of databases together!

Apr
16
2021
--

Integrated Alerting Design in Percona Monitoring and Management

Integrated Alerting Design Percona Monitoring and Management

Integrated Alerting Design Percona Monitoring and ManagementPercona Monitoring and Management 2.13 (PMM) introduced the Integrated Alerting feature as a technical preview. It adds a user-friendly way to set up and manage alerts for your databases. You can read more about this feature usage in our announcement blog post and in our documentation, while in this article we will be focusing on design and implementation details.

Entities

There are four basic entities used for IA: Alert Rule Template, Alert Rule, Alert, and Notification Channel.

Everything starts from the alert rule template. You can see its YAML representation below:

---
templates:
 - name: pmm_mongodb_high_memory_usage
   version: 1
   summary: Memory used by MongoDB
   expr: |-
     sum by (node_name) (mongodb_ss_mem_resident * 1024 * 1024)
     / on (node_name) (node_memory_MemTotal_bytes)
     * 100
     > [[ .threshold ]]
   params:
     - name: threshold
       summary: A percentage from configured maximum
       unit: "%"
       type: float
       range: [0, 100]
       value: 80
   for: 5m
   severity: warning
   labels: 
     cultom_label: demo
   annotations:
     summary: MongoDB high memory usage ({{ $labels.service_name }})
     description: |-
       {{ $value }}% of memory (more than [[ .threshold ]]%) is used
       by {{ $labels.service_name }} on {{ $labels.node_name }}.

A template serves as the base for alert rules. It defines several fields, let’s look at them:

  • name: uniquely identifies template (required)
  • version: defines template format version (required)
  • summary: a template description (required)
  • expr: a MetricsQL query string with parameter placeholders. MetricsQL is backward compatible with PromQL and provides some additional features. (required)
  • params: contains parameter definitions required for the query. Each parameter has a name, type, and summary. It also may have a unit, available range, and default value.
  • for: specifies the duration of time the expression must be met for;  The  alert query should return true for this period of time at which point the alert will be fired (required)
  • severity: specifies default alert severity level (required)
  • labels: are additional labels to be added to generated alerts (optional)
  • annotations: are additional annotations to be added to generated alerts. (optional)

A template is designed to be re-used as the basis for multiple alert rules so from a single pmm_node_high_cpu_load template you can have alerts for production vs non-production, warning vs critical, etc.

Register for Percona Live ONLINE
A Virtual Event about Open Source Databases

Users can create alert rules from templates. An alert rule is what’s actually executed against metrics and what produces an alert. The rule can override default values specified in the template, add filters to apply the rule to only required services/nodes/etc, and specify target notification channels, such as email, Slack, PagerDuty, or Webhooks. If the rule hasn’t any associated notification channels its alerts will be available only via PMM UI. It’s useful to note that after creation rule keeps its relation with the template and any change in the template will affect all related rules.

Here is an alert rule example:

---
groups:
 - name: PMM Integrated Alerting
   rules:
     - alert: /rule_id/c8e5c559-ffba-43ed-847b-921f69c031a9
       rule: test
       expr: |-
         sum by (node_name) (mongodb_ss_mem_resident * 1024 * 1024)
         / on (node_name) (node_memory_MemTotal_bytes)
         * 100
         > 40
       for: 5s
       labels:
         ia: "1"
         rule_id: /rule_id/c8e5c559-ffba-43ed-847b-921f69c031a9
         severity: error
         template_name: pmm_mongodb_high_memory_usage
         cultom_label: demo
       annotations:
         description: |-
         { { $value } }% of memory (more than 40%) is used
         by {{ $labels.service_name }} on {{ $labels.node_name }}.
         summary: MongoDB high memory usage ({{ $labels.service_name }})

It has a Prometheus alert rule format.

How it Works

Integrated Alerting feature built on top of Prometheus Alertmanager, VictoriaMetrics TimescaleDB (TSDB), and VMAlert.

VictoriaMetrics TSDB is the main metrics storage in PMM, VMalert responsible for alert rules execution, and Prometheus Alertmanager responsible for alerts delivery. VMAlert runs queries on VM TSDB, checks if they are positive for the specified amount of time (example: MySQL is down for 5 minutes), and triggers alerts. All alerts forwarded to the PMM internal Alertmanager but also can be duplicated to some external Alertmanager (it can be set up on the PMM Settings page).

There are four available templates sources:

  1. Built-in templates, shipped with PMM distribution. They are embedded into the managed binary (core component on PMM).
  2. Percona servers. It’s not available yet, but it will be similar to the STT checks delivery mechanism (HTTPS + files signatures).
  3. Templates created by the user via PMM UI. We persist them in PMM’s database.
  4. Templates created by the user as files in the /srv/ia/templates directory.

During PMM startup, managed loads templates from all sources into the memory.

Alert rules can be created via PMM UI or just by putting rule files in the /srv/prometheus/rules directory. Alert rules created via UI persist in PMM’s internal PostgreSQL database. For each alert rule from DB, managed binary creates a YAML file in /etc/ia/rules/ and asks VMalert to reload the configuration and reread rule files. VMAlert executes query from each loaded alert rule every minute, once the rule condition is met (query is positive for the specified amount of time) VMAlert produces an alert and passes it to the Alertmanager. Please note that /etc/ia/rules/ controlled by managed and any manual changes in that directory will be lost.

Managed generates configuration for Alertmanager and updates it once any related entity changes.

Managed goes through the list of the existing rules and collects unique notification channel combinations. For example, if we have two rules and each of them has assigned channels a,b, and c it will be the one unique channel combination. For each rule managed generates a route and for each unique channel combination, it generates a receiver in the Alertmanager configuration file. Each route has a target receiver and filter by rule id, also it can contain user-defined filters. If a rule hasn’t assigned notification channels, then a special empty receiver will be used. Users can redefine an empty receiver with Alertmanagers base configuration file /srv/alertmanager/alertmanager.base.yml. When some Notification Channel is disabled, managed recollects unique channel combinations excluding disabled channels and regenerates receivers and routing rules. If the rule has only one specified channel and it was disabled then a special disabled receiver will be used for that. Unlike empty receiver, disabled can’t be redefined by the user and always means “do nothing”.  It prevents unexpected behavior after channels disabling. After each Alertmanager configuration update, managed asks Alermanager to reload it.

When Alertmanager receives an alert from VMAlert, it uses routes to find an appropriate receiver and forward alerts to destination channels. The user also can observe alerts via PMM UI. In that case, managed gets all available alerts from Alertmanager API and applies required filters before showing them.

Conclusion

The Integrated Alerting feature has many moving parts, and functionally it’s more about managing configuration for different components and making them work together. It provides a really nice way to be aware of important events in your system. While it’s still in tech preview state, it’s already helpful. With built-in templates, it’s easy to try without diving into documentation about Prometheus queries and other stuff. So please try it and tell us about your experience. What parameters of a system you would like to have covered with templates? What use cases do you have for alerting? We will happy to any feedback.

Jan
26
2021
--

Achieving a More Secure MongoDB with Percona Monitoring and Management’s Security Tool

More Secure MongoDB

Percona Monitoring and Management (PMM) is a great way to monitor your MongoDB deployment for things like memory, CPU, inter-database metrics like wiredTiger cache utilization, read/write ticket utilization, Query Analytics, and many, many more.  Did you know that in addition to all that, it also has a security threat tool?  In this blog, we’ll go over the Security Threat Tool and its checks for MongoDB.  We’ll discuss how they can bring you value by helping you prevent MongoDB data leaks.

What is the Security Threat Tool?

PMM security threat tool

The Security Threat Tool is a widget in your main PMM dashboard that runs regular checks against connected databases, alerting you if any servers pose a potential security threat.  By default, these checks are run every 24 hours, so the checks will not be in real-time.  You can run them ad-hoc by going to PMM/PMM Database Checks in your PMM Grafana Dashboards and clicking on the “Run DB Checks” button.  When you first enable the Security Threat Tool it can take up to 24 hours for the results to populate, assuming you do not use the ad hoc button.  In addition to MongoDB, the Security Threat Tool, like PMM itself, can also check MySQL, PostgreSQL, and MariaDB for various security threats.  Additionally, if you find a check is too noisy, you do have the ability to silence it in PMM so that it no longer shows as a failed check.

What Does the Security Threat Tool Check for MongoDB?

At present, the Security Threat Tool checks MongoDB for two security-related items, mongodb_auth, and mongodb_version.  First, mongodb_auth checks to be sure that MongoDB authentication is enabled on your deployment.  Finally, mongodb_version checks to make sure that you’re running the latest version of MongoDB that you’re on.  

mongodb_auth

Checking MongoDB to ensure that authentication and authorization are enabled is of paramount importance to the security of your database.   If authentication and authorization are disabled for your MongoDB deployment, then anyone who can reach the port that you have MongoDB running on will have access to your data.  Disabled authentication and authorization as well as binding to all of your interfaces on your machine have been the cause of many data leaks over the years with MongoDB.  Having this check can help you and your MongoDB deployment from leaking data.

This check can also be helpful if you’re performing maintenance and need to temporarily disable authentication on a secondary or hidden node and forget to re-enable authentication and authorization when you’re done.

mongodb_auth

mongodb_version

Many organizations are requiring more frequent patching or upgrading of software, especially databases that often house the most critical data for your applications, in order to stay on top of changing security requirements.  With MongoDB’s replica set and sharded cluster architecture, it’s fairly easy to upgrade in a rolling manner and keep up to date on your MongoDB versions (after appropriate testing of the new versions!).  This checks to make sure you’re running on the latest version of the minor release that your MongoDB database is running on.  For example, your 3.6 version of MongoDB won’t alert telling you there’s a newer version of MongoDB 4.4 but will tell you when a new minor version of MongoDB 3.6 is released and can be upgraded to. Staying as close as possible to the latest minor release assures that you get as many bug fixes as are available and just as importantly, any security fixes for that version of MongoDB.  You want to minimize your security exposure by being up to date and having as many security vulnerabilities patched as are available.

mongodb_version

Summary

We hope that this blog post has helped you to see how enabling the Security Threat Tool in Percona Monitoring and Management can help you keep your MongoDB deployment secure.  Thanks for reading!

 

Percona Monitoring and Management is free to download and use. Try it today!

Dec
23
2020
--

Observations on Better Resource Usage with Percona Monitoring and Management v2.12.0

Better Resource Usage with Percona Monitoring and Management

Better Resource Usage with Percona Monitoring and ManagementPercona Monitoring and Management (PMM) v2.12.0 comes with a lot of improvements and one of the most talked-about is the usage of VictoriaMetricsDB. The reason we are doing this comparison is that PMM 2.12.0 is a release in which we integrate VictoriaMetricsDB and replace Prometheus as its default method of data ingestion.

A reason for this change was also driven by motivation for improved performance for PMM Server, and here we will look at an overview of why users must definitely consider using the 2.12.0 version if they have always been looking for a less resource-intensive PMM. This post will try to address some of those concerns.

Benchmark Setup Details

The benchmark was performed using a virtualized system with PMM Server running on an ec2 instance, with 8 cores, 32 GB of memory, and SSD Storage. The duration of the observation is 24 hours, and for clients, we set up 25 virtualized client instances on Linode with each emulating 10 Nodes running MySQL with real workloads using Sysbench TPC-C Test.

Percona Monitoring and Management benchmark

Both PMM 2.11.1 and PMM 2.12.0 were set up in the same way, with client instances running the exact same load, and to monitor the difference in performance, we used the default metrics mode for 2.12.0 for this observation.

Sample Ingestion rate for the load was around 96.1k samples/sec, with around 8.5 billion samples received in 24 hours. 

A more detailed benchmark for Prometheus vs. VictoriaMetrics was done by the VM team and it clearly shows how efficient VictoriaMetrics is and how better performance can be achieved with Victoria Metrics.

Disk Space Usage

VictoriaMetrics has really good efficiency when it comes to disk usage of the host system, and we found that 2.11.1 generates a lot of disk usage spikes with the maximum storage touching around 23.11 GB of storage space. If we compare the same for PMM 2.12.0, the disk usage spikes are not as high as 2.11.1 while the maximum disk usage is around 8.44 GB.

It is clear that PMM 2.12.0 needs 2.7 times less disk space for monitoring the same number of services for the same duration as compared to PMM 2.11.1.

Disk Usage PMM 2.11.1

Disk Usage 1: PMM 2.11.1 

Disk Usage PMM 2.12.0

Disk Usage 2: PMM 2.12.0


Memory Utilization

Another parameter on which PMM 2.12.0 performs better is Memory Utilization. During our testing, we found that PMM 2.11.1 was using two times more memory for monitoring the same number of services. This is indeed a significant improvement in terms of performance.

The memory usage clearly shows several spikes for PMM 2.11.1, which is not the case with PMM 2.12.0.

Memory Utilization PMM 2.11.1

Memory Utilization: PMM 2.11.1

Free Memory PMM 2.11.1
Free Memory PMM 2.11.1 


Memory Utilization 2.11.1
Memory Utilization PMM 2.12.0

The Memory Utilization for 2.12.0 clearly shows more than 55% of memory is available across the 24 hours of our observation, which is a significant improvement over 2.11.1.

Free Memory 2.12.0

Free Memory PMM 2.12.0

CPU Usage

During the observation we noticed a slight increase in CPU usage for PMM 2.12.0, the average CPU usage was about 2.6% more than PMM 2.11.1, but the max CPU usage for both versions did not have any significant difference.

CPU Usage 2.11.1

CPU Usage: PMM 2.11.1

CPU Usage 2.12.0

CPU Usage: PMM 2.12.0

 

Observations

The overall performance improvements are around the Memory Utilization and Disk Usage, and we also observed a significantly less disk I/O bandwidth with far fewer spikes in the write operations for PMM 2.12.0. This behavior is observed and articulated in the VictoriaMetrics benchmarking. CPU Usage and Memory are two important resource factors when planning to setup PMM Server, and with PMM 2.12.0 we can safely say that it will cost about half in terms of Memory and Disk resources when compared to any other previously released PMM versions. This would also likely encourage current users to be able to add more instances for monitoring without the need to think about the cost of extra infrastructure. 

Dec
09
2020
--

Enabling HTTPS Connections to Percona Monitoring and Management Using Custom Certificates

HTTPS Connections to Percona Monitoring and Management

HTTPS Connections to Percona Monitoring and ManagementWhichever way you installed Percona Monitoring and Management 2 (PMM2), using the docker image or an OVF image for your supported virtualized environment, PMM2 enables, by default, two ports for the web connections: 80 for HTTP and 443 for HTTPS. Using HTTPS certificates are requested for encrypting the connection for better security.

All the installation images contain self-signed certificates already configured, so every PMM2 deployment should work properly when using HTTPS.

This is cool, but sometimes self-signed certificates are not permitted, based on the security policy adopted by your company. If your company uses a Certification Authority to sign certificates and keys for encryption, most probably you are forced to use the files provided by the CA for all your services, even for PMM2 monitoring.

In this article, we’ll show how to use your custom certificates to enable HTTPS connections to PMM2, according to your security policy.

PMM2 Deployed as a Docker Image

If PMM Server is running as a Docker image, use docker cp to copy certificates. This example copies certificate files from the current working directory to a running PMM Server docker container.

docker cp certificate.crt pmm-server:/srv/nginx/certificate.crt
docker cp certificate.key pmm-server:/srv/nginx/certificate.key
docker cp ca-certs.pem pmm-server:/srv/nginx/ca-certs.pem
docker cp dhparam.pem pmm-server:/srv/nginx/dhparam.pem

If you’re going to deploy the container, you can use the following to use your own certificates instead of the built-in ones. Let’s suppose your certificates are in /etc/pmm-certs:

docker run -d -p 443:443 --volumes-from pmm-data \
  --name pmm-server -v /etc/pmm-certs:/srv/nginx \
  --restart always percona/pmm-server:2

  • The certificates must be owned by root.
  • The mounted certificate directory must contain the files certificate.crt, certificate.key, ca-certs.pem and dhparam.pem.
  • For SSL encryption, the container must publish on port 443 instead of 80.

PMM2 Deployed Using a Virtual Appliance Image

In such cases, you need to connect to the virtual machine and replace the certificate files in /srv/nginx:

  • connect to the virtual machine
    $> ssh root@pmm2.mydomain.com
  • place CA, certificate, and key files into the /srv/nginx directory. The file must be named certificate.crt, certificate.key, ca-certs.pem and dhparam.pem
  • if you would like to use different file names you can modify the nginx configuration file /etc/nginx/conf.d/pmm.conf. The following variables must be set:
    ssl_certificate /srv/nginx/my_custom_certificate.crt;
    ssl_certificate_key /srv/nginx/my_custom_certificate.key;
    ssl_trusted_certificate /srv/nginx/my_custom_ca_certs.pem;
    ssl_dhparam /srv/nginx/my_dhparam.pem
  • restart nginx
    [root@pmm2]> supervisorctl restart nginx

Conclusion

Percona Monitoring and Management is widely used for monitoring MySQL, Proxysql, MongoDB, PostgreSQL, and OSes. Setting up customer certificates for the connection encryption, according to the security policy adopted by your company, is quite simple. You can rely on PMM2 for troubleshooting your environments in a secure way.

Take a look at the demo site: https://pmmdemo.percona.com

Oct
07
2020
--

How to Find Query Slowdowns Using Percona Monitoring and Management

Query Slowdowns Using Percona Monitoring and Management

Visibility is a blessing, and with databases, visibility is a must. That’s true not only for metrics but for the queries themselves. Having info on all the stats around query execution is priceless, and Percona Monitoring and Management (PMM) offers that in the form of the Query Analytics dashboard (QAN).

But where to start? QAN helps you with that by calculating the query profile. What is the profile? It’s a rank of queries, ordered by Load, so it is easy to spot the heaviest queries hitting your database. The Load is defined as the “Average Active Queries” but can also be defined as a mix of Query Execution Time Plus Query count. In other words, all the time the query was alive and kicking.

The Profile in PMM 2.10.0 looks like this:

percona monitoring and management

The purpose of this profile is to facilitate the task of finding the queries that are worth improving, or at least the ones that will have a bigger impact on the performance when optimized.

However, how do you know that a slow query has been always slow or it has come down the road from good performance to painfully slow? That’s where the graph on the “Load” column comes handy.

There’s a method for doing this. The first step is to have a wide view. That means: check a time range long enough so you can see patterns. Personally, I like to check the last 7 days.

The second step is to find irregularities like spikes or increasing patterns. For example, in the above profile, we can see that the “SHOW BINARY LOGS” command is the top #4 of queries adding more load to the database. In this case, it’s because the binlogs are not being purged, so every day there are more and more binlog files to read and that adds to the executing time. But the amount of times that the “SHOW BINARY LOGS” query is executed remains the same.

Another query with an “anomaly” in the load graph is the top #3 one. Let’s isolate it and see what happened:

Query Analytics dashboard percona

The third step will be to reduce the time to a range involving the event so we can isolate it even more:

Query Analytics dashboard percona monitoring and management

The event happened between 8 AM and 9 AM. To discard or confirm that is an isolated event only related to this query, let’s see again all the queries running at that same moment.

So this is a generic situation, common to several queries. Most likely it was an event with the server that made queries to stall.

By looking at the threads graph, we can confirm that hypothesis:

MySQL Active Client Threads

After some digging, the source cause was detected to be a Disk problem:

Query Analytics dashboard disk latency

It’s confirmed that it is not an issue with the query itself, so no need to “optimize” due to this spike.

In conclusion, with the new QAN dashboard available since PMM 2.10.0, finding query slowdowns is easier thanks to the Load graph that can give us context pretty fast.

Try Percona Monitoring and Management today, for free!

Nov
27
2019
--

Running PMM1 and PMM2 Clients on the Same Host

Running PMM1 and PMM2 Clients

Running PMM1 and PMM2 ClientsWant to try out Percona Monitoring and Management 2 (PMM 2) but you’re not ready to turn off your PMM 1 environment?  This blog is for you! Keep in mind that the methods described are not intended to be a long-term migration strategy, but rather, simply a way to deploy a few clients in order to sample PMM 2 before you commit to the upgrade. ?

Here are step-by-step instructions for deploying PMM 1 & 2 client functionality i.e. pmm-client and pmm2-client, on the same host.

  1. Deploy PMM 1 on Server1 (you’ve probably already done this)
  2. Install and setup pmm-client for connectivity to Server1
  3. Deploy PMM 2 on Server2
  4. Install and setup pmm2-client for connectivity to Server2
  5. Remove pmm-client and switched completely to pmm2-client

The first few steps are already described in our PMM1 documentation so we are simply providing links to those documents.  Here we’ll focus on steps 4 and 5.

Install and Setup pmm2-client Connectivity to Server2

It’s not possible to install both clients from a repository at the same time. So you’ll need to download a tarball of pmm2-client. Here’s a link to the latest version directly from our site.

Download pmm2-client Tarball

* Note that depending on when you’re seeing this, the commands below may not be for the latest version, so the commands may need to be updated for the version you downloaded.

$ wget https://www.percona.com/downloads/pmm2/2.1.0/binary/tarball/pmm2-client-2.1.0.tar.gz

Extract Files From pmm2-client Tarball

$ tar -zxvf pmm2-client-2.1.0.tar.gz 
$ cd pmm2-client-2.1.0

Register and Generate Configuration File

Now it’s time to set up a PMM 2 client. In our example, the PMM2 server IP is 172.17.0.2 and the monitored host IP is 172.17.0.1.

$ ./bin/pmm-agent setup --config-file=config/pmm-agent.yaml \
--paths-node_exporter="$PWD/pmm2-client-2.1.0/bin/node_exporter" \
--paths-mysqld_exporter="$PWD/pmm2-client-2.1.0/bin/mysqld_exporter" \
--paths-mongodb_exporter="$PWD/pmm2-client-2.1.0/bin/mongodb_exporter" \
--paths-postgres_exporter="$PWD/pmm2-client-2.1.0/bin/postgres_exporter" \
--paths-proxysql_exporter="$PWD/pmm2-client-2.1.0/bin/proxysql_exporter" \
--server-insecure-tls --server-address=172.17.0.2:443 \
--server-username=admin  --server-password="admin" 172.17.0.1 generic node8.ca

Start pmm-agent

Let’s run the pmm-agent using a screen.  There’s no service manager integration when deploying alongside pmm-client, so if your server restarts, pmm-agent won’t automatically resume.

# screen -S pmm-agent

$ ./bin/pmm-agent --config-file="$PWD/config/pmm-agent.yaml"

Check the Current State of the Agent

$ ./bin/pmm-admin list
Service type  Service name         Address and port  Service ID

Agent type                  Status     Agent ID                                        Service ID
pmm-agent                   connected  /agent_id/805db700-3607-40a9-a1fa-be61c76fe755  
node_exporter               running    /agent_id/805eb8f6-3514-4c9b-a05e-c5705755a4be

Add MySQL Service

Detach the screen, then add the mysql service:

$ ./bin/pmm-admin add mysql --use-perfschema --username=root mysqltest
MySQL Service added.
Service ID  : /service_id/28c4a4cd-7f4a-4abd-a999-86528e38992b
Service name: mysqltest

Here is the state of pmm-agent:

$ ./bin/pmm-admin list
Service type  Service name         Address and port  Service ID
MySQL         mysqltest            127.0.0.1:3306    /service_id/28c4a4cd-7f4a-4abd-a999-86528e38992b

Agent type                  Status     Agent ID                                        Service ID
pmm-agent                   connected  /agent_id/805db700-3607-40a9-a1fa-be61c76fe755   
node_exporter               running    /agent_id/805eb8f6-3514-4c9b-a05e-c5705755a4be   
mysqld_exporter             running    /agent_id/efb01d86-58a3-401e-ae65-fa8417f9feb2  /service_id/28c4a4cd-7f4a-4abd-a999-86528e38992b
qan-mysql-perfschema-agent  running    /agent_id/26836ca9-0fc7-4991-af23-730e6d282d8d  /service_id/28c4a4cd-7f4a-4abd-a999-86528e38992b

Confirm you can see activity in each of the two PMM Servers:

PMM 1 PMM 2

Remove pmm-client and Switch Completely to pmm2-client

Once you’ve decided to move over completely to PMM2, it’s better to make a switch from the tarball version to installation from the repository. It will allow you to perform client updates much easier as well as register the new agent as a service for automatically starting with the server. Also, we will show you how to make a switch without re-adding monitored instances.

Configure Percona Repositories

$ sudo yum install https://repo.percona.com/yum/percona-release-latest.noarch.rpm 
$ sudo percona-release disable all 
$ sudo percona-release enable original release 
$ yum list | grep pmm 
pmm-client.x86_64                    1.17.2-1.el6                  percona-release-x86_64
pmm2-client.x86_64                   2.1.0-1.el6                   percona-release-x86_64

Here is a link to the apt variant.

Remove pmm-client

yum remove pmm-client

Install pmm2-client

$ yum install pmm2-client
Loaded plugins: priorities, update-motd, upgrade-helper
4 packages excluded due to repository priority protections
Resolving Dependencies
--> Running transaction check
---> Package pmm2-client.x86_64 0:2.1.0-5.el6 will be installed
...
Installed:
  pmm2-client.x86_64 0:2.1.0-5.el6                                                                                                                                                           

Complete!

Configure pmm2-client

Let’s copy the currently used pmm2-client configuration file in order to omit re-adding monitored instances.

$ cp pmm2-client-2.1.0/config/pmm-agent.yaml /tmp

It’s required to set the new location of exporters (/usr/local/percona/pmm2/exporters/) in the file.

$ sed -i 's|node_exporter:.*|node_exporter: /usr/local/percona/pmm2/exporters/node_exporter|g' /tmp/pmm-agent.yaml
$ sed -i 's|mysqld_exporter:.*|mysqld_exporter: /usr/local/percona/pmm2/exporters/mysqld_exporter|g' /tmp/pmm-agent.yaml
$ sed -i 's|mongodb_exporter:.*|mongodb_exporter: /usr/local/percona/pmm2/exporters/mongodb_exporter|g' /tmp/pmm-agent.yaml 
$ sed -i 's|postgres_exporter:.*|postgres_exporter: /usr/local/percona/pmm2/exporters/postgres_exporter|g' /tmp/pmm-agent.yaml
$ sed -i 's|proxysql_exporter:.*|proxysql_exporter: /usr/local/percona/pmm2/exporters/proxysql_exporter|g' /tmp/pmm-agent.yaml

The default configuration file has to be replaced by our file and the service pmm-agent has to be restarted.

$ cp /tmp/pmm-agent.yaml /usr/local/percona/pmm2/config/
$ systemctl restart pmm-agent

Check Monitored Services

So now we can verify the current state of monitored instances.

$ pmm-admin list

Also, it can be checked on PMM server-side.

Nov
22
2019
--

Tips for Designing Grafana Dashboards

Designing Grafana Dashboards

As Grafana powers our star product – Percona Monitoring and Management (PMM) – we have developed a lot of experience creating Grafana Dashboards over the last few years.   In this article, I will share some of the considerations for designing Grafana Dashboards. As usual, when it comes to questions of design they are quite subjective, and I do not expect you to chose to apply all of them to your dashboards, but I hope they will help you to think through your dashboard design better.

Design Practical Dashboards

Grafana features many panel types, and even more are available as plugins. It may be very attractive to use many of them in your dashboards using many different visualization options. Do not!  Stick to a few data visualization patterns and only add additional visualizations when they provide additional practical value not because they are cool.  Graph and Singlestat panel types probably cover 80% of use cases.

Do Not Place Too Many Graphs Side by Side

This probably will depend a lot on how your dashboards are used.  If your dashboard is designed for large screens placed on the wall you may be able to fit more graphs side by side, if your dashboard needs to scale down to lower resolution small laptop screen I would suggest sticking to 2-3 graphs in a row.

Use Proper Units

Grafana allows you to specify a unit for the data type displayed. Use it! Without type set values will not be properly shortened and very hard to read:

Grafana Dashboards

Compare this to

Grafana Dashboards2

Mind Decimals

You can specify the number of values after decimal points you want to display or leave it default.  I found default picking does not always work very well, for example here:

Grafana Dashboards3

For some reason on the panel Axis, we have way too many values displayed after the decimal point.  Grafana also often picks three values after decimal points as in the table below which I find inconvenient – from the glance view, it is hard to understand if we’re dealing with a decimal point or with “,” as a “thousands” separator, so I may be looking at 2462 GiB there.  While it is not feasible in this case, there are cases such as data rate where a 1000x value difference is quite possible.  Instead, I prefer setting it to one decimal (or one if it is enough) which makes it clear that we’re not looking at thousands.

Label your Axis

You can label your axis (which especially makes sense) if the presentation is something not as obvious as in this example; we’re using a negative value to lot writes to a swap file.

Grafana Dashboards4

Use Shared Crosshair or Tooltip

In Dashboard Settings, you will find “Graph Tooltip” option and set it to “Default”,
“Shared Crosshair” or “Share Tooltip”  This is how these will look:

Grafana Dashboards5

Grafana Dashboards 6

Grafana Dashboards 6

 

Shared crosshair shows the line matching the same time on all dashboards while Tooltip shows the tooltip value on all panels at the same time.  You can pick what makes sense for you; my favorite is using the tooltip setting because it allows me to visually compare the same time without making the dashboard too slow and busy.

Note there is handy shortcut CTRL-O which allows you to cycle between these settings for any dashboard.

Pick Colors

If you’re displaying truly dynamic information you will likely have to rely on Grafana’s automatic color assignment, but if not, you can pick specific colors for all values being plotted.  This will prevent colors from potentially being re-assigned to different values without you planning to do so.

Grafana Dashboards 7

Picking colors you also want to make sure you pick colors that make logical sense. For example, I think for free memory “green” is a better color than “red”.  As you pick the colors, use the same colors for the same type of information when you show it on the different panels if possible, because it makes it easier to understand.

I would even suggest sticking to the same (or similar) color for the Same Kind of Data – if you have many panels which show disk Input and Output using similar colors, this can be a good idea.

Fill Stacking Graphs

Grafana does not require it, but I would suggest you use filling when you display stacking data and don’t use filling when you’re plotting multiple independent values.  Take a look at these graphs:

In the first graph, I need to look at the actual value of the plotted value to understand what I’m looking at. At the same time, in the second graph, that value is meaningless and what is valuable is the filled amount. I can see on the second graph what amount of the Cache, blue value, has shrunk.

I prefer using a fill factor of 6+ so it is easier to match the fill colors with colors in the table.   For the same reason, I prefer not to use the fill gradient on such graphs as it makes it much harder to see the color and the filled volume.

Do Not Abuse Double Axis

Graphs that use double axis are much harder to understand.  I used to use it very often, but now I avoid it when possible, only using it when I absolutely want to limit the number of panels.

Note in this case I think gradient fits OK because there is only one value displayed as the line, so you can’t get confused if you need to look at total value or “filled volume”.

Separate Data of Different Scales on Different Graphs

I used to plot Innodb Rows Read and Written at the same graph. It is quite common to have reads to be 100x higher in volume than writes, crowding them out and making even significant changes in writes very hard to see.  Splitting them to different graphs solved this issue.

Consider Staircase Graphs

In the monitoring applications, we often display average rates computed over a period of time.  If this is the case, we do not know how the rate was changing within that period and it would be misleading to show that. This especially makes sense if you’re displaying only a few data points.

Let’s look at this graph which is being viewed with one-hour resolution:

This visually shows what amount of rows read was falling from 16:00 to 18:00, and if we compare it to the staircase graph:

It simply shows us that the value at 18 am was higher than 17 am, but does not make any claim about the change.

This display, however, has another issue. Let’s look at the same data set with 5min resolution:

We can see the average value from 16:00 to 17:00 was lower than from 17:00 to 18:00, but this is however NOT what the lower resolution staircase graph shows – the value for 17 to 18 is actually lower!

The reason for that is if we compute on Prometheus side rate() for 1 hour at 17:00 it will be returned as a data point for 17:00 where this average rate is really for 16:00 to 17:00, while staircase graph will plot it from 17:00 to 18:00 until a new value is available.  It is off by one hour.

To fix it, you need to shift the data appropriately. In Prometheus, which we use in PMM, I can use an offset operator to shift the data to be displayed correctly:

Provide Multiple Resolutions

I’m a big fan of being able to see the data on the same dashboard with different resolutions, which can be done through a special dashboard variable of type “Interval”.  High-resolution data can provide a great level of detail but can be very volatile.

While lower resolution can hide this level of detail, it does show trends better.

Multiple Aggregates for the Same Metrics

To get even more insights, you can consider plotting the same metrics with different aggregates applied to it:

In this case, we are looking at the same variable – threads_running – but at its average value over a period of time versus max (peak) value. Both of them are meaningful in a different way.

You can also notice here that points are used for the Max value instead of a line. This is in general good practice for highly volatile data, as a plottings line for something which changes wildly is messy and does not provide much value.

Use Help and Panel Links

If you fill out a description for the panel, it will be visible if you place your mouse over the tiny “i” sign. This is very helpful to explain what the panel shows and how to use this data.  You can use Markup for formatting.  You can also provide one or more panel links, that you can use for additional help or drill down.

With newer Grafana versions, you can even define a more advanced drill-down, which can contain different URLs based on the series you are looking at, as well as other templating variables:

Summary

This list of considerations for designing Grafana Dashboards and best practices is by no means complete, but I hope you pick up an idea or two which will allow you to create better dashboards!

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com