Oct
14
2021
--

Custom Percona Monitoring and Management Metrics in MySQL and PostgreSQL

mysql postgresl custom metrics

mysql postgresl custom metricsA few weeks ago we did a live stream talking about Percona Monitoring and Management (PMM) and showcased some of the fun things we were doing at the OSS Summit.  During the live stream, we tried to enable some custom queries to track the number of comments being added to our movie database example.  We ran into a bit of a problem live and did not get it to work. As a result, I wanted to follow up and show you all how to add your own custom metrics to PMM and show you some gotchas to avoid when building them.

Custom metrics are defined in a file deployed on each server you are monitoring (not on the server itself).  You can add custom metrics by navigating over to one of the following:

  • For MySQL:  /usr/local/percona/pmm2/collectors/custom-queries/mysql
  • For PostgreSQL:  /usr/local/percona/pmm2/collectors/custom-queries/postgresql
  • For MongoDB:  This feature is not yet available – stay tuned!

You will notice the following directories under each directory:

  • high-resolution/  – every 5 seconds
  • medium-resolution/ – every 10 seconds
  • low-resolution/ – every 60 seconds

Note you can change the frequency of the default metric collections up or down by going to the settings and changing them there.  It would be ideal if in the future we added a resolution config in the YML file directly.  But for now, it is a universal setting:

Percona Monitoring and Management metric collections

In each directory you will find an example .yml file with a format like the following:

mysql_oss_demo: 
  query: "select count(1) as comment_cnt from movie_json_test.movies_normalized_user_comments;"
  metrics: 
    - comment_cnt: 
        usage: "GAUGE" 
        description: "count of the number of comments coming in"

Our error during the live stream was we forgot to include the database in our query (i.e. table_name.database_name), but there was a bug that prevented us from seeing the error in the log files.  There is no setting for the database in the YML, so take note.

This will create a metric named mysql_oss_demo_comment_cnt in whatever resolution you specify.  Each YML will execute separately with its own connection.  This is important to understand as if you deploy lots of custom queries you will see a steady number of connections (this is something you will want to consider if you are doing custom collections).  Alternatively, you can add queries and metrics to the same file, but they are executed sequentially.  If, however, the entire YML file can not be completed in less time than the defined resolution ( i.e. finished within five seconds for high resolution), then the data will not be stored, but the query will continue to run.  This can lead to a query pile-up if you are not careful.   For instance, the above query generally takes 1-2 seconds to return the count.  I placed this in the medium bucket.  As I added load to the system, the query time backed up.

You can see the slowdown.  You need to be careful here and choose the appropriate resolution.  Moving this over to the low resolution solved the issue for me.

That said, query response time is dynamic based on the conditions of your server.  Because these queries will run to completion (and in parallel if the run time is longer than the resolution time), you should consider limiting the query time in MySQL and PostgreSQL to prevent too many queries from piling up.

In MySQL you can use:

mysql>  select /*+ MAX_EXECUTION_TIME(4) */  count(1) as comment_cnt from movie_json_test.movies_normalized_user_comments ;
ERROR 1317 (70100): Query execution was interrupted

And on PostgreSQL you can use:

SET statement_timeout = '4s'; 
select count(1) as comment_cnt from movies_normalized_user_comments ;
ERROR:  canceling statement due to statement timeout

By forcing a timeout you can protect yourself.  That said, these are “errors” so you may see errors in the error log.

You can check the system logs (syslog or messages) for errors with your custom queries (note at this time as of PMM 2.0.21, errors were not making it into these logs because of a potential bug).  If the data is being collected and everything is set up correctly, head over to the default Grafana explorer or the “Advanced Data Exploration” dashboard in PMM.  Look for your metric and you should be able to see the data graphed out:

Advanced Data Exploration PMM

In the above screenshot, you will notice some pretty big gaps in the data (in green).  These gaps were caused by our query taking longer than the resolution bucket.  You can see when we moved to 60-second resolution (in orange), the graphs filled in.

Percona Monitoring and Management is a best-of-breed open source database monitoring solution. It helps you reduce complexity, optimize performance, and improve the security of your business-critical database environments, no matter where they are located or deployed.

Download Percona Monitoring and Management Today

Jul
07
2021
--

Inspecting MySQL Servers Part 5: Percona Monitoring and Management

Inspecting MySQL Servers PMM

Inspecting MySQL Servers PMMIn the previous posts of this series, I presented how the Percona Support team approaches the analysis and troubleshooting of a MySQL server using a tried-and-tested method supported by specific tools found in the Percona Toolkit:

Inspecting MySQL Servers Part 1: The Percona Support Way

Inspecting MySQL Servers Part 2: Knowing the Server

Inspecting MySQL Servers Part 3: What MySQL?

Inspecting MySQL Servers Part 4: An Engine in Motion

A drawback from such an approach is that data collection is done in a “reactive” way and (part of) it needs to be processed before we can interpret it. Enters Percona Monitoring and Management (PMM): PMM continually collects MySQL status variables and plots the metrics in easy-to-interpret Grafana graphs and panels. Plus, it includes a rich Query Analytics dashboard that helps identify the top slow queries and show how they are executing. It makes for an excellent complement to the approach we presented. In fact, many times it takes the central role: we analyze the data available on PMM and, if necessary, look at complementing it with pt-stalk samples. In this post, I will show you how we can obtain much of the same information we got from the Percona Toolkit tools (and sometimes more) from PMM.

* As was the case in the previous posts in this series, data, and graphs used to illustrate this post does not come from a single server and have been captured using different versions of PMM.

Know the Server

Once you are connected to PMM, you can select the target server under the Node Name field in the menu located on the top-left side of the interface, then select PMM dashboards on the left menu, System (Node), and, finally, Node Summary, as shown in the screenshot below:

PMM Dashboard

The header section of the Node Summary page shows the basic hardware specs of the server as well as a few metrics and projections. You will find on the right side of this section the full output of pt-summary, which we have scrutinized extensively in the second post of this series, there, waiting for you:

MySQL Node Summary
Below the header section, there are four panels dedicated to CPU, Memory, Disk, and Network, each containing graphics with specific metrics on each of these areas. It makes it easy, for example, to look at overall CPU utilization:

CPU utilization

Recent spikes in I/O activity:

And memory usage:

Note the graphs cover the last 12 hours of activity by default but you can select a different time range in the top-right menu:

What MySQL?

Taking a slightly different route by selecting MySQL instead of System (Node) and then MySQL Summary, we get to access a dashboard that displays MySQL-specific metrics for the selected instance:

MySQL Summary

Under the Service Summary panel, you will find the full output of pt-mysql-summary, which we reviewed in detail in the third post of this series:

The main goal of the pt-mysql-summary is to provide a sneak-peek into how MySQL is configured, at a single point in time.  With PMM you get instant access to most of the MySQL trends and status variables we only get a glance from in the report. We can go and look straight under the hood to look at the engine characteristics while it is under load, over the last 5 minutes to the last 30 days or more!

An Engine in Motion

There is so much we can look at at this point. If we go and more or less follow the sequence observed in the previous posts we can start by checking if the table cache is big enough. The example below shows it to be just right, if we base ourselves in the limited time frame this particular sample covers, with an average hit ratio close to 100%:

MySQL Table Open Cache Status

Or we can look for a disruption in the pattern, such as a peak in threads connected:

And then investigate the effects it caused on the server (or was it already a consequence of something else that occurred?), for example, a change in the rate of temporary tables created at that time for both in-memory and on-disk tables:

The MySQL Instance Summary is just one of many dashboards available for MySQL:

Under the MySQL InnoDB Details dashboard we find many InnoDB-specific metrics plotted as a multitude of different graphs, providing a visual insight into things such as the number of requests that can be satisfied from data that is already loaded in the Buffer Pool versus those that must be first read from disk (does my hot data fit in memory?):

InnoDB Buffer Pool Requests

Besides MySQL status variables metrics, there is also data filtered directly from SHOW ENGINE INNODB STATUS. For instance, we can find long-running transactions based on increasing values of InnoDB’s history length list:

Another perk of PMM is the ability to easily evaluate whether redo log space is big enough based on the rate of writes versus the size of the log files:

And thus observe checkpoint age, a concept that is explained in detail for PMM in How to Choose the MySQL innodb_log_file_size:

Another evaluation made easy with PMM is whether a server’s workload is benefitting from having InnoDB’s Adaptive Hash Index (AHI) enabled. The example below shows an AHI hit-ratio close to 100% up to a certain point, from which the number of searches increased and the situation inverted:

The evaluation of settings like the size of the redo log space and the efficiency of AHI should be done at a macro level, spanning days: we should be looking for what is the best general configuration for these. However, when we are investigating a particular event, it is important to zoom in on the time frame where it occurred to better analyze the data captured at the time. Once you do this, change the data resolution from the default of auto to 1s or 5s interval/granularity so you can better see spikes and overall variation: 

QAN: Query Analytics

Query analysis is something I only hinted at but didn’t explore in the first articles in this series. The “manual” way requires processing the slow query log with a tool such as pt-query-digest and then going for details about a particular query by connecting to the server to obtain the execution plan and schema details. A really strong feature of PMM is the Query Analytics dashboard, which provides a general overview of query execution and captures all information about it for you. 

The example below comes from a simple sysbench read-write workload on my test server:

PMM Query AnalyticsWe can select an individual query on the list and check the details of its execution:

The query’s  EXPLAIN plan is also available, both in classic and JSON formats:

You can read more about QAN on our website as well as in other posts on our blog platform, such as How to Find Query Slowdowns Using Percona Monitoring and Management.

What PMM Does Not Include

There remains information/data we cannot obtain from PMM, such as the full output of SHOW ENGINE INNODB STATUS. For situations when obtaining this information is important, we resort back to pt-stalk. It is not one or the other, we see them as complementary tools in our job of inspecting MySQL servers. 

If you are curious about PMM and would like to see how it works in practice, check our demo website at https://pmmdemo.percona.com/. To get up and running with PMM quickly, refer to our quickstart guide.

Tuning the Engine for the Race Track

There you have it! It certainly isn’t all there is but we’ve got a lot packed in this series, enough to get you moving in the right direction when it comes to inspecting and troubleshooting MySQL servers. I hope you have enjoyed the journey and learned a few new tricks along the way ?

Jun
14
2021
--

MongoDB Integrated Alerting in Percona Monitoring and Management

MongoDB Integrated Alerting

MongoDB Integrated AlertingPercona Monitoring and Management (PMM) recently introduced the Integrated Alerting feature as a technical preview. This was a very eagerly awaited feature, as PMM doesn’t need to integrate with an external alerting system anymore. Recently we blogged about the release of this feature.

PMM includes some built-in templates, and in this post, I am going to show you how to add your own alerts.

Enable Integrated Alerting

The first thing to do is navigate to the PMM Settings by clicking the wheel on the left menu, and choose Settings:

Next, go to Advanced Settings, and click on the slider to enable Integrated Alerting down in the “Technical Preview” section.

While you’re here, if you want to enable SMTP or Slack notifications you can set them up right now by clicking the new Communications tab (which shows up after you hit “Apply Changes” turning on the feature).

The example below shows how to configure email notifications through Gmail:

You should now see the Integrated Alerting option in the left menu under Alerting, so let’s go there next:

Configuring Alert Destinations

After clicking on the Integrated Alerting option, go to the Notification Channels to configure the destination for your alerts. At the time of this writing, email via your SMTP server, Slack and PagerDuty are supported.

Creating a Custom Alert Template

Alerts are defined using MetricsQL which is backward compatible with Prometheus QL. As an example, let’s configure an alert to let us know if MongoDB is down.

First, let’s go to the Explore option from the left menu. This is the place to play with the different metrics available and create the expressions for our alerts:

To identify MongoDB being down, one option is using the up metric. The following expression would give us the alert we need:

up{service_type="mongodb"}

To validate this, I shut down a member of a 3-node replica set and verified that the expression returns 0 when the node is down:

The next step is creating a template for this alert. I won’t go into a lot of detail here, but you can check Integrated Alerting Design in Percona Monitoring and Management for more information about how templates are defined.

Navigate to the Integrated Alerting page again, and click on the Add button, then add the following template:

---
templates:
  - name: MongoDBDown
    version: 1
    summary: MongoDB is down
    expr: |-
      up{service_type="mongodb"} == 0
    severity: critical
    annotations:
      summary: MongoDB is down ({{ $labels.service_name }})
      description: |-
        MongoDB {{ $labels.service_name }} on {{ $labels.node_name }} is down

This is how it looks like:

Next, go to the Alert Rules and create a new rule. We can use the Filters section to add comma-separated “key=value” pairs to filter alerts per node, per service, per agent, etc.

For example: node_id=/node_id/123456, service_name=mongo1, agent_id=/agent_id/123456

After you are done, hit the Save button and go to the Alerts dashboard to see if the alert is firing:

From this page, you can also silence any firing alerts.

If you configured email as a destination, you should have also received a message like this one:

For now, a single notification is sent. In the future, it will be possible to customize the behavior.

Creating MongoDB Alerts

In addition to the obvious “MongoDB is down” alert, there are a couple more things we should monitor. For starters, I’d suggest creating alerts for the following conditions:

  • Replica set member in an unusual state
mongodb_replset_member_state != 1 and mongodb_replset_member_state != 2

  • Connections higher than expected
avg by (service_name) (mongodb_connections{state="current"}) > 5000

  • Cache evictions higher than expected
avg by(service_name, type) (rate(mongodb_mongod_wiredtiger_cache_evicted_total[5m])) > 5000

  • Low WiredTiger tickets
avg by(service_name, type) (max_over_time(mongodb_mongod_wiredtiger_concurrent_transactions_available_tickets[1m])) < 50

The values listed above are just for illustrative purposes, you need to decide the proper thresholds for your specific environment(s).

As another example, let’s add the alert template for the low WiredTiger tickets:

---
templates:
  - name: MongoDB Wiredtiger Tickets
    version: 1
    summary: MongoDB Wiredtiger Tickets low
    expr: avg by(service_name, type) (max_over_time(mongodb_mongod_wiredtiger_concurrent_transactions_available_tickets[1m])) < 50
    severity: warning
    annotations:
      description: "WiredTiger available tickets on (instance {{ $labels.node_name }}) are less than 50"

Conclusion

Integrated alerting is a really nice to have feature. While it is still in tech preview state, we already have a few built-in alerts you can test, and also you can define your own. Make sure to check the Integrated Alerting official documentation for more information about this topic.

Do you have any specific MongoDB alerts you’d like to see? Given the feature is still in technical preview, any contributions and/or feedback about the functionality are welcome as we’re looking to release this as GA very soon!

Apr
29
2021
--

Improving Percona Monitoring and Management EC2 Instance Resilience Using CloudWatch Alarm Actions

Percona Monitoring and Management EC2 Instance Resilience Using CloudWatch

Percona Monitoring and Management EC2 Instance Resilience Using CloudWatchNothing lasts forever, including hardware running your EC2 instances. You will usually receive an advance warning on hardware degradation and subsequent instance retirement, but sometimes hardware fails unexpectedly. Percona Monitoring and Management (PMM) currently doesn’t have an HA setup, and such failures can leave wide gaps in monitoring if not resolved quickly.

In this post, we’ll see how to set up automatic recovery for PMM instances deployed in AWS through Marketplace. The automation will take care of the instance following an underlying systems failure. We’ll also set up an automatic restart procedure in case the PMM instance itself is experiencing issues. These simple automatic actions go a long way in improving the resilience of your monitoring setup and minimizing downtime.

Some Background

Each EC2 instance has two associated status checks: System and Instance. You can read in more detail about them on the “Types of status checks” AWS documentation page. The gist of it is, the System check fails to pass when there are some infrastructure issues. The Instance check fails to pass if there’s anything wrong on the instance side, like its OS having issues. You can normally see the results of these checks as “2/2 checks passed” markings on your instances in the EC2 console.

ec2 instance overview showing 2/2 status checks

CloudWatch, an AWS monitoring system, can react to the status check state changes. Specifically, it is possible to set up a “recover” action for an EC2 instance where a system check is failing. The recovered instance is identical to the original, but will not retain the same public IP unless it’s assigned an Elastic IP. I recommend that you use Elastic IP for your PMM instances (see also this note in PMM documentation). For the full list of caveats related to instance recovery check out the “Recover your instance” page in AWS documentation.

According to CloudWatch pricing, having two alarms set up will cost $0.2/month. An acceptable cost for higher availability.

Automatically Recovering PMM on System Failure

Let’s get to actually setting up the automation. There are at least two ways to set up the alarm through GUI: from the EC2 console, and from the CloudWatch interface. The latter option is a bit involved, but it’s very easy to set up alarms from the EC2 console. Just right-click on your instance, pick the “Monitor and troubleshoot” section in the drop-down menu, and then choose the “Manage CloudWatch alarms” item.

EC2 console dropdown menu navigation to CloudWatch Alarms

Once there, choose the “Status check failed” as the “Type of data to sample”, and specify the “Recover” Alarm action. You should see something like this.

Adding CloudWatch alarm through EC2 console interface

You could notice that GUI offers to set up a notification channel for the alarm. If you want to get a message if your PMM instance is recovered automatically, feel free to set that up.

The alarm will fire when the maximum value of the StatusCheckFailed_System metric is >=0.99 during two consecutive 300 seconds periods. Once the alarm fires, it will recover the PMM instance. We can check out our new alarm in the CloudWatch GUI.

EC2 instance restart alarm

EC2 console will also show that the alarm is set up.

Single alarm set in the instance overview

This example uses a pretty conservative alarm check duration of 10 minutes, spread over two 5-minute intervals. If you want to recover a PMM instance sooner, risking triggering on false positives, you can bring down the alarm period and number of evaluation periods. We also use the “maximum” of the metric over a 5-minute interval. That means a check could be in failed state only one minute out of five, but still count towards alarm activation. The assumption here is that checks don’t flap for ten minutes without a reason.

Automatically Restarting PMM on Instance Failure

While we’re at it, we can also set up an automatic action to execute when “Instance status check” is failing. As mentioned, usually that happens when there’s something wrong within the instance: really high load, configuration issue, etc. Whenever a system check fails, an instance check is going to be failing, too, so we’ll set this alarm to check for a longer period of time before firing. That’ll also help us to minimize the rate of false-positive restarts, for example, due to a spike in load. We use the same period of 300 seconds here but will only alarm after three periods show instance failure. The restart will thus happen after ~15 minutes. Again, this is pretty conservative, so adjust as you think works best for you.

In the GUI, repeat the same actions that we did last time, but pick the “Status check failed: instance” data type, and “Reboot” alarm action.

Adding CloudWatch alarm for instance failure through EC2 console interface

Once done, you can see both alarms reflected in the EC2 instance list.

EC2 instance overview showing 2 alarms set

Setting up Alarms Using AWS CLI

Setting up our alarm is also very easy to do using the AWS CLI.  You’ll need an AWS account with permissions to create CloudWatch alarms, and some EC2 permissions to list instances and change their state. A pretty minimal set of permissions can be found in the policy document below. This set is enough for a user to invoke CLI commands in this post.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "cloudwatch:PutMetricAlarm",
        "cloudwatch:GetMetricStatistics",
        "cloudwatch:ListMetrics",
        "cloudwatch:DescribeAlarms"
      ],
      "Resource": [
        "*"
      ],
      "Effect": "Allow"
    },
    {
      "Action": [
        "ec2:DescribeInstanceStatus",
        "ec2:DescribeInstances",
        "ec2:StopInstances",
        "ec2:StartInstances"
      ],
      "Resource": [
        "*"
      ],
      "Effect": "Allow"
    }
  ]
}

And here are the specific AWS CLI commands. Do not forget to change instance id (i-xxx) and region in the command. First, setting up recovery after system status check failure.

$ aws cloudwatch put-metric-alarm \
    --alarm-name restart_PMM_on_system_failure_i-xxx \
    --alarm-actions arn:aws:automate:REGION:ec2:recover \
    --metric-name StatusCheckFailed_System --namespace AWS/EC2 \
    --dimensions Name=InstanceId,Value=i-xxx \
    --statistic Maximum --period 300 --evaluation-periods 2 \
    --datapoints-to-alarm 2 --threshold 1 \
    --comparison-operator GreaterThanOrEqualToThreshold \
    --tags Key=System,Value=PMM

Second, setting up restart after instance status check failure.

$ aws cloudwatch put-metric-alarm \
    --alarm-name restart_PMM_on_instance_failure_i-xxx \
    --alarm-actions arn:aws:automate:REGION:ec2:reboot \
    --metric-name StatusCheckFailed_Instance --namespace AWS/EC2 \
    --dimensions Name=InstanceId,Value=i-xxx \
    --statistic Maximum --period 300 --evaluation-periods 3 \
    --datapoints-to-alarm 3 --threshold 1 \
    --comparison-operator GreaterThanOrEqualToThreshold \
    --tags Key=System,Value=PMM

Testing New Alarms

Unfortunately, it doesn’t seem to be possible to simulate system status check failure. If there’s a way, let us know in the comments. Because of that, we’ll be testing our alarms using instance failure instead. The simplest way to fail the Instance status check is to mess up networking on an instance. Let’s just bring down a network interface. In this particular instance, it’s ens5.

# ip link
...
2: ens5:...
...
# date
Sat Apr 17 20:06:26 UTC 2021
# ip link set ens5 down

The stuff of nightmares: having no response for a command. In 15 minutes, we can check that the PMM instance is available again. We can see that alarm triggered and executed the restart action at 20:21:37 UTC.

CloudWatch showing Alarm fired actions

And we can access the server now.

$ date
Sat Apr 17 20:22:59 UTC 2021

The alarm itself takes a little more time to return to OK state. Finally, let’s take a per-minute look at the instance check metric.

CloudWatch metric overview

Instance failure was detected at 20:10 and resolved at 20:24, according to the AWS data. In reality, the server rebooted and was accessible even earlier. PMM instances deployed from a marketplace image have all the required service in autostart, and thus PMM is fully available after restarting without a need for an operator to take any action.

Summary

Even though Percona Monitoring and Management itself doesn’t offer high availability at this point, you can utilize built-in AWS features to make your PMM installation more resilient to failure. PMM in the Marketplace is set up in a way that should not have problems with recovery from the instance retirement events. After your PMM instance is recovered or rebooted, you’ll be able to access monitoring without any manual intervention.

Note: While this procedure is universal and can be applied to any EC2 instance, there are some caveats explained in the “What do I need to know when my Amazon EC2 instance is scheduled for retirement?” article on the AWS site. Note specifically the Warning section. 

References

Apr
28
2021
--

Running Percona Monitoring and Management v2 on Windows with Docker

PMM Windows Docker

PMM Windows DockerOne way to run Percona Monitoring and Management (PMM) v2 on Windows is using the Virtual Appliance installation method. It works well but requires you to run extra software to run Virtual Appliances, such as Oracle VirtualBox, VMWare Workstation Player, etc. Or, you can use Docker.

Docker is not shipped with Windows 10, though you can get it installed for free with Docker Desktop. Modern Docker on Windows actually can either run using Hyper-V backend or using Windows Subsystem for Linux v2 (WSL2). Because WSL2-based Docker installation is more involved, I’m sticking to what is now referred to as “legacy” Hyper-V backend.

On my Windows 10 Pro (19042) system, I had to change some BIOS settings:

 

PMM on Windows Docker

 

And install the Hyper-V component for Docker Desktop to install correctly:

 

Percona Monitoring and Management Windows

 

With this, we’re good to go to install Percona Monitoring and Management on Windows.

 

Register for Percona Live ONLINE
A Virtual Event about Open Source Databases

 

Official installing PMM with Docker instructions assumes you’re running a Linux/Unix-based system that needs to be slightly modified for Windows. First, there is no sudo on Windows, and second, “Command Prompt” does not interpret “\”  the same way as shell does.

With those minor fixes though, basically, the same command to start PMM works:

docker pull percona/pmm-server:2
docker create --volume /srv --name pmm-data percona/pmm-server:2 /bin/true
docker run --detach --restart always --publish 80:80 --publish 443:443 --volumes-from pmm-data --name pmm-server percona/pmm-server:2

That’s it! You should have your Percona Monitoring and Management (PMM) v2 running!

Looking at PMM Server Instance through PMM itself you may notice it does not have all resources by default, as Docker on Linux would.

 

PMM

 

Note there are only 2 CPU cores, 2GB of Memory, and some 60GB of Disk space available.  This should be enough for a test but if you want to put some real load on this instance you may want to adjust those settings. It is done through the “Resources” setting in Docker Desktop:

 

PMM

 

Adjust these to suit your load needs and you’re done!  Enjoy your Percona Monitoring and Management!

Apr
27
2021
--

Percona Monitoring and Management Meetup on May 6th: Join Us for Live Tuning and Roadmap Talk

We’re leveling up – would you like to join us on Discord? There we chat about databases, open source, and even play together.

On May 6th at 11 am EST, the Percona Community is talking about Percona Monitoring and Management (PMM). Our experts are also there to live tune and optimize your database – come check it out!

Are there any features you would like to see in PMM? The meetup is a good way to get your voice heard! But you can already start chatting with all of us on Discord right now.

It’s just a few more nights of sleep until Percona Live ONLINE with many interesting events for you to check out too. Join the community and let’s shape the future of databases together!

Apr
16
2021
--

Integrated Alerting Design in Percona Monitoring and Management

Integrated Alerting Design Percona Monitoring and Management

Integrated Alerting Design Percona Monitoring and ManagementPercona Monitoring and Management 2.13 (PMM) introduced the Integrated Alerting feature as a technical preview. It adds a user-friendly way to set up and manage alerts for your databases. You can read more about this feature usage in our announcement blog post and in our documentation, while in this article we will be focusing on design and implementation details.

Entities

There are four basic entities used for IA: Alert Rule Template, Alert Rule, Alert, and Notification Channel.

Everything starts from the alert rule template. You can see its YAML representation below:

---
templates:
 - name: pmm_mongodb_high_memory_usage
   version: 1
   summary: Memory used by MongoDB
   expr: |-
     sum by (node_name) (mongodb_ss_mem_resident * 1024 * 1024)
     / on (node_name) (node_memory_MemTotal_bytes)
     * 100
     > [[ .threshold ]]
   params:
     - name: threshold
       summary: A percentage from configured maximum
       unit: "%"
       type: float
       range: [0, 100]
       value: 80
   for: 5m
   severity: warning
   labels: 
     cultom_label: demo
   annotations:
     summary: MongoDB high memory usage ({{ $labels.service_name }})
     description: |-
       {{ $value }}% of memory (more than [[ .threshold ]]%) is used
       by {{ $labels.service_name }} on {{ $labels.node_name }}.

A template serves as the base for alert rules. It defines several fields, let’s look at them:

  • name: uniquely identifies template (required)
  • version: defines template format version (required)
  • summary: a template description (required)
  • expr: a MetricsQL query string with parameter placeholders. MetricsQL is backward compatible with PromQL and provides some additional features. (required)
  • params: contains parameter definitions required for the query. Each parameter has a name, type, and summary. It also may have a unit, available range, and default value.
  • for: specifies the duration of time the expression must be met for;  The  alert query should return true for this period of time at which point the alert will be fired (required)
  • severity: specifies default alert severity level (required)
  • labels: are additional labels to be added to generated alerts (optional)
  • annotations: are additional annotations to be added to generated alerts. (optional)

A template is designed to be re-used as the basis for multiple alert rules so from a single pmm_node_high_cpu_load template you can have alerts for production vs non-production, warning vs critical, etc.

Register for Percona Live ONLINE
A Virtual Event about Open Source Databases

Users can create alert rules from templates. An alert rule is what’s actually executed against metrics and what produces an alert. The rule can override default values specified in the template, add filters to apply the rule to only required services/nodes/etc, and specify target notification channels, such as email, Slack, PagerDuty, or Webhooks. If the rule hasn’t any associated notification channels its alerts will be available only via PMM UI. It’s useful to note that after creation rule keeps its relation with the template and any change in the template will affect all related rules.

Here is an alert rule example:

---
groups:
 - name: PMM Integrated Alerting
   rules:
     - alert: /rule_id/c8e5c559-ffba-43ed-847b-921f69c031a9
       rule: test
       expr: |-
         sum by (node_name) (mongodb_ss_mem_resident * 1024 * 1024)
         / on (node_name) (node_memory_MemTotal_bytes)
         * 100
         > 40
       for: 5s
       labels:
         ia: "1"
         rule_id: /rule_id/c8e5c559-ffba-43ed-847b-921f69c031a9
         severity: error
         template_name: pmm_mongodb_high_memory_usage
         cultom_label: demo
       annotations:
         description: |-
         { { $value } }% of memory (more than 40%) is used
         by {{ $labels.service_name }} on {{ $labels.node_name }}.
         summary: MongoDB high memory usage ({{ $labels.service_name }})

It has a Prometheus alert rule format.

How it Works

Integrated Alerting feature built on top of Prometheus Alertmanager, VictoriaMetrics TimescaleDB (TSDB), and VMAlert.

VictoriaMetrics TSDB is the main metrics storage in PMM, VMalert responsible for alert rules execution, and Prometheus Alertmanager responsible for alerts delivery. VMAlert runs queries on VM TSDB, checks if they are positive for the specified amount of time (example: MySQL is down for 5 minutes), and triggers alerts. All alerts forwarded to the PMM internal Alertmanager but also can be duplicated to some external Alertmanager (it can be set up on the PMM Settings page).

There are four available templates sources:

  1. Built-in templates, shipped with PMM distribution. They are embedded into the managed binary (core component on PMM).
  2. Percona servers. It’s not available yet, but it will be similar to the STT checks delivery mechanism (HTTPS + files signatures).
  3. Templates created by the user via PMM UI. We persist them in PMM’s database.
  4. Templates created by the user as files in the /srv/ia/templates directory.

During PMM startup, managed loads templates from all sources into the memory.

Alert rules can be created via PMM UI or just by putting rule files in the /srv/prometheus/rules directory. Alert rules created via UI persist in PMM’s internal PostgreSQL database. For each alert rule from DB, managed binary creates a YAML file in /etc/ia/rules/ and asks VMalert to reload the configuration and reread rule files. VMAlert executes query from each loaded alert rule every minute, once the rule condition is met (query is positive for the specified amount of time) VMAlert produces an alert and passes it to the Alertmanager. Please note that /etc/ia/rules/ controlled by managed and any manual changes in that directory will be lost.

Managed generates configuration for Alertmanager and updates it once any related entity changes.

Managed goes through the list of the existing rules and collects unique notification channel combinations. For example, if we have two rules and each of them has assigned channels a,b, and c it will be the one unique channel combination. For each rule managed generates a route and for each unique channel combination, it generates a receiver in the Alertmanager configuration file. Each route has a target receiver and filter by rule id, also it can contain user-defined filters. If a rule hasn’t assigned notification channels, then a special empty receiver will be used. Users can redefine an empty receiver with Alertmanagers base configuration file /srv/alertmanager/alertmanager.base.yml. When some Notification Channel is disabled, managed recollects unique channel combinations excluding disabled channels and regenerates receivers and routing rules. If the rule has only one specified channel and it was disabled then a special disabled receiver will be used for that. Unlike empty receiver, disabled can’t be redefined by the user and always means “do nothing”.  It prevents unexpected behavior after channels disabling. After each Alertmanager configuration update, managed asks Alermanager to reload it.

When Alertmanager receives an alert from VMAlert, it uses routes to find an appropriate receiver and forward alerts to destination channels. The user also can observe alerts via PMM UI. In that case, managed gets all available alerts from Alertmanager API and applies required filters before showing them.

Conclusion

The Integrated Alerting feature has many moving parts, and functionally it’s more about managing configuration for different components and making them work together. It provides a really nice way to be aware of important events in your system. While it’s still in tech preview state, it’s already helpful. With built-in templates, it’s easy to try without diving into documentation about Prometheus queries and other stuff. So please try it and tell us about your experience. What parameters of a system you would like to have covered with templates? What use cases do you have for alerting? We will happy to any feedback.

Jan
26
2021
--

Achieving a More Secure MongoDB with Percona Monitoring and Management’s Security Tool

More Secure MongoDB

Percona Monitoring and Management (PMM) is a great way to monitor your MongoDB deployment for things like memory, CPU, inter-database metrics like wiredTiger cache utilization, read/write ticket utilization, Query Analytics, and many, many more.  Did you know that in addition to all that, it also has a security threat tool?  In this blog, we’ll go over the Security Threat Tool and its checks for MongoDB.  We’ll discuss how they can bring you value by helping you prevent MongoDB data leaks.

What is the Security Threat Tool?

PMM security threat tool

The Security Threat Tool is a widget in your main PMM dashboard that runs regular checks against connected databases, alerting you if any servers pose a potential security threat.  By default, these checks are run every 24 hours, so the checks will not be in real-time.  You can run them ad-hoc by going to PMM/PMM Database Checks in your PMM Grafana Dashboards and clicking on the “Run DB Checks” button.  When you first enable the Security Threat Tool it can take up to 24 hours for the results to populate, assuming you do not use the ad hoc button.  In addition to MongoDB, the Security Threat Tool, like PMM itself, can also check MySQL, PostgreSQL, and MariaDB for various security threats.  Additionally, if you find a check is too noisy, you do have the ability to silence it in PMM so that it no longer shows as a failed check.

What Does the Security Threat Tool Check for MongoDB?

At present, the Security Threat Tool checks MongoDB for two security-related items, mongodb_auth, and mongodb_version.  First, mongodb_auth checks to be sure that MongoDB authentication is enabled on your deployment.  Finally, mongodb_version checks to make sure that you’re running the latest version of MongoDB that you’re on.  

mongodb_auth

Checking MongoDB to ensure that authentication and authorization are enabled is of paramount importance to the security of your database.   If authentication and authorization are disabled for your MongoDB deployment, then anyone who can reach the port that you have MongoDB running on will have access to your data.  Disabled authentication and authorization as well as binding to all of your interfaces on your machine have been the cause of many data leaks over the years with MongoDB.  Having this check can help you and your MongoDB deployment from leaking data.

This check can also be helpful if you’re performing maintenance and need to temporarily disable authentication on a secondary or hidden node and forget to re-enable authentication and authorization when you’re done.

mongodb_auth

mongodb_version

Many organizations are requiring more frequent patching or upgrading of software, especially databases that often house the most critical data for your applications, in order to stay on top of changing security requirements.  With MongoDB’s replica set and sharded cluster architecture, it’s fairly easy to upgrade in a rolling manner and keep up to date on your MongoDB versions (after appropriate testing of the new versions!).  This checks to make sure you’re running on the latest version of the minor release that your MongoDB database is running on.  For example, your 3.6 version of MongoDB won’t alert telling you there’s a newer version of MongoDB 4.4 but will tell you when a new minor version of MongoDB 3.6 is released and can be upgraded to. Staying as close as possible to the latest minor release assures that you get as many bug fixes as are available and just as importantly, any security fixes for that version of MongoDB.  You want to minimize your security exposure by being up to date and having as many security vulnerabilities patched as are available.

mongodb_version

Summary

We hope that this blog post has helped you to see how enabling the Security Threat Tool in Percona Monitoring and Management can help you keep your MongoDB deployment secure.  Thanks for reading!

 

Percona Monitoring and Management is free to download and use. Try it today!

Dec
23
2020
--

Observations on Better Resource Usage with Percona Monitoring and Management v2.12.0

Better Resource Usage with Percona Monitoring and Management

Better Resource Usage with Percona Monitoring and ManagementPercona Monitoring and Management (PMM) v2.12.0 comes with a lot of improvements and one of the most talked-about is the usage of VictoriaMetricsDB. The reason we are doing this comparison is that PMM 2.12.0 is a release in which we integrate VictoriaMetricsDB and replace Prometheus as its default method of data ingestion.

A reason for this change was also driven by motivation for improved performance for PMM Server, and here we will look at an overview of why users must definitely consider using the 2.12.0 version if they have always been looking for a less resource-intensive PMM. This post will try to address some of those concerns.

Benchmark Setup Details

The benchmark was performed using a virtualized system with PMM Server running on an ec2 instance, with 8 cores, 32 GB of memory, and SSD Storage. The duration of the observation is 24 hours, and for clients, we set up 25 virtualized client instances on Linode with each emulating 10 Nodes running MySQL with real workloads using Sysbench TPC-C Test.

Percona Monitoring and Management benchmark

Both PMM 2.11.1 and PMM 2.12.0 were set up in the same way, with client instances running the exact same load, and to monitor the difference in performance, we used the default metrics mode for 2.12.0 for this observation.

Sample Ingestion rate for the load was around 96.1k samples/sec, with around 8.5 billion samples received in 24 hours. 

A more detailed benchmark for Prometheus vs. VictoriaMetrics was done by the VM team and it clearly shows how efficient VictoriaMetrics is and how better performance can be achieved with Victoria Metrics.

Disk Space Usage

VictoriaMetrics has really good efficiency when it comes to disk usage of the host system, and we found that 2.11.1 generates a lot of disk usage spikes with the maximum storage touching around 23.11 GB of storage space. If we compare the same for PMM 2.12.0, the disk usage spikes are not as high as 2.11.1 while the maximum disk usage is around 8.44 GB.

It is clear that PMM 2.12.0 needs 2.7 times less disk space for monitoring the same number of services for the same duration as compared to PMM 2.11.1.

Disk Usage PMM 2.11.1

Disk Usage 1: PMM 2.11.1 

Disk Usage PMM 2.12.0

Disk Usage 2: PMM 2.12.0


Memory Utilization

Another parameter on which PMM 2.12.0 performs better is Memory Utilization. During our testing, we found that PMM 2.11.1 was using two times more memory for monitoring the same number of services. This is indeed a significant improvement in terms of performance.

The memory usage clearly shows several spikes for PMM 2.11.1, which is not the case with PMM 2.12.0.

Memory Utilization PMM 2.11.1

Memory Utilization: PMM 2.11.1

Free Memory PMM 2.11.1
Free Memory PMM 2.11.1 


Memory Utilization 2.11.1
Memory Utilization PMM 2.12.0

The Memory Utilization for 2.12.0 clearly shows more than 55% of memory is available across the 24 hours of our observation, which is a significant improvement over 2.11.1.

Free Memory 2.12.0

Free Memory PMM 2.12.0

CPU Usage

During the observation we noticed a slight increase in CPU usage for PMM 2.12.0, the average CPU usage was about 2.6% more than PMM 2.11.1, but the max CPU usage for both versions did not have any significant difference.

CPU Usage 2.11.1

CPU Usage: PMM 2.11.1

CPU Usage 2.12.0

CPU Usage: PMM 2.12.0

 

Observations

The overall performance improvements are around the Memory Utilization and Disk Usage, and we also observed a significantly less disk I/O bandwidth with far fewer spikes in the write operations for PMM 2.12.0. This behavior is observed and articulated in the VictoriaMetrics benchmarking. CPU Usage and Memory are two important resource factors when planning to setup PMM Server, and with PMM 2.12.0 we can safely say that it will cost about half in terms of Memory and Disk resources when compared to any other previously released PMM versions. This would also likely encourage current users to be able to add more instances for monitoring without the need to think about the cost of extra infrastructure. 

Dec
09
2020
--

Enabling HTTPS Connections to Percona Monitoring and Management Using Custom Certificates

HTTPS Connections to Percona Monitoring and Management

HTTPS Connections to Percona Monitoring and ManagementWhichever way you installed Percona Monitoring and Management 2 (PMM2), using the docker image or an OVF image for your supported virtualized environment, PMM2 enables, by default, two ports for the web connections: 80 for HTTP and 443 for HTTPS. Using HTTPS certificates are requested for encrypting the connection for better security.

All the installation images contain self-signed certificates already configured, so every PMM2 deployment should work properly when using HTTPS.

This is cool, but sometimes self-signed certificates are not permitted, based on the security policy adopted by your company. If your company uses a Certification Authority to sign certificates and keys for encryption, most probably you are forced to use the files provided by the CA for all your services, even for PMM2 monitoring.

In this article, we’ll show how to use your custom certificates to enable HTTPS connections to PMM2, according to your security policy.

PMM2 Deployed as a Docker Image

If PMM Server is running as a Docker image, use docker cp to copy certificates. This example copies certificate files from the current working directory to a running PMM Server docker container.

docker cp certificate.crt pmm-server:/srv/nginx/certificate.crt
docker cp certificate.key pmm-server:/srv/nginx/certificate.key
docker cp ca-certs.pem pmm-server:/srv/nginx/ca-certs.pem
docker cp dhparam.pem pmm-server:/srv/nginx/dhparam.pem

If you’re going to deploy the container, you can use the following to use your own certificates instead of the built-in ones. Let’s suppose your certificates are in /etc/pmm-certs:

docker run -d -p 443:443 --volumes-from pmm-data \
  --name pmm-server -v /etc/pmm-certs:/srv/nginx \
  --restart always percona/pmm-server:2

  • The certificates must be owned by root.
  • The mounted certificate directory must contain the files certificate.crt, certificate.key, ca-certs.pem and dhparam.pem.
  • For SSL encryption, the container must publish on port 443 instead of 80.

PMM2 Deployed Using a Virtual Appliance Image

In such cases, you need to connect to the virtual machine and replace the certificate files in /srv/nginx:

  • connect to the virtual machine
    $> ssh root@pmm2.mydomain.com
  • place CA, certificate, and key files into the /srv/nginx directory. The file must be named certificate.crt, certificate.key, ca-certs.pem and dhparam.pem
  • if you would like to use different file names you can modify the nginx configuration file /etc/nginx/conf.d/pmm.conf. The following variables must be set:
    ssl_certificate /srv/nginx/my_custom_certificate.crt;
    ssl_certificate_key /srv/nginx/my_custom_certificate.key;
    ssl_trusted_certificate /srv/nginx/my_custom_ca_certs.pem;
    ssl_dhparam /srv/nginx/my_dhparam.pem
  • restart nginx
    [root@pmm2]> supervisorctl restart nginx

Conclusion

Percona Monitoring and Management is widely used for monitoring MySQL, Proxysql, MongoDB, PostgreSQL, and OSes. Setting up customer certificates for the connection encryption, according to the security policy adopted by your company, is quite simple. You can rely on PMM2 for troubleshooting your environments in a secure way.

Take a look at the demo site: https://pmmdemo.percona.com

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com