Jun
04
2024
--

Downsampling Metrics in Percona Monitoring and Management: Saving Space and Improving Performance

Downsampling Metrics in Percona Monitoring and ManagementDownsampling is the process by which we can selectively prune (discard, summarize, or recalculate) data from a series of samples in order to decrease how much storage is consumed. This has the downside of reducing the accuracy of the data, but has the great benefit of allowing us to store data from a wider sampling […]

Jun
03
2024
--

Keeping an Eye on the Eye: Self-Monitoring for Percona Monitoring and Management

Self-Monitoring for Percona Monitoring and ManagementPercona Monitoring and Management (PMM) is a powerful tool for keeping your databases healthy, but what about PMM itself?  While it comes pre-packaged as an appliance, PMM’s internal workings can be complex. Many users, including our internal teams, frequently ask: “How can I tell if everything inside PMM is functioning properly?”To address this need, we’ve […]

Feb
01
2024
--

Simplify the Use of ENV Variables in Percona Monitoring and Management AMI

ENV Variables in PMM AMI.jpgThe Percona Monitoring and Management (PMM) Amazon Machine Image (AMI) currently lacks native support for ENV variables. In this guide, we’ll walk through a straightforward workaround that simplifies the process of using ENV variables in PMM AMI and reapplying them after an upgrade.Step one: Adding ENV variables to /srv/.envBegin by consolidating your ENV variables in […]

Jan
03
2024
--

PMM Dump GUI in Percona Monitoring and Management 2.41.0

PMM Dump GUI

A couple of weeks ago, we announced the first GA release of the PMM Dump: a new support tool that dumps Percona Monitoring and Management (PMM) metrics and Query Analytics (QAN) data to transfer to the expert engineer for review and performance suggestions. That blog introduced a command-line interface. A week later, PMM 2.0.41.0 was released with GUI for the PMM Dump.

If you are a database administrator or developer, you may encounter some issues that require external assistance. Whether you seek help from Percona Support or the Community, you must provide sufficient information about your database performance and configuration.

Having all the data at hand is crucial for finding the root cause of the issue and providing the best solution. Without the data, Percona experts may ask you multiple questions and request additional information, which can delay the resolution process and increase your frustration. However, gathering such information can be challenging and time-consuming. Providing direct access to PMM, even through a VPN, could be impossible.

That’s why we have introduced a new feature for PMM that allows you to collect the necessary data about your database performance with just one click. This feature will save you time and effort and enable Percona experts to diagnose and resolve your problem faster.

By using PMM Dump in PMM, you can avoid back-and-forth communication and get your problem solved as quickly as possible.

You can use this feature when you report a Support case as a Percona customer or when you report a bug in our Jira as a Community user. This blog post will show you how to use this feature and what kind of data it collects.

PMM Dump is included in PMM server distribution, and you can try it straight away.

PMM Dump menu is located in the bottom left corner, under the “Help” group:

PMM Dump

After you click “PMM Dump,” a new dashboard will be opened with a “Create dataset” button.

PMM Dump Dashboard

Click on the “Create dataset” button to make a first dump.

You can choose the service names you want to export in the opened window or leave the default (“All Services”). By default, PMM sets PMM Dump to export data collected in the last 12 hours (default for PMM). You can change this range by adjusting “Start Time” and “End Time.” To export QAN data, select “Export QAN.”

The “Ignore load” checkbox is here in case PMM Dump cannot finish due to protections set in the code. If you want to keep but increase protection limits, use a command-line tool with custom options 
maxload  and

criticalload , as described
here.  

The same applies if you need advanced filtering or other custom options that PMM Dump provides. I hope that in future versions of PMM, we will have full support for PMM Dump options.

After you click the “Create dataset” button, a dump will be created and available on the PMM Dump dashboard.

Once the dump is complete, the status changes from “Pending” to “Success.” Here you can see details about your dump:

This will be handy after you create a few of them.

If you click on the dots, you will see the options:

“Download” allows you to download the exported data locally. “View logs” will open a modal window with the PMM Dump log file. “Delete” will remove the dump file.

If you are a Percona Support customer, you can safely upload the dump to the Percona SFTP server by clicking “Send to Support.”

In this case, you need to open a Support case and then create individual credentials for the Percona SFTP server.  Enter them into the “Name” and “Password” fields. Put your ticket number into the Directory field. You will find more details in the Knowledge Base inside the Support portal.

We require individual credentials for each ticket for extra protection of customer’s data. Once the issue is resolved and the corresponding ticket closed, we remove all data provided by customers. You can read more about Percona security and data privacy practices at https://www.percona.com/legal/percona-security-and-data-privacy-practices.

After you create multiple dumps, you may want to perform bulk actions on them.

Choose a few dumps or click on the top checkbox to select all of them, then choose any of the available operations: “Send to Support,” “Download,” or “Delete.”

Dump files in PMM are stored in the
pmmdata  Docker volume. Therefore, you need to watch if they don’t take up too much space. Ensure you are deleting old dump files when they are not needed anymore.

You will find more information on PMM Dump support in the PMM Dump topic in the PMM documentation.

Percona Monitoring and Management is a best-of-breed open source database monitoring solution. It helps you reduce complexity, optimize performance, and improve the security of your business-critical database environments, no matter where they are located or deployed.

 

Download Percona Monitoring and Management Today

Dec
07
2023
--

How to Filter or Customize Alert Notifications in Percona Monitoring and Management (Subject and Body)

Customize Alert Notifications in Percona Monitoring and Management

In many scenarios, the standard alert notification template in Percona Monitoring and Management (PMM), while comprehensive, may not align perfectly with specific operational needs. This often leads to an excess of details in the notification’s “Subject” and “Body”, cluttering your inbox with information that may not be immediately relevant.

The focus today is on tailoring these notifications to fit your unique requirements. We’ll guide you through the process of editing the “Subject” and “Body” in the PMM UI, ensuring that the alerts you receive are filtered and relevant to your specific business context.

Please note: This post assumes a foundational understanding of basic alerting and configuration in PMM. For those new to these concepts, we recommend consulting the documentation on “SMTP” and “PMM Integrated/Grafana alert” for a primer.

Customizing the “Subject” section of alert notification

 1) The default “Subject” will look something like below.

Subject" section of a PMM alert notification

 2) Now, let’s proceed to edit the “subject” content.

I) First, we need to create a new message template called “email.subjectin Alerting -> Contact points with the following content. 

Template_name: email.subject

{{ define "email.subject" }}
{{ range .Alerts }} Percona Alert | {{ .Labels.  }} | {{ .Labels.node_name }} | {{ .Labels.DB }}  {{ end }}
{{ end }}

Customizing the "Subject" section of alert notification in PMM

Here, we are simply using the range to iterate over the alert labels. We loop through the alert labels and extract the alert name and node name.

The provided template is written in Go’s templating language. For a more detailed understanding of the syntax and usage of templates, please refer to the official manual.

Reference:- https://grafana.com/docs/grafana/latest/alerting/manage-notifications/template-notifications/create-notification-templates/#template-the-subject-of-an-email

II) Then we need to edit the default contact point name inside  “Alerting->Contact points”

PMM Alerting->Contact points

And define the below “Subject” under “Optional Email Settings”.

{{ template "email.subject". }}

III) After successfully testing, we can save the changes.

That’s it. Now, if the alert triggers, we will observe a customized subject in the email. 

Example:

Customizing the “Body” section of alert notification

1) Let’s first see how the notifications appear with the native alerting. This is a basic notification alert that triggers when the database/MySQL is down. As we can see, it includes additional information, such as various labels and a summary.

Customizing the “Body” section of alert notification

2) Now, suppose we want to get rid of some content and want only a few relevant details. This can be achieved by following the below outlined steps.

I) Go to Alerting -> Contact points and add new “Message templates”.

II) Next, create a notification template named “email” with two templates in the content: “email.message_alert” and “email.message”.

The “email.message_alert” template is used to display the labels and values for each firing and resolved alert, while the “email.message” template contains the email’s structure.

Template name:  email.message

### These are the key-value pairs that we want to display in our alerts.###

{{- define "email.message_alert" -}}
AlertName = {{ index .Labels "alertname" }}{{ "n" }}
Database = {{ index .Labels "DB" }}{{ "n" }}
Node_name = {{ index .Labels "node_name" }}{{ "n" }}
Service_name = {{ index .Labels "service_name" }}{{ "n" }}
Service Type = MySQL {{ "n" }}
Severity = {{ index .Labels "severity" }}{{ "n" }}
TemplateName = {{ index .Labels "template_name" }}{{ "n" }}

{{- end -}}

### Next, we have defined the main section that governs the alerting and firing rules. ###

{{ define "email.message" }}
There are {{ len .Alerts.Firing }} firing alert(s), and {{ len .Alerts.Resolved }} resolved alert(s){{ "n" }}
 
###Finally, the alerts and firing rules are invoked and triggered based on the generated alerts or fixes. ###

{{ if .Alerts.Firing -}}
Firing alerts:{{ "n" }}
{{- range .Alerts.Firing }}
- {{ template "email.message_alert" . }}
{{- end }}
{{- end }}

{{ if .Alerts.Resolved -}}
Resolved alerts:{{ "n" }}
{{- range .Alerts.Resolved }}
- {{ template "email.message_alert" . }}
{{- end }}
{{- end }}

{{ end }}

The above template is written in Go’s templating language. To know more in detail about the syntax and template usage you can refer to the manual.

Reference:-https://grafana.com/docs/grafana/latest/alerting/manage-notifications/template-notifications/using-go-templating-language/

III) Lastly, simply save the template

3) Next, we will edit the default “Contact points” and define the below content under “Update contact point -> Optional Email settings->Message” for email. Similarly, you can add other channels as well, like Telegram, Slack, etc.

Execute the template from the “message” field in your contact point integration.

{{ template "email.message" . }}

Percona Alerting comes with a pre-configured default notification policy. This policy utilizes the grafana-default-email contact point and is automatically applied to all alerts that do not have a custom notification policy assigned to them.

Reference:- https://docs.percona.com/percona-monitoring-and-management/use/alerting.html#notification-policies

After verifying a successful test message, we can save the updated contact point.

4) Finally, once the alert is triggered, you will be able to see the customized notification reflecting only the defined key/values.

Moreover, we can also use “LABEL LOOPS” instead of defining the separate “Key/Value” pairs as we did in the above steps. In this way, we can have all the default parameters in iteration without explicitly defining each of them. 

Here, we use a range to iterate over the alerts such that dot refers to the current alert in the list of alerts, and then use a range on the sorted labels so dot is updated to refer to the current label. Inside the range, use “.Name” and “.Value” to print the name and value of each label.

### applying label loop option ###

{{- define "email.message_alert" -}}
Label Loop:
{{ range .Labels.SortedPairs }}
{{ .Name }} => {{ .Value }}
{{ end }}

{{- end -}}


{{ define "email.message" }}
There are {{ len .Alerts.Firing }} firing alert(s), and {{ len .Alerts.Resolved }} resolved alert(s){{ "n" }}


{{ if .Alerts.Firing -}}
Firing alerts:{{ "n" }}
{{- range .Alerts.Firing }}
- {{ template "email.message_alert" . }}
{{- end }}
{{- end }}


{{ if .Alerts.Resolved -}}
Resolved alerts:{{ "n" }}
{{- range .Alerts.Resolved }}
- {{ template "email.message_alert" . }}
{{- end }}
{{- end }}

{{ end }}

To add some more options, say (summary and description) in the customized alerts below, template changes can be performed.

I) First, you can add/update the “Summary and annotationssection inside the “alert rule based on your preference.

II)  Then, edit the below Message template (“email.message”) in Alerting->contact points with the updated changes.

Template name:  email.message

 

{{- define "email.message_alert" -}}
    AlertName    = {{ index .Labels "alertname" }}{{ "n" }}
    Database     = {{ index .Labels "DB" }}{{ "n" }}
    Node_name    = {{ index .Labels "node_name" }}{{ "n" }}
    Service_name = {{ index .Labels "service_name" }}{{ "n" }}
    Service Type = {{ index .Labels "service_type" }}{{ "n" }}
    Severity     = {{ index .Labels "severity" }}{{ "n" }}
    TemplateName = {{ index .Labels "template_name" }}{{ "n" }}
    {{- end -}}

    {{ define "email.message" }}
    There are {{ len .Alerts.Firing }} firing alert(s), and {{ len .Alerts.Resolved }} resolved alert(s){{ "n" }}

    {{ if .Alerts.Firing -}}
    Firing alerts:{{ "n" }}
    {{- range .Alerts.Firing }}
    - {{ template "email.message_alert" . }}
    - {{ template "alerts.summarize" . }}
    {{- end }}
    {{- end }}

    {{ if .Alerts.Resolved -}}
    Resolved alerts:{{ "n" }}
    {{- range .Alerts.Resolved }}
    - {{ template "email.message_alert" . }}
    - {{ template "alerts.summarize" . }}
    {{- end }}
    {{- end }}

    {{ end }}

    {{ define "alerts.summarize" -}}
    {{ range .Annotations.SortedPairs}}
    {{ .Name }} = {{ .Value }}
    {{ end }}
    {{ end }}

Reference:- https://grafana.com/blog/2023/04/05/grafana-alerting-a-beginners-guide-to-templating-alert-notifications/

Sometimes, the alert notifications might appear in a single line instead of separate lines for all the Keys. Although this is not a regular behavior it can be fixed by using the below changes.

I) Access to the PMM Server 

sudo docker exec -it pmm-server bash

II) Thereafter, you can edit the file:- “/usr/share/grafana/public/emails/ng_alert_notification.html” and replace the text in between lines (288 to 290) as below.

Replace:

{{ if gt (len .Message) 0 }}
      <div style="white-space: pre-line;" align="left">{{ .Message }}
    {{ else }}

With:

{{ if gt (len .Message) 0 }} 
   <span style="white-space: pre-line;">{{ .Message }}</span> 
 {{ else }}

Note: Please ensure to take the backup before making any changes to the PMM Server files. Moreover, these changes could be lost when doing a PMM upgrade, especially when Grafana is upgraded as part of PMM, so a backup of the edited version would also be needed for later restoration purposes.

III) Finally, you can restart the Grafana service.

supervisorctl restart grafana

Summary

Filtering in alert notifications proves useful in concealing extraneous information from the relevant users. Only the specified elements are displayed in the notification email, thereby preventing unnecessary clutter in the alert content.

Percona Monitoring and Management is a best-of-breed open source database monitoring solution. It helps you reduce complexity, optimize performance, and improve the security of your business-critical database environments, no matter where they are located or deployed.

 

Download Percona Monitoring and Management Today

Aug
08
2023
--

MySQL Capacity Planning

MySQL capacity planning

As businesses grow and develop, the requirements that they have for their data platform grow along with it. As such, one of the more common questions I get from my clients is whether or not their system will be able to endure an anticipated load increase. Or worse yet, sometimes I get questions about regaining normal operations after a traffic increase caused performance destabilization.

As the subject of this blog post suggests, this all comes down to proper capacity planning. Unfortunately, this topic is more of an art than a science, given that there is really no foolproof algorithm or approach that can tell you exactly where you might hit a bottleneck with server performance. But we can discuss common bottlenecks, how to assess them, and have a better understanding as to why proactive monitoring is so important when it comes to responding to traffic growth.

Hardware considerations

The first thing we have to consider here is the resources that the underlying host provides to the database. Let’s take a look at each common resource. In each case, I’ll explain why a 2x increase in traffic doesn’t necessarily mean you’ll have a 2x increase in resource consumption.

Memory

Memory is one of the easier resources to predict and forecast and one of the few places where an algorithm might help you, but for this, we need to know a bit about how MySQL uses memory.

MySQL has two main memory consumers. Global caches like the InnoDB buffer pool and MyISAM key cache and session-level caches like the sort buffer, join buffer, random read buffer, etc.

Global memory caches are static in size as they are defined solely by the configuration of the database itself. What this means is that if you have a buffer pool set to 64Gb, having an increase in traffic isn’t going to make this any bigger or smaller. What changes is how session-level caches are allocated, which may result in larger memory consumption.

A tool that was popular at one time for calculating memory consumption was mysqlcalculator.com. Using this tool, you could enter in your values for your global and session variables and the number of max connections, and it would return the amount of memory that MySQL would consume. In practice, this calculation doesn’t really work, and that’s due to the fact that caches like the sort buffer and join buffer aren’t allocated when a new connection is made; they are only allocated when a query is run and only if MySQL determines that one or more of the session caches will be needed for that query. So idle connections don’t use much memory at all, and active connections may not use much more if they don’t require any of the session-level caches to complete their query.

The way I get around this is to estimate the amount of memory consumed on average by sessions as such…

({Total memory consumed by MySQL} – {sum of all global caches}) / {average number of active sessions}

Keep in mind that even this isn’t going to be super accurate, but at least it gives you an idea of what common session-level memory usage looks like. If you can figure out what the average memory consumption is per active session, then you can forecast what 2x the number of active sessions will consume.

This sounds simple enough, but in reality, there could be more to consider. Does your traffic increase come with updated code changes that change the queries? Do these queries use more caches? Will your increase in traffic mean more data, and if so, will you need to grow your global cache to ensure more data fits into it?

With the points above under consideration, we know that we can generally predict what MySQL will do with memory under a traffic increase, but there may be changes that could be unforeseen that could change the amount of memory that sessions use.

The solution is proactive monitoring using time-lapse metrics monitoring like what you would get with Percona Monitoring and Management (PMM). Keep an eye on your active session graph and your memory consumption graph and see how they relate to one another. Checking this frequently can help you get a better understanding of how session memory allocation changes over time and will give you a better understanding of what you might need as traffic increases.

CPU

When it comes to CPU, there’s obviously a large number of factors that contribute to usage. The most common is the queries that you run against MySQL itself. However, having a 2x increase in traffic may not lead to a 2x increase in CPU as, like memory, it really depends on the queries that are run against the database. In fact, the most common cause of massive CPU increase that I’ve seen isn’t traffic increase; it’s code changes that introduced inefficient revisions to existing queries or new queries. As such, a 0% increase in traffic can result in full CPU saturation.

This is where proactive monitoring comes into play again. Keep an eye on CPU graphs as traffic increases. In addition, you can collect full query profiles on a regular basis and run them through tools like pt-query-digest or look at the Query Analyzer (QAN) in PMM to keep track of query performance, noting where queries may be less performant than they once were, or when new queries have unexpected high load.

Disk space

A 2x increase in traffic doesn’t mean a 2x increase in disk space consumption. It may increase the rate at which disk space is accumulated, but that also depends on how much of the traffic increase is write-focused. If you have a 4x increase in reads and a 1.05X increase in writes, then you don’t need to be overly concerned about disk space consumption rates.

Once again, we look at proactive monitoring to help us. Using time-lapse metrics monitoring, we can monitor overall disk consumption and the rate at which consumption occurs and then predict how much time we have left before we run out of space.

Disk IOPS

The amount of disk IOPS your system uses will be somewhat related to how much of your data can fit into memory. Keep in mind that the disk will still need to be used for background operations as well, including writing to the InnoDB redo log, persisting/checkpointing data changes to table spaces from the redo log, etc. But, for example, if you have a large traffic increase that’s read-dependent and all of the data being read in the buffer pool, you may not see much of an IOPS increase at all.

Guess what we should do in this case? If you said “proactive monitoring,” you get a gold star. Keep an eye out for metrics related to IOPS and disk utilization as traffic increases.

Before we move on to the next section, consider the differences in abnormal between disk space and disk IOPS. When you saturate disk IOPS, your system is going to run slow. If you fill up your disk, your database will start throwing errors and may stop working completely. It’s important to understand the difference so you know how to act based on the situation at hand.

Database engine considerations

While resource utilization/saturation are very common bottlenecks for database performance, there are limitations within the engine itself. Row-locking contention is a good example, and you can keep an eye on row-lock wait time metrics in tools like PMM. But, much like any other software that allows for concurrent session usage, there are mutexes/semaphores in the code that are used to limit the number of sessions that can access shared resources. Information about this can be found in the semaphores section in the output of the “SHOW ENGINE INNODB STATUS” command.

Unfortunately, this is the single hardest bottleneck to predict and is based solely on the use case. I’ve seen systems running 25,000+ queries per second with no issue, and I’ve also seen systems running ~5,000 queries per second that ran into issues with mutex contention.

Keeping an eye on metrics for OS context switching will help with this a little bit, but unfortunately this is a situation where you normally don’t know where the wall is until you run right into it. Adjusting variables like innodb_thread_concurrency can help with this in a pinch, but when you get to this point, you really need to look at query efficiency and horizontal scaling strategies.

Another thing to consider is configurable hard limits like max_connections, where you can limit the upper bound of the number of connections that can connect to MySQL at any given time. Keep in mind that increasing this value can impact memory consumption as more connections will use more memory, so use caution when adjusting upward.

Conclusion

Capacity planning is not something you do once a year or more as part of a general exercise. It’s not something you do when management calls you to let you know a big sale is coming up that will increase the load on the hosts. It’s part of a regular day-to-day activity for anyone that’s operating in a database administrator role.

Proactive monitoring plays a big part in capacity planning. I’m not talking about alert-based monitoring that hits your pager when it’s already too late, but evaluating metrics usage on a regular basis to see what the data platform is doing, how it’s handling its current traffic, etc. In most cases, you don’t see massive increases in traffic all at once; typically, it’s gradual enough that you can monitor as it increases and adjust your system or processes to avoid saturation.

Tools like PMM and the Percona Toolkit play a big role in proactive monitoring and are open source for free usage. So if you don’t have tools like this in place, this comes in at a price point that makes tool integration easier for your consideration.

Also, if you still feel concerned about your current capacity planning, you can always reach out to Percona Managed Services for a performance review or query review that will give you a detailed analysis of the current state of your database along with recommendations to keep it as performant as possible.

Percona Monitoring and Management is a best-of-breed open source database monitoring solution. It helps you reduce complexity, optimize performance, and improve the security of your business-critical database environments, no matter where they are located or deployed.

 

Download Percona Monitoring and Management Today

Jun
05
2023
--

PMM Authentication Bypass Vulnerability fixed in 2.37.1

How To Use pt-secure-collect

On May 30, Percona was notified of a possible vulnerability in Percona Monitoring and Management (PMM). After researching the report, we agreed with the reporter and began working on a fix to address the issue. Today we’re releasing PMM 2.37.1 with a fix for CVE-2023-34409 that addresses the PMM authentication bypass vulnerability. This release contains no other features or fixes. We advise users to upgrade PMM at the earliest opportunity, particularly if the PMM instance is accessible directly from the Internet.

All versions of PMM starting with 2.0.0 are assumed to be vulnerable.

In prior versions of PMM, the authenticate function would strip parts of a URL separated by a dot or slash until it found a matching pattern in its ruleset. This could allow an attacker to feed a malformed URL to PMM to bypass authentication and access PMM logs. In turn, this could allow information disclosure and privilege escalation.

If you are able to update, follow the standard instructions to upgrade PMM.

If you are unable to perform an update, it is possible to mitigate this issue by making a change to the NGINX configuration on the running PMM instance. To do so, create a Bash script with the code from this script on GitHub.

Then, you can apply the code using this docker command on a server running the PMM Docker container (as root or using sudo):

docker exec -it pmm-server bash -c 'curl -fsSL https://raw.githubusercontent.com/percona/pmm/main/scripts/authfix.sh  | /bin/bash'

If you are running PMM via a virtual appliance (OVF or AMI), use SSH to shell into the PMM server and run this command, as root or using sudo:

curl -fsSL https://raw.githubusercontent.com/percona/pmm/main/scripts/authfix.sh  | /bin/bash

We’d like to thank Adam Kues, security researcher at Assetnote, for the vulnerability report. We deeply appreciate all community security and bug reports that help us identify and fix issues in Percona software. If you believe you’ve identified a security issue, see the Percona Security page for reporting procedures, our security policies, and the Responsible Disclosure program.

Apr
17
2023
--

Percona Monitoring and Management and Datadog Integration

Datadog with percona

So your company is using Datadog on the global level. Still, as DBA, you want to use Percona Monitoring and Management (PMM) for better database insights and maintain a common policy of Datadog as a standard alerting or high-level observability tool.

Percona Monitoring and Management is not intended to replace Datadog; rather, they work better together. The first can provide you with in-depth database monitoring without vendor lock-in or licensing costs, while the second provides you with end-to-end observability beyond database infrastructure.

In this blog post, I’ll show you a simple example of collecting all rich data about databases using PMM and a way to send some of it to Datadog for general dashboards with main metrics to monitor. We’ll use here “up” metrics for different databases to reflect database status.

Setting up environments

Let’s prepare our environment. I’ll use a free Datadog account for this experiment.

Note: while it’s https://www.datadoghq.com/, you may end up on  https://app.datadoghq.eu/monitors/create, and this is an EU domain change (This com->eu caused me several hours of investigating why it’s not working as expected)

Next, I need an installed and working PMM with several databases added to it. To install PMM, I’ll follow the instructions from the official documentation in installing pmm and add several databases to PMM.

installing PMM

Integrate the tools

Now it’s time to start connecting both tools.

Preparing Datadog side:  You can find a complete list of Available Datadog agents for different OS inside the user account.

Because I am using an EC2 instance for my PMM installation, I’ll use an Amazon Linux agent.

I also need to select an API key for this agent, and after it’s set, I can have the command I need to execute inside my ES2 with PMM installed inside it.

I have a command like:

DD_API_KEY=2........6 DD_SITE="datadoghq.eu" bash -c "$(curl -L https://s3.amazonaws.com/dd-agent/scripts/install_script_agent7.sh)"

After execution, I can see the logs of Datadog agent installation and configuration and at the end:

...
Installed:
  datadog-agent.x86_64 1:7.43.1-1                                               
Complete!
* Adding your API key to the Datadog Agent configuration: /etc/datadog-agent/datadog.yaml
* Setting SITE in the Datadog Agent configuration: /etc/datadog-agent/datadog.yaml
/usr/bin/systemctl
* Starting the Datadog Agent...
  Your Datadog Agent is running and functioning properly.
  It will continue to run in the background and submit metrics to Datadog.
  If you ever want to stop the Datadog Agent, run:
  
      sudo systemctl stop datadog-agent

  And to run it again run:
  
      sudo systemctl start datadog-agent

To confirm that the agent has been added and working, I’ll go to the Infrastructure dashboard in Datadog UI and see the recently added host in the Hostname list with the status “Active.”

DataDog Percona

The next step is to extract data from PMM. We will use the Federation mechanism for this.

I will experiment with *_up metrics that usually reflect service status in Prometheus exporters.

Info: While using VictoriaMetrics inside PMM, we expose usual Prometheus endpoints to simplify integrations. 

So the federated URL for exposing “*_up” metrics will be https://PMM_SERVER/prometheus/federate?match[]={__name__=~%22.*_up%22}

I can test in on my PMM while being logged into it.

The next step I need to do is to feed the Datadog agent with data from PMM.
I’ll use the OpenMetrics integration available in Datadog.

To do this, I must create a file /etc/datadog-agent/conf.d/openmetrics.d/conf.yaml from a conf.yaml.example located in the same folder. And modify several lines in the file:

    openmetrics_endpoint: 'http://PMM_USER:PMM_PASSWORD@PMM_SERVER/prometheus/federate?match[]=%7B__name__%3D~%22..%2A_up%22%7D'

To force the Datadog agent on this node to read data from PMM URL:

    namespace: pmm

To get all pmm metrics in the same and conveniently separated namespace.

The most “hard to find and understand” option is:

metrics: 
         - .+:
             type: gauge

To force the Datadog agent to collect and store data. (some strange behavior if the type is not defined).

Time to verify the integration.

First, I need to restart the Datadog agent to get a new collection applied.

# sudo service datadog-agent restart

And if it’s ok – recheck the status by command: 

# sudo datadog-agent status

In the output, I’m searing for the section: 

...
    openmetrics (2.3.0)
    -------------------
      Instance ID: openmetrics:pmm:543447ee3db29bba [OK]
      Configuration Source: file:/etc/datadog-agent/conf.d/openmetrics.d/conf.yaml
      Total Runs: 4
      Metric Samples: Last Run: 17, Total: 68
      Events: Last Run: 0, Total: 0
      Service Checks: Last Run: 1, Total: 4
      Average Execution Time : 56ms
      Last Execution Date : 2023-04-08 13:40:56 UTC (1680961256000)
      Last Successful Execution Date : 2023-04-08 13:40:56 UTC (1680961256000)

 A non-zero number for samples confirms the data was properly collected. 

Now it’s time to go into Datadog UI and verify data there. 

Metrics Summary is the best place to check this and search for PMM-related metrics.

PMM metrics

By clicking on pmm.mysql_up metric, I can confirm that data are collected for all three MySQL servers added to the PMM.

Dashboarding data in Datadog

The main Idea of having data in Datadog is to create dashboards based on this data.  While I’ve confirmed the presence of metrics – I can go straight to the next step – creating the dashboard by clicking the “Save to Dashboard” button. After following the modal window question, I can create a new dashboard from this metric. By editing the formula to add another metric (pg_up), changing the style to the area, and coloring different metrics in different palettes (so I can easily see what type of services is absent), I can get a better dashboard.

DataDog

Now I can see all five databases I have in PMM presented in the Datadog dashboard. 

DataDog Dashboard

Conclusion

You now know how to pass data to Datadog from PMM and do a simple dashboard. This post (I hope) will help you to play with PMM and integrate it with other tools. If there are better ways to integrate tools than this approach – please, let us know how you do this or your challenges. I will be happy to listen to you on our forum. We are looking for your feedback.

Percona Monitoring and Management is a best-of-breed open source database monitoring solution. It helps you reduce complexity, optimize performance, and improve the security of your business-critical database environments, no matter where they are located or deployed.

 

Download Percona Monitoring and Management Today

Mar
17
2023
--

Percona Monitoring and Management 2 Scaling and Capacity Planning


2022 was an exciting year for Percona Monitoring and Management (PMM). We’ve added and improved many features, including Alerting and Backup Management. These updates are designed to keep databases running at peak performance and simplify database operations. But as companies grow and see more demand for their databases, we need to ensure that PMM also remains scalable so you don’t need to worry about its performance while tending to the rest of your environment.

PMM2 uses VictoriaMetrics (VM) as its metrics storage engine. Percona’s co-Founder Peter Zaitsev wrote a detailed post about migration from Prometheus to VictoriaMetrics, One of the most significant differences in terms of performance of PMM2 comes with the usage for VM, which can also be derived from performance comparison on node_exporter metrics between Prometheus and VictoriaMetrics.

Planning for resources of a PMM Server host instance can be tricky because the numbers can change depending on the DB instances being monitored by PMM. For example, a higher number of data samples ingested per second or a monitoring database with a huge number of tables (1000+) can affect performance; similarly, the configuration for exporters or a custom metric resolution can also have an impact on the performance of a PMM server host. The point is that scaling up PMM isn’t linear, and this post is only meant to give you a general idea and serve our users as a good starting point when planning to set up PMM2.

The VictoriaMetrics team has also published some best practices, which can also be referred to while planning for resources for setting up PMM2.

Home Dashboard for PMM2
We have tested PMM version 2.33.0 with its default configuration, and it can monitor more than 1,000 MySQL services, with the databases running with the default sysbench Read-Write workload.  We observed that the overall performance of PMM monitoring 1,000 database services was good, and no significant resource usage spikes were observed; this is a HUGE increase in performance and capacity over previous versions!  Please note that the focus of these tests was around standard metrics gathering and display, we’ll use a future blog post to benchmark some of the more intensive query analytics (QAN) performance numbers.

Capacity planning and setup details

We used a dedicated 32-core CPU and 64GB of RAM for our testing.

CPU Usage for PMM Server Host System

The CPU usage averaged 24% utilization, as you can see in the above picture.

 

Memory Utilization for PMM Server Host System

Virtual Memory utilization was averaging 48 GB of RAM.

VictoriaMetrics maintains an in-memory cache for mapping active time series into internal series IDs. The cache size depends on the available memory for VictoriaMetrics in the host system; hence planning for enough RAM on the host system is important for better performance to avoid having a high percentage of slow inserts.

If we talk about overall disk usage for the instance monitoring 1,000 Database services, the average disk usage per datapoint comes out to be roughly .25 Bytes, or you should plan storage roughly between 500 GB – one TB for a default 30 day retention period.

Average Datapoints Size

Average datapoints size

Stats on Victoria Metrics

We recommended having at least two GB RAM and a two-core system for PMM Server as a bare minimum requirement to set up monitoring your database services. With this minimum recommended setup, you can monitor up to three databases comfortably, possibly more, depending on some of your environment’s already mentioned factors.

Based on our observations and various setups we have done with PMM, overall, with a reasonably powerful pmm-server host system (eight+ GB RAM and eight+ cores), it is the most optimum to target monitoring 32 databases per core or 16 databases per one GB RAM, hence keeping this in mind is really helpful while planning resources for your respective monitoring setups.

Number of Database Services Monitored Min Recommend Requirement
0-250 services 8 Core, 16 GB RAM
250-500 services 16 Core, 32 GB RAM
500-1000 services 32 Core, 64 GB RAM

PMM scalability improved dramatically through UX and performance research

In earlier versions of PMM2, the Home Dashboard could not load with more than 400 DB services, resulting in a frustrating experience for users. Interacting with UI elements such as filters and date pickers was previously impossible. We conducted thorough research to improve scalability and the user experience on our Home Dashboard for 1,000 database services. Our findings revealed that the design of the Home Dashboard heavily impacted scalability and poor UX on the UI resulting in unresponsive pages.

We redesigned the Home Dashboard as a solution, and the results were significant. The new dashboard provides a much better user experience with more critical information being displayed and scalability for environments up to 1000 DB services. The overall load time has improved dramatically, going from 50+ seconds to roughly 20 seconds, and there are no longer any unresponsive errors on the UI. Users can now interact with filters on other dashboards seamlessly as well!

There are still some limitations we’re working on addressing

  • Instance Overview Dashboards, which are shipped with PMM, do not work well with such a large number of instances, so it is recommended not to rely on them when such a high number of databases are being monitored. They would work well only with a maximum of 400 database services.
  • There is a known issue around the request “URI too Large” pop-up message that is visible because of some large query requests, this also leads to an issue with setting a big time range for observing metrics from the monitored Database.  Our team is planning to implement a fix for this soon.
  • QAN takes 50+ seconds to load up when 400+ database services are monitored. Also, the overall interaction with QAN feels laggy when searching and applying filters across a big list of services/nodes. Our team is working on improving the overall user experience of the QAN App, which will soon be fixed in future releases of PMM.

Not a formula but a rule of thumb

Overall resource usage in PMM depends on the configuration and workload, and it may vary for different setups so it’s difficult to say, “for monitoring this number of DB services, you need a machine of that size.” This post is meant to show how the PMM server scales up and performs with the default setup and all database host nodes configured in default metrics mode (push).

We plan to also work on another follow-up post on the performance and scalability where we would highlight results for different dashboards and QAN, showcasing the improvements we have made over the last few PMM releases.

Tell us what you think about PMM!

We’re excited about all of the improvements we’ve made, but we’re not done yet! Have some thoughts about how we can improve PMM, or want to ask questions? Come talk to us in the Percona Forums, and let us know what you’re thinking!

PMM Forum

Dec
19
2022
--

PMM V2.33: Offline Metric Collection, Guided Alerting Tour, Security Fixes, and More!

latest release of Percona Monitoring and Management

latest release of Percona Monitoring and ManagementWe are excited to announce the latest release of Percona Monitoring and Management (PMM) – V2.33. This release includes several new features and improvements that make PMM even more effective and user-friendly. Some of the key highlights of PMM V2.33 include:

  • Offline metric collection during PMM server outages or loss of PMM client-server network connectivity
  • A guided tour of Alerting, which helps new users get up to speed quickly and start using the alerting features of PMM
  • Easily restore your MongoDB databases to a previous state
  • Updated Grafana to version 9.2.5 to fix critical security vulnerabilities
  • Tab completion for the pmm-admin CLI command, which makes it easier to use the command line interface to manage PMM

You can get started using PMM in minutes with our PMM Quickstart guide to check out the latest version of PMM V2.33. 

Client-side caching minimizes potential for metrics loss

This new feature ensures that the PMM Client saves the monitoring data locally when a connection to the PMM server is lost, preventing gaps in the data. When the connection is restored, the data is sent to the PMM server, allowing the monitoring of your systems to continue without any data loss.

Note:

The client node is currently limited to storing only 1 GB of offline data. So,  if your instance is down for three days and generates more than 1 GB of data during that time, all the data will not be retrieved.

One of the core principles of our open-source philosophy is transparency, and we are committed to sharing our roadmap openly and transparently with our users. We are happy to share the roadmap for the implementation of PMM high availability (HA) in three stages, which has been a highly requested feature by our users. 

PMM HA will be rolled out in three stages. Stage one, which is included in PMM 2.33.0, involves the implementation of a data loss prevention solution using VictoriaMetrics integration for short outages. This feature is now available in the latest release of PMM. Stages two and three of PMM HA will be rolled out, including additional features and enhancements to provide a complete high availability solution for PMM. We are excited to bring this much-anticipated feature to our users, and we look forward to sharing more details in the coming months.

 

Stages of PMM HA Solutions Provided
Stage one (included in PMM 2.33.0) As an initial step toward preventing data loss we have developed the following:

Offline metric collection for short outages

Stage two (will be rolled out in 2023) As part of PMM HA stage two in HA we plan to implement the following:

HA data sources

As part of stage two, we will let the users use external data sources thereby decreasing dependency on the file system.

Stage three (will be rolled out in 2023) As part of PMM HA stage three we plan to implement the following:

HA Clustered PMM Servers 

Clustered PMM will be the focus of stage three. Detailed information will be included in the upcoming release notes.

 

Please feel free to book a 1:1 meeting with us to share your thoughts, needs, and feedback about PMM HA.

Tip: To improve the availability of the PMM Server until the general availability of PMM HA, PMM administrators can deploy it on Kubernetes via the Helm chart. The Kubernetes cluster can help ensure that PMM is available and able to handle different types of failures, such as the failure of a node or the loss of network connectivity.

Critical security vulnerabilities fixed

In PMM 2.33.0, we have updated Grafana to version 9.2.5, which includes important security fixes. This upgrade addresses several critical and moderate vulnerabilities, including CVE-2022-39328, CVE-2022-39307, and CVE-2022-39306. For more details, please see the Grafana 9.2.5 release notes. We strongly recommend that all users upgrade to the latest version of PMM to ensure the security of their systems.

Guided tour on Alerting

In the 2.31.0 release of PMM, we added a new feature called Percona Alerting, which provides a streamlined alerting system. To help users get started with this new feature, we have added a short in-app tutorial that automatically pops up when you first open the Alerting page. This tutorial will guide you through the fundamentals of Percona Alerting, and help you explore the various features and options available. We hope this tutorial will make it easier for users to get started with Percona Alerting and take full advantage of its capabilities.

Restore MongoDB backups more easily

Building upon the significant improvements for MongoDB backup management introduced in the previous release, we are now simplifying the process for restoring physical MongoDB backups. Starting with this release, you can restore physical backups straight from the UI, and PMM will handle the process end-to-end. Prior to this, you would need to perform additional manual steps to restart your MongoDB database service so that your applications could make use of the restored data.

Improvements on the pmm-admin CLI command

pmm-admin is a command-line tool that is used to manage and configure PMM. It is part of the PMM Client toolset and can be used to perform various administrative tasks, such as managing inventory. We have added tab completion for the pmm-admin CLI command. This means that you no longer have to know the entire command when using pmm-admin. Instead, you can simply type the initial part of the command and press Tab, and the rest of the command will be automatically completed for you.  This new feature makes it easier to use the command line interface and ensures that you can quickly and easily access all of the powerful features of PMM. 

What’s next?

  • A Health dashboard for MySQL is on the way. Please share your suggestions in the comments or forum if you’d like to be part of the group shaping PMM. 
  • We have started to work on two new and significant projects: High Availability in PMM and advanced Role-Based Access Control (RBAC). We’d love to hear your needs, use cases, and suggestions. You can quickly book a short call with the product team to collaborate with us. 

Install PMM 2.33 now or upgrade your installation to V2.33 by checking our documentation for more information about upgrading.

Thanks to Community and Perconians

At Percona, we are grateful for our supportive community and dedicated team, who work together to shape the future of PMM. If you would like to be a part of this community, you can join us on our forums to request new features, share your feedback, and ask for support. We value the input of our community and welcome all members to participate in the ongoing development of PMM.

See PMM in action now!

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com