Oct
22
2020
--

Annotations Provide Context to Timelines in Percona Monitoring and Management

Annotations Percona Monitoring and Management

About the AuthorThis blog was written as a collaboration with my colleague Jiri Ctvrtka.  Jiri is a senior software developer from Brno, Czech Republic, and has been partnering with Percona for almost a year working on various components of our Percona Platform. He’s been programming in Go since 2015 and Jiri’s got a passion for simplicity, speed, and precision of data and has focused that passion on understanding impacts of changes and reimagined the Percona Monitoring and Management (PMM) Annotations functionality.

“Does anyone remember what caused this large spike last Thursday at 2:24 am?”

Annotations Percona Monitoring and ManagementWhat are Annotations?

Annotations are a way to provide context to timelines in Percona Monitoring and Management (PMM). For example, in bigger teams, it is a good way to inform others about an event, or important changes that may have occurred. It can contain any kind of information but we see it most commonly used for indicating there was an outage, maintenance window start/end, deployment of new code, security event, etc.  Annotations in PMM help provide quick explanations to peaks and valleys in graphs or indicate when something took place on the timeline and create correlations.

 

An example of annotations automatically added at the beginning and end of a benchmark test.

Every annotation can be shown/hidden by a simple toggle button presented in the filter options. So you don’t need to be worried about crowded charts of annotations. You can toggle on/off and zoom in or out around events to get better detail.

Filter contains a toggle button to turn on/off the visibility of annotations.

How Can I Add Annotations?

Annotations can simply be added by the pmm-admin option: annotate. So, let’s try it:

pmm-admin annotate “Deployed web version 2.4.6“

That command will place an annotation on every chart in every node and service…maybe more than you need. What if you only needed to add an annotation for a specific node or service to indicate a broad network outage? Or add specific annotations for specific nodes or services to indicate a software update was applied? This is all possible as pmm-admin provides four flags, which can be used for just this purpose.

--node = annotate the node the pmm-admin command was run on
--node-name = annotate node with specified name
--service = annotate all services running on the node the pmm-admin command was run on
--service-name = annotate service with specified name

All these flags can be combined together for options to annotate more nodes or services by just only one command. The order of flags doesn’t matter. Just imagine how many combinations you have!  Even better, imagine how easily this can be integrated into our CI/CD or deploy pipelines!

You can also add annotations via the API with curl commands:

curl 'http://admin:admin@localhost/graph/api/annotations' -H 'Content-Type: application/json;charset=UTF-8' --data-binary '{"tags": ARRAY OF TAGS,"text":"TEXT"}'

ARRAY OF TAGS = Names of node or service, like for node pmm-server and service pmm-server-mysql it will be “tags: [“pmm-server”, “pmm-server-mysql”]”

Some Examples for Better Understanding

Case 1: We have a DB node we need to take offline for maintenance and want to capture this on all its graphs to explain the gap in reporting data.

pmm-admin annotate "`date` - System is going down for Maintenance - RFC-2232-2020" --node-name=”db1002”

via API:

curl 'http://admin:admin@localhost/graph/api/annotations' -H 'Content-Type: application/json;charset=UTF-8' --data-binary '{"tags": ["db1002"],"text":"`date` - System is going down for Maintenance - RFC-2232-2020"}'

or if pmm-admin is running on this node then you don’t even need to know the name of this node and can set it as part of the shutdown routine.

pmm-admin annotate "`date` - System is being shut down/rebooted"

Case 2: We have node pmm-server and three services running on the current node (mysql, postgres, mongo). So, yeah, it’s simple right?

pmm-admin annotate “`date` - Apply configuration change - RFC-1009-2020“ –service-name=mysql
pmm-admin annotate “Restarting Postgres to apply security patch - “ –service-name=postgres
pmm-admin annotate “`date` - Service Disruption Reported via Support - SUP-239“ –service-name=mongo

via API:

curl 'http://admin:admin@localhost/graph/api/annotations' -H 'Content-Type: application/json;charset=UTF-8' --data-binary '{"tags": ["mysql", "postgresl", "mongo"],"text":"`date` - Apply configuration change - RFC-1009-2020"}'

Or you can do it in one command:

pmm-admin annotate “Services Recycled to pickup config changes“ –service

And that’s it!  All services found running on that node will be annotated.

Case 3: We have node “registrationDb” and many services running on the current node. What if we want to annotate that node and also every service running on this node by just one command? Again, no problem:

pmm-admin annotate “`date` - Security alerted to possible event“ –node-name=registrationDb --service

via API:

curl 'http://admin:admin@localhost/graph/api/annotations' -H 'Content-Type: application/json;charset=UTF-8' --data-binary '{"tags": ["registrationDb", "service1", "service2",...*],"text":"`date` - Security alerted to possible event"}'

* while the PMM admin command –service flag will add to all services, you need to add all services names of the current node to get the same result using the API but you can make an API call to get all services on a given node

That’s it, no matter how many services you are running on node registrationDb the annotations will be presented on all of them and on node graphs as well.

Case 4: We have 100 services running on the current node and also another node named pmm-server2 and we need to annotate all 100 services on the current node (but not the current node) along with node pmm-server2.  Simple:

pmm-admin annotate “`date` - Load Test Start“ –node-name=pmm-server2 --service

via API:

curl 'http://admin:admin@localhost/graph/api/annotations' -H 'Content-Type: application/json;charset=UTF-8' --data-binary '{"tags": ["pmm-server2", "service1", "service2",...*],"text":"`date` - Load Test - Increasing to 100 threads"}'

* while the PMM admin command –service flag will add to all services, you need to add all services names of the current node to get the same result using the API but you can make an API call to get all services on a given node

The current node will not be annotated, but every service on this node will be along with node pmm-server2.

Here’s a little guide to see many of the possible combinations of flags and what will result:

–node = current node
–node-name = node with name
–node –node-name = current node and node with name
–node –service-name = current node and service with name
–node –node-name –service-name = current node, node with name, and service with name
–node –service = current node and all services of current node
–node-name –service = all services of current node, node with name
–node –node-name –service = all services of current node, node with name, and current node
–service = all services of current node
–service-name = service with name
–service –service-name = all services of current node, service with name
–service –node-name = all services of current node and node with name
–service-name –node-name = service with name and node with name
–service –service-name –node-name = service with name, all services on current node, and node with name

So thanks to annotations, correlating events on your servers is now easier than ever.  We’d love to hear or even see how you’re using annotations to make your life easier, hopefully, we’ve given you some ideas to get started right away!

Download Percona Monitoring and Management Today

Oct
19
2020
--

PMM 101: Troubleshooting MongoDB with Percona Monitoring and Management

Troubleshooting MongoDB with Percona Monitoring and Management

Troubleshooting MongoDB with Percona Monitoring and ManagementPercona Monitoring and Management (PMM) is an open-source tool developed by Percona that allows you to monitor and manage your MongoDB, MySQL, and PostgreSQL databases.  This blog will give you an overview of troubleshooting your MongoDB deployments with PMM.

Let’s start with a basic understanding of the architecture of PMM. PMM has two main architectural components:

  1. PMM Client – Client that lives on each database host in your environment.  Collects server metrics, system metrics, and database metrics used for Query Analytics
  2. PMM Server – Central part of PMM that the clients report all of their metric data to.   Also presents dashboards, graphs, and tables of that data in its web interface for visualization of your metric data.

For more details on the architecture of PMM, check out our docs.

Query Analytics

PMM Query Analytics (“QAN”) allows you to analyze MongoDB query performance over periods of time.  In the below screenshot you can see that the most longest-running query was against the testData collection.

Percona Monitoring and Management query analytics

If we drill deeper by clicking on the query in PMM we can see exactly what it was running. In this case, the query was searching in the testData collection of the mpg database looking for records where the value of x is 987544.

Percona Monitoring and Management

This is very helpful in determining what each query is doing, how much it is running, and which queries make up the bulk of your load.

The output is from db.currentOp(), and I agree it may not be clear at a glance what the application-side (or mongo shell) command was. This is a limitation of the MongoDB API in general – the drivers will send the request with perfect functional accuracy but it does not necessarily resemble what the user typed (or programmed).  But with an understanding of this, and focusing first on what the “command” field contains it is not too hard to picture a likely original format. For example the example above could have been sent by running “use mpg; db.testData.find({“x”: { “$lte”: …, “$gt”: … }).skip(0)” in the shell. The last “.skip(0)” is optional as it is 0 by default.

Additionally, you can see the full explain plan for your query just as you would by adding .explain() to your query.   In the below example we can see that the query did a full collection scan on the mpg.testData collection and we should think about adding an index to the ‘x’ field to improve the performance of this query.

Metrics Monitor

Metrics Monitor allows you to monitor, alert, and visualize different metrics related to your database overall, its internal metrics, and the systems they are running on.

Overall System Performance View

The first view that is helpful is your overall system performance view.   Here you can see at a high level, how much CPU and memory are being used, the amount of writes and reads from disk, network bandwidth, # of database connections, database queries per second, RAM, and the uptime for both the host and the database.   This view can often lead you to the problematic node(s) if you’re experiencing any issues and can also give you a high level of the overall health of your monitored environment.

Percona Monitoring and Management system overview

WiredTiger Metrics

Next, we’ll start digging into some of the database internal metrics that are helpful for troubleshooting MongoDB.  These metrics are mostly from the WiredTiger Storage Engine that is the default storage engine for MongoDB since MongoDB 3.0.  In addition to the metrics I cover, there are more documented here.

The WiredTiger storage engine uses tickets as a way to handle concurrency, The default is for WiredTiger to have 128 read and 128 write tickets.  PMM allows you to alert when your available tickets are getting low.  You can also correlate with other metrics as to why so many tickets are being utilized. The graph sample below shows a low-load situation – only ~1 ticket out of 128 was checked out at any time.

Percona Monitoring and Management wiredtiger

One of the metrics that could be causing you to use a large number of tickets is if your checkpoint time is high.   WiredTiger, by default, does a full checkpoint at least every 60 seconds, this is controlled by the WiredTiger parameter checkpoint=(wait=60)).  Checkpointing flushes all the dirty pages to disk. (By the way ‘dirty’ is not as bad as it sounds – it’s just a storage engine term meaning ‘not committed to disk yet’.)  High checkpointing times can lead to more tickets being in use.

Finally, we have WiredTiger Cache Activity metrics. WiredTiger Cache activity indicates the level of data that is being read into or written from the cache.  These metrics can help you baseline your normal cache activity, so you can notice if you have a large amount of data being read into the cache, perhaps from a poorly tuned query, or a lot of data being written from the cache.

WiredTiger Cache Activity

Database Metrics

PMM also has database metrics that are not WiredTiger specific.   Here we can see the uptime for the node, queries per second, latency, connections, and number of cursors.   These are higher-level metrics which can be indicative of a larger problem such as connection storms, storage latency, and excessive queries per second.  These can help you hone in on potential issues for your database.

Percona Monitoring and Management database metrics

Node Overview Metrics

System metrics can point you towards an issue at the O/S level that may or may not correlate to your database.  CPU, CPU saturation, core usage, DISK I/O, Swap Activity, and Network Traffic are some of the metrics that can help you find issues that may start at the O/S level or below.  Additional metrics to the below can be found in our documentation.

Node Overview Metrics

Takeaways

In this blog, we’ve discussed how PMM can help you troubleshoot your MongoDB deployment, whether you’re looking at the WiredTiger specific metrics, system-level metrics, or database level metrics PMM has you covered and can help you troubleshoot your MongoDB deployment.  Thanks for reading!

Additional Resources:

Download Percona Monitoring and Management

PMM for MongoDB Quick Start Guide

PMM Blog Topics

MongoDB Blog Topics

Oct
12
2020
--

Setup Teams/Users With Limited Access in Percona Monitoring and Management

limit access percona monitoring and managment

From time to time we are asked how to limit users to viewing only some dashboards or servers in Percona Monitoring and Management (PMM). Here are some hints on how to do this.

Let’s imagine you want the following:

  • Users user1 and user2 are only allowed to see the “CPU Utilization Details” dashboard for server1, server2 and pmm-server;
  • User user3 is only allowed to see the “CPU Utilization Details” dashboard for server3;
  • All users are allowed to see MySQL dashboards for any services.

1. First, let’s create users user1, user2, and user3. Their roles should be set to “Viewer”.

percona monitoring and management

2. Now let’s create two folders, Team1 and Team2

percona monitoring and management dashboard

3. We limit folder access on “Team1” to User1/User2, and folder “Team2” to User3



4. Viewer (Role) has to be excluded for all original folders except for MySQL. In our situation, all users are allowed access to MySQL dashboards.

percona monitoring and management insight
5. Now we make copies of the “CPU Utilization Details” dashboard in folders “Team1” and “Team2”


So now users can view only dashboards in folder “MySQL” and “Team1″/”Team2”. In the next step, we will apply limits by servers.

6. We are going to limit servers for dashboards to new folders. To do this, we must modify the node_name variables.
Navigate to “Setting” … “Variables” … “node_name”
Allowed servers are added into the field “Regex”
For Team1: /server1|server2|pmm-server/
For Team2: /server3/


That’s it. Let’s login as user1 and check what we’ve got.

(Please notice that the Home dashboard is located in the folder “Internal” so it’s not allowed for our users either. So the list of allowed dashboards can be accessed through the left menu only.)


As we can see, MySQL dashboards and “CPU Utilization Details” dashboards are accessible. But let’s also check the servers in the dropdown list of the last dashboard.


So it’s correct; User1 can see data for pmm-server.

You can read more about this in the official Grafana documentation, in the section “Manage users“. Also, please keep in mind that users with “Editor” role have access to dashboards settings and can remove or modify regex filtering for servers/services. So it’s better to avoid assigning the “Editor” role to users in the provided solution.

Oct
08
2020
--

MySQL 101: Troubleshooting MySQL with Percona Monitoring and Management

MySQL 101 Troubleshoot with percona monitoring and management

MySQL 101 Troubleshoot with percona monitoring and managementPercona Monitoring and Management (PMM) is a free and open source platform for managing and monitoring MySQL, MongoDB, and PostgreSQL databases. In this blog, we will look at troubleshooting MySQL issues and performance bottlenecks with the help of PMM.

We can start troubleshooting MySQL after selecting the appropriate server or cluster from the PMM dashboard. There are many aspects to check which should get priority depending on the issue you are facing.

Things to look at:

OS:

CPU Usage:

Check to see if there is a spike or gradual increase in the CPU usage on the database server other than the normal pattern. If so, you can check the timeframe of the spike or starting point of the increasing load and review the database connections/thread details from the MySQL dashboard for that time interval.

MySQL CPU usage

CPU Saturation Metrics and Max Core Usage:

This is an important metric as it shows the saturation level of the CPU with normalized CPU load. Normalized CPU Load is very helpful for understanding when the CPU is overloaded. An overloaded CPU causes response times to increase and performance to degrade.

CPU Saturation Metrics and Max Core Usage

Disk latency/ Disk IO utilization:

Check to see if there is any latency observed for the disk. If you see the Disk IO utilization reach 100%, this will cause latency in queries as Disk would not be able to perform the read/writes, causing a gradual pile up of queries and hence the spike. The issue might be with the underlying disk or hardware.

Disk latency/ Disk IO utilization

Memory Utilization:

Any sudden change in memory usage consumption could indicate some process hogging the memory, for example, the Mysql process, if many concurrent queries or any long-running queries are in progress. We can see any increase when any backup job or a scheduled batch job is in progress on the server as well.

Network Details:

Check the Inbound and Outbound Network traffic for the duration of the issue for any sudden dip which would point to some network problems.

MySQL:

MySQL Client Thread Activity / MySQL Connections:

If the issue at hand is for a large number of running threads or a connection spike, you can check the graphs of MySQL Connections and MySQL thread activity and get the timeframe when these connections start increasing. More details about the threads (or queries running) can be seen from Innodb dashboards. As mentioned previously, a spike in Disk IO utilization reaching 100% can also cause connections to pile up. Hence, it is important to check all aspects before coming to any conclusion.

MySQL Client Thread Activity / MySQL Connections

MySQL Slow Queries:

If there were queries that were performing slow, this would be reported in the MySQL slow queries graph. This could be due to old queries performing slowly due to multiple concurrent queries, underlying disk load, or newly introduced queries that need analysis and optimization. Look at the timeframe involved and further check the slow logs and QAN to get the exact queries.

MySQL Slow Queries

MySQL Aborted Connections:

If there were a large number of users or threads unable to establish a connection to the server this would be reported by a spike in aborted connections.

InnoDB Metrics:

InnoDB Transactions:

This metric will show the graph of History List Length on the server, which is basically ‘undo logs’ created to keep a consistent state of data for any particular connection. An increase in HLL over time is caused due to long-running transactions on the server. If you see a gradual increase in HLL, look at your server and check show engine innodb status\G. Look for the culprit query/transaction and try to kill it if it’s not truly needed. While not an immediate issue, an increase in HLL can hamper the performance of a server if the value is in the millions and still increasing.

InnoDB Transactions

InnoDB Row Operations:

When you see a spike in thread activity, you should check here next to get more details about the threads running.  Spike in reads/inserts/deletes? You will get details about each of the row operations and their count, which will help you understand what kind of queries were running on the server and find the particular job that is responsible for this.

InnoDB Row Operations

InnoDb Locking > InnoDB Row Lock Time:

Row Lock Waits indicates how many times a transaction waited on a row lock per second. Row Lock Wait Load is a rolling 5 minute average of Row Lock Waits. Average Row Lock Wait Time is the row lock wait time divided by the number of row locks.

InnoDb Locking > InnoDB Row Lock Time

InnoDB Logging > InnoDB Log file usage Hourly:

If there is an increase in writes on the server, we can see an increase in the log file usage hourly and it would show the size in GB on how much data was written to the ib_logfiles before being sent to disk.

Performance Schema Details:

PS File IO (Events):

This dashboard will provide details on the wait IO for various events as shown in the image:

PS File IO (Events)

PS File IO (Load):

Similar to events, this will display the load corresponding to the event.

QAN:

PMM Query Analytics:

The Query Analytics dashboard shows how queries are executed and where they spend their time. To get more information out of QAN you should have QAN prerequisites enabled. Select the timeframe and check for the slow queries that caused most of the load. Most probably these are the queries that you need to optimize to avoid any further issues.

PMM Query Analytics

Daniel’s blog on How to find query slowdowns using PMM will help you troubleshoot better with QAN.

For more details on MySQL Troubleshooting and Performance Optimizations, you can check our CEO Peter Zaitsev’s webinar on MySQL Troubleshooting and Performance Optimization with PMM.

Oct
07
2020
--

How to Find Query Slowdowns Using Percona Monitoring and Management

Query Slowdowns Using Percona Monitoring and Management

Visibility is a blessing, and with databases, visibility is a must. That’s true not only for metrics but for the queries themselves. Having info on all the stats around query execution is priceless, and Percona Monitoring and Management (PMM) offers that in the form of the Query Analytics dashboard (QAN).

But where to start? QAN helps you with that by calculating the query profile. What is the profile? It’s a rank of queries, ordered by Load, so it is easy to spot the heaviest queries hitting your database. The Load is defined as the “Average Active Queries” but can also be defined as a mix of Query Execution Time Plus Query count. In other words, all the time the query was alive and kicking.

The Profile in PMM 2.10.0 looks like this:

percona monitoring and management

The purpose of this profile is to facilitate the task of finding the queries that are worth improving, or at least the ones that will have a bigger impact on the performance when optimized.

However, how do you know that a slow query has been always slow or it has come down the road from good performance to painfully slow? That’s where the graph on the “Load” column comes handy.

There’s a method for doing this. The first step is to have a wide view. That means: check a time range long enough so you can see patterns. Personally, I like to check the last 7 days.

The second step is to find irregularities like spikes or increasing patterns. For example, in the above profile, we can see that the “SHOW BINARY LOGS” command is the top #4 of queries adding more load to the database. In this case, it’s because the binlogs are not being purged, so every day there are more and more binlog files to read and that adds to the executing time. But the amount of times that the “SHOW BINARY LOGS” query is executed remains the same.

Another query with an “anomaly” in the load graph is the top #3 one. Let’s isolate it and see what happened:

Query Analytics dashboard percona

The third step will be to reduce the time to a range involving the event so we can isolate it even more:

Query Analytics dashboard percona monitoring and management

The event happened between 8 AM and 9 AM. To discard or confirm that is an isolated event only related to this query, let’s see again all the queries running at that same moment.

So this is a generic situation, common to several queries. Most likely it was an event with the server that made queries to stall.

By looking at the threads graph, we can confirm that hypothesis:

MySQL Active Client Threads

After some digging, the source cause was detected to be a Disk problem:

Query Analytics dashboard disk latency

It’s confirmed that it is not an issue with the query itself, so no need to “optimize” due to this spike.

In conclusion, with the new QAN dashboard available since PMM 2.10.0, finding query slowdowns is easier thanks to the Load graph that can give us context pretty fast.

Try Percona Monitoring and Management today, for free!

Oct
07
2020
--

Our Approach to Percona Monitoring and Management Upgrade Testing

Percona Monitoring and Management Upgrade

Percona Monitoring and Management UpgradeHey Community! This is my first blog post so let me introduce myself. My name is Vasyl Yurkovych. I am a QA Automation Engineer at Percona and I have solid experience in software testing. 

Software quality is one of the main focuses of Percona so we put a lot of energy into it (hopefully so you don’t have to!). Our biggest challenge in testing isn’t necessarily running the routines against the current version, but the fact that users can decide to upgrade at any time, from pretty much any supported version.  So I’d like to share with you our upgrade testing approach for Percona Monitoring and Management (PMM) in hopes that this might be useful to others who create software that users install!

PMM is released every month and our users do not reinstall each new version – they just perform the upgrade operation when there’s a compelling feature or fix. Also, these PMM instances are not “empty”; they are full of various settings, monitored DBs, etc.,  all of which need to be preserved.

Taking into account all of that, we want to make sure that: 

  • the user will not suffer after the upgrade and will enjoy the new version of PMM
  • user’s instances will still be under monitoring without missing vital data during the upgrade
  • settings will be preserved
  • new features will work as expected

We decided to automate this process and select critical automation scenarios to execute. They were split into 3 stages: Pre Upgrade, Upgrade, and Post Upgrade.

Pre Upgrade (UI and API Scenarios)

  1. “Fill” PMM with monitored instances. We add each supported DB type for monitoring. Along with this, we ensure that the monitoring status of each added instance is ”RUNNING”. For these scenarios, we use corresponding PMM API endpoints.
  2. Apply custom Settings.
  3. Check that the Upgrade widget at the Home Dashboard is present and contains an Update button, the available version of PMM is correct and What’s new link is there.

Upgrade

  1. To fully simulate a user’s behavior we use the UI upgrade option.

Post Upgrade (UI Scenarios Only)

  1. After the upgrade itself was successful we check that the Upgrade widget at the Home Dashboard indicates that the PMM has the latest version.
  2. Also, we check the PMM Inventory to confirm that all the previously monitored DB instances exist and have “RUNNING” statuses.
  3. We check that Query Analytics Dashboard has corresponding DB filters (filters only exist when the DB specific queries exist).
  4. We confirm that all our custom Settings were properly upgraded and remain intact.

Also, we test upgrades from the older PMM versions, so we created a CI job that runs on the weekends and during the release testing phase. This gives us the ability to check the upgrade from any available version to the latest one. All we have to do is to specify the version we want to upgrade from. Check out this screenshot as an example of a failed upgrade.

Currently, we use this approach for testing Docker-based PMM Server upgrade, because PMM docker images are most commonly used. But we plan to implement the same upgrade job for OVF and AMI based PMM-Server soon. 

This is the final piece of our upgrade testing approach which alerts us immediately if some version has a problem upgrading to the latest one and allows us to react at the same time.

While this may seem an obvious tactic to ensure software quality, it’s amazing how often we’ll discover something that only impacts a single version.  

You are now safe to upgrade your Percona Monitoring and Management version, as PMM CI is watching your back!

Sep
29
2020
--

Using Security Threat Tool and Alertmanager in Percona Monitoring and Management

security threat tool percona monitoring and management

security threat tool percona monitoring and managementWith version 2.9.1 of Percona Monitoring and Management (PMM) we delivered some new improvements to its Security Threat Tool (STT).

Aside from an updated user interface, you now have the ability to run STT checks manually at any time, instead of waiting for the normal 24 hours check cycle. This can be useful if, for example, you want to see an alert gone after you fixed it. Moreover, you can now also temporarily mute (for 24 hours) some alerts you may want to work on later.

But how do these actions work?

Alertmanager

In a previous article, we briefly explained how the STT back end publishes alerts to Alertmanager so they appear in the STT section of PMM.

Now, before we uncover the details of that, please bear in mind that PMM’s built-in Alertmanager is still under development. We do not recommend you use it directly for your own needs, at least not for now.

With that out of the way, let’s see the details of the interaction with Alertmanager.

To retrieve the current alerts, the interface calls an Alertmanager’s API, filtering for non-silenced alerts:

GET /alertmanager/api/v2/alerts?silenced=false[...]

This call returns a list of active alerts, which looks like this:

[
  {
    "annotations": {
      "description": "MongoDB admin password does not meet the complexity requirement",
      "summary": "MongoDB password is weak"
    },
    "endsAt": "2020-09-30T14:39:03.575Z",
    "startsAt": "2020-04-20T12:08:48.946Z",
    "labels": {
      "service_name": "mongodb-inst-rpl-1",
      "severity": "warning",
      ...
    },
    ...
  },
  ...
]

Active alerts have a

startsAt

timestamp at the current time or in the past, while the

endsAt

 timestamp is in the future. The other properties contain descriptions and the severity of the issue the alert is about.

labels

, in particular, uniquely identify a specific alert and are used by Alertmanager to deduplicate alerts. (There are also other “meta” properties, but they are out of the scope of this article.)

Force Check

Clicking on “Run DB checks” will trigger an API call to the PMM server, which will execute the checks workflow on the PMM back end (you can read more about it here). At the end of that workflow, alerts are sent to Alertmanager through a POST call to the same endpoint used to retrieve active alerts. The call payload has the same structure as shown above.

Note that while you could create alerts manually this way, that’s highly discouraged, since it could negatively impact STT alerts. If you want to define your own rules for Alertmanager, PMM can integrate with an external Alertmanager, independent of STT. You can read more in Percona Monitoring and Management, Meet Prometheus Alertmanager.

Silences

Alertmanager has the concept of Silences. To temporarily mute an alert, the front end generates a “silence” payload starting from the metadata of the alert the user wants to mute and calls the silence API on Alertmanager:

POST /alertmanager/api/v2/silences

An example of a silence payload:

{
  "matchers": [
    { "name": "service_name", "value": "mongodb-inst-rpl-1", "isRegex": false },
    { "name": "severity", "value": "warning", "isRegex": false },
    ...
  ],
  "startsAt": "2020-09-14T20:24:15Z",
  "endsAt": "2020-09-15T20:24:15Z",
  "createdBy": "someuser",
  "comment": "reason for this silence",
  "id": "a-silence-id"
}

As a confirmation of success, this API call will return a

silenceID

:

{ "silenceID": "1fcaae42-ec92-4272-ab6b-410d98534dfc" }

 

Conclusion

From this quick overview, you can hopefully understand how simple it is for us to deliver security checks. Alertmanager helps us a lot in simplifying the final stage of delivering security checks to you in a reliable way. It allows us to focus more on the checks we deliver and the way you can interact with them.

We’re constantly improving our Security Threat Tool, adding more checks and features to help you protect your organization’s valuable data. While we’ll try to make our checks as comprehensive as possible, we know that you might have very specific needs. That’s why for the future we plan to make STT even more flexible, adding scheduling of checks (since some need to run more/less frequently than others), disabling of checks, and even the ability to let you add your own checks! Keep following the latest releases as we continue to iterate on STT.

For now, let us know in the comments: what other checks or features would you like to see in STT? We love to hear your feedback!

Check out our Percona Monitoring and Management Demo site or download Percona Monitoring and Management today and give it a try!

Sep
28
2020
--

3 Features I Love in Percona Monitoring and Management 2.10

Percona Monitoring and Management 2.10 features

Percona Monitoring and Management 2.10 shipped a few days ago and while it is a minor release, it introduces some features which I absolutely love. In this blog post, I will talk about those features – what they are and why they are worth loving. ?

1. Query Search

This is a great feature if you have many queries and you want to find a particular query or set of queries quickly.  Want to find which DELETE queries cause the most load on the database?  Or maybe queries which access a particular table or column? You got it!

2. Crosshair in Query Analytics

PMM Crosshair in Query Analytics

While Grafana has had the Crosshair feature for ages, it was not available in Query Analytics until now.  When looking at multiple time series, this is a super helpful feature to visually align multiple data points corresponding to the same point in time.

Crosshair does not only work in the query view, but also in the metrics view, where it is especially helpful. Some of the metrics can be very volatile, so having the Crosshair view to see how events align is super helpful.

Crosshair in Query Analytics PMM

3. System Information

PMM System Information

Percona Monitoring and Management captures thousands of metrics about system configuration and operation parameters.  In some cases, though, it is more helpful to see the system configuration information on the same page, for example as pt-summary from Percona Toolkit. This is exactly what this new feature does – integrating pt-summary output into the Node Summary dashboard in Percona Monitoring and Management.

Which one of those features do you like best? 

Whatever feature it is, you can check it out in action on our Percona Monitoring and Management Demo site or learn more information about Percona Monitoring and Management including how to easily install or upgrade it.

Sep
23
2020
--

5 Cool Features in Percona Monitoring and Management You Should Try!

Cool Features in Percona Monitoring and Management

Cool Features in Percona Monitoring and ManagementPercona Monitoring and Management (PMM) is a free and open-source platform for managing and monitoring MySQL, MongoDB, and PostgreSQL databases, along with Load Balancing tools like ProxySQL. It is also ‘cloud ready’, meaning it has support for monitoring DBaaS (Amazon RDS, Aurora, and more). You can run PMM in your own environment for maximum security and reliability. The biggest strength of PMM is that it is highly customizable, which we will see later in this blog.

The PMM 2 release introduces a number of enhancements and additional features. Here are some improvements and features which I think are cool!

1. New Security Threat Tool

The new Security Threat tool has the goal to advise PMM users on security-related database problems they might have on their databases. 

  • As PMM already has a reach into databases for performance monitoring, it makes sense for it to monitor database security as well. This tool will notify about any users without a password on the servers, MongoDB authentication disabled, MySQL/Mongo new versions available, and more.
  • For enabling this feature, go over the Failed security checks panel on PMM Homepage and click on Security Threat tool, enable it, and ‘Apply changes’.
  • Once enabled, the box shows a count of the number of failed checks. These will be divided as Critical, Major, Trivial.
  • For example, in the image, the Security threat tool is enabled so we can see the result of the checks:

  • After clicking over these checks, we can see more details, as below for the one Critical and one Major failed check. Resolve these as per preference to secure your database.

2. Labeling

Labeling helps to easily group instances and review these groups together.  You do this by tagging the servers with Standard or Custom labels.

  • Some of the standard labels available are Environments, Clusters, Replication Sets, Region.
  • This can be configured while adding the server under monitoring with pmm-admin using flags --environment='',--cluster='', replication_set='' and so on.
  • You can also have custom labels by specifying a Key-Value pair using the flag  –custom-labels=’Key=Value’.  For example, custom labels can be set as ‘DC=Asia’ or ‘Role=Reporting’, ‘Role=OLTP’ or anything as per your topology.
  • These labels can be used in filtering in QAN as well.

  • Important events related to the application can now be marked with Annotations. Some events in the application like upgrades, patches, may impact the database. Annotations visualize these events on each dashboard of the PMM Server so that you can correlate any performance changes on the database with these events.
  • Annotations can be added with pmm-admin annotate <--tags> command on PMM Client, and passing it text which explains what event the annotation should represent. Below is an example where we can see an event, ‘Upgrade to v1.2’ represented as a vertical dotted line on the graph.

3. Query Analytics (QAN)

QAN helps to ensure database queries are executed as expected, in the shortest time possible. You can identify queries causing problems and review detailed metrics related to those queries here.

  • With the new and improved QAN dashboard, you can now add multiple columns to the Query Analytics table with the Add column button, which will also show you the list of columns available. These are tagged with Service names – MySQL, MongoDB, PostgreSQL. Columns can be sorted in ascending or descending order.
  • Some of the examples for additional columns are – Query Count with errors, InnoDB IO read ops, No Index Used(MySQL), Docs Scanned, Returned (MongoDB), Shared Blocks Read, Written, Dirtied (PostgreSQL).
  • The query activity can now be visualized from multiple dimensions, not just the query pattern. As shown in the image, query count and load can also be viewed for Database, User, Client Hosts, etc. This will help to identify an increase in traffic from a particular user or a host, for example.

  • With the new release of PMM 2.10, you now have a Search by bar beside the dimensions as seen in the above image. It gives you the flexibility to limit the view of queries containing only the specified keywords entered in the search bar. The Search by can be used for other dimensions as well, like database, username.
  • Hovering over the sparkline now shows the load and timestamp for that particular time under the cursor.

  • The improvements will help to look for users causing the most activity, busiest schema, most active incoming client hosts, and of course problematic queries.

4. New Filter Panel for QAN

The new Filter Panel for the QAN dashboard allows you to see all your labels and gives the capability to select multiple items for filtering.

  • The Filter Panel helps filtering servers with standard or custom labels like  Environment, Clusters, individual nodes for faster troubleshooting.
  • Filters are listed as per category in the Filter panel – Service type, Cluster, Replication Set, database, and even users.
  • Selecting one reduces the overview list to those items matching the filter.
  • Together with the QAN improvements mentioned before and this filter panel, you can identify users, servers that are experiencing unexpected traffic or load across a logical grouping of servers.

 

 

5. Easily Remove Services or Nodes From the PMM Inventory Dashboard

The PMM Inventory dashboard lists all Nodes, Agents, and Services that are registered against the PMM Server.

  • Now, you can easily remove a service, node, or agent from this dashboard directly, unlike older versions. To remove a node/service, go to the PMM Inventory under the PMM tab in the top right corner.
  • On the page, there are separate tabs for Service, Agents, and Nodes. Select the tab you wish to remove from, then select the name of the service or node as required.  For example, here, I wish to delete the db2node-mysql service monitoring. So, I just checked the relevant box and clicked Delete in the right corner and got the confirmation message, clicked on Force mode, and done!

  • This will stop the monitoring of the service and its name will no longer appear under the list of monitored services. This can be verified from the command line after logging in to the server and executing pmm-admin list. The mysqld_exporter process would no longer be running.
  • We can also add and modify instances with PMM API. It eases deployments of large fleets of servers through scripting against the Administrative API.

Along with this, you can also customize your PMM and extend the list of available metrics. Check out below Percona blogs by Daniel Guzmán Burgos, Carlos Salguero, and Vadim Yalovets on these cool customizations!

Extend Metrics for PMM with textfile collectors
PMM’s Custom Queries in Action
Running Custom MySQL Queries in PMM2
Grafana Plugins and PMM

Takeaways

PMM is a best-of-breed open source database monitoring and management solution which helps you focus on optimizing database performance with better observability in a secure way.

For PMM install instructions, see Installing PMM Server and Installing PMM client.

For a full list of new features and bug fixes included in PMMv2, see our release notes and for more information on Percona Monitoring and Management, visit our PMM webpage.

Sep
22
2020
--

New MongoDB Exporter Released with Percona Monitoring and Management 2.10.0

MongoDB Exporter Percona Monitoring and Management

MongoDB Exporter Percona Monitoring and ManagementWith Percona Monitoring and Management (PMM) 2.10.0, Percona is releasing a new MongoDB exporter for Prometheus. It is a complete rewrite from scratch with a totally new approach to collect and expose metrics from MongoDB diagnostic commands.

The MongoDB exporter in the 0.11.x branch exposes only a static list of handpicked metrics with custom names and labels. The new exporter uses a totally different approach: it exposes ALL available metrics returned by MongoDB internal diagnostic commands and the metric naming (or renaming) follows concrete rules that apply the same for all metrics.

For example, if we run

db.getSiblingDB('admin').runCommand({"getDiagnosticData": 1});

 the command returns a structure that looks like this:

{
     "data" : {
         "start" : ISODate("2020-08-23T22:25:26Z"),
         "serverStatus" : {
             "start" : ISODate("2020-08-23T22:25:26Z"),
             "host" : "f9cd25606ada",
             "version" : "4.2.8",
             "process" : "mongod",
             "pid" : NumberLong(1),
             "uptime" : 186327,
             "uptimeMillis" : NumberLong(186327655),
             "uptimeEstimate" : NumberLong(186327),
             "localTime" : ISODate("2020-08-23T22:25:26Z"),
             "asserts" : {
                 "regular" : 0,
                 "warning" : 0,
                 "msg" : 0,
                 "user" : 62,
                 "rollovers" : 0
             },
             "connections" : {
                 "current" : 25,
                 "available" : 838835,
                 "totalCreated" : 231,
                 "active" : 1
             },
             "electionMetrics" : {
                 "stepUpCmd" : {
                     "called" : NumberLong(0),
                     "successful" : NumberLong(0)
                 },
                 "priorityTakeover" : {
                     "called" : NumberLong(0),
                     "successful" : NumberLong(0)
                 },
                 "catchUpTakeover" : {
                     "called" : NumberLong(0),
                     "successful" : NumberLong(0)
                 },

In the new exporter, the approach to expose all metrics is to traverse the result of the diagnostic commands like

getDiagnosticData

looking for values to expose. In this case, we have

serverStatus

, inside it we found

asserts

 and inside asserts there are metrics to expose (because they are numbers):

regular

,

warning

,

msg

,

user

, and

rollovers

. In this method of metric gathering, the metric name is the composition of the metrics keys we had to follow, for example, it will produce a metric like this:

serverStatus_asserts_user

:

62

.

If we open the web interface for the exporter at http://localhost:9216, we won’t find that metric I just mentioned. Why? Because to make the metric names shorter and to be able to group some metrics under the same name, the new exporter implements a metric name prefix rename and we are converting some metric suffixes to labels.

Prefix Renaming Table

The string

mongodb_

is prepended to all metrics as the Prometheus job name. Unlike < v2.0 mongodb_exporter, there won’t be

mongod

 vs.

mongos

 in the job name.

Metric Prefix New Prefix
serverStatus.wiredTiger.transaction ss_wt_txn
serverStatus.wiredTiger ss_wt
serverStatus ss
replSetGetStatus rs
systemMetrics sys
local.oplog.rs.stats.wiredTiger oplog_stats_wt
local.oplog.rs.stats oplog_stats
collstats_storage.wiredTiger collstats_storage_wt
collstats_storage.indexDetails collstats_storage_idx
collStats.storageStats collstats_storage
collStats.latencyStats collstats_latency

Prefix labeling table:

Metric Prefix New Prefix
collStats.storageStats.indexDetails. index_name
globalLock.activeQueue. count_type
globalLock.locks. lock_type
serverStatus.asserts. assert_type
serverStatus.connections. conn_type
serverStatus.globalLock.currentQueue. count_type
serverStatus.metrics.commands. cmd_name
serverStatus.metrics.cursor.open. csr_type
serverStatus.metrics.document. doc_op_type
serverStatus.opLatencies. op_type
serverStatus.opReadConcernCounters. concern_type
serverStatus.opcounters. legacy_op_type
serverStatus.opcountersRepl. legacy_op_type
serverStatus.transactions.commitTypes. commit_type
serverStatus.wiredTiger.concurrentTransactions. txn_rw_type
serverStatus.wiredTiger.perf. perf_bucket
systemMetrics.disks. device_name

Because of the metric renaming and labeling, we will find that the metric

serverStatus.asserts

.

<metric>

  will become

ss_asserts

  and the metric name will be used as a label:

# HELP mongodb_ss_asserts serverStatus.asserts.
# TYPE mongodb_ss_asserts untyped
mongodb_ss_asserts{assert_type="msg"} 0
mongodb_ss_asserts{assert_type="regular"} 0
mongodb_ss_asserts{assert_type="rollovers"} 0
mongodb_ss_asserts{assert_type="user"} 62
mongodb_ss_asserts{assert_type="warning"} 0

Advantages Of The New Exporter

Since the new exporter will automatically collect all available metrics, it is now possible to collect new metrics in the PMM dashboards and as new MongoDB versions expose new metrics, they will automatically become available without the need to manually add metrics and upgrade the exporter. Also, since there are clear rules for metric renaming and how labels are created, metric names are more consistent even when new metrics are added.

How It Works

As mentioned previously, this new exporter exposes all metrics by traversing the JSON output of each MongoDB diagnostic command.
Those commands are:

{"getDiagnosticData": 1}

which includes:
serverStatus
replSetGetStatus (will be fetched separately if MongoDB <= v3.6)
Oplog collection stats
OS system metrics:
Memory
CPU
Disk usage
netstat
vmstat

{"replSetGetStatus": 1}

{"serverStatus": 1}

and it is possible also to specify database.collections pairs lists to get stats for collections usage and indexes by running these commands for each collection:

{"$collStats": {"latencyStats": {"histograms": true}}}

{"indexStats"}

Enabling Compatibility Mode

The new exporter has a parameter,

--compatible-mode

, which enables a special compatibility mode. In this mode, the old exporter metrics are also exposed along with the new metrics. This way, existing dashboards should work without requiring any change, and it is the default mode in PMM 2.10.0.

Example: in compatibility mode, all metrics having the

mongodb_ss_wt_txn_transaction_checkpoint

  prefix and the

min_time_msecs

  or

max_time_msecs

  suffix like

# HELP mongodb_ss_wt_txn_transaction_checkpoint_min_time_msecs serverStatus.wiredTiger.transaction.
# TYPE mongodb_ss_wt_txn_transaction_checkpoint_min_time_msecs untyped
mongodb_ss_wt_txn_transaction_checkpoint_min_time_msecs 14

will be also exposed using the old naming convention as

# HELP mongodb_mongod_wiredtiger_transactions_checkpoint_milliseconds mongodb_mongod_wiredtiger_transactions_checkpoint_milliseconds
# TYPE mongodb_mongod_wiredtiger_transactions_checkpoint_milliseconds untyped
mongodb_mongod_wiredtiger_transactions_checkpoint_milliseconds{type="max"} 71
mongodb_mongod_wiredtiger_transactions_checkpoint_milliseconds{type="min"} 14

and the suffix is used as a label.

Debugging

When starting the exporter with

--debug

, it will output the result of each diagnostic command to the standard error. This makes it easier to check the values returned by each command to verify the metric renaming and values.

Releases

This exporter is going to be released as part of PMM starting with version 2.10.0 and will also be released as an independent exporter in the repo’s release page.

Currently, the exporter resides in the

v0.20.0

  branch and the old exporter is in the master branch but, exporter

v0.11

  will be moved to the

main

 branch and

master

branch will be used for the new exporter code.

How to Contribute

Using the Makefile

In the main directory, there is a

Makefile

to help you with development and testing tasks. Use make without parameters to get help. These are the available options:

Command Description
init Install linters
build Build the binaries
format Format source code
check Run checks/linters
check-license Check license in headers.
help Display this help message.
test Run all tests (need to start the sandbox first)
test-cluster Starts MongoDB test cluster. Use env var TEST_MONGODB_IMAGE to set flavor and version.
Example: TEST_MONGODB_IMAGE=mongo:3.6 make test-cluster
test-cluster-clean Stops MongoDB test cluster

Initializing the Development Environment

First, you need to have Go and Docker installed on your system, and then in order to install tools to format, test, and build the exporter, you need to run this command:

make init

It will install

goimports

,

goreleaser

,

golangci-lint

, and

reviewdog

.

Testing

Starting the Sandbox

The testing sandbox starts in MongoDB instances as follows:

  • 3 instances for shard 1 at ports 17001, 17002, 17003
  • 3 instances for shard 2 at ports 17004, 17005, 17006
  • 3 config servers at ports 17007, 17008, 17009
  • 1 mongos server at port 17000
  • 1 stand-alone instance at port 27017

All instances are currently running without user and password so, for example, to connect to the mongos, you can just use:

mongo mongodb://127.0.0.1:17001/admin

The sandbox can be started using the provided

Makefile

 using:

make test-cluster

  and it can be stopped using

make test-cluster-clean

.

Running Tests

To run the unit tests, just run

make test

.

Formating Code

Before submitting code, please run make format to format the code according to the standards.

Known Issues

  • Replicaset lag sometimes shows strange values.
  • Elements that use next metrics have been removed from dashboards:
    mongodb_mongod_rocksdb_*
    mongodb_mongod_locks_time_locked_global_microseconds_total
    mongodb_mongod_durability_time_milliseconds_sum
    mongodb_mongod_durability_time_milliseconds_count

So, these dashboards have been updated:

  • dashboard “MongoDB RocksDB Details” –> removed dashboard completely
  • dashboard “MongoDB MMAPv1 Details”, element “MMAPv1 Journaling Time” –> remove element on the dashboard
  • dashboard “MongoDB MMAPv1 Details”, element “MMAPv1 Lock Ratios”, parameter “Lock (pre-3.2 only)” –> removed chart on the element on the dashboard

Final Thoughts

This new exporter shouldn’t affect any existing dashboard since the compatibility mode exposes all old-style metrics along with the new ones. We deprecated only a few metrics that were already meaningless because they are only valid and exposed for old MongoDB versions like mongodb_mongod_global_lock_ratio and mongodb_version_info.

At Percona, we built this new MongoDB exporter with the idea in mind of having an exporter capable of exposing all available metrics, with no hard-coded metrics and not tied to any particular MongoDB version. We would like to encourage users to help us by using this version and providing feedback. We also accept (and encourage) code fixes and improvements.

Also, learn more about the new Percona Customer Portal rolling out starting with the 2.10.0 release of Percona Monitoring and Management.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com