Nov
01
2018
--

Percona Monitoring and Management (PMM) 1.16.0 Is Now Available

Percona Monitoring and Management

PMM (Percona Monitoring and Management) is a free and open-source platform for managing and monitoring MySQL, MongoDB, and PostgreSQL performance. You can run PMM in your own environment for maximum security and reliability. It provides thorough time-based analysis for MySQL® and MongoDB® servers to ensure that your data works as efficiently as possible.

Percona Monitoring and Management

While much of the team is working on longer-term projects, we were able to provide the following feature:

  • MySQL and PostgreSQL support for all cloud DBaaS providers – Use PMM Server to gather Metrics and Queries from remote instances!
  • Query Analytics + Metric Series – See Database activity alongside queries
  • Collect local metrics using node_exporter + textfile collector

We addressed 11 new features and improvements, and fixed 21 bugs.

MySQL and PostgreSQL support for all cloud DBaaS providers

You’re now able to connect PMM Server to your MySQL and PostgreSQL instances, whether they run in a cloud DBaaS environment, or you simply want Database metrics without the OS metrics.  This can help you get up and running with PMM using minimal configuration and zero client installation, however be aware there are limitations – there won’t be any host-level dashboards populated for these nodes since we don’t attempt to connect to the provider’s API nor are we granted access to the instance in order to deploy an exporter.

How to use

Using the PMM Add Instance screen, you can now add instances from any cloud provider (AWS RDS and Aurora, Google Cloud SQL for MySQL, Azure Database for MySQL) and benefit from the same dashboards that you are already accustomed to. You’ll be able to collect Metrics and Queries from MySQL, and Metrics from PostgreSQL.  You can add remote instances by selecting the PMM Add Instance item in a PMM group of the system menu:

https://github.com/percona/pmm/blob/679471210d476a5e98d26a632318f1680cfd98a2/doc/source/.res/graphics/png/metrics-monitor.menu.pmm1.png?raw=true

where you will then have the opportunity to add a Remote MySQL or Remote PostgreSQL instance:

You’ll add the instance by supplying just the Hostname, database Username and Password (and optional Port and Name):

metrics-monitor.add-remote-mysql-instance.png

Also new as part of this release is the ability to display nodes you’ve added, on screen RDS and Remote Instances:

metrics-monitor.add-rds-or-remote-instance1.png

Server activity metrics in the PMM Query Analytics dashboard

The Query Analytics dashboard now shows a summary of the selected host and database activity metrics in addition to the top ten queries listed in a summary table.  This brings a view of System Activity (CPU, Disk, and Network) and Database Server Activity (Connections, Queries per Second, and Threads Running) to help you better pinpoint query pileups and other bottlenecks:

https://raw.githubusercontent.com/percona/pmm/86e4215a58e788a8ec7cb1ebe679e1593c484078/doc/source/.res/graphics/png/query-analytics.png

Extending metrics with node_exporter textfile collector

While PMM provides an excellent solution for system monitoring, sometimes you may have the need for a metric that’s not present in the list of node_exporter metrics out of the box. There is a simple method to extend the list of available metrics without modifying the node_exporter code. It is based on the textfile collector.  We’ve enabled this collector as on by default, and is deployed as part of linux:metrics in PMM Client.

The default directory for reading text files with the metrics is /usr/local/percona/pmm-client/textfile-collector, and the exporter reads files from it with the .prom extension. By default it contains an example file example.prom which has commented contents and can be used as a template.

You are responsible for running a cronjob or other regular process to generate the metric series data and write it to this directory.

Example – collecting docker container information

This example will show you how to collect the number of running and stopped docker containers on a host. It uses a crontab task, set with the following lines in the cron configuration file (e.g. in /etc/crontab):

*/1* * * *     root   echo -n "" > /tmp/docker_all.prom; docker ps -a -q | wc -l | xargs echo node_docker_containers_total >> /usr/local/percona/pmm-client/docker_all.prom;
*/1* * * *     root   echo -n "" > /tmp/docker_running.prom; docker ps | wc -l | xargs echo node_docker_containers_running_total >> /usr/local/percona/pmm-client/docker_running.prom;

The result of the commands is placed into the docker_all.prom and docker_running.prom files and read by exporter and will create two new metric series named node_docker_containers_total and node_docker_containers_running_total, which we’ll then plot on a graph:

pmm 1.16

New Features and Improvements

  • PMM-3195 Remove the light bulb
  • PMM-3194 Change link for “Where do I get the security credentials for my Amazon RDS DB instance?”
  • PMM-3189 Include Remote MySQL & PostgreSQL instance logs into PMM Server logs.zip system
  • PMM-3166 Convert status integers to strings on ProxySQL Overview Dashboard – Thanks,  Iwo Panowicz for  https://github.com/percona/grafana-dashboards/pull/239
  • PMM-3133 Include Metric Series on Query Analytics Dashboard
  • PMM-3078 Generate warning “how to troubleshoot postgresql:metrics” after failed pmm-admin add postgresql execution
  • PMM-3061 Provide Ability to Monitor Remote MySQL and PostgreSQL Instances
  • PMM-2888 Enable Textfile Collector by Default in node_exporter
  • PMM-2880 Use consistent favicon (Percona logo) across all distribution methods
  • PMM-2306 Configure EBS disk resize utility to run from crontab in PMM Server
  • PMM-1358 Improve Tooltips on Disk Space Dashboard – thanks, Corrado Pandiani for texts

Fixed Bugs

  • PMM-3202 Cannot add remote PostgreSQL to monitoring without specified dbname
  • PMM-3186 Strange “Quick ranges” tag appears when you hover over documentation links on PMM Add Instance screen
  • PMM-3182 Some sections for MongoDB are collapsed by default
  • PMM-3171 Remote RDS instance cannot be deleted
  • PMM-3159 Problem with enabling RDS instance
  • PMM-3127 “Expand all” button affects JSON in all queries instead of the selected one
  • PMM-3126 Last check displays locale format of the date
  • PMM-3097 Update home dashboard to support PostgreSQL nodes in Environment Overview
  • PMM-3091 postgres_exporter typo
  • PMM-3090 TLS handshake error in PostgreSQL metric
  • PMM-3088 It’s possible to downgrade PMM from Home dashboard
  • PMM-3072 Copy to clipboard is not visible for JSON in case of long queries
  • PMM-3038 Error adding MySQL queries when options for mysqld_exporters are used
  • PMM-3028 Mark points are hidden if an annotation isn’t added in advance
  • PMM-3027 Number of vCPUs for RDS is displayed incorrectly – report and proposal from Janos Ruszo
  • PMM-2762 Page refresh makes Search condition lost and shows all queries
  • PMM-2483 LVM in the PMM Server AMI is poorly configured/documented – reported by Olivier Mignault  and lot of people involved.  Special thanks to  Chris Schneider for checking with fix options
  • PMM-2003 Delete all info related to external exporters on pmm-admin list output

How to get PMM Server

PMM is available for installation using three methods:

Help us improve our software quality by reporting any Percona Monitoring and Management bugs you encounter using our bug tracking system.

Oct
23
2018
--

Reclaiming space on your Docker PMM server deployment

reclaiming space Docker PMM

reclaiming space Docker PMMRecently we had a customer that had issues with a filled disk on the server hosting their Docker pmm-server environment. They were not able to access the web UI, or even stop the pmm-server container because they had filled the /var/ mount point.

Setting correct expectations

The best way to avoid these kinds of issues in the first place is to plan ahead, and to know exactly with what you are dealing with in terms of disk space requirements. Michael Coburn has written a great blogpost on this matter:

https://www.percona.com/blog/2017/05/04/how-much-disk-space-should-i-allocate-for-percona-monitoring-and-management/

We are now using Prometheus version 2 inside PMM server, so you should take it with a pinch of salt. On the other hand, it will show how you should plan ahead, and think about the “steady state” disk usage, so it’s a good read.

That’s the first step to make sure you won’t get into trouble down the line. But, what happens if you are already in trouble? We’ll see two quick ways that may help reclaiming space.

Before anything else, you should stop any and all PMM clients running, so that you don’t have a race condition after recovering some space, in which metrics coming from the running clients will fill up whatever disk you had freed.

If

pmm-admin stop --all

  won’t work, you can stop the services manually, or even manually kill the running processes as a last resort:

shell> systemctl list-unit-files | grep enabled | grep pmm | awk '{print $1}' | xargs -n 1 systemctl stop
shell> ps ax | egrep "exporter|qan-agent|pmm" | grep -v "ssh" | awk '{print $1}' | xargs kill

Removing unused containers

In order for the next steps to be as effective as possible, make sure there are no unused containers running, or stopped:

shell> docker ps -a

If you see any container that you know you don’t need anymore:

shell> docker stop <container_name>
shell> docker rm -v <container_name>

WARNING! Do not remove the pmm-data container!

Reclaiming space from unused Docker images

After you are done cleaning unused containers, we can move forward with removing unused images. Unless you are manually building your own Docker images, it’s really easy to get them again if needed, so you shouldn’t be afraid of deleting the ones that are not being used. In fact, you don’t need to explicitly download the images. By simply running

docker run … image_name

  Docker will automatically do it for you if it’s not found locally.

shell> docker image prune -a
WARNING! This will remove all images without at least one container associated to them.
Are you sure you want to continue? [y/N] y
Deleted Images:
...
Total reclaimed space: 3.97GB

Not too bad, we just reclaimed 4Gb of disk space. This alone should be enough to restart the Docker service and have the pmm-server container back up. But we want more, just because we can ?

Reclaiming space from orphaned Docker volumes

By default, when removing a container (with

docker rm

 ) Docker will not delete the associated volumes, unless you use the -v switch as we did above. This will mean that, unless you were aware of this fact, you will probably have some other gigabytes worth of data occupying disk space. We can easily do this with the volume prune command:

shell> docker volume prune
WARNING! This will remove all local volumes not used by at least one container.
Are you sure you want to continue? [y/N] y
Deleted Volumes:
...
Total reclaimed space: 115GB

Yeah… that’s some significant amount of disk space we just reclaimed back! Again, make sure you don’t care about any of the volumes from your past containers to be able to do this safely, since there is no turning back from this, obviously.

For earlier versions of Docker where this command is not available, you can check this link.

Planning ahead

As mentioned before, you should now revisit Michael’s blogpost, and set the metrics retention and queries retention variables to whatever makes sense for your environment. Even if you plan ahead, you may not be counting on the additional variable overhead of images and orphaned volumes, so you may want to (warning: shameless plug for my own blogpost ahead) use different mount points for your PMM deployment, and avoid using the shared /var/lib/docker/ mount point for it.

PMM also includes a Disk Space usage dashboard, that you can use to monitor this.

Don’t forget to start back up your PMM clients, and continue to monitor them 24×7!

Photo by Andrew Wulf on Unsplash

Oct
15
2018
--

Identifying High Load Spots in MySQL Using Slow Query Log and pt-query-digest

pt-query-digest MySQL slow queries

pt-query-digest MySQL slow queriespt-query-digest is one of the most commonly used tool when it comes to query auditing in MySQL®. By default, pt-query-digest reports the top ten queries consuming the most amount of time inside MySQL. A query that takes more time than the set threshold for completion is considered slow but it’s not always true that tuning such queries makes them faster. Sometimes, when resources on server are busy, it will impact every other operation on the server, and so will impact queries too. In such cases, you will see the proportion of slow queries goes up. That can also include queries that work fine in general.

This article explains a small trick to identify such spots using pt-query-digest and the slow query log. pt-query-digest is a component of Percona Toolkit, open source software that is free to download and use.

Some sample data

Let’s have a look at sample data in Percona Server 5.7. Slow query log is configured to capture queries longer than ten seconds with no limit on rate of logging, which is generally considered to throttle the IO that comes while writing slow queries to the log file.

mysql> show variables like 'log_slow_rate%' ;
+---------------------+---------+
| Variable_name       | Value    |
+---------------------+---------+
| log_slow_rate_limit | 1       |  --> Log all queries
| log_slow_rate_type  | session |
+---------------------+---------+
2 rows in set (0.00 sec)
mysql> show variables like 'long_query_time' ;
+-----------------+-----------+
| Variable_name   | Value     |
+-----------------+-----------+
| long_query_time | 10.000000 |  --> 10 seconds
+-----------------+-----------+
1 row in set (0.01 sec)

When I run pt-query-digest, I see in the summary report that 80% of the queries have come from just three query patterns.

# Profile
# Rank Query ID                      Response time    Calls R/Call   V/M
# ==== ============================= ================ ===== ======== =====
#    1 0x7B92A64478A4499516F46891... 13446.3083 56.1%   102 131.8266  3.83 SELECT performance_schema.events_statements_history
#    2 0x752E6264A9E73B741D3DC04F...  4185.0857 17.5%    30 139.5029  0.00 SELECT table1
#    3 0xAFB5110D2C576F3700EE3F7B...  1688.7549  7.0%    13 129.9042  8.20 SELECT table2
#    4 0x6CE1C4E763245AF56911E983...  1401.7309  5.8%    12 116.8109 13.45 SELECT table4
#    5 0x85325FDF75CD6F1C91DFBB85...   989.5446  4.1%    15  65.9696 55.42 SELECT tbl1 tbl2 tbl3 tbl4
#    6 0xB30E9CB844F2F14648B182D0...   420.2127  1.8%     4 105.0532 12.91 SELECT tbl5
#    7 0x7F7C6EE1D23493B5D6234382...   382.1407  1.6%    12  31.8451 70.36 INSERT UPDATE tbl6
#    8 0xBC1EE70ABAE1D17CD8F177D7...   320.5010  1.3%     6  53.4168 67.01 REPLACE tbl7
#   10 0xA2A385D3A76D492144DD219B...   183.9891  0.8%    18  10.2216  0.00 UPDATE tbl8
#      MISC 0xMISC                     948.6902  4.0%    14  67.7636   0.0 <10 ITEMS>

Query #1 is generated by the qan-agent from PMM and runs approximately once a minute. These results will be handed over to PMM Server. Similarly queries #2 & #3 are pretty simple. I mean, they scan just one row and will return either zero or one rows. They also use indexing, which makes me think that this is not because of something just with in MySQL. I wanted to know if I could find any common aspect of all these occurrences.

Let’s take a closer look at the queries recorded in slow query log.

# grep -B3 DIGEST mysql-slow_Oct2nd_4th.log
....
....
# User@Host: ztrend[ztrend] @ localhost []  Id: 6431601021
# Query_time: 139.279651  Lock_time: 64.502959 Rows_sent: 0  Rows_examined: 0
SET timestamp=1538524947;
SELECT DIGEST, CURRENT_SCHEMA, SQL_TEXT FROM performance_schema.events_statements_history;
# User@Host: ztrend[ztrend] @ localhost []  Id: 6431601029
# Query_time: 139.282594  Lock_time: 83.140413 Rows_sent: 0  Rows_examined: 0
SET timestamp=1538524947;
SELECT DIGEST, CURRENT_SCHEMA, SQL_TEXT FROM performance_schema.events_statements_history;
# User@Host: ztrend[ztrend] @ localhost []  Id: 6431601031
# Query_time: 139.314228  Lock_time: 96.679563 Rows_sent: 0  Rows_examined: 0
SET timestamp=1538524947;
SELECT DIGEST, CURRENT_SCHEMA, SQL_TEXT FROM performance_schema.events_statements_history;
....
....

Now you can see two things.

  • All of them have same Unix timestamp
  • All of them were spending more than 70% of their execution time waiting for some lock.

Analyzing the data from pt-query-digest

Now I want to check if I can group the count of queries based on their time of execution. If there are multiple queries at a given time captured into the slow query log, time will be printed for the first query but not all. Fortunately, in this case I can rely on the Unix timestamp to compute the counts. The timestamp is gets captured for every query. Luckily, without a long struggle, a combination of grep and awk utilities have displayed what I wanted to display.

# grep -A1 Query_time mysql-slow_Oct2nd_4th.log | grep SET | awk -F "=" '{ print $2 }' | uniq -c
2   1538450797;
1   1538524822;
3   1538524846;
7   1538524857;
167 1538524947;   ---> 72% of queries have happened at this timestamp.
1   1538551813;
3   1538551815;
6   1538602215;
1   1538617599;
33  1538631015;
1   1538631016;
1   1538631017;

You can use the command below to check the regular date time format of a given timestamp. So, Oct 3, 05:32 is when there was something wrong on the server:

# date -d @1538524947
Wed Oct 3 05:32:27 IST 2018

Query tuning can be carried out alongside this, but identifying such spots helps avoiding spending time on query tuning where badly written queries are not the problem. Having said that, from this point, further troubleshooting may take different sub paths such as checking log files at that particular time, looking at CPU reports, reviewing past pt-stalk reports if set up to run in the background, and dmesg etc. This approach is useful for identifying at what time (or time range) MySQL was more stressed just using slow query log when no robust monitoring tools, like Percona Monitoring and Management (PMM), are deployed.

Using PMM to monitor queries

If you have PMM, you can review Query Analytics to see the topmost slow queries, along with details like execution counts, load etc. Below is a sample screen copy for your reference:

Slow query log from PMM dashboard

NOTE: If you use Percona Server for MySQL, slow query log can report time in micro seconds. It also supports extended logging of  other statistics about query execution. These provide extra power to see the insights of query processing. You can see more information about these options here.

Oct
10
2018
--

Percona Monitoring and Management (PMM) 1.15.0 Is Now Available

Percona Monitoring and Management

Percona Monitoring and Management (PMM) is a free and open-source platform for managing and monitoring MySQL® and MongoDB® performance. You can run PMM in your own environment for maximum security and reliability. It provides thorough time-based analysis for MySQL® and MongoDB® servers to ensure that your data works as efficiently as possible.

Percona Monitoring and Management

This release offers two new features for both the MySQL Community and Percona Customers:

  • MySQL Custom Queries – Turn a SELECT into a dashboard!
  • Server and Client logs – Collect troubleshooting logs for Percona Support

We addressed 17 new features and improvements, and fixed 17 bugs.

MySQL Custom Queries

In 1.15 we are introducing the ability to take a SQL SELECT statement and turn the result set into metric series in PMM.  The queries are executed at the LOW RESOLUTION level, which by default is every 60 seconds.  A key advantage is that you can extend PMM to profile metrics unique to your environment (see users table example), or to introduce support for a table that isn’t part of PMM yet. This feature is on by default and only requires that you edit the configuration file and use vaild YAML syntax.  The configuration file is in /usr/local/percona/pmm-client/queries-mysqld.yml.

Example – Application users table

We’re going to take a fictional MySQL users table that also tracks the number of upvotes and downvotes, and we’ll convert this into two metric series, with a set of seven labels, where each label can also store a value.

Browsing metrics series using Advanced Data Exploration Dashboard

Lets look at the output so we understand the goal – take data from a MySQL table and store in PMM, then display as a metric series.  Using the Advanced Data Exploration Dashboard you can review your metric series. Exploring the metric series  app1_users_metrics_downvotes we see the following:

PMM Advanced Data Exploration Dashboard

MySQL table

Lets assume you have the following users table that includes true/false, string, and integer types.

SELECT * FROM `users`
+----+------+--------------+-----------+------------+-----------+---------------------+--------+---------+-----------+
| id | app  | user_type    | last_name | first_name | logged_in | active_subscription | banned | upvotes | downvotes |
+----+------+--------------+-----------+------------+-----------+---------------------+--------+---------+-----------+
|  1 | app2 | unprivileged | Marley    | Bob        |         1 |                   1 |      0 |     100 |        25 |
|  2 | app3 | moderator    | Young     | Neil       |         1 |                   1 |      1 |     150 |        10 |
|  3 | app4 | unprivileged | OConnor   | Sinead     |         1 |                   1 |      0 |      25 |        50 |
|  4 | app1 | unprivileged | Yorke     | Thom       |         0 |                   1 |      0 |     100 |       100 |
|  5 | app5 | admin        | Buckley   | Jeff       |         1 |                   1 |      0 |     175 |         0 |
+----+------+--------------+-----------+------------+-----------+---------------------+--------+---------+-----------+

Explaining the YAML syntax

We’ll go through a simple example and mention what’s required for each line.  The metric series is constructed based on the first line and appends the column name to form metric series.  Therefore the number of metric series per table will be the count of columns that are of type GAUGE or COUNTER.  This metric series will be called app1_users_metrics_downvotes:

app1_users_metrics:                                 ## leading section of your metric series.
  query: "SELECT * FROM app1.users"                 ## Your query. Don't forget the schema name.
  metrics:                                          ## Required line to start the list of metric items
    - downvotes:                                    ## Name of the column returned by the query. Will be appended to the metric series.
        usage: "COUNTER"                            ## Column value type.  COUNTER will make this a metric series.
        description: "Number of upvotes"            ## Helpful description of the column.

Full queries-mysqld.yml example

Each column in the SELECT is named in this example, but that isn’t required, you can use a SELECT * as well.  Notice the format of schema.table for the query is included.

---
app1_users_metrics:
  query: "SELECT app,first_name,last_name,logged_in,active_subscription,banned,upvotes,downvotes FROM app1.users"
  metrics:
    - app:
        usage: "LABEL"
        description: "Name of the Application"
    - user_type:
        usage: "LABEL"
        description: "User's privilege level within the Application"
    - first_name:
        usage: "LABEL"
        description: "User's First Name"
    - last_name:
        usage: "LABEL"
        description: "User's Last Name"
    - logged_in:
        usage: "LABEL"
        description: "User's logged in or out status"
    - active_subscription:
        usage: "LABEL"
        description: "Whether User has an active subscription or not"
    - banned:
        usage: "LABEL"
        description: "Whether user is banned or not"
    - upvotes:
        usage: "COUNTER"
        description: "Count of upvotes the User has earned.  Upvotes once granted cannot be revoked, so the number can only increase."
    - downvotes:
        usage: "GAUGE"
        description: "Count of downvotes the User has earned.  Downvotes can be revoked so the number can increase as well as decrease."
...

We hope you enjoy this feature, and we welcome your feedback via the Percona forums!

Server and Client logs

We’ve enhanced the volume of data collected from both the Server and Client perspectives.  Each service provides a set of files designed to be shared with Percona Support while you work on an issue.

Server

From the Server, we’ve improved the logs.zip service to include:

  • Prometheus targets
  • Consul nodes, QAN API instances
  • Amazon RDS and Aurora instances
  • Version
  • Server configuration
  • Percona Toolkit commands

You retrieve the link from your PMM server using this format:   https://pmmdemo.percona.com/managed/logs.zip

Client

On the Client side we’ve added a new action called summary which fetches logs, network, and Percona Toolkit output in order to share with Percona Support. To initiate a Client side collection, execute:

pmm-admin summary

The output will be a file you can use to attach to your Support ticket.  The single file will look something like this:

summary__2018_10_10_16_20_00.tar.gz

New Features and Improvements

  • PMM-2913 – Provide ability to execute Custom Queries against MySQL – Credit to wrouesnel for the framework of this feature in wrouesnel/postgres_exporter!
  • PMM-2904 – Improve PMM Server Diagnostics for Support
  • PMM-2860 – Improve pmm-client Diagnostics for Support
  • PMM-1754Provide functionality to easily select query and copy it to clipboard in QAN
  • PMM-1855Add swap to AMI
  • PMM-3013Rename PXC Overview graph Sequence numbers of transactions to IST Progress
  • PMM-2726 – Abort data collection in Exporters based on Prometheus Timeout – MySQLd Exporter
  • PMM-3003 – PostgreSQL Overview Dashboard Tooltip fixes
  • PMM-2936Some improvements for Query Analytics Settings screen
  • PMM-3029PostgreSQL Dashboard Improvements

Fixed Bugs

  • PMM-2976Upgrading to PMM 1.14.x fails if dashboards from Grafana 4.x are present on an installation
  • PMM-2969rds_exporter becomes throttled by CloudWatch API
  • PMM-1443The credentials for a secured server are exposed without explicit request
  • PMM-3006Monitoring over 1000 instances is displayed imperfectly on the label
  • PMM-3011PMM’s default MongoDB DSN is localhost, which is not resolved to IPv4 on modern systems
  • PMM-2211Bad display when using old range in QAN
  • PMM-1664Infinite loading with wrong queryID
  • PMM-2715Since pmm-client-1.9.0, pmm-admin detects CentOS/RHEL 6 installations using linux-upstart as service manager and ignores SysV scripts
  • PMM-2839Tablestats safety precaution does not work for RDS/Aurora instances
  • PMM-2845pmm-admin purge causes client to panic
  • PMM-2968pmm-admin list shows empty data source column for mysql:metrics
  • PMM-3043 Total Time percentage is incorrectly shown as a decimal fraction
  • PMM-3082Prometheus Scrape Interval Variance chart doesn’t display data

How to get PMM Server

PMM is available for installation using three methods:

Help us improve our software quality by reporting any Percona Monitoring and Management bugs you encounter using our bug tracking system.

Oct
09
2018
--

PostgreSQL Monitoring: Set Up an Enterprise-Grade Server (and Sign Up for Webinar Weds 10/10…)

PostgreSQL Monitoring

PostgreSQL logoThis is the last post in our series on building an enterprise-grade PostgreSQL set up using open source tools, and we’ll be covering monitoring.

The previous posts in this series discussed aspects such as security, backup strategy, high availability, connection pooling and load balancing, extensions, and detailed logging in PostgreSQL. Tomorrow, Wednesday, October 10 at 10AM EST, we will be reviewing these topics together, and showcasing then in practice in a webinar format: we hope you can join us!

 

Monitoring databases

The importance of monitoring the activity and health of production systems is unquestionable. When it comes to the database, with its high number of customizable settings, the ability to track its various metrics (status counters and gauges) allows for the maintenance of a historical record of its performance over time. This can be used for capacity planningtroubleshooting and validation.

When it comes to capacity planning, a monitoring solution is a helpful tool to help you assess how the current setup is faring. At the same time, it can help predict future needs based on trends, such as the increase of active connections, queries, and CPU usage. For example, an increase in CPU usage might be due to a genuine increase in workload, but it could also be a sign of unoptimized queries growing in popularity. In which case, comparing CPU with disk access might provide a more complete view of what is going on.

Being able to easily correlate data like this helps you to catch minor issues and to plan accordingly, sometimes allowing you to avoid an easier but more costly solution of scaling up to mitigate problems like this. But having the right monitoring solution is really invaluable when it comes to investigative work and root cause analysis. Trying to understand a problem that has already taken place is a rather complicated, and often unenviable, task unless you established a continuous, watchful eye on the set up for the whole time.

Finally, a monitoring solution can help you validate changes made in the business logic in general or in the database configuration in specific. By comparing prior and post results for a given metric or for overall performance, you can observe the impact of such changes in practice.

Monitoring PostgreSQL with open source solutions

There is a number of monitoring solutions for PostgreSQL and postgresql.org’s Wiki provides an extensive list, albeit a little outdated. It categorizes the main monitoring solutions into two distinct categories: those that can be identified as generic solutions—and can be extended to cover different technologies through custom plugins—and those labeled as Postgres-centric, which are specific to PostgreSQL.

In the first group, we find venerated open source monitoring tools such as Munin, Zabbix, and CactiNagios could have also been added to this group but it was instead indirectly included in the “Checkers” group. That category includes monitoring scripts that can be used both in stand-alone mode or as feeders (plugins) for “Nagios like software“. Examples of these are check_pgactivity and check_postgres.

One omission from this list is Grafana, a modern time series analytics platform conceived to display metrics from a number of different data sources. Grafana includes a solution packaged as a PostgreSQL native plugin. Percona has built its Percona Monitoring and Management (PMM) platform around Grafana, using Prometheus as its data source. Since version 1.14.0, PMM supports PostgreSQL. Query Analytics (QAN) integration is coming soon.

An important factor that all these generic solutions have in common is that they are widely used for the monitoring of a diverse collection of services, like you’d normally find in enterprise-like environments. It’s common for a given company to adopt one, or sometimes two, such solutions with the aim of monitoring their entire infrastructure. This infrastructure often includes a heterogeneous combination of databases and application servers.

Nevertheless, there is a place for complementary Postgres-centric monitoring solutions in such enterprise environments too. These solutions are usually implemented with a specific goal in mind. Two examples we can mention in this context are PGObserver, which has a focus on monitoring stored procedures, and pgCluu, with its focus on auditing.

Monitoring PostgreSQL with PMM

We built an enterprise-grade PostgreSQL set up for the webinar, and use PMM for monitoring. We will be showcasing some of PMM’s main features, and highlighting some of the most important metrics to watch, during our demo.You may want to have a look at this demo setup to get a feel of how our PostgreSQL Overview dashboard looks:

You can find instructions on how to setup PMM for monitoring your PostgreSQL server in our documentation space. And if there’s still time, sign up for tomorrow’s webinar!

 

Sep
28
2018
--

Scaling Percona Monitoring and Management (PMM)

PMM tested with 1000 nodes

Starting with PMM 1.13,  PMM uses Prometheus 2 for metrics storage, which tends to be heaviest resource consumer of CPU and RAM.  With Prometheus 2 Performance Improvements, PMM can scale to more than 1000 monitored nodes per instance in default configuration. In this blog post we will look into PMM scaling and capacity planning—how to estimate the resources required, and what drives resource consumption.

PMM tested with 1000 nodes

We have now tested PMM with up to 1000 nodes, using a virtualized system with 128GB of memory, 24 virtual cores, and SSD storage. We found PMM scales pretty linearly with the available memory and CPU cores, and we believe that a higher number of nodes could be supported with more powerful hardware.

What drives resource usage in PMM ?

Depending on your system configuration and workload, a single node can generate very different loads on the PMM server. The main factors that impact the performance of PMM are:

  1. Number of samples (data points) injected into PMM per second
  2. Number of distinct time series they belong to (cardinality)
  3. Number of distinct query patterns your application uses
  4. Number of queries you have on PMM, through the user interface on the API, and their complexity

These specifically can be impacted by:

  • Software version – modern database software versions expose more metrics)
  • Software configuration – some metrics are only exposed in certain configuration
  • Workload – a large number of database objects and high concurrency will increase both the number of samples ingested and their cardinality.
  • Exporter configuration – disabling collectors can reduce amount of data collectors
  • Scrape frequency –  controlled by METRICS_RESOLUTION

All these factors together may impact resource requirements by a factor of ten or more, so do your own testing to be sure. However, the numbers in this article should serve as good general guidance as a start point for your research.

On the system supporting 1000 instances we observed the following performance:

Performance PMM 1000 nodes load

As you can see, we have more than 2.000 scrapes/sec performed, providing almost two million samples/sec, and more than eight million active time series. These are the main numbers that define the load placed on Prometheus.

Capacity planning to scale PMM

Both CPU and memory are very important resources for PMM capacity planning. Memory is the more important as Prometheus 2 does not have good options for limiting memory consumption. If you do not have enough memory to handle your workload, then it will run out of memory and crash.

We recommend at least 2GB of memory for a production PMM Installation. A test installation with 1GB of memory is possible. However, it may not be able to monitor more than one or two nodes without running out of memory. With 2GB of memory you should be able to monitor at least five nodes without problem.

With powerful systems (8GB of more) you can have approximately eight systems per 1GB of memory, or about 15,000 samples ingested/sec per 1GB of memory.

To calculate the CPU usage resources required, allow for about 50 monitored systems per core (or 100K metrics/sec per CPU core).

One problem you’re likely to encounter if you’re running PMM with 100+ instances is the “Home Dashboard”. This becomes way too heavy with such a large number of servers. We plan to fix this issue in future releases of PMM, but for now you can work around it in two simple ways:

You can select the host, for example “pmm-server” in your home dashboard and save it, before adding a large amount of hosts to the system.

set home dashboard for PMM

Or you can make some other dashboard of your choice and set it as the home dashboard.

Summary

  • More than 1,000 monitored systems is possible per single PMM server
  • Your specific workload and configuration may significantly change the resources required
  • If deploying with 8GB or more, plan 50 systems per core, and eight systems per 1GB of RAM

The post Scaling Percona Monitoring and Management (PMM) appeared first on Percona Database Performance Blog.

Sep
20
2018
--

Prometheus 2 Times Series Storage Performance Analyses

cpu saturation and max core usage

Prometheus 2 time series database (TSDB) is an amazing piece of engineering, offering a dramatic improvement compared to “v2” storage in Prometheus 1 in terms of ingest performance, query performance and resource use efficiency. As we’ve been adopting Prometheus 2 in Percona Monitoring and Management (PMM), I had a chance to look into the performance of Prometheus 2 TSDB. This blog post details my observations.

Understanding the typical Prometheus workload

For someone who has spent their career working with general purpose databases, the typical workload of Prometheus is quite interesting. The ingest rate tends to remain very stable: typically, devices you monitor will send approximately the same amount of metrics all the time, and infrastructure tends to change relatively slowly.

Queries to the data can come from multiple sources. Some of them, such as alerting, tend to be very stable and predictable too. Others, such as users exploring data, can be spiky, though it is not common for this to be largest part of the load.

The Benchmark

In my assessment, I focused on handling an ingest workload. I had deployed Prometheus 2.3.2 compiled with Go 1.10.1 (as part of PMM 1.14)  on Linode using this StackScript.  For a maximally realistic load generation, I spin up multiple MySQL nodes running some real workloads (Sysbench TPC-C Test) , with each emulating 10 Nodes running MySQL and Linux using this StackScript

The observations below are based on a Linode instance with eight virtual cores and 32GB of memory, running  20 load driving simulating the monitoring of 200 MySQL instances. Or, in Prometheus Terms, some 800 targets; 440 scrapes/sec 380K samples ingested per second and 1.7M of active time series.

Design Observations

The conventional approach of traditional databases, and the approach that Prometheus 1.x used, is to limit amount of memory. If this amount of memory is not enough to handle the load, you will have high latency and some queries (or scrapes) will fail. Prometheus 2 memory usage instead is configured by

storage.tsdb.min-block-duration

   which determines how long samples will be stored in memory before they are flushed (the default being 2h). How much memory it requires will depend on the number of time series, the number of labels you have, and your scrape frequency in addition to the raw ingest rate. On disk, Prometheus tends to use about three bytes per sample. Memory requirements, though, will be significantly higher.

While the configuration knob exists to change the head block size, tuning this by users is discouraged. So you’re limited to providing Prometheus 2 with as much memory as it needs for your workload.

If there is not enough memory for Prometheus to handle your ingest rate, then it will crash with out of memory error message or will be killed by OOM killer.

Adding more swap space as a “backup” in case Prometheus runs out of RAM does not seem to work as using swap space causes a dramatic memory usage explosion. I suspect swapping does not play well with Go garbage collection.

Another interesting design choice is aligning block flushes to specific times, rather than to time since start:

head block Prometheus 2

As you can see from this graph, flushes happen every two hours, on the clock. If you change min-block-duration  to 1h, these flushes will happen every hour at 30 minutes past the hour.

(If you want to see this and other graphs for your Prometheus Installation you can use this Dashboard. It has been designed for PMM but can work for any Prometheus installation with little adjustments.)

While the active block—called head block— is kept in memory, blocks containing older blocks are accessed through

mmap()

  This eliminates the need to configure cache separately, but also means you need to allocate plenty of memory for OS Cache if you want to query data older than fits in the head block.

It also means the virtual memory you will see Prometheus 2 using will get very high: do not let it worry you.

Prometheus process memory usage

Another interesting design choice is WAL configuration. As you can see in the storage documentation, Prometheus protects from data loss during a crash by having WAL log. The exact durability guarantees, though, are not clearly described. As of Prometheus 2.3.2, Prometheus flushes the WAL log every 10 seconds, and this value is not user configurable.

Compactions

Prometheus TSDB is designed somewhat similar to the LSM storage engines – the head block is flushed to disk periodically, while at the same time, compactions to merge a few blocks together are performed to avoid need to scan too many blocks for queries

Here is the number of data blocks I observed on my system after a 24h workload:

active data blocks

If you want more details about storage, you can check out the meta.json file which has additional information about the blocks you have, and how they came about.

{
       "ulid": "01CPZDPD1D9R019JS87TPV5MPE",
       "minTime": 1536472800000,
       "maxTime": 1536494400000,
       "stats": {
               "numSamples": 8292128378,
               "numSeries": 1673622,
               "numChunks": 69528220
       },
       "compaction": {
               "level": 2,
               "sources": [
                       "01CPYRY9MS465Y5ETM3SXFBV7X",
                       "01CPYZT0WRJ1JB1P0DP80VY5KJ",
                       "01CPZ6NR4Q3PDP3E57HEH760XS"
               ],
               "parents": [
                       {
                               "ulid": "01CPYRY9MS465Y5ETM3SXFBV7X",
                               "minTime": 1536472800000,
                               "maxTime": 1536480000000
                       },
                       {
                               "ulid": "01CPYZT0WRJ1JB1P0DP80VY5KJ",
                               "minTime": 1536480000000,
                               "maxTime": 1536487200000
                       },
                       {
                               "ulid": "01CPZ6NR4Q3PDP3E57HEH760XS",
                               "minTime": 1536487200000,
                               "maxTime": 1536494400000
                       }
               ]
       },
       "version": 1
}

Compactions in Prometheus are triggered at the time the head block is flushed, and several compactions may be performed at these intervals:Prometheus 2 compactions

Compactions do not seem to be throttled in any way, causing huge spikes of disk IO usage when they run:

spike in io activity for compactions

And a spike in CPU usage:

spike in CPU usage during compactions

This, of course, can cause negative impact to the system performance. This is also why it is one of the greatest questions in LSM engines: how to run compactions to maintain great query performance, but not cause too much overhead.

Memory utilization as it relates to the compaction process is also interesting:

Memory utilization during compaction process

We can see after compaction a lot of memory changes from “Cached”  to “Free”, meaning potentially valuable data is washed out from memory. I wonder if

fadvice()

 or other techniques to minimize data washout from cache are in use, or if this is caused by the fact that the blocks which were cached are destroyed by the compaction process

Crash Recovery

Crash recovery from the log file takes time, though it is reasonable. For an ingest rate of about 1 mil samples/sec, I observed some 25 minutes recovery time on SSD storage:

level=info ts=2018-09-13T13:38:14.09650965Z caller=main.go:222 msg="Starting Prometheus" version="(version=2.3.2, branch=v2.3.2, revision=71af5e29e815795e9dd14742ee7725682fa14b7b)"
level=info ts=2018-09-13T13:38:14.096599879Z caller=main.go:223 build_context="(go=go1.10.1, user=Jenkins, date=20180725-08:58:13OURCE)"
level=info ts=2018-09-13T13:38:14.096624109Z caller=main.go:224 host_details="(Linux 4.15.0-32-generic #35-Ubuntu SMP Fri Aug 10 17:58:07 UTC 2018 x86_64 1bee9e9b78cf (none))"
level=info ts=2018-09-13T13:38:14.096641396Z caller=main.go:225 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2018-09-13T13:38:14.097715256Z caller=web.go:415 component=web msg="Start listening for connections" address=:9090
level=info ts=2018-09-13T13:38:14.097400393Z caller=main.go:533 msg="Starting TSDB ..."
level=info ts=2018-09-13T13:38:14.098718401Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536530400000 maxt=1536537600000 ulid=01CQ0FW3ME8Q5W2AN5F9CB7R0R
level=info ts=2018-09-13T13:38:14.100315658Z caller=web.go:467 component=web msg="router prefix" prefix=/prometheus
level=info ts=2018-09-13T13:38:14.101793727Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536732000000 maxt=1536753600000 ulid=01CQ78486TNX5QZTBF049PQHSM
level=info ts=2018-09-13T13:38:14.102267346Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536537600000 maxt=1536732000000 ulid=01CQ78DE7HSQK0C0F5AZ46YGF0
level=info ts=2018-09-13T13:38:14.102660295Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536775200000 maxt=1536782400000 ulid=01CQ7SAT4RM21Y0PT5GNSS146Q
level=info ts=2018-09-13T13:38:14.103075885Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536753600000 maxt=1536775200000 ulid=01CQ7SV8WJ3C2W5S3RTAHC2GHB
level=error ts=2018-09-13T14:05:18.208469169Z caller=wal.go:275 component=tsdb msg="WAL corruption detected; truncating" err="unexpected CRC32 checksum d0465484, want 0" file=/opt/prometheus/data/.prom2-data/wal/007357 pos=15504363
level=info ts=2018-09-13T14:05:19.471459777Z caller=main.go:543 msg="TSDB started"
level=info ts=2018-09-13T14:05:19.471604598Z caller=main.go:603 msg="Loading configuration file" filename=/etc/prometheus.yml
level=info ts=2018-09-13T14:05:19.499156711Z caller=main.go:629 msg="Completed loading of configuration file" filename=/etc/prometheus.yml
level=info ts=2018-09-13T14:05:19.499228186Z caller=main.go:502 msg="Server is ready to receive web requests."

The problem I observed with recovery is that it is very memory intensive. While the server may be capable of handling the normal load with memory to spare if it crashes, it may not be able to ever recover due to running out of memory.  The only solution I found for this is to disable scraping, let it perform crash recovery, and then restarting the server with scraping enabled

Warmup

Another behavior to keep in mind is the need for warmup – a lower performance/higher resource usage ratio immediately after start. In some—but not all—starts I can observe significantly higher initial CPU and memory usage

cpu usage during warmup

memory usage during warmup

The gaps in the memory utilization graph show that Prometheus is not initially able to perform all the scrapes configured, and as such some data is lost.

I have not profiled what exactly causes this extensive CPU and memory consumption. I suspect these might be happening when new time series entries are created, at head block, and at high rate.

CPU Usage Spikes

Besides compaction—which is quite heavy on the Disk IO—I also can observe significant CPU spikes about every 2 minutes. These are longer with a higher ingest ratio. These seem to be caused by Go Garbage Collection during these spikes: at least some CPU cores are completely saturated

cpu usage spikes maybe during Go Garbage collection

cpu saturation and max core usage

These spikes are not just cosmetic. It looks like when these spikes happen, the Prometheus internal /metrics endpoint becomes unresponsive, thus producing data gaps during the exact time that the spikes occur:

Prometheus 2 process memory usage

We can also see the Prometheus Exporter hitting a one second timeout:

scrape time by job

We can observe this correlates with garbage collection:

garbage collection in Prometheus processing

Conclusion

Prometheus 2 TSDB offers impressive performance, being able to handle a cardinality of millions of time series, and also to handle hundreds of thousands of samples ingested per second on rather modest hardware. CPU and disk IO usage are both very impressive. I got up to 200K/metrics/sec per used CPU core!

For capacity planning purposes you need to ensure that you have plenty of memory available, and it needs to be real RAM. The actual amount of memory I observed was about 5GB per 100K/samples/sec ingest rate, which with additional space for OS cache, makes it 8GB or so.

There is work that remains to be done to avoid CPU and IO usage spikes, though this is not unexpected considering how young Prometheus 2 TSDB is – if we look at InnoDB, TokuDB, RocksDB, WiredTiger all of them had similar problem in their initial releases.

The post Prometheus 2 Times Series Storage Performance Analyses appeared first on Percona Database Performance Blog.

Sep
18
2018
--

Tutorial Schedule for Percona Live Europe 2018 Is Live

Percona Live Europe tutorials and sneak peak

Percona Live Europe Open Source Database Conference PLE 2018Percona has revealed the line-up of in-depth tutorials for the Percona Live Europe 2018 Open Source Database Conference, taking place November 5–7, 2018 at the Radisson Blu Hotel in Frankfurt, Germany. Secure your spot now with Advanced Registration prices. Be sure to buy your tickets soon as tickets prices will only head up, not down! Sponsorship opportunities for the conference are still available.

Percona Live Europe 2018 Open Source Database Conference is the premier open source database event. Our theme this year is “Connect. Accelerate. Innovate.”  Percona Live is the place to learn about how open source database technology can power your applications, improve your websites and solve your critical database issues.

Monday, November 5: Tutorial Day

Tutorials take place throughout the day on Monday, November 5. Tutorials are three hours and provide practical, in-depth knowledge exchange on critical open source database issues. The line up includes:

  • Query Optimization with MySQL 8.0 and MariaDB 10.3: The Basics with Jaime Crespo – Wikimedia Foundation
  • Hands on ProxySQL with René Cannaò – ProxySQL
  • ElasticSearch 101 with Antonios Giannopoulos – ObjectRocket
  • MySQL Performance Schema in Action: the Complete Tutorial with Sveta Smirnova and Alexander Rubin – Percona
  • MySQL InnoDB Cluster in a Nutshell : The Saga Continues with 8.0 with Frédéric Descamps – Oracle
  • Introduction to PostgreSQL for MySQL and Oracle DBAs with Avinash Vallarapu – Percona
  • InnoDB Architecture and Optimization with Peter Zaitsev – Percona
  • MongoDB: Replica Sets and Sharded Clusters with Adamo Tonete – Percona
  • High Availability PostgreSQL and Kubernetes with Google Cloud presented by Alexis Guajardo – Google
  • Open Source Database Performance Optimization and Monitoring with PMM with Michael Coburn, Vinicius Grippa, and Avinash Vallarapu – Percona
  • Percona XtraDB Cluster Tutorial presented by Tibor Köröcz – Percona
  • Mastering PostgreSQL Administration with Bruce Momjian – EnterpriseDB

and a sneak peak at some of the sessions

Of course, we have a stellar line up of talks, too! Here’s a tantalising glimpse of just some of the talks you could MISS if you don’t head to Frankfurt in November

Tuesday 6th November

  • Percona Server 8.0 – Laurynas Biveinis – Percona
  • MySQL 8.0 Performance: Scalability & Benchmarks – Dimitri Kravtchuk – Oracle
  • MySQL Group Replication : the magic explained – Frédéric Descamps – Oracle
  • Explaining the Postgres Query Optimizer – Bruce Momjian – EnterpriseDB
  • TLS for MySQL at large scale – Jaime Crespo – Wikimedia Foundation
  • BlaBlaCar – 100% Containers Powered Carpooling – Maxime Fouilleul – BlaBlaCar
  • A Year in Google Cloud – Carmen Mason, Alan Mason – VitalSource Technologies
  • Demystifying MySQL Replication Crash Safety – Jean-François Gagné
  • MongoDB Shard 101 – Adamo Tonete, Vinodh Krishnaswamy – Percona
  • Highway to Hell or Stairway to Cloud? – Alexander Kukushkin – Zalando
  • MongoDB administration cool tips – Gabriel Ciciliani – Pythian

Wednesday 7th November

  • PostgreSQL Enterprise Features – Michael Banck – credativ GmbH
  • Open Source Databases and Non-Volatile Memory – Frank Ober – Intel Memory Group
  • MariaDB 10.3 Optimizer and beyond – Vicentiu Ciorbaru – MariaDB Foundation
  • MariaDB system-versioned tables – Federico Razzoli – PayProp

A shout out to our fantastic Conference Committee who have been working hard to review the tutorial and talk submissions: we had over 200! Thank you!

The Radisson Blu Hotel, Frankfurt

Percona Live 2018 Open Source Database Conference will be held at the Radisson Blu Hotel, Frankfurt Franklinstraße 65, 60486 Frankfurt am Main, Germany

The Radisson Blu enjoys an enviable location in the Bockenheim District, just off the A66 motorway – only one kilometer from Messe Frankfurt, one of the world’s largest exhibition complexes. They’re also just 10 minutes from the city center, and Frankfurt International Airport (FRA) is a quick 15-minute drive away.

Book your hotel using Percona’s special room block rate, available only until September 20.

Sponsorships

Sponsorship opportunities for Percona Live Europe 2018 Open Source Database Conference are available and offer the opportunity to interact with the DBAs, sysadmins, developers, CTOs, CEOs, business managers, technology evangelists, solution vendors, and entrepreneurs who typically attend the event. Contact events@percona.com for sponsorship details.

The post Tutorial Schedule for Percona Live Europe 2018 Is Live appeared first on Percona Database Performance Blog.

Sep
18
2018
--

Monitoring Processes with Percona Monitoring and Management

Memory utilization during compaction process

A few months ago I wrote a blog post on How to Capture Per Process Metrics in PMM. Since that time, Nick Cabatoff has made a lot of improvements to Process Exporter and I’ve improved the Grafana Dashboard to match.

I will not go through installation instructions, they are well covered in original blog post.  This post covers features available in release 0.4.0 Here are a few new features you might find of interest:

Used Memory

Memory usage in Linux is complicated.  You can look at resident memory, which shows how much space is used in RAM. However, if you have a substantial part of process swapped out because of memory pressure, you would not see it. You can also look at virtual memory–but it will include a lot of address space which was allocated and never mapped either to RAM or to swap space.   Especially for processes written in Go, the difference can be extreme. Let’s look at the process exporter itself: it uses 20MB of resident memory but over 2GB of virtual memory.

top processes by resident memory

top processes by virtual memory

Meet the Used Memory dashboard, which shows the sum of resident memory used by the process and swap space used:

used memory dashboard

There is dashboard to see process by swap space used as well, so you can see if some processes that you expect to be resident are swapped out.

Processes by Disk IO

processes by disk io

Processes by Disk IO is another graph which I often find very helpful. It is the most useful for catching the unusual suspects, when the process causing the IO is not totally obvious.

Context Switches

Context switches, as shown by VMSTAT, are often seen as an indication of contention. With contention stats per process you can see which of the process are having those context switches.

top processes by voluntary context switches

Note: while large number of context switches can be a cause of high contention, some applications and workloads are just designed in such a way. You are better off looking at the change in the number of context switches, rather than at the raw number.

CPU and Disk IO Saturation

As Brendan Gregg tells us, utilization and saturation are not the same. While CPU usage and Disk IO usage graphs show us resource utilization by different processes, they do not show saturation.

top running processes graph

For example, if you have four CPU cores then you can’t get more than four CPU cores used by any process, whether there are four or four hundred concurrent threads trying to run.

While being rather volatile as gauge metrics, top running processes and top processes waiting on IO are good metrics to understand which processes are prone to saturation.

These graphs roughly provide a breakdown of “r” and “b”  VMSTAT columns per process

Kernel Waits

Finally, you can see which kernel function (WCHAN) the process is sleeping on, which can be very helpful to access processes which are not using a lot of CPU, but are not making much progress either.

I find this graph most useful if you pick the single process in the dashboard picker:

kernel waits for sysbench

In this graph we can see sysbench has most threads sleeping in

unix_stream_read_generic

  which corresponds to reading the response from MySQL from UNIX socket – exactly what you would expect!

Summary

If you ever need to understand what different processes are doing in your system, then Nick’s Process Exporter is a fantastic tool to have. It just takes few minutes to get it added into your PMM installation.

If you enjoyed this post…

You might also like my pre-recorded webinar MySQL troubleshooting and performance optimization with PMM.

The post Monitoring Processes with Percona Monitoring and Management appeared first on Percona Database Performance Blog.

Sep
11
2018
--

How To Deploy PMM on Linode With StackScripts

rebuiild from a StackScript

In my previous blog post, I showed how to deploy Percona Monitoring and Management (PMM) on Linode manually. It is pretty simple, but with a little coding it can be done even more easily using StackScripts

Here’s how:

1. Click on the “Add a Linode” and pick a Linode type you want to deploy.

2. Click on the deployed Linode and then click on the “Rebuild” Link

Rebuild the Linode

3. Click on Deploy Using StackScripts

Deploy using StackScripts

4. On the resulting page search for “PMM” and pick PMMServer from PerconaLab.

Choose PMMServer from PerconaLab

5. Provide the host name for new Linode, pick the root password and click on “Rebuild”

6. Boot the server.

boot the server

7.  You’re done. Wait for about 5 minutes for the installation to complete, then you can see PMM interface by going to this Linode IP

view PMM on Linode IP

If you think that a manual deployment with StackScripts is not much less hassle than doing it manually, you’re right. The real benefit comes with using Linode API for deployment.

There are multiple way to access this API, though for basic scripting I prefer the linode-cli tool for using the Linode API from the command line.

With linode-cli  you can deploy your PMM Server on Linode using this one liner:

linode-cli linodes create --label pmm-test  --root_pass MyRootPassword123 --stackscript_id 338458  --stackscript_data '{"hostname": "pmm-test"}'

Summary

As you can see, with Linode StackScripts you can get going with Percona Monitoring and Management on Linode in no time, especially if you chose to use the Linode API.

You might also like:

Here’s an overview from the Percona Monitoring and Management manual on deploying PMM. If you are new to PMM and would like to know more, you will find lots of resources on this site including my webinar MySQL Troubleshooting and Performance Optimization with PMM.

The post How To Deploy PMM on Linode With StackScripts appeared first on Percona Database Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com