Sep
03
2018
--

Monitoring S.M.A.R.T. Metrics with Prometheus and PMM

visualized using Grafana

In his excellent blog post, Pavel Trukhanov showed the value of S.M.A.R.T. metric collections, so I wondered how hard would it be to enable their collection in Percona Monitoring and Management (PMM)

A quick search led me to the  text_collector plugin SmartMon, which can be easily integrated with any Prometheus Installation

For PMM, Vadim Yalovets recently showed how to do custom integrations based on text_collector

Let’s put those together:

  1. Ensure you have the smartctl tool installed. It is available in repositories for most Linux distributions
  2. Get  smartmon.sh and place it in /usr/local/bin or other location
  3. Install the cron job
    echo  "*/5 * * * * root bash  /usr/local/bin/smartmon.sh > /tmp/smart_metrics.prom  " > /etc/cron.d/smartmon
  4. Enable textfile_collector as described in this blog post

That’s it! You should get your data flowing. Now you can use Prometheus to query device information:

use prometheus to query device

Or if you want to get a specific S.M.A.R.T value, such as media_wearout indicator:

specific smart value wearout indicator

If you would like to see a nicer visualization in Grafana, you can install the appropriate dashboard from the Grafana web site.

visualized using Grafana

The number and kind of metrics you’re going to get depends on the storage device vendor and model. Here is an example list from one of my test systems:

# HELP smartmon_smartctl_version SMART metric smartctl_version
# TYPE smartmon_smartctl_version gauge
smartmon_smartctl_version{version="6.5"} 1
# HELP smartmon_current_pending_sector_raw_value SMART metric current_pending_sector_raw_value
# TYPE smartmon_current_pending_sector_raw_value gauge
smartmon_current_pending_sector_raw_value{disk="/dev/sda",type="sat",smart_id="197"} 0.000000e+00
# HELP smartmon_current_pending_sector_threshold SMART metric current_pending_sector_threshold
# TYPE smartmon_current_pending_sector_threshold gauge
smartmon_current_pending_sector_threshold{disk="/dev/sda",type="sat",smart_id="197"} 0
# HELP smartmon_current_pending_sector_value SMART metric current_pending_sector_value
# TYPE smartmon_current_pending_sector_value gauge
smartmon_current_pending_sector_value{disk="/dev/sda",type="sat",smart_id="197"} 100
# HELP smartmon_current_pending_sector_worst SMART metric current_pending_sector_worst
# TYPE smartmon_current_pending_sector_worst gauge
smartmon_current_pending_sector_worst{disk="/dev/sda",type="sat",smart_id="197"} 100
# HELP smartmon_device_info SMART metric device_info
# TYPE smartmon_device_info gauge
smartmon_device_info{disk="/dev/sda",type="sat",vendor="",product="",revision="",lun_id="",model_family="",device_model="Crucial_CT275MX300SSD1",serial_number="16431465B53F",firmware_version="M0CR031"} 1
# HELP smartmon_device_smart_available SMART metric device_smart_available
# TYPE smartmon_device_smart_available gauge
smartmon_device_smart_available{disk="/dev/sda",type="sat"} 1
# HELP smartmon_device_smart_enabled SMART metric device_smart_enabled
# TYPE smartmon_device_smart_enabled gauge
smartmon_device_smart_enabled{disk="/dev/sda",type="sat"} 1
# HELP smartmon_device_smart_healthy SMART metric device_smart_healthy
# TYPE smartmon_device_smart_healthy gauge
smartmon_device_smart_healthy{disk="/dev/sda",type="sat"} 1
# HELP smartmon_end_to_end_error_raw_value SMART metric end_to_end_error_raw_value
# TYPE smartmon_end_to_end_error_raw_value gauge
smartmon_end_to_end_error_raw_value{disk="/dev/sda",type="sat",smart_id="184"} 0.000000e+00
# HELP smartmon_end_to_end_error_threshold SMART metric end_to_end_error_threshold
# TYPE smartmon_end_to_end_error_threshold gauge
smartmon_end_to_end_error_threshold{disk="/dev/sda",type="sat",smart_id="184"} 0
# HELP smartmon_end_to_end_error_value SMART metric end_to_end_error_value
# TYPE smartmon_end_to_end_error_value gauge
smartmon_end_to_end_error_value{disk="/dev/sda",type="sat",smart_id="184"} 100
# HELP smartmon_end_to_end_error_worst SMART metric end_to_end_error_worst
# TYPE smartmon_end_to_end_error_worst gauge
smartmon_end_to_end_error_worst{disk="/dev/sda",type="sat",smart_id="184"} 100
# HELP smartmon_offline_uncorrectable_raw_value SMART metric offline_uncorrectable_raw_value
# TYPE smartmon_offline_uncorrectable_raw_value gauge
smartmon_offline_uncorrectable_raw_value{disk="/dev/sda",type="sat",smart_id="198"} 0.000000e+00
# HELP smartmon_offline_uncorrectable_threshold SMART metric offline_uncorrectable_threshold
# TYPE smartmon_offline_uncorrectable_threshold gauge
smartmon_offline_uncorrectable_threshold{disk="/dev/sda",type="sat",smart_id="198"} 0
# HELP smartmon_offline_uncorrectable_value SMART metric offline_uncorrectable_value
# TYPE smartmon_offline_uncorrectable_value gauge
smartmon_offline_uncorrectable_value{disk="/dev/sda",type="sat",smart_id="198"} 100
# HELP smartmon_offline_uncorrectable_worst SMART metric offline_uncorrectable_worst
# TYPE smartmon_offline_uncorrectable_worst gauge
smartmon_offline_uncorrectable_worst{disk="/dev/sda",type="sat",smart_id="198"} 100
# HELP smartmon_power_cycle_count_raw_value SMART metric power_cycle_count_raw_value
# TYPE smartmon_power_cycle_count_raw_value gauge
smartmon_power_cycle_count_raw_value{disk="/dev/sda",type="sat",smart_id="12"} 2.000000e+01
# HELP smartmon_power_cycle_count_threshold SMART metric power_cycle_count_threshold
# TYPE smartmon_power_cycle_count_threshold gauge
smartmon_power_cycle_count_threshold{disk="/dev/sda",type="sat",smart_id="12"} 0
# HELP smartmon_power_cycle_count_value SMART metric power_cycle_count_value
# TYPE smartmon_power_cycle_count_value gauge
smartmon_power_cycle_count_value{disk="/dev/sda",type="sat",smart_id="12"} 100
# HELP smartmon_power_cycle_count_worst SMART metric power_cycle_count_worst
# TYPE smartmon_power_cycle_count_worst gauge
smartmon_power_cycle_count_worst{disk="/dev/sda",type="sat",smart_id="12"} 100
# HELP smartmon_power_on_hours_raw_value SMART metric power_on_hours_raw_value
# TYPE smartmon_power_on_hours_raw_value gauge
smartmon_power_on_hours_raw_value{disk="/dev/sda",type="sat",smart_id="9"} 1.313300e+04
# HELP smartmon_power_on_hours_threshold SMART metric power_on_hours_threshold
# TYPE smartmon_power_on_hours_threshold gauge
smartmon_power_on_hours_threshold{disk="/dev/sda",type="sat",smart_id="9"} 0
# HELP smartmon_power_on_hours_value SMART metric power_on_hours_value
# TYPE smartmon_power_on_hours_value gauge
smartmon_power_on_hours_value{disk="/dev/sda",type="sat",smart_id="9"} 100
# HELP smartmon_power_on_hours_worst SMART metric power_on_hours_worst
# TYPE smartmon_power_on_hours_worst gauge
smartmon_power_on_hours_worst{disk="/dev/sda",type="sat",smart_id="9"} 100
# HELP smartmon_raw_read_error_rate_raw_value SMART metric raw_read_error_rate_raw_value
# TYPE smartmon_raw_read_error_rate_raw_value gauge
smartmon_raw_read_error_rate_raw_value{disk="/dev/sda",type="sat",smart_id="1"} 0.000000e+00
# HELP smartmon_raw_read_error_rate_threshold SMART metric raw_read_error_rate_threshold
# TYPE smartmon_raw_read_error_rate_threshold gauge
smartmon_raw_read_error_rate_threshold{disk="/dev/sda",type="sat",smart_id="1"} 0
# HELP smartmon_raw_read_error_rate_value SMART metric raw_read_error_rate_value
# TYPE smartmon_raw_read_error_rate_value gauge
smartmon_raw_read_error_rate_value{disk="/dev/sda",type="sat",smart_id="1"} 100
# HELP smartmon_raw_read_error_rate_worst SMART metric raw_read_error_rate_worst
# TYPE smartmon_raw_read_error_rate_worst gauge
smartmon_raw_read_error_rate_worst{disk="/dev/sda",type="sat",smart_id="1"} 100
# HELP smartmon_reallocated_sector_ct_raw_value SMART metric reallocated_sector_ct_raw_value
# TYPE smartmon_reallocated_sector_ct_raw_value gauge
smartmon_reallocated_sector_ct_raw_value{disk="/dev/sda",type="sat",smart_id="5"} 0.000000e+00
# HELP smartmon_reallocated_sector_ct_threshold SMART metric reallocated_sector_ct_threshold
# TYPE smartmon_reallocated_sector_ct_threshold gauge
smartmon_reallocated_sector_ct_threshold{disk="/dev/sda",type="sat",smart_id="5"} 10
# HELP smartmon_reallocated_sector_ct_value SMART metric reallocated_sector_ct_value
# TYPE smartmon_reallocated_sector_ct_value gauge
smartmon_reallocated_sector_ct_value{disk="/dev/sda",type="sat",smart_id="5"} 100
# HELP smartmon_reallocated_sector_ct_worst SMART metric reallocated_sector_ct_worst
# TYPE smartmon_reallocated_sector_ct_worst gauge
smartmon_reallocated_sector_ct_worst{disk="/dev/sda",type="sat",smart_id="5"} 100
# HELP smartmon_reported_uncorrect_raw_value SMART metric reported_uncorrect_raw_value
# TYPE smartmon_reported_uncorrect_raw_value gauge
smartmon_reported_uncorrect_raw_value{disk="/dev/sda",type="sat",smart_id="187"} 0.000000e+00
# HELP smartmon_reported_uncorrect_threshold SMART metric reported_uncorrect_threshold
# TYPE smartmon_reported_uncorrect_threshold gauge
smartmon_reported_uncorrect_threshold{disk="/dev/sda",type="sat",smart_id="187"} 0
# HELP smartmon_reported_uncorrect_value SMART metric reported_uncorrect_value
# TYPE smartmon_reported_uncorrect_value gauge
smartmon_reported_uncorrect_value{disk="/dev/sda",type="sat",smart_id="187"} 100
# HELP smartmon_reported_uncorrect_worst SMART metric reported_uncorrect_worst
# TYPE smartmon_reported_uncorrect_worst gauge
smartmon_reported_uncorrect_worst{disk="/dev/sda",type="sat",smart_id="187"} 100
# HELP smartmon_smartctl_run SMART metric smartctl_run
# TYPE smartmon_smartctl_run gauge
smartmon_smartctl_run{disk="/dev/sda",type="sat"} 1535666337
# HELP smartmon_temperature_celsius_raw_value SMART metric temperature_celsius_raw_value
# TYPE smartmon_temperature_celsius_raw_value gauge
smartmon_temperature_celsius_raw_value{disk="/dev/sda",type="sat",smart_id="194"} 3.100000e+01
# HELP smartmon_temperature_celsius_threshold SMART metric temperature_celsius_threshold
# TYPE smartmon_temperature_celsius_threshold gauge
smartmon_temperature_celsius_threshold{disk="/dev/sda",type="sat",smart_id="194"} 0
# HELP smartmon_temperature_celsius_value SMART metric temperature_celsius_value
# TYPE smartmon_temperature_celsius_value gauge
smartmon_temperature_celsius_value{disk="/dev/sda",type="sat",smart_id="194"} 69
# HELP smartmon_temperature_celsius_worst SMART metric temperature_celsius_worst
# TYPE smartmon_temperature_celsius_worst gauge
smartmon_temperature_celsius_worst{disk="/dev/sda",type="sat",smart_id="194"} 59
# HELP smartmon_udma_crc_error_count_raw_value SMART metric udma_crc_error_count_raw_value
# TYPE smartmon_udma_crc_error_count_raw_value gauge
smartmon_udma_crc_error_count_raw_value{disk="/dev/sda",type="sat",smart_id="199"} 0.000000e+00
# HELP smartmon_udma_crc_error_count_threshold SMART metric udma_crc_error_count_threshold
# TYPE smartmon_udma_crc_error_count_threshold gauge
smartmon_udma_crc_error_count_threshold{disk="/dev/sda",type="sat",smart_id="199"} 0
# HELP smartmon_udma_crc_error_count_value SMART metric udma_crc_error_count_value
# TYPE smartmon_udma_crc_error_count_value gauge
smartmon_udma_crc_error_count_value{disk="/dev/sda",type="sat",smart_id="199"} 100
# HELP smartmon_udma_crc_error_count_worst SMART metric udma_crc_error_count_worst
# TYPE smartmon_udma_crc_error_count_worst gauge
smartmon_udma_crc_error_count_worst{disk="/dev/sda",type="sat",smart_id="199"} 100

The post Monitoring S.M.A.R.T. Metrics with Prometheus and PMM appeared first on Percona Database Performance Blog.

Aug
28
2018
--

Extend Metrics for Percona Monitoring and Management Without Modifying Code

PMM Extended Metrics

Percona Monitoring and Management (PMM) provides an excellent solution for system monitoring. Sometimes, though, you’ll have the need for a metric that’s not present in the list of node_exporter metrics out of the box. In this post, we introduce a simple method and show how to extend the list of available metrics without modifying the node_exporter code. It’s based on the textfile collector.

Enable the textfile collector in pmm-client

This collector is not enabled by default in the latest version of pmm-client. So, first let’s enable the textfile collector.

# pmm-admin rm linux:metrics
OK, removed system pmm-client-hostname from monitoring.
# pmm-admin add linux:metrics -- --collectors.enabled=diskstats,filefd,filesystem,loadavg,meminfo,netdev,netstat,stat,time,uname,vmstat,textfile --collector.textfile.directory="/tmp"
OK, now monitoring this system.
# pmm-admin ls
pmm-admin 1.13.0
PMM Server      | 10.178.1.252  
Client Name     | pmm-client-hostname
Client Address  | 10.178.1.252  
Service Manager | linux-upstart
-------------- -------------------- ----------- -------- ------------ --------
SERVICE TYPE   NAME                 LOCAL PORT  RUNNING  DATA SOURCE  OPTIONS  
-------------- -------------------- ----------- -------- ------------ --------
linux:metrics  pmm-client-hostname  42000       YES      -

Notice that the whole list of default collectors has to be re-enabled. Also, don’t forget to specify the directory for reading files with the metrics (–collector.textfile.directory=”/tmp”). The exporter reads files with the extension .prom

Add a crontab task

The second step is to add a crontab task to collect metrics and place them into a file.

Here are the cron commands for collecting the number of running and stopping docker containers.

*/1 * * * *     root   echo -n "" > /tmp/docker_all.prom; /usr/bin/docker ps -a | sed -n '1!p'| /usr/bin/wc -l | sed -ne 's/^/node_docker_containers_total /p' >> /tmp/doc
ker_all.prom;
*/1 * * * *     root   echo -n "" > /tmp/docker_running.prom; /usr/bin/docker ps | sed -n '1!p'| /usr/bin/wc -l | sed -ne 's/^/node_docker_containers_running_total /p' >>
/tmp/docker_running.prom;

The result of the commands is placed into the files 

/tmp/docker_running.prom

and

/tmp/docker_running.prom

and read by exporter.

Look - we got a new metric!

Adding the crontab tasks by using a script

Also, we have a few bash scripts that make it much easier to add crontab tasks.

The first one allows you to collect the logged-in users and the size of Innodb data files.

Modifying the cron job - a script

You may use the suggested names of files and metrics or set new ones.

The second script is more universal. It allows us to get the size of any directories or files. This script can be placed directly into a crontab task. You should just specify the list of monitored instances (e.g. /var/log /var/cache/apt /var/lib/mysql/ibdata1)

echo  "*/5 * * * * root bash  /root/object_sizes.sh /var/log /var/cache/apt /var/lib/mysql/ibdata1"  > /etc/cron.d/object_size

So, I hope this has provided useful insight into how to set up the collection of new PMM metrics without the need to write code. Please feel free to use the scripts or configure commands similar to the ones provided above.

More resources you might enjoy

If you are new to PMM, there is a great demo site of the latest version, showing you those out of the box metrics. Or how about our free webinar on monitoring Amazon RDS with PMM?

The post Extend Metrics for Percona Monitoring and Management Without Modifying Code appeared first on Percona Database Performance Blog.

Powered by WordPress | Theme: Aeros 2.0 by TheBuckmaker.com