Check smartmon - NRPE S.M.A.R.T harddisk check

From TykWiki
Jump to navigationJump to search

S.M.A.R.T is a technology used to ask harddisks how they are doing. You can have Nagios monitor the temperature of the harddisks in your servers using this port:

$ cat /usr/ports/net-mgmt/nagios-check_smartmon/pkg-descr
check_smartmon is a Nagios plug-in written in python that uses
smartmontools to check disk health status and temperature.

Configuring Nagios

First I define a few new services on the Nagios server, in /usr/local/etc/nagios/objects/services.cfg. I define one service per disk name I want to check. If I have three servers with an ad0 drive, and one server (host2 in the example below) with both ad0 and ad1 drives, I add service definitions for checking both ad0 and ad1:

# SMART ad0
define service {
        use                             generic-service
        host_name                       host1,host2,host3
        service_description             nrpe_check_smart_ad0
        check_command                   check_nrpe2!check_smart_ad0
}

# SMART ad1
define service {
        use                             generic-service
        host_name                       host2
        service_description             nrpe_check_smart_ad1
        check_command                   check_nrpe2!check_smart_ad1
}

Instead of adding the server hostnames to the service definitions directly, I could also have added the servers I want to check to groups called something like smart-ad0-servers, smart-ad1-servers etc., and then added the groups to the services, but for now I did it like this.

Install the plugin

Install the port:

sudo portmaster /usr/ports/net-mgmt/nagios-check_smartmon/

Fix sudo permissions

The Nagios user needs permission to run the smartctl binary with root permissions, I recommend using sudo for this purpose. I add the following to /usr/local/etc/sudoers on the servers being monitored:

nagios          ALL=(ALL) NOPASSWD: /usr/local/libexec/nagios/check_smartmon -d /dev/ad*
nagios          ALL=(ALL) NOPASSWD: /usr/local/libexec/nagios/check_smartmon -d /dev/da*

The first line is needed if you are checking ide adX devices, the second line is needed if you are checking scsi or usb daX devices. I normally just leave both of them in.

To test this, as a user who has sudo access run the following command, substituting ad10 for the device name you want to monitor:

$ sudo su -m nagios -c "sudo /usr/local/libexec/nagios/check_smartmon -d /dev/ad10"
OK: device is functional and stable (temperature: 36)

If you get a reply like the one above, everything works as intended.

Configuring NRPE

On the server being monitored, add the following line to /usr/local/etc/nrpe.cfg (this example has both an ad0 and an ad1 drive:

command[check_smart_ad0]=/usr/local/bin/sudo /usr/local/libexec/nagios/check_smartmon -d /dev/ad0
command[check_smart_ad1]=/usr/local/bin/sudo /usr/local/libexec/nagios/check_smartmon -d /dev/ad1

Remember to restart NRPE after changing the config:

sudo /usr/local/etc/rc.d/nrpe2 restart