Check smartmon - NRPE S.M.A.R.T harddisk check

From TykWiki
Jump to navigationJump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

S.M.A.R.T is a technology used to ask harddisks how they are doing. You can have Nagios monitor the temperature of the harddisks in your servers using this port:

$ cat /usr/ports/net-mgmt/nagios-check_smartmon/pkg-descr
check_smartmon is a Nagios plug-in written in python that uses
smartmontools to check disk health status and temperature.

Configuring Nagios

First I define a few new services on the Nagios server, in /usr/local/etc/nagios/objects/services.cfg. I define one service per disk name I want to check. If I have three servers with an ad0 drive, and one server (host2 in the example below) with both ad0 and ad1 drives, I add service definitions for checking both ad0 and ad1:

# SMART ad0
define service {
        use                             generic-service
        host_name                       host1,host2,host3
        service_description             nrpe_check_smart_ad0
        check_command                   check_nrpe2!check_smart_ad0
}

# SMART ad1
define service {
        use                             generic-service
        host_name                       host2
        service_description             nrpe_check_smart_ad1
        check_command                   check_nrpe2!check_smart_ad1
}

Instead of adding the server hostnames to the service definitions directly, I could also have added the servers I want to check to groups called something like smart-ad0-servers, smart-ad1-servers etc., and then added the groups to the services, but for now I did it like this.

Install the plugin

Install the port:

sudo portmaster /usr/ports/net-mgmt/nagios-check_smartmon/

Fix sudo permissions

The Nagios user needs permission to run the smartctl binary with root permissions, I recommend using sudo for this purpose. I add the following to /usr/local/etc/sudoers on the servers being monitored:

nagios          ALL=(ALL) NOPASSWD: /usr/local/libexec/nagios/check_smartmon -d /dev/ad*
nagios          ALL=(ALL) NOPASSWD: /usr/local/libexec/nagios/check_smartmon -d /dev/da*

The first line is needed if you are checking ide adX devices, the second line is needed if you are checking scsi or usb daX devices. I normally just leave both of them in.

To test this, as a user who has sudo access run the following command, substituting ad10 for the device name you want to monitor:

$ sudo su -m nagios -c "sudo /usr/local/libexec/nagios/check_smartmon -d /dev/ad10"
OK: device is functional and stable (temperature: 36)

If you get a reply like the one above, everything works as intended.

Configuring NRPE

On the server being monitored, add the following line to /usr/local/etc/nrpe.cfg (this example has both an ad0 and an ad1 drive:

command[check_smart_ad0]=/usr/local/bin/sudo /usr/local/libexec/nagios/check_smartmon -d /dev/ad0
command[check_smart_ad1]=/usr/local/bin/sudo /usr/local/libexec/nagios/check_smartmon -d /dev/ad1

Remember to restart NRPE after changing the config:

sudo /usr/local/etc/rc.d/nrpe2 restart