Check smartmon - NRPE S.M.A.R.T harddisk check
S.M.A.R.T is a technology used to ask harddisks how they are doing. You can have Nagios monitor the temperature of the harddisks in your servers using this port:
$ cat /usr/ports/net-mgmt/nagios-check_smartmon/pkg-descr check_smartmon is a Nagios plug-in written in python that uses smartmontools to check disk health status and temperature.
Configuring Nagios
First I define a few new services on the Nagios server, in /usr/local/etc/nagios/objects/services.cfg
. I define one service per disk name I want to check. If I have three servers with an ad0
drive, and one server (host2 in the example below) with both ad0
and ad1
drives, I add service definitions for checking both ad0
and ad1
:
# SMART ad0 define service { use generic-service host_name host1,host2,host3 service_description nrpe_check_smart_ad0 check_command check_nrpe2!check_smart_ad0 } # SMART ad1 define service { use generic-service host_name host2 service_description nrpe_check_smart_ad1 check_command check_nrpe2!check_smart_ad1 }
Instead of adding the server hostnames to the service definitions directly, I could also have added the servers I want to check to groups called something like smart-ad0-servers
, smart-ad1-servers
etc., and then added the groups to the services, but for now I did it like this.
Install the plugin
Install the port:
sudo portmaster /usr/ports/net-mgmt/nagios-check_smartmon/
Fix sudo permissions
The Nagios user needs permission to run the smartctl
binary with root permissions, I recommend using sudo for this purpose. I add the following to /usr/local/etc/sudoers
on the servers being monitored:
nagios ALL=(ALL) NOPASSWD: /usr/local/libexec/nagios/check_smartmon -d /dev/ad* nagios ALL=(ALL) NOPASSWD: /usr/local/libexec/nagios/check_smartmon -d /dev/da*
The first line is needed if you are checking ide adX
devices, the second line is needed if you are checking scsi or usb daX
devices. I normally just leave both of them in.
To test this, as a user who has sudo access run the following command, substituting ad10
for the device name you want to monitor:
$ sudo su -m nagios -c "sudo /usr/local/libexec/nagios/check_smartmon -d /dev/ad10" OK: device is functional and stable (temperature: 36)
If you get a reply like the one above, everything works as intended.
Configuring NRPE
On the server being monitored, add the following line to /usr/local/etc/nrpe.cfg
(this example has both an ad0
and an ad1
drive:
command[check_smart_ad0]=/usr/local/bin/sudo /usr/local/libexec/nagios/check_smartmon -d /dev/ad0 command[check_smart_ad1]=/usr/local/bin/sudo /usr/local/libexec/nagios/check_smartmon -d /dev/ad1
Remember to restart NRPE after changing the config:
sudo /usr/local/etc/rc.d/nrpe2 restart