Nagios: Delay WARNING Notifications
I have a Nagios check to monitor the power being drawn through our PDUs. While setting all this up, I had to figure out alert threshholds. The convention for disk capacity is warn at 80%, critical at 90%. I also had this gem to work with:
National Electric Code requires that the continuous current drawn from a branch circuit not exceed 80% of the circuit’s maximum rating. “Continuous current” is any load sustained continuously for at least 3 hours.
(Thanks to mike Pennington, via http://serverfault.com/a/413307/72839)
So, I went with 80% warning, 90% critical.
Lately I have been getting a lot of warning notifications about circuits exceeding 80%. Ah, but the NEC says that is only a problem if they are at 80% for more than three hours. So, I dig through Nagios documentation and split my check out into two services:
define service{ # PDU load at 90% of circuit rating use generic-service hostgroup_name pdus service_description Power Load Critical notification_options c,u,r check_command check_sentry contact_groups admins } define service{ # PDU sustained load at 80% of circuit rating for 3 hours use generic-service hostgroup_name pdus service_description Power Load High notification_options w,r first_notification_delay 180 check_command check_sentry contact_groups admins }
The first part limits regular notifications to critical alerts. In the second case, the first_notification_delay
should cover the “don’t bug me unless it has been happening for three hours” caveat and I set that service to only notify on warnings and recovery.