Nagios: Delay WARNING Notifications

I have a Nagios check to monitor the power being drawn through our PDUs. While setting all this up, I had to figure out alert threshholds. The convention for disk capacity is warn at 80%, critical at 90%. I also had this gem to work with:

National Electric Code requires that the continuous current drawn from a branch circuit not exceed 80% of the circuit’s maximum rating. “Continuous current” is any load sustained continuously for at least 3 hours.

(Thanks to mike Pennington, via http://serverfault.com/a/413307/72839)

So, I went with 80% warning, 90% critical.

Lately I have been getting a lot of warning notifications about circuits exceeding 80%. Ah, but the NEC says that is only a problem if they are at 80% for more than three hours. So, I dig through Nagios documentation and split my check out into two services:

define service{ # PDU load at 90% of circuit rating
    use                     generic-service
    hostgroup_name          pdus
    service_description     Power Load Critical
    notification_options    c,u,r
    check_command           check_sentry
    contact_groups          admins

define service{ # PDU sustained load at 80% of circuit rating for 3 hours
    use                       generic-service
    hostgroup_name            pdus
    service_description       Power Load High
    notification_options      w,r
    first_notification_delay  180
    check_command             check_sentry
    contact_groups            admins

The first part limits regular notifications to critical alerts. In the second case, the first_notification_delay should cover the “don’t bug me unless it has been happening for three hours” caveat and I set that service to only notify on warnings and recovery.

Read More

Categories: Technical