Checks behaving like volatile checks though not configured as such

log1c · September 15, 2020, 11:50am

Hi all,

just stumbled upon a “bunch of problems” in a customers system.

The system is “constantly” sending messages for checks that have already sent a notification.
And it looks like the checks that are in a non-ok state are behaving like volatile checks:

As you can see the switch from critical to unknown directly becomes a hard state as well as vice versa AND it’s triggering a re-renotification.
Is this normal behavior?

volatile was set to false in the service template.

template Service "generic-service" {
    max_check_attempts = "3"
    check_interval = 5m
    retry_interval = 1m
    enable_notifications = true
    enable_active_checks = true
    enable_passive_checks = false
    enable_event_handler = true
    enable_flapping = true
    enable_perfdata = true
    volatile = false 	
}

notification interval is set to 0, so no re-notifications.

Notification rule is build to send 24/7 two three different contacts. Two of those contacts are limited to a non-office-hours time period.

Checked via API: Attribute no_more_notifications is set to true so there normally shouldn’t be any more notifications for this problem.

apply Notification "alle-Servicechecks" to Service { 	 	 
	   import "mail-service-notification2" 	 	 
	 	 	 
	   period = "24x7" 	 	 
	   assign where match("*", service.name) && ! match("*cpu*", service.name) && service.vars.nrpe_command != "check_backupexec_backups" && !(service.vars.snmp_storage_name == "Physical Memory" || service.vars.snmp_storage_name == "Virtual Memory" || service.name == "Linux-Updates" || service.name == "LDAPS-Events") 	 	 
	   users = [ "User24/7", "User1_non_work", "User2_non_work" ] 	 	 
	}

I just changed the notification rules to be separate for the 24/7 and non-workhours users, to be able to delay notifications for the latter.
Maybe this will also fix this strange behavior, will check in the next couple of days.

Any other ideas were to look are apprechiated
Cheers

log1c · September 16, 2020, 6:37am

Another example:

template Notification "mail-host-notification" {
    command = "mail-host-notification"
    interval = 0s
    period = "24x7"
    states = [ Down, Up ]
    types = [ Custom, Problem, Recovery ]
}

template Notification "mail-service-notification2" {
    command = "mail-service-notification"
    interval = 0s
    period = "24x7"
    states = [ Critical, OK, Warning ]
    types = [ Custom, Problem, Recovery ]
}

It did not.

Will update from 2.11.3 to 2.12 today, but I don’t expect it to change anything

UPDATE:
Did the icinga update to v2.12 (at around 9.30), but it didn’t help.
Also did a rm -fr /var/lib/icinga2/api/ to “kind of reset” the notifications.

So the notification at 09:51 was not a surprise, but then the same behavior goes on.

Debug Log from 10:27

[2020-09-16 10:27:25 +0200] notice/Process: PID 10336 ('/usr/lib/nagios/plugins/check_snmp_storage.pl' '-C' 'public' '-H' 'ip' '-c' '90' '-f' '-m' 'C:' '-t' '50' '-w' '80') terminated with exit code 2
[2020-09-16 10:27:25 +0200] debug/Checkable: Update checkable 'HOSTNAME!disk-c' with check interval '300' from last check time at 2020-09-16 10:27:25 +0200 (1.60024e+09) to next check time at 2020-09-16 10:32:19 +0200 (1.60025e+09).
[2020-09-16 10:27:25 +0200] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2020-09-16 10:27:25 +0200] notice/ApiListener: Relaying 'event::CheckResult' message
[2020-09-16 10:27:25 +0200] notice/Checkable: State Change: Checkable 'HOSTNAME!disk-c' hard state change from UNKNOWN to CRITICAL detected.
[2020-09-16 10:27:25 +0200] information/Checkable: Checkable 'HOSTNAME!disk-c' has 2 notification(s). Checking filters for type 'Problem', sends will be logged.
[2020-09-16 10:27:25 +0200] notice/Notification: Attempting to send notifications of type 'Problem' for notification object 'HOSTNAME!disk-c!alle-Servicechecks_User2'.
[2020-09-16 10:27:25 +0200] notice/Notification: Not sending notifications for notification object 'HOSTNAME!disk-c!alle-Servicechecks_User2': before specified begin time (15 minutes)
[2020-09-16 10:27:25 +0200] notice/Notification: Attempting to send notifications of type 'Problem' for notification object 'HOSTNAME!disk-c!alle-Servicechecks_User1'.
[2020-09-16 10:27:25 +0200] debug/Notification: Type 'Problem', TypeFilter: Custom, Problem and Recovery (FType=32, TypeFilter=104)
[2020-09-16 10:27:25 +0200] debug/Notification: State 'Critical', StateFilter: Critical, OK and Warning (FState=4, StateFilter=7)
[2020-09-16 10:27:25 +0200] debug/Notification: User 'ITPostfach' notification 'HOSTNAME!disk-c!alle-Servicechecks_User1', Type 'Problem', TypeFilter: Problem (FType=32, TypeFilter=104)
[2020-09-16 10:27:25 +0200] debug/Notification: User 'ITPostfach' notification 'HOSTNAME!disk-c!alle-Servicechecks_User1', State 'Critical', StateFilter: Critical, Down and Warning (FState=4, StateFilter=38)
[2020-09-16 10:27:25 +0200] information/Notification: Sending 'Problem' notification 'HOSTNAME!disk-c!alle-Servicechecks_User1' for user 'ITPostfach'
[2020-09-16 10:27:25 +0200] notice/Process: Running command '/etc/icinga2/scripts/mail-service-notification.sh' '-4' 'ip' '-6' '' '-b' '' '-c' '' '-d' '2020-09-16 10:27:25 +0200' '-e' 'disk-c' '-l' 'HOSTNAME' '-n' 'HOSTNAME' '-o' 'C:\ Label:  Serial Number a405ab64: 91%used(74301MB/81417MB) (>90%) : CRITICAL ' '-r' 'MAILADDRESS' '-s' 'CRITICAL' '-t' 'PROBLEM' '-u' 'disk-c': PID 10339
[2020-09-16 10:27:25 +0200] information/Notification: Completed sending 'Problem' notification 'HOSTNAME!disk-c!alle-Servicechecks_User1' for checkable 'HOSTNAME!disk-c' and user 'ITPostfach' using command 'mail-service-notification'.
[2020-09-16 10:27:25 +0200] notice/ApiListener: Relaying 'event::SendNotifications' message
[2020-09-16 10:27:25 +0200] notice/ApiListener: Relaying 'event::SetForceNextNotification' message
[2020-09-16 10:27:25 +0200] notice/ApiListener: Relaying 'event::NotificationSentToAllUsers' message
[2020-09-16 10:27:25 +0200] notice/ApiListener: Relaying 'event::NotificationSentUser' message
[2020-09-16 10:27:25 +0200] notice/Process: PID 10339 ('/etc/icinga2/scripts/mail-service-notification.sh' '-4' 'ip' '-6' '' '-b' '' '-c' '' '-d' '2020-09-16 10:27:25 +0200' '-e' 'disk-c' '-l' 'HOSTNAME' '-n' 'HOSTNAME' '-o' 'C:\ Label:  Serial Number a405ab64: 91%used(74301MB/81417MB) (>90%) : CRITICAL ' '-r' 'MAILADDRESS' '-s' 'CRITICAL' '-t' 'PROBLEM' '-u' 'disk-c') terminated with exit code 0

I’m still not sure if this is a correct behavior.
From my point of view there should be no re-notification after a change from crit->unknown->crit as there is no OK state inbetween, which would reset no_more_notifications to false.

log1c · September 16, 2020, 12:17pm

Hm, according to this older post from @dnsmichi, the behavior is normal:

I will talk to the customer and propose to switch to checks via agent or satellite instead of snmp via the vpn-connection.

If anyone has another idea how to reduce these kinds of messages in the meantime, feel free to comment