Notifications despite Dependencies

ken · January 10, 2020, 7:50pm

I’ve been building out an Icinga2/IcingaWeb2 system and am impressed by the capabilities and configuration. However (the fly in the ointment) I’m seeing a lot of both service critical and host down notifications that (I think) should be blocked by dependencies.

My setup is fairly straightforward. A few servers depend on their
upstream router, which lies between the servers and a single Icinga2
instance. Each runs a “Nagios NRPE” (tcp-nrpe) Service that has an
implicit dependency on its host. In turn, each host has many NRPE
services that have an explicit dependency on “Nagios NRPE”, as well as
their implicit host dependency.

The router host has only ping monitoring. When the router no longer
pings, I see two things quickly afterwards. “Nagios NRPE” notifies,
then the router notifies again, within a minute. I may or may not see
a few random services on the servers notify in that same timeframe.

Trying to follow best practices, I have the router on a shorter check and
retry time that the downstream servers.

I tried to force dependencies by making the implicit host dependencies explicit,
only to get 100% duplicate dependency error messages when checking config.

As a result of this set of behaviors, Icinga notifies at roughly 3
times the rate of the older Nagios system.

I’ll summarize my configs in a moment. Thank you for taking a look.

ken · January 10, 2020, 7:50pm

object Host "abcd-router" {
  import "corp-vpn-endpoint"
  address = "10.8.148.1"
  vars..facility_code = "abcd"
  vars.hostgroups = [ "abcd" ]
}


apply Dependency "router-internal" to Host {
  parent_host_name = "abcd-router"
  disable_notifications = true

  assign where host.vars..facility_code == "abcd"
  ignore where  host.name == "abcd-router"
}


object Host "abcd-ras2.site.getwellnetwork.com" {
  import "corp-linux-host"
  address = "10.8.148.15"
  vars.inpatient.facility_code = "abcd"
  
  vars.hostgroups = [ "abcd",  ]
  display_name = "abcd-ras2"
  vars..facility_code = "abcd"
}

object Host "abcd-ras1.site.getwellnetwork.com" {
  import "corp-linux-host"
  address = "10.8.148.14"
  vars.inpatient.facility_code = "abcd"
  
  vars.hostgroups = [ "abcd",  ]
  display_name = "abcd-ras1"
  vars..facility_code = "abcd"
}


/*
 * All NRPE checks require tcp-nrpe to be up, so define a dependency here
 */
apply Dependency "nrpe" to Service {
  parent_service_name = "tcp-nrpe"
  disable_checks = true    // <-- then the states remain unchanged and the service is "late"
  disable_notifications = true
  assign where service.vars.nrpe_dependency == true
}



/* TCP NRPE CHECK
 * This check should be applied to any NRPE-enabled host. It is the parent check
 * for all other NRPE checks, so of the port becomes unreachable, the other
 * NRPE checks don't send notifications on their own because of the dead port
 */
apply Service "tcp-nrpe" {
  import "corp-generic-nrpe"
  display_name = "Nagios NRPE"
  check_command = "tcp"
  vars.tcp_port = 5666
  vars.tcp_ctime = 15

  check_interval = 3m
  retry_interval = 30s

  assign where supports_nrpe(host)
}


/*
 * DISK ROOT PARTITION
 */
apply Service "nrpe-disk-root" {
  import "corp-generic-nrpe"
  display_name = "DiskRootPartition"

  check_command = "corp-nrpe"
  vars.nrpe_command = "check_disk"
  vars.nrpe_arguments = [ "7%", "5%", "/" ]

  vars.nrpe_dependency = true
  assign where supports_nrpe(host)
}

template Service "corp-generic-nrpe" {
  import "corp-generic-service"
  vars.nrpe_timeout = 35s
}


/**
 * Provides default settings for services. By convention
 * all services should import this template.
 */
template Service "corp-generic-service" {
  max_check_attempts = 5
  check_interval = 10m
  retry_interval = 30s
  vars.notification.use_slack = true
}


template Host "corp-vpn-endpoint" {
  import "corp-generic-host"
  check_interval = 2m
  retry_interval = 15s
  vars.vpn.delay = 6m
}


apply Notification "corp-chat-vpn-notification-devops" to Host {
  import "corp-chat-host-notification-devops-tmpl"
  users = [ "slack" ]

  // Delay notification
  times.begin = host.vars.vpn.delay

  assign where host.vars.notification.use_slack == true && host.vars.vpn.delay
}

apply Notification "corp-chat-service-notification-devops" to Service {
  import "corp-chat-service-notification-devops-tmpl"
  users = [ "slack" ]
  states = [ Critical, OK ]

  assign where host.vars.notification.use_slack == true && service.vars.notification.use_slack == true
  ignore where host.vars.vpn.delay
}


template Notification "corp-chat-host-notification-devops-tmpl" {
  import "corp-generic-notification"
  command = "corp-cmd-chat-host-notification-devops"
}



template Notification "corp-chat-service-notification-devops-tmpl" {
  import "corp-generic-notification"
  command = "corp-cmd-chat-service-notification-devops"
}


/**
* The notification template all other notification definitions should inherit from
*/
template Notification "corp-generic-notification" {
  interval = 4h
  
}

Pooh · January 10, 2020, 8:12pm

.

apply Dependency “router-internal” to Host {
parent_host_name = “abcd-router”
disable_notifications = true

assign where host.vars…facility_code == “abcd”
ignore where host.name == “abcd-router”
}

Maybe a minor quibble, but if the “…” in that stanza above is a direct copy-
n-paste from your live config, it could be a contributor to your problems…

Antony.

ken · January 13, 2020, 12:51pm

Thanks, that’s just a typo from the copy and paste, not in the actual configs

ken · January 15, 2020, 3:46pm

Here are the dependencies that are not stopping notifications:

Object 'abcd-ras2.site.getwellnetwork.com!nrpe-disk-root!nrpe' of type 'Dependency':
  % declared in '/etc/icinga2/corp.d/corp-services.conf', lines 125:1-125:34
  * __name = "abcd-ras2.site.getwellnetwork.com!nrpe-disk-root!nrpe"
  * child_host_name = "abcd-ras2.site.getwellnetwork.com"
    % = modified in '/etc/icinga2/corp.d/corp-services.conf', lines 125:1-125:34
  * child_service_name = "nrpe-disk-root"
    % = modified in '/etc/icinga2/corp.d/corp-services.conf', lines 125:1-125:34
  * disable_checks = true
    % = modified in '/etc/icinga2/corp.d/corp-services.conf', lines 127:3-127:23
  * disable_notifications = true
    % = modified in '/etc/icinga2/corp.d/corp-services.conf', lines 129:3-129:30
  * ignore_soft_states = true
  * name = "nrpe"
  * package = "_etc"
    % = modified in '/etc/icinga2/corp.d/corp-services.conf', lines 125:1-125:34
  * parent_host_name = "abcd-ras2.site.getwellnetwork.com"
    % = modified in '/etc/icinga2/corp.d/corp-services.conf', lines 125:1-125:34
  * parent_service_name = "tcp-nrpe"
    % = modified in '/etc/icinga2/corp.d/corp-services.conf', lines 126:3-126:34
  * period = ""
  * source_location
    * first_column = 1
    * first_line = 125
    * last_column = 34
    * last_line = 125
    * path = "/etc/icinga2/corp.d/corp-services.conf"
  * states = null
  * templates = [ "nrpe" ]
    % = modified in '/etc/icinga2/corp.d/corp-services.conf', lines 125:1-125:34
  * type = "Dependency"
  * vars = null
  * zone = ""


Object 'abcd-ras2.site.getwellnetwork.com!router-internal' of type 'Dependency':
  % declared in '/etc/icinga2/corp.d/routers.conf', lines 6790:1-6790:46
  * __name = "abcd-ras2.site.getwellnetwork.com!router-internal"
  * child_host_name = "abcd-ras2.site.getwellnetwork.com"
    % = modified in '/etc/icinga2/corp.d/routers.conf', lines 6790:1-6790:46
  * child_service_name = ""
  * disable_checks = false
  * disable_notifications = true
    % = modified in '/etc/icinga2/corp.d/routers.conf', lines 6792:3-6792:30
  * ignore_soft_states = true
  * name = "router-internal"
  * package = "_etc"
    % = modified in '/etc/icinga2/corp.d/routers.conf', lines 6790:1-6790:46
  * parent_host_name = "abcd-router"
    % = modified in '/etc/icinga2/corp.d/routers.conf', lines 6790:1-6790:46
    % = modified in '/etc/icinga2/corp.d/routers.conf', lines 6791:3-6791:38
  * parent_service_name = ""
  * period = ""
  * source_location
    * first_column = 1
    * first_line = 6790
    * last_column = 46
    * last_line = 6790
    * path = "/etc/icinga2/corp.d/routers.conf"
  * states = null
  * templates = [ "router-internal" ]
    % = modified in '/etc/icinga2/corp.d/routers.conf', lines 6790:1-6790:46
  * type = "Dependency"
  * vars = null
  * zone = ""

ken · January 17, 2020, 4:13pm

Is my question being ignored for a reason?

ken · January 28, 2020, 2:48pm

More details:
18.04.3 LTS (Bionic Beaver)

icinga2                                     2.11.2-1.bionic                   
icinga2-bin                                 2.11.2-1.bionic                   
icinga2-common                              2.11.2-1.bionic                   
icinga2-ido-mysql                           2.11.2-1.bionic                   
icingacli                                   2.7.3-1.bionic                    
icingaweb2                                  2.7.3-1.bionic                    
icingaweb2-common                           2.7.3-1.bionic     
icingaweb2-module-doc                       2.7.3-1.bionic     
icingaweb2-module-monitoring                2.7.3-1.bionic     
libboost-context1.67.0-icinga:amd64         1.67.0-13.1.bionic 
libboost-coroutine1.67.0-icinga:amd64       1.67.0-13.1.bionic 
libboost-filesystem1.67.0-icinga:amd64      1.67.0-13.1.bionic 
libboost-program-options1.67.0-icinga:amd64 1.67.0-13.1.bionic                
libboost-regex1.67.0-icinga:amd64           1.67.0-13.1.bionic                
libboost-system1.67.0-icinga:amd64          1.67.0-13.1.bionic                
libboost-thread1.67.0-icinga:amd64          1.67.0-13.1.bionic                
php-icinga                                  2.7.3-1.bionic                    
vim-icinga2                                 2.11.2-1.bionic

ken · March 11, 2020, 3:31pm

I’ve added

ignore_soft_states = false
states = [ Up ]

to the Dependency to Host object for the router. No change. Merrily get notified by all kinds of services on hosts that have a dependency.

This issue has gotten too bad to ignore. We are still getting 10:1 notifications for hosts and services behind routers that are down versus are older monitoring solution. There’s no way we can work with that level of noise.

twidhalm · March 11, 2020, 4:35pm

Hi @ken I can’t think of any reason why your question would be ignored. Dependencies are sometimes hard to deal with so maybe noone had an idea how to tackle it. Please keep in mind that this is completely run by the community / volunteers. If you need professional support please contact one of our partners:

Did you check your logfiles for any entries about Dependencies? If dependencies are working you should see them in the log. At least in the debug.log but AFAIR in the regular icinga2.log as well.

Did you make sure that the parent objects are positively in critical state to fire the dependency?

Sometimes there’s a problem with the parent object getting online and the dependent object sending right away before it got rechecked. I think this was fixed in 2.11 but I’m not completely sure.

ken · March 12, 2020, 12:37pm

Thank you for the response.

Debug mode logs over 5MB a minute, so it’s hard to catch things before we have to turn it off. It would be helpful to be able to turn on verbose for just a host or service and all objects it interacts with.

Parent objects are definitely in the critical state, but the ignore_soft_states settings I listed above should invoke while still soft. Not that it matters, it was notifying despite a critical parent before I made that change.

I understand the race condition at restart or a new host versus its services, but I’m seeing this with hosts and services that have been OK for quite some time.

Over the past 24 hours I’ve made another change, I added times.begin of 90s to host notifications and 2m to services notifications. I also moved the times.begin on routers from 6m to 60s.

twidhalm · March 12, 2020, 1:00pm

That’s sounds like a good idea. I thought I remembered a issue about being already filed but I couldn’t find any.

Maybe you could have a second look and file a new one about this if you can’t find one, either?

https://github.com/Icinga/icinga2/issues

So you are saying nothing else is notifying anymore, reagardless of whether dependencies are in place or not?

I don’t know of an option times.delay just times.begin and times.end: https://icinga.com/docs/icinga2/latest/doc/09-object-types/#notification Maybe you have an error in you configuration? Why would you add times in the first place? To remedy the problem with notifications firing despite the object having a failed dependency?

ken · March 12, 2020, 1:04pm

Sorry meant to write times.begin. Did not set times.end.

twidhalm · March 12, 2020, 1:18pm

Ok, no worries. As long as it’s correct in your configuration

Why did you set times.begin just to deal with your problematic dependencies?

ken · March 12, 2020, 1:22pm

I regularly dump icinga2 object list --type notification and icinga2 object list --type dependency to make sure dependencies exist and are linked to the right parents.

I added times.begin to give icinga2 more time to notice that these dependencies exist, since it is clearly completely ignoring them. Silly, I know, it already knows they are there…

OK I edited my post above, host notifications were broken overnight, so times.begin probably didn’t fix anything.

twidhalm · March 12, 2020, 1:27pm

I like your idea of using times.begin to deal with this effect. I’ll have another look why they could not be working, though.

Maybe someone else comes up with another idea in the meantime.

ken · March 12, 2020, 1:35pm

Thank you, this discussion gave me some things to look at.

I just entered:

as you suggested.

ken · March 16, 2020, 4:43pm

I was reading this section of the Docs. First it creates host to Master dependencies, but then it implies that a host dependency on the “Master” does not stop notifications from the services of that host, and then creates service dependencies that rely directly on the Master.

Is this implication true? Dependencies don’t chain? If so, disappointing, but then all I have to do is create some from the services to the router.

from: https://icinga.com/docs/icinga2/latest/doc/03-monitoring-basics/#apply-dependencies-based-on-custom-variables

You can extend this example, and make your services depend on the master.example.com host too. Their local scope allows you to use host.vars.vm_parent similar to the example above.

apply Dependency "vm-service-to-parent-master" to Service {
  parent_host_name = host.vars.vm_parent
  assign where "generic-vm" in host.templates
}

That way you don’t need to wait for your guest hosts becoming unreachable when the master host goes down. Instead the services will detect their reachability immediately when executing checks.

ken · March 17, 2020, 12:36pm

I created some Dependencies that link a host’s services directly to the router, skipping over the host that it is on. That was successful in quieting Notifications for those services when the router flapped. Now I need to do that across all such services, fortunately most of them lump under NRPE already, and that service to service dependency already works fine.

While I’ve found a workaround, it is disappointing that a service doesn’t automatically mute Notifications when the host it runs on is muted by a host to host(router) dependency.

log1c · March 17, 2020, 1:39pm

Hm, this is an interesting catch.

Normally the implicit dependency between a services and the host it belongs to disables notifications, when the host is in a DOWN state.

Could be that now, that the host is not “really” DOWN, this implicit dependency does not activate.

ken · March 17, 2020, 2:49pm

Than you for your response.

As mentioned above I added the below text, which, if I read the Docs right, should take care of disabling notifications when the host is in a soft state. Or is there something else I missed? I usually get the router DOWN notification followed immediately, and the next few minutes, by several service timeouts. Sometimes one service timeout comes before the router DOWN, that can be expected just due to random timing.

I also have the router with the shortest retry times, the host slightly longer retry times, and the service longer still. And the same kind of spacing with the times.begin on the notification, router short, host longer, services longest.