I’ve been building out an Icinga2/IcingaWeb2 system and am impressed by the capabilities and configuration. However (the fly in the ointment) I’m seeing a lot of both service critical and host down notifications that (I think) should be blocked by dependencies.
My setup is fairly straightforward. A few servers depend on their
upstream router, which lies between the servers and a single Icinga2
instance. Each runs a “Nagios NRPE” (tcp-nrpe) Service that has an
implicit dependency on its host. In turn, each host has many NRPE
services that have an explicit dependency on “Nagios NRPE”, as well as
their implicit host dependency.
The router host has only ping monitoring. When the router no longer
pings, I see two things quickly afterwards. “Nagios NRPE” notifies,
then the router notifies again, within a minute. I may or may not see
a few random services on the servers notify in that same timeframe.
Trying to follow best practices, I have the router on a shorter check and
retry time that the downstream servers.
I tried to force dependencies by making the implicit host dependencies explicit,
only to get 100% duplicate dependency error messages when checking config.
As a result of this set of behaviors, Icinga notifies at roughly 3
times the rate of the older Nagios system.
I’ll summarize my configs in a moment. Thank you for taking a look.
assign where host.vars…facility_code == “abcd”
ignore where host.name == “abcd-router”
}
Maybe a minor quibble, but if the “…” in that stanza above is a direct copy-
n-paste from your live config, it could be a contributor to your problems…
to the Dependency to Host object for the router. No change. Merrily get notified by all kinds of services on hosts that have a dependency.
This issue has gotten too bad to ignore. We are still getting 10:1 notifications for hosts and services behind routers that are down versus are older monitoring solution. There’s no way we can work with that level of noise.
Hi @ken I can’t think of any reason why your question would be ignored. Dependencies are sometimes hard to deal with so maybe noone had an idea how to tackle it. Please keep in mind that this is completely run by the community / volunteers. If you need professional support please contact one of our partners:
Did you check your logfiles for any entries about Dependencies? If dependencies are working you should see them in the log. At least in the debug.log but AFAIR in the regular icinga2.log as well.
Did you make sure that the parent objects are positively in critical state to fire the dependency?
Sometimes there’s a problem with the parent object getting online and the dependent object sending right away before it got rechecked. I think this was fixed in 2.11 but I’m not completely sure.
Debug mode logs over 5MB a minute, so it’s hard to catch things before we have to turn it off. It would be helpful to be able to turn on verbose for just a host or service and all objects it interacts with.
Parent objects are definitely in the critical state, but the ignore_soft_states settings I listed above should invoke while still soft. Not that it matters, it was notifying despite a critical parent before I made that change.
I understand the race condition at restart or a new host versus its services, but I’m seeing this with hosts and services that have been OK for quite some time.
Over the past 24 hours I’ve made another change, I added times.begin of 90s to host notifications and 2m to services notifications. I also moved the times.begin on routers from 6m to 60s.
So you are saying nothing else is notifying anymore, reagardless of whether dependencies are in place or not?
I don’t know of an option times.delay just times.begin and times.end: https://icinga.com/docs/icinga2/latest/doc/09-object-types/#notification Maybe you have an error in you configuration? Why would you add times in the first place? To remedy the problem with notifications firing despite the object having a failed dependency?
I regularly dump icinga2 object list --type notification and icinga2 object list --type dependency to make sure dependencies exist and are linked to the right parents.
I added times.begin to give icinga2 more time to notice that these dependencies exist, since it is clearly completely ignoring them. Silly, I know, it already knows they are there…
OK I edited my post above, host notifications were broken overnight, so times.begin probably didn’t fix anything.
I was reading this section of the Docs. First it creates host to Master dependencies, but then it implies that a host dependency on the “Master” does not stop notifications from the services of that host, and then creates service dependencies that rely directly on the Master.
Is this implication true? Dependencies don’t chain? If so, disappointing, but then all I have to do is create some from the services to the router.
You can extend this example, and make your services depend on the master.example.com host too. Their local scope allows you to use host.vars.vm_parent similar to the example above.
apply Dependency "vm-service-to-parent-master" to Service {
parent_host_name = host.vars.vm_parent
assign where "generic-vm" in host.templates
}
That way you don’t need to wait for your guest hosts becoming unreachable when the master host goes down. Instead the services will detect their reachability immediately when executing checks.
I created some Dependencies that link a host’s services directly to the router, skipping over the host that it is on. That was successful in quieting Notifications for those services when the router flapped. Now I need to do that across all such services, fortunately most of them lump under NRPE already, and that service to service dependency already works fine.
While I’ve found a workaround, it is disappointing that a service doesn’t automatically mute Notifications when the host it runs on is muted by a host to host(router) dependency.
As mentioned above I added the below text, which, if I read the Docs right, should take care of disabling notifications when the host is in a soft state. Or is there something else I missed? I usually get the router DOWN notification followed immediately, and the next few minutes, by several service timeouts. Sometimes one service timeout comes before the router DOWN, that can be expected just due to random timing.
I also have the router with the shortest retry times, the host slightly longer retry times, and the service longer still. And the same kind of spacing with the times.begin on the notification, router short, host longer, services longest.