Service dependency, soft/hard states and timers


I want to ask the community how they solve some possible flapping/timing problems while checking hosts and services, especial when hosts change to soft down state.

My usual host/service check intervals are 300s, retry interval 100s with 3 attempts. There is an implicit dependency between host and it’s services. I didn’t find info what is happening when host is in soft down state - it seems that services are still checked when host is already down. This leads to unnecessary critical service soft states which also produces logs and performance data, even the chance that the service switches to flapping state ist getting higher. IMHO the freshness timer will not help in this case.

What’s the best way to avoid these scenarios?

Increasing just the host check intervall could lead to false alerts and flapping states. Flapping states are very bad for umbrella monitoring systems because there is no clear state.
I’m thinking about a setting to suppress service checks during host soft down and also about a hold down timer when host changes to up state to avoid service alerts (i.e. when service needs some time to get up after a host restart).


1 Like

Do you apply the services to hosts?

Yes, of course the services are applied to hosts.

I’m monitoring thousends of hosts and services for several customers. With higher automation it is necessary to avoid false alerts caused by instabilities in customer networks and power supplies.

Did you already check your servicetemplates/ dependencies in relation to check_interval or retry_interval?

We try to optimize timings, ie. by using longer retry intervals or changing number of retries needed for hard states. Icinga2 uses internal mechanisms to plan check scheduling which is not really controlable for users.

An example for a problematic scenario:
Power outage on a hypervisor, host and running VMs get unreachable immediatly.
Depending on the schedule of host checks it takes some time until the monitoring realizes that the hosts are gone (even with correct parent-child host dependencies). Before hypervisor host reaches hard down state, there will be several events with hosts down and services critical.
Of course you can use shorter timing values for parent hosts or decrease number of retries, but this will not really scale in larger environments with multiple network hops/parents to the target hosts.

When power is back, the hypervisor will be pingable soon after server startup, but hypervisor services and virtual machines need some minutes to start (VMs are usually getting started sequencially…).
In the moment where hypervisor gets reachable, all VMs (and it’s services) are down from Icinga’s view.
During this periode you will also get events or even alerts until all services are up.

I am seeing the same issue with my configuration.

Besides retry and number of retries, I tried tuning the check interval to be longer for dependent/children (hosts and services), and/or shorter for the parent (hypervisor or network device). This reduced the number of child alerts on parent recovery, but did not totally fix the problem.

As you noted, there is no real control over actual check scheduling beyond “check now”

1 Like

I think something like this could help :

apply Dependency "disable-host-service-checks" to Service {
  disable_checks = true
  ignore_soft_states = false
  assign where true

This would prevent checks from executing as soon as the 1st check fails and the host goes in soft state.

Based on thoses :