I want to ask the community how they solve some possible flapping/timing problems while checking hosts and services, especial when hosts change to soft down state.
My usual host/service check intervals are 300s, retry interval 100s with 3 attempts. There is an implicit dependency between host and it’s services. I didn’t find info what is happening when host is in soft down state - it seems that services are still checked when host is already down. This leads to unnecessary critical service soft states which also produces logs and performance data, even the chance that the service switches to flapping state ist getting higher. IMHO the freshness timer will not help in this case.
What’s the best way to avoid these scenarios?
Increasing just the host check intervall could lead to false alerts and flapping states. Flapping states are very bad for umbrella monitoring systems because there is no clear state.
I’m thinking about a setting to suppress service checks during host soft down and also about a hold down timer when host changes to up state to avoid service alerts (i.e. when service needs some time to get up after a host restart).