We’ve been seeing regular and consistent failures across multiple services in our portal. They last for approximately one minute and then all recover again. We have 200 hosts / 1900 services and only a portion of them fail - “ping4” checks. Most of the time it’s not much of a problem as it doesn’t trigger any notifications etc due to the quick recovery. However the last few days we have seen one or two, in the early hours, call out to PagerDuty, with downtime lasting approximately 10 minutes. So we are now keen to find the root cause and get this resolved.
I did suspect the issue may be related to this topic from two years ago:
So I replaced the “hostalive” host check within the template.conf with a “dummy” to reduce the execution time for hosts. This is now at 0.365 execution time down from 4.7 but the problem persists. The icinga.log doesn’t seem to provide any clues.
Apologies for the lack of any real detail so please let me know any additional information required to help troubleshoot this one.