I’ve posted about this previously, but hoping to get a fresh opinion on it.
I’ve got a number of remote satellites, and checks often become long “overdue” and do not transition to an Unknown state.
How should I be dealing with these? Ideally, we would have something that changes the status to unknown after it becomes overdue, or significantly overdue.
Hello @0xliam, are you looking for a way to get notified about this type of checks?
if yes - aggregate checks might be useful for you, example: https://github.com/danieldreier/icinga2-aggregated-check
Usually in my practice overdue checks signals about anomaly in work / overload of server side (scheduler can’t handle amount of events to process)
P.S. script above may require some modification to track for some specific attributes for overdue checks
Hello @0xliam,
I have experience an over due checks problem before. This was from the local Icinga agent not communicating with the Icinga master server correctly. Do you see any errors in the logs when this happens?
Please review the Health Checks section in the online documentation. Use the check command cluster-zone to confirm the local Icinga agent did not lose connection to the Icinga master server during the overdue check time period.