How do you manage overdue checks?

0xliam · July 8, 2020, 5:06am

Hi there,

I’ve posted about this previously, but hoping to get a fresh opinion on it.

I’ve got a number of remote satellites, and checks often become long “overdue” and do not transition to an Unknown state.

How should I be dealing with these? Ideally, we would have something that changes the status to unknown after it becomes overdue, or significantly overdue.

Interested to hear everyones thoughts.

Solkren · July 8, 2020, 12:54pm

Hello @0xliam, are you looking for a way to get notified about this type of checks?
if yes - aggregate checks might be useful for you, example: https://github.com/danieldreier/icinga2-aggregated-check
Usually in my practice overdue checks signals about anomaly in work / overload of server side (scheduler can’t handle amount of events to process)

P.S. script above may require some modification to track for some specific attributes for overdue checks

aclark6996 · July 8, 2020, 3:19pm

Hello @0xliam,
I have experience an over due checks problem before. This was from the local Icinga agent not communicating with the Icinga master server correctly. Do you see any errors in the logs when this happens?
Please review the Health Checks section in the online documentation. Use the check command cluster-zone to confirm the local Icinga agent did not lose connection to the Icinga master server during the overdue check time period.

Regards
Alex