Understanding Late checks/check freshness

Hi there,

I’ve had a read through the documentation but I am seeking further clarification around check ‘freshness’.

We’ve identified a few problem satellites and devices which seem to not be scheduling the checks, or missing the schedule and never reporting back - e.g. the below is taken from a host where the host check command is ‘hostalive’:

We’ve got a 3 tiered configuration - hosts connect to the satellite, and satellites connect to the masters (multi master <— satellite <— hosts).

Check execution

Command hostalive
Check Source satellite
Reachable yes
Last check on Oct 8 11:01
Next check on Oct 8 11:06 [Reschedule]
Check attempts 1/3 (hard state)
Check execution time 4.011s
Check latency 0.284538s

So if I understand correctly, host/service freshness determines whether to schedule a check - but from the above example, the last check was 11:10 AEST - it’s currently 10:55 AEST on October 9.

Another host with the same template:

Last check 1d 0h ago
Next check in -1d 0h Reschedule
Check attempts 1/3 (hard state)
Check execution time 4.011s
Check latency 0.284538s

I’m sure there is a problem with the satellite - but I am confused as to how we can be alerted if these host check commands do not execute within a reasonable amount of time - e.g. this check has not executed for almost 24 hours.

Is it possible to have these transition to an ‘UNKNOWN’ state?

Strangely enough, it’s affecting about 15 hosts with the ‘hostalive’ command being used as the host check, however it is not affecting any service checks - and seems to be isolated to one particular satellite.

It was affecting a satellite we have running in Azure, but I realised it was on the latest client (we’re still running 2.10.5 on our masters) so I have downgraded to see if resolves the issue - however we saw the same/similar behaviour - the host check which is the ‘icinga’ command stopped reporting after 20 minutes of running, but it did not disconnect (we also monitor connectivity from the masters).