A host which was connected to icinga froze, it couldn’t perform any checks
and its local services weren’t working but it was pingable so Icinga
didn’t report it as down.
The weird thing is that all the services which were assigned to that host
weren’t getting any replies from the agent for half an hour
Where are the checks being performed?
Are they running on the frozen host, or are they running on another machine
connecting over the network to get a response? I like to use a mixture.
but they weren’t changing their states and the checks didn’t time out (the
check timeout is 3 minutes). I noticed that due to some complains and I had
to go to the dashboard and see that the last report was 30 minutes ago.
Shouldn’t the checks timeout in that case?
I suspect the problem is that the agent which would have been capable of
reporting the timeout was frozen, just like everything else (except the
network stack) on the machine.
One thing I always try to do is have a remote check (ie: running on something
other than the host being checked) such as SSH or HTTP to use as an indicator
that the host is responding to external requests.
You can even use this in place of the “hostalive” ping check if you want to,
which can be a much more reliable means of identifying whether a machine is
You might also want to investigate the cluster-zone service check: