Icinga and "frozen" host

MikeKall · October 13, 2021, 2:54pm

Hello,

I just faced a very weird problem/situation and I would like to have an expert’s opinion if that’s possible.

A host which was connected to icinga froze, it couldn’t perform any checks and its local services weren’t working but it was pingable so Icinga didn’t report it as down. The weird thing is that all the services which were assigned to that host weren’t getting any replies from the agent for half an hour but they weren’t changing their states and the checks didn’t time out (the check timeout is 3 minutes).
I noticed that due to some complains and I had to go to the dashboard and see that the last report was 30 minutes ago.

Shouldn’t the checks timeout in that case?

Pooh · October 13, 2021, 3:27pm

A host which was connected to icinga froze, it couldn’t perform any checks
and its local services weren’t working but it was pingable so Icinga
didn’t report it as down.

The weird thing is that all the services which were assigned to that host
weren’t getting any replies from the agent for half an hour

Where are the checks being performed?

Are they running on the frozen host, or are they running on another machine
connecting over the network to get a response? I like to use a mixture.

but they weren’t changing their states and the checks didn’t time out (the
check timeout is 3 minutes). I noticed that due to some complains and I had
to go to the dashboard and see that the last report was 30 minutes ago.

Shouldn’t the checks timeout in that case?

I suspect the problem is that the agent which would have been capable of
reporting the timeout was frozen, just like everything else (except the
network stack) on the machine.

One thing I always try to do is have a remote check (ie: running on something
other than the host being checked) such as SSH or HTTP to use as an indicator
that the host is responding to external requests.

You can even use this in place of the “hostalive” ping check if you want to,
which can be a much more reliable means of identifying whether a machine is
“healthy”.

You might also want to investigate the cluster-zone service check:

https://icinga.com/docs/icinga-2/latest/doc/06-distributed-monitoring/
#cluster-zone-with-masters-and-agents

Antony.

MikeKall · October 13, 2021, 3:33pm

Yes the checks were performed on the agent.

That would be better indeed.

Thanks for your input!

steaksauce · October 13, 2021, 8:17pm

Just a side note, we use monit to setup a watchdog for all of our endpoints. If any agent or master timeout on the API, monit will kick off a restart of the service. Could be useful in automagically remediating Icinga2 locking up on an agent/master.

stevie-sy · October 14, 2021, 5:57am

Normally, if you are using the “cluster” or “cluster-zone”-check, you should see if the agent is frozen

rsx · October 14, 2021, 8:00am

I guess you mean checks got overdue? This can be seen here /icingaweb2/dashboard?pane=Overdue

MikeKall · October 14, 2021, 9:29am

Sorry for my stupid question but what do you mean by monit?
Can you brake it down a bit?

Pooh · October 14, 2021, 10:16am

Antony.