For some time now a server icinga was monitoring was reporting as unknown on most of its checks.
The guy responsible for that server told me that he found 200+ processes of icinga hanging.
My questions are,
No the only thing that it says after the killing process is that the API client disconnected. And then it creates a new connection to the master.
The OS is windows server 2012 R2
Icinga agent version is 2.12.3
64GB of RAM (40 something is free, I don’t think that that’s the problem)
It’s an agent
In total there are 14 checks.
It’s a storage server. It runs some applications for managing the RAIDs and SMB but still don’t think that’s a problem because of the answer number 9
Most of the service checks are custom and most of them they were killed.
After the killing warning message it’s just an info message information/RemoteCheckQueue: items: 0, rate: 0/s (6/min 30/5min 90/15min);
Nothing else. And the complete killing message is this one for every service:
warning/Process: Killing process group 8276 ('<command>') after timeout of 180 seconds
There are 88 more machines exactly like this one but none of the other had the same problem.
Unfortunately I can’t think of anything that could be helpful…
The problem does not persists, the process tree was manually killed but the question on why this could happen and if it’s possible to solve it from the master side remains.