Icinga dead processes

Hello all,

For some time now a server icinga was monitoring was reporting as unknown on most of its checks.
The guy responsible for that server told me that he found 200+ processes of icinga hanging.
My questions are,

  • How can this be fixed from the side of icinga?
  • Why there were so many dead processes?

Thanks in advance,
Mike

Edit: typo

What do the Icinga logs on that machine show?

Antony.

There is a huge amount of warning/Process: Killing process group logs
Nothing else helpful

Do those lines say that the process group was being killed due to a timeout?

Also, are those lines followed by ones saying “Couldn’t kill the process
group” for any reason?

Please tell us something about the machine this is happening to, such as:

  1. What O/S and version is it running?

  2. Which version of Icinga is it running?

  3. How much RAM does it have?

  4. Is it a Satellite or an Agent?

  5. How many service checks is it running (either for itself, or for Agents as
    well if it is a Satellite)?

  6. If it’s a Satellite, how many Agents is it talking to?

  7. What applications is it running other than Icinga (Apache, MySQL, Asterisk,
    that sort of thing)?

  8. What is the service check shown in the log file which it tells you is being
    killed?

  9. Do you have any other machines in your network which are doing something
    similar but not having this problem?

  10. Anything else you can think of which might help us to know more about your
    setup?

Antony.

Yes exactly that.

No the only thing that it says after the killing process is that the API client disconnected. And then it creates a new connection to the master.

  1. The OS is windows server 2012 R2
  1. Icinga agent version is 2.12.3
  1. 64GB of RAM (40 something is free, I don’t think that that’s the problem)
  1. It’s an agent
  1. In total there are 14 checks.
  1. It’s a storage server. It runs some applications for managing the RAIDs and SMB but still don’t think that’s a problem because of the answer number 9 :grin:
  1. Most of the service checks are custom and most of them they were killed.
    After the killing warning message it’s just an info message
    information/RemoteCheckQueue: items: 0, rate: 0/s (6/min 30/5min 90/15min);
    Nothing else. And the complete killing message is this one for every service:
warning/Process: Killing process group 8276 ('<command>') after timeout of 180 seconds
  1. There are 88 more machines exactly like this one but none of the other had the same problem.
  1. Unfortunately I can’t think of anything that could be helpful…
    The problem does not persists, the process tree was manually killed but the question on why this could happen and if it’s possible to solve it from the master side remains.