include “constants.conf”
include “zones.conf”
include
include
include
include
include
include
include “features-enabled/*.conf”
// include_recursive “conf.d”
The only thing I have done to see if this will help is extent out the check_interval to 75s (was previously 60s)… was unsure if the checks are stacking up due to insufficient time to process to the master?
Your service definition results in a check of a host against itself (as long $address$ is not set in service-check-alarm-settings). And perhaps it stalls (sometimes) if you have any kind of security hardening in place e.g. AppArmor, SELinux or a packet filter.
If it is stalling, that is what I am trying to troubleshoot and figure out the cause. Unsure if this is a bug in icinga? or if my icinga master servers are falling behind? There is SELinux, no firewalls and no AppArmor.
Perhaps I can take this thread a different direction…
I’ll look to see what I can do for making the client perform the check and report to the master (until more eyes can peek at this thread for the original issue).
It almost seems like the masters are overwhelmed (which they should not be).
Just found something… MaxConcurrentChecks = 1024 (default = 512)
I added this to the constants.conf (both master nodes).
Will see if this helps resolve the issue.
UPDATE: Looks like this did not resolve the issue… however I see my masters were on version (version: r2.10.5-1), so I applied the recent patches/updates (version: 2.11.1-1) and will monitor for a day.
I believe the update on the master nodes is considered the resolution
Should this change I will post an update here. I’ll mark this as a resolution in a few days (just want to monitor a little more)
Until the BUG is corrected, I created a workaround based on the event handler documentation.
I’m sure this may not be to everyone’s likings, but its working for what I need at the moment (feel free to tear it apart and use it as you see fit)
Deployed the script in plugins dir (on each client)
[user@client01 ]$ sudo cat /usr/lib64/nagios/plugins/restart_service
#!/bin/bash
while getopts "s:t:a:S:" opt; do
case $opt in
s)
servicestate=$OPTARG
;;
t)
servicestatetype=$OPTARG
;;
a)
serviceattempt=$OPTARG
;;
S)
service=$OPTARG
;;
esac
done
if ( [ $servicestate == "CRITICAL" ] && [ $servicestatetype == "HARD" ] ); then
sudo /sbin/service icinga2 restart > /dev/null
fi
…I still see the warnings when there are 1 zombie running, but once it increments to “2”, they are zapped back to “0”. From what I am seeing, the BUG is triggered when I perform configuration reloads on the master nodes. If I do a restart (instead of a reload), I am not seeing the zombies being created.
At the risk of necro’ing a thread about zombies … I have seen similar issues when using sudo in the CheckCommand's command with Icinga2 2.13 and created Issue Zombie CheckCommand processes · Issue #8981 · Icinga/icinga2 · GitHub (for future internet travelers). It doesn’t seem to be related to reloads, but seems similar enough to mention (general execution, reaping handling).