Assistance Needed -- Zombies - service checks are becoming defunct on client

Hello,

Hoping someone has been here/done this before? I couldn’t find anything in the search.

I’m starting to add clients to my icinga2 environment and quickly running into an issue with clients getting several defunct/zombie processes.

Example: (RHEL6)

[user@nagiostst ~ ]$ ps aux|grep defun
icinga    6299  0.0  0.0      0     0 ?        ZNs  Oct28   0:00 [check_ntp_time] <defunct>
icinga   11247  0.0  0.0      0     0 ?        ZNs  14:45   0:00 [check_ntp_time] <defunct>
icinga   15921  0.0  0.0      0     0 ?        ZNs  14:27   0:00 [check_ntp_time] <defunct>
icinga   18195  0.0  0.0      0     0 ?        ZNs  Oct24   0:00 [check_ntp_time] <defunct>
icinga   20598  0.0  0.0      0     0 ?        ZNs  14:30   0:00 [check_ntp_time] <defunct>
icinga   25223  0.0  0.0      0     0 ?        ZNs  07:30   0:00 [check_ntp_time] <defunct>
icinga   28020  0.0  0.0      0     0 ?        ZNs  Oct28   0:00 [check_ntp_time] <defunct>

Version:

[user@nagiostst ~ ]$ sudo icinga2 -V | head -1
icinga2 - The Icinga 2 network monitoring daemon (version: 2.11.1-1)

Client definition (on master): --> client uses client_endpoint

[user@icinga01 hosts ]$ cat nagiostst.conf

// Endpoints & Zones
object Endpoint "nagiostst" {
	host = "1.2.3.4"
}

object Zone "nagiostst" {
	endpoints = [ "nagiostst" ]
	parent = "master"
}
object Host "nagiostst" {
	import "icon_rhel_vhost"
	import "unix"
	address = "1.2.3.4"
	vars.os = "linux"
	vars.disks["disk"] = { /* No parameters. */ }
	vars.filesystem["/"] = {}
	vars.filesystem["/export"] = {}
	vars.filesystem["/var"] = {}
	vars.filesystem["/var/log"] = {}
	vars.client_endpoint = name
	vars.program = "unixadm"
	vars.location = "DC1"
	vars.type = "virtual"
}

Client api.conf: (client accepts from master)

[user@nagiostst features-enabled ]$ cat api.conf
/* Icinga 2 API Config */

/*
 * The API listener is used for distributed monitoring setups.
 */

object ApiListener "api" {
    accept_commands = true
    accept_config = true
}

Client icinga2.conf: (Made sure the conf.d is not used)

[user@nagiostst icinga2 ]$ cat icinga2.conf
/* Icinga 2 Client Config */

include “constants.conf”
include “zones.conf”
include
include
include
include
include
include
include “features-enabled/*.conf”
// include_recursive “conf.d”

The only thing I have done to see if this will help is extent out the check_interval to 75s (was previously 60s)… was unsure if the checks are stacking up due to insufficient time to process to the master?

Service Definition for NTP:

apply Service "NTP" {
    import "service-check-alarm-settings"
    check_command = "ntp_time"
    command_endpoint = host.vars.client_endpoint
    vars.ntp_warning = 120s
    vars.ntp_critical = 300s
    check_interval = 75s
    retry_interval = 30s
    //enable_notifications = false
    assign where host.address && host.vars.os == "linux"
}

Your service definition results in a check of a host against itself (as long $address$ is not set in service-check-alarm-settings). And perhaps it stalls (sometimes) if you have any kind of security hardening in place e.g. AppArmor, SELinux or a packet filter.

Here is the service-check-alarm-settings:
It just gets a small portion of settings that are common across the environment out of the way.

template Service "service-check-alarm-settings" {
    max_check_attempts = 3
    check_interval = 1m
    retry_interval = 30s
    check_period = "24x7"
}

If it is stalling, that is what I am trying to troubleshoot and figure out the cause. Unsure if this is a bug in icinga? or if my icinga master servers are falling behind? There is SELinux, no firewalls and no AppArmor.

@mfriedrich

Have you ever seen this behavior on service checks?

Do you experience any issue when you run the check_ntp_time as user icinga manually?

Seems to work snappy:
Hoping I am demonstrating this correctly (sometimes my nagios-isms kick in).

From Client:

[user@nagiostst ~ ]$ sudo -u icinga /usr/lib64/nagios/plugins/check_ntp_time -H localhost -w '120s' -c '300s'
NTP OK: Offset -1.168251038e-05 secs, stratum best:5 worst:5|offset=-0.000012s;60.000000;120.000000; stratum_best=5 stratum_worst=5 num_warn_stratum=0 num_crit_stratum=0

From Master:

[user@icinga01 ~ ]$ sudo -u icinga /usr/lib64/nagios/plugins/check_ntp_time -H nagiostst -w '120s' -c '300s'
NTP OK: Offset -0.001338005066 secs|offset=-0.001338s;60.000000;120.000000;

Perhaps I can take this thread a different direction…
I’ll look to see what I can do for making the client perform the check and report to the master (until more eyes can peek at this thread for the original issue).

It almost seems like the masters are overwhelmed (which they should not be).

Just found something… MaxConcurrentChecks = 1024 (default = 512)
I added this to the constants.conf (both master nodes).

Will see if this helps resolve the issue.

UPDATE: Looks like this did not resolve the issue… however I see my masters were on version (version: r2.10.5-1), so I applied the recent patches/updates (version: 2.11.1-1) and will monitor for a day.

So far… so good.

I believe the update on the master nodes is considered the resolution
Should this change I will post an update here. I’ll mark this as a resolution in a few days (just want to monitor a little more)

Spoke too soon… issue is back, so updates did nothing for me.

Bummed… I’m out of options at this point.
Anyone out there gone through this issue before?

Guessing Bug Report is next step. --> submitted: https://github.com/Icinga/icinga2/issues/7614