Assistance Needed -- Zombies - service checks are becoming defunct on client

Hello,

Hoping someone has been here/done this before? I couldn’t find anything in the search.

I’m starting to add clients to my icinga2 environment and quickly running into an issue with clients getting several defunct/zombie processes.

Example: (RHEL6)

[user@nagiostst ~ ]$ ps aux|grep defun
icinga    6299  0.0  0.0      0     0 ?        ZNs  Oct28   0:00 [check_ntp_time] <defunct>
icinga   11247  0.0  0.0      0     0 ?        ZNs  14:45   0:00 [check_ntp_time] <defunct>
icinga   15921  0.0  0.0      0     0 ?        ZNs  14:27   0:00 [check_ntp_time] <defunct>
icinga   18195  0.0  0.0      0     0 ?        ZNs  Oct24   0:00 [check_ntp_time] <defunct>
icinga   20598  0.0  0.0      0     0 ?        ZNs  14:30   0:00 [check_ntp_time] <defunct>
icinga   25223  0.0  0.0      0     0 ?        ZNs  07:30   0:00 [check_ntp_time] <defunct>
icinga   28020  0.0  0.0      0     0 ?        ZNs  Oct28   0:00 [check_ntp_time] <defunct>

Version:

[user@nagiostst ~ ]$ sudo icinga2 -V | head -1
icinga2 - The Icinga 2 network monitoring daemon (version: 2.11.1-1)

Client definition (on master): → client uses client_endpoint

[user@icinga01 hosts ]$ cat nagiostst.conf

// Endpoints & Zones
object Endpoint "nagiostst" {
	host = "1.2.3.4"
}

object Zone "nagiostst" {
	endpoints = [ "nagiostst" ]
	parent = "master"
}
object Host "nagiostst" {
	import "icon_rhel_vhost"
	import "unix"
	address = "1.2.3.4"
	vars.os = "linux"
	vars.disks["disk"] = { /* No parameters. */ }
	vars.filesystem["/"] = {}
	vars.filesystem["/export"] = {}
	vars.filesystem["/var"] = {}
	vars.filesystem["/var/log"] = {}
	vars.client_endpoint = name
	vars.program = "unixadm"
	vars.location = "DC1"
	vars.type = "virtual"
}

Client api.conf: (client accepts from master)

[user@nagiostst features-enabled ]$ cat api.conf
/* Icinga 2 API Config */

/*
 * The API listener is used for distributed monitoring setups.
 */

object ApiListener "api" {
    accept_commands = true
    accept_config = true
}

Client icinga2.conf: (Made sure the conf.d is not used)

[user@nagiostst icinga2 ]$ cat icinga2.conf
/* Icinga 2 Client Config */

include “constants.conf”
include “zones.conf”
include
include
include
include
include
include
include “features-enabled/*.conf”
// include_recursive “conf.d”

The only thing I have done to see if this will help is extent out the check_interval to 75s (was previously 60s)… was unsure if the checks are stacking up due to insufficient time to process to the master?

Service Definition for NTP:

apply Service "NTP" {
    import "service-check-alarm-settings"
    check_command = "ntp_time"
    command_endpoint = host.vars.client_endpoint
    vars.ntp_warning = 120s
    vars.ntp_critical = 300s
    check_interval = 75s
    retry_interval = 30s
    //enable_notifications = false
    assign where host.address && host.vars.os == "linux"
}

Your service definition results in a check of a host against itself (as long $address$ is not set in service-check-alarm-settings). And perhaps it stalls (sometimes) if you have any kind of security hardening in place e.g. AppArmor, SELinux or a packet filter.

Here is the service-check-alarm-settings:
It just gets a small portion of settings that are common across the environment out of the way.

template Service "service-check-alarm-settings" {
    max_check_attempts = 3
    check_interval = 1m
    retry_interval = 30s
    check_period = "24x7"
}

If it is stalling, that is what I am trying to troubleshoot and figure out the cause. Unsure if this is a bug in icinga? or if my icinga master servers are falling behind? There is SELinux, no firewalls and no AppArmor.

@dnsmichi

Have you ever seen this behavior on service checks?

Do you experience any issue when you run the check_ntp_time as user icinga manually?

Seems to work snappy:
Hoping I am demonstrating this correctly (sometimes my nagios-isms kick in).

From Client:

[user@nagiostst ~ ]$ sudo -u icinga /usr/lib64/nagios/plugins/check_ntp_time -H localhost -w '120s' -c '300s'
NTP OK: Offset -1.168251038e-05 secs, stratum best:5 worst:5|offset=-0.000012s;60.000000;120.000000; stratum_best=5 stratum_worst=5 num_warn_stratum=0 num_crit_stratum=0

From Master:

[user@icinga01 ~ ]$ sudo -u icinga /usr/lib64/nagios/plugins/check_ntp_time -H nagiostst -w '120s' -c '300s'
NTP OK: Offset -0.001338005066 secs|offset=-0.001338s;60.000000;120.000000;

Perhaps I can take this thread a different direction…
I’ll look to see what I can do for making the client perform the check and report to the master (until more eyes can peek at this thread for the original issue).

It almost seems like the masters are overwhelmed (which they should not be).

Just found something… MaxConcurrentChecks = 1024 (default = 512)
I added this to the constants.conf (both master nodes).

Will see if this helps resolve the issue.

UPDATE: Looks like this did not resolve the issue… however I see my masters were on version (version: r2.10.5-1), so I applied the recent patches/updates (version: 2.11.1-1) and will monitor for a day.

So far… so good.

I believe the update on the master nodes is considered the resolution
Should this change I will post an update here. I’ll mark this as a resolution in a few days (just want to monitor a little more)

Spoke too soon… issue is back, so updates did nothing for me.

Bummed… I’m out of options at this point.
Anyone out there gone through this issue before?

Guessing Bug Report is next step. --> submitted: https://github.com/Icinga/icinga2/issues/7614

UPDATE: Unsure why, but the bug report was merged to a similar bug (without explanation).
New bug report is https://github.com/Icinga/icinga2/pull/7606

UPDATE:

Until the BUG is corrected, I created a workaround based on the event handler documentation.
I’m sure this may not be to everyone’s likings, but its working for what I need at the moment (feel free to tear it apart and use it as you see fit)

Deployed the script in plugins dir (on each client)

[user@client01 ]$ sudo cat /usr/lib64/nagios/plugins/restart_service

#!/bin/bash

while getopts "s:t:a:S:" opt; do
  case $opt in
    s)
      servicestate=$OPTARG
      ;;
    t)
      servicestatetype=$OPTARG
      ;;
    a)
      serviceattempt=$OPTARG
      ;;
    S)
      service=$OPTARG
      ;;
  esac
done

if ( [ $servicestate == "CRITICAL" ] && [ $servicestatetype == "HARD" ] ); then
    sudo /sbin/service icinga2 restart > /dev/null
fi

Service definition to check zombies:

apply Service "Zombie Procs" {
    import "service-check-alarm-settings"
    check_command = "procs"
    event_command = "restart_service"
    command_endpoint = host.vars.client_endpoint
    vars.grafana_graph_disable = true
    vars.procs_warning = 0
    vars.procs_critical = 1
    vars.procs_state = "Z"
    enable_notifications = false
    assign where host.address && host.vars.os == "linux"
}

eventhandler.conf definition:

object EventCommand "restart_service" {
    command = [ PluginDir + "/restart_service" ]
    arguments = {
        "-s" = "$service.state$"
        "-t" = "$service.state_type$"
        "-a" = "$service.check_attempt$"
        "-S" = "$restart_service$"
    }
    vars.restart_service = "$procs_commands$"
}

…I still see the warnings when there are 1 zombie running, but once it increments to “2”, they are zapped back to “0”. From what I am seeing, the BUG is triggered when I perform configuration reloads on the master nodes. If I do a restart (instead of a reload), I am not seeing the zombies being created.

checking in on this to see if anyone has some new info?

Seems like the new Version (2.12) solved this Issue

At the risk of necro’ing a thread about zombies … I have seen similar issues when using sudo in the CheckCommand's command with Icinga2 2.13 and created Issue Zombie CheckCommand processes · Issue #8981 · Icinga/icinga2 · GitHub (for future internet travelers). It doesn’t seem to be related to reloads, but seems similar enough to mention (general execution, reaping handling).