High CPU load and error on WS-C3560E-24TD cisco switch interface

Hi
i experience high CPU load when l poll about about 600 devices using ping4 and check_nwc_health in health_mode and only four devices are checking interface status. some devices timeout and there is no one point where l dont see timeout exceeded device on either health_mode or ping4. for the device where i check interface status, i see the below error

"print() on closed filehandle GEN1 at /usr/lib/x86_64-linux-gnu/perl/5.26/IO/Handle.pm line 159.

UNKNOWN - cannot write status dir /var/tmp/check_nwc_health! check your filesystem (permissions/usage/integrity) and disk devices, TenGigabitEthernet0/2 (alias {core} device-name Ten0/2) is up/up"

Occasionally, icinga monitoring health shows the process is not running…

i am running on

root@vboss-ensmon01-mtb:~# icinga2 -V
icinga2 - The Icinga 2 network monitoring daemon (version: r2.10.5-1)

Copyright © 2012-2019 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later http://gnu.org/licenses/gpl2.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

System information:
Platform: Ubuntu
Platform version: 18.04.2 LTS (Bionic Beaver)
Kernel: Linux
Kernel version: 4.15.0-50-generic
Architecture: x86_64

Build information:
Compiler: GNU 7.4.0
Build host: 8b8700ecb474

Application information:

General paths:
Config directory: /etc/icinga2
Data directory: /var/lib/icinga2
Log directory: /var/log/icinga2
Cache directory: /var/cache/icinga2
Spool directory: /var/spool/icinga2
Run directory: /run/icinga2

Old paths (deprecated):
Installation root: /usr
Sysconf directory: /etc
Run directory (base): /run
Local state directory: /var

Internal paths:
Package data directory: /usr/share/icinga2
State path: /var/lib/icinga2/icinga2.state
Modified attributes path: /var/lib/icinga2/modified-attributes.conf
Objects path: /var/cache/icinga2/icinga2.debug
Vars path: /var/cache/icinga2/icinga2.vars
PID path: /run/icinga2/icinga2.pid
root@vboss-ensmon01-mtb:~#

Hi,

which check/retry interval are you using for these services? Please share the full configuration objects. It may be the case that your device responds slowly whenever multiple parallel requests are fired.

The temp file write error sounds weird, is there enough space on /tmp available? Could be a tmpfs using the RAM which may be consumed already.

Cheers,
Michael

template Host “generic-host” {
max_check_attempts = 3
check_interval = 5m
retry_interval = 180s
vars.ping_wrta = “6000, 90%”
vars.ping_crta = “8000, 100%”
check_command = “hostalive”
}

/**

  • Provides default settings for services. By convention
  • all services should import this template.
    */
    template Service “generic-service” {
    max_check_attempts = 5
    check_interval = 5m
    retry_interval = 300s
    }

I also see these errors:

Plugin Output

<Terminated by signal 9 (Killed).>

Plugin Output

this is how my filesystem space looks like
Filesystem Size Used Avail Use% Mounted on
udev 16G 0 16G 0% /dev
tmpfs 3.2G 1.2M 3.2G 1% /run
/dev/sda2 295G 81G 199G 29% /
tmpfs 16G 0 16G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 16G 0 16G 0% /sys/fs/cgroup
/dev/loop0 89M 89M 0 100% /snap/core/6964
/dev/loop1 87M 87M 0 100% /snap/core/4917
tmpfs 3.2G 0 3.2G 0% /run/user/1000

This is caused by the default command timeout of icinga

can l chage the default behaviour of cmd timeout of icinga so that it doesnt kill any process.

Just add

check_timeout = 180

or even higher to your service/apply rule

this is how my service lists looks like for each of my hosts. and this causes high CPU load of more than 50 all the time. is there a way l can optimise my config to reduce CPU load.

Object ‘device01 !check_nwc_health’ of type ‘Service’:
% declared in ‘/etc/icinga2/conf.d/services.conf’, lines 56:1-56:32

  • __name = “mau-plo-ltk-ppe01 !check_nwc_health”
  • action_url = “”
  • check_command = “check_nwc_health”
    % = modified in ‘/etc/icinga2/conf.d/services.conf’, lines 58:2-58:35
  • check_interval = 300
    % = modified in ‘/etc/icinga2/conf.d/templates.conf’, lines 29:3-29:21
  • check_period = “”
  • check_timeout = 200
    % = modified in ‘/etc/icinga2/conf.d/services.conf’, lines 60:2-60:20
  • command_endpoint = “”
  • display_name = “check_nwc_health”
  • enable_active_checks = true
  • enable_event_handler = true
  • enable_flapping = false
  • enable_notifications = true
  • enable_passive_checks = true
  • enable_perfdata = true
  • event_command = “”
  • flapping_threshold = 0
  • flapping_threshold_high = 30
  • flapping_threshold_low = 25
  • groups = [ ]
  • host_name = "device01 "
    % = modified in ‘/etc/icinga2/conf.d/services.conf’, lines 56:1-56:32
  • icon_image = “”
  • icon_image_alt = “”
  • max_check_attempts = 5
    % = modified in ‘/etc/icinga2/conf.d/templates.conf’, lines 28:3-28:24
  • name = “check_nwc_health”
    % = modified in ‘/etc/icinga2/conf.d/services.conf’, lines 56:1-56:32
  • notes = “”
  • notes_url = “”
  • package = “_etc”
    % = modified in ‘/etc/icinga2/conf.d/services.conf’, lines 56:1-56:32
  • retry_interval = 300
    % = modified in ‘/etc/icinga2/conf.d/templates.conf’, lines 30:3-30:23
  • source_location
    • first_column = 1
    • first_line = 56
    • last_column = 32
    • last_line = 56
    • path = “/etc/icinga2/conf.d/services.conf”
  • templates = [ “check_nwc_health”, “generic-service” ]
    % = modified in ‘/etc/icinga2/conf.d/services.conf’, lines 56:1-56:32
    % = modified in ‘/etc/icinga2/conf.d/templates.conf’, lines 27:1-27:34
  • type = “Service”
  • vars
    • nwc_mode = “hardware-health”
      % = modified in ‘/etc/icinga2/conf.d/services.conf’, lines 59:2-59:34
  • volatile = false
  • zone = “”

Object ‘device01 !ping4’ of type ‘Service’:
% declared in ‘/etc/icinga2/conf.d/services.conf’, lines 26:1-26:21

  • __name = “device01 !ping4”
  • action_url = “”
  • check_command = “ping4”
    % = modified in ‘/etc/icinga2/conf.d/services.conf’, lines 29:3-29:25
  • check_interval = 300
    % = modified in ‘/etc/icinga2/conf.d/templates.conf’, lines 29:3-29:21
  • check_period = “”
  • check_timeout = null
  • command_endpoint = “”
  • display_name = “ping4”
  • enable_active_checks = true
  • enable_event_handler = true
  • enable_flapping = false
  • enable_notifications = true
  • enable_passive_checks = true
  • enable_perfdata = true
  • event_command = “”
  • flapping_threshold = 0
  • flapping_threshold_high = 30
  • flapping_threshold_low = 25
  • groups = [ ]
  • host_name = "device01 "
    % = modified in ‘/etc/icinga2/conf.d/services.conf’, lines 26:1-26:21
  • icon_image = “”
  • icon_image_alt = “”
  • max_check_attempts = 5
    % = modified in ‘/etc/icinga2/conf.d/templates.conf’, lines 28:3-28:24
  • name = “ping4”
    % = modified in ‘/etc/icinga2/conf.d/services.conf’, lines 26:1-26:21
  • notes = “”
  • notes_url = “”
  • package = “_etc”
    % = modified in ‘/etc/icinga2/conf.d/services.conf’, lines 26:1-26:21
  • retry_interval = 300
    % = modified in ‘/etc/icinga2/conf.d/templates.conf’, lines 30:3-30:23
  • source_location
    • first_column = 1
    • first_line = 26
    • last_column = 21
    • last_line = 26
    • path = “/etc/icinga2/conf.d/services.conf”
  • templates = [ “ping4”, “generic-service” ]
    % = modified in ‘/etc/icinga2/conf.d/services.conf’, lines 26:1-26:21
    % = modified in ‘/etc/icinga2/conf.d/templates.conf’, lines 27:1-27:34
  • type = “Service”
  • vars = null
  • volatile = false
  • zone = “”

check_nwc_health uses a lot of cpu. If your icinga host is a vm you can add more cpu, or use another icinga2 agent host to run the checks from there.

We had also memory Problems, when many check_nwc_health-checks ran at the same time.
Since then, we use check_interfaces to check the interfaces from our network equipment and we are happy :grinning: