Hi
i experience high CPU load when l poll about about 600 devices using ping4 and check_nwc_health in health_mode and only four devices are checking interface status. some devices timeout and there is no one point where l dont see timeout exceeded device on either health_mode or ping4. for the device where i check interface status, i see the below error
"print() on closed filehandle GEN1 at /usr/lib/x86_64-linux-gnu/perl/5.26/IO/Handle.pm line 159.
UNKNOWN - cannot write status dir /var/tmp/check_nwc_health! check your filesystem (permissions/usage/integrity) and disk devices, TenGigabitEthernet0/2 (alias {core} device-name Ten0/2) is up/up"
Occasionally, icinga monitoring health shows the process is not running…
which check/retry interval are you using for these services? Please share the full configuration objects. It may be the case that your device responds slowly whenever multiple parallel requests are fired.
The temp file write error sounds weird, is there enough space on /tmp available? Could be a tmpfs using the RAM which may be consumed already.
this is how my service lists looks like for each of my hosts. and this causes high CPU load of more than 50 all the time. is there a way l can optimise my config to reduce CPU load.
Object ‘device01 !check_nwc_health’ of type ‘Service’:
% declared in ‘/etc/icinga2/conf.d/services.conf’, lines 56:1-56:32
__name = “mau-plo-ltk-ppe01 !check_nwc_health”
action_url = “”
check_command = “check_nwc_health”
% = modified in ‘/etc/icinga2/conf.d/services.conf’, lines 58:2-58:35
check_interval = 300
% = modified in ‘/etc/icinga2/conf.d/templates.conf’, lines 29:3-29:21
check_period = “”
check_timeout = 200
% = modified in ‘/etc/icinga2/conf.d/services.conf’, lines 60:2-60:20
We had also memory Problems, when many check_nwc_health-checks ran at the same time.
Since then, we use check_interfaces to check the interfaces from our network equipment and we are happy