Check_nwc_health performance

Hi,
I’m struggling to scale out check_nwc_health for pulling back switch interface usage, errors, and duplex, on 48 port switches.
I’ve deployed a satellite node to offload checks from my Master, however I’m running into issues with cpu load.

My checks appear to work without issue and are as follows;

'/usr/lib/nagios/plugins/check_nwc_health' '--community' 'xxxx' '--hostname' '192.x.x.x' '--mode' 'interface-usage' '--name' '1' '--statefilesdir' '/var/tmp'
'/usr/lib/nagios/plugins/check_nwc_health' '--community' 'xxxx' '--hostname' '192.x.x.x' '--mode' 'interface-duplex' '--name' '1' '--statefilesdir' '/var/tmp'
'/usr/lib/nagios/plugins/check_nwc_health' '--community' 'xxxx' '--hostname' '192.x.x.x' '--mode' 'interface-errors' '--name' '1' '--statefilesdir' '/var/tmp'

The frequency for each is, note I have made these values longer trying to improve performance;

Usage:120s
Duplex:1200s
Errors: 24000s

The Timeout on all 3 is 20s.

My host is configured with an array for each interface, approximately 48 per switch.

When I start Icinga or redeploy I see situations where hundreds of processes are created and the whole thing grinds to a halt. I suffer timeouts and things seem to backup I’ve tested the command line against each switch and they respond instantly.

My satellite is running 2 cores with 4GB, on a VM,

Is there anything I can do further optimise this?

I’ve now increased the resources on the satellite VM to 4 cores and 12GB, it’s certainly helped, but seems a little excessive. I’d be interested to know if there is a better way of achieving the same result.

I’ve also got a switch with over a hundred interfaces, when using check_nwc_health to query int usages and duplex, my box grinds to a halt. Throwing excessive cpu and ram at the problem is great and we can’t really spare it either. Is there a better option?

Hi,

what you describe is like “work as design” and how Icinga, the check_nwc_health script, interpreter languages etc. works.
First of all if you look into the script you will see it’s a perl script with 77391 lines. A lot of!
Depending how many switches you want to check you fire this script with this many lines against the switches - 3 times! (for every mode one time).So not very efficient on large environments.

if icinga triggers the start of the check parallel - maybe after a (re)start of the daemon - there will be a lot of processes in the task manager.

Simplified calculation to make this clear:
In our setup the script needs 0,2 % from memory and 3,6% cpu, the execution time is about 0,3 to 1 sec (depending on switch and what to check) - not much if you read this. But let’s say you have 1000 switches to check and you want the script with 3 diffrent modes like in your example. 0,2 %mem x 1000 switches x 3 modes = 600 % memory. Ok I know this calculation is not 100% correct, because there are some more influences than this simple calculation, like a delay for running the processes, but it should show you how many ressources the checks need. In this thread of stackexchange.com I found this command:

pmap $(ps -ef | grep check_nwc_health.pl | grep -v grep | awk ‘{print $2}’)

With this command I see that if perl excute the script and has to load a lot of modules it need about 212092K to check the cpu-load on our switches!

You can observe this if you stop the icinga daemon, start a second shell with a task manager like (h)top, wait and start the daemon again. You will see in the task manager how many proccesses icinga has to start and the cpu load and memory of your server is rising. Similar behavior after after a deployment or if you have to restart your icinga server. Here did the devs a good job so icinga has some mechanism inside, that icinga know what is to check now. (see https://icinga.com/docs/icinga2/latest/doc/19-technical-concepts/#technical-concepts-check-scheduler or https://icinga.com/docs/icinga2/latest/doc/19-technical-concepts/#core-reload-handling) .

What can you do?

  • Create a wrapper script to add another artificial delay, like

#!/bin/sh
sleep echo $RANDOM % 3 + 1 | bc
/usr/lib64/nagios/plugins/check_nwc_health.pl $*
exit $?

I hope this explanation makes things clearer

2 Likes

You could also try limiting the maximum concurrent checks of Icinga 2.
You can set a specific attribute in constants.conf do change the default number:
https://icinga.com/docs/icinga2/latest/doc/16-upgrading-icinga-2/#configuration

1 Like

Thanks for the detail. You’re input is certainly aligned with my experience. I’d prefer to go with the simpler solution, as it should be easier for others to maintain long term. I’m therefore going to look at the check interfaces option.

1 Like

wow check_interfaces is so much faster than check_nwc_health. That certainly solves the performance issues.

Only thing is, I can’t seem to get bytes/s in and out from the metrics.
The docs mention -p

-p|--perfdata      last check perfdata
                        Performance data from previous check (used to calculate traffic)

Adding that to a command line test doesn’t generate what I need, does it produce the sort of metric I’m after?
Thanks

We don’t have to use this parameter. We get the performance data by default. E.g. this is for one interface:
image

Grafana does the rest for us.

thanks could I ask what your command looks like?

‘/usr/lib64/nagios/plugins/check_interfaces’ ‘–aliases’ ‘–auth-phrase’ ‘???’ ‘–auth-proto’ ‘???’ ‘–hostname’ ‘< IP/DNS Name >’ ‘–if-names’ ‘–priv-phrase’ ‘??’ ‘–priv-proto’ ‘??’ ‘–regex’ ‘< searchregex as regex >’ ‘–timeout’ ‘60000’ ‘–user’ ‘< username >’

Thanks that’s basically what i’m using, but still having a few issues, think I’ll start another thread specifically on the check

1 Like