Check_nwc_health performance

ctrmint · February 21, 2020, 10:33am

Hi,
I’m struggling to scale out check_nwc_health for pulling back switch interface usage, errors, and duplex, on 48 port switches.
I’ve deployed a satellite node to offload checks from my Master, however I’m running into issues with cpu load.

My checks appear to work without issue and are as follows;

'/usr/lib/nagios/plugins/check_nwc_health' '--community' 'xxxx' '--hostname' '192.x.x.x' '--mode' 'interface-usage' '--name' '1' '--statefilesdir' '/var/tmp'
'/usr/lib/nagios/plugins/check_nwc_health' '--community' 'xxxx' '--hostname' '192.x.x.x' '--mode' 'interface-duplex' '--name' '1' '--statefilesdir' '/var/tmp'
'/usr/lib/nagios/plugins/check_nwc_health' '--community' 'xxxx' '--hostname' '192.x.x.x' '--mode' 'interface-errors' '--name' '1' '--statefilesdir' '/var/tmp'

The frequency for each is, note I have made these values longer trying to improve performance;

Usage:120s
Duplex:1200s
Errors: 24000s

The Timeout on all 3 is 20s.

My host is configured with an array for each interface, approximately 48 per switch.

When I start Icinga or redeploy I see situations where hundreds of processes are created and the whole thing grinds to a halt. I suffer timeouts and things seem to backup I’ve tested the command line against each switch and they respond instantly.

My satellite is running 2 cores with 4GB, on a VM,

Is there anything I can do further optimise this?

ctrmint · February 21, 2020, 1:04pm

I’ve now increased the resources on the satellite VM to 4 cores and 12GB, it’s certainly helped, but seems a little excessive. I’d be interested to know if there is a better way of achieving the same result.

ctrmint · February 24, 2020, 7:50am

I’ve also got a switch with over a hundred interfaces, when using check_nwc_health to query int usages and duplex, my box grinds to a halt. Throwing excessive cpu and ram at the problem is great and we can’t really spare it either. Is there a better option?

stevie-sy · February 24, 2020, 8:42am

Hi,

what you describe is like “work as design” and how Icinga, the check_nwc_health script, interpreter languages etc. works.
First of all if you look into the script you will see it’s a perl script with 77391 lines. A lot of!
Depending how many switches you want to check you fire this script with this many lines against the switches - 3 times! (for every mode one time).So not very efficient on large environments.

if icinga triggers the start of the check parallel - maybe after a (re)start of the daemon - there will be a lot of processes in the task manager.

Simplified calculation to make this clear:
In our setup the script needs 0,2 % from memory and 3,6% cpu, the execution time is about 0,3 to 1 sec (depending on switch and what to check) - not much if you read this. But let’s say you have 1000 switches to check and you want the script with 3 diffrent modes like in your example. 0,2 %mem x 1000 switches x 3 modes = 600 % memory. Ok I know this calculation is not 100% correct, because there are some more influences than this simple calculation, like a delay for running the processes, but it should show you how many ressources the checks need. In this thread of stackexchange.com I found this command:

pmap $(ps -ef | grep check_nwc_health.pl | grep -v grep | awk ‘{print $2}’)

With this command I see that if perl excute the script and has to load a lot of modules it need about 212092K to check the cpu-load on our switches!

You can observe this if you stop the icinga daemon, start a second shell with a task manager like (h)top, wait and start the daemon again. You will see in the task manager how many proccesses icinga has to start and the cpu load and memory of your server is rising. Similar behavior after after a deployment or if you have to restart your icinga server. Here did the devs a good job so icinga has some mechanism inside, that icinga know what is to check now. (see Technical Concepts - Icinga 2 or Technical Concepts - Icinga 2) .

What can you do?

Create a wrapper script to add another artificial delay, like

#!/bin/sh
sleep echo $RANDOM % 3 + 1 | bc
/usr/lib64/nagios/plugins/check_nwc_health.pl $*
exit $?

cluster your satellites so that the load is shared a little. In addition, you get a reliability.
Use another check which saves resources. Instead of check_nwc for interfaces we for example are using this one: Icinga Template Library - Icinga 2, You can get the check from here https://github.com/NETWAYS/check_interfaces. It’s a C program. And it’s much faster. e.g. in our setup this program needs 0.7 % cpu and 0.0%mem and the execution time is about 0:00:03 sec

I hope this explanation makes things clearer

log1c · February 24, 2020, 1:34pm

You could also try limiting the maximum concurrent checks of Icinga 2.
You can set a specific attribute in constants.conf do change the default number:
https://icinga.com/docs/icinga2/latest/doc/16-upgrading-icinga-2/#configuration

ctrmint · February 24, 2020, 2:25pm

Thanks for the detail. You’re input is certainly aligned with my experience. I’d prefer to go with the simpler solution, as it should be easier for others to maintain long term. I’m therefore going to look at the check interfaces option.

ctrmint · February 24, 2020, 3:06pm

wow check_interfaces is so much faster than check_nwc_health. That certainly solves the performance issues.

Only thing is, I can’t seem to get bytes/s in and out from the metrics.
The docs mention -p

-p|--perfdata      last check perfdata
                        Performance data from previous check (used to calculate traffic)

Adding that to a command line test doesn’t generate what I need, does it produce the sort of metric I’m after?
Thanks

stevie-sy · February 24, 2020, 3:20pm

We don’t have to use this parameter. We get the performance data by default. E.g. this is for one interface:

Grafana does the rest for us.

ctrmint · February 24, 2020, 3:36pm

thanks could I ask what your command looks like?

stevie-sy · February 25, 2020, 6:28am

‘/usr/lib64/nagios/plugins/check_interfaces’ ‘–aliases’ ‘–auth-phrase’ ‘???’ ‘–auth-proto’ ‘???’ ‘–hostname’ ‘< IP/DNS Name >’ ‘–if-names’ ‘–priv-phrase’ ‘??’ ‘–priv-proto’ ‘??’ ‘–regex’ ‘< searchregex as regex >’ ‘–timeout’ ‘60000’ ‘–user’ ‘< username >’

ctrmint · February 25, 2020, 7:52am

Thanks that’s basically what i’m using, but still having a few issues, think I’ll start another thread specifically on the check

orsa · April 29, 2020, 11:30am

hi @stevie-sy

I understand correctly that you count the utilization of interfaces in Grafana?
and event notifications are also sent through it ?

Icinga not used it process?

stevie-sy · April 30, 2020, 5:18am

Hi @orsa!

Yes and no. We use “check_nwc_health” and/or “check_interfaces” to check the interfaces from our switches.

it is how Icinga works:
Icinga triggers the checks (scripts). The return value of the used script is in the case of the mentioned scripts besides up/down also performance data. They will be shipped into Grafana.
Like every other used check (script)

thola · March 1, 2021, 8:24am

Since the check_nwc_health plugin is written in perl and perl is an interpreter language, the perl interpreter is executed every time the plugin is run which causes a lot of CPU and IO.

We had the same problem in our setup with many network devices.

There is a newer project called “Thola” which is written in Go, is very fast and requires few system resources.

Please have a look here and try it:

0xliam · March 1, 2021, 11:44pm

I’m very excited to try this we monitor ~3,000 network devices at ~80 sites and we’ve had to significantly increase the resources allocated to our satellites since using check_nwc_health for interface monitoring.

stevie-sy · March 2, 2021, 5:42am

Here the same, we will try the new check the next time. We have also a lot of network devices to check and the mix from check_interfaces and check_nwc_health makes some troubles at some locations and hardware types

thola · March 2, 2021, 6:16am

I am interested in your feedback. If something is still missing or not working, it’s best to open an issue on GitHub.

It may well be that vendors / models are missing. But these can be added independently via YAML config file: Writing a device class | Thola Documentation
Please upload the YAML config files via PullRequest on GitHub.
Alternatively, you can also provide an SNMP record file and we add the device.

0xliam · March 2, 2021, 11:17am

I’ve had a chance to test this evening and so far it looks great, and we will definitely make use of the features.
One of the reasons we make use of the check_nwc_health plugin is it has the cability to calculate interface saturation as a percentage.

Although we are writing our perfdata to InfluxDB and OpenTSDB (and we can graph utilisation from there), it is still useful for us to have the check exit with Warning or Critical when an interface becomes fully saturated.

I believe check_nwc_health does this by storing the previous values in a temporary file, and calculating the delta, and dividing by the time between the checks.

Is this something you would likely look at implementing? I understand that it may be compute intensive to do all of those calculations, hence the reason why check_nwc_health is so heavy on resources.

thola · March 2, 2021, 11:45am

We are also writing our perfdata to an InfluxDB cluster.

That’s why the capability to calculate interface percentage is not so important for us.
But we plan to add the capability later this years (Q3 2021).

Thola already has a database backend (using SQLlite or Redis) to store cached / historical values.
So it’s easy to add such a feature.

Currently support of telnet / SSH protocols is more important for our own purposes.

If you like you can write us an email team@thola.io to to discuss in more detail.