Disable Notification

vishalsevani · July 11, 2019, 10:39am

Iam using Icinga to monitor the device input power via SNMP. The issue is when there is SNMP timeout the service state changes to Critical and I get a email notification. I want to suppress the email notification when there is SNMP timeout. The message I get when there is SNMP timeout is ‘Plugin timed out while executing system call’.

I tried adding ‘vars.nrpe_timeout_unknown = true’ to service definition. But it did not work. My service definition is

apply Service “mikrotik-power” {
import “generic-service”
check_command = “snmp”
vars.snmp_oid = “.1.3.6.1.4.1.14988.1.1.3.8.0”
vars.snmp_community = “UxkzrAunFoGSrBhLMA”
vars.snmp_units = “dV”
vars.snmp_version = “2c”
vars.snmp_crit = host.vars.critical_range
vars.snmp_warn = host.vars.warning_range
vars.nrpe_timeout_unknown = true
assign where (match(“RouterBOARD 750*”, host.vars.hardware) || match(“960PGS”, host.vars.hardware) || match(“RB750Gr3”, host.vars.hardware) || match(“CCR1009-7G-1C-1S+”, host.vars.hardware)) && host.vars.critical_range && host.vars.warning_range
}

Is it possible to achieve what I want?

Any help would be appreciated.

P.S:- Iam using Icinga2 version r2.10.4-1

Thanks

blakehartshorn · July 11, 2019, 11:14am

Try something like this:

check_timeout = 3m
vars.snmp_timeout = 180

That’ll give it 3 full minutes before giving up on you, adjust to your liking. Notifications go off state, and if the default snmp module decides a timeout is critical, Icinga is going to handle it that way.

The nrpe timeout setting wont make a difference here as you’re not using nrpe, but the aforementioned snmp_timeout variable is specific to that check. You can find more things you can potentially adjust to improve the reliability of the check here:
https://icinga.com/docs/icinga2/latest/doc/10-icinga-template-library/#snmp

vishalsevani · July 11, 2019, 11:37am

Doesnt seem to work. Are these options suppose to suppress timeout error? My requirement is I want to prevent the state of service being changed to CRITICAL in case of plugin timeout so there is no email notification.

Thanks

blakehartshorn · July 11, 2019, 11:50am

I was suggesting to give it more time as Icinga can’t necessarily override that for you, but I’m seeing something on check_snmp that I don’t see in the template library:

 -t, --timeout=INTEGER:<timeout state>
    Seconds before connection times out (default: 10)
    Optional ":<timeout state>" can be a state integer (0,1,2,3) or a state STRING

Try setting vars.snmp_timeout = "180:1" and see if you get a warning instead of a critical.

vishalsevani · July 11, 2019, 12:01pm

With your original suggestion (without using :<timeout state>), the service did go into UNKNOWN state with error as <Terminated by signal 9 (Killed).>. The service took more than 3 minutes to go to UNKNOWN state, so I reported incorrectly in my previous reply.

I hope there is no issue with error, <Terminated by signal 9 (Killed).>?

Thanks

winem · July 11, 2019, 12:04pm

@vishalsevani how often does that happen? If that’s just a single check that fails and triggers the notifications a modification of the max_check_attempts and retry_interval might be what you want.

So 5 successive checks for example have to fail with max_check_attempts = 5.

I don’t use the check_snmp by myself but the ereg|eregi options might be helpful, too. But this depends on the code and implementation in the module. The timeout message must be parsed before the script applies the regex and I actually doubt that but it might be worth a try.

Do you want to receive critical alarms if any thresholds are breached or do you just pull the metrics and ship them to icinga2, grafana, influx or whatever TSDB you use in the backend? You could also use check_negate to return OK instead of CRITICAL if you do not have thresholds applied in icinga2. Unfortunately this does not support any parsing of the output and just changes the state from 2 to 0 (CRITICAL -> OK).

blakehartshorn · July 11, 2019, 12:15pm

That means my bad advice probably has Icinga terminating about 1 second before the check terminates itself. Lower the vars.snmp_timeout threshold to 120:1 to test here. It should exit with a warning if it itmes out.

vishalsevani · July 11, 2019, 12:16pm

Thanks Marcel. I actually just want to suppress the error “Plugin timed out while executing system call” and do not generate email notification for it. Currently this error changes the state of service to CRITICAL and so I get email notification.

I use SNMP check to retrieve input power value and compare it with a threshold. If threshold is breached, I want email notification. So I guess I cant use check_negate.

But I will try to use ereg/eregi and see if it serves the purpose.

Thanks

winem · July 11, 2019, 12:34pm

Looks like @blakehartshorn had a look at the check_snmp code or just knows the module better than me. Try what he suggests as well.

And I think it could be useful for others as well to have the status configurable if the plugin times out. This might be a warning for some people, why it’s critical or simply unknown for others.
Another idea is to add an option like ereg|eregi to the negate plugin. This should be fairly easy as the negate plugin is written in bash.
And don’t forget to share the code updates with the community if you modify them.

vishalsevani · July 11, 2019, 12:53pm

if I use timeout as 120:1, then, in Icinga GUI, I get error as incorrect option 120:1, the value should be integer. Iam on version r2.10.4.1 which is quite recent. So should not be a version issue?

Thanks

dnsmichi · July 11, 2019, 12:58pm

Which version of check_snmp are you using? (Hint: check_snmp --version)

vishalsevani · July 11, 2019, 2:45pm

Ok I have following set-up one master and two satellites. On one satellite Iam using nagios plugins with check_snmp v2.2.1.git (nagios-plugins 2.2.1) that has timeout:state option. On another satellite Iam using check_snmp v2.1.1 (monitoring-plugins 2.1.1) which doesnt have timeout:state option.

I will update another satellite to use nagios-plugins 2.2.1 as well. It should not be a issue switching from monitoring-plugins 2.1.1 to nagios-plugins 2.2.1?

Thanks

dnsmichi · July 11, 2019, 2:51pm

Not really, unfortunately both projects differ a bit in options. Still, both are not really actively developed anymore, so the differences should be small. I always thought that check_snmp would timeout into UNKNOWN by default, but maybe the patches were never applied.

vishalsevani · July 11, 2019, 2:58pm

Yeah if I do add vars.snmp_timeout = 20, the state does change to UNKNOWN as you say, and the “Plugin Output” is “<Timeout exceeded.><Terminated by signal 9 (Killed).>”. Is this plugin output acceptable or is indicative of something wrong?

But if I dont add the variable vars.snmp_timeout, the state changes to CRITICAL with “Plugin Output” being “Plugin timed out while executing system call”.

Thanks

dnsmichi · July 11, 2019, 3:03pm

That output originates from Icinga sending the KILL signal to the process. The default CheckCommand timeout setting is 1m.

You can either create your own CheckCommand which exceeds the timeout, or use check_timeout on the service object.

object CheckCommand "snmp-ex-timeout" {
  import "snmp"

  timeout = 5m
}

Cheers,
Michael

vishalsevani · July 11, 2019, 3:39pm

I added the option check_timeout = 5m to my service definition which now reads,

apply Service "mikrotik-power" {
import "generic-service"
check_command = "snmp"
check_timeout = 5m
vars.snmp_oid = ".1.3.6.1.4.1.14988.1.1.3.8.0"
vars.snmp_community = "UxkzrAunFoGSrBhLMA"
vars.snmp_units = "dV"
vars.snmp_version = "2c"
vars.snmp_crit = host.vars.critical_range
vars.snmp_warn = host.vars.warning_range
vars.snmp_timeout = 20
assign where (match("RouterBOARD 750*", host.vars.hardware) || match("960PGS", 
host.vars.hardware) || match("RB750Gr3", host.vars.hardware) || match("CCR1009-7G-1C- 
1S+", host.vars.hardware)) && host.vars.critical_range && host.vars.warning_range
}

But still I get the same plugin output “<Timeout exceeded.><Terminated by signal 9 (Killed).>”.

Is there a issue with this plugin output? Can it mean that no response was received by check_snmp command in 20 seconds and so icinga killed the check_snmp process, which I guess is expected behaviour?

Thanks

winem · July 13, 2019, 6:59pm

It just means that the plugin itself did not return any results withing the plugin timeout for any reason. The plugin not being able to connect to the host in time can be one of the reasons.

The snmp_timeout = 20 tells the plugin to wait 20 seconds to successfully establish the connection to the target host. So it will fail if the connection fails or can not be established in that time.
But the error can also occur if the plugin has to process a lot of data, the connection is successful but slow or any other reason that slows down the execution of the check.

I am not 100% sure about the check_timeout variable. I think this is the variable to tell Icinga2 to wait longer for a response from the plugin (I’m to be honest just a bit confused by the documentation, so give it a try).

In general plugins should not run too long unless there is a very good reason for it. So do you actually know why it times out? Is it “expected” to fail from time to time? I would start some debugging on that topic in parallel and try to understand why it fails. A simple cronjob for example polling some data via SNMP from the same host that writes a log with the result and the whole execution time for example could be helpful to give you more insights.

vishalsevani · July 15, 2019, 11:28am

I think the error "<Timeout exceeded.><Terminated by signal 9 (Killed).>” was occuring, because I was setting snmp_timeout = 120s. But in my templates.conf file the retry value for snmp service was 30 sec. So snmp check was being tried in 30s when previous snmp check was still in progress. This previous snmp check needed to be terminated to perform new check and hence Terminated by signal 9 (Killed) message. Now I have changed snmp_timeout to 20 secs, so error has gone.

Thanks