Flapping between UNKNOWN and CRITICAL with by_ssh

  • Version used (icinga2 --version)
  • Operating System and version

icinga2 - The Icinga 2 network monitoring daemon (version: r2.13.5-1)

System information:
Platform: Rocky Linux
Platform version: 8.7 (Green Obsidian)
Kernel: Linux
Kernel version: 4.18.0-425.3.1.el8.x86_64
Architecture: x86_64

I’m setting up a deployment that will monitor host-local checks using ssh, for example:

apply Service "memory" {
        import "generic-service"
        check_command = "memory"
        import "ssh-service"

        vars.memory_vmstats = true
        vars.memory_warning = "80%"
        vars.memory_critical = "90%"

        assign where host.vars.classes.os_cosrhl_8
}

where the ssh-service template is:

template Service "ssh-service" {
        import "generic-service"
        // "save" original command name, and replace it
        vars.original_check_command = check_command
        check_command = "by_ssh"
        // these get evaluated at runtime
        vars.by_ssh_command = {{get_check_command(service.vars.original_check_command).command }}
        vars.by_ssh_arguments = {{ get_check_command(service.vars.original_check_command).arguments }}
        vars.by_ssh_quiet = true
}

I’m having a problem that frequently a check will flap between UNKNOWN:

Output: UNKNOWN - check_by_ssh: Remote command
‘’/usr/lib64/nagios/plugins/check_memory’ ‘-c’ ‘90%’ ‘-s’ ‘-w’ ‘80%’’ returned
status 255

and CRITICAL:

Output: CRITICAL - Plugin timed out

This is a pain, as it generates a ton of alerts. Whereas the CRITICAL state comes from when the ssh connection simply doesn’t respond before the check timeout is exceeded, I think the UNKNOWN state arises when the ssh command returns a failure code within the timeout, for example if the connection is refused or some other ssh level failure occurs.

From my perspective, whether there’s no valid status returned because of a timeout or because of a ssh transport issue, I’d like it to be the same state rather than two different states. Technically, it seems UNKNOWN is more valid for a timeout, as you don’t know if the CRITICAL threshold has been reached, you just don’t know what the current value is, just like when ssh fails, but I’d be happy if they were both CRITICAL, as long as it doesn’t flap between two basically identical situations with different states.

Is there any easy way to make this happen with just configuration, as opposed to modifying the by_ssh check command implementation?

Thanks…

Hey Paul,

The status codes is based upon the plugins exit code after it has run. There is not really a smart way to change that, unless you edit the plugin itself.

Is it often you get SSH connection issues? Maybe there’s some infrastructure issues that should be resolved.

Alternatively, it could be considered to go by an agent execution, rather than an SSH connection? This would mitigate the entire problem.

Thanks for the response. When this flapping occurs, there is generally something broken that should result in a notification, the problem is that the same breakage results in two different statuses depending on whether the attempted ssh connection exceeds the check timeout or the attempted connection fails. I’m assuming an agent check in the same scenario would also fail, but perhaps always with the same status. My preference is the ssh check, as all of the systems have ssh server capability and it’s easier to install a couple of checks on them rather than a whole icinga deployment.

Reviewing the status page you reference, it says “Critical” is “The check exceeded the critical threshold, or something really is broken and will harm the production environment” whereas “Unknown” is “Invalid parameters, low level resource errors (IO device busy, no fork resources, TCP sockets, etc.) preventing the actual check […] TCP connection timeouts should be treated as Critical […] Whenever the plugin reaches its timeout […] it should also terminate with Unknown

I believe the UKNOWN is coming from the ssh plugin failing with an error, whereas the CRITICAL is coming from timing out the check. For example:

$ /usr/lib64/nagios/plugins/check_by_ssh -C “‘/usr/lib64/nagios/plugins/check_paging’ ‘-c’ ‘10000’ ‘-w’ ‘5000’” -H XXX -q -t 5
CRITICAL - Plugin timed out

Per this definition, shouldn’t the check_by_ssh plugin be returning UNKNOWN rather than CRITICAL on a timeout, which would prevent my flapping issue?

Thanks again…

Have you considered activating the flapping option for those checks?
https://icinga.com/docs/icinga-2/latest/doc/08-advanced-topics/#check-flapping

Flapping occurs when a service or host changes state too frequently, which would result in a storm of problem and recovery notifications. With flapping detection enabled a flapping notification will be sent while other notifications are suppressed until it calms down after receiving the same status from checks a few times.

Another possibility could be the negate plugin.

This allows you to change the return code of the executed check.

1 Like

Thanks for the suggestions. Tweaking the flapping detection might work, but will still result in extra notifications that I don’t think are warranted for the failure case. I might be able to use the negate plugin to turn unknown into critical. But after reviewing the definitions of the states I think the by_ssh module is incorrect in using CRITICAL for a plugin timeout, it should be using UNKNOWN.

Fixed upstream:

1 Like