- Version used (
icinga2 --version
) - Operating System and version
icinga2 - The Icinga 2 network monitoring daemon (version: r2.13.5-1)
System information:
Platform: Rocky Linux
Platform version: 8.7 (Green Obsidian)
Kernel: Linux
Kernel version: 4.18.0-425.3.1.el8.x86_64
Architecture: x86_64
I’m setting up a deployment that will monitor host-local checks using ssh, for example:
apply Service "memory" {
import "generic-service"
check_command = "memory"
import "ssh-service"
vars.memory_vmstats = true
vars.memory_warning = "80%"
vars.memory_critical = "90%"
assign where host.vars.classes.os_cosrhl_8
}
where the ssh-service template is:
template Service "ssh-service" {
import "generic-service"
// "save" original command name, and replace it
vars.original_check_command = check_command
check_command = "by_ssh"
// these get evaluated at runtime
vars.by_ssh_command = {{get_check_command(service.vars.original_check_command).command }}
vars.by_ssh_arguments = {{ get_check_command(service.vars.original_check_command).arguments }}
vars.by_ssh_quiet = true
}
I’m having a problem that frequently a check will flap between UNKNOWN:
Output: UNKNOWN - check_by_ssh: Remote command
‘’/usr/lib64/nagios/plugins/check_memory’ ‘-c’ ‘90%’ ‘-s’ ‘-w’ ‘80%’’ returned
status 255
and CRITICAL:
Output: CRITICAL - Plugin timed out
This is a pain, as it generates a ton of alerts. Whereas the CRITICAL state comes from when the ssh connection simply doesn’t respond before the check timeout is exceeded, I think the UNKNOWN state arises when the ssh command returns a failure code within the timeout, for example if the connection is refused or some other ssh level failure occurs.
From my perspective, whether there’s no valid status returned because of a timeout or because of a ssh transport issue, I’d like it to be the same state rather than two different states. Technically, it seems UNKNOWN is more valid for a timeout, as you don’t know if the CRITICAL threshold has been reached, you just don’t know what the current value is, just like when ssh fails, but I’d be happy if they were both CRITICAL, as long as it doesn’t flap between two basically identical situations with different states.
Is there any easy way to make this happen with just configuration, as opposed to modifying the by_ssh check command implementation?
Thanks…