Check command timeout state - Critical instead of Unknown

Hello,

I’m looking for a way to show a service state as “Critical” instead of “Unknown” when the check command times out after a defined “check_timeout” is reached. Currently the “Uknown” state result is "Timeout exceeded <Terminated bu signal 15 (Terminated)> I want to to show it as “Critical” instead similar to “check_nrpe” when it times out.

Does anyone know a way? Thanks

If you control or can wrap the check plugin, you could try wait for a SIGALRM signal and then returning your custom timeout error with a CRITICAL service state. An example is documented at Service Monitoring, Timeout.

Another way would be to “recover” a service or host after it went UNKNOWN by creating an Event Command. Such an Event Command could then overwrite the state, e.g., using the Icinga 2 API’s process_check_result.

I am somewhat certain it should also be possible to alter the state within the Icinga 2 DSL by creating some wrapper Icinga 2 Service, but this is something one needs to try out.

1 Like

I am somewhat certain it should also be possible to alter the state within the Icinga 2 DSL by creating some wrapper Icinga 2 Service , but this is something one needs to try out.

Not while using the director as the places where DSL is supported are severely limited.

Wrapping with negate allows you to change states.

The way with the event would also allow you to only to it if it really was a timeout.

Unsure if negate works in this case, since Icinga 2 kills the check plugin after it has exceeded the timeout. Thus, the negate command itself would be killed, still resulting in an UNKNOWN state with the “Terminated by signal 15” message.

1 Like

Hi,

My first thought when reading the question went to the “-t” option that some plugins propose.

Here is the help on “-t” for check_tcp:

 -t, --timeout=INTEGER:<timeout state>
    Seconds before connection times out (default: 10)
    Optional ":<timeout state>" can be a state integer (0,1,2,3) or a state STRING

We use this “-t” option, and we make sure that the Service definition in Icinga allows a higher value for “check_timeout” than the value defined for “-t”. As a result, we can fine-tune how timeouts are being rendered in service checks.

Hope this helps,

Jean

1 Like

The idea of altering the state within Icinga2 DSL sounds interesting. I’m wondering about how to try it out. Any suggestions @apenning ?
Thanks

I just did some kind of Proof of concept how something like this could look like. At least as far as i understood the issue.
Basically, its a service that changes its state and plugin output depending on the state of another service. Yes, in my PoC this would be 2 services instead of having one service.


So, as you can see both services are in the same state if http alive is in state 0 (or 1 and 2). However, if http alive is in state 3 the service statechanger becomes state 2.

Here is my icinga DSL config for this:

apply Service "http alive" {
  //import "http"
  check_command = "dummy"
  // more details
  volatile = true
  assign where host.name == "webfrontend"
}

object Host "webfrontend" {
  check_command = "dummy"
  //...
}

object Service "statechange" {
  check_command = "statechange_cmd"
  host_name = "webfrontend"
  vars.service = "http alive"
  vars.host = host_name
}

object CheckCommand "statechange_cmd" {
  command = [ PluginDir + "/check_dummy" ]
  arguments+= {
    state = {
      order = 2
      skip_key = true
      value = {{
        var service = macro("$service$")
        var host = macro("$host$")
        var s = get_service(host, service)
        if (s) {
          if (s.state == 3) {
            return 2
          } else {
            s.state = s.state
          }
        }
      }}
    }
    output = {
      order = 3
      skip_key = true
      value = {{
        var service = macro("$service$")
        var host = macro("$host$")
        var s = get_service(host, service)
        if (s) {
          if (s.state == 3) {
            return "UNKNOWN"
          }
        }
      }}
    }
  }
}
3 Likes

Thank you for sharing a nice implementation of check and using DSL. For a single service, I guess some timeout in the runtime will have to be looked for in order to apply similar logic to change the state from 3 to 2.