Service alert comparing current values to past to track state change

keithf4 · February 10, 2022, 2:39pm

I’m trying to see if it is possible to do a similar alert in Icinga that I have done in Prometheus. The alert in question checks the current value against the value from 5 minutes ago to see if there has been a state change and alerts based on that. Example of the Prometheus

expr: ccp_is_in_recovery_status != ccp_is_in_recovery_status offset 5m

Just FYI, it’s used to check for a failover event in PostgreSQL. However in automated failover scenarios, we just need the alert to fire for general notification that it happened and it can resolve itself after a while (5 minutes in this case). Whether it was successful or not is covered by other alerts that give more details on why and in most cases automated failover goes over without any issues. So a self-resolving alert works perfectly.

Is something like this possible in Icinga? I find it useful in other situations as well, not just this one example. I do have a TSDB storage backend so past data is available, but I don’t see anything that allows querying it from a service alert standpoint.

Version used: 2.11
Operating System and version: CentOS7

rsx · February 10, 2022, 2:51pm

Yes, you can create a check plugin that compares result with the previous output. And configure a service that is sending the previous output to that plugin. Example:

vars.icinga_last_output = {{get_service(host.name, “anyservice”).last_check_result.output}}

keithf4 · February 10, 2022, 3:42pm

Sorry if I’m not following along, are you saying to make a check plugin that queries the TSDB for previous data? If so, any guidance on something to do that (we are using the Graphite backend)?

Or is there some other way to query a previous result in icinga?

keithf4 · February 10, 2022, 7:33pm

Following a long a little better now looking up where “last_check_result” is from and how that get_service() function works. Playing around with it in the API.

Still not sure about creating a check plugin to do this though to compare against previous results.

rsx · February 11, 2022, 8:07am

You need to create a service that queries internally the current means last check result and send it as argument to your check plugin. The check plugin determines the new result and compare it with the content that was handed over. You need something like this:

apply Service "ccp_status" {
   check_command = "ccp-status"
   vars.icinga_last_output = {{get_service(host.name, "ccp_status").last_check_result.output}}
...
   assign where ...
}

object CheckCommand "ccp-status" {
   command = [ PluginDir + "/check_ccp_status" ]

   arguments = {
      "-l" = {
         value = "$icinga_last_output$"
         description = "Last check result"
         }
...
   }

keithf4 · February 11, 2022, 8:01pm

So I didn’t need to know the actual value in the past, I just needed to know the state change. Your reply about using the “get_services()” function at least got me going down the right path for checking service state.

I also found this blog post about building dummy services to check the state of other existing services as well

I have a simple file count check so I tested making a new service against that to watch for its state change below. Will then see about adapting it to the failover state change. Thanks for the guidance!

apply Service "chk_filecount_state" {
  import "generic-service"

  check_command = "dummy"
  check_interval = 1m

  assign where host.name == NodeName

  vars.dummy_state = {{

    var this_last_state_change = get_service(host.name, "chk_filecount").last_state_change
    var ten_minutes_ago = get_time() - number(10*60)

    if (this_last_state_change > ten_minutes_ago ) {
        return 2
    } else {
        return 0
    }    
  }}

  vars.dummy_text = {{
    
    var this_last_state_change = get_service(host.name, "chk_filecount").last_state_change
    var ten_minutes_ago = get_time() - number(10*60)

    if (this_last_state_change > ten_minutes_ago ) {
        text = "CRITICAL: File count state change detected" 
    } else { 
        text = "OK: File count state is normal"
    }    
    return text 
  }}
}