Icinga2 - overdue states

fatslimjoe · August 26, 2020, 1:05pm

Hi,

couple months ago we had problem with overdue states. It was related with bug in icinga2 with version 10.5. After update to 11.4 we dont have problem. Also we had same issue with some endopoints. When for example icinga2 agent (which is some running state) stop to respond and send the status of checks, then checks go in overdue state. We didnt received any alerts from icinga2 itself but we want to have some notification about that.

How did you menage monitoring overdue states? It would be perfect if we can create notification rule like “apply rule for server which is in overdue state”.

Did you in that case query DB or there is maybe easy solution for that? I am step closer to develop python script which will do some query against DB and give some results.

Also, livestatus is deprecated from Icinga2 and from version 2.12 and above is gone.

What do you think is the best way to create alerts when some machine stuck in overdue state (when clock symbol appears inside notification)?

Unfortunately we are using Icinga director and that means we are not able to use some fancy TCL’s.

P.S. did check this topic and dont like much How do you manage overdue checks?

THX

0xliam · August 28, 2020, 2:24am

Hi @fatslimjoe,

This is still something I’m looking to solve too.

I imagine it would be easy to query overdue checks via the database, but I would love if this could be built into the Icinga codebase.

As far as I understand, Icinga2 has no concept of overdue checks - this is something that was built into the Icinga2 Web framework and is calculated by working

I’d love if there was a setting to mark checks that are overdue to UNKNOWN but I don’t know if this fits into the spirit of Icinga.

Alternatively, it would be possible to extend IcingaWeb or develop a module to set the state of overdue checks to UNKNOWN manually (after hitting a certain threshold), but something would need to execute this check - possibly something for the Icinga Director Background Daemon, but this again might not fit into the spirit of the tool.

I’d be curious to hear the thoughts of the core team on this though - this is something we will need to solve for our use case and I’m sure my employers would be happy for me to contribute this back to the Icinga2 codebase if the idea is supported.

rsx · August 28, 2020, 8:27am

As I’m also interested in this topic I’ve been looking around and find this idea. And when started to integrate it I realized it is already there: Dashboard/Overdue (which provides this URL: http://icinga2.example.com/icingaweb2/dashboard?pane=Overdue).

(Unfortunately, I have been able to figure out when it was added).

fatslimjoe · August 28, 2020, 6:46pm

@0xliam I am not sure but I think it is possible using livestatus and if you sure that you will use icinga2 version 2.11.5 or lower for loooong time then you can try to check what will you get from livestatus. This could be solved maybe over rest API? Didnt find yet the right request. Maybe you have right. This couldn’t be queried from icinga_ido DB. Will check that to.

@rsx This is interesting. This could also be helpful but it would be better to have notifications. Maybe it could be solved using another approach like analysing the root cause and create some extra service checks? Like some indirect checks? It is getting more complicated if I am trying to think about solutions …

bkai · August 31, 2020, 2:41pm

I once wrote some bash code to ferret out late service checks, i.e. those where Last Check is older than a certain no. of minutes. So basically I used the HTTPS port 5665 API…

This might have to be adapted for host checks.

But, using it, you could essentially build a plugin that notices late checks, and then goes red to trigger a notification rule.

Let me know if this might be helpful & I would post some code snippets.

fatslimjoe · August 31, 2020, 3:56pm

@bkai I think this could be helpful … I was thinking to go in that direction using rest api. I am certain if this is not solution then definitely is right direction. Need to see output of it and play little bit. I will keep update this topic and post my solution to.

tkarey · September 1, 2020, 3:21pm

I solved this with Icinga DSL and a “in memory” check on master and satellites
to see if checks are overdue or master and satellites are in a config loop

globals.overdue_number = function() {
    var res = []

    for (s in get_objects(Service).filter(s => s.last_check < get_time() - 2 * s.check_interval)) {
      res.add([s.__name, DateTime(s.last_check).to_string()])
    }

    return res.len()
}


apply Service "Overdue Checks" {

    check_command = "dummy"
    check_interval = 1m
    retry_interval = 1m
    max_check_attempts = 20

    command_endpoint = host.name

    vars.dummy_warn = 5
    vars.dummy_crit = 10

    vars.overdue_number = {{ return overdue_number()}}
    vars.dummy_state = {{ return my_state(macro("$dummy_warn$"),macro("$dummy_crit$"),macro("$overdue_number$")) }}
    vars.dummy_text = {{
        var overdue_state = macro("$dummy_state$")
        var warn = macro("$dummy_warn$")
        var crit = macro("$dummy_crit$")
        var overdue_number = macro("$overdue_number$")

        if (overdue_state == 0) {
            var output = "OK: found " + overdue_number + " late checks | overdue=" + overdue_number + ";" + warn + ";" + crit + ";;"
        }
        if (overdue_state == 1) {
            var output = "WARNING: found " + overdue_number + " late checks | overdue=" + overdue_number + ";" + warn + ";" + crit + ";;"
        }
        if (overdue_state == 2) {
            var output = "CRITICAL: found " + overdue_number + " late checks | overdue=" + overdue_number + ";" + warn + ";" + crit + ";;"
        }
        return output
    }}

    assign where host.name in [ "master01", "master02", "satellite01" ]
}

fatslimjoe · September 1, 2020, 4:11pm

@tkarey thank you very much for sharing your solution. I made mistake while I was posting initial post … I wrote TCLs instead DSLs … in our case we are using icinga director … so we are not able to use DSLs … once again thank you for sharing, some will find it usefull I am sure.

prupert · April 13, 2021, 9:35am

Thanks for sharing your “in memory” check. The “my_state()” function does not exist as far as I know. I also noticed that it is redundant to also perform this check on satellites.

I made some small adjustments to also get the service names inside the output. Let me know if you see any errors or room for improvement.

globals.getLateServices = function() {
  var res = []
  for (s in get_objects(Service).filter(s => s.last_check < get_time() - 2 * s.check_interval)) {
    res.add(s.__name)
  }
  return res
}

apply Service "icinga-overdue-checks" {
  # (...) import some generic service template ...
  check_command = "dummy"
  vars.dummy_warn = 1
  vars.dummy_crit = 10
  vars.overdue_services = {{ return getLateServices() }}
  vars.overdue_number = {{ return macro("$overdue_services$").len() }}
  vars.dummy_state = {{
    var warn = macro("$dummy_warn$")
    var crit = macro("$dummy_crit$")
    var overdue_number = macro("$overdue_number$")
    if (overdue_number >= crit) {
      return 2
    } else if (overdue_number >= warn) {
      return 1
    } else {
      return 0
    }
  }}
  vars.dummy_text = {{
    var overdue_state = macro("$dummy_state$")
    var warn = macro("$dummy_warn$")
    var crit = macro("$dummy_crit$")
    var overdue_number = macro("$overdue_number$")
    if (overdue_state == 0) {
      var output = "OK: found " + overdue_number + " late checks | overdue=" + overdue_number + ";" + warn + ";" + crit + ";;"
    }
    if (overdue_state == 1) {
      var overdue_services = macro("$overdue_services$").join(", ").replace("|","_")
      var output = "WARNING: found " + overdue_number + " late checks | overdue=" + overdue_number + ";" + warn + ";" + crit + ";;\n " + overdue_services
    }
    if (overdue_state == 2) {
      var overdue_services = macro("$overdue_services$").join(", ").replace("|","_")
      var output = "CRITICAL: found " + overdue_number + " late checks | overdue=" + overdue_number + ";" + warn + ";" + crit + ";;\n " + overdue_services
    }
    return output
  }}
  # (...) assign where rule ....
}

fatslimjoe · April 15, 2021, 9:10am

Hi,

this looks great! Will try to check in our test env.

KR,
Josip