Checking service clusters in Icinga 2

Author: @mfrosch
Last Edit: 2019-03-28

Today we step into the depths of the Icinga 2 DSL and try to find a nice way to combine the state of several services into one.

This approach takes a list of hosts with one service on each and check if a minimum is OK before warning or critical.

In Icinga 1.x this would be achieved by check_cluster using macros like $SERVICESTATEID:host:service$.

Features

  • Nice Markup with Links and HTML for Icinga 2
  • Check working internally in Icinga 2
  • Simple definition appart from the CheckCommand
  • Verbose state information
  • Fallback to unknown as appropriate

Services

Let’s define services:

apply Service "http alive" {
  import "http"
  // more details
  assign where host.vars.cluster = "webfrontend"
}

object Host "webfrontend" {
  //...
}

object Service "cluster http alive" {
  import "default service"
  command = "cluster_services"

  vars.cluster_label = "HTTP Webfrontend"
  vars.cluster_hosts = [
    "web1.example.com",
    "web2.example.com",
    "web3.example.com",
    "web4.example.com",
  ]
  vars.cluster_service = "http alive"
  vars.cluster_min_warn = 2
  vars.cluster_min_crit = 1
}

Internals

Fields / Vars:

  • cluster_label Label to be shown in the output
  • cluster_hosts An array of hosts involved in the cluster
  • cluster_service Name of the service on on cluster hosts to check
  • cluster_min_warn Minimum Services that must be OK before state is WARNING
  • cluster_min_crit Minimum Services that must be OK before state is CRITICAL
  • icingaweb_baseurl If you need a more specific base URL than /icingaweb2
object CheckCommand "cluster_services" {
    import "plugin-check-command"

    command = [ PluginDir + "/check_dummy" ]

    arguments += {
        output = {
            order = 2
            skip_key = true
            value = {{
                var count_ok = 0
                var count_warning = 0
                var count_critical = 0
                var count_unknown = 0
                var outputs = []
            
                var hosts = macro("$cluster_hosts$")
                var service = macro("$cluster_service$")
                var label = macro("$cluster_label$")
                var icingaweb = macro("$icingaweb_baseurl$")
                if (!icingaweb) {
                  icingaweb = "/icingaweb2"
                }
            
                for (var host in hosts) {
                  var s = get_service(host, service)
                  var link = "<a class=\"action-link\" href=\"" + icingaweb + "/monitoring/service/show?host=" + host + "&service=" + service +"\">" + host + "</a>"
                  var line = "[" + link + "] "
                  if (s) {
                    if (s.state == 0) {
                      count_ok += 1
                      line += "[OK] "
                    } else if (s.state == 1) {
                      count_warning += 1
                      line += "[WARNING] "
                    } else if (s.state == 2) {
                      count_critical += 1
                      line += "[CRITICAL] "
                    } else {
                      count_unknown += 1
                      line += "[UNKNOWN] "
                    }

                    if (s.last_check_result) {
                      line += s.last_check_result.output.split("\n")[0]
                    } else {
                      line += "<no check result>"
                    }
                  } else {
                    line += "<missing>"
                    count_unknown += 1
                  }

                  outputs.add(line)
                }

                return "Cluster " + label + ": " + count_ok + " ok, " + count_warning + " warning, " + count_critical + " critical, " + count_unknown + " unknown" + "\n<div class=\"preformatted\">" + outputs.join("\n") + "</div>"
            }}
        }
        state = {
            order = 1
            skip_key = true
            value = {{
                var count_ok = 0

                var hosts = macro("$cluster_hosts$")
                var service = macro("$cluster_service$")
                var min_warn = macro("$cluster_min_warn$")
                var min_crit = macro("$cluster_min_crit$")
            
                for (var host in hosts) {
                  var s = get_service(host, service)
                  if (s) {
                    count += 1
                    if (s.state == 0) {
                      count_ok += 1
                    }
                  }
                }
            
                if (count_ok < min_crit) {
                  return 2
                } else if (count_ok < min_warn) {
                  return 1
                } else {
                  return 3
                }
            }}
        }
    }
    vars.cluster_min_crit = 1
    vars.cluster_min_warn = 1
}

Summary

  • This could have been done with dummy instead of the external check_dummy, but it is compatible with Director like this

Although the example here is not with HTTP states, the output in Icinga Web 2 looks like this:

Bildschirmfoto%20von%202019-03-28%2011-34-52

Any questions or comments?

4 Likes

There will be a way to Import this via Director Baskets, but there is a pending issue which will be fixed in 1.7.0:

A JSON dump to be imported with Icinga Director Baskets can be found here:

Nice, I like the way you combined/formatted the output! You might eventually want to change the Hosts field DataType in a way allowing one to choose from available Hosts, as shown here.

Cheers,
Thomas

Hi,

Thank you for this! It’s really useful! Before I start implementing it though I wanted to check as it’s been a little while since your post whether you’d still recommend this approach for replacing check_cluster - or if there are any changes that you’ve made since or planning to make in production?

Thanks!

Since check_cluster can not be used with Icinga 2 in a useful way, this should be the best approach.

Feel free to use it. I should move it to a GitHub repository at some point…

1 Like

We’ve got a couple of service cluster checks to check processes running on hosts. Is there a way to use this with generic service definitions like the following?

object CheckCommand "check_procs_named" {
        import "plugin-check-command"

        command = [ PluginDir + "/check_procs" ]
        arguments = {
        "-C" = "$proc_name$",
        "-w" = "1:5",
        "-c" = "1:8"
        }
}
apply Service "ha-procs " for ( proc_name => config in host.vars.check_procs ) {
  import "15m-service"

    check_command = "check_procs_named"
    vars += config
 
    command_endpoint = host.vars.remote_client
}
  vars.check_procs["autmount"] = {
    proc_name = "automount"
  }

e.g. a cluster check for automount

Thank you! :slight_smile:

I guess one could write a function to fill vars.cluster_hosts automatically, but this would be recalculated every time the check runs and I’m not sure what a fast way is to lookup all ha-procs named over all hosts.

Well having a look at your code I played around with the API console, looks like it should just work!
e.g.
get_service(“hostname”, “ha-procs myprocessname”)

So just need to set vars.cluster_service to “ha-procs processname”