Monitoring same hosts from multiple locations

thoomas · February 2, 2019, 9:28am

I’m looking for ideas about how to implement the following scenario: We have a distributed monitoring setup as it is described as “Top Down Config Sync” in the current Icinga2 documentation. We’ll get a new node placed out in the internet (outside of our data center). The goal is to do a small subset of checks we have twice:

To let this outside node send out notifications in case the master is down
To check the internal and external view (ie, we need to have ping latency results as they are from inside our network, as well as how they are from the internet, visible in Icingaweb)

We checked some approaches, which are all not really satisfying. Hope someone of you has ideas.

Let the outside node run standalone: Doesn’t really fit our requirements, as then we don’t have the check results of the outside node visible in Icingaweb (and we don’t want to have a dedicated Icingaweb instance setup for the outside node). Also, notifications get sent out twice if master and outside node are not able to coordinate.

High availability master: Not applicable, as this outside node should only do a small subset of checks and not all of them. As the node is placed outside our datacenter, it only has limited access to the internal network anway.

Using check_command: Not applicable, as the outside node should start sending notifications as soon as the master is down (thus the outside node needs its own scheduler).

Connect this outside node via top down config sync: This would be possible, as we can allow the outside node and the master to do bidirectional communication and thus could attach the outside node similar as every inside node.

I was trying some approaches with top down config sync, which we all not really satisfying:

Distribute the the Host and Service objects that should be checked twice from the master to the outside node: Doesn’t work, as the master then complains about having the Host being redefined in zone.d. But I actually need it twice, as otherwise I can’t monitor twice.
Manually enter Host and Service objects in conf.d on the outside node: Would work, but has two disadvantages:
** Notifications are sent twice (outside node should not send out notifications if the master is available to do it)
** The outside node also submits the status of the Host object to the master. Seems like there is no way to to have this filtered out.

Why do I want to have the status of the Host object not submitted? Probably the best solution would be if I could define (as an example) a service ping4_internal and ping4_external on a Host. The outside node should only submit the ping4_external service check to the master, nothing else. This would allow to clearly distinguish between what is the inside and what is the outside result. It would also allow to define distinct notification rules (ie, notify inside and outside with different “times”). However, when the outside node and the master don’t agree upon whether the host is up or not, then the host starts flapping on the master. Thus would be much easier if the outside node could submit only Service results.

Let’s start this discussion.

dnsmichi · February 3, 2019, 12:32pm

Hi,

I have no exact solution for this, but I would throw in some ideas.

Multi location views

First off, specify object names which are unique, and have the external checks being synced to such an external satellite checker.

Think of a specific location schema, e.g. host1-de, host1-us, etc. These hosts are providers for actual data, but do not send notifications or anything else. Their result alone can be seen in Icinga Web 2, but it is not part of the “action document” when something goes wrong.

Instead, you’ll define a business process logic on top of these 2…n hosts and services. Whether all of them are dead, the overall status is critical, if there are some, warning, etc. This can be achieved with the business process module and defined overall check, which you use for the real host1 object then.

Another possibility would be to use the object accessor functions inside the DSL and write a function which fetches the host states from a defined list, and calculates the state and output too. There is an example in the docs too.

Having such also allows to add more satellites in different regions and combine more than just 2 location based checks.

These “real” host and service objects also get notifications attached to it, and additionally the “action plan” with URLs and notes.

Outside notification sender

In terms of the “outside notification when the master is down” question, I would split this away from the above. There’s no guarantee that the outside node would actually be able to have the correct state, and in case your network is not reachable, under what conditions and circumstances should your outside customers be notified?

I’d rather “solve” this in the way that not single services are notified outside, but a global notification about general outage/reachability is pushed (external mail provider, status website).

Coming back from the above - if you want your customers being notified about host1-us when the cluster-zone check is critical to the master, sync notification apply rules to the satellite, but turn notifications off. Using an event handler script, you can enable notifications via the REST API.

Test-drive the above in a lab though. It may cause false positive alarms and might not be the real wanted solution.

Cheers,
Michael

thoomas · February 4, 2019, 8:01pm

Instead, you’ll define a business process logic on top of these 2…n hosts and services

You mean the business process module add-on for Icingaweb? In such a case, would alerting only happen if the webserver hosting Icingaweb is ready to process check results?

Another possibility would be to use the object accessor functions inside the DSL and write a function which fetches the host states from a defined list, and calculates the state and output too

That is actually a cool feature. Somehow I never looked at it, but seems that could be a perfect match for my requirements.

Using an event handler script, you can enable notifications via the REST API

Makes sense. Can the DSL and functions be used in the Notification objects too (e.g. dynamically specifying the notification command)?

Addendum: Writing a function looks indeed really promising. Approach I was trying out: Use the HostGroup membership to determine which hosts should be included in the calculation. So far I found out, this only seems possible by looping over get_objects(Host) and verifying the groups variable. Is there a simpler possibility? I’ve seen that get_host_group() or get_objects(HostGroup) does not contain the list of members.

dnsmichi · February 5, 2019, 7:58am

Hi,

It is one of the possibilities, yes. The module sources its data from the Icinga Web 2 backend, e.g. the IDO database. The corresponding CLI command which runs a check and combines the output, needs access to this as well. Either executed on the web server, or it runs on the Icinga master and has the database resource configured and can access the IDO db.

If you’re looking into the DSL magic, a short tip: Start simple with inline lists, e.g. hardcode the object names into that function, and loop and calculate and return. You can always refactor and refine that code later on, also with help from us.

command is a static object attribute and as such, you cannot change it from inside the DSL. Something like

command = {{ 
  if (bla) {
    return "cmdobjname1"
  } else {
    return "cmdobjname2"
  }
}}

doesn’t work. Instead, you can modify the command attribute for a given notification object via the REST API by sending a POST request. This of course needs an existing NotificationCommand beforehand.

I tend to think that this gets complicated though, so I would rather move that logic with different behaviour either into the script itself, and keep only boolean values modified (enable_notifications or a custom attribute override).

No, unfortunately not. Can you share the snippets you’ve done already? We can optimize it together and maybe we’ll find something applicable for a patch/feature request in the DSL

Cheers,
Michael

thoomas · February 5, 2019, 8:34pm

Can you share the snippets you’ve done already?

Didn’t anything sophisticated (yet). But there were two approaches I was playing with. Both would rely on host groups.

First one, as I can’t easily get a list of host group members, fill in a custom variable on the host group. As I’m using Puppet, it is possible to easily propagate (query the PuppetDB, output each found host), as such:

object HostGroup "DEMO" {
    vars.members = [ "xyz.example.com" ]
    vars.members += [ "abc.example.com" ]
    .... and so forth ....
}

Then I tried:

object Host "DEMO" {
  display_name = "Environment DEMO"
  check_command = "dummy"
  vars.dummy_state = {{
    var cluster_nodes = get_host_group("DEMO").vars.members
    for (node in cluster_nodes) {
      ... count and apply thresholds here
    }
  }}
}

Works so far, but I think this would be difficult to handle for anyone who is not running a config management (such as Puppet, Ansible, etc.), as maintaining the host group is too complicated.

Second approach:

object Host "DEMO" {
  display_name = "Environment DEMO"
  check_command = "dummy"
  vars.dummy_state = {{
    var all_nodes = get_objects(Host)
    for (node in all_nodes) {
      if("DEMO" in node.groups) {
      ... count and apply thresholds here
      }
    }
  }}
}

One advantage here is that I can define the host group with a assign where rule. This would make it easier to handle for everyone who is maintaining his host group definitions manually. But probably has a lot of overhead as it is looping over hosts that are not relevant for this check. I have to also keep in mind, that I need the same for Services. Looping over services and comparing the service name seems like a lot of overhead too, even more than just looping over the hosts.

Regarding feature requests, some suggestions:

Allow a filter argument for get_objects, e.g. get_objects(Host, ‘where match(“*.example.com”, host.name)’)
Or: get_objects(Host, ‘where “DEMO” in host.groups)’)
Or: get_objects(Service, ‘where match(“*.example.com”, host.name) && service.name == “ping4”’)
Allow to call the REST API, e.g. call_rest_api(filter, attrs) returns a dictionary in the same way as sending this over the API would

dnsmichi · February 5, 2019, 9:38pm

The first approach looks good to me, if that works in your environment.

The second approach is also very well thought through.

In terms of looping and object lookup - even if you call get_host(“somename”), the code does a lookup in the object list in memory. Best case is that it is the first element, worst case is the last element in that last. Sure, there’s optimizations in place, still, it could need optimizations with filters.

I would do it in the same way as with Array#filter, allowing the user to bind an anonymous lambda function for filtering the scope, or similar. I think it is already doable with get_objects() returning an array of objects.

Check the troubleshooting docs where we’ve used this to analyse objects and check results in HA clusters at a customer.

Your examples probably look like this (I haven’t hacked the DSL for a while, I’m not sure if they will work copy-paste):

get_objects(Host).filter(x => match("*.example.com", x))
get_objects(Host).filter(x => "DEMO" in x.groups)
get_objects(Service).filter(s => match("*.example.com", s.host_name) && s.name == "ping4")

Unfortunately we cannot provide an HTTP client in our DSL, as there are known bugs in the current http client library. Maybe such http-like elements can be added later on, though this makes the configuration too dynamic imho, similar as with feature requests for exec or system. Anyhow, it has to wait up until a new http implementation happens.

Cheers,
Michael

thoomas · February 6, 2019, 10:03pm

Your examples probably look like this (I haven’t hacked the DSL for a while, I’m not sure if they will work copy-paste):
get_objects(Host).filter(x => match("*.example.com", x))

That’s great. I could make that work and it makes way it easier to handle.

When looking at the documentation, it wasn’t obvious that a returned host object can be handled similarly like an array data type. Or at least I didn’t see it.

Now, the next thing I’m working on is making it re-usable - trying to define a template Host object. Looks promising:

template Host "fictitious-host-as-hostgroup-aggregation" {
  vars.dummy_text = {{
    [... left out ...]
    var mygroup = host.vars.aggregate_group
    log(mygroup)
    var nodes = get_objects(Host).filter(node => mygroup in node.groups)
    [... left out ...]
  }}
}

object Host "DEMO" {
  import "fictitious-host-as-hostgroup-aggregation"
  vars.aggregate_group = "DEMO"
  vars.aggregate_percentage_warn = 10
  vars.aggregate_percentage_crit = 25
}

I let only one function snippet there - because that’s where I get a problem: Error: Error while evaluating expression: Tried to access undefined script variable ‘mygroup’ The log() call outputs the proper value for mygroup to the logfile. I tried out various variants, but with no luck. I guess this is some type of variable scoping problem?

Unfortunately we cannot provide an HTTP client in our DSL, as there are known bugs in the current http client library.

Must not necessary use “real” HTTP. It could directly call the internal backend REST functions, without doing a real HTTP request. Probably there’s a authentication issue anyway, as it would become necessary to have the API credentials ready in the function.

dnsmichi · February 7, 2019, 8:27am

Hi,

thoomas:

Your examples probably look like this (I haven’t hacked the DSL for a while, I’m not sure if they will work copy-paste):
get_objects(Host).filter(x => match("*.example.com", x))
That’s great. I could make that work and it makes way it easier to handle.

When looking at the documentation, it wasn’t obvious that a returned host object can be handled similarly like an array data type. Or at least I didn’t see it.

I have to correct myself, the code should read as

get_objects(Host).filter(x => match("*.example.com", x.name))

only matching a specific string attribute for that object. The docs lack some good examples on how to combine the different bits. I’m a friend of helping users and evaluating the best solutions, with later maybe writing a howto or docs entry.

Introduction into Array#filter()

Typically, when you call get_objects(), you’ll get an array as result type. This array contains elements, and an object is such an element. Object behave nearly the same as dictionaries, at least you can access their attributes with the dot indexer.

That’s basically what happens in the above code:

calling filter() on an array
the function walks each element and executes the function callback on it
the function callback takes x as argument which is our current object element
must return true or false
thus using a condition which magically returns the boolean expression’s value
match the object name from x against a pattern. This can be any other string as well, e.g. x.check_command or you’ll go for x.check_interval > 60.
When true is return from the callback, this element is copied into the resulting array
In the end, you’ll get an array of shrinked objects.

You can go further, and use map() to reduce the entire object list to just an array of names for instance.

get_objects(Host).filter(x => match("*.example.com", x.name)),map(x => x.name)

Tip: Test such functions with the debug console first

$ icinga2 console --connect 'https://root:icinga@localhost:5665/'

<7> => get_objects(Service).filter(s => match("*mbp*", s.host_name)).map(s => s.name)
[ "ssh", "ping6", "ping4", "procs", "swap", "users", "load", "disk", "icinga", "disk /", "http" ]

Access object scopes in anonymous functions at runtime

The host object is not available in this scope for anonymous lamdba functions, there’s nothing which binds this into the scope. (if you want to learn more about the differences with anonmyous functions and those which bind variables and scopes, check the docs).

Instead, you need to do a lookup for a given name available in this scope, by using runtime macros. That way you can circumvent the problem of missing scoped objects.

  var myhost = macro("$host.name$")
  var mygroup = myhost.vars.aggregate_group

REST API vs DSL

I’m not sure what’s missing here. If you want to access specific objects and attributes, the DSL has more possibilities than the API following a more strict URL format. Is there a specific example on local expressions you’d need here? I was thinking that such API calls should be fired against the secondary master, or anything else.

Cheers,
Michael

thoomas · February 8, 2019, 8:29am

Is there a specific example on local expressions you’d need here?

Enabling / disabling notifications.

The host object is not available in this scope for anonymous lamdba functions, there’s nothing which binds this into the scope.

Ok. As you can see in my example, I put in log() for debugging purposes. It does output the proper value for mygroup to the icinga log. Is log() handled in some special way?

I cannot make it work with your suggestion either. That’s what I tried:

var myhost = macro("$host.name$")
log(myhost)
var myobj = get_host(myhost)
var mygroup = myobj.vars.aggregate_group
log(mygroup)
var nodes = get_objects(Host).filter(node => mygroup in node.groups)

log() both times output the correct value for myhost and mygroup. But still … undefined script variable …

Next try:

globals.get_hostgroup_status_array = function(mygroup) {
    log(mygroup)
    var nodes = get_objects(Host).filter(node => mygroup in node.groups)
}

Then in the debugger console: <1> => get_hostgroup_status_array(“DEMO”)

Results in Tried to access undefined script variable ‘mygroup’. Even if log() outputs the proper value to the icinga log. I guess I shouldn’t have a scoping issue here, as the variable has been explicitly passed to the function?

As of now, I’m continuing with:

globals.get_hostgroup_status_array = function(mygroup) {
    var nodes = get_objects(Host)
    for (node in nodes) {
      if (mygroup !in node.groups) {
        continue
      }
      [...]

dnsmichi · February 8, 2019, 8:43am

There’s an experimental modify_attribute function but I don’t know whether this fully works in this scope. I assume you’d want sort of “trigger an action or modify something”, also like “create a check result and process it”.

log() calls orint the given value with a timestamp up front. I would use the long version as we do in the code, like log(LogWarning, "config", JsonEncode(myhost)) to debug further.

Hmmm, maybe the parser doesn’t like the variable instead of the string in there for the anonymous lambda function. Let’s see about this, I need to reproduce it Coming back soon.

Cheers,
Michael

dnsmichi · February 8, 2019, 10:52am

So, I’ve found the problem and decided to write this as a short howto: DSL: Get host objects in hostgroup with get_objects() and Array#filter (deep-dive into lambda expressions, functions and closures)

TL;DR - we need to bind mygroup into the function’s local scope, as otherwise it doesn’t know about it.

It doesn’t hurt that you keep your own function, since closures and function callbacks might not be your thing at all. This follows the developer’s possibilities into several programming languages, and is still a cool feature

Cheers,
Michael

jwal · July 7, 2020, 11:18pm

Sorry to re-open this thread. @dnsmichi Your reply to the OP was almost exactly what I was looking for. In reference to the portion of the Multi location views section of your comment: Is one able to use object accessor functions when using director? If so, how? Maybe I’m misunderstanding, but I don’t see anything of the like in director, and as far as I know director manages all of the configs. Also, can you enlighten me at all about what DSL means? I’ve attempted to search for answers but came up empty handed.