Get host objects in hostgroup - failing for hosts added via the API

jkoppel · April 21, 2020, 8:52pm

Thank you for the excellent documentation here:

These seem to work flawlessly for all of our on-premise (statically defined in the config) hosts; our trouble seems to be that this method is not returning any results from hosts that were added via the API, and I was hoping someone could show me what I’m overlooking.

Summary

Running on: RHEL 6.7
Version: icinga2 2.10.3-1 (via yum)
All methods described here work for statically defined hosts.
For our AWS hosts, they are added via the Icinga2 API (via an AWS Lambda). Their Templates and HostGroups are still statically defined in the Icinga config

Console test

If I connect to the icinga2 console and run this code, it works perfectly, even for the AWS hosts:

host_group = "aws_test01"
filter_function = function(node) use(host_group) { host_group in node.groups }
get_objects(Host).filter(filter_function)

But, when used in the function and then viewed in Icingaweb2, I get a zero-length array back.

Method 1

Cluster host definition:

object Host "Cluster: App: Healthcheck URL: aws_test01" {
    import "aws_test01_cluster"
    check_command = "dummy"
    vars += {
        dummy_state = get_dummy("state", "aws_test01", "App: Healthcheck URL")
        dummy_text  = get_dummy("text", "aws_test01", "App: Healthcheck URL")
    }
}

get_dummy function (work in progress):

globals.get_dummy = function(t, host_group, service_name) {
    return function() use (t, host_group, service_name) {
        filter_function = function(node) use(host_group) { host_group in node.groups }
        cluster_nodes   = get_objects(Host).filter(filter_function)
        threshold       = macro("$host.vars.threshold$")
        if (t == "state") {
            down_count  = 0
            for (node in cluster_nodes) {
                health_state = get_service(node, service_name).last_check_result.state
                if (health_state > 0) {
                    down_count += 1
                }

            }
            # If no nodes were detected, exit UNKNOWN.
            if (down_count >= threshold) {
                return 2
            } else {
                return 0
            }
        } else if (t == "text") {
            down_count  = 0
            host_state  = { "CRITICAL" = {}, "OK" = {} }
            for (node in cluster_nodes) {
                health_state  = get_service(node, service_name).last_check_result.state
                health_output = get_service(node, service_name).last_check_result.output
                if (health_state == 0) {
                    host_state["OK"][node.name] = health_output
                } else {
                    host_state["CRITICAL"][node.name] = health_output
                    down_count += 1
                }
            }

            up_count = len(cluster_nodes) - down_count
            if (len(cluster_nodes) == 0) {
                output = "[UNKNOWN] "
            } else if (down_count >= threshold) {
                output = "[CRITICAL] "
            } else {
                output = "[OK] "
            }
            output += "Cluster:  " + up_count + "/" + len(cluster_nodes)
            output += " nodes up.  (" + down_count + " down.)  "
            output += "Failure threshold:  " + threshold + "\n"
            for (stat => href in host_state) {
                for (hname => check_output in href) {
                    output += "[" + stat + "] "  + hname + ":  " + check_output + "\n"
                }
            }
            output += "\nDebug:\n"
            output += "t:                   '" + t + "'\n"
            output += "host_group:          '" + host_group + "'\n"
            output += "service_name:        '" + service_name + "'\n"
            output += "len(cluster_nodes):  '" + len(cluster_nodes) + "\n\n"
            for (node in cluster_nodes) {
                output += "node:  '" + node.name + "'\n"
            }
            return output
        }
    }
}

And the output in Icinga for an aws cluster:

[UNKNOWN] Cluster:  0/0 nodes up.  (0 down.)  Failure threshold:  2
Debug:
host_group:          'aws_test01'
service_name:        'App: Healthcheck URL'
len(cluster_nodes):  '0

Method 2 - define in lamdba

dummy_text = {{
    host_group      = "aws_test01"
    service_name    = "App: Healthcheck URL"
    filter_function = function(node) use(host_group) { host_group in node.groups }
    cluster_nodes   = get_objects(Host).filter(filter_function)
    threshold       = macro("$host.vars.threshold$")
    down_count  = 0
    host_state  = { "CRITICAL" = {}, "OK" = {} }
    for (node in cluster_nodes) {
        health_state  = get_service(node, service_name).last_check_result.state
        health_output = get_service(node, service_name).last_check_result.output
        if (health_state == 0) {
            host_state["OK"][node.name] = health_output
        } else {
            host_state["CRITICAL"][node.name] = health_output
            down_count += 1
        }
    }

    up_count = len(cluster_nodes) - down_count
    if (len(cluster_nodes) == 0) {
        output = "[UNKNOWN] "
    } else if (down_count >= threshold) {
        output = "[CRITICAL] "
    } else {
        output = "[OK] "
    }
    output += "Cluster:  " + up_count + "/" + len(cluster_nodes)
    output += " nodes up.  (" + down_count + " down.)  "
    output += "Failure threshold:  " + threshold + "\n"
    for (stat => href in host_state) {
        for (hname => check_output in href) {
            output += "[" + stat + "] "  + hname + ":  " + check_output + "\n"
        }
    }
    output += "\nDebug:\n"
    output += "host_group:          '" + host_group + "'\n"
    output += "service_name:        '" + service_name + "'\n"
    output += "len(cluster_nodes):  '" + len(cluster_nodes) + "\n\n"
    for (node in cluster_nodes) {
        output += "node:  '" + node.name + "'\n"
    }
    return output
}}

The output in Icingaweb2 is the same.

Method 3 - 1-line filter w/ lambda

dummy_text = {{
    host_group      = "xhaws_test01_cpe"
    service_name    = "App: Healthcheck URL"
    cluster_nodes = get_objects(Host).filter(h => "xhaws_test01_cpe" in h.groups)
    threshold       = macro("$host.vars.threshold$")
    down_count  = 0
    host_state  = { "CRITICAL" = {}, "OK" = {} }
    for (node in cluster_nodes) {
        health_state = get_service(node, service_name).last_check_result.state
        health_output = get_service(node, service_name).last_check_result.output
        if (health_state == 0) {
            host_state["OK"][node.name] = health_output
        } else {
            host_state["CRITICAL"][node.name] = health_output
            down_count += 1
        }
    }

    up_count = len(cluster_nodes) - down_count
    if (len(cluster_nodes) == 0) {
        output = "[UNKNOWN] "
    } else if (down_count >= threshold) {
        output = "[CRITICAL] "
    } else {
        output = "[OK] "
    }
    output += "Cluster:  " + up_count + "/" + len(cluster_nodes)
    output += " nodes up.  (" + down_count + " down.)  "
    output += "Failure threshold:  " + threshold + "\n"
    for (stat => href in host_state) {
        for (hname => check_output in href) {
            output += "[" + stat + "] "  + hname + ":  " + check_output + "\n"
        }
    }
    output += "\nDebug:\n"
    output += "host_group:          '" + host_group + "'\n"
    output += "service_name:        '" + service_name + "'\n"
    output += "len(cluster_nodes):  '" + len(cluster_nodes) + "\n\n"
    for (node in cluster_nodes) {
        output += "node:  '" + node.name + "'\n"
    }
    return output
}}

Again, no change.

Final thoughts

I also tried testing by changing the hostgroup for the aws hosts to the same as one of our on-premise clusters. As expected, when looking in the HostGroup itself in Icingaweb2, they all show up together. Also, in the console test, these hosts all show up. But, in Icingaweb2, only the statically defined hosts are returned. I’d love to hear any feedback you might have.

Thanks!

Al2Klimov · April 23, 2020, 8:45am

Hello @jkoppel!

Al least you first example seems to be broken. Please Upgrade to v2.11 and watch your log, you should get warnings like these.

Best,
AK

jkoppel · April 27, 2020, 6:45pm

Thank you for your reply. Do you know at what stage I should be getting those warnings? I upgraded my local Vagrant instance to 2.11.3 and reran the first example from the console and I’m not seeing errors in the logs. I’ve been attempting to build upon the example given here: DSL: Get host objects in hostgroup with get_objects() and Array#filter (deep-dive into lambda expressions, functions and closures)

Al2Klimov · April 28, 2020, 9:38am

You should get them at least while the host is being checked. Even if you don’t get them, you should define variables not like filter_function = ..., but like var filter_function = ....

jkoppel · April 29, 2020, 1:03am

Ahh, thank you; I have updated the function and included that change.

globals.get_dummy = function(t, host_group, service_name) {
    return function() use (t, host_group, service_name) {
        var filter_function = function(node) use(host_group) { host_group in node.groups }
        var cluster_nodes   = get_objects(Host).filter(filter_function)
        var threshold       = macro("$host.vars.threshold$")
        var down_count  = 0
        if (t == "state") {
            for (node in cluster_nodes) {
                var health_state = get_service(node, service_name).last_check_result.state
                # Check if the host itself is down first, then the service.
                if (node.state > 0) {
                    down_count += 1
                } else if (health_state > 0) {
                    down_count += 1
                }
            }
            if (len(cluster_nodes) == 0) {
                return 2
            } else if (down_count >= threshold) {
                return 2
            } else {
                return 0
            }
        } else if (t == "text") {
            var host_state  = { "DOWN" = {}, "UP" = {} }
            for (node in cluster_nodes) {
                var service       = get_service(node, service_name)
                var health_state  = service.last_check_result.state
                var health_output = service.last_check_result.output
                # Check if the host itself is down first, then the service.
                if (node.state > 0) {
                    var host_state["DOWN"][node.name] = health_output
                    down_count += 1
                } else if (health_state > 0) {
                    var host_state["DOWN"][node.name] = health_output
                    down_count += 1
                } else {
                    var host_state["UP"][node.name] = health_output
                }
            }

            var up_count = len(cluster_nodes) - down_count
            if (len(cluster_nodes) == 0) {
                var output = "[UNKNOWN] "
            } else if (down_count >= threshold) {
                var output = "[DOWN] "
            } else {
                var output = "[UP] "
            }
            output += "Cluster:  " + len(cluster_nodes) + " nodes."
            if (up_count > 0) {
                output += "  [UP]:  " + up_count + "."
            }
            if (down_count > 0) {
                output += "  [DOWN]:  " + down_count + "."
            }
            output += "  Failure threshold:  " + threshold + "\n"
            output += "Node status:\n"
            for (stat => href in host_state) {
                for (hname => check_output in href) {
                    output += "[" + stat + "] "  + hname + ":  " + check_output + "\n"
                }
            }
            if (len(cluster_nodes) == 0) {
                # If there are no nodes for this hostgroup, something went
                # wrong.  Include debug info.
                output += "\nDebug:\n"
                output += "t:                   '" + t + "'\n"
                output += "host_group:          '" + host_group + "'\n"
                output += "service_name:        '" + service_name + "'\n"
                output += "len(cluster_nodes):  '" + len(cluster_nodes) + "\n\n"
                for (node in cluster_nodes) {
                    output += "node:  '" + node.name + "'\n"
                }
            }
            return output
        }
    }
}

The output is cleaner, but the result is the same:

In icingaweb2:

[UNKNOWN] Cluster:  0 nodes.  Failure threshold:  2
Node status:

Debug:
t:                   'text'
host_group:          'aws_test01'
service_name:        'App: Healthcheck URL'
len(cluster_nodes):  '0

It hinges on the lines:

    var filter_function = function(node) use(host_group) { host_group in node.groups }
    var cluster_nodes   = get_objects(Host).filter(filter_function)

Results:

In the icinga2 console, these work for every cluster.
In icingaweb2:
- All statically-defined hosts return the correct HostGroup members (array of Host objects).
- All hosts added via the API return an empty array.

Al2Klimov · April 30, 2020, 8:51am

Unfortunately I have no idea what’s the problem here. I’ve tested getting Hosts inside vars.dummy_* and it definitively works.

You seem not to store get_objects(Host) anywhere, nor to dump it in the debug output. I’d recommend you to dump/log every single step. Maybe then you’ll figure out what’s wrong.

Al2Klimov · April 30, 2020, 1:50pm

Btw., my config:

template Host "dc" {
        check_command = "dummy"
        check_interval = 1s
        vars.dummy_state = 0

        vars.dummy_text = function() {
                var groups = macro("$host.groups$")
                return len(get_objects(Host).filter(function(node) use(groups) {
                        return node.groups.any(function(group) use(groups) {
                                return group in groups
                        })
                }))
        }
}

object HostGroup "c1" {
}

object Host "c1a" {
        import "dc"
        groups = [ "c1" ]
}

object Host "c1b" {
        import "dc"
        groups = [ "c1" ]
}

object HostGroup "c2" {
}

object Host "c2a" {
        import "dc"
        groups = [ "c2" ]
}

object Host "c2b" {
        import "dc"
        groups = [ "c2" ]
}

jkoppel · September 1, 2020, 2:46am

Update: The filtering logic does not seem to be an issue at all; the root of the problem is that:

get_objects(Host)

is not returning hosts that have been added via the API. Does anyone know why that is happening, or if there is another way to return all hosts?

Al2Klimov · September 2, 2020, 9:36am

Works for me:

➜  icinga2 git:(master) prefix/sbin/icinga2 daemon -d
[2020-09-02 11:32:08 +0200] information/cli: Icinga application loader (version: v2.12.0)
[2020-09-02 11:32:08 +0200] information/cli: Closing console log.
➜  icinga2 git:(master) curl -fksSLu root:icinga -H 'Accept: application/json' -X PUT 'https://localhost:5665/v1/objects/hosts/lolcat' -d '{ "attrs": { "check_command": "hostalive" }, "pretty": true }'; echo

{
    "results": [
        {
            "code": 200.0,
            "status": "Object was created"
        }
    ]
}
➜  icinga2 git:(master) prefix/sbin/icinga2 console --connect https://root:icinga@localhost:5665/
Icinga 2 (version: v2.12.0)
Type $help to view available commands.
<1> => get_objects(Host).map(h => h.name)
[ "Alexanders-MacBook-Pro.local", "lolcat" ]
<2> =>

Please provide instructions to reproduce this in a fresh Icinga 2 environment.

jkoppel · September 2, 2020, 5:58pm

Yes, that works for me as well, from the Icinga Console. Where it fails in when it is used inside an attribute’s custom function. This appears to be a bug, so I have logged one here, with additional tests: https://github.com/Icinga/icinga2/issues/8209