Icinga2 can't execute hostcheck on Windows: execvpe(/usr/lib/nagios/plugins/check_disk.exe) failed: No such file or directory

Hi,

I recently installed the director in a small environment only being one master, one satellite and one windows host with the agent installed. I imported the agent endpoint via kickstart wizard and created a host. I can run service checks for disk, load and memory with no problems. The command endpoint for this checks is the satellite.

The problem i run into is that when i configure a hostcheck it always fails with a message like this:

execvpe(/usr/lib/nagios/plugins/check_disk.exe) failed: No such file or directory

I am not able to run any other checks than the dummy one. Something seems to be missing but i can’t determine where. The referenced files like check_disk.exe are on every icinga server in /usr/lib/nagios/plugins/. Except on the Windows System where they are in C:\Program Files\ICINGA2\sbin which is also the standard directory.

I am just able to speculate/suspect that Icinga tries to run the checks from /usr/lib/nagios/plugins/ on the windows machine an can’t because it doesn’t exist.

I hope somebody can help me or point me in the right direction.

I think the problem is “the command endpoint is the satellite”.

Surely you want your command endpoint to be the Windows host.

You only want to run checks on the satellite if you’re checking the satellite.

Antony.

2 Likes

Thank you for your help.

I changed the command endpoint to the Windows host (dc01) but i can’t deploy the configuration because of following error.

[2019-12-10 14:24:38 +0100] information/cli: Icinga application loader (version: r2.11.2-1)
[2019-12-10 14:24:38 +0100] information/cli: Loading configuration file(s).
[2019-12-10 14:24:38 +0100] information/ConfigItem: Committing config item(s).
[2019-12-10 14:24:38 +0100] information/ApiListener: My API identity: master-01
[2019-12-10 14:24:38 +0100] critical/config: Error: Validation failed for object 'dc01' of type 'Host'; Attribute 'command_endpoint': Command endpoint must be in zone 'dc01' or in a direct child zone thereof.
Location: in [stage]/zones.d/dc01/hosts.conf: 1:0-1:44
[stage]/zones.d/dc01/hosts.conf(1): object Host "dc01" {
                                    ^^^^^^^^^^^^^^^^^^
[stage]/zones.d/dc01/hosts.conf(2):     import "Windows Agent Satllit01"
[stage]/zones.d/dc01/hosts.conf(3): 

[2019-12-10 14:24:38 +0100] critical/config: 1 error
[2019-12-10 14:24:38 +0100] critical/cli: Config validation failed. Re-run with 'icinga2 daemon -C' after fixing the config.

The strange thing is, that the endpoint dc01 is in the zone dc01. I configured the zone hierarchy manually before i ran the kickstart wizard (master - satellite - dc01) with each endpoint in its respective zone. I know there is a difference between host and endpoint. But satellite and dc01 were integrated in the same way into the environment, so i don’t really understand why everything except the hostcheck works with the satellite as command endpoint but i can’t deploy if I change it to dc01.
Maybe it is worth noting that the satellite doesn’t know the IP of dc01 but the dc01 knows the IP of the satellite.

Agents don’t need manual configuration in any zones.conf, this is done by the director.

Setting the option “run on agent” in the service template to yes should solve the problem. Then this check is run on the host itself always (be it the master, a satellite or an agent).
image

To have this working the host object has to be named exactly like the zone and endpoint object of the satellite/agent host.

Can you show the zones.conf of the master, satellite and agent?

The option “run on agent” is enabled for all services I use. The problem is the check command for the host object of the windows host.

The Windows endpoint and zone have the same name. The master and the satellite zone are named differently than the endpoint. Do you think I should i change them?

Following are the zones.conf of each server. I mainly used the zones.d directory on the master for importing endpoints and zones into icinga with the kickstart wizard. The configuration in those files doesn’t differ from the zones.conf.

Master zones.conf

object Endpoint "mon-master-01" {
}

object Zone "master" {
        endpoints = [ "mon-master-01" ]
}

object Endpoint "mon-satellit01" {
        host = "10.10.22.4"
}

object Zone "satellit01" {
        endpoints = ["mon-satellit01"]
        parent = "master"
}

object Zone "global-templates" {
        global = true
}

object Zone "director-global" {
        global = true
}

Satellite zones.conf

object Endpoint "mon-master-01" {
        host = "10.10.20.4"
        port = "5665"
}

object Zone "master" {
        endpoints = [ "mon-master-01" ]
}

object Endpoint "mon-satellit01" {
}

object Zone "satellit01" {
        endpoints = [ "mon-satellit01" ]
        parent = "master"
}

object Zone "global-templates" {
        global = true
}

object Zone "director-global" {
        global = true
}

Agent zones.conf

object Endpoint "mon-satellit01" {
	host = "1.2.3.4" //public satellite IP
	port = "5665"
}

object Zone "satellit01" {
	endpoints = [ "mon-satellit01" ]
}

object Endpoint "dc01" {
}

object Zone "dc01" {
	endpoints = [ "dc01" ]
	parent = "satellit01"
}

object Zone "global-templates" {
	global = true
}

object Zone "director-global" {
	global = true
}

What does icinga say about the check_source? That’s the system which tries to execute the check. For a disk check it should be the actual client/agent, not the satellite server.
The respective service object should have the directive command_endpoint == host.name

If all these are correct, have a look into the constants.conf of your windoze machine, the PluginDir should point to c:\program files\and\so\on\check_disk.exe

The zones.conf files look okay, although I hope the endpoint names are redacted and the real names match the fqdn of the respective system.

Check your host object inside the Director and look at it’s preview.
Check if this contains the objects for the agent endpoint and zone dc01.

Example from my dev environment:

If they are missing Icinga 2 doesn’t know about the agent and you will have to redeploy it to the agent host. Best via the script provided by the Director (shows up when you configure the Agent tab inside the host config).
Second (more advanced) option could be creating a host template, configure the agent tab there to create a self-service API key and have the host register it self.
See https://icinga.com/docs/director/latest/doc/74-Self-Service-API/ for more info.

I only saw hosts.conf in the preview so i tried both of your recommended options. I still don’t have the endpoint and zones.conf in the preview but most of the services are executed on the agent and return useful results. What I want to mention regarding the agent is that i don’t know its IP-Address, but they can connect to the satellite on a public IP. This IP I had to add manually to the satellite endpoint in the icinga2.conf on the windows host.

However I’m getting more and more confused because nothing seems to behave in a consistent way.

Host is DOWN despite using the dummy-check, which seems to be executed by schedule. But there is still the plugin output of the failed check_disk. Check source is the satellite. Host is reachable, I guess this refers to the satellite?

Host


Disk Service Check did work yesterday but doesn’t work anymore. Check Source is the Agent. After redeploying the Service, it will stay in “outstanding” like the next one. “Check now” doesn’t seem to have any effect. Host is not reachable. I guess this also refers to the check source.


Windows Uptime Check. Didn’t work ever. In “outstanding” since I deployed it. There is no check source but it is reachable??


Finally the load check. Works as intended (like the memory check “nscp-local-memory”) and returns a correct status every minute. Check source is the Agent but not reachable.


All services were configured with the director in the exact same way except the check commands of course. No service has a pre-defined check source because if i set it to the Agent i can’t deploy the config because of the error message in my first reply. In some cases it is the Agent, and then again there is nothing.

Where does the check-source come from in two of four services?
What does the reachable parameter refer to?

Uff, that is some strange behavior…

The “reachable” information is a calculation of the monitoring system, if this host or service can be reached by the parent system. But without configured dependencies between hosts this is no real indicator. At least that is my understanding. (hope this is correct :smiley:)

Please implement a check for the dc01 zone and show it’s output. Let the check run on both the master and the satellite system.
Use the check command cluster-zone from the ITL, add a variable called cluster_zone and set it to the dc01 zone name.
E.g.

apply Service "icinga-cluster-zone-dc01" {
    check_command = "cluster-zone"
    max_check_attempts = "5"
    check_interval = 1m
    retry_interval = 10s
    enable_notifications = true
    enable_active_checks = true
    enable_passive_checks = false
    enable_event_handler = true
    enable_flapping = true
    enable_perfdata = true
    assign where match("master", host.name)
    vars.cluster_zone = "dc01"

    import DirectorOverrideTemplate
}

I implemented the checks as you recommended with following output:

master cluster-zone check:
Zone 'dc01' is not connected. Log lag: less than 1 millisecond

satellite cluster-zone check:
Zone 'dc01' is connected. Log lag: less than 1 millisecond

The check source is the system where the check is actually executed, the “reachable” indicator refers to that system. Your load check looks like it should be, source is the dc01, but the disk check is executed by your satellite (if it was a linux check it would show the disk space on your satellite server instead of the dc01)
Please show us the config preview of the disk check (hit “Modifizieren” or “Modify” on the check’s page and then again “Preview” on the right side.
Maybe the same for the correctly configured load check, for reference.

Edit regarding the cluster-zone check: Looks good. dc01 should only connect to the satellite, if it is it’s parent zone.

I added the memory check beacuse it’s also a nscp check.

Preview disk-check:
check source: dc01 / Reachable / doesn’t work

//zones.d/director-global/service_templates.conf
template Service "windows disk" {
    check_command = "nscp-local-disk"
    max_check_attempts = "5"
    check_interval = 1m
    retry_interval = 30s
    command_endpoint = host_name
}

Preview load-check
check source: dc01 / not Reachable / works

//zones.d/director-global/service_templates.conf
template Service "windows load" {
    check_command = "load-windows"
    max_check_attempts = "5"
    check_interval = 1m
    retry_interval = 30s
    command_endpoint = host_name
}

Preview memory-check
check source: dc01 / not Reachable / works

//zones.d/director-global/service_templates.conf
template Service "windows memory" {
    check_command = "nscp-local-memory"
    max_check_attempts = "5"
    check_interval = 1m
    retry_interval = 30s
    command_endpoint = host_name
}

the templates look okay, how do the actual services look like?

I deployed the services via service set in combination with the custom variable os

  /**
     * Service Set: Windows Performance Checks
     * 
     * Services für das Ressourcenmonitoring von Windows Maschinen
     * 
     * assign where host.vars.os == "windows"
     */

    apply Service "windows load" {
        import "windows load"

        assign where host.vars.os == "windows"

        import DirectorOverrideTemplate
    }

    apply Service "windows memory" {
        import "windows memory"

        assign where host.vars.os == "windows"

        import DirectorOverrideTemplate
    }

    apply Service "windows disk" {
        import "windows disk"

        assign where host.vars.os == "windows"

        import DirectorOverrideTemplate
    }

    apply Service "windows uptime" {
        import "windows uptime"

        assign where host.vars.os == "windows"

        import DirectorOverrideTemplate
    }

And the Host Check which initially was the reason for this thread :sweat_smile:

## zones.d/satellit01/host_templates.conf

template Host "Windows Agent Satllit01" { 
check_command = "dummy" 
command_endpoint = "mon-satellit01" 
vars.os = "windows" 
}

and the host object

## zones.d/satellit01/hosts.conf

object Host "dc01" { 
import "Windows Agent Satllit01" 
display_name = "Testmandant Domain Controller" 
}

Please test if the behavior changes after you remove the command_endpoint from the host template.
Normally you don’t need to set this. As the host is “inside” the satellit01 zone it will be checked by the satellites by default. If you want to change the check execution to the agent it self just set the Icinga2 Agent option in the Director to yes in the host/service template
image

The Services are already configured as you recommended it.

Icinga Agent and Zone configuration on the host template up until now

image

I removed the Cluster Zone. This removes the command_endpoint as seen in the activity log:

But if I try to deploy, I get following error:

[2019-12-13 13:39:59 +0100] information/cli: Icinga application loader (version: r2.11.2-1)
[2019-12-13 13:39:59 +0100] information/cli: Loading configuration file(s).
[2019-12-13 13:39:59 +0100] information/ConfigItem: Committing config item(s).
[2019-12-13 13:39:59 +0100] information/ApiListener: My API identity: mon-master-01
[2019-12-13 13:39:59 +0100] critical/config: Error: Validation failed for object 'dc01!windows uptime' of type 'Service'; Attribute 'command_endpoint': Command endpoint must be in zone 'master' or in a direct child zone thereof.
Location: in [stage]/zones.d/director-global/servicesets.conf: 33:1-33:30
[stage]/zones.d/director-global/servicesets.conf(31): }
[stage]/zones.d/director-global/servicesets.conf(32): 
[stage]/zones.d/director-global/servicesets.conf(33): apply Service "windows uptime" {
                                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[stage]/zones.d/director-global/servicesets.conf(34):     import "windows uptime"
[stage]/zones.d/director-global/servicesets.conf(35): 

[2019-12-13 13:39:59 +0100] critical/config: Error: Validation failed for object 'dc01!windows memory' of type 'Service'; Attribute 'command_endpoint': Command endpoint must be in zone 'master' or in a direct child zone thereof.
Location: in [stage]/zones.d/director-global/servicesets.conf: 17:1-17:30
[stage]/zones.d/director-global/servicesets.conf(15): }
[stage]/zones.d/director-global/servicesets.conf(16): 
[stage]/zones.d/director-global/servicesets.conf(17): apply Service "windows memory" {
                                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[stage]/zones.d/director-global/servicesets.conf(18):     import "windows memory"
[stage]/zones.d/director-global/servicesets.conf(19): 

[2019-12-13 13:39:59 +0100] critical/config: Error: Validation failed for object 'dc01!windows disk' of type 'Service'; Attribute 'command_endpoint': Command endpoint must be in zone 'master' or in a direct child zone thereof.
Location: in [stage]/zones.d/director-global/servicesets.conf: 25:1-25:28
[stage]/zones.d/director-global/servicesets.conf(23): }
[stage]/zones.d/director-global/servicesets.conf(24): 
[stage]/zones.d/director-global/servicesets.conf(25): apply Service "windows disk" {
                                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[stage]/zones.d/director-global/servicesets.conf(26):     import "windows disk"
[stage]/zones.d/director-global/servicesets.conf(27): 

[2019-12-13 13:39:59 +0100] critical/config: Error: Validation failed for object 'dc01!windows load' of type 'Service'; Attribute 'command_endpoint': Command endpoint must be in zone 'master' or in a direct child zone thereof.
Location: in [stage]/zones.d/director-global/servicesets.conf: 9:1-9:28
[stage]/zones.d/director-global/servicesets.conf(7):  */
[stage]/zones.d/director-global/servicesets.conf(8): 
[stage]/zones.d/director-global/servicesets.conf(9): apply Service "windows load" {
                                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[stage]/zones.d/director-global/servicesets.conf(10):     import "windows load"
[stage]/zones.d/director-global/servicesets.conf(11): 

[2019-12-13 13:39:59 +0100] critical/config: 4 errors
[2019-12-13 13:39:59 +0100] critical/cli: Config validation failed. Re-run with 'icinga2 daemon -C' after fixing the config.

Addition: I was able to bypass this error message if I change the cluster zone directly on the host object to the host-zone itself and then change the cluster zone on the template. This has absolutely no effect at all but is deployed succesfully.
I also don’t really think a hostcheck should be executed on the host itself, because it is mainly used to determine the state of the host, or am I wrong about this?

Why is there so much with space in front of the endpoint name?
Also please show the endpoint and zones inside the Director. There still seems to be a problem with them, or the Director not knowing them, hence the errors on deployment.

The space is beacause I redact all server names. None of my object names have a space in it. Sorry for the confusion.

The Endpoints and Zones are Imported via Kickstart Wizard. In the Preview they are defined like this:

object Endpoint "mon-master-01" {
    host = "10.10.20.4"
    port = "5665"
    log_duration = 1d
}

object Endpoint "mon-satellit01" {
    host = "10.10.22.4"
    port = "5665"
    log_duration = 1d
}

object Endpoint "dc01" {
    port = "5665"
    log_duration = 1d
}

object Zone "master" {
    endpoints = [ "mon-master-01" ]
}

object Zone "satellit01" {
    parent = "master"
    endpoints = [ "mon-satellit01" ]
}

object Zone "dc01" {
    parent = "satellit01"
    endpoints = [ "dc01" ]
}

Also kind of a new thing developed over the weekend. The two successful checks (load and memory) are now also not working correctly:

image

As I first saw this the nächster check/next check timer was the same as the letzer check/last check timer. Services are still marked with ‘OK’ but delayed in the Webinterface. At least it is consistent again and nothing works :sweat_smile:

:neutral_face:
I’m out of ideas…
I’m sure we are missing just one single/simple thing, but can’t figure out what exactly :sweat_smile:

Is it possible that you install the agent on the windows host from scratch with the script provided by the Director, when you configure the Agent tab on the host?

@dnsmichi do you have any further ideas were to look for the problem?