Host check is running on the host being checked

pgreenland_mcl · May 30, 2019, 7:00pm

Hi,

We have Icinga2 version 2.6.2 deployed to monitor our infrastructure.

We noticed the other day that for some hosts the “check source” of the host check, is the machine itself, meaning the machines always appear to be up, even when they’re down, as no check results get back to the master.

We use a cluster of two nodes for high availability in a “datacenter” zone, which is the master zone…I don’t know why it isn’t called that unfortunately…before my time :-(.

We have several other machines in the datacentre, each in their own zone as a child of datacenter zone, located alongside the machines hosting icinga2.

We have several remote sites. Each site has a pair of machines, a “master” machine which is in it’s own zone, a satellite of the datacenter zone. Alongside a “slave” machine which again is in it’s own zone, as a client of the zone of its corresponding “master” machine.

The two main icinga servers correctly check one another. Likewise they share the task of checking the other data center servers between themselves.

The satellite nodes from the datacenter correctly check their notional subordinate node.

However the satellite node for some reason also checks itself, rather than one of the icinga nodes in the datacenter zone as I would have expected.

I’ve checked that the zone hierarchy is defined correctly and that there’s definitely a network route from the icinga masters in the datacentre zone to the satellite node but other than that am a bit lost.

Graphically this arrangement is as follows:

An extract of our zones.conf file follows, with IP addresses and FQDN’s redacted where required.

# GLOBAL ZONES
object Zone "global-templates"  { global = true }
object Zone "global-commands"   { global = true }

################################################################################################################

# Icinga master zone endpoints
object Endpoint "management1.<REDACTED>" { host = "x.x.x.x" }
object Endpoint "management2.<REDACTED>" { host = "x.x.x.x" }

# Icinga master zone
object Zone "datacenter" { endpoints = [
    "management1.<REDACTED>",
    "management2.<REDACTED>",
] }

################################################################################################################

# Datacentre zone endpoints
object Endpoint "dbsrv1.<REDACTED>" { host = "x.x.x.x" }
object Endpoint "dbsvr2.<REDACTED>" { host = "x.x.x.x" }
object Endpoint "dbsvr3.<REDACTED>" { host = "x.x.x.x" }
object Endpoint "dbsvr4.<REDACTED>" { host = "x.x.x.x" }

# Datacentre zones
object Zone "dbsrv1.<REDACTED>" { endpoints = [ "dbsrv1.<REDACTED>" ] ; parent = "datacenter" }
object Zone "dbsvr2.<REDACTED>" { endpoints = [ "dbsvr2.<REDACTED>" ] ; parent = "datacenter" }
object Zone "dbsvr3.<REDACTED>" { endpoints = [ "dbsvr3.<REDACTED>" ] ; parent = "datacenter" }
object Zone "dbsvr4.<REDACTED>" { endpoints = [ "dbsvr4.<REDACTED>" ] ; parent = "datacenter" }

################################################################################################################

# Remote satellite zone endpoints
object Endpoint "appsvr_master.sitea.<REDACTED>" { host = "x.x.x.x" }

# Remote satellite zones
object Zone "appsvr_master.sitea.<REDACTED>" { endpoints = [ "appsvr_master.sitea.<REDACTED>" ] ; parent = "datacenter" }

################################################################################################################

# Remote client zone endpoints
object Endpoint "appsvr_slave.sitea<REDACTED>" {  }

# Remove client zones
object Zone "appsvr_slave.sitea.<REDACTED>" { endpoints = [ "appsvr_slave.sitea.<REDACTED>" ] ; parent = "appsvr_master.sitea.<REDACTED>" }

Could anyone help me shed some light on what might be the problem here?

Thanks,

Phil

anon66228339 · May 31, 2019, 4:57am

For satellites i use the cluster-zone check to see if they are up, so they dont check themself.

pgreenland_mcl · May 31, 2019, 10:35am

Hey Carsten,

That’s what we’ve been considering as a workaround.

However we quite liked the idea of relying on a simple ping from another machine to tell whether the machine was up or not.

Rather than relying on messages from the client which may have crashed, while the machine is still up.

I’ve been trying to understand why the checks are performed as I would expect on the data centre machines, but not on the others.

If it’s a side effect of having the satellite zone which can’t be avoided or something that can be corrected.

Thanks,

Phil

anon66228339 · May 31, 2019, 10:57am

Hello Phil,

You can configure that it should run on master as command_endpoint if it is a satellite

Regards,
Carsten

pgreenland_mcl · May 31, 2019, 12:20pm

Hey Carsten,

I gave that a try early on, setting the command_endpoint on the host object of the satellite node.

When validating the config file I receive the following error message:

[2019-05-31 13:13:47 +0100] critical/config: Error: Validation failed for object ‘appsvr_master.sitea.’ of type ‘Host’; Attribute ‘command_endpoint’: Command endpoint must be in zone ‘appsvr_master.sitea.’ or in a direct child zone thereof.

Seems to indicate that you can’t request that a server in a zone higher-up in the hierarchy to perform the check for you?

Or did I place that config option in the wrong location?

Appreciate the help!

Thanks,

Phil

dnsmichi · June 3, 2019, 9:27am

Ping is just another layer, this never tells whether a service and application is actually running. As @anon66228339 said, either use the cluster-zone check, or at least ensure that tcp/5665 is up and running. Or use both methods. command_endpoint being run bottom up is not supported, merely a matter of trust - only top down works in this regard.

Cheers,
Michael

pgreenland_mcl · June 3, 2019, 10:11am

Hey Michael,

Thanks for the reply.

We are indeed using the cluster zone check, which we’ve implemented as a service.

We were hoping to use the host ping/hostalive check alongside for a basic sanity check (we’ve had icinga crash on a client before so knowing whether the host is up via another mechanism is useful).

I understand why specifying the command_endpoint on the satellite node doesn’t work based on the security model. It should have been obvious before I saw the warning .

I’ve been trying to understand why I’m seeing the behaviour I am (satellite pinging itself) and how I might get the icinga masters to perform the ping instead.

The datacentre machines for example, which are in child zones of the datacentre zone (in which the masters live) are pinged by the masters as expected (which the checks being shared between the pair of masters).

The satellites are similarly located in their own zones, which are children of the datacentre zone, however they ping themselves.

What’s the logic behind where the host ping/hostalive check is performed and how might I get the master to perform this check?

Really appreciate the help,

Thanks,

Phil

dnsmichi · June 3, 2019, 11:39am

Hi,

if you want the master to ping the satellite, you’ll need to create a Host object for that satellite which is located in the master zone, e.g. in /etc/icinga2/zones.d/master. If you want to keep this object inside the satellite zone directory, a workaround is required with setting the zone attribute to ´masterwhile moving the service apply rules with the zone attribute tosatellite` - likely resulting in multiple apply rules for each specific zone.

Cheers,
Michael

pgreenland_mcl · June 6, 2019, 4:59pm

Hi Michael,

Thanks for that information!

Checks are now being performed exactly as I expected .

Appreciate the help,

Regards,

Phil

dnsmichi · June 7, 2019, 7:14am

Hi,

can you share the configuration bits from your solution for others please too?

Cheers,
Michael