Monitoring won't work if agent only connected to one parent

I am currently hesitant to deploy the Icinga Agent because I have observed the following problem. A few agents were rolled out on a test basis and intentionally misconfigured. For example, only one parent (satellite) was configured where there are two. The agent successfully connects to the parent, however I get the following error in Icinga:

Remote Icinga instance 'Child' is not connected to 'Sat1'.

The client, on the other hand, has successfully connected to Sat2:

2021-09-07 08:48:42 +0200] information/ApiListener: Reconnecting to endpoint 'Sat2' via host 'Sat2' and port '5665'
2021-09-07 08:48:42 +0200] information/ConfigItem: Activated all objects.
2021-09-07 08:48:42 +0200] information/ApiListener: New client connection for identity 'Sat2' to [1.2.3.4]:5665
2021-09-07 08:48:42 +0200] information/ApiListener: Requesting new certificate for this Icinga instance from endpoint 'Sat2'.
2021-09-07 08:48:42 +0200] information/ApiListener: Sending config updates for endpoint 'Sat2' in zone 'ZoneA'.
2021-09-07 08:48:42 +0200] information/ApiListener: Finished sending config file updates for endpoint 'Sat2' in zone 'ZoneA'.
2021-09-07 08:48:42 +0200] information/ApiListener: Syncing runtime objects to endpoint 'Sat2'.
2021-09-07 08:48:42 +0200] information/ApiListener: Finished syncing runtime objects to endpoint 'Sat2'.
2021-09-07 08:48:42 +0200] information/ApiListener: Finished sending runtime config updates for endpoint 'Sat2' in zone 'ZoneA'.
2021-09-07 08:48:42 +0200] information/ApiListener: Sending replay log for endpoint 'Sat2' in zone 'ZoneA'.
2021-09-07 08:48:42 +0200] information/ApiListener: Finished sending replay log for endpoint 'Sat2' in zone 'ZoneA'.
2021-09-07 08:48:42 +0200] information/ApiListener: Finished syncing endpoint 'Sat2' in zone 'ZoneA'.
2021-09-07 08:48:42 +0200] information/ApiListener: Finished reconnecting to endpoint 'Sat2' via host 'Sat2' and port '5665'
2021-09-07 08:48:42 +0200] information/ApiListener: Applying config update from endpoint 'Sat2' of zone 'ZoneA'.

I would expect that in this case Icinga would route the check results to the master via Sat2, which unfortunately is not the case. I realize that normally both parents should be reachable, but this is something I can’t guarantee. I built the redundancy to have a working monitoring in such a case, where the connection to a single satellite is disturbed.

Give as much information as you can, e.g.

  • Version used (icinga2 --version)
    2.13.1

  • Operating System and version
    Centos 7 (Master) Centos 8 (Satellite) Windows 2016 (Agent)

  • Enabled features (icinga2 feature list):
    Disabled features: command compatlog debuglog elasticsearch gelf graphite icingadb influxdb2 opentsdb perfdata statusdata syslog
    Enabled features: api checker ido-mysql influxdb livestatus mainlog notification

  • Icinga Web 2 version and modules (System - About)
    2.9.3

  • Config validation (icinga2 daemon -C)

  • If you run multiple Icinga 2 instances, the zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone) from all affected nodes

Are both satellites online in this scenario?

The checks will only be migrated to one satellite in a zone if it is the last one remaining, i.e. the second satellite is down.

Normally Icinga calculates (with some formula) how checks a distributed between nodes in a zone. After this calculation is done the checks will always be executed by that node, unless it is offline, which leads to the migration of the check execution to the remaining node.

yes, both satellites are online and connected to the master. Only the connection from the agent to one (of two) satellite is affected. I know that Icinga is splitting the load across multiple satellites, but in this case, there is only one instance actually executing the checks - the agent.

In this scenario, I would expect the check results to be routed through Sat2 and the Icinga cluster check to notify me that the connection could only be made through one satellite.

Ah, ok.
I thought this message is displayed at some of the checks

Where is it displayed in your case?

Tbh I would expect the same, but this is just guessing. I have never tried this scenario myself.

That’s check result of a check that is running an agent:

Is this the only check that shows this message? Or are there others as well (like about half of the checks)?
Is the check source for the check definitely the agent?

What result does running a check using the “cluster-zone” check command yield when run on the agent to check the connection to the parent zone?
I would either expect OK
image

or some form of connected to one parent, not connected to the other.

On my test host i only had one check that is using the agent. I cloned that service, but all stuck at pending:


Same as for all other agent based checks i add e.g. cluster-zone

Only Check that are configured to run on the agent ar affected - all other active checks on this host are evenly spread across the two satellites.

Maybe enable and check the debug log on the agent?
other than that I’m out of ideas, sorry.