Intermittent remote instance is not connected error

In my case this error occurs from time to time on different hosts and after a while (hours) they may recover. It may occur after the restart of a host or out of nowhere. Some hosts are never affected. Sometimes some service checks work while others report ‘not connected’.
All other operation on the hosts run without problems.

The threads I found on this topic have fundamental problems connecting and usually do not explain under which circumstances this error occurs.

My icinga2 r2.14.2 called ‘Monitor’ with icingaweb 2.12.2 and director is running within a ubuntu 22.04 LTS VM hosted by proxmox 8 (debian 12).
I monitor the VMs, a small compute node group (3 nodes) and a small proxmox ceph cluster (5 nodes). All hardware is connected by 2.5gbit/s connections. All participants can ping and ssh. All hosts are updated.
All hosts are in the same subnet 10.10.8.xx/24.
All commands, services, hosts, etc are set up using the director. All clients were added by installing icinga2 per apt on the client. And then the agent script by the director was run on the client without failure.
There are no cert requests open. Where possible I used the defaults. I did not setup additional zones and left all to standard or undefined. The zone concept is a bit unclear to me. However, the zones.conf is the same on all clients (with proper names for the respective clients)

The log at a client that currently fails says:

[2024-05-24 21:13:16 +0200] information/ApiListener: New client connection for identity ‘monitor’ from [::ffff:]:41324
[2024-05-24 21:13:31 +0200] warning/ApiListener: Timeout while processing incoming connection from [::ffff:]:41324
[2024-05-24 21:13:31 +0200] warning/ApiListener: No data received on new API connection from [::ffff:]:41324 for identity ‘monitor’. Ensure that the remote endpoints are properly configured in a cluster setup.

My questions:
When does icinga decide a agent is not connected? Is there a timeout?
How can it be one service checks ok and another checks ‘not connected’ at the same time executed?
Under what circumstances can a disconnect happen while there is no configuration change?
Where can I find a more detailed lock on what happens during a particular connection?

Can you post your zones.conf as without them, it’s hard to find out, what’s going on.

The zones concept isn’t too complicated.

  1. the master zone is the root/trunk of the tree
  2. every zone can only contain one or two nodes (endpoints). More then one implies HA features.
  3. I know of no limit to how many satellite zones can exist per master and satellites zone. sub satellites are possible AFAIK
  4. icinga2 services running under Windows can only be endpoints as single node in there own zone - never in a master or satellite zone. one could call those agent zones.
  5. every zone needs to have at least one endpoint (node) and parent if it isn’t the master zone
  6. every endpoint needs to be in exactly one zone. this can be a master, satellite or agent zone

1 Like