In my case this error occurs from time to time on different hosts and after a while (hours) they may recover. It may occur after the restart of a host or out of nowhere. Some hosts are never affected. Sometimes some service checks work while others report ‘not connected’.
All other operation on the hosts run without problems.
The threads I found on this topic have fundamental problems connecting and usually do not explain under which circumstances this error occurs.
My icinga2 r2.14.2 called ‘Monitor’ with icingaweb 2.12.2 and director is running within a ubuntu 22.04 LTS VM hosted by proxmox 8 (debian 12).
I monitor the VMs, a small compute node group (3 nodes) and a small proxmox ceph cluster (5 nodes). All hardware is connected by 2.5gbit/s connections. All participants can ping and ssh. All hosts are updated.
All hosts are in the same subnet 10.10.8.xx/24.
All commands, services, hosts, etc are set up using the director. All clients were added by installing icinga2 per apt on the client. And then the agent script by the director was run on the client without failure.
There are no cert requests open. Where possible I used the defaults. I did not setup additional zones and left all to standard or undefined. The zone concept is a bit unclear to me. However, the zones.conf is the same on all clients (with proper names for the respective clients)
The log at a client that currently fails says:
[2024-05-24 21:13:16 +0200] information/ApiListener: New client connection for identity ‘monitor’ from [::ffff:10.10.8.20]:41324
[2024-05-24 21:13:31 +0200] warning/ApiListener: Timeout while processing incoming connection from [::ffff:10.10.8.20]:41324
[2024-05-24 21:13:31 +0200] warning/ApiListener: No data received on new API connection from [::ffff:10.10.8.20]:41324 for identity ‘monitor’. Ensure that the remote endpoints are properly configured in a cluster setup.
My questions:
When does icinga decide a agent is not connected? Is there a timeout?
How can it be one service checks ok and another checks ‘not connected’ at the same time executed?
Under what circumstances can a disconnect happen while there is no configuration change?
Where can I find a more detailed lock on what happens during a particular connection?