Intermittent remote instance is not connected error

FiveAces · May 24, 2024, 7:38pm

In my case this error occurs from time to time on different hosts and after a while (hours) they may recover. It may occur after the restart of a host or out of nowhere. Some hosts are never affected. Sometimes some service checks work while others report ‘not connected’.
All other operation on the hosts run without problems.

The threads I found on this topic have fundamental problems connecting and usually do not explain under which circumstances this error occurs.

My icinga2 r2.14.2 called ‘Monitor’ with icingaweb 2.12.2 and director is running within a ubuntu 22.04 LTS VM hosted by proxmox 8 (debian 12).
I monitor the VMs, a small compute node group (3 nodes) and a small proxmox ceph cluster (5 nodes). All hardware is connected by 2.5gbit/s connections. All participants can ping and ssh. All hosts are updated.
All hosts are in the same subnet 10.10.8.xx/24.
All commands, services, hosts, etc are set up using the director. All clients were added by installing icinga2 per apt on the client. And then the agent script by the director was run on the client without failure.
There are no cert requests open. Where possible I used the defaults. I did not setup additional zones and left all to standard or undefined. The zone concept is a bit unclear to me. However, the zones.conf is the same on all clients (with proper names for the respective clients)

The log at a client that currently fails says:

[2024-05-24 21:13:16 +0200] information/ApiListener: New client connection for identity ‘monitor’ from [::ffff:10.10.8.20]:41324
[2024-05-24 21:13:31 +0200] warning/ApiListener: Timeout while processing incoming connection from [::ffff:10.10.8.20]:41324
[2024-05-24 21:13:31 +0200] warning/ApiListener: No data received on new API connection from [::ffff:10.10.8.20]:41324 for identity ‘monitor’. Ensure that the remote endpoints are properly configured in a cluster setup.

My questions:
When does icinga decide a agent is not connected? Is there a timeout?
How can it be one service checks ok and another checks ‘not connected’ at the same time executed?
Under what circumstances can a disconnect happen while there is no configuration change?
Where can I find a more detailed lock on what happens during a particular connection?

rivad · May 27, 2024, 8:20am

Can you post your zones.conf as without them, it’s hard to find out, what’s going on.

The zones concept isn’t too complicated.

the master zone is the root/trunk of the tree
every zone can only contain one or two nodes (endpoints). More then one implies HA features.
I know of no limit to how many satellite zones can exist per master and satellites zone. sub satellites are possible AFAIK
icinga2 services running under Windows can only be endpoints as single node in there own zone - never in a master or satellite zone. one could call those agent zones.
every zone needs to have at least one endpoint (node) and parent if it isn’t the master zone
every endpoint needs to be in exactly one zone. this can be a master, satellite or agent zone

https://icinga.com/docs/icinga-2/latest/doc/06-distributed-monitoring/#roles-master-satellites-and-agents
https://icinga.com/docs/icinga-2/latest/doc/06-distributed-monitoring/#zones
https://icinga.com/docs/icinga-2/latest/doc/06-distributed-monitoring/#endpoints

FiveAces · June 26, 2024, 12:27pm

Sorry for the late reply but here it is to my best knowledge:
I do not know where the real zones.conf resides.
This is the one from /etc/icinga2/zones.conf:

/** Icinga 2 Config - proposed by Icinga Director */

object Endpoint “Ceph3” {}

object Zone “Ceph3” {
parent = “master”
endpoints = [ “Ceph3” ]
}

object Zone “master” {
endpoints = [ “monitor” ]
}
object Endpoint “monitor” {
//host = “monitor”
}
object Zone “director-global” {
global = true
}

It looks the same on the other four working nodes.
This is what the zones.conf of the icinga server “monitor” looks like:

/*

Generated by Icinga 2 node setup commands

on 2024-05-03 17:11:26 +0000
*/

object Endpoint “monitor” {
}

object Zone “master” {
endpoints = [ “monitor” ]
}

object Zone “global-templates” {
global = true
}

object Zone “director-global” {
global = true
}

I know that the director works with directories in /var/lcinga2/ but I cannot find any zones.conf on the nodes under /var/lib/icinga2.

Meanwhile I googled it could be a cert problem all along because there is no successful connection established. I redid the director-agent setup using the script but to no avail.