Agents in remote locations "laggy"

Hi Community,

I’m not 100% how to describe the problem, hence the “laggy” in the title.
Customer Setup with a single master (v2.10.4 atm) there is a problem with hosts checked by agents, that are located in China.
The checks of the agent are basically working, but they often report “not connected to … instance” after some checks or even switch states for every check. (Hope this is understandable :S)

Is there some kind of timeout at work, where the parent instance is waiting for the response of the agent and is not getting any? Is this configurable?

I will try collecting more data from the icinga check running on the agent and post it here.

Best regards :slight_smile:

Connections will be closed whenever there’s no traffic after a while. Depending on the connection direction, either the master or the agents will try to reconnect after a static interval. AFAIK that’s 30s with 2.10.x and 10s with 2.11.x. It may be the case that the connection & TLS handshake is stalled and takes too long. Whenever a command endpoint check happens during this period, it will fail with the “not connected” message.

I would investigate on the connection itself, whether there is packet loss, or TLS handshakes fail (logs). Since you’ve said China, I could imagine that something else intercepts the traffic as well.

Cheers,
Michael

Problem solved.
Customer usually clones their VMs to create new ones.
This time they forgot to change the icinga config afterwards, so more than one machine was connecting with the same configuration…

After fixing the config all is working well.

1 Like