Windows Agent (2.10.5) intermittently disconnects from Master

Hello All,
I hope you all are well. I have experienced some of my Windows agents intermittently disconnecting from the Icinga Master server. The agent will always reconnect again but what is causing the disconnect in the first place. Notifications are sent out when this happens. Below is the icinga2.log file. You can see the “TLS stream disconnect” message and “TLS handshake failed” message. But after some time it will sync back up and start working again. This causes another notification to get sent out that the server owners are asking about. Has anyone experience this?

I have a HA Icinga2 environment. Both Master servers are running version 2.10.5. Most Windows clients (600+) are running version 2.10.5.

[2020-05-28 23:21:57 -0700] information/RemoteCheckQueue: items: 0, rate: 0/s (6/min 30/5min 90/15min);
[2020-05-28 23:22:47 -0700] information/WorkQueue: #6 (ApiListener, SyncQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2020-05-28 23:23:17 -0700] information/RemoteCheckQueue: items: 0, rate: 0/s (6/min 30/5min 90/15min);
[2020-05-28 23:23:50 -0700] warning/TlsStream: TLS stream was disconnected.
[2020-05-28 23:23:50 -0700] warning/JsonRpcConnection: API client disconnected for identity 'Master2'
[2020-05-28 23:23:50 -0700] warning/ApiListener: Removing API client for endpoint 'Master2'. 0 API clients left.
[2020-05-28 23:24:24 -0700] information/WorkQueue: #9 (JsonRpcConnection, #0) items: 0, rate: 0.3/s (18/min 116/5min 364/15min);
[2020-05-28 23:24:47 -0700] information/RemoteCheckQueue: items: 0, rate: 0/s (6/min 30/5min 90/15min);
[2020-05-28 23:25:07 -0700] information/RemoteCheckQueue: items: 0, rate: 0/s (6/min 30/5min 90/15min);
[2020-05-28 23:25:17 -0700] information/RemoteCheckQueue: items: 0, rate: 0/s (12/min 60/5min 180/15min);
[2020-05-28 23:26:12 -0700] information/ApiListener: New client connection for identity 'Master2' from [::ffff:10.157.1.2]:48788
[2020-05-28 23:26:31 -0700] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2020-05-28 23:26:35 -0700] information/ConfigObject: Dumping program state to file 'C:\ProgramData\icinga2\var\lib\icinga2/icinga2.state'
[2020-05-28 23:28:02 -0700] warning/TlsStream: TLS stream was disconnected.
[2020-05-28 23:28:02 -0700] warning/TlsStream: TLS stream was disconnected.
[2020-05-28 23:28:02 -0700] critical/ApiListener: Client TLS handshake failed (from [::ffff:10.157.1.2]:49938): Error: Socket was closed during TLS handshake.


Context:
	(0) Handling new API client connection

[2020-05-28 23:28:02 -0700] critical/ApiListener: Client TLS handshake failed (from [::ffff:10.157.1.2]:49958): Error: Socket was closed during TLS handshake.


Context:
	(0) Handling new API client connection

[2020-05-28 23:27:44 -0700] warning/JsonRpcConnection: API client disconnected for identity 'Master1'
[2020-05-28 23:28:02 -0700] information/WorkQueue: #9 (JsonRpcConnection, #0) items: 5, rate: 0.266667/s (16/min 85/5min 332/15min); empty in 1 minute and 27 seconds
[2020-05-28 23:28:02 -0700] information/ApiListener: New client connection for identity 'Master1' from [::ffff:10.156.1.2]:48758
[2020-05-28 23:28:02 -0700] warning/ApiListener: Removing API client for endpoint 'Master1'. 0 API clients left.
[2020-05-28 23:28:03 -0700] information/ApiListener: Requesting new certificate for this Icinga instance from endpoint 'Master2'.
[2020-05-28 23:28:03 -0700] information/ApiListener: Requesting new certificate for this Icinga instance from endpoint 'Master1'.
[2020-05-28 23:28:03 -0700] warning/ApiListener: Ignoring config update. 'api' does not accept config.
[2020-05-28 23:28:03 -0700] information/ApiListener: Sending config updates for endpoint 'Master2' in zone 'Master'
[2020-05-28 23:28:03 -0700] information/ApiListener: Finished sending config file updates for endpoint 'Master2' in zone 'Master'
[2020-05-28 23:28:03 -0700] information/ApiListener: Syncing runtime objects to endpoint 'Master2'.
[2020-05-28 23:28:03 -0700] information/ApiListener: Finished syncing runtime objects to endpoint 'Master2'.
[2020-05-28 23:28:03 -0700] information/ApiListener: Finished sending runtime config updates for endpoint 'Master2' in zone 'Master'
[2020-05-28 23:28:03 -0700] information/ApiListener: Sending replay log for endpoint 'Master2' in zone 'Master'
[2020-05-28 23:28:03 -0700] information/ApiListener: Sending config updates for endpoint 'Master1' in zone 'Master'
[2020-05-28 23:28:03 -0700] information/ApiListener: Finished sending config file updates for endpoint 'Master1' in zone 'Master'
[2020-05-28 23:28:03 -0700] information/ApiListener: Syncing runtime objects to endpoint 'Master1'.
[2020-05-28 23:28:03 -0700] information/ApiListener: Finished syncing runtime objects to endpoint 'Master1'.
[2020-05-28 23:28:03 -0700] information/ApiListener: Finished sending runtime config updates for endpoint 'Master1' in zone 'Master'
[2020-05-28 23:28:03 -0700] information/ApiListener: Sending replay log for endpoint 'Master1' in zone 'Master'
[2020-05-28 23:28:03 -0700] information/ApiListener: Finished sending replay log for endpoint 'Master1' in zone 'Master'
[2020-05-28 23:28:03 -0700] information/ApiListener: Finished syncing endpoint 'Master1' in zone 'Master'
[2020-05-28 23:28:03 -0700] information/ApiListener: Finished sending replay log for endpoint 'Master2' in zone 'Master'
[2020-05-28 23:28:03 -0700] information/ApiListener: Finished syncing endpoint 'Master2' in zone 'Master'
[2020-05-28 23:28:03 -0700] information/WorkQueue: #6 (ApiListener, SyncQueue) items: 0, rate: 0.0333333/s (2/min 2/5min 2/15min);
[2020-05-28 23:28:03 -0700] warning/ApiListener: Ignoring config update. 'api' does not accept config.
[2020-05-28 23:28:03 -0700] warning/JsonRpcConnection: API client disconnected for identity 'Master2'
[2020-05-28 23:28:03 -0700] warning/ApiListener: Removing API client for endpoint 'Master2'. 0 API clients left.
[2020-05-28 23:28:11 -0700] information/ApiListener: New client connection for identity 'Master2' from [::ffff:10.157.1.2]:49990
[2020-05-28 23:28:11 -0700] information/ApiListener: Requesting new certificate for this Icinga instance from endpoint 'Master2'.
[2020-05-28 23:28:11 -0700] information/ApiListener: Sending config updates for endpoint 'Master2' in zone 'Master'
[2020-05-28 23:28:11 -0700] information/ApiListener: Finished sending config file updates for endpoint 'Master2' in zone 'Master'
[2020-05-28 23:28:11 -0700] information/ApiListener: Syncing runtime objects to endpoint 'Master2'.
[2020-05-28 23:28:11 -0700] information/ApiListener: Finished syncing runtime objects to endpoint 'Master2'.
[2020-05-28 23:28:11 -0700] information/ApiListener: Finished sending runtime config updates for endpoint 'Master2' in zone 'Master'
[2020-05-28 23:28:11 -0700] information/ApiListener: Sending replay log for endpoint 'Master2' in zone 'Master'
[2020-05-28 23:28:11 -0700] information/ApiListener: Finished sending replay log for endpoint 'Master2' in zone 'Master'
[2020-05-28 23:28:11 -0700] information/ApiListener: Finished syncing endpoint 'Master2' in zone 'Master'
[2020-05-28 23:28:11 -0700] warning/ApiListener: Ignoring config update. 'api' does not accept config.
[2020-05-28 23:28:12 -0700] information/RemoteCheckQueue: items: 0, rate: 0/s (6/min 30/5min 90/15min);
[2020-05-28 23:28:22 -0700] information/RemoteCheckQueue: items: 0, rate: 0/s (6/min 30/5min 90/15min);
[2020-05-28 23:28:32 -0700] information/RemoteCheckQueue: items: 0, rate: 0/s (6/min 30/5min 90/15min);
[2020-05-28 23:29:02 -0700] information/RemoteCheckQueue: items: 0, rate: 0/s (6/min 30/5min 90/15min);
[2020-05-28 23:29:22 -0700] information/RemoteCheckQueue: items: 0, rate: 0/s (6/min 30/5min 90/15min);
[2020-05-28 23:29:33 -0700] information/WorkQueue: #9 (JsonRpcConnection, #0) items: 0, rate: 0.4/s (24/min 195/5min 434/15min);
[2020-05-28 23:29:43 -0700] information/RemoteCheckQueue: items: 0, rate: 0/s (6/min 30/5min 90/15min);
[2020-05-28 23:31:32 -0700] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2020-05-28 23:32:33 -0700] information/RemoteCheckQueue: items: 0, rate: 0/s (6/min 30/5min 90/15min);
[2020-05-28 23:32:53 -0700] information/RemoteCheckQueue: items: 0, rate: 0/s (6/min 30/5min 90/15min);
[2020-05-28 23:33:02 -0700] information/ConfigObject: Dumping program state to file 'C:\ProgramData\icinga2\var\lib\icinga2/icinga2.state'
[2020-05-28 23:33:03 -0700] information/WorkQueue: #6 (ApiListener, SyncQueue) items: 0, rate:  0/s (0/min 1/5min 3/15min);
[2020-05-28 23:33:44 -0700] information/RemoteCheckQueue: items: 0, rate: 0/s (6/min 30/5min 90/15min);
[2020-05-28 23:34:35 -0700] information/WorkQueue: #9 (JsonRpcConnection, #0) items: 0, rate: 0.4/s (24/min 124/5min 436/15min);
[2020-05-28 23:35:14 -0700] information/RemoteCheckQueue: items: 0, rate: 0/s (6/min 30/5min 90/15min);

Thanks in advance for you help.
Alex

When you get the “[client] is not connected to [master]” UNKNOWN messages in Icinga, is it always the same master? If so, you’ll want to check c:\ProgramData\Icinga2\etc\Icinga2\zones.conf and make sure both master entries are correct. If not, one can connect fine and while the other fails and you get these intermittent TLS errors.

Hello Blake,
Thanks for the reply. I am investigating this problem on more server now. I will let you know. It is weird because I never noticed this before. Maybe has been happening all along, this is the first time a server owner reported this behavior.

I recently changed how the service to service dependency are configured on all Windows servers. I have a service using the “cluster-zone” check-command configured to check if the Icinga agent is connected to the Master server. All services except Icinga2 are a dependency of this cluster-zone service check. Since Icinga2 lost connection for a short period of time a notification was sent out. This is a new configuration so, maybe this was happening all along but the dependency was configured differently before and a notification was never sent out the old way. Mmmhhhh :thinking:

As far as my local zones.conf file on the agents. It is a standard file for all agents. It’s working well. If I stop the Icinga2 service on one Master server for some maintenance all check and notification fail over to the over Master without a problem.

C:\ProgramData\icinga2\etc\icinga2\zones.conf

object Endpoint "Master1" {
}

object Endpoint "Master2" {
}

object Zone "SunChemical" {
	endpoints = [ "Master1", "Master2" ]
}

object Zone "global-templates" {
	global = true
}

object Zone "director-global" {
	global = true
}

object Endpoint NodeName {
}

object Zone ZoneName {
	endpoints = [ NodeName ]
	parent = "Master"
}

NodeName & ZoneName are set in the constants.conf file on the localhost

Thanks in advance for your help.
Alex

Hello @blakehartshorn,
I hope you are well. Sorry for the long time period since I have replied about this post. I was investigating this problem. The disconnection are happening from both Masters. The first message in the icinga2.log (client) is “warning/TlsStream: TLS stream was disconnected”. I can match the same disconnect message time stamp message in the logs on the master. After a certain amount of time the agent just reconnects and resumes checks. The disconnect and reconnect time period is different each time on each server.

Since I have started watching this I see a few different servers just disconnect and reconnect daily. I have 600+ Windows servers that are monitoried by Icinga. So it is not a bad problem now BUT we are adding more servers each week. I don’t want the trend to continue and become a larger problem.

I have attached the logs if you want to review them. Any feed back on this problem would be great!

The time stamp in the logs is different because the agent is in a different time zone. The agent is 3 hours behind the master.
agent_icinga2.log (38.6 KB) master1_icinga2.log (32.7 KB)

Thanks in advance for your help. :slight_smile:
Alex

1 Like