Bug? Agents cannot communicate after master reboot

Creamers158 · October 2, 2019, 7:47pm

My icinga2 (version r2.11.0-1) master node had a problem so I rebooted this server, but now all the agents don’t seem to recover the communication. (I have it set to let the agents connect to the master)

“Zone xxx is not connected” and all other services have status Unkown.

I say “communication” and not “connection” because I still see 2 established connections (port 5665) ?!
After restarting the icinga2 service this changes back to 1 established connection (port 5665) and the critical errors disappear.

On one node I had debuglog feature enabled, showing this log output:

debug/ApiListener: Not checking connection to Zone ‘global-templates’ because it’s not in the same zone, a parent or a child zone.
[2019-10-02 21:30:02 +0200] debug/ApiListener: Not checking connection to Zone ‘director-global’ because it’s not in the same zone, a parent or a child zone.
[2019-10-02 21:30:02 +0200] debug/ApiListener: Not checking connection to Zone ‘global-templates’ because it’s not in the same zone, a parent or a child zone.
[2019-10-02 21:30:02 +0200] debug/ApiListener: Not checking connection to Zone ‘director-global’ because it’s not in the same zone, a parent or a child zone.

Personally I think the master restarted and the agent established a new connection and created some kind of conflict?

Shouldnt the agents just connect to the master again without problems?

anon66228339 · October 3, 2019, 10:45pm

Looks like you didnt configure the zones director-global and global-templates as a global zone on master/satellite/agent

Creamers158 · October 4, 2019, 11:33am

Every agent/satellite has:

object Zone “global-templates” {
global = true
}

object Zone “director-global” {
global = true
}

After restarting the service all services showing green again in icingaweb2.

anon66228339 · October 4, 2019, 12:16pm

If you can reproduce it, i would open an issue

Creamers158 · October 5, 2019, 6:31pm

I will as soon as my master server goes down again. Would a debuglog on two agents and the master be sufficient? (All other info is static and can be delivered after my servers are up again.)

Creamers158 · October 10, 2019, 4:20pm

Same problem again. If the server hangs and we reboot that server the clients/agents need a service restart to reconnect to the master server again. I would expect this to be automatically.

btw, the agents connect to the master in my setup

twidhalm · October 10, 2019, 9:30pm

Yes, please report this as an issue.

The template for new issues will include some suggestions what you should include. Debuglog of the master and one of the agents will be of some help, I think. Maybe prepare it so it will include the time after the restart until you were able to reconnect. Maybe that’s just a few lines then you can post it. If it’s longer, you can offer it to the developers and they could provide you with a link for uploading.

dnsmichi · October 11, 2019, 6:59am

The debug log lines are not telling much in this regard, global zones are logged but they don’t serve any purpose with connections to endpoints. Global zones are only used for syncing configuration throughout the cluster.

The real error message hides somewhere else. When you’re saying that the agent is not connected, can you verify that via netstat -an | grep 5665 on both the master and the agent?

How does the zones.conf on the agent look like?

Cheers,
Michael

Creamers158 · October 12, 2019, 2:34pm

Hi Michael,

Thats what I thought when I tried to analyse the connections and thats why I wrote this:

It seemed to me that the connection was “still there” when looking at the established connections on the agent.
At that time it even seemed that there were metrics coming in, but that’s something I need to look at again when it happens again.
Maybe I can simulate it by doing a hard reset, but havent tried it yet as I need help to troubleshoot this further.

One of the multiple agents:

/*

Generated by Icinga 2 node setup commands

on 2019-08-31 12:25:50 +0200
*/

object Endpoint “icinga” {
host = “xx.xx.xx.xx”
port = “5665”
}

object Zone “master” {
endpoints = [ “icinga” ]
}

object Endpoint “vmi12345.contaboserver.net” {
}

object Zone “vmi12345.contaboserver.net” {
endpoints = [ “vmi12345.contaboserver.net” ]
parent = “master”
}

object Zone “global-templates” {
global = true
}

object Zone “director-global” {
global = true
}

dnsmichi · October 14, 2019, 9:18am

What I could think of - are these agents storing quite a lot of replay log for the master? E.g. is %ProgramData%\var\lib\icinga2\api\log filled up with lots of data? Try purging that and also set log_duration=0 on the Endpoint configuration in your agent zones.conf.

Also, the agent must log a connection and sync attempt in icinga2.log, how does that look like?

Cheers,
Michael

Creamers158 · October 15, 2019, 8:20am

Hi Michael,

I use my agents on debian so I checked “/var/lib/icinga2/api/log/” but nothing there:
api_log

Checking the /var/log/icinga2/
logsize

Piece of the icinga2.log that shows alot of zero’s

Creamers158 · October 24, 2019, 7:45pm

Hi @dnsmichi ,

Had a freeze on my master again to troubleshoot some more

First it shows this in the debug.log:

[2019-10-24 16:07:20 +0200] debug/ApiListener: Not connecting to Endpoint ‘icinga’ because we’re already connected to it.

When the master server hangs it shows this on the agent:

[2019-10-24 16:31:23 +0200] information/JsonRpcConnection: No messages for identity ‘icinga’ have been received in the last 60 seconds.
[2019-10-24 16:31:23 +0200] warning/JsonRpcConnection: API client disconnected for identity ‘icinga’
[2019-10-24 16:31:23 +0200] warning/ApiListener: Removing API client for endpoint ‘icinga’. 0 API clients left.
[2019-10-24 16:31:23 +0200] debug/EndpointDbObject: update is_connected=0 for endpoint ‘icinga’

Then it will show:

[2019-10-24 16:31:39 +0200] debug/ApiListener: Not connecting to Endpoint ‘icinga’ because we’re already trying to connect to it.

Now my expectation is that after a reboot of the master the clients will reconnect automatically but they dont seem to be able to connect …at least instantly…

The agent shows after the master reboot :

[2019-10-24 18:34:46 +0200] debug/ApiListener: Not connecting to Endpoint ‘icinga’ because we’re already trying to connect to it.

I restarted the services on the agents that are ‘critical’ and left some agents untouched after the master reboot.

After almost 3 hours! the agent connected again to the master and near the reconnect time I saw this on in the agent log, see below.

To me it seems that the agent ‘keeps’ some kind of connection that it is still using until a time-out ?! Looking at the log it seems that there still activity, but until it shows “Client TLS handshake failed” icingaweb shows the results again and the Health-Cluster will restore to OK again.

[2019-10-24 19:33:14 +0200] information/ConfigObject: Dumping program state to file ‘/var/lib/icinga2/icinga2.state’
[2019-10-24 19:37:33 +0200] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min);
[2019-10-24 19:37:33 +0200] information/WorkQueue: #6 (ApiListener, SyncQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min);
[2019-10-24 19:38:14 +0200] information/ConfigObject: Dumping program state to file ‘/var/lib/icinga2/icinga2.state’
[2019-10-24 19:41:32 +0200] critical/ApiListener: Client TLS handshake failed (to [12.13.14.15]:5665): Connection timed out
[2019-10-24 19:41:32 +0200] information/ApiListener: Finished reconnecting to endpoint ‘icinga’ via host ‘12.13.14.15’ and port ‘5665’
[2019-10-24 19:41:38 +0200] information/ApiListener: Reconnecting to endpoint ‘icinga’ via host ‘12.13.14.15’ and port ‘5665’
[2019-10-24 19:41:38 +0200] information/ApiListener: New client connection for identity ‘icinga’ to [12.13.14.15]:5665
[2019-10-24 19:41:38 +0200] information/ApiListener: Requesting new certificate for this Icinga instance from endpoint ‘icinga’.
[2019-10-24 19:41:38 +0200] information/ApiListener: Sending config updates for endpoint ‘icinga’ in zone ‘master’.
[2019-10-24 19:41:38 +0200] information/ApiListener: Finished sending config file updates for endpoint ‘icinga’ in zone ‘master’.
[2019-10-24 19:41:38 +0200] information/ApiListener: Syncing runtime objects to endpoint ‘icinga’.

Is there anything that I can check to confirm my suspicion ?

dnsmichi · October 25, 2019, 7:03am

What does the master do in the meantime? It should at least log the incoming connection from the agent, before actually performing a TLS handshake. Try to do a tcpdump to analyse the incoming packets.

Besides, which distribution version is the master on? Please share the full output of icinga2 --version.

Cheers,
Michael

Creamers158 · October 25, 2019, 8:32am

icinga2 --version output:

icinga2 - The Icinga 2 network monitoring daemon (version: r2.11.0-1)

Copyright (c) 2012-2019 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later http://gnu.org/licenses/gpl2.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

System information:
Platform: Raspbian GNU/Linux
Platform version: 10 (buster)
Kernel: Linux
Kernel version: 4.19.66-v7+
Architecture: armv7l

Build information:
Compiler: GNU 8.3.0
Build host: runner-LTrJQZ9N-project-297-concurrent-0

Application information:

General paths:
Config directory: /etc/icinga2
Data directory: /var/lib/icinga2
Log directory: /var/log/icinga2
Cache directory: /var/cache/icinga2
Spool directory: /var/spool/icinga2
Run directory: /run/icinga2

Old paths (deprecated):
Installation root: /usr
Sysconf directory: /etc
Run directory (base): /run
Local state directory: /var

Internal paths:
Package data directory: /usr/share/icinga2
State path: /var/lib/icinga2/icinga2.state
Modified attributes path: /var/lib/icinga2/modified-attributes.conf
Objects path: /var/cache/icinga2/icinga2.debug
Vars path: /var/cache/icinga2/icinga2.vars
PID path: /run/icinga2/icinga2.pid

This is what the master shows on the time that the agents connects. On the master there are no traces of the agent trying to connect.:

[2019-10-24 18:41:29 +0100] information/ApiListener: New client connection from [x.x.x.x]:45492 (no client certificate)
[2019-10-24 18:41:29 +0100] information/HttpServerConnection: Request: POST /v1/actions/process-check-result (from [x.x.x.x]:45492), user: root, agent: curl/7.64.0).
[2019-10-24 18:41:29 +0100] information/HttpServerConnection: HTTP client disconnected (from [x.x.x.x]:45492)
[2019-10-24 18:41:35 +0100] information/ApiListener: New client connection from [x.x.x.x]:45508 (no client certificate)
[2019-10-24 18:41:35 +0100] information/HttpServerConnection: Request: POST /v1/actions/process-check-result (from [x.x.x.x]:45508), user: root, agent: curl/7.64.0).
[2019-10-24 18:41:35 +0100] information/HttpServerConnection: HTTP client disconnected (from [x.x.x.x]:45508)
[2019-10-24 18:41:38 +0100] information/ApiListener: New client connection for identity ‘agent02’ from [x.x.x.x]:43924

[2019-10-24 18:41:39 +0100] information/ApiListener: Sending config updates for endpoint agent02’ in zone ‘agent02’.
[2019-10-24 18:41:39 +0100] information/ApiListener: Syncing configuration files for global zone ‘global-templates’ to endpoint ‘agent02’.
[2019-10-24 18:41:39 +0100] information/ApiListener: Finished sending config file updates for endpoint ‘agent02’ in zone ‘agent02’.
[2019-10-24 18:41:39 +0100] information/ApiListener: Syncing runtime objects to endpoint ‘agent02’.
[2019-10-24 18:41:39 +0100] information/ApiListener: Finished syncing runtime objects to endpoint ‘agent02’.
[2019-10-24 18:41:39 +0100] information/ApiListener: Finished sending runtime config updates for endpoint ‘agent02’ in zone ‘agent02’.
[2019-10-24 18:41:39 +0100] information/ApiListener: Sending replay log for endpoint ‘agent02’ in zone ‘agent02’.
[2019-10-24 18:41:39 +0100] information/ApiListener: Finished sending replay log for endpoint ‘agent02’ in zone ‘agent02’.
[2019-10-24 18:41:39 +0100] information/ApiListener: Finished syncing endpoint ‘agent02’ in zone ‘agent02’.
[2019-10-24 18:41:39 +0100] information/JsonRpcConnection: Received certificate request for CN ‘agent02’ signed by our CA.

In short:
Agent thinks it’s connected. No traces of the agent on the master. Then the agent suddenly showing a disconnect (when the handshake fails?). Agent tries a reconnect to the master. The master receives the request and cluster health is restored.

Looking at the agent history log again, it seemed to repeat this when it was not ‘really’ connected:

[2019-10-24 17:22:52 +0200] information/ApiListener: New client connection for identity ‘icinga’ to [x.x.x.x]:5665
[2019-10-24 17:22:52 +0200] information/ApiListener: Requesting new certificate for this Icinga instance from endpoint ‘icinga’.
[2019-10-24 17:22:52 +0200] information/ApiListener: Sending config updates for endpoint ‘icinga’ in zone ‘master’.
[2019-10-24 17:22:52 +0200] information/ApiListener: Finished sending config file updates for endpoint ‘icinga’ in zone ‘master’.
[2019-10-24 17:22:52 +0200] information/ApiListener: Syncing runtime objects to endpoint ‘icinga’.
[2019-10-24 17:22:52 +0200] information/ApiListener: Finished syncing runtime objects to endpoint ‘icinga’.
[2019-10-24 17:22:52 +0200] information/ApiListener: Finished sending runtime config updates for endpoint ‘icinga’ in zone ‘master’.
[2019-10-24 17:22:52 +0200] information/ApiListener: Sending replay log for endpoint ‘icinga’ in zone ‘master’.
[2019-10-24 17:22:52 +0200] information/ApiListener: Finished sending replay log for endpoint ‘icinga’ in zone ‘master’.
[2019-10-24 17:22:52 +0200] information/ApiListener: Finished syncing endpoint ‘icinga’ in zone ‘master’.
[2019-10-24 17:22:52 +0200] information/ApiListener: Finished reconnecting to endpoint ‘icinga’ via host ‘x.x.x.x’ and port ‘5665’
[2019-10-24 17:23:14 +0200] information/ConfigObject: Dumping program state to file ‘/var/lib/icinga2/icinga2.state’
[2019-10-24 17:23:15 +0200] information/WorkQueue: #6 (ApiListener, SyncQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min);
[2019-10-24 17:23:15 +0200] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min);
[2019-10-24 17:23:52 +0200] information/JsonRpcConnection: No messages for identity ‘icinga’ have been received in the last 60 seconds.
[2019-10-24 17:23:52 +0200] warning/JsonRpcConnection: API client disconnected for identity ‘icinga’
[2019-10-24 17:23:52 +0200] warning/ApiListener: Removing API client for endpoint ‘icinga’. 0 API clients left.
[2019-10-24 17:24:01 +0200] information/ApiListener: Reconnecting to endpoint ‘icinga’ via host ‘x.x.x.x’ and port ‘5665’
[2019-10-24 17:26:12 +0200] critical/ApiListener: Cannot connect to host ‘x.x.x.x’ on port ‘5665’: Connection timed out
[2019-10-24 17:26:21 +0200] information/ApiListener: Reconnecting to endpoint ‘icinga’ via host ‘x.x.x.x’ and port ‘5665’
[2019-10-24 17:28:14 +0200] information/ConfigObject: Dumping program state to file ‘/var/lib/icinga2/icinga2.state’
[2019-10-24 17:28:25 +0200] information/WorkQueue: #6 (ApiListener, SyncQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min);
[2019-10-24 17:28:25 +0200] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min);
[2019-10-24 17:28:31 +0200] critical/ApiListener: Cannot connect to host ‘x.x.x.x’ on port ‘5665’: Connection timed out
[2019-10-24 17:28:41 +0200] information/ApiListener: Reconnecting to endpoint ‘icinga’ via host ‘x.x.x.x’ and port ‘5665’
[2019-10-24 17:33:14 +0200] information/ConfigObject: Dumping program state to file ‘/var/lib/icinga2/icinga2.state’
[2019-10-24 17:33:35 +0200] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min);
[2019-10-24 17:33:35 +0200] information/WorkQueue: #6 (ApiListener, SyncQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min);

Is it possible that there are ‘ghost’ connections’ on a router that gives this problem?