Icinga2 - API client disconnected

Spoth · April 23, 2019, 9:53am

Hi all,

I’m trying to find out why Icinga2 endpoints in my installation keep randomly beeing disconnected. I could find some older posts about issues like this, but no real clue how to handle this.

[2019-04-23 09:11:36 +0200] warning/TlsStream: TLS stream was disconnected.
[2019-04-23 09:11:36 +0200] warning/JsonRpcConnection: API client disconnected for identity ‘masterhost.example’
[2019-04-23 09:11:36 +0200] warning/ApiListener: Removing API client for endpoint ‘masterhost.example’. 0 API clients left.

Icinga version on the master and the endpoints is: r2.10.4-1

Endpoint:

Red Hat Enterprise Linux Server release 7.6 (Maipo)
3.10.0-957.1.3.el7.x86_64

Master:

Debian GNU/Linux 9
3.10.0-957.1.3.el7.x86_64
Docker Image: jordan/icinga2:latest

The endpoints log does not show any attempts to re-initiate the connection again, although it is executing multiple checks every minute. The “cluster” check (Icinga Template Library - Icinga 2) shows the endpoints as disconnected and I see the services “Last check” timestamp getting old.

The log of the master does not seem to show any related error. When searching /var/log/icinga2.log, the last entry containing the name of the disconneced host is:

[2019-04-23 08:09:48 +0000] information/ApiListener: Applying configuration file update for path ‘/var/lib/icinga2/api/zones/endpointhost.example’ (0 Bytes). Received timestamp ‘2019-04-23 08:09:48 +0000’ (1556006988.769867), Current timestamp ‘2019-04-23 08:00:38 +0000’ (1556006438.549674).

The Icinga setup consists of about 30 endpoints and I see 1-2 disconnecting every week. I have to restart the Icinga service on the endpoint, then it reconnects.

Any help would be gratefully appreciated.

Regards,
Sven

blakehartshorn · April 24, 2019, 9:51am

Curious about a few things here. Is just your master running in a Docker container, or are some of the clients as well? Have you cranked the log level up to debug on both the master and endpoints? Any other docker contains having intermittent network issues? Any packet loss in the hostalive checks?

Those are the next things I can think of to look at right now (I’m here at 5:50am EDT for some weird reason).

Spoth · April 24, 2019, 5:48pm

Hi Blake,

only the master is running in a Docker container, all the endpoints are not.

I’ll have to increase the log levels, I did not do that yet.

The monitoring does not show any network issues or packet loss, but I’ll keep watching. What I don’t understand is, why does it not even try to reconnect? Networks can have issues, connections can be interrupted, that is all no reason not to try any reconnect. The endpoints are configured to connect to the master, so I would assume they should retry. Maybe this is just not logged, I’ll maybe see when increasing the log levels.

Regards,
Sven

blakehartshorn · April 24, 2019, 6:21pm

I think the endpoints connecting to the master is actually your problem. I don’t have my config in front of me right now, but go to zones.conf on one of your clients, remove host and port from the endpoint object of your master, and reload Icinga. If that works fine, do it everywhere else. You kind of need to leave that in there if you’re using the node wizard in order to setup the certs, but after the fact you want that going top down. If you only have 30, that’s easy. If you have to scale out huge at some point, check the puppet or ansible modules.

Spoth · April 24, 2019, 7:17pm

I’ll give that a try and reconfigure it. To make the master connect to the endpoints I’ll have to, as far as I rememer the docs, remove host and port of the master from the endpoints and add the host and port of the endpoints to the zones configuration on the master.
I just don’t see where the advantage of that is, or why I would want the connection initiation to go top down. I just had the free choice and did it this way around, for no special reason. Should not really make any difference who initiates the connection.
The node wizard lets you choose if you want the endpoint connect to the master, and adds host and port of the master to the zones configuration if you choose you want to actively connect to the master. But I don’t think this is required.

blakehartshorn · April 24, 2019, 7:33pm

It’s been recommended practice for a while. I’ve got a HA setup with ~2500 nodes and it works great. If your master goes down, it’ll start collecting the second it comes back up. I’d recommend spinning up a second master if you catch the container dropping, but it likely shouldn’t.

Spoth · April 24, 2019, 7:46pm

Ok, thanks for the info. I will reconfigure and see what happens. Might take some time to find out, it does not happen very often and I cannot reproduce it.

How do you handle such amount of nodes? I made one zone per server. I wanted to have the nodes executing most of their check for them self on their own. That zones configuration and directory structure for the config on the master gets annoyingly big and confusing with all the zones.

Regards,
Sven

blakehartshorn · April 24, 2019, 8:08pm

I’ve scripted out my config for that which helps greatly. Basically, when you get too big for zones.conf and have satellites, I do this:

Every parent zone has a folder named after itself in zones.d (not custom, just how it works). You can toss in a series of .conf files and it’ll grab all of them as if they’re one file. I use separate files for hosts vs endpoint/zone objects to keep things readable. I have common server types across 6 zones, so I get as in depth and conditional with the global-templates as I can, that leads to a bunch of simple/boring config files that just import a template unless they’re snowflakes. Every endpoint having its own zone seems odd at first, but it’s just to establish hierarchy, so don’t overthink it. Keeping all of that on your primary master means you only have to mess with it in one place and it’s set/forget everywhere else. I also control my conf folder with git in case one of my coworkers breaks it.

blakehartshorn · April 24, 2019, 8:23pm

Also, noticing something I missed about your last message, they do execute their checks on their own in this scenario. It ain’t nrpe. Their parent pushes their config to them, they do their job, and then the parent asks for a report and it finds it’s way down. In a huge environment the HA approach also helps for load balancing. I gave my masters 8 cores each and my satellites 6. As long as they know about each other in the config, they’ll split responsibility. Ultimately the CPU footprint of Icinga on the clients is next to nothing.