Flip/Flop between connected status to not connected

Hi everyone,

My first post here.

We switch from nagios to icinga2 6 months ago. This week I need to reinstall and change the fqdn of the server (everything are installed through puppet). Because it’s installed with puppet the server got the same certificat as the old one (install from scracth no from backup)

After the reinstallation I notice I need to regenerate the cert on the satelitte, so I clean up the /var/lib/icinga2/certs and lets puppet agent to recreate the certs (using a ticket/salt).

I got ~700 hosts and 7000 services. Let’s say 60% of the hosts works fine.

For the 40% it’s very strange the node flip/flop from «normal status» to «Remote Icinga instance ‘CLIENT_FQDN’ is not connected to ‘MASTER_FQDN’

It will work during 2 or 3 hours, after that it switch to «not connected», restarting the icinga2 on the client solve 50% of the case (for sometime) for the rest some are just won’t connect, some will connect after lot of restart (notice it’s work a little better if I do a systemctl stop icinga2 ; sleep 100; systemctl start icinga2), or sometime if I delete the /var/lib/icinga2/certs and regenerate the certificat can solve the problem

On the server, for every host (working or not) I can see when I restart icinga2 on the client

[2021-01-06 10:41:41 +0100] information/ApiListener: New client connection for identity ‘CLIENT_FQDN’ from [x.y.z.t]:36320
[2021-01-06 10:41:41 +0100] information/ApiListener: Sending config updates for endpoint ‘CLIENT_FQDN’ in zone ‘CLIENT_FQDN’.
[2021-01-06 10:41:41 +0100] information/ApiListener: Finished sending config file updates for endpoint ‘CLIENT_FQDN’ in zone ‘CLIENT_FQDN’.
[2021-01-06 10:41:41 +0100] information/ApiListener: Syncing runtime objects to endpoint ‘CLIENT_FQDN’.
[2021-01-06 10:41:41 +0100] information/JsonRpcConnection: Received certificate request for CN ‘CLIENT_FQDN’ signed by our CA.
[2021-01-06 10:41:41 +0100] information/JsonRpcConnection: The certificate for CN ‘CLIENT_FQDN’ is valid and uptodate. Skipping automated renewal.
[2021-01-06 10:41:41 +0100] information/ApiListener: Finished syncing runtime objects to endpoint ‘CLIENT_FQDN’.
[2021-01-06 10:41:41 +0100] information/ApiListener: Finished sending runtime config updates for endpoint ‘CLIENT_FQDN’ in zone ‘CLIENT_FQDN’.
[2021-01-06 10:41:41 +0100] information/ApiListener: Sending replay log for endpoint ‘CLIENT_FQDN’ in zone ‘CLIENT_FQDN’.
[2021-01-06 10:41:41 +0100] information/ApiListener: Finished sending replay log for endpoint ‘CLIENT_FQDN’ in zone ‘CLIENT_FQDN’.
[2021-01-06 10:41:41 +0100] information/ApiListener: Finished syncing endpoint ‘CLIENT_FQDN’ in zone ‘CLIENT_FQDN’.

Regards

Ok. So I solve the problem

I’m not sure where are the actual problem, but after rebooting the server (master) everything work fine. The server was swapping. Maybe because it’s swapping so it’s slow icinga2 drop thing.

Thanks !!!