Remote Icinga instance is not connectet to master - for a few minutes after reload

rafi01010 · July 21, 2022, 11:04am

Dear Community,

I’m taking care of a Icinga2 HA Cluster that is basically running well and reliably, but with every config change and following reload of the config master I get the following message for a few minutes: Remote Icinga instance 'client' is not connected to 'master'.

Until everything is up and running again (which can take 5-10 minutes) it stops running checks for some (many) services. In Icingaweb they all end up in Overdue: Late Check Results.

With several reloads a day this can be quite annoying, especially with checks that have only 1 max_check_attempts and then send false positive emails.

My setup: Icinga 2.13.2 on Debian 11. 2 masters (Virtual machines each: 8 cores, 8GB RAM, SSD) and 3 Sub zones with each one or two satellites.

I use file-based config with:

1.700 Hosts
27.000 Services
57.000 Notifications
24.000 Dependencies
312 Zones and Endpoints (each endpoint is a new zone)

reload time:
time systemctl reload icinga2.service

real    0m30.558s
user    0m0.009s
sys     0m0.001s

I measured the times with:
00:00: reload started
00:31: reload done
00:50: first services with not connected…
01:40: a lot of services with not connected
02:25: a lot of services/hosts are overdue
09:05: Everything back healthy

My questions:
What’s the reason for the described problem?
Am I the only one with such a problem in larger HA clusters?
Do I just need more power on the master servers?

Thanks in advance for your help

PS: If you also want to have such a nice language pack like me. Netways has some great ones: GitHub - NETWAYS/icingaweb2-theme-oesterreichisch: Austria Theme for Icinga Web 2

Al2Klimov · July 26, 2022, 5:12pm

Hello @rafi01010!

Consider increasing it as well as the retry_interval.

Best,
A/K

log1c · July 27, 2022, 6:21am

Can’t help you specifically, but I remembered this thread:

There were some test run there by one/some user(s), but I think it was mostly focused on the IDO database.
But maybe there is something helpful for you inside there as well.

rafi01010 · August 31, 2022, 10:29am

Hello @Al2Klimov,
is already done, but some checks need to have only 1 max_check_attempts in my environment.
But this is not the cause for so long reload times i think.

@log1c Yes I have already read this, but unfortunately not found much useful.

rafi01010 · March 8, 2023, 9:49am

@Al2Klimov I’m sorry if I’m annoying you with this topic. But unfortunately this is still relevant in our environment…
Do you have any new ideas?

moreamazingnick · March 8, 2023, 9:58am

are these windows hosts?

rafi01010 · March 8, 2023, 10:31am

95% of the 312 Endpoints are Linux hosts. About 5% are Windows