I’m taking care of a Icinga2 HA Cluster that is basically running well and reliably, but with every config change and following reload of the config master I get the following message for a few minutes: Remote Icinga instance 'client' is not connected to 'master'.

Until everything is up and running again (which can take 5-10 minutes) it stops running checks for some (many) services. In Icingaweb they all end up in Overdue: Late Check Results.

With several reloads a day this can be quite annoying, especially with checks that have only 1 max_check_attempts and then send false positive emails.

My setup: Icinga 2.13.2 on Debian 11. 2 masters (Virtual machines each: 8 cores, 8GB RAM, SSD) and 3 Sub zones with each one or two satellites.

I use file-based config with:

  • 1.700 Hosts
  • 27.000 Services
  • 57.000 Notifications
  • 24.000 Dependencies
  • 312 Zones and Endpoints (each endpoint is a new zone)

reload time:
time systemctl reload icinga2.service

real    0m30.558s
user    0m0.009s
sys     0m0.001s

I measured the times with:
00:00: reload started
00:31: reload done
00:50: first services with not connected…
01:40: a lot of services with not connected
02:25: a lot of services/hosts are overdue
09:05: Everything back healthy

My questions:
What’s the reason for the described problem?
Am I the only one with such a problem in larger HA clusters?
Do I just need more power on the master servers?

Consider increasing it as well as the retry_interval.


Can’t help you specifically, but I remembered this thread:

There were some test run there by one/some user(s), but I think it was mostly focused on the IDO database.
But maybe there is something helpful for you inside there as well.

is already done, but some checks need to have only 1 max_check_attempts in my environment.
But this is not the cause for so long reload times i think.

@log1c Yes I have already read this, but unfortunately not found much useful.

@Al2Klimov I’m sorry if I’m annoying you with this topic. But unfortunately this is still relevant in our environment…
Do you have any new ideas?

are these windows hosts?

95% of the 312 Endpoints are Linux hosts. About 5% are Windows