I have a distributed Icinga2 setup (2.11.2-1) with 2 HA masters, 4 satellites, 150 hosts and about 3k services to check. There are 4 regions, largest one has 2 satellites.
Configuration for all Icinga2 instances is generated by Puppet. I’ve tried to switch configuration to the top-down approach, however the problem is that almost 1/3 of all checks is either late or unknown.
but the load seems to be unevenly distributed. Are there any recommendation how to scale Icinga2 setup? DB backend runs on dedicated PostgreSQL instance.
I’m curious, if you check the logs for both satellites and the masters under them, do you see any errors about connections to either one? Do the zones.conf files for the satellites have their sibling satellite’s IP address filled in? (they wont coordinate together otherwise.)
What is the CPU load roughly on each satellite as well as on the postgres server?
Both satellites have host, port and log_duration = 1d. There are no errors about connections between satellites. I’ve noticed many errors about host that was removed from cluster (and from master), thus the change should have been propagated to satellites (but wasn’t or it’s taking too long). Also I’ve noticed that one satellite was allocating 12GB RAM (RES), after restart went down to 2GB.
Since I’ve removed host parameter on agents, i.e.:
object Endpoint “icinga2-satellite1.localdomain” {
// Do not actively connect to the satellite by leaving out the ‘host’ attribute
}
the load on satellites at least tripled, one has 14, the other 19. Which is quite bad, I’ll try to move those instances to better servers.