Icinga checks being late or unknown

I have a distributed Icinga2 setup (2.11.2-1) with 2 HA masters, 4 satellites, 150 hosts and about 3k services to check. There are 4 regions, largest one has 2 satellites.

According to the documentation I’m basically using " Three Levels with Masters, Satellites and Agents".

Configuration for all Icinga2 instances is generated by Puppet. I’ve tried to switch configuration to the top-down approach, however the problem is that almost 1/3 of all checks is either late or unknown.

Satellites seems to be overloaded:

[2020-01-14 22:36:00 +0000] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 11658, rate: 2.13333/s (128/min 5308/5min 43519/15min); empty in 10 minutes
[2020-01-14 22:36:10 +0000] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 14456, rate: 2.13333/s (128/min 5144/5min 43519/15min); empty in 51 seconds
[2020-01-14 22:36:20 +0000] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 15197, rate: 2.13333/s (128/min 5144/5min 43519/15min); empty in 3 minutes and 24 seconds
[2020-01-14 22:36:30 +0000] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 13964, rate: 26.0167/s (1561/min 6629/5min 44987/15min); empty in less than 1 millisecond
[2020-01-14 22:38:00 +0000] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 104, rate: 76.05/s (4563/min 21037/5min 54250/15min); empty in 9 seconds
[2020-01-14 22:38:10 +0000] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 373, rate: 71.5167/s (4291/min 21037/5min 54250/15min); empty in 13 seconds

but the load seems to be unevenly distributed. Are there any recommendation how to scale Icinga2 setup? DB backend runs on dedicated PostgreSQL instance.

I’m curious, if you check the logs for both satellites and the masters under them, do you see any errors about connections to either one? Do the zones.conf files for the satellites have their sibling satellite’s IP address filled in? (they wont coordinate together otherwise.)

What is the CPU load roughly on each satellite as well as on the postgres server?

Both satellites have host, port and log_duration = 1d. There are no errors about connections between satellites. I’ve noticed many errors about host that was removed from cluster (and from master), thus the change should have been propagated to satellites (but wasn’t or it’s taking too long). Also I’ve noticed that one satellite was allocating 12GB RAM (RES), after restart went down to 2GB.

Since I’ve removed host parameter on agents, i.e.:

object Endpoint “icinga2-satellite1.localdomain” {
// Do not actively connect to the satellite by leaving out the ‘host’ attribute
}

the load on satellites at least tripled, one has 14, the other 19. Which is quite bad, I’ll try to move those instances to better servers.

That’s odd. I have 60,000 services and my satellites aren’t taking that hard of a beating.

I get the direction you’re now trying to go in here. Can I see a host/endpoint/zones example for a master, a satellite and a client?

Also, if you had automated your config going in the other direction, /etc/icinga2/features-enabled/api.conf might need an update:

# cat api.conf 
/**
 * The API listener is used for distributed monitoring setups.
 */
object ApiListener "api" {
  accept_config = true
  accept_commands = true
}

If accept_config isn’t set, this could be why your config changes aren’t propagating. You’ll want that on everything but the primary master.

1 Like