Loadbalancing on satellites not working as expected

Hello Community,

I am currently running a High Availability Cluster, and this is the setup for this question:
(master01/master02) → satellite01 (internal) → (satellite01/satellite02) (external).
And inside the external satellite zone, about 1k hosts (14k services) are checked with nrpe from the external satellites.
The setup with two satellites in the external zone (before it was only one satellite) has been running for a few days now, and I thought that both satellites would load balance the checks. But based on the procs, satellite01 has <200 procs and satellite02 has >400 procs. And comparing systemctl status icinga2.service, satellit02 is definitely running more checks than satellit01.

Each master/satellite is checked by the icinga check. I tried to read the active_service_checks metric, but the value is the same for both satellites.

  1. Shouldn`t the value be different if one satellite does more checks than the other?

  2. Any idea why the load balancing is not working properly?

satellite01 (external) zones.conf: (I changed the real domain to .internal and .external as in the description above)

object Endpoint "icingaproxy01.internal.de" {
        host = "ip"
        port = "5665"
}

object Zone "icingaproxy01.internal.de" {
        endpoints = [ "icingaproxy01.internal.de" ]
}

object Endpoint "icingaproxy01.external.de" {
}

object Endpoint "icingaproxy02.external.de" {
        host = "ip"
        port = "5665"
}

object Zone "icingaproxy01.external.de" {
        endpoints = [ "icingaproxy01.external.de", "icingaproxy02.external.de" ]
        parent = "icingaproxy01.internal.de"
}

object Zone "global-templates" {
        global = true
}

satellite02 (external) zones.conf:

object Endpoint "icingaproxy01.internal.de" {
        host = "ip"
        port = "5665"
}

object Zone "icingaproxy01.internal.de" {
        endpoints = [ "icingaproxy01.internal.de" ]
}

object Endpoint "icingaproxy01.external.de" {
        host = "ip"
        port = "5665"
}

object Endpoint "icingaproxy02.external.de" {
}

object Zone "icingaproxy01.external.de" {
        endpoints = [ "icingaproxy01.external.de", "icingaproxy02.external.de" ]
        parent = "icingaproxy01.internal.de"
}

object Zone "global-templates" {
        global = true
}
  • Version used (r2.13.2-1)
  • Operating System and version: Debian 11
  • Enabled features (api checker mainlog (satellites), api checker ido-mysql influxdb2 mainlog notification (masters))

Hello @rafi01010!

Unfortunately Icinga distributes checkables not perfectly equal, but based on isEven(hash(name)).

If you shut down one of the two sats and the other gets 600 procs and the other way around as well, you know everything’s configured fine and you can’t do much about it.

Best,
A/K

Hello @Al2Klimov,
sorry for my late reply.
Then I think everything is working fine :+1:

Can you tell me something about my first question with the metrics?

Maybe the icinga checks aren’t pinned to the nodes to be checked via command_endpoint?

But that’s the point, that the two sattelites split the checks and I don’t want to pin the checks to one endpoint specifically. Both sattelites show the sum of both sattelites in the metric. And not each the number of his own checks

Please share how this is configured.

apply Service "Icinga" {
    import "generic-service"

    check_command = "icinga"
    command_endpoint = host.vars.client_endpoint

    assign where "icinga" in host.groups && host.vars.client_endpoint
}

Each Icinga Master/Sattelite has the hostgroup “icinga” and a clinet_endpoint is defined (FQDN of the master/sattelite)

And this one yields equal metrics despite different process amounts?

The values are slightly different, because I have not immediately took the 2. screenshot.
Check source is also always the respective server

icingaproxy01:

icingaproxy02: