The strategy of load-balance in high-availability mode

liuyanagit · March 30, 2020, 9:18am

I have four icinga instances，and all endpoints in the master zone work as high-availability setup. They will load-balance the check execution. What is the strategy of load-balance in high-availability mode？The number of all objects or something else?

MarcusCaepio · March 30, 2020, 11:42am

You may not want to have more than two nodes per zone:
https://icinga.com/docs/icinga2/latest/doc/19-technical-concepts/#high-availability

twidhalm · March 30, 2020, 4:41pm

Currently you can’t have more than two nodes within one HA-zone - your nodes would sync each other to death!

The strategy is to calculate which node runs what check by running the object name (?) through a modulo function. So they nodes don’t talk to each other which of them has to run what check but both calculate what checks to run for themselves. This is also fit for upcoming changes in the cluster protocol because you can just change the modulo for more hosts. All nodes are constantly connected and as long as this connection is active they keep calculating and spreading the load. Once the connection breaks, the modulo is reduced by one (or the calculation is stopped completely when only one node is left).

Sorry for being a bit unprecise in my explanations, I hope it helps, though

liuyanagit · March 31, 2020, 3:53am

This mode has been running for more than two years without any exception, except that frequent reboots will lead to linked core.After one node is restarted, the other node is down immediately

liuyanagit · March 31, 2020, 6:52am

my zone.conf is like this.

object Endpoint "10.187.179.18" {
        host = "10.187.179.18"
        port = "5665"
        log_duration = 20m
}

object Endpoint "11.36.68.171" {
  host = "11.36.68.171"
}

object Endpoint "11.36.10.239" {
        host = "11.36.10.239"
}

object Endpoint "11.36.97.183" {
        host = "11.36.97.183"
}

object Zone "master" {
  endpoints = [ "10.187.179.18", "11.36.68.171", "11.36.10.239", "11.36.97.183"]
}

object Zone "global-templates" {
  global = true
}```
According to you，does the master zone include at most two nodes？

twidhalm · March 31, 2020, 7:17am

This will kill your system sooner or later! The cluster sync protocol is flawed and it can not survive more than two endpoints within one zone definition - you have four.

The reason why this is not getting reworked is that even the biggest setups seem to have sufficient redundancy and load balancing by just using two masters (or two satellites) per zone.

liuyanagit · March 31, 2020, 7:30am

Thanks for your patience! The reason why there are four nodes is that there are 2291 services in total， when checking the services, the CPU utilization is very high, so we needs four machines to balance.
The version of icinga2 is v2.5.4.
What should i do? Perhaps, i set two zones, per zone has two masters.But in this mode, do we need to set other features? Or where there are relevant configuration documents and precautions？

        host = "10.187.179.18"
        port = "5665"
        log_duration = 20m
}

object Endpoint "11.36.68.171" {
  host = "11.36.68.171"
}

object Endpoint "11.36.10.239" {
        host = "11.36.10.239"
}

object Endpoint "11.36.97.183" {
        host = "11.36.97.183"
}

object Zone "master1" {
  endpoints = [ "10.187.179.18", "11.36.68.171"]
}
object Zone "master2" {
  endpoints = [ "11.36.10.239", "11.36.97.183"] 
}

object Zone "global-templates" {
  global = true
}```

twidhalm · March 31, 2020, 7:49am

Some of our customers have several 100000 Services in their setup with only two masters, so Icinga should be able to handle the load. There are several tuning tricks you can use:

Scale up your Icinga nodes. Icinga can handle quite a lot of resources. With “only” 2300 Services this might the only tuning you will really have to do
Let masters only deal with the results of checks, don’t let them run any plugins. You can achieve this with running all checks on satellites. You can spread the checks over several pairs of satellites but be aware that every second-level satellite and agent can only be connected to one satellite zone. Running plugins takes quite a bunch of resources which you can avoid by not running them on the masters
Split services to extra nodes (database, grapher, Icinga Web) - please note that this is mostly useful for very big setups
Update your Icinga. Please be aware that the current Icinga release is 2.11.3 with 2.12 already being in release candidate state so you’re missing out on several important improvements and features
Use names for your Endpoints. Please do yourself the favor and use the FQDN as a name for Endpoints so you can use several “configuration tricks” you can find in the documentation or in Director (if you use it)
There’s more but these are the first that come to mind

liuyanagit · March 31, 2020, 8:28am

Thanks a lot! The high CPU utilization is because of running plugins. I’ll think over your suggestion.