we have top-down clustered set up with 2 masters, 6 satellite zones (each with 2 servers). recently we had issues where the icingaweb dashboard was showing some services were ‘late’ and had old/incorrect check results from one of the satellites. hitting ‘check now’ did not reset the state (we have had some success with that previously; yes i’m moving to use command source to the api soon). running the check itself from command line on the satellite produced correct output. then i noticed that the icinga2.service running time between the two satellites was way different - 36 days to 104 days (the “bad” one). looking at all the other satellites, most have the same service run times in their zone but a few are a 5-10 days apart. i ‘fixed’ the issue by stopping the icinga2.service on the “bad” satellite until all the checks were coming from the good/current one; restarting the service then destributed the check sources and all were current/correct. but now i fear the time delay may lead to future issues. so my question is simple - should the satellites in the same zone both have the same icinga2.service run time? is there a ‘breaking’ point of too much difference causes problems. any ideas what could cause this (other than the obvious someone only started one)? what is the best approach to avoid this in the future (other than the obvious reload them both)?
icinga2 version: r2.8.1-1
System information:
Platform: Debian GNU/Linux
Platform version: 9 (stretch)
Kernel: Linux
Kernel version: 4.4.0-141-generic
Architecture: x86_64
Build information:
Compiler: GNU 6.3.0
Build host: d3c3d2a588bd