Top-down satellites icinga2.service run time delta

we have top-down clustered set up with 2 masters, 6 satellite zones (each with 2 servers). recently we had issues where the icingaweb dashboard was showing some services were ‘late’ and had old/incorrect check results from one of the satellites. hitting ‘check now’ did not reset the state (we have had some success with that previously; yes i’m moving to use command source to the api soon). running the check itself from command line on the satellite produced correct output. then i noticed that the icinga2.service running time between the two satellites was way different - 36 days to 104 days (the “bad” one). looking at all the other satellites, most have the same service run times in their zone but a few are a 5-10 days apart. i ‘fixed’ the issue by stopping the icinga2.service on the “bad” satellite until all the checks were coming from the good/current one; restarting the service then destributed the check sources and all were current/correct. but now i fear the time delay may lead to future issues. so my question is simple - should the satellites in the same zone both have the same icinga2.service run time? is there a ‘breaking’ point of too much difference causes problems. any ideas what could cause this (other than the obvious someone only started one)? what is the best approach to avoid this in the future (other than the obvious reload them both)?

icinga2 version: r2.8.1-1

System information:
Platform: Debian GNU/Linux
Platform version: 9 (stretch)
Kernel: Linux
Kernel version: 4.4.0-141-generic
Architecture: x86_64

Build information:
Compiler: GNU 6.3.0
Build host: d3c3d2a588bd

Runtime should not matter in anyway. BUT you should really do an upgrade, 2.8. 1 is quite old.

3 Likes

And us a NTP check in Icinga to verify that all nodes are on the same time (not timezone). Differences in time can lead to a lot of bad things when it comes to clusters.

2 Likes

@twidhalm - yes we have an ntp check and they’re all synched; all are using UTC timezone too. the difference is from supervisorctl status icinga2.service where one satellite will show it’s been running for 104 days, the other for 36 days. other satellite zones are all the same running time.

1 Like

Ok, that’s good. That should not be a problem. Whenever an instance is restarting it’s resyncing the current status.

I could imagine a problem with checkresults stored with a future timestamp, now not resetting back. That is something which has been fixed with 2.10.3.

1 Like