High load on monitored systems after master/satellite disconnect

chrender · March 11, 2025, 10:52am

Hi,

I’m currently running a setup in which I’ve got two HA masters running at home. Both connect to two HA-satellites (each in it’s own computing centre ~5ms apart) which in turn connect to all the sytems on the internet which are supposed to be monitored (all of these have their own agent installed).

Everytime the internet connection at home fails (thus interrupting communication between masters and satellites), the monitored systems start executing a large amount of checks once the connection is up again, often causing a significant amount of load (sometimes crashing smaller systems). The numer of checks started appears to depend on the disconnect’s duration: The longer the masters have been separated from the satellites, the more checks are started on the checked systems (making it appear that the missed checks are now running all at one).

I’d have assumed that, since the satellites are running unaffected and all systems-to-be-monitored have their agent’s parent set to the satellites’ zones, checks would continue as normal on the monitored systems. Since this apparently is not the case: Is it possible that checks are scheduled on the master(s), queued until the connection is up again and then executed all at once when the connection is back? Is there any way to work around this?

Thanks in advance,
Christoph

rsx · March 11, 2025, 12:57pm

Checks are scheduled at endpoint(s) of a zone to which host objects belong. In your case I’d assume the hosts belong to the satellite zone(s) and this will work as usual when their masters are not reachable. And the satellites will record check result and send them to the masters once they are back online (depending on the log_duration which is 1d per default). This is done as fast as possible and cause higher load on the masters and to the network.

Are you sure your checks are rescheduled instead?

chrender · March 12, 2025, 8:02pm

I tried to find out about (re)scheduling and enabled debuglog on master, satellite and host level. I found that once the masters are getting back online, the host being monitored receives a real lot of event::ExecuteCommand from the satellites (so a corresponding number of check processes is being spawned causing high load).

Looking at a satellite, I found:

[2025-03-12 12:38:19 +0000] debug/CheckerComponent: Scheduling
  info for checkable 'node-12!linux-nic-wg_dialin' (2025-03-12
  12:37:51+0000): Object 'node-12!linux-nic-wg_dialin', Next Check:
  2025-03-12 12:37:51 +0000(1.74178e+09)

This looks like there’s a check scheduled at 12:38:19 which is supposed to run at 12:37:51 (which might result in the host executing it at once, since 12:37 is already in the past when the check is scheduled). There are many other log entries from 12:38 scheduling checks for 12:37 and 12:36.

On the masters I haven’t found anything which looks like it might be related to the issue.

Evaluating a single check for the host mentioned above on a satellite, the following emerges:

[2025-03-12 12:36:01] … Next Check: 2025-03-12 12:36:01 +0000(1.74178e+09).
[2025-03-12 12:36:11] … Next Check: 2025-03-12 12:36:11 +0000(1.74178e+09).
[2025-03-12 12:36:21] … Next Check: 2025-03-12 12:36:21 +0000(1.74178e+09).
[2025-03-12 12:36:31] … Next Check: 2025-03-12 12:36:31 +0000(1.74178e+09).
[2025-03-12 12:36:41] … Next Check: 2025-03-12 12:36:41 +0000(1.74178e+09).
[2025-03-12 12:36:51] … Next Check: 2025-03-12 12:36:51 +0000(1.74178e+09).
[2025-03-12 12:37:01] … Next Check: 2025-03-12 12:37:01 +0000(1.74178e+09).
[2025-03-12 12:37:11] … Next Check: 2025-03-12 12:37:11 +0000(1.74178e+09).
[2025-03-12 12:37:21] … Next Check: 2025-03-12 12:37:21 +0000(1.74178e+09).
[2025-03-12 12:37:31] … Next Check: 2025-03-12 12:37:31 +0000(1.74178e+09).
[2025-03-12 12:37:41] … Next Check: 2025-03-12 12:37:41 +0000(1.74178e+09).
[2025-03-12 12:37:51] … Next Check: 2025-03-12 12:37:51 +0000(1.74178e+09).
[2025-03-12 12:38:01] … Next Check: 2025-03-12 12:38:01 +0000(1.74178e+09).

[2025-03-12 12:38:04] … Next Check: 2025-03-12 12:36:51 +0000(1.74178e+09).
[2025-03-12 12:38:05] … Next Check: 2025-03-12 12:37:01 +0000(1.74178e+09).
[2025-03-12 12:38:05] … Next Check: 2025-03-12 12:37:11 +0000(1.74178e+09).
[2025-03-12 12:38:06] … Next Check: 2025-03-12 12:37:21 +0000(1.74178e+09).
[2025-03-12 12:38:07] … Next Check: 2025-03-12 12:37:31 +0000(1.74178e+09).
[2025-03-12 12:38:07] … Next Check: 2025-03-12 12:37:41 +0000(1.74178e+09).
[2025-03-12 12:38:08] … Next Check: 2025-03-12 12:37:51 +0000(1.74178e+09).
[2025-03-12 12:38:09] … Next Check: 2025-03-12 12:37:11 +0000(1.74178e+09).
[2025-03-12 12:38:09] … Next Check: 2025-03-12 12:37:11 +0000(1.74178e+09).
[2025-03-12 12:38:10] … Next Check: 2025-03-12 12:37:21 +0000(1.74178e+09).
[2025-03-12 12:38:11] … Next Check: 2025-03-12 12:37:31 +0000(1.74178e+09).
[2025-03-12 12:38:12] … Next Check: 2025-03-12 12:37:41 +0000(1.74178e+09).
[2025-03-12 12:38:14] … Next Check: 2025-03-12 12:36:51 +0000(1.74178e+09).
[2025-03-12 12:38:14] … Next Check: 2025-03-12 12:37:51 +0000(1.74178e+09).
[2025-03-12 12:38:15] … Next Check: 2025-03-12 12:37:01 +0000(1.74178e+09).
[2025-03-12 12:38:15] … Next Check: 2025-03-12 12:37:01 +0000(1.74178e+09).
[2025-03-12 12:38:16] … Next Check: 2025-03-12 12:37:11 +0000(1.74178e+09).
[2025-03-12 12:38:17] … Next Check: 2025-03-12 12:37:21 +0000(1.74178e+09).
[2025-03-12 12:38:17] … Next Check: 2025-03-12 12:38:01 +0000(1.74178e+09).
[2025-03-12 12:38:17] … Next Check: 2025-03-12 12:37:31 +0000(1.74178e+09).
[2025-03-12 12:38:17] … Next Check: 2025-03-12 12:38:01 +0000(1.74178e+09).
[2025-03-12 12:38:18] … Next Check: 2025-03-12 12:37:41 +0000(1.74178e+09).
[2025-03-12 12:38:19] … Next Check: 2025-03-12 12:37:51 +0000(1.74178e+09).
[2025-03-12 12:38:19] … Next Check: 2025-03-12 12:37:51 +0000(1.74178e+09).
[2025-03-12 12:38:20] … Next Check: 2025-03-12 12:37:11 +0000(1.74178e+09).
[2025-03-12 12:38:20] … Next Check: 2025-03-12 12:38:11 +0000(1.74178e+09).
[2025-03-12 12:38:21] … Next Check: 2025-03-12 12:37:21 +0000(1.74178e+09).
[2025-03-12 12:38:21] … Next Check: 2025-03-12 12:37:31 +0000(1.74178e+09).
[2025-03-12 12:38:22] … Next Check: 2025-03-12 12:37:41 +0000(1.74178e+09).
[2025-03-12 12:38:22] … Next Check: 2025-03-12 12:37:51 +0000(1.74178e+09).
[2025-03-12 12:38:22] … Next Check: 2025-03-12 12:38:01 +0000(1.74178e+09).
[2025-03-12 12:38:23] … Next Check: 2025-03-12 12:38:11 +0000(1.74178e+09).

[2025-03-12 12:39:53] … Next Check: 2025-03-12 12:39:53 +0000(1.74178e+09).
[2025-03-12 12:40:02] … Next Check: 2025-03-12 12:40:02 +0000(1.74178e+09).
[2025-03-12 12:40:12] … Next Check: 2025-03-12 12:40:12 +0000(1.74178e+09).
[2025-03-12 12:40:21] … Next Check: 2025-03-12 12:40:21 +0000(1.74178e+09).
[2025-03-12 12:40:31] … Next Check: 2025-03-12 12:40:31 +0000(1.74178e+09).
[2025-03-12 12:40:41] … Next Check: 2025-03-12 12:40:41 +0000(1.74178e+09).

Until 12:38:01, everything’s fine – the check in question is supposed to be run every ten seconds, and that’s what’s happening. In this example, I cut the masters off at 12:35:30 and re-enabled the connection at 12:37:45. According to the logs, the checks are running nicely as planned even when both masters are offline. After 12:38:01, when the masters are re-connected to the satellites, one can see that apparently all the already run checks from 12:36:51 are re-scheduled again (for some points in time are even multiple checks) and are then all executed at once on the monitored host (which is then starting about 30 check processes in parallel for each of these services). At 12:39:53 everything is in sync again and the checks are now executed every ten seconds again.

Is there any way to prevent re-running the already executed checks?

rsx · March 13, 2025, 11:17am

Hmm, sounds like a bug. You might file an issue at github.