High load on Satellites after config deploy

Hi all,

just out of interest:

Why is it that a deployment of a new config generates a extremely high load on my satellites?
image
image

This lasts about 5min and goes back to “normal”
image

(Though the load is still quite high.)

Setup is:
Master-HA (icinga2 v2.10.3)
two Satellites in one zone (icinga2 v2.10.3)

  • concurrent_checks is set to 128

Hosts: image about 2030 are done by the satellites
Services: image about 2900 are done by the Satellites

Check intervals are mostly 5min, check script is mostly check_icmp and check_nwc_health.
Is it just due to the “big” check_nwc_health?

Thanks and best regards
:slight_smile:

if check_nwc_health is the process that is executed there, then yes. I would guess that after a config redeploy all checks on the satellite are queued to run and he will run as many parallel as possible

Hi,

typically the cluster config sync, validation and reload takes some resources, but not so many. Depending on the check interval being set, Icinga tries to adjust and run certain checks in between these 5 minutes to prevent a “blind folded” restart. How long do these checks typically run (execution_time) and which check interval is defined for them?

Cheers,
Michael

Hi,

thanks for your answers.

I “suspected” sth. like this. Maybe a dummy question, but “Why?”

avg_execution_time for the satellites (taken from icinga check) is around 1.2s

Though it is quite spread out with checks running .1s and checks running 4.2s and everything between.
Generally speaking check in an OK state take less than 1s.

I haven’t considered check_execution. The checks the satellites are running are to remote locations (with a sometimes not so fast connection). The check_interval is 5min for all checks run by the satellites.

Since you’ve said that the check interval is 5 minutes, this is likely to happen that all of the checks run in this 5 minutes interval with adjusted offsets to avoid many of them in the same second.

I would investigate further and analyse the scheduled_start plus execution_start times from within the last_check_result key in your service objects via the REST API. There may be overlaps, and latency involved - plus slow plugin responses causing a delay for the other pending checks.

Cheers,
Michael

Thanks!
I will see if I can make anything out of the suggested variables contents :slight_smile: