High load on Satellites after config deploy

log1c · March 11, 2019, 9:11am

Hi all,

just out of interest:

Why is it that a deployment of a new config generates a extremely high load on my satellites?

This lasts about 5min and goes back to “normal”

(Though the load is still quite high.)

Setup is:
Master-HA (icinga2 v2.10.3)
two Satellites in one zone (icinga2 v2.10.3)

concurrent_checks is set to 128

Hosts: about 2030 are done by the satellites
Services: about 2900 are done by the Satellites

Check intervals are mostly 5min, check script is mostly check_icmp and check_nwc_health.
Is it just due to the “big” check_nwc_health?

Thanks and best regards

KevinHonka · March 11, 2019, 9:21am

if check_nwc_health is the process that is executed there, then yes. I would guess that after a config redeploy all checks on the satellite are queued to run and he will run as many parallel as possible

dnsmichi · March 11, 2019, 1:21pm

Hi,

typically the cluster config sync, validation and reload takes some resources, but not so many. Depending on the check interval being set, Icinga tries to adjust and run certain checks in between these 5 minutes to prevent a “blind folded” restart. How long do these checks typically run (execution_time) and which check interval is defined for them?

Cheers,
Michael

log1c · March 11, 2019, 1:54pm

Hi,

thanks for your answers.

I “suspected” sth. like this. Maybe a dummy question, but “Why?”

avg_execution_time for the satellites (taken from icinga check) is around 1.2s

Though it is quite spread out with checks running .1s and checks running 4.2s and everything between.
Generally speaking check in an OK state take less than 1s.

I haven’t considered check_execution. The checks the satellites are running are to remote locations (with a sometimes not so fast connection). The check_interval is 5min for all checks run by the satellites.

dnsmichi · March 11, 2019, 2:03pm

Since you’ve said that the check interval is 5 minutes, this is likely to happen that all of the checks run in this 5 minutes interval with adjusted offsets to avoid many of them in the same second.

I would investigate further and analyse the scheduled_start plus execution_start times from within the last_check_result key in your service objects via the REST API. There may be overlaps, and latency involved - plus slow plugin responses causing a delay for the other pending checks.

Cheers,
Michael

log1c · March 11, 2019, 2:17pm

Thanks!
I will see if I can make anything out of the suggested variables contents