Serious performance issues in large HA cluster

twmnrs · November 30, 2021, 3:22pm

Dear community,

I’m taking care of a somewhat larger Icinga2 HA cluster that is basically running well and reliable but has serious performance issues in some areas.

The basic setup: Icinga 2.12.6 on Debian 10, 2 masters, 6 zones with 2 satellites in each. The masters are Dell machines with Intel Xeon Silver (16 cores + HT), 64GB RAM, HW-RAID and enterprise SSDs.

I use a traditional file-based config with:

10.000 hosts
100.000 services
140.000 notifications
30.000 dependencies

The most pressing issues I have are:

reload time - reloading a changed config takes about 15 minutes(!) during which the config-master is 100% cpu busy on all cores and threads
API performance - some users like to hit the API in suboptimal ways to set downtimes, re-schedule checks and fetch results, killing performance within seconds with all cores and threads about 60% cpu and 40% system busy.

While the setup is not optimal with MySQL and Icingaweb2 running on the masters rather than on dedicated machines, there are no obvious bottlenecks, i.e. there’s not network congestion, no I/O-wait, and all CPU is consumed almost exclusively by Icinga2.

What I’d like to know:

What’s the reason for the looooong reload time? Is it just the huge number of objects or is it something else? Some users apply sophisticated loops over hosts’ custom vars to create service objects, could this be a problem? Is “one notification” equal to “one notification object” or is it possible to have something like a shared notification object for a number n of services?

The API replies to single & complex requests without any problems (e.g. “give me the state of all services”), but hit it with several simple requests in parallel (e.g. “give me the state of ‘host!service’”) and things go down the drain quickly. I looks like it might be some sort of lock congestion, but how do I find out and how can I solve this?

Thanks in advance for any hints.

Al2Klimov · December 1, 2021, 10:51am

Hello @twmnrs!

Long reload times are a known problem. Please could you install libjemalloc, run time icinga2 daemon -C without and with LD_PRELOAD=/PATH/TO/libjemalloc.so and tell the results?

Best,
AK

twmnrs · December 7, 2021, 11:52am

Hey @Al2Klimov,

I ran the test several times, but whether using jemalloc or not does not make any significant difference. The largest diff I noticed was 9 seconds, sometimes with jemalloc ahead, sometimes not. Typically it looks like this:

without jemalloc:
real    5m17.741s
user    149m32.944s
sys     1m54.966s

with jemalloc:
real    5m17.810s
user    144m42.635s
sys     1m54.719s

Any other ideas?

Al2Klimov · December 7, 2021, 1:14pm

Please share the exact commands you ran.

twmnrs · December 7, 2021, 3:25pm

Of course, it’s time env LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 icinga2 daemon -C for the test with jemalloc respectively time icinga2 daemon -C for the default malloc.

Do you suspect it’s not working as intended due to the non-noticable difference?

Al2Klimov · December 9, 2021, 10:01am

Damn! It should be not icinga2, but /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2.