Dear community,
I’m taking care of a somewhat larger Icinga2 HA cluster that is basically running well and reliable but has serious performance issues in some areas.
The basic setup: Icinga 2.12.6 on Debian 10, 2 masters, 6 zones with 2 satellites in each. The masters are Dell machines with Intel Xeon Silver (16 cores + HT), 64GB RAM, HW-RAID and enterprise SSDs.
I use a traditional file-based config with:
- 10.000 hosts
- 100.000 services
- 140.000 notifications
- 30.000 dependencies
The most pressing issues I have are:
- reload time - reloading a changed config takes about 15 minutes(!) during which the config-master is 100% cpu busy on all cores and threads
- API performance - some users like to hit the API in suboptimal ways to set downtimes, re-schedule checks and fetch results, killing performance within seconds with all cores and threads about 60% cpu and 40% system busy.
While the setup is not optimal with MySQL and Icingaweb2 running on the masters rather than on dedicated machines, there are no obvious bottlenecks, i.e. there’s not network congestion, no I/O-wait, and all CPU is consumed almost exclusively by Icinga2.
What I’d like to know:
What’s the reason for the looooong reload time? Is it just the huge number of objects or is it something else? Some users apply sophisticated loops over hosts’ custom vars to create service objects, could this be a problem? Is “one notification” equal to “one notification object” or is it possible to have something like a shared notification object for a number n of services?
The API replies to single & complex requests without any problems (e.g. “give me the state of all services”), but hit it with several simple requests in parallel (e.g. “give me the state of ‘host!service’”) and things go down the drain quickly. I looks like it might be some sort of lock congestion, but how do I find out and how can I solve this?
Thanks in advance for any hints.