I need a bit more understanding of what is going on for the system when checks get late with MaxConcurrentChecks.
I have a situation where i have checks that runs very oftenly (like every 2-5 sec), on the same machine i also have checks that runs less frequently (like every 5 min).
To make sure every check gets to run at scheduling time, i made sure to configure MaxConcurrentChecks to the value of total checkables objects (hosts + service).
The problem is that checks get sometimes a bit late because of the pseudo random execution time due to waiting for the target of the check to answer (network device, server …), the consequence is almost invisible when check is scheduled every minutes, but when data is collected at a second level, a delay is spotted easily. The part of the situation i’m not sure to understand well is the following, i’m assuming two things :
- because some checks got late and took more time to end, they use and execution slot for a longer amount of time before they release it and so can mess with initial scheduling icinga programmed for the check.
- the next execution of the same check could not be scheduled while the same check was already ongoing to avoid stacking even if the current concurrent checks count was lower than MaxConcurrentChecks.
To solve this problem i made sure to give more room for icinga to work and almost doubled MaxConcurrentChecks value, this worked, but it still not 100% sure why and would need more insight on it and icinga internal working if possible.
The icinga version used is 2.11.4-1 on redhat 7.5.