[feedback & ideas] icinga scheduling and high frequency checks

Someone · February 2, 2021, 11:54am

Hello there,
I found myself in an interesting situation recently regarding scheduling and server load when mixing high and low frequency checks.

Situation was the following :

not possible to add a node to the cluster to unload the icinga server
low frequency checks (5 minutes based) and high frequency checks (2 seconds based) running together on the same server.
maxconcurrent var set to more than the total of hosts and service the server had to run (around 4k)
server had 32 cores
checks had a decent execution time (under or around a second)

Data regarding low frequency checks was afterward exported out of icinga to an other tool to better visualize it, and what i saw was that for a constantly repeating cycle of 5 minutes, the first two minutes for checks running every 2 seconds got an average delay of 5 seconds which means a pretty big loss of information.

As you’ve probably already understood, the server is undersized for the amount of check it must do, however i think the situation can be improved.

As far as i’ve understood thanks to this :
https://icinga.com/docs/icinga-2/latest/doc/19-technical-concepts/#technical-concepts-check-scheduler

The current scheduling doesnt prioritize checks between them, i think the lower the frequency check is, the more it should be prioritized by the scheduler, why ? Regardless if the data is exported to a third party vizualisation tool, shifting for a few sec a check repeating every few minutes is almost invisible while shifting for a few sec a check repeating every few sec is very visible (late checks)

what do you guys think of this idea ? good ? bad ? could be better ? why ?

I’m pinging @theFeu, the icinga team could be interested in this.

theFeu · February 2, 2021, 12:08pm

Hey there!
I recall our @htriem doing a project on check scheduling for his finals, so I’m forwarding the ping to him

dgoetz · February 2, 2021, 12:44pm

Problem here is actually two I think.

The priority you propose or when is a check first executed and how they are distributed afterwards to get an even load is one. And as @theFeu mentioned there is already some on-going work in an early state.

But when you say the 2 second check is only delayed for the first 2 minutes, there is still the problem that a start/reload is very resource hungry and causing additional problems like this. So not only a better scheduling but also a non-disruptive reload would be needed and perhaps also some changes for a start of the service. There has already been many work done on this, always balancing performance vs. consistency of the configuration load.

htriem · February 11, 2021, 4:23pm

Well, I can confirm this topic is something we’re taking a look at. We’ve had quite a few discussions concering the topic, so let me just share a couple of thoughts.

First of all, even the most intricate smartest perfectly working check scheduling system has it’s limitations. While it is true that we can still improve the distribution of various loads on a system, if you’re trying to do too much with an undersized system, you will run into bottlenecks.
If that’s the case, I suggest adjusting the configuration any day over expanding system capacity - really think about which checks are being executed, how often that’s the case, and how fresh a check result really needs to be for efficient monitoring.

When it comes to priotizing checks and their execution, a couple of ideas are floating around - you’ve mentioned one of them, priotizing the check execution depending on the configured check interval.
I personally don’t think that this is the way to go - namely just because an interval is detached from whatever the system is being required to handle in terms of load. While it’s obviously true that you need to consider the interval in scheduling - like you’ve correctly assessed, a check being exectued once a day can deviate from that hard interval in a much bigger way than a check being executed every second - at the end of the day, the load a single check execution causes can variate wildly, and that’s the area where I think we should evaluate and improve the efficiency of distribution.

In any case, watch this space. We’re in a very early stage, but this topic is being discussed among the Icinga developers.

Someone · February 11, 2021, 7:15pm

Thank you both for your reply and your time.

So as far as i understand and i can imagine from my experience or from what i can read in monitoring world accross the internet, there could be different best case and würst case scenario depending on the icinga usage (graphing, alerting, load of checks, etc )

So basically the questions we are trying to answer are

1. how late a check can be and under what circumstances to fit in the better scheduling scenario ?
  or also
1. should we have different seperated scheduling modes or an adaptative one able to switch and scale between different scenarios to prevent user to do too much manual configuration (like prioritizing checks on specific templates/hosts/services) ?

I’ll try to answer them :

this depends mostly on user side needs (assuming he’s aware or it), the user may or may not care too much for some checks to be late in a normal situation, however under heavy load, low io, low cpu ressources, or others heavy technical constraints, the scheduler could have the possibility to go in a safe mode and choose by himself the checks to prioritize to lessen the impact on the host system and bypass configuration user could have done about it, that mean it could still launch for example checks that are cpu bound every sec event if memory is almost full.
Also, if you give the possibility to user to manually prioritize checkables, then i think giving more informations and/or visibility on internal state and planning load of the scheduler is required to help the user configure its checks.
a scheduler could work on various way to name a few : higher frequency checks are prioritized, or best effort mode for every check, or safe mode to preserve host system if possible while getting job done, or default mode + handmade priorization by user, there are probably other possibilities but thats the one i thought of first. Regarding the variety of scenario possible and the complexity of coding it seems to require at first glance, as a conclusion to thoses reflexions i think there are two possibilities which are completly opposed :

A) keep scheduling stupidly simple, but let to user the possibility to choose what to prioritize manually hosts/services, this could be done with a new attribute in thoses objects for example.
B) train an AI (tensor flow for example) under a lot and very different execution situations to let it choose the best scenario for you, i’m no too much into machine learning, but i think its inputs should used each time a new check is scheduled :

previous exec time of the check
state type (soft/hard)
flapping value
retry_interval
check_interval
the timestamp the normal scheduler would have output
the host name, the service name (if applicable), the check command names used in boths (if applicable)
avg exec times for the check commands used by host and service (if applicable)
global system load
remaining memory
global system iops
global system io latency

And as output the timestamp that is scheduled.

There are probably others interesting informations to input to the AI to help it computing the best timestamp outcome given the check stats and history and also system status, but you get the general idea.

htriem · February 12, 2021, 4:13pm

Hey

First of all, that’s some very appreciated input you’ve written. I’d like to thank you as well for that, the time invested and being so interested in the Icinga project in general. Always great to see a person from the community being this enganged.

Again, my post is going to be a reply to your thoughts based on what I’ve gathered in terms of experience and information in various discussions directly concerning this topic. I just want to say that because we in the Icinga development team still haven’t even agreed on a specific solution we’re going to implement, there will be little in the way of concrete approaches to solving certain issues - there are quite a few ideas floating around and I don’t want to prematurely announce properties that might behave wildly different when they are going to be implemented in a full scale Icinga release. This might be why my post will bring up more issues than present solutions.

When it comes to your point 1., yes, our experience with users concerning the requirement for punctual checks varies wildly. Sometimes it is an absolute requirement, sometimes actual punctuality isn’t as important as bringing new results in from differing sources, sometimes this whole topic is being overlooked as long as the system is working as expected.
When I started working directly with the Icinga team itself, I myself was blown away by the range of use cases and usages Icinga has. This is something which gives a couple of caveats when it comes to implementing a manual prioritizing function in Icinga. In the discussions, two major problems came up with a solution like this:

Imagine having systems with different checks on different hosts, numbering in the hundreds of thousands, usually implemented anyways with automation tools, big, massive enterprise environments. Here usually the selection of checks needed has already been reduced to the lowest possible amount, so you have two priorities: “checks we really need” and “checks we definitely need”. Adminstering such a monitoring system is time comsuming enough as is, you don’t want the additional burden of ranking all of those elements in a meaningful way, especially if it’s just in shades of high importance.
In general - even if we provide an option to configure scheduling settings by hand, nothing changes about the need for some kind of scheduling system working nicely without the need for configuration. Icinga needs to be smart enough to schedule checks in an efficient way on it’s own, be it either for processing an intricate user scheduling configuration (what if such a configuration contains conflicts? etc.) or for processing the complete lack thereof. If, and that is a big if, we implement an option to configure scheduling details by hand, it’s going to be on top of a smart scheduling system, and therefore this feature is rather low priority. And, most of our users agree: A solution working great out-of-the-box is much prefered to more configuration options.

A “safe mode” is something we’d see rather critical as well. The issue there is based in the various reasons a check might be overdue or not executed. Sometimes there is an actual issue, sometimes you’re just asking too much of to little resources. In any case, obstruction of that issue is not the way to go. Either it’s something we want the admins to fix or it’s a sign that more resources or a smaller configuration is needed - both cases were action is advisable. We want to avoid giving a false sense of security by letting Icinga make the decision to drop checks just to make Icinga run more smoothly. Even a clear indication by Icinga that it would switch to “safe mode” behaviour will most likely be ignored rather than lead to the required action.
And - in the world of checks, even a bad reply is better than no inquiry. For the end user, a check that crashes your Icinga is more valuable information for you than a running Icinga that didn’t even check.

When it comes to what the scheduler should prioritize, the discussions are still ongoing. It is almost a certainty though that looking for resource bottlenecks and working around them will be the way to go. Again, this stems from the massive difference in applications Icinga has. From basic monitoring of computer systems to highly specialized, usually custom made modules and loads of varying hardware sensors, there are no generalized statements one could make about what every user needs.
There is only one thing all things Icinga does have in common - Icinga runs on a computer system and it’s actions have varying costs in this environment. The check scheduler will have the duty of managing these costs in a way the usage of resources stays as low as possible while still respecting the users requirement for a certain freshness of the results.
And even then - some users will have more than enough processing capacity, but limited memory, and vice versa. The vast majority of issues when it comes to late checks have something to do with the amount of a certain resource being depleted on a system. At least all Icinga systems have these costs, so that’s where we will try to distribute the resulting loads in a much smarter way.
Again, the details of how, what and in which way are still being discussed.

The topic machine learning/AI is something - seeing how we’re a young and hungry development team at Icinga - we’ve discussed as well. Myself personally included, a solution like that is very attractive, but again, there are big issues with it.
To repeat myself once more, the usages of Icinga are just too different. With all the custom implementations and different hardware behaving in different ways, we can’t gather reliable data concerning varying implementations of Icinga in our own testing environments. Our possiblities are quite limited - sure, we can train a machine learning system to distribute checks efficiently in a Linux adminstration scenario, but that covers a comparitively small amount of use cases. And since we don’t record data from our customers (except in rare special mutually agreed cases), we don’t have the information to train a machine learning system in a general sense.
Of course, the thought of implementing a machine learning system for every single environment on it’s own crossed our minds as well, but at the moment it would be much too resource hungry - especially since the loads would increase exponentially with the amount of checks - to be a reasonable solution to this issue. It’s something we have on our list, though, an idea for the (hopefullly not too far) future.

So, while the system which will be implemented will obviously be adaptive - trying small adjustments in scheduling to get the load distributed more evenly - it most likely won’t be a full blown machine learning system. But that’s one of the rather nice things about this problem - most monitoring systems don’t change that much from day to day (yet), so small adjustments over time make a big difference - usually, there is no need for completely starting from scratch after every single reload of Icinga.

This whole topic opens up a lot of interesting questions concerning self monitoring and the evaluation of Icingas own data. I’d agree with you that a lot of the data we need to provide a good solution should be made visible to the end user as well. Maybe it will even lead to some more resource friendly rewrites of some of the most classic checks you can run with Icinga, who knows? Ideas, concepts and code are always welcome

At the end of the day, we want something which doesn’t need a lot of input, does as little as possible in terms of it’s own calculations to find the most viable way of avoiding load peaks, so our users have more resources on their systems to simply do more with their Icinga. We really want to make powerful monitoring easy for everybody, and hopefully we come to solution that is future proof and workable in any case a user could have to run their own Icinga environment.

Hopefully this gave some food for thought!

Someone · February 12, 2021, 5:47pm

Thank you !
Now i better understand the constraints you are dealing with.

You’ve written down some very valid points and requirements i barely realized before, so this makes me came up with an other idea, but first some feedback.

I agree on that, it’s better to have user fix wrong checks and prod problems than scaling icinga down.

Thoses sums up your post and needs pretty well, so i wanted to hightlight it.

About AI machine learning scheduler :

Given a monitoring system usually doesn’t change much, i think an AI may not need to be used for every check scheduling, but actually once per complete monitoring check period cycle to adjust offsets for all checks after it learnt the load of the cycle without a given trained model from outside (it’s basically self training), so the unpredictable part would be big variations in check loads, the more there are, the less this solution will be reliable. If it’s really applicable it could lessen the required load and make it viable.
I think used ressources can also be lessen in an other way in this execution scenario by grouping checks into time frames by check periods, and then choosing offsets not for checks themselves but for the timeframe related to them. For example :

Lets imagine the scheduler split its planning into for example 5 sec
It needs to schedule 12 checks, which have respectively 6 checks of 1 sec, 3 checks of 1 min and 3 checks of 10 min check period, each timeframe can contain at max 3 checks

group by checks of the same check period into a same timeframe for execution.
so we’ll have 4 timeframes :
2 Timeframes which contains each 3 checks of 1s check_period
1 timeframe for 1 min checks
1 timeframe for 10 min checks
time pass and icinga get information that one of the timeframe for 1sec checks gets a high load which causes the icinga host to overload, AI knows now which timeframe to deprioritize for offset computations.
Also, as i imagine now to better spread load timeframe for execution can overlap as long as output load doesn’t damage host.

This method have a drawback thought, grouping checks to manage offsets asks less ressources, but groups too big will lead to a loss of precision in load spreading, there may be a smart algo to find right balance here.