Hey
First of all, that’s some very appreciated input you’ve written. I’d like to thank you as well for that, the time invested and being so interested in the Icinga project in general. Always great to see a person from the community being this enganged.
Again, my post is going to be a reply to your thoughts based on what I’ve gathered in terms of experience and information in various discussions directly concerning this topic. I just want to say that because we in the Icinga development team still haven’t even agreed on a specific solution we’re going to implement, there will be little in the way of concrete approaches to solving certain issues - there are quite a few ideas floating around and I don’t want to prematurely announce properties that might behave wildly different when they are going to be implemented in a full scale Icinga release. This might be why my post will bring up more issues than present solutions.
When it comes to your point 1., yes, our experience with users concerning the requirement for punctual checks varies wildly. Sometimes it is an absolute requirement, sometimes actual punctuality isn’t as important as bringing new results in from differing sources, sometimes this whole topic is being overlooked as long as the system is working as expected.
When I started working directly with the Icinga team itself, I myself was blown away by the range of use cases and usages Icinga has. This is something which gives a couple of caveats when it comes to implementing a manual prioritizing function in Icinga. In the discussions, two major problems came up with a solution like this:
-
Imagine having systems with different checks on different hosts, numbering in the hundreds of thousands, usually implemented anyways with automation tools, big, massive enterprise environments. Here usually the selection of checks needed has already been reduced to the lowest possible amount, so you have two priorities: “checks we really need” and “checks we definitely need”. Adminstering such a monitoring system is time comsuming enough as is, you don’t want the additional burden of ranking all of those elements in a meaningful way, especially if it’s just in shades of high importance.
-
In general - even if we provide an option to configure scheduling settings by hand, nothing changes about the need for some kind of scheduling system working nicely without the need for configuration. Icinga needs to be smart enough to schedule checks in an efficient way on it’s own, be it either for processing an intricate user scheduling configuration (what if such a configuration contains conflicts? etc.) or for processing the complete lack thereof. If, and that is a big if, we implement an option to configure scheduling details by hand, it’s going to be on top of a smart scheduling system, and therefore this feature is rather low priority. And, most of our users agree: A solution working great out-of-the-box is much prefered to more configuration options.
A “safe mode” is something we’d see rather critical as well. The issue there is based in the various reasons a check might be overdue or not executed. Sometimes there is an actual issue, sometimes you’re just asking too much of to little resources. In any case, obstruction of that issue is not the way to go. Either it’s something we want the admins to fix or it’s a sign that more resources or a smaller configuration is needed - both cases were action is advisable. We want to avoid giving a false sense of security by letting Icinga make the decision to drop checks just to make Icinga run more smoothly. Even a clear indication by Icinga that it would switch to “safe mode” behaviour will most likely be ignored rather than lead to the required action.
And - in the world of checks, even a bad reply is better than no inquiry. For the end user, a check that crashes your Icinga is more valuable information for you than a running Icinga that didn’t even check.
When it comes to what the scheduler should prioritize, the discussions are still ongoing. It is almost a certainty though that looking for resource bottlenecks and working around them will be the way to go. Again, this stems from the massive difference in applications Icinga has. From basic monitoring of computer systems to highly specialized, usually custom made modules and loads of varying hardware sensors, there are no generalized statements one could make about what every user needs.
There is only one thing all things Icinga does have in common - Icinga runs on a computer system and it’s actions have varying costs in this environment. The check scheduler will have the duty of managing these costs in a way the usage of resources stays as low as possible while still respecting the users requirement for a certain freshness of the results.
And even then - some users will have more than enough processing capacity, but limited memory, and vice versa. The vast majority of issues when it comes to late checks have something to do with the amount of a certain resource being depleted on a system. At least all Icinga systems have these costs, so that’s where we will try to distribute the resulting loads in a much smarter way.
Again, the details of how, what and in which way are still being discussed.
The topic machine learning/AI is something - seeing how we’re a young and hungry development team at Icinga - we’ve discussed as well. Myself personally included, a solution like that is very attractive, but again, there are big issues with it.
To repeat myself once more, the usages of Icinga are just too different. With all the custom implementations and different hardware behaving in different ways, we can’t gather reliable data concerning varying implementations of Icinga in our own testing environments. Our possiblities are quite limited - sure, we can train a machine learning system to distribute checks efficiently in a Linux adminstration scenario, but that covers a comparitively small amount of use cases. And since we don’t record data from our customers (except in rare special mutually agreed cases), we don’t have the information to train a machine learning system in a general sense.
Of course, the thought of implementing a machine learning system for every single environment on it’s own crossed our minds as well, but at the moment it would be much too resource hungry - especially since the loads would increase exponentially with the amount of checks - to be a reasonable solution to this issue. It’s something we have on our list, though, an idea for the (hopefullly not too far) future.
So, while the system which will be implemented will obviously be adaptive - trying small adjustments in scheduling to get the load distributed more evenly - it most likely won’t be a full blown machine learning system. But that’s one of the rather nice things about this problem - most monitoring systems don’t change that much from day to day (yet), so small adjustments over time make a big difference - usually, there is no need for completely starting from scratch after every single reload of Icinga.
This whole topic opens up a lot of interesting questions concerning self monitoring and the evaluation of Icingas own data. I’d agree with you that a lot of the data we need to provide a good solution should be made visible to the end user as well. Maybe it will even lead to some more resource friendly rewrites of some of the most classic checks you can run with Icinga, who knows? Ideas, concepts and code are always welcome
At the end of the day, we want something which doesn’t need a lot of input, does as little as possible in terms of it’s own calculations to find the most viable way of avoiding load peaks, so our users have more resources on their systems to simply do more with their Icinga. We really want to make powerful monitoring easy for everybody, and hopefully we come to solution that is future proof and workable in any case a user could have to run their own Icinga environment.
Hopefully this gave some food for thought!