[RFC] A new Monitoring Plugins Interface

Hello people,
I wanted to vocalize a thought which has been bugging me for quite some time.
In short, I see several drawbacks with the current way Monitoring Plugins work and I would like to know, whether I am alone with this or not.

The problems of the Monitoring Plugin Interface

Communicating the state with the exit code is vulnerable to faulty programming and environmental failures

Using the exit code of a program to communicate anything other than “Execution succeded” and “Something broke” often leads to problems.
I did see some examples of Monitoring Plugins written in Python and rather optimistically, where an uncaught exception caused an exit code of 1, which was not only unintented, but also wrong, since “CRITICAL” or “WARNING” are “I know what is happening and it is not good” and not “I don’t know what is happening”.

This is a systemic problem. Even if a programmer works defensively and tests for all errors, the operating system might do something unexpected and cause an error during execution.
The solution to is likely, that Monitoring Plugins should ALWAYS exit with 0 if the execution succeded (as does every other sensible program) and the result of the specific test should be communicated by other means.

Information display is rather rudimentary and limited

Currently the only thing a Monitoring Plugin can rely on regarding the display of the message printed on stdout is, that the first line is shown to a user, likely as plain text.
At least that is my guess, regarding different Monitoring Systems like Icinga. IcingaWeb2 will additionaly render HTML in the Plugin Output, which IMHO is a horrible recipe for really weird display bugs and a bad separation of concerns.

But, let me elaborate a little bit:

Some (or event most) times one line of output is enough to display everything necessary about a problem (“Available Memory is less than 10%”, “Certificate is only valid for three more days”, “This machine is running Windows Server 2008”, etc.).
But, pretty soon, a single Monitoring Plugin execution will perform multiple tests/checks (“is there more than 10% free space on these five different filesystems”, “are all of those sensors within their respective ok range”).
The separate test/checks (I do like to call them “subchecks” or “partial checks”) can be indepent of each other (different filesystems), dependent on a previous one (“Can I open a TCP connection to xy?” → “Can I open a TLS session?” → “Can I speak HTTP over that?” → “Do I get a 200 for a GET /?”) or may even be mere “meta subchecks” which consist of several other ones (“For every filesystem, is there more than 10% free space AND more than 50% free Inodes?”).

Therefore I do imagine a tree like logic here, as it is implemented in the IfW plugins or in check_system_basics.

Currently some plugins provide an error only option to limit output to failed things, but then there is no way to see the “OK” things anymore. So, why is this not a “display option”, meaning “in the frontend”? Why is there no button in the GUI “Show me only the failed checks in there, not the ones which are OK”? Why are my critical “subchecks” not sorted above the other ones?
The answer is easy, there is simply no sane way to decide what would be an “important” part of the output and what is not. There is no common structure, especially no machine readable one.

State would be great, but currently everyone has a ramshackle mechanism of her own

Most Monitoring Plugins run stateless, which is great for tests, but for some purposes is insufficient. Some things from the top of my mind: Rate calculation (for interfaces), CPU usage measurements on linux (if someone points me to that part of the kernel which tells me that for the last x secons and not in total since boot I would be happy) or “did this file change between the last run and now”.

Obviously some people just implemented something then, a bunch of plugins generate state files in different formats somewhere.
While this works “most of the time”, the Monitoring System doesn’t know anything at all about this, which is, IMHO, a problem in itself, but it also leads to problems: What if the same check is executed by another machine everytime (true enough for Icinga HA setups)?
This is arguably not a big problem, but it would be nice to have state AND in an way the Monitoring Systems knows about that.

A implementation which does this in a ok way would be check_interfaces, which accepts performance data from the previous run as input to do rate calculation.

Arguably the whole “rate calculation” thing should not be done in the plugin, but could be done by some other component later in the stack, but I think this topic still stands for other examples.

Some thoughts on solutions

Regarding the first two topics, I would propose to use some machine readable structured data format as output for Monitoring Plugins. Probably JSON or XML where one field could contain the actual test result, another one some free form info text and so on.
The Monitoring Plugin would then exit with 0 everytime if it ran through correctly and everything else would be a bug or something like that and the Monitoring System would know it.
The centreon-plugins actually do implement something like this with the --output-json and --output-xml options.

To get the most out of it, this should be a product and company independent standard, something like Open Telemetry, if everybody does something different, there will be even more balcanisation of the post-Nagios monitoring world.

What now?

So, I would be interested in your perspective on this topic and, ideally, whether people would be willing to invest time and work for something in this direction. Happy to answer any questions if I did not manage to form clear and understandable sentences.

Sorry for the long read :slight_smile:

6 Likes

Good ideas, but the big problem here is compatibility with the existing checks and Icinga.

That is indeed a probem, but, in my mind, not as huge as one might fear.
If we assume JSON as the serialisation format, every message would begin with { and end } (if it is correct) (correct me if I am wrong), so autodetection would be possible by that alone, since I have never seen a plugin output doing that until now.
Or, more bluntly, by throwing is_this_valid_json(text) on it and falling back to plain text if not.
And/or manual switches for CheckCommands could be added, since there is space for new attributes there, so

object CheckCommand "foo" {
  ...
  MPI_v2 = true
}

would be an option.