[RFC] A new Monitoring Plugins Interface

lorenz · July 19, 2024, 12:12pm

Hello people,
I wanted to vocalize a thought which has been bugging me for quite some time.
In short, I see several drawbacks with the current way Monitoring Plugins work and I would like to know, whether I am alone with this or not.

The problems of the Monitoring Plugin Interface

Communicating the state with the exit code is vulnerable to faulty programming and environmental failures

Using the exit code of a program to communicate anything other than “Execution succeded” and “Something broke” often leads to problems.
I did see some examples of Monitoring Plugins written in Python and rather optimistically, where an uncaught exception caused an exit code of 1, which was not only unintented, but also wrong, since “CRITICAL” or “WARNING” are “I know what is happening and it is not good” and not “I don’t know what is happening”.

This is a systemic problem. Even if a programmer works defensively and tests for all errors, the operating system might do something unexpected and cause an error during execution.
The solution to is likely, that Monitoring Plugins should ALWAYS exit with 0 if the execution succeded (as does every other sensible program) and the result of the specific test should be communicated by other means.

Information display is rather rudimentary and limited

Currently the only thing a Monitoring Plugin can rely on regarding the display of the message printed on stdout is, that the first line is shown to a user, likely as plain text.
At least that is my guess, regarding different Monitoring Systems like Icinga. IcingaWeb2 will additionaly render HTML in the Plugin Output, which IMHO is a horrible recipe for really weird display bugs and a bad separation of concerns.

But, let me elaborate a little bit:

Some (or event most) times one line of output is enough to display everything necessary about a problem (“Available Memory is less than 10%”, “Certificate is only valid for three more days”, “This machine is running Windows Server 2008”, etc.).
But, pretty soon, a single Monitoring Plugin execution will perform multiple tests/checks (“is there more than 10% free space on these five different filesystems”, “are all of those sensors within their respective ok range”).
The separate test/checks (I do like to call them “subchecks” or “partial checks”) can be indepent of each other (different filesystems), dependent on a previous one (“Can I open a TCP connection to xy?” → “Can I open a TLS session?” → “Can I speak HTTP over that?” → “Do I get a 200 for a GET /?”) or may even be mere “meta subchecks” which consist of several other ones (“For every filesystem, is there more than 10% free space AND more than 50% free Inodes?”).

Therefore I do imagine a tree like logic here, as it is implemented in the IfW plugins or in check_system_basics.

Currently some plugins provide an error only option to limit output to failed things, but then there is no way to see the “OK” things anymore. So, why is this not a “display option”, meaning “in the frontend”? Why is there no button in the GUI “Show me only the failed checks in there, not the ones which are OK”? Why are my critical “subchecks” not sorted above the other ones?
The answer is easy, there is simply no sane way to decide what would be an “important” part of the output and what is not. There is no common structure, especially no machine readable one.

State would be great, but currently everyone has a ramshackle mechanism of her own

Most Monitoring Plugins run stateless, which is great for tests, but for some purposes is insufficient. Some things from the top of my mind: Rate calculation (for interfaces), CPU usage measurements on linux (if someone points me to that part of the kernel which tells me that for the last x secons and not in total since boot I would be happy) or “did this file change between the last run and now”.

Obviously some people just implemented something then, a bunch of plugins generate state files in different formats somewhere.
While this works “most of the time”, the Monitoring System doesn’t know anything at all about this, which is, IMHO, a problem in itself, but it also leads to problems: What if the same check is executed by another machine everytime (true enough for Icinga HA setups)?
This is arguably not a big problem, but it would be nice to have state AND in an way the Monitoring Systems knows about that.

A implementation which does this in a ok way would be check_interfaces, which accepts performance data from the previous run as input to do rate calculation.

Arguably the whole “rate calculation” thing should not be done in the plugin, but could be done by some other component later in the stack, but I think this topic still stands for other examples.

Some thoughts on solutions

Regarding the first two topics, I would propose to use some machine readable structured data format as output for Monitoring Plugins. Probably JSON or XML where one field could contain the actual test result, another one some free form info text and so on.
The Monitoring Plugin would then exit with 0 everytime if it ran through correctly and everything else would be a bug or something like that and the Monitoring System would know it.
The centreon-plugins actually do implement something like this with the --output-json and --output-xml options.

To get the most out of it, this should be a product and company independent standard, something like Open Telemetry, if everybody does something different, there will be even more balcanisation of the post-Nagios monitoring world.

What now?

So, I would be interested in your perspective on this topic and, ideally, whether people would be willing to invest time and work for something in this direction. Happy to answer any questions if I did not manage to form clear and understandable sentences.

Sorry for the long read

mdicss · August 19, 2024, 1:17pm

Good ideas, but the big problem here is compatibility with the existing checks and Icinga.

lorenz · August 26, 2024, 3:14pm

That is indeed a probem, but, in my mind, not as huge as one might fear.
If we assume JSON as the serialisation format, every message would begin with { and end } (if it is correct) (correct me if I am wrong), so autodetection would be possible by that alone, since I have never seen a plugin output doing that until now.
Or, more bluntly, by throwing is_this_valid_json(text) on it and falling back to plain text if not.
And/or manual switches for CheckCommands could be added, since there is space for new attributes there, so

object CheckCommand "foo" {
  ...
  MPI_v2 = true
}

would be an option.

linuxfabrik · March 14, 2025, 4:31pm

IMHO, an additional “info” state would solve a lot of the “I just want some information, but nothing to worry about” problems. Could only work in Icinga, and check plugins could just return OK instead of INFO when running somewhere else. Would be simple and low hanging fruit.

To take this to another level, there are some efforts to standardise monitoring plugins, such as OpenMetrics from the Prometheus project, which uses JSON in the way you intended. Then some sort of prometheus2nagios translation layer is needed if Icinga doesn’t implement it on their end.

Stateful checks: There are stateful checks out there, but states need to be implemented in every check separately. For example, our cpu-usage check does exactly what you are looking for. Depends very much on the use case.

rivad · April 8, 2025, 1:35pm

Better handling of secrets.

github.com/Icinga/icingadb

Command line argument values written as part of check_commandline to icingadb could contain credentials

opened 10:02PM - 04 Apr 25 UTC

cisco-abrandel

## Describe the bug Icingadb by default writes out the command, along with argu…ments used for a check, to the mysql database. This is visible in the check_commandline column of the service_state table. I get the value in this, it allows you to see what was executed for a specific check and is great for debugging. The flip side of this is that its a very common pattern with nagios style checks to take credentials as command line arguments. Look no further than the ITL for lots of examples where checks take command line arguments that will contain credentials. Being in the database is one thing, since access to this is likely limited, but the worse part of this issue is that icingaweb2 with the icingadb module will then show the check command and arguments in the source of a service. This means that anyone with access to view a service in icingaweb2 could potentially learn credentials, which is a pretty big issue. While you could filter the display of these command line arguments in icingaweb, thats not the solution in my opinion. The database containing credentials in plaintext is the root of the issue here, and filtering the credentials in the UI is just a bandaid. I understand that there is an argument to be made about never using command line arguments to pass credentials, but the reality is this is extremely prevalent, even in the ITL. ## To Reproduce Configure icinga2 with icingadb, and the icingaweb2 interface Have some check that passes credentials as command line arguments ## Expected behavior Icingadb should provide the ability to redact command lines arguments from being written to the database, or a similar solution that doesn't involve credentials being exposed to users. I would imagine this be exposed as a knob that users can turn, since some might want the command line arguments exposed. Perhaps a more complete solution would be to allow users to redact the values of certain arguments, instead of just all of nothing. ## Your Environment * Icinga DB version: 1.1.1-g6c8b52f * Icinga 2 version: 2.14.3 * icinga web version: v2.12.4 * Operating System and version: Rocky Linux 9

apenning · April 11, 2025, 8:31am

Interesting. How would this differ from the OK state in case of handling? Currently, some check plugins are quite noisy in their output, even for an OK result - and that is fine. However, this information usually goes unnoticed unless there has been a state change.

So how should a monitoring software like Icinga 2 deal with such a state? Should it trigger a special notification? If so, the check could also go to WARNING.

Maybe instead of adding another severity, a more fine-grained alerting would be useful. For example, not only the state, but also the output (sub-string or regex) or the perf data. Or am I misunderstanding your motivation for a different state?

rivad · April 11, 2025, 2:53pm

If we look at this in the context of a check like the “about me” it’s something I look up from time to time but noting more. So states would be pending, info and unknown.
This would also to filter info out of notifications and dashboards.

MarcusCaepio · April 16, 2025, 5:58am

This! Going hand in hand with Allow to hash/encrypt credentials or use an external storage · Issue #6404 · Icinga/icinga2 · GitHub