State retention routines

I’ve got some custom checks that I need to write, involving SNMPv3 data retrievals from custom hardware devices.

One of the core requirements is that I need to compare current results to previous results.

What I’m wondering is if there are any established “best practices” around this - or if there are some known caveats that I should probably be aware of.

Most of the check code already exists. What’s happening at this time is that I’m trying to integrate this into our Icinga2 environment.

(Note, these checks are written in Python - the amount of time required to retrieve the necessary data so greatly exceeds the difference in execution time of the check between Python and C that it’s not worth it for me to spend the extra time to convert my code.)

So far, the only information I’ve found includes:

https://www.monitoring-plugins.org/doc/index.html, which leads me to
https://www.monitoring-plugins.org/doc/state-retention.html,
https://www.monitoring-plugins.org/doc/guidelines.html#AEN254 and
https://www.monitoring-plugins.org/doc/faq/private-c-api.html#state-information - all of which appear to really only apply to the monitoring-plugins package.

I’m not currently finding much anywhere else - are there other resources I’m missing? (I think I took a pretty good shot at trying to find answers about this here, but without any success. Searching for “state retention” only brings up some Graphite-related questions, and I couldn’t think of anything else to search for that looked like it put me on the right track.)

It seems to me (from what I’m not seeing) that there really aren’t any commonly endorsed methods. I should be able to define, allocate, and configure for a particular directory, create a directory structure within it like <test_name>/<host>/data-<x> and go on my merry way.

Also note:

  • I know I need to lock the files to prevent concurrent execution, no problem. (I also know I should be able to timeout on the lock request after 10 seconds. If I can’t get to the file in that period of time, there’s something else really wrong.)
  • I know I need to ensure the tests fail gracefully. If for any reason the file can’t be opened, or the current data in the file “doesn’t make sense”, or anything else along those lines, I need to return a status of Unknown.

All advice is welcome, thanks.
Ken

In case you only need the last check results you could send them to your plugins using e.g.:

vars.icinga_last_output = {{get_service(host.name, “your_service_name”).last_check_result.output}}

2 Likes

Which only helps if there was a last check result. Think about timeouts or host not reachable.
If i need to compare a number of a snmp device to the number of the last check is use a file to store them, just like the plugins mysql_health, nwc_health etc does. Just be aware that if you have dual Satellites/Masters per zone you need to make those filesystem available on all satellites (i prefer glusterfs).

Regards,
Carsten

1 Like

Thank you both very much for the ideas. In particular, the references to mysql_health and nwc_health looks like they might be good samples for me to review.

Fortunately, this is all being done on a single-server system (one zone, no satellites), so the shared-filesystem issue doesn’t apply (yet) - but that’s a really good point to keep in mind for the future.
(Hmmm… stray thought… I wonder if I should just stash these values in another set of tables in a database.)

1 Like