I am thinking about moving from nagios to icinga. the main issue I have with most monitoring/alert tools:
We have a Render-Farm (CGI). So when there are “jobs” lots of hosts are supposed to be “up” with there services avaible. If there are no jobs the sames hosts should be down and only that is a “ok” state.
In nagios I always had a dashboard with e.g.: 163 “RED”-Hosts with state “Down”. I only managed to supress alert emails depending on the job list (by setting the maintenance flag). Two did not shut down due to some problems (mostly ACPI) and where green and one should be “up” and was not.
So I realy need to set dynamically what is “OK” and what not. Can ICINGA do that?
Some plugins have the option to flip the resulting status. For the rest you can use something like the negate plugin which is a wrapper that allows you to change a CRITICAL to an OK state.
If it were me, I’d consider replacing the default ping check for a host with one that pings and queries whether or not a job is running. Probably wouldn’t take much code. Something you should also look into is the dependency object and how you can manipulate what is actively being checked:
There is a negate plugin that allows you to transform any check result into another one.
For example OK -> WARN; WARN -> CRIT; CRIT -> OK or whatever you would like to have. So it should work fine if you define a new check command with the negate plugin that triggers the ping plugin.
Edit: Oh, just saw that @mjbrooks mentioned it already. Didn’t see this while scrolling through the thread. Sorry.
Similar to what Blake suggested I’d match the number of hosts in a certain state against the number of running jobs. Depending on the job you might expect a certain number of hosts being online and check against that.
Thank you all for the feedback. The negate plugin look interesting.
btw … The main problem is actually more checking the power down (which needs to be checked via IPMI). The job managing system itself turns on rendering nodes and unused workstations and alerts unresponsive nodes (and one job normally uses ALL available nodes) . But it doesn’t really care if a node doesn’t shut down. Because otherwise they are working “flawless”. Our “worse” case was a change in CentOS so our render user didn’t have anymore the permissions to shutdown the node. Keeping the render farm completly running and draining power…