I think that I’ve had some success in working through this myself so I thought I would share this here. I’d be very open to suggestion if there are better ways to handle this but here goes:
First, I defined a service for the main Load check:
apply Service "Load" {
import "critical-service"
check_command = "Load"
command_endpoint = host.vars.agent_endpoint
assign where host.vars.agent_endpoint
// More config here...
}
Then, I defined a summary service that runs the more expensive check to gather information about the load. In principle think of this as ps aux
for the application server:
apply Service "HiddenSummary" {
import "critical-service"
check_command = "Running"
command_endpoint = host.vars.agent_endpoint
assign where host.vars.agent_endpoint
// More config here...
}
The final service is a dummy service that uses HiddenSummary. Assuming I’ve not mis-read the docs, this will only go warning/critical if Load
is critical and HiddenSummary has run its own check to fetch the most recent set of summary information.
apply Service "LoadSummary" {
import "critical-service"
check_command = "dummy"
vars.dummy_state = {{
var load_service = get_service(macro("$host.name$"), "Load")
var running_service = get_service(macro("$host.name$"), "Running")
if (load_service.state != 0 && running_service.last_check_result.execution_start < load_service.last_state_change) {
// If the HiddenSummary service isn't yet updated, just keep the status as 0
return 0;
}
return load_service.state;
}}
vars.dummy_text = {{
var load_service = get_service(macro("$host.name$"), "Load")
var running_service = get_service(macro("$host.name$"), "Running")
if (load_service.state == 0) {
return "OK"
}
if (running_service.last_check_result.execution_start < load_service.last_state_change) {
// If the HiddenSummary service hasn't yet updated since Load went crit/warning then
// set a basic message
return "Gathering process data"
}
return running_service.last_check_result.output
}}
assign where host.vars.agent_endpoint && host.vars.running_mysql_defaults_file
}
Finally, I created a dependency that disables checks on the HiddenSummary service. If Load switches to a critical/warning state then it activates.
apply Dependency "disable-running-checks" to Service {
parent_service_name = "Load"
disable_checks = true
states = [ Critical, Warning ]
assign where host.vars.agent_endpoint && service.name == "HiddenSummary"
}
As a result of the above, the summary information isn’t gathered until load becomes a problem and, because of the dummy service, this information is emailed out to relevant parties without those parties needing any specific access to the server internals nor login to Icingaweb2.