Check every x minutes, until the problem is resolved

liverna · September 12, 2022, 2:41pm

Hello,

template Service “disk-test” {
max_check_attempts = 2
check_interval = 60m
** retry_interval = 5m**
}

apply Service for (tmp_disks => config in host.vars.tmp.disks) {
import “disk-test”
vars.customer_services = “on”
check_command = “disk_snmp”
vars.disk = “tmp”
vars += config
}

It checks every 60 minutes.
If the disk is full
checks again after 5 minutes.
mark service WARN/CRITICAL and sending notification.
and checks again after 60 minutes.

my request

Check every 60 minutes
If there is a problem,
check it after 5 minutes.
mark service WARN/CRITICAL and sending notification.

and check every x minutes until the problem is resolved

liverna · September 13, 2022, 9:03am

for example

template Service “disk-test” {
max_check_attempts = 2
check_interval = 60m
retry_interval = 5m
xxxxxxxxxxxx = 2m
}

apply Service for (tmp_disks => config in host.vars.tmp.disks) {
import “disk-test”
vars.customer_services = “on”
check_command = “disk_snmp”
vars.disk = “tmp”
vars += config
}

01:00:00 check → Result OK
02:00:00 check → Result OK
03:00:00 check → Result OK
04:00:00 check → Result OK
05:00:00 check → Result WARN/CRITICAL
05:05:00 check → Result WARN/CRITICAL (retry_interval)
05:07:00 check → Result WARN/CRITICAL (xxxxxxxxxxxx = 2m) (check again = problem not resolved)
05:09:00 check → Result WARN/CRITICAL (xxxxxxxxxxxx = 2m) (check again = problem not resolved)
05:11:00 check → Result WARN/CRITICAL (xxxxxxxxxxxx = 2m) (check again = problem not resolved)
05:13:00 check → Result WARN/CRITICAL (xxxxxxxxxxxx = 2m) (check again = problem not resolved)
05:15:00 check → Result WARN/CRITICAL (xxxxxxxxxxxx = 2m) (check again = problem not resolved)
05:17:00 check → Result WARN/CRITICAL (xxxxxxxxxxxx = 2m) (check again = problem not resolved)
05:19:00 check → Result OK (check again = problem resolved)
06:00:00 check → Result OK
07:00:00 check → Result OK
08:00:00 check → Result OK
…

moreamazingnick · September 13, 2022, 1:20pm

the retry interval is only to differentiate between soft and hard state changes so no you can’t have that.

I don’t know what kind of applience you try to check via snmp so it’s difficult to evaluate the check interval, but I would suggest to check with 10m and fix the disk issue if you get an alert.

If you get the notification that something is wrong you should check or fix it. If you need to change the check interval it looks like you are hoping that the error resolves itself.
If that is the case you should check the thresholds or upgrade the disks if possible

Best Regards

rivad · September 14, 2022, 7:45am

Create a copy of the service with the different check interval but active checks disabled and use a event hook on the first one to enable the second service via API?