Service check_timeout and command timeout not working

I have a command check_mountpoints which can takes longer because of some slow NFS mounts, so I’m trying to modify its timeout making it bigger than 1 minutes default.

I tried to set check_timeout = 2m in the apply Service definition and timeout = 2m in the object CheckCommand definition, but this is not applied because sometimes the command is killed after 60 seconds (<Timeout exceeded.><Terminated by signal 9 (Killed).>).

Please note that this is a remote check executed on a client through Icinga Agent (command_endpoint = host.vars.client_endpoint in the apply Service definition.

Could you help me to understand where I’m wrong, please?

Thanks!

Can you please show the command, service template and service apply definitions?

Yes of course!

Command:

object CheckCommand "_check_mountpoints" {
        command = [ CustomPluginsDir + "/check_mountpoints.sh" ]

        arguments = {
                "--mountpoint" = {
                        value = "$check_mountpoints_mountpoint$"
                        description = "list of mountpoints to check. Ignored when -a is given"
                        skip_key = true
                        order = -1
                }
                "-m" = {
                        value = "$check_mountpoints_mtab$"
                        description = "Use this mtab instead (default: /proc/mounts)"
                }
                "-f" = {
                        value = "$check_mountpoints_fstab$"
                        description = "Use this fstab instead (default: /etc/fstab)"
                }
                "-N" = {
                        value = "$check_mountpoints_fs_field$"
                        description = "FS Field number in fstab (default: 3)"
                }
                "-M" = {
                        value = "$check_mountpoints_mount_field$"
                        description = "Mount Field number in fstab (default: 2)"
                }
                "-O" = {
                        value = "$check_mountpoints_option_field$"
                        description = "Option Field number in fstab (default: 4)"
                }
                "-T" = {
                        value = "$check_mountpoints_nfs_timeout$"
                        description = "Responsetime at which an NFS is declared as staled (default: 3)"
                }
                "-L" = {
                        set_if = "$check_mountpoints_softlinks$"
                        description = "Allow softlinks to be accepted instead of mount points"
                }
                "-i" = {
                        set_if = "$check_mountpoints_ignore_fstab$"
                        description = "Ignore fstab. Do not fail just because mount is not in fstab. (default: unset)"
                }
                "-a" = {
                        set_if = "$check_mountpoints_autoselect_mounts$"
                        description = "Autoselect mounts from fstab (default: unset)"
                }
                "-A" = {
                        set_if = "$check_mountpoints_fstab_autoselect$"
                        description = "Autoselect from fstab. Return OK if no mounts found. (default: unset)"
                }
                "-E" = {
                        value = "$check_mountpoints_exclude$"
                        description = "Use with -a or -A to exclude a path from fstab. Use backslash+pipe between paths fo
                }
                "-o" = {
                        set_if = "$check_mountpoints_ignore_noauto$"
                        description = "When autoselecting mounts from fstab, ignore mounts having noauto flag. (default: u
                }
                "-w" = {
                        set_if = "$check_mountpoints_writetest$"
                        description = "Writetest. Touch file $mountpoint/.mount_test_from_$(hostname) (default: unset)"
                }
                "-e" = {
                        value = "$check_mountpoints_extra$"
                        description = "Extra arguments for df (default: unset)"
                }
        }

        timeout = 2m

        vars.check_mountpoints_fstab_autoselect = true
        vars.check_mountpoints_ignore_noauto = true
}

Service template:

template Service "generic-service" {
  max_check_attempts = 5
  check_interval = 1m
  retry_interval = 30s
}

Apply definition:

apply Service "Mount Points" {
  import "generic-service"

  check_command = "_check_mountpoints"

  assign where host.vars.os == "Linux" && host.vars.check_mountpoints != false

  if (host.vars.client_endpoint) {
    command_endpoint = host.vars.client_endpoint
  }

  vars.check_mountpoints_nfs_timeout = 120

  check_timeout = 2m

  max_check_attempts = 10
  check_interval = 10m
  retry_interval = 2m
}

Host definition:

object Zone "client.host.name" {
  endpoints = [ "client.host.name" ]
  parent = "master.host.name"
}

object Endpoint "client.host.name" {
  host = "123.123.123.123"
}

object Host "client.host.name" {
  import "generic-host"
  address = "123.123.123.123"

  vars.os = "Linux"
  vars.distro = "Debian"

  vars.disks["disk"] = {
    disk_all = true
    disk_local = true
  }

  vars.notification["mail"] = {
    groups = [ "icingaadmins" ]
  }

  enable_notifications = false

  vars.client_endpoint = name

  # Host Variables
  vars.check_swap = false
  vars.procs_warning = "500"
  vars.procs_critical = "600"
  vars.zfs = true
  vars.mem_warning = "5"
  vars.mem_critical = 1
  # End Host Variables
}

Hm, that looks good.
As the check_timeout at the service level overrides the timeout of the check command, it is not really needed, I’d say.

Are you really sure that the check is still being killed after 60 seconds? Do you see that time in the webinterface?
Otherwise I would guess that the script is running even longer than two minutes and as both the script timeout and the icinga (command/service) timeout are of the same length, you don’t really know which is responsible.
I would change the timeout of the script call (-T) to be a bit shorter than the command timeout and see if the output changes.

What is the output of

icinga2 object list -n _check_mountpoints | grep timeout

Yes, because when it go timeout I see “Check execution time 60.002s” in the web interface.

It is correct:

SHARED root@cop ~# icinga2 object list -n _check_mountpoints | grep timeout
      * value = "$check_mountpoints_nfs_timeout$"
  * timeout = 120

Could the cause of the problem be that this is a remote check and I must modify the timeout value in the client configuration too as this check is executed through Icinga Client?

Yes (I forget to mention you should run this command at your agent).

Yes, it’s 60 seconds:

root@stg1:~# icinga2 object list -n _check_mountpoints | grep timeout
      * value = "$check_mountpoints_nfs_timeout$"
  * timeout = 60

So I have to change these value in the client configuration too.

Now I have a question (maybe OT): is there a way to avoid this syncing this configuration from master to client?

Syncing is one of the nice features of icinga2. Hence, I’d recommend to define your check command at your master within a global zone that is synced to every agent. Within that definition you could define the default timeout (and overwrite it for a particular service or host if needed).