Manual check vs auto check issue

Hi, I am not sure what I am doing wrong. I have distributed setup where configuration is done on the master server and send to the clients. As I mention before, I didn’t did the setup, but have been managing it for a while and I am starting to understand how things work in Icinga thanks to the documentation and you guys. Is a bit complex, but very powerful.

So I am monitoring load average. The default values are:

load_wload1=  5
load_wload5 = 4 
load_wload15= 3
load_cload1= 10
load_cload5= 6
load_cload15= 4

I have a big server with more cpus so I want to change this values to something higher because I am getting notifications that are not really a problem. So I override this values to:

load_wload1=  10
load_wload5 = 8
load_wload15= 6
load_cload1= 20
load_cload5= 12
load_cload15= 8

However Icinga is using the default values as you can see in the following image:

load1

When I click on check now, the values are what I want them to be as you can see here:

load2

The configuration

load.conf

apply Service "load" {
  import "generic-service"

  check_command = "load"

  vars.load_wload1 = host.vars.load_wload1
  vars.load_wload5 = host.vars.load_wload5
  vars.load_wload15 = host.vars.load_wload15
  vars.load_cload1 = host.vars.load_cload1
  vars.load_cload5 = host.vars.load_cload5
  vars.load_cload15 = host.vars.load_cload15

  retry_interval = 2m

  assign where host.address
}

The generic service:

template Service "generic-service" {
  max_check_attempts = 5
  check_interval = 5m
  retry_interval = 30s
  command_endpoint = host.name
}

The host definition:

object Endpoint "host.foo.bar" {
    host = "abc.abc.abc.abc"
}

object Host "host.foo.bar"  {
    import "generic-host"

    address = "abc.abc.abc.abc"
    vars.ssh_port = 3784

    vars.os = "FreeBSD"

# Load average values:
    vars.load_wload1 = 10
    vars.load_wload5 = 8
    vars.load_wload15 = 6
    vars.load_cload1 = 20
    vars.load_cload5 = 12
    vars.load_cload15 = 8

}

object Zone "host.foo.bar" {
    endpoints = [ "host.foo.bar", ]
    parent = "master"
}

Any idea what I might be doing wrong? Thanks a lot.

I still don’t understand what is happening. However I found some things:

  • The default values are defined in: /usr/local/share/icinga2/include/command-plugins.conf
  • The problem only happens in some servers, not all. The host definition are the same, the only thing that changes related to load monitoring are the variables like wload1, etc…
  • I only have issues with FreeBSD servers (the majority), and not with Linux.
  • I have the hosts define in zones.d/master/staging-hosts.conf and zones.d/master/production-hosts.conf. The configuration is the same, however I have no issues with the staging servers. Could it be related with the order that the files are included?
  • The variables in the hosts keep the default values. In both, in hosts that work fine and those that don’t work.

Here, what I have done:

I inspected load in the master server. It has the default values.

Object 'load' of type 'CheckCommand':
  % declared in '/usr/local/share/icinga2/include/command-plugins.conf', lines 1665:1-1665:26
  * __name = "load"
  * arguments
    % = modified in '/usr/local/share/icinga2/include/command-plugins.conf', lines 1668:2-1681:2
    * -c
      * description = "Exit with CRITICAL status if load average exceed CLOADn; the load average format is the same used by 'uptime' and 'w'"
      * value = "$load_cload1$,$load_cload5$,$load_cload15$"
    * -r
      * description = "Divide the load averages by the number of CPUs (when possible)"
      * set_if = "$load_percpu$"
    * -w
      * description = "Exit with WARNING status if load average exceeds WLOADn"
      * value = "$load_wload1$,$load_wload5$,$load_wload15$"
  * command = [ "/usr/local/libexec/nagios/check_load" ]
    % = modified in '/usr/local/share/icinga2/include/command-plugins.conf', lines 1666:2-1666:40
  * env = null
  * execute
    % = modified in 'methods-itl.conf', lines 19:3-19:23
    * arguments = [ "checkable", "cr", "resolvedMacros", "useResolvedMacros" ]
    * deprecated = false
    * name = "Internal#PluginCheck"
    * side_effect_free = false
    * type = "Function"
  * name = "load"
  * package = "_etc"
  * source_location
    * first_column = 1
    * first_line = 1665
    * last_column = 26
    * last_line = 1665
    * path = "/usr/local/share/icinga2/include/command-plugins.conf"
  * templates = [ "load", "plugin-check-command" ]
    % = modified in '/usr/local/share/icinga2/include/command-plugins.conf', lines 1665:1-1665:26
    % = modified in 'methods-itl.conf', lines 18:2-18:94
  * timeout = 60
  * type = "CheckCommand"
  * vars
    * load_cload1 = 10
      % = modified in '/usr/local/share/icinga2/include/command-plugins.conf', lines 1687:2-1687:24
    * load_cload15 = 4
      % = modified in '/usr/local/share/icinga2/include/command-plugins.conf', lines 1689:2-1689:24
    * load_cload5 = 6
      % = modified in '/usr/local/share/icinga2/include/command-plugins.conf', lines 1688:2-1688:23
    * load_percpu = false
      % = modified in '/usr/local/share/icinga2/include/command-plugins.conf', lines 1691:2-1691:25
    * load_wload1 = 5
      % = modified in '/usr/local/share/icinga2/include/command-plugins.conf', lines 1683:2-1683:23
    * load_wload15 = 3
      % = modified in '/usr/local/share/icinga2/include/command-plugins.conf', lines 1685:2-1685:24
    * load_wload5 = 4
      % = modified in '/usr/local/share/icinga2/include/command-plugins.conf', lines 1684:2-1684:23
  * zone = ""

From this output I learn about /usr/share/icinga2/include/command-plugins.conf. I found the default values defined there:

<...>
object CheckCommand "load" {
        command = [ PluginDir + "/check_load" ]

        arguments = {
                "-w" = {
                        value = "$load_wload1$,$load_wload5$,$load_wload15$"
                        description = "Exit with WARNING status if load average exceeds WLOADn"
                }
                "-c" = {
                        value = "$load_cload1$,$load_cload5$,$load_cload15$"
                        description = "Exit with CRITICAL status if load average exceed CLOADn; the load average format is the same used by 'uptime' and 'w'"
                }
                "-r" = {
                        set_if = "$load_percpu$"
                        description = "Divide the load averages by the number of CPUs (when possible)"
                }
        }

        vars.load_wload1 = 5.0
        vars.load_wload5 = 4.0
        vars.load_wload15 = 3.0

        vars.load_cload1 = 10.0
        vars.load_cload5 = 6.0
        vars.load_cload15 = 4.0

        vars.load_percpu = false
}
<...>

However, when I see the host object I can see that the values have been set as expected.

Object 'hostx.foo.bar' of type 'Host':
  % declared in '/usr/local/etc/icinga2/zones.d/master/production-host.conf', lines 198:1-198:36
  * __name = "hostx.foo.bar"
  * action_url = ""
  * address = "xyz.xyz.xyz.xyz"
    % = modified in '/usr/local/etc/icinga2/zones.d/master/production-host.conf', lines 201:5-201:29
  * address6 = ""
  * check_command = "hostalive"
    % = modified in '/usr/local/etc/icinga2/conf.d/templates/generic-host.conf', lines 6:3-6:29
  * check_interval = 60
    % = modified in '/usr/local/etc/icinga2/conf.d/templates/generic-host.conf', lines 3:3-3:21
  * check_period = ""
  * check_timeout = null
  * command_endpoint = ""
  * display_name = "hostx.foo.bar"
  * enable_active_checks = true
  * enable_event_handler = true
  * enable_flapping = false
  * enable_notifications = true
  * enable_passive_checks = true
  * enable_perfdata = true
  * event_command = ""
  * flapping_threshold = 0
  * flapping_threshold_high = 30
  * flapping_threshold_low = 25
  * groups = [ ]
  * icon_image = ""
  * icon_image_alt = ""
  * max_check_attempts = 2
    % = modified in '/usr/local/etc/icinga2/conf.d/templates/generic-host.conf', lines 2:3-2:24
  * name = "hostx.foo.bar"
  * notes = ""
  * notes_url = ""
  * package = "_etc"
  * retry_interval = 30
    % = modified in '/usr/local/etc/icinga2/conf.d/templates/generic-host.conf', lines 4:3-4:22
  * source_location
    * first_column = 1
    * first_line = 198
    * last_column = 36
    * last_line = 198
    * path = "/usr/local/etc/icinga2/zones.d/master/production-host.conf"
  * templates = [ "hostx.foo.bar", "generic-host" ]
    % = modified in '/usr/local/etc/icinga2/zones.d/master/production-host.conf', lines 198:1-198:36
    % = modified in '/usr/local/etc/icinga2/conf.d/templates/generic-host.conf', lines 1:0-1:27
  * type = "Host"
  * vars
    * disks
      * disk /
        % = modified in '/usr/local/etc/icinga2/conf.d/templates/generic-host.conf', lines 8:3-10:3
        * disk_partitions = "/"
    * load_cload1 = 20
      % = modified in '/usr/local/etc/icinga2/zones.d/master/production-host.conf', lines 211:5-211:25
    * load_cload15 = 8
      % = modified in '/usr/local/etc/icinga2/zones.d/master/production-host.conf', lines 213:5-213:25
    * load_cload5 = 12
      % = modified in '/usr/local/etc/icinga2/zones.d/master/production-host.conf', lines 212:5-212:25
    * load_wload1 = 10
      % = modified in '/usr/local/etc/icinga2/zones.d/master/production-host.conf', lines 208:1-208:21
    * load_wload15 = 6
      % = modified in '/usr/local/etc/icinga2/zones.d/master/production-host.conf', lines 210:5-210:25
    * load_wload5 = 8
      % = modified in '/usr/local/etc/icinga2/zones.d/master/production-host.conf', lines 209:5-209:24
    * os = "FreeBSD"
      % = modified in '/usr/local/etc/icinga2/zones.d/master/production-host.conf', lines 204:5-204:23
    * ssh_port = 2389
      % = modified in '/usr/local/etc/icinga2/zones.d/master/production-host.conf', lines 202:5-202:24
  * volatile = false
  * zone = "master"

However when I retrieve the load object from the client I have the default values.

Object 'load' of type 'CheckCommand':
  % declared in '/usr/local/share/icinga2/include/command-plugins.conf', lines 1665:1-1665:26
  * __name = "load"
  * arguments
    % = modified in '/usr/local/share/icinga2/include/command-plugins.conf', lines 1668:2-1681:2
    * -c
      * description = "Exit with CRITICAL status if load average exceed CLOADn; the load average format is the same used by 'uptime' and 'w'"
      * value = "$load_cload1$,$load_cload5$,$load_cload15$"
    * -r
      * description = "Divide the load averages by the number of CPUs (when possible)"
      * set_if = "$load_percpu$"
    * -w
      * description = "Exit with WARNING status if load average exceeds WLOADn"
      * value = "$load_wload1$,$load_wload5$,$load_wload15$"
  * command = [ "/usr/local/libexec/nagios/check_load" ]
    % = modified in '/usr/local/share/icinga2/include/command-plugins.conf', lines 1666:2-1666:40
  * env = null
  * execute
    % = modified in 'methods-itl.conf', lines 19:3-19:23
    * arguments = [ "checkable", "cr", "resolvedMacros", "useResolvedMacros" ]
    * deprecated = false
    * name = "Internal#PluginCheck"
    * side_effect_free = false
    * type = "Function"
  * name = "load"
  * package = "_etc"
  * source_location
    * first_column = 1
    * first_line = 1665
    * last_column = 26
    * last_line = 1665
    * path = "/usr/local/share/icinga2/include/command-plugins.conf"
  * templates = [ "load", "plugin-check-command" ]
    % = modified in '/usr/local/share/icinga2/include/command-plugins.conf', lines 1665:1-1665:26
    % = modified in 'methods-itl.conf', lines 18:2-18:94
  * timeout = 60
  * type = "CheckCommand"
  * vars
    * load_cload1 = 10
      % = modified in '/usr/local/share/icinga2/include/command-plugins.conf', lines 1687:2-1687:24
    * load_cload15 = 4
      % = modified in '/usr/local/share/icinga2/include/command-plugins.conf', lines 1689:2-1689:24
    * load_cload5 = 6
      % = modified in '/usr/local/share/icinga2/include/command-plugins.conf', lines 1688:2-1688:23
    * load_percpu = false
      % = modified in '/usr/local/share/icinga2/include/command-plugins.conf', lines 1691:2-1691:25
    * load_wload1 = 5
      % = modified in '/usr/local/share/icinga2/include/command-plugins.conf', lines 1683:2-1683:23
    * load_wload15 = 3
      % = modified in '/usr/local/share/icinga2/include/command-plugins.conf', lines 1685:2-1685:24
    * load_wload5 = 4
      % = modified in '/usr/local/share/icinga2/include/command-plugins.conf', lines 1684:2-1684:23
  * zone = ""

The default values inside the CheckCommand object are ok.
You have to take a look at the client host and service object and what values are set there.

Do you have

    vars.load_wload1 = x
    vars.load_wload5 = x
    vars.load_wload15 = x
    vars.load_cload1 = x
    vars.load_cload5 = x
    vars.load_cload15 = x

set for (or in) every host object that shows this strange problem?

It still does not make sense (to me) that the values differ if executed by “check now”…

Doing this I found that the problematic hosts are having more configuration than the non problematic ones. I would come back to this later in this post.

yes


So I started quering for objects locally on problematic hosts vs no problematic hosts I found this. I have local definition of services in problematic hosts, and no definition in non-problematic hosts.

Objects in problematic Host:

Object 'host1.foo.bar!ping6' of type 'Service':
Object 'host1.foo.bar!procs' of type 'Service':
Object 'host1.foo.bar!icinga' of type 'Service':
Object 'host1.foo.bar!ssh' of type 'Service':
Object 'host1.foo.bar!users' of type 'Service':
Object 'host1.foo.bar!disk /' of type 'Service':
Object 'host1.foo.bar!http' of type 'Service':
Object 'host1.foo.bar!swap' of type 'Service':
Object 'host1.foo.bar!load' of type 'Service':
Object 'host1.foo.bar!disk' of type 'Service':
Object 'host1.foo.bar!ping4' of type 'Service':

So we have host1.foo.bar!load. My guess is that Icinga is taking the local values instead of the master ones. (I guess this should be an issue for other services as well.)

A similar thing happens when I query for hosts. I have the host definition in the problematic server and no host definition in the non-problematic servers.

Host Object in problematic host:

Object 'host1.foo.bar' of type 'Host':
  % declared in '/usr/local/etc/icinga2/conf.d/hosts.conf', lines 18:1-18:20
  * __name = "host1.foo.bar"
  * action_url = ""
  * address = "127.0.0.1"
...

Now when I query for endpoints, I found that the problematic servers only have a definition for itself and not for the master host:

Object 'host1.foo.bar' of type 'Endpoint':
  % declared in '/usr/local/etc/icinga2/zones.conf', lines 12:1-12:24
  * __name = "host1.foo.bar"
  * host = ""
  * log_duration = 86400
  * name = "host1.foo.bar"
  * package = "_etc"
  * port = "5665"
  * source_location
    * first_column = 1
    * first_line = 12
    * last_column = 24
    * last_line = 12
    * path = "/usr/local/etc/icinga2/zones.conf"
  * templates = [ "host1.foo.bar" ]
    % = modified in '/usr/local/etc/icinga2/zones.conf', lines 12:1-12:24
  * type = "Endpoint"
  * zone = ""

In a non problematic host, I have the definition of the host itself and the master host:

Object 'host2.foo.bar' of type 'Endpoint':
  % declared in '/usr/local/etc/icinga2/zones.conf', lines 12:1-12:24
  * __name = "host2.foo.bar"
  * host = ""
  * log_duration = 86400
  * name = "host2.foo.bar"
  * package = "_etc"
  * port = "5665"
  * source_location
    * first_column = 1
    * first_line = 12
    * last_column = 24
    * last_line = 12
    * path = "/usr/local/etc/icinga2/zones.conf"
  * templates = [ "host2.foo.bar" ]
    % = modified in '/usr/local/etc/icinga2/zones.conf', lines 12:1-12:24
  * type = "Endpoint"
  * zone = ""

Object 'master.foo.bar' of type 'Endpoint':
  % declared in '/usr/local/etc/icinga2/zones.conf', lines 1:0-1:38
  * __name = "master.foo.bar"
  * host = ""
  * log_duration = 86400
  * name = "master.foo.bar"
  * package = "_etc"
  * port = "5665"
  * source_location
    * first_column = 0
    * first_line = 1
    * last_column = 38
    * last_line = 1
    * path = "/usr/local/etc/icinga2/zones.conf"
  * templates = [ "master.foo.bar" ]
    % = modified in '/usr/local/etc/icinga2/zones.conf', lines 1:0-1:38
  * type = "Endpoint"
  * zone = ""

Just in case, quering for zones is the same in problematic and non-problematic hosts. All have master, global-templates and the host itself.

Object 'master' of type 'Zone':
  % declared in '/usr/local/etc/icinga2/zones.conf', lines 4:1-4:20
  * __name = "master"
  * endpoints = [ "master.foo.bar" ]
    % = modified in '/usr/local/etc/icinga2/zones.conf', lines 5:2-5:40
  * global = false
  * name = "master"
  * package = "_etc"
  * parent = ""
  * source_location
    * first_column = 1
    * first_line = 4
    * last_column = 20
    * last_line = 4
    * path = "/usr/local/etc/icinga2/zones.conf"
  * templates = [ "master" ]
    % = modified in '/usr/local/etc/icinga2/zones.conf', lines 4:1-4:20
  * type = "Zone"
  * zone = ""

Object 'global-templates' of type 'Zone':
  % declared in '/usr/local/etc/icinga2/zones.conf', lines 8:1-8:30
  * __name = "global-templates"
  * endpoints = null
  * global = true
    % = modified in '/usr/local/etc/icinga2/zones.conf', lines 9:2-9:14
  * name = "global-templates"
  * package = "_etc"
  * parent = ""
  * source_location
    * first_column = 1
    * first_line = 8
    * last_column = 30
    * last_line = 8
    * path = "/usr/local/etc/icinga2/zones.conf"
  * templates = [ "global-templates" ]
    % = modified in '/usr/local/etc/icinga2/zones.conf', lines 8:1-8:30
  * type = "Zone"
  * zone = ""

Object 'host.foo.bar' of type 'Zone':
  % declared in '/usr/local/etc/icinga2/zones.conf', lines 15:1-15:20
  * __name = "host.foo.bar"
  * endpoints = [ "host.foo.bar" ]
    % = modified in '/usr/local/etc/icinga2/zones.conf', lines 16:2-16:25
  * global = false
  * name = "host.foo.bar"
  * package = "_etc"
  * parent = "master"
    % = modified in '/usr/local/etc/icinga2/zones.conf', lines 17:2-17:18
  * source_location
    * first_column = 1
    * first_line = 15
    * last_column = 20
    * last_line = 15
    * path = "/usr/local/etc/icinga2/zones.conf"
  * templates = [ "host.foo.bar" ]
    % = modified in '/usr/local/etc/icinga2/zones.conf', lines 15:1-15:20
  * type = "Zone"
  * zone = ""

So for some reason the client hosts are keeping local configuration and this is causing the problem. I have to take a look on how to do this.

I was trying to remove and add a problematic host from monitoring to see if this fixes the issue, but had no success. I think I might have not completely remove the host.

Thanks a lot.

So I resolved this issue by removing and adding hosts.