One client simply stopped checking all "local" services

One windows client simple stopped checking all “local” services. I cannot identify any hint in the log files, the master just informs

notifications are disabled for service ‘…’

for every service assigned to this client.

All checks running on the master for that client are still working: ping and cluster-zone and even the check icinga is still running:

Icinga 2 has been running for 11 days, 22 hours, 39 minutes and 13 seconds. Version: v2.10.2

icinga2.exe feature list on that client is identical to others (just checked only one, but all are installed by the icinga2-powershell-module):

Disabled features: api command compatlog debuglog elasticsearch gelf graphite id
o-mysql ido-pgsql influxdb livestatus opentsdb perfdata statusdata
Enabled features: checker mainlog notification

To me it looks like the scheduler stopped working. Maybe a bug?

Hi,

  • Is the client able to resolve its hostname on startup, or does it log about something not being able to, using ‘localhost’ instead?
  • Is the NodeName constant set in constants.conf?
  • Disabled notifications would mean that someone disabled them on the master, either via manual config edit, or at runtime.
  • How are these services checked, can you share some configuration details from icinga2 object list?

Cheers,
Michael

BTW: Haven’t been doing anything except analyzing (i.e. no restart yet).

Last restart on master was 2019-02-13 10:35:54 and the clients reported properly i.e.:

[2019-02-13 10:36:03 +0100] information/ApiListener: Finished syncing endpoint ‘master.example.com’ in zone ‘example.com’.

Name resolving is still working.

On the master: yes on the client: no (means default: const ZoneName = NodeName)

Maybe my dependency rule? However, the was no reason to so (no relevant event in the history). This notification disable was at [2019-02-13 17:43:56 +0100]

Object 'client.example.com!win_network' of type 'Service':
  % declared in '/etc/icinga2/conf.d/win/network.conf', lines 1:0-1:26
  * __name = "client.example.com!win_network"
  * action_url = ""
  * check_command = "network-windows"
    % = modified in '/etc/icinga2/conf.d/win/network.conf', lines 3:4-3:36
  * check_interval = 300
  * check_period = ""
  * check_timeout = null
  * command_endpoint = "client.example.com"
    % = modified in '/etc/icinga2/conf.d/win/network.conf', lines 4:4-4:31
  * display_name = "Windows Network"
    % = modified in '/etc/icinga2/conf.d/win/network.conf', lines 2:4-2:35
  * enable_active_checks = true
  * enable_event_handler = true
  * enable_flapping = false
  * enable_notifications = true
  * enable_passive_checks = true
  * enable_perfdata = true
  * event_command = ""
  * flapping_threshold = 0
  * flapping_threshold_high = 30
  * flapping_threshold_low = 25
  * groups = [ ]
  * host_name = "client.example.com"
    % = modified in '/etc/icinga2/conf.d/win/network.conf', lines 1:0-1:26
  * icon_image = ""
  * icon_image_alt = ""
  * max_check_attempts = 3
  * name = "win_network"
    % = modified in '/etc/icinga2/conf.d/win/network.conf', lines 1:0-1:26
  * notes = ""
  * notes_url = ""
  * package = "_etc"
    % = modified in '/etc/icinga2/conf.d/win/network.conf', lines 1:0-1:26
  * retry_interval = 60
  * source_location
    * first_column = 0
    * first_line = 1
    * last_column = 26
    * last_line = 1
    * path = "/etc/icinga2/conf.d/win/network.conf"
  * templates = [ "win_network" ]
    % = modified in '/etc/icinga2/conf.d/win/network.conf', lines 1:0-1:26
  * type = "Service"
  * vars = null
  * volatile = false
  * zone = "example.com"
    % = modified in '/etc/icinga2/conf.d/win/network.conf', lines 1:0-1:26

Forgot to mention that all “local” services got state “Unkown”. For win_network the first event was at 2019-02-13 17:41:56

That configuration sounds odd. The service is part of the example.com zone, and command_endpoint = "client.example.com".

In order to get a better understanding, how’s that zone hierarchy built in zones.conf?

The hierarchy is just one master and several clients. master’s zone.conf:

/*
 * Generated by Icinga 2 node setup commands
 * on 2019-01-15 14:13:02 +0100
 */

object Endpoint "master.example.com" {
}

object Zone "example.com" {
   endpoints = [ "master.example.com" ]
}

object Zone "global-templates" {
   global = true
}

object Zone "director-global" {
   global = true
}

object Zone "windows-commands" {
   global = true
}

All client related stuff is configured by the director (1.6.0).

My service definitions look like:

apply Service "win_network" {
   display_name = "Windows Network"
   check_command = "network-windows"
   command_endpoint = host_name

   assign where "Windows" in host.templates
}

That doesn’t match, I would expect the service apply rule path from inside the Director package.

That won’t evaluate well, I’d say this is written as host.name instead, isn’t it?

Cheers,
Michael

That sounds strange, I know, and I had a hard and long discussion with Lennart about this topic (and finally I could convince him). Background: I’ve been managing a default setup which is deployed one by one for every customer. Before director 1.6.0 there was no option to create services on a “template machine” and distribute them to all customer sites. Therefore, we decided to go with a mix (which is not recommended of course): hosts, host templates and some other handled via director, but service definition via conf files (distributed with deb packages). The “connection” between these two worlds is done by the assign rules.

And this setup works for all other clients and for the mentioned client until 5 days ago.

Unfortunately, it does, but I’ll follow Lennart’s advice to use host.name instead (but had not enough time yet, to replace it everywhere.

Hi,

still a setup hard to debug and troubleshoot. Since you’re saying that you tried to convince Lennart, you already know that this isn’t a long term solution and needs a proper configuration at some point in the future.

I’m curious how the host object for this service looks like, I suspect that there’s a mix in place between local check execution in a zone plus the command endpoint triggered by the parent node.

icinga2 object list --type Host --name client.example.com

What happens in the debug logs on the master and the agent host when you force a re-check of this service?

Cheers,
Michael

Hi,

Yes, I totally agree and once the basket functionality is stable enough and fit our needs, I’ll convert everything to director.

Object 'client.example.com' of type 'Host':
  % declared in '/var/lib/icinga2/api/packages/director/cab35194-013c-4d22-98ea-ced9753892d7/zones.d/example.com/hosts.conf', lines 257:1-257:38
  * __name = "client.example.com"
  * action_url = ""
  * address = "192.168.33.41"
    % = modified in '/var/lib/icinga2/api/packages/director/cab35194-013c-4d22-98ea-ced9753892d7/zones.d/example.com/hosts.conf', lines 262:5-262:26
  * address6 = ""
  * check_command = "hostalive4"
    % = modified in '/var/lib/icinga2/api/packages/director/cab35194-013c-4d22-98ea-ced9753892d7/zones.d/example.com/host_templates.conf', lines 6:5-6:32
  * check_interval = 300
  * check_period = ""
  * check_timeout = null
  * command_endpoint = ""
  * display_name = "client"
    % = modified in '/var/lib/icinga2/api/packages/director/cab35194-013c-4d22-98ea-ced9753892d7/zones.d/example.com/hosts.conf', lines 261:5-261:34
  * enable_active_checks = true
  * enable_event_handler = true
  * enable_flapping = false
  * enable_notifications = true
  * enable_passive_checks = true
  * enable_perfdata = true
  * event_command = ""
  * flapping_threshold = 0
  * flapping_threshold_high = 30
  * flapping_threshold_low = 25
  * groups = [ ]
  * icon_image = ""
  * icon_image_alt = ""
  * max_check_attempts = 3
  * name = "client.example.com"
  * notes = ""
  * notes_url = ""
  * package = "director"
  * retry_interval = 60
  * source_location
    * first_column = 1
    * first_line = 257
    * last_column = 38
    * last_line = 257
    * path = "/var/lib/icinga2/api/packages/director/cab35194-013c-4d22-98ea-ced9753892d7/zones.d/example.com/hosts.conf"
  * templates = [ "client.example.com", "swtype1", "Windows", "Site" ]
    % = modified in '/var/lib/icinga2/api/packages/director/cab35194-013c-4d22-98ea-ced9753892d7/zones.d/example.com/hosts.conf', lines 257:1-257:38
    % = modified in '/var/lib/icinga2/api/packages/director/cab35194-013c-4d22-98ea-ced9753892d7/zones.d/example.com/host_templates.conf', lines 15:1-15:23
    % = modified in '/var/lib/icinga2/api/packages/director/cab35194-013c-4d22-98ea-ced9753892d7/zones.d/example.com/host_templates.conf', lines 5:1-5:23
    % = modified in '/var/lib/icinga2/api/packages/director/cab35194-013c-4d22-98ea-ced9753892d7/zones.d/example.com/host_templates.conf', lines 1:0-1:19
  * type = "Host"
  * vars
    * servertype = "kom"
      % = modified in '/var/lib/icinga2/api/packages/director/cab35194-013c-4d22-98ea-ced9753892d7/zones.d/example.com/hosts.conf', lines 263:5-263:35
    * ntp_server = "ntp1.example.com"
      % = modified in '/var/lib/icinga2/api/packages/director/cab35194-013c-4d22-98ea-ced9753892d7/zones.d/example.com/host_templates.conf', lines 2:5-2:39
  * volatile = false
  * zone = "example.com"

Unfortunately, I restarted the Icinga2 service this morning (and everything looks fine as expected).

Cheers,
Roland

Hi,

ok then think about changing the following:

  • Either the check should be executed by the agent itself, setting its zone to example.com
  • Or use the command endpoint execution bridge and leave the host’s zone to master, thus setting the service command_endpoint to the agent’s endpoint (with host.name in the static config, later via Director agent settings).

Cheers,
Michael

Hi,

I’m sorry, I’m getting confused. If the agent shall execute the check itself, I’d assume the zone should be client.example.com. However, with Icinga Agent Discussion in mind, this is not fully supported, thus, not recommended.

This is how it’s configured now.

Cheers,
Roland

Hi,

the responsible zone (“authoritative for this object”) for initiating a remote command endpoint check should always be the parent zone of an agent, e.g. master or satellite. In your example output from object list, the host explicitly sets the zone to example.com, not master.

This is a common error with checks not being executed, thus I am asking it. If you’ve changed that to master already, everything should be fine.

Cheers,
Michael

Hi,

the parent zone of that agent is example.com (and master is the endpoint of that zone). Thus, I’ve not changed anything and it’s working as expected.

Cheers,
Roland