Cluster-zone check never executed

Hi everyone!

I have an issue with the agent health check. I never gets executed. I enabled debug logging to confirm that it actually never gets executed.

I have simply copy-pasted the snippet from the documentation (cluster-zone with Masters and Agents). The check also worked (or at least I assume it did) when running 2.10, but after the upgrade to 2.11 and the necessary config changes on my side they simply ceased working. I have done all troubleshooting mentioned in the upgrade notes, but that didn’t help either.

I assume something is wrong, but I can’t seem to find what it might be.

Here’s my config:

icinga2 object list -n agent-health

Object 'my.agent.host!agent-health' of type 'Service':
  % declared in '/etc/icinga2/zones.d/master/cluster-health.conf', lines 11:1-11:28
  * __name = "my.agent.host!agent-health"
  * action_url = ""
  * check_command = "cluster-zone"
    % = modified in '/etc/icinga2/zones.d/master/cluster-health.conf', lines 12:3-12:32
  * check_interval = 300
  * check_period = ""
  * check_timeout = null
  * command_endpoint = ""
  * display_name = "cluster-health-my.agent.host"
    % = modified in '/etc/icinga2/zones.d/master/cluster-health.conf', lines 14:3-14:46
  * enable_active_checks = true
  * enable_event_handler = true
  * enable_flapping = false
  * enable_notifications = true
  * enable_passive_checks = true
  * enable_perfdata = true
  * event_command = ""
  * flapping_threshold = 0
  * flapping_threshold_high = 30
  * flapping_threshold_low = 25
  * groups = [ ]
  * host_name = "my.agent.host"
    % = modified in '/etc/icinga2/zones.d/master/cluster-health.conf', lines 11:1-11:28
  * icon_image = ""
  * icon_image_alt = ""
  * max_check_attempts = 3
  * name = "agent-health"
    % = modified in '/etc/icinga2/zones.d/master/cluster-health.conf', lines 11:1-11:28
  * notes = ""
  * notes_url = ""
  * package = "_etc"
    % = modified in '/etc/icinga2/zones.d/master/cluster-health.conf', lines 11:1-11:28
  * retry_interval = 60
  * source_location
    * first_column = 1
    * first_line = 11
    * last_column = 28
    * last_line = 11
    * path = "/etc/icinga2/zones.d/master/cluster-health.conf"
  * templates = [ "agent-health" ]
    % = modified in '/etc/icinga2/zones.d/master/cluster-health.conf', lines 11:1-11:28
  * type = "Service"
  * vars
    * cluster_zone = "my.agent.host"
      % = modified in '/etc/icinga2/zones.d/master/cluster-health.conf', lines 17:3-17:31
  * volatile = false
  * zone = "my.agent.host"
    % = modified in '/etc/icinga2/zones.d/master/cluster-health.conf', lines 11:1-11:28

zones.conf

object Zone "master" {
  endpoints = [ "my.master.host" ]
}

object Endpoint "my.master.host" {
  // That's us
}

object Endpoint "my.agent.host" {
  host = "127.0.0.1" // Localhost, because SSH tunnel
  port = 5666 // SSH tunnel port
  log_duration = 0 // Disable the replay log for command endpoint agents
}

object Zone "my.agent.host" {
  endpoints = [ "my.agent.host" ]

  parent = "master"
}

/* sync global commands */
object Zone "global-templates" {
  global = true
}
object Zone "director-global" {
  global = true
}

zones.d/master/cluster-health.conf

apply Service "agent-health" {
  check_command = "cluster-zone"

  display_name = "cluster-health-" + host.name

  /* This follows the convention that the agent zone name is the FQDN which is the same as the host object name. */
  vars.cluster_zone = host.name

  assign where host.vars.agent_endpoint
}

curl -k -s -u 'root:passwort' 'https://127.0.0.1:5665/v1/objects/services?service=my.agent.host!agent-health' | jq .

{
  "results": [
    {
      "attrs": {
        "__name": "my.agent.host!agent-health",
        "acknowledgement": 0,
        "acknowledgement_expiry": 0,
        "action_url": "",
        "active": true,
        "check_attempt": 1,
        "check_command": "cluster-zone",
        "check_interval": 300,
        "check_period": "",
        "check_timeout": null,
        "command_endpoint": "",
        "display_name": "cluster-health-my.agent.host",
        "downtime_depth": 0,
        "enable_active_checks": true,
        "enable_event_handler": true,
        "enable_flapping": false,
        "enable_notifications": true,
        "enable_passive_checks": true,
        "enable_perfdata": true,
        "event_command": "",
        "flapping": false,
        "flapping_current": 0,
        "flapping_last_change": 0,
        "flapping_threshold": 0,
        "flapping_threshold_high": 30,
        "flapping_threshold_low": 25,
        "force_next_check": false,
        "force_next_notification": false,
        "groups": [],
        "ha_mode": 0,
        "handled": false,
        "host_name": "my.agent.host",
        "icon_image": "",
        "icon_image_alt": "",
        "last_check": -1,
        "last_check_result": null,
        "last_hard_state": 3,
        "last_hard_state_change": 0,
        "last_reachable": true,
        "last_state": 3,
        "last_state_change": 0,
        "last_state_critical": 0,
        "last_state_ok": 0,
        "last_state_type": 0,
        "last_state_unknown": 0,
        "last_state_unreachable": 0,
        "last_state_warning": 0,
        "max_check_attempts": 3,
        "name": "agent-health",
        "next_check": 1592728914.1601052,
        "notes": "",
        "notes_url": "",
        "original_attributes": null,
        "package": "_etc",
        "paused": false,
        "previous_state_change": 0,
        "problem": true,
        "retry_interval": 60,
        "severity": 24,
        "source_location": {
          "first_column": 1,
          "first_line": 11,
          "last_column": 28,
          "last_line": 11,
          "path": "/etc/icinga2/zones.d/master/cluster-health.conf"
        },
        "state": 3,
        "state_type": 0,
        "templates": [
          "agent-health"
        ],
        "type": "Service",
        "vars": {
          "cluster_zone": "my.agent.host"
        },
        "version": 0,
        "volatile": false,
        "zone": "my.agent.host"
      },
      "joins": {},
      "meta": {},
      "name": "my.agent.host!agent-health",
      "type": "Service"
    }
  ]
}

Even rescheduling with force does nothing, no check is performed.

Does anyone see the issue, because I don’t?

Cheers
Steffen

Putting cluster-health.conf in /etc/icinga2/zones.d/master/ means this service is only active on your master as /etc/icinga2/zones.d/master/ is not being synced to your agents. Hence, they don’t know about this service. Therefore, you need to put cluster-health.conf in a global zone e.g. /etc/icinga2/zones.d/global-templates.

Hm, this is weird because this is kind of the opposite of what the official documentation says.

However, for testing I have put the configuration to global-templates anyway, as you suggested. This caused all the checks to turn CRITICAL and report that the zones supposedly are not connected. Querying the master’s API hovever, I can see that all zones are connected properly.

I don’t think putting the checks to the global-templates is the right solution.

Do you have all your zones and endpoint objects in zones.conf only? 2.11 ignores all objects defined elsewhere.

Sorry that I didn’t make it more clearly in my original post. Yes, all zone declarations are exclusively done in /etc/icinga2/zones.conf.

The service object is in the wrong zone

  • zone = “my.agent.host”

as of this I’d assume the host object is in the wrong zone as well. Hence, I’d assume the host object for my.agent.host is not in /etc/icinga2/zones.d/master/.

No, it isn’t. Any agent also is a zone, so the my.agent.host is in the my.agent.host zone, not in the master zone. The master zone however is the parent zone of all agents. According to the documentation, zones in zones is not possible, so I have configured the agent zones “in parallel” to the master zone.

Isn’t that the correct configuration? What am I missing here?

The cluster-zone check has to be executed on/by the master itself, because only the master knows about the connection state.
So you have to put the check into the master directory and change the vars.cluster_zone to the different agents host/zone names.
edit:

Good to know, didn’t know that :ok_hand:

1 Like

Host object for agents belongs to its parent and needs to be placed in the according directory e.g. in your case `/etc/icinga2/zones.d/master/. The endpoint and zone objects for an agent has to be placed in zones.conf only instead. Until 2.10 this was also allowed to in zones directories (I’m not sure whether this will return with 2.12).

This will execute all checks from the parent e.g. check_http, ping, hostalive or cluster-zone. To run any check locally on an agent then it is very common to add command_endpoint = host.name to service definitions.

You can even run cluster-zone on an agent to check the connection to a parent. To do so you need to add cluster_zone = <parent_zone_name> to the service definition. I use this to monitor satellite to parent from the satellites.

3 Likes

This is exactly what I did, but that doesn’t work somehow.

This still gives me headaches.
According to the distributed monitoring documentation, agent nodes have their own zone.

I think I have misunderstood the documentation in a way that each zone (which an agent is) must have their own configuration in zones.d. I think the part where the troubleshooting section says that “nested zones” won’t work mislead me to assume that I may not put hosts that are in zones (which in my mind I equated with zones themselves) nested inside the zones.d/master zone.

However, I think it (kind of) makes sense now. Let me check the config and report here if that worked.

Yes, that was the fix. Moved the host config into zones.d/master and the cluster checks came back OK almost instantly without breaking anything else.

Thanks @rsx and @log1c for your help!

Nope, that wasn’t it. Moving the agents to the master zone broke the local services. I didn’t have any service configuration on the agent any more. This broke my passive checks (submitted against the agent API) because of the missing services.

However, that again broke the cluster checks. It’s a little strange that you can EITHER have the services locally (in the agent’s API) and be able to use passive checks OR have no services etc. (which breaks the passive checks) but the cluster checks working.

I wonder if this actually is a bug :thinking:

OK, so all I can do is execute the check for the “master” zone on my agents, but not for the agents in the master zone because apparently the agents are not a child zone of the master.

I don’t know if the documentation is wrong, but it says “This example adds a health check for the ha master with agents scenario”, so maybe the “non-ha master with agents scenario” would be different?

Anyway, I’m using the “cluster” check for the master -> agent connection and “cluster-health” for the agent -> master connection.

I think I’ll file a bug now about this to have the documentation updated. I still find it somewhat confusing.