Cannot execute checks using command_endpoint after 2.11 upgrade

I have a single master setup with multiple hosts in multiple zones. File and log dumps will be at the bottom of the question.

I’ve read through the troubleshooting documents available for upgrading from 2.10 to 2.11, and I’m just too dumb to figure it out I guess. I have each zone configured in zones.conf with the relevant endpoints and “parent = “master””. I have the master zone also configured, as well as a global “global-templates” zone.

In zones.d, I have a folder for each zone (which contains host configuration for each host in that zone).

In my “master” zone folder, I have everything from conf.d. In my “global-templates” zone file, I have a few template files that describe host types. Looking at the debug log, I’m seeing the files propagate to the clients just fine when I restart icinga2 on the master. When I run commands, however, it appears (from the debug log) that nothing’s actually happening.

What am I missing? It’s gotta be something dumb that I’m missing, but I’m not sure what it is. I’m going to post some file info tracing a single check in a single node. I want to be able to do a disk check on one of my mongo hosts. From what I’m seeing, the command_endpoint is defined via a series of imports (defined in generic-service, imported to mail-service) and the host.vars.agent_endpoint is defined the same way (defined in satellite.conf, imported in host object definition). I’ve also tried replacing all instances of “agent_endpoint” with “client_endpoint” and that hasn’t worked either.

zones.conf on master:

object Endpoint NodeName {
  host = NodeName
}

object Zone "master" {
  endpoints = [ NodeName ]
}
object Zone "global-templates" {
  global = true
}
object Endpoint "mdb-03.example.com" {
        host = "<snip>
}
object Zone "mongozone2" {
        endpoints = [ "mdb-03.example.com" ]
        parent = "master"
}

zones.d/mongozone2/mdb-03.example.com.conf (not the full file, just the relevant bits):

object Host "mdb-03.example.com" {
    import "satellite-host"
    address = "<snip>"
    vars.disks["backup"] = {
      disk_partition = "/backup"
}

zones.d/global-templates/satellite.conf:

template Host "satellite-host" {
    (unimportant definitions)
    vars.agent_endpoint = name # Have also tried with client_endpoint
    vars.disks["disk"] = {
    }
}

zones.d/global-templates/services2.conf:

apply Service for (disk => config in host.vars.disks) {
    import "mail-service"
    check_command = "disk"
    vars += config
    assign where host.vars.disks && host.vars.agent_endpoint
}

zones.d/global-templates/templates.conf:

template Service "mail-service" { 
    import "generic-service" 
    (unimportant definitions)
}
template Service "generic-service" {
    (unimportant definitions)
    command_endpoint = host.vars.agent_endpoint
}

zones.conf *ON CLIENT:

object Endpoint "monitor.example.com" {
        host = "monitor.example.com"
        port = "5665"
}
object Zone "mongozone2" {
        endpoints = [ "monitor.example.com" ]
}
object Endpoint NodeName {
        host = "<snip>"
}
object Zone "global-templates" { # Add global templates zone
        global = true
}
object Zone ZoneName {
        endpoints = [ NodeName ]
        parent = "mongozone2"
}

Log entries when a check is forced:

/var/log/icinga2/icinga2.log ON HOST:

[2019-10-03 17:11:58 -0400] information/ExternalCommandListener: Executing external command: 
[1570137118] SCHEDULE_FORCED_SVC_CHECK;mdb-03.example.com;backup;1570137118

debuglog ON HOST:

[2019-10-03 17:16:15 -0400] information/ExternalCommandListener: Executing external command: 
[1570137375] SCHEDULE_FORCED_SVC_CHECK;mdb-03.example.com;backup;1570137375
[2019-10-03 17:16:15 -0400] notice/ExternalCommandProcessor: Rescheduling next check for service 'backup'

debuglog ON CLIENT:

[2019-10-03 17:17:11 -0400] notice/JsonRpcConnection: Received 'event::SetForceNextCheck' 
message from identity 'monitor.example.com'.
[2019-10-03 17:17:11 -0400] notice/JsonRpcConnection: Received 'event::SetNextCheck' message 
from identity 'monitor.example.com'.
[2019-10-03 17:17:11 -0400] notice/ApiListener: Relaying 'event::SetForceNextCheck' message
[2019-10-03 17:17:11 -0400] notice/ApiListener: Relaying 'event::SetNextCheck' message

Host Object Definition:

Object 'mdb-03.example.com' of type 'Host':
  % declared in '/etc/icinga2/zones.d/mongozone2/mdb-03.example.com.conf', lines 1:0-1:46
  * __name = "mdb-03.example.com"
  * action_url = ""
  * address = "<snip>
    % = modified in '/etc/icinga2/zones.d/mongozone2/mdb-03.example.com.conf', lines 3:3-3:24
  * address6 = ""
  * check_command = "hostalive"
    % = modified in '/etc/icinga2/zones.d/master/satellite.conf', lines 19:3-19:29
  * check_interval = 60
    % = modified in '/etc/icinga2/zones.d/master/satellite.conf', lines 13:3-13:21
  * check_period = ""
  * check_timeout = null
  * command_endpoint = ""
  * display_name = "mdb-03.example.com"
  * enable_active_checks = true
  * enable_event_handler = true
  * enable_flapping = false
  * enable_notifications = true
  * enable_passive_checks = true
  * enable_perfdata = true
  * event_command = ""
  * flapping_threshold = 0
  * flapping_threshold_high = 30
  * flapping_threshold_low = 25
  * groups = [ ]
  * icon_image = ""
  * icon_image_alt = ""
  * max_check_attempts = 3
    % = modified in '/etc/icinga2/zones.d/master/satellite.conf', lines 12:3-12:24
  * name = "mdb-03.example.com"
  * notes = ""
  * notes_url = ""
  * package = "_etc"
  * retry_interval = 60
    % = modified in '/etc/icinga2/zones.d/master/satellite.conf', lines 14:3-14:21
  * source_location
    * first_column = 0
    * first_line = 1
    * last_column = 46
    * last_line = 1
    * path = "/etc/icinga2/zones.d/mongozone2/mdb-03.example.com.conf"
  * templates = [ "mdb-03.example.com", "satellite-host" ]
    % = modified in '/etc/icinga2/zones.d/mongozone2/mdb-03.example.com.conf', lines 1:0-1:46
    % = modified in '/etc/icinga2/zones.d/master/satellite.conf', lines 11:1-11:30
  * type = "Host"
  * vars
    * agent_endpoint = "mdb-03.example.com"
      % = modified in '/etc/icinga2/zones.d/master/satellite.conf', lines 28:3-28:28
    * disks
      * backup
        % = modified in '/etc/icinga2/zones.d/mongozone2/mdb-03.example.com.conf', lines 15:3-17:3
        * disk_partition = "/backup"
      * disk
        % = modified in '/etc/icinga2/zones.d/master/satellite.conf', lines 30:3-31:3
        % = modified in '/etc/icinga2/zones.d/mongozone2/mdb-03.example.com.conf', lines 18:3-20:3
        * disk_partitions = [ "/", "/dev", "/dev/shm", "/sys/fs/cgroup", "/boot", "/boot/efi", "/usr/local", "/home", "/tmp", "/var", "/var/log", "/var/log/audit" ]
      * mongo
        % = modified in '/etc/icinga2/zones.d/mongozone2/mdb-03.example.com.conf', lines 12:3-14:3
        * disk_partition = "/var/lib/mongo"
    * hosttype = "mdb"
      % = modified in '/etc/icinga2/zones.d/mongozone2/mdb-03.example.com.conf', lines 11:3-11:23
      % = modified in '/etc/icinga2/zones.d/mongozone2/mdb-03.example.com.conf', lines 21:3-21:23
    * load_cload1 = 40
      % = modified in '/etc/icinga2/zones.d/mongozone2/mdb-03.example.com.conf', lines 8:3-8:23
    * load_cload15 = 40
      % = modified in '/etc/icinga2/zones.d/mongozone2/mdb-03.example.com.conf', lines 10:3-10:24
    * load_cload5 = 40
      % = modified in '/etc/icinga2/zones.d/mongozone2/mdb-03.example.com.conf', lines 9:3-9:23
    * load_wload1 = 20
      % = modified in '/etc/icinga2/zones.d/mongozone2/mdb-03.example.com.conf', lines 5:3-5:23
    * load_wload15 = 30
      % = modified in '/etc/icinga2/zones.d/mongozone2/mdb-03.example.com.conf', lines 7:3-7:24
    * load_wload5 = 30
      % = modified in '/etc/icinga2/zones.d/mongozone2/mdb-03.example.com.conf', lines 6:3-6:23
    * mdb_address = "<snip>
      % = modified in '/etc/icinga2/zones.d/mongozone2/mdb-03.example.com.conf', lines 4:3-4:33
    * notification
      * mail
        % = modified in '/etc/icinga2/zones.d/master/satellite.conf', lines 32:3-34:3
        * groups = [ "<snip>" ]
    * ntp_address = "<snip>
      % = modified in '/etc/icinga2/zones.d/master/satellite.conf', lines 20:3-20:33
    * ntp_critical = 10
      % = modified in '/etc/icinga2/zones.d/master/satellite.conf', lines 22:3-22:24
    * ntp_timeout = 30
      % = modified in '/etc/icinga2/zones.d/master/satellite.conf', lines 23:3-23:23
    * ntp_warning = 5
      % = modified in '/etc/icinga2/zones.d/master/satellite.conf', lines 21:3-21:22
    * os = "Linux"
      % = modified in '/etc/icinga2/zones.d/mongozone2/mdb-03.example.com.conf', lines 22:3-22:19
    * ping_cpl = 60
      % = modified in '/etc/icinga2/zones.d/master/satellite.conf', lines 27:3-27:20
    * ping_crta = 500
      % = modified in '/etc/icinga2/zones.d/master/satellite.conf', lines 26:3-26:24
    * ping_wpl = 20
      % = modified in '/etc/icinga2/zones.d/master/satellite.conf', lines 25:3-25:20
    * ping_wrta = 100
      % = modified in '/etc/icinga2/zones.d/master/satellite.conf', lines 24:3-24:24
    * testing = "mdb-03.example.com"
      % = modified in '/etc/icinga2/zones.d/master/satellite.conf', lines 29:3-29:21
    * users_cgreater = 5
      % = modified in '/etc/icinga2/zones.d/master/satellite.conf', lines 17:3-17:25
    * users_wgreater = 3
      % = modified in '/etc/icinga2/zones.d/master/satellite.conf', lines 16:3-16:25
  * volatile = false
  * zone = "mongozone2"

Service object definition:

Object 'mdb-03.example.com!backup' of type 'Service':
  % declared in '/etc/icinga2/zones.d/master/services_additional.conf', lines 1:0-1:52
  * __name = "mdb-03.example.com!backup"
  * action_url = ""
  * check_command = "disk"
    % = modified in '/etc/icinga2/zones.d/master/services_additional.conf', lines 4:3-4:24
  * check_interval = 60
    % = modified in '/etc/icinga2/zones.d/master/templates.conf', lines 72:3-72:21
  * check_period = ""
  * check_timeout = null
  * command_endpoint = "mdb-03.example.com"
    % = modified in '/etc/icinga2/zones.d/master/templates.conf', lines 75:3-75:45
  * display_name = "backup"
  * enable_active_checks = true
  * enable_event_handler = true
  * enable_flapping = false
  * enable_notifications = true
  * enable_passive_checks = true
  * enable_perfdata = true
  * event_command = ""
  * flapping_threshold = 0
  * flapping_threshold_high = 30
  * flapping_threshold_low = 25
  * groups = [ ]
  * host_name = "mdb-03.example.com"
    % = modified in '/etc/icinga2/zones.d/master/services_additional.conf', lines 1:0-1:52
  * icon_image = ""
  * icon_image_alt = ""
  * max_check_attempts = 3
    % = modified in '/etc/icinga2/zones.d/master/templates.conf', lines 71:3-71:24
  * name = "backup"
    % = modified in '/etc/icinga2/zones.d/master/services_additional.conf', lines 1:0-1:52
  * notes = ""
  * notes_url = ""
  * package = "_etc"
    % = modified in '/etc/icinga2/zones.d/master/services_additional.conf', lines 1:0-1:52
  * retry_interval = 30
    % = modified in '/etc/icinga2/zones.d/master/templates.conf', lines 73:3-73:22
  * source_location
    * first_column = 0
    * first_line = 1
    * last_column = 52
    * last_line = 1
    * path = "/etc/icinga2/zones.d/master/services_additional.conf"
  * templates = [ "backup", "mail-service", "generic-service" ]
    % = modified in '/etc/icinga2/zones.d/master/services_additional.conf', lines 1:0-1:52
    % = modified in '/etc/icinga2/zones.d/master/templates.conf', lines 100:1-100:31
    % = modified in '/etc/icinga2/zones.d/master/templates.conf', lines 70:1-70:34
  * type = "Service"
  * vars
    % = modified in '/etc/icinga2/zones.d/master/services_additional.conf', lines 6:3-6:16
    * disk_partition = "/backup"
    * interval = 1800
      % = modified in '/etc/icinga2/zones.d/master/templates.conf', lines 74:3-74:21
      % = modified in '/etc/icinga2/zones.d/master/templates.conf', lines 104:3-104:21
    * notification
      * mail
        % = modified in '/etc/icinga2/zones.d/master/templates.conf', lines 105:3-107:3
        * groups = [ "<snip>" ]
    * notification_type = "mail"
      % = modified in '/etc/icinga2/zones.d/master/templates.conf', lines 102:3-102:33
    * period = "24x7"
      % = modified in '/etc/icinga2/zones.d/master/templates.conf', lines 103:3-103:22
  * volatile = false
  * zone = "mongozone2"
    % = modified in '/etc/icinga2/zones.d/master/services_additional.conf', lines 1:0-1:52

Edit: Formatting

A little more info: Using the API I can see that the check source is indeed wrong on the agent. How do I change this? I thought command_endpoint would change it, but apparently not.

curl -k -s -u root:icinga -H 'Accept: application/json' -H 'X-HTTP-Method-Override: GET' -X POST 'https://localhost:5665/v1/objects/services' -d '{ "filter": "regexpattern, service.name)", "filter_vars": { "pattern": "^backup" }, "attrs": [ "__name", "last_check_result" ], "pretty": true }'

{
    "attrs": {
        "__name": "mdb-03.example.com!backup",
        "last_check_result": {
            "active": true,
            "check_source": "monitor.example.com",

icinga2 object list --name *backup* --type Service

Object 'mdb-03.example.com!backup' of type 'Service':
  % declared in '/etc/icinga2/zones.d/global-templates/services_additional.conf', lines 1:0-1:52
  * __name = "mdb-03.example.com!backup"
  * action_url = ""
  * check_command = "disk"
    % = modified in '/etc/icinga2/zones.d/global-templates/services_additional.conf', lines 4:3-4:24
  * check_interval = 60
    % = modified in '/etc/icinga2/zones.d/global-templates/templates.conf', lines 72:3-72:21
  * check_period = ""
  * check_timeout = null
  * command_endpoint = "mdb-03.example.com"

Instead of

command_endpoint = host.vars.agent_endpoint

I’d try this

command_endpoint = host.name

That’s defined in the template host

template Host "satellite-host" {
    ...
    vars.agent_endpoint = name
    ...

}

To update the question, I’ve actually had a bit of success.

For each host, I’ve modified the zones.conf to match the current config, with “master” being the master zone and corrected the zone name. I’ve also added the endpoint for the other host(s) in the zone (since as far as I recall zones cannot have more than 2 hosts each), and that appears to have fixed it. I’m getting connection errors between the hosts for some reason, but at least I’m getting check responses. If I can figure out the cross-connectivity errors I’ll report back and mark it solved.

EDIT: Looks like the agents are scheduling checks for each other, which is obviously not correct. Probably because of the way I’ve got all this stuff configured. Just in case anyone else is following my progress on this post.

Had to take a break but I’m still struggling with this. At this point, it appears that if I add a zone with more than one endpoint in it, both endpoints will start checking each other. I’m at the point where I’m about to clear out icinga entirely and reinstall from scratch, I’m having a devil of a time understanding the documentation and figuring out how to port what we used to do to what we need to do. If anyone has any input, or if there’s anything anyone can think to check or any information I can provide please help. If there’s an IRC or Discord or somewhere I can talk in realtime as well, that would be great.

Question in this regard - mongozone2 is a satellite which checks agents itself, or is that really the agent where disk checks are run on?

If the last option is true, the host object is located in the wrong zone, this should be put into zones.d/master, allowing the checks being run via command_endpoint.

Cheers,
Michael

Just some generic things:

Is the “checker” feature enabled on satellite and the command-endpoint?
Is the icinga2 daemon -C not talking about ignoring zones on one of the hosts?

If you define more than one endpoints in a zone, you need to add command_endpoint for local checks (like check_disk) of the endpoint. If you have only one endpoint in a zone, you don’t need the command_endpoint.

@dnsmichi sorry, I’m having a hard time explaining my setup fully.

We have a 3-node mongo cluster. When I set up icinga2, zones were only supposed to have 2 nodes each (I can’t recall why), so I split the 3 nodes as follows:

mongozone1: Contains mdb-01 and mdb-02
mongozone2: contains mdb-03

We only have a master and agents. Our master (monitor) is supposed to tell the agents to run local checks, and the agents return the results of the check back to the monitor node. However, what appears to be happening is the agents are trying to check each other. In mongozone1 I’m seeing errors that mdb-02 is not connected to mdb-01 and vice versa.

As per your suggestion, it appears that may have fixed the issue. After moving all hosts to the master/hosts.conf file and removing all zone folders, I’m not seeing the connection errors. In fact, once I copied my commands file over to the global-templates zone, it appears that all the hosts resolved themselves, I can’t believe it was that simple. I’m going to do some additional checks, but I’ll mark this as “solved” with your response.

For those coming after me (and for me when I inevitably forget what I did later on), here are the steps I took:

  1. Create master zone (renamed from ZoneName) in zones.conf on monitor
  2. Create global-templates global zone if it doesn’t already exist
  3. For each file in the (previously used) repository.d/zones folder, change old zone name from monitor.example.com to master
  4. Add all endpoints from (previously used) repository.d/endpoints folder to zones.conf
  5. For each endpoint in (previously used) repository.d/hosts folder, append the info to conf.d/hosts.conf
  6. Remove repository.d/*
  7. Create zones.d/master folder
  8. Move everything from conf.d into zones.d/master
  9. Move customized commands, templates, users, groups, time periods, etc (IE all things needed by ALL hosts) into zones.d/global-templates folder
2 Likes