Checks functioning on agents, but not on satellites

Hello,

Current setup is this:
1 Master
3 Satellites
1 Agent
3 Zones (master, dev, qa)

Right now, all 5 hosts show up in Icinga2.

The master is running checks locally on itself without any issues.
The master should be running checks on the satellites to verify their health.
The satellites should be running checks on the agents.

My issue is, all of the checks that the master should be preforming on the satellites are just stuck pending are are showing as Late. When I click on “Check Now” in the webui, it says its scheduling the check, but it never gets a result.

[root@dev-icinga icinga2]# icinga2 daemon -C
[2024-07-15 14:59:08 -0400] information/cli: Icinga application loader (version: r2.14.2-1)
[2024-07-15 14:59:08 -0400] information/cli: Loading configuration file(s).
[2024-07-15 14:59:08 -0400] information/ConfigItem: Committing config item(s).
[2024-07-15 14:59:08 -0400] information/ApiListener: My API identity: dev-icinga.qa.***.net
[2024-07-15 14:59:08 -0400] information/ConfigItem: Instantiated 2 NotificationCommands.
[2024-07-15 14:59:08 -0400] information/ConfigItem: Instantiated 21 Notifications.
[2024-07-15 14:59:08 -0400] information/ConfigItem: Instantiated 1 IcingaApplication.
[2024-07-15 14:59:08 -0400] information/ConfigItem: Instantiated 6 HostGroups.
[2024-07-15 14:59:08 -0400] information/ConfigItem: Instantiated 5 Hosts.
[2024-07-15 14:59:08 -0400] information/ConfigItem: Instantiated 5 Downtimes.
[2024-07-15 14:59:08 -0400] information/ConfigItem: Instantiated 3 Comments.
[2024-07-15 14:59:08 -0400] information/ConfigItem: Instantiated 2 FileLoggers.
[2024-07-15 14:59:08 -0400] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2024-07-15 14:59:08 -0400] information/ConfigItem: Instantiated 6 Zones.
[2024-07-15 14:59:08 -0400] information/ConfigItem: Instantiated 1 CheckerComponent.
[2024-07-15 14:59:08 -0400] information/ConfigItem: Instantiated 1 User.
[2024-07-15 14:59:08 -0400] information/ConfigItem: Instantiated 5 Endpoints.
[2024-07-15 14:59:08 -0400] information/ConfigItem: Instantiated 2 ApiUsers.
[2024-07-15 14:59:08 -0400] information/ConfigItem: Instantiated 1 ApiListener.
[2024-07-15 14:59:08 -0400] information/ConfigItem: Instantiated 1 NotificationComponent.
[2024-07-15 14:59:08 -0400] information/ConfigItem: Instantiated 253 CheckCommands.
[2024-07-15 14:59:08 -0400] information/ConfigItem: Instantiated 1 UserGroup.
[2024-07-15 14:59:08 -0400] information/ConfigItem: Instantiated 3 ServiceGroups.
[2024-07-15 14:59:08 -0400] information/ConfigItem: Instantiated 3 TimePeriods.
[2024-07-15 14:59:08 -0400] information/ConfigItem: Instantiated 5 ScheduledDowntimes.
[2024-07-15 14:59:08 -0400] information/ConfigItem: Instantiated 92 Services.
[2024-07-15 14:59:08 -0400] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2024-07-15 14:59:08 -0400] information/cli: Finished validating the configuration file(s).

Looking in the logs, I can see that the satellites and masters are connected, but still no checks happening.

I did just see this in the satellite log on a restart of icinga2

[root@dev-satellite ~]# cat /var/lib/icinga2/api//zones-stage-startup-last-failed.log
[2024-07-15 15:09:49 -0400] information/cli: Icinga application loader (version: r2.14.2-1)
[2024-07-15 15:09:49 -0400] information/cli: Loading configuration file(s).
[2024-07-15 15:09:49 -0400] critical/config: Error: Object 'dev-agent.qa.***.net' of type 'Endpoint' re-defined: in /var/lib/icinga2/api/zones-stage//dev-satellites/_etc/dev-agent.qa.***.net.conf: 12:1-12:42; previous definition: in /etc/icinga2/zones.conf: 22:1-22:42
Location: in /var/lib/icinga2/api/zones-stage//dev-satellites/_etc/dev-agent.qa.***.net.conf: 12:1-12:42
/var/lib/icinga2/api/zones-stage//dev-satellites/_etc/dev-agent.qa.***.net.conf(10): }
/var/lib/icinga2/api/zones-stage//dev-satellites/_etc/dev-agent.qa.***.net.conf(11):
/var/lib/icinga2/api/zones-stage//dev-satellites/_etc/dev-agent.qa.***.net.conf(12): object Endpoint "dev-agent.qa.***.net" {
                                                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/var/lib/icinga2/api/zones-stage//dev-satellites/_etc/dev-agent.qa.***.net.conf(13):   host = "dev-agent.qa.***.net" // The satellite actively tries to connect to the agent
/var/lib/icinga2/api/zones-stage//dev-satellites/_etc/dev-agent.qa.***.net.conf(14):   log_duration = 0 // Disable the replay log for command endpoint agents
[2024-07-15 15:09:49 -0400] critical/cli: Config validation failed. Re-run with 'icinga2 daemon -C' after fixing the config.

Maybe I’m not fully understanding exactly how zones should be setup. On the master I have the following

object Endpoint "dev-icinga.qa.***.net" {
}

object Zone "master" {
  endpoints = [ "dev-icinga.qa.***.net" ]
}

object Endpoint "dev-satellite.qa.***.net" {
  host = "dev-satellite.qa.***.net"
  port = "5665"
}

object Zone "dev-satellites" {
  endpoints = [ "dev-satellite.qa.***.net" ]
  parent = "master"
}

object Endpoint "ua-satellite1.qa.***.net" {
  host = "qa-satellite1.qa.***.net"
  port ="5665"
}

object Endpoint "ua-satellite2.qa.***.net" {
  host = "qa-satellite2.qa.***.net"
  port = "5665"
}

object Zone "qa" {
  endpoints = [ "ua-satellite1.qa.***.net", "ua-satellite2.qa.***.net" ]
  parent = "master"
}

object Zone "global-templates" {
  global = true
}

object Zone "director-global" {
  global = true
}

Then inside zones.d I have a folder for master, dev-satelltes, and qa. Inside those a conf file for each host.

dev-satellite.qa.***.net

object Host "dev-satellite.qa.***.net" {
  import "generic-host"
  vars.os = "Linux"
  vars.disks["disk /"] = {
    disk_partitions = "/"
  }
  vars.zone = "dev"
  address = "dev-satellite.qa.***.net"
  vars.agent_endpoint = name
}

dev-agent.qa.***.net

object Host "dev-agent.qa.***.net" {
  import "generic-host"
  vars.os = "Linux"
  vars.disks["disk /"] = {
    disk_partitions = "/"
  }
  vars.zone = "dev"
  address = "dev-agent.qa.***.net"
  vars.agent_endpoint = name
}

object Endpoint "dev-agent.qa.***.net" {
  host = "dev-agent.qa.***.net" 
  log_duration = 0 // Disable the replay log for command endpoint agents
}

object Zone "dev-agent.qa.***.net" {
  endpoints = [ "dev-agent.qa.***.net" ]

  parent = "dev-satellites"
}

Well, I believe I got this figured out, but it raises another question.

I had to manually create the conf files in the zones.d folder on the satellites and place the config files I have on the master in those directories, on the satellites.

Am I wrong in thinking that if I have something like

/ect/icinga2/zones.d/dev-satellites/dev-agent.conf

Shouldn’t that file, as well as everything else in dev-satellites folder get synced to the satellite in that zone?

Zone and endpoint objects for master and satellites shall be configured in zones.conf only.

Is not needed when the file in placed in the correct directory. It would fail anyway, since you don’t have a zone dev defined.

I have those configured in zones.conf on master. Then inside of zones.d I have a folder named for each zone configured. Inside those zone folders, I have the configuration for the agents within that zone.

But, it doesn’t seem that those are syncing to the satellites. I have dev-satellite with its endpoint and zone configured in the master’s zone.conf. Then on the master I have zones.d/dev-satellites/dev-agent.conf. Am I wrong in thinking that the dev-agent.conf file should sync to the dev-satellite host’s zone.d folder inside a folder called dev-satellites?

The vars.zone is just used for a hostgroup template matcher.

The log extract tells me, that the Endpoint object for dev-agent.qa.***.net was defined twice (one time in your main config on the master, the other one probably on the satellite directly).
The Icinga2 on the satellite rejects the configuration therefore and never applies it. That is the reason why you never get any results, the satellite doesn’t know about it at all, since it is running with a different configuration.

In general I put the Endpoint object for Agents (not Satellites!!) in the main configuration and NOT locally in Satellites (or generally in the zones.conf file).

Did you place satellite’s host objects in /etc/icinga2/zones.d/master?

No, maybe that is where I screwed up, the host object for the satellites I put into the zone they are in.

So, am I correct here, that all satellite zones themselves should be defined in zones.conf on the master, and then the host object entry should be in zones.d/master/?

You need to put host objects in that zone which endpoint(s) shall schedule their checks.

Ok, that makes sense.

I’m going to attempt to add another agent into the setup now, this one will be on the qa satellites.

I’m using the node setup CLI command to do this. First thing I did was create the cert on the new agent, then save the master cert.

icinga2 pki new-cert --cn qa-host.qa.***.net --key /var/lib/icinga2/certs/qa-host.qa.***.net.key --cert /var/lib/icinga2/certs/qa-host1.qa.***.net.crt

icinga2 pki save-cert --trustedcert /var/lib/icinga2/certs/trusted-parent.crt --host master-icinga.qa.***.net

I then generate the ticket on the master. Then, for the node setup command, would I put the satellite or the master as the parent?

icinga2 node setup --ticket ********* --cn qa-host1.qa.***.net --endpoint qa-satellite1.qa.***.net --zone qa-host1.qa.***.net --parent_zone qa --parent_host qa-satellite1.qa.***.net --trustedcert /var/lib/icinga2/certs/trusted-parent.crt --accept-commands --accept-config --disable-confd

The above command would be telling it that the satellite is the master, but will the satellite pass everything to the master correctly? Or do I need to replace every mention of the satellite with the master host? Ideally, we would not want to open up the master to every single host that is going to be an agent, and would want the agents to only speak to the satellites. I did see mention of ca-proxy, but I’m not sure if there is anyhthing I need to change on the satellite or the above command for that to work.

Now this makes less sense. I managed to get the qa agent host added, and most of the checks are working, but there are a handful of them that are just stuck pending and not running.

They’re all working now, no idea what I did to fix it.