How to Setup Agent without Connection from Parent/Master to Agent

In my implementation I am using Director and a Distributed configuration. The satellite I’m attempting to connect to is downstream from an internal aggregation zone which then connects up to the master node.

Master > Aggregation Zone (two endpoints) > Roaming Device Zone

I am currently configuring a workflow for monitoring roaming devices such as notebooks that are often away from the office where a P2P VPN exists for monitoring. Our goal is to have 150-300 devices connect to a satellite via something like “icinga.example.com.” This presents a certificate problem I think can be resolved by manually deploying the CA.crt file to the hosts.

Due to the agents being behind random internet IPs, we have no way of routing traffic from the master to these agents. Because the agents can still connect up stream via the icinga.example.com address, I believe, once configured, the passive data from the agent will allow us to monitor these devices. I understand I will be unable to deploy new config or schedule immediate checks (something better than nothing).

I’ve never manually built Icinga configuration, all of my experience has been with Director and the node wizard. So, before I set down this path, does this seem viable or do I need to approach this differently?

TL;DR: What is the proper method of implementing an Icinga agent which can connect to its parent, while the parent cannot connect back, in a distributed monitoring implementation?

I guess there are some misunderstandings. Unfortunately, I did get your approach completely, hence, I write down some generic infos.

First, it does matter which of the connected nodes initiates the connection. Once connected data is transferred in both directions. Hence, there is no limitation of any function. As of this, It’s not necessary to reach out to the agents. And you don’t need to open a port on every agent.

Second, every agent requires its own zone. By convention you should use the FQDN for the zone name as well as certificate’s cn.

Third, certificate chain and dns must not be identical. Something like “icinga.example.com.” should not be a blocker (if I understood your approach good enough).

Can you explain this further?

First, it does matter which of the connected nodes initiates the connection. Once connected data is transferred in both directions. Hence, there is no limitation of any function. As of this, It’s not necessary to reach out to the agents. And you don’t need to open a port on every agent.

There is no route for this traffic to go back across the internet. Does the agent create some sort of VPN to allow this traffic to reach the endpoint? Are you saying that the agent does submit passive results once configured? Is the only way to configure an agent this way, manually? Presently, because the satellite cannot connect to the agent, it never sends configuration which means the remote host ends up having 0 services.

I’ve no idea what you mean here.

Any icinga node that talks to another icinga node by an encrypted and certificate based connection.

Once a connection is established everything runs throw that connection e.g. check results or check now request or config files etc. no matter which of the icinga nodes has initiated that connection.

I don’t understand this question.

Services are assigned by apply rules at your master. This has nothing to do with sending configuration to agents. You would not get any result if there is an issue but still the service exists.

Sorry for the gap in replies, I wound up quite busy for the end of last week.

There is no route for this traffic to go back across the internet

When the Windows agent connects to our satellite at “roaming.domain.tld,” which is in public DNS, the satellite has no way of knowing where to send traffic to the agent. It would be like having Google open a connection directly to your computer. I had hoped the agent would open a stateful connection and perhaps UPnP would solve the rest but there seems to be no. Should this be the case?

Even in our healthy nodes, I regularly see endpoints disconnect from their parent satellites and reconnect later to replay their log.

Effectively what needs to happen is:

windows agent with IP address 192.168.1.12 behind a Spectrum router needs to connect to our satellite via “roaming.domain.tld.” When doing this, I need the satellite to receive agent configuration via Director. Then, because “roaming.domain.tld” is publicly accessible and in our DMZ, the satellite in our DMZ would passively receive agent data submitted to “roaming.domain.tld.”

This appears to not be working because the satellite in our DMZ has no way of routing traffic across the internet to the windows agent behind the Spectrum router. This appears to be normal and expected albeit frustrating. When connecting an agent device to the DMZ host, the agent receives no configuration whatsoever. Additionally, the agent wizard does not note this or complain. The debug log simply states it has 0 checkables.

I need infrastructure configured in a way that would allow a customer device to connect to our cluster regardless of their internet connection.

It doesn’t matter if the satelite or the agent on the notebook initiates the connection.
Once it is established the satellite can send configuration to the agent on the notebook!

So yes, after the notebook connected to the satellite, there is a route for this traffic to go back across the internet. It’s the same way that googles webserver sends back the results of your search to your browser - via the connection your browser made to googles webserver.

Nothing passive here. Icinga2 passive checks are checks that don’t execute by some Icinga2 agent and report the result via standard out but get executed independently of Icinga2 and only report results to preconfigured objects into Icinga via API or legacy methods.

Wrong, as soon as the agent opened the connection to the satellite, the satellite can send what ever it wants to the agent.

This depends on the configuration of the agent if it allows to receive configuration from the satellite after it connected to it. Again after the agent opened the connection nothing is hindering the satellite to send configuration to the agent. Also the satellite needs to feel responsible for the agent or else it will not accept the connection and/or not send the configuration for the agent. This is done in the zones.conf.

You need to tell the cluster that endpoints there are in zone.conf and do the same on the agent.
I would just tell them both to try to connect to each other as it simplifies config and speeds up the connection for the server and if the notebooks one in a blue moon are in house.
Also most importantly you need to allow the agents to receive config. I think this is done in the API feature file.

Thank you so much for the info and explanations!

I was able to get my notebook connected back to our Icinga cluster without doing anything crazy. You were right about the connection staying open. My understanding was that, because the log shows hosts reconnecting then replaying log to parents, that endpoints would periodically reconnect to their upstream parent to transmit the log. It looks like this mostly stays open? Which explains why it works.

For the agent configuration. I had to use a different name for the “instance name” vs the “host name” of said endpoint so my “instance name” needed to be changed. What I ultimately ended up with was

instance name = the name of the icinga “endpoint” (satellite.internaldomain.tld)
host name = (roaming-device.publicdomain.tld)

This resolved my certificate issue.

As far as I can tell, once I resolved this, I was able to connect normally. One thing I did change was I adjusted the host check from the typical hostalive4 to a version of the cluster-zone command. This seemed much more reliable than trying to ping the device.

1 Like

I also configured a “Service: Agent Health” that uses the cluster-zone command.
Then I utilize it as parent for all other services on the host to minimize alarm noise.

zones.d/master/service_templates.conf
template Service "tpl-service-agent-health" {
    import "tpl-service-generic"

    check_command = "cluster-zone"
    icon_image = "icinga.png"
    command_endpoint = null
    vars.teams = [ "Entwicklung_Monitoring" ]
}
zones.d/master/service_apply.conf
apply Service "Agent Health" {
    import "tpl-service-agent-health"

    assign where host.vars.agent_endpoint
    zone = "master"

    import DirectorOverrideTemplate
}
zones.d/director-global/dependency_templates.conf
template Dependency "tpl-dependency-agent-health-check" {
    disable_notifications = true
    states = [ OK ]
}
zones.d/master/dependency_apply.conf
apply Dependency "agent-health-check" to Service {
    import "tpl-dependency-agent-health-check"

    assign where host.vars.agent_endpoint && service.name != "Agent Health"
    parent_service_name = "Agent Health"
}

Maybe this is a good idea for your roaming devices also.