for traffic between an icinga2 agent and the icinga2 master on port 5665, with a firewall in between, is the traffic bi-directional or just from the agent to the master? Sorry, dumb question I know, but I need to be certain.
And does the same apply for an icinga2 satellite telking to an icinga2 master across a firewall?
Not really. Depends on your config.
Read about the one direction connection model here:
Basically, if in zones.conf on the client nodes you do not specify a host address or port for their parent, the parent only connects to them. I just double checked this by firewalling off 5665 on my test master to be sure.
So master needs to connect to 5665 on the satellite needs to connect to 5665 on the client in this type of setup.
Oh, almost forgot, if you’re doing a HA setup, make sure the masters can connect to each other on 5665, and that the satellites can talk to each other on 5665 or holy split brain duplicate check and event handling Batman.
To add, once there is such a connection (verify that with netstat or tcpdump), each host may initiate sending TCP packages over the wire.
The underlaying cluster protocol is JSON-RPC with notifications, so the sender doesn’t wait for the receiver to acknowledge the message receive. Instead, a check execution is sent from the master, triggered via web action for example, to the satellite.
The satellite on its own decides how to deal with it - local execution, or remote command endpoint on the client. Once the plugin returns data, the instance parses that and decides which zones are responsible for the object. This triggers a cluster event with e.g. sending the check result from the satellite back to the master.
If the connection between the master and satellite was cut off, such a message will be stored in the replay logs. The side which does the initial connection, retries in a regular interval to do so. Once the connection is re-established, past cluster events are replayed and your master’s history backend has everything again.
This was designed and implemented mainly for SLA reporting reasons. A cluster connection being cut off must not influence the actual service check being run in a different location/zone.
You can read more about JSON-RPC messages in the docs released with 2.10.5 yesterday.