One-Way Agent-to-Satellite Configuration Issues | Lots of Notifications

I’d like to preface this with the information that most of my understanding of Icinga was gleaned from documentation or inferred from observed behavior. I may be misunderstanding pieces of the issue.

The goal

My company’s implementation of Icinga uses a “roaming” satellite that sits in our DMZ. This satellite allows agents to connect to the public IP and submit monitoring information from the agent. This allows for one-way connection between agent to parent (return path is not routable). This works for provisioning the agent and signing the certificate. The only real trade-off is that the “check-now” button does not work. However, it allows a device, like a notebook, to travel between hotels, warehouses, home networks, and corporate offices while still being able to send results back to us. We’re using “cluster” as a host check to determine the host’s availability status instead of a typical ping. This might be our problem…

The Problem

Notifications.

When a device comes back online, it’s very common to receive a swathe of

Remote Icinga instance ‘{agent device}’ is not connected to ‘{Roaming satellite}’

This occurs for every service on the host. This is especially painful in the morning when users are logging in. It’s also rough when devices sleep as some Windows devices will wake the NIC every now and then which allows Icinga to submit data and occasionally produces more notifications.

I’d like to know what exactly causes the “Remote icinga instance is not connected” message. I know this happens during agent setup if your cert isn’t signed as well as a few other issues. However, this is happening on an established agent with a signed cert. I’d like to know which side is producing this error. It’s not clear if the agent is saying “i’m not connected to roaming” or if the satellite is saying “this host isn’t connected to me”.

I created the following dependency in Director to try and deal with this but it doesn’t seem to work the way I wanted. We still frequently receive notifications for numerous services when this happens.

zones.d/director-global/dependency_apply.conf
apply Dependency "Agent Checks" to Service {
    disable_checks = true
    disable_notifications = true
    ignore_soft_states = true
    period = "24x7"
    assign where service.name != "Agent Health" && host.vars.isAgentHost && host.zone == "Roaming"
    parent_service_name = "Agent Health"
    states = [ OK ]
}

Bumping this for exposure

This happens when the satellite can’t connect to the agent for some reason (in our environment it’s usually a firewall, the icinga2 service not running, or like you say, certificates not set up correctly). This sounds like the correct error message - because it looks like those devices are offline at that time?