I’d like to preface this with the information that most of my understanding of Icinga was gleaned from documentation or inferred from observed behavior. I may be misunderstanding pieces of the issue.
The goal
My company’s implementation of Icinga uses a “roaming” satellite that sits in our DMZ. This satellite allows agents to connect to the public IP and submit monitoring information from the agent. This allows for one-way connection between agent to parent (return path is not routable). This works for provisioning the agent and signing the certificate. The only real trade-off is that the “check-now” button does not work. However, it allows a device, like a notebook, to travel between hotels, warehouses, home networks, and corporate offices while still being able to send results back to us. We’re using “cluster” as a host check to determine the host’s availability status instead of a typical ping. This might be our problem…
The Problem
Notifications.
When a device comes back online, it’s very common to receive a swathe of
Remote Icinga instance ‘{agent device}’ is not connected to ‘{Roaming satellite}’
This occurs for every service on the host. This is especially painful in the morning when users are logging in. It’s also rough when devices sleep as some Windows devices will wake the NIC every now and then which allows Icinga to submit data and occasionally produces more notifications.
I’d like to know what exactly causes the “Remote icinga instance is not connected” message. I know this happens during agent setup if your cert isn’t signed as well as a few other issues. However, this is happening on an established agent with a signed cert. I’d like to know which side is producing this error. It’s not clear if the agent is saying “i’m not connected to roaming” or if the satellite is saying “this host isn’t connected to me”.