Services to unknown when agent loses connectivity?

tomp · April 25, 2024, 7:14am

I’m confused about the state of monitored services when an agent times out. I’d like services to transition to unknown when the agent times out, but instead they’re keeping their last reported state.

I’ve configured cluster and cluster-zone checks as described here and the checks turn critical when communication with the agent is lost, but the services the agent monitors remain in their last state rather, than becoming unknown.

How can I make all services become unknown when the agent times out?

Both master and agent are running r2.14.2-1.

rivad · April 25, 2024, 9:07am

I used the agent health check as parent of all other checks on the host to minimize notifications.

The result looks like this:

zones.d/director-global/dependency_templates.conf

template Dependency "tpl-dependency-agent-health-check" {
    disable_notifications = true
    states = [ OK ]
}

zones.d/master/dependency_apply.conf

apply Dependency "agent-health-check" to Service {
    import "tpl-dependency-agent-health-check"

    assign where host.vars.agent_endpoint && service.name != "Agent Health"
    parent_service_name = "Agent Health"
}

tomp · April 25, 2024, 6:31pm

Thanks for the help! I’ve tried to duplicate your config, but I haven’t gotten it working yet.

Can you show me your Agent Health service config? That may be where I’m going wrong.

I’ve tried this from the docs, but the checks are unhandled and stuck in a pending state.
zones.d/master/services.conf

apply Service "Agent Health" {
  check_command = "cluster-zone"

  display_name = "cluster-health-" + host.name

  /* This follows the convention that the agent zone name is the FQDN which is the same as the host object name. */
  vars.cluster_zone = host.name

  assign where host.vars.agent_endpoint
}

I’ve also tried a different version, which works on its own, but fails validation when I apply the service dependency with references a parent host/service which doesn’t exist.
zones.d/master/services.conf (names sanitized)

apply Service "Agent Health" {
  check_command = "cluster-zone"
  check_interval = 30s
  retry_interval = 10s

  vars.cluster_zone = "agent.fqdn"

  assign where match("master.fqdn", host.name)
}

rivad · April 26, 2024, 7:19am

This is how I configured it:

zones.d/master/service_templates.conf
template Service "tpl-service-agent-health" {
    import "tpl-service-generic"

    check_command = "cluster-zone"
    notes_url = "https://icinga.com/docs/icinga-2/latest/doc/06-distributed-monitoring/#cluster-zone-with-masters-and-agents"
    icon_image = "icinga.png"
    command_endpoint = null
}

zones.d/master/service_apply.conf
apply Service "Agent Health" {
    import "tpl-service-agent-health"

    assign where host.vars.agent_endpoint
    zone = "master"

    import DirectorOverrideTemplate
}

tomp · April 26, 2024, 10:46pm

That helped, thank you!

My services are now showing unreachable when the agent loses connectivity with the master. The state is still showing Ok, not unknown, but I suspect that’s a separate configuration problem.

I marked your response as the solution. Thanks again for your help, I really appreciate it!