Monitor Icinga agent's availability

Heilig · May 31, 2021, 10:15am

Hi there,

I use ICMP and Icinga 2 agents to monitor some servers. It works fine. I have critical alarm if server is down or there is a performance issue.
I would like to improve monitoring and generate a critical problem if Icinga agent is unreachable on a server. Note, I use passive checks (i.e. Icinga agents execute checks and send values to the satellites). The “negate” plugin looks good to me (e.g. generate problem instead of unknown for an agent check), but it doesn’t work.

An example:

template Service "check_negate_command" {
    check_command = "check_negate_command"
    command_endpoint = host_name
    vars.critical = "OK"
    vars.negate_command = "/usr/lib64/nagios/plugins/check_load -w 1,5,15 -c 1,10,15"
    vars.ok = "OK"
    vars.unknown = "CRITICAL"
    vars.warning = "OK"
}

The expected result: critical status if agent is down. I am getting “unknown” if agent is stopped.

I have these several questions:

what is the best way to monitor agent status for this setup (master <–> satellite ← agent)? Note, 5665 port check is not the best option because we use passive checks. “Port is down” != “Agent is down”. It would be great to generate problem if there is no response from the agent
the issue with neagte is it doesn’t trigger problem is I use “Run on agent: yes”. I have unknown status instead. Is it possible to use “negate” to convert unknown to critical for this specific use case?

stevie-sy · May 31, 2021, 10:19am

Hi, maybe this blog post helps: Monitoring the Monitor: How to keep a watch on Icinga 2

Heilig · May 31, 2021, 11:01am

Thank you for the URL. I saw it. The 5665 port check is not the best option for the mentioned situation. It may cause a false positive alert if server has no agent. The “icinga” check returns unknown status and “Remote Icinga instance ‘client’ is not connected to ‘satellite’” output.

How can I configure Icinga to generate critical problem when “Remote Icinga ‘client’ is not connected to ‘satellite’”?

Pooh · May 31, 2021, 11:30am

Try the cluster-zone service check?

https://icinga.com/docs/icinga-2/latest/doc/06-distributed-monitoring/
#health-checks

Antony.

Heilig · May 31, 2021, 2:08pm

It looks like “cluster-zone” is the possible solution. At least it works if I have “Run on agent: No” configured for this check (i.e. I am getting critical status).

This check has “Unknown” status if I use “Run on agent: Yes”. Is this expected? Just to confirm: all checks with “Run on agent: Yes” will change status to “Unknown” if agent is down.

Is there any way to change “Unknown” to “Critical” with “Run on agent: Yes” parameter? I am asking because “Negate” function should do the trick but it doesn’t if I use “Run on agent” = Yes.

Pooh · May 31, 2021, 2:33pm

The whole point about “run on agent” is that the service check runs on the
agent and then reports back to the satellite / master.

If the agent is down, then (a) it cannot run the check, and (b) it cannot
report the result.

If you want to know whether a(ny) machine is down, you have to test that from
some other machine - otherwise it’s like asking a group of people to “put your
hand up if you aren’t here”.

Antony.

Heilig · June 1, 2021, 1:08pm

Thanks for the provided information.

The “a” and “b” are logical and clear. Tests from some other machine (e.g. satellite) isn’t always possible and less accurate.

I am asking these questions because some other monitoring systems can generate problem if there is no data from an agent for last N minutes. Also, Icinga changes agent based check statuses to unknown (i.e. tracks (?) their statuses even if agent is down). I can’t understand why I can’t change unknown status with critical by using “negate” plugin. Is this an expected behaviour or my misconfiguration?