Get notified when Icinga2 service is NOT running on remote host

I discovered today that the Icinga2 service was not running on a few of my hosts. I was not notified about it.

If I do stop the Icinga2 service manually on any host I would like to trigger a notification, how can I do this?

if you are using the icinga as the agent on the remote nodes, you can try to use the cluster check
https://icinga.com/docs/icinga2/latest/doc/10-icinga-template-library/#cluster

or if you are using nrpe, or check-by-ssh you can use the check_procs plugin for detecting the icinga running state.

If you mean to check your Masters/ Satelittes ( again see the cluster ), but if you only have one master, then that is a bit of a problem, but you can use the host cron capability to run a test to see if it is running and if not send a notification.

1 Like

I tried doing this but the status of the host never changes, always says it’s “UP” until I restart the service on the remote. Then when I stop the service again keeps saying its UP but the last check is red.

I run the cluster check from a dummy against the master zone and I get criticals when something is down, subsequently you have the zone check which is the next thing down from what’s on the aforementioned link.

Do you have the icinga check running on every server? It’ll go unknown with “[parent] is not connected to [your server].” If you have it set to page you for unknowns, that’ll get your attention. Also, any checks running on that node will light up purple too, so it tends to stick out.

Inside the zones./server/hosts.conf I have this defined:

/etc/icinga2/zones.d/xxx-xxx-app-1/hosts.conf

object Host “xxx-xxx-app-1” {
check_command = “hostalive”
#check_command = “cluster” (i set it back to hostalive for now but tried cluster, tcp etc)
address = “xxx”
vars.client_endpoint = name
vars.notification[“mail”] = {
groups = [ “aow” ]
}

vars.cloud = “xxx-CLOUD-2”
max_check_attempts = 3
check_interval = 5s
retry_interval = 1m
vars.sms_notify = true
vars.mail_notify = true
vars.voice_notify = true
}

primary hosts.conf inside /etc/icinga2

object Host “monitor.xxxx.net” {
import “generic-host”
check_command = “hostalive”
address = “xxxxx”
vars.http_vhosts[“http”] = {
http_uri = “/”
}
vars.notification[“mail”] = {
groups = [“aow”]
}
enable_notifications: true
}

Do you have an example of your config?

Host object:

object Host "zone-master" {
  check_command = "cluster"
  check_interval = 5m
  retry_interval = 1m
  max_check_attempts = 5
  display_name = "Icinga Zone: Master"
  // my notification variables were here
}

And that’s all she wrote. That’s a placeholder host, not a real one, that I have some hostless checks relevant to that topic running for (i.e. a check to make sure the PagerDuty API can be queried).

Also, I have this in place to track Icinga’s health across all servers running the daemon.

apply Service "icinga" {
  import "generic-service"

  check_command = "icinga"
  command_endpoint = host.vars.client_endpoint

  assign where host.vars.client_endpoint
}

Do I use this configuration inside the MASER host or inside each satellite?

The host object is in master. The service check is in global-templates in my case.

doesn’t seem to work :confused: Not sure what else to do

Ive added this:

object Host “monitor.asdasdadasd.net” {
import “generic-host”
check_command = “cluster”
address = “xxxxx”
retry_interval = 1m
max_check_attempts = 5

vars.http_vhosts[“http”] = {
http_uri = “/”
}
vars.notification[“mail”] = {
groups = [“aow”]
}

Then i stopped the service on one of the clients, but host still shows up as UP but the last check is in red.

Reachable yes
Last check 4m 42s ago
Next check in -4m 0s Reschedule
Check attempts 1/3 (hard state)
Check execution time 4.009s
Check latency 0.000787s

By host, do you mean the endpoint that you stopped Icinga on? That isn’t how that check works. The cluster check in your master zone will go critical if endpoints aren’t connected to it. There’s also a bit of a delay on it. If you apply the Icinga service check to your host (the second example) you’ll get an unknown (exit 3) on that service check if it’s not connecting relatively quick. Cluster isn’t a command meant to execute from individual clients.

If you set the icinga check as the host check, you might get a DOWN when that fails.

so check_command = “icinga” on the host instead of hostalive. I’ve never tried that, but go ahead and give it a shot if you’re trying to get a host-down if Icinga isn’t running.

Oh, while I’m at it though, I don’t necessarily recommend that approach. You will see a similar negative count down on the icinga check as it’s making an effort at getting a response from that daemon. This will take longer to fail than the hostalive check which is just trying to ping. I’d recommend just doing the service check like I applied and set your notifications to support “unknown” instead of just “critical”, as unknown usually means bad things are happening as well.

I will try your suggestions once time permits, I tried them but nothing seemed to work. I may be missing something though.

What I’m doing for the time being is created a custom script (or command) that is assigned to the master host and I just do a port check (port that the icinga2 service runs on) for each child host and if one of them fails return an error.

check_tcp is built in if you want to just monitor 5665.

https://icinga.com/docs/icinga2/latest/doc/10-icinga-template-library/#tcp

I tried this already but assigned it to each child host but never updated etc once the service once stopped

When you’re at your desk, can I take a look at some of your service objects?

Yes, give me a few and i’ll paste as much info as possible.