Get notified when Icinga2 service is NOT running on remote host

alexeightsix · May 30, 2019, 2:50pm

I discovered today that the Icinga2 service was not running on a few of my hosts. I was not notified about it.

If I do stop the Icinga2 service manually on any host I would like to trigger a notification, how can I do this?

aflatto · May 30, 2019, 4:59pm

if you are using the icinga as the agent on the remote nodes, you can try to use the cluster check
https://icinga.com/docs/icinga2/latest/doc/10-icinga-template-library/#cluster

or if you are using nrpe, or check-by-ssh you can use the check_procs plugin for detecting the icinga running state.

If you mean to check your Masters/ Satelittes ( again see the cluster ), but if you only have one master, then that is a bit of a problem, but you can use the host cron capability to run a test to see if it is running and if not send a notification.

alexeightsix · May 30, 2019, 5:32pm

I tried doing this but the status of the host never changes, always says it’s “UP” until I restart the service on the remote. Then when I stop the service again keeps saying its UP but the last check is red.

blakehartshorn · May 30, 2019, 5:44pm

I run the cluster check from a dummy against the master zone and I get criticals when something is down, subsequently you have the zone check which is the next thing down from what’s on the aforementioned link.

Do you have the icinga check running on every server? It’ll go unknown with “[parent] is not connected to [your server].” If you have it set to page you for unknowns, that’ll get your attention. Also, any checks running on that node will light up purple too, so it tends to stick out.

alexeightsix · May 30, 2019, 5:47pm

Inside the zones./server/hosts.conf I have this defined:

/etc/icinga2/zones.d/xxx-xxx-app-1/hosts.conf

object Host “xxx-xxx-app-1” {
check_command = “hostalive”
#check_command = “cluster” (i set it back to hostalive for now but tried cluster, tcp etc)
address = “xxx”
vars.client_endpoint = name
vars.notification[“mail”] = {
groups = [ “aow” ]
}

vars.cloud = “xxx-CLOUD-2”
max_check_attempts = 3
check_interval = 5s
retry_interval = 1m
vars.sms_notify = true
vars.mail_notify = true
vars.voice_notify = true
}

primary hosts.conf inside /etc/icinga2

object Host “monitor.xxxx.net” {
import “generic-host”
check_command = “hostalive”
address = “xxxxx”
vars.http_vhosts[“http”] = {
http_uri = “/”
}
vars.notification[“mail”] = {
groups = [“aow”]
}
enable_notifications: true
}

alexeightsix · May 30, 2019, 5:49pm

Do you have an example of your config?

blakehartshorn · May 30, 2019, 5:52pm

Host object:

object Host "zone-master" {
  check_command = "cluster"
  check_interval = 5m
  retry_interval = 1m
  max_check_attempts = 5
  display_name = "Icinga Zone: Master"
  // my notification variables were here
}

And that’s all she wrote. That’s a placeholder host, not a real one, that I have some hostless checks relevant to that topic running for (i.e. a check to make sure the PagerDuty API can be queried).

blakehartshorn · May 30, 2019, 5:54pm

Also, I have this in place to track Icinga’s health across all servers running the daemon.

apply Service "icinga" {
  import "generic-service"

  check_command = "icinga"
  command_endpoint = host.vars.client_endpoint

  assign where host.vars.client_endpoint
}

alexeightsix · May 30, 2019, 5:58pm

Do I use this configuration inside the MASER host or inside each satellite?

blakehartshorn · May 30, 2019, 6:00pm

The host object is in master. The service check is in global-templates in my case.

alexeightsix · May 30, 2019, 6:05pm

doesn’t seem to work Not sure what else to do

Ive added this:

object Host “monitor.asdasdadasd.net” {
import “generic-host”
check_command = “cluster”
address = “xxxxx”
retry_interval = 1m
max_check_attempts = 5

vars.http_vhosts[“http”] = {
http_uri = “/”
}
vars.notification[“mail”] = {
groups = [“aow”]
}

Then i stopped the service on one of the clients, but host still shows up as UP but the last check is in red.

alexeightsix · May 30, 2019, 6:07pm

Reachable	yes
Last check	4m 42s ago
Next check	in -4m 0s Reschedule
Check attempts	1/3 (hard state)
Check execution time	4.009s
Check latency	0.000787s

blakehartshorn · May 30, 2019, 6:11pm

By host, do you mean the endpoint that you stopped Icinga on? That isn’t how that check works. The cluster check in your master zone will go critical if endpoints aren’t connected to it. There’s also a bit of a delay on it. If you apply the Icinga service check to your host (the second example) you’ll get an unknown (exit 3) on that service check if it’s not connecting relatively quick. Cluster isn’t a command meant to execute from individual clients.

If you set the icinga check as the host check, you might get a DOWN when that fails.

so check_command = “icinga” on the host instead of hostalive. I’ve never tried that, but go ahead and give it a shot if you’re trying to get a host-down if Icinga isn’t running.

blakehartshorn · May 30, 2019, 6:41pm

Oh, while I’m at it though, I don’t necessarily recommend that approach. You will see a similar negative count down on the icinga check as it’s making an effort at getting a response from that daemon. This will take longer to fail than the hostalive check which is just trying to ping. I’d recommend just doing the service check like I applied and set your notifications to support “unknown” instead of just “critical”, as unknown usually means bad things are happening as well.

alexeightsix · May 30, 2019, 6:43pm

I will try your suggestions once time permits, I tried them but nothing seemed to work. I may be missing something though.

What I’m doing for the time being is created a custom script (or command) that is assigned to the master host and I just do a port check (port that the icinga2 service runs on) for each child host and if one of them fails return an error.

blakehartshorn · May 30, 2019, 6:44pm

check_tcp is built in if you want to just monitor 5665.

https://icinga.com/docs/icinga2/latest/doc/10-icinga-template-library/#tcp

alexeightsix · May 30, 2019, 6:45pm

I tried this already but assigned it to each child host but never updated etc once the service once stopped

blakehartshorn · May 30, 2019, 6:46pm

When you’re at your desk, can I take a look at some of your service objects?

alexeightsix · May 30, 2019, 6:56pm

Yes, give me a few and i’ll paste as much info as possible.