Icinga not sending notifcations when machine is down or icinga service is not running

Hello,
I am running icinga version 2.10.3-1 on the master.

I ran into an issue recently where one of my servers locked up but icinga did not send any notifications about the issue. Once i rebooted the machine I stopped the icinga service to see what would happen and it basically just keeps trying to execute the check but does not actually send any warnings or go to a Critical state. I checked the icinga2 logs and the only message relating to the server is "information/ApiListener: Finished reconnecting to endpoint ‘server-test’ via host ‘123.45.6.78’ and port ‘5665’ . But how is it connecting if the icinga service is down on the client? I checked to see if the master node was connected to the client via lsof command and grepping for the IP address but it does not show up. I had this issue once before and what helped solve this issue was changing the hostalive command to cluster-zone but this did not help this time.

If i manually stop the icinga service on a client machine all the checks become late but the machine does not change its state it remains “ok”. Also if i check the log is still at first says it is not able to connect to the client machine but the next message is “finished connecting to the client”. Any ideas?

Basically my machines are not going to the down state and the check just remains late. Is there anyone with any suggestions? I tried changing the hostalive check to cluster-zone check and now the check is in cirtical state saying “zone is not connected. Log lag: less than 1 millisecond” even though both machines are connected according to the lsof command.

The issues you are having isn’t very clear. But sounds like, you are not getting notifications when other Icinga2 instances are not running.

The question to ask, are you doing health checks of the Icinga2 services? You need to configure monitoring of Masters, Satellites and Agents, it’s not automatic.

If if a Master, Satellite or Agent looses connectivity, any service being monitored will not send a notification. Because those nodes are responsible for announcing what checks have failed. So you have to monitor the cluster and icinga.

apply Service "icinga_cluster" {
  display_name = "Icinga - Cluster"
  import "gold-service"
  check_command = "cluster"
  command_endpoint = host.command_endpoint
  assign where host.vars.icinga
}

apply Service "icinga_service" {
  display_name = "Icinga - Service"
  import "gold-service"
  check_command = "icinga"
  vars.icinga_min_version = "2.9.1"
  command_endpoint = host.command_endpoint
  assign where host.vars.icinga
}

Hello. Thank you so much for getting back to me!

So for your suggestion of monitoring the cluster and icinga I would do this on the master node? Would i add this to each of the agent nodes services.conf file or just the masters service.conf and it would apply to each satellite? Also as for the import command, I am not familiar with “gold-service”. Could you elaborate or point me to any documentation about this?

Also i am currently monitoring icinga on all the agent hosts. When i stop the icinga service the checks for that machine all just become late but it does not change the state of the machine in anyway so unless i go to the icinga web and happen to see these checks are late I will never know that the icinga service stopped and the checks are late.

The configuration really depends on your environment.

You should understand the Master nodes send the notifications. With only one (1) Master, notifications would stop if it were down.

My Hierarchy

It contains two (2) Master nodes and there are two (2) Satellites for each zone.

  • The Satellites monitor Agents in the zone and ping the Agent hosts.
  • The Master nodes monitor the Satellites and each other.

If a Master is down, the other Master sends the notification. If a Satellite is down, the Master nodes send the notification. If an Agent is down the Satellites relay the status to the Master nodes and the Master sends the notification.

gold-service is just a template.

template Service "gold-service" {
  max_check_attempts = 3
  check_interval = 5m
  retry_interval = 1m

  vars.notification["mail"] = {
    groups = [ "icingaadmins" ]
    users = [ ]
  }
}

Agents are just responsible for the localhost and relay checks upward in the hierarchy. So you need to monitor agents from a higher point in the environment. Check are not happening, not being relayed upward, and you’re not monitoring icinga from the right place.

With an Agent down, checks will get stale (late). You need to configure the Master/Satellite nodes to monitor the Agents. To make sure the Agent is running.

I used your suggestion of doing the cluster health check and had the check done on the master.It solved my issue. Thank you so much for you help and thorough explanations, it really cleared up a lot for me. Thanks again!