HA Zone: inconsistent status reporting

Hey there,

I’m currently testing Icinga Ha Setups with one master and two agents (see zone.conf below).
I’ve configured two clients to be monitored by the agents in agent zone with check_ping. This is working fine so far.

Then I blocked requests from agent1 to the two clients using the local firewall and discovered, that client1 gets reported as DOWN, client2 gets reported as UP. After allowing the requests from both agents to both clients, both clients are reported as UP. When I now block all requests from agent2 to the clients, client1 gets reported as UP, client2 gets reported as DOWN.

So the problem is: It depends on the agent, that can reach a client, whether it is reported as UP or DOWN.
I clearly understand, that it’s a tough decision, whether you want to report a client as DOWN or UP when only one of two agents can reach it, but depending this on the agent that is able to reach it, doesn’t seem to be the right choice for me.
Is there a way to force a failed check on one agent to be run on the other agent in the same zone before globally reporting it as failed?

Thanks,
Jo-Jo

Additional Information:

  • Version used: r2.12.3-1
  • Operating System and version: Debian 10
  • Enabled features: api checker command compatlog debuglog ido-pgsql mainlog notification statusdata

agent1.localdomain zones.conf:

object Endpoint "master.localdomain" {
  host = "192.168.1.3"
  port = "5665"
}

object Zone "master" {
  endpoints = [ "master.localdomain" ]
}

object Endpoint "agent1.localdomain" {
}

object Endpoint "agent2.localdomain" {
  host = "192.168.1.13"
  port = "5665"
}

object Zone "agents" {
  endpoints = [ "agent1.localdomain", "agent2.localdomain" ]
  parent = "master"
}

object Zone "global-templates" {
  global = true
}

object Zone "director-global" {
  global = true
}

Hello Jo-Jo,
I hope you are doing well. High Availability (HA) is setup on the Icinga master servers only. If one master server would go offline the agents would still continue getting monitored by the second master server. From the zones.conf file you shared you only have one master server. This is not a correct HA setup. Please review the documentation here online for more details about setting up HA.

Regards
Alex

Hello Alex,
Thanks for the quick response :slight_smile:

I did these tests with two Masters without any more icinga instances as well and got the same results.
When one icinga instance goes offline (gets stopped) everything is handled fine and the remaining instance takes over all the checks.
My problem is when the instance is running, but isn’t able to reach the client due to network issues between them (icinga and client) and another instance in the same zone (doesn’t matter if master or satelite/agent) is able to reach the client because it is able to use a different path in my network.

Regards
Jo-Jo

@Jo-Jo,
From the zones.conf file you shared earlier you did not have two master servers configured in the same ‘master’ zone. You cannot have a HA zone without two endpoints. What is the ‘agents’ zone? Is this a satellite zone? Can you share your master zones.conf file? More details is need to help resolve your problem.

Also can you please describe your problem using the Icinga standards (Master, Satellites, Agents, etc…). You are using the term ‘clients’. Is this a Icinga agent (not master or satellite)? It is confusing and will make it hard for the Icinga community to help you.

Regards
Alex

@aclark6996,
Sorry, this is my first post here and I know, my problem is a bit tricky.

The machine I named ‘client’ is reachable via icmp ping for the icinga masters, but does not run any software out of the icinga stack. It is just activly monitored using icmp pings, please see the config attached. So what would be the correct term for such a machine in icinga standards?

Let me just simplify the setting here:

To understand and reproduce my problem, please follow these steps:

  1. Setup two Machines as icinga HA master, as described here: Icinga » Blog » How to set up High-Availability Masters
  2. Setup icingaweb2 on one of the masters.
  3. Take a third machine, that is reachable via icmp ping from the two masters. This is the machine, I named ‘client’.
  4. place the attached configs on the masters and reload icinga2 on them
  5. See client1 in icingaweb showing up and beeing reported as UP
  6. login onto client1 and run ‘iptables -I INPUT 1 -s 192.168.2.3 -j DROP’ to drop connections from master1 to client1.
  7. watch client1 in icingaweb
  8. run ‘iptables -F && iptables -I INPUT 1 -s 192.168.2.13 -j DROP’ on client1
  9. watch client1 in icingaweb
  10. you’ll see: blocking connections from one master will report client1 as DOWN, blocking them from the other master will report client1 as UP

As you can see, it depends on which master is able to ping client1, whether it is reported as UP or DOWN. I’d like to have this consistent, that means when one master is able to ping the client it should be reported as UP, without considering, which of the two masters is able to ping it…

Thanks
Jo-Jo

master1.localhost:
zones.conf:

object Endpoint "master1.localdomain" {
}

object Endpoint "master2.localdomain" {
}

object Zone "master" {
	endpoints = [ "master1.localdomain", "master2.localdomain" ]
}

object Zone "global-templates" {
	global = true
}

object Zone "director-global" {
	global = true
}

master2.localhost:
zones.conf:

object Endpoint "master1.localdomain" {
	host = "192.168.2.3"
	port = "5665"
}

object Zone "master" {
	endpoints = [ "master1.localdomain", "master2.localdomain" ]
}

object Endpoint "master2.localdomain" {
}

object Zone "global-templates" {
	global = true
}

object Zone "director-global" {
	global = true
}

configs on both masters:
zones.d/global-templates/all.conf:

template Host "generic-host" {
    max_check_attempts = 2
    check_interval = 1m
    retry_interval = 30s
    enable_flapping = true
    enable_perfdata = false
    enable_notifications = true
    flapping_threshold_low = 5.0
    flapping_threshold_high = 25.0

    check_command = "check-alive"
    max_check_attempts = 2
}

object CheckCommand "check-alive" {
    import "plugin-check-command"
    command = [PluginDir + "/check_ping"]

    arguments = {
        "-H" = "$address$"
        "-w" = "100,30%"
        "-c" = "500,60%"
        "-p" = "3"
    }
}

zones.d/master/client1.localdomain.conf:

object Host "client1.localdomain" {
  import "generic-host"
  address = "192.168.2.14"
  check_command = "hostalive"
}

Hello @Jo-Jo,
In your Icingaweb2 website what is the ‘check source’ for this host (client)? Does the check source match the master that shows the ‘Down’ state when you drop the connection by Iptables? Both master servers are not checking the host (client), only the master listed in the check source. When you enable HA, checks are load balanced between both masters. So by running the iptables command you’re unplugging the network between the host and master which would provide a ‘Down’ state.

I not sure the correct term for a machine without the Icinga agent is. I guess they could be called host without the Icinga agent installed or a network device. The Icinga documentation does not really reference these. This caused confuse for me also. Only machines with the Icinga agent installed require the zone and endpoint objects in your configuration.

Regards
Alex

Hello @aclark6996,
Yes, the check source matches the master, who shows the ‘Down’ state when dropping the connections.

Based on this, I’d assume, what I’m seeing here is the wanted behavior… at least for active checks.

Could there be a way to reconfigure this behavior and, at least, make it consistent? I think this could cause some really bad false positives and headaches in debugging my monitoring.

Regards
Jo-Jo

You can set the static name of the master that you want the ping check command to run from by using the ‘command_endpoint’ variable under the Host object. But if this master goes offline the host checks will not automatically fail-over to the other master.

I had read other users posting here about setting up ping command from both masters before a problem notification is sent out. You may need to dig into old post about a solution for this setup.

Regards
Alex