Icinga2 HA - no hosts after failover

Hello,

I am working to setup Master Master icinga2. and am testing it in vagrant/vm environment. I believe i have a mostly working setup. But there is one particular thing that isn’t working. When i shutdown the icinga2 service on the node currently listed as ‘Active Endpoint’ (in this case nmstest01), the secondary node takes over, but all my hosts disappear.

nmstest01 is the only node with configuration in /etc/icinga2/zones.d, that directly is empty on nmstest02 as per the documentation (Distributed Monitoring - Icinga 2)

the /var/log/icinga2/icinga2.log files on both nodes seems to show that the syncing of state etc is going fine. So it is unclear to me why when stopping nmstest01 (simulating a node reboot or node failure) that all the hosts disappear.

  • Version used (version: r2.12.4-1)
  • Operating System and version - 20.04.1 LTS (Focal Fossa)
  • Enabled features (Enabled features: api checker command ido-mysql influxdb mainlog)
  • Icinga Web 2 version and modules (2.8.2)
  • Config validation (icinga2 daemon -C)
root@nmstest01:/etc/icinga2# cat zones.conf
/*
 * Generated by Icinga 2 node setup commands
 * on 2021-06-07 07:48:07 -0400
 */

object Endpoint "nmstest01.agilitypr.internal" {
}

object Endpoint "nmstest02.agilitypr.internal" {
    host = "172.30.0.137" //activley connect to secondary
}

object Zone "master" {
        endpoints = [ "nmstest01.agilitypr.internal","nmstest02.agilitypr.internal" ]
}

object Zone "global-templates" {
        global = true
}

object Zone "director-global" {
        global = true
}
root@nmstest02:/etc/icinga2# cat zones.conf
object Endpoint "nmstest01.agilitypr.internal" {
}

object Endpoint "nmstest02.agilitypr.internal" {
}

object Zone "master" {
        endpoints = [ "nmstest01.agilitypr.internal","nmstest02.agilitypr.internal" ]
}

object Zone "global-templates" {
        global = true
}

object Zone "director-global" {
        global = true
}

Did you change the ‘accept_config’ variable on the ApiListener object to true on your Master servers? Sound like your conf files are not syning over to the 2nd Master server. See the Top Down Sync Config Sync documentation for more details.

Regards
Alex

both of my hosts now have the following in their api.conf config file

object ApiListener "api" {

  accept_config = true
  accept_commands = true

  ticket_salt = TicketSalt
}

but the issue still exists. restarting the icinga2 process on nmstest01 does cause the hosts to re appear.

Still have not figured out what the issue is since the logs are not showing any error, and it is not obvious what the issue is.

Should icinga2 daemon --validate have the same output on both of the HA hosts? in my environment only the config master shows output indicating hosts

nmstest01 - config master with content in zones.d

[2021-06-09 09:06:32 -0400] information/ConfigItem: Instantiated 1 NotificationComponent.
[2021-06-09 09:06:32 -0400] information/ConfigItem: Instantiated 5 Hosts.
[2021-06-09 09:06:32 -0400] information/ConfigItem: Instantiated 63 Downtimes.
[2021-06-09 09:06:32 -0400] information/ConfigItem: Instantiated 5 NotificationCommands.
[2021-06-09 09:06:32 -0400] information/ConfigItem: Instantiated 1 FileLogger.
[2021-06-09 09:06:32 -0400] information/ConfigItem: Instantiated 189 Notifications.
[2021-06-09 09:06:32 -0400] information/ConfigItem: Instantiated 1 IcingaApplication.
[2021-06-09 09:06:32 -0400] information/ConfigItem: Instantiated 22 HostGroups.
[2021-06-09 09:06:32 -0400] information/ConfigItem: Instantiated 1 CheckerComponent.
[2021-06-09 09:06:32 -0400] information/ConfigItem: Instantiated 236 Zones.
[2021-06-09 09:06:32 -0400] information/ConfigItem: Instantiated 235 Endpoints.
[2021-06-09 09:06:32 -0400] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2021-06-09 09:06:32 -0400] information/ConfigItem: Instantiated 3 ApiUsers.
[2021-06-09 09:06:32 -0400] information/ConfigItem: Instantiated 1 ApiListener.
[2021-06-09 09:06:32 -0400] information/ConfigItem: Instantiated 116 CheckCommands.
[2021-06-09 09:06:32 -0400] information/ConfigItem: Instantiated 1 InfluxdbWriter.
[2021-06-09 09:06:32 -0400] information/ConfigItem: Instantiated 7 TimePeriods.
[2021-06-09 09:06:32 -0400] information/ConfigItem: Instantiated 9 UserGroups.
[2021-06-09 09:06:32 -0400] information/ConfigItem: Instantiated 9 Users.
[2021-06-09 09:06:32 -0400] information/ConfigItem: Instantiated 58 Services.
[2021-06-09 09:06:32 -0400] information/ConfigItem: Instantiated 9 ServiceGroups.
[2021-06-09 09:06:32 -0400] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2021-06-09 09:06:32 -0400] information/cli: Finished validating the configuration file(s).

nmstest02 - no zones.d directory

[2021-06-09 09:06:49 -0400] information/cli: Icinga application loader (version: r2.12.4-1)
[2021-06-09 09:06:49 -0400] information/cli: Loading configuration file(s).
[2021-06-09 09:06:49 -0400] information/ConfigItem: Committing config item(s).
[2021-06-09 09:06:49 -0400] information/ApiListener: My API identity: nmstest02.agilitypr.internal
[2021-06-09 09:06:49 -0400] information/ConfigItem: Instantiated 1 FileLogger.
[2021-06-09 09:06:49 -0400] information/ConfigItem: Instantiated 1 IcingaApplication.
[2021-06-09 09:06:49 -0400] information/ConfigItem: Instantiated 1 CheckerComponent.
[2021-06-09 09:06:49 -0400] information/ConfigItem: Instantiated 3 Zones.
[2021-06-09 09:06:49 -0400] information/ConfigItem: Instantiated 2 Endpoints.
[2021-06-09 09:06:49 -0400] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2021-06-09 09:06:49 -0400] information/ConfigItem: Instantiated 1 ApiListener.
[2021-06-09 09:06:49 -0400] information/ConfigItem: Instantiated 86 CheckCommands.
[2021-06-09 09:06:49 -0400] information/ConfigItem: Instantiated 1 InfluxdbWriter.
[2021-06-09 09:06:49 -0400] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2021-06-09 09:06:49 -0400] information/cli: Finished validating the configuration file(s).

I managed to resolve my issue by clearing the state on nmstest02

sudo service icinga2 stop
rm -rf /var/lib/icinga2/icinga2.state
rm -rf /var/lib/icinga2/api/*
sudo service icinga2 start

and then letting it re sync