Struggling with HA Master addition after fully building out single node with director

I originally built our a single node instance, assuming adding HA would be easy, but I am not having much luck with that.

I have went through the steps detailed here;

What I found afterwards was that neither side could talk to each other, and all monitoring visibility was lost, as in, neither master showed any nodes being checked. For what it’s worth, this is an agent-less, SNMP-driven environment.

monit03 shows monit03 as an available zone when checking here: infrastructure#!/icingaweb2/director/zones

monit03 (original):
Ubuntu 20.04
icinga2 version: r2.13.1-1

  • Enabled features: api checker ido-mysql mainlog notification
    icingaweb2 version: 2.9.3
  • businessprocess version: 2.3.1
  • cube version: 1.1.1
  • director version: 1.2.0
  • doc version: 2.9.3
  • idoreports version: 0.9.1
  • incubator version: 0.6.0
  • jira version: 1.1.0
  • migrate version: 2.9.3
  • monitoring version: 2.9.3
  • pdfexport version: 0.9.1
  • reporting version: 0.10.0
  • setup version: 2.9.3
  • toplevelview version: 0.3.3
  • treeview version: 0.1.0
  • x509 version: 1.0.0

config check:

[2021-10-14 12:12:38 -0500] information/cli: Icinga application loader (version: r2.13.1-1)
[2021-10-14 12:12:38 -0500] information/cli: Loading configuration file(s).
[2021-10-14 12:12:38 -0500] information/ConfigItem: Committing config item(s).
[2021-10-14 12:12:38 -0500] information/ApiListener: My API identity: monit03
[2021-10-14 12:12:38 -0500] information/ConfigItem: Instantiated 1 IcingaApplication.
[2021-10-14 12:12:38 -0500] information/ConfigItem: Instantiated 1 Host.
[2021-10-14 12:12:38 -0500] information/ConfigItem: Instantiated 1 FileLogger.
[2021-10-14 12:12:38 -0500] information/ConfigItem: Instantiated 1 CheckerComponent.
[2021-10-14 12:12:38 -0500] information/ConfigItem: Instantiated 1 ApiListener.
[2021-10-14 12:12:38 -0500] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2021-10-14 12:12:38 -0500] information/ConfigItem: Instantiated 3 Zones.
[2021-10-14 12:12:38 -0500] information/ConfigItem: Instantiated 2 Endpoints.
[2021-10-14 12:12:38 -0500] information/ConfigItem: Instantiated 2 ApiUsers.
[2021-10-14 12:12:38 -0500] information/ConfigItem: Instantiated 244 CheckCommands.
[2021-10-14 12:12:38 -0500] information/ConfigItem: Instantiated 1 NotificationComponent.
[2021-10-14 12:12:38 -0500] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2021-10-14 12:12:38 -0500] information/cli: Finished validating the configuration file(s).

zones.conf:

object Endpoint "monit03" {
}

object Endpoint "monit04" {
        host = "10.3.210.71"
}

object Zone "master" {
        endpoints = [ "monit03", "monit04" ]
}

object Zone "global-templates" {
        global = true
}

object Zone "director-global" {
        global = true
}

monit04 lacks monit03 zone, but has a master zone instead when checking here: infrastructure#!/icingaweb2/director/zones

monit04:
Ubuntu 20.04
icinga2 version: r2.13.1-1

  • Enabled features: api checker ido-mysql mainlog
    icingaweb2 version: 2.9.3
  • director version: 1.2.0
  • doc version: 2.9.3
  • incubator version: 0.6.0
  • monitoring version: 2.9.3

config check:

[2021-10-14 12:17:07 -0500] information/cli: Icinga application loader (version: r2.13.1-1)
[2021-10-14 12:17:07 -0500] information/cli: Loading configuration file(s).
[2021-10-14 12:17:07 -0500] information/ConfigItem: Committing config item(s).
[2021-10-14 12:17:07 -0500] information/ApiListener: My API identity: monit04
[2021-10-14 12:17:07 -0500] information/ConfigItem: Instantiated 1 IcingaApplication.
[2021-10-14 12:17:07 -0500] information/ConfigItem: Instantiated 1 FileLogger.
[2021-10-14 12:17:07 -0500] information/ConfigItem: Instantiated 1 CheckerComponent.
[2021-10-14 12:17:07 -0500] information/ConfigItem: Instantiated 1 ApiListener.
[2021-10-14 12:17:07 -0500] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2021-10-14 12:17:07 -0500] information/ConfigItem: Instantiated 3 Zones.
[2021-10-14 12:17:07 -0500] information/ConfigItem: Instantiated 2 Endpoints.
[2021-10-14 12:17:07 -0500] information/ConfigItem: Instantiated 2 ApiUsers.
[2021-10-14 12:17:07 -0500] information/ConfigItem: Instantiated 244 CheckCommands.
[2021-10-14 12:17:07 -0500] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2021-10-14 12:17:07 -0500] information/cli: Finished validating the configuration file(s).

zones.conf:

object Endpoint "monit03" {
}

object Endpoint "monit04" {
}

object Zone "master" {
	endpoints = [ "monit03", "monit04" ]
}

object Zone "global-templates" {
	global = true
}

object Zone "director-global" {
	global = true
}

In the config check, your nodes show monit03.int.sitelock.com/monit04.int.sitelock.com as their identity but the the content of zones.conf show monit03/monit04 as names of the Endpoint objects. These should be the same and match the common name in the certificates used by the nodes.

Have you either renamed any of the Endpoint or Zone objects in zones.conf or regenerated certificates yourself?

I think you may also have Endpoints named monit03.int.sitelock.com/monit04.int.sitelock.com deployed from Director, then these would be the ones actually used. If those are in distinct zones that don’t have a relation to each other, this would explain the behavior of the the nodes not connecting to each other.

In the config check, your nodes show monit03.int.sitelock.com/monit04.int.sitelock.com as their identity but the the content of zones.conf show monit03/monit04 as names of the Endpoint objects. These should be the same and match the common name in the certificates used by the nodes.

That was me forgetting to remove the domain names, prior to submitting, as the domain name is not important for investigation anyway.

Currently FQDN is used everywhere in the configs.

I think you may also have Endpoints named monit03.int.sitelock.com/monit04.int.sitelock.com deployed from Director, then these would be the ones actually used. If those are in distinct zones that don’t have a relation to each other, this would explain the behavior of the the nodes not connecting to each other.

I’d love to un-deploy them if that’s root cause, and is possible, though I do not see a way how. Is there a way to confirm that is what is happening?

If that mismatch was there just because you truncated the node names just in some places, that’s probably not the issue. Given that the config check show 2 endpoints and 3 zones, this suggests that the config from zones.conf is all it uses and this seems fine at first glance.

Have you looked at the log files? monit03 should log something about connection attempts to monit04 with that configuration.