Struggling with HA Master addition after fully building out single node with director

icingios · October 14, 2021, 5:23pm

I originally built our a single node instance, assuming adding HA would be easy, but I am not having much luck with that.

I have went through the steps detailed here;

What I found afterwards was that neither side could talk to each other, and all monitoring visibility was lost, as in, neither master showed any nodes being checked. For what it’s worth, this is an agent-less, SNMP-driven environment.

monit03 shows monit03 as an available zone when checking here: infrastructure#!/icingaweb2/director/zones

monit03 (original):
Ubuntu 20.04
icinga2 version: r2.13.1-1

Enabled features: api checker ido-mysql mainlog notification
icingaweb2 version: 2.9.3
businessprocess version: 2.3.1
cube version: 1.1.1
director version: 1.2.0
doc version: 2.9.3
idoreports version: 0.9.1
incubator version: 0.6.0
jira version: 1.1.0
migrate version: 2.9.3
monitoring version: 2.9.3
pdfexport version: 0.9.1
reporting version: 0.10.0
setup version: 2.9.3
toplevelview version: 0.3.3
treeview version: 0.1.0
x509 version: 1.0.0

config check:

[2021-10-14 12:12:38 -0500] information/cli: Icinga application loader (version: r2.13.1-1)
[2021-10-14 12:12:38 -0500] information/cli: Loading configuration file(s).
[2021-10-14 12:12:38 -0500] information/ConfigItem: Committing config item(s).
[2021-10-14 12:12:38 -0500] information/ApiListener: My API identity: monit03
[2021-10-14 12:12:38 -0500] information/ConfigItem: Instantiated 1 IcingaApplication.
[2021-10-14 12:12:38 -0500] information/ConfigItem: Instantiated 1 Host.
[2021-10-14 12:12:38 -0500] information/ConfigItem: Instantiated 1 FileLogger.
[2021-10-14 12:12:38 -0500] information/ConfigItem: Instantiated 1 CheckerComponent.
[2021-10-14 12:12:38 -0500] information/ConfigItem: Instantiated 1 ApiListener.
[2021-10-14 12:12:38 -0500] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2021-10-14 12:12:38 -0500] information/ConfigItem: Instantiated 3 Zones.
[2021-10-14 12:12:38 -0500] information/ConfigItem: Instantiated 2 Endpoints.
[2021-10-14 12:12:38 -0500] information/ConfigItem: Instantiated 2 ApiUsers.
[2021-10-14 12:12:38 -0500] information/ConfigItem: Instantiated 244 CheckCommands.
[2021-10-14 12:12:38 -0500] information/ConfigItem: Instantiated 1 NotificationComponent.
[2021-10-14 12:12:38 -0500] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2021-10-14 12:12:38 -0500] information/cli: Finished validating the configuration file(s).

zones.conf:

object Endpoint "monit03" {
}

object Endpoint "monit04" {
        host = "10.3.210.71"
}

object Zone "master" {
        endpoints = [ "monit03", "monit04" ]
}

object Zone "global-templates" {
        global = true
}

object Zone "director-global" {
        global = true
}

monit04 lacks monit03 zone, but has a master zone instead when checking here: infrastructure#!/icingaweb2/director/zones

monit04:
Ubuntu 20.04
icinga2 version: r2.13.1-1

Enabled features: api checker ido-mysql mainlog
icingaweb2 version: 2.9.3
director version: 1.2.0
doc version: 2.9.3
incubator version: 0.6.0
monitoring version: 2.9.3

config check:

[2021-10-14 12:17:07 -0500] information/cli: Icinga application loader (version: r2.13.1-1)
[2021-10-14 12:17:07 -0500] information/cli: Loading configuration file(s).
[2021-10-14 12:17:07 -0500] information/ConfigItem: Committing config item(s).
[2021-10-14 12:17:07 -0500] information/ApiListener: My API identity: monit04
[2021-10-14 12:17:07 -0500] information/ConfigItem: Instantiated 1 IcingaApplication.
[2021-10-14 12:17:07 -0500] information/ConfigItem: Instantiated 1 FileLogger.
[2021-10-14 12:17:07 -0500] information/ConfigItem: Instantiated 1 CheckerComponent.
[2021-10-14 12:17:07 -0500] information/ConfigItem: Instantiated 1 ApiListener.
[2021-10-14 12:17:07 -0500] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2021-10-14 12:17:07 -0500] information/ConfigItem: Instantiated 3 Zones.
[2021-10-14 12:17:07 -0500] information/ConfigItem: Instantiated 2 Endpoints.
[2021-10-14 12:17:07 -0500] information/ConfigItem: Instantiated 2 ApiUsers.
[2021-10-14 12:17:07 -0500] information/ConfigItem: Instantiated 244 CheckCommands.
[2021-10-14 12:17:07 -0500] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2021-10-14 12:17:07 -0500] information/cli: Finished validating the configuration file(s).

zones.conf:

object Endpoint "monit03" {
}

object Endpoint "monit04" {
}

object Zone "master" {
	endpoints = [ "monit03", "monit04" ]
}

object Zone "global-templates" {
	global = true
}

object Zone "director-global" {
	global = true
}

jbrost · October 15, 2021, 4:37pm

In the config check, your nodes show monit03.int.sitelock.com/monit04.int.sitelock.com as their identity but the the content of zones.conf show monit03/monit04 as names of the Endpoint objects. These should be the same and match the common name in the certificates used by the nodes.

Have you either renamed any of the Endpoint or Zone objects in zones.conf or regenerated certificates yourself?

I think you may also have Endpoints named monit03.int.sitelock.com/monit04.int.sitelock.com deployed from Director, then these would be the ones actually used. If those are in distinct zones that don’t have a relation to each other, this would explain the behavior of the the nodes not connecting to each other.

icingios · October 18, 2021, 5:34pm

In the config check, your nodes show monit03.int.sitelock.com/monit04.int.sitelock.com as their identity but the the content of zones.conf show monit03/monit04 as names of the Endpoint objects. These should be the same and match the common name in the certificates used by the nodes.

That was me forgetting to remove the domain names, prior to submitting, as the domain name is not important for investigation anyway.

Currently FQDN is used everywhere in the configs.

I think you may also have Endpoints named monit03.int.sitelock.com/monit04.int.sitelock.com deployed from Director, then these would be the ones actually used. If those are in distinct zones that don’t have a relation to each other, this would explain the behavior of the the nodes not connecting to each other.

I’d love to un-deploy them if that’s root cause, and is possible, though I do not see a way how. Is there a way to confirm that is what is happening?

jbrost · October 19, 2021, 9:08am

If that mismatch was there just because you truncated the node names just in some places, that’s probably not the issue. Given that the config check show 2 endpoints and 3 zones, this suggests that the config from zones.conf is all it uses and this seems fine at first glance.

Have you looked at the log files? monit03 should log something about connection attempts to monit04 with that configuration.

icingios · October 20, 2021, 5:24pm

Both nodes are talking to each other, and all I am getting log-wise are informational records;

monit03:

[2021-10-20 12:20:27 -0500] information/ApiListener: New client connection for identity 'monit04' to [10.3.210.71]:5665
[2021-10-20 12:20:27 -0500] information/ApiListener: Sending config updates for endpoint 'monit04' in zone 'master'.
[2021-10-20 12:20:27 -0500] information/ApiListener: Syncing configuration files for zone 'master' to endpoint 'monit04'.
[2021-10-20 12:20:27 -0500] information/ApiListener: Finished reconnecting to endpoint 'monit04' via host '10.3.210.71' and port '5665'
[2021-10-20 12:20:27 -0500] information/ApiListener: Syncing configuration files for global zone 'director-global' to endpoint 'monit04'.
[2021-10-20 12:20:27 -0500] information/ApiListener: Finished sending config file updates for endpoint 'monit04' in zone 'master'.
[2021-10-20 12:20:27 -0500] information/ApiListener: Syncing runtime objects to endpoint 'monit04'.
[2021-10-20 12:20:27 -0500] information/ApiListener: Finished syncing runtime objects to endpoint 'monit04'.
[2021-10-20 12:20:27 -0500] information/ApiListener: Finished sending runtime config updates for endpoint 'monit04' in zone 'master'.
[2021-10-20 12:20:27 -0500] information/ApiListener: Sending replay log for endpoint 'monit04' in zone 'master'.
[2021-10-20 12:20:27 -0500] information/ApiListener: Replayed 2 messages.
[2021-10-20 12:20:27 -0500] information/ApiListener: Finished sending replay log for endpoint 'monit04' in zone 'master'.
[2021-10-20 12:20:27 -0500] information/ApiListener: Finished syncing endpoint 'monit04' in zone 'master'.
[2021-10-20 12:20:27 -0500] information/ApiListener: Applying config update from endpoint 'monit04' of zone 'master'.
[2021-10-20 12:20:27 -0500] information/ApiListener: Ignoring config update from endpoint 'monit04' for zone 'director-global' because we have an authoritative version of the zone's config.
[2021-10-20 12:20:27 -0500] information/ApiListener: Ignoring config update from endpoint 'monit04' for zone 'master' because we have an authoritative version of the zone's config.
[2021-10-20 12:20:27 -0500] information/ApiListener: Received configuration updates (0) from endpoint 'monit04' are equal to production, skipping validation and reload.

monit04:

 [2021-10-20 12:20:27 -0500] information/ApiListener: New client connection for identity 'monit03' from [::ffff:10.3.210.69]:39840
[2021-10-20 12:20:27 -0500] information/ApiListener: Sending config updates for endpoint 'monit03' in zone 'master'.
[2021-10-20 12:20:27 -0500] information/ApiListener: Syncing configuration files for zone 'master' to endpoint 'monit03'.
[2021-10-20 12:20:27 -0500] information/ApiListener: Syncing configuration files for global zone 'director-global' to endpoint 'monit03'.
[2021-10-20 12:20:27 -0500] information/ApiListener: Finished sending config file updates for endpoint 'monit03' in zone 'master'.
[2021-10-20 12:20:27 -0500] information/ApiListener: Syncing runtime objects to endpoint 'monit03'.
[2021-10-20 12:20:27 -0500] information/ApiListener: Applying config update from endpoint 'monit03' of zone 'master'.
[2021-10-20 12:20:27 -0500] information/ApiListener: Finished syncing runtime objects to endpoint 'monit03'.
[2021-10-20 12:20:27 -0500] information/ApiListener: Finished sending runtime config updates for endpoint 'monit03' in zone 'master'.
[2021-10-20 12:20:27 -0500] information/ApiListener: Sending replay log for endpoint 'monit03' in zone 'master'.
[2021-10-20 12:20:27 -0500] information/ApiListener: Ignoring config update from endpoint 'monit03' for zone 'director-global' because we have an authoritative version of the zone's config.
[2021-10-20 12:20:27 -0500] information/ApiListener: Ignoring config update from endpoint 'monit03' for zone 'master' because we have an authoritative version of the zone's config.
[2021-10-20 12:20:27 -0500] information/ApiListener: Received configuration updates (0) from endpoint 'monit03' are equal to production, skipping validation and reload.
[2021-10-20 12:20:27 -0500] information/ApiListener: Finished sending replay log for endpoint 'monit03' in zone 'master'.
[2021-10-20 12:20:27 -0500] information/ApiListener: Finished syncing endpoint 'monit03' in zone 'master'.