Top down configuration not being synced

deric · April 23, 2019, 9:19am

I have distributed Icinga 2.10.4-1 setup with 2 master nodes, 3 satellites and about 100 client nodes. For some reason configuration from master node is not being synced to secondary master nor satellite nodes. Is there command to force config synchronization?

I’m having long-term issues with the secondary master instance. From what I’m seeing it appears that the Icinga2 cluster got into a split-brain situation where each master is propagating different zone configuration. According to docs this setup should be supported, but it appears to me that it brings more issues than benefits.

Here’s primary master config:

object Endpoint "icinga-bhs1.example.com"  {
  host = "10.0.0.3"
  port = 5667
}

object Endpoint "icinga-dc08.example.com"  {
  host = "10.0.0.8"
  port = 5667
}

object Endpoint "icinga-dc10.example.com"  {
  host = "10.0.0.10"
  port = 5667
}

object Endpoint "icinga01.example.com"  {
}

object Endpoint "icinga02.example.com"  {
  host = "10.0.0.2"
  port = 5667
}

object Zone "eu-west"  {
  endpoints = [ "icinga-dc10.example.com", "icinga-dc08.example.com", ]
  parent = "master"
}

object Zone "global-templates"  {
  global = true
}

object Zone "master"  {
  endpoints = [ "icinga01.example.com", "icinga02.example.com", ]
}

object Zone "us-east"  {
  endpoints = [ "icinga-bhs1.example.com", ]
  parent = "master"
}

anon66228339 · April 23, 2019, 9:30am

Hello,

did you try to run icinga2 daemon -C on second master? If there is any eror it will not reload or sync any new configuration and stay on the old all the time.

deric · April 23, 2019, 9:33am

Sure, all icinga instances appear to be healthy. I haven’t noticed any errors, except outdated zone config in /var/lib/icinga2/api/zones/

anon66228339 · April 23, 2019, 9:38am

Please add IP for endpoint "icinga01.example.com also on all zone configuration on all masters/satellites and test again.

deric · April 23, 2019, 9:41am

I’ve added zone config. The configuration is generated by Puppet. Secondary master uses pretty much the same configuration, except this part:

object Endpoint "icinga01.example.com"  {
  host = "10.0.0.1"
  port = 5667
}

object Endpoint "icinga02.example.com"  {
}

anon66228339 · April 23, 2019, 9:42am

Also check if the api on all masters/satellites accepts commands and configuration.

dnsmichi · April 23, 2019, 9:48am

Why is icinga01.example.com missing on the secondary master? This renders the instance to not trust the first master on connect, neither would it sync or accept anything. Typically, this is a self-made split brain scenario.

To me this looks like a general misunderstanding how zones/endpoints are being configured.

Please extract the exact configuration from 1) master A and 2) master B by using icinga2 object list --type Zone and icinga2 object list --type Endpoint.

Cheers,
Michael

deric · April 23, 2019, 10:51am

It’s not missing there. Primary master is called icinga01.example.com it doesn’t have IP address and port as there’s no need to connect to itself. Or is there such need? In debug log icinga writes:

Not connecting to Endpoint 'icinga01.example.com' because that's us.

Based on this I assumed that the loopback endpoint is not needed.

icinga02 (secondary master) and all satellites have api with:

  accept_commands = true
  accept_config = true

dnsmichi · April 23, 2019, 10:54am

The Endpoint object is needed to establish the zone membership and as such, the trust relationship between endpoints. The connection parameter is a secondary option here.

Your current setup cannot work, with the shown configuration and the masters not trusting each other.

Cheers,
Michael

deric · April 23, 2019, 11:49am

How do I know that the endpoint is not trusted? Can I check it from CLI?

When I was setting up the cluster, I wrote a note:

determines connection direction, when no host address is defined, master should be connecting to the satellite

I guess I found this in the documentation. Anyway thanks for help!

dnsmichi · April 23, 2019, 12:11pm

I’d suggest to thoroughly read the documentation, especially the first chapters explaining the main concepts and ideas behind it. Specifically this with

All nodes in a zone trust each other.

Your problem description is that the masters deploy wrong configuration, and their sync state is wrong. The first thing I would check is why the secondary master doesn’t sync the configuration received from the first master.

I’m having long-term issues with the secondary master instance. From what I’m seeing it appears that the Icinga2 cluster got into a split-brain situation where each master is propagating different zone configuration. According to docs this setup should be supported, but it appears to me that it brings more issues than benefits.

There you’ll already mention the suspicion with a split-brain scenario and different zone syncs. Wondering why you kept it for so long. Typical setups just put the masters in a zone, define the endpoints and configure that on each master endpoint. That’s just about it.

Cheers,
Michael

deric · April 23, 2019, 12:36pm

Thanks, I’ve read the documentation several times. It covers many sections that are irrelevant for my scenario.

I think the reason is that the distributed setup is much harder to debug, than a single host setup. A icinga lacks tooling for the distributed setup. Also the logs are being too verbose which makes it hard to find important information.

I’d really appreciate commands for checking cluster state from CLI. After I’ve applied changes to endpoints one satellite stopped sending updates to master zone. After daemon restart it started magically working but I have no idea why.

dnsmichi · April 23, 2019, 12:59pm

Did you fix your configuration meanwhile and verified that the secondary master has the same objects synced just like the first one?

(that’s imho the most important task, than to nitpick about docs and cli commands)

deric · April 23, 2019, 9:27pm

I’m sorry, I meant no offense. I think it might be fixed, but I can’t be sure. The previous configuration was eventually consistent, which is a state that is rather hard to debug. By tooling I meant printing cluster matrix, showing connected endpoints, lags in synchronization etc. It shouldn’t be hard to implement as most of data needed seems to be available in API.

dnsmichi · April 24, 2019, 6:29am

Mh actually it is not that easy, since endpoints don’t know much about indirectly connected endpoints. For the config objects, this may be known since the masters may have them all for deployment reasons, but runtime metrics are not synced. There’s ideas and efforts towards this, but unlikely going to happen anytime soon.

That being said, a CLI tool won’t be handy, except for rendering a tree based view for configured zones from e.g. object list but without any actual stats.

Having two masters which partially trust each other is a common problem, yet the easiest way for verification is a comparison between objects and counts via the REST API. Missing zones and endpoints on either one is good indication that something is wrong, after that typically hosts, services and runtime details such as downtimes. Both masters truly need to share the same exact information.

Base metrics on connected endpoints and replay log lags can be fetched via /v1/status, e.g. with a small script. If you are more into the console, the debug console may provide additional insights. We use that for debugging late check results in HA environments, as can be seen in the troubleshooting docs.

Nevertheless, I had a stressful day yesterday, sorry for my harsh tone.

Cheers,
Michael

deric · April 24, 2019, 12:32pm

Ok, thanks for explanation.

I’m still having issues with satellites not relaying check results. When I turn on debuglog everything works, then I disable debuglog, restart the service and nothing works again.

REST API claims to be connected to the master zone:

            "master": {
              "client_log_lag": 0,
              "connected": true,
              "endpoints": [
                "icinga02.example.com",
                "icinga01.example.com"
              ],
              "parent_zone": ""
            },

but no data is sent to the master zone. I have no idea how am I supposed to debug such state, there are no errors in logs, no crash report. The only change I’ve made was removing:

/etc/icinga2/features-enabled/debuglog.conf

and restarting the service. When I restart master instance do I have to restart all satellites as well?

dnsmichi · April 24, 2019, 12:56pm

Specifically you can use the cluster-zone health checks as well. The /v1/status endpoint als provides a list of connected_endpoints, both masters should be connected.

Did you fix your master zone problem, and if so, how?

Something like this or this.

It also helps to pick a specific host/service and troubleshoot the entire chain from scheduling a check, executing it, checking its result on the satellite, verifying that the parent object retrieves it, etc.

Common errors are endpoints whose time is not in sync, such as a missing NTP service.

Cheers,
Michael

deric · April 24, 2019, 2:08pm

Yes, both master are connected from all satellites. From masters side all satellites are connected as well.

I did specify IP and port for all satellites as Carsten suggested. But still random satellite is not reporting results which makes me unsure if the configuration is correct.

All nodes have synchronized time using NTP service. I’m going through the troubleshooting guide, but I haven’t find any errors so far.