Scheduled downtime in with HA masters

sethlyons · February 24, 2020, 6:44pm

I am running Icinga 2.11.2 on CentOS 7 and noticed recently that even though some of my hosts have scheduled downtime that appears in Icinga Web 2, notifications are still sent. The problem is very much like PROBLEM notification sent during recurring downtime.

When going through some troublehshooting steps, I noticed that all of the downtimes exist on the secondary master, but only some of the downtimes are in the _api package on the config master.

What is the recommended way of addressing this?

a1mw · February 25, 2020, 10:09am

Hi,

also have the same problem, my downtimes were created with Director and are shown
correctly within Icinga2 (icinga2 object list -t downtime).

Meanwhile I’ve deleted the downtime definition in Director, but they still exist in icinga2.
Don’t know if it is a Icinga2 or a Director problem.

Cheers,
Manfred

sethlyons · February 25, 2020, 12:49pm

Just as a data point in case it helps, I don’t use Director.

sethlyons · March 1, 2020, 6:56am

Poking through my logs some more, I see that the downtime is created on the secondary master, but it fails to send it to the config master because the config master api does not accept config.

[2020-02-29 05:00:56 -0500] warning/ApiListener: Ignoring config update from '<secondary master>' (endpoint: '<secondary master>', zone: 'master') for object '<hostname>!01acb970-1ae7-48c8-93b7-f4ab976c9137' of type 'Downtime'. 'api' does not accept config.

I thought that maybe it was related to connection direction (I have HA master with two zones, each with two satellites). Prior to tonight my connection direction wasn’t 100% correct, but even after updating it based on the docs, I still see the same behavior.

I attempted to remove /var/lib/icinga2/api/zones on the secondary master, but no change. In order get the recurring downtimes on the config master, I stopped the icinga2 service on the secondary master. The config master then generated the downtimes (yay!), but once I restarted the icinga2 service on the secondary master, it still tried to update the config master’s config.

sethlyons · March 1, 2020, 7:18am

I also tried copying /var/lib/icinga2/api/packages_api and /var/lib/icinga2/icinga2.state from the config master to the secondary master, but the secondary is still trying to update the config on the master.

I think it’s also related, but haven’t yet figured out how to address it, but Icingaweb2 shows the active endpoint as the secondary master instead of the config master.

dnsmichi · March 1, 2020, 10:43am

The first master needs to accept config, as Downtime/Comments are treated like that. With only copying the state file, this won’t work.

sethlyons · March 1, 2020, 1:55pm

Oh wow…I could have sworn that I read somewhere that the config master should have accept_config = false, but of course I can’t find that now :]

A few follow ups:
What is an example of a scenario where would would set accept_config = false?

Since I copied /var/lib/icinga2/api/packages/_api from my config master to secondary master, the stages have the same name. Is that the expected behavior and there was a problem initially, or should I restore the old directory on the secondary master so that the stages have different names again?

I’m also happy to post in the Icingaweb2 section if it’s not appropriate for here, but is the reason the Icingaweb2 shows the secondary master as the active member because of this…specifically:

Blockquote
During startup Icinga 2 calculates whether the feature configuration object is authoritative on this node or not. The order is an alpha-numeric comparison, e.g. if you have master1 and master2, Icinga 2 will enable the DB IDO feature on master2 by default.

dnsmichi · March 1, 2020, 4:35pm

There was a long standing loop bug which posed this as a workaround. This was fixed with 2.11 and as such, the docs and workaround being updated/removed.

Security rationale, same as accept_commands. By default, no-one should be trusted, even with the endpoint/zone relationship is established. A more concrete scenario would be e.g. when an agent shouldn’t receive synced check commands or alike. Users don’t use that much, but enabling these flags by default would cause trouble since this was the default behavior since 2.0 and everyone knows about it.

If the entire package was copied, this should be fine. Stage names are auto-generated UUIDs, previous older versions had to use a custom schema due to el5 not supporting UUIDs. Since 2.7 or so, these names are a random string being persistent on disk.

Yes, exactly. One side decides which feature config object remains running, the other one is paused.
There’s no harm with e.g. sending commands to the first master, all actions are replicated. Thus requiring accept_config at least then. accept_commands is solely for command_endpoint checks as target.

Cheers,
Michael

sethlyons · March 1, 2020, 6:03pm

Awesome…thank you so much! This was incredibly helpful! I really appreciate it.