Debugging downtimes, acks, comments, etc... not propagated to "empty" nodes in master zone

lesinigo · July 24, 2019, 10:28am

We are setting up a three-node Icinga2 HA cluster (all the nodes in master zone, no other satellites). We are using official packages on CentOS7, but we are currently staying at Icinga version 2.7.2 for a number of reasons outside of my control (we do plan to upgrade in the future once we will have got rid of ClassicUI and other internal scripts reading status.dat and the like).

We write configurations (mostly hosts, services, commands, notifications) on the “main” master (the first one) as static files under /etc/icinga2/zones.d/master (for a total of ~ 16000 objects) and issue a systemctl reload icinga2 on that node when we updated the config files. All three nodes have accept_commands and accept_config true.

Users enter “runtime” configurations like downtimes, acknowledges, comments through IcingaWeb2 and/or ClassicUI, and we noticed that sometimes those objects get out of sync between one node and the others (as seen using icinga2 object list and/or the ClassicUI on the three nodes). I have not checked personally but I fear that will actually impact runtime behaviour (eg. a node checking one host would notify it being down but not in downtime, even if the host is in downtime as seen from another node).

Also, trying to debug those “out of sync” nodes we cleaned up one of them like this:

systemctl stop icinga2
rm -f /var/lib/icinga2/icinga2.state /var/lib/icinga2/modified-attributes.conf
find /var/cache/icinga2 -type f -delete
find /var/lib/icinga2/api -type f -delete
systemctl start icinga2

After that clean up we ended up in the following situation:

# on the "main master", which we never touched / restarted / cleaned:
icinga2 object list --type downtime | grep ^Object | wc -l
304
# on one of the other masters, cleaned up like above and left running for one hour
icinga2 object list --type downtime | grep ^Object | wc -l
0

Also using Icinga2 API:

# main master, never stopped / cleaned / restarted:
curl -s -H "Accept: application/json" -k -u root:icinga 'https://127.0.0.1:5665/v1/objects/downtimes' | jq '.results | length'
304
# secondary master, an hour after cleanup:
curl -s -H "Accept: application/json" -k -u root:icinga 'https://127.0.0.1:5665/v1/objects/downtimes' | jq '.results | length'
0

I have not actually tried but I strongly suspects the same would apply if we added a brand new node to the cluster (eg. a fourth node in the master zone). In these tests we “cleaned up” a node and after restarting it gave those results (zero downtimes), but during normal operations when we noted discrepancies we would get different values (even if > 0) from different master nodes.
What I call “static configuration”, the hosts and services and commands defined under /etc/icinga2/zones.d/master on the first master, do get restored correctly on all the nodes even after the “clean ups”.

Is this some kind of known behaviour? Could it be some configuration error on our side, or could it be a bug? (I tried to search through GitHub issues, found some things related to the API but none that exactly match what we are seeing)

Any help in troubleshooting this would be really appreciated.

lesinigo · July 24, 2019, 11:08am

I’ll add that when “clean up and restarting” a node, the log shows the usual configuration sync, a reload, and then these for all comments, downtimes, etc…:

[2019-07-24 13:01:45 +0200] critical/ApiListener: Could not create object 'MYHOSTNAME!My service!blabla-1551373977-0':
[2019-07-24 13:01:45 +0200] critical/ApiListener: Configuration file '/var/lib/icinga2/api/packages/_api//conf.d/comments/MYHOSTNAME!My service!blabla-1551373977-0.conf' already exists.

Which obviously seems strange since I just wiped all files from /var/lib/icinga2/api before restarting the node.

The files do exist there (under /var/lib/icinga2/api/packages/_api/conf.d/[comments|downtimes]) after restarting the node, but querying it by API or command line or ClassicUI does not show those comments and downtimes:

# find /var/lib/icinga2/api/packages/_api/conf.d/downtimes -type f -name '*.conf' | wc -l
304
# icinga2 object list --type downtime | wc -l
0

dnsmichi · July 25, 2019, 4:52pm

Hi,

Three nodes in a zone are known to be buggy, cannot recommend this scenario atm.
2.7.x is far too old to say whether you’ve hit other problems. Test that setup with 2.10.5 or the today’s released 2.11 RC1 in a staging environment.
object list relies on a cache refreshed on config validation / startup, this is never as accurate as runtime queries against the REST API.
You’re manipulating /var/lib/icinga2 manually, especially deleting everything with find /var/lib/icinga2/api -type f -delete is somehow brave and insane in the same second. I’m not sure if 2.7 is able to detect this and re-create its internal storage tree. 2.10.x does.

Cheers,
Michael

lesinigo · July 26, 2019, 6:09am

Hi Michael, thanks for your input.

we read about the three nodes and we think we’ll go down to two atm. (actually the problem is for >= 3 nodes, correct? 4 or 5 nodes won’t be any better than 3?)
for that we are trying to prepare and replicate the thing in a more testable scenario, we’ll start with docker compose and if that won’t exhibit what we are seeing we’ll go to full virtual machines. Once we can replicate this behaviour with this version we’ll also be able to test for any change in newer releases.
oh I didn’t knew that, I always thought that object list would be as accurate as the Web2 or something like that. Good to know.
I was basically wiping everything except logs and SSL certs, in my mind that would be the same as adding a brand new node. It’s some kind of last resort testing thing right now, but I’d think it shouldn’t be a problem for Icinga - no more than adding a new “empty” node to an existing zone. Isn’t it more or less the same thing? Do we need to take into account any potential issues should a node fail / be lost and rebuilt from scratch from a new install and added to an existing zone?

dnsmichi · July 26, 2019, 12:08pm

Yep
docker compose is something I don’t use within Icinga clusters, I prefer VMs with systemd enabled. Easier to troubleshoot, especially with daemons.
As said, I don’t know how 2.7 behaves here, this is out of support. I’ve fixed many bugs in this region over the past years, the last one’s with the coming 2.11. I’ve seen customer setups where such an invasive rm -rf caused the cluster to loop, it took us days to debug to find out that this was a cronjob someone installed as a workaround. I can only recommend to only wipe things when you understand what this may cause. Since /var/lib/<applicationname> is fully owned by an application, the users shouldn’t manipulate this at all. Doing that with MySQL for instance isn’t a good idea either.

lesinigo · July 26, 2019, 12:49pm

Just to clarify, do you expect any possibility of problems when you add a brand new node to a pre-existing zone? Because that’s what I want to do when I rm -rf things around, my goal is to reproduce the exact state of the system when Icinga2 has been freshly installed and has not yet been started even once.

Thanks for you feedback

dnsmichi · July 26, 2019, 12:54pm

In case you want to add a secondary master later on, there’s some manual copying required next to the state file. That’s described here.