We are setting up a three-node Icinga2 HA cluster (all the nodes in master zone, no other satellites). We are using official packages on CentOS7, but we are currently staying at Icinga version 2.7.2 for a number of reasons outside of my control (we do plan to upgrade in the future once we will have got rid of ClassicUI and other internal scripts reading status.dat and the like).
We write configurations (mostly hosts, services, commands, notifications) on the “main” master (the first one) as static files under /etc/icinga2/zones.d/master (for a total of ~ 16000 objects) and issue a systemctl reload icinga2 on that node when we updated the config files. All three nodes have accept_commands and accept_config true.
Users enter “runtime” configurations like downtimes, acknowledges, comments through IcingaWeb2 and/or ClassicUI, and we noticed that sometimes those objects get out of sync between one node and the others (as seen using icinga2 object list and/or the ClassicUI on the three nodes). I have not checked personally but I fear that will actually impact runtime behaviour (eg. a node checking one host would notify it being down but not in downtime, even if the host is in downtime as seen from another node).
Also, trying to debug those “out of sync” nodes we cleaned up one of them like this:
systemctl stop icinga2
rm -f /var/lib/icinga2/icinga2.state /var/lib/icinga2/modified-attributes.conf
find /var/cache/icinga2 -type f -delete
find /var/lib/icinga2/api -type f -delete
systemctl start icinga2
After that clean up we ended up in the following situation:
# on the "main master", which we never touched / restarted / cleaned:
icinga2 object list --type downtime | grep ^Object | wc -l
304
# on one of the other masters, cleaned up like above and left running for one hour
icinga2 object list --type downtime | grep ^Object | wc -l
0
Also using Icinga2 API:
# main master, never stopped / cleaned / restarted:
curl -s -H "Accept: application/json" -k -u root:icinga 'https://127.0.0.1:5665/v1/objects/downtimes' | jq '.results | length'
304
# secondary master, an hour after cleanup:
curl -s -H "Accept: application/json" -k -u root:icinga 'https://127.0.0.1:5665/v1/objects/downtimes' | jq '.results | length'
0
I have not actually tried but I strongly suspects the same would apply if we added a brand new node to the cluster (eg. a fourth node in the master zone). In these tests we “cleaned up” a node and after restarting it gave those results (zero downtimes), but during normal operations when we noted discrepancies we would get different values (even if > 0) from different master nodes.
What I call “static configuration”, the hosts and services and commands defined under /etc/icinga2/zones.d/master on the first master, do get restored correctly on all the nodes even after the “clean ups”.
Is this some kind of known behaviour? Could it be some configuration error on our side, or could it be a bug? (I tried to search through GitHub issues, found some things related to the API but none that exactly match what we are seeing)
Any help in troubleshooting this would be really appreciated.
