For some unknown reason, some of the services ( not all ) on multiple hosts end up having notification disabled on them at some point. I would like to track this down why that is occurring.
I couldn’t find anything odd as far as the configuration ( no director here ) is concerned. Icinga2 api is open from a few limited hosts. How do I debug or tracke this down when it happens again?
When looking at these problem I start with where the original config is, it sounds like it is on disk for you. Then I compare that with the icinga2 object list on the masters and if I need to any endpoints dowstream.
The icinga config is pre load by Icinga, the object list is post load by Icinga so they should match.
I find this is a good way to ensure that config is deploy’d correctly to all endpoints in a cluster.
If you find that is all correct and that your on disk config is changing you could try turning it into a git repo with a cron based auto commit. Then at least you’ll know what changed and when.
For the corresponding services is a good idea. Be sure to run icinga2 daemon -C --dump-objects before, to have the up-to-date versions of each object available, as this command does not query live data (only the API does)
I would also enable the debug log and take a look in there.
If you can’t force this problem and it “just happens” the debug log can fill up your space pretty quick, depending on the size of your setup and the messages logged.
We have done the following for this case (to “always” be able to have a look at a debug log):
edited /etc/icinga2/features-available/debuglog.conf