We’re running a multi-master setup of Icinga with IcingaDB. We noticed that when the active endpoint switches, for example during configuration reload or icinga2 service restarts, some hosts that have active downtime trigger notifications while they should not. Somewhat concerning is that this does not happen consistent for all of the hosts with downtime, but just a few.
Does this sound familiar to someone? And if so, is there any workaround/fix?
Hi @Al2Klimov , what an honor to have you reply here
I just witnessed it again today with a host that had downtime until the third of July:
Downtime object for this host looks like this:
Triggered On
2023-06-19 22:40:35
Scheduled Start
2023-06-19 22:40:12
Actual Start
2023-06-19 22:40:12
Scheduled End
2023-07-03 17:40:12
Actual End
2023-07-03 17:40:12
However, we still get a host down event:
Host ran into a problem
Plugin Output
CRITICAL - Socket timeout
Event Info
Sent On: 2023-06-28 09:15:57
Type: Problem
State: Down
Host: redacted
This happens right after Puppet triggers a reload of the Icinga2 service.
Jun 28 09:14:04 systemd[1]: Reloaded Icinga host/service/network monitoring system.
Jun 28 09:14:04 puppet-agent[8418]: (/Stage[main]/Icinga2::Service/Service[icinga2]) Triggered 'refresh' from 3 events
Jun 28 09:14:04 puppet-agent[8418]: Applied catalog in 91.83 seconds
I’ll try enabling the debug log and hopefully get some more interesting information out of there…
If you have any advice on how to troubleshoot this further, it would be greatly appreciated.