Notifications of hosts with downtime during master failover

Erik · June 6, 2023, 11:23am

Hi,

We’re running a multi-master setup of Icinga with IcingaDB. We noticed that when the active endpoint switches, for example during configuration reload or icinga2 service restarts, some hosts that have active downtime trigger notifications while they should not. Somewhat concerning is that this does not happen consistent for all of the hosts with downtime, but just a few.

Does this sound familiar to someone? And if so, is there any workaround/fix?

Version used: r2.13.6-1
Operating System and version: CentOS 7.9
Enabled features (icinga2 feature list)
$ sudo icinga2 feature list
Disabled features: compatlog debuglog elasticsearch gelf graphite ido-mysql influxdb influxdb2 opentsdb perfdata statusdata syslog
Enabled features: api checker command icingadb livestatus mainlog notification
Icinga Web 2 version and modules (System - About): 2.11.4 + IcingaDB 1.0.2

Al2Klimov · June 26, 2023, 2:09pm

Hello Erik!

How far is the downtime end away on a such notification?

Best,
A/K

Erik · June 28, 2023, 8:18am

Hi @Al2Klimov , what an honor to have you reply here

I just witnessed it again today with a host that had downtime until the third of July:

Downtime object for this host looks like this:

Triggered On
2023-06-19 22:40:35

Scheduled Start
2023-06-19 22:40:12

Actual Start
2023-06-19 22:40:12

Scheduled End
2023-07-03 17:40:12

Actual End
2023-07-03 17:40:12

However, we still get a host down event:

Host ran into a problem

Plugin Output

CRITICAL - Socket timeout

Event Info

Sent On: 2023-06-28 09:15:57
Type: Problem
State: Down
Host: redacted

This happens right after Puppet triggers a reload of the Icinga2 service.

Jun 28 09:14:04 systemd[1]: Reloaded Icinga host/service/network monitoring system.
Jun 28 09:14:04 puppet-agent[8418]: (/Stage[main]/Icinga2::Service/Service[icinga2]) Triggered 'refresh' from 3 events
Jun 28 09:14:04 puppet-agent[8418]: Applied catalog in 91.83 seconds

I’ll try enabling the debug log and hopefully get some more interesting information out of there…

If you have any advice on how to troubleshoot this further, it would be greatly appreciated.

Thanks.