Hello,
I’m creating this topic for an already open issue, and I know that the community helps on their free time, but perhaps not every user of this forum is also active on GitHub, and hopefully I could find some helpful debug tips over here.
Just to recap some important info:
Problem: as you can see from the icingaweb screenshot, now and then we have some checks that are executed twice. Even the notifications are sent twice (with multiple entries in the history, even though it says that the notification was not sent). In contrast, the OK states are never sent instead at all.
we have 2 HA masters (features notifications & metrics)
3 satellites zones (features checker)
we use a notification command script to send the alerts to Alerta. Our script logs reflects what we also see in the icingaweb history
we’ve upgraded from SLES to RHEL and from icinga 2.10.3 to 2.14.0, so a lot has changed since the error appeared.
We haven’t seen any other person reporting this issue of the double notifications. As we can’t replicate the error, and it’s not constant, we can’t put the debug on as it’s just too much.
We have no clue how to better debug this or where to grab additional possible info (query? api?)
Update:
We have realised that it happens after an icinga reload.
Apparently the secondary master takes the lead while the primary is “unresponsive” during the reload, but then the primary comes back thinking he is still in charge.
The 2 masters will receive the same info from the satellite, write it to the IDO DB and send the notifications in parallel.
Eventually a DB deadlock will stop this after a while and the situation goes back to normal