Delay in CRITICAL alert — triggered 1 hour later from peer master (Icinga 2.24.3)

We’re running an Icinga 2.24.3 HA cluster (two masters, same zone).
Recently a host’s ping check went CRITICAL, but the alert fired 1 hour later — and from the other master.

Cluster connectivity looks fine and both masters share the same config/database.

We’d like to understand:

  • Why a master detecting a CRITICAL state wouldn’t send the alert immediately
  • Why the peer triggered it later
  • What conditions (e.g. cluster lag, event queue delay, HA ownership) could cause this
  • Which metrics (like slave_lag) we should monitor to prevent such delays

Any hints on how to trace notification ownership or cluster sync timing would be appreciated.

Thanks!

Welcome to the Icinga community and thanks for posting.

First, you are using Icinga 2.14.3, right? In this case, I would recommend upgrading since at least 2.14.4 has addressed certain HA cluster issues. Furthermore, 2.14.5 and 2.14.6 fixed some stability and security issues. And there is even 2.15.0 with 2.15.1 around the corner.

To understand the actual issue, please supply the icinga2.log of both master Icinga 2 master nodes around the time of the issue, starting from the first state change of the ping check up to the notification. If enabled, the Icinga 2 debug log would also proof helpful.

Do you face other cluster issues as well? Please refer to the cluster troubleshooting section, which also mentions the slave_lag performance metric. In addition, comparing last_messages_sent and last_messages_received - both Unix timestamps - might indicate communication issues.