Checks executed twice and no recovery notifications are sent

d33niel · August 5, 2024, 10:47am

Hello,
I’m creating this topic for an already open issue, and I know that the community helps on their free time, but perhaps not every user of this forum is also active on GitHub, and hopefully I could find some helpful debug tips over here.

github.com/Icinga/icinga2

Checks executed twice and no recovery notifications are sent

opened 12:29PM - 08 Feb 24 UTC

danpicpic

bug area/notifications

## Describe the bug We have observed a couple of times in the last 3 weeks a we…ird behaviour where the checks are performed twice, the notifications sent twice (at least the Problem one), but at the same time we also saw that no Recovery notifications were ever sent. Every time it happened in a small time frame (for e.g. between 8am and 9am), on different number of servers/services with no common pattern between them. The `checker` and `notifications` features are enable in HA on both master. On both of them, from the icinga2.log (is it normal that they log the same? are they doing the same action in parallel?) I see the following lines, where a Problem notification is sent but not the Recovery one: ``` [2024-02-03 08:20:18 +0100] information/Checkable: Checkable 'hostxxx!servicexxxx' has 1 notification(s). Checking filters for type 'Problem', sends will be logged. [2024-02-03 08:20:18 +0100] information/Notification: Sending 'Problem' notification 'hostxxx!servicexxxx!state-notification-to-service' for user 'dummy_user' [2024-02-03 08:20:18 +0100] information/Notification: Completed sending 'Problem' notification 'hostxxx!servicexxxx!state-notification-to-service' for checkable 'hostxxx!servicexxxx' and user 'dummy_user' using command 'state-notification'. [2024-02-03 08:20:18 +0100] information/Checkable: Checkable 'hostxxx!servicexxxx' has 1 notification(s). Checking filters for type 'Problem', sends will be logged. [2024-02-03 08:43:18 +0100] information/Checkable: Checkable 'hostxxx!servicexxxx' has 1 notification(s). Checking filters for type 'Recovery', sends will be logged. [2024-02-03 08:43:18 +0100] information/Checkable: Checkable 'hostxxx!servicexxxx' has 1 notification(s). Checking filters for type 'Recovery', sends will be logged. ``` 1. The first screenshot below is the one linked to the above log. All the messages regarding the _notification not sent_ are weird, as the Problem notification was sent anyway, but not the Recovery. 2. From the two screenshots we can see how every check/action is done twice or multiple times (soft state, hard state, ok, notifications) ## Screenshots ![icinga_ss1](https://github.com/Icinga/icinga2/assets/159417321/02a4bcb8-ed70-4ca9-b298-a26091a00c75) ![icinga_ss2](https://github.com/Icinga/icinga2/assets/159417321/cfe8930a-9f43-4230-8e68-2cf32b9fe2ab) ## Your Environment Include as many relevant details about the environment you experienced the problem in * Version used (`icinga2 --version`): r2.14.1-1 * Operating System and version: RHEL 9.2 * Enabled features (`icinga2 feature list`): api-users api checker command graphite ido-mysql mainlog notification * Icinga Web 2 version and modules (System - About): 2.11.4 * Config validation (`icinga2 daemon -C`): OK * If you run multiple Icinga 2 instances, the `zones.conf` file: ``` object Endpoint "master1" { } object Endpoint "master2" { host = "master2" } object Endpoint "satellite1" { host = "satellite1" } object Endpoint "satellite2" { host = "satellite2" } object Zone "director-global" { global = true } object Zone "global-templates" { global = true } object Zone "master" { endpoints = [ "master1", "master2", ] } object Zone "satellite" { endpoints = [ "satellite1", "satellite2", ] parent = "master" } ``` ## Additional context - We have migrated our infrastructure from `SLES12.5` (Icinga 2.10.3) to `RHEL9` (Icinga 2.14.0) around 2 months ago - We have also installed `jemalloc-5.2.1-2.el9.x86_64` - At the beginning we only had test servers (of which ~1000 with active notifications) to validate the new Icinga2 - 3 weeks ago we started to monitor the remaining ~2000 Production servers and upgraded Icinga2 to v2.14.1 We have started to see the error in the last 3 weeks, but we don't know if it was introduced by the last minor update to 2.14.1, or if it was already present since the first migration, but as we had fewer servers and less important, it might have been ignored.

Just to recap some important info:

Problem: as you can see from the icingaweb screenshot, now and then we have some checks that are executed twice. Even the notifications are sent twice (with multiple entries in the history, even though it says that the notification was not sent). In contrast, the OK states are never sent instead at all.
we have 2 HA masters (features notifications & metrics)
3 satellites zones (features checker)
we use a notification command script to send the alerts to Alerta. Our script logs reflects what we also see in the icingaweb history
we’ve upgraded from SLES to RHEL and from icinga 2.10.3 to 2.14.0, so a lot has changed since the error appeared.
We haven’t seen any other person reporting this issue of the double notifications. As we can’t replicate the error, and it’s not constant, we can’t put the debug on as it’s just too much.
We have no clue how to better debug this or where to grab additional possible info (query? api?)

d33niel · August 15, 2024, 6:39am

Update:
We have realised that it happens after an icinga reload.
Apparently the secondary master takes the lead while the primary is “unresponsive” during the reload, but then the primary comes back thinking he is still in charge.
The 2 masters will receive the same info from the satellite, write it to the IDO DB and send the notifications in parallel.
Eventually a DB deadlock will stop this after a while and the situation goes back to normal