Icinga2 fails to send notification reliably

Hi, fellow Icinga users. I have an issue where my Icinga doesn’t send notifications reliably. The issue is that after 2 days or roughly 48 hours icinga stops sending notification.

As can be seen from the sql query below last notification has been sent on 2022-12-09. And the interval is set to 1d so we have clearly missed a few notifications here.

mysql> select hoststatus_id,last_notification,next_notification from icinga_hoststatus WHERE hoststatus_id = 95; ±--------------±--------------------±--------------------+
| hoststatus_id | last_notification | next_notification |
±--------------±--------------------±--------------------+
| 95 | 2022-12-09 16:36:48 | 2022-12-13 16:36:54 |
±--------------±--------------------±--------------------+
1 row in set (0.00 sec)

Service restart fixes the issue for few days again and some of the missed notifications are being sent afterwards.

######################configs#################
/* NOTIFICATION TEMPLATES */
template Notification “generic-notification-template” {
interval = 1d
states = [ Warning, Critical, Unknown ]
types = [ Problem, Acknowledgement, Recovery, Custom, FlappingStart,
FlappingEnd, DowntimeStart, DowntimeEnd, DowntimeRemoved ]
period = “24x7”
interval = 24h
}

template Notification “host-notification-template” {
import “generic-notification-template”

states = [ Up, Down ]
types = [ Problem, Recovery, Custom ]

}

template Notification “service-notification-template” {
import “generic-notification-template”

states = [ OK, Warning, Critical ]
types = [ Problem, Recovery ]

}

#########################################
apply Notification “sms-icingaadmin” to Service {
import “service-notification-template”
command = “sms-service-notification”
user_groups = service.vars.notification.sms.groups
assign where service.vars.notification.sms && service.vars.non_default_notification != true && typeof(service.vars.notification.sms) == Dictionary
}
apply Notification “sms-icingaadmin” to Host {
import “host-notification-template”
command = “sms-host-notification”
user_groups = host.vars.notification.sms.groups
assign where host.vars.notification.sms && typeof(host.vars.notification.sms) == Dictionary
}
########################################
/* SMS NOTIFICATION FOR HOST */
object NotificationCommand “sms-host-notification” {
import “plugin-notification-command”

command = [ SysconfDir + “/icinga2/scripts/sms-notification.sh” ]

arguments = {
“-P” = {
value = “$user.pager$”
skip_key = true
order = 0
}
“-M” = {
value = “$host.display_name$ is $host.state$”
skip_key = true
order = 1
}
}

}
/* SMS NOTIFICATION FOR SERVICE */
object NotificationCommand “sms-service-notification” {
import “plugin-notification-command”

command = [ SysconfDir + “/icinga2/scripts/sms-notification.sh” ]

arguments = {
“-P” = {
value = “$user.pager$”
skip_key = true
order = 0
}
“-M” = {
value = “$host.display_name$ \ $service.name$ is $service.state$”
skip_key = true
order = 1
}
}
}

This is what it looks like around the time when icinga should send a notification.

mysql> select host_id,check_source,last_notification,next_notification from icinga_hoststatus WHERE host_id = 14;
ERROR 1054 (42S22): Unknown column ‘host_id’ in ‘field list’
mysql> select hoststatus_id,last_notification,next_notification from icinga_hoststatus WHERE hoststatus_id = 14;
±--------------±--------------------±--------------------+
| hoststatus_id | last_notification | next_notification |
±--------------±--------------------±--------------------+
| 14 | 2022-12-09 14:43:33 | 2022-12-13 14:43:44 |
±--------------±--------------------±--------------------+
1 row in set (0.00 sec)

mysql> select now();
±--------------------+
| now() |
±--------------------+
| 2022-12-13 14:48:18 |
±--------------------+
1 row in set (0.00 sec)

mysql> select hoststatus_id,last_notification,next_notification from icinga_hoststatus WHERE hoststatus_id = 14;select now();
±--------------±--------------------±--------------------+
| hoststatus_id | last_notification | next_notification |
±--------------±--------------------±--------------------+
| 14 | 2022-12-09 14:43:33 | 2022-12-14 14:43:45 |
±--------------±--------------------±--------------------+
1 row in set (0.00 sec)

±--------------------+
| now() |
±--------------------+
| 2022-12-13 14:48:29 |
±--------------------+
1 row in set (0.00 sec)

My guess since I managed to check the db when it was supposed to send a notification is that something is running for 5 mins then fails and updates the next notification time.
So some specific process doing that? There was some clock skewing (+1 year and the back to present day) done by a colleague which might have messed up something on the system.
I’m restarting the service and setting debuglogs on so that I can observe the issue more.

I proceeded to investigate the issue further. But as mentioned after restart the problem dissapears for a few days. Below are some snippets of the debuglog. 14th of december notifications worked just fine and after that once again nothing.

[2022-12-14 16:37:03 +0200] notice/ApiListener: Relaying ‘event::NotificationSentUser’ message
[2022-12-14 16:37:03 +0200] notice/Process: Running command ‘/etc/icinga2/scripts/sms-notification.sh’ ‘+123123123123’ ‘HOST_NOT_RESPONDING is DOWN’: PID 2457875
[2022-12-14 16:37:03 +0200] information/Notification: Completed sending ‘Problem’ notification ‘HOST_NOT_RESPONDING!sms-icingaadmin’ for checkable ‘HOST_NOT_RESPONDING’ and user ‘jacky’ using command ‘sms-host-notification’.
[2022-12-14 16:37:03 +0200] notice/ApiListener: Relaying ‘event::NotificationSentUser’ message

[2022-12-15 16:37:06 +0200] notice/NotificationComponent: Attempting to send reminder notification ‘HOST_NOT_RESPONDING!sms-icingaadmin’.
[2022-12-15 16:37:06 +0200] notice/Notification: Attempting to send reminder notifications of type ‘Problem’ for notification object ‘HOST_NOT_RESPONDING!sms-icingaadmin’.
[2022-12-15 16:37:06 +0200] notice/Notification: Not sending reminder notifications for notification object ‘HOST_NOT_RESPONDING!sms-icingaadmin’: not in timeperiod ‘24x7’
[2022-12-15 16:37:06 +0200] notice/NotificationComponent: Attempting to send reminder notification ‘HOST_NOT_RESPONDING!sms-json-host’.

Now I believe I’ve spotted the issue here. This “debug/TimePeriod: Adding segment” happens on the service restart and it includes today and tomorrow.
Checking on the previous days debuglogs I noticed that this doesn’t happen after the reboot.
And if nothing is adding new time segments to the timeperiod 24x7 this makes sense that the notification stop coming and the error is that the time is “not in timeperiod ‘24x7’”

/var/log/icinga2/debug.log:[2022-12-16 15:23:33 +0200] debug/LegacyTimePeriod: Legacy timeperiod update returned 2 segments.
/var/log/icinga2/debug.log:[2022-12-16 15:23:33 +0200] debug/TimePeriod: Removing segment ‘Fri Dec 16 15:23:33 2022’ <-> ‘Sat Dec 17 15:23:33 2022’ from TimePeriod ‘24x7’
/var/log/icinga2/debug.log:[2022-12-16 15:23:33 +0200] debug/TimePeriod: Adding segment ‘Fri Dec 16 00:00:00 2022’ <-> ‘Sat Dec 17 00:00:00 2022’ to TimePeriod ‘24x7’
/var/log/icinga2/debug.log:[2022-12-16 15:23:33 +0200] debug/TimePeriod: Adding segment ‘Sat Dec 17 00:00:00 2022’ <-> ‘Sun Dec 18 00:00:00 2022’ to TimePeriod ‘24x7’

I will check the debuglogs again on monday or tuesday and post if I can find those “debug/TimePeriod: Adding segment” from the weekend. But since I didn’t find those messages from the last few days I don’t expect to find them from the weekend logs.

If someone has seen this kind of behaviour before help would be much appreciated. I would like to know if the clock skewing messed up something, since other installations in which no clock skewing was done have been just fine or whether this is a bug.

Icinga 2 version used: 2.13.2-1
OS: AlmaLinux 8.6

So yeah I was correct. No new time segments are being added to 24x7 timeperiod after service restart.

debug.log-20221214.gz:[2022-12-13 14:56:03 +0200] debug/TimePeriod: Adding segment 'Tue Dec 13 00:00:00 2022' <-> 'Wed Dec 14 00:00:00 2022' to TimePeriod '24x7'
debug.log-20221214.gz:[2022-12-13 14:56:03 +0200] debug/TimePeriod: Adding segment 'Wed Dec 14 00:00:00 2022' <-> 'Thu Dec 15 00:00:00 2022' to TimePeriod '24x7'
debug.log-20221214.gz:[2022-12-13 14:56:03 +0200] debug/TimePeriod: Adding segment 'Tue Dec 13 09:00:00 2022' <-> 'Tue Dec 13 17:00:00 2022' to TimePeriod '9to5'
debug.log-20221214.gz:[2022-12-13 14:56:03 +0200] debug/TimePeriod: Adding segment 'Wed Dec 14 09:00:00 2022' <-> 'Wed Dec 14 17:00:00 2022' to TimePeriod '9to5'
debug.log-20221217.gz:[2022-12-16 15:23:33 +0200] debug/TimePeriod: Adding segment 'Fri Dec 16 00:00:00 2022' <-> 'Sat Dec 17 00:00:00 2022' to TimePeriod '24x7'
debug.log-20221217.gz:[2022-12-16 15:23:33 +0200] debug/TimePeriod: Adding segment 'Sat Dec 17 00:00:00 2022' <-> 'Sun Dec 18 00:00:00 2022' to TimePeriod '24x7'
debug.log-20221217.gz:[2022-12-16 15:23:33 +0200] debug/TimePeriod: Adding segment 'Fri Dec 16 09:00:00 2022' <-> 'Fri Dec 16 17:00:00 2022' to TimePeriod '9to5'

Seems to be this part of the code.icinga2/timeperiod.cpp at ee705bb110e802f8cafd21bab2d8697b0a538b0a · Icinga/icinga2 · GitHub

We can mitigate the issue by restarting the service via cron every 2 days. But I really don’t see that as a proper fix. Perhaps @log1c or @moreamazingnick can help here as you seem to have a bit more knowledge about icinga.

Can these TimePeriod tasks be run from cli? Clearly this task should be run periodically by the service but for reason x it is not being run. What should activate this task that adds new time segments?

will get you one host status and the corresponding notification

try something like that to check if there are any new notifications:

SELECT * FROM `icinga_notifications` ORDER BY `icinga_notifications`.`start_time` DESC

there you set interval twice, and it means that you want a renotification after 24h?

Reloading configs also prompts the task to create new time segments. Therefore it can also mitigate the issue.

systemctl reload icinga2.service
grep -i "adding" /var/log/icinga2/debug.log
debug/TimePeriod: Adding segment 'Tue Dec 20 00:00:00 2022' <-> 'Wed Dec 21 00:00:00 2022' to TimePeriod '24x7'
debug/TimePeriod: Adding segment 'Wed Dec 21 00:00:00 2022' <-> 'Thu Dec 22 00:00:00 2022' to TimePeriod '24x7'
debug/TimePeriod: Adding segment 'Tue Dec 20 09:00:00 2022' <-> 'Tue Dec 20 17:00:00 2022' to TimePeriod '9to5'
debug/TimePeriod: Adding segment 'Wed Dec 21 09:00:00 2022' <-> 'Wed Dec 21 17:00:00 2022' to TimePeriod '9to5'
debug/TimePeriod: Adding segment 'Tue Dec 20 00:00:00 2022' <-> 'Wed Dec 21 00:00:00 2022' to TimePeriod '24x7'
debug/TimePeriod: Adding segment 'Wed Dec 21 00:00:00 2022' <-> 'Thu Dec 22 00:00:00 2022' to TimePeriod '24x7'
debug/TimePeriod: Adding segment 'Tue Dec 20 09:00:00 2022' <-> 'Tue Dec 20 17:00:00 2022' to TimePeriod '9to5'
debug/TimePeriod: Adding segment 'Wed Dec 21 09:00:00 2022' <-> 'Wed Dec 21 17:00:00 2022' to TimePeriod '9to5'

@moreamazingnick There are notifications. The issue is that there are no time segments which belong to the 24x7 timeperiod. And when there are no suitable time segments it just postpones the notications.

notice/Notification: Not sending reminder notifications for notification object ‘HOST_NOT_RESPONDING!sms-icingaadmin’: not in timeperiod ‘24x7’

Yes we want a renotification after 24 hours from the first. Even though interval has been set twice it doesn’t really break anything there. But thanks for noticing I’ll remove that extra interval setting.

So the correct problem would be:

Icinga fails to send renotification for a 24x7 check after 24h because the timeperoid might not exist anymore?

Yeah something like that.

I did not check the behaviour, if you think it’s an issue open a ticket on github.

but nevertheless I cant imagine a case where a renotification after 24h is any good. It must be very unimportant if there is no statechange within this interval, and the statechange can cause the next notification.

It’s like: something is broken, we fix or ignore, if we ignore we decide again after 24h?

So to make this issue clear since I think I was not clear enough.

Icinga fails to send any notification after 2 days from service restart or reload. Due to no new time segments are being added to 24x7 timeperiod.

These “TimePeriod: Adding segment” messages are absent from the debug logs after restart. These are captured at the restarts.
Restart at 2022-12-13 14:56:03

[2022-12-13 14:56:03 +0200] debug/TimePeriod: Adding segment 'Tue Dec 13 00:00:00 2022' <-> 'Wed Dec 14 00:00:00 2022' to TimePeriod '24x7'
[2022-12-13 14:56:03 +0200] debug/TimePeriod: Adding segment 'Wed Dec 14 00:00:00 2022' <-> 'Thu Dec 15 00:00:00 2022' to TimePeriod '24x7'

And at 2022-12-16 15:23:33.

[2022-12-16 15:23:33 +0200] debug/TimePeriod: Adding segment 'Fri Dec 16 00:00:00 2022' <-> 'Sat Dec 17 00:00:00 2022' to TimePeriod '24x7'
[2022-12-16 15:23:33 +0200] debug/TimePeriod: Adding segment 'Sat Dec 17 00:00:00 2022' <-> 'Sun Dec 18 00:00:00 2022' to TimePeriod '24x7'

No new time segments where added between these restarts resulting in no notification sent.

notice/Notification: Not sending reminder notifications for notification object ‘HOST_NOT_RESPONDING!sms-icingaadmin’: not in timeperiod ‘24x7

This means that any re-notification send after 2 days will not be sent.
Nor will any notification be sent from a new issue.

This 2 day notification period includes the ongoing day.
e.g.
Service is restarted at 20-12-2022 15:00. Notification will be sent until 22-12-2022 00:00. After that nothing will be sent.
e.g. 2
Service is restarted at 20-12-2022 23:52. Notification will be sent until 22-12-2022 00:00. After that nothing will be sent.
e.g. 3
Service is restarted at 21-12-2022 00:01. Notification will be sent until 23-12-2022 00:00. After that nothing will be sent.

Long story short. Icinga gets timeperiods via the OS and since there was clock skewing done on the OS it somehow messed up this process. Resulting in Icinga only adding timesegments after restart even though Icinga should periodically add new timesegments.

Reinstalling software did not help.
Creating a new vm fixed the issue.

Lesson to be learned don’t let colleagues do clock skewing.

1 Like