Notification rate limiting

10RUPTiV · August 28, 2021, 5:11pm

Hey guys…

Is there a way to rate limit the notification!?

For example, let say we reboot a firewall for maintenance and we forgot to put all devices in downtime…

For around 200 services, with 3 “contacts” that have 2 different notifications (email and SMS), that’s near 200 x 3 X 2, 1200 outgoing notifications at the same time…

How do you handle this kind of stuff ?

steaksauce · August 28, 2021, 8:10pm

Scheduled downtimes would be the ideal way, but there are other things that might help.

We use notifications to automatically create cases in Dynamics CRM (or update existing cases) and acknowledge the host/service with the case # in the comment field. As part of that and to reduce “noise” we set a notification delay of 5 minutes – if the service is still down after that 5 minutes, a notification is sent. If not, the notification never sends.

Not sure that Icinga would have a way around this other than going to System > Monitoring Health > Turning off Notifications.

If you’re concerned about mail notifications, you might be able to set up something on the mail side to stop notifications if X amount are sent in an hour, but it’s been a long time since I’ve touched mail, and it will vary by platform.

leeclemens · August 28, 2021, 10:36pm

Perhaps a question is how would you think Icinga2 should handle that? It seems that at least some device should have been placed in Downtime (people/procedural problem).

If you can see this alert-storm coming, you may have time to shut Icinga2’s notifications off. Otherwise, maybe you could pipe the emails through another server with rate limiting (postfix, etc) to buy you some more time based on quantity? But ultimately, it seems you should put the device(s) in Downtime if you do no want to receive notifications for them.

nexo1960 · August 29, 2021, 9:08am

@10RUPTiV
If one system is down only this one should be notified, so you should configure dependencies to avoid so much notifications.

@steaksauce
relying on the user to set all downtimes correctly (including dependant ones) doesn’t work - at least from my experience.

steaksauce · August 30, 2021, 2:53pm

Sounds like a workplace issue – gotta reflect pain onto users who forget to schedule downtimes;

no one is gonna like getting blasted during an overnight maintenance because someone forgot to follow process

nexo1960 · August 30, 2021, 2:57pm

I would be in favor of publicly tarring and feathering anyone who does not follow the rules - monitoring administrators are excluded of course.

leeclemens · August 30, 2021, 3:59pm

I fully support these responses

@nexo1960 brought up a good idea I had missed regarding using Dependencies to at least limit the number of alerts. (Assuming they fit the use case, here it seems like they would.)

steaksauce · August 30, 2021, 4:31pm

Hmmm, I should bring up tar the next time this happens (we are using notifications to automatically create or update tickets now).

2 things to note for anyone who joins the party:

parenting is amazing if you have it setup correctly. Unfortunately in my environment (working for an ISP), parenting doesn’t help since OSPF or BGP can pick a new route at any moment and destroy the parenting.
as of the time of this post, there is a known bug with scheduled downtimes in which they are not persistent across reloads/restarts of Icinga2 (we use director to dynamically import hosts/services, with ~20 reloads a day)
Downtimes not reapplied after a reload/restart of Icinga2 · Issue #8968 · Icinga/icinga2 · GitHub

Downtime bug aside, downtimes are the “built-in” way that Icinga would handle whether or not to send out notifications. You can mass disable notifications using the method in the above posts, or disabling a notification rule (you might do these things in a large maintenance that may impact monitoring’s ability to monitor, or like a POP router that may affect multiple customers)

You can rate limit emails using Postfix or other mail protocols, but it varies by service.

Other notification channels you might use (Slack, Pager Duty, etc…) may require a different solution, or have no solution built in.

There are some things that you can do limit the amount of notifications; first notification delays, interval between notifications, disabling re-notifications.

Tar and feathers seems like a good idea though.

10RUPTiV · August 30, 2021, 5:57pm

Our problem came, I think, when we cut the internet for a moment OR reboot all switches (so our remote agent) lost connection with everything, and then, start having all the status “critical” locally, when the internet come back, it send it to the master, that send out all notification!

So we are receiving all hosts/services notifications of down/critical and then “up/OK” at the same time.

I agree with “downtime” but when you are doing an emergency maintenance, you forgot or don’t have the time to apply a downtime for a bunch of hosts/services…

Pooh · August 30, 2021, 6:21pm

I totally support the point already made that dependencies between machines
(commonly used for routers and the networks behind them, but just as
applicable to switches and the machines plugged in to them) would solve the
majority of these problems.

Antony.

huky · September 2, 2021, 2:01am

It’s a example to limit notification for me, the notication will stop after 5m:


apply Notification "host-mail-sms" to Host {
  import "host-notification"

  if (host.vars.notification_interval) {
    interval = host.vars.notification_interval
  }

  if (host.vars.notification.pager.users) {
    users = host.vars.notification.pager.users
  } else if (host.vars.notification.pager.groups) {
    user_groups = host.vars.notification.pager.groups
  } else if (host.vars.type == "switch") {
    user_groups = [ "network" ]
  } else {
    #user_groups = [ "icingaadmins" ]
    users = ["none"]
  }

  #command = "sms-host-notification"
  command = "sms-notification"
  
  interval = 5m
  interval = 0 //disable re-notification
    
  assign where host.address
  times = {
    end = 5m
  } 
} 

apply Notification "sms-2st-level-host" to Host {
  import "host-notification"

  if (host.vars.notification.pager.users) {
    users = host.vars.notification.pager.users
  } else if (host.vars.notification.pager.groups) {
    user_groups = host.vars.notification.pager.groups
  } else if (host.vars.type == "switch") {
    user_groups = [ "network" ]
  } else {
    user_groups = [ "icingaadmins" ]
  }

  command = "sms-notification"
  #command = "sms-host-notification"

  interval = 0
  times = {
    begin = 5m
  }
  assign where host.address
}