Help: Best Practices Notification and Downtime?

Hello everyone,

I have concerns about two aspects of my current setup that I think could be improved.

  1. User Notifications: Right now, I set a Custom Variable for every host that accepts entries from a list of names. For each name in that list, I’ve created a user and apply rules for both hosts and services corresponding to these users. While this method works, it’s cumbersome to add new users. Does anyone have suggestions for a better configuration?
  2. Scheduled Operations: We run replications and backups from 10 pm to 6 am. During this period, there’s an increase in critical errors like high CPU usage. I’ve set notifications to only send between 6 am and 9 pm to avoid false alerts, but this means we won’t be notified of genuine issues occurring between 10 pm and 6 am. Any solutions for this?

I’d appreciate any advice. Thank you!

  1. usergroups. I even use a director import to manage the users group memberships: Active Directory/LDAP User & Group Imports via Director - #7 by rivad this allows me to sync the recipients and the icingaweb2 roles by triggering the script and the director import.
  2. you could duplicate the affected services and set different thresholds to model the normal operations inside the replications and backup periods and then exclusively run the duplicates between 10pm and 6am by leveraging check periods but don’t use imports and excludes in your periods as this is buggy right now.
1 Like
  1. not sure if we mean the same. I am talking about Notifications not about Useraccounts.
  2. This would mean i get two times the amount of services, which is quite a lot.

Thanks for your suggestion i hope you can explain me the first one in more detail.

what @rivad means is that you put the users in usergroups. I would even go further and put the hosts and services in host and servicegroups.

Then you can apply the notification based on usergroups:

apply Notification "TEST" to Host {
    import "host-notification"

    assign where "Network Router" in host.groups
    user_groups = [ "one usergroup" ,"another usergroup" ]
    users = [ "one user" ,"another user"]
}

for your second problem:
I would’t restrict the notification, but create a downtime for the cpu/mem services
the downtime prohibits icinga to send notifications. In the example I used the checkcommand but you can also use hostnames/servicenames or other vars to even exclude or include hosts/services that should be put in a downtime.

see the docs for more information:

Downtimes can be scheduled for planned server maintenance or any other targeted service outage you are aware of in advance.
https://icinga.com/docs/icinga-2/latest/doc/08-advanced-topics/#downtimes

apply ScheduledDowntime "TEST" to Service {
    author = "nicolas"
    comment = "service dt"
    fixed = true
    assign where service.check_command == "checkmem" && service.check_command == "checkcpu"
    ranges = {
        "monday"	= "22:00-24:00"
        "tuesday"	= "00:00-06:00"
#and so on...
    }
}

duplicated services with adapted thresholds (thats what @rivad sugessted) is only ment for the mem /cpu checks that are really important to you.
Example:
MEM service1 standard / Critical 80% Warning 70%
You know that the mem usage hits 90% during sync processes
MEM service2 info / / Critical 98% / Warning 95%
But that is just for the situation that you really need to distinguish between these two monitoring events

2 Likes

1.) we dont really have user groups our notifications are pretty fine-grained, therefore i assign every hosts the contacts using a variable. So usergroups dont work i think.

2.) that sounds good and i will give feedback on it

  1. I also use vars.teams to manage the user groups but I very rarely add single users to a host or service.
  2. yes, sparingly used, mostly CPU, RAM and Disk I/O and also some information transports, where in non business hours, not much moves and we don’t have a heartbeat.