We have fiber over which we send multiple channels of light. At each end of the fiber we have DWDM devices which have EDFAs (pumps that boost the light). Each channel of light is a host with multiple services (transmit and receive for each end), and each DWDM is a host with services such as the EDFA.
We have notifications when a channel goes down, however, if the EDFA fails, it takes down all the channels with it, and we are flooded with multiple notifications.
We also have multiple types of notifications for different groups in our org, master/pager, and individual channel maintainers.
When the EDFA the we want the pager to receive a notification for the EDFA down, and to suppress the notifications for each channel, however we still want the individual channel maintainers to each get notification related to their one channel being down.
Additional, if one or more channels fail independently from the EDFA, we want the master/pager and each individual channel maintainers to be notified.
I have been looking at cluster-zone with Masters and Agents, dependancies, health checks, but not certain which is the best tool to solve this problem with.
We did accomplish something somewhat similar with the different transmit and receives on each channel with a dummy check service and some || logic to return 0 or 2, so that if a device in unplugged (different from a fiber cut), the master/pager isn’t notified, with some assign and ignore where to work that out.
You can use dependencies to suppress some of the messages to the master pager. The channel checks that are faster then the EDFA down will still be reported.
You could attach event handlers to the channels to trigger a immediate EDFA down check to suppress this race condition if you allow at least one soft state for the channel checks or set the timing with soft and hard states so that the EDFA down is guarantied registered before the channels notify.
Also in the event handler on the channel you could, on change to a hard problem with EDFA down, send a message to the individual channel maintainers. To make it look like a “nomal” notification, you can call the same notification command as a regular notification would but you need to populate the env and/or arguments from your event handler script for the notification script.
The only other way I see, is to combine a dependency on the EDFA to stop the flood reaching the master/pager and the business process module to model the channels and set checks on the business process top nodes per channel to notify the channel maintainers that there channel is down.
I would try to avoid the event handler and first try to craft the check_intervals, retry_intervals and retries carefully to avoid the above mentioned race condition and use the business process module as I already utilize it.
I regard the event handler like goto in coding: very powerful and handy but only use it if you absolutely have to!