Duration in recovery notifications

MDKamijo · March 19, 2019, 5:51pm

I want to have the duration time from non-ok state to ok state in my notification emails. I’ve tried a bunch of “last_state…” variables but none of them are pretty use full. Do any of you guys out there have any ideas of how to solve that?

What I want to do is to have a row in the emails saying:
Problem duration: 1h 2m 32s

…or something like that. I thought I could use service.duration_sec but that doesn’t work since that will be “duration” since last state change and that state changes is from non-ok to ok when sending out a recovery notification.

aflatto · March 20, 2019, 7:19am

Hello and welcome.

AFAIK there is no direct way to do this, but maybe you can get that data by querying the database for the time of the issues first notification and then do a calculation of the duration to the recovery state change?
you will have to change the notification scripts and it will slow them down ( not to mention add load on the DB) , but this is one solution I can think of that will give you what you want.

dnsmichi · March 20, 2019, 7:27am

Hi,

should it be the first not-ok timestamp for SOFT or HARD states - the latter is when the first problem notification is sent. That’s not implemented in the core, but I’d like to figure out whether this qualifies for a feature request (cc @bsheqa).

Cheers,
Michael

MDKamijo · March 20, 2019, 7:36am

Thanks for the answer! I have also thought about that. But for now I think I will handle this in a slightly different way. When the first notification is fired off for a service the notification script will save the start time to a table in the db. Then when the recovery for the same service shows up the start time will be fetched and and the duration will be calculated. That’s the short version Might not be the best of solutions but a pretty easy one at least.

MDKamijo · March 20, 2019, 7:38am

Sounds like something that could be really good. If we could have that kind of timestamp things would be really nice and easy to handle.

dnsmichi · March 20, 2019, 7:46am

Please clarify first, which scenario exactly is needed, e.g. with a timeline picture.

MDKamijo · April 3, 2019, 11:06am

Sorry for the late reply…
I’m not really sure what you mean with “timeline picutre” but I’ll try to explain here.

Let’s say we have a bunch of services for which we are sending notifications to a timeseries database like InfluxDB. On every notification we are sending measurement points to InfluxDB with a few tags like ‘problem_type’, ‘owner’, ‘environment’ etc.
A service becomes critical and we send something like the following to InfluxDB: notifications,service=the_check,owner=myteam,zone=eu,site=london,host=the-host,notification_type=PROBLEM,state=CRITICAL duration=0 1554288881
A while later someone is acknowledging the problem. Then we send a new measurement point to InfluxDB almost same as in 2. but this time we add duration since the last state change. So the line might look like this:
notifications,service=the_check,owner=myteam,zone=eu,site=london,host=the-host,notification_type=ACKNOWLEDGEMENT,state=CRITICAL duration=637 1554289548
Later on when the problem have recovered and a recovery notification is sent out we send the total duration:
notifications,service=the_check,owner=myteam,zone=eu,site=london,host=the-host,notification_type=ACKNOWLEDGEMENT,state=CRITICAL duration=2943 1554292491

This is not only for this kind of notifications could be really great to include in the notification emails, slack messages or what ever you are using. Why not in the GUI as well so when you check the service / host details you could see how long the service / host have been in the current state.

The issue now is that we need to calculate the “duration” on our own now. For the moment we are using a database table with every notification in. So when we send a notification we add that notification with the receiver and start time to a line. Then when that problem the notification belonged to we use the start time saved in the db table and calculate the duration. That’s the short and lazy description of how we do it.

If there was a variable containing the duration of the last problem it would make life a lot easier.