Continued from here.
I’ve been working with Icinga2 for a while now and was curious as to what everyone’s workflows are like.
For me, I use the Dashing dashboard for a high-level view of everything going on, which includes the number of current host/service problems and list of individual problems as well as rotating NagVis maps of high-level views of my environment integrated in an iFrame on the dashboard. If there is some problem that pops up, the NagVis map will visibly and audibly alert me (I’ll also get an email or SMS) and I can then click on the node to bring me to a more detailed NagVis map of the network device with its port mappings, health, and resource statuses (if applicable). Then, if I want to get an even more granular view, I can click on one of the health or resource nodes from NagVis to bring me to a Grafana graph of the performance history of that host/service. Through this process, I’m able to find where , when , and sometimes how the problem occurred.
I’m now wondering how you guys implement Icinga into your environments and if there is any room for improvement for my set up, plus, perhaps I’ve sparked some ideas for your Icinga set ups as well!
my “integration” is rather simple. If Icinga2 sends me a mail, I will go and check it. outside of that, I usually check the webinterface every few hours for APT Check messages that do not get send via mail.
This is due to the fact, that my Ansible AWX Server uses the Check information to decied which servers are getting updated, for this to happen the checks have to be acknowledged.
My workflow for the services I run is as follows:
- Mail notification provides as little and as much detail so that I know why a service is critical
- URL to Icinga Web 2 and the detail view
- Detail view is enriched with additional metrics and metadata (logs from Elasticsearch if available)
- URL to Foreman where the machine’s state can be viewed if needed
- Navigate into Graphite/Grafana from the detail view’s graph if needed
- Acknowledge a problem, or schedule a post mortem downtime if I know that I’ll be doing immediate fixing and maintenance
In addition to that, I’d like to have that correlation thing with Grafana, Graphite, Icinga, Elasticsearch which Blerim explains in his Icinga Camp talk:
Long term ideas (personal and at work)
- Office dashboard for development services (our NMS and support team has that), which includes RT, Discourse, GitLab, Jenkins, etc.
- More widgets for Icinga Web 2 and drop Dashing
- Deeper Ticketsystem integration
my Ansible AWX Server uses the Check information to decide which servers are getting updated, for this to happen the checks have to be acknowledged.
This is really cool! So you grab the check output from your Ansible server (from Icinga API?) and decide what to do based on whether or not the check is acknowledged? This sparks a lot of ideas for me.
Mail notification provides as little and as much detail so that I know why a service is critical
Cool, so you have different types of mail notifications based on what type of notification you’re getting? I just use the generic template. What are some more detailed things that you might include in specific mail notifications? I’ve thought of using the same premise that mikesch’s Grafana x Icingaweb2 plugin has in that each email notification will come with a link to a corresponding graph for the problem.
URL to Foreman where the machine’s state can be viewed if needed
I really should look into Foreman. It seems very powerful.
I like that “correlation thing” that you linked as well! I might do something like that as well.
More widgets for Icinga Web 2 and drop Dashing
I like this, combining Dashing and Icinga Web 2 would be very nice.
Have you guys looked at NagVis before and what do you think of it? I find it pretty useful, especially in combination with iFrames in Dashing.
@KevinHonka dives into Ansible
I do exactly that. it works rather well with the Ansible URI module and when you run the playbook with
strategy: freeall hosts can update independently
@mfriedrich jumps in
There’s no generic solution for mail notifications (maybe the updated script in 2.7 solves that partially). I tend to look into available object attributes and within the context, which information could be useful. In my previous job, we for example included a link to graphs too. Right on, I don’t need that since it is embedded into the detail view in Icinga Web 2.
I’ve been using NagVis for a while but personally I think it is really hard to create the “correct” maps to get the most out of it. I do know that there’s no alternative to it, and I see that many others keep using it. One thing which keeps me away from it are the lack of rpm/deb packages, I’ve found it rather hard to automate its installation with Puppet for example. Still that’s a personal experience and does not apply everywhere.
In terms of maps which work with location based tags, I really like what’s @nicolaiB been developing. Still a more generic “maps” approach would greatly be needed. One thing you’d need to keep in mind though - the current backends don’t provide all the details you would need for proper visualization of distributed monitoring environments (zones, endpoints, check source, dependencies, etc.). This is a long term shot once a new backend is designed and the aforementioned widgets for Icinga Web 2 hopefully.
@unic shares his workflow
Additional to mail notification, we are using a special dashboard which is displayed on seperate screen. It only shows new entries. If someone acknowledge a problem, the Host/Service disappears from the Dashboard. So if there are no open Problems, the screen is plain white with no information at all. So its easy to see if something new is coming in. At the host object we have a direct link to a dokuwiki with Host/Service documentation were we can find hints how to solve a problem and more detailed information about the host.
The simple dashboard:
[Overdue.Late Service Check Results] url = "monitoring/list/services?service_next_update<now" title = "Late Service Check Results" disabled = "1" [openProblems] title = "Open Problems" [openProblems.Open Service Problems] url = "monitoring/list/services?service_acknowledgement_type=0&service_problem=1&service_in_downtime=0&sort=service_state&limit=10" title = "Open Service Problems" [openProblems.Open Host Problems] url = "monitoring/list/hosts?host_acknowledgement_type=0&host_in_downtime=0&host_problem=1&sort=host_severity&limit=10" title = "Open Host Problems" [openProblems.Late Service Check Results] url = "monitoring/list/services?service_next_update<now" title = "Late Service Check Results"
That’s really cool! What do you use to include the link to your wiki?
Your dashboard could fit nicely as an iFrame in a Dashing dashboard perhaps.
@unic to the rescue:
I’am using the Notes URL for this links:
You can place multiple URLs in the field: