What's your Icinga workflow?

Continued from here.

@watermelon asks:

Hey all,

I’ve been working with Icinga2 for a while now and was curious as to what everyone’s workflows are like.

For me, I use the Dashing dashboard for a high-level view of everything going on, which includes the number of current host/service problems and list of individual problems as well as rotating NagVis maps of high-level views of my environment integrated in an iFrame on the dashboard. If there is some problem that pops up, the NagVis map will visibly and audibly alert me (I’ll also get an email or SMS) and I can then click on the node to bring me to a more detailed NagVis map of the network device with its port mappings, health, and resource statuses (if applicable). Then, if I want to get an even more granular view, I can click on one of the health or resource nodes from NagVis to bring me to a Grafana graph of the performance history of that host/service. Through this process, I’m able to find where , when , and sometimes how the problem occurred.

I’m now wondering how you guys implement Icinga into your environments and if there is any room for improvement for my set up, plus, perhaps I’ve sparked some ideas for your Icinga set ups as well!

Thanks!

@KevinHonka replies:

my “integration” is rather simple. If Icinga2 sends me a mail, I will go and check it. outside of that, I usually check the webinterface every few hours for APT Check messages that do not get send via mail.
This is due to the fact, that my Ansible AWX Server uses the Check information to decied which servers are getting updated, for this to happen the checks have to be acknowledged.

@mfriedrich adds:

My workflow for the services I run is as follows:

  • Mail notification provides as little and as much detail so that I know why a service is critical
  • URL to Icinga Web 2 and the detail view
  • Detail view is enriched with additional metrics and metadata (logs from Elasticsearch if available)
  • URL to Foreman where the machine’s state can be viewed if needed
  • Navigate into Graphite/Grafana from the detail view’s graph if needed
  • Acknowledge a problem, or schedule a post mortem downtime if I know that I’ll be doing immediate fixing and maintenance

In addition to that, I’d like to have that correlation thing with Grafana, Graphite, Icinga, Elasticsearch which Blerim explains in his Icinga Camp talk:

https://www.youtube.com/watch?v=sZLYxerqyqQ

Long term ideas (personal and at work)

  • Office dashboard for development services (our NMS and support team has that), which includes RT, Discourse, GitLab, Jenkins, etc.
  • More widgets for Icinga Web 2 and drop Dashing
  • Deeper Ticketsystem integration

@watermelon concludes:

@KevinHonka

my Ansible AWX Server uses the Check information to decide which servers are getting updated, for this to happen the checks have to be acknowledged.

This is really cool! So you grab the check output from your Ansible server (from Icinga API?) and decide what to do based on whether or not the check is acknowledged? This sparks a lot of ideas for me.

@dnsmichi

Mail notification provides as little and as much detail so that I know why a service is critical

Cool, so you have different types of mail notifications based on what type of notification you’re getting? I just use the generic template. What are some more detailed things that you might include in specific mail notifications? I’ve thought of using the same premise that mikesch’s Grafana x Icingaweb2 plugin has in that each email notification will come with a link to a corresponding graph for the problem.

URL to Foreman where the machine’s state can be viewed if needed

I really should look into Foreman. It seems very powerful.

I like that “correlation thing” that you linked as well! I might do something like that as well.

More widgets for Icinga Web 2 and drop Dashing

I like this, combining Dashing and Icinga Web 2 would be very nice.

Have you guys looked at NagVis before and what do you think of it? I find it pretty useful, especially in combination with iFrames in Dashing.

@KevinHonka dives into Ansible

I do exactly that. it works rather well with the Ansible URI module and when you run the playbook with strategy: free all hosts can update independently

@mfriedrich jumps in

There’s no generic solution for mail notifications (maybe the updated script in 2.7 solves that partially). I tend to look into available object attributes and within the context, which information could be useful. In my previous job, we for example included a link to graphs too. Right on, I don’t need that since it is embedded into the detail view in Icinga Web 2.

I’ve been using NagVis for a while but personally I think it is really hard to create the “correct” maps to get the most out of it. I do know that there’s no alternative to it, and I see that many others keep using it. One thing which keeps me away from it are the lack of rpm/deb packages, I’ve found it rather hard to automate its installation with Puppet for example. Still that’s a personal experience and does not apply everywhere.

In terms of maps which work with location based tags, I really like what’s @nicolaiB been developing. Still a more generic “maps” approach would greatly be needed. One thing you’d need to keep in mind though - the current backends don’t provide all the details you would need for proper visualization of distributed monitoring environments (zones, endpoints, check source, dependencies, etc.). This is a long term shot once a new backend is designed and the aforementioned widgets for Icinga Web 2 hopefully.

@unic shares his workflow

Additional to mail notification, we are using a special dashboard which is displayed on seperate screen. It only shows new entries. If someone acknowledge a problem, the Host/Service disappears from the Dashboard. So if there are no open Problems, the screen is plain white with no information at all. So its easy to see if something new is coming in. At the host object we have a direct link to a dokuwiki with Host/Service documentation were we can find hints how to solve a problem and more detailed information about the host.

The simple dashboard:

[Overdue.Late Service Check Results]
url = "monitoring/list/services?service_next_update<now"
title = "Late Service Check Results"
disabled = "1"

[openProblems]
title = "Open Problems"

[openProblems.Open Service Problems]
url = "monitoring/list/services?service_acknowledgement_type=0&service_problem=1&service_in_downtime=0&sort=service_state&limit=10"
title = "Open Service Problems"

[openProblems.Open Host Problems]
url = "monitoring/list/hosts?host_acknowledgement_type=0&host_in_downtime=0&host_problem=1&sort=host_severity&limit=10"
title = "Open Host Problems"

[openProblems.Late Service Check Results]
url = "monitoring/list/services?service_next_update<now"
title = "Late Service Check Results"

@watermelon asks:

That’s really cool! What do you use to include the link to your wiki?

Your dashboard could fit nicely as an iFrame in a Dashing dashboard perhaps.

@unic to the rescue:

I’am using the Notes URL for this links:
You can place multiple URLs in the field:

https://$address$’ ‘https://:12433$address$’ ‘https://wiki/inventory/$devicetype$/$host_name$

1 Like

Saltsack takes care of deploying and configuring the Icinga2 master, agents, and the custom monitoring plugins.

Icinga2 and Graphite for monitoring and metrics, with Grafana for graphs and some alerting.

Icinga2 notifications are sent to Slack, which has links to Icingaweb2, and buttons for acknowledging the issue.

Icinga2 agents can also use EventCommands to send events to Salt which can then react to the issue.

Each environment also has its own ELK server for collecting and watching logs.

2 Likes

Cool, thanks for sharing :slight_smile:

Which Salt recipe are you using, or can you share certain snippets from it?

Since you’re referring to “each environment”, how many of them do you manage? :slight_smile:

Cheers,
Michael

I rewrote an Icinga2 Salt formula to work like the official Puppet module. I still need to work on the documentation a bit more :slight_smile:

I’m looking after two main environments; one testing, one production.

2 Likes

Oh, cool, thanks. Ping @bsheqa :slight_smile:

One thing for the docs - I don’t know whether Salt requires you to use RST, but I would go for Markdown. That also allows for easier copy paste and writing it (e.g. with Atom). Oh, and maybe more contributors on GitHub too :wink: @nicolaiB maybe you can help out here, I remember that you know Salt a little.

Is your testing environment the same as production, or how’s that built with the agent roles? I know that there’s a certain problem with building the zone tree with 2 agents on the same host.

One thing for the docs - I don’t know whether Salt requires you to use RST, but I would go for Markdown.

Are you talking about the README.rst in the repo? That’s just what the original repo used. Salt doesn’t have a preference there. The pillar examples are the biggest thing really.

The test environment is mostly the same as the production environment, each configured using branching Salt State and Pillar environments (with a “parent” environment with shared pillars and states).

Every server has an Icinga2 agent, and they all talk directly to the same Icinga2 master so it’s a flat “distributed monitoring” structure. The agents just get a minimal configuration.

1 Like

Nice, I’ll definity have a look on the formula! If I seen it correctly you’re using a custom CA managed by salt? We’re using the salt-mine feature to generate the token on master and share with the satellites / client. If I find time, I’ll share some insights.

There is also a formula at the saltstack repo, on which your fork-fork is based. Maybe you could send a PR?

1 Like

If I seen it correctly you’re using a custom CA managed by salt?

Yeah, it uses the Salt peer system to let agents request a cert directly from the Icinga2 master minion. From memory, I think I found that easier than using the Icinga2 token/cert system.

There is also a formula at the saltstack repo, on which your fork-fork is based. Maybe you could send a PR?

I did, but I rewrote the entire thing (to try and match the Puppet module) so it’s no longer compatible with the original formula. I also don’t have any testing done outside of Debian Jessie.