Troubleshooting-Guide

Hi folks,

As an Icinga-partner providing support we thought for quite some time time that we would have no use for a sophisticated troubleshooting guide for internal use. The tickets that came in were too diverse and most cases had to be dealt with by hardened and seasoned support engineers who drove into the problem at hand like Conan the Barbarian into the ranks of his foes by the pure power of their experience and knowledge.

With more and more young folks joining the ranks of our Support-team, sometimes just for a short period of time, we saw the need for a guide arising. Of course our younglings were never left alone so we had in fact two people working on an issue that didn’t take much more than asking for specific data in the first place. And sometimes, I gotta admit it, even the experienced ones had to be reminded that they forgot to ask about something specific - most of us are human and we all make mistakes.

So we came up with the idea to build a guide which contains some standard cases and how to deal with them. With a focus on what data to gather if one needs to escalate the issue to the next level.

Since it’s no good to have several guides covering overlapping topics and to utilize the power of the community we came up with the idea, to build this guide part by part together. “Together” meaning community members, Icinga team members and employees at partners (sometimes all together in one person).

We all (should) know, there’s a great part about troubleshooting and another about debugging in the Icinga docs, but this is not exactly what we need.

So what is this all about. I don’t want to start right away with writing some new guide but find a way how we can build something that can satisfy all our needs. Or a collection of information that serves the different purposes. To be honest, I have a rough idea, but to find the final form is something, I’m hoping to achieve through this thread.

The idea is to have something we (meaning everyone) can use for at least these three purposes:

  • Find information to help ourselves when we are in trouble. If you’re experienced enough you should be able to find solutions to common problems so you can fix them yourself
  • Give a guide on how to provide information here in our great community if you need help. (e.g. There are times someone has a question giving sparse information where rude people could reply with an icon of a crystal ball. Instead we could send the thread started to a specific part within the guide and ask them to follow the steps to provide the information needed)
  • Give a guide for support engineers (no matter if they are junior, senior or lead) at partners on what to ask for from their customers. If a solution can be found within the guide - great. But the focus of the partner part would be more about which information to ask for to make sure the customer didn’t hit a bug. Or if they did, what information the next level needs to deal with the bug. Or what information is needed to replicate a certain situation.
  • Give ideas for further development for tools like Icinga Diagnostics about which information to collect automatically

This is not an easy task even if all the questions and answers were already collected. The different groups of people have very different needs when it comes to guidelines. Some need more thorough information, others need a very quick overview because they know exactly what they’re looking for.

So what I’m hoping for is ideas for the details on how to achieve a collection of information that can cater to all these needs. Maybe even ideas for more usecases. And I want to find the right tool to do this. A markdown file in a git repository? A thread/wiki within this community board? A new or reworked section within docs.icinga.com ? A new tool (please, not).

5 Likes

Since this post is collecting likes and views (which both make me happy) but no replies, I’ll start with replies myself:

About the tool:

  • I’d say we should use one of the existing places and not search for a new tool
  • Having it here might make it easier for more people to contribute
  • Having it in the official docs would make it easier to find and maybe get more people into contributing (or spook them away from helping with the guide after all)

What do you say?

About the format:

  • My first idea is to have simple oneliner describing the problem like “Icinga is crashing” in the style like someone would describe their problem
  • Then a very concise list of things to check and ask for (coredumps, diagnostics output, icinga2.log, debug.log, openssl version,…)
  • Then a slightly longer text about what this could actually mean. Background info so to say
  • Then another longer text about what to check for (loglines, open ports,…) and, if it’s a know problem, how you could fix the it yourself if you know what you’re doing. (upgrade to 2.15.7, delete api log, resync hosts)
  • If there is any, a link to an issue that’s related to the problem. If there are more issues, than a list of issues with the version where it was fixed
  • If the problem might be relevant for a diagnostics (or whatever tool) check, we could refer to an issue (and version) there as well

At least all descriptions (first item in the list) should be listed in a table of contents for easy search and jumping. A list of synonyms or tags would be great but I can’t think of a way how to achieve that with the tools we already have.

I’m happy that this post got several likes but I’m feeling a bit alone here in the thread. :stuck_out_tongue:

So, since I got several times “too many words, got headache” feedback over other channels here’s an even shorter version:

Tool

Pros and Cons are hidden in the last posts.

  • Post on community
  • Official docs
  • Don’t care
  • Something else (details in thread)

0 voters

Format

Just tick anything you find useful

  • Oneliner description what wen’t wrong (Symptom)
  • Oneliner description of potential problem in setups (Cause)
  • List of what data to collect (List)
  • List of how to collect data (Example code)
  • Background info on what might have gone wrong
  • Symptoms to look out for (descriptive text)
  • Symptoms to look out for (list)
  • Info on how to fix this (if possible)
  • Links to issues/bugs that could have been hit
  • Info which version might have closed this issue
  • Potential workarounds
  • List of what type of nodes can be affected
  • List of OSes which might be affected
  • Integrate data collection into automated tool
  • Don’t care

0 voters

If I should add more options, let me know. I’m aware that a knowledgebase that covers all these points would be hard to build, so let’s focus on the ones you’d like to see. Or hear what other ideas you’re having.

Hi Thomas!

IMHO the Official Docs would be the best place for a formal troubleshooting guide.

The community here is more for asking questions either because there is something that isn’t covered by the docs or because they were too lazy to look before asking. Also it’s a threaded conversation view isn’t exactly ideal for a guide that gets updated.

If a question within the community brings up an important thing that should be added, or just comes up quite frequently it can get added to the docs.

1 Like

Thanks for your input @mjbrooks , that’s very much appreciated.

Discourse is offering a “Wiki” style for posting so everyone can contribute to the first post of a thread. This is what I have been thinking of. It lacks the power of Wiki-links, though and to be honest, I think the official docs would be better, too. The only downside of posting it there I can think of is that it’s harder for many people to contribute. In here it’s just editing a post, in the official docs you’d need to get familiar with git and pull requests. Online editing in GitHub might mitigate that, though.

Editing in Discourse seems like it might be too much of a free for all and would probably result in something that lacks a consistent flow. Also, when I need to point to a reference I prefer to link to the docs and not have to hunt down a comment thread. Discourse just doesn’t quite seem like the right place.

That said, it could possibly be used as a scratch pad and then people who have a better handle on git can pool together updates from it, clean them up, and then put them into a polished version for the official docs.

2 Likes

Yes. Discussing topics here and then having some people transferring it to the official docs sound like a good idea.

What I think I would like to see is some sort of tree chart, that helps you pinpoint what went wrong - and points you to where you need to look.

You start at the top with the first thing you usually check “logs in xyz” and then some arrows with possible issues on them that point to the next step with where could be a problem.

Is that understandable? I could try to draw it too if the description is too confusing.

Anyway, I assume it’s a lot of work, but I could imagine that this kind of graph would help a lot :slight_smile:

1 Like

So some sort of help with creating a diagnosis? If you just see “there’s something going sideways” but you can’t pinpoint it?

That’s a great idea! We could combine that with the troubleshooting knowledgebase in a way that the “leaves” of the tree point you to articles in the KB.

I could imagine there are tools like “Mermaid” that help with building a tree like that. We could collect some possible entries and try different layouts like tree or flowchart.

1 Like