How do you do SLA and Reporting?

Your manager comes to you, “I need a monthly or weekly report to show which servers have had issues in the last week or month. I need to see which hosts have had critical disk alerts and for how long, I need cpu stats, which ones were critical in the last month - Oh and those custom http checks, need to know how long they were critical for and for which hosts. On my desk once a week or month please?”

So you open the reporting module and realise, you don’t have what you need.

How do you tackle SLAs and reporting with Icinga?

@Radius540R What exactly are you missing from the reporting module?
I know it is sparse, but what is missing specifically.
I do mostly hear random complaints about it being insufficient, but not many where people could tell me “I need this kind of data, structured in way x and presented in way y”.

2 Likes

lorenz, which other products would you compare icinga to on the market?

You could say that nothing is missing or you could say that everything is missing depending on what is on offer from other products. Examples: data trends, KPIs, advanced graphs, machine learning.

Could you please keep it on topic[1]? Lorenz kindly asked what kind of SLA or reporting features you would like to see and you started to ramble about trend predictions, up to ML.

I am quite certain that the current reporting module may not satisfy any needs, as outlined in your anecdotic first post. For example, having a report about the last moth’s outages would be useful in lots of compliance scenarios.

But without hearing specific demands, it’s hard to help satisfying those. Thus, please try to keep your communication direct.


  1. Which you have set by creating this thread, btw. ↩︎

1 Like

Yes correct Alvar, apologies, must be the jetlag and thank you lorenz for answering and asking the questions

Here are some examples of things I would like to have in onscreen reports/dashboards/automatically sent reports:

  • top 10 alerts per day/week/month
  • top 10 hosts with most critical alerts per day/week/month - in a table
  • top 10 servicegroups with most critical alerts per day/week/month
  • top 10 hostgroups with most critical alerts per day/week/month - in a table
  • when cpu/mem/disk/http critical alerts happen the most frequently during the week/weekend
  • how many cpu/mem/disk/http service alerts occur per day/week/month
  • top 10 notification recipients per week/month
  • which services have an increase amount of warning/critical alerts compared to previous week/month - a negative trend (or positive)
  • which hosts have an increase in critical/warning alerts over previous week/month
  • reporting list of hosts contacting Icinga but are not IN Icingaweb

No worries, but thanks for the further clarification.

Both having some kind of “usual suspects” or outliers list might indeed come in handy. Going further, using some check’s performance data may even allow a trend prediction using simple statistics, e.g., linear growth for a steadily filling disk.

At least the top ten of trouble makers should be realizable quite easily with the already available data from Icinga DB. Same goes for new problematic hosts, having no or few records of state changes in the past.

I will try to pitch this somehow to the web team, as on the core or daemon side, everything should already be there.

For the prediction part, something supporting the perf data would be required.

However, there was one entry in your exemplary list I don’t quite get: “reporting list of hosts contacting Icinga but are not IN Icingaweb”. Are you referring to pending certificate requests on the Icinga 2 master node or signed clients missing a representing Host object?

1 Like

I would also like the option to query the Perfdata writer (InfluxDB) for tactical views as the donuts is nice but I need to switch to my Grafana Icinga dashboard to know if the number of unknowns is sinking or rising.

I did the disk full prediction myself in the past :wink:

1 Like