How do you do SLA and Reporting?

Radius540R · November 19, 2024, 9:43am

Your manager comes to you, “I need a monthly or weekly report to show which servers have had issues in the last week or month. I need to see which hosts have had critical disk alerts and for how long, I need cpu stats, which ones were critical in the last month - Oh and those custom http checks, need to know how long they were critical for and for which hosts. On my desk once a week or month please?”

So you open the reporting module and realise, you don’t have what you need.

How do you tackle SLAs and reporting with Icinga?

lorenz · November 19, 2024, 11:16am

@Radius540R What exactly are you missing from the reporting module?
I know it is sparse, but what is missing specifically.
I do mostly hear random complaints about it being insufficient, but not many where people could tell me “I need this kind of data, structured in way x and presented in way y”.

Radius540R · November 20, 2024, 3:08pm

lorenz, which other products would you compare icinga to on the market?

You could say that nothing is missing or you could say that everything is missing depending on what is on offer from other products. Examples: data trends, KPIs, advanced graphs, machine learning.

apenning · November 20, 2024, 3:32pm

Could you please keep it on topic^[1]? Lorenz kindly asked what kind of SLA or reporting features you would like to see and you started to ramble about trend predictions, up to ML.

I am quite certain that the current reporting module may not satisfy any needs, as outlined in your anecdotic first post. For example, having a report about the last moth’s outages would be useful in lots of compliance scenarios.

But without hearing specific demands, it’s hard to help satisfying those. Thus, please try to keep your communication direct.

Which you have set by creating this thread, btw. ↩︎

Radius540R · November 20, 2024, 4:32pm

Yes correct Alvar, apologies, must be the jetlag and thank you lorenz for answering and asking the questions

Here are some examples of things I would like to have in onscreen reports/dashboards/automatically sent reports:

top 10 alerts per day/week/month
top 10 hosts with most critical alerts per day/week/month - in a table
top 10 servicegroups with most critical alerts per day/week/month
top 10 hostgroups with most critical alerts per day/week/month - in a table
when cpu/mem/disk/http critical alerts happen the most frequently during the week/weekend
how many cpu/mem/disk/http service alerts occur per day/week/month
top 10 notification recipients per week/month
which services have an increase amount of warning/critical alerts compared to previous week/month - a negative trend (or positive)
which hosts have an increase in critical/warning alerts over previous week/month
reporting list of hosts contacting Icinga but are not IN Icingaweb

apenning · November 21, 2024, 8:08am

No worries, but thanks for the further clarification.

Both having some kind of “usual suspects” or outliers list might indeed come in handy. Going further, using some check’s performance data may even allow a trend prediction using simple statistics, e.g., linear growth for a steadily filling disk.

At least the top ten of trouble makers should be realizable quite easily with the already available data from Icinga DB. Same goes for new problematic hosts, having no or few records of state changes in the past.

I will try to pitch this somehow to the web team, as on the core or daemon side, everything should already be there.

For the prediction part, something supporting the perf data would be required.

However, there was one entry in your exemplary list I don’t quite get: “reporting list of hosts contacting Icinga but are not IN Icingaweb”. Are you referring to pending certificate requests on the Icinga 2 master node or signed clients missing a representing Host object?

rivad · November 21, 2024, 2:26pm

I would also like the option to query the Perfdata writer (InfluxDB) for tactical views as the donuts is nice but I need to switch to my Grafana Icinga dashboard to know if the number of unknowns is sinking or rising.

I did the disk full prediction myself in the past