Dynamic thresholds for check disks

blindzero · December 14, 2020, 7:07pm

Hi there

maybe you know the issue: having a fixed warning and critical threshold in check_disk is quite problematic, because it is depending on the disk‘s size. 10% free or 10GB free makes a great difference if your disk is 100GB or 10TB.

Maintaining the thresholds manually for each service is unhandy as well.

Apply rules are not working or you have to maintain the disk size in custom vars again…so…then I could maintain the threshold as well…

How are you dealing with this?

Is there anything nice like dynamic thresholds? I stumbled over such an approach at another monitoring solution. There the threshold can be adjusted by a magic factor (e.g. 0.8) which is related to a base size (e.g. 100GB). At exactly 100GB disk size you would have 1.0 threshold (no change). At lower disk size you would have a decreased threshold at higher ones an increased.

Best
M.

Pooh · December 14, 2020, 7:24pm

More importantly than this, I think, is that most people are not in the least
bit interested in how full their disks are, so long as they stay like that.

What’s important is how fast they are filling up - in other words, how the free
space is decreasing over a period of time.

I do not care if a 100Gb partition is 98% full for 3 months.

I do care if a 100Gb partition is 90% full, then 91% full 5 minutes later, and
92% full 5 minutes after that, and so on…

After all, a 98% full partition can stay like that as long as it likes without
causing any problems.

A 90% full partition which looks like it’s going to 100% full in 50 minutes’
time is a problem.

That is what I want a monitoring system to check and alert me about my
disks.

Regards,

Antony

dgoetz · December 15, 2020, 7:58am

I have seen two approaches.

The simpler one was simply to use an if-clause when the disk size was available as custom variable from some inventory data.

The more complex one was collecting the metric with the normal check_disk or another tool like collectd and then monitoring the data with a check that can handle functions like standard deviation.

I think the last one would be the future if only someone could do a good implementation because the most failed here by using not enough samples, wrong samples, wrong function or something like this and the prediction then failed so hard it was too much false positive or never alarmed.