Graphite installation for Ubuntu

What do you have your retention set to?

I switched mine to the example mentioned a few posts back of:

retention = 5m:30d,15m:90d,1h:1y

But I’m wondering what others are doing. Seeing as we seem to have similar check intervals etc…I’m curious what you use.

I’ve got a 1m, but that’s because I do have a lot of checks that run every 1-2 minutes, as well as the retries in critcal scenarios to see if it recovers quickly or not.

[icinga-hosts]
pattern = ^icinga2\..*\.host\.
retentions = 5m:12w

[icinga-services]
pattern = ^icinga2\..*\.services\.
retentions = 1m:4w, 5m:26w, 15m:1y, 30m:2y

Does having the 1m negatively impact the checks that are set to 5 minutes?

Actually, let’s talk about storage-aggregation.conf for a second.

xFilesFactor is a confusing setting, from the docs:

xFilesFactor should be a floating point number between 0 and 1, and specifies what fraction of the previous retention level’s slots must have non-null values in order to aggregate to a non-null value. The default is 0.5."

I just tell that thing I don’t care how empty it looks, just average my stuff.

[default_average]
pattern = .*
xFilesFactor = 0
aggregationMethod = average
1 Like

Can I change this on a (long) running system without any further doings?

Afaik you have to do some “magic” when changing the retention intervals, when there are already files created.

It’s going to be wherever you have this little guy hiding. I’ve seen a lot of people use find exec with this thing which makes me nervous and you should definitely back up first. Also good for cases where you have a check that runs once an hour and don’t need it wasting disk space.

# whisper-resize.py 
Usage: whisper-resize.py path timePerPoint:timeToStore [timePerPoint:timeToStore]*

timePerPoint and timeToStore specify lengths of time, for example:

60:1440      60 seconds per datapoint, 1440 datapoints = 1 day of retention
15m:8        15 minutes per datapoint, 8 datapoints = 2 hours of retention
1h:7d        1 hour per datapoint, 7 days of retention
12h:2y       12 hours per datapoint, 2 years of retention


Options:
  -h, --help            show this help message and exit
  --xFilesFactor=XFILESFACTOR
                        Change the xFilesFactor
  --aggregationMethod=AGGREGATIONMETHOD
                        Change the aggregation function (average, sum, last,
                        max, min, avg_zero, absmax, absmin)
  --force               Perform a destructive change
  --newfile=NEWFILE     Create a new database file without removing the
                        existing one
  --nobackup            Delete the .bak file after successful execution
  --aggregate           Try to aggregate the values to fit the new archive
                        better. Note that this will make things slower and use
                        more memory.
2 Likes

I came across this blog post that seems to address this as well.

2 Likes

Thank you both, will check on it :slight_smile:

This one is used to tell graphite how much quality we expect in our time series database to make a clean aggregation between archives.

Small Example!

First retention tells Graphite when it should expect a metric, in your case every minute.

1w:4w

(This is our raw data)

00:01:00 icinga2.value 2.0
00:02:00 icinga2.value 10.0
00:03:00 icinga2.value 1.0
00:04:00 icinga2.value 49.0
00:05:00 icinga2.value 38.0
00:06:00 icinga2.value 4.0

Second retention says, okay we want to store 5 minutes for 23 weeks. Now that means Graphite will summarize five 1 minute points to one 5 minute point.

5m:23w

(This is our already aggregated data in a perfect world)

00:05:00 icinga2.value 20.0 = ((2+10+1+49+38)/5) <- Because we want the avearage
00:06:00 icinga2.value 20.4 = ((10+1+49+38+4)/5) <- Because we want the avearage

Why perfect world? We learned already that sometimes a point (or several) can get lost because check was killed or point was to early / late. Then Graphite will set a default value like 0 for this point.

Than in the real world it could give situations like these:

1w:4w

00:01:00 icinga2.value 0 <- dead check
00:02:00 icinga2.value 0 <- dead check
00:03:00 icinga2.value 0 <- dead check
00:04:00 icinga2.value 49 <- admin fixed the check
00:05:00 icinga2.value 38
00:06:00 icinga2.value 4

5m:23w

00:05:00 icinga2.value 56.6 = ((0+0+0+49+38)/5) <- Before 20.0 difference + 36.6
00:06:00 icinga2.value 20.2 = ((0+0+49+38+4)/5) <- Before 20.4 difference - 0.02

To prevent such rollercoaster an aggregation schema will be applied to this. Default one says:

[default]
pattern = .*
xFilesFactor = 0.5
aggregationMethod = [average](https://graphite.readthedocs.io/en/latest/config-carbon.html?highlight=aggregationMethod#storage-aggregation-conf)

Pattern says: “We want to match everything”.

xFileFactor says: "Okay if we should aggregate clean between our archives we want 0.5 = 50% of data not 0".

aggregationMethod says: “Use the average value for aggregation”

Now we expect 50% of data to be not 0, then our real world scenario looks like this:

1w:4w

00:01:00 icinga2.value 0 <- dead check
00:02:00 icinga2.value 0 <- dead check
00:03:00 icinga2.value 0 <- dead check
00:04:00 icinga2.value 49 <- admin fixed the check
00:05:00 icinga2.value 38
00:06:00 icinga2.value 4

5m:23w

00:05:00 icinga2.value 0 = ((0+0+0+49+38)/5) <- 3 of 5 = 0 | 60% of 100%  = 0
00:06:00 icinga2.value 20.2 = ((0+0+49+38+4)/5) <- 2 of 5 = 20.2 | 40% of 100% = 20.2

WHAT!? yeah.

Since xFileFactor is set to 0.5 we expect less then 50% of our data to be not 0. Means if ±50% of our data are 0 Graphite will aggregate a 0 into the next archive.

Here is some lecture about this: http://obfuscurity.com/2012/04/Unhelpful-Graphite-Tip-9

1 Like

whisper-resize is normally used to convert existing whisper files to a new retention schema without losing whole history.

This helped me a lot.
Also thx @m4k5ym for your explanation!

Some problems/questions:
I changed the host check interval from 5min to 1min and then changed the graphite stuff.

After changing the storage schema and adding a 1m retention retentions = 1m:3d,5m:10d,30m:90d,360m:4y the look of my graphs changed.

Before (5min check interval, 5min retention:

After (1min check interval, 1min retetion and whisper resize):

Template is the same for both

Graphite Template
[icmp-rt.graph]
check_command = "icmp-host"

[icmp-rt.metrics_filters]
rtmin.value = "$host_name_template$.perfdata.rtmin.value"
rta.value = "$host_name_template$.perfdata.rta.value"
rtmax.value = "$host_name_template$.perfdata.rtmax.value"

[icmp-rt.urlparams]
areaAlpha = "0.5"
areaMode = "all"
lineWidth = "2"
min = "0"
yUnitSystem = "none"

[icmp-rt.functions]
rtmin.value = "alias(color(scale($metric$, 1000), '#44bb77'), 'Min. round trip time (ms)')"
rta.value = "alias(color(scale($metric$, 1000), '#ffaa44'), 'Avg. round trip time (ms)')"
rtmax.value = "alias(color(scale($metric$, 1000), '#ff5566'), 'Max. round trip time (ms)')"


[icmp-pl.graph]
check_command = "icmp-host"

[icmp-pl.metrics_filters]
pl.value = "$host_name_template$.perfdata.pl.value"

[icmp-pl.urlparams]
areaAlpha = "0.5"
areaMode = "all"
lineWidth = "2"
min = "0"
yUnitSystem = "none"

[icmp-pl.functions]
pl.value = "alias(color($metric$, '#1a7dd7'), 'Packet loss (%)')"

Why is the “new” graph so segmented?
If you zoom in it gets even worse.

Anything I missed or did wrong?

Deleting the resized wsp-files and thus creating fresh one shows the normal graph again.

I’m going to try using your retention settings minus the 1m average as I have no checks that are 1m intervals.

This is what we were discussing earlier on. That’s because you have null values, and the xFilesFactor = 0 will average those out when they get to the 5 minute aggregation and so on.If you edit the ini file for a specific check you can have it link those together. I personally like it because I can see it ramp up on the retries when something goes wrong, but it does waste more disk space to have a 1m retention. at 3 days, it hardly matters.

1 Like

So if I understand you correctly these gaps should vanish when it comes to the next aggregation level for the 5min interval.

edit: I deleted some of the old graphs yesterday. The newly created ones don’t have those gaps in them.

What do you mean by that?

Take a look at /etc/icingaweb2/modules/graphite/templates/load.ini for example, and you’ll see how it produces this:

You can create custom templates for any checks you want. The default is a single graph, blue, with lines marking time series values. In the case of this one, everything is linked together and 3 different performance metrics are mapped onto the same graph.

There’s a default.ini that will encompass this general one you’re seeing. You can modify that to your liking.

1 Like

Ah, I thought you had something else in mind.

I already created some templates for other checks, myself :slight_smile:

I merely was confused why the graph is different before and after the config change/resizing.
But I guess this is due to the missing values from the previous check interval of 5min in the now 1min interval graph, correct?
Because deleting the old wsp-files leads to normal looking correct graphs.

Is your output at 1mm intervals? because I’d imagine you’re missing data for your 1 minute period. Graphite’s web interface will also show plotted dots if you zoom in in this scenario.

Finally did that venv version I promised (didn’t use pipenv, found it to be needlessly awkward with the way graphite lays itself out). Also used the traditional carbon-cache/relay daemons rather than the Go alternative @dokon used. So, options!

https://community.icinga.com/t/a-distro-agnostic-guide-to-graphite-with-venv-examples-rhel8-debian10-ubuntu18/1424/2

3 Likes

I don’t think so, because these are the old 5min interval graphs changed with whisper-resize.
But I`d have thought that the graphs would “switch” to the normal view once there is enough 1min interval data.

Basically, this is what I expect and have been more or less describing. Here’s a 5m interval check on graphite which supports 1m intervals. You’ll notice I suddenly have an immediate timepoint when my check when into soft state and increased its frequency of checking. When this aggregates to 5m later, I won’t see that, but from a what-just-happened-oh-noez standpoint, I like this implicit detail for the last few days.

2 Likes