Graphite installation for Ubuntu

Well done. Thanks for making this. I never heard of go-carbon until now.

2 Likes

I already have on my to do list the actual centos version, and then as mention before the whole new releases of the operating systems.

I will post them here as well :stuck_out_tongue:

until then have a nice weekend.

Regards

David

Man you beat me to it. I just loaded up Ubuntu 18 and RHEL8 environments wanting to do a pipenv one tomorrow D:

Actually that’s a completely different method so still gonna. Maybe drag ass longer though since I’m no longer competing.

1 Like

If you have a different working method please share it with us, your welcome to take your time as long as you need … the goal here is to present a working solution :smiley:

Best

David

I’m like 2/3 of the way through the doc but whiskey break happened and you know…

1 Like

Hi David,

thanks a lot for sharing - I have moved this into the #howto section :slight_smile:

Cheers,
Michael

1 Like

These gaps happen mostly because of a retention missmatch in graphite storage-schema. Graphite expect a metric point in a period of time which is defined as a retention in storage-schema. If your metric points not match this defined retention, graphite will set per default null as value for that “too late/early point”.

Small Example:
If you tell Graphite to expect every minute an Icinga 2 metric point your schema will look similar to this:

pattern = ^icinga2.*
retention = 1m:7d,5m:30d,15m:90d,1h:1y

But if your check is scheduled every five minute the result is that you have every “five minute slot”, four “null” values and one “perfdata” value. These null values are for example your gaps. So you have to change the retention policy to this:

pattern = ^icinga2.*
retention = 5m:30d,15m:90d,1h:1y

Now Graphite expects every five minute a point and Icinga will deliver every five minute a point so both are happy.

\(^.^)/

Conclusion: You should define retentions matching the specified check_interval to avoid this situation.

3 Likes

Very helpful. Thank you. This makes sense. I will review my settings.

I’m trying to understand. So my check intervals for a service are 300 seconds (5m).

I used the example from the Icinga2 doc.

[icinga2_default]
pattern = ^icinga2.
retentions = 1m:2d,5m:10d,30m:90d,360m:4y

If I change 1m to 5m, does that solve the gap issue because it would be getting a metric every 5 minutes instead of 1 minute?

How does the retention break down?

Does the above mean get a metric every 1 minute and keep it for 2 days, get a metric every 5 minutes and keep it for 10 days, get a metric every 30 minutes and keep that for 90 days get a metric every 360 months and keep it for 4 years?

Most of my service checks are 300 second intervals. I’m trying to understand and determine the best default setting to use in order to review short term data and long term data for trending analysis etc…

Don’t go tweaking metric retention for this; it’s totally unnecessary. The default graph view in Icinga’s graphite module does space based on null, which is more visible depending on what timeframe you have it showing you. It’s helpful when you need it to be, and graphite averages things over time anyway. I’m not at a computer atm, but I know there are other settings for how it handles null values. Your output is also a moving target; retry intervals are default 30s. I use similar retentions as you.

For aesthetics, look at some of the inis in the templates folder for the graphite module in icingaweb.

1 Like

What do you have your retention set to?

I switched mine to the example mentioned a few posts back of:

retention = 5m:30d,15m:90d,1h:1y

But I’m wondering what others are doing. Seeing as we seem to have similar check intervals etc…I’m curious what you use.

I’ve got a 1m, but that’s because I do have a lot of checks that run every 1-2 minutes, as well as the retries in critcal scenarios to see if it recovers quickly or not.

[icinga-hosts]
pattern = ^icinga2\..*\.host\.
retentions = 5m:12w

[icinga-services]
pattern = ^icinga2\..*\.services\.
retentions = 1m:4w, 5m:26w, 15m:1y, 30m:2y

Does having the 1m negatively impact the checks that are set to 5 minutes?

Actually, let’s talk about storage-aggregation.conf for a second.

xFilesFactor is a confusing setting, from the docs:

xFilesFactor should be a floating point number between 0 and 1, and specifies what fraction of the previous retention level’s slots must have non-null values in order to aggregate to a non-null value. The default is 0.5."

I just tell that thing I don’t care how empty it looks, just average my stuff.

[default_average]
pattern = .*
xFilesFactor = 0
aggregationMethod = average
1 Like

Can I change this on a (long) running system without any further doings?

Afaik you have to do some “magic” when changing the retention intervals, when there are already files created.

It’s going to be wherever you have this little guy hiding. I’ve seen a lot of people use find exec with this thing which makes me nervous and you should definitely back up first. Also good for cases where you have a check that runs once an hour and don’t need it wasting disk space.

# whisper-resize.py 
Usage: whisper-resize.py path timePerPoint:timeToStore [timePerPoint:timeToStore]*

timePerPoint and timeToStore specify lengths of time, for example:

60:1440      60 seconds per datapoint, 1440 datapoints = 1 day of retention
15m:8        15 minutes per datapoint, 8 datapoints = 2 hours of retention
1h:7d        1 hour per datapoint, 7 days of retention
12h:2y       12 hours per datapoint, 2 years of retention


Options:
  -h, --help            show this help message and exit
  --xFilesFactor=XFILESFACTOR
                        Change the xFilesFactor
  --aggregationMethod=AGGREGATIONMETHOD
                        Change the aggregation function (average, sum, last,
                        max, min, avg_zero, absmax, absmin)
  --force               Perform a destructive change
  --newfile=NEWFILE     Create a new database file without removing the
                        existing one
  --nobackup            Delete the .bak file after successful execution
  --aggregate           Try to aggregate the values to fit the new archive
                        better. Note that this will make things slower and use
                        more memory.
2 Likes

I came across this blog post that seems to address this as well.

2 Likes

Thank you both, will check on it :slight_smile:

This one is used to tell graphite how much quality we expect in our time series database to make a clean aggregation between archives.

Small Example!

First retention tells Graphite when it should expect a metric, in your case every minute.

1w:4w

(This is our raw data)

00:01:00 icinga2.value 2.0
00:02:00 icinga2.value 10.0
00:03:00 icinga2.value 1.0
00:04:00 icinga2.value 49.0
00:05:00 icinga2.value 38.0
00:06:00 icinga2.value 4.0

Second retention says, okay we want to store 5 minutes for 23 weeks. Now that means Graphite will summarize five 1 minute points to one 5 minute point.

5m:23w

(This is our already aggregated data in a perfect world)

00:05:00 icinga2.value 20.0 = ((2+10+1+49+38)/5) <- Because we want the avearage
00:06:00 icinga2.value 20.4 = ((10+1+49+38+4)/5) <- Because we want the avearage

Why perfect world? We learned already that sometimes a point (or several) can get lost because check was killed or point was to early / late. Then Graphite will set a default value like 0 for this point.

Than in the real world it could give situations like these:

1w:4w

00:01:00 icinga2.value 0 <- dead check
00:02:00 icinga2.value 0 <- dead check
00:03:00 icinga2.value 0 <- dead check
00:04:00 icinga2.value 49 <- admin fixed the check
00:05:00 icinga2.value 38
00:06:00 icinga2.value 4

5m:23w

00:05:00 icinga2.value 56.6 = ((0+0+0+49+38)/5) <- Before 20.0 difference + 36.6
00:06:00 icinga2.value 20.2 = ((0+0+49+38+4)/5) <- Before 20.4 difference - 0.02

To prevent such rollercoaster an aggregation schema will be applied to this. Default one says:

[default]
pattern = .*
xFilesFactor = 0.5
aggregationMethod = [average](https://graphite.readthedocs.io/en/latest/config-carbon.html?highlight=aggregationMethod#storage-aggregation-conf)

Pattern says: “We want to match everything”.

xFileFactor says: "Okay if we should aggregate clean between our archives we want 0.5 = 50% of data not 0".

aggregationMethod says: “Use the average value for aggregation”

Now we expect 50% of data to be not 0, then our real world scenario looks like this:

1w:4w

00:01:00 icinga2.value 0 <- dead check
00:02:00 icinga2.value 0 <- dead check
00:03:00 icinga2.value 0 <- dead check
00:04:00 icinga2.value 49 <- admin fixed the check
00:05:00 icinga2.value 38
00:06:00 icinga2.value 4

5m:23w

00:05:00 icinga2.value 0 = ((0+0+0+49+38)/5) <- 3 of 5 = 0 | 60% of 100%  = 0
00:06:00 icinga2.value 20.2 = ((0+0+49+38+4)/5) <- 2 of 5 = 20.2 | 40% of 100% = 20.2

WHAT!? yeah.

Since xFileFactor is set to 0.5 we expect less then 50% of our data to be not 0. Means if ±50% of our data are 0 Graphite will aggregate a 0 into the next archive.

Here is some lecture about this: http://obfuscurity.com/2012/04/Unhelpful-Graphite-Tip-9

1 Like

whisper-resize is normally used to convert existing whisper files to a new retention schema without losing whole history.