Graphite installation for Ubuntu

Well done. Thanks for making this. I never heard of go-carbon until now.

2 Likes

I already have on my to do list the actual centos version, and then as mention before the whole new releases of the operating systems.

I will post them here as well :stuck_out_tongue:

until then have a nice weekend.

Regards

David

Man you beat me to it. I just loaded up Ubuntu 18 and RHEL8 environments wanting to do a pipenv one tomorrow D:

Actually thatā€™s a completely different method so still gonna. Maybe drag ass longer though since Iā€™m no longer competing.

1 Like

If you have a different working method please share it with us, your welcome to take your time as long as you need ā€¦ the goal here is to present a working solution :smiley:

Best

David

Iā€™m like 2/3 of the way through the doc but whiskey break happened and you knowā€¦

1 Like

Hi David,

thanks a lot for sharing - I have moved this into the howto section :slight_smile:

Cheers,
Michael

1 Like

These gaps happen mostly because of a retention missmatch in graphite storage-schema. Graphite expect a metric point in a period of time which is defined as a retention in storage-schema. If your metric points not match this defined retention, graphite will set per default null as value for that ā€œtoo late/early pointā€.

Small Example:
If you tell Graphite to expect every minute an Icinga 2 metric point your schema will look similar to this:

pattern = ^icinga2.*
retention = 1m:7d,5m:30d,15m:90d,1h:1y

But if your check is scheduled every five minute the result is that you have every ā€œfive minute slotā€, four ā€œnullā€ values and one ā€œperfdataā€ value. These null values are for example your gaps. So you have to change the retention policy to this:

pattern = ^icinga2.*
retention = 5m:30d,15m:90d,1h:1y

Now Graphite expects every five minute a point and Icinga will deliver every five minute a point so both are happy.

\(^.^)/

Conclusion: You should define retentions matching the specified check_interval to avoid this situation.

3 Likes

Very helpful. Thank you. This makes sense. I will review my settings.

Iā€™m trying to understand. So my check intervals for a service are 300 seconds (5m).

I used the example from the Icinga2 doc.

[icinga2_default]
pattern = ^icinga2.
retentions = 1m:2d,5m:10d,30m:90d,360m:4y

If I change 1m to 5m, does that solve the gap issue because it would be getting a metric every 5 minutes instead of 1 minute?

How does the retention break down?

Does the above mean get a metric every 1 minute and keep it for 2 days, get a metric every 5 minutes and keep it for 10 days, get a metric every 30 minutes and keep that for 90 days get a metric every 360 months and keep it for 4 years?

Most of my service checks are 300 second intervals. Iā€™m trying to understand and determine the best default setting to use in order to review short term data and long term data for trending analysis etcā€¦

Donā€™t go tweaking metric retention for this; itā€™s totally unnecessary. The default graph view in Icingaā€™s graphite module does space based on null, which is more visible depending on what timeframe you have it showing you. Itā€™s helpful when you need it to be, and graphite averages things over time anyway. Iā€™m not at a computer atm, but I know there are other settings for how it handles null values. Your output is also a moving target; retry intervals are default 30s. I use similar retentions as you.

For aesthetics, look at some of the inis in the templates folder for the graphite module in icingaweb.

1 Like

What do you have your retention set to?

I switched mine to the example mentioned a few posts back of:

retention = 5m:30d,15m:90d,1h:1y

But Iā€™m wondering what others are doing. Seeing as we seem to have similar check intervals etcā€¦Iā€™m curious what you use.

Iā€™ve got a 1m, but thatā€™s because I do have a lot of checks that run every 1-2 minutes, as well as the retries in critcal scenarios to see if it recovers quickly or not.

[icinga-hosts]
pattern = ^icinga2\..*\.host\.
retentions = 5m:12w

[icinga-services]
pattern = ^icinga2\..*\.services\.
retentions = 1m:4w, 5m:26w, 15m:1y, 30m:2y

Does having the 1m negatively impact the checks that are set to 5 minutes?

Actually, letā€™s talk about storage-aggregation.conf for a second.

xFilesFactor is a confusing setting, from the docs:

xFilesFactor should be a floating point number between 0 and 1, and specifies what fraction of the previous retention levelā€™s slots must have non-null values in order to aggregate to a non-null value. The default is 0.5."

I just tell that thing I donā€™t care how empty it looks, just average my stuff.

[default_average]
pattern = .*
xFilesFactor = 0
aggregationMethod = average
1 Like

Can I change this on a (long) running system without any further doings?

Afaik you have to do some ā€œmagicā€ when changing the retention intervals, when there are already files created.

Itā€™s going to be wherever you have this little guy hiding. Iā€™ve seen a lot of people use find exec with this thing which makes me nervous and you should definitely back up first. Also good for cases where you have a check that runs once an hour and donā€™t need it wasting disk space.

# whisper-resize.py 
Usage: whisper-resize.py path timePerPoint:timeToStore [timePerPoint:timeToStore]*

timePerPoint and timeToStore specify lengths of time, for example:

60:1440      60 seconds per datapoint, 1440 datapoints = 1 day of retention
15m:8        15 minutes per datapoint, 8 datapoints = 2 hours of retention
1h:7d        1 hour per datapoint, 7 days of retention
12h:2y       12 hours per datapoint, 2 years of retention


Options:
  -h, --help            show this help message and exit
  --xFilesFactor=XFILESFACTOR
                        Change the xFilesFactor
  --aggregationMethod=AGGREGATIONMETHOD
                        Change the aggregation function (average, sum, last,
                        max, min, avg_zero, absmax, absmin)
  --force               Perform a destructive change
  --newfile=NEWFILE     Create a new database file without removing the
                        existing one
  --nobackup            Delete the .bak file after successful execution
  --aggregate           Try to aggregate the values to fit the new archive
                        better. Note that this will make things slower and use
                        more memory.
2 Likes

I came across this blog post that seems to address this as well.

2 Likes

Thank you both, will check on it :slight_smile:

This one is used to tell graphite how much quality we expect in our time series database to make a clean aggregation between archives.

Small Example!

First retention tells Graphite when it should expect a metric, in your case every minute.

1w:4w

(This is our raw data)

00:01:00 icinga2.value 2.0
00:02:00 icinga2.value 10.0
00:03:00 icinga2.value 1.0
00:04:00 icinga2.value 49.0
00:05:00 icinga2.value 38.0
00:06:00 icinga2.value 4.0

Second retention says, okay we want to store 5 minutes for 23 weeks. Now that means Graphite will summarize five 1 minute points to one 5 minute point.

5m:23w

(This is our already aggregated data in a perfect world)

00:05:00 icinga2.value 20.0 = ((2+10+1+49+38)/5) <- Because we want the avearage
00:06:00 icinga2.value 20.4 = ((10+1+49+38+4)/5) <- Because we want the avearage

Why perfect world? We learned already that sometimes a point (or several) can get lost because check was killed or point was to early / late. Then Graphite will set a default value like 0 for this point.

Than in the real world it could give situations like these:

1w:4w

00:01:00 icinga2.value 0 <- dead check
00:02:00 icinga2.value 0 <- dead check
00:03:00 icinga2.value 0 <- dead check
00:04:00 icinga2.value 49 <- admin fixed the check
00:05:00 icinga2.value 38
00:06:00 icinga2.value 4

5m:23w

00:05:00 icinga2.value 56.6 = ((0+0+0+49+38)/5) <- Before 20.0 difference + 36.6
00:06:00 icinga2.value 20.2 = ((0+0+49+38+4)/5) <- Before 20.4 difference - 0.02

To prevent such rollercoaster an aggregation schema will be applied to this. Default one says:

[default]
pattern = .*
xFilesFactor = 0.5
aggregationMethod = [average](https://graphite.readthedocs.io/en/latest/config-carbon.html?highlight=aggregationMethod#storage-aggregation-conf)

Pattern says: ā€œWe want to match everythingā€.

xFileFactor says: "Okay if we should aggregate clean between our archives we want 0.5 = 50% of data not 0".

aggregationMethod says: ā€œUse the average value for aggregationā€

Now we expect 50% of data to be not 0, then our real world scenario looks like this:

1w:4w

00:01:00 icinga2.value 0 <- dead check
00:02:00 icinga2.value 0 <- dead check
00:03:00 icinga2.value 0 <- dead check
00:04:00 icinga2.value 49 <- admin fixed the check
00:05:00 icinga2.value 38
00:06:00 icinga2.value 4

5m:23w

00:05:00 icinga2.value 0 = ((0+0+0+49+38)/5) <- 3 of 5 = 0 | 60% of 100%  = 0
00:06:00 icinga2.value 20.2 = ((0+0+49+38+4)/5) <- 2 of 5 = 20.2 | 40% of 100% = 20.2

WHAT!? yeah.

Since xFileFactor is set to 0.5 we expect less then 50% of our data to be not 0. Means if Ā±50% of our data are 0 Graphite will aggregate a 0 into the next archive.

Here is some lecture about this: http://obfuscurity.com/2012/04/Unhelpful-Graphite-Tip-9

1 Like

whisper-resize is normally used to convert existing whisper files to a new retention schema without losing whole history.