Well done. Thanks for making this. I never heard of go-carbon until now.
I already have on my to do list the actual centos version, and then as mention before the whole new releases of the operating systems.
I will post them here as well
until then have a nice weekend.
Regards
David
Man you beat me to it. I just loaded up Ubuntu 18 and RHEL8 environments wanting to do a pipenv one tomorrow D:
Actually thatās a completely different method so still gonna. Maybe drag ass longer though since Iām no longer competing.
If you have a different working method please share it with us, your welcome to take your time as long as you need ā¦ the goal here is to present a working solution
Best
David
Iām like 2/3 of the way through the doc but whiskey break happened and you knowā¦
These gaps happen mostly because of a retention missmatch in graphite storage-schema. Graphite expect a metric point in a period of time which is defined as a retention in storage-schema. If your metric points not match this defined retention, graphite will set per default null as value for that ātoo late/early pointā.
Small Example:
If you tell Graphite to expect every minute an Icinga 2 metric point your schema will look similar to this:
pattern = ^icinga2.*
retention = 1m:7d,5m:30d,15m:90d,1h:1y
But if your check is scheduled every five minute the result is that you have every āfive minute slotā, four ānullā values and one āperfdataā value. These null values are for example your gaps. So you have to change the retention policy to this:
pattern = ^icinga2.*
retention = 5m:30d,15m:90d,1h:1y
Now Graphite expects every five minute a point and Icinga will deliver every five minute a point so both are happy.
\(^.^)/
Conclusion: You should define retentions matching the specified check_interval to avoid this situation.
Very helpful. Thank you. This makes sense. I will review my settings.
Iām trying to understand. So my check intervals for a service are 300 seconds (5m).
I used the example from the Icinga2 doc.
[icinga2_default]
pattern = ^icinga2.
retentions = 1m:2d,5m:10d,30m:90d,360m:4y
If I change 1m to 5m, does that solve the gap issue because it would be getting a metric every 5 minutes instead of 1 minute?
How does the retention break down?
Does the above mean get a metric every 1 minute and keep it for 2 days, get a metric every 5 minutes and keep it for 10 days, get a metric every 30 minutes and keep that for 90 days get a metric every 360 months and keep it for 4 years?
Most of my service checks are 300 second intervals. Iām trying to understand and determine the best default setting to use in order to review short term data and long term data for trending analysis etcā¦
Donāt go tweaking metric retention for this; itās totally unnecessary. The default graph view in Icingaās graphite module does space based on null, which is more visible depending on what timeframe you have it showing you. Itās helpful when you need it to be, and graphite averages things over time anyway. Iām not at a computer atm, but I know there are other settings for how it handles null values. Your output is also a moving target; retry intervals are default 30s. I use similar retentions as you.
For aesthetics, look at some of the inis in the templates folder for the graphite module in icingaweb.
What do you have your retention set to?
I switched mine to the example mentioned a few posts back of:
retention = 5m:30d,15m:90d,1h:1y
But Iām wondering what others are doing. Seeing as we seem to have similar check intervals etcā¦Iām curious what you use.
Iāve got a 1m, but thatās because I do have a lot of checks that run every 1-2 minutes, as well as the retries in critcal scenarios to see if it recovers quickly or not.
[icinga-hosts]
pattern = ^icinga2\..*\.host\.
retentions = 5m:12w
[icinga-services]
pattern = ^icinga2\..*\.services\.
retentions = 1m:4w, 5m:26w, 15m:1y, 30m:2y
Does having the 1m negatively impact the checks that are set to 5 minutes?
Actually, letās talk about storage-aggregation.conf for a second.
xFilesFactor is a confusing setting, from the docs:
xFilesFactor
should be a floating point number between 0 and 1, and specifies what fraction of the previous retention levelās slots must have non-null values in order to aggregate to a non-null value. The default is 0.5."
I just tell that thing I donāt care how empty it looks, just average my stuff.
[default_average]
pattern = .*
xFilesFactor = 0
aggregationMethod = average
Can I change this on a (long) running system without any further doings?
Afaik you have to do some āmagicā when changing the retention intervals, when there are already files created.
Itās going to be wherever you have this little guy hiding. Iāve seen a lot of people use find exec with this thing which makes me nervous and you should definitely back up first. Also good for cases where you have a check that runs once an hour and donāt need it wasting disk space.
# whisper-resize.py
Usage: whisper-resize.py path timePerPoint:timeToStore [timePerPoint:timeToStore]*
timePerPoint and timeToStore specify lengths of time, for example:
60:1440 60 seconds per datapoint, 1440 datapoints = 1 day of retention
15m:8 15 minutes per datapoint, 8 datapoints = 2 hours of retention
1h:7d 1 hour per datapoint, 7 days of retention
12h:2y 12 hours per datapoint, 2 years of retention
Options:
-h, --help show this help message and exit
--xFilesFactor=XFILESFACTOR
Change the xFilesFactor
--aggregationMethod=AGGREGATIONMETHOD
Change the aggregation function (average, sum, last,
max, min, avg_zero, absmax, absmin)
--force Perform a destructive change
--newfile=NEWFILE Create a new database file without removing the
existing one
--nobackup Delete the .bak file after successful execution
--aggregate Try to aggregate the values to fit the new archive
better. Note that this will make things slower and use
more memory.
I came across this blog post that seems to address this as well.
Thank you both, will check on it
This one is used to tell graphite how much quality we expect in our time series database to make a clean aggregation between archives.
Small Example!
First retention tells Graphite when it should expect a metric, in your case every minute.
1w:4w
(This is our raw data)
00:01:00 icinga2.value 2.0
00:02:00 icinga2.value 10.0
00:03:00 icinga2.value 1.0
00:04:00 icinga2.value 49.0
00:05:00 icinga2.value 38.0
00:06:00 icinga2.value 4.0
Second retention says, okay we want to store 5 minutes for 23 weeks. Now that means Graphite will summarize five 1 minute points to one 5 minute point.
5m:23w
(This is our already aggregated data in a perfect world)
00:05:00 icinga2.value 20.0 = ((2+10+1+49+38)/5) <- Because we want the avearage
00:06:00 icinga2.value 20.4 = ((10+1+49+38+4)/5) <- Because we want the avearage
Why perfect world? We learned already that sometimes a point (or several) can get lost because check was killed or point was to early / late. Then Graphite will set a default value like 0
for this point.
Than in the real world it could give situations like these:
1w:4w
00:01:00 icinga2.value 0 <- dead check
00:02:00 icinga2.value 0 <- dead check
00:03:00 icinga2.value 0 <- dead check
00:04:00 icinga2.value 49 <- admin fixed the check
00:05:00 icinga2.value 38
00:06:00 icinga2.value 4
5m:23w
00:05:00 icinga2.value 56.6 = ((0+0+0+49+38)/5) <- Before 20.0 difference + 36.6
00:06:00 icinga2.value 20.2 = ((0+0+49+38+4)/5) <- Before 20.4 difference - 0.02
To prevent such rollercoaster an aggregation schema will be applied to this. Default one says:
[default]
pattern = .*
xFilesFactor = 0.5
aggregationMethod = [average](https://graphite.readthedocs.io/en/latest/config-carbon.html?highlight=aggregationMethod#storage-aggregation-conf)
Pattern says: āWe want to match everythingā.
xFileFactor says: "Okay if we should aggregate clean between our archives we want 0.5 = 50% of data not 0
".
aggregationMethod says: āUse the average value for aggregationā
Now we expect 50% of data to be not 0
, then our real world scenario looks like this:
1w:4w
00:01:00 icinga2.value 0 <- dead check
00:02:00 icinga2.value 0 <- dead check
00:03:00 icinga2.value 0 <- dead check
00:04:00 icinga2.value 49 <- admin fixed the check
00:05:00 icinga2.value 38
00:06:00 icinga2.value 4
5m:23w
00:05:00 icinga2.value 0 = ((0+0+0+49+38)/5) <- 3 of 5 = 0 | 60% of 100% = 0
00:06:00 icinga2.value 20.2 = ((0+0+49+38+4)/5) <- 2 of 5 = 20.2 | 40% of 100% = 20.2
WHAT!? yeah.
Since xFileFactor is set to 0.5 we expect less then 50% of our data to be not 0
. Means if Ā±50% of our data are 0
Graphite will aggregate a 0
into the next archive.
Here is some lecture about this: http://obfuscurity.com/2012/04/Unhelpful-Graphite-Tip-9
whisper-resize is normally used to convert existing whisper files to a new retention schema without losing whole history.