Graphite resets graphs after each server reboot

After I reboot my monitoring server seems to reset all graphs every time.
This is happening on two separate servers.
I even reinstalled the one of the servers following the installation instructions strictly.
Any idea what could be happening here?
Or how I can start debugging this issue in a good way?

Versions:
Debian GNU/Linux 10 (with all updates)
icinga2 2.12.4
icingaweb2 2.8.2
mariadb 10.3.29
apache 2.4.38
graphite 1.1.4

Before:

After the reboot:

Thank you,
Andy

Hi.

As a starting point, it would be interesting to know if the whisper files are really no longer present.

Maybe you could look for the whisper files (e.g. at /opt/graphite/storage/whisper/icinga2/…) and check if older performance-data is present.
Look for the tool whisper-fetch.py, which accepts an “Unix Epoch Timestamp” for the argument

"--from"

Syntax:

/path/to/whisper-fetch.py <path-to-whisper-file> --from=<timestamp_from_when> [--pretty]

Example:

/opt/graphite/bin/whisper-fetch.py max.wsp --from=1624177772 --pretty
# or : /opt/graphite/bin/whisper-fetch.py max.wsp --from=1624177772 --pretty | head

If there is no data present older than 1 day, you should have a look at the
storage-schemas.conf. This excellent post might be helpful.

Hope it helps.


Greetings.

Thank you very much for this info.
whisper-fetch shows nicely that after a reboot values simply disappear.
But not for all days or all times. E.g. I just did a reboot and only these are left:
Mon Jun 14 13:00:00 2021 -0.000177
Mon Jun 14 13:30:00 2021 -0.000270
Mon Jun 14 14:00:00 2021 -0.000026
Mon Jun 21 09:30:00 2021 -0.000058
[…]
Tue Jun 22 01:00:00 2021 -0.000449

Before and after this time range, values are set to “None”.

I don’t think it has to do with the schemas definitions.
They are in my case:

[carbon]
pattern = ^carbon\.
retentions = 60:90d

[icinga2_internals]
pattern = ^icinga2\..*\.(max_check_attempts|reachable|current_attempt|execution_time|latency|state|state_type)
retentions = 5m:7d

[icinga2_default]
pattern = ^icinga2\.
retentions = 1m:5d,5m:14d,30m:90d,120m:4y

[default]
pattern = .*
retentions = 1m:5d,5m:14d,30m:90d,120m:4y

Do you see an error here?

Thanks so much!

Additional remarks:

My usual check_interval is set to 60s.

Since I was surprised that there are only values every 30min I ran whisper-fetch without “–from” and then it showed a value for every minute. Is this a known side effect? The documentation doesn’t mention it:

OPTIONS
       --from Unix epoch time of the beginning of your requested interval (default: 24 hours ago).

$ whisper-fetch value.wsp --from=1618993772 --pretty

...
Wed Jun 23 16:00:00 2021    0.000013
Wed Jun 23 16:30:00 2021    -0.000121
Wed Jun 23 17:00:00 2021    0.000457
Wed Jun 23 17:30:00 2021    -0.000375
...

$ whisper-fetch value.wsp --pretty

...
Wed Jun 23 15:59:00 2021    0.000066
Wed Jun 23 16:00:00 2021    -0.000097
Wed Jun 23 16:01:00 2021    -0.000178
Wed Jun 23 16:02:00 2021    -0.000570
Wed Jun 23 16:03:00 2021    0.000293
...

$ whisper-info value.wsp

aggregationMethod: average
maxRetention: 126144000
xFilesFactor: 0.5
fileSize: 396928

Archive 0
offset: 64
secondsPerPoint: 60
points: 7200
retention: 432000
size: 86400

Archive 1
offset: 86464
secondsPerPoint: 300
points: 4032
retention: 1209600
size: 48384

Archive 2
offset: 134848
secondsPerPoint: 1800
points: 4320
retention: 7776000
size: 51840

Archive 3
offset: 186688
secondsPerPoint: 7200
points: 17520
retention: 126144000
size: 210240
```

I’d say this is to be expected. The timestamp 1618993772 dates to Apr 21st, which is inside the retention period of 90d defined in the schema definition. This retention aggregate the values of the previous retention period (5m:14d), so basically it make one “30m” value out of six “5m” values.
The post linked by homerjay explaines this in more detail :slight_smile:

As to why the values/graphs are disappearing after a reboot, I have no idea atm. I never had that issue before.

1 Like

Hi again.

Sorry, I also have no idea. Also never had this issue.
Maybe @dgoetz or @blakehartshorn can give you a hint.

It would be nice to know which process exactly causes this problem.
Does it also happen if you only restart the graphite-service?

Greetings.

I was already watching because I was curious if someone has a solution as I never experienced something like this.

Only idea I have would be a cache service getting killed before he can write the data as the webinterface can also ask the cache in addition to the files.

But if values from the file get deleted that already existed I have no idea.

1 Like

I continued testing this like so:

Changed storage-schemas.conf to:

[carbon]
pattern = ^carbon\.
retentions = 60:90d

[icinga2_internals]
pattern = ^icinga2\..*\.(max_check_attempts|reachable|current_attempt|execution_time|latency|state|state_type)
retentions = 5m:7d

[icinga2_2hourchecks]
pattern = ^icinga2\..*\.(apt|checkrestart)
retentions = 120m:4y

[icinga2_default]
pattern = ^icinga2\.
retentions = 1m:14d,5m:90d,30m:1y,120m:4y

[default]
pattern = .*
retentions = 1m:14d,5m:90d,30m:1y,120m:4y

(because there a two checks with a different check_interval)

Instead of using whisper-resize this time I deleted the entire whisper directory:
rm -r /var/lib/graphite/whisper/icinga2 /var/lib/graphite/whisper/carbon

I waited for 1,5 days and did a reboot now.
Result: no data loss this time

Could my problem be related to whisper-resize? Doesn’t make sense either however.

1 Like

Hi.

Weird problem. I could imagine, that the whisper-resize wasn’t executed successfully after a change of the storage-schemas.conf in the past - that can happen pretty quickly.

Please let us know if this solved your problem (when you have continued to watch this).
Thanks.


Greetings.

Please let us know if this solved your problem (when you have continued to watch this).

I have bad news. Just did a reboot of the monitoring servers (two independent servers + networks) and parts of the graph data disappeared again.

1 Like

And today again. :frowning:


Could it have to do with the strangely increasing memory consumption on the icinga server?

1 Like

The cache will keep metrics in memory as long as it can not write them to the disk. So yes, if the cache is the process causing the memory consumption it could be. And it could be that it gets killed instead of writing everything to disk during reboot.

There are some settings influencing, especially limiting, how much it is writing in a time frame. Especially MAX_UPDATES_PER_SECOND, MAX_UPDATES_PER_SECOND_ON_SHUTDOWN and MAX_CREATES_PER_MINUTE, but also logging if enables and tagging can influence write performance and by this caching.

If it is not reaching the limits, it could also be necessary to enable additional cache instances and add a relay in between so distribute the load.

1 Like

Thanks for your suggestions. These are my current settings (still the default):
MAX_UPDATES_PER_SECOND = 500
not set: MAX_UPDATES_PER_SECOND_ON_SHUTDOWN
MAX_CREATES_PER_MINUTE = 50

Does that sound too low to you?
I’m only monitoring 12 hosts with 324 services on this node. (2 CPUs of a Intel Xeon, 4 GB RAM)
The avg load is 0.2.

I think MAX_CREATES_PER_MINUTE is much too low by default, but with this environment size and that it is not only an initial problem, it should not cause any problems.

MAX_UPDATES_PER_SECOND should also not limit such a small environment, even with 1 minute check interval and if every check would update 50 or more metrics.

You lost 10 days of data when the service was killed, so a big amount of data had to be kept in memory. When the shown graph is from the graphite system this would explain the consumption, but not why this is all held in memory instead of written to the disk. Are there any other settings changed? Are there settings of the Linux system changed like the kernel parameter vm.dirty_ratio, vm.dirty_background_ratio, vm.swappiness or something else affecting memory handling? How is the setting of the graphite writer feature, so are metadata and thresholds enabled (but this should not increase the metrics to an unhandleable amount)? Can you enable LOG_UPDATES and LOG_CREATES so you can see if the cache is writing to the disk (if everything runs fine updates log can grow fast and big so this should not be enabled in production afterwards)?

I am really just guessing as I think killing the cache daemon while it has not written its data is the only reason this could happen, but I have no idea why it should behave in such an extreme way in such a small environment.