Graphite resets graphs after each server reboot

ciberandy · June 22, 2021, 5:23pm

After I reboot my monitoring server seems to reset all graphs every time.
This is happening on two separate servers.
I even reinstalled the one of the servers following the installation instructions strictly.
Any idea what could be happening here?
Or how I can start debugging this issue in a good way?

Versions:
Debian GNU/Linux 10 (with all updates)
icinga2 2.12.4
icingaweb2 2.8.2
mariadb 10.3.29
apache 2.4.38
graphite 1.1.4

Before:

After the reboot:

Thank you,
Andy

homerjay · June 23, 2021, 8:52am

Hi.

As a starting point, it would be interesting to know if the whisper files are really no longer present.

Maybe you could look for the whisper files (e.g. at /opt/graphite/storage/whisper/icinga2/…) and check if older performance-data is present.
Look for the tool whisper-fetch.py, which accepts an “Unix Epoch Timestamp” for the argument

"--from"

Syntax:

/path/to/whisper-fetch.py <path-to-whisper-file> --from=<timestamp_from_when> [--pretty]

Example:

/opt/graphite/bin/whisper-fetch.py max.wsp --from=1624177772 --pretty
# or : /opt/graphite/bin/whisper-fetch.py max.wsp --from=1624177772 --pretty | head

If there is no data present older than 1 day, you should have a look at the
storage-schemas.conf. This excellent post might be helpful.

Hope it helps.

Greetings.

ciberandy · June 23, 2021, 10:20am

Thank you very much for this info.
whisper-fetch shows nicely that after a reboot values simply disappear.
But not for all days or all times. E.g. I just did a reboot and only these are left:
Mon Jun 14 13:00:00 2021 -0.000177
Mon Jun 14 13:30:00 2021 -0.000270
Mon Jun 14 14:00:00 2021 -0.000026
Mon Jun 21 09:30:00 2021 -0.000058
[…]
Tue Jun 22 01:00:00 2021 -0.000449

Before and after this time range, values are set to “None”.

I don’t think it has to do with the schemas definitions.
They are in my case:

[carbon]
pattern = ^carbon\.
retentions = 60:90d

[icinga2_internals]
pattern = ^icinga2\..*\.(max_check_attempts|reachable|current_attempt|execution_time|latency|state|state_type)
retentions = 5m:7d

[icinga2_default]
pattern = ^icinga2\.
retentions = 1m:5d,5m:14d,30m:90d,120m:4y

[default]
pattern = .*
retentions = 1m:5d,5m:14d,30m:90d,120m:4y

Do you see an error here?

Thanks so much!

ciberandy · June 23, 2021, 7:26pm

Additional remarks:

My usual check_interval is set to 60s.

Since I was surprised that there are only values every 30min I ran whisper-fetch without “–from” and then it showed a value for every minute. Is this a known side effect? The documentation doesn’t mention it:

OPTIONS
       --from Unix epoch time of the beginning of your requested interval (default: 24 hours ago).

$ whisper-fetch value.wsp --from=1618993772 --pretty

...
Wed Jun 23 16:00:00 2021    0.000013
Wed Jun 23 16:30:00 2021    -0.000121
Wed Jun 23 17:00:00 2021    0.000457
Wed Jun 23 17:30:00 2021    -0.000375
...

$ whisper-fetch value.wsp --pretty

...
Wed Jun 23 15:59:00 2021    0.000066
Wed Jun 23 16:00:00 2021    -0.000097
Wed Jun 23 16:01:00 2021    -0.000178
Wed Jun 23 16:02:00 2021    -0.000570
Wed Jun 23 16:03:00 2021    0.000293
...

$ whisper-info value.wsp

aggregationMethod: average
maxRetention: 126144000
xFilesFactor: 0.5
fileSize: 396928

Archive 0
offset: 64
secondsPerPoint: 60
points: 7200
retention: 432000
size: 86400

Archive 1
offset: 86464
secondsPerPoint: 300
points: 4032
retention: 1209600
size: 48384

Archive 2
offset: 134848
secondsPerPoint: 1800
points: 4320
retention: 7776000
size: 51840

Archive 3
offset: 186688
secondsPerPoint: 7200
points: 17520
retention: 126144000
size: 210240
```

log1c · June 24, 2021, 6:31am

I’d say this is to be expected. The timestamp 1618993772 dates to Apr 21st, which is inside the retention period of 90d defined in the schema definition. This retention aggregate the values of the previous retention period (5m:14d), so basically it make one “30m” value out of six “5m” values.
The post linked by homerjay explaines this in more detail

As to why the values/graphs are disappearing after a reboot, I have no idea atm. I never had that issue before.

homerjay · June 24, 2021, 7:05am

Hi again.

Sorry, I also have no idea. Also never had this issue.
Maybe @dgoetz or @blakehartshorn can give you a hint.

It would be nice to know which process exactly causes this problem.
Does it also happen if you only restart the graphite-service?

Greetings.

dgoetz · June 24, 2021, 11:33am

I was already watching because I was curious if someone has a solution as I never experienced something like this.

Only idea I have would be a cache service getting killed before he can write the data as the webinterface can also ask the cache in addition to the files.

But if values from the file get deleted that already existed I have no idea.

ciberandy · June 25, 2021, 2:17pm

I continued testing this like so:

Changed storage-schemas.conf to:

[carbon]
pattern = ^carbon\.
retentions = 60:90d

[icinga2_internals]
pattern = ^icinga2\..*\.(max_check_attempts|reachable|current_attempt|execution_time|latency|state|state_type)
retentions = 5m:7d

[icinga2_2hourchecks]
pattern = ^icinga2\..*\.(apt|checkrestart)
retentions = 120m:4y

[icinga2_default]
pattern = ^icinga2\.
retentions = 1m:14d,5m:90d,30m:1y,120m:4y

[default]
pattern = .*
retentions = 1m:14d,5m:90d,30m:1y,120m:4y

(because there a two checks with a different check_interval)

Instead of using whisper-resize this time I deleted the entire whisper directory:
rm -r /var/lib/graphite/whisper/icinga2 /var/lib/graphite/whisper/carbon

I waited for 1,5 days and did a reboot now.
Result: no data loss this time

Could my problem be related to whisper-resize? Doesn’t make sense either however.

homerjay · June 25, 2021, 3:01pm

Hi.

Weird problem. I could imagine, that the whisper-resize wasn’t executed successfully after a change of the storage-schemas.conf in the past - that can happen pretty quickly.

Please let us know if this solved your problem (when you have continued to watch this).
Thanks.

Greetings.

ciberandy · July 8, 2021, 5:21pm

Please let us know if this solved your problem (when you have continued to watch this).

I have bad news. Just did a reboot of the monitoring servers (two independent servers + networks) and parts of the graph data disappeared again.

ciberandy · July 19, 2021, 11:29am

And today again.

Could it have to do with the strangely increasing memory consumption on the icinga server?

dgoetz · July 19, 2021, 1:56pm

The cache will keep metrics in memory as long as it can not write them to the disk. So yes, if the cache is the process causing the memory consumption it could be. And it could be that it gets killed instead of writing everything to disk during reboot.

There are some settings influencing, especially limiting, how much it is writing in a time frame. Especially MAX_UPDATES_PER_SECOND, MAX_UPDATES_PER_SECOND_ON_SHUTDOWN and MAX_CREATES_PER_MINUTE, but also logging if enables and tagging can influence write performance and by this caching.

If it is not reaching the limits, it could also be necessary to enable additional cache instances and add a relay in between so distribute the load.

ciberandy · July 19, 2021, 4:16pm

Thanks for your suggestions. These are my current settings (still the default):
MAX_UPDATES_PER_SECOND = 500
not set: MAX_UPDATES_PER_SECOND_ON_SHUTDOWN
MAX_CREATES_PER_MINUTE = 50

Does that sound too low to you?
I’m only monitoring 12 hosts with 324 services on this node. (2 CPUs of a Intel Xeon, 4 GB RAM)
The avg load is 0.2.

dgoetz · July 19, 2021, 8:18pm

I think MAX_CREATES_PER_MINUTE is much too low by default, but with this environment size and that it is not only an initial problem, it should not cause any problems.

MAX_UPDATES_PER_SECOND should also not limit such a small environment, even with 1 minute check interval and if every check would update 50 or more metrics.

You lost 10 days of data when the service was killed, so a big amount of data had to be kept in memory. When the shown graph is from the graphite system this would explain the consumption, but not why this is all held in memory instead of written to the disk. Are there any other settings changed? Are there settings of the Linux system changed like the kernel parameter vm.dirty_ratio, vm.dirty_background_ratio, vm.swappiness or something else affecting memory handling? How is the setting of the graphite writer feature, so are metadata and thresholds enabled (but this should not increase the metrics to an unhandleable amount)? Can you enable LOG_UPDATES and LOG_CREATES so you can see if the cache is writing to the disk (if everything runs fine updates log can grow fast and big so this should not be enabled in production afterwards)?

I am really just guessing as I think killing the cache daemon while it has not written its data is the only reason this could happen, but I have no idea why it should behave in such an extreme way in such a small environment.

ciberandy · September 17, 2021, 7:24pm

I kept watching it for a while and I believe now that it really must have to do with the carbon cache.
After I restart the service (service carbon-cache restart) the graphs are empty.

whisper_fetch shows that no new data is written to the wsp files anymore.
And the modify date of the wsp files is really old - although the graphs are showing new values.

Does that imply that the data is only kept in the cache? That would explain why it gets lost after a restart. But why doesn’t it get written to the files at all?

I am still clueless.

Andy

homerjay · September 19, 2021, 12:49pm

Hi again.

I still do not have a solution, but maybe you could analyze what happens when you restart the carbon-cache, e.g. with BPF tools (like execsnoop) or something else.

Greetings.

dgoetz · September 20, 2021, 6:53am

Yes, the question why is it not writing data is the important one.

Carbon-Cache has two jobs, one is cache the data so they can be read without the immediate need to right them to disk, the other is to order the data to optimize write and then write them to the disk.

We already checked the configuration values which would limit the writing which looked fine. So did you enable creates (LOG_CREATES) and updates (LOG_UPDATES) log? Perhaps this can give you a clue, but it could also be simply empty as nothing happens. Also a look at CACHE_WRITE_STRATEGY and perhaps deactivating cache at all with WHISPER_AUTOFLUSH could be worth a try.

ciberandy · October 4, 2021, 9:28am

Yes, and I still have no clue.

I enabled LOG_CREATES and LOG_UPDATES: yes, both are written to:

01/10/2021 22:16:23 :: creating database metric icinga2.xxxx.services.http.http.perfdata.size.value (archive=[(60, 20160), (300, 25920), (1800, 17520), 
01/10/2021 22:16:23 :: new metric icinga2.xxxx.services.http.http.perfdata.time.max matched schema icinga2_default
01/10/2021 22:16:23 :: new metric icinga2.xxxx.services.http.http.perfdata.time.max matched aggregation schema default
01/10/2021 22:16:23 :: creating database metric icinga2.xxxx.services.http.http.perfdata.time.max (archive=[(60, 20160), (300, 25920), (1800, 17520), (77200, 17520)] xff=None agg=None)

01/10/2021 22:21:51 :: wrote 1 datapoints for icinga2.yyyy.services.procs.by_ssh.perfdata.procs.value in 0.00008 seconds
01/10/2021 22:21:51 :: wrote 1 datapoints for icinga2.xxxx.services.disk.by_ssh_disk.perfdata._.max in 0.00009 seconds
01/10/2021 22:21:51 :: wrote 1 datapoints for icinga2.xxxx.services.disk.by_ssh_disk.perfdata._.min in 0.00008 seconds
01/10/2021 22:21:51 :: wrote 1 datapoints for icinga2.xxxx.services.disk.by_ssh_disk.perfdata._.warn in 0.00007 seconds
01/10/2021 22:21:51 :: wrote 1 datapoints for icinga2.xxxx.services.disk.by_ssh_disk.perfdata._.crit in 0.00007 seconds
01/10/2021 22:21:51 :: wrote 1 datapoints for icinga2.xxxx.services.disk.by_ssh_disk.perfdata._.value in 0.00008 seconds
01/10/2021 22:21:51 :: wrote 1 datapoints for icinga2.zzzz.services.Process_ntpd.by_ssh.perfdata.procs.min in 0.00008 seconds

CACHE_WRITE_STRATEGY is set to “sorted”. Is that ok?

On Friday I deactivated the cache by setting WHISPER_AUTOFLUSH to true
but when I restarted the carbon cache this morning all values since Saturday noon were gone again:

This is so strange. Any other ideas?
Otherwise, I’m tempted to reinstall the whole icinga server…

dgoetz · October 4, 2021, 11:42am

Yes, this is the default and with a uniform resolution of the metrics it should produce the best result.

With the cache being effectively disabled by WHISPER_AUTOFLUSH, I would say Graphite is not the problem, but it should be a problem of the system’s filesystem cache or the filesystem itself.

The first one is simple to test by using sync which will execute the system call to write from memory to disk, but this should be done automatically during shutdown anyways.

From what I know ext4 is still Debian 10’s default filesystem which should be pretty stable. Not aware of any other filesystem with issues, but still to ensure: What filesystem do you use and any specific mount options (not only from /etc/fstab but also default mount options from tune2fs -l)?

Any other caching involved here like a hardware raid-controller?

ciberandy · October 4, 2021, 3:52pm

Good points.

I’m using standard ext4fs on Debian 10 (in a VM) and no extra caching tools involved. Just a plain Debian installation with Icingas + Graphite on top.