Graphite resets graphs after each server reboot

After I reboot my monitoring server seems to reset all graphs every time.
This is happening on two separate servers.
I even reinstalled the one of the servers following the installation instructions strictly.
Any idea what could be happening here?
Or how I can start debugging this issue in a good way?

Debian GNU/Linux 10 (with all updates)
icinga2 2.12.4
icingaweb2 2.8.2
mariadb 10.3.29
apache 2.4.38
graphite 1.1.4


After the reboot:

Thank you,


As a starting point, it would be interesting to know if the whisper files are really no longer present.

Maybe you could look for the whisper files (e.g. at /opt/graphite/storage/whisper/icinga2/…) and check if older performance-data is present.
Look for the tool, which accepts an “Unix Epoch Timestamp” for the argument



/path/to/ <path-to-whisper-file> --from=<timestamp_from_when> [--pretty]


/opt/graphite/bin/ max.wsp --from=1624177772 --pretty
# or : /opt/graphite/bin/ max.wsp --from=1624177772 --pretty | head

If there is no data present older than 1 day, you should have a look at the
storage-schemas.conf. This excellent post might be helpful.

Hope it helps.


Thank you very much for this info.
whisper-fetch shows nicely that after a reboot values simply disappear.
But not for all days or all times. E.g. I just did a reboot and only these are left:
Mon Jun 14 13:00:00 2021 -0.000177
Mon Jun 14 13:30:00 2021 -0.000270
Mon Jun 14 14:00:00 2021 -0.000026
Mon Jun 21 09:30:00 2021 -0.000058
Tue Jun 22 01:00:00 2021 -0.000449

Before and after this time range, values are set to “None”.

I don’t think it has to do with the schemas definitions.
They are in my case:

pattern = ^carbon\.
retentions = 60:90d

pattern = ^icinga2\..*\.(max_check_attempts|reachable|current_attempt|execution_time|latency|state|state_type)
retentions = 5m:7d

pattern = ^icinga2\.
retentions = 1m:5d,5m:14d,30m:90d,120m:4y

pattern = .*
retentions = 1m:5d,5m:14d,30m:90d,120m:4y

Do you see an error here?

Thanks so much!

Additional remarks:

My usual check_interval is set to 60s.

Since I was surprised that there are only values every 30min I ran whisper-fetch without “–from” and then it showed a value for every minute. Is this a known side effect? The documentation doesn’t mention it:

       --from Unix epoch time of the beginning of your requested interval (default: 24 hours ago).

$ whisper-fetch value.wsp --from=1618993772 --pretty

Wed Jun 23 16:00:00 2021    0.000013
Wed Jun 23 16:30:00 2021    -0.000121
Wed Jun 23 17:00:00 2021    0.000457
Wed Jun 23 17:30:00 2021    -0.000375

$ whisper-fetch value.wsp --pretty

Wed Jun 23 15:59:00 2021    0.000066
Wed Jun 23 16:00:00 2021    -0.000097
Wed Jun 23 16:01:00 2021    -0.000178
Wed Jun 23 16:02:00 2021    -0.000570
Wed Jun 23 16:03:00 2021    0.000293

$ whisper-info value.wsp

aggregationMethod: average
maxRetention: 126144000
xFilesFactor: 0.5
fileSize: 396928

Archive 0
offset: 64
secondsPerPoint: 60
points: 7200
retention: 432000
size: 86400

Archive 1
offset: 86464
secondsPerPoint: 300
points: 4032
retention: 1209600
size: 48384

Archive 2
offset: 134848
secondsPerPoint: 1800
points: 4320
retention: 7776000
size: 51840

Archive 3
offset: 186688
secondsPerPoint: 7200
points: 17520
retention: 126144000
size: 210240

I’d say this is to be expected. The timestamp 1618993772 dates to Apr 21st, which is inside the retention period of 90d defined in the schema definition. This retention aggregate the values of the previous retention period (5m:14d), so basically it make one “30m” value out of six “5m” values.
The post linked by homerjay explaines this in more detail :slight_smile:

As to why the values/graphs are disappearing after a reboot, I have no idea atm. I never had that issue before.

1 Like

Hi again.

Sorry, I also have no idea. Also never had this issue.
Maybe @dgoetz or @blakehartshorn can give you a hint.

It would be nice to know which process exactly causes this problem.
Does it also happen if you only restart the graphite-service?


I was already watching because I was curious if someone has a solution as I never experienced something like this.

Only idea I have would be a cache service getting killed before he can write the data as the webinterface can also ask the cache in addition to the files.

But if values from the file get deleted that already existed I have no idea.

1 Like

I continued testing this like so:

Changed storage-schemas.conf to:

pattern = ^carbon\.
retentions = 60:90d

pattern = ^icinga2\..*\.(max_check_attempts|reachable|current_attempt|execution_time|latency|state|state_type)
retentions = 5m:7d

pattern = ^icinga2\..*\.(apt|checkrestart)
retentions = 120m:4y

pattern = ^icinga2\.
retentions = 1m:14d,5m:90d,30m:1y,120m:4y

pattern = .*
retentions = 1m:14d,5m:90d,30m:1y,120m:4y

(because there a two checks with a different check_interval)

Instead of using whisper-resize this time I deleted the entire whisper directory:
rm -r /var/lib/graphite/whisper/icinga2 /var/lib/graphite/whisper/carbon

I waited for 1,5 days and did a reboot now.
Result: no data loss this time

Could my problem be related to whisper-resize? Doesn’t make sense either however.

1 Like


Weird problem. I could imagine, that the whisper-resize wasn’t executed successfully after a change of the storage-schemas.conf in the past - that can happen pretty quickly.

Please let us know if this solved your problem (when you have continued to watch this).


Please let us know if this solved your problem (when you have continued to watch this).

I have bad news. Just did a reboot of the monitoring servers (two independent servers + networks) and parts of the graph data disappeared again.

1 Like

And today again. :frowning:

Could it have to do with the strangely increasing memory consumption on the icinga server?

1 Like

The cache will keep metrics in memory as long as it can not write them to the disk. So yes, if the cache is the process causing the memory consumption it could be. And it could be that it gets killed instead of writing everything to disk during reboot.

There are some settings influencing, especially limiting, how much it is writing in a time frame. Especially MAX_UPDATES_PER_SECOND, MAX_UPDATES_PER_SECOND_ON_SHUTDOWN and MAX_CREATES_PER_MINUTE, but also logging if enables and tagging can influence write performance and by this caching.

If it is not reaching the limits, it could also be necessary to enable additional cache instances and add a relay in between so distribute the load.

1 Like

Thanks for your suggestions. These are my current settings (still the default):

Does that sound too low to you?
I’m only monitoring 12 hosts with 324 services on this node. (2 CPUs of a Intel Xeon, 4 GB RAM)
The avg load is 0.2.

I think MAX_CREATES_PER_MINUTE is much too low by default, but with this environment size and that it is not only an initial problem, it should not cause any problems.

MAX_UPDATES_PER_SECOND should also not limit such a small environment, even with 1 minute check interval and if every check would update 50 or more metrics.

You lost 10 days of data when the service was killed, so a big amount of data had to be kept in memory. When the shown graph is from the graphite system this would explain the consumption, but not why this is all held in memory instead of written to the disk. Are there any other settings changed? Are there settings of the Linux system changed like the kernel parameter vm.dirty_ratio, vm.dirty_background_ratio, vm.swappiness or something else affecting memory handling? How is the setting of the graphite writer feature, so are metadata and thresholds enabled (but this should not increase the metrics to an unhandleable amount)? Can you enable LOG_UPDATES and LOG_CREATES so you can see if the cache is writing to the disk (if everything runs fine updates log can grow fast and big so this should not be enabled in production afterwards)?

I am really just guessing as I think killing the cache daemon while it has not written its data is the only reason this could happen, but I have no idea why it should behave in such an extreme way in such a small environment.

I kept watching it for a while and I believe now that it really must have to do with the carbon cache.
After I restart the service (service carbon-cache restart) the graphs are empty. :frowning:

whisper_fetch shows that no new data is written to the wsp files anymore.
And the modify date of the wsp files is really old - although the graphs are showing new values.

Does that imply that the data is only kept in the cache? That would explain why it gets lost after a restart. But why doesn’t it get written to the files at all?

I am still clueless. :sob:


Hi again.

I still do not have a solution, but maybe you could analyze what happens when you restart the carbon-cache, e.g. with BPF tools (like execsnoop) or something else.


Yes, the question why is it not writing data is the important one.

Carbon-Cache has two jobs, one is cache the data so they can be read without the immediate need to right them to disk, the other is to order the data to optimize write and then write them to the disk.

We already checked the configuration values which would limit the writing which looked fine. So did you enable creates (LOG_CREATES) and updates (LOG_UPDATES) log? Perhaps this can give you a clue, but it could also be simply empty as nothing happens. Also a look at CACHE_WRITE_STRATEGY and perhaps deactivating cache at all with WHISPER_AUTOFLUSH could be worth a try.

Yes, and I still have no clue.

I enabled LOG_CREATES and LOG_UPDATES: yes, both are written to:

01/10/2021 22:16:23 :: creating database metric (archive=[(60, 20160), (300, 25920), (1800, 17520), 
01/10/2021 22:16:23 :: new metric matched schema icinga2_default
01/10/2021 22:16:23 :: new metric matched aggregation schema default
01/10/2021 22:16:23 :: creating database metric (archive=[(60, 20160), (300, 25920), (1800, 17520), (77200, 17520)] xff=None agg=None)
01/10/2021 22:21:51 :: wrote 1 datapoints for in 0.00008 seconds
01/10/2021 22:21:51 :: wrote 1 datapoints for in 0.00009 seconds
01/10/2021 22:21:51 :: wrote 1 datapoints for in 0.00008 seconds
01/10/2021 22:21:51 :: wrote 1 datapoints for in 0.00007 seconds
01/10/2021 22:21:51 :: wrote 1 datapoints for in 0.00007 seconds
01/10/2021 22:21:51 :: wrote 1 datapoints for in 0.00008 seconds
01/10/2021 22:21:51 :: wrote 1 datapoints for in 0.00008 seconds

CACHE_WRITE_STRATEGY is set to “sorted”. Is that ok?

On Friday I deactivated the cache by setting WHISPER_AUTOFLUSH to true
but when I restarted the carbon cache this morning all values since Saturday noon were gone again:

This is so strange. Any other ideas?
Otherwise, I’m tempted to reinstall the whole icinga server…

Yes, this is the default and with a uniform resolution of the metrics it should produce the best result.

With the cache being effectively disabled by WHISPER_AUTOFLUSH, I would say Graphite is not the problem, but it should be a problem of the system’s filesystem cache or the filesystem itself.

The first one is simple to test by using sync which will execute the system call to write from memory to disk, but this should be done automatically during shutdown anyways.

From what I know ext4 is still Debian 10’s default filesystem which should be pretty stable. Not aware of any other filesystem with issues, but still to ensure: What filesystem do you use and any specific mount options (not only from /etc/fstab but also default mount options from tune2fs -l)?

Any other caching involved here like a hardware raid-controller?

Good points.

I’m using standard ext4fs on Debian 10 (in a VM) and no extra caching tools involved. Just a plain Debian installation with Icingas + Graphite on top.