Since a few weeks we have been facing an issue where icingadb-redis seems to be out of memory and crashes. This also causes icingadb to fail. Restarting both services fixes the issues. From the error logs it seems obvious that the issue is that Redis is out of memory.
Some steps have already been taken to diagnose the problem and to fix it
Steps already taken
Checked Memory usage of machine: it never exceeds 90%
Redis: Enabled vm.overcommit_memory
Redis: limited Maxmemory to 8gb
Redis: changed maxmemory-policy to allkeys-lru
Unfortunately after all that the problem still occurs. Here are the logs of the last time it diod.
Welcome to the Icinga community and thanks for posting.
Based on the icingadb-redis-journalctl.log log file, I would assume that your system simply runs out of memory and Redis is the first fallout.
Sep 12 23:00:08 monitoring-master-01.company.tld icingadb-redis-server[1268522]: 1268522:M 12 Sep 2025 23:00:08.815 # Out Of Memory allocating 86032 bytes!
Never exceeding 90% of memory usage is one thing, but how does it normally look? Can you please provide a report/graph of the memory consumption through an hour before the next Redis OOM crash?
And is it really only Redis to go OOM? Based on your statement of 90% memory consumption, I would expect other processes to also get killed.
On a similar question: Do you have the OS OOM killer configured?
The host itself never has a memory issue. There is plenty (avg. 83%) free memory on the host. Redis never uses its configured 8GB memory limit, also does not take more if there is no memory limit configured.
I don’t think that it is a system OOM event but rather a redis out of memory event. Besides icingadb-redis, nothing else has issues on this machine.
Thanks for the reply. I have to admit, I am a bit out of ideas, so let’s dig into some configuration options.
How have you configured the 8G of memory limit? Could you please share your Redis config with us? Does Redis run in its own cgroup or has another OS-level limitation, e.g., via some systemd options?
Please execute INFO in the Redis CLI and post the output. Maybe something sticks out?
Going back to the provided icingadb-redis-journalctl.log file, it especially said “Guru Meditation: Redis aborting for OUT OF MEMORY.” How do the maxmemory and maxheap configuration options compare? In my understanding, maxheap should be multiple times bigger than maxmemory.
Thanks. We had a small outage last week, but luckily we have this monitoring software you may have heard about
I’ve seen you have set maxmemory_policy:allkeys-lru. Does the OOM still happens? In my understanding, Redis should preferr cleaning up instead of going OOM.
Btw, does Icinga DB consume the Redis streams? Do you have the icingadb check configured and is it happy?
You could also try to find potential culprits in the Redis using the icingadb-redis-cli with either the --bigkeys or --memkeys output, as described in their docs. Please feel free to share the output. Maybe something is somehow stuck?
Out of curiosity, how big is the Icinga setup connected to the Redis? How many hosts and services are there?
Please excuse my random questions, but so far I haven’t seen anything which brought me to an idea.
I mostly skipped through this thread, but what my first thought here was that the save setting for redis (configures if and when to dump the redis content to disk) might play a role in this.
In some scenarios we had problems with that since the dump requires significant ressources. Especially in big setups the default threshold is to low and in one case, caused the dump process to run continously (with a lot of IO and memory usage). That’s why this part of the docs exist
Thanks for mentioning the save option, @lorenz. Earlier in this thread, I have linked to this section of the docs, but maybe it got lost with the switch from @CertifiedRedisHater to @MadPat, thus I have not mentioned it again.
# Scanning the entire keyspace to find biggest keys as well as
# average sizes per key type. You can use -i 0.1 to sleep 0.1 sec
# per 100 SCAN commands (not usually needed).
[00.00%] Biggest hash found so far '"icinga:endpoint"' with 349 fields
[00.00%] Biggest stream found so far '"icinga:dump"' with 54 entries
[00.00%] Biggest hash found so far '"icinga:hostgroup:member"' with 1131 fields
[00.00%] Biggest hash found so far '"icinga:checksum:service:state"' with 10878 fields
[29.41%] Biggest hash found so far '"icinga:service:customvar"' with 43118 fields
[29.41%] Biggest hash found so far '"icinga:notification:recipient"' with 94595 fields
[29.41%] Biggest stream found so far '"icinga:runtime"' with 1000009 entries
[61.76%] Biggest zset found so far '"icinga:nextupdate:host"' with 493 members
[92.65%] Biggest hash found so far '"icinga:notification:customvar"' with 103952 fields
[92.65%] Biggest zset found so far '"icinga:nextupdate:service"' with 10109 members
-------- summary -------
Sampled 68 keys in the keyspace!
Total key length in bytes is 1594 (avg len 23.44)
Biggest hash found '"icinga:notification:customvar"' has 103952 fields
Biggest stream found '"icinga:runtime"' has 1000009 entries
Biggest zset found '"icinga:nextupdate:service"' has 10109 members
0 lists with 0 items (00.00% of keys, avg size 0.00)
54 hashs with 454818 fields (79.41% of keys, avg size 8422.56)
0 strings with 0 bytes (00.00% of keys, avg size 0.00)
12 streams with 2001065 entries (17.65% of keys, avg size 166755.42)
0 sets with 0 members (00.00% of keys, avg size 0.00)
2 zsets with 10602 members (02.94% of keys, avg size 5301.00)
Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.
# Scanning the entire keyspace to find biggest keys as well as
# average sizes per key type. You can use -i 0.1 to sleep 0.1 sec
# per 100 SCAN commands (not usually needed).
[00.00%] Biggest hash found so far '"icinga:endpoint"' with 145491 bytes
[00.00%] Biggest stream found so far '"icinga:dump"' with 9824 bytes
[00.00%] Biggest hash found so far '"icinga:hostgroup:member"' with 323288 bytes
[00.00%] Biggest hash found so far '"icinga:checksum:service:state"' with 1610608 bytes
[00.00%] Biggest hash found so far '"icinga:customvar"' with 6813745 bytes
[14.71%] Biggest hash found so far '"icinga:service:state"' with 14281294 bytes
[29.41%] Biggest hash found so far '"icinga:notification:recipient"' with 28181272 bytes
[29.41%] Biggest stream found so far '"icinga:runtime"' with 293504356 bytes
[45.59%] Biggest hash found so far '"icinga:notification"' with 39843968 bytes
[61.76%] Biggest zset found so far '"icinga:nextupdate:host"' with 61697 bytes
[61.76%] Biggest stream found so far '"icinga:runtime:state"' with 804043208 bytes
[92.65%] Biggest zset found so far '"icinga:nextupdate:service"' with 1409665 bytes
-------- summary -------
Sampled 68 keys in the keyspace!
Total key length in bytes is 1594 (avg len 23.44)
Biggest hash found '"icinga:notification"' has 39843968 bytes
Biggest stream found '"icinga:runtime:state"' has 804043208 bytes
Biggest zset found '"icinga:nextupdate:service"' has 1409665 bytes
0 lists with 0 bytes (00.00% of keys, avg size 0.00)
54 hashs with 173606930 bytes (79.41% of keys, avg size 3214943.15)
0 strings with 0 bytes (00.00% of keys, avg size 0.00)
12 streams with 1097595680 bytes (17.65% of keys, avg size 91466306.67)
0 sets with 0 bytes (00.00% of keys, avg size 0.00)
2 zsets with 1471362 bytes (02.94% of keys, avg size 735681.00)
~500 Hosts, ~2300 Services.
I’ve been hesitant to configure that because if this is not the issue we could have data loss. We have two masters, the redis instances on those never failed concurently. Now I am not deeply knowledge into how Icinga works but is the data collected by one master, shared with the other? If yes we could try this.
A week ago we had a crash and a coworker found a solution, here is what he reported.
$ sysctl vm.max_map_count
vm.max_map_count = 65530 # maximal amount of maps for a process; if exceeded malloc() will fail
# 20 minutes after
$ cat /proc/$(pgrep icingadb-redis)/maps | wc -l
1087
# after 25 minutes
$ cat /proc/$(pgrep icingadb-redis)/maps | wc -l
1327
# around 240 new memory mappings which aren't cleaned up in 5 minutes -> 48 per minute
# 65530/48 -> 1365min = 22h until Redis crashed if the memory is never freed
$ sysctl -w vm.max_map_count=262144
# is now allowed to make 4x the amount of maps and will probably crash no more / less
# after 20 hours
$ cat /proc/$(pgrep icingadb-redis)/maps | wc -l
30620