Icingadb-redis keeps crashing with "Out of Memory"

CertifiedRedisHater · September 16, 2025, 11:16am

Since a few weeks we have been facing an issue where icingadb-redis seems to be out of memory and crashes. This also causes icingadb to fail. Restarting both services fixes the issues. From the error logs it seems obvious that the issue is that Redis is out of memory.

Some steps have already been taken to diagnose the problem and to fix it

Steps already taken

Checked Memory usage of machine: it never exceeds 90%
Redis: Enabled vm.overcommit_memory
Redis: limited Maxmemory to 8gb
Redis: changed maxmemory-policy to allkeys-lru

Unfortunately after all that the problem still occurs. Here are the logs of the last time it diod.

Configs

icingadb-redis.conf (1.8 KB)
icingadb.config.yml.txt (284 Bytes)

I am happy to provide more logs. The problem is just not reproducable so I can’t guarantee a new set of logs.

As I am writing this I notice that maybe setting max_connections to something reasonable would help.

Icinga DB Web version (System - About): 1.2.2
Icinga Web 2 version (System - About): 2.12.5
Web browser: Firefox
Icinga 2 version (icinga2 --version): r2.15.0-1
Icinga DB version (icingadb --version): v1.4.0
PHP version used (php --version): 8.2.29
Server operating system and version: Debian 12

icinga-version472×649 46.8 KB

CertifiedRedisHater · September 16, 2025, 11:17am

Logs

icingadb-redis-journalctl.log (92.7 KB)
icingadb-journalctl.log (77.5 KB)
/var/log/icinga2/icinga2.log or error.log show nothing related

apenning · September 18, 2025, 8:45am

Welcome to the Icinga community and thanks for posting.

Based on the icingadb-redis-journalctl.log log file, I would assume that your system simply runs out of memory and Redis is the first fallout.

Sep 12 23:00:08 monitoring-master-01.company.tld icingadb-redis-server[1268522]: 1268522:M 12 Sep 2025 23:00:08.815 # Out Of Memory allocating 86032 bytes!

Never exceeding 90% of memory usage is one thing, but how does it normally look? Can you please provide a report/graph of the memory consumption through an hour before the next Redis OOM crash?

And is it really only Redis to go OOM? Based on your statement of 90% memory consumption, I would expect other processes to also get killed.

On a similar question: Do you have the OS OOM killer configured?

As an alternative, you can try to tweak the save option in the redis.conf, as mentioned in our docs, https://icinga.com/docs/icinga-db/latest/doc/07-Operations/#huge-memory-footprint-and-io-usage-in-large-setups.

MadPat · November 5, 2025, 7:53am

Different person here, same org.

The host itself never has a memory issue. There is plenty (avg. 83%) free memory on the host. Redis never uses its configured 8GB memory limit, also does not take more if there is no memory limit configured.

I don’t think that it is a system OOM event but rather a redis out of memory event. Besides icingadb-redis, nothing else has issues on this machine.

Memory-wise it looks like this in the journald:

maxmemory_policy:allkeys-lru
maxmemory_human:8.00G
maxmemory:8589934592
used_memory_dataset_perc:99.98%
used_memory_dataset:1265614506
used_memory_startup:866240
used_memory_overhead:1159662
used_memory_peak_perc:95.39%
used_memory_peak_human:1.24G
used_memory_peak:1327973808
used_memory_rss_human:1.26G
used_memory_rss:1348984832
used_memory_human:1.18G
used_memory:1266774168

Interestingly used_memory_dataset_perc:99.98% is pretty high but never near the 8GB we configured

BTW; icinga.com seems to be down.

apenning · November 13, 2025, 10:53am

Thanks for the reply. I have to admit, I am a bit out of ideas, so let’s dig into some configuration options.

How have you configured the 8G of memory limit? Could you please share your Redis config with us? Does Redis run in its own cgroup or has another OS-level limitation, e.g., via some systemd options?

Please execute INFO in the Redis CLI and post the output. Maybe something sticks out?

Going back to the provided icingadb-redis-journalctl.log file, it especially said “Guru Meditation: Redis aborting for OUT OF MEMORY.” How do the maxmemory and maxheap configuration options compare? In my understanding, maxheap should be multiple times bigger than maxmemory.

Thanks. We had a small outage last week, but luckily we have this monitoring software you may have heard about

BTW: Cool profile picture.

MadPat · November 13, 2025, 12:57pm

Heya!

Jup, me aswell

Via the config at /etc/icingadb-redis/icingadb-redis.conf with maxmemory 14gb. Done via ansible.

Sure thing, its attached. icingadb-redis.conf (104.2 KB)

Nope, system does not set any limits for icingadb[-redis]. We use a default installation on Debian 12.

127.0.0.1:6380> INFO
# Server
redis_version:7.2.11
redis_git_sha1:790aec2f
redis_git_dirty:0
redis_build_id:658d8a5f6e0bba60
redis_mode:standalone
os:Linux 6.1.0-38-cloud-amd64 x86_64
arch_bits:64
monotonic_clock:POSIX clock_gettime
multiplexing_api:epoll
atomicvar_api:c11-builtin
gcc_version:12.2.0
process_id:2978259
process_supervised:systemd
run_id:57bfa3ec3dcd14a7db466d6482ea2dac960b06a4
tcp_port:6380
server_time_usec:1763038172038662
uptime_in_seconds:25721
uptime_in_days:0
hz:10
configured_hz:10
lru_clock:1430492
executable:/usr/bin/icingadb-redis-server
config_file:/usr/share/icingadb-redis/icingadb-redis-systemd.conf
io_threads_active:0
listener0:name=tcp,bind=0.0.0.0,bind=::,port=6380

# Clients
connected_clients:13
cluster_connections:0
maxclients:10000
client_recent_max_input_buffer:81920
client_recent_max_output_buffer:0
blocked_clients:7
tracking_clients:0
clients_in_timeout_table:7
total_blocking_keys:7
total_blocking_keys_on_nokey:0

# Memory
used_memory:1272012784
used_memory_human:1.18G
used_memory_rss:1334468608
used_memory_rss_human:1.24G
used_memory_peak:1290177104
used_memory_peak_human:1.20G
used_memory_peak_perc:98.59%
used_memory_overhead:1163406
used_memory_startup:866272
used_memory_dataset:1270849378
used_memory_dataset_perc:99.98%
allocator_allocated:1272451688
allocator_active:1286582272
allocator_resident:1326587904
total_system_memory:16782839808
total_system_memory_human:15.63G
used_memory_lua:32768
used_memory_vm_eval:32768
used_memory_lua_human:32.00K
used_memory_scripts_eval:0
number_of_cached_scripts:0
number_of_functions:0
number_of_libraries:0
used_memory_vm_functions:33792
used_memory_vm_total:66560
used_memory_vm_total_human:65.00K
used_memory_functions:184
used_memory_scripts:184
used_memory_scripts_human:184B
maxmemory:15032385536
maxmemory_human:14.00G
maxmemory_policy:allkeys-lru
allocator_frag_ratio:1.01
allocator_frag_bytes:14130584
allocator_rss_ratio:1.03
allocator_rss_bytes:40005632
rss_overhead_ratio:1.01
rss_overhead_bytes:7880704
mem_fragmentation_ratio:1.05
mem_fragmentation_bytes:62457648
mem_not_counted_for_evict:0
mem_replication_backlog:0
mem_total_replication_buffers:0
mem_clients_slaves:0
mem_clients_normal:293174
mem_cluster_links:0
mem_aof_buffer:0
mem_allocator:jemalloc-5.3.0
active_defrag_running:0
lazyfree_pending_objects:0
lazyfreed_objects:0

# Persistence
loading:0
async_loading:0
current_cow_peak:602112
current_cow_size:602112
current_cow_size_age:10
current_fork_perc:1.47
current_save_keys_processed:1
current_save_keys_total:68
rdb_changes_since_last_save:29387
rdb_bgsave_in_progress:1
rdb_last_save_time:1763038100
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:7
rdb_current_bgsave_time_sec:11
rdb_saves:378
rdb_last_cow_size:9056256
rdb_last_load_keys_expired:0
rdb_last_load_keys_loaded:68
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_rewrites:0
aof_rewrites_consecutive_failures:0
aof_last_write_status:ok
aof_last_cow_size:0
module_fork_in_progress:0
module_fork_last_cow_size:0

# Stats
total_connections_received:192
total_commands_processed:7716042
instantaneous_ops_per_sec:44
total_net_input_bytes:4642970835
total_net_output_bytes:143332692
total_net_repl_input_bytes:0
total_net_repl_output_bytes:0
instantaneous_input_kbps:66.39
instantaneous_output_kbps:0.59
instantaneous_input_repl_kbps:0.00
instantaneous_output_repl_kbps:0.00
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
expired_stale_perc:0.00
expired_time_cap_reached_count:0
expire_cycle_cpu_milliseconds:483
evicted_keys:0
evicted_clients:0
total_eviction_exceeded_time:0
current_eviction_exceeded_time:0
keyspace_hits:331610
keyspace_misses:50952
pubsub_channels:0
pubsub_patterns:0
pubsubshard_channels:0
latest_fork_usec:52232
total_forks:378
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0
total_active_defrag_time:0
current_active_defrag_time:0
tracking_total_keys:0
tracking_total_items:0
tracking_total_prefixes:0
unexpected_error_replies:0
total_error_replies:1388
dump_payload_sanitizations:0
total_reads_processed:1302130
total_writes_processed:947406
io_threaded_reads_processed:0
io_threaded_writes_processed:0
reply_buffer_shrinks:2184
reply_buffer_expands:2310
eventloop_cycles:1521189
eventloop_duration_sum:245769300
eventloop_duration_cmd_sum:62518402
instantaneous_eventloop_cycles_per_sec:37
instantaneous_eventloop_duration_usec:141
acl_access_denied_auth:0
acl_access_denied_cmd:0
acl_access_denied_key:0
acl_access_denied_channel:0

# Replication
role:master
connected_slaves:0
master_failover_state:no-failover
master_replid:047ee59c0180e8821d9ec2d3e55511c96246cabc
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:0
second_repl_offset:-1
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0

# CPU
used_cpu_sys:124.016826
used_cpu_user:163.220503
used_cpu_sys_children:471.545254
used_cpu_user_children:2259.402344
used_cpu_sys_main_thread:123.091433
used_cpu_user_main_thread:163.131422

# Modules

# Errorstats
errorstat_EXECABORT:count=21
errorstat_LOADING:count=1367

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=68,expires=0,avg_ttl=0

maxheap is not set, where should that be set?

Haha, so i heard

apenning · November 14, 2025, 9:54am

Thanks for these information.

I’ve seen you have set maxmemory_policy:allkeys-lru. Does the OOM still happens? In my understanding, Redis should preferr cleaning up instead of going OOM.

Btw, does Icinga DB consume the Redis streams? Do you have the icingadb check configured and is it happy?

You could also try to find potential culprits in the Redis using the icingadb-redis-cli with either the --bigkeys or --memkeys output, as described in their docs. Please feel free to share the output. Maybe something is somehow stuck?

Out of curiosity, how big is the Icinga setup connected to the Redis? How many hosts and services are there?

Please excuse my random questions, but so far I haven’t seen anything which brought me to an idea.

lorenz · November 16, 2025, 10:43am

Hi there,

I mostly skipped through this thread, but what my first thought here was that the save setting for redis (configures if and when to dump the redis content to disk) might play a role in this.

In some scenarios we had problems with that since the dump requires significant ressources. Especially in big setups the default threshold is to low and in one case, caused the dump process to run continously (with a lot of IO and memory usage). That’s why this part of the docs exist

Or did you already rule that out, @apenning ?

apenning · November 17, 2025, 7:12am

Thanks for mentioning the save option, @lorenz. Earlier in this thread, I have linked to this section of the docs, but maybe it got lost with the switch from @CertifiedRedisHater to @MadPat, thus I have not mentioned it again.

CertifiedRedisHater · November 17, 2025, 3:02pm

It unfortunately did again two days ago.


# Scanning the entire keyspace to find biggest keys as well as
# average sizes per key type.  You can use -i 0.1 to sleep 0.1 sec
# per 100 SCAN commands (not usually needed).

[00.00%] Biggest hash   found so far '"icinga:endpoint"' with 349 fields
[00.00%] Biggest stream found so far '"icinga:dump"' with 54 entries
[00.00%] Biggest hash   found so far '"icinga:hostgroup:member"' with 1131 fields
[00.00%] Biggest hash   found so far '"icinga:checksum:service:state"' with 10878 fields
[29.41%] Biggest hash   found so far '"icinga:service:customvar"' with 43118 fields
[29.41%] Biggest hash   found so far '"icinga:notification:recipient"' with 94595 fields
[29.41%] Biggest stream found so far '"icinga:runtime"' with 1000009 entries
[61.76%] Biggest zset   found so far '"icinga:nextupdate:host"' with 493 members
[92.65%] Biggest hash   found so far '"icinga:notification:customvar"' with 103952 fields
[92.65%] Biggest zset   found so far '"icinga:nextupdate:service"' with 10109 members

-------- summary -------

Sampled 68 keys in the keyspace!
Total key length in bytes is 1594 (avg len 23.44)

Biggest   hash found '"icinga:notification:customvar"' has 103952 fields
Biggest stream found '"icinga:runtime"' has 1000009 entries
Biggest   zset found '"icinga:nextupdate:service"' has 10109 members

0 lists with 0 items (00.00% of keys, avg size 0.00)
54 hashs with 454818 fields (79.41% of keys, avg size 8422.56)
0 strings with 0 bytes (00.00% of keys, avg size 0.00)
12 streams with 2001065 entries (17.65% of keys, avg size 166755.42)
0 sets with 0 members (00.00% of keys, avg size 0.00)
2 zsets with 10602 members (02.94% of keys, avg size 5301.00)

Warning: Using a password with '-a' or '-u' option on the command line interface may not be safe.

# Scanning the entire keyspace to find biggest keys as well as
# average sizes per key type.  You can use -i 0.1 to sleep 0.1 sec
# per 100 SCAN commands (not usually needed).

[00.00%] Biggest hash   found so far '"icinga:endpoint"' with 145491 bytes
[00.00%] Biggest stream found so far '"icinga:dump"' with 9824 bytes
[00.00%] Biggest hash   found so far '"icinga:hostgroup:member"' with 323288 bytes
[00.00%] Biggest hash   found so far '"icinga:checksum:service:state"' with 1610608 bytes
[00.00%] Biggest hash   found so far '"icinga:customvar"' with 6813745 bytes
[14.71%] Biggest hash   found so far '"icinga:service:state"' with 14281294 bytes
[29.41%] Biggest hash   found so far '"icinga:notification:recipient"' with 28181272 bytes
[29.41%] Biggest stream found so far '"icinga:runtime"' with 293504356 bytes
[45.59%] Biggest hash   found so far '"icinga:notification"' with 39843968 bytes
[61.76%] Biggest zset   found so far '"icinga:nextupdate:host"' with 61697 bytes
[61.76%] Biggest stream found so far '"icinga:runtime:state"' with 804043208 bytes
[92.65%] Biggest zset   found so far '"icinga:nextupdate:service"' with 1409665 bytes

-------- summary -------

Sampled 68 keys in the keyspace!
Total key length in bytes is 1594 (avg len 23.44)

Biggest   hash found '"icinga:notification"' has 39843968 bytes
Biggest stream found '"icinga:runtime:state"' has 804043208 bytes
Biggest   zset found '"icinga:nextupdate:service"' has 1409665 bytes

0 lists with 0 bytes (00.00% of keys, avg size 0.00)
54 hashs with 173606930 bytes (79.41% of keys, avg size 3214943.15)
0 strings with 0 bytes (00.00% of keys, avg size 0.00)
12 streams with 1097595680 bytes (17.65% of keys, avg size 91466306.67)
0 sets with 0 bytes (00.00% of keys, avg size 0.00)
2 zsets with 1471362 bytes (02.94% of keys, avg size 735681.00)

~500 Hosts, ~2300 Services.

I’ve been hesitant to configure that because if this is not the issue we could have data loss. We have two masters, the redis instances on those never failed concurently. Now I am not deeply knowledge into how Icinga works but is the data collected by one master, shared with the other? If yes we could try this.

CertifiedRedisHater · November 24, 2025, 1:56pm

A week ago we had a crash and a coworker found a solution, here is what he reported.

$ sysctl vm.max_map_count
vm.max_map_count = 65530 # maximal amount of maps for a process; if exceeded malloc() will fail
# 20 minutes after 
$ cat /proc/$(pgrep icingadb-redis)/maps | wc -l
1087
# after 25 minutes
$  cat /proc/$(pgrep icingadb-redis)/maps | wc -l
1327

# around 240 new memory mappings which aren't cleaned up in 5 minutes -> 48 per minute
# 65530/48 -> 1365min = 22h until Redis crashed if the memory is never freed

$ sysctl -w vm.max_map_count=262144
# is now allowed to make 4x the amount of maps and will probably crash no more / less

# after 20 hours
$  cat /proc/$(pgrep icingadb-redis)/maps | wc -l
30620

https://github.com/redis/redis/issues/12821

That was 9 days ago, no crash since then.