Investigate memory leak

Hi everyone,

We’ve recently upgraded our infra to switch to IcingaDB as we were facing performance issues with the ido-pgsql but now we’re starting to see slowly increasing memory usage of the master to the point of triggering the oomkiller. I’m looking on any hints that could help us stabilize our stack.

  • icinga2 --version:
icinga2 - The Icinga 2 network monitoring daemon (version: r2.14.2-1)

Copyright (c) 2012-2024 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later <https://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

System information:
  Platform: Debian GNU/Linux
  Platform version: 10 (buster)
  Kernel: Linux
  Kernel version: 5.15.26-grsec-zfs-classid
  Architecture: x86_64

Build information:
  Compiler: GNU 8.3.0
  Build host: runner-hh8q3bz2-project-575-concurrent-0
  OpenSSL version: OpenSSL 1.1.1d  10 Sep 2019

Application information:

General paths:
  Config directory: /etc/icinga2
  Data directory: /var/lib/icinga2
  Log directory: /var/log/icinga2
  Cache directory: /var/cache/icinga2
  Spool directory: /var/spool/icinga2
  Run directory: /run/icinga2

Old paths (deprecated):
  Installation root: /usr
  Sysconf directory: /etc
  Run directory (base): /run
  Local state directory: /var

Internal paths:
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid
  • features:
Disabled features: command compatlog debuglog elasticsearch gelf graphite ido-pgsql influxdb influxdb2 journald livestatus opentsdb perfdata statusdata syslog
Enabled features: api checker icingadb mainlog notification
  • icingaweb2 info:
Icinga Web 2 Version 	2.12.1
Git commit 	cd2daeb2cb8537c633d343a29eb76c54cd2ebbf2
PHP Version 	7.3.31-1~deb10u5
Git commit date 	2023-11-15

Loaded Libraries
icinga/icinga-php-library 	0.13.1
icinga/icinga-php-thirdparty 	0.12.1
Loaded Modules
icingadb 		1.1.1 
  • Config validation:
[2024-04-12 17:46:52 +0200] information/cli: Icinga application loader (version: r2.14.2-1)
[2024-04-12 17:46:52 +0200] information/cli: Loading configuration file(s).
[2024-04-12 17:46:53 +0200] warning/config: Ignoring directory '/etc/icinga2/zones.d/XXX' for unknown zone 'XXX'.
[2024-04-12 17:46:53 +0200] warning/config: Ignoring directory '/etc/icinga2/zones.d/16' for unknown zone '16'.
[2024-04-12 17:46:53 +0200] warning/config: Ignoring directory '/etc/icinga2/zones.d/22' for unknown zone '22'.
[2024-04-12 17:46:53 +0200] warning/config: Ignoring directory '/etc/icinga2/zones.d/51' for unknown zone '51'.
[2024-04-12 17:46:53 +0200] warning/config: Ignoring directory '/etc/icinga2/zones.d/60' for unknown zone '60'.
[2024-04-12 17:46:53 +0200] warning/config: Ignoring directory '/etc/icinga2/zones.d/61' for unknown zone '61'.
[2024-04-12 17:46:53 +0200] warning/config: Ignoring directory '/etc/icinga2/zones.d/dev' for unknown zone 'dev'.
[2024-04-12 17:46:53 +0200] warning/config: Ignoring directory '/etc/icinga2/zones.d/na' for unknown zone 'na'.
[2024-04-12 17:46:53 +0200] warning/config: Ignoring directory '/etc/icinga2/zones.d/202' for unknown zone '202'.
[2024-04-12 17:46:54 +0200] information/ConfigItem: Committing config item(s).
[2024-04-12 17:46:54 +0200] information/ApiListener: My API identity: master01.icinga
[2024-04-12 17:47:04 +0200] information/WorkQueue: #7 (ApiListener, RelayQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2024-04-12 17:47:04 +0200] information/WorkQueue: #8 (ApiListener, SyncQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2024-04-12 17:47:04 +0200] information/WorkQueue: #5 (DaemonUtility::LoadConfigFiles) items: 24, rate: 8.6/s (516/min 516/5min 516/15min); empty in 19825 days, 15 hours, 47 minutes and 4 seconds

[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 6 NotificationCommands.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 18439 Notifications.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 1 IcingaApplication.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 679 HostGroups.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 16726 Hosts.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 10 Downtimes.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 1 EventCommand.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 156 Comments.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 1 IcingaDB.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 1 FileLogger.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 30 Zones.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 1 CheckerComponent.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 3 Users.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 58 Endpoints.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 17 ApiUsers.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 1 ApiListener.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 1 NotificationComponent.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 232 CheckCommands.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 55 UserGroups.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 3 ServiceGroups.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 3 TimePeriods.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 410982 Services.
[2024-04-12 17:47:35 +0200] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2024-04-12 17:47:35 +0200] information/cli: Finished validating the configuration file(s).
  • sysinfo:

VMWare host 12 cores E5-2689, 45GB mem, SSD datastore

  • Icinga Service tweaks:
# /etc/systemd/system/icinga2.service.d/limits.conf
# Icinga 2 sets some default values to extend OS defaults
#
# Please refer to our troubleshooting documentations for details
# and reasons on these values.
[Service]
TasksMax=infinity

# May also cause problems, uncomment if you have any
# LimitNPROC=62883

# /etc/systemd/system/icinga2.service.d/override.conf
[Service]
Environment=LD_PRELOAD=/lib/x86_64-linux-gnu/libjemalloc.so.2
ExecStart=
ExecStart=/usr/sbin/icinga2 daemon --close-stdio -e ${ICINGA2_ERROR_LOG} -DConfiguration.Concurrency=8

# /etc/systemd/system/icinga2.service.d/systemd_override.conf
[Service]
Restart=on-failure
RestartSec=5
StartLimitInterval=60
StartLimitBurst=5
  • Packages:
  • icinga2-bin/icinga-buster,now 2.14.2-1+debian10 amd64 [installed]
  • icinga2-common/icinga-buster,now 2.14.2-1+debian10 all [installed]
  • icingadb-redis-server/icinga-buster,now 7.0.12-1+debian10 amd64 [installed,automatic]
  • icingadb/icinga-buster,now 1.1.1-1+debian10 amd64 [installed,upgradable to: 1.2.0-1+debian10]
  • icingaweb2/icinga-buster,now 2.12.1-1+debian10 all [installed]
  • libjemalloc2/oldoldstable,now 5.1.0-3 amd64 [installed,automatic]
  • Sidenote

We’re in the middle of some change of how we manage checks and the number of Services is currently half of what the end target is


Moving to jemalloc helped a lot decrease the mem usage growth rate but we’re still seeing it grow over time. We’re currently at a rate of roughly 1GB/hour

Any hint would be appreciated or any hint on where to start looking

Cheers


Edit:
Looking at the mem used I can’t totally make sense of what I see but at the same time I’m no troubleshooting expert here:

# pmap -p 1394  | head -n 20
1394:   /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2 --no-stack-rlimit daemon --close-stdio -e /var/log/icinga2/error.log -DConfiguration.Concurrency=8
0000561a050a8000   2340K r---- /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2
0000561a052f1000   6844K r-x-- /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2
0000561a059a0000   3604K r---- /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2
0000561a05d26000    248K r---- /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2
0000561a05d64000     24K rw--- /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2
0000561a05d6a000     36K rw---   [ anon ]
00007f0069a00000 1726464K rw---   [ anon ]
00007f00d3100000 50238464K rw---   [ anon ]
00007f0ccd700000 19084288K rw---   [ anon ]
00007f115a500000 3755008K rw---   [ anon ]
00007f123f900000 1477632K rw---   [ anon ]
00007f1299d00000 179200K rw---   [ anon ]
00007f12a4d4a000 587264K rw---   [ anon ]
00007f12c8aca000      4K -----   [ anon ]

so our mem usage is mainly from anon pages, most likely from mallocs, but if I try to get a peek of what’s in there: @ 00007f0069a00000 or 00007f123f900000 for e.g I see data i recognize, but for @ 00007f00d3100000 i’m not sure of what I see

00000cc0  60 3b 5c a6 11 7f 00 00  d6 00 00 00 00 00 00 00  |`;\.............|
00000cd0  d6 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000ce0  20 3d 5c a6 11 7f 00 00  db 00 00 00 00 00 00 00  | =\.............|
00000cf0  db 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000d00  00 86 da a1 11 7f 00 00  70 03 00 00 00 00 00 00  |........p.......|
00000d10  70 03 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |p...............|

So we rolled back what we were doing and ended up with the same kind of deployment we had before any attempt. However we now are running r2.14.2-1 with icingadb 1.1.1 (while we were relying on the monitoring feature before).

We had no problem before upgrading and swithcing to icingadb but we still are seeing some memory leak over time, just more slowly with ~32k services compared to when we created ~1M services