Hi everyone,
We’ve recently upgraded our infra to switch to IcingaDB as we were facing performance issues with the ido-pgsql but now we’re starting to see slowly increasing memory usage of the master to the point of triggering the oomkiller. I’m looking on any hints that could help us stabilize our stack.
- icinga2 --version:
icinga2 - The Icinga 2 network monitoring daemon (version: r2.14.2-1)
Copyright (c) 2012-2024 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later <https://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
System information:
Platform: Debian GNU/Linux
Platform version: 10 (buster)
Kernel: Linux
Kernel version: 5.15.26-grsec-zfs-classid
Architecture: x86_64
Build information:
Compiler: GNU 8.3.0
Build host: runner-hh8q3bz2-project-575-concurrent-0
OpenSSL version: OpenSSL 1.1.1d 10 Sep 2019
Application information:
General paths:
Config directory: /etc/icinga2
Data directory: /var/lib/icinga2
Log directory: /var/log/icinga2
Cache directory: /var/cache/icinga2
Spool directory: /var/spool/icinga2
Run directory: /run/icinga2
Old paths (deprecated):
Installation root: /usr
Sysconf directory: /etc
Run directory (base): /run
Local state directory: /var
Internal paths:
Package data directory: /usr/share/icinga2
State path: /var/lib/icinga2/icinga2.state
Modified attributes path: /var/lib/icinga2/modified-attributes.conf
Objects path: /var/cache/icinga2/icinga2.debug
Vars path: /var/cache/icinga2/icinga2.vars
PID path: /run/icinga2/icinga2.pid
- features:
Disabled features: command compatlog debuglog elasticsearch gelf graphite ido-pgsql influxdb influxdb2 journald livestatus opentsdb perfdata statusdata syslog
Enabled features: api checker icingadb mainlog notification
- icingaweb2 info:
Icinga Web 2 Version 2.12.1
Git commit cd2daeb2cb8537c633d343a29eb76c54cd2ebbf2
PHP Version 7.3.31-1~deb10u5
Git commit date 2023-11-15
Loaded Libraries
icinga/icinga-php-library 0.13.1
icinga/icinga-php-thirdparty 0.12.1
Loaded Modules
icingadb 1.1.1
- Config validation:
[2024-04-12 17:46:52 +0200] information/cli: Icinga application loader (version: r2.14.2-1)
[2024-04-12 17:46:52 +0200] information/cli: Loading configuration file(s).
[2024-04-12 17:46:53 +0200] warning/config: Ignoring directory '/etc/icinga2/zones.d/XXX' for unknown zone 'XXX'.
[2024-04-12 17:46:53 +0200] warning/config: Ignoring directory '/etc/icinga2/zones.d/16' for unknown zone '16'.
[2024-04-12 17:46:53 +0200] warning/config: Ignoring directory '/etc/icinga2/zones.d/22' for unknown zone '22'.
[2024-04-12 17:46:53 +0200] warning/config: Ignoring directory '/etc/icinga2/zones.d/51' for unknown zone '51'.
[2024-04-12 17:46:53 +0200] warning/config: Ignoring directory '/etc/icinga2/zones.d/60' for unknown zone '60'.
[2024-04-12 17:46:53 +0200] warning/config: Ignoring directory '/etc/icinga2/zones.d/61' for unknown zone '61'.
[2024-04-12 17:46:53 +0200] warning/config: Ignoring directory '/etc/icinga2/zones.d/dev' for unknown zone 'dev'.
[2024-04-12 17:46:53 +0200] warning/config: Ignoring directory '/etc/icinga2/zones.d/na' for unknown zone 'na'.
[2024-04-12 17:46:53 +0200] warning/config: Ignoring directory '/etc/icinga2/zones.d/202' for unknown zone '202'.
[2024-04-12 17:46:54 +0200] information/ConfigItem: Committing config item(s).
[2024-04-12 17:46:54 +0200] information/ApiListener: My API identity: master01.icinga
[2024-04-12 17:47:04 +0200] information/WorkQueue: #7 (ApiListener, RelayQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min);
[2024-04-12 17:47:04 +0200] information/WorkQueue: #8 (ApiListener, SyncQueue) items: 0, rate: 0/s (0/min 0/5min 0/15min);
[2024-04-12 17:47:04 +0200] information/WorkQueue: #5 (DaemonUtility::LoadConfigFiles) items: 24, rate: 8.6/s (516/min 516/5min 516/15min); empty in 19825 days, 15 hours, 47 minutes and 4 seconds
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 6 NotificationCommands.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 18439 Notifications.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 1 IcingaApplication.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 679 HostGroups.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 16726 Hosts.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 10 Downtimes.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 1 EventCommand.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 156 Comments.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 1 IcingaDB.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 1 FileLogger.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 30 Zones.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 1 CheckerComponent.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 3 Users.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 58 Endpoints.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 17 ApiUsers.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 1 ApiListener.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 1 NotificationComponent.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 232 CheckCommands.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 55 UserGroups.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 3 ServiceGroups.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 3 TimePeriods.
[2024-04-12 17:47:35 +0200] information/ConfigItem: Instantiated 410982 Services.
[2024-04-12 17:47:35 +0200] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2024-04-12 17:47:35 +0200] information/cli: Finished validating the configuration file(s).
- sysinfo:
VMWare host 12 cores E5-2689, 45GB mem, SSD datastore
- Icinga Service tweaks:
# /etc/systemd/system/icinga2.service.d/limits.conf
# Icinga 2 sets some default values to extend OS defaults
#
# Please refer to our troubleshooting documentations for details
# and reasons on these values.
[Service]
TasksMax=infinity
# May also cause problems, uncomment if you have any
# LimitNPROC=62883
# /etc/systemd/system/icinga2.service.d/override.conf
[Service]
Environment=LD_PRELOAD=/lib/x86_64-linux-gnu/libjemalloc.so.2
ExecStart=
ExecStart=/usr/sbin/icinga2 daemon --close-stdio -e ${ICINGA2_ERROR_LOG} -DConfiguration.Concurrency=8
# /etc/systemd/system/icinga2.service.d/systemd_override.conf
[Service]
Restart=on-failure
RestartSec=5
StartLimitInterval=60
StartLimitBurst=5
- Packages:
- icinga2-bin/icinga-buster,now 2.14.2-1+debian10 amd64 [installed]
- icinga2-common/icinga-buster,now 2.14.2-1+debian10 all [installed]
- icingadb-redis-server/icinga-buster,now 7.0.12-1+debian10 amd64 [installed,automatic]
- icingadb/icinga-buster,now 1.1.1-1+debian10 amd64 [installed,upgradable to: 1.2.0-1+debian10]
- icingaweb2/icinga-buster,now 2.12.1-1+debian10 all [installed]
- libjemalloc2/oldoldstable,now 5.1.0-3 amd64 [installed,automatic]
- Sidenote
We’re in the middle of some change of how we manage checks and the number of Services is currently half of what the end target is
Moving to jemalloc helped a lot decrease the mem usage growth rate but we’re still seeing it grow over time. We’re currently at a rate of roughly 1GB/hour
Any hint would be appreciated or any hint on where to start looking
Cheers
Edit:
Looking at the mem used I can’t totally make sense of what I see but at the same time I’m no troubleshooting expert here:
# pmap -p 1394 | head -n 20
1394: /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2 --no-stack-rlimit daemon --close-stdio -e /var/log/icinga2/error.log -DConfiguration.Concurrency=8
0000561a050a8000 2340K r---- /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2
0000561a052f1000 6844K r-x-- /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2
0000561a059a0000 3604K r---- /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2
0000561a05d26000 248K r---- /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2
0000561a05d64000 24K rw--- /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2
0000561a05d6a000 36K rw--- [ anon ]
00007f0069a00000 1726464K rw--- [ anon ]
00007f00d3100000 50238464K rw--- [ anon ]
00007f0ccd700000 19084288K rw--- [ anon ]
00007f115a500000 3755008K rw--- [ anon ]
00007f123f900000 1477632K rw--- [ anon ]
00007f1299d00000 179200K rw--- [ anon ]
00007f12a4d4a000 587264K rw--- [ anon ]
00007f12c8aca000 4K ----- [ anon ]
so our mem usage is mainly from anon pages, most likely from mallocs, but if I try to get a peek of what’s in there: @ 00007f0069a00000 or 00007f123f900000 for e.g I see data i recognize, but for @ 00007f00d3100000 i’m not sure of what I see
00000cc0 60 3b 5c a6 11 7f 00 00 d6 00 00 00 00 00 00 00 |`;\.............|
00000cd0 d6 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000ce0 20 3d 5c a6 11 7f 00 00 db 00 00 00 00 00 00 00 | =\.............|
00000cf0 db 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000d00 00 86 da a1 11 7f 00 00 70 03 00 00 00 00 00 00 |........p.......|
00000d10 70 03 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |p...............|