Enable opentelemetry observability/traces on icinga2 core

Please describe your problem as detailed as possible and don’t forget to use a meaningful title

Hi, I have been running into ‘random’ out of memory kills for icinga2 core
i.e.


[Sun Nov  6 23:50:01 2022] Killed process 1491 (icinga2), UID 997, total-vm:202536kB, anon-rss:23584kB, file-rss:0kB, shmem-rss:0kB

[Wed Nov 16 03:02:44 2022] Killed process 27648 (icinga2), UID 996, total-vm:14297888kB, anon-rss:13853948kB, file-rss:0kB, shmem-rss:0kB

where I can not find anything relevant/outstanding on the logs that would help isolating the root cause of the failuire (and any subsequent resolution action). That is, main log is no showing any meaningful errors and , the ‘random’ nature of the failures (not related to configuration updates/reloads), prevents keeping the debug log enabled on the production environment.

Under such perspective, I was wondering if there is a way to enable opentelemetry observability into icinga2 core, or anything alike , as to help capturing any traces that could help understand/troubleshoot the root cause of the failure, whenever it happens to again?

Give as much information as you can, e.g.

  • Version used (icinga2 --version)

icinga2 --version

icinga2 - The Icinga 2 network monitoring daemon (version: 2.13.2-1)

Copyright (c) 2012-2022 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later https://gnu.org/licenses/gpl2.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

System information:
Platform: CentOS Linux
Platform version: 7 (Core)
Kernel: Linux
Kernel version: 3.10.0-1160.76.1.el7.x86_64
Architecture: x86_64

Build information:
Compiler: GNU 4.8.5
Build host: runner-hh8q3bz2-project-322-concurrent-0
OpenSSL version: OpenSSL 1.0.2k-fips 26 Jan 2017

Application information:

General paths:
Config directory: /etc/icinga2
Data directory: /var/lib/icinga2
Log directory: /var/log/icinga2
Cache directory: /var/cache/icinga2
Spool directory: /var/spool/icinga2
Run directory: /run/icinga2

Old paths (deprecated):
Installation root: /usr
Sysconf directory: /etc
Run directory (base): /run
Local state directory: /var

Internal paths:
Package data directory: /usr/share/icinga2
State path: /var/lib/icinga2/icinga2.state
Modified attributes path: /var/lib/icinga2/modified-attributes.conf
Objects path: /var/cache/icinga2/icinga2.debug
Vars path: /var/cache/icinga2/icinga2.vars
PID path: /run/icinga2/icinga2.pid

  • Operating System and version
# cat /etc/centos-release
CentOS Linux release 7.9.2009 (Core)
  • Enabled features (icinga2 feature list)
# icinga2 feature list
Disabled features: command compatlog debuglog elasticsearch gelf graphite icingadb influxdb2 livestatus opentsdb perfdata statusdata syslog
Enabled features: api checker ido-mysql influxdb mainlog notification
  • Icinga Web 2 version and modules (System - About) - NA
  • Config validation (icinga2 daemon -C)
icinga2 daemon -C
[2022-11-21 07:49:25 -0500] information/cli: Icinga application loader (version: 2.13.2-1)
[2022-11-21 07:49:25 -0500] information/cli: Loading configuration file(s).
[2022-11-21 07:49:26 -0500] information/ConfigItem: Committing config item(s).
[2022-11-21 07:49:26 -0500] information/ApiListener: My API identity: icinga-1.corp-apps.com
[2022-11-21 07:49:36 -0500] information/WorkQueue: #5 (InfluxdbWriter, influxdb) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2022-11-21 07:49:36 -0500] information/WorkQueue: #4 (DaemonUtility::LoadConfigFiles) items: 8, rate: 12.9333/s (776/min 776/5min 776/15min); empty in 19317 days, 12 hours, 49 minutes and 36 seconds
[2022-11-21 07:49:36 -0500] information/WorkQueue: #7 (ApiListener, RelayQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2022-11-21 07:49:36 -0500] information/WorkQueue: #8 (ApiListener, SyncQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2022-11-21 07:49:46 -0500] information/WorkQueue: #4 (DaemonUtility::LoadConfigFiles) items: 8, rate: 12.9333/s (776/min 776/5min 776/15min); empty in infinite time, your task handler isn't able to keep up

Several Suppressed ApplyRule warnings of type
[2022-11-21 07:51:38 -0500] warning/ApplyRule:... for type 'Service' does not match anywhere!

[2022-11-21 07:51:38 -0500] information/ConfigItem: Instantiated 1 InfluxdbWriter.
[2022-11-21 07:51:38 -0500] information/ConfigItem: Instantiated 1 NotificationComponent.
[2022-11-21 07:51:38 -0500] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2022-11-21 07:51:38 -0500] information/ConfigItem: Instantiated 1 CheckerComponent.
[2022-11-21 07:51:38 -0500] information/ConfigItem: Instantiated 2 Users.
[2022-11-21 07:51:38 -0500] information/ConfigItem: Instantiated 51 ServiceGroups.
[2022-11-21 07:51:38 -0500] information/ConfigItem: Instantiated 6 TimePeriods.
[2022-11-21 07:51:38 -0500] information/ConfigItem: Instantiated 103055 Services.
[2022-11-21 07:51:38 -0500] information/ConfigItem: Instantiated 53 Zones.
[2022-11-21 07:51:38 -0500] information/ConfigItem: Instantiated 192 ScheduledDowntimes.
[2022-11-21 07:51:38 -0500] information/ConfigItem: Instantiated 3 NotificationCommands.
[2022-11-21 07:51:38 -0500] information/ConfigItem: Instantiated 369 HostGroups.
[2022-11-21 07:51:38 -0500] information/ConfigItem: Instantiated 144616 Notifications.
[2022-11-21 07:51:38 -0500] information/ConfigItem: Instantiated 1427 Downtimes.
[2022-11-21 07:51:38 -0500] information/ConfigItem: Instantiated 138679 Dependencies.
[2022-11-21 07:51:38 -0500] information/ConfigItem: Instantiated 1 IcingaApplication.
[2022-11-21 07:51:38 -0500] information/ConfigItem: Instantiated 7115 Hosts.
[2022-11-21 07:51:38 -0500] information/ConfigItem: Instantiated 1 EventCommand.
[2022-11-21 07:51:38 -0500] information/ConfigItem: Instantiated 67 Endpoints.
[2022-11-21 07:51:38 -0500] information/ConfigItem: Instantiated 7788 Comments.
[2022-11-21 07:51:38 -0500] information/ConfigItem: Instantiated 1 FileLogger.
[2022-11-21 07:51:38 -0500] information/ConfigItem: Instantiated 21 ApiUsers.
[2022-11-21 07:51:38 -0500] information/ConfigItem: Instantiated 464 CheckCommands.
[2022-11-21 07:51:38 -0500] information/ConfigItem: Instantiated 1 ApiListener.
[2022-11-21 07:51:39 -0500] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2022-11-21 07:51:39 -0500] information/cli: Finished validating the configuration file(s).
  • If you run multiple Icinga 2 instances, the zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone) from all affected nodes

This is an HA Master cluster. could be provided on demand. Count of endpoints, zones, etc visible on config validation.

*additional info: jemaloc is enabled

# rpm -qa | egrep "ici|jem"
icinga2-2.13.2-1.el7.icinga.x86_64
icinga2-common-2.13.2-1.el7.icinga.x86_64
icinga2-bin-2.13.2-1.el7.icinga.x86_64
icinga2-ido-mysql-2.13.2-1.el7.icinga.x86_64
jemalloc-3.6.0-1.el7.x86_64
# cat /etc/sysconfig/icinga2
#Mananged by puppet
#This is the default environment Icinga 2 runs with.
#Make your changes here.
#DAEMON=/usr/sbin/icinga2
#ICINGA2_CONFIG_FILE=/etc/icinga2/icinga2.conf
#ICINGA2_INIT_RUN_DIR=/run/icinga2
#ICINGA2_PID_FILE=/run/icinga2/icinga2.pid
#ICINGA2_LOG_DIR=/var/log/icinga2
#ICINGA2_ERROR_LOG=/var/log/icinga2/error.log
#ICINGA2_STARTUP_LOG=/var/log/icinga2/startup.log
#ICINGA2_LOG=/var/log/icinga2/icinga2.log
#ICINGA2_CACHE_DIR=/var/cache/icinga2
#ICINGA2_USER=icinga
#ICINGA2_GROUP=icinga
#ICINGA2_COMMAND_GROUP=icingacmd
LD_PRELOAD=/usr/lib64/libjemalloc.so.1

Harward specs

 free -h
              total        used        free      shared  buff/cache   available
Mem:            15G        2.6G        8.4G        784M        4.6G         11G
Swap:            0B          0B          0B

grep proc /proc/cpuinfo | tail
processor       : 0
processor       : 1
processor       : 2
processor       : 3
processor       : 4
processor       : 5
processor       : 6
processor       : 7

Hello Pedro!

While actually resolving the issue you could try

[Service]
Restart=always

in Icinga’s systemd unit file as a workaround. See

on how to edit the file. Almost as dirty as

but should save you some headache for now.

To actually find the cause you should graph the memory usage and locate the “tipping point” on the timeline. Then you can compare Icinga 2 logs before/after that point, maybe you find something.

Best,
A/K

thanks for the workaround, it should certainly help keeping the application ‘running’ and prevent building the replication logs under /var/lib/icinga2/api/logs on the peer master node (with the space utilization pressure). Will dig into logs and exact timestamps while discussing the update of the systemd file (puppet customization) internally.

No clues on instrumenting the application, right?

Best,

No clues at all tbh. The more advanced version would be (given that the memory consumption is increasing slowly enough and there’s enough RAM) to monitor the memory consumption and to conditionally let Icinga 2 reload.

1 Like