Icinga crashes. No error message in logs except PID xxxxx was terminated by signal 9

Hello Icinga Community,
I am experiencing Icinga crashes on my HA system. I have two master nodes checking 2625 host and 16517 services. All was working great without any problems, than all the sudden the application crashed. The icinga2.log files do not have any error message showing. The only thing I notice in the logs is a lot of ‘PID xxxxx was terminated by signal 9’ error messages. Soon after this message, I see a lot of ‘client disconnect’ messages. What could be causing my system to crash? I have not made any major changes. I have only added more host and services to check.

Has anyone experienced this problem? Thanks in advance for your help.
Alex

icinga2 -V

[2021-12-13 13:57:12 -0500] warning/Application: Failed to adjust resource limit for open file handles (RLIMIT_NOFILE) with error "Operation not permitted"
[2021-12-13 13:57:13 -0500] warning/Application: Failed to adjust resource limit for open file handles (RLIMIT_NOFILE) with error "Operation not permitted"
icinga2 - The Icinga 2 network monitoring daemon (version: 2.11.11-1)

Copyright (c) 2012-2021 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

System information:
  Platform: Red Hat Enterprise Linux Server
  Platform version: 7.7 (Maipo)
  Kernel: Linux
  Kernel version: 3.10.0-1160.49.1.el7.x86_64
  Architecture: x86_64

Build information:
  Compiler: GNU 4.8.5
  Build host: runner-hh8q3bz2-project-507-concurrent-0

Application information:

General paths:
  Config directory: /etc/icinga2
  Data directory: /var/lib/icinga2
  Log directory: /var/log/icinga2
  Cache directory: /var/cache/icinga2
  Spool directory: /var/spool/icinga2
  Run directory: /run/icinga2

Old paths (deprecated):
  Installation root: /usr
  Sysconf directory: /etc
  Run directory (base): /run
  Local state directory: /var

Internal paths:
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid

icinga2 daemon -C

[2021-12-13 13:59:02 -0500] warning/Application: Failed to adjust resource limit for open file handles (RLIMIT_NOFILE) with error "Operation not permitted"
[2021-12-13 13:59:02 -0500] warning/Application: Failed to adjust resource limit for open file handles (RLIMIT_NOFILE) with error "Operation not permitted"
[2021-12-13 13:59:02 -0500] information/cli: Icinga application loader (version: 2.11.11-1)
[2021-12-13 13:59:02 -0500] information/cli: Loading configuration file(s).
[2021-12-13 13:59:35 -0500] information/ConfigItem: Committing config item(s).
[2021-12-13 13:59:36 -0500] information/ApiListener: My API identity: USDCPVDAS120.prod.sunchemical.com
[2021-12-13 13:59:37 -0500] warning/globals.getHostGeoLocation: Cannot find 'USBC' in GeoLocationShort
[2021-12-13 13:59:37 -0500] warning/globals.getHostGeoLocation: Cannot find 'USEP' in GeoLocationShort
[2021-12-13 13:59:37 -0500] warning/globals.getHostGeoLocation: Cannot find 'USBC' in GeoLocationShort
[2021-12-13 13:59:37 -0500] warning/globals.getHostGeoLocation: Cannot find 'USBC' in GeoLocationShort
[2021-12-13 13:59:37 -0500] warning/globals.getHostGeoLocation: Cannot find 'USBC' in GeoLocationShort
[2021-12-13 13:59:37 -0500] warning/globals.getHostGeoLocation: Cannot find 'USBC' in GeoLocationShort
[2021-12-13 13:59:38 -0500] warning/globals.getHostGeoLocation: Cannot find 'USEP' in GeoLocationShort
[2021-12-13 13:59:45 -0500] information/WorkQueue: #4 (DaemonUtility::LoadConfigFiles) items: 0, rate: 4.4/s (264/min 264/5min 264/15min);
[2021-12-13 13:59:45 -0500] information/WorkQueue: #5 (GraphiteWriter, graphite) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2021-12-13 13:59:46 -0500] information/WorkQueue: #7 (ApiListener, RelayQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2021-12-13 13:59:46 -0500] information/WorkQueue: #8 (ApiListener, SyncQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2021-12-13 14:01:45 -0500] warning/ApplyRule: Apply rule 'scheduled_task_last_ran_month' (in /etc/icinga2/zones.d/global-templates/services-windows-others.conf: 35:1-35:124) for type 'Service' does not match anywhere!
[2021-12-13 14:01:45 -0500] warning/ApplyRule: Apply rule 'AD_User_Eventlog_Check_2019' (in /etc/icinga2/zones.d/global-templates/Application_Groups/ADDC/services.conf: 35:1-35:43) for type 'Service' does not match anywhere!
[2021-12-13 14:01:45 -0500] information/ConfigItem: Instantiated 1 GraphiteWriter.
[2021-12-13 14:01:45 -0500] information/ConfigItem: Instantiated 1 NotificationComponent.
[2021-12-13 14:01:45 -0500] information/ConfigItem: Instantiated 1 LivestatusListener.
[2021-12-13 14:01:45 -0500] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2021-12-13 14:01:45 -0500] information/ConfigItem: Instantiated 1 CheckerComponent.
[2021-12-13 14:01:45 -0500] information/ConfigItem: Instantiated 208 Users.
[2021-12-13 14:01:45 -0500] information/ConfigItem: Instantiated 13 ServiceGroups.
[2021-12-13 14:01:45 -0500] information/ConfigItem: Instantiated 4 TimePeriods.
[2021-12-13 14:01:45 -0500] information/ConfigItem: Instantiated 783 Zones.
[2021-12-13 14:01:45 -0500] information/ConfigItem: Instantiated 16484 Services.
[2021-12-13 14:01:45 -0500] information/ConfigItem: Instantiated 14756 ScheduledDowntimes.
[2021-12-13 14:01:45 -0500] information/ConfigItem: Instantiated 28619 Notifications.
[2021-12-13 14:01:45 -0500] information/ConfigItem: Instantiated 50393 Dependencies.
[2021-12-13 14:01:45 -0500] information/ConfigItem: Instantiated 2 NotificationCommands.
[2021-12-13 14:01:45 -0500] information/ConfigItem: Instantiated 17 Comments.
[2021-12-13 14:01:45 -0500] information/ConfigItem: Instantiated 1 IcingaApplication.
[2021-12-13 14:01:45 -0500] information/ConfigItem: Instantiated 782 Endpoints.
[2021-12-13 14:01:45 -0500] information/ConfigItem: Instantiated 241 HostGroups.
[2021-12-13 14:01:45 -0500] information/ConfigItem: Instantiated 2609 Hosts.
[2021-12-13 14:01:45 -0500] information/ConfigItem: Instantiated 1 EventCommand.
[2021-12-13 14:01:45 -0500] information/ConfigItem: Instantiated 7533 Downtimes.
[2021-12-13 14:01:45 -0500] information/ConfigItem: Instantiated 247 CheckCommands.
[2021-12-13 14:01:45 -0500] information/ConfigItem: Instantiated 1 FileLogger.
[2021-12-13 14:01:45 -0500] information/ConfigItem: Instantiated 5 ApiUsers.
[2021-12-13 14:01:45 -0500] information/ConfigItem: Instantiated 1 ApiListener.
[2021-12-13 14:01:45 -0500] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2021-12-13 14:01:45 -0500] information/cli: Finished validating the configuration file(s).

We frequently run out of memory due to the amount of API calls we make and have a watchdog daemon restart Icinga2 if it happens. It’s been a while since it happened (we upped the specs on the VM after upgrading our KVM host), but I think I recall this “silently getting killed” behavior when it was happening.

I found this error on master1 in the icinga2.log file.

When this error happened master2 was the ‘Active Endpoint’. Soon after this error master2 stopped working and master1 took over as the ‘Active Endpoint’

[2021-12-13 16:01:49 -0500] warning/JsonRpcConnection: Error while sending JSON-RPC message for identity 'master2'
Error: Connection reset by peer

Stacktrace:
 0# __cxa_throw in /usr/lib64/icinga2/sbin/icinga2
 1# icinga::NetString::WriteStringToStream(std::shared_ptr<icinga::AsioTlsStream> const&, icinga::String const&, boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >) in /usr/lib64/icinga2/sbin/icinga2
 2# icinga::JsonRpc::SendRawMessage(std::shared_ptr<icinga::AsioTlsStream> const&, icinga::String const&, boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >) in /usr/lib64/icinga2/sbin/icinga2
 3# icinga::JsonRpcConnection::WriteOutgoingMessages(boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >) in /usr/lib64/icinga2/sbin/icinga2
 4# 0x0000000000A8D717 in /usr/lib64/icinga2/sbin/icinga2
 5# 0x0000000000A95B3A in /usr/lib64/icinga2/sbin/icinga2
 6# make_fcontext in /lib64/libboost_context.so.1.69.0

[2021-12-13 16:01:49 -0500] warning/JsonRpcConnection: API client disconnected for identity 'master2'
[2021-12-13 16:01:49 -0500] warning/ApiListener: Removing API client for endpoint 'master2'. 0 API clients left.```

Are there corresponding OOM Killer messages in your system log? Or do you have sar enabled to view memory usage (likey sar -r)?

Hi,

sig 9 issue … yeah I can recall we had that to … never completely figure it out what was causing case. We did upgrade to newest version of icinga2 and increased HW resources (memory, cpu). Also another team did upgrade VMWare host and we have enabled option in VMWare something like “use directly memory from host” and after we never had sig 9 issue. Do you have some another monitoring system for your master enviroment? I would recommend to check complet env.

I am following up on this post. I believe the problem is solved now. Icinga has not crashed in the last 7-10 days.
I added 2 more CPUs to master 2 (host in VMware). This did not solve the problem. I experienced another crash.
On master 2, I deleted all the files under ‘/var/lib/icinga2/api/zones’ and ‘/var/lib/icinga2/api/zones-stage’. I started master 2 up and it has not crashed yet.

Hopefully this will help someone else out.