Icinga dies of segv when the load on the REST API is high

j4nd3r53n · February 27, 2019, 11:41am

I use the REST API extensively to update passive checks - and we have a very large number of checks: about 10,000 at the moment (4 checks from each of 2500 systems). This appears to kill the icinga server:

bash-4.2# systemctl status icinga2
● icinga2.service - Icinga host/service/network monitoring system
   Loaded: loaded (/usr/lib/systemd/system/icinga2.service; disabled; vendor preset: disabled)
   Active: failed (Result: signal) since Wed 2019-02-27 11:16:22 GMT; 3min 51s ago
  Process: 5896 ExecStart=/usr/sbin/icinga2 daemon -d -e ${ICINGA2_ERROR_LOG} (code=exited, status=0/SUCCESS)
  Process: 4007 ExecStartPre=/usr/lib/icinga2/prepare-dirs /etc/sysconfig/icinga2 (code=exited, status=0/SUCCESS)
 Main PID: 5955 (code=killed, signal=SEGV)

Feb 27 09:16:31 cx1-admin icinga2[5896]: [2019-02-27 09:16:31 +0000] information/ConfigItem: Instantiated 1 User.
Feb 27 09:16:31 cx1-admin icinga2[5896]: [2019-02-27 09:16:31 +0000] information/ConfigItem: Instantiated 3 ServiceGroups.
Feb 27 09:16:31 cx1-admin icinga2[5896]: [2019-02-27 09:16:31 +0000] information/ConfigItem: Instantiated 5184 Services.
Feb 27 09:16:31 cx1-admin icinga2[5896]: [2019-02-27 09:16:31 +0000] information/ConfigItem: Instantiated 1 CheckerComponent.
Feb 27 09:16:31 cx1-admin icinga2[5896]: [2019-02-27 09:16:31 +0000] information/ScriptGlobal: Dumping variables to file '...2.vars'
Feb 27 09:16:32 cx1-admin systemd[1]: Started Icinga host/service/network monitoring system.
Feb 27 11:16:22 cx1-admin icinga2[5896]: [2019-02-27 09:16:31 +0000] inf
Feb 27 11:16:22 cx1-admin systemd[1]: icinga2.service: main process exited, code=killed, status=11/SEGV
Feb 27 11:16:22 cx1-admin systemd[1]: Unit icinga2.service entered failed state.
Feb 27 11:16:22 cx1-admin systemd[1]: icinga2.service failed.
Hint: Some lines were ellipsized, use -l to show in full.

I have tried to spread the updates out a bit, but it doesn’t make a difference; the logs don’t show anything, which isn’t surprising, since it got killed by the system - SEGV is a buffer overflow, I assume?

What is the best way to troubleshoot this?

aflatto · February 27, 2019, 1:26pm

Have you enabled the Debug log ?
Anything thing else in the Journal logs ?
Can you share the specification of the server ( CPU, RAM , NIC stats etc`) ?

j4nd3r53n · February 27, 2019, 3:11pm

Hi Assaf,

Thanks for getting back to me - as a matter of fact, I’ve just restored it to some semblance of sanity, it seems - turns out, the cache was full of odd rubbish. I found that out when I shut everything down and ran icinga2 daemon -C. Things are looking a lot better, for now at least.

dnsmichi · March 11, 2019, 2:44pm

Icinga will attempt to write a crash log to /var/log/icinga2/crash in case it is possible. If you’ve got gdb installed, it will also attempt to log a stacktrace. Clearly it must not happen and when retrieving such, open an issue on GitHub including all the verbose output plus everything required to reproduce the issue.

Cheers,
Michael