Passive checks still grinding to a halt

This is following on from Passive checks not updating for 2 days, where I found icinga had stopped updating the web pages; the same happened again - I can see from the cron log that the updates start getting <Response [404]> at 15:15, and never recover. I thought I had fixed this yesterday, by deleting the files in /var/lib/icinga2/api/packages/_api/cx1-admin-1521551477-1/conf.d/*/, but it was only temporary.

I suppose I could simply try to clear it all out and reinstall from scratch, but I’d really like to understand what the poblem is - I must have done something to mess up the configuration, somehow, and I’d like to avoid doing it again some time in the future. It would be great if one of the kind developers, who has already helped me, would guide me a bit with the troubleshooting.

I will now go away and follow the steps in the official troubleshootingguide, and then update here in an edit.

===EDIT===

The setup: 1 master zone, 2 satelite zones. ATM I work with only one of the satelite zones and the master. I have a small number of passive check scripts (currently: 8), but they run on >1500 client systems, so the total number of service objects amount to something like >12000. The passive checks will, each time they run, first communicate with the master directly: ask ‘does this Host object exist?’, if not, create it, then ‘does this service object exist?’, if not, create it, and secondly, report the result of the check to the satelite zone. This seemed to work reliably, at least for a few weeks.

So, when I start afresh, what I expect to see is, first the web pages will be empty, then they will populate with service and host data.

Recently, for less than a week, I have begun to see this fail - the updates will stop coming in, and in the cron log I see ‘<Response 404>’. When this happens, the debug log will contain:

[2019-03-11 10:18:23 +0000] critical/ApiListener: Could not create object 'cx1-141-15-3.cx1.hpc.ic.ac.uk':
[2019-03-11 10:18:23 +0000] critical/ApiListener: Configuration file '/var/lib/icinga2/api/packages/_api/cx1-admin-1521551477-1/conf.d/hosts/cx1-141-15-3.cx1.hpc.ic.ac.uk.conf' already exists.

I then try to stop icinga on the satelite, delete the files in /var/lib/icinga2/api/packages/_api/cx1-admin-1521551477-1/conf.d/*/, start icinga, and things appear to work again for a while. On occasion icinga dies, unfortunately without leaving a core dump:

bash-4.2# systemctl status icinga2
● icinga2.service - Icinga host/service/network monitoring system
   Loaded: loaded (/usr/lib/systemd/system/icinga2.service; disabled; vendor preset: disabled)
   Active: failed (Result: signal) since Tue 2019-03-12 09:17:20 GMT; 2min 13s ago
  Process: 29294 ExecStart=/usr/sbin/icinga2 daemon -d -e ${ICINGA2_ERROR_LOG} (code=exited, status=0/SUCCESS)
  Process: 29215 ExecStartPre=/usr/lib/icinga2/prepare-dirs /etc/sysconfig/icinga2 (code=exited, status=0/SUCCESS)
 Main PID: 29353 (code=killed, signal=SEGV)

Mar 11 15:05:01 cx1-admin icinga2[29294]: [2019-03-11 15:05:01 +0000] information/ConfigItem: Instantiated 1 User.
Mar 11 15:05:01 cx1-admin icinga2[29294]: [2019-03-11 15:05:01 +0000] information/ConfigItem: Instantiated 12 ServiceGroups.
Mar 11 15:05:01 cx1-admin icinga2[29294]: [2019-03-11 15:05:01 +0000] information/ConfigItem: Instantiated 4 Services.
Mar 11 15:05:01 cx1-admin icinga2[29294]: [2019-03-11 15:05:01 +0000] information/ConfigItem: Instantiated 1 CheckerComponent.
Mar 11 15:05:01 cx1-admin icinga2[29294]: [2019-03-11 15:05:01 +0000] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
Mar 11 15:05:02 cx1-admin systemd[1]: Started Icinga host/service/network monitoring system.
Mar 12 09:17:20 cx1-admin icinga2[29294]: [2019-03-11 15:05:01 +0000] informati
Mar 12 09:17:20 cx1-admin systemd[1]: icinga2.service: main process exited, code=killed, status=11/SEGV
Mar 12 09:17:20 cx1-admin systemd[1]: Unit icinga2.service entered failed state.
Mar 12 09:17:20 cx1-admin systemd[1]: icinga2.service failed.

Icinga was installed with yum from http://packages.icinga.com

bash-4.2# icinga2 --version
icinga2 - The Icinga 2 network monitoring daemon (version: r2.8.1-1)

Copyright (c) 2012-2017 Icinga Development Team (https://www.icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Application information:
  Installation root: /usr
  Sysconf directory: /etc
  Run directory: /run
  Local state directory: /var
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid

System information:
  Platform: CentOS Linux
  Platform version: 7 (Core)
  Kernel: Linux
  Kernel version: 3.10.0-693.11.6.el7.x86_64
  Architecture: x86_64

Build information:
  Compiler: GNU 4.8.5
  Build host: unknown

bash-4.2# icinga2 feature list
Disabled features: command compatlog debuglog elasticsearch gelf graphite influxdb livestatus notification opentsdb perfdata statusdata syslog
Enabled features: api checker mainlog
bash-4.2# icinga2 daemon -C
information/cli: Icinga application loader (version: r2.8.1-1)
information/cli: Loading configuration file(s).
information/ConfigItem: Committing config item(s).
information/ApiListener: My API identity: admin.cx1.hpc.imperial.ac.uk
warning/ApplyRule: Apply rule 'mail-icingaadmin' (in /etc/icinga2/conf.d/notifications.conf: 11:1-11:45) for type 'Notification' does not match anywhere!
warning/ApplyRule: Apply rule 'mail-icingaadmin' (in /etc/icinga2/conf.d/notifications.conf: 23:1-23:48) for type 'Notification' does not match anywhere!
warning/ApplyRule: Apply rule 'backup-downtime' (in /etc/icinga2/conf.d/downtimes.conf: 5:1-5:52) for type 'ScheduledDowntime' does not match anywhere!
warning/ApplyRule: Apply rule 'backup-downtime' (in /var/lib/icinga2/api/zones/global-templates/_etc/downtimes.conf: 5:1-5:52) for type 'ScheduledDowntime' does not match anywhere!
warning/ApplyRule: Apply rule 'ping6' (in /etc/icinga2/conf.d/services.conf: 34:1-34:21) for type 'Service' does not match anywhere!
warning/ApplyRule: Apply rule '' (in /etc/icinga2/conf.d/services.conf: 51:1-51:65) for type 'Service' does not match anywhere!
warning/ApplyRule: Apply rule '' (in /etc/icinga2/conf.d/services.conf: 59:1-59:53) for type 'Service' does not match anywhere!
warning/ApplyRule: Apply rule 'icinga' (in /etc/icinga2/conf.d/services.conf: 67:1-67:22) for type 'Service' does not match anywhere!
warning/ApplyRule: Apply rule 'load' (in /etc/icinga2/conf.d/services.conf: 75:1-75:20) for type 'Service' does not match anywhere!
warning/ApplyRule: Apply rule 'procs' (in /etc/icinga2/conf.d/services.conf: 86:1-86:21) for type 'Service' does not match anywhere!
warning/ApplyRule: Apply rule 'swap' (in /etc/icinga2/conf.d/services.conf: 94:1-94:20) for type 'Service' does not match anywhere!
warning/ApplyRule: Apply rule 'users' (in /etc/icinga2/conf.d/services.conf: 102:1-102:21) for type 'Service' does not match anywhere!
warning/ApplyRule: Apply rule 'cx1-disk-var' (in /var/lib/icinga2/api/zones/cx1-zone/_etc/services.conf: 3:1-3:28) for type 'Service' does not match anywhere!
warning/ApplyRule: Apply rule 'procs-ssh' (in /var/lib/icinga2/api/zones/global-templates/_etc/services.conf: 48:1-48:25) for type 'Service' does not match anywhere!
information/ConfigItem: Instantiated 1 ApiListener.
information/ConfigItem: Instantiated 4 Zones.
information/ConfigItem: Instantiated 2 Endpoints.
information/ConfigItem: Instantiated 3 ApiUsers.
information/ConfigItem: Instantiated 1 FileLogger.
information/ConfigItem: Instantiated 1 UserGroup.
information/ConfigItem: Instantiated 213 CheckCommands.
information/ConfigItem: Instantiated 1 IcingaApplication.
information/ConfigItem: Instantiated 2 Hosts.
information/ConfigItem: Instantiated 3 HostGroups.
information/ConfigItem: Instantiated 3 TimePeriods.
information/ConfigItem: Instantiated 1 User.
information/ConfigItem: Instantiated 12 ServiceGroups.
information/ConfigItem: Instantiated 4 Services.
information/ConfigItem: Instantiated 1 CheckerComponent.
information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
information/cli: Finished validating the configuration file(s).

I attach a screenshot of the system -> version:

Have I understood correctly that each passive check makes three requests every time? One for the host, the service, and finally the check result?

That is correct, yes.

2.8.1 isn‘t supported anymore, try with the latest release.

Can I also suggest reversing the logic of the scripts? Try and push the check result, and if that fails, try and create the Service and Host objects, then retry.

mfriedrich, hemebond, those are both good suggestions, which I will follow. However, I think the fundamental problem was overloading og the icinga daemon, so I have worked on spreading the load of updates out, basically letting each update sleep a random number of minutes, and this seems to have helped a lot.

I have been trying to get icinga to dump a core when it dies, but even though I can see in /proc/[pid]/limits that core files size is set to unlimited, I don’t find a core when it dies, and now, of course it is less urgent. It would be nice to pin down what it dies of, still, if it happens again.