Passive checks not updating for 2 days

j4nd3r53n · March 11, 2019, 10:13am

Edit: This issue is very similar to the following - but that one was mainly about the fact that I hadn’t configured the zone correctly in the service objects; this is not the issue now:

However, the symptoms seem to be the same:

I have a master zone, a satelite, and a number of clients
All service checks are passive, and the first time they are run, they create their own host and service objects, if necessary

This appears to work fine - except that it seems to be unrealiable. After I create a new ServiceGroup and restart the master/satelite, all host and service objects disappear, but are the recreated. I did this before the weekend, everything seemed OK, sensible data were seen, but now it turns out that no updates have been received since.

When I try to run the check scripts manually, they get a return code 404, so the service object doesn’t exist. However, when I query the servers in different ways, they do exist - I have tried through the API and using the console (“icinga2 console --connect ‘https://some:one@master.server:5665/’”), and I find the service objects. It is a bit of a disaster TBH, if I can’t get it to work reliably - or at least can find a reliable way to troubleshoot this sort of issue. Any ideas, please?

I have had a look in the debug log and the icinga2.log on the satelite - the only thing that stands out is the fact that I see this sort of thing an awful lot in the icinga2.log - in fact, it seems to happen for all updates:

[2019-03-11 10:31:23 +0000] information/ApiListener: New client connection from [192.168.103.61]:52792 (no client certificate)
[2019-03-11 10:31:23 +0000] information/HttpServerConnection: Request: POST /v1/actions/process-check-result?service=cx1-138-16-1.cx1.hpc.ic.ac.uk%21sdr_list (from [192.168.103.61]:52792, user: client-pki-ticket-cx1-admin)
[2019-03-11 10:31:23 +0000] warning/TlsStream: TLS stream was disconnected.

On the master, there are no log entries relating to the updates. These problems seem to arise every time I add a ServiceGroup at the moment - I do this by changing zones.d/global-templates/groups.conf, running icinga2 daemon -C and restarting the icinga server on master and satelite. Is there anything I need to do differently?

j4nd3r53n · March 11, 2019, 12:54pm

The answer to this problem was in the debug log, after all:

[2019-03-11 10:18:23 +0000] critical/ApiListener: Could not create object 'cx1-141-15-3.cx1.hpc.ic.ac.uk':
[2019-03-11 10:18:23 +0000] critical/ApiListener: Configuration file '/var/lib/icinga2/api/packages/_api/cx1-admin-1521551477-1/conf.d/hosts/cx1-141-15-3.cx1.hpc.ic.ac.uk.conf' already exists.

The satelite zone didn’t know it has these host and service objects - the master tried to tell it, but the satelite thought it already knew. However, even though these files presumably contained the definitions for the service and host objects, trying to find them by querying the satelite zone failed. deleting everything in the hosts- and services directories fixed it:

bash-4.2# ll /var/lib/icinga2/api/packages/_api/cx1-admin-1521551477-1/conf.d
total 956
drwx------. 2 icinga icinga  77824 Mar  8 17:15 hosts/
drwx------. 2 icinga icinga 729088 Mar  8 17:15 services/
bash-4.2# rm -r /var/lib/icinga2/api/packages/_api/cx1-admin-1521551477-1/conf.d/*/*

dnsmichi · March 11, 2019, 1:24pm

Hi,

it seems that the _api package used for syncing runtime config objects was scrambled somehow. That way the core doesn’t load it on startup, and will process a config update thus resulting on that dump on disk. Future versions of Icinga will handle this in smarter way, without that non-telling error message.

Cheers,
Michael