Too many late service checks

bkai · April 17, 2019, 6:08pm

Hi - this is my first posting here in the “Community” portal; I already did a few in the older portals (same monicker). I work for a small department of a Bavarian firm specialising in monitoring, partly as a service, for middle enterprises (usually less than 1000 hosts). BTW, we try to use only Director for the configuration work we do, as much as possible.

We have a serious problems with a simple Icinga 2.10 instance on a customer monitoring server, checking about 250 hosts with a little more than 2100 services running.

The problem: For several weeks now, we always have about a third of these service checks being very late - they simply do not do repeat checking any more. The “next check” time value decreases down to zero and then goes more & more negative. These services are neither in an acknowledged state, nor has a downtime been set for them. They just sit in the Late Service Checks list (found in Dashboard -> Overdue), and become older and older.

System load is low (hardly ever going above 1.0); file systems are half empty. We have mainly active checks via SNMP, instigated by the monitoring server itself on remote clients.

This problem started under a 2.9 version, so we upgraded last Monday, in the hope that the new version would solve it. It didn’t. Updating to Director 1.6.2 didn’t help, either.

Initial research has shown that all services are ones configured in the “single services” part of the Director services configuration section. Services only defined within service sets, e.g., are not affected, it seems.

I have included an “icinga2 troubleshoot” output of the system, redacted to not include direct references to us or the client: troubleshooting-2019-04-15_eve_editedBySupport.log (31.5 KB)

Do you have any ideas what’s causing this? Is it a known problem? Any pointers to a solution would be much appreciated!

P.S.: For the moment I have built a clumsy “cron job” to at least make sure that late checks are forced to be run every 30 minutes on average, by applying the following 2 API calls within a bash script (making use also of sed & perl). It works, but this is not proper monitoring, and can only be a short-term workaround (customer is informed, of course)…

Common command line prefix to what follows: curl -k -s -u root:somepasswd -H 'Accept: application/json' -X
GET /localhost:5665/v1/objects/services -d '{ \"attrs\": [ \"last_check\" ], \"pretty\": false }'
POST /localhost:5665/v1/actions/reschedule-check \
-d '{ \"type\": \"Service\",
\"filter\": \"host.name==\\\"HHH\\\"&&service.name==\\\"SSS\\\"\",
\"force\": true,
\"pretty\": true }'
(Add an HTTPS:/ before every “/localhost” to get the full URL I used.)

dnsmichi · April 18, 2019, 6:58am

Hi & welcome,

The Director provides the Deployment tab where you can render the configuration for such a deployment stage. Can you extract the configuration bits which include 1) the working apply sets 2) the non-working objects from there.

Further, please use the object names from there, and extract them via icinga2 object list --type Service --name ... to see the entire attributes after compilation.

I’m primarily interested in specific zone attributes, and that being said, please also attach the zones.conf, or likewise, the output of icinga2 object list --type Zone as well as icinga2 object list --type Endpoint.

What exactly does that mean, next_check cannot be a negative value. Can you collect that data over time, e.g. with a cyclic check against the REST API on /v1/objects/services with attrs=next_check&attrs=__name?

Cheers,
Michael

bkai · April 18, 2019, 3:40pm

Thanks, Michi, for the prompt reply! Your last question first: I’m sorry I didn’t state the location of that negative “next check” value more clearly - I meant in the GUI display for a particular service, e.g. something like:
Last check |12m 5s ago|Check now
Next check |in -7m 20s Reschedule
I.e. the next check time lies fixed in the past, and does not change (unless another check is forced, either manually or by my “cron job”). In the DB the value of the attribute is of course a positive no. of seconds since the epoch. Do you still need the “data over time” output?

Concerning your data requests earlier in your reply, the current <10KB attachment - 4outputsNconfs.zip - contains 4 files:

From the rendered configuration I tried to include all "conf"s that contain APPLY rules (except “notification_apply.conf”, which I think has nothing to do with check repetition). Is that what you meant by “apply sets”?
The non-working objects are quite a lot, so I made an example list of some services in the “QXlatecases” file. Do you want to see the “services(ets).conf” entries for those services from the rendered configuration?
Complete service infos for the mentioned late cases in that file can be found in the larger output file “QXcompleteServiceInfos”. I used a shell loop with additional echo’s and your “list --type Service” command, above, to generate this file.
Zone & endpoint infos are in the script session excerpt file.

In all files I have redacted the texts to protect our customer’s specific company/host name(s).

One thing I noticed is that in the troubleshoot-output file in my first post up above, the host name of the monitoring server mentioned in some API paths is not the current one for our customer (the latter appears in the script excerpt, in item 4, as “QXCUST-monitor”). It seems our DB still contains the old name of our reference VM, which we drew a clone of to then set up the customer VM; this is the way we have been setting up customer monitoring servers the last years for Icinga1, and have tried to continue this practice with Icinga2 (of which we still have a minority out in the field).

Thanks for your time & a good Easter! /KB

dnsmichi · April 25, 2019, 7:49am

Hi,

sorry, I had your tab opened, next to many 2.11 GH tabs. I totally missed to answer after easter and some days off in Austria.

Still this doesn’t sound like something I may answer in 15 minutes. Since you’ve already contacted my colleagues, I hope this can be resolved in a remote session with direct access.

Going back to my GH page :->

Cheers,
Michael

bkai · May 3, 2019, 5:44pm

No problem, Michael!

We have since solved the problem, it looks like! The crazy effect with the many late checks seems to have been caused by dependencies that we had added in the last few months all not ignoring the Soft States!

All our dependencies were importing a template where this field was set to “no” - once we set it to “yes” AND restarted Icinga2 service, the late checks rapidly declined down to count zero.

A more awake colleague noticed that for e.g. SNMP connections, the corresponding dependency was sometimes “flapping” once a minute or so, i.e. switching all SNMP-dependent checks off and then later on again. If the check interval happened to “land” on one of the off times, the check became a late check, I guess. If it kept happening at later times, it remained in that state!!

For our purposes, we don’t need dependencies triggering on anything but hard-state critical & unknown errors, so we have decided to make sure that that Soft States setting is now always on “ignore” for all our monitoring instances.

We’re very relieved! And we’ve learned to spend more time analysing the debug logs in future.

bkai · October 15, 2019, 3:29pm

Just as a late info to the theme of this thread: We just discovered another reason all host and/or all service checks can be late. In spite of having enabled the “checker” feature on the command line (with icinga2 feature enable checker, if I remember correctly), there is a separate way to switch features off and on in the GUI itself!

This can be found under “System -> Monitoring Health” (German for the last one: “Monitoring Status”). There on the right all host checks and/or all service checks can also be switched off manually, which again will have the effect of corresponding masses of checks becoming late…

dnsmichi · October 17, 2019, 10:57am

Those global options should be used with care, they are runtime modifications you do not immediately see on the config CLI. IIRC you can control their access via monitoring restrictions in Icinga Web. Or, if you are already using the REST API, just drop the permission to modify the icingaapplication type for the ApiUser object. This may result in errors in the web interface, but the backend is safe too

Cheers,
Michael