Hi - this is my first posting here in the “Community” portal; I already did a few in the older portals (same monicker). I work for a small department of a Bavarian firm specialising in monitoring, partly as a service, for middle enterprises (usually less than 1000 hosts). BTW, we try to use only Director for the configuration work we do, as much as possible.
We have a serious problems with a simple Icinga 2.10 instance on a customer monitoring server, checking about 250 hosts with a little more than 2100 services running.
The problem: For several weeks now, we always have about a third of these service checks being very late - they simply do not do repeat checking any more. The “next check” time value decreases down to zero and then goes more & more negative. These services are neither in an acknowledged state, nor has a downtime been set for them. They just sit in the Late Service Checks list (found in Dashboard -> Overdue), and become older and older.
System load is low (hardly ever going above 1.0); file systems are half empty. We have mainly active checks via SNMP, instigated by the monitoring server itself on remote clients.
This problem started under a 2.9 version, so we upgraded last Monday, in the hope that the new version would solve it. It didn’t. Updating to Director 1.6.2 didn’t help, either.
Initial research has shown that all services are ones configured in the “single services” part of the Director services configuration section. Services only defined within service sets, e.g., are not affected, it seems.
I have included an “icinga2 troubleshoot” output of the system, redacted to not include direct references to us or the client: troubleshooting-2019-04-15_eve_editedBySupport.log (31.5 KB)
Do you have any ideas what’s causing this? Is it a known problem? Any pointers to a solution would be much appreciated!
P.S.: For the moment I have built a clumsy “cron job” to at least make sure that late checks are forced to be run every 30 minutes on average, by applying the following 2 API calls within a bash script (making use also of sed & perl). It works, but this is not proper monitoring, and can only be a short-term workaround (customer is informed, of course)…
-
Common command line prefix to what follows:
curl -k -s -u root:somepasswd -H 'Accept: application/json' -X
-
GET
/localhost:5665/v1/objects/services -d '{ \"attrs\": [ \"last_check\" ], \"pretty\": false }'
-
POST
/localhost:5665/v1/actions/reschedule-check \
-d '{ \"type\": \"Service\",
\"filter\": \"host.name==\\\"HHH\\\"&&service.name==\\\"SSS\\\"\",
\"force\": true,
\"pretty\": true }'
(Add an HTTPS:/ before every “/localhost” to get the full URL I used.)