I’ve been upgrading my cluster to 2.12. One node was updated yesterday, another today.
Now i have a host where checks are permanently in pending state and no check source is assigned. What can i do about this? There is nothing special about its definition, it should be working.
It’s likely that this host was defined when my cluster was partially upgraded, at least that is my assumption as to what is the cause of the problem.
i can see checks being scheduled to run from icingaweb2, but they don’t seem to run. the check result is never registered and it never appears as if it executed.
Newly added commands also never execute.
I downgraded to 2.11.5 and i still have this problem.
So, check the logs for any errors/warnings/criticals. This should give you a pretty good idea what is going wrong, as the log mostly tells this in plain text.
There was nothing of value in the logs, but things got worse over time. Problem notifications would re-send for valid (OK) services and there were even more pending checks.
I stropped the cluster and deleted the icinga2.state files on each node, and restarted the cluster. Things seem ok now.
I’ve got the old state files around, so i’ll look into them.
My guess is that rolling upgrade of the cluster is a bad idea with icinga2.
Update the masters one after the other, then doe the same with the satellites and then the agents.
As you have a cluster you don’t really have a downtime, because the checks are shifted to the other node in the zone that is currently not being updated.
There is an interesting issue i observed. On the secondary master node (one that receives configuration via api), icinga2’s child processes go defunct as soon as master sends cfg updates.
Then all sorts of odd things start happening until i kill icinga2 on that machine. Secondary master is on debian 9, primary is on debian 8. I will have to run on single master for a while until i figure out what is happening.