Upgrade to 2.12 - some checks are permanently in pending state with blank check source

kowalskimn · August 6, 2020, 7:32am

I’ve been upgrading my cluster to 2.12. One node was updated yesterday, another today.

Now i have a host where checks are permanently in pending state and no check source is assigned. What can i do about this? There is nothing special about its definition, it should be working.

It’s likely that this host was defined when my cluster was partially upgraded, at least that is my assumption as to what is the cause of the problem.

log1c · August 6, 2020, 7:56am

I just updated my tseting master to 2.12 and have no issues. Satellites are still on 2.11.4

What do your logs say?
Does the host in question have an agent installed? Config? Logs?

kowalskimn · August 6, 2020, 1:20pm

i can see checks being scheduled to run from icingaweb2, but they don’t seem to run. the check result is never registered and it never appears as if it executed.

Newly added commands also never execute.

I downgraded to 2.11.5 and i still have this problem.

log1c · August 6, 2020, 1:34pm

Without any information about

the contents of the logs
if the host in question has the agent installed
the host/service config

we can’t help you.

So, check the logs for any errors/warnings/criticals. This should give you a pretty good idea what is going wrong, as the log mostly tells this in plain text.

kowalskimn · August 7, 2020, 6:25am

There was nothing of value in the logs, but things got worse over time. Problem notifications would re-send for valid (OK) services and there were even more pending checks.

I stropped the cluster and deleted the icinga2.state files on each node, and restarted the cluster. Things seem ok now.

I’ve got the old state files around, so i’ll look into them.

My guess is that rolling upgrade of the cluster is a bad idea with icinga2.

log1c · August 7, 2020, 6:53am

Normally this shouldn’t be a problem.

Update the masters one after the other, then doe the same with the satellites and then the agents.
As you have a cluster you don’t really have a downtime, because the checks are shifted to the other node in the zone that is currently not being updated.

kowalskimn · August 7, 2020, 10:20am

I am actually still seeing the problem. Some hosts started staing in pending state again. I’ll look into what could be causing this.

It seems i had a misbehaving old icinga2 process running around on one of nodes in my zone. I’m looking into it, whether that helps.

kowalskimn · August 7, 2020, 12:15pm

There is an interesting issue i observed. On the secondary master node (one that receives configuration via api), icinga2’s child processes go defunct as soon as master sends cfg updates.

Then all sorts of odd things start happening until i kill icinga2 on that machine. Secondary master is on debian 9, primary is on debian 8. I will have to run on single master for a while until i figure out what is happening.