Icinga 2.11 "check command does not exist" (it does under 2.10.5)

Hi.

I’m using Icinga 2.10.5 with a Master and multiple Clients. Those Clients fall into three groups: Development, Test and Production. All run Debian 9 Stretch.

Last Thursday two of my Development machines got updated to Icinga 2.11, and some (but not all) of my service checks on those machines then failed with “check command does not exist”. No configuration changes were made at the same time.

I downgraded Icinga back to 2.10.5 but this did not fix the problem.

Fortunately:

a) these were “only” development machines
b) we take nightly backups of every entire machine
c) no significant data is stored on any single Client

I ended up fixing the problem by restoring the previous nightly backup of these two machines.

I have pinned Icinga at version 2.10.x for the time being on all machines so that Debian package upgrades don’t take it up to 2.11 again.

I have read https://icinga.com/docs/icinga2/latest/doc/16-upgrading-icinga-2/ and seen quite how much has changed between 2.10 and 2.11. I rather agree with the comment in there that it could well have been called Icinga 3.0.

I have several questions:

  1. Is this problem caused by upgrading the Clients to 2.11 while the Master is still on 2.10.5?

  2. Is there any way to run 2.10.5 on my Master machine and update Clients to 2.11 and keep service checks running?

  3. If I upgrade my Master to 2.11, will this cause problems for the Clients still running 2.10.5?

  4. Once a machine has been upgraded from 2.10.5 to 2.11, if I run into problems, is there any reasonable way to downgrade back to 2.10.5 (short of restoring the backup of the whole machine)?

  5. How / where can I find more information about what “check command does not exist” is actually trying to tell me (in other words, what does 2.11 want me to change so that the checks which work perfectly well under 2.10.5 will continue working)?

Because my Client machines are in Development, Test and Production clusters, I cannot upgrade all at the same time (especially if it turns out that the upgrade causes problems, such as I had with the Development machines last week). We need to observe stable behaviour on Development and Test before we can proceed to upgrade the Live environment. Therefore we have no choice but to run with mixed versions of Icinga for short periods of time (and this has worked well over the past 2 years).

Thanks for any guidance on this. I am especially concerned about upgrading the Icinga Master machine, since obviously our entire monitoring infrastructure depends on that one working, and being compatible with the Clients.

Thanks,

Antony.

I have the exact same issue. I updated one of my endpoints and suddenly some (!) commands are not found. I thought it was coincidence and updated another endpoint without altering anything else…same result and some (!) services give the error 'command does not exist ’

As last resort I upgraded all my endpoints to 2.11 but it appeared that not every endpoint got the same problem but the config is the same overall ?!

Comparing permissions on the plugins is also the same on working or non working endpoints.
The problem seems to be only with custom plugins, but those are the same on all the endpoints.

“/var/lib/icinga2/api/zones/global-templates/_etc/commands.conf” is exactly the same on all the endpoints:

root@endpoint02:/var/lib/icinga2/api/zones/global-templates/_etc# sha256sum commands.conf
2e6a4811e9785930ce1e44957106a957a803373d12a26bdf163e78d72971fab4 commands.conf

root@endpoint01:/var/lib/icinga2/api/zones/global-templates/_etc# sha256sum commands.conf
2e6a4811e9785930ce1e44957106a957a803373d12a26bdf163e78d72971fab4 commands.conf

The debuglog shows the command executed on the working agents and not on the two upgraded agents, so somehow it seems to be executed on the wrong endpoint but that’s just a assumption.

Checking the api on the icinga2 master gives me a difference:
Partial piece of a working endpoint shows this:

“command”: [
“/usr/lib/nagios/plugins/check_mem.pl”,
“-C”,
“-c”,
“90”,
“-u”,
“-w”,
“70”
],

Partial piece of a non-working endpoint:

“command”: null,

As an example:

example

FOUND IT !

Reset the non-working endpoint:

rm -rf /var/lib/icinga2/api/zones/*
rm -rf /var/lib/icinga2/api/zones-stage/*

systemctl restart icinga2

The advice by @dnsmichi is explained here:

we truly recommend to upgrade masters first, then satellites, then agents. The other way around may work, but given the new cluster config stages, such cases may occur.

Excellent - thanks for the information.

I’ll try upgrading the same two Development machines which went wrong last
week, check they still have the problem, and try deleting the directories you
mention to see if they start behaving.

If they do, that’ll give me confidence to do the Test and Live machines as well
(but I’m still going to have backups, just in case…)

I do wish there were some more enforced notification that this is such a major
upgrade, and that plenty of things are possible to go wrong (such as you can
find at https://icinga.com/docs/icinga2/latest/doc/16-upgrading-icinga-2/ ),
when you innocently do a Debian package update.

Anyone deliberately upgrading Icinga can be expected to read the docs and find
out about the potential pitfalls, but how many people are using the DEB, RPM
and other package repositories, and therefore don’t know an update is even
going to happen until it’s too late, and things like this happen to their
systems?

Ho hum,

Antony.

Hi,

It may have an influence, yes. The preferred way is to upgrade the master first, then satellite, then agent/client.

By default, it should. With 2.11 and the config sync stages, it might not exactly be possible. Still trying to understand what’s happening.

The log on the satellite/client should tell you:

Received configuration update without checksums from parent endpoint master1.localdomain . This behaviour is deprecated. Please upgrade the parent endpoint to 2.11+

If you don’t use any new features in the configuration (getenv() for example), that’s possible.

Since this is a new problem, no exact troubleshooting entry exists for this complex problem. Though, whenever this message occurs, the first steps should be to verify whether the commands are really synced to the agent/client node and are loaded into production memory.

https://icinga.com/docs/icinga2/latest/doc/15-troubleshooting/#cluster-troubleshooting-command-endpoint-errors

If the reload is not triggered, here’s a separate entry for 2.11: https://icinga.com/docs/icinga2/latest/doc/15-troubleshooting/#new-configuration-does-not-trigger-a-reload

That’s a good thing, having a test environment before going live.

Which isn’t a problems, if you follow along with the requirement to upgrade masters and satellites first, then agent/clients. That’s what we test before the release and what’s also documented.

2.11 introduces a zones-stage directory where the configuration is put first. This ensures that broken configuration is not immediately put into production, and after a manual restart, everything fails.

Check things in there first, this is also used for the initial checksum comparison.

Whenever the parent node doesn’t send the checksums, it falls back to the .timestamp file and does the old comparison.

That’s also noted in the documentation, since many users asked in the past years.
https://icinga.com/docs/icinga2/latest/doc/06-distributed-monitoring/#versions-and-upgrade

In terms of an upgrade, ensure that the master is upgraded first, then involved satellites, and last the Icinga agents. If you are on v2.10 currently, first upgrade the master instance(s) to 2.11, and then proceed with the satellites. Things are getting easier with any sort of automation tool (Puppet, Ansible, etc.).

I know that this isn’t always doable, still this is the way which is the safest one.

I’m not happy with the cluster config sync changes which were required to solve a long standing issue. On the other hand, hearing that the bug is 4 years old and why no-one fixes it also is annoying which is why I started to implement the change last year in September.

The checksum logic is not optimal, but actually no-one said something against it. The technical issue including the design was open for many months.

Agreed, somewhere inside there is a possible bug with the config change detection when different versions talk to each other. Any debug log extraction with a reproducible configuration will help mitigate and possibly fix or enhance the behaviour.

That’s been discussed for many years now. The package systems don’t allow for notifications during install/upgrade and each way differs in the way you could inject a log. And then you need to keep your fingers crossed that users actually read this.

Our expectation is, and that still stands, that you do follow the release announcements prior to hitting “yum update”. There is a reason why we hold back the release up until the announcement is published - in order to allow people to follow along.

With 2.11, I had people telling me that they did not hear about the 2.11 Release Candidate. It had been sitting there for 8 weeks, for the very reason that changes with the network stack and cluster config sync are huge and people should actually get aware of any possible problems which developers can tackle before the final release.

Turns out, the many just wait for the “yum update” and then things fail. Don’t get me wrong, but getting all that “this shit doesn’t work feedback” is really something I’d like to avoid after a stressful release including the RC feedback weeks. I also heard that “we’ll just wait for 2.11.1” as a joke - well, if you always depend on others, development won’t be a success and likely not all problems are fixed.

What else is needed to tell people to not just pull in upgrades but follow the release announcement on http://icinga.com/blog with the changelog and upgrading documentation? You can even follow along on twitter.com/icinga or FB, if you are more of a social media fellow.

Anyhow, as always, problems may always occur from software changes. While I don’t have time to debug things myself, rest assured that we are listening and try to help with workarounds whenever possible.

Cheers,
Michael