I’m using Icinga 2.10.5 with a Master and multiple Clients. Those Clients fall into three groups: Development, Test and Production. All run Debian 9 Stretch.
Last Thursday two of my Development machines got updated to Icinga 2.11, and some (but not all) of my service checks on those machines then failed with “check command does not exist”. No configuration changes were made at the same time.
I downgraded Icinga back to 2.10.5 but this did not fix the problem.
a) these were “only” development machines
b) we take nightly backups of every entire machine
c) no significant data is stored on any single Client
I ended up fixing the problem by restoring the previous nightly backup of these two machines.
I have pinned Icinga at version 2.10.x for the time being on all machines so that Debian package upgrades don’t take it up to 2.11 again.
Is this problem caused by upgrading the Clients to 2.11 while the Master is still on 2.10.5?
Is there any way to run 2.10.5 on my Master machine and update Clients to 2.11 and keep service checks running?
If I upgrade my Master to 2.11, will this cause problems for the Clients still running 2.10.5?
Once a machine has been upgraded from 2.10.5 to 2.11, if I run into problems, is there any reasonable way to downgrade back to 2.10.5 (short of restoring the backup of the whole machine)?
How / where can I find more information about what “check command does not exist” is actually trying to tell me (in other words, what does 2.11 want me to change so that the checks which work perfectly well under 2.10.5 will continue working)?
Because my Client machines are in Development, Test and Production clusters, I cannot upgrade all at the same time (especially if it turns out that the upgrade causes problems, such as I had with the Development machines last week). We need to observe stable behaviour on Development and Test before we can proceed to upgrade the Live environment. Therefore we have no choice but to run with mixed versions of Icinga for short periods of time (and this has worked well over the past 2 years).
Thanks for any guidance on this. I am especially concerned about upgrading the Icinga Master machine, since obviously our entire monitoring infrastructure depends on that one working, and being compatible with the Clients.
I have the exact same issue. I updated one of my endpoints and suddenly some (!) commands are not found. I thought it was coincidence and updated another endpoint without altering anything else…same result and some (!) services give the error 'command does not exist ’
As last resort I upgraded all my endpoints to 2.11 but it appeared that not every endpoint got the same problem but the config is the same overall ?!
Comparing permissions on the plugins is also the same on working or non working endpoints.
The problem seems to be only with custom plugins, but those are the same on all the endpoints.
“/var/lib/icinga2/api/zones/global-templates/_etc/commands.conf” is exactly the same on all the endpoints:
Anyone deliberately upgrading Icinga can be expected to read the docs and find
out about the potential pitfalls, but how many people are using the DEB, RPM
and other package repositories, and therefore don’t know an update is even
going to happen until it’s too late, and things like this happen to their
It may have an influence, yes. The preferred way is to upgrade the master first, then satellite, then agent/client.
By default, it should. With 2.11 and the config sync stages, it might not exactly be possible. Still trying to understand what’s happening.
The log on the satellite/client should tell you:
Received configuration update without checksums from parent endpoint master1.localdomain . This behaviour is deprecated. Please upgrade the parent endpoint to 2.11+
If you don’t use any new features in the configuration (getenv() for example), that’s possible.
Since this is a new problem, no exact troubleshooting entry exists for this complex problem. Though, whenever this message occurs, the first steps should be to verify whether the commands are really synced to the agent/client node and are loaded into production memory.
That’s a good thing, having a test environment before going live.
Which isn’t a problems, if you follow along with the requirement to upgrade masters and satellites first, then agent/clients. That’s what we test before the release and what’s also documented.
2.11 introduces a zones-stage directory where the configuration is put first. This ensures that broken configuration is not immediately put into production, and after a manual restart, everything fails.
Check things in there first, this is also used for the initial checksum comparison.
Whenever the parent node doesn’t send the checksums, it falls back to the .timestamp file and does the old comparison.
In terms of an upgrade, ensure that the master is upgraded first, then involved satellites, and last the Icinga agents. If you are on v2.10 currently, first upgrade the master instance(s) to 2.11, and then proceed with the satellites. Things are getting easier with any sort of automation tool (Puppet, Ansible, etc.).
I know that this isn’t always doable, still this is the way which is the safest one.
I’m not happy with the cluster config sync changes which were required to solve a long standing issue. On the other hand, hearing that the bug is 4 years old and why no-one fixes it also is annoying which is why I started to implement the change last year in September.
The checksum logic is not optimal, but actually no-one said something against it. The technical issue including the design was open for many months.
Agreed, somewhere inside there is a possible bug with the config change detection when different versions talk to each other. Any debug log extraction with a reproducible configuration will help mitigate and possibly fix or enhance the behaviour.
That’s been discussed for many years now. The package systems don’t allow for notifications during install/upgrade and each way differs in the way you could inject a log. And then you need to keep your fingers crossed that users actually read this.
Our expectation is, and that still stands, that you do follow the release announcements prior to hitting “yum update”. There is a reason why we hold back the release up until the announcement is published - in order to allow people to follow along.
With 2.11, I had people telling me that they did not hear about the 2.11 Release Candidate. It had been sitting there for 8 weeks, for the very reason that changes with the network stack and cluster config sync are huge and people should actually get aware of any possible problems which developers can tackle before the final release.
Turns out, the many just wait for the “yum update” and then things fail. Don’t get me wrong, but getting all that “this shit doesn’t work feedback” is really something I’d like to avoid after a stressful release including the RC feedback weeks. I also heard that “we’ll just wait for 2.11.1” as a joke - well, if you always depend on others, development won’t be a success and likely not all problems are fixed.
What else is needed to tell people to not just pull in upgrades but follow the release announcement on http://icinga.com/blog with the changelog and upgrading documentation? You can even follow along on twitter.com/icinga or FB, if you are more of a social media fellow.
Anyhow, as always, problems may always occur from software changes. While I don’t have time to debug things myself, rest assured that we are listening and try to help with workarounds whenever possible.