Configuration Management - Stage Creation

Hi

I’ve been struggling with what i assume to be a bug in the icinga2 configuration management. The issue is hard to isolate and for now i can only explain the circumstances when it pops up:

I’m using the config management to deploy packages on my icinga2 master, which are either bigger global zones or smaller “per-host” definitions (Host type objects mostly). The api requests are being done in an ansible playbook with a simple logic:

  1. Try to create package (usually fails with http 500 since the package already exists)
  2. Try to create new stage in package, results in:
    {
    “status”: “Created stage. Reload triggered.”,
    “code”: 200,
    “stage”: “40ca7a81-9a39-469a-976e-55ada46bc03c”,
    “package”: “azure-global”
    }
  3. Try to determine result of reload (in this case query /v1/config/files/azure-global/40ca7a81-9a39-469a-976e-55ada46bc03c/status).

Step 3 gets repeated several times with a preset delay if we get an http 404 (message: “path not found”). This loop usually (max. 3 retries) results in an http 200 when the stage was successfully loaded.

There are cases however when the query for the stage simply won’t return a 200 and stays 404 for a full minute of retries. Rerunning the playbook can result in the previously failed stage update to succeed without issues. The size of the stage (the amount of icinga2 objects contained) does not seem to be a factor in the previously described error. Even stages of minimal size (e.g. definitions for 4 HostGroups) can trigger the behavior and simply stay nonexistant for status queries, even after 5 minutes or more.

Looking at the debuglog a stage reload for package “azure-global” and stage “40ca7a81-9a39-469a-976e-55ada46bc03c” as described in the output above triggers the following command:

/usr/lib64/icinga2/sbin/icinga2 --no-stack-rlimit daemon --close-stdio --validate --define ActiveStageOverride=azure-global:40ca7a81-9a39-469a-976e-55ada46bc03c

The output is not shown in the debuglog and there seems to be no error being reported.
Running the command on the master myself doesn’t return any errors either. The output looks like any other successfull config validation.

Looking at the stage itself in the filesystem i’ve noticed a difference, the only contents of 40ca7a81-9a39-469a-976e-55ada46bc03c where:

conf.d/
include.conf
zones.d/

An active stage usually looks like this:

conf.d/
include.conf
startup.log
status
zones.d/

At this point i had plenty of bad stages for all my packages, so i decided to purge all stages except the active one:

for package_path in /var/lib/icinga2/api/packages/;
do package=${package_path##
/};
export active_stage=$(cat ${package_path}/active-stage);
for stage_path in /var/lib/icinga2/api/packages/${package}/* ; do
stage=${stage_path##*/};
if [ “${#stage}” -eq 36 ] && [ ! “${stage}” = ${active_stage} ]; then
rm -rf $stage_path;
fi ;
done ;
done

→ Not related to this issue, but please tell me if theres a better way to do this. The api allows for deletion of single stages which seems a bit cumbersome if there are a lot of packages/stages.

The removal of any unused stages for all packages made the issue vanish instantly. The next ansible run went through without issues. I’ve decided to do these purges at regular intervals until i’ve managed to get to the core of this issue.

Does this weird behavior ring a bell with you guys? I’ve had this in icinga2 2.10 and now also in 2.11.4 .

Hello, about this point i have observed it happens when icinga is validating the config, so the status file has not been written yet, but as soon as it is validated (or fails), you should have a result for the query on status. Same logic should apply for the startup.log url.

If you need to purge everything and start up from something fresh, you can also directly erase your package with the related api command and then recreate it with the configuration required for your previous active stage
https://icinga.com/docs/icinga2/latest/doc/12-icinga2-api/#deleting-configuration-package

I dont think i can help you much more, i guess your best chance to get more informations to understand what’s going on is either to strace icinga process when it’s validating to confirm system didn’t denied some ressources to it, or either attach a debugger to icinga process.
https://icinga.com/docs/icinga2/latest/doc/21-development/#debug-icinga-2

It also can be a known problem by the icinga team, i’d advise you to wait for them to answer here.