Flapping values for load and procs in spite that higher values are explicitly set from config files

Janos.Kiss · July 12, 2023, 12:31pm

Dear Icinga Community,

for a couple nodes the “load” and “procs” checks are flapping.
The higher warning/critical threshold values which are defined in the configuration files are periodically not taken into account and are reverted back to lower Icinga defaults leading to the flapping of the service check.
Couple seconds later the higher values from the config files take effect, so the critical state stops, but this is suddenly periodically overridden again by the low default Icinga values, and the whole deal repeats over and over, several times a minute.

The Icinga version is r2.13.6-1, running on CentOS 7.9 Linux with kernel 3.10.0-1160.76.1.el7.x86_64.

In the file /etc/icinga2/zones.d/master/services.conf non-default high threshold values are explicitly set.
For the load check:

vars.load_wload1 = “70”
vars.load_wload5 = “70”
vars.load_wload15 = “70”
vars.load_cload1 = “90”
vars.load_cload5 = “90”
vars.load_cload15 = “90”

For the procs check:

vars.procs_warning = 3000
vars.procs_critical = 6000

The same values are also defined in the /usr/share/icinga2/include/command-plugins.conf since that is where the low values were still present, so I hoped that changing both config files might fix the issue, but it did not.

After config changes the icinga2 service was restarted several times, that did not fix the flapping either.

In the web interface for a few hosts there are critical errors popping up periodically (multiple times per minute), where the WebGUI shows the default low values for the warning/critical thresholds.
For example:
Performance data
Label Value Warning Critical
procs 999.00 250.00 400.00

Couple seconds later the higher threshold values which were defined in the config files take effect. At this stage warning/critical threshold values displayed in the WebGUI are now those higher values which were set from the config files, and the red critical states are gone from the WebGUI, all is green:
Performance data
Label Value Warning Critical
procs 1,005.00 3,000.00 6,000.00

In couple seconds the higher threshold values from the config files are reset back again to the default low values, and critical checks appear in red.
The same thing repeats several times a minute, which makes the whole check very annoying and pointless.

I just can not figure out where are those pesky default values still stored, or why are they periodically set/reset from the low Icinga default values to the higher custom values and back again.

Could you please give me some hints what to change/check?

Best regards

log1c · July 13, 2023, 6:39am

Hello,

can you please share the service definition for the two checks as well as the host definitions where the problem occurs? Service and host templates could also be helpful.

Changing /usr/share/icinga2/include/command-plugins.conf is not recommended, as it will be undone with the next update.

Janos.Kiss · July 13, 2023, 8:06am

Hi,

first of all thanks for any helpful hint.

Regarding /usr/share/icinga2/include/command-plugins.conf, I have just changed that file out of desperation. It was the only other place where I could find the default low limits defined, so I was hoping that by modifying that file it will fix the issue as a workaround.

When it comes to the service definitions, for the load service in the services.conf file, I just copy-pasted it in below (the variables like NativeAgent, OStype, PuppetRole etc. are working fine, so that is not what causing the issue):

apply Service "load" {
  import "generic-service"
  vars.load_wload1 = "70"
  vars.load_wload5 = "70"
  vars.load_wload15 = "70"
  vars.load_cload1 = "90"
  vars.load_cload5 = "90"
  vars.load_cload15 = "90"

  command_endpoint = host.vars.agent_endpoint
  check_command = "load"

  /* Used by the ScheduledDowntime apply rule in `downtimes.conf`. */
  vars.backup_downtime = "02:00-03:00"

  // Assign a command endpoint check on the host if the host:
  // -is an endpoint
  // -it has a native Icinga agent installed on the host
  // -host is running Linux
  // -host is not a compute node in the HPC cluster
  assign where host.vars.agent_endpoint && host.vars.NativeAgent == "True" && host.vars.OStype == "Linux" && host.vars.PuppetRole != "compute"
}

Below is the service definition copy-pasted from the services.conf file for the procs check:

apply Service "procs" {
  import "generic-service"
  
  vars.procs_warning = "3000"
  vars.procs_critical = "6000"
  
  command_endpoint = host.vars.agent_endpoint
  check_command = "procs"
  
  // Assign a command endpoint check on the host if the host:
  // -is an endpoint
  // -it has a native Icinga agent installed on the host
  // -host is running Linux
  // -host is not a compute node in the HPC cluster
  assign where host.vars.agent_endpoint && host.vars.NativeAgent == "True" && host.vars.OStype == "Linux" && host.vars.NodeFunction != "virtualisation" && host.vars.PuppetRole != "compute"
}

Now when it comes to the actual host definition, looking at it closer, that could be where the problem arises. Namely, the other hosts which are working fine are explicitly defined via HostName-host.conf files located in the directory /etc/icinga2/zones.d/master generated by Puppet collected resources. Wheres for the two misbehaving nodes they were added by my colleague using API command. However, when using the API, that has nuked randomly the scheduled downtimes (I think we were hitting some bug in Icinga because I saw similar issues reported on the forum by other members without solution, asked myself about it here on the forum without solution), so we no longer use the API.
Based on that, I will try to remove these two misbehaving nodes via the API, and see what this do.
Could it be that nodes which were added via API somehow are treated differently compared to nodes which were added via the configuration files?

Best regards

log1c · July 13, 2023, 1:40pm

Maybe enabling the debug log sheds some light into this.
Take a look in the debug log at the time around a reload of the icinga2 service and during the check execution.

Janos.Kiss · July 18, 2023, 12:37pm

I have enabled debug on the monitored client, where the checks are flapping.
All nodes are added the same way in the config file, where all hosts have their corresponding entry with their respective IP address, short hostname and FQDN name in the nodes.conf like:

> object Host "loginXYZ.fqdn" {
> 
>   import "generic-host"
>   address = "10.109.66.157"
>   check_command = "hostalive"
>   vars.agent_endpoint = name
>   vars.name = "loginXYZ.fqdn"
> 
>   vars += {
>     fqdn            = "loginXYZ.fqdn"
>     OStype          = "Linux"
>     PuppetRole      = "hpc_login"
>     os              = "Linux"
>     NativeAgent     = "True"
>     name            = "login"
>     redfish         = true
>   }


  groups = [
  ]

}

Looking at the check from the log, it is clear that the value is flapping indeed between the higher threshold I have defined from the config and between the lower default values (see below).
It is clear, that the threshold are periodically and apparently randomly overridden from
‘-c’ ‘90,90,90’ ‘-w’ ‘70,70,70’
to
‘-c’ ‘10,6,4’ ‘-w’ ‘5,4,3’

Where can I figure out why are the threshold values overridden for this particular host, and it works fine for other hosts considering that all hosts are added the same way?

grep -i '/usr/lib64/nagios/plugins/check_load' /var/log/icinga2/debug.log
[2023-07-18 15:20:50 +0300] notice/Process: Running command '/usr/lib64/nagios/plugins/check_load' '-c' '90,90,90' '-w' '70,70,70': PID 72088
[2023-07-18 15:20:50 +0300] notice/Process: PID 72088 ('/usr/lib64/nagios/plugins/check_load' '-c' '90,90,90' '-w' '70,70,70') terminated with exit code 0
[2023-07-18 15:21:02 +0300] notice/Process: Running command '/usr/lib64/nagios/plugins/check_load' '-c' '10,6,4' '-w' '5,4,3': PID 72200
[2023-07-18 15:21:02 +0300] notice/Process: PID 72200 ('/usr/lib64/nagios/plugins/check_load' '-c' '10,6,4' '-w' '5,4,3') terminated with exit code 2
[2023-07-18 15:21:31 +0300] notice/Process: Running command '/usr/lib64/nagios/plugins/check_load' '-c' '90,90,90' '-w' '70,70,70': PID 72394
[2023-07-18 15:21:31 +0300] notice/Process: PID 72394 ('/usr/lib64/nagios/plugins/check_load' '-c' '90,90,90' '-w' '70,70,70') terminated with exit code 0
[2023-07-18 15:22:01 +0300] notice/Process: Running command '/usr/lib64/nagios/plugins/check_load' '-c' '10,6,4' '-w' '5,4,3': PID 72608
[2023-07-18 15:22:01 +0300] notice/Process: PID 72608 ('/usr/lib64/nagios/plugins/check_load' '-c' '10,6,4' '-w' '5,4,3') terminated with exit code 2
[2023-07-18 15:22:30 +0300] notice/Process: Running command '/usr/lib64/nagios/plugins/check_load' '-c' '90,90,90' '-w' '70,70,70': PID 72812
[2023-07-18 15:22:30 +0300] notice/Process: PID 72812 ('/usr/lib64/nagios/plugins/check_load' '-c' '90,90,90' '-w' '70,70,70') terminated with exit code 0
[2023-07-18 15:23:00 +0300] notice/Process: Running command '/usr/lib64/nagios/plugins/check_load' '-c' '10,6,4' '-w' '5,4,3': PID 73709
[2023-07-18 15:23:00 +0300] notice/Process: PID 73709 ('/usr/lib64/nagios/plugins/check_load' '-c' '10,6,4' '-w' '5,4,3') terminated with exit code 2
[2023-07-18 15:23:28 +0300] notice/Process: Running command '/usr/lib64/nagios/plugins/check_load' '-c' '90,90,90' '-w' '70,70,70': PID 73919
[2023-07-18 15:23:29 +0300] notice/Process: PID 73919 ('/usr/lib64/nagios/plugins/check_load' '-c' '90,90,90' '-w' '70,70,70') terminated with exit code 0
[2023-07-18 15:23:59 +0300] notice/Process: Running command '/usr/lib64/nagios/plugins/check_load' '-c' '10,6,4' '-w' '5,4,3': PID 74116
[2023-07-18 15:23:59 +0300] notice/Process: PID 74116 ('/usr/lib64/nagios/plugins/check_load' '-c' '10,6,4' '-w' '5,4,3') terminated with exit code 2

Janos.Kiss · July 18, 2023, 1:06pm

Even after I have explicitly set for this particular host in the config file:

>   vars.load_wload1 = "70"
>   vars.load_wload5 = "70"
>   vars.load_wload15 = "70"
>   vars.load_cload1 = "90"
>   vars.load_cload5 = "90"
>   vars.load_cload15 = "90"
>   vars.procs_warning = "3000"
>   vars.procs_critical = "6000"

restarted icinga2 service on the master monitoring server from where the checks are pushed, restarted icinga2 service on this particular monitored client, nope, same thing.
Custom higher threshold values are periodically and randomly set/reset from the debug log file on the client side.

grep -i ‘/usr/lib64/nagios/plugins/check_load’ /var/log/icinga2/debug.log
[2023-07-18 16:00:33 +0300] notice/Process: Running command ‘/usr/lib64/nagios/plugins/check_load’ ‘-c’ ‘90,90,90’ ‘-w’ ‘70,70,70’: PID 108365
[2023-07-18 16:00:33 +0300] notice/Process: PID 108365 (‘/usr/lib64/nagios/plugins/check_load’ ‘-c’ ‘90,90,90’ ‘-w’ ‘70,70,70’) terminated with exit code 0
[2023-07-18 16:00:43 +0300] notice/Process: Running command ‘/usr/lib64/nagios/plugins/check_load’ ‘-c’ ‘10,6,4’ ‘-w’ ‘5,4,3’: PID 108501
[2023-07-18 16:00:43 +0300] notice/Process: PID 108501 (‘/usr/lib64/nagios/plugins/check_load’ ‘-c’ ‘10,6,4’ ‘-w’ ‘5,4,3’) terminated with exit code 2
[2023-07-18 16:01:12 +0300] notice/Process: Running command ‘/usr/lib64/nagios/plugins/check_load’ ‘-c’ ‘90,90,90’ ‘-w’ ‘70,70,70’: PID 109132
[2023-07-18 16:01:12 +0300] notice/Process: PID 109132 (‘/usr/lib64/nagios/plugins/check_load’ ‘-c’ ‘90,90,90’ ‘-w’ ‘70,70,70’) terminated with exit code 0
[2023-07-18 16:01:41 +0300] notice/Process: Running command ‘/usr/lib64/nagios/plugins/check_load’ ‘-c’ ‘10,6,4’ ‘-w’ ‘5,4,3’: PID 109725

log1c · July 24, 2023, 2:11pm

Have you tried clearing the C:\ProgramData\icinga2\var\lib\icinga2\api\ directory on the affected agents and restarted the service?
That way the agent should start fresh and get the config synced from the master.

Janos.Kiss · July 25, 2023, 6:27am

It is a Linux host.
Therefore, I have stopped icinga2 service, cleaned up the /var/lib/icinga2/api/ directory.
After restarting icinga2 service, the contents of the subdirectories has been recreated automatically, and boom, the flapping check issue is back.

However, in none of the files located under the newly created /var/lib/icinga2/api/zones/global-templates/_etc I can find the higher threshold values set by hand, and the same applies to other monitored nodes which work fine. In fact, the services.conf file is not present under /var/lib/icinga2/api/zones/global-templates/_etc in any of the monitored hosts.
I just don’t get it what makes this particular monitored host to act differently causing this issue.

log1c · July 25, 2023, 8:05am

Is there any local config in /etc/icinga2/conf.d that could interfere?

Maybe with icinga2 object list --type host --name hostname you get more insights into where the config for that host originates from (run it on the master)