Check_procs random critical

d83 · July 15, 2019, 1:42pm

Hi,

in my master-satellite Icinga2 configuration I’m monitoring the crond command with the check_procs plugin.

On some CentOS 6 servers these checks change their states between OK and CRITICAL every 30 seconds more or less.

I manually checked the crond command with watch and it is running, stable and it is not changing its PID number.

I enabled the debug log on the satellite, and I see many lines like these:

[root@satellite ~]# grep cron /var/log/icinga2/debug.log 
[2019-07-15 15:16:12 +0200] notice/Process: Running command '/usr/lib64/nagios/plugins/check_procs' '-C' 'cron' '-c' '1:' '-w' '250': PID 26610
[2019-07-15 15:16:12 +0200] notice/Process: PID 26610 ('/usr/lib64/nagios/plugins/check_procs' '-C' 'cron' '-c' '1:' '-w' '250') terminated with exit code 2
[2019-07-15 15:16:40 +0200] notice/Process: Running command '/usr/lib64/nagios/plugins/check_procs' '-C' 'crond' '-c' '1:' '-w' '250': PID 26805
[2019-07-15 15:16:40 +0200] notice/Process: PID 26805 ('/usr/lib64/nagios/plugins/check_procs' '-C' 'crond' '-c' '1:' '-w' '250') terminated with exit code 0
[2019-07-15 15:17:09 +0200] notice/Process: Running command '/usr/lib64/nagios/plugins/check_procs' '-C' 'cron' '-c' '1:' '-w' '250': PID 26917
[2019-07-15 15:17:09 +0200] notice/Process: PID 26917 ('/usr/lib64/nagios/plugins/check_procs' '-C' 'cron' '-c' '1:' '-w' '250') terminated with exit code 2
[2019-07-15 15:17:37 +0200] notice/Process: Running command '/usr/lib64/nagios/plugins/check_procs' '-C' 'crond' '-c' '1:' '-w' '250': PID 27110
[2019-07-15 15:17:37 +0200] notice/Process: PID 27110 ('/usr/lib64/nagios/plugins/check_procs' '-C' 'crond' '-c' '1:' '-w' '250') terminated with exit code 0
[2019-07-15 15:18:09 +0200] notice/Process: Running command '/usr/lib64/nagios/plugins/check_procs' '-C' 'cron' '-c' '1:' '-w' '250': PID 27244
[2019-07-15 15:18:09 +0200] notice/Process: PID 27244 ('/usr/lib64/nagios/plugins/check_procs' '-C' 'cron' '-c' '1:' '-w' '250') terminated with exit code 2
[2019-07-15 15:18:37 +0200] notice/Process: Running command '/usr/lib64/nagios/plugins/check_procs' '-C' 'crond' '-c' '1:' '-w' '250': PID 27433
[2019-07-15 15:18:37 +0200] notice/Process: PID 27433 ('/usr/lib64/nagios/plugins/check_procs' '-C' 'crond' '-c' '1:' '-w' '250') terminated with exit code 0
[2019-07-15 15:19:09 +0200] notice/Process: Running command '/usr/lib64/nagios/plugins/check_procs' '-C' 'cron' '-c' '1:' '-w' '250': PID 27569
[2019-07-15 15:19:09 +0200] notice/Process: PID 27569 ('/usr/lib64/nagios/plugins/check_procs' '-C' 'cron' '-c' '1:' '-w' '250') terminated with exit code 2
[2019-07-15 15:19:37 +0200] notice/Process: Running command '/usr/lib64/nagios/plugins/check_procs' '-C' 'crond' '-c' '1:' '-w' '250': PID 27820
[2019-07-15 15:19:37 +0200] notice/Process: PID 27820 ('/usr/lib64/nagios/plugins/check_procs' '-C' 'crond' '-c' '1:' '-w' '250') terminated with exit code 0
[...]

So the check_procs command seems to return exit code 2 (CRITICAL) every about 30 seconds.

I tried to reproduce the problem with the icinga user but I got 100% exit code 0 (OK) when repeatedly ran the command /usr/lib64/nagios/plugins/check_procs -C crond -c 1: -w 250 on the satellite as user icinga 1000 times:

bash-4.1$ for i in $(seq 1 1000); do /usr/lib64/nagios/plugins/check_procs -C crond -c 1: -w 250; done
PROCS OK: 1 process with command name ‘crond’ | procs=1;250;1:;0;
PROCS OK: 1 process with command name ‘crond’ | procs=1;250;1:;0;
PROCS OK: 1 process with command name ‘crond’ | procs=1;250;1:;0;
PROCS OK: 1 process with command name ‘crond’ | procs=1;250;1:;0;
PROCS OK: 1 process with command name ‘crond’ | procs=1;250;1:;0;
[…]

Could you help me to understand what’s going on, please?
Thank you very much!

winem · July 15, 2019, 2:38pm

Hi,

welcome to the community!

The shared log snippet says that the failing checks are looking for cron with a critical value of 1: while the successful checks search for crond (note the trailing d). So it looks like it can’t find a process cron but it finds a process called crond.

Do you have a redundant service configuration somewhere and a typo in one of them?

mcktr · July 15, 2019, 6:30pm

Hi,

maybe there is a local configuration on the satellite. Did you disable the conf.d directory inclusion during the node wizard on your satellite? Check the icinga2.conf file on your satellite and make sure the inclusion is commented out.

You can run icinga2 object list --type Service --name "http" on your satellite to have a look where the service is defined. (Put your service name where http is )

Best regards
Michael

dnsmichi · July 16, 2019, 7:04am

Your test is finished too fast, put in a sleep command to simulate 1s interval for instance.

for i in $(seq 1 1000); do /usr/lib64/nagios/plugins/check_procs -C crond -c 1: -w 250 && sleep 1; done

Maybe there is a script or systemd which restarts or reloads the crond quite often.

Cheers,
Michael

d83 · July 16, 2019, 8:38am

maybe there is a local configuration on the satellite

Thanks Michael, this was the problem!
I am managing configurations with a self-written Ansible role, and I mistakenly did not disabled conf.d and other custom inclusion directories on satellites!

d83 · July 16, 2019, 9:50am

Michael, so is this correct that satellite stil needs to have commands definitions (and, of course, command files) even if they are applied from the master node?

mcktr · July 16, 2019, 11:32am

If the command is not defined in the Icinga Template Library the stallite need to know about the command. But you don’t need to do this on each satellite manually, you can use the cluster config sync and global zones for this.

I am not sure what you mean with command files. If you mean apply rules and/or service definitions it depends on your configuration mode, but in both modes the configuration is done on the master.

In a distributed environment you (normally) don’t need the conf.d directory, this will led to unexpected behavior for exaple when a service is defined twice (on the master and locally on the satellite).