Icinga2 produces a lot of sudo zombies processes

icinga2 - The Icinga 2 network monitoring daemon (version: 2.13.1-1)
Kernel 3.10.0-1160.25.1.el7.x86_64
CentOS Linux release 7.9.2009
Sudoers I/O plugin version 1.8.23

On icinga client I see many similar errors which produces a lot of sudo zombies processes:

[2021-11-08 14:58:28 +0100] warning/Process: Killing process group 18816 (’/usr/bin/sudo’ ‘/usr/lib64/nagios/plugins/site/privileged/check_bind.sh’) after timeout of 66 seconds
[2021-11-08 14:58:28 +0100] warning/Process: Couldn’t kill the process group 18816 (’/usr/bin/sudo’ ‘/usr/lib64/nagios/plugins/site/privileged/check_bind.sh’): [errno 1] Operation not permitted
[2021-11-08 14:58:28 +0100] warning/PluginCheckTask: Check command for object ‘my.hidden.host!bind’ (PID: 18816, arguments: ‘/usr/bin/sudo’ ‘/usr/lib64/nagios/plugins/site/privileged/check_bind.sh’) terminated with exit code 128, output:
[2021-11-08 15:00:26 +0100] warning/Process: Terminating process 18938 (’/usr/bin/sudo’ ‘/usr/lib64/nagios/plugins/site/privileged/check_bind.sh’) after timeout of 60 seconds
[2021-11-08 15:00:26 +0100] warning/Process: Couldn’t terminate the process 18938 (’/usr/bin/sudo’ ‘/usr/lib64/nagios/plugins/site/privileged/check_bind.sh’): [errno 1] Operation not permitted
[2021-11-08 15:00:29 +0100] warning/PluginCheckTask: Check command for object ‘my.hidden.host!bind’ (PID: 18938, arguments: ‘/usr/bin/sudo’ ‘/usr/lib64/nagios/plugins/site/privileged/check_bind.sh’) terminated with exit code 128, output: tac: write error
[2021-11-08 15:05:01 +0100] warning/Process: Terminating process 19213 (’/usr/bin/sudo’ ‘/usr/lib64/nagios/plugins/site/privileged/check_bind.sh’) after timeout of 60 seconds
[2021-11-08 15:05:01 +0100] warning/Process: Couldn’t terminate the process 19213 (’/usr/bin/sudo’ ‘/usr/lib64/nagios/plugins/site/privileged/check_bind.sh’): [errno 1] Operation not permitted
[2021-11-08 15:05:07 +0100] warning/Process: Killing process group 19213 (’/usr/bin/sudo’ ‘/usr/lib64/nagios/plugins/site/privileged/check_bind.sh’) after timeout of 66 seconds
[2021-11-08 15:05:07 +0100] warning/Process: Couldn’t kill the process group 19213 (’/usr/bin/sudo’ ‘/usr/lib64/nagios/plugins/site/privileged/check_bind.sh’): [errno 1] Operation not permitted
[2021-11-08 15:05:07 +0100] warning/PluginCheckTask: Check command for object ‘my.hidden.host!bind’ (PID: 19213, arguments: ‘/usr/bin/sudo’ ‘/usr/lib64/nagios/plugins/site/privileged/check_bind.sh’) terminated with exit code 128, output:
[2021-11-08 15:07:06 +0100] warning/Process: Terminating process 19395 (’/usr/bin/sudo’ ‘/usr/lib64/nagios/plugins/site/privileged/check_bind.sh’) after timeout of 60 seconds
[2021-11-08 15:07:06 +0100] warning/Process: Couldn’t terminate the process 19395 (’/usr/bin/sudo’ ‘/usr/lib64/nagios/plugins/site/privileged/check_bind.sh’): [errno 1] Operation not permitted
[2021-11-08 15:07:12 +0100] warning/Process: Killing process group 19395 (’/usr/bin/sudo’ ‘/usr/lib64/nagios/plugins/site/privileged/check_bind.sh’) after timeout of 66 seconds
[2021-11-08 15:07:12 +0100] warning/Process: Couldn’t kill the process group 19395 (’/usr/bin/sudo’ ‘/usr/lib64/nagios/plugins/site/privileged/check_bind.sh’): [errno 1] Operation not permitted
[2021-11-08 15:07:12 +0100] warning/PluginCheckTask: Check command for object ‘my.hidden.host!bind’ (PID: 19395, arguments: ‘/usr/bin/sudo’ ‘/usr/lib64/nagios/plugins/site/privileged/check_bind.sh’) terminated with exit code 128, output:
[2021-11-08 15:09:11 +0100] warning/Process: Terminating process 19463 (’/usr/bin/sudo’ ‘/usr/lib64/nagios/plugins/site/privileged/check_bind.sh’) after timeout of 60 seconds
[2021-11-08 15:09:11 +0100] warning/Process: Couldn’t terminate the process 19463 (’/usr/bin/sudo’ ‘/usr/lib64/nagios/plugins/site/privileged/check_bind.sh’): [errno 1] Operation not permitted
[2021-11-08 15:09:18 +0100] warning/Process: Killing process group 19463 (’/usr/bin/sudo’ ‘/usr/lib64/nagios/plugins/site/privileged/check_bind.sh’) after timeout of 66 seconds
[2021-11-08 15:09:18 +0100] warning/Process: Couldn’t kill the process group 19463 (’/usr/bin/sudo’ ‘/usr/lib64/nagios/plugins/site/privileged/check_bind.sh’): [errno 1] Operation not permitted
[2021-11-08 15:09:18 +0100] warning/PluginCheckTask: Check command for object ‘my.hidden.host!bind’ (PID: 19463, arguments: ‘/usr/bin/sudo’ ‘/usr/lib64/nagios/plugins/site/privileged/check_bind.sh’) terminated with exit code 128, output:

[root@my tmp]# ps ax | grep defunc
19213 ? ZNs 0:00 [sudo]
19395 ? ZNs 0:00 [sudo]
19463 ? ZNs 0:00 [sudo]
19771 ? ZNs 0:00 [sudo]
19890 ? ZNs 0:00 [sudo]
[root@my ~]# ps ax | grep defunc | wc -l
53
The script which executed via sudo have all sufficient privileges and works like a charm if I run it from bash:

[root@my tmp]# sudo -u icinga /usr/bin/sudo /usr/lib64/nagios/plugins/site/privileged/check_bind.sh
Bind9 is running. 206 successfull requests, 0 referrals, 25 nxdomains since last check. | ‘success’=206 ‘referral’=0 ‘nxrrset’=69 ‘nxdomain’=25 ‘recursion’=0 ‘failure’=0 ‘duplicate’=0 ‘dropped’=0

From bash it works even every second when I run it in while;true loop, but when it runs within icinga it is constantly produces the same error almost every minute. Just spent all day on this problem, has someone experienced the same?

I found that there are some sudo bugs like these sudo hangs and leaves the executed program as “zombie” | /contrib/famzah but seems not my case, since I got no any issues or zombie processes of sudo when I run sudo from bash directly. But invoked within icinga sudo always make a lot of zombies.

The script looks fine as well. It’s pretty old check_bind.sh - Nagios Exchange
but works fine on other machines.

Can someone points me out what else can be done to investigate whether it is script bug or sudo bug?
Sometimes I see some errors with tac command like these: [2021-11-08 04:13:53 +0100] warning/PluginCheckTask: Check command for object ‘my.hidden.host!bind’ (PID: 32657, arguments: ‘/usr/bin/sudo’ ‘/usr/lib64/nagios/plugins/site/privileged/check_bind.sh’) terminated with exit code 128, output: tac: write error: Broken pipe
[2021-11-08 04:26:21 +0100] warning/PluginCheckTask: Check command for object ‘my.hidden.host!bind’ (PID: 1507, arguments: ‘/usr/bin/sudo’ ‘/usr/lib64/nagios/plugins/site/privileged/check_bind.sh’) terminated with exit code 128, output: tac: write error: Broken pipe
[2021-11-08 04:32:33 +0100] warning/PluginCheckTask: Check command for object ‘my.hidden.host!bind’ (PID: 1897, arguments: ‘/usr/bin/sudo’ ‘/usr/lib64/nagios/plugins/site/privileged/check_bind.sh’) terminated with exit code 128, output: tac: write error
[2021-11-08 05:44:30 +0100] warning/PluginCheckTask: Check command for object ‘my.hidden.host!bind’ (PID: 7810, arguments: ‘/usr/bin/sudo’ ‘/usr/lib64/nagios/plugins/site/privileged/check_bind.sh’) terminated with exit code 128, output: tac: write error: Broken pipe
[2021-11-08 05:47:39 +0100] warning/PluginCheckTask: Check command for object ‘my.hidden.host!bind’ (PID: 8043, arguments: ‘/usr/bin/sudo’ ‘/usr/lib64/nagios/plugins/site/privileged/check_bind.sh’) terminated with exit code 128, output: tac: write error
[2021-11-08 07:29:11 +0100] warning/PluginCheckTask: Check command for object ‘my.hidden.host!bind’ (PID: 16830, arguments: ‘/usr/bin/sudo’ ‘/usr/lib64/nagios/plugins/site/privileged/check_bind.sh’) terminated with exit code 128, output: tac: write error
[2021-11-08 08:33:58 +0100] warning/PluginCheckTask: Check command for object ‘my.hidden.host!bind’ (PID: 22126, arguments: ‘/usr/bin/sudo’ ‘/usr/lib64/nagios/plugins/site/privileged/check_bind.sh’) terminated with exit code 128, output: tac: write error
[2021-11-08 09:10:05 +0100] warning/PluginCheckTask: Check command for object ‘my.hidden.host!bind’ (PID: 25123, arguments: ‘/usr/bin/sudo’ ‘/usr/lib64/nagios/plugins/site/privileged/check_bind.sh’) terminated with exit code 128, output: tac: write error: Broken pipe
[2021-11-08 11:43:26 +0100] warning/PluginCheckTask: Check command for object ‘my.hidden.host!bind’ (PID: 32083, arguments: ‘/usr/bin/sudo’ ‘/usr/lib64/nagios/plugins/site/privileged/check_bind.sh’) terminated with exit code 128, output: tac: write error
[2021-11-08 15:00:29 +0100] warning/PluginCheckTask: Check command for object ‘my.hidden.host!bind’ (PID: 18938, arguments: ‘/usr/bin/sudo’ ‘/usr/lib64/nagios/plugins/site/privileged/check_bind.sh’) terminated with exit code 128, output: tac: write error

But again, when I run the script manually with sudo icinga rights like about 5 times a second for about 10 minutes I don’t see any slowness or bugs or zombies or these errors with tac.

Can you check your icinga2 log? It looks like it may be a similar issue:

Thanks! Yes, the logs look exactly the same.
Uploaded here: [2021-11-08 19:10:42 +0100] warning/Process: Terminating process 7958 ('/usr/bin - Pastebin.com

Seems there are still no solution, right? I think it is a bad idea to give icinga sudo rights to kill processes. I prefer to find out why it produces timeouts on this script constantly, while the script actually works permanently fine when run manually.

4 Z root     12316 21864  0  85   5 -     0 do_exi 20:00 ?        00:00:00         [sudo] <defunct>
4 Z root     12507 21864  0  85   5 -     0 do_exi 20:02 ?        00:00:00         [sudo] <defunct>
4 S root     12771 21864  0  85   5 - 59826 poll_s 20:05 ?        00:00:00         /usr/bin/sudo /usr/lib64/nagios/plugins/site/privileged/check_bind.sh
4 S root     12773 12771  0  85   5 - 28319 do_wai 20:05 ?        00:00:00           /bin/sh /usr/lib64/nagios/plugins/site/privileged/check_bind.sh
4 D root     12780 12773  1  85   5 - 27016 lock_p 20:05 ?        00:00:02             tac /var/named/data/named_stats.txt
4 S root     12946 21864  0  85   5 - 59826 poll_s 20:06 ?        00:00:00         /usr/bin/sudo /usr/lib64/nagios/plugins/site/privileged/check_bind.sh
4 S root     12948 12946  0  85   5 - 28319 do_wai 20:06 ?        00:00:00           /bin/sh /usr/lib64/nagios/plugins/site/privileged/check_bind.sh
4 R root     12955 12948  5  85   5 - 27016 -      20:06 ?        00:00:01             tac /var/named/data/named_stats.txt
4 S root     12889     1  0  80   0 - 87629 poll_s 20:06 ?        00:00:00   /usr/sbin/abrt-dbus -t133

Just checked ps -efH and noticed that icinga run two scripts at the same time which led to conflict with reading and writing the file in /tmp with tac command. still got no idea why icinga starts two copies of the same script.

There were no bugs in sudo or whatever. The problem is that it takes too much time for the tac command from the script to read the named.stats file (more then a minute). As a result Icinga tries to kill this process after 60 seconds, but cannot do it, because there are no such capabilities for icinga user to kill a process with sudo command.

[root@dns2 data]# ls -alh named_stats.txt
-rw-r--r--. 1 named named 551M Nov  8 20:42 named_stats.txt

[root@dns2 data]# time tac /var/named/data/named_stats.txt &> /dev/null
real	2m27.493s
user	0m1.129s
sys	0m11.012s

In the script there is this kind of command:

tac /var/named/data/named_stats.txt | awk '/--- \([0-9]*\)/{p=1} p{print} /\+\+\+ \([0-9]*\)/{p=0;if (count++==1) exit}' > /tmp/named.stats.tmp

And it looks like when this script run by Icinga, it tries first to read all the 500mb file and load it to the memory, and only after applies the awk expression. That’s really suprised me. Because when I run the script from shell by the hand, the file reads almost immediately. Any ideas why icinga tries to load the whole file into memory first? Just curious :slight_smile: