Exception occurred while checking 'some.server': Error: Function call 'pipe2' failed with error code 24, 'Too many open files'

Good morning everybody,

our Icinga2 stopped working this morning totally with the Error described above for all servers.

[2019-12-05 06:55:50 +0100] critical/checker: Exception occurred while checking 'some.server': Error: Function call 'pipe2' failed with error code 24, 'Too many open files'
[2019-12-05 06:55:50 +0100] critical/checker: Exception occurred while checking 'some.server': Error: Function call 'pipe2' failed with error code 24, 'Too many open files'
[2019-12-05 06:55:55 +0100] critical/checker: Exception occurred while checking 'some.server': Error: Function call 'pipe2' failed with error code 24, 'Too many open files'
[2019-12-05 06:55:55 +0100] critical/checker: Exception occurred while checking 'some.server': Error: Function call 'pipe2' failed with error code 24, 'Too many open files'
[2019-12-05 06:56:01 +0100] critical/checker: Exception occurred while checking 'some.server': Error: Function call 'pipe2' failed with error code 24, 'Too many open files'
[2019-12-05 06:56:01 +0100] critical/checker: Exception occurred while checking 'some.server': Error: Function call 'pipe2' failed with error code 24, 'Too many open files'
[2019-12-05 06:56:01 +0100] critical/checker: Exception occurred while checking 'some.server': Error: Function call 'pipe2' failed with error code 24, 'Too many open files'
[2019-12-05 06:56:03 +0100] critical/checker: Exception occurred while checking 'some.server!disk': Error: Function call 'pipe2' failed with error code 24, 'Too many open files'
[2019-12-05 06:56:07 +0100] critical/checker: Exception occurred while checking 'some.server': Error: Function call 'pipe2' failed with error code 24, 'Too many open files'
[2019-12-05 06:56:15 +0100] critical/checker: Exception occurred while checking 'some.server': Error: Function call 'pipe2' failed with error code 24, 'Too many open files'

I can see the error for every monitored server check in the log on both masters. Anybody a idea what happened and how to prevent this error?

best regards,

Alicia

Hi,

this error sources from the scope when Icinga forks a process to actually execute a check. The output from that plugin is read from a pipe stream into the main process. If the open file limit is reached, this error occurs. Files are not only file handles but also sockets. So if you’re hitting the case that you have 4000 agents with opened sockets, and 1000 plugins checks running in parallel, and your ulimit is set to 4000, this will likely fail.

You may raise the number of open files as described in this issue to see whether it helps mitigate the issue.

Cheers,
Michael

I forgot to check the number of open files before restarting the icinga2 service this morning. But I’m confused, because I have only 50 hosts (agents) and about 350 services in the monitoring at the moment with two masters.

In my “test-stage” i have about 200 hosts and 800 services and the error never occured, with the same limits an both systems (prod and test).

So do you mean, this can really a problem with only 400 checks to make from the icinga2 prod ?

I wouldn’t look at the numbers and compare them to other systems, but do some more analysis in this regard:

  • Add a script which watches the open files, and regularly dump that somewhere else in a log file with timestamps ahead (cronjob outside of Icinga, because it might not be executed).
  • Run a watch session in a screen terminal if you can reproduce this easily.
  • Extract the timestamps where the file limits are exceeded, and correlate this with other events at that time. Network problems, load, IO, etc. - anything which looks suspicious.

Well, and the most obvious parts - extract the current file limits applied to the daemon.

Cheers,
Michael

Thank you :slight_smile:

I will this and think about the other points if I find a way to reproduce, and can say that there were no other issues this morning.

Can you give me a hint to extract this?

Thanks and best regards,
Alicia

That’s described in the linked issue in the Verify section.

for p in $(pidof icinga2); do echo -e "$p\n" && ps -ef | grep $p && echo && cat /proc/$p/limits | grep 'open files' && echo; done

for p in $(pidof icinga2); do echo -e "$p\n" && ps -ef | grep $p && echo && lsof -p $p && echo; done

Cheers,
Michael

Thank you so much!

I have to admit, that it is a littlebit complicated to understand…

the first command is to show the limits, the second one to list all open files for the icinga2 process?

pidof icinga2 returns the PID list of all running processes. With 2.11, this counts 3. In order to automate this, the for loop around is created and uses $p as loop variable.

ps -ef | grep $p then prints the line from the process listing output to see which command is run.

After that, either the current limits are printed (first command), or open files/sockets from a given process id are listed with lsof.

These commands source from a long debugging analysis and are therefore just one liners to save time in each iteration.

Cheers,
Michael