Icinga2.log filling up disk space with "Too many open files" on Icinga2 Version r2.13.7-1 cluster

MartiBarr · May 20, 2024, 9:40am

During the past few months our Icinga2 master has had issues with the /var/log/icinga2/icinga2.log file filling up with
“Critical/ApiListener: Cannot accept new connection: Too many open files”

The limit was increased but we hit the same issue a few days later.

Our current work around is to move the large log file (41GB) to another disk area and zero the log file for icinga2. Then restart icinga2.

The issue is still re-occurring and the last time was yesterday 19th May 2024

Give as much information as you can, e.g.

Version used (icinga2 --version) … Version r2.13.7-1
Operating System and version … CentOS Linux release 7.9.2009 (Core)
Enabled features (icinga2 feature list)
… icinga2 feature list
Disabled features: command compatlog debuglog elasticsearch gelf graphite ido-mysql influxdb influxdb2 livestatus opentsdb opsgenie perfdata statusdata syslog
Enabled features: api checker icingadb mainlog notification
Icinga Web 2 version and modules (System - About)
… The icinga2web runs on a separate server.
Icinga2 Versions is the same as above
Config validation (icinga2 daemon -C)
When run this is green without any errors
[2024-05-20 10:34:51 +0100] information/ConfigItem: Instantiated 1 NotificationComponent.
[2024-05-20 10:34:51 +0100] information/ConfigItem: Instantiated 1 CheckerComponent.
[2024-05-20 10:34:51 +0100] information/ConfigItem: Instantiated 2 UserGroups.
[2024-05-20 10:34:51 +0100] information/ConfigItem: Instantiated 33 TimePeriods.
[2024-05-20 10:34:51 +0100] information/ConfigItem: Instantiated 45 Users.
[2024-05-20 10:34:51 +0100] information/ConfigItem: Instantiated 1859 Services.
[2024-05-20 10:34:51 +0100] information/ConfigItem: Instantiated 1287 ServiceGroups.
[2024-05-20 10:34:51 +0100] information/ConfigItem: Instantiated 7 ScheduledDowntimes.
[2024-05-20 10:34:51 +0100] information/ConfigItem: Instantiated 153 Zones.
[2024-05-20 10:34:51 +0100] information/ConfigItem: Instantiated 6 NotificationCommands.
[2024-05-20 10:34:51 +0100] information/ConfigItem: Instantiated 2476 Notifications.
[2024-05-20 10:34:51 +0100] information/ConfigItem: Instantiated 96 Hosts.
[2024-05-20 10:34:51 +0100] information/ConfigItem: Instantiated 1 IcingaApplication.
[2024-05-20 10:34:51 +0100] information/ConfigItem: Instantiated 33 HostGroups.
[2024-05-20 10:34:51 +0100] information/ConfigItem: Instantiated 24 Comments.
[2024-05-20 10:34:51 +0100] information/ConfigItem: Instantiated 5 EventCommands.
[2024-05-20 10:34:51 +0100] information/ConfigItem: Instantiated 7 Downtimes.
[2024-05-20 10:34:51 +0100] information/ConfigItem: Instantiated 111 Endpoints.
[2024-05-20 10:34:51 +0100] information/ConfigItem: Instantiated 1 FileLogger.
[2024-05-20 10:34:51 +0100] information/ConfigItem: Instantiated 2 ApiUsers.
[2024-05-20 10:34:51 +0100] information/ConfigItem: Instantiated 335 CheckCommands.
[2024-05-20 10:34:51 +0100] information/ConfigItem: Instantiated 1 ApiListener.
[2024-05-20 10:34:51 +0100] information/ConfigItem: Instantiated 1 IcingaDB.
[2024-05-20 10:34:51 +0100] information/ScriptGlobal: Dumping variables to file ‘/var/cache/icinga2/icinga2.vars’
[2024-05-20 10:34:51 +0100] information/cli: Finished validating the configuration file(s).
If you run multiple Icinga 2 instances, the zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone) from all affected nodes
Only the master server has this issue at the moment.
object Endpoint “monitor04.dion.aws” {
}

object Endpoint “monitor05.dion.aws” {
host = “monitor05.dion.aws”
}

object Zone “master” {
endpoints = [ “monitor04.dion.aws”, “monitor05.dion.aws” ]
}

include “/opt/primary-ic2-01/ic2/zones.conf”

The clustered server monitor05 does not have the same issue.

rivad · May 21, 2024, 7:36am

Did you check with lsof | grep icinga2 what is using up the file descriptors.
Maybe something is opening connections and not closing them properly.

MartiBarr · May 21, 2024, 7:46am

Hi Dominik,
I have had a look at the lsof command and can see lots of icinga “CLOSE_WAIT” in the output.

E.G
icinga2 8293 9231 icinga *261u IPv6 162620594 0t0 TCP monitor08.dio.aws:5665->ip-11-46-21-37.eu-west-2.compute.internal:42097 (CLOSE_WAIT)

I grep’d on that and did a wc -l which gave the value 215232 of these

Is there a parameter setting that tells icinga2 to reduce the close wait time?

rivad · May 21, 2024, 9:39am

Strange, I have not a single (CLOSE_WAIT)!

Al2Klimov · May 21, 2024, 3:27pm

I’m afraid this only handles the problem post-factum. Instead I’d monitor the amount of CLOSE_WAIT and alert before the problem even occurs.

MartiBarr · May 22, 2024, 10:13am

Hi Dominik, Grandmaster,
I am monitoring the number of close waits and also discussing this issue internally as we see it may be involved with another issue we are seeing where IPV6 is in use and there are 2 IP’s that are commonly seen with the CLOSE_WAITS. I saw over 215 thousand yesterday morning and then just 27 thousand in the afternoon, so they are closing eventually. This morning it was again high at 153 thousand. I will check this afternoon to see if I see the same reduction in the afternoon. I will also chase up what is happening with the IPV6 issue.