Hi,
To start off let me set the scene a bit.
Basically we run a number of site local Icinga2 stacks across our enterprise. These consist of 2 Icinga2 masters operating together and a mixture of Linux \ Windows endpoint clients. The majority of our endpoint \ zone configuration is automatically maintained by a custom script i wrote which executes every 30 minutes against the Icinga2 masters configuration. This is done to facilitate auto discovery \ auto removal of monitored objects to make things a lot less manual in terms of monitoring configuration upkeep. As part of this scripts operations the Icinga2 services on all masters in the associated stack are reloaded configuration wise.
Ok so basically there is nothing wrong with the scripts functions persee (At least that i know of), the issue here is that very infrequently (Typically happens once every 2-3 months) one of these common Icinga2 service reloads (Happens every 30 minutes by script schedule) will cause an issue against the second master in the stack.
It seems that when this manifests as soon as the affected master attempts to re-connect to its endpoints they refuse the connection and the following error is seen at the time in the Icinga2 log:
[2021-09-03 08:38:15 +0200] critical/ApiListener: Client TLS handshake failed (from [10.121.4.7]:56022): excessive message size
This also is seen when the affected master attempts to reconnect to its partner master node so it is refused API connectivity. When this happens all endpoint checks on the affected master start entering an unknown state reporting that the master is disconnected from the endpoints. Obviously this causes a slew of notifications from the affected master over time.
When this is seen the Icinga2 service is restarted on the affected master and at that point it reconnects to all endpoints successfully, the excessive message size error goes away in the logs and all affected checks start returning to an ok state (Causing yet more notification spam).
This is quite confusing as there have been no significant configuration changes on the master systems in a while now and the issue still presents infrequently. Additionally given that 99.9% of Icinga2 reloads produce no issue (Which happen frequently every 30 minutes) i am further confused as to why every now and then TLS handshakes suddenly report themselves as oversize from an affected master.
I have been unable to trace the root cause of this entirely and upgrades to the Icinga2 components have not rid us of the issue.
Generally the stacks operate well against our configuration aside from these infrequent master blips every now and then.
Therefore i was wondering if anyone has experienced this before and knows potentially where this could stem from.
I was wondering about switching the Icinga2 generated certificate key bit strength down from 4096 bits to 2048 bits to see if that alleviates this potentially but frankly i cannot see any option to nominate the desired bit strength around the “icinga2 pki new-cert” CLI command. I am not sure that would impact the issue anyway.
Here is some information on a standard dual master stack setup we run:
<==== Icinga2 Versioning ====>
icinga2 - The Icinga 2 network monitoring daemon (version: 2.12.4-1)
System information:
Platform: CentOS Linux
Platform version: 7 (Core)
Kernel: Linux
Kernel version: 3.10.0-1160.21.1.el7.x86_64
Architecture: x86_64
Build information:
Compiler: GNU 4.8.5
Build host: runner-hh8q3bz2-project-322-concurrent-0
OpenSSL version: OpenSSL 1.0.2k-fips 26 Jan 2017
Application information:
General paths:
Config directory: /etc/icinga2
Data directory: /var/lib/icinga2
Log directory: /var/log/icinga2
Cache directory: /var/cache/icinga2
Spool directory: /var/spool/icinga2
Run directory: /run/icinga2
Enabled features: api checker graphite ido-mysql mainlog notification
<==== Icingaweb2 Versioning ====>
Icingaweb2: v2.8.2
Modules:
graphite: v1.1.0
ipl: v0.5.0
monitoring: v2.8.2
reactbundle: v0.7.0
<==== Icinga2 Configuration Validation ====
[2021-09-03 13:57:45 +0200] information/cli: Icinga application loader (version: 2.12.4-1)
[2021-09-03 13:57:45 +0200] information/cli: Loading configuration file(s).
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 1 NotificationComponent.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 1 CheckerComponent.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 9 Users.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 9 UserGroups.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 3 TimePeriods.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 231 Zones.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 23 ServiceGroups.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 1 GraphiteWriter.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 4878 Services.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 1 IcingaApplication.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 231 Hosts.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 2 NotificationCommands.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 5109 Notifications.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 10 HostGroups.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 179 Endpoints.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 22 Downtimes.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 4878 Dependencies.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 1 Comment.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 1 FileLogger.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 4 ApiUsers.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 267 CheckCommands.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 1 ApiListener.
[2021-09-03 13:57:47 +0200] information/ScriptGlobal: Dumping variables to file ‘/var/cache/icinga2/icinga2.vars’
[2021-09-03 13:57:47 +0200] information/cli: Finished validating the configuration file(s).
Thanks,
Christopher Murchison