TLS Excessive Message Size Seen Infrequently On Icinga2 Masters After Configuration Reload

murchisonc · September 3, 2021, 12:04pm

Hi,

To start off let me set the scene a bit.

Basically we run a number of site local Icinga2 stacks across our enterprise. These consist of 2 Icinga2 masters operating together and a mixture of Linux \ Windows endpoint clients. The majority of our endpoint \ zone configuration is automatically maintained by a custom script i wrote which executes every 30 minutes against the Icinga2 masters configuration. This is done to facilitate auto discovery \ auto removal of monitored objects to make things a lot less manual in terms of monitoring configuration upkeep. As part of this scripts operations the Icinga2 services on all masters in the associated stack are reloaded configuration wise.

Ok so basically there is nothing wrong with the scripts functions persee (At least that i know of), the issue here is that very infrequently (Typically happens once every 2-3 months) one of these common Icinga2 service reloads (Happens every 30 minutes by script schedule) will cause an issue against the second master in the stack.

It seems that when this manifests as soon as the affected master attempts to re-connect to its endpoints they refuse the connection and the following error is seen at the time in the Icinga2 log:

[2021-09-03 08:38:15 +0200] critical/ApiListener: Client TLS handshake failed (from [10.121.4.7]:56022): excessive message size

This also is seen when the affected master attempts to reconnect to its partner master node so it is refused API connectivity. When this happens all endpoint checks on the affected master start entering an unknown state reporting that the master is disconnected from the endpoints. Obviously this causes a slew of notifications from the affected master over time.

When this is seen the Icinga2 service is restarted on the affected master and at that point it reconnects to all endpoints successfully, the excessive message size error goes away in the logs and all affected checks start returning to an ok state (Causing yet more notification spam).

This is quite confusing as there have been no significant configuration changes on the master systems in a while now and the issue still presents infrequently. Additionally given that 99.9% of Icinga2 reloads produce no issue (Which happen frequently every 30 minutes) i am further confused as to why every now and then TLS handshakes suddenly report themselves as oversize from an affected master.

I have been unable to trace the root cause of this entirely and upgrades to the Icinga2 components have not rid us of the issue.

Generally the stacks operate well against our configuration aside from these infrequent master blips every now and then.

Therefore i was wondering if anyone has experienced this before and knows potentially where this could stem from.

I was wondering about switching the Icinga2 generated certificate key bit strength down from 4096 bits to 2048 bits to see if that alleviates this potentially but frankly i cannot see any option to nominate the desired bit strength around the “icinga2 pki new-cert” CLI command. I am not sure that would impact the issue anyway.

Here is some information on a standard dual master stack setup we run:

<==== Icinga2 Versioning ====>

icinga2 - The Icinga 2 network monitoring daemon (version: 2.12.4-1)

System information:
Platform: CentOS Linux
Platform version: 7 (Core)
Kernel: Linux
Kernel version: 3.10.0-1160.21.1.el7.x86_64
Architecture: x86_64

Build information:
Compiler: GNU 4.8.5
Build host: runner-hh8q3bz2-project-322-concurrent-0
OpenSSL version: OpenSSL 1.0.2k-fips 26 Jan 2017

Application information:

General paths:
Config directory: /etc/icinga2
Data directory: /var/lib/icinga2
Log directory: /var/log/icinga2
Cache directory: /var/cache/icinga2
Spool directory: /var/spool/icinga2
Run directory: /run/icinga2

Enabled features: api checker graphite ido-mysql mainlog notification

<==== Icingaweb2 Versioning ====>

Icingaweb2: v2.8.2

Modules:

graphite: v1.1.0
ipl: v0.5.0
monitoring: v2.8.2
reactbundle: v0.7.0

<==== Icinga2 Configuration Validation ====

[2021-09-03 13:57:45 +0200] information/cli: Icinga application loader (version: 2.12.4-1)
[2021-09-03 13:57:45 +0200] information/cli: Loading configuration file(s).
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 1 NotificationComponent.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 1 CheckerComponent.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 9 Users.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 9 UserGroups.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 3 TimePeriods.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 231 Zones.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 23 ServiceGroups.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 1 GraphiteWriter.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 4878 Services.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 1 IcingaApplication.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 231 Hosts.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 2 NotificationCommands.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 5109 Notifications.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 10 HostGroups.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 179 Endpoints.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 22 Downtimes.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 4878 Dependencies.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 1 Comment.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 1 FileLogger.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 4 ApiUsers.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 267 CheckCommands.
[2021-09-03 13:57:47 +0200] information/ConfigItem: Instantiated 1 ApiListener.
[2021-09-03 13:57:47 +0200] information/ScriptGlobal: Dumping variables to file ‘/var/cache/icinga2/icinga2.vars’
[2021-09-03 13:57:47 +0200] information/cli: Finished validating the configuration file(s).

Thanks,

Christopher Murchison

Al2Klimov · September 3, 2021, 1:58pm

Hello @murchisonc!

Have you tried to reload less often?

Best,
AK

steaksauce · September 3, 2021, 2:42pm

Not sure that this would help – we reload every 20 minutes with a director job (and this potentially affects thousands of services)

That said, a config reload for us takes 2 or 3 minutes sometimes – could it be something from a long config reload?

We also have some automation with monit to check the state of Icinga2 and ensure that it’s running and the API is responding though.

murchisonc · September 3, 2021, 2:44pm

To be honest no. The reason why it is 30 minutes specifically is we have a number of systems that run ActiveMQ and we need to rescan for new queues quite frequently on some of the more busy brokers.

Frankly though this issue only pops up 1 out of every few thousand reloads so that is a bit confusing.

Really i am trying to get to the root of the issue as it does not make sense to me at the moment given the TLS handshake max message size error (Which goes away when the icinga2 service is restarted). I thought this may be a leak somewhere occurring over time so a few months back i put some logic in to action a restart of the master level icinga2 services once a week in case that was the issue but one stacks master produced this issue today so that clearly did not work.

I could put some logic in to detect max message size errors in the icinga2.log after service reloads and issue a restart on that condition but once again that is a workaround and i would really like to understand how this is manifesting itself ideally to snub it out for the future.

murchisonc · September 3, 2021, 3:00pm

Also i want to point out our reload times are really quite slim on masters and they only take a couple of seconds since we do not have that many objects against our monitoring configuration per stack. This is why i saw no issue with performing this function every 30 mins. Frankly these stacks have been in operation for a year and a half now (The issue has been prevalent since they were built, but since it only occurs once every few months it has not been contested too much in the past) and these reloads have been in place on the same schedule throughout with no general issues witnessed (Barring this random one, every now and then).

Al2Klimov · September 3, 2021, 4:26pm

To me this sounds unimaginable hard to reproduce, so I’d try to reload only if something changed to reduce the risk of this issue. I.e. to reload as 5am and (just for the case of this issue) to have a scheduled downtime around 5am for… everything?

steaksauce · September 3, 2021, 6:01pm

To piggy back off this, our director runs every 20 minutes, but only if there’s a change.

You could get your scripts to leverage the same Icinga2 api that director uses and keep it on the same schedule and only make a new package when there’s changes.

https://icinga.com/docs/icinga-2/latest/doc/12-icinga2-api/#configuration-management

murchisonc · September 6, 2021, 10:32am

I could add some logic to only effect master reloads upon configuration changes yes. I do fear that if i do that this issue will still manifest, just with longer times inbetween those points. As i say most of the time there is no issue with performing frequent configuration reloads with our current configuration. Ideally i am trying to get to the root cause of the issue (Admittedly with not much luck at the moment) instead of working around the problem further.

murchisonc · September 13, 2021, 9:04am

At this point i want to add that even though i was only seeing TLS handshakes of about 3500 bytes (Which is quite far off the max considered to be excessive tbh) on average across masters in our Icinga2 dual master stacks across sites i have made the decision (Since i really do not understand this issue) to bypass the Icinga2 pki CLI operations and generate endpoint certificates against the master CA’s via OpenSSL directly. In doing this i then have the option to reduce generated certificate keys from a 4096 bit strength down to 2048 bit strength. Our automation has already deployed all newly replaced certificates to the necessary endpoints once they were available.

Looks like on average master handshakes are now about 2800 bytes. Not much of an improvement for one call but likely reduces resource utilization for handshakes overall quite a bit. Having said that i am still dubious this will correct the issue we are seeing very infrequently, but i figured it was worth a try as frankly i was a bit out of ideas on this one.

I will update here if the issue returns or if we see it has gone away in the coming months. I am suspecting this will probably not alleviate it though.

jbrost · December 17, 2021, 10:22am

The same issue was just reported on GitHub: Communication was suddenly broken: Client TLS handshake failed [..] excessive message size · Issue #9141 · Icinga/icinga2 · GitHub

In case you’re seeing the issue again, can you please capture the network traffic (for example using tcpdump) so that we can have a look at what’s in these handshakes causing them to become too large?

murchisonc · December 20, 2021, 10:23am

We are still experiencing this issue yes, nothing i have attempted seems to mitigate this issue sadly so i am still looking at it but this issue and how it presents makes it quite hard to track the root cause.

The last failure fitting this criteria actually occurred against one of our site master stacks last Thursday but i did not gather tcpdump output from the affected node at the time. The next time one of our master nodes is affected by this i will gather tcpdump information ont he affected node and present that as part of the analysis. Given that this issue tends to happen once every 2 - 3 months it might be a bit of time to see this happen again but i will bear in mind taking a tcpdump when it does occur in the future at this point yes.