I monitor some websites with the check_http plugin. This works quite well so far. However, I get a critical alarm at irregular intervals: CRITICAL - Socket timeout after 10 seconds.
I have looked at the access logs on the web server and cannot find any entries for the time when the alarm was triggered.
Therefore, I assume that there is already a problem on the monitoring server, which means that no request can be sent to the website.
How could I debug this?
Icinga is running on an Ubuntu 20.04 server in version v2.12.5-1. The configuration was done with the Icinga Director.
I have 1 Icinga master which performans the monitoring checks for systems, where it’s not possible to install the Icinga Agent, like websites, printer, switches etc… On the servers (16) the Icinga Agent is running.
The http service check is running on the Icinga master.
I have a similar issue - we get random socket timeouts when using the http_check tool. I have removed the http_check tool completely from our Nagios installation and am running it in a bash loop, checking once every 15 seconds. It usually does a socket timeout once or twice a day. I cannot find a pattern to this yet.
I have noticed that this issue occurs both when I monitor our own infrastructure, and google.com which suggests it’s a networking issue. I have five checks running in rapid succession, like so: curl, http_check, curl, http_check, curl. When the socket timeout occurs the curl checks are all working fine, but the http_check commands do not work.
This situation resolves itself pretty quickly. It doesn’t seem to coincide with any other events on the machine, and another machine in our office also exhibits the same issue, but not with the exact same regularity. We have recently changed out our gateway router to the internet, which I thought was the cause of this, but Nagios logs showed this has been happening for some time, just under the threshold for alerting so we didn’t notice it.
Running the exact same check at another site (from my home) works just fine.
I’m slowly getting to the point where I will reach out to our network provider, but at this time I am highly suspicious of the check_http tool itself as being a contributor to this situation also, as ridiculous as that sounds.
Did you come to any conclusion with your investigation?