Check_http random socket timeout

HExSM · August 4, 2021, 10:14am

I monitor some websites with the check_http plugin. This works quite well so far. However, I get a critical alarm at irregular intervals: CRITICAL - Socket timeout after 10 seconds.

I have looked at the access logs on the web server and cannot find any entries for the time when the alarm was triggered.

Therefore, I assume that there is already a problem on the monitoring server, which means that no request can be sent to the website.

How could I debug this?

Icinga is running on an Ubuntu 20.04 server in version v2.12.5-1. The configuration was done with the Icinga Director.

The service configuration looks like:

template Service "_default" {
    max_check_attempts = "3"
    check_interval = 1m
    retry_interval = 30s
    check_timeout = 30s
    enable_notifications = true
    enable_active_checks = true
    enable_passive_checks = true
    enable_event_handler = false
    enable_flapping = false
    enable_perfdata = true
}

template Service "http" {
    import "_default"

    check_command = "http"
    vars.http_critical_time = "5"
    vars.http_ignore_body = true
    vars.http_onredirect = "follow"
    vars.http_ssl = true
    vars.http_vhost = "$host.address$"
    vars.http_warn_time = "2"
}

Thank you in advance!

rsx · August 4, 2021, 10:29am

You can enable debug log to get more details. And you can manually run the plugin with adding --verbose.

Pooh · August 4, 2021, 10:38am

It would be useful to know a bit more about your Icinga network.

Do you have:

a) just a single Icinga server performing all monitoring checks?

b) a Master with one or more Agents?

c) a Master with one or more Satellites, and one or more Agents?

Finally, out of the above descriptions, which Icinga server is performing the
http service check on your web server?

Thanks,

Antony.

HExSM · August 4, 2021, 11:15am

I have 1 Icinga master which performans the monitoring checks for systems, where it’s not possible to install the Icinga Agent, like websites, printer, switches etc… On the servers (16) the Icinga Agent is running.

The http service check is running on the Icinga master.

Best regards
Stefan

HExSM · August 4, 2021, 11:16am

Thank you! I will try the debug log functionality.

Best regards
Stefan

HExSM · August 4, 2021, 1:17pm

Here is an excerpt from the debug.log:

[2021-08-04 14:41:28 +0200] notice/Process: PID 584506 ('/usr/lib/nagios/plugins/check_http' '--no-body' '-H' 'www.mydomain.tld' '-I' 'www.mydomain.tld' '-S' '-c' '5' '-f' 'follow' '-w' '2') terminated with exit code 2
[2021-08-04 14:41:28 +0200] notice/Dependency: Dependency 'mydomain.tld!internet-connection' passed: Parent host '_internet-connection' matches state filter.
[2021-08-04 14:41:28 +0200] notice/Dependency: Dependency 'mydomain.tld!http!host' passed: Parent host 'mydomain.tld' matches state filter.
[2021-08-04 14:41:28 +0200] notice/Dependency: Dependency 'mydomain.tld!internet-connection' passed: Parent host '_internet-connection' matches state filter.
[2021-08-04 14:41:28 +0200] notice/Dependency: Dependency 'mydomain.tld!http!host' passed: Parent host 'mydomain.tld' matches state filter.
[2021-08-04 14:41:28 +0200] debug/Checkable: Update checkable 'mydomain.tld!http' with check interval '60' from last check time at 2021-08-04 14:41:28 +0200 (1.62808e+09) to next check time at 2021-08-04 14:41:56 +0200 (1.62808e+
09).
[2021-08-04 14:41:28 +0200] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2021-08-04 14:41:28 +0200] notice/Checkable: State Change: Checkable 'mydomain.tld!http' soft state change from OK to CRITICAL detected.
[2021-08-04 14:41:28 +0200] notice/ApiListener: Relaying 'event::CheckResult' message

Unfortunately, that doesn’t help me either.

rsx · August 4, 2021, 1:55pm

What is the result of this command?

/usr/lib/nagios/plugins/check_http --no-body -H www.mydomain.tld -I www.mydomain.tld -S -c 5 -f follow -w 2 --verbose

I’m not sure whether -H and -I works together.

gurubobnz · September 23, 2023, 9:58am

I have a similar issue - we get random socket timeouts when using the http_check tool. I have removed the http_check tool completely from our Nagios installation and am running it in a bash loop, checking once every 15 seconds. It usually does a socket timeout once or twice a day. I cannot find a pattern to this yet.

I have noticed that this issue occurs both when I monitor our own infrastructure, and google.com which suggests it’s a networking issue. I have five checks running in rapid succession, like so: curl, http_check, curl, http_check, curl. When the socket timeout occurs the curl checks are all working fine, but the http_check commands do not work.

This situation resolves itself pretty quickly. It doesn’t seem to coincide with any other events on the machine, and another machine in our office also exhibits the same issue, but not with the exact same regularity. We have recently changed out our gateway router to the internet, which I thought was the cause of this, but Nagios logs showed this has been happening for some time, just under the threshold for alerting so we didn’t notice it.

Running the exact same check at another site (from my home) works just fine.

I’m slowly getting to the point where I will reach out to our network provider, but at this time I am highly suspicious of the check_http tool itself as being a contributor to this situation also, as ridiculous as that sounds.

Did you come to any conclusion with your investigation?