Notification for nrpe timout

Basicaly I have all my nrpe checks set with following:

template Service "nrpe_wfm" {
        check_command = "nrpe_v3"
        max_check_attempts = "3"
        check_interval = 10m
        retry_interval = 1m
        check_timeout = 1m
        vars.nrpe_timeout = 60
        vars.nrpe_timeout_unknown = true
	vars.ssl_version = "TLSv1.2"
}
apply Service "WFM docker stats" {
        import "nrpe_wfm"
        vars.docker_cpu_war = host.vars.docker_cpu_war
        vars.docker_cpu_crit = host.vars.docker_cpu_crit
        vars.docker_mem_war = host.vars.docker_mem_war
        vars.docker_mem_crit = host.vars.docker_mem_crit
        vars.nrpe_command = "verify_docker_stats"
        vars.nrpe_arguments = [vars.docker_cpu_war, vars.docker_cpu_crit, vars.docker_mem_war, vars.docker_mem_crit]
	vars.pushover = true
        assign where "tds_wfm_health" in host.groups
        groups = ["tds_service", "tds_wfm_health", "tds_docker_stats"]
}

Now my mail notifications (default ones from installation) are taking this into consideration and this works fine.

Now - my new notification scripts (like pushover) do not take ‘vars.nrpe_timeout_unknown = true’ and throws critical:

here are the confs and scripts for not working pushover notification:

template Notification "pushover-service-notification" {
  command = "pushover-service-notification"

  users = ["pushover"]

  states = [ Critical, Warning ]
  types = [ Problem, Custom ]

  period = "24x7"
}
apply Notification "pushover-service" to Service {
  import "pushover-service-notification"
  interval = 0
  assign where service.vars.pushover && host.vars.pushover
}
object NotificationCommand "pushover-service-notification" {
  command = [ ConfigDir + "/scripts/pushover_service.sh" ]

  arguments += {
    "servicename" = {
      skip_key = true
      order = 0
      value = "$notification_servicename$"
    }
  "hostname" = {
      skip_key = true
      order = 1
      value = "$notification_hostname$"
    }
    "servicestate" = {
      skip_key = true
      order = 2
      value = "$notification_servicestate$"
    }
    "serviceoutput" = {
      skip_key = true
      order = 3
      value = "$notification_serviceoutput$"
    }
  }

  vars += {
    notification_servicename = "$service.name$"
    notification_hostname = "$host.name$"
    notification_servicestate = "$service.state$"
    notification_serviceoutput = "$service.output$"
  }
}
/etc/icinga2/conf.d# cat ../scripts/pushover_service.sh 
#!/bin/bash

# test by hand
# ./pushover_service.sh "ping" host CRITICAL "big fail"

echo "$@" >> /tmp/pushover_service_alert.log
# echo "$@" |& tee -a /tmp/pushover_service_alert.log

set -e

SERVICENAME=$1
HOSTALIAS=$2
SERVICESTATE=$3
SERVICEOUTPUT=$4

. /etc/icinga2/scripts/pushover.sh.env || . ./pushover.sh.env

curl -s \
--form-string "token=$PUSHOVER_TOKEN" \
--form-string "user=$PUSHOVER_KEY" \
--form-string "title=$SERVICENAME on $HOSTALIAS" \
--form-string "message=is in $SERVICESTATE state with $SERVICEOUTPUT" \
--form-string "priority=0" \
$PUSHOVER_URL &> /dev/stdout | tee -a /tmp/pushover_service_response.log

Example alert:

CRITICAL - CHECK_NRPE: Error - Could not connect to x.x.x.x: Connection reset by peer

To be honest, I don’t get what the problem is that you want to be solved. Could you clarify a bit what you want to achieve?

Please use code highlighting (like 3 backticks (`) ) as it makes your posts a lot easier to read. I changed that on your post already. Please use this as a reference.

For

pushover-service-notification

I want to have the following respected:

vars.nrpe_timeout_unknown = true

This is not the case at the moment.

Thanks for using the highlighting! :slight_smile:

The problem is that the option is used by check_nrpe, not by your notification command. Notification commands are “stupid”, they don’t know about Critical or Unknown. They just get a status from the check plugin (in your case check_nrpe ) and send that to the user.

I can see you’re using a check command called nrpe_v3. I’d like to have a look into that because that seems to be the culprit here.

This is my definition of command, but I don’t think that’s the problem here. I use the same one for mail notifications and it’s fine …

object CheckCommand "nrpe_v3" {
        import "ipv4-or-ipv6"

        command = [ PluginDir + "/check_nrpe_v3" ]

        arguments = {
                "-H" = {
                        value = "$nrpe_address$"
                        description = "The address of the host running the NRPE daemon"
                }
                "-p" = {
                        value = "$nrpe_port$"
                }
                "-c" = {
                        value = "$nrpe_command$"
                }
                "-n" = {
                        set_if = "$nrpe_no_ssl$"
                        description = "Do not use SSL"
                }
                "-u" = {
                        set_if = "$nrpe_timeout_unknown$"
                        description = "Make socket timeouts return an UNKNOWN state instead of CRITICAL"
                }
                "-t" = {
                        value = "$nrpe_timeout$"
                        description = "<interval>:<state> = <Number of seconds before connection times out>:<Check state to exit with in the event of a timeout (default=CRITICAL)>"
                }
                "-S" = {
                        value = "$ssl_version$"
                        description = "SSL version"
                }
                "-C" = {
                        value = "$client_cert$"
                        description = "Client cert path"
                }
                "-K" = {
                        value = "$client_key$"
                        description = "Client cert key path"
                }
                "-A" = {
                        value = "$ca_cert$"
                        description = "CA cert key path"
                }
                "-a" = {
                        value = "$nrpe_arguments$"
                        repeat_key = false
                        order = 1
				}
                "-4" = {
                        set_if = "$nrpe_ipv4$"
                        description = "Use IPv4 connection"
                }
                "-6" = {
                        set_if = "$nrpe_ipv6$"
                        description = "Use IPv6 connection"
                }
                "-2" = {
                        set_if = "$nrpe_version_2$"
                        description = "Use this if you want to connect to NRPE v2"
                }
        }

        vars.nrpe_address = "$check_address$"
        vars.nrpe_no_ssl = false
        vars.nrpe_timeout_unknown = true
        vars.check_ipv4 = "$nrpe_ipv4$"
        vars.check_ipv6 = "$nrpe_ipv6$"
        vars.nrpe_version_2 = false
        timeout = 5m
}

Ok, I don’t know about the check_nrpe_v3 plugin. Maybe it doesn’t honor the -u in every case?

The problem still seems to be there. Your notification script just get’s a string and sends it via e-mail. It doesn’t know about nrpe or it’s timeouts.

You could try and run the different checks from commandline:

sudo -u icinga /usr/lib64/nagios/plugins/check_nrpe_v3 -u ... and see if it’s returning UNKNOWN or `CRITICAL``

check_nrpe_v3

is basically the check_nrpe plugin in version 3. I just use it this way to distinguish between v2 and v3.

/usr/lib/nagios/plugins/check_nrpe_v3

Incorrect command line arguments supplied

NRPE Plugin for Nagios
Version: 3.2.1

Copyright (c) 2009-2017 Nagios Enterprises
              1999-2008 Ethan Galstad (nagios@nagios.org)

Last Modified: 2017-09-01

License: GPL v2 with exemptions (-l for more info)

SSL/TLS Available: OpenSSL 0.9.6 or higher required

Usage: check_nrpe -H <host> [-2] [-4] [-6] [-n] [-u] [-V] [-l] [-d <dhopt>]
       [-P <size>] [-S <ssl version>]  [-L <cipherlist>] [-C <clientcert>]
       [-K <key>] [-A <ca-certificate>] [-s <logopts>] [-b <bindaddr>]
       [-f <cfg-file>] [-p <port>] [-t <interval>:<state>] [-g <log-file>]
       [-c <command>] [-E] [-a <arglist...>]

Options:
 -H, --host=HOST              The address of the host running the NRPE daemon
 -2, --v2-packets-only        Only use version 2 packets, not version 3
 -4, --ipv4                   Bind to ipv4 only
 -6, --ipv6                   Bind to ipv6 only
 -n, --no-ssl                 Do no use SSL
 -u, --unknown-timeout        Make connection problems return UNKNOWN instead of CRITICAL
 -V, --version                Print version info and quit
 -l, --license                Show license
 -E, --stderr-to-stdout       Redirect stderr to stdout
 -d, --use-dh=DHOPT           Anonymous Diffie Hellman use:
                              0         Don't use Anonymous Diffie Hellman
                                        (This will be the default in a future release.)
                              1         Allow Anonymous Diffie Hellman (default)
                              2         Force Anonymous Diffie Hellman
 -P, --payload-size=SIZE      Specify non-default payload size for NSClient++
 -S, --ssl-version=VERSION    The SSL/TLS version to use. Can be any one of:
                              SSLv3     SSL v3 only
                              SSLv3+    SSL v3 or above 
                              TLSv1     TLS v1 only
                              TLSv1+    TLS v1 or above (DEFAULT)
                              TLSv1.1   TLS v1.1 only
                              TLSv1.1+  TLS v1.1 or above
                              TLSv1.2   TLS v1.2 only
                              TLSv1.2+  TLS v1.2 or above
 -L, --cipher-list=LIST       The list of SSL ciphers to use (currently defaults
                              to "ALL:!MD5:@STRENGTH:@SECLEVEL=0". THIS WILL change in a future release.)
 -C, --client-cert=FILE       The client certificate to use for PKI
 -K, --key-file=FILE          The private key to use with the client certificate
 -A, --ca-cert-file=FILE      The CA certificate to use for PKI
 -s, --ssl-logging=OPTIONS    SSL Logging Options
 -b, --bind=IPADDR            Local address to bind to
 -f, --config-file=FILE       Configuration file to use
 -g, --log-file=FILE          Log file to write to
 -p, --port=PORT              The port on which the daemon is running (default=5666)
 -c, --command=COMMAND        The name of the command that the remote daemon should run
 -a, --args=LIST              Optional arguments that should be passed to the command,
                              separated by a space. If provided, this must be the last
                              option supplied on the command line.

 NEW TIMEOUT SYNTAX
 -t, --timeout=INTERVAL:STATE
                              INTERVAL  Number of seconds before connection times out (default=10)
                              STATE     Check state to exit with in the event of a timeout (default=CRITICAL)
                              Timeout STATE must be a valid state name (case-insensitive) or integer:
                              (OK, WARNING, CRITICAL, UNKNOWN) or integer (0-3)

Note:
This plugin requires that you have the NRPE daemon running on the remote host.
You must also have configured the daemon to associate a specific plugin command
with the [command] option you are specifying here. Upon receipt of the
[command] argument, the NRPE daemon will run the appropriate plugin command and
send the plugin output and return code back to *this* plugin. This allows you
to execute plugins on remote hosts and 'fake' the results to make Nagios think
the plugin is being run locally.

and I think ‘-u’ works correctly

Ok. Then try running your checks “manually” with sudo. This way you should see if the check is returning the correct status so your notification can send it.

If this is an example that you want to be “UNKNOWN”, this will not work.
As the description states the -u flag sets the check state to unknown in case the connection times out (e.g. host not reachable, script running too long).

Your message shows that a connection is established but reset/not correctly answered. Not sure how to describe it best.
Check the host you are querying via nrpe. Is the IP of the monitoring server allowed?
Are there any firewalls or proxies between the host and the nrpe client.
What output do you get by simply running check_nrpe_v3 -H hostaddress?

1 Like

I think we can close this item.

Looks like somebody was messing with nrpe.cfg and allowed_hosts directive without my knowledge on that host.

Thank you all - it all is clear for me now.

@seeb Only you can close this thread. :slight_smile: Just klick on the “Solution” checkbox in the posting that solved the issue. (In this case it might be your own, last posting). This way it gets closed and everyone sees that you don’t need any more help.