CHECK NRPE : back and forth from OK to timeout

Hi everyone,

One year ago, we switch from Skinken/Thruk to Icinga2/IcingaWeb2 to manage our monitoring.
With more than 550 hosts and 12.000 services.
A majority of those services are NRPE checks. Script run on hosts and results directly transfert to the monitoring server via NRPE.

Example :

'/usr/lib/nagios/plugins/check_nrpe' '-4' '-H' '[ANONYMIZED_IP]' '-c' 'check_name' '-t' '10'

We use a distributed monitoring system. With two masters and one satellite. No agents on hosts.

Since then we encounter a problem, not seen before on our old system.

From time to time NRPE checks (not any other) come back critical with this error :

CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds.

And 1 to 3 minutes after (next check period) status goes back OK.

This error occurs from both masters and satellite, on different hosts and services, but not all. No link between those host or services found on our end.

NRPE conf files on hosts are correct. And nothing to be find in the network as far as we can see.

Any ideas where to look to understand this kind of behavior ?

  • Version used (icinga2 --version)

    icinga2 - The Icinga 2 network monitoring daemon (version: r2.12.3-1)

          Copyright (c) 2012-2021 Icinga GmbH (https://icinga.com/)
          License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
          This is free software: you are free to change and redistribute it.
          There is NO WARRANTY, to the extent permitted by law.
    
          System information:
            Platform: Debian GNU/Linux
            Platform version: 10 (buster)
            Kernel: Linux
            Kernel version: 4.19.0-0.bpo.6-amd64
            Architecture: x86_64
    
          Build information:
            Compiler: GNU 8.3.0
            Build host: runner-hh8q3bz2-project-298-concurrent-0
            OpenSSL version: OpenSSL 1.1.1d  10 Sep 2019
    
          Application information:
    
          General paths:
            Config directory: /etc/icinga2
            Data directory: /var/lib/icinga2
            Log directory: /var/log/icinga2
            Cache directory: /var/cache/icinga2
            Spool directory: /var/spool/icinga2
            Run directory: /run/icinga2
    
          Old paths (deprecated):
            Installation root: /usr
            Sysconf directory: /etc
            Run directory (base): /run
            Local state directory: /var
    
          Internal paths:
            Package data directory: /usr/share/icinga2
            State path: /var/lib/icinga2/icinga2.state
            Modified attributes path: /var/lib/icinga2/modified-attributes.conf
            Objects path: /var/cache/icinga2/icinga2.debug
            Vars path: /var/cache/icinga2/icinga2.vars
            PID path: /run/icinga2/icinga2.pid
    
  • Operating System and version

      Distributor ID:	Debian
      Description:	Debian GNU/Linux 10 (buster)
      Release:	10
      Codename:	buster
    
  • Enabled features (icinga2 feature list)

      Disabled features: compatlog debuglog elasticsearch gelf graphite icingadb influxdb opentsdb
      perfdata statusdata syslog
      Enabled features: api checker command ido-mysql livestatus mainlog notification
    
  • Config validation (icinga2 daemon -C)

      [2021-04-07 15:59:04 +0200] information/cli: Icinga application loader (version: r2.12.3-1)
      [2021-04-07 15:59:04 +0200] information/cli: Loading configuration file(s).
      [2021-04-07 15:59:05 +0200] information/ConfigItem: Committing config item(s).
      [2021-04-07 15:59:05 +0200] information/ApiListener: My API identity: [ANONYMISED].net
      [2021-04-07 15:59:11 +0200] information/ConfigItem: Instantiated 1 NotificationComponent.
      [2021-04-07 15:59:11 +0200] information/ConfigItem: Instantiated 527 Hosts.
      [2021-04-07 15:59:11 +0200] information/ConfigItem: Instantiated 231 Downtimes.
      [2021-04-07 15:59:11 +0200] information/ConfigItem: Instantiated 6 NotificationCommands.
      [2021-04-07 15:59:11 +0200] information/ConfigItem: Instantiated 1 FileLogger.
      [2021-04-07 15:59:11 +0200] information/ConfigItem: Instantiated 10 Comments.
      [2021-04-07 15:59:11 +0200] information/ConfigItem: Instantiated 12218 Notifications.
      [2021-04-07 15:59:11 +0200] information/ConfigItem: Instantiated 1 IcingaApplication.
      [2021-04-07 15:59:11 +0200] information/ConfigItem: Instantiated 55 HostGroups.
      [2021-04-07 15:59:11 +0200] information/ConfigItem: Instantiated 1 CheckerComponent.
      [2021-04-07 15:59:11 +0200] information/ConfigItem: Instantiated 5 Zones.
      [2021-04-07 15:59:11 +0200] information/ConfigItem: Instantiated 3 Endpoints.
      [2021-04-07 15:59:11 +0200] information/ConfigItem: Instantiated 1 ExternalCommandListener.
      [2021-04-07 15:59:11 +0200] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
      [2021-04-07 15:59:11 +0200] information/ConfigItem: Instantiated 4 ApiUsers.
      [2021-04-07 15:59:11 +0200] information/ConfigItem: Instantiated 1 ApiListener.
      [2021-04-07 15:59:11 +0200] information/ConfigItem: Instantiated 292 CheckCommands.
      [2021-04-07 15:59:11 +0200] information/ConfigItem: Instantiated 1 LivestatusListener.
      [2021-04-07 15:59:11 +0200] information/ConfigItem: Instantiated 10 TimePeriods.
      [2021-04-07 15:59:11 +0200] information/ConfigItem: Instantiated 8 UserGroups.
      [2021-04-07 15:59:11 +0200] information/ConfigItem: Instantiated 10 Users.
      [2021-04-07 15:59:11 +0200] information/ConfigItem: Instantiated 11476 Services.
      [2021-04-07 15:59:11 +0200] information/ConfigItem: Instantiated 24 ServiceGroups.
      [2021-04-07 15:59:11 +0200] information/ScriptGlobal: Dumping variables to file 
      '/var/cache/icinga2/icinga2.vars'
      [2021-04-07 15:59:11 +0200] information/cli: Finished validating the configuration file(s).
    
  • If you run multiple Icinga 2 instances, the zones.conf file (or icinga2 object list --type Endpoint and icinga2 object list --type Zone) from all affected nodes

      Object '[ANONYMIZED_SATELLITE]' of type 'Endpoint':
        % declared in '/etc/icinga2/zones.conf', lines 22:1-22:46
        * __name = "[ANONYMIZED_SATELLITE]"
        * host = "[ANONYMIZED_IP_SATELLITE]"
          % = modified in '/etc/icinga2/zones.conf', lines 23:2-23:24
        * log_duration = 86400
        * name = "[ANONYMIZED_SATELLITE]"
        * package = "_etc"
        * port = "5665"
          % = modified in '/etc/icinga2/zones.conf', lines 24:2-24:14
        * source_location
          * first_column = 1
          * first_line = 22
          * last_column = 46
          * last_line = 22
          * path = "/etc/icinga2/zones.conf"
        * templates = [ "[ANONYMIZED_SATELLITE]" ]
          % = modified in '/etc/icinga2/zones.conf', lines 22:1-22:46
        * type = "Endpoint"
        * zone = ""
    
      Object '[ANONYMIZED_MASTER2]' of type 'Endpoint':
        % declared in '/etc/icinga2/zones.conf', lines 15:1-15:39
        * __name = "[ANONYMIZED_MASTER2]"
        * host = ""
        * log_duration = 86400
        * name = "[ANONYMIZED_MASTER2]"
        * package = "_etc"
        * port = "5665"
        * source_location
          * first_column = 1
          * first_line = 15
          * last_column = 39
          * last_line = 15
          * path = "/etc/icinga2/zones.conf"
        * templates = [ "[ANONYMIZED_MASTER2]" ]
          % = modified in '/etc/icinga2/zones.conf', lines 15:1-15:39
        * type = "Endpoint"
        * zone = ""
    
      Object '[ANONYMIZED_MASTER1]' of type 'Endpoint':
        % declared in '/etc/icinga2/zones.conf', lines 6:1-6:36
        * __name = "[ANONYMIZED_MASTER1]"
        * host = "[ANONYMIZED_MASTER1]"
          % = modified in '/etc/icinga2/zones.conf', lines 7:4-7:18
        * log_duration = 86400
        * name = "[ANONYMIZED_MASTER1]"
        * package = "_etc"
        * port = "5665"
        * source_location
          * first_column = 1
          * first_line = 6
          * last_column = 36
          * last_line = 6
          * path = "/etc/icinga2/zones.conf"
        * templates = [ "[ANONYMIZED_MASTER1]" ]
          % = modified in '/etc/icinga2/zones.conf', lines 6:1-6:36
        * type = "Endpoint"
        * zone = ""
    
    
    Object 'global-commands' of type 'Zone':
      % declared in '/etc/icinga2/zones.conf', lines 41:1-41:29
      * __name = "global-commands"
      * endpoints = null
      * global = true
        % = modified in '/etc/icinga2/zones.conf', lines 42:3-42:15
      * name = "global-commands"
      * package = "_etc"
      * parent = ""
      * source_location
        * first_column = 1
        * first_line = 41
        * last_column = 29
        * last_line = 41
        * path = "/etc/icinga2/zones.conf"
      * templates = [ "global-commands" ]
        % = modified in '/etc/icinga2/zones.conf', lines 41:1-41:29
      * type = "Zone"
      * zone = ""
    
    Object 'interne' of type 'Zone':
      % declared in '/etc/icinga2/zones.conf', lines 27:1-27:21
      * __name = "interne"
      * endpoints = [ "[ANONYMIZED_SATELLITE]" ]
        % = modified in '/etc/icinga2/zones.conf', lines 28:2-28:47
      * global = false
      * name = "interne"
      * package = "_etc"
      * parent = "master"
        % = modified in '/etc/icinga2/zones.conf', lines 29:2-29:18
      * source_location
        * first_column = 1
        * first_line = 27
        * last_column = 21
        * last_line = 27
        * path = "/etc/icinga2/zones.conf"
      * templates = [ "interne" ]
        % = modified in '/etc/icinga2/zones.conf', lines 27:1-27:21
      * type = "Zone"
      * zone = ""
    
    Object 'global-templates' of type 'Zone':
      % declared in '/etc/icinga2/zones.conf', lines 33:1-33:30
      * __name = "global-templates"
      * endpoints = null
      * global = true
        % = modified in '/etc/icinga2/zones.conf', lines 34:2-34:14
      * name = "global-templates"
      * package = "_etc"
      * parent = ""
      * source_location
        * first_column = 1
        * first_line = 33
        * last_column = 30
        * last_line = 33
        * path = "/etc/icinga2/zones.conf"
      * templates = [ "global-templates" ]
        % = modified in '/etc/icinga2/zones.conf', lines 33:1-33:30
      * type = "Zone"
      * zone = ""
    
    Object 'director-global' of type 'Zone':
      % declared in '/etc/icinga2/zones.conf', lines 37:1-37:29
      * __name = "director-global"
      * endpoints = null
      * global = true
        % = modified in '/etc/icinga2/zones.conf', lines 38:2-38:14
      * name = "director-global"
      * package = "_etc"
      * parent = ""
      * source_location
        * first_column = 1
        * first_line = 37
        * last_column = 29
        * last_line = 37
        * path = "/etc/icinga2/zones.conf"
      * templates = [ "director-global" ]
        % = modified in '/etc/icinga2/zones.conf', lines 37:1-37:29
      * type = "Zone"
      * zone = ""
    
    Object 'master' of type 'Zone':
      % declared in '/etc/icinga2/zones.conf', lines 18:1-18:20
      * __name = "master"
      * endpoints = [ "[ANONYMIZED_MASTER1]", "[ANONYMIZED_MASTER2]" ]
        % = modified in '/etc/icinga2/zones.conf', lines 19:2-19:62
      * global = false
      * name = "master"
      * package = "_etc"
      * parent = ""
      * source_location
        * first_column = 1
        * first_line = 18
        * last_column = 20
        * last_line = 18
        * path = "/etc/icinga2/zones.conf"
      * templates = [ "master" ]
        % = modified in '/etc/icinga2/zones.conf', lines 18:1-18:20
      * type = "Zone"
      * zone = ""
    
  • NRPE plugin

NRPE Plugin for Nagios
Version: 4.0.3

Hello Thomas-Michel,
10 seconds is the default time out for the check_nrpe command. If the network was slow or the endpoint receiving the command was overloaded could cause the command to not complete in the 10 seconds time. You can change the timeout value by declaring the nrpe_timeout variable for the nrpe check command.

BTW - the nrpe command is insecure, you should consider using a different check command for security reason.

Thanks you for your reply.
I’m aware of the timeout parameter. We already try to change it and nothing change.

Our network don’t show any signs of degradation or slowdown.

Only NRPE check are failing, should other services too ?
How can I know if the endpoint is overloaded ? And what can I do if this is the case.

And we have consider changing to NCPA, but it doesn’t seems possible to do custom checks as we do.

Any suggestions ? to identify how could this be happening .

Hello @Thomas-Michel,
What happens when you run the nrpe command manually by command line? Does the command time out than?

/usr/lib64/nagios/plugins/check_nrpe -H [ANONYMIZED_IP] -c check_name -t 30

If you’re sending the nrpe command from your master server in New York over the network to the agent in Paris. The command results may not make the round trip in less than 10 seconds. Also if the agent is busy running a different job that uses most of the CPU load, than the command may not complete before the timeout.

Regards
Alex