Migration from the local nrpe to icinga agent monitoring

Hi,

I would like to migrate the local nrpe.cfg based to the Icinga agent based monitoring. I understand that I will have install the Icinga2 agent on the host but not sure how to move from nrpe.cfg to the agent.

Has anyone here done this before?

Thanks

Hi,

you can run this in parallel, just ensure to follow the agent setup described in the docs. Do you have a concrete service/host config from nrpe you can post here? Then we can help you with the config for the agent bits.

Cheers,
Michael

Here are some service/host definitions –

apply Service "dns_health_check" {

  import "generic-service"

  check_command = "nrpe"
  vars.nrpe_command = "dns_health_check"
  vars.nrpe_timeout = 60
  check_interval = 5m
  vars.notification["mail"] = {
    groups = [ "dns-admins" ]
  }

  vars.notification["pager"] = {
    groups = [ "dns-pagers" ]
  }
  assign where host.name == "dns"
}

apply Service "" for (instance in host.vars.nrpe_services) {
  import "generic-service"

  check_command = "nrpe"

  if ( check_ifpassivethost(host) ) {
   enable_active_checks  = 0
   enable_passive_checks = 1
   enable_flapping       = 0
   volatile              = 1
  }
  vars.nrpe_timeout = 30
  vars.nrpe_command = instance
  vars.check_db = instance

  //vars.alerttype = ["SYS"]
  vars.emailonly = true

  assign where host.vars.nrpe_services
  ignore where host.vars.location in DOWNSITES || host.vars.standby
}

Ok, and the corresponding command definitions from nrpe.cfg for instance and dns_health_check?

command[dns_health_check]=/usr/local/nagios/libexec/dns_health_check.sh node IP-address

command[check_disk1]=/usr/local/nagios/libexec/check_disk -w 20 -c 10 -p /dev/mapper/os-root

Hi,

ok, that’s a good basis for both CheckCommand requirements :slight_smile:

Prepare the Master

Communication happens via TLS, so you’ll need to setup a CA key pair, and a signed certificate for the master node. Everything is decoupled into the node wizard CLI command as shown in the docs.

Setup the Icinga Agent

That’s pretty straight forward with following the docs, install the package and run the setup wizard. Here you’ll decide whether to go with a pre-generated ticket (CSR auto-signing), or you’ll leave it empty and approve the signing request on the master (CSR on-demand-signing).

The master then needs the agent zone and endpoint defined. In this case I’d assume you want the master to actively connect to the agent, so add the host attribute to the agent’s endpoint.

Host Object Preparations

If not already done, move the Host object into the master zone on the Master node. Ensure that its object name is the FQDN of the agent host.

object Host "icinga-agent1.example.com" {
  //..add your existing configuration. 
}

Create a new Endpoint object with the same name, this will be used for telling Icinga which target endpoint will be used for checks.

object Endpoint "icinga-agent1.example.com" {
  host = ""192.168.56.110" //add the real IP address where port 5665 is listening on the agent
}

Build the trust relationship by assigning the endpoint to the agent’s zone, which becomes a child of the master zone. If you miss that, the master will not execute checks on the agent.

object Zone "icinga-agent1.example.com" {
  parent = "master"
  endpoints = [ "icinga-agent1.example.com" ]
}

Global Zone for Command Sync

Pick global-templates, this comes by default in both setup CLI tools, master and agent.

Pre-defined commands

The Icinga Template Library already provides a set of CheckCommand objects. The disk check from nrpe.cfg doesn’t need a new CheckCommand object.

command[check_disk1]=/usr/local/nagios/libexec/check_disk -w 20 -c 10 -p /dev/mapper/os-root

This requires a little review of used plugins in the ITL, but is worth the effort to not re-create all CheckCommands by hand.

For the disk CheckCommand, we translate the 3 arguments for later.

  • disk_wfree = 20 from -w 20
  • disk_cfree = 10 from -c 10
  • disk_partitions = [ "/dev/mapper/os-root" ] from -p /dev/mapper/os-root

Custom CheckCommands

The Icinga agent needs these commands locally defined. With using the global-templates zone, those commands can be synced from the config master. You can also define them locally on the agent, but this is known to become a problem with many managed agents.

The dns_health_check requires a new CheckCommand, see the docs how the syntax and attributes work.

command[dns_health_check]=/usr/local/nagios/libexec/dns_health_check.sh node IP-address

This can be translated into

object CheckCommand "dns_health_check" {
  command = [ PluginDir + "/dns_health_check.sh" ]

  arguments = {
    "-H" = {
      value = "$dns_health_check_host$"
      skip_key = true //the shell script doesn't use getopts, if you decide to use `-H <host>` instead, set this to false
      description = "DNS host"
    }
    "-A" = {
      value = "$dns_health_check_address$"
      skip_key = true
      description = "Expected DNS IP address"
    }
  }
}

At this stage, put this into /etc/icinga2/zones.d/global-templates/commands.conf and run icinga2 daemon -C to check the configuration being valid already.

PluginDir Constant

PluginDir needs to be set to /usr/local/nagios/libexec in your agent in the constants.conf. It is better to use a global constant here, than to hardcode the path in every CheckCommand. If you decide to use a different plugin prefix path by e.g. using the packages, you only need to edit constants.conf then.

Restart and Sync

Restart Icinga on the master and verify that the CheckCommand dns_healh_check is synced to the agent.

Master:

systemctl restart icinga2

Agent:

icinga object list --type CheckCommand --name dns_health_check

Setup Agent Checks via Command Endpoint

Disk

Here I wouldn’t re-use the previous disk check with the difference of active/passive checks. The Icinga agent can actively connect to the master, or the master connects to the Icinga agent. Therefore the checks from the master always can happen, just one direction needs to initiate the connection before.

In addition to that, the Host object should explicitly take advantage of specifying the disks as dictionary and not mix all services in an array. This follows the example config from conf.d.

Start simple with a single apply rule

apply Service "disk root" {

  check_command = "disk" //provided by the ITL
  command_endpoint = host.name

  vars.disk_wfree = 20 //extracted from nrpe.cfg, see above
  vars.disk_cfree = 10
  vars.disk_partitions = [ "/dev/mapper/os-root" ]

  assign where host.vars.os == "Linux"
}

Validate the config with icinga2 daemon -C and restart Icinga 2 with systemctl restart icinga2.

Force a re-check and retrieve the executed command line via REST API. You can also enable the debug log on the agent, and tail/grep for the check plugin’s name.

Configure Service via Host using Apply For

Then prepare the host object again with the disks dictionary providing the thresholds and checks.

  vars.disks["disk /"] = {
    disk_wfree = 20
    disk_cfree = 10
    disk_partitions = [ "/dev/mapper/os-root" ]
  }

And finally create the service apply for rule similar to the example config, except for the command_endpoint attribute.

apply Service for (disk => config in host.vars.disks) {
  import "generic-service"

  check_command = "disk"

  command_endpoint = host.name

  vars += config
}

Validate the config with icinga2 daemon -C and restart Icinga 2 with systemctl restart icinga2.

Force a re-check and retrieve the executed command line via REST API. You can also enable the debug log on the agent, and tail/grep for the check plugin’s name.

DNS

The existing service apply rule stripped down to the important parts …

apply Service "dns_health_check" {

  check_command = "nrpe"
  vars.nrpe_command = "dns_health_check"
  vars.nrpe_timeout = 60

  assign where host.name == "dns"
}

… needs to be changed into the real CheckCommand object reference in check_command. It also needs the command_endpoint attribute pointing to the host’s endpoint we’ve defined above. Since the host name is equal to the endpoint name, we can use the trick with command_endpoint = host.name here.

Further, the dns_health_check had two arguments in the nrpe.cfg, we need to define them here too.

This sums up into the following service apply rule:

apply Service "dns_health_check" {

  check_command = "dns_health_check"

  command_endpoint = host.name

  vars.dns_health_check_host = "<NODE>" //maybe you can use `host.vars...` to provide the check details
  vars.dns_health_check_address = "<IP-ADDRESS>"

  assign where host.name == "dns"
}

If you accidentally leave out the command_endpoint attribute, the check will be executed on the master, not the agent endpoint. This is a common source for errors.

Validate the config with icinga2 daemon -C and restart Icinga 2 with systemctl restart icinga2.

Force a re-check and retrieve the executed command line via REST API. You can also enable the debug log on the agent, and tail/grep for the check plugin’s name.

Conclusion

This looks longer than it is. I’ve taken the time to step into every little detail you may encounter - once you’ve done it a couple of times, it is easier and faster.

The most important bits - the command arguments for the agent should be managed by the master, and not in their CheckCommand or apply rule, but the Host object becoming the source of truth. This one can be generated from a CMDB, as you’ll likely already do with the advanced DSL code in the service apply rules.

Cheers,
Michael

2 Likes

Thanks a lot for the detailed instructions and your time, greatly appreciate it. I will give it a try and will revert back.

I was also thinking to go with “passive” checks instead of the active checks. Would there be any other changes needed to be made?

We also have checkers in various zones running the commands on the clients. I will get that snippet of one of such checks to see how those could be handled.

Thanks

I wouldn’t go completely passive. You always have to rely on freshness checks if you don’t hear anything from your monitored objects, you can’t reschedule checks, you can’t just recheck, the check_interval / retry_interval logic doesn’t apply and so on.

Go with active checks like they were intended and use passive checks only where you don’t have another option.

1 Like