Can someone point me to a simple working check using a custom DSL function?

Janos.Kiss · May 1, 2023, 8:57am

Dear Icinga community,

I am trying to perform a check, where a single check should combine multiple checks into one, and I want to do this without a business process. I thought that using a “function” for this via DSL should be simple, but so far I did not succeed, and I am a bit lost and confused after going through several blog posts presenting very complicated solutions.
I found this example
https://icinga.com/blog/2021/04/29/calculating-a-state-over-mutliple-services/
which is over-complicated for me.

Could please someone share a working simple example. Practically what I would need is to know:

how to save a custom function
how to make a saved custom function usable as a check (or how to call that function)
how to associate that check with a host, or how to call that function to execute a check

After going through the example I posted above, I don’t see why I would need to set up dummy hosts and helper functions just to run a simple function which runs on the icinga master node, so it would be nice to have a more simple working example what I can use as a blueprint, so I can learn the basics.

Namely, I have created a custom function running multiple checks at once which works if I copy-paste it into the Icinga2 console. After pasting in the text file into the console which contains the new function, I can call it from the console with:
MyFunction()
and it returns the expected result just fine:
"OK: minimal stack check passed"

When I create a conf file under /etc/icinga2/zones.d/master/ where I save this function which I pasted into the console, and restart icinga2 service, thing are still fine.
However, if I create a separate conf file, where now I try to call this function like:

apply Service "DSL_Test" {
  display_name = "Minimal StackCheck from DSL"
  vars.dummy_state = 3
  vars.dummy_text = MyFunction()
  assign where host.name == "icinga"
  check_command = "dummy"
  check_interval = 5m
  retry_interval = 3m
}

and I try to restart icinga2.service, this fails with:
critical/config: Error: Invalid field access (for value of type 'Service'): 'MyFunction'

Probably I am overlooking something basic of how to use a custom function, and how to perform a check with a custom function. By searching online, I did not find a good explanation and basic examples, so I would appreciate it if someone could help me out here with a simple working example.

Best regards

Janos.Kiss · May 1, 2023, 1:38pm

I guess one of the issues was related to the function which I have created not being available globally, but only existed in the local scope. As a brute-force first try I made the function to be global:

globals.MyFunction = function () {

Now at least the icinga2 service starts up fine, and the check executes:

apply Service "DSL_Test" {
  display_name = "Minimal StackCheck from DSL"
  vars.dummy_state = 3
  vars.dummy_text = MyFunction()
  assign where host.name == "icinga"
  check_command = "dummy"
  check_interval = 5m
  retry_interval = 3m
}

However, the output is not the expected one, so I guess that although the function itself is global, the internal scope of the
get_service("Hostname", "check").last_check_result.exit_status
which is used inside the global function is wrong, and it does not return 0.000000 as it does when I run the function inside the icinga console. I am not sure what the right scope should be, but I guess it should be the “this” special scope type, however, I don’t find a way to set that for a check like above.

Any hint what I should do, or how to debug this (or even better: is there a simple working example available somewhere) ?

Best regards

rivad · May 2, 2023, 7:15am

I use this to check if two things are in sync or if the deviation is to big:

object CheckCommand "116-cmd-equal-or-not" {
    import "plugin-check-command"
    command = [ "/usr/lib64/nagios/plugins/dummy" ]
    timeout = 10s
    arguments += {
        "--message" = {
            description = "Message"
            required = true
            value = {{
                function get_perfdata(service){
            	  var perf_value = service.last_check_result.performance_data[0].split("=")[1]
            	  var perf_value = perf_value.split("c;")[0]
            	  return perf_value
            	}
            	var output = ""
            	var hosts_to_compare = macro("$116_comparison_hosts$")
            	var service_pattern = macro("$116_comparison_service_pattern$")
            	var tolerance = macro("$116_comparison_tolerance$")
            	var values = []
            	var servcies = []
            
            	for (host in hosts_to_compare) {
            	  var service_names = get_services(host).map(s => s.name)
            	  for (service_name in service_names) {
            		if (match(service_pattern, service_name)) {
            		  if (match("Compare Services*", service_name)){
            		   continue
            		  }
            		  servcies.add(host + "!" + service_name)
            		  service = get_service(host, service_name)
            	      if (len(service.last_check_result.performance_data) < 1){
                       return "[UNKNOWN] '" + host + "!" + service_name + "' has no performance data value in last_check_result!"
            	      }
            		  values.add(get_perfdata(service))
            		}
            	  }
            	}
                if (len(values) < 2) {
                  return "[UNKNOWN] less then 2 values collected!"  
                }
            	
            	output = "[OK] all values in allowed tolerance |"
            	
            	for (value in values) {
            	  if (number(value) + number(tolerance) < number(values[0]) || number(value) - number(tolerance) > number(values[0])) {
            		output = "[CRITICAL] value " + value + " not in allowed tolerance " + tolerance + " from first value " + values[0] + " |"
            	  }
            	}
            
            	for (service in servcies) {
            	  var host = service.split("!")[0]
            	  var service_name = service.split("!")[1]
            	  var service = get_service(host, service_name)
            	  var value = get_perfdata(service)
            	  var value_min = number(values[0]) - number(tolerance)
            	  var value_max = number(values[0]) + number(tolerance)
            	  output += " '" + host + "!" + service_name + "'=" + value + ";;" + value_min + ":" + value_max + ";"
            	}
            
            	return output
            }}
        }
        "--state" = {
            description = "State"
            value = {{
                function get_perfdata(service){
            	  var perf_value = service.last_check_result.performance_data[0].split("=")[1]
            	  var perf_value = perf_value.split("c;")[0]
            	  return perf_value
            	}
            	var hosts_to_compare = macro("$116_comparison_hosts$")
            	var service_pattern = macro("$116_comparison_service_pattern$")
            	var tolerance = macro("$116_comparison_tolerance$")
            	var values = []
            
            	for (host in hosts_to_compare) {
            	  var service_names = get_services(host).map(s => s.name)
            	  for (service_name in service_names) {
            		if (match(service_pattern, service_name)) {
            		  if (match("Compare Services*", service_name)){
            		   continue
            		  }
            		  var service = get_service(host,service_name)
            		  if (len(service.last_check_result.performance_data) < 1){
                        return "unk"
            	      }
            		  values.add(get_perfdata(service))
            		}
            	  }
            	}
                if (len(values) < 2) {
            	  return "unk"  
                }
            
            	for (value in values) {
            	  if (number(value) + number(tolerance) < number(values[0]) || number(value) - number(tolerance) > number(values[0])) {
            		return "crit" //at least one service is not OK
            	  }
            	}
            
            	return "ok" //all is well
            }}
        }
    }
}

Hope this helps even if I don’t define functions in a global scope.

Janos.Kiss · May 2, 2023, 8:23am

Thanks for the example, will have a look on it and see how I can apply it to the things I want to do.

In my view it would be nice to have a description and a documented working example somewhere on the Icinga main webpage about monitoring full software/application stacks similar to what one would do with a business process, but without having to set up a business process via the WebGUI for that stack. Like for an example I am sure many people want to have a single check for a LAMP stack: instead of several individual checks per constituent service at the end what is important is a single combined check representing the state of the LAMP stack as a whole. Of course, a working well documented example for monitoring an orchestrated cloud-based service where one can set custom redundancy levels would be even nicer. It would help a lot for people developing their own custom monitoring functions in DSL, and applying it in fully dynamic OpenStack cloud environments and such.

I guess the issue is me not knowing enough yet how to do a bit more advanced config. Just need to go through and learn more how to achieve this without business processes, and only use business processes for a higher level overview to handle redundancy and decide whether alerting should start, or the service has enough redundancy to continue operating.

Best regards

rivad · May 2, 2023, 8:39am

I would advice against doing to much in the Icinga2 DSL.

IMHO, it would be better to have all this logic in a dedicated script and just send the results back to Icinga 2 via the REST API. This is called a passive check because Icinga isn’t scheduling and triggering the check. You can also use Icinga2 to do the scheduling and trigger the script (active check) and use the exit code and stdout to tell Icinga what your script found out.

A coworker of mine used Pester to write such a active check to monitor his stack.

BTW, you can use the external script to also read the Icinga2 API if you want to user states of services already in Icinga.

Janos.Kiss · May 2, 2023, 9:05am

I try to steer away from the API as much as possible.

Namely, we had issues when using the API. In some cases it totally broke Icinga, causing it not to be able to schedule downtimes after API commands, or scheduled downtimes randomly disappeared, and we could only randomly reproduce that issue after a rollback.
Have seen similar problems reported on the forum, but the solutions offered there were inconclusive, and there was also a bugfix, but that was related to a slightly different problem.
Not sure whether the issue is related to a bug which was fixed in a newer release, but in our case for example if one added new nodes to be monitored via API commands, the directory structure or the content of the files in the subdirectories under
/var/lib/icinga2/api/packages/
was randomly garbled, and we had to roll back.

This is why I try not to use the API, since I am not sure what it will break.
Yes, I know that the Director is just practically spawning API commands and in theory all should work fine. However, in practice the issue shows up random, and we are not using the Director anyway.
Most likely doing only polling operations via API should not break anything, but I rather just try not to use API at all if possible.

rivad · May 2, 2023, 9:49am

I use the director and for creating host, services and reading and sending check results.
This works without strange behavior for us.

I also use Ansible to manage the check plugins and on windows we set and remove donwtimes via API because we need to stop the service or else the files of the checks are locked if they are running.