The thought of going from DSL to Director (and why I discarded it)

MarcusCaepio · July 23, 2021, 9:19am

Hi all,
I don’t want to copy the discussion from Data Type "Dictionary" to allow for nested variables · Issue #337 · Icinga/icingaweb2-module-director · GitHub, but it is relevant to my post.
We are planning a new Icinga Cluster and wanted to give the Director a POC again. The main idea was, that the “Icinga Admins” configure all the commands, services etc (either on DSL or director). Hosts will be added/deleted automatically via the director, and the Devs can adjust thresholds, add new checks to their hosts etc. without in deep knowledge of how Icinga is working. They just shall have to edit their host objects via director. I played around a little bit with the master branch of the director, but I didn’t make to “translate” this DSL Snip:

apply Service "tcp: " for ( tcp => config in host.vars.tcp ) {
  import "generic-service"
  check_command = "tcp"
  vars += config
  assign where host.vars.tcp
}
object Host "abc" {
  [...]
  vars.tcp["123"] = {
    tcp_address = "xxx.xxx.xxx.xxx"
    tcp_port = "123"
  }
  vars.tcp["567"] = {
    tcp_address = "xxx.xxx.xxx.xxx"
    tcp_port = "567"
  }
}

On the one side, @tgelf described his opinion about dictionaries in this comment Data Type "Dictionary" to allow for nested variables · Issue #337 · Icinga/icingaweb2-module-director · GitHub.
On the other side, this kind of snip is officially documented here Monitoring Basics - Icinga 2

Based on my DSL Snip, it would be nice, if sb. could tell me, how to do it “right” in the director.

Another point I was struggling with, are the checks based on SSH. The lambda function of the -C Paramater in the “by_ssh” command is not present in the director, it is just empty.

I tried to rebuilt the command and inserted the lambda function. That worked, but the director does not like to add “{ }” to service variables. so things like giving parameters in the ssh arguments seems to be impossible. E.g.:

apply Service "load" {
  import "by_ssh"
  vars.by_ssh_command = "/usr/lib/nagios/plugins/check_load"
  vars.by_ssh_arguments = {
    "-w" = {
        value = "$load_wload1$,$load_wload5$,$load_wload15$"
        description = "Exit with WARNING status if load average exceeds WLOADn"
        }
    "-c" = {
        value = "$load_cload1$,$load_cload5$,$load_cload15$"
        description = "Exit with CRITICAL status if load average exceed CLOADn; the load average format is the same used by 'uptime' and 'w'"
        }
    "-r" = {
        set_if = "$load_percpu$"
        description = "Divide the load averages by the number of CPUs (when possible)"
       }
    }
  assign where host.vars.os == "linux"
}

is unrepresentable.
I have tested the director every now and then in recent years, because I am highly interested in “automate as much as possible and delegate everything else”. I have used the DSL for over 5 years now and every time I take a look at the director, I see a huge gap between it and the DSL and I am not able to use the stuff I learned over the time.

I don’t want to start a new basic discussion about DSL and Director. But I think, the differences between DSL and Director are too big. Shouldn’t I be able to use the director without any problems, when I configured Icinga via DSL for so many years and know, how Icinga is working? Shouldn’t there be some kind of documentation, which describe in detail, where the differences between the DSL and the Director are? Shouldn’t I be able to identify, what kind of configuration fit my needs best without spending hours of time to investigate it by myself?
After I was not able to “translate” my two examples into the director, after I have searched solutions in Github, the Docs and this forum, I stopped testing it and discarded the thought of using it. And to be honest, I am very sad about it. But when it seems to be so complicated or impossible to get these two examples running, how will it end with more complicated stuff like

  states = get_object(User, user).vars.mail_service_states  || [ OK, Warning, Critical, Unknown ]
  types = get_object(User, user).vars.mail_service_types || [ DowntimeStart, DowntimeEnd, DowntimeRemoved, Custom, Acknowledgement, Problem, Recovery, FlappingStart, FlappingEnd ]

?
If anyone already did the transformation from DSL to the Director, I really would like to exchange with him/her.

Cheers,
Marcus

dgoetz · July 23, 2021, 11:05am

I did this type of migration in multiple environments and I can say that I personally have no preference as both options are quite valid for me. DSL has the higher flexibility as there is no additional layer of abstraction, Director allows to give more people access to the configuration and has built-in import capabilities which are great as long as there is a good source.

So to address you questions and concerns.

Let’s the first one describe as “Different best practices of DSL and Director”:
Yes, it makes totally sense to limit options in the Director and to be opinionated as the DSL is not only a configuration format but also provides programming capabilities. So the DSL does try to allow as much flexibility so you can do everything needed for your rule-based configuration. The Director instead wants to provide a consistent webinterface also a non-monitoring-admin can use and also allow automation. I see the conflict between this and understand it. So for me such a migration is also always about rethinking the configuration as not every trick in the DSL is helping with an easier configuration in the UI.

The second one I would call “Limiting capabilites of the API”:
Somethings the director can not query via the API which is a problem of Icinga 2, but as the configuration lands on Icinga 2 in the end, it will always work. This is also annoying for the internal checks of Icinga 2. But here is the director the wrong to blame.

“Shouldn’t I be able to use the director without any problems, when I configured Icinga via DSL for so many years and know, how Icinga is working?”
Yes, but as I said there is some abstraction and different approaches involved so director is in fact a new tool to learn, but you should totally understand what the director is creating.

“Shouldn’t there be some kind of documentation, which describe in detail, where the differences between the DSL and the Director are? Shouldn’t I be able to identify, what kind of configuration fit my needs best without spending hours of time to investigate it by myself?”
No, there should be no need for such documentation, but there should be some more documentation instead of the technical one only. Icinga tried to have a technical writer in the past, doing use-case driven documentation in form of blog posts, but this unfortunately did not work out. I also recommended the blog posts at Monitoring Archive – UN*XE but they are more or less out-dated now. So I would say user or use-case orientated documentation would be the way to go here. Also having some conceptional documentation which explains why something is design in a specific way, how it should work and so on would help, because you will know if something as just meant to be done different or if it is a bug.

And for your last example I do not see any need as I would build it with the defaults on the notification and on the user template and then simply that user specific changes on the user, so no need for any or condition as all this is built-in.

So to summarize my opinion:

Icinga could do much better with documentation if doing the right one
Director and DSL are different concepts and no need for a 1:1 match
Both solutions are fine and can be used with success

MarcusCaepio · July 23, 2021, 2:03pm

Hi Dirk,
thanks for your opinion. I am with you, that DSL and Director are different concepts. I am also willing to learn these different ones. But as I said, getting information beside of “try and error” is very hard. The Icinga2 Documentation is very good imho and as you said, the Director Documentation should be better. It would be nice, to have a documentation, which covers all the content of the icinga2 docs made with the director.

parad1se · September 23, 2022, 10:19am

I’m glad that I found this topic. Maybe somebody can help me to understand the DSL and Director better. At the moment I’m using Icinga2 DSL only and so far all works well. With new requirements: as for example to monitor our new azure cloud services, I found a solution from Icinga2 like: Microsoft Azure director importer module for Icinga Web 2. Which seems to be easy to integrate a azure cloud based monitoring. Now my question: can I keep using the DSL for my Host / Services which are running on premises and for the azure cloud based monitor to use the Director and the azure importer at the same Icinga2 Instance? Is it possible to run everything hand in hand? Or do I have to choose between DSL and Directory only?

Many thanks in advance!

Best regards
David

rivad · September 23, 2022, 6:14pm

I use the Director and sprinkled in some DSL.
Mostly for Notifications and there are my checks that use results of services.
The first kind, I just manage outside of the director and the others I use the only place the Director allows DSL - Command Arguments. To get the result back into Icinga the Linuxfabrik wrote monitoring-plugins/check-plugins/dummy at main · Linuxfabrik/monitoring-plugins · GitHub for me.

rivad · September 23, 2022, 6:15pm

Icinga is like Lego for monitoring, so spin up a test environment and start building!

parad1se · September 24, 2022, 1:25am

Hi Dominik,
thank your very much for sharing your experience. I will now definitely try out DSL and the Director. I do not quite understand yet, how it exactly works to pass DSL to the Directory with the Dummy-Check from Linuxfrabrik. But I guess it will become more clear for me, as I will start to work with the Director.

Do maybe have some snippets for me / exmaples?

Many thanks in advance!

rivad · September 27, 2022, 12:33pm

This is, for example, a check that I use to compare 2 values with a allowed deviation - one from a DB check and the other from a file count check:

Sorry at the bottom the code gets garbled but in the input field you can copy paste with proper white spaces as you can see at Wert/Value.

Here the two code blocks with white space:

function get_perfdata(service){
	  var perf_value = service.last_check_result.performance_data[0].split("=")[1]
	  var perf_value = perf_value.split("c;")[0]
	  return perf_value
	}
	var output = ""
	var hosts_to_compare = macro("$116_comparison_hosts$")
	var service_pattern = macro("$116_comparison_service_pattern$")
	var tolerance = macro("$116_comparison_tolerance$")
	var values = []
	var servcies = []

	for (host in hosts_to_compare) {
	  var service_names = get_services(host).map(s => s.name)
	  for (service_name in service_names) {
		if (match(service_pattern, service_name)) {
		  if (match("Compare Services*", service_name)){
		   continue
		  }
		  servcies.add(host + "!" + service_name)
		  service = get_service(host, service_name)
	      if (len(service.last_check_result.performance_data) < 1){
           return "[UNKNOWN] '" + host + "!" + service_name + "' has no performance data value in last_check_result!"
	      }
		  values.add(get_perfdata(service))
		}
	  }
	}
    if (len(values) < 2) {
      return "[UNKNOWN] less then 2 values collected!"  
    }
	
	output = "[OK] all values in allowed tolerance |"
	
	for (value in values) {
	  if (number(value) + number(tolerance) < number(values[0]) || number(value) - number(tolerance) > number(values[0])) {
		output = "[CRITICAL] value " + value + " not in allowed tolerance " + tolerance + " from first value " + values[0] + " |"
	  }
	}

	for (service in servcies) {
	  var host = service.split("!")[0]
	  var service_name = service.split("!")[1]
	  var service = get_service(host, service_name)
	  var value = get_perfdata(service)
	  var value_min = number(values[0]) - number(tolerance)
	  var value_max = number(values[0]) + number(tolerance)
	  output += " '" + host + "!" + service_name + "'=" + value + ";;" + value_min + ":" + value_max + ";"
	}

	return output

function get_perfdata(service){
	  var perf_value = service.last_check_result.performance_data[0].split("=")[1]
	  var perf_value = perf_value.split("c;")[0]
	  return perf_value
	}
	var hosts_to_compare = macro("$116_comparison_hosts$")
	var service_pattern = macro("$116_comparison_service_pattern$")
	var tolerance = macro("$116_comparison_tolerance$")
	var values = []

	for (host in hosts_to_compare) {
	  var service_names = get_services(host).map(s => s.name)
	  for (service_name in service_names) {
		if (match(service_pattern, service_name)) {
		  if (match("Compare Services*", service_name)){
		   continue
		  }
		  var service = get_service(host,service_name)
		  if (len(service.last_check_result.performance_data) < 1){
            return "unk"
	      }
		  values.add(get_perfdata(service))
		}
	  }
	}
    if (len(values) < 2) {
	  return "unk"  
    }

	for (value in values) {
	  if (number(value) + number(tolerance) < number(values[0]) || number(value) - number(tolerance) > number(values[0])) {
		return "crit" //at least one service is not OK
	  }
	}

	return "ok" //all is well

The dummy just reflects message and state back into icinga2 but all the DSL computatons are already done by the time dummy gets called so message and state look like this:

'/usr/lib64/nagios/plugins/dummy' '--message' '[OK] all values in allowed tolerance | '\''host1!servcie1'\''=26;;24:28; '\'host2!service2'\''=27;;24:28;' '--state' 'ok'