Distributed monitoring in compute cluster

jan · July 27, 2023, 3:44pm

This is not really about a problem - more about seeking some advice before I get going so I don’t create a mess.

I have a working icinga2 with a number of nodes reporting in - what used to be called passive checks; this works fine. I am now building a compute cluster with slurm as the workload manager - each node boots up via PXE to debian 12.

When I set up a new node, I use icinga2 node wizard, and it asks for the common name, which will be the FQDN of the node; however, the compute nodes are diskless servers, so I have to configure this on the boot image, and the node’s name isn’t known. The question then is, is this possible? Can I just use the same common name for all the nodes, or will this cause a conflict somewhere?

rsx · July 28, 2023, 5:53am

Agent nodes also have their own unique zone. By convention you must use the FQDN for the zone name. Details can be found here.

umberto10 · August 13, 2024, 10:53am

But it doesn’t answer the question, of how to monitor diskless nodes in a netboot cluster with read-only image:v We’re struggling with it too, so is there any update on this case?

rivad · August 13, 2024, 11:17am

Use the agent but auto configure it on every boot, via the director self-service API
Use the director to generate the hosts in Icinga2 and then use check_by_ssh or SNMP

umberto10 · August 20, 2024, 8:50am

I don’t want to use director, so can I use wildcard certificates?
Another way for me is to generate all certs and mount the proper one via nfs. What do you think?