Monitoring of a grid engine

Hi,

this is my first post and i hope i include all necessary information concerning the goal i try to achieve.

The Situation:
I want to perform a monitoring of valid and invalid simulation processes in a distributed environment of a master-server, which is managing simulation-jobs and several calculating-servers, which do all the work. Until now i wrote a bash-script. This script on the one hand asks the master for a list of running jobs. On the other hand it asks each calcutating-server for an output of “ps -p 1 -p $$ -ewo user,pid,lstart,stat,command”. The script is located on the master, the ps-output is fetched via ssh.

After some awk-statments for comparison of both lists i get a list of jobs (pid, user, etc.), which are no know to the master-server and therefore not valid for the whole system. These invalid jobs consume cpu-time on the calculating-servers, which is unwanted by our customer.

My Questions:

  1. How can i do this with icingaweb and the director? I wrote several custom check within the director to count the simulation-processes, but i havent found yet an option of check_procs to extract the start-time of a process. My opinion is that it would be more elegant to rebuild the bash-script using icingaweb2 and the director.

  2. How can i perform my script as custom check? Its not yet clear to me, if icinga can evoke my bash-script or do i have to start it with “while true; do…” and icinga only monitors the output of it.

I thank you in advance for your answers. In case my questions are not precise enough please give me a hint.

Martin

Check_procs can not filter processes by start- or run-time, also the correlation of the jobs would be much harder in Icinga than in your own script. This is what make Icinga so powerful that you can implement your own very specific checks quite simply.

Based on how your script is written at the moment you perhaps only have to adjust return codes and output a little bit to be a valid check plugin. Have a look at https://www.monitoring-plugins.org/doc/guidelines.html for guidance.

Afterwards to have to create a check command definition which can be done using the file based configuration (see https://icinga.com/docs/icinga2/latest/doc/09-object-types/#checkcommand) or the director (“Icinga director > Commands > Commands > Add command” if I remember correctly).

This command can be used in a service afterwards like every other one.