this is my first post and i hope i include all necessary information concerning the goal i try to achieve.
I want to perform a monitoring of valid and invalid simulation processes in a distributed environment of a master-server, which is managing simulation-jobs and several calculating-servers, which do all the work. Until now i wrote a bash-script. This script on the one hand asks the master for a list of running jobs. On the other hand it asks each calcutating-server for an output of “ps -p 1 -p $$ -ewo user,pid,lstart,stat,command”. The script is located on the master, the ps-output is fetched via ssh.
After some awk-statments for comparison of both lists i get a list of jobs (pid, user, etc.), which are no know to the master-server and therefore not valid for the whole system. These invalid jobs consume cpu-time on the calculating-servers, which is unwanted by our customer.
How can i do this with icingaweb and the director? I wrote several custom check within the director to count the simulation-processes, but i havent found yet an option of check_procs to extract the start-time of a process. My opinion is that it would be more elegant to rebuild the bash-script using icingaweb2 and the director.
How can i perform my script as custom check? Its not yet clear to me, if icinga can evoke my bash-script or do i have to start it with “while true; do…” and icinga only monitors the output of it.
I thank you in advance for your answers. In case my questions are not precise enough please give me a hint.