How to monitor a server is in hung state? The best I can think of is if we can monitor a particular Service Distributed on a Server is going in execution delayed state. But I do not know how to create a service which will monitor another service being delayed.
Any director based example or any other thought process around this topic will be nice
Our approach is to check whether a server is accepting data or is answering requests. In addition we measure the time for those checks and even the server e.g. accepts data but exceed warning level, the result is a warning. And btw, these measurements over a long period of time if a good indicator for increasing loads.
How do you do that on a Linux based server as an example. What plugins do you use. The server might be a stand alone workload server with no logins for long period of time.
Can you please help with an example on how to measure execution time of a particular check and alert based on that if it breaches a particular threshold?
It depends, I write many plugins myself, therefore, I can decide which strategy is most suitable for which plugins. This is required anyway since our products are not very common, hence, there are no plugins available.
A very simple example structure to measure execution time:
command="timeout $CRIT sh -c '<any command>'"
output=$(sh -c "$command" 2>&1)
responseTime=$(($(expr $end_time - $start_time)/1000))
if [ $RC -ne 0 ]; then #happens also when timeout occured
elif [ "$responseTime" -gt "$WARN" ]; then
Hi @rsx thanks.
When a server is in hung state obviously the script will not execute and will stay hung. The Service Monitoring the script (Service_Hung_Critical) is going to be in delayed state. Is there a way to add notification in icinga when the status is delayed. I think Critical Warning Unknown OK these are the 4 states we can trigger a notification. Not sure of the service delayed state
The script is executed on an icinga node and do its check remotely to that faulty server. That’s why I’ve
timeout in the command and as commented in the example, return code is not 0 in such cases. Therefore, the check result is critical.
amazing stuff!! Now only thing i need to do is open ssh 22 port across from my icinga server to all other servers in the world to be able to execute the script
The delayed tab in icingaweb somehow shows the checks which are delayed. I guess there is no way we can capture the status in a service for alerting purpose
But atleast i now know how to proceed with the solution
You could create checks which are calculating delay based on last check result and throw warning or critical.