API process-check-result randomly returns 404

Summary

The API process-check-result randomly returns 404.
Icinga2 r2.10.2-1, Debian Stretch.

Background

We’re currently monitoring about a hundred hosts representing about 400 services.
The hosts are passive, meaning Icinga2 doesn’t do any active checks on them.
Everything comes from the hosts themselves. A bash script executes some checks and submit the result to Icinga2 with a curl command.
The checks are performed every minutes or so, and the result are sent consecutively.
There’s no delay between the submissions, apart from the duration of the check itself.

Most of the time the checkresults are properly processed by Icinga2 (API answers 200, sevice status is updated correctly).
But rather frequently Icinga returns 404 {“error”:404.0,“status”:“No objects found.”}.

There’s nothing relevant in Icinga2’s logs, or debuglog.
Incidentaly, the debuglog is a pain to decipher as everything is intertwined, there’s no way to tell what line belongs to what process/api call.

My only guess is that the API has a hard time dealing with simultaneous connections as all the check_results arrive at about the same moment.

Possibly related : https://github.com/Icinga/icinga2/issues/6682

icinga2 --version

expand
icinga2 - The Icinga 2 network monitoring daemon (version: r2.10.2-1)

System information:
  Platform: Debian GNU/Linux
  Platform version: 9 (stretch)
  Kernel: Linux
  Kernel version: 4.9.0-6-amd64
  Architecture: x86_64

Build information:
  Compiler: GNU 6.3.0
  Build host: 486e413fb159

Config

expand
template Service "passive-service" {
  enable_passive_checks = true
  enable_active_checks = true
  max_check_attempts = 3
  check_interval = 5m
	retry_interval = 1m
  vars.dummy_state = 2
  vars.dummy_text = {{
    var service = get_service(macro("$host.name$"), macro("$service.name$"))
    var lastCheck = DateTime(service.last_check).to_string()
    return "No check results received. Last result time: " + lastCheck
  }}
	enable_flapping = true
}

apply Service "passive_check_disk" {
  import "passive-service"
  display_name = "Disk"
  check_command = "passive"
  assign where host.vars.passive_check_disk
}

template Host "passive-host" {
  check_command = "dummy"
  max_check_attempts = 3
  check_interval = 5m
  retry_interval = 1m
  vars.dummy_state = 2
  vars.dummy_text = {{
    var lastCheck = DateTime(host.last_check).to_string()
    return "Host is down since: " + lastCheck
  }}
}

object Host "test-host" {
	import "passive-host"
	groups = [ "test-group" ]
	display_name = "TEST host"
	host.vars.passive_check_disk = true
}

Check result

Example of data that’s sent to Icinga2 :

{ “exit_status”:0, “check_command”: ["/usr/lib/nagios/plugins/check_disk","-w",“20%”,"-c",“10%”,"-p","/"], “plugin_output”: “DISK OK - free space: / 1375 MB (26% inode=62%);”, “performance_data”: ["/=3825MB;4402;4952;0;5503"], “execution_start”: 1549555681.439195, “execution_end”: 1549555681.443452 }

The data is sent like this :

curl -s -u “${ICINGA2API_USER}:${ICINGA2API_PASSWORD}” -H ‘Accept: application/json’ -X POST “${url}” -d “${data}”

Debuglog extract

expand
[2019-02-07 16:08:01 +0100] notice/ApiListener: New HTTP client
[2019-02-07 16:08:01 +0100] debug/HttpRequest: line: POST /v1/actions/process-check-result?service=test-host%21passive_check_disk HTTP/1.1, tokens: 3
[2019-02-07 16:08:01 +0100] information/HttpServerConnection: Request: POST /v1/actions/process-check-result?service=test-host%21passive_check_disk (from [123.123.123.123]:43257), user: test_remote)
[2019-02-07 16:08:01 +0100] notice/WorkQueue: Spawning WorkQueue threads for 'HttpServerConnection'
...
about 50 lines further
...
[2019-02-07 16:08:01 +0100] notice/ApiActionHandler: Running action process-check-result
[2019-02-07 16:08:01 +0100] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2019-02-07 16:08:01 +0100] notice/ApiListener: Relaying 'event::CheckResult' message
[2019-02-07 16:08:01 +0100] debug/ApiEvents: Processing event type 'CheckResult'.
[2019-02-07 16:08:01 +0100] debug/HttpUtility: Request body: '{ "exit_status":0, "check_command": ["/usr/lib/nagios/plugins/check_disk","-w","20%","-c","10%","-p","/"], "plugin_output": "DISK OK - free space: / 1375 MB (26% inode=62%);", "performance_data": ["/=3825MB;4402;4952;0;5503"], "execution_start": 1549552081.585619, "execution_end": 1549552081.591264 }'
...
about 50 lines further
...
[2019-02-07 16:08:01 +0100] notice/ApiActionHandler: Running action process-check-result
[2019-02-07 16:08:01 +0100] debug/IdoPgsqlConnection: Query: UPDATE icinga_servicestatus SET acknowledgement_type = E'0',  active_checks_enabled = E'1',  check_command = E'passive',  check_source = E'icinga2.domain.com',  check_type = E'0',  current_check_attempt = E'1',  current_notification_number = E'0',  current_state = E'0',  event_handler_enabled = E'1',  execution_time = E'0.005645',  flap_detection_enabled = E'1',  has_been_checked = E'1',  instance_id = 1,  is_flapping = E'0',  is_reachable = E'1',  last_check = TO_TIMESTAMP(1549552081) AT TIME ZONE 'UTC',  last_hard_state = E'0',  last_hard_state_change = TO_TIMESTAMP(1549534772) AT TIME ZONE 'UTC',  last_notification = TO_TIMESTAMP(1549534772) AT TIME ZONE 'UTC',  last_state_change = TO_TIMESTAMP(1549534772) AT TIME ZONE 'UTC',  last_time_critical = TO_TIMESTAMP(1549534670) AT TIME ZONE 'UTC',  last_time_ok = TO_TIMESTAMP(1549552081) AT TIME ZONE 'UTC',  latency = E'0',  long_output = E'',  max_check_attempts = E'5',  next_check = TO_TIMESTAMP(1549552381) AT TIME ZONE 'UTC',  next_notification = TO_TIMESTAMP(1549552081) AT TIME ZONE 'UTC',  normal_check_interval = E'5',  notifications_enabled = E'1',  original_attributes = E'null',  output = E'DISK OK - free space: / 1375 MB (26% inode=62%);',  passive_checks_enabled = E'1',  percent_state_change = E'0',  perfdata = E'/=3825MB;4402;4952;0;5503',  problem_has_been_acknowledged = E'0',  process_performance_data = E'1',  retry_check_interval = E'2',  scheduled_downtime_depth = E'0',  service_object_id = 1741,  should_be_scheduled = E'1',  state_type = E'1',  status_update_time = TO_TIMESTAMP(1549552081) AT TIME ZONE 'UTC' WHERE service_object_id = 1741

Questions

  • What could cause the 404 ?
  • How can we fix it ?
  • How to make the debug log actually useful ?

Hello,

mybe a quick workaround is to loop the checkresult send part until you get a 200 return code or the loop tried it for X times.

For the debuglog, use Elastic Search or Graylog :slight_smile:
Regards,
Carsten

Thanks for the suggestion.

That is indeed a quick workaround. But that won’t do for two reasons : we’re about to monitor a few hundreds more machines, and hitting the API with more requests will most likely makes the situation worse than it already is.

As for the debuglog, feeding it to ES or Graylog won’t solve the fact that there’s nothing in the log to follow the life cycle of an API request. And the timestamps are not precise enough imo, but that’s another matter.

Hi,

thanks for the nice formatting and description, that helps a lot! :slight_smile:

Question aside, this isn’t reproducible when just a few connects and actions happen, right?

That’s unfortunately a known problem and not an easy one to fix, i.e. to have structured logging and so on. We need to replace our core logging functionality, and that’s something for the coming months or years.

Can you create a full gdb backtrace from the running process please? The commands are described in the development docs.

That way we can inspect whether there are deadlocks in the threads.

And if you can, I’d appreciate you testing the snapshot packages which include fixes for this problem.

Cheers,
Michael

I don’t know. I’ll have to set up a test environment to, well, do some testing.

Yeah, it’s always difficult when there’s so many things going on at the same time. We have the exact same gripe with postfix’s logs which are sometimes uterly pointless when the information is spread all over the place with (sometimes) no reliable identifier. (I’m looking at you, TLS connection status!)
Good luck for this !

I’ll see what I can do. There’s hoping I can provide you with something next week or so, but no promise.

Ok, so, I set up a test machine and tried flooding it with a thousand simultaneous checkresults. Couldn’t produce a 404. That said, that test had obvious flaws : the config contained only up to 4 hosts and the checkresults were sent and received locally.

I’ve upgraded our production Icinga2 install to the latest 2.10.4-1.stretch version, just in case, but the 404 still occur.

Here’s a full gdb backtrace :


When running gdb, I got two errors :
28 …/sysdeps/unix/sysv/linux/recvmsg.c: No such file or directory.
84 …/sysdeps/unix/syscall-template.S: No such file or directory.
Is that relevant in any way ?

Hi,

thanks for your effort :slight_smile:

No, that occurs here too. gdb attempts to load symbols from all used locations, and some might not be available. Those symbols help to get the function names and parameters in stacktraces.

In terms of the underlaying problem, I am convinced that this is directly related to the API not being able to handle the requests. I’ve written a full analysis here: https://github.com/Icinga/icinga2/issues/7041

Your backtrace shows that you’re using API event streams too.

  • Thread 26 (Thread 0x2b7f21a0c700 (LWP 28588)) waits for more events
  • epoll_wait is fine, the socket IO threads are all idle
  • the global application and apilistener thread pool is idle, waiting for new tasks
  • Accept on the TCP socket hangs (or we did catch this in this specific moment) in thread 9
  • Thread 1 is the main event loop, just fine
  • Thread 27 til 30 closes a http connection, but there’s still data in the workqueue which is why Join() waits. This would mean that 4 more HTTP connections are established right now.

How many connections are established to Icinga on port 5665? netstat -tulpen | grep 5665. Does your client application close the request itself, or are you sending the Connection: close header? What is the API event stream for, how is that used?

Cheers,
Michael

Thanks for the follow up.

you’re using API event streams too.

That’s right, I should have mentionned it. Although it’s fairly recent and the 404 already happened prior to that.
Since icinga2 doesn’t allow to send the checkresults on a syslog facility natively (a feature request maybe :wink: ), we use the stream API to get them. The stream API is called via a dead simple service written in bash. It’s basically just a curl call that feeds the checkresults to rsyslog.

Does your client application close the request itself, or are you sending the Connection: close header?

All the requests are made using curl, with no particular headers sent.

How many connections are established to Icinga on port 5665?

Most of the time, there’s just the one permanent connection from our service. Then every minutes there’s a bunch of connections coming from the hosts we monitor.
I’ve run a netstat -an |grep 5665 -c every milliseconds to see how many connections the server was receiving.
It never went really high, the peak was 65 (wich is surprising considering a hundred machines make 4+ requests each to post their checkresults at the same time).

What more info can I give you to unravel this ?

The API event streams are fine, also the wait loop in the stacktrace. This normally doesn’t influence anything except waiting for new events. A syslog streamer isn’t something we’d implement this time without any structured logging being available.

The script you’re using to pass the checkresults, can you share it? :slight_smile: Particularly I’m interested in the way requests are fired.

Cheers,
Michael

There’s nothing much in there. I tried to make it as simple as possible. (and I should put it on github…)

script
#!/bin/bash

# version 11
# A script to perform monitoring checks and send the results to Icinga2.
# Loosely based on a script by Mark Janssen.

CONF_FILEPATH="$(dirname "$0")/monitor_conf.rc"
CHECKS_CONF_FILEPATH="$(dirname "$0")/monitor_checks.rc"
ONLY_HOST_ALIVE=0
FORCE_ENGLISH=0
CHECKS=()
LOG_FILE='/var/log/montiring.log'
LOCK_DIR='/var/run/monitor'
LOCK_FILE='pid'
INSECURE=''
VERBOSE=0
TEST=0

# $1=string
# $2=remove trailing slash
addOrRemoveTrailingSlash(){
	local STR="$1"
	length=${#STR}
	last_char=${STR:length-1:1}
	if [ -n "$2" ]; then
		[[ $last_char == "/" ]] && STR=${STR:0:length-1}
	else
		[[ $last_char != "/" ]] && STR="$STR/"
	fi
	echo "$STR"
}

# $1=string
trim(){
	shopt -s extglob
	# trim leading
	text=${1##+([[:space:]])}
	# trim trailing
	echo "${text%%+([[:space:]])}"
}

log(){
	if [ $VERBOSE -eq 1 ]; then
		printf "[%s]\t%s %s\n" "$(date -u +'%F %T')" "$LOCK_FILE" "$1" > /dev/tty
	fi
	printf "[%s]\t%s %s\n" "$(date -u +'%F %T')" "$LOCK_FILE" "$1" >> "$LOG_FILE"
}

cleanup() {
	while read -r lockfile; do
		if other_pid=$(cat "$lockfile" &>/dev/null) && ! kill -0 "$other_pid" &>/dev/null; then
			# log "Lock ${lockfile} is stale, remove it"
			rm -f "$lockfile"
		fi
	done < <(find "$LOCK_DIR" -type f)
}

# $1=string check type (service or host)
# $2=string checkname
# $3=string data
sendCheckResult(){
	if [ "$1" = "host" ]; then
		url="${ICINGA2API_URL}:${ICINGA2API_PORT}/v1/actions/process-check-result?host=${HOSTNAME}"
		if [ $TEST -eq 1 ]; then
			echo -e "\n"
			echo "URL: ${url}"
		else
			curl "$INSECURE" -s -u "${ICINGA2API_USER}:${ICINGA2API_PASSWORD}" -H 'Accept: application/json' -X POST "${url}" -d '{ "exit_status":0, "plugin_output": "Host is up" }' -w '\t%{url_effective}\t%{http_code}\t'
		fi
	elif [ "$1" = "service" ]; then
		# urlencode spaces in checkname, the rest shouldn't contain funky characters
		service_name="${2// /%20}"
		url="${ICINGA2API_URL}:${ICINGA2API_PORT}/v1/actions/process-check-result?service=${HOSTNAME}%21${service_name}"
		if [ $TEST -eq 1 ]; then
			echo -e "\n"
			echo "URL: ${url}"
			echo "DATA: ${3}"
		else
			curl "$INSECURE" -s -u "${ICINGA2API_USER}:${ICINGA2API_PASSWORD}" -H 'Accept: application/json' -X POST "${url}" -d "${3}" -w '\t%{url_effective}\t%{http_code}\t'
		fi
	else
		echo "Undefined check type"
		exit 1
	fi
}

usage(){
	echo  """
	Usage: bash $0 [-a path_to_main_conf -A -c path_to_check_conf -i -l -p lock_file_name -t -v]
		Options:
		-a  optional path to the main conf file (default: $CONF_FILEPATH)
		-A  only send host alive result
		-c  optional path to the checks conf file (default: $CHECKS_CONF_FILEPATH)
		-i  'insecure' SSL connections and transfers. Useful when icinga2 uses a self-signed cert
		-l  force language to English during check
		-p  lock file name (default: $LOCK_FILE)
		-t  test only, don't send to icinga2
		-v  verbose
	"""
}

while getopts "a:Ac:hilp:tv" OPTION; do
	case $OPTION in
		a)
			CONF_FILEPATH="$OPTARG"
			;;
		A)
			ONLY_HOST_ALIVE=1
			;;
		c)
			CHECKS_CONF_FILEPATH="$OPTARG"
			;;
		i)
			INSECURE='--insecure'
			;;
		l)
			FORCE_ENGLISH=1
			;;
		p)
			LOCK_FILE="$OPTARG"
			;;
		t)
			TEST=1
			;;
		v)
			VERBOSE=1
			;;
		h|*)
			usage
			exit 3
			;;
	esac
done

if [ ! -e "$CONF_FILEPATH" ]; then
	echo "File $CONF_FILEPATH path does not exist"
	exit 1
fi

if [ ! -e "$CHECKS_CONF_FILEPATH" ]; then
	echo "File $CHECKS_CONF_FILEPATH path does not exist"
	exit 1
fi

if [ -z "$LOCK_FILE" ]; then
	echo "Lock file parameter is empty"
	exit 1
fi

if [[ "$LOCK_FILE" =~ \.|/|\\ ]]; then
	echo "Lock file parameter must be a single word, without dot nor slash"
	exit 1
fi

# Load settings
. "$CONF_FILEPATH"
. "$CHECKS_CONF_FILEPATH"

LOCK_DIR="$(addOrRemoveTrailingSlash "$LOCK_DIR" 'REMOVE')"
LOCK_FULL_PATH="${LOCK_DIR}/${LOCK_FILE}"

if [ -n "$TIMEOUT" ]; then
	PREPEND="/usr/bin/timeout ${TIMEOUT} "
else
	PREPEND=""
fi

CHECKPATH="$(addOrRemoveTrailingSlash "$CHECKPATH")"
if [ ! -d "$CHECKPATH" ]; then
	echo "Checkpath $CHECKPATH does not exist"
	exit 1
fi

CUSTOM_CHECKPATH="$(addOrRemoveTrailingSlash "$CUSTOM_CHECKPATH")"
if [ ! -d "$CUSTOM_CHECKPATH" ]; then
	echo "Checkpath $CUSTOM_CHECKPATH does not exist"
	exit 1
fi

# prepare lock dir
if [ ! -d "$LOCK_DIR" ]; then
	mkdir "$LOCK_DIR"
fi

# Acquire lock...
if [ ! -e "$LOCK_FULL_PATH" ]; then
	# cleanup on exit
	trap 'ECODE=$?;
		cleanup;
		rm -f "$LOCK_FULL_PATH"' 0

	echo "$$" >"${LOCK_FULL_PATH}"

	# the following handler will exit the script on receiving these signals
	# the trap on "0" (EXIT) from above will be triggered by this trap's "exit" command!
	trap 'echo "[$0] Killed by a signal." >&2
		exit 3' 1 2 3 15

	if [ $ONLY_HOST_ALIVE -eq 1 ]; then
		log "hostalive $(sendCheckResult 'host')"
	else
		log "hostalive $(sendCheckResult 'host')"
		# Run actual checks
		for CHECK in "${CHECKS[@]}"; do
			CHECKNAME=$(echo "${CHECK}" | awk -F\| '{print $1}')
			CHECKCMD=$(echo "${CHECK}" | awk -F\| '{print $2}')
			BINNAME=$(echo "$CHECKCMD" |cut -d ' ' -f1)
			CHECKARGS=$(echo "$CHECKCMD" |cut -d ' ' -f2-)
			BINPATH="${CHECKPATH}$BINNAME"
			if [ ! -e "$BINPATH" ]; then
				BINPATH="${CUSTOM_CHECKPATH}$BINNAME"
				if [ ! -e "$BINPATH" ]; then
					log "File not found for ${CHECKNAME} in plugin folders: ${CHECKPATH} or ${CUSTOM_CHECKPATH}"
					continue
				fi
			fi

			# enforce check in english
			if [ "$FORCE_ENGLISH" -eq 1 ]; then
				OLDLANG=$LANG
				LANG=en_US
			fi
			execution_start=$(echo "scale=6; $(date +%s%N) /1000000000" |bc)
			check_output=$("${PREPEND}${BINPATH}" ${CHECKARGS})
			RETVAL=$?
			execution_end=$(echo "scale=6; $(date +%s%N) /1000000000" |bc)
			if [ "$FORCE_ENGLISH" -eq 1 ]; then
				LANG=$OLDLANG
			fi

			# split arguments
			check_args=$(echo "$CHECKARGS"|cut -d' ' -s -f1- --output-delimiter '","')
			check_command="\"${BINPATH}\""
			if [ -n "$check_args" ]; then
				check_command="${check_command},\"${check_args}\""
			fi

			plugin_output=$(echo "${check_output}"|cut -d'|' -f1)
			plugin_output=$(trim "${plugin_output}")
			performance_data=$(echo "${check_output}"|cut -d'|' -s -f2-)
			performance_data=$(trim "${performance_data}")
			# split performance_data
			if [ -n "${performance_data}" ]; then
				performance_data="\"$(echo "${performance_data}"|cut -d' ' -f1- --output-delimiter '","')\""
			fi

			data="{ \"exit_status\":${RETVAL}, \"check_command\": [${check_command}], \"plugin_output\": \"${plugin_output}\", \"performance_data\": [${performance_data}], \"execution_start\": ${execution_start}, \"execution_end\": ${execution_end} }"

			log "${CHECKNAME} ${RETVAL} ${data} $(sendCheckResult 'service' "${CHECKNAME}" "${data}")"
		done
	fi
else # lock failed, now check if the other PID is alive
	if [ -e "$LOCK_FULL_PATH" ]; then
		if ! OTHERPID=$(cat "$LOCK_FULL_PATH" &>/dev/null); then
			log "Can't read lock ${LOCK_FULL_PATH}."
			exit 1
		fi

		if ! kill -0 "$OTHERPID" &>/dev/null; then
			log "Lock ${LOCK_FULL_PATH} is stale, remove it and restart"
			rm -f "$LOCK_FULL_PATH"
			exec "$0" "$@"
		else
			log "Lock ${LOCK_FULL_PATH} is valid and OTHERPID is active - exit, we're locked!"
			exit 2
		fi
	else
		log "Lock ${LOCK_FULL_PATH} doesn't exist. Restart"
		exec "$0" "$@"
	fi
fi

There’s supposed to be two conf files with it. The main config is like this:

conf
# HOSTNAME corresponds to the host from Icinga2 config
HOSTNAME="$(hostname -s)"
CHECKPATH='/usr/lib/nagios/plugins/'
CUSTOM_CHECKPATH='/root/scripts/monitor/plugins'
# TIMEOUT=10
LOG_FILE='/var/log/monitoring.log'
ICINGA2API_URL='https://icinga2.domain.com'
ICINGA2API_PORT='5665'
ICINGA2API_USER='icinga2_api_user'
ICINGA2API_PASSWORD='icinga2_api_password'

And the other one that contains an array of checks to perform. Eg :

checks
CHECKS=(
'passive_check_disk|check_disk -w 20% -c 10% -p /'
'passive_check_load|check_load -w 10,5,2 -c 15,20,10'
'passive_check_procs|check_procs -w 300 -c 350'
'passive_check_memory|check_mem -w 85 -c 95'
)

Do you think there’s something wrong with my script ?

Sorry, really busy this time working on 2.11. The script looks sane, I’m still wondering how the 404 gets constructed. This may be a side effect of problems in the current network stack.

On Monday, we’ve merged the first rewrite steps towards a new network I/O implementation into git master, and pushed snapshot packages, also for Debian Stretch. If you happen to have a test environment, please install these packages and try to reproduce the error you’re seeing.

Cheers,
Michael

Iam not quite sure if you have to encode the !, if you escape the URL right for the bash it is not needed.

url="${ICINGA2API_URL}:${ICINGA2API_PORT}/v1/actions/process-check-result?service=${HOSTNAME}!${service_name}"

Anyways i use filters, because some customers has hostnames and/or services with whitespaces etc. With filters you dont need to escape anything.

curl -k -s -u root:icinga -H 'Accept: application/json' -X POST 'https://localhost:5665/v1/actions/process-check-result' \
-d '{ "filter": "host.name==\"example.localdomain\"", "type": "Host", "exit_status": 1, "plugin_output": "Host is not available." }'

Thought about this as well, and was about to suggest to use filter instead. Still, ! may be a special character on the shell.

Let’s stick to English btw :kissing_heart:

Ups sorry, was not thinking on it. change inc asap :slight_smile:

1 Like

There is one thing to keep in mind : the API returns specifically {“error”:404.0,“status”:“No objects found.”}. This particular status string is found in just a few source code files, when there’s an exception during FilterUtility::GetFilterTargets, so it would seem that it’s really during the retrieval of the target object that the problem lies.

Hey, I hadn’t thought of that, thanks!
I’ll modify the script on a few hosts to use filters instead of query params to see if that changes anything. We never know.