API process-check-result randomly returns 404

api

#1

Summary

The API process-check-result randomly returns 404.
Icinga2 r2.10.2-1, Debian Stretch.

Background

We’re currently monitoring about a hundred hosts representing about 400 services.
The hosts are passive, meaning Icinga2 doesn’t do any active checks on them.
Everything comes from the hosts themselves. A bash script executes some checks and submit the result to Icinga2 with a curl command.
The checks are performed every minutes or so, and the result are sent consecutively.
There’s no delay between the submissions, apart from the duration of the check itself.

Most of the time the checkresults are properly processed by Icinga2 (API answers 200, sevice status is updated correctly).
But rather frequently Icinga returns 404 {“error”:404.0,“status”:“No objects found.”}.

There’s nothing relevant in Icinga2’s logs, or debuglog.
Incidentaly, the debuglog is a pain to decipher as everything is intertwined, there’s no way to tell what line belongs to what process/api call.

My only guess is that the API has a hard time dealing with simultaneous connections as all the check_results arrive at about the same moment.

Possibly related : https://github.com/Icinga/icinga2/issues/6682

icinga2 --version

expand
icinga2 - The Icinga 2 network monitoring daemon (version: r2.10.2-1)

System information:
  Platform: Debian GNU/Linux
  Platform version: 9 (stretch)
  Kernel: Linux
  Kernel version: 4.9.0-6-amd64
  Architecture: x86_64

Build information:
  Compiler: GNU 6.3.0
  Build host: 486e413fb159

Config

expand
template Service "passive-service" {
  enable_passive_checks = true
  enable_active_checks = true
  max_check_attempts = 3
  check_interval = 5m
	retry_interval = 1m
  vars.dummy_state = 2
  vars.dummy_text = {{
    var service = get_service(macro("$host.name$"), macro("$service.name$"))
    var lastCheck = DateTime(service.last_check).to_string()
    return "No check results received. Last result time: " + lastCheck
  }}
	enable_flapping = true
}

apply Service "passive_check_disk" {
  import "passive-service"
  display_name = "Disk"
  check_command = "passive"
  assign where host.vars.passive_check_disk
}

template Host "passive-host" {
  check_command = "dummy"
  max_check_attempts = 3
  check_interval = 5m
  retry_interval = 1m
  vars.dummy_state = 2
  vars.dummy_text = {{
    var lastCheck = DateTime(host.last_check).to_string()
    return "Host is down since: " + lastCheck
  }}
}

object Host "test-host" {
	import "passive-host"
	groups = [ "test-group" ]
	display_name = "TEST host"
	host.vars.passive_check_disk = true
}

Check result

Example of data that’s sent to Icinga2 :

{ “exit_status”:0, “check_command”: ["/usr/lib/nagios/plugins/check_disk","-w",“20%”,"-c",“10%”,"-p","/"], “plugin_output”: “DISK OK - free space: / 1375 MB (26% inode=62%);”, “performance_data”: ["/=3825MB;4402;4952;0;5503"], “execution_start”: 1549555681.439195, “execution_end”: 1549555681.443452 }

The data is sent like this :

curl -s -u “${ICINGA2API_USER}:${ICINGA2API_PASSWORD}” -H ‘Accept: application/json’ -X POST “${url}” -d “${data}”

Debuglog extract

expand
[2019-02-07 16:08:01 +0100] notice/ApiListener: New HTTP client
[2019-02-07 16:08:01 +0100] debug/HttpRequest: line: POST /v1/actions/process-check-result?service=test-host%21passive_check_disk HTTP/1.1, tokens: 3
[2019-02-07 16:08:01 +0100] information/HttpServerConnection: Request: POST /v1/actions/process-check-result?service=test-host%21passive_check_disk (from [123.123.123.123]:43257), user: test_remote)
[2019-02-07 16:08:01 +0100] notice/WorkQueue: Spawning WorkQueue threads for 'HttpServerConnection'
...
about 50 lines further
...
[2019-02-07 16:08:01 +0100] notice/ApiActionHandler: Running action process-check-result
[2019-02-07 16:08:01 +0100] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2019-02-07 16:08:01 +0100] notice/ApiListener: Relaying 'event::CheckResult' message
[2019-02-07 16:08:01 +0100] debug/ApiEvents: Processing event type 'CheckResult'.
[2019-02-07 16:08:01 +0100] debug/HttpUtility: Request body: '{ "exit_status":0, "check_command": ["/usr/lib/nagios/plugins/check_disk","-w","20%","-c","10%","-p","/"], "plugin_output": "DISK OK - free space: / 1375 MB (26% inode=62%);", "performance_data": ["/=3825MB;4402;4952;0;5503"], "execution_start": 1549552081.585619, "execution_end": 1549552081.591264 }'
...
about 50 lines further
...
[2019-02-07 16:08:01 +0100] notice/ApiActionHandler: Running action process-check-result
[2019-02-07 16:08:01 +0100] debug/IdoPgsqlConnection: Query: UPDATE icinga_servicestatus SET acknowledgement_type = E'0',  active_checks_enabled = E'1',  check_command = E'passive',  check_source = E'icinga2.domain.com',  check_type = E'0',  current_check_attempt = E'1',  current_notification_number = E'0',  current_state = E'0',  event_handler_enabled = E'1',  execution_time = E'0.005645',  flap_detection_enabled = E'1',  has_been_checked = E'1',  instance_id = 1,  is_flapping = E'0',  is_reachable = E'1',  last_check = TO_TIMESTAMP(1549552081) AT TIME ZONE 'UTC',  last_hard_state = E'0',  last_hard_state_change = TO_TIMESTAMP(1549534772) AT TIME ZONE 'UTC',  last_notification = TO_TIMESTAMP(1549534772) AT TIME ZONE 'UTC',  last_state_change = TO_TIMESTAMP(1549534772) AT TIME ZONE 'UTC',  last_time_critical = TO_TIMESTAMP(1549534670) AT TIME ZONE 'UTC',  last_time_ok = TO_TIMESTAMP(1549552081) AT TIME ZONE 'UTC',  latency = E'0',  long_output = E'',  max_check_attempts = E'5',  next_check = TO_TIMESTAMP(1549552381) AT TIME ZONE 'UTC',  next_notification = TO_TIMESTAMP(1549552081) AT TIME ZONE 'UTC',  normal_check_interval = E'5',  notifications_enabled = E'1',  original_attributes = E'null',  output = E'DISK OK - free space: / 1375 MB (26% inode=62%);',  passive_checks_enabled = E'1',  percent_state_change = E'0',  perfdata = E'/=3825MB;4402;4952;0;5503',  problem_has_been_acknowledged = E'0',  process_performance_data = E'1',  retry_check_interval = E'2',  scheduled_downtime_depth = E'0',  service_object_id = 1741,  should_be_scheduled = E'1',  state_type = E'1',  status_update_time = TO_TIMESTAMP(1549552081) AT TIME ZONE 'UTC' WHERE service_object_id = 1741

Questions

  • What could cause the 404 ?
  • How can we fix it ?
  • How to make the debug log actually useful ?

(Carsten Köbke) #2

Hello,

mybe a quick workaround is to loop the checkresult send part until you get a 200 return code or the loop tried it for X times.

For the debuglog, use Elastic Search or Graylog :slight_smile:
Regards,
Carsten


#3

Thanks for the suggestion.

That is indeed a quick workaround. But that won’t do for two reasons : we’re about to monitor a few hundreds more machines, and hitting the API with more requests will most likely makes the situation worse than it already is.

As for the debuglog, feeding it to ES or Graylog won’t solve the fact that there’s nothing in the log to follow the life cycle of an API request. And the timestamps are not precise enough imo, but that’s another matter.


(Michael Friedrich) #4

Hi,

thanks for the nice formatting and description, that helps a lot! :slight_smile:

Question aside, this isn’t reproducible when just a few connects and actions happen, right?

That’s unfortunately a known problem and not an easy one to fix, i.e. to have structured logging and so on. We need to replace our core logging functionality, and that’s something for the coming months or years.

Can you create a full gdb backtrace from the running process please? The commands are described in the development docs.

That way we can inspect whether there are deadlocks in the threads.

And if you can, I’d appreciate you testing the snapshot packages which include fixes for this problem.

Cheers,
Michael


#5

I don’t know. I’ll have to set up a test environment to, well, do some testing.

Yeah, it’s always difficult when there’s so many things going on at the same time. We have the exact same gripe with postfix’s logs which are sometimes uterly pointless when the information is spread all over the place with (sometimes) no reliable identifier. (I’m looking at you, TLS connection status!)
Good luck for this !

I’ll see what I can do. There’s hoping I can provide you with something next week or so, but no promise.