Summary
The API process-check-result randomly returns 404.
Icinga2 r2.10.2-1, Debian Stretch.
Background
We’re currently monitoring about a hundred hosts representing about 400 services.
The hosts are passive, meaning Icinga2 doesn’t do any active checks on them.
Everything comes from the hosts themselves. A bash script executes some checks and submit the result to Icinga2 with a curl command.
The checks are performed every minutes or so, and the result are sent consecutively.
There’s no delay between the submissions, apart from the duration of the check itself.
Most of the time the checkresults are properly processed by Icinga2 (API answers 200, sevice status is updated correctly).
But rather frequently Icinga returns 404 {“error”:404.0,“status”:“No objects found.”}.
There’s nothing relevant in Icinga2’s logs, or debuglog.
Incidentaly, the debuglog is a pain to decipher as everything is intertwined, there’s no way to tell what line belongs to what process/api call.
My only guess is that the API has a hard time dealing with simultaneous connections as all the check_results arrive at about the same moment.
Possibly related : https://github.com/Icinga/icinga2/issues/6682
icinga2 --version
expand
icinga2 - The Icinga 2 network monitoring daemon (version: r2.10.2-1) System information: Platform: Debian GNU/Linux Platform version: 9 (stretch) Kernel: Linux Kernel version: 4.9.0-6-amd64 Architecture: x86_64 Build information: Compiler: GNU 6.3.0 Build host: 486e413fb159
Config
expand
template Service "passive-service" { enable_passive_checks = true enable_active_checks = true max_check_attempts = 3 check_interval = 5m retry_interval = 1m vars.dummy_state = 2 vars.dummy_text = {{ var service = get_service(macro("$host.name$"), macro("$service.name$")) var lastCheck = DateTime(service.last_check).to_string() return "No check results received. Last result time: " + lastCheck }} enable_flapping = true } apply Service "passive_check_disk" { import "passive-service" display_name = "Disk" check_command = "passive" assign where host.vars.passive_check_disk } template Host "passive-host" { check_command = "dummy" max_check_attempts = 3 check_interval = 5m retry_interval = 1m vars.dummy_state = 2 vars.dummy_text = {{ var lastCheck = DateTime(host.last_check).to_string() return "Host is down since: " + lastCheck }} } object Host "test-host" { import "passive-host" groups = [ "test-group" ] display_name = "TEST host" host.vars.passive_check_disk = true }
Check result
Example of data that’s sent to Icinga2 :
{ “exit_status”:0, “check_command”: ["/usr/lib/nagios/plugins/check_disk","-w",“20%”,"-c",“10%”,"-p","/"], “plugin_output”: “DISK OK - free space: / 1375 MB (26% inode=62%);”, “performance_data”: ["/=3825MB;4402;4952;0;5503"], “execution_start”: 1549555681.439195, “execution_end”: 1549555681.443452 }
The data is sent like this :
curl -s -u “${ICINGA2API_USER}:${ICINGA2API_PASSWORD}” -H ‘Accept: application/json’ -X POST “${url}” -d “${data}”
Debuglog extract
expand
[2019-02-07 16:08:01 +0100] notice/ApiListener: New HTTP client [2019-02-07 16:08:01 +0100] debug/HttpRequest: line: POST /v1/actions/process-check-result?service=test-host%21passive_check_disk HTTP/1.1, tokens: 3 [2019-02-07 16:08:01 +0100] information/HttpServerConnection: Request: POST /v1/actions/process-check-result?service=test-host%21passive_check_disk (from [123.123.123.123]:43257), user: test_remote) [2019-02-07 16:08:01 +0100] notice/WorkQueue: Spawning WorkQueue threads for 'HttpServerConnection' ... about 50 lines further ... [2019-02-07 16:08:01 +0100] notice/ApiActionHandler: Running action process-check-result [2019-02-07 16:08:01 +0100] notice/ApiListener: Relaying 'event::SetNextCheck' message [2019-02-07 16:08:01 +0100] notice/ApiListener: Relaying 'event::CheckResult' message [2019-02-07 16:08:01 +0100] debug/ApiEvents: Processing event type 'CheckResult'. [2019-02-07 16:08:01 +0100] debug/HttpUtility: Request body: '{ "exit_status":0, "check_command": ["/usr/lib/nagios/plugins/check_disk","-w","20%","-c","10%","-p","/"], "plugin_output": "DISK OK - free space: / 1375 MB (26% inode=62%);", "performance_data": ["/=3825MB;4402;4952;0;5503"], "execution_start": 1549552081.585619, "execution_end": 1549552081.591264 }' ... about 50 lines further ... [2019-02-07 16:08:01 +0100] notice/ApiActionHandler: Running action process-check-result [2019-02-07 16:08:01 +0100] debug/IdoPgsqlConnection: Query: UPDATE icinga_servicestatus SET acknowledgement_type = E'0', active_checks_enabled = E'1', check_command = E'passive', check_source = E'icinga2.domain.com', check_type = E'0', current_check_attempt = E'1', current_notification_number = E'0', current_state = E'0', event_handler_enabled = E'1', execution_time = E'0.005645', flap_detection_enabled = E'1', has_been_checked = E'1', instance_id = 1, is_flapping = E'0', is_reachable = E'1', last_check = TO_TIMESTAMP(1549552081) AT TIME ZONE 'UTC', last_hard_state = E'0', last_hard_state_change = TO_TIMESTAMP(1549534772) AT TIME ZONE 'UTC', last_notification = TO_TIMESTAMP(1549534772) AT TIME ZONE 'UTC', last_state_change = TO_TIMESTAMP(1549534772) AT TIME ZONE 'UTC', last_time_critical = TO_TIMESTAMP(1549534670) AT TIME ZONE 'UTC', last_time_ok = TO_TIMESTAMP(1549552081) AT TIME ZONE 'UTC', latency = E'0', long_output = E'', max_check_attempts = E'5', next_check = TO_TIMESTAMP(1549552381) AT TIME ZONE 'UTC', next_notification = TO_TIMESTAMP(1549552081) AT TIME ZONE 'UTC', normal_check_interval = E'5', notifications_enabled = E'1', original_attributes = E'null', output = E'DISK OK - free space: / 1375 MB (26% inode=62%);', passive_checks_enabled = E'1', percent_state_change = E'0', perfdata = E'/=3825MB;4402;4952;0;5503', problem_has_been_acknowledged = E'0', process_performance_data = E'1', retry_check_interval = E'2', scheduled_downtime_depth = E'0', service_object_id = 1741, should_be_scheduled = E'1', state_type = E'1', status_update_time = TO_TIMESTAMP(1549552081) AT TIME ZONE 'UTC' WHERE service_object_id = 1741
Questions
- What could cause the 404 ?
- How can we fix it ?
- How to make the debug log actually useful ?