After update to 2.11.2 icinga2 restart takes forever

It seems icinga2 waits until CheckerComponent has finished all the checks before it restarts the daemon, which sometimes takes 100+ secs to do so. Is this normal? How can I set a timeout here or force a restart, because it doesn’t make sense to wait for those checks to complete if I am restarting it anyway?
This seems to be a new behavior, it didn’t occur in v2.10.
I must say it is a new install from scratch, since it needed a newer debian because of the needed libs and I didn’t want to do a dist-upgrade. The config was then cloned from our git.

excerpt from debug log:

... debug/CheckerComponent: Waiting for running checks (3) to finish. Waited for x seconds now.
... debug/CheckerComponent: Waiting for running checks (2) to finish. Waited for x seconds now.
[...]
[2019-12-03 09:10:02 +0100] debug/CheckerComponent: Waiting for running checks (1) to finish. Waited for 175.1 seconds now.
[2019-12-03 09:10:02 +0100] debug/CheckerComponent: Waiting for running checks (1) to finish. Waited for 175.2 seconds now.
[2019-12-03 09:10:02 +0100] notice/Process: PID 67332 ('/usr/lib/nagios/plugins/check_by_ssh' '-C' ''/usr/local/nagios/bin/plugins/check_timemachine.sh' '-c' '57600' '-w' '43200'' '-H' '10.159.0.55' '-l' 'foo' '-o' 'PasswordAuthentication=no' '-p' '22' '-t' '120') terminated with exit code 0
[2019-12-03 09:10:02 +0100] notice/Dependency: Dependency 'roley!host-roley-de-nue-sw-01-dev' passed: Parent host 'de-nue-sw-01-dev' matches state filter.
[2019-12-03 09:10:02 +0100] notice/Dependency: Dependency 'roley!host-roley-de-nue-sw-01-dev' passed: Parent host 'de-nue-sw-01-dev' matches state filter.
[2019-12-03 09:10:02 +0100] notice/Dependency: Dependency 'roley!SSH!disable-host-service-checks' passed: Parent host 'roley' matches state filter.
[2019-12-03 09:10:02 +0100] notice/Dependency: Dependency 'roley!Timemachine backup daily!disable-host-service-checks' passed: Parent host 'roley' matches state filter.
[2019-12-03 09:10:02 +0100] notice/Dependency: Dependency 'roley!Timemachine backup daily!by_ssh-on-ssh-failures' passed: Parent service 'roley!SSH' matches state filter.
[2019-12-03 09:10:02 +0100] notice/Dependency: Dependency 'roley!host-roley-de-nue-sw-01-dev' passed: Parent host 'de-nue-sw-01-dev' matches state filter.
[2019-12-03 09:10:02 +0100] notice/Dependency: Dependency 'roley!host-roley-de-nue-sw-01-dev' passed: Parent host 'de-nue-sw-01-dev' matches state filter.
[2019-12-03 09:10:02 +0100] notice/Dependency: Dependency 'roley!SSH!disable-host-service-checks' passed: Parent host 'roley' matches state filter.
[2019-12-03 09:10:02 +0100] notice/Dependency: Dependency 'roley!Timemachine backup daily!disable-host-service-checks' passed: Parent host 'roley' matches state filter.
[2019-12-03 09:10:02 +0100] notice/Dependency: Dependency 'roley!Timemachine backup daily!by_ssh-on-ssh-failures' passed: Parent service 'roley!SSH' matches state filter.
[2019-12-03 09:10:02 +0100] debug/Checkable: Update checkable 'roley!Timemachine backup daily' with check interval '600' from last check time at 2019-12-03 09:10:02 +0100 (1.57536e+09) to next check time at 2019-12-03 09:19:49 +0100 (1.57536e+09).
[2019-12-03 09:10:02 +0100] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2019-12-03 09:10:02 +0100] notice/ApiListener: Relaying 'event::CheckResult' message
[2019-12-03 09:10:02 +0100] debug/IdoPgsqlConnection: Query: UPDATE icinga_servicestatus SET acknowledgement_type = E'0',  active_checks_enabled = E'1',  check_command = E'by_ssh',  check_source = E'de-nue-monitoring-staging',  check_type = E'0',  current_check_attempt = E'1',  current_notification_number = E'0',  current_state = E'0',  endpoint_object_id = 21220,  event_handler_enabled = E'1',  execution_time = E'799.652998',  flap_detection_enabled = E'1',  has_been_checked = E'1',  instance_id = 1,  is_flapping = E'0',  is_reachable = E'1',  last_check = TO_TIMESTAMP(1575360602) AT TIME ZONE 'UTC',  last_hard_state = E'0',  last_hard_state_change = TO_TIMESTAMP(1575343159) AT TIME ZONE 'UTC',  last_state_change = TO_TIMESTAMP(1575343159) AT TIME ZONE 'UTC',  last_time_ok = TO_TIMESTAMP(1575360602) AT TIME ZONE 'UTC',  last_time_unknown = TO_TIMESTAMP(1575343035) AT TIME ZONE 'UTC',  latency = E'0.000874',  long_output = E'',  max_check_attempts = E'3',  next_check = TO_TIMESTAMP(1575361189) AT TIME ZONE 'UTC',  next_notification = TO_TIMESTAMP(1575358988) AT TIME ZONE 'UTC',  normal_check_interval = E'10',  notifications_enabled = E'1',  original_attributes = E'null',  output = E'OK: last backup ran on Tue Dec  3 08:10:14 CET 2019, 59 minutes ago.',  passive_checks_enabled = E'1',  percent_state_change = E'0',  perfdata = E'',  problem_has_been_acknowledged = E'0',  process_performance_data = E'1',  retry_check_interval = E'2',  scheduled_downtime_depth = E'0',  service_object_id = 18511,  should_be_scheduled = E'1',  state_type = E'1',  status_update_time = TO_TIMESTAMP(1575360602) AT TIME ZONE 'UTC' WHERE service_object_id = 18511
[2019-12-03 09:10:02 +0100] information/CheckerComponent: 'checker' stopped.
[2019-12-03 09:10:02 +0100] information/LivestatusListener: 'livestatus' stopped.

OS: Debian GNU/Linux 10 (buster)
Package: icinga2, Version: 2.11.2-1.buster
just one (staging) zone. The HA master cluster w/ 2 nodes will also be upgraded, after more testing.

How long is the average execution time of that plugin?

The DB insert says execution_time = E'799.652998', for this time.

Since 2.11 Icinga doesn’t immediately kill running checks but waits a certain amount of time before doing so. 800 seconds for a check plugin to run via the scheduler is quite a long time. If I were you, I would move this into an external (cron) job and let it execute there.

Cheers,
Michael

1 Like

What’s ‘a certain amount of time’ and how can I influence that?
I know that this is a shitty plugin (we have a few more of those legacy things lying around) and that it/they should be tweaked, but I would gladly just force this system to restart for now, since it’s just a staging env.
(Restart is triggered by a git post-update hook, same as on the two master nodes.)

Thanks,
Bruno.-

You cannot influence the actual wait value, that’s calculated from the global ReloadTimeout including a grace period. See the source code with the Stop() implementation here.

So, when a plugin is still running, it may need to hit the reload timeout (5m) to actually get killed away - not what users expect, being alerted with UNKNOWN from that SIGKILL. I strongly recommend to use CheckCommand objects with the timeout attribute, or even implement an alarm timeout in your plugin and control the timeout in there.

Cheers,
Michael

OK, thanks for that. Guess I have to get rid of those plugins, or in some cases, find a new way for them to get the data. Like in this case where the command ‘tmutil listbackups’ takes a really long time to complete.

Greetings,
Bruno.-