After update to 2.11.2 icinga2 restart takes forever

netphantm · December 3, 2019, 9:38am

It seems icinga2 waits until CheckerComponent has finished all the checks before it restarts the daemon, which sometimes takes 100+ secs to do so. Is this normal? How can I set a timeout here or force a restart, because it doesn’t make sense to wait for those checks to complete if I am restarting it anyway?
This seems to be a new behavior, it didn’t occur in v2.10.
I must say it is a new install from scratch, since it needed a newer debian because of the needed libs and I didn’t want to do a dist-upgrade. The config was then cloned from our git.

excerpt from debug log:

... debug/CheckerComponent: Waiting for running checks (3) to finish. Waited for x seconds now.
... debug/CheckerComponent: Waiting for running checks (2) to finish. Waited for x seconds now.
[...]
[2019-12-03 09:10:02 +0100] debug/CheckerComponent: Waiting for running checks (1) to finish. Waited for 175.1 seconds now.
[2019-12-03 09:10:02 +0100] debug/CheckerComponent: Waiting for running checks (1) to finish. Waited for 175.2 seconds now.
[2019-12-03 09:10:02 +0100] notice/Process: PID 67332 ('/usr/lib/nagios/plugins/check_by_ssh' '-C' ''/usr/local/nagios/bin/plugins/check_timemachine.sh' '-c' '57600' '-w' '43200'' '-H' '10.159.0.55' '-l' 'foo' '-o' 'PasswordAuthentication=no' '-p' '22' '-t' '120') terminated with exit code 0
[2019-12-03 09:10:02 +0100] notice/Dependency: Dependency 'roley!host-roley-de-nue-sw-01-dev' passed: Parent host 'de-nue-sw-01-dev' matches state filter.
[2019-12-03 09:10:02 +0100] notice/Dependency: Dependency 'roley!host-roley-de-nue-sw-01-dev' passed: Parent host 'de-nue-sw-01-dev' matches state filter.
[2019-12-03 09:10:02 +0100] notice/Dependency: Dependency 'roley!SSH!disable-host-service-checks' passed: Parent host 'roley' matches state filter.
[2019-12-03 09:10:02 +0100] notice/Dependency: Dependency 'roley!Timemachine backup daily!disable-host-service-checks' passed: Parent host 'roley' matches state filter.
[2019-12-03 09:10:02 +0100] notice/Dependency: Dependency 'roley!Timemachine backup daily!by_ssh-on-ssh-failures' passed: Parent service 'roley!SSH' matches state filter.
[2019-12-03 09:10:02 +0100] notice/Dependency: Dependency 'roley!host-roley-de-nue-sw-01-dev' passed: Parent host 'de-nue-sw-01-dev' matches state filter.
[2019-12-03 09:10:02 +0100] notice/Dependency: Dependency 'roley!host-roley-de-nue-sw-01-dev' passed: Parent host 'de-nue-sw-01-dev' matches state filter.
[2019-12-03 09:10:02 +0100] notice/Dependency: Dependency 'roley!SSH!disable-host-service-checks' passed: Parent host 'roley' matches state filter.
[2019-12-03 09:10:02 +0100] notice/Dependency: Dependency 'roley!Timemachine backup daily!disable-host-service-checks' passed: Parent host 'roley' matches state filter.
[2019-12-03 09:10:02 +0100] notice/Dependency: Dependency 'roley!Timemachine backup daily!by_ssh-on-ssh-failures' passed: Parent service 'roley!SSH' matches state filter.
[2019-12-03 09:10:02 +0100] debug/Checkable: Update checkable 'roley!Timemachine backup daily' with check interval '600' from last check time at 2019-12-03 09:10:02 +0100 (1.57536e+09) to next check time at 2019-12-03 09:19:49 +0100 (1.57536e+09).
[2019-12-03 09:10:02 +0100] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2019-12-03 09:10:02 +0100] notice/ApiListener: Relaying 'event::CheckResult' message
[2019-12-03 09:10:02 +0100] debug/IdoPgsqlConnection: Query: UPDATE icinga_servicestatus SET acknowledgement_type = E'0',  active_checks_enabled = E'1',  check_command = E'by_ssh',  check_source = E'de-nue-monitoring-staging',  check_type = E'0',  current_check_attempt = E'1',  current_notification_number = E'0',  current_state = E'0',  endpoint_object_id = 21220,  event_handler_enabled = E'1',  execution_time = E'799.652998',  flap_detection_enabled = E'1',  has_been_checked = E'1',  instance_id = 1,  is_flapping = E'0',  is_reachable = E'1',  last_check = TO_TIMESTAMP(1575360602) AT TIME ZONE 'UTC',  last_hard_state = E'0',  last_hard_state_change = TO_TIMESTAMP(1575343159) AT TIME ZONE 'UTC',  last_state_change = TO_TIMESTAMP(1575343159) AT TIME ZONE 'UTC',  last_time_ok = TO_TIMESTAMP(1575360602) AT TIME ZONE 'UTC',  last_time_unknown = TO_TIMESTAMP(1575343035) AT TIME ZONE 'UTC',  latency = E'0.000874',  long_output = E'',  max_check_attempts = E'3',  next_check = TO_TIMESTAMP(1575361189) AT TIME ZONE 'UTC',  next_notification = TO_TIMESTAMP(1575358988) AT TIME ZONE 'UTC',  normal_check_interval = E'10',  notifications_enabled = E'1',  original_attributes = E'null',  output = E'OK: last backup ran on Tue Dec  3 08:10:14 CET 2019, 59 minutes ago.',  passive_checks_enabled = E'1',  percent_state_change = E'0',  perfdata = E'',  problem_has_been_acknowledged = E'0',  process_performance_data = E'1',  retry_check_interval = E'2',  scheduled_downtime_depth = E'0',  service_object_id = 18511,  should_be_scheduled = E'1',  state_type = E'1',  status_update_time = TO_TIMESTAMP(1575360602) AT TIME ZONE 'UTC' WHERE service_object_id = 18511
[2019-12-03 09:10:02 +0100] information/CheckerComponent: 'checker' stopped.
[2019-12-03 09:10:02 +0100] information/LivestatusListener: 'livestatus' stopped.

OS: Debian GNU/Linux 10 (buster)
Package: icinga2, Version: 2.11.2-1.buster
just one (staging) zone. The HA master cluster w/ 2 nodes will also be upgraded, after more testing.

dnsmichi · December 3, 2019, 9:58am

How long is the average execution time of that plugin?

The DB insert says execution_time = E'799.652998', for this time.

Since 2.11 Icinga doesn’t immediately kill running checks but waits a certain amount of time before doing so. 800 seconds for a check plugin to run via the scheduler is quite a long time. If I were you, I would move this into an external (cron) job and let it execute there.

Cheers,
Michael

netphantm · December 3, 2019, 11:39am

What’s ‘a certain amount of time’ and how can I influence that?
I know that this is a shitty plugin (we have a few more of those legacy things lying around) and that it/they should be tweaked, but I would gladly just force this system to restart for now, since it’s just a staging env.
(Restart is triggered by a git post-update hook, same as on the two master nodes.)

Thanks,
Bruno.-

dnsmichi · December 3, 2019, 11:45am

You cannot influence the actual wait value, that’s calculated from the global ReloadTimeout including a grace period. See the source code with the Stop() implementation here.

So, when a plugin is still running, it may need to hit the reload timeout (5m) to actually get killed away - not what users expect, being alerted with UNKNOWN from that SIGKILL. I strongly recommend to use CheckCommand objects with the timeout attribute, or even implement an alarm timeout in your plugin and control the timeout in there.

Cheers,
Michael

netphantm · December 3, 2019, 12:21pm

OK, thanks for that. Guess I have to get rid of those plugins, or in some cases, find a new way for them to get the data. Like in this case where the command ‘tmutil listbackups’ takes a really long time to complete.

Greetings,
Bruno.-