New instance of a long running check is triggered every minute

Dirk · February 26, 2025, 11:07am

Hi,
I’m facing a strange issue. I’m writing new monitors quite often to match specific monitoring requirements. The current one is running quite long (several minutes, no way to shorten that).

By accident I found, that Icinga is starting a new instance of that check every minute, regardless if it is still running, or if I killed the existing run on OS level. Even restarting Icinga on client side (endpoint) didn’t stop that, it still spawned new instances of that check. Finally I had to deactivate “active checks” of that service to prevent starting of new instances.

The check is defined with following parameters:

check interval = 43200
retry interval = 14400
*max check attempts = 1
check timeout = 14400

All new instances of that check were spawned by Icinga (PID). Just to be shure, I replaced the check content with:

sleep 30000
echo “OK: test”
exit 0

So can someone please give me a hint how to investigate on that? On target node there is no debuglog feature installed. The Icinga log there only shows when I killed the processes, but no entries for spawning.
I enabled debuglog on server side for a while, but it doesn’t show much as well. I can only see some distribution messages and when I switched off active checking for that service.

Has anyone an explanation for those processes being triggered every minute, despite the same check is still running and check interval and retry interval are set much higher?

Let me know, if you need any output from Icinga.

Thanks a lot.
Regards,
Dirk

lorenz · February 26, 2025, 11:25am

Hi @Dirk ,
I hope this does not make the wrong impression, but are you sure, that you are looking at the right Service object? Just to be sure, you might want to take a look whether the settings are actually applied correctly.
Does the webinterface predict the next test run to be in one minute or not?

Dirk · February 26, 2025, 11:54am

Hi Lorenz,
thanks for your time. Yes, it is a new script written by me and I configured everything by myself. The service is named similar to the check and it is assigned to one node only currently on a test instance. Since I set “active checks” to “no”, no new instances are spawned.

If I trigger the check from web interface, it just increases the (negative) value for “next check”, which is normal, as long as a check is still running.

Regards,
Dirk

apenning · February 26, 2025, 11:56am

Could you please post the (redacted) output of the following command, please substitute the host and service in the --name parameter accordingly.

icinga2 object list -t Service --name 'host!service'

Dirk · February 26, 2025, 11:57am

Hi @lorenz,
just an addition: All those script instances piling up in the process list are spawned from Icinga agent and running in parallel.

Regards,
Dirk

Dirk · February 26, 2025, 12:13pm

Hi @apenning,

thank you too. The requested output follows. Because I’m working in a sensible environment, I replaced hostname and service name, but that shouldn’t make a difference, hopefully:

Object <my_host>!<my_check>’ of type ‘Service’:
% declared in ‘/var/lib/icinga2/api/packages/director/6d63c7d1-70af-4065-af6d-0528336d6d7a/zones.d/dev-GCP-01/services.conf’, lines 27:1-27:38

__name = “<my_host>!<my_check>”

action_url = “”

check_command = “<my_check_command>”
% = modified in ‘/var/lib/icinga2/api/packages/director/6d63c7d1-70af-4065-af6d-0528336d6d7a/zones.d/director-global/service_templates.conf’, lines 935:5-935:43

check_interval = 43200
% = modified in ‘/var/lib/icinga2/api/packages/director/6d63c7d1-70af-4065-af6d-0528336d6d7a/zones.d/director-global/service_templates.conf’, lines 937:5-937:24

check_period = “”

check_timeout = 14400
% = modified in ‘/var/lib/icinga2/api/packages/director/6d63c7d1-70af-4065-af6d-0528336d6d7a/zones.d/director-global/service_templates.conf’, lines 939:5-939:22

command_endpoint = “<my_host>”
% = modified in ‘/var/lib/icinga2/api/packages/director/6d63c7d1-70af-4065-af6d-0528336d6d7a/zones.d/director-global/service_templates.conf’, lines 943:5-943:32

display_name = “<my_check>”

enable_active_checks = false
% = modified in ‘/var/lib/icinga2/api/packages/director/6d63c7d1-70af-4065-af6d-0528336d6d7a/zones.d/director-global/service_templates.conf’, lines 941:5-941:32

enable_event_handler = false
% = modified in ‘/var/lib/icinga2/api/packages/director/6d63c7d1-70af-4065-af6d-0528336d6d7a/zones.d/director-global/service_templates.conf’, lines 942:5-942:32

enable_flapping = false

enable_notifications = false
% = modified in ‘/var/lib/icinga2/api/packages/director/6d63c7d1-70af-4065-af6d-0528336d6d7a/zones.d/director-global/service_templates.conf’, lines 940:5-940:32

enable_passive_checks = true

enable_perfdata = true

event_command = “”

flapping_ignore_states = null

flapping_threshold = 0

flapping_threshold_high = 30

flapping_threshold_low = 25

groups =

host_name = “<my_host>”
% = modified in ‘/var/lib/icinga2/api/packages/director/6d63c7d1-70af-4065-af6d-0528336d6d7a/zones.d/dev-GCP-01/services.conf’, lines 28:5-28:41

icon_image = “”

icon_image_alt = “”

max_check_attempts = 1
% = modified in ‘/var/lib/icinga2/api/packages/director/6d63c7d1-70af-4065-af6d-0528336d6d7a/zones.d/director-global/service_templates.conf’, lines 936:5-936:28

name = “my_check”

notes = “”

notes_url = “”

package = “director”

retry_interval = 14400
% = modified in ‘/var/lib/icinga2/api/packages/director/6d63c7d1-70af-4065-af6d-0528336d6d7a/zones.d/director-global/service_templates.conf’, lines 938:5-938:23

source_location

first_column = 1

first_line = 27

last_column = 38

last_line = 27

path = “/var/lib/icinga2/api/packages/director/6d63c7d1-70af-4065-af6d-0528336d6d7a/zones.d/dev-GCP-01/services.conf”

templates = [ “<my_check>”, “<my_check>”, “generic-service”, “no_grafana” ]
% = modified in ‘/var/lib/icinga2/api/packages/director/6d63c7d1-70af-4065-af6d-0528336d6d7a/zones.d/dev-GCP-01/services.conf’, lines 27:1-27:38
% = modified in ‘/var/lib/icinga2/api/packages/director/6d63c7d1-70af-4065-af6d-0528336d6d7a/zones.d/director-global/service_templates.conf’, lines 931:1-931:40
% = modified in ‘/var/lib/icinga2/api/packages/director/6d63c7d1-70af-4065-af6d-0528336d6d7a/zones.d/director-global/service_templates.conf’, lines 1:0-1:33
% = modified in ‘/var/lib/icinga2/api/packages/director/6d63c7d1-70af-4065-af6d-0528336d6d7a/zones.d/director-global/service_templates.conf’, lines 303:1-303:29

type = “Service”

vars

disablegrafana = true
% = modified in ‘/var/lib/icinga2/api/packages/director/6d63c7d1-70af-4065-af6d-0528336d6d7a/zones.d/director-global/service_templates.conf’, lines 304:5-304:30

volatile = false

zone = “dev-GCP-01”

Regards,
Dirk

apenning · February 26, 2025, 1:10pm

Thanks for the output, which looks fine.

Thus, I would second @lorenz’ idea to verify that this service really results in all these processes. Could you please enable the debug log on the Icinga 2 agent.

Within the debug.log, please grep for your check command and please post both the starting and finishing log events. The second also contains the next check time.

They would look like this:

[2025-02-26 14:01:26 +0100] notice/Process: Running command '/usr/local/libexec/nagios/check_foo': PID 83565

[2025-02-26 14:01:30 +0100] notice/Process: PID 83565 ('/usr/local/libexec/nagios/check_foo') terminated with exit code 0
[2025-02-26 14:01:30 +0100] debug/Checkable: Update checkable 'host!foo' with check interval '300' from last check time at 2025-02-26 14:01:30 +0100 (1.74057e+09) to next check time at 2025-02-26 14:06:24 +0100 (1.74058e+09).

Please feel free to redact the output again to exclude sensitive information.

Could you also query all of these parallel running check commands on the server and check their PID within the debug.log. Since you have not explicitly stated the operating system, on a Linux you could get the processes via ps aux | grep __check_command__.

Btw, could you please tell us the Icinga 2 version you are using and the operating systems? Maybe also some other details about your setup?

Dirk · February 27, 2025, 8:00am

Hi @apenning,

thanks for the hints.

I set up a new check to exclude all possible side effects which will write its start and end times and its PID to the output. Currently I have an issue with enabling debug mode, which tells me:

critical/cli: Cannot parse available features. Path ‘/etc/icinga2/features-available’ does not exist.

But I can see that directory via “ls”:

drwxr-xr-x. 2 <our_user> <our_group> 4096 Aug 22 2023 /etc/icinga2/features-available

And there is content as well:

api.conf debuglog.conf influxdb2.conf opentsdb.conf
api.conf.orig elasticsearch.conf influxdb.conf perfdata.conf
checker.conf gelf.conf livestatus.conf statusdata.conf
command.conf graphite.conf mainlog.conf syslog.conf
compatlog.conf icingadb.conf notification.conf

I’ll trace the command to find out what is going on there. When this is fixed, I’ll setup some scripts to extract script results from icinga db every minute, so we’ll get a picture.

For the time being the data you were asking for:

/usr/sbin/icinga2 --version

icinga2 - The Icinga 2 network monitoring daemon (version: 2.13.1-1)

Copyright (c) 2012-2025 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later https://gnu.org/licenses/gpl2.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

System information:
Platform: Red Hat Enterprise Linux
Platform version: 8.10 (Ootpa)
Kernel: Linux
Kernel version: 4.18.0-553.36.1.el8_10.x86_64
Architecture: x86_64

Build information:
Compiler: GNU 8.4.1
Build host: runner-hh8q3bz2-project-507-concurrent-0
OpenSSL version: OpenSSL 1.1.1k FIPS 25 Mar 2021

I’ll update you, when I was able to run the test.

Regards,
Dirk

Dirk · February 27, 2025, 8:15am

Hi @apenning,
found the “enable debuglog” issue:

setuid(992) = 0
stat(“/etc/icinga2/features-available”, 0x7fffd8daf410) = -1 EACCES (Permission denied)

Caused by not using the default user “icinga”. Most likely we have overseen something here, because thats a test environment. On server side I was able to enable the feature. UID 992 is pointing to default user “icinga”, so “setuid (992)” tries to access the directory with wrong user, which results in missing permissions.

On my test machine I can fix that easily, I’m going on to prepare the check now.

Regards,
Dirk

apenning · February 27, 2025, 9:27am

Icinga 2 expects the existence of an ICINGA_USER and ICINGA_GROUP, usually both defaulting to icinga. During startup of most icinga2 commands, the daemon checks if it runs as this user and group, or changes users otherwise.
Some insights are available in the recently created Icinga 2 issue #10307.

Thus, please ensure that everything under /etc/icinga2/ is accessible to the icinga user.

On another topic, you mentioned that you are running version 2.13.1-1, which is quite outdated - released Aug 19, 2021 - and contains a known critical security vulnerability.

Dirk · February 28, 2025, 9:48am

Hi @apenning,
thanks again. Sorry, didn’t find the time to work on this yesterday.

Our Icinga2 service is running with non-default user/group, because that’s defined in /etc/sysconfig/icinga2. The content in /etc/icinga2 belongs to our user/group (as it should be), but as the trace snipped shows, icinga is switching to user “icinga” when trying to activate tracing. This might be a bug, but I don’t want to follow up on this.

My test details:

new service created which calls a new script called “spawn_test”
service is assigned to one node only
check interval: 14400
retry interval: 3600
check timeout: 14400
content: echo -n "pid: $$ start: date ";sleep 600; echo “end: date”; exit 0

That check was started via Icinga Gui and a small script checked the process list every minute for “spawn_test”. It prints the date, all matching PIDs and the PPIDs. Result:

PIDs 08:25:17
PIDs 08:26:17
PIDs 08:27:17
1736213 (1724059)
PIDs 08:28:17
1736213 (1724059)
1750851 (1724059)
PIDs 08:29:17
1736213 (1724059)
1750851 (1724059)
PIDs 08:30:17
1736213 (1724059)
1750851 (1724059)
1759602 (1724059)
PIDs 08:31:17
1736213 (1724059)
1750851 (1724059)
1759602 (1724059)
1770269 (1724059)
PIDs 08:32:17
1736213 (1724059)
1750851 (1724059)
1759602 (1724059)
1770269 (1724059)
PIDs 08:33:17
1736213 (1724059)
1750851 (1724059)
1759602 (1724059)
1770269 (1724059)
1779949 (1724059)
PIDs 08:34:17
1736213 (1724059)
1750851 (1724059)
1759602 (1724059)
1770269 (1724059)
1779949 (1724059)
1790382 (1724059)
PIDs 08:35:17
1736213 (1724059)
1750851 (1724059)
1759602 (1724059)
1770269 (1724059)
1779949 (1724059)
1790382 (1724059)
PIDs 08:36:17
1736213 (1724059)
1750851 (1724059)
1759602 (1724059)
1770269 (1724059)
1779949 (1724059)
1790382 (1724059)
1798140 (1724059)
PIDs 08:37:17
1750851 (1724059)
1759602 (1724059)
1770269 (1724059)
1779949 (1724059)
1790382 (1724059)
1798140 (1724059)
PIDs 08:38:17
1759602 (1724059)
1770269 (1724059)
1779949 (1724059)
1790382 (1724059)
1798140 (1724059)
PIDs 08:39:17
1759602 (1724059)
1770269 (1724059)
1779949 (1724059)
1790382 (1724059)
1798140 (1724059)
PIDs 08:40:17
1770269 (1724059)
1779949 (1724059)
1790382 (1724059)
1798140 (1724059)
PIDs 08:41:17
1779949 (1724059)
1790382 (1724059)
1798140 (1724059)
PIDs 08:42:17
PIDs 08:43:17

As you can see new processes are spawned by Icinga permanently until the first job finishes after 10 minutes. This causes a status change and now new jobs are triggered anymore. Another script queried the service state from Icinga database every minute:

08:25:02
DB output
pid: 1658311 start: Fri 28 Feb 08:14:30 CET 2025 <Terminated by signal 15 (Terminated).>
08:26:02
DB output
pid: 1658311 start: Fri 28 Feb 08:14:30 CET 2025 <Terminated by signal 15 (Terminated).>
08:27:03
DB output
pid: 1658311 start: Fri 28 Feb 08:14:30 CET 2025 <Terminated by signal 15 (Terminated).>
08:28:03
DB output
pid: 1658311 start: Fri 28 Feb 08:14:30 CET 2025 <Terminated by signal 15 (Terminated).>
08:29:03
DB output
pid: 1658311 start: Fri 28 Feb 08:14:30 CET 2025 <Terminated by signal 15 (Terminated).>
08:30:03
DB output
pid: 1658311 start: Fri 28 Feb 08:14:30 CET 2025 <Terminated by signal 15 (Terminated).>
08:31:03
DB output
pid: 1658311 start: Fri 28 Feb 08:14:30 CET 2025 <Terminated by signal 15 (Terminated).>
08:32:03
DB output
pid: 1658311 start: Fri 28 Feb 08:14:30 CET 2025 <Terminated by signal 15 (Terminated).>
08:33:03
DB output
pid: 1658311 start: Fri 28 Feb 08:14:30 CET 2025 <Terminated by signal 15 (Terminated).>
08:34:04
DB output
pid: 1658311 start: Fri 28 Feb 08:14:30 CET 2025 <Terminated by signal 15 (Terminated).>
08:35:04
DB output
pid: 1658311 start: Fri 28 Feb 08:14:30 CET 2025 <Terminated by signal 15 (Terminated).>
08:36:04
DB output
pid: 1658311 start: Fri 28 Feb 08:14:30 CET 2025 <Terminated by signal 15 (Terminated).>
08:37:04
DB output
pid: 1736213 start: Fri 28 Feb 08:26:45 CET 2025 end: Fri 28 Feb 08:36:45 CET 2025
08:38:04
DB output
pid: 1736213 start: Fri 28 Feb 08:26:45 CET 2025 end: Fri 28 Feb 08:36:45 CET 2025
08:39:04
DB output
pid: 1750851 start: Fri 28 Feb 08:28:15 CET 2025 end: Fri 28 Feb 08:38:15 CET 2025
08:40:04
DB output
pid: 1759602 start: Fri 28 Feb 08:29:45 CET 2025 end: Fri 28 Feb 08:39:45 CET 2025
08:41:04
DB output
pid: 1759602 start: Fri 28 Feb 08:29:45 CET 2025 end: Fri 28 Feb 08:39:45 CET 2025
08:42:05
DB output
pid: 1770269 start: Fri 28 Feb 08:31:15 CET 2025 end: Fri 28 Feb 08:41:15 CET 2025
08:43:05
DB output
pid: 1770269 start: Fri 28 Feb 08:31:15 CET 2025 end: Fri 28 Feb 08:41:15 CET 2025
08:44:05
DB output
pid: 1770269 start: Fri 28 Feb 08:31:15 CET 2025 end: Fri 28 Feb 08:41:15 CET 2025

I killed the previous run, so we start with “terminated”. This state stays until the first 10-minute run finishes (that’s what to be expected), but then we can see that the results of that additionally spawned processes are updating the status when they finish.

I’m quite sure that’s a bug related to the retry function, because:

my retry interval setting is ignored
it should not trigger a new instance of the job, before the previous one ended (with whatever status)

We are migrating the prod instance to a new nodes with higher release currently. I’ll have to check if this issue exists there as well. If so, I’ll have to trigger a bug report, I assume.

Thanks a lot for your input and have a nice weekend.

Regards,
Dirk

Dirk · March 11, 2025, 6:43am

Just for completeness: Issue exists in current releases too, bug was accepted by Icinga development and fill be fixed. https://github.com/Icinga/icinga2/issues/10362#issuecomment-2706360122

moreamazingnick · March 11, 2025, 9:09am

*max check attempts = 1

Shouldn’t this make any retry_interval obsolete?

Best Regards
Nicolas

Dirk · March 11, 2025, 10:17am

Hi,
yes, it would, but then you’ll have not retries anymore and you’ll get a notification after the first failure.
Another workaround I got from development is to set the command timeout to the same value as service timeout. That will prevent duplicate starts, because it checks if the command is running already.

Regards,
Dirk