Hi,
I’m facing a strange issue. I’m writing new monitors quite often to match specific monitoring requirements. The current one is running quite long (several minutes, no way to shorten that).
By accident I found, that Icinga is starting a new instance of that check every minute, regardless if it is still running, or if I killed the existing run on OS level. Even restarting Icinga on client side (endpoint) didn’t stop that, it still spawned new instances of that check. Finally I had to deactivate “active checks” of that service to prevent starting of new instances.
All new instances of that check were spawned by Icinga (PID). Just to be shure, I replaced the check content with:
sleep 30000 echo “OK: test” exit 0
So can someone please give me a hint how to investigate on that? On target node there is no debuglog feature installed. The Icinga log there only shows when I killed the processes, but no entries for spawning.
I enabled debuglog on server side for a while, but it doesn’t show much as well. I can only see some distribution messages and when I switched off active checking for that service.
Has anyone an explanation for those processes being triggered every minute, despite the same check is still running and check interval and retry interval are set much higher?
Hi @Dirk ,
I hope this does not make the wrong impression, but are you sure, that you are looking at the right Service object? Just to be sure, you might want to take a look whether the settings are actually applied correctly.
Does the webinterface predict the next test run to be in one minute or not?
Hi Lorenz,
thanks for your time. Yes, it is a new script written by me and I configured everything by myself. The service is named similar to the check and it is assigned to one node only currently on a test instance. Since I set “active checks” to “no”, no new instances are spawned.
If I trigger the check from web interface, it just increases the (negative) value for “next check”, which is normal, as long as a check is still running.
thank you too. The requested output follows. Because I’m working in a sensible environment, I replaced hostname and service name, but that shouldn’t make a difference, hopefully:
Object <my_host>!<my_check>’ of type ‘Service’:
% declared in ‘/var/lib/icinga2/api/packages/director/6d63c7d1-70af-4065-af6d-0528336d6d7a/zones.d/dev-GCP-01/services.conf’, lines 27:1-27:38
__name = “<my_host>!<my_check>”
action_url = “”
check_command = “<my_check_command>”
% = modified in ‘/var/lib/icinga2/api/packages/director/6d63c7d1-70af-4065-af6d-0528336d6d7a/zones.d/director-global/service_templates.conf’, lines 935:5-935:43
check_interval = 43200
% = modified in ‘/var/lib/icinga2/api/packages/director/6d63c7d1-70af-4065-af6d-0528336d6d7a/zones.d/director-global/service_templates.conf’, lines 937:5-937:24
check_period = “”
check_timeout = 14400
% = modified in ‘/var/lib/icinga2/api/packages/director/6d63c7d1-70af-4065-af6d-0528336d6d7a/zones.d/director-global/service_templates.conf’, lines 939:5-939:22
command_endpoint = “<my_host>”
% = modified in ‘/var/lib/icinga2/api/packages/director/6d63c7d1-70af-4065-af6d-0528336d6d7a/zones.d/director-global/service_templates.conf’, lines 943:5-943:32
display_name = “<my_check>”
enable_active_checks = false
% = modified in ‘/var/lib/icinga2/api/packages/director/6d63c7d1-70af-4065-af6d-0528336d6d7a/zones.d/director-global/service_templates.conf’, lines 941:5-941:32
enable_event_handler = false
% = modified in ‘/var/lib/icinga2/api/packages/director/6d63c7d1-70af-4065-af6d-0528336d6d7a/zones.d/director-global/service_templates.conf’, lines 942:5-942:32
enable_flapping = false
enable_notifications = false
% = modified in ‘/var/lib/icinga2/api/packages/director/6d63c7d1-70af-4065-af6d-0528336d6d7a/zones.d/director-global/service_templates.conf’, lines 940:5-940:32
enable_passive_checks = true
enable_perfdata = true
event_command = “”
flapping_ignore_states = null
flapping_threshold = 0
flapping_threshold_high = 30
flapping_threshold_low = 25
groups =
host_name = “<my_host>”
% = modified in ‘/var/lib/icinga2/api/packages/director/6d63c7d1-70af-4065-af6d-0528336d6d7a/zones.d/dev-GCP-01/services.conf’, lines 28:5-28:41
icon_image = “”
icon_image_alt = “”
max_check_attempts = 1
% = modified in ‘/var/lib/icinga2/api/packages/director/6d63c7d1-70af-4065-af6d-0528336d6d7a/zones.d/director-global/service_templates.conf’, lines 936:5-936:28
name = “my_check”
notes = “”
notes_url = “”
package = “director”
retry_interval = 14400
% = modified in ‘/var/lib/icinga2/api/packages/director/6d63c7d1-70af-4065-af6d-0528336d6d7a/zones.d/director-global/service_templates.conf’, lines 938:5-938:23
Thus, I would second @lorenz’ idea to verify that this service really results in all these processes. Could you please enable the debug log on the Icinga 2 agent.
Within the debug.log, please grep for your check command and please post both the starting and finishing log events. The second also contains the next check time.
[2025-02-26 14:01:30 +0100] notice/Process: PID 83565 ('/usr/local/libexec/nagios/check_foo') terminated with exit code 0
[2025-02-26 14:01:30 +0100] debug/Checkable: Update checkable 'host!foo' with check interval '300' from last check time at 2025-02-26 14:01:30 +0100 (1.74057e+09) to next check time at 2025-02-26 14:06:24 +0100 (1.74058e+09).
Please feel free to redact the output again to exclude sensitive information.
Could you also query all of these parallel running check commands on the server and check their PID within the debug.log. Since you have not explicitly stated the operating system, on a Linux you could get the processes via ps aux | grep __check_command__.
Btw, could you please tell us the Icinga 2 version you are using and the operating systems? Maybe also some other details about your setup?
I set up a new check to exclude all possible side effects which will write its start and end times and its PID to the output. Currently I have an issue with enabling debug mode, which tells me:
critical/cli: Cannot parse available features. Path ‘/etc/icinga2/features-available’ does not exist.
But I can see that directory via “ls”:
drwxr-xr-x. 2 <our_user> <our_group> 4096 Aug 22 2023 /etc/icinga2/features-available
I’ll trace the command to find out what is going on there. When this is fixed, I’ll setup some scripts to extract script results from icinga db every minute, so we’ll get a picture.
For the time being the data you were asking for:
/usr/sbin/icinga2 --version
icinga2 - The Icinga 2 network monitoring daemon (version: 2.13.1-1)
Copyright (c) 2012-2025 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later https://gnu.org/licenses/gpl2.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
System information:
Platform: Red Hat Enterprise Linux
Platform version: 8.10 (Ootpa)
Kernel: Linux
Kernel version: 4.18.0-553.36.1.el8_10.x86_64
Architecture: x86_64
Build information:
Compiler: GNU 8.4.1
Build host: runner-hh8q3bz2-project-507-concurrent-0
OpenSSL version: OpenSSL 1.1.1k FIPS 25 Mar 2021
Caused by not using the default user “icinga”. Most likely we have overseen something here, because thats a test environment. On server side I was able to enable the feature. UID 992 is pointing to default user “icinga”, so “setuid (992)” tries to access the directory with wrong user, which results in missing permissions.
On my test machine I can fix that easily, I’m going on to prepare the check now.
Icinga 2 expects the existence of an ICINGA_USER and ICINGA_GROUP, usually both defaulting to icinga. During startup of most icinga2 commands, the daemon checks if it runs as this user and group, or changes users otherwise.
Some insights are available in the recently created Icinga 2 issue #10307.
Thus, please ensure that everything under /etc/icinga2/ is accessible to the icinga user.
Hi @apenning,
thanks again. Sorry, didn’t find the time to work on this yesterday.
Our Icinga2 service is running with non-default user/group, because that’s defined in /etc/sysconfig/icinga2. The content in /etc/icinga2 belongs to our user/group (as it should be), but as the trace snipped shows, icinga is switching to user “icinga” when trying to activate tracing. This might be a bug, but I don’t want to follow up on this.
My test details:
new service created which calls a new script called “spawn_test”
That check was started via Icinga Gui and a small script checked the process list every minute for “spawn_test”. It prints the date, all matching PIDs and the PPIDs. Result:
As you can see new processes are spawned by Icinga permanently until the first job finishes after 10 minutes. This causes a status change and now new jobs are triggered anymore. Another script queried the service state from Icinga database every minute:
08:25:02
DB output
pid: 1658311 start: Fri 28 Feb 08:14:30 CET 2025 <Terminated by signal 15 (Terminated).>
08:26:02
DB output
pid: 1658311 start: Fri 28 Feb 08:14:30 CET 2025 <Terminated by signal 15 (Terminated).>
08:27:03
DB output
pid: 1658311 start: Fri 28 Feb 08:14:30 CET 2025 <Terminated by signal 15 (Terminated).>
08:28:03
DB output
pid: 1658311 start: Fri 28 Feb 08:14:30 CET 2025 <Terminated by signal 15 (Terminated).>
08:29:03
DB output
pid: 1658311 start: Fri 28 Feb 08:14:30 CET 2025 <Terminated by signal 15 (Terminated).>
08:30:03
DB output
pid: 1658311 start: Fri 28 Feb 08:14:30 CET 2025 <Terminated by signal 15 (Terminated).>
08:31:03
DB output
pid: 1658311 start: Fri 28 Feb 08:14:30 CET 2025 <Terminated by signal 15 (Terminated).>
08:32:03
DB output
pid: 1658311 start: Fri 28 Feb 08:14:30 CET 2025 <Terminated by signal 15 (Terminated).>
08:33:03
DB output
pid: 1658311 start: Fri 28 Feb 08:14:30 CET 2025 <Terminated by signal 15 (Terminated).>
08:34:04
DB output
pid: 1658311 start: Fri 28 Feb 08:14:30 CET 2025 <Terminated by signal 15 (Terminated).>
08:35:04
DB output
pid: 1658311 start: Fri 28 Feb 08:14:30 CET 2025 <Terminated by signal 15 (Terminated).>
08:36:04
DB output
pid: 1658311 start: Fri 28 Feb 08:14:30 CET 2025 <Terminated by signal 15 (Terminated).>
08:37:04
DB output
pid: 1736213 start: Fri 28 Feb 08:26:45 CET 2025 end: Fri 28 Feb 08:36:45 CET 2025
08:38:04
DB output
pid: 1736213 start: Fri 28 Feb 08:26:45 CET 2025 end: Fri 28 Feb 08:36:45 CET 2025
08:39:04
DB output
pid: 1750851 start: Fri 28 Feb 08:28:15 CET 2025 end: Fri 28 Feb 08:38:15 CET 2025
08:40:04
DB output
pid: 1759602 start: Fri 28 Feb 08:29:45 CET 2025 end: Fri 28 Feb 08:39:45 CET 2025
08:41:04
DB output
pid: 1759602 start: Fri 28 Feb 08:29:45 CET 2025 end: Fri 28 Feb 08:39:45 CET 2025
08:42:05
DB output
pid: 1770269 start: Fri 28 Feb 08:31:15 CET 2025 end: Fri 28 Feb 08:41:15 CET 2025
08:43:05
DB output
pid: 1770269 start: Fri 28 Feb 08:31:15 CET 2025 end: Fri 28 Feb 08:41:15 CET 2025
08:44:05
DB output
pid: 1770269 start: Fri 28 Feb 08:31:15 CET 2025 end: Fri 28 Feb 08:41:15 CET 2025
I killed the previous run, so we start with “terminated”. This state stays until the first 10-minute run finishes (that’s what to be expected), but then we can see that the results of that additionally spawned processes are updating the status when they finish.
I’m quite sure that’s a bug related to the retry function, because:
my retry interval setting is ignored
it should not trigger a new instance of the job, before the previous one ended (with whatever status)
We are migrating the prod instance to a new nodes with higher release currently. I’ll have to check if this issue exists there as well. If so, I’ll have to trigger a bug report, I assume.
Thanks a lot for your input and have a nice weekend.
Hi,
yes, it would, but then you’ll have not retries anymore and you’ll get a notification after the first failure.
Another workaround I got from development is to set the command timeout to the same value as service timeout. That will prevent duplicate starts, because it checks if the command is running already.