I have had the windows agent deployed to many servers for a couple of weeks now with no issues. Starting last night some random servers had the service stop… it fails to start when I try manually.
if I reinstall the agent, it runs… but dies again and ends up in the same state.
What can I look at to try and determine the cause? I see the log file in programdata/var/logs but I don’t see anything jump out at me.
Reinstalling again appears to be stable at the moment but I would like to know how to investigate the issue with the service if there is anything logged somewhere.
It continues to happen. Multiple versions of Windows server. Everything will be fine for a while after a fresh install of the agent and then I will see unknown for services checks in the web interface.
When I investigate, the Icinga2 service is stopped and if I try to start it is fails. The only way around it that I can find so far is to remove the Icinga2 agent and reinstall it. Then it seems fine again for a while.
I will see if I can sanitize the log file I have and share it. I don’t know what is happening here.
I’m starting to think it doesn’t survive a reboot of the windows server… like it is fine until the server reboots and then it bombs… this concerns me more and I need to verify.
Maybe there’s a problem with DNS resolution. Try setting the NodeName constant in constants.conf to the FQDN, which prevents this lookup on startup. If that setting already exists, the problem is elsewhere but should be visible in the logs on startup as well.
I stopped the Icinga2 service on the Windows servers… waited a few minutes and started it again. It failed to start.
I changed NodeName to = “fqdn of server” and that did not resolve the issue. If I reinstall it, It will be fine until the service stops and try’s to start again.
I have to sanitize the logs before I can upload them but I noticed a couple of interesting things.
Starting the service and letting it stop/crash does not seem to add anything to the icinga2 log in C:\ProgramData\icinga2\var\log\icinga2. There are plently of log entries from today in there (probably from the reinstall I did) but starting the service doesn’t appear to touch this log file.
There is nothing in C:\ProgramData\icinga2\var\log\icinga2\crash either.
I see this in the log, not sure what it means or if it is relevant or not.
[2019-05-20 09:51:25 -0400] information/ApiListener: Applying configuration file update for path ‘C:\ProgramData\icinga2\var\lib\icinga2/api/zones/director-global’ (28934 Bytes). Received timestamp ‘2019-05-20 09:51:23 -0400’ (1558360283.743386), Current timestamp ‘2019-05-20 09:50:45 -0400’ (1558360245.799614).
[2019-05-20 09:51:25 -0400] information/ApiListener: Restarting after configuration change.
[2019-05-20 09:51:25 -0400] information/Application: Got reload command: Starting new instance.
[2019-05-20 09:51:26 -0400] critical/Application: Found error in config: reloading aborted
Your problem is an error in your deployed configuration. After installing the Agent there is no config deployment from master/satelite, so the service starts. After that the config is synced, but reload failed, because there is an error in your configuration. If you restart icinga will try to load the broken configuration and fail.
Check from commandline with “icinga2.exe daemon -C” to see the error.
I uninstalled the agent and redeployed after removing the service template from director and now I do not get that error… perhaps I have to do this on all of my windows servers now.
I’m not sure where master came from… or if there is a better way to resolve this on a ton of servers other than uninstall/reinstall of the agent.
Is there a way to make the agent pull a fresh config?
I appreciate the help, I didn’t know about the config check on the command line so this whole exercise has been educational for me.
You can try to stop the service and delete everything under C:\ProgramData\icinga2\var\lib\icinga2\api\zones\*. After that start the service and the configuration should be downloaded from the master again.