Windows Agent Service Crashing Randomly and wont restart

csmall · May 17, 2019, 12:50pm

I have had the windows agent deployed to many servers for a couple of weeks now with no issues. Starting last night some random servers had the service stop… it fails to start when I try manually.

if I reinstall the agent, it runs… but dies again and ends up in the same state.

What can I look at to try and determine the cause? I see the log file in programdata/var/logs but I don’t see anything jump out at me.

Agent version 2.10.4

csmall · May 17, 2019, 2:06pm

Reinstalling again appears to be stable at the moment but I would like to know how to investigate the issue with the service if there is anything logged somewhere.

dnsmichi · May 20, 2019, 7:29am

Maybe you’ll share them and we spot something not so obvious.

Cheers,
Michael

csmall · May 20, 2019, 1:34pm

It continues to happen. Multiple versions of Windows server. Everything will be fine for a while after a fresh install of the agent and then I will see unknown for services checks in the web interface.

When I investigate, the Icinga2 service is stopped and if I try to start it is fails. The only way around it that I can find so far is to remove the Icinga2 agent and reinstall it. Then it seems fine again for a while.

I will see if I can sanitize the log file I have and share it. I don’t know what is happening here.

csmall · May 20, 2019, 2:04pm

I’m starting to think it doesn’t survive a reboot of the windows server… like it is fine until the server reboots and then it bombs… this concerns me more and I need to verify.

dnsmichi · May 20, 2019, 2:53pm

Maybe there’s a problem with DNS resolution. Try setting the NodeName constant in constants.conf to the FQDN, which prevents this lookup on startup. If that setting already exists, the problem is elsewhere but should be visible in the logs on startup as well.

Cheers,
Michael

csmall · May 20, 2019, 3:03pm

I’ll try and report back

csmall · May 20, 2019, 3:18pm

I stopped the Icinga2 service on the Windows servers… waited a few minutes and started it again. It failed to start.

I changed NodeName to = “fqdn of server” and that did not resolve the issue. If I reinstall it, It will be fine until the service stops and try’s to start again.

dnsmichi · May 20, 2019, 3:21pm

Without logs and more details, we won’t be able to dig this up unfortunately. The error is too vague.

csmall · May 20, 2019, 3:59pm

I have to sanitize the logs before I can upload them but I noticed a couple of interesting things.

Starting the service and letting it stop/crash does not seem to add anything to the icinga2 log in C:\ProgramData\icinga2\var\log\icinga2. There are plently of log entries from today in there (probably from the reinstall I did) but starting the service doesn’t appear to touch this log file.

There is nothing in C:\ProgramData\icinga2\var\log\icinga2\crash either.

I see this in the log, not sure what it means or if it is relevant or not.

[2019-05-20 09:51:25 -0400] information/ApiListener: Applying configuration file update for path ‘C:\ProgramData\icinga2\var\lib\icinga2/api/zones/director-global’ (28934 Bytes). Received timestamp ‘2019-05-20 09:51:23 -0400’ (1558360283.743386), Current timestamp ‘2019-05-20 09:50:45 -0400’ (1558360245.799614).
[2019-05-20 09:51:25 -0400] information/ApiListener: Restarting after configuration change.
[2019-05-20 09:51:25 -0400] information/Application: Got reload command: Starting new instance.
[2019-05-20 09:51:26 -0400] critical/Application: Found error in config: reloading aborted

csmall · May 20, 2019, 5:47pm

This appears to affect all servers I have deployed the agent to. It is fine until a service restart/reboot. This makes me sad

I might try the snapshot agent instead of 2.10.4 … maybe something will be fixed.

csmall · May 20, 2019, 6:37pm

I am deploying the 2.10.4 agent via self-service api and director with the agent version/hash defined etc…

Is there an easy way to deploy the snapshot version to a test server instead of mass deploying it to all existing servers?

unic · May 20, 2019, 7:56pm

Your problem is an error in your deployed configuration. After installing the Agent there is no config deployment from master/satelite, so the service starts. After that the config is synced, but reload failed, because there is an error in your configuration. If you restart icinga will try to load the broken configuration and fail.

Check from commandline with “icinga2.exe daemon -C” to see the error.

csmall · May 20, 2019, 8:03pm

Thank you. I will run it at the command line and see what it says. Everything works…service checks etc… until the service restarts.

I hope it is something easy to resolve

csmall · May 20, 2019, 8:11pm

Interesting… it is failing because of a service check I created to monitor a business process totally unrelated to the servers in question.

Maybe if I get rid of this template/service it will start working again.

Here is the error:

[2019-05-20 16:07:05 -0400] information/cli: Icinga application loader (version: v2.10.4)
[2019-05-20 16:07:05 -0400] information/cli: Loading configuration file(s).
[2019-05-20 16:07:05 -0400] critical/config: Error: Object 'Business Process' of type 'Service' re-defined: in C:\ProgramData\icinga2\var\lib\icinga2\api\zones/master/director/service_templates.conf: 1:0-1:55; previous definition: in C:\ProgramData\icinga2\var\lib\icinga2\api\zones/director-global/director/service_templates.conf: 206:1-206:56
Location: in C:\ProgramData\icinga2\var\lib\icinga2\api\zones/master/director/service_templates.conf: 1:0-1:55
C:\ProgramData\icinga2\var\lib\icinga2\api\zones/master/director/service_templates.conf(1): template Service "Business Process" {
                                                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
C:\ProgramData\icinga2\var\lib\icinga2\api\zones/master/director/service_templates.conf(2):     check_command = "icingacli-businessprocess"
C:\ProgramData\icinga2\var\lib\icinga2\api\zones/master/director/service_templates.conf(3): }

unic · May 20, 2019, 8:19pm

Why is the master zone deployed to the agent? That should not be the case.

csmall:

[2019-05-20 16:07:05 -0400] critical/config: Error: Object 'Business Process' of type 'Service' re-defined: in C:\ProgramData\icinga2\var\lib\icinga2\api\zones/master/director/service_templates.conf: 1:0-1:55; previous definition: in C:\ProgramData\icinga2\var\lib\icinga2\api\zones/director-global/director/service_templates.conf: 206:1-206:56
Location: in C:\ProgramData\icinga2\var\lib\icinga2\api\zones/master/director/service_templates.conf: 1:0-1:55
C:\ProgramData\icinga2\var\lib\icinga2\api\zones/master/director/service_templates.conf(1): template Service "Business Process" {
                                                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
C:\ProgramData\icinga2\var\lib\icinga2\api\zones/master/director/service_templates.conf(2):     check_command = "icingacli-businessprocess"
C:\ProgramData\icinga2\var\lib\icinga2\api\zones/master/director/service_templates.conf(3): }

thats the error. get rid of these duplicate entries. Maybe your nodename ist wrong.

csmall · May 20, 2019, 8:21pm

The self-service api says director-global under global-zones… or is master coming from somewhere else?

I deleted the template from director but it still has the same error… not sure where to look for the duplicate to remove it.

Do you mean manually remove the dupe from the windows agent configs?

csmall · May 20, 2019, 8:54pm

I uninstalled the agent and redeployed after removing the service template from director and now I do not get that error… perhaps I have to do this on all of my windows servers now.

I’m not sure where master came from… or if there is a better way to resolve this on a ton of servers other than uninstall/reinstall of the agent.

Is there a way to make the agent pull a fresh config?

I appreciate the help, I didn’t know about the config check on the command line so this whole exercise has been educational for me.

fluxX · May 21, 2019, 5:52am

Hi,

You can try to stop the service and delete everything under C:\ProgramData\icinga2\var\lib\icinga2\api\zones\*. After that start the service and the configuration should be downloaded from the master again.

Greetz