All Service Check stopped working

Hello

I am almost at my wits end. I just finished installing the icinga2 setup with 2 master in HA and 2 Satellite in HA. Hopeful that my configuration is correct. Details of the setup is in the link Community Question
For couple of days I was trying to build up the services on 2 Linux boxes I have. I am deploying the services using director. What I noticed was random “pending state” of services on first time deployment on Host Template. But If select 1 server it was deploying fine. Post that if delete and redeploy on Host Template it will work fine as well. Yesterday I was playing around with check_logfile plugin and due to /var/log/message file cannot be read by icinga user I gave 640 permission to the messages file. Though it is not related but that is the only change I did apart from deleting service deploy again several times on host first then host group etc.

After sometime I saw all the services are running late. Then soon I realized all services are not being reported any more in web2 portal.

I enabled debug log and found that the agent on the end server is running the service checks.
The Graphite Browser also stopped reporting data trends. Once I restarted the Database (postgres) looked like for sometime it started working again. But then as soon as i restarted the Satellite it stopped again.

I have no clue on what caused this and how to fix that. Any guidance on what to look will be very helpful

EDIT I disabled the ido-pgsql and stopped the secondary master and it looks good. Now we are able to get the service checks done. Something is not correct in my secondary master configuration. I also stopped 1 Satellite and waited and then another Satellite. Looks like all is OK in terms of Satellite. But Secondary Master looked to be the problem - I don’t know why.

Hello

I have further observed that since I am using Director to configure all the services as well as adding nodes my .conf files are getting created in

/var/lib/icinga2/api/zones/master/director
/var/lib/icinga2/api/zones/US_Satellite/director

etc.

But in secondary master these folders are not present. Also the
/etc/icinga2/zones.d is just blank in both the servers

I tried to stop the Primary Master and found that the IcingaWeb2 just went blank with no node or service at all. I am guessing the Primary Master is not syncing services to secondary obviously

Hi,

for me personally it is really hard to follow a long text with many details. Try to cut it down to all the steps you did, and also illustrate it with configuration snippets and log outputs.

Also, if you have used a docs URL or source, link it here to allow everyone learn what you’ve tried already.

First things first, please share the zones.conf Zone hierarchy to get a better picture :slight_smile:
Second to that, please add the output of icinga2 --version for all involved nodes.

Cheers,
Michael

Hi

Sorry It was progressive and I was trying myself in the background along with asking for help. Hence so much of text

Current problem : When Primary Master goes down the Icingaweb2 goes blank. As if Secondary Master is not able to continue the monitoring. Though I can see in agent log that the checks are still happening.

I have deployed all services using director.

zones.conf -> Primary Master

object Endpoint "ncvdl09.us.corp.net" {
         //Local Server
 }

 object Endpoint "ncvdl10.us.corp.net" {
         host = "192.168.1.154"
 }

 object Zone "master" {
         endpoints = [ "ncvdl09.us.corp.net", "ncvdl10.us.corp.net" ]
 }

 object Zone "global-templates" {
         global = true
 }

 object Zone "director-global" {
         global = true
 }

For Secondary Master the Zones.conf

object Endpoint "ncvdl10.us.corp.net" {
       // Local Server
}

object Endpoint "ncvdl09.us.corp.net" {
       // Remote Server Primary Master
       host = "53.242.35.151" 
}

object Zone "master" {
        endpoints = [ "ncvdl10.us.corp.net", "ncvdl09.us.corp.net" ]
}

object Zone "global-templates" {
        global = true
}

object Zone "director-global" {
        global = true
}

icinga2 --version

icinga2 - The Icinga 2 network monitoring daemon (version: 2.11.0-1)

Copyright (c) 2012-2019 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

System information:
  Platform: Red Hat Enterprise Linux Server
  Platform version: 7.6 (Maipo)
  Kernel: Linux
  Kernel version: 3.10.0-957.5.1.el7.x86_64
  Architecture: x86_64

Build information:
  Compiler: GNU 4.8.5
  Build host: runner-LTrJQZ9N-project-322-concurrent-0

Application information:

General paths:
  Config directory: /etc/icinga2
  Data directory: /var/lib/icinga2
  Log directory: /var/log/icinga2
  Cache directory: /var/cache/icinga2
  Spool directory: /var/spool/icinga2
  Run directory: /run/icinga2

Old paths (deprecated):
  Installation root: /usr
  Sysconf directory: /etc
  Run directory (base): /run
  Local state directory: /var

Internal paths:
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid

With using the Director, are you managing the Satellite zone inside the infrastructure tab? We generally recommend to build the Zone trust relationship for the cluster config sync outside in zones.conf, to prevent problems like Zone-in-Zone-Inception being a chicken egg problem.

That being said, move the satellite zone definitions outside of the Director into zones.conf. That’s also described here.

Cheers,
Michael

I actually created the zone.conf first.

Then went to Director -> Activity Log -> Infrastructure -> Kickstart Wizard --> Added Icinga primary Host / port / API username & Password and imported it. Is that something I was not suppose to do ?

Sure that’s the default way of getting things done, fetching the external master zone Icinga knows about. Still, your zones.conf does not include any reference to the child zone called satellite_US … where’s that defined?

All Right. I missed it. Infact this is my first configuration of icinga. So I am complete new at this

Primary Master

object Endpoint "ncvdl09.us.corp.net" {
        //Local Server
}

object Endpoint "ncvdl10.us.corp.net" {
        host = "192.168.1.154"
}

object Zone "master" {
        endpoints = [ "ncvdl09.us.corp.net", "ncvdl10.us.corp.net" ]
}

object Endpoint "ncvdl12.us.corp.net" {
        host = "192.168.1.193"
}

object Endpoint "ncvdl11.us.corp.net" {
        host = "192.168.1.156"
}

object Zone "US_Satellite" {
        endpoints = [ "ncvdl12.us.corp.net", "ncvdl11.us.corp.net" ]
        parent = "master"
}

object Zone "global-templates" {
        global = true
}

object Zone "director-global" {
        global = true

Secondary Master

object Endpoint "ncvdl10.us.corp.net" {
       // Local Server
}

object Endpoint "ncvdl09.us.corp.net" {
       host = "192.168.1.151"
}

object Zone "master" {
        endpoints = [ "ncvdl10.us.corp.net", "ncvdl09.us.corp.net" ]
}

object Endpoint "ncvdl12.us.corp.net" {
        host = "192.168.1.193"
}

object Endpoint "ncvdl11.us.corp.net" {
        host = "192.168.1.156"
}

object Zone "US_Satellite" {
        endpoints = [ "ncvdl12.us.corp.net", "ncvdl11.us.corp.net" ]
        parent = "master"
}

object Zone "global-templates" {
        global = true
}

object Zone "director-global" {
        global = true
}

I stopped Primary Master
And :frowning: the web2 went blank with no service checks visible. I can see in the agent debug log that it is running the service checks