How do I monitor the primary, and send alerts if the primary goes down, from a Satellite?

dw1 · February 8, 2022, 3:31pm

I have a single Primary server with several Satellite servers. I read in the documentation that best practice for a HA master configuration is to use a single database and connect both masters to that IP (VIP, if the database is a cluster behind a load balancer), so unless I’m missing something, that’s not actually HA - because the database / VIP is still a single point of failure.

For this reason, I’d like to stick with my current model of a single Primary & multiple Satellites. That said, I’d like to monitor a few things from one of my secondary nodes - and have the secondary send alerts if those things go down. Specifically, I want a Satellite server to monitor the Primary node, as well as the underlying physical hypervisors that the Primary resides on.

The Primary is in 1 datacenter, and I’m putting a secondary into each of our other data centers across the globe.

I’m still fairly new to Icinga, but I believe that My Master/Satellite setup is configured in a Top Down Config Sync mode. When I make changes to the config on the Master, I don’t have to then go make any changes or restart anything on the Secondary for my changes to take effect.

I realize that the documentation encourages one to use a services.conf file in each zone and reference that with something like: assign where host.vars.agent_endpoint. That said, I’ve decided to define the service checks for each host (endpoint) in its own conf file, so that we can much more easily manage each host using Ansible.

I have an ansible playbook that I wrote which will setup monitoring for a specific endpoint that can be run like so, that will put the configuration for the new host that needs to be monitored into the correct zone on the Master server:

#  ansible-playbook add_manual_host.yml -e "hostname=fqdn ip_address=172.16.x.x zone=europe http=false vnc=false linux=true" -K
#	Possible Zones:
#		- americas
#		- europe
#		- asia

How can I accomplish monitoring the Primary (and sending notifications) from a secondary server? On my Primary, I have multiple zones (for each data center) defined in /etc/icinga2/zones.conf, and the corresponding zone directories in setup in /etc/icinga2/zones.d/.

Below are a few other details about my environment (stdout from the Primary). Thanks in advance!

root@icinga:/# icinga2 --version
icinga2 - The Icinga 2 network monitoring daemon (version: r2.13.2-1)

Copyright (c) 2012-2022 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later <https://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

System information:
  Platform: Ubuntu
  Platform version: 20.04.3 LTS (Focal Fossa)
  Kernel: Linux
  Kernel version: 5.4.0-99-generic
  Architecture: x86_64

Build information:
  Compiler: GNU 9.3.0
  Build host: runner-hh8q3bz2-project-298-concurrent-0
  OpenSSL version: OpenSSL 1.1.1f  31 Mar 2020

Application information:

General paths:
  Config directory: /etc/icinga2
  Data directory: /var/lib/icinga2
  Log directory: /var/log/icinga2
  Cache directory: /var/cache/icinga2
  Spool directory: /var/spool/icinga2
  Run directory: /run/icinga2

Old paths (deprecated):
  Installation root: /usr
  Sysconf directory: /etc
  Run directory (base): /run
  Local state directory: /var

Internal paths:
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid

root@icinga:/# icinga2 feature list
Disabled features: command compatlog debuglog elasticsearch gelf graphite icingadb influxdb influxdb2 livestatus opentsdb perfdata statusdata syslog
Enabled features: api checker ido-pgsql mainlog notification

rsx · February 9, 2022, 8:20am

We do this by simply using the cluster-zone check defined at the satellites for their parent zones. Their notification objects are assigned on the satellites only with this rule assign where host.zone == ZoneName

For the underlying hypervisors it would be easy in case of vSphere since you only need to define service checks at your satellites accordingly.

dw1 · February 9, 2022, 6:19pm

Thank you for your response. Unfortunately, I must still be missing something.

Reading through the documentation about how to use cluster-zone, it appears that’s talking about a master-master setup, whereas I have a single master / multiple satellites.

Nevertheless, I tried doing the following:

cd /etc/icinga2/zones.d/custom/
root@icinga:/etc/icinga2/zones.d/custom# cat health.conf 
apply Service "agent-health" {
  check_command = "cluster-zone"

  display_name = "agent-health-" + host.name

  // This follows the convention that the agent zone name is the FQDN which is the same as the host object name.
  vars.cluster_zone = host.name

  // Create this health check for agent hosts in the satellite zone
  assign where host.zone == "custom" && host.vars.agent_endpoint
}

apply Dependency "agent-health-check" to Service {
  parent_service_name = "agent-health"

  states = [ OK ] // Fail if the parent service state switches to NOT-OK
  disable_notifications = true

  assign where host.zone == "custom" && host.vars.agent_endpoint // Automatically assigns all agent endpoint checks as child services on the matched host
  ignore where service.name == "agent-health" // Avoid a self reference from child to parent
}

… and then I went into the maser zone, and created a health.conf file:

root@icinga:/etc/icinga2/zones.d/master# cat health.conf 
apply Service "satellite-zone-health" {
  check_command = "cluster-zone"
  check_interval = 30s
  retry_interval = 10s

  vars.cluster_zone = "custom"

  assign where match("icinga-master.fqdn.com", host.name)
}

… and I then added this Service object to the master host config file:

object Service "cluster" {
  host_name = "icinga-master"
  check_command = "cluster"
  check_interval = 5s
  retry_interval = 1s
}

This doesn’t appear to be working!
When I run icinga2 daemon -C, I see two warnings:

[2022-02-09 18:16:01 +0000] warning/ApplyRule: Apply rule 'agent-health-check' (in /etc/icinga2/zones.d/custom/health.conf: 13:1-13:48) for type 'Dependency' does not match anywhere!
[2022-02-09 18:16:01 +0000] warning/ApplyRule: Apply rule 'agent-health' (in /etc/icinga2/zones.d/custom/health.conf: 1:0-1:27) for type 'Service' does not match anywhere!

So then I switched gears, and I did what I thought was the correct thing to do in my situation, but after I put in the below configuration, icinga2 daemon -C didn’t produce any errors, but when I log into the icingaweb2 interface, I see a message that Zone master-host-BACKUP_MONITOR doesn’t exist.

On the master, I navigated to /etc/icinga2/zones.d/my-satellite, and I created the following file:

master-host.conf:

object Host "master-host-BACKUP_MONITOR" {
        check_command = "hostalive"
        address = "172.16.x.x"
}

object Service "Master Health" {
	check_command = "cluster-zone"
	host_name  = "master-host-BACKUP_MONITOR"
}

I should note that my zones are custom names (except for master), not the FQDN of the icinga satellite server “controlling” that zone.

What am I still missing in either of these attempts?

rsx · February 10, 2022, 8:12am

Try this simple example:

apply Service "icinga_parent" {
   display_name = "Icinga Parent Zone"
   check_command = "cluster-zone"

   vars.cluster_zone = host.vars.icinga_parent

   assign where host.vars.icinga_parent
}

dw1 · February 10, 2022, 11:31am

It appears that I have monitoring of the Master working properly on the Satellite now. Thank you for your help. Unfortunately, the Satellite is not sending notifications, even from the icingaweb2 interface when I “force” a test notification.

Here’s what I see in the icinga2.log file on the Satellite when I tried to send a test notification from the Secondary:

[2022-02-10 11:26:12 +0000] information/HttpServerConnection: Request: POST /v1/actions/send-custom-notification (from [::ffff:127.0.0.1]:47096), user: icingaweb2, agent: , status: OK).

When I run this same test, on the same object, from the Primary, the notification gets sent correctly.

Notification is an enabled feature on the Satellite:

root@icinga:/var/log/icinga2# icinga2 feature list
Disabled features: command compatlog debuglog elasticsearch gelf graphite icingadb influxdb influxdb2 livestatus opentsdb perfdata statusdata syslog
Enabled features: api checker ido-pgsql mainlog notification

I have reviewed Delegate notifications master/satellite - #5 by rsx, and I setup the following config file in my Satellite zone:

root@icinga:/etc/icinga2/zones.d/custom# cat satellite-notify.conf 
apply Notification "mail-from-satellite" to Service {
  import "mail-service-notification"
  user_groups = host.vars.notification.mail.groups
  users = host.vars.notification.mail.users
  assign where host.name == "master-host-BACKUP_MONITOR" && host.zone == "custom"
}

But that’s not working either.

So I’m not sure what I’m missing here.

rsx · February 10, 2022, 1:29pm

Have you checked if the notification object exists at your satellite e.g. icinga2 object list -n mail-from-satellite?

dw1 · February 10, 2022, 2:07pm

Oh, that’s interesting. Good catch!

From the Master, I see two objects when I run that command.
From the Satellite, I don’t see any objects.

So how do I get the Master to push these to the Satellite?

rsx · February 10, 2022, 2:36pm

Best practice is to have global zones for such kind of objects. In this case a global zone e.g. notifications that exists on master and all satellites would be enough.

dw1 · February 10, 2022, 3:00pm

So it sounds like having this in /etc/icinga2/features-enabled/api.conf:

object ApiListener "api" {
  accept_config = true
  accept_commands = true
}

… and having this in zones.conf:

object Zone "global-templates" {
        global = true
}

object Zone "director-global" {
        global = true
}

… isn’t enough.

I’m just now realizing, with the way things are structured, I should probably have a folder on the Master server named “global-templates” inside zones.d, and put my config in there.

Thank you!

dw1 · February 10, 2022, 3:23pm

This is resolved (and I have now edited this post 2-3 different times).

Briefly, AFTER doing everything below (see “more details” below), I had the brilliant idea to actually check error logs. I found this snippet:

[2022-02-10 19:18:39 +0000] critical/ApiListener: Config validation failed for staged cluster config sync in '/var/lib/icinga2/api/zones-stage/'. Aborting. Logs: '/var/lib/icinga2/api/zones-stage//startup.log'

So then, I checked /var/lib/icinga2/api/zones-stage//startup.log, which led me to discover that the objects I was trying to create were trying to import things that didn’t exist.

My solution: In short, I had to move all of the relevant config files from /etc/icinga2/conf.d/ into /etc/icinga2/zones.d/global-templates/.

More details, and how I went about troubleshooting:
So now I have these three files in global-templates:

root@icinga:/etc/icinga2/zones.d/global-templates# ls

health.conf
satellite-notify.conf

Contents of each file are below. After making these changes, I’m getting the same results.
The command icinga2 object list -n mail-from-satellite shows 2 objects on the master, and none on the Satellite, so I’m still missing something. I feel like I’m close though.

root@icinga:/etc/icinga2/zones.d/global-templates# cat health.conf 
apply Service "agent-health" {
  check_command = "cluster-zone"
  display_name = "agent-health-" + host.name
  vars.cluster_zone = host.name
  assign where host.zone == "my-satellite-zone" && host.vars.agent_endpoint
}

apply Dependency "agent-health-check" to Service {
  parent_service_name = "agent-health"
  states = [ OK ] 
  disable_notifications = true
  assign where host.zone == "my-satellite-zone" && host.vars.agent_endpoint // Automatically assigns all agent endpoint checks as child services on the matched host
  ignore where service.name == "agent-health" // Avoid a self reference from child to parent
}

apply Service "satellite-zone-health" {
  check_command = "cluster-zone"
  check_interval = 30s
  retry_interval = 10s

  vars.cluster_zone = "master"

  assign where match("master-fqdn", host.name)
}

apply Service "satellite-master-health" {
  check_command = "cluster-zone"
  check_interval = 30s
  retry_interval = 10s
  vars.cluster_zone = "my-satellite-zone"
  assign where match("master-host-BACKUP_MONITOR", host.name)
}

root@icinga:/etc/icinga2/zones.d/global-templates# cat satellite-notify.conf 
apply Notification "mail-from-satellite" to Service {
  import "mail-service-notification"
  user_groups = host.vars.notification.mail.groups
  users = host.vars.notification.mail.users
  assign where host.name == "master-host-BACKUP_MONITOR" && host.zone == "my-satellite-zone"
}

Digging into this further, after I did the above, I moved the “global-templates” folder on the Master to a new folder called notifications (still in /etc/icinga2/zones.d/), and I edited zones.conf accordingly to reference the new notifications zone. I then edited the zones.conf on the Satellite server to match:

object Zone "notifications" {
global = true
}

The same two files that I mentioned above exist in this new notifications zone.
I restarted icinga2 on both the master & the Satellite, and here is what I observed:

On the Master, I see the files in /var/lib/icinga2/api/zones/notifications/_etc/
On the Satellite, I see the files in /var/lib/icinga2/api/zones-stage/notifications/_etc/

But icinga2 object list -t Notification on the Satellite still produces no results, so I’m clearly still missing something.