Icinga master HA does not work like expected

Hello,

i did setup two Icinga2 masters in HA according to this article: https://icinga.com/blog/2020/10/01/how-to-set-up-high-availability-masters/

Ma01 is the primary master, ma02 the secondary master that has been set up as a satellite.

Test cases:

  1. Ma01 online, ma02 online: OK, on both icingaweb2 interfaces checks are shown as executed
  2. Ma01 online, ma02 offline: OK, on ma01 icingaweb2 interface checks are shown as executed
  3. Ma01 offline, ma02 online: NOK, on ma02 icingaweb2 interface the message comes up “Monitoring backend ‘m_monitoring_backend’ is not running.” although on ma02 there is the same postgresql VIP configured as on ma01.

On ma02 there is also no icinga2 process. I think it should be so because it is setup as a satellite?

Could you help me understand the situation?

Shouldn’t a HA setup let the other node take over if the primary is offline?

Enabled features:

[root@otn-ac-monp-ma01 icinga2]# icinga2 feature list
Disabled features: command compatlog debuglog elasticsearch gelf graphite icingadb influxdb2 livestatus opentsdb perfdata statusdata syslog
Enabled features: api checker ido-pgsql influxdb mainlog notification
[root@otn-ac-monp-ma02 icinga2]# icinga2 feature list
Disabled features: command compatlog debuglog elasticsearch gelf graphite icingadb influxdb influxdb2 livestatus notification opentsdb perfdata statusdata syslog
Enabled features: api checker ido-pgsql mainlog

ido-pgsql.conf on ma01:

object IdoPgsqlConnection "ido-pgsql" {
  user = "icinga"
  password = "xxx"
  host = "otn-ac-monp-dvip.local"
  database = "icinga"
  enable_ha=true
}

ido-pgsql.conf on ma02:

object IdoPgsqlConnection "ido-pgsql" {
  user = "icinga"
  password = "xxx"
  host = "otn-ac-monp-dvip.local"
  database = "icinga"
  enable_ha=true
}

Hello @lobr!

What do the Icinga 2 logs on ma02 say about the IDO?

Best,
A/K

Hi there,

nope - should be running.

If it doesn’t - that’s why checks won’t be executed and nothing will be pushed into database.

Enable service for startup at reboot, start service and try your tests again.

If it won’t work after this, pleas show your zone configuration.
Even ma02 needs to be setup as “satellite” it’s just because it shouldn’t create it’s one CA - everything else should be the same in your master cluster.

BR,
Chris

What do the Icinga 2 logs on ma02 say about the IDO?

This is in the log of ma02 when ma01 is running. There are no check commands in the log :(
[2022-05-05 14:25:24 +0200] information/ApiListener: Our production configuration is more recent than the received configuration update. Ignoring configuration file update for path '/var/lib/icinga2/api/zones-stage/otn-ac-monp-sa05.localdomain'. Current timestamp '2022-05-05 14:25:16 +0200' (1651753516.959498) >= received timestamp '2022-05-05 14:25:16 +0200' (1651753516.959498).
[2022-05-05 14:25:24 +0200] information/ApiListener: Received configuration for zone 'otn-ac-monp-sa05.localdomain' from endpoint 'otn-ac-monp-ma01.localdomain'. Comparing the timestamp and checksums.
[2022-05-05 14:25:24 +0200] information/ApiListener: Stage: Updating received configuration file '/var/lib/icinga2/api/zones-stage/otn-ac-monp-sa05.localdomain//_etc/cloud/vSphere67/otn-div-ac/vm/DataCenter-Services/DataCenter-Services.conf' for zone 'otn-ac-monp-sa05.localdomain'.
[2022-05-05 14:25:24 +0200] information/ApiListener: Applying configuration file update for path '/var/lib/icinga2/api/zones-stage/otn-ac-monp-sa05.localdomain' (48386 Bytes).
[2022-05-05 14:25:24 +0200] information/ApiListener: Received configuration updates (7) from endpoint 'otn-ac-monp-ma01.localdomain' are equal to production, skipping validation and reload.
[2022-05-05 14:25:31 +0200] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 0, rate: 155.2/s (9312/min 9312/5min 9312/15min);
[2022-05-05 14:25:31 +0200] information/WorkQueue: #7 (ApiListener, SyncQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2022-05-05 14:25:34 +0200] information/IdoPgsqlConnection: Pending queries: 1403 (Input: 2992/s; Output: 2923/s)
[2022-05-05 14:25:36 +0200] information/IdoPgsqlConnection: Finished reconnecting to 'ido-pgsql' database 'icinga' in 12.4162 second(s).
...
...
...
[2022-05-05 14:35:30 +0200] information/ApiListener: Our production configuration is more recent than the received configuration update. Ignoring configuration file update for path '/var/lib/icinga2/api/zones-stage/otn-ac-monp-sa05.localdomain'. Current timestamp '2022-05-05 14:35:16 +0200' (1651754116.809427) >= received timestamp '2022-05-05 14:35:16 +0200' (1651754116.809427).
[2022-05-05 14:35:30 +0200] information/ApiListener: Received configuration for zone 'otn-ac-monp-sa05.localdomain' from endpoint 'otn-ac-monp-ma01.localdomain'. Comparing the timestamp and checksums.
[2022-05-05 14:35:30 +0200] information/ApiListener: Stage: Updating received configuration file '/var/lib/icinga2/api/zones-stage/otn-ac-monp-sa05.localdomain//_etc/ac/vSphere67/otn-div-ac/vm/DataCenter-Services/DataCenter-Services.conf' for zone 'otn-ac-monp-sa05.localdomain'.
[2022-05-05 14:35:30 +0200] information/ApiListener: Applying configuration file update for path '/var/lib/icinga2/api/zones-stage/otn-ac-monp-sa05.localdomain' (48386 Bytes).
[2022-05-05 14:35:30 +0200] information/ApiListener: Received configuration updates (7) from endpoint 'otn-ac-monp-ma01.localdomain' are equal to production, skipping validation and reload.
[2022-05-05 14:35:34 +0200] information/IdoPgsqlConnection: Finished reconnecting to 'ido-pgsql' database 'icinga' in 4.0823 second(s).
[2022-05-05 14:35:36 +0200] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 0, rate: 102.433/s (6146/min 6146/5min 6146/15min);
[2022-05-05 14:35:36 +0200] information/WorkQueue: #7 (ApiListener, SyncQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2022-05-05 14:35:39 +0200] information/IdoPgsqlConnection: Pending queries: 0 (Input: 997/s; Output: 997/s)


If it won’t work after this, pleas show your zone configuration.

## MASTER 

object Endpoint "otn-ac-monp-ma01.localdomain" {
  host = "otn-ac-monp-ma01.localdomain"
}
object Zone "otn-ac-monp-ma01.localdomain" {
  endpoints = [ "otn-ac-monp-ma01.localdomain", "otn-ac-monp-ma02.localdomain"]
}

object Endpoint "otn-ac-monp-ma02.localdomain" {
}

## SAT
object Endpoint "otn-ac-monp-sa01.localdomain" {
  host = "otn-ac-monp-sa01.localdomain" // Actively connect to the secondary master
}
object Zone "otn-ac-monp-sa01.localdomain" {
  endpoints = [ "otn-ac-monp-sa01.localdomain" ]
  parent = "otn-ac-monp-ma01.localdomain"
}

object Endpoint "otn-ac-monp-sa02.localdomain" {
  host = "otn-ac-monp-sa02.localdomain" // Actively connect to the secondary master
}
object Zone "otn-ac-monp-sa02.localdomain" {
  endpoints = [ "otn-ac-monp-sa02.localdomain" ]
  parent = "otn-ac-monp-ma01.localdomain"
}

object Endpoint "otn-ac-monp-sa03.localdomain" {
  host = "otn-ac-monp-sa03.localdomain" // Actively connect to the secondary master
}
object Zone "otn-ac-monp-sa03.localdomain" {
  endpoints = [ "otn-ac-monp-sa03.localdomain" ]
  parent = "otn-ac-monp-ma01.localdomain"
}

object Endpoint "otn-ac-monp-sa04.localdomain" {
  host = "otn-ac-monp-sa04.localdomain" // Actively connect to the secondary master
}
object Zone "otn-ac-monp-sa04.localdomain" {
  endpoints = [ "otn-ac-monp-sa04.localdomain" ]
  parent = "otn-ac-monp-ma01.localdomain"
}

object Endpoint "otn-ac-monp-sa05.localdomain" {
  host = "otn-ac-monp-sa05.localdomain" // Actively connect to the secondary master
}
object Zone "otn-ac-monp-sa05.localdomain" {
  endpoints = [ "otn-ac-monp-sa05.localdomain" ]
  parent = "otn-ac-monp-ma01.localdomain"
}

object Endpoint "otn-ac-monp-sa06.localdomain" {
  host = "otn-ac-monp-sa06.localdomain" // Actively connect to the secondary master
}
object Zone "otn-ac-monp-sa06.localdomain" {
  endpoints = [ "otn-ac-monp-sa06.localdomain" ]
  parent = "otn-ac-monp-ma01.localdomain"
}


Even ma02 needs to be setup as “satellite” it’s just because it shouldn’t create it’s one CA - everything else should be the same in your master cluster.

I am not 100% sure if I did setup it as a master, since there is the icingaweb2 available on the ma02, which is what i wanted.
Is there a way to check this now if it is setup as satellite or master?

Hi there,

first of all:
If I get it right you named your master zone the same name your ca-master’s FQDN.
Not sure (never tried/did this), but maybe this could cause problems.

You could try to rename it, but you need to think about other zones parent directive and maybe your host/service/whatever templates where your zone is placed in.

Check wether CA directory is present and ca.crt and ca.key are placed inside.
This should only be available on your first master (your “ca-master”).
This will only be done by master, not by satellite node setup.
Also you could look at your TicketSalt - here’s the same, only comes with master setup.

Check a few things and share your information please.
Do on both your master nodes the following steps:

  1. icinga2 variable list (you should xxx your TicketSalt)

  2. icinga2 object list --type zone --name <TYPE_ZONE_HERE>
    You will receive ZoneName in step (1) - Yes it’s in your config but just to be sure, take it from there

  3. icinga2 object list --type endpoint --name <TYPE_ENDPOINT_HERE>
    You will receive your endpoints in step (2) - Here again, it’s in your posted config, but just to be sure

Please share the output of all these steps - cut sensitive information if you need to.

BR,
Chris

1 Like

@ChrissK

Isn’t this best practice? See: https://icinga.com/docs/icinga-2/latest/doc/06-distributed-monitoring/#endpoints

/var/lib/icinga2/certs on ma01:

26200253 4 -rw-r--r--. 1  998 icinga 1720 Mar 21 17:17 ca.crt
26200252 4 -rw-r--r--. 1  998 icinga 1846 Mar 21 17:17 otn-ac-monp-ma01.localdomain.crt
26200251 4 -rw-r--r--. 1  998 icinga 1736 Mar 21 17:17 otn-ac-monp-ma01.localdomain.csr
26200250 4 -rw-------. 1  998 icinga 3247 Mar 21 17:17 otn-ac-monp-ma01.localdomain.key
26311172 4 -rw-r--r--  1  998 icinga 1895 Mar 21 17:51 otn-ac-monp-sa01.localdomain.crt
26311169 4 -rw-------  1  998 icinga 3243 Mar 21 17:51 otn-ac-monp-sa01.localdomain.key
26311174 4 -rw-r--r--  1  998 icinga 1895 Mar 21 17:51 otn-ac-monp-sa02.localdomain.crt
26311173 4 -rw-------  1  998 icinga 3247 Mar 21 17:51 otn-ac-monp-sa02.localdomain.key
26308181 4 -rw-r--r--  1  998 icinga 1895 Mar 21 17:51 otn-ac-monp-sa03.localdomain.crt
26308178 4 -rw-------  1  998 icinga 3243 Mar 21 17:51 otn-ac-monp-sa03.localdomain.key
26311176 4 -rw-r--r--  1  998 icinga 1895 Mar 21 17:51 otn-ac-monp-sa04.localdomain.crt
26311171 4 -rw-------  1  998 icinga 3243 Mar 21 17:51 otn-ac-monp-sa04.localdomain.key
26311178 4 -rw-r--r--  1  998 icinga 1895 Mar 21 17:51 otn-ac-monp-sa05.localdomain.crt
26311177 4 -rw-------  1  998 icinga 3243 Mar 21 17:51 otn-ac-monp-sa05.localdomain.key
26311179 4 -rw-r--r--  1  998 icinga 1899 Mar 21 17:51 otn-ac-monp-sa06.localdomain.crt
26308184 4 -rw-------  1  998 icinga 3247 Mar 21 17:51 otn-ac-monp-sa06.localdomain.key

/var/lib/icinga2/certs on ma02:

26200253 4 -rw-r--r--. 1 icinga icinga 1720 Apr 13 17:54 ca.crt
26191552 4 -rw-r--r--  1 icinga icinga 1720 Apr 13 17:43 ca.crt.orig
26200252 4 -rw-r--r--. 1 icinga icinga 1846 Apr 13 17:54 otn-ac-monp-ma02.localdomain.crt
26191551 4 -rw-r--r--  1 icinga icinga 1846 Apr 13 17:30 otn-ac-monp-ma02.localdomain.crt.orig
26200251 4 -rw-r--r--. 1 icinga icinga 1736 Mar 21 17:17 otn-ac-monp-ma02.localdomain.csr
26200250 4 -rw-------. 1 icinga icinga 3243 Apr 13 17:53 otn-ac-monp-ma02.localdomain.key
26191531 4 -rw-------  1 icinga icinga 3243 Apr 13 17:30 otn-ac-monp-ma02.localdomain.key.orig
26191557 4 -rw-------  1 icinga icinga   40 Apr 13 17:55 ticket

icinga2 variable list on ma01

ActiveStages = {
        _api = "5f0ba41f-68cd-4055-aafc-0349fb0628ce"
}
Icinga = Object of type 'Namespace'
Internal = Object of type 'Namespace'
ManubulonPluginDir = /usr/lib64/nagios/plugins
MaxConcurrentChecks = 512
NodeName = otn-ac-monp-ma01.localdomain
NscpPath = 
PluginContribDir = /usr/lib64/nagios/plugins
PluginDir = /usr/lib64/nagios/plugins
ReloadTimeout = 300
StatsFunctions = Object of type 'Namespace'
System = Object of type 'Namespace'
TicketSalt = xxx
Types = Object of type 'Namespace'
ZoneName = otn-ac-monp-ma01.localdomain
api_token = xxx
vars = {
        ilo_pwd = "f33cadcf776bbe2ab6a97c7bc00476af"
        ilo_user = "ac-ilo-operator"
}

icinga2 variable list on ma02

ActiveStages = {
        _api = "4b401d99-0ddb-4f22-b048-c7fd84c4192e"
}
Icinga = Object of type 'Namespace'
Internal = Object of type 'Namespace'
ManubulonPluginDir = /usr/lib64/nagios/plugins
MaxConcurrentChecks = 512
NodeName = otn-ac-monp-ma02.localdomain
NscpPath = 
PluginContribDir = /usr/lib64/nagios/plugins
PluginDir = /usr/lib64/nagios/plugins
ReloadTimeout = 300
StatsFunctions = Object of type 'Namespace'
System = Object of type 'Namespace'
TicketSalt = 
Types = Object of type 'Namespace'
ZoneName = otn-ac-monp-ma02.localdomain
api_token = xxx

(there is no ticket salt)

icinga2 object list --type zone --name “otn-ac-monp-ma01.localdomain” on ma01

Object 'otn-ac-monp-ma01.localdomain' of type 'Zone':
  % declared in '/etc/icinga2/zones.conf', lines 43:1-43:60
  * __name = "otn-ac-monp-ma01.localdomain"
  * endpoints = [ "otn-ac-monp-ma01.localdomain", "otn-ac-monp-ma02.localdomain" ]
    % = modified in '/etc/icinga2/zones.conf', lines 44:3-44:115
  * global = false
  * name = "otn-ac-monp-ma01.localdomain"
  * package = "_etc"
  * parent = ""
  * source_location
    * first_column = 1
    * first_line = 43
    * last_column = 60
    * last_line = 43
    * path = "/etc/icinga2/zones.conf"
  * templates = [ "otn-ac-monp-ma01.localdomain" ]
    % = modified in '/etc/icinga2/zones.conf', lines 43:1-43:60
  * type = "Zone"
  * zone = ""

icinga2 object list --type zone --name “otn-ac-monp-ma02.localdomain” on ma02

Object 'otn-ac-monp-ma02.localdomain' of type 'Zone':
  % declared in '/etc/icinga2/zones.conf', lines 18:1-18:60
  * __name = "otn-ac-monp-ma02.localdomain"
  * endpoints = [ "otn-ac-monp-ma02.localdomain" ]
    % = modified in '/etc/icinga2/zones.conf', lines 19:2-19:65
  * global = false
  * name = "otn-ac-monp-ma02.localdomain"
  * package = "_etc"
  * parent = "otn-ac-monp-ma01.localdomain"
    % = modified in '/etc/icinga2/zones.conf', lines 20:2-20:58
  * source_location
    * first_column = 1
    * first_line = 18
    * last_column = 60
    * last_line = 18
    * path = "/etc/icinga2/zones.conf"
  * templates = [ "otn-ac-monp-ma02.localdomain" ]
    % = modified in '/etc/icinga2/zones.conf', lines 18:1-18:60
  * type = "Zone"
  * zone = ""

icinga2 object list --type endpoint --name “otn-ac-monp-ma01.localdomain” on ma01

Object 'otn-ac-monp-ma01.localdomain' of type 'Endpoint':
  % declared in '/etc/icinga2/zones.conf', lines 40:1-40:64
  * __name = "otn-ac-monp-ma01.localdomain"
  * host = "otn-ac-monp-ma01.localdomain"
    % = modified in '/etc/icinga2/zones.conf', lines 41:3-41:57
  * log_duration = 86400
  * name = "otn-ac-monp-ma01.localdomain"
  * package = "_etc"
  * port = "5665"
  * source_location
    * first_column = 1
    * first_line = 40
    * last_column = 64
    * last_line = 40
    * path = "/etc/icinga2/zones.conf"
  * templates = [ "otn-ac-monp-ma01.localdomain" ]
    % = modified in '/etc/icinga2/zones.conf', lines 40:1-40:64
  * type = "Endpoint"
  * zone = ""

icinga2 object list --type endpoint --name “otn-ac-monp-ma02.localdomain” on ma01

Object 'otn-ac-monp-ma02.localdomain' of type 'Endpoint':
  % declared in '/etc/icinga2/zones.conf', lines 47:1-47:64
  * __name = "otn-ac-monp-ma02.localdomain"
  * host = ""
  * log_duration = 86400
  * name = "otn-ac-monp-ma02.localdomain"
  * package = "_etc"
  * port = "5665"
  * source_location
    * first_column = 1
    * first_line = 47
    * last_column = 64
    * last_line = 47
    * path = "/etc/icinga2/zones.conf"
  * templates = [ "otn-ac-monp-ma02.localdomain" ]
    % = modified in '/etc/icinga2/zones.conf', lines 47:1-47:64
  * type = "Endpoint"
  * zone = ""

icinga2 object list --type endpoint --name “otn-ac-monp-ma02.localdomain” on ma02

Object 'otn-ac-monp-ma02.localdomain' of type 'Endpoint':
  % declared in '/etc/icinga2/zones.conf', lines 15:1-15:64
  * __name = "otn-ac-monp-ma02.localdomain"
  * host = ""
  * log_duration = 86400
  * name = "otn-ac-monp-ma02.localdomain"
  * package = "_etc"
  * port = "5665"
  * source_location
    * first_column = 1
    * first_line = 15
    * last_column = 64
    * last_line = 15
    * path = "/etc/icinga2/zones.conf"
  * templates = [ "otn-ac-monp-ma02.localdomain" ]
    % = modified in '/etc/icinga2/zones.conf', lines 15:1-15:64
  * type = "Endpoint"
  * zone = ""

Hi there,

To name Endpoints the same as the hosts FQDN - yes, but not the zone.
On Agents you would like to name it’s zone the same as the Endpoint.
But real clusters (master / satellites) you would name differently (e.g. “master”).
This way you will never be confused if zone or endpoint name is what you see :wink:

As I can see, you setup master ma01 to have a zone with endpoints pointing to ma01 and ma02.
But on ma02 you setup a different zone with only its own host as endpoint - and that’s your problem here:

This should be the same as you wish them to run in one cluster (one zone) with two endpoints.

You should have zone / endpoint config like the follow:

ma01:

object Endpoint "otn-ac-monp-ma01.localdomain" {
}

object Zone "otn-ac-monp-ma01.localdomain" {
  endpoints = [ "otn-ac-monp-ma01.localdomain", "otn-ac-monp-ma02.localdomain"]
}

object Endpoint "otn-ac-monp-ma02.localdomain" {
  host = "otn-ac-monp-ma02.localdomain"
}

ma02:

object Endpoint "otn-ac-monp-ma01.localdomain" {
  host = "otn-ac-monp-ma01.localdomain"
}

object Zone "otn-ac-monp-ma01.localdomain" {
  endpoints = [ "otn-ac-monp-ma01.localdomain", "otn-ac-monp-ma02.localdomain"]
}

object Endpoint "otn-ac-monp-ma02.localdomain" {
}

This way ma01 and ma02 are setup to run both in zone “otn-ac-monp-ma01.localdomain”.
You could change this as zone “master” but I guess you don’t need to (in my opinion its ugly, but after seeing your configs I’m pretty sure this doesn’t cause your problems).

Also you don’t have to configure “host” in endpoint config for ma01 on ma02.
The way shown above means that ma01 tries to connect to ma02 and vice versa.
The node which is “faster” to do so will establish the connection - shouldn’t cause any problems at all (I do so in my setups as well).
But if you want to stick closer to the docs - there (as I remember correctly) it says to only configure one direction which means to leave this blank on ma02:

ma02:

object Endpoint "otn-ac-monp-ma01.localdomain" {
}

object Zone "otn-ac-monp-ma01.localdomain" {
  endpoints = [ "otn-ac-monp-ma01.localdomain", "otn-ac-monp-ma02.localdomain"]
}

object Endpoint "otn-ac-monp-ma02.localdomain" {
}

So basically you need to reconfigure (just change the config manually) on ma02 to be the same zone as ma01 is in as shown above.
If you have zones which are configured to have “otn-ac-monp-ma02.localdomain” as zone parent you need to change them as well.
Ater this there won’t be any zone named “otn-ac-monp-ma02.localdomain”.

One more thing here - I can’t tell here if ma02 is using a different CA.
Your posted directory /var/lib/icinga2/certs/ - ca (with key) will be found here:

ls -laF /var/lib/icinga2/ca/

But since there is no TicketSalt on ma02 available there shouldn’t be a own CA as well.
But please check this before editing the zone.
If /var/lib/icinga2/ca/ doesn’t exists on ma02 you’re good to go with changes described above.
If it exists (and ca.crt and ca.key are placed there) on ma02 you should stop and check wether ma01 and ma02 ca.crt and ca.key are the same (copied to it) or different.
This could cause some troubles since all certs on ma02 and its childs will be signed by ma02 and won’t communicate with ma01’s zone …

Try this, restart and check if it’s working now.

Maybe it will - maybe it will report something like the following error in icinga.log:

Our production configuration is more recent than the received configuration update.

If it does you will need to cleanup local configurations as it will prevent syncing configs from ma01 to ma02 - but this would be a hole different story (just telling you now so you can check logs for this).

Keep me posted, BR
Chris

Thank you so much @ChrissK!!
You made my day! :star_struck:

1 Like