Implementing HA Master-Master setup

@log1c
Thank you, that documentation was very helpful! :slight_smile:

I managed that the master2 takes over the checks when the icinga2.service of master1 goes down. (i see that with tcpdump)

[2022-02-10 14:15:27 +0100] warning/JsonRpcConnection: API client disconnected for identity 'otn-ac-monq-ma01.aircloud.common.airbusds.corp'
[2022-02-10 14:15:27 +0100] warning/ApiListener: Removing API client for endpoint 'otn-ac-monq-ma01.aircloud.common.airbusds.corp'. 0 API clients left.

But sadly the master2 is not writing the IDO.
Should there be a specific feature enabled on the master2 from master1?
Currently there is only:

  • api.conf
  • checker.conf
  • mainlog.conf

The ido-pgsql.conf on the master1 is set to:

object IdoPgsqlConnection "ido-pgsql" {
  user = "icinga"
  password = "xxx"
  host = "otn-ac-monq-db01.localdomain"
  database = "icinga"
  enable_ha=true
}

Debug.log of master2 when master1 is down on purpose (=otn-ac-monq-sa01)

notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
debug/ApiListener: Not connecting to Endpoint 'otn-ac-monq-ma01.localdomain' because the host/port attributes are missing.
debug/ApiListener: Not connecting to Endpoint 'otn-ac-monq-sa01.localdomain' because that's us.
notice/ApiListener: Current zone master: otn-ac-monq-sa01.localdomain
notice/ApiListener: Updating object authority for objects at endpoint 'otn-ac-monq-sa01.localdomain'.
notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
debug/CheckerComponent: Scheduling info for checkable 'otn-ac-monq-sa02.localdomain!ping4onZone' (2022-02-10 14:33:08 +0100): Object 'otn-ac-monq-sa02.localdomain!ping4onZone', Next Check: 2022-02-10 14:33:08 +0100(1.6445e+09).
debug/CheckerComponent: Executing check for 'otn-ac-monq-sa02.localdomain!ping4onZone'
notice/ApiListener: Connected endpoints: 
notice/ApiListener: Relaying 'event::SetLastCheckStarted' message
debug/Checkable: Update checkable 'otn-ac-monq-sa02.localdomain!ping4onZone' with check interval '20' from last check time at 2022-02-10 14:32:43 +0100 (1.6445e+09) to next check time at 2022-02-10 14:33:49 +0100 (1.6445e+09).
notice/ApiListener: Relaying 'event::SetNextCheck' message
notice/Process: Running command '/usr/lib64/nagios/plugins/check_ping' '-4' '-H' 'otn-ac-monq-sa02.localdomain' '-c' '200,15%' '-w' '100,5%': PID 4031
debug/CheckerComponent: Check finished for object 'otn-ac-monq-sa02.localdomain!ping4onZone'

Is the ido-pgsql feature enabled on master2 and the config correct as well?

Both masters need to have the same features enabled or they wont be able to take over all tasks from one another (e.g. IDO connection or notifications)

From the docs:

Note : All nodes in the same zone require that you enable the same features for high-availability (HA).

Good point!
I did copy /etc/icinga2/features-enabled/ido-pgsql.conf (you can see the content in my post above) from master1 to master2, enabled the ido-pgsql feature on master2 and restarted both nodes.

Sadly master2 is still not writing to the IDO when master1 is down.

All features from master1 are enabled on master2 as well:

> Disabled features: command compatlog elasticsearch gelf graphite icingadb influxdb influxdb2 livestatus opentsdb perfdata statusdata syslog
> Enabled features: api checker debuglog ido-pgsql mainlog notification

Another topic related to HA:
hosts which to get forced to be executed by a specific endpoint (e.g. satellite) seem not to be checked by the master if the satellite goes down. Is that behavior planned? Or should the master take over the check?

The HA functionality works for nodes inside the same zone. So a host object which is in another zone (e.g. satellite-zone1) will not be checked by a node (be it master or a satellite) from a different zone (e.g master or satellite-zone2).

Check the icinga2.log around the time the master1 goes down.
Maybe even enable the debug log and check again.

@log1c

zones.conf on ma01 (MASTER)

object Endpoint "otn-ac-monq-ma01.localdomain" {
  // That's us
}

object Endpoint "otn-ac-monq-sa01.localdomain" {
  host = "otn-ac-monq-sa01.localdomain" // Actively connect to the secondary master
}

object Zone "master" {
  endpoints = [ "otn-ac-monq-ma01.localdomain", "otn-ac-monq-sa01.localdomain" ]
}

zones.conf on sa01 (SATELLITE) → should take over for ma01 in case of outage

object Endpoint "otn-ac-monq-ma01.localdomain" {
// The first master already connects to us
}

object Zone "master" {
        endpoints = [ "otn-ac-monq-ma01.localdomain", "otn-ac-monq-sa01.localdomain" ]
}

object Endpoint "otn-ac-monq-sa01.localdomain" {
// That's us
}

debug.log of sa01 (SATELLITE) when icinga2 service of ma01 got shutdown at 14:54:47


[2022-02-14 14:54:29 +0100] notice/ApiListener: Relaying 'event::SetLastCheckStarted' message
[2022-02-14 14:54:29 +0100] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2022-02-14 14:54:29 +0100] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2022-02-14 14:54:29 +0100] notice/Process: Running command '/usr/lib64/nagios/plugins/check_ping' '-4' '-H' 'otn-ac-monq-sa02.localdomain' '-c' '200,15%' '-w' '100,5%': PID 19910
[2022-02-14 14:54:30 +0100] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0.0333333
[2022-02-14 14:54:30 +0100] notice/ApiListener: Setting log position for identity 'otn-ac-monq-ma01.localdomain': 2022/02/14 14:54:29
[2022-02-14 14:54:30 +0100] notice/JsonRpcConnection: Received 'event::SetLastCheckStarted' message from identity 'otn-ac-monq-ma01.localdomain'.
[2022-02-14 14:54:30 +0100] notice/JsonRpcConnection: Received 'event::SetNextCheck' message from identity 'otn-ac-monq-ma01.localdomain'.
[2022-02-14 14:54:30 +0100] notice/JsonRpcConnection: Received 'event::SetNextCheck' message from identity 'otn-ac-monq-ma01.localdomain'.
[2022-02-14 14:54:30 +0100] notice/JsonRpcConnection: Received 'event::CheckResult' message from identity 'otn-ac-monq-ma01.localdomain'.
[2022-02-14 14:54:32 +0100] information/RemoteCheckQueue: items: 0, rate: 0/s (6/min 30/5min 90/15min);
[2022-02-14 14:54:33 +0100] notice/JsonRpcConnection: Received 'log::SetLogPosition' message from identity 'otn-ac-monq-ma01.localdomain'.
[2022-02-14 14:54:33 +0100] notice/JsonRpcConnection: Received 'event::Heartbeat' message from identity 'otn-ac-monq-ma01.localdomain'.
[2022-02-14 14:54:33 +0100] notice/Process: PID 19910 ('/usr/lib64/nagios/plugins/check_ping' '-4' '-H' 'otn-ac-monq-sa02.localdomain' '-c' '200,15%' '-w' '100,5%') terminated with exit code 0
[2022-02-14 14:54:33 +0100] notice/ApiListener: Sending message 'event::CheckResult' to 'otn-ac-monq-ma01.localdomain'
[2022-02-14 14:54:33 +0100] notice/JsonRpcConnection: Received 'event::SetNextCheck' message from identity 'otn-ac-monq-ma01.localdomain'.
[2022-02-14 14:54:33 +0100] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2022-02-14 14:54:33 +0100] notice/JsonRpcConnection: Received 'event::CheckResult' message from identity 'otn-ac-monq-ma01.localdomain'.
[2022-02-14 14:54:33 +0100] debug/Checkable: Update checkable 'otn-ac-monq-sa02.localdomain!ping4onZone' with check interval '20' from last check time at 2022-02-14 14:54:33 +0100 (1.64485e+09) to next check time at 2022-02-14 14:54:53 +0100 (1.64485e+09).
[2022-02-14 14:54:33 +0100] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2022-02-14 14:54:33 +0100] notice/ApiListener: Relaying 'event::CheckResult' message
[2022-02-14 14:54:35 +0100] notice/ApiListener: Updating object authority for objects at endpoint 'otn-ac-monq-sa01.localdomain'.
[2022-02-14 14:54:35 +0100] debug/ApiListener: Not connecting to Endpoint 'otn-ac-monq-ma01.localdomain' because the host/port attributes are missing.
[2022-02-14 14:54:35 +0100] debug/ApiListener: Not connecting to Endpoint 'otn-ac-monq-sa01.localdomain' because that's us.
[2022-02-14 14:54:35 +0100] notice/ApiListener: Current zone master: otn-ac-monq-ma01.localdomain
[2022-02-14 14:54:35 +0100] notice/ApiListener: Connected endpoints: otn-ac-monq-ma01.localdomain (1)
[2022-02-14 14:54:35 +0100] notice/ApiListener: Setting log position for identity 'otn-ac-monq-ma01.localdomain': 2022/02/14 14:54:33
[2022-02-14 14:54:35 +0100] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0.05
[2022-02-14 14:54:38 +0100] notice/JsonRpcConnection: Received 'log::SetLogPosition' message from identity 'otn-ac-monq-ma01.localdomain'.
[2022-02-14 14:54:40 +0100] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0.05
[2022-02-14 14:54:40 +0100] notice/ApiListener: Setting log position for identity 'otn-ac-monq-ma01.localdomain': 2022/02/14 14:54:33
[2022-02-14 14:54:43 +0100] notice/JsonRpcConnection: Received 'log::SetLogPosition' message from identity 'otn-ac-monq-ma01.localdomain'.
[2022-02-14 14:54:45 +0100] notice/JsonRpcConnection: Received 'event::SetLastCheckStarted' message from identity 'otn-ac-monq-ma01.localdomain'.
[2022-02-14 14:54:45 +0100] notice/JsonRpcConnection: Received 'event::SetNextCheck' message from identity 'otn-ac-monq-ma01.localdomain'.
[2022-02-14 14:54:45 +0100] notice/JsonRpcConnection: Received 'event::SetNextCheck' message from identity 'otn-ac-monq-ma01.localdomain'.
[2022-02-14 14:54:45 +0100] notice/JsonRpcConnection: Received 'event::CheckResult' message from identity 'otn-ac-monq-ma01.localdomain'.
[2022-02-14 14:54:45 +0100] debug/ApiListener: Not connecting to Endpoint 'otn-ac-monq-ma01.localdomain' because the host/port attributes are missing.
[2022-02-14 14:54:45 +0100] notice/ApiListener: Updating object authority for objects at endpoint 'otn-ac-monq-sa01.localdomain'.
[2022-02-14 14:54:45 +0100] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0.05
[2022-02-14 14:54:45 +0100] notice/ApiListener: Setting log position for identity 'otn-ac-monq-ma01.localdomain': 2022/02/14 14:54:45
[2022-02-14 14:54:45 +0100] debug/ApiListener: Not connecting to Endpoint 'otn-ac-monq-sa01.localdomain' because that's us.
[2022-02-14 14:54:45 +0100] notice/ApiListener: Current zone master: otn-ac-monq-ma01.localdomain
[2022-02-14 14:54:45 +0100] notice/ApiListener: Connected endpoints: otn-ac-monq-ma01.localdomain (1)
[2022-02-14 14:54:47 +0100] notice/JsonRpcConnection: Error while reading JSON-RPC message for identity 'otn-ac-monq-ma01.localdomain': Error: short read

Stacktrace:
 0# __cxa_throw in /usr/lib64/icinga2/sbin/icinga2
 1# icinga::NetString::ReadStringFromStream(boost::intrusive_ptr<icinga::Shared<icinga::AsioTlsStream> > const&, boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >, long) in /usr/lib64/icinga2/sbin/icinga2
 2# icinga::JsonRpc::ReadMessage(boost::intrusive_ptr<icinga::Shared<icinga::AsioTlsStream> > const&, boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >, long) in /usr/lib64/icinga2/sbin/icinga2
 3# icinga::JsonRpcConnection::HandleIncomingMessages(boost::asio::basic_yield_context<boost::asio::executor_binder<void (*)(), boost::asio::executor> >) in /usr/lib64/icinga2/sbin/icinga2
 4# 0x0000000000AFA8B7 in /usr/lib64/icinga2/sbin/icinga2
 5# 0x0000000000B06A89 in /usr/lib64/icinga2/sbin/icinga2
 6# make_fcontext in /lib64/libboost_context.so.1.69.0
[2022-02-14 14:54:47 +0100] warning/JsonRpcConnection: API client disconnected for identity 'otn-ac-monq-ma01.localdomain'
[2022-02-14 14:54:47 +0100] warning/ApiListener: Removing API client for endpoint 'otn-ac-monq-ma01.localdomain'. 0 API clients left.
[2022-02-14 14:54:47 +0100] debug/EndpointDbObject: update is_connected=0 for endpoint 'otn-ac-monq-ma01.localdomain'
[2022-02-14 14:54:50 +0100] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0.0333333
[2022-02-14 14:54:55 +0100] debug/ApiListener: Not connecting to Endpoint 'otn-ac-monq-ma01.localdomain' because the host/port attributes are missing.
[2022-02-14 14:54:55 +0100] notice/ApiListener: Updating object authority for objects at endpoint 'otn-ac-monq-sa01.localdomain'.
[2022-02-14 14:54:55 +0100] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0.0333333
[2022-02-14 14:54:55 +0100] debug/ApiListener: Not connecting to Endpoint 'otn-ac-monq-sa01.localdomain' because that's us.
[2022-02-14 14:54:55 +0100] notice/ApiListener: Current zone master: otn-ac-monq-sa01.localdomain
[2022-02-14 14:54:55 +0100] notice/ApiListener: Connected endpoints: 
[2022-02-14 14:54:55 +0100] debug/CheckerComponent: Scheduling info for checkable 'otn-ac-monq-sa02.localdomain!ping4onZone' (2022-02-14 14:54:53 +0100): Object 'otn-ac-monq-sa02.localdomain!ping4onZone', Next Check: 2022-02-14 14:54:53 +0100(1.64485e+09).
[2022-02-14 14:54:55 +0100] debug/CheckerComponent: Executing check for 'otn-ac-monq-sa02.localdomain!ping4onZone'
[2022-02-14 14:54:55 +0100] debug/Checkable: Update checkable 'otn-ac-monq-sa02.localdomain!ping4onZone' with check interval '20' from last check time at 2022-02-14 14:54:33 +0100 (1.64485e+09) to next check time at 2022-02-14 14:55:15 +0100 (1.64485e+09).
[2022-02-14 14:54:55 +0100] notice/ApiListener: Relaying 'event::SetLastCheckStarted' message
[2022-02-14 14:54:55 +0100] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2022-02-14 14:54:55 +0100] notice/Process: Running command '/usr/lib64/nagios/plugins/check_ping' '-4' '-H' 'otn-ac-monq-sa02.localdomain' '-c' '200,15%' '-w' '100,5%': PID 19912
[2022-02-14 14:54:55 +0100] debug/CheckerComponent: Check finished for object 'otn-ac-monq-sa02.localdomain!ping4onZone'
[2022-02-14 14:54:59 +0100] notice/Process: PID 19912 ('/usr/lib64/nagios/plugins/check_ping' '-4' '-H' 'otn-ac-monq-sa02.localdomain' '-c' '200,15%' '-w' '100,5%') terminated with exit code 0
[2022-02-14 14:54:59 +0100] debug/Checkable: Update checkable 'otn-ac-monq-sa02.localdomain!ping4onZone' with check interval '20' from last check time at 2022-02-14 14:54:59 +0100 (1.64485e+09) to next check time at 2022-02-14 14:55:19 +0100 (1.64485e+09).
[2022-02-14 14:54:59 +0100] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2022-02-14 14:54:59 +0100] notice/ApiListener: Relaying 'event::CheckResult' message
[2022-02-14 14:55:00 +0100] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 2; Checks/s: 0.05
[2022-02-14 14:55:02 +0100] debug/CheckerComponent: Scheduling info for checkable 'otn-ac-monq-sa02.localdomain' (2022-02-14 14:55:02 +0100): Object 'otn-ac-monq-sa02.localdomain', Next Check: 2022-02-14 14:55:02 +0100(1.64485e+09).
[2022-02-14 14:55:02 +0100] debug/CheckerComponent: Executing check for 'otn-ac-monq-sa02.localdomain'
[2022-02-14 14:55:02 +0100] debug/Checkable: Update checkable 'otn-ac-monq-sa02.localdomain' with check interval '300' from last check time at 2022-02-14 14:50:06 +0100 (1.64485e+09) to next check time at 2022-02-14 15:00:02 +0100 (1.64485e+09).
[2022-02-14 14:55:02 +0100] notice/ApiListener: Relaying 'event::SetLastCheckStarted' message
[2022-02-14 14:55:02 +0100] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2022-02-14 14:55:02 +0100] notice/Process: Running command '/usr/lib64/nagios/plugins/check_ping' '-H' 'otn-ac-monq-sa02.localdomain' '-c' '5000,100%' '-w' '3000,80%': PID 19914
[2022-02-14 14:55:02 +0100] debug/CheckerComponent: Check finished for object 'otn-ac-monq-sa02.localdomain'

This should not be there.

What version of icinga2 are you running on your systems?
If it is not (one of) the most recent (2.13.x) I suggest you update all node prior to troubleshooting.
Afaik there where many bug fixes regarding the cluster communication since at least v2.11+

@log1c
it’s on version 2.13.2-1, so this shouldn’t be the issue.

Hm, then I’m currently out of ideas.
I suggest opening a new thread regarding that JSON-RPC message and the cluster communication problem providing all the info gathered so far, as this thread is quite full and hard to follow with all those different postings :slight_smile: