Migrate from single master to multiple master does not work

rafi01010 · March 21, 2022, 10:34am

Hello Community,

I have a distributed monitoring setup with a single master node and about 300 Endpoints and 16k Hosts.

now I wanted to extend my setup with another master server for HA. I’m no using the Icinga Director, I configure everything locally in files (using a custom ansible role). Before migrating our production monitoring, I tried everything in a test environment. For the migration I used the following instructions: How to set up High-Availability Masters and Distributed Monitoring - Scenarios. In the test environment everything worked fine. Then I wanted to migrate our productive monitoring with the same configuration and settings. After I adjusted all configs and restarted services I had the following problem:
(icinga01 is our config master, icinga02 is master which is available for HA. icinga01 uses “Top Down Config Sync”).

After config sync from icinga01 to icinga02 (took about 10-15minutes (very slow)) icinga01 and icinga02 did not really executed checks anymore (mabye 2 checks in a minute if the service feels like it). All services/hosts changed status to pending. I stopped icinga service on icinga02 → same scenario. I stopped icinga01 and started only icinga02 → after about 10-20minutes the “failover” was done and everything worked. As soon as I started icinga01 again nothing worked anymore.

On both servers the service was running, but no service wrote to the ido-mysql database and in the logfiles there were no critical, warning entries which would have indicated a problem. Even though this was written in the log file information/DbConnection: 'ido-mysql' started.
Only that the ido-mysql connection had a lot pending quieries. But the DB is on a seperate node and this server has sufficient performance and was not fully utilized.
After adding second master I did as described: Initial Sync for new Endpoints in a Zone

If you want me to provide logfiles / debuglog or GDB dumps or backtraces please contact me directly as there are sensitive infos about our infrastructure in there.

The icinga2 MariaDB database is about 700MB - 1GB in size.

Any ideas what the problem could be? I know a lot about Icinga and have already invested >30 hours in troubleshooting. Even colleagues who are relatively fit do not know what to do.

For both master nodes:

Version used (2.13.2)
Operating System and version (Debian 11)
Enabled features (api checker ido-mysql influxdb influxdb2 mainlog notification)

icinga01 config master zones.conf

object Endpoint "icinga01" {
}

object Endpoint "icinga02" {
  host = "IP-ADDRESS"
  port = "5665"
}

object Zone "master" {
  endpoints = [ "icinga01", "icinga02" ]
}

object Zone "global-templates" {
  global = true
}

icinga02 master zones.conf

object Endpoint "icinga02" {
}

object Endpoint "icinga02" {
  host = "IP-ADRESS"
  port = "5665"
}

object Zone "master" {
  endpoints = [ "icinga01", "icinga02" ]
}

object Zone "global-templates" {
  global = true
}

object Zone "director-global" {
  global = true
}

rafi01010 · March 24, 2022, 2:15pm

Does anyone have any ideas? Or should I rebuild the setup. I mean especially completely rebuild the database and both master servers. Except the certificates, because that is an insane amount of work.

rafi01010 · March 29, 2022, 8:59am

I solved the Problem. Sending config updates and replay log took about 10 Minutes.

After reading the Documentation again i found informations about replay log: Distributed Monitoring - Icinga 2. So I disabled log_duration and everything works fine for me.