Hello Community,
I have a distributed monitoring setup with a single master node and about 300 Endpoints and 16k Hosts.
now I wanted to extend my setup with another master server for HA. I’m no using the Icinga Director, I configure everything locally in files (using a custom ansible role). Before migrating our production monitoring, I tried everything in a test environment. For the migration I used the following instructions: How to set up High-Availability Masters and Distributed Monitoring - Scenarios. In the test environment everything worked fine. Then I wanted to migrate our productive monitoring with the same configuration and settings. After I adjusted all configs and restarted services I had the following problem:
(icinga01 is our config master, icinga02 is master which is available for HA. icinga01 uses “Top Down Config Sync”).
After config sync from icinga01 to icinga02 (took about 10-15minutes (very slow)) icinga01 and icinga02 did not really executed checks anymore (mabye 2 checks in a minute if the service feels like it). All services/hosts changed status to pending. I stopped icinga service on icinga02 → same scenario. I stopped icinga01 and started only icinga02 → after about 10-20minutes the “failover” was done and everything worked. As soon as I started icinga01 again nothing worked anymore.
On both servers the service was running, but no service wrote to the ido-mysql database and in the logfiles there were no critical, warning entries which would have indicated a problem. Even though this was written in the log file information/DbConnection: 'ido-mysql' started.
Only that the ido-mysql connection had a lot pending quieries. But the DB is on a seperate node and this server has sufficient performance and was not fully utilized.
After adding second master I did as described: Initial Sync for new Endpoints in a Zone
If you want me to provide logfiles / debuglog or GDB dumps or backtraces please contact me directly as there are sensitive infos about our infrastructure in there.
The icinga2 MariaDB database is about 700MB - 1GB in size.
Any ideas what the problem could be? I know a lot about Icinga and have already invested >30 hours in troubleshooting. Even colleagues who are relatively fit do not know what to do.
For both master nodes:
- Version used (
2.13.2
) - Operating System and version (
Debian 11
) - Enabled features (
api checker ido-mysql influxdb influxdb2 mainlog notification
)
icinga01 config master zones.conf
object Endpoint "icinga01" {
}
object Endpoint "icinga02" {
host = "IP-ADDRESS"
port = "5665"
}
object Zone "master" {
endpoints = [ "icinga01", "icinga02" ]
}
object Zone "global-templates" {
global = true
}
icinga02 master zones.conf
object Endpoint "icinga02" {
}
object Endpoint "icinga02" {
host = "IP-ADRESS"
port = "5665"
}
object Zone "master" {
endpoints = [ "icinga01", "icinga02" ]
}
object Zone "global-templates" {
global = true
}
object Zone "director-global" {
global = true
}