Master-Master Cluster Sync Problem

unic · November 19, 2019, 8:08am

i had such a problem after i created a HA Cluster from normal single-node master. The manuel says I need to copy the /var/lib/icinga2/icinga2.state from the master to the second one. But everytime i tried this, it has not worked. So i decided to remove the statefile on the second master and cleaned the api folder

After that everything worked for me.

PinkFrog · November 19, 2019, 11:17am

This also did not work.

And since last night I discovered a further problem, not sure if it depends to the first one.

I got a Exception occurred while checking 'master1': Error: Function call 'pipe2' failed with error code 24, 'Too many open files' (0) Executing check for object 'master1'alert in Icingaweb2 and my service crashed:

● icinga2.service - Icinga host/service/network monitoring system
   Loaded: loaded (/lib/systemd/system/icinga2.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/icinga2.service.d
           └─limits.conf
   Active: failed (Result: exit-code) since Sun 2019-11-19 00:01:01 CET; 0 day 12h ago
  Process: 84816 ExecStartPre=/usr/lib/icinga2/prepare-dirs /etc/default/icinga2 (code=exited, status=0/SUCCESS)
  Process: 84823 ExecStart=/usr/sbin/icinga2 daemon --close-stdio -e ${ICINGA2_ERROR_LOG} (code=exited, status=1/FAIL
 Main PID: 84823 (code=exited, status=1/FAILURE)
    Tasks: 0
   Memory: 402.1M
   CGroup: /system.slice/icinga2.service

Nov 18 14:50:35 master1 icinga2[84823]: [2019-11-18 14:50:35 +0100] information/ConfigItem: Instantiat
Nov 18 14:50:35 master1 icinga2[84823]: [2019-11-18 14:50:35 +0100] information/ConfigItem: Instantiat
Nov 18 14:50:35 master1 icinga2[84823]: [2019-11-18 14:50:35 +0100] information/ConfigItem: Instantiat
Nov 18 14:50:35 master1 icinga2[84823]: [2019-11-18 14:50:35 +0100] information/ConfigItem: Instantiat
Nov 18 14:50:35 master1 icinga2[84823]: [2019-11-18 14:50:35 +0100] information/ConfigItem: Instantiat
Nov 18 14:50:35 master1 icinga2[84823]: [2019-11-18 14:50:35 +0100] information/ScriptGlobal: Dumping
Nov 18 14:50:35 master1 icinga2[84823]: [2019-11-18 14:50:35 +0100] information/cli: Closing console l
Nov 18 14:50:35 master1 systemd[1]: Started Icinga host/service/network monitoring system.
Nov 19 00:01:01 master1 systemd[1]: icinga2.service: Main process exited, code=exited, status=1/FAILUR
Nov 19 00:01:01 master1 systemd[1]: icinga2.service: Failed with result 'exit-code'.

Everything really strange… maybe this depends on the hugh amount of imported hosts which are not connected to the new Icinga2 Cluster right now, I have deactived them now.

Back to the state-files, does everyone know if there is a way to generate them totally new for both masters?

PinkFrog · November 20, 2019, 10:41am

Okay, seems like nobody else has an idea how to find a solution. Does anybody know where in the Icinga2 code I can find the following things?

the build in icinga check -> found it, so I would only need the second point
generating of icinga2.state file?

I looked into that file on master1, and it starts with

295650:{"name":"api","type":"ApiListener","update":{"last_failed_zones_stage_validation":{"log":"[2019-11-13 18:05:41 +0100] information/cli:

what for me looks like it would start looking into the log-file, but in /var/log/icinga2/icinga2.log isn’t any line from this date anymore.

So I think I need to check where icinga2 finds that log entry.

Would be great if somebody could help, I don’t have further ideas how to handle this.

anon66228339 · November 20, 2019, 10:56am

Maybe enable debug log and see what happens on startup.

PinkFrog · November 20, 2019, 11:29am

Thanks Carsten,

I tried this again, but I couldn’t find the point where the state file is created and only the point where a old log file is removed, but nothing about creating a new one:

[2019-11-20 12:02:10 +0100] notice/ApiListener: Removing old log file: /var/lib/icinga2/api/log/1574247559 
[2019-11-20 12:02:10 +0100] notice/ApiListener: Current zone master: master1 
[2019-11-20 12:02:10 +0100] information/ApiListener: New client connection for identity 'master2' to [xxx.xxx.xxx.xxx]:5665 
[2019-11-20 12:02:10 +0100] notice/JsonRpcConnection: Received 'config::Update' message from identity 'master2'. 
[2019-11-20 12:02:10 +0100] information/ApiListener: Applying config update from endpoint 'master2' of zone 'master'. 
[2019-11-20 12:02:10 +0100] notice/ConfigCompiler: Registered authoritative config directories for zone 'director-global': /etc/icinga2/zones.d/director-global and /var/lib/icinga2/api/packages/director/b8437779-e99b-463b-a790-95a1ad2af673/zones.d/director-global 
[2019-11-20 12:02:10 +0100] information/ApiListener: Ignoring config update from endpoint 'master2' for zone 'director-global' because we have an authoritative version of the zone's config. 
[2019-11-20 12:02:10 +0100] notice/ConfigCompiler: Registered authoritative config directories for zone 'master': /etc/icinga2/zones.d/master and /var/lib/icinga2/api/packages/director/b8437779-e99b-463b-a790-95a1ad2af673/zones.d/master
[2019-11-20 12:02:10 +0100] information/ApiListener: Ignoring config update from endpoint 'master2' for zone 'master' because we have an authoritative version of the zone's config.
[2019-11-20 12:02:10 +0100] information/ApiListener: Received configuration updates (0) from endpoint 'master2' do not qualify for production, not triggering reload.
[2019-11-20 12:02:13 +0100] notice/JsonRpcConnection: Received 'log::SetLogPosition' message from identity 'master2'.
[2019-11-20 12:02:14 +0100] notice/JsonRpcConnection: Received 'event::SetNextCheck' message from identity 'master2'.
[2019-11-20 12:02:14 +0100] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2019-11-20 12:02:15 +0100] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 25; Checks/s: 0
[2019-11-20 12:02:15 +0100] notice/ApiListener: Setting log position for identity 'master2': 2019/11/20 12:02:14
[2019-11-20 12:02:18 +0100] notice/JsonRpcConnection: Received 'event::Heartbeat' message from identity 'server'.
[2019-11-20 12:02:18 +0100] notice/JsonRpcConnection: Received 'log::SetLogPosition' message from identity 'master2'.
[2019-11-20 12:02:18 +0100] notice/JsonRpcConnection: Received 'event::SetNextCheck' message from identity 'master2'.
[2019-11-20 12:02:18 +0100] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2019-11-20 12:02:18 +0100] notice/JsonRpcConnection: Received 'event::CheckResult' message from identity 'master2

PinkFrog · November 21, 2019, 10:47am

Hi everyone,

I looked again into the debug.log on master1. I found something for me a littlebit strange, there are no lines deleted between:

[2019-11-21 11:18:00 +0100] notice/JsonRpcConnection: Received 'config::Update' message from identity 'master2.domain.com'.
[2019-11-21 11:18:00 +0100] information/ApiListener: Applying config update from endpoint 'master2.domain.com' of zone 'master'.
[2019-11-21 11:18:00 +0100] information/ApiListener: Received configuration updates (0) from endpoint 'master2.domain.com' do not qualify for production, not triggering reload.

Is that a normal behaviour of a config-master? On master 2 I can see the hole sync like it is discribed in https://icinga.com/docs/icinga2/latest/doc/15-troubleshooting/#new-configuration-does-not-trigger-a-reload

anon66228339 · November 21, 2019, 10:51am

Thats normal, because master1 is configuration master

picard · May 19, 2020, 9:36am

How do you solved it? I have the same problems. I found only one solution that works: I have to do following on each (Windows)-Client:
net stop icinga2
del C:\ProgramData\icinga2\var\lib\icinga2\icinga2.state
del C:\ProgramData\icinga2\var\lib\icinga2\modified-attributes.conf
del /s C:\ProgramData\icinga2\var\lib\icinga2\api\zones
del /s C:\ProgramData\icinga2\var\lib\icinga2\api\zones-stages
net start icinga2

Not very nice.

PinkFrog · June 9, 2020, 5:19am

Till now, I didn’t fixed it. My problem is not on an agent, is one of the masters. There is an possible bug fix update on GIT but i didn’t have the time and a testing zone with the same issue so I couldn’t test it till now.

Deleting this files didn’t worked for me.

Regards, Alicia