Master-Master Cluster Sync Problem

No, there is nothing with the name of the old zone.

But I found another difference between both configurations, only on master1 is the file conf.d/api-users.conf included, so there is no configuration for api-users on master2. This is the line more in the output of icinga2 daemon -C on master1. Could this be a problem?

differrent api-user configuration shoulld not be a problem (beside for the api users)

Is there a possibility to make manually a sync stage validation like the icinga check?

It confuses me, that the date is the same since it come up first on 2019-11-13 18:05:41.

Icinga 2 has been running for 8 hours, 11 minutes and 30 seconds. Version: r2.11.2-1; Last zone sync stage validation failed at 2019-11-13 18:05:41 +0100

And also confusing that there is no startup.log anymore.

What happens if you change a custom variable for a host, and deploy the changes via the Director on master1?

There is no problem to change a custom variable or anything else. The deployment is successful and the changing takes effect.

I actually checked again the output of icinga2 daemon -C, this time with the notices and searching for “ignor” and found this on master1:

# icinga2 daemon -C -x notice | grep -i ignor
notice/config: Ignoring explicit load request for library "db_ido_mysql".
notice/config: Ignoring non local config include for zone 'director-global': We already have an authoritative copy included.
notice/config: Ignoring non local config include for zone 'master': We already have an authoritative copy included.

on master2 only:

# icinga2 daemon -C -x notice | grep -i ignor
notice/config: Ignoring explicit load request for library "db_ido_mysql".

And this change doesn’t change the output of the icinga check when you force a recheck of that service?

No, that’s weird:

Icinga 2 has been running for 8 hours, 42 minutes and 28 seconds. Version: r2.11.2-1; Last zone sync stage validation failed at 2019-11-13 18:05:41 +0100

Seems like there wasn’t a reload of icinga2.service, if i restart the service manually, the uptime for the service is changing, but the date for the stage validation failed does not change.

master2 should log something like this:

Copying file ... from config sync staging to production zones directory

and then clear the last failed zone sync stage validation entry.

Yes, there are such lines at this time in the log.

Where can I clear that?

That sentence referred to icinga2 should ... so the core itself clears that. It is weird that it does not do that. The icinga2.state file should have the entry removed.

1 Like

But it doesn’t. Maybe a bug?

The check “icinga” is a check of the file /var/lib/icinga2/icinga2.state?

i had such a problem after i created a HA Cluster from normal single-node master. The manuel says I need to copy the /var/lib/icinga2/icinga2.state from the master to the second one. But everytime i tried this, it has not worked. So i decided to remove the statefile on the second master and cleaned the api folder

After that everything worked for me.

This also did not work.

And since last night I discovered a further problem, not sure if it depends to the first one.

I got a Exception occurred while checking 'master1': Error: Function call 'pipe2' failed with error code 24, 'Too many open files' (0) Executing check for object 'master1'alert in Icingaweb2 and my service crashed:

● icinga2.service - Icinga host/service/network monitoring system
   Loaded: loaded (/lib/systemd/system/icinga2.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/icinga2.service.d
           └─limits.conf
   Active: failed (Result: exit-code) since Sun 2019-11-19 00:01:01 CET; 0 day 12h ago
  Process: 84816 ExecStartPre=/usr/lib/icinga2/prepare-dirs /etc/default/icinga2 (code=exited, status=0/SUCCESS)
  Process: 84823 ExecStart=/usr/sbin/icinga2 daemon --close-stdio -e ${ICINGA2_ERROR_LOG} (code=exited, status=1/FAIL
 Main PID: 84823 (code=exited, status=1/FAILURE)
    Tasks: 0
   Memory: 402.1M
   CGroup: /system.slice/icinga2.service

Nov 18 14:50:35 master1 icinga2[84823]: [2019-11-18 14:50:35 +0100] information/ConfigItem: Instantiat
Nov 18 14:50:35 master1 icinga2[84823]: [2019-11-18 14:50:35 +0100] information/ConfigItem: Instantiat
Nov 18 14:50:35 master1 icinga2[84823]: [2019-11-18 14:50:35 +0100] information/ConfigItem: Instantiat
Nov 18 14:50:35 master1 icinga2[84823]: [2019-11-18 14:50:35 +0100] information/ConfigItem: Instantiat
Nov 18 14:50:35 master1 icinga2[84823]: [2019-11-18 14:50:35 +0100] information/ConfigItem: Instantiat
Nov 18 14:50:35 master1 icinga2[84823]: [2019-11-18 14:50:35 +0100] information/ScriptGlobal: Dumping
Nov 18 14:50:35 master1 icinga2[84823]: [2019-11-18 14:50:35 +0100] information/cli: Closing console l
Nov 18 14:50:35 master1 systemd[1]: Started Icinga host/service/network monitoring system.
Nov 19 00:01:01 master1 systemd[1]: icinga2.service: Main process exited, code=exited, status=1/FAILUR
Nov 19 00:01:01 master1 systemd[1]: icinga2.service: Failed with result 'exit-code'.

Everything really strange… maybe this depends on the hugh amount of imported hosts which are not connected to the new Icinga2 Cluster right now, I have deactived them now.

Back to the state-files, does everyone know if there is a way to generate them totally new for both masters?

Okay, seems like nobody else has an idea how to find a solution. Does anybody know where in the Icinga2 code I can find the following things?

  1. the build in icinga check -> found it, so I would only need the second point
  2. generating of icinga2.state file?

I looked into that file on master1, and it starts with

295650:{"name":"api","type":"ApiListener","update":{"last_failed_zones_stage_validation":{"log":"[2019-11-13 18:05:41 +0100] information/cli: 

what for me looks like it would start looking into the log-file, but in /var/log/icinga2/icinga2.log isn’t any line from this date anymore.

So I think I need to check where icinga2 finds that log entry.

Would be great if somebody could help, I don’t have further ideas how to handle this.

Maybe enable debug log and see what happens on startup.

Thanks Carsten,

I tried this again, but I couldn’t find the point where the state file is created and only the point where a old log file is removed, but nothing about creating a new one:

[2019-11-20 12:02:10 +0100] notice/ApiListener: Removing old log file: /var/lib/icinga2/api/log/1574247559 
[2019-11-20 12:02:10 +0100] notice/ApiListener: Current zone master: master1 
[2019-11-20 12:02:10 +0100] information/ApiListener: New client connection for identity 'master2' to [xxx.xxx.xxx.xxx]:5665 
[2019-11-20 12:02:10 +0100] notice/JsonRpcConnection: Received 'config::Update' message from identity 'master2'. 
[2019-11-20 12:02:10 +0100] information/ApiListener: Applying config update from endpoint 'master2' of zone 'master'. 
[2019-11-20 12:02:10 +0100] notice/ConfigCompiler: Registered authoritative config directories for zone 'director-global': /etc/icinga2/zones.d/director-global and /var/lib/icinga2/api/packages/director/b8437779-e99b-463b-a790-95a1ad2af673/zones.d/director-global 
[2019-11-20 12:02:10 +0100] information/ApiListener: Ignoring config update from endpoint 'master2' for zone 'director-global' because we have an authoritative version of the zone's config. 
[2019-11-20 12:02:10 +0100] notice/ConfigCompiler: Registered authoritative config directories for zone 'master': /etc/icinga2/zones.d/master and /var/lib/icinga2/api/packages/director/b8437779-e99b-463b-a790-95a1ad2af673/zones.d/master
[2019-11-20 12:02:10 +0100] information/ApiListener: Ignoring config update from endpoint 'master2' for zone 'master' because we have an authoritative version of the zone's config.
[2019-11-20 12:02:10 +0100] information/ApiListener: Received configuration updates (0) from endpoint 'master2' do not qualify for production, not triggering reload.
[2019-11-20 12:02:13 +0100] notice/JsonRpcConnection: Received 'log::SetLogPosition' message from identity 'master2'.
[2019-11-20 12:02:14 +0100] notice/JsonRpcConnection: Received 'event::SetNextCheck' message from identity 'master2'.
[2019-11-20 12:02:14 +0100] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2019-11-20 12:02:15 +0100] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 25; Checks/s: 0
[2019-11-20 12:02:15 +0100] notice/ApiListener: Setting log position for identity 'master2': 2019/11/20 12:02:14
[2019-11-20 12:02:18 +0100] notice/JsonRpcConnection: Received 'event::Heartbeat' message from identity 'server'.
[2019-11-20 12:02:18 +0100] notice/JsonRpcConnection: Received 'log::SetLogPosition' message from identity 'master2'.
[2019-11-20 12:02:18 +0100] notice/JsonRpcConnection: Received 'event::SetNextCheck' message from identity 'master2'.
[2019-11-20 12:02:18 +0100] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2019-11-20 12:02:18 +0100] notice/JsonRpcConnection: Received 'event::CheckResult' message from identity 'master2

Hi everyone,

I looked again into the debug.log on master1. I found something for me a littlebit strange, there are no lines deleted between:

[2019-11-21 11:18:00 +0100] notice/JsonRpcConnection: Received 'config::Update' message from identity 'master2.domain.com'.
[2019-11-21 11:18:00 +0100] information/ApiListener: Applying config update from endpoint 'master2.domain.com' of zone 'master'.
[2019-11-21 11:18:00 +0100] information/ApiListener: Received configuration updates (0) from endpoint 'master2.domain.com' do not qualify for production, not triggering reload.

Is that a normal behaviour of a config-master? On master 2 I can see the hole sync like it is discribed in https://icinga.com/docs/icinga2/latest/doc/15-troubleshooting/#new-configuration-does-not-trigger-a-reload

Thats normal, because master1 is configuration master :slight_smile:

How do you solved it? I have the same problems. I found only one solution that works: I have to do following on each (Windows)-Client:
net stop icinga2
del C:\ProgramData\icinga2\var\lib\icinga2\icinga2.state
del C:\ProgramData\icinga2\var\lib\icinga2\modified-attributes.conf
del /s C:\ProgramData\icinga2\var\lib\icinga2\api\zones
del /s C:\ProgramData\icinga2\var\lib\icinga2\api\zones-stages
net start icinga2

Not very nice.

Till now, I didn’t fixed it. My problem is not on an agent, is one of the masters. There is an possible bug fix update on GIT but i didn’t have the time and a testing zone with the same issue so I couldn’t test it till now.

Deleting this files didn’t worked for me.

Regards, Alicia