Icinga master not receiving check result, marks as overdue

Sorry for the two posts - I wanted to keep this separate from my other one.

I have a 3 tier multi-master Icinga configuration.
All zones except one appears to be functioning normally, and all of the problem zone checks are appearing as overdue.

They were working previously until a week ago, with no major changes as far as I can tell.

I can see the checks are executing (log altered for privacy reasons):
Host check from satellite:

[2020-07-08 15:15:08 +1000] notice/Dependency: Dependency 'windows-server!customer' passed: Parent host 'satellite' matches state filter.
[2020-07-08 15:15:08 +1000] debug/CheckerComponent: Scheduling info for checkable 'windows-server' (2020-07-08 15:15:08 +1000): Object 'windows-server', Next Check: 2020-07-08 15:15:08 +1000(1.59419e+09).
[2020-07-08 15:15:08 +1000] debug/CheckerComponent: Executing check for 'windows-server'
[2020-07-08 15:15:08 +1000] debug/Checkable: Update checkable 'windows-server' with check interval '300' from last check time at 2020-07-08 15:10:10 +1000 (1.59419e+09) to next check time at 2020-07-08 15:20:06 +1000 (1.59419e+09).
[2020-07-08 15:15:08 +1000] notice/Dependency: Dependency 'windows-server!customer' passed: Parent host 'satellite' matches state filter.
[2020-07-08 15:15:08 +1000] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2020-07-08 15:15:08 +1000] notice/ApiListener: Sending message 'event::SetNextCheck' to 'master.fqdn'
[2020-07-08 15:15:08 +1000] notice/Process: Running command '/usr/lib64/nagios/plugins/check_ping' '-H' '10.0.1.236' '-c' '5000,100%' '-w' '3000,80%': PID 6952
[2020-07-08 15:15:08 +1000] debug/CheckerComponent: Check finished for object 'windows-server'

Agent based check from Windows:

[2020-07-08 15:07:54 +1000] debug/CheckerComponent: Scheduling info for checkable 'windows-agent!Disk Free' (2020-07-08 15:07:54 +1000): Object 'windows-agent!Disk Free', Next Check: 2020-07-08 15:07:54 +1000(1.59418e+09).
[2020-07-08 15:07:54 +1000] debug/CheckerComponent: Executing check for 'windows-agent!Disk Free'
[2020-07-08 15:07:54 +1000] debug/Checkable: Update checkable 'windows-agent!Disk Free' with check interval '300' from last check time at 2020-07-08 15:02:54 +1000 (1.59418e+09) to next check time at 2020-07-08 15:12:54 +1000 (1.59419e+09).
[2020-07-08 15:07:54 +1000] notice/ApiListener: Sending message 'event::ExecuteCommand' to 'windows-agent'
[2020-07-08 15:07:54 +1000] debug/CheckerComponent: Check finished for object 'windows-agent!Disk Free'
[2020-07-08 15:07:54 +1000] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2020-07-08 15:07:54 +1000] notice/ApiListener: Sending message 'event::SetNextCheck' to 'master1'
[2020-07-08 15:07:54 +1000] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2020-07-08 15:07:54 +1000] notice/ApiListener: Sending message 'event::SetNextCheck' to 'master1'
[2020-07-08 15:07:54 +1000] notice/CheckerComponent: Pending checkables: 0; Idle checkables: 89; Checks/s: 0.533333
[2020-07-08 15:07:54 +1000] notice/ApiListener: Setting log position for identity 'master1': 2020/07/08 14:55:37
[2020-07-08 15:07:54 +1000] notice/ApiListener: Updating object authority for objects at endpoint 'satellite'.

Checks are being scheduled, and executed, but the result never makes it way back to the master.

I’ve checked and the time of all servers is accurate. I’ve removed /var/lib/icinga2/api/* and the state files, upgraded the satellite to the latest version, but still no luck.

I can’t seem to see anything strange happening on the masters either, and all other zones work perfectly.

Any ideas?

Could it be a time-related problem?

Are all the server on the same time(zone)?

Hello @0xliam

As far as i can see you are 3 level cluster: Master => Satellite => Agent

This output tells us that agent sends payload directly to the master, could you please provide listing for zones.conf and describe full cluster setup?

Hi @Solkren,

Sorry I should clarify - all the logs I posted are from the Satellite - for the agent based checks I was just using that to demonstrate that the checks are being scheduled.

We have a 3 tiered dual-master setup -

Master 1 = \
     |     -----=> Satellite => Agent
Master 2 = /

Excuse my bad text diagram!

Hi @log1c I will check that again now - that’s what we initially suspected - all servers are running CentOS, using Hyper-V’s clock sync integration (which in turn syncs from NTP) - the ‘date’ command shows the correct time for both masters and satellite.

What’s strange is this was working fine, up until about 10 days ago - we did a patching cycle and rebooted the Hyper-V host, and then it just stopped working.

I should mention all of our config is managed by Director.

I can see the zone is connected to the masters via the cluster command:

Zone 'customer.fqdn' is connected. Log lag: less than 1 millisecond

Performance data

Label Value Warning Critical
slave_lag 0.00 0.00 0.00
last_messages_sent 1,594,253,618.37 - -
last_messages_received 1,594,253,616.07 - -
sum_messages_sent_per_second 0.30 - -
sum_messages_received_per_second 0.30 - -
sum_bytes_sent_per_second 27.60 - -
sum_bytes_received_per_second 25.40 - -

Checked all times, everything seems to match, but when I was restarting services I noticed a few results trickled through - a few perfdata points appeared in our backend and I noticed a few of the services in the zone came good, but it has since gone back to being overdue.

I think I can rule out the masters as we have 20+ other zones connected with no issues, so it must be the satellite. Will keep tinkering.

Master 1 Time:

[root@master1 ~]# timedatectl
      Local time: Fri 2020-07-10 14:51:49 AEST
  Universal time: Fri 2020-07-10 04:51:49 UTC
        RTC time: Fri 2020-07-10 04:51:49
       Time zone: Australia/Melbourne (AEST, +1000)
     NTP enabled: yes
NTP synchronized: yes
 RTC in local TZ: no
      DST active: no
 Last DST change: DST ended at
                  Sun 2020-04-05 02:59:59 AEDT
                  Sun 2020-04-05 02:00:00 AEST
 Next DST change: DST begins (the clock jumps one hour forward) at
                  Sun 2020-10-04 01:59:59 AEST
                  Sun 2020-10-04 03:00:00 AEDT

Master 2 Time:

[root@master2 ~]# timedatectl
      Local time: Fri 2020-07-10 14:51:48 AEST
  Universal time: Fri 2020-07-10 04:51:48 UTC
        RTC time: Fri 2020-07-10 04:51:48
       Time zone: Australia/Melbourne (AEST, +1000)
     NTP enabled: yes
NTP synchronized: yes
 RTC in local TZ: no
      DST active: no
 Last DST change: DST ended at
                  Sun 2020-04-05 02:59:59 AEDT
                  Sun 2020-04-05 02:00:00 AEST
 Next DST change: DST begins (the clock jumps one hour forward) at
                  Sun 2020-10-04 01:59:59 AEST
                  Sun 2020-10-04 03:00:00 AEDT

Satellite Time:

[root@satellite ~]# timedatectl
      Local time: Fri 2020-07-10 14:51:46 AEST
  Universal time: Fri 2020-07-10 04:51:46 UTC
        RTC time: Fri 2020-07-10 06:11:15
       Time zone: Australia/Melbourne (AEST, +1000)
     NTP enabled: yes
NTP synchronized: yes
 RTC in local TZ: no
      DST active: no
 Last DST change: DST ended at
                  Sun 2020-04-05 02:59:59 AEDT
                  Sun 2020-04-05 02:00:00 AEST
 Next DST change: DST begins (the clock jumps one hour forward) at
                  Sun 2020-10-04 01:59:59 AEST
                  Sun 2020-10-04 03:00:00 AEDT

Okay, I ended up stopping both masters, removed the state files, and starting them up again.

This seems to have fixed things. I am not sure what was causing this - if anyone has any ideas I would love to hear them!