SetLogPosition in future?

0xliam · March 21, 2023, 3:34am

One of our satellite zones has randomly stopped checking in, and after enabling debuglog on the master node, I can see the following message:

[2023-03-21 14:13:26 +1100] notice/ApiListener: Setting log position for identity 'satellite.zone': 2023/06/12 21:53:19

However, on the satellite in question, I can see checks are all running correctly and as expected:

[2023-03-21 14:20:20 +1100] debug/Checkable: Update checkable 'REDACTED!CPU Load' with check interval '300' from last check time at 2023-03-21 14:20:20 +1100 (1.67937e+09) to next check time at 2023-03-21 14:25:05 +1100 (1.67937e+09).
[2023-03-21 14:20:20 +1100] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2023-03-21 14:20:20 +1100] notice/ApiListener: Sending message 'event::SetNextCheck' to 'master'
[2023-03-21 14:20:20 +1100] notice/ApiListener: Relaying 'event::CheckResult' message
[2023-03-21 14:20:20 +1100] notice/ApiListener: Sending message 'event::CheckResult' to 'master'
[2023-03-21 14:20:20 +1100] notice/Process: PID 6838 ('/usr/lib64/nagios/plugins/thola' '--no-cache' '--snmp-community' 'REDACTED' 'check' 'interface-metrics' '10.0.99.9' '--snmp-version' '2c') terminated with exit code 0
[2023-03-21 14:20:20 +1100] debug/Checkable: Update checkable 'REDACTED!SNMP Interface Utilisation' with check interval '60' from last check time at 2023-03-21 14:20:20 +1100 (1.67937e+09) to next check time at 2023-03-21 14:21:16 +1100 (1.67937e+09).
[2023-03-21 14:20:20 +1100] notice/ApiListener: Relaying 'event::SetNextCheck' message
[2023-03-21 14:20:20 +1100] notice/ApiListener: Sending message 'event::SetNextCheck' to 'master'
[2023-03-21 14:20:20 +1100] notice/ApiListener: Relaying 'event::CheckResult' message
[2023-03-21 14:20:20 +1100] notice/ApiListener: Sending message 'event::CheckResult' to 'master'

I have checked all the usual spots - NTP is configured and working on all hosts, and we have 102 other zones all connected and working correctly - it’s just this specific zone.

I’ve configured the satellite on a new VM, but the same behaviour is occurring - which makes me think the master node is doing this somehow.

The satellite zone appears connected to the master zone, but all services in IcingaWeb appear as overdue.

Master node is running r2.13.6-1 and satellite node is running r2.13.7-1.

0xliam · March 21, 2023, 3:46am

After looking at the forums more, I discovered a post I made in 2020 which appears to be the same issue - Last check in future? Unable to reschedule - #2 by twidhalm - I will try these steps again and see if it fixes it up.

Is this a bug I should raise on GitHub? Essentially what I think has happened is NTP or the HW clock on the host has jumped into the future, a whole heap of checks isolated to one zone got scheduled 3 months into the future, and then the clock recovered, but the checks remain stuck in the future.

rsx · March 21, 2023, 7:00am

Your versions might violate allowed combinations as described here.

0xliam · March 21, 2023, 8:27am

Ah I did not think about that - that satellite would have pulled the latest version - I’ll pin it to the same as the master.

Having said that, this issue was occurring before upgrading to the latest - I believe they were both on the same version previously, but I wanted a clean slate and rebuilt the VM.

I’ve had a look through the code to identify where the timestamp is pulled from, but I can’t seem to find the definition of this function:

github.com

Icinga/icinga2/blob/66b039df9c5460bcd4db4c4774e09a1ba8ca075b/lib/remote/apilistener.cpp#L956


      
          			Log(LogNotice, "ApiListener")
          				<< "Removing old log file: " << path;
          			(void)unlink(path.CStr());
          		}
          	}
          
          
	for (const Endpoint::Ptr& endpoint : ConfigType::GetObjectsByType<Endpoint>()) {
          		if (!endpoint->GetConnected())
          			continue;
          
          
		double ts = endpoint->GetRemoteLogPosition();
          
          
		if (ts == 0)
          			continue;
          
          
		Dictionary::Ptr lmessage = new Dictionary({
          			{ "jsonrpc", "2.0" },
          			{ "method", "log::SetLogPosition" },
          			{ "params", new Dictionary({
          				{ "log_position", ts }
          			}) }

0xliam · March 22, 2023, 12:11am

I have downgraded the satelilte node to be the same version as the master node, but still same behaviour:

[root@satellite ~]# icinga2 -V
icinga2 - The Icinga 2 network monitoring daemon (version: r2.13.6-1)

Copyright (c) 2012-2023 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later <https://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

System information:
  Platform: CentOS Linux
  Platform version: 7 (Core)
  Kernel: Linux
  Kernel version: 3.10.0-957.el7.x86_64
  Architecture: x86_64

[root@satellite ~]# date
Wed Mar 22 09:52:20 AEDT 2023

[root@master ~]# icinga2 -V
icinga2 - The Icinga 2 network monitoring daemon (version: r2.13.6-1)

Copyright (c) 2012-2023 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later <https://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

System information:
  Platform: CentOS Linux
  Platform version: 7 (Core)
  Kernel: Linux
  Kernel version: 3.10.0-1160.15.2.el7.x86_64
  Architecture: x86_64

[root@master ~]# date
Wed Mar 22 09:52:53 AEDT 2023

The behaviour I am seeing is that the zone is connected, but no host or service check results are being accepted by the master node - this includes both checks running from the satellite node, and hosts within the zone running Icinga Agent.

It is also only affecting this one zone.

Any suggestions?

rsx · March 22, 2023, 11:06am

Are you sure to have this satellite zone connected? Best practice is to have zone checks e.g. cluster-zone defined for every zone.

0xliam · March 22, 2023, 10:55pm

100% sure - I can see both on the master and the satellite node they are connected (and all zone checks are passing).

I ended up stopping the master, renaming the icinga2.state file and restarting and everything has come good.

I realised I encountered this same issue in 2020, so I ran the same troubleshooting steps and deleting the state file resolved it.

Is this something I should raise as an issue in GitHub? I’ve had a look through the source code and it appears that the master should be handling checks that have a future timestamp, but in my case I was only seeing that the “log position” was three months into the future.