Incorrect timestamps for last_check and overdue not being set correctly

Hi there,

Unsure whether this is a bug, or a time issue, but a large number of checks (appear to be on the same satellites) are setting the last_check value to the wrong timestamp - at least I think that is is what is happening:

While reviewing a Service check executed locally on the host (disk-windows):

last_check 1607536417.295000 (Thursday, 10 December 2020 04:53:37.295 GMT+11:00 DST - 9 hours ago)
next_check 1607566917 (Thursday, 10 December 2020 13:21:57 GMT+11:00 DST - 3 minutes ago)
next_update 1607567217 (Thursday, 10 December 2020 13:26:57 GMT+11:00 DST - In 2 minutes)

I reviewed it again a few minutes later:

last_check 1607537017.328000 (Thursday, 10 December 2020 05:03:37.328 GMT+11:00 DST - 9 hours ago)
next_check 1607567925 (Thursday, 10 December 2020 13:38:45 GMT+11:00 DST - 2 minutes ago)
next_update 1607568224.992000 (Thursday, 10 December 2020 13:43:44.992 GMT+11:00 DST - In 3 minutes)

The last check value has incremented by 600 seconds (despite my check interval being 300) - so the check is being executed, but the wrong timestamp is being set somewhere.

I canā€™t see any reason why this is occurring - Both the masters, my 3 SQL servers, the satellite and agent are synced with NTP and have the same timezone and time set.

I am seeing some strange behaviour when I look at the Icingaweb2 overdue page too - hosts are coming and going - some moments there are 50+ hosts, I refresh a few moments later and it is down to 4 or 5.

Service checks appear to also behave in the same way.

This isnā€™t affecting all of my zones/satellites, majority are fine - hence why I am thinking it is a satellite issue, but I canā€™t see what would cause this to happen.

Another example:

Late Host Check Results
No hosts found matching the filter.

A host check (using hostalive) not appearing in overdue despite being overdue:

last_check 1607541656.825736 (Thursday, 10 December 2020 06:20:56.825 GMT+11:00 DST - 8 hours ago)
next_check 1607570946.677169 (Thursday, 10 December 2020 14:29:06.677 GMT+11:00 DST - 2 minutes ago)
next_update 1607571254.768501 (Thursday, 10 December 2020 14:34:14.768 GMT+11:00 DST - In 2 minutes)

Another example:

I click check now on an overdue check, and see the check is executed:

[2020-12-10 15:26:13 +1100] notice/Process: Running command '/usr/lib64/nagios/plugins/check_nwc_health' '--community' '####' '--hostname' '10.35.0.17' '--mode' 'interface-usage' '--multiline': PID 48969
[2020-12-10 15:26:15 +1100] notice/Process: PID 48969 ('/usr/lib64/nagios/plugins/check_nwc_health' '--community' '####' '--hostname' '10.35.0.17' '--mode' 'interface-usage' '--multiline') terminated with exit code 0

Sorry this is a bit of a mess to read - our masters and satellites are on 2.12.2 - some agents on 2.11.3 but majority on 2.12.2.

It appears all of my issues go away if I stop the icinga2 service on one of the masters - doesnā€™t matter which one on initial tests, as long as one is stopped, everything goes back to normal.

Some more on thisā€¦

While master2 is stopped, this SQL query was returning about 37 results:
image

SELECT count(*) FROM `icinga_hoststatus` INNER JOIN icinga_objects  ON
`icinga_objects`.`object_id` = `icinga_hoststatus`.`host_object_id`
WHERE `last_check` < '2020-12-10 18:30:00' AND `icinga_objects`.`is_active` = 1 
LIMIT 0,1000;

I started master2:

[root@master2 ~]# systemctl status icinga2
ā— icinga2.service - Icinga host/service/network monitoring system
   Loaded: loaded (/usr/lib/systemd/system/icinga2.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2020-12-10 18:41:08 AEDT; 17min ago

Master 1 Logs:

[2020-12-10 18:41:14 +1100] information/ApiListener: New client connection for identity 'master2' from [1.2.3.4]:46948
[2020-12-10 18:41:14 +1100] information/ApiListener: Sending config updates for endpoint 'master2' in zone 'master'.
[2020-12-10 18:41:19 +1100] information/DbConnection: Pausing IDO connection: ido-mysql
[2020-12-10 18:41:19 +1100] information/IdoMysqlConnection: Disconnected from 'ido-mysql' database 'icinga-prod'.
[2020-12-10 18:41:19 +1100] information/IdoMysqlConnection: 'ido-mysql' paused
[2020-12-10 18:42:37 +1100] information/ApiListener: Replayed 1468716 messages.
[2020-12-10 18:42:39 +1100] information/ApiListener: Finished sending replay log for endpoint 'master2' in zone 'master'.
[2020-12-10 18:42:39 +1100] information/ApiListener: Finished syncing endpoint 'master2' in zone 'master'.

Master 2 Logs:

[2020-12-10 18:41:54 +1100] information/IdoMysqlConnection: Last update by endpoint 'master1' was 36.9143s ago. Taking over 'ido-mysql' in HA zone 'master'.
[2020-12-10 18:41:54 +1100] information/IdoMysqlConnection: MySQL IDO instance id: 1 (schema version: '1.14.3')
[2020-12-10 18:41:57 +1100] information/IdoMysqlConnection: Finished reconnecting to 'ido-mysql' database 'icinga-prod' in 2.4372 second(s).
[2020-12-10 18:42:04 +1100] information/IdoMysqlConnection: Pending queries: 0 (Input: 280/s; Output: 280/s)

Itā€™s now 7:00PM AEDT time, so I run this check again (most of our checks are on 5 minute check periods, so they should well and truly be checked by now:

SELECT count(*) FROM `icinga_hoststatus` INNER JOIN icinga_objects 
ON `icinga_objects`.`object_id` = `icinga_hoststatus`.`host_object_id`
WHERE `last_check` < '2020-12-10 18:40:00' AND `icinga_objects`.`is_active` = 1
LIMIT 0,1000;

image

623 host checks are overdue - if I change this select query to be last check before 6:45PM, even more:
image

At this point Icinga on master2 has been running for 23 minutes - I stop it at 7:05PM:

[root@master2 ~]# date
Thu Dec 10 19:05:02 AEDT 2020

Majority of my hosts have become overdue:

As you can see, Icinga Web has two pages of 500 hosts, so 500+ hosts are now overdue:

[root@master2 ~]# date
Thu Dec 10 19:11:08 AEDT 2020

I run the SQL query again, selecting all hosts that last check was over 5 minutes ago:

SELECT count(*) FROM `icinga_hoststatus` INNER JOIN icinga_objects 
ON `icinga_objects`.`object_id` = `icinga_hoststatus`.`host_object_id`
WHERE `last_check` < '2020-12-10 19:05:00' AND `icinga_objects`.`is_active` = 1
LIMIT 0,1000;

image

Another 5 minutes pass and weā€™re back to ~37 checks older than 10 minutes - which is what I expect to see as we have 37 hosts which we only check daily.

[root@master2 ~]# systemctl status icinga2
ā— icinga2.service - Icinga host/service/network monitoring system
   Loaded: loaded (/usr/lib/systemd/system/icinga2.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Thu 2020-12-10 19:04:45 AEDT; 12min ago

So Icinga has been stopped for 12 minutes, and all of my hosts have returned to normal.

Any ideas? Is this potentially a bug in the core? We only started noticing this since we upgraded but I presumed there were other issues going on.

I feel like this may be related, but it was fixed in 2.12.1 and weā€™re on 2.12.2:

[root@master1 icinga2]# icinga2 -V
icinga2 - The Icinga 2 network monitoring daemon (version: 2.12.2-1)
[root@master2 ~]# icinga2 -V
icinga2 - The Icinga 2 network monitoring daemon (version: 2.12.2-1)

Sorry to bump this thread but I was hoping someone might be able to tell me what Iā€™ve done wrong.

Weā€™ve been running on one master since December - Iā€™ve attempted to resolve this myself but I canā€™t work out why this is occurring.

Both of my Icinga2 masters seem to think the other is the active endpoint - notification logs suggests both are ā€˜pausedā€™ - so no checks are getting written to the IDO and no notifications are being sent:

Both servers are constantly logging this message:

[2021-01-22 16:07:18 +1100] notice/NotificationComponent: Reminder notification 'Server!OpsGenie Host Alerts': HA cluster active, this endpoint does not have the authority (paused=true). Skipping. 

After doing some research Iā€™ve seen mention of the authoritative endpoint - but how does this get set?

Both masters are talking to each other, and both connect to a VIP on our Galera cluster.

@theFeu Sorry to tag you in this - I am hoping I could get someone from the Icinga team to weigh in on this - essentially I believe my two masters are in a split brain scenario where both think the other is the active zone master, and stop processing any alerts or check results.

This only occurred after updating to Icinga 2.12.2.

If I run either master with the other disabled, things return to normal.

We are planning an upgrade to 2.12.3, but none of the bugfixes in that release appear relevant.

Hey there,

If you think you found a bug and you need to get in contact with the development team, the easiest way is to head over to GitHub and open a bug report.
I forwarded this thread now, but in general our developers tend to check GitHub daily :slight_smile:

1 Like

Thanks :slight_smile: - I donā€™t think itā€™s a bug because Iā€™m sure someone else would have ran into it by now, so Iā€™m guessing itā€™s something in our configuration that is wrong, but weā€™ve got a pretty standard configuration.

Plain director + IcingaWeb, etc, nothing incredibly special or complicated.

1 Like

I think Iā€™ve managed to resolve this, although I am still unsure of how to identify the root cause or whether this fixed it.

I upgraded our cores to the latest Icinga (from 2.12.2 to 2.12.3) this evening, and purged the /var/lib/icinga2/api folder on our secondary master.

After restarting our first master with itā€™s cluster config, and starting the second master, it seems to be resolved again.

I have noticed a slight dip in performance but itā€™s likely related to that host (theyā€™re distributed in two data centres and run on slightly different hardware).

Iā€™ll mark this as the solution but Iā€™d love to know how I could have identified what causes a master to chose which is the authoritative endpoint in a zone.

Hello, are your icinga running under a virtualized environment (kvm, hyperv, ā€¦) ?
If so, depending of the global load of the hypervisor system it will not necesary have the ressource to properly sync its clock with its vms in due time. This perception of time can vary on the vm and is abnormal, even if ntp resync time every min (it may not even sync if gap is too large) there is still a possibility for timestamps to go nuts for an app if it sees time going backward/forward.

Unfortunately our issues have come back.

I am again seeing weird issues with times, and check executions.

CPU and Memory checks on our secondary master are not executing at all - they will work for 15-30 minutes at a time, then stop for hours, until the icinga2 service is restarted:

Eventually they will stop again.

Some checks running on satellite hosts are showing:

This check should execute every 5 minutes, but the last checked is showing 41 minutes ago in the web browser, but then inspecting the host check object shows:

Those unix timestamps are not what is being shown in the web browser - the last_checked timestamp:

The next_check timestamp:

Which is exactly what it should be - it was last executed 2 minutes ago, and will check again in 3 minutes (5 minute interval).

Why would Icinga Web show it was last checked 41 minutes ago? Where does it get this timestamp from?

Iā€™ve checked all VMs at the time of the issue and they are all perfectly in sync with NTP.

Iā€™ve checked the IDO database and that service has no mention of that time anywhere.

Given thoses informations i can see 2 possibilities, your masters gets out of sync together or one master and IDO are out of sync.
Can you check that in this file /etc/icinga2/features-available/ido-mysql.conf you have the enable_ha flag at true for both of your master ?
(in a more general way, this file should be the same between your masters)

For the other possibility you have to filter out in /var/log/icinga2/icinga.log for events related to IDO or master.
The events iā€™m expecting to find should look like this :

[2021-01-06 10:18:32 +0100] notice/ApiListener: Current zone master: master2.localdomain 
[2021-01-06 10:11:33 +0100] information/IdoMysqlConnection: Last update by endpoint 'master2.localdomain' was 64.0388s ago. Taking over 'ido-mysql' in HA zone 'master'.
[2021-01-06 10:10:37 +0100] information/ApiListener: Received configuration updates (8) from endpoint 'master2.localdomain' are different to production, triggering validation and reload.

Try to see if thoses match with moments where checks stops running.

Also, if thatā€™s only one type of checks that stops working (memory and cpu) this may be closer to a configuration problem.

I can confirm the enable_ha flag is set on both masters, and we are using a Galera MySQL cluster.

After reviewing the logs, I noticed icinga2 crashed at 8:27AM which appears to be memory related (either OOM or Hyper-V is not allocating the requested memory):

Feb 16 08:25:33 master2 systemd[1]: icinga2.service: main process exited, code=exited, status=137/n/a
Feb 16 08:25:33 master2 systemd[1]: Unit icinga2.service entered failed state.
Feb 16 08:25:33 master2 systemd[1]: icinga2.service failed.
Feb 16 08:25:39 master2 systemd[1]: icinga2.service holdoff time over, scheduling restart.
Feb 16 08:25:39 master2 systemd[1]: Stopped Icinga host/service/network monitoring system.
Feb 16 08:25:39 master2 systemd[1]: Starting Icinga host/service/network monitoring system

After restarting, the checks then began working at 8:27AM until about 10:02AM:

Nothing interesting happened on either host at 10:02 or the moments after, but I suspect you may have been correct that this is a resourcing issue.

EDIT: Can confirm the OOM killer is kicking in - I am having our infrastructure team check whether dynamic RAM is working as expected.

Host was set to static RAM, so weā€™ve doubled the resources.

Iā€™ve left it running for the last 6 hours and issues still seem to be occuring - the CPU and memory checks that are run from our secondary master ran from 11:51AM until 12:26PM.

The disk and network check that runs on the same server, on the same schedule, have run without any issues.

Itā€™s just those two checks that are never being executed.

Iā€™m a bit out of my depth here - I thought I had a pretty good understanding of why checks would become overdue, but I canā€™t find anything with debug logging enabled.

@Someone by IDO out of sync, do you mean the IDO database/tables or the status in memory on both masters?

I donā€™t think the tables could be out of sync, as they are using a 3 node Galera cluster with a floating IP address which routes to itā€™s closest local instance, but even pointing both masters to the same node without the floating IP has not made a difference.

With all of these issues combined,ā€¦ all hosts and services becoming overdue, or random checks starting and stopping, or the checks showing they last ran 45 minutes ago but the timestamp was seconds agoā€¦

Things have been relatively stable other than these two CPU and Memory checks that are not being executed, when they are 5+ hours overdue.

Iā€™ve also noticed that when I click Check Now in IcingaWeb, I can see the API request appear in the log on the primary master, but not the secondary, and it looks like it never gets scheduled (nothing appears in the logs on the secondary master about needing to check its overdue checks or the ones I manually request to Check Now).

As we keep running into these problems, I think we are going to reevaluate whether Icinga is the best product for our use case.

No, i mean by that icinga lost track of who control the cluster, and you basically end up in a split brain situation, thaā€™ts why i want to confirm or not throught logs both of you masters are not trying to get master of the cluster and IDO at the same time (which is one the firs things you suggested), but since you confirm itā€™s just about the memory/cpu checks itā€™s more unlikely to me now.

Itā€™s not exactly explained in the doc afaik, but the authoritative endpoint on a cluster is the one with zones.d directory filled with config, if it happens both endpoints have config under zone.d, theyā€™ll consider both as authoritative and reject config update from the other endpoint, so itā€™s basically set manually.

Thatā€™s a good idea, but i would have done it on both masters, especially since config comes from packages (from director or handmade one) and zones.d from etc. Iā€™m suggesting it to make sure the master journal files gets erased and everything restart from zero with no history of previous icinga runs.

On overall the icinga global settings looks good to me, can you post the config related to checks (checkcommands, hosts, service) that have thoses problems ? maybe iā€™ll find something, but honestly i have troubles to see why service/host/checkcommand config would cause thoses.

Last option as @theFeu suggested is to ask for help on github, if itā€™s really a code sided problem theyā€™ll be much more able to help you than me.

Thanks for your input @Someone - itā€™s much appreciated.

After a config was pushed from Director, I can see from the IDO logs, the primary master establishes connection, then the secondary, and the secondary takes over - the logs appear normal on both sides.

Here are some logs from yesterday:

Master 1 Logs:

[2021-02-16 09:54:21 +1100] information/DbConnection: 'ido-mysql' stopped.
[2021-02-16 09:54:24 +1100] information/DbConnection: 'ido-mysql' started.
[2021-02-16 09:55:04 +1100] information/IdoMysqlConnection: 'ido-mysql' resumed.
[2021-02-16 09:55:04 +1100] information/DbConnection: Resuming IDO connection: ido-mysql
[2021-02-16 09:55:14 +1100] information/IdoMysqlConnection: Last update by endpoint 'master2' was 35.5522s ago. Taking over 'ido-mysql' in HA zone 'master'.
[2021-02-16 09:55:19 +1100] information/IdoMysqlConnection: Finished reconnecting to 'ido-mysq' database 'icinga-prod' in 4.63813 second(s).
[2021-02-16 09:55:21 +1100] information/DbConnection: Pausing IDO connection: ido-mysql
[2021-02-16 09:55:21 +1100] information/IdoMysqlConnection: Disconnected from 'ido-mysql' database 'icinga-prod'.
[2021-02-16 09:55:21 +1100] information/IdoMysqlConnection: 'ido-mysql' paused.

Master 2 logs:

[2021-02-16 09:54:46 +1100] information/DbConnection: Pausing IDO connection: ido-mysql
[2021-02-16 09:54:46 +1100] information/IdoMysqlConnection: Disconnected from 'ido-mysql' database 'icinga-prod'.
[2021-02-16 09:54:46 +1100] information/IdoMysqlConnection: 'ido-mysql' paused.
[2021-02-16 09:54:46 +1100] information/DbConnection: 'ido-mysql' stopped.
[2021-02-16 09:54:48 +1100] information/DbConnection: 'ido-mysql' started.
[2021-02-16 09:55:20 +1100] information/IdoMysqlConnection: 'ido-mysql' resumed.
[2021-02-16 09:55:20 +1100] information/DbConnection: Resuming IDO connection: ido-mysql
[2021-02-16 09:55:50 +1100] information/IdoMysqlConnection: Last update by endpoint 'master1' was 36.8543s ago. Taking over 'ido-mysql' in HA zone 'master'.
[2021-02-16 09:56:23 +1100] information/IdoMysqlConnection: Finished reconnecting to 'ido-mysq' database 'icinga-prod' in 32.6144 second(s).

It does seem that after a reload of the config, the check begins working for a few minutes - but not all the time - e.g. a user pushed a Director config at 10:03AM this morning, and the check began working again at

I donā€™t think it is related to this check specifically, as it is the ITL load check - and Iā€™ve also seen the cluster check become overdue too.

Weā€™ve got the exact same load check template applied to at least 20 other hosts - and with the other issues combined I donā€™t think it sounds like a template issue - especially as the templates have not changed for the 12+ months weā€™ve been using Icinga.

I should be able to purge all config packages on both masters - do you think it is worth also dropping the IDO database and starting from scratch?

We donā€™t need any historical data as we already write check results into InfluxDB and OpenTSDB.

Given the logs there dont seems to be an abnormal behaviour here, this happens when one master reload after config gets updated, the other master then reload.

I would say it wont change much since IDO is being written in by icinga

Ok, thanks for confirming. This and the fact checks works for a short time after a reload makes me think your scheduler queue gets full. That leads me to ask you few other things.

How many Hosts/services are scheduled to run on each zone ?
You can get this information by doing an icinga2 daemon -C on a poller that belong to each different zone.

What is the MaxConcurrentChecks Value for each of zones ?
You can get this information by looking at constants.conf file, if not set itā€™s by default set at 512.

The idea is to see how many concurrent checks you can run vs how many services/hosts are scheduled to run by zone. By the way, even if MaxConcurrentChecks is slightly superior or equal to amount of service/hosts to check you can still have overdue checks if checks execution is too long.

Regarding this kind of problem i made this debug function which could help you too to get an overview of overdue checks.

1 Like

We have many, many satellite zones, so it really does vary.

We havenā€™t modified anything in constants.conf for any of our satellites, so it will be 512.

In total, we have 2001 hosts, 5769 services, and 66 satellite zones.

Would the MaxConcurrentChecks impact the masters? Perhaps this is what is the problem - if the masters need to schedule all of the checks to the satellites, we would definitely have more than 512 running at the same time across all sites (never 512 on one satellite though, or in the master zone).

REDACTED:
    [2021-02-18 10:52:19 +1100] information/ConfigItem: Instantiated 5 Hosts.
REDACTED:
    [2021-02-18 10:52:16 +1100] information/ConfigItem: Instantiated 9 Hosts.
REDACTED:
    [2021-02-18 10:52:19 +1100] information/ConfigItem: Instantiated 14 Hosts.
REDACTED:
    [2021-02-18 10:52:19 +1100] information/ConfigItem: Instantiated 4 Hosts.
REDACTED:
    [2021-02-18 10:52:19 +1100] information/ConfigItem: Instantiated 11 Hosts.
REDACTED:
    [2021-02-18 10:51:53 +1100] information/ConfigItem: Instantiated 6 Hosts.
REDACTED:
    [2021-02-18 10:51:33 +1100] information/ConfigItem: Instantiated 7 Hosts.
REDACTED:
    [2021-02-18 10:52:19 +1100] information/ConfigItem: Instantiated 10 Hosts.
REDACTED:
    [2021-02-18 10:52:19 +1100] information/ConfigItem: Instantiated 15 Hosts.
REDACTED:
    [2021-02-18 10:52:19 +1100] information/ConfigItem: Instantiated 18 Hosts.
REDACTED:
    [2021-02-18 10:52:19 +1100] information/ConfigItem: Instantiated 9 Hosts.
REDACTED:
    [2021-02-18 10:52:19 +1100] information/ConfigItem: Instantiated 2 Hosts.
REDACTED:
    [2021-02-18 10:52:19 +1100] information/ConfigItem: Instantiated 12 Hosts.
REDACTED:
    [2021-02-18 10:52:18 +1100] information/ConfigItem: Instantiated 7 Hosts.
REDACTED:
    [2021-02-18 10:45:14 +1100] information/ConfigItem: Instantiated 10 Hosts.
REDACTED:
    [2021-02-18 10:52:19 +1100] information/ConfigItem: Instantiated 4 Hosts.
REDACTED:
    [2021-02-18 10:52:19 +1100] information/ConfigItem: Instantiated 3 Hosts.
REDACTED:
    [2021-02-18 10:52:19 +1100] information/ConfigItem: Instantiated 10 Hosts.
REDACTED:
    [2021-02-18 10:51:26 +1100] information/ConfigItem: Instantiated 8 Hosts.
REDACTED:
    [2021-02-18 10:52:19 +1100] information/ConfigItem: Instantiated 64 Hosts.
REDACTED:
    [2021-02-18 10:52:19 +1100] information/ConfigItem: Instantiated 3 Hosts.
REDACTED::
    [2021-02-18 10:52:19 +1100] information/ConfigItem: Instantiated 9 Hosts.
REDACTED::
    [2021-02-18 10:52:19 +1100] information/ConfigItem: Instantiated 6 Hosts.
REDACTED:
    [2021-02-18 10:50:31 +1100] information/ConfigItem: Instantiated 10 Hosts.
REDACTED:
    [2021-02-18 10:52:19 +1100] information/ConfigItem: Instantiated 3 Hosts.
REDACTED:
    [2021-02-18 10:52:19 +1100] information/ConfigItem: Instantiated 316 Hosts.
REDACTED:
    [2021-02-18 10:52:19 +1100] information/ConfigItem: Instantiated 35 Hosts.
REDACTED:
    [2021-02-18 10:52:19 +1100] information/ConfigItem: Instantiated 8 Hosts.
REDACTED:
    [2021-02-18 10:52:19 +1100] information/ConfigItem: Instantiated 21 Hosts.
REDACTED:
    [2021-02-18 10:45:23 +1100] information/ConfigItem: Instantiated 26 Hosts.
REDACTED:
    [2021-02-18 10:54:00 +1100] information/ConfigItem: Instantiated 8 Hosts.
REDACTED:
    [2021-02-18 10:52:17 +1100] information/ConfigItem: Instantiated 6 Hosts.
REDACTED:
    [2021-02-18 10:52:19 +1100] information/ConfigItem: Instantiated 5 Hosts.
REDACTED:
    [2021-02-18 10:52:17 +1100] information/ConfigItem: Instantiated 7 Hosts.
REDACTED:
    [2021-02-18 10:52:19 +1100] information/ConfigItem: Instantiated 9 Hosts.
REDACTED:
    [2021-02-18 10:52:19 +1100] information/ConfigItem: Instantiated 39 Hosts.
REDACTED:
    [2021-02-18 10:52:19 +1100] information/ConfigItem: Instantiated 4 Hosts.
REDACTED:
    [2021-02-18 10:52:19 +1100] information/ConfigItem: Instantiated 4 Hosts.
REDACTED:
    [2021-02-18 10:52:21 +1100] information/ConfigItem: Instantiated 7 Hosts.
REDACTED:
    [2021-02-18 10:52:20 +1100] information/ConfigItem: Instantiated 51 Hosts.
REDACTED:
    [2021-02-18 10:52:20 +1100] information/ConfigItem: Instantiated 6 Hosts.
REDACTED:
    [2021-02-18 10:52:20 +1100] information/ConfigItem: Instantiated 8 Hosts.
REDACTED:
    [2021-02-18 10:43:10 +1100] information/ConfigItem: Instantiated 6 Hosts.
REDACTED:
    [2021-02-18 10:52:19 +1100] information/ConfigItem: Instantiated 7 Hosts.
REDACTED:
    [2021-02-18 10:52:20 +1100] information/ConfigItem: Instantiated 156 Hosts.
REDACTED:
    [2021-02-18 10:52:21 +1100] information/ConfigItem: Instantiated 6 Hosts.
REDACTED:
    [2021-02-18 10:52:20 +1100] information/ConfigItem: Instantiated 7 Hosts.
REDACTED:
    [2021-02-18 10:52:20 +1100] information/ConfigItem: Instantiated 5 Hosts.
REDACTED:
    [2021-02-18 10:52:20 +1100] information/ConfigItem: Instantiated 38 Hosts.
REDACTED:
    [2021-02-18 10:52:20 +1100] information/ConfigItem: Instantiated 11 Hosts.
REDACTED:
    [2021-02-18 10:52:20 +1100] information/ConfigItem: Instantiated 9 Hosts.
REDACTED:
    [2021-02-18 10:52:20 +1100] information/ConfigItem: Instantiated 5 Hosts.
REDACTED:
    [2021-02-18 10:52:20 +1100] information/ConfigItem: Instantiated 7 Hosts.
REDACTED:
    [2021-02-18 10:52:20 +1100] information/ConfigItem: Instantiated 17 Hosts.
REDACTED:
    [2021-02-18 10:52:20 +1100] information/ConfigItem: Instantiated 6 Hosts.
REDACTED:
    [2021-02-18 10:52:20 +1100] information/ConfigItem: Instantiated 6 Hosts.
REDACTED:
    [2021-02-18 10:52:20 +1100] information/ConfigItem: Instantiated 36 Hosts.
REDACTED:
    [2021-02-18 10:52:20 +1100] information/ConfigItem: Instantiated 49 Hosts.
REDACTED:
    [2021-02-18 10:52:20 +1100] information/ConfigItem: Instantiated 84 Hosts.
REDACTED:
    [2021-02-18 10:52:20 +1100] information/ConfigItem: Instantiated 68 Hosts.
REDACTED:
    [2021-02-18 10:52:20 +1100] information/ConfigItem: Instantiated 3 Hosts.
REDACTED:
    [2021-02-17 23:52:20 +0000] information/ConfigItem: Instantiated 2 Hosts.
REDACTED:
    [2021-02-18 10:52:20 +1100] information/ConfigItem: Instantiated 298 Hosts.
REDACTED:
    [2021-02-18 10:52:20 +1100] information/ConfigItem: Instantiated 21 Hosts.
REDACTED:
    [2021-02-18 10:52:19 +1100] information/ConfigItem: Instantiated 10 Hosts.
REDACTED:
    [2021-02-17 23:52:22 +0000] information/ConfigItem: Instantiated 55 Hosts.

In the actual master zone, there are a few checks that execute (checking satellite zones are connected), and the two masters also execute their own CPU/Memory/Disk/Network checks.

I think you are onto something with the MaxConcurrentChecks - because some zones would have the ~300 host checks, plus the services, all on the same check intervals,

Other than requiring more compute power, does the MaxConcurrentChecks impact anything else?

After doing some more research around the MaxConcurrentChecks constant, this post seems to suggest it would only affect checkables running on one endpoint.

Am I correct to assume if I set this to 1024 on the master zone, it would make no change to the overall amount of checks running, or would this allow for the masters to schedule more checks at a time?

Some more strange behaviour.

I had a server send an alert today, but IcingaWeb shows no record of this.

Logs from Master 1:

[2021-02-18 14:20:03 +1100] information/Notification: Sending reminder 'Problem' notification 'server.fdqn!Memory Usage!OpsGenie Service Alerts' for user 'OpsGenie'
[2021-02-18 14:20:03 +1100] information/Notification: Completed sending 'Problem' notification 'server.fdqn!Memory Usage!OpsGenie Service Alerts' for checkable 'server.fdqn!Memory Usage' and user 'OpsGenie' using command 'icinga2opsgenie'.

I am unsure why it is sending a ā€˜reminderā€™ notification as the interval is set to 0.

Logs from Master 2 show nothing around the same time for that host.

The check says it was last checked at 12:27, but as you can see from my graph there are loads of data points since then, and if it was truly last checked at 12:27 then no alert could have been sent at 2:40PM.



There is also nothing in the history tab, nor in the Icinga Web > Notifications tab - the last notification showing up is listed as 11:50AM.

Iā€™ve tried disabling both IDO backends in IcingaWeb, to see if there is a difference, but they both show the same results.

This is starting to sound like a SQL issue now.