Incorrect timestamps for last_check and overdue not being set correctly

More SQL weirdness going on.

I’ve discovered an old host is still appearing in IcingaWeb:

The host does not exist in Director or the config files anywhere, and when we inspect the host:

But if I check the icinga_objects SQL table, I can see the rows still exist:

Depends, it would impact only the checks pinned in the master zone, and it’s usually not the masters job to poll data.

MaxConcurrentChecks applies only to the checker feature (scheduler), so raising it on busy satellites can help, raising it on master shouldn’t change anything given you dont have much checks on your master zone.

You are right, MaxConcurrentChecks setting applies on the local icinga instance, it is not distributed with others configurations since it’s modified in the constants.conf file. Also, satellites dont get their checks scheduled by masters, they only get config from them and then schedule and load balance checks by themselves in their own zone.

Scheduler is supposed to spread checks over time to avoid load burst, even if thoses are on the same check interval as long as your total of checks for a zone is more than MaxConcurrentChecks it’s ok.
You have the precise inner workings of the scheduler here if it can help
https://icinga.com/docs/icinga-2/latest/doc/19-technical-concepts/#check-scheduler

Having them with the is_active flag enabled makes me think that IDO is not updated properly, it can be for various reasons, but on the configuration side it seems fine, so as you have guessed in your last post it may be directly SQL related.

Given your feedback, i think you have not only one, but multiple issues, what i can advise for you to find root causes is to go on a simplier icinga architecture for a time (only one ido backend with one icingaweb frontend instance, alternate from one master to another and raise MaxConcurrentChecks on satellites with many checks), switching on/off parts and let it run as it for a time can help you to see what component is breaking the whole.

If you are going to do that i’d advise you to purge your ido database to help you reanalyze from scratch.

Last thing, for the checks that are overdue and causes you problems, you can have their check_source in icingaweb, if that’s always from the same poller/zone then you may have an issue on this endpoint precisely.

1 Like

It does seem the IDO is the root of the issue.

Strangely enough, when I was troubleshooting earlier, if I had one master running, everything was fine - it was as soon as the other became active where everything starts to fall apart.

We have had an IDO issue once before, where the NTP on our Hyper-V host went crazy and set the time briefly a few 100 days into the future, so a lot of host and service checks got their host_next_update set way into future, and never ran until we purged the IDO.

I’ll make a plan to turn off notifications, purge both API folders, purge the IDO and bring them back up.

I do agree it sounds like multiple issues, but we’ve been running with this architecture for over a year, and the issues only started happening after upgrading to 2.12.2.

Is there an easier way to purge the IDO other than dropping the tables and reimporting the schema?

Regarding the check_source, the current problematic checks are service checks that the masters executed on themselves.

It always seems to be the same checks on the same hosts which become overdue.

Master 1:

Master 2:

These checks run on a 5 minute interval.

As you can see, the primary (left) seems to be working as expected on some checks, but there are a lot of data points missing from the secondary.

Interesting too, they seem to stop at the exact same time.

You can setup IDO to cleanup data is and avoid stacking for desactivated objects, i’d advise you to do it especially if you dont use historical data. The thing is that in your case objects dont get desactivated in database (is_active field was set for a deleted object), so i’d completely reset IDO database and setup cleanup on masters.
You can check configuration here regarding cleanup options :
https://icinga.com/docs/icinga-2/latest/doc/14-features/#db-ido-cleanup
https://icinga.com/docs/icinga-2/latest/doc/09-object-types/#idomysqlconnection

I can give it a try to dive in the code (i’m not an icinga dev, i may miss things) and see if there are differences that could have impacted you, you got this problem when you upgraded your masters to 2.12.2, but from which version ? i need a starting point to review diffs.

I haven’t purged the DB yet, planning on doing it sometime today.

However, I have been reading some more troubleshooting documentation and I am getting some interesting results using the Icinga console:

Master 1:

<1> => var res = {}; for (s in get_objects(Service).filter(s => s.last_check < get_time() - 2 * s.check_interval)) { res[s.paused] += 1 }; res
{
    @false = 1742.000000
    @true = 1715.000000
}
<2> => var res = []; for (s in get_objects(Service).filter(s => s.last_check < get_time() - 2 * s.check_interval)) { res.add([s.__name, DateTime(s.last_check).to_string()]) }; res.len()
3292.000000

Master 2:

<1> => var res = {}; for (s in get_objects(Service).filter(s => s.last_check < get_time() - 2 * s.check_interval)) { res[s.paused] += 1 }; res
{
    @false = 102.000000
    @true = 98.000000
}
<2> => var res = []; for (s in get_objects(Service).filter(s => s.last_check < get_time() - 2 * s.check_interval)) { res.add([s.__name, DateTime(s.last_check).to_string()]) }; res.len()
155.000000

Stopped the second master as we had almost all of our services become overdue.

Now everything has returned to normal.

I’m not sure if this warrants a GitHub issue but how likely do you think this is related to the IDO?

I would have thought if the IDO tables were broken, it would be a problem regardless of whether it is a cluster or not.

It’s very strange it all returns to normal after the second master is stopped.

Still haven’t purged the DB yet as it’s working fine (and it’s in production so I am trying to minimise downtime).

However I’ve discovered another weird behaviour in IcingaWeb:

image

I’ve obfuscated the names, but VMK is the node that is stopped currently.

So at this point I’m certain it is the database.

Well, I finally managed to get around to building a new blank database, and pointing both instances of Icinga to it.

Unfortunately, the issue still remains - as soon as the secondary master starts, all checks slowly become overdue, and as soon as I stop the secondary master - everything returns to normal within a few minutes - exactly the same behaviour as I witnessed previously.

I cannot understand how this is happening - I’m still not convinced this is a bug, because I haven’t seen anyone report anything similar with identical configurations, however this is such a breaking issue that without developer input, we’re pretty lost.

I hate to say it, as I’ve had nothing but great experiences up until this point, but we will have to evaluate whether Icinga is stable enough for us to continue using it.

Sorry for jumping in, but it seems to me it’s a configuration mistake. Have read several times in the thread that the hosts become normal again after you have stopped the second master i.e. it only happens when both masters are started. Does the second master even know that the other hosts exist? You can use API query to poll for some overdue hosts in both master endpoints, then see if you get the same output. It may be because you have configured the zone and endpoint definition of the satellites only in the first master. It would then explain why this happens. For all your hosts that directly connect to master1, the second master also has to be known by the hosts as being there and the hosts by the master2, so that in case of any failure of the first master or due to the object authority mechanism, the second master can take over. After all, that is what a cluster denotes.

Below I have an example of two masters and one satellite, how the configuration should look like.

Master1:

object Endpoint "master1" {}

object Endpoint "master2" { }

object Zone "master" {
        endpoints = [ "master1", "master2" ]
}

object Endpoint "satellite" { }

object Zone "sattelite" {
        endpoints = [ "sattelite" ]
        parent = "master"
}

Master2:

object Endpoint "master1" {
       host = "master1's IP-Address"
       port = "Default: 5665"
}

object Endpoint "master2" { }

object Zone "master" {
        endpoints = [ "master1", "master2" ]
}

object Endpoint "satellite" { }

object Zone "sattelite" {
        endpoints = [ "sattelite" ]
        parent = "master"
}

satellite:

object Endpoint "master1" {
       host = "master1's IP-Address"
       port = "Default: 5665"
}

object Endpoint "master2" { }

object Zone "master" {
        endpoints = [ "master1", "master2" ]
}

object Endpoint "satellite" { }

object Zone "sattelite" {
        endpoints = [ "sattelite" ]
        parent = "master"
}

And you can also configure the second master to accept configurations from the primary master to keep both in sync. And the second master must have exactly the same features enabled as in master1.

Have you read the High Availability: Object Authority documentation section?

1 Like

Sorry for the thread resurrection, but I thought I might post an update to this.

We’re still running a single instance, with the second master disabled.

We had all sorts of issues randomly pop up after months of working as expected, and I think I’ve finally worked out the culprit - the OpenTSDBWriter plugin.

We use both OpenTSDB and InfluxDB - with OpenTSDB enabled, thousands of hosts and services become overdue.

I am going to leave it disabled for a few days and see if the same issues pop up, but this is likely what has caused ALL of my issues since I created this thread.

2 Likes

Alright, we’ve left it disabled for the past 22 days, and everything has worked perfectly. No weird delays, no overdue checks, or overdue alerts.

My team took a look at the OpenTSDB writer plugin, and it looks like the problem is caused by the fact that each write is performed synchronously, and blocks all other functions from running.

This means that the problem is compounded as we scale - more hosts/checks = more writes, which means more time waiting for OpenTSDB.

We’re going to fully migrate to InfluxDB.