Check icingadb issues

Caquisse · April 6, 2023, 8:26am

Hello!

I have a question regarding Icinga DB and the Icinga DB check: we have an Icinga DB setup with 2 masters and the recommended configuration (ie one local icingadb-redis install per master and one database separated from the masters to store the events).

Regularly the Icinga DB check from Icinga 2 outputs a warning about the history backlog being greater than the warning threshold.

Digging into the check’s code led me to some Redis and MySQL queries to see what’s going on: from what I’ve seen events seem to be sent to both Redis instances on both masters, one of the Icinga DB daemons will handle the event and store the result in the DB, and the other will silently “ignore” the event as it has already been registered by one master (I suppose it’s something like that).

In certain situations, one of our masters will not process an event in the Redis stream for 10-15 mins, but the other is fine, which means that I can see the “lagging” event in the Redis of our master, while the other is empty as expected. Also the “lagging” event is properly registered in the DB, and the master not processing the event is fine regarding CPU/RAM.

That means this behaviour triggers the icingadb check in a warning state in one of our masters even though everything seems fine in the end as the event is properly registered in the DB. Do you have any idea about what could cause this behaviour, or if that is normal? I don’t see anything that could explain that one of our masters does not process an event for 10-15 mins.

(Note: The only lead that I have is that it’s always the same master that has this issue, the other machine is always fine, but the impacted machine has enough resources and the logs don’t seem to warn about anything either)

Al2Klimov · April 6, 2023, 11:23am

Hello @Caquisse!

Does increasing the threshold fix the issue?

Best,
A/K

Caquisse · April 6, 2023, 1:45pm

I suppose this can fix it (that’s what I’m trying to integrate within our automation right now). But I’m about to put very high values compared to the default ones (warning 5 minutes / critical 15 minutes). Given what I’ve seen I suppose we should at least wait 15 minutes before triggering a warning, but I’m wondering if this could be a problem in the long run having such a high threshold here if a real issue happens some day

Caquisse · May 25, 2023, 1:28pm

Hello, just a follow up if anyone ends up in this situation. After a few weeks of analysing and testing this we didn’t found any root cause so we ended up increasing the threshold as suggested by @Al2Klimov.

That means that master 1 has now the check with the increased threshold and master 2 has the standard check. We did not see any other issue following this change. This is not a real solution as we did not find the root cause but at least it’s working properly and not notifying us for nothing now.