Hello!
I have a question regarding Icinga DB and the Icinga DB check: we have an Icinga DB setup with 2 masters and the recommended configuration (ie one local icingadb-redis install per master and one database separated from the masters to store the events).
Regularly the Icinga DB check from Icinga 2 outputs a warning about the history backlog being greater than the warning threshold.
Digging into the check’s code led me to some Redis and MySQL queries to see what’s going on: from what I’ve seen events seem to be sent to both Redis instances on both masters, one of the Icinga DB daemons will handle the event and store the result in the DB, and the other will silently “ignore” the event as it has already been registered by one master (I suppose it’s something like that).
In certain situations, one of our masters will not process an event in the Redis stream for 10-15 mins, but the other is fine, which means that I can see the “lagging” event in the Redis of our master, while the other is empty as expected. Also the “lagging” event is properly registered in the DB, and the master not processing the event is fine regarding CPU/RAM.
That means this behaviour triggers the icingadb check in a warning state in one of our masters even though everything seems fine in the end as the event is properly registered in the DB. Do you have any idea about what could cause this behaviour, or if that is normal? I don’t see anything that could explain that one of our masters does not process an event for 10-15 mins.
(Note: The only lead that I have is that it’s always the same master that has this issue, the other machine is always fine, but the impacted machine has enough resources and the logs don’t seem to warn about anything either)