Came across this after restarting a system set up as a secondary master after having taken it down for a few hours for maintenance. Only a small amount of the history was restored, compared to what is recorded on the primary system. This is setup WITHOUT HA on icingadb (each system has its own local database). (This system has entirely too many alerts, but that’s a separate issue for us to work out in tuning…it made it clear, though, that the full replay was not received).
Anyhow, I set up a lab system to test this out and had the same problem. What I suspect is happening is that once an element is rechecked on the recovering system, it stops accepting any more events from the replay log. Debug log shows something like:
Skipping check result for checkable ‘test14!Ping’ from 2025-04-22 13:41:42 -0600 (1745350902.210829). It is in the past compared to ours at 2025-04-22 13:42:44
To further test that idea, I disabled checker on the secondary system before restarting, and that seemed to allow the full replay to come through.
I may just not understand this well enough and this is how it is supposed to work, but I’d like to gather as much info as I can so I know what to expect in the future.
Give as much information as you can, e.g.
- Version used (
icinga2 --version
) - 2.14.5
- Operating System and version
- CentOS Stream 9
- Enabled features (
icinga2 feature list
) - Enabled features: api checker command debuglog icingadb livestatus mainlog