Secondary master, replay log does not replay full history

mdetrano · April 22, 2025, 9:01pm

Came across this after restarting a system set up as a secondary master after having taken it down for a few hours for maintenance. Only a small amount of the history was restored, compared to what is recorded on the primary system. This is setup WITHOUT HA on icingadb (each system has its own local database). (This system has entirely too many alerts, but that’s a separate issue for us to work out in tuning…it made it clear, though, that the full replay was not received).

Anyhow, I set up a lab system to test this out and had the same problem. What I suspect is happening is that once an element is rechecked on the recovering system, it stops accepting any more events from the replay log. Debug log shows something like:

Skipping check result for checkable ‘test14!Ping’ from 2025-04-22 13:41:42 -0600 (1745350902.210829). It is in the past compared to ours at 2025-04-22 13:42:44

To further test that idea, I disabled checker on the secondary system before restarting, and that seemed to allow the full replay to come through.

I may just not understand this well enough and this is how it is supposed to work, but I’d like to gather as much info as I can so I know what to expect in the future.

Give as much information as you can, e.g.

Version used (icinga2 --version)
2.14.5
Operating System and version
CentOS Stream 9
Enabled features (icinga2 feature list)
Enabled features: api checker command debuglog icingadb livestatus mainlog

lorenz · April 23, 2025, 8:35am

Hi @mdetrano ,
Interesting problem. I think IcingaDB was designed with a single database in mind,
your scenario therefore is somewhat outside of the specification.
Spontaneously I would guess, that you are right and the way the replay log
is applied will not be completely reflected in the database on the secondary
master.

I guess the proper procedure for a recovery in that scenario is to mirror the database
of the other node.

mdetrano · April 23, 2025, 3:31pm

Ok, that is good to know. (patching in the missing data in the DB is currently our fix for this).

I would note about the replay log…if it was replayed in full (and even in time order) for the recovering system that could benefit any system attached to it and collecting the check data (not just IcingaDB) . It could help fill in performance data for instance (another issue we had with a separate Influx collector).

lorenz · April 27, 2025, 3:14pm

I think in general the log is replayed, so Performance Data Writers might receive all data points, but maybe someone with more knowledge of the code should verify this.