Comment history and retention on a busy system

The question I have is around clean up of comment_history and how to fix it in this system.

The system details are
icingadb: 1.4.0
icinga2: 2.15.0

The system has around 20,000 hosts and 80,000 services and is processing around 1.35 million checks per hour.

Obviously this is a very busy system which generates a lot of history. Clean up is critical.

Icinga DB retention is 90 days

retention:
  history-days: 90
  sla-days: 90
  interval: 1m

Comments are created via the Icinga2 API, we are (now :slight_smile:) very careful around setting expiry times on comments, as a result there is around 24,000 comments in the database with none being more than 90 days old (the value we set comment expiry times to).

But the comment history in the database is showing around 280,000 comments the oldest being September 2024, well past the 90 days.

All the entries in the comment_history table have a entry_time and none of them have a remove_time. This includes comment history that is orphaned, ie it has no comment in the comment table, or not orphaned comment history.

Note only comment history that is less that 90 days old has linked comments

I’m not sure if the excessive retention of comment history is

  • By IcingaDB design
  • Because retention isn’t configured correctly
  • Because comment_history should have a remove_time and isn’t being set (I’m not sure how this would get set)
  • If it is safe to manually clean up orphaned comment_history directly in the database

We have created db clean up script for director because it’s retention clean up can’t keep up, so it would fit in with the processes used to manage this system to manually delete orphaned comment_history if it is safe to do so

Thanks for your detailed report.

The comment retention is relative to the remove_time column[1][2]. Thus, entries without a remove_time are not considered.

This seems to be a bug, meh.

The comment table is being populated from the Icinga 2 data received via Redis. Thus, if Icinga 2 removes the Comment object, Icinga DB does so as well. However, this does not close the history entry.

This seems to be another dimension of Downtime on a removed object are never closed. · Issue #910 · Icinga/icingadb · GitHub. Since I am currently trying to fix this, I would consider this as well.


  1. icingadb/pkg/icingadb/history/retention.go at v1.5.1 · Icinga/icingadb · GitHub ↩︎

  2. icingadb/pkg/icingadb/cleanup.go at v1.5.1 · Icinga/icingadb · GitHub ↩︎

Thanks @apenning

From a users perspective it isn’t illogical to expect a comment and comment history to be different, so the use case to have comment be removed from a object but maintained in the history logical.

I feel that if there is a bug it is more likely that retention shouldn’t be using remove_time.

retention:
  history-days: 90

This config reads as it should have the behaviour “delete [comment] history after 90 days from when it was created” not “delete [comment] history after 90 days but only if something else has set a remove_time

Thought with this there could be a argument to maintain the comment history if the comment hasn’t been deleted so perhaps the safe statement is “delete [comment] history after 90 days from when it was created if the comment is no longer exists”

I useful query I’ve been using to monitor history is


SELECT 'acknowledgement_history' AS type,
       FROM_UNIXTIME(MIN(set_time) DIV 1000) AS oldest,
       FROM_UNIXTIME(MAX(set_time) DIV 1000) AS newest,
       COUNT(*) AS total,
       TIMESTAMPDIFF(DAY, FROM_UNIXTIME(MIN(set_time) DIV 1000), NOW()) AS oldest_age_days,
       TIMESTAMPDIFF(DAY, FROM_UNIXTIME(MAX(set_time) DIV 1000), NOW()) AS newest_age_days
FROM acknowledgement_history


UNION ALL
SELECT 'comment_history',
       FROM_UNIXTIME(MIN(entry_time) DIV 1000), FROM_UNIXTIME(MAX(entry_time) DIV 1000), COUNT(*),
       TIMESTAMPDIFF(DAY, FROM_UNIXTIME(MIN(entry_time) DIV 1000), NOW()),
       TIMESTAMPDIFF(DAY, FROM_UNIXTIME(MAX(entry_time) DIV 1000), NOW())
FROM comment_history


UNION ALL
SELECT 'downtime_history',
       FROM_UNIXTIME(MIN(entry_time) DIV 1000), FROM_UNIXTIME(MAX(entry_time) DIV 1000), COUNT(*),
       TIMESTAMPDIFF(DAY, FROM_UNIXTIME(MIN(entry_time) DIV 1000), NOW()),
       TIMESTAMPDIFF(DAY, FROM_UNIXTIME(MAX(entry_time) DIV 1000), NOW())
FROM downtime_history


UNION ALL
SELECT 'flapping_history',
       FROM_UNIXTIME(MIN(start_time) DIV 1000), FROM_UNIXTIME(MAX(start_time) DIV 1000), COUNT(*),
       TIMESTAMPDIFF(DAY, FROM_UNIXTIME(MIN(start_time) DIV 1000), NOW()),
       TIMESTAMPDIFF(DAY, FROM_UNIXTIME(MAX(start_time) DIV 1000), NOW())
FROM flapping_history


UNION ALL
SELECT 'notification_history',
       FROM_UNIXTIME(MIN(send_time) DIV 1000), FROM_UNIXTIME(MAX(send_time) DIV 1000), COUNT(*),
       TIMESTAMPDIFF(DAY, FROM_UNIXTIME(MIN(send_time) DIV 1000), NOW()),
       TIMESTAMPDIFF(DAY, FROM_UNIXTIME(MAX(send_time) DIV 1000), NOW())
FROM notification_history


UNION ALL
SELECT 'state_history',
       FROM_UNIXTIME(MIN(event_time) DIV 1000), FROM_UNIXTIME(MAX(event_time) DIV 1000), COUNT(*),
       TIMESTAMPDIFF(DAY, FROM_UNIXTIME(MIN(event_time) DIV 1000), NOW()),
       TIMESTAMPDIFF(DAY, FROM_UNIXTIME(MAX(event_time) DIV 1000), NOW())
FROM state_history;

Thanks again for your insights. In the meantime, two PR have emerged addressing different points of this issue, https://github.com/Icinga/icingadb/pull/913 and https://github.com/Icinga/icingadb/pull/1066.


After taking a few looks into this, I have a question regarding something you wrote in the original post.

How did you end up with these 24k comment history entries? Have you deleted the Comment objects or are these comments still there? And if so, how did you have deleted them? Please note, that you need to use the remove-comment API endpoint and not just delete the object. This is somewhat of a footgun, also being mentioned during the review of one PR.

24,000 is the number of comment entries in the comment table, they are comments that are correctly still there.
280,000 is the number of comment history entries in the comment_history table.

Comments aren’t removed via api, they have a expiry set on them. The assumption here is a comment with a expiry removes itself.


Around 4/5 months ago somebody decided it would be a good idea to run rm -rf /var/lib/icinga2/api/{packages,zones,zones-stage}/* on both headends simultaneously :scream:. I can confirm this is a bad idea and can result in all comments and downtime being effectively deleted from Icinga (and a bunch of work by me to get critical data back), and has likely left behind a bit of a mess for data past the 4/5 month mark.

This was partly the reason for the question “If it is safe to manually clean up orphaned comment_history directly in the database”.

Note the problem of comment_history exist after this point (I specifically waited so I’d have clean data) but I have a historical mess as well :upside_down_face: