Icingadb keeps growing to absurd size

Hello Community,

We are facing problems with one of our icingadb where the DB is growing with no end.
The Icinga2 Instance is not that big:

Hosts: ~1600
Services: ~17000

But the Icingadb is growing so that we reaching absurd sizes:
[root@icinga ~]# du -sh /var/lib/mysql/icingadb/
167G /var/lib/mysql/icingadb/

Inside the /etc/icingadb/config.yml the retention option is set to 30 days, but it keeps growing.
From time to time we clean up the DB manually but this is nothing we want to do every 1-2 months.

Does anyone else expirienced such an issue and maybe have an solution for this?


About the environment:

This is an HPC Cluster that is not 100% productive right now.
So we have a lot of systems that are offline but they have a downtime set in Icinga.

We have another cluster that we monitoring with icinga2, where we do not facing those issues.

  • Icinga DB Web version: 1.0.2
  • Icinga Web 2 version: 2.11.4
  • Web browser: Chrome 113
  • Icinga 2 version: r2.13.7-1
  • Icinga DB version : v1.1.0
  • PHP version used : 7.2.24
  • Server operating system and version: Rocky Linux release 8.8 (Green Obsidian)

Can you post your retention config please?

Good point.
Here are the config.yml

# cat config.yml |grep -v "# "

database:
  type: mysql
  host: localhost
  database: icingadb
  user: icingadb
  password: *****************

redis:
  host: localhost

logging:
  options:
  retention:
  history-days: 30
  sla-days: 30
  options:

You don’t seem to have any retention configured (only for the logging of the retention). Have a look at the configuration docs. There is also a config.example.yml

My retention settings look like this:

retention:
  history-days: 365
  options:
    acknowledgement: 365
    comment: 365
    downtime: 90
    flapping: 365
    notification: 365
    state: 365

Thank you Erik,

I updated the config.yml with an retention of 30 day.
So, I will give the icingadb some time and check back if there is some cleanup made.

The actual status of the db files are:

# ll /var/lib/mysql/icingadb/ -Srh | tail
-rw-rw---- 1 mysql mysql  18M Jun  7 19:53 sla_history_downtime.ibd
-rw-rw---- 1 mysql mysql  18M Jun 14 10:04 downtime_history.ibd
-rw-rw---- 1 mysql mysql  23M Apr 17 10:32 service.ibd
-rw-rw---- 1 mysql mysql  26M Jun  7 15:18 service_customvar.ibd
-rw-rw---- 1 mysql mysql  36M Jun  7 16:18 host_customvar.ibd
-rw-rw---- 1 mysql mysql  36M Jun  7 21:15 downtime.ibd
-rw-rw---- 1 mysql mysql  40M Jun 14 10:06 service_state.ibd
-rw-rw---- 1 mysql mysql  13G Jun 14 10:06 sla_history_state.ibd
-rw-rw---- 1 mysql mysql  21G Jun 14 10:06 state_history.ibd
-rw-rw---- 1 mysql mysql  22G Jun 14 10:06 history.ibd
1 Like

Hello *,

unfortunatelly the changes inside the config.yml did not change anything:

grafik

About a month is gone since I posted the last status, the DB is growing really fast.
Any idea how to find out what is going here wrong?

Hi @pbirokas, just curious, did you restart the icingadb service after changing the config file? If so, you can set the log level for the retention component to debug as well and see what icingadb is doing.

logging:
  options:
    retention: debug

Thank you Yonas for the replay.

The database was restarted after editing the config file.
Now I added also, like suggested the debug flag for the retention logs.

After another restart of icingadb service I see those logs:

Jul 18 11:21:58 icinga systemd[1]: Starting Icinga DB...
Jul 18 11:21:58 icinga icingadb[862892]: Starting Icinga DB
Jul 18 11:21:58 icinga systemd[1]: Started Icinga DB.
Jul 18 11:21:58 icinga icingadb[862892]: Connecting to database at 'localhost:0'
Jul 18 11:21:58 icinga icingadb[862892]: Connecting to Redis at 'localhost:6380'
Jul 18 11:21:58 icinga icingadb[862892]: Starting history sync
Jul 18 11:21:59 icinga icingadb[862892]: heartbeat: Received Icinga heartbeat
Jul 18 11:21:59 icinga icingadb[862892]: Taking over
Jul 18 11:21:59 icinga icingadb[862892]: Starting config sync
Jul 18 11:21:59 icinga icingadb[862892]: Starting initial state sync
Jul 18 11:21:59 icinga icingadb[862892]: Starting overdue sync
Jul 18 11:21:59 icinga icingadb[862892]: config-sync: Updating 1578 items of type host state
Jul 18 11:21:59 icinga icingadb[862892]: config-sync: Updating 11507 items of type service state
Jul 18 11:21:59 icinga icingadb[862892]: Starting config runtime updates sync
Jul 18 11:21:59 icinga icingadb[862892]: config-sync: Finished config sync in 668.742509ms
Jul 18 11:22:00 icinga icingadb[862892]: Starting history retention
Jul 18 11:22:00 icinga icingadb[862892]: Starting state runtime updates sync
Jul 18 11:22:00 icinga icingadb[862892]: retention: Starting history retention for category acknowledgement
Jul 18 11:22:00 icinga icingadb[862892]: retention: Starting history retention for category comment
Jul 18 11:22:00 icinga icingadb[862892]: retention: Starting history retention for category downtime
Jul 18 11:22:00 icinga icingadb[862892]: retention: Starting history retention for category flapping
Jul 18 11:22:00 icinga icingadb[862892]: retention: Starting history retention for category notification
Jul 18 11:22:00 icinga icingadb[862892]: retention: Starting history retention for category state
Jul 18 11:22:00 icinga icingadb[862892]: retention: Skipping history retention for category sla_downtime
Jul 18 11:22:00 icinga icingadb[862892]: retention: Skipping history retention for category sla_state
Jul 18 11:22:00 icinga icingadb[862892]: config-sync: Finished initial state sync in 694.908834ms
Jul 18 11:22:00 icinga icingadb[862892]: retention: Cleaning up historical data for category state from table state_history older than 2023-06-18 11:22:00.007016177 +0200 CEST
Jul 18 11:22:00 icinga icingadb[862892]: retention: Cleaning up historical data for category downtime from table downtime_history older than 2023-06-18 11:22:00.007102068 +0200 CEST
Jul 18 11:22:00 icinga icingadb[862892]: retention: Cleaning up historical data for category comment from table comment_history older than 2023-06-18 11:22:00.007041397 +0200 CEST
Jul 18 11:22:00 icinga icingadb[862892]: retention: Cleaning up historical data for category flapping from table flapping_history older than 2023-06-18 11:22:00.007075008 +0200 CEST
Jul 18 11:22:00 icinga icingadb[862892]: retention: Cleaning up historical data for category notification from table notification_history older than 2023-06-18 11:22:00.00718281 +020>
Jul 18 11:22:00 icinga icingadb[862892]: retention: Cleaning up historical data for category acknowledgement from table acknowledgement_history older than 2023-06-18 11:22:00.0070616>
Jul 18 11:22:16 icinga icingadb[862892]: retention: Removed 19470 old state history items
Jul 18 11:22:18 icinga icingadb[862892]: history-sync: Synced 172 state history items
Jul 18 11:22:20 icinga icingadb[862892]: runtime-updates: Upserted 176 ServiceState items

And since last week the DB still grows. Systems, Services, etc. still the same. No changes on Icinga2 Items.

-rw-rw---- 1 mysql mysql  29G Jul 18 11:28 sla_history_state.ibd
-rw-rw---- 1 mysql mysql  34G Jul 18 11:28 state_history.ibd
-rw-rw---- 1 mysql mysql  52G Jul 18 11:28 history.ibd
[root@icinga icingadb]# ll -rSh /var/lib/mysql/icingadb/                                   

Everything looks fine for almost all history types except sla_downtime and sla_state. You have not set the retention for SLA and thus by default it is kept forever.

retention:
  # Number of days to retain historical data for SLA reporting. By default, it is retained forever.
 # sla-days:

Thank you again Yonas,

I added those two into the config file, but now the icingadb stop from time to time:

Jul 18 15:42:49 icinga icingadb[132922]: Error 1206: The total number of locks exceeds the lock table size
                                         can't perform "DELETE FROM sla_history_state WHERE environment_id = :environment_id AND event_time < :time\nORDER BY event_time LIMIT 5000"
                                         github.com/icinga/icingadb/internal.CantPerformQuery
                                                 github.com/icinga/icingadb/internal/internal.go:30
                                         github.com/icinga/icingadb/pkg/icingadb.(*DB).CleanupOlderThan
                                                 github.com/icinga/icingadb/pkg/icingadb/cleanup.go:53
                                         github.com/icinga/icingadb/pkg/icingadb/history.(*Retention).Start.func1
                                                 github.com/icinga/icingadb/pkg/icingadb/history/retention.go:189
                                         github.com/icinga/icingadb/pkg/periodic.Start.func1
                                                 github.com/icinga/icingadb/pkg/periodic/periodic.go:78
                                         runtime.goexit
                                                 runtime/asm_amd64.s:1594
Jul 18 15:42:49 icinga systemd[1]: icingadb.service: Main process exited, code=exited, status=1/FAILURE
Jul 18 15:42:49 icinga systemd[1]: icingadb.service: Failed with result 'exit-code'.

Hii, that’s an Icinga DB bug. Please open an issue here, but till this gets fixed and released may take a while. Try if changing the innodb buffer pool size to a higher value helps.

Good, those changes did help to keep the service alive.
Unfortunatelly the cleanup of the history DB looks to me that is not working:

[root@icinga icingadb]# ll -rS | tail -3
-rw-rw---- 1 mysql mysql 31021072384 Jul 18 18:19 sla_history_state.ibd
-rw-rw---- 1 mysql mysql 36427530240 Jul 18 18:19 state_history.ibd
-rw-rw---- 1 mysql mysql 55494836224 Jul 18 18:19 history.ibd
[root@icinga icingadb]# ll -rS | tail -3
-rw-rw---- 1 mysql mysql 31021072384 Jul 19 08:05 sla_history_state.ibd
-rw-rw---- 1 mysql mysql 36427530240 Jul 19 08:05 state_history.ibd
-rw-rw---- 1 mysql mysql 55566139392 Jul 19 08:05 history.ibd

At least sla_history_state.ibd and state_history.ibd did not grow over night.
But the history.ibd, with over 55 GB, still grows.

This happens again. I increased the buffer size not to 4GB but still it failes

[mysqld]
innodb_buffer_pool_chunk_size=1073741824
innodb_buffer_pool_instances=4
innodb_buffer_pool_size=4294967296

for the moment I need to deactivate the sla-history retention to be able to keep the service stable.