Icinga2 database (postgresql) size issues

t.wagner · March 2, 2020, 3:19pm

Hello everyone!

We are currently having a problem with the postgresql database of our icinga2 (version r2.9.1-1 running on a server with Debian GNU/Linux 7 (wheezy)). Everythings appears to work fine but apparently the database has been constantly gaining in size since we started using icinga.

Our setup uses crm to run icinga on one of two identical VMs. Configuration and database are shared between the two VMs using a drbd synched storage. Icingaweb2 also uses the same database as icinga2 (backend). The drbd drive is relatively small (12GB) and the icinga database has already filled up our testsystem once and is also nearing critical levels on our production system.

We noticed that we had not set up the cleaning mechanism in the ido-pgsql.conf so that was the first thing we tried. This did not affect database size at all, even when we set all options to only retain data for 3 hours. When we look at icingaweb2 we can see that the cleanup seems to be working since all notifications/history is at maximum 3 hours old.

We also couldn’t figure out which table actually stores that much data. The OIDs of icinga tables did not match the large files postgres saves on the disk (that at least appears to identify with OIDs), and as far as we can see there are hardly any objects in any of the tables (the icinga_objects table, for example, has only 8700 objects)

In addition to that, for some reason the database in production has suddenly grown during the weekend by about 1GB (maybe it has something to do with the switch from February to March?). This forced us to come up with a quick solution, meaning we would drop the database, recreate it and import the schema again (as proposed in this closed request https://github.com/Icinga/icinga2/issues/5464 ).

We tested this in our test environment. We started with a backup. To our surprise, the database taking up roughly 8GB on the server was only 50MB as plain text backup and 5MB as the pgadmin default backup format. Having a look at the content we could see everything we would suspect: notifications, custom variables , state changes etc.

We then dropped the icinga2 database, which freed up the expected 8GB. Instead of starting with an empty database we tried to import the backup again, which worked and does not take up more than the size of the backup file. The history (going back as long as cleanup is defined) is also intact.

This leaves us with a couple of questions. Hopefully we can find some answers here

Why was the database so large when the actual data only takes up a few dozen MB? Our guess is that something is being stored (maybe at the end of a month?) that is maybe hidden and doesn’t get exported in a backup but we can’t figure out what. Or maybe something isn’t cleaned up correctly, but we tried to optimize/“vacuum” the database first and that didn’t free up any space.
How can we limit the growth of the database? Are there any mechanisms beyond cleanup in ido-pgsql.conf that we can activate? Are there any mechanisms from icingaweb2 that we may activate? Is sharing the database between icinga2 and icingaweb2 a (or maybe even “the”) problem?
Is there another way to downsize the database when it gets so big? Are there any specific tables that might carry such large amounts of data that we can simply truncate? As far as I can tell from the IDO database model at https://icinga.com/docs/icinga1/latest/en/db_model.html there is nothing that would explain this extremely large amount of data.

We are very unsure how to proceed. Manually reimporting the database every few month is not a great option. Increasing the drive capacity just postpones the issue since the database keeps growing. We are considering moving the database (or possibly even the complete icinga system) to another VM, but that faces some administrative issues in our production system. And since the database would also keep growing in that VM we’d again have the same problem…

Is there any more config-data or information that might be helpful?

Any ideas, pointers or hints are highly appreciated, so thanks in advance!

Best regards,
Thomas

blakehartshorn · March 2, 2020, 3:48pm

I’m not much of a postgres expert so somebody might have better advice than me, but one thing I would suggest is to check the size of the pg_wal folder. This is something that snuck up on me. If it’s having a hard time writing data in, it just kind of sits there. This is less likely a problem since you’re using drbd instead of other replication features, but it’s worth looking at. There’s also a max wal size setting in postgresql.conf that will usually wrangle that in if you identify it as the culprit.

I personally would recommend allocating larger space for the database. My postgres database currently uses 9GB, but that’s data going back to fall of 2018, so I’d definitely go poking around in the folder and seeing if you can find where the garbage data is hiding. I can’t say enough good things about having historical data when trying to explain my on-call week to people.

t.wagner · March 2, 2020, 4:04pm

Thanks for the hint. However there doesn’t seem to be too much WAL data in the pg_xlog folder (we are still using postgresql 9.1).

The “garbage” data was definitely stored with the icinga2 database. That much was clear from the OIDs. The OID of the icinga2 database matched exactly to the folder in /postgresql/9.1/main/base/OID. That folder was 8GB large before we reimported the database. It consisted of a very large number of files. Some few of these were up to almost 1GB large, while others were at 0KB. These seem to be named after OIDs of table, but we could not match any icinga2 table OIDs with these large files.