Icingadb not running after deployment

Hi,

I’m having some issues with my Icinga setup.
So. We have one master and 3 satellites. The master is just getting and displaying the check results (the only thing it’s checking is itself).
Almost everytime a deployment occurs, icingadb seems to be failing for 5/10 minutes. During that time, the IO Delay on the host server is increased quite a bit (the master and satellites are virtual machines under proxmox).
If I look into the logs, I can see this :

Logs from mariadb

2024-11-14 9:13:27 426426 [Warning] Aborted connection 426426 to db: ‘icingadb’ user: ‘icingadb’ host: ‘localhost’ (Got an error reading communication packets)
2024-11-14 9:13:27 426440 [Warning] Aborted connection 426440 to db: ‘icingadb’ user: ‘icingadb’ host: ‘localhost’ (Got an error reading communication packets)
2024-11-14 9:13:27 426444 [Warning] Aborted connection 426444 to db: ‘icingadb’ user: ‘icingadb’ host: ‘localhost’ (Got an error reading communication packets)

Logs from icingadb

2024-11-14T09:12:57.771306+01:00 HOST icingadb[2497003]: heartbeat: Previous heartbeat not read from channel
2024-11-14T09:13:00.771534+01:00 HOST icingadb[2497003]: heartbeat: Previous heartbeat not read from channel
2024-11-14T09:13:03.771369+01:00 HOST icingadb[2497003]: heartbeat: Previous heartbeat not read from channel

  • Version used :
    2.14.3-1
  • Operating System and version :
    Debian 11
  • Enabled features (icinga2 feature list) :
    Enabled features: api checker icingadb influxdb2 mainlog notification
  • Icinga Web 2 version and modules (System - About)
    customdashboards - director - doc - grafana - icingadb - incubator - itop
  • Config validation (icinga2 daemon -C)

[2024-11-14 10:27:17 +0100] information/cli: Icinga application loader (version: r2.14.3-1)
[2024-11-14 10:27:17 +0100] information/cli: Loading configuration file(s).
[2024-11-14 10:27:19 +0100] information/ConfigItem: Committing config item(s).
[2024-11-14 10:27:19 +0100] information/ApiListener: My API identity: icinga-master-p-01.srv-gsi.brgm.recia.net
[2024-11-14 10:27:27 +0100] information/ConfigItem: Instantiated 5 NotificationCommands.
[2024-11-14 10:27:27 +0100] information/ConfigItem: Instantiated 1 IcingaApplication.
[2024-11-14 10:27:27 +0100] information/ConfigItem: Instantiated 198 HostGroups.
[2024-11-14 10:27:27 +0100] information/ConfigItem: Instantiated 13708 Hosts.
[2024-11-14 10:27:27 +0100] information/ConfigItem: Instantiated 4 Downtimes.
[2024-11-14 10:27:27 +0100] information/ConfigItem: Instantiated 1 Influxdb2Writer.
[2024-11-14 10:27:27 +0100] information/ConfigItem: Instantiated 13964 Dependencies.
[2024-11-14 10:27:27 +0100] information/ConfigItem: Instantiated 34 Comments.
[2024-11-14 10:27:27 +0100] information/ConfigItem: Instantiated 1 IcingaDB.
[2024-11-14 10:27:27 +0100] information/ConfigItem: Instantiated 1 FileLogger.
[2024-11-14 10:27:27 +0100] information/ConfigItem: Instantiated 6 Zones.
[2024-11-14 10:27:27 +0100] information/ConfigItem: Instantiated 1 CheckerComponent.
[2024-11-14 10:27:27 +0100] information/ConfigItem: Instantiated 1 User.
[2024-11-14 10:27:27 +0100] information/ConfigItem: Instantiated 4 Endpoints.
[2024-11-14 10:27:27 +0100] information/ConfigItem: Instantiated 3 ApiUsers.
[2024-11-14 10:27:27 +0100] information/ConfigItem: Instantiated 1 ApiListener.
[2024-11-14 10:27:27 +0100] information/ConfigItem: Instantiated 1 NotificationComponent.
[2024-11-14 10:27:27 +0100] information/ConfigItem: Instantiated 256 CheckCommands.
[2024-11-14 10:27:27 +0100] information/ConfigItem: Instantiated 1 UserGroup.
[2024-11-14 10:27:27 +0100] information/ConfigItem: Instantiated 2 ServiceGroups.
[2024-11-14 10:27:27 +0100] information/ConfigItem: Instantiated 2 TimePeriods.
[2024-11-14 10:27:27 +0100] information/ConfigItem: Instantiated 37002 Services.
[2024-11-14 10:27:27 +0100] information/ScriptGlobal: Dumping variables to file ‘/var/cache/icinga2/icinga2.vars’
[2024-11-14 10:27:28 +0100] information/cli: Finished validating the configuration file(s).

The master has 16 CPUs and 16 GiBs of RAM and doesn’t seem to struggle too much but the mariadb service does take quite a bit of ressources with many subprocesses; I wonder if that’s normal.
Here’s what I configured on the mariadb server :

[mysqld]
innodb_buffer_pool_size = 7000M
max_connections = 1000
innodb_log_buffer_size = 64M
key_buffer_size = 256M

Anybody has an idea about this ?

How long does it usually take to reload for a deployment? During that time, Icinga 2 is effectively blocked and does not communicate with Icinga DB through the Redis. That’s why Icinga DB complains about heartbeats.

However, this should not directly effect SQL database queries. Are there more information in the Icinga DB logs? Would it be possible to temporary increase its log level for the next deployment?

The five or ten minute thingy should go back to the internal retry logic in Icinga DB, which effectively retries some operations up to five minutes, before giving up. Are there any Icinga DB crashes or does it “self-heal” after this time?