Database tables icinga_servicestatus and icinga_hoststatus grow uncontrollably

toerkl · February 6, 2024, 12:20pm

Hi,

Since yesterday we have a problem where the icinga2 master node updates the tables icinga_servicestatus and icinga_hoststatus continuously with an incredibly high frequency. Severel times per second.

The SQL statements all have the nature of the following example:

UPDATE icinga_servicestatus SET acknowledgement_type = '0',  active_checks_enabled = '1',  check_command = '_nrpe',  check_source = 'satellite server hostname',  check_timeperiod_object_id = NULL,  check_type = '0',  current_check_attempt = '1',  current_notification_number = '0',  current_state = '0',  endpoint_object_id = 225,  event_handler_enabled = '1',  execution_time = '0.075727',  flap_detection_enabled = '1',  has_been_checked = '1',  instance_id = 1,  is_flapping = '0',  is_reachable = '1',  last_check = TO_TIMESTAMP(1707222804) AT TIME ZONE 'UTC',  last_hard_state = '0',  last_hard_state_change = TO_TIMESTAMP(1705968617) AT TIME ZONE 'UTC',  last_notification = TO_TIMESTAMP(1688098366) AT TIME ZONE 'UTC',  last_state_change = TO_TIMESTAMP(1705968617) AT TIME ZONE 'UTC',  last_time_critical = TO_TIMESTAMP(1705968560) AT TIME ZONE 'UTC',  last_time_ok = TO_TIMESTAMP(1707222804) AT TIME ZONE 'UTC',  last_time_unknown = TO_TIMESTAMP(1619121479) AT TIME ZONE 'UTC',  last_time_warning = NULL,  latency = '0.000309',  long_output = '',  max_check_attempts = '11',  next_check = TO_TIMESTAMP(1707222862) AT TIME ZONE 'UTC',  next_notification = TO_TIMESTAMP(1707221879) AT TIME ZONE 'UTC',  normal_check_interval = '1',  notifications_enabled = '1',  original_attributes = 'null',  output = 'DISK OK - free space: / 1812 MB (29.54% inode=97%);',  passive_checks_enabled = '1',  percent_state_change = '0',  perfdata = '/=4321MB;4907;5520;0;6134',  problem_has_been_acknowledged = '0',  process_performance_data = '1',  retry_check_interval = '1',  scheduled_downtime_depth = '0',  service_object_id = 13310,  should_be_scheduled = '1',  state_type = '1',  status_update_time = TO_TIMESTAMP(1707222804) AT TIME ZONE 'UTC' WHERE service_object_id = 13310

Version used (icinga2 --version)
r2.13.6-1

Operating System and version

  Platform: Amazon Linux
  Platform version: 2
  Kernel: Linux
  Kernel version: 5.15.117-73.143.amzn2.x86_64
  Architecture: x86_64

Enabled features (icinga2 feature list)

Disabled features: command compatlog debuglog elasticsearch gelf graphite icingadb influxdb2 livestatus opentsdb perfdata statusdata
Enabled features: api checker ido-pgsql influxdb mainlog notification syslog

Icinga Web version

Icinga Web 2 Version	2.10.5
Git commit	e9f0b266bd62ca01b76177db2fe1c292b3ce859b
PHP Version	8.0.28
Git commit date	2023-01-26

We run a cluster with around 10 satellites and 2 masters

The CPU usage is very high on the database due to this and the system is unstable.

Here is a graph of the table size for icinga_servicestatus the last 3 days.

We have manually gone in an cleaned up the table after this problem started. Before it happened it was not increasing so it seems something is triggering this behaviour lately. We have not done any updates for some time that could be the cause for this.

Does anybody have an idea or do you know what could be the problem and how we can resolve it?

godd-mash · February 7, 2024, 10:52am

It looks like I have the same problem in my cluster

MVahan · February 7, 2024, 2:22pm

It’s the same thing here. It started all suddenly without any changes in the cluster.

I have searched a lot but unfortunately I have not found any solution to this. The problem with the database is all the dead records. The dead records are filling up the database.

toerkl · February 8, 2024, 7:25am

In our case it seems it was related to a patching of the Postgres database. Once we switched over to the secondary db node, and then back to the primary, the issue was resolved. It’s worth mentioning that we also restarted php-fpm.