Monitoring backend 'mysql_backend' is not running

vinay.bhuria · November 16, 2022, 6:49pm

Getting errors: “Monitoring backend ‘mysql_backend’ is not running.”

cpu was running with 100% utilization. Doubled the vm cpu from 4 to 8 but still 100% cpu utilization.
sql server is on local vm. checked icinga2 process running with 300-500% cpu utilization. no errors in logs.
any pointers to resolve this.

log1c · November 21, 2022, 1:05pm

Hello

not with that few information.

Please share more insights into your environment.

Versions used
processes/command lines causing the high load
what happend just before this behavior began?

I guess the connection worked before, so that the config is self (backend config, ido-mysql config) is ok.
A high load can cause the connection to fail or take long to be established.
If the database is unresponsive the webinterface will show that message, because it determines if the connection is working be reading a specific value from the database.

vinay.bhuria · November 21, 2022, 4:52pm

icinga version:- version: r2.13.4-1
Icinga Web 2 Version 2.9.5
even issue in environment where Icinga Web 2 Version 2.11.1
Getting this error in all the non-Prod and Prod envs intermittently
in one env memory utilization was almost hundred. then increased cpu/memory still memory utilization is near 100%.

  PID  PPID %MEM %CPU CMD
19640 12208  0.4  166 /usr/lib64/icinga2/sbin/icinga2 --no-stack-rlimit daemon --close-stdio -e /var/log/icinga2/error.log
 1976  1748  0.8 92.7 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --log-error=/var/log/mariadb/mariadb.log --pid-file=/var/run/mariadb/mariadb.pid --socket=/var/lib/mysql/mysql.sock

In Prod using Cloud SQL while in non prod using local VMs:-
cloud sql query insights shows for Prod below query consuming a lot:-

DELETE FROM icinga_notifications WHERE instance_id = ? AND start_time < FROM_UNIXTIME (?)

But I doubt for this as it is running for long time without any issues…

Behaviour is intermittent so difficult to capture but saw below errors in one of the environment:-

[2022-11-14 15:16:56 +0000] warning/PluginUtility: Error evaluating set_if value 'False' used in argument '-S': Can't convert 'False' to a floating point number.
Context:
        (0) Executing check for object 'alto-na-endpoints!check-alto-prod-tenant-kibana'



[2022-11-14 15:16:59 +0000] information/IdoMysqlConnection: Pending queries: 4496 (Input: 35/s; Output: 771/s)
[2022-11-14 15:17:09 +0000] information/IdoMysqlConnection: Pending queries: 4854 (Input: 36/s; Output: 771/s)
[2022-11-14 15:17:17 +0000] information/ConfigObject: Dumping program state to file '/var/lib/icinga2/icinga2.state'
[2022-11-14 15:17:19 +0000] information/IdoMysqlConnection: Pending queries: 5272 (Input: 41/s; Output: 771/s)
[2022-11-14 15:17:29 +0000] information/IdoMysqlConnection: Pending queries: 5823 (Input: 53/s; Output: 771/s)
[2022-11-14 15:17:39 +0000] information/IdoMysqlConnection: Pending queries: 6322 (Input: 49/s; Output: 771/s)
[2022-11-14 15:17:49 +0000] information/IdoMysqlConnection: Pending queries: 6830 (Input: 51/s; Output: 771/s)
[2022-11-14 15:17:59 +0000] information/IdoMysqlConnection: Pending queries: 7424 (Input: 59/s; Output: 771/s)
[2022-11-14 15:18:09 +0000] warning/PluginUtility: Error evaluating set_if value 'resource.label.cluster_name' used in argument '--group-by': Can't convert 'resource.label.cluster_name' to a floating point number.
Context:
        (0) Executing check for object 'mlisa-alto-prod-0-dataproc!hdfs-storage'



[2022-11-14 15:18:09 +0000] information/IdoMysqlConnection: Pending queries: 8082 (Input: 66/s; Output: 771/s)
[2022-11-14 15:18:19 +0000] information/IdoMysqlConnection: Pending queries: 4392 (Input: 93/s; Output: 461/s)
[2022-11-14 15:18:29 +0000] information/IdoMysqlConnection: Pending queries: 4952 (Input: 54/s; Output: 461/s)
[2022-11-14 15:18:34 +0000] warning/PluginUtility: Error evaluating set_if value 'resource.label.cluster_name' used in argument '--group-by': Can't convert 'resource.label.cluster_name' to a floating point number.
Context:
        (0) Executing check for object 'mlisa-sa-prod-0-dataproc!hdfs-storage'



[2022-11-14 15:18:43 +0000] warning/PluginUtility: Error evaluating set_if value 'resource.label.cluster_name' used in argument '--group-by': Can't convert 'resource.label.cluster_name' to a floating point number.
Context:
        (0) Executing check for object 'mlisa-alto-apac-prod-0-dataproc!hdfs-storage'



[2022-11-14 15:19:03 +0000] warning/PluginUtility: Error evaluating set_if value 'resource.label.cluster_name' used in argument '--group-by': Can't convert 'resource.label.cluster_name' to a floating point number.
Context:
        (0) Executing check for object 'mlisa-sa-apac-prod-0-dataproc!hdfs-storage'



[2022-11-14 15:19:19 +0000] information/IdoMysqlConnection: Pending queries: 2576 (Input: 43/s; Output: 460/s)
[2022-11-14 15:19:29 +0000] information/IdoMysqlConnection: Pending queries: 3109 (Input: 52/s; Output: 460/s)

In one another environment seeing below errors:-

[2022-11-21 16:48:51 +0000] warning/PluginNotificationTask: Notification command for object 'alto-na-login-service!login-service-login-784cdf7f55-dg6rj' (PID: 14014, arguments: 'sh' '-c' 'curl  --fail --connect-timeout 30 --max-time 60 --silent --show-error -X POST -H 'Content-type: application/json' --data '{"channel":"#monitoring_alerts","title":":red_circle: PROBLEM: Service <http://10.103.32.244/monitoring/service/show?host=alto-na-login-service&service=login-service-login-784cdf7f55-dg6rj|Failed to create tyk session for tenant 5e4c7f970f244bdab6c2ef69d04384c2> transitioned from state UNKNOWN to state CRITICAL"}' 'https://hooks.slack.com/workflows/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'') terminated with exit code 22, output: curl: (22) The requested URL returned error: 429 Too Many Requests
[2022-11-21 16:48:51 +0000] information/Notification: Completed sending 'Problem' notification 'alto-na-logstash-could-not-index-event-to-elasticsearch-from-dlq!logstash-could-not-index-event-to-elasticsearch-dynamic-activity!slack-notifications-notification-services' for checkable 'alto-na-logstash-could-not-index-event-to-elasticsearch-from-dlq!logstash-could-not-index-event-to-elasticsearch-dynamic-activity' and user 'icingaadmin' using command 'slack-notifications-command'.
[2022-11-21 16:48:51 +0000] warning/PluginNotificationTask: Notification command for object 'alto-na-broker-not-available!opd-collector-kafka-opdcollector-76d978f947-g55ql' (PID: 13989, arguments: 'sh' '-c' 'curl  --fail --connect-timeout 30 --max-time 60 --silent --show-error -X POST -H 'Content-type: application/json' --data '{"channel":"#monitoring_alerts","title":":red_circle: PROBLEM: Service <http://10.103.32.244/monitoring/service/show?host=alto-na-broker-not-available&service=opd-collector-kafka-opdcollector-76d978f947-g55ql|Pod opdcollector-76d978f947-g55ql cannot connect to Kafka broker. Please restart it> transitioned from state UNKNOWN to state CRITICAL"}' 'https://hooks.slack.com/workflows/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'') terminated with exit code 22, output: curl: (22) The requested URL returned error: 429 Too Many Requests
[2022-11-21 16:48:51 +0000] information/Notification: Completed sending 'Problem' notification 'alto-na-broker-not-available!opd-collector-kafka-opdcollector-548c6b7df-7vpvd!slack-notifications-notification-services' for checkable 'alto-na-broker-not-available!opd-collector-kafka-opdcollector-548c6b7df-7vpvd' and user 'icingaadmin' using command 'slack-notifications-command'.
[2022-11-21 16:48:51 +0000] information/Notification: Completed sending 'Problem' notification 'alto-na-login-service!login-service-login-b865b595c-q9l4z!slack-notifications-notification-services' for checkable 'alto-na-login-service!login-service-login-b865b595c-q9l4z' and user 'icingaadmin' using command 'slack-notifications-command'.
[2022-11-21 16:48:51 +0000] warning/PluginNotificationTask: Notification command for object 'alto-na-login-service!login-service-login-7486bc578b-glgwc' (PID: 13976, arguments: 'sh' '-c' 'curl  --fail --connect-timeout 30 --max-time 60 --silent --show-error -X POST -H 'Content-type: application/json' --data '{"channel":"#monitoring_alerts","title":":red_circle: PROBLEM: Service <http://10.103.32.244/monitoring/service/show?host=alto-na-login-service&service=login-service-login-7486bc578b-glgwc|Failed to create tyk session for tenant 162bb84c015c41a88a6996732da5e747> transitioned from state UNKNOWN to state CRITICAL"}' 'https://hooks.slack.com/workflows/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'') terminated with exit code 22, output: curl: (22) The requested URL returned error: 429 Too Many Requests
[2022-11-21 16:48:51 +0000] information/HttpServerConnection: Request: GET /v1/objects/services (from [::ffff:10.103.32.148]:58378), user: logstash, agent: Manticore 0.9.1, status: OK).
[2022-11-21 16:48:51 +0000] information/HttpServerConnection: Request: GET /v1/objects/services (from [::ffff:10.103.32.148]:58404), user: logstash, agent: Manticore 0.9.1, status: OK).
[2022-11-21 16:48:51 +0000] warning/PluginNotificationTask: Notification command for object 'alto-na-login-service!login-service-login-d86f6cdcb-kvd9f' (PID: 14085, arguments: 'sh' '-c' 'curl  --fail --connect-timeout 30 --max-time 60 --silent --show-error -X POST -H 'Content-type: application/json' --data '{"channel":"#monitoring_alerts","title":":red_circle: PROBLEM: Service <http://10.103.32.244/monitoring/service/show?host=alto-na-login-service&service=login-service-login-d86f6cdcb-kvd9f|Failed to create tyk session for tenant 7043f111bb34435c9882b8046ca21ad0> transitioned from state UNKNOWN to state CRITICAL"}' 'https://hooks.slack.com/workflows/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'') terminated with exit code 22, output: curl: (22) The requested URL returned error: 429 Too Many Requests
[2022-11-21 16:48:51 +0000] warning/PluginNotificationTask: Notification command for object 'alto-na-broker-not-available!opd-collector-kafka-opdcollector-77bb9664d9-62shh' (PID: 14006, arguments: 'sh' '-c' 'curl  --fail --connect-timeout 30 --max-time 60 --silent --show-error -X POST -H 'Content-type: application/json' --data '{"channel":"#monitoring_alerts","title":":red_circle: PROBLEM: Service <http://10.103.32.244/monitoring/service/show?host=alto-na-broker-not-available&service=opd-collector-kafka-opdcollector-77bb9664d9-62shh|Pod opdcollector-77bb9664d9-62shh cannot connect to Kafka broker. Please restart it> transitioned from state UNKNOWN to state CRITICAL"}' 'https://hooks.slack.com/workflows/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'') terminated with exit code 22, output: curl: (22) The requested URL returned error: 429 Too Many Requests
[2022-11-21 16:48:51 +0000] information/HttpServerConnection: Request: GET /v1/objects/services (from [::ffff:10.103.32.148]:58398), user: logstash, agent: Manticore 0.9.1, status: OK).
[2022-11-21 16:48:51 +0000] warning/PluginNotificationTask: Notification command for object 'alto-na-logstash-could-not-index-event-to-elasticsearch-from-dlq!logstash-could-not-index-event-to-elasticsearch-parse-failure-flux-helm-operator' (PID: 15320, arguments: 'sh' '-c' 'curl  --fail --connect-timeout 30 --max-time 60 --silent --show-error -X POST -H 'Content-type: application/json' --data '{"channel":"#monitoring_alerts","title":":red_circle: PROBLEM: Service <http://10.103.32.244/monitoring/service/show?host=alto-na-logstash-could-not-index-event-to-elasticsearch-from-dlq&service=logstash-could-not-index-event-to-elasticsearch-parse-failure-flux-helm-operator|[flux-helm-operator]: failed to parse field [version] of type [long] in document with id '1WPK04MBpjs3AWFbgxhC'. Preview of field's value: 'v3'> transitioned from state UNKNOWN to state CRITICAL"}' 'https://hooks.slack.com/workflows/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'') terminated with exit code 1, output: sh: -c: line 0: unexpected EOF while looking for matching `"'
sh: -c: line 1: syntax error: unexpected end of file

as errors are not common so could not relate.

log1c · November 23, 2022, 7:24am

Hm, looks like something is “blocking” the database, hence the pending queries for the IdoMysqlConnection.
I remembered a github issue about the IDO cleanup:
https://github.com/Icinga/icinga2/issues/7753
maybe you find something helpful there.

I also had a look in our /etc/icinga2/features-enabled/ido-mysql.conf and we put a note in the cleanup options there:

Parameter “statehistory_age” should not be activated, because it causes an ido error in icinga2.log > “critical/IdoMysqlConnection: Error Lock wait timeout exceeded”

other than that we have the following options set:

cleanup = {
downtimehistory_age = 180d
commenthistory_age = 180d
contactnotifications_age = 180d
contactnotificationmethods_age = 180d
hostchecks_age = 180d
logentries_age = 180d
notifications_age = 180d
processevents_age = 180d
#statehistory_age = 180d
servicechecks_age = 180d
systemcommands_age = 180d
flappinghistory_age = 31d
}

Hope this helps in tracking this further

As for your log messages regarding the Notifications:
The reason they are there is standing at the end of the individual lines.

unexpected EOF
http 429: too many requests

vinay.bhuria · November 24, 2022, 11:46am

My config:-

  enable_ha = true
  cleanup = {
    acknowledgements_age = 60d
    commenthistory_age = 60d
    contactnotifications_age = 60d
    contactnotificationmethods_age = 60d
    downtimehistory_age = 60d
    eventhandlers_age = 60d
    externalcommands_age = 60d
    flappinghistory_age = 60d
    hostchecks_age = 60d
    logentries_age = 60d
    notifications_age    = 60d
    processevents_age = 60d
    statehistory_age = 60d
    servicechecks_age = 60d
    systemcommands_age = 60d
  }
}

Lets focus on first issue related to QA env. I closely tracked and figure out that in one env whenever below message is coming disconnection message popping up:-

[2022-11-24 11:32:44 +0000] warning/PluginNotificationTask: Notification command for object 'alto-na-login-service!login-service-login-65d8ddf6f7-4fr8q' (PID: 20852, arguments: 'sh' '-c' 'curl  --fail --connect-timeout 30 --max-time 60 --silent --show-error -X POST -H 'Content-type: application/json' --data '{"channel":"#monitoring_alerts","title":":red_circle: PROBLEM: Service <http://10.103.32.244/monitoring/service/show?host=alto-na-login-service&service=login-service-login-65d8ddf6f7-4fr8q|Failed to create tyk session for tenant 5aa124b7770847fa8f2d92f297225b30> transitioned from state UNKNOWN to state CRITICAL"}' 'https://hooks.slack.com/workflows/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'') terminated with exit code 22, output: curl: (22) The requested URL returned error: 429 Too Many Requests
[2022-11-24 11:32:44 +0000] warning/PluginNotificationTask: Notification command for object 'alto-na-login-service!login-service-login-5f876c4798-zlf96' (PID: 21002, arguments: 'sh' '-c' 'curl  --fail --connect-timeout 30 --max-time 60 --silent --show-error -X POST -H 'Content-type: application/json' --data '{"channel":"#monitoring_alerts","title":":red_circle: PROBLEM: Service <http://10.103.32.244/monitoring/service/show?host=alto-na-login-service&service=login-service-login-5f876c4798-zlf96|Failed to create tyk session for tenant 524d893fe01e47d798c6bbd17b5bec42> transitioned from state UNKNOWN to state CRITICAL"}' 'https://hooks.slack.com/workflows/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'') terminated with exit code 22, output: curl: (22) The requested URL returned error: 429 Too Many Requests
[2022-11-24 11:32:44 +0000] warning/PluginNotificationTask: Notification command for object 'alto-na-login-service!login-service-login-784cdf7f55-z2n7z' (PID: 21062, arguments: 'sh' '-c' 'curl  --fail --connect-timeout 30 --max-time 60 --silent --show-error -X POST -H 'Content-type: application/json' --data '{"channel":"#monitoring_alerts","title":":red_circle: PROBLEM: Service <http://10.103.32.244/monitoring/service/show?host=alto-na-login-service&service=login-service-login-784cdf7f55-z2n7z|Failed to create tyk session for tenant cf58cd720bb14e91b24c8ab4a0f576ac> transitioned from state UNKNOWN to state CRITICAL"}' 'https://hooks.slack.com/workflows/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'') terminated with exit code 22, output: curl: (22) The requested URL returned error: 429 Too Many Requests

vinay.bhuria · November 24, 2022, 11:50am

For the link you shared related to pending queries mentioning of adding index, seems it is available in latest versions of icinga. not sure need to check.

vinay.bhuria · November 24, 2022, 4:56pm

While in separate scale env different issue occurring once or twice in week:-

log1c · November 25, 2022, 7:17am

Please stick to the original problem you opened the thread for.
Did disabling the cleanup option have any effect?
You also didn’t answer my question on what changes where made just before the issue started to appear.

Side notes to your other problems:

notifications

not a fault of icinga.

EOF:
– check the script your are using and the parameters you pass to it
429 too many requests
– problem with the API you are querying

icinga stopping

check the logs

if you require further assistance, please open new topics for those problems.

vinay.bhuria · November 28, 2022, 5:39pm

disabling did not have any effect.

When issue started to apear not sure. As it was identified after sometime. Problem is
daily multiple changes get deployed on icinga2 so even not sure which change caused the issue.

log1c · November 30, 2022, 1:28pm

Hm, not sure.

Some ideas that come to mind:
Do you have many dependencies configured?
Those tend to create many IDO related operations due to their need to determine the parent state.

Do you have a HA setup with two masters or is it a single node system?

vinay.bhuria · December 2, 2022, 5:24pm

In Prod we have HA set up and in lower only single node system.

Everywhere same problem.

Yes, We have configured many dependencies.

log1c · December 5, 2022, 2:00pm

Without having any real starting point this is hard to determine.

I assume you have a graphing tool (like graphite or Grafana) where you can see when the CPU load of the icinga server and the database server (or is both on the same system?) started to increase.
In case you are using the Icinga Director you can then cross-check that time with changes in the Activity log. Or maybe the date then rings a bell as to what was changed then.

Also check the perfdata graph of the check for icinga itself, check command is icinga ( Icinga Template Library - Icinga 2).
This shows many interesting metrics for icinga itself, like pending queries, executed checks and so on.

You could also have a look at the debug log or run icinga in the foreground with a different log severity (Troubleshooting - Icinga 2)

vinay.bhuria · December 15, 2022, 3:36pm

I commented vars.slack_notifications = “enabled” in templates wherever it was mentioned. For example below snippet. thereafter did not observe the issue. seems something to do with slack notifications

template Service "generic-service" {
  retry_interval = 1m
  max_check_attempts = 2
  check_interval = 5m
#  vars.slack_notifications = "enabled"
  vars.enable_pagerduty = true
  vars.routing_servicegroup = ""
  vars.routing_service = ""
  enable_flapping = true
  flapping_threshold_low = 5
  flapping_threshold_high = 10
}