IdoMysqlConnection timout to long had to restart Icinga - config error on my part?

rivad · May 8, 2020, 8:55am

Hi

I had to restart icinga2 today to get the DB connection working again after a hiccup in the network.
Did I misconfigure something or is this normal.

The DB is not on the same host. It is a MariaDB configured for Master <-> Master sync behind a load balancer that runs pinned to Master 1 with a failover to Master 2.

Beginning of problem:

[2020-05-08 08:39:22 +0200] information/WorkQueue: #8 (IdoMysqlConnection, ido-mysql) items: 0, rate: 51.1167/s (3067/min 189383/5min 231919/15min);
[2020-05-08 08:40:42 +0200] information/WorkQueue: #8 (IdoMysqlConnection, ido-mysql) items: 2, rate: 51.3167/s (3079/min 152108/5min 231933/15min);
[2020-05-08 08:41:02 +0200] information/WorkQueue: #8 (IdoMysqlConnection, ido-mysql) items: 2, rate: 50.85/s (3051/min 152096/5min 231939/15min);
[2020-05-08 08:43:42 +0200] information/WorkQueue: #8 (IdoMysqlConnection, ido-mysql) items: 34, rate: 50.1/s (3006/min 15304/5min 220124/15min);
[2020-05-08 08:43:52 +0200] information/WorkQueue: #8 (IdoMysqlConnection, ido-mysql) items: 485, rate: 42.1167/s (2527/min 14829/5min 219631/15min); empty in 10 seconds
[2020-05-08 08:44:02 +0200] information/WorkQueue: #8 (IdoMysqlConnection, ido-mysql) items: 978, rate: 33.6333/s (2018/min 14268/5min 219078/15min); empty in 19 seconds
[2020-05-08 08:44:12 +0200] information/WorkQueue: #8 (IdoMysqlConnection, ido-mysql) items: 1646, rate: 23.7167/s (1423/min 13721/5min 218482/15min); empty in 24 seconds

Resolution by me restarting icinga2:

[2020-05-08 09:22:42 +0200] information/WorkQueue: #8 (IdoMysqlConnection, ido-mysql) items: 93312, rate:  0/s (0/min 0/5min 0/15min); empty in infinite time, your task handler isn't able to keep up
[2020-05-08 09:22:52 +0200] information/WorkQueue: #8 (IdoMysqlConnection, ido-mysql) items: 93312, rate:  0/s (0/min 0/5min 0/15min); empty in infinite time, your task handler isn't able to keep up
[2020-05-08 09:23:02 +0200] information/IdoMysqlConnection: 'ido-mysql' resumed.
[2020-05-08 09:23:02 +0200] information/IdoMysqlConnection: MySQL IDO instance id: 1 (schema version: '1.15.0')
[2020-05-08 09:23:03 +0200] information/IdoMysqlConnection: Finished reconnecting to 'ido-mysql' database 'icinga2' in 0.851092 second(s).
[2020-05-08 09:23:09 +0200] information/WorkQueue: #8 (IdoMysqlConnection, ido-mysql) items: 0, rate: 8.45/s (507/min 507/5min 507/15min);
[2020-05-08 09:23:59 +0200] information/WorkQueue: #8 (IdoMysqlConnection, ido-mysql) items: 3, rate: 1482.23/s (88934/min 88934/5min 88934/15min);

The last time something like this happened it resolved by icinga restarting the IdoMysqlConnection after ca. 30min.
Would it have resolved by itself if I just had waited some more minutes?
I confirmed that I could connect from the server icinga is running on to the database with the same credentials that icinga uses. No locks seen with

SHOW FULL PROCESSLIST;

as icinga db user.

What are your thoughts and recommendations?

Regards,
Dominik

aclark6996 · May 8, 2020, 2:17pm

Hello Dominik,
I hope you are doing well. What attributes do you have set in your IdoMySqlConnection object? Do you have the “enable_ha” and “failover_timeout” set? More details found here.

Which node in the active node when both nodes Master1 & Master2 are online? When both nodes are online Master2 is usually the active node.

Your post said you are using a load balancer. I do not think that is needed. If HA is configured correctly, when one node goes offline the other node automatically becomes the active node and checks and notification continue with problem on the active node. Maybe the load balancer is pointing all to Master 1.

Regards
Alex

rivad · May 18, 2020, 8:46am

Sorry for the late answer.

object IdoMysqlConnection "ido-mysql" {
  user = "icinga2",
  password = "PW",
  host = "ictmysqlp.domain.tld",
  database = "icinga2"
}

Icinga has only one master.
The load balancer runs as a fail over and uses by default ictmysqllp01.

aclark6996 · May 19, 2020, 8:30pm

You said in your original post.

I was under the assumption that you have high availability configured in your Icinga environment. Now you are saying that you only have one Master server. If that is the case than the attributes “enable_ha” and “failover_timeout” do not apply in your setup.

I was going to suggest to add a fail over timeout attribute to your IdoMySqlConnection object. If the active Icinga node cannot connect to the database because of a network problem it will automatically fail over to the other Icinga node after the timeout period.

I have never noticed a problem with the Master server not communicating to the DB server yet. Sorry if this was not any help…

Alex

rivad · May 20, 2020, 9:44am

Master <–> Master sync behind f5 load balancer / fail over was meant in context of MySQL – sorry about the confusion.