Monitoring active-active pacemaker clusters

ha3mak · June 9, 2019, 11:23am

Hi!

I have active-active pacemaker clusters with 2 or more resource groups containing floating IP addresses what I want to monitoring with Icinga2 agent. My clients accepts configs and commands also and the agents connect to my master(s).

I prepared service checks for certain services. I defined a “services” array in my host definitions and if it contains e.g. “apache” then all of my apache2 checks applied to the host. I’d like to use this method on my clusters.

In case of Icinga1 I used NRPE to monitor my cluster services on the floating IP and worked great. I’d like to do something similar with Icinga2 clients. What’s the official suggestion to monitor clusters like these?

I found 2 solutions but they are quite ugly…

#1: I assign my service checks to all cluster nodes. There will be a node with “OK” checks and there will be nodes with “CRITICAL”. Then I create a “virtual” host object and assign dummy checks to it and in the config with “if” statements I can set the status of the dummy checks based on the real check running on the nodes. I don’t like it because I always have a lot of misleading “CRITICAL” checks… If I could hide somehow the real checks and show only the “virtual” checks it would be a good solution.

#2: I found this solution: https://www.netways.de/blog/2018/06/08/wie-ueberwache-ich-eine-cluster-applikation-in-icinga-2/
It’s working also but I can’t use my already existing checks with it.

So, my question what’s the best practice to monitoring pacemaker clusters with Icinga2?

Thanks!

a1mw · June 13, 2019, 12:00pm

Hi,

I’m operating several corosync/pacemaker cluster and found the following best practice for me:

Service checks of clustered services only against the cluster virtual IP (user view).
Health checks (CPU, memory, disks, corosync/pacemaker states,…) against every cluster node (infrastructure view).
Availability checks are modeled by Business Process Plugin.
[https://icinga.com/docs/businessprocess/latest/doc/02-Getting-Started/https://icinga.com/docs/businessprocess/latest/doc/02-Getting-Started/](Business Process Docu)
https://www.unixe.de/business-processes-in-icinga-2/
You can monitor the calculated state of each business process by additional service checks like
any “normal” service check. These checks should primary trigger notifications, not the individual node checks.

Hope that helps.

Greetings,
Manfred

ha3mak · June 17, 2019, 9:09am

Hi,

thanks for the reply. I will do exactly that way my checks but…

e.g.: I have a cluster running a DB and a webserver instance. One is running on the “A” node other’s on the “B”. I want to check if there’s a process called ‘httpd’. How can I run a process check for this to get only one “OK” result? I tried that I check the process on every node then I create a dummy check with if statements(1 ok and 1 ciritcal is ok, 2 critical is critical)… It’s not the best solution because I always will have some “CRITICAL” results.

With NRPE I could do this job by connecting to the cluster IP and the check was running on the right node. I’m looking for a solution where Icinag2 sends the check command only for the right node.

Thanks!

a1mw · June 17, 2019, 10:02am

I’m not sure if there is a solution for this within Icinga. Cluster software has it’s own logic for starting/moving services around and Icinga is not aware of that - and this can happen very fast.

Do you really need to check for a http or mysql process? These services can be checked directly from Icinga with check_http or check_mysql - no need for local checks via agent or nrpe.
If the cluster is doing things that are not “normal” (i.e. specific service is not running on preferred node, service not starting up,…), you will see it in your cluster management (CRM-shell,…). There are several
check scripts for status of corosync and pacemaker (unfortunatly needs to be checked locally on cluster nodes).