Defining checks that cover multiple nodes

Hi,

I am looking to use Icinga2 for monitoring Cassandra and Kafka clusters. I’m wondering if anyone has any advice on how to model checks that operate against a cluster as a whole instead of an individual host. As an example, I might have a cluster with 20 nodes, however I want to check that the cluster SSL certs are not going to expire soon. I don’t want 20 alerts all at the same time, just a single one for the cluster. There are quite a few things like this that are not on the host level but on the cluster level. In the best case I would have some checks on the cluster level, some on the datacentre level, and others on the node level.

Does anyone have any suggestions for how to model this within Icinga? Would I create a “special” host to represent the cluster or datacentre?

I will be using the passive check mechanism to input check results into Icinga via the REST API.

Please forgive me if this is a silly question, I am quite new to Icinga.

Any help would be greatly appreciated!

Dummy hosts and vips are your friends in this situation. I actually have to dive into kafka soon, probably have a more specific answer on that one in a couple months ¯\_(ツ)_/¯

Anyway, this isn’t a definitive answer so much as few things I’ve had success with.

For things that cluster that have a vip address you can grab information from, I make a host object for that, but I don’t setup endpoints or zones for it. For example, I have some primary/secondary redis servers at work, they each run the Icinga daemon for the usual suspects (cpu load, swap, process running, etc.), but I have a host object with an address where I apply remotely run services to. Typically python scripts I use to do key counts and things of that nature. Nothing has to run on the client node and actually nothing does by default, so you can happily query the API of your cluster for whatever you need and not have it apply to every single server.

If an IP address is irrelevant in a particular use case, make a host object but use “dummy” instead of the “hostalive” check command:
https://icinga.com/docs/icinga2/latest/doc/10-icinga-template-library/#dummy

In this case, I’ve created dummy hosts that represent an entire isolated application stack and run checks representing it. Typically checking to see if multiple nodes are running hot at the same time so it doesn’t wake me up in the middle night just because one server has been at full load for 5 minutes. You can scrape that kind of data from the Icinga API or IDO.

Let me know if you need to be pointed anywhere else, I might have ideas.

Hi,

there are different methods which can achieve quite the same thing, but require some advanced knowledge.

  • Use the DSL with object accessor methods to create such “clustered checks
  • Use the Business Process module and model the processes and their overall state. This can be chained into multiple tree levels, including simulation of nodes as well as has a check command via icingacli.

The one advantage for Business Processes are visibility inside the web interface and dashboards. You can also test them inside the Vagrant boxes which come with some pre-defined processes.

Cheers,
Michael

Excellent, thanks for your help.