External watchdog for Icinga2

I’ve been experiencing occasional, random occurrences where icinga2 greatly reduces the number of of checks until the UI complains that icinga2 is not running. It is running, it’s just not doing anything.

It happens often enough that I need to implement some kind of watchdog to check on this state, and of course it can’t run from icinga2’s scheduler. My first thought is to use whatever mechanism the UI uses, but I am open to ideas.

So my two questions:

  1. What exactly is the UI looking at (something in the mariadb database, apparently), so I could look at it, too?
  2. Any other ideas for a watchdog trigger?

The solution I implement would probably run often from systemd, and have the authority to restart icinga2.

Hi,

the question is what is the action afterwards? Accordingly I would make my choice.

I don’t know if you know this blog post: Monitoring the Monitor: How to keep a watch on Icinga 2. If not, you could start here to get ideas what fits for your environment.

As well do you know, that systemctl has also a whatchdog? linux watchdog and systemd watchdog - Unix & Linux Stack Exchange

Thank you, but that blog post just says to put up another Icinga instance for this situation. I don’t want to do that.

Systemd watchdog is more for computer reboot. But systemd can restart the icinga2 app… which still leaves me with the question, what is the UI looking at, that it knows that icinga2 is idle (running but doing nothing). I want to look at the same thing.

You’d need to draft it for your own use case, but I wrote a script that runs as a cronjob in a different location that makes sure it can query an object from Icinga’s API on both masters. If either master is unresponsive, it uses PagerDuty’s API directly to page us about Icinga not working.

If you have a secondary monitoring system in place somewhere, it might be better to just leverage that. Also, sometimes with icingaweb reporting Icinga is down is just either the database running slow, or icinga being delayed on inserts into it (during a reload, for example). Would be good to make sure you’re using IDO and database performance checks to log performance data so you can see if the bottleneck is happening there.

1 Like

I found the answer to my question here:

I just need to write some code around that, and also make sure I don’t mistake a reload or restart for a zombie state.

1 Like