How to monitor your monitoring?

I am curious, how you monitor your monitoring. :slight_smile:

Cheers,
Marcus

Mostly through high availability. Having two masters with redundant hardware and redundant ways of informing you about problems in the other host should be enough. The only thing that this won’t catch is when you bog up your configuration but you will see that instantly when deploying new configuration.

This is a reason why some of our customers (event those with “all virtual” strategies) bought 2 hardware hosts each with their own UPS and SMS modem.

It’s like @twidhalm wrote: Through the high availability (2 master nodes, 2 nodes in zone x, 2 nodes in zone y etc). We have virtualized our Icinga servers with kvm on our own hardware. And they are installed in diffrent data centers. So we are independently from our virtualization environment.
Each node you can check with the health check from the ITL. At the end our Icinga servers are monitored like every other server with disk, load, ping etc.

You can also create something like a watchdog on your Icinga Servers. Like checking if the icinga service is running correctly. If not create a trigger to do something like send a SMS, call or whatever. This is for situation if something really really weird or crazy is going on and the complete cluster is down. But if you are working regullary with icinga(web) you will see such errors immediately. This is more for situation if the office is not staffed around the clock and you don’t want to have something like a surprise the next working day.

1 Like

I have a watchdog script in an auxiliary datacenter to the masters. It both makes sure the Icinga API is reachable and that PagerDuty is responding. It runs every 5 minutes and announces in our team chat if things aren’t working as expected.

1 Like

This is very important. I’ve seen a lot of customer setups which monitor every tiniest bit of their hosts but barely ping their Icinga hosts. Don’t forget to deal with Icinga hosts like with every other production system when it comes to monitoring. And don’t forget the Icinga-specific checks from the ITL @stevie-sy mentioned. (icinga, cluster-zone, ido etc.)

1 Like

I am curious about the 2 masters. Currently we have 1 Master, 2 Zones with 2 satellites each, which doing the checks ( we are mostly using wmi and ssh ). the master itself has also icingaweb2, the db, different icingaweb modules and grafana installed. I already thought about a second master. How are you doing all the Icingaweb, grafana and other modules part (e.g. business process alerting) in HA? How do you keep them synced?

I have more than one monitoring system - don’t ask.

So one of the first things I did was to set them up to monitor each other.

Additionally the SMSEagles are sending Alert SMS if critical hosts of the monitoring stacks stop answer to pings.

I just stay awake during the night and watch it.
In the morning i crawl to my coffin and let the “daywalkers” watch it :stuck_out_tongue_winking_eye: :crazy_face:

I have icingaweb on both and treat the instance on master2 as a staging environment for new icingaweb components. I tell users only to use it if master1 is down. I have my database on separate servers as well as Grafana/InfluxDB. You can distribute these things out if you like and having a hot standby of your database isn’t a bad idea either. Having a second master also keeps you from losing monitoring during upgrades.

icingaweb2 is on boh masters installed with their own databases - so we don’t use a cluster like Galera (but I tried it). The db and the dasboards will be copied from master1 to master2 e.g. every 2 hours. So much doesn’t happend at the moment to use other sync tools for this. That means we keep this simple.
With keepalive we handle on wich master the use will work. Grafana is installed on an extra server.

are you just syncing both /usr/share/icingaweb2 and /etc/icingaweb2 via rsync?

At the moment with scp and a cronjob.