We have two separate Icinga environments and we found that it would be useful to have each env monitor the other. So in env1, we have a service check that queries the graphite data on env2 and looks for various num_services_X metrics. And vice versa. There are other checks to make sure the “other environment” is healthy. This is so that if either env is not doing checks properly, we’d know about it.
One of the checks we rely on is to see if num_services_unknown is too high, but we noted recently that we exceed our thresholds because we had to turn off a bunch of monitored servers. Although we have the hosts and services downtimed for these boxes that are shutdown temporarily, the num_services_unknown still counts them.
Is there a way to combine the other num_services_X metrics to come up with what I am looking for? Number of Unknown services that are not downtimed.
Both environments are on 2.12.3, with icingaweb2 (2.8.2) and director (1.7.2)