I have an Icinga2 in High Availability, master1 and master2. Inside master1 I have installed graphite and grafana too. If I have started icinga2 service in master1 only, I can see graphs in grafana sucessfully. But if I also start icinga2 service in master 2, the graphs are not equal. It seems that not write all the data. At the end, if I stop master2 service icinga2, graphs are ok again.
I’ve done tcpdump -i eth0 port 2003 in master2. I think all is ok.
Is possible I’ve misconfigured something in icinga2 master2?
My environment:
master1 and master2: ubuntu 16.04, icinga2 2.9.1.1.
master1: grafana 5.2.4, graphite-carbon 0.9.15-1
/**
* The GraphiteWriter type writes check result metrics and
* performance data to a graphite tcp socket.
*/
library "perfdata"
object GraphiteWriter "graphite" {
host = "icingamaster1-FQDN"
port = 2003
enable_send_thresholds= true
enable_send_metadata = true
}
The Icinga2 side looks ok to me. Did you check that master2 can write to the graphite port on master1?
Also check logfiles of graphite, i think the problem is there.
current versions with Graphite enabled inside a HA zone will have both sides active, and as such, the metrics are written as received from both ends. Your carbon-cache/relay should be able to filter/deal with that, or you’ll keep two instances. 2.11+ will bring HA awareness here, meaning to say that only one endpoint actively writes to carbon-cache.
Try to enable the debug log, and grep for GraphiteWriter, especially the sent metrics. Compare the timestamps from these logs, and qualify which graphs source from which metrics.
Plus, 2.9.1 might contain bugs in this region, 2.9 is not actively developed anymore.
@anon66228339. I checked master2. I forgot open security group . Now master2 sends data to graphite :).
@Michael. If I understand you well, do the two icinga masters send data to graphite? I thought it was only icinga active endpoint sent data to graphite.
And finally, are you telling me to upgrade icinga version al least 2.10 right?
Exactly, you can see that with a short tcpdump splitting between incoming sources on master1 (local vs master2).
In terms of an upgrade - it is desirable but I would first analyse the problem further. Since you didn’t know that there are two writers instead of one, this could lead to the solution already.
I would have to check the documentation again but what I remember is that carbon-cache daemon literally caches all incoming data for the duration of the flush-interval. When it writes the data to graphite, the whisper library handles the incoming data and aligns the data if there are multiple values for the same key within timestamp_now - (timestamp_now - interval).
So, if you have an interval of 60 seconds, just one point will be written per minute.
But there are a few exceptions for example if the carbon-cache daemon flushes the data more often than that. Therefore you can rely on the deduplication by graphite but there are a very few exceptions.
Here is what our prod system looks like:
2 Icinga master nodes
2 nodes with Graphite and Grafana
Both icinga nodes have a local carbon-relay running. carbon-relay allows you to specify multiple targets. So the relay-daemon knows of both Graphite-Nodes and duplicate the data so that both nodes receive the same data. They also cache the data for a limited time in case of outages or short network interruptions. If you experience larger outages, network partitions or other events that prevent the relay to send the data to the carbon-caches you should take a look at the utilities from carbonate: https://github.com/graphite-project/carbonate
They are really good and useful if you want to align the data across graphite instances, backfill or migrate data for example.
Edit: This is just my recommendation if you do not need or do not have the resources to have a whole full blown graphite cluster.
I am very grateful for your ideas and recommendations because I’m lost with this topic and this open a way for trying to deploy.
I’ve configured one icinga2 master and one graphite and grafana in the same machine and all run ok. But from I’ve tried deploy icinga2 HA I’ve begun not to understand things.
Typically I would built it like that. 2.11 provides the capabilities to either have one or the other endpoint with an active Graphite feature, but that’s not released yet nor has any release date.