Icinga2 HA does not show grafana graphs correctly

Hi.

I have an Icinga2 in High Availability, master1 and master2. Inside master1 I have installed graphite and grafana too. If I have started icinga2 service in master1 only, I can see graphs in grafana sucessfully. But if I also start icinga2 service in master 2, the graphs are not equal. It seems that not write all the data. At the end, if I stop master2 service icinga2, graphs are ok again.

I’ve done tcpdump -i eth0 port 2003 in master2. I think all is ok.

Is possible I’ve misconfigured something in icinga2 master2?

My environment:
master1 and master2: ubuntu 16.04, icinga2 2.9.1.1.
master1: grafana 5.2.4, graphite-carbon 0.9.15-1

Ahy help is welcome.
Thanks in advance.

Hello,

you should post your configuration for graphite writer and the grafana-module.

Regards,
Carsten

Hello Carsten.

Here it is:

master1 (graphite.conf)

library "perfdata"

object GraphiteWriter "graphite" {
  host = "localhost"
  port = 2003
  enable_send_thresholds= true
  enable_send_metadata = true
}

master2 (graphite.conf)

/**
 * The GraphiteWriter type writes check result metrics and
 * performance data to a graphite tcp socket.
 */
library "perfdata"

object GraphiteWriter "graphite" {
  host = "icingamaster1-FQDN"
  port = 2003
  enable_send_thresholds= true
  enable_send_metadata = true
}

master1 grafana.ini

[grafana]
host = "localhost:3000"
protocol = "http"
defaultdashboard = "icinga2-default"
defaultdashboardstore = "db"
theme = "light"
datasource = "graphite"
accessmode = "direct"
directrefresh = "no"
height = "280"
width = "640"
enableLink = "yes"
shadows = "0"
usepublic = "no"

I have no master2 graphana.ini because I still don’t have icingaweb2 HA.

I have to say I’ve installed grafana but not as module. Is better using grafana module?

Thanks for your time.

The Icinga2 side looks ok to me. Did you check that master2 can write to the graphite port on master1?
Also check logfiles of graphite, i think the problem is there.

Hi,

current versions with Graphite enabled inside a HA zone will have both sides active, and as such, the metrics are written as received from both ends. Your carbon-cache/relay should be able to filter/deal with that, or you’ll keep two instances. 2.11+ will bring HA awareness here, meaning to say that only one endpoint actively writes to carbon-cache.

Try to enable the debug log, and grep for GraphiteWriter, especially the sent metrics. Compare the timestamps from these logs, and qualify which graphs source from which metrics.

Plus, 2.9.1 might contain bugs in this region, 2.9 is not actively developed anymore.

Cheers,
Michael

@anon66228339. I checked master2. I forgot open security group :frowning: . Now master2 sends data to graphite :).

@Michael. If I understand you well, do the two icinga masters send data to graphite? I thought it was only icinga active endpoint sent data to graphite.

And finally, are you telling me to upgrade icinga version al least 2.10 right? :wink:

Regards
Jesús

Exactly, you can see that with a short tcpdump splitting between incoming sources on master1 (local vs master2).

In terms of an upgrade - it is desirable but I would first analyse the problem further. Since you didn’t know that there are two writers instead of one, this could lead to the solution already.

Cheers,
Michael

1 Like

Ok. I’m working on it.

Thanks a lot again.

Jesús

Hi Michael,

You told me:

Your carbon-cache/relay should be able to filter/deal with that, or you’ll keep two instances

But, do you have an example for this? I’ve not found any example.

Regards.
Jesús

That’s an assumption by myself listening to talks at conferences, I have never used graphite relays myself in production.

Then, in icinga2 HA environment, do you usually have graphite in each machine?

Regards.
Jesús

I would have to check the documentation again but what I remember is that carbon-cache daemon literally caches all incoming data for the duration of the flush-interval. When it writes the data to graphite, the whisper library handles the incoming data and aligns the data if there are multiple values for the same key within timestamp_now - (timestamp_now - interval).

So, if you have an interval of 60 seconds, just one point will be written per minute.

But there are a few exceptions for example if the carbon-cache daemon flushes the data more often than that. Therefore you can rely on the deduplication by graphite but there are a very few exceptions.

Here is what our prod system looks like:

  • 2 Icinga master nodes
  • 2 nodes with Graphite and Grafana

Both icinga nodes have a local carbon-relay running. carbon-relay allows you to specify multiple targets. So the relay-daemon knows of both Graphite-Nodes and duplicate the data so that both nodes receive the same data. They also cache the data for a limited time in case of outages or short network interruptions. If you experience larger outages, network partitions or other events that prevent the relay to send the data to the carbon-caches you should take a look at the utilities from carbonate: https://github.com/graphite-project/carbonate

They are really good and useful if you want to align the data across graphite instances, backfill or migrate data for example.

Edit: This is just my recommendation if you do not need or do not have the resources to have a whole full blown graphite cluster.

1 Like

Hi Marcel.

I am very grateful for your ideas and recommendations because I’m lost with this topic and this open a way for trying to deploy.
I’ve configured one icinga2 master and one graphite and grafana in the same machine and all run ok. But from I’ve tried deploy icinga2 HA I’ve begun not to understand things.

Regards
Jesús

Typically I would built it like that. 2.11 provides the capabilities to either have one or the other endpoint with an active Graphite feature, but that’s not released yet nor has any release date.

1 Like

Hi Michael.

I’m waiting impatiently, :wink:

Seriously. Thank you for this great software. Today I’m updating vagrant to see icinga reporting.

Regards
Jesús

1 Like