Icinga2 HA does not show grafana graphs correctly

jesusin · April 12, 2019, 4:57pm

Hi.

I have an Icinga2 in High Availability, master1 and master2. Inside master1 I have installed graphite and grafana too. If I have started icinga2 service in master1 only, I can see graphs in grafana sucessfully. But if I also start icinga2 service in master 2, the graphs are not equal. It seems that not write all the data. At the end, if I stop master2 service icinga2, graphs are ok again.

I’ve done tcpdump -i eth0 port 2003 in master2. I think all is ok.

Is possible I’ve misconfigured something in icinga2 master2?

My environment:
master1 and master2: ubuntu 16.04, icinga2 2.9.1.1.
master1: grafana 5.2.4, graphite-carbon 0.9.15-1

Ahy help is welcome.
Thanks in advance.

anon66228339 · April 13, 2019, 7:35am

Hello,

you should post your configuration for graphite writer and the grafana-module.

Regards,
Carsten

jesusin · April 14, 2019, 11:17am

Hello Carsten.

Here it is:

master1 (graphite.conf)

library "perfdata"

object GraphiteWriter "graphite" {
  host = "localhost"
  port = 2003
  enable_send_thresholds= true
  enable_send_metadata = true
}

master2 (graphite.conf)

/**
 * The GraphiteWriter type writes check result metrics and
 * performance data to a graphite tcp socket.
 */
library "perfdata"

object GraphiteWriter "graphite" {
  host = "icingamaster1-FQDN"
  port = 2003
  enable_send_thresholds= true
  enable_send_metadata = true
}

master1 grafana.ini

[grafana]
host = "localhost:3000"
protocol = "http"
defaultdashboard = "icinga2-default"
defaultdashboardstore = "db"
theme = "light"
datasource = "graphite"
accessmode = "direct"
directrefresh = "no"
height = "280"
width = "640"
enableLink = "yes"
shadows = "0"
usepublic = "no"

I have no master2 graphana.ini because I still don’t have icingaweb2 HA.

I have to say I’ve installed grafana but not as module. Is better using grafana module?

Thanks for your time.

anon66228339 · April 15, 2019, 5:14am

The Icinga2 side looks ok to me. Did you check that master2 can write to the graphite port on master1?
Also check logfiles of graphite, i think the problem is there.

dnsmichi · April 15, 2019, 7:39am

Hi,

current versions with Graphite enabled inside a HA zone will have both sides active, and as such, the metrics are written as received from both ends. Your carbon-cache/relay should be able to filter/deal with that, or you’ll keep two instances. 2.11+ will bring HA awareness here, meaning to say that only one endpoint actively writes to carbon-cache.

Try to enable the debug log, and grep for GraphiteWriter, especially the sent metrics. Compare the timestamps from these logs, and qualify which graphs source from which metrics.

Plus, 2.9.1 might contain bugs in this region, 2.9 is not actively developed anymore.

Cheers,
Michael

jesusin · April 15, 2019, 9:39am

@anon66228339. I checked master2. I forgot open security group . Now master2 sends data to graphite :).

@Michael. If I understand you well, do the two icinga masters send data to graphite? I thought it was only icinga active endpoint sent data to graphite.

And finally, are you telling me to upgrade icinga version al least 2.10 right?

Regards
Jesús

dnsmichi · April 15, 2019, 9:42am

Exactly, you can see that with a short tcpdump splitting between incoming sources on master1 (local vs master2).

In terms of an upgrade - it is desirable but I would first analyse the problem further. Since you didn’t know that there are two writers instead of one, this could lead to the solution already.

Cheers,
Michael

jesusin · April 15, 2019, 9:50am

Ok. I’m working on it.

Thanks a lot again.

Jesús

jesusin · April 17, 2019, 12:15pm

Hi Michael,

You told me:

Your carbon-cache/relay should be able to filter/deal with that, or you’ll keep two instances

But, do you have an example for this? I’ve not found any example.

Regards.
Jesús

dnsmichi · April 17, 2019, 12:52pm

That’s an assumption by myself listening to talks at conferences, I have never used graphite relays myself in production.

jesusin · April 17, 2019, 2:28pm

Then, in icinga2 HA environment, do you usually have graphite in each machine?

Regards.
Jesús

winem · April 17, 2019, 3:58pm

I would have to check the documentation again but what I remember is that carbon-cache daemon literally caches all incoming data for the duration of the flush-interval. When it writes the data to graphite, the whisper library handles the incoming data and aligns the data if there are multiple values for the same key within timestamp_now - (timestamp_now - interval).

So, if you have an interval of 60 seconds, just one point will be written per minute.

But there are a few exceptions for example if the carbon-cache daemon flushes the data more often than that. Therefore you can rely on the deduplication by graphite but there are a very few exceptions.

Here is what our prod system looks like:

2 Icinga master nodes
2 nodes with Graphite and Grafana

Both icinga nodes have a local carbon-relay running. carbon-relay allows you to specify multiple targets. So the relay-daemon knows of both Graphite-Nodes and duplicate the data so that both nodes receive the same data. They also cache the data for a limited time in case of outages or short network interruptions. If you experience larger outages, network partitions or other events that prevent the relay to send the data to the carbon-caches you should take a look at the utilities from carbonate: https://github.com/graphite-project/carbonate

They are really good and useful if you want to align the data across graphite instances, backfill or migrate data for example.

Edit: This is just my recommendation if you do not need or do not have the resources to have a whole full blown graphite cluster.

jesusin · April 17, 2019, 4:55pm

Hi Marcel.

I am very grateful for your ideas and recommendations because I’m lost with this topic and this open a way for trying to deploy.
I’ve configured one icinga2 master and one graphite and grafana in the same machine and all run ok. But from I’ve tried deploy icinga2 HA I’ve begun not to understand things.

Regards
Jesús

dnsmichi · April 18, 2019, 6:59am

Typically I would built it like that. 2.11 provides the capabilities to either have one or the other endpoint with an active Graphite feature, but that’s not released yet nor has any release date.

jesusin · April 18, 2019, 11:26am

Hi Michael.

I’m waiting impatiently,

Seriously. Thank you for this great software. Today I’m updating vagrant to see icinga reporting.

Regards
Jesús