HA/Distributed Architecture ideas for 2 datacenters

Hi,
I think many of you have already been in the same situation before to find the best and also simplest monitoring setup for multiple locations. So perhaps you can tell me your ideas how you have managed to setup your monitoring environment.

Now:
this day we have one offsite datacenter with a highly virtualized environment and one Icinga2 Master (VM) on a dedicated ESXi Host (local ssd datastore on the ESXi Host so our monitoring keeps working also if the central SAN Storage fails) with everything Icinga2 needs (idodb, icingaweb2, grafana) on it.

Future:
this year we build up a second offsite datacenter which will host the half of our whole environment, and which should funtion as HA datacenter, so if one datecenter fails (completely offline, fire, other disasters) the left online datacenter will take over. On VMWare perspective we will use stretched Clusters, so we can profit on VMWare HA functionality. On both locations will be a dedicated ESXi Host for the Icinga nodes

Szenario/Idea 1:
Each location gets an Icinga Satellite which will only monitor the environment where it self is located. Additionally one Icinga Master which holds the idodb, icingaweb and grafana. This VM could be saved via VMWare Fault Tolerance over both locations since it doesnt have massive own workload for checking the environment.

Szenario/Idea 2:
2 HA Masters, one at each location and synced via one masterzone. Sounds simple, but here I see the coplexity at the HA part of the idodb and the icingaweb (maybe an dedicated node only for idodb an icingaweb like in Szenario 1?). Also the two masters would check the environment randomly cross the two locations and cause much more traffic on the dark fiber connectiones between the locations.

How have you setup your 2 location environments?

Many thanks so far
Robert

I would definitely go for your scenario 1 - a fully self-contained Icinga
Satellite setup in each data centre.

Whether I would put the Icinga Master on a VM shared across the data centres,
or at a third, completely external, location, is something I can’t really
comment on, due to my lack of familiarity with VMware Fault Tolerance.

However, I do like having something which monitors whatever I am
interested in, from the outside (ie: the Icinga master at a third location
would be checking Internet availability of each data centre, as well as
collecting data from the two Satellites).

I think Icinga dual-Master HA is all very well when the two Masters are in one
location, and monitoring the same bunch of servers, but splitting these across
your two data centres would mean you have all sorts of unpredictable
interactions between your Masters, and the monitored Agents, when one of your
locations goes (perhaps partially) down, and I think your monitoring staff
would find it difficult to interpret quickly and accurately what Icinga is
telling them about the state of the two locations.

I would far prefer to have Satellite A telling me everything I need to know
about servers in location A, and Satellite B telling me everything I need to
know about servers in location B, with the Master then giving me an overall
composite view of the state of my network (including external access to A and
to B, and also interconnections between A and B).

Then, if location B starts having problems of any sort (and remember that the
most difficult problems can be the ones where something has not competely
failed, but is unreliable / intermittent / slow / etc), you can rely on
Satellite A still telling you all you need to know about location A, with no
chance that the problems at B are influencing the data.

Antony.

2 Likes

Thank your for your thaughts. One problem with szenario 1 is the overtake of the VMs from location 1 to 2 when one location completely fails. Than the VMs that were configured in the location 1 and moved to location 2 via VMWare HA will not be monitored on location 2 or recognized as recovered. An instant idea could be an 3. Icinga satellite node dedicated to the “moveable” vms.

Assuming those VMs keep their FQDN and ip address(es) you just need to change their zone from satellite A to satellite B. This could be done online via Rest API.

1 Like

That would mean to only have only one global zone in which each of the satelites knows all hosts instead of two zones for each location, right? Would you work with predefined scripts where every the initial location of the host objects is listed, or an additional custom var to know where to put it after everything recovers again?

I would monitor anything that can move between data centres separately from
the services which run at fixed locations.

This brings me back to my preferred setup of:

  • Icinga Satellite at location A, monitoring stuff which is definitely at A
  • Icinga Satellite at location B, monitoring stuff which is definitely at B
  • Icinga Master at location C, getting data from A & B, and monitoring
    anything which can move

That means Icinga A tells you whether data centre A is running as expected.

Icinga B tells you whether data centre B is running as expected.

Icinga C tells you whether your HA moveable resources are running somewhere,
without being fussy about where they are running; this also tells you they are
accessible from location C, which can be valuable information if the two data
centres can see each other but have external connectivity problems.

Antony.

1 Like

No, you need a master zone and one zone per satellite. The latter I was referring to when I’ve mentioned to switch to other satellite zone.

1 Like

Hi Anthony,

Indeed, this was our preferred choice as well, but making sure I have an extra instance per location/zone.

master zone - Master1 + Master2: masters are checking nodes in zone1 (incl Sat’s) and domains and ssl …
zone2 - Sat1 + Sat2: Sat’s are checking endpoints of zone2
zone3 - Sat1 + Sat2: Sat’s are checking endpoints of zone3

Hi Roland,
do you have a short info how that could be done?
Like "all failed Hosts with check_source = “failed satellite” get a new host_template (import) that defines that they are check from the other site?
What could be the correct trigger? Certain checks in icinga on the satellites that fail?