Master node as multi-site aggregator

aieri · March 3, 2020, 4:55pm

Hi,

I’m investigating the functionality of Icinga to see if it would be a good replacement for our current setup.
The overall architecture I’d be looking at would be a single master (possibly in HA), and a set of satellites - nothing special here. Due to how we’re doing configuration management though, we can’t approach the setup in a top-down fashion; it would instead be very idiomatic for us to store checks on the satellites as plain text files, and use the master purely as an aggregator. In other words, I would want something more like a multi-master + aggregator architecture instead of master and satellites.

I have been reading several threads and docs and I could so far determine the following:

bottom-up config style is not supported anymore
thruk could act as a frontend and interface with icinga2, but livestatus may soon be deprecated
having multiple IDO DBs seems to work, but is discouraged

…and I thus have a few questions:

is a master able to not just display checks that are stored on satellites, but also communicate back to them (e.g. to issue downtimes)?
if not, could I use a centralized icingaweb2 to connect to multiple IDO DBs?
is a multi-master, multi-IDO setup something I should not rely on, because it might be deprecated at some point?

Thanks!

aflatto · March 4, 2020, 6:17am

Hello and Welcome.

In your proposed scenario the master is the “aggregator” and thus will be the place when you can see the status and control the sending of notifications so the downtime definition on the master will cover that.

I am unsure why you’d want an independent IDO for each satellite, if you want to aggregate the data, surely a single database is the way to go and you can use HA functionality for the database for redundancy.

log1c · March 4, 2020, 7:22am

Maybe some interesting stuff to read for you:

As @aflatto said, the master will be able to distribute downtimes etc.
That is basically what the master does: Receive check results from the satellites, distribute the config, dish out commands/downtimes/comments to hosts/services, send notifications.

Maybe the schema drawings I made some time ago also help you

aieri · March 4, 2020, 9:05am

The various zones are in separate geographical regions and belong to separate customers. From a configuration management perspective, they are in separate, non-communicating domains. Additionally, all the zones already have properly sized and redundant databases, whereas the master would live in a much more resource-constrained environment and should thus try to be rather lightweight. Finally, connectivity from the master to the satellites can at times fail, so the separate zones must be able to operate in full autonomy.

Effectively, zones should be separate icinga deployments, but for operational convenience we also need a centralized dashboard (hence my mention of thruk).

Yeah, I did read that thread and it’s good to hear that a single database can handle that many checks, but the nature of our deployments do require a more distributed approach.

Thanks for the replies! I will run some tests in the lab to see how icinga reacts to having separate IDO DBs.

aieri · March 4, 2020, 2:59pm

Ok, I think I’ve hit a roadblock: the master node must have an IDO, which is something I would want to avoid. If that’s correct, livestatus + thruk would remain the only way for me to aggregate separate sites, but the fact that livestatus might be deprecated soon gives me a lot of pause

xavpaice · April 23, 2020, 8:30am

Reviving this thread as I’m working on the same thing.

The scenario is to replace a setup where we have multiple sites, each locally managed using Juju in the same way folks might do with Puppet or Salt - the hosts to monitor send their config to the config master, which currently sends that to a Nagios host (and uses NRPE). The local Nagios is entirely standalone, and links to pager for notifications, etc. So far, so good, and I can see how we can easily replace that with Icinga2 and Icingaweb2, and have built a demo for that which looks great.

The thing we do next however, is where we struggle. We connect, from our tiny management box, to all the remote Nagios hosts using mklivestatus and Thruk, so that we can issue commands (downtime etc), and view check status, from all the remote sites at once. The connectivity to remote sites is slow, unreliable, and we rely heavily on the presence of Thruk’s local cache of info to make it usable.

Our remote sites are designed to be independent, and able to be disconnected permanently from our central management at short notice, with minimal effort. We would, therefore, install Icingaweb2 on each remote site, to allow this. Hence the need for IDO on the Satellites.

Having attempted to play a little with a master/satellite/agent setup, with master in our central host, satellite on each site, and agents linked to the satellite, I have a few queries:

All the docs I’ve seen so far suggest that the content of conf.d is no longer the right place for host and service configs, and we would need to put the host/service configs into the zone dir on the master, to be dropped down to the satellites as appropriate. In our setup, the config files are generated by config management not available to the master, only the satellites. Does this mean we generate configs at each site, then transfer them to the master so it can transfer them back to the satellite? Can I add configs for hosts, services, etc all in a specific satellite, and have the master set to “just go grab all the info from this satellite, and issue commands to it”? Unfortunately we’re monitoring a variety of configurations that are all fairly diverse.
In my test environment, when I put host/service configs in conf.d on the satellite, and enabled reading that dir, I could only see the hosts defined on the satellite. I guess that’s to be expected, but the opposite of what we want the master for. Am I reading this wrong?
What’s actually stored in the IDO? We’re faced with tight security requirements about shipping data offsite from the remote sites, so we’d need to ensure that we can quantify what we’re replicating and how we secure that.

What we might be wanting to do could be a lot simpler, and I might be overthinking this. The only real reason we even need a central master is so our team have an easy way to view and silence alerts from many remote sites, that might be something we can overcome another way. I’d prefer to avoid having to log into 30 Icinga web interfaces at once

twidhalm · April 23, 2020, 9:20am

So I’d try to answer parts of both of your questions:

The most important part is: Yes, you need all the configuration on the master. The bottom-up approach was more or less never used and was confusing so it got removed. The master needs to know every object or it will discard information about it.
You don’t need an IDO for a master. If you don’t want to have Icinga Web 2 or anything else that’s relying on the IDO you can skip it.
You can have “local” IDOs in your satellites and Icinga Web instances which use the data from there. They will only hold the data of the satellite they are connected to and all nodes below them in the tree. One customer would not be able to see the data from another one. This is mostly discouraged because it’s confusing not for a technical reason (as far as I know)
You can have satellites send out notifications. The same rules as for IDO apply
The conf.d directory is mostly used for default configuration. So the approach with putting everything in zone directories is far better and more flexible
You don’t have to rely on config files. Although many users prefer them you can use the Icinga 2 API and several ways of syncing or API from Icinga Director.

xavpaice · April 24, 2020, 3:04am

Thanks so much for the response - that really does clear it up a lot.

I found also today the Icingadb and nice looking web plugin - I’m setting up a test env for that now, but a quick question on the layout of services for our use case. I wonder if you could confirm I’m on the right path:

Remote sites, Icinga2, with icinga-redis-server and (optionally) for local access, Icingaweb2 with IDO
central site, icingadb, mysql IDO, and Icingaweb2 plus the Icinga DB Web plugin
the central site would connect to each remote redis, to collect info and sync that to the central Mysql instance, which is then read by the web interface

If that’s correct - is there any way (outside firewall) to secure the comms to Redis? I’d probably want to use TLS to cover the security requirements of our customers.

I’m struggling to find how to get Icingadb to read multiple remote redis hosts - am I reading that wrong?

twidhalm · April 24, 2020, 7:46am

You’re very welcome.

I’m afraid, you got it wrong. Having local Redis, Database and Icinga Web 2 is more or less an inofficial “workaround” and was never considered to be best practice in real life setups. It’s just that during development maybe no one thought of the possibility that someone might have a need for local webinterfaces that only show parts of the data. That’s one thing why this community is so precious.

The way Icinga 2 works is that all data runs upstream in the tree. (agent -> satellite -> master) using the Icinga cluster protocol over port 5665/tcp. All required data takes this way. The masters then make sure they talk to their local Redis and a common (common for both masters that is) database. Where “common” can mean two nodes that replicate each other or a single instance or or a virtual IP…

In short: Every node has its own Redis, but all nodes within one zone have a common database. No instance will connect to the database from another zone.

xavpaice · April 28, 2020, 5:27am

that explains it totally - many thanks!

So, this thread might help some future development discussions, where it’s worth considering a very lightweight (Thruk-like) means to view the data from multiple masters, in different locations, but without needing to have the database from all.

twidhalm · April 28, 2020, 8:01am

Just remember, that livestatus is deprecated and if I remember correctly, Thruk needs livestatus.