Architecture at icinga (a little bit of criticism ...)

Hello, everybody.

Just a few things from my side:

There have been many exciting ideas / suggestions in the last few years, but unfortunately they often didn’t make it beyond a prototype or a demo. On the one hand this is understandable if you consider that the capacity in development is only finite but it only leads to the fact that everyone has to ask himself if this feature will ever exist or if it will remain a ghost.

Many of the features that are now implemented using PHP modules I like but I miss the distributed approach, which has always been a big advantage of Icinga. Especially things like import sources (I love them!) in the Director or the vSphere module are hard to use in production environments once you implement strong security and segmentation. I talked to Tom about this at the last OSMC and he showed me examples of what a distributed setup can look like. Without this, most new modules are worthless for many setups.

I understand that much of the Icinga universe has grown organically over the last few years, but we’re at the point where we need a solid roadmap where the role of the community is also addressed. Personally, I’ve had a lot of fun developing modules, but I don’t really feel that any value is being placed on these contributions beyond the official modules, nor is any real effort being made to involve external developers early on to ensure that after major releases everything still works as usual. This makes me tired and leads me to the next point: documentation.

The (developer) documentation for Icingaweb2 is in fact (apart from a not really maintained github repo) not available! To understand things or figure out how certain integrations are possible, I have to analyze the code every time and hope it doesn’t break with one of the future versions. There are a few people who willingly answer the majority of questions, but that can’t replace documentation! Without this documentation & communication there can’t be much input from the community.

It would be a pity if this great project with its even greater community, in which many friendships have been made, were to take the wrong path.

Nicolai

8 Likes
5 Likes

Hi,

I want to add a few personal thoughts:

I am a regularly contributor for quite a while now, especially to the Icinga 2 core. I do this in my spare time since I had fun doing it and learned new things, I also wanted to give something back.

Lately I lost a big part of my motivation to work on Icinga related things. In the last few months and weeks I found myself doing more research on changes, trying to find out reasons for changes or trying to understand the context. The quality of the documentation for code changes decreased. I am aware that many of the Icinga Developers meet regularly in person and can therefore exchange information about changes. For “external” contributors is the written down “why” for a change important to follow up with the development.

I also noticed a case where a contributor opened a new pull request and had trouble to build due to an error in the changed code. The answer was just “Fix that error.”. Such a harsh and unfriendly tone should be avoided, this will scare off new contributors.

I sign that, true words!

Best regards
Michael

5 Likes

I’m part of the Icinga-Community for quite some time and I have views from a lot of different points of view. I’m part of the Icinga-Team, I’m employee at an Icinga partner but I’m no developer but focus on support and documentation. I took the time to talk to some people from every part of the equation and there are several things I learnt during the last weeks.

The picture I got is that this problem consists of several levels which overlap and enforce each other but have very different reasons why they exist after all. The following is very personal from what I saw in the last time mixed with something I heard. This is in no way something official from the Icinga project.

  • It’s a hard time for developers. First the mammoth-project called Icinga DB, then the JSON-RPC bug which brought us 2.11.3 and then some external factors like the Corona lockdown and others came in. As most of us know, communication, documentation, community work is a lot of work. Some issues were so pressing that maybe even some of those utmost important tasks got out of focus. Especially the JSON-RPC bug took so much resources I can understand that some other things lacked focus. It might not have been completely obvious for everyone here, but this bug hit us as a partner hard. And I mean hard. I’m very thankful to the core developers that they dedicated so much to solving this bug and I can understand if they had to reprioritize (which I assume they did). If you weren’t affected by the bug you might just have recognized the level of communication going down. I can only guess how much the beforementioned factors affected communication and community work but I can very well believe that there is some connection.
  • I do see some problems in the level of communication. Like Blerim said, there are things to come and there were things scheduled like IcingaConf where they should have been shown to the public. Maybe, sometimes it’s problematic to make presentations in the most perfect way and wait for conferences. Everyone wants presentations to be perfect because it’s part of the experience. Just throwing parts of info at the community wouldn’t satisfy anyone, too. Maybe it’s time to take a step back and review again, how to get information across fast enough without lacking professionalism. I personally think it’s extremely hard to satisfy everybody. Especially since the community and usergroup is growing beyond IT-keen people who are interested on the naked truth and raw information. There are needs from other groups to satisfy, too. That makes it even harder to find one way to communicate.
  • What I personally think that Icinga lacks is, like Nicolai said, giving insight in why something was built in which way. Even more so to give insight in what is planned so people can chime in early. But I also see, that this would mean tons of extra work. It would be close to impossible to find solutions which everyone likes and there would be a lot of hollow discussions to be fought (my personal view). We all know, how many trolls every community can attract and while we are very happy to have very few of them in here I can imagine that discussing which way something architectural or implemenation wise should go, could end up in endless discussions leading to nothing but frustrated developers and community members. I can see a point in not discussing every little bit in the open although I personally would still prefer slow but open advance (as long as there still is advance)
  • Not everyone gets all news from every channel. This might be the most important part. I hear a lot that people are complaining about lack of information. Then I ask people responsible for communication and they tell me: it’s all there. And really, there’s blogposts, talks, GitHub Issues and so on. On the other hand I hear people complaining and I tell them: Go, give this feedback to other people, too. And they say: I did. But still they feel unheard. I don’t want to put myself outside of this circle. I complained about not being informed, too. As an answer I get links where I could easily have found the information. Sometimes in places I really should have looked but didn’t. Sometimes in places I wouldn’t have dreamt of looking.
  • Then there’s the part where there are really different approaches to what is communicated where when and if at all. That’s the part to be discussed in the upcoming meeting and in other places, I think.

To sum it all up, I can see several different factors which all end up in one thing: People think they don’t get enough information. People in the community feel like they don’t get enough info from the project but give feedback. The people in the team think they don’t get enough feedback but put lots of information out there. Everyone thinks they give their best and on the other side nobody listens. And it’s hard to blame someone because most people (including the Icinga project) try or at least tried.

I, for myself, took several things to do out of the discussion:

  • Even if it looks like it’s always one problem recurring, there are many, many different reasons why I might not have gotten the information that I needed or why I was not heard. I’ll try to differentiate if it’s due to something which has to be changed of just something bad happening we have to wait until it’s over or just retry getting the info across.
  • I will try to give information and especially feedback. Not once, but multiple times. To the same and to different people as well. Sometimes it’s the wrong time, the wrong tone, the wrong wording. If you want to be heard, you have to be sure that you really reach the receiver
  • I’ll try to cut back on emotions in the discussion. We all want the project to thrive so we all want ultimately the same thing. Let’s do this together
  • I’ll try to listen more carefully. Lot’s of the things you want to know are out there. In the community, in the news from the project.
  • I’ll ask more. Maybe I’ve missed information. Maybe someone didn’t think something was of importance for me. Maybe someone doesn’t want me to know something in the first place but changes their mind when they think twice.
  • And ultimately I’ll take my place in the upcoming discussion about all this. I want to be part of the team and of the community and to help find a way we all can live with.
7 Likes

Hi,

I thought about this for some days because I tried to come up with ideas for improvements and specific ideas but I have to admit that it’s hard. So I’ll just add a few things basically in addition to what Bodo and Nicolai said.

1st topic - Architecture & the Ops view:
I spent most of the past 10+ years with global players in the pharmaceutical industry and mostly ISPs & carriers. A very few of them are open for Open Source and solutions provided by smaller and sometimes even very small companies. They realized that there is more than IBM, HPE, Nokia, Siemens, Cisco, Solarwinds and all the other big players. That given, I would tend to say that all of them have hard- and software in place which they use for many years. They have highly trained, experienced and therefore skilled engineers in key positions who know every single bit of the applications they are responsible for. They have very strict requirements and expectations regarding new tools - and I totally understand it (most of the time).

Fun fact: You would be surprised how many ISPs still rely on SNMPv2 in their core networks.

So the 1st big challenge was to win them as a customer. The 2nd big challenge was the integration into their environment.
Examples of restrictions and requirements that come to mind:

  • defined routes & communication (i.e. endpoints, who initiates the connection, encryption, …) between all components
  • no outgoing connections from an inner to an outer network (so no Prometheus in the default configuration)
  • as few services as possible
  • High-Availability
  • documentation about running services and handbook for the “first responder” if any alarm comes up and any of the services is not behaving as expected
  • visibility

Icinga2 is actually a great tool for this purpose and made it relatively easy to convince the customers to approve its usage. We have/had defined endpoints, great architecture of the components (the core, possibility to en-/disable modules living in the core, the ido), low resource usage and a great community to share experience, issues and plugins. I just read up on the more recent modules and am afraid that the additional daemons and dependencies would make this process harder. It may overwhelm the customers a bit and lead them to force suppliers coming up with Icinga2 to drop it and stick to what they already have.

2nd topic - missing(?) roadmap:
I especially searched for informations when I read this thread and yes, there is some info available in news letters, at github, in the community and other places but I couldn’t find a real roadmap. To me, this is something that sometimes hold me back from investing into new development of modules, automation or other things related to icinga2. A road map would be really helpful to know where the project is going and be able to evaluate our own ideas when planning projects. It may also help to inspire and motivate users to share what they have in place when it comes to related topics or just contribute to the project. I guess we all love to see new modules, features and an active community. :slight_smile:

Sorry for the quite long post but to me it seems important to have the kind of companies and potential users I mentioned in the first topic out there in mind, too. I really enjoyed (I only use the past tense because I barely do it since I changed my job half a year ago) working with icinga2 and bring it into new environments. I feel a bit sorry that I can’t really provide any alternatives or specific solutions but I hope this thread and the upcoming meetup helps to unite the community, contributors & developers again and trigger valuable improvements at all sides. :slight_smile:

Cheers,
Marcel

3 Likes

Hey there!

Why don’t you join us in our jitsi meet in half an hour to discuss?

I’m afraid this is a bit too short notice to incorporate your post from our side, but maybe you can give your input yourself?

Greetings
Feu

3 Likes

No worries, I’ll join. :slight_smile:

See you there,
Marcel

2 Likes

What is the latest URL to join the meeting , if there is still one ?

Update: Looks like we were hijacked properly.

Sincerest apologies, we will have to move this meetup.

It looks like we were poorly prepared for the amount of people.

Let’s not waste everyone’s time by keeping at it today.

Next week, same time, details will follow.

2 Likes

As an example and so that one can also imagine it figuratively.
This is just a current (and small) customer setup.
The backend / frontend clusters each contain 5-7 Icinga2 satellites.

If I go deep inside myself, I also get the setup of EOS together, which was twice as big and much more complex.

If anyone else here since setup would like to visualize in a similar way, I’m happy to help out.

1 Like

@bodsch asked me to post a picture of a typical setup i support at my customers , so here you go. Its ugly and it shows the master zone only, other satellite zones are mostly HA too without own databases or webinterfaces.

Sometimes all is virtualized, soetimes one master/db/influx/webinterface/grafana is on its own hardware/storage to avoid downtimes due to data center problems (dogs dont eat their own food sometimes).

1 Like

I guess the Icinga DB feature would help to scale the monitoring infrastructure?

From the architecture point of view, I wish for a push based monitoring model in Icinga2

I guess the Icinga DB feature would help to scale the monitoring infrastructure?

Maybe.
In this case it would help to integrate IcingaDB into one of the architecture overviews above.
Also to make communication relationships more visible.

From the architecture point of view, I wish for a push based monitoring model in Icinga2

push based?
You should define that a little bit more exactly! Who should push what where?
The satellites may push their results to the master.

What I meant by the “push model” is that clients run the tests and sending results to the satellites/master instead of the active/pull model. Something like BB/Xymon do via a daemon running on the clients.

The existing Icinga2 passive check are way complicated to setup and not reliable in my experience.

Create a passive service, its possible. You can even dynamicly create a service over the api

1 Like

Hey there!
Like Carsten said, passive checks sound like what you described.
I just wanted to share the link to the documentation here too, so you can read up on it:

https://icinga.com/docs/icinga2/latest/doc/08-advanced-topics/#external-passive-check-results

Or did you have something different in mind?

Greetings
Feu

@monigacom if you’re referring to push based when it comes to check results - you can configure the way the communication is going. There’s two parts where a direction comes into place: Scheduling and start of a connection.

  • You can use the host option in Endpoint objects to define which Icinga node will start the connection. It can either be “top down” (master starting to talk with satellite) or “bottom up” (satellite starting to talk with master). Same goes for agents
  • Schedulers are run on satellites. So they actively tell agents when to run checks and collect the outcome. Satellites themselves push results to the masters
  • Masters only actively trigger checks and collect data when there are no satellites between masters and agents.
  • There is even a way you can run a scheduler on agents but this is discouraged. Especially because it’s most of the time useless and more complicated to configure.
2 Likes

Thank you @theFeu and @twidhalm

Yes, I was talking about a scheduler on Icinga2 agent and/or “ncpa/nrpe” daemon which will run the tests locally and send the result to Satellite/Master. If there are no results sent by the agent, the test will report that.

Thanks

If you’re looking for an alternative to NRPE, just more secure, flexibel, powerful and overall more sophisticated, just go for the Icinga agent, that’s for you.

In most cases you’re fine with having a satellite close to your agents. All satellites run their own scheduler and even if they get disconnected from the master they will keep polling those agents. Most of the time you don’t want a scheduler on every agent. It’s cumbersome to configure and produces definitely more load on your agents than just a regular agent.

Either way, if your masters stop receiving messages from satellites and agents you will see it in Icinga Web 2. If you configure your checks (icinga and cluster-zone) correctly you can even get notified.

Most cases where you want a local scheduler have to do with losing connection to whole network/subnet. This is where a satellite comes in handy. The only case where you might want a scheduler on an agent is when you want to run checks even when the host is completely disconnected from all network connections.

Some Follow-Up Blogposts:

1 Like