Architecture at icinga (a little bit of criticism ...)

bodsch · March 19, 2020, 4:11pm

To the introduction:
It’s been on my chest for a long time.
And I was afraid to talk about it because I don’t give up hope.
But if I’m just stewing in my own juice, the overall situation won’t get any better.

So, sit down and hold on, there’s a rant…

I am using icinga2 on a few customers.
Surely there are other users, who support larger and broader setups than I do.
But I still want to get rid of a few points.

Since icinga1 there is a simple, effective in my eyes a reasonable separation between services that execute checks (the core / satellites) and the presentation of the results (icingaweb2).
For reasons that I cannot understand and cannot see, more and more things are now migrating to icingaweb2 where they have no place from an architectural point of view.
Let me take x509 and vhspheredb as an example.

All of a sudden icingaweb2 gets its own daemons and executes its own checks, thus competing directly with the core that was designed for it.

For me as an OPs, this means that I have to kick the classic separation (icingaweb2 on a dedicated, smaller VM) if I want to use these additional modules or if the customer wants them at all costs.
Suddenly I have to plan more resources for this VM (more and more separate daemons). It needs much more rights (communication with other APIs, access to external resources) and I may suddenly have to secure the VM with HA.

And to complete the chaos, there is no standard for saving configurations.
x509 puts everything into an ini file, vhspheredb into a (new) database.
(and if you don’t have a systemd at hand, you have to do a lot of bad things to get these modules running.
And some customers actually come up with the idea to run “the piece of PHP software in a docker container”. Then completely different problems come up).

It looks to me like 2 competing (developer) teams are trying to take the water from each other.
The focus on OPs / operations is getting more and more out of focus.

Maybe it makes sense to write modules in another programming language.
But then these modules AFAIK should still be processed by the core.
That’s what it is for.
It can be set up redundantly for load balancing.

I have been working with (even very large) system architectures for ~15 years and I don’t like the development of the last 2 years at all.
(I have rather strong expressions ready, but I’ll spare you that)

I am increasingly faced with the question, with whom do I dump my frustration?
The core team?
The icingaweb team?

Unfortunately I have the feeling that criticism is not accepted or perceived at all.

I lack a valid roadmap with planned developments that I can point out to customers.
On which I can align my own solutions (ansible roles, docker containers, tools).
Many things come quietly and secretly, completely surprisingly and then suddenly many things simply burst away.
For example: When updating to icinga 2.11.2 the handling of the configuration has changed.
For me, without any visible announcement.
I still haven’t managed to adapt my docker container to it and now I plan to leave it behind.

There were times when community contributions (modules, themes, plugins) were highlighted on the Icinga website.
Now you have to search to find something like that.
It’s okay if you bundle them under exchange, but “our” visibility is dwindling.
You feel pushed away.

(Too) many daemons to maintain separately, which are also rather single instances, so not HA capable.
Here a functioning infrastructure is fragmented and turned into an unmaintainable heap.

What I wish for is a more open communication with us, the community.
That people listen to us again and work together with us.

Greetings from Hamburg,
Bodo

anon66228339 · March 19, 2020, 4:15pm

I sign that too, i would only add the lack of communications with partners.

berk · March 19, 2020, 6:06pm

Hi Bodo,

you are right, we should and we can do better especially explaining the reasons for some parts of the architecture. I would be also super happy for ideas. How would you like to be a part of it? How can we make it better for you? We will go into the details tomorrow after the team is involved.

We’ll come back tomorrow

Bernd

n0braist · March 19, 2020, 7:56pm

Thanks Bodo for that threat. Some opinion from my point of view. Especially in huge environments with separated icingaweb running in different security layern this results in an exorbitant administrative effort. That is the reason I’m not running these modules.

bsheqa · March 20, 2020, 10:45am

Hi Bodo

thank you for your feedback. I wish you hadn’t waited for it for such a long time but it’s good that you finally reached out.

Like you stated, communication is key and the lack of it is an issue we are aware of and eager on solving. Communication has to be bidirectional and we want to have you and others involved in it.

Increasing the visibility of the community is a goal we have set ourselves and that we are currently working on by creating a complete relaunch of icinga.com. With a better integration for our GitHub projects, this Discourse platform and community modules, to highlight both: the code contributors and the community board contributors.

The requirements for Icinga have grown a lot over the time. Especially with new and changing technologies it’s not always possible and useful to handle those things with hosts and services. Additionally, pushing every aspect of monitoring through one daemon has clearly it’s limitations, even for Icinga 2.

There is nothing that can replace a face-to-face discussion, not a GitHub issue and not a community board. Sometimes it’s required that you have to explain things in a larger session and react directly on expressions and feedback. That’s something we had planned for IcingaConf this year: Showing how Icinga can evolve over the next few years, discussing it with everyone present and making everything public.

Since this is not possible now we are already considering different ways how we can bring those information to everyone. This is something that will take some more time for preparations.

Nevertheless we want to have discussions with you and other community members as soon as possible. Therefore we will setup a Jitsi Meetup for the next week where we will invite not only you but everyone in the community to meet (virtually) and just have the discussion around the things that lie heavy on your heart. Expect an announcement in our blog and social media within the next hours.

nicolaiB · March 20, 2020, 12:14pm

Hello, everybody.

Just a few things from my side:

There have been many exciting ideas / suggestions in the last few years, but unfortunately they often didn’t make it beyond a prototype or a demo. On the one hand this is understandable if you consider that the capacity in development is only finite but it only leads to the fact that everyone has to ask himself if this feature will ever exist or if it will remain a ghost.

Many of the features that are now implemented using PHP modules I like but I miss the distributed approach, which has always been a big advantage of Icinga. Especially things like import sources (I love them!) in the Director or the vSphere module are hard to use in production environments once you implement strong security and segmentation. I talked to Tom about this at the last OSMC and he showed me examples of what a distributed setup can look like. Without this, most new modules are worthless for many setups.

I understand that much of the Icinga universe has grown organically over the last few years, but we’re at the point where we need a solid roadmap where the role of the community is also addressed. Personally, I’ve had a lot of fun developing modules, but I don’t really feel that any value is being placed on these contributions beyond the official modules, nor is any real effort being made to involve external developers early on to ensure that after major releases everything still works as usual. This makes me tired and leads me to the next point: documentation.

The (developer) documentation for Icingaweb2 is in fact (apart from a not really maintained github repo) not available! To understand things or figure out how certain integrations are possible, I have to analyze the code every time and hope it doesn’t break with one of the future versions. There are a few people who willingly answer the majority of questions, but that can’t replace documentation! Without this documentation & communication there can’t be much input from the community.

It would be a pity if this great project with its even greater community, in which many friendships have been made, were to take the wrong path.

Nicolai

bsheqa · March 20, 2020, 12:22pm

mcktr · March 20, 2020, 1:56pm

Hi,

I want to add a few personal thoughts:

I am a regularly contributor for quite a while now, especially to the Icinga 2 core. I do this in my spare time since I had fun doing it and learned new things, I also wanted to give something back.

Lately I lost a big part of my motivation to work on Icinga related things. In the last few months and weeks I found myself doing more research on changes, trying to find out reasons for changes or trying to understand the context. The quality of the documentation for code changes decreased. I am aware that many of the Icinga Developers meet regularly in person and can therefore exchange information about changes. For “external” contributors is the written down “why” for a change important to follow up with the development.

I also noticed a case where a contributor opened a new pull request and had trouble to build due to an error in the changed code. The answer was just “Fix that error.”. Such a harsh and unfriendly tone should be avoided, this will scare off new contributors.

I sign that, true words!

Best regards
Michael

twidhalm · March 20, 2020, 3:37pm

I’m part of the Icinga-Community for quite some time and I have views from a lot of different points of view. I’m part of the Icinga-Team, I’m employee at an Icinga partner but I’m no developer but focus on support and documentation. I took the time to talk to some people from every part of the equation and there are several things I learnt during the last weeks.

The picture I got is that this problem consists of several levels which overlap and enforce each other but have very different reasons why they exist after all. The following is very personal from what I saw in the last time mixed with something I heard. This is in no way something official from the Icinga project.

It’s a hard time for developers. First the mammoth-project called Icinga DB, then the JSON-RPC bug which brought us 2.11.3 and then some external factors like the Corona lockdown and others came in. As most of us know, communication, documentation, community work is a lot of work. Some issues were so pressing that maybe even some of those utmost important tasks got out of focus. Especially the JSON-RPC bug took so much resources I can understand that some other things lacked focus. It might not have been completely obvious for everyone here, but this bug hit us as a partner hard. And I mean hard. I’m very thankful to the core developers that they dedicated so much to solving this bug and I can understand if they had to reprioritize (which I assume they did). If you weren’t affected by the bug you might just have recognized the level of communication going down. I can only guess how much the beforementioned factors affected communication and community work but I can very well believe that there is some connection.
I do see some problems in the level of communication. Like Blerim said, there are things to come and there were things scheduled like IcingaConf where they should have been shown to the public. Maybe, sometimes it’s problematic to make presentations in the most perfect way and wait for conferences. Everyone wants presentations to be perfect because it’s part of the experience. Just throwing parts of info at the community wouldn’t satisfy anyone, too. Maybe it’s time to take a step back and review again, how to get information across fast enough without lacking professionalism. I personally think it’s extremely hard to satisfy everybody. Especially since the community and usergroup is growing beyond IT-keen people who are interested on the naked truth and raw information. There are needs from other groups to satisfy, too. That makes it even harder to find one way to communicate.
What I personally think that Icinga lacks is, like Nicolai said, giving insight in why something was built in which way. Even more so to give insight in what is planned so people can chime in early. But I also see, that this would mean tons of extra work. It would be close to impossible to find solutions which everyone likes and there would be a lot of hollow discussions to be fought (my personal view). We all know, how many trolls every community can attract and while we are very happy to have very few of them in here I can imagine that discussing which way something architectural or implemenation wise should go, could end up in endless discussions leading to nothing but frustrated developers and community members. I can see a point in not discussing every little bit in the open although I personally would still prefer slow but open advance (as long as there still is advance)
Not everyone gets all news from every channel. This might be the most important part. I hear a lot that people are complaining about lack of information. Then I ask people responsible for communication and they tell me: it’s all there. And really, there’s blogposts, talks, GitHub Issues and so on. On the other hand I hear people complaining and I tell them: Go, give this feedback to other people, too. And they say: I did. But still they feel unheard. I don’t want to put myself outside of this circle. I complained about not being informed, too. As an answer I get links where I could easily have found the information. Sometimes in places I really should have looked but didn’t. Sometimes in places I wouldn’t have dreamt of looking.
Then there’s the part where there are really different approaches to what is communicated where when and if at all. That’s the part to be discussed in the upcoming meeting and in other places, I think.

To sum it all up, I can see several different factors which all end up in one thing: People think they don’t get enough information. People in the community feel like they don’t get enough info from the project but give feedback. The people in the team think they don’t get enough feedback but put lots of information out there. Everyone thinks they give their best and on the other side nobody listens. And it’s hard to blame someone because most people (including the Icinga project) try or at least tried.

I, for myself, took several things to do out of the discussion:

Even if it looks like it’s always one problem recurring, there are many, many different reasons why I might not have gotten the information that I needed or why I was not heard. I’ll try to differentiate if it’s due to something which has to be changed of just something bad happening we have to wait until it’s over or just retry getting the info across.
I will try to give information and especially feedback. Not once, but multiple times. To the same and to different people as well. Sometimes it’s the wrong time, the wrong tone, the wrong wording. If you want to be heard, you have to be sure that you really reach the receiver
I’ll try to cut back on emotions in the discussion. We all want the project to thrive so we all want ultimately the same thing. Let’s do this together
I’ll try to listen more carefully. Lot’s of the things you want to know are out there. In the community, in the news from the project.
I’ll ask more. Maybe I’ve missed information. Maybe someone didn’t think something was of importance for me. Maybe someone doesn’t want me to know something in the first place but changes their mind when they think twice.
And ultimately I’ll take my place in the upcoming discussion about all this. I want to be part of the team and of the community and to help find a way we all can live with.

winem · March 27, 2020, 3:15pm

Hi,

I thought about this for some days because I tried to come up with ideas for improvements and specific ideas but I have to admit that it’s hard. So I’ll just add a few things basically in addition to what Bodo and Nicolai said.

1st topic - Architecture & the Ops view:
I spent most of the past 10+ years with global players in the pharmaceutical industry and mostly ISPs & carriers. A very few of them are open for Open Source and solutions provided by smaller and sometimes even very small companies. They realized that there is more than IBM, HPE, Nokia, Siemens, Cisco, Solarwinds and all the other big players. That given, I would tend to say that all of them have hard- and software in place which they use for many years. They have highly trained, experienced and therefore skilled engineers in key positions who know every single bit of the applications they are responsible for. They have very strict requirements and expectations regarding new tools - and I totally understand it (most of the time).

Fun fact: You would be surprised how many ISPs still rely on SNMPv2 in their core networks.

So the 1st big challenge was to win them as a customer. The 2nd big challenge was the integration into their environment.
Examples of restrictions and requirements that come to mind:

defined routes & communication (i.e. endpoints, who initiates the connection, encryption, …) between all components
no outgoing connections from an inner to an outer network (so no Prometheus in the default configuration)
as few services as possible
High-Availability
documentation about running services and handbook for the “first responder” if any alarm comes up and any of the services is not behaving as expected
visibility

Icinga2 is actually a great tool for this purpose and made it relatively easy to convince the customers to approve its usage. We have/had defined endpoints, great architecture of the components (the core, possibility to en-/disable modules living in the core, the ido), low resource usage and a great community to share experience, issues and plugins. I just read up on the more recent modules and am afraid that the additional daemons and dependencies would make this process harder. It may overwhelm the customers a bit and lead them to force suppliers coming up with Icinga2 to drop it and stick to what they already have.

2nd topic - missing(?) roadmap:
I especially searched for informations when I read this thread and yes, there is some info available in news letters, at github, in the community and other places but I couldn’t find a real roadmap. To me, this is something that sometimes hold me back from investing into new development of modules, automation or other things related to icinga2. A road map would be really helpful to know where the project is going and be able to evaluate our own ideas when planning projects. It may also help to inspire and motivate users to share what they have in place when it comes to related topics or just contribute to the project. I guess we all love to see new modules, features and an active community.

Sorry for the quite long post but to me it seems important to have the kind of companies and potential users I mentioned in the first topic out there in mind, too. I really enjoyed (I only use the past tense because I barely do it since I changed my job half a year ago) working with icinga2 and bring it into new environments. I feel a bit sorry that I can’t really provide any alternatives or specific solutions but I hope this thread and the upcoming meetup helps to unite the community, contributors & developers again and trigger valuable improvements at all sides.

Cheers,
Marcel

theFeu · March 27, 2020, 3:36pm

Hey there!

Why don’t you join us in our jitsi meet in half an hour to discuss?

I’m afraid this is a bit too short notice to incorporate your post from our side, but maybe you can give your input yourself?

Greetings
Feu

winem · March 27, 2020, 4:00pm

No worries, I’ll join.

See you there,
Marcel

aflatto · March 27, 2020, 4:46pm

What is the latest URL to join the meeting , if there is still one ?

theFeu · March 27, 2020, 4:53pm

Update: Looks like we were hijacked properly.

Sincerest apologies, we will have to move this meetup.

It looks like we were poorly prepared for the amount of people.

Let’s not waste everyone’s time by keeping at it today.

Next week, same time, details will follow.

bodsch · April 1, 2020, 1:10pm

As an example and so that one can also imagine it figuratively.
This is just a current (and small) customer setup.
The backend / frontend clusters each contain 5-7 Icinga2 satellites.

If I go deep inside myself, I also get the setup of EOS together, which was twice as big and much more complex.

If anyone else here since setup would like to visualize in a similar way, I’m happy to help out.

anon66228339 · April 5, 2020, 8:35am

@bodsch asked me to post a picture of a typical setup i support at my customers , so here you go. Its ugly and it shows the master zone only, other satellite zones are mostly HA too without own databases or webinterfaces.

Sometimes all is virtualized, soetimes one master/db/influx/webinterface/grafana is on its own hardware/storage to avoid downtimes due to data center problems (dogs dont eat their own food sometimes).

monigacom · April 17, 2020, 5:43am

I guess the Icinga DB feature would help to scale the monitoring infrastructure?

From the architecture point of view, I wish for a push based monitoring model in Icinga2

bodsch · April 17, 2020, 2:37pm

I guess the Icinga DB feature would help to scale the monitoring infrastructure?

Maybe.
In this case it would help to integrate IcingaDB into one of the architecture overviews above.
Also to make communication relationships more visible.

From the architecture point of view, I wish for a push based monitoring model in Icinga2

push based?
You should define that a little bit more exactly! Who should push what where?
The satellites may push their results to the master.

monigacom · April 17, 2020, 4:05pm

What I meant by the “push model” is that clients run the tests and sending results to the satellites/master instead of the active/pull model. Something like BB/Xymon do via a daemon running on the clients.

The existing Icinga2 passive check are way complicated to setup and not reliable in my experience.

anon66228339 · April 17, 2020, 4:06pm

Create a passive service, its possible. You can even dynamicly create a service over the api