Icinga2 at large scale

The things mentioned are amongst other schema drawbacks the drivers for creating Icinga DB as a new backend. Currently its first released version is blocked with a critical bug inside Icinga 2, 2.12-RC, specifically https://github.com/Icinga/icinga2/issues/7532.

3 Likes

Thanks @Solkren for sharing the great and interesting information.I will use it to scrutinize my Icinga Infra as I remember seeing the IDO messages saying “You DB can’t keep up with…”

@dnsmichi Does that mean that there will be an improved DB schema in 2.12?
Is enabling the IDO feature on the satellites supported?

Thanks

Hi,

IcingaDB is a whole new backend architecture, and will replace the IDO backend in the future.

The short-cut architecture looks like this:

Icinga 2 Core -> IcingaDB feature -> Redis -> IcingaDB daemon -> DB backend (history, config) & Redis (states & volatile data) -> Icinga Web module.

The required components will be provided with Icinga 2 v2.12, Icinga Web 2 v2.8 and new projects: IcingaDB daemon, and a new Web module.

Currently the JSON-RPC bug is holding of the entire release chain, with Icinga 2 being a key core component required for this new architecture.

To my knowledge - and that is not taken as granted, as things may change - the first Release Candidates will add MySQL as backend, with PostgreSQL following up. Also, requirements like data migration from IDO or updating existing web modules to the new backend framework are an ongoing task after RC1.

The latest update on this can be watched in the OSMC talk by @berk.

In terms of satellites and IDO - you can do that, but that’s just a local backend then. There’s no plans to extend this e.g. like shown above with different instance_ids in a central DB cluster and what not. Also, it will be hard to get support for that as we do not encourage this as best practice.

Cheers,
Michael

1 Like

Hello Michael,
I know it’s kinda hard to provide/predict any estimates for fixing this JSON-RPC bug, but do we have any insights when it could be fixed? Any info is appreciated :slight_smile:

With best regards,
Dmitriy.

Hi,

asking for an ETA honestly is one of the worst things you can do in this situation, no offense here.

I’ll also tell you why - imagine that you’ve got a problem, and you have tried 20 different things to tackle it. Your mood ranges from frustration to the emotion of nearly having tackled the problem. Being on that emotional ride, always trying new things from new findings.

First off, you had problems with reproducing the problem reliably.

Then there is the thing that no-one is there to help you. You need to tackle the problem alone, Google everything, analyse your code, re-consider every change you made in the past year.

You may have a colleague who joins you, and other colleagues who help with testing, the “maybe solutions” you’ll provide. Oh, it failed again. Gaining motivation to strive further is hard, even if that’s the job you’re getting paid for.

At this very stage you have knowledge about deploying a large scale Icinga cluster with Ansible, Terraform in AWS, DO, NWS. And you don’t want to look onto the monthly bills coming soon.

Then you’ll reach a point where the problem is mitigated with disabling a feature. “oh yay, it is fixed.” No, actually not. You’ll need to find a way to fix it, with re-enabling the feature or code you just removed to find the spot.

You’ll know that you’ve invested 3 months time already. With the problem always taking 2 long days being seen again, in a large scale clustered environment only.

In parallel to all to the above, you get asked nearly daily about the status and when a fix will be available. There’s two sorts of users: Those you want to have IcingaDB, and those who expect that they get a 2.11.x point release.

And sometimes, you are lucky to have your team lead capture that for you, so you don’t get stressed out by that.

And of course, you need to have a plan B,C and D if your current strategy doesn’t pay off. That’s not the best feeling you’ll have during the day.

At the very moment there are all 4 core developers and our team lead assigned to this task, nearly full time.
If there would be more people debugging, with different ideas and insights, this would truly lead to faster results and better distributed views.

Help out

Everyone can start in the development chapter, there’s the possibility to test the snapshot packages, dive deep into the code and add your findings with logs, core dumps, analysis, etc.

There’s even a chance that a programming error can be detected with static/runtime code analysis tools we haven’t tried yet. If you know some, throw Icinga at it and try to extract valuable details.

Last but not least, always consider that everything we do here is within an Open Source project no-one initially pays anything for. Sometimes expectations are raised too high, and sometimes it is hard to imagine the other side, developers, project managers and their emotions.

I’d recommend to watch our Icinga 2.11 insights video, which also highlights some of the above: https://icinga.com/2019/10/25/icinga-2-11-insights-video-online/

That being said, we’re doing more than 100% to solve the problem. The more people join this journey, the better. :muscle:

Cheers,
Michael

PS: “You” in the above is @Al2Klimov who’s been doing a magnificent job in holding out here. I had given up multiple times already, coming back, motivated by @elippmann @nhilverling @htriem 's work. :+1:

13 Likes

Hi!

I have a master cluster with 5 Satellites, ~5000 Hosts, ~10000 Services and nothing is crashing here. Only found one crashlog from September. So, my question how many users do have that problem? Why do we not have this problem too? Is our environment to small? As we are not using Agents on linux, only on Windows, we have not much agent connections.

Sadly I’am not a real developer and can’t help much :frowning:

@Solkren , great thread and testing on large scale icinga installations.

I am working on a cluster as well, and final the landscape will have 66 zones.
Level 1 with 2 masters (HA), mysql backend, web
Level 2 with 7 satellite zones (14 servers HA), with their own mysql backend and web
Level 3 with 52 satellite zones (Some HA…some single)
In load testing, ido can no longer keep up on Level 1, when i hit 22K hosts, and 290K service checks reporting to the masters from the level 2 satellites. I am thinking about dropping Level 1 ido, single level 1 master, and setting Level 2 ido with the instance option and write to Level 1 master mysql like your setup, but how does icingaweb2 work…what does it see as the active node writing to the db? The level 1 icingaweb2 is to be single pane of glass seeing everything from the landscape.

Hello @powelleb,
thanks for the feedback, the only problem with Icingaweb in this configuration is that: icingaweb2/monitoring/health/info will show you only the information about first satellite zone connected to the database, icingaweb unfortunately reads only first id from table in the database and show statistic based on that. But the good news here that all hosts are visible and doesn’t have any conflicts inside database

Also i’ve tested icinga-agent move from one satellite zone to another, all works like a charm and it doesn’t make conflicts in database.

When everything is connected Icingaweb will read a single database and it makes queries based on host/service id + environmentid

Database: with that amount of clients take care to create daily partitions for tables like history and notifications they will grow with time, default database cleaner use delete methods and its fairly slow for this scale, with partitions you will be able to drop unnecessary data fast and without service interruptions.

Also there is 2.12RC available and soon(fingers-crossed) we will get a 2.12 released, 2.12 use redis with redis-streams which is really fast, so i hope it would be enough to have 2 baremetal masters that would handle all load, on the other side, for infinitive scaling current concept will not work in 2.12 with icingadb, at least i hope they will add support for this kind of setup later(modern micro-service architectures doesn’t really respect monsters like that and for simple management/update/upgrade all used to have a lot of small VMs, also it much more failure tolerant)

P.S. in your setup you have Level 3 with additional satellites, which makes extreme overhead in terms of data processing and management:
agent send data => Satellite L3 handle connection + transfer data => Satellite L2 transfer data + push data to the database => Master accepts data and drop it
I don’t see a real reason to use L3, i would just increase amount of satellites on L2

With best regards,
Dmitriy

Another idea is to use Icinga DB which was developed exactly for that scenario.

Hello @Solkren

I have been testing some of the RC with Redis as well…fingers crossed.

RE: the level 3 satellites.
Those satellites are within private networks and it is not possible to have all the clients in those private networks talking directly back to level 2…that is why the level 3 zone count is so wide (~52). If there are other thoughts, I am all for it.

BR,
Eric

To All:
Just curious if anyone has any more feedback or experiences now that 2.12.0-1 is released.

Hello Dmitriy, thanks for sharing this. Can you advise how to configure the 5000 events per sec in the system? Many thanks.

Regards,
Xueyi

Still, could you please guide me on how to measure the number of events per second?

Hello, sorry for the late reply

based on my environment here some math:

  • 50 dummy service checks with 1min frequency
  • 1 random service check with 5min frequency
  • 1 dummy host check with 5min frequency
    Each Agent Endpoint have 1000 Host objects configured on it.

50(dummy checks) / 60(sec) = 0.8333 checks per sec
(1 (random check) + 1(host check) ) / 300(sec) = 0,006 checks per sec
0.8333 + 0,006 = 0,8349 checks per sec

1 Host with 52 checks generates 0,8349 checks per sec

100 hosts * 0,8349 = 83,49 checks per sec
1000 hosts * 0,8349 = 834,9 checks per sec
10 000 hosts * 0,8349 = 8349 checks per sec
100 000 hosts * 0,8349 = 83490 checks per sec

Hope it helps you with your cluster configuration

With best regards,
Dmitriy.

2 Likes

Hi - I have been following this thread since we have a large environment and ran into IDO backend bottleneck. We just tried icinga2 2.13.2 together with icingadb 1.0 rc2. Some good news to share - in our dev environment we see ~700x less updates to icingadb database then old IDO database, which makes perfect sense since icingadb only keeps track of state changes and has an extra cache layer(redis).
My only concern/question now, is if same thing(to only report state changes) can be done to the communication between satellites and masters, when ido is disabled. That way icinga will be much more scalable when using icingadb.

1 Like

I haven’t seen any options for it as of now, i would like to see this feature as well.
From my recent tests with icingadb - it sync data much faster, but still lacks with a lot of performance options for it, right now in the large environment icingadb service just starts to hang and stops sync of objects at some point of time.

1 Like

ok i understand it’s not supported/recommended to have satellites write to the central IDO backend, but it does sound attractive. we’re running 38000 hosts, 280000 services (mostly 5min, avg 8 services/host); ha master zone, 24 ha satellite zones. but have a question for @Solkren - how do API calls work if IDO is disabled on masters? we create/remove downtimes for groups of hosts/services (sometimes in the 100s) by sending API calls to the primary master host. we also use the api for the icingaweb commandtransport.ini and the web connects to the IDO in resources.ini. in addition we sometimes move hosts around the satellite zones to distribute load among them.

1 Like

With Icinga Camp coming up in a few weeks, lets put together a few questions we can ask the authors of Icingadb to get these kind of things a bit clearer.

We already have similar install with 35k hosts in prod and API calls works as usual, everything is still connected, technically satellites still send all messages to masters and masters just drop IDO messages because don’t have IDO configured, when satellite accept messages from agents it process message queue (sends data to IDO) and forward all those messages to the master.

Moving hosts from 1 satellite to another not a problem as well, icinga ido database logic identify host as hostname + instance_id (instance_name option in /etc/icing2/features-available/ido-mysql.conf), so it does not cause duplicates.

The only problem i’ve seen so far - adding new checks. Adding a lot of objects across several satellites cause them all to make table locks to add new objects in database, during this lock other satellites got to wait and reconnect, so config distribution could take a lot of time
Example: you have 30k+ hosts configured in the same way, you add 1 check to service set which cause to spawn of 30k new objects at once.

Good afternoon Dimitriy after doing the math I currently have 36k hosts, 2 Masters 3 satellites, I intend to increase to 60k hosts, and from what I saw the most they tried was up to 35k hosts in which we have already exceeded that limit. My question is, is there a possibility to make this increase in hosts? have i noticed that one of the problems it could cause would be the registry entry for the correct database? Making the calculation you put here, at the moment I have 1320 checks per second and I will have 2200 per second, at this moment we have 400k active services, we will pass up to 660k active, remembering that the checks are done every 5 min. Thanks for listening