Icinga2 at large scale

powelleb · July 2, 2020, 1:16pm

I have been testing some of the RC with Redis as well…fingers crossed.

RE: the level 3 satellites.
Those satellites are within private networks and it is not possible to have all the clients in those private networks talking directly back to level 2…that is why the level 3 zone count is so wide (~52). If there are other thoughts, I am all for it.

BR,
Eric

powelleb · September 5, 2020, 2:42pm

To All:
Just curious if anyone has any more feedback or experiences now that 2.12.0-1 is released.

masseyxueyi · June 21, 2021, 11:27am

Hello Dmitriy, thanks for sharing this. Can you advise how to configure the 5000 events per sec in the system? Many thanks.

Regards,
Xueyi

masseyxueyi · June 21, 2021, 11:44am

Still, could you please guide me on how to measure the number of events per second?

Solkren · July 12, 2021, 2:50pm

Hello, sorry for the late reply

based on my environment here some math:

50 dummy service checks with 1min frequency

1 random service check with 5min frequency

1 dummy host check with 5min frequency
Each Agent Endpoint have 1000 Host objects configured on it.

50(dummy checks) / 60(sec) = 0.8333 checks per sec
(1 (random check) + 1(host check) ) / 300(sec) = 0,006 checks per sec
0.8333 + 0,006 = 0,8349 checks per sec

1 Host with 52 checks generates 0,8349 checks per sec

100 hosts * 0,8349 = 83,49 checks per sec
1000 hosts * 0,8349 = 834,9 checks per sec
10 000 hosts * 0,8349 = 8349 checks per sec
100 000 hosts * 0,8349 = 83490 checks per sec

Hope it helps you with your cluster configuration

With best regards,
Dmitriy.

jhou4 · March 23, 2022, 2:32pm

Hi - I have been following this thread since we have a large environment and ran into IDO backend bottleneck. We just tried icinga2 2.13.2 together with icingadb 1.0 rc2. Some good news to share - in our dev environment we see ~700x less updates to icingadb database then old IDO database, which makes perfect sense since icingadb only keeps track of state changes and has an extra cache layer(redis).
My only concern/question now, is if same thing(to only report state changes) can be done to the communication between satellites and masters, when ido is disabled. That way icinga will be much more scalable when using icingadb.

Solkren · April 1, 2022, 12:31pm

I haven’t seen any options for it as of now, i would like to see this feature as well.
From my recent tests with icingadb - it sync data much faster, but still lacks with a lot of performance options for it, right now in the large environment icingadb service just starts to hang and stops sync of objects at some point of time.

petew · July 6, 2022, 2:14pm

ok i understand it’s not supported/recommended to have satellites write to the central IDO backend, but it does sound attractive. we’re running 38000 hosts, 280000 services (mostly 5min, avg 8 services/host); ha master zone, 24 ha satellite zones. but have a question for @Solkren - how do API calls work if IDO is disabled on masters? we create/remove downtimes for groups of hosts/services (sometimes in the 100s) by sending API calls to the primary master host. we also use the api for the icingaweb commandtransport.ini and the web connects to the IDO in resources.ini. in addition we sometimes move hosts around the satellite zones to distribute load among them.

davekempe · July 7, 2022, 9:22pm

With Icinga Camp coming up in a few weeks, lets put together a few questions we can ask the authors of Icingadb to get these kind of things a bit clearer.

Solkren · July 10, 2022, 10:08am

We already have similar install with 35k hosts in prod and API calls works as usual, everything is still connected, technically satellites still send all messages to masters and masters just drop IDO messages because don’t have IDO configured, when satellite accept messages from agents it process message queue (sends data to IDO) and forward all those messages to the master.

Moving hosts from 1 satellite to another not a problem as well, icinga ido database logic identify host as hostname + instance_id (instance_name option in /etc/icing2/features-available/ido-mysql.conf), so it does not cause duplicates.

The only problem i’ve seen so far - adding new checks. Adding a lot of objects across several satellites cause them all to make table locks to add new objects in database, during this lock other satellites got to wait and reconnect, so config distribution could take a lot of time
Example: you have 30k+ hosts configured in the same way, you add 1 check to service set which cause to spawn of 30k new objects at once.

Joao · August 17, 2022, 7:11am

Good afternoon Dimitriy after doing the math I currently have 36k hosts, 2 Masters 3 satellites, I intend to increase to 60k hosts, and from what I saw the most they tried was up to 35k hosts in which we have already exceeded that limit. My question is, is there a possibility to make this increase in hosts? have i noticed that one of the problems it could cause would be the registry entry for the correct database? Making the calculation you put here, at the moment I have 1320 checks per second and I will have 2200 per second, at this moment we have 400k active services, we will pass up to 660k active, remembering that the checks are done every 5 min. Thanks for listening

juarezjx · September 9, 2024, 7:17pm

Hi Dmytro,
I recently took over for monitoring in our company. I inherited an environment running RHEL6 Icinga1. I am upgrading to Ubuntu 22.04 Icinga2, primarily to get on a supported OS but also to get Icinga in a database and not the flat files we were using in Icinga 1. We have converted the necessary configuration files and have a working Icinga2. We have about 125,000 check and 3,000 hosts. With a single Icinga server this is not working well and we are reading about the distributed environments you have deployed. Do you offer any consulting services? We would be interested in having someone show us how to configure a Satellite/Zone and using the Icinga Master without ido-mysql enabled.

log1c · September 11, 2024, 7:10am

monigacom · September 11, 2024, 2:48pm

Maybe IcingaDB is the way to go and probably will help running Icinga at a large scale. Has anyone done the scalability and benchmark tests using IcingaDB at all? Not sure if IcingaDB is still being evolved from development perspective. But if it is ready for Production, then results of the tests are needed and would be helpful to consider it for the production implementation.

Thanks

gabrieldslacerda · March 4, 2025, 10:23am

Hello all. We are passing through the same process of planning a large deployment of Icinga (+25000 devices with checks every 2 min).

The issue is: we are limited to use SNMPv2. Has anyone had experience using SNMP in a big environment like this?

Thank you.

rivad · March 4, 2025, 1:26pm

I think T-Systems does Icinga with SNMP, if I remember correctly from Icinga Camp Berlin.

I couldn’t find the video but I’ve found this one: https://www.youtube.com/watch?v=3q0T0akj19g

lorenz · March 4, 2025, 2:25pm

I do not have direct experience, but I see two
sides two this:

The Icinga side (Icinga2 + backend)
The Monitoring Plugin side

Regarding 1. I am mostly certain, that this is quite possible and should not pose a big problem (if you can add some computing and memory easily).

Number 2. is a “depends”. Since every test run starts a new process, overhead adds up fast. If you use natively compiled plugins it might work just fine, if you use something in an interpreted language (lets say Perl) the interpreter gets initialized and the code compiled every time and that is going to be a lot of wasted computing. Also direct queries (requesting one or more OIDs specifically) is quite easy, but doing SNMP walks is not, so one should try to minimize the usage of plugins (or usage modes) which try to do that.

Finally I just suggest to create a little test setup, monitor that meticulously, add more devices incrementally and search for the bottlenecks