Capacity Planning - Best Practice For iCinga Master / Satellite

radioactive9 · June 9, 2020, 9:46am

Hello

I was trying to find some capacity planning GYAN on the icinga docs, but not able to get to exactly what I am looking at. I know there is no thumb rule on capacity planning and all is dependent and little vague.

I had long discussion on this topic with Thilo during our training - but honestly I am still at dark.

I am running 3 very large environment at 3 geographies (continents). I have been asked time and again by management to save cost why don’t I have 1 pair of master / DB / Web and use satellite at the respective continent DC. I have skipped the question by always giving excuse of Data Privacy Law / Practice (China US etc etc) But some day I will get caught of giving run around . I need to have a solid figure that with a pair of Master I can run X amount of Nodes (Agents). Also I can run Y amount of checks (local or remote) etc The good part is I have always made the agents talk to satellite only and not directly to master. But still I am little skeptical to control the whole world (~30k node) with a pair of Master. Leaving aside the network latency. Please correct if my thoughts are in wrong direction

Q2: At one time I know I have to build another pair of satellite for monitoring more servers. What is the thumb rule or indicators that will tell me that I need more satellite. Again lets say I have ~2500 servers hosted on a pair of satellite. Only host alive checks are running from satellite. Rest all are agent based check (run on agent). Considering this how far I can push my luck ? Do I say after 3000 nodes I should stop. Should I say 4000 Agents is the limit? How do I know that my satellites are not able to take any more? I have already run into the following issue and have followed the documentation to resolve it. Question here is how far I can push my luck?

theFeu · June 15, 2020, 6:48am

Hello there,
I saw that you posted this topic under the community category, while I think it fits better in the core, so I moved it over.
This way people should be able to find it more easily
Have a nice day!
Feu

unic · June 15, 2020, 11:31am

Take look here: Number of devices monitored on Icinga

In most environments the bottleneck will be the load from the checkcommands, so you will never get the correct answer for your environment. On plain Icinga with agents (with no heavy checks) you may find an answer in this post: Icinga2 at large scale

Edit: Maybe its a good idea to pin some benchmark/performacnce posts in an overview thread, as this questions popping up regulary. I only can remember the two i linked here

theFeu · June 17, 2020, 7:13am

Thanks for the input, that sounds like a good idea!
Maybe we could also have a pinned post as a collection, with all benchmark posts linked there?
I’ll go check if we can compile a nice list on monday

Solkren · July 2, 2020, 11:37am

It would be nice to attach it somewhere in documentation, i’m not sure if/how you make a stress testing for new releases, maybe it worth to attach those benchmarks to release notes to keep it actual, main problem with pinning user posts on forum:

Hard to track, my post for Icinga2 at large scale can lost its actuality with 2.12 and icingadb
Benefits to chose icinga2 compared to other open source solutions while engineers & its leadership makes a research/decision of what to choose as a software for monitoring.

With best regards,
Dmitriy