We are working on replacing our monitoring stack, and Icinga became a candidate for the SNMP monitoring part. Does anyone uses Icinga with +25000 devices? How big is the resource consumption of this environment? How do you scale up quickly (horizontal/vertical scaling)? How do you plan high availability?
It would be great to talk to someone with this experience.
There are setups with 25k hosts and bigger. In my experience, the more relevant limit is how often each service is checked and how many state changes occur. But in general, 25k hosts with multiple services should be nothing to worry about.
The resource consumption is heavily based on your setup. If you are aiming at 25k hosts, I would advise to start with not less than 32 GiB of memory and some CPU cores. Less should also work, but eventually you will run into barriers.
In a nutshell: Icinga supports HA with two nodes on different levels, for the Icinga 2 masters or satellites. More information are in the docs:
Now to the juicy part: SNMP. People are using Icinga 2 for SNMP monitoring and it is described in the docs, Agent Based Monitoring - Icinga 2 . However, since personally I don’t have much SNMP experience, I would like to hear other people about this topic.
here is an interesting post about scaling and ido, which is the old data-backend for icinga2.
Since icingadb should perform better this could be used as a baseline.
Keep also in mind the timeout of the checks. if your custom checks waits too long for a result and/or the icinga check timeout is set to long this can also trigger unintendend load on your system.
Most bigger setups I know are using the Icinga Agent (or sometimes check_by_ssh) instead of SNMP. So I personally cannot say how much resources the SNMP checks will require.
However, in case this would be applicable for you, consider using the Icinga Agent instead of SNMP since there are check plugins available for all three things you have listed.
In general, these are not so much checks to be executed in parallel.
Next to what @rivad already wrote: Icinga 2’s HA model works by having two master nodes sharing state. However, you are able to add more satellites on demand.
But maybe doing so will not solve your issues since Icinga 2 synchronizes its state via its cluster protocol. Thus, just adding more connected machines will not always resolve resource limits.
When planning your setup with zones, having satellites in each, scaling should be not so much of an issue.
Running the checks on the satellites helps a lot to offset the load but as @apenning stated will not help with limits on the cluster messages per second or a potential DB bottleneck.
2500 hosts with ping check every 180s (13 cps)
15000 service checks of which about 90% are by SNMP, the rest is port checks, check by ssh, https etc., every 300s (50 cps)
This is a two node active-active HW setup with 32 CPU and 64 GB Memory. Icinga, icingaweb, apache and mariadb are all running on this two nodes. They each use somewhere between 5% and 10% of available CPU and about 20% of available memory, or if you prefere unix load average it is around 2 (with 32 CPUs).
I could reduce the time spent for SNMP and by this also some what the load by only loading data from OID: HOST-RESOURCES-TYPES::hrStorageFixedDisk when checking disk capacity, compared to most plugins that check the whole OID “.1.3.6.1.2.1.25.2.3.1” and by this loading the data for HOST-RESOURCES-MIB::hrStorageAllocationFailure, which can take quite long for some systems. You can find my modification here: GitHub - lko23/check_usolved_disks: Nagios Plugin for checking all disks on a windows or linux machine