Large Scale Icinga deployment

Hello all, hope you are well.

We are working on replacing our monitoring stack, and Icinga became a candidate for the SNMP monitoring part. Does anyone uses Icinga with +25000 devices? How big is the resource consumption of this environment? How do you scale up quickly (horizontal/vertical scaling)? How do you plan high availability?

It would be great to talk to someone with this experience.

Thank you.

Welcome to Icinga and the Icinga Community forum.

There are setups with 25k hosts and bigger. In my experience, the more relevant limit is how often each service is checked and how many state changes occur. But in general, 25k hosts with multiple services should be nothing to worry about.

The resource consumption is heavily based on your setup. If you are aiming at 25k hosts, I would advise to start with not less than 32 GiB of memory and some CPU cores. Less should also work, but eventually you will run into barriers.

In a nutshell: Icinga supports HA with two nodes on different levels, for the Icinga 2 masters or satellites. More information are in the docs:

Now to the juicy part: SNMP. People are using Icinga 2 for SNMP monitoring and it is described in the docs, Agent Based Monitoring - Icinga 2 . However, since personally I don’t have much SNMP experience, I would like to hear other people about this topic.

I do have some numbers about the usage:

  • ICMP (208cps)
    • Ping : 25k checks every 120s => ~208cps
  • SNMP (200cps)
    • Device [Note: We could develop one single check for uptime+ram+cpu]
      • Uptime : 20k checks every 300s => ~67cps
      • RAM : 20k checks every 300s => ~67cps
      • CPU : 20k checks every 300s => ~67cps
    • Interfaces [Note: Assuming 3 monitored interfaces per device]
      • Status + statistics : 60k checks every 900s => ~67cps

Is there a way to plan capacity for these numbers?

You mentioned HA, but what about scaling? e.g. I have a sudden increase of demand, can I scale deploying more satellites/masters?

Per zone there’s a max of two nodes but there’s AFAIK no horizontal and vertical limit on the number of satellites zones.

If you want one SNMP plugin per host I can recommend monitoring-plugins/check-plugins/snmp at main · Linuxfabrik/monitoring-plugins · GitHub.
It uses a CSV file to specify the OIDs and manipulate them with a bit of inline Python - very flexible. I would recommend Ansible and Git to manage the CSV files.

1 Like

here is an interesting post about scaling and ido, which is the old data-backend for icinga2.

Since icingadb should perform better this could be used as a baseline.

Keep also in mind the timeout of the checks. if your custom checks waits too long for a result and/or the icinga check timeout is set to long this can also trigger unintendend load on your system.

3 Likes

Most bigger setups I know are using the Icinga Agent (or sometimes check_by_ssh) instead of SNMP. So I personally cannot say how much resources the SNMP checks will require.

However, in case this would be applicable for you, consider using the Icinga Agent instead of SNMP since there are check plugins available for all three things you have listed.

In general, these are not so much checks to be executed in parallel.

Next to what @rivad already wrote: Icinga 2’s HA model works by having two master nodes sharing state. However, you are able to add more satellites on demand.

But maybe doing so will not solve your issues since Icinga 2 synchronizes its state via its cluster protocol. Thus, just adding more connected machines will not always resolve resource limits.

When planning your setup with zones, having satellites in each, scaling should be not so much of an issue.

Running the checks on the satellites helps a lot to offset the load but as @apenning stated will not help with limits on the cluster messages per second or a potential DB bottleneck.

here is also a post about hardware sizing this might help:

Some more numbers from our setup:

2500 hosts with ping check every 180s (13 cps)
15000 service checks of which about 90% are by SNMP, the rest is port checks, check by ssh, https etc., every 300s (50 cps)

This is a two node active-active HW setup with 32 CPU and 64 GB Memory. Icinga, icingaweb, apache and mariadb are all running on this two nodes. They each use somewhere between 5% and 10% of available CPU and about 20% of available memory, or if you prefere unix load average it is around 2 (with 32 CPUs).

I could reduce the time spent for SNMP and by this also some what the load by only loading data from OID: HOST-RESOURCES-TYPES::hrStorageFixedDisk when checking disk capacity, compared to most plugins that check the whole OID “.1.3.6.1.2.1.25.2.3.1” and by this loading the data for HOST-RESOURCES-MIB::hrStorageAllocationFailure, which can take quite long for some systems. You can find my modification here: GitHub - lko23/check_usolved_disks: Nagios Plugin for checking all disks on a windows or linux machine

Hope this helps.

2 Likes