I’m in the process of migrating to IcingaDB from IDO, and I’m having trouble finding detailed best practices and recommendations for sizing the database and Icinga-Redis in the official documents.
In particular, I am seeking guidance on:
Sizing the database and Icinga-Redis: Is there any rule of thumb or guidelines to follow for estimating the required RAM, CPU, and IOPS, especially considering the scale of monitoring required in a multi-master setup?
Self-monitoring: What metrics should be monitored regarding IcingaDB and Icinga-Redis in a multi-master setup to ensure optimal performance and early problem detection?
Any insights or pointers to relevant resources would be greatly appreciated.
It is important to check the number of hosts and services. My environment has 338 hosts and 2832 services. I hope this information was of some help to you.
Thank you for your response. Unfortunately, I’m working with a significantly larger setup, making it not directly comparable. My database runs on a 3-node Galera cluster with 4 CPUs and 24GB RAM each. Both the masters and satellites are equipped with 16GB RAM and 16 CPUs each.
The IDO Database currently stands at approximately 5.5GB in size with a 1-year retention policy applied. A configuration reload takes about 2 minutes and 30 seconds. I’m not at liberty to disclose the number of hosts and services monitored by this setup. I just want to avoid unpleasant surprises, because I cannot reproduce the actual load on the development and staging environment.
Server sizing will depend on the size of your setup so I’m not going to suggest specifics. Besides the size of servers though some of the things you want to consider if you are planning to have 1,000’s or 10,000’s hosts and services are
keeping things separate helps, Icinga masters, database and IcingaWeb2 all on their own hosts.
don’t run checks on the master, if that means you create a dedicated satellite to run checks do that
be careful of how you use apply rules and/or automation, you should use them but try to break things up a small amount. This is so when you change something it only results on part of your configuration changing rather than all of it. The reload bomb is real.
To expand on the last point, if you import hosts with director automation it is better to have multiple small import sources and sync rules rather than a single large import source and sync rule.
This is because a change to the single import source may result in all your hosts changing which is a large reload task for the masters. Multiple small import sources allow you to make more target changes or spread out over multiple deployments. It also helps contain human error in sync rules as the fallout is smaller
examples of breaking up import rules might be AWS hosts, VMware Hosts, Physical Hosts or by department. The best way to do this depends on your needs.