I have no experience with such big setups, yet (sadly).
Just some figures from the biggest one I have set up until now:
Monitoring ~2100 hosts and ~3200 services.
HW setup:
Master-Cluster: 4vCPU, 8GB RAM, 60GB HDD each.
Satellite-Cluster: 4vCPU, 8GB RAM, 30GB HDD each.
DB: 4vCPU, 8GB RAM, 60GB HDD
Stats:
Load:
- Master
– is bored, load around 0.6 - 1.1 - Satellites (doing the most checks ~1650 hosts ~2900 services)
– still bored 0.7 - 1.8 mostly - DB
– around 1.0 - 1.3
Memory:
- Master
– master1 ~60%, master2 ~10% (assuming this is due to master1 being used as the main webinterface) - Satellites (doing the most checks ~1650 hosts ~2900 services)
– ~10% - DB
– ~20%
Disk-/var:
- Master
– around 9 - 11 GB - DB
– around 6GB
The setup is running for about a year now.
As you plan to run mostly SNMP checks keep in mind that checks that create temporary cache files (like some check from check_nwc_health do) can have a considerable performance impact.