I am not sure if I miss anything regarding the log duration and the optimal configuration for distributed setups.
First some notes about our setup:
- Icinga2 2.10.3
- IcingaDirector 1.6.2
- the Graphite module is enabled and we use Grafana for the Visualization of the performance data
So, my understanding was that log_duration as endpoint attribute tells the endpoint how long to store a replay log of all check results, etc. on the client if the connection to the parent get’s lost. Once the connection recovers the client will replay that log, all the data will be send to the parent and there should be no gap as long as log_duration < network partition.
So I set this parameter but still see gaps in our graphs when the clients are not available to connect to the parents for some minutes and notifications for state changes during the time when the connection was lost were not sent. I will do some more tests and share the results here because especially the missing notifications were an issue last time this happened.
The actual questions are:
- I set the log_duration to 172800 seconds (2 days) in the Icinga2 Director and
icinga2 object list --type endpointslists a log duration of 86400 (which is the default value). This confused me in the beginning but it actually makes sense because the log_duration must be set in
/etc/icinga2/zones.confto override the attribute for that endpoint and this file is not one of those that get synced via the config sync. What is part of the replay log sent from a parent to the child? Are configuration changes, downtimes, etc. stored in a replay log, too?
- Is there anything configuration-wise I can improve to avoid the gaps in graphs? I will take a look at the exact data being sent to the carbon-cache to see if I see anything suspicious there but you might have any hints already.