I am not sure if I miss anything regarding the log duration and the optimal configuration for distributed setups.
First some notes about our setup:
Icinga2 2.10.3
IcingaDirector 1.6.2
the Graphite module is enabled and we use Grafana for the Visualization of the performance data
So, my understanding was that log_duration as endpoint attribute tells the endpoint how long to store a replay log of all check results, etc. on the client if the connection to the parent get’s lost. Once the connection recovers the client will replay that log, all the data will be send to the parent and there should be no gap as long as log_duration < network partition.
So I set this parameter but still see gaps in our graphs when the clients are not available to connect to the parents for some minutes and notifications for state changes during the time when the connection was lost were not sent. I will do some more tests and share the results here because especially the missing notifications were an issue last time this happened.
The actual questions are:
I set the log_duration to 172800 seconds (2 days) in the Icinga2 Director and icinga2 object list --type endpoints lists a log duration of 86400 (which is the default value). This confused me in the beginning but it actually makes sense because the log_duration must be set in /etc/icinga2/zones.conf to override the attribute for that endpoint and this file is not one of those that get synced via the config sync. What is part of the replay log sent from a parent to the child? Are configuration changes, downtimes, etc. stored in a replay log, too?
Is there anything configuration-wise I can improve to avoid the gaps in graphs? I will take a look at the exact data being sent to the carbon-cache to see if I see anything suspicious there but you might have any hints already.
It depends on the configuration mode you have choosen. If you don’t use top down config sync the parents schedule all checks. And if there is no connection no check is executed.
Hint: Please keep in mind top down config sync is not supported for Windows clients.
I would analyse why specifically CheckResult messages stored in the replay log are not received on the master and as such, put towards the metric writers.
As @rsx already mentioned, if command_endpoint clients are involved, the check execution origin is always the master, whereas the client/agent doesn’t store check results locally. If the client remains not connected/unknown, there won’t be any sort of check results being cached to the replay log. And as such, no metrics later replayed to Graphite/InfluxDB/etc.
The replay log’s main intent are solving connection problems between masters and satellites, each with their own scheduler. As well as multiple masters in a zone where runtime events need to be in sync, e.g. notification (last_, next_) as well as scheduling details next to check results.
And we use the top down config sync method and all checks are executed locally with the node itself as endpoint. But something that is a bit special about our current setup is that all except for 2 clustered nodes are “dummy hosts” (check command dummy and they just receive passive check results).
Only 7 messages were replayed after the last network split (duration was ~5 minutes) between the clustered satellites and their parents. It should have been way more for the 2 satellites + the 8 dummy hosts with ~35 services each.
However, thanks for the clarification and good to know that my understanding was not totally off. I know how to proceed with the tests of our new architecture where we get rid of all the dummy hosts and each node gets a local icinga2 instance.
There are several places where the “log_duration” option can be set.
If the master is not connected to the satelite i want the satelite to hold the check data to do the sync when its connected again. Where to specify the desired “log_duration” to achieve this behavior?
in the masters zones.conf for master endpoint?
in the masters zones.conf for satelite endpoint?
in the satelites zones.conf for master endpoint?
in the satelites zones.conf for satelite endpoint?
On the node where the replay log is stored on disk, being the satellite.
The endpoint receiving the replayed events from the local log needs to be configured, being the master.