Hello at all,
During our implementation and tests with Icinga-PowerShell Plugins (Installation according to the instructions) we got the request to monitor also the Windows EventLog. So we configured the check “Invoke-IcingaCheckEventlog”.
We came up with the following problems, whereby we would like to ask you about your experience. Because we may have a mistake in thinking.
After installing the plugins and without using the switch “DisableTimeCache” the checks doesn’t work. Icinga throws every time a permission error for the cache directory. With this switch everything works fine. Maybe this was a result of our server setups and the set permissions from our server admins. .
What we also realized while using the switch “DisableTimeCache” was, that it really makes no sense using this. Let’s assume we have a check interval from 5 min. During this interval a program writes a message into the eventlog and the check throws warning/critical. After the next check interval the check is ok again, because in this interval there no new event was written into the log.
So checking the hole eventlog also makes no sense. Because therefore we have to know how many log entries are “normal”. Or our colleagues have to delete the event after fixing the problem. In this case it makes a log obsolete.
Trying the switch “After” improves the situation a little bit. With the Icinga-DSL (using var dt = DateTime() - 24 * 60 * 60; return dt.to_string() ) we can create a timestmap like now - 24h. But our office is closed on weekends and public holidays. So if there was a event written, we don’t get it via Icinga the next working day that something was there. Here we would first have to check the logs from all servers again or expand the time-range.
The next idea would be that icinga have to stop to check the eventlog if it’s getting critical. After fixing the problem we have to set the check manually to ok. But here is the problem, if there would be another event written.
Another possibility would be that every server bump the full eventlog from every server to our ELK stack. However, we see similar problems like written above if checking this log.
So what do you think? What is your experience?