Log Monitoring best practice

radioactive9 · April 2, 2020, 9:00am

Hello All

What is the best practice to monitor Logs. Either using check_logfiles or for that matter any other.

I will describe the problem

Server X - Monitoring Log - For Pattern “A, B …n” number of patterns - using single service

There is a event manager which is pulling events from icinga and creating alerts and tickets

The problem is after the first event is raised - the next polling cycle if there is no more errors in the log it closes in Icinga. This triggers OK state for the service and the alert gets closed in event manager before even the technician can see. We do not want to raise a OK event for this service. How to not change State immediately in the next polling cycle.

twidhalm · April 2, 2020, 11:52am

I’d say currently the best way is to use check_logfiles just for quick and fast checks if you need just very basic logmonitoring.

A way better approach (with a lot more tasks to do) is to utilize Elastic Stack to collect your logs and use Logstashs icinga output to send in check results into the Icinga API. If you just want it for log monitoring and not a full blown log management solution you can skip some best practices and go for a minimal setup. You can always scale it up when you like it.

radioactive9 · April 2, 2020, 1:14pm

Thank You.

Do you create a sampling Service for Log checks? The problem with sampling is one moment it is open for the chunk of log it read. The next chunk if there is no matching pattern it closes the event before someone has seen it. Now obviously we want to keep the even open till someone manually closes it. At the same time open more events on the same server with the same service for a different pattern.

twidhalm · April 2, 2020, 1:42pm

That’s the problem with every sort of logmanagement. Most developers forget about “all is well again” messages. It’s just “problem - problem- problem - silence” where silence can mean “all is well” or “I’m actually dead”.

So the way I go most of the time is to combine active and passive (log) monitoring. And have a passive check reset the check back o “Ok” when there are no new message for some time.

I use Logstash config to idenftify problematic log messages and only forward these to Icinga. The basic approach is to have on “Logs” service per host, applied per Apply-Rule which catches all logs for this host. The more sophisticated approach is to have several Log-Services for specific hosts. All of them are of type passive and will be reset after some time.

I had some customers wish for the way you described, so having to manually reset the services once the problem is gone. Very soon they started wishing for a bunch of trained monkey which don’t do anything but hit the “Ok” button. Some service restarts or updates created so many alerts that they where completely overwhelmed.

unic · April 2, 2020, 2:20pm

Hi,
I think the easiest was would be the passive way as @twidhalm mentioned.

You even could use a cronjob that check the log and on an error, you write the message via Icinga2 API to your passive check. This way you need to set in back to okay manually.

radioactive9 · April 3, 2020, 4:50am

Hello Thomas

You nailed it perfect. So you are using a combination of Active (Sampled) and Passive check. But don’t know how you managed a active check to stop sending OK event the next cycle it sees in logs that there is no Error.

What I mean is -> A sampled service checks a pattern in log find an error opens an alert. Next cycle the same service checks the log sees no error - closes the alert. How do you make the active check not close the event and depend on the passive check to close the active check event

twidhalm · April 3, 2020, 6:42am

I’m speaking of different services.

The active ones are the usual: CPU, memory, services, API etc.
The passive ones solely rely on passive events from logs coming in. They do have a CheckCommand like passive configured so when the freshness threshold is reached it will fire once and reset the service to Ok. But this will only happen if you don’t receive any new logmessages over a longer period of time (a.k.a. freshness threshold).

dfrise · April 7, 2020, 8:01am

Hi,

Thomas Widhalm is certainly highly sophisticated and best solution.

We use a more basic approach using check_logfiles in active mode.
After making unsuccessful experiments with the sticky parameter we ended up with another solution.
Our approach assumes that most of the time a problem reported in a log is never solved/recovered by itself.
We have then defined a notification template for check_logfiles where the states are “Warning Critical”. This service will then fire an alert for a Warning or Critcal state but no alert for the OK state (next run of check_logfiles by default).