Icinga and Prometheus

dnsmichi · January 11, 2019, 1:23pm

Simon asks:

Could some, in short, explain me the difference between icinga2/icingaweb2 and Prometheus?

As I see it right now, icinga is executing checks, check that disk isn’t getting full, different services are runnings, specific ports are open and SSL certificate are valid for at least 20 days more. Icinga is not saving metrics data over time. Prometheus checks different process performance over time. For example memory usage by Passenger or open connections to websocket or web request time.

Michael adds:

One difference is that Icinga actively executes check scripts which return state, output and performance data metrics. These values are collected and used for further state history calculation, notifications, dependencies, etc. Metrics can be forwarded to popular TSDB backends for storage.

Prometheus implements its own TSDB afaik. v2.0 has a rewritten one which is not compatible to v1.0. To my knowledge, services need to export metrics via HTTP /metrics endpoint and you’ll configure Prometheus to go look there.

Discovered metrics e.g. from container services are stored more easily. Based on the stored data, you can create queries for alerts. There is no centric host/service model with static configuration.

I haven’t tried Prometheus in detail yet, but I could think of the following questions:

Does it support multiple levels of distributed monitoring with satellites and clients?

Is it possible to configure the connection direction, e.g. into the DMZ or from inside the DMZ

How to apply dependencies/reachability prior to alerts

Security: TLS, CN validation, etc.

To me, both worlds follow different approaches and probably can be integrated in common scenarios.

Michael listened to some Twitter discussions and talks then.

I had a look into it lately, since I was doing a research on tools and their possibilities with SNMP monitoring et al.

If your service doesn’t expose an HTTP endpoint with metrics, you need to write a wrapper or use a converter script to pass these things into Prometheus.

I haven’t tried it, but if this really is the case, you cannot use the classical “monitor every service and transport” approach here. Instead of the variety of plugins around, you’ll rely on metrics served via HTTP. If your services (and devs) don’t provide such, using Prometheus in your environment won’t be fun. No metrics, no alerts, no SLA.

It is highly likely that an integration with Prometheus makes sense, where you put your classical service monitoring with Icinga and variants up front. Then you’ll expose the plugin perfdata metrics via HTTP to Prometheus to allow them being collected. A similar thing was requested on GitHub already.

Simon replies:

Cool, I think a integration would be very cool.

Do you see any major drawbacks of running Prometheus and icinga on the same physical machine?

Some of these metrics looks very interesting https://samsaffron.com/archive/2018/02/02/instrumenting-rails-with-prometheus for my usage. I am considering that I can just start a different container and install Prometheus there without any integration with icinga.

In an integration. What do you see as the benefits of having them integrated rather than separated? Single responsibility, if one crash it can’t take the other one down.

Michael replies:

I have never run any Prometheus instance myself, I know nothing about its resource requirements. I wouldn’t run 2 monitoring applications on the same host though, as the failure of one (OOM or full disk for example) could kill the other one.

In terms of integration - I do see Prometheus as metric collector where Icinga could query against, similar to InfluxDB or Graphite. Having cluster and container checks with highly volatile data inside, this sounds like an interesting idea.

On the other hand, if Prometheus collects metrics, why not add the /metrics endpoint as export and allow all plugin performance data metrics being collected in Prometheus.

Those are just ideas from my mind, nothing I have tried nor designed. Waiting for community members to step up and actually build such things

Jan adds different monitoring types:

For starters you could search for the difference between whitebox- and blackbox-monitoring.

Also this article might be helpful to see the difference: https://insights.sei.cmu.edu/devops/2016/08/whitebox-monitoring-with-prometheus.html

Assaf shares his experience:

I have implemented both system ( in differing scales ) and can say that comparing them is not doing justice to either.

Icinga is an active (pull) system where you actively check the status of the state you want to monitor.
Prometheus is a passive (push) listener that scrapes data from individual services executed on the target nodes, in a pre-set interval (which can be altered) but out of the box, it will not complain if a metric is not coming or if it can not scrape the data from a node.

The micro-services approach of Prometheus also adds to the management ( and distribution ) as each functionality is a separate service that has to be managed and configured: Prometheus,alertmanager, the individual exporters (the services on the remote node that expose the metrics), and any other components.

Prometheus’s own graphical interface is lacking, to say the least, and require the integration of a 3rd party tool, mainly Grafana to create the dashboards and the visualisation of the metrics.

While Icinga was not build as a Time series metric collector, but as a “state probe” tool, Prometheus has, and as such they function with a different approach and methodology. Granted they are both a monitoring tool, but each was built with another goal in mind.

What’s your 2 cents on the matter?

dnsmichi · December 17, 2019, 8:38pm

I’m revisiting this story after 1 year full of learning, and changing opinions slightly. I will update this topic with additional work I am planning to do.

Classic service monitoring has one big issue:

Containers are volatile, and may not exist as a “hostname” object
Kubernetes clusters with 2 out of 10 http containers will still let the site operate
At the active check polling time an overall state may be critical, but after 5 seconds the cluster healed itself. So you’ll generate many false alarms by accident.

Moving this into an event based approach with calculating metric trends and removing spikes helps here.

So the main idea is not either moving Icinga or Prometheus, but to gather the best out of two worlds and integrate them, if possible.

Or, likewise, to extract their data sources and combine them into Grafana dashboards, alerts, and specific views.

Integrations

What was the plan?

Add an experimental /metrics endpoint to 2.9 or 2.10 to allow Prometheus to use Icinga as a scrape target.

The idea was not to only provide /v1/status but also expose host/service name specific metrics for Prometheus.

What happened?

Icinga 2.10 introduced severe issues with the REST API and cluster protocol. The performance was so bad that I feared with adding a /metrics endpoint. My shoutout always was a stable 2.11 with a rewritten network stack. https://icinga.com/2019/09/19/icinga-2-11/

2.11 introduced other issues, and somehow burned me out a little. Getting back on track here lately.

What’s the new plan?

Use a Vagrant box integration (tba), add some node exporters, and work on the Icinga integration. Explore the possibilities of both ecosystems.

Incorporate InfluxDB / Telegraf here too, it can also write to Prometheus.

Pull

Either create /metrics or /probe endpoints, similar to the SSL exporter. Or use the opsdis Python node_exporter which pulls the Icinga API. Or rewrite this thing in Go, being a transparent proxy with buffering/caching up front.

Push

The push approach can be used to passively send in events and results.

HTTP Requests
Serialize Metrics https://github.com/jupp0r/prometheus-cpp/blob/master/core/src/text_serializer.cc
Escape certain strings

Needs a work queue or buffer being flushed, similar to InfluxDB.

There’s one problem with historical data though:

Push gateway doesn’t allow to specify the timestamp. No Icinga cluster replay of metric data possible. https://prometheus.io/docs/practices/pushing/

Either we can convince the Prometheus authors to allow this for historical data replay, or we’ll have to live with the fact that only live data works.

dnsmichi · December 18, 2019, 3:25pm

Here’s a first design draft/concept including tasks.

Batman · February 12, 2020, 3:08pm

Hello @dnsmichi:

This deserves an standing ovation. Thank you very much for trying this.

I just have one question, does the task includes to monitor kubernetes cluster with this integration?

Thanks again.

dnsmichi · February 12, 2020, 3:46pm

Hi,

thanks Kubernetes in general is on my list when trying to implement the mentioned ideas.

https://github.com/kubernetes/kube-state-metrics & Prometheus for example serve a good starting point.

This is an ongoing effort, so it will take a while up until there’s visible progress. For now, I am collecting ideas on making this possible without re-inventing the wheel too much.

Cheers,
Michael

dnsmichi · March 2, 2020, 8:49pm

Hi,

a personal update on the manner - I will continue looking into this topic as part of my new role as Developer Evangelist at GitLab.

You can read more about my new adventure here:

Cheers,
Michael

Edit: Since there were unforeseen things, I left Icinga and won’t dig any further into integrations. I will be following monitoring tools closely, and sharing cool stuff on Twitter.