Icinga objects writers

fatslimjoe · October 20, 2021, 3:23pm

Hi experts. I am developing POC where we will run 2 objects writers in our enviroment. We are having around 50k service checks and 65 satellites. I did everything in test/development enviroment and everythings working perfectly … even I have created 3 inxluxdb object writers and 1 graphite object writer. 2 for influxdb 1.8 with diff taging rules and 1 is influxdb2.0 which is totally diffrent thing. Anyway on test enviroment is working perfectly fine. But there I have 2 satellites, 2 endpoints and around 20 service checks. Even I have test it with old version 2.10.4.

I am wondering if I put additional object writer on prod env which has around 50k service checks, should I expect bigger utilization on our icinga2 master’s? Like much bigger?

We wanted to get rid of graphite and his carbon-caches processes … so to avoid data migration we want just add another object writer and then when time will be riight turn off graphite and use influxdb and grafana solution instead…

Any thoughts/advice?

Kind regards,
Josip

dgoetz · October 21, 2021, 7:10am

The object writers in normal state do not cause much cpu or memory load, so it is more or less network I/O which could become a bottleneck. In error state when the target is not reachable output will be queued which will result in memory consumption.

So I have no productive environment with such an amount of writers, but I do not expect it to be a problem when network is stable and fast and there is some memory space free in normal situations on the server.

An option I typically use for graphite is a local carbon-relay which should be reduce the memory needed of icinga2 itself and should be more efficient (and has more options to tweak). I am not sure if a similar option exists for InfluxDB as I have not digged deeper in this topic yet.

fatslimjoe · October 21, 2021, 9:25am

Hi Dirk,

many thx for fast response. Yes all our “icinga2 master” components are practically sitting next to each other in same network. So it should be more then enough network throughput.

You mean you are having carbon-relay on icinga2 master or it is located on graphite server (side by side with other carbon cache’s processes)?

dgoetz · October 21, 2021, 9:31am

Directly next to the icinga2 instance which is writing on the same server, so it should always be reachable and no objects should be queued in icinga2.

This is something I started when a customer had problems in its network and it happen to fill up the queue in icinga2 while the system would have plenty of RAM I could utilize with a carbon-relay with an infinite queue.

fatslimjoe · October 21, 2021, 10:06am

ahaam … this is amazing then … Sounds really cool and could be answer on some problems what we had in the past and they appears with network issue and caused our icinga2 to crash … with msg in /var/log/messages … SIG something and complaining about memory and killing icinga2.service … this must be it …

influxdb doesnt have such processes like carbon cache or carbon relay it is just timeseries db which will sit on the server and listen on port 8086 for writes … so that means it will be behave like carbon-relay installed on some other place not on icinga2 server next to icinga2 service/process itself … so if this network outage appears then icinga2 will crash … cause there is not influxdb relay process which could be installed next to icinga2 and do exactly same thing which you have explained from above.

then to avoid such scenario on network outage … could be solution to put influxdb on same server where is Icinga2 master ? … our masters are sitting on server handling configuration and sending alerts. They are not utilized for executing service checks plugins. This workload has been delegated to satellites.

Only thing here which could be problem is disk i/o … hmm veriy interesting topic to think about it …

fatslimjoe · October 21, 2021, 10:16am

sorry for spaming just found on the internet that there is actually some influxdb relay GitHub - influxdata/influxdb-relay: Service to replicate InfluxDB data for high availability . It looks like pretty similar like graphite relay …

dgoetz · October 21, 2021, 10:50am

Another idea would be if Telegraf could take the role, but as I said I am not very familiar with the Influx stack.

fatslimjoe · October 21, 2021, 11:37am

Many thanks this opens me new ideas.

Cheers,
Josip