Different Command Endpoints for different checks

Hello!

I have been working on configuring a distributed monitoring setup in a VM sandbox. I am currently struggling with setting up external checks (SSH, Ping) and internal checks (mem, load).

All VM’s have Icinga running on them. If I manually set the command endpoint to one of the parent zone hosts it can properly run the external checks. However if I then try to configure an internal check the director appears to assign the command endpoint to the master host which is on a separate isolated network from the agent. The two networks are bridged by the satellite. As a result I either get a unknown error since the agent isn’t directly linked to the master or I am given an error stating that the command endpoint must be within the master zone or a direct child zone.

One thing that seems to work is assigning the agent to the parent zone it is reporting into. However if the satellite goes offline all child checks simply stay static and show no sign of an issue without drilling into that agent’s menu where it will show things like next check in -1m:30s.

In the Director documentation, working with agents, it indicates that I don’t need to set up agent zones but when I exclude a zone for any agent hosts it again defaults to the master zone.

This may be a combination of a number of different problems all at once but I have thoroughly confused myself at this point and need some guidance on where I may have gone wrong.

TLDR: I am trying to figure out how to make my satellite server send external checks like ping to my agents and how to use the agent itself for internal checks like mem. I have had some success but my results are inconsistent.

Thank you for your help in advance.

Hi,

as far as I understand your setup, you are going with the three level cluster route where the satellites actively schedule the checks being executed remotely via command endpoint. The master only receives the check results and presents everything in Icinga Web.

When the connection between the master and satellite breaks, the satellite continues to run checks. Since it cannot send the results back to the master, it stores them inside its replay log (can be adjusted via log_duration setting on the endpoint). Once the connection with the master works again, the log including the old check results is replayed and sent back to the master. At this point, metric data points and history tables are updated with the past segments.

During the connection fail period, you should have late check results - in order to quickly see that the zone connection is not working, I highly recommend to use the cluster health checks, e.g. cluster-zone from the master checking against the satellite zone.

Cheers,
Michael

Hi Michael,

That is correct I am running a 3 level cluster. That would make sense that the results are delayed especially when the target machine is down.

I have experimented with the cluster health checks. So far I have only been working with the “external checks” imported with the director kickstart wizard. I will need to customize the cluster-zone check so that it accurately checks the satellite zones since I did not use the satellite’s host name as the satellite’s zone name.

Another potentially odd behavior I have noticed is that when creating the host object itself I need to assign the host to the parent zone of that object. For example in the attached screenshot I have ICIAGEN manually assigned to the satzone1 cluster zone. For reference this branch of my setup resembles the following:

Hosts:
ICIMAST -> ICISAT -> ICIAGEN

Zones:
master-> satzone1 -> ICIAGEN

My understanding is that the endpoint agents should be located in a self named child zone with the satellite zone as the parent. However if I set this setting to a blank value it pushes all checks up to the master server which is unable to connect to the agent directly. If I set the cluster zone to the ICIAGEN zone it tells me during deployment that the command endpoint must be located within the ICIAGEN zone or a direct child zone.

This error appears to be raising since in the Agent - Zone 1 template I set the command endpoint to ICISAT since it would need to run the host alive check. Setting the hostalive check to the ICIAGEN endpoint does not accurately reflect the current status of ICIAGEN if ICIAGEN is brought offline. The check appears to wait for a result forever rather than showing host down when the check does not return.

Thank you for your help so far!

No, whenever command endpoint is used, the parent zone schedules the checks and triggers the execution. Then the command_endpoint fires that on the child host, being the agent.

That being said, inside the Director, the host itself needs to be marked as agent, and put into the satellite cluster zone. Then everything works as expected.

Try that first before continuing with more questions.

Cheers,
Michael

Hi Michael,

After working with the information you provided my instance appears to be working as intended. I have the agents assigned to the satellite group and the satellites assigned to the master group and all checks internal and external are functioning and reporting correct statuses.

Thank you!

1 Like