Service failures

cbowden · April 27, 2021, 10:14am

We’ve been seeing regular and consistent failures across multiple services in our portal. They last for approximately one minute and then all recover again. We have 200 hosts / 1900 services and only a portion of them fail - “ping4” checks. Most of the time it’s not much of a problem as it doesn’t trigger any notifications etc due to the quick recovery. However the last few days we have seen one or two, in the early hours, call out to PagerDuty, with downtime lasting approximately 10 minutes. So we are now keen to find the root cause and get this resolved.

I did suspect the issue may be related to this topic from two years ago:

So I replaced the “hostalive” host check within the template.conf with a “dummy” to reduce the execution time for hosts. This is now at 0.365 execution time down from 4.7 but the problem persists. The icinga.log doesn’t seem to provide any clues.

Apologies for the lack of any real detail so please let me know any additional information required to help troubleshoot this one.

Thank you.

dgoetz · April 27, 2021, 12:46pm

While I agree that check_ping plugin is not the best as hostalive check and check_icmp would perform better, I have not seen a similar problem just caused by using check_ping.

What is the output or performance data in the case the checks fail? Because the possible reasons can differ when it is package loss or round trip time. In case of package loss I would look into error count on the system’s and the switch’s interface. When the round trip time increases without package loss I have seen problems with the load of a switch.

Is your master high available or do you use satellites and does the result differ based on the machine executing the check?

cbowden · April 28, 2021, 7:43am

Hi Dirk, thank you for the reply. The trouble we have with this one is there is such a short window to actually do any troubleshooting. I’ve provided a screenshot of the output and performance data.

packet_loss

I can confirm the packet loss isn’t consistent so it can vary per host usually from 30% up to 100% at any given time. A ping on the command line to an affected host also confirms the packet loss and again quickly recovers. We will continue to look at it looking closely at our network switches etc. to see if we can pinpoint a problem in that area. Unfortunately this is not a high available setup and the execution point is always from the master.

One further question, relating to the PagerDuty alert we have been receiving lately. It’s the ping4 check, always within a 15 minute window and the packet loss value is always 16%. As you know the critical threshold value is 15% for ping4. How would we increase the ping_cpl value globally as ping4 is not specified in the template (or host/services)? Do I need to create a new CheckCommand in this case?

Thank you for your assistance, it’s greatly appreciated.

dgoetz · April 28, 2021, 8:33am

With this package loss and the only reply having no increased round trip time I would look into network errors like losing communication completely in one direction, discarded packages, asynchronous routing, mtu or similar. Unfortunately I am no networker and can not provide to much guidance here.

The defaults look like you are directly using ping4, so you could switch to hostalive4 which has higher default thresholds more suited to say “it is really down” than “we have some connectivity problems”. If you want to stick with ping4 as you want your own threshold anyway, you can specify the threshold at your most common host template (like generic-host), simply add vars.ping_cpl = 100 there (or if using the director create a field for it, associate it to the template and then set defaults).

cbowden · April 28, 2021, 9:15am

Thanks Dirk.

I’m still a little confused regarding the entry of the new threshold. As ‘ping4’ isn’t explicitly listed in any configuration file (e.g. services, hosts, templates etc) but automatically associated as a service to all hosts I’m unsure how to add this threshold. Apologies for not understanding. Please could you clarify where this should be set?

dgoetz · April 28, 2021, 10:00am

Ok, let’s see if we find the location of relevant part of the configuration together as I do not know your configuration structure.

Can you provide the output of icinga2 object list --type service --name ONEOFYOURHOSTS and if the failing service is not the host icinga2 object list --type service --name YOURSERVICENAME? This will show the location of the object in the configuration and also some important structure like the templates used.

cbowden · April 28, 2021, 1:36pm

This worked perfectly and helped me to track down and reapply with a new threshold. I was incorrectly reviewing the wrong files earlier! At least this should stop the PagerDuty alerts until we work out the cause of these short outage reports.

As stated earlier we are reviewing our network device logs to track this issue down so I’m happy to resolve this issue as we suspect it’s not specifically the Icinga setup as the root cause. Thanks once again for your assistance.