Host state UP/DOWN based on "host classes" defined by % of services states

MLohr.ACN · March 22, 2023, 10:52am

Host state UP/DOWN based on “host classes” defined by % or number of services states

Before you ask a question, you can check the troubleshooting documentation first, maybe you can find an answer here.
wasn’t able to find a way forward for my case.

Hi all,

How can we control the threshold to bring a host into DOWN or UP again based on the services of the hosts that report an OK, warning or critical state?

I am looking for some idea, guidance and suggestions. We have a medium setup and we were wondering how we can improve in terms of the notification. We have too many and sometimes important messages that get overseen and are slipping through our fingers because of less important messages flooding our mattermost channels. So we were looking for some kind of “host classes” that can prioritize hosts over others when it comes to the host state and by that the notification.

Example:

ultra-high prio: DNS service hosts, network devices and storage server
high prio: AD controller and HA services hosts (pair of two) in general
medium prio: HA service hosts with more than two service hosts per HA group
low prio: for example reachability checks (we use host objects with lists of objects to group end points to run checks as services across the group of end points)

So the idea was to say put host to DOWN (and notify) if:

ultra-high prio: number of service in warning or critical > 1
high prio: number of service critical > 1
medium prio: %/number of services in warning or critical > 10%/3
low prio: % of services in warning or critical > 50%

Any idea or links where we can see some examples or guidance?

################################

Give as much information as you can, e.g.

Version used (icinga2 --version)
Operating System and version

icinga2 - The Icinga 2 network monitoring daemon (version: 2.13.2-1)

System information:
  Platform: CentOS Linux
  Platform version: 7 (Core)
  Kernel: Linux
  Kernel version: 3.10.0-1160.53.1.el7.x86_64
  Architecture: x86_64

Build information:
  Compiler: GNU 4.8.5
  Build host: runner-hh8q3bz2-project-322-concurrent-0
  OpenSSL version: OpenSSL 1.0.2k-fips  26 Jan 2017

Enabled features (icinga2 feature list)

icinga2 feature list
Disabled features: command compatlog debuglog elasticsearch gelf graphite icingadb influxdb2 livestatus opentsdb perfdata statusdata syslog
Enabled features: api checker ido-pgsql influxdb mainlog notification

Config validation (icinga2 daemon -C)

icinga2 daemon -C
[2023-03-22 11:16:52 +0100] information/cli: Icinga application loader (version: 2.13.2-1)
[2023-03-22 11:16:52 +0100] information/cli: Loading configuration file(s).
[2023-03-22 11:16:52 +0100] information/ConfigItem: Committing config item(s).
[2023-03-22 11:16:52 +0100] information/ApiListener: My API identity: #######.########.####.####.###
[2023-03-22 11:16:53 +0100] warning/ApplyRule: Apply rule 'ncpa_ctx_drive - ' (in /etc/icinga2/zones.d/global-templates/services_CTX.conf: 32:1-32:69) for type 'Service' does not match anywhere!
[2023-03-22 11:16:53 +0100] warning/ApplyRule: Apply rule 'ncpa_network_ctx_win_recv_ ' (in /etc/icinga2/zones.d/global-templates/services_CTX.conf: 75:1-75:71) for type 'Service' does not match anywhere!
[2023-03-22 11:16:53 +0100] warning/ApplyRule: Apply rule 'tcp-' (in /etc/icinga2/zones.d/global-templates/services/services.conf: 8:1-8:55) for type 'Service' does not match anywhere!
[2023-03-22 11:16:53 +0100] warning/ApplyRule: Apply rule 'ncpa_mnt_used_gb' (in /etc/icinga2/zones.d/global-templates/services/services.conf: 92:1-92:32) for type 'Service' does not match anywhere!
[2023-03-22 11:16:53 +0100] warning/ApplyRule: Apply rule 'ncpa_mem_used_swap' (in /etc/icinga2/zones.d/global-templates/services/services.conf: 168:1-168:34) for type 'Service' does not match anywhere!
[2023-03-22 11:16:53 +0100] warning/ApplyRule: Apply rule 'hlm-status' (in /etc/icinga2/zones.d/global-templates/services/services.conf: 405:1-405:26) for type 'Service' does not match anywhere!
[2023-03-22 11:16:53 +0100] information/ConfigItem: Instantiated 1 InfluxdbWriter.
[2023-03-22 11:16:53 +0100] information/ConfigItem: Instantiated 1 NotificationComponent.
[2023-03-22 11:16:53 +0100] information/ConfigItem: Instantiated 1 IdoPgsqlConnection.
[2023-03-22 11:16:53 +0100] information/ConfigItem: Instantiated 1 CheckerComponent.
[2023-03-22 11:16:53 +0100] information/ConfigItem: Instantiated 2 Users.
[2023-03-22 11:16:53 +0100] information/ConfigItem: Instantiated 1 UserGroup.
[2023-03-22 11:16:53 +0100] information/ConfigItem: Instantiated 3 ServiceGroups.
[2023-03-22 11:16:53 +0100] information/ConfigItem: Instantiated 3 TimePeriods.
[2023-03-22 11:16:53 +0100] information/ConfigItem: Instantiated 7310 Services.
[2023-03-22 11:16:53 +0100] information/ConfigItem: Instantiated 9 Zones.
[2023-03-22 11:16:53 +0100] information/ConfigItem: Instantiated 4 NotificationCommands.
[2023-03-22 11:16:53 +0100] information/ConfigItem: Instantiated 143 HostGroups.
[2023-03-22 11:16:53 +0100] information/ConfigItem: Instantiated 8303 Notifications.
[2023-03-22 11:16:53 +0100] information/ConfigItem: Instantiated 1 IcingaApplication.
[2023-03-22 11:16:53 +0100] information/ConfigItem: Instantiated 982 Hosts.
[2023-03-22 11:16:53 +0100] information/ConfigItem: Instantiated 8 Endpoints.
[2023-03-22 11:16:53 +0100] information/ConfigItem: Instantiated 20 Comments.
[2023-03-22 11:16:53 +0100] information/ConfigItem: Instantiated 1 FileLogger.
[2023-03-22 11:16:53 +0100] information/ConfigItem: Instantiated 2 ApiUsers.
[2023-03-22 11:16:53 +0100] information/ConfigItem: Instantiated 293 CheckCommands.
[2023-03-22 11:16:53 +0100] information/ConfigItem: Instantiated 1 ApiListener.
[2023-03-22 11:16:54 +0100] information/ScriptGlobal: Dumping variables to file '/var/cache/icinga2/icinga2.vars'
[2023-03-22 11:16:54 +0100] information/cli: Finished validating the configuration file(s).

moreamazingnick · March 22, 2023, 11:12am

That would work with with icinga dsl

you could design a check command that iterates over the services and perform this calculations. based on the result you can set the exitcode

do you use icinga2 conf files or icinga director for config changes?

if you need it less sophisticated you can use the business process module

rivad · March 22, 2023, 8:00pm

As DSL only works in the paramters for check commands in the director, the Linuxfabrik build me a dummy check that just returns the message, state and perfdata:

Here the preview of my “more then halve” check that gets used on a virtual host with a Loadbalancer IP.
The cluster nodes are the real servers behind the load Loadbalancer and this code is gathering the states of one service on those and calculates it’s own state depending on them:

object CheckCommand "116-cmd-more-then-halve" {
    import "plugin-check-command"
    command = [ "/usr/lib64/nagios/plugins/dummy" ]
    timeout = 10s
    arguments += {
        "--message" = {
            required = false
            value = {{
                var output_status = ""
                var up_count = 0
                var down_count = 0
                var cluster_nodes = macro("$116_cluster_nodes$")
                var more_then_halve_service_name = macro("$116-cluster-more-then-halve-service$")
            
                for (node in cluster_nodes) {
                  if (get_service(node, more_then_halve_service_name).state > 0) {
                    down_count += 1
                  } else {
                    up_count += 1
                  }
                }
            
                if (up_count > down_count) {
                  output_status = "OK: "
                }
                if (up_count == down_count) {
                  output_status = "WARNING: "
                }
                if (up_count < down_count) {
                  output_status = "CRITICAL: "
                }
            
                var output = output_status
            
                for (node in cluster_nodes) {
                  output += node + ": " + more_then_halve_service_name + ": " + get_service(node, more_then_halve_service_name).last_check_result.output + " "
                }
            
                output += " | count_of_alive_" + more_then_halve_service_name +"="+up_count+";" + string((up_count + down_count) / 2 + 1) + ":;" + string((up_count + down_count) / 2 ) + ":;0;" + string(up_count + down_count)
                log(output)
                return output
            }}
        }
        "--state" = {{
            var up_count = 0
            var down_count = 0
            var cluster_nodes = macro("$116_cluster_nodes$")
            var more_then_halve_service_name = macro("$116-cluster-more-then-halve-service$")
        
            for (node in cluster_nodes) {
              if (get_service(node, more_then_halve_service_name).state > 0) {
                down_count += 1
              } else {
                up_count += 1
              }
            }
        
            if (up_count > down_count) {
              return "ok" // more up then down -> OK
            }
            if (up_count == down_count) {
              return "warn" // same up as down -> Warning
            }
            if (up_count < down_count) {
              return "crit" // less up then down -> Critical 
            }
            return "unk" // should never reach this
        }}
    }
}

The other, simpler but more tedious option is, to use the “Icingaweb2 Business Process Module”, create your 4 tiers there, add the necessary operators on the nodes below that and put all services under the host nodes. To finish it up, add icingacli checks to alert your mattermost channels.

But maybe you are holding it wrong and you could clean up your alerts, by only sending criticals to your mattermost channels and warning to email?
Is a disk check critical at 95% or is it only a warning between 90% - 100% and only critical at 99% or a 100%?
I know, this is philosophical but I start to push in this direction for us as it makes the use of the “Icingaweb2 Business Process Module” a lot easier.

MLohr.ACN · April 1, 2023, 9:08am

hi Nick,
THX for the reply.

we are using the plain good old config-files. We have a number of scripts that run every few minutes and places new config files automatically. A reload of icinga every 5 minutes picks them up.

I will have a look at the “business process module”.

Is there something that i can place inside a host template? Are you aware of something like this? like the threshold for memory …

vars.warning_val = "80"
vars.critical_val ="90"

just for services of a host

MLohr.ACN · April 1, 2023, 9:19am

Hi Dominik,
thanks for the details. Interesting approach. This check will be a service at the end, right? so you create a artificial host with this special service and since you only have one service in the host, the host will be down or up based on this one check?

The “Business Process Module” is something I really should have a look on. Let’s see. I need to learn it and then teach it to my guys and girls. As long as we can stay with our hand-automation we should be fine.

Maybe I have some time tomorrow and can create some screenshots that will help to understand my actual situation.

But thanks for the ideas and thought so far.

rivad · April 3, 2023, 7:37am

I don’t use it for host status but you can just change it a bit and use it as host check.

The result of the code I posted, is a service on the host, representing the Loadbalancer IP, that is the calculated result of two services on the two servers behind the Loadbalancer. More then two Servers also work.

Is there something that i can place inside a host template? Are you aware of something like this? like the threshold for memory …
vars.warning_val = "80"
vars.critical_val ="90"
just for services of a host

You could use a variable like this in the DSL approach - see var cluster_nodes = macro("$116_cluster_nodes$") as this is a variable I get form the host.
And as you don’t use the director, you can use the build in dummy command directly - no need for the Linuxfabrik variant.

The syntax of the business process module is a bit of a mess but it could be generated also.