Monitoring Icinga's config file sync in distributed setups

Hi,

I just had the problem that some CheckCommands weren’t synced properly to my agents. The reason was that I added a notification command which had a constant, which was defined in /etc/icinga2/constants.conf - which, of course, is outside of /etc/icinga2/zones.d/master.
Hence I got a nice error in the /var/log/icinga2/icinga2.log and /var/lib/icinga2/api/zones-stage/startup.log on all agents. (If only I had looked at the agents logfiles early, I was too focused on the master logfiles…)

The cluster-health check (as described in: Distributed Monitoring - Icinga 2) only checks if endpoints are connected. This was always the case. It was just the case that new CheckCommands weren’t synced. (And after deleting the synced config files on the Agents all previously synced files/checks/etc. disappeared too.) So this doesn’t work as it doesn’t verify the config file sync is up-to-date.

Therefore I searched around as I thought that there clearly must be some kind of check-plugin to monitor this, right? But I found nothing.
Of course I can utilize check_logfiles, no problem.

But is there a better way to do this? How are you all monitoring this?

Regards

1 Like

Hi @Doolberg,
the icinga CheckCommand (one of the few internal ones) is exactly what you want. It will switch to WARNING if the configuration sync fails.
It has to be executed on the agent/satellite.

That one and the cluster-zone CheckCommand are always used in my distributed setups.

1 Like

Hi @lorenz,

oh I use both. They are executed every minute. But no alarm was raised, despite my config file sync not working for over a month… Just created the error situation again to verify it and: Again, no warning, no critical, nothing.

Neither from icinga nor from cluster-zone (and cluster on the Master also gives nothing for that matter).

So what else can I use?
In Icinga Template Library - Icinga 2 there seems to be nothing which would check this.

Oh, an icinga2 daemon -C on the Agent also gives no errors or warnings. Config passes with flying colors. I assume because the Config validation step fails, the broken config isn’t used as active config. Hence the icinga2 daemon isn’t reloaded/restarted. And therefore the icinga CheckCommand (and others) don’t detect any failures.

Icinga2 version is 2.12.3-1 (on Master and all Agents) if that mattes for any reason.

Regards

Was the Service with icinga CheckCommand executed on the agent? It really shoud catch a synchronisation error.

Oh my god… This is embarrasing… Yeah command_endpoint = host.vars.agent_endpoint was missing from the service definition… As I started without distributed monitoring…

Added it and the check is executed on the correct host.

But now I get the error that the agent is not connected to it’s zone.
“Zone ‘ga.domain.tld’ is not connected. Log lag: less than 1 millisecond”
Keep in mind, the config file sync works (if I don’t forget to define constants used in distributed-templates…)

My agent-health service definition is the following:
(With vars.cluster_zone = “master” the service returns OK, but from my understanding that is wrong to do so?)

apply Service "agent-health" {

        check_command = "cluster-zone"

        display_name = "cluster-health-" + host.name

        // Only works if: Agent zone name is the FQDN which also must be the host object name
        vars.cluster_zone = host.name
        //vars.cluster_zone = "master"

        // For services which should be executed ON the host itself
        command_endpoint = host.vars.agent_endpoint

        assign where host.vars.agent_endpoint
}

The zones.conf on host ga is the following:

root@ga:~# cat /etc/icinga2/zones.conf
/*
 * Generated by Icinga 2 node setup commands
 * on 2023-05-18 19:40:05 +0200
 */

object Endpoint "admin.domain.tld" {
  // No host attribute as else the Agents would try to actively connect to the Master
  // But we want the Master to connect to the Agents

}

object Zone "master" {
        endpoints = [ "admin.domain.tld" ]
}

object Endpoint "ga.domain.tld" {
//      host = "ga.domain.tld"
//      port = "5665"
}

object Zone "ga.domain.tld" {
        endpoints = [ "ga.domain.tld" ]
        parent = "master"
}

object Zone "global-templates" {
        global = true
}

object Zone "director-global" {
        global = true
}

object Zone "global-commands" {
        global = true
}

And on the master it’s the following:

root@admin ~ # cat /etc/icinga2/zones.conf
/*
 * Generated by Icinga 2 node setup commands
 * on 2023-05-18 19:44:56 +0200
 */

object Endpoint "admin.domain.tld" {
  // No host attribute as else the Agents would try to actively connect to the Master
  // But we want the Master to connect to the Agents
}

object Zone "master" {
        endpoints = [ "admin.domain.tld" ]
}

object Endpoint "ga.domain.tld" {
        host = "ga.domain.tld"
        port = "5665"
}

object Zone "ga.domain.tld" {
        endpoints = [ "ga.domain.tld" ]
        parent = "master"
}

object Zone "global-templates" {
        global = true
}

object Zone "director-global" {
        global = true
}

object Zone "global-commands" {
        global = true
}

From my understand this should be correct…

Executing root@admin ~ # curl -s -u $ICINGA2_API_USER:$ICINGA2_API_PASSWORD -H 'Accept: application/json' -H 'X-HTTP-Method-Override: GET' -X POST -k "https://$ICINGA2_API_HOST:$ICINGA2_API_PORT/v1/objects/services?filter=service.name==%22agent-health%22" | json_pp I get:

{
   "results" : [
      {
         "attrs" : {
            "__name" : "ga.domain.tld!agent-health",
            "acknowledgement" : 0,
            "acknowledgement_expiry" : 0,
            "acknowledgement_last_change" : 0,
            "action_url" : "",
            "active" : true,
            "check_attempt" : 1,
            "check_command" : "cluster-zone",
            "check_interval" : 300,
            "check_period" : "",
            "check_timeout" : null,
            "command_endpoint" : "ga.domain.tld",
            "display_name" : "cluster-health-ga.domain.tld",
            "downtime_depth" : 0,
            "enable_active_checks" : true,
            "enable_event_handler" : true,
            "enable_flapping" : false,
            "enable_notifications" : true,
            "enable_passive_checks" : true,
            "enable_perfdata" : true,
            "event_command" : "",
            "flapping" : false,
            "flapping_current" : 10.9,
            "flapping_last_change" : 1721250494.37898,
            "flapping_threshold" : 0,
            "flapping_threshold_high" : 30,
            "flapping_threshold_low" : 25,
            "force_next_check" : false,
            "force_next_notification" : false,
            "groups" : [],
            "ha_mode" : 0,
            "handled" : false,
            "host_name" : "ga.domain.tld",
            "icon_image" : "",
            "icon_image_alt" : "",
            "last_check" : 1721739859.5294,
            "last_check_result" : {
               "active" : true,
               "check_source" : "ga.domain.tld",
               "command" : "cluster-zone",
               "execution_end" : 1721739859.5294,
               "execution_start" : 1721739859.52938,
               "exit_status" : 0,
               "output" : "Zone 'ga.domain.tld' is not connected. Log lag: less than 1 millisecond",
               "performance_data" : [
                  {
                     "counter" : false,
                     "crit" : 0,
                     "label" : "slave_lag",
                     "max" : null,
                     "min" : null,
                     "type" : "PerfdataValue",
                     "unit" : "s",
                     "value" : 0,
                     "warn" : 0
                  },
                  {
                     "counter" : false,
                     "crit" : null,
                     "label" : "last_messages_sent",
                     "max" : null,
                     "min" : null,
                     "type" : "PerfdataValue",
                     "unit" : "",
                     "value" : 0,
                     "warn" : null
                  },
                  {
                     "counter" : false,
                     "crit" : null,
                     "label" : "last_messages_received",
                     "max" : null,
                     "min" : null,
                     "type" : "PerfdataValue",
                     "unit" : "",
                     "value" : 0,
                     "warn" : null
                  },
                  {
                     "counter" : false,
                     "crit" : null,
                     "label" : "sum_messages_sent_per_second",
                     "max" : null,
                     "min" : null,
                     "type" : "PerfdataValue",
                     "unit" : "",
                     "value" : 0,
                     "warn" : null
                  },
                  {
                     "counter" : false,
                     "crit" : null,
                     "label" : "sum_messages_received_per_second",
                     "max" : null,
                     "min" : null,
                     "type" : "PerfdataValue",
                     "unit" : "",
                     "value" : 0,
                     "warn" : null
                  },
                  {
                     "counter" : false,
                     "crit" : null,
                     "label" : "sum_bytes_sent_per_second",
                     "max" : null,
                     "min" : null,
                     "type" : "PerfdataValue",
                     "unit" : "",
                     "value" : 0,
                     "warn" : null
                  },
                  {
                     "counter" : false,
                     "crit" : null,
                     "label" : "sum_bytes_received_per_second",
                     "max" : null,
                     "min" : null,
                     "type" : "PerfdataValue",
                     "unit" : "",
                     "value" : 0,
                     "warn" : null
                  }
               ],
               "schedule_end" : 1721739859.5294,
               "schedule_start" : 1721739859.5294,
               "state" : 2,
               "ttl" : 0,
               "type" : "CheckResult",
               "vars_after" : {
                  "attempt" : 1,
                  "reachable" : true,
                  "state" : 2,
                  "state_type" : 1
               },
               "vars_before" : {
                  "attempt" : 1,
                  "reachable" : true,
                  "state" : 2,
                  "state_type" : 1
               }
            },
            "last_hard_state" : 2,
            "last_hard_state_change" : 1721739561.28873,
            "last_reachable" : true,
            "last_state" : 2,
            "last_state_change" : 1721739444.80542,
            "last_state_critical" : 1721739859.53051,
            "last_state_ok" : 1721739303.94638,
            "last_state_type" : 1,
            "last_state_unknown" : 1721737817.98358,
            "last_state_unreachable" : 1720095736.37306,
            "last_state_warning" : 0,
            "max_check_attempts" : 3,
            "name" : "agent-health",
            "next_check" : 1721740157.77058,
            "next_update" : 1721740457.77058,
            "notes" : "",
            "notes_url" : "",
            "original_attributes" : null,
            "package" : "_etc",
            "paused" : false,
            "previous_state_change" : 1721739444.80542,
            "problem" : true,
            "retry_interval" : 60,
            "severity" : 2176,
            "source_location" : {
               "first_column" : 1,
               "first_line" : 4,
               "last_column" : 28,
               "last_line" : 4,
               "path" : "/etc/icinga2/zones.d/master/services.d/agent-health.conf"
            },
            "state" : 2,
            "state_type" : 1,
            "templates" : [
               "agent-health"
            ],
            "type" : "Service",
            "vars" : {
               "cluster_zone" : "ga.domain.tld"
            },
            "version" : 0,
            "volatile" : false,
            "zone" : "master"
         },
         "joins" : {},
         "meta" : {},
         "name" : "ga.domain.tld!agent-health",
         "type" : "Service"
      }
   ]
}

check_source is ga.domain.tld - which is how I want it/it must be
vars.cluster_zone is “ga.domain.tld” which is also correct.
A zone with that name exists, and the host is part of that zone.

What am I missing?

  • Execute icinga on the agent
  • Execute cluster-zone on the parent

icinga tells you whether the agent detects any errors itself.

cluster-zone tells you whether another Icinga 2 node (vars.cluster_zone) is properly connected to the executing node. So for “AgentA” (zone AND endpoint name) you have a Service “Can I connect to AgentA?” executed on the parent node (let’s call it “SatelliteA”).

Technically you could add a service “Can I talk to my parent?” where you execute cluster-zone with vars.cluster_zone = "SatelliteA" on the agent, but you will, of course, never see that result if they don’t talk to each other anymore.

1 Like

Ok, now I am somewhat confused.

I followed this approach regarding the Service checks as I want to introduce a second master later: Distributed Monitoring - Icinga 2

And took the health checks from here:
https://icinga.com/docs/icinga-2/latest/doc/06-distributed-monitoring/#health-checks

There the agent-health service is defined along with the Dependency "agent-health-check".
Therefore it made sense for me that the cluster-zone command (or agent-health check) is executed on the agent.
In face the cluster-zone is only present on ga.domain.tld and not on the master.

And there is no command_endpoint configured for the masters in that documentation.
So if I now change the cluster-zone Service to be executed on the master by changing the assign where… I get loads of configuration errors because, of course, the dependency to the agent-health-check can’t be resolved.

So is there some semantical error in the documentation?

Or can you provide your service definitions for the cluster-zone and icinga services?

Ah… No… I thought too much about it. Sorry for causing that much confusion…

The assign where rule only ensure that the check is present for all agents.
But of course the check source is the master.
Therefore I have to remove command_endpoint from the agent-health service.

I thought in a mantra of: “If assign where host.vars.agent_endpoint is present, you need to specify command_endpoint too.” which is true in 99,9% of normal monitoring cases, but not in this one…

Narf…

Check is now returning OK. Thanks!

apply Service "agent-health" {

        check_command = "cluster-zone"

        display_name = "cluster-health-" + host.name

        // Only works if: Agent zone name is the FQDN which also must be the host object name
        vars.cluster_zone = host.name
       
        // If command_endpoint is not explicitely specified, the Endpoint responsible for this zone (in your case both masters) will execute the CheckCommand
        //command_endpoint = host.vars.agent_endpoint

        assign where host.vars.agent_endpoint
}

Ah, should have read your second post … :slight_smile:

1 Like

With that being fixed, I re-created my problem. Moving a Notification Template to /etc/icinga2/zones.d/global-templates which needs a constant which is only defined in /etc/icinga2/constants.conf.
Which will of course not be synced to the agents, as the file is outside of any configured zone.

In the log I see the following:

[2024-07-23 15:52:18 +0200] information/ApiListener: Applying config update from endpoint 'admin.domain.tld' of zone 'master'.
[2024-07-23 15:52:18 +0200] information/ApiListener: Received configuration for zone 'global-commands' from endpoint 'admin.domain.tld'. Comparing the timestamp and checksums.
[2024-07-23 15:52:18 +0200] information/ApiListener: Stage: Updating received configuration file '/var/lib/icinga2/api/zones-stage/global-commands//_etc/check_apt_update.conf' for zone 'global-commands'.
[2024-07-23 15:52:18 +0200] information/ApiListener: Stage: Updating received configuration file '/var/lib/icinga2/api/zones-stage/global-commands//_etc/check_linux_memory.conf' for zone 'global-commands'.
[2024-07-23 15:52:18 +0200] information/ApiListener: Stage: Updating received configuration file '/var/lib/icinga2/api/zones-stage/global-commands//_etc/commands.conf' for zone 'global-commands'.
[2024-07-23 15:52:18 +0200] information/ApiListener: Stage: Updating received configuration file '/var/lib/icinga2/api/zones-stage/global-commands//_etc/telegrambot-commands.conf' for zone 'global-commands'.
[2024-07-23 15:52:18 +0200] information/ApiListener: Applying configuration file update for path '/var/lib/icinga2/api/zones-stage/global-commands' (8668 Bytes).
[2024-07-23 15:52:18 +0200] information/ApiListener: Received configuration for zone 'global-templates' from endpoint 'admin.domain.tld'. Comparing the timestamp and checksums.
[2024-07-23 15:52:18 +0200] information/ApiListener: Stage: Updating received configuration file '/var/lib/icinga2/api/zones-stage/global-templates//_etc/eventcommands.conf' for zone 'global-templates'.
[2024-07-23 15:52:18 +0200] information/ApiListener: Stage: Updating received configuration file '/var/lib/icinga2/api/zones-stage/global-templates//_etc/groups.conf' for zone 'global-templates'.
[2024-07-23 15:52:18 +0200] information/ApiListener: Stage: Updating received configuration file '/var/lib/icinga2/api/zones-stage/global-templates//_etc/host-templates.conf' for zone 'global-templates'.
[2024-07-23 15:52:18 +0200] information/ApiListener: Stage: Updating received configuration file '/var/lib/icinga2/api/zones-stage/global-templates//_etc/notifications.conf' for zone 'global-templates'.
[2024-07-23 15:52:18 +0200] information/ApiListener: Stage: Updating received configuration file '/var/lib/icinga2/api/zones-stage/global-templates//_etc/service-templates.conf' for zone 'global-templates'.
[2024-07-23 15:52:18 +0200] information/ApiListener: Stage: Updating received configuration file '/var/lib/icinga2/api/zones-stage/global-templates//_etc/telegrambot-notifications.conf' for zone 'global-templates'.
[2024-07-23 15:52:18 +0200] information/ApiListener: Stage: Updating received configuration file '/var/lib/icinga2/api/zones-stage/global-templates//_etc/templates.conf' for zone 'global-templates'.
[2024-07-23 15:52:18 +0200] information/ApiListener: Stage: Updating received configuration file '/var/lib/icinga2/api/zones-stage/global-templates//_etc/timeperiods.conf' for zone 'global-templates'.
[2024-07-23 15:52:18 +0200] information/ApiListener: Stage: Updating received configuration file '/var/lib/icinga2/api/zones-stage/global-templates//_etc/users.conf' for zone 'global-templates'.
[2024-07-23 15:52:18 +0200] information/ApiListener: Applying configuration file update for path '/var/lib/icinga2/api/zones-stage/global-templates' (6797 Bytes).
[2024-07-23 15:52:18 +0200] information/ApiListener: Received configuration updates (2) from endpoint 'admin.domain.tld' are different to production, triggering validation and reload.
[2024-07-23 15:52:18 +0200] critical/ApiListener: Config validation failed for staged cluster config sync in '/var/lib/icinga2/api/zones-stage/'. Aborting. Logs: '/var/lib/icinga2/api/zones-stage//startup.log'

The startup.log has the following content:

[2024-07-23 15:52:18 +0200] information/cli: Icinga application loader (version: r2.12.3-1)
[2024-07-23 15:52:18 +0200] information/cli: Loading configuration file(s).
[2024-07-23 15:52:18 +0200] information/ConfigItem: Committing config item(s).
[2024-07-23 15:52:18 +0200] critical/config: Error: Error while evaluating expression: Tried to access undefined script variable 'TelegramBotToken'
Location: in /var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf: 20:26-20:41
/var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf(18):     NOTIFICATIONCOMMENT = "$notification.comment$"
/var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf(19):     HOSTDISPLAYNAME = "$host.display_name$"
/var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf(20):     TELEGRAM_BOT_TOKEN = TelegramBotToken
                                                                                                               ^^^^^^^^^^^^^^^^
/var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf(21):     TELEGRAM_CHAT_ID = "$user.vars.telegram_chat_id$"
/var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf(22):

[2024-07-23 15:52:18 +0200] critical/config: Error: Error while evaluating expression: Tried to access undefined script variable 'TelegramBotToken'
Location: in /var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf: 46:26-46:41
/var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf(44):     HOSTDISPLAYNAME = "$host.display_name$"
/var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf(45):     SERVICEDISPLAYNAME = "$service.display_name$"
/var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf(46):     TELEGRAM_BOT_TOKEN = TelegramBotToken
                                                                                                               ^^^^^^^^^^^^^^^^
/var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf(47):     TELEGRAM_CHAT_ID = "$user.vars.telegram_chat_id$"
/var/lib/icinga2/api/zones-stage//global-commands/_etc/telegrambot-commands.conf(48):

[2024-07-23 15:52:18 +0200] critical/config: 2 errors
[2024-07-23 15:52:18 +0200] critical/cli: Config validation failed. Re-run with 'icinga2 daemon -C' after fixing the config.

So the new config isn’t applied.
But neither the icinga nor the cluster-zone check are warning or critical.
An icinga daemon -C on the master or agents also returns OK.

My simple check if the logfile /var/lib/icinga2/api/zones-stage/startup.log exists is critical, as this logfile only exists, when there is some error.

So again, how can I check for that apart from a simple check_fileage?
How can I spot an Config validation failed for staged cluster config sync in...?

apply Service "agent-connection" {
        check_command = "cluster-zone"
        display_name = "cluster-health-" + host.name
        // Only works if: Agent zone name is the FQDN which also must be the host object name
        vars.cluster_zone = host.name
        assign where host.vars.agent_endpoint
}

apply Service "agent-health" {

        check_command = "icinga"

        display_name = "cluster-health-" + host.name

        // Only works if: Agent zone name is the FQDN which also must be the host object name
        vars.cluster_zone = host.name
       
        // this must be executed on the agent
        command_endpoint = host.vars.agent_endpoint

        assign where host.vars.agent_endpoint
}

Basically I want a check which returns the same error, just as a icinga2 daemon --validate --define System.ZonesStageVarDir=/var/lib/icinga2/api/zones-stage/ would do.

Do your checks really go into Warning/Critical of the new config can’t be validated?
Or just if the config sync is broken completely?

When using the CheckCommand icinga on the agent they do, yes

NARF!

@lorenz Thanks for staying with me and smashing that wall in my head… Forgot to add command_endpoint for the icinga check.

Now I get a Warning with:
“Icinga 2 has been running for 1 hour, 26 minutes and 38 seconds. Version: r2.12.3-1; Last zone sync stage validation failed at 2024-07-23 16:46:44 +0200”

Fascinating how much one can do wrong… :sweat_smile:

Icinga 2 is a beast :slight_smile:

2 Likes