We are running a 3-tier distributed setup where CheckCommand
objects (and lots of other stuff) are distributed using a global configuration zone. Frequently, after deploying new versions of this zone to the central configuration master and triggering a reload, the zone misses objects - we most often see this because checks turn UNKNOWN with an error message like Check command 'readonly_fs' does not exist.
.
Version distribution of our setup (all hosts are running 2.11.2 on CentOS): 726 on EL6, 39 on EL7 and 83 on EL8. Architecture in all cases is x86_64
.
The current host, muc1dev-core-core-3
, is a leaf node. The top level zone as well as the global configuration zone look like this:
// managed by Class['icinga2']
// the actual master zone
object Endpoint "fra1pro-infra-master-1.adm.example.com" {
host = "fra1pro-infra-master-1.infra.fra1pro.example.com"
log_duration = 2h
}
object Endpoint "fra1pro-infra-master-2.adm.example.com" {
host = "fra1pro-infra-master-2.infra.fra1pro.example.com"
log_duration = 2h
}
object Zone "master" {
endpoints = ["fra1pro-infra-master-1.adm.example.com", "fra1pro-infra-master-2.adm.example.com"]
}
// global configuration data (do NOT add checks here!)
object Zone "global-configuration" {
global = true
}
The muc1dev
zone looks like this:
// managed by Class['icinga2']
object Endpoint "muc1dev-infra-1.adm.example.com" {
host = "muc1dev-infra-1.infra.muc1dev.example.com"
log_duration = 2h
}
object Endpoint "muc1dev-infra-2.adm.example.com" {
host = "muc1dev-infra-2.infra.muc1dev.example.com"
log_duration = 2h
}
object Zone "muc1dev" {
endpoints = ["muc1dev-infra-1.adm.example.com","muc1dev-infra-2.adm.example.com"]
parent = "master"
}
Last but not least, the host:
// managed by Class['icinga2']
object Endpoint "muc1dev-core-core-3.adm.example.com" {
host = "10.3.1.77"
log_duration = 0
}
object Zone "muc1dev-core-core-3.adm.example.com" {
endpoints = ["muc1dev-core-core-3.adm.example.com"]
parent = "muc1dev"
}
object Host "muc1dev-core-core-3" {
import "puppet-host"
address = "10.3.1.77"
vars.client_endpoint = "muc1dev-core-core-3.adm.example.com" // NB: host name == endpoint name
vars.has_client = true
[...] host details snipped
}
Currently, the CheckCommand
definition for readonly_fs
is present in /var/lib/icinga2/api/zones-stage/global-configuration/_etc/checkcommands/readonly_fs.conf
, however, it’s not present anywhere in /var/lib/icinga2/api/zones/global-configuration/_etc/
- the latter directory seems to be missing almost everything that’s present in zones-staging
:
[root@muc1dev-core-core-3 api]# ls -al zones/global-configuration/_etc/
total 12
drwx------ 3 icinga icinga 4096 Dec 10 10:37 .
drwx------ 3 icinga icinga 4096 Dec 10 10:37 ..
drwx------ 2 icinga icinga 4096 Dec 10 10:37 checkcommands
[root@muc1dev-core-core-3 api]# ls -al zones-stage/global-configuration/_etc/
total 60
drwxr-xr-x 15 icinga icinga 4096 Dec 10 10:50 .
drwx------ 3 icinga icinga 4096 Dec 10 10:50 ..
drwxr-xr-x 2 icinga icinga 4096 Dec 10 10:50 checkcommands
drwxr-xr-x 2 icinga icinga 4096 Dec 10 10:50 collected-resources
drwxr-xr-x 2 icinga icinga 4096 Dec 10 10:50 commands
drwxr-xr-x 2 icinga icinga 4096 Dec 10 10:50 downtimes
drwxr-xr-x 2 icinga icinga 4096 Dec 10 10:50 eventcommands
drwxr-xr-x 2 icinga icinga 4096 Dec 10 10:50 hostgroups
drwxr-xr-x 2 icinga icinga 4096 Dec 10 10:50 notifications
drwxr-xr-x 2 icinga icinga 4096 Dec 10 10:50 servicegroups
drwxr-xr-x 5 icinga icinga 4096 Dec 10 10:50 services
drwxr-xr-x 2 icinga icinga 4096 Dec 10 10:50 templates
drwxr-xr-x 2 icinga icinga 4096 Dec 10 10:50 timeperiods
drwxr-xr-x 2 icinga icinga 4096 Dec 10 10:50 usergroups
drwxr-xr-x 2 icinga icinga 4096 Dec 10 10:50 users
Unfortunately, we don’t have debug level logging, but I’ll show you what I got from the last reload:
[2019-12-10 10:50:10 +0100] information/Application: Received request to shut down.
[2019-12-10 10:50:11 +0100] information/Application: Shutting down...
[2019-12-10 10:50:11 +0100] information/ApiListener: 'api' stopped.
[2019-12-10 10:50:14 +0100] information/FileLogger: 'main-log' started.
[2019-12-10 10:50:14 +0100] information/ApiListener: 'api' started.
[2019-12-10 10:50:14 +0100] information/ApiListener: Started new listener on '[0.0.0.0]:5665'
[2019-12-10 10:50:14 +0100] information/ConfigItem: Activated all objects.
[2019-12-10 10:50:19 +0100] information/ApiListener: New client connection for identity 'muc1dev-infra-2.adm.example.com' from [10.3.0.17]:37236
[2019-12-10 10:50:19 +0100] information/ApiListener: Requesting new certificate for this Icinga instance from endpoint 'muc1dev-infra-2.adm.example.com'.
[2019-12-10 10:50:19 +0100] information/ApiListener: Sending config updates for endpoint 'muc1dev-infra-2.adm.example.com' in zone 'muc1dev'.
[2019-12-10 10:50:19 +0100] information/ApiListener: Finished sending config file updates for endpoint 'muc1dev-infra-2.adm.example.com' in zone 'muc1dev'.
[2019-12-10 10:50:19 +0100] information/ApiListener: Syncing runtime objects to endpoint 'muc1dev-infra-2.adm.example.com'.
[2019-12-10 10:50:19 +0100] information/ApiListener: Finished syncing runtime objects to endpoint 'muc1dev-infra-2.adm.example.com'.
[2019-12-10 10:50:19 +0100] information/ApiListener: Finished sending runtime config updates for endpoint 'muc1dev-infra-2.adm.example.com' in zone 'muc1dev'.
[2019-12-10 10:50:19 +0100] information/ApiListener: Sending replay log for endpoint 'muc1dev-infra-2.adm.example.com' in zone 'muc1dev'.
[2019-12-10 10:50:19 +0100] information/ApiListener: Finished sending replay log for endpoint 'muc1dev-infra-2.adm.example.com' in zone 'muc1dev'.
[2019-12-10 10:50:19 +0100] information/ApiListener: Finished syncing endpoint 'muc1dev-infra-2.adm.example.com' in zone 'muc1dev'.
[2019-12-10 10:50:19 +0100] information/ApiListener: Applying config update from endpoint 'muc1dev-infra-2.adm.example.com' of zone 'muc1dev'.
[2019-12-10 10:50:19 +0100] information/ApiListener: Received configuration for zone 'global-configuration' from endpoint 'muc1dev-infra-2.adm.example.com'. Comparing the timestamp and checksums.
[2019-12-10 10:50:19 +0100] information/ApiListener: Our production configuration is more recent than the received configuration update. Ignoring configuration file update for path '/var/lib/icinga2/api/zones-stage/global-configuration'. Current timestamp '2019-12-10 10:36:43 +0100' (1575970603.515344) >= received timestamp '2019-12-10 10:36:43 +0100' (1575970603.515344).
[2019-12-10 10:50:19 +0100] information/ApiListener: Stage: Updating received configuration file '/var/lib/icinga2/api/zones-stage/global-configuration//_etc/checkcommands/activemq.conf' for zone 'global-configuration'.
[ .... tons of other files .... ]
[2019-12-10 10:50:19 +0100] information/ApiListener: Stage: Updating received configuration file '/var/lib/icinga2/api/zones-stage/global-configuration//_etc/checkcommands/readonly_fs.conf' for zone 'global-configuration'.
[ .... tons of other files .... ]
[2019-12-10 10:50:19 +0100] information/ApiListener: Applying configuration file update for path '/var/lib/icinga2/api/zones-stage/global-configuration' (282958 Bytes).
[2019-12-10 10:50:19 +0100] information/ApiListener: Received configuration updates (1) from endpoint 'muc1dev-infra-2.adm.example.com' do not qualify for production, not triggering reload.
[2019-12-10 10:50:22 +0100] information/ApiListener: New client connection for identity 'muc1dev-infra-1.adm.example.com' from [10.3.0.16]:54016
[2019-12-10 10:50:22 +0100] information/ApiListener: Requesting new certificate for this Icinga instance from endpoint 'muc1dev-infra-1.adm.example.com'.
[2019-12-10 10:50:22 +0100] information/ApiListener: Sending config updates for endpoint 'muc1dev-infra-1.adm.example.com' in zone 'muc1dev'.
[2019-12-10 10:50:22 +0100] information/ApiListener: Finished sending config file updates for endpoint 'muc1dev-infra-1.adm.example.com' in zone 'muc1dev'.
[2019-12-10 10:50:22 +0100] information/ApiListener: Syncing runtime objects to endpoint 'muc1dev-infra-1.adm.example.com'.
[2019-12-10 10:50:22 +0100] information/ApiListener: Finished syncing runtime objects to endpoint 'muc1dev-infra-1.adm.example.com'.
[2019-12-10 10:50:22 +0100] information/ApiListener: Finished sending runtime config updates for endpoint 'muc1dev-infra-1.adm.example.com' in zone 'muc1dev'.
[2019-12-10 10:50:22 +0100] information/ApiListener: Sending replay log for endpoint 'muc1dev-infra-1.adm.example.com' in zone 'muc1dev'.
[2019-12-10 10:50:22 +0100] information/ApiListener: Finished sending replay log for endpoint 'muc1dev-infra-1.adm.example.com' in zone 'muc1dev'.
[2019-12-10 10:50:22 +0100] information/ApiListener: Finished syncing endpoint 'muc1dev-infra-1.adm.example.com' in zone 'muc1dev'.
[2019-12-10 10:50:22 +0100] information/ApiListener: Applying config update from endpoint 'muc1dev-infra-1.adm.example.com' of zone 'muc1dev'.
[2019-12-10 10:50:22 +0100] information/ApiListener: Received configuration for zone 'global-configuration' from endpoint 'muc1dev-infra-1.adm.example.com'. Comparing the timestamp and checksums.
[2019-12-10 10:50:22 +0100] information/ApiListener: Our production configuration is more recent than the received configuration update. Ignoring configuration file update for path '/var/lib/icinga2/api/zones-stage/global-configuration'. Current timestamp '2019-12-10 10:36:43 +0100' (1575970603.515344) >= received timestamp '2019-12-10 10:36:43 +0100' (1575970603.515344).
[2019-12-10 10:50:22 +0100] information/ApiListener: Stage: Updating received configuration file '/var/lib/icinga2/api/zones-stage/global-configuration//_etc/checkcommands/activemq.conf' for zone 'global-configuration'.
[ .... tons of other files .... ]
[2019-12-10 10:50:22 +0100] information/ApiListener: Stage: Updating received configuration file '/var/lib/icinga2/api/zones-stage/global-configuration//_etc/checkcommands/readonly_fs.conf' for zone 'global-configuration'.
[ .... tons of other files .... ]
[2019-12-10 10:50:22 +0100] information/ApiListener: Applying configuration file update for path '/var/lib/icinga2/api/zones-stage/global-configuration' (282958 Bytes).
[2019-12-10 10:50:22 +0100] information/ApiListener: Received configuration updates (1) from endpoint 'muc1dev-infra-1.adm.example.com' do not qualify for production, not triggering reload.
We are aware that we can simply reset the client and this will be fixed, however, due to the number of clients and the frequency of changes to the global configuration zones, this is not an optimal solution. We also reset all Icinga2 machines (except the central configuration master) this morning, and the problem re-surfaced again an hour later after the first update to the global-configuration
zone.
Can anyone give me a hint on how to debug this?
Thanks,
Stefan