Downtime set to type=Host and child_options=DowntimeNonTriggeredChildren also affects to services

jjuanino · October 3, 2024, 3:55pm

Hi, my setup is icinga2 r2.14.2-1 on RHEL8.

I am scheduling a downtime via api, with this test script:

icingaweb_user=root
icingaweb_passwd=mypasswd
start_time=$(date +%s -d "+0 hour")
end_time=$(date +%s -d "+2 minute")

env -i /bin/curl -k -s -u "${icingaweb_user}:${icingaweb_passwd}" -H 'Accept: application/json' \
        -X POST 'https://localhost:5665/v1/actions/schedule-downtime' \
        -d "{ \
            \"type\": \"Host\", \
            \"filter\": \"host.name==\\\"myhost.mycompany.corp\\\"\", \
            \"author\": \"icingaadmin\", \
            \"comment\": \"Test\", \
            \"all_services\": false,\
            \"child_options\": \"DowntimeNonTriggeredChildren\", \
            \"start_time\": $start_time, \
            \"end_time\": $end_time, \
            \"pretty\": true \
        }"

The expected result is that only the host myhost.mycompany.corp and its children host be affected by this downtime. But the what we have got is that also the services are affected by the downtime in a recursive way.

Am I misunderstanding or skipping something?
The goal is avoid downtime in any service, only in a specific host and in its children.

Regards

moreamazingnick · October 3, 2024, 6:08pm

if the host is down the service check can fail and this will generate notifications and worsen an sla calculation.
are you sure you do not want a downtime for services attached to the host?

downtime in icinga is more or less a scheduled maintenance window.

jjuanino · October 3, 2024, 7:44pm

Yes, that is what I need. When a host goes down, automatically the overall of its services are suspended via disable_checks = true in a ad hoc built dependency, therefore the attached service cannot fail, simply because the checks cannot execute.

My scenario is as follows: I have an Oracle Cluster made up of 4 linux host and a lot of database instances (each database instance depends upon the linux host where it is running). When a maintenance comes, it is done in a rolling fashion: we patch the first linux node of the cluster, while the global database service is carried out by the other 3 linux hosts. When we finish the patching of the first node, it starts up again, and the patching task iterates over the second linux host, and so on.

Therefore, the database service is always up and running, and from a user point of view there is no downtime at all. If for example, in the meanwhile of the rolling patching, a tablespace is filled, that incident must be alerted and propagated accordingly.

To accomplish this, I need only to take a downtime on the 4 linux hosts and on the overall of the dependent Oracle instances (the latter are also Host object types for icinga, as well as the former 4 linux hosts). But I need to guarantee that while a icinga Host (linux server or Oracle instance server) is up and running, its services behave as usual,and the incidents addressed as usual by our support team. And obviously, under the window maintenance, I do not want to flood the support team with alerts caused by Oracle Instances going down or linux host restarted.

I hope I have explained my scenario … sorry for my bad english.

Regards

moreamazingnick · October 3, 2024, 9:47pm

ok still a downtime for the services should not hurt.

all_services = false is also the default value see here

what happens if you remove this line from your json?

how do you know the service is in a downtime too?
Because I can’t reproduce that

jeanm · October 4, 2024, 7:52am

Could you please exemplify the dependencies? My intuition is that the database instances could be defined as Services attached to a Host defined as the cluster VIP. Plus of course each of the 4 servers defined as Host objects with their physical Service objects (for CPU, memory, etc.).

My two cents,
Jean

jjuanino · October 4, 2024, 8:24pm

Hi Moreamazingnick, Jean, thank you for your responses.

The issue can be reproduced as follows.

Declare a Host object, named hostA, with some services, named serviceA and serviceB. Declare also a new host named childHost, dependent on hostA, and with some services.

hostA
   | serviceA, serviceB
   |
   | --> childHost
     | services_of_childHost

As stated by doc, there is a implicit dependency between the services and their respective hosts. When you declare a downtime on hostA with type=Host and child_options=DowntimeNonTriggeredChildren, the downtime only affects to hostA and childHost. So far, so good, there is no surprise here.

But assume that you need to tweak the default dependency between the services and the host declaring a new object Dependency, for example:

apply Dependency "contrived_dependency" to Service {
    ignore_soft_states = false
}

With this new setup, you have a new dependency between the services and their respective hosts. Now, when you declare a downtime with the same settings as before, the downtime do affects to the overall of the hosts and the services, regadless all_services argument is set to true or false.

What is happened, I think, is that the argument DowntimeNonTriggeredChildren means: apply the downtime recursively to any dependent object of hostA, regardless its type (host or service). If there is no explicit dependency declared between the services and the host, then does not exist any Dependency object, and therefore when you declare the downtime as before, only affects to the two hosts. But when any dependency is declared, the downtime affects also the services in a recursive way.

In my honest opinion, this behavior is wrong: if you set all_services=false in the definition of the downtime, that downtime must not affect to any dependent service, regardless of the existence of an explicit Dependency object between the services and the host.

Regards

moreamazingnick · October 4, 2024, 8:42pm

Can you post a screenshot? I just want to make sure that we are talking about the same thing.

jjuanino · October 4, 2024, 9:15pm

I know it because the shell scripts outputs the objects affected by the downtime.
When there is an explicit dependency between the services and their respective host, we get the following ouput:

env -i /bin/curl -k -s -u root:XXXXXXXXXXXXX -H 'Accept: application/json' -X POST https://localhost:5665/v1/actions/schedule-downtime -d '{"type": "Host","filter": "host.name==\"parent_host.mydomain\"",
"author": "icingaadmin", "comment": "Test", "child_options": "DowntimeNonTriggeredChildren", "start_time": 1728075328, "end_time": 1728075388, "pretty": true}'
{
    "results": [
        {
            "child_downtimes": [
                {
                    "legacy_id": 641,
                    "name": "db_instance_1!f34aea44-1356-4677-8591-586aabb9bcc7"  ►►►► Downtime on Host ◄◄◄◄◄
                },
                {
                    "legacy_id": 642,
                    "name": "db_instance_2!84bbf10e-b642-4836-861e-122345a1d265" ►►►► Downtime on Host ◄◄◄◄◄
                },
                {
                    "legacy_id": 643,
                    "name": "db_instance_3!bb4c4c62-632a-426e-8aa5-fa84078e16f6" ►►►► Downtime on Host ◄◄◄◄◄
                },
                {
                    "legacy_id": 644,
                    "name": "parent_host.mydomain!filesystem-usage-/var/log!b308e32d-4c15-427e-be67-34d67786f2f7" ►►►► Downtime on Service ◄◄◄◄◄
                },
                {
                    "legacy_id": 645,
                    "name": "db_instance_2!84b5f03d-3f8f-4ad9-8601-0078179324b5" ►►►► Downtime on Host ◄◄◄◄◄
                },
                {
                    "legacy_id": 646,
                    "name": "db_instance_3!time_model-DB_time!0320524b-6ab8-48e9-b5b5-3deef113a563" ►►►► Downtime on Service ◄◄◄◄◄
                }
            ],
            "code": 200,
            "legacy_id": 640,
            "name": "parent_host.mydomain!b66b6a46-0dc9-4a1a-b87d-37740ca2bbe0",
            "status": "Successfully scheduled downtime 'parent_host.mydomain!b66b6a46-0dc9-4a1a-b87d-37740ca2bbe0' for object 'parent_host.mydomain'." ►►►►► Downtime on Parent Host ◄◄◄◄◄◄
        }
    ]
}

As you can see, the downtime affects to the hosts and the services in a recursive way.

But when I remove the dependencies declared between the services and their hosts in my DSL code, the output only shows hosts:

env -i /bin/curl -k -s -u root:XXXXXXXXXXXXX -H 'Accept: application/json' -X POST https://localhost:5665/v1/actions/schedule-downtime -d '{"type": "Host","filter":"host.name==\"parent_host.mydomain\"",
"author": "icingaadmin", "comment": "Test", "child_options": "DowntimeNonTriggeredChildren", "start_time": 1728075328, "end_time": 1728075388, "pretty": true}'
{
    "results": [
        {
            "child_downtimes": [
                {
                    "legacy_id": 641,
                    "name": "db_instance_1!f34aea44-1356-4677-8591-586aabb9bcc7"  ►►►► Downtime on Host ◄◄◄◄◄
                },
                {
                    "legacy_id": 642,
                    "name": "db_instance_2!84bbf10e-b642-4836-861e-122345a1d265" ►►►► Downtime on Host ◄◄◄◄◄
                },
                {
                    "legacy_id": 643,
                    "name": "db_instance_3!bb4c4c62-632a-426e-8aa5-fa84078e16f6" ►►►► Downtime on Host ◄◄◄◄◄
                },
                {
                    "legacy_id": 644,
                    "name": "db_instance_2!84b5f03d-3f8f-4ad9-8601-0078179324b5" ►►►► Downtime on Host ◄◄◄◄◄
                }
            ],
            "code": 200,
            "legacy_id": 640,
            "name": "parent_host.mydomain!b66b6a46-0dc9-4a1a-b87d-37740ca2bbe0",
            "status": "Successfully scheduled downtime 'parent_host.mydomain!b66b6a46-0dc9-4a1a-b87d-37740ca2bbe0' for object 'parent_host.mydomain'." ►►►►► Downtime on Parent Host ◄◄◄◄◄◄
        }
    ]
}

Regards

moreamazingnick · October 5, 2024, 4:45am

apparently with this dependency you declared your services as children of the host.
Which makes sense in one way. But I have to admit the behaviour doesn’t reflect the gui which explicitly says that it triggers a downtime on child hosts.

i could reproduce this with this dependency:

apply Dependency "dependency" to Service {
    disable_notifications = false
    ignore_soft_states = true
    assign where match("*", host.name)
}

Some other annoying issue is:
If you remove the host downtime manually the service downtimes are not removed automatically.

You could create a github issue for that.

as a workaround (if not to complicated and applicable) you can do the following:

create hostgroups for you dependencies
create a downtime with the filter “group-database” in host.groups
in combination use DowntimeNoChildren

Of course this doesn’t work on every scenario but with this example you should be fine.

Alternatively you could

define a hostvar that references the parent(s)
recursively fetch the hosts that have the previous host as a parent
send the downtime with filter “hostname or hostname”

and you can always create a github issue on that