Implementing HA Master-Master setup

Thank you @log1c for the enlightening answer.
I have successfully created what is needed.
One question though.
If I do not put enable_ha=false one of the two IDO databases gets disabled. As far as documentation includes that is the expected behavior.
Will this mean that the databases do not get synced?
If I create client checks on master1 will the checks get propagated to master2 ?
Lastly, I want to enable director, will director database get synced as well ?
Sorry if my questions are somewhat “noobie” but after reading the documentation many many times, I still find it somehow chaotic.

Best Regards,
Panagiotis

Hm, can’t answer that as I don’t know. Maybe someone else can share their thoughts.

Yes, checks are distributed between the two masters and the last remaining one takes over in case of a failure (some goes for satellites in a zone).

No, this DB does not get synced. Also only one of the masters can be the config master.

Hi,

replication inside the MySQL cluster needs to be done by the user. This is a separate HA cluster, and has nothing to do with Icinga and its HA capabilities.

The best thing is a central virtual cluster IP address where both Icinga instances can write their IDO data into. With enable_ha being true by default, only one master will actively write to the database backend at the time.

If you don’t have the possibility to create a MySQL cluster with a VIP by yourself, you can use local databases on each Icinga master, where each of them writes to the local database. This is what enable_ha=false ensures, both IDO features are active and running.

In addition to the two masters with local databases, there’s also users who build their own MySQL replication between the nodes. This works ™ but if you run into slave lag synchronisation, this will generate a huge kab00m with DB exceptions, huge bin logs, etc. I don’t recommend that scenario unless you are an experienced DB admin.

Instead, a central database on a dedicated host, or a DB cluster on that IP works best in a HA enabled Icinga master scenario. Try it out simple, create two masters, one DB host and test the runtime behavior.

Cheers,
Michael

3 Likes

So, MySQL cluster with VIP is not an option since the two icinga masters servers with the local databases are in different regions ( one in Frankfurt and the other in UK ).

What I’m trying to achieve here is :

  1. Two icinga masters
  2. Both synced ( checks and databases/database )
  3. If one goes down the other one takes over.

I was under the impression that icinga2 HA means that the IDO database is being replicated automagically. Is icinga2 HA only for the icinga2 services and not including databases?

If I enable HA on the IDO databases, I lose the one database, so since one IDO database is active at the time, will DNS records ( active - failover ) work if both databases have the same hostname ( different IP ) and I setup the IDO connection to be the hostname ( instead of IP ). In case the one IDO database goes down will the other one come up?

I’m trying to find the best possible scenario for a truly high available solution, since these servers will monitor a huge amount of servers and I want to make sure that the availability of the monitoring system is 99,99% SLA uptime.

That’s what I thought, too.
I’m running one setup with two masters were each has their own database server and enable_ha=false. Previously this was a two-node (I know…) Galera “cluster”, which I deactivated.
Have never check if the databases on the two servers are really the same.

Hi,

Use Vagrant or your cloud provider and build a simple setup with 2 masters and 1 DB host as VMs. Test your scenarios in there, and collect your answers. You will see how replication works, and which parts of HA work as described in the documentation. If you miss IDO replication being described in the docs, it doesn’t exist. The docs are complete and every supported functionality is located there.

I just wanted to add, since you mentioned the different regions with data centers spread around the globe: Ensure that this isn’t a low latency connection. Otherwise the HA syncs (config, check results, etc. as runtime data) will be slow and you won’t gain any benefit from that.

The cluster functionality takes care of syncing runtime events to any node in the same zone. Meaning to say, if you schedule a downtime, or a new check result reaches the master zone, both masters receive the message and will forward this to their backends. In that second, you’ll have 2 check results written to 2 local databases.

I have the feeling that you’re mixing runtime event replication with true builtin MySQL replication. Icinga does not look into both IDO databases and runs a replication sync. That is something MySQL on its own does, when configured, i.e. with binlogs and master-master or master-slave replication setups.

MySQL/PostgreSQL as a storage layer is a different application cluster. Icinga only ensures to keep both masters in sync, allowing a failover having the same historical and runtime data.

Cheers,
Michael

PS: There are numerous topics in this community where this has been discussed. You’ll likely find other opinions there, and are obviously free to build what’s best for you. Just keep in mind, that complicated non-standard setups will make it harder to read, understand and provide useful answers.

2 Likes

Hi,

I am trying to set up two Icinga2 Servers that both write into their own database similar to this topic.

I’m stuck at the simplest thing, I cannot find the config file where the “enable_ha=false” settings needs to be put.

Thank you in advance.

Hi,

depending on your installed database:
MariaDB/Mysql: /etc/icinga2/features-available/ido-mysql.conf
look here: https://icinga.com/docs/icinga-2/latest/doc/09-object-types/#idomysqlconnection
PostgreSQL: /etc/icinga2/features-available/ido-pgsql.conf
look here: https://icinga.com/docs/icinga-2/latest/doc/09-object-types/#idopgsqlconnection

3 Likes

Thank you! I’ve found the file and added the Setting.

1 Like

Hi,

just to be sure, there are only two possibilities regarding a HA master setup with two master instances?

  1. Active - Passive mode: There are two icinga-master-nodes which are communicating to one database. If one master instance goes down, the other takes over. enable_ha attribute must be set to true.

  2. Active - Active mode: There are two icinga-master-nodes which can communicate to their own database. If one master instance goes down, the other one is still available. enable_ha attribute must be set to false.

Is this correct?

And another topic: Is it possible to that the database is running on another VM than the icinga master instances do, or does the DB have to be installed local on the icinga master VM?

1 Like

Not quite.
Instances in a zone (be it master or satellite) always distribute the hosts and services between them (active-active). If one of the instances goes down the remaining instance will take over the checks from the failed instance.

Regarding the enable_ha for the database feature (ido):
If you set this to false each node will write into their own database.
If you set it to true (default) only one of the nodes will write into the database.
But both nodes will still be “active” regarding check execution!

The DB can be set up on a separate server, you then just need to configure the correct parameters in the ido config file(or during setup of the webinterface).

The docs have a quite elaborate chapter on describing the HA features:
https://icinga.com/docs/icinga-2/latest/doc/19-technical-concepts/#high-availability

Is it possible to set enable_ha=true while each node will have its own database?

If you set it to true (default) only one of the nodes will write into the database.
But both nodes will still be “active” regarding check execution!

Does that mean both master nodes will perform checks simultaneously but only one node does store the results in the DB? If so, this would increase the load on network and nodes to be checked significantly…

Possible? Yes.
Would I do that? No. If I am not mistaken that scenario would lead you to two databases with different states as there is no replication between the databases. If you want multiple DB servers you need some form DB replication. Icinga does not cover that.

Yes, both nodes in the same zone will split the host objects and their checks between them (see docs).
If enable_ha is set to true only one of the node will write to the DB, the other node will sync their received check results to the node writing to the DB.

Why? Please explain.

6 or 7 years ago I installed a Master-Master setup in two datacenters. One in Nuremberg and one in Berlin. Both masters wrote to a local running mariadb. The mariadb had galera installed, so the data was replicated by galera over the 2 locations.
I used a small vm with mariadb and galera only for the quorum.
I left the company some years ago. But as far as I know, the setup is still running without any problem.

@log1c
Is it correct that in a master HA scenario one instance (e.g. master1) is a complete Icinga2 master installation, while the second instance (master2) consists out of a satellite installation?

If so, in case of a master1 failure, master2 wouldn’t be able to send data to the master1, whereas no data will be written to the IDO, since only master1 can manage that?

Master2 starts as a satellite setup. But after that you have to do some manual tasks which elevate it from a satellite to a master (zones config, feature config).
If master1 goes down, master2 will take over the checks normally held bei master1 and will start writing to the IDO DB, if enabled_ha = true. If the HA functionality for the IDO feature isn’t enabled both masters need their own DB configured and will always write to this DB.

@log1c
where are these steps described how to elevate a satellite to a master?
Or is it just copying zones and features directory?

Check this section of the docs :slight_smile:
https://icinga.com/docs/icinga-2/latest/doc/06-distributed-monitoring/#high-availability-master-with-agents

@log1c
Thank you, that documentation was very helpful! :slight_smile:

I managed that the master2 takes over the checks when the icinga2.service of master1 goes down. (i see that with tcpdump)

[2022-02-10 14:15:27 +0100] warning/JsonRpcConnection: API client disconnected for identity 'otn-ac-monq-ma01.aircloud.common.airbusds.corp'
[2022-02-10 14:15:27 +0100] warning/ApiListener: Removing API client for endpoint 'otn-ac-monq-ma01.aircloud.common.airbusds.corp'. 0 API clients left.

But sadly the master2 is not writing the IDO.
Should there be a specific feature enabled on the master2 from master1?
Currently there is only:

  • api.conf
  • checker.conf
  • mainlog.conf

The ido-pgsql.conf on the master1 is set to:

object IdoPgsqlConnection "ido-pgsql" {
  user = "icinga"
  password = "xxx"
  host = "otn-ac-monq-db01.localdomain"
  database = "icinga"
  enable_ha=true
}

Debug.log of master2 when master1 is down on purpose (=otn-ac-monq-sa01)

notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
debug/ApiListener: Not connecting to Endpoint 'otn-ac-monq-ma01.localdomain' because the host/port attributes are missing.
debug/ApiListener: Not connecting to Endpoint 'otn-ac-monq-sa01.localdomain' because that's us.
notice/ApiListener: Current zone master: otn-ac-monq-sa01.localdomain
notice/ApiListener: Updating object authority for objects at endpoint 'otn-ac-monq-sa01.localdomain'.
notice/CheckerComponent: Pending checkables: 0; Idle checkables: 0; Checks/s: 0
debug/CheckerComponent: Scheduling info for checkable 'otn-ac-monq-sa02.localdomain!ping4onZone' (2022-02-10 14:33:08 +0100): Object 'otn-ac-monq-sa02.localdomain!ping4onZone', Next Check: 2022-02-10 14:33:08 +0100(1.6445e+09).
debug/CheckerComponent: Executing check for 'otn-ac-monq-sa02.localdomain!ping4onZone'
notice/ApiListener: Connected endpoints: 
notice/ApiListener: Relaying 'event::SetLastCheckStarted' message
debug/Checkable: Update checkable 'otn-ac-monq-sa02.localdomain!ping4onZone' with check interval '20' from last check time at 2022-02-10 14:32:43 +0100 (1.6445e+09) to next check time at 2022-02-10 14:33:49 +0100 (1.6445e+09).
notice/ApiListener: Relaying 'event::SetNextCheck' message
notice/Process: Running command '/usr/lib64/nagios/plugins/check_ping' '-4' '-H' 'otn-ac-monq-sa02.localdomain' '-c' '200,15%' '-w' '100,5%': PID 4031
debug/CheckerComponent: Check finished for object 'otn-ac-monq-sa02.localdomain!ping4onZone'

Is the ido-pgsql feature enabled on master2 and the config correct as well?

Both masters need to have the same features enabled or they wont be able to take over all tasks from one another (e.g. IDO connection or notifications)

From the docs:

Note : All nodes in the same zone require that you enable the same features for high-availability (HA).