Question on Galera Mysql cluster with VIPs & HAProxy

fireheadman · August 15, 2019, 1:47pm

sub’d for updates… believe this is what I have been looking for in my Galera config (Icinga2).

question for the thread.
in /etc/icinga2/features-enabled/ido-mysql.conf does the enable_ha = false/true get set (unsure if it has a default value if not called out)?

dnsmichi · August 15, 2019, 2:07pm

Defaults to enable_ha=true (either one or the other master actively connects and writes).

fireheadman · August 15, 2019, 2:18pm

awesome and thanks for the very quick response!

fireheadman · August 15, 2019, 4:32pm

UPDATE: Looks like I am hitting a wall with trying to bind to port 3306 with a non-privileged user “haproxy”. RHEL7

[user@icinga02 rh-haproxy18 1517]$ sudo /opt/rh/rh-haproxy18/root/sbin/haproxy -f /etc/opt/rh/rh-haproxy18/haproxy/haproxy.cfg
[WARNING] 226/120753 (52452) : parsing [/etc/opt/rh/rh-haproxy18/haproxy/haproxy.cfg:89] : backend 'icinga2_ido' : 'option tcplog' directive is ignored in backends.
[WARNING] 226/120753 (52452) : config : 'option forwardfor' ignored for frontend 'icinga2_ido' as it requires HTTP mode.
[WARNING] 226/120753 (52452) : config : 'option forwardfor' ignored for backend 'icinga2_ido' as it requires HTTP mode.
[ALERT] 226/120753 (52452) : Starting frontend icinga2_ido: cannot bind socket [0.0.0.0:3306]

anon66228339 · August 16, 2019, 9:22am

Where do you run HAProxy? On the same servers as the MySQL?
If yes change the port in HAProxy config to something else then the MYSQL port or confiure MySQL to run on localhost only and the HAProxy to run only on non localhost interfaces.

fireheadman · August 16, 2019, 5:59pm

correct,
2 servers, each server runs Icinga2 / Icingaweb2 / MariaDB Galera / HAProxy / KeepAlived
I changed the port from 3306 to 3309 for icinga2_ido (frontend)

Since I have never run this configuration, trying to wrap my head around the configuration
…does this sound correct:

Master 1:
icinga2/icingaweb2/mariadb --> 192.168.1.50
keepalive/haproxy --> 192.169.1.100

Master 2:
icinga2/icingaweb2/mariadb --> 192.168.2.50
keepalive/haproxy --> 192.169.2.100

===========
UPDATE/EDIT: I finally got the HaProxy online on the Primary (Master1). However, once I start the process on Master2, it locks up the console for the main IP Address 192.168.2.50… I believe this is due to HaProxy or keepalived offling access?

[user@icinga01 haproxy 9451]$ sudo ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:50:56:b0:84:0c brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.50/24 brd 192.168.1.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet 192.168.2.100/24 scope global eth0
       valid_lft forever preferred_lft forever
    inet 192.168.1.100/24 scope global secondary eth0
       valid_lft forever preferred_lft forever

[user@icinga02 ~ 1419]$ sudo ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:50:56:b0:6a:f3 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.50/24 brd 192.168.2.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet 192.168.1.100/24 scope global eth0
       valid_lft forever preferred_lft forever
    inet 192.168.2.100/24 scope global secondary eth0
       valid_lft forever preferred_lft forever

fireheadman · August 16, 2019, 8:16pm

UPDATE: Believe I will need to split out my MariaDB Galera instances onto separate servers.
I could only reconnect to icingaweb2 after inserting

GRANT SELECT, INSERT, UPDATE, DELETE, DROP, CREATE VIEW, INDEX, EXECUTE ON icinga.* TO 'icinga'@'192.168.1.50' IDENTIFIED BY 'password';
GRANT SELECT, INSERT, UPDATE, DELETE, DROP, CREATE VIEW, INDEX, EXECUTE ON icinga.* TO 'icinga'@'192.168.2.50' IDENTIFIED BY 'password';

…and if I read the documentation correctly, we would want this to connect on the VIP which is 192.168.1.100 and 192.168.2.100

Also had to comment out in the keepalived.conf the VIP2 on Master1 and VIP1 on Master2
This made the “ip addr show” output have only 2 IP’s vs the 3 I showed in the previous update.

As for the HAProxy stats page, I see updates/counts on the icinga2-ido but not for the mysqlread

winem · August 17, 2019, 10:43am

I am not sure if I get the quoted statement correct but does this mean that your keepalived config on the first node looks like this:

vppr_instance {
...
    virtual_ipaddress {
        192.168.1.100
   }
...
}

and the configured vip on the second node is 192.168.2.100?

Usually the vip should be the same on both instances so that you have one virtual IP that is available on one of the nodes at any time. For example on the first node as prefered master unless this instance goes down. Keepalived on the 2nd node will notice the missing VRRP announcements, promote itself to a master and add the vip that was assigned to the first node to a local interface. So all the client applications have a single endpoint - the VIP.

Can you show us the client error messages or log files that show the issues you have with having the same vIP in both keepalived configuration files? Can this be an issue with net.ipv4.ip_nonlocal_bind?

fireheadman · August 19, 2019, 3:09pm

apologies on the delay, the weekend got ahold of me.

correct, here is my **config on Master1:

and Master2

Sysctl output, I do have this this enabled.

[user@icinga01 ~ 9577]$ sysctl net.ipv4.ip_nonlocal_bind
net.ipv4.ip_nonlocal_bind = 1
[user@icinga02 ~ 1639]$ sysctl net.ipv4.ip_nonlocal_bind
net.ipv4.ip_nonlocal_bind = 1

IP Addr Show

[user@icinga01 ~ 9582]$ sudo ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:50:56:b0:84:0c brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.50/24 brd 192.168.1.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet 192.168.1.100/24 scope global secondary eth0
       valid_lft forever preferred_lft forever

[user@icinga02 ~ 1641]$ sudo ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:50:56:b0:6a:f3 brd ff:ff:ff:ff:ff:ff
    inet 192.168.2.50/24 brd 192.168.2.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet 192.168.2.100/24 scope global secondary eth0
       valid_lft forever preferred_lft forever

winem · August 19, 2019, 3:26pm

So you still have two different IPs (192.168.1.100 and 192.168.2.100) which means that the client should support more than one endpoint (which is not the case).

So let’s try to fix this.

What was the issue when you configure the vIP to be 192.168.1.100 on both nodes? Can you share the logs of that issue? Or did you just miss that IP on the 2nd node while it was available at the 1st one?

Another issue I see is that you use state MASTER on both nodes and both have the same priority. I would recommend a setup with one node as Master and the 2nd as Backup or at least to have different priorities. I see some potential for frequent state changes and some kind of flapping with your current configuration. The definition of rise and fall for the vrrp_scripts could also be helpful to avoid too frequent failover.

fireheadman · August 19, 2019, 4:30pm

quick question… in the guide above, the keepalived.conf is showing a VIP_1 and VIP_2 for each server. I noticed both servers are on the same subnet, so for a server to assume the other’s IP would be fine.

however in my scenario, I have 2 different subnets, which is why I commented out the 2nd VIP entry.
I’m actually wanting my icinga2/icingaweb2 endpoints to both be online. I want the HAProxy to roundrobin the connect(s) to the MariaDB Galera cluster

I updated my config for keepalived

master1 now has MASTER role w/prio 100
master2 now has BACKUP role w/prio 50

as for log sharing, its a little difficult. I was masking my hostname/IPs from the actual ones to sanitize for this thread.

fireheadman · August 19, 2019, 4:48pm

UPDATE: I may be able to accommodate 2 servers on the same subnet. Just found out we have a stretch LAN from our main site to dr site, give me a bit to reconfigure IPs on the MASTER_2 server.

dnsmichi · August 19, 2019, 6:12pm

FYI I’ve split the topic and moved your specific questions into a new one. This increases readability on the howto and allows to specifically focus on your setup questions.

fireheadman · August 19, 2019, 7:29pm

UPDATE: Was able to re-IP my servers, they are now on the same subnet.

updated /etc/keepalived/keepalived.conf to match the example in this thread, now using Master/Backup blocks on each server.
updated /etc/opt/rh/rh-haproxy18/haproxy/haproxy.cfg to have both IP’s updated for Icinga2_ido & Icingaweb2_ido,
updated /etc/icinga2/zones.conf, /etc/icinga2/zones.conf/master/hosts.conf (for new IPs)
updated /etc/opt/rh/rh-mariadb103/my.cnf.d/galera.cnf (for new IPs)
updated /etc/opt/rh/rh-mariadb103/my.cnf.d/mariadb-server.cnf
rebooted, started a new galera_cluster instance, had 2nd master join…

Everything came online correctly, I am able to view HaProxy stats page. There are no errors in Icinga at this point.

To test this…
In vmware, I disconnect the vnic for Master2, I wanted to simulate a network disruptive event between 2 physical sites.

Expectation:
Master1 node should maybe hiccup and then connect to its local DB instance and keep chugging…

Result:
Master1 node assumes the VIP, ip addr show now has 3 ips.
I am getting a nice error in the Icinga GUI:

SQLSTATE[08S01]: Communication link failure: 1047 WSREP has not yet prepared node for application use, query was: SELECT CASE WHEN (UNIX_TIMESTAMP(programstatus.status_update_time) + 60 > UNIX_TIMESTAMP(NOW())) THEN 1 ELSE 0 END AS is_currently_running, programstatus.process_id, programstatus.endpoint_name, UNIX_TIMESTAMP(programstatus.program_start_time) AS program_start_time, UNIX_TIMESTAMP(programstatus.status_update_time) AS status_update_time, programstatus.program_version, UNIX_TIMESTAMP(programstatus.last_command_check) AS last_command_check, UNIX_TIMESTAMP(programstatus.last_log_rotation) AS last_log_rotation, programstatus.global_service_event_handler, programstatus.global_host_event_handler, programstatus.notifications_enabled, UNIX_TIMESTAMP(programstatus.disable_notif_expire_time) AS disable_notif_expire_time, programstatus.active_service_checks_enabled, programstatus.passive_service_checks_enabled, programstatus.active_host_checks_enabled, programstatus.passive_host_checks_enabled, programstatus.event_handlers_enabled, programstatus.obsess_over_services, programstatus.obsess_over_hosts, programstatus.flap_detection_enabled, programstatus.process_performance_data FROM icinga_programstatus AS programstatus

Unsure if this has any correlation:
sudo cat /mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid: 2d859d0d-c2af-11e9-b811-d72f4992e5b7
seqno: -1
safe_to_bootstrap: 0

fireheadman · August 19, 2019, 7:45pm

May have found the answer to my question… will read up on this and adjust
https://galeracluster.com/library/kb/best/two-node-clusters.html

UPDATE:
Found 3 possible resolutions.

If you have the resources, run 3 MariaDB servers in the cluster
Can follow the recommendation on the link posted above (This seems to be a on the fly fix when you are down a node and need to re-enable galera)
Or can install the Galera Arbitrator on a 3rd node (I went with this option on an admin host that lived on the same network)

winem · August 20, 2019, 12:42pm

Option 3 is what I would have recommended as well. It’s a simple solution if resources available are limited.

You might want to get an alarm if the error from above appears again. So from my experience the 4 most important keys to monitor on a galera cluster are these 4:
show global status where variable_name in ('wsrep_cluster_size', 'wsrep_cluster_status', 'wsrep_connected', 'wsrep_ready');

In your case, wsrep_ready was OFF.

This plugin (https://github.com/fridim/nagios-plugin-check_galera_cluster) does that for you. It’s not perfect but does a good job and helped us a few times already to see issues before anyone else does.