High Availability Cluster with Pacemaker Part 2: Planning

Now lets talk about planning our High Availability Cluster with Pacemaker: What kind of hardware do we need, how much nodes, what re the network requirements.

https://clusterlabs.org/

This solution is also used by SUSE and Redhat in their products:

https://www.suse.com/products/highavailability/

https://www.redhat.com/en/store/high-availability-add#?sku=RH00025

JWB-Systems can deliver you training and consulting on all kinds of Pacemaker clusters:

SLE321v15 Deploying and Administering SUSE Linux Enterprise High Availability 15

SLE321v12 Deploying and Administering SUSE Linux Enterprise High Availability 12

RHEL 8 / CentOS 8 High Availability

RHEL 7 / CentOS 7 High Availability

Principles of Planning
In this part we talk about how to plan our cluster, what hardware is needed for a optimal setup and where the pitfalls are.

Number of Nodes
Split Bain
2 Nodes or 2+N Nodes and Quorum
STONITH
Network planing ( NICs, Switches)
Number of Nodes
Pacemaker is a multinode Active/Active cluster, so the distributors support clusters with multiple nodes, SUSE for example supports clusters up to 32 nodes.
There is no need for a dedicated standby node that just waits till the active node fails, all nodes can be active and carry active resources. But you should consider node failures in your planning. What happens if in a 2 node cluster every node is during normal circumstances already loaded with 70% and one node fails? Correct, the remaining node has to bear now 140% load and is overloaded!
Conclusion: Calculate the load reserves in your cluster in a way where a node failure does not overload your remaining nodes!

Also don’t create to big clusters. If you plan to build up a 15 node cluster, I would advise you not to do that. If you experience a cluster problem all 15 nodes will not run resources! Total failure!
Better build 3 5-node-clusters or 5 3-node-clusters. Then a possible problem in a cluster has a smaller impact footprint.

Split Brain
Split Brain is a scenario in a cluster that we never want to have! It describes a communication problem between the nodes. The nodes cannot see each other any more on the corosync layer and don’t know the status of the partner node.
And the cluster tries to achieve high availability, he wants to make sure all resources in the cluster are startet.
So every node starts the resources that he can’t see as running (bcs they have been running before on the partner node). This leads to double running resources and is very bad. Most resources use a shared storage to write their data, and now 2 instances of a resource try to write data on one shared storage parallel. Result: your data is destroyed! Superbad!
We must avoid or resolve such a situation immediately. This can be done with Quorum and STONITH

2 Nodes or 2+N Nodes and Quorum
2+N Nodes
There is a big difference between running a 2 node cluster or a 2+N node cluster.
This difference is called quorum.
Quorum is a mathematical majority. If we have for example 3 Nodes, and there is a problem in the cluster communication between node 3 and the other nodes it will result in 2 partial clusters:
– node1 and node2 can communicate with each other and build a partial cluster1, they say: we are 2 out of 3
– node3 cannot see any other node and builds up a partial cluster2, its says: i am 1 out of 3
Partial cluster1 has a mathematical majority (2 out of 3), it achieved quorum.
Partial cluster2 has no mathematical majority (1 out of 3) and has no quorum.
The partial cluster with quorum is allowed to start all resources to achieve high availability, while the partial cluster without quorum will stop all resources that he is running.

This is a method that helps resolving split brain scenarios, but its not mandatory to use it. There are downsides also of using quorum: Imagine a 2 node failure in our 3 node cluster! Yes correct, the remaining node will loose quorum and stop all services! Not very high available, right?
So its up to you if you want to use quorum in a 2+N node cluster. If you use it its also recommended to use a odd number of nodes.

2 Nodes
There is a situation when quorum is not usable anyway: 2 node clusters (and this are in our days the most common used clusters).
If we loose cluster communication between the 2 nodes, we have again 2 partial clusters, everyone saying: i am 1 out of 2). And that’s no mathematical majority! So both nodes have no quorum, both nodes would stop all resources.
Conclusion: in a 2 node cluster you have to disable quorum! Pacemaker detects that automatically and disables it.

But what happens without quorum? In the previous described case all nodes would start now all services if quorum is not active, and we have split brain with data corruption!
We need another measure that makes sure this is not happening: STONITH.

STONITH
STONITH is the abbreviation for “Shoot the other node in the head!”. And its exactly doing that!
I will explain STONITH now just shortly, my next cluster blog article will explain STONITH deeper, its very important.
For now we just need to know enough to plan the hardware for the cluster.

Example: Node1 can’t communicate with node2 (the reason can be network problems, node kernel freeze or anything else), so node1 decides to STONITH node2: it executes a hard reset (no reboot, really hard reset) on node2. This makes sure node 2 is not running resources any more that access the shared storage. After that node1 will start all resources.
It’s the only really reliable technique to make sure that split brain scenarios can’t happen.

STONITH can be achieved it 2 ways: hardware based and software based.

Hardware based STONITH
For hardware based STONITH we need a STONITH device that can control the power supply on every node.
That can be network controllable power strips, network controllable UPS, or baseband management controller on the nodes, for example HP ILO or DELL Drac. Modern baseband management controllers understand the IPMI protocol, the cluster can use that protocol to control the nodes power status.

Software based STONITH
Software based STONITH (also called SBD) needs no such hardware, but it needs shared LAN storage.
The SAN (can be fiber channel SAN or ISCSI SAN) must provide a small LUN (10MB are enough) that is visible on every node. And the mainboards of our nodes must have a watchdog chip onboard. Every modern server mainboard has such a watchdog chip integrated, also a lot of desktop mainboards. That’s all we need to achieve reliable STONITH.

Cluster nodes on virtual machines or in the cloud can be stonithed by giving the virtual machine host system the command to reset the virtual machine. Supported hosts are Xen, KVM and VMWare in on premise systems, if you are in the public cloud there are STONITH plugins that support AWS, Azure, Google-Cloud and so on..

STONITH Conclusion
What kind of STONITH you want to use depends on your environment. But there MUST be STONITH! Clusters without STONITH are not supported by SUSE or Redhat, and that is really for a good reason!

Here is a list of supported STONITH variants for SUSE clusters for your planning:
apcmaster
apcmastersnmp
apcsmart
baytech
cyclades
drac3
external/drac5
external/dracmc-telnet
external/ec2
external/hetzner
external/hmchttp
external/ibmrsa
external/ibmrsa-telnet
external/ipmi
external/ippower9258
external/kdumpcheck
external/libvirt
external/nut
external/rackpdu
external/riloe
external/sbd
external/vcenter
external/vmware
external/xen0
external/xen0-ha
ibmhmc
meatware
nw_rpc100s
rcd_serial
rhcs/cisco_ucs
rps10
suicide
wti_mpc
wti_nps

Network planning ( NICs, Switches, Shared Storage)
Now it’s time to think about your network. We need cluster communication, user communication, admin/management communication and storage communication.

The cluster communication network
This network must be redundant.There are two ways to accomplish that, you can build a bond of two nics on every node and use on this bond a single corosync communication ring, or you use on every node directly two nics and create the redundancy in the Corsync layer by letting corosync communicate with two rings. What method you choose is on you, both ar OK. Be careful when you connect a two node cluster direct (without switch) about the bond modes, if you implement it wrong (for example both bonds in failover mode) your cluster communication will fail with a 50% probability!

User communication Network
The network for user communication should also use bond of two nics on every node. Same the admin/management communication.

Storage communication network
The storage communication should also be redundant, and should always be a separate network, because we can await a high data load when the applications on the nodes communicate with the shared storage.
If your storage communication is FC and not network based make sure you have attached your FC SAN and the FC switches multipathed.

Additional network recommendations
If your cluster nodes are virtual machines (on premise or in the puplic cloud), and the redundcancy for network is already guaranteed on the virtualisation host layer you can use also only single nics in your nodes for the communication.
But make sure cluster communication and storage communication is on a dedicated network.

A good way is to have on every cluster node 2 4-port network cards, so you can use for every network redundancy a port on card1 and a port on card2.

Finished hardware planning
Now we have the necessary knowledge to plan and buy the hardware for our cluster. The next part will be about STONITH in detail.

CU soon again here 🙂