High Availability Cluster with Pacemaker Part 3: STONITH

This blog chapter is now about STONITH in the pacemaker cluster

https://clusterlabs.org/

This solution is also used by SUSE and Redhat in their products:

https://www.suse.com/products/highavailability/

https://www.redhat.com/en/store/high-availability-add#?sku=RH00025

JWB-Systems can deliver you training and consulting on all kinds of Pacemaker clusters:

SLE321v15 Deploying and Administering SUSE Linux Enterprise High Availability 15

SLE321v12 Deploying and Administering SUSE Linux Enterprise High Availability 12

RHEL 8 / CentOS 8 High Availability

RHEL 7 / CentOS 7 High Availability

Bad things will happen without STONITH
STONITH is designed to prevent from the worst thing that can happen in a cluster:
Split Brain!
Lets take a cluster consisting of 2 nodes without STONITH as example:
In a situation where both nodes cannot communicate via the cluster network (corosync) with each other,
both think: “i am alone, the other node is gone: It’s my responsibility to provide high availability, so i will start all resources”.
This leads to the fact that all resources will be double running. For some resources this is not that bad, but for others, like Filesystem resources (mounting a filesystem from a shared storage) this is superbad! It will shred your data immediately!

How can STONITH help
In the same example like above ( 2 nodes that cannot communicate via corosync) but with STONITH, both nodes again thing: “i am alone, it’s my responsibility to start all resources”.
But they don’t do that immediately! They know: STONITH is enabled (a configuration setting in the cluster properties, without this enabled the cluster is not supported), “i must first make sure that the other node is really gone”. So they both try to hard reboot (or shutdown) the other node first via STONITH.
Why hard? In a regular shutdown or reboot there is still plenty time for the other node to wreck your data . Also a regular shutdown is not reliable, it could get stuck, the cluster pacemaker service remains running and does bad.

Why do both nodes try that?
Because they both dont know the status of the other node and feel responsible for providing high availability.

Isn’t that bad if both nodes do that?
Theoretically it could happen that both nodes are able to send a hard reboot signal to the STONITH device of the other node. But practically always one node is faster then the other one, so there is only one survivour. Its like in a shootout in the Wild West. The faster one survives.

OK, one node is faster and sends the STONITH signal to the other. What happens then?
After the STONITH device received the signal it does a power cycle or power off (depends on the configuration in the cluster properties) on the looser node. After that, the STONITH device sends a confirmation about the successful STONITH back to the the triggering node.
When the triggering node receives the confirmation in time (there is a configurable timeout for that, again in the cluster properties), he knows:
OK now its guaranteed: The other node is dead!
And he starts all resources.

About the looser node
The looser node gets either a hard shutdown signal and stays powered off, or he gets a power cycle signal and reboots.
In that case there is one thing important to consider:
If the node reboots and the pacemaker service is set to autostart boot in systemd, the node would rejoin the cluster after he rebooted.
Sometimes this is wanted, but most of the times a node that was kicked out of the cluster bt STONITH should not automatically rejoin the cluster, first you have to investigate the reason for the STONITH, otherwise this could happen again and you have a yo-yo effect.
My suggestion:
Method1: Use powercycle STONITH and turn pacemaker autostart in systemd to off
Then the Node reboots and can be investigated, but it will not rejoin the cluster automatically, only manually if the admin decides: yes the node is OK.

SBD STONITH
There is one kind of STONITH that is different then the others:
SBD. This stands for “Split Brain Device” or “Storage based death”.
Its also known as “poison pill”.
It works without a special hardware STONITH device, it just needs a small LUN (10 MB are enough) from a SAN that is visible on all nodes. The LUN must be shared block storage, not NAS storage!
That’s all.
The advantage is: it is generic, it works in all clusters the same way, no need to handle different hardware STONITH devices.

How is SBD working
BD consists of several components that work together hand in hand:

SBD device: this is the shared LUN, it must be formatted with the SBD format on one node. It acts as a mailbox, every node has a slot in this mailbox.
watchdog kernel module loaded on every node: Most mainboards have a hardware watchdog on board. For VMs you can use the softdog module (attention: not supported on RHEL!) when there is no hardware watchdog. When a module is loaded, there will be a watchdog device /dev/watchdog
config file on every node /etc/sysconfig/sbd: If pacemaker is started on a node it looks for the configfile. If it exists, the SBD daemon is started automatically and initialises the watchdog device. The SBD daemon is responsible for counting up the watchdog device. If it stops counting it up, the watchdog device will hard reset the system.
a SBD cluster resource: This resource needs to run on one of the nodes.
Possible problems and how SBD handles
Communication problem on the corosync network: the node with he SBD resource running will write a message to the other nodes slot in the SBD device: reset. The other node reads this and does what it’s told: it executes a hard reset.
Node down: again the surviving node sees no communication on the corosync network. It tries to send a STONITH message to the dead node via the SBD device. If the SBD resource was running on the dead node the remaining node will start it and send a STONITH message to the SBD device.
One node cannot access the SBD Device: it will commit suicide via a hard reset.
Both nodes cannot access the SBD device: depending on the configuration both nodes can commit suicide or they just say: we still can communicate on corosync, no need for reset.
One node hangs: the SBD daemon will stop counting up the watchdog device and the watchdog will hard reset the node.
Special considerations with SBD
If you use SBD STONITH, there is no confirmation about a succesful executed STONITH!
So the timeout settings for SBD are important!

Watchdog timeout: The time a node waits to take action (suicide) if it cannot access the SBD device. This must be higher then a possible timeout of the LUN in case of a SAN switchover.
Messagewait timeout: the time a node waits after sending the reset message (the poison pill) to the other nodes slot in the SBD device. After this waiting time the remaining nodes considers the STONITH as successful (without confirmation) and starts all resources. It must be higher then the watchdog timeout.
This two timeouts are configured during formatting the SBD device with this parameters: -4 = msgwait -1 = watchdogtimeout.
Example: sbd -d /dev/sbddevice -4 20 -1 10 create

STONITH timeout, it must be higher then the messagewait timeout. It’s configured in the cluster properties.
Multiple SBD devices
It is possible to use more then one LUN as SBD device. Up to 3 devices are possible.

SBD with 2 devices: Normally used if you have 2 SANS that cannot replicate themselves. You can use host based mirroring so that you data is written to both SANS redundant by the node. In this case it’s recommended to use also a LUN on both SANs for the SBD device.
SBD with 3 devices: SBD will only commit suicide if more than one device is lost. It is mostly used again for host based mirroring scenarios where you have a third SAN that provides a third SBD device as a tie breaker.
Conclusion
For bare metal hosts you can use:

Hardware based STONITH: a baseboard management card on every nodes that can power cycle the server or a power supply that can be managed by network and take away power from the server.
Software based STONITH: SBD
For VMs (or cloud cluster nodes) you can use:

STONITH by Hypervisor: The node that wants to shut down another node tells the hypervisor where the node is running on to power cycle the VM. STONITH resource agents for this are available for several virtualisation solutions, like KVM, Xen, VMWare and public clouds. The disadvantage: The STONITH resource running in the cluster needs management access to the hypervisors.
Software based STONITH: SBD Here is to mention that most virtualisation solutions don’t simulate a watchdog device, so the VM lacks it. Then you can use the softdog device. But attention: This is not supported in RHEL!
OK now we talked a lot about STONITH. Remember: a cluster without a really exactly tested STONITH is worthless!
In the next part of my blog we will talk about how to set up the cluster.
CU soon again here 🙂