STONITH : I maintain your servers integrity. Part 1

Hello Fellow Administrators , Today we will be discussing about STONITH 

Shoot The Other Node In The Head it is nothing but a service that helps in maintaining the integrity of nodes in a HA Cluster.

What does STONITH actually does ?

So you have a HA cluster , with Primary and HA server listed in it. In the scenario where one of the server is not working correctly or failover scenario the attached HA server will automatically come up as the primary or the fault system will be stopped and not allowed to start.

Fencing goes side ways with STONITH. Fencing is the method to bring a cluster to a known state.

So what is done in fencing ?

A cluster sometimes detects that one of the nodes is behaving strangely and needs to remove it. Every resource in a cluster has a state attached . The cluster must make sure that every resource may be started on only one node (i.e. only HA or only Primary is active )

Every node must report every change that happens to a resource. The cluster state is thus a collection of resource states and node states.

When the state of a node or resource cannot be established with certainty, fencing comes in. Even when the cluster is not aware of what is happening on a given node, fencing can ensure that the node does not run any important resources.

Fencing is of two types :-

List of STONITH devices : stonith -L or crm ra list stonith

Check status of nodes : crm status

Let's understand why we said that it maintains the integrity :

“Split brain scenario”, and this may result in bad things happening to the cluster resources. Imagine, for example, a database that starts running twice in the cluster or a file system that starts to be written between two independent nodes. So, having a split brain in the cluster is bad, and the only way to ensure that no such scenario can occur in the cluster is by using the STONITH approach.

What happens actually ?

Cluster resources are not in sync and each node in the cluster believes it is the only active cluster. 

To avoid this, we can configure Split Brain Detection (SBD) as node fencing mechanism to shut down the device in case of split-brain scenario. SBD provides a node fencing mechanism for Pacemaker-based clusters through the exchange of messages via shared block storage.

What does STONITH do in case of Split brain scenario ?

STONITH (Shoot the Other Node in The Head), is a basically a fencing mechanism which powers down the selected server remotely, removing it from cluster and allowing other nodes in the cluster to take over. 

Different STONITH approaches

  • Disk-based STONITH: external/sbd (On Premise – Best Practice)
  • Hardware-based STONITH: external/ipmi (On Premise – Second Choice)
  • GCP STONITH: external/gcpstonith (Google Cloud)

Split brain Detection Mechanism

On this shared disk, we create a small partition that is used for SBD. The size of the partition depends on the block size of the used disk.

SBD daemon which runs on all nodes in the cluster, will monitor the shared storage. When SBD daemon loses access to storage devices, it terminates itself in case the disk become unreachable. Increased protection is offered through watchdog, where daemon continuously writes a service pulse – if the daemon stops feeding the watchdog, the hardware will enforce a system restart. This protects against failures of the SBD process itself, such as dying, or becoming stuck on an IO error. So, the pacemaker software configuration ensures a safe transition of resources in the cluster in case when node is down.

SBD STONITH is a simple but effective way to ensure the integrity of data and other nodes in a Linux cluster.

I am learning on this topic if you find any issue or you have any question please drop about it in comment.

References :-

Basic Intro 

Configuring Cluster

Stonith Linux Commands

Blog By Dennis