Skip to main content

STONITH : I maintain your servers integrity. Part 1

Hello Fellow Administrators , Today we will be discussing about STONITH 


Shoot The Other Node In The Head it is nothing but a service that helps in maintaining the integrity of nodes in a HA Cluster.

What does STONITH actually does ?

So you have a HA cluster , with Primary and HA server listed in it. In the scenario where one of the server is not working correctly or failover scenario the attached HA server will automatically come up as the primary or the fault system will be stopped and not allowed to start.

Fencing goes side ways with STONITH. Fencing is the method to bring a cluster to a known state.

So what is done in fencing ?

A cluster sometimes detects that one of the nodes is behaving strangely and needs to remove it. Every resource in a cluster has a state attached . The cluster must make sure that every resource may be started on only one node (i.e. only HA or only Primary is active )

Every node must report every change that happens to a resource. The cluster state is thus a collection of resource states and node states.

When the state of a node or resource cannot be established with certainty, fencing comes in. Even when the cluster is not aware of what is happening on a given node, fencing can ensure that the node does not run any important resources.

Fencing is of two types :-


List of STONITH devices : stonith -L or crm ra list stonith

Check status of nodes : crm status

Let's understand why we said that it maintains the integrity :

“Split brain scenario”, and this may result in bad things happening to the cluster resources. Imagine, for example, a database that starts running twice in the cluster or a file system that starts to be written between two independent nodes. So, having a split brain in the cluster is bad, and the only way to ensure that no such scenario can occur in the cluster is by using the STONITH approach.

What happens actually ?

Cluster resources are not in sync and each node in the cluster believes it is the only active cluster. 

To avoid this, we can configure Split Brain Detection (SBD) as node fencing mechanism to shut down the device in case of split-brain scenario. SBD provides a node fencing mechanism for Pacemaker-based clusters through the exchange of messages via shared block storage.

What does STONITH do in case of Split brain scenario ?

STONITH (Shoot the Other Node in The Head), is a basically a fencing mechanism which powers down the selected server remotely, removing it from cluster and allowing other nodes in the cluster to take over. 

Different STONITH approaches

  • Disk-based STONITH: external/sbd (On Premise – Best Practice)
  • Hardware-based STONITH: external/ipmi (On Premise – Second Choice)
  • GCP STONITH: external/gcpstonith (Google Cloud)

Split brain Detection Mechanism

On this shared disk, we create a small partition that is used for SBD. The size of the partition depends on the block size of the used disk.

SBD daemon which runs on all nodes in the cluster, will monitor the shared storage. When SBD daemon loses access to storage devices, it terminates itself in case the disk become unreachable. Increased protection is offered through watchdog, where daemon continuously writes a service pulse – if the daemon stops feeding the watchdog, the hardware will enforce a system restart. This protects against failures of the SBD process itself, such as dying, or becoming stuck on an IO error. So, the pacemaker software configuration ensures a safe transition of resources in the cluster in case when node is down.

SBD STONITH is a simple but effective way to ensure the integrity of data and other nodes in a Linux cluster.


I am learning on this topic if you find any issue or you have any question please drop about it in comment.

References :-

Basic Intro 

Configuring Cluster

Stonith Linux Commands

Blog By Dennis


Comments

You might find these interesting

How to properly Start/Stop SAP system through command line ?

Starting/stopping an SAP system is not a critical task, but the method that most of us follow to achieve this is sometimes wrong. A common mistake that most of the SAP admins do is, making use of the 'startsap' and 'stopsap' commands for starting/stopping the system.  These commands got deprecated in 2015 because the scripts were not being maintained anymore and SAP recommends not to use them as many people have faced errors while executing those scripts. For more info and the bugs in scripts, you can check the sap note 809477.  These scripts are not available in kernel version 7.73 and later. So if these are not the correct commands, then how to start/stop the sap system?  In this post, we will see how to do it in the correct way. SAP SYSTEM VS INSTANCE In SAP, an instance is a group of resources such as memory, work processes and so on, usually in support of a single application server or database server with

sapstartsrv is not started or sapcontrol is not working

 What is sapstartsrv ? The SAP start service runs on every computer where an instance of an SAP system is started. It is implemented as a service on Windows, and as a daemon on UNIX. The process is called  sapstartsrv.exe   on Windows, and   sapstartsrv   on UNIX platforms. The SAP start service provides the following functions for monitoring SAP systems, instances, and processes. Starting and stopping Monitoring the runtime state Reading logs, traces, and configuration files Technical information, such as network ports, active sessions, thread lists, etc. These services are provided on SAPControl SOAP Web Service, and used by SAP monitoring tools (SAP Management Console,  SAP NetWeaver  Administrator, etc.). For more understanding use this link : https://help.sap.com/doc/saphelp_nw73ehp1/7.31.19/enUS/b3/903925c34a45e28a2861b59c3c5623/content.htm?no_cache=true How to check if it is working or not ? In case of linux , you can simply ps -ef | grep sapstartsrv In case of windows, you need

HANA System Replication - Prerequisites & Setup

Hey Folks! Welcome back to Hana high availability blog series. In our last blog we checked out operation & replication modes in hana system replication. If you haven't gone though that blog, you can checkout  this link In this blog we will be talking about the prerequisites of hana replication and it's setup. So let's get started. When we plan to setup hana system replication, we need to make sure that all prerequisite steps have been followed. Let's have a look at these prerequisites. HANA System Replication Prerequisites: Primary & secondary systems should be up & running HDB version of secondary should be greater than or equal to Primary database sever But, for Active/Active(read enabled config), HDB version should be same on both sites. System configuration/ini files should be identical on both sides Replication happe

ST03N : The chapter for all BASIS Admins

This blog is targeted to BASIS ADMINS Transaction for workload analysis statistical data changed over time are monitored using transaction code ST03 , now ST03N (from SAP R/3 4.6C) . With SAP Web AS 6.4 the transaction ST03 is available again. From time to time ST03 and ST03N has seen many changes but later in SAP NW7.0 ST03N has reworked in detail specially processing time is now shown in separate column. Main Use of ST03N  is to get detailed information on performance of any ABAP based SAP system. Workload monitor analyzes the statistical data originally collected by kernel. You can compare or analyze the performance of a single application server or multiple application server. Using this you start checking from the entire system and finding your way to that one application server and narrowing down to exact issue. By Default :- You see data of current day as default view , you can change the default view. Source of the image : sap-perf.ca Let's discuss the WORKLOAD MONITOR By D

HANA hdbuserstore

The hdbuserstore (hana secure user store) is a tool which comes as an executable with the SAP Hana Client package. This secure user store allows you to store SAP HANA connection information, including user passwords, securely on clients. With the help of secure store, the client applications can connect to SAP HANA without the user having to enter host name or logon credentials. You can also use the secure store to configure failover support for application servers in a 3-tier scenario (for example, SAP Business Warehouse) by storing a list of all the hosts that the application server can connect to. To access the system using secure store, there are two connect options: (1)key and (2)virtualHostName. key is the hdbuserstore key that you use to connect to SAP HANA, while virtualHostName specifies the virtual host name. This option allows you to change where the hdbuserstore searches for the data and key files. Note

Unlock the Power of SAP HANA Performance Optimization: A Comprehensive Guide for SAP Basis Administrators [Part 2 : Deep Dive ]

In SAP systems, the performance of the underlying database plays a crucial role in delivering a seamless user experience. SAP HANA, being an in-memory database, offers exceptional speed and agility. However, it's essential to monitor and troubleshoot potential performance bottlenecks to ensure optimal system performance. In this blog post, we will explore key areas to focus on when monitoring the SAP HANA database and provide examples of how to identify and address common performance issues. Threads : Threads in the SAP HANA database represent individual tasks that execute concurrently. Monitoring thread utilization helps identify any thread-related issues, such as high thread utilization or thread exhaustion. You can use the M_SERVICE_THREADS table to gather information about active threads and their status. For example, a high number of waiting threads may indicate resource contention or blocking situations that require investigation and optimization To retrieve  thread based on

Work Process and Memory Management in SAP

Let’s talk about the entire concepts that are related to memory when we talk about SAP Application. Starting with few basic terminologies, Local Memory :  Local process memory, the operating system keeps the two allocation steps transparent. The operating system does the other tasks, such as reserving physical memory, loading and unloading virtual memory into and out of the main memory. Shared Memory :  If several processes are to access the same memory area, the two allocation steps are not transparent. One object is created that represents the physical memory and can be used by various processes. The processes can map the object fully or partially into the address space. The way this is done varies from platform to platform. Memory mapped files, unnamed mapped files, and shared memory are used.  Extended Memory : SAP extended memory is the core of the SAP memory management system. Each SAP work process has a part reserved in its virtual address space for extended memory. You can set

How to resolve Common Error : Standard Template "sap_sm.xls" missing

Hey everyone, putting forward a common error we usually face when we have “ Excel inplace” functionality enabled in our SAP system. This error occurs when validity of the signature of SAP standard templates expired or were incorrectly delivered via support packages. We can reproduce the error by doing as below.. Click on “spreadsheet” icon after any SAP ALV grid view of data is on screen to make this data to export into excel directly from SAP.

ABAP Dumps Analysis

Ever now and then have you heard about ABAP Dumps, We also have a joke everything in temporary in life except ABAP dumps for SAP Consultants. Lets try to understand ABAP dumps from perspective of a SAP BASIS Consultant. Dumps happen when an ABAP program runs and something goes wrong that cannot be handled by the program We have two broad categories of Dumps , In custom program Dumps and SAP provided program Dumps. Dumps that happen in the customer namespace ranges (i.e. own-developed code) or known as Custom Program , can usually be fixed by the ABAP programmer of your team. Dumps that happen in SAP standard code probably need a fix from SAP. You do not have to be an "ABAPer" in order to resolve ABAP dump issues. The common way to deal with them is to look up in ST22 How to correct the error ? Hints are given for the keywords that may be used to search on the note system. Gather Information about the issue  Go to System > Status and Check the Basis SP level as well as info

SAP application log tables: BALHDR (Application Log: Header Data) and BALDAT (Application Log: Detail Data)

  BALHDR (Application Log: Header Data): Usage : The BALHDR table stores the header information for application logs. It serves as a central repository for managing and organizing log entries. Example Data Stored: The table may contain entries for various system activities, such as error messages, warnings, or information logs generated during SAP transactions or custom programs. Columns Involved: LOGNUMBER: Unique log number assigned to each log entry. OBJECT: Identifies the object associated with the log entry (e.g., a program, transaction, or process). SUBOBJECT: Further categorizes the object. USERNAME: User ID of the person who created the log entry. TIME: Date and time when the log entry was created. ADD_OBJECT: Additional information or details related to the log entry. BALDAT (Application Log: Detail Data): Usage : The BALDAT table contains the detailed data for each log entry, linked to the corresponding entry in the BALHDR table. It stores the specific log details and me