STONITH : I maintain your servers integrity. Part 1

Hello Fellow Administrators , Today we will be discussing about STONITH

Shoot The Other Node In The Head it is nothing but a service that helps in maintaining the integrity of nodes in a HA Cluster.

What does STONITH actually does ?

So you have a HA cluster , with Primary and HA server listed in it. In the scenario where one of the server is not working correctly or failover scenario the attached HA server will automatically come up as the primary or the fault system will be stopped and not allowed to start.

Fencing goes side ways with STONITH. Fencing is the method to bring a cluster to a known state.

So what is done in fencing ?

A cluster sometimes detects that one of the nodes is behaving strangely and needs to remove it. Every resource in a cluster has a state attached . The cluster must make sure that every resource may be started on only one node (i.e. only HA or only Primary is active )

Every node must report every change that happens to a resource. The cluster state is thus a collection of resource states and node states.

When the state of a node or resource cannot be established with certainty, fencing comes in. Even when the cluster is not aware of what is happening on a given node, fencing can ensure that the node does not run any important resources.

Fencing is of two types :-

List of STONITH devices : stonith -L or crm ra list stonith

Check status of nodes : crm status

Let's understand why we said that it maintains the integrity :

“Split brain scenario”, and this may result in bad things happening to the cluster resources. Imagine, for example, a database that starts running twice in the cluster or a file system that starts to be written between two independent nodes. So, having a split brain in the cluster is bad, and the only way to ensure that no such scenario can occur in the cluster is by using the STONITH approach.

What happens actually ?

Cluster resources are not in sync and each node in the cluster believes it is the only active cluster.

To avoid this, we can configure Split Brain Detection (SBD) as node fencing mechanism to shut down the device in case of split-brain scenario. SBD provides a node fencing mechanism for Pacemaker-based clusters through the exchange of messages via shared block storage.

What does STONITH do in case of Split brain scenario ?

STONITH (Shoot the Other Node in The Head), is a basically a fencing mechanism which powers down the selected server remotely, removing it from cluster and allowing other nodes in the cluster to take over.

Different STONITH approaches

Disk-based STONITH: external/sbd (On Premise – Best Practice)
Hardware-based STONITH: external/ipmi (On Premise – Second Choice)
GCP STONITH: external/gcpstonith (Google Cloud)

Split brain Detection Mechanism

On this shared disk, we create a small partition that is used for SBD. The size of the partition depends on the block size of the used disk.

SBD daemon which runs on all nodes in the cluster, will monitor the shared storage. When SBD daemon loses access to storage devices, it terminates itself in case the disk become unreachable. Increased protection is offered through watchdog, where daemon continuously writes a service pulse – if the daemon stops feeding the watchdog, the hardware will enforce a system restart. This protects against failures of the SBD process itself, such as dying, or becoming stuck on an IO error. So, the pacemaker software configuration ensures a safe transition of resources in the cluster in case when node is down.

SBD STONITH is a simple but effective way to ensure the integrity of data and other nodes in a Linux cluster.

I am learning on this topic if you find any issue or you have any question please drop about it in comment.

References :-

Basic Intro

Configuring Cluster

Stonith Linux Commands

Blog By Dennis

Comments

You might find these interesting

How to properly Start/Stop SAP system through command line ?

Starting/stopping an SAP system is not a critical task, but the method that most of us follow to achieve this is sometimes wrong. A common mistake that most of the SAP admins do is, making use of the 'startsap' and 'stopsap' commands for starting/stopping the system. These commands got deprecated in 2015 because the scripts were not being maintained anymore and SAP recommends not to use them as many people have faced errors while executing those scripts. For more info and the bugs in scripts, you can check the sap note 809477. These scripts are not available in kernel version 7.73 and later. So if these are not the correct commands, then how to start/stop the sap system? In this post, we will see how to do it in the correct way. SAP SYSTEM VS INSTANCE In SAP, an instance is a group of resources such as memory, work processes and so on, usually in support of a single application server or database server with

sapstartsrv is not started or sapcontrol is not working

What is sapstartsrv ? The SAP start service runs on every computer where an instance of an SAP system is started. It is implemented as a service on Windows, and as a daemon on UNIX. The process is called sapstartsrv.exe on Windows, and sapstartsrv on UNIX platforms. The SAP start service provides the following functions for monitoring SAP systems, instances, and processes. Starting and stopping Monitoring the runtime state Reading logs, traces, and configuration files Technical information, such as network ports, active sessions, thread lists, etc. These services are provided on SAPControl SOAP Web Service, and used by SAP monitoring tools (SAP Management Console, SAP NetWeaver Administrator, etc.). For more understanding use this link : https://help.sap.com/doc/saphelp_nw73ehp1/7.31.19/enUS/b3/903925c34a45e28a2861b59c3c5623/content.htm?no_cache=true How to check if it is working or not ? In case of linux , you can simply ps -ef | grep sapstartsrv In case of windows, you need

HANA System Replication - Prerequisites & Setup

Hey Folks! Welcome back to Hana high availability blog series. In our last blog we checked out operation & replication modes in hana system replication. If you haven't gone though that blog, you can checkout this link In this blog we will be talking about the prerequisites of hana replication and it's setup. So let's get started. When we plan to setup hana system replication, we need to make sure that all prerequisite steps have been followed. Let's have a look at these prerequisites. HANA System Replication Prerequisites: Primary & secondary systems should be up & running HDB version of secondary should be greater than or equal to Primary database sever But, for Active/Active(read enabled config), HDB version should be same on both sites. System configuration/ini files should be identical on both sides Replication happe

ST03N : The chapter for all BASIS Admins

This blog is targeted to BASIS ADMINS Transaction for workload analysis statistical data changed over time are monitored using transaction code ST03 , now ST03N (from SAP R/3 4.6C) . With SAP Web AS 6.4 the transaction ST03 is available again. From time to time ST03 and ST03N has seen many changes but later in SAP NW7.0 ST03N has reworked in detail specially processing time is now shown in separate column. Main Use of ST03N is to get detailed information on performance of any ABAP based SAP system. Workload monitor analyzes the statistical data originally collected by kernel. You can compare or analyze the performance of a single application server or multiple application server. Using this you start checking from the entire system and finding your way to that one application server and narrowing down to exact issue. By Default :- You see data of current day as default view , you can change the default view. Source of the image : sap-perf.ca Let's discuss the WORKLOAD MONITOR By D

How to resolve Common Error : Standard Template "sap_sm.xls" missing

Hey everyone, putting forward a common error we usually face when we have “ Excel inplace” functionality enabled in our SAP system. This error occurs when validity of the signature of SAP standard templates expired or were incorrectly delivered via support packages. We can reproduce the error by doing as below.. Click on “spreadsheet” icon after any SAP ALV grid view of data is on screen to make this data to export into excel directly from SAP.

HANA hdbuserstore

The hdbuserstore (hana secure user store) is a tool which comes as an executable with the SAP Hana Client package. This secure user store allows you to store SAP HANA connection information, including user passwords, securely on clients. With the help of secure store, the client applications can connect to SAP HANA without the user having to enter host name or logon credentials. You can also use the secure store to configure failover support for application servers in a 3-tier scenario (for example, SAP Business Warehouse) by storing a list of all the hosts that the application server can connect to. To access the system using secure store, there are two connect options: (1)key and (2)virtualHostName. key is the hdbuserstore key that you use to connect to SAP HANA, while virtualHostName specifies the virtual host name. This option allows you to change where the hdbuserstore searches for the data and key files. Note

SUM Tool : An Introduction

Let’s Discuss about the famous tool, that is asked in almost all the Basis and HANA interview and it is very easy to understand but a bit tricky. Tighten your seatbelts and Let’s understand in one go. SAP Technical Upgrade is a periodic project that is implemented across companies to upgrade their SAP system to the latest released version. Most of the upgrade activities are done by the technical team and the role of functional consultants is limited and mostly confined to regression testing What are the maintenance that are performed by help of SUM ? Release upgrade (major release change) System update (EHP installation) applying Support Packages (SPs) / Support Package Stacks applying Java patches correction of installed software information combine update and migration to SAP HANA (DMO: Database Migration Option) System Conversion from SAP ERP to SAP S/4HANA First thing first , never confuse in these two things : Upgrade and Update. Updating SAP products is for applying support pac

Work Process and Memory Management in SAP

Let’s talk about the entire concepts that are related to memory when we talk about SAP Application. Starting with few basic terminologies, Local Memory : Local process memory, the operating system keeps the two allocation steps transparent. The operating system does the other tasks, such as reserving physical memory, loading and unloading virtual memory into and out of the main memory. Shared Memory : If several processes are to access the same memory area, the two allocation steps are not transparent. One object is created that represents the physical memory and can be used by various processes. The processes can map the object fully or partially into the address space. The way this is done varies from platform to platform. Memory mapped files, unnamed mapped files, and shared memory are used. Extended Memory : SAP extended memory is the core of the SAP memory management system. Each SAP work process has a part reserved in its virtual address space for extended memory. You can set

SAP HANA System Replication - Operation Mode & Replication Mode

Hey Folks! Welcome back to Hana high availability blog series. In our last blog we checked out what is hana system replication and how it basically works. If you haven't gone through that blog, you can checkout link In this blog we will be talking about the replication modes and operation modes in hana system replication. So let's get started. When we setup the replication and register the secondary site, we need to decide the operation mode & replication mode we want to choose for replication. For now we won't focus on setting up replication as we'll cover it in our next blogs. Operation Modes in Hana System Replication: There are three operation modes available in system replication: delta_datashipping, logreplay and logreplay_readaccess. Default operation mode is logreplay. 1. Delta_datashipping: In this operation mode initially one full data shipping is done as part of replication setup and then a delta data shipping takes place occasionally in addition to cont

Complete Guide : XPI Inspector Tool

Content of this blog : What is an XPI Inspector Tool ? Why XPI Tool is used ? XPI standard URL How to check XPI Tool version ? How to Install/Update XPI version using TELNET How to Use XPI Tool ? References – SAP Notes What is XPI Inspector tool ? - XPI Inspector is a diagnostics web application developed by SAP that collects logs and debug traces from various PI components in a very simple way and is useful for SAP PI consultants, developers, and administrators to get more insights on an issue. Why XPI is used – 1. Used to collect traces and logs from Messaging system or XI module. 2. Used to collect the related information to solving the issues or improving the PI or PO systems’ performance. 3. Using XPI Inspector application you will be able to collect a lot of information about your system that will help you to learn about problems in the past, to analyze new and detect such at an early stage. 4. Performs certain number of configuration checks, such as SSL c

Let's Talk SAP

Search This Blog