Node health (hardware/software) is very critical in cluster environment. Whether you run a lot of web servers in load balancer or having compute node in High Performance Computing setup, each node is critical which is been used for any type of work. It has to be healthy to run processes we need to run. This script created to check node health before using as a compute node or any service in load balancer. If all checks run fine then node will be mark good and will available to run in production.

1. License

GPL. You can add/edit/delete script to fit your need. Please do share your edit to enhance the script.

2. Functions

Script is created using functions. Every check is based on function, so to perform any check or not is depends on whether you call the function or not. Just make to call function or not in the Main Section of the script and it will run as per your need. Even you can add/edit/delete as per your need to make changes in script.

3. Checks

Following checks are made as per my need.

  • Loadavg
  • Memory / Swap
  • CPU
  • Ethernet Speed
  • Infiniband
  • NFS Mounts
  • NIS Server
  • MCELog (Machine Check Event, hardware error reporting)

You can create your own checks and add to the script for enhance checking.

4. Script

Here it is

#!/bin/bash
#
# Script: Node Health Script 
# Created By: Sohail Riaz (sohaileo@gmail.com) www.sohailriaz.com
# Created On: 9th April 2013
# Detail: To Check Single Node Health. 
# HowTo Run: chmod +x nodecheck.sh; ./nodecheck.sh
# Checks: loadavg, memory, cpu, ehternet, infiniband, infiniband ipath test, cpu test, nfs mounts, nis, mcelog
# Functions: You can load/unload any function you need to run or not to run.
# License: GPL
#

## LoadAvg Function
loadavg() {
LoadAvg=`uptime | awk -F "load average:" '{print $2}' | cut -f 1 -d,`
echo "LoadAvg = $LoadAvg" >> $Report
}

## Memory Function
memory() {
TOTAL_MEM=`grep "MemTotal:" /proc/meminfo | awk '{msum+=($2/1024)/1024} END {printf "%.0f",msum}'`
FREE_MEM=`grep "MemFree:" /proc/meminfo | awk '{mfree+=($2/1024)/1024} END {printf "%.0f",mfree}'`
TOTAL_SWAP=`grep "SwapTotal:" /proc/meminfo | awk '{ssum+=($2/1024)/1024} END {printf "%.0f",ssum}'`
FREE_SWAP=`grep "SwapFree:" /proc/meminfo | awk '{sfree+=($2/1024)/1024} END {printf "%.0f",sfree}'`

echo "TotalMemory = $TOTAL_MEM GB ($FREE_MEM GB Free)" >> $Report
echo "TotalSwap = $TOTAL_SWAP GB ($FREE_SWAP GB Free)" >> $Report
}

## CPU Function
cpu() {
PROCESSOR=`grep processor /proc/cpuinfo | wc -l`
CPU_MODEL=`grep "model name" /proc/cpuinfo | head -n 1 | awk '{print $7 " " $8 " " $9}'`

echo "Processors = $PROCESSOR" >> $Report
echo "ProcessorModel = $CPU_MODEL" >> $Report
}

## Ethernet Function (eth1 for me, you can edit for yours)
ethernet() {
ETHER_SPEED=`ethtool eth1 | grep "Speed:" | awk '{print $2}'`

echo "EthernetSpeed = $ETHER_SPEED" >> $Report
}

## IB Function
ib() {
IB_STATE=`cat /sys/class/infiniband/*/ports/1/state | awk -F ":" '{print $2}'`
IB_PHYS_STATE=`cat /sys/class/infiniband/*/ports/1/phys_state | awk -F ":" '{print $2}'`
IB_RATE=`cat /sys/class/infiniband/*/ports/1/rate`

echo "IB_STATE = $IB_STATE" >> $Report
echo "IBLink = $IB_PHYS_STATE" >> $Report
echo "IBRate = $IB_RATE" >> $Report
}

## IB Test Function
ibtest() {
IB_TEST=`ipath_pkt_test -B | awk -F ":" '{print $2}'`
echo "IPathTest = $IB_TEST" >> $Report
}

## NFS Mounts Function
nfs() {
NFS_MOUNTS=`mount -t nfs,panfs,gpfs | wc -l`

echo "NFS_MOUNTS = $NFS_MOUNTS" >> $Report
}

## NIS Function
nis() {
NIS_TEST=`ypwhich`

echo "NIS_SERVER = $NIS_TEST" >> $Report
}

## MCELog Test Function
mcelog() {
MCELog=`if [ -s /var/log/mcelog ]; then echo "Check MCELog"; else echo "No MCELog"; fi`

echo "MCE Log = $MCELog" >> $Report
}

### MAIN SCRIPT

## Get Node Name
Hostname=`hostname -s`
touch ./$Hostname-checks.txt
Report=./$Hostname-checks.txt
echo " " > $Report
echo "Node = ${Hostname}" >> $Report
echo "----------------" >> $Report

## Get Cluster Name
Cluster=`echo $Hostname | cut -c1-4`

## Call Function
loadavg
memory
cpu
ethernet
ib
ibtest
nfs
nis
mcelog

## Generate Report
echo " " >> $Report
cat $Report

or press to download.     [button link=”http://www.sohailriaz.com/downloads/nodecheck.sh” bg_color=”orange”]Download Script[/button]

To run the script run following command.

chmod +x nodecheck.sh
./nodecheck.sh

 5. Enhancement

You can add/edit/delete any function inside script to meet your need but do share your edits to let us improve the script for maximum checks.

 

 

 

By Sohail Riaz

I am a First Red Hat Certified Architect - RHCA (ID # 110-082-666) from Pakistan with over 14 years industry experience in several disciplines including LINUX/UNIX System Administration, Virtualization, Network, Storage, Load Balances, HA Clusters and High Performance Computing.

Leave a Reply

Your email address will not be published. Required fields are marked *