How to set up a shell script to monitor load averages

Here is a handy bash script that can be used to send email notifications when a specific server is under high load. In the event that there is no other monitoring solution in place this script can be scheduled as a cron job so that it runs at a set interval.

Before we look at the complete script at the end of this post, let’s run through each part to see exactly what it does.

What does it do? Where? Why?

#! /bin/sh is the shebang which points to the interpreter, in this case the Bash shell which we want to use to run the script.

#! /bin/sh

The variable NOTIFY is set to a load number at which email notifications will be sent.

NOTIFY="1"

TRUE is set to 1. Why? We will use this condition at the end of the script to either send a notification or to take no action.

TRUE="1"

In this example I have set the notification trigger to 1 because the server that it runs on a single-core CPU. On a single-core CPU a load average of one means that the CPU is running at exactly full capacity. Any load above one means that executions are queueing up and this will affect the performance of the service as processes are waiting for CPU time. On a quad-core CPU a load average of 4 means 100% capacity, etc.

Next, the variable EMAIL is set to the recipient address. Multiple addresses can be separated by a space.

EMAIL="admin2@swoops.co.uk admin3@swoops.co.uk"

TEMPFILE – mktemp creates a random temporary file.

# Create a temp file
TEMPFILE="$(mktemp)"

As we are going to use the output from the uptime command we will set up a variable FTEXT for some of the output text to make life easier later on. Uptime is handy to use here as it shows us the load average over the last one, five and fifteen minutes.

FTEXT='load average:'

Uptime is handy to use here as it shows us the load average over the last one, five and fifteen minutes.

$ uptime
 03:58:13 up 68 days, 11:15,  1 user,  load average: 1.36, 1.06, 0.85

LOAD1MIN is set to run the uptime command and then pipe it through awk, cut and sed to get exactly and only the number of the load during the last one minute.

# Get the load average for the last 1 minutes.
LOAD1MIN="$(uptime | awk -F "$FTEXT" '{ print $2 }' | cut -d, -f1 | sed 's/ //g')"

The two variables LOAD5MIN and LOAD15MIN will do the same for the average load during the last five and fifteen minutes.

# Get the load average for the last 1 minutes.
# Get the load average for the last 10 minutes.
LOAD5MIN="$(uptime | awk -F "$FTEXT" '{ print $2 }' | cut -d, -f2 | sed 's/ //g')"
# Get the load average for the last 15 minutes.
LOAD15MIN="$(uptime | awk -F "$FTEXT" '{ print $2 }' | cut -d, -f3 | sed 's/ //g')"

The variable MEMU will format the memory statistics for readability, using the command free and then awk to format the output.

# awk the memory stats
MEMU="$(free -tom | awk '/Total:/ {print "Total memory: "$2" MB\nUsed memory: "$3" MB\nFree memory: "$4" MB"}')"

Let’s have a quick look at how that looks on the command line:

$ free -tom | awk '/Total:/ {print "Total memory: "$2" MB\nUsed memory: "$3" MB\nFree memory: "$4" MB"}'
Total memory: 2019 MB
Used memory: 928 MB
Free memory: 1090 MB

Next, we create a variable for the subject line of the notification email. The subject as we have it here will include the hostname of the server, and the average load for the last five minutes. I use five minutes here because we are going to schedule the script to run every five minutes.

# Email subject
SUBJECT="Alert $(hostname) high load average: $LOAD5MIN"

Next we set up the message body so that it is nicely formatted. Now we see the results of setting up all the variables for the script. The setup of the message body is really just using the echo command to append text and the output of our variables to the temporary file.

RESULT is used to check whether the five minute load average is above our notification threshold. bc will result in either 1 or 0.

# Look if the limit has been exceeded, compared with the last 15 min load average.
# Check if the load average is larger than the specified limit.
# bc will return true or false.
RESULT=$(echo "$LOAD5MIN > $NOTIFY" | bc)

Let’s see an example on the command line:

$ echo "5 > 1" | bc
1
$ echo "5 > 9" | bc
0

Finally, if the result equals one then the email is sent, including the subject as we set up and the contents of the temporary file.

# If the result is true, send the message
if [ "$RESULT" == "$TRUE" ]; then
        # echo true
        /bin/mail "$EMAIL" -s "$SUBJECT" < $TEMPFILE
fi

Set up a cron job to schedule when the script to run every five minutes:

$ crontab -e

Add:

*/5 * * * * /home/orfels/scripts/loadaverage.sh >/dev/null 2>&1

Make the script executable:

$ chmod u+x /home/orfels/scripts/loadaverage.sh

Here is the actual script:

#! /bin/sh
#
# Script to send email notification if a server exceeds a specified load average.
#
# Selected load average limit.  If above this number a notification message will be emailed.
NOTIFY="1"
TRUE="1"
# Email address to receive alerts.
EMAIL="admin2@swoops.co.uk admin3@swoops.co.uk"
# Create a temp file
TEMPFILE="$(mktemp)"
# The text which will be awk'ed a few times looking for the same text, so we specify it here once.
FTEXT='load average:'
# Get the load average for the last 1 minutes.
LOAD1MIN="$(uptime | awk -F "$FTEXT" '{ print $2 }' | cut -d, -f1 | sed 's/ //g')"
# Get the load average for the last 10 minutes.
LOAD5MIN="$(uptime | awk -F "$FTEXT" '{ print $2 }' | cut -d, -f2 | sed 's/ //g')"
# Get the load average for the last 15 minutes.
LOAD15MIN="$(uptime | awk -F "$FTEXT" '{ print $2 }' | cut -d, -f3 | sed 's/ //g')"
# awk the memory stats
MEMU="$(free -tom | awk '/Total:/ {print "Total memory: "$2" MB\nUsed memory: "$3" MB\nFree memory: "$4" MB"}')"
# Email subject
SUBJECT="Alert $(hostname) high load average: $LOAD5MIN"
# Mail message body
echo "Server 5 min load average $LOAD5MIN is above notification threshold $NOTIFY" >> $TEMPFILE
echo " " >> $TEMPFILE
echo "Hostname: $(hostname)" >> $TEMPFILE
echo "Local Date & Time : $(date)" >> $TEMPFILE
echo " " >> $TEMPFILE
echo "Server load for the last five minutes: $LOAD5MIN" >> $TEMPFILE
echo "Server load for the last fifteen minutes: $LOAD15MIN" >> $TEMPFILE
echo " " >> $TEMPFILE
echo "------------------------" >> $TEMPFILE
echo "Memory stats:" >> $TEMPFILE
echo "------------------------" >> $TEMPFILE
echo "$MEMU" >> $TEMPFILE
echo " " >> $TEMPFILE
# Look if the limit has been exceeded, compared with the last 15 min load average.
# Check if the load average is larger than the specified limit.
# bc will return true or false.
RESULT=$(echo "$LOAD5MIN > $NOTIFY" | bc)
# If the result is true, send the message
if [ "$RESULT" == "$TRUE" ]; then
        # echo true
        /bin/mail "$EMAIL" -s "$SUBJECT" < $TEMPFILE
fi

Example of a notification email:

Server 5 min load average 2.05 is above notification threshold 1

Hostname: vmorfels02
Local Date & Time : Wed May 16 05:20:02 PDT 2012

Server load for the last five minutes: 2.05
Server load for the last fifteen minutes: 2.47

————————
Memory stats:
————————
Total memory: 8834 MB
Used memory: 7204 MB
Free memory: 1629 MB

Now you know you have problems and will instantly regret checking your mail in the middle of the night on the way to the bathroom… Nice.

I have completed all three load times in the script just for the sake of completeness. If a script needs to be amended to send notifications i.e. every fifteen minutes instead it can be easily changed. A useful addition to the script might be to include information about the three running processes that are consuming the most amount of CPU, or free disk space, or something else?

Be Sociable, Share!

No related posts.