Marching Penguins: Monitoring Your HPC Cluster

Getting into Ganglia for a scalable and flexible solution to the problem of cluster monitoring.

As the proud System Administrator a shiny new cluster sits in front of you, a nice set of LINPACK results are complete, and a bunch of jobs running through the queues. Things are good, users are happy, and you can catch up on all those other projects you have pending. Then you get an email: “Why are my jobs slow?” Or a project manager comes by and wants to know if the expensive new hardware is actually being used. Perhaps you are trying to plan for the coming year and need to know the recent usage trends. Monitoring your systems to establish a set of baseline figures and current performance information can help with each of these problems.

The idea of “monitoring” is somewhat overloaded (much like the word “clustering” itself). In High Performance Computing (HPC), most attention is paid to the utilization and performance metrics of compute nodes, rather than focusing on service availability and problem notification. This article will focus on the former, while programs such as Nagios and OpenNMS are excellent at handling the latter.

For the purpose of this article, it is assumed that you have access to a working cluster; a functioning web server with mod_php (including GD support); and are comfortable with basic administration tasks, Apache configuration, and using command line tools. The commands below were run on CentOS 5 and Gentoo systems, but the general concepts should apply to all common Linux distributions (and most other UNIX-like operating systems as well).

Collection

Monitoring itself can be broken down into two broad parts. The first is actually determining which metrics are needed and then collecting that data from the hosts. Some of the most commonly tracked metrics are CPU usage, memory consumption, network bandwidth, and disk I/O stats. These provide different indications of how well a system is performing, and may indicate where there are potential problems or performance bottlenecks. Once the data have actually been acquired, the second task is presenting the information in a meaningful way for analysis.

Linux provides a large number of utilities to access data about the system. A few common tools are vmstat, iostat, and netstat, although there are many others. These programs are typically geared toward non-interactive use—an important consideration for continuously monitoring systems. Programs that are interactive in nature (like top) will not work well unless they have some form of batch mode that is easily parsed. Some programs, such as sar and atop, have data collection modes and can produce detailed reports on a wide range of metrics. When all else fails, it is possible to pull numbers directly from the /proc or /sys filesystems, although you may need to manipulate the raw data to get usable information. For example, the various “CPU %” values are actually calculated from values found in /proc/stat.

Almost all metrics can be reduced to a number, or collection of numbers, marked with a timestamp; these are then stored for later retrieval. Storage formats can range from the simple (a plain text file) to the complicated (a set of fully normalized SQL tables). The RRD format used by rrdtool was designed to store metric data, and it is very well suited for this task. Ganglia (more below) makes extensive use of RRD files for its data storage.

For basic monitoring, vmstat is a natural place to start. It can tell us a great deal about the system, and it reports at regular intervals. The commands in Listing One tell vmstat to report every 5 seconds and print the column header line only once (normally, it displays the header frequently enough to have it displayed once per terminal). The first three lines are unnecessary for our current purpose since lines 1 and 2 are column headers, and line 3 reports average values since the last system reboot. Unfortunately, vmstat will not show load averages, but we can work around this with help from other programs.

To do this, we tell awk to print a timestamp for each line, and also grab the contents of /proc/loadavg (which stores the current load average). Finally, the output is sent to the tee command so that the output is displayed on STDOUT and saved to the vmstat.dat file. Normally, a simple file redirect is sufficient to just store the data. For each of these programs, please check their respective man pages for details on usage, command syntax, and output format.

$ vmstat -n 5 | awk '(NR>3){getline load < "/proc/loadavg"; print systime(), $0, load }' | tee vmstat.dat
1196738925 0 1  188 138820 254824 674948  0  0   2  0 3470 9566 35 11 54 0 0.46 0.38 0.52 1/140 31493
1196738935 1 0  188 123972 263420 674948  0  0  860  0 3648 9876  6 12 0 82 0.63 0.41 0.53 2/140 31493
1196738945 0 1  188 99296 273700 674976  0  0 1029  0 3708 10016 6 13 0 82 0.68 0.43 0.54 1/140 31493
1196738955 1 1  188 89716 281340 674976  0  0  764  0 3654 9921  6 10 0 84 0.73 0.45 0.54 1/140 31493

The example in Listing One stores the data in a single ASCII flat file. The advantage to this format is simplicity: the data are immediately available, and no special tools are needed to extract it. The disadvantages are just about everything else. Storing numbers as raw text is not an efficient use of disk space, and for very large files, reading and processing the data can take significant amounts of time. No metadata is kept, and the files need to be managed in such a way that it is known what data they represent. For example, in the output file above, unless it is known that data came from vmstat, we have no way to be sure what it represents. For very simple (and perhaps temporary) monitoring, flat files work fine. For serious data collection, a more robust solution (such as using RRD files or a true database) is worth considering.

Display

The amount of data collected in even a small cluster can quickly become unwieldy, and making sense of it is the other half of “monitoring” (and indeed, the whole point!). As the saying goes, “a picture is worth a thousand words.”

Taking the data from vmstat, we can send it to gnuplot to get an idea of system performance during this period of time. The vmstat and awk commands above produced a file with twenty-one different metrics, plus a timestamp; it becomes obvious why managing many different metrics can be a challenge. In the present case, we want columns 1, 14, 15, and 18, which refer to the timestamp, user CPU%, system CPU%, and 1-minute load average, respectively. See the sidebar on gnuplot at the end of the article for details on plotting the data.


Figure One: Plot of data from vmstat and /proc/loadavg

As the plot in Figure One shows, the system was fairly busy at the time: several virtual machines were running, and a number of large software packages were being compiled. The high level of system CPU time is due mainly to heavy disk I/O and overhead from the virtual machines. Notice that the load numbers nicely track CPU usage.

Ganglia

As is frequently the case in the open source world, someone else has already written software to solve the problem, and monitoring clusters is no exception.

Ganglia consists of several components designed specifically for the different aspects of monitoring, collecting, and displaying metrics from HPC systems in an efficient and scalable way. It was originally written by Matt Massie at the University of California, Berkeley (unsurprisingly, Ganglia was released under a BSD license), and is actively maintained by a small group of developers. Ganglia is used by commercial, educational, government, and non-profit organizations across the world to monitor some of the largest clusters currently in operation. A partial, but still impressive, list of organizations using Ganglia can be found on the Ganglia homepage. The current stable release is ganglia-3.1.0. This new release has a number of improvements over the previous 3.0.x series, including a new modular interface for adding metrics directly to gmond (with C and Python bindings), the addition of several new core metrics, and a number of display improvements. While the screenshots below were taken using version 3.0.5, the setting up a new Ganglia 3.1.0 installation is essentially the same.

Installing Ganglia is straightforward, and most recent distributions provide packages; some RPMs are also provided for download from the SourceForge website. However, if packages are not available or out of date, tarballs can be found on the Ganglia homepage. The web frontend will needs Apache, mod_php, and PHP must have GD support. Otherwise, the common mantra of “configure; make; make install” should work fine. To install the web frontend, simply copy the “web” directory into the Apache document root and rename it to “ganglia”. See Listing Two below.

$ tar xzf ganglia-3.0.5.tar.gz
$ cd ganglia-3.0.5
$ ./configure --with-gmetad
$ make
$ sudo make install
$ sudo cp -r web html_docroot/ganglia

As shown in Figure Two, there are four main parts to Ganglia

  • gmond is responsible for collecting a basic set of core metrics (CPU usage, basic network and memory stats, etc) about the local machine. The gmond daemons send data out over multicast (by default) or unicast to other gmond daemons within the cluster. This way, each daemon can track the global state of the cluster at any time, and any one of them can provide a complete report to gmetad.
  • The gmetad daemon is the heart of the system. It collects metrics from one or more gmond daemons and stores them in RRD files for later retrieval. gmetad can also poll other gmetad instances for summary information on other clusters. This is known as “federation”, and is useful for creating summary views of discrete, but related, clusters.
  • A web frontend built on PHP is used to actually display the data. When each page is loaded, the PHP scripts will request relevant data from gmetad in order to generate the page requested. There are a number of pre-made reports that provide very useful views into the workings of the cluster as a whole, and custom reports can be written. The web frontend does not need to run on the same computer as gmetad, but it does make configuration simpler.
  • For metrics not directly supported by gmond, Ganglia includes the command line program gmetric to track additional metrics. These are reported to gmond, which passes them to gmetad along with the built-in statistics. New in the 3.1.0 release is the ability to extend gmond directly by writing modules in C or Python.


Figure Two: Ganglia Overview

Each compute node will need a gmond process to collect data, and the master node (or adminstration node) will need to run gmetad. Gmond listens for incoming requests from gmetad on TCP port 8649 and sends and receives multicast traffic on address 239.2.11.71, port 8649. The gmetad daemon listens on TCP ports 8651 and 8652. If there are problems using multicast, make sure that the IP routing table is configured to handle multicast traffic correctly.

Just about anything that can be quantified can be passed to gmetric. For example, to count the number of unique users logged into a system and feed that into Ganglia, the shell script in Listing Three will work.

#!/bin/sh
USERS=`who | awk '{print $1}' | sort -u | wc -l`
gmetric -n users_loggedin -v $USERS -t uint8 -u Users

The command line options -n (name of metric), -v (value of the metric) and -t (metric datatype) are required for each call to gmetric. The -u option indicates the units for this metric; its use is optional, but recommended.

For devices that cannot run gmond directly, it is possible to use gmetric to send data on their behalf. This ability to “spoof” reports from other hosts permits Ganglia to monitor data from embedded systems, proprietary hardware, and other devices. For example, many uninterruptible power supplies (UPS) have internal temperature sensors and support network management cards with built-in SNMP agents. Covering SNMP is well beyond the scope of this article, although there are a number of excellent resources online (see the References sidebar). Assuming you have snmpget from the Net-SNMP project, temperature data from the UPS can be fed into Ganglia using the script in Listing Four below:

#!/bin/sh
# Name and IP address of the UPS, and OID.
# The OID is specific to APC hardware,
# but other vendors provide similar support.
UPS=ClusterUPS1
IP=192.168.200.101
OID=.1.3.6.1.4.1.318.1.1.2.1.1.0
TEMPERATURE=`snmpget -O qv -c public -v 1 $IP $OID`
gmetric -n temperature -v $TEMPERATURE -t int8 -u 'deg C' -S $IP:$UPS

This script should be suitable to call from crond on a regular basis. Ganglia expects metrics to be updated once every 60 seconds, but this can be adjusted if needed, depending on the expected behavior.

Notice that the datatype used in this example is “int8″ instead of “uint8″ from the previous example. This indicates that Ganglia, via rrdtool, should use a signed 8-bit integer for storing the temperature data, as opposed to an unsigned 8-bit integer for logged-in users. Temperature can be negative–although it hopefully never gets that cold in the datacenter. The number of users will never be negative, so an unsigned integer will work. This script can easily be expanded to poll for other information, such as the UPS’ outgoing amperage or incoming voltage. Integer and floating point datatypes up to 32 bits long are also available.

For this article, I created a small 4 node cluster using virtual machines. There is one head node and three compute nodes, each with one single-core CPU. While no powerhouse, it works wonderfully for demonstrations. Each of the compute nodes is running gmond, and the head node runs gmond, gmetad and the web frontend.

The default configuration files for gmond (“/etc/gmond.conf”) and gmetad (“/etc/gmetad.conf”) should work without modification. However, for a production install of Ganglia, change the data_source setting in “gmetad.conf” (See the comments in the file, if you are running a gmond on the head node, then localhost will work.) and the cluster { name = "unspecified" } entry in “gmond.conf&quot to match. Also look at the various access control measures supported by both daemons, especially if using the spoofing feature.

The screenshot in Figure Three shows the Ganglia installation running on the test cluster. Ganglia displays a great deal of information in a fairly compact space. The top of the webpage provides a general overview of the cluster plus CPU, memory, and network statistics. When a host is offline, that is indicated as well.


Figure Three: Ganglia in Action

The bottom section of the webpage shows per-node statistics, with one graph per system. The default settings show the 1-minute load. The nodes are sorted in descending order, so busy nodes are listed first. This choice can be changed via the “metric” drop-down field displayed at the very top of the page. The coloring of the chart is always based on the current 1-minute load, divided by the number of CPUs. Thus, a load of 1.02 on a single-CPU box is colored red, whereas a load of 2.06 on a box with 8 cores is colored light green. A detailed page on a specific host is available by clicking on the per-node chart for that host or by choosing the node specifically from the “Choose a Node” drop-down menu, also at the top of the page.

The timescale is the same for all of the charts, making it easy to correlate different metrics. For example, at about 01:10, node2 went offline. The number of available CPUs and amount of memory reported by Ganglia both dropped to reflect this. Also in the load report, there was a drop in the 1-minute load from about 3.0 to about 2.0. This makes sense, since the value displayed is the sum of all 1-minute loads across the cluster, and each of the two nodes reports a load slightly less than 1.0. The other reports will show similar changes when a node goes offline, although in this test the effects are subtle.

For HPC systems, monitoring performance metrics is an important part of making sure that clusters are running well. There are many different ways of managing and displaying the data, and the specific one chosen will depend on many different factors. Ganglia provides a scalable and flexible solution to the problem and is well worth considering. It is also possible to write a custom solution using command line tools and custom scripts. While perhaps not as quick a solution to deploy, writing such a program will teach you a great deal about your system, and there is much to be said for learning something new.

Sidebar: Gnuplot

Gnuplot is a surprisingly powerful program, and over the course of many years, I have kept coming back to it. I
have frequently done data analysis in other programs with built-in graphing functions (including OpenOffice Calc and Microsoft Excel), but often export the data to gnuplot for analysis and presentation.

Below are the commands to plot the data from vmstat and /proc/loadavg. The various set lines
define the chart parameters and tell gnuplot how to deal with the data. The plot and replot statements actually draw lines on the chart.

These lines tells the program to pull data from the “vmstat.dat” file, using column 1 as the x-axis value (which you will recall is
the timestamp), and columns 14, 15 and 18 for the y values then connect them with lines (gnuplot has a very wide range of plotting styles). For the two replot lines, there are also several additional parameters to adjust the line
color and appearance.

Consult the Gnuplot documentation for more complete explanations and examples.

set key inside left top
set xdata time
set timefmt "%s"
set format x "%H:%M"
set xlabel "Time"
set ylabel "CPU %"
set ytics nomirror
set y2label "Load average"
set y2range [0:]
set y2tics 1
set my2tics 0.5
plot 'vmstat.dat' using 1:14 \
  title 'user cpu%' with lines
replot 'vmstat.dat' using 1:15 \
  title 'sys cpu%' with lines \
  linecolor 3
replot 'vmstat.dat' using 1:18 title 'load 1' \
  with lines linecolor 7 linewidth 2 \
  axes x1y2 smooth bezier

References

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62