Cluster Monitoring with Ganglia

Monitoring the status of a Beowulf-style cluster can be a daunting task for any system administrator, especially if the cluster consists of more than a dozen nodes. While Linux is extremely stable, hardware problems can cause nodes to crash or become inaccessible, and chasing down problem nodes in a 500-node cluster is painful. Luckily, some sort of statistical resource monitoring can often yield early warnings of impending hardware failures.

Monitoring the status of a Beowulf-style cluster can be a daunting task for any system administrator, especially if the cluster consists of more than a dozen nodes. While Linux is extremely stable, hardware problems can cause nodes to crash or become inaccessible, and chasing down problem nodes in a 500-node cluster is painful. Luckily, some sort of statistical resource monitoring can often yield early warnings of impending hardware failures.

Perhaps even more important than collecting long-term statistics is the need — for both system administrators and cluster users — for real-time resource utilization data. If jobs are queued and waiting to run while half the machine is idle, you may have a problem. A system administrator should be able to quickly determine this sort of condition.

Utilization information is valuable to cluster users as well. If a user needs to run a job immediately with as many processors as possible on nodes with specific memory requirements, he or she should be able to determine a runtime configuration based on present resource availability.

Ideally, what’s needed is a real-time system monitoring tool that can handle a large number of nodes, can store utilization data for long-term analyses, can present this information in an easy-to-read graphical form, and can do this without noticeably consuming precious CPU or network resources.

Enter Ganglia, a scalable, distributed monitoring package for high performance computing systems. Not only can Ganglia monitor a compute cluster (up to 2,000 nodes!), but it can provide a single view of an entire grid of clusters (or “cluster of clusters”) dispersed over a campus or a wide area network (i.e., the Internet).

Ganglia is an Open Source project (available on SourceForge at http://ganglia.sourceforge.net) with a BSD license. It grew out of the University of California, Berkeley, Millennium Cluster Project (see http://www.millennium.berkeley.edu) in collaboration with the National Partnership for Advanced Computational Infrastructure (NPACI) Rocks Cluster Group. NPACI Rocks is a cluster distribution of Linux that contains — in addition to a kernel and normal system software — a number of tools (many of which have been discussed right here in Linux Magazine,) including Ganglia.

Ganglia provides a complete, real-time monitoring and execution environment based on a hierarchical design. It uses a multicast listen/announce protocol to monitor node status, and uses a tree of point-to-point connections to coordinate clusters of clusters and aggregate their state information. Ganglia uses the eXtensible Markup Language (XML) to represent data, eXternal Data Representation (XDR) for compact binary data transfers, and an open source package called RRDTool for data storage (in Round Robin databases) and for graphical visualization.

The developers of Ganglia have gone out of their way to make their software as portable as possible. It runs on Linux, Solaris, FreeBSD, AIX, IRIX, Tru64, HPUX, Mac OS X and (dare I speak its name?) Windows (using cygwin). Ganglia’s in use on more than 500 clusters around the world. Moreover, PlanetLab has used Ganglia to monitor clusters at over 100 sites in half a dozen countries on a single Linux box.

The Ganglia Monitoring Core consists of the monitoring daemon, gmond, which runs on every node; the meta daemon, gmetad, which runs on a central machine, and collects and stores state information; a command-line status client called gstat which connects to a monitoring daemon and outputs a load-balanced list of cluster nodes; and a command-line tool called gmetric that defines new metrics for the monitoring daemons to track.

In addition, a command-line tool called gexec is available. It’s useful for starting parallel or distributed jobs on a cluster or on the computational grid. A web front-end for displaying real-time statistics and graphics from the meta daemon is also available. Finally, a Python class is available for sorting and classifying large clusters using the monitoring core.

gmond, the Ganglia Monitoring Daemon

The monitoring daemon, gmond, is a multi-threaded program that runs on each node in the cluster. It doesn’t require a common filesystem, special user accounts, or database files. gmond monitors changes in node state, multicasts relevant changes, listens to multicasts of state from other nodes, and answers requests for an XML description of cluster status. gmond consists of a metric scheduler thread, XML output threads, and multicast listening threads, which share a fast in-memory, cluster hash image. The state information is stored very efficiently in memory, making gmond very scalable.

The monitoring daemon multicasts only those metrics that are being monitored; furthermore, it multicasts these only when the metric value exceeds a change threshold or when the time since last transmission exceeds a certain time threshold. These policies keep traffic down to a minimum. For example, the number of CPUs is multicast only once per hour, but the 1-minute load average might be sent as often as every 15 seconds.

Since gmond multicasts its startup time in a heartbeat message, rebooting a node alerts other gmonds, forcing expiration of all metric data from that node so that it’s quickly updated.

The XML output threads process incoming requests on port 8649 (that’s U-N-I-X on your touch-tone keypad!). All requests for XML state information must come either from the local machine ( or from a trusted host (listed in /etc/gmond.conf); otherwise, the connection is immediately closed. If the host is valid, a complete XML description of the state of all nodes on the local multicast channel is sent.

gmetad, the Ganglia Meta Daemon

The meta daemon, gmetad, works via unicast routes to pull together XML reports from one or more clusters. It stores historical information in Round Robin databases, and exports summary XML information, which the web front-end uses to present data and graphics to a web browser. The behavior of gmetad is controlled through a single configuration file, /etc/ gmetad.conf, and needs only to be run on a single host. The host could be a master node in the cluster or an independent computer that can monitor one or more clusters via traditional unicast connections.

The meta daemon uses Tobi Oetiker’s RRDTool (available at http://www.rrdtool.com) to store time series data in Round Robin databases. RRDTool is a separate package, not a part of Ganglia itself. It stores the data in a very compact way that will not expand over time, and it generates graphs by processing the data with a fixed density. gemtad uses the graphing capabilities to feed time series plots to the web front-end for display in web browsers.

RRDTool appears to be a useful and extensible package, and is distributed under the GNU General Public License (GPL). People who find it useful are encouraged to participate in the Happy Tobi Project, which appears to amount to buying Tobi a DVD from his eclectic and growing Amazon Wish List. He’ll also settle for a PayPal donation.

The Ganglia Web Front-end

The Ganglia web front-end provides a view of collected data through real-time dynamic web pages. It provides up to three levels of data displays: one for the grid (or multi-cluster view), one for each cluster (physical view), and one for each host or node. Utilization data can be viewed over the past hour, day, week, month, or year. This makes it easy for both system administrators and grid/cluster users to quickly see the status of the resources and their trends through time.

All the information and graphics are generated on-the-fly by parsing a complete Ganglia XML tree — obtained by contacting the local gmetad on port 8651 (by default) — for every page accessed. Therefore, the front-end should run on a fairly powerful computer; if a large grid is monitored this way, a dedicated server may be best. The web front-end is written in the PHP scripting language, and it works well under the Apache web server with the PHP 4.1 module.

Starting Jobs with gexec

Although not part of the Ganglia Monitoring Core, gexec is a command-line tool for starting parallel or distributed jobs on clusters or grids. It relies on two daemons, gexecd and authd, to handle user authentication on clusters and nodes via public/private key pairs. It can start jobs on a fixed list of nodes, or on a dynamically-generated list of under-utilized nodes.

In order to use gexec, gmond must be built to support it. gexec is completely optional, and does not affect Ganglia’s monitoring capabilities. In the example installation that follows, gexec is not enabled since only the monitoring features are of interest.

Installing and Configuring Ganglia

The Ganglia Monitoring Core is available as a source tarball and as an RPM package on the Ganglia website at http://ganglia.sourceforge.net, while the web front-end is available for Linux only as an RPM package. The Ganglia web site contains very clear and explicit instructions about installing the package from either sources or the RPM. The source includes rc-style startup files that can be used with chkconfig. The RPMs install the startup files automatically and start up the daemons.

Remember that the RRDTool is required for building and installing the meta daemon, gmetad. Source code tarballs and RPMs for this package are available at http://www.rrdtool.com.

After building the executables from source code, the monitoring daemon (gmond) should be installed on every node in the cluster, and the meta daemon should be installed on the centralized server or the master node. When installing from RPMs, the gmond RPM should be installed on every node while the gmetad RPM should be installed on the centralized server or the master node. In the example installation presented here (for a single small cluster), gmetad and the web front-end are installed on the master node, which is also running the Apache web server.

The configuration file for the monitoring daemon, /etc/ gmond.conf, should be installed, and should be the same on every node within a cluster. The sample configuration file is very well documented: it lists default values and provides commented example settings for overriding the defaults. The most meaningful items to set are the cluster name (name), the owner of the cluster (owner), and the URL for information about the cluster (url).

Most of the other items in the configuration file have reasonable defaults; however, the multicast time-to-live (TTL) defaults to a value of 1. It should be set — via the mcast_ ttl parameter — to one greater than the number of hops (routers) between the nodes. If all nodes are on a single switch, the default value of one is fine. If routers must be traversed, they must be configured to pass multicast traffic.

The meta daemon, installed only on a central server or the master node, has a configuration file (/etc/gmetad.conf) that must contain the a data_source entry for each cluster being monitored. Each data_source entry looks like this:

data_source “my cluster” [polling interval] address1:port address2:port …

where “my cluster” is the name of the cluster, the polling interval is optional, and address1:port is a list of hostnames or IP addresses. The :port part is optional unless gmond is running on a port other than 8649 on the specified host. Additional information can be specified in the configuration file, including a list of trusted hosts to which gmetad can provide XML reports. localhost is always trusted.

Once installed, gmond can be tested from any node by running telnet, connecting to the gmond port:

% telnet localhost 8649

That command should generate a page full of XML. Likewise, gmetad can be tested on the master node by running:

% telnet localhost 8651

The gstat command can be used to display node state information. gstat queries the local gmond (by default), and displays the number of CPUs on each node; the number of running and total processes; the one minute, five minute, and fifteen minute load averages; and the user, nice, system, and idle percentages for CPUs as shown in Figure One.

Figure One: Output from gstat

[forrest@sci1-1 forrest]$ gstat -a
Name: Penguins
Hosts: 12
Gexec Hosts: 0
Dead Hosts: 0
Localtime: Thu May 29 22:44:54 2003

Hostname LOAD CPU
CPUs (Procs/Total) [ 1, 5, 15min] [ User, Nice, System, Idle]

2 ( 1/ 79) [ 1.00, 1.00, 0.99] [ 48.7, 0.0, 1.4, 49.9] OFF
2 ( 2/ 83) [ 1.99, 1.97, 1.91] [ 92.7, 0.0, 7.3, 0.0] OFF
2 ( 2/ 83) [ 1.99, 1.97, 1.93] [ 93.0, 0.0, 6.2, 0.0] OFF
2 ( 2/ 82) [ 1.99, 1.98, 1.91] [ 96.4, 0.0, 3.7, 0.1] OFF
2 ( 2/ 82) [ 1.99, 1.97, 1.91] [ 95.8, 0.0, 4.2, 0.0] OFF
2 ( 2/ 84) [ 1.99, 1.97, 1.91] [ 96.5, 0.0, 5.3, -0.1] OFF
2 ( 2/ 84) [ 1.99, 1.98, 1.92] [ 96.4, 0.0, 3.6, 0.0] OFF
2 ( 2/ 82) [ 2.00, 2.00, 1.93] [ 95.2, 0.0, 4.1, -0.0] OFF
2 ( 2/ 64) [ 2.00, 2.00, 2.00] [ 84.5, 0.0, 16.8, 0.0] OFF
2 ( 2/ 82) [ 2.00, 2.00, 1.93] [ 95.1, 0.0, 4.1, 0.1] OFF
2 ( 2/ 80) [ 2.00, 2.00, 2.00] [ 93.1, 0.0, 7.0, 0.0] OFF
2 ( 3/ 178) [ 2.72, 2.56, 2.46] [ 96.2, 0.0, 2.3, 0.1] OFF

The gstat command can also generate a current, dynamic load-balanced list of machines for use with MPI (the Message Passing Interface) if gexec is enabled.

The RPM for the web front-end should be installed on the Linux box hosting gmetad. The front-end consists of a number of PHP scripts — installed in /var/www/html/ganglia-webfrontend/ by default — that query gmetad and display information and graphics over the web. Apache should be running on this server, and the PHP4 module should be enabled by ensuring that the following lines are contained and uncommented in /etc/httpd/conf/httpd.conf:

<IfDefine HAVE_PHP4>
LoadModule php4_module
AddModule mod_php4.c
AddType application/x-httpd-
php .php .php4 .php3 .phtml
AddType application/x-httpd-php-
source .phps

Figure Two: The upper half of the physical view from the web front-end.

Monitoring Metrics and Web Graphics

Figure Two shows the upper half of the physical (or cluster level) view of metrics from the Ganglia web front-end. The time and date for the report are shown at the top, along with a button labeled “Get Fresh Data” for obtaining instant updates. By default, the Overview section describes the state of the cluster (i.e., 12 nodes, 24 CPUs, no nodes down), and shows graphs of the distribution of CPU load for the last hour, the current and last hour of load averages, and the memory utilization for the last hour. The time period for display can be changed at the top of the page.

From the graphs displayed, one can determine that the cluster had about 70% CPU utilization until around 20:50 when additional jobs were started. The new jobs increased memory utilization, but had little or no effect of the small about of memory swapped across the entire cluster. At the time of this report, CPU utilization had climbed so that the overall one minute load average slightly exceeds the total number of CPUs.

Next, the Snapshot section shows each node in the cluster, color coded by its CPU utilization. The legend for these colors is shown in Table One. In Figure Two, three nodes are over 100% utilization, eight are at 50-74% utilization, and one is at 25-49% utilization. The next section, labeled “Penguins load one,” displays a graph of the metric selected in the pull-down menu at the top of the page for each node for the selected time period sorted as selected at the top.

Table One: Ganglia node image legend


RedOver 100% Utilization

Orange75-100% Utilization

Yellow50-74% Utilization

Green25-49% Utilization

Blue0-24% Utilization

CrossbonesThe node is dead or has not reported in last 60 seconds

Utilization is: (1 min load) / (number of CPUs) * 100%

Figure Three shows the bottom half of the same page shown in Figure One. It contains twelve graphs of the one minute load average for the last hour — one graph for each node. The sci1-1 node is (and has been) quite busy. New jobs were recently started on sci2-1, sci1-3, sci2-2, sci3-1, sci3-2, and sci3-3. The nodes sci1-2, sci3-4, sci2-3, and sci1-4 have maintained a load average around two, while sci2-4 has a load average around one.

Figure Three: Screenshot of the lower half of the physical view.

Figure Four: Screenshots of the node view for sci3-1.

On this physical view web page, you can click on either one of the color-coded PCs or on one of the node graphs to display state information about that node. For example, if the load average graph for sci3-1 is selected, a new page is generated containing specific information about that node, as shown in Figure Four. This information includes the time of last boot and uptime, the operating system release, the number and speed of the CPUs, the total memory, and so on.

Graphs of the load average, CPU utilization, and memory utilization (and distribution) are shown at the top of the page. Below that are graphs of “volatile metrics,” a wide variety of parameters that are monitored on every node. On this node, the load average was essentially one until a second job was started around 20:55, causing the load average to climb to two.

A Must Have

Ganglia is a very capable and powerful toolkit for monitoring anything from a small cluster of compute nodes to a large grid of clusters connected via the Internet. The information it provides, particularly when displayed on dynamic web pages, makes it a tool no cluster should be without.

Forrest Hoffman is a computer modeling and simulation researcher at Oak Ridge National Laboratory. He can be reached at forrest@climate.ornl.gov.

Comments are closed.