Monitoring Network Services with NetSaint:

It can be a pain to pinpoint trouble spots on your network. Here's how to simplify things.


Like many people, you might not pay as much attention to your car’s fuel gauge as you should. You might only notice the gauge when the gas tank is nearly empty. Some new cars, though, have a nice feature: a dashboard indicator light that reads “Low Fuel” when there is about an eighth of a tank left. This strategy is much better than, say, the “Oil” light; by the time that light comes on, the car is already in trouble.

Network monitoring has tended to be more like the oil light than the low fuel light. Too often, the first indication of a problem is some failure: a disk partition has filled, a process has crashed, or a critical server has lost its Internet connection. And like a neglected car, a failure in your network inevitably requires an immediate and often difficult, costly, or time-consuming repair.

So, how’s your network right now?

That’s probably a tough question to answer. And no wonder. Even the smallest enterprise is likely to have tens of critical services it depends on: domain name service, sendmail, POP servers, FTP, Web servers, file servers, printer spoolers, database servers, VPN servers, and a collection of hardware tuned for specialized tasks. While it is possible to monitor your computing resources and services — just as you can always check your car’s oil level — doing so is something that can slip through the cracks in a busy schedule. Keeping track of all that software (and hardware) becomes an increasingly daunting task as the number of computers, devices, and services you’re responsible for increases.

Standard Linux distributions provide little relief. The small number of monitoring utilities that are included — commands like uptime (which measures CPU usage), vmstat (which measures memory usage), and ping, traceroute and netstat (the latter three measure network connectivity) — are generally limited to examining a single machine and its own network connectivity. Fortunately, a class of utilities called network monitors offer a solution.

Network monitors actively track and report the status of groups of computers and other network devices: printers, routers, and so on. Most network monitors will keep you well-informed with a Web interface that is updated frequently. In some cases, you can even track performance data. (Other packages, that monitor traffic on your network, are also referred to as network monitors, but aren’t covered in this article.)

Most importantly, network monitors keep a vigilant watch over each resource, looking for problems: a system or service that’s become unusable (e.g., basic connectivity tests fail) or the value of some metric has moved outside its acceptable range (e.g., the load average on a computer system rises above some preset level, indicating that CPU resources are becoming scarce). In these situations and others, the network monitor will notify you (via e-mail or with some kind of visual indication) about the potential problem, allowing you to intervene before the situation becomes critical. The most sophisticated programs can also begin fixing some problems as soon as they are detected.

In this article we’ll look at one popular monitoring package called NetSaint (http://netsaint.org), written by Ethan Galstad. NetSaint is designed to run under Linux, but should work on other UNIX variants, too. NetSaint can run as a normal process or as a daemon, and network status can be conveniently viewed through any Web browser. You can easily extend the features of NetSaint and can customize its reports to match the topology of your network.

The (Net) Saint

NetSaint is a full-featured network monitoring package that provides information about system and service status across an entire network. If run as a daemon, NetSaint runs periodically, probes devices and other daemons, gathers data, and then summarizes the current state of your network in a number of customizable system status reports.

NetSaint itself is simply a control program; all of the actual system and service monitoring is performed using external plug-ins. Hence, it’s easy to extend NetSaint’s capabilities by adding your own plug-ins. NetSaint can also be configured to send alerts and perform other actions when problems are detected (see “Network Monitoring Solutions” for other tools that you can use).

Installing NetSaint is straightforward. Choose a machine to host NetSaint and download the free software. RPMs are available for many popular Linux distributions and recent SuSE Linux distributions include NetSaint (although it installs the package in a non-standard location). Like most network monitoring packages, NetSaint has several prerequisites (including MySQL and a Web server); be sure to read the “What’s New” section of the NetSaint documentation before you install anything.

These are the most important NetSaint components:

  • The netsaint daemon, which continually collects data, updates displays, and generates and handles alerts. The daemon is usually started at boot time by placing a link to the netsaint script in /etc/init.d/.
  • Plug-in programs, which perform the actual device and resource probing. You can create your own plug-ins or re-use those written by others. In any case, you’ll need NetSaint plug-ins before you can actually monitor anything.
  • Configuration files, which define the devices and services you want to monitor.
  • CGI programs, which support Web access to the displays.
Figure One: The NetSaint Network Monitor

Figure One displays NetSaint’s Notifications screen. It provides summary information about the current state of everything being monitored. In this case, we’re monitoring five hosts, of which four have some kind of problem. The screen shows the host, service, and state of the service. Some services have many entries — more than one person can be notified in case of failures. The display shows an abnormally high number of failures to make the discussion more interesting.

As you can see from the figure, NetSaint provides links within each table to more detailed information. A report for a particular host, for example, might indicate that a Microsoft Windows system was down for over two hours in a five-hour period. A system administrator can provide a comment, such as “swapped out disk drive,” to explain any unusual circumstances. The report also displays information about the host’s recent history and its current monitoring configuration.

Figure One also shows the NetSaint menu bar in the window’s leftmost frame. The items listed under “Monitoring” select various status displays. “Status Detail” displays overall status in tabular form; “Status Overview” shows a breakdown of host and service status by the devices’ physical location (the accounting department, for instance) or by how they’re used (printers, for instance). In this way, the location of trouble can be determined quickly.

Figure Two: NetSaint’s Host Menu

NetSaint’s Host Commands menu, shown in Figure Two, lets the administrator change numerous aspects of a host’s monitoring configuration, including enabling and disabling monitoring and/or alert notifications, adding and modifying scheduled downtime for the host (during which monitoring ceases and alerts are not sent), and forcing all defined checks to be run immediately (rather than waiting for their next scheduled instance).

The second menu item allows you to acknowledge any current problem. Acknowledging a problem simply means “I know about the problem, and it’s being handled.” NetSaint marks the corresponding event as such, and future alerts are suppressed until the item returns to its normal state. This process also allows you to enter a comment explaining the situation, which is helpful when more than one administrator is examining the monitoring data.

Canonize Your Own NetSaint

Configuring NetSaint can seem daunting at first, but it’s actually quite straightforward once you understand all the pieces. It has several configuration files:

  • The netsaint.cfg file configures the NetSaint daemon: you can specify parameters such as the directory locations for the package’s various components, the user and group ids for the netsaint daemon, which items to log, log file rotation settings, timeouts and other performance-related settings. The file also contains additional settings for advanced features like event handling and notifications.
  • The commands.cfg and hosts.cfg files define the commands that will test the various hosts and network services being monitored. By convention, commands.cfg defines commands and hosts.cfg defines hosts and services.
  • The nscgi.cfg file configures the NetSaint displays, including paths to Web page items and scripts, and per-item icon image and sound selections. It also authorizes access to NetSaint’s data and commands.
  • The resource.cfg file defines macros that may be used within other settings for clarity and security purposes (e.g., to hide passwords from view).

NetSaint has many nice features. First of all, it can save data between runs (this is the default configuration) so you analyze trends over time (see the accompanying article “Identifying and Analyzing Trends,” pg. 29). You can also specify whether or not to display the saved status information when the NetSaint page is first opened. The following netsaint.cfg entries control the latter feature:

 retain_state_information=1 retention_update_interval=60 use_retained_program_state=1 

You can also save the data produced by the status commands for future use outside of NetSaint, using these netsaint.cfg entries:

 process_performance_data=1 service_perfdata_command=commandname 

The arbitrary commandname specified in the second entry is defined in commands.cfg. A typical entry might write the status commands’ output to an external file:

 command[commandname]=/bin/echo $OUTPUT$ >>filename 

The macro $OUTPUT$ is replaced with the entire output returned by the status commands. Once captured, the data in the external file can be sent to a database or processed in any way that you like. We’ll talk more about the syntax of commands.cfg shortly.

The NetSaint daemon can also be configured to accept data from outside sources. In this passive mode, the remote device performs a check and writes the result in a predetermined format to a file on the NetSaint host machine. This mode may be enabled by inserting the following line into the netsaint.cfg file:


Next, you might want to take a look at the nscgi.cfg file, which defines who can access NetSaint and what they can do. Figure Three shows some sample entries from that file. The first entry enables the access control mechanism. The next two entries specify the users who are allowed to view NetSaint’s configuration information and the status of the services NetSaint is monitoring. All users must also be authenticated by the Web server using the Apache htpasswd mechanism.

Figure Three: Sample nscgi.cfg Entries

 use_authentication=1 authorized_for_configuration_information=netsaintadmin,root,chavez authorized_for_all_services=netsaintadmin,root,chavez,maresca hostextinfo[bagel]=;redhat.gif;;redhat.gd2;;168,36;,,;

The final entry in Figure Three specifies extended attributes for the host named bagel. The file names in this example specify image files for the host in the status tables (GIF format) and in the status map (GD2 format); the two numeric values specify the device’s location within the status map (consult the man page for complete details). NetSaint’s status maps provide a quick way of accessing information about individual devices.

Figure four: A NetSaint Status Map

A sample status map is displayed in Figure Four. The map was created with David Kmoch’s saintmap utility (http://netsaint.org/download), which provides a convenient way to create status maps. In this case, we’ve grouped devices by their physical location (although we haven’t bothered to label the groups). The lines from taurus to each device in the bottom group indicate that taurus is the gateway between our NetSaint host and this location. When the map is used by NetSaint, each icon will have a status indication (up or down) added to it, enabling an administrator to get an overall view of things right away, even when the network is very large and complex.

NetSaint Nuts and Bolts

Let’s look in more detail at the syntax of two important NetSaint files, commands.cfg and hosts.cfg. By convention, commands.cfg defines monitoring tasks and associated commands, whereas hosts.cfg defines hosts and services.

Figure Five shows a sample commands.cfg file. It creates two monitoring tasks, one named do_ping and another called check_telnet. Figure Six lists a sample hosts.cfg file; it defines two hosts, ishtar and callisto. Finally, Figure Seven shows some sample services.

Figure Five: Sample Command Definitions

 # command entries define a monitoring task and its associated command. # command entries are also used to define commands used for other purposes like # sending alerts and event handlers. command[do_ping]=/bin/ping -c 1 $HOSTADDRESS$ command[check_telnet]=/usr/local/netsaint/libexec/check_tcp -H $HOSTADDRESS$ -p 23

Figure Six: Sample Host Definitions

 # host: define a host/device to be monitored. # hostgroup: create a list of hosts to be grouped together in displays. # service: define an item on a host/device to be checked periodically. # contact: specify a list of recipients for alerts. # timeperiod: assign a name to a specified time period. # #host[label]=description;IP address;parent;check command;;;;;;; host[ishtar]=ishtar (with printer);;taurus;check-printer-alive;10;120;24×7;1;1;1; host[callisto]=callisto;;;check-host-alive;10;120;24×7;1;1;1;

Figure Seven: Sample Service Definitions

 #service[host-label]=service-label;; when;;;; notify;;;;;; check-command service[callisto]=TELNET;0;24×7;4;5;1;admins;960;24×7;0;0;0;;check_telnet service[callisto]=PROCS;0;24×7;4;5;1;admins;960;24×7;0;0;0;;snmp_nproc!commune!250!400 service[ingres]=HPJD;0;24×7;4;5;1;locals;960;24×7;0;0;0;;check_hpjd

Note that a service combines a host and a command. To use NetSaint, you define your hosts, create a pool of monitoring tasks, and then tie tasks to a host. For example, host[callisto] defines a device named “callisto”; command[check_telnet] defines a command called “check_telnet”; they are tied together in a definition like service[callisto]=TELNET … check_telnet which tests the telnet capabilities of the device named callisto using the command check_telnet. Let’s look at the contents of each file in more detail.

Creating Commands

The first command entry in Figure Five defines a command called do_ping that runs the ping utility to send a single ICMP Echo packet to an IP address. The IP address is not defined; instead the built-in macro $HOSTADDRESS$ will be replaced with the IP address of the host that we want to ping.

The second entry in Figure Five defines a check_telnet command, which runs the NetSaint plug-in named check_tcp; check_tcp attempts to connect to the TCP port specified by -p on the indicated host. In this case, it’s port 23, the standard Telnet port. Again, this entry uses $HOSTADDRESS$ instead of a fixed IP address.

In addition to these macros, command entries can also use arguments to parameterize commands. Here’s an example:

 command[check_tcp]= /usr/local/netsaint/libexec/check_tcp -H $HOSTADDRESS$ -p $ARG1$ 

This entry defines the check_tcp command differently than the one shown in Figure Five. If this check_tcp is used in a service entry, NetSaint calls the same plug-in as in the earlier check_telnet command, but uses the first argument passed in the service entry as the desired port number (set by the -p flag). Macros of the form $ARGn$ allow you to use any number of arguments in a command.

Many NetSaint plug-ins also use the -w and -c options to define value ranges that should generate warning alerts and critical alerts, respectively. Somewhat counterintuitively, the argument to these options specifies the range which should not be considered problematic.

For example, the following entry defines the command snmp_load5, setting the warning level to values over 150, and the critical level to over 300:

 command[snmp_load5]= /usr/local/netsaint/libexec/check_snmp -H $HOSTADDRESS$ -C $ARG1$ -o . -w 0:150 -c :300 -l load5 

This entry calls the check_ snmp command for the current host, using the first command argument as the SNMP community name (the -C flag), and retrieves the 5-minute load average value (multiplied by 100, as specified by the SNMP label in the -o option), labeling the data as load5. The value will trigger a warning alert if it is over 150; -w 0:150 means that values from 0 to 150 inclusive are not in the warning range. It will also trigger a critical alert if it is over 300, i.e., not in the range 0 (used as the default if there’s no value before the colon) to 300 inclusive. If a value would trigger both a warning alert and critical alert, only a critical alert is issued.

Defining Hosts

Figure Six illustrates the definitions for hosts. Let’s take these entries apart, field by field. Individual fields are separated by semicolons.

The first field is the most complicated and has the following syntax: host[label]=description, where label is the label to be used in status displays and description is a (possibly longer) phrase describing the device.

The second field holds the device’s IP address. This field actually identifies the desired device (the preceding items are just arbitrary labels).

The third field specifies the parent device for the item: a list of one or more labels for intermediate devices located between the current system and this one. For example, to reach ishtar, we must go through the device named taurus, and so taurus is specified as its parent. The parent is optional, and the entry for callisto does not use it.

The fourth field specifies the command NetSaint should use to determine whether the host is accessible (“alive”) or not. (The two used here, check-printer-alive and check-host-alive come with the standard NetSaint package.)

The fifth field indicates how many checks must fail before the host is assumed to be down (10 in our examples).

The remaining fields in the example entries are used to customize alert notifications.

Sixth field: the time interval between alerts when a host remains down, in minutes (here, 120 minutes or 2 hours).

Seventh field: the time period during which alerts should be sent. The time period is defined elsewhere in the configuration file as a timeperiod entry. This one, named “24×7,” is pre-defined to mean “all the time.” It’s a convenient choice when you are getting started using NetSaint.

Eighth field: a flag (0 means no; 1 means yes) whether to notify when the host recovers after being down.

Ninth field: flag indicating whether to notify when the host goes down.

Tenth field: flag indicating whether to notify when the host is unreachable due to a failure of an intermediate device.

All the flags are set to yes in our examples.

Monitoring Services

Now that we have defined entries for both commands and hosts, we are ready to define specific items that NetSaint should monitor. These items are known as “services” and are created using the service entry. Examples are shown in Figure Seven.

The most important fields in these entries are the first, third, seventh, and final ones, which hold the following settings:

  • The service definition (field 1), using the syntax service[host-label]=service-label. For example, the first entry in Figure Seven defines a service named TELNET for the host entry labeled callisto.
  • The name of the time period during which this check should be performed (field 3), again defined in a timeperiod entry.
  • The contact name (field 7): this item holds the name of a contact entry defined elsewhere in the file. The contact entry type is used to specify lists of users to be contacted when alerts are generated.
  • The command to run to perform the check (final field), defined via a command entry elsewhere in the configuration file. Arguments to the command are specified as exclamation-mark-separated subfields with the command (as in the PROCS example service). These will be passed through to the $ARGn$ macros in the command definitions.

The other fields hold the volatility flag (field 2), the maximum number of checks before a service is considered down (4), the number of minutes between normal checks (5) and failure rechecks (6), the number of minutes between failure alerts while the service remains down (8), the time period during which to send alerts (9), three alert flags (10-12) corresponding to service recovery, and whether or not to send critical alerts and warning alerts, respectively. The next-to-last field (13) holds the command name for the event handler for this service (see below); no event handler is specified in these cases. The default values, used in the examples, are good starting points, and these fields are further defined in the NetSaint documentation.

NetSaint displays can summarize status information for a group of devices. You specify this by defining a hostgroup. For example, the following configuration file entry defines the Printers group:

 hostgroup[Printers]=  Printers;locals;  ingres,lomein,turtle,catprt 

The syntax is simple: hostgroup[group-label]=description; contact-group;list-of-host-labels. Keep in mind that the host labels refer to the names of host definitions within the NetSaint configuration file and not necessarily to literal hostnames. The members of the specified contact group will be notified whenever there is a problem with any device in the list.

Creating Event Handlers and Alerts

In addition to sending alert messages, NetSaint also provides support for event handlers: commands to be performed when a service check fails. In this way, you can have NetSaint begin to deal with a problem before you even know about it. Figure Eight shows the entries which correspond to a simple event handler.

Figure Eight: Sample Event Handler

 #event handler for disk full failures command[clean]=/usr/local/netsaint/local/clean $STATETYPE$ service[beulah]=DISK;0;24×7;4;5;1;locals;960;24×7;0;0;0;clean;check_disk!/!15!5

First, we define a command named clean which specifies a script to run. Its sole argument is the value of the $STATETYPE$ NetSaint macro, which is set to “HARD” for critical failures and “SOFT” for warnings. The clean command is then specified as the event handler for the DISK service on beulah. (In this example, the /usr/local/netsaint/ local/clean script is one we custom-wrote to use the find command to delete junk files within the root filesystem and use its argument’s value to decide how aggressive to be.)

The three arguments to check_disk are the file system to check (/), the amount of free space there must be in order to not trigger a warning (15, so a warning is sent if the file system is more than 85% full), and the amount to not trigger a critical alert (5, meaning a crtitical alert is sent if the file system is more than 95% full).

Plan Ahead

Network monitoring software can be a powerful tool for keeping track of system status, both in real-time and over the long haul. However, don’t underestimate the time it will take to implement a monitoring strategy in the real world. Careful planning can minimize the amount of time that it will require, but it’s always a big job. You need to consider not only the installation and configuration issues but also the performance impact on your network and the security ramifications of the daemons and protocols you’re enabling. While this can be a daunting task that cannot be rushed, in the end it’s worth the effort.

Network Monitoring Solutions

There’s no shortage of packages that provide more complex monitoring and event-handling capabilities. While these packages can be very powerful tools for information gathering, their installation and configuration complexity typically grows as the number of features does.


There are many Open Source monitoring programs and projects, including OpenNMS (http://www.opennms.com), Sean MacGuire’s Big Brother (http://www.bb4.com), and Thomas Aeby’s Big Sister (http://bigsister.graeff.com). A more basic Open Source program, the Angel Network Monitor by Marco Paganini (http://www.paganini.net/angel), is also available.


If you can afford it, there’s nothing like a full-scale commercial network monitoring package for enterprise-wide monitoring. The amount of time to set up and learn to use one tends to be less than that of an Open Source solution. These days, all major commercial products can monitor Linux systems, and some even allow a Linux system to be the master monitoring station.

You should expect most or all of the following features from a commercialnetwork monitoring product:

  • Excellent performance, even for large networks. One disadvantage of the Open Source packages is that they generally rely on scripts for data collection (at least in part), an approach that tends to not scale well beyond a certain point (reached by a medium-sized network). Commercial programs can use the performance advantages of compiled code (as opposed to interpreted scripts).
  • Device autodetection: commercial packages will search for and detect new devices that appear on the network and automatically add them to the set being monitored.
  • Support for heterogeneous networks, including mixed UNIX and Windows environments. Some of the Open Source packages also support such environments.
  • Multiple server-based monitoring, designed to distribute the overhead load somewhat and also prevent a single point of failure.
  • More sophisticated authentication techniques and data protection.
  • Built-in report generation and graphing facilities.

The best known of the commercial products are OpenView from Hewlett-Packard, Patrol from BMC, Tivoli from IBM, and Unicenter from Computer Associates; Unicenter won the “Best in Show” award at the January 2002 LinuxWorld Expo.

Identifying and Analyzing Networking Trends

NetSaint is very good about providing up-to-the-minute status information, but there are also times when it’s helpful to compare current data with past values. This is essential for performance tuning and capacity planning.

One of the best-known packages of this type is the Multi-Router Traffic Grapher (MRTG, http://www.mrtg.org), written by Tobi Oetiker and Dave Rand. It collects data and automatically produces graphs of these values over various time periods. As its name suggests, it was designed to track the performance of the routers in a network, but can be used for a wide variety of data — in fact, for any value that can be tracked over time.

When a new data point is added, the oldest is deleted from storage, and the result is referred to as “round-robin data” (RRD). Summary values are also stored over various time periods. This strategy results in small, fixed-size databases that nevertheless offer a wealth of important information.

More recently, MRTG has been supplanted by Oetiker’s newer package, RRDtool (http://rrdtool.eu.org), which has much more powerful and configurable graphing facilities. It does, however, require a separate data collection script or package, such as the popular Cricket (http://cricket.sourceforge.net).

Figure A shows a command to create a database named ping.rrd consisting of two fields, trip and lost, which are defined by the two data set (DS) lines. These fields hold the round-trip travel time and the percentage of lost packets resulting from running the ping command. Both fields are of type GAUGE, meaning that the values should be interpreted as separate values. Alternatives to the GAUGE type are referred to as counters, and their values are interpreted as changes with respect to the preceding value; they include COUNTER for monotonically increasing data and DERIVE for data which can vary up or down.

Figure A: Creating a Database with RRDtool

 rrdtool create ping.rrd \  –step 300 \  DS:trip:GAUGE:600:U:U \  DS:lost:GAUGE:600:U:U \  RRA:AVERAGE:0.5:1:600 \  RRA:AVERAGE:0.5:6:700 \  RRA:AVERAGE:0.5:24:775 \  RRA:AVERAGE:0.5:288:797 \  RRA:MAX:0.5:1:600 \  RRA:MAX:0.5:6:700 \  RRA:MAX:0.5:24:775 \  RRA:MAX:0.5:288:797

The fourth field in each DS line is the time period between data samples, in seconds (here 10 minutes), and the final two fields hold the valid range of the data. A setting of U stands for “unknown”, and two of them have the effect of allowing the data itself to define the valid range (i.e., accept any value).

The remaining lines of the command, labeled RRA, create round-robin archive data within the database. The second field indicates the kind of aggregate value to compute: here, we compute averages and maximums. The remaining fields specify the maximum percentage of the required sampled data that can be missing (50%), the number of raw values to combine and the number of data points of this type to store.

Those final two fields can be confusing at first. For example, values of 6 and 700 mean that the average (or other function) of six raw values will be computed, and the most recent 700 averages will be saved. The default time period between data points is 300 seconds (this can be changed by using the –step option). Thus, the second RRA line will calculate a 30-minute average value (6 * 5 minutes), and we will have them going back for 350 hours (700 * 6 * 5 minutes).

Thus, in our example database, we are creating five-minute averages and maximums, 30-minute averages and maximums, 2-hour averages and maximums (24 * 5 minutes) and daily averages and maximums (288 * 5 minutes = 1 day). We will have five-minute data going back for 50 hours (600 * 5 minutes), 30-minute data for 350 hours (700 * .5 hours), 2-hour data for 1,550 hours (775 * 2 hours), and daily data for over 2 years (797 days).

There are many ways to add data to an RRDtool database. Figure B shows a script illustrating one of the simplest: using the update argument.

Figure B: Adding Data Using RRDtool

 #!/bin/csh  ping -w 30 -c 10 $1 > /tmp/ping_$1 set trip=`tail -1 /tmp/ping_$1 | awk -F= ‘{print $2}’ | awk -F/ ‘{print $2}’` set lost=`grep transmitted /tmp/ping_$1 | awk -F, ‘{print $3}’ | awk -F% ‘{print $1}’` rm -f /tmp/ping_$1 rrdtool update ping.rrd “N:”$trip”:”$lost

After the ping command generates the data, the next two lines take apart the output and store the results in two variables. The command rrdtool update then adds it into the database. The final argument to the rrdtool command is a colon-separated list of data values, beginning with the time to be associated with the data (N means “now”), followed by the value for each defined data field, in order.

In this case, we used normal UNIX commands to obtain the data we needed, but we could also have used SNMP as the source. For more on SNMP, the Simple Network Management Protocol, see “Network Device Interrogation” in the November 2001 issue (available online at http://www.linux-mag.com/2001-11/snmp_01.html) and in this month’s Guru Guidance column (available online at http://www.linux-mag.com/2002-05/guru_01.html).

Once we’ve accumulated data for a while, we can begin to create graphs. We’ll again use RRDtool. For example, the following command creates a simple graph of the data:

 rrdtool graph ping.gif \  –title “Packet Trip Times” \  DEF:time=ping.rrd:trip:AVERAGE \  LINE2:time\#0000FF 
Figure C: Creating a Chart With RRDtool

This command defines a graph of a single value, specified via the DEF (definition) line. The graphed variable is named time and it comes from the stored averages of the trip field in the ping.rrd database (raw values cannot be graphed). The LINE2 command is what actually graphs the values. LINE2 refers to a fever chart of the defined variable (time), displayed in the color corresponding to the RGB value #0000FF (blue). The result is seen in Figure C. The backslash before the number sign is required to protect it from the shell, but is not part of the command syntax. The resulting graph is placed in the file ping.gif.

Figure D: The RRGrapher Utility

Creating graphs like these can be tedious. Fortunately, there is RRGrapher (http://net.doit.wisc.edu/~plonka/ RRGrapher/), a utility that automates the process. This CGI script, written by Dave Plonka, is illustrated in Figure D.

You can use this tool to create graphs which draw data from multiple RRD databases. In this figure, we’re plotting values from two databases over a specified time period. The latter is one of RRGrapher’s most convenient features, since RRDtool requires times to be expressed in standard UNIX format (seconds since January 1, 1970) but you can enter them here in a readable format.

As mentioned earlier, you’ll want a front-end package, such as Cricket, to automate the process of gathering data for RRDtool. Cricket is written in Perl, and requires a very large number of modules to function (plan on several visits to CPAN), so installing it may take a bit of time. Once it’s up and running, its most important components are:

  • The cricket-config directory tree, which contains specifications for each device to be monitored.
  • The collector script, which runs periodically from cron (usually, every 5 minutes).
  • The grapher.cgi script, which is used to display Cricket graphs within a Web browser.

The cricket-config directory tree contains the configuration files that tell the collector script what data to get from which devices. It holds a hierarchical set of configuration files. Default values set at each level continue to apply to lower levels unless they are explicitly overridden. Once the initial setup is completed, adding additional devices is simple.

The first-level subdirectories within this tree refer to broad classes of devices: routers, switches, and so on. We’ll see an example of the device class hosts, provided as part of the Unix Host Resources package written as a Cricket add-on by James Moorei for devices that support SNMP (http://www.certaintysolutions.com/tech-advice/cricket-contrib/).

The file cricket-config/hosts/Defaults supplies default values for entries within the hosts subtree. The Defaults file, for instance, defines a data set called ucd_System that includes CPU and memory usage, as well as load averages over 1, 5, and 15 minutes. The ucd_Storage data set includes information on disk usage.

Every host to be monitored has a subdirectory under hosts that contains a file named Targets, which contains the information necessary to build a round-robin data set for that particular host. Figure E shows some excerpts from the file for the host callisto.

Figure E: The cricket-config/hosts/callisto/Targets config file

 Target –default–   server               =         callisto   snmp-community       =         somesecret  # Specify data source groups to collect target ucd_sys   target-type          =         ucd_System   short-desc           =         “CPU, Memory, and Load”  target boot   target-type          =         ucd_Storage   inst                 =         1   short-desc           =         “Bytes used on /boot”   max-size             =         19487   storage              =         boot 

This file instructs Cricket to collect values for all of the items defined in the ucd_System and ucd_Storage groups. Each target will appear as an option within the Cricket Web interface for this host.

Graphs of the Cricket data can be very helpful in determining what the normal range of behavior is for various devices. When you understand the normal status and variation, you are in a much better position to recognize and understand the significance of anomalies that turn up.


NetSaint http://netsaint.org/

SaintMap http://netsaint.org/download/

OpenNMS http://www.opennms.com/

Big Brother (free for non-commercial uses) http://www.bb4.com/

Big Sister http://bigsister.graeff.com/

Angel Network Monitor http://www.paganini.net/angel/

Multi-Router Traffic Grapher (MRTG) http://www.mrtg.org/

RRdtool http://rrdtool.eu.org

Cricket http://cricket.sourceforge.net/

James Moore’s Cricket Add-ons http://www.certaintysolutions.com/tech-advice/cricket-contrib/

RRGrapher http://net.doit.wisc.edu/~plonka/RRGrapher/

OpenView http://www.openview.hp.com/

Patrol http://www.bmc.com/patrol/

Tivoli http://www.tivoli.com/

Unicenter http://www.cai.com/unicenter/

Additional information about programs mentioned in this article can be found online at http://www.linux-mag.org/downloads/2002-05/netmonitor.tgz.

Æleen Frisch is the author of Essential System Administration and writes the “Guru Guidance” column for Linux Magazine. She can be reached at aefrisch@lorentzian.com.

Comments are closed.