Whether your network has a single vital server or one hundred, install Nagios 2.0 now — before your phone starts ringing (again).
Surprisingly often, system administrators first learn of problems when the help desk hotline starts ringing off the hook. It’s Bob in Accounting calling to complain he can’t print; it’s Amanda in Sales frantically phoning, trying to find someone to reboot the file server. Problems turn into crises unnecessarily because network monitoring tools can readily detect many kinds of problems as soon as they occur or as they emerge.
While most organizations have some sort of network monitoring in place — typically a collection of ad hoc scripts that check on individual services and resources connected to the network — these scripts tend to be an incomplete solution, and can become nightmarishly difficult to maintain as a network grows and knowledgeable staff leaves to join a startup in Maui. In many organizations, maintaining and monitoring the monitors is a chore.
Fortunately, there’s a better way. Nagios (http://www.nagios.com) is an open source network monitoring application available under the GNU Public License (GPL). Nagios is a modular platform that can easily be customized to monitor just about anything on your network. From web site availability to machine room temperature, from daemon status to the amount of toner in the printer in accounting, Nagios can monitor it all. Better yet, the latest release of the software (now in beta), Nagios 2.0, makes monitoring even easier.
Whether your network has a single vital server or one hundred, install Nagios 2.0 now — before your phone starts ringing (again).
The Nagios System
Every Nagios deployment includes the Nagios core and a combination of plug-ins. The core collects data generated by the plug-ins and determines the overall health of the network. If something is wrong, the Nagios core knows who to contact, when, and how.
Meanwhile, Nagios plug-ins supply the actual intelligence that monitors your systems, networks, and applications. To check the status of a web server, for example, a specialized plug-in will attempt to connect to the server and retrieve an HTML page. If successful, the plug-in signals “OK” to the core; otherwise, the plug-in raises an alert and the core takes over from there, perhaps paging the webmaster.
The separation between checking status and relaying status makes it easier to integrate existing ad hoc scripts into Nagios — a task that takes surprisingly very little effort. Additionally, in Nagios 2.0, a lot of effort has been spent improving performance and increasing Nagios’ ability to monitor the large, complex networks that have become more common. Portions of Nagios have been rewritten to leverage multiple threads, and the data parsing methods of the CGI-based web interface have been redesigned with an eye toward performance and the ability to display tens of thousands of services running across tens of thousands of hosts. Even the structure of the configuration files has changed to accommodate larger networks: it’s now easier to use macros and templates to configure entire networks of similar machines.
Table One shows a side-by-side comparison of some of the most useful features of Nagios 2.0. (The table is by no means exhaustive.)
TABLE ONE: Nagios 2.0 has many new features
|Native Database Support
||Not available (dropped)
|Non-Template Style Configuration Files
||Not available (dropped)
||” Some service check parameters can be modified dynamically at runtime, without a restart”
|Host and service State Retention
||” Greatly enhanced: Nagios 2.0 keeps scheduling information across restarts, and other improvements”
|Host Freshness Checking
||Nagios 2.0 now checks to see if a host check has been performed within a given time interval
|Scheduled Host Checks
||Not available (aggressive host checking only)
||New in 2.0; regular checks of host status
||New in 2.0
||” About 30 defined macros, useable in context”
||” 99 defined macros,” On Demand” macros (out of context), and macros as environment variables”
|Wildcard Support for Object Names
||Supports”*” to match any name
||Supports matching on regular expressions
In general, Nagios 2.0 has more features, is more flexible, and is faster than the previous 1.x software. The changes to service retention, for example, have greatly improved Nagios’ reports and have minimized the disruption of making changes to the configuration. Regular expression support and some minor tweaks to how escalations are defined make life much easier for those users who set up complex escalation structures. And Nagios 2.0 no longer depends on a database to record data. (See the sidebar “How Nagios Saves State Data” for more information.)
The Nagios 1.4 plug-in architecture also represents a vast improvement.[ Revisions of the Nagios core and revisions of the Nagios plug-in architecture are numbered separately.] Plug-ins can now report availability and some numeric performance data. This data can be used with other add-ons, such as perfparse, (see the sidebar “The perfparse Add-on for Nagios”) to produce detailed performance reports. Plug-in support for IPv6 is new, and there have also been major improvements in internationalization. Nagios 1.4 plug-ins are compatible with Nagios 1.x and Nagios 2.0.
Getting Started with Nagios 2.0
Depending upon the size and complexity of your network, you may want to dedicate a separate Linux system to hosting the Nagios core. You can obtain Nagios 2.0b2 (the most current version of the software as of this writing) from http://www.nagios.org/download. You can download the Nagios plug-ins by following the links shown on that same page.
It’s critical that you know what you want to monitor. While it’s possible to monitor literally hundreds of parameters, you should only focus on a few. A good rule of thumb is to monitor only those systems and services that are critical — in other words, issues that you want to be alerted about immediately — even at 3:00 a.m. When monitoring your network, less is truly more.
Ideally, your Nagios host should have a web server, so you can view the status displays and interact with Nagios’ web forms and reports. Any web server that supports CGI is suitable.
Like other packages, Nagios has a number of dependencies. In particular, Nagios uses the GD graphics libraries (http://www.boutell.com/gd/). If you’re using Red Hat Linux, you should be able to install GD by installing the gd and gd-devel RPMs. (APT and yum users can just type
apt-get install gd-devel or
yum install gd-devel, which installs both required packages.) GD can also be built from source.
The Nagios manuals are quite complete and can guide you through setting up the web interface. There are also number of options for authentication, so choose the one that aligns with your security policy and access methods. Setting up SSL or even an LDAP redirector is possible, and is not specific to Nagios.
The Nagios plug-ins have a variety of dependencies, as they require many kinds of libraries, Perl modules, utilities, clients, and the like to monitor the great variety of services that exist. For instance, check radius, a useful plug-in that makes sure the RADIUS server is running and permitting logins, requires a RADIUS client and won’t compile unless it’s there. Other plug-ins may compile even if prerequisites are missing, although the plug-in may not function fully.
If you do nothing else, you should install the Net::SNMP packages and Perl modules to enable SNMP-based checks, which are still among the most useful system and network statistics around.
To make Nagios work, you must define the hosts and services you wish to monitor (to understand the difference between the two, see the sidebar “Host vs. Service Monitoring”). Nagios uses configuration files, such as minimal.cfg-sample, which is good to crib from and refer to as you deploy Nagios. (minimal.cfg-sample defines several checks to monitor the monitoring server itself.)
It’s a good idea to read the Nagios documentation to understand how the Nagios files are structured, and to build your configuration methodically, carefully, and incrementally to ensure you can keep it organized.
Host vs. Service Monitoring
Hosts are monitored differently than services in Nagios.
Services run on hosts and can’t exist without being associated with a host. For example, if you have a web server with several virtual hosts, you can create one monitor for the machine itself and one monitor for each top-level URL.
The inherent logic of this structure is leveraged for a kind of low-level event correlation. The services are regularly checked on a host, but by default the host itself is not checked. The implicit assumption is that it’s pointless to check a host if the services on it are running.
However, as soon as one service fails, a definable check, normally an ICMP echo request, or ping, runs against the host. If the host check fails, all service notifications are suppressed and a second sequence of logic starts to determine if the host is down or if an intervening router or switch is down, rendering it unreachable.
To begin, you’ll probably want to install some plug-ins on the Nagios host itself so it can be monitored. Then, you can enable the core.
Fix the problems with the sample. The file minimal.cfg. is broken. It contains a
contact_groups directive for a
hostgroup object that’s no longer valid in version 2.0, and there are other problems. Simply edit this file and remove the
contact_groups directive from line 223, placing it in the host definition for
localhost above it. minimal.cfg also contains an error in the
check-host-alive command definition. Replace
/bin/check_ping with the macro
$USER1$/check_ping on line 78. This tells Nagios to use the macro definition in the resource.cfg file to point to the check_ping plug-in.
LISTING ONE: Comment out these lines in nagios.cfg-sample
Edit configuration files. In the central file, nagios.cfg-sample, comment out all of the lines shown in Listing One and add this line to include the file minimal.cfg:
Also, edit cgi.cfg-sample and uncomment the lines shown in Listing Two. You may want to change the usernames in Listing Two to match what you set up for authentication in your Web server.
LISTING TWO: Uncomment these lines in cgi.cfg-sample.
authorized_for_system_information=nagiosadmin authorized_for_configuration_information=nagiosadmin authorized_for_system_commands=nagiosadmin authorized_for_all_services=nagiosadmin authorized_for_all_hosts=nagiosadmin authorized_for_all_service_commands=nagiosadmin authorized_for_all_host_commands=nagiosadmin
To get the files ready to use, run the following commands as root:
# mv minimal.cfg-sample minimal.cfg
# mv nagios.cfg-sample nagios.cfg
# mv cgi.cfg-sample cgi.cfg
# mv resource.cfg-sample resource.cfg
At this point, you should be able to run the Nagios “pre-flight check” to validate your configuration and detect any errors that would stop Nagios from running. The pre-flight check emits errors with filenames and line numbers, making errors easier to find and fix.
To run the pre-flight check, type:
# /usr/local/nagios/bin/nagios –v /usr/local/nagios/etc/nagios.cfg
Now, start Nagios:
# /etc/init.d/nagios start
If all goes well, you should have a running version of Nagios 2.0b2 on your system, monitoring itself with five checks. Figure One shows a sample status screen.
FIGURE ONE: The Nagios status screen
Checking On Hosts and Services
Once Nagios is up and running, you’ll likely want to add more hosts to your configuration. If you’d like, you can separate your configuration into multiple files. Using many (small) files instead of one large file gives you more flexibility, since individual files can be edited, generated by scripts, or copied into and out of configurations more easily than if the entire configuration is in a single file.
To include a new host configuration file, create the file and then reference it from the master file nagios.cfg with a
cfg_file= directive. For example, this line…
… would include /usr/local/nagios/etc/webserver.cfg. If you want to see examples of host configuration files, download the samples from http://www.itgroundwork.com/resources/downloads.html.
Now that you’ve defined an additional host, you can check its status or the status of any of its services with one or more plug-ins. For example, if you’ve added your web server, you can check its condition using the check_http plug-in.
check_http –help lists all of the plug-in’s capabilities, but its most useful feature is searching web pages for known strings. By matching a known string to the downloaded page, you can verify that the page loads with an HTTP
200 code and presents content that you expect.
For instance, if you run the following…
# ./check_http –H www.itgroundwork.com –u \
–s "GroundWork Monitor"
… you get:
HTTP OK HTTP/1.1 200 OK -
0.140 second response time
But if you run…
% ./check_http –H www.itgroundwork.com –u \
–s "GroundWork Wrangler"
… you get:
CRITICAL - string not found
You can specify a virtual host name with
–H, or specify the IP address and URL combinations with
–I. Like many other plug-ins, experiment with check_http from the command-line before you define service checks for your hosts.
Nagios' Web Interface
The interaction design of the Nagios interface is a little rough, but it's very functional. The status of hosts and services are quite obvious, but there are some subtle facets.
Nagios' topological maps are not particularly interesting in the minimal configuration show here, but if you instantiate the example configuration files supplied with this article, you'll appreciate their benefits. A sample map is shown in Figure Two.
FIGURE TWO: A topological map, showing the structure of the network
Drilling down on a host from the service detail or host detail pages takes you to the command screen (shown in Figure Three.) The screen lists all of the commands possible to run against a host. Nagios 1.x users may notice a few additional items, such as the ability to submit passive check results for the host, which was only available for services in Nagios 1.x. This makes testing notifications simpler, as you can force a host down from the Web interface and then hit the Notifications link to see who gets notified and when. The “Locate host on map” icon is also new.
FIGURE THREE: The Nagios command screen
Alas, configuration, one aspect that seems natural for the web interface, is missing. You can view the configuration, enable and disable functions like checks and notifications, and use a large number of features and functions from the web interface, but you must go back to the text configuration files to make changes. There are a few web-based configuration tools, — for example, see the Nagios Administration Tool, at http://nagat.sourceforge.net/ — but none (yet) allow you to specify all the options available with the files. (However, one does come close. See[ the author's] “Monitor Architect” at http://www.itgroundwork.com).
Notifications in Nagios are virtually unlimited — anything you can run at the command-line can be used to notify staff of emergent problems. That said, most users prefer email — which is ironic considering email does go down (and hence is a critical service that's likely to be monitored). So, notifications typically include email and some other “out-of-band” method to get critical alerts to staff.
One of the best techniques is to use a modem to page people via SNPP gateways, and there are other approaches. Instant messaging, Short Message Service, (SMS) and even voice messages are other options, and integrating them into Nagios is usually trivial.
The notification templates can contain whatever boilerplate text you wish, and the particulars can be filled in by macros. Nagios 2.0 greatly expands the use of macros, so notification messages can contain a lot of interesting information.
For instance, in Nagios 1.x, you could easily make your email notifications look like this:
*There has been a : PROBLEM Service: cpu_utilization on host: WWW45 at address: 192.168.12.33 is : CRITICAL Check results: 99%
Nagios 2.0 adds a lot of statistical macros, so you can add information about the general state of things, like:
* There are 3 hosts down
* There are 27 services that are having problems
* There are 12 services with problems that no one is working on
You could put URLs in these messages in Nagios 1.x, but Nagios 2.0 adds a few more to choose from, such as host and service action URLs, which are definable links associated with the host or service. Also, while Nagios 1.x supported flap detection (a way of shutting off notifications when a given host or service has too many state changes in a given period), it didn't notify you that the host or service had started or stopped flapping, which Nagios 2.0 can do.
Notifications are organized around contacts; contacts are grouped into contact groups; and hosts are grouped into hostgroups. In Nagios 1.x, you had to associate contact groups to host groups, but Nagios 2.0 is simpler, associating contact groups directly with hosts. Omitting escalations, all of the contacts in a contact group associated with a host will get notifications when that host goes down (or comes back up, or becomes unreachable, as you prefer). Services are analogous, in that there are contact groups for those too, which behave similarly.
You can restrict when a given host or service generates notifications based on time periods, which are filters for monitoring or notification activity. Time periods can be applied to contacts too, so you can restrict when a given person is called. Time periods can now be applied to escalations as well.
The example configuration files have some basic hostgroups and contact groups defined. Hostgroups are useful for things like scheduling downtime (maintenance periods), filtering reports, segmenting the display, and setting up escalations. Nagios 2.0 adds the concept of the service group, which adds these functions to arbitrary groupings of services on specified hosts. This is useful for reporting on all services of a given type (disk capacity, for example) or in setting up service group dependencies and escalations.
Perhaps the single most powerful enhancement in Nagios 2.0 is the addition of regular expression or wildcard character matching in the configuration files. Wisely, Ethan Galstad, the lead developer of Nagios, has implemented this rather cautiously, including the ability to disable it (the default) or implement it partially (only when
? is used) or completely, but only for a given set of directives (specifically, host and service names). Regular expressions allow very flexible configurations of service escalations, something that's currently a sore spot in Nagios 1.x configurations.
For example, the following service escalations applies to all services that have" cpu" in their description:
The services containing the string
cpu is a simple regular expression) have to exist on all the hosts in the specified hostgroup for this to work. (Of course, regular expression matching has to be enabled as well.)
Nagios 2.0 contains an undocumented event broker API. In future versions, this API may well enable integration to many other tools, both open source and proprietary.
In the mean time, 2.0 is ready for you to download and evaluate. The new release is more powerful and flexible than its predecessors and perhaps most other commercial monitoring applications. This article just scratched the surface of Nagios 2.0, but you should be able to deploy it and start monitoring your network with just a little effort. Bob and Amanda will thank you.