Network management requires attention to faults, configuration, accounting, performance, and security — the so-called FCAPS. You can spend lots of money and lots of time deploying an FCAPS package — or you can deploy the open source OpenNMS. Here's how.
To paraphrase the beginning of M. Scott Peck’s The Road Less Traveled, network management is hard. Worse, traditional enterprise-grade network management has also been expensive. It can take months, even years to deploy solutions such as Hewlett-Packard’s OpenView or IBM’s Tivoli, and it’s estimated that the service cost of these packages is on the order of seven to ten times the software licensing costs.
OpenNMS was created to provide a better alternative. It was designed from its inception to be “enterprise grade,” capable of monitoring thousands to tens of thousands of devices (and ultimately have no practical limits). Moreover, OpenNMS is open source, so there are no licensing fees, and better yet, you can bend, twist, and shape OpenNMS to your network, hastening deployment.
In 1999, the Internet bubble was still burgeoning, and one of the bubble buzzwords was” service level agreement” (SLA). An SLA dictated how much downtime an Internet service provider (ISP) was allowed before it incurred some sort of financial penalty. For example, an ISP might warrant 99.999 percent uptime — or” five nines” of availability.
Now 99.999 percent uptime may look good in an advertisement, but in practice, near-perfect uptime was much more difficult to verify. No tools existed to corroborate uptime with such precision, and you cannot monitor what can’t be measured.
For example, HP OpenView’s Network Node Manager was able to do two main things: send a “ping” to a network device and collect data from it using the Simple Network Management Protocol (SNMP). With such limited capabilities, trying to measure availability of something like a Web server was akin to measuring the fuel economy of a car by using the rotational speed of the fuel pump, how often the spark plugs fired, and the temperature of the exhaust manifold.
OpenNMS, on the other hand, approaches fault management from the vantage point of a user. To return to the car analogy, drive the car some distance and see how much fuel gets used.
For instance, to determine if a Web server is up, a Web monitor accesses a page and measures the response time or notes an error. Likewise, A DNS monitor attempts to look up the IP address of a domain name. Other monitors probe additional, well-known services, such as FTP, SSH, and DHCP. You can even create your own monitor with scripts, and OpenNMS recently added support for plug-ins to the popular Nagios monitoring application via NRPE (for Unix and Linux servers) and NSClient (for Windows). Of course, you can also use “pings,” which are represented as a device supplying the ICMP Echo Reply service.
Each user configurable monitor uses at least two factors to determine if a service is “up”. The first is whether or not an expected response was received, and the second is the time the polling required. Both metrics can indicate a problem, as in a Web server that is “up,” but takes minutes to serve a page.
Doing Things Better
As open source, OpenNMS has some advantage over established tools like OpenView and Tivoli. However, cost doesn’t always seal the deal — OpenNMS has to be just as capable, even better. One innovation is the “downtime” model built in to the OpenNMS poller. Here’s why.
A 99.999 percent uptime translates to about five minutes of downtime per year, or a little less than thirty seconds per month. Yet it is impractical to realize this level of service (the SLA) as virtually every problem takes more than thirty seconds to resolve.
However, “four nines” or 99.99 percent uptime over a month is roughly four and a half minutes of downtime — a reasonable span to solve a number of common problems. A number of one to two minute outages could occur in a month and not violate the SLA.
But most network management applications have a set polling frequency, usually five minutes. With a five minute poll, the shortest outage one can measure is five minutes, or the time between an unsuccessful and successful poll. Yes, you can increase the polling rate, but that only adds unwanted traffic to an otherwise fully-functional network.
OpenNMS addresses the problem by implementing a “downtime” model that dynamically changes the polling rate when an outage occurs. By default, OpenNMS uses the five minute polling frequency of other tools, but when an outage is detected, this rate is temporarily increased to 30 seconds. This lasts for five minutes (there are ten polls in the five minutes after an outage is detected). After five minutes, an SLA of four nines has been violated, so there’s no reason to keep polling at that rate, and OpenNMS backs off to a five-minute interval. After twelve hours of downtime, the rate drops further to 10 minutes, and after five days, the service is unmanaged. (All of this behavior is configurable, of course.) Thus OpenNMS can detect an outage as short as 30 seconds.
And since OpenNMS maintains a database of outages, it can calculate an SLA in a number of different ways, such as by device, service, groups of services, and so on, and can generate reports based on categories that the user creates.
Events and Notifications
Besides monitoring network services, another major part of fault management is the processing of events (both internal and external), as well as generating some form of notification when certain events occur.
OpenNMS generates its own events, such as when a network service is down, and it can also receive events from external sources. One common type of event is an SNMP Trap, an event sent from an SNMP agent on a device to the “manager” (in this case OpenNMS). A trap can be very useful because it allows the remote device to asynchronously message the management station without waiting to be polled.
To alert people to problems, OpenNMS ships with a very robust notification system. Practically any command that can be run from the OpenNMS server command-line can be used to generate a notice, although the most common commands send email to a user’s account and send email to a user’s phone or pager. OpenNMS even includes built in support for the Extensible Messaging and Presence Protocol (XMPP), the protocol that drives instant messaging clients like Jabber and Google Talk.
The notification system works as follows:
- 1.An event is received by OpenNMS, such as a
nodeLostService, which indicates that a network service is no longer available.
- The notification system looks through its list of configured notices to see if anyone is interested in the event. Perhaps the IT group wants to be notified of all service failures, while the database group wants to be notified only when the database has been affected. OpenNMS can create a notice for each interested party based on event type, the device that sent the event, and/or the service affected.
- If an event matches a notice, it “walks a path,” where the path specifies how to handle the notice. For example, the first target along a path might send email to a user or a certain group of users. If the notice is not acknowledged within ten minutes, say, the next target on the path might send a page to that same group of users. An event escalates until either the path is complete or someone acknowledges the notice. The person that acknowledges the notice is associated with it, providing accountability.
OpenNMS can also associate an “up” event with a “down” event so that some notices can be automatically acknowledged. When this is coupled with the downtime model mentioned in the previous section, it avoids bothering staff with spurious pages. Each destination path can be set with an initial delay, say two minutes. For the first two minutes no notice is sent. If the notice is automatically acknowledged in that time, no one is alerted (yet the outage will remain apparent in the Web console). Since the default downtime model increases polling to 30 seconds, there can be up to four attempts to contact a service before a notice is sent. With this scheme, when a pager goes off, it’s a good indication that the problem isn’t due to a minor or intermittent lapse in service.
In the development release of OpenNMS, several new features have been added for event management. One is the addition of alarms.. An alarm is a way to view events in aggregate.
For example, suppose a device sends an SNMP trap every ten seconds due to a problem. This can result in over 600 events an hour, which can cause a lot of noise in an event list. With an alarm, similar events can be reduced to a single line with the name of the event, the first time it occurred, the last time it occurred, and the total number of events seen. Thousands of events can be reduced to fit on one Web page.
It’s also possible to configure automations to operate on an alarm. For example, if a Web server stops, OpenNMS generates a
nodeLostService event with a severity of
Warning. A few minutes later, the problem is resolved, generating a nodeRegainedService event. The alarm subsystem matches the “up” event with the “down” event and sets the severity of the down event to Cleared. Five minutes later, the cleared event is expunged from the alarm list, leaving just those events that represent current issues. As one would expect, this behavior is highly configurable.
Performance Data Collection
Outside of traps, the SNMP protocol exchanges performance data, including such data as traffic through a network interface, CPU utilization, and the number of users currently logged on. When a device is discovered by OpenNMS via SNMP, the device is “queried” to determine and record its type. Then, based on the device type, OpenNMS attempts to collect configured data points. If a new, similar device is added, no additional configuration is needed to get OpenNMS to begin collection.
For example, an APC brand uninterruptible power supply (UPS) can support SNMP. If a new APC UPS device is added to the network, OpenNMS identifies the device as an APC UPS and begins collecting information such as the voltage in, the amount of time available on battery power, and even the unit’s temperature, all without additional operator intervention.
OpenNMS can one of two methods to store time series performance data. One method is via RRDTool, the stalwart library underlying such open source tools as MRTG and Cacti. The other method uses a similar library called JRobin, a Java port of RRDTool. (Since OpenNMS is written in Java, there are performance benefits to using JRobin, although many people use RRDTool since they can they integrate the data that OpenNMS gathers into other tools.)
Collected data can be graphed and displayed via the Web console. In addition, the application can scan the collected data for threshold violations. For example, if available disk space drops below a particular value, or CPU utilization is high for a certain amount of time, events can be generated to notify IT staff.
OpenNMS stores the collected data in discreet files and not in a regular database, so care needs to be taken to size the disk requirements. RAID 5 should be avoided, since the high number of writes to small files can result in a performance hit. For high-end collection, a SAN is recommended. (At one site, OpenNMS is collecting nearly 660,000 data points every five minutes, requiring a NetApp storage system to handle the load.)
In much the same way that the service monitor has various monitor plug-ins, the collector has the ability to add new collection methods. In the development version of OpenNMS, three new collection methods have been added: JMX, HTTP, and NSClient.
- The JMX collector can gather various performance statistics from a Java Virtual Machine (JVM). While the JVM may appear to the operating system as using 512 MB of memory, JMX can tell how much of that memory is actually in use by processes in the JVM, and whether or not more is needed, as well as other JVM specific statistics.
- The HTTP collector can gather numeric data from any Web page. This is especially useful since any application that can write to the document section of the Web server can output data for OpenNMS to collect, without any special SNMP instrumentation. The HTTP collector is a very easy way to integrate a custom application into OpenNMS.
- NSClient is an agent created by the Nagios project. OpenNMS can now connect to that agent and retrieve any arbitrary perfmon counter, which greatly increases the amount of information it is possible to extract from a Windows system (since Microsoft has historically poor support for SNMP).
OpenNMS is designed to be a very flexible management platform, and many new users are overwhelmed with the amount of custom configuration that is available. But for those new to the application, all that is required is some basic knowledge of installing software on the operating system of choice, and editing of one or two files.
There are always two versions of OpenNMS available: a production (stable) version and the development (unstable) version. The production version has been heavily tested and the only updates that are made to it concern bug fixes or very small features. The development version often has numerous new features, and the term “unstable” refers not to the relatively stability of the application but the fact that the underlying code may undergo large changes between releases.
Each version has its own installation guide available on the OpenNMS Project Web site. The following is a quick overview of what’s required.
Since OpenNMS is written mainly in Java, a version of Java is required. The release from Sun is recommended, and the application is not known to run under gcj or kaffe. The SDK or JDK should be installed (not just the JRE), since OpenNMS requires the javac compiler. The compiler is required because OpenNMS uses the Apache Tomcat Web servlet engine to create its user interface. Tomcat is a Web server that can use Java code to create dynamic HTML. The installation manuals have a number of suggestions for getting Tomcat installed.
All of the event and node information that OpenNMS gathers is stored in a backend database. PostgreSQL is currently used, although the code is currently undergoing a transformation to enable it to talk to any database. Once PostgreSQL is installed, a couple of changes must be made to pg_hba.conf (for access control) and postgresql.conf (to set some database variables). Follow the installation guide for the required steps.
The final piece of supporting software is RRDTool.
Once all of the pre-requisites are installed, it is a pretty easy to install OpenNMS. Go to the download page of the OpenNMS project on Sourceforge and choose your distribution. There are usually three packages: opennms (the application itself), opennms-webapp (the Web console) and opennms-docs. The last package is optional.
Once installed, two changes are all that are needed to get OpenNMS up and running. The configuration files for OpenNMS are stored in an etc subdirectory of the OpenNMS home directory, usually /opt/OpenNMS or /opt/opennms.
The first file to change is discovery-configuration.xml. The default looks something like this:
<include-range retries="2" timeout="3000">
The documentation explains most of these options, but the critical parts are the IP addresses in the
include-range section. Set
end to reflect the network that OpenNMS is to monitor. It is possible to have multiple ranges, as in:
<include-range retries="2" timeout="3000">
<include-range retries="2" timeout="3000">
The first entry scan the Class C network 172.20.1.x, while the second entry scans both 10.1.1. x and 10.1.2.x.
The second file, snmp-config.xml, determines which SNMP community strings are used to test if the device supports SNMP. Community strings function like passwords within SNMP to see if a particular request for data should be granted.
The next file attempts to use the community string
public by default, but for
192.168.0.5, it uses
YrUsonoZ. The string
bubba is used for the 192.168.100.x and 192.168.104.x networks.
<range begin="192.168.100.1" end="192.168.100.254"/>
<range begin="192.168.104.1" end="192.168.104.254"/>
These changes are not required to have OpenNMS discover the network and monitoring it; the steps are needed only to gather performance data via SNMP. If the community strings on the network are not known, they can be added to this file later.
Now all that is left is to start the application.
$ sudo /etc/init.d/opennms start
Once OpenNMS is running, ensure that Tomcat is running and then point a browser to http://yoursnmphost:8080/opennms. Log in with the username admin and password admin.
Learning More and Getting Help
OpenNMS is supported by a large and helpful community centered around the OpenNMS wiki and the OpenNMS mailing lists. Both resources provide valuable information. In addition, commercial support and services, including training, is provided by The OpenNMS Group.
OpenNMS may lack some of the shine and polish of older commercial applications, but it more than makes up for it in power and flexibility. This has led many to switch from OpenView and Tivoli to OpenNMS. As with most open source applications, it does require one to climb a bit of a learning curve, but the journey is well worth it.