Writing Custom Nagios Plugins with Python

Beef up your sysadmin toolkit by extending the powerful monitoring features of Nagios.

Monitoring the vital signs of more than a handful of servers can rapidly become a full-time job for most systems administrators. Unfortunately, the reality of life as a systems administrator in today’s IT environment is the prospect of having far more projects on your plate than you have the time or energy to tackle. That’s what makes system and network monitoring frameworks like Nagios such a lifesaver — they take away the grunt work of writing and maintaining a localized collection of shell, Perl or Python scripts for systems monitoring. It’s a safe bet to assume that the majority of modern sysadmins tossed their toolkit of scripts into the trash long ago and moved to open-source monitoring solutions like Nagios.

The latest release of Nagios, version 3.20, is a powerhouse of a monitoring system. Properly installed, Nagios provides system and network managers with the ability to monitor critical components, receive alerts, generate reports and produce trending diagrams that can be most helpful when trying to convince management to provide the funds for systems and network upgrades.

Out of the box, Nagios 3.2 and the accompanying plug-ins distribution contain modules that handle most common monitoring tasks. System availability, service availability, response time, disk and memory utilization — all covered. But what if you want to monitor a custom application or system parameter that Nagios doesn’t know about? No worries. The Nagios plug-in API is just what you’re looking for.

In this hands-on tutorial we’ll walk you through the steps involved in writing a very basic Nagios plug-in for Nagios 3.2. This article is not an introduction to Nagios or Python. We assume you have installed and are comfortable using Nagios and have spent some time developing Python applications. We won’t be doing tackling any advanced topics, particularly in Python, but we will be touching on subjects that might be a bit unfamiliar for beginners. If you’re new to Nagios or Python, we recommend that you do some reading prior to starting this tutorial. The Quickstart Guide to Nagios is a great intro to Nagios, and for Python novices we suggest the Python Tutorial by Guido Van Rossum, the developer of Python.

Distilling it Down

Unlike graphical system monitoring applications like Cacti, Nagios doesn’t deliver hard numerical data to users. Volts, gigabytes and percentages are not spoken by Nagios. Instead, the tool distills information down to the bare essence: ‘OK,’ ‘WARNING’ and ‘CRITICAL’. While this approach may seem non-intuitive at first, its beauty becomes readily apparent with a bit of thought. Rather than getting bogged down by a slew of statistics,
system administrators can rapidly drill down to the essence of the issue.

Your First Plug-in: Free Space Remaining

To demonstrate how to construct a simple plug-in for Nagios we’ll build a short Python script that duplicates the ‘check_disk’ service supplied with Nagios. Our script parses the output of ‘df /’ and returns a Nagios-compliant result of ‘OK’, ‘WARNING’,'CRITICAL’or ‘STATE UNKNOWN.’ We’ll be the first to admit that this plug-in is minimal in the extreme. We don’t provide options for submitting run-time parameters and our error handling isn’t as robust as you would want in a production script. For your first attempt at writing a plug-in there’s no need to get bogged down in the details — it’s more important that you grasp the basic concepts. So let’s get started. Here’s the source code for our sample plugin:

#!/usr/bin/python
import re,sys,commands

#################
#Set variables
command = "df /"
critical = 95.0
warning = 75.0
#################

#build regex
dfPattern = re.compile('[0-9]+')

#get disk utilization
diskUtil = commands.getstatusoutput(command)

#split out the util %
diskUtil = diskUtil[1].split()[11]

#look for a match. If no match exit and return an
#UNKNOWN (3) state to Nagios

matchobj = dfPattern.match(diskUtil)
if (matchobj):
    diskUtil = eval(matchobj.group(0))
else:
    print "STATE UNKNOWN"
    sys.exit(3)

################################
#Uncomment and change
#diskUtil value to test plug-in
#diskUtil = 98.0
################################

#Determine state to pass to Nagios
#CRITICAL = 2
#WARNING = 1
#OK = 0
if diskUtil >= critical:
    print "FREE SPACE CRITICAL: '/' is %.2f%% full" % (float(diskUtil))
    sys.exit(2)
elif diskUtil >= warning:
    print "FREE SPACE WARNING: '/' is %.2f%% full" % (float(diskUtil))
    sys.exit(1)
else:
    print "FREE SPACE OK: '/' is %.2f%% full" % (float(diskUtil))
    sys.exit(0)

Let’s take a moment to review the Python code before we register the plug-in and use it to create a service.

At the top of the application we define the command string called by Python’s commands.getstatusoutput() routine and we set the critical and warning values. Feel free to modify these percentages — these are arbitrary values.

The next section of code executes the command string, pulls the utilization percentage from the returned list and strips the ‘%’ sign from the value. If we don’t get a valid return from the regex match we exit with a return value of 3. That’s ‘STATE UNKNOWN’ to Nagios.

The final part of the listing is a simple comparison routine that checks the returned value against the defined values for ‘CRITICAL’ and ‘WARNING.’ After doing the comparison the code returns the proper value to Nagios, prints a message to STDOUT and exits. That’s all there is to this plug-in. Simple, short and easy to understand. As we mentioned earlier in the tutorial, error checking isn’t as robust as it should be for a production plug-in, and we’ve not provided a way to configure the thresholds at run-time. Those are topics for another tutorial. Meanwhile, let’s get Nagios configured to recognize our new plug-in and attach it to a service so we can see it in action.

Plug It In

Getting Nagios to utilize your new plug-in is quite easy. We’re going to make changes to three files and restart Nagios — that’s all it takes. A note of caution: we’re running Nagios 3.2 on CentOS 5.3. If you’re using an older version of Nagios the installation process will be different.

The first file we’ll edit is /etc/nagios/command-plugins.cfg. Open up the file using your favorite text editor and look for the section beginning with:

# These are some example service check commands.  See the HTML
# documentation on the plugins for examples of how to configure
# command definitions.

command[check_tcp]=/usr/lib/nagios/plugins/check_tcp -H $HOSTADDRESS$ -p $ARG1$
command[check_udp]=/usr/lib/nagios/plugins/check_udp -H $HOSTADDRESS$ -p $ARG1$
...

At the end of this section add the following lines and save the file:

#our diskFree.py example
command[diskFree]=/usr/lib/nagios/plugins/diskFree.py -H $HOSTADDRESS$

Drop down one directory to /etc/nagios/objects and open commands.cfg. Locate the section beginning with:

################################################################################
# NOTE:  The following 'check_...' commands are used to monitor services on
#        both local and remote hosts.
################################################################################

# 'check_ftp' command definition
define command{
        command_name    check_ftp
        command_line    $USER1$/check_ftp -H $HOSTADDRESS$ $ARG1$
        }
...

Insert our command definition:

# 'diskFree' command definition
define command{
        command_name diskFree
        command_line $USER1$/diskFree.py -H $HOSTADDRESS$
        }

Save the file and open up localhost.cfg. Locate this section:

###############################################################################
###############################################################################
#
# SERVICE DEFINITIONS
#
###############################################################################
###############################################################################

and add the entry for our example plug-in:

#define our example diskFree service
define service{
        use                             local-service         ; Name of service template to use
        host_name                       localhost
        service_description             DISK FREE
        check_command                   diskFree
        }

Save the file, and after we’re almost done. All that remains is to restart Nagios and to verify that our plug-in is working. Restart Nagios by issuing the following command:

/etc/init.d/nagios restart

If Nagios reports errors, double-check the files we’ve just edited and make sure there are no typos and that you remembered to save your changes. After you’ve gotten an error-free Nagios restart the final step is to view our new code in action. Using a web browser, navigate to your Nagios installation and select the ‘services’ menu item. You should see the Service Status Details For All Hosts page.

Select ‘localhost’ and make sure that DISK FREE is listed in the Services column. It may take a few minutes for Nagios to execute the command and the status to change from ‘scheduled’ to actively reporting. Below is a screenshot of our test server’s Status Details page. Note that our root filesystem is at 85% capacity, so our new plug-in is reporting a status of ‘WARNING.’ Also note that the information our plug-in is reporting agrees with the distribution-supplied ‘Root Partition’ service. This confirms that our new plug-in is working and reporting accurate results.

currier-screenshot.png

Next Steps

We’ve barely scratched the surface of what can be done with custom Nagios plug-ins. As we mentioned previously, we haven’t provided robust error checking, usage messages and don’t allow run-time parameter passing. Production plug-ins require significantly more work, but if you’ve successfully completed our tutorial you now have the necessary knowledge to continue your exploration of Nagios plug-ins. The ability to write plug-ins makes a great addition to your resume. And in today’s grim IT jobs market it might just give you the edge you need to grab that new gig.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62