The Health of Your Site, Part 2

In last month's column, I showed how to create a web site testing tool based on Perl's own testing framework and the WWW::Mechanize module. For reference, I've reproduced the code developed in last month's article in Listing One. The test code verifies the proper operation of a web site, in this case, http://search.cpan.org.

In last month’s column, I showed how to create a web site testing tool based on Perl’s own testing framework and the WWW::Mechanize module. For reference, I’ve reproduced the code developed in last month’s article in Listing One. The test code verifies the proper operation of a web site, in this case, http://search.cpan.org.

Listing One: A testing program to diagnose site health

01 #!/usr/bin/perl
02 use Test::More tests => 9;
03 use WWW::Mechanize;
04 isa_ok(my $a = WWW::Mechanize->new, "WWW::Mechanize");
06 $a->timeout(1);
07 $a->get("http://search.cpan.org/");
08 is($a->status, 200, "fetched /");
09 like($a->title, qr/The CPAN Search Site/, "/ title matches");
11 SKIP: {
12   ok($a->follow_link( text => 'FAQ' ), "follow FAQ link")
13     or skip "missing FAQ link", 2;
15   SKIP: {
16     is($a->status, 200, "fetched FAQ page")
17     or skip "bad FAQ fetch", 1;
18     like($a->content, qr/Frequently Asked Questions/, "FAQ content matches");
19     $a->back;
20   }
21 }
23 SKIP: {
24   ok($a->form_number(1), "select query form")
25     or skip "cannot select query form", 2;
26   $a->set_fields(query => "PETDANCE", mode => 'author');
27   $a->click();
29   SKIP: {
30     is($a->status, 200, "query returned good for 'author'")
31     or skip "missing author page", 1;
32     like($a->content, qr/Andy Lester/, "found Andy Lester");
33     $a->back;
34   }
35 }

While invoking this program directly certainly gives an immediate status, it’d be more useful to run this program automatically and frequently. For example, I might invoke this program every five minutes from cron, and mail the results to my cell phone or pager. However, because there’s a lot of output even when everything is OK, I’d get a lot of useless interruptions just to hear that “Everything’s OK.”

Running the program under the standard Test::Harness module helps a bit. Test::Harness interprets the ok and not ok values appropriately, providing a nice summary at the end, resulting in output something like:

All tests successful.
Files=1, Tests=9

However, it’s still hard to reduce the text to pinpoint the errors, or know whether things were successful or were a partial or total failure. Also, I hate being told the same thing over and over again when a failure occurs, and also hate not being told when something has cleared up. And, the text doesn’t squish well into a nice SMS message for my phone or pager.

So, let’s take this one step further. The Test::Harness module inherits its core functionality from Test::Harness: :Straps, which is still in development. We can use a Test::Harness::Straps object to invoke the test script and interpret its output to determine which tests failed programmatically. If we have that, we can tailor the output.

One strategy is to test every five minutes (from cron), but send a page only when things are broken, and then only once every thirty minutes. This message can be cut down to precisely show the failing tests and perhaps the associated error output and exit status of the test program. Once the error clears up, the program can page on the next round with a single “All clear,” letting you turn back around and head home instead of finishing your midnight trek into the office to fix the problem.

Of course, there’s more going on than just “All OK” and “Something’s broken.” Each different combination of “Something’s broken” may be worthy of paging. Let’s ensure that only one page per unique combination of events gets sent, and that only one all clear signal is sent when everything clears up.

Difficult? Not at all, especially when you use Cache: :FileCache as a lightweight time-oriented database. The resulting cron-job program is shown in Listing Two.

LISTING TWO: Send failure and recovery reports to an administrator

01 #!/usr/bin/perl -w
02 use strict;  #turn on warnings
03 $|++;        #disable buffering of stdout
05 ## CONFIG
07 my $ALL_CLEAR_INTERVAL = "never"; # how often to repeat "all clear" signal
08 my $TEST_FAIL_INTERVAL = "30 minutes"; # how often to repeat test failed
10 sub SEND_REPORT {  # what do I do with a report?
11   ## open STDOUT, "|sendmail 5035551212\@vtext.com" or die "sendmail: $!";
12   @_ = "ALL CLEAR\n" unless @_;
13   print @_;
14 }
18 use File::Temp qw(tempfile);  # core
19 use File::Basename qw(dirname);  # core
20 use Test::Harness::Straps ();  # core
21 use Cache::FileCache ();  # CPAN
23 my $errors = tempfile();
24 open SAVE_STDERR, ">&STDERR" or warn "dup 2 to SAVE_STDERR: $!";
26 my $cache = Cache::FileCache->new({namespace => 'healthcheck_reporter'});
28 chdir dirname($0) or warn "Cannot chdir to dirname of $0: $!";
30 my $strap = Test::Harness::Straps->new;
32 my @failed;
34 for my $test_file (glob "*.t t/*.t") {
35   my %results;
36   {
37     open STDERR, ">&", $errors or print "dup $errors to STDERR: $!";
38     %results = $strap->analyze_file($test_file);
39     open STDERR, ">&", \*SAVE_STDERR or print "dup SAVE_STDERR TO STDERR: $!";
40   };
41   push @failed, map {
42     $results{details}[$_]{ok} ? () :
43       ["$test_file:".($_+1) => $results{details}[$_]{name}]
44   } 0..$#{$results{details}};
45   push @failed, ["$test_file:wait" => $results{wait}] if $results{wait};
46 }
48 if (-s $errors) {
49   seek $errors, 0, 0;
50   local $/;
51   push @failed, ["errors" => <$errors>];
52 }
54 my $key = join " ", map $_->[0], @failed;
56 if ($key) {  # bad report
57   $cache->remove("");  # blow away good report stamp
58   if ($cache->get($key)) {  # seen this recently?
59     ## print "ignoring duplicate report for $key\n";
60   } else {
61     $cache->set($key, 1, $TEST_FAIL_INTERVAL);
63     my @report;
65     for (@failed) {
66       my ($key, $value) = @$_;
67       my @values = split /\n/, $value;
68       @values = ("") unless @values; # ensure at least one line
69       push @report, "$key = $_\n" for @values;
70     }
72     SEND_REPORT(@report);
73   }
74 } else {                        # good report
75   if ($cache->get("")) {        # already said good?
76     ## print "ignoring good report\n";
77   } else {
78     $cache->clear();            # all is forgiven
79     $cache->set("", 1, $ALL_CLEAR_INTERVAL);
81     SEND_REPORT();              # empty means good report
82   }
83 }

Lines 5 to 16 define the configuration section — those things I’m likely to change and therefore want to locate quickly. The two time constants are defined in Cache: :Cache-compatible units, which understands things like 15 seconds or 4 hours. (See the documentation for details.) $ALL_CLEAR_INTERVAL defines how often a repeat page saying “Everything’s OK” gets sent. If you set this to 1 day, you’ll get a single page a day saying everything is OK as a nice meta-check that your health-check is OK. By setting it to never, you get one page when the monitoring starts the very first time, but never again unless a failure’s been fixed. Similarly, $TEST_ FAIL_INTERVAL defines how often a page is sent for an identical combination of failures.

Lines 10 to 14 define the callback subroutine of what to do when an event of significance occurs. If this subroutine is called with no parameters, then it’s the “All clear” signal. Otherwise, it’s the current error text. For debugging, I simply display this text to stdout, but by reopening standard output to a pipe to sendmail, I could just as easily send this to my cell phone or pager, presuming there’s an email gateway.

Lines 18 to 21 pull in the four modules needed by the rest of the program. Three of the four modules come with all recent Perl distributions, although you might have to upgrade your Test::Harness from the CPAN if you get an error because of Test::Harness::Straps. The fourth is Cache::File Cache, part of the Cache::Cache distribution.

Lines 23 and 24 create a temporary file and associated filehandle to let me capture the stderr output from the various test programs being run (such as the one in Listing One). The error output is usually diagnostic explanations, and often elaborates on the reasons for failure. The code also saves the current stderr so that it can be reset it after every child process, so that die and warn messages end up in the right place.

Line 26 sets up the Cache::FileCache object, creating persistence between invocations.

Line 28 ensures that the current directory is the same as the running script. This permits the paging program to be called as an absolute path in cron without having to manage the location of the test scripts, as long as they’re all nearby.

Line 30 creates the Test::Harness::Straps object to interpret the result of one or more testing program invocations.

Line 32 collects the failure information that eventually decides what to report. The array will contain arrayrefs, where each referenced array has two elements. The first element is an identifying (hopefully unique) key of a failing condition, and the second is its associated text. This is explained more later.

Lines 34 to 46 loop over every test file in either the current directory or a subdirectory named t, similar to the normal Test: :Harness module-installation operation. Each of these tests is run separately, although the results are gathered for a single page. Instead of limiting the location of these *.t files to the current directory and one subdirectory, I might also want to use File: :Find or a configuration file to define which tests are run. Line 35 defines %results, having the same meaning as the Test: :Harness::Straps documentation gives to the variable.

Lines 36 to 40 run a specific test file, using the Test: :Harness::Straps object. Because you want to capture the STDERR output from each invocation, you must first redirect stderr into the temporary file, then call the testing harness, then restore it back. It’s a bit messy, but necessary. Perhaps future versions of Test::Harness::Straps will provide a hook to do this directly.

Once you have the results, there are two things to ask: first, did any of the tests fail? And second, did the test child exit in some unusual manner?

To see if any of the tests failed, look at $results{details}, which references an array of individual tests and their results. Within each element of the array, the ok element of the hashref is true if the test succeeded. If ok is true, simply ignore the test. If it’s false, add a new arrayref that contains an identification name for the test (offset by one because element zero of the array is test one) and the name the test gives itself (usually the text after the comment mark) to @failed. This is all handled nicely by the map in lines 41 to 44.

If the child exited badly, add another element with the wait status to @failed in line 45.

Lines 48 to 52 look at the standard error output that’s accumulated from running all of the tests. If any output exists, it’s gathered into another @failed entry, keyed by errors.

When the program hits line 54, all of the invidivual tests have run, possibly from many different test programs, and the results are in @failed. Next, the code creates a “current problems key” in $key, resulting from joining all of the error tags into a space separated list. If this string is empty, everything went OK, but otherwise you end up with a list like “health.t:4 health.t:wait errors” showing that test 4 failed, the wait status was bad, and that there was some text on stderr from one or more of the children.

Based on this error key string, you can now decide whether to page or not.

Line 56 distinguishes between “Everything’s OK,” and “Something’s wrong.”

If something is wrong, the code in lines 57 to 73 runs. First, remove any marker for “Everything’s OK” in the cache. This ensures that the next time everything is OK, a page is sent that says so.

Line 58 determines if a page’s been sent recently for this particular combination of errors. If so, the value returns true from the cache, and nothing further happens this invocation. (I’ve left the commented-out debugging print at line 59 so you can see where this happens.) Otherwise, it’s time to send a page. First, in line 61, ensure that this particular page isn’t duplicated within the $TEST_FAIL_INTERVAL time window.

Line 63 defines a holder for the report. Lines 65 to 70 process the @failed array, extracting out each key/value pair, and then prepending each line of the value with the key for careful labeling. Even if the value is empty, at least one line containing the key is generated.

Line 72 passes this report list into the SEND_REPORT callback, defined at the top of the program. This sends the appropriate report with just the broken pieces of the collective tests.

Lines 75 to 82 deal with an “Everything’s OK” run. First, if there’s already been an “All OK” signal sent recently enough, there’s nothing to do (again, noted in a commented-out debugging print in line 76). Otherwise, throw away all of the recently seen failure tags in line 78 by clearing out the entire cache, and set a flag to prevent an additional “Everything’s OK” message until the $ALL_CLEAR_INTERVAL has passed in line 79.

Line 81 passes an empty list to SEND_REPORT, a signal to the program that it’s time to send the all-clear message.

Although this simple test reporting tool doesn’t have a lot of fancy features, it illustrates how the basics are accomplished and how reporting can be kept to the essentials. If you want more, there’s larger, more complex, and even commercial solutions to notify you when things go wrong. Until next time, enjoy!

Randal L. Schwarz is the chief Perl guru at Stonehenge Consulting and can be reached at merlyn@stonehenge.com.

Comments are closed.