dcsimg

Looking for Lossage

The Linux box currently hosting stonehenge.com is in a rented space at a co-location facility. As a result of the Internet shakeout happening everywhere, the co-lo facility was bought by a larger networking company and we've been having network interruptions, including complete loss of service, from time to time. The administrator of the box came to me looking for evidence that these outages had been going on for extended periods of time so that he could take that to the new owner, get some of his money back, and pass the savings along to me.

The Linux box currently hosting stonehenge.com is in a rented space at a co-location facility. As a result of the Internet shakeout happening everywhere, the co-lo facility was bought by a larger networking company and we’ve been having network interruptions, including complete loss of service, from time to time. The administrator of the box came to me looking for evidence that these outages had been going on for extended periods of time so that he could take that to the new owner, get some of his money back, and pass the savings along to me.

It occurred to me that I could use the logs from my Web server, and it would have been nice to use the database-driven logs as I did in last month’s column, but I needed data from a year ago. I did have the database dumps, but not enough disk space to restore all the data. Fortunately, I also had the standard Apache access_log files broken up by day, compressed separately and dating all the way back to my first hosting machine. It was from these files that I was able to glean the data.

First, let’s consider the requirements. I wanted an hour-by-hour hit count for a given range of dates. I also wanted to filter out any local hits, since those hits don’t reflect network connectivity issues. I also wanted to be reasonably efficient about date parsing. That ruled out running Date::Manip‘s general date recognition routines or anything else in the inner loop of the analyzer. The output did not have to be pretty — just the hit count for each hour from which we would apply an interpretation. A program meeting all of my requirements is presented in Listing One.




Listing One: webhits.pl – Part 1


1 #!/usr/bin/perl -w
2 use strict;
3 $|++;
4
5 use Date::Calc qw(Decode_Month Mktime Decode_Date_US);
6 use File::Find;
7

8 ## begin config
9
10 my @DIRS = glob “/home/merlyn/Logs/OLD.*/.”;
11 my $START = “1/1/2001″;
12 my $END = “1/1/2002″;
13
14 ## end config
15
16 for ($START, $END) {
17 my @y_m_d = Decode_Date_US($_) or die
“Cannot decode $_”;
18 $_ = Mktime(@y_m_d, 12, 0, 0); # noon on that day
19 }
20
21 my @files;
22 find sub {
23 return unless /^access_log.*gz$/;
24 return unless -f;
25 my $mtime = (stat(_))[9];
26 return unless $mtime >= $START and $mtime <= $END;
27 push @files, $File::Find::name;
28 }, @DIRS;
29
30 my %count;
31
32 my %month_name_to_number;
33
34 while (my @chunk = splice @files, 0, 10) {
35 open IN, “zcat @chunk |” or die;
36 while (<IN>) {
37 my ($day, $month, $year, $hour, $minute, $second) =
38 m{
39 ^
40 (?!web\.stonehenge\.comm)
41 [^\[]*
42 \[
43 (\d+) \/ (\w+) \/ (\d+)
44 :
45 (\d+) : (\d+) : (\d+)
46 }x or next;
47 $month = $month_name_to_number{$month}
48 ||= sprintf "%02d", Decode_Month($month);
49 $count{"$year $month"}{0+$day}{0+$hour}++;
50 }
51 }
52
53 for my $year_month (sort keys %count) {
54 for my $day (sort {$a <=> $b} keys %{$count
{$year_month}}) {
55 for my $hour (0..23) {
56 my $hits = $count{$year_month}{$day}{$hour} || 0;
57 printf "%s %02d %02d: %d\n", $year_month, $day,
$hour, $hits;
58 }
59 }
60 }


Lines 1 through 3 start nearly every Perl program I write, enabling warnings for development, turning on the compiler restrictions, and unbuffering standard output. Line 5 pulls in a few selected routines from the Date::Calc module (found in the CPAN). Specifically, I needed to convert a month name to a number, a time stamp to an epoch time, and a human-readable date to a month-day-year value. Line 6 pulls in the standard File::Find module.

Lines 10 through 12 provide a few configuration constants that I was tweaking during development. The @DIRS variable gives the top-level directories that contain my Web logs as starting points for a "find" operation. $START and $END give the approximate range endpoints for the analysis. They aren't precise because my log-roller doesn't run exactly at midnight, and the endpoints are merely used to determine which logfiles to begin parsing with. The endpoints can be given in any format interpretable by Date::Calc's date format.

Lines 16 to 19 convert the human-readable string into an epoch time, first by calling Decode_Date_US(), and if that works, by turning that year-month-day value into a Unix-epoch-based value. I'm using a trick here of assigning to the for loop control variable to alter the original variable being examined, thus changing the values of $START and $END from strings into epoch times.

Lines 21 to 28 locate the logfiles to be parsed. If the name matches the right pattern (line 23), and it's a file (line 24), then the stat buffer's modification time value is saved (line 25). (The -f file test on line 24 performs an implicit call to stat() and the results are stored in the stat buffer. "_" refers to the stat buffer.) If the modification time value is within range (line 26), it saves the full pathname to @files (line 27).

Line 30 is the report accumulator, counting the number of hits in a given hour. The data structure within %count will be described shortly. Line 32 is a cache of monthname-to-number, as returned by Decode_Month(). This is necessary because I don't want to call that subroutine for every line, so I cache the results for speed.

Line 34 begins a loop that operates on 10 of the files at a time, pulling them out as a chunk. The final chunk can be less than 10 files, however. Line 35 launches the zcat command on the chunk of files. The ordering and clustering of the files is not relevant, since we're ultimately going to process all the files and update %count before generating any reports.

Lines 36 to 50 process each Apache access_log-style line from the logs within range. They look a bit like this:


foo.example.com - - [14/Mar/2002:13:18:55
-0800] “GET / HTTP/1.0″ 200 3340

Using an extended-syntax regular expression in lines 38 to 46, I locate the lines that do not begin with my Web host name (to ensure the hit is a remote hit) but contain a date stamp in the normal form. The six memory references (day, month, year, hour, minute, and second) are returned as a list, assigned to the variables in line 37.

Lines 47 and 48 turn the month name into a number for easier sorting. First, an attempt is made to locate it in the cache hash. If it’s there, that value overwrites the previous $month value. Otherwise, Decode_Month() is called, and the result padded to two digits for easy string-based sorting. The ||= syntax says that if the statement so far evaluates to “false” (that is, there was no corresponding value in the cache hash), then execute the rest of the statement, which assigns the result to the cache, which in turn passes it to the current $month variable.

Line 49 creates a multi-level data structure. At the top-level, the key is the year and month number. Next we have the day number, converted to a pure number by adding zero, and finally the hour number, once again converted to a pure number by adding zero. The appropriate entry is incremented to show one hit during this period of interest. (I’m using hashes for all the levels because I anticipated sparse data. In fact, nearly every hash element in sequence was being used, so a hash of arrayrefs of arrayrefs would probably have been a better choice. But this was “fast enough,” so I stopped optimizing.)

Finally, it’s time to show the results using three nested for loops. The outer loop walks through a string-sorted list of keys, representing the year/month pairs for which data has been made available. The middle loop (starting in line 54) figures out which days of that month are present and sorts them numerically. The inner loop (starting in line 55) forces a scan for every hour of that day.

Since some of the hours have 0 hits, we don’t want to skip over them in the output, so by forcing an hourly scan, we can see the zero hit results. I knew the outage was never longer than a day, or I’d have already moved my hosting service to another location. Line 57 displays the data and the task is complete.

With the report in hand, my administrator is off to the co-lo owner, hopefully resulting is some returned money. Until next time, enjoy!



Randal L. Schwartz is the chief Perl guru at Stonehenge Consulting and can be reached at merlyn@stonehenge.com. Code listings for this column can be found at http://www.stonehenge.com/merlyn/LinuxMag/.

Comments are closed.