A Web Server Log Database, Part Two

A small amount of code produces a logger process to write web server log entries to a database.

In last month’s “Perl of Wisdom” (available online at http://www.linux-mag.com/2006-01/perl_01, html), I showed how my Apache 2 processes use mod_perl2 to write logging information to a logger process. This month, let’s continue the discussion and look at the inner workings of the logger.

Connecting the Plumbing

The logger is launched with its stdin connected to a pipe created by the master Apache process. When the Apache master process spawns each of its child processes, the pipe is inherited by each child. Thus, each child can write information to the logger. (Conceptually, the plumbing looks something like Figure One.)

FIGURE ONE: All of the Apache server processes can write to the logger

During the logging phase, each child writes a line to the (shared) pipe that looks like:

<field1=hexvalue field2=hexvalue field3=hexvalue>

Each field name (field1, field2,…) in the string corresponds to a column in the logging database, and the hexvalues are the corresponding values.

Each server child process writes to this pipe asynchronously, without performing any sort of locking. While that sounds problematic, most modern Unix-like systems guarantee that writes under a certain size (typically 4,096 bytes) are performed atomically, so the logger process should get discrete data. To be certain, however, each record is bracketed by “& lt;& gt;”, which is easily parsed to verify that one and only one complete record was read. (In the months that I’ve been using this system, though, I have yet to see a failed record, so I believe this system to be reasonably safe.)

The logger process needs to remain alive as long as the Apache children are alive. An end-of-file on the pipe indicates that it’s safe to quit. But what about errors? If the logger process dies, there’s no easy way to replace it (other than restart the Apache server), because the pipe from the parent Apache process isn’t a named pipe. (Making the interface a named pipe might be an interesting idea for a future version of this program.)

Because this input pipe was “sacred” and the logger needed to run “forever”, I decided to implement a watchdog process pattern. The logger program forks a child to do the actual work, and watches for the child to exit. If the child exits cleanly, so does the watchdog (such as when the child reads EOF because Apache is shutting down). However, if the child dies a violent death, the watchdog respawns a new child. And because the watchdog retains the stdin from the Apache logging pipe, the new child also inherits this connection, including any unprocessed data, and can pick up where the exited child left off.

As long as the watchdog itself doesn’t have any unusual exit conditions, we’re guaranteed a fair degree of safety, so it’s important to make sure the watchdog does as little as possible. Conceptually, the plumbing of the entire system looks something like Figure One. The program that implements both the watchdog and the database writer are found in Listing One.

LISTING ONE: The logger process

01#![* env.${"PERL"} *] -w
02use strict;
05[* USE date; FILTER format(’### %-70s ###’) -*]
06Processed automatically at [* date.format(date.now) *] from
07[* template.name *] edited at [* date.format(component.modtime) *]
08[*- END *]
10## config
11my $DSN = ’dbi:Pg:dbname=web’;
12my $DB_TABLE = ’requests’;
13my $DB_AUTH = ’Username:Password’;
14## end config
16use DBI;
19 defined (my $pid = fork) or do { warn “can’t fork: $!”; sleep 5; redo };
20 if ($pid) {
21 ## parent does…
22 waitpid($pid, 0);
23 last unless $?; # clean exit means I can go away too
24 warn “logger died with $?, restarting…”;
25 sleep 5; # do not thrash!
26 redo;
27 }
28 ## child does…
30 my $dbh = DBI->connect($DSN, (split ’:’, $DB_AUTH),
31 {RaiseError => 1 });
33 while (<STDIN>) {
34 ## warn “$0 saw $_”;
35 s/^<(.*)>$/$1/
36 or warn (“skipping $_\n”), next;
37 my @fields;
38 for my $field (split) {
39 $field =~ /^(\w+)=([0-9a-fA-F]*)$/
40 or warn (“skipping $_ in $field\n”), next;
41 my ($key, $value) = ($1, pack(“H*”, $2));
42 push @fields, [$key, $value];
43 }
44 my $INSERT =
46 (join “,”, map $_->[0], @fields).
47 “) VALUES(“.
48 (join “,”, (“?”) x @fields).
49 “)”;
50 my $sth = $dbh->prepare_cached($INSERT);
51 $sth->execute(map $_->[1], @fields);
52 }
54 $dbh->disconnect;
55 exit 0;

Canvassing the Code

Lines 1-3 are my traditional program start. Because this program is processed as a Template Toolkit file during the server build process, anything marked with [*..*] is processed and replaced with the appropriate text. In this case, the PERL variable from the environment provides the appropriate” she-bang” line. This variable is set in the GNU Makefile as:

export PERL ?= $(shell which perl)

Hence, by default, an appropriate path to perl is determined. This is important because I’ve left the Mac OS X Perl alone on my box, preferring instead to install local modules into a local bleeding-edge Perl that I’ve installed in /opt/perl/snap/bin/ on my machine. However, when I push this code up to the server, I need to use the traditional /usr/bin/perl path instead.

Lines 5-8 create a header on the resulting templated file, giving the original file name, processing time, and the modification time. I’ve discovered that when I generate a file that looks like a source file, it’s important to mark the file in some way, lest I try editing it and then wonder why my changes disappear later.

Lines 11-15 provide the database information to connect to my PostgreSQL server. Obviously, the username and password here are fictionalized. These items are used in Line 31 to connect to the database.

Line 17 pulls in the DBI module.

Lines 19-57 form a loop for the basic watchdog algorithm.

Line 20 attempts to create a child process to do the real work. If this fails, a message is logged to stderr, which makes the error message appear in the Apache error log. After five seconds, the spawn is tried again. The five seconds provides enough delay to ensure that the logger doesn’t get into a tight loop burning CPU, trying to fork, again and again immediately.

Upon reaching Line 21, both a parent (the watchdog) and its child (the actual logging process) are running in parallel. However, by convention, $pid will be zero in the child, and non-zero in the parent, so it’s time for them to start acting differently. The parent process waits for the child (Line 23).

If the child process exits cleanly, then $? is zero, and the parent process breaks out of the loop (Line 24), resulting in a clean exit. However, if the child broke somehow, Line 25 triggers a warning message and triggers an attempt to restart the child. Again, a brief delay of five seconds ensures there’s no thrashing because of something really horrible going wrong out there.

The parent code executes either the last in Line 24 or the redo in Line 27, so it can never make it to Line 29, which marks the beginning of the child code, where the real work gets done.

Lines 31 and 32 set up my database handle (called $dbh for traditional reasons). Because I’ve set RaiseError, any bad operations (including setting up the initial connection) trigger an exception. Because these exceptions aren’t caught (no eval blocks in this code), the entire child process will die, and the watchdog will replace it. Simple, but effective.

Lines 34-53 read a line at a time from stdin. Because this child process inherits stdin from the parent (watchdog) process and the parent process got its tdin as the pipe connection to Apache for logging, this loop executes once for each log record.

Line 35 was useful during debugging to ensure that the right data was being read by the child logger. The Apache error log showed that I was getting the right information at the right time. Of course, in production, this would be excessively noisy (turning the error log into an access log), so I commented the line in case I needed it in the future.

Lines 36 and 37 extract the keyword and hex value payload from the line. If the line doesn’t have the proper angle bracket markers, we report it and ignore it. This might happen if the writes weren’t atomic, for example.

Line 38 sets up an array to hold the list of fields to be inserted.

Lines 39-44 break up the line by whitespace, causing $field to repeatedly contain a column name and hex value for that column. Lines 40 and 41 break up the $field into the column name and hex value as $1 and $2. If that fails, the item is logged and ignored.

Line 42 grabs the column name for $key and the hex value in the original string for $value. Line 43 pushes that into the @fields array for final insertion.

Once the value extraction is complete, it’s time to create the insert handle. Lines 45-50 create a string that looks like:

INSERT INTO requests
(stamp, url, status)
VALUES (?, ?, ?)

The number of question marks corresponds to the number of fields in this request. These question marks are DBI placeholders that will be given values during the execute step.

Line 51 prepares the SQL, using prepare_cached(). Most of these inserts will likely use the same column names, so the prepare_cached() step will be very fast. However, for maximum flexibility, each request is considered separately. Note that the logger doesn’t even have to know the names of the columns.

Once a good statement handle is in $sth, it’s time to perform the insert in Line 52. Each of the values extracted from @fields provides the value for the corresponding column selected in the SQL earlier.

When the loop reading from stdin, there is no more input. At that point, the Apache server is no longer talking to us, so it’s time to leave quietly in Line 56. The 0 (“zero”) exit code ensures that the watchdog also knows to exit quietly.

And there you have it: a logging aggregator that can take log information from nearly any number of Apache child processes and write the messages to the database using a single connection handle.

Convenience is King

Now that this logging agent is in place, I can get near-real-time information from my database about the performance of my webserver.

For convenience, I’ve created a couple of views to select the front or back server cluster:

SELECT requests.* FROM requests
WHERE server = ’P’;
SELECT requests.* FROM requests
WHERE server = ’B’;

From these views, I can ask about all front requests that have resulted in “ 404 ‘s” in the past 24 hours:

SELECT referer, url, count(*)
FROM front
WHERE stamp > ’yesterday’ AND status = 404

Or, I can see which user agents (types of browsers or robots) have burned the most CPU on my system in the past five minutes (possibly pointing out some badly behaved spider):

SELECT SUM(cpuuser+cpusys+cpucuser+cpucsys), useragent
FROM requests
WHERE stamp > now() – interval ’5 minutes’

I could also implement some bandwidth restrictions by penalizing any heavy downloaders by IP address using a query similar to:

SELECT SUM(bytes), host
FROM front
WHERE stamp > now() – interval ’1 minutes’

Since my pages are dynamically generated, it’s nice to know which of my pages could use some optimization:

SELECT SUM(cpuuser+cpusys+cpucuser+cpucsys), vhost || url
FROM requests
WHERE stamp > now() – interval ’1 day’

Lots of useful queries! I’m still discovering more every day.

I hope you find the code useful. Until next time, enjoy!

Randal Schwartz is the chief Perl guru at Stonehenge Consulting. You can reach Randal at merlyn@stonehenge.com. You can download the code shown in this month’s column from http://www.linux-mag.com/downloads/2005-06/perl.pl.txt.

Comments are closed.