dcsimg

A Web Server Log Database

Learn how to build a better web server log using your own mod_perl handler.
The server for Stonehenge Consulting (http://www.stonehenge.com) and a few other sublease customers like Geek Cruises uses Apache and mod_perl.
Because the mod_perl processes tend to be heavy, I’ve installed the traditional caching, reverse-proxy server stack so that each mod_perl process can be freed the moment it’s completed its task, and so some lightweight hits (such as images and icons) can be served without touching the backend at all.
For many years, I’ve been logging the backend processes directly to a database with DBI, using a custom mod_perl logger-phase handler. However, because the front processes didn’t have mod_perl, the only logging I could get was a traditional flat file.
Recently, I upgraded to Apache 2.0 and mod_perl2 (which is a bit of a story in itself — perhaps for another time). This required updating my DBI logger a bit. At the same time, I was also asked to track users to the Geek Cruises web site, so I wanted to add mod_usertrack and also add a way to link frontend hits with backend hits for complete correlation.
At first, I just enabled the DBI logger in the front servers, but within a few days, things started breaking. Because I use PostgreSQL, each child process in both the front (many) and back (few) servers was maintaining a separate connection to the database, causing both out of memory and out of connections errors. Even worse, if the number of connections was just teetering near the limit, a new connection for the registration or administration applications for Geek Cruises would fail, leaving either the customer or the coordinator, respectively, upset and frustrated.
So, I quickly concluded that rather than have separate connections from each child to the database, I needed some sort of aggregator. I noticed that the automatic “log roller” application included with Apache was basically a separate process forked during the startup phase, but one that shared a connection with all of the children. After studying the C source code of how that was accomplished, I decided I could do the same thing in a few lines of Perl as well. The result is that I now have only two connections to the database — one used by the front cluster of servers, and the other used by the back cluster. My PostgreSQL is happy, and so are the folks at Geek Cruises.
The code from the mod_perl handler is in Listing One. The separate logger program will be presented in next month’s column.
Line 1 marks the code as Perl code, which makes Emacs switch to the right editing mode. Line 2 establishes the package for my handler. (I put locally developed modules into the Stonehenge:: prefix to prevent collisions with CPAN modules.) Line 3 enables the traditional compiler restrictions. Line 5 reminds me what I need to do to use this module. I try to leave hints like this to remind myself what I was thinking at the time. In this case, I’m reminded of the necessary line in file httpd.conf.
Line 9 defines a pnote name that will be used to hold the initial wall clock and CPU time values.[ For more about pnotes, see" Recipe 21.13" Sharing Information Between Handlers" in The Perl Cookbook, Second Edition.] I’ll set this up as an array ref at the beginning of the service cycle, and when we arrive at the logging phase, I can pull the values back to see what to subtract from the current values to get the deltas.
Line 11 defines the external program to accept the logging values and write them into the database. When Listing One is processed through the Template Toolkit (as described in the July 2002″ Perl of Wisdom” column, available online at http://www.linux-mag.com/2002-07/perl_01.html) during the installation process, the value inside the brackets is replaced by the path to the root of the installation tree. Beneath that root, I have the traditional sbin directory for things like apachectl, and I’ve added the logger program as well. (Again, the logger program will be described next month.)
Line 15 and Line 16 import the constants needed later in this module.
Line 18 declares the variable that will hold the handle for the pipe to the logger program. This handle will be set up in the parent process, before Apache forks each child, and will thus be shared with the children. This is the key to getting only one logger to activate for each cluster of front and back servers.
Lines 20-44 define the subroutine that will be triggered in the parent Apache process just after the configuration files have been read, but before Apache begins forking children. Line 21 is copied from the documentation and provides me with things I don’t understand, so I mostly ignore them. However, $s is the server handle and is useful.
Line 23 can be uncommented so that I can see that my handler is being executed in the right process (and only once).
Lines 25-33 set up the handlers for the start-of-transaction (PostReadRequest) and end-of-transaction (Log) phases. I loop through all of the servers, skipping over any that are virtual. There should be just one real server, and this is where I want to add the other phase handlers, which I verified by uncommenting Line 28. The corresponding handlers are located in the current package with names that correspond to the phase name.
Line 35 and Line 36 establish the pipe to the child process for the logger program. Death here means that my master server process won’t start, so this logger is critical to the entire operation. Lines 37-41 unbuffer this pipe.
Line 43 returns an OK status, which is probably ignored anyway, because we’re not in a phase that’s actually serving a response.
Once the child Apache process has accepted a connection and read the headers, the handler beginning in Line 46 is executed.
Line 47 grabs the request object as the only parameter to the handler. Line 48 rejects subrequests, because you don’t want to reset the counters if a page uses a subrequest, such as to perform an include file or to check the MIME type of a file in a directory listing.
Line 50 creates a 5-element list, consisting of the Unix- epoch wall-clock time, and the four accumulated CPU times (user CPU, system CPU, child-user CPU, and child-system CPU). These values form the “start” time for our current hit. Line 52 pushes these values into the request pnotes structure as an array reference, so that we may access the values later.
Line 54 may be uncommented to verify that the right values are being stored.
Line 55 returns DECLINED so that other handlers at this phase also run. (I don’t recall, but I suspect I can return OK as well, but this is safer and more familiar.)
Line 58 starts the handler that is called at the end of the request cycle, during the Log phase. Line 59 grabs the request object to get further information. Line 60 can be uncommented to see that you’re actually getting into this handler.
Lines 62-70 force any children that are running to finish before the code performs the logging. Generally, child processes are finished anyway, so this is typically a no-op, but if there are still some children running that would have been reaped in the cleanup phase, I want to make sure that their child CPU values don’t get charged to the wrong process.
Line 72 extracts the 5-element list of time values saved during the earlier handler. If they don’t exist, Line 74 issues a warning, and Line 75 returns immediately, because there’s no valid values for this hit. I’ve seen this message trigger occasionally, when a request comes in but the browser disconnects before sending a complete header, for example.
Line 79 subtracts these values just obtained from the current values, using a clever mapping that I figured out one day. I create a new list that results in pairing the new values with each element of the old list one by one, subtracting the old value.
Lines 81-82 attempt to extract the $orig (original) and $last request record from the current request record. When a request receives an internal redirect, such as for a 404 handler, an additional request record is added to the chain. I was having difficulty finding my wanted values, as you’ll see in the code a few lines later, and decided to try looking at most of the likely elements of the request chain.
Line 83 extracts the server information object from the original request. Similarly, Line 84 extracts the connection object.
Line 85 fetches the server value, so that I can tell whether this is the lightweight reverse-proxy server or the heavyweight backend server. In the httpd.conf, I have…
PerlSetVar “DBILogServer” “[* kind *]”
… where kind expands to P for the proxy server and B for the back server. This string ends up in $server for logging.
Lines 87-108 remind me of the SQL that I used to create the request records in my database. And to load that data up, I create a hash %data in Line 112, and fill it in Lines 114-136. Each key corresponds to a column name. The logger program doesn’t know the column names, but simply takes key-value pairs and creates an insert for those names.
Lines 114-120 try to find my unique-id (added by mod_unique_id). I was having trouble finding it on some of the requests, so this is probably overkill now, but I’m impatient. Similarly, Lines 121-124 try to find the cookie I created with my own Perl version of mod_usertrack.
Line 125 gets the timestamp of the beginning of the request. Similarly, Lines 126-135 fetch other parts of the information about the request. Line 136 uses a hash slice to set the five values for the wallclock time and CPU times all at once.
Once the hash has been created, I need to create a simple string from the data so that it can be written to the logger, which is expecting line-oriented input. Lines 138-143 create a space-separated list of values that look like columnname=hexstring, where the hexstring is the hex-encoded value. I have to encode the value because it might contain spaces and other odd characters.
Line 146 and Line 147 write the data line to the logger handle, enclosed in angle brackets, which I use as start and end markers in the logger to make sure I got a clean record. Modern Unix-like operating systems guarantee that if this write is smaller than a certain size (4,096 bytes is typical), then the data will not be interrupted by other data from a different write, and this is the key to allowing many child processes to all use the same logger process.
LISTING ONE: A DBI- persisted web log
001#! perl
002package Stonehenge::DBILog;
003use strict;
004
005## usage: PerlPostConfigHandler Stonehenge::DBILog
006
007## config
008
009my $PNOTE_IDENTIFIER = ’DBILog_Times’;
010
011my $LOGGER = “[* env.PREFIX *]/sbin/logger”;
012
013## end config
014
015use Apache2::Const qw(DECLINED OK);
016use Apache2::Util qw(ht_time);
017
018my $logger_handle;
019
020sub handler {
021 my ($conf_pool, $log_pool, $temp_pool, $s) = @_;
022
023 ## warn “executed in $$\n”;
024
025 for (my $vs = $s; $vs; $vs = $vs->next) {
026 ## set this up in the “real” server:
027 next if $vs->is_virtual;
028 ## warn “setting up “, $vs->server_hostname, “:”, $vs->port, “\n”;
029 $vs->push_handlers(PerlPostReadRequestHandler =>
030 __PACKAGE__ . ’::PerlPostReadRequestHandler’);
031 $vs->push_handlers(PerlLogHandler =>
032 __PACKAGE__ . ’::PerlLogHandler’);
033 }
034
035 open $logger_handle, “|$LOGGER”
036 or die “Cannot create |$LOGGER: $!”;
037 {
038 my $old = select($logger_handle);
039 $| = 1;
040 select($old);
041 }
042
043 return OK;
044}
045
046sub PerlPostReadRequestHandler {
047 my $r = shift;
048 return DECLINED unless $r->is_initial_req;
049
050 my @times = (time, times);
051
052 $r->pnotes($PNOTE_IDENTIFIER, \@times);
053
054 ## warn “saved @times in pnote\n”;
055 return DECLINED;
056}
057
058sub PerlLogHandler {
059 my $r = shift;
060 ## warn “running the handler for “, $r->uri, “\n”;
061
062 ## first, reap any zombies so child CPU is proper:
063 {
064
065 my $kid = waitpid(-1, 1);
066 if ($kid > 0) {
067 # $r->warn(“found kid $kid”); # DEBUG
068 redo;
069 }
070 }
071
072 my @times = @{$r->pnotes($PNOTE_IDENTIFIER) || []};
073 unless (@times) {
074 $r->warn($r->uri, “: DBILog: where is \@times?”);
075 return DECLINED;
076 }
077
078 ## delta these times:
079 @times = map { $_ – shift @times } time, times;
080
081 my $orig = $r->main || $r;
082 my $last = $orig; $last = $last->next while $last->next;
083 my $s = $orig->server;
084 my $c = $orig->connection;
085 my $server = $orig->dir_config->get(“DBILogServer”);
086
087=for SQL
088
089create table requests (
090 uid text,
091 cookie text,
092 stamp timestamp with timezone default now(),
093 host text not null,
094 server text,
095 vhost text not null,
096 method text not null,
097 url text not null,
098 basicauth text,
099 referer text,
100 useragent text,
101 status int default 0,
102 bytes int,
103 wall int,
104 cpuuser real,
105 cpusys real,
106 cpucuser real,
107 cpucsys real,
108);
109
110=cut
111
112 my %data;
113
114 $data{uid} =
115 $orig->headers_in->{“x-stonehenge-unique-id”} ||
116 $r->headers_in->{“x-stonehenge-unique-id”} ||
117 $last->headers_in->{“x-stonehenge-unique-id”} ||
118 $orig->subprocess_env->{UNIQUE_ID} ||
119 $r->subprocess_env->{UNIQUE_ID} ||
120 $last->subprocess_env->{UNIQUE_ID};
121 $data{cookie} =
122 $orig->notes->{“cookie”} ||
123 $r->notes->{“cookie”} ||
124 $last->notes->{“cookie”};
125 $data{stamp} =Apache2::Util::ht_time($r->pool, $orig->request_time);
126 $data{host} = $c->get_remote_host;
127 $data{server} = $server;
128 $data{vhost} = join(“:”, $s->server_hostname, $s->port);
129 $data{method} = $orig->method;
130 $data{url} =($orig->the_request =~ /^\S+\s+(\S+)/)[0];
131 $data{basicauth} =$orig->user;
132 $data{referer} =$orig->headers_in->{’Referer’};
133 $data{useragent} =$orig->headers_in->{’User-agent’};
134 $data{status} =$orig->status;
135 $data{bytes} =$r->bytes_sent;
136 @data{qw(wall cpuuser cpusys cpucuser cpucsys)} = @times;
137
138 my $data = join(” “,
139 map { defined $data{$_} ?
140 “$_=” . unpack(“H*”, $data{$_}) :
141 () }
142 sort keys %data
143 );
144
145 ## we hope we get an atomic write here…
146 print $logger_handle “<$data>\n”
147 or die “Write error on $logger_handle: $!”;
148 ## hopefully, angle bracket markers will tell us if we got scrambled
149
150 return OK;
151}
152
1531;
154
Well, that’s the server-side part of the process. Next time, I’ll talk about the construction of logger and show some typical queries that this database log lets me perform. Until then, enjoy!

Randal Schwartz is the chief Perl guru at Stonehenge Consulting. You can reach Randal at class="emailaddress">merlyn@stonehenge.com.

Comments are closed.