dcsimg

mod_perl, Part I

Last month, I talked a bit about mod_perl, and how I used it extensively on my web server. But I was reminded by a few of my reviewers that I've yet to provide a good overview of mod_perl in any of my columns! Time to fix that.

Last month, I talked a bit about mod_perl, and how I used it extensively on my web server. But I was reminded by a few of my reviewers that I’ve yet to provide a good overview of mod_perl in any of my columns! Time to fix that.

As its name suggests, mod_perl is an Apache module that can be added to a statically- or dynamically-linked Apache server, or can be added later via the APXS mechanism. Generally, if your flavor of Linux has Apache, you probably have simple instructions for adding mod_perl already available. Macintosh computers running OS X already have mod_perl built-in, and there are a number of pre-built versions of Apache with mod_perl for Windows machines as well.

The Apache server goes through a number of phases when it processes an incoming request: it parses the incoming data stream; it determines what resource is being requested by the URL; it controls access, authentication, and authorization; it determines the MIME type of the response; it serves the content; and it logs what happened.

Most mod_whatevers apply to only one phase of the Apache process. For example, mod_cgi, which handles CGI scripts, deals exclusively with the content phase, and mod_auth_dbm deals with authentication using DBM files during the authentication phase. But mod_perl can alter the behavior of all phases of the request! In other words, mod_perl exposes nearly the entire Apache programming interface to Perl code so that Perl programmers can extend and embed complex behaviors easily. This makes mod_perl far more powerful and useful than, say, mod_cgi or mod_php, which can impact only the content phase.

For example, Perl code can be used to manipulate Apache’s URL-to-filename mapping during the translation phase, allowing different content to be served depending on the time of day or origin of the requestor. Or, during the access phase, Perl code can control whether a given URL is even permitted based on the origin of the requestor or another authentication scheme of your design. If you’re concerned about performance or costs, you can even log the CPU time of each request to a database to see which dynamic requests are burdening your overloaded machine. While some of this can be accomplished through existing, specialized Apache modules, being able to script behaviors in Perl is a lot more flexible and is arguably easier to use and extend.

And not to be left out, mod_perl can also also deliver dynamic content — and this is probably the primary use of mod_perl. One huge advantage of mod_perl over mod_cgi is that the Apache process does not have to fork() and exec() and parse a new Perl program. Both the fork() and new parsing take additional time and resources, substantially reducing the effectiveness of a given hardware configuration. I’ve heard stories of between 10X and 100X speed improvements just switching from mod_cgi to mod_perl (using Apache::Registry, which I’ll get to in a moment). Additionally, database connections can be re-used in mod_perl, preventing costly authentication handshaking on each hit.

Though, as they say in the recent Spider-man movies (and in the original comics, if I’m told correctly), “With great power comes great responsibility.” Perl code that is used and re-used repeatedly must “behave well,” and that requires a particular discipline and cooperation from all involved. Thus, mod_perl is not generally a suitable candidate for shared-hosting situations where you cannot trust every other user of a given Apache process.

By default, installing mod_perl into an Apache server does absolutely nothing. You have to tell Apache to hand some (or all) of the requests to one or more Perl handlers at the appropriate phase, by placing the appropriate configuration directives into httpd.conf and various .htaccess files (if so enabled). Some of these directives control the state of the entire embedded Perl interpreter.

For example, PerlWarn Yes enables the equivalent of the -w command-line flag, turning on warnings for all processed code. And PerlTaintCheck Yes is like -T on the perl command-line, enabling taint checking. Because these two flags affect all code for this embedded Perl interpreter, they should be used cautiously.

When we place PerlModule My::Module into a configuration file, Perl acts as if we said require My::Module; pulling in the module according to the @INC path (more on that in a moment). And PerlRequire/some/path/ foo.pl is like require “/some/path/foo.pl”, allowing us to pull in arbitrary code. Note that this code can set @INC or execute anything we desire. Also, be aware that if the server is running as root to be able to bind to a low-numbered port, this code is also executed as root. Caveat executor!

If you don’t want to create a separate file of Perl code, you can embed Perl code directly into the configuration file:


<perl>
use lib “/my/place”; $ENV{FOO} = “bar”;
</perl>

This Perl code is executed at the time the server is started (or restarted), again with root privileges, if available.

If the code in one of these <perl> sections sets either the $PerlConfig or @PerlConfig variables, these variables are then interpreted as if they were lines in the configuration file. Thus, Perl code can generate configuration directives on the fly.

For example, to set the listening port dynamically based on the presence at server startup of an environment variable named TESTING, we could use:


<perl> my $port = $ENV{TESTING} ? 8080 : 80;
$PerlConfig = “Port $port\n”; </perl>

If TESTING is set, we get Port 8080 as a directive. If not, we get Port 80. One cool use I’ve heard of this feature is configuring a series of virtual hosts based on reading a database with DBI.

Associating Perl code with content delivery is rather straightforward. Within an .htaccess file, or a Location or Files or Directory section of some configuration file, we simply add both a SetHandler and PerlHandler directive.

One of the most common content handlers is the Apache: :Registry handler, which takes a file (typically an entire CGI.pm-based Perl program), turns it into a Perl subroutine, then caches that subroutine to provide dynamic content for that URL. For example…


<Location /perl>
SetHandler perl-script
PerlHandler Apache::Registry
</Location>

… causes all scripts located below the perl directory within the document root to be treated with Apache::Registry. Now, when we visit /perl/myproggy, the file myproggy is turned into a Perl subroutine and executed in a manner similar to a CGI script. However, we do this without forking, and we cache the resulting subroutine in memory. On the next hit to the same Apache process, we’ve already parsed the file and things move much quickly. If the file changes, Apache: :Registry re-parses the file on the next hit transparently.

We can also get the same behavior based on a file extension rather than a particular location using:


<Files *.pl> SetHandler perl-script
PerlHandler Apache::Registry </Files>

However, I don’t recommend the use of file extensions as a trigger, as it gives away the implementation technology too easily, inviting a possible security exploit.

Because the subroutine created from the script is used repeatedly, we have to ensure that the code works well when being reused. Package variables aren’t reinitialized on every hit, for example, nor are open filehandles automatically closed and reopened. Most of the most common traps are documented in the cgi_to_mod_perl and mod_perl_traps man pages, included along the mod_perl distribution.

The caching of the scripts by Apache::Registry is a nice feature. If the file changes, the new script is automatically sucked in. However, once a Perl interpreter has loaded a require file or module, that file is marked as already loaded and is never examined again.

In a development environment, this can be frustrating, because you might be updating a module during testing and yet only some of your Apache processes will be loading the new code, while others hang on to your old code. One workaround is the use of Apache::StatINC.

Adding PerlInitHandler Apache::StatINC to your configuration file causes Perl, during the initial phase of each request, to walk through the %INC hash (containing the already loaded require files and modules) to see if any files had been updated since they were last loaded. If so, they are flushed and reloaded as needed. While this module is great during development, you shouldn’t use it in production, since it adds a number of additional filesystem system calls on each hit.

Another core module that can be useful to determine the state of things during development is Apache::Status, enabled like:


<Location /perl-status>
SetHandler perl-script
PerlHandler Apache::Status
</Location>

Now, when you visit /perl-status, you’ll get some status information about that particular embedded Perl process, including loaded modules, environment variables, and so on. Again, this is too much information for a would-be intruder during production, so be sure to enable this only on development machines.

Another core module is Apache::Resource, which can be used to limit the resources used by a child server process.

For example:


PerlSetEnv PERL_RLIMIT_CPU 120
PerlChildInitHandler Apache::Resource

If a child process now takes more than 120 CPU seconds, it’s aborted immediately. This is a hard abort, returning a 500-type error to the client, but at least you won’t have a runaway Apache process. Also, this is not per-request, but rather per-child, so you’ll want to set the appropriate MaxRequestsPerChild to a low enough number so as not to trigger this limit in normal execution.

I’ve run out of space for this month’s article, so next month, I’ll continue this introduction to mod_perl, including the complete API from Perl back into Apache, and some nice CPAN modules as well. Until then, enjoy!



E-mail Randal Schwartz at merlyn@stonehenge.com.

Comments are closed.