Fly High with the Feather

The Apache HTTP server is the most widely-used web server in the world. It's powerful, portable, and extensible. And with the introduction of filters in Apache HTTP 2.0, you'll find even more ways to customize the server to suit your site. Find out how.

Historically, the architecture of Apache has been centered around extensibility — allowing third-party modules to add or modify the operation and features of the HTTP server. Consequently, a large number and variety of third-party Apache modules have been built to do everything from dynamic web page generation (for example, Perl and PHP) to secure communication (via mod_ssl), to making web site authoring and maintenance easier (the FrontPage module).

While Apache 1.3 was clearly a success, the Apache maintainers wanted Apache 2.0 to provide an even more stable, flexible, and powerful foundation for third-party enhancements. So, in addition to many other new features (see the sidebar “New features in the Apache 2.0 HTTP Server”), the Apache developers introduced filters. You can think of a filter as a kind of plug-in: Apache modules can register a filter, and at the appropriate point in the processing of an HTTP request or response, the filter is executed.

Apache 2.0 filters can be far less complicated to write and deploy than Apache 1.3 modules, and provide for much more flexible content transformations. In fact, the HTTP protocol logic within the Apache 2.0 web server itself is now (mostly) implemented as several input and output filters. Using filters for HTTP processing allows Apache 2.0 to support new protocols or enhancements simply by plugging in replacement filters.

To use filters properly, you must know what they are, how they fit into the big picture, and how to configure them. Let’s take a brief tour of Apache filters, inside and out. As you’ll see, filters can enhance the Apache server in ways never before possible.

Conceptually, a filter is any program that takes some input, modifies it, and emits the changes as output. For example, if you type ls -lR | tr ‘-’ ‘!’ | more into your shell, tr acts as a filter: its takes a stream of characters as input, replaces each dash character with an exclamation point, and emits the now-modified stream as output.

In Apache, filters are conceptually very similar to pipes in the shell. Like pipes, which can affect stdin and stdout, Apache filters can affect incoming requests (Apache’s “input”) and outgoing responses (Apache’s “output”). Correspondingly, there are two types of Apache filters: input filters modify the request, while output filters modify the response. Like piped commands in the shell, each input or output filter is responsible for performing its task and passing the data to the next filter in the chain.

However, unlike piped shell commands, which execute from left to right, chains of Apache filters are ordered sequentially based on priority (see below for description of the filter priority classes), and chains of input and output filters operate independently of each other.

Apache and the Apache Software Foundation

In this article, we refer to the Apache HTTP Server 2.0 simply as “Apache.” However, it may be more appropriate to call Apache the “Apache web server” or the “Apache HTTP server” because the web server is just one of many open source projects managed and promoted by the Apache Software Foundation.

Currently, the Apache Software Foundation is a collection of many other open source projects, including Ant, Cocoon, and James. For a list of “Apache” projects, see http://www.apache.org.

Handlers versus Filters

As mentioned above, a key advantage of the Apache architecture is the ease with which third-parties can write modules that add to or extend the existing features of the web server. That extensibility is provided via hooks — places in Apache where a module can register callbacks to express interest in certain kinds of processing. For example, a module can register a callback via the pre_connection hook to be invoked whenever a connection has been created, or can register a callback via the check_user_id hook to be invoked whenever a user should be authenticated.

One of these hooks is called the handler. For any given request, the handler decides what content should be sent back to the client in response. Depending on the request, the handler may decide to serve an HTML page off the disk, or may choose to pass the request along to an executable CGI script. The handler may also need to read the request body to determine what to send to the client. (This is often the case with a POST request, which is how forms are commonly transmitted over HTTP.) If a request body is present, the handler hook will ask for the request body via the ap_get_brigade or ap_get_client_block functions. (Most handlers read small parts of the body one at a time rather than the entire body at once.) At the point where the handlers ask for data, the input filters are invoked, allowing those filters to process the request’s body before the body is sent to the handler.

Apache 2.0′s filters are organized in a linear chain, where each filter processes a chunk of data and passes the (perhaps modified) data along to the next filter in the sequence. This process continues until there are no more filters left.

After the handler finishes scanning the request, it generates data, and that’s when the output filters are invoked. Output filters receive a block of data and are asked to pass it on to the next filter in the chain.

If necessary, a filter can maintain its own internal state to save information as the request body is read. This allows for the system to deal with partial data. For example, if a body is being decompressed, but the filter doesn’t have enough data to complete the operation, the filter can defer processing that chunk until the rest of the data is encountered.

Now, filters are not for everything. If a task is essentially transformative, it may make sense to implement as a filter. However, if the filter is generating data rather than transforming it, a handler may make more sense.

New features in Apache HTTP Server 2.0

A major enhancement to Apache 2.0 is that it now supports threading. Prior versions of Apache used a model where each incoming request received a dedicated process that generated a response. After servicing the request, the process would then be reused to serve other incoming requests. However, this required at least one process per expected incoming connection. On some platforms, that was an expensive proposition. Threading was introduced to alleviate these concerns.

In a threaded Apache server, a single process can spawn multiple independent threads. Then, each thread can service a request. However, each thread shares the same global address space with all of its sibling threads. Therefore, all of the code has to be written in a manner that ensures that the code is thread-safe. Indeed, all of the modules for Apache 2.0 have to written in a manner that is thread-safe.

However, it’s not mandatory to run a threaded Apache. A system called the Multi-Platform Model (MPM) selects a policy as to how Apache should handle requests and manage processes and threads.

On Unix-based platforms, the default MPM is “prefork,” which is identical to how Apache 1.3 handled requests — one process per connection.

If threading is available on your platform, the “worker” MPM can be selected. The worker MPM allows multiple processes to be created and each process will spawn threads. Each thread in the worker MPM services requests.

Another feature of MPMs is that it allows for platform-specific implementations to be used. Windows-based platforms have their own MPM, which utilizes the Windows threading API. OS/2 and Netware also have MPMs available that are specifically tuned to their platform and interfaces.

There are also several experimental MPMs available in the core distribution (including “leader,” “threadpool,” and “perchild”) that use slightly different approaches to managing processes, threads, and requests.

Apache 2.0 has also been rewritten using the Apache Portable Runtime (APR). APR provides a set of core software libraries that in turn provide a predictable and consistent interface to the underlying platform. Therefore, a code base that is built upon APR does not have to deal with the nuances of a particular platform.

Previous versions of Apache 1.3 dealt with platforms in an inconsistent manner — the introduction of APR into Apache 2.0 allows Unix-based platforms, Windows, Netware, and OS/2 to all be supported consistently with minimal hassle to the application.

In addition, two third-party modules have also been added to the core distribution in Apache 2.0. These modules are now maintained by the core Apache developers rather than by the original maintainers. Ralf Engelschall’s


module has been incorporated into the repository, and several configuration and performance enhancements have been added to


in 2.0. Greg Stein’s


has also been added and adapted into the core. By integrating these popular modules, it makes it easier to use SSL and WebDAV in Apache 2.0.

As examples, here are some of the core modules in Apache 2.0 that were rewritten to use filters. (Understanding how filters are used within the core of Apache 2.0 can help you use those filters and decide when to write your own filters.)

* mod_include. mod_include is responsible for handling server-side includes (SSIs). SSIs are HTML pages with special markup that require special processing, such as the introduction of variables (such as DATE or LAST_MODIFIED), conditionals (if a variable is set, then display “XYZ,” else “ABC”), and including other pages (via the #include option).

mod_include is an excellent, almost canonical use of filters, since mod_include isn’t responsible for deciding what file should be delivered. In prior releases, mod_include had to duplicate some of Apache’s internal features to be able to serve content. Now, it can rely on the default handler to select the page to deliver, and it can operate on the stream as before.

* mod_ext_filter. mod_ext_filter is a module that allows an external program to be invoked as a filter. Any program that can operate as a filter in a Unix environment can be used with mod_ext_filter, as long as it doesn’t require any user interaction. However, since this filter requires a new process per request, external programs can become an inefficient way to implement filters. However, mod_ext_filter may be useful for implementing a quick fix or prototype of a filter. Here’s an example of using mod_ext_filter to replace all text in the response body that says “Hello” with “Goodbye:”

ExtFilterDefine goodbye mode=output
cmd=”/usr/bin/sed s/Hello/Goodbye/g”
<Location \>
AddOutputFilter goodbye .html

* mod_deflate. mod_deflate is a new module that implements on-the-fly gzip content encoding (“deflate” is a variant of the gzip algorithm). mod_deflate is similar to the third-party mod_gzip module, but is implemented entirely as a filter. mod_deflate has both input and output filters, which means that it can transparently decompress compressed request bodies, and also compress response bodies on-the-fly. This can lead to a substantial savings of bandwidth, albeit at the cost of some CPU power.

Clients may indicate their willingness to accept deflated response bodies by adding an Accept-Encoding: gzip header in their request. When mod_deflate is configured and that header is present, the mod_deflate output filter compresses the response body and adds the Content-Encoding: gzip header to the response. Most current web browsers already support this functionality.

* HTTP protocol support.

The HTTP protocol logic within Apache 2.0 is now mainly implemented as several input and output filters. There is a header filter, a content-length filter, a byte range filter, a chunking filter, and a dechunking filter. Since HTTP is the default protocol, no special action is required to enable these filters. However, since HTTP is implemented mostly as filters, it allows Apache 2.0 to support new protocols or enhancements by plugging in replacement filters.

* mod_ssl.

As mentioned previously, Apache 2.0 now includes Ralf Engelschall’s mod_ssl. In the process of integrating it into the Apache 2.0 tree, mod_ssl was rewritten to use filters. Now all of the input and output functionality of dealing with the SSL protocol is accomplished with connection-level filters. This has lead to a much cleaner integration of SSL with the rest of the server.

Configuring Filters

Now that we’ve seen some filters and how they work, let’s look at how filters can be configured into Apache. Like other Apache options, you configure filters in the httpd.conf file or in .htaccess files. The canonical documentation on filters is available on the Apache 2.0 web site at http://httpd.apache.org/docs-2.0/filter.html.

Most of the filter directives follow the standard inheritance rules of the Apache directives. For example, a filter can be added at a parent directory or location, and all descendants will have that directive automatically apply. Filters may be removed by a descendant directory by utilizing the corresponding directive.

Consider this code:

<Location />
AddOutputFilter INCLUDES .shtml
AddOutputFilterByType DEFLATE text/html

<Location /dir1>
RemoveOutputFilter INCLUDES .shtml

Globally, the filter INCLUDE applies to .shtml files, and DEFLATE applies to MIME-type “text/html.” However, in directory /dir1, the filter INCLUDES does not apply.

Most filter directives look like this:

directive filter[; filter] extension [extension...]

directive is a keyword like AddInputFilter or RemoveOutputFilter, filter is the filter’s “friendly” name, and extension is a filename extension like “txt” or “html.”

Typically, the friendly name of the filter is similar to the module name, but it does not have to be. For instance, mod_ include’s filter name is INCLUDES, while mod_deflate uses DEFLATE. In the majority of the filter directives, there’s also a reference to a filename extension.

These extensions are mapped by the mod_mime module, and mod_mime may apply many different rules if a file has multiple extensions. That is, if a file is called bar.html.txt, and there are rules for “html” and “txt,” both rules are applied in a left-to-right ordering.

Here are some examples:

* AddInputFilter DEFLATE html indicates that DEFLATE should be added to each file that ends with the extension .html.

The directive RemoveOutputFilter jpg indicates that any filters that were added with AddInputFilter for the extension jpg should be cleared. Essential filters such as the HTTP and SSL filters cannot be removed by this directive.

* SetInputFilter DEFLATE unconditionally adds the requested filters to the location or directory regardless of file extension. This may be useful if you want all files in a particular directory to be executed with a particular filter.

The directive AddOutputFilterByType DEFLATE text/html is a bit different from the rest of the filter directives because it operates on the MIME-type rather than the filename extension.

AddOutputFilterByType was added specifically to make it easier to configure mod_deflate. For example, if you have multiple files that map to the same content-type (such as .html and .shtml), the AddOutputFilter directive would have to include all of the possible extensions that could be compressed with mod_deflate.

However, it may be undesirable to use SetOutputFilter with mod_deflate since images (such as JPEG and GIF) may not compress well, and the overhead incurred by mod_deflate for those files could impact performance. Instead, AddOutputFilterByType was added to allow filters to be added by the MIME-type. The syntax for AddOutputFilterByType is:

AddOutputFilterByType filter[;filter...]
MIME-type [MIME-type] …

So, the directive AddOutputFilterByType DEFLATE text/html adds mod_deflate to all of the files that have a “text/html” type.

But, there’s a caveat with this directive: if a file has no extension and the DefaultType directive is set, then AddOutputFilterByType doesn’t apply, even if the MIME-type used is the same as the DefaultType. Each file extension must use the AddType directive to have this directive work. (See http://httpd.apache.org/docs-2.0/mod/core.html#addoutputfilterbytype for more information.)

Classes and Ordering of Filters

One detail that’s not been covered so far is how filters are ordered. Many filter directives can add multiple filters, and there can be many directives. So what gets executed first? Last? It depends on the filter’s class.

When a filter is registered, the filter indicates what class or type of filter it is. Ultimately, it’s the class of the filter that determines where it is inserted into the filter chain. However, the ordering of filters within the same class may be adjusted by the directives. Table One shows the different filter classes in the order that they execute. As examples, an AP_FTYPE_ CONTENT_SET filter is always executed after all of the AP_ FTYPE_RESOURCE directive, and AP_FTYPE_NETWORK filters are expected to be the last filters in the chain.

Table One: Filter classes (listed from top to bottom in the order of execution)

AP_FTYPE_RESOURCE “Normal” filters that transform data in any way mod_include
AP_FTYPE_CONTENT_SET Cannot change the content in any identifiable way mod_deflate
AP_FTYPE_PROTOCOL Perform the request/response logic for the protocol HTTP protocol filters
AP_FTYPE_TRANSCODE Some of the protocols require a separate layer of filters HTTP chunking filter
AP_FTYPE_CONNECTION Handles any connection-level protocol issues mod_ssl’s filters
AP_FTYPE_NETWORK Reads from, writes to the sockets Socket filters

In some of the filter directives, you can specify multiple filters in one directive. If the filters are in the same class, they will be added in a left to right ordering. For instance…

ExtFilterDefine goodbye mode=output
cmd=”/usr/bin/sed s/Hello/Goodbye/g”
<Location />
AddOutputFilter INCLUDES;goodbye .shtml

… specifies that the mod_include filter is executed before the goodbye filter. However, if the AddOutputFilter line were changed to…

AddOutputFilter goodbye;INCLUDES .shtml

… then the goodbye filter would be executed before the mod_include filter. So, if the original .shtml file looked like…

<!–#include file=”Hello.html” –>

… then after goodbye filter was executed, mod_include would see the file as:

<!–#include file=”Goodbye.html” –>

On the other hand, if filters are in different classes, the ordering within the directive is immaterial. Thus the following directive lines are equivalent:

AddOutputFilter DEFLATE;INCLUDES .shtml
AddOutputFilter INCLUDES;DEFLATE .shtml

Putting it All Together

Here’s a simple, yet real-world example of using filters.

AddType text/html .shtml
<Location />
Options IncludesNoExec
AddOutputFilter INCLUDES .shtml
AddOutputFilterByType DEFLATE text/html

FIGURE ONE: Filter chains example (input).

This snippet specifies that files ending in .shtml have the “text/html” content-type and should be run through the mod_include filter. As an additional safeguard, Apache 2.0 still requires Options IncludesNoExec (or Options Includes, if you want to expand the #exec keyword in your documents) to use the INCLUDES output filter.

FIGURE TWO: Filter chains example (output).

Additionally, the AddOutputFilterByType directive specifies that all responses that are “text/html” should be processed by the DEFLATE output filter of mod_deflate. Since .shtml files are “text/html,” they are deflated on their way to the client after being expanded by mod_include.

Filters in Apache 2.0 provides for much more flexible content transformations, and do so in a much more transparent way than Apache 1.3. Certain types of modules that were extremely complicated in Apache 1.3 can now be reduced in size if adapted to work as filters. In this article, we’ve attempted to provide the foundations required to understand how filters work by providing analogies to other systems, describing how Apache 2.0 itself uses filters, detailing some of the new filter-related configuration directives, and examining the filter classes that are available.

However, the best way to learn more about filters is to try the feature out for yourself. Experiment with your configuration and witness the effects of the filter directives. Examine the source code of the filters that are currently distributed with Apache 2.0. These embody invaluable techniques for developing filters.

The Apache documentation at http://httpd.apache.org/docs-2./ is a valuable resource for information on all of Apache, not just the filters. If you’re still confused about how to use filters properly, you might try to ask around on the users@ httpd.apache.org mailing list. And, if you’re writing your own filter and desire a more technically-oriented audience, you might find the developer’s mailing list, dev@httpd.apache. org, a helpful resource.

As more users begin to understand filters and more developers start writing filters, it is expected that filters will evolve over time to meet the needs of the community. Apache 2.0 is the first major release that includes this feature. Therefore, feedback from users and module developers will only serve to improve filters in subsequent releases of Apache. So, download Apache 2.0 and get coding!

Justin R. Erenkrantz is a Ph.D. student at the University of California, Irvine. He has contributed to the Apache HTTP Server, Apache Portable Runtime, and Subversion. You can send him email at justin@erenkrantz.com.

Comments are closed.