Many large organizations use caching proxy servers to save on network bandwidth utilization (and costs) and improve browsing response times. In fact, an entire industry has grown up around caching proxy appliances. But in the open source world, we've had one of the most advanced proxy servers for many, many years. Squid (http://www.squid-cache.org) is to caching proxy servers as Apache is to web servers -- the hands-down open source winner.
Many large organizations use caching proxy servers to save on network bandwidth utilization (and costs) and improve browsing response times. In fact, an entire industry has grown up around caching proxy appliances. But in the open source world, we’ve had one of the most advanced proxy servers for many, many years. Squid (http://www.squid-cache.org) is to caching proxy servers as Apache is to web servers — the hands-down open source winner.
In a typical deployment, the Squid proxy server sits between internal users and a corporate firewall, as shown in Figure One.
Rather than sending all requests directly to remote sites on the Internet, clients connect to the Squid proxy to issue GET and POST requests. If Squid has the requested resource, it simply returns it from its local cache; otherwise, Squid contacts the remote site to request the resource, possibly caching the result.
|Figure One: Squid caches resources, reducing outbound requests.|
To keep track of when objects should expire and to determine if an object should be cached in the first place, Squid pays attention to the Expires, Last-Modified, Pragma, and Cache-Control headers that are sent back in HTTP responses. If Squid already has an up-to-date, local copy of a resource, there’s no need to fetch it from the remote site. This greatly reduces page load times because the largest elements on a page (in terms of file size) are generally static images.
For example, the Yahoo! home page contains several images, but the images are relatively static. The Yahoo! logo at the top of the page may change once every year. But for the other 364 days of the year, there’s no need to ask Yahoo’s image servers for yet another copy yet again. Instead, Squid simply returns the copy it has stored on disk, turning a 50-300 ms request into a 1-5 ms request.
Turning the Tables
So what’s this all have to do with serving web content? While most people deploy Squid to cache the results of outbound web requests, you can also turn Squid around the other way. If you run a moderately busy web site, Squid can act as a front-end cache for a small cluster of web servers, as shown in Figure Two.
|Figure Two: Squid used as a reverse proxy.|
By using Squid as a reverse proxy in front of your public web site, you can serve much of your static content from Squid’s cache, and do so very efficiently. With Squid configured as a reverse proxy, you end up needing fewer connections to the web server, freeing up resources for other site features. Using Squid like this is often called httpd accelerator mode, since it’s introduced to reduce the load of your real web server and improve response time for the end-user.
One downside to using Squid as a reverse proxy is that log analysis is more difficult. Rather than a single set of server logs, your site activity is only partially captured in the web server’s logs. Luckily, the Squid logs can be configured to look just like your existing logs. Simply enable the emulate_ httpd_log option in your configuration file. (See http://www.squid-cache.org/Doc/FAQ/FAQ-6.html#ss6.6 for details on Squid’s logging.) You’ll still need to perform log analysis on two hosts rather than one, but at least the logs will have a common format.
Another less common problem is that some Apache modules aren’t as useful with Squid on the front-end. mod_throttle, for example, can’t distinguish among the various hosts requesting content. From Apache’s point of view, all requests are coming from the proxy. Similarly, the output you’ll see in Apache’s /server-status URL (if you’ve enabled it) won’t give you any idea of who the real clients are.
Setting Squid Up
Squid is available for all popular Linux distributions. If your distribution does not offer ready-made packages, you can download the source for version 2.5 from the web at http://www.squid-cache.org/Versions/v2/ 2.5. Once you have the source code, the standard build and install routine gets you up and running:
$ ./configure –prefix=/usr/local/squid
$ sudo make install
Configuring Squid for reverse proxy support is a simple matter of adding (or changing) a few lines in the squid.conf file. First you need to tell Squid to listen on TCP port 80 rather than its default of 3128:
Next, you must tell Squid how to contact the back-end server:
After making those changes and pointing the DNS entry for your web site (www.example.com) at your Squid proxy, clients won’t notice a difference, but your web server will end up serving fewer requests over time.
If you’re already using Squid to handle outbound requests, you’ll need to add…
…to your configuration as well. By default, Squid disables traditional caching when configured in httpd accelerator mode.
Squid on the High-End
Since you’re likely to have multiple back-end web servers, you might also want to use Squid in a load-balancing configuration, as shown in Figure Three.
|Figure Three: Squid can also act as a load balancer.|
Squid doesn’t currently have built-in support for load-balancing, but with the use of a helper program, called a redirector, you can still get the job done. (Or you might apply one of the patches floating around on-line.) In a way, redirectors provide functionality similar to Apache’s mod_rewrite. You can freely manipulate URLs in whatever manner is necessary.
To configure a redirector for Squid, adjust your configuration accordingly:
This configures Squid to run every request URL through redir.pl. The program is fed URLs on standard input and should emit the new, modified URL to standard output. The line redirect_rewrites_host_head off specifies that Squid should not rewrite the Host header in the HTTP request that it sends to the back-end server. This is important because servers that use name-based virtual hosting will not work properly if they see an unknown hostname in the request. More on that shortly.
If you have a single back-end web server, a redirector really isn’t necessary. But if you wanted to use one anyway (maybe to re-write URLs for one portion of your site), the script can be as easy as this:
This assumes that www.example.com is your web site, and w1.example.com is the real web server that’s sitting behind the Squid proxy. When a request for http://www.example.com/blah.html arrives, Squid passes it off to redir.pl, which returns http://w1.example.com/blah.html. Squid then contacts w1.example.com to fetch the content.
Notice that the script never exits; it simply sits there, waiting for Squid to feed it another line of input. Because the script has disabled output buffering via $| (very important!), Squid gets the result back instantly.
The other thing to notice is that the Host: header is not affected by the change to the URL. (Remember, redirect_ rewrites_host_head was off.) When Squid contacts w1.example.com, the request contains a line that looks like:
By passing Host: through unchanged, virtual hosting on the back-end server isn’t confused. In fact, from the server’s point of view there is no proxy; the server simply thinks that the proxy machine is sending a lot of requests.
Now, if you want to perform some basic load-balancing across multiple web servers, you’d need to teach the script about the servers and introduce some randomness (or round robin or whatever algorithm you’d like). Here’s a randomized approach.
Now all you need to do when adding or removing servers is update the @servers list to reflect the changes and then restart Squid. You could also introduce some additional monitoring with an external script that checks the health of all the back-end web servers and updates information in a place that the redirector checks from time to time.
Here are a few pointers for deploying Squid:
* Squid works best on machines with fast disks and a large amount of system memory. However, you should be sure that all components of the machine deliver good performance and are reliable. If you want to upgrade an existing machine, your money is probably best spent on I/O — for fast random-access seek time disks — and buying extra RAM. Generally, Squid is not CPU-intensive, and a multi-processor machine tends to be wasted as a Squid proxy.
* If your cache is critical to the operation of your site (whether as an incoming or outgoing proxy), make sure to have a second capable machine ready to run Squid. The Squid web site explains how to build a standby proxy.
Squid is quite powerful and easy to set up. Its flexibility is limited only by your imagination and your ability to code whatever fancy scheme you’ve dreamed up.
Jeremy Zawodny plays with MySQL by day and spends his spare time flying gliders in California and writing a MySQL book for O’Reilly & Associates. You can reach Jeremy via e-mail at Jeremy @Zawodny.com.