As users and system administrators, we fight spam (unsolicited bulk e-mail) every day. For instance, it's not uncommon these days to receive more spam than legitimate email. What's worse, the amount of spam traversing the Internet has grown substantially every year for the past several years. Indeed, spam is becoming a threat to the Internet. The mail servers of ISPs are becoming overburdened, forcing ISPs to buy more hardware and pay for more bandwidth just to handle spam. Spam has become such a serious problem that a conference on the subject was held recently in Cambridge, Massachusetts.
As users and system administrators, we fight spam (unsolicited bulk e-mail) every day. For instance, it’s not uncommon these days to receive more spam than legitimate email. What’s worse, the amount of spam traversing the Internet has grown substantially every year for the past several years. Indeed, spam is becoming a threat to the Internet. The mail servers of ISPs are becoming overburdened, forcing ISPs to buy more hardware and pay for more bandwidth just to handle spam. Spam has become such a serious problem that a conference on the subject was held recently in Cambridge, Massachusetts.
Unfortunately, no spam-fighting method is completely effective. This column reviews some of the problems that must be overcome by any spam-fighting technique, and then focuses on one that’s become popular in the past year: statistical spam filtering. You’ll also see how to configure Bogofilter, a Linux-compatible statistical spam filter that’s become quite popular. With any luck, it’ll help you control your spam.
|Figure One: Spam detection can yield false positives and misses|
Approaches to Spam Control
In evaluating spam-fighting techniques, many people refer to accuracy, but this term is imprecise. Consider Figure One, which displays hypothetical distributions of ham (non-spam) and spam messages according to some measure of “spamminess.” For example, spamminess might be measured by the number of words from a “spam word list” that appear in a message.
To categorize a message as spam or ham, you must set some criterion above which the message is considered spam. However, as you can see in Figure One, the ham and spam distributions overlap, causing ambiguities. Some ham happens to have a higher spamminess index than some spam, causing those ham messages to be misidentified as spams — that is, as false alarms or false positives. Likewise, some spams will be misclassified as ham — a condition known as a miss, and appearing in Figure One as the tail of the spam distribution falling to the left of the criterion line. With any luck, a spam will be correctly identified (a hit), and most hams will also be correctly identified (correct rejects).
In theory, you can change the percentage and types of errors in two ways. The easiest way is to simply change the criterion. For instance, rather than saying that, say, ten words from the spam word list must appear before a message is categorized as spam, you can say that eight or twelve words must appear. This method shifts the types of errors and can reduce the overall number of errors. In fact, it’s trivially easy to get 100% perfect spam detection — just move the criterion to 0 and call everything spam! Unfortunately, this also creates a 100% false alarm rate, which is an unacceptable side effect.
Another approach is to change the sensitivity of your test. This involves moving the two curves in Figure One further apart (or, equivalently, reducing the width of the two curves). However, this approach is easier said than done. It often involves radical changes to your anti-spam software. Instead of measuring the number of words from a spam list, for instance, you might use the IP address of the sender or combine a spamminess metric with the IP address.
Figure One‘s spamminess index is an abstraction. In practice, many spam-fighting techniques use binary classifications. For instance, blackhole lists are a database of IP addresses. You can reference a blackhole list from your Simple Mail Transfer Protocol (SMTP) server and reject all messages from the IP addresses on the list. Here, you have no control over the criterion or sensitivity of the spam testing, but behind the scenes these measures still exist — they’re derived from the policies the blackhole list maintainer uses to add IP addresses to its list. A very large repository of blackhole lists is maintained at http://www.declude.com/junkmail/support/ip4r.htm.
Keyword filters have also become very popular in recent years. You can create a procmail filter set, for example, that rejects messages based on criteria such as the presence of the word “Viagra” in a message or a Subject: header that contains more than five consecutive spaces. There’s no easy mathematical mapping of Figure One‘s spamminess index to these rules, but you can think of each rule as reflecting a criterion placement in a set of messages. One of the most popular keyword filters is SpamAssassin (http://www.spamassassin.org), although this tool also supports other tests.
The latest approach to spam fighting is statistical filtering. Initially, you must create a database of words that appear in ham and another database of words that appear in spam. The filter then checks a message’s contents against both lists and computes a probability of the message being spam. This approach most closely maps onto Figure One, because it computes a probability of the message being spam — a measure akin to a spamminess index.
The Theory Behind Statistical Spam Filters
Modern statistical spam filters exploded onto the scene as a result of Paul Graham’s essay, “A Plan For Spam” (http://www.paulgraham.com/spam.html). His approach used a statistical principle known as Bayes’ Rule, which enables you to combine the probabilities of events, such as the probability of a message containing a given word being spam. By looking at all of the words in a message (or, as Graham did, the fifteen “most interesting” words), you can compute the probability of the message being spam.
Statistical filters are fundamentally different from keyword filters because keyword filters are very rigid and narrow. For instance, a keyword filter that rejects messages containing the phrase “hot stock tip” rejects all messages with that phrase, disregarding who the sender is (even if the sender is someone you want to hear from). Although all three of the words in this phrase may have high spamminess indexes individually, a statistical filter might let the message through if the rest of the message contained words with sufficiently low spamminess indexes.
One of the drawbacks of statistical filtering is that it requires a large starting corpus of both spam and ham messages. Ideally, these messages should be provided by the individual using the filter.
If you plan to use a statistical spam filter, collect your spams for a few days. (If you can’t collect several hundred spams before setting up a statistical filter, you can find samples online; for instance, look at http://www.spamarchive.org.) You should also save your legitimate email for a while, especially if you don’t already have a collection of archived correspondence. Use both collections to “train” your filter to identify your spam and ham.
As an example of statistical spam filtering, let’s install and configure Bogofilter (http://bogofilter.sourceforge.net). You can download source code or x86 binaries; both are available as tarballs or RPMs. This column is based on a version 0.11.1.3 RPM installed on a SuSE system.
Fortunately, Bogofilter’s default configuration usually works fairly well. The main Bogofilter configuration file is /etc/ bogofilter.cf. This file creates a configuration that applies to all local users of the tool. You can override the main configuration with an individual configuration file in your home directory named ~/.bogofilter.cf. (The main configuration file sets the name of this individual configuration file, so you can change .bogofilter.cf to anything you like.)
Bogofilter relies upon a database of ham and spam words, which by default reside in the user’s ~/.bogofilter directory, using the names goodlist.db and badlist.db, respectively. If this directory or files don’t exist when you try to classify a message, Bogofilter creates the directory and files. If you want to run Bogofilter system-wide using a default system spam/ham database, change this directory definition to one in a common area, such as /etc/bogofilter. Alternatively, you can use the wordlist keyword to add global wordlist files while allowing users to build their own wordlists to supplement the global one. Follow the syntax provided in the bogofilter.cf comments.
If you use Bogofilter and find that it’s letting too much or too little spam slip through, you can change the program’s criterion by adjusting the spam_cutoff line. For instance, for the Fisher algorithm, this feature is set to 0.95. To make the program classify more messages as spam, lower this value — say, to 0.90. To have the program classify fewer messages as spam, raise it — say, to 0.97. You should leave these values alone until you have some experience with the program, though.
The most important part of configuring Bogofilter is feeding it samples of both spam and ham. The program accepts input on standard input, so you should use redirection to do the job:
$ bogofilter -sv < spam-sample.txt
$ bogofilter -nv < ham-sample.txt
After each run, Bogofilter reports the number of words and messages it’s classified. The files you feed to Bogofilter may be individual messages or message collections, such as mailbox files created by many Linux mail programs.
If you want to feed Bogofilter several files in succession, you can do that, too; Bogofilter adds words to the database and doesn’t delete them. Or, if you have a directory filled with messages in individual files, you can simplify matters by using cat to concatenate the messages into a single file, as in cat * > sample.txt. However, be sure to hold back a few spams and hams so that you can use them to test Bogofilter’s effectiveness. Once you’ve added samples to the database, you can test the filter by passing some individual spam and ham messages through Bogofilter:
$ bogofilter -v <message.txt
X-Bogosity: Yes, tests=bogofilter,
In this case, Bogofilter has classified the message as being spam, with a “spamicity” rating of 0.995393. Try feeding Bogofilter the messages you held back earlier, to see how it handles messages it’s never seen before. If Bogofilter appears to be classifying these messages correctly, you’re ready to begin using it to filter real mail.
As just described, Bogofilter is a text-based tool you can call from the command line. How then do you integrate it into your mail system? You have several choices. One that’s very effective is to use Bogofilter in a procmail recipe. If you’re already using a procmail spam-filtering system, adding Bogofilter to the mix should be fairly straightforward.
If you’re not using procmail, reconfiguring your system to do so is generally fairly simple if you use a local mail queue. If you read all your mail from an ISP’s POP or IMAP server, you may want to reconfigure your system to use Fetchmail (http://catb.org/~esr/fetchmail) to inject that mail into a local mail queue, where you can then use procmail. The Bogofilter documentation includes information on configuring the Mutt mail reader to use Bogofilter directly, and it’s likely that more mail clients will use Bogofilter or similar tools in the future.
To use procmail, you need to create a .procmailrc file in your home directory. Alternatively, you can run procmail system-wide by creating an /etc/procmailrc file. However, running Bogofilter from a system-wide procmail configuration can cause complications in the form of root ownership of the ultimate spam destination.
The procmail configuration files contain a series of recipes that define how procmail is to handle mail. (You can read more about procmail in the July 2001 issue, available online at http://www.linux-mag.com/2001-07/guru_01.html.) A recipe to use Bogofilter is:
* ? bogofilter -u -l
Add this entry to your existing .procmailrc file or create a new .procmailrc file with these lines. This entry causes Bogofilter to automatically add words from the messages to the spam or ham databases (the -u option) and logs information on its activity (the -l option). Change the options as you see fit.
One word of caution: The -u option can cause a message’s words to be added to the wrong databases if Bogofilter misclassifies the message. Such mistakes can cause further misclassifications unless corrected, as described shortly.
The above recipe directs procmail to store messages that Bogofilter identifies as spam in the ~/Mail/spam directory, which most mail readers will present as a folder called spam. If the ~/Mail/spam file doesn’t exist, procmail will create it, but if you use a global procmail configuration, the file may be owned by root, so be sure it exists first. You can peruse this folder from time to time and delete your spam. If you become very confident in Bogofilter’s spam classifications, you can change the $HOME/Mail/spam line to read /dev/null to direct procmail to permanently discard the messages that Bogofilter identifies as spam.
Most mail servers use procmail for delivering mail, even if you don’t realize they’re configured in this way. If a ~/.procmailrc or /etc/procmailrc file that calls Bogofilter doesn’t seem to work, you should first check your mail log files for entries from Bogofilter. It’s possible that Bogofilter is simply classifying messages as ham, when in fact they’re spam. You can also check your mail server’s configuration to be sure it uses procmail.
If you want to add messages to the spam and ham databases, you can save them and use the -s and -n options to bogofilter, as described earlier. If you make a mistake or if you use the -i option and Bogofilter itself misclassifies a message, you can capitalize the options to undo the old classification.
For instance, if a message has been mistakenly added to the spam list when in fact it’s ham, you can save the message as not-spam.txt and type bogofilter -Nv < not-spam.txt to correct the matter.
The Future of Spam
Bogofilter and other statistical spam filters are very useful tools for detecting spam. Their principles of operation are not without flaws, though. Spammers aren’t stupid, and as statistical spam filtering becomes more popular, spammers will adapt. Possible adaptations include encoding all messages using Base-64, using HTML with random comments embedded within words, and sending very short spams that merely point to a web site on which the real message exists. Spam filters are also likely to adapt to at least some of these techniques; for instance, a filter might automatically unencode a message that’s encoded with Base-64. Such techniques are CPU-intensive, though, so few spam filters do this today.
For the moment, statistical spam filtering is very effective for individual mail accounts. Many users report a 99% or better hit rate, with 0.1% false positives. Those figures become less impressive when statistical filters are applied to an entire site’s e-mail load, though, and they’re likely to drop, again as spammers adapt.
However you fight spam today, one thing seems certain: While today’s best anti-spam tools can be quite effective, they’ll be as ineffective against tomorrow’s spam as a copper shield is against a nuclear blast. Prepare yourself.
Roderick W. Smith is the author or co-author of ten books, including
Advanced Linux Networking. He can be reached at firstname.lastname@example.org.