In Paul Graham's now famous article, "A Plan for Spam" (http://www.paulgraham.com/spam.html), Graham argued for a much different and radically simplified approach to spam filtering. Instead of using extensive rule-based schemes. Graham suggested using a statistical approach that learned from your e-mail. Shortly after, Bayesian mail filters begun popping up everywhere. This month let's look at SpamBayes, one of the most popular and effective Bayesian tools around.
In Paul Graham’s now famous article, “A Plan for Spam” (http://www.paulgraham.com/spam.html), Graham argued for a much different and radically simplified approach to spam filtering. Instead of using extensive rule-based schemes. Graham suggested using a statistical approach that learned from your e-mail. Shortly after, Bayesian mail filters begun popping up everywhere. This month let’s look at SpamBayes, one of the most popular and effective Bayesian tools around.
The idea behind Bayesian filtering is quite simple: the system learns from your mail preferences so that it’s able to accurately predict if a new message is spam or not — based solely on the content of the new message and what you’ve told the system in the past. Bayesian filtering works by breaking messages down into individual tokens (typically words) and scoring them. The scores are derived from how frequently each token occurs in your spam and non-spam (known as “ham”) messages.
For example, the system may learn that a message that contains “money” has a 70% probability of being spam, and that one containing “Viagra” has a 96% chance of being spam.
The tokens and their associated spam probabilities are stored in a database that the system consults when evaluating new messages. Over time, as you train the system, providing it with more feedback, the database becomes a better and better predictor of your personal e-mail preferences.
Enough theory. Let’s get started.
To find the latest version of SpamBayes, start on the Unix client page at http://spambayes.sourceforge.net/unix.html, follow the “bundled package” link, and select a recent version from the SourceForge download page. Then download and extract the tarball. Assuming you have Python 2.2 or newer, you can then install SpamBayes using the installation script:
$ sudo python setup.py install
Once installed, create your personal SpamBayes configuration file in ~/.spambayesrc. The file should contain three lines at a bare minimum:
persistent_use_database = True
persistent_storage_file = ~/.hammiedb
With SpamBayes installed and ready, you need to provide some initial training. There are a number of ways to do this, but the easiest is to provide two mail archives — one archive full of spam and one full of ham.
$ hammie.py -d -g ham -s spam
It’s best that the archive files ham and spam have more than a few messages of each type. The more email messages you use, the better SpamBayes can classify your mail without additional training.
Putting SpamBayes to Work
If you’re already using procmail to filter and sort your e-mail, integrating SpamBayes is quite easy. First, tell procmail to pass messages through SpamBayes for analysis and classification. In your ~/.procmailrc file, add:
| /usr/local/bin/hammie.py -f -d -p
That invokes hammie.py as a filter and instructs it to use the database file generated during the initial training process. (The path to hammie.py may be different on your system.)
Once a message’s been passed back to procmail, you can filter the email based on the SpamBayes classification. All messages fall into three classes: ham, unsure, and spam. SpamBayes adds an X-SpamBayes-Classification header to each message to make filtering easy:
* ^X-SpamBayes-Classification: spam
That recipe will toss all spam messages into a separate mbox-style mailbox. If you’re using Maildir or some other format, you’ll need to adjust this rule to suit.
If you’d rather not deal with procmail, there are alternatives. SpamBayes also provides POP and IMAP “proxies” that sit between your computer and your POP or IMAP server(s). When your mail client fetches new messages, the proxy fetches each message, classifies it, and then passes it to your mail client. The same SpamBayes headers are added, so you can create a custom filter to move spam messages to the correct folder. These proxies even come with a built-in web interface that simplifies the training process. There’s no need to touch the command-line if you’d rather not.
Do you have an idea for a project we should feature? Drop a note to firstname.lastname@example.org and let us know.