http://www.paulgraham.com/spam.html), Graham argued for a much different and radically simplified approach to spam filtering. Instead of using extensive rule-based schemes. Graham suggested using a statistical approach that learned from your e-mail. Shortly after, Bayesian mail filters begun popping up everywhere. This month let’s look at SpamBayes, one of the most popular and effective Bayesian tools around.
In Paul Graham’s now famous article, “A Plan for Spam” (http://www.paulgraham.com/spam.html), Graham argued for a much different and radically simplified approach to spam filtering. Instead of using extensive rule-based schemes. Graham suggested using a statistical approach that learned from your e-mail. Shortly after, Bayesian mail filters begun popping up everywhere. This month let’s look at SpamBayes, one of the most popular and effective Bayesian tools around.
In Paul Graham’s now famous article, “A Plan for Spam” (http://www.paulgraham.com/spam.html), Graham argued for a much different and radically simplified approach to spam filtering. Instead of using extensive rule-based schemes. Graham suggested using a statistical approach that learned from your e-mail. Shortly after, Bayesian mail filters begun popping up everywhere. This month let’s look at SpamBayes, one of the most popular and effective Bayesian tools around.
The idea behind Bayesian filtering is quite simple: the system learns from your mail preferences so that it’s able to accurately predict if a new message is spam or not — based solely on the content of the new message and what you’ve told the system in the past. Bayesian filtering works by breaking messages down into individual tokens (typically words) and scoring them. The scores are derived from how frequently each token occurs in your spam and non-spam (known as “ham”) messages.
For example, the system may learn that a message that contains “money” has a 70% probability of being spam, and that one containing “Viagra” has a 96% chance of being spam.
The tokens and their associated spam probabilities are stored in a database that the system consults when evaluating new messages. Over time, as you train the system, providing it with more feedback, the database becomes a better and better predictor of your personal e-mail preferences.
Enough theory. Let’s get started.
To find the latest version of SpamBayes, start on the Unix client page…
Please log in to view this content.
Not Yet a Member?
Register with LinuxMagazine.com and get free access to the entire archive, including: