Electronic mail is both a blessing and a curse. On the one hand, email is so convenient: You can send email at any time from almost any computer, and email is great for staying connected to colleagues, friends, and family members, especially if you're separated by time zones or great distances. Moreover, email costs nothing and is easily shared with groups of any size. On the other hand, email can be overwhelming. Today, it's not unusual to receive hundreds of email messages a day. Sure, reading and replying to all of that correspondence has to be done, but filtering and filing all of that email (not to mention fighting spam) can be downright daunting and time-consuming.
Electronic mail is both a blessing and a curse. On the one hand, email is so convenient: You can send email at any time from almost any computer, and email is great for staying connected to colleagues, friends, and family members, especially if you’re separated by time zones or great distances. Moreover, email costs nothing and is easily shared with groups of any size. On the other hand, email can be overwhelming. Today, it’s not unusual to receive hundreds of email messages a day. Sure, reading and replying to all of that correspondence has to be done, but filtering and filing all of that email (not to mention fighting spam) can be downright daunting and time-consuming.
To cope with tidal waves of email, users have turned to a variety of tools and techniques. For example, almost all commercial email applications offer programmable filters to automatically sort and file incoming (and outgoing) email. While filters can be effective to block out certain kinds of mail, say, from a specific user or domain name, filters are only as smart as you make them. Creating new filters and maintaining existing filters often just adds to the work. Worse, it’s almost impossible to take filters from one email application to the next. And spam blockers can be just as maddening: One user’s spam is another user’s treasure.
What’s needed is a personal, flexible, and ideally, adaptive email filtering system. And that’s exactly what POPFile is. POPFile, one of the top five most popular open source projects on SourceForge, scans the content of incoming email messages and classifies each one by comparing each new message to the email you’ve already received. New email messages with words similar to existing messages get assigned to the same category, or “bucket.”
POPFile must initially be trained (you have to sort things manually to start), but can then demonstrate amazing accuracy with no user intervention. And as time goes on, POPFile’s precision only improves. Even better, you can use POPFile with your favorite email application (as long as you receive email via POP3).
Linux Magazine Editor Martin Streicher and SourceForge.net Site Director Pat McGovern recently exchanged email (but of course!) with the leaders of the POPFile team to learn more about the clever creators — project leader and founder John Graham-Cumming, Ph.D., 35, lead developer Stanley Krute, 52, and lead developer Sam Schinke, 22 — behind a very clever solution.
How would you describe POPFile?
Stanley Krute: POPFile is a filter that learns. POPFile starts out as an idiot: Initially, it has no idea how the user wants their email sorted. At first, the user tells POPFile which category, or bucket, particular email messages should go in. After a while, by analyzing the words contained in each categorized message, POPFile is able to sort messages without user input. If a user provides consistent training, POPFile becomes rather smart within a few hundred messages. Within a few thousand messages, it can be better at email sorting than many humans.
John Graham-Cumming: POPFile is a POP3 proxy that automatically sorts email using a Naive Bayes text classifier. POPFile adds a message header to each downloaded message to indicate where the message should be placed. You can have any number of buckets.
You said that POPFile is a proxy. Do you have to be a system administrator to install or use it?
Sam Schinke: No. Anyone with Perl can use it. You need a web browser to train POPFile, and your email has to be served via POP3. POPFile should run on any system with a bit of RAM to spare. POPFile’s also available in many foreign languages.
What makes POPFile unique?
Graham-Cumming: POPFile is unique in that it allows you to “automagically” sort your email by teaching it like a child. Show POPFile examples of your email and it quickly learns what mail should go where. POPFile is flexible, accurate and lets you to sort incoming email into any number of categories.
John, what led to the creation of POPFile?
Graham-Cumming: While I was working at Scriptics (the Tcl company), I was receiving a large amount of email and went searching for an automatic sorting solution. The only one available at the time was iFile, which required that I use exmh. As a corporate user of Microsoft Outlook, I needed something that would play well in the Microsoft environment, so I ended up downloading the libbow toolkit from CMU and writing a COM+ plugin for Outlook, which I called AutoFile.
Later, I got to thinking about how long it took to download my POP3 email through a 56K line and came to the realization that message downloads ought to be sorted into ascending order of size, not delivery time, so that the small mails arrive first and you can be dealing with them while the big ones download. That lead to the creation of POPtimize, a POP3 proxy written in Perl and still available from http://www.extravalent.com.
Then in August 2002, I realized that I could merge AutoFile and POPtimize to created a POP3 proxy that would work with any email client and do automatic email sorting. Hence, POPFile was born. A month later, I registered POPfile on SourceForge and released the code under the GPL.
Stan, Sam, how did you get involved?
Krute: A little over a year ago, I started to work part-time on spam filtration. I’ve been using email for 26 years, and had finally had it with spam.
Like many folks, I started out thinking that a set of rule-based filters might work quite well, so I began implementing such a set in Outlook Express. Things went well at first, and eventually I was able to catch about 97% of my fairly voluminous spam flow with a set of 800+ rules. But there were a number of vexing problems with the rules-based approach: filter set degradation, filter set maintenance, and the need for large “white lists” to avoid false positives.
In August of 2002, I saw Paul Graham’s paper on Bayesian filtration. It described a statistical approach to spam filtering and appeared to solve the issues I’d come across with rules-based filters. But I couldn’t get my hands on any code to check it out — until I found POPFile. Talk about serendipity. I downloaded it, loved it, was happy to see that the source code was coherent. I quickly decided it would be much more fun to put my energies into POPFile than go and reinvent the wheel on my own.
Schinke: I first exchanged email with John in August 2002. Prior to that, there were undoubtedly some newsgroup conversations. I wasn’t yet developing the software, but was using it during beta and doing bug-hunting, searching specific pieces of code to find the causes of specific problems. In the process, I found myself learning Perl and liking it. So, I went from finding bugs to fixing bugs, which is an easy step in a moderately-sized piece of open software — even if you aren’t proficient in the specific language. I do suggest some programming experience though.
Do any of you work full-time on POPFile?
Graham-Cumming: No. Even for me, POPFile is a part-time project. I’d love for it to be my day job, but it doesn’t pay the bills… yet.
I spend about an hour a day answering email during the week (and thank goodness for POPFile since I need to sort through about 400 emails a day); then I spend a lot of time on the weekend working on the code. I probably spend twenty hours a week total — coding, answering emails, and rearchitecting in my head.
Schinke: I’m a working student, though I have taken this year off school. I work at a local roller-skating arena and also run a computer business. I spend ten to twenty hours on POPFile each week. Some weeks I don’t have time for much work at all though, and other weeks I go well beyond that.
Krute: I probably average twenty hours per week doing POPFile work, but, like Sam, my time varies, depending on how screaming my other tasks are.
I run a small computer hardware/software/training/web development company, and am also quite active in animal rescue — I’ve currently got 18 dogs and half a dozen cats in residence. Doing animal rescue is a bit like being a dairy farmer in terms of time commitment.
John, how do you coordinate the project?
Graham-Cumming: From my secret underground lair I communicate almost exclusively with the team of developers and patchers through the SourceForge forums. POPFile has three special developer-oriented forums: Bleeding Edge Source Code, Bleeding Edge Documentation, and Bleeding Edge UI. I generally assign specific tasks through those forums and also often respond to people who wish to fix a particular bug. Many users submit patches through the SourceForge patch system, which I also look at.
How many people use your software?
Krute: Based on the number of downloads of each new release, my guess is that POPFile currently [as of March 2003] has about 10,000 users. I think we’ve had something like 45,000 downloads since the project was posted.
What has surprised you the most?
Krute: How well the program works. At this point, about a third of POPFile users are getting 98%+ accuracy. And about a sixth of users are getting 99%+ accuracy. That’s truly amazing. Most humans can’t manually sort their email with that level of accuracy. And it’s accomplished in the face of some remarkable efforts by spammers to mask the contents of their messages. John and Sam have done a remarkable job putting code into POPFile that unmasks message content.
Graham-Cumming: A guy wrote to me from Europe to tell me that he’d modified POPFile to work with his cell phone so he could avoid reading spam on the tiny screen. He donated money to [POPFile] because POPFile was directly saving him money: Every spam downloaded onto a cell phone costs money because the subscriber pays by the byte. POPFile helped him take a byte out of his spam. (Stop me before I pun again.)
What’s been your greatest challenge?
Graham-Cumming: Making a Perl script usable by the average Windows user. You can’t expect the average person used to a SETUP.EXE to download ActivePerl, install the scripts, and type perl popfile.pl. That’s too much for 90% of computer users.
Part of POPFile’s world domination is predicated on the need to infect the Windows desktop with simple email classification. All I can say is thank heavens for the NSIS SuperPIMP installer.
Krute: Finding time to work on the project. I’ve got a fairly time-crunched existence, to put it mildly.
Schinke: Right now I am working on a suite to provide an objective test of POPFile’s accuracy so that different classification and mail parsing strategies can be objectively tested.
What are you most proud of?
Graham-Cumming: An ISP wrote to me to tell me that his company had decided not to use a $30,000 piece of spam removal software, but instead was going to donate $300 per month to the POPFile project for their use of POPFile. That was very cool.
Equally great was migrating an open source project out of the hands of geeks and into the hands of the average Outlook Express user. With all of the talk about Linux’s success, it’s easy to realize that open source software has had little impact on the general desktop user. POPFile is a true “cross over” project: It’s GPL open source and easy to use.
What’s next? What’s ahead?
Krute: POPFile is a powerful pattern recognition learning engine. It’ll be interesting to see how we can harness that engine to perform all sorts of data-handling chores. I’d like to have POPFile filter RSS news items and NNTP posts, for example.
Graham-Cumming: To infinity and beyond! The road ahead includes support for IMAP, SMTP, and more and more languages, including Asian and Middle Eastern languages. We’d also like to create client plug-ins so that a user of, say, Outlook Express could train POPFile without the web interface. The code needs a total rearchitecture to better leverage POPFile’s object-oriented structure and to simplify how to write POPFile Loadable Modules.
Schinke: I see integration with MTA/MXs at the ISP level, allowing ISPs to replace relatively high-maintenance spam solutions with user-maintainable POPFile accounts.
I’m with John in believing that integration and acceptance at the client/desktop level will be a huge boost for POPFile, but I think acceptance by ISPs would be even bigger and have an even larger impact on junk mail.
How can people help?
Krute: I have a strong belief that people do their best work when they’re working on projects that they love. Folks who try out POPFile and love it will come across things they think can be improved. At that point, they should make some suggestions in the Bleeding Edge forums where we coordinate work on the project. Chances are good someone will say, “Go for it!” We can use help on all aspects of the project: coding, documentation, testing, and user support.
Schinke: To fully handle extended character-set email, it’ll be necessary to use UTF-8/Unicode for our message handling. Help or advice in this area would be welcome. I think we could also use somebody quite experienced in multi-user Linux/BSD environments if ISP use is to become a reality. Further, since the heart of POPFile is a statistical engine, any statisticians would be welcome. We’re examining alternative or supplementary classification strategies and more informed input would be great. We have some tools to give “results based” answers to whether one approach is better than another, but we can’t always tell why.
Graham-Cumming: POPFile has a POPFile Developers Guide available in the Docs section of the POPFile project on SourceForge. Anyone interested in contributing should read that document first and then drop in on the developer forums. Make a suggestion there and then get coding!
Purpose: POPFile is an email filter that learns. Used as a proxy between a user’s email client and email server, POPFile uses statistical methods to sort the user’s email messages into an arbitrary number of categories.
System Requirements: POPFile is written in Perl and works on Windows, Macintosh, Unix, and Linux systems.
License: GNU General Public License (GPL)
Founded: August 2002
Project Leaders: John Graham-Cumming
Development Status: Production/Stable