Wednesday, June 13, 2007

Buckets 'o Spam!

I have assets exceeding twenty million dollars in a Nigerian Bank. I don't need to make a withdrawal because I can refinance my home with no money down and get cash back in the process. But this is nothing compared to my 37 inch, um, wallet. And I can stand at attention for hours, much to the delight of the lonely but attractive women who have decided to email me. Oh. You, too, eh? A few years back I decided I should be proactive and try to figure out what to do about spam. I read Paul Graham's `Plan for Spam' and decided to make my own spam filter in Emacs. It took a while, mainly because I really wanted to understand the principles behind the Bayesian approach. Ultimately, I got my filter working, but it just didn't perform as well as I had hoped. I was considering more complex approaches, but the university installed SpamAssassin and it did a pretty good job, and I wasn't getting paid to design spam filters. But now and again I want to pursue an idea that sounded promising to me. Nigerian 419 spam really doesn't have much in common with the discount drug ads, and neither share many features with the cheap mortgage spams. I did a bit of research and found that the bulk of spam falls into about 10 major catagories. A narrow Bayesian filter could easily pick out one of these catagories, but you need a fairly broad filter to handle them all. But a broad Bayesian filter is more prone to false positives. It occurred to me that a set of Bayesian filters, each tuned to a particular catagory of spam, might perform much better than a single filter attempting to cover spam in general. The problem is training such a filter. You need an already classified corpus of examples in order to compute the statistics for a Bayesian filter. But I *hate* this stuff! I'm not going to wade through thousands of spams and try to manually classify them into the various sordid catagories. And I'd have to retrain the filters each time a new scam comes along. So rather than filtering my mail, I want to categorize it. I want a program to look over my incoming mail and simply group it into buckets based on similarity. The categorizer won't know that a set of emails are spam. It doesn't need to. It will simply place all the similar email into the same bucket. When I go to read my mail, I'll take a peek in each bucket. ``V1AGR@" -- nope, not interested. ``Continuation-passing-style'' --- ok, I'll read the email in this bucket. I'll have the added benefit of auto-generated folders for the email I *want* to read. But I don't want to have a pre-defined set of categories. I want the machine to perform `unsupervised clustering'. This is a much harder task than simple Bayesian filtering. I've been thinking about this recently, and I figured I'd share my confusion. My current line of inquiry is looking into Bayesian clustering. Some of the literature is promising, but a friend with more experience in the field said that nothing she saw looked very good. More to come, I hope.

1 comment:

patchworkZombie said...

Set up some defaults on each filter so that it catches some of the appropriate type of email. Any email taken by more than one filter is assigned to the one with higher confidence and both are trained accordingly. On false negatives you assign the that spam to the filter that most likes it (possibly the "miscellaneous" filter if none of the others like it much). This way you get a group of filters each trained to take the emails that the others do not want.