Thursday, June 14, 2007

Pondering about clusters

I don't know much about unsupervised machine learning, so I'm trying to figure it out for myself. This will probably be laughably naive to someone that knows this stuff. Go ahead and laugh. I got on a Bayesian clustering kick yesterday and read a few papers. Using Unlabeled Data to Improve Text Classification by Kamal Paul Nigam seemed relevent. His thesis was that you could use a little bit of labeled data and a lot of unlabeled data to train a text classifier. I was wondering if this works in the limit of no labeled data at all. This looked like a good starting point, and I started thinking of the generative model of email and spam. I spent a few hours pondering this when I realized that I'm approaching this incorrectly. I need to understand what a `cluster' is in the first place before I can figure out how to get the machine to recognize one. There seem to be a number of different theories about how to cluster data. It is hard to decide what is the `best'. They all seem to have a certain amount of `magic', too, usually in the form of inpenetrable math. I thought I knew some math when I went to college, but some of these things look pretty bizarre. It isn't obvious to me what constitutes good clustering. At one limit, we have one cluster for each and every email that holds exactly that email. The clusters are *very* precise, but not very helpful. At the other limit, we have one cluster for the entire corpus of email. It isn't very precise or helpful. Somewhere in the middle we have a set of clusters each of which most likely contain several messages. There has to be some measure that assigns a high value to this middle ground and a low value to the limiting case. If you're like me, you're already thinking about drawing circles around groups of little dots on some graph paper, and how to decide the right size of circle, where to put it, etc. But I'm getting ahead of myself. What exactly *is* a cluster? Why can't I just take every fifth email, put it in a bag and call that a cluster? The point of clustering is to capture some similarity between objects. If I am thinking of a particular email, and I ask you a question about it, you'd have no idea at all what the answer might be. But if I were to tell you that the email I'm thinking of is from my `spam' cluster, then you'd be able to answer a few questions about it. For instance, you might infer that the liklihood that the email solicits money is *much* higher than if you didn't know the clustering. So if you know that an object belongs to a cluster, you now know something about the object itself. A cluster must have a certain amount of information associated with it. This is starting to make a little sense. If I were to randomly assign email to clusters, then the fact that an email was in cluster 5 would be fairly meaningless because we'd expect cluster 5 to be more or less the same as cluster 4 and have the same statistical characteristics as the entire corpus of email. So the measure of a good clustering has to be related to the information content of that cluster. To be continued....

2 comments:

Hadley Wickham said...

Yes, you want objects within a cluster to be as similar as possible, and clusters to be as different as possible. How you define these similarities, and the algorithm you use to find the optimal solution, basically define the different types of clustering.

ed said...

Have you looked into Latent Semantic Analysis at all?

I did some work in the past around a similar topic and found that LSA could certainly be useful when it comes to identifying clusters.