Friday, July 25, 2008

I had intended to write about clustering, but I went off on a tangent about Bayesian probability.

Spectral clustering was disappointing. I learned a bunch about linear algebra, so it wasn't a total waste of time, but the results just didn't appear useful. There are some obvious similarities in the data that you can see just by looking at it. I expect the clustering algorithm to find these and perhaps some I didn't see.

I'm currently exploring Bayesian Hierarchical Clustering. I started with the paper by Heller and Ghahramani. The paper seemed like a reasonable approach and I always wanted to learn more about Bayesian statistics.

There are two general approaches to clustering. ‘Top down’ works by initially putting everything in one cluster, then dividing that cluster into two smaller ones. You then recursively apply the algorithm to the resulting clusters. ‘Bottom up’ works by initially putting each item into its own cluster then selecting clusters that are similar and merging them into larger clusters.

Bottom up has the problem that you have an enormous number of initial clusters and finding candidates to merge is an O(n^2) problem. This puts a limit on how much data you can handle. I want to merge tens of thousands of items, so this doesn't seem attractive.

Top down is very appealing because it appears to be an O(log n) process. However, looks are deceiving. There are about 2^n ways to divide a set into two parts, so it is impossible to do a brute-force search for the ‘best’ division. Furthermore, it is a lot easier to describe how two items are similar (for a bottom-up merge) than it is to describe how mixtures from multiple sets are different.

Bottom-up merging usually works by having some sort of similarity metric between clusters. At each step you find some likely candidates to merge and score the candidates merges by similarity. You then merge the pair with the best score and repeat the process.

Bayesian Hierarchical Clustering is a slight variation on this. But instead of creating a similarity metric between clusters, we compute the probability that the two clusters belong together. At each step we find the pair of clusters most likely to belong together and merge them. This has a few advantages: it is resistant to ‘noise’ and ‘outliers’, and the probabilistic component allows us to model our uncertainty (i.e. ignorance) about the data.

This last point is important. It isn't so much *our* ignorance we want to model, it is the *computer's* ignorance. Suppose we are clustering a set of email messages that contain a certain amount of spam. We don't expect the computer to have perfect knowledge of what constitutes spam, or how spam is created and distributed, or why there is spam in the set of message to cluster. We want the computer to *discover* or *notice* that a bunch of these message are all very similar and should therefore be grouped together.

So what's a cluster?

If we're going to group data together, we need to have sort of idea what it means to be part of the group. A generative model is often appropriate. We assume that there exist a number of machines that generate our data samples. An unknown number of data points are taken from each machine and they are all mixed together to form our final data set. Our goal in clustering is to find a plausible set of generators and determine which data points they generated. Our generative model doesn't need to be accurate, or even remotely realistic. It doesn't matter that the data isn't generated the way we model it, it only matters that the data points are grouped in a useful way when we are done.

I've always preferred learning from concrete examples, so let's make one. I'll borrow the examples from Dave Madigan's ‘Statistics and the War on Spam’. He has four email messages that make up his test corpora, and one additional email to classify. Rather than train a Bayesian classifier on this corpus, I want to perform a hierarchical clustering of all the messages.

As in the paper, we transform the messages into binary term vectors that indicate the presence or absence of terms.

Msg0: #( 1 1 1 0 0 0 ) X_0 = brown X_1 = fox
Msg1: #( 0 0 1 1 0 1 ) X_2 = quick X_3 = rabbit
Msg2: #( 0 0 0 1 0 1 ) X_4 = rest X_5 = run
Msg3: #( 0 0 0 1 1 0 )
Msg4: #( 0 0 1 1 1 0 )
.
What we're going to do is consider all the pairs of messages and ask ourselves whether we think it is more likely that they came from a cluster or more likely they didn't. How on earth do we do that?!

Returning to the clustering problem, I have some data (D) and a cluster model (M) and given the model it is easy to determine the probability of my model generating the observed data. Here's where Bayes theorem comes into the picture. We have the probability of the data conditional upon the model, p(D|M). We use Bayes theorem to obtain the probability of the model conditional upon the data p(M|D).

We're going to apply one more trick. Instead of using this probability directly, we are going to compare two different models to determine which of the two is more likely. We'll compute the probability ratio between the two models to see which one is favored. The ratio of probabilities is called the ‘marginal’ probability. The marginal probability can in some cases be easier to compute than the two probabilities that it is derived from. This is because the normalizing term (the a-priori probability of the data) is common between the two and cancels out in the ratio.

So we start by putting each message in its own cluster. Then we consider all the pairs that can be formed from the clusters to find the pair that is most likely to be combined. Let's suppose we are checking message one against message two.

We're going to make the completely unwarranted assumption that the words in each message are independently generated. That is, our generative model is that every cluster has a set of biased coins, one coin for each word. To generate a message, each coin in the cluster is flipped and the selected word is included if the coin comes up heads, but omitted if it comes up tails. This model is clearly deficient because it ignores word order and word interdependence in sentences, yet it seems to be a pretty good model nonetheless. By assuming independence, we greatly simplify the probability calculations, which are already difficult.

We're going to compare the probability of two hypothesis:
  1. Both messages were generated by a single set of biased tosses.
  2. Each message was generated by its own set of biased tosses.
We don't know the biases, so we have to integrate over all possible biases. Since we make the assumption of independence, we can take the product of integrals for each term rather than dealing with a truly nasty polynomial.

To be continued....