Our first hypothesis is that we have a single generator for the data points. Since each term is generated independently, it is as if we had a biased coin for each flip. We don't know the bias, so we choose a uniform prior probability distribution (any bias from 0 to 1 is equally likely). Fortunately, the integral over this distribution has a closed form solution:

gamma(h + t) ------------------ gamma(h + t + 2).

(Gamma is the gamma function.)

If we select a biased coin at random and toss it twice, we will expect to see two heads about 1/3 of the time. We'd expect 2 tails one third of the time, too. Heads first and tails second comes in 1/6 of the time, tails first heads second the other sixth. (Note: The idea is that each time we do this we have a fresh coin with an unknow bias and give it two flips. It is

*not*to say that if we take a coin with an unknown bias that we will get two heads 1/3 of the time on every pair of flips!)

If we generate messages one and two by selecting a biased coin and flipping it twice, and repeating this six times for each term in the message, we find the following probability of the data being generated by a single cluster with unknown biases:

Msg1: #( 0 0 1 1 0 1 ) Msg2: #( 0 0 0 1 0 1 ) Msg1 Msg2 0 0 -- 1/3 0 0 -- 1/3 1 0 -- 1/6 1 1 -- 1/3 0 0 -- 1/3 1 1 -- 1/3 1 / 1458.

One chance in 1458. That's the probabilty of generating these exact two messages in this order from a set of six coins with a randomly distributed bias. Not much of a chance, but we aren't interested in the absolute chance, we are interested in the marginal probability.

What's the probability of generating each message separately? For each message, we select six biased coins and toss them. Again, we don't know the bias. But if the bias is evenly distributed and we select a coin and flip it, we'll get heads half the time and tails the other half (remember, after one flip we grab a brand new biased coin with unknow distribution). The probability of getting these two messages is therefore:

Msg1 Msg2 0 -- 1/2 0 -- 1/2 0 -- 1/2 0 -- 1/2 1 -- 1/2 0 -- 1/2 1 -- 1/2 1 -- 1/2 0 -- 1/2 0 -- 1/2 1 -- 1/2 1 -- 1/2 1/64 * 1/64 1/4096.

The marginal probability of message 1 and 2 forming a cluster is:

1/1458 -------- = 2.8 1/4096.

so it is almost 3 times more likely that Msg1 and Msg2 are from the same cluster.

Message 0 and message 4 have a different story.

Msg0 Msg4 1 0 -- 1/6 1 0 -- 1/6 1 1 -- 1/3 0 1 -- 1/6 0 1 -- 1/6 0 0 -- 1/3 1/11664.

It is 1/3 as likely that Msg0 and Msg4 form a cluster than having been drawn from two clusters.

This is the essence of Bayesian clustering, but there are two important things I have glossed over. The first is the prior.

In the cases above I assumed a uniform prior where I didn't know what the probability of a particular term was. Most of the time, however, you have some clue what the probability distribution of the term might be. With a sufficient amount of data, the prior probability tends to be less important, but when you are working with small amounts of data, it can have a large effect. A good estimate of the prior probability will greatly improve the accuracy of the clustering.

But there is a problem. We are integrating over the prior probability distribution. We are lucky in that there is a closed form solution for a uniform distribution. What if we know the distribution isn't uniform? It turns out that there is a family of distributions called ‘beta’ distributions that are parameterized by two values. The ‘beta’ family of distributions is basically a unimodal distribution and it is the distribution you get from a biased binomial selection (tossing a biased coin). If you use a beta prior, it is easy to compute the effect of data on the probability distribution. It may be quite likely that the nature of the problem implies a beta prior, but if not, it is usually a good idea to pretend it does and find a beta distribution close enough so the math is tractable. Otherwise you'll end up running huge numeric integrations for a long time.

The second important thing is one of the central parts of the paper. To be correct, when you consider merging two clusters you should compare the probability of merging against

*all possible*subclusterings of the data (with the appropriate weights). This turns out to be exponential in time, and would make this method of clustering hopelessly expensive. Rather than compare against all possible subclusterings, the authors suggest comparing against only those subclusterings that are `tree consistent' with the original cluster. This greatly reduces the number of comparisons, but more importantly, the weighting of the entire set of tree consistent clusterings can be represented as a single number that is carried through the computation.

Next time, a bit of ranting.