The results of Heller and Ghahramani are interesting, and the technique is actually pretty clever. It is a shame that the paper is so opaque (at least to me). The algorithm is described in section 2 of the paper. Everything goes fairly smoothly until these sentences:

The first hypothesis... is that all the data in D_k were in fact generated independently and identically from the same probabilistic model p(x|theta) with unknown parameters theta. Let us imagine that this probabilistic model is a multivariate Gaussian, with parameters theta = (mu, sigma)... we need to specify some prior over the parameters of the model, p(theta|beta) with hyperparameters beta.Four greek letters in three sentences and no real definition of any of them. And what is a ‘hyperparameter’ anyway? The fact that these letters are used in the subsequent integral means that they really have to have values associated with them at runtime (they aren't just a conceptual aid). But where do the values come from? Thin air?

In fact, that is just exactly where they come from. I suppose it would be clear to someone steeped in Bayesian statisticts, but it took me a while to puzzle this out. Recall that we started with a generative model of clustering, that is, there is a set of machines that generate the data. A ‘parameter’ is a setting on a particular machine. For instance, if our method of generating a term is to flip a biased coin, the ‘parameter’ is the bias of the coin. But

*we don't know what that is*! So we model the bias itself as a probability distribution. Again, we use a generative model and we assume some machine manufactured our biased coin based on some settings. These are the ‘hyperparameters’.

So the entire model is this: A cluster is made up from a set of term generators. Each term generator is a machine that flips a biased coin in order to determine whether to generate the term. When we create a term generator, we initialize it with a biased coin we obtained from a machine that manufactures biased coins. We don't know the bias of the coin, but we

*do*know the settings of the machine that makes the coin.

How do we know

*those*values? We make them up. Right out of thin air. But isn't that cheating? Not exactly. Earlier I mentioned the ‘beta’ distribution that we used as a prior probability distribution. Imagine our biased coin manufacturing machine creates coins based on a beta distribution. There are two dials on the machine, A and B, and the probability distribution of the bias of the coins depends on how we set A and B. If we put A and B both at 1, then the probability distribution of the bias is flat --- we'll get coins of arbitrary bias with equal probability. If we set A to 1000 and B to 10, we'll mostly get coins that are heavily biased towards heads, but we'll get the occasional coin that tail-heavy.

If we are completely ignorant (or agnostic) of the bias of the term generators, we can set our coin generator ‘hyperparameters’ A and B to 1. If we know a little bit about the distribution, we can tune A and B to get better results.

Returning to the paper, the ‘hyperparameters’ beta are actually the set of variables A and B we use to tune our coin generator. The parameter ‘theta’ that we integrate over is the probability distribution of the

*bias*(that is, the space of outputs of the coin generator).

What about mu and sigma? The paper assumed a multivariate gaussian, not a coin flipper. Our coin flipper can be described by simply stating the bias of the coin. A multivariate gaussian needs a mean (mu) and standard deviation (sigma). So they say ‘theta = (mu, sigma)’, but we can say ‘theta = bias’.

There is a very special reason we assumed our biased coin came from machine with a beta distribution. The beta distribution is a ‘conjugate prior’ for the integral we need to perform. It has an easy closed-form solution, so we can determine the probability the data came from a single cluster by a simple equation:

(define (gamma x) (factorial (- x 1))) (define (bernoulli-beta A B heads tails) (/ (* (gamma (+ A B)) (gamma (+ A heads)) (gamma (+ B tails))) (* (gamma A) (gamma B) (gamma (+ A B heads tails))))).

This tells us the probability of seeing a particular set of coin flips given that we don't know the bias of the coin, but we

*do*have some idea about the kind of biases the coin making machine generates. If we set A and B each to 1, indicating an even distribution of biased coins, and we observe 2 heads and no tails, we would expect to see that 1/3 of the time. On the other hand, if we set A and B each to 100, indicating a high preference for fair coins to be generated, and we observe 2 heads, we see that we would expect that 1/4 of the time.

It is this value that we compute as the probability that a particular given data set came from our hypothetical single cluster.

Now we need to figure out the alternative --- that the data set came from more than one hypothetical cluster.

Next time....