Monday, July 28, 2008

Can I rant? (Why not, it's my blog.)

The results of Heller and Ghahramani are interesting, and the technique is actually pretty clever. It is a shame that the paper is so opaque (at least to me). The algorithm is described in section 2 of the paper. Everything goes fairly smoothly until these sentences:
The first hypothesis... is that all the data in D_k were in fact generated independently and identically from the same probabilistic model p(x|theta) with unknown parameters theta. Let us imagine that this probabilistic model is a multivariate Gaussian, with parameters theta = (mu, sigma)... we need to specify some prior over the parameters of the model, p(theta|beta) with hyperparameters beta.
Four greek letters in three sentences and no real definition of any of them. And what is a ‘hyperparameter’ anyway? The fact that these letters are used in the subsequent integral means that they really have to have values associated with them at runtime (they aren't just a conceptual aid). But where do the values come from? Thin air?

In fact, that is just exactly where they come from. I suppose it would be clear to someone steeped in Bayesian statisticts, but it took me a while to puzzle this out. Recall that we started with a generative model of clustering, that is, there is a set of machines that generate the data. A ‘parameter’ is a setting on a particular machine. For instance, if our method of generating a term is to flip a biased coin, the ‘parameter’ is the bias of the coin. But we don't know what that is! So we model the bias itself as a probability distribution. Again, we use a generative model and we assume some machine manufactured our biased coin based on some settings. These are the ‘hyperparameters’.

So the entire model is this: A cluster is made up from a set of term generators. Each term generator is a machine that flips a biased coin in order to determine whether to generate the term. When we create a term generator, we initialize it with a biased coin we obtained from a machine that manufactures biased coins. We don't know the bias of the coin, but we do know the settings of the machine that makes the coin.

How do we know those values? We make them up. Right out of thin air. But isn't that cheating? Not exactly. Earlier I mentioned the ‘beta’ distribution that we used as a prior probability distribution. Imagine our biased coin manufacturing machine creates coins based on a beta distribution. There are two dials on the machine, A and B, and the probability distribution of the bias of the coins depends on how we set A and B. If we put A and B both at 1, then the probability distribution of the bias is flat --- we'll get coins of arbitrary bias with equal probability. If we set A to 1000 and B to 10, we'll mostly get coins that are heavily biased towards heads, but we'll get the occasional coin that tail-heavy.

If we are completely ignorant (or agnostic) of the bias of the term generators, we can set our coin generator ‘hyperparameters’ A and B to 1. If we know a little bit about the distribution, we can tune A and B to get better results.

Returning to the paper, the ‘hyperparameters’ beta are actually the set of variables A and B we use to tune our coin generator. The parameter ‘theta’ that we integrate over is the probability distribution of the bias (that is, the space of outputs of the coin generator).

What about mu and sigma? The paper assumed a multivariate gaussian, not a coin flipper. Our coin flipper can be described by simply stating the bias of the coin. A multivariate gaussian needs a mean (mu) and standard deviation (sigma). So they say ‘theta = (mu, sigma)’, but we can say ‘theta = bias’.

There is a very special reason we assumed our biased coin came from machine with a beta distribution. The beta distribution is a ‘conjugate prior’ for the integral we need to perform. It has an easy closed-form solution, so we can determine the probability the data came from a single cluster by a simple equation:
(define (gamma x)
  (factorial (- x 1)))

(define (bernoulli-beta A B heads tails)
  (/ (* (gamma (+ A B))
        (gamma (+ A heads))
        (gamma (+ B tails)))
     (* (gamma A)
        (gamma B)
        (gamma (+ A B heads tails)))))
This tells us the probability of seeing a particular set of coin flips given that we don't know the bias of the coin, but we do have some idea about the kind of biases the coin making machine generates. If we set A and B each to 1, indicating an even distribution of biased coins, and we observe 2 heads and no tails, we would expect to see that 1/3 of the time. On the other hand, if we set A and B each to 100, indicating a high preference for fair coins to be generated, and we observe 2 heads, we see that we would expect that 1/4 of the time.

It is this value that we compute as the probability that a particular given data set came from our hypothetical single cluster.

Now we need to figure out the alternative --- that the data set came from more than one hypothetical cluster.

Next time....