Thursday, July 31, 2008

In the last post, I finally figured out what the equations were saying in the Heller and Ghahramani paper. I coded it up and got a bit of life out of the program, but there are still two more problems. The first was obvious. In the naive implementation, you can cluster about 20 items in a reasonable amount of time. This is a little short of the 50,000 items I wanted to run thruogh the program.

The most immediate problem turned out to be bignums. A quick inspection of the program shows a lot of calls to the gamma function. Gamma (14) is larger than 2^32, so even relatively tiny clusters will start using bignum arithmetic. The product of several calls to gamma will be bigger still, and unless you are very lucky, the ratio of products of gamma will give you rational numbers with bignum numerators and denominators. The answers are perectly exact, if you are willing to wait for them.

The obvious answer is floating point. This gives a definite improvement in speed and space and I could cluster close to 170 items in a reasonable amonut of time. But the gamma function once again starts causing problems. Gamma (172) is too large to be represented as a (double-precision) floating point number.

If you've been following this blog for the past few weeks, you probably know what the solution is. We convert the program to work in log-odds space. This is a tiny bit tricky, but we can use the initial naive implementation as a reference to make sure we don't screw up. Recall that the marginal probability of clustering is given by function R. We simply want to compute log-odds of R.
```
(define (probability->log-odds p)
(10log10 (probability->odds p)))

(define (probability->odds p)
(/ p (- 1 p)))
```

.
Rather than use the natural log, I prefer to use 10log10. This gives me the odds in decibels (not as nutty as you might think!). I ran a few test problems through the naive solution and then computed the log-odds of the answer so I could check that any new code computed the same answer.

The first thing to do is to convert the program to odds space. Here's how we go about it. We start with our definition of R.
```
(define (r cluster)
(/ (* (pi cluster) (ph1 cluster))
(pd cluster)))
```

.
And combine it with our definition of probability->odds:
```
(define (r-odds cluster)
(probability->odds
(/ (* (pi cluster) (ph1 cluster))
(pd cluster))))
```

.
(That was trivial.) Now we inline the definition of probability->odds.
```
(define (r-odds cluster)
((lambda (p)
(/ p (- 1 p)))
(/ (* (pi cluster) (ph1 cluster))
(pd cluster))))
```

.
And beta reduce.
```
(define (r-odds cluster)
(/ (/ (* (pi cluster) (ph1 cluster))
(pd cluster))
(- 1 (/ (* (pi cluster) (ph1 cluster))
(pd cluster)))))
```

.
That's uglier. But we can fiddle around with this a bit. First add some temporaries to help unclutter the code.
```
(define (r-odds cluster)
(let ((t0 (ph1 cluster))
(t1 (pi cluster))
(t2 (pd cluster)))
(/ (/ (* t1 t0) t2)
(- 1 (/ (* t1 t0) t2)))))
```

.
There are some common terms between pi and pd that look like they might cancel, so let's substitute in pd.
```
(define (r-odds cluster)
(let ((t0 (ph1 cluster))
(t1 (pi cluster))
(t2 (if (leaf-node? cluster)
(ph1 cluster)
(let ((pi_k (pi cluster)))
(+ (*      pi_k  (ph1 cluster))
(* (- 1 pi_k) (ph2 cluster)))))))
(/ (/ (* t1 t0) t2)
(- 1 (/ (* t1 t0) t2)))))
```

.
And remove the redundant computations:
```
(define (r-odds cluster)
(let* ((t0 (ph1 cluster))
(t1 (pi cluster))
(t2 (if (leaf-node? cluster)
t0
(+ (*      t1  t0)
(* (- 1 t1) (ph2 cluster))))))
(/ (/ (* t1 t0) t2)
(- 1 (/ (* t1 t0) t2)))))
```

.
And substitute in the definition of pi:
```
(define (r-odds cluster)
(let* ((t0 (ph1 cluster))
(t1 (if (leaf-node? cluster)
1
(/ (* *alpha* (gamma (n cluster)))
(d cluster))))
(t2 (if (leaf-node? cluster)
t0
(+ (*      t1  t0)
(* (- 1 t1) (ph2 cluster))))))
(/ (/ (* t1 t0) t2)
(- 1 (/ (* t1 t0) t2)))))
```

.
And the definition of d.
```
(define (r-odds cluster)
(let* ((t0 (ph1 cluster))
(t1 (if (leaf-node? cluster)
1
(/ (* *alpha* (gamma (n cluster)))
(if (leaf-node? cluster)
*alpha*               ;; tuning parameter
(+ (* *alpha* (gamma (n cluster)))
(* (d (left-child cluster))
(d (right-child cluster))))))))
(t2 (if (leaf-node? cluster)
t0
(+ (*      t1  t0)
(* (- 1 t1) (ph2 cluster))))))
(/ (/ (* t1 t0) t2)
(- 1 (/ (* t1 t0) t2)))))
```

.
The probability of R on a leaf node is 1, so the odds of R on a leaf node would be infinite. But we never call R on a leaf node, so we can simply ignore that branch of the conditionals.
```
(define (r-odds cluster)
(if (leaf-node? cluster)
(error "Odds are infinite.")
(let* ((t0 (ph1 cluster))
(t1 (/ (* *alpha* (gamma (n cluster)))
(+ (* *alpha* (gamma (n cluster)))
(* (d (left-child cluster))
(d (right-child cluster))))))
(t2 (+ (*      t1  t0)
(* (- 1 t1) (ph2 cluster)))))
(/ (/ (* t1 t0) t2)
(- 1 (/ (* t1 t0) t2))))))
```

.
Let's pull out some common terms:
```
(define (r-odds cluster)
(if (leaf-node? cluster)
(error "Odds are infinite.")
(let* ((t0 (ph1 cluster))
(t3 (* *alpha* (gamma (n cluster))))
(t1 (/ t3
(+ t3
(* (d (left-child cluster))
(d (right-child cluster))))))
(t4 (* t1 t0))
(t2 (+ t4
(* (- 1 t1) (ph2 cluster)))))
(/ (/ t4 t2)
(- 1 (/ t4 t2))))))

a/b           a
And use the identity --------- = -----------
1 - a/b       b - a

(define (r-odds cluster)
(if (leaf-node? cluster)
(error "Odds are infinite.")
(let* ((t0 (ph1 cluster))
(t3 (* *alpha* (gamma (n cluster))))
(t1 (/ t3
(+ t3
(* (d (left-child cluster))
(d (right-child cluster))))))
(t4 (* t1 t0))
(t2 (+ t4
(* (- 1 t1) (ph2 cluster)))))
(/ t4
(- t2 t4)))))
```

.
Simplify the denominator:
```
(define (r-odds cluster)
(if (leaf-node? cluster)
(error "Odds are infinite.")
(let* ((t0 (ph1 cluster))
(t3 (* *alpha* (gamma (n cluster))))
(t1 (/ t3
(+ t3
(* (d (left-child cluster))
(d (right-child cluster))))))
(t4 (* t1 t0)))
(/ t4
(* (- 1 t1) (ph2 cluster))))))
```

.
Factor out t3:
```
(define (r-odds cluster)
(if (leaf-node? cluster)
(error "Odds are infinite.")
(let* ((t0 (ph1 cluster))
(t3 (* *alpha* (gamma (n cluster))))
(t1 (/ t3
(+ t3
(* (d (left-child cluster))
(d (right-child cluster))))))
(t4 (* t1 t0)))
(/ t4
(* (- 1 t1) (ph2 cluster))))))
```

.
Distribute the denominator:
```
(define (r-odds cluster)
(if (leaf-node? cluster)
(error "Odds are infinite.")
(let* ((t0 (ph1 cluster))
(t3 (* *alpha* (gamma (n cluster))))
(t1 (/ t3
(+ t3
(* (d (left-child cluster))
(d (right-child cluster))))))
(t4 (* t1 t0)))
(/ t4
(- (ph2 cluster) (* t1 (ph2 cluster)))))))
```

.
Divide top and bottom by t1
```
(define (r-odds cluster)
(if (leaf-node? cluster)
(error "Odds are infinite.")
(let* ((t0 (ph1 cluster))
(t3 (* *alpha* (gamma (n cluster))))
(t1 (/ t3
(+ t3
(* (d (left-child cluster))
(d (right-child cluster))))))
(t4 (* t1 t0)))
(/ (/ t4 t1)
(- (/ (ph2 cluster) t1) (/ (* t1 (ph2 cluster)) t1))))))

```

.
Simplify
```
(define (r-odds cluster)
(if (leaf-node? cluster)
(error "Odds are infinite.")
(let* ((t0 (ph1 cluster))
(t3 (* *alpha* (gamma (n cluster))))
(t1 (/ t3
(+ t3
(* (d (left-child cluster))
(d (right-child cluster)))))))
(/ t0
(- (/ (ph2 cluster) t1)  (ph2 cluster))))))

```

.
Substitute t1
```
(define (r-odds cluster)
(if (leaf-node? cluster)
(error "Odds are infinite.")
(let* ((t0 (ph1 cluster))
(t3 (* *alpha* (gamma (n cluster)))))
(/ t0
(- (/ (ph2 cluster)
(/ t3
(+ t3
(* (d (left-child cluster))
(d (right-child cluster))))))
(ph2 cluster))))))

```

.
Simplify
```
(define (r-odds cluster)
(if (leaf-node? cluster)
(error "Odds are infinite.")
(let* ((t0 (ph1 cluster))
(t3 (* *alpha* (gamma (n cluster)))))
(/ t0
(- (* (ph2 cluster)
(/ (+ t3
(* (d (left-child cluster))
(d (right-child cluster))))
t3))
(ph2 cluster))))))

```

.
Factor out ph2
```
(define (r-odds cluster)
(if (leaf-node? cluster)
(error "Odds are infinite.")
(let* ((t0 (ph1 cluster))
(t3 (* *alpha* (gamma (n cluster)))))
(/ t0
(* (ph2 cluster) (- (* 1
(/ (+ t3
(* (d (left-child cluster))
(d (right-child cluster))))
t3))
1))))))

```

.
Multiply top and bottom by t3
```
(define (r-odds cluster)
(if (leaf-node? cluster)
(error "Odds are infinite.")
(let* ((t0 (ph1 cluster))
(t3 (* *alpha* (gamma (n cluster)))))
(/ (* t3 t0)
(* t3 (ph2 cluster) (- (* 1
(/ (+ t3
(* (d (left-child cluster))
(d (right-child cluster))))
t3))
1))))))

```

.
Simplify
```
(define (r-odds cluster)
(if (leaf-node? cluster)
(error "Odds are infinite.")
(let* ((t0 (ph1 cluster))
(t3 (* *alpha* (gamma (n cluster)))))
(/ (* t3 t0)
(* (ph2 cluster) (* (d (left-child cluster))
(d (right-child cluster))))))))

```

.
Now substitute ph2:
```
(define (r-odds cluster)
(if (leaf-node? cluster)
(error "Odds are infinite.")
(let* ((t0 (ph1 cluster))
(t3 (* *alpha* (gamma (n cluster)))))
(/ (* t3 t0)
(* (pd (left-child cluster))
(pd (right-child cluster))
(d (left-child cluster))
(d (right-child cluster)))))))

```

.
And combine pd and d into a new function:
```
(define (r-odds cluster)
(if (leaf-node? cluster)
(error "Odds are infinite.")
(/ (* *alpha* (gamma (n cluster)) (ph1 cluster))
(* (px (left-child cluster))
(px (right-child cluster))))))

(define (px cluster)
(* (pd cluster) (d cluster)))

```

.
Using similar rewriting, we can define px more simply as this:
```
(define (px cluster)
(if (leaf-node? cluster)
(* *alpha* (gamma (n cluster)) (ph1 cluster))
(+ (* *alpha* (gamma (n cluster)) (ph1 cluster))
(* (px (left-child cluster))
(px (right-child cluster))))))

```

.
If we define ph1a as
```
(define (ph1a cluster)
(* *alpha* (gamma (n cluster)) (ph1 cluster)))

```

.
We can further simplify:
```
(define (px cluster)
(if (leaf-node? cluster)
(ph1a cluster)
(+ (ph1a cluster)
(* (px (left-child cluster))
(px (right-child cluster))))))

(define (r-odds cluster)
(if (leaf-node? cluster)
(error "Odds are infinite.")
(/ (ph1a cluster)
(* (px (left-child cluster))
(px (right-child cluster))))))

```

.
Whew! That was work. Fortunately, we can compare the output from the naive version with the output of this version to see if the answers are the same.

Next time, we'll take the logarithm of this code.

Tuesday, July 29, 2008

...continued. I forgot something in the last post. I wanted to illustrate this process in code and I didn't write the code for the first equation. Here it is:
```
(define (ph1 cluster)
(let ((data-points (cluster-data-points cluster)))
(product (lambda (term)
(let ((present (count-occurrances term data-points))
(absent (- (length data-points) present))
(bernoulli-beta A B present absent))))
all-terms)))
```

.
We assume that clusters contain a number of data points. For each term that could appear in a data point, we count how many times it actually appears and how many times it is absent. We compute the probability of the observed term given the model parameters A and B, and we take the product across all the terms.

Why Lisp? Or in this case, why Scheme? Lisp is the best language for doing something when you don't know what you are doing. There are three features I'm depending upon here.
1. Lisp and Scheme dispense with most of the minutae and bookkeeping involved in making the code run. I don't have to declare anything before I can use it, I can refer to routines I haven't written yet, I can run partially written code.
2. Self-descriptive data. All data in Lisp and Scheme have explicit tags. This makes it easy to poke at data structures and the partial results of computations to find out what it looks like and see if things sort of make sense.
3. Mixing compiled and interpreted code. Once I know something is working write, I can compile the know good code and make it fast. Or I can reload the interpreted code if I need to re-examine what's going on. Interpreted code and compiled code mix seamlessly. MIT Scheme has a powerful debugger that keeps track of what source code corresponds to compiled sections, and what stack positions correspond to what variables so it is easy to figure out what is going on. The MIT Scheme compiler generates very fast code and the sophisticated linking system can cache variable and function references so most function calls can avoid indirect calls through a function cell. (Did I mention it was fast?)

The next few paragraphs of Heller and Ghahramani discuss how to determine the alternative to the hypothesis that a selection of data points belong to a cluster. Rather than enumerating all the possible partitions of the data, however, they give an efficient closed-form equation for the marginal probability that the data belongs to a ‘tree consistent’ subset of the partitions. This is a key point to the paper. Unfortunately, this key point is a bit obscure and it took me a while to puzzle it out.

The equation is recursive and the authors seem to think this makes it more difficult to understand. They break the equation up into several parts, but this only makes it harder to put the pieces back together. First they present this equation:
```
p(D_k|H2_k) = p(D_i|T_i) * p(D_j|T_j)
```

.
Huh? This seems easy, we're just multiplying something by something else, but where do the multiplicands come from? That's defined later. Ok, we'll wait a bit. Now the authors define a new variable:
```
pi_k = p(H1_k)
```

.
Which is basically the prior probability of the first hypothesis without taking into account the data.

Finally, the authors present the equation we're waiting for:
```
p(D_k|T_k) = pi_k p(D_k|H1_k) + (1 - pi_k) p(D_i|T_i) p(D_j|T_j)
```

.
“Ah”, you say, “this makes it all clear!” No? I didn't either. Here's part of the problem. By putting this in mathematical form, we've quite effectively hidden what's being recursively defined here. Furthermore, there are still some free variables floating around, and one of them is simply defined as yet a different one! (Note to authors, renaming one free variable as another does not help.) Let me rewrite this.

First, assume that a ‘cluster’ is either a ‘tree’ with a left and right subcluster, or it is a leaf consisting of a single data point. If it is a tree, then these are the equations:
```
(define (ph2 cluster)
(* (pd (cluster-left-child cluster))
(pd (cluster-right-child cluster))))

(define (pd cluster)
"Compute the probability of the observed data given the tree
structure of the cluster."
(let ((pi_k (pi cluster)))
;; pi_k is in the range [0,1]
;; It determines how much weight we give to each
;; alternative.
(+ (*      pi_k  (ph1 cluster))
(* (- 1 pi_k) (ph2 cluster)))))
```

.
So we're recursively descending a tree structure! It turns out that the functions we're computing and the arguments we're using are cleverly disguised as subscripts.

There is a catch in the code above. The recursion in `pd` isn't well grounded. What happens when we hit the leaves of the tree? We'll find out in a moment.

What about this `pi` function? It isn't defined it yet. In fact, it isn't defined until two pages later in the middle of the proof about the validity of using tree-consistent partitioning. Well, better late than never. The authors define `pi_k` through an inductive algorithm:
```
initialize each leaf i to have d_i = alpha, pi_i = 1

for each internal node k do
d_k = alpha * gamma (n_k) + d_leftk * d_rightk

alpha * gamma (n_k)
pi_k = ---------------------
d_k

end for
```

.
Where `n_k` is the number of data points at internal node `k` and `alpha` is a tuning parameter for the whole algorithm.

I'll forgive the sudden introduction of the ‘left’ and ‘right’ subscripts (why now, when ‘i’ and ‘j’ were used before?), and at least they explain the free variables. The big problem is that this since this algorithm is inductive (that is, bottom up) and the equations above are co-inductive (top down), you can't just plug them together.

Fortunately, it isn't hard to recast this algorithm as a set of trivial recursive functions:
```
(define (n cluster)
(if (leaf-node? cluster)
1
(+ (n (left-child cluster))
(n (right-child cluster)))))

(define (d cluster)
(if (leaf-node? cluster)
*alpha*               ;; tuning parameter
(+ (* *alpha* (gamma (n cluster)))
(* (d (left-child cluster))
(d (right-child cluster))))))

(define (pi cluster)
(if (leaf-node? cluster)
1
(/ (* *alpha* (gamma (n cluster)))
(d cluster))))
```

.
Now we can see how to fix the recursion in `pd`. If we are at a leaf node the value of `pi` is 1 and the sum
```
(+ (*      pi_k  (ph1 cluster))
(* (- 1 pi_k) (ph2 cluster)))
```

.
becomes
```
(+ (* 1 (ph1 cluster))
(* 0 (ph2 cluster)))
```

.
So we can simply fix `pd` to avoid the call to `ph2` in the base case:
```
(define (pd cluster)
"Compute the probability of the observed data given the tree
structure of the cluster."
(if (leaf-node? cluster)
(ph1 cluster)
(let ((pi_k (pi cluster)))
(+ (*      pi_k  (ph1 cluster))
(* (- 1 pi_k) (ph2 cluster))))))

```

.
Now we can jump back a page in the paper to where they define `r_k`. This isn't as well disguised:
```

pi_k * p(D_k|H1_k)
r_k = -----------------------------------------------------------
pi_k * p(D_k|H1_k) + (1 - pi_k) * p(D_i|T_i) * p(D_j|T_j)

```

.
We can rewrite this simply as:
```
(define (r cluster)
(/ (* (pi cluster) (ph1 cluster))
(pd cluster)))

```

.
Here again are the core components:
```
(define (r cluster)
"The marginal probability of this cluster given the data.
We use this value to determine which clusters to merge."
(/ (* (pi cluster) (ph1 cluster))
(pd cluster)))

(define (pi cluster)
"Recursively defined weight of merging vs. non-merging hypotheses."
(if (leaf-node? cluster)
1
(/ (* *alpha* (gamma (n cluster)))
(d cluster))))

(define (pd cluster)
"Compute the probability of the observed data given the tree
structure of the cluster."
(if (leaf-node? cluster)
(ph1 cluster)
(let ((pi_k (pi cluster)))
(+ (*      pi_k  (ph1 cluster))
(* (- 1 pi_k) (ph2 cluster))))))

(define (ph1 cluster)
;; Hypothesis 1.  Data is generated uniformly in cluster.

;; User defined depending on the generative model.
;; Possible definition for bernoulli-beta model:

(let ((data-points (cluster-data-points cluster)))
(product (lambda (term)
(let ((present (count-occurrances term data-points))
(absent (- (length data-points) present))
(bernoulli-beta A B present absent))))
all-terms))

;; Other models may be appropriate.
)

(define (ph2 cluster)
;; Hypothesis 2.  Data is generated from two or more of
;;  the tree consistent subclusters.
(* (pd (cluster-left-child cluster))
(pd (cluster-right-child cluster))))

(define (n cluster)
(if (leaf-node? cluster)
1
(+ (n (left-child cluster))
(n (right-child cluster)))))

(define (d cluster)
(if (leaf-node? cluster)
*alpha*               ;; tuning parameter
(+ (* *alpha* (gamma (n cluster)))
(* (d (left-child cluster))
(d (right-child cluster))))))
```

.
The rest of the process is pretty simple. We start with only leaf nodes and we examine them pair by pair. We create a ‘test cluster’ out of each pair of leaf nodes and compute the value of `r` for the test cluster. After examining all pairs, we keep the cluster with the highest `r` and discard the leaves that it contains. We can either continue this process until we have one cluster with everything in it or until every test cluster gives us an `r` value of .5 or less (in which case we'll end up with a set of disjoint clusters).

Some ranting.

It took me weeks to tease apart the equations in the paper and determine what is really going on. My claim is that the code I wrote above should be easily understood by most practiced coders, and even those that can't understand the code should be able to type it in (transliterated to the computer language of their choice), write the support code, and run it to see what it does.

The code is exactly as mathematical as the equations in the paper, but it is much clearer about what is being defined, what is being operated upon, and what results are being generated.

Here are my gripes:
1. The equations are unclear for several reasons. The values computed are encoded in the subscripts, the mutually recursive definitions are spread across three different pages, the free variables aren't defined.
2. The one place where values are computed via an algorithm is using the opposite form of induction of the rest of the paper.
3. The pseudo-code can't run. Choose a good functional language like lisp and the pseudo-code will be actual code.
4. The pseudo-code is gratuitously ‘bash and smash’ imperative code. The values computed are naturally functionally inductive over the tree structure, the code should be that way too.
5. The recursive equations are presented without a base case, and the base case is hidden in the pseudo-code (with the wrong induction so we can't easily see that it grounds the recursion!)
Enough for today.

I got this part of the code working several weeks ago, but I ran into some more problems that are beyond the scope of the paper, but are interesting nonetheless. I'll talk about them soon.

Monday, July 28, 2008

Can I rant? (Why not, it's my blog.)

The results of Heller and Ghahramani are interesting, and the technique is actually pretty clever. It is a shame that the paper is so opaque (at least to me). The algorithm is described in section 2 of the paper. Everything goes fairly smoothly until these sentences:
The first hypothesis... is that all the data in D_k were in fact generated independently and identically from the same probabilistic model p(x|theta) with unknown parameters theta. Let us imagine that this probabilistic model is a multivariate Gaussian, with parameters theta = (mu, sigma)... we need to specify some prior over the parameters of the model, p(theta|beta) with hyperparameters beta.
Four greek letters in three sentences and no real definition of any of them. And what is a ‘hyperparameter’ anyway? The fact that these letters are used in the subsequent integral means that they really have to have values associated with them at runtime (they aren't just a conceptual aid). But where do the values come from? Thin air?

In fact, that is just exactly where they come from. I suppose it would be clear to someone steeped in Bayesian statisticts, but it took me a while to puzzle this out. Recall that we started with a generative model of clustering, that is, there is a set of machines that generate the data. A ‘parameter’ is a setting on a particular machine. For instance, if our method of generating a term is to flip a biased coin, the ‘parameter’ is the bias of the coin. But we don't know what that is! So we model the bias itself as a probability distribution. Again, we use a generative model and we assume some machine manufactured our biased coin based on some settings. These are the ‘hyperparameters’.

So the entire model is this: A cluster is made up from a set of term generators. Each term generator is a machine that flips a biased coin in order to determine whether to generate the term. When we create a term generator, we initialize it with a biased coin we obtained from a machine that manufactures biased coins. We don't know the bias of the coin, but we do know the settings of the machine that makes the coin.

How do we know those values? We make them up. Right out of thin air. But isn't that cheating? Not exactly. Earlier I mentioned the ‘beta’ distribution that we used as a prior probability distribution. Imagine our biased coin manufacturing machine creates coins based on a beta distribution. There are two dials on the machine, A and B, and the probability distribution of the bias of the coins depends on how we set A and B. If we put A and B both at 1, then the probability distribution of the bias is flat --- we'll get coins of arbitrary bias with equal probability. If we set A to 1000 and B to 10, we'll mostly get coins that are heavily biased towards heads, but we'll get the occasional coin that tail-heavy.

If we are completely ignorant (or agnostic) of the bias of the term generators, we can set our coin generator ‘hyperparameters’ A and B to 1. If we know a little bit about the distribution, we can tune A and B to get better results.

Returning to the paper, the ‘hyperparameters’ beta are actually the set of variables A and B we use to tune our coin generator. The parameter ‘theta’ that we integrate over is the probability distribution of the bias (that is, the space of outputs of the coin generator).

What about mu and sigma? The paper assumed a multivariate gaussian, not a coin flipper. Our coin flipper can be described by simply stating the bias of the coin. A multivariate gaussian needs a mean (mu) and standard deviation (sigma). So they say ‘theta = (mu, sigma)’, but we can say ‘theta = bias’.

There is a very special reason we assumed our biased coin came from machine with a beta distribution. The beta distribution is a ‘conjugate prior’ for the integral we need to perform. It has an easy closed-form solution, so we can determine the probability the data came from a single cluster by a simple equation:
```(define (gamma x)
(factorial (- x 1)))

(define (bernoulli-beta A B heads tails)
(/ (* (gamma (+ A B))
(gamma (+ A heads))
(gamma (+ B tails)))
(* (gamma A)
(gamma B)
(gamma (+ A B heads tails)))))
```
.
This tells us the probability of seeing a particular set of coin flips given that we don't know the bias of the coin, but we do have some idea about the kind of biases the coin making machine generates. If we set A and B each to 1, indicating an even distribution of biased coins, and we observe 2 heads and no tails, we would expect to see that 1/3 of the time. On the other hand, if we set A and B each to 100, indicating a high preference for fair coins to be generated, and we observe 2 heads, we see that we would expect that 1/4 of the time.

It is this value that we compute as the probability that a particular given data set came from our hypothetical single cluster.

Now we need to figure out the alternative --- that the data set came from more than one hypothetical cluster.

Next time....
... somewhat continued.

Our first hypothesis is that we have a single generator for the data points. Since each term is generated independently, it is as if we had a biased coin for each flip. We don't know the bias, so we choose a uniform prior probability distribution (any bias from 0 to 1 is equally likely). Fortunately, the integral over this distribution has a closed form solution:
```
gamma(h + t)
------------------
gamma(h + t + 2)
```
.
(Gamma is the gamma function.)

If we select a biased coin at random and toss it twice, we will expect to see two heads about 1/3 of the time. We'd expect 2 tails one third of the time, too. Heads first and tails second comes in 1/6 of the time, tails first heads second the other sixth. (Note: The idea is that each time we do this we have a fresh coin with an unknow bias and give it two flips. It is notto say that if we take a coin with an unknown bias that we will get two heads 1/3 of the time on every pair of flips!)

If we generate messages one and two by selecting a biased coin and flipping it twice, and repeating this six times for each term in the message, we find the following probability of the data being generated by a single cluster with unknown biases:
```  Msg1:  #( 0 0 1 1 0 1 )   Msg2:  #( 0 0 0 1 0 1 )

Msg1   Msg2
0     0  -- 1/3
0     0  -- 1/3
1     0  -- 1/6
1     1  -- 1/3
0     0  -- 1/3
1     1  -- 1/3

1 / 1458
```
.
One chance in 1458. That's the probabilty of generating these exact two messages in this order from a set of six coins with a randomly distributed bias. Not much of a chance, but we aren't interested in the absolute chance, we are interested in the marginal probability.

What's the probability of generating each message separately? For each message, we select six biased coins and toss them. Again, we don't know the bias. But if the bias is evenly distributed and we select a coin and flip it, we'll get heads half the time and tails the other half (remember, after one flip we grab a brand new biased coin with unknow distribution). The probability of getting these two messages is therefore:
```
Msg1             Msg2
0 -- 1/2         0  -- 1/2
0 -- 1/2         0  -- 1/2
1 -- 1/2         0  -- 1/2
1 -- 1/2         1  -- 1/2
0 -- 1/2         0  -- 1/2
1 -- 1/2         1  -- 1/2

1/64       *     1/64  1/4096
```
.
The marginal probability of message 1 and 2 forming a cluster is:
```           1/1458
--------  = 2.8
1/4096
```
.
so it is almost 3 times more likely that Msg1 and Msg2 are from the same cluster.

Message 0 and message 4 have a different story.
```
Msg0   Msg4
1     0  -- 1/6
1     0  -- 1/6
1     1  -- 1/3
0     1  -- 1/6
0     1  -- 1/6
0     0  -- 1/3

1/11664
```
.
It is 1/3 as likely that Msg0 and Msg4 form a cluster than having been drawn from two clusters.

This is the essence of Bayesian clustering, but there are two important things I have glossed over. The first is the prior.

In the cases above I assumed a uniform prior where I didn't know what the probability of a particular term was. Most of the time, however, you have some clue what the probability distribution of the term might be. With a sufficient amount of data, the prior probability tends to be less important, but when you are working with small amounts of data, it can have a large effect. A good estimate of the prior probability will greatly improve the accuracy of the clustering.

But there is a problem. We are integrating over the prior probability distribution. We are lucky in that there is a closed form solution for a uniform distribution. What if we know the distribution isn't uniform? It turns out that there is a family of distributions called ‘beta’ distributions that are parameterized by two values. The ‘beta’ family of distributions is basically a unimodal distribution and it is the distribution you get from a biased binomial selection (tossing a biased coin). If you use a beta prior, it is easy to compute the effect of data on the probability distribution. It may be quite likely that the nature of the problem implies a beta prior, but if not, it is usually a good idea to pretend it does and find a beta distribution close enough so the math is tractable. Otherwise you'll end up running huge numeric integrations for a long time.

The second important thing is one of the central parts of the paper. To be correct, when you consider merging two clusters you should compare the probability of merging against all possible subclusterings of the data (with the appropriate weights). This turns out to be exponential in time, and would make this method of clustering hopelessly expensive. Rather than compare against all possible subclusterings, the authors suggest comparing against only those subclusterings that are `tree consistent' with the original cluster. This greatly reduces the number of comparisons, but more importantly, the weighting of the entire set of tree consistent clusterings can be represented as a single number that is carried through the computation.

Next time, a bit of ranting.

Friday, July 25, 2008

I had intended to write about clustering, but I went off on a tangent about Bayesian probability.

Spectral clustering was disappointing. I learned a bunch about linear algebra, so it wasn't a total waste of time, but the results just didn't appear useful. There are some obvious similarities in the data that you can see just by looking at it. I expect the clustering algorithm to find these and perhaps some I didn't see.

I'm currently exploring Bayesian Hierarchical Clustering. I started with the paper by Heller and Ghahramani. The paper seemed like a reasonable approach and I always wanted to learn more about Bayesian statistics.

There are two general approaches to clustering. ‘Top down’ works by initially putting everything in one cluster, then dividing that cluster into two smaller ones. You then recursively apply the algorithm to the resulting clusters. ‘Bottom up’ works by initially putting each item into its own cluster then selecting clusters that are similar and merging them into larger clusters.

Bottom up has the problem that you have an enormous number of initial clusters and finding candidates to merge is an O(n^2) problem. This puts a limit on how much data you can handle. I want to merge tens of thousands of items, so this doesn't seem attractive.

Top down is very appealing because it appears to be an O(log n) process. However, looks are deceiving. There are about 2^n ways to divide a set into two parts, so it is impossible to do a brute-force search for the ‘best’ division. Furthermore, it is a lot easier to describe how two items are similar (for a bottom-up merge) than it is to describe how mixtures from multiple sets are different.

Bottom-up merging usually works by having some sort of similarity metric between clusters. At each step you find some likely candidates to merge and score the candidates merges by similarity. You then merge the pair with the best score and repeat the process.

Bayesian Hierarchical Clustering is a slight variation on this. But instead of creating a similarity metric between clusters, we compute the probability that the two clusters belong together. At each step we find the pair of clusters most likely to belong together and merge them. This has a few advantages: it is resistant to ‘noise’ and ‘outliers’, and the probabilistic component allows us to model our uncertainty (i.e. ignorance) about the data.

This last point is important. It isn't so much *our* ignorance we want to model, it is the *computer's* ignorance. Suppose we are clustering a set of email messages that contain a certain amount of spam. We don't expect the computer to have perfect knowledge of what constitutes spam, or how spam is created and distributed, or why there is spam in the set of message to cluster. We want the computer to *discover* or *notice* that a bunch of these message are all very similar and should therefore be grouped together.

So what's a cluster?

If we're going to group data together, we need to have sort of idea what it means to be part of the group. A generative model is often appropriate. We assume that there exist a number of machines that generate our data samples. An unknown number of data points are taken from each machine and they are all mixed together to form our final data set. Our goal in clustering is to find a plausible set of generators and determine which data points they generated. Our generative model doesn't need to be accurate, or even remotely realistic. It doesn't matter that the data isn't generated the way we model it, it only matters that the data points are grouped in a useful way when we are done.

I've always preferred learning from concrete examples, so let's make one. I'll borrow the examples from Dave Madigan's ‘Statistics and the War on Spam’. He has four email messages that make up his test corpora, and one additional email to classify. Rather than train a Bayesian classifier on this corpus, I want to perform a hierarchical clustering of all the messages.

As in the paper, we transform the messages into binary term vectors that indicate the presence or absence of terms.
```
Msg0: #( 1 1 1 0 0 0 ) X_0 = brown X_1 = fox
Msg1: #( 0 0 1 1 0 1 ) X_2 = quick X_3 = rabbit
Msg2: #( 0 0 0 1 0 1 ) X_4 = rest X_5 = run
Msg3: #( 0 0 0 1 1 0 )
Msg4: #( 0 0 1 1 1 0 )
```
.
What we're going to do is consider all the pairs of messages and ask ourselves whether we think it is more likely that they came from a cluster or more likely they didn't. How on earth do we do that?!

Returning to the clustering problem, I have some data (D) and a cluster model (M) and given the model it is easy to determine the probability of my model generating the observed data. Here's where Bayes theorem comes into the picture. We have the probability of the data conditional upon the model, p(D|M). We use Bayes theorem to obtain the probability of the model conditional upon the data p(M|D).

We're going to apply one more trick. Instead of using this probability directly, we are going to compare two different models to determine which of the two is more likely. We'll compute the probability ratio between the two models to see which one is favored. The ratio of probabilities is called the ‘marginal’ probability. The marginal probability can in some cases be easier to compute than the two probabilities that it is derived from. This is because the normalizing term (the a-priori probability of the data) is common between the two and cancels out in the ratio.

So we start by putting each message in its own cluster. Then we consider all the pairs that can be formed from the clusters to find the pair that is most likely to be combined. Let's suppose we are checking message one against message two.

We're going to make the completely unwarranted assumption that the words in each message are independently generated. That is, our generative model is that every cluster has a set of biased coins, one coin for each word. To generate a message, each coin in the cluster is flipped and the selected word is included if the coin comes up heads, but omitted if it comes up tails. This model is clearly deficient because it ignores word order and word interdependence in sentences, yet it seems to be a pretty good model nonetheless. By assuming independence, we greatly simplify the probability calculations, which are already difficult.

We're going to compare the probability of two hypothesis:
1. Both messages were generated by a single set of biased tosses.
2. Each message was generated by its own set of biased tosses.
We don't know the biases, so we have to integrate over all possible biases. Since we make the assumption of independence, we can take the product of integrals for each term rather than dealing with a truly nasty polynomial.

To be continued....

Monday, July 21, 2008

A Slightly Less Simple Answer

If you haven't read the problem in the previous post, I invite you to do so before reading the solution here.

We have a coin with an unknown bias and we have flipped it 41 times and observed it come up heads on 29 of those flips. In order to predict whether it will come up heads on the next two tosses, we need to estimate the bias. The standard approach to statistics tells us that the sample mean should approximate the true mean. The sample mean is about 0.7073, and if we toss a coin with that bias twice, we will have a 0.7073^2 = .5003 chance of getting two heads. If you were given even odds on this chance, you should take it because you'd (eventually) make money.

The traditional method is simple, but wrong. It doesn't matter what your philosophy of statistics is, it is simply the case that if you write a program to simulate the problem, it will not converge to this solution.

What did we do wrong? We estimated the bias by integrating over the sample space. (Since the sample space is discrete, it is a sum.) We should have estimated the bias by integrating over the model space. We don't know the bias of the coin. It could be about .7073 and have generated 29 heads out of 41 flips, but it also could have been a completely fair coin (.5) and we were just lucky to have a few more heads than we'd expect. It could have been a very unfair coin (.9) and we were unlucky and got a few fewer heads than expected. There is a continuum of possibilities, and we want to take them all into account. To do this, we have to consider the probability of generating the observed data for all possible biases and integrate over them. The probability of the bias having a value of theta given the observed data is
```              (n + 1)!
p(theta|D) = ------------ * theta^r * (1 - theta)^(n - h)
h!(n - h)!
```
where h is the number heads and n is the number of flips.

Our best estimate of the bias is
```            theta=1                               h + 1
bias = integral   theta * p(theta|D) dtheta = ---------
theta=0                                 n + 2
```
.
For our example, after observing 29 heads in 41 flips, we should estimate the bias to be (29 + 1)/(41 + 2) = .6977 (Intuitively, there are more ways to bias a coin less than 29/41 than there are ways to bias it higher, and each of the extra ways has a possibility of being lucky and generating our observed data.)

Ah, so we estimate our bias at .6977 and square that giving 0.4867 as the probability of two heads. Nope. Close.

Once we have flipped the coin and gotten our first head, we have a new situation. We have now observed 42 flips and gotten 30 heads. This changes our estimate of the bias, which is now (30 + 1)/(42 + 2) = 0.7045 Therefore the probability of getting two heads in a row is (29 + 1)/(41 + 2) * (30 + 1)/(42 + 2) = 0.4915

Are we insane?!! How could the bias change between the flips?! How could the result of the first flip influence the probability we compute before we flip the coin?!! Remember, though, Bayesian probability is not based on physics, it is based on information theory. The coin's bias doesn't change at all throughout the experiment. Our knowledge about the bias changes every time we determine an outcome. We don't know what the result of the first flip will be, but we do know that it must be heads if we are to win the bet. Therefore, if we are lucky, we can infer the higher bias. Of course if we are unlucky we can infer the lower bias, but since we'll be out ten bucks, we don't care to estimate how much lower.

It is easy to write a small program to model the problem. It probably won't be able to run long enough to converge to the correct solution for 41 flips, but it will easily handle small test cases. For example, if you select a bias at random, flip it three times and get 2 heads out of three flips, the probability of the next two flips being heads should be (2 + 1)/(3 + 2) * (3 + 1)/(4 + 2) = 2/3

The original problem was at Panos Ipeirotis's Blog.

A simple puzzle

As you're walking along the street in front of Honest Jake's Biased Coin Factory, you find a quarter on the sidewalk. You suspect it is biased, so you flip it 41 times and observe that it comes up heads 29 times. A bystander notices your interest in probability and offers \$10 even money that your next two flips are not both heads. Do you take the bet?

Clarifications and assumptions:
• Assume the coin is from Honest Jake's factory.
• Honest Jake guarantees that every flip of their coins is independent.
• Honest Jake makes coins with any bias and has no preference for any particular bias. His motto is ‘My coins are biased, I'm not.’
• The bystander pays you \$10 if the next two flips are heads, otherwise you pay the bystander \$10.
• You will be willing to take the bet if you expect to gain money and unwilling if you expect to lose money.
I found this problem on the web (I'll credit it later). The problem was intended to show the difference between Bayesian and traditional reasoning, but the original poster made a slight error and got the wrong result. This problem has a very subtle point to it. I'll post the answer in a few hours.

Friday, July 18, 2008

Bayes theorem is trivially derived from this identity:
```
P(A&B) = P(A|B)P(B) = P(B|A)P(A)
```

.
The probability of A and B both being true is the probability of A being true if B is assumed times the probability of B, or, alternatively, the probability of B being true if A is assumed times the probability of A.

Now we just move a term:
```
P(B|A) P(A)
P(A|B) = ----------------
P(B)
```

.
There's no magic here, it's just the simple algebra you learned in high school. This result is called ‘Bayes Theorem’. Interestingly, there is no controversy at all about Bayes Theorem itself. Everyone agrees it is correct. What is controversial is the ontological meaning of the P(B|A) term on the right hand side.

Bayes Theorem allows you to ‘turn a probability around’. That is, given statement A conditional on statement B, we can create an inverse statement of B conditionalized on statement A. Consider this:
```   A: I am wet.
B: It is raining.
```

.
90% of the time, when it is raining, I get wet. So P(A|B) is .9 Suppose some day I walk in from outside and I'm wet. What is the probability that it is raining? To figure this out we need to know two additional probabilities: how often am I wet for whatever reason, and how often does it rain. If I regularly go swimming, I may be wet, say 2% of the time. If we live in an arid area, it may only rain one day out of 100. So P(A) = .02, P(B) = .01, P(A|B) = .9
```

P(B) P(A|B)
--------------- =       P(B|A)
P(A)

(/ (* .01 .9) .02) = .45
```

.
So despite the fact that I nearly always get wet when it rains, I swim often enough and it is arid enough that I'm slightly more often wet because I've been swimming.

Now let's look at a trickier problem. I'm walking down the street and I find a coin. I idly flip it and it comes up heads. In fact, it comes up heads on the next 4 flips. I begin to wonder if I have a biased coin. How do I figure this out?

The traditional approach is to consider a ‘null hypothesis’ and determine if the data are consistent with it. In this case, our null hypothesis is that the coin is fair. Were the coin fair, getting heads five times in a row would happen once out of every 32 experiments of flipping five times, that is, with a probability of 0.03125 We reject the null hypothisis if the probability is below a certain ‘significance level’ (often taken to be 5%). If we use 5% here, we reject the null hypothesis and decide with ‘95% confidence’ the coin is not fair.

The Bayesian approach is markedly different. We start with a model of a coin with a unknown bias. If the bias is .25, we get heads 1/4 of the time and tails the remaining 3/4 of the time. If the bias is .5, we get heads half the time and tails half the time (the coin is fair). It should be obvious that
```
Given a particular value of the bias, we can determine the
probability of getting 5 heads in a row.
```

.
So if someone told us the bias was .5, we'd expect a 1/32 chance of getting 5 heads in a row. If we knew it was .9999, we'd nearly always expect 5 heads in a row. In other words, we can determine the value of
```
P(5 heads | bias)
```

.
Now we apply Bayes theorem to ‘turn the probability around’.
```
P(5 heads| bias)
P(bias | 5 heads) = --------------------  * P(bias)
P(5 heads)

```

.
This is the heart of the controversy. We've taken a physical probability that is conditioned on an unknown parameter (the probability that heads comes up given the bias) and turned it into a probability of a parameter value that is conditioned on a physical observation. Mathematically we're still ok, but the question is whether this is a meaningful transformation.

The standard point of view is that the bias has a particular, well-defined value. We don't know what it is, but it *doesn't change* and it isn't *distributed* over a range. The ‘randomness’ enters the situation when we flip the coin. The outcome of the flip is random, but by observing a large number of coin tosses we can eventually deduce the true value of the bias (actually, we can come up with a narrow interval in which we are very confident that the true value lies within).

Furthermore, the probability of flipping heads is clearly *dependent* upon the bias. But the bias certainly isn't dependent on flipping heads! How could the bias of the coin change depend on the outcome of a flip? Thus while we can use Bayes theorem to compute a number that we call ‘P(bias| 5 heads)’, this isn't a meaningful thing to do. It suggests a varying bias that is dependent upon flipping the coin, an absurdity.

The Bayesian viewpoint is different. P(bias | 5 heads) is not to be interpreted as a distribution of the value of the bias, but rather a distribution of our *knowledge* about the bias. Recall that standard statistics is based on physics but Bayesian statistics is based on information theory. We are not saying that the bias itself is distributed, but that our information about it can be distributed among different values (or our ‘willingness to bet on it' or ‘degree of belief’, or even ‘our opinion’). It should be clear that what we know about the bias can change if we find out that we have flipped 5 heads in a row.

Returning to the equation:
```
P(5 heads| bias)
P(bias | 5 heads) = --------------------  * P(bias)
P(5 heads)

```

.
we have three quantities on the right-hand side. We already discussed P(5 heads | bias). What is the probabilty of flipping 5 heads?

Note that this is probability of flipping 5 heads *independent* of whatever bias there might be. How can know that?! We integrate over all possible biases.
```
bias = 1.0
P(5 heads) = integral        ( P(5 heads | bias) dbias)
bias = 0.0

```

.
Our final quantity is P(bias). This isn't what we're trying to estimate, it is the ‘a-priori’ probability of bias. In other words, what did we know about the bias of the coin *before* we flipped it (actually, before we *found out* the results of flipping it. Remember, we're talking information, not coin modification.) Our choice of the ‘prior probability’ can influence our degree of belief on the posterior probability. If we had picked up this coin outside the loading dock of ‘Jake's Two-headed Coin Factory’ we might not be very surprised at all to get 5 heads in a row. In our example, however, we have no reason to believe in any particular bias, so we can choose a ‘uniform prior’ for bias.

Choosing a prior is somewhat controversial, too. I say ‘somewhat’ because it is usually fairly easy to find ‘uninformative priors’ that have no or minimal assumptions built-in, but it is possible to use a prior probability that is completely irrational. With the appropriate prior, I can compute the ‘probability’ that leprechauns wear little green hats. With crazy inputs, you'll get crazy results, but with an additional veneer of difficult math.

Wednesday, July 16, 2008

Standard probability theory is based on the idea that although certain phenomena are random and unpredictable when taken in isolation, we can make predictions based on aggregate observations of individual phenomena. For example, suppose we have a biased coin. We flip it and it comes up heads. Based on this single observation, we cannot determine the bias of the coin or predict the outcome of the next toss. But by flipping the coin many, many times, we can expect the ratio of heads to tails to approach the actual bias of the coin, and we can make predictions about the likelihood of certain events, like the outcome of the next toss.

The Bayesian approach treats probability as a mechanism for modeling incomplete knowledge about some event or process or phenomenon. A model for some phenomenon is proposed and each observation of the phenomenon is used to refine the model. Again, suppose we have a biased coin that we flip many times. We imagine that a parameter, call it theta, determines the bias of the coin. As we observe the outcome of the flips, we refine our estimation of theta.

The fundamental difference between the two is this:
• Standard probability theory treats randomness as a physical property.
• Bayesian probability theory treats randomness as an information-theoretical quantity.

Crackpots

I've always enjoyed finding out about crackpots whose odd ideas are years ahead of their time. Their ideas languish in obscurity for decades until some other nut comes along and points out the utility and novelty of the idea. Here are some of my favorites.

Quaternions

William Hamilton invented quaternions in 1843. Although there was some initial excitement about quaternions, they were overshadowed by vector algebra and vector calculus. Oliver Heaviside was openly hostile to the idea of quaternions and spent a fair amount of time pointing out the advantages of vector analysis. J. Willard Gibbs didn't quite ‘get’ quaternions (not that he didn't understand them, he just didn't see why you'd bother with the complexity of quaternions when you could just use a 4-vector). Quaternions generally fell by the wayside, but in recent years they have been adopted by the computer gaming industry as the standard way of dealing with rotations of objects in 3-dimensional space.

Reversible Computing

Rolf Landauer of IBM pointed out that the second law of thermodynamics implies that erasing information generates heat (that is, if you clear a register in a processor, the bits that you erase have to go somewhere, and that somewhere is the heat sink). On the other hand, if the computer never erased information, it could presumably run without generating heat. (In computers these days, each bit is represented by huge piles of electrons, and the vast bulk of the heat generated is because we are trying to move all these electrons around at very high speed. The heat from the logical information content is minuscule.) Although the connection between information theory and thermodynamics seems clear and obvious to me, it seems that many chemists and physicists consider it to be an analogy that has been stretched beyond its limits. Shannon's entropy from information theory has an identical formulation as Boltzmann's entropy from thermodynamics (modulo the conversion constant). The question is whether this is because they are analogous quantities or whether they are the same quantities.

Bayesian Statistics

Ronald Fisher, the father of modern statistics, was a staunch opponent of the Bayesian school of thought. The Bayesian approach to statistics has only recently started to gain traction. One important difference between standard statistics and Bayesian statistics is in the integrals over the likelihood function. The standard approach is to integrate over the sample space, but the Bayesian approach is to integrate over the hypothesis space. It is generally impossible to find a closed-form solution to the Bayesian integral, and attempting a numeric approximation is intractable. The standard integrals over the sample space are much easier, and if they cannot be solved analytically, they easily succumb to numeric techniques. But the Bayesian approach can give much better answers (when you can get them!). Modern computers allow a brute-force approach to numeric integration, and Bayesian statistics has been gaining in popularity.

Tuesday, July 8, 2008

Probability and floating point

Floating point numbers are not evenly distributed. For every value of the exponent, there are 2^53 values that can be represented. In other words, between 1.0 and 2.0, there are 2^53 floating point numbers and between 1024.0 and 2048.0 there are also 2^53 floating point numbers. Since the range grows as the exponent, floating point numbers get sparser and sparser as the exponent goes up and denser and denser as it goes down.

This has a curious effect when you try to use floating point when computing probabilities. Floating point numbers are quite dense near 0, but much sparser near 1. The largest non-certain probability you can represent as a double precision floating point number is 0.9999999999999999d0, but the smallest non-zero probability you can represent is 2.2250738585072014d-308 So when you are working with marginal probibilities (determining which of two things is more likely), you can discriminate between unlikely things to a much greater extent than you can discriminate between likely ones.

Between throwing away 3/4 of your floating point numbers (negative numbers and positive ones greater than 1) and cramming almost all of the remaining ones down near zero, you can run out of precision when dealing with probabilities.

What are the odds?

You can solve part of this problem by computing the odds rather than the probability. The odds are the ratio of success to failure rather than the ratio of success to all tries. The odds are easily computed from the probability by dividing the probability by (1 - probability). Odds can take on any non-negative value, so you have twice as many floating point values to work with. Furthermore, `even' odds, representing a 50% chance, is 1.0 so there are about an equal number of floating point values above 50% as there are below it (part of the range is taken by NaNs and there are extra denormalized floats really close to zero.)

Working in odds space doubles the number of floating point values you can use, but there are still issues with varying precision. Extremely likely odds don't have anywhere near the same precision as extremely unlikely ones. One often computes the probability of an event by noticing that p + (1 - p) = 1, in other words, you can compute the probability of an event occurring by determining the ways it *won't* occur and subtracting that from 1. But when your floating point precision changes dramatically between low and high probability or odds, you can lose a huge amount of precision when relying on that identity.

What we need to do is to compensate for the varying precision by stretching the dense regions of floating point numbers and compressing the sparse regions. Taking the logarithm of the odds accomplishes this nicely. Working in log-odds space has long been known to be convenient for mathematicians, but few people seem aware that log-odds space is a better place to compute with floating point numbers.