Friday, July 18, 2008

Bayes theorem is trivially derived from this identity:

         P(A&B) = P(A|B)P(B) = P(B|A)P(A)

The probability of A and B both being true is the probability of A being true if B is assumed times the probability of B, or, alternatively, the probability of B being true if A is assumed times the probability of A.

Now we just move a term:

                     P(B|A) P(A)
          P(A|B) = ----------------

There's no magic here, it's just the simple algebra you learned in high school. This result is called ‘Bayes Theorem’. Interestingly, there is no controversy at all about Bayes Theorem itself. Everyone agrees it is correct. What is controversial is the ontological meaning of the P(B|A) term on the right hand side.

Bayes Theorem allows you to ‘turn a probability around’. That is, given statement A conditional on statement B, we can create an inverse statement of B conditionalized on statement A. Consider this:
   A: I am wet.
   B: It is raining.

90% of the time, when it is raining, I get wet. So P(A|B) is .9 Suppose some day I walk in from outside and I'm wet. What is the probability that it is raining? To figure this out we need to know two additional probabilities: how often am I wet for whatever reason, and how often does it rain. If I regularly go swimming, I may be wet, say 2% of the time. If we live in an arid area, it may only rain one day out of 100. So P(A) = .02, P(B) = .01, P(A|B) = .9

                P(B) P(A|B) 
             --------------- =       P(B|A) 

            (/ (* .01 .9) .02) = .45

So despite the fact that I nearly always get wet when it rains, I swim often enough and it is arid enough that I'm slightly more often wet because I've been swimming.

Now let's look at a trickier problem. I'm walking down the street and I find a coin. I idly flip it and it comes up heads. In fact, it comes up heads on the next 4 flips. I begin to wonder if I have a biased coin. How do I figure this out?

The traditional approach is to consider a ‘null hypothesis’ and determine if the data are consistent with it. In this case, our null hypothesis is that the coin is fair. Were the coin fair, getting heads five times in a row would happen once out of every 32 experiments of flipping five times, that is, with a probability of 0.03125 We reject the null hypothisis if the probability is below a certain ‘significance level’ (often taken to be 5%). If we use 5% here, we reject the null hypothesis and decide with ‘95% confidence’ the coin is not fair.

The Bayesian approach is markedly different. We start with a model of a coin with a unknown bias. If the bias is .25, we get heads 1/4 of the time and tails the remaining 3/4 of the time. If the bias is .5, we get heads half the time and tails half the time (the coin is fair). It should be obvious that

    Given a particular value of the bias, we can determine the
    probability of getting 5 heads in a row.   

So if someone told us the bias was .5, we'd expect a 1/32 chance of getting 5 heads in a row. If we knew it was .9999, we'd nearly always expect 5 heads in a row. In other words, we can determine the value of

               P(5 heads | bias)

Now we apply Bayes theorem to ‘turn the probability around’.

                               P(5 heads| bias)
         P(bias | 5 heads) = --------------------  * P(bias)
                                 P(5 heads)

This is the heart of the controversy. We've taken a physical probability that is conditioned on an unknown parameter (the probability that heads comes up given the bias) and turned it into a probability of a parameter value that is conditioned on a physical observation. Mathematically we're still ok, but the question is whether this is a meaningful transformation.

The standard point of view is that the bias has a particular, well-defined value. We don't know what it is, but it *doesn't change* and it isn't *distributed* over a range. The ‘randomness’ enters the situation when we flip the coin. The outcome of the flip is random, but by observing a large number of coin tosses we can eventually deduce the true value of the bias (actually, we can come up with a narrow interval in which we are very confident that the true value lies within).

Furthermore, the probability of flipping heads is clearly *dependent* upon the bias. But the bias certainly isn't dependent on flipping heads! How could the bias of the coin change depend on the outcome of a flip? Thus while we can use Bayes theorem to compute a number that we call ‘P(bias| 5 heads)’, this isn't a meaningful thing to do. It suggests a varying bias that is dependent upon flipping the coin, an absurdity.

The Bayesian viewpoint is different. P(bias | 5 heads) is not to be interpreted as a distribution of the value of the bias, but rather a distribution of our *knowledge* about the bias. Recall that standard statistics is based on physics but Bayesian statistics is based on information theory. We are not saying that the bias itself is distributed, but that our information about it can be distributed among different values (or our ‘willingness to bet on it' or ‘degree of belief’, or even ‘our opinion’). It should be clear that what we know about the bias can change if we find out that we have flipped 5 heads in a row.

Returning to the equation:

                               P(5 heads| bias)
         P(bias | 5 heads) = --------------------  * P(bias)
                                 P(5 heads)

we have three quantities on the right-hand side. We already discussed P(5 heads | bias). What is the probabilty of flipping 5 heads?

Note that this is probability of flipping 5 heads *independent* of whatever bias there might be. How can know that?! We integrate over all possible biases.

                   bias = 1.0
   P(5 heads) = integral        ( P(5 heads | bias) dbias)
                 bias = 0.0

Our final quantity is P(bias). This isn't what we're trying to estimate, it is the ‘a-priori’ probability of bias. In other words, what did we know about the bias of the coin *before* we flipped it (actually, before we *found out* the results of flipping it. Remember, we're talking information, not coin modification.) Our choice of the ‘prior probability’ can influence our degree of belief on the posterior probability. If we had picked up this coin outside the loading dock of ‘Jake's Two-headed Coin Factory’ we might not be very surprised at all to get 5 heads in a row. In our example, however, we have no reason to believe in any particular bias, so we can choose a ‘uniform prior’ for bias.

Choosing a prior is somewhat controversial, too. I say ‘somewhat’ because it is usually fairly easy to find ‘uninformative priors’ that have no or minimal assumptions built-in, but it is possible to use a prior probability that is completely irrational. With the appropriate prior, I can compute the ‘probability’ that leprechauns wear little green hats. With crazy inputs, you'll get crazy results, but with an additional veneer of difficult math.