We have a coin with an unknown bias and we have flipped it 41 times and observed it come up heads on 29 of those flips. In order to predict whether it will come up heads on the next two tosses, we need to estimate the bias. The standard approach to statistics tells us that the sample mean should approximate the true mean. The sample mean is about 0.7073, and if we toss a coin with that bias twice, we will have a 0.7073^2 = .5003 chance of getting two heads. If you were given even odds on this chance, you should take it because you'd (eventually) make money.
The traditional method is simple, but wrong. It doesn't matter what your philosophy of statistics is, it is simply the case that if you write a program to simulate the problem, it will not converge to this solution.
What did we do wrong? We estimated the bias by integrating over the sample space. (Since the sample space is discrete, it is a sum.) We should have estimated the bias by integrating over the model space. We don't know the bias of the coin. It could be about .7073 and have generated 29 heads out of 41 flips, but it also could have been a completely fair coin (.5) and we were just lucky to have a few more heads than we'd expect. It could have been a very unfair coin (.9) and we were unlucky and got a few fewer heads than expected. There is a continuum of possibilities, and we want to take them all into account. To do this, we have to consider the probability of generating the observed data for all possible biases and integrate over them. The probability of the bias having a value of theta given the observed data is
(n + 1)! p(theta|D) = ------------ * theta^r * (1 - theta)^(n - h) h!(n - h)!where h is the number heads and n is the number of flips.
Our best estimate of the bias is
theta=1 h + 1 bias = integral theta * p(theta|D) dtheta = --------- theta=0 n + 2.
For our example, after observing 29 heads in 41 flips, we should estimate the bias to be (29 + 1)/(41 + 2) = .6977 (Intuitively, there are more ways to bias a coin less than 29/41 than there are ways to bias it higher, and each of the extra ways has a possibility of being lucky and generating our observed data.)
Ah, so we estimate our bias at .6977 and square that giving 0.4867 as the probability of two heads. Nope. Close.
Once we have flipped the coin and gotten our first head, we have a new situation. We have now observed 42 flips and gotten 30 heads. This changes our estimate of the bias, which is now (30 + 1)/(42 + 2) = 0.7045 Therefore the probability of getting two heads in a row is (29 + 1)/(41 + 2) * (30 + 1)/(42 + 2) = 0.4915
Are we insane?!! How could the bias change between the flips?! How could the result of the first flip influence the probability we compute before we flip the coin?!! Remember, though, Bayesian probability is not based on physics, it is based on information theory. The coin's bias doesn't change at all throughout the experiment. Our knowledge about the bias changes every time we determine an outcome. We don't know what the result of the first flip will be, but we do know that it must be heads if we are to win the bet. Therefore, if we are lucky, we can infer the higher bias. Of course if we are unlucky we can infer the lower bias, but since we'll be out ten bucks, we don't care to estimate how much lower.
It is easy to write a small program to model the problem. It probably won't be able to run long enough to converge to the correct solution for 41 flips, but it will easily handle small test cases. For example, if you select a bias at random, flip it three times and get 2 heads out of three flips, the probability of the next two flips being heads should be (2 + 1)/(3 + 2) * (3 + 1)/(4 + 2) = 2/3
The original problem was at Panos Ipeirotis's Blog.
This is a little late, but you can get the same result as the two-step inference by the (arguably) more direct method of integrating the distribution function.
ReplyDeleteUsing R, this would just be "integrate(function (x) {x*x*dbeta(x,30,13)}, 0, 1)", using a Beta(1, 1) prior.
Not late at all.
ReplyDeleteI'm quite new to this Bayesian probability stuff and I'm still unsure of my grasp of it at the more abstract level. I sort of understand your direct method (it basically unifies the first two equations) and I know that dealing with Beta priors is so common that it becomes second nature to experienced Bayesians. But I keep having to go back and relate the problem to a Bernoulli Urn situation and re-derive the equations from that because that way I feel I'm on familiar ground and I trust the results.
The point was just that, if you want to know what the probability of getting two heads on a coin with p(head)=mu, it's just mu*mu.
ReplyDeleteIf you're not certain of the probability and, in fact, have a distribution in mu, then you have to integrate over that distribution, giving something like integral(mu*mu*p(mu) dmu).
Here, we know p(mu)=Beta(mu, 30, 13), so we can just do that integral. I find it somewhat amazing (and reassuring) that this gives the same answer as just multiplying the averages. (i.e., in this case, E[mu|29 heads]*E[mu|30 heads].
I'm not much of a Bayesian myself, but I've recently been working through some of this logic, so this caught my eye.