Sunday, July 5, 2009

Having fits

Choosing a model for the data is hard. On the one hand, you want a model that is reasonably faithful to the data. A model that is too different from your data is not going to be too useful for prediction. On the other hand, you want a model simple enough to reason about. A model that has dozens of tuning parameters can be very accurate, but it won't be easy to understand. It's a tradeoff. If the model has a basis in some physical rationale — for example, an exponential model is naturally expected from a radioactive sample — it may offer an insight as to the physics behind what you are measuring. But even if the model is simply an easily described curve with no physical basis, it can still be useful.

A one or two parameter model is what I'm looking for. There are a lot of one and two parameter models that look like my data. The main characteristic of most of these distributions is that they asymptotically approach zero over a long time.





Eyeballing these is hard, too. So we need a couple of more tricks. My favorite trick of log scale doesn't work too well. It does show the tail of the curve nicely, but it also magnifies the variation. Since the data are sparse out at the tail, this makes the variation that much bigger.

Instead, I'm going to integrate over the distribution and normalize the values so the curve goes from 0 to 1.

Now this graph is obviously bad. The curve is squeezed in near the edges so you can't see it, and the asymptotes are so close you couldn't tell if something fit or not. We'll fix this in a sec. But take a look here and you'll see a number of graphs that display the data is this awful way.

Now I'm going to use my log scale trick.

This is quite a bit better. Now we can see important things like the median (at the .5 mark) and the 90th percentile (at the .9 mark). There is another benefit we got. The data in this graph is the unsmoothed data. If we zoom in on a small part of the graph, we can see that the ‘hair’ has turned into tiny stairsteps.

Although the hair was tall, it was very narrow, so it doesn't contribute much to the integral. (So my attempt to shave the hair off the data was pretty much a waste of time. Oh well.)

But the point of doing this was to make it easier to fit the models. The models usually have an integral form (cumulative distribution) so we can try them out. But since we're using the integral form, the errors add up and we can more easily see which distributions have a better fit.

Cumulative Cauchy:

Cumulative log Pareto:

Cumulative lognormal:

Cumulative log Poisson:

There's one more thing to do....