Using the cumulative distribution makes it a bit easier to fit the
distribution, but where before the curve asymptotically approached 0,
it now asymptotically approaches 1. It's still hard to see how well
the curve fits when it and the data are very close to the asymptote.
To fix this, we want to apply a stretching function to the graph. A
popular stretching function is the inverse error function. The reason
is obvious:
With inverse erf stretching, a cumulative normal distribution becomes
a straight line.
It's pretty clear that my data set does not follow a lognormal
distribution. Although the fit looked pretty good on the original
histogram, it is simply not a straight line. However, the low end of
the curve looks pretty straight.
This next plot is a Pareto distribution plotted with inverse erf stretching:
The Pareto distribution wasn't too bad a fit, but still, we can see it isn't a good match when we look at it here.
Another stretching function to consider is the inverse logistic
function.
Now this is interesting. A good chunk of the
curve is now a straight line. It seems that a log-logistic
distribution is a really good model for the data (at least for the
long tail). Let's see how it looks on the original histogram.
In this graph I plotted the lognormal distribution that fit the low-end of the plot and the log-logistic distribution that fit the high-end.
So what is a log-logistic distribution? It seems to be common in sociology and biology. The survival curve after a kidney transplant follows a log-logistic distribution. It also seems to show up in insurance risk models. It is used in hydrology to model precipitation rates. This is weird. I can't see why this distribution would arise in what I'm measuring.
The log-logistic distribution has two tuning parameters, α and β. α is the median of the curve and β determines the amount of spread. In the stretched cumulative plot, α determines where the line intersects the 50th percentile and β determines the slope of the line.
No comments:
Post a Comment