Abstract Heresies: The long tail

Monday, July 6, 2009

The long tail

Like my time series, my series of posts about it is gradually diminishing. I started with this raw data:

with more than eight-thousand data points and I can summarize it like this:

It's a log-logistic distribution.
α = 1124
β = 1.58

And that's enough information to reconstruct the curve quite accurately. From these parameters, you can compute the mean and the mode (where the peak is), and where the quantiles are (the 80th, 90th percentile, etc.) Interestingly, the variance is only defined when β > 2. With the variance undefined, the ‘standard deviation’ is also undefined. (So I know that people who quote it are blowing smoke!)

I've been using gnuplot to generate the graphs, but I've also been using its fit command to compute the parameters of the curves. It seems to work, but I have no idea how. I don't want to depend on it. I'd like to write some code that determines the best α and β values from the raw data. Once I have that, I'd like to plot how α and β change under different circumstances: over time, under different loads, etc. Perhaps I can figure out why the curve is the way it is. I'll post more information if I discover more.

I'd be interested in seeing if other people find it useful to analyze time series in this way. Let me know.

3 comments:

David McClain said...: Hi Joe,

It seems like you are just plugging in numbers and turning a crank, looking for whatever distribution "fits best".

But why not go back to the nature of the source of the data and examine the physics that produces the time series?

That should produce a more meaningful model estimate for you, even if the "fits" don't look as nice as some other distribution just chosen for closeness of fit.

In nearly all of nature, there are several different distributions convolved together to produce the measured result. For example an infrared detector will have components described by Poisson statistics for the incoming photon field, Exponential distributions to describe the effects of timing jitter on the readout, Rayleigh noise on the readout electronics, and so on.

No one of these individual distributions will accurately describe the observed values. And the convolution of all of them, with suitably chosen parameters, may not look like a real close fit either. And the eigenfunctions of the decomposition will not necessarily exhibit much orthogonality either. But the results will more closely match, as well as your measurements allow, the underlying physics that led to the observed time series.

To ignore this physical basis seems like only a mathematical game with no resultant meaning...

- DM; July 6, 2009 at 4:11 PM
David McClain said...: ... that said, the way to find the associated parameters of the underlying distributions is to perform a linear or nonlinear optimization in the parameters, depending on the functional forms.

That will produce not only the parameter estimates, but also some notion of their uncertainties. And that kind of analysis may point you toward areas of the measurement process that need, and may allow, improvements.; July 6, 2009 at 4:21 PM
Joe Marshall said...: David McClain comments: It seems like you are just plugging in numbers and turning a crank, looking for whatever distribution "fits best".

To some extent, yes.

Why not go back to the nature of the source of the data and examine the physics that produces the time series?

It isn't feasible. Part of the time I'm measuring involves data transfer across the internet. This is going to include the electrical characteristics of various media such as copper wire (DSL), coax, microwave, satellite, WiFi, etc. Then there are the timing features of the various computers and routers along the way.

I'm at the earliest stage of understanding the problem: measuring it. Step two is trying to understand what effect optimizations and tuning have on the overall measurements. Step three is attempting to separate the causal and non-causal factors (the parts of the model I can and cannot control). Step four is to optimize the parts I can control.

With all the variability that comes from networking, it is a bit surprising to me that my model fits. I was expecting a gaussian or log-normal curve because these generally show up when you have a lot of uncontrolled factors. A gamma curve would simply indicate that you cannot get much information from this measurement (gamma curves have low information content).

The way to find the associated parameters of the underlying distributions is to perform a linear or nonlinear optimization in the parameters, depending on the functional forms.

I'm looking at that. Part of the problem is that I want a very good fit to the part that fits well, rather than a mediocre fit to the entire curve.; July 6, 2009 at 6:10 PM