Monday, July 6, 2009

The long tail

Like my time series, my series of posts about it is gradually diminishing. I started with this raw data:

with more than eight-thousand data points and I can summarize it like this: And that's enough information to reconstruct the curve quite accurately. From these parameters, you can compute the mean and the mode (where the peak is), and where the quantiles are (the 80th, 90th percentile, etc.) Interestingly, the variance is only defined when β > 2. With the variance undefined, the ‘standard deviation’ is also undefined. (So I know that people who quote it are blowing smoke!)

I've been using gnuplot to generate the graphs, but I've also been using its fit command to compute the parameters of the curves. It seems to work, but I have no idea how. I don't want to depend on it. I'd like to write some code that determines the best α and β values from the raw data. Once I have that, I'd like to plot how α and β change under different circumstances: over time, under different loads, etc. Perhaps I can figure out why the curve is the way it is. I'll post more information if I discover more.

I'd be interested in seeing if other people find it useful to analyze time series in this way. Let me know.