Monday, July 6, 2009

The long tail

Like my time series, my series of posts about it is gradually diminishing. I started with this raw data:

with more than eight-thousand data points and I can summarize it like this: And that's enough information to reconstruct the curve quite accurately. From these parameters, you can compute the mean and the mode (where the peak is), and where the quantiles are (the 80th, 90th percentile, etc.) Interestingly, the variance is only defined when β > 2. With the variance undefined, the ‘standard deviation’ is also undefined. (So I know that people who quote it are blowing smoke!)

I've been using gnuplot to generate the graphs, but I've also been using its fit command to compute the parameters of the curves. It seems to work, but I have no idea how. I don't want to depend on it. I'd like to write some code that determines the best α and β values from the raw data. Once I have that, I'd like to plot how α and β change under different circumstances: over time, under different loads, etc. Perhaps I can figure out why the curve is the way it is. I'll post more information if I discover more.

I'd be interested in seeing if other people find it useful to analyze time series in this way. Let me know.

Sunday, July 5, 2009

10 billion evaluations

While bootstrapping my Scheme interpreter, I ran 10 billion evaluations using unoptimized code. The breakdown of SCode types was this:

A bit of a stretch

Using the cumulative distribution makes it a bit easier to fit the distribution, but where before the curve asymptotically approached 0, it now asymptotically approaches 1. It's still hard to see how well the curve fits when it and the data are very close to the asymptote. To fix this, we want to apply a stretching function to the graph. A popular stretching function is the inverse error function. The reason is obvious:

With inverse erf stretching, a cumulative normal distribution becomes a straight line.

It's pretty clear that my data set does not follow a lognormal distribution. Although the fit looked pretty good on the original histogram, it is simply not a straight line. However, the low end of the curve looks pretty straight.

This next plot is a Pareto distribution plotted with inverse erf stretching:

The Pareto distribution wasn't too bad a fit, but still, we can see it isn't a good match when we look at it here.

Another stretching function to consider is the inverse logistic function.

Now this is interesting. A good chunk of the curve is now a straight line. It seems that a log-logistic distribution is a really good model for the data (at least for the long tail). Let's see how it looks on the original histogram.

In this graph I plotted the lognormal distribution that fit the low-end of the plot and the log-logistic distribution that fit the high-end.

So what is a log-logistic distribution? It seems to be common in sociology and biology. The survival curve after a kidney transplant follows a log-logistic distribution. It also seems to show up in insurance risk models. It is used in hydrology to model precipitation rates. This is weird. I can't see why this distribution would arise in what I'm measuring.

The log-logistic distribution has two tuning parameters, α and β. α is the median of the curve and β determines the amount of spread. In the stretched cumulative plot, α determines where the line intersects the 50th percentile and β determines the slope of the line.

Having fits

Choosing a model for the data is hard. On the one hand, you want a model that is reasonably faithful to the data. A model that is too different from your data is not going to be too useful for prediction. On the other hand, you want a model simple enough to reason about. A model that has dozens of tuning parameters can be very accurate, but it won't be easy to understand. It's a tradeoff. If the model has a basis in some physical rationale — for example, an exponential model is naturally expected from a radioactive sample — it may offer an insight as to the physics behind what you are measuring. But even if the model is simply an easily described curve with no physical basis, it can still be useful.

A one or two parameter model is what I'm looking for. There are a lot of one and two parameter models that look like my data. The main characteristic of most of these distributions is that they asymptotically approach zero over a long time.

Gamma:

Log-normal:

Pareto:

Weibull:

Eyeballing these is hard, too. So we need a couple of more tricks. My favorite trick of log scale doesn't work too well. It does show the tail of the curve nicely, but it also magnifies the variation. Since the data are sparse out at the tail, this makes the variation that much bigger.


Instead, I'm going to integrate over the distribution and normalize the values so the curve goes from 0 to 1.


Now this graph is obviously bad. The curve is squeezed in near the edges so you can't see it, and the asymptotes are so close you couldn't tell if something fit or not. We'll fix this in a sec. But take a look here and you'll see a number of graphs that display the data is this awful way.

Now I'm going to use my log scale trick.


This is quite a bit better. Now we can see important things like the median (at the .5 mark) and the 90th percentile (at the .9 mark). There is another benefit we got. The data in this graph is the unsmoothed data. If we zoom in on a small part of the graph, we can see that the ‘hair’ has turned into tiny stairsteps.

Although the hair was tall, it was very narrow, so it doesn't contribute much to the integral. (So my attempt to shave the hair off the data was pretty much a waste of time. Oh well.)

But the point of doing this was to make it easier to fit the models. The models usually have an integral form (cumulative distribution) so we can try them out. But since we're using the integral form, the errors add up and we can more easily see which distributions have a better fit.

Cumulative Cauchy:

Cumulative log Pareto:

Cumulative lognormal:

Cumulative log Poisson:

There's one more thing to do....

Friday, July 3, 2009

A little Scheme code

Patrick Stein commented on the way that I defined my DTFT function:
(define (dtft samples)
  (lambda (omega)
    (sum 0 (vector-length samples)
         (lambda (n)
           (* (vector-ref samples n)
              (make-polar 1 (* omega n)))))))
Let me tweak it just a bit. The idea is that we have our original function which may or may not be continuous and we sample it at regular intervals.
(define (dtft original sample-interval sample-count)
  (lambda (omega)
    (sum 0 sample-count
         (lambda (n)
           (* (original (* sample-interval n))
              (make-polar 1 (* omega n)))))))
Now we need to turn our vector of samples into a function.
;; Non robust.  Assumes N is a fixnum in the right range.
(define (vector->function sample-vector)
  (lambda (n) (vector-ref sample-vector n)))
I take the original function and return a new function with a single parameter omega which is the transformed version. The original function only needs to be defined at the sample points in the range from 0 to sample-count. I'm treating my vector of samples as a function from integer->real (the samples are in fact integers, but they could be floats).

The function I return is a continuous function from real->complex. It is defined over all the reals, but since omega is an angle, the value computed is going to be repeated every time omega sweeps out an entire circle, so we should limit our interest to the range of 0 to 2π or -π to π.

It turns out to be somewhat computationally expensive to evaluate the transformed function. It isn't very expensive, but with eight thousand samples, we several tens of thousands of floating point operations. If we memoize the function, we can save a huge amount of time if we ever re-evaluate with the same omega. (And we will.)
(define (memoize f)
  (let ((table (make-eqv-hash-table))) ;; eqv works on floats
    (lambda (arg)
      (or (hash-table/get table arg #f)
          (let ((value (f arg)))
            (hash-table/put! table arg value)
            value)))))

(define ym
  (memoize
   (dtft (vector->function *data*) 1 8192)))
The inverse transform is fairly simple:
(define (inverse-dtft y domega)
  (lambda (n)
    (define (f omega)
      (* (y omega)
         (make-polar 1 (* omega (- n)))))

    (/ (real-part (integrate f (- *pi*) *pi* domega))
       (* *pi* 2))))
The integral should return a complex number where the real part cancels out to zero, but since the integration is a numeric approximation, I just lop off the imaginary part.

We need a definition of integrate. I got this simple trapezoid integration from Sussman and Abelson:
(define (trapezoid f a b h)
  (let* ((n (/ 1 h)) 
         (dx (/ (- b a) n)))
    (define (iter j sum)
      (if (>= j n) 
          sum
          (let ((x (+ a (* j dx))))
            (iter (+ j 1) (+ sum (f x))))))
    (* dx (iter 1 (+ (/ (f a) 2) (/ (f b) 2))))))

(define integrate trapezoid)
It converges slowly.
(do ((i 1 (+ i 1))
     (d-omega 1.0 (/ d-omega 2)))
    ((>= i 13))
  (display i)
  (display " ")
  (display d-omega)
  (display " ")
  (display ((inverse-dtft ym d-omega) 250))
  (newline))

1 1. -161.
2 .5 32215.
3 .25 16091.999999999949
4 .125 8074.
5 .0625 4074.000000000024
6 .03125 2048.0000000000246
7 .015625 1050.0000000000236
8 .0078125 571.9999999999932
9 .00390625 336.9999999999946
10 .001953125 235.99999999999997
11 .0009765625 184.9999999999997
12 .00048828125 160.99999999999937
13 .000244140625 153.00000000000196
But once I have it computed for one value of x, the others come fast. The actual value of the integral at this point is 150, but 153 is close enough for me.

Thursday, July 2, 2009

Too much information

I won't keep you in suspense. Here is my time series after removing the hair:

I sliced up the curve into approximate 15 ms segments. I offset the segments so that the spike would be about in the middle, then I simply averaged each segment independently and pasted the result back together. This does remove details smaller than about 15 ms, but I don't need that level of detail.

Actually the whole point of what I'm doing is to remove detail. The original time series had way too much. I want to remove as much detail as possible, but keep just enough so I can easily characterize this data set and compare it with others. At one end of the detail scale I have the original data set of 8192 values. The other end of the scale has no data whatsoever (trivial, but very parsimonious). A number at least allows some basis of comparison. But which number? If I told you that the average value was about 1572 ms, would that be useful? Well, that depends. You might be mislead to expect that if you were to try the experiment right now that it would take around 1572 ms, more or less. You'd be wrong. More than 65% — almost two thirds — of the samples are below the average. The average is the total amount of time to gather all the samples divided by the number of samples gathered. This tells you a lot about the rate at which you can expect to gather samples in bulk, but not very much about the time value of an individual sample.

We need a parameterized model for our data set. The model will tell us qualitatively what the data set is like, the parameters will give us the quantitative information we need to compare a particular data set against others. If you don't know what the model is, the numbers are meaningless! This is no exaggeration. In this chart, there are two distributions. Both distributions have the same average.

Whenever I see a measure of “average latency” a little alarm bell goes off in my head. It's a good indication that the stated value is a meaningless number and that the person who measured it doesn't know what he is measuring. A louder alarm bell goes off when I see a measure of “standard deviation”. Informally, the standard deviation is a measure of how widely the data is distributed. But the way you calculate the standard deviation depends on the model. (The relevance of the standard deviation also depends on the model. If your model is logarithmic, as most time-based models are, you would be better off computing the ‘geometric standard deviation’.) If the model is not specified, it is likely that the reported ‘standard deviation’ was simply calculated with a gaussian model. Some models don't even have a standard deviation — it's simply not defined.

When I see a measure of standard deviation, it's a good indication that the person who measured it not only doesn't know what he is measuring, but also that he is using a tool that he doesn't understand.


In order to figure out how to characterize my data set, I need to find a good model for it. This can be really hard. There are hundreds of models. Some of them have enough tuning parameters that you can make them fit anything. What I'm looking for is a model that is simple, has as few parameters as possible, and fits the data reasonably well. (The reason I shaved the hair off the data is so I could visually compare the fit of a few different models.) There are a number of ways to search for models, but it often comes down to trial and error. There are a couple of tricks, though.

The first trick is see how the data looks in log space. It is frequently the case that you are working with some ‘scale-free’ quantity. You don't care about the absolute time, you care about the relative improvement. When you discuss data in terms of ‘percent’ or ‘factor of two’ and such, you are likely to want to work in log space. Here's a plot of my shaved data set in log space: Now that is starting to look like a bell-shaped curve. This suggests that a log-normal distribution might be a good model:
And now I can state quantitatively that under a log-normal model, my data set has a value of μ= 6.97 and σ=.90 With just these two numbers, you can approximate the mean (the true average is 1572, the average of the model is 1593), the mode (472), and any other quantity of the curve (for example, the geometric standard deviation, which is 2.46).

But the log-normal curve doesn't quite fit. It's close, but it overestimates near the mean and underestimates the long tail. I've been looking for a better model.

Next time.... trying different models.

Wednesday, July 1, 2009

Sometimes you have one of those “Aha!” moments. This isn't one of those times. This is closer to a “D'oh!” moment.

The various filters I tried on my data set were really disappointing. To try to figure out what was going on, I made a fake data set that had spikes every 15.625 milliseconds, but was zero everywhere else. If you perform a Fourier transform on this, you'll see the problem immediately. If my signal was simply a 64 Hz sine wave, I'd have a single point on the Fourier transform. If it were a square wave, I'd have the primary frequency and the odd harmonics. With a pulse, there are harmonics everywhere.

But since I know that the spikes come every 15.625 milliseconds, and I know the phase, it occurs to me that I can simply erase those samples. No filters, no transforms, just a little rounding. In fact, if I walk through the data in 15.625 millisecond chunks, but make my first step exactly half that size, every chunk will have exactly one spike (although it might contribute to more than one sample within the chunk).

If I call this a “learning opportunity” maybe I'll feel better about this.