Thursday, July 30, 2009

Confessions of a Math Idiot

I wasn't always a math idiot. I took the ‘advanced’ math classes in grade school. While I was in high school I took Calculus II over at the community college (the professor was a Harvard grad with a Ph.D. in theoretical physics from Cornell). I took the AP Calc test and my math SAT score was nothing to sneeze at. I went as far as differential equations and linear algebra in college.

But then I found out about computers. On the first day of 6.001 the professor wrote this on the board:

                f(x + dx) - f(x)
f'(x) =  lim  ------------------
        dx->0        dx

(define dx 0.0001)

(define (deriv f)
  (define (f-prime x)
    (/ (- (f (+ x dx)) (f x))
       dx))
  f-prime)

(define (cube x) (* x x x))

(define g (deriv cube))

(g 4)
;Value: 48.00120000993502
It was the beginning of the end.

I started to unlearn math. I began seeing math through the eyes of a computer scientist. Things that I believed in stopped making sense. I used to believe that integration and differentiation were opposites: that one ‘undid’ what the other ‘did’. But look:

(define (deriv f) 
   ...)

(define (integrate f low hi)
   ...)
The arity of deriv is 1 and the arity of integrate is 3. They can't be opposites!

I used to think you couldn't divide by zero. But look:
(/ 2.3 0.0)
;Value: #[+inf]


I started seeing ambiguities in even the simplest math. Is ‘x + 3’ an expression, a function, or a number? Is ‘f(x)’ a function or the value of a function at x? When I see log (f(x)) = dH(u)/du + y, are they stating an identity, do they want a solution for y, or are they defining H?

I started reading weird things like Gosper's Acceleration of Series and Denotational Semantics by Joseph Stoy and Henry Baker's Archive. These look like math, but they seemed to make more sense to me.

But I drifted further away from mathematics. A few years ago a fellow computer scientist explained to me the wedge product. It is a straightforward abstraction of multiplication with the only exception being that you are not permitted to ‘combine like terms’ after you multiply (so you get a lot of irreducable terms). Then he explained the Hodge-* operator as the thing that permits you to combine the like terms and get answers. It's really not very complicated.

But then I looked it up online.
The exterior algebra ∧(V) over a vector space V is defined as the quotient algebra of the tensor algebra by the two-sided ideal I generated by all elements of the form x⊗x such that x ∈ V.
And I discovered that I'm now a math idiot.

I'm probably going to have to live like this the rest of my life. I'm trying to cope by relearning what I used to know but translating it to a style I'm more comfortable with. It will be a long process, but I'm optimistic.

How do you approach a problem?

Douglas Gregor wrote me:

I'm trying to hone my problem solving skills. Do you know any good books written on how to solve complex problems? Did school, i.e., teachers, books, labs, peers, help define the process? How do you approach a problem? How do you break it apart — the fluff from the stuff? How do you formulate a plan?

Wow. These are not easy questions.

Years ago I had a book called How to Solve It by George Pólya. I remember not giving it the attention that I think it deserves. I recall that some of the advice seemed rather obvious. In retrospect, I wish I had payed more attention to it. When someone like Pólya suggests something obvious, it probably isn't as obvious as you think and it is probably worth pondering.

College was a huge influence. The whole focus of the undergraduate curriculum was to “teach you how to think”. The detail of the coursework was almost secondary. It was there in order to be stuff to think about. It had the benefit of being topical. You got to know the jargon of the field and get familiar with the kind of day-to-day problems you would encounter, but that was secondary.

Graduate students would teach the recitations, but the professor was usually a tenured researcher. That made a difference. I found that the professors were amazingly sharp people. They also enjoyed tremendously what they were doing and their enthusiasm showed.

Nearly every course would start with a lecture where the professor would say “we're here to teach you how to think&rdquo. Then they would start in on the “first principles”. For example, in electrical engineering we started with Kirchoff's Laws. Now Kirchoff's Laws are really obvious: all the current flowing into and out of a node has to add up to zero (otherwise it would pile up somewhere or come out of nowhere), and all the voltages around a loop had to add up to zero (or you'd have made power out of thin air). But the professor would always start from “first principles&rdquo.

The next thing the professor would do is introduce a model. Invariably the professor would take a pretty simple and straightforward problem and point out that it is essentially impossible to understand and solve. Again, to use the example from electrical engineering, the action of any electrical circuit whatsoever follows Maxwell's equations (it has no choice), but no one is going to solve those (what's the B field around an irregular lump of solder?) The professor would show us the basic model and how to use it to approximate solutions to real problems.

Finally we'd go over the building blocks. There were a selection of primitive elements and a handful of ways to combine them.

What was interesting is that virtually every course took this approach. It didn't matter whether it was chemistry, electronics, physics, or computer science. It came down to “first principles&rdquo, models, and the basic elements.

Another thing that I noticed was the almost absurdly naive way that the professors would approach things. They always wanted to mention their research, so we'd hear about some of the cutting edge of science. The problems that the professors were working on could usually be summed up in a sentence or two: “Why do things fall down instead of up?” “How about if I made a computer chip with really tiny mirrors on it and computed with photons?” “If I run my computer backwards, can I get back the energy I put in?”

Of course someone would object and say “That wouldn't work.” or “You can't do that.” and the professor would immediately counter with “Why not?” and start showing that his ideas were actually plausible if you went back to first principles and used some imagination.

I found that the best professors seemed to have what seemed to be the most naive questions. I also found that they usually had a long, long record of posing interesting questions and developing astounding solutions. And they were fun to talk to and hang around with.

I really enjoyed the way that certain professors thought about things and I tried to emulate that. I keep telling myself to start with first principles and make a model. (I have to keep reminding myself of this, too. It's too easy to get wrapped up in the detail.)

There's one more thing these professors did: teach. Some of them taught undergrad courses that weren't their specialty because they wanted to learn the material themselves. If you can teach a non-expert enough to understand what it is you are doing, then you probably have a good grasp of it. Otherwise, you might not even understand what problem you are trying to solve. (That's one reason I want to write more blog entries. It helps me organize my thoughts by trying to describe what I'm doing to other people.)

How do you approach a problem? How do you formulate a plan?

This question gave me a lot to think about. I kept trying to think how I formulate a plan and I realized just this moment that I don't. I think maybe I disagree with Pólya on this point. He gives these steps:
  1. First, you have to understand the problem.
  2. After understanding, then make a plan.
  3. Carry out the plan.
  4. Look back on your work. How could it be better?
The interesting problems are the ones where I have no idea at all how to go about solving. Then there is a good opportunity to learn something (and I love learning things).

I usually don't understand the problem, and that's the biggest hurdle. Of course I often think that I do understand it and get off on the wrong path. I have found myself with a ‘solution’ to something I didn't really want to solve, or the answer to a question that wasn't worth asking. I have to keep reminding myself of what question I want the answer to.

I either don't have good plans, or I can't recognize the good plans amongst the bad ones. I look at a lot of different approaches and try one that looks promising. If that doesn't work, I'll try another. I'll keep refining the plan on the way.

When I get really into a problem, I'm constantly explaining it to other people to try to get some insight. Often I find I don't have a clear enough concept to be able to explain it to someone. Usually the person I'm explaining it to gives me some irrelevant suggestions, but it helps me understand how they are perceiving or understanding what I am asking. Sometimes they say words that trigger an idea.

I also like the oddball approaches. For a lot of problems there seem to be one or two people on the ‘fringe’. They eschew the standard approach to something and advocate something that seems to come out of left field. Sometimes, however, a non-standard approach gives you some insight into the problem that you wouldn't have gotten otherwise. A good example is using Clifford Algebras for computer graphics.

That's about all I can think of right now. I hope I answered your questions.

Wednesday, July 29, 2009

Why I don't like math

You might think it is a contradiction to be a computer scientist and to dislike math. It isn't that I dislike ‘mathematics’. I like the logic, symmetry, and beauty of the ideas. What drives me up the wall is the notation.

Mathematicians like to think they are rigorous. Ha! If you haven't tried to cast your equations into a working computer program I can guarantee you that you left something out. Mathematicians like to omit ‘unnecessary’ detail. I'm a fan of abstraction, but you have to use it in a disciplined manner. If you simply discard the boring parts of the notation, you'll end up with an ambiguous mess. Of course humans can handle ambiguity more gracefully than computers, but it seems to me that if a computer cannot make sense out of what is being expressed, then a poor human like me has very little chance of success.

So here's my current frustration: the distribution I've been studying has an interesting quantile function.And I'm already in trouble. This notation has some extra baggage. Let me recast this as a program.
(define (make-quantile-function alpha beta)
  (lambda (p)
    (* alpha (expt (/ p (- 1 p)) (/ 1 beta)))))
alpha and beta are tuning parameters for this kind of distribution. alpha is essentially the median of the distribution, beta determines whether the curve descends steeply or has a long tail. The quantile function takes an argument p which is a number between 0 and 1. If you give it an argument of .9, it will return the value of the 90th percentile (that value of X where 90 percent of the distribution is below X). If you give it an argument of .5, you'll get the median. If you give it argument of .25, you'll get the first quartile, etc. It's a useful way of describing a distribution.

The reason the fancy equation had a superscript of -1 is because the quantile function is the inverse of the cumulative distribution function. The reason the function was named F-1 is because they named the cumulative distribution function F. The cumulative distribution function could be defined like this:
(define (make-cumulative-distribution-function alpha beta)
  (lambda (x)
    (/ 1 (+ 1 (expt (/ x alpha) (- beta))))))
We can show that these functions are inverses of each other:
(define q (make-quantile-function .9 1.3))
;Value: q

(define cdf (make-cumulative-distribution-function .9 1.3))
;Value: cdf

(q (cdf 3))
;Value: 3.000000000000001

(q (cdf 1.8736214))
;Value: 1.8736213999999998

(cdf (q .2))
;Value: .19999999999999996

(cdf (q .134))
;Value: .13399999999999995
I mentioned earlier that I thought the quantile function was particularly interesting. If you've read my past posts, you know that I think you should work in log-odds space whenever you are working with probability. The quantile function maps a probability to the point in the distribution that gives that probability. The 90th percentile is where 9 times out of 10, your latency is at least as good, if not better. So what if we convert the quantile function to a `logile' function? In other words, take the log-odds as an argument.
(define (log-odds->probability lo)
  (odds->probability (exp lo)))

(define (odds->probability o)
  (/ o (+ o 1)))

(define (quantile->logile q)
  (lambda (lo)
    (q (log-odds->probability lo))))

(define (make-logile-function alpha beta)
  (quantile->logile 
    (lambda (p)
      (* alpha (expt (/ p (- 1 p)) (/ 1 beta))))))

(define logile (make-logile-function .9 1.3))

(logile 3)
;Value: 9.046082508750782

(logile 1)
;Value: 1.9422949805536014

(logile 0)
;Value: .9

(logile -3)
;Value: .0895415224453726

But let's simplify that computation:
(quantile->logile 
  (lambda (p)
    (* alpha (expt (/ p (- 1 p)) (/ 1 beta)))))

((lambda (p)
   (* alpha (expt (/ p (- 1 p)) (/ 1 beta))))
 (log-odds->probability lo))

(* alpha (expt (/ (log-odds->probability lo) 
           (- 1 (log-odds->probability lo)))
        (/ 1 beta)))

(* alpha (expt (/ (odds->probability (exp lo))
           (- 1 (odds->probability (exp lo))))
        (/ 1 beta)))

(* alpha (expt (/ (/ (exp lo) (+ (exp lo) 1)) 
           (- 1 (/ (exp lo) (+ (exp lo) 1))))
        (/ 1 beta)))

;; after a lot of simplification
(* alpha (exp (/ lo beta)))

(define (make-logile-function alpha beta)
  (* alpha (exp (/ lo beta))))
My ‘logile’ function is even simpler than the ‘quantile’ function. Now if I want to express the answers in log space as well, I simply take the log of the logile function:
(log (* alpha (exp (/ lo beta))))

(+ (log alpha) (/ lo beta))
Which you will recognize is the equation of a line.

Well so what? With sufficient algebra, any function that has two independent parameters can be turned into a line. However, the cumulative distribution function is an integral, so I have an intuition that there is something interesting going on here. I found some work by Steinbrecher and Shaw where they point out that “Quantile functions may also be characterized as solutions of non-linear ordinary differential equations.” Obviously, the logile function can also be characterized this way, too. I'm hoping that looking at the distribution in this way can yield some insight about the distribution.

And this is where I run into the annoying math. Check out these two equations:


By the first equation, Q is obviously some sort of function (the quantile function). H therefore operates on a function. By the second equation, the argument to H, which is called x, is used as the argument to f, which is the probability distribution function. f maps numbers to numbers, so x is a number. The types don't match.

So what am I supposed to make of this? Well, I'll try to follow the process and see if I can make sense of the result.

Thursday, July 23, 2009

Did you mean....?

http://www.google.com/search?q=recursion

What's wrong with this code? II

(let loop ((ranges (%char-set-ranges cs))
           (value 0))
      (if (null? ranges)
          value
          (loop (cdr ranges)
                (modulo (+ value (caar ranges) (cdar ranges)) 
                        bound))))
Again, the machinery of iterating down the list is woven into the code. And it is incorrect. Rather than testing for null? and assuming that ranges must otherwise be a list, the code should test if ranges is in fact a pair before taking the car or cdr.
;; Common, but incorrect:

  (if (null? foo)
      ... handle base case ...
      (let ((a (car foo)))
         ... handle recursive case ...
         ))

;; Correct:

    (cond ((pair? foo) (let ((a (car foo)))
                          ... handle recursive case ...
                          ))
          ((null? foo)  ... handle base case ...)
          (else (error "Improper list: " foo)))
The second way might be faster if the compiler is reasonably smart about type propagation. The compiler could note that once (pair? foo) is satisfied, then (car foo) does not need an additional type check. This is not the case with the first sample. Just because foo is not null doesn't mean it is therefore a pair.

But in any case, if we abstract away the iterative list traversal, we don't need to bother ourselves about the minutae of taking car and cdr and checking types:
(fold-left (lambda (value range)
             (modulo (+ value (car range) (cdr range))
                     bound))
           0
           (%char-set-ranges cs))
Here's another example:
(let loop ((syms (make-global-value-list))
           (result '()))
    (if (null? syms)
        result
        (let ((sym-str (symbol->string (caar syms))))
          (if (and (>= (string-length sym-str) (string-length string))
                   (string=? (substring sym-str 0 (string-length string))
                             string))
              (loop (cdr syms)
                    (cons (list (symbol->string (caar syms))) result))
              (loop (cdr syms) result)))))

(fold-left (lambda (result sym)
             (let ((sym-str (symbol->string (car sym))))
               (if (and (>= (string-length sym-str) (string-length string))
                        (string=? (substring sym-str 0 (string-length string))
                                   string))
                   (cons (list sym-str) result)
                   result)))
            '()
            (make-global-value-list))
We don't want to write code that iterates down a list. We want to write code that operates on a single element of a list and then use a higher-order function to make it work across the entire list. map is great example, but fold-left is even more general than map. map always produces a list of the same length as the input list. fold-left can add or remove elements:
;; Double the items in the list.
(fold-left (lambda (result item) 
             (cons item (cons item result)))
           '() foo)

;; Return a list of the lengths
;; of the non-empty strings.
(fold-left (lambda (result item)
             (let ((x (string-length item)))
               (if (zero? x)
                   result
                   (cons x result))))
           '()
            foo)
(Note that the returned list in each of these is in reverse order. That's ok, we can simply call reverse on the result if we want it in forward order. In most cases, a second traversal to reverse the result is about as cheap as trying to construct the list in forward order, and it is far less complicated.)

fold-left doesn't have to return a list, either.
(fold-left + 0 foo) ;; sum the elements of foo

Monday, July 20, 2009

What's wrong with this code?

(define (& . l)
  (let loop ((l l) (s ""))
    (if (null? l)
        s
        (loop (cdr l) (string-append s (->string (car l))))))
The problem is that the code doesn't say what it does. Consider this version:
(apply string-append (map ->string l))
We're going to take a list of items, convert them all to strings, and paste the strings together. That's pretty much what that one-liner says is going to happen. The loop version has a lot of other cruft in it that is the machinery to iterate down a list. We want to abstract that away.

Perhaps we don't want to use apply here, though. Maybe we're worried about the stack space involved. This will do the trick:
(fold-left (lambda (s e)
             (string-append s (->string e)))
           ""
           l)

Wednesday, July 15, 2009

Variable lookup

After the changes I mentioned in an earlier post I find that about 63% of all variable references are to arguments in the immediately enclosing lambda. In fact, more than half the variable references are to the first argument in the argument list. I'm going to guess that most of the remaining variable references end up referring to global variables, but I could be wrong.

It also turns out that most of the argument references are subproblems of PrimitiveCombination1 and PrimitiveCombination2. Enabling specialization of these forms should be a big win.

Without specialization, these are the steps to evaluating a PrimitiveCombination1:
public override bool EvalStep (out object answer, ref Control expression, ref Environment environment)
{
    Control unev0 = this.arg0;
    Environment env = environment;
    object ev0;
    while (unev0.EvalStep (out ev0, ref unev0, ref env)) { };

    if (ev0 == Interpreter.UnwindStack) {
        ((UnwinderState) env).AddFrame (new PrimitiveCombination1Frame0 (this, environment));
        answer = Interpreter.UnwindStack;
        environment = env;
        return false;
    }

    // It is expensive to bounce down to invoke the procedure
    // we invoke it directly and pass along the ref args.
    if (this.method (out answer, ev0)) {
        TailCallInterpreter tci = answer as TailCallInterpreter;
        if (tci != null) {
            answer = null; // dispose of the evidence
            // set up the interpreter for a tail call
            expression = tci.Expression;
            environment = tci.Environment;
            return true;
        }
        else
            throw new NotImplementedException ();
    }
    else return false;
}
Now if we know that the argument to the primitive is simply going to be an argument bound in the enclosing lambda expression, we can bypass the code that evaluates it and just fetch the value:
public override bool EvalStep (out object answer, ref Control expression, ref Environment environment)
{
    if (this.method (out answer, environment.Argument0Value)) {
        TailCallInterpreter tci = answer as TailCallInterpreter;
        if (tci != null) {
            answer = null; // dispose of the evidence
            // set up the interpreter for a tail call
            expression = tci.Expression;
            environment = tci.Environment;
            return true;
        }
        else
            throw new NotImplementedException ();
    }
    else return false;
}

Keep it simple

After fooling around with trying to find and extract the linear region of my graph, I decided to just punt. I have a number of these graphs and the linear regions are all about at the same place, so the simple thing to do was just grab the data in that region and run least-squares to get a fit. The coefficient of regression is coming in at around .998, so that's a pretty good fit. Sometimes the really simple solution is the best one.

Tuesday, July 14, 2009

Two little things

The curve I was looking at is close to linear when I plot it on a log-logistic scale. I'm trying to find the slope and intercept of the linear part of the curve. I wrote this to compute the least-squares line of a set of data points:
(define (least-squares points receiver)
  (define (iter tail count sigma-x sigma-y sigma-xx sigma-yy sigma-xy)
    (cond ((pair? tail)
           (let* ((point (car tail))
                  (x (car point))
                  (y (cdr point)))
             (iter (cdr tail)
                   (fix:+ count 1)
                   (+ sigma-x x)
                   (+ sigma-y y)
                   (+ sigma-xx (square x))
                   (+ sigma-yy (square y))
                   (+ sigma-xy (* x y)))))
          ((null? tail)
           (let ((xbar (/ sigma-x count))
                 (ybar (/ sigma-y count)))
             (let ((ssxx (- sigma-xx (* count (square xbar))))
                   (ssyy (- sigma-yy (* count (square ybar))))
                   (ssxy (- sigma-xy (* count xbar ybar))))
               (let* ((slope (/ ssxy ssxx))
                      (intercept (- ybar (* slope xbar)))
                      (reg-coeff (/ (square ssxy) (* ssxx ssyy)))
                      (s         (sqrt (/ (- ssyy (* slope ssxy))
                                          (- count 2)))))
                 (receiver slope intercept reg-coeff s)))))
          (else
           (error "Improper data."))))
  (iter points 0 0 0 0 0 0))
But now my problem is defining what a `linear region' is. At one extreme, the entire graph is linear if your tolerances are wide enough. At the other extreme, the part of the graph between any two points is linear. I've been experimenting with trying to divide the standard deviation by the length of the segment. A long, straight segment will give an unusually low number. Unfortunately, this measure of `linear segmentedness' isn't vey well behaved. You can find a good long sloppy match and then look for a better, shorted, more exact match and discover that the two are disjoint. An exhaustive search of all possible segment lengths over all possible positions is too expensive.

As you can see from the pie chart on my earlier post, I've been playing with my Scheme interpreter some more. I have all the optimizations disabled at this point because they don't play well with the notion of saving and restoring the world. (You can save and restore, but if you change the configuration options in-between, things get unhappy.) As you can see, variable lookup is the biggest part of the pie. In order to run fast, variable lookup has to be fast. This means optimizing the way environment structures are built, which in turn means analyzing the lambda expressions that bind variables. I've divided my lambda expressions into different types. A `standard' lambda is the basic, does-everything, lambda expression. It has to create a `standard closures' that is closed over a `standard environment'. A `standard environment' is one that has to support incremental definition and reassignment of variables. Looking up a variable in a standard environment involves checking the argument list of the lambda that bound the environment and checking the auxiliary table for any bindings that were incrementally added after the environment was created. (For instance, if someone did an eval of a define expression in that environment.) Fortunately, only the top-level environments typically have additional bindings like this. Unless a call is made to the-environment, you cannot easily get your hands on the environment structure. Therefore, we can define an environment type called a `static environment'. A static environment has no way of adding additional bindings. It is called `static' because the location of the variables can be statically determined. (In a standard environment, someone could come in and shadow an existing variable by evaluating a definition.) A static environment supports mutation of variables, however. Every time you fetch a value from a static environment you have to check that the variable has a value. (There is a special `non-assigned' marker.)

Finally, there are `simple' environments. A quick code walk at the time the lambda is constructed can tell us if any calls to the-environment exist or whether any of the variables are the target of a set!. If there are no assignments or calls to the-environment, then we know that whatever value is bound to a variable never changes. We also know it cannot be the `non-assigned' marker. In this case, we can bypass some of the machinery involved in searching for the variable and checking if it is bound.

Fortunately, virtually every lambda expression is a `simple' lambda.

I've re-enabled the code that separates out the different kinds of lambda expressions.

When you fetch a variable, there are a few interesting cases to consider. The most common case is when the variable is bound in the immediately enclosing lambda expression. This is an easy case because there is no way to shadow the variable even if the user has the ability to add bindings. So when we build a lambda expression, we can do a quick code-walk and mark all the variables that are bound in the argument list. I've enabled that case, too.

Tuesday, July 7, 2009

Trauma on the ice

So our skater was spinning and flinging off drops of blood at random angles. We know the position of the stains along the wall, but we want to know the position of the skater on the ice. If we knew where the skater was, we could compute the probability of a blood stain being at a particular offset. (Assume that the probability distribution of the rotation angle θ is uniform.) Let's put a Cartesian coordinate system on this problem:
    | y = 0, left wall
    |
    |
     --------------  
    /              \
    |              |
    *              |
    |              |
    |  8           |
--- |--------------| ------  x = 0  centerline
    *              |
    *              |
    |              |
    |              |
    \              /
     -------------
The skater has left a figure ‘8’ on the ice and the asterisks represent a few of the blood stains.

We'll call the skater's distance from the wall (along the X axis) α and the distance from the centerline β. For any particular point on the wall, the probability of it having a stain is function of α and β.
(define (stain-probability alpha beta)
  (lambda (y-pos)
    (/ alpha
       (* *pi* (+ (square alpha)
    (square (- y-pos beta)))))))
So if the skater was at (2,3) we'd get these values:
(define demo1 (stain-probability 2 3))
;Value: demo1

(demo1 2)
;Value: .12732395447351627

(demo1 3)
;Value: .15915494309189535

(demo1 4)
;Value: .12732395447351627
We can see that the probability peaks at the position closest to where the skater is. If the skater were further away from the wall, the peak is at the same place, but it is much less distinct.

(define demo2 (stain-probability 5 3))
;Value: demo2

(demo2 2)
;Value: .06121343965072897

(demo2 3)
;Value: .06366197723675814

(demo2 4)
;Value: .06121343965072897
We need to ‘turn around’ this equation. We have the probability of a stain as a function of position, we need the probability of the position as a function of the stain. We can compute α and β separately (notice that the position of the peak doesn't change when we get further away from the wall, only the dispersion changes). So we can use Bayes theorem to invert our equation: p(β|y,α) ∝ p(y|α,β) * p(β|α) Here, y stands for the proposition that the bloodstain is at a particular offset on the y axis.

Our knowledge of the position of any one blood stain gives us no information about the position of any other, so we can simply multiply the probabilities together:p(Y|α,β) = ∏ p(yi|α,β) Here, Y is the probability of all the blood stains and it is simply the probability of each stain multiplied together.

If we substitute this back in, we can compute the likelihood of β: log(p(β|Y,α)) = constant - ∑ log(α2+(yi2)) Our best estimate of β is that β that maximizes the right-hand side. (Or minimizes the second term.)

Are your eyes glazed over? Mine are. I'm paraphrasing from Data analysis By D. S. Sivia, John Skilling. It seems like a good book, but when I read math books the equations start all looking the same.
Polonius: What do you read, my lord?
Hamlet: Math... math... math...

;; Given α (distance to the wall) and a list of the
;; y-coordinates of the blood stains, compute the second term
;; of the log of p(β)
(define (term2 alpha stains)
  (lambda (beta)
    (sum (lambda (y)
           (log (+ (square alpha)
                   (square (- y beta)))))
         stains)))
The reader can implement sum for fun. So let's look at how this behaves.
;; Given the position of the skater, return a procedure that generates
;; the y position of the blood stains.  Simulates the problem.
(define (skater alpha beta)
  (lambda ()
    (+ (* (tan (* (random 1.0) *pi*)) alpha)
       beta)))

;; Given a skater and n, return a list of stain positions.
(define (spots skater n)
  (do ((i 0 (+ i 1))
       (spots '() (cons (skater) spots)))
      ((>= i n) spots)))

(define my-spots (spots (skater 1 2) 10))

my-spots
;Value 37: (-.4391382779105797 3.7536812452447545 1.9033520510509812 2.227674346928543 10.229175820018094 8.35132161012094 2.0848923361826426 2.868570305181259 4.926524369307749 -1.1024829548141497)

;; Search for minimum.
(let ((f (term2 1 my-spots)))
  (do ((i -5 (+ i 1)))
      ((>= i 5))
    (display i)
    (display " ")
    (display (f i))
    (newline)))
-5 41.39540398748283
-4 38.485040725146305
-3 35.00661596259841
-2 30.748568382398258
-1 26.161800734404803
0 23.049858617589564
1 20.020797776287406
2 16.5459276648517     ;; note minimum here
3 16.679502775207673
4 19.15407343693434
So we deduce that beta was 2.

I'm not entirely satisfied with this solution. I think I understand it, but I'm not sure I could reproduce it. The reason I was playing with this problem is that I want to automatically fit the linear segment of that data set I was playing with.

An amusing problem

Here's an interesting problem. A skater is performing an upright spin. As she pulls in her arms and is spinning faster and faster, suddenly a shot rings out. Our skater is hit, but she does not fall (but she stops her angular acceleration and now rotates evenly. Go figure.) Droplets of blood are thrown off by centrifugal force and hit the wall of the rink.

Later that week, CSI is called in. They need to determine the exact position on the ice where the skater was. The Zamboni has already been run, so the only clue is the position of the blood stains on the wall. The stains have run, so the vertical measurement is useless. The drops were too small to form ellipsoid shapes. The only measurement is the horizontal position of the blood stains.

So imagine a grid placed on the rink so that one long wall is on the Y axis and the center line is on the X axis. The skater is at position (x,y), and when a drop of blood comes off, it moves in a straight line at a random angle and strikes the wall. (If it goes away from the wall, it hits the ice and the Zamboni erases it.) We can model this:
(define (skater distance-from-wall distance-from-centerline)
  (lambda ()
    (+ (* (tan (* (random 1.0) *pi*)) distance-from-wall)
       distance-from-centerline)))

(define test-skater (skater 3 2)) ;; 3 meters from the wall, 2 above the CL

(test-skater)
;Value: 38.317142690716395

(test-skater)
;Value: .9959730295057432

(test-skater)
;Value: 4.505616575826165

(test-skater)
;Value: -5.781087279218612

(test-skater)
;Value: .08491367951998141
Each time we call test-skater, we generate the position along the wall that a blood spot will be.

How do we find out where the skater is? How about using the average displacement of the spots on the wall to determine one of the axes?
(define (average-x skater count)
  (do ((i 0 (+ i 1))
       (total 0 (+ (skater) total)))
      ((>= i count) (/ total i))))

;; Converges to 2 with enough steps?

(average-x test-skater 100)
;Value: -19.147138944613047

(average-x test-skater 100)
;Value: 1.6702294390626096

(average-x test-skater 100)
;Value: -12.917476155358356
These estimates are terrible. Let's try more data points.
(average-x test-skater 1000)
;Value: 9.664970088915723

(average-x test-skater 1000)
;Value: 1.1197618210432498

(average-x test-skater 1000)
;Value: .6766247404710667

(average-x test-skater 10000)
;Value: -1.335001891411956

(average-x test-skater 100000)
;Value: -1.7106929730299765

(average-x test-skater 1000000)
;Value: -3.076252074037072

(average-x test-skater 1000000)
;Value: 7.717576932627036
The average is simply not converging. This distribution is not Gaussian. The mean of the distribution is pretty much a random meaningless number. In fact, this is one of those distributions that have an undefined variance and standard deviation. Solving this problem will be tricky.

More later...

Monday, July 6, 2009

The long tail

Like my time series, my series of posts about it is gradually diminishing. I started with this raw data:

with more than eight-thousand data points and I can summarize it like this: And that's enough information to reconstruct the curve quite accurately. From these parameters, you can compute the mean and the mode (where the peak is), and where the quantiles are (the 80th, 90th percentile, etc.) Interestingly, the variance is only defined when β > 2. With the variance undefined, the ‘standard deviation’ is also undefined. (So I know that people who quote it are blowing smoke!)

I've been using gnuplot to generate the graphs, but I've also been using its fit command to compute the parameters of the curves. It seems to work, but I have no idea how. I don't want to depend on it. I'd like to write some code that determines the best α and β values from the raw data. Once I have that, I'd like to plot how α and β change under different circumstances: over time, under different loads, etc. Perhaps I can figure out why the curve is the way it is. I'll post more information if I discover more.

I'd be interested in seeing if other people find it useful to analyze time series in this way. Let me know.

Sunday, July 5, 2009

10 billion evaluations

While bootstrapping my Scheme interpreter, I ran 10 billion evaluations using unoptimized code. The breakdown of SCode types was this:

A bit of a stretch

Using the cumulative distribution makes it a bit easier to fit the distribution, but where before the curve asymptotically approached 0, it now asymptotically approaches 1. It's still hard to see how well the curve fits when it and the data are very close to the asymptote. To fix this, we want to apply a stretching function to the graph. A popular stretching function is the inverse error function. The reason is obvious:

With inverse erf stretching, a cumulative normal distribution becomes a straight line.

It's pretty clear that my data set does not follow a lognormal distribution. Although the fit looked pretty good on the original histogram, it is simply not a straight line. However, the low end of the curve looks pretty straight.

This next plot is a Pareto distribution plotted with inverse erf stretching:

The Pareto distribution wasn't too bad a fit, but still, we can see it isn't a good match when we look at it here.

Another stretching function to consider is the inverse logistic function.

Now this is interesting. A good chunk of the curve is now a straight line. It seems that a log-logistic distribution is a really good model for the data (at least for the long tail). Let's see how it looks on the original histogram.

In this graph I plotted the lognormal distribution that fit the low-end of the plot and the log-logistic distribution that fit the high-end.

So what is a log-logistic distribution? It seems to be common in sociology and biology. The survival curve after a kidney transplant follows a log-logistic distribution. It also seems to show up in insurance risk models. It is used in hydrology to model precipitation rates. This is weird. I can't see why this distribution would arise in what I'm measuring.

The log-logistic distribution has two tuning parameters, α and β. α is the median of the curve and β determines the amount of spread. In the stretched cumulative plot, α determines where the line intersects the 50th percentile and β determines the slope of the line.

Having fits

Choosing a model for the data is hard. On the one hand, you want a model that is reasonably faithful to the data. A model that is too different from your data is not going to be too useful for prediction. On the other hand, you want a model simple enough to reason about. A model that has dozens of tuning parameters can be very accurate, but it won't be easy to understand. It's a tradeoff. If the model has a basis in some physical rationale — for example, an exponential model is naturally expected from a radioactive sample — it may offer an insight as to the physics behind what you are measuring. But even if the model is simply an easily described curve with no physical basis, it can still be useful.

A one or two parameter model is what I'm looking for. There are a lot of one and two parameter models that look like my data. The main characteristic of most of these distributions is that they asymptotically approach zero over a long time.

Gamma:

Log-normal:

Pareto:

Weibull:

Eyeballing these is hard, too. So we need a couple of more tricks. My favorite trick of log scale doesn't work too well. It does show the tail of the curve nicely, but it also magnifies the variation. Since the data are sparse out at the tail, this makes the variation that much bigger.


Instead, I'm going to integrate over the distribution and normalize the values so the curve goes from 0 to 1.


Now this graph is obviously bad. The curve is squeezed in near the edges so you can't see it, and the asymptotes are so close you couldn't tell if something fit or not. We'll fix this in a sec. But take a look here and you'll see a number of graphs that display the data is this awful way.

Now I'm going to use my log scale trick.


This is quite a bit better. Now we can see important things like the median (at the .5 mark) and the 90th percentile (at the .9 mark). There is another benefit we got. The data in this graph is the unsmoothed data. If we zoom in on a small part of the graph, we can see that the ‘hair’ has turned into tiny stairsteps.

Although the hair was tall, it was very narrow, so it doesn't contribute much to the integral. (So my attempt to shave the hair off the data was pretty much a waste of time. Oh well.)

But the point of doing this was to make it easier to fit the models. The models usually have an integral form (cumulative distribution) so we can try them out. But since we're using the integral form, the errors add up and we can more easily see which distributions have a better fit.

Cumulative Cauchy:

Cumulative log Pareto:

Cumulative lognormal:

Cumulative log Poisson:

There's one more thing to do....

Friday, July 3, 2009

A little Scheme code

Patrick Stein commented on the way that I defined my DTFT function:
(define (dtft samples)
  (lambda (omega)
    (sum 0 (vector-length samples)
         (lambda (n)
           (* (vector-ref samples n)
              (make-polar 1 (* omega n)))))))
Let me tweak it just a bit. The idea is that we have our original function which may or may not be continuous and we sample it at regular intervals.
(define (dtft original sample-interval sample-count)
  (lambda (omega)
    (sum 0 sample-count
         (lambda (n)
           (* (original (* sample-interval n))
              (make-polar 1 (* omega n)))))))
Now we need to turn our vector of samples into a function.
;; Non robust.  Assumes N is a fixnum in the right range.
(define (vector->function sample-vector)
  (lambda (n) (vector-ref sample-vector n)))
I take the original function and return a new function with a single parameter omega which is the transformed version. The original function only needs to be defined at the sample points in the range from 0 to sample-count. I'm treating my vector of samples as a function from integer->real (the samples are in fact integers, but they could be floats).

The function I return is a continuous function from real->complex. It is defined over all the reals, but since omega is an angle, the value computed is going to be repeated every time omega sweeps out an entire circle, so we should limit our interest to the range of 0 to 2π or -π to π.

It turns out to be somewhat computationally expensive to evaluate the transformed function. It isn't very expensive, but with eight thousand samples, we several tens of thousands of floating point operations. If we memoize the function, we can save a huge amount of time if we ever re-evaluate with the same omega. (And we will.)
(define (memoize f)
  (let ((table (make-eqv-hash-table))) ;; eqv works on floats
    (lambda (arg)
      (or (hash-table/get table arg #f)
          (let ((value (f arg)))
            (hash-table/put! table arg value)
            value)))))

(define ym
  (memoize
   (dtft (vector->function *data*) 1 8192)))
The inverse transform is fairly simple:
(define (inverse-dtft y domega)
  (lambda (n)
    (define (f omega)
      (* (y omega)
         (make-polar 1 (* omega (- n)))))

    (/ (real-part (integrate f (- *pi*) *pi* domega))
       (* *pi* 2))))
The integral should return a complex number where the real part cancels out to zero, but since the integration is a numeric approximation, I just lop off the imaginary part.

We need a definition of integrate. I got this simple trapezoid integration from Sussman and Abelson:
(define (trapezoid f a b h)
  (let* ((n (/ 1 h)) 
         (dx (/ (- b a) n)))
    (define (iter j sum)
      (if (>= j n) 
          sum
          (let ((x (+ a (* j dx))))
            (iter (+ j 1) (+ sum (f x))))))
    (* dx (iter 1 (+ (/ (f a) 2) (/ (f b) 2))))))

(define integrate trapezoid)
It converges slowly.
(do ((i 1 (+ i 1))
     (d-omega 1.0 (/ d-omega 2)))
    ((>= i 13))
  (display i)
  (display " ")
  (display d-omega)
  (display " ")
  (display ((inverse-dtft ym d-omega) 250))
  (newline))

1 1. -161.
2 .5 32215.
3 .25 16091.999999999949
4 .125 8074.
5 .0625 4074.000000000024
6 .03125 2048.0000000000246
7 .015625 1050.0000000000236
8 .0078125 571.9999999999932
9 .00390625 336.9999999999946
10 .001953125 235.99999999999997
11 .0009765625 184.9999999999997
12 .00048828125 160.99999999999937
13 .000244140625 153.00000000000196
But once I have it computed for one value of x, the others come fast. The actual value of the integral at this point is 150, but 153 is close enough for me.

Thursday, July 2, 2009

Too much information

I won't keep you in suspense. Here is my time series after removing the hair:

I sliced up the curve into approximate 15 ms segments. I offset the segments so that the spike would be about in the middle, then I simply averaged each segment independently and pasted the result back together. This does remove details smaller than about 15 ms, but I don't need that level of detail.

Actually the whole point of what I'm doing is to remove detail. The original time series had way too much. I want to remove as much detail as possible, but keep just enough so I can easily characterize this data set and compare it with others. At one end of the detail scale I have the original data set of 8192 values. The other end of the scale has no data whatsoever (trivial, but very parsimonious). A number at least allows some basis of comparison. But which number? If I told you that the average value was about 1572 ms, would that be useful? Well, that depends. You might be mislead to expect that if you were to try the experiment right now that it would take around 1572 ms, more or less. You'd be wrong. More than 65% — almost two thirds — of the samples are below the average. The average is the total amount of time to gather all the samples divided by the number of samples gathered. This tells you a lot about the rate at which you can expect to gather samples in bulk, but not very much about the time value of an individual sample.

We need a parameterized model for our data set. The model will tell us qualitatively what the data set is like, the parameters will give us the quantitative information we need to compare a particular data set against others. If you don't know what the model is, the numbers are meaningless! This is no exaggeration. In this chart, there are two distributions. Both distributions have the same average.

Whenever I see a measure of “average latency” a little alarm bell goes off in my head. It's a good indication that the stated value is a meaningless number and that the person who measured it doesn't know what he is measuring. A louder alarm bell goes off when I see a measure of “standard deviation”. Informally, the standard deviation is a measure of how widely the data is distributed. But the way you calculate the standard deviation depends on the model. (The relevance of the standard deviation also depends on the model. If your model is logarithmic, as most time-based models are, you would be better off computing the ‘geometric standard deviation’.) If the model is not specified, it is likely that the reported ‘standard deviation’ was simply calculated with a gaussian model. Some models don't even have a standard deviation — it's simply not defined.

When I see a measure of standard deviation, it's a good indication that the person who measured it not only doesn't know what he is measuring, but also that he is using a tool that he doesn't understand.


In order to figure out how to characterize my data set, I need to find a good model for it. This can be really hard. There are hundreds of models. Some of them have enough tuning parameters that you can make them fit anything. What I'm looking for is a model that is simple, has as few parameters as possible, and fits the data reasonably well. (The reason I shaved the hair off the data is so I could visually compare the fit of a few different models.) There are a number of ways to search for models, but it often comes down to trial and error. There are a couple of tricks, though.

The first trick is see how the data looks in log space. It is frequently the case that you are working with some ‘scale-free’ quantity. You don't care about the absolute time, you care about the relative improvement. When you discuss data in terms of ‘percent’ or ‘factor of two’ and such, you are likely to want to work in log space. Here's a plot of my shaved data set in log space: Now that is starting to look like a bell-shaped curve. This suggests that a log-normal distribution might be a good model:
And now I can state quantitatively that under a log-normal model, my data set has a value of μ= 6.97 and σ=.90 With just these two numbers, you can approximate the mean (the true average is 1572, the average of the model is 1593), the mode (472), and any other quantity of the curve (for example, the geometric standard deviation, which is 2.46).

But the log-normal curve doesn't quite fit. It's close, but it overestimates near the mean and underestimates the long tail. I've been looking for a better model.

Next time.... trying different models.

Wednesday, July 1, 2009

Sometimes you have one of those “Aha!” moments. This isn't one of those times. This is closer to a “D'oh!” moment.

The various filters I tried on my data set were really disappointing. To try to figure out what was going on, I made a fake data set that had spikes every 15.625 milliseconds, but was zero everywhere else. If you perform a Fourier transform on this, you'll see the problem immediately. If my signal was simply a 64 Hz sine wave, I'd have a single point on the Fourier transform. If it were a square wave, I'd have the primary frequency and the odd harmonics. With a pulse, there are harmonics everywhere.

But since I know that the spikes come every 15.625 milliseconds, and I know the phase, it occurs to me that I can simply erase those samples. No filters, no transforms, just a little rounding. In fact, if I walk through the data in 15.625 millisecond chunks, but make my first step exactly half that size, every chunk will have exactly one spike (although it might contribute to more than one sample within the chunk).

If I call this a “learning opportunity” maybe I'll feel better about this.

Going backwards

First a “Thank you!” to all that have pointed me towards notch filters. I expect that I'll end up using some sort of notch filter.

Part of the reason I'm doing this the hard way is to understand what I really want to do. I know the trail has already been blazed by Fourier, Laplace, Butterworth and Chebyshev, but I want to find out why this direction is the easiest and best way to get where I want to be. So back to the code.

Having found that enormous spike at 64 Hz, I want to separate my data into two components: the 64 Hz component and the remainder. I had a clever idea. What if I simply trim that spike out of the Fourier transform and then do an inverse Fourier transform to get the original data, but without the spike. There are a couple of problems with this. The first is that the inverse Fourier transform is harder to compute. Here is the simple code:
(define (inverse-dtft x)
  (lambda (n domega)
    (define (f omega)
      (* (x omega)
         (make-polar 1 (* omega (- n)))))

    (/ (integrate f (- *pi*) *pi* domega)
       (* *pi* 2))))
In this code, x is the Fourier transform. To find the value of element n in the data vector, you have to integrate. So n is the index and domega is the step size for the integration. It turns out that you need really small steps. The integral does not converge quickly.

It's a bit harder to understand why this works. The Fourier transform was easy to explain with the rotating arrows and such. This integral has a rotating arrow, but it rotates in the opposite direction. (I got it wrong at first. It doesn't work that way.) Think of the Fourier transform as a water sprinkler spraying your data across the lawn. The rotational speed of the sprinkler determines what data goes where, and if it matches a frequency component in your data, you get an uneven watering pattern. Now the inverse Fourier transform is you with a bucket running like mad trying to catch the water. That's a bit harder to do.

So with the integral converging so slowly, I did a few tricks. First, I memoized the computation of the transform. This takes a pile of memory, but the speedup is worth it. Second, I used Romberg integration. Nonetheless, I found that it still takes quite a while to invert the transform.

I mentioned a couple of problems. Besides it being slow, it doesn't work. After trimming out the spike at 64 Hz I found that the curve still has its hair. Apparently the harmonics carry a fair amount of weight. I tried trimming a couple. Very little effect on the hair, but I was starting to munge up the other data. I tried a different trick.
(define (y1 x)
  ;; Squash all frequencies above .1 (radians per second) 
  (if (> x .1)
      0
      (y x)))
Since the hair is at about .4, I figured this ought to do it.

It doesn't. I have to admit that I'm a bit surprised at this. I figured it would be trivial to trim the hair if I knew what frequency it was at and simply clipped it out. Sure, a notch filter would be far more efficient to compute, but isn't the concept right? I guess it isn't.

Well, then, back to basics.