Tuesday, July 14, 2009

Two little things

The curve I was looking at is close to linear when I plot it on a log-logistic scale. I'm trying to find the slope and intercept of the linear part of the curve. I wrote this to compute the least-squares line of a set of data points:
(define (least-squares points receiver)
  (define (iter tail count sigma-x sigma-y sigma-xx sigma-yy sigma-xy)
    (cond ((pair? tail)
           (let* ((point (car tail))
                  (x (car point))
                  (y (cdr point)))
             (iter (cdr tail)
                   (fix:+ count 1)
                   (+ sigma-x x)
                   (+ sigma-y y)
                   (+ sigma-xx (square x))
                   (+ sigma-yy (square y))
                   (+ sigma-xy (* x y)))))
          ((null? tail)
           (let ((xbar (/ sigma-x count))
                 (ybar (/ sigma-y count)))
             (let ((ssxx (- sigma-xx (* count (square xbar))))
                   (ssyy (- sigma-yy (* count (square ybar))))
                   (ssxy (- sigma-xy (* count xbar ybar))))
               (let* ((slope (/ ssxy ssxx))
                      (intercept (- ybar (* slope xbar)))
                      (reg-coeff (/ (square ssxy) (* ssxx ssyy)))
                      (s         (sqrt (/ (- ssyy (* slope ssxy))
                                          (- count 2)))))
                 (receiver slope intercept reg-coeff s)))))
           (error "Improper data."))))
  (iter points 0 0 0 0 0 0))
But now my problem is defining what a `linear region' is. At one extreme, the entire graph is linear if your tolerances are wide enough. At the other extreme, the part of the graph between any two points is linear. I've been experimenting with trying to divide the standard deviation by the length of the segment. A long, straight segment will give an unusually low number. Unfortunately, this measure of `linear segmentedness' isn't vey well behaved. You can find a good long sloppy match and then look for a better, shorted, more exact match and discover that the two are disjoint. An exhaustive search of all possible segment lengths over all possible positions is too expensive.

As you can see from the pie chart on my earlier post, I've been playing with my Scheme interpreter some more. I have all the optimizations disabled at this point because they don't play well with the notion of saving and restoring the world. (You can save and restore, but if you change the configuration options in-between, things get unhappy.) As you can see, variable lookup is the biggest part of the pie. In order to run fast, variable lookup has to be fast. This means optimizing the way environment structures are built, which in turn means analyzing the lambda expressions that bind variables. I've divided my lambda expressions into different types. A `standard' lambda is the basic, does-everything, lambda expression. It has to create a `standard closures' that is closed over a `standard environment'. A `standard environment' is one that has to support incremental definition and reassignment of variables. Looking up a variable in a standard environment involves checking the argument list of the lambda that bound the environment and checking the auxiliary table for any bindings that were incrementally added after the environment was created. (For instance, if someone did an eval of a define expression in that environment.) Fortunately, only the top-level environments typically have additional bindings like this. Unless a call is made to the-environment, you cannot easily get your hands on the environment structure. Therefore, we can define an environment type called a `static environment'. A static environment has no way of adding additional bindings. It is called `static' because the location of the variables can be statically determined. (In a standard environment, someone could come in and shadow an existing variable by evaluating a definition.) A static environment supports mutation of variables, however. Every time you fetch a value from a static environment you have to check that the variable has a value. (There is a special `non-assigned' marker.)

Finally, there are `simple' environments. A quick code walk at the time the lambda is constructed can tell us if any calls to the-environment exist or whether any of the variables are the target of a set!. If there are no assignments or calls to the-environment, then we know that whatever value is bound to a variable never changes. We also know it cannot be the `non-assigned' marker. In this case, we can bypass some of the machinery involved in searching for the variable and checking if it is bound.

Fortunately, virtually every lambda expression is a `simple' lambda.

I've re-enabled the code that separates out the different kinds of lambda expressions.

When you fetch a variable, there are a few interesting cases to consider. The most common case is when the variable is bound in the immediately enclosing lambda expression. This is an easy case because there is no way to shadow the variable even if the user has the ability to add bindings. So when we build a lambda expression, we can do a quick code-walk and mark all the variables that are bound in the argument list. I've enabled that case, too.