Tuesday, April 28, 2009

You knew I'd say something (Part II)

Part II
van Rossum writes:
Is it really such a big deal to rewrite your function to use a loop? (After all TRE only addresses recursion that can easily be replaced by a loop. :-)
If this were true, I'd be far, far less concerned about the issue. Sure, a lot of tail recursion really is just looping, and it isn't such a big deal to use a loop. In fact, a loop may be a better construct in many situations. This particular use of tail recursion is not the interesting case.

It is difficult to find a compelling example of code that demonstrates why interprocedural tail recursion is useful. You won't find many examples taken from languages that don't support tail recursion; they are either too small to cause a problem, or the author used some sort of workaround to avoid the space issues. Examples taken from languages that do support tail recursion will inevitably be dismissed as unrealistic and artificial, or particular to Scheme, or simply irrational (‘No real program would ever be written that way. Maybe a Scheme programmer would think in such a way, but...’)

The difficulty is in finding a program that is big enough to illustrate the problem, small enough to post in a blog, easy enough to understand, but complex enough to suggest what problems would occur in the real world.

Continuation passing style needs proper tail recursion to be useful on anything but small problems, but it is a very complex technique which makes it hard to illustrate succinctly. Several people have suggested implementing a state machine through tail calls. I like this technique, but there are so many ways of implementing state machines that someone is bound to claim that the problem is artificial.

Here's an example that I think will be persuasive.

Consider the Visitor Pattern. This is a technique that allows you to abstract away the details of a data structure and the method of traversing it from the algorithm that operates on the data. It is commonly found in object-oriented languages like Java.

In this Python code we implement a binary tree and we provide a means to traverse it with the Visitor Pattern. To make things a bit more interesting, we also implement a DelegateNode class that acts like a Node by delegating the method calls. In this example we pass an accumulator variable with the visitor. We can use the accumulator to collect information about the tree.
import math, random, sys, traceback

class Node:
  def __init__(self, left, right):
      self.left = left
      self.right = right

  def accept(self, visitor, accum):
    return visitor.visitNode (self.left, self.right, accum)

class Leaf:
  def __init__(self, value):
    self.value = value

  def accept(self, visitor, accum):
    return visitor.visitLeaf(self.value, accum)

class DelegateNode(Node):
  def __init__(self, originalNode):
    self.originalNode = originalNode

  def accept(self, visitor, accum):
    return self.originalNode.accept(visitor, accum)

# Generate a tree for us to visit.
def GenerateTree(n):
  if n <= 0:
    return Leaf (1)
    return DelegateNode (Node(GenerateTree(int(math.sqrt(float(n))) - 1),
                              GenerateTree(n - 1)))
Now remember that this is a toy example. A real example would have several kinds of nodes and the nodes would have some interesting functionality and we'd do something more than just traverse them. The important point is this is a simplified version of some code we might actually encounter in a real-world problem.

So let's traverse the tree and count the different kinds of nodes. First we need a Visitor that can do that:
class TreeCountingVisitor:
  def visitNode (self, left, right, accum):
    return right.accept (self, left.accept (self, [accum[0] + 1, accum[1]]))

  def visitLeaf (self, value, accum):
    return [accum[0], accum[1] + 1]
The accumulator is a two-tuple. The first element is the number of nodes, the second element is the number of leaves. When we visit a leaf, we return a new tuple with the leaf count bumped by 1. When we visit a node, we bump the node count by 1 before traversing the left branch. Since the visitor returns an updated accumulator, we pass that in when we traverse the right branch.

And our main function to try this out. We'll start with a small tree and see how it scales.
if __name__=="__main__":

  for i in range(25,1000,25):
    print "Generating tree " + str(i)
    testTree = GenerateTree(i)
    print "Testing"
    total = testTree.accept(TreeCountingVisitor(), [0,0])
    print "Total is " + str(total)
On my machine, this runs out of stack at about 350.

Can the tail recursion in this be easily replaced by a loop? It depends on what you mean by easy. Certainly we could write a looping program that traverses the tree without using the Visitor pattern, and it wouldn't take long at all. But in a real-world situation, this may not be so easy. If the Visitor pattern is the ‘advertised’ API, we'd be deliberately bypassing an important abstraction. Another developer could change a node or add a new node type and provide the appropriate Visitor API, but if we weren't using the Visitor pattern, we'd be out of luck.

Assuming we want to organize our code around the Visitor pattern, it becomes much harder to replace the recursion with a loop.

Part III in a day or two...

Saturday, April 25, 2009

You knew I'd say something.

In a recent blog post, Guido van Rossum gave his rationale for not wanting tail recursion in Python. From the nature of his arguments, it seems he has several misconceptions about tail recursion. I have to admit spending a bit of time thinking up snarky remarks, but I found that some friends of mine — programmers that I respect a great deal — shared some of these misconceptions. I already know I'm not going to change van Rossum's mind, and I'm preaching to the choir for most of the people who read this, but maybe I can clear up these misconceptions for one or two people.

Will Clinger's paper Proper tail recursion and space efficiency in the Proceedings of the 1998 ACM Conference on Programming Language Design and Implementation, June 1998, pages 174-185, is the best reference I've found on the topic.

First of all, let's be clear about what ‘proper tail recursion’ is. Clinger defines it this way:
The essence of proper tail recursion is that a procedure can return by performing a tail call to any procedure, incuding itself.
It is important to recognize that we're not just talking about loops written in tail-recursive style, but any procedure that (tail) calls another in order to compute the return value.

It is a common misconception that tail recursion is primarily a syntactic sleight-of-hand that allows a programmer to use a function call rather than a traditional iteration construct. A Scheme programmer might write this:
(define (factorial x)
  (define (fact-iter n answer)
    (if (> n x)
        (fact-iter (+ n 1) (* answer n))))
  (fact-iter 1 1))

Where a more traditional formulation is this:
int factorial (int x)
    int answer = 1;
    for (int n = 1; n <= x; n++)
      answer *= n;
    return answer;

The Scheme programmer relies on the compiler to recognize that the recursive call to fact-iter is intended to be a loop. He expects that the code will compile to something very close to the traditional formulation. The MIT-Scheme compiler generates this code:
;; Note: some instructions have been omitted for clarity
;; (tagging, interrupt polling, etc.)
        ;; (assign (register #x10) (offset (register 4) (machine-constant 0)))
        (mov w (r 0) (@r 4))
        ;; (assign (register #x12) (offset (register 4) (machine-constant 2)))
        (mov w (r 1) (@ro b 4 8))
        ;; (fixnum-pred-2-args greater-than-fixnum? (register #x11) (register #x13))
        (cmp w (r 0) (r 1))
        (jg (@pcr label-6))

        ;; (assign (register #x15) (offset (register 4) (machine-constant 1)))
        (mov w (r 1) (@ro b 4 4))

        ;; (assign (register #x19) (fixnum-2-args multiply-fixnum (register #x16) (register #x11) #f))
        (imul w (r 1) (r 0))

        ;; (assign (offset (register 4) (machine-constant 1)) (register #x14))
        (mov w (@ro b 4 4) (r 1))
        ;; (assign (register #x1d) (fixnum-1-arg one-plus-fixnum (register #x11) #f))
        (add w (r 0) (& #x5a))

        ;; (assign (offset (register 4) (machine-constant 0)) (register #x1a))
        (mov w (@r 4) (r 0))
        ;; (invocation:jump 2 #f fact-iter-3)
        (jmp (@pcr fact-iter-3))

As you can see, the compiler dutifully turned it into a loop.

Not everyone is aware that this works across procedure boundaries.
(define (even? x)
  (if (= x 0)
      (odd? (- x 1))))

(define (odd? x)
  (even? (- x 1)))
Yeah, I blew it here by forgetting to test for
a base case.  But the conditional wasn't the interesting part anyway.

;; Note: some instructions have been omitted for clarity
        ;; (assign (register #x10) (offset (register 4) (machine-constant 0)))
        (mov w (r 0) (@r 4))
        ;; (eq-test (register #x10) (constant 0))
        (cmp w (r 0) (&u #x68000000))
        (je (@pcr label-4))
        ;; (lap-opt fixnum-add-const-in-place)
        ;; (assign (register #x14) (object->fixnum (register #x10)))
        ;; (assign (register #x15) (fixnum-1-arg minus-one-plus-fixnum (register #x14) #f))
        ;; (assign (register #x12) (fixnum->object (register #x15)))
        (add w (r 0) (& #x3ffffff))
        (and w (r 0) (& #x6bffffff))
        ;; (assign (offset (register 4) (machine-constant 0)) (register #x12))
        (mov w (@r 4) (r 0))
        ;; (invocation:uuo-link 2 #f odd?)
        (jmp odd?-1)
        ;; (assign (offset (register 6) (machine-constant 2)) (constant #t))
        (mov w (@ro b 6 8) (&u #x20000000))
        ;; (pop-return)

        ;; (assign (register #x11) (offset (register 4) (machine-constant 0)))
        (mov w (r 0) (@r 4))
        ;; (lap-opt fixnum-add-const-in-place)
        ;; (assign (register #x12) (object->fixnum (register #x11)))
        ;; (assign (register #x13) (fixnum-1-arg minus-one-plus-fixnum (register #x12) #f))
        ;; (assign (register #x10) (fixnum->object (register #x13)))
        (add w (r 0) (& #x3ffffff))
        (and w (r 0) (& #x6bffffff))
        ;; (assign (offset (register 4) (machine-constant 0)) (register #x10))
        (mov w (@r 4) (r 0))
        ;; (invocation:uuo-link 2 #f even?)
        (jmp even?-2)
And people have pointed out that this may be surprising and baffling to someone who expects to see the history of the computation laid out on the stack.

Scheme programmers make a big deal out of this. They point out that with proper tail recursion, Scheme doesn't need looping constructs like while or for or do because the user can write them himself. But Scheme hackers are a fringe group of the archetypical fringe group: Lisp hackers. They enjoy removing features from the language that they consider ‘unnecessary’. Then they write papers that somehow ‘prove’ that this is better. Of course these academic papers contain an awful lot of math and not very many actual programs.

In fact, Scheme programmers will try to sneak tail recusion into other programming languages. There's hardly a point to doing so because other languages have loops already. But now you get programs written by wet-behind-the-ears neophytes that use tail recursion rather than something more obvious. Furthemore, you cannot debug this code because the tail recursion has erased the backtrace.

And what does this buy you? In theory, when you exit a procedure with a tail-recursive call, you can replace the call/return instruction pair with a jump. In theory, that's not much of an improvement. In practice, however, it isn't quite that simple. The stack usually has to be adjusted one way or the other and this completely wipes out any benefit to be seen from avoiding the push and pop of the return address.

To summarize, this point of view about tail recursion is based on these ideas:
  • Its purpose is to avoid writing looping constructs, which are somehow considered ‘bad’ by ‘purists’. These weirdos ‘think’ in loops, then transform the code to be ‘recursive with tail calls’ when they write it, and then expect the compiler transform it back. This is academic mental gymnastics with no practical purpose.
  • People who like tail recursion are purposefully burdening their code in the name of ‘mathematical’ or ‘theoretical’ purity, but they expect the language implementor to bear the weight.
  • Any code that depends on tail recursion can easily be rewritten to use a looping construct instead. Such a rewrite is a trivial local transformation.
  • Inter-procedural tail recursion may have some performance benefit, but such benefit is very small at best, and the tradeoff is the loss of valuable debugging information.
van Rossum's blog post almost screams each of these points.

If these were in fact the key issues to be argued, then there would be little more to say beyond de gustibus non est disputandum. But these arguments are a sideshow. The true argument about tail recursion is that it reduces the space complexity class of a program and allows a program access to a more powerful computation model. This argument is subtle.

Let me digress for a moment. Remember the Bubble Sort algorithm? If I were to write some code that used Bubble Sort to maintain the order of elements in a set, I would be lucky if I only got laughed at. The Hacker's Jargon file calls Bubble Sort `the generic bad algorithm'.

What makes Bubble Sort so bad? It is extremely inefficient. On the average, Bubble Sort performs O(n^2) operations to sort a set of n elements. The Big-O notation is the asymptotic time complexity of the algorithm. O(n^2) means that when the value of n is large enough, the number of operations grows proportional to the square of the number of elements. So if we were to Bubble Sort a set of one thousand objects, we might expect the number of operations to be in the millions. If our set grew to ten thousand objects, the amount of time to sort it would grow by a hundredfold.

Let's compare this to mergesort. Mergesort has an asymptotic time complexity of O(n log n). A mergesort of one thousand objects would take something like 1000 * 4 operations. This is much less than the millions required for a Bubble Sort of the same set. If we grew the set to ten thousand items, the amount of time to sort it would only grow by a factor of a little more than ten. Certainly far less than the hundredfold growth we see in Bubble Sort.

Bubble Sort is bad because it wastes a valuable resource: time. And when you let the problem grow larger, it wastes a disproportionately larger amount of time compared to other, more efficient algorithms.

When you determine a strategy to solve a problem, you take a close look at the complexity classes of your algorithms. You'd like a solution within a reasonable amount of time, and you'd like to avoid using an unreasonable amount of space (whether memory or disk). Picking an inefficient algorithm for some minor part of your program can have a major impact on the resource usage of your program. Even the small change from O(n^2) to O(n log n) makes a huge difference.

Asymptotic time complexity is not the only measure of an algorithms efficiency. Space (memory usage) can be as much of a limiting factor as time, so asymptotic space complexity should be considered. Consider the diff program that compares two files. The naive implementation of diff using dynamic programming requires a rectangular tableau with a size that is the product of the length of the two files being compared. Diffing a pair of fairly large files in this way will exhaust memory. A tradeoff can be made where multiple passes are needed to complete the diff, but the space complexity, rather than being O(n*m) is O(n + m). It takes longer, but now truly enormous files can be diffed without running out of memory.

So what does this have to do with tail recursion?

A program describes a process. An implementation takes a program and performs the process the program describes. The process that emerges has a space and time complexity that is characteristic of both the implementation and the program.

Programmers usually expect to be able to ignore the implementation's contribution to the time and space complexity of a problem. A programmer would be surprised if he had coded up a mergesort with the expected O(n log n) asymptotic time complexity but found that the actual asymptotic time complexity on his computer was O(n^2). He would be understandably upset if an obviously linear text search required O(n^2) asymptotic space complexity in the size of the corpus.

To quote Clinger:
Proper tail recursion yields a well-defined asymptotic space complexity that can be relied upon by portable programs.
Among these [space consumption] properties of an implementation, proper tail recursion is especially important because it is easy to construct programs that should run in small constant space, but are likely to require at least linear space in implementations that are not properly tail recursive.

That is to say, if your implementation is not properly tail recursive, you will often find that a program that ought to run in O(1) space is actually taking O(n) space or even worse.

How could this be?

Consider this very simple program:
(define (foo x)
  (baz (bar x)))
In a non tail recursive implementation, the call to baz would return back to foo, which would then return to foo's caller. In a tail recursive implementation, foo would directly jump to baz and baz would return directly to foo's caller. The difference is that foo releases its resources before transferring to baz in the tail recursive case, and after baz's computation in the non tail recursive case. There are two resources of interest. The obvious one is the stack frame itself. The less obvious one is the reference to the value of X.

This is a very important. Code like the above is typical where one is performing a computation that involves part of a data structure. The call to bar traverses X to find the relevant sub-structure to operate upon. Once this substructure is located, it is no longer necessary to retain the enclosing structure X. The tail recursive implementation immediately discards the reference to X and allows the garbage collector to reclaim the resources. The non tail recursive implementation unnecessarily retains X during the computation of baz. This is not simply an inconvenience. If X holds substantial resources, we may run out during baz. If baz somehow depends on a recursive call to foo, we'll find our space complexity is not a function of the relevant information sitting in the bar portion of X, but a function of X itself, the container of our information. If we write a program that processes text sentence by sentence, we'd be surprised to find that the implementation uses space proportional to the entire document!

If one objects to Bubble Sort on the grounds of its poor asymptotic time complexity compared to better algorithms, then one clearly should object to a lack of tail recursion. It has even worse asymptotic space complexity, and it acts upon the entire program. (At least Bubble Sort doesn't slow down your program if you don't sort anything.)

Part II on its way....

(By the way, if you have questions or need some clarification on what I've said so far, please comment or email me.)

Tuesday, April 14, 2009

The TI Explorer was a clone of the LMI Lambda. At some point when LMI needed financing, TI and LMI did a technology swap where TI ended up with the Lambda architecture and LMI ended up with some cash and the NuBus. The Explorer was a close enough clone of the Lambda that it could run the same microcode.

Ever run into one of those bugs where you begin to doubt your basic assumptions? I hit one of those on the Explorer. The symptom was simply this: about 15 minutes after booting, the Explorer would crash. The proximate cause of the crash was a bad entry in the page tables. The Lisp machine really had an operating system hiding within it, and there's a virtual memory system with demand paging. There was a table that kept track of what the physical memory was being used for. If physical memory was needed and wasn't available, an existing physical page would be written back to the hard disk. This was when the bug was detected. A check on the page status showed it to be in an undefined state, so the machine halted.

In theory, there was no way for a page to transition to an undefined state, so something really bad must have happened. Fortunately, it was always the same page, so it was clearly a programming logic bug. It was also fortunate that it was in the page fault handler. The Lambda and Explorer microcode is completely single threaded, and interrupts are polled for in known locations. Since the page fault handler would play with the memory mapping hardware, it only polled for interrupts at the very few places that the hardware was in a known, good state. I could pretty much ignore the interrupts for debugging this one.

I put in a couple of quick checks to halt the machine when it wrote the page table entry and rebooted the machine. The machine ran for a while and then halted at my debug break. I single stepped the microcode (from the other processor), and watched it write the page table entry. It read the bits, modified the correct subfield and wrote the bits back correctly. Hmmm. Must be a different write. I let the machine run, but it didn't touch that entry again. The machine halted at the sanity check.

This was weird. The machine correctly modified only the field it was supposed to, yet it crashed. Perhaps the word was smashed after the modification. I added some code to check the value of the word at crash time. When I ran it again, I found that the word contained exactly what was written at my breakpoint.

I had to re-evaluate my assumption that the field being modified had something to do with the crash. Since that page had not previously been in use at all, I assumed it was zeroed out. I added a sanity check to ensure this. My sanity check failed. The pristine page table entry was not zeroed out. This made sense. The code had taken an entry that was not correctly initialized, correctly filled in the right field, but the other bits were wrong and that caused the crash. Clearly there was a bug in the initialization code.

This was sounding like a fencepost error. A fencepost error when initializing a table might leave one entry at the end uninitialized, or it might overwrite an entry in the next table. I didn't see anything obvious in the page table initialization code, though. I put a debug halt in just after the page table was initialized. When the machine halted, I looked at the table. It was fine. Everything was zeroed out correctly, and the next table in memory had no problems, either.

Obviously someone had done something with the entry between the time it was initialized and the time it was first used. I put in some more debug code to halt whenever that particular page table entry was accessed. This was possible because the page table manipulation code was in a single subroutine. I restarted the machine. It first halted when it initialized the page. As expected. It halted a second time when it read the page entry. That wasn't expected. Whoever modified the page table entry didn't go through the normal channels. That would cause a bug, but it was wasn't going to be easy to find.

I wrote a debugging subroutine that would simply validate the contents of that one special address and put a call to that subroutine in the middle of the boot sequence. The machine halted at my new test. I put it one quarter of the way into the boot sequence. It didn't halt. I put it three eights of the way through. Over a period of a couple of hours, I had determined where in the boot sequence the memory location was being overwritten. It was some subroutine that really had nothing to do with the memory system. In fact, it barely used the memory system. I pored over the code, but there seemed to be nothing at all out of the ordinary.

I was baffled. I was beginning to wonder if the memory maps had been corrupted. It is almost inconceivable that the machine would be able to run for more than a few milliseconds with corrupt memory maps, but it was the only thing I could think of that would allow you to make an undetected stray memory write. I wrote some more code to validate the memory maps. They were fine.

Ok, if that memory location started as zero, then later became non-zero, then a memory write cycle must have occurred. There are no two ways about it. I wrote some more microcode that checked the contents of that location and verified that it contained the correct value. I modified the microassembler to insert a microcode subroutine call to my validation code after every memory cycle. I rebooted the machine. It was incredibly slow. Every single memory reference involved an out-of-line microcode call to validate the page map. Nonetheless, the validation code finally caused the machine to halt.

It was in a weird location. Well, weird only in the fact that absolutely nothing interesting was going on. By using the backtrace buffer and looking at the microcode I could see which write instruction must have failed. But there was no reason it should have failed because the nearby writes to the same page worked just fine. I looked at the contents of the virtual address register and the memory data register. Surprisingly, they were the expected values. In fact, the memory data register did not contain the bogus value that was written to the page table. This is, in a word, impossible.

So what were the implicit assumptions I was making? One of them was wrong. Bad memory board? The memory test passed. I swapped the memory cards on the machine, but it still crashed in the same place. It was too deterministic and predictable to be flaky hardware. But the Lisp processor wasn't writing to that location.

On the other hand, the Lisp processor wasn't the only device on the bus. The corrupted memory location contained a byte full of 1s. I asked our hardware people: Is there any reason a device would write a byte full of 1s to another card on the bus?

As a matter of fact, there was. The NuBus doesn't have separate interrupt or I/O channels. It signals an interrupt by doing a memory cycle and plopping a byte full of 1s into another card's address space. For some reason, a device in the Explorer was trying to post an interrupt and ended up dropping a turd right where the Explorer put its physical page tables. This happened fairly early on, but after the page table were initialized. Then, some 15 minutes later, when the Explorer finally got around to using that page, it bombed on the corrupted entry.

I found the resident Explorer experts and mentioned that something was trying to post an interrupt. They looked at the initialization sequence and discovered that the code to initialize one of the devices had a bug and didn't correctly turn off the interrupt. The device would post an interrupt every few seconds. It was a couple of lines to fix it, but it took me several days of intense debugging to find it.

Monday, April 13, 2009

Taking a header

It took a while to iron out the contract between the stack closure implementation and the garbage collector. Timing was a critical factor, and memory integrity bugs have a special ability to survive one iteration of garbage collection. The bug confuses the garbage collector on the first iteration and causes it to corrupt memory as it is collecting, but the collector doesn't notice it at this point. On the next garbage collection it encounters the memory corruption and craps out. The problem is that the source of the corruption is long, long gone.

When Ken Sinclair left LMI to go to Symbolics, I inherited the garbage collector. As I mentioned in an earlier post, the garbage collector tended to get blamed because it discovered memory corruption caused by other modules. It was simply the case that if you ran the collector often enough, you'd eventually trigger a collection when someone wasn't expecting it and cause a crash. My microcode tracer cause the GC to run a lot, so I saw most of the errors.

Here's a good example. In uc-lambda-array.lisp you'll find this microcode:
((m-a) m-array-pointer)
((m-b) q-pointer m-array-header) ;Don't put headers in accumulators!
((m-d) m-array-rank)
(m-e) m-array-origin)
((m-s) m-array-length)
This subroutine copies the pointer, header, rank, origin, and length of an array to the m-a, m-b, m-d, m-e, and m-s registers. (The popj-after-next is the microcode subroutine return. You always put it one instruction ahead of when the return actually happens because of pipelining.) The point of this copying is to cache the information needed by the most recent array operation. If the next array operation uses the same array, it runs much faster because it doesn't have to go to main memory.

Note that second line with the comment. An array header is the marker that appears before the array contents. It is used for various things, but one of the most important is that it contains a count of contiguous words of memory that follow. When the garbage collector is scavenging memory to relocate pointers, it checks the array header to see if the array contains ‘boxed’ pointers, or ‘unboxed’ immediate values. If the array is unboxed, the scavenger skips the block of memory containing the array contents.

Enter the law of unintended consequences.

If you browse the microcode, you'll see comments about ‘sequence breaks’. A sequence break is point where the microcode is allowed to transfer control to a different stack-group. The Lisp machine simulated multitasking through time-division multiplexing. When a sequence break occurred, the current state of the micro-machine was saved in a main memory data structure. The accumulators (the m registers with single letter names) are stored in the data structure. If you were very unlucky, you could take a sequence break right after performing an array operation on an unboxed array. The array header would then be stored in the data structure. This wasn't very likely to happen because the m-b accumulator is one of the most frequently used registers and the array header would usually be overwritten within microseconds.

If you didn't turn on the garbage collector, this rare occurrence would be irrelevant. If you did turn on the garbage collector, it would be mostly irrelevant unless you happened to be very unlucky indeed. If you had a sequence break immediately after performing an array operation on an unboxed array, and the garbage collector decided at that point to scavenge the page that contained the saved m-b accumulator, then it would skip over some of the following pointers.

You might be surprised to learn that even that doesn't usually lead to memory corruption. You had to be really, really unlucky. If control is transferred back to this stack-group, the registers are re-loaded from memory. The memory system enforced a read barrier to support incremental garbage collection, so the pointers in the registers would have been adjusted at that point. The problem only occurs when you
  1. Perform an array operation on an unboxed array.
  2. Immediately swap out the stack.
  3. While the stack is swapped out, start and complete a garbage collection. Some registers (the ones stored in memory locations after where m-b is store) may contain bad pointers at this point.
  4. Start another garbage collection before re-assigning the registers.
This long chain of events is typical of the bugs I encountered. They were quite hard to reproduce. This bug had an easy fix. When we strip the data type off the array-header, the garbage collector no longer recognizes the object as a special case and simply steps over it to the next word in memory.

There were at most a couple dozen of these sort of bugs in the microcode. They were a real bear to find, but over the course of several months I eliminated a lot of them. Within the year you could expect to run the garbage collector without crashing the machine.

A short followup

I got some great feedback on the Cavendish experiment. fuf2 said:
how can you be sure electrical attraction is not interfering? I understand those forces are much stronger and might have contributed to the balance moving.
Short answer: I can't. I'd be sure it was electrical if the masses repelled. I did the experiment in a humid basement during the summer rather than in a carpeted room during a crisp, dry winter, but it's hard to discount electrical attraction.

Lincoln Quirk did the math and came up with a value of G = 1.32 * 10^-10. That's good enough for me (and sounds near what I came up with, but my memory might be fooling me).

I remember the most difficult part of figuring out G was determining the angular momentum of the torsion balance and deriving the restoring force from the period. I had my physics and calculus textbooks, but they didn't give the formula. In fact, the textbooks had this particular configuration as one of the student exercises. I eventually set the problem up as an integral and solved it, but it was a bit painful. (It's not a hard integral, and it's not a hard problem, but when was the last time you did it?) Our local library didn't have very many of the kind of reference books you'd need for this. Inter-library loan works great if you know the title you want, but it isn't as convenient for browsing. The monks at the monastery down the road were in the middle of illuminating Strunk and White and were expecting to take a while before getting to anything scientific.

Thanks to everyone that wrote in!

Sunday, April 12, 2009

Let's do the twist

This post is about amateur physics. No computer science is involved, but it was pretty cool.

In 1798 Henry Cavendish did an experiment to weigh the world. Although Cavendish didn't bother deriving the gravitational constant, his experiment was the first that could produce a direct measurement of it.

In the summer of 1985 I was at home convalescing and being bored. It occurred to me one day that if Cavendish could determine the gravitational constant back in 1798, I ought to be able to do something similar, especially because I had access to a few things that were a little hard to obtain in 1798.
The Cavendish experiment involves a torsion balance, which is like a dumbbell which is suspended in the middle by a thin fiber. After letting the thing settle down for a long time, it will be very sensitive to forces that cause it to rotate in the horizontal plane. But any rotation will be opposed (very slightly) by the torsion on the fiber. By placing a pair of large masses near the weights on the end of the dumbbell, the gravitational attraction between the masses will cause the dumbell to rotate until the force of gravity matches the opposing force of the torsion. If we can measure the angle of rotation and determine the torsion of the fiber, we can derive the gravitational attraction between the masses.

Cavendish cast a pair of 1.61 pound lead weights. I found a couple of 2-pound lead cylinders my dad had lying around. I used duct tape to attach them to a 3-foot wooden dowel. Cavendish used a wire to suspend the balance, I used nylon monofilament. To determine the torsion of the fiber, you wait until the balance stops moving (a day or two) and then you slightly perturb it. The balance will slowly oscillate back and forth. The restoring force is calculated from the period of oscillation. Cavendish had a 7-minute period. My balance had a 40 minute period (nylon is nowhere near as stiff as wire).

Cavendish used a pair of 350 pound lead balls to attract the ends of the balance from about 9 inches away. I put a couple of 8 pound jugs of water about an inch away. The next trick was to measure the rotation of the balance. Cavendish had a small telescope to read the Vernier scale on the balance. I used some modern technology. I borrowed a laser from Tom Knight (Thanks again!), and bounced it off a mirror that I mounted on the middle of the balance. This made a small red dot on the wall about 20 feet away. (I was hoping this would be enough to measure the displacement, but I was considering an interferometer if necessary.)

To my surprise, it all worked. After carefully putting the jugs of water in place, the dot on the wall started to visibly move. Within a few minutes, it had moved an inch or two. I carefully removed the jugs of water and sure enough, the dot on the wall drifted back to its starting position.

This was really cool. Newton's theory of gravity was the first ‘unified theory’ of physics. It took several disparate phenomena — the orbits of the planets, the orbits of moons, tides, and the kinematics of falling objects — and proposed a single theory that explained them all mathematically. But Newton's theory supposed that every object has a slight gravitational attraction to every other. This is a strange phenomenon that hadn't been observed (prior to Cavendish). It's not something you usually see because the force is extraordinarily small.

My dad was astounded. Of course he knew about gravity from high school, and knew it kept the planets in orbit and stuff stuck on the ground, but he hadn't remembered (or perhaps wasn't taught) that there was a very small gravitational force between everything, including the lump of lead on the end of my stick and the jug of water a few inches away.

Even though I knew there was a force I could measure, it was still pretty amazing to watch it happen. Sure, you believe Newton's laws, but after seeing this in action, there is still a `wow' factor.

Now as for the value of G. I think I gave enough information here for someone to derive it. I think I calculated it to be somewhere around 10^-11 plus or minus an order of magnitude. One day I might try to really calculate it.

I have to recommend trying this experiment if you have the room to set it up. It's something to see.

Friday, April 10, 2009

Some Lisp Machine minutia

Bit 59 in the Lisp Machine microcode is the halt bit. If the microcode instruction register ever has bit 59 turned on, the clocks stop and the entire processor freezes. Sprinkled throughout the microcode are sanity checks that call the ILLOP subroutine if they fail. The ILLOP subroutine tries to write a ‘crash record’ in a pre-defined location and then halts the processor. If things are running correctly, this should never happen. Well, hardly ever.

The System Diagnostic Unit (SDU) has a watchdog timer. Every now and then, the processor resets the watchdog timer in the SDU. If the watchdog timer ever times out, the SDU assumes that the Lisp processor has stopped. The SDU will blink the screen to let you know the machine crashed. If you wish, you can tell the SDU to turn the clocks back on. Since ILLOP is a microcode subroutine, it will return back to the instruction immediately following the failed sanity check. This allows you to put debugging ILLOPs in the code to act as very primitive breakpoints.

Down at the bottom of the Lisp Machine screen, there is a status bar called the ‘wholine’. Under the wholine there are three little lines that are 1 pixel high and 16 pixels long. These are the ‘run lights’. The rightmost one is on unless the processor is in an input wait state, the middle one comes on when the processor is waiting for the disk, and the leftmost one comes on when garbage collection is occurring. The GC runlight has two parts, one indicates scavenging, the other indicates transporting.

When the machine is working, the runlights flicker on and off. When it crashes, everything freezes for a second or two. There is a tense pause while you wait for some sign of life from the runlights or for the SDU to blink the screen.

In my previous post I had decided to turn on the garbage collector so I wouldn't run out of memory when trying to trace the microcode. Once I had the GC turned on, I tried to run my code. The machine froze and the screen blinked.

I got very, very familiar with this.

An LMI Lambda 2x2 had two Lisp Machine processor on a single NuBus. If a processor halted, it was possible to read the processor state from other devices on the bus. When you were debugging the microcode, the best way was to have a 2x2 so you could decode the frozen processor state from the other processor and match up machine addresses with symbols, etc.

The Lisp Machine had a hardware ring buffer that recorded the value of the microcode program address register on every cycle. You could get a backtrace of the last thousand instructions that lead to the crash. This is an incredibly cool and useful feature.

When I ran my microcode tracer and crashed the machine, the microcode trace revealed that we had encountered an illegal stack environment while attempting to transport a lexical environment from within the garbage collector loop. RG sent me to talk to Ken Sinclair. The Symbolics 3600 had come out a couple of years before and had the first commercial generational garbage collectors. LMI was trying to play catch-up and Ken Sinclair was in the process of writing a generational GC for the LMI Lambda. Ken had actually completed the garbage collector, but the rest of the company was slow in adopting it.

There was an unforseen consequence of the practice of running the machine with the garbage collector turned off. The microcode had gotten sloppy and people were playing a bit fast and loose with the memory model. Untagged addresses would be squirreled away and dereferenced later with no thought given to whether the GC might move the object referred to. Bogus objects — properly tagged words with invalid addresses that pointed at uninitialized memory or into the middle of object of a different type — which would cause the GC to corrupt memory would be left in registers or on the stack. These sort of problems were everywhere in the microcode. Of course it was the garbage collector that discovered these, and following the ancient tradition, the messenger was blamed for the message.

The stack allocated lexical closure microcode was written in haste by RMS, and repented in liesure by yours truly. With Ken Sinclair's help from the GC side, and a lot of poring over microcode, I managed to fix the stack closure bug.

I don't imagine you really want the gory details, but I'll give them anyway. Here are the comments at the beginning of uc-stack-closure.lisp.

Here is how lexical closures work 
A lexical closure consists of a piece of code and an environment to 
resolve free variables in.  The code is implemented by a FEF and the 
environment is implemented as a list of vectors.  Each car in the 
environment is a lexical environment frame which is a vector of 
variable bindings. As there is no way to extend the number of variables 
in a lexical environment, a vector saves space, but more 
importantly, since the lexically visible variables can be known at 
compile time, the compiler can generate code that simply "arefs" a 
given frame and finds the correct value of a lexical variable.  A 
lexical environment is a list of these vectors.  

A closure (any kind, stack or regular) is implemented like this: 
  (dtp-{stack-}closure         ) 
                /------------/ pointer to closure 
  (cdr-next dtp-fef-pointer  --)---> points to a fef containing the code 
  (cdr-nil  dtp-list           ) 
                /------------/ pointer to lexical environment 
  (cdr-normal dtp-list       --)---> points to lexical frame 
  (cdr-error  dtp-list       --)---> points to next environment  

This would be too easy if it weren't for efficiency hacks.  The 
following things are noted: 

1) Some closures are only used in downward funargs and will never be 
   needed after this dynamic frame is exited.  The extent of these 
   closures is dynamic and follows stack discipline, so they can be 
   consed on the stack giving the garbage collector a break. 
   (Actually, this was done so that you wouldn't have to use a 
   losing garbage collector.)  It is not possible to tell if a 
   funarg is to be used in a downward only direction if it is passed 
   to a procedure external to the one currently running. 

2) The uppermost lexical frame of the closure when it is created is 
   the current dynamic frame.  The args and locals of the frame are 
   the ones that are seen by the code.  They cannot be copied out of 
   the frame. 

3) Not all the args and locals of a dynamic frame need appear in a 
   lexical frame.  Which args and locals are needed can be 
   determined at compile time.  This will save space and allow the 
   garbage collector to reclaim inaccessable objects that would be 
   otherwise saved if we kept the whole dynamic environment. 

4) Nested lexical contexts can be "flattened" if the code that 
   creates them is only run once.  See page 87 - 89 (base 10) of the 
   common lisp book for examples of flattenable and unflattenable 
   contexts.  A corollary to this is the fact that a binding which 
   is lexically visible to different closures and which should be 
   distinct in each closure can be shared among them if the code 
   never mutates the variable. 

The above is taken advantage of by the below. 
Efficiency hacks: 
1) We take an optimistic approach and assume all funargs are 
   downward.  Lexical frames and closures are initially allocated on 
   the stack.  All pointers made to the closure are labeled 
   dtp-stack-closure.  If a dtp-stack-closure is ever moved anywhere 
   but to a frame closer to the top of the stack, it becomes 
   necessary to copy the closure and the lexical frame 
   out of the stack and into the heap.  All closures that 
   are in the heap are labeled dtp-closure. 

2) The lexical frame when it is created contains 
   external-value-cell-pointers (EVCP's) to the actual locations in 
   the stack of the args and locals.  This makes it possible to 
   smash the args and locals from downward funargs. 

3) The lexical frame is created only with those bindings needed by 
   the closure.  This is determined by looking at the FEF of the 
   current procedure (not the one you are closing!).  Two slots 
   before the unboxed instructions is a list which is a template for 
   making lexical frames.  It is a list of fixnums in which the low 
   twelve bits specifies which argument or local appears and the 
   sign bit indicates whether it is an arg or local.  This list is 
   arranged in "reverse" order, i.e. the first argument is the last 
   on the list and the last local is the first element of the list. 
   The list is stored in reverse order because the microcode just 
   happens to make a pointer to the box just after the lexical 
   frame.  It then constructs the lexical frame by decrementing the 
   pointer and cdring down the map in the FEF. 

4) The contexts are in fact flattened by the compiler.  The compiler 
   makes sure variable references go to the right slot, so there are 
   no name conflicts.  In order to take advantage of sharing, we 
   assume that all lexical frames closed in the current dynamic 
   frame can be shared and only cons up one initially.  The compiler 
   issues commands STACK-CLOSURE-DISCONNECT to force a 
   splitting of shared frames.  This copies the current frame into 
   the heap.  Two frames which were EQ are now EQUAL (i.e. identical 
   copies instead of being identical).  Then, the compiler does a 
   STACK-CLOSURE-UNSHARE giving it an argument which specifies which 
   lexical slot to unshare.  Remember that the lexical frame 
   initially contains EVCP's to the args and locals. 
   STACK-CLOSURE-UNSHARE "snaps" the invisible pointer and copies 
   the arg into the lexical frame.  The frame will still share the 
   other args and locals by virtue of the remaining EVCP'S 

   When it finally comes time to exit the stack frame, if there are 
   any outstanding closures in the heap pointing to a lexical frame 
   which is stack consed in the current stack frame, we copy the 
   stack frame to the heap and snap all the EVCP's in it.  We then 
   go to each closure pointing sharing any of the args or locals and 
   make their EVCP's point to the copy we just constructed.  Now we 
   can exit the frame.  Note that in order to find each closure in 
   the heap, we keep around a list of all closures disconnected from 
   this frame.  

How a frame with closures is set up in the first place:  

Frame begins here  (dtp-fix) ;bits controlling return 
                (DTP-FEF-POINTER  --)---> to code for the frame. 
Arguments        (cdr-next ......) 
 cdr codes are   (cdr-next ......) 
 set right               : <more args> 
 last arg        (cdr-nil  ......) 
Locals           (...............) <--- A-LOCALP if this is the current function 
 random boxed    (...............) 
 objects                 : <more locals> 
   Stack closures are allocated in the area for locals and take up 
   four local slots.  They are not really locals, they just live here. 
   The <pointer to next cell> is the pointer to the lexical 
   environment chain which just happens to be in the next cell. 
stack closure   (cdr-next   dtp-fef-pointer --)---> points to closure code 
                (cdr-nil    dtp-list  <points to next cell>) 
                (cdr-normal dtp-list  <points to lexical frame>) 
                (cdr-error  dtp-list  <points to next higher context>) 
                        :  <more stack closures> 
   The lexical frame also takes up locals.  It is constructed from the 
   map found two slots before the instructions in the FEF.  We work 
   here on the assumption that we will not need more than one lexical 
   frame (in the case where all variables are or can be shared). 
                (dtp-list <pointer to beginning of frame>) 
lexical frame   (cdr-next dtp-external-value-cell-pointer <pointer to local or arg>) 
                (cdr-next dtp-external-value-cell-pointer <pointer to local or arg>) 
                        : <more lexical slots> 
next-to-last-local      (dtp-list <pointer to lexical frame>) 
last-local       Contains a list of all copies made of the lexical 
                 frame so we can set them up right when we exit this 
                 dynamic frame and deallocate storage for the 

The top of the stack is here.  

Notes on the above diagram. 

1) The word just before the lexical frame is used by 
   COPY-STACK-ENVIRONMENT-INTO-HEAP to locate the beginning of the stack 

2) The lexical frame need not be there.  All the local slots 
   are set to point to nil when the frame is entered.  When it comes 
   time to make a stack closure, the next to last local slot is 
   checked to see if a lexical frame has been made.  If it has not, 
   (i.e. it is nil) the FEF is looked at to find a list of args and 
   locals to forward the slots in the lexical frame to.  If the list 
   in the FEF is nil, this means that the current frame is empty. 
   In this case, the next to last local is set to T indicating an 
   empty frame. 

3) In the comments above I used the words "lexical frame" to 
   indicate what it is that holds the bindings in an environment. 
   In the below code, it is called a "lexical frame"  I 
   apologize to those of you who tried to figure out what was going 
   on by examining my comments vs. the code before reading this. 
   "Lexical frame" is the use, "lexical frame" is the 

4) A-LEXICAL-ENVIRONMENT will point to the context outside of this 
   frame so all closures created in this frame will contain a copy 

When we copy a stack closure into the heap, we forward the copy on 
the stack.  We can't use HEADER-FORWARD because putting that in a 
structure will confuse other things.  What we do is put 
external-value-cell-forwards in the fef pointer and the environment 
pointer in the stack closure to point to the corresponding cells in 
the heap pointer.  This does not really forward the closure, but it 
will do the trick because the stack closure never moves until it is 
deallocated and anything that remains around after that sees only 
the heap allocated version.  The lexical frame (the stack closure 
vector) is not forwarded.  

The last word in the frame is a pointer to every copy of the lexical 
frame in the world.  If the stack frame is exited, we copy the 
lexical frame and go to each of the other copies and change their 
EVCP's to point to our new copy.  

This is the old comment. 
; Copy a DTP-STACK-CLOSURE when it is stored anywhere but 
; into the same stack it points at, and farther down than where it points. 
; There is no way to forward the stack closure to the copy, 
; because only header-forward works to forward a list's car and cdr, 
; and putting that inside a structure will confuse other things. 
; So we stick an external-value-cell-pointer into the stack closure 
; pointing at the copy.  This does not forward it as far as the 
; low levels of the system is concerned!  But as long as the 
; stack closure still exists, that's ok; the evcp forwards only the car 
; of the stack closure, but forwards it to the car of the copy, 
; which contains the correct value. 
Wow. I vaguely remember writing that.

Thursday, April 9, 2009

An obvious solution

The reason the Lisp Machine doesn't have tail recursion is that there is too much random cruft on the stack. The reason there was so much cruft on the stack was because of the garbage collector. The connection there is probably not obvious.

The PDP-10 on which MACLISP ran had an address space of 2^18 words (a word was 36-bits and cold hold both the car and cdr of a cons cell). The CADR Lisp machine had an address space of 2^24 words. The words were 32 bits, so the Lisp Machine could run an enormous 64 megabyte image. Of corse no one could afford anywhere near that much memory, so the CADR had something like 192K of RAM and the rest was paged to disk. (The LMI Lambda doubled the address space by snarfing a bit from the data tags.)

When the CADR needed to garbage collect, it would stop and copy. and copy... and copy .... and copy .... and copy .... The disk was slow and copying the image meant a lot of disk traffic. It was so slow that it was much faster to reboot the machine and restart your program. In fact, you didn't have to reserve a chunk of address space for GC if you didn't mind rebooting, and then you could run longer between reboots.

People got used to this mode of programming and added a lot of code to the machine to support it. You could allocate in discardable regions of memory, you could dynamically allocate on the stack, you could manage allocation pools and explicitly free objects. Some people became wizards at writing completely non-consing code.

But I always thought that the computer should handle the mundane tasks of memory management, and I wasn't going to contort my code to make up for a lame collector. Besides, I'd heard something about `ephemeral garbage collection' and I thought the machine was supposed to do that.

My microcode tracer tended to cons a lot of lexical closures. It would run out of memory for even moderately sized runs. I had three options for making progress:
  1. Rewrite the code to avoid consing at all costs.
  2. Figure out how to save and restore the intermediate state of the tracer so we could make progress across several reboots.
  3. Turn on the garbage collector.
That's a no-brainer.

Wednesday, April 8, 2009

LMI Flashback

Once I fixed the compiler bug that was causing variable aliasing, I ran into the second (of many) problems. I discovered that there was a rather small limit on the stack frame size. The nested procedures that I used quickly exhausted the local variable space and caused the compiler to error out. Unfortunately, the Lisp Machine macro code (the microcode implemented a stack-oriented virtual machine) only had a small field for the displacement value of stack instructions, and there was no easy way to fix this. Given the choice between rewriting my code and rewriting the compiler, I chose to wimp out.

Next problem: stack overflow. It didn't occur to me that the Lisp Machine, the pinacle of AI Lab engineering, wouldn't optimize tail-recursion. It's a pretty trivial thing to implement. If you ever find the compiler emitting a call instruction followed immediately by a return instruction, you simply change that to a jump (what's tricky is if your calling sequence is such that the compiler never emits a call followed by an immediate return). It's easy to do dynamically in the microcode as well. When you do a function call, you peek at the following instruction. If it is a return instruction and the stack is empty, just omit pushing the return address.

I asked RG about this. He said that there was some microcode to handle tail recursion, but it was commented out because it didn't work. He said I was free to look into it, but he didn't expect that I'd have any luck. He was right.

Although the concept is easy, the devil is in the details. At function call time, the stack is never empty. Most of the cruft on the stack is not in use, so it could be deallocated, but there are things that might be in use. These are things like the dynamic binding level (for special variables), pending catches and unwind-protects, data structures with dynamic extent, and even more esoteric things (microcode state!) that happened to end up on the stack because it was convenient. The return instruction can simply deallocate the stack space because everything that stores stuff on the stack doesn't expect it to last beyond the next return. But for tail recursion, we want to deallocate stuff as soon as possible.

The code was a true mess. There is a status word at the beginning of the stack frame and many of the routines that store stuff on the stack set a flag in the status word to indicate they need the frame to last. But not every routine was this well behaved. In the particular case of lexical closures, the lexical environment was optimistically allocated on the stack and only moved to the heap if pointers to the closure still existed at the time of return. The code that implemented tail-recursion had to grovel around on the stack and try to determine if the stack frame was truly needed. It was easy enough to be conservative and punt, but the particular use case I had in mind (where I used some continuation-passing-style) was the hard case.

I eventually worked around the problem by allocating a particularly deep stack.

Over the time I was at LMI, I occasionally revisited this code to see if I could get it working again, but I never had any luck.

Tuesday, April 7, 2009

No More Boring Code!

I really wish I had thought of this. It is absolutely brilliant. Fortunately, John Clements is a friend and colleague of mine, so I'm glad he thought of it:
Reading and writing Java is, by and large, boring. Boring code is boring to write, and genuinely difficult to read without glossing over details. This is an observation made by Yaron Minsky at his Cam(e)l Trading talk at CMU. This resonated strongly for me, and got pretty close to the root of what I think functional languages are about: providing the abstraction tools to obviate the need for Boring Code.
Well, maybe Yaron Minsky thought of it. Alas it wasn't me. But John came up with a logo and I want to jump on the bandwagon.

Let's stamp out boring code!

Thursday, April 2, 2009

Lisp Conference - Jerry Boetje

Jerry Boetje officially unveiled CLforJava at the conference. Jerry's goal is to ‘intertwine’ Common Lisp and Java so you can build programs out of components written in either or both languages and transparently move between the languages.

I think this is a great idea. I'd like to see something more formal about the semantics of doing this. What gets lost in the intertwining? What is intertwining? It isn't compilation or interpretation.

ObScheme: I'm doing some latency measurements and have adapted my old latency measuring code. It's a trivial SCSH script, but it produces the goods.

Wednesday, April 1, 2009

Lisp Conference - Mike Blair

I was surprised to see Mike Blair at the Lisp conference. Mike is an old friend of mine and we've worked on several projects together. He's very sharp and has some fantastic ideas about dynamic program optimization (examining the statistics of runtime values to guide JIT compilation).

Mike knows computing from the transistor on up through the meta-circular evaluator. (And even the meta-meta-circuluar. I recall that he once got the meta-circular evaluator to run itself. It took about forty minutes to get to the prompt.) I've seen him program a micro-coded Turing machine and I've worked with him on a spreadsheet program. He was at Transmeta for some time and has some interesting stories about debugging Microsoft Windows from below (finding bugs where Windows unintentionally relied on hardware behavior that was known to be probabilistic).