Friday, May 15, 2009

Oldspace and the Transporter

When the garbage collector decides to free a region, it marks the region as being ‘oldspace’. Live objects will be copied out of oldspace into newspace. Once all the live object have been relocated, the oldspace regions are returned to the memory system. (Since there is no longer anything of interest in oldspace, the virtual memory associated with it is unmapped and any physical pages can be freed.)

A key component of the garbage collector is the transporter. The transporter is responsible for copying objects from oldspace to newspace. This is a fairly straightforward operation, but there are a few complexities. First, the transporter must copy the entire object to newspace, and it must copy only that object. If it doesn't copy the entire object, then information will be lost and it is likely that newspace will be corrupted. If it copies more than it should, it may cause a partial copy of the next object in oldspace to end up in newspace. Again, this will likely corrupt memory. The constraint of copying all the object and only the object is an obvious one, but many GC bugs can be tracked down to a failure in this constraint. An incorrect header can cause the transporter to miscalculate the size of the object. An unboxed word might be mistakenly interpreted as a pointer and cause random words to be copied.

Once the transporter has copied an object to newspace, it writes a forwarding pointer in oldspace that contains the address of the new object. This preserves the sharing in the graph structure of the heap.

The transporter is made more complicated by the presence of ‘locatives’. A locative is a first-class pointer to a storage location. To give an example, I could create a locative that points to the third element of a vector. Reading the locative would return the same value as reading the third element of the vector. Writing the locative would modify the third element of the vector. In the Lisp machine, a locative is simply a pointer with a special tag. Most tagged pointers contain the address of the first word of storage allocated for an object. A locative is allowed to point anywhere within the storage allocated for an object.

To support locatives, you must be able to take an arbitrary heap address and discover the first word of the object that contains that address. There are different techniques for this. Here's how the LMI Lambda did it. You can always discover the object boundaries in a region by starting at the very beginning of the region (which must be the start of an object) and scanning forward object by object until you find the containing object. Of course this is terribly slow, not only because you are making a linear scan, but because you may have to fetch the relevant pages from the disk in order to walk the objects. So the Lambda kept track of the address of the first object on each page and stored this value in the ‘structure handles’. To find the object that contains an address, you find the page the object is on, locate the first object on that page, and then scan forward. When you relocate a locative, you have to displace it to the middle of the relocated object.

Another thing that complicates the transporter is forwarding pointers. There are various kinds of forwarding pointers used in the Lisp machine. Some forwarding pointers are handled exactly like locatives. The transported doesn't have to do anything special (beyond the usual locative relocation). Other forwarding pointers might be ‘snapped’ by the transporter. A good example is the DTP-ONE-Q-FORWARD type. This is special pointer that allows you to deliberately alias memory locations. When the transporter encounters one of these, it dereferences the forwarding address to find the actual storage and transports that object instead. The forwarding pointer is no longer necessary after the GC.

The transporter also has some special code to handle CDR-coded lists. CDR-coding allows you to compress list structure by eliding CDR pointers in the common case of a linear list. It is important for the transporter to maintain the CDR-coding or the list would double in size when it was transported. I was surprised to find that the transporter did not perform cdr-coded compression if it discovered a uncompressed linear list. The rationale was that would only happen rarely, and the amount of hair it would add to the code was significant (consider if somewhere else in memory someone had taken a locative to the CDR of a list you were about to encode away).

The Lambda had a ‘read barrier’ that enabled it to do incremental garbage collection in parallel with user processes. Henry Baker came up with the idea that allows this to work. The machine maintains the illusion that there is no oldspace by automatically transporting objects to newspace when references to them are read from memory (the read barrier acts as a barrier to prevent oldspace pointers from ever being in registers). Since no user code could manipulate an object without bringing a reference to it into a register, the user process was guaranteed to never see an oldspace pointer. The user process goes about its business completely unaware that a chunk of memory is in the process of being reclaimed.

In theory, the use of a read-barrier would make it possible to guarantee a real-time response during garbage collection, but the theory depends on the transporter always taking a small bounded amount of time. In actuality, the amount of time transporter takes depends upon the size of the object being transported. Arrays naturally take time proportional to their size. Furthermore, it may be necessary to take page faults when reading the object out of oldspace and writing in to new space. In the worst case, the current region in newspace is full and the allocator will have to find and initialize a new region. So although the Lambda was generally able to run during garbage collection, it didn't have a hard real-time guarantee.

The read barrier had a clever hardware assist. The Lambda had a dispatch microinstruction that performed a computed jump through a dispatch table. You often see this sequence of instructions in the microcode:
((vma-start-read) add m-fef a-1) ;Read the location.
(check-page-read)
(dispatch transport md)
The check-page-read instruction is a conditional call to the page fault handler. The vma register holds the virtual address, and when the read finishes, the md register (for ‘memory data’) holds the contents. The (dispatch transport md) instruction is the one that checks the read barrier. The Lambda barrel shifter is used to rotate the tag bits into place for the dispatch, and the lowest bit of the dispatch table address is taken from the memory maps. The dispatch table itself has entries that specify whether to call the transporter subroutine or to simply drop through.

This sounds more complicated than it is. Basically, once you've read a memory location, you either leave it alone or you call the transporter. You only call the transporter if it is both a pointer, and the address is in a block of oldspace.

Conventional wisdom has it that a read barrier implemented in software would be too expensive. Many garbage collected systems eschew the read barrier and simply pause the user process while the garbage collector runs. With generational scavenging, the pauses can be amazingly small, so it hardly seems worth implementing a read barrier. On the other hand, processors these days are so fast that even reading a memory location takes a noticeable amount of time. An oldspace check on read might not add significant overhead, especially if it were cleverly coded.

The read barrier on the LMI Lambda was only partially implemented in hardware. The hardware used the contents of the MD register to address the first-level memory map in order to read the oldspace bit. The microcode still had to issue a dispatch instruction (which included the necessary shift and mask). Modern processors run so much quicker these days that they can emulate the Lambda hardware faster than the original hardware itself.

Wednesday, May 13, 2009

MIT Scheme on C#

Matthew Swank reports that he can run the C# back end of MIT Scheme with mono on os-x PPC32 and linux x86-64. It seems to cold load and get to the prompt. Under the debugger it's bombing out with
Unimplemented primitive: RELOAD-SAVE-STRING

Well, I guess I should look into this.

Tuesday, May 12, 2009

Before I got distracted by the tail recursion brouhaha, I was writing about some Lisp machine esoterica. I want to describe how the LMI Lambda does garbage collection. The Lisp machine had the first commercial generational garbage collector. The implementation was fairly straightforward, but it had some features that I haven't seen in other garbage collectors. There is nothing special that would prevent these from being implemented in another GC, it just seems that no one does it.

To understand the Lambda garbage collector, you have to know a bit about the memory system. Memory in the Lambda is contained in ‘regions’. A region is a contiguous chunk of virtual address space that is aligned on a 64-page block boundary and contains an integral number of 64-page blocks. The reason for the alignment and size restrictions is because the memory mapping hardware uses the top 12 bits of the virtual address (bits [24:13] inclusive) as an index into the level one memory map. There are usually several regions in use at any time.

Every region has a free-pointer. Addresses within the region, but below the free-pointer are in use, addresses within the region but equal to or greater than the free-pointer are not in use.

Now let me digress a little. Assuming you're not using one of those broken reference-count hacks that don't actually work, the standard garbage collection model you see everywhere is this:
  1. Garbage collection is triggered when you run out of memory.
  2. The garbage collector starts with a root set of objects, then either
    • recursively walks the tree of objects marking the ones in use.
    • scans the root set and relocates objects in use to a new place in memory.
  3. Step 2 continues until the transitive closure of objects is marked or moved.
  4. The collector frees the memory that is no longer in use (the coset of the memory in use).
  5. Some or all of the freed memory is now available for allocation. In particular, the allocation that triggered the garbage collection can now proceed.
The Lisp machine model of garbage collection is a bit different. Here's how I would characterize it:
  1. The fundamental unit of allocation and deallocation for the memory system is the region.
  2. To allocate an object within a region, you check the free pointer to determine if there is sufficient remaining room.
    • If there is room, you simply move the free pointer and initialize the storage.
    • If the region is full, you look for a different appropriate region. (Some regions are reserved for different uses.) If you find one, allocate there.
    • If there are no appropriate regions with room, you request the creation a new region from the memory system.
    • If this succeeds, you begin allocation within this new region.
    • If this fails, then you are out of memory and the machine halts. (An error is signalled before you are absolutely out of memory, so you won't be suddenly surprised, but if you continue to ignore the error and not release any storage at all, the machine will eventually halt because it is unable to fulfill your request.)
  3. The garbage collector works by freeing entire regions of memory. The basic plan is this:
    • select one or more of the allocated regions and mark them as ‘condemned’. No new objects can be allocated there.
    • Evacuate (relocate) the live objects from the condemned regions.
    • When there are no more live objects in the region, return the region to the memory system.

    The garbage collector runs as a separate process on par with the other processes. Its job is to free memory and return it to the operating system. It monitors memory usage and initiates a garbage collection cycle when it determines that memory needs to be freed. The decision when to initiate garbage collection is based on the current rate of allocation, the estimated rate of recovery, the estimated machine resources used during collection, and user parameters.

    In other words, allocation is decoupled from garbage collection. The decision of when (or even if) to garbage collect is not made by the allocator. You can turn the garbage collector off at any time, or turn it back on at any time. Of course you run the risk of running out of memory if you leave it off for too long.


Next: Oldspace and the transporter

Tuesday, May 5, 2009

You knew I'd say something, Part V

Is this horse dead, yet? At the beginning of this series I said that my goal was to clear up a few misconceptions about tail recursion:
  • Tail recursion is primarily syntactic sleight-of-hand that allows a programmer to use a function call rather than a traditional iteration construct.
    Tail recursion is a means of reducing the asymptotic space complexity of some programs. Tail recursion unifies function calls and message passing in a synchronous, sequential Actor model.
  • Its purpose is to avoid writing looping constructs.
    Its purpose is to allow access to more powerful programming models.
  • Any code that depends on tail recursion can easily be rewritten to use a looping construct instead. Such a rewrite is a trivial local transformation.
    Certain constructs that depend on tail recursion cannot be rewritten into a looping construct without global changes to the code.
  • People who like tail recursion are burdening their code, but they expect the implementor to bear the weight.
    Proper tail recursion cannot be added within the language without adding devices such as trampolines and imposing a calling convention on all users.
  • Tail recursion causes loss of valuable debugging information.
    Means to retain stack traces while preserving tail recursion have been known for close to thirty years.
My primary intention was to get at least a few people to think a little bit harder about tail recursion and not simply dismiss it as cute trick. I think I succeeded at this.

I wanted to follow up on some questions that people have raised in the comments.

Mr. Jerf said this about my Visitor Pattern:
You need to show an example where the solution would be more elegant than the canonical Python solution if and only if it could use tail recursion, not a grotesquely less elegant solution than is literally constructed for the purpose of blowing the stack.
and then
I posted in part II, complaining that you need to show something that isn't better written another way. (I still disagree with your later reply, because it isn't fair to say you need to globally rewrite code you specifically chose for the fact nobody would write it that way; it would have been iterators from day one.)
My purpose in part II was to demonstrate that not all tail-recursive code can be easily changed into an equivalent loop. The example is admittedly contrived for that specific purpose, but I don't think that invalidates it as an example. I chose the Visitor pattern because it makes use of callbacks. Callbacks are quite frequently called from tail position and could take advantage of tail recursion. If a Visitor callback happens to perform further visits (a not uncommon thing to do), it opens up the possibility of very deep recursion that could benefit from tail recursion.

The Visitor pattern is simply a form of continuation passing style(Shhh! Don't tell anyone).

I also disagree that ‘nobody would write it that way’. I did a bunch of poking around through Python source code to see whether people would use the Visitor pattern, and whether they would use it in such a way that could lead to stack overflow. I found several examples of visitors. In Zope there is an AST walker, in various XML libraries there are graph walkers, there even seem to be a few visitors in Python itself. I checked to see if there were any cases where a visitor performed further visits. There are. My very tiny example code is a distilled version of this that only has the interesting part of the code.

The point, however, was not to demonstrate the benefits of tail recursion or to persuade Python programmers to change either their language or programming style. It was simply to point out that not every use of tail recursion could be easily transformed into a loop, and I wished to provide a small, but not completely unrealistic example.

Mr. Prescod said:
I still feel that the last post is incomplete without code examples in TCO versus trampoline style.
That's a fair amount of work! I think you may be in luck, though. My MIT-Scheme interpreter used to be written with a simple trampoline in order to do tail recursion (revision 196). The trampoline was this:
void Run ()
{
    // The etc argument isn't used.
    // We want EvalStep to be an expression rather
    // than a statement so we chain the return value
    // around in case we think of a use in the
    // future.             
    object etc = null;
    while (true)                 
        etc = this.expression.EvalStep (this, etc);
} 
Each node in the AST had an EvalStep method. Here's the method for a conditional expression:
internal override object EvalStep (Interpreter interpreter, object etc)
{
    Conditional.evaluationCount += 1;
    return interpreter.EvalSubproblem (this.predicate, 
            new ConditionalDecide (interpreter.Continuation, 
                                   this,
                                   interpreter.Environment));
}
This is an example of a non tail recursive call. The predicate subexpression will be evaluated and its value will be passed to the ConditionalDecide object. The definition of EvalNewSubproblem is this:
internal object EvalSubproblem (SCode sCode, Continuation continuation)
{
    this.expression = sCode;             
    this.continuation = continuation;             
    this.history.NewSubproblem (sCode, this.environment);             
    return null;
}
So we put the predicate subexpression in interpreter.expression, put the continuation in interpreter.continuation, record the call in the History, and then return null.

The null is the value returned to the trampoline. interpreter.expression is now the predicate subexpression, so evaluation proceeds at that point. By returning to the trampoline on each evaluation step, we avoid pushing the stack. We still have to be careful not to heap allocate on the tail calls, though. So let's look at how that works.

Suppose our predicate was a constant. EvalStep for a constant is this:
internal override object EvalStep (Interpreter interpreter, object etc)
{
    Quotation.evaluationCount += 1;
    return interpreter.Return (this.item);         
} 
and we'll need to see how interpreter.Return works:
internal object Return (object value)
{
    Continuation cont = this.continuation;
    this.continuation = cont.Parent;
    return cont.Invoke (this, value);
}
To return a value, we invoke the continuation. The continuations are chained together in the heap like a stack, so we put the parent continuation in place when we invoke the topmost continuation.

The ConditionalDecide object is the continuation. Continuations have Invoke methods:
internal override object Invoke (Interpreter interpreter, object value)
{
    return interpreter.EvalReduction ((value is bool && (bool) value == false)
                                      ? this.Expression.Alternative
                                      : this.Expression.Consequent,
                                      this.Environment);
}
A conditional will tail call either the alternative or the consequent, so we select which subexpression is to be evaluated. EvalReduction arranges for the tail call:
internal object EvalReduction (SCode sCode, Environment environment)
{
    this.expression = sCode;
    this.environment = environment;
    this.history.NewReduction (this.expression, this.environment);
    return null;         
}
We put the alternative or the consequent into expression, put the environment into environment, record the tail call in the History, and then return null. As before, this null is returned to the trampoline.

The difference between EvalSubproblem and EvalReduction is that EvalSubproblem needs an explicit continuation object and EvalReduction implicitly uses the existing continuation in the interpreter. Each time we evaluate a subproblem, we allocate storage for the continuation. Each time we evaluate a reduction (a tail call), we just use the existing continuation.

Now just for the sake of argument, suppose I had written this for ConditionalDecide.Inovke :
internal override object Invoke (Interpreter interpreter, object value)
{
    return interpreter.EvalSubproblem
                 ((value is bool && (bool) value == false)
                      ? this.Expression.Alternative
                      : this.Expression.Consequent,
                  this.Environment,
                  new ConditionalFinish (interpreter.Continuation));
}
where ConditionalFinish has this Invoke method:
internal override object Invoke (Interpreter interpreter, object value)
{
    return interpreter.Return (value);         
}
What I've basically done here is take Alternative or Consequent out of tail position. In order to run the code as a subproblem, I have to supply an explicit continuation (which I am calling ConditionalFinish). ConditionalFinish is not a very useful piece of code. It has no state except for the parent continuation, and the Invoke method just unwraps that and invokes it. The effect is virtually identical to using the implicit continuation with the only difference being a waste of time and space.

Now, how would this look if I could just use tail recursion? Here's the EvalStep for conditionals:
internal override object EvalStep (Interpreter interpreter, object etc)
{
    Conditional.evaluationCount += 1;
    object value = interpreter.EvalSubproblem (this.predicate);
    if (value is bool && (bool) value == false)
        return interpreter.EvalReduction (this.Alternative);
    else
        return interpreter.EvalReduction (this.Consequent);
}

Sunday, May 3, 2009

You knew I'd say something, Part IV

I thought I could squeeze the rest in this post, but I think I need one more after this.

van Rossum suggests that tail recursion may be difficult for the compiler if function redefinition is permitted. (Of course we both agree that doing this is not good style. Furthermore, we both agree that it nevertheless ought to ‘work’ in an intuitive way.) He gives this example:
def f(x):
     if x > 0:
            return f(x-1)
     return 0

g = f
def f(x):
      return x
g(5)
and explains
The call to g(5) invokes the earlier f, but the “recursive” call no longer recurses! At run-time, the name ‘f’ is rebound to the later non-recursive definition, so the returned value is 4, not 0.
The direct translation into Scheme is this:
(define f
  (lambda (x)
    (if (> x 0)
        (f (- x 1))
        0)))

;; Prove it is tail recursive.
(f 1000000) => 0

;; Mutate the definition of F.
(define g f)
(set! f (lambda (x) x))

;; Now what happens?
(g 5) => 4
This will work in any version of Scheme. The compiler doesn't need to analyze the target of the function call, it only needs to notice that the caller's stack frame is no longer in use.

What about debugging?

The most common objection to tail recursion is that it makes debugging more difficult because the stack backtrace is missing the frames from the tail call positions. This is a real, pragmatic concern that I think is worth addressing.

MIT-Scheme has an approach that seems promising. At each evaluation step the interpreter records the program state in a structure called the ‘history’. The history essentially contains a copy of the stack frame. (Wait!!! I can hear the objections already and I haven't even posted this!) Obviously, this would have the exact same effect on the space utilization as the stack would in a non-tail-recursive implementation except for one very important thing: the history is a ring buffer with a limited size. When a tail recursive call is recorded in the history, each frame in the history moves down by one. If a sufficient number of calls are made, the oldest record is deleted.

Recording the stack frame in the history does change the space complexity, but in a more benign way than simply foregoing tail recursion altogether. The asymptotic space utilization is changed by a constant factor. Since the oldest frames are dropped from the history, the total storage consumed by the history cannot grow without bound. This is sufficient to avoid changing the asymptotic space complexity class. (Actually, I haven't seen a formal proof of this, so there may be some corner cases. Bear with me for a moment.)

In MIT-Scheme, the history is implemented as a ring buffer of ring buffers. Imagine that you have discovered the rib cage of an animal that was shaped like an inner tube. When a non-tail recursive call is made, we rotate the entire inner tube to get to the next rib. When the non-tail call returns, we rotate it back. When a tail call is made, we rotate the current rib and record the stack frame. The effect is this: if we have, say, 5 ribs, then the top five stack frames will have a backtrace of the tail calls that occurred at that level. If each rib is, say, 10 elements long, then each of the top five stack frames will show up to 10 tail calls.

This sounds much more complicated than it really is. The end result is that when you have an error and enter the debugger, you can get pretty much most of the expected stack trace.

But what if that's not enough?

The size of the history can be controlled by the user. If you need a bigger stack trace, just make the history deeper or wider.

But what about the overhead and the extra complexity?

As it turns out, the description sounds worse than reality. If you implement the ’backbone‘ of the history as a doubly-linked list, then rotating the history one way or the other simply involves dereferencing a pointer. And if you keep each ‘rib’ of the history as a doubly linked list as well, each tail call simply involves dereferencing a pointer as well.

I think a simple demonstration might be in order here. This is a buggy version of the Ackermann function:
(define (ack m n)
  (cond ((= m 0) (/ 1 0))  ; deliberate error
        ((= n 0) (ack (- m 1) 1))
        (else (ack (- m 1) (ack m (- n 1))))))
And here is what happen when we try to run it

(ack 2 2)

  Division by zero signalled by /.

 S0  (ack m (- n 1))
    R0  (ack (- m 1) (ack m (- n 1)))
    R1  (if (= n 0) (ack (- m 1) 1) (ack (- m 1) (ack m (- n 1))))
    R2  (cond ((= m 0) (/ 1 0)) ((= n 0) (ack (- m 1) 1)) (else (ack (- m 1 ...
    R3  (ack (- m 1) 1)
    R4  (if (= n 0) (ack (- m 1) 1) (ack (- m 1) (ack m (- n 1))))
 S1  (ack m (- n 1))
    R0  (ack (- m 1) (ack m (- n 1)))
    R1  (if (= n 0) (ack (- m 1) 1) (ack (- m 1) (ack m (- n 1))))
    R2  (cond ((= m 0) (/ 1 0)) ((= n 0) (ack (- m 1) 1)) (else (ack (- m 1 ...
    R3  (ack m (- n 1))
 S2  (ack m (- n 1))
    R0  (ack (- m 1) (ack m (- n 1)))
    R1  (if (= n 0) (ack (- m 1) 1) (ack (- m 1) (ack m (- n 1))))
    R2  (cond ((= m 0) (/ 1 0)) ((= n 0) (ack (- m 1) 1)) (else (ack (- m 1 ...
    R3  (ack 2 2)
The frames labeled S0, S1, and S2 are ‘subproblem’ frames that result from the non-tail-recursive calls. The frames labeled R0, R1, etc. are the ‘reduction’ frames that result from the tail recursive calls. In this case, the history contains everything that a stack trace would contain. Again, if this is not sufficient for debugging, both the depth and width of the history are under user control.

Keeping a history structure may or may not be the best way to deal with the loss of debugging information that comes from having proper tail recursion. This feature has been in MIT Scheme since the early 80s (possibly the late 70s). It isn't very difficult to implement. My description of it is probably larger than the code itself.

I think this is a reasonable counter to the problem of loss of debugging information.

Friday, May 1, 2009

You knew I'd say something, Part III

Earlier I said that tail recursion allows a program access to a more powerful computation model. I'm referring to Hewitt's Actor model. At its core, the Actor model is extremely simple. Computation agents called ‘Actors’ interact only through message-passing. The full Actor model allows for multiple asynchronous agents, each acting independently. On a single processor computer, it can be much more convenient to use a synchronous, serial Actor model in which only one Actor at a time is running and control of the processing is passed along with the messages.

One important feature of the Actor model is that messages are unidirectional. When an actor sends a message it does not require a reply as part of the protocol. If the actor needs a reply or wishes to handle a reply, it sets up a callback. The callback can have a well-known name that any actor can use in a message, or it can be an unnamed piece of code (a lambda expression) that shares the context of the sender.

Since messages are unidirectional, they don't require a ‘return address’. The message may have a return address as part of a callback, but that's a protocol decision, it isn't built-in to the message itself. The recipient of the message might simply ignore the callback (it is allowed to).

The recipient of a message can pretty much do anything it wants. Of course we expect the program to be written such that something useful happens, but the recipient of a message is allowed to do things like
  • compute something
  • change its state
  • send a message
  • create new actors
Or any combination of the above. For example, upon receiving a message, an actor could compute something based on the message contents, create a new actor, and resend the message (with the newly computed value) on to the newly created actor. This kind of arbitrary delegation is a key feature of the actor model. Once a message is sent, the sender is completely out of the picture (although he could put his return address in a callback). The message recipient is allowed to delegate all or any part of its action to other actors. Control structure emerges from the patterns of message passing.

Function calling = message passing (almost)

A procedure is like an actor. Nothing happens until you invoke it (send it a message), it can compute something, change its state (if it is a closure or has access to something mutable), and call other procedures (create new actors and send them messages). But a procedure will be a limited form of actor if it cannot fully delegate its computation to another procedure.

In virtually every computer language, a procedure call has an implicit continuation (informally, the caller). ‘Returning a value’ is ‘invoking the continuation’. Whether a procedure can fully delegate to another depends on how the language manages the implicit continuation. If the language supports proper tail recursion, then when a procedure tail calls another, the implicit continuation is passed directly along to the callee. If the language does not support proper tail recursion, then the implicit continuation is not passed to the callee. Instead, a new implicit continuation is allocated that contains (wraps around) the old one. The wrapper doesn't actually do anything in the tail call situation, it just takes up space and possibly retains pointers to reclaimable storage, but you can't get rid of it because it is built in to the language.

If you cannot pass the implicit continuation on to the callee, then you cannot fully delegate the computation. If you cannot fully delegate, then you cannot fully model message passing with function calls. On the other hand, if you can fully delegate, then a function call is exactly the same passing a message (in our synchronous, serial actor model).

Tail recursion unifies message passing and function calling. This is the reason tail recursion is cool. The fact that this allows you to write a loop as a tail call is sort of a side benefit. (It directly follows because all control structure can be modeled as message passing, but as I mentioned repeatedly, I have nothing against looping constructs per se.)

With tail recursion, you have message passing embedded in your language. By embedded I mean that you don't have to directly manipulate message objects and inject and extract values from them. That happens automatically as part of the function call syntax.

Without tail recursion, you have to implement message passing as a separate, ‘bolted on’, feature. You will have explicit message objects to allocate, initialize, enqueue, poll for, destructure, and deallocate. The control structure that emerges from the message passing patterns will interfere with the control structure in your code (you'll have to structure all your code around the ‘driver loop’).

Interlude

Part III is coming, but I wanted to make an observation. As I've been writing stuff, I've been poring over a lot of code looking for examples and counter-examples of ideas I want to talk about. Since I've been talking about tail recursion, I've been looking in particular at procedures that return answers that are computed by other procedures. It's fairly easy to find these things — you just look for a return statement immediately followed by a function call: return<whitespace+><identifier><whitespace*> (. This doesn't match perfectly, but it gets more than enough results to look at. In looking at several examples of code I've noticed a common pattern like this:
    ...
    Type foo = computeFoo (a, b);
    return foo;
}
This isn't a tail recursive call, but it is trivially transformed into one. Then I was thinking that I wouldn't write code like that. (At least I don't think I would.) It's like writing this in Scheme:
    (let ((foo (compute-foo a b)))
       foo)
Which is simply syntactic sugar for this:
    ((lambda (foo) foo) (compute-foo a b))
Which trivially beta-reduces. That first term is better known as the identity function, so I could have just written:
    (id (compute-foo a b))
instead of the let expression, but why would I even bother? So this code:
    ...
    Type foo = computeFoo (a, b);
    return foo;
}
is really a verbose way of writing this:
    ...
    return Identity<Type> (computeFoo (a, b));
}
which no sane person would prefer to
    ...
    return computeFoo (a, b);
}
Yet there are apparently reasonably sane people who seem to prefer the first version. (caveat here: I'm assuming that the Identity function is being called for value and that this is not an obscure means of forcing a type coercion. In all the code I've looked at, I haven't seen anything that nasty, but it wouldn't surprise me. I'm not talking about obscure compiler tricks.)