Friday, September 12, 2008

Correction

“Hamiltonian graphs cannot be expressed in existential monadic second order logic because palindromes are not a regular language.”

Apparently, Keith Ellul immediately realized that without MONADIC the statement is false, because it is “trivial” to see that hamiltonian graphs can be expressed in existential second order logic.

I can't wait to tell the guys at the pub.

Did you know?

Hamiltonian graphs cannot be expressed in existential second order logic because palindromes are not a regular language.

I have to say that is one of the most obscure sentences that I've seen in a long time.

Monday, September 8, 2008

In ‘Continuations from Generalized Stack inspection’, and in the Addendum, I showed how to abuse the exception handling mechanism to implement call-with-current-continuation on a virtual machine that doesn't allow access to the stack. The problem with the method is that exception handling is astoundingly slow. Nearly any other method of handling continuations will outperform it.

The key to the method is noticing that the method associated with each stack frame has full access to its own frame even if the other methods do not. The exception mechanism was a way to transfer control back to method, but to a segment of code that unloads the stack rather than continues the computation. This is quite similar to what we just did with the calling conventions to EvalStep. The machinery necessary to handle first-class continuations is an almost trivial addition. We create special singleton object and test the return value for this object. If we see it, we unload the current frame and return. This propagates down the stack until we reach the bottom and the stack is empty. As an example, the code for interpreting a conditional expression is this:
public override bool EvalStep (out object answer, ref SCode expression, ref Environment environment)
{
    SCode unev = this.predicate;
    Environment env = environment;
    while (unev.EvalStep (out answer, ref unev, ref env)) { };
    if (answer == Interpreter.UnwindStack) {
        UnwindState xxx = <get unwind state>
        xxx.AddFrame (new ConditionalState (this, environment));
        return;
        }

    expression = (answer is bool) && (bool) answer == false
        ? this.alternative
        : this.consequent;
    return true;
}
The special code has a simple task. It should save any relevant state out of the current stack frame into a data structure, then add that data structure to the list of reified stack frames, and finally return the ‘Interpreter.UnwindStack’ object itself.

Where do we keep the reified stack frames? We'd like to pass it down from the frames above, but it seems a waste to allocate yet another local variable and ref argument for this purpose because they are used so rarely. We use the ‘Wolf in Sheep's clothing trick’ to accomplish this.
class UnwindState : Environment
{
   <whatever needed for unwinding>
 
   <a bunch of virtual methods that
    make the compiler think we handle the ‘Envionment’ 
    protocol, but just raise errors>
}
And now we can stuff the unwind state in the Environment:
     
public override bool EvalStep (out object answer, ref SCode expression, ref Environment environment)
{
    SCode unev = this.predicate;
    Environment env = environment;
    while (unev.EvalStep (out answer, ref unev, ref env)) { };
    if (answer == Interpreter.UnwindStack) {
        ((UnwindState) env).AddFrame (new ConditionalState (this, environment));
        return;
        }

    expression = (answer is bool) && (bool) answer == false
        ? this.alternative
        : this.consequent;
    return true;
}
Our calling convention is a fair amount more complicated than it was. Each caller to EvalStep must wrap the call in a while loop, then it must check for the answer being ‘Interpreter.UnwindStack’, and if so, handle the stack unwinding via the environment variable. Not pretty, but it works.

Continuations in my version of Scheme are no longer allocated in the heap, but on the stack where we can take advantage of the hardware. Tail recursion is preserved via an unusual calling convention. However, the code is able to load and unload the stack, so call-with-current-continuation works correctly.

With these changes, the C# version of the MIT-Scheme interpreter now outperforms the original C version.

Thursday, September 4, 2008

Now it starts getting interesting.

I have a Scheme system that can interpret the MIT-Scheme S-Code, but it doesn't use the underlying MIT-Scheme runtime system. It can read the fasl binaries that MIT Scheme produces, but the object format is whatever the .NET system is providing. How does it perform? Not particularly well.

It's pretty easy to figure out why. Even the simplest piece of code is creating and invoking continuations at an enormous rate. Each continuation is being allocated on the heap. Now it is the case with modern generational garbage collection that heap allocation has almost no cost, but even the very tiny cost of allocation adds up quickly when you do it for nearly every node you traverse in the abstract syntax tree.

But the real problem is this: the CPU has specialized hardware that is specifically designed to efficiently allocate, deallocate and manipulate continuations, and we aren't using it. Again, it is sitting there managing the continuations used by the virtual functions in our interpreter, while we use the main data paths and the memory management software library to mimic it's functionality for the virtual machine. In fact, we go out of our way to get it to discard these useless continuations so we can preserve tail recursion.

Why do we do this? First, the C# continuations don't quite do what we want. C#, like most other languages, unconditionally allocates continuations on procedure entry and deallocates them on exit. Scheme, on the other hand, allocates continuations only if they logically needed to regain control. If a routine has completed its work and is simply going to call another routine to get the answer, it just jumps there and the callee transmits the answer directly back to whatever routine initiated the call chain.

Second, C# continuations are too opaque for our purposes. Scheme's call-with-current-continuation primitive reifies the continuation — it takes the stack and turns it into a first-class object that can be manipulated by the user. C#, like Java, does not permit programs to randomly access the stack and certainly doesn't let programs unload and reload the contents of the stack.

But there are techniques to get around these limitations.

Let me describe how I solve the first problem.

The EvalStep method in my interpreter originally had this signature:
void EvalStep (Interpreter interpreter);
and a node in the abstract systax tree would do something like this:
 sealed class Conditional : SCode, ISystemHunk3
 {
     readonly SCode predicate;
     readonly SCode consequent;
     readonly SCode alternative;

     override void EvalStep (Interpreter interpreter)
     {
         interpreter.Continuation =
             new ConditionalDecide (interpreter.Continuation, 
                                    this,
                                    interpreter.Environment));
         interpreter.Expression = this.predicate;
         return;
     }  }
Note how the new continutation takes as one of its arguments the existing continuation. We put that into the interpreter continuation register, set the interpreter expression register to the predicate of the conditional and then return control to the caller (which is the trampoline).

The point is to evaluate the predicate and pass the answer on to the newly created ConditionalDecide continuation. It has code like this:
override void Invoke (Interpreter interpreter, object value)
{
    interpreter.Continuation = this.parent;
    interpreter.Environment = this.environment;
    interpreter.Expression = (value is bool && (bool) value == false)
                               ? this.expression.Alternative
                               : this.Expression.Consequent;
    return;

These two code fragments are part of a single logical routine that is more easily written like this:
override void EvalStep (Interpreter interpreter)
{
    object value = this.predicate.EvalStep (interpreter);
    if (value is bool && (bool) value == false)
        return this.expression.Altenative.EvalStep (interpreter);
    else
        return this.expression.Consequent.EvalStep (interpreter);
}
This is not only easier to write, it performs much better because we are co-opting the continuations that are used for C# subroutines for our own sub-evaluations. Unfortunately, however, the tail calls to EvalStep are allocating a continuation that destroys tail recursion. We need to transfer control to one of these routines without allocating a new continuation.

Here's the trick: we'll get the caller to do the transfer for us. We'll return to the caller with some indication that we're not really done, but that we intend for someone else to finish our job. Essentially, we are taking some of the code that logically belongs in our routine and placing it in the caller. This isn't possible in general because the caller is unknown and the callees can be arbitrary, but in the context of the interpreter, we have an extraordinary special case: Nearly every call that is necessary to be tail recursive is a call to EvalStep. (There are a couple of primitives that call back into the interpreter that we'll deal with in a bit.)

We'll define a new calling convention that requires the caller to handle the transfer to EvalStep for us. It's a little unorthodox, but no more so than requiring every routine to return to a trampoline. The new signature of EvalStep will be this:
    bool EvalStep (out object answer, ref SCode expression);
The boolean return value indicates whether we've returned control with an answer or whether we've returned with a new expression we wish to tail call. The caller must invoke EvalStep like this:
    SCode expr = <expression to evaluate>;
    object answer;
    while (expr.EvalStep (out answer, ref expr)) { };
    // When the while exits, we have an answer.
And the callee should do one of two things. Either assign an answer and return false, or assign a new expression and return true. The EvalStep handler for the conditional expression above would therefore look like this:
override bool EvalStep (out object answer, ref SCode expression)
{
    SCode  expr = expression.Predicate;
    object predicateValue;
    while (expr.EvalStep (out predicateValue, ref expr)) { };

    if (predicateValue is bool && (bool) predicateValue == false)
        expression = expression.Altenative;
    else
        expression = expression.Consequent;

    return true;
}
This isn't quite the whole story. Evaluation needs an environment and that is missing. The EvalStep routines will need to provide this to the caller as well, so it will handled with a ref argument, too. Here is the code for conditionals:
bool EvalStep (out object answer, ref SCode expression, ref Environment environment)
{
    SCode expr = this.predicate;
    Environment env = environment;
    while (expr.EvalStep (out answer, ref expr, ref env)) { };

    expression = (answer is bool) && (bool) answer == false
        ? this.alternative
        : this.consequent;
    // request tail recursion.
    return true;
}
Here is the code for variables:
override bool EvalStep (out object value, ref SCode expression, ref Environment environment)
{
    if (environment.DeepSearch (out value, this.name))
        throw new UnboundVariableException (this.name);
    // actually returning a value
    return false;
}
If you look at this code you'll notice that what we are doing is severely abusing the language features to change the calling conventions. We allocate some stack for EvalStep in the caller's frame. We then pass control to EvalStep and give it the addresses of the allocated variables. EvalStep eventually returns control to us, but it isn't logically finished. Recall that we've moved some of the code in the callee to the caller. If the return value is true, the while loop performs a method call to EvalStep on the new value of expr. This is the call that the callee would have performed, if he was allowed to clear the stack prior to the call. Since he is not allowed to do this, we just move the code around a little so that it runs after the return.

With the code in this form, the interpreter continuations and the C# continuations are one and the same object, and we can take advantage of the processor's stack hardware for performance. This solves the first problem we have with C# continuations. The second problem is harder to solve.

Tuesday, September 2, 2008

S-code

MIT Scheme has an internal format for code called ‘S-Code’. This is a set of data structures that represent the abstract syntax tree of the language with all the surface syntax expanded away. There are nodes for conditionals, nodes for literal values, nodes for variables, nodes for lambda expressions, etc. Each node has a particular evaluation rule. The interpreter traverses the syntax tree and evaluates each node according to the appropriate evaluation rule.

MIT Scheme uses a rather old high-tagged pointer format. This is a legacy from the Scheme 81 chip. A machine word in MIT Scheme is 32-bits, but the top six bits encode the type of word. The remaining twenty-six bits may encode an immediate value, like a small integer (fixnum), or it may encode the address at which a larger data structure may be found. Twenty-six bits of address was more than ample in 1981, but it is quite small by today's standards.

The MIT Scheme interpreter is essentially an emulator for a Scheme 81-like chip (although it has evolved substantially). The core of the interpreter is a loop that reads the expression ‘register’, extracts the tag bits and dispatches (with a switch statement) to the code that implements the evaluation rule for the expression.

My implementation of MIT Scheme uses a different strategy. Each S-Code object is an instance of a C# class. The details of how the instance is represented in memory are unimportant in this implementation because we aren't relying on the ability to inspect the raw bits. The interpreter traverses the S-Code by invoking a virtual function EvalStep defined on the abstract base class. Every concrete S-Code class must provide a concrete method that implements the appropriate evaluation rule. Rather than a big switch statement, the interpreter simply calls EvalStep to dispatch to the appropriate routine.

This requires a few slight changes to the interpreter. In MIT Scheme, literal objects can be embedded directly within the S-Code. If you refer to a number in your program, that number appears directly in the abstract syntax tree. Since every object in MIT Scheme has a tag, the interpreter will dispatch to the `self-evaluating' code when it encounters one of these embedded objects. That code simply returns the object itself.

In my version, it must be the case that every object that the interpreter acts upon is derived from the abstract class SCode and has an EvalStep virtual method. The implementors of the .NET runtime did not have the foresight to add an EvalStep method to the object, so the interpreter cannot dispatch on those. There are several solutions, but the easiest is this: self-evaluating objects are never allowed to appear directly in S-Code, they must be wrapped within a Quotation node. The implementation ensures this automatically. When an S-Code node is initialized, any subfield that is not itself an S-Code object is wrapped within a Quotation.

MIT Scheme is naturally tail recursive. The underlying dispatch loop is a while statement. The stack provided by the physical machine is not used, so the code only has to manage a virtual stack.

While the .NET CLR has the ability to perform tail-recursive calls, the C# compiler makes no use of it. This means we have to go out of our way to implement tail recursion. There are several techniques for this, but the only realistic one that allows us to use the normal function calling mechanism is a trampoline. (Baker's Cheney on the M.T.A technique won't work because the CLR prohibits the creation of linked objects on the stack.) In my implementation, the trampoline looks like this:
    public sealed class Interpreter
    {
        SCode expression;
        Environment environment;
        Continuation continuation;

        void Run ()
        {
            while (true)
                this.expression.EvalStep (this);
        }
    }
Each ‘bounce’ through the trampoline fetches the contents of the expression register and invokes the EvalStep method. The interpreter object itself is passed to the method. The method performs it's actions and modifies the interpreter's expression register before returning to the trampoline. The various EvalStep methods can ‘return’ a value by passing it on to the continuation object.

There's not much really interesting here although there is one amusing observation. To preserve tail recursion, all calls have to go through the trampoline to empty the C# stack. However, when invoking a Scheme continuation, we do not have to use the trampoline because there are no infinite chains of continuations (so control eventually returns with finite resource usage). Therefore, the routine that implements the continuation is called like a normal function. We have this curious inversion: when calling a Scheme function, we perform a return in C# in order to pop the C# stack and transfer control via the trampoline, but when performing a return in Scheme (invoking a continuation), we use a C# call and push the C# stack.

Next, a heap is fast, but the stack is faster...

Threaded interpreter

I mentioned earlier that I was playing around with my own version of MIT/GNU Scheme. Part of the reason was to understand why interpreters are so much slower than compilers and where the ‘interpretation overhead’ comes from.

Virtual machines are largely like their physical counterparts. This is because of two reasons:
  1. we aren't that imaginative, so computer architectures have the same basic building blocks,
  2. it's easier to implement an instruction if you can just use the physical hardware.
But sometimes you can find a virtual machine performing an operation very similar to the physical one, but in a far less efficient manner. Take for example a simple byte-code interpreter. The byte codes are stored in a vector and are addressed by a virtual program counter. The main dispatch loop is like this:
  while (true) {
    byte instruction = instructionStream[programCounter];
    switch (instruction) {
        case NOP:
            programCounter += 1;
            break;

        case JUMP:
            programCounter = instructionStream[programCounter + 1];
            break;
        }
    }
This is pretty straightforward and we know it doesn't perform that well. We can blame branch prediction, locality, memory traffic, etc. But the real problem is simply this: the CPU has specialized hardware that is specifically designed to efficiently fetch and decode an instruction stream, and we aren't using it. It's sitting in the background repeatedly fetching and decoding the ‘while’ loop, but we're using the main data paths and the program logic to fetch and decode our virtual instruction stream. The physical program counter is being incremented automatically by a specialized adder, but the virtual one has nothing to do with it and is using the main ALU.

Why do we do this? First of all, the virtual instruction set is quite a bit different from the native instruction set, and second of all it is somewhat tricky to dynamically install new code in the running image, but quite easy to install new data. But people have come up with other techniques to deal with this problem.

One technique is JIT compilation. JIT compilation translates the virtual instruction stream to a physical instruction stream. In it's simplest form, it can simply concatenate the subsections of code that implement each virtual instruction. JIT compilers typically perform very few optimizations, yet the resulting code is substantially faster. This is because we are now co-opting the instruction fetch hardware on the CPU to fetch and decode our translated virtual instructions directly.

Threaded interpretation is a different tactic. Threaded interpretation acutally refers to three similar techniques. ‘Subroutine’ threaded interpretation substitutes each virtual instruction with a native CALL instruction to the routine that implements the virtual instruction. So the resulting instruction stream would look like this:
   call push_immediate 3
   call push_variable x
   call add
   .... 
This gives a pretty-good speedup because we are again using the instruction fetch hardware to do our dispatch, but we also end up using the stack hardware for no good reason. On the other hand, it is extremely simple to get working.

Direct threaded code works by embedding the actual code that implements the virtual instructions within the data structure that represents the source language. The data structures are laid out in memory with the data first followed by a code block. The data structures are referred to by the pointer to the code block. You interpret the code by literally jumping to it. The data structures in the source code have the literal address of substructure in them, so the native instruction fetching hardware will be traversing the data structures directly.

Direct threading has mixed performance. On some machines, this is very fast, but other machines may make use of the fact that code and data are not often directly intermixed. Modern hardware even raises an error condition if it detects this because it is symptomatic of the kind of self-modifying code that malicious software uses. Moving the data structures (for garbage collection) is also problematic because you have to re-link the code blocks that are embedded within them.

Indirect threading works by putting the address of the code that implements the virtual instructions (rather than the code itself) within the data structures that represent the source language. Rather than directly jumping into the data structure itself, you perform an indirect jump to the address stored in the data structure. The extra level of indirection causes some minor delay, but it is still quite fast and fairly easy to implement.

There is one more technique that I haven't seen a name for, so allow me to call it ‘virtual threading’. We use a slightly higher level language and implement our interpretation subroutines as ‘virtual methods’. Implementations of languages with ‘virtual methods’ typically have a table of methods that are shared by all instances of a class of data structure. The address of the table is stored somewhere in the data structure and invoking a ‘virtual method’ basically involves dereferencing the table pointer and performing an indirect jump. Since this involves two indirections, and a call and return, it is slower than indirect threading, but there are many advantages.

First, and primarily, we are once more using the instruction fetching hardware to fetch our instruction stream directly rather than using the main data paths. This improves the performance.

Secondly, by laying out our interpreter data structures in a format that is nearly ubiquitous these days, we can take advantage of all the software that has been designed to manipulate and manage these sorts of data structures. Additionally, we can more easily interoperate with third-party software because they expect foreign data to be layed out in this manner.

Third, we can use a high-level language like C# or Java to implement the interpreter without incurring too much of a performance penalty. It is very difficult to implement indirect threading in these languages, but virtual method dispatch is built-in.

Finally, because ‘virtual method’ calling is so ubiquitous, there is an incentive to the hardware manufacturers to ensure that it is efficient.

Teaser

I've did some interesting things over the long weekend. It's going to take a few blog posts to explain them fully, but I think this will be worth it.