Tuesday, September 2, 2008

Threaded interpreter

I mentioned earlier that I was playing around with my own version of MIT/GNU Scheme. Part of the reason was to understand why interpreters are so much slower than compilers and where the ‘interpretation overhead’ comes from.

Virtual machines are largely like their physical counterparts. This is because of two reasons:
  1. we aren't that imaginative, so computer architectures have the same basic building blocks,
  2. it's easier to implement an instruction if you can just use the physical hardware.
But sometimes you can find a virtual machine performing an operation very similar to the physical one, but in a far less efficient manner. Take for example a simple byte-code interpreter. The byte codes are stored in a vector and are addressed by a virtual program counter. The main dispatch loop is like this:
  while (true) {
    byte instruction = instructionStream[programCounter];
    switch (instruction) {
        case NOP:
            programCounter += 1;

        case JUMP:
            programCounter = instructionStream[programCounter + 1];
This is pretty straightforward and we know it doesn't perform that well. We can blame branch prediction, locality, memory traffic, etc. But the real problem is simply this: the CPU has specialized hardware that is specifically designed to efficiently fetch and decode an instruction stream, and we aren't using it. It's sitting in the background repeatedly fetching and decoding the ‘while’ loop, but we're using the main data paths and the program logic to fetch and decode our virtual instruction stream. The physical program counter is being incremented automatically by a specialized adder, but the virtual one has nothing to do with it and is using the main ALU.

Why do we do this? First of all, the virtual instruction set is quite a bit different from the native instruction set, and second of all it is somewhat tricky to dynamically install new code in the running image, but quite easy to install new data. But people have come up with other techniques to deal with this problem.

One technique is JIT compilation. JIT compilation translates the virtual instruction stream to a physical instruction stream. In it's simplest form, it can simply concatenate the subsections of code that implement each virtual instruction. JIT compilers typically perform very few optimizations, yet the resulting code is substantially faster. This is because we are now co-opting the instruction fetch hardware on the CPU to fetch and decode our translated virtual instructions directly.

Threaded interpretation is a different tactic. Threaded interpretation acutally refers to three similar techniques. ‘Subroutine’ threaded interpretation substitutes each virtual instruction with a native CALL instruction to the routine that implements the virtual instruction. So the resulting instruction stream would look like this:
   call push_immediate 3
   call push_variable x
   call add
This gives a pretty-good speedup because we are again using the instruction fetch hardware to do our dispatch, but we also end up using the stack hardware for no good reason. On the other hand, it is extremely simple to get working.

Direct threaded code works by embedding the actual code that implements the virtual instructions within the data structure that represents the source language. The data structures are laid out in memory with the data first followed by a code block. The data structures are referred to by the pointer to the code block. You interpret the code by literally jumping to it. The data structures in the source code have the literal address of substructure in them, so the native instruction fetching hardware will be traversing the data structures directly.

Direct threading has mixed performance. On some machines, this is very fast, but other machines may make use of the fact that code and data are not often directly intermixed. Modern hardware even raises an error condition if it detects this because it is symptomatic of the kind of self-modifying code that malicious software uses. Moving the data structures (for garbage collection) is also problematic because you have to re-link the code blocks that are embedded within them.

Indirect threading works by putting the address of the code that implements the virtual instructions (rather than the code itself) within the data structures that represent the source language. Rather than directly jumping into the data structure itself, you perform an indirect jump to the address stored in the data structure. The extra level of indirection causes some minor delay, but it is still quite fast and fairly easy to implement.

There is one more technique that I haven't seen a name for, so allow me to call it ‘virtual threading’. We use a slightly higher level language and implement our interpretation subroutines as ‘virtual methods’. Implementations of languages with ‘virtual methods’ typically have a table of methods that are shared by all instances of a class of data structure. The address of the table is stored somewhere in the data structure and invoking a ‘virtual method’ basically involves dereferencing the table pointer and performing an indirect jump. Since this involves two indirections, and a call and return, it is slower than indirect threading, but there are many advantages.

First, and primarily, we are once more using the instruction fetching hardware to fetch our instruction stream directly rather than using the main data paths. This improves the performance.

Secondly, by laying out our interpreter data structures in a format that is nearly ubiquitous these days, we can take advantage of all the software that has been designed to manipulate and manage these sorts of data structures. Additionally, we can more easily interoperate with third-party software because they expect foreign data to be layed out in this manner.

Third, we can use a high-level language like C# or Java to implement the interpreter without incurring too much of a performance penalty. It is very difficult to implement indirect threading in these languages, but virtual method dispatch is built-in.

Finally, because ‘virtual method’ calling is so ubiquitous, there is an incentive to the hardware manufacturers to ensure that it is efficient.

1 comment:

Duncan said...

The paper and code from this presentation is not publicly available yet, afaik, but you might find this presentation on 'PicoScheme' from this year's Workshop on Self-sustaining Systems (S3) 2008 interesting: