Wednesday, November 12, 2008

You want to know what I think? I'll tell you what I think. Here's what I think: Java Java Java is is is too too too damn damn damn verbose verbose verbose. That's what I think. And I'm sticking to it. So there.

You want to know what I think? I'll tell you what I think. Here's what I think: There's There's There's way way way too too too much much much boilerplate boilerplate boilerplate in in in Java Java Java. That's what I think. And I'm sticking to it. So there.

You want to know what I think? I'll tell you what I think. Here's what I think: If If If Charles Charles Charles Dickens Dickens Dickens had had had designed designed designed a a a language language language, it it it would would would be be be a a a lot lot lot like like like Java Java Java. That's what I think. And I'm sticking to it. So there.

I was writing a small helper class to correctly paste together the CGI arguments to an http service. The idea is simple. Rather than simply appending strings in a kludgy way, you'd collect a number of instances of the objects you want to pass (mostly integers, text, and enums) and the code would magically marshal them in to a correctly escaped URI. No big deal.

Two days later...

Sixteen hundred lines of code distributed in twenty-nine files. Is it any wonder that people get carpal tunnel syndrome? And this is mostly `throwaway' code that abstracts only the small amount of functionality I need. I don't even want to think how awful it would be if I had tried to engineer a full-blown class-based utility.

Of course I was asked “Why didn't you just use the mumble class?” A few reasons, but the primary reason was that the suggested class worked like this:

    FooBuilder fooBuilder = new fooBuilder();

    FooPartBuilder fooPartBuilder = new fooPartBuilder();
    fooPartBuilder.addName ("fooPartName");
    fooPartBuilder.addValue ("fooPartValue");

    FooPart fooPart = fooPartBuilder.createPart();
    fooBuilder.addPart (fooPart);
    .... etc. etc. ...

I wanted something like this:

    Foo myFoo = new Foo (new FooPart ("fooPartName", "fooPartValue"),
                         new BellOption (BellOption.Sound.Tinkle),
                         new WhistleOption (), ...)

What's the difference? As I was skimming the code for FooBuilder (and the code for AbstractFoo, FooInterface, AbstractFooBase, etc...) I saw a note that mentioned ‘this isn't thread safe’. Great. Another thing to think about. The problem with the FooBuilder model is that it is just lousy with state. It starts empty and then you smash some state into it. This is stupid. It begins in a state that no one wants it to be in, so the first thing everyone does is smash the state to something more useful.

They way I wanted to do it was to simply construct the object in its final configuration. The object is immutable and stateless. It is guaranteed to be thread safe because there is no way to modify it. The JVM could keep separate copies all over the place for all I care. And because there are no mutators, the code is about 30% smaller. That's 30% fewer javadoc comments that replicate the argument lists.

Another gripe: Having a database that keeps track of where the source code resides is a good idea. Using the file system as a ‘poor man's database’ for that purpose is a bad idea. To give a concrete example, I wanted to say, formally, that ‘every optional CGI argument must provide a marshaling function’. Simple statement. Here goes:

    public abstract class OptionalCgiArgument {
        abstract String marshal() throws MarshalingException;
    }

But it should be in a file all by itself, which means a copyright notice, a package declaration, and, if I'm being pedantic, a Javadoc comment.

    // Copyright (c) 2008

    package com.foo.bar.baz.utility.api.helper;

    /**
     *  Abstract base class for OptionalCgiArguments.
     *  Ensures that all OptionalCgiArguments provide
     *  a marshaling method.
     *
     * @author jmarshall
     */
    public abstract class OptionalCgiArgument {
        abstract String marshal() throws MarshalingException;
    }

Did I forget the unit test?

Tuesday, November 4, 2008

I haven't dropped off the face of the planet, but I don't have much to report. Mostly, I've been hacking on this Scheme interpreter off and on trying to figure out how to minimize the amount of variable lookup. I have a plan, but it is a bit complex, so I'm writing some Scheme code to generate some C# code that will work in the interpreter.

Thursday, October 9, 2008

Now that we have static environments, we can take advantage of the known locations of lexical variables. For instance, if the variable is bound in the immediately enclosing lambda, we can simply fetch it because it cannot be shadowed. If a variable is completely free in a fragment of code, that is, if it is a `global' or `top-level' variable, we can grab the value cell at load time and simply dereference it to evaluate the variable.

Before loading or evaluating a piece of SCode, we make a pre-pass over it to `compile' the variable lookup. (We could do this at lambda-construction time, but then we'd have to make multiple passes over deeply nested lambda expressions. Nonetheless, multiple passes are much simpler to understand. When contructing a lambda, we'd walk the body and replace all the lambda-bound variables that are free in the body with variables that `knew' their lexical depth. And it is unlikely that the lexical depth will exceed a very small number, so the multiple passes wouldn't add up. On the other hand, you would still need a final pass just before evaluation, and if you were too simple-minded about multiple passes you'd end up with an exponential growth factor.)

As I was saying, we make a pre-pass over the code to `compile' the variable lookup. We create a LexicalMap that holds a model of the runtime environment, and as we walk the code, we simulate the variable lookup against the lexical map. As we simulate the lookup, we note how deep we have to search, whether there will be incremental bindings in any of the frames, and where we ultimately find the variable (if at all). If we find the variable, we look at the path we used to get at it and create an instance of a specialized variable class that performs that particular style of lookup at runtime. If we don't find the variable, we create an instance of a specialized variable class that performs a deep search.

Wednesday, October 8, 2008

Look! Up in the environment!

Variable lookup is right up there with continuation management as an important performance bottleneck. In one benchmark, I measured 2.9 billion evaluations of which 991 million (34%) of them were variable lookups. A good chunk of compiler optimizations revolves around getting the variables to where they need to be as quickly as possible and keeping them in easy to find places.

My interpreter uses the standard `environment model' of evaluation where a linked chain of lexical environment chains are maintained during evaluation. Every variable lookup goes through a `deep search' of the frames looking for a binding with a matching name. Needless to say, this is not blazingly fast. There are two parts to solving this. First is to reduce the need for the deep search, second is to optimize the fast path.

MIT Scheme supports first-class environments. The special form the-environment returns the lexical environment at the point of invocation. The returned environment can be used as an argument to eval and that's where the trouble begins. If eval is called, the user is going to expect that the code will run in a context with the same bindings visible as where the environment was captured. This means we cannot move or collapse the environment frames because it may cause changes visible to the user. Even worse, if the user evals a define expression, or loads a file, it may inject bindings that were not previously visible. This means we cannot even cache the locations of the variables. The flexibility of first-class environments comes with a heavy cost. On the other hand, this flexibility is hardly ever used, so we end up paying this cost to no advantage.

So how do we fix this? First, because (the-environment) is the only documented way to obtain a first-class environment, we can walk the SCode and determine whether a first-class environment will be needed. I do this when I construct an SCode lambda expression. If the body contains a call to (the-environment), I construct a StandardLambda object, but if it does not, I construct a StaticLambda object. A StandardLambda, when evaluated, constructs a StandardClosure. A StandardClosure, when applied, constructs a StandardEnvironment. A StandardEnvironment must be layed out exactly as specified by the source code and variable lookups must do a deep search because the user might inject bindings via define or load.

A StaticEnvironment however, has an important property: because the user cannot use load or eval with it (he can't name the object to pass it as an argument, so this does not create a user-visible restriction), we know that there will not be incremental definitions that could hide the bindings in the environment. We know the lexical depth and offset of all variables bound in a StaticEnvironment.

So the first task is to identify and separate StandardLambda and StaticLambda instances.

Monday, October 6, 2008

Weird bug

I found (to some extent) the bug that has been vexing me for the past couple of weeks. It is obscure and I can't think of a good reason that things would work this way, but I do have a reliable way to avoid the bug.

The symptoms were this: After booting Scheme and getting to the REPL, forms typed in the REPL would be incorrectly rewritten with uninterned symbols appearing in body of the form. Naturally, these would not match the bindings of the interned symbols appearing in the binding lists, so unbound variables would be raised.

I first changed the way uninterned symbols printed so I could easily identify them. This showed a second symptom: the forms typed to the REPL would not be rewritten the same each time. Retyping the form to the REPL would give a different answer on the second rewrite.

Despite these problems, enough of Scheme is working to load and initialize the world and get to the prompt. This takes some 6 million evaluations.

A little poking around led me to the guts of the syntaxer. This is the component of MIT Scheme that builds the abstract syntax tree from s-expressions. It has two renaming phases: first, it renames all identifiers to uninterned symbols. This has the effect of alpha-renaming the bound variables so that macro expansion cannot inadvertently capture bindings. After everything is expanded, the second renaming phase renames the indentifiers back to their original names provided that it does not introduce a naming conflict. If there is a naming conflict, a non-conflicting identifier is generated for the name.

The renaming information is kept in a set of hash tables called the rename-database. What appeared to be happening was that during the second phase the code either thought that there was a conflict in renaming the variables in the body of some lambda expressions, or it could not find the entry in the rename-database that contained the original name. Either explanation could account for it. I suspected a problem with hash tables.

Now the first baffling thing is this: most of MIT Scheme is written in Scheme itself, so the hash table implementation is really only dependent on some very trivial arithmetic. Furthermore, since the hash code is computed by calling a primitive procedure that I wrote, I was able to ensure that all symbols would hash to the same bucket, thus turning hash-table lookup into a linear search for symbols. This depends only on checking EQ-ness of symbols.

The CLR has a notion of interned and uninterned immutable strings. These are semantically equivalent to symbols and I used them as such. I thought that perhaps some uninterned strings were being inadvertently interned or that perhaps some strings I thought were interned were not. I added a two-character prefix to my uninterned strings whenever I created them and I added code that checked that if the two characters were present, the symbol was uninterned, and if absent the symbol was interned. Furthermore, I checked that no interned symbol had the prefix and no uninterned symbol was missing the prefix. I performed this check on every call to EQ and on every call to a Scheme primitive that manipulated symbols. I even put a test in the variable lookup code to check the state of any symbol that may have been placed in a variable.

This code showed clearly that I was correctly managing the uninterned and interned symbol sets. No symbol accidentally migrated.

The addition of the prefix characters helped when stepping through the code. It was easy to see which symbols were uninterned and which were interned by simply looking for the prefix. Unfortunately, it is hard to distinguish between two different uninterned symbols with the same name (and there cannot be two interned symbols with the same name). I could see a few cases where two uninterned symbols with the same name were being compared and the answer indicated they were different symbols.

So to make it easier for me to tell which uninterned symbols were which, I changed the Scheme primitive that creates uninterned symbols to add a unique prefix to each one.

The bug went away.

`Aha!' you say, it *was* a case of accidentally confusing two symbols. Well, I *thought* that, too, but it is weirder than that. If giving each uninterned symbol a unique name solves the problem, then we seem to have this kind of bug:

Two symbols that are NOT identical (occupy different locations in memory) are determined to be identical because they contain the same sequence of characters.

Now I suspected a bug of this sort fairly early on in the debugging process, so I was a bit confused why I didn't find it earlier. (After all, it seems obvious)

Here is an obvious fact: the number of uninterned symbols is strictly equal to or less than the number of calls made to the primitive that creates them. There is exactly one routine in my interpreter that creates uninterned symbols, so each one must come from a call to that routine. Of course I had put a breakpoint at that routine so I could see who was generating these symbols.

The call to make-uninterned-symbol happens exactly *once* per identifier. This, too, make sense: you only need one uninterned name per identifier.

So if we are comparing two symbols that are NOT identical, where did the second one come from?

Ah, you say, maybe it was left over from a previous expansion. As it so happens, I instrumented EQ quite heavily because I wanted to be sure I knew what was what. When I use the code that creates each symbol uniquely, I can see what the uninterned symbol is compared against. It is compared against itself and the original interned symbol. Under no circumstances did I see a comparison with any uninterned symbol *other* than itself. That makes sense because I was working with very simple lambda expressions that had only one bound variable.

So if we are creating only *one* uninterned symbol per binding, and only checking against that one uninterned symbol, how could we confuse them?

The explanation I'm stuck with is this: the CLR may not be guaranteeing that uninterned symbols have reference semantics. That is, the `identity' of an uninterned string (as defined as the location in memory where the bits are) might not be preserved, but can even *split* (two strings that were once EQ become not EQ). I can't think of why an implementation would do this, but it does explain what I am seeing. (I'm thinking there might be an explanation involving cross-generational pools of uninterned strings, but I dunno.)

The workaround is trivial: I no longer use uninterned strings as `naked' object but wrap them in a trivial class object. Since class instances are defined to have reference semantics, there is no problem with using object-equality on them.

I suppose I can try to write a program that demonstrates this issue, but I'm getting tired of debugging it and I have a workaround. But this is one weird bug.

Wednesday, October 1, 2008

A curious bug

I've been hacking away on my Scheme system, but I'll give a full update later on. I've encountered a curious bug, though. When I get to the REPL and type (let ((x 3)) x), I get an unbound variable error. It seems that the syntaxer (the part of MIT Scheme that expands macros and converts the s-expressions into s-code) is renaming the bound variable x.

What is curious about this bug is the specific nature of it. Since my version of MIT-Scheme is still incomplete, I often run into missing primitives or as-yet-unimplemented behavior. This just drops me into the C# debugger. But in order to cold-load and initialize Scheme from an empty image, I needed to get a lot of stuff working correctly. Scheme does some 6 million evaluations during the cold load, so I know that large portions of the language work just fine. The unimplemented portions just crash, so it is somewhat baffling that I've encountered something that is so obviously incorrect, yet correct enough to allow booting and even to attempt evaluation.

My guess right now is that there is something slightly wrong with hash tables. It's just a hunch, though. I have tried some simple tests from the REPL, and they look ok. Unfortunately, there is a lot of evaluation going on between hitting the return key and actually running `eval', so I haven't been able to find the problem by tracing the interpretation.

Friday, September 12, 2008

Correction

“Hamiltonian graphs cannot be expressed in existential monadic second order logic because palindromes are not a regular language.”

Apparently, Keith Ellul immediately realized that without MONADIC the statement is false, because it is “trivial” to see that hamiltonian graphs can be expressed in existential second order logic.

I can't wait to tell the guys at the pub.