Monday, October 6, 2008

Weird bug

I found (to some extent) the bug that has been vexing me for the past couple of weeks. It is obscure and I can't think of a good reason that things would work this way, but I do have a reliable way to avoid the bug.

The symptoms were this: After booting Scheme and getting to the REPL, forms typed in the REPL would be incorrectly rewritten with uninterned symbols appearing in body of the form. Naturally, these would not match the bindings of the interned symbols appearing in the binding lists, so unbound variables would be raised.

I first changed the way uninterned symbols printed so I could easily identify them. This showed a second symptom: the forms typed to the REPL would not be rewritten the same each time. Retyping the form to the REPL would give a different answer on the second rewrite.

Despite these problems, enough of Scheme is working to load and initialize the world and get to the prompt. This takes some 6 million evaluations.

A little poking around led me to the guts of the syntaxer. This is the component of MIT Scheme that builds the abstract syntax tree from s-expressions. It has two renaming phases: first, it renames all identifiers to uninterned symbols. This has the effect of alpha-renaming the bound variables so that macro expansion cannot inadvertently capture bindings. After everything is expanded, the second renaming phase renames the indentifiers back to their original names provided that it does not introduce a naming conflict. If there is a naming conflict, a non-conflicting identifier is generated for the name.

The renaming information is kept in a set of hash tables called the rename-database. What appeared to be happening was that during the second phase the code either thought that there was a conflict in renaming the variables in the body of some lambda expressions, or it could not find the entry in the rename-database that contained the original name. Either explanation could account for it. I suspected a problem with hash tables.

Now the first baffling thing is this: most of MIT Scheme is written in Scheme itself, so the hash table implementation is really only dependent on some very trivial arithmetic. Furthermore, since the hash code is computed by calling a primitive procedure that I wrote, I was able to ensure that all symbols would hash to the same bucket, thus turning hash-table lookup into a linear search for symbols. This depends only on checking EQ-ness of symbols.

The CLR has a notion of interned and uninterned immutable strings. These are semantically equivalent to symbols and I used them as such. I thought that perhaps some uninterned strings were being inadvertently interned or that perhaps some strings I thought were interned were not. I added a two-character prefix to my uninterned strings whenever I created them and I added code that checked that if the two characters were present, the symbol was uninterned, and if absent the symbol was interned. Furthermore, I checked that no interned symbol had the prefix and no uninterned symbol was missing the prefix. I performed this check on every call to EQ and on every call to a Scheme primitive that manipulated symbols. I even put a test in the variable lookup code to check the state of any symbol that may have been placed in a variable.

This code showed clearly that I was correctly managing the uninterned and interned symbol sets. No symbol accidentally migrated.

The addition of the prefix characters helped when stepping through the code. It was easy to see which symbols were uninterned and which were interned by simply looking for the prefix. Unfortunately, it is hard to distinguish between two different uninterned symbols with the same name (and there cannot be two interned symbols with the same name). I could see a few cases where two uninterned symbols with the same name were being compared and the answer indicated they were different symbols.

So to make it easier for me to tell which uninterned symbols were which, I changed the Scheme primitive that creates uninterned symbols to add a unique prefix to each one.

The bug went away.

`Aha!' you say, it *was* a case of accidentally confusing two symbols. Well, I *thought* that, too, but it is weirder than that. If giving each uninterned symbol a unique name solves the problem, then we seem to have this kind of bug:

Two symbols that are NOT identical (occupy different locations in memory) are determined to be identical because they contain the same sequence of characters.

Now I suspected a bug of this sort fairly early on in the debugging process, so I was a bit confused why I didn't find it earlier. (After all, it seems obvious)

Here is an obvious fact: the number of uninterned symbols is strictly equal to or less than the number of calls made to the primitive that creates them. There is exactly one routine in my interpreter that creates uninterned symbols, so each one must come from a call to that routine. Of course I had put a breakpoint at that routine so I could see who was generating these symbols.

The call to make-uninterned-symbol happens exactly *once* per identifier. This, too, make sense: you only need one uninterned name per identifier.

So if we are comparing two symbols that are NOT identical, where did the second one come from?

Ah, you say, maybe it was left over from a previous expansion. As it so happens, I instrumented EQ quite heavily because I wanted to be sure I knew what was what. When I use the code that creates each symbol uniquely, I can see what the uninterned symbol is compared against. It is compared against itself and the original interned symbol. Under no circumstances did I see a comparison with any uninterned symbol *other* than itself. That makes sense because I was working with very simple lambda expressions that had only one bound variable.

So if we are creating only *one* uninterned symbol per binding, and only checking against that one uninterned symbol, how could we confuse them?

The explanation I'm stuck with is this: the CLR may not be guaranteeing that uninterned symbols have reference semantics. That is, the `identity' of an uninterned string (as defined as the location in memory where the bits are) might not be preserved, but can even *split* (two strings that were once EQ become not EQ). I can't think of why an implementation would do this, but it does explain what I am seeing. (I'm thinking there might be an explanation involving cross-generational pools of uninterned strings, but I dunno.)

The workaround is trivial: I no longer use uninterned strings as `naked' object but wrap them in a trivial class object. Since class instances are defined to have reference semantics, there is no problem with using object-equality on them.

I suppose I can try to write a program that demonstrates this issue, but I'm getting tired of debugging it and I have a workaround. But this is one weird bug.