I found (to some extent) the bug that has been vexing me for the past
couple of weeks. It is obscure and I can't think of a good reason that
things would work this way, but I do have a reliable way to avoid the bug.
The symptoms were this: After booting Scheme and getting to the REPL,
forms typed in the REPL would be incorrectly rewritten with uninterned symbols
appearing in body of the form. Naturally, these would not match the bindings
of the interned symbols appearing in the binding lists, so unbound variables
would be raised.
I first changed the way uninterned symbols printed so I could easily
identify them. This showed a second symptom: the forms typed to the
REPL would not be rewritten the same each time. Retyping the form to
the REPL would give a different answer on the second rewrite.
Despite these problems, enough of Scheme is working to load and initialize
the world and get to the prompt. This takes some 6 million evaluations.
A little poking around led me to the guts of the syntaxer. This is the
component of MIT Scheme that builds the abstract syntax tree from
s-expressions. It has two renaming phases: first, it renames all identifiers
to uninterned symbols. This has the effect of alpha-renaming the bound
variables so that macro expansion cannot inadvertently capture bindings.
After everything is expanded, the second renaming phase renames the
indentifiers back to their original names provided that it does not introduce
a naming conflict. If there is a naming conflict, a non-conflicting identifier
is generated for the name.
The renaming information is kept in a set of hash tables called the
rename-database.
What appeared to be happening was that during the second phase the
code either thought that there was a conflict in renaming the variables
in the body of some lambda expressions, or it could not find the entry in the
rename-database that contained the original name. Either explanation could
account for it. I suspected a problem with hash tables.
Now the first baffling thing is this: most of MIT Scheme is written in Scheme
itself, so the hash table implementation is really only dependent on some
very trivial arithmetic. Furthermore, since the hash code is computed by
calling a primitive procedure that I wrote, I was able to ensure that
all symbols
would hash to the same bucket, thus turning hash-table lookup into a linear
search for symbols. This depends only on checking EQ-ness of symbols.
The CLR has a notion of interned and uninterned immutable strings. These are
semantically equivalent to symbols and I used them as such. I thought that
perhaps some uninterned strings were being inadvertently interned or that
perhaps some strings I thought were interned were not. I added a two-character
prefix to my uninterned strings whenever I created them and I added code that
checked that if the two characters were present, the symbol was uninterned,
and if absent the symbol was interned. Furthermore, I checked that no interned
symbol had the prefix and no uninterned symbol was missing the prefix.
I performed
this check on every call to EQ and on every call to a Scheme primitive that
manipulated symbols. I even put a test in the variable lookup code to check the
state of any symbol that may have been placed in a variable.
This code showed clearly that I was correctly managing the uninterned
and interned
symbol sets. No symbol accidentally migrated.
The addition of the prefix characters helped when stepping through the
code. It was
easy to see which symbols were uninterned and which were interned by
simply looking
for the prefix. Unfortunately, it is hard to distinguish between two
different uninterned
symbols with the same name (and there cannot be two interned symbols
with the same
name). I could see a few cases where two uninterned symbols with the
same name were
being compared and the answer indicated they were different symbols.
So to make it easier for me to tell which uninterned symbols were
which, I changed
the Scheme primitive that creates uninterned symbols to add a unique
prefix to each
one.
The bug went away.
`Aha!' you say, it *was* a case of accidentally confusing two symbols.
Well, I *thought*
that, too, but it is weirder than that. If giving each uninterned
symbol a unique name
solves the problem, then we seem to have this kind of bug:
Two symbols that are NOT identical (occupy different locations in
memory) are determined
to be identical because they contain the same sequence of characters.
Now I suspected a bug of this sort fairly early on in the debugging
process, so I
was a bit confused why I didn't find it earlier. (After all, it seems obvious)
Here is an obvious fact: the number of uninterned symbols is strictly
equal to or
less than the number of calls made to the primitive that creates them. There is
exactly one routine in my interpreter that creates uninterned symbols,
so each one
must come from a call to that routine. Of course I had put a
breakpoint at that routine
so I could see who was generating these symbols.
The call to make-uninterned-symbol happens exactly *once* per
identifier. This, too,
make sense: you only need one uninterned name per identifier.
So if we are comparing two symbols that are NOT identical, where did the second
one come from?
Ah, you say, maybe it was left over from a previous expansion. As it
so happens,
I instrumented EQ quite heavily because I wanted to be sure I knew
what was what.
When I use the code that creates each symbol uniquely, I can see what
the uninterned
symbol is compared against. It is compared against itself and the
original interned
symbol. Under no circumstances did I see a comparison with any
uninterned symbol
*other* than itself. That makes sense because I was working with very
simple lambda
expressions that had only one bound variable.
So if we are creating only *one* uninterned symbol per binding, and
only checking
against that one uninterned symbol, how could we confuse them?
The explanation I'm stuck with is this: the CLR may not be guaranteeing that
uninterned symbols have reference semantics. That is, the `identity'
of an uninterned
string (as defined as the location in memory where the bits are) might not be
preserved, but can even *split* (two strings that were once EQ become not EQ).
I can't think of why an implementation would do this, but it does explain what
I am seeing. (I'm thinking there might be an explanation involving
cross-generational
pools of uninterned strings, but I dunno.)
The workaround is trivial: I no longer use uninterned strings as
`naked' object but
wrap them in a trivial class object. Since class instances are defined to have
reference semantics, there is no problem with using object-equality on them.
I suppose I can try to write a program that demonstrates this issue,
but I'm getting
tired of debugging it and I have a workaround. But this is one weird bug.
No comments:
Post a Comment