The TI Explorer was a clone of the LMI Lambda. At some point when LMI
needed financing, TI and LMI did a technology swap where TI ended up
with the Lambda architecture and LMI ended up with some cash and the
NuBus. The Explorer was a close enough clone of the Lambda that it
could run the same microcode.
Ever run into one of those bugs where you begin to doubt your basic
assumptions? I hit one of those on the Explorer. The symptom was
simply this: about 15 minutes after booting, the Explorer would
crash. The proximate cause of the crash was a bad entry in the page
tables. The Lisp machine really had an operating system hiding within
it, and there's a virtual memory system with demand paging. There was
a table that kept track of what the physical memory was being used
for. If physical memory was needed and wasn't available, an existing
physical page would be written back to the hard disk. This was when
the bug was detected. A check on the page status showed it to be in
an undefined state, so the machine halted.
In theory, there was no way for a page to transition to an undefined
state, so something really bad must have happened. Fortunately, it
was always the same page, so it was clearly a programming logic bug.
It was also fortunate that it was in the page fault handler. The
Lambda and Explorer microcode is completely single threaded, and
interrupts are polled for in known locations. Since the page fault
handler would play with the memory mapping hardware, it only polled
for interrupts at the very few places that the hardware was in a
known, good state. I could pretty much ignore the interrupts for
debugging this one.
I put in a couple of quick checks to halt the machine when it wrote
the page table entry and rebooted the machine. The machine ran for a
while and then halted at my debug break. I single stepped the
microcode (from the other processor), and watched it write the page
table entry. It read the bits, modified the correct subfield and
wrote the bits back correctly. Hmmm. Must be a different
write. I let the machine run, but it didn't touch that entry again.
The machine halted at the sanity check.
This was weird. The machine correctly modified only the field it was
supposed to, yet it crashed. Perhaps the word was smashed after the
modification. I added some code to check the value of the word at
crash time. When I ran it again, I found that the word contained
exactly what was written at my breakpoint.
I had to re-evaluate my assumption that the field being modified had
something to do with the crash. Since that page had not previously
been in use at all, I assumed it was zeroed out. I added a sanity
check to ensure this. My sanity check failed. The pristine page
table entry was not zeroed out. This made sense. The code had taken
an entry that was not correctly initialized, correctly filled in the
right field, but the other bits were wrong and that caused the crash.
Clearly there was a bug in the initialization code.
This was sounding like a fencepost error. A fencepost error when
initializing a table might leave one entry at the end uninitialized,
or it might overwrite an entry in the next table. I didn't see
anything obvious in the page table initialization code, though. I put
a debug halt in just after the page table was initialized. When the
machine halted, I looked at the table. It was fine. Everything was
zeroed out correctly, and the next table in memory had no problems,
either.
Obviously someone had done something with the entry between the time
it was initialized and the time it was first used. I put in some more
debug code to halt whenever that particular page table entry was
accessed. This was possible because the page table manipulation code
was in a single subroutine. I restarted the machine. It first halted
when it initialized the page. As expected. It halted a second time
when it read the page entry. That wasn't expected. Whoever modified
the page table entry didn't go through the normal channels. That
would cause a bug, but it was wasn't going to be easy to find.
I wrote a debugging subroutine that would simply validate the contents
of that one special address and put a call to that subroutine in the
middle of the boot sequence. The machine halted at my new test. I
put it one quarter of the way into the boot sequence. It didn't halt.
I put it three eights of the way through. Over a period of a couple
of hours, I had determined where in the boot sequence the memory
location was being overwritten. It was some subroutine that really
had nothing to do with the memory system. In fact, it barely used the
memory system. I pored over the code, but there seemed to be nothing
at all out of the ordinary.
I was baffled. I was beginning to wonder if the memory maps had been
corrupted. It is almost inconceivable that the machine would be able
to run for more than a few milliseconds with corrupt memory maps, but
it was the only thing I could think of that would allow you to make an
undetected stray memory write. I wrote some more code to validate the
memory maps. They were fine.
Ok, if that memory location started as zero, then later became
non-zero, then a memory write cycle must have occurred.
There are no two ways about it. I wrote some more microcode that
checked the contents of that location and verified that it contained
the correct value. I modified the microassembler to insert a microcode
subroutine call to my validation code after every memory
cycle. I rebooted the machine. It was incredibly slow. Every single
memory reference involved an out-of-line microcode call to validate
the page map. Nonetheless, the validation code finally caused the
machine to halt.
It was in a weird location. Well, weird only in the fact that
absolutely nothing interesting was going on. By using the backtrace
buffer and looking at the microcode I could see which write
instruction must have failed. But there was no reason it should have
failed because the nearby writes to the same page worked just fine. I
looked at the contents of the virtual address register and the memory
data register. Surprisingly, they were the expected values. In fact,
the memory data register did not contain the bogus value that was
written to the page table. This is, in a word, impossible.
So what were the implicit assumptions I was making? One of them was
wrong. Bad memory board? The memory test passed. I swapped the
memory cards on the machine, but it still crashed in the same place.
It was too deterministic and predictable to be flaky hardware. But
the Lisp processor wasn't writing to that location.
On the other hand, the Lisp processor wasn't the only device on the
bus. The corrupted memory location contained a byte full of 1s. I
asked our hardware people: Is there any reason a device would write a
byte full of 1s to another card on the bus?
As a matter of fact, there was. The NuBus doesn't have separate
interrupt or I/O channels. It signals an interrupt by doing a memory
cycle and plopping a byte full of 1s into another card's address
space. For some reason, a device in the Explorer was trying to post
an interrupt and ended up dropping a turd right where the Explorer put
its physical page tables. This happened fairly early on, but after
the page table were initialized. Then, some 15 minutes later, when
the Explorer finally got around to using that page, it bombed on the
corrupted entry.
I found the resident Explorer experts and mentioned
that something was trying to post an interrupt. They looked at the
initialization sequence and discovered that the code to initialize one
of the devices had a bug and didn't correctly turn off the interrupt.
The device would post an interrupt every few seconds. It was a couple
of lines to fix it, but it took me several days of intense debugging
to find it.
very interesting read, as always, joe.
ReplyDeleteThanks!
Fascinating.
ReplyDeleteIt's been a long time since I worked that close to the metal (writing drivers for automated test equipment at Teradyne), and I don't really miss in, but it is nice to hear a tale from the trenches!